All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] irq: fasteoi handler re-runs on concurrent invoke
@ 2023-03-17  9:53 James Gowans
  2023-03-17 10:12 ` Yipeng Zou
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: James Gowans @ 2023-03-17  9:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, James Gowans, David Woodhouse, Marc Zyngier,
	KarimAllah Raslan

Update the generic handle_fasteoi_irq to cater for the case when the
next interrupt comes in while the previous handler is still running.
Currently when that happens the irq_may_run() early out causes the next
IRQ to be lost. Change the behaviour to mark the interrupt as pending
and re-run the handler when handle_fasteoi_irq sees that the pending
flag has been set again. This is largely inspired by handle_edge_irq.

Generally it should not be possible for the next interrupt to arrive
while the previous handler is still running: the next interrupt should
only arrive after the EOI message has been sent and the previous handler
has returned. However, there is a race where if the interrupt affinity
is changed while the previous handler is running, then the next
interrupt can arrive at a different CPU while the previous handler is
still running. In that case there will be a concurrent invoke and the
early out will be taken.

For example:

           CPU 0             |          CPU 1
-----------------------------|-----------------------------
interrupt start              |
  handle_fasteoi_irq         | set_affinity(CPU 1)
    handler                  |
    ...                      | interrupt start
    ...                      |   handle_fasteoi_irq -> early out
  handle_fasteoi_irq return  | interrupt end
interrupt end                |

This issue was observed specifically on an arm64 system with a GIC-v3
handling MSIs; GIC-v3 uses the handle_fasteoi_irq handler. The issue is
that the global ITS is responsible for affinity but does not know
whether interrupts are pending/running, only the CPU-local redistributor
handles the EOI. Hence when the affinity is changed in the ITS, the new
CPU's redistributor does not know that the original CPU is still running
the handler.

There are a few uncertainties about this implementation compared to the
prior art in handle_edge_irq:

1. Do we need to mask the IRQ and then unmask it later? I don't think so
but it's not entirely clear why handle_edge_irq does this anyway; it's
an edge IRQ so not sure why it needs to be masked.

2. Should the EOI delivery be inside the do loop after every handler
run? I think outside the loops is best; only inform the chip to deliver
more IRQs once all pending runs have been finished.

3. Do we need to check that desc->action is still set? I don't know if
it can be un-set while the IRQ is still enabled.

4. There is now more overlap with the handle_edge_eoi_irq handler;
should we try to unify these?

Signed-off-by: James Gowans <jgowans@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <maz@kernel.org>
Cc: KarimAllah Raslan <karahmed@amazon.com>
---
 Documentation/core-api/genericirq.rst | 9 ++++++++-
 kernel/irq/chip.c                     | 9 +++++++--
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst
index f959c9b53f61..b54485eca8b5 100644
--- a/Documentation/core-api/genericirq.rst
+++ b/Documentation/core-api/genericirq.rst
@@ -240,7 +240,14 @@ which only need an EOI at the end of the handler.
 
 The following control flow is implemented (simplified excerpt)::
 
-    handle_irq_event(desc->action);
+    if (desc->status & running) {
+        desc->status |= pending;
+        return;
+    }
+    do {
+        desc->status &= ~pending;
+        handle_irq_event(desc->action);
+    } while (status & pending);
     desc->irq_data.chip->irq_eoi();
 
 
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 49e7bc871fec..4e5fc2b7e8a9 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -692,8 +692,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
 
 	raw_spin_lock(&desc->lock);
 
-	if (!irq_may_run(desc))
+	if (!irq_may_run(desc)) {
+		desc->istate |= IRQS_PENDING;
 		goto out;
+	}
 
 	desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
 
@@ -711,7 +713,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
 	if (desc->istate & IRQS_ONESHOT)
 		mask_irq(desc);
 
-	handle_irq_event(desc);
+	do {
+		handle_irq_event(desc);
+	} while (unlikely((desc->istate & IRQS_PENDING) &&
+			!irqd_irq_disabled(&desc->irq_data)));
 
 	cond_unmask_eoi_irq(desc, chip);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-03-17  9:53 [PATCH] irq: fasteoi handler re-runs on concurrent invoke James Gowans
@ 2023-03-17 10:12 ` Yipeng Zou
  2023-03-17 11:49   ` Gowans, James
  2023-04-09 11:41 ` Marc Zyngier
  2023-05-23  3:15 ` liaochang (A)
  2 siblings, 1 reply; 28+ messages in thread
From: Yipeng Zou @ 2023-03-17 10:12 UTC (permalink / raw)
  To: James Gowans, Thomas Gleixner
  Cc: linux-kernel, David Woodhouse, Marc Zyngier, KarimAllah Raslan


在 2023/3/17 17:53, James Gowans 写道:
> Update the generic handle_fasteoi_irq to cater for the case when the
> next interrupt comes in while the previous handler is still running.
> Currently when that happens the irq_may_run() early out causes the next
> IRQ to be lost. Change the behaviour to mark the interrupt as pending
> and re-run the handler when handle_fasteoi_irq sees that the pending
> flag has been set again. This is largely inspired by handle_edge_irq.
>
> Generally it should not be possible for the next interrupt to arrive
> while the previous handler is still running: the next interrupt should
> only arrive after the EOI message has been sent and the previous handler
> has returned. However, there is a race where if the interrupt affinity
> is changed while the previous handler is running, then the next
> interrupt can arrive at a different CPU while the previous handler is
> still running. In that case there will be a concurrent invoke and the
> early out will be taken.
>
> For example:
>
>             CPU 0             |          CPU 1
> -----------------------------|-----------------------------
> interrupt start              |
>    handle_fasteoi_irq         | set_affinity(CPU 1)
>      handler                  |
>      ...                      | interrupt start
>      ...                      |   handle_fasteoi_irq -> early out
>    handle_fasteoi_irq return  | interrupt end
> interrupt end                |
>
> This issue was observed specifically on an arm64 system with a GIC-v3
> handling MSIs; GIC-v3 uses the handle_fasteoi_irq handler. The issue is
> that the global ITS is responsible for affinity but does not know
> whether interrupts are pending/running, only the CPU-local redistributor
> handles the EOI. Hence when the affinity is changed in the ITS, the new
> CPU's redistributor does not know that the original CPU is still running
> the handler.
>
> There are a few uncertainties about this implementation compared to the
> prior art in handle_edge_irq:
>
> 1. Do we need to mask the IRQ and then unmask it later? I don't think so
> but it's not entirely clear why handle_edge_irq does this anyway; it's
> an edge IRQ so not sure why it needs to be masked.
>
> 2. Should the EOI delivery be inside the do loop after every handler
> run? I think outside the loops is best; only inform the chip to deliver
> more IRQs once all pending runs have been finished.
>
> 3. Do we need to check that desc->action is still set? I don't know if
> it can be un-set while the IRQ is still enabled.
>
> 4. There is now more overlap with the handle_edge_eoi_irq handler;
> should we try to unify these?
>
> Signed-off-by: James Gowans <jgowans@amazon.com>
> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: KarimAllah Raslan <karahmed@amazon.com>
> ---
>   Documentation/core-api/genericirq.rst | 9 ++++++++-
>   kernel/irq/chip.c                     | 9 +++++++--
>   2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst
> index f959c9b53f61..b54485eca8b5 100644
> --- a/Documentation/core-api/genericirq.rst
> +++ b/Documentation/core-api/genericirq.rst
> @@ -240,7 +240,14 @@ which only need an EOI at the end of the handler.
>   
>   The following control flow is implemented (simplified excerpt)::
>   
> -    handle_irq_event(desc->action);
> +    if (desc->status & running) {
> +        desc->status |= pending;
> +        return;
> +    }
> +    do {
> +        desc->status &= ~pending;
> +        handle_irq_event(desc->action);
> +    } while (status & pending);
>       desc->irq_data.chip->irq_eoi();
>   
>   
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 49e7bc871fec..4e5fc2b7e8a9 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -692,8 +692,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>   
>   	raw_spin_lock(&desc->lock);
>   
> -	if (!irq_may_run(desc))
> +	if (!irq_may_run(desc)) {
> +		desc->istate |= IRQS_PENDING;
>   		goto out;
> +	}
>   
>   	desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
>   
> @@ -711,7 +713,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>   	if (desc->istate & IRQS_ONESHOT)
>   		mask_irq(desc);
>   
> -	handle_irq_event(desc);
> +	do {
> +		handle_irq_event(desc);
> +	} while (unlikely((desc->istate & IRQS_PENDING) &&
> +			!irqd_irq_disabled(&desc->irq_data)));
>   
>   	cond_unmask_eoi_irq(desc, chip);
>   

Hi:

Finally, someone also hit this problem.

I just send patch a  few weeks ago.

It seems that we have the same solution.(I introduced a new flow handler).

Hopefully this issue will be fixed as soon as possible.

[1] 
https://lore.kernel.org/all/20230310101417.1081434-1-zouyipeng@huawei.com/

-- 
Regards,
Yipeng Zou


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-03-17 10:12 ` Yipeng Zou
@ 2023-03-17 11:49   ` Gowans, James
  2023-03-22  6:26     ` Yipeng Zou
  0 siblings, 1 reply; 28+ messages in thread
From: Gowans, James @ 2023-03-17 11:49 UTC (permalink / raw)
  To: zouyipeng, tglx; +Cc: maz, Raslan, KarimAllah, Woodhouse, David, linux-kernel

On Fri, 2023-03-17 at 18:12 +0800, Yipeng Zou wrote:
> It seems that we have the same solution.

That's a good sign! :D

> (I introduced a new flow handler).

I considered this, but IMO a new handler isn't the way to go: we already have a
bit of handler proliferation going on here. As mentioned in my commit message
there this is starting to get closer to handle_edge_eoi_irq, and adding a new
generic handler which is a mix of the two existing seems to just add more
confusion: which one should a driver owner use? I think it'll be great if we can
enhance the existing generic handlers to cater for the various edge cases and
perhaps even merge these generic handlers in future.

What are your thoughts on this approach compared to your proposal?

There is also the "delay the affinity change of LPI until the next interrupt
acknowledge" option described in the previous thread [0]. I also considered that
but seeing as the handle_edge_irq does the approach implemented here of setting
the PENDING flag and then re-running it, it seemed like good prior art to draw
on. Is that option of enabling CONFIG_GENERIC_PENDING_IRQ a viable? IMO the
generic handlers should be resilient to this so I would prefer this fix than
depending on the user to know to set this config option.

JG

[0] https://lore.kernel.org/all/b0f2623b-ec70-d57e-b744-26c62b1ce523@huawei.com/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-03-17 11:49   ` Gowans, James
@ 2023-03-22  6:26     ` Yipeng Zou
  2023-03-22  7:48       ` Gowans, James
  0 siblings, 1 reply; 28+ messages in thread
From: Yipeng Zou @ 2023-03-22  6:26 UTC (permalink / raw)
  To: Gowans, James, tglx
  Cc: maz, Raslan, KarimAllah, Woodhouse, David, linux-kernel


在 2023/3/17 19:49, Gowans, James 写道:
> On Fri, 2023-03-17 at 18:12 +0800, Yipeng Zou wrote:
>> It seems that we have the same solution.
> That's a good sign! :D
>
>> (I introduced a new flow handler).
> I considered this, but IMO a new handler isn't the way to go: we already have a
> bit of handler proliferation going on here. As mentioned in my commit message
> there this is starting to get closer to handle_edge_eoi_irq, and adding a new
> generic handler which is a mix of the two existing seems to just add more
> confusion: which one should a driver owner use? I think it'll be great if we can
> enhance the existing generic handlers to cater for the various edge cases and
> perhaps even merge these generic handlers in future.
>
> What are your thoughts on this approach compared to your proposal?

Hi,

I also agree with you, enhance the existing generic handlers is a good 
way to go.

Too many generic handlers really confuse developers.

> There is also the "delay the affinity change of LPI until the next interrupt
> acknowledge" option described in the previous thread [0]. I also considered that
> but seeing as the handle_edge_irq does the approach implemented here of setting
> the PENDING flag and then re-running it, it seemed like good prior art to draw
> on. Is that option of enabling CONFIG_GENERIC_PENDING_IRQ a viable? IMO the
> generic handlers should be resilient to this so I would prefer this fix than
> depending on the user to know to set this config option.


About CONFIG_GENERIC_PENDING_IRQ is actually some attempts we made 
before under the suggestion of Thomas.

This patch is valid for our problem. However, the current config is only 
supported on x86, and some code modifications are required on arm.

In our patch, the config cannot be perfectly supported on arm like x86. 
Refer to the patch below.

This has led to some changes in the original behavior of modifying 
interrupting affinity, from the next interrupt taking effect to the next 
to the next interrupt taking effect.

So, in general, I also prefer the fix which make generic handlers be 
resilient than introduce an CONFIG_GENERIC_PENDING_IRQ for gic.


diff --git a/drivers/irqchip/irq-gic-v3-its.c 
b/drivers/irqchip/irq-gic-v3-its.c
index 1cb392fb16d0..64cfa5e38d89 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -1678,6 +1678,12 @@ static int its_select_cpu_other(const struct 
cpumask *mask_val)
  }
  #endif

+static void its_irq_chip_eoi(struct irq_data *d)
+{
+       irq_move_irq(d);
+       irq_chip_eoi_parent(d);
+}
+
  static int its_set_affinity(struct irq_data *d, const struct cpumask 
*mask_val,
                             bool force)
  {
@@ -2026,7 +2032,7 @@ static struct irq_chip its_irq_chip = {
         .name                   = "ITS",
         .irq_mask               = its_mask_irq,
         .irq_unmask             = its_unmask_irq,
-       .irq_eoi                = irq_chip_eoi_parent,
+       .irq_eoi                = its_irq_chip_eoi,
         .irq_set_affinity       = its_set_affinity,
         .irq_compose_msi_msg    = its_irq_compose_msi_msg,
         .irq_set_irqchip_state  = its_irq_set_irqchip_state,
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index ab5505d8caf1..3c829bb4f649 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -33,7 +33,9 @@ config GENERIC_IRQ_LEGACY_ALLOC_HWIRQ

  # Support for delayed migration from interrupt context
  config GENERIC_PENDING_IRQ
-       bool
+       bool "Support for delayed migration from interrupt context"
+       depends on SMP
+       default n

  # Support for generic irq migrating off cpu before the cpu is offline.
  config GENERIC_IRQ_MIGRATION
diff --git a/kernel/irq/migration.c b/kernel/irq/migration.c
index def48589ea48..bcb61ee69c20 100644
--- a/kernel/irq/migration.c
+++ b/kernel/irq/migration.c
@@ -117,3 +117,5 @@ void __irq_move_irq(struct irq_data *idata)
         if (!masked)
                 idata->chip->irq_unmask(idata);
  }
+
+void __weak irq_force_complete_move(struct irq_desc *desc) { }

> JG
>
> [0]https://lore.kernel.org/all/b0f2623b-ec70-d57e-b744-26c62b1ce523@huawei.com/

-- 
Regards,
Yipeng Zou


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-03-22  6:26     ` Yipeng Zou
@ 2023-03-22  7:48       ` Gowans, James
  2023-03-22 10:37         ` Thomas Gleixner
  2023-03-22 10:38         ` Yipeng Zou
  0 siblings, 2 replies; 28+ messages in thread
From: Gowans, James @ 2023-03-22  7:48 UTC (permalink / raw)
  To: zouyipeng, tglx; +Cc: maz, Raslan, KarimAllah, Woodhouse, David, linux-kernel

On Wed, 2023-03-22 at 14:26 +0800, Yipeng Zou wrote:
> > 在 2023/3/17 19:49, Gowans, James 写道:
> > What are your thoughts on this approach compared to your proposal?
> 
> Hi,
> 
> I also agree with you, enhance the existing generic handlers is a good
> way to go.
> 
> Too many generic handlers really confuse developers.

Thomas, would you be open to taking the patch to tweak the handle_fasteoi_irq
handler? Or is there a different solution to this problem which you prefer?

> About CONFIG_GENERIC_PENDING_IRQ is actually some attempts we made
> before under the suggestion of Thomas.
> 
> This patch is valid for our problem. However, the current config is only
> supported on x86, and some code modifications are required on arm.

Thanks for the patch! I have been trying out CONFIG_GENERIC_PENDING_IRQ too, but
couldn't get it to work; it seems the IRQ never actually moved. I see from your
patch that we would need to tweak the callbacks and explicitly do the affinity
move in the EOI handler of the chip; the generic code won't do it for us.

> This has led to some changes in the original behavior of modifying
> interrupting affinity, from the next interrupt taking effect to the next
> to the next interrupt taking effect.

So this means that even if it's safe to change the affinity right now, the
change will actually be delayed until the *next* interrupt? Specifically because
interrupt doesn't have the IRQD_MOVE_PCNTXT state flag isn't set hence
irq_set_affinity_locked won't call irq_try_set_affinity?


JG

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-03-22  7:48       ` Gowans, James
@ 2023-03-22 10:37         ` Thomas Gleixner
  2023-04-03 13:17           ` zhangjianhua (E)
  2023-03-22 10:38         ` Yipeng Zou
  1 sibling, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2023-03-22 10:37 UTC (permalink / raw)
  To: Gowans, James, zouyipeng
  Cc: maz, Raslan, KarimAllah, Woodhouse, David, linux-kernel

On Wed, Mar 22 2023 at 07:48, James Gowans wrote:
> On Wed, 2023-03-22 at 14:26 +0800, Yipeng Zou wrote:
> Thomas, would you be open to taking the patch to tweak the handle_fasteoi_irq
> handler? Or is there a different solution to this problem which you
> prefer?

It's in my backlog.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-03-22  7:48       ` Gowans, James
  2023-03-22 10:37         ` Thomas Gleixner
@ 2023-03-22 10:38         ` Yipeng Zou
  1 sibling, 0 replies; 28+ messages in thread
From: Yipeng Zou @ 2023-03-22 10:38 UTC (permalink / raw)
  To: Gowans, James, tglx
  Cc: maz, Raslan, KarimAllah, Woodhouse, David, linux-kernel


在 2023/3/22 15:48, Gowans, James 写道:
> On Wed, 2023-03-22 at 14:26 +0800, Yipeng Zou wrote:
>>> 在 2023/3/17 19:49, Gowans, James 写道:
>>> What are your thoughts on this approach compared to your proposal?
>> Hi,
>>
>> I also agree with you, enhance the existing generic handlers is a good
>> way to go.
>>
>> Too many generic handlers really confuse developers.
> Thomas, would you be open to taking the patch to tweak the handle_fasteoi_irq
> handler? Or is there a different solution to this problem which you prefer?

Our workaround is generally similar, but the implementation details are 
somewhat different.

Maybe let us look for some comments from maintainer and other people.

>> About CONFIG_GENERIC_PENDING_IRQ is actually some attempts we made
>> before under the suggestion of Thomas.
>>
>> This patch is valid for our problem. However, the current config is only
>> supported on x86, and some code modifications are required on arm.
> Thanks for the patch! I have been trying out CONFIG_GENERIC_PENDING_IRQ too, but
> couldn't get it to work; it seems the IRQ never actually moved. I see from your
> patch that we would need to tweak the callbacks and explicitly do the affinity
> move in the EOI handler of the chip; the generic code won't do it for us.
>
>> This has led to some changes in the original behavior of modifying
>> interrupting affinity, from the next interrupt taking effect to the next
>> to the next interrupt taking effect.
> So this means that even if it's safe to change the affinity right now, the
> change will actually be delayed until the *next* interrupt? Specifically because
> interrupt doesn't have the IRQD_MOVE_PCNTXT state flag isn't set hence
> irq_set_affinity_locked won't call irq_try_set_affinity?

Yes, modify of the interrupt affinity will be delayed until the *next* 
interrupt eoi handler(in hard_irq context).

This is the difference from x86, which do irq_move_irq in ack handler, 
and then transfer the current interrupt to the new CPU without other affect.

> JG

-- 
Regards,
Yipeng Zou


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-03-22 10:37         ` Thomas Gleixner
@ 2023-04-03 13:17           ` zhangjianhua (E)
  2023-04-03 13:19             ` Marc Zyngier
  0 siblings, 1 reply; 28+ messages in thread
From: zhangjianhua (E) @ 2023-04-03 13:17 UTC (permalink / raw)
  To: Thomas Gleixner, Gowans, James, zouyipeng
  Cc: maz, Raslan, KarimAllah, Woodhouse, David, linux-kernel

Hi Thomas

I have the same problem as James and Yipeng, while modify the irq affinity

concurrently during LPI interrupt processing, it does cause interrupt loss.

Fortunately, James and Yipeng give their patches separately with the similar

solutions, and hope you'll take this issue seriously, thanks.

在 2023/3/22 18:37, Thomas Gleixner 写道:
> On Wed, Mar 22 2023 at 07:48, James Gowans wrote:
>> On Wed, 2023-03-22 at 14:26 +0800, Yipeng Zou wrote:
>> Thomas, would you be open to taking the patch to tweak the handle_fasteoi_irq
>> handler? Or is there a different solution to this problem which you
>> prefer?
> It's in my backlog.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-04-03 13:17           ` zhangjianhua (E)
@ 2023-04-03 13:19             ` Marc Zyngier
  2023-04-03 13:39               ` Gowans, James
  0 siblings, 1 reply; 28+ messages in thread
From: Marc Zyngier @ 2023-04-03 13:19 UTC (permalink / raw)
  To: zhangjianhua (E)
  Cc: Thomas Gleixner, Gowans, James, zouyipeng, Raslan, KarimAllah,
	Woodhouse, David, linux-kernel

On 2023-04-03 14:17, zhangjianhua (E) wrote:
> Hi Thomas
> 
> I have the same problem as James and Yipeng, while modify the irq 
> affinity
> 
> concurrently during LPI interrupt processing, it does cause interrupt 
> loss.
> 
> Fortunately, James and Yipeng give their patches separately with the 
> similar
> 
> solutions, and hope you'll take this issue seriously, thanks.

Neither solutions are remotely acceptable.

I'll look at something a bit more palatable.

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-04-03 13:19             ` Marc Zyngier
@ 2023-04-03 13:39               ` Gowans, James
  0 siblings, 0 replies; 28+ messages in thread
From: Gowans, James @ 2023-04-03 13:39 UTC (permalink / raw)
  To: maz, chris.zjh
  Cc: tglx, zouyipeng, Raslan, KarimAllah, Woodhouse, David, linux-kernel

On Mon, 2023-04-03 at 14:19 +0100, Marc Zyngier wrote:
> Neither solutions are remotely acceptable.
> I'll look at something a bit more palatable.

Thanks Marc, that will be much appreciated.

Out of interest though, why is the suggestion to mark the irq_desc as pending
and re-run the handler not acceptable?

JG

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-03-17  9:53 [PATCH] irq: fasteoi handler re-runs on concurrent invoke James Gowans
  2023-03-17 10:12 ` Yipeng Zou
@ 2023-04-09 11:41 ` Marc Zyngier
  2023-04-11 10:27   ` Gowans, James
  2023-05-23  3:15 ` liaochang (A)
  2 siblings, 1 reply; 28+ messages in thread
From: Marc Zyngier @ 2023-04-09 11:41 UTC (permalink / raw)
  To: James Gowans
  Cc: Thomas Gleixner, linux-kernel, David Woodhouse, KarimAllah Raslan

James,

On Fri, 17 Mar 2023 09:53:00 +0000,
James Gowans <jgowans@amazon.com> wrote:
> 
> Update the generic handle_fasteoi_irq to cater for the case when the
> next interrupt comes in while the previous handler is still running.
> Currently when that happens the irq_may_run() early out causes the next
> IRQ to be lost. Change the behaviour to mark the interrupt as pending
> and re-run the handler when handle_fasteoi_irq sees that the pending
> flag has been set again. This is largely inspired by handle_edge_irq.
> 
> Generally it should not be possible for the next interrupt to arrive
> while the previous handler is still running: the next interrupt should
> only arrive after the EOI message has been sent and the previous handler
> has returned.

There isn't necessarily an EOI message. On bare metal, EOI of a LPI
amounts to sweet nothing (see the pseudocode to convince yourself),
because the physical LPI doesn't have an active state. When
virtualised, the active state is limited to the state stored in the
LRs, and isn't propagated anywhere else.

> However, there is a race where if the interrupt affinity
> is changed while the previous handler is running, then the next
> interrupt can arrive at a different CPU while the previous handler is
> still running. In that case there will be a concurrent invoke and the
> early out will be taken.
> 
> For example:
> 
>            CPU 0             |          CPU 1
> -----------------------------|-----------------------------
> interrupt start              |
>   handle_fasteoi_irq         | set_affinity(CPU 1)
>     handler                  |
>     ...                      | interrupt start
>     ...                      |   handle_fasteoi_irq -> early out
>   handle_fasteoi_irq return  | interrupt end
> interrupt end                |
> 
> This issue was observed specifically on an arm64 system with a GIC-v3
> handling MSIs; GIC-v3 uses the handle_fasteoi_irq handler. The issue is
> that the global ITS is responsible for affinity but does not know
> whether interrupts are pending/running, only the CPU-local redistributor
> handles the EOI. Hence when the affinity is changed in the ITS, the new
> CPU's redistributor does not know that the original CPU is still running
> the handler.

You seem to be misunderstanding the architecture.

The local redistributor doesn't know anything either. A redistributor
doesn't contain any state about what is currently serviced. At best,
it knows of the pending state of interrupts, and that about it. This
is even more true in a VM. Only the CPU knows about this, and there is
no EOI that ever gets propagated to the redistributor.

> 
> There are a few uncertainties about this implementation compared to the
> prior art in handle_edge_irq:
> 
> 1. Do we need to mask the IRQ and then unmask it later? I don't think so
> but it's not entirely clear why handle_edge_irq does this anyway; it's
> an edge IRQ so not sure why it needs to be masked.

Please measure that cost and weep, specially in the context of
multiple concurrent interrupts serviced by a single ITS (cost of
locking the command queue, of waiting for a full round trip to the ITS
for a couple of commands...).

> 
> 2. Should the EOI delivery be inside the do loop after every handler
> run? I think outside the loops is best; only inform the chip to deliver
> more IRQs once all pending runs have been finished.
> 
> 3. Do we need to check that desc->action is still set? I don't know if
> it can be un-set while the IRQ is still enabled.
> 
> 4. There is now more overlap with the handle_edge_eoi_irq handler;
> should we try to unify these?
> 
> Signed-off-by: James Gowans <jgowans@amazon.com>
> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: KarimAllah Raslan <karahmed@amazon.com>
> ---
>  Documentation/core-api/genericirq.rst | 9 ++++++++-
>  kernel/irq/chip.c                     | 9 +++++++--
>  2 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst
> index f959c9b53f61..b54485eca8b5 100644
> --- a/Documentation/core-api/genericirq.rst
> +++ b/Documentation/core-api/genericirq.rst
> @@ -240,7 +240,14 @@ which only need an EOI at the end of the handler.
>  
>  The following control flow is implemented (simplified excerpt)::
>  
> -    handle_irq_event(desc->action);
> +    if (desc->status & running) {
> +        desc->status |= pending;
> +        return;
> +    }
> +    do {
> +        desc->status &= ~pending;
> +        handle_irq_event(desc->action);
> +    } while (status & pending);
>      desc->irq_data.chip->irq_eoi();
>  
>  
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 49e7bc871fec..4e5fc2b7e8a9 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -692,8 +692,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>  
>  	raw_spin_lock(&desc->lock);
>  
> -	if (!irq_may_run(desc))
> +	if (!irq_may_run(desc)) {
> +		desc->istate |= IRQS_PENDING;
>  		goto out;
> +	}

What is currently unclear to me is how we get there on another CPU if
we're still handling the interrupt, which happens in the critical
section delineated by desc->lock. So there is some additional state
change here which you don't seem to be describing.

>  
>  	desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
>  
> @@ -711,7 +713,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>  	if (desc->istate & IRQS_ONESHOT)
>  		mask_irq(desc);
>  
> -	handle_irq_event(desc);
> +	do {
> +		handle_irq_event(desc);
> +	} while (unlikely((desc->istate & IRQS_PENDING) &&
> +			!irqd_irq_disabled(&desc->irq_data)));
>  
>  	cond_unmask_eoi_irq(desc, chip);
>  

Why do we need to inflict any of this on all other users of this flow?
The lack of active state on LPIs is the issue, as it allows a new
interrupt to be presented to another CPU as soon as the previous one
has been ACK'ed.

This also has the effect of keeping the handling of the interrupt on
the original CPU, negating the effects of the change of affinity.

It feels that we should instead resend the interrupt, either by making
it pending again by generating an INT command to the ITS, or by using
the SW resend mechanism (which the lack of deactivation requirement
makes possible).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-04-09 11:41 ` Marc Zyngier
@ 2023-04-11 10:27   ` Gowans, James
  2023-04-12 13:32     ` Marc Zyngier
  0 siblings, 1 reply; 28+ messages in thread
From: Gowans, James @ 2023-04-11 10:27 UTC (permalink / raw)
  To: maz
  Cc: tglx, Raslan, KarimAllah, Woodhouse, David, zouyipeng,
	linux-kernel, Sironi, Filippo, chris.zjh

Hi Marc, thanks for taking the time to put thought into this!

On Sun, 2023-04-09 at 12:41 +0100, Marc Zyngier wrote:
> > Generally it should not be possible for the next interrupt to arrive
> > while the previous handler is still running: the next interrupt should
> > only arrive after the EOI message has been sent and the previous handler
> > has returned.
> 
> There isn't necessarily an EOI message. On bare metal, EOI of a LPI
> amounts to sweet nothing (see the pseudocode to convince yourself),
> because the physical LPI doesn't have an active state.

Okay, I saw the gic_eoi_irq() function and thought there might be something
going on there. I guess it's not relevant to this issue.

> > 
> > The issue is
> > that the global ITS is responsible for affinity but does not know
> > whether interrupts are pending/running, only the CPU-local redistributor
> > handles the EOI. Hence when the affinity is changed in the ITS, the new
> > CPU's redistributor does not know that the original CPU is still running
> > the handler.
> 
> You seem to be misunderstanding the architecture.
> 
> The local redistributor doesn't know anything either. A redistributor
> doesn't contain any state about what is currently serviced. At best,
> it knows of the pending state of interrupts, and that about it. This
> is even more true in a VM. Only the CPU knows about this, and there is
> no EOI that ever gets propagated to the redistributor.

Right. This misunderstanding basically stems from my confusion above where I
thought that the EOI would move it back into "pending." That obviously makes no
sense because if another IRQ arrives while the handler is already running, then
another handler run must be queued, hence it must be something really early in
the flow which ACKs the IRQ so that it gets moved back to "inactive."

> > 1. Do we need to mask the IRQ and then unmask it later? I don't think so
> > but it's not entirely clear why handle_edge_irq does this anyway; it's
> > an edge IRQ so not sure why it needs to be masked.
> 
> Please measure that cost and weep, specially in the context of
> multiple concurrent interrupts serviced by a single ITS (cost of
> locking the command queue, of waiting for a full round trip to the ITS
> for a couple of commands...).

Fortunately this mask/unmasking would only happen in the rare(ish) cased of the
race condition described here being hit. Exactly the same as
with handle_edge_irq(), the masking and later unmasking would only be done
when irq_may_run() == false due to the race being hit. Considering that this is
a rare occurrence, I think we could stomach the occasional overhead? I was more
asking if it's actually *necessary* to do this masking/unmasking. I'm not sure
it's necessary anyway, hence it wasn't implemented in my patch.

> > index 49e7bc871fec..4e5fc2b7e8a9 100644
> > --- a/kernel/irq/chip.c
> > +++ b/kernel/irq/chip.c
> > @@ -692,8 +692,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> > 
> >       raw_spin_lock(&desc->lock);
> > 
> > -     if (!irq_may_run(desc))
> > +     if (!irq_may_run(desc)) {
> > +             desc->istate |= IRQS_PENDING;
> >               goto out;
> > +     }
> 
> What is currently unclear to me is how we get there on another CPU if
> we're still handling the interrupt, which happens in the critical
> section delineated by desc->lock. So there is some additional state
> change here which you don't seem to be describing.

The lock is actually released and later re-acquired in the handle_irq_event()
function which is called below. Only this flag wrangling code is done with the
irq_desc lock held; all of the logic in the later IRQ-specific handler code is
actually run unlocked. This is why there is a window for the irq_desc to get
into the irq_may_run() function concurrently with the handler being run on a
difference CPU.
> 
> >       desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
> > 
> > @@ -711,7 +713,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> >       if (desc->istate & IRQS_ONESHOT)
> >               mask_irq(desc);
> > 
> > -     handle_irq_event(desc);
> > +     do {
> > +             handle_irq_event(desc);
> > +     } while (unlikely((desc->istate & IRQS_PENDING) &&
> > +                     !irqd_irq_disabled(&desc->irq_data)));
> > 
> >       cond_unmask_eoi_irq(desc, chip);
> > 
> 
> Why do we need to inflict any of this on all other users of this flow?
> The lack of active state on LPIs is the issue, as it allows a new
> interrupt to be presented to another CPU as soon as the previous one
> has been ACK'ed.

Do you mean an active state in hardware? Even if we had an active state, that
state would be CPU-local, right? The issue here is that when the CPU affinity
changes while the handler is running, the new CPU will early out because the
flags in irq_desc indicate it's already running elsewhere. A CPU-local hardware
active state would not help here; only if the active/pending states were global,
then it may help.

As for inflicting this on other users, this is pretty light in general. It's
basically just looking at that flag again; in general it will be unset and we'll
exit the loop after one invoke.

> 
> This also has the effect of keeping the handling of the interrupt on
> the original CPU, negating the effects of the change of affinity.

Yes. This bothered me too initially, but on reflection I'm not sure it's
actually a problem. One possible issue that came to mind was around CPU
offlining, but in the event that a CPU being offlined was running interrupt
handlers it wouldn't be able to complete the offline anyway until the handlers
were finished, so I don't think this is an issue. Do you see any practical issue
with running the handler once more on the original CPU immediately after the
affinity has been changed?

> It feels that we should instead resend the interrupt, either by making
> it pending again by generating an INT command to the ITS, or by using
> the SW resend mechanism (which the lack of deactivation requirement
> makes possible).

I'm open to other suggestions here, just not sure how this re-sending would work
in practice. Do you mean that when the original CPU observes that the pending
flag is set, instead of running the handler again locally the original CPU would
instruct the ITS to mark the interrupt as pending again, hence re-triggering it
on the new CPU? That could work, but may be tricky to shoe-horn into the generic
code, unless we use the EOI callback for this?

Fundamentally I don't think that the solution here should be GIC specific. We
hit this issue on the GIC, but potentially any irq chip could cause this lost
interrupt issue to manifest? That's one of the reasons I preferred a tweak to
the generic code to fix this for everyone.

JG

[0] https://developer.arm.com/documentation/102923/0100/Redistributors/Initial-configuration-of-a-Redistributor?lang=en

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-04-11 10:27   ` Gowans, James
@ 2023-04-12 13:32     ` Marc Zyngier
  2023-04-18  2:39       ` Yipeng Zou
                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Marc Zyngier @ 2023-04-12 13:32 UTC (permalink / raw)
  To: Gowans, James
  Cc: tglx, Raslan, KarimAllah, Woodhouse, David, zouyipeng,
	linux-kernel, Sironi, Filippo, chris.zjh

On Tue, 11 Apr 2023 11:27:26 +0100,
"Gowans, James" <jgowans@amazon.com> wrote:
> 
> Hi Marc, thanks for taking the time to put thought into this!
> 
> On Sun, 2023-04-09 at 12:41 +0100, Marc Zyngier wrote:
> > > Generally it should not be possible for the next interrupt to arrive
> > > while the previous handler is still running: the next interrupt should
> > > only arrive after the EOI message has been sent and the previous handler
> > > has returned.
> > 
> > There isn't necessarily an EOI message. On bare metal, EOI of a LPI
> > amounts to sweet nothing (see the pseudocode to convince yourself),
> > because the physical LPI doesn't have an active state.
> 
> Okay, I saw the gic_eoi_irq() function and thought there might be something
> going on there. I guess it's not relevant to this issue.

It is relevant in a way. If LPIs had an active state, then it wouldn't
be able to creep back up behind your back until the first CPU had
issued a deactivation (triggered by EOI when EOImode=0). But the
architecture was broken since day one, because sine architect thought
they knew everything there was to know about the subject...

> 
> > > 
> > > The issue is
> > > that the global ITS is responsible for affinity but does not know
> > > whether interrupts are pending/running, only the CPU-local redistributor
> > > handles the EOI. Hence when the affinity is changed in the ITS, the new
> > > CPU's redistributor does not know that the original CPU is still running
> > > the handler.
> > 
> > You seem to be misunderstanding the architecture.
> > 
> > The local redistributor doesn't know anything either. A redistributor
> > doesn't contain any state about what is currently serviced. At best,
> > it knows of the pending state of interrupts, and that about it. This
> > is even more true in a VM. Only the CPU knows about this, and there is
> > no EOI that ever gets propagated to the redistributor.
> 
> Right. This misunderstanding basically stems from my confusion above where I
> thought that the EOI would move it back into "pending." That obviously makes no
> sense because if another IRQ arrives while the handler is already running, then
> another handler run must be queued, hence it must be something really early in
> the flow which ACKs the IRQ so that it gets moved back to "inactive."

The ACK is the read of ICC_IAR_EL1 in the GICv3 driver, which indeed
sends the interrupt back to the inactive state, despite the
corresponding priority being active on the CPU. With this setup, EOI
only performs the priority drop.

> > > 1. Do we need to mask the IRQ and then unmask it later? I don't think so
> > > but it's not entirely clear why handle_edge_irq does this anyway; it's
> > > an edge IRQ so not sure why it needs to be masked.
> > 
> > Please measure that cost and weep, specially in the context of
> > multiple concurrent interrupts serviced by a single ITS (cost of
> > locking the command queue, of waiting for a full round trip to the ITS
> > for a couple of commands...).
> 
> Fortunately this mask/unmasking would only happen in the rare(ish) cased of the
> race condition described here being hit. Exactly the same as
> with handle_edge_irq(), the masking and later unmasking would only be done
> when irq_may_run() == false due to the race being hit. Considering that this is
> a rare occurrence, I think we could stomach the occasional overhead? I was more
> asking if it's actually *necessary* to do this masking/unmasking. I'm not sure
> it's necessary anyway, hence it wasn't implemented in my patch.

But does it solve anything? At the point where you mask the interrupt,
you already have consumed it. You'd still need to make it pending
somehow, which is what your patch somehow.

>
> > > index 49e7bc871fec..4e5fc2b7e8a9 100644
> > > --- a/kernel/irq/chip.c
> > > +++ b/kernel/irq/chip.c
> > > @@ -692,8 +692,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> > > 
> > >       raw_spin_lock(&desc->lock);
> > > 
> > > -     if (!irq_may_run(desc))
> > > +     if (!irq_may_run(desc)) {
> > > +             desc->istate |= IRQS_PENDING;
> > >               goto out;
> > > +     }
> > 
> > What is currently unclear to me is how we get there on another CPU if
> > we're still handling the interrupt, which happens in the critical
> > section delineated by desc->lock. So there is some additional state
> > change here which you don't seem to be describing.
> 
> The lock is actually released and later re-acquired in the handle_irq_event()
> function which is called below. Only this flag wrangling code is done with the
> irq_desc lock held; all of the logic in the later IRQ-specific handler code is
> actually run unlocked. This is why there is a window for the irq_desc to get
> into the irq_may_run() function concurrently with the handler being run on a
> difference CPU.

Huh, indeed! Interrupts are still disabled (TFFT!), but this makes
firing on another CPU entirely possible. Thanks for the eye-opening
comment.

> > 
> > >       desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
> > > 
> > > @@ -711,7 +713,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> > >       if (desc->istate & IRQS_ONESHOT)
> > >               mask_irq(desc);
> > > 
> > > -     handle_irq_event(desc);
> > > +     do {
> > > +             handle_irq_event(desc);
> > > +     } while (unlikely((desc->istate & IRQS_PENDING) &&
> > > +                     !irqd_irq_disabled(&desc->irq_data)));
> > > 
> > >       cond_unmask_eoi_irq(desc, chip);
> > > 
> > 
> > Why do we need to inflict any of this on all other users of this flow?
> > The lack of active state on LPIs is the issue, as it allows a new
> > interrupt to be presented to another CPU as soon as the previous one
> > has been ACK'ed.
> 
> Do you mean an active state in hardware? Even if we had an active state, that
> state would be CPU-local, right?

No, and that's the whole point. An active state is global. See for
example how SPIs, which are as global as it gets, have their active
state honoured *system wide*.

> The issue here is that when the CPU affinity
> changes while the handler is running, the new CPU will early out because the
> flags in irq_desc indicate it's already running elsewhere. A CPU-local hardware
> active state would not help here; only if the active/pending states were global,
> then it may help.

See above.

> 
> As for inflicting this on other users, this is pretty light in general. It's
> basically just looking at that flag again; in general it will be unset and we'll
> exit the loop after one invoke.
> 
> > 
> > This also has the effect of keeping the handling of the interrupt on
> > the original CPU, negating the effects of the change of affinity.
> 
> Yes. This bothered me too initially, but on reflection I'm not sure it's
> actually a problem. One possible issue that came to mind was around CPU
> offlining, but in the event that a CPU being offlined was running interrupt
> handlers it wouldn't be able to complete the offline anyway until the handlers
> were finished, so I don't think this is an issue. Do you see any practical issue
> with running the handler once more on the original CPU immediately after the
> affinity has been changed?

My take on this is that we put the pressure on the CPU we want to move
away from. I'd rather we put the it on the GIC itself, and use its
Turing-complete powers to force it to redeliver the interrupt at a
more convenient time.

> 
> > It feels that we should instead resend the interrupt, either by making
> > it pending again by generating an INT command to the ITS, or by using
> > the SW resend mechanism (which the lack of deactivation requirement
> > makes possible).
> 
> I'm open to other suggestions here, just not sure how this re-sending would work
> in practice. Do you mean that when the original CPU observes that the pending
> flag is set, instead of running the handler again locally the original CPU would
> instruct the ITS to mark the interrupt as pending again, hence re-triggering it
> on the new CPU? That could work, but may be tricky to shoe-horn into the generic
> code, unless we use the EOI callback for this?

I think we have most of what we need already, see the hack below. It
is totally untested, but you'll hopefully get the idea.

Thanks,

	M.

From c96d2ab37fe273724f1264fba5f4913259875d56 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz@kernel.org>
Date: Mon, 10 Apr 2023 10:56:32 +0100
Subject: [PATCH] irqchip/gicv3-its: Force resend of LPIs taken while already
 in-progress

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 drivers/irqchip/irq-gic-v3-its.c |  3 +++
 include/linux/irq.h              | 13 +++++++++++++
 kernel/irq/chip.c                |  5 ++++-
 kernel/irq/debugfs.c             |  2 ++
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 586271b8aa39..d04c73a2bc6b 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -3574,6 +3574,7 @@ static int its_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
 		irqd = irq_get_irq_data(virq + i);
 		irqd_set_single_target(irqd);
 		irqd_set_affinity_on_activate(irqd);
+		irqd_set_resend_when_in_progress(irqd);
 		pr_debug("ID:%d pID:%d vID:%d\n",
 			 (int)(hwirq + i - its_dev->event_map.lpi_base),
 			 (int)(hwirq + i), virq + i);
@@ -4512,6 +4513,8 @@ static int its_vpe_irq_domain_alloc(struct irq_domain *domain, unsigned int virq
 		irq_domain_set_hwirq_and_chip(domain, virq + i, i,
 					      irqchip, vm->vpes[i]);
 		set_bit(i, bitmap);
+
+		irqd_set_resend_when_in_progress(irq_get_irq_data(virq + i));
 	}
 
 	if (err) {
diff --git a/include/linux/irq.h b/include/linux/irq.h
index b1b28affb32a..4b2a7cc96eb2 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -223,6 +223,8 @@ struct irq_data {
  *				  irq_chip::irq_set_affinity() when deactivated.
  * IRQD_IRQ_ENABLED_ON_SUSPEND	- Interrupt is enabled on suspend by irq pm if
  *				  irqchip have flag IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND set.
+ * IRQD_RESEND_WHEN_IN_PROGRESS - Interrupt may fire when already in progress,
+ *				  needs resending.
  */
 enum {
 	IRQD_TRIGGER_MASK		= 0xf,
@@ -249,6 +251,7 @@ enum {
 	IRQD_HANDLE_ENFORCE_IRQCTX	= (1 << 28),
 	IRQD_AFFINITY_ON_ACTIVATE	= (1 << 29),
 	IRQD_IRQ_ENABLED_ON_SUSPEND	= (1 << 30),
+	IRQD_RESEND_WHEN_IN_PROGRESS	= (1 << 31),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -448,6 +451,16 @@ static inline bool irqd_affinity_on_activate(struct irq_data *d)
 	return __irqd_to_state(d) & IRQD_AFFINITY_ON_ACTIVATE;
 }
 
+static inline bool irqd_needs_resend_when_in_progress(struct irq_data *d)
+{
+	return __irqd_to_state(d) & IRQD_RESEND_WHEN_IN_PROGRESS;
+}
+
+static inline void irqd_set_resend_when_in_progress(struct irq_data *d)
+{
+	__irqd_to_state(d) |= IRQD_RESEND_WHEN_IN_PROGRESS;
+}
+
 #undef __irqd_to_state
 
 static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 49e7bc871fec..73546ba8bc43 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -692,8 +692,11 @@ void handle_fasteoi_irq(struct irq_desc *desc)
 
 	raw_spin_lock(&desc->lock);
 
-	if (!irq_may_run(desc))
+	if (!irq_may_run(desc)) {
+		if (irqd_needs_resend_when_in_progress(&desc->irq_data))
+			check_irq_resend(desc, true);
 		goto out;
+	}
 
 	desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
 
diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c
index bbcaac64038e..5971a66be034 100644
--- a/kernel/irq/debugfs.c
+++ b/kernel/irq/debugfs.c
@@ -133,6 +133,8 @@ static const struct irq_bit_descr irqdata_states[] = {
 	BIT_MASK_DESCR(IRQD_HANDLE_ENFORCE_IRQCTX),
 
 	BIT_MASK_DESCR(IRQD_IRQ_ENABLED_ON_SUSPEND),
+
+	BIT_MASK_DESCR(IRQD_RESEND_WHEN_IN_PROGRESS),
 };
 
 static const struct irq_bit_descr irqdesc_states[] = {
-- 
2.34.1


-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-04-12 13:32     ` Marc Zyngier
@ 2023-04-18  2:39       ` Yipeng Zou
  2023-04-18 10:56       ` Gowans, James
  2023-05-23  3:16       ` liaochang (A)
  2 siblings, 0 replies; 28+ messages in thread
From: Yipeng Zou @ 2023-04-18  2:39 UTC (permalink / raw)
  To: Marc Zyngier, Gowans, James
  Cc: tglx, Raslan, KarimAllah, Woodhouse, David, linux-kernel, Sironi,
	Filippo, chris.zjh


在 2023/4/12 21:32, Marc Zyngier 写道:
> I think we have most of what we need already, see the hack below. It
> is totally untested, but you'll hopefully get the idea.
>
> Thanks,
>
> 	M.
>
> >From c96d2ab37fe273724f1264fba5f4913259875d56 Mon Sep 17 00:00:00 2001
> From: Marc Zyngier <maz@kernel.org>
> Date: Mon, 10 Apr 2023 10:56:32 +0100
> Subject: [PATCH] irqchip/gicv3-its: Force resend of LPIs taken while already
>   in-progress
>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>   drivers/irqchip/irq-gic-v3-its.c |  3 +++
>   include/linux/irq.h              | 13 +++++++++++++
>   kernel/irq/chip.c                |  5 ++++-
>   kernel/irq/debugfs.c             |  2 ++
>   4 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index 586271b8aa39..d04c73a2bc6b 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -3574,6 +3574,7 @@ static int its_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
>   		irqd = irq_get_irq_data(virq + i);
>   		irqd_set_single_target(irqd);
>   		irqd_set_affinity_on_activate(irqd);
> +		irqd_set_resend_when_in_progress(irqd);
>   		pr_debug("ID:%d pID:%d vID:%d\n",
>   			 (int)(hwirq + i - its_dev->event_map.lpi_base),
>   			 (int)(hwirq + i), virq + i);
> @@ -4512,6 +4513,8 @@ static int its_vpe_irq_domain_alloc(struct irq_domain *domain, unsigned int virq
>   		irq_domain_set_hwirq_and_chip(domain, virq + i, i,
>   					      irqchip, vm->vpes[i]);
>   		set_bit(i, bitmap);
> +
> +		irqd_set_resend_when_in_progress(irq_get_irq_data(virq + i));
>   	}
>   
>   	if (err) {
> diff --git a/include/linux/irq.h b/include/linux/irq.h
> index b1b28affb32a..4b2a7cc96eb2 100644
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -223,6 +223,8 @@ struct irq_data {
>    *				  irq_chip::irq_set_affinity() when deactivated.
>    * IRQD_IRQ_ENABLED_ON_SUSPEND	- Interrupt is enabled on suspend by irq pm if
>    *				  irqchip have flag IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND set.
> + * IRQD_RESEND_WHEN_IN_PROGRESS - Interrupt may fire when already in progress,
> + *				  needs resending.
>    */
>   enum {
>   	IRQD_TRIGGER_MASK		= 0xf,
> @@ -249,6 +251,7 @@ enum {
>   	IRQD_HANDLE_ENFORCE_IRQCTX	= (1 << 28),
>   	IRQD_AFFINITY_ON_ACTIVATE	= (1 << 29),
>   	IRQD_IRQ_ENABLED_ON_SUSPEND	= (1 << 30),
> +	IRQD_RESEND_WHEN_IN_PROGRESS	= (1 << 31),
>   };
>   
>   #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
> @@ -448,6 +451,16 @@ static inline bool irqd_affinity_on_activate(struct irq_data *d)
>   	return __irqd_to_state(d) & IRQD_AFFINITY_ON_ACTIVATE;
>   }
>   
> +static inline bool irqd_needs_resend_when_in_progress(struct irq_data *d)
> +{
> +	return __irqd_to_state(d) & IRQD_RESEND_WHEN_IN_PROGRESS;
> +}
> +
> +static inline void irqd_set_resend_when_in_progress(struct irq_data *d)
> +{
> +	__irqd_to_state(d) |= IRQD_RESEND_WHEN_IN_PROGRESS;
> +}
> +
>   #undef __irqd_to_state
>   
>   static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 49e7bc871fec..73546ba8bc43 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -692,8 +692,11 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>   
>   	raw_spin_lock(&desc->lock);
>   
> -	if (!irq_may_run(desc))
> +	if (!irq_may_run(desc)) {
> +		if (irqd_needs_resend_when_in_progress(&desc->irq_data))
> +			check_irq_resend(desc, true);
>   		goto out;
> +	}
>   
>   	desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
>   
> diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c
> index bbcaac64038e..5971a66be034 100644
> --- a/kernel/irq/debugfs.c
> +++ b/kernel/irq/debugfs.c
> @@ -133,6 +133,8 @@ static const struct irq_bit_descr irqdata_states[] = {
>   	BIT_MASK_DESCR(IRQD_HANDLE_ENFORCE_IRQCTX),
>   
>   	BIT_MASK_DESCR(IRQD_IRQ_ENABLED_ON_SUSPEND),
> +
> +	BIT_MASK_DESCR(IRQD_RESEND_WHEN_IN_PROGRESS),
>   };
>   
>   static const struct irq_bit_descr irqdesc_states[] = {

Hi Marc:

     Thanks for you patch, and i was tested it on my board, it works for me.

     We force resend a irq when irq_may_run return false, besides 
IRQD_IRQ_INPROGRESS,

     are there other situations (such as IRQD_WAKEUP_ARMED) that will 
also resend an irq?

-- 
Regards,
Yipeng Zou


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-04-12 13:32     ` Marc Zyngier
  2023-04-18  2:39       ` Yipeng Zou
@ 2023-04-18 10:56       ` Gowans, James
  2023-04-19  3:08         ` Yipeng Zou
                           ` (2 more replies)
  2023-05-23  3:16       ` liaochang (A)
  2 siblings, 3 replies; 28+ messages in thread
From: Gowans, James @ 2023-04-18 10:56 UTC (permalink / raw)
  To: maz
  Cc: tglx, Raslan, KarimAllah, Woodhouse, David, zouyipeng,
	linux-kernel, Sironi, Filippo, chris.zjh

On Wed, 2023-04-12 at 14:32 +0100, Marc Zyngier wrote:
> 
> > > > 1. Do we need to mask the IRQ and then unmask it later? I don't think so
> > > > but it's not entirely clear why handle_edge_irq does this anyway; it's
> > > > an edge IRQ so not sure why it needs to be masked.
> > > 
> > > Please measure that cost and weep, specially in the context of
> > > multiple concurrent interrupts serviced by a single ITS (cost of
> > > locking the command queue, of waiting for a full round trip to the ITS
> > > for a couple of commands...).
> > 
> > Fortunately this mask/unmasking would only happen in the rare(ish) cased of the
> > race condition described here being hit. Exactly the same as
> > with handle_edge_irq(), the masking and later unmasking would only be done
> > when irq_may_run() == false due to the race being hit. Considering that this is
> > a rare occurrence, I think we could stomach the occasional overhead? I was more
> > asking if it's actually *necessary* to do this masking/unmasking. I'm not sure
> > it's necessary anyway, hence it wasn't implemented in my patch.
> 
> But does it solve anything? At the point where you mask the interrupt,
> you already have consumed it. You'd still need to make it pending
> somehow, which is what your patch somehow.

I don't really know - the reason I asked the question is that the related
handle_edge_irq() does this mask/unmasking, and I wasn't quite sure why it
did that and hence if we needed to do something similar.
Anyway, let's focus on your patch rather - I think it's more compelling.


> > Yes. This bothered me too initially, but on reflection I'm not sure it's
> > actually a problem. One possible issue that came to mind was around CPU
> > offlining, but in the event that a CPU being offlined was running interrupt
> > handlers it wouldn't be able to complete the offline anyway until the handlers
> > were finished, so I don't think this is an issue. Do you see any practical issue
> > with running the handler once more on the original CPU immediately after the
> > affinity has been changed?
> 
> My take on this is that we put the pressure on the CPU we want to move
> away from. I'd rather we put the it on the GIC itself, and use its
> Turing-complete powers to force it to redeliver the interrupt at a
> more convenient time.

This idea and implementation looks and works great! It may need a few
tweaks; discussing below.

> 
> From c96d2ab37fe273724f1264fba5f4913259875d56 Mon Sep 17 00:00:00 2001
> From: Marc Zyngier <maz@kernel.org>
> Date: Mon, 10 Apr 2023 10:56:32 +0100
> Subject: [PATCH] irqchip/gicv3-its: Force resend of LPIs taken while
> already
>  in-progress

Perhaps you can pillage some of my commit message to explain the race here
when you send this patch?
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> 
> diff --git a/include/linux/irq.h b/include/linux/irq.h
> index b1b28affb32a..4b2a7cc96eb2 100644
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -223,6 +223,8 @@ struct irq_data {
>   *                               irq_chip::irq_set_affinity() when
> deactivated.
>   * IRQD_IRQ_ENABLED_ON_SUSPEND - Interrupt is enabled on suspend by irq
> pm if
>   *                               irqchip have flag
> IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND set.
> + * IRQD_RESEND_WHEN_IN_PROGRESS - Interrupt may fire when already in
> progress,
> + *                               needs resending.
>   */
>  enum {
>         IRQD_TRIGGER_MASK               = 0xf,
> @@ -249,6 +251,7 @@ enum {
>         IRQD_HANDLE_ENFORCE_IRQCTX      = (1 << 28),
>         IRQD_AFFINITY_ON_ACTIVATE       = (1 << 29),
>         IRQD_IRQ_ENABLED_ON_SUSPEND     = (1 << 30),
> +       IRQD_RESEND_WHEN_IN_PROGRESS    = (1 << 31),
>  };

Do we really want a new flag here? I'd be keen to fix this race for all
drivers, not just those who know to set this flag. I think the patch
you're suggesting is pretty close to being safe to enable generally? If so
my preference is for one less config option - just run it always.

>  static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 49e7bc871fec..73546ba8bc43 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -692,8 +692,11 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> 
>         raw_spin_lock(&desc->lock);
> 
> -       if (!irq_may_run(desc))
> +       if (!irq_may_run(desc)) {
> +               if (irqd_needs_resend_when_in_progress(&desc->irq_data))
> +                       check_irq_resend(desc, true);
>                 goto out;
> +       }

This will run check_irq_resend() on the *newly affined* CPU, while the old
one is still running the original handler. AFAICT what will happen is:
check_irq_resend
  try_retrigger
    irq_chip_retrigger_hierarchy
      its_irq_retrigger
... which will cause the ITS to *immediately* re-trigger the IRQ. The
original CPU can still be running the handler in that case.

If that happens, consider what will happen in check_irq_resend:
- first IRQ comes in, successflly runs try_retrigger and sets IRQS_REPLAY.
- it is *immediately* retriggered by ITS, and because the original handler
on the other CPU is still running, comes into check_irq_resend again.
- check_irq_resend now observes that IRQS_REPLAY is set and early outs.
- No more resends, the IRQ is still lost. :-(

Now I admit the failure mode is getting a bit pathological: two re-
triggers while the original handler is still running, but I was able to
hit this on my test machine by intentionally slowing
the handler down by a few dozen micros. Should we cater for this?

I can see two possibilities:
- tweak check_irq_resend() to not early-out in this case but to keep re-
triggering until it eventually runs.
- move the check_irq_resend to only happen later, *after* the original
handler has finished running. This would be very similar to what I
suggested in my original patch, except instead of running a do/while loop,
the code would observe that the pending flag was set again and run
check_irq_resend.

I'm also wondering what will happen for users who don't have the
chip->irq_retrigger callback set and fall back to the tasklet
via irq_sw_resend()... Looks like it will work fine. However if we do my
suggestion and move check_irq_resend to the end of handle_fasteoi_irq then
the tasklet will be scheduled on the old CPU again, which may be sub-
optimal.

JG

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-04-18 10:56       ` Gowans, James
@ 2023-04-19  3:08         ` Yipeng Zou
  2023-05-02  8:43         ` Gowans, James
  2023-05-02 10:28         ` Marc Zyngier
  2 siblings, 0 replies; 28+ messages in thread
From: Yipeng Zou @ 2023-04-19  3:08 UTC (permalink / raw)
  To: Gowans, James, maz
  Cc: tglx, Raslan, KarimAllah, Woodhouse, David, linux-kernel, Sironi,
	Filippo, chris.zjh


在 2023/4/18 18:56, Gowans, James 写道:
> On Wed, 2023-04-12 at 14:32 +0100, Marc Zyngier wrote:
>>>>> 1. Do we need to mask the IRQ and then unmask it later? I don't think so
>>>>> but it's not entirely clear why handle_edge_irq does this anyway; it's
>>>>> an edge IRQ so not sure why it needs to be masked.
>>>> Please measure that cost and weep, specially in the context of
>>>> multiple concurrent interrupts serviced by a single ITS (cost of
>>>> locking the command queue, of waiting for a full round trip to the ITS
>>>> for a couple of commands...).
>>> Fortunately this mask/unmasking would only happen in the rare(ish) cased of the
>>> race condition described here being hit. Exactly the same as
>>> with handle_edge_irq(), the masking and later unmasking would only be done
>>> when irq_may_run() == false due to the race being hit. Considering that this is
>>> a rare occurrence, I think we could stomach the occasional overhead? I was more
>>> asking if it's actually *necessary* to do this masking/unmasking. I'm not sure
>>> it's necessary anyway, hence it wasn't implemented in my patch.
>> But does it solve anything? At the point where you mask the interrupt,
>> you already have consumed it. You'd still need to make it pending
>> somehow, which is what your patch somehow.
> I don't really know - the reason I asked the question is that the related
> handle_edge_irq() does this mask/unmasking, and I wasn't quite sure why it
> did that and hence if we needed to do something similar.
> Anyway, let's focus on your patch rather - I think it's more compelling.
>
>
>>> Yes. This bothered me too initially, but on reflection I'm not sure it's
>>> actually a problem. One possible issue that came to mind was around CPU
>>> offlining, but in the event that a CPU being offlined was running interrupt
>>> handlers it wouldn't be able to complete the offline anyway until the handlers
>>> were finished, so I don't think this is an issue. Do you see any practical issue
>>> with running the handler once more on the original CPU immediately after the
>>> affinity has been changed?
>> My take on this is that we put the pressure on the CPU we want to move
>> away from. I'd rather we put the it on the GIC itself, and use its
>> Turing-complete powers to force it to redeliver the interrupt at a
>> more convenient time.
> This idea and implementation looks and works great! It may need a few
> tweaks; discussing below.
>
>> >From c96d2ab37fe273724f1264fba5f4913259875d56 Mon Sep 17 00:00:00 2001
>> From: Marc Zyngier<maz@kernel.org>
>> Date: Mon, 10 Apr 2023 10:56:32 +0100
>> Subject: [PATCH] irqchip/gicv3-its: Force resend of LPIs taken while
>> already
>>   in-progress
> Perhaps you can pillage some of my commit message to explain the race here
> when you send this patch?
>> Signed-off-by: Marc Zyngier<maz@kernel.org>
>>
>> diff --git a/include/linux/irq.h b/include/linux/irq.h
>> index b1b28affb32a..4b2a7cc96eb2 100644
>> --- a/include/linux/irq.h
>> +++ b/include/linux/irq.h
>> @@ -223,6 +223,8 @@ struct irq_data {
>>    *                               irq_chip::irq_set_affinity() when
>> deactivated.
>>    * IRQD_IRQ_ENABLED_ON_SUSPEND - Interrupt is enabled on suspend by irq
>> pm if
>>    *                               irqchip have flag
>> IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND set.
>> + * IRQD_RESEND_WHEN_IN_PROGRESS - Interrupt may fire when already in
>> progress,
>> + *                               needs resending.
>>    */
>>   enum {
>>          IRQD_TRIGGER_MASK               = 0xf,
>> @@ -249,6 +251,7 @@ enum {
>>          IRQD_HANDLE_ENFORCE_IRQCTX      = (1 << 28),
>>          IRQD_AFFINITY_ON_ACTIVATE       = (1 << 29),
>>          IRQD_IRQ_ENABLED_ON_SUSPEND     = (1 << 30),
>> +       IRQD_RESEND_WHEN_IN_PROGRESS    = (1 << 31),
>>   };
> Do we really want a new flag here? I'd be keen to fix this race for all
> drivers, not just those who know to set this flag. I think the patch
> you're suggesting is pretty close to being safe to enable generally? If so
> my preference is for one less config option - just run it always.
>
>>   static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
>> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
>> index 49e7bc871fec..73546ba8bc43 100644
>> --- a/kernel/irq/chip.c
>> +++ b/kernel/irq/chip.c
>> @@ -692,8 +692,11 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>>
>>          raw_spin_lock(&desc->lock);
>>
>> -       if (!irq_may_run(desc))
>> +       if (!irq_may_run(desc)) {
>> +               if (irqd_needs_resend_when_in_progress(&desc->irq_data))
>> +                       check_irq_resend(desc, true);
>>                  goto out;
>> +       }
> This will run check_irq_resend() on the *newly affined* CPU, while the old
> one is still running the original handler. AFAICT what will happen is:
> check_irq_resend
>    try_retrigger
>      irq_chip_retrigger_hierarchy
>        its_irq_retrigger
> ... which will cause the ITS to *immediately* re-trigger the IRQ. The
> original CPU can still be running the handler in that case.
>
> If that happens, consider what will happen in check_irq_resend:
> - first IRQ comes in, successflly runs try_retrigger and sets IRQS_REPLAY.
> - it is *immediately* retriggered by ITS, and because the original handler
> on the other CPU is still running, comes into check_irq_resend again.
> - check_irq_resend now observes that IRQS_REPLAY is set and early outs.
> - No more resends, the IRQ is still lost. :-(
>
> Now I admit the failure mode is getting a bit pathological: two re-
> triggers while the original handler is still running, but I was able to
> hit this on my test machine by intentionally slowing
> the handler down by a few dozen micros. Should we cater for this?
>
> I can see two possibilities:
> - tweak check_irq_resend() to not early-out in this case but to keep re-
> triggering until it eventually runs.
> - move the check_irq_resend to only happen later, *after* the original
> handler has finished running. This would be very similar to what I
> suggested in my original patch, except instead of running a do/while loop,
> the code would observe that the pending flag was set again and run
> check_irq_resend.
>
> I'm also wondering what will happen for users who don't have the
> chip->irq_retrigger callback set and fall back to the tasklet
> via irq_sw_resend()... Looks like it will work fine. However if we do my
> suggestion and move check_irq_resend to the end of handle_fasteoi_irq then
> the tasklet will be scheduled on the old CPU again, which may be sub-
> optimal.
>
> JG

Hi James:

     This does have a problem, and I didn't hit this on my test machine 
because my interrupt

handling would be ended quickly. But this scenario really should be 
considered.

     However, if the check_irq_resend is executed in the end of the 
handle_fasteoi_irq, a flag

is needed to identify it. Same as the original IRQS_PENDING function.

     It's just that the do/while loop in our original patch is replaced 
with check_irq_resend.

     As for the irq_sw_resend that will be followed without implementing 
chip->irq_retrigger,

I think maybe there will be another problem here: interrupt response 
latency.

     If an irq_sw_resend is used to trigger a new interrupt, the delay 
of this interrupt response

is uncertain (tasklet scheduling, etc.), which is important for some 
devices with low latency

requirements There may also be an impact.

-- 
Regards,
Yipeng Zou


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-04-18 10:56       ` Gowans, James
  2023-04-19  3:08         ` Yipeng Zou
@ 2023-05-02  8:43         ` Gowans, James
  2023-05-23  3:16           ` liaochang (A)
       [not found]           ` <86sfcfghqh.wl-maz@kernel.org>
  2023-05-02 10:28         ` Marc Zyngier
  2 siblings, 2 replies; 28+ messages in thread
From: Gowans, James @ 2023-05-02  8:43 UTC (permalink / raw)
  To: maz, tglx
  Cc: zouyipeng, Raslan, KarimAllah, Woodhouse, David, linux-kernel,
	Sironi, Filippo, chris.zjh

Hi Marc and Thomas,

On Tue, 2023-04-18 at 12:56 +0200, James Gowans wrote:
> >   static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
> > diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> > index 49e7bc871fec..73546ba8bc43 100644
> > --- a/kernel/irq/chip.c
> > +++ b/kernel/irq/chip.c
> > @@ -692,8 +692,11 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> >          raw_spin_lock(&desc->lock);
> > -       if (!irq_may_run(desc))
> > +       if (!irq_may_run(desc)) {
> > +               if (irqd_needs_resend_when_in_progress(&desc->irq_data))
> > +                       check_irq_resend(desc, true);
> >                  goto out;
> > +       }
> 
> 
> This will run check_irq_resend() on the *newly affined* CPU, while the old
> one is still running the original handler. AFAICT what will happen is:
> check_irq_resend
>   try_retrigger
>     irq_chip_retrigger_hierarchy
>       its_irq_retrigger
> ... which will cause the ITS to *immediately* re-trigger the IRQ. The
> original CPU can still be running the handler in that case.
> 
> If that happens, consider what will happen in check_irq_resend:
> - first IRQ comes in, successflly runs try_retrigger and sets IRQS_REPLAY.
> - it is *immediately* retriggered by ITS, and because the original handler
> on the other CPU is still running, comes into check_irq_resend again.
> - check_irq_resend now observes that IRQS_REPLAY is set and early outs.
> - No more resends, the IRQ is still lost. :-(
> 
> Now I admit the failure mode is getting a bit pathological: two re-
> triggers while the original handler is still running, but I was able to
> hit this on my test machine by intentionally slowing
> the handler down by a few dozen micros. Should we cater for this?
> 
> I can see two possibilities:
> - tweak check_irq_resend() to not early-out in this case but to keep re-
> triggering until it eventually runs.
> - move the check_irq_resend to only happen later, *after* the original
> handler has finished running. This would be very similar to what I
> suggested in my original patch, except instead of running a do/while loop,
> the code would observe that the pending flag was set again and run
> check_irq_resend.
> 
> I'm also wondering what will happen for users who don't have the
> chip->irq_retrigger callback set and fall back to the tasklet
> via irq_sw_resend()... Looks like it will work fine. However if we do my
> suggestion and move check_irq_resend to the end of handle_fasteoi_irq then
> the tasklet will be scheduled on the old CPU again, which may be sub-
> optimal.

Just checking to see if you've had a chance to consider these
issues/thoughts, and if/how they should be handled?
I'm still tending towards saying that the check_irq_resend() should run
after handle_irq_event() and the IRQS_PENDING flag should be wrangled to
decide whether or not to resend.

I just don't know if having the tasklet scheduled and run on the original
CPU via irq_sw_resend() would be problematic or not. In general it
probably won't but in the CPU offlining case.... maybe? I realise that for
GIC-v3 the tasklet won't be used because GIC has chip->irq_retrigger
callback defined - I'm just thinking in general here, especially so
assuming we drop the new IRQD_RESEND_WHEN_IN_PROGRESS flag).

Thoughts?

I can put together a PoC and test it along with Yipeng from Huawei if you
think it sounds reasonable.

JG

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-04-18 10:56       ` Gowans, James
  2023-04-19  3:08         ` Yipeng Zou
  2023-05-02  8:43         ` Gowans, James
@ 2023-05-02 10:28         ` Marc Zyngier
  2 siblings, 0 replies; 28+ messages in thread
From: Marc Zyngier @ 2023-05-02 10:28 UTC (permalink / raw)
  To: Gowans, James
  Cc: tglx, Raslan, KarimAllah, Woodhouse, David, zouyipeng,
	linux-kernel, Sironi, Filippo, chris.zjh

Catching up...

On Tue, 18 Apr 2023 11:56:07 +0100,
"Gowans, James" <jgowans@amazon.com> wrote:
> 
> On Wed, 2023-04-12 at 14:32 +0100, Marc Zyngier wrote:
> > 
> > From c96d2ab37fe273724f1264fba5f4913259875d56 Mon Sep 17 00:00:00 2001
> > From: Marc Zyngier <maz@kernel.org>
> > Date: Mon, 10 Apr 2023 10:56:32 +0100
> > Subject: [PATCH] irqchip/gicv3-its: Force resend of LPIs taken while
> > already
> >  in-progress
> 
> Perhaps you can pillage some of my commit message to explain the race here
> when you send this patch?

Sure. At the moment, we're still far from a final patch though.

> > 
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > 
> > diff --git a/include/linux/irq.h b/include/linux/irq.h
> > index b1b28affb32a..4b2a7cc96eb2 100644
> > --- a/include/linux/irq.h
> > +++ b/include/linux/irq.h
> > @@ -223,6 +223,8 @@ struct irq_data {
> >   *                               irq_chip::irq_set_affinity() when
> > deactivated.
> >   * IRQD_IRQ_ENABLED_ON_SUSPEND - Interrupt is enabled on suspend by irq
> > pm if
> >   *                               irqchip have flag
> > IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND set.
> > + * IRQD_RESEND_WHEN_IN_PROGRESS - Interrupt may fire when already in
> > progress,
> > + *                               needs resending.
> >   */
> >  enum {
> >         IRQD_TRIGGER_MASK               = 0xf,
> > @@ -249,6 +251,7 @@ enum {
> >         IRQD_HANDLE_ENFORCE_IRQCTX      = (1 << 28),
> >         IRQD_AFFINITY_ON_ACTIVATE       = (1 << 29),
> >         IRQD_IRQ_ENABLED_ON_SUSPEND     = (1 << 30),
> > +       IRQD_RESEND_WHEN_IN_PROGRESS    = (1 << 31),
> >  };
> 
> Do we really want a new flag here? I'd be keen to fix this race for all
> drivers, not just those who know to set this flag. I think the patch
> you're suggesting is pretty close to being safe to enable generally? If so
> my preference is for one less config option - just run it always.

I contend that this really is a GICv3 architectural bug. The lack of
an active state on LPIs leads to it, and as far as I can tell, no
other interrupt architecture has the same issue. So the onus should be
on the GIC, the GIC only, and only the parts of the GIC that require
it (SPIs, PPIs and SGIs are fine, either because they have an active
state, or because the lock isn't dropped when calling the handler).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-03-17  9:53 [PATCH] irq: fasteoi handler re-runs on concurrent invoke James Gowans
  2023-03-17 10:12 ` Yipeng Zou
  2023-04-09 11:41 ` Marc Zyngier
@ 2023-05-23  3:15 ` liaochang (A)
  2023-05-23 11:59   ` Gowans, James
  2 siblings, 1 reply; 28+ messages in thread
From: liaochang (A) @ 2023-05-23  3:15 UTC (permalink / raw)
  To: James Gowans, Thomas Gleixner
  Cc: linux-kernel, David Woodhouse, Marc Zyngier, KarimAllah Raslan,
	Yipeng Zou, zhangjianhua (E)

Hi, James

在 2023/3/17 17:53, James Gowans 写道:
> Update the generic handle_fasteoi_irq to cater for the case when the
> next interrupt comes in while the previous handler is still running.
> Currently when that happens the irq_may_run() early out causes the next
> IRQ to be lost. Change the behaviour to mark the interrupt as pending
> and re-run the handler when handle_fasteoi_irq sees that the pending
> flag has been set again. This is largely inspired by handle_edge_irq.
> 
> Generally it should not be possible for the next interrupt to arrive
> while the previous handler is still running: the next interrupt should
> only arrive after the EOI message has been sent and the previous handler
> has returned. However, there is a race where if the interrupt affinity
> is changed while the previous handler is running, then the next
> interrupt can arrive at a different CPU while the previous handler is
> still running. In that case there will be a concurrent invoke and the
> early out will be taken.
> 
> For example:
> 
>            CPU 0             |          CPU 1
> -----------------------------|-----------------------------
> interrupt start              |
>   handle_fasteoi_irq         | set_affinity(CPU 1)
>     handler                  |
>     ...                      | interrupt start
>     ...                      |   handle_fasteoi_irq -> early out
>   handle_fasteoi_irq return  | interrupt end
> interrupt end                |
> 
> This issue was observed specifically on an arm64 system with a GIC-v3
> handling MSIs; GIC-v3 uses the handle_fasteoi_irq handler. The issue is
> that the global ITS is responsible for affinity but does not know
> whether interrupts are pending/running, only the CPU-local redistributor
> handles the EOI. Hence when the affinity is changed in the ITS, the new
> CPU's redistributor does not know that the original CPU is still running
> the handler.
> 
> There are a few uncertainties about this implementation compared to the
> prior art in handle_edge_irq:
> 
> 1. Do we need to mask the IRQ and then unmask it later? I don't think so
> but it's not entirely clear why handle_edge_irq does this anyway; it's
> an edge IRQ so not sure why it needs to be masked.

In the GIC architecture, a CPU that is handling an IRQ cannot be signaled
with a new interrupt until it satisfies the interrupt preemption requirements
defined in the GIC architecture. One of these requirements is that the
priority of the new pending interrupt is higher than the priority of the
current running interrupt. Obviously, the same interrupt source cannot
preempt itself. Additionally, interrupt priority is rarely enabled in
the Linux kernel,except for the PESUDO_NMI.

If the interrupt is an LPI, and interrupt handling is still in progress on
the original CPU, the lack of the ACITVE state means that the LPI can be
distributed to another CPU once its affinity has been changed. On the other
hand, since the desc->istate is marked IRQS_PENDING and this interrupt has
been consumed on the new CPU, there is no need to mask it and then unmask it later.

> 
> 2. Should the EOI delivery be inside the do loop after every handler
> run? I think outside the loops is best; only inform the chip to deliver
> more IRQs once all pending runs have been finished.

The GIC architecture requires that writes to the EOI register be in the exact
reverse order of reads from the IAR register. Therefore, IAR and EOI must be paired.
Writing EOI inside the do loop after every handler may cause subtle problems.

> 
> 3. Do we need to check that desc->action is still set? I don't know if
> it can be un-set while the IRQ is still enabled.

This check is necessary here. When the code enters this critical section, the kernel
running on another CPU might have already unregistered the IRQ action via the free_irq API.
Although free_irq is typically used in code paths called on module unload or exception
handling, we have also observed that virtualization using VFIO as a PCI backend
frequently intends to use free_irq in some regular code paths.

Thanks.

> 
> 4. There is now more overlap with the handle_edge_eoi_irq handler;
> should we try to unify these?
> 
> Signed-off-by: James Gowans <jgowans@amazon.com>
> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: KarimAllah Raslan <karahmed@amazon.com>
> ---
>  Documentation/core-api/genericirq.rst | 9 ++++++++-
>  kernel/irq/chip.c                     | 9 +++++++--
>  2 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst
> index f959c9b53f61..b54485eca8b5 100644
> --- a/Documentation/core-api/genericirq.rst
> +++ b/Documentation/core-api/genericirq.rst
> @@ -240,7 +240,14 @@ which only need an EOI at the end of the handler.
>  
>  The following control flow is implemented (simplified excerpt)::
>  
> -    handle_irq_event(desc->action);
> +    if (desc->status & running) {
> +        desc->status |= pending;
> +        return;
> +    }
> +    do {
> +        desc->status &= ~pending;
> +        handle_irq_event(desc->action);
> +    } while (status & pending);
>      desc->irq_data.chip->irq_eoi();
>  
>  
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 49e7bc871fec..4e5fc2b7e8a9 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -692,8 +692,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>  
>  	raw_spin_lock(&desc->lock);
>  
> -	if (!irq_may_run(desc))
> +	if (!irq_may_run(desc)) {
> +		desc->istate |= IRQS_PENDING;
>  		goto out;
> +	}
>  
>  	desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
>  
> @@ -711,7 +713,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>  	if (desc->istate & IRQS_ONESHOT)
>  		mask_irq(desc);
>  
> -	handle_irq_event(desc);
> +	do {
> +		handle_irq_event(desc);
> +	} while (unlikely((desc->istate & IRQS_PENDING) &&
> +			!irqd_irq_disabled(&desc->irq_data)));
>  
>  	cond_unmask_eoi_irq(desc, chip);
>  

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-04-12 13:32     ` Marc Zyngier
  2023-04-18  2:39       ` Yipeng Zou
  2023-04-18 10:56       ` Gowans, James
@ 2023-05-23  3:16       ` liaochang (A)
  2 siblings, 0 replies; 28+ messages in thread
From: liaochang (A) @ 2023-05-23  3:16 UTC (permalink / raw)
  To: Marc Zyngier, Gowans, James
  Cc: tglx, Raslan, KarimAllah, Woodhouse, David, zouyipeng,
	linux-kernel, Sironi, Filippo, chris.zjh



在 2023/4/12 21:32, Marc Zyngier 写道:
> On Tue, 11 Apr 2023 11:27:26 +0100,
> "Gowans, James" <jgowans@amazon.com> wrote:
>>
>> Hi Marc, thanks for taking the time to put thought into this!
>>
>> On Sun, 2023-04-09 at 12:41 +0100, Marc Zyngier wrote:
>>>> Generally it should not be possible for the next interrupt to arrive
>>>> while the previous handler is still running: the next interrupt should
>>>> only arrive after the EOI message has been sent and the previous handler
>>>> has returned.
>>>
>>> There isn't necessarily an EOI message. On bare metal, EOI of a LPI
>>> amounts to sweet nothing (see the pseudocode to convince yourself),
>>> because the physical LPI doesn't have an active state.
>>
>> Okay, I saw the gic_eoi_irq() function and thought there might be something
>> going on there. I guess it's not relevant to this issue.
> 
> It is relevant in a way. If LPIs had an active state, then it wouldn't
> be able to creep back up behind your back until the first CPU had
> issued a deactivation (triggered by EOI when EOImode=0). But the
> architecture was broken since day one, because sine architect thought
> they knew everything there was to know about the subject...
> 
>>
>>>>
>>>> The issue is
>>>> that the global ITS is responsible for affinity but does not know
>>>> whether interrupts are pending/running, only the CPU-local redistributor
>>>> handles the EOI. Hence when the affinity is changed in the ITS, the new
>>>> CPU's redistributor does not know that the original CPU is still running
>>>> the handler.
>>>
>>> You seem to be misunderstanding the architecture.
>>>
>>> The local redistributor doesn't know anything either. A redistributor
>>> doesn't contain any state about what is currently serviced. At best,
>>> it knows of the pending state of interrupts, and that about it. This
>>> is even more true in a VM. Only the CPU knows about this, and there is
>>> no EOI that ever gets propagated to the redistributor.
>>
>> Right. This misunderstanding basically stems from my confusion above where I
>> thought that the EOI would move it back into "pending." That obviously makes no
>> sense because if another IRQ arrives while the handler is already running, then
>> another handler run must be queued, hence it must be something really early in
>> the flow which ACKs the IRQ so that it gets moved back to "inactive."
> 
> The ACK is the read of ICC_IAR_EL1 in the GICv3 driver, which indeed
> sends the interrupt back to the inactive state, despite the
> corresponding priority being active on the CPU. With this setup, EOI
> only performs the priority drop.
> 
>>>> 1. Do we need to mask the IRQ and then unmask it later? I don't think so
>>>> but it's not entirely clear why handle_edge_irq does this anyway; it's
>>>> an edge IRQ so not sure why it needs to be masked.
>>>
>>> Please measure that cost and weep, specially in the context of
>>> multiple concurrent interrupts serviced by a single ITS (cost of
>>> locking the command queue, of waiting for a full round trip to the ITS
>>> for a couple of commands...).
>>
>> Fortunately this mask/unmasking would only happen in the rare(ish) cased of the
>> race condition described here being hit. Exactly the same as
>> with handle_edge_irq(), the masking and later unmasking would only be done
>> when irq_may_run() == false due to the race being hit. Considering that this is
>> a rare occurrence, I think we could stomach the occasional overhead? I was more
>> asking if it's actually *necessary* to do this masking/unmasking. I'm not sure
>> it's necessary anyway, hence it wasn't implemented in my patch.
> 
> But does it solve anything? At the point where you mask the interrupt,
> you already have consumed it. You'd still need to make it pending
> somehow, which is what your patch somehow.
> 
>>
>>>> index 49e7bc871fec..4e5fc2b7e8a9 100644
>>>> --- a/kernel/irq/chip.c
>>>> +++ b/kernel/irq/chip.c
>>>> @@ -692,8 +692,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>>>>
>>>>       raw_spin_lock(&desc->lock);
>>>>
>>>> -     if (!irq_may_run(desc))
>>>> +     if (!irq_may_run(desc)) {
>>>> +             desc->istate |= IRQS_PENDING;
>>>>               goto out;
>>>> +     }
>>>
>>> What is currently unclear to me is how we get there on another CPU if
>>> we're still handling the interrupt, which happens in the critical
>>> section delineated by desc->lock. So there is some additional state
>>> change here which you don't seem to be describing.
>>
>> The lock is actually released and later re-acquired in the handle_irq_event()
>> function which is called below. Only this flag wrangling code is done with the
>> irq_desc lock held; all of the logic in the later IRQ-specific handler code is
>> actually run unlocked. This is why there is a window for the irq_desc to get
>> into the irq_may_run() function concurrently with the handler being run on a
>> difference CPU.
> 
> Huh, indeed! Interrupts are still disabled (TFFT!), but this makes
> firing on another CPU entirely possible. Thanks for the eye-opening
> comment.
> 
>>>
>>>>       desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
>>>>
>>>> @@ -711,7 +713,10 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>>>>       if (desc->istate & IRQS_ONESHOT)
>>>>               mask_irq(desc);
>>>>
>>>> -     handle_irq_event(desc);
>>>> +     do {
>>>> +             handle_irq_event(desc);
>>>> +     } while (unlikely((desc->istate & IRQS_PENDING) &&
>>>> +                     !irqd_irq_disabled(&desc->irq_data)));
>>>>
>>>>       cond_unmask_eoi_irq(desc, chip);
>>>>
>>>
>>> Why do we need to inflict any of this on all other users of this flow?
>>> The lack of active state on LPIs is the issue, as it allows a new
>>> interrupt to be presented to another CPU as soon as the previous one
>>> has been ACK'ed.
>>
>> Do you mean an active state in hardware? Even if we had an active state, that
>> state would be CPU-local, right?
> 
> No, and that's the whole point. An active state is global. See for
> example how SPIs, which are as global as it gets, have their active
> state honoured *system wide*.
> 
>> The issue here is that when the CPU affinity
>> changes while the handler is running, the new CPU will early out because the
>> flags in irq_desc indicate it's already running elsewhere. A CPU-local hardware
>> active state would not help here; only if the active/pending states were global,
>> then it may help.
> 
> See above.
> 
>>
>> As for inflicting this on other users, this is pretty light in general. It's
>> basically just looking at that flag again; in general it will be unset and we'll
>> exit the loop after one invoke.
>>
>>>
>>> This also has the effect of keeping the handling of the interrupt on
>>> the original CPU, negating the effects of the change of affinity.
>>
>> Yes. This bothered me too initially, but on reflection I'm not sure it's
>> actually a problem. One possible issue that came to mind was around CPU
>> offlining, but in the event that a CPU being offlined was running interrupt
>> handlers it wouldn't be able to complete the offline anyway until the handlers
>> were finished, so I don't think this is an issue. Do you see any practical issue
>> with running the handler once more on the original CPU immediately after the
>> affinity has been changed?
> 
> My take on this is that we put the pressure on the CPU we want to move
> away from. I'd rather we put the it on the GIC itself, and use its
> Turing-complete powers to force it to redeliver the interrupt at a
> more convenient time.
> 
>>
>>> It feels that we should instead resend the interrupt, either by making
>>> it pending again by generating an INT command to the ITS, or by using
>>> the SW resend mechanism (which the lack of deactivation requirement
>>> makes possible).
>>
>> I'm open to other suggestions here, just not sure how this re-sending would work
>> in practice. Do you mean that when the original CPU observes that the pending
>> flag is set, instead of running the handler again locally the original CPU would
>> instruct the ITS to mark the interrupt as pending again, hence re-triggering it
>> on the new CPU? That could work, but may be tricky to shoe-horn into the generic
>> code, unless we use the EOI callback for this?
> 
> I think we have most of what we need already, see the hack below. It
> is totally untested, but you'll hopefully get the idea.
> 
> Thanks,
> 
> 	M.
> 
>>From c96d2ab37fe273724f1264fba5f4913259875d56 Mon Sep 17 00:00:00 2001
> From: Marc Zyngier <maz@kernel.org>
> Date: Mon, 10 Apr 2023 10:56:32 +0100
> Subject: [PATCH] irqchip/gicv3-its: Force resend of LPIs taken while already
>  in-progress
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  drivers/irqchip/irq-gic-v3-its.c |  3 +++
>  include/linux/irq.h              | 13 +++++++++++++
>  kernel/irq/chip.c                |  5 ++++-
>  kernel/irq/debugfs.c             |  2 ++
>  4 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index 586271b8aa39..d04c73a2bc6b 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -3574,6 +3574,7 @@ static int its_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
>  		irqd = irq_get_irq_data(virq + i);
>  		irqd_set_single_target(irqd);
>  		irqd_set_affinity_on_activate(irqd);
> +		irqd_set_resend_when_in_progress(irqd);
>  		pr_debug("ID:%d pID:%d vID:%d\n",
>  			 (int)(hwirq + i - its_dev->event_map.lpi_base),
>  			 (int)(hwirq + i), virq + i);
> @@ -4512,6 +4513,8 @@ static int its_vpe_irq_domain_alloc(struct irq_domain *domain, unsigned int virq
>  		irq_domain_set_hwirq_and_chip(domain, virq + i, i,
>  					      irqchip, vm->vpes[i]);
>  		set_bit(i, bitmap);
> +
> +		irqd_set_resend_when_in_progress(irq_get_irq_data(virq + i));
>  	}
>  
>  	if (err) {
> diff --git a/include/linux/irq.h b/include/linux/irq.h
> index b1b28affb32a..4b2a7cc96eb2 100644
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -223,6 +223,8 @@ struct irq_data {
>   *				  irq_chip::irq_set_affinity() when deactivated.
>   * IRQD_IRQ_ENABLED_ON_SUSPEND	- Interrupt is enabled on suspend by irq pm if
>   *				  irqchip have flag IRQCHIP_ENABLE_WAKEUP_ON_SUSPEND set.
> + * IRQD_RESEND_WHEN_IN_PROGRESS - Interrupt may fire when already in progress,
> + *				  needs resending.
>   */
>  enum {
>  	IRQD_TRIGGER_MASK		= 0xf,
> @@ -249,6 +251,7 @@ enum {
>  	IRQD_HANDLE_ENFORCE_IRQCTX	= (1 << 28),
>  	IRQD_AFFINITY_ON_ACTIVATE	= (1 << 29),
>  	IRQD_IRQ_ENABLED_ON_SUSPEND	= (1 << 30),
> +	IRQD_RESEND_WHEN_IN_PROGRESS	= (1 << 31),
>  };
>  
>  #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
> @@ -448,6 +451,16 @@ static inline bool irqd_affinity_on_activate(struct irq_data *d)
>  	return __irqd_to_state(d) & IRQD_AFFINITY_ON_ACTIVATE;
>  }
>  
> +static inline bool irqd_needs_resend_when_in_progress(struct irq_data *d)
> +{
> +	return __irqd_to_state(d) & IRQD_RESEND_WHEN_IN_PROGRESS;
> +}
> +
> +static inline void irqd_set_resend_when_in_progress(struct irq_data *d)
> +{
> +	__irqd_to_state(d) |= IRQD_RESEND_WHEN_IN_PROGRESS;
> +}
> +
>  #undef __irqd_to_state
>  
>  static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 49e7bc871fec..73546ba8bc43 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -692,8 +692,11 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>  
>  	raw_spin_lock(&desc->lock);
>  
> -	if (!irq_may_run(desc))
> +	if (!irq_may_run(desc)) {
> +		if (irqd_needs_resend_when_in_progress(&desc->irq_data))
> +			check_irq_resend(desc, true);

Marc,

I like your idea of using interrupt redelivery to enhance fasteoi handler. However,
I have a few concerns about the way you are proposing to implement it.

First, the cost issue you mentioned is still a concern in the context of mutiple
concurrent interrupts serviced by a single ITS. Actually, no matter what kind of
operation is performed(mask, unmask or resend), it end up with injecting an ITS
command with the ITS lock held. this could lead to performance issue.

Second, I am concerned that repeatedly resending the interrupt could be harmful to
the current task running on the new CPU. When an interrupt is delivered, the task
will be interrupt and forced to save and restore it context, and CPU must switch
exception level. This can be a costly operation, especially for virtualization systems.

Thanks.

>  		goto out;
> +	}
>  
>  	desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
>  
> diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c
> index bbcaac64038e..5971a66be034 100644
> --- a/kernel/irq/debugfs.c
> +++ b/kernel/irq/debugfs.c
> @@ -133,6 +133,8 @@ static const struct irq_bit_descr irqdata_states[] = {
>  	BIT_MASK_DESCR(IRQD_HANDLE_ENFORCE_IRQCTX),
>  
>  	BIT_MASK_DESCR(IRQD_IRQ_ENABLED_ON_SUSPEND),
> +
> +	BIT_MASK_DESCR(IRQD_RESEND_WHEN_IN_PROGRESS),
>  };
>  
>  static const struct irq_bit_descr irqdesc_states[] = {

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-05-02  8:43         ` Gowans, James
@ 2023-05-23  3:16           ` liaochang (A)
  2023-05-25 10:04             ` Gowans, James
       [not found]           ` <86sfcfghqh.wl-maz@kernel.org>
  1 sibling, 1 reply; 28+ messages in thread
From: liaochang (A) @ 2023-05-23  3:16 UTC (permalink / raw)
  To: Gowans, James, maz, tglx
  Cc: zouyipeng, Raslan, KarimAllah, Woodhouse, David, linux-kernel,
	Sironi, Filippo, chris.zjh



在 2023/5/2 16:43, Gowans, James 写道:
> Hi Marc and Thomas,
> 
> On Tue, 2023-04-18 at 12:56 +0200, James Gowans wrote:
>>>   static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
>>> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
>>> index 49e7bc871fec..73546ba8bc43 100644
>>> --- a/kernel/irq/chip.c
>>> +++ b/kernel/irq/chip.c
>>> @@ -692,8 +692,11 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>>>          raw_spin_lock(&desc->lock);
>>> -       if (!irq_may_run(desc))
>>> +       if (!irq_may_run(desc)) {
>>> +               if (irqd_needs_resend_when_in_progress(&desc->irq_data))
>>> +                       check_irq_resend(desc, true);
>>>                  goto out;
>>> +       }
>>
>>
>> This will run check_irq_resend() on the *newly affined* CPU, while the old
>> one is still running the original handler. AFAICT what will happen is:
>> check_irq_resend
>>   try_retrigger
>>     irq_chip_retrigger_hierarchy
>>       its_irq_retrigger
>> ... which will cause the ITS to *immediately* re-trigger the IRQ. The
>> original CPU can still be running the handler in that case.
>>
>> If that happens, consider what will happen in check_irq_resend:
>> - first IRQ comes in, successflly runs try_retrigger and sets IRQS_REPLAY.
>> - it is *immediately* retriggered by ITS, and because the original handler
>> on the other CPU is still running, comes into check_irq_resend again.
>> - check_irq_resend now observes that IRQS_REPLAY is set and early outs.
>> - No more resends, the IRQ is still lost. :-(
>>
>> Now I admit the failure mode is getting a bit pathological: two re-
>> triggers while the original handler is still running, but I was able to
>> hit this on my test machine by intentionally slowing
>> the handler down by a few dozen micros. Should we cater for this?
>>
>> I can see two possibilities:
>> - tweak check_irq_resend() to not early-out in this case but to keep re-
>> triggering until it eventually runs.
>> - move the check_irq_resend to only happen later, *after* the original
>> handler has finished running. This would be very similar to what I
>> suggested in my original patch, except instead of running a do/while loop,
>> the code would observe that the pending flag was set again and run
>> check_irq_resend.

Hi, James and Marc,

After studying your discussions, I list some requirements need to satify for
the final practical solution:

1. Use the GIC to maintain the unhandled LPI.
2. Do not change the semantics of set_irq_affinity, which means that the interrupt
   action must be performed on the new CPU when the next interrupt occurs after a
   successful set_irq_affinity operation.
3. Minimize the cost, especially to other tasks running on CPUs, which means avoid
   a do/while loop on the original CPU and repeatedly resend interrupt on the new CPU.

Based on these requirements and Linux v6.3 rev, I propose the following hack:

diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 49e7bc871fec..1b49518b19bd 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -692,8 +692,14 @@ void handle_fasteoi_irq(struct irq_desc *desc)

 	raw_spin_lock(&desc->lock);

-	if (!irq_may_run(desc))
+	/*
+	 * Ack another interrupt from the same source can occurs on new
+	 * CPU even before the first one is handled on original CPU.
+	 */
+	if (!irq_may_run(desc)) {
+		desc->istate |= IRQS_PENDING;
 		goto out;
+	}

 	desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);

@@ -715,6 +721,8 @@ void handle_fasteoi_irq(struct irq_desc *desc)

 	cond_unmask_eoi_irq(desc, chip);

+	check_irq_resend(desc, true);
+
 	raw_spin_unlock(&desc->lock);
 	return;
 out:

Looking forward to your feedbacks, thanks.

>>
>> I'm also wondering what will happen for users who don't have the
>> chip->irq_retrigger callback set and fall back to the tasklet
>> via irq_sw_resend()... Looks like it will work fine. However if we do my
>> suggestion and move check_irq_resend to the end of handle_fasteoi_irq then
>> the tasklet will be scheduled on the old CPU again, which may be sub-
>> optimal.
> 
> Just checking to see if you've had a chance to consider these
> issues/thoughts, and if/how they should be handled?
> I'm still tending towards saying that the check_irq_resend() should run
> after handle_irq_event() and the IRQS_PENDING flag should be wrangled to
> decide whether or not to resend.
> 
> I just don't know if having the tasklet scheduled and run on the original
> CPU via irq_sw_resend() would be problematic or not. In general it
> probably won't but in the CPU offlining case.... maybe? I realise that for
> GIC-v3 the tasklet won't be used because GIC has chip->irq_retrigger
> callback defined - I'm just thinking in general here, especially so
> assuming we drop the new IRQD_RESEND_WHEN_IN_PROGRESS flag).
> 
> Thoughts?
> 
> I can put together a PoC and test it along with Yipeng from Huawei if you
> think it sounds reasonable.
> 
> JG

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-05-23  3:15 ` liaochang (A)
@ 2023-05-23 11:59   ` Gowans, James
  2023-05-25 12:31     ` Liao, Chang
  0 siblings, 1 reply; 28+ messages in thread
From: Gowans, James @ 2023-05-23 11:59 UTC (permalink / raw)
  To: tglx, maz, liaochang1
  Cc: zouyipeng, Raslan, KarimAllah, Woodhouse, David, linux-kernel, chris.zjh

On Tue, 2023-05-23 at 11:15 +0800, liaochang (A) wrote:
> > 1. Do we need to mask the IRQ and then unmask it later? I don't think so
> > but it's not entirely clear why handle_edge_irq does this anyway; it's
> > an edge IRQ so not sure why it needs to be masked.
> 
> 
> In the GIC architecture, a CPU that is handling an IRQ cannot be signaled
> with a new interrupt until it satisfies the interrupt preemption requirements
> defined in the GIC architecture. One of these requirements is that the
> priority of the new pending interrupt is higher than the priority of the
> current running interrupt. Obviously, the same interrupt source cannot
> preempt itself. Additionally, interrupt priority is rarely enabled in
> the Linux kernel,except for the PESUDO_NMI.
> 
> If the interrupt is an LPI, and interrupt handling is still in progress on
> the original CPU, the lack of the ACITVE state means that the LPI can be
> distributed to another CPU once its affinity has been changed. On the other
> hand, since the desc->istate is marked IRQS_PENDING and this interrupt has
> been consumed on the new CPU, there is no need to mask it and then unmask it later.

Thanks, this makes sense to me and matches my understanding. What I was
actually unsure about was why handle_edge_irq() *does* do the masking. I
guess it's because handle_edge_irq() runs desc->irq_data.chip->irq_ack(),
and perhaps if that isn't run (in the concurrent invoke case) then the IRQ
will keep getting re-triggered? Wild speculation. :-)

> > 2. Should the EOI delivery be inside the do loop after every handler
> > run? I think outside the loops is best; only inform the chip to deliver
> > more IRQs once all pending runs have been finished.
> 
> The GIC architecture requires that writes to the EOI register be in the exact
> reverse order of reads from the IAR register. Therefore, IAR and EOI must be paired.
> Writing EOI inside the do loop after every handler may cause subtle problems.
> 
Ah ha! (where is this documented, btw?) And because the IAR read happens
really early in (for example) __gic_handle_irq_from_irqson() it only makes
sense to call EOI once when returning from the interrupt context. That
also means that in the concurrent invoke case we would need to run EOI
even if the handler was not invoked.

> > 3. Do we need to check that desc->action is still set? I don't know if
> > it can be un-set while the IRQ is still enabled.
> 
> This check is necessary here. When the code enters this critical section, the kernel
> running on another CPU might have already unregistered the IRQ action via the free_irq API.
> Although free_irq is typically used in code paths called on module unload or exception
> handling, we have also observed that virtualization using VFIO as a PCI backend
> frequently intends to use free_irq in some regular code paths.
> 

Okay, I'm a bit hazy on whether it is or should be possible to unregister
the IRQ action while the handler is still running - it sounds like you're
saying this is possible and safe. More code spelunking would be necessary
to verify this but I won't bother seeing as it looks like the solution
we're tending towards uses check_irq_resend(), instead of invoking the
handler directly.

JG

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
       [not found]           ` <86sfcfghqh.wl-maz@kernel.org>
@ 2023-05-23 12:47             ` Gowans, James
  2023-05-25 12:31               ` Liao, Chang
  0 siblings, 1 reply; 28+ messages in thread
From: Gowans, James @ 2023-05-23 12:47 UTC (permalink / raw)
  To: maz, liaochang1
  Cc: tglx, zouyipeng, Raslan, KarimAllah, Woodhouse, David,
	linux-kernel, Sironi, Filippo, chris.zjh

On Tue, 2023-05-02 at 11:17 +0100, Marc Zyngier wrote:
> 
> Sorry for the delay, I've been travelling the past couple of weeks.

Me too - just getting back to this now. :-)
> 
> On Tue, 02 May 2023 09:43:02 +0100,
> "Gowans, James" <jgowans@amazon.com> wrote:
> > 
> > > This will run check_irq_resend() on the *newly affined* CPU, while the old
> > > one is still running the original handler. AFAICT what will happen is:
> > > check_irq_resend
> > >   try_retrigger
> > >     irq_chip_retrigger_hierarchy
> > >       its_irq_retrigger
> > > ... which will cause the ITS to *immediately* re-trigger the IRQ. The
> > > original CPU can still be running the handler in that case.
> 
> Immediately is a relative thing. The GIC processes interrupts far
> slower than the CPU can handle them.

Ack - I've tested with your patch below and empirically on my system it
seems that a resend storm (running resend in a tight loop on the new CPU
while the original one is still slowly running a handler) causes
interrupts get delivered at period of about 40 µs. That is indeed less
"immediately" than handlers *typically* take.

> Yes, this is clearly missing from my patch, thanks for pointing this
> out. I think something like the hack below should handle it.
> 
> diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
> index 0c46e9fe3a89..5fa96842a882 100644
> --- a/kernel/irq/resend.c
> +++ b/kernel/irq/resend.c
> @@ -117,7 +117,8 @@ int check_irq_resend(struct irq_desc *desc, bool inject)
>                 return -EINVAL;
>         }
> 
> -       if (desc->istate & IRQS_REPLAY)
> +       if ((desc->istate & IRQS_REPLAY) &&
> +           !irqd_needs_resend_when_in_progress(&desc->irq_data))
>                 return -EBUSY;
> 
>         if (!(desc->istate & IRQS_PENDING) && !inject)

Have tested this and confirm it mitigates the issue.

> I really think that we want to move away from the old CPU asap.
> Imagine the following case: we want to turn a CPU off, and for this we
> need to migrate the interrupts away from it. But the source of the
> interrupt is a guest which keeps the interrupt coming, and this could
> at least in theory delay the offlining indefinitely.

Agreed - just note that whether we do the hardware-assisted
irq_retrigger() before the handler on the newly affined CPU (as your patch
does) or after the handler on the old CPU (as my original patch suggested)
doesn't make a difference - either way the next actual handler run will
happen on the newly affined CPU and we get off the old CPU for the next
handler runs.

The only time is does make a difference is in the SW resend case but as
you point out:

> With SW resend, the tasklet should get moved to CPUs that are still
> online, and there may be some additional work to do here.

> > Just checking to see if you've had a chance to consider these
> > issues/thoughts, and if/how they should be handled?
> > I'm still tending towards saying that the check_irq_resend() should run
> > after handle_irq_event() and the IRQS_PENDING flag should be wrangled to
> > decide whether or not to resend.
> 
> I still think this is the wrong approach. This mixes the pending and
> replay states, which are significantly different in the IRQ code.

Okay, TBH I'm not too sure what the difference between these two flags is,
the comments on the enum values isn't too detailed. I was inspired by
handle_edge_irq() to use the IRQ_PENDING. Any idea why it's correct to use
IRQ_PENDING there but not here? We could tweak this to use the REPLAY flag
and use that to conditionally run check_irq_resend() after the handler.

> I definitely want to see more testing on this, but I'm not sure the SW
> resend part is that critical. The issue at hand really is a GIC
> specific problem.

Ack, I'm also happy to not bother with the SW resend case too much. Just
cross-posting from your other email (unifying the threads):

> I contend that this really is a GICv3 architectural bug. The lack of
> an active state on LPIs leads to it, and as far as I can tell, no
> other interrupt architecture has the same issue. So the onus should be
> on the GIC, the GIC only, and only the parts of the GIC that require
> it (SPIs, PPIs and SGIs are fine, either because they have an active
> state, or because the lock isn't dropped when calling the handler).

Again I'm happy to defer to you here. I'm still not sure why we wouldn't
prefer to solve this in a way that *doesn't* need flags, and will just
never run for interrupt controllers which are not impacted? That seems
like we could have simpler code with no downside.

JG


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-05-23  3:16           ` liaochang (A)
@ 2023-05-25 10:04             ` Gowans, James
  2023-05-29  2:47               ` Liao, Chang
  0 siblings, 1 reply; 28+ messages in thread
From: Gowans, James @ 2023-05-25 10:04 UTC (permalink / raw)
  To: maz, tglx, liaochang1
  Cc: zouyipeng, Raslan, KarimAllah, Woodhouse, David, linux-kernel,
	Sironi, Filippo, chris.zjh

On Tue, 2023-05-23 at 11:16 +0800, liaochang (A) wrote:
> Hi, James and Marc,
> 
> After studying your discussions, I list some requirements need to satify for
> the final practical solution:
> 
> 1. Use the GIC to maintain the unhandled LPI.
> 2. Do not change the semantics of set_irq_affinity, which means that the interrupt
>    action must be performed on the new CPU when the next interrupt occurs after a
>    successful set_irq_affinity operation.
> 3. Minimize the cost, especially to other tasks running on CPUs, which means avoid
>    a do/while loop on the original CPU and repeatedly resend interrupt on the new CPU.

Seems reasonable to me.
> 
> Based on these requirements and Linux v6.3 rev, I propose the following hack:
> 
> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
> index 49e7bc871fec..1b49518b19bd 100644
> --- a/kernel/irq/chip.c
> +++ b/kernel/irq/chip.c
> @@ -692,8 +692,14 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> 
>         raw_spin_lock(&desc->lock);
> 
> -       if (!irq_may_run(desc))
> +       /*
> +        * Ack another interrupt from the same source can occurs on new
> +        * CPU even before the first one is handled on original CPU.
> +        */
> +       if (!irq_may_run(desc)) {
> +               desc->istate |= IRQS_PENDING;
>                 goto out;
> +       }

Wording may need a slight tweak, especially pointing out why PENDING is
set.
> 
>         desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
> 
> @@ -715,6 +721,8 @@ void handle_fasteoi_irq(struct irq_desc *desc)
> 
>         cond_unmask_eoi_irq(desc, chip);
> 
> +       check_irq_resend(desc, true);
> +

I'm pretty sure the second param, "inject", must not be true. As is this
will *always* resend the interrupt after every interrupt. Infinite
interrupt loop! (did you test this?)
I've changed it to false and tested, seems good to me.

I'd suggest adding a comment here which pairs with the comment above
explaining why this is being called here. Also perhaps adding:
if (unlikely((desc->istate & IRQS_PENDING))
...so that in general this function call will be skipped over.

If you want I'm happy to make these tweaks and post as V2?
There are also some other comments I'm keen to add to the flag enums to
make it a bit clearer what some of the flags mean.

Marc, what are your thoughts on this approach?
The main (only?) downside I see compared to your suggestion is that in the
irq_sw_resend() case the tasklet will be scheduled on the original CPU. As
you mentioned in a previous reply, perhaps we shouldn't bother thinking
about the sw_resend case now as only GIC is impacted. In future if
necessary the sw_resend implementation could be tweaked to schedule the
tasklet on the CPU which the interrupt is affined to.

JG

>         raw_spin_unlock(&desc->lock);
>         return;
>  out:
> 
> Looking forward to your feedbacks, thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-05-23 11:59   ` Gowans, James
@ 2023-05-25 12:31     ` Liao, Chang
  0 siblings, 0 replies; 28+ messages in thread
From: Liao, Chang @ 2023-05-25 12:31 UTC (permalink / raw)
  To: Gowans, James, tglx, maz
  Cc: zouyipeng, Raslan, KarimAllah, Woodhouse, David, linux-kernel, chris.zjh



在 2023/5/23 19:59, Gowans, James 写道:
> On Tue, 2023-05-23 at 11:15 +0800, liaochang (A) wrote:
>>> 1. Do we need to mask the IRQ and then unmask it later? I don't think so
>>> but it's not entirely clear why handle_edge_irq does this anyway; it's
>>> an edge IRQ so not sure why it needs to be masked.
>>
>>
>> In the GIC architecture, a CPU that is handling an IRQ cannot be signaled
>> with a new interrupt until it satisfies the interrupt preemption requirements
>> defined in the GIC architecture. One of these requirements is that the
>> priority of the new pending interrupt is higher than the priority of the
>> current running interrupt. Obviously, the same interrupt source cannot
>> preempt itself. Additionally, interrupt priority is rarely enabled in
>> the Linux kernel,except for the PESUDO_NMI.
>>
>> If the interrupt is an LPI, and interrupt handling is still in progress on
>> the original CPU, the lack of the ACITVE state means that the LPI can be
>> distributed to another CPU once its affinity has been changed. On the other
>> hand, since the desc->istate is marked IRQS_PENDING and this interrupt has
>> been consumed on the new CPU, there is no need to mask it and then unmask it later.
> 
> Thanks, this makes sense to me and matches my understanding. What I was
> actually unsure about was why handle_edge_irq() *does* do the masking. I
> guess it's because handle_edge_irq() runs desc->irq_data.chip->irq_ack(),
> and perhaps if that isn't run (in the concurrent invoke case) then the IRQ
> will keep getting re-triggered? Wild speculation. :-)

James,

I am not very familiar with the history of these flow handlers. However, according
to the comments, I believe that handle_edge_irq() masks interrupts to ensure a window
in which the first interrupt can be handled. This may be because, when handle_edge_irq()
was introduced into the kernel, the controller hardware may have been too simple to gate
interrupts from the same source. As a result, the kernel had to explicitly mask interrupts.

Modern Arm and GIC(v3) handle this differently. Once the CPU acknowledges an edge-triggered
LPI, it inherits the priority of the interrupt into the RRP register. This means that pending
interrupts with insufficient priority cannot preempt the handling interrupt on the CPU until
the CPU drops the priority by writing EOI. In other words, the interrupt from the same source
will be gated by the GIC hardware, so there is no need to do masking in handle_edge_irq().

> 
>>> 2. Should the EOI delivery be inside the do loop after every handler
>>> run? I think outside the loops is best; only inform the chip to deliver
>>> more IRQs once all pending runs have been finished.
>>
>> The GIC architecture requires that writes to the EOI register be in the exact
>> reverse order of reads from the IAR register. Therefore, IAR and EOI must be paired.
>> Writing EOI inside the do loop after every handler may cause subtle problems.
>>
> Ah ha! (where is this documented, btw?) And because the IAR read happens
> really early in (for example) __gic_handle_irq_from_irqson() it only makes
> sense to call EOI once when returning from the interrupt context. That
> also means that in the concurrent invoke case we would need to run EOI
> even if the handler was not invoked.

Download offical GIC specification from [1], and check out Chapter4.

> 
>>> 3. Do we need to check that desc->action is still set? I don't know if
>>> it can be un-set while the IRQ is still enabled.
>>
>> This check is necessary here. When the code enters this critical section, the kernel
>> running on another CPU might have already unregistered the IRQ action via the free_irq API.
>> Although free_irq is typically used in code paths called on module unload or exception
>> handling, we have also observed that virtualization using VFIO as a PCI backend
>> frequently intends to use free_irq in some regular code paths.
>>
> 
> Okay, I'm a bit hazy on whether it is or should be possible to unregister
> the IRQ action while the handler is still running - it sounds like you're
> saying this is possible and safe. More code spelunking would be necessary
> to verify this but I won't bother seeing as it looks like the solution
> we're tending towards uses check_irq_resend(), instead of invoking the
> handler directly.

> 
> JG

[1] https://developer.arm.com/documentation/ihi0069/latest/

-- 
BR
Liao, Chang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-05-23 12:47             ` Gowans, James
@ 2023-05-25 12:31               ` Liao, Chang
  0 siblings, 0 replies; 28+ messages in thread
From: Liao, Chang @ 2023-05-25 12:31 UTC (permalink / raw)
  To: Gowans, James, maz
  Cc: tglx, zouyipeng, Raslan, KarimAllah, Woodhouse, David,
	linux-kernel, Sironi, Filippo, chris.zjh



在 2023/5/23 20:47, Gowans, James 写道:
> On Tue, 2023-05-02 at 11:17 +0100, Marc Zyngier wrote:
>>
>> Sorry for the delay, I've been travelling the past couple of weeks.
> 
> Me too - just getting back to this now. :-)
>>
>> On Tue, 02 May 2023 09:43:02 +0100,
>> "Gowans, James" <jgowans@amazon.com> wrote:
>>>
>>>> This will run check_irq_resend() on the *newly affined* CPU, while the old
>>>> one is still running the original handler. AFAICT what will happen is:
>>>> check_irq_resend
>>>>   try_retrigger
>>>>     irq_chip_retrigger_hierarchy
>>>>       its_irq_retrigger
>>>> ... which will cause the ITS to *immediately* re-trigger the IRQ. The
>>>> original CPU can still be running the handler in that case.
>>
>> Immediately is a relative thing. The GIC processes interrupts far
>> slower than the CPU can handle them.
> 
> Ack - I've tested with your patch below and empirically on my system it
> seems that a resend storm (running resend in a tight loop on the new CPU
> while the original one is still slowly running a handler) causes
> interrupts get delivered at period of about 40 µs. That is indeed less
> "immediately" than handlers *typically* take.
> 
>> Yes, this is clearly missing from my patch, thanks for pointing this
>> out. I think something like the hack below should handle it.
>>
>> diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c
>> index 0c46e9fe3a89..5fa96842a882 100644
>> --- a/kernel/irq/resend.c
>> +++ b/kernel/irq/resend.c
>> @@ -117,7 +117,8 @@ int check_irq_resend(struct irq_desc *desc, bool inject)
>>                 return -EINVAL;
>>         }
>>
>> -       if (desc->istate & IRQS_REPLAY)
>> +       if ((desc->istate & IRQS_REPLAY) &&
>> +           !irqd_needs_resend_when_in_progress(&desc->irq_data))
>>                 return -EBUSY;

Post from my email in this thread:

"Second, I am concerned that repeatedly resending the interrupt could be harmful to
the current task running on the new CPU. When an interrupt is delivered, the task
will be interrupt and forced to save and restore it context, and CPU must switch
exception level. This can be a costly operation, especially for virtualization systems."

Fortunately, this problem is supposed to occur rarely. This means that the performance
degradation may not be so noticeable at the business level. However, I just wonder if
there is another way to handle this situation that does not involve tweaking IRQS_PENDING
or repeatedly resending the interrupt to the new CPU.

Thanks.

>>
>>         if (!(desc->istate & IRQS_PENDING) && !inject)
> 
> Have tested this and confirm it mitigates the issue.
> 
>> I really think that we want to move away from the old CPU asap.
>> Imagine the following case: we want to turn a CPU off, and for this we
>> need to migrate the interrupts away from it. But the source of the
>> interrupt is a guest which keeps the interrupt coming, and this could
>> at least in theory delay the offlining indefinitely.
> 
> Agreed - just note that whether we do the hardware-assisted
> irq_retrigger() before the handler on the newly affined CPU (as your patch
> does) or after the handler on the old CPU (as my original patch suggested)
> doesn't make a difference - either way the next actual handler run will
> happen on the newly affined CPU and we get off the old CPU for the next
> handler runs.
> 
> The only time is does make a difference is in the SW resend case but as
> you point out:
> 
>> With SW resend, the tasklet should get moved to CPUs that are still
>> online, and there may be some additional work to do here.
> 
>>> Just checking to see if you've had a chance to consider these
>>> issues/thoughts, and if/how they should be handled?
>>> I'm still tending towards saying that the check_irq_resend() should run
>>> after handle_irq_event() and the IRQS_PENDING flag should be wrangled to
>>> decide whether or not to resend.
>>
>> I still think this is the wrong approach. This mixes the pending and
>> replay states, which are significantly different in the IRQ code.
> 
> Okay, TBH I'm not too sure what the difference between these two flags is,
> the comments on the enum values isn't too detailed. I was inspired by
> handle_edge_irq() to use the IRQ_PENDING. Any idea why it's correct to use
> IRQ_PENDING there but not here? We could tweak this to use the REPLAY flag
> and use that to conditionally run check_irq_resend() after the handler.
> 
>> I definitely want to see more testing on this, but I'm not sure the SW
>> resend part is that critical. The issue at hand really is a GIC
>> specific problem.
> 
> Ack, I'm also happy to not bother with the SW resend case too much. Just
> cross-posting from your other email (unifying the threads):
> 
>> I contend that this really is a GICv3 architectural bug. The lack of
>> an active state on LPIs leads to it, and as far as I can tell, no
>> other interrupt architecture has the same issue. So the onus should be
>> on the GIC, the GIC only, and only the parts of the GIC that require
>> it (SPIs, PPIs and SGIs are fine, either because they have an active
>> state, or because the lock isn't dropped when calling the handler).
> 
> Again I'm happy to defer to you here. I'm still not sure why we wouldn't
> prefer to solve this in a way that *doesn't* need flags, and will just
> never run for interrupt controllers which are not impacted? That seems
> like we could have simpler code with no downside.
> 
> JG
> 

-- 
BR
Liao, Chang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-05-25 10:04             ` Gowans, James
@ 2023-05-29  2:47               ` Liao, Chang
  2023-05-30 21:47                 ` Gowans, James
  0 siblings, 1 reply; 28+ messages in thread
From: Liao, Chang @ 2023-05-29  2:47 UTC (permalink / raw)
  To: Gowans, James, maz, tglx
  Cc: zouyipeng, Raslan, KarimAllah, Woodhouse, David, linux-kernel,
	Sironi, Filippo, chris.zjh

Hi, James.

在 2023/5/25 18:04, Gowans, James 写道:
> On Tue, 2023-05-23 at 11:16 +0800, liaochang (A) wrote:
>> Hi, James and Marc,
>>
>> After studying your discussions, I list some requirements need to satify for
>> the final practical solution:
>>
>> 1. Use the GIC to maintain the unhandled LPI.
>> 2. Do not change the semantics of set_irq_affinity, which means that the interrupt
>>    action must be performed on the new CPU when the next interrupt occurs after a
>>    successful set_irq_affinity operation.
>> 3. Minimize the cost, especially to other tasks running on CPUs, which means avoid
>>    a do/while loop on the original CPU and repeatedly resend interrupt on the new CPU.
> 
> Seems reasonable to me.
>>
>> Based on these requirements and Linux v6.3 rev, I propose the following hack:
>>
>> diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
>> index 49e7bc871fec..1b49518b19bd 100644
>> --- a/kernel/irq/chip.c
>> +++ b/kernel/irq/chip.c
>> @@ -692,8 +692,14 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>>
>>         raw_spin_lock(&desc->lock);
>>
>> -       if (!irq_may_run(desc))
>> +       /*
>> +        * Ack another interrupt from the same source can occurs on new
>> +        * CPU even before the first one is handled on original CPU.
>> +        */
>> +       if (!irq_may_run(desc)) {
>> +               desc->istate |= IRQS_PENDING;
>>                 goto out;
>> +       }
> 
> Wording may need a slight tweak, especially pointing out why PENDING is
> set.

Sure, see the comment below:

"Ack another interrupt from the same source can occurs on new CPU even before
the first one is handled on original CPU, IRQS_PENDING bit can be reused to
indicate this situation, which will defer the execution of the interrupt handler
function associated with the irq_desc until the first interrupt handler returns."

In summary, the IRQ_PENDINGS ensures that only one interrupt handler is ever
running for a particular source at a time, and the major usages of IRQS_PENDING
in kernel as follows:

1. Used in irq flow handler to indicate that an acknowledged interrupt cannot be
handled immediately due to three different reasons:
- Case1: the interrupt handler function has been unregistered via free_irq().
- Case2: the interrupt has been disabled via irq_disable().
- Case3: the interrupt is an edge-triggered interrupt and its handler is already
         running on the CPU.

In any of these cases, the kernel will defer the execution of the interrupt handler
until the interrupt is enabled and new handler is established again via check_irq_resend(),
or via the inside loop in handle_edge_irq() upon the previous handler returns.

2. Used in the spurious interrupt detector, a few systems with misdescribed IRQ
routing can cause an interrupt to be handled on the wrong CPU. In this situation,
the spurious interrupt detector searches for a recovery handler for the interrupt.
If the found handler is running on another CPU, the spurious interrupt detector
also defers the execution of the recovery handler, similar to case 3 in #1.

I hope this is helpful.

>>
>>         desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
>>
>> @@ -715,6 +721,8 @@ void handle_fasteoi_irq(struct irq_desc *desc)
>>
>>         cond_unmask_eoi_irq(desc, chip);
>>
>> +       check_irq_resend(desc, true);
>> +
> 
> I'm pretty sure the second param, "inject", must not be true. As is this
> will *always* resend the interrupt after every interrupt. Infinite
> interrupt loop! (did you test this?)
> I've changed it to false and tested, seems good to me.

Sorry, This code is a very trivial prototype and has not yet been tested.
You are correct that the "inject" parameter should be set to false, otherwise
the IRQS_PENDING check in check_irq_resend() will be skipped. Thank you for
pointing out my mistake.

> 
> I'd suggest adding a comment here which pairs with the comment above
> explaining why this is being called here. Also perhaps adding:
> if (unlikely((desc->istate & IRQS_PENDING))
> ...so that in general this function call will be skipped over.

Agree.

> 
> If you want I'm happy to make these tweaks and post as V2?
> There are also some other comments I'm keen to add to the flag enums to
> make it a bit clearer what some of the flags mean.

Don't see why not :)

Thanks.

> 
> Marc, what are your thoughts on this approach?
> The main (only?) downside I see compared to your suggestion is that in the
> irq_sw_resend() case the tasklet will be scheduled on the original CPU. As
> you mentioned in a previous reply, perhaps we shouldn't bother thinking
> about the sw_resend case now as only GIC is impacted. In future if
> necessary the sw_resend implementation could be tweaked to schedule the
> tasklet on the CPU which the interrupt is affined to.
> 
> JG
> 
>>         raw_spin_unlock(&desc->lock);
>>         return;
>>  out:
>>
>> Looking forward to your feedbacks, thanks.

-- 
BR
Liao, Chang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] irq: fasteoi handler re-runs on concurrent invoke
  2023-05-29  2:47               ` Liao, Chang
@ 2023-05-30 21:47                 ` Gowans, James
  0 siblings, 0 replies; 28+ messages in thread
From: Gowans, James @ 2023-05-30 21:47 UTC (permalink / raw)
  To: maz, tglx, liaochang1
  Cc: zouyipeng, Raslan, KarimAllah, Woodhouse, David, linux-kernel,
	Sironi, Filippo, chris.zjh

On Mon, 2023-05-29 at 10:47 +0800, Liao, Chang wrote:
> 
> > 
> > Wording may need a slight tweak, especially pointing out why PENDING is
> > set.
> 
> Sure, see the comment below:
> 
> "Ack another interrupt from the same source can occurs on new CPU even before
> the first one is handled on original CPU, IRQS_PENDING bit can be reused to
> indicate this situation, which will defer the execution of the interrupt handler
> function associated with the irq_desc until the first interrupt handler returns."
> 
> In summary, the IRQ_PENDINGS ensures that only one interrupt handler is ever
> running for a particular source at a time, and the major usages of IRQS_PENDING
> in kernel as follows:
> 
> 1. Used in irq flow handler to indicate that an acknowledged interrupt cannot be
> handled immediately due to three different reasons:
> - Case1: the interrupt handler function has been unregistered via free_irq().
> - Case2: the interrupt has been disabled via irq_disable().
> - Case3: the interrupt is an edge-triggered interrupt and its handler is already
>          running on the CPU.
> 
> In any of these cases, the kernel will defer the execution of the interrupt handler
> until the interrupt is enabled and new handler is established again via check_irq_resend(),
> or via the inside loop in handle_edge_irq() upon the previous handler returns.
> 
> 2. Used in the spurious interrupt detector, a few systems with misdescribed IRQ
> routing can cause an interrupt to be handled on the wrong CPU. In this situation,
> the spurious interrupt detector searches for a recovery handler for the interrupt.
> If the found handler is running on another CPU, the spurious interrupt detector
> also defers the execution of the recovery handler, similar to case 3 in #1.
> 
> I hope this is helpful.

Stunning! I tried to pillage some of this for the commit message, but I
think it was too much detail for this commit message - it probably belongs
in a doc or code comment somewhere though. Useful!

> > If you want I'm happy to make these tweaks and post as V2?
> > There are also some other comments I'm keen to add to the flag enums to
> > make it a bit clearer what some of the flags mean.
> 
> Don't see why not :)

Done, V2: https://lore.kernel.org/lkml/20230530213848.3273006-2-jgowans@amazon.com/

JG

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-05-30 21:48 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-17  9:53 [PATCH] irq: fasteoi handler re-runs on concurrent invoke James Gowans
2023-03-17 10:12 ` Yipeng Zou
2023-03-17 11:49   ` Gowans, James
2023-03-22  6:26     ` Yipeng Zou
2023-03-22  7:48       ` Gowans, James
2023-03-22 10:37         ` Thomas Gleixner
2023-04-03 13:17           ` zhangjianhua (E)
2023-04-03 13:19             ` Marc Zyngier
2023-04-03 13:39               ` Gowans, James
2023-03-22 10:38         ` Yipeng Zou
2023-04-09 11:41 ` Marc Zyngier
2023-04-11 10:27   ` Gowans, James
2023-04-12 13:32     ` Marc Zyngier
2023-04-18  2:39       ` Yipeng Zou
2023-04-18 10:56       ` Gowans, James
2023-04-19  3:08         ` Yipeng Zou
2023-05-02  8:43         ` Gowans, James
2023-05-23  3:16           ` liaochang (A)
2023-05-25 10:04             ` Gowans, James
2023-05-29  2:47               ` Liao, Chang
2023-05-30 21:47                 ` Gowans, James
     [not found]           ` <86sfcfghqh.wl-maz@kernel.org>
2023-05-23 12:47             ` Gowans, James
2023-05-25 12:31               ` Liao, Chang
2023-05-02 10:28         ` Marc Zyngier
2023-05-23  3:16       ` liaochang (A)
2023-05-23  3:15 ` liaochang (A)
2023-05-23 11:59   ` Gowans, James
2023-05-25 12:31     ` Liao, Chang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.