linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit
@ 2020-11-01 13:14 Marc Zyngier
  2020-11-01 13:14 ` [PATCH 1/2] genirq: Allow an interrupt to be marked as 'naked' Marc Zyngier
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Marc Zyngier @ 2020-11-01 13:14 UTC (permalink / raw)
  To: LAK, linux-kernel
  Cc: Will Deacon, Catalin Marinas, Thomas Gleixner,
	Valentin Schneider, Peter Zijlstra, Android Kernel Team

Vincent recently reported [1] that 5.10-rc1 showed a significant
regression when running "perf bench sched pipe" on arm64, and
pinpointed it to the recent move to handling IPIs as normal
interrupts.

The culprit is the use of irq_enter/irq_exit around the handling of
the rescheduling IPI, meaning that we enter the scheduler right after
the handling of the IPI instead of deferring it to the next preemption
event. This accounts for most of the overhead introduced.

On architectures that have architected IPIs at the CPU level (x86
being the obvious one), the absence of irq_enter/exit is natural. ARM
(both 32 and 64bits) mimicked this behaviour by having some
arch-specific handling for the interrupts that are used to implement
IPIs. Moving IPIs on the normal interrupt path introduced the
regression.

This couple of patches try to acknowledge the fact that some IPIs are
"special", in the sense that they don't need to follow the standard
interrupt handling flow.

The good news is that it cures the regression on arm64, and could
be similarly beneficial to both 32bit ARM, MIPS, or any other
architecture that uses a unique IRQ to represent the scheduler IPI.

The bad news is that these patches are ugly as sin, and I really don't
like them. I specially hate that they can give driver authors the idea
that they can make random interrupts "faster".

Comments, suggestions and hate mails appreciated, as always.

	M.

[1] https://lore.kernel.org/r/CAKfTPtDjPpri5Gt6kLeFp_B_zJUZ5DYXEqtJ+0VKohU-y9bFEQ@mail.gmail.com

Marc Zyngier (2):
  genirq: Allow an interrupt to be marked as 'naked'
  arm64: Mark the recheduling IPI as naked interrupt

 arch/arm64/kernel/smp.c |  4 ++++
 include/linux/irq.h     |  4 +++-
 kernel/irq/debugfs.c    |  1 +
 kernel/irq/irqdesc.c    | 17 ++++++++++++-----
 kernel/irq/settings.h   |  7 +++++++
 5 files changed, 27 insertions(+), 6 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] genirq: Allow an interrupt to be marked as 'naked'
  2020-11-01 13:14 [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit Marc Zyngier
@ 2020-11-01 13:14 ` Marc Zyngier
  2020-11-01 14:33   ` David Laight
  2020-11-01 13:14 ` [PATCH 2/2] arm64: Mark the recheduling IPI as naked interrupt Marc Zyngier
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Marc Zyngier @ 2020-11-01 13:14 UTC (permalink / raw)
  To: LAK, linux-kernel
  Cc: Will Deacon, Catalin Marinas, Thomas Gleixner,
	Valentin Schneider, Peter Zijlstra, Android Kernel Team

Some interrupts (such as the rescheduling IPI) rely on not going through
the irq_enter()/irq_exit() calls. To distinguish such interrupts, add
a new IRQ flag that allows the low-level handling code to sidestep the
enter()/exit() calls.

Only the architecture code is expected to use this. It will do the wrong
thing on normal interrupts.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 include/linux/irq.h   |  4 +++-
 kernel/irq/debugfs.c  |  1 +
 kernel/irq/irqdesc.c  | 17 ++++++++++++-----
 kernel/irq/settings.h |  7 +++++++
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index c54365309e97..af5ba7336925 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -72,6 +72,7 @@ enum irqchip_irq_state;
  *				  mechanism and from core side polling.
  * IRQ_DISABLE_UNLAZY		- Disable lazy irq disable
  * IRQ_HIDDEN			- Don't show up in /proc/interrupts
+ * IRQ_NAKED			- Bypass irq_enter()/irq_exit()
  */
 enum {
 	IRQ_TYPE_NONE		= 0x00000000,
@@ -99,13 +100,14 @@ enum {
 	IRQ_IS_POLLED		= (1 << 18),
 	IRQ_DISABLE_UNLAZY	= (1 << 19),
 	IRQ_HIDDEN		= (1 << 20),
+	IRQ_NAKED		= (1 << 21),
 };
 
 #define IRQF_MODIFY_MASK	\
 	(IRQ_TYPE_SENSE_MASK | IRQ_NOPROBE | IRQ_NOREQUEST | \
 	 IRQ_NOAUTOEN | IRQ_MOVE_PCNTXT | IRQ_LEVEL | IRQ_NO_BALANCING | \
 	 IRQ_PER_CPU | IRQ_NESTED_THREAD | IRQ_NOTHREAD | IRQ_PER_CPU_DEVID | \
-	 IRQ_IS_POLLED | IRQ_DISABLE_UNLAZY | IRQ_HIDDEN)
+	 IRQ_IS_POLLED | IRQ_DISABLE_UNLAZY | IRQ_HIDDEN | IRQ_NAKED)
 
 #define IRQ_NO_BALANCING_MASK	(IRQ_PER_CPU | IRQ_NO_BALANCING)
 
diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c
index e4cff358b437..e031d6afc0f8 100644
--- a/kernel/irq/debugfs.c
+++ b/kernel/irq/debugfs.c
@@ -140,6 +140,7 @@ static const struct irq_bit_descr irqdesc_states[] = {
 	BIT_MASK_DESCR(_IRQ_IS_POLLED),
 	BIT_MASK_DESCR(_IRQ_DISABLE_UNLAZY),
 	BIT_MASK_DESCR(_IRQ_HIDDEN),
+	BIT_MASK_DESCR(_IRQ_NAKED),
 };
 
 static const struct irq_bit_descr irqdesc_istates[] = {
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..c08a1c19d061 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -667,10 +667,9 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 	unsigned int irq = hwirq;
+	struct irq_desc *desc;
 	int ret = 0;
 
-	irq_enter();
-
 #ifdef CONFIG_IRQ_DOMAIN
 	if (lookup)
 		irq = irq_find_mapping(domain, hwirq);
@@ -680,14 +679,22 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
 	 * Some hardware gives randomly wrong interrupts.  Rather
 	 * than crashing, do something sensible.
 	 */
-	if (unlikely(!irq || irq >= nr_irqs)) {
+	desc = irq_to_desc(irq);
+	if (unlikely(!desc || irq >= nr_irqs)) {
 		ack_bad_irq(irq);
 		ret = -EINVAL;
+		goto out;
+	}
+
+	if (unlikely(irq_settings_is_naked(desc))) {
+		generic_handle_irq_desc(desc);
 	} else {
-		generic_handle_irq(irq);
+		irq_enter();
+		generic_handle_irq_desc(desc);
+		irq_exit();
 	}
 
-	irq_exit();
+out:
 	set_irq_regs(old_regs);
 	return ret;
 }
diff --git a/kernel/irq/settings.h b/kernel/irq/settings.h
index 403378b9947b..587e67f9c302 100644
--- a/kernel/irq/settings.h
+++ b/kernel/irq/settings.h
@@ -18,6 +18,7 @@ enum {
 	_IRQ_IS_POLLED		= IRQ_IS_POLLED,
 	_IRQ_DISABLE_UNLAZY	= IRQ_DISABLE_UNLAZY,
 	_IRQ_HIDDEN		= IRQ_HIDDEN,
+	_IRQ_NAKED		= IRQ_NAKED,
 	_IRQF_MODIFY_MASK	= IRQF_MODIFY_MASK,
 };
 
@@ -33,6 +34,7 @@ enum {
 #define IRQ_IS_POLLED		GOT_YOU_MORON
 #define IRQ_DISABLE_UNLAZY	GOT_YOU_MORON
 #define IRQ_HIDDEN		GOT_YOU_MORON
+#define IRQ_NAKED		GOT_YOU_MORON
 #undef IRQF_MODIFY_MASK
 #define IRQF_MODIFY_MASK	GOT_YOU_MORON
 
@@ -174,3 +176,8 @@ static inline bool irq_settings_is_hidden(struct irq_desc *desc)
 {
 	return desc->status_use_accessors & _IRQ_HIDDEN;
 }
+
+static inline bool irq_settings_is_naked(struct irq_desc *desc)
+{
+	return desc->status_use_accessors & _IRQ_NAKED;
+}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] arm64: Mark the recheduling IPI as naked interrupt
  2020-11-01 13:14 [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit Marc Zyngier
  2020-11-01 13:14 ` [PATCH 1/2] genirq: Allow an interrupt to be marked as 'naked' Marc Zyngier
@ 2020-11-01 13:14 ` Marc Zyngier
  2020-11-01 14:30 ` [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit David Laight
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Marc Zyngier @ 2020-11-01 13:14 UTC (permalink / raw)
  To: LAK, linux-kernel
  Cc: Will Deacon, Catalin Marinas, Thomas Gleixner,
	Valentin Schneider, Peter Zijlstra, Android Kernel Team

Flag the rescheduling IPI as 'naked', making sure such interrupt
doesn't trigger a rescheduling event by itself.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kernel/smp.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 82e75fc2c903..6c11be3e74d3 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -993,6 +993,10 @@ void __init set_smp_ipi_range(int ipi_base, int n)
 
 		ipi_desc[i] = irq_to_desc(ipi_base + i);
 		irq_set_status_flags(ipi_base + i, IRQ_HIDDEN);
+
+		/* The recheduling IPI is special... */
+		if (i == IPI_RESCHEDULE)
+			irq_set_status_flags(ipi_base + i, IRQ_NAKED);
 	}
 
 	ipi_irq_base = ipi_base;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* RE: [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit
  2020-11-01 13:14 [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit Marc Zyngier
  2020-11-01 13:14 ` [PATCH 1/2] genirq: Allow an interrupt to be marked as 'naked' Marc Zyngier
  2020-11-01 13:14 ` [PATCH 2/2] arm64: Mark the recheduling IPI as naked interrupt Marc Zyngier
@ 2020-11-01 14:30 ` David Laight
  2020-11-02 10:30 ` Valentin Schneider
  2020-11-03 20:32 ` Thomas Gleixner
  4 siblings, 0 replies; 12+ messages in thread
From: David Laight @ 2020-11-01 14:30 UTC (permalink / raw)
  To: 'Marc Zyngier', LAK, linux-kernel
  Cc: Will Deacon, Catalin Marinas, Thomas Gleixner,
	Valentin Schneider, Peter Zijlstra, Android Kernel Team

From: Marc Zyngier
> Sent: 01 November 2020 13:14
> 
> Vincent recently reported [1] that 5.10-rc1 showed a significant
> regression when running "perf bench sched pipe" on arm64, and
> pinpointed it to the recent move to handling IPIs as normal
> interrupts.
> 
> The culprit is the use of irq_enter/irq_exit around the handling of
> the rescheduling IPI, meaning that we enter the scheduler right after
> the handling of the IPI instead of deferring it to the next preemption
> event. This accounts for most of the overhead introduced.
> 
> On architectures that have architected IPIs at the CPU level (x86
> being the obvious one), the absence of irq_enter/exit is natural. ARM
> (both 32 and 64bits) mimicked this behaviour by having some
> arch-specific handling for the interrupts that are used to implement
> IPIs. Moving IPIs on the normal interrupt path introduced the
> regression.
> 
> This couple of patches try to acknowledge the fact that some IPIs are
> "special", in the sense that they don't need to follow the standard
> interrupt handling flow.
> 
> The good news is that it cures the regression on arm64, and could
> be similarly beneficial to both 32bit ARM, MIPS, or any other
> architecture that uses a unique IRQ to represent the scheduler IPI.
> 
> The bad news is that these patches are ugly as sin, and I really don't
> like them. I specially hate that they can give driver authors the idea
> that they can make random interrupts "faster".
> 
> Comments, suggestions and hate mails appreciated, as always.

One problem is that the only explanation is in the cover note.
Nothing in the patches even gives a hint as to what difference
the 'NAKED' flag makes nor why its is applied to one interrupt.

Which all means someone might revert it all as part of a cleanup.
A couple of comments is probably all it needs.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 1/2] genirq: Allow an interrupt to be marked as 'naked'
  2020-11-01 13:14 ` [PATCH 1/2] genirq: Allow an interrupt to be marked as 'naked' Marc Zyngier
@ 2020-11-01 14:33   ` David Laight
  0 siblings, 0 replies; 12+ messages in thread
From: David Laight @ 2020-11-01 14:33 UTC (permalink / raw)
  To: 'Marc Zyngier', LAK, linux-kernel
  Cc: Will Deacon, Catalin Marinas, Thomas Gleixner,
	Valentin Schneider, Peter Zijlstra, Android Kernel Team

From: Marc Zyngier
> Sent: 01 November 2020 13:14
> 
> Some interrupts (such as the rescheduling IPI) rely on not going through
> the irq_enter()/irq_exit() calls. To distinguish such interrupts, add
> a new IRQ flag that allows the low-level handling code to sidestep the
> enter()/exit() calls.
> 
> Only the architecture code is expected to use this. It will do the wrong
> thing on normal interrupts.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  include/linux/irq.h   |  4 +++-
>  kernel/irq/debugfs.c  |  1 +
>  kernel/irq/irqdesc.c  | 17 ++++++++++++-----
>  kernel/irq/settings.h |  7 +++++++
>  4 files changed, 23 insertions(+), 6 deletions(-)
> 
...
> diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
> index 1a7723604399..c08a1c19d061 100644
> --- a/kernel/irq/irqdesc.c
> +++ b/kernel/irq/irqdesc.c
> @@ -667,10 +667,9 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
>  {
>  	struct pt_regs *old_regs = set_irq_regs(regs);
>  	unsigned int irq = hwirq;
> +	struct irq_desc *desc;
>  	int ret = 0;
> 
> -	irq_enter();
> -
>  #ifdef CONFIG_IRQ_DOMAIN
>  	if (lookup)
>  		irq = irq_find_mapping(domain, hwirq);
> @@ -680,14 +679,22 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
>  	 * Some hardware gives randomly wrong interrupts.  Rather
>  	 * than crashing, do something sensible.
>  	 */
> -	if (unlikely(!irq || irq >= nr_irqs)) {
> +	desc = irq_to_desc(irq);
> +	if (unlikely(!desc || irq >= nr_irqs)) {

Should those tests be in the other order?
Probably as:
	if (unlikely(irq >= nr_irqs || !(desc = irq_to_desc(irq))) {
		...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit
  2020-11-01 13:14 [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit Marc Zyngier
                   ` (2 preceding siblings ...)
  2020-11-01 14:30 ` [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit David Laight
@ 2020-11-02 10:30 ` Valentin Schneider
  2020-11-10 13:03   ` Peter Zijlstra
  2020-11-03 20:32 ` Thomas Gleixner
  4 siblings, 1 reply; 12+ messages in thread
From: Valentin Schneider @ 2020-11-02 10:30 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: LAK, linux-kernel, Will Deacon, Catalin Marinas, Thomas Gleixner,
	Peter Zijlstra, Android Kernel Team


Hi Marc,

On 01/11/20 13:14, Marc Zyngier wrote:
> Vincent recently reported [1] that 5.10-rc1 showed a significant
> regression when running "perf bench sched pipe" on arm64, and
> pinpointed it to the recent move to handling IPIs as normal
> interrupts.
>
> The culprit is the use of irq_enter/irq_exit around the handling of
> the rescheduling IPI, meaning that we enter the scheduler right after
> the handling of the IPI instead of deferring it to the next preemption
> event. This accounts for most of the overhead introduced.
>
> On architectures that have architected IPIs at the CPU level (x86
> being the obvious one), the absence of irq_enter/exit is natural. ARM
> (both 32 and 64bits) mimicked this behaviour by having some
> arch-specific handling for the interrupts that are used to implement
> IPIs. Moving IPIs on the normal interrupt path introduced the
> regression.
>
> This couple of patches try to acknowledge the fact that some IPIs are
> "special", in the sense that they don't need to follow the standard
> interrupt handling flow.
>
> The good news is that it cures the regression on arm64, and could
> be similarly beneficial to both 32bit ARM, MIPS, or any other
> architecture that uses a unique IRQ to represent the scheduler IPI.
>
> The bad news is that these patches are ugly as sin, and I really don't
> like them. I specially hate that they can give driver authors the idea
> that they can make random interrupts "faster".
>
> Comments, suggestions and hate mails appreciated, as always.

So since

  90b5363acd47 ("sched: Clean up scheduler_ipi()")
  https://lore.kernel.org/lkml/20200505134058.361859938@linutronix.de/

scheduler_ipi() is indeed pretty lean, though I wonder if moving SGI
handling to genirq didn't just bring us back to before it went on a diet
(< ~5.8).

I don't really have a much better idea, sadly. Perhaps a variant of
handle_domain_irq() that doesn't issue the irq_{enter, exit}() pair we
would call within the GIC driver itself, but then:
- we're back to adding special SGI sauce to the GIC drivers
- it does prevent random drivers from "accelerating" their IRQs, though
  truly invested speed addicts will go and hack the GIC driver anyway :(


Now, I'd like to pen exactly why we think it's okay to forgo irq_{enter,
exit}() for that one IRQ and not any other.

AIUI proper irq_{enter,exit}() signalling is required by RCU, but AFAICT
there is no read side section down this specific path
(CONFIG_RCU_EQS_DEBUG=y may or may not agree with me).

Regarding CONFIG_IRQ_TIME_ACCOUNTING=y, I would tend to say that given this
is genuinely happening within an IRQ, it should be accounted as such. The
other side of the coin is that the actual work is nigh-insignificant, so
one may argue about ignoring it. IMO this is somewhat borderline on the
correctness side of things :/

>       M.
>
> [1] https://lore.kernel.org/r/CAKfTPtDjPpri5Gt6kLeFp_B_zJUZ5DYXEqtJ+0VKohU-y9bFEQ@mail.gmail.com
>
> Marc Zyngier (2):
>   genirq: Allow an interrupt to be marked as 'naked'
>   arm64: Mark the recheduling IPI as naked interrupt
>
>  arch/arm64/kernel/smp.c |  4 ++++
>  include/linux/irq.h     |  4 +++-
>  kernel/irq/debugfs.c    |  1 +
>  kernel/irq/irqdesc.c    | 17 ++++++++++++-----
>  kernel/irq/settings.h   |  7 +++++++
>  5 files changed, 27 insertions(+), 6 deletions(-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit
  2020-11-01 13:14 [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit Marc Zyngier
                   ` (3 preceding siblings ...)
  2020-11-02 10:30 ` Valentin Schneider
@ 2020-11-03 20:32 ` Thomas Gleixner
  2020-11-20  9:20   ` Marc Zyngier
  4 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2020-11-03 20:32 UTC (permalink / raw)
  To: Marc Zyngier, LAK, linux-kernel
  Cc: Will Deacon, Catalin Marinas, Valentin Schneider, Peter Zijlstra,
	Android Kernel Team

On Sun, Nov 01 2020 at 13:14, Marc Zyngier wrote:
> Vincent recently reported [1] that 5.10-rc1 showed a significant
> regression when running "perf bench sched pipe" on arm64, and
> pinpointed it to the recent move to handling IPIs as normal
> interrupts.
>
> The culprit is the use of irq_enter/irq_exit around the handling of
> the rescheduling IPI, meaning that we enter the scheduler right after
> the handling of the IPI instead of deferring it to the next preemption
> event. This accounts for most of the overhead introduced.

irq_enter()/exit() does not end up in the scheduler. If it does then
please show the call chain.

Scheduling happens when the IPI returns just before returning into the
low level code (or on ARM in the low level code) when NEED_RESCHED is
set (which is usually the case when the IPI is sent) and:

  the IPI hit user space

or

  the IPI hit in preemptible kernel context and CONFIG_PREEMPT[_RT] is
  enabled.

Not doing so would be a bug. So I really don't understand your reasoning
here.

> On architectures that have architected IPIs at the CPU level (x86
> being the obvious one), the absence of irq_enter/exit is natural.

It's not really architected IPIs. We reserve the top 20ish vectors on
each CPU for IPIs and other per processor interrupts, e.g. the per cpu
timer.

Now lets look at what x86 does:

Interrupts and regular IPIs (function call ....) do

    irqentry_enter()   <- handles rcu_irq_enter() or context tracking
      ...
      irq_enter_rcu()
      ...
      irq_exit_rcu()
    irqentry_exit()     <- handles need_resched()

The scheduler IPI does:

    irqentry_enter()   <- handles rcu_irq_enter() or context tracking
      ...
      __irq_enter_raw()
      ...
      __irq_exit_raw()
    irqentry_exit()     <- handles need_resched()

So we don't invoke irq_enter_rcu() on enter and on exit we skip
irq_exit_rcu(). That's fine because

  - Calling the tick management is pointless because this is going to
    schedule anyway or something consumed the need_resched already.

  - The irqtime accounting is silly because it covers only the call and
    returns. The time spent in the accounting is more than the time
    we are accounting (empty call).

So what your hack fails to invoke is rcu_irq_enter()/exit() in case that
the IPI hits the idle task in an RCU off region. You also fail to tell
lockdep. No cookies!

> The bad news is that these patches are ugly as sin, and I really don't
> like them.

Yes, they are ugly and the extra conditional in the irq handling path is
not pretty either.

> I specially hate that they can give driver authors the idea that they
> can make random interrupts "faster".

Just split the guts of irq_modify_status() into __irq_modify_status()
and call that from irq_modify_status().

Reject IRQ_HIDDEN (which should have been IRQ_IPI really) and IRQ_NAKED
(IRQ_RAW perhaps) in irq_modify_status().

Do not export __irq_modify_status() so it can be only used from built-in
code which takes it away from driver writers.

We probably should add a few of the other IRQ_ bits to the reject mask
for completness sake, but that's not urgent.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit
  2020-11-02 10:30 ` Valentin Schneider
@ 2020-11-10 13:03   ` Peter Zijlstra
       [not found]     ` <19286daf276f46aa@fake-msgid>
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2020-11-10 13:03 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Marc Zyngier, LAK, linux-kernel, Will Deacon, Catalin Marinas,
	Thomas Gleixner, Android Kernel Team

On Mon, Nov 02, 2020 at 10:30:50AM +0000, Valentin Schneider wrote:

> Now, I'd like to pen exactly why we think it's okay to forgo irq_{enter,
> exit}() for that one IRQ and not any other.

Thomas already said a few words on this, but basically scheduler_ipi()
is a NOP (*almost*), the IPI has no body. All it does is tickle the
return-from-interrupt path. So any setup and tear-down done for the
non-existing body is a waste of time.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit
       [not found]     ` <19286daf276f46aa@fake-msgid>
@ 2020-11-10 15:48       ` Valentin Schneider
  0 siblings, 0 replies; 12+ messages in thread
From: Valentin Schneider @ 2020-11-10 15:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Marc Zyngier, LAK, linux-kernel, Will Deacon, Catalin Marinas,
	Thomas Gleixner, Android Kernel Team


On 01/01/70 01:00, Valentin Schneider wrote:
> On 10/11/20 13:03, Peter Zijlstra wrote:
>> On Mon, Nov 02, 2020 at 10:30:50AM +0000, Valentin Schneider wrote:
>>
>>> Now, I'd like to pen exactly why we think it's okay to forgo irq_{enter,
>>> exit}() for that one IRQ and not any other.
>>
>> Thomas already said a few words on this, but basically scheduler_ipi()
>> is a NOP (*almost*), the IPI has no body. All it does is tickle the
>> return-from-interrupt path. So any setup and tear-down done for the
>> non-existing body is a waste of time.

Gotcha.

The pedant in me thinks this makes it more of a handler property than an
IRQ one, but I don't see a nice way to e.g. have this as a flag passed to
__request_percpu_irq() and not have it usable by random modules.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit
  2020-11-03 20:32 ` Thomas Gleixner
@ 2020-11-20  9:20   ` Marc Zyngier
  2020-11-20 14:17     ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Marc Zyngier @ 2020-11-20  9:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LAK, linux-kernel, Will Deacon, Catalin Marinas,
	Valentin Schneider, Peter Zijlstra, Android Kernel Team,
	mark.rutland

[+ Mark who has been hacking in the same area lately]

Hi Thomas,

On 2020-11-03 20:32, Thomas Gleixner wrote:
> On Sun, Nov 01 2020 at 13:14, Marc Zyngier wrote:
>> Vincent recently reported [1] that 5.10-rc1 showed a significant
>> regression when running "perf bench sched pipe" on arm64, and
>> pinpointed it to the recent move to handling IPIs as normal
>> interrupts.
>> 
>> The culprit is the use of irq_enter/irq_exit around the handling of
>> the rescheduling IPI, meaning that we enter the scheduler right after
>> the handling of the IPI instead of deferring it to the next preemption
>> event. This accounts for most of the overhead introduced.
> 
> irq_enter()/exit() does not end up in the scheduler. If it does then
> please show the call chain.
> 
> Scheduling happens when the IPI returns just before returning into the
> low level code (or on ARM in the low level code) when NEED_RESCHED is
> set (which is usually the case when the IPI is sent) and:
> 
>   the IPI hit user space
> 
> or
> 
>   the IPI hit in preemptible kernel context and CONFIG_PREEMPT[_RT] is
>   enabled.
> 
> Not doing so would be a bug. So I really don't understand your 
> reasoning
> here.

You are of course right. I somehow associated the overhead of the 
resched
IPI with the scheduler itself. I stand corrected.

> 
>> On architectures that have architected IPIs at the CPU level (x86
>> being the obvious one), the absence of irq_enter/exit is natural.
> 
> It's not really architected IPIs. We reserve the top 20ish vectors on
> each CPU for IPIs and other per processor interrupts, e.g. the per cpu
> timer.
> 
> Now lets look at what x86 does:
> 
> Interrupts and regular IPIs (function call ....) do
> 
>     irqentry_enter()   <- handles rcu_irq_enter() or context tracking
>       ...
>       irq_enter_rcu()
>       ...
>       irq_exit_rcu()
>     irqentry_exit()     <- handles need_resched()
> 
> The scheduler IPI does:
> 
>     irqentry_enter()   <- handles rcu_irq_enter() or context tracking
>       ...
>       __irq_enter_raw()
>       ...
>       __irq_exit_raw()
>     irqentry_exit()     <- handles need_resched()
> 
> So we don't invoke irq_enter_rcu() on enter and on exit we skip
> irq_exit_rcu(). That's fine because
> 
>   - Calling the tick management is pointless because this is going to
>     schedule anyway or something consumed the need_resched already.
> 
>   - The irqtime accounting is silly because it covers only the call and
>     returns. The time spent in the accounting is more than the time
>     we are accounting (empty call).
> 
> So what your hack fails to invoke is rcu_irq_enter()/exit() in case 
> that
> the IPI hits the idle task in an RCU off region. You also fail to tell
> lockdep. No cookies!

Who needs cookies when you can have cheese? ;-)

More seriously, it seems to me that we have a bit of a 
cross-architecture
disconnect here. I have been trying to join the dots between what you
describe above, and the behaviour of arm64 (and probably a large number
of the non-x86 architectures), and I feel massively confused.

Up to 5.9, our handling of the rescheduling IPI was "do as little as
possible": decode the interrupt from the lowest possible level (the
root irqchip), call into arch code, end-up in scheduler_ipi(), the end.

No lockdep, no RCU, no nothing.

What changed? Have we missed some radical change in the way the core
kernel expects the arch code to do thing? I'm aware of the 
kernel/entry/common.c stuff, which implements most of the goodies you
mention, but this effectively is x86-only at the moment.

If arm64 has forever been broken, I'd really like to know and fix it.

>> The bad news is that these patches are ugly as sin, and I really don't
>> like them.
> 
> Yes, they are ugly and the extra conditional in the irq handling path 
> is
> not pretty either.
> 
>> I specially hate that they can give driver authors the idea that they
>> can make random interrupts "faster".
> 
> Just split the guts of irq_modify_status() into __irq_modify_status()
> and call that from irq_modify_status().
> 
> Reject IRQ_HIDDEN (which should have been IRQ_IPI really) and IRQ_NAKED
> (IRQ_RAW perhaps) in irq_modify_status().
> 
> Do not export __irq_modify_status() so it can be only used from 
> built-in
> code which takes it away from driver writers.

Yup, done.

Thanks,

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit
  2020-11-20  9:20   ` Marc Zyngier
@ 2020-11-20 14:17     ` Thomas Gleixner
  2020-11-22 16:13       ` Marc Zyngier
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2020-11-20 14:17 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: LAK, linux-kernel, Will Deacon, Catalin Marinas,
	Valentin Schneider, Peter Zijlstra, Android Kernel Team,
	mark.rutland

Marc,

On Fri, Nov 20 2020 at 09:20, Marc Zyngier wrote:
> On 2020-11-03 20:32, Thomas Gleixner wrote:
>> On Sun, Nov 01 2020 at 13:14, Marc Zyngier wrote:
>

> More seriously, it seems to me that we have a bit of a
> cross-architecture disconnect here. I have been trying to join the
> dots between what you describe above, and the behaviour of arm64 (and
> probably a large number of the non-x86 architectures), and I feel
> massively confused.
>
> Up to 5.9, our handling of the rescheduling IPI was "do as little as
> possible": decode the interrupt from the lowest possible level (the
> root irqchip), call into arch code, end-up in scheduler_ipi(), the end.
>
> No lockdep, no RCU, no nothing.
>
> What changed? Have we missed some radical change in the way the core
> kernel expects the arch code to do thing? I'm aware of the 
> kernel/entry/common.c stuff, which implements most of the goodies you
> mention, but this effectively is x86-only at the moment.
>
> If arm64 has forever been broken, I'd really like to know and fix it.

So with the pre 5.8 scheduler IPI we had scheduler_ipi() doing wonderful
things

scheduler_ipi()
       ...
       if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
               return;
       irq_enter();
       ...
       irq_exit();

When Peter and I started to investigate the correctness and robustness
of syscalls, interrupts and exceptions versus RCU, instrumentation,
breakpoints for live patching etc., we stumbled over this and a lot of
other things.

Especially instrumentation had grown warts over time to deal with the
problem that it can hit into a RCU idle section. That started to get
worse which made us look deeper and the decision was to have strict
non-instrumentable sections which call out into instrumentable sections
_after_ establishing state. That moved all of the context tracking, RCU,
tracing muck out into C code which is what we have in kernel/entry/ now.

Back to the scheduler IPI. 5.8 made the scheduler IPI an almost empty
function which just folds need resched on architectures which need that.

Of course when I wrote the last reply I had the current x86
implementation in mind which does:

DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_reschedule_ipi)
{
        ack_APIC_irq();
        trace_reschedule_entry(RESCHEDULE_VECTOR);
        inc_irq_stat(irq_resched_count);
        scheduler_ipi();
        trace_reschedule_exit(RESCHEDULE_VECTOR);
}

ack_APIC_irq() can be traced and the tracepoints of course want a proper
established context too.

Now you surely could do:

ASM-IPI-enter:
       asm_ack_irq();
       asm_fold_need_resched();
       reti();

But that's not enough, because the point of the scheduler_ipi is
unsurprisingly to reevaluate need_resched() and to schedule. So you get:

ASM-IPI-enter:
       asm_ack_irq();
       asm_fold_need_resched()
       if (!asm_need_resched())
            reti();

       /* context tracking, rcu, .... */
       handle_enter_from_user_or_kernel();

       preempt_schedule_irq();

       /* context tracking, rcu, .... */
       handle_exit_to_user_or_kernel();
       reti();

We did not do that because we have the tracepoints there and I couldn't
find a reason why having yet another slightly different codepath to get
straight vs. context tracking and RCU was desirable. So we ended up
with:

ASM-IPI-enter:
  asm_enter();

    irqentry_enter() {
      handle_enter_from_user_or_kernel();

      /* Start of safe and instrumentable section */
    }

    __irq_enter_raw();

    sysvec_scheduler_ipi();

    __irq_exit_raw();

    irqentry_exit() {
      if (need_resched())
        preempt_schedule_irq();

      /* End of safe and instrumentable section */

      handle_exit_to_user_or_kernel();
    }

  asm_exit();

The only difference to other IPIs is that it avoids the accounting
overhead because accounting the almost empty scheduler_ipi() function is
pretty much pointless.

Hope that clarifies it.

> If arm64 has forever been broken, I'd really like to know and fix it.

Define broken. It kinda works with all the warts and bolts in tracing,
context tracking and other places. Is it entirely correct? Probably not.

The scheduler IPI is trivial compared to the more interesting ones like
NMI, MCE, breakpoint, debug exceptions etc. We found quite some
interesting issues with all of that muck during our 'noinstr' chase.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit
  2020-11-20 14:17     ` Thomas Gleixner
@ 2020-11-22 16:13       ` Marc Zyngier
  0 siblings, 0 replies; 12+ messages in thread
From: Marc Zyngier @ 2020-11-22 16:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LAK, linux-kernel, Will Deacon, Catalin Marinas,
	Valentin Schneider, Peter Zijlstra, Android Kernel Team,
	mark.rutland

On Fri, 20 Nov 2020 14:17:12 +0000,
Thomas Gleixner <tglx@linutronix.de> wrote:

Thomas,

> So with the pre 5.8 scheduler IPI we had scheduler_ipi() doing wonderful
> things

[... tons of interesting and helpful stuff ...]

> Hope that clarifies it.

It does in a way, as it maps some of the low-level x86 things onto
generic concepts. It certainly outlines that we have a lot of work
ahead of us, none of which can be expected to be a quick fix. It
requires restructuring a lot of the low-level arm64 code (Mark has
been toying with it for a couple of years now), and rejig it in funny
ways to match the expectations of the core code.

I haven't tried to think about 32bit ARM just yet, and we may have to
sever the IRQ ties that exist between the two architectures if we want
to make forward progress in a reasonable time frame. In general, it
looks that we'll need a new interface from the generic low-level IRQ
entry code into this new common framework.

> > If arm64 has forever been broken, I'd really like to know and fix it.
> 
> Define broken. It kinda works with all the warts and bolts in tracing,
> context tracking and other places. Is it entirely correct? Probably
> not.
>
> The scheduler IPI is trivial compared to the more interesting ones like
> NMI, MCE, breakpoint, debug exceptions etc. We found quite some
> interesting issues with all of that muck during our 'noinstr' chase.

Yup, point taken. However, given the complexity of the above and the
fact that we currently have a measurable performance regression in
5.10, I'll respin the original series with the cleanups you mentioned,
and the added note that we *really* need to sort this.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-11-22 16:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-01 13:14 [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit Marc Zyngier
2020-11-01 13:14 ` [PATCH 1/2] genirq: Allow an interrupt to be marked as 'naked' Marc Zyngier
2020-11-01 14:33   ` David Laight
2020-11-01 13:14 ` [PATCH 2/2] arm64: Mark the recheduling IPI as naked interrupt Marc Zyngier
2020-11-01 14:30 ` [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit David Laight
2020-11-02 10:30 ` Valentin Schneider
2020-11-10 13:03   ` Peter Zijlstra
     [not found]     ` <19286daf276f46aa@fake-msgid>
2020-11-10 15:48       ` Valentin Schneider
2020-11-03 20:32 ` Thomas Gleixner
2020-11-20  9:20   ` Marc Zyngier
2020-11-20 14:17     ` Thomas Gleixner
2020-11-22 16:13       ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).