[PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
@ 2019-06-02  1:12 Paul E. McKenney
  2019-06-03  8:38 ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2019-06-02  1:12 UTC (permalink / raw)
  To: tglx, mingo, jpoimboe, peterz, mojha; +Cc: linux-kernel

Scheduling-clock interrupts can arrive late in the CPU-offline process,
after idle entry and the subsequent call to cpuhp_report_idle_dead().
Once execution passes the call to rcu_report_dead(), RCU is ignoring
the CPU, which results in lockdep complaints when the interrupt handler
uses RCU:

------------------------------------------------------------------------

=============================
WARNING: suspicious RCU usage
5.2.0-rc1+ #681 Not tainted
-----------------------------
kernel/sched/fair.c:9542 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

RCU used illegally from offline CPU!
rcu_scheduler_active = 2, debug_locks = 1
no locks held by swapper/5/0.

stack backtrace:
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.2.0-rc1+ #681
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
Call Trace:
 <IRQ>
 dump_stack+0x5e/0x8b
 trigger_load_balance+0xa8/0x390
 ? tick_sched_do_timer+0x60/0x60
 update_process_times+0x3b/0x50
 tick_sched_handle+0x2f/0x40
 tick_sched_timer+0x32/0x70
 __hrtimer_run_queues+0xd3/0x3b0
 hrtimer_interrupt+0x11d/0x270
 ? sched_clock_local+0xc/0x74
 smp_apic_timer_interrupt+0x79/0x200
 apic_timer_interrupt+0xf/0x20
 </IRQ>
RIP: 0010:delay_tsc+0x22/0x50
Code: ff 0f 1f 80 00 00 00 00 65 44 8b 05 18 a7 11 48 0f ae e8 0f 31 48 89 d6 48 c1 e6 20 48 09 c6 eb 0e f3 90 65 8b 05 fe a6 11 48 <41> 39 c0 75 18 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d0 48 29
RSP: 0000:ffff8f92c0157ed0 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000005 RBX: ffff8c861f356400 RCX: ffff8f92c0157e64
RDX: 000000321214c8cc RSI: 00000032120daa7f RDI: 0000000000260f15
RBP: 0000000000000005 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8c861ee18000 R15: ffff8c861ee18000
 cpuhp_report_idle_dead+0x31/0x60
 do_idle+0x1d5/0x200
 ? _raw_spin_unlock_irqrestore+0x2d/0x40
 cpu_startup_entry+0x14/0x20
 start_secondary+0x151/0x170
 secondary_startup_64+0xa4/0xb0

------------------------------------------------------------------------

This happens rarely, but can be forced by happen more often by
placing delays in cpuhp_report_idle_dead() following the call to
rcu_report_dead().  With this in place, the folloiwng rcutorture
scenario reproduces the problem within a few minute:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TREE04"

This commit uses the crude by effective expedient of disabling interrupts
via local_irq_disable() just prior to the call to rcu_report_dead().
This passes light testing.  Of course, preventing the scheduling-clock
interrupt might be preferable.  However, this commit does have the
advantage of allowing progress to be made on other RCU bugs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 448efc06bb2d..3b33d83b793d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -930,6 +930,7 @@ void cpuhp_report_idle_dead(void)
 	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
 
 	BUG_ON(st->state != CPUHP_AP_OFFLINE);
+	local_irq_disable();
 	rcu_report_dead(smp_processor_id());
 	st->state = CPUHP_AP_IDLE_DEAD;
 	udelay(1000);


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-02  1:12 [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline Paul E. McKenney
@ 2019-06-03  8:38 ` Peter Zijlstra
  2019-06-03 11:44   ` Mark Rutland
  2019-06-04  8:14   ` Paul E. McKenney
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2019-06-03  8:38 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: tglx, mingo, jpoimboe, mojha, linux-kernel

On Sat, Jun 01, 2019 at 06:12:53PM -0700, Paul E. McKenney wrote:
> Scheduling-clock interrupts can arrive late in the CPU-offline process,
> after idle entry and the subsequent call to cpuhp_report_idle_dead().
> Once execution passes the call to rcu_report_dead(), RCU is ignoring
> the CPU, which results in lockdep complaints when the interrupt handler
> uses RCU:

> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 448efc06bb2d..3b33d83b793d 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -930,6 +930,7 @@ void cpuhp_report_idle_dead(void)
>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
>  
>  	BUG_ON(st->state != CPUHP_AP_OFFLINE);
> +	local_irq_disable();
>  	rcu_report_dead(smp_processor_id());
>  	st->state = CPUHP_AP_IDLE_DEAD;
>  	udelay(1000);

Urgh... I'd almost suggest we do something like the below.


But then I started looking at the various arch_cpu_idle_dead()
implementations and ran into arm's implementation, which is calling
complete() where generic code already established this isn't possible
(see for example cpuhp_report_idle_dead()).

And then there's powerpc which for some obscure reason thinks it needs
to enable preemption when dying ?! pseries_cpu_die() actually calls
msleep() ?!?!

Sparc64 agains things it should enable preemption when playing dead.

So clearly this isn't going to work well :/

---
 include/linux/tick.h | 10 ----------
 kernel/sched/idle.c  |  5 +++--
 2 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index f92a10b5e112..196a0a7bfc4f 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -134,14 +134,6 @@ extern unsigned long tick_nohz_get_idle_calls(void);
 extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
-
-static inline void tick_nohz_idle_stop_tick_protected(void)
-{
-	local_irq_disable();
-	tick_nohz_idle_stop_tick();
-	local_irq_enable();
-}
-
 #else /* !CONFIG_NO_HZ_COMMON */
 #define tick_nohz_enabled (0)
 static inline int tick_nohz_tick_stopped(void) { return 0; }
@@ -164,8 +156,6 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
 }
 static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
 static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
-
-static inline void tick_nohz_idle_stop_tick_protected(void) { }
 #endif /* !CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 80940939b733..e4bc4aa739b8 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -241,13 +241,14 @@ static void do_idle(void)
 		check_pgt_cache();
 		rmb();
 
+		local_irq_disable();
+
 		if (cpu_is_offline(cpu)) {
-			tick_nohz_idle_stop_tick_protected();
+			tick_nohz_idle_stop_tick();
 			cpuhp_report_idle_dead();
 			arch_cpu_idle_dead();
 		}
 
-		local_irq_disable();
 		arch_cpu_idle_enter();
 
 		/*

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-03  8:38 ` Peter Zijlstra
@ 2019-06-03 11:44   ` Mark Rutland
  2019-06-03 13:39     ` Dietmar Eggemann
  2019-06-04  8:14   ` Paul E. McKenney
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Rutland @ 2019-06-03 11:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, tglx, mingo, jpoimboe, mojha, linux-kernel

On Mon, Jun 03, 2019 at 10:38:48AM +0200, Peter Zijlstra wrote:
> On Sat, Jun 01, 2019 at 06:12:53PM -0700, Paul E. McKenney wrote:
> > Scheduling-clock interrupts can arrive late in the CPU-offline process,
> > after idle entry and the subsequent call to cpuhp_report_idle_dead().
> > Once execution passes the call to rcu_report_dead(), RCU is ignoring
> > the CPU, which results in lockdep complaints when the interrupt handler
> > uses RCU:
> 
> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > index 448efc06bb2d..3b33d83b793d 100644
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -930,6 +930,7 @@ void cpuhp_report_idle_dead(void)
> >  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> >  
> >  	BUG_ON(st->state != CPUHP_AP_OFFLINE);
> > +	local_irq_disable();
> >  	rcu_report_dead(smp_processor_id());
> >  	st->state = CPUHP_AP_IDLE_DEAD;
> >  	udelay(1000);
> 
> Urgh... I'd almost suggest we do something like the below.
> 
> 
> But then I started looking at the various arch_cpu_idle_dead()
> implementations and ran into arm's implementation, which is calling
> complete() where generic code already established this isn't possible
> (see for example cpuhp_report_idle_dead()).

IIRC, that should have been migrated over to cpu_report_death(), as
arm64 was in commit:

  05981277a4de1ad6 ("arm64: Use common outgoing-CPU-notification code")

... but it looks like Paul's patch to do so [1] fell through the cracks;
I'm not aware of any reason that shouldn't have been taken.
  
[1] https://lore.kernel.org/lkml/1431467407-1223-8-git-send-email-paulmck@linux.vnet.ibm.com/

Paul, do you want to resend that?

For arm64 we mask SError, debug, and FIQ exceptions later in our
cpu_die(). FIQ should never happen, but we could take SError or debug
exceptions, and I think we end up using RCU behind the scenes in the
handlers for those. :/

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-03 11:44   ` Mark Rutland
@ 2019-06-03 13:39     ` Dietmar Eggemann
  2019-06-04  7:45       ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Dietmar Eggemann @ 2019-06-03 13:39 UTC (permalink / raw)
  To: Mark Rutland, Peter Zijlstra
  Cc: Paul E. McKenney, tglx, mingo, jpoimboe, mojha, linux-kernel

On 6/3/19 1:44 PM, Mark Rutland wrote:
> On Mon, Jun 03, 2019 at 10:38:48AM +0200, Peter Zijlstra wrote:
>> On Sat, Jun 01, 2019 at 06:12:53PM -0700, Paul E. McKenney wrote:
>>> Scheduling-clock interrupts can arrive late in the CPU-offline process,
>>> after idle entry and the subsequent call to cpuhp_report_idle_dead().
>>> Once execution passes the call to rcu_report_dead(), RCU is ignoring
>>> the CPU, which results in lockdep complaints when the interrupt handler
>>> uses RCU:
>>
>>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>>> index 448efc06bb2d..3b33d83b793d 100644
>>> --- a/kernel/cpu.c
>>> +++ b/kernel/cpu.c
>>> @@ -930,6 +930,7 @@ void cpuhp_report_idle_dead(void)
>>>   	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
>>>   
>>>   	BUG_ON(st->state != CPUHP_AP_OFFLINE);
>>> +	local_irq_disable();
>>>   	rcu_report_dead(smp_processor_id());
>>>   	st->state = CPUHP_AP_IDLE_DEAD;
>>>   	udelay(1000);
>>
>> Urgh... I'd almost suggest we do something like the below.
>>
>>
>> But then I started looking at the various arch_cpu_idle_dead()
>> implementations and ran into arm's implementation, which is calling
>> complete() where generic code already established this isn't possible
>> (see for example cpuhp_report_idle_dead()).
> 
> IIRC, that should have been migrated over to cpu_report_death(), as
> arm64 was in commit:
> 
>    05981277a4de1ad6 ("arm64: Use common outgoing-CPU-notification code")
> 
> ... but it looks like Paul's patch to do so [1] fell through the cracks;
> I'm not aware of any reason that shouldn't have been taken.
>    
> [1] https://lore.kernel.org/lkml/1431467407-1223-8-git-send-email-paulmck@linux.vnet.ibm.com/
> 
> Paul, do you want to resend that?

Please do. We're carrying this patch out-of-tree for while now in our 
EAS integration to get cpu hotplug tests passing on TC2 (arm).

-- Dietmar

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-03 13:39     ` Dietmar Eggemann
@ 2019-06-04  7:45       ` Paul E. McKenney
  2019-06-04 13:29         ` Dietmar Eggemann
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2019-06-04  7:45 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Mark Rutland, Peter Zijlstra, tglx, mingo, jpoimboe, mojha, linux-kernel

On Mon, Jun 03, 2019 at 03:39:18PM +0200, Dietmar Eggemann wrote:
> On 6/3/19 1:44 PM, Mark Rutland wrote:
> >On Mon, Jun 03, 2019 at 10:38:48AM +0200, Peter Zijlstra wrote:
> >>On Sat, Jun 01, 2019 at 06:12:53PM -0700, Paul E. McKenney wrote:
> >>>Scheduling-clock interrupts can arrive late in the CPU-offline process,
> >>>after idle entry and the subsequent call to cpuhp_report_idle_dead().
> >>>Once execution passes the call to rcu_report_dead(), RCU is ignoring
> >>>the CPU, which results in lockdep complaints when the interrupt handler
> >>>uses RCU:
> >>
> >>>diff --git a/kernel/cpu.c b/kernel/cpu.c
> >>>index 448efc06bb2d..3b33d83b793d 100644
> >>>--- a/kernel/cpu.c
> >>>+++ b/kernel/cpu.c
> >>>@@ -930,6 +930,7 @@ void cpuhp_report_idle_dead(void)
> >>>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> >>>  	BUG_ON(st->state != CPUHP_AP_OFFLINE);
> >>>+	local_irq_disable();
> >>>  	rcu_report_dead(smp_processor_id());
> >>>  	st->state = CPUHP_AP_IDLE_DEAD;
> >>>  	udelay(1000);
> >>
> >>Urgh... I'd almost suggest we do something like the below.
> >>
> >>
> >>But then I started looking at the various arch_cpu_idle_dead()
> >>implementations and ran into arm's implementation, which is calling
> >>complete() where generic code already established this isn't possible
> >>(see for example cpuhp_report_idle_dead()).
> >
> >IIRC, that should have been migrated over to cpu_report_death(), as
> >arm64 was in commit:
> >
> >   05981277a4de1ad6 ("arm64: Use common outgoing-CPU-notification code")
> >
> >... but it looks like Paul's patch to do so [1] fell through the cracks;
> >I'm not aware of any reason that shouldn't have been taken.
> >[1] https://lore.kernel.org/lkml/1431467407-1223-8-git-send-email-paulmck@linux.vnet.ibm.com/
> >
> >Paul, do you want to resend that?
> 
> Please do. We're carrying this patch out-of-tree for while now in
> our EAS integration to get cpu hotplug tests passing on TC2 (arm).

Huh.  It still applies.  But I have no means of testing it.

And it looks like the reason I dropped it was that I didn't get any
response from the maintainer.  I sent a message to this effect to
linux-arm-kernel@lists.infradead.org and linux@arm.linux.org.uk on May
21, 2015.

So here it is again.  ;-)

I have queued this locally.  Left to myself, I add the two of you on its
Cc: list and run it through my normal process.  But given the history,
I would still want either an ack from the maintainer or, better, for
the maintainer to take the patch.

Or is there a better way for us to proceed on this?

							Thanx, Paul

------------------------------------------------------------------------

arm: Use common outgoing-CPU-notification code

This commit removes the open-coded CPU-offline notification with new
common code.  In particular, this change avoids calling scheduler code
using RCU from an offline CPU that RCU is ignoring.  This is a minimal
change.  A more intrusive change might invoke the cpu_check_up_prepare()
and cpu_set_state_online() functions at CPU-online time, which would
allow onlining throw an error if the CPU did not go offline properly.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Russell King <linux@arm.linux.org.uk>

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index ebc53804d57b..8687d619260f 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -267,15 +267,13 @@ int __cpu_disable(void)
 	return 0;
 }
 
-static DECLARE_COMPLETION(cpu_died);
-
 /*
  * called on the thread which is asking for a CPU to be shutdown -
  * waits until shutdown has completed, or it is timed out.
  */
 void __cpu_die(unsigned int cpu)
 {
-	if (!wait_for_completion_timeout(&cpu_died, msecs_to_jiffies(5000))) {
+	if (!cpu_wait_death(cpu, 5)) {
 		pr_err("CPU%u: cpu didn't die\n", cpu);
 		return;
 	}
@@ -322,7 +320,7 @@ void arch_cpu_idle_dead(void)
 	 * this returns, power and/or clocks can be removed at any point
 	 * from this CPU and its cache by platform_cpu_kill().
 	 */
-	complete(&cpu_died);
+	(void)cpu_report_death();
 
 	/*
 	 * Ensure that the cache lines associated with that completion are


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-03  8:38 ` Peter Zijlstra
  2019-06-03 11:44   ` Mark Rutland
@ 2019-06-04  8:14   ` Paul E. McKenney
  2019-06-04 12:06     ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2019-06-04  8:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, mingo, jpoimboe, mojha, linux-kernel

On Mon, Jun 03, 2019 at 10:38:48AM +0200, Peter Zijlstra wrote:
> On Sat, Jun 01, 2019 at 06:12:53PM -0700, Paul E. McKenney wrote:
> > Scheduling-clock interrupts can arrive late in the CPU-offline process,
> > after idle entry and the subsequent call to cpuhp_report_idle_dead().
> > Once execution passes the call to rcu_report_dead(), RCU is ignoring
> > the CPU, which results in lockdep complaints when the interrupt handler
> > uses RCU:
> 
> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > index 448efc06bb2d..3b33d83b793d 100644
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -930,6 +930,7 @@ void cpuhp_report_idle_dead(void)
> >  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> >  
> >  	BUG_ON(st->state != CPUHP_AP_OFFLINE);
> > +	local_irq_disable();
> >  	rcu_report_dead(smp_processor_id());
> >  	st->state = CPUHP_AP_IDLE_DEAD;
> >  	udelay(1000);
> 
> Urgh... I'd almost suggest we do something like the below.
> 
> 
> But then I started looking at the various arch_cpu_idle_dead()
> implementations and ran into arm's implementation, which is calling
> complete() where generic code already established this isn't possible
> (see for example cpuhp_report_idle_dead()).

Yeah, my patch that would have changed that never was acked or taken
by the maintainer, as discussed later in this thread.

> And then there's powerpc which for some obscure reason thinks it needs
> to enable preemption when dying ?! pseries_cpu_die() actually calls
> msleep() ?!?!

Isn't pseries_cpu_die() invoked via the smp_ops->cpu_die() function
pointer, whch is invoked from __cpu_die() in arch/powerpc/kernel/smp.c?
Then, if I am reading the code correctly, __cpu_die() is invoked from
takedown_cpu(), which is invoked not from the dying CPU but rather from
a surviving CPU.  Or am I misreading the code?

> Sparc64 agains things it should enable preemption when playing dead.
> 
> So clearly this isn't going to work well :/

Well, it looks like it will work at least as well as my patch.  I will
test it out this evening, ten timezones east of my usual location.  ;-)

							Thanx, Paul

> ---
>  include/linux/tick.h | 10 ----------
>  kernel/sched/idle.c  |  5 +++--
>  2 files changed, 3 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index f92a10b5e112..196a0a7bfc4f 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -134,14 +134,6 @@ extern unsigned long tick_nohz_get_idle_calls(void);
>  extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu);
>  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
>  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
> -
> -static inline void tick_nohz_idle_stop_tick_protected(void)
> -{
> -	local_irq_disable();
> -	tick_nohz_idle_stop_tick();
> -	local_irq_enable();
> -}
> -
>  #else /* !CONFIG_NO_HZ_COMMON */
>  #define tick_nohz_enabled (0)
>  static inline int tick_nohz_tick_stopped(void) { return 0; }
> @@ -164,8 +156,6 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
>  }
>  static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
>  static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; }
> -
> -static inline void tick_nohz_idle_stop_tick_protected(void) { }
>  #endif /* !CONFIG_NO_HZ_COMMON */
>  
>  #ifdef CONFIG_NO_HZ_FULL
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 80940939b733..e4bc4aa739b8 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -241,13 +241,14 @@ static void do_idle(void)
>  		check_pgt_cache();
>  		rmb();
>  
> +		local_irq_disable();
> +
>  		if (cpu_is_offline(cpu)) {
> -			tick_nohz_idle_stop_tick_protected();
> +			tick_nohz_idle_stop_tick();
>  			cpuhp_report_idle_dead();
>  			arch_cpu_idle_dead();
>  		}
>  
> -		local_irq_disable();
>  		arch_cpu_idle_enter();
>  
>  		/*
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-04  8:14   ` Paul E. McKenney
@ 2019-06-04 12:06     ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2019-06-04 12:06 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: tglx, mingo, jpoimboe, mojha, linux-kernel

On Tue, Jun 04, 2019 at 01:14:35AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 03, 2019 at 10:38:48AM +0200, Peter Zijlstra wrote:

> > And then there's powerpc which for some obscure reason thinks it needs
> > to enable preemption when dying ?! pseries_cpu_die() actually calls
> > msleep() ?!?!
> 
> Isn't pseries_cpu_die() invoked via the smp_ops->cpu_die() function
> pointer, whch is invoked from __cpu_die() in arch/powerpc/kernel/smp.c?
> Then, if I am reading the code correctly, __cpu_die() is invoked from
> takedown_cpu(), which is invoked not from the dying CPU but rather from
> a surviving CPU.  Or am I misreading the code?

Argh..

arch_cpu_idle_dead() -> cpu_die() -> ppc_md.cpu_die()

which is _NOT_ smp_ops.cpu_die()

this one ends in pseries_mach_cpu_die()

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-04  7:45       ` Paul E. McKenney
@ 2019-06-04 13:29         ` Dietmar Eggemann
  2019-06-08 16:41           ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Dietmar Eggemann @ 2019-06-04 13:29 UTC (permalink / raw)
  To: paulmck
  Cc: Mark Rutland, Peter Zijlstra, tglx, mingo, jpoimboe, mojha, linux-kernel

On 6/4/19 9:45 AM, Paul E. McKenney wrote:
> On Mon, Jun 03, 2019 at 03:39:18PM +0200, Dietmar Eggemann wrote:
>> On 6/3/19 1:44 PM, Mark Rutland wrote:
>>> On Mon, Jun 03, 2019 at 10:38:48AM +0200, Peter Zijlstra wrote:
>>>> On Sat, Jun 01, 2019 at 06:12:53PM -0700, Paul E. McKenney wrote:
>>>>> Scheduling-clock interrupts can arrive late in the CPU-offline process,

[...]

>>>    05981277a4de1ad6 ("arm64: Use common outgoing-CPU-notification code")
>>>
>>> ... but it looks like Paul's patch to do so [1] fell through the cracks;
>>> I'm not aware of any reason that shouldn't have been taken.
>>> [1] https://lore.kernel.org/lkml/1431467407-1223-8-git-send-email-paulmck@linux.vnet.ibm.com/
>>>
>>> Paul, do you want to resend that?
>>
>> Please do. We're carrying this patch out-of-tree for while now in
>> our EAS integration to get cpu hotplug tests passing on TC2 (arm).
> 
> Huh.  It still applies.  But I have no means of testing it.

We can do the testing part on our TC2 platform, i.e. we're testing it 
with each of our EAS mainline integration right now.

https://developer.arm.com/tools-and-software/open-source-software/linux-kernel/energy-aware-scheduling/eas-mainline-development

http://linux-arm.org/git?p=linux-power.git;a=commit;h=8cd16f1dc2cd896a0b1e2010b4992b33fdc11fe0

> And it looks like the reason I dropped it was that I didn't get any
> response from the maintainer.  I sent a message to this effect to
> linux-arm-kernel@lists.infradead.org and linux@arm.linux.org.uk on May
> 21, 2015.
> 
> So here it is again.  ;-)
> 
> I have queued this locally.  Left to myself, I add the two of you on its
> Cc: list and run it through my normal process.  But given the history,
> I would still want either an ack from the maintainer or, better, for
> the maintainer to take the patch.
> 
> Or is there a better way for us to proceed on this?

You could send this patch also to linux-arm-kernel@lists.infradead.org 
and cc rmk to get his opinion on the patch.

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-04 13:29         ` Dietmar Eggemann
@ 2019-06-08 16:41           ` Paul E. McKenney
  2019-06-11 13:14             ` Dietmar Eggemann
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2019-06-08 16:41 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Mark Rutland, Peter Zijlstra, tglx, mingo, jpoimboe, mojha, linux-kernel

On Tue, Jun 04, 2019 at 03:29:32PM +0200, Dietmar Eggemann wrote:
> On 6/4/19 9:45 AM, Paul E. McKenney wrote:
> >On Mon, Jun 03, 2019 at 03:39:18PM +0200, Dietmar Eggemann wrote:
> >>On 6/3/19 1:44 PM, Mark Rutland wrote:
> >>>On Mon, Jun 03, 2019 at 10:38:48AM +0200, Peter Zijlstra wrote:
> >>>>On Sat, Jun 01, 2019 at 06:12:53PM -0700, Paul E. McKenney wrote:
> >>>>>Scheduling-clock interrupts can arrive late in the CPU-offline process,
> 
> [...]
> 
> >>>   05981277a4de1ad6 ("arm64: Use common outgoing-CPU-notification code")
> >>>
> >>>... but it looks like Paul's patch to do so [1] fell through the cracks;
> >>>I'm not aware of any reason that shouldn't have been taken.
> >>>[1] https://lore.kernel.org/lkml/1431467407-1223-8-git-send-email-paulmck@linux.vnet.ibm.com/
> >>>
> >>>Paul, do you want to resend that?
> >>
> >>Please do. We're carrying this patch out-of-tree for while now in
> >>our EAS integration to get cpu hotplug tests passing on TC2 (arm).
> >
> >Huh.  It still applies.  But I have no means of testing it.
> 
> We can do the testing part on our TC2 platform, i.e. we're testing
> it with each of our EAS mainline integration right now.
> 
> https://developer.arm.com/tools-and-software/open-source-software/linux-kernel/energy-aware-scheduling/eas-mainline-development
> 
> http://linux-arm.org/git?p=linux-power.git;a=commit;h=8cd16f1dc2cd896a0b1e2010b4992b33fdc11fe0
> 
> >And it looks like the reason I dropped it was that I didn't get any
> >response from the maintainer.  I sent a message to this effect to
> >linux-arm-kernel@lists.infradead.org and linux@arm.linux.org.uk on May
> >21, 2015.
> >
> >So here it is again.  ;-)
> >
> >I have queued this locally.  Left to myself, I add the two of you on its
> >Cc: list and run it through my normal process.  But given the history,
> >I would still want either an ack from the maintainer or, better, for
> >the maintainer to take the patch.
> >
> >Or is there a better way for us to proceed on this?
> 
> You could send this patch also to
> linux-arm-kernel@lists.infradead.org and cc rmk to get his opinion
> on the patch.

OK, please let me know how the testing goes.  My thought is to send the
patch as you suggest with your Tested-by.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-08 16:41           ` Paul E. McKenney
@ 2019-06-11 13:14             ` Dietmar Eggemann
  2019-06-11 13:54               ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Dietmar Eggemann @ 2019-06-11 13:14 UTC (permalink / raw)
  To: paulmck
  Cc: Mark Rutland, Peter Zijlstra, tglx, mingo, jpoimboe, mojha, linux-kernel

On 6/8/19 6:41 PM, Paul E. McKenney wrote:
> On Tue, Jun 04, 2019 at 03:29:32PM +0200, Dietmar Eggemann wrote:
>> On 6/4/19 9:45 AM, Paul E. McKenney wrote:
>>> On Mon, Jun 03, 2019 at 03:39:18PM +0200, Dietmar Eggemann wrote:
>>>> On 6/3/19 1:44 PM, Mark Rutland wrote:
>>>>> On Mon, Jun 03, 2019 at 10:38:48AM +0200, Peter Zijlstra wrote:
>>>>>> On Sat, Jun 01, 2019 at 06:12:53PM -0700, Paul E. McKenney wrote:
>>>>>>> Scheduling-clock interrupts can arrive late in the CPU-offline process,

[...]

>>> And it looks like the reason I dropped it was that I didn't get any
>>> response from the maintainer.  I sent a message to this effect to
>>> linux-arm-kernel@lists.infradead.org and linux@arm.linux.org.uk on May
>>> 21, 2015.
>>>
>>> So here it is again.  ;-)
>>>
>>> I have queued this locally.  Left to myself, I add the two of you on its
>>> Cc: list and run it through my normal process.  But given the history,
>>> I would still want either an ack from the maintainer or, better, for
>>> the maintainer to take the patch.
>>>
>>> Or is there a better way for us to proceed on this?
>>
>> You could send this patch also to
>> linux-arm-kernel@lists.infradead.org and cc rmk to get his opinion
>> on the patch.
> 
> OK, please let me know how the testing goes.  My thought is to send the
> patch as you suggest with your Tested-by.

Tested your patch on top of v5.2-rc4* on Arm TC2 (32bit) and CPU hotplug 
stress test. W/o your patch, the test fails within seconds since CPUs 
are not coming up again. W/ your patch, the test runs for hours just fine.

You can add my:

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

* just for the record: one additional unrelated patch (to disable the 
NOR flash) is necessary on Arm TC2: 
https://patchwork.kernel.org/patch/10968391 .

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-11 13:14             ` Dietmar Eggemann
@ 2019-06-11 13:54               ` Paul E. McKenney
  2019-06-11 14:39                 ` Dietmar Eggemann
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2019-06-11 13:54 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Mark Rutland, Peter Zijlstra, tglx, mingo, jpoimboe, mojha, linux-kernel

On Tue, Jun 11, 2019 at 03:14:54PM +0200, Dietmar Eggemann wrote:
> On 6/8/19 6:41 PM, Paul E. McKenney wrote:
> >On Tue, Jun 04, 2019 at 03:29:32PM +0200, Dietmar Eggemann wrote:
> >>On 6/4/19 9:45 AM, Paul E. McKenney wrote:
> >>>On Mon, Jun 03, 2019 at 03:39:18PM +0200, Dietmar Eggemann wrote:
> >>>>On 6/3/19 1:44 PM, Mark Rutland wrote:
> >>>>>On Mon, Jun 03, 2019 at 10:38:48AM +0200, Peter Zijlstra wrote:
> >>>>>>On Sat, Jun 01, 2019 at 06:12:53PM -0700, Paul E. McKenney wrote:
> >>>>>>>Scheduling-clock interrupts can arrive late in the CPU-offline process,
> 
> [...]
> 
> >>>And it looks like the reason I dropped it was that I didn't get any
> >>>response from the maintainer.  I sent a message to this effect to
> >>>linux-arm-kernel@lists.infradead.org and linux@arm.linux.org.uk on May
> >>>21, 2015.
> >>>
> >>>So here it is again.  ;-)
> >>>
> >>>I have queued this locally.  Left to myself, I add the two of you on its
> >>>Cc: list and run it through my normal process.  But given the history,
> >>>I would still want either an ack from the maintainer or, better, for
> >>>the maintainer to take the patch.
> >>>
> >>>Or is there a better way for us to proceed on this?
> >>
> >>You could send this patch also to
> >>linux-arm-kernel@lists.infradead.org and cc rmk to get his opinion
> >>on the patch.
> >
> >OK, please let me know how the testing goes.  My thought is to send the
> >patch as you suggest with your Tested-by.
> 
> Tested your patch on top of v5.2-rc4* on Arm TC2 (32bit) and CPU
> hotplug stress test. W/o your patch, the test fails within seconds
> since CPUs are not coming up again. W/ your patch, the test runs for
> hours just fine.
> 
> You can add my:
> 
> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Thank you!!!

> * just for the record: one additional unrelated patch (to disable
> the NOR flash) is necessary on Arm TC2:
> https://patchwork.kernel.org/patch/10968391 .

Is this progressing, or does it also need help getting to mainline?

Left to myself, I will push my patch and assume that the NOR flash patch
will make it in its own good time -- or, alternatively, that there is
someone better positioned than me to push it.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-11 13:54               ` Paul E. McKenney
@ 2019-06-11 14:39                 ` Dietmar Eggemann
  2019-06-11 19:25                   ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Dietmar Eggemann @ 2019-06-11 14:39 UTC (permalink / raw)
  To: paulmck
  Cc: Mark Rutland, Peter Zijlstra, tglx, mingo, jpoimboe, mojha, linux-kernel

On 6/11/19 3:54 PM, Paul E. McKenney wrote:
> On Tue, Jun 11, 2019 at 03:14:54PM +0200, Dietmar Eggemann wrote:
>> On 6/8/19 6:41 PM, Paul E. McKenney wrote:
>>> On Tue, Jun 04, 2019 at 03:29:32PM +0200, Dietmar Eggemann wrote:
>>>> On 6/4/19 9:45 AM, Paul E. McKenney wrote:
>>>>> On Mon, Jun 03, 2019 at 03:39:18PM +0200, Dietmar Eggemann wrote:
>>>>>> On 6/3/19 1:44 PM, Mark Rutland wrote:
>>>>>>> On Mon, Jun 03, 2019 at 10:38:48AM +0200, Peter Zijlstra wrote:
>>>>>>>> On Sat, Jun 01, 2019 at 06:12:53PM -0700, Paul E. McKenney wrote:
>>>>>>>>> Scheduling-clock interrupts can arrive late in the CPU-offline process,

[...]

>> Tested your patch on top of v5.2-rc4* on Arm TC2 (32bit) and CPU
>> hotplug stress test. W/o your patch, the test fails within seconds
>> since CPUs are not coming up again. W/ your patch, the test runs for
>> hours just fine.
>>
>> You can add my:
>>
>> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> Thank you!!!
> 
>> * just for the record: one additional unrelated patch (to disable
>> the NOR flash) is necessary on Arm TC2:
>> https://patchwork.kernel.org/patch/10968391 .
> 
> Is this progressing, or does it also need help getting to mainline?

This is an unrelated specific issue w/ the TC2 platform which will 
progress independently. Other Arm32 platforms should profit from your 
patch independently of that. I just wanted to mention it here in case 
people try to recreate the test on this specific platform.
> Left to myself, I will push my patch and assume that the NOR flash patch
> will make it in its own good time -- or, alternatively, that there is
> someone better positioned than me to push it.

IMHO, the best thing is you push your patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline
  2019-06-11 14:39                 ` Dietmar Eggemann
@ 2019-06-11 19:25                   ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2019-06-11 19:25 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Mark Rutland, Peter Zijlstra, tglx, mingo, jpoimboe, mojha, linux-kernel

On Tue, Jun 11, 2019 at 04:39:34PM +0200, Dietmar Eggemann wrote:
> On 6/11/19 3:54 PM, Paul E. McKenney wrote:
> >On Tue, Jun 11, 2019 at 03:14:54PM +0200, Dietmar Eggemann wrote:
> >>On 6/8/19 6:41 PM, Paul E. McKenney wrote:
> >>>On Tue, Jun 04, 2019 at 03:29:32PM +0200, Dietmar Eggemann wrote:
> >>>>On 6/4/19 9:45 AM, Paul E. McKenney wrote:
> >>>>>On Mon, Jun 03, 2019 at 03:39:18PM +0200, Dietmar Eggemann wrote:
> >>>>>>On 6/3/19 1:44 PM, Mark Rutland wrote:
> >>>>>>>On Mon, Jun 03, 2019 at 10:38:48AM +0200, Peter Zijlstra wrote:
> >>>>>>>>On Sat, Jun 01, 2019 at 06:12:53PM -0700, Paul E. McKenney wrote:
> >>>>>>>>>Scheduling-clock interrupts can arrive late in the CPU-offline process,
> 
> [...]
> 
> >>Tested your patch on top of v5.2-rc4* on Arm TC2 (32bit) and CPU
> >>hotplug stress test. W/o your patch, the test fails within seconds
> >>since CPUs are not coming up again. W/ your patch, the test runs for
> >>hours just fine.
> >>
> >>You can add my:
> >>
> >>Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >
> >Thank you!!!
> >
> >>* just for the record: one additional unrelated patch (to disable
> >>the NOR flash) is necessary on Arm TC2:
> >>https://patchwork.kernel.org/patch/10968391 .
> >
> >Is this progressing, or does it also need help getting to mainline?
> 
> This is an unrelated specific issue w/ the TC2 platform which will
> progress independently. Other Arm32 platforms should profit from
> your patch independently of that. I just wanted to mention it here
> in case people try to recreate the test on this specific platform.
> >Left to myself, I will push my patch and assume that the NOR flash patch
> >will make it in its own good time -- or, alternatively, that there is
> >someone better positioned than me to push it.
> 
> IMHO, the best thing is you push your patch.

I just now sent it.  Here is hoping.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-06-11 19:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-02  1:12 [PATCH HACK RFC] cpu: Prevent late-arriving interrupts from disrupting offline Paul E. McKenney
2019-06-03  8:38 ` Peter Zijlstra
2019-06-03 11:44   ` Mark Rutland
2019-06-03 13:39     ` Dietmar Eggemann
2019-06-04  7:45       ` Paul E. McKenney
2019-06-04 13:29         ` Dietmar Eggemann
2019-06-08 16:41           ` Paul E. McKenney
2019-06-11 13:14             ` Dietmar Eggemann
2019-06-11 13:54               ` Paul E. McKenney
2019-06-11 14:39                 ` Dietmar Eggemann
2019-06-11 19:25                   ` Paul E. McKenney
2019-06-04  8:14   ` Paul E. McKenney
2019-06-04 12:06     ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.