FAILED: patch "[PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug" failed to apply to 3.18-stable tree

All of lore.kernel.org
 help / color / mirror / Atom feed

* FAILED: patch "[PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug" failed to apply to 3.18-stable tree
@ 2018-01-29  8:08 gregkh
  2018-01-29 14:20 ` [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: gregkh @ 2018-01-29  8:08 UTC (permalink / raw)
  To: tglx, anna-maria, bigeasy, paulmck, peterz; +Cc: stable

The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

thanks,

greg k-h

------------------ original commit in Linus's tree ------------------

>From d5421ea43d30701e03cadc56a38854c36a8b4433 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx@linutronix.de>
Date: Fri, 26 Jan 2018 14:54:32 +0100
Subject: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug

The hrtimer interrupt code contains a hang detection and mitigation
mechanism, which prevents that a long delayed hrtimer interrupt causes a
continous retriggering of interrupts which prevent the system from making
progress. If a hang is detected then the timer hardware is programmed with
a certain delay into the future and a flag is set in the hrtimer cpu base
which prevents newly enqueued timers from reprogramming the timer hardware
prior to the chosen delay. The subsequent hrtimer interrupt after the delay
clears the flag and resumes normal operation.

If such a hang happens in the last hrtimer interrupt before a CPU is
unplugged then the hang_detected flag is set and stays that way when the
CPU is plugged in again. At that point the timer hardware is not armed and
it cannot be armed because the hang_detected flag is still active, so
nothing clears that flag. As a consequence the CPU does not receive hrtimer
interrupts and no timers expire on that CPU which results in RCU stalls and
other malfunctions.

Clear the flag along with some other less critical members of the hrtimer
cpu base to ensure starting from a clean state when a CPU is plugged in.

Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
root cause of that hard to reproduce heisenbug. Once understood it's
trivial and certainly justifies a brown paperbag.

Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic")
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index d32520840fde..aa9d2a2b1210 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -655,7 +655,9 @@ static void hrtimer_reprogram(struct hrtimer *timer,
 static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base)
 {
 	base->expires_next = KTIME_MAX;
+	base->hang_detected = 0;
 	base->hres_active = 0;
+	base->next_timer = NULL;
 }

 /*
@@ -1589,6 +1591,7 @@ int hrtimers_prepare_cpu(unsigned int cpu)
 		timerqueue_init_head(&cpu_base->clock_base[i].active);
 	}

+	cpu_base->active_bases = 0;
 	cpu_base->cpu = cpu;
 	hrtimer_init_hres(cpu_base);
 	return 0;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug
  2018-01-29  8:08 FAILED: patch "[PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug" failed to apply to 3.18-stable tree gregkh
@ 2018-01-29 14:20 ` Sebastian Andrzej Siewior
  2018-01-29 14:32   ` Greg KH
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-01-29 14:20 UTC (permalink / raw)
  To: gregkh; +Cc: tglx, anna-maria, paulmck, peterz, stable

From: Thomas Gleixner <tglx@linutronix.de>

commit d5421ea43d30701e03cadc56a38854c36a8b4433 upstream.

The hrtimer interrupt code contains a hang detection and mitigation
mechanism, which prevents that a long delayed hrtimer interrupt causes a
continous retriggering of interrupts which prevent the system from making
progress. If a hang is detected then the timer hardware is programmed with
a certain delay into the future and a flag is set in the hrtimer cpu base
which prevents newly enqueued timers from reprogramming the timer hardware
prior to the chosen delay. The subsequent hrtimer interrupt after the delay
clears the flag and resumes normal operation.

If such a hang happens in the last hrtimer interrupt before a CPU is
unplugged then the hang_detected flag is set and stays that way when the
CPU is plugged in again. At that point the timer hardware is not armed and
it cannot be armed because the hang_detected flag is still active, so
nothing clears that flag. As a consequence the CPU does not receive hrtimer
interrupts and no timers expire on that CPU which results in RCU stalls and
other malfunctions.

Clear the flag along with some other less critical members of the hrtimer
cpu base to ensure starting from a clean state when a CPU is plugged in.

Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
root cause of that hard to reproduce heisenbug. Once understood it's
trivial and certainly justifies a brown paperbag.

Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic")
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
[bigeasy: backport to v3.18, drop ->next_timer it was introduced later]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/time/hrtimer.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 210b84882935..e4c722437708 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -612,6 +612,7 @@ static int hrtimer_reprogram(struct hrtimer *timer,
 static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base)
 {
 	base->expires_next.tv64 = KTIME_MAX;
+	base->hang_detected = 0;
 	base->hres_active = 0;
 }

@@ -1632,6 +1633,7 @@ static void init_hrtimers_cpu(int cpu)
 		timerqueue_init_head(&cpu_base->clock_base[i].active);
 	}

+	cpu_base->active_bases = 0;
 	cpu_base->cpu = cpu;
 	hrtimer_init_hres(cpu_base);
 }
-- 
2.15.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug
  2018-01-29 14:20 ` [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug Sebastian Andrzej Siewior
@ 2018-01-29 14:32   ` Greg KH
  0 siblings, 0 replies; 11+ messages in thread
From: Greg KH @ 2018-01-29 14:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: tglx, anna-maria, paulmck, peterz, stable

On Mon, Jan 29, 2018 at 03:20:32PM +0100, Sebastian Andrzej Siewior wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> commit d5421ea43d30701e03cadc56a38854c36a8b4433 upstream.
> 
> The hrtimer interrupt code contains a hang detection and mitigation
> mechanism, which prevents that a long delayed hrtimer interrupt causes a
> continous retriggering of interrupts which prevent the system from making
> progress. If a hang is detected then the timer hardware is programmed with
> a certain delay into the future and a flag is set in the hrtimer cpu base
> which prevents newly enqueued timers from reprogramming the timer hardware
> prior to the chosen delay. The subsequent hrtimer interrupt after the delay
> clears the flag and resumes normal operation.
> 
> If such a hang happens in the last hrtimer interrupt before a CPU is
> unplugged then the hang_detected flag is set and stays that way when the
> CPU is plugged in again. At that point the timer hardware is not armed and
> it cannot be armed because the hang_detected flag is still active, so
> nothing clears that flag. As a consequence the CPU does not receive hrtimer
> interrupts and no timers expire on that CPU which results in RCU stalls and
> other malfunctions.
> 
> Clear the flag along with some other less critical members of the hrtimer
> cpu base to ensure starting from a clean state when a CPU is plugged in.
> 
> Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
> root cause of that hard to reproduce heisenbug. Once understood it's
> trivial and certainly justifies a brown paperbag.
> 
> Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic")
> Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Sebastian Sewior <bigeasy@linutronix.de>
> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
> Cc: stable@vger.kernel.org
> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
> [bigeasy: backport to v3.18, drop ->next_timer it was introduced later]

Thanks for the backport, now queued up.

greg k-h

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug
  2018-01-30 21:03         ` Thomas Gleixner
@ 2018-01-31  0:52           ` Paul E. McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2018-01-31  0:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sebastian Sewior, LKML, Anna-Maria Gleixner, Peter Zijlstra, Ingo Molnar

On Tue, Jan 30, 2018 at 10:03:17PM +0100, Thomas Gleixner wrote:
> On Mon, 29 Jan 2018, Paul E. McKenney wrote:
> 
> > On Mon, Jan 29, 2018 at 01:57:38AM -0800, Paul E. McKenney wrote:
> > > On Mon, Jan 29, 2018 at 09:20:48AM +0100, Sebastian Sewior wrote:
> > > > On 2018-01-26 14:09:17 [-0800], Paul E. McKenney wrote:
> > > > > find this one.  ;-)  But it did pass rcutorture testing for a great many
> > > > > years, didn't it?  :-/
> > > > 
> > > > It started to trigger better (or at all) on our test box with
> > > > 	modprobe kvm_intel preemption_timer=n
> > > > 
> > > > on the host kernel so maybe a completely unrelated change helped to
> > > > trigger this.
> > > 
> > > Good point!
> > > 
> > > And testing continues, currently at 108 hours of TREE01 without any
> > > waylayed timers, so looking good!  ;-)
> > > 
> > > Just kicked off another 70 hours worth.
> > 
> > And those completed without incident for a total of 178 hours.  I believe
> > we can call this one fixed.  Thank you all!!!
> > 
> > One question...  Is the patch shown below needed, or is this just yet
> > another case of me being confused?  (The lack of it is not triggering,
> > but...)
> 
> See commit 26456f87aca7

Got it, thank you!  I will remove my patch from my queue.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug
  2018-01-29 23:43       ` Paul E. McKenney
@ 2018-01-30 21:03         ` Thomas Gleixner
  2018-01-31  0:52           ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2018-01-30 21:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Sebastian Sewior, LKML, Anna-Maria Gleixner, Peter Zijlstra, Ingo Molnar

On Mon, 29 Jan 2018, Paul E. McKenney wrote:

> On Mon, Jan 29, 2018 at 01:57:38AM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 29, 2018 at 09:20:48AM +0100, Sebastian Sewior wrote:
> > > On 2018-01-26 14:09:17 [-0800], Paul E. McKenney wrote:
> > > > find this one.  ;-)  But it did pass rcutorture testing for a great many
> > > > years, didn't it?  :-/
> > > 
> > > It started to trigger better (or at all) on our test box with
> > > 	modprobe kvm_intel preemption_timer=n
> > > 
> > > on the host kernel so maybe a completely unrelated change helped to
> > > trigger this.
> > 
> > Good point!
> > 
> > And testing continues, currently at 108 hours of TREE01 without any
> > waylayed timers, so looking good!  ;-)
> > 
> > Just kicked off another 70 hours worth.
> 
> And those completed without incident for a total of 178 hours.  I believe
> we can call this one fixed.  Thank you all!!!
> 
> One question...  Is the patch shown below needed, or is this just yet
> another case of me being confused?  (The lack of it is not triggering,
> but...)

See commit 26456f87aca7

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug
  2018-01-29  9:57     ` Paul E. McKenney
@ 2018-01-29 23:43       ` Paul E. McKenney
  2018-01-30 21:03         ` Thomas Gleixner
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2018-01-29 23:43 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Thomas Gleixner, LKML, Anna-Maria Gleixner, Peter Zijlstra, Ingo Molnar

On Mon, Jan 29, 2018 at 01:57:38AM -0800, Paul E. McKenney wrote:
> On Mon, Jan 29, 2018 at 09:20:48AM +0100, Sebastian Sewior wrote:
> > On 2018-01-26 14:09:17 [-0800], Paul E. McKenney wrote:
> > > find this one.  ;-)  But it did pass rcutorture testing for a great many
> > > years, didn't it?  :-/
> > 
> > It started to trigger better (or at all) on our test box with
> > 	modprobe kvm_intel preemption_timer=n
> > 
> > on the host kernel so maybe a completely unrelated change helped to
> > trigger this.
> 
> Good point!
> 
> And testing continues, currently at 108 hours of TREE01 without any
> waylayed timers, so looking good!  ;-)
> 
> Just kicked off another 70 hours worth.

And those completed without incident for a total of 178 hours.  I believe
we can call this one fixed.  Thank you all!!!

One question...  Is the patch shown below needed, or is this just yet
another case of me being confused?  (The lack of it is not triggering,
but...)

							Thanx, Paul

------------------------------------------------------------------------

commit accb0edb85526a05b934eac49658d05ea0216fc4
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Thu Dec 7 13:18:44 2017 -0800

    timers: Ensure that timer_base ->clk accounts for time offline
    
    The timer_base ->must_forward_clk is set to indicate that the next timer
    operation on that timer_base must check for passage of time.  One instance
    of time passage is when the timer wheel goes idle, and another is when
    the corresponding CPU is offline.  Note that it is not appropriate to set
    ->is_idle because that could result in IPIing an offline CPU.  Therefore,
    this commit instead sets ->must_forward_clk at CPU-offline time.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index ffebcf878fba..94cce780c574 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1875,6 +1875,7 @@ int timers_dead_cpu(unsigned int cpu)
 
 		BUG_ON(old_base->running_timer);
 
+		old_base->must_forward_clk = true;
 		for (i = 0; i < WHEEL_SIZE; i++)
 			migrate_timer_list(new_base, old_base->vectors + i);
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug
  2018-01-29  8:20   ` Sebastian Sewior
@ 2018-01-29  9:57     ` Paul E. McKenney
  2018-01-29 23:43       ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2018-01-29  9:57 UTC (permalink / raw)
  To: Sebastian Sewior
  Cc: Thomas Gleixner, LKML, Anna-Maria Gleixner, Peter Zijlstra, Ingo Molnar

On Mon, Jan 29, 2018 at 09:20:48AM +0100, Sebastian Sewior wrote:
> On 2018-01-26 14:09:17 [-0800], Paul E. McKenney wrote:
> > find this one.  ;-)  But it did pass rcutorture testing for a great many
> > years, didn't it?  :-/
> 
> It started to trigger better (or at all) on our test box with
> 	modprobe kvm_intel preemption_timer=n
> 
> on the host kernel so maybe a completely unrelated change helped to
> trigger this.

Good point!

And testing continues, currently at 108 hours of TREE01 without any
waylayed timers, so looking good!  ;-)

Just kicked off another 70 hours worth.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug
  2018-01-26 22:09 ` Paul E. McKenney
  2018-01-28  0:53   ` Paul E. McKenney
@ 2018-01-29  8:20   ` Sebastian Sewior
  2018-01-29  9:57     ` Paul E. McKenney
  1 sibling, 1 reply; 11+ messages in thread
From: Sebastian Sewior @ 2018-01-29  8:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, LKML, Anna-Maria Gleixner, Peter Zijlstra, Ingo Molnar

On 2018-01-26 14:09:17 [-0800], Paul E. McKenney wrote:
> find this one.  ;-)  But it did pass rcutorture testing for a great many
> years, didn't it?  :-/

It started to trigger better (or at all) on our test box with
	modprobe kvm_intel preemption_timer=n

on the host kernel so maybe a completely unrelated change helped to
trigger this.

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug
  2018-01-26 22:09 ` Paul E. McKenney
@ 2018-01-28  0:53   ` Paul E. McKenney
  2018-01-29  8:20   ` Sebastian Sewior
  1 sibling, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2018-01-28  0:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Sewior, Anna-Maria Gleixner, Peter Zijlstra, Ingo Molnar

On Fri, Jan 26, 2018 at 02:09:17PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 26, 2018 at 02:54:32PM +0100, Thomas Gleixner wrote:
> > The hrtimer interrupt code contains a hang detection and mitigation
> > mechanism, which prevents that a long delayed hrtimer interrupt causes a
> > continous retriggering of interrupts which prevent the system from making
> > progress. If a hang is detected then the timer hardware is programmed with
> > a certain delay into the future and a flag is set in the hrtimer cpu base
> > which prevents newly enqueued timers from reprogramming the timer hardware
> > prior to the chosen delay. The subsequent hrtimer interrupt after the delay
> > clears the flag and resumes normal operation.
> > 
> > If such a hang happens in the last hrtimer interrupt before a CPU is
> > unplugged then the hang_detected flag is set and stays that way when the
> > CPU is plugged in again. At that point the timer hardware is not armed and
> > it cannot be armed because the hang_detected flag is still active, so
> > nothing clears that flag. As a consequence the CPU does not receive hrtimer
> > interrupts and no timers expire on that CPU which results in RCU stalls and
> > other malfunctions.
> > 
> > Clear the flag along with some other less critical members of the hrtimer
> > cpu base to ensure starting from a clean state when a CPU is plugged in.
> > 
> > Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
> > root cause of that hard to reproduce heisenbug. Once understood it's
> > trivial and certainly justifies a brown paperbag.
> 
> Thank you very much, and I do know that feeling!  After reading the
> commit log, I feel significantly less incompetent for having failed to
> find this one.  ;-)  But it did pass rcutorture testing for a great many
> years, didn't it?  :-/
> 
> I have started an eight-hour seven-way test on the dreaded rcutorture
> TREE01 scenario.  In the meantime, off to the train!

And bozo here forgot to disable tracing, so the runs take much longer
than the stated time.  And because I applied against v4.15-rc9, which
lacks the recent code to suppress stalls when dumping the trace log,
all runs get RCU CPU stall warnings.  :-/

But I can filter out those tracing-induced stall warnings easily enough,
and thus far there have been 84 successful 30-minute runs out of 112
total, for no failures in 42 hours of TREE01 execution.  Given the base
failure rate of 0.33 per hour, the probability of this happening by
chance is something like ten to the minus sixth power, so:

Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

I will be running longer runs without tracing, but looking extremely
good thus far!  Thank you all very much!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug
  2018-01-26 13:54 Thomas Gleixner
@ 2018-01-26 22:09 ` Paul E. McKenney
  2018-01-28  0:53   ` Paul E. McKenney
  2018-01-29  8:20   ` Sebastian Sewior
  0 siblings, 2 replies; 11+ messages in thread
From: Paul E. McKenney @ 2018-01-26 22:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Sebastian Sewior, Anna-Maria Gleixner, Peter Zijlstra, Ingo Molnar

On Fri, Jan 26, 2018 at 02:54:32PM +0100, Thomas Gleixner wrote:
> The hrtimer interrupt code contains a hang detection and mitigation
> mechanism, which prevents that a long delayed hrtimer interrupt causes a
> continous retriggering of interrupts which prevent the system from making
> progress. If a hang is detected then the timer hardware is programmed with
> a certain delay into the future and a flag is set in the hrtimer cpu base
> which prevents newly enqueued timers from reprogramming the timer hardware
> prior to the chosen delay. The subsequent hrtimer interrupt after the delay
> clears the flag and resumes normal operation.
> 
> If such a hang happens in the last hrtimer interrupt before a CPU is
> unplugged then the hang_detected flag is set and stays that way when the
> CPU is plugged in again. At that point the timer hardware is not armed and
> it cannot be armed because the hang_detected flag is still active, so
> nothing clears that flag. As a consequence the CPU does not receive hrtimer
> interrupts and no timers expire on that CPU which results in RCU stalls and
> other malfunctions.
> 
> Clear the flag along with some other less critical members of the hrtimer
> cpu base to ensure starting from a clean state when a CPU is plugged in.
> 
> Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
> root cause of that hard to reproduce heisenbug. Once understood it's
> trivial and certainly justifies a brown paperbag.

Thank you very much, and I do know that feeling!  After reading the
commit log, I feel significantly less incompetent for having failed to
find this one.  ;-)  But it did pass rcutorture testing for a great many
years, didn't it?  :-/

I have started an eight-hour seven-way test on the dreaded rcutorture
TREE01 scenario.  In the meantime, off to the train!

							Thanx, Paul

> Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic")
> Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: stable@vger.kernel.org
> ---
>  kernel/time/hrtimer.c |    3 +++
>  1 file changed, 3 insertions(+)
> 
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -655,7 +655,9 @@ static void hrtimer_reprogram(struct hrt
>  static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base)
>  {
>  	base->expires_next = KTIME_MAX;
> +	base->hang_detected = 0;
>  	base->hres_active = 0;
> +	base->next_timer = NULL;
>  }
> 
>  /*
> @@ -1589,6 +1591,7 @@ int hrtimers_prepare_cpu(unsigned int cp
>  		timerqueue_init_head(&cpu_base->clock_base[i].active);
>  	}
> 
> +	cpu_base->active_bases = 0;
>  	cpu_base->cpu = cpu;
>  	hrtimer_init_hres(cpu_base);
>  	return 0;
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug
@ 2018-01-26 13:54 Thomas Gleixner
  2018-01-26 22:09 ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Gleixner @ 2018-01-26 13:54 UTC (permalink / raw)
  To: LKML
  Cc: Paul E. McKenney, Sebastian Sewior, Anna-Maria Gleixner,
	Peter Zijlstra, Ingo Molnar

The hrtimer interrupt code contains a hang detection and mitigation
mechanism, which prevents that a long delayed hrtimer interrupt causes a
continous retriggering of interrupts which prevent the system from making
progress. If a hang is detected then the timer hardware is programmed with
a certain delay into the future and a flag is set in the hrtimer cpu base
which prevents newly enqueued timers from reprogramming the timer hardware
prior to the chosen delay. The subsequent hrtimer interrupt after the delay
clears the flag and resumes normal operation.

If such a hang happens in the last hrtimer interrupt before a CPU is
unplugged then the hang_detected flag is set and stays that way when the
CPU is plugged in again. At that point the timer hardware is not armed and
it cannot be armed because the hang_detected flag is still active, so
nothing clears that flag. As a consequence the CPU does not receive hrtimer
interrupts and no timers expire on that CPU which results in RCU stalls and
other malfunctions.

Clear the flag along with some other less critical members of the hrtimer
cpu base to ensure starting from a clean state when a CPU is plugged in.

Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
root cause of that hard to reproduce heisenbug. Once understood it's
trivial and certainly justifies a brown paperbag.

Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic")
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
---
 kernel/time/hrtimer.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -655,7 +655,9 @@ static void hrtimer_reprogram(struct hrt
 static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base)
 {
 	base->expires_next = KTIME_MAX;
+	base->hang_detected = 0;
 	base->hres_active = 0;
+	base->next_timer = NULL;
 }

 /*
@@ -1589,6 +1591,7 @@ int hrtimers_prepare_cpu(unsigned int cp
 		timerqueue_init_head(&cpu_base->clock_base[i].active);
 	}

+	cpu_base->active_bases = 0;
 	cpu_base->cpu = cpu;
 	hrtimer_init_hres(cpu_base);
 	return 0;

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-01-31  9:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-29  8:08 FAILED: patch "[PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug" failed to apply to 3.18-stable tree gregkh
2018-01-29 14:20 ` [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug Sebastian Andrzej Siewior
2018-01-29 14:32   ` Greg KH
  -- strict thread matches above, loose matches on Subject: below --
2018-01-26 13:54 Thomas Gleixner
2018-01-26 22:09 ` Paul E. McKenney
2018-01-28  0:53   ` Paul E. McKenney
2018-01-29  8:20   ` Sebastian Sewior
2018-01-29  9:57     ` Paul E. McKenney
2018-01-29 23:43       ` Paul E. McKenney
2018-01-30 21:03         ` Thomas Gleixner
2018-01-31  0:52           ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.