[PATCH V1 0/1] tick: broadcast-hrtimer: Fix a race in bc_set

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V1 0/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
@ 2019-09-18 14:41 Balasubramani Vivekanandan
  2019-09-18 14:41 ` [PATCH V1 1/1] " Balasubramani Vivekanandan
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Balasubramani Vivekanandan @ 2019-09-18 14:41 UTC (permalink / raw)
  To: fweisbec, tglx, mingo; +Cc: balasubramani_vivekanandan, erosca, linux-kernel

I was investigating a rcu stall warning on ARM64 Renesas Rcar3
platform. On analysis I found that rcu stall warning was because the
rcu_preempt kthread was starved of cpu time. rcu_preempt was blocked in
the function schedule_timeout() and never woken up. On further
investigation I found that local timer interrupts were not happening on
the cpu where the rcu_preempt kthread was blocked. So the rcu_preempt
was not woken up after timeout.
I continued my analysis to debug why the timer failed on the cpu. I
found that when cpu goes through idle state cycle, the timer failure
happens. When the cpu enters the idle state it subscribes to the tick
broadcast clock and shutsdown the local timer. Then on exit from idle
state the local timer is programmed to fire interrupts. But I found that
the during the error scenario, cpu fails to program the local timer on
exit from idle state. The below code in
__tick_broadcast_oneshot_control() is where the idle code exit path goes
through and fails to program the timer hardware

now = ktime_get();
if (dev->next_event <= now) {
	cpumask_set_cpu(cpu, tick_broadcast_force_mask);
		goto out;
}

The value in next_event will be earlier than current time because the
tick broadcast clock did not wake up the cpu on its subcribed
timeout. Later when the cpu is woken up due to some other event this
condition will arise. After the cpu woken up, any further timeout
requests by any task on the cpu might fail to program the timer
hardware because the value in next_event will be earlier than the
current time.
Then I focussed on why the tick broadcast clock failed to wake up the
cpu. I noticed a race condition in the hrtimer based tick broadcast
clock. The race condition results in a condition where the tick
broadcast hrtimer is never restarted. I have created a patch to fix the
race condition. Please review 

Balasubramani Vivekanandan (1):
  tick: broadcast-hrtimer: Fix a race in bc_set_next

 kernel/time/tick-broadcast-hrtimer.c | 58 ++++++++++++++++++++++------
 kernel/time/tick-broadcast.c         |  2 +
 2 files changed, 48 insertions(+), 12 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V1 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-18 14:41 [PATCH V1 0/1] tick: broadcast-hrtimer: Fix a race in bc_set_next Balasubramani Vivekanandan
@ 2019-09-18 14:41 ` Balasubramani Vivekanandan
  2019-09-23 19:51   ` Thomas Gleixner
  2019-09-20 10:39 ` [PATCH V1 0/1] " Eugeniu Rosca
  2019-09-20 10:58 ` Eugeniu Rosca
  2 siblings, 1 reply; 16+ messages in thread
From: Balasubramani Vivekanandan @ 2019-09-18 14:41 UTC (permalink / raw)
  To: fweisbec, tglx, mingo; +Cc: balasubramani_vivekanandan, erosca, linux-kernel

When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is
active using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does
not provide the required synchronization when the callback is active on
other core.
The callback could have already executed
tick_handle_oneshot_broadcast() and could have also returned. But still
there is a small time window where the hrtimer_try_to_cancel() returns
-1. In that case bc_set_next() returns without doing anything, but the
next_event of the tick broadcast clock device is already set to a
timeout value.

In the race condition diagram below, CPU #1 is running the timer
callback and CPU #2 is entering idle state and so calls bc_set_next().

In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback
returns HRTIMER_NORESTART. The hrtimer might never recover if all
further requests from the CPUs to subscribe to tick broadcast have
timeout greater than the next_event of tick broadcast clock device. This
leads to cascading of failures and finally noticed as rcu stall
warnings.

Here is a depiction of the race condition

CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                  and subscribe to
                                                  tick broadcast)
---------------------                             ---------------------

__run_hrtimer()                                   tick_broadcast_enter()

  bc_handler()                                      __tick_broadcast_oneshot_control()

    tick_handle_oneshot_broadcast()

      raw_spin_lock(&tick_broadcast_lock);

      dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
      //next_event for tick broadcast clock
      set to KTIME_MAX since no other cores
      subscribed to tick broadcasting

      raw_spin_unlock(&tick_broadcast_lock);

    if (dev->next_event == KTIME_MAX)
      return HRTIMER_NORESTART
    // callback function exits without
       restarting the hrtimer                      //tick_broadcast_lock acquired
                                                   raw_spin_lock(&tick_broadcast_lock);

                                                   tick_broadcast_set_event()

                                                     clockevents_program_event()

                                                       dev->next_event = expires;

                                                       bc_set_next()

                                                         hrtimer_try_to_cancel()
                                                         //returns -1 since the timer
                                                         callback is active. Exits without
                                                         restarting the timer
  cpu_base->running = NULL;

In the fix, if hrtimer_try_to_cancel returns callback is active, an
additional check is added in bc_set_next() to know the state of the
timer callback.  If the timer callback has already released the
tick_broadcast_lock, then the bc_set_next() can safely call
hrtimer_cancel() to cancel the timer and restart it.

When there are no more cpus subscribed to broadcast, the timer callback
might not set the expiry time for hrtimer. Therefore the callback timer
function is modified to set the state of broadcast clock to
CLOCK_EVT_STATE_ONESHOT_STOPPED which in turn will set the expiry time
of hrtimer to KTIME_MAX.

Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
---
 kernel/time/tick-broadcast-hrtimer.c | 58 ++++++++++++++++++++++------
 kernel/time/tick-broadcast.c         |  2 +
 2 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
index c1f5bb590b5e..09644403a320 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -42,18 +42,13 @@ static int bc_shutdown(struct clock_event_device *evt)
  */
 static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
 {
+	ktime_t now;
+	ktime_t bc_expiry;
 	int bc_moved;
+
 	/*
-	 * We try to cancel the timer first. If the callback is on
-	 * flight on some other cpu then we let it handle it. If we
-	 * were able to cancel the timer nothing can rearm it as we
-	 * own broadcast_lock.
-	 *
-	 * However we can also be called from the event handler of
-	 * ce_broadcast_hrtimer itself when it expires. We cannot
-	 * restart the timer because we are in the callback, but we
-	 * can set the expiry time and let the callback return
-	 * HRTIMER_RESTART.
+	 * We try to cancel the timer first. If we were able to cancel
+	 * the timer nothing can rearm it as we own broadcast_lock.
 	 *
 	 * Since we are in the idle loop at this point and because
 	 * hrtimer_{start/cancel} functions call into tracing,
@@ -72,9 +67,47 @@ static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
 	if (bc_moved) {
 		/* Bind the "device" to the cpu */
 		bc->bound_on = smp_processor_id();
-	} else if (bc->bound_on == smp_processor_id()) {
-		hrtimer_set_expires(&bctimer, expires);
+	} else {
+		if (bc->bound_on == smp_processor_id()) {
+			/* We are called from the callback timer itself, we
+			 * just set expiry time and let the callback return
+			 * HRTIMER_RESTART
+			 */
+			hrtimer_set_expires(&bctimer, expires);
+		} else {
+			now = ktime_get();
+			bc_expiry = hrtimer_get_expires(&bctimer);
+
+			/* If the expiry time of the tick broadcast hrtimer is
+			 * in future means that hrtimer callback is running on
+			 * another core which has set the expiry time for the
+			 * broadcast timer. We are holding the broadcast_lock
+			 * means the timer callback has already released it.
+			 * We can now safely cancel the timer and restart it.
+			 */
+			if (bc_expiry > now) {
+				RCU_NONIDLE(
+					{
+						hrtimer_cancel(&bctimer);
+						hrtimer_start(&bctimer, expires,
+							      HRTIMER_MODE_ABS_PINNED_HARD);
+					}
+				);
+				bc->bound_on = smp_processor_id();
+			}
+			/* Else the hrtimer callback is waiting on the other
+			 * core for the broadcast_lock. There is no need for
+			 * us to do anything
+			 */
+		}
 	}
+
+	return 0;
+}
+
+static int bc_set_oneshot_stopped(struct clock_event_device *bc)
+{
+	hrtimer_set_expires(&bctimer, KTIME_MAX);
 	return 0;
 }
 
@@ -85,6 +118,7 @@ static struct clock_event_device ce_broadcast_hrtimer = {
 	.features		= CLOCK_EVT_FEAT_ONESHOT |
 				  CLOCK_EVT_FEAT_KTIME |
 				  CLOCK_EVT_FEAT_HRTIMER,
+	.set_state_oneshot_stopped = bc_set_oneshot_stopped,
 	.rating			= 0,
 	.bound_on		= -1,
 	.min_delta_ns		= 1,
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index e51778c312f1..28f04527ff54 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -672,6 +672,8 @@ static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
 	 */
 	if (next_event != KTIME_MAX)
 		tick_broadcast_set_event(dev, next_cpu, next_event);
+	else
+		clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT_STOPPED);
 
 	raw_spin_unlock(&tick_broadcast_lock);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 0/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-18 14:41 [PATCH V1 0/1] tick: broadcast-hrtimer: Fix a race in bc_set_next Balasubramani Vivekanandan
  2019-09-18 14:41 ` [PATCH V1 1/1] " Balasubramani Vivekanandan
@ 2019-09-20 10:39 ` Eugeniu Rosca
  2019-09-20 10:58 ` Eugeniu Rosca
  2 siblings, 0 replies; 16+ messages in thread
From: Eugeniu Rosca @ 2019-09-20 10:39 UTC (permalink / raw)
  To: Balasubramani Vivekanandan, fweisbec, tglx, mingo
  Cc: erosca, linux-kernel, Eugeniu Rosca

+linux-renesas-soc

In hope that the issue reported in [1] was seen by other members of
Renesas community.

[1] https://lkml.org/lkml/2019/9/18/711

On Wed, Sep 18, 2019 at 04:41:37PM +0200, Balasubramani Vivekanandan wrote:
> I was investigating a rcu stall warning on ARM64 Renesas Rcar3
> platform. On analysis I found that rcu stall warning was because the
> rcu_preempt kthread was starved of cpu time. rcu_preempt was blocked in
> the function schedule_timeout() and never woken up. On further
> investigation I found that local timer interrupts were not happening on
> the cpu where the rcu_preempt kthread was blocked. So the rcu_preempt
> was not woken up after timeout.
> I continued my analysis to debug why the timer failed on the cpu. I
> found that when cpu goes through idle state cycle, the timer failure
> happens. When the cpu enters the idle state it subscribes to the tick
> broadcast clock and shutsdown the local timer. Then on exit from idle
> state the local timer is programmed to fire interrupts. But I found that
> the during the error scenario, cpu fails to program the local timer on
> exit from idle state. The below code in
> __tick_broadcast_oneshot_control() is where the idle code exit path goes
> through and fails to program the timer hardware
> 
> now = ktime_get();
> if (dev->next_event <= now) {
> 	cpumask_set_cpu(cpu, tick_broadcast_force_mask);
> 		goto out;
> }
> 
> The value in next_event will be earlier than current time because the
> tick broadcast clock did not wake up the cpu on its subcribed
> timeout. Later when the cpu is woken up due to some other event this
> condition will arise. After the cpu woken up, any further timeout
> requests by any task on the cpu might fail to program the timer
> hardware because the value in next_event will be earlier than the
> current time.
> Then I focussed on why the tick broadcast clock failed to wake up the
> cpu. I noticed a race condition in the hrtimer based tick broadcast
> clock. The race condition results in a condition where the tick
> broadcast hrtimer is never restarted. I have created a patch to fix the
> race condition. Please review 
> 
> Balasubramani Vivekanandan (1):
>   tick: broadcast-hrtimer: Fix a race in bc_set_next
> 
>  kernel/time/tick-broadcast-hrtimer.c | 58 ++++++++++++++++++++++------
>  kernel/time/tick-broadcast.c         |  2 +
>  2 files changed, 48 insertions(+), 12 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Best Regards,
Eugeniu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 0/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-18 14:41 [PATCH V1 0/1] tick: broadcast-hrtimer: Fix a race in bc_set_next Balasubramani Vivekanandan
  2019-09-18 14:41 ` [PATCH V1 1/1] " Balasubramani Vivekanandan
  2019-09-20 10:39 ` [PATCH V1 0/1] " Eugeniu Rosca
@ 2019-09-20 10:58 ` Eugeniu Rosca
  2 siblings, 0 replies; 16+ messages in thread
From: Eugeniu Rosca @ 2019-09-20 10:58 UTC (permalink / raw)
  To: Balasubramani Vivekanandan, linux-renesas-soc
  Cc: fweisbec, tglx, mingo, erosca, linux-kernel, Eugeniu Rosca

+linux-renesas-soc

In hope that the issue reported in [1] was seen by other members of
Renesas community.

[1] https://lkml.org/lkml/2019/9/18/711

On Wed, Sep 18, 2019 at 04:41:37PM +0200, Balasubramani Vivekanandan wrote:
> I was investigating a rcu stall warning on ARM64 Renesas Rcar3
> platform. On analysis I found that rcu stall warning was because the
> rcu_preempt kthread was starved of cpu time. rcu_preempt was blocked in
> the function schedule_timeout() and never woken up. On further
> investigation I found that local timer interrupts were not happening on
> the cpu where the rcu_preempt kthread was blocked. So the rcu_preempt
> was not woken up after timeout.
> I continued my analysis to debug why the timer failed on the cpu. I
> found that when cpu goes through idle state cycle, the timer failure
> happens. When the cpu enters the idle state it subscribes to the tick
> broadcast clock and shutsdown the local timer. Then on exit from idle
> state the local timer is programmed to fire interrupts. But I found that
> the during the error scenario, cpu fails to program the local timer on
> exit from idle state. The below code in
> __tick_broadcast_oneshot_control() is where the idle code exit path goes
> through and fails to program the timer hardware
> 
> now = ktime_get();
> if (dev->next_event <= now) {
> 	cpumask_set_cpu(cpu, tick_broadcast_force_mask);
> 		goto out;
> }
> 
> The value in next_event will be earlier than current time because the
> tick broadcast clock did not wake up the cpu on its subcribed
> timeout. Later when the cpu is woken up due to some other event this
> condition will arise. After the cpu woken up, any further timeout
> requests by any task on the cpu might fail to program the timer
> hardware because the value in next_event will be earlier than the
> current time.
> Then I focussed on why the tick broadcast clock failed to wake up the
> cpu. I noticed a race condition in the hrtimer based tick broadcast
> clock. The race condition results in a condition where the tick
> broadcast hrtimer is never restarted. I have created a patch to fix the
> race condition. Please review 
> 
> Balasubramani Vivekanandan (1):
>   tick: broadcast-hrtimer: Fix a race in bc_set_next
> 
>  kernel/time/tick-broadcast-hrtimer.c | 58 ++++++++++++++++++++++------
>  kernel/time/tick-broadcast.c         |  2 +
>  2 files changed, 48 insertions(+), 12 deletions(-)
> 
> -- 
> 2.17.1
> 

-- 
Best Regards,
Eugeniu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-18 14:41 ` [PATCH V1 1/1] " Balasubramani Vivekanandan
@ 2019-09-23 19:51   ` Thomas Gleixner
  2019-09-24 17:14     ` Eugeniu Rosca
  2019-09-25 11:55     ` [PATCH V2 " Balasubramani Vivekanandan
  0 siblings, 2 replies; 16+ messages in thread
From: Thomas Gleixner @ 2019-09-23 19:51 UTC (permalink / raw)
  To: Balasubramani Vivekanandan
  Cc: Frederic Weisbecker, Ingo Molnar, erosca, LKML, Peter Zijlstra

On Wed, 18 Sep 2019, Balasubramani Vivekanandan wrote:
> 
> When there are no more cpus subscribed to broadcast, the timer callback
> might not set the expiry time for hrtimer. Therefore the callback timer
> function is modified to set the state of broadcast clock to
> CLOCK_EVT_STATE_ONESHOT_STOPPED which in turn will set the expiry time
> of hrtimer to KTIME_MAX.

That's an ugly layering violation, really.

Aside of that the whole try to cancel logic and the comment that the
hrtimer cannot be armed from within the callback is wrong. All of this can
go way.

Find a completely untested patch below.

Thanks,

	tglx

8<---------------------
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -42,39 +42,35 @@ static int bc_shutdown(struct clock_even
  */
 static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
 {
-	int bc_moved;
 	/*
-	 * We try to cancel the timer first. If the callback is on
-	 * flight on some other cpu then we let it handle it. If we
-	 * were able to cancel the timer nothing can rearm it as we
-	 * own broadcast_lock.
+	 * This is called either from enter/exit idle code or from the
+	 * broadcast handler. In all cases tick_broadcast_lock is held.
 	 *
-	 * However we can also be called from the event handler of
-	 * ce_broadcast_hrtimer itself when it expires. We cannot
-	 * restart the timer because we are in the callback, but we
-	 * can set the expiry time and let the callback return
-	 * HRTIMER_RESTART.
+	 * hrtimer_cancel() cannot be called here neither from the
+	 * broadcast handler nor from the enter/exit idle code. The idle
+	 * code can run into the problem described in bc_shutdown() and the
+	 * broadcast handler cannot wait for itself to complete for obvious
+	 * reasons.
 	 *
-	 * Since we are in the idle loop at this point and because
-	 * hrtimer_{start/cancel} functions call into tracing,
-	 * calls to these functions must be bound within RCU_NONIDLE.
+	 * Each caller tries to arm the hrtimer on its own CPU, but if the
+	 * handler is currently running hrtimer_start() does not move
+	 * it. It keeps it on the CPU on which it is assigned at the
+	 * moment.
 	 */
-	RCU_NONIDLE(
-		{
-			bc_moved = hrtimer_try_to_cancel(&bctimer) >= 0;
-			if (bc_moved) {
-				hrtimer_start(&bctimer, expires,
-					      HRTIMER_MODE_ABS_PINNED_HARD);
-			}
-		}
-	);
-
-	if (bc_moved) {
-		/* Bind the "device" to the cpu */
-		bc->bound_on = smp_processor_id();
-	} else if (bc->bound_on == smp_processor_id()) {
-		hrtimer_set_expires(&bctimer, expires);
-	}
+	RCU_NONIDLE( {
+		hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED_HARD);
+		/*
+		 * The core tick broadcast mode expects bc->bound_on to be set
+		 * correctly to prevent a CPU which has the broadcast hrtimer
+		 * armed from going deep idle.
+		 *
+		 * As tick_broadcast_lock is held, nothing can change the
+		 * cpu base which was just established in hrtimer_start()
+		 * above. So the below access is safe even without holding
+		 * the hrtimer base lock.
+		 */
+		bc->bound_on = bctimer.base->cpu_base->cpu;
+	} );
 	return 0;
 }
 
@@ -100,10 +96,6 @@ static enum hrtimer_restart bc_handler(s
 {
 	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
 
-	if (clockevent_state_oneshot(&ce_broadcast_hrtimer))
-		if (ce_broadcast_hrtimer.next_event != KTIME_MAX)
-			return HRTIMER_RESTART;
-
 	return HRTIMER_NORESTART;
 }
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V1 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-23 19:51   ` Thomas Gleixner
@ 2019-09-24 17:14     ` Eugeniu Rosca
  2019-09-25 11:55     ` [PATCH V2 " Balasubramani Vivekanandan
  1 sibling, 0 replies; 16+ messages in thread
From: Eugeniu Rosca @ 2019-09-24 17:14 UTC (permalink / raw)
  To: Thomas Gleixner, Balasubramani Vivekanandan
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Peter Zijlstra,
	linux-renesas-soc, Eugeniu Rosca, Eugeniu Rosca

On Mon, Sep 23, 2019 at 09:51:54PM +0200, Thomas Gleixner wrote:
> On Wed, 18 Sep 2019, Balasubramani Vivekanandan wrote:
> > 
> > When there are no more cpus subscribed to broadcast, the timer callback
> > might not set the expiry time for hrtimer. Therefore the callback timer
> > function is modified to set the state of broadcast clock to
> > CLOCK_EVT_STATE_ONESHOT_STOPPED which in turn will set the expiry time
> > of hrtimer to KTIME_MAX.
> 
> That's an ugly layering violation, really.
> 
> Aside of that the whole try to cancel logic and the comment that the
> hrtimer cannot be armed from within the callback is wrong. All of this can
> go way.
> 
> Find a completely untested patch below [1]
> 
> Thanks,
> 
> 	tglx
> 

jFTR, for the sake of anyone who lives on older kernels and happen to
deal with the same issue as reported by Balasubramani in this thread,
after enabling cpuidle [2-3] on R-Car H3ULCB (which seems to be a hard
prerequisite for seeing this issue reproduced), I got below [4,5,6]
backtraces on recent v5.3-12025-g4c07e2ddab5b using various intermittent
stress-ng loads. I believe they portray the issue reported by Bala.

I was not able to reproduce the RCU stalls after applying [1].

[1] https://lkml.org/lkml/2019/9/23/803
[2] https://patchwork.kernel.org/patch/10769701/
    ("[v2,1/5] arm64: dts: r8a7795: Add cpuidle support for CA57 cores")
[3] https://patchwork.kernel.org/patch/10769689/
    ("[v2,2/5] arm64: dts: r8a7795: Add cpuidle support for CA53 cores")

[4] RCU stall #1 reproduced on v5.3-12025-g4c07e2ddab5b + patches [2-3]

[  701.628630] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  701.635259] rcu:     1-...!: (0 ticks this GP) idle=5b4/0/0x0 softirq=178953/178953 fqs=0 last_accelerate: c1a9/209c, Nonlazy posted: ..D
[  701.648195] rcu:     3-...!: (667 GPs behind) idle=1a0/0/0x0 softirq=188640/188641 fqs=0 last_accelerate: c1a6/20a0, Nonlazy posted: .LD
[  701.660666] rcu:     5-...!: (0 ticks this GP) idle=674/0/0x0 softirq=39323/39323 fqs=0 last_accelerate: c380/20a0, Nonlazy posted: ..D
[  701.673046] rcu:     6-...!: (0 ticks this GP) idle=628/0/0x0 softirq=27147/27147 fqs=0 last_accelerate: c00c/20a0, Nonlazy posted: ..D
[  701.685419]  (detected by 2, t=25012 jiffies, g=392945, q=968)
[  701.691487] Task dump for CPU 1:
[  701.694873] swapper/1       R  running task        0     0      1 0x00000028
[  701.702227] Call trace:
[  701.704822]  __switch_to+0x2c4/0x2e8
[  701.708578]  num_spec.70129+0x177e7d/0x1d336c
[  701.713130] Task dump for CPU 3:
[  701.716514] swapper/3       R  running task        0     0      1 0x00000028
[  701.723865] Call trace:
[  701.726445]  __switch_to+0x2c4/0x2e8
[  701.730192]  0x3
[  701.732140] Task dump for CPU 5:
[  701.735524] swapper/5       R  running task        0     0      1 0x00000028
[  701.742874] Call trace:
[  701.745455]  __switch_to+0x2c4/0x2e8
[  701.749204]  num_spec.70129+0x177e7d/0x1d336c
[  701.753756] Task dump for CPU 6:
[  701.757139] swapper/6       R  running task        0     0      1 0x00000028
[  701.764489] Call trace:
[  701.767070]  __switch_to+0x2c4/0x2e8
[  701.770819]  num_spec.70129+0x177e7d/0x1d336c
[  701.775391] rcu: rcu_preempt kthread starved for 25068 jiffies! g392945 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=5
[  701.786394] rcu: RCU grace-period kthread stack dump:
[  701.791652] rcu_preempt     I    0    10      2 0x00000028
[  701.797369] Call trace:
[  701.799944]  __switch_to+0x2c4/0x2e8
[  701.803688]  __schedule+0xe1c/0xf58
[  701.807337]  schedule+0xc0/0x104
[  701.810721]  schedule_timeout+0x298/0x6ec
[  701.814911]  rcu_gp_kthread+0xf8c/0xf94
[  701.818923]  kthread+0x1c4/0x1dc
[  701.822304]  ret_from_fork+0x10/0x18
[  701.826061] Kernel panic - not syncing: RCU Stall
[  701.830964] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.3.0+ #18
[  701.837201] Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT)
[  701.845503] Call trace:
[  701.848072]  dump_backtrace+0x0/0x1ec
[  701.851893]  show_stack+0x24/0x30
[  701.855361]  dump_stack+0xcc/0x128
[  701.858915]  panic+0x288/0x5a4
[  701.862112]  panic_on_rcu_stall+0x2c/0x34
[  701.866293]  rcu_sched_clock_irq+0xe2c/0x1554
[  701.870835]  update_process_times+0x34/0x60
[  701.875196]  tick_sched_handle+0x80/0x98
[  701.879286]  tick_sched_timer+0x64/0xbc
[  701.883289]  __hrtimer_run_queues+0x5c8/0xb94
[  701.887829]  hrtimer_interrupt+0x1ec/0x454
[  701.892107]  arch_timer_handler_phys+0x40/0x58
[  701.896737]  handle_percpu_devid_irq+0x170/0x6f4
[  701.901548]  generic_handle_irq+0x3c/0x54
[  701.905729]  __handle_domain_irq+0x114/0x118
[  701.910178]  gic_handle_irq+0x70/0xac
[  701.913999]  el1_irq+0xbc/0x180
[  701.917289]  __asan_load8+0x40/0xb0
[  701.920934]  tick_check_broadcast_expired+0x70/0x8c
[  701.926011]  cpu_idle_poll+0x1ec/0x380
[  701.929925]  do_idle+0x188/0x354
[  701.933301]  cpu_startup_entry+0x28/0x2c
[  701.937396]  secondary_start_kernel+0x344/0x374
[  701.942184] SMP: stopping secondary CPUs
[  701.946316] Dumping ftrace buffer:
[  701.950694]    (ftrace buffer empty)
[  701.954445] Kernel Offset: disabled
[  701.958093] CPU features: 0x0002,21006004
[  701.962274] Memory Limit: none
[  701.965552] ---[ end Kernel panic - not syncing: RCU Stall ]---

[5] RCU stall #2 reproduced on v5.3-12025-g4c07e2ddab5b + patches [2-3]

[ 1193.212038] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:max-time=5 run; sleep 5; done                                                                                                                                    
[ 1193.218757] rcu:     4-...!: (0 ticks this GP) idle=17c/0/0x0 softirq=130361/130361 fqs=1 last_accelerate: 9cdc/a0d9, Nonlazy posted: .LD
[ 1193.231958]  (detected by 3, t=25021 jiffies, g=774405, q=58539)
[ 1193.238238] Task dump for CPU 4:
[ 1193.241641] swapper/4       R  running task        0     0      1 0x00000028
[ 1193.249021] Call trace:
[ 1193.251631]  __switch_to+0x2c4/0x2e8
[ 1193.255395]  0x1fffe000db4d0fe4
[ 1193.258720] rcu: rcu_preempt kthread starved for 7974 jiffies! g774405 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=4
[ 1193.270900] rcu: RCU grace-period kthread stack dump:
[ 1193.276542] rcu_preempt     I    0    10      2 0x00000028
[ 1193.282278] Call trace:
[ 1193.284864]  __switch_to+0x2c4/0x2e8
[ 1193.288619]  __schedule+0xe1c/0xf58
[ 1193.292278]  schedule+0xc0/0x104
[ 1193.295671]  schedule_timeout+0x298/0x6ec
[ 1193.299875]  rcu_gp_kthread+0xf8c/0xf94
[ 1193.303901]  kthread+0x1c4/0x1dc
[ 1193.307295]  ret_from_fork+0x10/0x18
[ 1193.311069] Kernel panic - not syncing: RCU Stall
[ 1193.315986] CPU: 3 PID: 2892 Comm: bash Not tainted 5.3.0+ #18
[ 1193.322056] Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT)
[ 1193.330372] Call trace:
[ 1193.332951]  dump_backtrace+0x0/0x1ec
[ 1193.336782]  show_stack+0x24/0x30
[ 1193.340259]  dump_stack+0xcc/0x128
[ 1193.343822]  panic+0x288/0x5a4
[ 1193.347026]  panic_on_rcu_stall+0x2c/0x34
[ 1193.351217]  rcu_sched_clock_irq+0xe2c/0x1554
[ 1193.355770]  update_process_times+0x34/0x60
[ 1193.360141]  tick_sched_handle+0x80/0x98
[ 1193.364240]  tick_sched_timer+0x64/0xbc
[ 1193.368252]  __hrtimer_run_queues+0x5c8/0xb94
[ 1193.372804]  hrtimer_interrupt+0x1ec/0x454
[ 1193.377092]  arch_timer_handler_phys+0x40/0x58
[ 1193.381734]  handle_percpu_devid_irq+0x170/0x6f4
[ 1193.386558]  generic_handle_irq+0x3c/0x54
[ 1193.390749]  __handle_domain_irq+0x114/0x118
[ 1193.395208]  gic_handle_irq+0x70/0xac
[ 1193.399037]  el1_irq+0xbc/0x180
[ 1193.402333]  lock_acquire+0x154/0x39c
[ 1193.406164]  _raw_spin_lock+0x4c/0x88
[ 1193.409997]  set_close_on_exec+0x44/0x1ac
[ 1193.414190]  do_fcntl+0x578/0xbf4
[ 1193.417664]  __arm64_sys_fcntl+0xa4/0xdc
[ 1193.421767]  el0_svc_common.constprop.0+0x158/0x1f8
[ 1193.426854]  el0_svc_handler+0x94/0xa0
[ 1193.430775]  el0_svc+0x8/0xc
[ 1193.433878] SMP: stopping secondary CPUs
[ 1193.438059] Dumping ftrace buffer:
[ 1193.442613]    (ftrace buffer empty)
[ 1193.446373] Kernel Offset: disabled
[ 1193.450028] CPU features: 0x0002,21006004
[ 1193.454217] Memory Limit: none
[ 1193.457508] ---[ end Kernel panic - not syncing: RCU Stall ]---

[6] RCU stall #3 reproduced on v5.3-12025-g4c07e2ddab5b + patches [2-3]

[66194.249069] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[66194.255718] rcu:     7-...!: (0 ticks this GP) idle=b34/0/0x0 softirq=1969905/1969905 fqs=1 last_accelerate: 36d8/7727, Nonlazy posted: .LD
[66194.268853]  (detected by 2, t=25021 jiffies, g=8936349, q=98599)
[66194.275195] Task dump for CPU 7:
[66194.278585] swapper/7       R  running task        0     0      1 0x00000028
[66194.285941] Call trace:
[66194.288537]  __switch_to+0x2c4/0x2e8
[66194.292298]  num_spec.70129+0x177e7d/0x1d336c
[66194.296872] rcu: rcu_preempt kthread starved for 16505 jiffies! g8936349 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=7
[66194.307968] rcu: RCU grace-period kthread stack dump:
[66194.313228] rcu_preempt     I    0    10      2 0x00000028
[66194.318949] Call trace:
[66194.321527]  __switch_to+0x2c4/0x2e8
[66194.325275]  __schedule+0xe1c/0xf58
[66194.328926]  schedule+0xc0/0x104
[66194.332310]  schedule_timeout+0x298/0x6ec
[66194.336502]  rcu_gp_kthread+0xf8c/0xf94
[66194.340516]  kthread+0x1c4/0x1dc
[66194.343900]  ret_from_fork+0x10/0x18
[66194.347664] Kernel panic - not syncing: RCU Stall
[66194.352574] CPU: 2 PID: 17232 Comm: stress-ng-cpu Not tainted 5.3.0+ #18
[66194.359535] Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT)
[66194.367848] Call trace:
[66194.370428]  dump_backtrace+0x0/0x1ec
[66194.374256]  show_stack+0x24/0x30
[66194.377730]  dump_stack+0xcc/0x128
[66194.381288]  panic+0x288/0x5a4
[66194.384493]  panic_on_rcu_stall+0x2c/0x34
[66194.388680]  rcu_sched_clock_irq+0xe2c/0x1554
[66194.393229]  update_process_times+0x34/0x60
[66194.397595]  tick_sched_handle+0x80/0x98
[66194.401691]  tick_sched_timer+0x64/0xbc
[66194.405700]  __hrtimer_run_queues+0x5c8/0xb94
[66194.410246]  hrtimer_interrupt+0x1ec/0x454
[66194.414528]  arch_timer_handler_phys+0x40/0x58
[66194.419167]  handle_percpu_devid_irq+0x170/0x6f4
[66194.423990]  generic_handle_irq+0x3c/0x54
[66194.428178]  __handle_domain_irq+0x114/0x118
[66194.432630]  gic_handle_irq+0x70/0xac
[66194.436454]  el0_irq_naked+0x50/0x5c
[66194.440261] SMP: stopping secondary CPUs
[66194.444571] Dumping ftrace buffer:
[66194.449132]    (ftrace buffer empty)
[66194.452890] Kernel Offset: disabled
[66194.456538] CPU features: 0x0002,21006004
[66194.460719] Memory Limit: none
[66194.463997] ---[ end Kernel panic - not syncing: RCU Stall ]---

-- 
Best Regards,
Eugeniu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V2 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-23 19:51   ` Thomas Gleixner
  2019-09-24 17:14     ` Eugeniu Rosca
@ 2019-09-25 11:55     ` Balasubramani Vivekanandan
  2019-09-25 11:55       ` Balasubramani Vivekanandan
                         ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Balasubramani Vivekanandan @ 2019-09-25 11:55 UTC (permalink / raw)
  To: fweisbec, tglx, mingo; +Cc: balasubramani_vivekanandan, erosca, linux-kernel

> Find a completely untested patch below

Thanks for the patch. I tested your patch and it is working fine. I do
not notice the rcu stall warnings anymore. I have just updated the
formatting of brackets. Please check and apply




^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V2 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-25 11:55     ` [PATCH V2 " Balasubramani Vivekanandan
@ 2019-09-25 11:55       ` Balasubramani Vivekanandan
  2019-09-25 13:32       ` Balasubramani Vivekanandan
  2019-09-25 14:20       ` Balasubramani Vivekanandan
  2 siblings, 0 replies; 16+ messages in thread
From: Balasubramani Vivekanandan @ 2019-09-25 11:55 UTC (permalink / raw)
  To: fweisbec, tglx, mingo; +Cc: balasubramani_vivekanandan, erosca, linux-kernel

When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is
active using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does
not provide the required synchronization when the callback is active on
other core.
The callback could have already executed
tick_handle_oneshot_broadcast() and could have also returned. But still
there is a small time window where the hrtimer_try_to_cancel() returns
-1. In that case bc_set_next() returns without doing anything, but the
next_event of the tick broadcast clock device is already set to a
timeout value.

In the race condition diagram below, CPU #1 is running the timer
callback and CPU #2 is entering idle state and so calls bc_set_next().

In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback
returns HRTIMER_NORESTART. The hrtimer might never recover if all
further requests from the CPUs to subscribe to tick broadcast have
timeout greater than the next_event of tick broadcast clock device. This
leads to cascading of failures and finally noticed as rcu stall
warnings as shown in [1]

Here is a depiction of the race condition

CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                  and subscribe to
                                                  tick broadcast)
---------------------                             ---------------------

__run_hrtimer()                                   tick_broadcast_enter()

  bc_handler()                                      __tick_broadcast_oneshot_control()

    tick_handle_oneshot_broadcast()

      raw_spin_lock(&tick_broadcast_lock);

      dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
      //next_event for tick broadcast clock
      set to KTIME_MAX since no other cores
      subscribed to tick broadcasting

      raw_spin_unlock(&tick_broadcast_lock);

    if (dev->next_event == KTIME_MAX)
      return HRTIMER_NORESTART
    // callback function exits without
       restarting the hrtimer                      //tick_broadcast_lock acquired
                                                   raw_spin_lock(&tick_broadcast_lock);

                                                   tick_broadcast_set_event()

                                                     clockevents_program_event()

                                                       dev->next_event = expires;

                                                       bc_set_next()

                                                         hrtimer_try_to_cancel()
                                                         //returns -1 since the timer
                                                         callback is active. Exits without
                                                         restarting the timer
  cpu_base->running = NULL;

Since it is now allowed to start the hrtimer from the callback, there is
no need for the try to cancel logic. All that is now removed.

[1]: rcu stall warnings noticed before this patch

[   26.477514] INFO: rcu_preempt detected stalls on CPUs/tasks:
[   26.483197]  4-...: (0 ticks this GP) idle=718/0/0 softirq=1436/1436 fqs=0
[   26.490171]  (detected by 0, t=1755 jiffies, g=778, c=777, q=10243)
[   26.496456] Task dump for CPU 4:
[   26.499688] swapper/4       R  running task        0     0      1 0x00000020
[   26.506756] Call trace:
[   26.509221] [<ffff000008086214>] __switch_to+0x94/0xd8
[   26.514378] [<ffff000008ed9000>] bp_hardening_data+0x0/0x10
[   26.519964] rcu_preempt kthread starved for 1762 jiffies! g778 c777 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=4
[   26.530245] rcu_preempt     I    0     8      2 0x00000020
[   26.535742] Call trace:
[   26.538192] [<ffff000008086214>] __switch_to+0x94/0xd8
[   26.543344] [<ffff000008a6365c>] __schedule+0x274/0x940
[   26.548578] [<ffff000008a63d68>] schedule+0x40/0xa8
[   26.553467] [<ffff000008a6847c>] schedule_timeout+0x94/0x430
[   26.559141] [<ffff00000813cb04>] rcu_gp_kthread+0x76c/0x1068
[   26.564813] [<ffff0000080e1610>] kthread+0x138/0x140
[   26.569787] [<ffff000008084f74>] ret_from_fork+0x10/0x1c
[   30.481515] INFO: rcu_sched detected stalls on CPUs/tasks:
[   30.487030]  4-...: (0 ticks this GP) idle=718/0/0 softirq=1436/1436 fqs=0
[   30.494004]  (detected by 1, t=1755 jiffies, g=-24, c=-25, q=45)
[   30.500028] Task dump for CPU 4:
[   30.503261] swapper/4       R  running task        0     0      1 0x00000020
[   30.510330] Call trace:
[   30.512796] [<ffff000008086214>] __switch_to+0x94/0xd8
[   30.517953] [<ffff000008ed9000>] bp_hardening_data+0x0/0x10
[   30.523541] rcu_sched kthread starved for 1762 jiffies! g18446744073709551592 c18446744073709551591 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=4
[   30.536608] rcu_sched       I    0     9      2 0x00000020
[   30.542105] Call trace:
[   30.544554] [<ffff000008086214>] __switch_to+0x94/0xd8
[   30.549707] [<ffff000008a6365c>] __schedule+0x274/0x940
[   30.554942] [<ffff000008a63d68>] schedule+0x40/0xa8
[   30.559830] [<ffff000008a6847c>] schedule_timeout+0x94/0x430
[   30.565504] [<ffff00000813cb04>] rcu_gp_kthread+0x76c/0x1068
[   30.571176] [<ffff0000080e1610>] kthread+0x138/0x140
[   30.576149] [<ffff000008084f74>] ret_from_fork+0x10/0x1c

Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-broadcast-hrtimer.c | 52 ++++++++++++----------------
 1 file changed, 23 insertions(+), 29 deletions(-)

diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
index c1f5bb590b5e..6828fe3baacd 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -42,39 +42,37 @@ static int bc_shutdown(struct clock_event_device *evt)
  */
 static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
 {
-	int bc_moved;
 	/*
-	 * We try to cancel the timer first. If the callback is on
-	 * flight on some other cpu then we let it handle it. If we
-	 * were able to cancel the timer nothing can rearm it as we
-	 * own broadcast_lock.
+	 * This is called either from enter/exit idle code or from the
+	 * broadcast handler. In all cases tick_broadcast_lock is held.
 	 *
-	 * However we can also be called from the event handler of
-	 * ce_broadcast_hrtimer itself when it expires. We cannot
-	 * restart the timer because we are in the callback, but we
-	 * can set the expiry time and let the callback return
-	 * HRTIMER_RESTART.
+	 * hrtimer_cancel() cannot be called here neither from the
+	 * broadcast handler nor from the enter/exit idle code. The idle
+	 * code can run into the problem described in bc_shutdown() and the
+	 * broadcast handler cannot wait for itself to complete for obvious
+	 * reasons.
 	 *
-	 * Since we are in the idle loop at this point and because
-	 * hrtimer_{start/cancel} functions call into tracing,
-	 * calls to these functions must be bound within RCU_NONIDLE.
+	 * Each caller tries to arm the hrtimer on its own CPU, but if the
+	 * handler is currently running hrtimer_start() does not move
+	 * it. It keeps it on the CPU on which it is assigned at the
+	 * moment.
 	 */
 	RCU_NONIDLE(
 		{
-			bc_moved = hrtimer_try_to_cancel(&bctimer) >= 0;
-			if (bc_moved) {
-				hrtimer_start(&bctimer, expires,
-					      HRTIMER_MODE_ABS_PINNED_HARD);
-			}
+			hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED_HARD);
+			/*
+			 * The core tick broadcast mode expects bc->bound_on to be set
+			 * correctly to prevent a CPU which has the broadcast hrtimer
+			 * armed from going deep idle.
+			 *
+			 * As tick_broadcast_lock is held, nothing can change the
+			 * cpu base which was just established in hrtimer_start()
+			 * above. So the below access is safe even without holding
+			 * the hrtimer base lock.
+			 */
+			bc->bound_on = bctimer.base->cpu_base->cpu;
 		}
 	);
-
-	if (bc_moved) {
-		/* Bind the "device" to the cpu */
-		bc->bound_on = smp_processor_id();
-	} else if (bc->bound_on == smp_processor_id()) {
-		hrtimer_set_expires(&bctimer, expires);
-	}
 	return 0;
 }
 
@@ -100,10 +98,6 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
 {
 	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
 
-	if (clockevent_state_oneshot(&ce_broadcast_hrtimer))
-		if (ce_broadcast_hrtimer.next_event != KTIME_MAX)
-			return HRTIMER_RESTART;
-
 	return HRTIMER_NORESTART;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-25 11:55     ` [PATCH V2 " Balasubramani Vivekanandan
  2019-09-25 11:55       ` Balasubramani Vivekanandan
@ 2019-09-25 13:32       ` Balasubramani Vivekanandan
  2019-09-25 14:20       ` Balasubramani Vivekanandan
  2 siblings, 0 replies; 16+ messages in thread
From: Balasubramani Vivekanandan @ 2019-09-25 13:32 UTC (permalink / raw)
  To: fweisbec, tglx, mingo, peterz; +Cc: erosca, linux-kernel, linux-renesas-soc

+ Peter Zijlstra, linux-renesas-soc


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V2 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-25 11:55     ` [PATCH V2 " Balasubramani Vivekanandan
  2019-09-25 11:55       ` Balasubramani Vivekanandan
  2019-09-25 13:32       ` Balasubramani Vivekanandan
@ 2019-09-25 14:20       ` Balasubramani Vivekanandan
  2019-09-25 14:20         ` Balasubramani Vivekanandan
  2 siblings, 1 reply; 16+ messages in thread
From: Balasubramani Vivekanandan @ 2019-09-25 14:20 UTC (permalink / raw)
  To: fweisbec, tglx, mingo, peterz
  Cc: balasubramani_vivekanandan, erosca, linux-renesas-soc, linux-kernel

Sorry, I missed to fix few checkpatch warnings. Corrected it now



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V2 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-25 14:20       ` Balasubramani Vivekanandan
@ 2019-09-25 14:20         ` Balasubramani Vivekanandan
  2019-09-26 10:01           ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Balasubramani Vivekanandan @ 2019-09-25 14:20 UTC (permalink / raw)
  To: fweisbec, tglx, mingo, peterz
  Cc: balasubramani_vivekanandan, erosca, linux-renesas-soc, linux-kernel

When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is
active using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does
not provide the required synchronization when the callback is active on
other core.
The callback could have already executed
tick_handle_oneshot_broadcast() and could have also returned. But still
there is a small time window where the hrtimer_try_to_cancel() returns
-1. In that case bc_set_next() returns without doing anything, but the
next_event of the tick broadcast clock device is already set to a
timeout value.

In the race condition diagram below, CPU #1 is running the timer
callback and CPU #2 is entering idle state and so calls bc_set_next().

In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback
returns HRTIMER_NORESTART. The hrtimer might never recover if all
further requests from the CPUs to subscribe to tick broadcast have
timeout greater than the next_event of tick broadcast clock device. This
leads to cascading of failures and finally noticed as rcu stall
warnings as shown in [1]

Here is a depiction of the race condition

CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                  and subscribe to
                                                  tick broadcast)
---------------------                             ---------------------

__run_hrtimer()                                   tick_broadcast_enter()

  bc_handler()                                      __tick_broadcast_oneshot_control()

    tick_handle_oneshot_broadcast()

      raw_spin_lock(&tick_broadcast_lock);

      dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
      //next_event for tick broadcast clock
      set to KTIME_MAX since no other cores
      subscribed to tick broadcasting

      raw_spin_unlock(&tick_broadcast_lock);

    if (dev->next_event == KTIME_MAX)
      return HRTIMER_NORESTART
    // callback function exits without
       restarting the hrtimer                      //tick_broadcast_lock acquired
                                                   raw_spin_lock(&tick_broadcast_lock);

                                                   tick_broadcast_set_event()

                                                     clockevents_program_event()

                                                       dev->next_event = expires;

                                                       bc_set_next()

                                                         hrtimer_try_to_cancel()
                                                         //returns -1 since the timer
                                                         callback is active. Exits without
                                                         restarting the timer
  cpu_base->running = NULL;

Since it is now allowed to start the hrtimer from the callback, there is
no need for the try to cancel logic. All that is now removed.

[1]: rcu stall warnings noticed before this patch

[   26.477514] INFO: rcu_preempt detected stalls on CPUs/tasks:
[   26.483197]  4-...: (0 ticks this GP) idle=718/0/0 softirq=1436/1436 fqs=0
[   26.490171]  (detected by 0, t=1755 jiffies, g=778, c=777, q=10243)
[   26.496456] Task dump for CPU 4:
[   26.499688] swapper/4       R  running task        0     0      1 0x00000020
[   26.506756] Call trace:
[   26.509221] [<ffff000008086214>] __switch_to+0x94/0xd8
[   26.514378] [<ffff000008ed9000>] bp_hardening_data+0x0/0x10
[   26.519964] rcu_preempt kthread starved for 1762 jiffies! g778 c777 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=4
[   26.530245] rcu_preempt     I    0     8      2 0x00000020
[   26.535742] Call trace:
[   26.538192] [<ffff000008086214>] __switch_to+0x94/0xd8
[   26.543344] [<ffff000008a6365c>] __schedule+0x274/0x940
[   26.548578] [<ffff000008a63d68>] schedule+0x40/0xa8
[   26.553467] [<ffff000008a6847c>] schedule_timeout+0x94/0x430
[   26.559141] [<ffff00000813cb04>] rcu_gp_kthread+0x76c/0x1068
[   26.564813] [<ffff0000080e1610>] kthread+0x138/0x140
[   26.569787] [<ffff000008084f74>] ret_from_fork+0x10/0x1c
[   30.481515] INFO: rcu_sched detected stalls on CPUs/tasks:
[   30.487030]  4-...: (0 ticks this GP) idle=718/0/0 softirq=1436/1436 fqs=0
[   30.494004]  (detected by 1, t=1755 jiffies, g=-24, c=-25, q=45)
[   30.500028] Task dump for CPU 4:
[   30.503261] swapper/4       R  running task        0     0      1 0x00000020
[   30.510330] Call trace:
[   30.512796] [<ffff000008086214>] __switch_to+0x94/0xd8
[   30.517953] [<ffff000008ed9000>] bp_hardening_data+0x0/0x10
[   30.523541] rcu_sched kthread starved for 1762 jiffies! g18446744073709551592 c18446744073709551591 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=4
[   30.536608] rcu_sched       I    0     9      2 0x00000020
[   30.542105] Call trace:
[   30.544554] [<ffff000008086214>] __switch_to+0x94/0xd8
[   30.549707] [<ffff000008a6365c>] __schedule+0x274/0x940
[   30.554942] [<ffff000008a63d68>] schedule+0x40/0xa8
[   30.559830] [<ffff000008a6847c>] schedule_timeout+0x94/0x430
[   30.565504] [<ffff00000813cb04>] rcu_gp_kthread+0x76c/0x1068
[   30.571176] [<ffff0000080e1610>] kthread+0x138/0x140
[   30.576149] [<ffff000008084f74>] ret_from_fork+0x10/0x1c

Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-broadcast-hrtimer.c | 53 +++++++++++++---------------
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
index c1f5bb590b5e..2fb16d49939a 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -42,39 +42,38 @@ static int bc_shutdown(struct clock_event_device *evt)
  */
 static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
 {
-	int bc_moved;
 	/*
-	 * We try to cancel the timer first. If the callback is on
-	 * flight on some other cpu then we let it handle it. If we
-	 * were able to cancel the timer nothing can rearm it as we
-	 * own broadcast_lock.
+	 * This is called either from enter/exit idle code or from the
+	 * broadcast handler. In all cases tick_broadcast_lock is held.
 	 *
-	 * However we can also be called from the event handler of
-	 * ce_broadcast_hrtimer itself when it expires. We cannot
-	 * restart the timer because we are in the callback, but we
-	 * can set the expiry time and let the callback return
-	 * HRTIMER_RESTART.
+	 * hrtimer_cancel() cannot be called here neither from the
+	 * broadcast handler nor from the enter/exit idle code. The idle
+	 * code can run into the problem described in bc_shutdown() and the
+	 * broadcast handler cannot wait for itself to complete for obvious
+	 * reasons.
 	 *
-	 * Since we are in the idle loop at this point and because
-	 * hrtimer_{start/cancel} functions call into tracing,
-	 * calls to these functions must be bound within RCU_NONIDLE.
+	 * Each caller tries to arm the hrtimer on its own CPU, but if the
+	 * handler is currently running hrtimer_start() does not move
+	 * it. It keeps it on the CPU on which it is assigned at the
+	 * moment.
 	 */
 	RCU_NONIDLE(
 		{
-			bc_moved = hrtimer_try_to_cancel(&bctimer) >= 0;
-			if (bc_moved) {
-				hrtimer_start(&bctimer, expires,
-					      HRTIMER_MODE_ABS_PINNED_HARD);
-			}
+			hrtimer_start(&bctimer, expires,
+				      HRTIMER_MODE_ABS_PINNED_HARD);
+			/*
+			 * The core tick broadcast mode expects bc->bound_on to
+			 * be set correctly to prevent a CPU which has the
+			 * broadcast hrtimer armed from going deep idle.
+			 *
+			 * As tick_broadcast_lock is held, nothing can change
+			 * the cpu base which was just established in
+			 * hrtimer_start() above. So the below access is safe
+			 * even without holding the hrtimer base lock.
+			 */
+			bc->bound_on = bctimer.base->cpu_base->cpu;
 		}
 	);
-
-	if (bc_moved) {
-		/* Bind the "device" to the cpu */
-		bc->bound_on = smp_processor_id();
-	} else if (bc->bound_on == smp_processor_id()) {
-		hrtimer_set_expires(&bctimer, expires);
-	}
 	return 0;
 }
 
@@ -100,10 +99,6 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
 {
 	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
 
-	if (clockevent_state_oneshot(&ce_broadcast_hrtimer))
-		if (ce_broadcast_hrtimer.next_event != KTIME_MAX)
-			return HRTIMER_RESTART;
-
 	return HRTIMER_NORESTART;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-25 14:20         ` Balasubramani Vivekanandan
@ 2019-09-26 10:01           ` Thomas Gleixner
  2019-09-26 13:51             ` [PATCH V3 " Balasubramani Vivekanandan
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2019-09-26 10:01 UTC (permalink / raw)
  To: Balasubramani Vivekanandan
  Cc: fweisbec, mingo, peterz, erosca, linux-renesas-soc, linux-kernel

On Wed, 25 Sep 2019, Balasubramani Vivekanandan wrote:
> 
> Since it is now allowed to start the hrtimer from the callback, there is

Is now allowed? 

> no need for the try to cancel logic. All that is now removed.

Sure I can see that it is removed from the patch, but why and why is it
correct?

> [1]: rcu stall warnings noticed before this patch
> 
> [   26.477514] INFO: rcu_preempt detected stalls on CPUs/tasks:

<SNIP>

I which way is this backtrace giving any useful information about the
problem?

> 
> Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Interesting. You claim authorship on that patch and then you put my SOB
after yours just because you feel entitled to do so?

I surely appreciate the time you spent to analyze the problem and I
wouldn't have said anything if you just would have done the right thing:

 1) Write a changelog which explains why the change is actually correct

 2) Not wreckage the formatting which I intentionally did for readability
    sake

 3) Add 'Originally-by: Thomas Gleixner' or at least having the courtesy to
    mention that this is not your work.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V3 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-26 10:01           ` Thomas Gleixner
@ 2019-09-26 13:51             ` Balasubramani Vivekanandan
  2019-09-26 13:51               ` Balasubramani Vivekanandan
  0 siblings, 1 reply; 16+ messages in thread
From: Balasubramani Vivekanandan @ 2019-09-26 13:51 UTC (permalink / raw)
  To: fweisbec, tglx, mingo, peterz
  Cc: balasubramani_vivekanandan, erosca, linux-renesas-soc, linux-kernel


> I which way is this backtrace giving any useful information about the problem?

The intention of me including the callstack was to help anyone notice
this commit while searching using the prints from the callstack. I have
removed it if it is creating noise.

> Interesting. You claim authorship on that patch and then you put my
> SOB after yours just because you feel entitled to do so?

Apologies for that. My limited experience with sharing patches upstream
for review and fixing it, was the reason for that. I have included the
"Originally-by" in the updated patch shared.

> 1) Write a changelog which explains why the change is actually correct
Done. Please check the update. Let me know if I need to add more
details.

> 2) Not wreckage the formatting which I intentionally did for
> readability sake
Done.

> 3) Add 'Originally-by: Thomas Gleixner' or at least having the courtesy to
>    mention that this is not your work.
Done.

Regards,
Bala



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V3 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-26 13:51             ` [PATCH V3 " Balasubramani Vivekanandan
@ 2019-09-26 13:51               ` Balasubramani Vivekanandan
  2019-09-26 15:43                 ` Eugeniu Rosca
  2019-09-27 12:48                 ` [tip: timers/core] " tip-bot2 for Balasubramani Vivekanandan
  0 siblings, 2 replies; 16+ messages in thread
From: Balasubramani Vivekanandan @ 2019-09-26 13:51 UTC (permalink / raw)
  To: fweisbec, tglx, mingo, peterz
  Cc: balasubramani_vivekanandan, erosca, linux-renesas-soc, linux-kernel

When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is
active using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does
not provide the required synchronization when the callback is active on
other core.
The callback could have already executed
tick_handle_oneshot_broadcast() and could have also returned. But still
there is a small time window where the hrtimer_try_to_cancel() returns
-1. In that case bc_set_next() returns without doing anything, but the
next_event of the tick broadcast clock device is already set to a
timeout value.

In the race condition diagram below, CPU #1 is running the timer
callback and CPU #2 is entering idle state and so calls bc_set_next().

In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback
returns HRTIMER_NORESTART. The hrtimer might never recover if all
further requests from the CPUs to subscribe to tick broadcast have
timeout greater than the next_event of tick broadcast clock device. This
leads to cascading of failures and finally noticed as rcu stall
warnings

Here is a depiction of the race condition

CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                  and subscribe to
                                                  tick broadcast)
---------------------                             ---------------------

__run_hrtimer()                                   tick_broadcast_enter()

  bc_handler()                                      __tick_broadcast_oneshot_control()

    tick_handle_oneshot_broadcast()

      raw_spin_lock(&tick_broadcast_lock);

      dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
      //next_event for tick broadcast clock
      set to KTIME_MAX since no other cores
      subscribed to tick broadcasting

      raw_spin_unlock(&tick_broadcast_lock);

    if (dev->next_event == KTIME_MAX)
      return HRTIMER_NORESTART
    // callback function exits without
       restarting the hrtimer                      //tick_broadcast_lock acquired
                                                   raw_spin_lock(&tick_broadcast_lock);

                                                   tick_broadcast_set_event()

                                                     clockevents_program_event()

                                                       dev->next_event = expires;

                                                       bc_set_next()

                                                         hrtimer_try_to_cancel()
                                                         //returns -1 since the timer
                                                         callback is active. Exits without
                                                         restarting the timer
  cpu_base->running = NULL;

The comment that hrtimer cannot be armed from within the callback is
wrong. It is fine to start the hrtimer from within the callback. Also it
is safe to start the hrtimer from the enter/exit idle code while the
broadcast handler is active. The enter/exit idle code and the broadcast
handler are synchronized using tick_broadcast_lock. So there is no need
for the existing try to cancel logic. All this can be removed which will
eliminate the race condition as well.

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
---
 kernel/time/tick-broadcast-hrtimer.c | 58 ++++++++++++----------------
 1 file changed, 25 insertions(+), 33 deletions(-)

diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
index c1f5bb590b5e..f070f9734792 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -42,39 +42,35 @@ static int bc_shutdown(struct clock_event_device *evt)
  */
 static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
 {
-	int bc_moved;
 	/*
-	 * We try to cancel the timer first. If the callback is on
-	 * flight on some other cpu then we let it handle it. If we
-	 * were able to cancel the timer nothing can rearm it as we
-	 * own broadcast_lock.
+	 * This is called either from enter/exit idle code or from the
+	 * broadcast handler. In all cases tick_broadcast_lock is held.
 	 *
-	 * However we can also be called from the event handler of
-	 * ce_broadcast_hrtimer itself when it expires. We cannot
-	 * restart the timer because we are in the callback, but we
-	 * can set the expiry time and let the callback return
-	 * HRTIMER_RESTART.
+	 * hrtimer_cancel() cannot be called here neither from the
+	 * broadcast handler nor from the enter/exit idle code. The idle
+	 * code can run into the problem described in bc_shutdown() and the
+	 * broadcast handler cannot wait for itself to complete for obvious
+	 * reasons.
 	 *
-	 * Since we are in the idle loop at this point and because
-	 * hrtimer_{start/cancel} functions call into tracing,
-	 * calls to these functions must be bound within RCU_NONIDLE.
+	 * Each caller tries to arm the hrtimer on its own CPU, but if the
+	 * handler is currently running hrtimer_start() does not move
+	 * it. It keeps it on the CPU on which it is assigned at the
+	 * moment.
 	 */
-	RCU_NONIDLE(
-		{
-			bc_moved = hrtimer_try_to_cancel(&bctimer) >= 0;
-			if (bc_moved) {
-				hrtimer_start(&bctimer, expires,
-					      HRTIMER_MODE_ABS_PINNED_HARD);
-			}
-		}
-	);
-
-	if (bc_moved) {
-		/* Bind the "device" to the cpu */
-		bc->bound_on = smp_processor_id();
-	} else if (bc->bound_on == smp_processor_id()) {
-		hrtimer_set_expires(&bctimer, expires);
-	}
+	RCU_NONIDLE( {
+		hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED_HARD);
+		/*
+		 * The core tick broadcast mode expects bc->bound_on to be set
+		 * correctly to prevent a CPU which has the broadcast hrtimer
+		 * armed from going deep idle.
+		 *
+		 * As tick_broadcast_lock is held, nothing can change the cpu
+		 * base which was just established in hrtimer_start() above. So
+		 * the below access is safe even without holding the hrtimer
+		 * base lock.
+		 */
+		bc->bound_on = bctimer.base->cpu_base->cpu;
+	} );
 	return 0;
 }
 
@@ -100,10 +96,6 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
 {
 	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
 
-	if (clockevent_state_oneshot(&ce_broadcast_hrtimer))
-		if (ce_broadcast_hrtimer.next_event != KTIME_MAX)
-			return HRTIMER_RESTART;
-
 	return HRTIMER_NORESTART;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V3 1/1] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-26 13:51               ` Balasubramani Vivekanandan
@ 2019-09-26 15:43                 ` Eugeniu Rosca
  2019-09-27 12:48                 ` [tip: timers/core] " tip-bot2 for Balasubramani Vivekanandan
  1 sibling, 0 replies; 16+ messages in thread
From: Eugeniu Rosca @ 2019-09-26 15:43 UTC (permalink / raw)
  To: Balasubramani Vivekanandan
  Cc: fweisbec, tglx, mingo, peterz, linux-renesas-soc, linux-kernel,
	Eugeniu Rosca, Eugeniu Rosca

On Thu, Sep 26, 2019 at 03:51:01PM +0200, Balasubramani Vivekanandan wrote:
> When a cpu requests broadcasting, before starting the tick broadcast
> hrtimer, bc_set_next() checks if the timer callback (bc_handler) is
> active using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does
> not provide the required synchronization when the callback is active on
> other core.

[..]

> diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
> index c1f5bb590b5e..f070f9734792 100644
> --- a/kernel/time/tick-broadcast-hrtimer.c
> +++ b/kernel/time/tick-broadcast-hrtimer.c

[..]

FWIW, the patch seems to fix the very first commit adding hrtimer
broadcast, i.e. v3.15-rc1 commit 5d1638acb9f62f ("tick: Introduce
hrtimer based broadcast"), so maybe adding a Fixes: tag could be
relevant/beneficial for the stable trees?

-- 
Best Regards,
Eugeniu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip: timers/core] tick: broadcast-hrtimer: Fix a race in bc_set_next
  2019-09-26 13:51               ` Balasubramani Vivekanandan
  2019-09-26 15:43                 ` Eugeniu Rosca
@ 2019-09-27 12:48                 ` tip-bot2 for Balasubramani Vivekanandan
  1 sibling, 0 replies; 16+ messages in thread
From: tip-bot2 for Balasubramani Vivekanandan @ 2019-09-27 12:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Balasubramani Vivekanandan, stable, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     b9023b91dd020ad7e093baa5122b6968c48cc9e0
Gitweb:        https://git.kernel.org/tip/b9023b91dd020ad7e093baa5122b6968c48cc9e0
Author:        Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
AuthorDate:    Thu, 26 Sep 2019 15:51:01 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 27 Sep 2019 14:45:55 +02:00

tick: broadcast-hrtimer: Fix a race in bc_set_next

When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is active
using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does not provide
the required synchronization when the callback is active on other core.

The callback could have already executed tick_handle_oneshot_broadcast()
and could have also returned. But still there is a small time window where
the hrtimer_try_to_cancel() returns -1. In that case bc_set_next() returns
without doing anything, but the next_event of the tick broadcast clock
device is already set to a timeout value.

In the race condition diagram below, CPU #1 is running the timer callback
and CPU #2 is entering idle state and so calls bc_set_next().

In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback returns
HRTIMER_NORESTART. The hrtimer might never recover if all further requests
from the CPUs to subscribe to tick broadcast have timeout greater than the
next_event of tick broadcast clock device. This leads to cascading of
failures and finally noticed as rcu stall warnings

Here is a depiction of the race condition

CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                  and subscribe to
                                                  tick broadcast)
---------------------                             ---------------------

__run_hrtimer()                                   tick_broadcast_enter()

  bc_handler()                                      __tick_broadcast_oneshot_control()

    tick_handle_oneshot_broadcast()

      raw_spin_lock(&tick_broadcast_lock);

      dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
      //next_event for tick broadcast clock
      set to KTIME_MAX since no other cores
      subscribed to tick broadcasting

      raw_spin_unlock(&tick_broadcast_lock);

    if (dev->next_event == KTIME_MAX)
      return HRTIMER_NORESTART
    // callback function exits without
       restarting the hrtimer                      //tick_broadcast_lock acquired
                                                   raw_spin_lock(&tick_broadcast_lock);

                                                   tick_broadcast_set_event()

                                                     clockevents_program_event()

                                                       dev->next_event = expires;

                                                       bc_set_next()

                                                         hrtimer_try_to_cancel()
                                                         //returns -1 since the timer
                                                         callback is active. Exits without
                                                         restarting the timer
  cpu_base->running = NULL;

The comment that hrtimer cannot be armed from within the callback is
wrong. It is fine to start the hrtimer from within the callback. Also it is
safe to start the hrtimer from the enter/exit idle code while the broadcast
handler is active. The enter/exit idle code and the broadcast handler are
synchronized using tick_broadcast_lock. So there is no need for the
existing try to cancel logic. All this can be removed which will eliminate
the race condition as well.

Fixes: 5d1638acb9f6 ("tick: Introduce hrtimer based broadcast")
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190926135101.12102-2-balasubramani_vivekanandan@mentor.com

---
 kernel/time/tick-broadcast-hrtimer.c | 62 ++++++++++++---------------
 1 file changed, 29 insertions(+), 33 deletions(-)

diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
index c1f5bb5..b5a65e2 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -42,39 +42,39 @@ static int bc_shutdown(struct clock_event_device *evt)
  */
 static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
 {
-	int bc_moved;
 	/*
-	 * We try to cancel the timer first. If the callback is on
-	 * flight on some other cpu then we let it handle it. If we
-	 * were able to cancel the timer nothing can rearm it as we
-	 * own broadcast_lock.
+	 * This is called either from enter/exit idle code or from the
+	 * broadcast handler. In all cases tick_broadcast_lock is held.
 	 *
-	 * However we can also be called from the event handler of
-	 * ce_broadcast_hrtimer itself when it expires. We cannot
-	 * restart the timer because we are in the callback, but we
-	 * can set the expiry time and let the callback return
-	 * HRTIMER_RESTART.
+	 * hrtimer_cancel() cannot be called here neither from the
+	 * broadcast handler nor from the enter/exit idle code. The idle
+	 * code can run into the problem described in bc_shutdown() and the
+	 * broadcast handler cannot wait for itself to complete for obvious
+	 * reasons.
 	 *
-	 * Since we are in the idle loop at this point and because
-	 * hrtimer_{start/cancel} functions call into tracing,
-	 * calls to these functions must be bound within RCU_NONIDLE.
+	 * Each caller tries to arm the hrtimer on its own CPU, but if the
+	 * hrtimer callbback function is currently running, then
+	 * hrtimer_start() cannot move it and the timer stays on the CPU on
+	 * which it is assigned at the moment.
+	 *
+	 * As this can be called from idle code, the hrtimer_start()
+	 * invocation has to be wrapped with RCU_NONIDLE() as
+	 * hrtimer_start() can call into tracing.
 	 */
-	RCU_NONIDLE(
-		{
-			bc_moved = hrtimer_try_to_cancel(&bctimer) >= 0;
-			if (bc_moved) {
-				hrtimer_start(&bctimer, expires,
-					      HRTIMER_MODE_ABS_PINNED_HARD);
-			}
-		}
-	);
-
-	if (bc_moved) {
-		/* Bind the "device" to the cpu */
-		bc->bound_on = smp_processor_id();
-	} else if (bc->bound_on == smp_processor_id()) {
-		hrtimer_set_expires(&bctimer, expires);
-	}
+	RCU_NONIDLE( {
+		hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED_HARD);
+		/*
+		 * The core tick broadcast mode expects bc->bound_on to be set
+		 * correctly to prevent a CPU which has the broadcast hrtimer
+		 * armed from going deep idle.
+		 *
+		 * As tick_broadcast_lock is held, nothing can change the cpu
+		 * base which was just established in hrtimer_start() above. So
+		 * the below access is safe even without holding the hrtimer
+		 * base lock.
+		 */
+		bc->bound_on = bctimer.base->cpu_base->cpu;
+	} );
 	return 0;
 }
 
@@ -100,10 +100,6 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
 {
 	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
 
-	if (clockevent_state_oneshot(&ce_broadcast_hrtimer))
-		if (ce_broadcast_hrtimer.next_event != KTIME_MAX)
-			return HRTIMER_RESTART;
-
 	return HRTIMER_NORESTART;
 }
 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-09-27 12:48 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-18 14:41 [PATCH V1 0/1] tick: broadcast-hrtimer: Fix a race in bc_set_next Balasubramani Vivekanandan
2019-09-18 14:41 ` [PATCH V1 1/1] " Balasubramani Vivekanandan
2019-09-23 19:51   ` Thomas Gleixner
2019-09-24 17:14     ` Eugeniu Rosca
2019-09-25 11:55     ` [PATCH V2 " Balasubramani Vivekanandan
2019-09-25 11:55       ` Balasubramani Vivekanandan
2019-09-25 13:32       ` Balasubramani Vivekanandan
2019-09-25 14:20       ` Balasubramani Vivekanandan
2019-09-25 14:20         ` Balasubramani Vivekanandan
2019-09-26 10:01           ` Thomas Gleixner
2019-09-26 13:51             ` [PATCH V3 " Balasubramani Vivekanandan
2019-09-26 13:51               ` Balasubramani Vivekanandan
2019-09-26 15:43                 ` Eugeniu Rosca
2019-09-27 12:48                 ` [tip: timers/core] " tip-bot2 for Balasubramani Vivekanandan
2019-09-20 10:39 ` [PATCH V1 0/1] " Eugeniu Rosca
2019-09-20 10:58 ` Eugeniu Rosca

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).