linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul
@ 2021-07-13 13:39 Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 01/10] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() Thomas Gleixner
                   ` (9 more replies)
  0 siblings, 10 replies; 36+ messages in thread
From: Thomas Gleixner @ 2021-07-13 13:39 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti, Frederic Weisbecker

Folks!

This is the second version of the clock_was_set() overhaul together with
the avoidance of double reprogramming when rearming an enqueued timer.

Version 1 of those patches can be found here:

   https://lore.kernel.org/r/20210427082537.611978720@linutronix.de
and
   https://lore.kernel.org/r/87v989topu.ffs@nanos.tec.linutronix.de

Both the reprogramming logic and the clock_was_set() logic are not really
optimized. The series addresses this.

  - Avoid double reprogramming when an enqueued timer is rearmed

  - Consolidate the reprogramming variants (Peter Zijlstra)

  - Fix the cases where the clock_was_set() handling is incorrect depending
    on configuration or runtime conditions.

  - Distangle the resume notification and the clock-was-set mechanism to
    prepare for IPI avoidance

  - Adopt Marcelo's patch to the modified code and add some more smarts on
    top.

The series is also available from git:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git hrtimer

Changes vs V1:

  - Pick up Peter's consolidation patch
  - Polish a few things

Thanks,

	tglx
---
 fs/timerfd.c                |   16 ++
 include/linux/hrtimer.h     |    8 -
 kernel/time/hrtimer.c       |  337 +++++++++++++++++++++++++++++++-------------
 kernel/time/tick-common.c   |    7 
 kernel/time/tick-internal.h |   12 +
 kernel/time/timekeeping.c   |   36 ++--
 6 files changed, 299 insertions(+), 117 deletions(-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch V2 01/10] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns()
  2021-07-13 13:39 [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul Thomas Gleixner
@ 2021-07-13 13:39 ` Thomas Gleixner
  2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 02/10] hrtimer: Consolidate reprogramming code Thomas Gleixner
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2021-07-13 13:39 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti,
	Frederic Weisbecker, Lorenzo Colitti

From: Thomas Gleixner <tglx@linutronix.de>

If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
the timer has to be canceled first and then added back. If the timer is the
first expiring timer then on removal the clockevent device is reprogrammed
to the next expiring timer to avoid that the pending expiry fires needlessly.

If the new expiry time ends up to be the first expiry again then the clock
event device has to reprogrammed again.

Avoid this by checking whether the timer is the first to expire and in that
case, keep the timer on the current CPU and delay the reprogramming up to
the point where the timer has been enqueued again.

Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/time/hrtimer.c |   60 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 53 insertions(+), 7 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1030,12 +1030,13 @@ static void __remove_hrtimer(struct hrti
  * remove hrtimer, called with base lock held
  */
 static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool restart)
+remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+	       bool restart, bool keep_local)
 {
 	u8 state = timer->state;
 
 	if (state & HRTIMER_STATE_ENQUEUED) {
-		int reprogram;
+		bool reprogram;
 
 		/*
 		 * Remove the timer and force reprogramming when high
@@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st
 		debug_deactivate(timer);
 		reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
 
+		/*
+		 * If the timer is not restarted then reprogramming is
+		 * required if the timer is local. If it is local and about
+		 * to be restarted, avoid programming it twice (on removal
+		 * and a moment later when it's requeued).
+		 */
 		if (!restart)
 			state = HRTIMER_STATE_INACTIVE;
+		else
+			reprogram &= !keep_local;
 
 		__remove_hrtimer(timer, base, state, reprogram);
 		return 1;
@@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru
 				    struct hrtimer_clock_base *base)
 {
 	struct hrtimer_clock_base *new_base;
+	bool force_local, first;
 
-	/* Remove an active timer from the queue: */
-	remove_hrtimer(timer, base, true);
+	/*
+	 * If the timer is on the local cpu base and is the first expiring
+	 * timer then this might end up reprogramming the hardware twice
+	 * (on removal and on enqueue). To avoid that by prevent the
+	 * reprogram on removal, keep the timer local to the current CPU
+	 * and enforce reprogramming after it is queued no matter whether
+	 * it is the new first expiring timer again or not.
+	 */
+	force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
+	force_local &= base->cpu_base->next_timer == timer;
+
+	/*
+	 * Remove an active timer from the queue. In case it is not queued
+	 * on the current CPU, make sure that remove_hrtimer() updates the
+	 * remote data correctly.
+	 *
+	 * If it's on the current CPU and the first expiring timer, then
+	 * skip reprogramming, keep the timer local and enforce
+	 * reprogramming later if it was the first expiring timer.  This
+	 * avoids programming the underlying clock event twice (once at
+	 * removal and once after enqueue).
+	 */
+	remove_hrtimer(timer, base, true, force_local);
 
 	if (mode & HRTIMER_MODE_REL)
 		tim = ktime_add_safe(tim, base->get_time());
@@ -1115,9 +1146,24 @@ static int __hrtimer_start_range_ns(stru
 	hrtimer_set_expires_range_ns(timer, tim, delta_ns);
 
 	/* Switch the timer base, if necessary: */
-	new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);
+	if (!force_local) {
+		new_base = switch_hrtimer_base(timer, base,
+					       mode & HRTIMER_MODE_PINNED);
+	} else {
+		new_base = base;
+	}
 
-	return enqueue_hrtimer(timer, new_base, mode);
+	first = enqueue_hrtimer(timer, new_base, mode);
+	if (!force_local)
+		return first;
+
+	/*
+	 * Timer was forced to stay on the current CPU to avoid
+	 * reprogramming on removal and enqueue. Force reprogram the
+	 * hardware by evaluating the new first expiring timer.
+	 */
+	hrtimer_force_reprogram(new_base->cpu_base, 1);
+	return 0;
 }
 
 /**
@@ -1183,7 +1229,7 @@ int hrtimer_try_to_cancel(struct hrtimer
 	base = lock_hrtimer_base(timer, &flags);
 
 	if (!hrtimer_callback_running(timer))
-		ret = remove_hrtimer(timer, base, false);
+		ret = remove_hrtimer(timer, base, false, false);
 
 	unlock_hrtimer_base(timer, &flags);
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch V2 02/10] hrtimer: Consolidate reprogramming code
  2021-07-13 13:39 [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 01/10] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() Thomas Gleixner
@ 2021-07-13 13:39 ` Thomas Gleixner
  2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Peter Zijlstra
       [not found]   ` <CGME20210812130945eucas1p117fc1e90f31c8d9fd177932cd1a18512@eucas1p1.samsung.com>
  2021-07-13 13:39 ` [patch V2 03/10] hrtimer: Ensure timerfd notification for HIGHRES=n Thomas Gleixner
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 36+ messages in thread
From: Thomas Gleixner @ 2021-07-13 13:39 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti, Frederic Weisbecker

From: Peter Zijlstra <peterz@infradead.org>

This code is mostly duplicated. The redudant store in the force reprogram
case does no harm and the in hrtimer interrupt condition cannot be true for
the force reprogram invocations.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/hrtimer.c |   72 ++++++++++++++++++++------------------------------
 1 file changed, 29 insertions(+), 43 deletions(-)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -652,21 +652,24 @@ static inline int hrtimer_hres_active(vo
 	return __hrtimer_hres_active(this_cpu_ptr(&hrtimer_bases));
 }
 
-/*
- * Reprogram the event source with checking both queues for the
- * next event
- * Called with interrupts disabled and base->lock held
- */
 static void
-hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
+__hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal,
+		    struct hrtimer *next_timer, ktime_t expires_next)
 {
-	ktime_t expires_next;
+	/*
+	 * If the hrtimer interrupt is running, then it will reevaluate the
+	 * clock bases and reprogram the clock event device.
+	 */
+	if (cpu_base->in_hrtirq)
+		return;
 
-	expires_next = hrtimer_update_next_event(cpu_base);
+	if (expires_next > cpu_base->expires_next)
+		return;
 
 	if (skip_equal && expires_next == cpu_base->expires_next)
 		return;
 
+	cpu_base->next_timer = next_timer;
 	cpu_base->expires_next = expires_next;
 
 	/*
@@ -689,7 +692,23 @@ hrtimer_force_reprogram(struct hrtimer_c
 	if (!__hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
 		return;
 
-	tick_program_event(cpu_base->expires_next, 1);
+	tick_program_event(expires_next, 1);
+}
+
+/*
+ * Reprogram the event source with checking both queues for the
+ * next event
+ * Called with interrupts disabled and base->lock held
+ */
+static void
+hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
+{
+	ktime_t expires_next;
+
+	expires_next = hrtimer_update_next_event(cpu_base);
+
+	__hrtimer_reprogram(cpu_base, skip_equal, cpu_base->next_timer,
+			    expires_next);
 }
 
 /* High resolution timer related functions */
@@ -835,40 +854,7 @@ static void hrtimer_reprogram(struct hrt
 	if (base->cpu_base != cpu_base)
 		return;
 
-	/*
-	 * If the hrtimer interrupt is running, then it will
-	 * reevaluate the clock bases and reprogram the clock event
-	 * device. The callbacks are always executed in hard interrupt
-	 * context so we don't need an extra check for a running
-	 * callback.
-	 */
-	if (cpu_base->in_hrtirq)
-		return;
-
-	if (expires >= cpu_base->expires_next)
-		return;
-
-	/* Update the pointer to the next expiring timer */
-	cpu_base->next_timer = timer;
-	cpu_base->expires_next = expires;
-
-	/*
-	 * If hres is not active, hardware does not have to be
-	 * programmed yet.
-	 *
-	 * If a hang was detected in the last timer interrupt then we
-	 * do not schedule a timer which is earlier than the expiry
-	 * which we enforced in the hang detection. We want the system
-	 * to make progress.
-	 */
-	if (!__hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
-		return;
-
-	/*
-	 * Program the timer hardware. We enforce the expiry for
-	 * events which are already in the past.
-	 */
-	tick_program_event(expires, 1);
+	__hrtimer_reprogram(cpu_base, true, timer, expires);
 }
 
 /*


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch V2 03/10] hrtimer: Ensure timerfd notification for HIGHRES=n
  2021-07-13 13:39 [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 01/10] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 02/10] hrtimer: Consolidate reprogramming code Thomas Gleixner
@ 2021-07-13 13:39 ` Thomas Gleixner
  2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 04/10] hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case Thomas Gleixner
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2021-07-13 13:39 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti, Frederic Weisbecker

From: Thomas Gleixner <tglx@linutronix.de>

If high resolution timers are disabled the timerfd notification about a
clock was set event is not happening for all cases which use
clock_was_set_delayed() because that's a NOP for HIGHRES=n, which is wrong.

Make clock_was_set_delayed() unconditially available to fix that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hrtimer.h     |    5 -----
 kernel/time/hrtimer.c       |   32 ++++++++++++++++----------------
 kernel/time/tick-internal.h |    3 +++
 3 files changed, 19 insertions(+), 21 deletions(-)

--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -318,16 +318,12 @@ struct clock_event_device;
 
 extern void hrtimer_interrupt(struct clock_event_device *dev);
 
-extern void clock_was_set_delayed(void);
-
 extern unsigned int hrtimer_resolution;
 
 #else
 
 #define hrtimer_resolution	(unsigned int)LOW_RES_NSEC
 
-static inline void clock_was_set_delayed(void) { }
-
 #endif
 
 static inline ktime_t
@@ -351,7 +347,6 @@ hrtimer_expires_remaining_adjusted(const
 						    timer->base->get_time());
 }
 
-extern void clock_was_set(void);
 #ifdef CONFIG_TIMERFD
 extern void timerfd_clock_was_set(void);
 #else
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -777,22 +777,6 @@ static void hrtimer_switch_to_hres(void)
 	retrigger_next_event(NULL);
 }
 
-static void clock_was_set_work(struct work_struct *work)
-{
-	clock_was_set();
-}
-
-static DECLARE_WORK(hrtimer_work, clock_was_set_work);
-
-/*
- * Called from timekeeping and resume code to reprogram the hrtimer
- * interrupt device on all cpus.
- */
-void clock_was_set_delayed(void)
-{
-	schedule_work(&hrtimer_work);
-}
-
 #else
 
 static inline int hrtimer_is_hres_enabled(void) { return 0; }
@@ -877,6 +861,22 @@ void clock_was_set(void)
 	timerfd_clock_was_set();
 }
 
+static void clock_was_set_work(struct work_struct *work)
+{
+	clock_was_set();
+}
+
+static DECLARE_WORK(hrtimer_work, clock_was_set_work);
+
+/*
+ * Called from timekeeping and resume code to reprogram the hrtimer
+ * interrupt device on all cpus and to notify timerfd.
+ */
+void clock_was_set_delayed(void)
+{
+	schedule_work(&hrtimer_work);
+}
+
 /*
  * During resume we might have to reprogram the high resolution timer
  * interrupt on all online CPUs.  However, all other CPUs will be
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -165,3 +165,6 @@ DECLARE_PER_CPU(struct hrtimer_cpu_base,
 
 extern u64 get_next_timer_interrupt(unsigned long basej, u64 basem);
 void timer_clear_idle(void);
+
+void clock_was_set(void);
+void clock_was_set_delayed(void);


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch V2 04/10] hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case
  2021-07-13 13:39 [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul Thomas Gleixner
                   ` (2 preceding siblings ...)
  2021-07-13 13:39 ` [patch V2 03/10] hrtimer: Ensure timerfd notification for HIGHRES=n Thomas Gleixner
@ 2021-07-13 13:39 ` Thomas Gleixner
  2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 05/10] timerfd: Provide timerfd_resume() Thomas Gleixner
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2021-07-13 13:39 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti, Frederic Weisbecker

From: Thomas Gleixner <tglx@linutronix.de>

When CONFIG_HIGH_RES_TIMERS is disabled, but NOHZ is enabled then
clock_was_set() is not doing anything. With HIGHRES=n the kernel relies on
the periodic tick to update the clock offsets, but when NOHZ is enabled and
active then CPUs which are in a deep idle sleep do not have a periodic tick
which means the expiry of timers affected by clock_was_set() can be
arbitrarily delayed up to the point where the CPUs are brought out of idle
again.

Make the clock_was_set() logic unconditionaly available so that idle CPUs
are kicked out of idle to handle the update.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/hrtimer.c |   87 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 59 insertions(+), 28 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -739,23 +739,7 @@ static inline int hrtimer_is_hres_enable
 	return hrtimer_hres_enabled;
 }
 
-/*
- * Retrigger next event is called after clock was set
- *
- * Called with interrupts disabled via on_each_cpu()
- */
-static void retrigger_next_event(void *arg)
-{
-	struct hrtimer_cpu_base *base = this_cpu_ptr(&hrtimer_bases);
-
-	if (!__hrtimer_hres_active(base))
-		return;
-
-	raw_spin_lock(&base->lock);
-	hrtimer_update_base(base);
-	hrtimer_force_reprogram(base, 0);
-	raw_spin_unlock(&base->lock);
-}
+static void retrigger_next_event(void *arg);
 
 /*
  * Switch to high resolution mode
@@ -781,9 +765,50 @@ static void hrtimer_switch_to_hres(void)
 
 static inline int hrtimer_is_hres_enabled(void) { return 0; }
 static inline void hrtimer_switch_to_hres(void) { }
-static inline void retrigger_next_event(void *arg) { }
 
 #endif /* CONFIG_HIGH_RES_TIMERS */
+/*
+ * Retrigger next event is called after clock was set with interrupts
+ * disabled through an SMP function call or directly from low level
+ * resume code.
+ *
+ * This is only invoked when:
+ *	- CONFIG_HIGH_RES_TIMERS is enabled.
+ *	- CONFIG_NOHZ_COMMON is enabled
+ *
+ * For the other cases this function is empty and because the call sites
+ * are optimized out it vanishes as well, i.e. no need for lots of
+ * #ifdeffery.
+ */
+static void retrigger_next_event(void *arg)
+{
+	struct hrtimer_cpu_base *base = this_cpu_ptr(&hrtimer_bases);
+
+	/*
+	 * When high resolution mode or nohz is active, then the offsets of
+	 * CLOCK_REALTIME/TAI/BOOTTIME have to be updated. Otherwise the
+	 * next tick will take care of that.
+	 *
+	 * If high resolution mode is active then the next expiring timer
+	 * must be reevaluated and the clock event device reprogrammed if
+	 * necessary.
+	 *
+	 * In the NOHZ case the update of the offset and the reevaluation
+	 * of the next expiring timer is enough. The return from the SMP
+	 * function call will take care of the reprogramming in case the
+	 * CPU was in a NOHZ idle sleep.
+	 */
+	if (!__hrtimer_hres_active(base) && !tick_nohz_active)
+		return;
+
+	raw_spin_lock(&base->lock);
+	hrtimer_update_base(base);
+	if (__hrtimer_hres_active(base))
+		hrtimer_force_reprogram(base, 0);
+	else
+		hrtimer_update_next_event(base);
+	raw_spin_unlock(&base->lock);
+}
 
 /*
  * When a timer is enqueued and expires earlier than the already enqueued
@@ -842,22 +867,28 @@ static void hrtimer_reprogram(struct hrt
 }
 
 /*
- * Clock realtime was set
- *
- * Change the offset of the realtime clock vs. the monotonic
- * clock.
+ * Clock was set. This might affect CLOCK_REALTIME, CLOCK_TAI and
+ * CLOCK_BOOTTIME (for late sleep time injection).
  *
- * We might have to reprogram the high resolution timer interrupt. On
- * SMP we call the architecture specific code to retrigger _all_ high
- * resolution timer interrupts. On UP we just disable interrupts and
- * call the high resolution interrupt code.
+ * This requires to update the offsets for these clocks
+ * vs. CLOCK_MONOTONIC. When high resolution timers are enabled, then this
+ * also requires to eventually reprogram the per CPU clock event devices
+ * when the change moves an affected timer ahead of the first expiring
+ * timer on that CPU. Obviously remote per CPU clock event devices cannot
+ * be reprogrammed. The other reason why an IPI has to be sent is when the
+ * system is in !HIGH_RES and NOHZ mode. The NOHZ mode updates the offsets
+ * in the tick, which obviously might be stopped, so this has to bring out
+ * the remote CPU which might sleep in idle to get this sorted.
  */
 void clock_was_set(void)
 {
-#ifdef CONFIG_HIGH_RES_TIMERS
+	if (!hrtimer_hres_active() && !tick_nohz_active)
+		goto out_timerfd;
+
 	/* Retrigger the CPU local events everywhere */
 	on_each_cpu(retrigger_next_event, NULL, 1);
-#endif
+
+out_timerfd:
 	timerfd_clock_was_set();
 }
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch V2 05/10] timerfd: Provide timerfd_resume()
  2021-07-13 13:39 [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul Thomas Gleixner
                   ` (3 preceding siblings ...)
  2021-07-13 13:39 ` [patch V2 04/10] hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case Thomas Gleixner
@ 2021-07-13 13:39 ` Thomas Gleixner
  2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 06/10] timekeeping: Distangle resume and clock-was-set events Thomas Gleixner
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2021-07-13 13:39 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti, Frederic Weisbecker

From: Thomas Gleixner <tglx@linutronix.de>

Resuming timekeeping is a clock-was-set event and uses the clock-was-set
notification mechanism. This is in the way of making the clock-was-set
update for hrtimers selective so unnecessary IPIs are avoided when a CPU
base does not have timers queued which are affected by the clock setting.

Provide a seperate timerfd_resume() interface so the resume logic and the
clock-was-set mechanism can be distangled in the core code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 fs/timerfd.c            |   16 ++++++++++++++++
 include/linux/hrtimer.h |    2 ++
 2 files changed, 18 insertions(+)

--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -115,6 +115,22 @@ void timerfd_clock_was_set(void)
 	rcu_read_unlock();
 }
 
+static void timerfd_resume_work(struct work_struct *work)
+{
+	timerfd_clock_was_set();
+}
+
+static DECLARE_WORK(timerfd_work, timerfd_resume_work);
+
+/*
+ * Invoked from timekeeping_resume(). Defer the actual update to work so
+ * timerfd_clock_was_set() runs in task context.
+ */
+void timerfd_resume(void)
+{
+	schedule_work(&timerfd_work);
+}
+
 static void __timerfd_remove_cancel(struct timerfd_ctx *ctx)
 {
 	if (ctx->might_cancel) {
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -349,8 +349,10 @@ hrtimer_expires_remaining_adjusted(const
 
 #ifdef CONFIG_TIMERFD
 extern void timerfd_clock_was_set(void);
+extern void timerfd_resume(void);
 #else
 static inline void timerfd_clock_was_set(void) { }
+static inline void timerfd_resume(void) { }
 #endif
 extern void hrtimers_resume(void);
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch V2 06/10] timekeeping: Distangle resume and clock-was-set events
  2021-07-13 13:39 [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul Thomas Gleixner
                   ` (4 preceding siblings ...)
  2021-07-13 13:39 ` [patch V2 05/10] timerfd: Provide timerfd_resume() Thomas Gleixner
@ 2021-07-13 13:39 ` Thomas Gleixner
  2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 07/10] time/timekeeping: Avoid invoking clock_was_set() twice Thomas Gleixner
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2021-07-13 13:39 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti, Frederic Weisbecker

From: Thomas Gleixner <tglx@linutronix.de>

Resuming timekeeping is a clock-was-set event and uses the clock-was-set
notification mechanism. This is in the way of making the clock-was-set
update for hrtimers selective so unnecessary IPIs are avoided when a CPU
base does not have timers queued which are affected by the clock setting.

Distangle it by invoking hrtimer_resume() on each unfreezing CPU and invoke
the new timerfd_resume() function from timekeeping_resume() which is the
only place where this is needed.

Rename hrtimer_resume() to hrtimer_resume_local() to reflect the change.

With this the clock_was_set*() functions are not longer required to IPI all
CPUs unconditionally and can get some smarts to avoid them.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hrtimer.h     |    1 -
 kernel/time/hrtimer.c       |   15 ++++++---------
 kernel/time/tick-common.c   |    7 +++++++
 kernel/time/tick-internal.h |    2 ++
 kernel/time/timekeeping.c   |    4 +++-
 5 files changed, 18 insertions(+), 11 deletions(-)

--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -354,7 +354,6 @@ extern void timerfd_resume(void);
 static inline void timerfd_clock_was_set(void) { }
 static inline void timerfd_resume(void) { }
 #endif
-extern void hrtimers_resume(void);
 
 DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
 
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -900,8 +900,8 @@ static void clock_was_set_work(struct wo
 static DECLARE_WORK(hrtimer_work, clock_was_set_work);
 
 /*
- * Called from timekeeping and resume code to reprogram the hrtimer
- * interrupt device on all cpus and to notify timerfd.
+ * Called from timekeeping code to reprogram the hrtimer interrupt device
+ * on all cpus and to notify timerfd.
  */
 void clock_was_set_delayed(void)
 {
@@ -909,18 +909,15 @@ void clock_was_set_delayed(void)
 }
 
 /*
- * During resume we might have to reprogram the high resolution timer
- * interrupt on all online CPUs.  However, all other CPUs will be
- * stopped with IRQs interrupts disabled so the clock_was_set() call
- * must be deferred.
+ * Called during resume either directly from via timekeeping_resume()
+ * or in the case of s2idle from tick_unfreeze() to ensure that the
+ * hrtimers are up to date.
  */
-void hrtimers_resume(void)
+void hrtimers_resume_local(void)
 {
 	lockdep_assert_irqs_disabled();
 	/* Retrigger on the local CPU */
 	retrigger_next_event(NULL);
-	/* And schedule a retrigger for all others */
-	clock_was_set_delayed();
 }
 
 /*
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -470,6 +470,13 @@ void tick_resume_local(void)
 		else
 			tick_resume_oneshot();
 	}
+
+	/*
+	 * Ensure that hrtimers are up to date and the clockevents device
+	 * is reprogrammed correctly when high resolution timers are
+	 * enabled.
+	 */
+	hrtimers_resume_local();
 }
 
 /**
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -168,3 +168,5 @@ void timer_clear_idle(void);
 
 void clock_was_set(void);
 void clock_was_set_delayed(void);
+
+void hrtimers_resume_local(void);
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1810,8 +1810,10 @@ void timekeeping_resume(void)
 
 	touch_softlockup_watchdog();
 
+	/* Resume the clockevent device(s) and hrtimers */
 	tick_resume();
-	hrtimers_resume();
+	/* Notify timerfd as resume is equivalent to clock_was_set() */
+	timerfd_resume();
 }
 
 int timekeeping_suspend(void)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch V2 07/10] time/timekeeping: Avoid invoking clock_was_set() twice
  2021-07-13 13:39 [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul Thomas Gleixner
                   ` (5 preceding siblings ...)
  2021-07-13 13:39 ` [patch V2 06/10] timekeeping: Distangle resume and clock-was-set events Thomas Gleixner
@ 2021-07-13 13:39 ` Thomas Gleixner
  2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 08/10] hrtimer: Add bases argument to clock_was_set() Thomas Gleixner
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2021-07-13 13:39 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti, Frederic Weisbecker

From: Thomas Gleixner <tglx@linutronix.de>

do_adjtimex() might end up scheduling a delayed clock_was_set() via
timekeeping_advance() and then invoke clock_was_set() directly which is
pointless.

Make timekeeping_advance() return whether an invocation of clock_was_set()
is required and handle it at the call sites which allows do_adjtimex() to
issue a single direct call if required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/timekeeping.c |   18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2127,7 +2127,7 @@ static u64 logarithmic_accumulation(stru
  * timekeeping_advance - Updates the timekeeper to the current time and
  * current NTP tick length
  */
-static void timekeeping_advance(enum timekeeping_adv_mode mode)
+static bool timekeeping_advance(enum timekeeping_adv_mode mode)
 {
 	struct timekeeper *real_tk = &tk_core.timekeeper;
 	struct timekeeper *tk = &shadow_timekeeper;
@@ -2198,9 +2198,8 @@ static void timekeeping_advance(enum tim
 	write_seqcount_end(&tk_core.seq);
 out:
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
-	if (clock_set)
-		/* Have to call _delayed version, since in irq context*/
-		clock_was_set_delayed();
+
+	return !!clock_set;
 }
 
 /**
@@ -2209,7 +2208,8 @@ static void timekeeping_advance(enum tim
  */
 void update_wall_time(void)
 {
-	timekeeping_advance(TK_ADV_TICK);
+	if (timekeeping_advance(TK_ADV_TICK))
+		clock_was_set_delayed();
 }
 
 /**
@@ -2389,8 +2389,9 @@ int do_adjtimex(struct __kernel_timex *t
 {
 	struct timekeeper *tk = &tk_core.timekeeper;
 	struct audit_ntp_data ad;
-	unsigned long flags;
+	bool clock_set = false;
 	struct timespec64 ts;
+	unsigned long flags;
 	s32 orig_tai, tai;
 	int ret;
 
@@ -2425,6 +2426,7 @@ int do_adjtimex(struct __kernel_timex *t
 	if (tai != orig_tai) {
 		__timekeeping_set_tai_offset(tk, tai);
 		timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
+		clock_set = true;
 	}
 	tk_update_leap_state(tk);
 
@@ -2435,9 +2437,9 @@ int do_adjtimex(struct __kernel_timex *t
 
 	/* Update the multiplier immediately if frequency was set directly */
 	if (txc->modes & (ADJ_FREQUENCY | ADJ_TICK))
-		timekeeping_advance(TK_ADV_FREQ);
+		clock_set |= timekeeping_advance(TK_ADV_FREQ);
 
-	if (tai != orig_tai)
+	if (clock_set)
 		clock_was_set();
 
 	ntp_notify_cmos_timer();


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch V2 08/10] hrtimer: Add bases argument to clock_was_set()
  2021-07-13 13:39 [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul Thomas Gleixner
                   ` (6 preceding siblings ...)
  2021-07-13 13:39 ` [patch V2 07/10] time/timekeeping: Avoid invoking clock_was_set() twice Thomas Gleixner
@ 2021-07-13 13:39 ` Thomas Gleixner
  2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 09/10] hrtimer: Avoid unnecessary SMP function calls in clock_was_set() Thomas Gleixner
  2021-07-13 13:39 ` [patch V2 10/10] hrtimer: Avoid more " Thomas Gleixner
  9 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2021-07-13 13:39 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti, Frederic Weisbecker

From: Thomas Gleixner <tglx@linutronix.de>

clock_was_set() unconditionaly invokes retrigger_next_event() on all online
CPUs. This was necessary because that mechanism was also used for resume
from suspend to idle which is not longer the case.

The bases arguments allows the callers of clock_was_set() to hand in a mask
which tells clock_was_set() which of the hrtimer clock bases are affected
by the clock setting. This mask will be used in the next step to check
whether a CPU base has timers queued on a clock base affected by the event
and avoid the SMP function call if there are none.

Add a @bases argument, provide defines for the active bases masking and
fixup all callsites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/hrtimer.c       |    4 ++--
 kernel/time/tick-internal.h |    9 ++++++++-
 kernel/time/timekeeping.c   |   14 +++++++-------
 3 files changed, 17 insertions(+), 10 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -880,7 +880,7 @@ static void hrtimer_reprogram(struct hrt
  * in the tick, which obviously might be stopped, so this has to bring out
  * the remote CPU which might sleep in idle to get this sorted.
  */
-void clock_was_set(void)
+void clock_was_set(unsigned int bases)
 {
 	if (!hrtimer_hres_active() && !tick_nohz_active)
 		goto out_timerfd;
@@ -894,7 +894,7 @@ void clock_was_set(void)
 
 static void clock_was_set_work(struct work_struct *work)
 {
-	clock_was_set();
+	clock_was_set(CLOCK_SET_WALL);
 }
 
 static DECLARE_WORK(hrtimer_work, clock_was_set_work);
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -166,7 +166,14 @@ DECLARE_PER_CPU(struct hrtimer_cpu_base,
 extern u64 get_next_timer_interrupt(unsigned long basej, u64 basem);
 void timer_clear_idle(void);
 
-void clock_was_set(void);
+#define CLOCK_SET_WALL							\
+	(BIT(HRTIMER_BASE_REALTIME) | BIT(HRTIMER_BASE_REALTIME_SOFT) |	\
+	 BIT(HRTIMER_BASE_TAI) | BIT(HRTIMER_BASE_TAI_SOFT))
+
+#define CLOCK_SET_BOOT							\
+	(BIT(HRTIMER_BASE_BOOTTIME) | BIT(HRTIMER_BASE_BOOTTIME_SOFT))
+
+void clock_was_set(unsigned int bases);
 void clock_was_set_delayed(void);
 
 void hrtimers_resume_local(void);
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1323,8 +1323,8 @@ int do_settimeofday64(const struct times
 	write_seqcount_end(&tk_core.seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
-	/* signal hrtimers about time change */
-	clock_was_set();
+	/* Signal hrtimers about time change */
+	clock_was_set(CLOCK_SET_WALL);
 
 	if (!ret)
 		audit_tk_injoffset(ts_delta);
@@ -1371,8 +1371,8 @@ error: /* even if we error out, we forwa
 	write_seqcount_end(&tk_core.seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
-	/* signal hrtimers about time change */
-	clock_was_set();
+	/* Signal hrtimers about time change */
+	clock_was_set(CLOCK_SET_WALL);
 
 	return ret;
 }
@@ -1746,8 +1746,8 @@ void timekeeping_inject_sleeptime64(cons
 	write_seqcount_end(&tk_core.seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
-	/* signal hrtimers about time change */
-	clock_was_set();
+	/* Signal hrtimers about time change */
+	clock_was_set(CLOCK_SET_WALL | CLOCK_SET_BOOT);
 }
 #endif
 
@@ -2440,7 +2440,7 @@ int do_adjtimex(struct __kernel_timex *t
 		clock_set |= timekeeping_advance(TK_ADV_FREQ);
 
 	if (clock_set)
-		clock_was_set();
+		clock_was_set(CLOCK_REALTIME);
 
 	ntp_notify_cmos_timer();
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch V2 09/10] hrtimer: Avoid unnecessary SMP function calls in clock_was_set()
  2021-07-13 13:39 [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul Thomas Gleixner
                   ` (7 preceding siblings ...)
  2021-07-13 13:39 ` [patch V2 08/10] hrtimer: Add bases argument to clock_was_set() Thomas Gleixner
@ 2021-07-13 13:39 ` Thomas Gleixner
  2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Marcelo Tosatti
  2021-07-13 13:39 ` [patch V2 10/10] hrtimer: Avoid more " Thomas Gleixner
  9 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2021-07-13 13:39 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti, Frederic Weisbecker

From: Marcelo Tosatti <mtosatti@redhat.com>

Setting of clocks triggers an unconditional SMP function call on all online
CPUs to reprogram the clock event device.

However, only some clocks have their offsets updated and therefore
potentially require a reprogram. That's CLOCK_REALTIME and CLOCK_TAI and in
the case of resume (delayed sleep time injection) also CLOCK_BOOTTIME.

Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the affected clock bases which are
indicated by the caller in the @bases argument of clock_was_set().

If that's not the case, skip the IPI and update the offsets remotely which
ensures that any subsequently armed timers on the affected clocks are
evaluated with the correct offsets.

[ tglx: Adopted to the new bases argument, removed the softirq_active
  	check, added comment, fixed up stale comment ]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
v6:
  - Remove the softirq_active check (Peter Xu)
  - Adopt to the new base argument
  - Add comment and fixup the stale one
  - Bring back CLOCK_BOOTTIME conditionally for late sleeptime
    injection on resume (missed that in the review of V1)

v5:
  - Add missing hrtimer_update_base (Peter Xu).

v4:
   - Drop unused code (Thomas).

v3:
   - Nicer changelog  (Thomas).
   - Code style fixes (Thomas).
   - Compilation warning with CONFIG_HIGH_RES_TIMERS=n (Thomas).
   - Shrink preemption disabled section (Thomas).

v2:
   - Only REALTIME and TAI bases are affected by offset-to-monotonic changes (Thomas).
   - Don't special case nohz_full CPUs (Thomas).


---
 kernel/time/hrtimer.c |   35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -896,11 +896,42 @@ static void hrtimer_reprogram(struct hrt
  */
 void clock_was_set(unsigned int bases)
 {
+	cpumask_var_t mask;
+	int cpu;
+
 	if (!hrtimer_hres_active() && !tick_nohz_active)
 		goto out_timerfd;
 
-	/* Retrigger the CPU local events everywhere */
-	on_each_cpu(retrigger_next_event, NULL, 1);
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+		on_each_cpu(retrigger_next_event, NULL, 1);
+		goto out_timerfd;
+	}
+
+	/* Avoid interrupting CPUs if possible */
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
+		unsigned long flags;
+
+		raw_spin_lock_irqsave(&cpu_base->lock, flags);
+		/*
+		 * Only send the IPI when there are timers queued in one of
+		 * the affected clock bases. Otherwise update the base
+		 * remote to ensure that the next enqueue of a timer on
+		 * such a clock base will see the correct offsets.
+		 */
+		if (cpu_base->active_bases & bases)
+			cpumask_set_cpu(cpu, mask);
+		else
+			hrtimer_update_base(cpu_base);
+		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+	}
+
+	preempt_disable();
+	smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+	preempt_enable();
+	cpus_read_unlock();
+	free_cpumask_var(mask);
 
 out_timerfd:
 	timerfd_clock_was_set();



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [patch V2 10/10] hrtimer: Avoid more SMP function calls in clock_was_set()
  2021-07-13 13:39 [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul Thomas Gleixner
                   ` (8 preceding siblings ...)
  2021-07-13 13:39 ` [patch V2 09/10] hrtimer: Avoid unnecessary SMP function calls in clock_was_set() Thomas Gleixner
@ 2021-07-13 13:39 ` Thomas Gleixner
  2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  9 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2021-07-13 13:39 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti, Frederic Weisbecker

From: Thomas Gleixner <tglx@linutronix.de>

By unconditionally updating the offsets there are more indicators
whether the SMP function calls on clock_was_set() can be avoided:

  - When the offset update already happened on the remote CPU then the
    remote update attempt will yield the same seqeuence number and no
    IPI is required.

  - When the remote CPU is currently handling hrtimer_interrupt(). In
    that case the remote CPU will reevaluate the timer bases before
    reprogramming anyway, so nothing to do.

  - After updating it can be checked whether the first expiring timer in
    the affected clock bases moves before the first expiring (softirq)
    timer of the CPU. If that's not the case then sending the IPI is not
    required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Fix the in_hrtirq thinko (Marcelo)
    Add the missing masking (reported by 0day)

P.S.: The git branch is updated as well
---
 kernel/time/hrtimer.c |   74 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 65 insertions(+), 9 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -866,6 +866,68 @@ static void hrtimer_reprogram(struct hrt
 	__hrtimer_reprogram(cpu_base, true, timer, expires);
 }
 
+static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
+			     unsigned int active)
+{
+	struct hrtimer_clock_base *base;
+	unsigned int seq;
+	ktime_t expires;
+
+	/*
+	 * Update the base offsets unconditionally so the following
+	 * checks whether the SMP function call is required works.
+	 *
+	 * The update is safe even when the remote CPU is in the hrtimer
+	 * interrupt or the hrtimer soft interrupt and expiring affected
+	 * bases. Either it will see the update before handling a base or
+	 * it will see it when it finishes the processing and reevaluates
+	 * the next expiring timer.
+	 */
+	seq = cpu_base->clock_was_set_seq;
+	hrtimer_update_base(cpu_base);
+
+	/*
+	 * If the sequence did not change over the update then the
+	 * remote CPU already handled it.
+	 */
+	if (seq == cpu_base->clock_was_set_seq)
+		return false;
+
+	/*
+	 * If the remote CPU is currently handling an hrtimer interrupt, it
+	 * will reevaluate the first expiring timer of all clock bases
+	 * before reprogramming. Nothing to do here.
+	 */
+	if (cpu_base->in_hrtirq)
+		return false;
+
+	/*
+	 * Walk the affected clock bases and check whether the first expiring
+	 * timer in a clock base is moving ahead of the first expiring timer of
+	 * @cpu_base. If so, the IPI must be invoked because per CPU clock
+	 * event devices cannot be remotely reprogrammed.
+	 */
+	active &= cpu_base->active_bases;
+
+	for_each_active_base(base, cpu_base, active) {
+		struct timerqueue_node *next;
+
+		next = timerqueue_getnext(&base->active);
+		expires = ktime_sub(next->expires, base->offset);
+		if (expires < cpu_base->expires_next)
+			return true;
+
+		/* Extra check for softirq clock bases */
+		if (base->clockid < HRTIMER_BASE_MONOTONIC_SOFT)
+			continue;
+		if (cpu_base->softirq_activated)
+			continue;
+		if (expires < cpu_base->softirq_expires_next)
+			return true;
+	}
+	return false;
+}
+
 /*
  * Clock was set. This might affect CLOCK_REALTIME, CLOCK_TAI and
  * CLOCK_BOOTTIME (for late sleep time injection).
@@ -900,16 +962,10 @@ void clock_was_set(unsigned int bases)
 		unsigned long flags;
 
 		raw_spin_lock_irqsave(&cpu_base->lock, flags);
-		/*
-		 * Only send the IPI when there are timers queued in one of
-		 * the affected clock bases. Otherwise update the base
-		 * remote to ensure that the next enqueue of a timer on
-		 * such a clock base will see the correct offsets.
-		 */
-		if (cpu_base->active_bases & bases)
+
+		if (update_needs_ipi(cpu_base, bases))
 			cpumask_set_cpu(cpu, mask);
-		else
-			hrtimer_update_base(cpu_base);
+
 		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 	}
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [tip: timers/core] hrtimer: Avoid more SMP function calls in clock_was_set()
  2021-07-13 13:39 ` [patch V2 10/10] hrtimer: Avoid more " Thomas Gleixner
@ 2021-08-10 16:02   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-10 16:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     1e7f7fbcd40c69d23e3fe641ead9f3dc128fa8aa
Gitweb:        https://git.kernel.org/tip/1e7f7fbcd40c69d23e3fe641ead9f3dc128fa8aa
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 13 Jul 2021 15:39:55 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 10 Aug 2021 17:57:23 +02:00

hrtimer: Avoid more SMP function calls in clock_was_set()

By unconditionally updating the offsets there are more indicators
whether the SMP function calls on clock_was_set() can be avoided:

  - When the offset update already happened on the remote CPU then the
    remote update attempt will yield the same seqeuence number and no
    IPI is required.

  - When the remote CPU is currently handling hrtimer_interrupt(). In
    that case the remote CPU will reevaluate the timer bases before
    reprogramming anyway, so nothing to do.

  - After updating it can be checked whether the first expiring timer in
    the affected clock bases moves before the first expiring (softirq)
    timer of the CPU. If that's not the case then sending the IPI is not
    required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.887322464@linutronix.de

---
 kernel/time/hrtimer.c | 74 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 65 insertions(+), 9 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5d44c90..88aefc3 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -866,6 +866,68 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 	__hrtimer_reprogram(cpu_base, true, timer, expires);
 }
 
+static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
+			     unsigned int active)
+{
+	struct hrtimer_clock_base *base;
+	unsigned int seq;
+	ktime_t expires;
+
+	/*
+	 * Update the base offsets unconditionally so the following
+	 * checks whether the SMP function call is required works.
+	 *
+	 * The update is safe even when the remote CPU is in the hrtimer
+	 * interrupt or the hrtimer soft interrupt and expiring affected
+	 * bases. Either it will see the update before handling a base or
+	 * it will see it when it finishes the processing and reevaluates
+	 * the next expiring timer.
+	 */
+	seq = cpu_base->clock_was_set_seq;
+	hrtimer_update_base(cpu_base);
+
+	/*
+	 * If the sequence did not change over the update then the
+	 * remote CPU already handled it.
+	 */
+	if (seq == cpu_base->clock_was_set_seq)
+		return false;
+
+	/*
+	 * If the remote CPU is currently handling an hrtimer interrupt, it
+	 * will reevaluate the first expiring timer of all clock bases
+	 * before reprogramming. Nothing to do here.
+	 */
+	if (cpu_base->in_hrtirq)
+		return false;
+
+	/*
+	 * Walk the affected clock bases and check whether the first expiring
+	 * timer in a clock base is moving ahead of the first expiring timer of
+	 * @cpu_base. If so, the IPI must be invoked because per CPU clock
+	 * event devices cannot be remotely reprogrammed.
+	 */
+	active &= cpu_base->active_bases;
+
+	for_each_active_base(base, cpu_base, active) {
+		struct timerqueue_node *next;
+
+		next = timerqueue_getnext(&base->active);
+		expires = ktime_sub(next->expires, base->offset);
+		if (expires < cpu_base->expires_next)
+			return true;
+
+		/* Extra check for softirq clock bases */
+		if (base->clockid < HRTIMER_BASE_MONOTONIC_SOFT)
+			continue;
+		if (cpu_base->softirq_activated)
+			continue;
+		if (expires < cpu_base->softirq_expires_next)
+			return true;
+	}
+	return false;
+}
+
 /*
  * Clock was set. This might affect CLOCK_REALTIME, CLOCK_TAI and
  * CLOCK_BOOTTIME (for late sleep time injection).
@@ -900,16 +962,10 @@ void clock_was_set(unsigned int bases)
 		unsigned long flags;
 
 		raw_spin_lock_irqsave(&cpu_base->lock, flags);
-		/*
-		 * Only send the IPI when there are timers queued in one of
-		 * the affected clock bases. Otherwise update the base
-		 * remote to ensure that the next enqueue of a timer on
-		 * such a clock base will see the correct offsets.
-		 */
-		if (cpu_base->active_bases & bases)
+
+		if (update_needs_ipi(cpu_base, bases))
 			cpumask_set_cpu(cpu, mask);
-		else
-			hrtimer_update_base(cpu_base);
+
 		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 	}
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: timers/core] hrtimer: Avoid unnecessary SMP function calls in clock_was_set()
  2021-07-13 13:39 ` [patch V2 09/10] hrtimer: Avoid unnecessary SMP function calls in clock_was_set() Thomas Gleixner
@ 2021-08-10 16:02   ` tip-bot2 for Marcelo Tosatti
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Marcelo Tosatti @ 2021-08-10 16:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Marcelo Tosatti, Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     81d741d3460ca422843ce0ec8351083f259c6166
Gitweb:        https://git.kernel.org/tip/81d741d3460ca422843ce0ec8351083f259c6166
Author:        Marcelo Tosatti <mtosatti@redhat.com>
AuthorDate:    Tue, 13 Jul 2021 15:39:54 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 10 Aug 2021 17:57:23 +02:00

hrtimer: Avoid unnecessary SMP function calls in clock_was_set()

Setting of clocks triggers an unconditional SMP function call on all online
CPUs to reprogram the clock event device.

However, only some clocks have their offsets updated and therefore
potentially require a reprogram. That's CLOCK_REALTIME and CLOCK_TAI and in
the case of resume (delayed sleep time injection) also CLOCK_BOOTTIME.

Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the affected clock bases which are
indicated by the caller in the @bases argument of clock_was_set().

If that's not the case, skip the IPI and update the offsets remotely which
ensures that any subsequently armed timers on the affected clocks are
evaluated with the correct offsets.

[ tglx: Adopted to the new bases argument, removed the softirq_active
  	check, added comment, fixed up stale comment ]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.787536542@linutronix.de

---
 kernel/time/hrtimer.c | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index c8af165..5d44c90 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -882,11 +882,42 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
  */
 void clock_was_set(unsigned int bases)
 {
+	cpumask_var_t mask;
+	int cpu;
+
 	if (!hrtimer_hres_active() && !tick_nohz_active)
 		goto out_timerfd;
 
-	/* Retrigger the CPU local events everywhere */
-	on_each_cpu(retrigger_next_event, NULL, 1);
+	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
+		on_each_cpu(retrigger_next_event, NULL, 1);
+		goto out_timerfd;
+	}
+
+	/* Avoid interrupting CPUs if possible */
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
+		unsigned long flags;
+
+		raw_spin_lock_irqsave(&cpu_base->lock, flags);
+		/*
+		 * Only send the IPI when there are timers queued in one of
+		 * the affected clock bases. Otherwise update the base
+		 * remote to ensure that the next enqueue of a timer on
+		 * such a clock base will see the correct offsets.
+		 */
+		if (cpu_base->active_bases & bases)
+			cpumask_set_cpu(cpu, mask);
+		else
+			hrtimer_update_base(cpu_base);
+		raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
+	}
+
+	preempt_disable();
+	smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+	preempt_enable();
+	cpus_read_unlock();
+	free_cpumask_var(mask);
 
 out_timerfd:
 	timerfd_clock_was_set();

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: timers/core] hrtimer: Add bases argument to clock_was_set()
  2021-07-13 13:39 ` [patch V2 08/10] hrtimer: Add bases argument to clock_was_set() Thomas Gleixner
@ 2021-08-10 16:02   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-10 16:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     17a1b8826b451c80e7999a7c68e06b70579b2b8f
Gitweb:        https://git.kernel.org/tip/17a1b8826b451c80e7999a7c68e06b70579b2b8f
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 13 Jul 2021 15:39:53 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 10 Aug 2021 17:57:23 +02:00

hrtimer: Add bases argument to clock_was_set()

clock_was_set() unconditionaly invokes retrigger_next_event() on all online
CPUs. This was necessary because that mechanism was also used for resume
from suspend to idle which is not longer the case.

The bases arguments allows the callers of clock_was_set() to hand in a mask
which tells clock_was_set() which of the hrtimer clock bases are affected
by the clock setting. This mask will be used in the next step to check
whether a CPU base has timers queued on a clock base affected by the event
and avoid the SMP function call if there are none.

Add a @bases argument, provide defines for the active bases masking and
fixup all callsites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.691083465@linutronix.de

---
 kernel/time/hrtimer.c       |  4 ++--
 kernel/time/tick-internal.h |  9 ++++++++-
 kernel/time/timekeeping.c   | 14 +++++++-------
 3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 68e56f0..c8af165 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -880,7 +880,7 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
  * in the tick, which obviously might be stopped, so this has to bring out
  * the remote CPU which might sleep in idle to get this sorted.
  */
-void clock_was_set(void)
+void clock_was_set(unsigned int bases)
 {
 	if (!hrtimer_hres_active() && !tick_nohz_active)
 		goto out_timerfd;
@@ -894,7 +894,7 @@ out_timerfd:
 
 static void clock_was_set_work(struct work_struct *work)
 {
-	clock_was_set();
+	clock_was_set(CLOCK_SET_WALL);
 }
 
 static DECLARE_WORK(hrtimer_work, clock_was_set_work);
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 22de98c..3548f08 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -166,7 +166,14 @@ DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
 extern u64 get_next_timer_interrupt(unsigned long basej, u64 basem);
 void timer_clear_idle(void);
 
-void clock_was_set(void);
+#define CLOCK_SET_WALL							\
+	(BIT(HRTIMER_BASE_REALTIME) | BIT(HRTIMER_BASE_REALTIME_SOFT) |	\
+	 BIT(HRTIMER_BASE_TAI) | BIT(HRTIMER_BASE_TAI_SOFT))
+
+#define CLOCK_SET_BOOT							\
+	(BIT(HRTIMER_BASE_BOOTTIME) | BIT(HRTIMER_BASE_BOOTTIME_SOFT))
+
+void clock_was_set(unsigned int bases);
 void clock_was_set_delayed(void);
 
 void hrtimers_resume_local(void);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 19ed58e..b348749 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1323,8 +1323,8 @@ out:
 	write_seqcount_end(&tk_core.seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
-	/* signal hrtimers about time change */
-	clock_was_set();
+	/* Signal hrtimers about time change */
+	clock_was_set(CLOCK_SET_WALL);
 
 	if (!ret)
 		audit_tk_injoffset(ts_delta);
@@ -1371,8 +1371,8 @@ error: /* even if we error out, we forwarded the time, so call update */
 	write_seqcount_end(&tk_core.seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
-	/* signal hrtimers about time change */
-	clock_was_set();
+	/* Signal hrtimers about time change */
+	clock_was_set(CLOCK_SET_WALL);
 
 	return ret;
 }
@@ -1746,8 +1746,8 @@ void timekeeping_inject_sleeptime64(const struct timespec64 *delta)
 	write_seqcount_end(&tk_core.seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
-	/* signal hrtimers about time change */
-	clock_was_set();
+	/* Signal hrtimers about time change */
+	clock_was_set(CLOCK_SET_WALL | CLOCK_SET_BOOT);
 }
 #endif
 
@@ -2440,7 +2440,7 @@ int do_adjtimex(struct __kernel_timex *txc)
 		clock_set |= timekeeping_advance(TK_ADV_FREQ);
 
 	if (clock_set)
-		clock_was_set();
+		clock_was_set(CLOCK_REALTIME);
 
 	ntp_notify_cmos_timer();
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: timers/core] time/timekeeping: Avoid invoking clock_was_set() twice
  2021-07-13 13:39 ` [patch V2 07/10] time/timekeeping: Avoid invoking clock_was_set() twice Thomas Gleixner
@ 2021-08-10 16:02   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-10 16:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     1b267793f4fd9a089ea8558f3b6698186b9a3214
Gitweb:        https://git.kernel.org/tip/1b267793f4fd9a089ea8558f3b6698186b9a3214
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 13 Jul 2021 15:39:52 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 10 Aug 2021 17:57:23 +02:00

time/timekeeping: Avoid invoking clock_was_set() twice

do_adjtimex() might end up scheduling a delayed clock_was_set() via
timekeeping_advance() and then invoke clock_was_set() directly which is
pointless.

Make timekeeping_advance() return whether an invocation of clock_was_set()
is required and handle it at the call sites which allows do_adjtimex() to
issue a single direct call if required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.580966888@linutronix.de

---
 kernel/time/timekeeping.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index c8a9b9e..19ed58e 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2127,7 +2127,7 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
  * timekeeping_advance - Updates the timekeeper to the current time and
  * current NTP tick length
  */
-static void timekeeping_advance(enum timekeeping_adv_mode mode)
+static bool timekeeping_advance(enum timekeeping_adv_mode mode)
 {
 	struct timekeeper *real_tk = &tk_core.timekeeper;
 	struct timekeeper *tk = &shadow_timekeeper;
@@ -2198,9 +2198,8 @@ static void timekeeping_advance(enum timekeeping_adv_mode mode)
 	write_seqcount_end(&tk_core.seq);
 out:
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
-	if (clock_set)
-		/* Have to call _delayed version, since in irq context*/
-		clock_was_set_delayed();
+
+	return !!clock_set;
 }
 
 /**
@@ -2209,7 +2208,8 @@ out:
  */
 void update_wall_time(void)
 {
-	timekeeping_advance(TK_ADV_TICK);
+	if (timekeeping_advance(TK_ADV_TICK))
+		clock_was_set_delayed();
 }
 
 /**
@@ -2389,8 +2389,9 @@ int do_adjtimex(struct __kernel_timex *txc)
 {
 	struct timekeeper *tk = &tk_core.timekeeper;
 	struct audit_ntp_data ad;
-	unsigned long flags;
+	bool clock_set = false;
 	struct timespec64 ts;
+	unsigned long flags;
 	s32 orig_tai, tai;
 	int ret;
 
@@ -2425,6 +2426,7 @@ int do_adjtimex(struct __kernel_timex *txc)
 	if (tai != orig_tai) {
 		__timekeeping_set_tai_offset(tk, tai);
 		timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
+		clock_set = true;
 	}
 	tk_update_leap_state(tk);
 
@@ -2435,9 +2437,9 @@ int do_adjtimex(struct __kernel_timex *txc)
 
 	/* Update the multiplier immediately if frequency was set directly */
 	if (txc->modes & (ADJ_FREQUENCY | ADJ_TICK))
-		timekeeping_advance(TK_ADV_FREQ);
+		clock_set |= timekeeping_advance(TK_ADV_FREQ);
 
-	if (tai != orig_tai)
+	if (clock_set)
 		clock_was_set();
 
 	ntp_notify_cmos_timer();

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: timers/core] timekeeping: Distangle resume and clock-was-set events
  2021-07-13 13:39 ` [patch V2 06/10] timekeeping: Distangle resume and clock-was-set events Thomas Gleixner
@ 2021-08-10 16:02   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-10 16:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     a761a67f591a8c7476c30bb20ed0f09fdfb1a704
Gitweb:        https://git.kernel.org/tip/a761a67f591a8c7476c30bb20ed0f09fdfb1a704
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 13 Jul 2021 15:39:51 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 10 Aug 2021 17:57:23 +02:00

timekeeping: Distangle resume and clock-was-set events

Resuming timekeeping is a clock-was-set event and uses the clock-was-set
notification mechanism. This is in the way of making the clock-was-set
update for hrtimers selective so unnecessary IPIs are avoided when a CPU
base does not have timers queued which are affected by the clock setting.

Distangle it by invoking hrtimer_resume() on each unfreezing CPU and invoke
the new timerfd_resume() function from timekeeping_resume() which is the
only place where this is needed.

Rename hrtimer_resume() to hrtimer_resume_local() to reflect the change.

With this the clock_was_set*() functions are not longer required to IPI all
CPUs unconditionally and can get some smarts to avoid them.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.488853478@linutronix.de

---
 include/linux/hrtimer.h     |  1 -
 kernel/time/hrtimer.c       | 15 ++++++---------
 kernel/time/tick-common.c   |  7 +++++++
 kernel/time/tick-internal.h |  2 ++
 kernel/time/timekeeping.c   |  4 +++-
 5 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 253c6e2..0ee1401 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -354,7 +354,6 @@ extern void timerfd_resume(void);
 static inline void timerfd_clock_was_set(void) { }
 static inline void timerfd_resume(void) { }
 #endif
-extern void hrtimers_resume(void);
 
 DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
 
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 214fd65..68e56f0 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -900,8 +900,8 @@ static void clock_was_set_work(struct work_struct *work)
 static DECLARE_WORK(hrtimer_work, clock_was_set_work);
 
 /*
- * Called from timekeeping and resume code to reprogram the hrtimer
- * interrupt device on all cpus and to notify timerfd.
+ * Called from timekeeping code to reprogram the hrtimer interrupt device
+ * on all cpus and to notify timerfd.
  */
 void clock_was_set_delayed(void)
 {
@@ -909,18 +909,15 @@ void clock_was_set_delayed(void)
 }
 
 /*
- * During resume we might have to reprogram the high resolution timer
- * interrupt on all online CPUs.  However, all other CPUs will be
- * stopped with IRQs interrupts disabled so the clock_was_set() call
- * must be deferred.
+ * Called during resume either directly from via timekeeping_resume()
+ * or in the case of s2idle from tick_unfreeze() to ensure that the
+ * hrtimers are up to date.
  */
-void hrtimers_resume(void)
+void hrtimers_resume_local(void)
 {
 	lockdep_assert_irqs_disabled();
 	/* Retrigger on the local CPU */
 	retrigger_next_event(NULL);
-	/* And schedule a retrigger for all others */
-	clock_was_set_delayed();
 }
 
 /*
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index d663249..4678935 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -470,6 +470,13 @@ void tick_resume_local(void)
 		else
 			tick_resume_oneshot();
 	}
+
+	/*
+	 * Ensure that hrtimers are up to date and the clockevents device
+	 * is reprogrammed correctly when high resolution timers are
+	 * enabled.
+	 */
+	hrtimers_resume_local();
 }
 
 /**
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index cd610fa..22de98c 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -168,3 +168,5 @@ void timer_clear_idle(void);
 
 void clock_was_set(void);
 void clock_was_set_delayed(void);
+
+void hrtimers_resume_local(void);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 8a364aa..c8a9b9e 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1810,8 +1810,10 @@ void timekeeping_resume(void)
 
 	touch_softlockup_watchdog();
 
+	/* Resume the clockevent device(s) and hrtimers */
 	tick_resume();
-	hrtimers_resume();
+	/* Notify timerfd as resume is equivalent to clock_was_set() */
+	timerfd_resume();
 }
 
 int timekeeping_suspend(void)

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: timers/core] timerfd: Provide timerfd_resume()
  2021-07-13 13:39 ` [patch V2 05/10] timerfd: Provide timerfd_resume() Thomas Gleixner
@ 2021-08-10 16:02   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-10 16:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     66f7b0c8aadd2785fc29f2c71477ebc16f4e38cc
Gitweb:        https://git.kernel.org/tip/66f7b0c8aadd2785fc29f2c71477ebc16f4e38cc
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 13 Jul 2021 15:39:50 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 10 Aug 2021 17:57:22 +02:00

timerfd: Provide timerfd_resume()

Resuming timekeeping is a clock-was-set event and uses the clock-was-set
notification mechanism. This is in the way of making the clock-was-set
update for hrtimers selective so unnecessary IPIs are avoided when a CPU
base does not have timers queued which are affected by the clock setting.

Provide a seperate timerfd_resume() interface so the resume logic and the
clock-was-set mechanism can be distangled in the core code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.395287410@linutronix.de

---
 fs/timerfd.c            | 16 ++++++++++++++++
 include/linux/hrtimer.h |  2 ++
 2 files changed, 18 insertions(+)

diff --git a/fs/timerfd.c b/fs/timerfd.c
index c5509d2..e9c96a0 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -115,6 +115,22 @@ void timerfd_clock_was_set(void)
 	rcu_read_unlock();
 }
 
+static void timerfd_resume_work(struct work_struct *work)
+{
+	timerfd_clock_was_set();
+}
+
+static DECLARE_WORK(timerfd_work, timerfd_resume_work);
+
+/*
+ * Invoked from timekeeping_resume(). Defer the actual update to work so
+ * timerfd_clock_was_set() runs in task context.
+ */
+void timerfd_resume(void)
+{
+	schedule_work(&timerfd_work);
+}
+
 static void __timerfd_remove_cancel(struct timerfd_ctx *ctx)
 {
 	if (ctx->might_cancel) {
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 77295af..253c6e2 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -349,8 +349,10 @@ hrtimer_expires_remaining_adjusted(const struct hrtimer *timer)
 
 #ifdef CONFIG_TIMERFD
 extern void timerfd_clock_was_set(void);
+extern void timerfd_resume(void);
 #else
 static inline void timerfd_clock_was_set(void) { }
+static inline void timerfd_resume(void) { }
 #endif
 extern void hrtimers_resume(void);
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: timers/core] hrtimer: Ensure timerfd notification for HIGHRES=n
  2021-07-13 13:39 ` [patch V2 03/10] hrtimer: Ensure timerfd notification for HIGHRES=n Thomas Gleixner
@ 2021-08-10 16:02   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-10 16:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     8c3b5e6ec0fee18bc2ce38d1dfe913413205f908
Gitweb:        https://git.kernel.org/tip/8c3b5e6ec0fee18bc2ce38d1dfe913413205f908
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 13 Jul 2021 15:39:48 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 10 Aug 2021 17:57:22 +02:00

hrtimer: Ensure timerfd notification for HIGHRES=n

If high resolution timers are disabled the timerfd notification about a
clock was set event is not happening for all cases which use
clock_was_set_delayed() because that's a NOP for HIGHRES=n, which is wrong.

Make clock_was_set_delayed() unconditially available to fix that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.196661266@linutronix.de

---
 include/linux/hrtimer.h     |  5 -----
 kernel/time/hrtimer.c       | 32 ++++++++++++++++----------------
 kernel/time/tick-internal.h |  3 +++
 3 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index bb5e7b0..77295af 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -318,16 +318,12 @@ struct clock_event_device;
 
 extern void hrtimer_interrupt(struct clock_event_device *dev);
 
-extern void clock_was_set_delayed(void);
-
 extern unsigned int hrtimer_resolution;
 
 #else
 
 #define hrtimer_resolution	(unsigned int)LOW_RES_NSEC
 
-static inline void clock_was_set_delayed(void) { }
-
 #endif
 
 static inline ktime_t
@@ -351,7 +347,6 @@ hrtimer_expires_remaining_adjusted(const struct hrtimer *timer)
 						    timer->base->get_time());
 }
 
-extern void clock_was_set(void);
 #ifdef CONFIG_TIMERFD
 extern void timerfd_clock_was_set(void);
 #else
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5f7c465..7ebf642 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -777,22 +777,6 @@ static void hrtimer_switch_to_hres(void)
 	retrigger_next_event(NULL);
 }
 
-static void clock_was_set_work(struct work_struct *work)
-{
-	clock_was_set();
-}
-
-static DECLARE_WORK(hrtimer_work, clock_was_set_work);
-
-/*
- * Called from timekeeping and resume code to reprogram the hrtimer
- * interrupt device on all cpus.
- */
-void clock_was_set_delayed(void)
-{
-	schedule_work(&hrtimer_work);
-}
-
 #else
 
 static inline int hrtimer_is_hres_enabled(void) { return 0; }
@@ -877,6 +861,22 @@ void clock_was_set(void)
 	timerfd_clock_was_set();
 }
 
+static void clock_was_set_work(struct work_struct *work)
+{
+	clock_was_set();
+}
+
+static DECLARE_WORK(hrtimer_work, clock_was_set_work);
+
+/*
+ * Called from timekeeping and resume code to reprogram the hrtimer
+ * interrupt device on all cpus and to notify timerfd.
+ */
+void clock_was_set_delayed(void)
+{
+	schedule_work(&hrtimer_work);
+}
+
 /*
  * During resume we might have to reprogram the high resolution timer
  * interrupt on all online CPUs.  However, all other CPUs will be
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 6a742a2..cd610fa 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -165,3 +165,6 @@ DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
 
 extern u64 get_next_timer_interrupt(unsigned long basej, u64 basem);
 void timer_clear_idle(void);
+
+void clock_was_set(void);
+void clock_was_set_delayed(void);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: timers/core] hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case
  2021-07-13 13:39 ` [patch V2 04/10] hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case Thomas Gleixner
@ 2021-08-10 16:02   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-10 16:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     e71a4153b7c256ec103e79875398553808aeffd2
Gitweb:        https://git.kernel.org/tip/e71a4153b7c256ec103e79875398553808aeffd2
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 13 Jul 2021 15:39:49 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 10 Aug 2021 17:57:22 +02:00

hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case

When CONFIG_HIGH_RES_TIMERS is disabled, but NOHZ is enabled then
clock_was_set() is not doing anything. With HIGHRES=n the kernel relies on
the periodic tick to update the clock offsets, but when NOHZ is enabled and
active then CPUs which are in a deep idle sleep do not have a periodic tick
which means the expiry of timers affected by clock_was_set() can be
arbitrarily delayed up to the point where the CPUs are brought out of idle
again.

Make the clock_was_set() logic unconditionaly available so that idle CPUs
are kicked out of idle to handle the update.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.288697903@linutronix.de

---
 kernel/time/hrtimer.c | 87 ++++++++++++++++++++++++++++--------------
 1 file changed, 59 insertions(+), 28 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 7ebf642..214fd65 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -739,23 +739,7 @@ static inline int hrtimer_is_hres_enabled(void)
 	return hrtimer_hres_enabled;
 }
 
-/*
- * Retrigger next event is called after clock was set
- *
- * Called with interrupts disabled via on_each_cpu()
- */
-static void retrigger_next_event(void *arg)
-{
-	struct hrtimer_cpu_base *base = this_cpu_ptr(&hrtimer_bases);
-
-	if (!__hrtimer_hres_active(base))
-		return;
-
-	raw_spin_lock(&base->lock);
-	hrtimer_update_base(base);
-	hrtimer_force_reprogram(base, 0);
-	raw_spin_unlock(&base->lock);
-}
+static void retrigger_next_event(void *arg);
 
 /*
  * Switch to high resolution mode
@@ -781,9 +765,50 @@ static void hrtimer_switch_to_hres(void)
 
 static inline int hrtimer_is_hres_enabled(void) { return 0; }
 static inline void hrtimer_switch_to_hres(void) { }
-static inline void retrigger_next_event(void *arg) { }
 
 #endif /* CONFIG_HIGH_RES_TIMERS */
+/*
+ * Retrigger next event is called after clock was set with interrupts
+ * disabled through an SMP function call or directly from low level
+ * resume code.
+ *
+ * This is only invoked when:
+ *	- CONFIG_HIGH_RES_TIMERS is enabled.
+ *	- CONFIG_NOHZ_COMMON is enabled
+ *
+ * For the other cases this function is empty and because the call sites
+ * are optimized out it vanishes as well, i.e. no need for lots of
+ * #ifdeffery.
+ */
+static void retrigger_next_event(void *arg)
+{
+	struct hrtimer_cpu_base *base = this_cpu_ptr(&hrtimer_bases);
+
+	/*
+	 * When high resolution mode or nohz is active, then the offsets of
+	 * CLOCK_REALTIME/TAI/BOOTTIME have to be updated. Otherwise the
+	 * next tick will take care of that.
+	 *
+	 * If high resolution mode is active then the next expiring timer
+	 * must be reevaluated and the clock event device reprogrammed if
+	 * necessary.
+	 *
+	 * In the NOHZ case the update of the offset and the reevaluation
+	 * of the next expiring timer is enough. The return from the SMP
+	 * function call will take care of the reprogramming in case the
+	 * CPU was in a NOHZ idle sleep.
+	 */
+	if (!__hrtimer_hres_active(base) && !tick_nohz_active)
+		return;
+
+	raw_spin_lock(&base->lock);
+	hrtimer_update_base(base);
+	if (__hrtimer_hres_active(base))
+		hrtimer_force_reprogram(base, 0);
+	else
+		hrtimer_update_next_event(base);
+	raw_spin_unlock(&base->lock);
+}
 
 /*
  * When a timer is enqueued and expires earlier than the already enqueued
@@ -842,22 +867,28 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 }
 
 /*
- * Clock realtime was set
- *
- * Change the offset of the realtime clock vs. the monotonic
- * clock.
+ * Clock was set. This might affect CLOCK_REALTIME, CLOCK_TAI and
+ * CLOCK_BOOTTIME (for late sleep time injection).
  *
- * We might have to reprogram the high resolution timer interrupt. On
- * SMP we call the architecture specific code to retrigger _all_ high
- * resolution timer interrupts. On UP we just disable interrupts and
- * call the high resolution interrupt code.
+ * This requires to update the offsets for these clocks
+ * vs. CLOCK_MONOTONIC. When high resolution timers are enabled, then this
+ * also requires to eventually reprogram the per CPU clock event devices
+ * when the change moves an affected timer ahead of the first expiring
+ * timer on that CPU. Obviously remote per CPU clock event devices cannot
+ * be reprogrammed. The other reason why an IPI has to be sent is when the
+ * system is in !HIGH_RES and NOHZ mode. The NOHZ mode updates the offsets
+ * in the tick, which obviously might be stopped, so this has to bring out
+ * the remote CPU which might sleep in idle to get this sorted.
  */
 void clock_was_set(void)
 {
-#ifdef CONFIG_HIGH_RES_TIMERS
+	if (!hrtimer_hres_active() && !tick_nohz_active)
+		goto out_timerfd;
+
 	/* Retrigger the CPU local events everywhere */
 	on_each_cpu(retrigger_next_event, NULL, 1);
-#endif
+
+out_timerfd:
 	timerfd_clock_was_set();
 }
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: timers/core] hrtimer: Consolidate reprogramming code
  2021-07-13 13:39 ` [patch V2 02/10] hrtimer: Consolidate reprogramming code Thomas Gleixner
@ 2021-08-10 16:02   ` tip-bot2 for Peter Zijlstra
  2021-08-12  7:19     ` Mike Galbraith
       [not found]   ` <CGME20210812130945eucas1p117fc1e90f31c8d9fd177932cd1a18512@eucas1p1.samsung.com>
  1 sibling, 1 reply; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-08-10 16:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra, Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     b14bca97c9f5c3e3f133445b01c723e95490d843
Gitweb:        https://git.kernel.org/tip/b14bca97c9f5c3e3f133445b01c723e95490d843
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 13 Jul 2021 15:39:47 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 10 Aug 2021 17:57:22 +02:00

hrtimer: Consolidate reprogramming code

This code is mostly duplicated. The redudant store in the force reprogram
case does no harm and the in hrtimer interrupt condition cannot be true for
the force reprogram invocations.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.054424875@linutronix.de

---
 kernel/time/hrtimer.c | 72 ++++++++++++++++--------------------------
 1 file changed, 29 insertions(+), 43 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index ba2e0d0..5f7c465 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -652,21 +652,24 @@ static inline int hrtimer_hres_active(void)
 	return __hrtimer_hres_active(this_cpu_ptr(&hrtimer_bases));
 }
 
-/*
- * Reprogram the event source with checking both queues for the
- * next event
- * Called with interrupts disabled and base->lock held
- */
 static void
-hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
+__hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal,
+		    struct hrtimer *next_timer, ktime_t expires_next)
 {
-	ktime_t expires_next;
+	/*
+	 * If the hrtimer interrupt is running, then it will reevaluate the
+	 * clock bases and reprogram the clock event device.
+	 */
+	if (cpu_base->in_hrtirq)
+		return;
 
-	expires_next = hrtimer_update_next_event(cpu_base);
+	if (expires_next > cpu_base->expires_next)
+		return;
 
 	if (skip_equal && expires_next == cpu_base->expires_next)
 		return;
 
+	cpu_base->next_timer = next_timer;
 	cpu_base->expires_next = expires_next;
 
 	/*
@@ -689,7 +692,23 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
 	if (!__hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
 		return;
 
-	tick_program_event(cpu_base->expires_next, 1);
+	tick_program_event(expires_next, 1);
+}
+
+/*
+ * Reprogram the event source with checking both queues for the
+ * next event
+ * Called with interrupts disabled and base->lock held
+ */
+static void
+hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
+{
+	ktime_t expires_next;
+
+	expires_next = hrtimer_update_next_event(cpu_base);
+
+	__hrtimer_reprogram(cpu_base, skip_equal, cpu_base->next_timer,
+			    expires_next);
 }
 
 /* High resolution timer related functions */
@@ -835,40 +854,7 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 	if (base->cpu_base != cpu_base)
 		return;
 
-	/*
-	 * If the hrtimer interrupt is running, then it will
-	 * reevaluate the clock bases and reprogram the clock event
-	 * device. The callbacks are always executed in hard interrupt
-	 * context so we don't need an extra check for a running
-	 * callback.
-	 */
-	if (cpu_base->in_hrtirq)
-		return;
-
-	if (expires >= cpu_base->expires_next)
-		return;
-
-	/* Update the pointer to the next expiring timer */
-	cpu_base->next_timer = timer;
-	cpu_base->expires_next = expires;
-
-	/*
-	 * If hres is not active, hardware does not have to be
-	 * programmed yet.
-	 *
-	 * If a hang was detected in the last timer interrupt then we
-	 * do not schedule a timer which is earlier than the expiry
-	 * which we enforced in the hang detection. We want the system
-	 * to make progress.
-	 */
-	if (!__hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
-		return;
-
-	/*
-	 * Program the timer hardware. We enforce the expiry for
-	 * events which are already in the past.
-	 */
-	tick_program_event(expires, 1);
+	__hrtimer_reprogram(cpu_base, true, timer, expires);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: timers/core] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns()
  2021-07-13 13:39 ` [patch V2 01/10] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() Thomas Gleixner
@ 2021-08-10 16:02   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-10 16:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Lorenzo Colitti, Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     627ef5ae2df8eeccb20d5af0e4cfa4df9e61ed28
Gitweb:        https://git.kernel.org/tip/627ef5ae2df8eeccb20d5af0e4cfa4df9e61ed28
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Tue, 13 Jul 2021 15:39:46 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Tue, 10 Aug 2021 17:57:22 +02:00

hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns()

If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
the timer has to be canceled first and then added back. If the timer is the
first expiring timer then on removal the clockevent device is reprogrammed
to the next expiring timer to avoid that the pending expiry fires needlessly.

If the new expiry time ends up to be the first expiry again then the clock
event device has to reprogrammed again.

Avoid this by checking whether the timer is the first to expire and in that
case, keep the timer on the current CPU and delay the reprogramming up to
the point where the timer has been enqueued again.

Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135157.873137732@linutronix.de


---
 kernel/time/hrtimer.c | 60 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 7 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 4a66725..ba2e0d0 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1030,12 +1030,13 @@ static void __remove_hrtimer(struct hrtimer *timer,
  * remove hrtimer, called with base lock held
  */
 static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool restart)
+remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+	       bool restart, bool keep_local)
 {
 	u8 state = timer->state;
 
 	if (state & HRTIMER_STATE_ENQUEUED) {
-		int reprogram;
+		bool reprogram;
 
 		/*
 		 * Remove the timer and force reprogramming when high
@@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool rest
 		debug_deactivate(timer);
 		reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
 
+		/*
+		 * If the timer is not restarted then reprogramming is
+		 * required if the timer is local. If it is local and about
+		 * to be restarted, avoid programming it twice (on removal
+		 * and a moment later when it's requeued).
+		 */
 		if (!restart)
 			state = HRTIMER_STATE_INACTIVE;
+		else
+			reprogram &= !keep_local;
 
 		__remove_hrtimer(timer, base, state, reprogram);
 		return 1;
@@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 				    struct hrtimer_clock_base *base)
 {
 	struct hrtimer_clock_base *new_base;
+	bool force_local, first;
 
-	/* Remove an active timer from the queue: */
-	remove_hrtimer(timer, base, true);
+	/*
+	 * If the timer is on the local cpu base and is the first expiring
+	 * timer then this might end up reprogramming the hardware twice
+	 * (on removal and on enqueue). To avoid that by prevent the
+	 * reprogram on removal, keep the timer local to the current CPU
+	 * and enforce reprogramming after it is queued no matter whether
+	 * it is the new first expiring timer again or not.
+	 */
+	force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases);
+	force_local &= base->cpu_base->next_timer == timer;
+
+	/*
+	 * Remove an active timer from the queue. In case it is not queued
+	 * on the current CPU, make sure that remove_hrtimer() updates the
+	 * remote data correctly.
+	 *
+	 * If it's on the current CPU and the first expiring timer, then
+	 * skip reprogramming, keep the timer local and enforce
+	 * reprogramming later if it was the first expiring timer.  This
+	 * avoids programming the underlying clock event twice (once at
+	 * removal and once after enqueue).
+	 */
+	remove_hrtimer(timer, base, true, force_local);
 
 	if (mode & HRTIMER_MODE_REL)
 		tim = ktime_add_safe(tim, base->get_time());
@@ -1115,9 +1146,24 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 	hrtimer_set_expires_range_ns(timer, tim, delta_ns);
 
 	/* Switch the timer base, if necessary: */
-	new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);
+	if (!force_local) {
+		new_base = switch_hrtimer_base(timer, base,
+					       mode & HRTIMER_MODE_PINNED);
+	} else {
+		new_base = base;
+	}
+
+	first = enqueue_hrtimer(timer, new_base, mode);
+	if (!force_local)
+		return first;
 
-	return enqueue_hrtimer(timer, new_base, mode);
+	/*
+	 * Timer was forced to stay on the current CPU to avoid
+	 * reprogramming on removal and enqueue. Force reprogram the
+	 * hardware by evaluating the new first expiring timer.
+	 */
+	hrtimer_force_reprogram(new_base->cpu_base, 1);
+	return 0;
 }
 
 /**
@@ -1183,7 +1229,7 @@ int hrtimer_try_to_cancel(struct hrtimer *timer)
 	base = lock_hrtimer_base(timer, &flags);
 
 	if (!hrtimer_callback_running(timer))
-		ret = remove_hrtimer(timer, base, false);
+		ret = remove_hrtimer(timer, base, false, false);
 
 	unlock_hrtimer_base(timer, &flags);
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [tip: timers/core] hrtimer: Consolidate reprogramming code
  2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Peter Zijlstra
@ 2021-08-12  7:19     ` Mike Galbraith
  2021-08-12 14:11       ` Thomas Gleixner
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Galbraith @ 2021-08-12  7:19 UTC (permalink / raw)
  To: linux-kernel, linux-tip-commits; +Cc: Peter Zijlstra, Thomas Gleixner, x86

Greetings Peter, may your day get off to a better start than my box's
did :)

On Tue, 2021-08-10 at 16:02 +0000, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the timers/core branch of tip:
>
> Commit-ID:     b14bca97c9f5c3e3f133445b01c723e95490d843
> Gitweb:        https://git.kernel.org/tip/b14bca97c9f5c3e3f133445b01c723e95490d843
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Tue, 13 Jul 2021 15:39:47 +02:00
> Committer:     Thomas Gleixner <tglx@linutronix.de>
> CommitterDate: Tue, 10 Aug 2021 17:57:22 +02:00
>
> hrtimer: Consolidate reprogramming code

Per git-bisect, this is the tip.today commit that's bricking my box
early in boot.  Another inspires the moan below, courtesy of VM, which
unlike desktop box does not brick immediately thereafter.

[    0.556416] rtc_cmos 00:04: RTC can wake from S4
[    0.557184] rtc_cmos 00:04: registered as rtc0
[    0.557636] BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
[    0.558169] caller is debug_smp_processor_id+0x13/0x20
[    0.558511] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.14.0.g1fd628c-tip #15
[    0.558946] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[    0.559623] Call Trace:
[    0.559796]  dump_stack_lvl+0x62/0x78
[    0.560041]  dump_stack+0xc/0xd
[    0.560263]  check_preemption_disabled+0xd3/0xe0
[    0.560576]  debug_smp_processor_id+0x13/0x20
[    0.560954]  clock_was_set+0x1c/0x310
[    0.561118]  ? timekeeping_update+0x298/0x2b0
[    0.561118]  do_settimeofday64+0x31e/0x340
[    0.561118]  __devm_rtc_register_device+0x371/0x450
[    0.561118]  cmos_do_probe+0x4a2/0x6e0
[    0.561118]  ? cmos_interrupt+0x120/0x120
[    0.561118]  ? cmos_nvram_read+0x90/0x90
[    0.561118]  cmos_pnp_probe+0x91/0xe0
[    0.561118]  pnp_device_probe+0x15e/0x1d0
[    0.561118]  ? cmos_irq_enable+0x150/0x150
[    0.561118]  call_driver_probe+0x4a/0x130
[    0.561118]  really_probe+0x150/0x540
[    0.561118]  __driver_probe_device+0x160/0x200
[    0.561118]  driver_probe_device+0x3a/0x2b0
[    0.561118]  __driver_attach+0xb4/0x370
[    0.561118]  ? driver_attach+0x30/0x30
[    0.561118]  bus_for_each_dev+0xb0/0xe0
[    0.561118]  driver_attach+0x27/0x30
[    0.561118]  bus_add_driver+0x1ba/0x310
[    0.561118]  driver_register+0x104/0x200
[    0.561118]  pnp_register_driver+0x3e/0x50
[    0.561118]  ? rtc_dev_init+0x33/0x33
[    0.561118]  cmos_init+0x14/0xbc
[    0.561118]  ? rtc_dev_init+0x33/0x33
[    0.561118]  do_one_initcall+0xcf/0x2c0
[    0.561118]  ? strlen+0x18/0x30
[    0.561118]  ? parse_one+0x2b9/0x350
[    0.561118]  ? do_initcall_level+0x106/0x106
[    0.561118]  ? parse_args+0x133/0x280
[    0.561118]  ? parse_args+0x94/0x280
[    0.561118]  do_initcall_level+0x95/0x106
[    0.561118]  do_initcalls+0x61/0x8b
[    0.561118]  do_basic_setup+0x20/0x21
[    0.561118]  kernel_init_freeable+0x171/0x1de
[    0.561118]  ? rest_init+0xf0/0xf0
[    0.561118]  kernel_init+0x17/0x1e0
[    0.561118]  ? rest_init+0xf0/0xf0
[    0.561118]  ret_from_fork+0x1f/0x30



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch V2 02/10] hrtimer: Consolidate reprogramming code
       [not found]   ` <CGME20210812130945eucas1p117fc1e90f31c8d9fd177932cd1a18512@eucas1p1.samsung.com>
@ 2021-08-12 13:09     ` Marek Szyprowski
  0 siblings, 0 replies; 36+ messages in thread
From: Marek Szyprowski @ 2021-08-12 13:09 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Peter Zijlstra, Anna-Maria Behnsen, Marcelo Tosatti, Frederic Weisbecker

Hi,

On 13.07.2021 15:39, Thomas Gleixner wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> This code is mostly duplicated. The redudant store in the force reprogram
> case does no harm and the in hrtimer interrupt condition cannot be true for
> the force reprogram invocations.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   kernel/time/hrtimer.c |   72 ++++++++++++++++++++------------------------------
>   1 file changed, 29 insertions(+), 43 deletions(-)
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

This patch landed in today's linux-next (next-20210812) as commit 
b14bca97c9f5 ("hrtimer: Consolidate reprogramming code"). It breaks 
booting of many of my test machines: ARM 32bit Exynos based boards, ARM 
64bit QEmu virt machine or ARM64 Qualcomm DragonBoard410c board.

I've managed to catch the following log on QEmu's virt ARM64 machine:

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu:     0-...!: (0 ticks this GP) idle=330/0/0x0 softirq=42/42 fqs=0  
(false positive?)
  (detected by 1, t=6502 jiffies, g=-1091, q=115)

============================================
WARNING: possible recursive locking detected
5.14.0-rc5+ #10668 Not tainted
--------------------------------------------
swapper/1/0 is trying to acquire lock:
ffffbb9c1e4ca1d8 (rcu_node_0){-.-.}-{2:2}, at: 
rcu_dump_cpu_stacks+0x68/0x1c4

but task is already holding lock:
ffffbb9c1e4ca1d8 (rcu_node_0){-.-.}-{2:2}, at: 
rcu_sched_clock_irq+0x83c/0x1778

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(rcu_node_0);
   lock(rcu_node_0);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

1 lock held by swapper/1/0:
  #0: ffffbb9c1e4ca1d8 (rcu_node_0){-.-.}-{2:2}, at: 
rcu_sched_clock_irq+0x83c/0x1778

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.14.0-rc5+ #10668
Hardware name: linux,dummy-virt (DT)
Call trace:
  dump_backtrace+0x0/0x1d0
  show_stack+0x14/0x20
  dump_stack_lvl+0x88/0xb0
  dump_stack+0x14/0x2c
  __lock_acquire+0x17a4/0x1840
  lock_acquire+0x130/0x3e8
  _raw_spin_lock_irqsave+0x78/0x148
  rcu_dump_cpu_stacks+0x68/0x1c4
  rcu_sched_clock_irq+0x11e8/0x1778
  update_process_times+0x88/0xd0
  tick_sched_handle.isra.19+0x30/0x50
  tick_sched_timer+0x48/0x98
  __hrtimer_run_queues+0x380/0x5b0
  hrtimer_interrupt+0xe4/0x240
  arch_timer_handler_virt+0x30/0x40
  handle_percpu_devid_irq+0xc0/0x3d0
  handle_domain_irq+0x58/0x88
  gic_handle_irq+0xa8/0xc8
  call_on_irq_stack+0x28/0x38
  do_interrupt_handler+0x54/0x60
  el1_interrupt+0x2c/0x108
  el1h_64_irq_handler+0x14/0x20
  el1h_64_irq+0x74/0x78
  arch_cpu_idle+0x14/0x20
  default_idle_call+0x88/0x390
  do_idle+0x200/0x290
  cpu_startup_entry+0x20/0x80
  secondary_start_kernel+0x1c0/0x1f0
  __secondary_switched+0x7c/0x80

I hope it helps fixing the issue.

> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -652,21 +652,24 @@ static inline int hrtimer_hres_active(vo
>   	return __hrtimer_hres_active(this_cpu_ptr(&hrtimer_bases));
>   }
>   
> -/*
> - * Reprogram the event source with checking both queues for the
> - * next event
> - * Called with interrupts disabled and base->lock held
> - */
>   static void
> -hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
> +__hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal,
> +		    struct hrtimer *next_timer, ktime_t expires_next)
>   {
> -	ktime_t expires_next;
> +	/*
> +	 * If the hrtimer interrupt is running, then it will reevaluate the
> +	 * clock bases and reprogram the clock event device.
> +	 */
> +	if (cpu_base->in_hrtirq)
> +		return;
>   
> -	expires_next = hrtimer_update_next_event(cpu_base);
> +	if (expires_next > cpu_base->expires_next)
> +		return;
>   
>   	if (skip_equal && expires_next == cpu_base->expires_next)
>   		return;
>   
> +	cpu_base->next_timer = next_timer;
>   	cpu_base->expires_next = expires_next;
>   
>   	/*
> @@ -689,7 +692,23 @@ hrtimer_force_reprogram(struct hrtimer_c
>   	if (!__hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
>   		return;
>   
> -	tick_program_event(cpu_base->expires_next, 1);
> +	tick_program_event(expires_next, 1);
> +}
> +
> +/*
> + * Reprogram the event source with checking both queues for the
> + * next event
> + * Called with interrupts disabled and base->lock held
> + */
> +static void
> +hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
> +{
> +	ktime_t expires_next;
> +
> +	expires_next = hrtimer_update_next_event(cpu_base);
> +
> +	__hrtimer_reprogram(cpu_base, skip_equal, cpu_base->next_timer,
> +			    expires_next);
>   }
>   
>   /* High resolution timer related functions */
> @@ -835,40 +854,7 @@ static void hrtimer_reprogram(struct hrt
>   	if (base->cpu_base != cpu_base)
>   		return;
>   
> -	/*
> -	 * If the hrtimer interrupt is running, then it will
> -	 * reevaluate the clock bases and reprogram the clock event
> -	 * device. The callbacks are always executed in hard interrupt
> -	 * context so we don't need an extra check for a running
> -	 * callback.
> -	 */
> -	if (cpu_base->in_hrtirq)
> -		return;
> -
> -	if (expires >= cpu_base->expires_next)
> -		return;
> -
> -	/* Update the pointer to the next expiring timer */
> -	cpu_base->next_timer = timer;
> -	cpu_base->expires_next = expires;
> -
> -	/*
> -	 * If hres is not active, hardware does not have to be
> -	 * programmed yet.
> -	 *
> -	 * If a hang was detected in the last timer interrupt then we
> -	 * do not schedule a timer which is earlier than the expiry
> -	 * which we enforced in the hang detection. We want the system
> -	 * to make progress.
> -	 */
> -	if (!__hrtimer_hres_active(cpu_base) || cpu_base->hang_detected)
> -		return;
> -
> -	/*
> -	 * Program the timer hardware. We enforce the expiry for
> -	 * events which are already in the past.
> -	 */
> -	tick_program_event(expires, 1);
> +	__hrtimer_reprogram(cpu_base, true, timer, expires);
>   }
>   
>   /*
>
>
Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [tip: timers/core] hrtimer: Consolidate reprogramming code
  2021-08-12  7:19     ` Mike Galbraith
@ 2021-08-12 14:11       ` Thomas Gleixner
  2021-08-12 14:32         ` Thomas Gleixner
  2021-08-12 20:31         ` [PATCH] hrtimer: Use raw_cpu_ptr() in clock_was_set() Thomas Gleixner
  0 siblings, 2 replies; 36+ messages in thread
From: Thomas Gleixner @ 2021-08-12 14:11 UTC (permalink / raw)
  To: Mike Galbraith, linux-kernel, linux-tip-commits; +Cc: Peter Zijlstra, x86

On Thu, Aug 12 2021 at 09:19, Mike Galbraith wrote:
> Greetings Peter, may your day get off to a better start than my box's
> did :)
>
> On Tue, 2021-08-10 at 16:02 +0000, tip-bot2 for Peter Zijlstra wrote:
>> The following commit has been merged into the timers/core branch of tip:
>>
>> Commit-ID:     b14bca97c9f5c3e3f133445b01c723e95490d843
>> Gitweb:        https://git.kernel.org/tip/b14bca97c9f5c3e3f133445b01c723e95490d843
>> Author:        Peter Zijlstra <peterz@infradead.org>
>> AuthorDate:    Tue, 13 Jul 2021 15:39:47 +02:00
>> Committer:     Thomas Gleixner <tglx@linutronix.de>
>> CommitterDate: Tue, 10 Aug 2021 17:57:22 +02:00
>>
>> hrtimer: Consolidate reprogramming code
>
> Per git-bisect, this is the tip.today commit that's bricking my box
> early in boot.

Let me stare at that.

> Another inspires the moan below, courtesy of VM, which
> unlike desktop box does not brick immediately thereafter.

> [    0.556416] rtc_cmos 00:04: RTC can wake from S4
> [    0.557184] rtc_cmos 00:04: registered as rtc0
> [    0.557636] BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
> [    0.558169] caller is debug_smp_processor_id+0x13/0x20
> [    0.558511] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.14.0.g1fd628c-tip #15
> [    0.558946] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
> [    0.559623] Call Trace:
> [    0.559796]  dump_stack_lvl+0x62/0x78
> [    0.560041]  dump_stack+0xc/0xd
> [    0.560263]  check_preemption_disabled+0xd3/0xe0
> [    0.560576]  debug_smp_processor_id+0x13/0x20
> [    0.560954]  clock_was_set+0x1c/0x310

My stupidity. Fix for that below.

Thanks,

        tglx
---
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -944,10 +944,11 @@ static bool update_needs_ipi(struct hrti
  */
 void clock_was_set(unsigned int bases)
 {
+	struct hrtimer_cpu_base *cpu_base = raw_cpu_ptr(&hrtimer_bases);
 	cpumask_var_t mask;
 	int cpu;
 
-	if (!hrtimer_hres_active() && !tick_nohz_active)
+	if (!__hrtimer_hres_active(cpu_base) && !tick_nohz_active)
 		goto out_timerfd;
 
 	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
@@ -958,9 +959,9 @@ void clock_was_set(unsigned int bases)
 	/* Avoid interrupting CPUs if possible */
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
-		struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
 		unsigned long flags;
 
+		cpu_base = &per_cpu(hrtimer_bases, cpu);
 		raw_spin_lock_irqsave(&cpu_base->lock, flags);
 
 		if (update_needs_ipi(cpu_base, bases))


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [tip: timers/core] hrtimer: Consolidate reprogramming code
  2021-08-12 14:11       ` Thomas Gleixner
@ 2021-08-12 14:32         ` Thomas Gleixner
  2021-08-12 15:04           ` Mike Galbraith
  2021-08-12 20:32           ` [PATCH] hrtimer: Unbreak hrtimer_force_reprogram() Thomas Gleixner
  2021-08-12 20:31         ` [PATCH] hrtimer: Use raw_cpu_ptr() in clock_was_set() Thomas Gleixner
  1 sibling, 2 replies; 36+ messages in thread
From: Thomas Gleixner @ 2021-08-12 14:32 UTC (permalink / raw)
  To: Mike Galbraith, linux-kernel, linux-tip-commits; +Cc: Peter Zijlstra, x86

On Thu, Aug 12 2021 at 16:11, Thomas Gleixner wrote:

> On Thu, Aug 12 2021 at 09:19, Mike Galbraith wrote:
>> Greetings Peter, may your day get off to a better start than my box's
>> did :)
>>
>> On Tue, 2021-08-10 at 16:02 +0000, tip-bot2 for Peter Zijlstra wrote:
>>> The following commit has been merged into the timers/core branch of tip:
>>>
>>> Commit-ID:     b14bca97c9f5c3e3f133445b01c723e95490d843
>>> Gitweb:        https://git.kernel.org/tip/b14bca97c9f5c3e3f133445b01c723e95490d843
>>> Author:        Peter Zijlstra <peterz@infradead.org>
>>> AuthorDate:    Tue, 13 Jul 2021 15:39:47 +02:00
>>> Committer:     Thomas Gleixner <tglx@linutronix.de>
>>> CommitterDate: Tue, 10 Aug 2021 17:57:22 +02:00
>>>
>>> hrtimer: Consolidate reprogramming code
>>
>> Per git-bisect, this is the tip.today commit that's bricking my box
>> early in boot.
>
> Let me stare at that.

Can you please test whether the below fixes it for you?

I have yet to find a machine which reproduces it as I really want to
understand which particular part is causing this.

Thanks,

        tglx
---
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -652,24 +652,10 @@ static inline int hrtimer_hres_active(vo
 	return __hrtimer_hres_active(this_cpu_ptr(&hrtimer_bases));
 }
 
-static void
-__hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal,
-		    struct hrtimer *next_timer, ktime_t expires_next)
+static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
+				struct hrtimer *next_timer,
+				ktime_t expires_next)
 {
-	/*
-	 * If the hrtimer interrupt is running, then it will reevaluate the
-	 * clock bases and reprogram the clock event device.
-	 */
-	if (cpu_base->in_hrtirq)
-		return;
-
-	if (expires_next > cpu_base->expires_next)
-		return;
-
-	if (skip_equal && expires_next == cpu_base->expires_next)
-		return;
-
-	cpu_base->next_timer = next_timer;
 	cpu_base->expires_next = expires_next;
 
 	/*
@@ -707,8 +693,10 @@ hrtimer_force_reprogram(struct hrtimer_c
 
 	expires_next = hrtimer_update_next_event(cpu_base);
 
-	__hrtimer_reprogram(cpu_base, skip_equal, cpu_base->next_timer,
-			    expires_next);
+	if (skip_equal && expires_next == cpu_base->expires_next)
+		return;
+
+	__hrtimer_reprogram(cpu_base, cpu_base->next_timer, expires_next);
 }
 
 /* High resolution timer related functions */
@@ -863,7 +851,19 @@ static void hrtimer_reprogram(struct hrt
 	if (base->cpu_base != cpu_base)
 		return;
 
-	__hrtimer_reprogram(cpu_base, true, timer, expires);
+	if (expires >= cpu_base->expires_next)
+		return;
+
+	/*
+	 * If the hrtimer interrupt is running, then it will reevaluate the
+	 * clock bases and reprogram the clock event device.
+	 */
+	if (cpu_base->in_hrtirq)
+		return;
+
+	cpu_base->next_timer = timer;
+
+	__hrtimer_reprogram(cpu_base, timer, expires);
 }
 
 static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [tip: timers/core] hrtimer: Consolidate reprogramming code
  2021-08-12 14:32         ` Thomas Gleixner
@ 2021-08-12 15:04           ` Mike Galbraith
  2021-08-12 15:22             ` Thomas Gleixner
  2021-08-12 15:27             ` Mike Galbraith
  2021-08-12 20:32           ` [PATCH] hrtimer: Unbreak hrtimer_force_reprogram() Thomas Gleixner
  1 sibling, 2 replies; 36+ messages in thread
From: Mike Galbraith @ 2021-08-12 15:04 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, linux-tip-commits; +Cc: Peter Zijlstra, x86

[-- Attachment #1: Type: text/plain, Size: 3086 bytes --]

On Thu, 2021-08-12 at 16:32 +0200, Thomas Gleixner wrote:
> 
> Can you please test whether the below fixes it for you?

Yes, that boots.

> I have yet to find a machine which reproduces it as I really want to
> understand which particular part is causing this.

Config attached just in case.

> Thanks,
> 
>         tglx
> ---
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -652,24 +652,10 @@ static inline int hrtimer_hres_active(vo
>         return __hrtimer_hres_active(this_cpu_ptr(&hrtimer_bases));
>  }
>  
> -static void
> -__hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal,
> -                   struct hrtimer *next_timer, ktime_t expires_next)
> +static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
> +                               struct hrtimer *next_timer,
> +                               ktime_t expires_next)
>  {
> -       /*
> -        * If the hrtimer interrupt is running, then it will reevaluate
> the
> -        * clock bases and reprogram the clock event device.
> -        */
> -       if (cpu_base->in_hrtirq)
> -               return;
> -
> -       if (expires_next > cpu_base->expires_next)
> -               return;
> -
> -       if (skip_equal && expires_next == cpu_base->expires_next)
> -               return;
> -
> -       cpu_base->next_timer = next_timer;
>         cpu_base->expires_next = expires_next;
>  
>         /*
> @@ -707,8 +693,10 @@ hrtimer_force_reprogram(struct hrtimer_c
>  
>         expires_next = hrtimer_update_next_event(cpu_base);
>  
> -       __hrtimer_reprogram(cpu_base, skip_equal, cpu_base->next_timer,
> -                           expires_next);
> +       if (skip_equal && expires_next == cpu_base->expires_next)
> +               return;
> +
> +       __hrtimer_reprogram(cpu_base, cpu_base->next_timer,
> expires_next);
>  }
>  
>  /* High resolution timer related functions */
> @@ -863,7 +851,19 @@ static void hrtimer_reprogram(struct hrt
>         if (base->cpu_base != cpu_base)
>                 return;
>  
> -       __hrtimer_reprogram(cpu_base, true, timer, expires);
> +       if (expires >= cpu_base->expires_next)
> +               return;
> +
> +       /*
> +        * If the hrtimer interrupt is running, then it will reevaluate
> the
> +        * clock bases and reprogram the clock event device.
> +        */
> +       if (cpu_base->in_hrtirq)
> +               return;
> +
> +       cpu_base->next_timer = timer;
> +
> +       __hrtimer_reprogram(cpu_base, timer, expires);
>  }
>  
>  static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,


[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 39362 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [tip: timers/core] hrtimer: Consolidate reprogramming code
  2021-08-12 15:04           ` Mike Galbraith
@ 2021-08-12 15:22             ` Thomas Gleixner
  2021-08-12 15:31               ` Mike Galbraith
  2021-08-12 15:27             ` Mike Galbraith
  1 sibling, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2021-08-12 15:22 UTC (permalink / raw)
  To: Mike Galbraith, linux-kernel, linux-tip-commits; +Cc: Peter Zijlstra, x86

On Thu, Aug 12 2021 at 17:04, Mike Galbraith wrote:

> On Thu, 2021-08-12 at 16:32 +0200, Thomas Gleixner wrote:
>> 
>> Can you please test whether the below fixes it for you?
>
> Yes, that boots.

Not that I'm surprised, but I still do not know why :)

>> I have yet to find a machine which reproduces it as I really want to
>> understand which particular part is causing this.
>
> Config attached just in case.

I rather assume it's a hardware dependency. What kind of machine are you
using?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [tip: timers/core] hrtimer: Consolidate reprogramming code
  2021-08-12 15:04           ` Mike Galbraith
  2021-08-12 15:22             ` Thomas Gleixner
@ 2021-08-12 15:27             ` Mike Galbraith
  1 sibling, 0 replies; 36+ messages in thread
From: Mike Galbraith @ 2021-08-12 15:27 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, linux-tip-commits; +Cc: Peter Zijlstra, x86

On Thu, 2021-08-12 at 17:04 +0200, Mike Galbraith wrote:
> On Thu, 2021-08-12 at 16:32 +0200, Thomas Gleixner wrote:
> >
> > Can you please test whether the below fixes it for you?
>
> Yes, that boots.
>
> > I have yet to find a machine which reproduces it as I really want to
> > understand which particular part is causing this.
>
> Config attached just in case.

My HP lappy bricks as well, and with tip-rt. Fix verified there too.

	-Mike


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [tip: timers/core] hrtimer: Consolidate reprogramming code
  2021-08-12 15:22             ` Thomas Gleixner
@ 2021-08-12 15:31               ` Mike Galbraith
  2021-08-12 16:58                 ` Thomas Gleixner
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Galbraith @ 2021-08-12 15:31 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, linux-tip-commits; +Cc: Peter Zijlstra, x86

On Thu, 2021-08-12 at 17:22 +0200, Thomas Gleixner wrote:
> >
> > Config attached just in case.
>
> I rather assume it's a hardware dependency. What kind of machine are you
> using?

Desktop box is a garden variety MEDION i4790 box with an Nvidia GTX980.
Lappy is an HP Spectre 360 i5-6200U w. i915 graphics.  Very mundane.

	-Mike


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [tip: timers/core] hrtimer: Consolidate reprogramming code
  2021-08-12 15:31               ` Mike Galbraith
@ 2021-08-12 16:58                 ` Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Gleixner @ 2021-08-12 16:58 UTC (permalink / raw)
  To: Mike Galbraith, linux-kernel, linux-tip-commits; +Cc: Peter Zijlstra, x86

On Thu, Aug 12 2021 at 17:31, Mike Galbraith wrote:
> On Thu, 2021-08-12 at 17:22 +0200, Thomas Gleixner wrote:
>> >
>> > Config attached just in case.
>>
>> I rather assume it's a hardware dependency. What kind of machine are you
>> using?
>
> Desktop box is a garden variety MEDION i4790 box with an Nvidia GTX980.
> Lappy is an HP Spectre 360 i5-6200U w. i915 graphics.  Very mundane.

Hmm, spectre induced timer meltdown? :)

Anyway I found a box which exposes the problem. Investigating...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] hrtimer: Use raw_cpu_ptr() in clock_was_set()
  2021-08-12 14:11       ` Thomas Gleixner
  2021-08-12 14:32         ` Thomas Gleixner
@ 2021-08-12 20:31         ` Thomas Gleixner
  2021-08-12 20:40           ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2021-08-13  6:42           ` [PATCH] " Marek Szyprowski
  1 sibling, 2 replies; 36+ messages in thread
From: Thomas Gleixner @ 2021-08-12 20:31 UTC (permalink / raw)
  To: Mike Galbraith, linux-kernel, linux-tip-commits
  Cc: Peter Zijlstra, x86, Marek Szyprowski

clock_was_set() can be invoked from preemptible context. Use raw_cpu_ptr()
to check whether high resolution mode is active or not. It does not matter
whether the task migrates after acquiring the pointer.

Fixes: e71a4153b7c2 ("hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/hrtimer.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -944,10 +944,11 @@ static bool update_needs_ipi(struct hrti
  */
 void clock_was_set(unsigned int bases)
 {
+	struct hrtimer_cpu_base *cpu_base = raw_cpu_ptr(&hrtimer_bases);
 	cpumask_var_t mask;
 	int cpu;
 
-	if (!hrtimer_hres_active() && !tick_nohz_active)
+	if (!__hrtimer_hres_active(cpu_base) && !tick_nohz_active)
 		goto out_timerfd;
 
 	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
@@ -958,9 +959,9 @@ void clock_was_set(unsigned int bases)
 	/* Avoid interrupting CPUs if possible */
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
-		struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
 		unsigned long flags;
 
+		cpu_base = &per_cpu(hrtimer_bases, cpu);
 		raw_spin_lock_irqsave(&cpu_base->lock, flags);
 
 		if (update_needs_ipi(cpu_base, bases))

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] hrtimer: Unbreak hrtimer_force_reprogram()
  2021-08-12 14:32         ` Thomas Gleixner
  2021-08-12 15:04           ` Mike Galbraith
@ 2021-08-12 20:32           ` Thomas Gleixner
  2021-08-12 20:40             ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
  2021-08-13  7:58             ` [PATCH] " Thomas Gleixner
  1 sibling, 2 replies; 36+ messages in thread
From: Thomas Gleixner @ 2021-08-12 20:32 UTC (permalink / raw)
  To: Mike Galbraith, linux-kernel, linux-tip-commits; +Cc: Peter Zijlstra, x86

Since the recent consoliation of reprogramming functions,
hrtimer_force_reprogram() is affected by a check whether the new expiry
time is past the current expiry time.

This breaks the NOHZ logic as that relies on the fact that the tick hrtimer
is moved into the future. That means cpu_base->expires_next becomes stale
and subsequent reprogramming attempts fail as well until the situation is
cleaned up by an hrtimer interrupts.

For some yet unknown reason this leads to a complete stall, so for now
partially revert the offending commit to a known working state. The root
cause for the stall is still investigated and will be fixed in a subsequent
commit.

Fixes: b14bca97c9f5 ("hrtimer: Consolidate reprogramming code")
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mike Galbraith <efault@gmx.de>
---
 kernel/time/hrtimer.c |   40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -652,24 +652,10 @@ static inline int hrtimer_hres_active(vo
 	return __hrtimer_hres_active(this_cpu_ptr(&hrtimer_bases));
 }
 
-static void
-__hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal,
-		    struct hrtimer *next_timer, ktime_t expires_next)
+static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
+				struct hrtimer *next_timer,
+				ktime_t expires_next)
 {
-	/*
-	 * If the hrtimer interrupt is running, then it will reevaluate the
-	 * clock bases and reprogram the clock event device.
-	 */
-	if (cpu_base->in_hrtirq)
-		return;
-
-	if (expires_next > cpu_base->expires_next)
-		return;
-
-	if (skip_equal && expires_next == cpu_base->expires_next)
-		return;
-
-	cpu_base->next_timer = next_timer;
 	cpu_base->expires_next = expires_next;
 
 	/*
@@ -707,8 +693,10 @@ hrtimer_force_reprogram(struct hrtimer_c
 
 	expires_next = hrtimer_update_next_event(cpu_base);
 
-	__hrtimer_reprogram(cpu_base, skip_equal, cpu_base->next_timer,
-			    expires_next);
+	if (skip_equal && expires_next == cpu_base->expires_next)
+		return;
+
+	__hrtimer_reprogram(cpu_base, cpu_base->next_timer, expires_next);
 }
 
 /* High resolution timer related functions */
@@ -863,7 +851,19 @@ static void hrtimer_reprogram(struct hrt
 	if (base->cpu_base != cpu_base)
 		return;
 
-	__hrtimer_reprogram(cpu_base, true, timer, expires);
+	if (expires >= cpu_base->expires_next)
+		return;
+
+	/*
+	 * If the hrtimer interrupt is running, then it will reevaluate the
+	 * clock bases and reprogram the clock event device.
+	 */
+	if (cpu_base->in_hrtirq)
+		return;
+
+	cpu_base->next_timer = timer;
+
+	__hrtimer_reprogram(cpu_base, timer, expires);
 }
 
 static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [tip: timers/core] hrtimer: Unbreak hrtimer_force_reprogram()
  2021-08-12 20:32           ` [PATCH] hrtimer: Unbreak hrtimer_force_reprogram() Thomas Gleixner
@ 2021-08-12 20:40             ` tip-bot2 for Thomas Gleixner
  2021-08-13  7:58             ` [PATCH] " Thomas Gleixner
  1 sibling, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-12 20:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mike Galbraith, Marek Szyprowski, Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     f80e21489590c00f46226d5802d900e6f66e5633
Gitweb:        https://git.kernel.org/tip/f80e21489590c00f46226d5802d900e6f66e5633
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 12 Aug 2021 22:32:30 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 12 Aug 2021 22:34:40 +02:00

hrtimer: Unbreak hrtimer_force_reprogram()

Since the recent consoliation of reprogramming functions,
hrtimer_force_reprogram() is affected by a check whether the new expiry
time is past the current expiry time.

This breaks the NOHZ logic as that relies on the fact that the tick hrtimer
is moved into the future. That means cpu_base->expires_next becomes stale
and subsequent reprogramming attempts fail as well until the situation is
cleaned up by an hrtimer interrupts.

For some yet unknown reason this leads to a complete stall, so for now
partially revert the offending commit to a known working state. The root
cause for the stall is still investigated and will be fixed in a subsequent
commit.

Fixes: b14bca97c9f5 ("hrtimer: Consolidate reprogramming code")
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Link: https://lore.kernel.org/r/8735recskh.ffs@tglx

---
 kernel/time/hrtimer.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 33b00e2..0ea8702 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -652,24 +652,10 @@ static inline int hrtimer_hres_active(void)
 	return __hrtimer_hres_active(this_cpu_ptr(&hrtimer_bases));
 }
 
-static void
-__hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal,
-		    struct hrtimer *next_timer, ktime_t expires_next)
+static void __hrtimer_reprogram(struct hrtimer_cpu_base *cpu_base,
+				struct hrtimer *next_timer,
+				ktime_t expires_next)
 {
-	/*
-	 * If the hrtimer interrupt is running, then it will reevaluate the
-	 * clock bases and reprogram the clock event device.
-	 */
-	if (cpu_base->in_hrtirq)
-		return;
-
-	if (expires_next > cpu_base->expires_next)
-		return;
-
-	if (skip_equal && expires_next == cpu_base->expires_next)
-		return;
-
-	cpu_base->next_timer = next_timer;
 	cpu_base->expires_next = expires_next;
 
 	/*
@@ -707,8 +693,10 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal)
 
 	expires_next = hrtimer_update_next_event(cpu_base);
 
-	__hrtimer_reprogram(cpu_base, skip_equal, cpu_base->next_timer,
-			    expires_next);
+	if (skip_equal && expires_next == cpu_base->expires_next)
+		return;
+
+	__hrtimer_reprogram(cpu_base, cpu_base->next_timer, expires_next);
 }
 
 /* High resolution timer related functions */
@@ -863,7 +851,19 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 	if (base->cpu_base != cpu_base)
 		return;
 
-	__hrtimer_reprogram(cpu_base, true, timer, expires);
+	if (expires >= cpu_base->expires_next)
+		return;
+
+	/*
+	 * If the hrtimer interrupt is running, then it will reevaluate the
+	 * clock bases and reprogram the clock event device.
+	 */
+	if (cpu_base->in_hrtirq)
+		return;
+
+	cpu_base->next_timer = timer;
+
+	__hrtimer_reprogram(cpu_base, timer, expires);
 }
 
 static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: timers/core] hrtimer: Use raw_cpu_ptr() in clock_was_set()
  2021-08-12 20:31         ` [PATCH] hrtimer: Use raw_cpu_ptr() in clock_was_set() Thomas Gleixner
@ 2021-08-12 20:40           ` tip-bot2 for Thomas Gleixner
  2021-08-13  6:42           ` [PATCH] " Marek Szyprowski
  1 sibling, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-08-12 20:40 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Mike Galbraith, Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     9482fd71dbb8f0d1a61821a83e467dc0a9d7b429
Gitweb:        https://git.kernel.org/tip/9482fd71dbb8f0d1a61821a83e467dc0a9d7b429
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 12 Aug 2021 22:31:24 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 12 Aug 2021 22:34:40 +02:00

hrtimer: Use raw_cpu_ptr() in clock_was_set()

clock_was_set() can be invoked from preemptible context. Use raw_cpu_ptr()
to check whether high resolution mode is active or not. It does not matter
whether the task migrates after acquiring the pointer.

Fixes: e71a4153b7c2 ("hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/875ywacsmb.ffs@tglx

---
 kernel/time/hrtimer.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 88aefc3..33b00e2 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -944,10 +944,11 @@ static bool update_needs_ipi(struct hrtimer_cpu_base *cpu_base,
  */
 void clock_was_set(unsigned int bases)
 {
+	struct hrtimer_cpu_base *cpu_base = raw_cpu_ptr(&hrtimer_bases);
 	cpumask_var_t mask;
 	int cpu;
 
-	if (!hrtimer_hres_active() && !tick_nohz_active)
+	if (!__hrtimer_hres_active(cpu_base) && !tick_nohz_active)
 		goto out_timerfd;
 
 	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
@@ -958,9 +959,9 @@ void clock_was_set(unsigned int bases)
 	/* Avoid interrupting CPUs if possible */
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
-		struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
 		unsigned long flags;
 
+		cpu_base = &per_cpu(hrtimer_bases, cpu);
 		raw_spin_lock_irqsave(&cpu_base->lock, flags);
 
 		if (update_needs_ipi(cpu_base, bases))

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] hrtimer: Use raw_cpu_ptr() in clock_was_set()
  2021-08-12 20:31         ` [PATCH] hrtimer: Use raw_cpu_ptr() in clock_was_set() Thomas Gleixner
  2021-08-12 20:40           ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
@ 2021-08-13  6:42           ` Marek Szyprowski
  1 sibling, 0 replies; 36+ messages in thread
From: Marek Szyprowski @ 2021-08-13  6:42 UTC (permalink / raw)
  To: Thomas Gleixner, Mike Galbraith, linux-kernel, linux-tip-commits
  Cc: Peter Zijlstra, x86

On 12.08.2021 22:31, Thomas Gleixner wrote:
> clock_was_set() can be invoked from preemptible context. Use raw_cpu_ptr()
> to check whether high resolution mode is active or not. It does not matter
> whether the task migrates after acquiring the pointer.
>
> Fixes: e71a4153b7c2 ("hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case")
> Reported-by: Mike Galbraith <efault@gmx.de>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>

This fixes the following issue:

BUG: using smp_processor_id() in preemptible [00000000] code: hwclock/227

> ---
>   kernel/time/hrtimer.c |    5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -944,10 +944,11 @@ static bool update_needs_ipi(struct hrti
>    */
>   void clock_was_set(unsigned int bases)
>   {
> +	struct hrtimer_cpu_base *cpu_base = raw_cpu_ptr(&hrtimer_bases);
>   	cpumask_var_t mask;
>   	int cpu;
>   
> -	if (!hrtimer_hres_active() && !tick_nohz_active)
> +	if (!__hrtimer_hres_active(cpu_base) && !tick_nohz_active)
>   		goto out_timerfd;
>   
>   	if (!zalloc_cpumask_var(&mask, GFP_KERNEL)) {
> @@ -958,9 +959,9 @@ void clock_was_set(unsigned int bases)
>   	/* Avoid interrupting CPUs if possible */
>   	cpus_read_lock();
>   	for_each_online_cpu(cpu) {
> -		struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
>   		unsigned long flags;
>   
> +		cpu_base = &per_cpu(hrtimer_bases, cpu);
>   		raw_spin_lock_irqsave(&cpu_base->lock, flags);
>   
>   		if (update_needs_ipi(cpu_base, bases))
>
Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] hrtimer: Unbreak hrtimer_force_reprogram()
  2021-08-12 20:32           ` [PATCH] hrtimer: Unbreak hrtimer_force_reprogram() Thomas Gleixner
  2021-08-12 20:40             ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
@ 2021-08-13  7:58             ` Thomas Gleixner
  1 sibling, 0 replies; 36+ messages in thread
From: Thomas Gleixner @ 2021-08-13  7:58 UTC (permalink / raw)
  To: Mike Galbraith, linux-kernel, linux-tip-commits; +Cc: Peter Zijlstra, x86

On Thu, Aug 12 2021 at 22:32, Thomas Gleixner wrote:
> Since the recent consoliation of reprogramming functions,
> hrtimer_force_reprogram() is affected by a check whether the new expiry
> time is past the current expiry time.
>
> This breaks the NOHZ logic as that relies on the fact that the tick hrtimer
> is moved into the future. That means cpu_base->expires_next becomes stale
> and subsequent reprogramming attempts fail as well until the situation is
> cleaned up by an hrtimer interrupts.
>
> For some yet unknown reason this leads to a complete stall, so for now
> partially revert the offending commit to a known working state. The root
> cause for the stall is still investigated and will be fixed in a subsequent
> commit.

So with brain more awake I actually managed to decode the problem. It's
definitely the

           expires > cpu_base->expires_next

check. It not only prevents the NOHZ idle case from moving the next
timer interrupt into the future, it also causes the stall when switching
into high resolution / NOHZ mode. At that point the initial base value
can be smaller than the next event which prevents reprogramming and as
the base value stays stale it prevents any further reprogramming unless
there is a full update of the base which makes the problem go away.

TBH, that optimization logic to prevent reprogramming the timer hardware
for nothing is a bit fragile and non-obvious. I'll have a look to make
this more robust and less obscure.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-08-13  7:59 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-13 13:39 [patch V2 00/10] hrtimer: Reprogramming and clock_was_set() overhaul Thomas Gleixner
2021-07-13 13:39 ` [patch V2 01/10] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() Thomas Gleixner
2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2021-07-13 13:39 ` [patch V2 02/10] hrtimer: Consolidate reprogramming code Thomas Gleixner
2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Peter Zijlstra
2021-08-12  7:19     ` Mike Galbraith
2021-08-12 14:11       ` Thomas Gleixner
2021-08-12 14:32         ` Thomas Gleixner
2021-08-12 15:04           ` Mike Galbraith
2021-08-12 15:22             ` Thomas Gleixner
2021-08-12 15:31               ` Mike Galbraith
2021-08-12 16:58                 ` Thomas Gleixner
2021-08-12 15:27             ` Mike Galbraith
2021-08-12 20:32           ` [PATCH] hrtimer: Unbreak hrtimer_force_reprogram() Thomas Gleixner
2021-08-12 20:40             ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2021-08-13  7:58             ` [PATCH] " Thomas Gleixner
2021-08-12 20:31         ` [PATCH] hrtimer: Use raw_cpu_ptr() in clock_was_set() Thomas Gleixner
2021-08-12 20:40           ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2021-08-13  6:42           ` [PATCH] " Marek Szyprowski
     [not found]   ` <CGME20210812130945eucas1p117fc1e90f31c8d9fd177932cd1a18512@eucas1p1.samsung.com>
2021-08-12 13:09     ` [patch V2 02/10] hrtimer: Consolidate reprogramming code Marek Szyprowski
2021-07-13 13:39 ` [patch V2 03/10] hrtimer: Ensure timerfd notification for HIGHRES=n Thomas Gleixner
2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2021-07-13 13:39 ` [patch V2 04/10] hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case Thomas Gleixner
2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2021-07-13 13:39 ` [patch V2 05/10] timerfd: Provide timerfd_resume() Thomas Gleixner
2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2021-07-13 13:39 ` [patch V2 06/10] timekeeping: Distangle resume and clock-was-set events Thomas Gleixner
2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2021-07-13 13:39 ` [patch V2 07/10] time/timekeeping: Avoid invoking clock_was_set() twice Thomas Gleixner
2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2021-07-13 13:39 ` [patch V2 08/10] hrtimer: Add bases argument to clock_was_set() Thomas Gleixner
2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2021-07-13 13:39 ` [patch V2 09/10] hrtimer: Avoid unnecessary SMP function calls in clock_was_set() Thomas Gleixner
2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Marcelo Tosatti
2021-07-13 13:39 ` [patch V2 10/10] hrtimer: Avoid more " Thomas Gleixner
2021-08-10 16:02   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).