linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Anna-Maria Behnsen <anna-maria@linutronix.de>
To: linux-kernel@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>,
	John Stultz <jstultz@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Eric Dumazet <edumazet@google.com>,
	"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
	Arjan van de Ven <arjan@infradead.org>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Rik van Riel <riel@surriel.com>,
	Anna-Maria Behnsen <anna-maria@linutronix.de>,
	Richard Cochran <richardcochran@gmail.com>,
	Frederic Weisbecker <frederic@kernel.org>
Subject: [PATCH v5 09/18] timer: Keep the pinned timers separate from the others
Date: Wed,  1 Mar 2023 15:17:35 +0100	[thread overview]
Message-ID: <20230301141744.16063-10-anna-maria@linutronix.de> (raw)
In-Reply-To: <20230301141744.16063-1-anna-maria@linutronix.de>

Separate the storage space for pinned timers. Deferrable timers (doesn't
matter if pinned or non pinned) are still enqueued into their own base.

This is preparatory work for changing the NOHZ timer placement from a push
at enqueue time to a pull at expiry time model.

When a timer is added via add_timer_on(), TIMER_PINNED flag is required to
ensure it expires on the specified CPU. Otherwise it will be enqueued in
the global timer base which could be expired by a remote CPU. WARN_ONCE()
is added to prevent misuse.

Beside of that no functional change because all callers of add_timer_on()
already use TIMER_PINNED flag.

Originally-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
v5:
  - Add WARN_ONCE() in add_timer_on()
  - Decrease patch size by splitting into three patches (this patch and the
    two before)
---
 kernel/time/timer.c | 91 +++++++++++++++++++++++++++++++++------------
 1 file changed, 68 insertions(+), 23 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 1629ccf24dd0..7656eab1bf20 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -187,12 +187,18 @@ EXPORT_SYMBOL(jiffies_64);
 #define WHEEL_SIZE	(LVL_SIZE * LVL_DEPTH)
 
 #ifdef CONFIG_NO_HZ_COMMON
-# define NR_BASES	2
-# define BASE_STD	0
-# define BASE_DEF	1
+/*
+ * If multiple bases need to be locked, use the base ordering for lock
+ * nesting, i.e. lowest number first.
+ */
+# define NR_BASES	3
+# define BASE_LOCAL	0
+# define BASE_GLOBAL	1
+# define BASE_DEF	2
 #else
 # define NR_BASES	1
-# define BASE_STD	0
+# define BASE_LOCAL	0
+# define BASE_GLOBAL	0
 # define BASE_DEF	0
 #endif
 
@@ -902,7 +908,10 @@ static int detach_if_pending(struct timer_list *timer, struct timer_base *base,
 
 static inline struct timer_base *get_timer_cpu_base(u32 tflags, u32 cpu)
 {
-	struct timer_base *base = per_cpu_ptr(&timer_bases[BASE_STD], cpu);
+	int index = tflags & TIMER_PINNED ? BASE_LOCAL : BASE_GLOBAL;
+	struct timer_base *base;
+
+	base = per_cpu_ptr(&timer_bases[index], cpu);
 
 	/*
 	 * If the timer is deferrable and NO_HZ_COMMON is set then we need
@@ -915,7 +924,10 @@ static inline struct timer_base *get_timer_cpu_base(u32 tflags, u32 cpu)
 
 static inline struct timer_base *get_timer_this_cpu_base(u32 tflags)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+	int index = tflags & TIMER_PINNED ? BASE_LOCAL : BASE_GLOBAL;
+	struct timer_base *base;
+
+	base = this_cpu_ptr(&timer_bases[index]);
 
 	/*
 	 * If the timer is deferrable and NO_HZ_COMMON is set then we need
@@ -1264,6 +1276,12 @@ void add_timer_on(struct timer_list *timer, int cpu)
 	if (WARN_ON_ONCE(timer_pending(timer)))
 		return;
 
+	WARN_ONCE(!(timer->flags & TIMER_PINNED), "TIMER_PINNED flag for "
+		  "add_timer_on() is missing: timer=%p function=%ps",
+		  timer, timer->function);
+	/* Make sure timer flags have TIMER_PINNED flag set */
+	timer->flags |= TIMER_PINNED;
+
 	new_base = get_timer_cpu_base(timer->flags, cpu);
 
 	/*
@@ -1950,9 +1968,10 @@ static void forward_base_clk(struct timer_base *base, unsigned long nextevt,
  */
 u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+	unsigned long nextevt, nextevt_local, nextevt_global;
+	struct timer_base *base_local, *base_global;
+	bool local_first, is_idle;
 	u64 expires = KTIME_MAX;
-	unsigned long nextevt;
 
 	/*
 	 * Pretend that there is no timer pending if the cpu is offline.
@@ -1961,32 +1980,57 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
 	if (cpu_is_offline(smp_processor_id()))
 		return expires;
 
-	raw_spin_lock(&base->lock);
+	base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
+	base_global = this_cpu_ptr(&timer_bases[BASE_GLOBAL]);
 
-	nextevt = next_timer_interrupt(base);
+	raw_spin_lock(&base_local->lock);
+	raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
+
+	nextevt_local = next_timer_interrupt(base_local);
+	nextevt_global = next_timer_interrupt(base_global);
 
 	/*
 	 * We have a fresh next event. Check whether we can forward the
 	 * base.
 	 */
-	forward_base_clk(base, nextevt, basej);
+	forward_base_clk(base_local, nextevt_local, basej);
+	forward_base_clk(base_global, nextevt_global, basej);
 
 	/*
-	 * Base is idle if the next event is more than a tick away. Also
+	 * Check whether the local event is expiring before or at the same
+	 * time as the global event.
+	 *
+	 * Note, that nextevt_global and nextevt_local might be based on
+	 * different base->clk values. So it's not guaranteed that
+	 * comparing with empty bases results in a correct local_first.
+	 */
+	if (base_local->timers_pending && base_global->timers_pending)
+		local_first = time_before_eq(nextevt_local, nextevt_global);
+	else
+		local_first = base_local->timers_pending;
+
+	nextevt = local_first ? nextevt_local : nextevt_global;
+
+	/*
+	 * Bases are idle if the next event is more than a tick away. Also
 	 * the tick is stopped so any added timer must forward the base clk
 	 * itself to keep granularity small. This idle logic is only
-	 * maintained for the BASE_STD base, deferrable timers may still
-	 * see large granularity skew (by design).
+	 * maintained for the BASE_LOCAL and BASE_GLOBAL base, deferrable
+	 * timers may still see large granularity skew (by design).
 	 */
-	base->is_idle = time_after(nextevt, basej + 1);
+	is_idle = time_after(nextevt, basej + 1);
+
+	/* We need to mark both bases in sync */
+	base_local->is_idle = base_global->is_idle = is_idle;
 
-	if (base->timers_pending) {
+	if (base_local->timers_pending || base_global->timers_pending) {
 		/* If we missed a tick already, force 0 delta */
 		if (time_before(nextevt, basej))
 			nextevt = basej;
 		expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
 	}
-	raw_spin_unlock(&base->lock);
+	raw_spin_unlock(&base_global->lock);
+	raw_spin_unlock(&base_local->lock);
 
 	return cmp_next_hrtimer_event(basem, expires);
 }
@@ -1998,15 +2042,14 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
  */
 void timer_clear_idle(void)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
-
 	/*
 	 * We do this unlocked. The worst outcome is a remote enqueue sending
 	 * a pointless IPI, but taking the lock would just make the window for
 	 * sending the IPI a few instructions smaller for the cost of taking
 	 * the lock in the exit from idle path.
 	 */
-	base->is_idle = false;
+	__this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false);
+	__this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false);
 }
 #endif
 
@@ -2052,11 +2095,13 @@ static inline void __run_timers(struct timer_base *base)
  */
 static __latent_entropy void run_timer_softirq(struct softirq_action *h)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
 
 	__run_timers(base);
-	if (IS_ENABLED(CONFIG_NO_HZ_COMMON))
+	if (IS_ENABLED(CONFIG_NO_HZ_COMMON)) {
+		__run_timers(this_cpu_ptr(&timer_bases[BASE_GLOBAL]));
 		__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
+	}
 }
 
 /*
@@ -2064,7 +2109,7 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h)
  */
 static void run_local_timers(void)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
 
 	hrtimer_run_queues();
 
-- 
2.30.2


  parent reply	other threads:[~2023-03-01 14:19 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-01 14:17 [PATCH v5 00/18] timer: Move from a push remote at enqueue to a pull at expiry model Anna-Maria Behnsen
2023-03-01 14:17 ` [PATCH v5 01/18] tick-sched: Warn when next tick seems to be in the past Anna-Maria Behnsen
2023-03-01 14:17 ` [PATCH v5 02/18] timer: Add comment to get_next_timer_interrupt() description Anna-Maria Behnsen
2023-04-11  9:36   ` Frederic Weisbecker
2023-04-11 16:10     ` Anna-Maria Behnsen
2023-04-12 11:29       ` Frederic Weisbecker
2023-03-01 14:17 ` [PATCH v5 03/18] timer: Move store of next event into __next_timer_interrupt() Anna-Maria Behnsen
2023-03-21 12:48   ` Peter Zijlstra
2023-04-12 11:32   ` Frederic Weisbecker
2023-03-01 14:17 ` [PATCH v5 04/18] timer: Split next timer interrupt logic Anna-Maria Behnsen
2023-03-01 14:17 ` [PATCH v5 05/18] timer: Rework idle logic Anna-Maria Behnsen
2023-03-01 14:17 ` [PATCH v5 06/18] add_timer_on(): Make sure callers have TIMER_PINNED flag Anna-Maria Behnsen
2023-03-01 14:17 ` [PATCH v5 07/18] timers: Ease code in run_local_timers() Anna-Maria Behnsen
2023-04-12 14:32   ` Frederic Weisbecker
2023-03-01 14:17 ` [PATCH v5 08/18] timers: Create helper function to forward timer base clk Anna-Maria Behnsen
2023-04-12 14:40   ` Frederic Weisbecker
2023-03-01 14:17 ` Anna-Maria Behnsen [this message]
2023-03-01 14:17 ` [PATCH v5 10/18] timer: Retrieve next expiry of pinned/non-pinned timers seperately Anna-Maria Behnsen
2023-03-01 14:17 ` [PATCH v5 11/18] timer: Split out "get next timer interrupt" functionality Anna-Maria Behnsen
2023-03-09 16:30   ` Frederic Weisbecker
2023-03-09 17:45     ` Frederic Weisbecker
2023-03-21 14:30   ` Peter Zijlstra
2023-04-12 20:34   ` Frederic Weisbecker
2023-03-01 14:17 ` [PATCH v5 12/18] timer: Add get next timer interrupt functionality for remote CPUs Anna-Maria Behnsen
2023-03-01 14:17 ` [PATCH v5 13/18] timer: Restructure internal locking Anna-Maria Behnsen
2023-03-01 14:17 ` [PATCH v5 14/18] timer: Check if timers base is handled already Anna-Maria Behnsen
2023-03-21 14:43   ` Peter Zijlstra
2023-03-01 14:17 ` [PATCH v5 15/18] tick/sched: Split out jiffies update helper function Anna-Maria Behnsen
2023-03-01 14:17 ` [PATCH v5 16/18] timer: Implement the hierarchical pull model Anna-Maria Behnsen
2023-03-14 13:24   ` Frederic Weisbecker
2023-03-14 14:49     ` Anna-Maria Behnsen
2023-03-14 16:01       ` Frederic Weisbecker
2023-03-21 11:17   ` Frederic Weisbecker
2023-04-04 14:05     ` Anna-Maria Behnsen
2023-04-04 14:32       ` Frederic Weisbecker
2023-03-21 13:25   ` Frederic Weisbecker
2023-04-06  9:12     ` Anna-Maria Behnsen
2023-03-21 15:29   ` Peter Zijlstra
2023-03-21 15:34   ` Peter Zijlstra
2023-03-21 15:40   ` Peter Zijlstra
2023-03-23  9:22   ` Peter Zijlstra
2023-03-23  9:34   ` Peter Zijlstra
2023-03-23  9:47   ` Peter Zijlstra
2023-03-23 12:47   ` Peter Zijlstra
2023-03-23 14:24   ` Peter Zijlstra
2023-04-04 14:56     ` Anna-Maria Behnsen
2023-03-01 14:17 ` [PATCH v5 17/18] timer_migration: Add tracepoints Anna-Maria Behnsen
2023-03-01 14:17 ` [PATCH v5 18/18] timer: Always queue timers on the local CPU Anna-Maria Behnsen
2023-03-21 12:46 ` [PATCH v5 00/18] timer: Move from a push remote at enqueue to a pull at expiry model Peter Zijlstra
2023-04-04 13:35   ` Anna-Maria Behnsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230301141744.16063-10-anna-maria@linutronix.de \
    --to=anna-maria@linutronix.de \
    --cc=arjan@infradead.org \
    --cc=edumazet@google.com \
    --cc=frederic@kernel.org \
    --cc=fweisbec@gmail.com \
    --cc=jstultz@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=richardcochran@gmail.com \
    --cc=riel@surriel.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).