linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2)
@ 2023-10-20  1:40 Joel Fernandes (Google)
  2023-10-20  1:40 ` [PATCH 2/3] sched/nohz: Update comments about NEWILB_KICK Joel Fernandes (Google)
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Joel Fernandes (Google) @ 2023-10-20  1:40 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider
  Cc: Joel Fernandes (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney,
	Vineeth Pillai

Whenever a CPU stops its tick, it now requires another idle CPU to handle the
balancing for it because it can't perform its own periodic load balancing.
This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
the upcoming nohz-idle load balancing is too distant in the future. This update
process is done by triggering an ILB, as the general ILB handler
(_nohz_idle_balance) that manages regular nohz balancing also refreshes
'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
and selecting the smallest value.

Triggering this ILB is achieved in current mainline by setting the
NOHZ_NEXT_KICK flag. This primarily results in the ILB handler updating
'nohz.next_balance' while possibly not doing any load balancing at all.
However, sending an IPI merely to refresh 'nohz.next_balance' seems
excessive. This patch therefore directly sets nohz.next_balance from the
CPU stopping the tick.

Testing shows a considerable reduction in IPIs when doing this:

Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
the IPI call count profiled over 10s period is as follows:
without fix: ~10500
with fix: ~1000

Also just to note, without this patch we observe the following pattern:

1. A CPU is about to stop its tick.
2. It sets nohz.needs_update to 1.
3. It then stops its tick and goes idle.
4. The scheduler tick on another CPU checks this flag and decides an ILB kick
   is needed.
5. The ILB CPU ends up being the one that just stopped its tick!
6. This results in an IPI to the tick-stopped CPU which ends up waking it up
   and disturbing it!

Finally, this patch also avoids having to go through all the idle CPUs
just to update nohz.next_balance when the tick is stopped.

Previous version of patch had some issues which are addressed now:
https://lore.kernel.org/all/20231005161727.1855004-1-joel@joelfernandes.org/

Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
Co-developed-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c  | 44 +++++++++++++++++++++++++++++---------------
 kernel/sched/sched.h |  5 +----
 2 files changed, 30 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cb225921bbca..965c30fbbe5c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6627,7 +6627,6 @@ static struct {
 	cpumask_var_t idle_cpus_mask;
 	atomic_t nr_cpus;
 	int has_blocked;		/* Idle CPUS has blocked load */
-	int needs_update;		/* Newly idle CPUs need their next_balance collated */
 	unsigned long next_balance;     /* in jiffy units */
 	unsigned long next_blocked;	/* Next update of blocked load in jiffies */
 } nohz ____cacheline_aligned;
@@ -11687,9 +11686,6 @@ static void nohz_balancer_kick(struct rq *rq)
 unlock:
 	rcu_read_unlock();
 out:
-	if (READ_ONCE(nohz.needs_update))
-		flags |= NOHZ_NEXT_KICK;
-
 	if (flags)
 		kick_ilb(flags);
 }
@@ -11740,6 +11736,20 @@ static void set_cpu_sd_state_idle(int cpu)
 	rcu_read_unlock();
 }
 
+static inline void
+update_nohz_next_balance(unsigned long next_balance)
+{
+	unsigned long nohz_next_balance;
+
+	/* In event of a race, only update with the earliest next_balance. */
+	do {
+		nohz_next_balance = READ_ONCE(nohz.next_balance);
+		if (!time_after(nohz_next_balance, next_balance))
+			break;
+	} while (!try_cmpxchg(&nohz.next_balance, &nohz_next_balance,
+			     next_balance));
+}
+
 /*
  * This routine will record that the CPU is going idle with tick stopped.
  * This info will be used in performing idle load balancing in the future.
@@ -11786,13 +11796,13 @@ void nohz_balance_enter_idle(int cpu)
 	/*
 	 * Ensures that if nohz_idle_balance() fails to observe our
 	 * @idle_cpus_mask store, it must observe the @has_blocked
-	 * and @needs_update stores.
+	 * store.
 	 */
 	smp_mb__after_atomic();
 
 	set_cpu_sd_state_idle(cpu);
 
-	WRITE_ONCE(nohz.needs_update, 1);
+	update_nohz_next_balance(rq->next_balance);
 out:
 	/*
 	 * Each time a cpu enter idle, we assume that it has blocked load and
@@ -11829,6 +11839,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 	/* Earliest time when we have to do rebalance again */
 	unsigned long now = jiffies;
 	unsigned long next_balance = now + 60*HZ;
+	unsigned long old_nohz_next_balance;
 	bool has_blocked_load = false;
 	int update_next_balance = 0;
 	int this_cpu = this_rq->cpu;
@@ -11837,6 +11848,8 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 
 	SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
 
+	old_nohz_next_balance = READ_ONCE(nohz.next_balance);
+
 	/*
 	 * We assume there will be no idle load after this update and clear
 	 * the has_blocked flag. If a cpu enters idle in the mean time, it will
@@ -11844,13 +11857,9 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 	 * Because a cpu that becomes idle, is added to idle_cpus_mask before
 	 * setting the flag, we are sure to not clear the state and not
 	 * check the load of an idle cpu.
-	 *
-	 * Same applies to idle_cpus_mask vs needs_update.
 	 */
 	if (flags & NOHZ_STATS_KICK)
 		WRITE_ONCE(nohz.has_blocked, 0);
-	if (flags & NOHZ_NEXT_KICK)
-		WRITE_ONCE(nohz.needs_update, 0);
 
 	/*
 	 * Ensures that if we miss the CPU, we must see the has_blocked
@@ -11874,8 +11883,6 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 		if (need_resched()) {
 			if (flags & NOHZ_STATS_KICK)
 				has_blocked_load = true;
-			if (flags & NOHZ_NEXT_KICK)
-				WRITE_ONCE(nohz.needs_update, 1);
 			goto abort;
 		}
 
@@ -11906,12 +11913,19 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 	}
 
 	/*
-	 * next_balance will be updated only when there is a need.
+	 * nohz.next_balance will be updated only when there is a need.
 	 * When the CPU is attached to null domain for ex, it will not be
 	 * updated.
+	 *
+	 * Also, if it changed since we scanned the nohz CPUs above, do nothing as:
+	 * 1. A concurrent call to _nohz_idle_balance() moved nohz.next_balance forward.
+	 * 2. nohz_balance_enter_idle moved it backward.
 	 */
-	if (likely(update_next_balance))
-		nohz.next_balance = next_balance;
+	if (likely(update_next_balance)) {
+		/* Pairs with the smp_mb() above. */
+		cmpxchg_release(&nohz.next_balance, old_nohz_next_balance,
+				next_balance);
+	}
 
 	if (flags & NOHZ_STATS_KICK)
 		WRITE_ONCE(nohz.next_blocked,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 04846272409c..cf3597d91977 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2874,7 +2874,6 @@ extern void cfs_bandwidth_usage_dec(void);
 #define NOHZ_BALANCE_KICK_BIT	0
 #define NOHZ_STATS_KICK_BIT	1
 #define NOHZ_NEWILB_KICK_BIT	2
-#define NOHZ_NEXT_KICK_BIT	3
 
 /* Run rebalance_domains() */
 #define NOHZ_BALANCE_KICK	BIT(NOHZ_BALANCE_KICK_BIT)
@@ -2882,10 +2881,8 @@ extern void cfs_bandwidth_usage_dec(void);
 #define NOHZ_STATS_KICK		BIT(NOHZ_STATS_KICK_BIT)
 /* Update blocked load when entering idle */
 #define NOHZ_NEWILB_KICK	BIT(NOHZ_NEWILB_KICK_BIT)
-/* Update nohz.next_balance */
-#define NOHZ_NEXT_KICK		BIT(NOHZ_NEXT_KICK_BIT)
 
-#define NOHZ_KICK_MASK	(NOHZ_BALANCE_KICK | NOHZ_STATS_KICK | NOHZ_NEXT_KICK)
+#define NOHZ_KICK_MASK	(NOHZ_BALANCE_KICK | NOHZ_STATS_KICK)
 
 #define nohz_flags(cpu)	(&cpu_rq(cpu)->nohz_flags)
 
-- 
2.42.0.655.g421f12c284-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] sched/nohz: Update comments about NEWILB_KICK
  2023-10-20  1:40 [PATCH 1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2) Joel Fernandes (Google)
@ 2023-10-20  1:40 ` Joel Fernandes (Google)
  2023-10-20  7:51   ` Ingo Molnar
  2023-10-20  8:02   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
  2023-10-20  1:40 ` [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance Joel Fernandes (Google)
  2023-10-20  4:17 ` [PATCH 1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2) Joel Fernandes
  2 siblings, 2 replies; 19+ messages in thread
From: Joel Fernandes (Google) @ 2023-10-20  1:40 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider
  Cc: Joel Fernandes (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney

How ILB is triggered without IPIs is cryptic. Out of mercy for future
code readers, document it in code comments.

The comments are derived from a discussion with Vincent in a past
review.

Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 965c30fbbe5c..8e276d12c3cb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11959,8 +11959,19 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 }
 
 /*
- * Check if we need to run the ILB for updating blocked load before entering
- * idle state.
+ * Check if we need to directly run the ILB for updating blocked load before
+ * entering idle state. Here we run ILB directly without issuing IPIs.
+ *
+ * Note that when this function is called, the tick may not yet be stopped on
+ * this CPU yet. nohz.idle_cpus_mask is updated only when tick is stopped and
+ * cleared on the next busy tick. In other words, nohz.idle_cpus_mask updates
+ * don't align with CPUs enter/exit idle to avoid bottlenecks due to high idle
+ * entry/exit rate (usec). So it is possible that _nohz_idle_balance() is
+ * called from this function on (this) CPU that's not yet in the mask. That's
+ * OK because the goal of nohz_run_idle_balance() is to run ILB only for
+ * updating the blocked load of already idle CPUs without waking up one of
+ * those idle CPUs and outside the preempt disable / irq off phase of the local
+ * cpu about to enter idle, because it can take a long time.
  */
 void nohz_run_idle_balance(int cpu)
 {
-- 
2.42.0.655.g421f12c284-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-10-20  1:40 [PATCH 1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2) Joel Fernandes (Google)
  2023-10-20  1:40 ` [PATCH 2/3] sched/nohz: Update comments about NEWILB_KICK Joel Fernandes (Google)
@ 2023-10-20  1:40 ` Joel Fernandes (Google)
  2023-10-20  7:53   ` Ingo Molnar
                     ` (2 more replies)
  2023-10-20  4:17 ` [PATCH 1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2) Joel Fernandes
  2 siblings, 3 replies; 19+ messages in thread
From: Joel Fernandes (Google) @ 2023-10-20  1:40 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider
  Cc: Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney,
	Joel Fernandes

From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>

When newidle balancing triggers, we see that it constantly clobbers
rq->next_balance even when there is no newidle balance happening due to
the cost estimates.  Due to this, we see that periodic load balance
(rebalance_domains) may trigger way more often when the CPU is going in
and out of idle at a high rate but is no really idle. Repeatedly
triggering load balance there is a bad idea as it is a heavy operation.
It also causes increases in softirq.

Another issue is ->last_balance is not updated after newidle balance
causing mistakes in the ->next_balance calculations.

Fix by updating last_balance when a newidle load balance actually
happens and then updating next_balance. This is also how it is done in
other load balance paths.

Testing shows a significant drop in softirqs when running:
cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m  -q

Goes from ~6k to ~800.

Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/sched/fair.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e276d12c3cb..b147ad09126a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12076,11 +12076,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
-
-		if (sd)
-			update_next_balance(sd, &next_balance);
 		rcu_read_unlock();
-
 		goto out;
 	}
 	rcu_read_unlock();
@@ -12095,8 +12091,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		int continue_balancing = 1;
 		u64 domain_cost;
 
-		update_next_balance(sd, &next_balance);
-
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
 			break;
 
@@ -12109,6 +12103,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 			t1 = sched_clock_cpu(this_cpu);
 			domain_cost = t1 - t0;
 			update_newidle_cost(sd, domain_cost);
+			sd->last_balance = jiffies;
+			update_next_balance(sd, &next_balance);
 
 			curr_cost += domain_cost;
 			t0 = t1;
-- 
2.42.0.655.g421f12c284-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2)
  2023-10-20  1:40 [PATCH 1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2) Joel Fernandes (Google)
  2023-10-20  1:40 ` [PATCH 2/3] sched/nohz: Update comments about NEWILB_KICK Joel Fernandes (Google)
  2023-10-20  1:40 ` [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance Joel Fernandes (Google)
@ 2023-10-20  4:17 ` Joel Fernandes
  2 siblings, 0 replies; 19+ messages in thread
From: Joel Fernandes @ 2023-10-20  4:17 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider
  Cc: Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney, Vineeth Pillai



> On Oct 19, 2023, at 9:40 PM, Joel Fernandes (Google) <joel@joelfernandes.org> wrote:
> 
> Whenever a CPU stops its tick, it now requires another idle CPU to handle the
> balancing for it because it can't perform its own periodic load balancing.
> This means it might need to update 'nohz.next_balance' to 'rq->next_balance' if
> the upcoming nohz-idle load balancing is too distant in the future. This update
> process is done by triggering an ILB, as the general ILB handler
> (_nohz_idle_balance) that manages regular nohz balancing also refreshes
> 'nohz.next_balance' by looking at the 'rq->next_balance' of all other idle CPUs
> and selecting the smallest value.

Unfortunately this patch still has a subtle issue where we can miss a nohz balance
when the tick is stopped.

Sorry we realized only after sending. Thanks for bearing with us. We have
a fix in the works.

Feel free review the other 2 patches though. Will keep working on it and thanks!

- Joel & Vineeth


> 
> Triggering this ILB is achieved in current mainline by setting the
> NOHZ_NEXT_KICK flag. This primarily results in the ILB handler updating
> 'nohz.next_balance' while possibly not doing any load balancing at all.
> However, sending an IPI merely to refresh 'nohz.next_balance' seems
> excessive. This patch therefore directly sets nohz.next_balance from the
> CPU stopping the tick.
> 
> Testing shows a considerable reduction in IPIs when doing this:
> 
> Running "cyclictest -i 100 -d 100 --latency=1000 -t -m" on a 4vcpu VM
> the IPI call count profiled over 10s period is as follows:
> without fix: ~10500
> with fix: ~1000
> 
> Also just to note, without this patch we observe the following pattern:
> 
> 1. A CPU is about to stop its tick.
> 2. It sets nohz.needs_update to 1.
> 3. It then stops its tick and goes idle.
> 4. The scheduler tick on another CPU checks this flag and decides an ILB kick
>   is needed.
> 5. The ILB CPU ends up being the one that just stopped its tick!
> 6. This results in an IPI to the tick-stopped CPU which ends up waking it up
>   and disturbing it!
> 
> Finally, this patch also avoids having to go through all the idle CPUs
> just to update nohz.next_balance when the tick is stopped.
> 
> Previous version of patch had some issues which are addressed now:
> https://lore.kernel.org/all/20231005161727.1855004-1-joel@joelfernandes.org/
> 
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Fixes: 7fd7a9e0caba ("sched/fair: Trigger nohz.next_balance updates when a CPU goes NOHZ-idle")
> Co-developed-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> kernel/sched/fair.c  | 44 +++++++++++++++++++++++++++++---------------
> kernel/sched/sched.h |  5 +----
> 2 files changed, 30 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cb225921bbca..965c30fbbe5c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6627,7 +6627,6 @@ static struct {
>    cpumask_var_t idle_cpus_mask;
>    atomic_t nr_cpus;
>    int has_blocked;        /* Idle CPUS has blocked load */
> -    int needs_update;        /* Newly idle CPUs need their next_balance collated */
>    unsigned long next_balance;     /* in jiffy units */
>    unsigned long next_blocked;    /* Next update of blocked load in jiffies */
> } nohz ____cacheline_aligned;
> @@ -11687,9 +11686,6 @@ static void nohz_balancer_kick(struct rq *rq)
> unlock:
>    rcu_read_unlock();
> out:
> -    if (READ_ONCE(nohz.needs_update))
> -        flags |= NOHZ_NEXT_KICK;
> -
>    if (flags)
>        kick_ilb(flags);
> }
> @@ -11740,6 +11736,20 @@ static void set_cpu_sd_state_idle(int cpu)
>    rcu_read_unlock();
> }
> 
> +static inline void
> +update_nohz_next_balance(unsigned long next_balance)
> +{
> +    unsigned long nohz_next_balance;
> +
> +    /* In event of a race, only update with the earliest next_balance. */
> +    do {
> +        nohz_next_balance = READ_ONCE(nohz.next_balance);
> +        if (!time_after(nohz_next_balance, next_balance))
> +            break;
> +    } while (!try_cmpxchg(&nohz.next_balance, &nohz_next_balance,
> +                 next_balance));
> +}
> +
> /*
>  * This routine will record that the CPU is going idle with tick stopped.
>  * This info will be used in performing idle load balancing in the future.
> @@ -11786,13 +11796,13 @@ void nohz_balance_enter_idle(int cpu)
>    /*
>     * Ensures that if nohz_idle_balance() fails to observe our
>     * @idle_cpus_mask store, it must observe the @has_blocked
> -     * and @needs_update stores.
> +     * store.
>     */
>    smp_mb__after_atomic();
> 
>    set_cpu_sd_state_idle(cpu);
> 
> -    WRITE_ONCE(nohz.needs_update, 1);
> +    update_nohz_next_balance(rq->next_balance);
> out:
>    /*
>     * Each time a cpu enter idle, we assume that it has blocked load and
> @@ -11829,6 +11839,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
>    /* Earliest time when we have to do rebalance again */
>    unsigned long now = jiffies;
>    unsigned long next_balance = now + 60*HZ;
> +    unsigned long old_nohz_next_balance;
>    bool has_blocked_load = false;
>    int update_next_balance = 0;
>    int this_cpu = this_rq->cpu;
> @@ -11837,6 +11848,8 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
> 
>    SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
> 
> +    old_nohz_next_balance = READ_ONCE(nohz.next_balance);
> +
>    /*
>     * We assume there will be no idle load after this update and clear
>     * the has_blocked flag. If a cpu enters idle in the mean time, it will
> @@ -11844,13 +11857,9 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
>     * Because a cpu that becomes idle, is added to idle_cpus_mask before
>     * setting the flag, we are sure to not clear the state and not
>     * check the load of an idle cpu.
> -     *
> -     * Same applies to idle_cpus_mask vs needs_update.
>     */
>    if (flags & NOHZ_STATS_KICK)
>        WRITE_ONCE(nohz.has_blocked, 0);
> -    if (flags & NOHZ_NEXT_KICK)
> -        WRITE_ONCE(nohz.needs_update, 0);
> 
>    /*
>     * Ensures that if we miss the CPU, we must see the has_blocked
> @@ -11874,8 +11883,6 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
>        if (need_resched()) {
>            if (flags & NOHZ_STATS_KICK)
>                has_blocked_load = true;
> -            if (flags & NOHZ_NEXT_KICK)
> -                WRITE_ONCE(nohz.needs_update, 1);
>            goto abort;
>        }
> 
> @@ -11906,12 +11913,19 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
>    }
> 
>    /*
> -     * next_balance will be updated only when there is a need.
> +     * nohz.next_balance will be updated only when there is a need.
>     * When the CPU is attached to null domain for ex, it will not be
>     * updated.
> +     *
> +     * Also, if it changed since we scanned the nohz CPUs above, do nothing as:
> +     * 1. A concurrent call to _nohz_idle_balance() moved nohz.next_balance forward.
> +     * 2. nohz_balance_enter_idle moved it backward.
>     */
> -    if (likely(update_next_balance))
> -        nohz.next_balance = next_balance;
> +    if (likely(update_next_balance)) {
> +        /* Pairs with the smp_mb() above. */
> +        cmpxchg_release(&nohz.next_balance, old_nohz_next_balance,
> +                next_balance);
> +    }
> 
>    if (flags & NOHZ_STATS_KICK)
>        WRITE_ONCE(nohz.next_blocked,
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 04846272409c..cf3597d91977 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2874,7 +2874,6 @@ extern void cfs_bandwidth_usage_dec(void);
> #define NOHZ_BALANCE_KICK_BIT    0
> #define NOHZ_STATS_KICK_BIT    1
> #define NOHZ_NEWILB_KICK_BIT    2
> -#define NOHZ_NEXT_KICK_BIT    3
> 
> /* Run rebalance_domains() */
> #define NOHZ_BALANCE_KICK    BIT(NOHZ_BALANCE_KICK_BIT)
> @@ -2882,10 +2881,8 @@ extern void cfs_bandwidth_usage_dec(void);
> #define NOHZ_STATS_KICK        BIT(NOHZ_STATS_KICK_BIT)
> /* Update blocked load when entering idle */
> #define NOHZ_NEWILB_KICK    BIT(NOHZ_NEWILB_KICK_BIT)
> -/* Update nohz.next_balance */
> -#define NOHZ_NEXT_KICK        BIT(NOHZ_NEXT_KICK_BIT)
> 
> -#define NOHZ_KICK_MASK    (NOHZ_BALANCE_KICK | NOHZ_STATS_KICK | NOHZ_NEXT_KICK)
> +#define NOHZ_KICK_MASK    (NOHZ_BALANCE_KICK | NOHZ_STATS_KICK)
> 
> #define nohz_flags(cpu)    (&cpu_rq(cpu)->nohz_flags)
> 
> -- 
> 2.42.0.655.g421f12c284-goog
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] sched/nohz: Update comments about NEWILB_KICK
  2023-10-20  1:40 ` [PATCH 2/3] sched/nohz: Update comments about NEWILB_KICK Joel Fernandes (Google)
@ 2023-10-20  7:51   ` Ingo Molnar
  2023-10-20  8:02   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
  1 sibling, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2023-10-20  7:51 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney


* Joel Fernandes (Google) <joel@joelfernandes.org> wrote:

> How ILB is triggered without IPIs is cryptic. Out of mercy for future
> code readers, document it in code comments.
> 
> The comments are derived from a discussion with Vincent in a past
> review.
> 
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/fair.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 965c30fbbe5c..8e276d12c3cb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11959,8 +11959,19 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  }
>  
>  /*
> - * Check if we need to run the ILB for updating blocked load before entering
> - * idle state.
> + * Check if we need to directly run the ILB for updating blocked load before
> + * entering idle state. Here we run ILB directly without issuing IPIs.
> + *
> + * Note that when this function is called, the tick may not yet be stopped on
> + * this CPU yet. nohz.idle_cpus_mask is updated only when tick is stopped and
> + * cleared on the next busy tick. In other words, nohz.idle_cpus_mask updates
> + * don't align with CPUs enter/exit idle to avoid bottlenecks due to high idle
> + * entry/exit rate (usec). So it is possible that _nohz_idle_balance() is
> + * called from this function on (this) CPU that's not yet in the mask. That's
> + * OK because the goal of nohz_run_idle_balance() is to run ILB only for
> + * updating the blocked load of already idle CPUs without waking up one of
> + * those idle CPUs and outside the preempt disable / irq off phase of the local
> + * cpu about to enter idle, because it can take a long time.

Much appreciated! Feel free to update comments for the entire relevant code 
base, a lot of it has become cryptic and under-documented at best as 
complexity increased ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-10-20  1:40 ` [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance Joel Fernandes (Google)
@ 2023-10-20  7:53   ` Ingo Molnar
  2023-10-20 13:48     ` Vincent Guittot
  2023-10-20  8:02   ` [tip: sched/core] sched/fair: " tip-bot2 for Vineeth Pillai (Google)
  2023-10-20 13:40   ` [PATCH 3/3] sched: " Vincent Guittot
  2 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2023-10-20  7:53 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney


* Joel Fernandes (Google) <joel@joelfernandes.org> wrote:

> From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
> 
> When newidle balancing triggers, we see that it constantly clobbers 
> rq->next_balance even when there is no newidle balance happening due to 
> the cost estimates.  Due to this, we see that periodic load balance 
> (rebalance_domains) may trigger way more often when the CPU is going in 
> and out of idle at a high rate but is no really idle. Repeatedly 
> triggering load balance there is a bad idea as it is a heavy operation. 
> It also causes increases in softirq.
> 
> Another issue is ->last_balance is not updated after newidle balance 
> causing mistakes in the ->next_balance calculations.
> 
> Fix by updating last_balance when a newidle load balance actually happens 
> and then updating next_balance. This is also how it is done in other load 
> balance paths.
> 
> Testing shows a significant drop in softirqs when running:
> cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m  -q
> 
> Goes from ~6k to ~800.
> 
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/fair.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8e276d12c3cb..b147ad09126a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12076,11 +12076,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	if (!READ_ONCE(this_rq->rd->overload) ||
>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> -
> -		if (sd)
> -			update_next_balance(sd, &next_balance);
>  		rcu_read_unlock();
> -
>  		goto out;
>  	}
>  	rcu_read_unlock();
> @@ -12095,8 +12091,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  		int continue_balancing = 1;
>  		u64 domain_cost;
>  
> -		update_next_balance(sd, &next_balance);
> -
>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>  			break;
>  
> @@ -12109,6 +12103,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  			t1 = sched_clock_cpu(this_cpu);
>  			domain_cost = t1 - t0;
>  			update_newidle_cost(sd, domain_cost);
> +			sd->last_balance = jiffies;
> +			update_next_balance(sd, &next_balance);
>  
>  			curr_cost += domain_cost;
>  			t0 = t1;

Okay, I'm applying patches #2 and #3, without #1: it should be safe
out of order, but let me know if I missed something ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip: sched/core] sched/fair: Update ->next_balance correctly during newidle balance
  2023-10-20  1:40 ` [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance Joel Fernandes (Google)
  2023-10-20  7:53   ` Ingo Molnar
@ 2023-10-20  8:02   ` tip-bot2 for Vineeth Pillai (Google)
  2023-10-20 13:40   ` [PATCH 3/3] sched: " Vincent Guittot
  2 siblings, 0 replies; 19+ messages in thread
From: tip-bot2 for Vineeth Pillai (Google) @ 2023-10-20  8:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vineeth Pillai (Google), Joel Fernandes (Google),
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     1c0482707c42960ec46b88aadd6bffca8685db11
Gitweb:        https://git.kernel.org/tip/1c0482707c42960ec46b88aadd6bffca8685db11
Author:        Vineeth Pillai (Google) <vineeth@bitbyteword.org>
AuthorDate:    Fri, 20 Oct 2023 01:40:28 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Fri, 20 Oct 2023 09:56:21 +02:00

sched/fair: Update ->next_balance correctly during newidle balance

When newidle balancing triggers, we see that it constantly clobbers
rq->next_balance even when there is no newidle balance happening due to
the cost estimates.  Due to this, we see that periodic load balance
(rebalance_domains) may trigger way more often when the CPU is going in
and out of idle at a high rate but is no really idle. Repeatedly
triggering load balance there is a bad idea as it is a heavy operation.
It also causes increases in softirq.

Another issue is ->last_balance is not updated after newidle balance
causing mistakes in the ->next_balance calculations.

Fix by updating last_balance when a newidle load balance actually
happens and then updating next_balance. This is also how it is done in
other load balance paths.

Testing shows a significant drop in softirqs when running:

  $ cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m  -q

... goes from ~6,000 to ~800.

Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20231020014031.919742-3-joel@joelfernandes.org
---
 kernel/sched/fair.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c486ff..393d0dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12122,11 +12122,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
-
-		if (sd)
-			update_next_balance(sd, &next_balance);
 		rcu_read_unlock();
-
 		goto out;
 	}
 	rcu_read_unlock();
@@ -12141,8 +12137,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		int continue_balancing = 1;
 		u64 domain_cost;
 
-		update_next_balance(sd, &next_balance);
-
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
 			break;
 
@@ -12155,6 +12149,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 			t1 = sched_clock_cpu(this_cpu);
 			domain_cost = t1 - t0;
 			update_newidle_cost(sd, domain_cost);
+			sd->last_balance = jiffies;
+			update_next_balance(sd, &next_balance);
 
 			curr_cost += domain_cost;
 			t0 = t1;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [tip: sched/core] sched/nohz: Update comments about NEWILB_KICK
  2023-10-20  1:40 ` [PATCH 2/3] sched/nohz: Update comments about NEWILB_KICK Joel Fernandes (Google)
  2023-10-20  7:51   ` Ingo Molnar
@ 2023-10-20  8:02   ` tip-bot2 for Joel Fernandes (Google)
  1 sibling, 0 replies; 19+ messages in thread
From: tip-bot2 for Joel Fernandes (Google) @ 2023-10-20  8:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Joel Fernandes (Google), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fb064e5ae1657595c090ebbc5b15787a3ef603e9
Gitweb:        https://git.kernel.org/tip/fb064e5ae1657595c090ebbc5b15787a3ef603e9
Author:        Joel Fernandes (Google) <joel@joelfernandes.org>
AuthorDate:    Fri, 20 Oct 2023 01:40:27 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Fri, 20 Oct 2023 09:56:21 +02:00

sched/nohz: Update comments about NEWILB_KICK

How ILB is triggered without IPIs is cryptic. Out of mercy for future
code readers, document it in code comments.

The comments are derived from a discussion with Vincent in a past
review.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231020014031.919742-2-joel@joelfernandes.org
---
 kernel/sched/fair.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9ae2208..8c486ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12005,8 +12005,19 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 }
 
 /*
- * Check if we need to run the ILB for updating blocked load before entering
- * idle state.
+ * Check if we need to directly run the ILB for updating blocked load before
+ * entering idle state. Here we run ILB directly without issuing IPIs.
+ *
+ * Note that when this function is called, the tick may not yet be stopped on
+ * this CPU yet. nohz.idle_cpus_mask is updated only when tick is stopped and
+ * cleared on the next busy tick. In other words, nohz.idle_cpus_mask updates
+ * don't align with CPUs enter/exit idle to avoid bottlenecks due to high idle
+ * entry/exit rate (usec). So it is possible that _nohz_idle_balance() is
+ * called from this function on (this) CPU that's not yet in the mask. That's
+ * OK because the goal of nohz_run_idle_balance() is to run ILB only for
+ * updating the blocked load of already idle CPUs without waking up one of
+ * those idle CPUs and outside the preempt disable / irq off phase of the local
+ * cpu about to enter idle, because it can take a long time.
  */
 void nohz_run_idle_balance(int cpu)
 {

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-10-20  1:40 ` [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance Joel Fernandes (Google)
  2023-10-20  7:53   ` Ingo Molnar
  2023-10-20  8:02   ` [tip: sched/core] sched/fair: " tip-bot2 for Vineeth Pillai (Google)
@ 2023-10-20 13:40   ` Vincent Guittot
  2023-10-20 13:56     ` Ingo Molnar
  2023-10-22  0:28     ` Joel Fernandes
  2 siblings, 2 replies; 19+ messages in thread
From: Vincent Guittot @ 2023-10-20 13:40 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney

On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
>
> When newidle balancing triggers, we see that it constantly clobbers
> rq->next_balance even when there is no newidle balance happening due to
> the cost estimates.  Due to this, we see that periodic load balance
> (rebalance_domains) may trigger way more often when the CPU is going in
> and out of idle at a high rate but is no really idle. Repeatedly
> triggering load balance there is a bad idea as it is a heavy operation.
> It also causes increases in softirq.

we have 2 balance intervals:
- one when idle based on the sd->balance_interval = sd_weight
- one when busy which increases the period by multiplying it with
busy_factor = 16

When becoming idle, the rq->next_balance can have been set using the
16*sd_weight period so load_balance can wait for a long time before
running idle load balance for this cpu.
As a typical example, instead of waiting at most 8ms, we will wait
128ms before we try to pull a task on the idle CPU.

That's the reason for updating rq->next_balance in newidle_balance()

>
> Another issue is ->last_balance is not updated after newidle balance
> causing mistakes in the ->next_balance calculations.

newly idle load balance is not equal to idle load balance. It's a
light load balance trying to pull one  task and you can't really
consider it to the normal load balance

>
> Fix by updating last_balance when a newidle load balance actually
> happens and then updating next_balance. This is also how it is done in
> other load balance paths.
>
> Testing shows a significant drop in softirqs when running:
> cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m  -q
>
> Goes from ~6k to ~800.

Even if your figures look interesting, your patch adds regression in
the load balance and the fairness.

We can probably do improve the current behavior for decreasing number
of ILB but your proposal is not the right solution IMO

>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/fair.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8e276d12c3cb..b147ad09126a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12076,11 +12076,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
>         if (!READ_ONCE(this_rq->rd->overload) ||
>             (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> -
> -               if (sd)
> -                       update_next_balance(sd, &next_balance);
>                 rcu_read_unlock();
> -
>                 goto out;
>         }
>         rcu_read_unlock();
> @@ -12095,8 +12091,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>                 int continue_balancing = 1;
>                 u64 domain_cost;
>
> -               update_next_balance(sd, &next_balance);
> -
>                 if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>                         break;
>
> @@ -12109,6 +12103,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>                         t1 = sched_clock_cpu(this_cpu);
>                         domain_cost = t1 - t0;
>                         update_newidle_cost(sd, domain_cost);
> +                       sd->last_balance = jiffies;
> +                       update_next_balance(sd, &next_balance);
>
>                         curr_cost += domain_cost;
>                         t0 = t1;
> --
> 2.42.0.655.g421f12c284-goog
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-10-20  7:53   ` Ingo Molnar
@ 2023-10-20 13:48     ` Vincent Guittot
  2023-10-21  6:50       ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2023-10-20 13:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joel Fernandes (Google),
	linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney

Hi Ingo,

On Fri, 20 Oct 2023 at 09:53, Ingo Molnar <mingo@kernel.org> wrote:
>
>

...

> >                       curr_cost += domain_cost;
> >                       t0 = t1;
>
> Okay, I'm applying patches #2 and #3, without #1: it should be safe
> out of order, but let me know if I missed something ...

Could you hold on for patch 3 ? As replied to Joel, the patch had
regression in the update of rq->next_balance

>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-10-20 13:40   ` [PATCH 3/3] sched: " Vincent Guittot
@ 2023-10-20 13:56     ` Ingo Molnar
  2023-10-20 15:50       ` Joel Fernandes
  2023-10-22  0:28     ` Joel Fernandes
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2023-10-20 13:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Joel Fernandes (Google),
	linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney


* Vincent Guittot <vincent.guittot@linaro.org> wrote:

> Even if your figures look interesting, your patch adds regression in
> the load balance and the fairness.

Indeed - I've removed it from tip:sched/core for the time being.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-10-20 13:56     ` Ingo Molnar
@ 2023-10-20 15:50       ` Joel Fernandes
  0 siblings, 0 replies; 19+ messages in thread
From: Joel Fernandes @ 2023-10-20 15:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vincent Guittot, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney

On Fri, Oct 20, 2023 at 9:57 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> > Even if your figures look interesting, your patch adds regression in
> > the load balance and the fairness.
>
> Indeed - I've removed it from tip:sched/core for the time being.

Sorry we should have marked it as RFC. We still feel there are issues
in the newidle balance code and the fix is in the right direction,
we'll continue discussing with Vincent on the other thread and try to
come up with a proper solution.

Thank you Ingo and Vincent!

 - Joel & Vineeth

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-10-20 13:48     ` Vincent Guittot
@ 2023-10-21  6:50       ` Ingo Molnar
  0 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2023-10-21  6:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Joel Fernandes (Google),
	linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney


* Vincent Guittot <vincent.guittot@linaro.org> wrote:

> Hi Ingo,
> 
> On Fri, 20 Oct 2023 at 09:53, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> 
> ...
> 
> > >                       curr_cost += domain_cost;
> > >                       t0 = t1;
> >
> > Okay, I'm applying patches #2 and #3, without #1: it should be safe
> > out of order, but let me know if I missed something ...
> 
> Could you hold on for patch 3 ? As replied to Joel, the patch had
> regression in the update of rq->next_balance

Yeah, patch #3 is gone already from tip:sched/core.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-10-20 13:40   ` [PATCH 3/3] sched: " Vincent Guittot
  2023-10-20 13:56     ` Ingo Molnar
@ 2023-10-22  0:28     ` Joel Fernandes
  2023-10-26 14:23       ` Vincent Guittot
  1 sibling, 1 reply; 19+ messages in thread
From: Joel Fernandes @ 2023-10-22  0:28 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney

On Fri, Oct 20, 2023 at 03:40:14PM +0200, Vincent Guittot wrote:
> On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
> >
> > When newidle balancing triggers, we see that it constantly clobbers
> > rq->next_balance even when there is no newidle balance happening due to
> > the cost estimates.  Due to this, we see that periodic load balance
> > (rebalance_domains) may trigger way more often when the CPU is going in
> > and out of idle at a high rate but is no really idle. Repeatedly
> > triggering load balance there is a bad idea as it is a heavy operation.
> > It also causes increases in softirq.
> 
> we have 2 balance intervals:
> - one when idle based on the sd->balance_interval = sd_weight
> - one when busy which increases the period by multiplying it with
> busy_factor = 16

On my production system I see load balance triggering every 4 jiffies! In a
Qemu system (which I use for traces below), I see every 8 jiffies. I'll go
look into that more as well, it could be something going on in
get_sd_balance_interval().

> When becoming idle, the rq->next_balance can have been set using the
> 16*sd_weight period so load_balance can wait for a long time before
> running idle load balance for this cpu.

Got it.

> As a typical example, instead of waiting at most 8ms, we will wait
> 128ms before we try to pull a task on the idle CPU.
> 
> That's the reason for updating rq->next_balance in newidle_balance()

Got it, makes sense. But I still feel the mechanism is too aggressive, see
below.

> > Another issue is ->last_balance is not updated after newidle balance
> > causing mistakes in the ->next_balance calculations.
> 
> newly idle load balance is not equal to idle load balance. It's a
> light load balance trying to pull one  task and you can't really
> consider it to the normal load balance

True. However the point is that it is coupled with the other load balance
mechanism and the two are not independent. As you can see below, modifying
rq->next_balance in newidle also causes the periodic balance to happen more
aggressively as well if there is a high transition from busy to idle and
viceversa.

> > Fix by updating last_balance when a newidle load balance actually
> > happens and then updating next_balance. This is also how it is done in
> > other load balance paths.
> >
> > Testing shows a significant drop in softirqs when running:
> > cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m  -q
> >
> > Goes from ~6k to ~800.
> 
> Even if your figures look interesting, your patch adds regression in
> the load balance and the fairness.

Yes I see that now. However it does illustrate the problem IMO.

> We can probably do improve the current behavior for decreasing number
> of ILB but your proposal is not the right solution IMO

One of the problems is if you have task goes idle a lot, then the
newidle_balance mechanism triggers the periodic balance every jiffie (once
per millisecond on HZ=1000).

Following are some traces I collected.

cyclictest-123   [003]   522.650574  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,157)
   <idle>-0      [003]   522.651443  trigger_load_balance: time_after_eq(jiffies=221,158, rq->next_balance=221,145) = 1
   <idle>-0      [003]   522.651461  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,158)
cyclictest-123   [003]   522.651494  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,158)
   <idle>-0      [003]   522.652522  trigger_load_balance: time_after_eq(jiffies=221,159, rq->next_balance=221,145) = 1
   <idle>-0      [003]   522.652560  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,159)
cyclictest-124   [003]   522.652586  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,159)
   <idle>-0      [003]   522.654492  trigger_load_balance: time_after_eq(jiffies=221,161, rq->next_balance=221,145) = 1
   <idle>-0      [003]   522.654534  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,161)

Triggering it so aggressively likely that is useless, it increases softirq
count and hurts power when no real work is done IMHO. And it probably makes
things worse for power on ARM where you have uclamp stuff happening in the
load balance paths which is quite heavy when I last traced that..

Further, we have observed in our tracing on real device that the update of
rq->next_balance from the newidle path is itself buggy... we observed that
because newidle balance may not update rq->last_balance, it is possible that
rq->next_balance when updated by update_next_balance() will be updated to a
value that is in the past and it will be stuck there for a long time! Perhaps
we should investigate more and fix that bug separately. Vineeth could provide
more details on the "getting stuck in the past" behavior as well.

I hope these patches highlight some of the issues we find and can trigger any
improvements by us or others. From our side we'll continue to work on it and
thank you for explaining some of the load balance code!

thanks,

 - Joel & Vineeth


> 
> >
> > Cc: Suleiman Souhlal <suleiman@google.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Frederic Weisbecker <frederic@kernel.org>
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  kernel/sched/fair.c | 8 ++------
> >  1 file changed, 2 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8e276d12c3cb..b147ad09126a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12076,11 +12076,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> >         if (!READ_ONCE(this_rq->rd->overload) ||
> >             (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> > -
> > -               if (sd)
> > -                       update_next_balance(sd, &next_balance);
> >                 rcu_read_unlock();
> > -
> >                 goto out;
> >         }
> >         rcu_read_unlock();
> > @@ -12095,8 +12091,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >                 int continue_balancing = 1;
> >                 u64 domain_cost;
> >
> > -               update_next_balance(sd, &next_balance);
> > -
> >                 if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
> >                         break;
> >
> > @@ -12109,6 +12103,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >                         t1 = sched_clock_cpu(this_cpu);
> >                         domain_cost = t1 - t0;
> >                         update_newidle_cost(sd, domain_cost);
> > +                       sd->last_balance = jiffies;
> > +                       update_next_balance(sd, &next_balance);
> >
> >                         curr_cost += domain_cost;
> >                         t0 = t1;
> > --
> > 2.42.0.655.g421f12c284-goog
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-10-22  0:28     ` Joel Fernandes
@ 2023-10-26 14:23       ` Vincent Guittot
  2023-11-09 10:02         ` Joel Fernandes
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2023-10-26 14:23 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney

On Sun, 22 Oct 2023 at 02:28, Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Fri, Oct 20, 2023 at 03:40:14PM +0200, Vincent Guittot wrote:
> > On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
> > >
> > > When newidle balancing triggers, we see that it constantly clobbers
> > > rq->next_balance even when there is no newidle balance happening due to
> > > the cost estimates.  Due to this, we see that periodic load balance
> > > (rebalance_domains) may trigger way more often when the CPU is going in
> > > and out of idle at a high rate but is no really idle. Repeatedly
> > > triggering load balance there is a bad idea as it is a heavy operation.
> > > It also causes increases in softirq.
> >
> > we have 2 balance intervals:
> > - one when idle based on the sd->balance_interval = sd_weight
> > - one when busy which increases the period by multiplying it with
> > busy_factor = 16
>
> On my production system I see load balance triggering every 4 jiffies! In a

Which kind of system do you have? sd->balance_interval is in ms

> Qemu system (which I use for traces below), I see every 8 jiffies. I'll go
> look into that more as well, it could be something going on in
> get_sd_balance_interval().
>
> > When becoming idle, the rq->next_balance can have been set using the
> > 16*sd_weight period so load_balance can wait for a long time before
> > running idle load balance for this cpu.
>
> Got it.
>
> > As a typical example, instead of waiting at most 8ms, we will wait
> > 128ms before we try to pull a task on the idle CPU.
> >
> > That's the reason for updating rq->next_balance in newidle_balance()
>
> Got it, makes sense. But I still feel the mechanism is too aggressive, see
> below.
>
> > > Another issue is ->last_balance is not updated after newidle balance
> > > causing mistakes in the ->next_balance calculations.
> >
> > newly idle load balance is not equal to idle load balance. It's a
> > light load balance trying to pull one  task and you can't really
> > consider it to the normal load balance
>
> True. However the point is that it is coupled with the other load balance
> mechanism and the two are not independent. As you can see below, modifying
> rq->next_balance in newidle also causes the periodic balance to happen more
> aggressively as well if there is a high transition from busy to idle and
> viceversa.

As mentioned, rq->next_balance is updated whenever cpu enters idle
(i.e. in newidle_balance() but it's not related with doing a newly
idle load balance. But your problem is more related with the fact that
nohz.needs_update is set when stopping cpu timer in order to update
nohz.next_balance which is then used to kick a "real" idle load
balance

>
> > > Fix by updating last_balance when a newidle load balance actually
> > > happens and then updating next_balance. This is also how it is done in
> > > other load balance paths.
> > >
> > > Testing shows a significant drop in softirqs when running:
> > > cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m  -q
> > >
> > > Goes from ~6k to ~800.
> >
> > Even if your figures look interesting, your patch adds regression in
> > the load balance and the fairness.
>
> Yes I see that now. However it does illustrate the problem IMO.
>
> > We can probably do improve the current behavior for decreasing number
> > of ILB but your proposal is not the right solution IMO
>
> One of the problems is if you have task goes idle a lot, then the
> newidle_balance mechanism triggers the periodic balance every jiffie (once
> per millisecond on HZ=1000).

every msec seems quite a lot.

>
> Following are some traces I collected.
>
> cyclictest-123   [003]   522.650574  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,157)
>    <idle>-0      [003]   522.651443  trigger_load_balance: time_after_eq(jiffies=221,158, rq->next_balance=221,145) = 1
>    <idle>-0      [003]   522.651461  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,158)
> cyclictest-123   [003]   522.651494  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,158)
>    <idle>-0      [003]   522.652522  trigger_load_balance: time_after_eq(jiffies=221,159, rq->next_balance=221,145) = 1
>    <idle>-0      [003]   522.652560  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,159)
> cyclictest-124   [003]   522.652586  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,159)
>    <idle>-0      [003]   522.654492  trigger_load_balance: time_after_eq(jiffies=221,161, rq->next_balance=221,145) = 1
>    <idle>-0      [003]   522.654534  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,161)

Ok, so IIUC your trace above, this happens because the tick is not
stop after entering idle so it continues to fire and triggers a load
balance without checking if there is a need like what is done for nohz
mode

>
> Triggering it so aggressively likely that is useless, it increases softirq
> count and hurts power when no real work is done IMHO. And it probably makes

In the case above, the tick has already fired because it was not
stopped. I mean that it's not the ilb which triggers this and woke up
the cpu but it takes advantage of the tick so most of the pain has
already happened (wakeup the cpu which was probably in a shallow
c-state to let its tick firing)

> things worse for power on ARM where you have uclamp stuff happening in the
> load balance paths which is quite heavy when I last traced that..
>
> Further, we have observed in our tracing on real device that the update of
> rq->next_balance from the newidle path is itself buggy... we observed that
> because newidle balance may not update rq->last_balance, it is possible that
> rq->next_balance when updated by update_next_balance() will be updated to a
> value that is in the past and it will be stuck there for a long time! Perhaps
> we should investigate more and fix that bug separately. Vineeth could provide
> more details on the "getting stuck in the past" behavior as well.

sd->last_balance reflects last time an idle/busy load_balance happened
(newly idle is out of the scope for the points that I mentioned
previously).  So if no load balance happens for a while, the
rq->next_balance can be in the past but I don't see a problem here. It
just means that a load balance hasn't happened for a while. It can
even move backward if it has been set when busy but the cpu is now
idle

>
> I hope these patches highlight some of the issues we find and can trigger any
> improvements by us or others. From our side we'll continue to work on it and
> thank you for explaining some of the load balance code!
>
> thanks,
>
>  - Joel & Vineeth
>
>
> >
> > >
> > > Cc: Suleiman Souhlal <suleiman@google.com>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Frederic Weisbecker <frederic@kernel.org>
> > > Cc: Paul E. McKenney <paulmck@kernel.org>
> > > Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> > > Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> > >  kernel/sched/fair.c | 8 ++------
> > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 8e276d12c3cb..b147ad09126a 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -12076,11 +12076,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > >
> > >         if (!READ_ONCE(this_rq->rd->overload) ||
> > >             (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> > > -
> > > -               if (sd)
> > > -                       update_next_balance(sd, &next_balance);
> > >                 rcu_read_unlock();
> > > -
> > >                 goto out;
> > >         }
> > >         rcu_read_unlock();
> > > @@ -12095,8 +12091,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > >                 int continue_balancing = 1;
> > >                 u64 domain_cost;
> > >
> > > -               update_next_balance(sd, &next_balance);
> > > -
> > >                 if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
> > >                         break;
> > >
> > > @@ -12109,6 +12103,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > >                         t1 = sched_clock_cpu(this_cpu);
> > >                         domain_cost = t1 - t0;
> > >                         update_newidle_cost(sd, domain_cost);
> > > +                       sd->last_balance = jiffies;
> > > +                       update_next_balance(sd, &next_balance);
> > >
> > >                         curr_cost += domain_cost;
> > >                         t0 = t1;
> > > --
> > > 2.42.0.655.g421f12c284-goog
> > >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-10-26 14:23       ` Vincent Guittot
@ 2023-11-09 10:02         ` Joel Fernandes
  2023-11-09 12:31           ` Joel Fernandes
  2023-11-14 15:43           ` Vincent Guittot
  0 siblings, 2 replies; 19+ messages in thread
From: Joel Fernandes @ 2023-11-09 10:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney

Hi Vincent,

Sorry for late reply, I was in Tokyo all these days and was waiting to get to
writing a proper reply. See my replies below:

On Thu, Oct 26, 2023 at 04:23:35PM +0200, Vincent Guittot wrote:
> On Sun, 22 Oct 2023 at 02:28, Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Fri, Oct 20, 2023 at 03:40:14PM +0200, Vincent Guittot wrote:
> > > On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google)
> > > <joel@joelfernandes.org> wrote:
> > > >
> > > > From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
> > > >
> > > > When newidle balancing triggers, we see that it constantly clobbers
> > > > rq->next_balance even when there is no newidle balance happening due to
> > > > the cost estimates.  Due to this, we see that periodic load balance
> > > > (rebalance_domains) may trigger way more often when the CPU is going in
> > > > and out of idle at a high rate but is no really idle. Repeatedly
> > > > triggering load balance there is a bad idea as it is a heavy operation.
> > > > It also causes increases in softirq.
> > >
> > > we have 2 balance intervals:
> > > - one when idle based on the sd->balance_interval = sd_weight
> > > - one when busy which increases the period by multiplying it with
> > > busy_factor = 16
> >
> > On my production system I see load balance triggering every 4 jiffies! In a
> 
> Which kind of system do you have? sd->balance_interval is in ms

Yes, sorry I meant it triggers every jiffies which is extreme sometimes. It
is an ADL SoC (12th gen Intel, 4 P cores 8 E cores) get_sd_balance_interval()
returns 4 jiffies there. On my Qemu system, I see 8 jiffies.

[...]
> > > > Another issue is ->last_balance is not updated after newidle balance
> > > > causing mistakes in the ->next_balance calculations.
> > >
> > > newly idle load balance is not equal to idle load balance. It's a
> > > light load balance trying to pull one  task and you can't really
> > > consider it to the normal load balance
> >
> > True. However the point is that it is coupled with the other load balance
> > mechanism and the two are not independent. As you can see below, modifying
> > rq->next_balance in newidle also causes the periodic balance to happen more
> > aggressively as well if there is a high transition from busy to idle and
> > viceversa.
> 
> As mentioned, rq->next_balance is updated whenever cpu enters idle
> (i.e. in newidle_balance() but it's not related with doing a newly
> idle load balance.

Yes, I understand that. But my point was that the update of rq->next_balance
from the newidle path is itself buggy and interferes with the load balance
happening from the tick (trigger_load_balance -> run_rebalance_domains).

> But your problem is more related with the fact that
> nohz.needs_update is set when stopping cpu timer in order to update
> nohz.next_balance which is then used to kick a "real" idle load
> balance

Well, independent of nohz idle balance, I think we need to fix this issue as
mentioned above. This effect the periodic one as mentioned in the commit log.

See here another trace I collected this time dumping the 'interval'. There is
a tug of war happening between the newidle balance and the periodic balance.

The periodic one sets rq->next_balance for cpu 0 to 760,143 and then the
newidle comes in pulls it back a 118 jiffies to 760,024. This is actually in
the past because jiffies is currently 760,045 !!

This triggers the periodic balance againwhich sets rq->next_balance back to
760,143.

Rinse and repeat. End result is you have periodic balance every jiffies. With
this patch the issue goes away but we could fix it differently as you
mentioned, we need to pull newidle balance back but perhaps not so
aggressively. How about something like the untested diff I enclosed at the
end of this email?

<idle>-0   [000]  13.081781: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,045)
cyclictest-120   [000]  13.081806: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
cyclictest-120   [000]  13.081807: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,045)
cyclictest-120   [000]  13.082130: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
cyclictest-120   [000]  13.082338: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
cyclictest-120   [000]  13.082636: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
<idle>-0   [000]  13.082823: trigger_load_balance: time_after_eq(jiffies=760,046, rq->next_balance=760,024) = 1
<idle>-0   [000]  13.082823: softirq_raise: vec=7 [action=SCHED]
<idle>-0   [000]  13.082871: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,046)
trace-cmd-114   [000]  13.082876: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8)
trace-cmd-114   [000]  13.082876: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,046)
cyclictest-120   [000]  13.083333: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8)
cyclictest-120   [000]  13.083633: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8)
<idle>-0   [000]  13.083656: trigger_load_balance: time_after_eq(jiffies=760,047, rq->next_balance=760,024) = 1
<idle>-0   [000]  13.083656: softirq_raise: vec=7 [action=SCHED]
<idle>-0   [000]  13.083702: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,047)
cyclictest-120   [000]  13.083729: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
cyclictest-120   [000]  13.083730: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,047)
cyclictest-120   [000]  13.083960: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
cyclictest-120   [000]  13.084069: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
cyclictest-120   [000]  13.084423: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
cyclictest-120   [000]  13.084633: trigger_load_balance: time_after_eq(jiffies=760,048, rq->next_balance=760,024) = 1
cyclictest-120   [000]  13.084633: softirq_raise: vec=7 [action=SCHED]
cyclictest-120   [000]  13.084678: rebalance_domains: rq[cpu=0]->next_balance:

> > > > Fix by updating last_balance when a newidle load balance actually
> > > > happens and then updating next_balance. This is also how it is done in
> > > > other load balance paths.
> > > >
> > > > Testing shows a significant drop in softirqs when running:
> > > > cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m  -q
> > > >
> > > > Goes from ~6k to ~800.
> > >
> > > Even if your figures look interesting, your patch adds regression in
> > > the load balance and the fairness.
> >
> > Yes I see that now. However it does illustrate the problem IMO.
> >
> > > We can probably do improve the current behavior for decreasing number
> > > of ILB but your proposal is not the right solution IMO
> >
> > One of the problems is if you have task goes idle a lot, then the
> > newidle_balance mechanism triggers the periodic balance every jiffie (once
> > per millisecond on HZ=1000).
> 
> every msec seems quite a lot.

Yeah!

> >
> > Following are some traces I collected.
> >
> > cyclictest-123   [003]   522.650574  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,157)
> >    <idle>-0      [003]   522.651443  trigger_load_balance: time_after_eq(jiffies=221,158, rq->next_balance=221,145) = 1
> >    <idle>-0      [003]   522.651461  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,158)
> > cyclictest-123   [003]   522.651494  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,158)
> >    <idle>-0      [003]   522.652522  trigger_load_balance: time_after_eq(jiffies=221,159, rq->next_balance=221,145) = 1
> >    <idle>-0      [003]   522.652560  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,159)
> > cyclictest-124   [003]   522.652586  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,159)
> >    <idle>-0      [003]   522.654492  trigger_load_balance: time_after_eq(jiffies=221,161, rq->next_balance=221,145) = 1
> >    <idle>-0      [003]   522.654534  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,161)
> 
> Ok, so IIUC your trace above, this happens because the tick is not
> stop after entering idle so it continues to fire and triggers a load
> balance without checking if there is a need like what is done for nohz
> mode

The tick is normally not stopped if the CPU is awakened too soon by a timer.
That's pretty normal AFAIK. As you can see in the traces above, cyclictest
keeps waking up.

> > things worse for power on ARM where you have uclamp stuff happening in the
> > load balance paths which is quite heavy when I last traced that..
> >
> > Further, we have observed in our tracing on real device that the update of
> > rq->next_balance from the newidle path is itself buggy... we observed that
> > because newidle balance may not update rq->last_balance, it is possible that
> > rq->next_balance when updated by update_next_balance() will be updated to a
> > value that is in the past and it will be stuck there for a long time! Perhaps
> > we should investigate more and fix that bug separately. Vineeth could provide
> > more details on the "getting stuck in the past" behavior as well.
> 
> sd->last_balance reflects last time an idle/busy load_balance happened
> (newly idle is out of the scope for the points that I mentioned
> previously).  So if no load balance happens for a while, the
> rq->next_balance can be in the past but I don't see a problem here. It
> just means that a load balance hasn't happened for a while. It can
> even move backward if it has been set when busy but the cpu is now
> idle

Sure, but I think it should at least set it by get_sd_balance_interval() into
the future. Like so (untested)? Let me know what you think and thanks!

---8<-----------------------

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a3318aeff9e8..0d6667d31c51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11314,6 +11314,30 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
 	return interval;
 }
 
+/*
+ * Update the next balance from newidle balance.
+ * The update of next_balance from newidle balance tries to make sure that
+ * we don't trigger periodic balance too far in the future on a now-idle
+ * system.  This is just like update_next_balance except that since
+ * sd->last_balance may not have been updated for a while, we're careful to
+ * not set next_balance in the past.
+ */
+static inline void
+update_next_balance_newidle(struct sched_domain *sd, unsigned long *next_balance)
+{
+	unsigned long interval, next;
+
+	/* used by new idle balance, so cpu_busy = 0 */
+	interval = get_sd_balance_interval(sd, 0);
+	next = sd->last_balance + interval;
+
+	next = max(next, jiffies + interval);
+
+	if (time_after(*next_balance, next)) {
+		*next_balance = next;
+	}
+}
+
 static inline void
 update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
 {
@@ -12107,7 +12131,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
 		if (sd)
-			update_next_balance(sd, &next_balance);
+			update_next_balance_newidle(sd, &next_balance);
 		rcu_read_unlock();
 
 		goto out;
@@ -12124,7 +12148,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		int continue_balancing = 1;
 		u64 domain_cost;
 
-		update_next_balance(sd, &next_balance);
+		update_next_balance_newidle(sd, &next_balance);
 
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
 			break;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-11-09 10:02         ` Joel Fernandes
@ 2023-11-09 12:31           ` Joel Fernandes
  2023-11-14 15:43           ` Vincent Guittot
  1 sibling, 0 replies; 19+ messages in thread
From: Joel Fernandes @ 2023-11-09 12:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney

On Thu, Nov 9, 2023 at 5:02 AM Joel Fernandes <joel@joelfernandes.org> wrote:
[...]
> > > things worse for power on ARM where you have uclamp stuff happening in the
> > > load balance paths which is quite heavy when I last traced that..
> > >
> > > Further, we have observed in our tracing on real device that the update of
> > > rq->next_balance from the newidle path is itself buggy... we observed that
> > > because newidle balance may not update rq->last_balance, it is possible that
> > > rq->next_balance when updated by update_next_balance() will be updated to a
> > > value that is in the past and it will be stuck there for a long time! Perhaps
> > > we should investigate more and fix that bug separately. Vineeth could provide
> > > more details on the "getting stuck in the past" behavior as well.
> >
> > sd->last_balance reflects last time an idle/busy load_balance happened
> > (newly idle is out of the scope for the points that I mentioned
> > previously).  So if no load balance happens for a while, the
> > rq->next_balance can be in the past but I don't see a problem here. It
> > just means that a load balance hasn't happened for a while. It can
> > even move backward if it has been set when busy but the cpu is now
> > idle
>
> Sure, but I think it should at least set it by get_sd_balance_interval() into
> the future. Like so (untested)? Let me know what you think and thanks!

Btw, I also drew a graph showing the issue without patch:
https://i.imgur.com/RgTr45l.png

Each "x" mark is run_rebalance_domains() running on a CPU. As can be
seen, there were some 10 occurrences in a span of 15ms in one
instance.

Thanks,

 - Joel


> ---8<-----------------------
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a3318aeff9e8..0d6667d31c51 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11314,6 +11314,30 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
>         return interval;
>  }
>
> +/*
> + * Update the next balance from newidle balance.
> + * The update of next_balance from newidle balance tries to make sure that
> + * we don't trigger periodic balance too far in the future on a now-idle
> + * system.  This is just like update_next_balance except that since
> + * sd->last_balance may not have been updated for a while, we're careful to
> + * not set next_balance in the past.
> + */
> +static inline void
> +update_next_balance_newidle(struct sched_domain *sd, unsigned long *next_balance)
> +{
> +       unsigned long interval, next;
> +
> +       /* used by new idle balance, so cpu_busy = 0 */
> +       interval = get_sd_balance_interval(sd, 0);
> +       next = sd->last_balance + interval;
> +
> +       next = max(next, jiffies + interval);
> +
> +       if (time_after(*next_balance, next)) {
> +               *next_balance = next;
> +       }
> +}
> +
>  static inline void
>  update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
>  {
> @@ -12107,7 +12131,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>             (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
>                 if (sd)
> -                       update_next_balance(sd, &next_balance);
> +                       update_next_balance_newidle(sd, &next_balance);
>                 rcu_read_unlock();
>
>                 goto out;
> @@ -12124,7 +12148,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>                 int continue_balancing = 1;
>                 u64 domain_cost;
>
> -               update_next_balance(sd, &next_balance);
> +               update_next_balance_newidle(sd, &next_balance);
>
>                 if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>                         break;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-11-09 10:02         ` Joel Fernandes
  2023-11-09 12:31           ` Joel Fernandes
@ 2023-11-14 15:43           ` Vincent Guittot
  2023-11-14 17:43             ` Joel Fernandes
  1 sibling, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2023-11-14 15:43 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney

Le jeudi 09 nov. 2023 à 10:02:54 (+0000), Joel Fernandes a écrit :
> Hi Vincent,
> 
> Sorry for late reply, I was in Tokyo all these days and was waiting to get to
> writing a proper reply. See my replies below:
> 
> On Thu, Oct 26, 2023 at 04:23:35PM +0200, Vincent Guittot wrote:
> > On Sun, 22 Oct 2023 at 02:28, Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Fri, Oct 20, 2023 at 03:40:14PM +0200, Vincent Guittot wrote:
> > > > On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google)
> > > > <joel@joelfernandes.org> wrote:
> > > > >
> > > > > From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
> > > > >
> > > > > When newidle balancing triggers, we see that it constantly clobbers
> > > > > rq->next_balance even when there is no newidle balance happening due to
> > > > > the cost estimates.  Due to this, we see that periodic load balance
> > > > > (rebalance_domains) may trigger way more often when the CPU is going in
> > > > > and out of idle at a high rate but is no really idle. Repeatedly
> > > > > triggering load balance there is a bad idea as it is a heavy operation.
> > > > > It also causes increases in softirq.
> > > >
> > > > we have 2 balance intervals:
> > > > - one when idle based on the sd->balance_interval = sd_weight
> > > > - one when busy which increases the period by multiplying it with
> > > > busy_factor = 16
> > >
> > > On my production system I see load balance triggering every 4 jiffies! In a
> > 
> > Which kind of system do you have? sd->balance_interval is in ms
> 
> Yes, sorry I meant it triggers every jiffies which is extreme sometimes. It
> is an ADL SoC (12th gen Intel, 4 P cores 8 E cores) get_sd_balance_interval()
> returns 4 jiffies there. On my Qemu system, I see 8 jiffies.

Do you have details about the sched_domain  hierarchy ?
That could be part of your problem (see below)

> 
> [...]
> > > > > Another issue is ->last_balance is not updated after newidle balance
> > > > > causing mistakes in the ->next_balance calculations.
> > > >
> > > > newly idle load balance is not equal to idle load balance. It's a
> > > > light load balance trying to pull one  task and you can't really
> > > > consider it to the normal load balance
> > >
> > > True. However the point is that it is coupled with the other load balance
> > > mechanism and the two are not independent. As you can see below, modifying
> > > rq->next_balance in newidle also causes the periodic balance to happen more
> > > aggressively as well if there is a high transition from busy to idle and
> > > viceversa.
> > 
> > As mentioned, rq->next_balance is updated whenever cpu enters idle
> > (i.e. in newidle_balance() but it's not related with doing a newly
> > idle load balance.
> 
> Yes, I understand that. But my point was that the update of rq->next_balance
> from the newidle path is itself buggy and interferes with the load balance
> happening from the tick (trigger_load_balance -> run_rebalance_domains).

Newidle path is not buggy. It only uses sd->last_balance + interval to
estimate the next balance  which is the correct thing to do. Your problem
comes from the update of sd->last_balance which never happens and remains
in the past whereas you call run_rebalance_domains() which should
run load_balance for all domains with a sd->last_balance + interval in the
past.
Your problem most probably comes from the should_we_balance which always or
"almost always" returns false in your use case for some sched_domain and
prevents to updat sd->last_balance. Could you try the patch below ?
It should fix your problem of trying to rebalance every tick whereas
rebalance_domain is called.
At least this should show if it's your problem but I'm not sure it's the right
things to do all the time ...

---
 kernel/sched/fair.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3745ca289240..9ea1f42e5362 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11568,17 +11568,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 		need_decay = update_newidle_cost(sd, 0);
 		max_cost += sd->max_newidle_lb_cost;

-		/*
-		 * Stop the load balance at this level. There is another
-		 * CPU in our sched group which is doing load balancing more
-		 * actively.
-		 */
-		if (!continue_balancing) {
-			if (need_decay)
-				continue;
-			break;
-		}
-
 		interval = get_sd_balance_interval(sd, busy);

 		need_serialize = sd->flags & SD_SERIALIZE;
@@ -11588,7 +11577,12 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 		}

 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
-			if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
+			/*
+			 * Stop the load balance at this level. There is another
+			 * CPU in our sched group which is doing load balancing more
+			 * actively.
+			 */
+			if (continue_balancing && load_balance(cpu, rq, sd, idle, &continue_balancing)) {
 				/*
 				 * The LBF_DST_PINNED logic could have changed
 				 * env->dst_cpu, so we can't know our idle
--
2.34.1

> 
> > But your problem is more related with the fact that
> > nohz.needs_update is set when stopping cpu timer in order to update
> > nohz.next_balance which is then used to kick a "real" idle load
> > balance
> 
> Well, independent of nohz idle balance, I think we need to fix this issue as
> mentioned above. This effect the periodic one as mentioned in the commit log.
> 
> See here another trace I collected this time dumping the 'interval'. There is
> a tug of war happening between the newidle balance and the periodic balance.
> 
> The periodic one sets rq->next_balance for cpu 0 to 760,143 and then the
> newidle comes in pulls it back a 118 jiffies to 760,024. This is actually in
> the past because jiffies is currently 760,045 !!
> 
> This triggers the periodic balance againwhich sets rq->next_balance back to
> 760,143.
> 
> Rinse and repeat. End result is you have periodic balance every jiffies. With
> this patch the issue goes away but we could fix it differently as you
> mentioned, we need to pull newidle balance back but perhaps not so
> aggressively. How about something like the untested diff I enclosed at the
> end of this email?
> 
> <idle>-0   [000]  13.081781: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,045)
> cyclictest-120   [000]  13.081806: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
> cyclictest-120   [000]  13.081807: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,045)
> cyclictest-120   [000]  13.082130: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
> cyclictest-120   [000]  13.082338: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
> cyclictest-120   [000]  13.082636: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
> <idle>-0   [000]  13.082823: trigger_load_balance: time_after_eq(jiffies=760,046, rq->next_balance=760,024) = 1
> <idle>-0   [000]  13.082823: softirq_raise: vec=7 [action=SCHED]
> <idle>-0   [000]  13.082871: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,046)
> trace-cmd-114   [000]  13.082876: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8)
> trace-cmd-114   [000]  13.082876: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,046)
> cyclictest-120   [000]  13.083333: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8)
> cyclictest-120   [000]  13.083633: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8)
> <idle>-0   [000]  13.083656: trigger_load_balance: time_after_eq(jiffies=760,047, rq->next_balance=760,024) = 1
> <idle>-0   [000]  13.083656: softirq_raise: vec=7 [action=SCHED]
> <idle>-0   [000]  13.083702: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,047)
> cyclictest-120   [000]  13.083729: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
> cyclictest-120   [000]  13.083730: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,047)
> cyclictest-120   [000]  13.083960: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
> cyclictest-120   [000]  13.084069: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
> cyclictest-120   [000]  13.084423: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
> cyclictest-120   [000]  13.084633: trigger_load_balance: time_after_eq(jiffies=760,048, rq->next_balance=760,024) = 1
> cyclictest-120   [000]  13.084633: softirq_raise: vec=7 [action=SCHED]
> cyclictest-120   [000]  13.084678: rebalance_domains: rq[cpu=0]->next_balance:
> 
> > > > > Fix by updating last_balance when a newidle load balance actually
> > > > > happens and then updating next_balance. This is also how it is done in
> > > > > other load balance paths.
> > > > >
> > > > > Testing shows a significant drop in softirqs when running:
> > > > > cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m  -q
> > > > >
> > > > > Goes from ~6k to ~800.
> > > >
> > > > Even if your figures look interesting, your patch adds regression in
> > > > the load balance and the fairness.
> > >
> > > Yes I see that now. However it does illustrate the problem IMO.
> > >
> > > > We can probably do improve the current behavior for decreasing number
> > > > of ILB but your proposal is not the right solution IMO
> > >
> > > One of the problems is if you have task goes idle a lot, then the
> > > newidle_balance mechanism triggers the periodic balance every jiffie (once
> > > per millisecond on HZ=1000).
> > 
> > every msec seems quite a lot.
> 
> Yeah!
> 
> > >
> > > Following are some traces I collected.
> > >
> > > cyclictest-123   [003]   522.650574  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,157)
> > >    <idle>-0      [003]   522.651443  trigger_load_balance: time_after_eq(jiffies=221,158, rq->next_balance=221,145) = 1
> > >    <idle>-0      [003]   522.651461  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,158)
> > > cyclictest-123   [003]   522.651494  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,158)
> > >    <idle>-0      [003]   522.652522  trigger_load_balance: time_after_eq(jiffies=221,159, rq->next_balance=221,145) = 1
> > >    <idle>-0      [003]   522.652560  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,159)
> > > cyclictest-124   [003]   522.652586  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,159)
> > >    <idle>-0      [003]   522.654492  trigger_load_balance: time_after_eq(jiffies=221,161, rq->next_balance=221,145) = 1
> > >    <idle>-0      [003]   522.654534  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,161)
> > 
> > Ok, so IIUC your trace above, this happens because the tick is not
> > stop after entering idle so it continues to fire and triggers a load
> > balance without checking if there is a need like what is done for nohz
> > mode
> 
> The tick is normally not stopped if the CPU is awakened too soon by a timer.
> That's pretty normal AFAIK. As you can see in the traces above, cyclictest
> keeps waking up.
> 
> > > things worse for power on ARM where you have uclamp stuff happening in the
> > > load balance paths which is quite heavy when I last traced that..
> > >
> > > Further, we have observed in our tracing on real device that the update of
> > > rq->next_balance from the newidle path is itself buggy... we observed that
> > > because newidle balance may not update rq->last_balance, it is possible that
> > > rq->next_balance when updated by update_next_balance() will be updated to a
> > > value that is in the past and it will be stuck there for a long time! Perhaps
> > > we should investigate more and fix that bug separately. Vineeth could provide
> > > more details on the "getting stuck in the past" behavior as well.
> > 
> > sd->last_balance reflects last time an idle/busy load_balance happened
> > (newly idle is out of the scope for the points that I mentioned
> > previously).  So if no load balance happens for a while, the
> > rq->next_balance can be in the past but I don't see a problem here. It
> > just means that a load balance hasn't happened for a while. It can
> > even move backward if it has been set when busy but the cpu is now
> > idle
> 
> Sure, but I think it should at least set it by get_sd_balance_interval() into
> the future. Like so (untested)? Let me know what you think and thanks!
> 
> ---8<-----------------------
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a3318aeff9e8..0d6667d31c51 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11314,6 +11314,30 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
>  	return interval;
>  }
>  
> +/*
> + * Update the next balance from newidle balance.
> + * The update of next_balance from newidle balance tries to make sure that
> + * we don't trigger periodic balance too far in the future on a now-idle
> + * system.  This is just like update_next_balance except that since
> + * sd->last_balance may not have been updated for a while, we're careful to
> + * not set next_balance in the past.
> + */
> +static inline void
> +update_next_balance_newidle(struct sched_domain *sd, unsigned long *next_balance)
> +{
> +	unsigned long interval, next;
> +
> +	/* used by new idle balance, so cpu_busy = 0 */
> +	interval = get_sd_balance_interval(sd, 0);
> +	next = sd->last_balance + interval;
> +
> +	next = max(next, jiffies + interval);
> +
> +	if (time_after(*next_balance, next)) {
> +		*next_balance = next;
> +	}
> +}
> +
>  static inline void
>  update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
>  {
> @@ -12107,7 +12131,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>  
>  		if (sd)
> -			update_next_balance(sd, &next_balance);
> +			update_next_balance_newidle(sd, &next_balance);
>  		rcu_read_unlock();
>  
>  		goto out;
> @@ -12124,7 +12148,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  		int continue_balancing = 1;
>  		u64 domain_cost;
>  
> -		update_next_balance(sd, &next_balance);
> +		update_next_balance_newidle(sd, &next_balance);
>  
>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>  			break;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance
  2023-11-14 15:43           ` Vincent Guittot
@ 2023-11-14 17:43             ` Joel Fernandes
  0 siblings, 0 replies; 19+ messages in thread
From: Joel Fernandes @ 2023-11-14 17:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider,
	Vineeth Pillai (Google),
	Suleiman Souhlal, Frederic Weisbecker, Paul E . McKenney

Hi Vincent,

> On Nov 14, 2023, at 10:43 AM, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> 
> Le jeudi 09 nov. 2023 à 10:02:54 (+0000), Joel Fernandes a écrit :
>> Hi Vincent,
>> 
>> Sorry for late reply, I was in Tokyo all these days and was waiting to get to
>> writing a proper reply. See my replies below:
>> 
>>> On Thu, Oct 26, 2023 at 04:23:35PM +0200, Vincent Guittot wrote:
>>> On Sun, 22 Oct 2023 at 02:28, Joel Fernandes <joel@joelfernandes.org> wrote:
>>>> 
>>>> On Fri, Oct 20, 2023 at 03:40:14PM +0200, Vincent Guittot wrote:
>>>>> On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google)
>>>>> <joel@joelfernandes.org> wrote:
>>>>>> 
>>>>>> From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
>>>>>> 
>>>>>> When newidle balancing triggers, we see that it constantly clobbers
>>>>>> rq->next_balance even when there is no newidle balance happening due to
>>>>>> the cost estimates.  Due to this, we see that periodic load balance
>>>>>> (rebalance_domains) may trigger way more often when the CPU is going in
>>>>>> and out of idle at a high rate but is no really idle. Repeatedly
>>>>>> triggering load balance there is a bad idea as it is a heavy operation.
>>>>>> It also causes increases in softirq.
>>>>> 
>>>>> we have 2 balance intervals:
>>>>> - one when idle based on the sd->balance_interval = sd_weight
>>>>> - one when busy which increases the period by multiplying it with
>>>>> busy_factor = 16
>>>> 
>>>> On my production system I see load balance triggering every 4 jiffies! In a
>>> 
>>> Which kind of system do you have? sd->balance_interval is in ms
>> 
>> Yes, sorry I meant it triggers every jiffies which is extreme sometimes. It
>> is an ADL SoC (12th gen Intel, 4 P cores 8 E cores) get_sd_balance_interval()
>> returns 4 jiffies there. On my Qemu system, I see 8 jiffies.
> 
> Do you have details about the sched_domain  hierarchy ?
> That could be part of your problem (see below)

Since I am at LPC I am not at that machine right now but I could provide it next
week. I replied below:

> 
>> 
>> [...]
>>>>>> Another issue is ->last_balance is not updated after newidle balance
>>>>>> causing mistakes in the ->next_balance calculations.
>>>>> 
>>>>> newly idle load balance is not equal to idle load balance. It's a
>>>>> light load balance trying to pull one  task and you can't really
>>>>> consider it to the normal load balance
>>>> 
>>>> True. However the point is that it is coupled with the other load balance
>>>> mechanism and the two are not independent. As you can see below, modifying
>>>> rq->next_balance in newidle also causes the periodic balance to happen more
>>>> aggressively as well if there is a high transition from busy to idle and
>>>> viceversa.
>>> 
>>> As mentioned, rq->next_balance is updated whenever cpu enters idle
>>> (i.e. in newidle_balance() but it's not related with doing a newly
>>> idle load balance.
>> 
>> Yes, I understand that. But my point was that the update of rq->next_balance
>> from the newidle path is itself buggy and interferes with the load balance
>> happening from the tick (trigger_load_balance -> run_rebalance_domains).
> 
> Newidle path is not buggy.

Sure perhaps not directly newidle but something else is buggy as you
indicated below:

> It only uses sd->last_balance + interval to
> estimate the next balance  which is the correct thing to do. Your problem
> comes from the update of sd->last_balance which never happens and remains
> in the past whereas you call run_rebalance_domains() which should
> run load_balance for all domains with a sd->last_balance + interval in the
> past.
> Your problem most probably comes from the should_we_balance which always or
> "almost always" returns false in your use case for some sched_domain and
> prevents to updat sd->last_balance. Could you try the patch below ?
> It should fix your problem of trying to rebalance every tick whereas
> rebalance_domain is called.
> At least this should show if it's your problem but I'm not sure it's the right
> things to do all the time ...

You raise a good point, the root cause is indeed last_balance being
stuck in the past, or such.

I will try the patch below and let you know. Also my previous diff where
I cap the next balance setting also makes the issue go away, when I was last testing.

Thanks Vincent!

 - Joel 


> 
> ---
> kernel/sched/fair.c | 18 ++++++------------
> 1 file changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3745ca289240..9ea1f42e5362 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11568,17 +11568,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>        need_decay = update_newidle_cost(sd, 0);
>        max_cost += sd->max_newidle_lb_cost;
> 
> -        /*
> -         * Stop the load balance at this level. There is another
> -         * CPU in our sched group which is doing load balancing more
> -         * actively.
> -         */
> -        if (!continue_balancing) {
> -            if (need_decay)
> -                continue;
> -            break;
> -        }
> -
>        interval = get_sd_balance_interval(sd, busy);
> 
>        need_serialize = sd->flags & SD_SERIALIZE;
> @@ -11588,7 +11577,12 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>        }
> 
>        if (time_after_eq(jiffies, sd->last_balance + interval)) {
> -            if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
> +            /*
> +             * Stop the load balance at this level. There is another
> +             * CPU in our sched group which is doing load balancing more
> +             * actively.
> +             */
> +            if (continue_balancing && load_balance(cpu, rq, sd, idle, &continue_balancing)) {
>                /*
>                 * The LBF_DST_PINNED logic could have changed
>                 * env->dst_cpu, so we can't know our idle
> --
> 2.34.1
> 
>> 
>>> But your problem is more related with the fact that
>>> nohz.needs_update is set when stopping cpu timer in order to update
>>> nohz.next_balance which is then used to kick a "real" idle load
>>> balance
>> 
>> Well, independent of nohz idle balance, I think we need to fix this issue as
>> mentioned above. This effect the periodic one as mentioned in the commit log.
>> 
>> See here another trace I collected this time dumping the 'interval'. There is
>> a tug of war happening between the newidle balance and the periodic balance.
>> 
>> The periodic one sets rq->next_balance for cpu 0 to 760,143 and then the
>> newidle comes in pulls it back a 118 jiffies to 760,024. This is actually in
>> the past because jiffies is currently 760,045 !!
>> 
>> This triggers the periodic balance againwhich sets rq->next_balance back to
>> 760,143.
>> 
>> Rinse and repeat. End result is you have periodic balance every jiffies. With
>> this patch the issue goes away but we could fix it differently as you
>> mentioned, we need to pull newidle balance back but perhaps not so
>> aggressively. How about something like the untested diff I enclosed at the
>> end of this email?
>> 
>> <idle>-0   [000]  13.081781: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,045)
>> cyclictest-120   [000]  13.081806: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
>> cyclictest-120   [000]  13.081807: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,045)
>> cyclictest-120   [000]  13.082130: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
>> cyclictest-120   [000]  13.082338: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
>> cyclictest-120   [000]  13.082636: update_next_balance: sd->next_balance: 761,045 -> 760,024 (jiffies=760,045, interval=8)
>> <idle>-0   [000]  13.082823: trigger_load_balance: time_after_eq(jiffies=760,046, rq->next_balance=760,024) = 1
>> <idle>-0   [000]  13.082823: softirq_raise: vec=7 [action=SCHED]
>> <idle>-0   [000]  13.082871: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,046)
>> trace-cmd-114   [000]  13.082876: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8)
>> trace-cmd-114   [000]  13.082876: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,046)
>> cyclictest-120   [000]  13.083333: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8)
>> cyclictest-120   [000]  13.083633: update_next_balance: sd->next_balance: 761,046 -> 760,024 (jiffies=760,046, interval=8)
>> <idle>-0   [000]  13.083656: trigger_load_balance: time_after_eq(jiffies=760,047, rq->next_balance=760,024) = 1
>> <idle>-0   [000]  13.083656: softirq_raise: vec=7 [action=SCHED]
>> <idle>-0   [000]  13.083702: rebalance_domains: rq[cpu=0]->next_balance: 760,024 -> 760,143 (jiffies=760,047)
>> cyclictest-120   [000]  13.083729: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
>> cyclictest-120   [000]  13.083730: newidle_balance: this_rq[cpu=0]->next_balance: 760,143 -> 760,024 (jiffies=760,047)
>> cyclictest-120   [000]  13.083960: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
>> cyclictest-120   [000]  13.084069: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
>> cyclictest-120   [000]  13.084423: update_next_balance: sd->next_balance: 761,047 -> 760,024 (jiffies=760,047, interval=8)
>> cyclictest-120   [000]  13.084633: trigger_load_balance: time_after_eq(jiffies=760,048, rq->next_balance=760,024) = 1
>> cyclictest-120   [000]  13.084633: softirq_raise: vec=7 [action=SCHED]
>> cyclictest-120   [000]  13.084678: rebalance_domains: rq[cpu=0]->next_balance:
>> 
>>>>>> Fix by updating last_balance when a newidle load balance actually
>>>>>> happens and then updating next_balance. This is also how it is done in
>>>>>> other load balance paths.
>>>>>> 
>>>>>> Testing shows a significant drop in softirqs when running:
>>>>>> cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m  -q
>>>>>> 
>>>>>> Goes from ~6k to ~800.
>>>>> 
>>>>> Even if your figures look interesting, your patch adds regression in
>>>>> the load balance and the fairness.
>>>> 
>>>> Yes I see that now. However it does illustrate the problem IMO.
>>>> 
>>>>> We can probably do improve the current behavior for decreasing number
>>>>> of ILB but your proposal is not the right solution IMO
>>>> 
>>>> One of the problems is if you have task goes idle a lot, then the
>>>> newidle_balance mechanism triggers the periodic balance every jiffie (once
>>>> per millisecond on HZ=1000).
>>> 
>>> every msec seems quite a lot.
>> 
>> Yeah!
>> 
>>>> 
>>>> Following are some traces I collected.
>>>> 
>>>> cyclictest-123   [003]   522.650574  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,157)
>>>>   <idle>-0      [003]   522.651443  trigger_load_balance: time_after_eq(jiffies=221,158, rq->next_balance=221,145) = 1
>>>>   <idle>-0      [003]   522.651461  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,158)
>>>> cyclictest-123   [003]   522.651494  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,158)
>>>>   <idle>-0      [003]   522.652522  trigger_load_balance: time_after_eq(jiffies=221,159, rq->next_balance=221,145) = 1
>>>>   <idle>-0      [003]   522.652560  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,159)
>>>> cyclictest-124   [003]   522.652586  newidle_balance: this_rq[cpu=3]->next_balance: 221,264 -> 221,145 (jiffies=221,159)
>>>>   <idle>-0      [003]   522.654492  trigger_load_balance: time_after_eq(jiffies=221,161, rq->next_balance=221,145) = 1
>>>>   <idle>-0      [003]   522.654534  rebalance_domains: rq[cpu=3]->next_balance: 221,145 -> 221,264 (jiffies=221,161)
>>> 
>>> Ok, so IIUC your trace above, this happens because the tick is not
>>> stop after entering idle so it continues to fire and triggers a load
>>> balance without checking if there is a need like what is done for nohz
>>> mode
>> 
>> The tick is normally not stopped if the CPU is awakened too soon by a timer.
>> That's pretty normal AFAIK. As you can see in the traces above, cyclictest
>> keeps waking up.
>> 
>>>> things worse for power on ARM where you have uclamp stuff happening in the
>>>> load balance paths which is quite heavy when I last traced that..
>>>> 
>>>> Further, we have observed in our tracing on real device that the update of
>>>> rq->next_balance from the newidle path is itself buggy... we observed that
>>>> because newidle balance may not update rq->last_balance, it is possible that
>>>> rq->next_balance when updated by update_next_balance() will be updated to a
>>>> value that is in the past and it will be stuck there for a long time! Perhaps
>>>> we should investigate more and fix that bug separately. Vineeth could provide
>>>> more details on the "getting stuck in the past" behavior as well.
>>> 
>>> sd->last_balance reflects last time an idle/busy load_balance happened
>>> (newly idle is out of the scope for the points that I mentioned
>>> previously).  So if no load balance happens for a while, the
>>> rq->next_balance can be in the past but I don't see a problem here. It
>>> just means that a load balance hasn't happened for a while. It can
>>> even move backward if it has been set when busy but the cpu is now
>>> idle
>> 
>> Sure, but I think it should at least set it by get_sd_balance_interval() into
>> the future. Like so (untested)? Let me know what you think and thanks!
>> 
>> ---8<-----------------------
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index a3318aeff9e8..0d6667d31c51 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11314,6 +11314,30 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
>>    return interval;
>> }
>> 
>> +/*
>> + * Update the next balance from newidle balance.
>> + * The update of next_balance from newidle balance tries to make sure that
>> + * we don't trigger periodic balance too far in the future on a now-idle
>> + * system.  This is just like update_next_balance except that since
>> + * sd->last_balance may not have been updated for a while, we're careful to
>> + * not set next_balance in the past.
>> + */
>> +static inline void
>> +update_next_balance_newidle(struct sched_domain *sd, unsigned long *next_balance)
>> +{
>> +    unsigned long interval, next;
>> +
>> +    /* used by new idle balance, so cpu_busy = 0 */
>> +    interval = get_sd_balance_interval(sd, 0);
>> +    next = sd->last_balance + interval;
>> +
>> +    next = max(next, jiffies + interval);
>> +
>> +    if (time_after(*next_balance, next)) {
>> +        *next_balance = next;
>> +    }
>> +}
>> +
>> static inline void
>> update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
>> {
>> @@ -12107,7 +12131,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>        (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>> 
>>        if (sd)
>> -            update_next_balance(sd, &next_balance);
>> +            update_next_balance_newidle(sd, &next_balance);
>>        rcu_read_unlock();
>> 
>>        goto out;
>> @@ -12124,7 +12148,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>        int continue_balancing = 1;
>>        u64 domain_cost;
>> 
>> -        update_next_balance(sd, &next_balance);
>> +        update_next_balance_newidle(sd, &next_balance);
>> 
>>        if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>>            break;

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-11-14 17:43 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-20  1:40 [PATCH 1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2) Joel Fernandes (Google)
2023-10-20  1:40 ` [PATCH 2/3] sched/nohz: Update comments about NEWILB_KICK Joel Fernandes (Google)
2023-10-20  7:51   ` Ingo Molnar
2023-10-20  8:02   ` [tip: sched/core] " tip-bot2 for Joel Fernandes (Google)
2023-10-20  1:40 ` [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance Joel Fernandes (Google)
2023-10-20  7:53   ` Ingo Molnar
2023-10-20 13:48     ` Vincent Guittot
2023-10-21  6:50       ` Ingo Molnar
2023-10-20  8:02   ` [tip: sched/core] sched/fair: " tip-bot2 for Vineeth Pillai (Google)
2023-10-20 13:40   ` [PATCH 3/3] sched: " Vincent Guittot
2023-10-20 13:56     ` Ingo Molnar
2023-10-20 15:50       ` Joel Fernandes
2023-10-22  0:28     ` Joel Fernandes
2023-10-26 14:23       ` Vincent Guittot
2023-11-09 10:02         ` Joel Fernandes
2023-11-09 12:31           ` Joel Fernandes
2023-11-14 15:43           ` Vincent Guittot
2023-11-14 17:43             ` Joel Fernandes
2023-10-20  4:17 ` [PATCH 1/3] sched/nohz: Update nohz.next_balance directly without IPIs (v2) Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).