[PATCH v2 0/4] Improve newidle lb cost tracking and early abort

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] Improve newidle lb cost tracking and early abort
@ 2021-10-15 12:46 Vincent Guittot
  2021-10-15 12:46 ` [PATCH v2 1/4] sched/fair: Account update_blocked_averages in newidle_balance cost Vincent Guittot
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Vincent Guittot @ 2021-10-15 12:46 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: Vincent Guittot

This patchset updates newidle lb cost tracking and early abort:

The time spent running update_blocked_averages is now accounted in the 1st
sched_domain level. This time can be significant and move the cost of
newidle lb above the avg_idle time.

The decay of max_newidle_lb_cost is modified to start only when the field
has not been updated for a while. Recent update will not be decayed
immediatlybut only after a while.

The condition of an avg_idle lower than sysctl_sched_migration_cost has
been removed as the 500us value is quite large and prevent opportunity to
pull task on the newly idle CPU for at least 1st domain levels.

Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
following results:
       min    avg   max
SMT:   1us   33us  273us - this one includes the update of blocked load
MC:    7us   49us  398us
NUMA: 10us   45us  158us


Some results for hackbench -l $LOOPS -g $group :
group      tip/sched/core     + this patchset
1           15.189(+/- 2%)       14.987(+/- 2%)  +1%
4            4.336(+/- 3%)        4.322(+/- 5%)  +0%
16           3.654(+/- 1%)        2.922(+/- 3%) +20%
32           3.209(+/- 1%)        2.919(+/- 3%)  +9%
64           2.965(+/- 1%)        2.826(+/- 1%)  +4%
128          2.954(+/- 1%)        2.993(+/- 8%)  -1%
256          2.951(+/- 1%)        2.894(+/- 1%)  +2%

tbench and reaim have not shown any difference

Change since v1:
- account the time spent in update_blocked_averages() in the 1st domain

- reduce number of call of sched_clock_cpu() 

- change the way max_newidle_lb_cost is decayed. Peter suggested to use a
  IIR but keeping a track of the current max value gave the best result

- removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
  as suggested by Peter

Vincent Guittot (4):
  sched/fair: Account update_blocked_averages in newidle_balance cost
  sched/fair: Skip update_blocked_averages if we are defering load
    balance
  sched/fair: Wait before decaying max_newidle_lb_cost
  sched/fair: Remove sysctl_sched_migration_cost condition

 include/linux/sched/topology.h |  2 +-
 kernel/sched/fair.c            | 29 ++++++++++++++++++-----------
 kernel/sched/topology.c        |  2 +-
 3 files changed, 20 insertions(+), 13 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/4] sched/fair: Account update_blocked_averages in newidle_balance cost
  2021-10-15 12:46 [PATCH v2 0/4] Improve newidle lb cost tracking and early abort Vincent Guittot
@ 2021-10-15 12:46 ` Vincent Guittot
  2021-10-15 12:46 ` [PATCH v2 2/4] sched/fair: Skip update_blocked_averages if we are defering load balance Vincent Guittot
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Vincent Guittot @ 2021-10-15 12:46 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: Vincent Guittot

The time spent to update the blocked load can be significant depending of
the complexity fo the cgroup hierarchy. Take this time into account in
the cost of the 1st load balance of a newly idle cpu.

Also reduce the number of call to sched_clock_cpu() and track more actual
work.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2468d1d1edef..a7429dec8e2f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10841,9 +10841,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
 	unsigned long next_balance = jiffies + HZ;
 	int this_cpu = this_rq->cpu;
+	u64 t0, t1, curr_cost = 0;
 	struct sched_domain *sd;
 	int pulled_task = 0;
-	u64 curr_cost = 0;
 
 	update_misfit_status(NULL, this_rq);
 
@@ -10888,11 +10888,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	raw_spin_rq_unlock(this_rq);
 
+	t0 = sched_clock_cpu(this_cpu);
 	update_blocked_averages(this_cpu);
+
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		int continue_balancing = 1;
-		u64 t0, domain_cost;
+		u64 domain_cost;
 
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
 			update_next_balance(sd, &next_balance);
@@ -10900,17 +10902,18 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		}
 
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
-			t0 = sched_clock_cpu(this_cpu);
 
 			pulled_task = load_balance(this_cpu, this_rq,
 						   sd, CPU_NEWLY_IDLE,
 						   &continue_balancing);
 
-			domain_cost = sched_clock_cpu(this_cpu) - t0;
+			t1 = sched_clock_cpu(this_cpu);
+			domain_cost = t1 - t0;
 			if (domain_cost > sd->max_newidle_lb_cost)
 				sd->max_newidle_lb_cost = domain_cost;
 
 			curr_cost += domain_cost;
+			t0 = t1;
 		}
 
 		update_next_balance(sd, &next_balance);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/4] sched/fair: Skip update_blocked_averages if we are defering load balance
  2021-10-15 12:46 [PATCH v2 0/4] Improve newidle lb cost tracking and early abort Vincent Guittot
  2021-10-15 12:46 ` [PATCH v2 1/4] sched/fair: Account update_blocked_averages in newidle_balance cost Vincent Guittot
@ 2021-10-15 12:46 ` Vincent Guittot
  2021-10-15 12:46 ` [PATCH v2 3/4] sched/fair: Wait before decaying max_newidle_lb_cost Vincent Guittot
  2021-10-15 12:46 ` [PATCH v2 4/4] sched/fair: Remove sysctl_sched_migration_cost condition Vincent Guittot
  3 siblings, 0 replies; 10+ messages in thread
From: Vincent Guittot @ 2021-10-15 12:46 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: Vincent Guittot

In newidle_balance(), the scheduler skips load balance to the new idle cpu
when the 1st sd of this_rq is:

   this_rq->avg_idle < sd->max_newidle_lb_cost

Doing a costly call to update_blocked_averages() will not be useful and
simply adds overhead when this condition is true.

Check the condition early in newidle_balance() to skip
update_blocked_averages() when possible.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a7429dec8e2f..6b8065b72847 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10874,17 +10874,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	 */
 	rq_unpin_lock(this_rq, rf);
 
+	rcu_read_lock();
+	sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
 	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !READ_ONCE(this_rq->rd->overload)) {
+	    !READ_ONCE(this_rq->rd->overload) ||
+	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		rcu_read_lock();
-		sd = rcu_dereference_check_sched_domain(this_rq->sd);
 		if (sd)
 			update_next_balance(sd, &next_balance);
 		rcu_read_unlock();
 
 		goto out;
 	}
+	rcu_read_unlock();
 
 	raw_spin_rq_unlock(this_rq);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/4] sched/fair: Wait before decaying max_newidle_lb_cost
  2021-10-15 12:46 [PATCH v2 0/4] Improve newidle lb cost tracking and early abort Vincent Guittot
  2021-10-15 12:46 ` [PATCH v2 1/4] sched/fair: Account update_blocked_averages in newidle_balance cost Vincent Guittot
  2021-10-15 12:46 ` [PATCH v2 2/4] sched/fair: Skip update_blocked_averages if we are defering load balance Vincent Guittot
@ 2021-10-15 12:46 ` Vincent Guittot
  2021-10-15 17:40   ` Peter Zijlstra
  2021-10-15 12:46 ` [PATCH v2 4/4] sched/fair: Remove sysctl_sched_migration_cost condition Vincent Guittot
  3 siblings, 1 reply; 10+ messages in thread
From: Vincent Guittot @ 2021-10-15 12:46 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: Vincent Guittot

Decay max_newidle_lb_cost only when it has not been updated for a while
and ensure to not decay a recently changed value.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched/topology.h | 2 +-
 kernel/sched/fair.c            | 8 +++++---
 kernel/sched/topology.c        | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..9c8878f201ab 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -98,7 +98,7 @@ struct sched_domain {
 
 	/* idle_balance() stats */
 	u64 max_newidle_lb_cost;
-	unsigned long next_decay_max_lb_cost;
+	unsigned long last_decay_max_lb_cost;
 
 	u64 avg_scan_cost;		/* select_idle_sibling */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b8065b72847..2967d13737b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10265,10 +10265,10 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 		 * Decay the newidle max times here because this is a regular
 		 * visit to all the domains. Decay ~1% per second.
 		 */
-		if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
+		if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
 			sd->max_newidle_lb_cost =
 				(sd->max_newidle_lb_cost * 253) / 256;
-			sd->next_decay_max_lb_cost = jiffies + HZ;
+			sd->last_decay_max_lb_cost = jiffies;
 			need_decay = 1;
 		}
 		max_cost += sd->max_newidle_lb_cost;
@@ -10912,8 +10912,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 			t1 = sched_clock_cpu(this_cpu);
 			domain_cost = t1 - t0;
-			if (domain_cost > sd->max_newidle_lb_cost)
+			if (domain_cost > sd->max_newidle_lb_cost) {
 				sd->max_newidle_lb_cost = domain_cost;
+				sd->last_decay_max_lb_cost = jiffies;
+			}
 
 			curr_cost += domain_cost;
 			t0 = t1;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index c56faae461d9..edd71649d579 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1572,7 +1572,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		.last_balance		= jiffies,
 		.balance_interval	= sd_weight,
 		.max_newidle_lb_cost	= 0,
-		.next_decay_max_lb_cost	= jiffies,
+		.last_decay_max_lb_cost	= jiffies,
 		.child			= child,
 #ifdef CONFIG_SCHED_DEBUG
 		.name			= tl->name,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 4/4] sched/fair: Remove sysctl_sched_migration_cost condition
  2021-10-15 12:46 [PATCH v2 0/4] Improve newidle lb cost tracking and early abort Vincent Guittot
                   ` (2 preceding siblings ...)
  2021-10-15 12:46 ` [PATCH v2 3/4] sched/fair: Wait before decaying max_newidle_lb_cost Vincent Guittot
@ 2021-10-15 12:46 ` Vincent Guittot
  2021-10-15 17:42   ` Peter Zijlstra
  3 siblings, 1 reply; 10+ messages in thread
From: Vincent Guittot @ 2021-10-15 12:46 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: Vincent Guittot

With a default value of 500us, sysctl_sched_migration_cost is
significanlty higher than the cost of load_balance. Remove the
condition and rely on the sd->max_newidle_lb_cost to abort
newidle_balance.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2967d13737b7..924a5f0a8145 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10877,8 +10877,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
-	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !READ_ONCE(this_rq->rd->overload) ||
+	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
 		if (sd)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 3/4] sched/fair: Wait before decaying max_newidle_lb_cost
  2021-10-15 12:46 ` [PATCH v2 3/4] sched/fair: Wait before decaying max_newidle_lb_cost Vincent Guittot
@ 2021-10-15 17:40   ` Peter Zijlstra
  2021-10-15 18:02     ` Vincent Guittot
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2021-10-15 17:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, linux-kernel, tim.c.chen

On Fri, Oct 15, 2021 at 02:46:53PM +0200, Vincent Guittot wrote:
> Decay max_newidle_lb_cost only when it has not been updated for a while
> and ensure to not decay a recently changed value.

I was more thinking something long these lines; ofcourse, no idea how
well it actually behaves.

Index: linux-2.6/include/linux/sched/topology.h
===================================================================
--- linux-2.6.orig/include/linux/sched/topology.h
+++ linux-2.6/include/linux/sched/topology.h
@@ -98,7 +98,6 @@ struct sched_domain {
 
 	/* idle_balance() stats */
 	u64 max_newidle_lb_cost;
-	unsigned long next_decay_max_lb_cost;
 
 	u64 avg_scan_cost;		/* select_idle_sibling */
 
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -10241,6 +10241,17 @@ void update_max_interval(void)
 }
 
 /*
+ * Asymmetric IIR filter, 1/4th down, 3/4th up.
+ */
+static void update_newidle_cost(u64 *cost, u64 new)
+{
+	s64 diff = new - *cost;
+	if (diff > 0)
+		diff *= 3;
+	*cost += diff / 4;
+}
+
+/*
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
  *
@@ -10256,33 +10267,18 @@ static void rebalance_domains(struct rq
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
 	int update_next_balance = 0;
-	int need_serialize, need_decay = 0;
-	u64 max_cost = 0;
+	int need_serialize;
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
-		/*
-		 * Decay the newidle max times here because this is a regular
-		 * visit to all the domains. Decay ~1% per second.
-		 */
-		if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
-			sd->max_newidle_lb_cost =
-				(sd->max_newidle_lb_cost * 253) / 256;
-			sd->next_decay_max_lb_cost = jiffies + HZ;
-			need_decay = 1;
-		}
-		max_cost += sd->max_newidle_lb_cost;
 
 		/*
 		 * Stop the load balance at this level. There is another
 		 * CPU in our sched group which is doing load balancing more
 		 * actively.
 		 */
-		if (!continue_balancing) {
-			if (need_decay)
-				continue;
+		if (!continue_balancing)
 			break;
-		}
 
 		interval = get_sd_balance_interval(sd, busy);
 
@@ -10313,14 +10309,7 @@ out:
 			update_next_balance = 1;
 		}
 	}
-	if (need_decay) {
-		/*
-		 * Ensure the rq-wide value also decays but keep it at a
-		 * reasonable floor to avoid funnies with rq->avg_idle.
-		 */
-		rq->max_idle_balance_cost =
-			max((u64)sysctl_sched_migration_cost, max_cost);
-	}
+
 	rcu_read_unlock();
 
 	/*
@@ -10909,8 +10898,7 @@ static int newidle_balance(struct rq *th
 
 			t1 = sched_clock_cpu(this_cpu);
 			domain_cost = t1 - t0;
-			if (domain_cost > sd->max_newidle_lb_cost)
-				sd->max_newidle_lb_cost = domain_cost;
+			update_newidle_cost(&sd->max_newidle_lb_cost, domain_cost);
 
 			curr_cost += domain_cost;
 			t0 = t1;
@@ -10930,8 +10918,7 @@ static int newidle_balance(struct rq *th
 
 	raw_spin_rq_lock(this_rq);
 
-	if (curr_cost > this_rq->max_idle_balance_cost)
-		this_rq->max_idle_balance_cost = curr_cost;
+	update_newidle_cost(&this_rq->max_idle_balance_cost, curr_cost);
 
 	/*
 	 * While browsing the domains, we released the rq lock, a task could


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] sched/fair: Remove sysctl_sched_migration_cost condition
  2021-10-15 12:46 ` [PATCH v2 4/4] sched/fair: Remove sysctl_sched_migration_cost condition Vincent Guittot
@ 2021-10-15 17:42   ` Peter Zijlstra
  2021-10-15 17:57     ` Vincent Guittot
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2021-10-15 17:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, linux-kernel, tim.c.chen

On Fri, Oct 15, 2021 at 02:46:54PM +0200, Vincent Guittot wrote:
> With a default value of 500us, sysctl_sched_migration_cost is
> significanlty higher than the cost of load_balance. Remove the
> condition and rely on the sd->max_newidle_lb_cost to abort
> newidle_balance.
> 
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

This is separate and not folded in for ease of bisection?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 4/4] sched/fair: Remove sysctl_sched_migration_cost condition
  2021-10-15 17:42   ` Peter Zijlstra
@ 2021-10-15 17:57     ` Vincent Guittot
  0 siblings, 0 replies; 10+ messages in thread
From: Vincent Guittot @ 2021-10-15 17:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel,
	Tim Chen

On Fri, 15 Oct 2021 at 19:42, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Oct 15, 2021 at 02:46:54PM +0200, Vincent Guittot wrote:
> > With a default value of 500us, sysctl_sched_migration_cost is
> > significanlty higher than the cost of load_balance. Remove the
> > condition and rely on the sd->max_newidle_lb_cost to abort
> > newidle_balance.
> >
> > Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> This is separate and not folded in for ease of bisection?

yes ease bisect and ease revert if finally there is a regression

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 3/4] sched/fair: Wait before decaying max_newidle_lb_cost
  2021-10-15 17:40   ` Peter Zijlstra
@ 2021-10-15 18:02     ` Vincent Guittot
  2021-10-15 18:29       ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Vincent Guittot @ 2021-10-15 18:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel,
	Tim Chen

On Fri, 15 Oct 2021 at 19:41, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Oct 15, 2021 at 02:46:53PM +0200, Vincent Guittot wrote:
> > Decay max_newidle_lb_cost only when it has not been updated for a while
> > and ensure to not decay a recently changed value.
>
> I was more thinking something long these lines; ofcourse, no idea how
> well it actually behaves.
>
> Index: linux-2.6/include/linux/sched/topology.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched/topology.h
> +++ linux-2.6/include/linux/sched/topology.h
> @@ -98,7 +98,6 @@ struct sched_domain {
>
>         /* idle_balance() stats */
>         u64 max_newidle_lb_cost;
> -       unsigned long next_decay_max_lb_cost;
>
>         u64 avg_scan_cost;              /* select_idle_sibling */
>
> Index: linux-2.6/kernel/sched/fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/fair.c
> +++ linux-2.6/kernel/sched/fair.c
> @@ -10241,6 +10241,17 @@ void update_max_interval(void)
>  }
>
>  /*
> + * Asymmetric IIR filter, 1/4th down, 3/4th up.
> + */
> +static void update_newidle_cost(u64 *cost, u64 new)
> +{
> +       s64 diff = new - *cost;
> +       if (diff > 0)
> +               diff *= 3;
> +       *cost += diff / 4;
> +}

I tried to use something similar which was based on update_avg() but
there were some performance regressions:
some regressions were linked to not jumping to the new max directly. I
assume some level were started whereas it would take too much time
and some regressions happened  if the decay was too quick

> +
> +/*
>   * It checks each scheduling domain to see if it is due to be balanced,
>   * and initiates a balancing operation if so.
>   *
> @@ -10256,33 +10267,18 @@ static void rebalance_domains(struct rq
>         /* Earliest time when we have to do rebalance again */
>         unsigned long next_balance = jiffies + 60*HZ;
>         int update_next_balance = 0;
> -       int need_serialize, need_decay = 0;
> -       u64 max_cost = 0;
> +       int need_serialize;
>
>         rcu_read_lock();
>         for_each_domain(cpu, sd) {
> -               /*
> -                * Decay the newidle max times here because this is a regular
> -                * visit to all the domains. Decay ~1% per second.
> -                */
> -               if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
> -                       sd->max_newidle_lb_cost =
> -                               (sd->max_newidle_lb_cost * 253) / 256;
> -                       sd->next_decay_max_lb_cost = jiffies + HZ;
> -                       need_decay = 1;
> -               }
> -               max_cost += sd->max_newidle_lb_cost;
>
>                 /*
>                  * Stop the load balance at this level. There is another
>                  * CPU in our sched group which is doing load balancing more
>                  * actively.
>                  */
> -               if (!continue_balancing) {
> -                       if (need_decay)
> -                               continue;
> +               if (!continue_balancing)
>                         break;
> -               }
>
>                 interval = get_sd_balance_interval(sd, busy);
>
> @@ -10313,14 +10309,7 @@ out:
>                         update_next_balance = 1;
>                 }
>         }
> -       if (need_decay) {
> -               /*
> -                * Ensure the rq-wide value also decays but keep it at a
> -                * reasonable floor to avoid funnies with rq->avg_idle.
> -                */
> -               rq->max_idle_balance_cost =
> -                       max((u64)sysctl_sched_migration_cost, max_cost);
> -       }
> +
>         rcu_read_unlock();
>
>         /*
> @@ -10909,8 +10898,7 @@ static int newidle_balance(struct rq *th
>
>                         t1 = sched_clock_cpu(this_cpu);
>                         domain_cost = t1 - t0;
> -                       if (domain_cost > sd->max_newidle_lb_cost)
> -                               sd->max_newidle_lb_cost = domain_cost;
> +                       update_newidle_cost(&sd->max_newidle_lb_cost, domain_cost);
>
>                         curr_cost += domain_cost;
>                         t0 = t1;
> @@ -10930,8 +10918,7 @@ static int newidle_balance(struct rq *th
>
>         raw_spin_rq_lock(this_rq);
>
> -       if (curr_cost > this_rq->max_idle_balance_cost)
> -               this_rq->max_idle_balance_cost = curr_cost;
> +       update_newidle_cost(&this_rq->max_idle_balance_cost, curr_cost);
>
>         /*
>          * While browsing the domains, we released the rq lock, a task could
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 3/4] sched/fair: Wait before decaying max_newidle_lb_cost
  2021-10-15 18:02     ` Vincent Guittot
@ 2021-10-15 18:29       ` Peter Zijlstra
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2021-10-15 18:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel,
	Tim Chen

On Fri, Oct 15, 2021 at 08:02:01PM +0200, Vincent Guittot wrote:
> On Fri, 15 Oct 2021 at 19:41, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Oct 15, 2021 at 02:46:53PM +0200, Vincent Guittot wrote:
> > > Decay max_newidle_lb_cost only when it has not been updated for a while
> > > and ensure to not decay a recently changed value.
> >
> > I was more thinking something long these lines; ofcourse, no idea how
> > well it actually behaves.
> >
> > Index: linux-2.6/include/linux/sched/topology.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/sched/topology.h
> > +++ linux-2.6/include/linux/sched/topology.h
> > @@ -98,7 +98,6 @@ struct sched_domain {
> >
> >         /* idle_balance() stats */
> >         u64 max_newidle_lb_cost;
> > -       unsigned long next_decay_max_lb_cost;
> >
> >         u64 avg_scan_cost;              /* select_idle_sibling */
> >
> > Index: linux-2.6/kernel/sched/fair.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched/fair.c
> > +++ linux-2.6/kernel/sched/fair.c
> > @@ -10241,6 +10241,17 @@ void update_max_interval(void)
> >  }
> >
> >  /*
> > + * Asymmetric IIR filter, 1/4th down, 3/4th up.
> > + */
> > +static void update_newidle_cost(u64 *cost, u64 new)
> > +{
> > +       s64 diff = new - *cost;
> > +       if (diff > 0)
> > +               diff *= 3;
> > +       *cost += diff / 4;
> > +}
> 
> I tried to use something similar which was based on update_avg() but
> there were some performance regressions:
> some regressions were linked to not jumping to the new max directly. I
> assume some level were started whereas it would take too much time
> and some regressions happened  if the decay was too quick

Hmm, fair enough..

There's always something like:

       s64 diff = new - *cost;
       if (diff < 0)
               diff = 3*diff/256;
       *cost += diff;

Which jumps up instantly and decays rather slower. The advantage of
something like that, as I see it, is all those lines it deletes, but if
it doesn't actually work, it doesn't work.

A well. Thanks for trying.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-10-15 18:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-15 12:46 [PATCH v2 0/4] Improve newidle lb cost tracking and early abort Vincent Guittot
2021-10-15 12:46 ` [PATCH v2 1/4] sched/fair: Account update_blocked_averages in newidle_balance cost Vincent Guittot
2021-10-15 12:46 ` [PATCH v2 2/4] sched/fair: Skip update_blocked_averages if we are defering load balance Vincent Guittot
2021-10-15 12:46 ` [PATCH v2 3/4] sched/fair: Wait before decaying max_newidle_lb_cost Vincent Guittot
2021-10-15 17:40   ` Peter Zijlstra
2021-10-15 18:02     ` Vincent Guittot
2021-10-15 18:29       ` Peter Zijlstra
2021-10-15 12:46 ` [PATCH v2 4/4] sched/fair: Remove sysctl_sched_migration_cost condition Vincent Guittot
2021-10-15 17:42   ` Peter Zijlstra
2021-10-15 17:57     ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).