[PATCH v3 0/5] Improve newidle lb cost tracking and early abort

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
@ 2021-10-19 12:35 Vincent Guittot
  2021-10-19 12:35 ` [PATCH v3 1/5] sched/fair: Account update_blocked_averages in newidle_balance cost Vincent Guittot
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-19 12:35 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: Vincent Guittot

This patchset updates newidle lb cost tracking and early abort:

The time spent running update_blocked_averages is now accounted in the 1st
sched_domain level. This time can be significant and move the cost of
newidle lb above the avg_idle time.

The decay of max_newidle_lb_cost is modified to start only when the field
has not been updated for a while. Recent update will not be decayed
immediatlybut only after a while.

The condition of an avg_idle lower than sysctl_sched_migration_cost has
been removed as the 500us value is quite large and prevent opportunity to
pull task on the newly idle CPU for at least 1st domain levels.

Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
following results:
       min    avg   max
SMT:   1us   33us  273us - this one includes the update of blocked load
MC:    7us   49us  398us
NUMA: 10us   45us  158us


Some results for hackbench -l $LOOPS -g $group :
group      tip/sched/core     + this patchset
1           15.189(+/- 2%)       14.987(+/- 2%)  +1%
4            4.336(+/- 3%)        4.322(+/- 5%)  +0%
16           3.654(+/- 1%)        2.922(+/- 3%) +20%
32           3.209(+/- 1%)        2.919(+/- 3%)  +9%
64           2.965(+/- 1%)        2.826(+/- 1%)  +4%
128          2.954(+/- 1%)        2.993(+/- 8%)  -1%
256          2.951(+/- 1%)        2.894(+/- 1%)  +2%

tbench and reaim have not shown any difference

Change since v2:
- Update and decay of sd->last_decay_max_lb_cost are gathered in
  update_newidle_cost(). The behavior remains almost the same except that
  the decay can happen during newidle_balance now.

  Tests results haven't shown any differences
  
  I haven't modified rq->max_idle_balance_cost. It acts as the max value
  for avg_idle and prevents the latter to reach high value during long
  idle phase. Moving on an IIR filter instead, could delay the convergence
  of avg_idle to a reasonnable value that reflect current situation.

- Added a minor cleanup of newidle_balance

Change since v1:
- account the time spent in update_blocked_averages() in the 1st domain

- reduce number of call of sched_clock_cpu() 

- change the way max_newidle_lb_cost is decayed. Peter suggested to use a
  IIR but keeping a track of the current max value gave the best result

- removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
  as suggested by Peter

Vincent Guittot (5):
  sched/fair: Account update_blocked_averages in newidle_balance cost
  sched/fair: Skip update_blocked_averages if we are defering load
    balance
  sched/fair: Wait before decaying max_newidle_lb_cost
  sched/fair: Remove sysctl_sched_migration_cost condition
  sched/fair: cleanup newidle_balance

 include/linux/sched/topology.h |  2 +-
 kernel/sched/fair.c            | 65 ++++++++++++++++++++++------------
 kernel/sched/topology.c        |  2 +-
 3 files changed, 45 insertions(+), 24 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 1/5] sched/fair: Account update_blocked_averages in newidle_balance cost
  2021-10-19 12:35 [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Vincent Guittot
@ 2021-10-19 12:35 ` Vincent Guittot
  2021-10-21  9:43   ` Mel Gorman
  2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  2021-10-19 12:35 ` [PATCH v3 2/5] sched/fair: Skip update_blocked_averages if we are defering load balance Vincent Guittot
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-19 12:35 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: Vincent Guittot

The time spent to update the blocked load can be significant depending of
the complexity fo the cgroup hierarchy. Take this time into account in
the cost of the 1st load balance of a newly idle cpu.

Also reduce the number of call to sched_clock_cpu() and track more actual
work.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 87db481e8a56..c0145677ee99 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10840,9 +10840,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
 	unsigned long next_balance = jiffies + HZ;
 	int this_cpu = this_rq->cpu;
+	u64 t0, t1, curr_cost = 0;
 	struct sched_domain *sd;
 	int pulled_task = 0;
-	u64 curr_cost = 0;
 
 	update_misfit_status(NULL, this_rq);
 
@@ -10887,11 +10887,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	raw_spin_rq_unlock(this_rq);
 
+	t0 = sched_clock_cpu(this_cpu);
 	update_blocked_averages(this_cpu);
+
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		int continue_balancing = 1;
-		u64 t0, domain_cost;
+		u64 domain_cost;
 
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
 			update_next_balance(sd, &next_balance);
@@ -10899,17 +10901,18 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		}
 
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
-			t0 = sched_clock_cpu(this_cpu);
 
 			pulled_task = load_balance(this_cpu, this_rq,
 						   sd, CPU_NEWLY_IDLE,
 						   &continue_balancing);
 
-			domain_cost = sched_clock_cpu(this_cpu) - t0;
+			t1 = sched_clock_cpu(this_cpu);
+			domain_cost = t1 - t0;
 			if (domain_cost > sd->max_newidle_lb_cost)
 				sd->max_newidle_lb_cost = domain_cost;
 
 			curr_cost += domain_cost;
+			t0 = t1;
 		}
 
 		update_next_balance(sd, &next_balance);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 2/5] sched/fair: Skip update_blocked_averages if we are defering load balance
  2021-10-19 12:35 [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Vincent Guittot
  2021-10-19 12:35 ` [PATCH v3 1/5] sched/fair: Account update_blocked_averages in newidle_balance cost Vincent Guittot
@ 2021-10-19 12:35 ` Vincent Guittot
  2021-10-21  9:44   ` Mel Gorman
  2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  2021-10-19 12:35 ` [PATCH 3/5] sched/fair: Wait before decaying max_newidle_lb_cost Vincent Guittot
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-19 12:35 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: Vincent Guittot

In newidle_balance(), the scheduler skips load balance to the new idle cpu
when the 1st sd of this_rq is:

   this_rq->avg_idle < sd->max_newidle_lb_cost

Doing a costly call to update_blocked_averages() will not be useful and
simply adds overhead when this condition is true.

Check the condition early in newidle_balance() to skip
update_blocked_averages() when possible.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0145677ee99..c4c36865321b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10873,17 +10873,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	 */
 	rq_unpin_lock(this_rq, rf);
 
+	rcu_read_lock();
+	sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
 	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !READ_ONCE(this_rq->rd->overload)) {
+	    !READ_ONCE(this_rq->rd->overload) ||
+	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		rcu_read_lock();
-		sd = rcu_dereference_check_sched_domain(this_rq->sd);
 		if (sd)
 			update_next_balance(sd, &next_balance);
 		rcu_read_unlock();
 
 		goto out;
 	}
+	rcu_read_unlock();
 
 	raw_spin_rq_unlock(this_rq);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/5] sched/fair: Wait before decaying max_newidle_lb_cost
  2021-10-19 12:35 [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Vincent Guittot
  2021-10-19 12:35 ` [PATCH v3 1/5] sched/fair: Account update_blocked_averages in newidle_balance cost Vincent Guittot
  2021-10-19 12:35 ` [PATCH v3 2/5] sched/fair: Skip update_blocked_averages if we are defering load balance Vincent Guittot
@ 2021-10-19 12:35 ` Vincent Guittot
  2021-10-21  9:45   ` Mel Gorman
                     ` (2 more replies)
  2021-10-19 12:35 ` [PATCH v3 4/5] sched/fair: Remove sysctl_sched_migration_cost condition Vincent Guittot
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-19 12:35 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: Vincent Guittot

Decay max_newidle_lb_cost only when it has not been updated for a while
and ensure to not decay a recently changed value.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched/topology.h |  2 +-
 kernel/sched/fair.c            | 36 +++++++++++++++++++++++++---------
 kernel/sched/topology.c        |  2 +-
 3 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 2f9166f6dec8..c07bfa2d80f2 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -105,7 +105,7 @@ struct sched_domain {
 
 	/* idle_balance() stats */
 	u64 max_newidle_lb_cost;
-	unsigned long next_decay_max_lb_cost;
+	unsigned long last_decay_max_lb_cost;
 
 	u64 avg_scan_cost;		/* select_idle_sibling */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c4c36865321b..e50fd751e1df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10239,6 +10239,30 @@ void update_max_interval(void)
 	max_load_balance_interval = HZ*num_online_cpus()/10;
 }
 
+static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
+{
+	if (cost > sd->max_newidle_lb_cost) {
+		/*
+		 * Track max cost of a domain to make sure to not delay the
+		 * next wakeup on the CPU.
+		 */
+		sd->max_newidle_lb_cost = cost;
+		sd->last_decay_max_lb_cost = jiffies;
+	} else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
+		/*
+		 * Decay the newidle max times by ~1% per second to ensure that
+		 * it is not outdated and the current max cost is actually
+		 * shorter.
+		 */
+		sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
+		sd->last_decay_max_lb_cost = jiffies;
+
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
@@ -10262,14 +10286,9 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 	for_each_domain(cpu, sd) {
 		/*
 		 * Decay the newidle max times here because this is a regular
-		 * visit to all the domains. Decay ~1% per second.
+		 * visit to all the domains.
 		 */
-		if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
-			sd->max_newidle_lb_cost =
-				(sd->max_newidle_lb_cost * 253) / 256;
-			sd->next_decay_max_lb_cost = jiffies + HZ;
-			need_decay = 1;
-		}
+		need_decay = update_newidle_cost(sd, 0);
 		max_cost += sd->max_newidle_lb_cost;
 
 		/*
@@ -10911,8 +10930,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 			t1 = sched_clock_cpu(this_cpu);
 			domain_cost = t1 - t0;
-			if (domain_cost > sd->max_newidle_lb_cost)
-				sd->max_newidle_lb_cost = domain_cost;
+			update_newidle_cost(sd, domain_cost);
 
 			curr_cost += domain_cost;
 			t0 = t1;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e81246787560..30169c7685b6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1568,7 +1568,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		.last_balance		= jiffies,
 		.balance_interval	= sd_weight,
 		.max_newidle_lb_cost	= 0,
-		.next_decay_max_lb_cost	= jiffies,
+		.last_decay_max_lb_cost	= jiffies,
 		.child			= child,
 #ifdef CONFIG_SCHED_DEBUG
 		.name			= tl->name,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 4/5] sched/fair: Remove sysctl_sched_migration_cost condition
  2021-10-19 12:35 [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Vincent Guittot
                   ` (2 preceding siblings ...)
  2021-10-19 12:35 ` [PATCH 3/5] sched/fair: Wait before decaying max_newidle_lb_cost Vincent Guittot
@ 2021-10-19 12:35 ` Vincent Guittot
  2021-10-21  9:46   ` Mel Gorman
                     ` (2 more replies)
  2021-10-19 12:35 ` [PATCH v3 5/5] sched/fair: cleanup newidle_balance Vincent Guittot
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-19 12:35 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: Vincent Guittot

With a default value of 500us, sysctl_sched_migration_cost is
significanlty higher than the cost of load_balance. Remove the
condition and rely on the sd->max_newidle_lb_cost to abort
newidle_balance.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e50fd751e1df..57eae0ebc492 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10895,8 +10895,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
-	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !READ_ONCE(this_rq->rd->overload) ||
+	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
 		if (sd)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 5/5] sched/fair: cleanup newidle_balance
  2021-10-19 12:35 [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Vincent Guittot
                   ` (3 preceding siblings ...)
  2021-10-19 12:35 ` [PATCH v3 4/5] sched/fair: Remove sysctl_sched_migration_cost condition Vincent Guittot
@ 2021-10-19 12:35 ` Vincent Guittot
  2021-10-21  9:46   ` Mel Gorman
  2021-10-31 10:16   ` [tip: sched/core] sched/fair: Cleanup newidle_balance tip-bot2 for Vincent Guittot
  2021-10-21  9:52 ` [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Mel Gorman
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-19 12:35 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen
  Cc: Vincent Guittot

update_next_balance() uses sd->last_balance which is not modified by
load_balance() so we can merge the 2 calls in one place.

No functional change

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57eae0ebc492..13950beb01a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10916,10 +10916,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		int continue_balancing = 1;
 		u64 domain_cost;
 
-		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
-			update_next_balance(sd, &next_balance);
+		update_next_balance(sd, &next_balance);
+
+		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
 			break;
-		}
 
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 
@@ -10935,8 +10935,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 			t0 = t1;
 		}
 
-		update_next_balance(sd, &next_balance);
-
 		/*
 		 * Stop searching for tasks to pull if there are
 		 * now runnable tasks on this rq.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 1/5] sched/fair: Account update_blocked_averages in newidle_balance cost
  2021-10-19 12:35 ` [PATCH v3 1/5] sched/fair: Account update_blocked_averages in newidle_balance cost Vincent Guittot
@ 2021-10-21  9:43   ` Mel Gorman
  2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  1 sibling, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2021-10-21  9:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	bristot, linux-kernel, tim.c.chen

On Tue, Oct 19, 2021 at 02:35:33PM +0200, Vincent Guittot wrote:
> The time spent to update the blocked load can be significant depending of
> the complexity fo the cgroup hierarchy. Take this time into account in
> the cost of the 1st load balance of a newly idle cpu.
> 
> Also reduce the number of call to sched_clock_cpu() and track more actual
> work.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 2/5] sched/fair: Skip update_blocked_averages if we are defering load balance
  2021-10-19 12:35 ` [PATCH v3 2/5] sched/fair: Skip update_blocked_averages if we are defering load balance Vincent Guittot
@ 2021-10-21  9:44   ` Mel Gorman
  2021-10-21 12:45     ` Vincent Guittot
  2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  1 sibling, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-10-21  9:44 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	bristot, linux-kernel, tim.c.chen

On Tue, Oct 19, 2021 at 02:35:34PM +0200, Vincent Guittot wrote:
> In newidle_balance(), the scheduler skips load balance to the new idle cpu
> when the 1st sd of this_rq is:
> 
>    this_rq->avg_idle < sd->max_newidle_lb_cost
> 
> Doing a costly call to update_blocked_averages() will not be useful and
> simply adds overhead when this condition is true.
> 
> Check the condition early in newidle_balance() to skip
> update_blocked_averages() when possible.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>

The Signed-off-by seems to be in the wrong order but otherwise

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] sched/fair: Wait before decaying max_newidle_lb_cost
  2021-10-19 12:35 ` [PATCH 3/5] sched/fair: Wait before decaying max_newidle_lb_cost Vincent Guittot
@ 2021-10-21  9:45   ` Mel Gorman
  2021-10-29 10:01   ` Dietmar Eggemann
  2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  2 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2021-10-21  9:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	bristot, linux-kernel, tim.c.chen

Note with the change in the subject format causes b4 to miss this patch
when downloading the series as an mbox. Just something for the maintainers
to watch for if they use b4.

On Tue, Oct 19, 2021 at 02:35:35PM +0200, Vincent Guittot wrote:
> Decay max_newidle_lb_cost only when it has not been updated for a while
> and ensure to not decay a recently changed value.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 4/5] sched/fair: Remove sysctl_sched_migration_cost condition
  2021-10-19 12:35 ` [PATCH v3 4/5] sched/fair: Remove sysctl_sched_migration_cost condition Vincent Guittot
@ 2021-10-21  9:46   ` Mel Gorman
  2021-10-29 10:01   ` Dietmar Eggemann
  2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  2 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2021-10-21  9:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	bristot, linux-kernel, tim.c.chen

On Tue, Oct 19, 2021 at 02:35:36PM +0200, Vincent Guittot wrote:
> With a default value of 500us, sysctl_sched_migration_cost is
> significanlty higher than the cost of load_balance. Remove the
> condition and rely on the sd->max_newidle_lb_cost to abort
> newidle_balance.
> 
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 5/5] sched/fair: cleanup newidle_balance
  2021-10-19 12:35 ` [PATCH v3 5/5] sched/fair: cleanup newidle_balance Vincent Guittot
@ 2021-10-21  9:46   ` Mel Gorman
  2021-10-31 10:16   ` [tip: sched/core] sched/fair: Cleanup newidle_balance tip-bot2 for Vincent Guittot
  1 sibling, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2021-10-21  9:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	bristot, linux-kernel, tim.c.chen

On Tue, Oct 19, 2021 at 02:35:37PM +0200, Vincent Guittot wrote:
> update_next_balance() uses sd->last_balance which is not modified by
> load_balance() so we can merge the 2 calls in one place.
> 
> No functional change
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-19 12:35 [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Vincent Guittot
                   ` (4 preceding siblings ...)
  2021-10-19 12:35 ` [PATCH v3 5/5] sched/fair: cleanup newidle_balance Vincent Guittot
@ 2021-10-21  9:52 ` Mel Gorman
  2021-10-21 12:42   ` Vincent Guittot
  2021-10-26 17:25 ` Tim Chen
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2021-10-21  9:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	bristot, linux-kernel, tim.c.chen

On Tue, Oct 19, 2021 at 02:35:32PM +0200, Vincent Guittot wrote:
> This patchset updates newidle lb cost tracking and early abort:
> 
> The time spent running update_blocked_averages is now accounted in the 1st
> sched_domain level. This time can be significant and move the cost of
> newidle lb above the avg_idle time.
> 
> The decay of max_newidle_lb_cost is modified to start only when the field
> has not been updated for a while. Recent update will not be decayed
> immediatlybut only after a while.
> 
> The condition of an avg_idle lower than sysctl_sched_migration_cost has
> been removed as the 500us value is quite large and prevent opportunity to
> pull task on the newly idle CPU for at least 1st domain levels.
> 
> Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> following results:
>        min    avg   max
> SMT:   1us   33us  273us - this one includes the update of blocked load
> MC:    7us   49us  398us
> NUMA: 10us   45us  158us
> 
> 
> Some results for hackbench -l $LOOPS -g $group :
> group      tip/sched/core     + this patchset
> 1           15.189(+/- 2%)       14.987(+/- 2%)  +1%
> 4            4.336(+/- 3%)        4.322(+/- 5%)  +0%
> 16           3.654(+/- 1%)        2.922(+/- 3%) +20%
> 32           3.209(+/- 1%)        2.919(+/- 3%)  +9%
> 64           2.965(+/- 1%)        2.826(+/- 1%)  +4%
> 128          2.954(+/- 1%)        2.993(+/- 8%)  -1%
> 256          2.951(+/- 1%)        2.894(+/- 1%)  +2%
> 

I read the patches earlier but had queued tests and waiting on the results
before Acking. The hackbench results were not bad, not a universal win,
but wins more than it loses with small decreaseds in system CPU usage.

Most other results showed small gains or losses, nothing overly dramatic
and mostly within the noise.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-21  9:52 ` [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Mel Gorman
@ 2021-10-21 12:42   ` Vincent Guittot
  0 siblings, 0 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-21 12:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Tim Chen

On Thu, 21 Oct 2021 at 11:52, Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Oct 19, 2021 at 02:35:32PM +0200, Vincent Guittot wrote:
> > This patchset updates newidle lb cost tracking and early abort:
> >
> > The time spent running update_blocked_averages is now accounted in the 1st
> > sched_domain level. This time can be significant and move the cost of
> > newidle lb above the avg_idle time.
> >
> > The decay of max_newidle_lb_cost is modified to start only when the field
> > has not been updated for a while. Recent update will not be decayed
> > immediatlybut only after a while.
> >
> > The condition of an avg_idle lower than sysctl_sched_migration_cost has
> > been removed as the 500us value is quite large and prevent opportunity to
> > pull task on the newly idle CPU for at least 1st domain levels.
> >
> > Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> > THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> > following results:
> >        min    avg   max
> > SMT:   1us   33us  273us - this one includes the update of blocked load
> > MC:    7us   49us  398us
> > NUMA: 10us   45us  158us
> >
> >
> > Some results for hackbench -l $LOOPS -g $group :
> > group      tip/sched/core     + this patchset
> > 1           15.189(+/- 2%)       14.987(+/- 2%)  +1%
> > 4            4.336(+/- 3%)        4.322(+/- 5%)  +0%
> > 16           3.654(+/- 1%)        2.922(+/- 3%) +20%
> > 32           3.209(+/- 1%)        2.919(+/- 3%)  +9%
> > 64           2.965(+/- 1%)        2.826(+/- 1%)  +4%
> > 128          2.954(+/- 1%)        2.993(+/- 8%)  -1%
> > 256          2.951(+/- 1%)        2.894(+/- 1%)  +2%
> >
>
> I read the patches earlier but had queued tests and waiting on the results
> before Acking. The hackbench results were not bad, not a universal win,
> but wins more than it loses with small decreaseds in system CPU usage.

Thanks for running tests

>
> Most other results showed small gains or losses, nothing overly dramatic
> and mostly within the noise.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 2/5] sched/fair: Skip update_blocked_averages if we are defering load balance
  2021-10-21  9:44   ` Mel Gorman
@ 2021-10-21 12:45     ` Vincent Guittot
  0 siblings, 0 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-21 12:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Daniel Bristot de Oliveira,
	linux-kernel, Tim Chen

On Thu, 21 Oct 2021 at 11:44, Mel Gorman <mgorman@suse.de> wrote:
>
> On Tue, Oct 19, 2021 at 02:35:34PM +0200, Vincent Guittot wrote:
> > In newidle_balance(), the scheduler skips load balance to the new idle cpu
> > when the 1st sd of this_rq is:
> >
> >    this_rq->avg_idle < sd->max_newidle_lb_cost
> >
> > Doing a costly call to update_blocked_averages() will not be useful and
> > simply adds overhead when this condition is true.
> >
> > Check the condition early in newidle_balance() to skip
> > update_blocked_averages() when possible.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>
> The Signed-off-by seems to be in the wrong order but otherwise

The right signoff sequence should be:

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

But I skipped the last one which was only a rebase and my signoff was
already there

>
> Acked-by: Mel Gorman <mgorman@suse.de>
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-19 12:35 [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Vincent Guittot
                   ` (5 preceding siblings ...)
  2021-10-21  9:52 ` [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Mel Gorman
@ 2021-10-26 17:25 ` Tim Chen
  2021-10-27  8:49   ` Vincent Guittot
  2021-10-28 23:25 ` Joel Fernandes
  2021-10-29 10:01 ` Dietmar Eggemann
  8 siblings, 1 reply; 34+ messages in thread
From: Tim Chen @ 2021-10-26 17:25 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, linux-kernel

On Tue, 2021-10-19 at 14:35 +0200, Vincent Guittot wrote:
> This patchset updates newidle lb cost tracking and early abort:
> 
> The time spent running update_blocked_averages is now accounted in
> the 1st
> sched_domain level. This time can be significant and move the cost of
> newidle lb above the avg_idle time.
> 
> The decay of max_newidle_lb_cost is modified to start only when the
> field
> has not been updated for a while. Recent update will not be decayed
> immediatlybut only after a while.
> 
> The condition of an avg_idle lower than sysctl_sched_migration_cost
> has
> been removed as the 500us value is quite large and prevent
> opportunity to
> pull task on the newly idle CPU for at least 1st domain levels.
> 
> Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> following results:
>        min    avg   max
> SMT:   1us   33us  273us - this one includes the update of blocked
> load
> MC:    7us   49us  398us
> NUMA: 10us   45us  158us
> 
> 
> Some results for hackbench -l $LOOPS -g $group :
> group      tip/sched/core     + this patchset
> 1           15.189(+/- 2%)       14.987(+/- 2%)  +1%
> 4            4.336(+/- 3%)        4.322(+/- 5%)  +0%
> 16           3.654(+/- 1%)        2.922(+/- 3%) +20%
> 32           3.209(+/- 1%)        2.919(+/- 3%)  +9%
> 64           2.965(+/- 1%)        2.826(+/- 1%)  +4%
> 128          2.954(+/- 1%)        2.993(+/- 8%)  -1%
> 256          2.951(+/- 1%)        2.894(+/- 1%)  +2%
> 
> tbench and reaim have not shown any difference
> 

Vincent,

Our benchmark team tested the patches for our OLTP benchmark
on a 2 socket Cascade Lake
with 28 cores/socket.  It is a smaller configuration
than the 2 socket Ice Lake we hae tested previously that has 40
cores/socket so the overhead on update_blocked_averages is smaller
(~4%).

Here's a summary of the results:
					Relative Performance 
					(higher better)
5.15 rc4 vanilla (cgroup disabled)	100%
5.15 rc4 vanilla (cgroup enabled)	96%
patch v2				96%
patch v3				96%

We didn't see much change in performance from the patch set.

Looking at the profile on update_blocked_averages a bit more,
the majority of the call to update_blocked_averages
happens in run_rebalance_domain.  And we are not
including that cost of update_blocked_averages for
run_rebalance_domains in our current patch set. I think
the patch set should account for that too.


      0.60%     0.00%             3  [kernel.vmlinux]    [k] run_rebalance_domains                                                                                                                                                  -      -            
            |          
             --0.59%--run_rebalance_domains
                       |          
                        --0.57%--update_blocked_averages

Thanks.

Tim


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-26 17:25 ` Tim Chen
@ 2021-10-27  8:49   ` Vincent Guittot
  2021-10-27 20:53     ` Tim Chen
  2021-10-29 17:00     ` Tim Chen
  0 siblings, 2 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-27  8:49 UTC (permalink / raw)
  To: Tim Chen
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel

On Tue, 26 Oct 2021 at 19:25, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Tue, 2021-10-19 at 14:35 +0200, Vincent Guittot wrote:
> > This patchset updates newidle lb cost tracking and early abort:
> >
> > The time spent running update_blocked_averages is now accounted in
> > the 1st
> > sched_domain level. This time can be significant and move the cost of
> > newidle lb above the avg_idle time.
> >
> > The decay of max_newidle_lb_cost is modified to start only when the
> > field
> > has not been updated for a while. Recent update will not be decayed
> > immediatlybut only after a while.
> >
> > The condition of an avg_idle lower than sysctl_sched_migration_cost
> > has
> > been removed as the 500us value is quite large and prevent
> > opportunity to
> > pull task on the newly idle CPU for at least 1st domain levels.
> >
> > Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> > THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> > following results:
> >        min    avg   max
> > SMT:   1us   33us  273us - this one includes the update of blocked
> > load
> > MC:    7us   49us  398us
> > NUMA: 10us   45us  158us
> >
> >
> > Some results for hackbench -l $LOOPS -g $group :
> > group      tip/sched/core     + this patchset
> > 1           15.189(+/- 2%)       14.987(+/- 2%)  +1%
> > 4            4.336(+/- 3%)        4.322(+/- 5%)  +0%
> > 16           3.654(+/- 1%)        2.922(+/- 3%) +20%
> > 32           3.209(+/- 1%)        2.919(+/- 3%)  +9%
> > 64           2.965(+/- 1%)        2.826(+/- 1%)  +4%
> > 128          2.954(+/- 1%)        2.993(+/- 8%)  -1%
> > 256          2.951(+/- 1%)        2.894(+/- 1%)  +2%
> >
> > tbench and reaim have not shown any difference
> >
>
> Vincent,
>
> Our benchmark team tested the patches for our OLTP benchmark
> on a 2 socket Cascade Lake
> with 28 cores/socket.  It is a smaller configuration
> than the 2 socket Ice Lake we hae tested previously that has 40
> cores/socket so the overhead on update_blocked_averages is smaller
> (~4%).
>
> Here's a summary of the results:
>                                         Relative Performance
>                                         (higher better)
> 5.15 rc4 vanilla (cgroup disabled)      100%
> 5.15 rc4 vanilla (cgroup enabled)       96%
> patch v2                                96%
> patch v3                                96%
>
> We didn't see much change in performance from the patch set.

Thanks for testing.
I was not expecting this patch to fix all your problems but at least
those which have been highlighted during previous exchanges.

Few problems still remain in your case if I'm not wrong:
There is a patch that ensures that rq->next_balance is never set in the past.
IIRC, you also mentioned in a previous thread that the idle LB
happening during the tick just after the new idle balance was a
problem in your case. Which is probably linked to your figures below

>
> Looking at the profile on update_blocked_averages a bit more,
> the majority of the call to update_blocked_averages
> happens in run_rebalance_domain.  And we are not
> including that cost of update_blocked_averages for
> run_rebalance_domains in our current patch set. I think
> the patch set should account for that too.

nohz_newidle_balance keeps using sysctl_sched_migration_cost to
trigger a _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE);
This would probably benefit to take into account the cost of
update_blocked_averages instead

>
>
>       0.60%     0.00%             3  [kernel.vmlinux]    [k] run_rebalance_domains                                                                                                                                                  -      -
>             |
>              --0.59%--run_rebalance_domains
>                        |
>                         --0.57%--update_blocked_averages
>
> Thanks.
>
> Tim
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-27  8:49   ` Vincent Guittot
@ 2021-10-27 20:53     ` Tim Chen
  2021-10-28 12:15       ` Vincent Guittot
  2021-10-29 17:00     ` Tim Chen
  1 sibling, 1 reply; 34+ messages in thread
From: Tim Chen @ 2021-10-27 20:53 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel

On Wed, 2021-10-27 at 10:49 +0200, Vincent Guittot wrote:
> 
> > Looking at the profile on update_blocked_averages a bit more,
> > the majority of the call to update_blocked_averages
> > happens in run_rebalance_domain.  And we are not
> > including that cost of update_blocked_averages for
> > run_rebalance_domains in our current patch set. I think
> > the patch set should account for that too.
> 
> nohz_newidle_balance keeps using sysctl_sched_migration_cost to
> trigger a _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE);
> This would probably benefit to take into account the cost of
> update_blocked_averages instead
> 

For the case where

	this_rq->avg_idle < sysctl_sched_migration_cost

in newidle_balance(), we skip to the out: label

out:
        /* Move the next balance forward */
        if (time_after(this_rq->next_balance, next_balance))
                this_rq->next_balance = next_balance;

        if (pulled_task)
                this_rq->idle_stamp = 0;
        else
                nohz_newidle_balance(this_rq);

and we call nohz_newidle_balance as we don't have a pulled_task.

It seems to make sense to skip the call
to nohz_newidle_balance() for this case?  
We expect a very short idle and a task to wake shortly.  
So we do not have to pull a task
to this idle cpu and incur the migration cost.

Tim





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-27 20:53     ` Tim Chen
@ 2021-10-28 12:15       ` Vincent Guittot
  2021-10-28 21:36         ` Tim Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Vincent Guittot @ 2021-10-28 12:15 UTC (permalink / raw)
  To: Tim Chen
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel

Le mercredi 27 oct. 2021 à 13:53:32 (-0700), Tim Chen a écrit :
> On Wed, 2021-10-27 at 10:49 +0200, Vincent Guittot wrote:
> > 
> > > Looking at the profile on update_blocked_averages a bit more,
> > > the majority of the call to update_blocked_averages
> > > happens in run_rebalance_domain.  And we are not
> > > including that cost of update_blocked_averages for
> > > run_rebalance_domains in our current patch set. I think
> > > the patch set should account for that too.
> > 
> > nohz_newidle_balance keeps using sysctl_sched_migration_cost to
> > trigger a _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE);
> > This would probably benefit to take into account the cost of
> > update_blocked_averages instead
> > 
> 
> For the case where
> 
> 	this_rq->avg_idle < sysctl_sched_migration_cost
> 
> in newidle_balance(), we skip to the out: label
> 
> out:
>         /* Move the next balance forward */
>         if (time_after(this_rq->next_balance, next_balance))
>                 this_rq->next_balance = next_balance;
> 
>         if (pulled_task)
>                 this_rq->idle_stamp = 0;
>         else
>                 nohz_newidle_balance(this_rq);
> 
> and we call nohz_newidle_balance as we don't have a pulled_task.
> 
> It seems to make sense to skip the call
> to nohz_newidle_balance() for this case? 

nohz_newidle_balance() also tests this condition :
(this_rq->avg_idle < sysctl_sched_migration_cost)
and doesn't set NOHZ_NEWILB_KICKi in such case

But this patch now used the condition :
this_rq->avg_idle < sd->max_newidle_lb_cost
and sd->max_newidle_lb_cost can be higher than sysctl_sched_migration_cost

which means that we can set NOHZ_NEWILB_KICK:
-although we decided to skip newidle loop
-or when we abort because this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost 

This is even more true when sysctl_sched_migration_cost is lowered which is your case IIRC

The patch below ensures that we don't set NOHZ_NEWILB_KICK in such cases:

---
 kernel/sched/fair.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c19f4bb3df1a..36ddae208959 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10779,7 +10779,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	int this_cpu = this_rq->cpu;
 	u64 t0, t1, curr_cost = 0;
 	struct sched_domain *sd;
-	int pulled_task = 0;
+	int pulled_task = 0, early_stop = 0;
 
 	update_misfit_status(NULL, this_rq);
 
@@ -10816,8 +10816,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		if (sd)
+		if (sd) {
 			update_next_balance(sd, &next_balance);
+
+			/*
+			 * We skip new idle LB because there is not enough
+			 * time before next wake up. Make sure that we will
+			 * not kick NOHZ_NEWILB_KICK
+			 */
+			early_stop = 1;
+		}
 		rcu_read_unlock();
 
 		goto out;
@@ -10836,8 +10844,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 		update_next_balance(sd, &next_balance);
 
-		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
+			early_stop = 1;
 			break;
+		}
 
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 
@@ -10887,7 +10897,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	if (pulled_task)
 		this_rq->idle_stamp = 0;
-	else
+	else if (!early_stop)
 		nohz_newidle_balance(this_rq);
 
 	rq_repin_lock(this_rq, rf);
-- 

> We expect a very short idle and a task to wake shortly.  
> So we do not have to pull a task
> to this idle cpu and incur the migration cost.
> 
> Tim
> 
> 
> 
> 

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-28 12:15       ` Vincent Guittot
@ 2021-10-28 21:36         ` Tim Chen
  0 siblings, 0 replies; 34+ messages in thread
From: Tim Chen @ 2021-10-28 21:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel

On Thu, 2021-10-28 at 14:15 +0200, Vincent Guittot wrote:
> 
> > It seems to make sense to skip the call
> > to nohz_newidle_balance() for this case? 
> 
> nohz_newidle_balance() also tests this condition :
> (this_rq->avg_idle < sysctl_sched_migration_cost)
> and doesn't set NOHZ_NEWILB_KICKi in such case
> 
> But this patch now used the condition :
> this_rq->avg_idle < sd->max_newidle_lb_cost
> and sd->max_newidle_lb_cost can be higher than
> sysctl_sched_migration_cost
> 
> which means that we can set NOHZ_NEWILB_KICK:
> -although we decided to skip newidle loop
> -or when we abort because this_rq->avg_idle < curr_cost + sd-
> >max_newidle_lb_cost 
> 
> This is even more true when sysctl_sched_migration_cost is lowered
> which is your case IIRC
> 
> The patch below ensures that we don't set NOHZ_NEWILB_KICK in such
> cases:
> 

Thanks. Will ask our benchmark team to give it a spin.

Tim


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-19 12:35 [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Vincent Guittot
                   ` (6 preceding siblings ...)
  2021-10-26 17:25 ` Tim Chen
@ 2021-10-28 23:25 ` Joel Fernandes
  2021-10-29  7:49   ` Vincent Guittot
  2021-10-29 10:01 ` Dietmar Eggemann
  8 siblings, 1 reply; 34+ messages in thread
From: Joel Fernandes @ 2021-10-28 23:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen, joelaf, dianders,
	qais.yousef, Chris.Redpath

Hi,  Vincent, Peter,

On Tue, Oct 19, 2021 at 02:35:32PM +0200, Vincent Guittot wrote:
> This patchset updates newidle lb cost tracking and early abort:
> 
> The time spent running update_blocked_averages is now accounted in the 1st
> sched_domain level. This time can be significant and move the cost of
> newidle lb above the avg_idle time.
> 
> The decay of max_newidle_lb_cost is modified to start only when the field
> has not been updated for a while. Recent update will not be decayed
> immediatlybut only after a while.
> 
> The condition of an avg_idle lower than sysctl_sched_migration_cost has
> been removed as the 500us value is quite large and prevent opportunity to
> pull task on the newly idle CPU for at least 1st domain levels.

It appears this series is not yet in upstream Linus's tree. What's the latest on it?

I see a lot of times on ARM64 devices that load balance is skipped due to the
high the sysctl_sched_migration_cost. I saw another thread as well where
someone complained the performance varies and the default might be too high:
https://lkml.org/lkml/2021/9/14/150

Any other thoughts? Happy to help on any progress on this series as well. Thanks,

 - Joel

> 
> Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> following results:
>        min    avg   max
> SMT:   1us   33us  273us - this one includes the update of blocked load
> MC:    7us   49us  398us
> NUMA: 10us   45us  158us
> 
> 
> Some results for hackbench -l $LOOPS -g $group :
> group      tip/sched/core     + this patchset
> 1           15.189(+/- 2%)       14.987(+/- 2%)  +1%
> 4            4.336(+/- 3%)        4.322(+/- 5%)  +0%
> 16           3.654(+/- 1%)        2.922(+/- 3%) +20%
> 32           3.209(+/- 1%)        2.919(+/- 3%)  +9%
> 64           2.965(+/- 1%)        2.826(+/- 1%)  +4%
> 128          2.954(+/- 1%)        2.993(+/- 8%)  -1%
> 256          2.951(+/- 1%)        2.894(+/- 1%)  +2%
> 
> tbench and reaim have not shown any difference
> 
> Change since v2:
> - Update and decay of sd->last_decay_max_lb_cost are gathered in
>   update_newidle_cost(). The behavior remains almost the same except that
>   the decay can happen during newidle_balance now.
> 
>   Tests results haven't shown any differences
>   
>   I haven't modified rq->max_idle_balance_cost. It acts as the max value
>   for avg_idle and prevents the latter to reach high value during long
>   idle phase. Moving on an IIR filter instead, could delay the convergence
>   of avg_idle to a reasonnable value that reflect current situation.
> 
> - Added a minor cleanup of newidle_balance
> 
> Change since v1:
> - account the time spent in update_blocked_averages() in the 1st domain
> 
> - reduce number of call of sched_clock_cpu() 
> 
> - change the way max_newidle_lb_cost is decayed. Peter suggested to use a
>   IIR but keeping a track of the current max value gave the best result
> 
> - removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
>   as suggested by Peter
> 
> Vincent Guittot (5):
>   sched/fair: Account update_blocked_averages in newidle_balance cost
>   sched/fair: Skip update_blocked_averages if we are defering load
>     balance
>   sched/fair: Wait before decaying max_newidle_lb_cost
>   sched/fair: Remove sysctl_sched_migration_cost condition
>   sched/fair: cleanup newidle_balance
> 
>  include/linux/sched/topology.h |  2 +-
>  kernel/sched/fair.c            | 65 ++++++++++++++++++++++------------
>  kernel/sched/topology.c        |  2 +-
>  3 files changed, 45 insertions(+), 24 deletions(-)
> 
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-28 23:25 ` Joel Fernandes
@ 2021-10-29  7:49   ` Vincent Guittot
  2021-10-29 19:37     ` Joel Fernandes
  0 siblings, 1 reply; 34+ messages in thread
From: Vincent Guittot @ 2021-10-29  7:49 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen, joelaf, dianders,
	qais.yousef, Chris.Redpath, Yicong Yang, Barry Song

On Fri, 29 Oct 2021 at 01:25, Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi,  Vincent, Peter,
>
> On Tue, Oct 19, 2021 at 02:35:32PM +0200, Vincent Guittot wrote:
> > This patchset updates newidle lb cost tracking and early abort:
> >
> > The time spent running update_blocked_averages is now accounted in the 1st
> > sched_domain level. This time can be significant and move the cost of
> > newidle lb above the avg_idle time.
> >
> > The decay of max_newidle_lb_cost is modified to start only when the field
> > has not been updated for a while. Recent update will not be decayed
> > immediatlybut only after a while.
> >
> > The condition of an avg_idle lower than sysctl_sched_migration_cost has
> > been removed as the 500us value is quite large and prevent opportunity to
> > pull task on the newly idle CPU for at least 1st domain levels.
>
> It appears this series is not yet in upstream Linus's tree. What's the latest on it?
>

I sent an addon yesterday to cover cases that Tim cares about

> I see a lot of times on ARM64 devices that load balance is skipped due to the
> high the sysctl_sched_migration_cost. I saw another thread as well where

Have you tested the patchset ? Does it enable more load balance on
your platform ?

> someone complained the performance varies and the default might be too high:
> https://lkml.org/lkml/2021/9/14/150

Added  Yicong and Barry in the list

>
> Any other thoughts? Happy to help on any progress on this series as well. Thanks,
>
>  - Joel
>
> >
> > Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> > THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> > following results:
> >        min    avg   max
> > SMT:   1us   33us  273us - this one includes the update of blocked load
> > MC:    7us   49us  398us
> > NUMA: 10us   45us  158us
> >
> >
> > Some results for hackbench -l $LOOPS -g $group :
> > group      tip/sched/core     + this patchset
> > 1           15.189(+/- 2%)       14.987(+/- 2%)  +1%
> > 4            4.336(+/- 3%)        4.322(+/- 5%)  +0%
> > 16           3.654(+/- 1%)        2.922(+/- 3%) +20%
> > 32           3.209(+/- 1%)        2.919(+/- 3%)  +9%
> > 64           2.965(+/- 1%)        2.826(+/- 1%)  +4%
> > 128          2.954(+/- 1%)        2.993(+/- 8%)  -1%
> > 256          2.951(+/- 1%)        2.894(+/- 1%)  +2%
> >
> > tbench and reaim have not shown any difference
> >
> > Change since v2:
> > - Update and decay of sd->last_decay_max_lb_cost are gathered in
> >   update_newidle_cost(). The behavior remains almost the same except that
> >   the decay can happen during newidle_balance now.
> >
> >   Tests results haven't shown any differences
> >
> >   I haven't modified rq->max_idle_balance_cost. It acts as the max value
> >   for avg_idle and prevents the latter to reach high value during long
> >   idle phase. Moving on an IIR filter instead, could delay the convergence
> >   of avg_idle to a reasonnable value that reflect current situation.
> >
> > - Added a minor cleanup of newidle_balance
> >
> > Change since v1:
> > - account the time spent in update_blocked_averages() in the 1st domain
> >
> > - reduce number of call of sched_clock_cpu()
> >
> > - change the way max_newidle_lb_cost is decayed. Peter suggested to use a
> >   IIR but keeping a track of the current max value gave the best result
> >
> > - removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
> >   as suggested by Peter
> >
> > Vincent Guittot (5):
> >   sched/fair: Account update_blocked_averages in newidle_balance cost
> >   sched/fair: Skip update_blocked_averages if we are defering load
> >     balance
> >   sched/fair: Wait before decaying max_newidle_lb_cost
> >   sched/fair: Remove sysctl_sched_migration_cost condition
> >   sched/fair: cleanup newidle_balance
> >
> >  include/linux/sched/topology.h |  2 +-
> >  kernel/sched/fair.c            | 65 ++++++++++++++++++++++------------
> >  kernel/sched/topology.c        |  2 +-
> >  3 files changed, 45 insertions(+), 24 deletions(-)
> >
> > --
> > 2.17.1
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-19 12:35 [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Vincent Guittot
                   ` (7 preceding siblings ...)
  2021-10-28 23:25 ` Joel Fernandes
@ 2021-10-29 10:01 ` Dietmar Eggemann
  8 siblings, 0 replies; 34+ messages in thread
From: Dietmar Eggemann @ 2021-10-29 10:01 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen

On 19/10/2021 14:35, Vincent Guittot wrote:
> This patchset updates newidle lb cost tracking and early abort:
> 
> The time spent running update_blocked_averages is now accounted in the 1st
> sched_domain level. This time can be significant and move the cost of
> newidle lb above the avg_idle time.
> 
> The decay of max_newidle_lb_cost is modified to start only when the field
> has not been updated for a while. Recent update will not be decayed
> immediatlybut only after a while.
> 
> The condition of an avg_idle lower than sysctl_sched_migration_cost has
> been removed as the 500us value is quite large and prevent opportunity to
> pull task on the newly idle CPU for at least 1st domain levels.
> 
> Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> following results:
>        min    avg   max
> SMT:   1us   33us  273us - this one includes the update of blocked load
> MC:    7us   49us  398us
> NUMA: 10us   45us  158us
> 
> 
> Some results for hackbench -l $LOOPS -g $group :
> group      tip/sched/core     + this patchset
> 1           15.189(+/- 2%)       14.987(+/- 2%)  +1%
> 4            4.336(+/- 3%)        4.322(+/- 5%)  +0%
> 16           3.654(+/- 1%)        2.922(+/- 3%) +20%
> 32           3.209(+/- 1%)        2.919(+/- 3%)  +9%
> 64           2.965(+/- 1%)        2.826(+/- 1%)  +4%
> 128          2.954(+/- 1%)        2.993(+/- 8%)  -1%
> 256          2.951(+/- 1%)        2.894(+/- 1%)  +2%
> 
> tbench and reaim have not shown any difference
> 
> Change since v2:
> - Update and decay of sd->last_decay_max_lb_cost are gathered in
>   update_newidle_cost(). The behavior remains almost the same except that
>   the decay can happen during newidle_balance now.
> 
>   Tests results haven't shown any differences
>   
>   I haven't modified rq->max_idle_balance_cost. It acts as the max value
>   for avg_idle and prevents the latter to reach high value during long
>   idle phase. Moving on an IIR filter instead, could delay the convergence
>   of avg_idle to a reasonnable value that reflect current situation.
> 
> - Added a minor cleanup of newidle_balance
> 
> Change since v1:
> - account the time spent in update_blocked_averages() in the 1st domain
> 
> - reduce number of call of sched_clock_cpu() 
> 
> - change the way max_newidle_lb_cost is decayed. Peter suggested to use a
>   IIR but keeping a track of the current max value gave the best result
> 
> - removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
>   as suggested by Peter
> 
> Vincent Guittot (5):
>   sched/fair: Account update_blocked_averages in newidle_balance cost
>   sched/fair: Skip update_blocked_averages if we are defering load
>     balance
>   sched/fair: Wait before decaying max_newidle_lb_cost
>   sched/fair: Remove sysctl_sched_migration_cost condition
>   sched/fair: cleanup newidle_balance
> 
>  include/linux/sched/topology.h |  2 +-
>  kernel/sched/fair.c            | 65 ++++++++++++++++++++++------------
>  kernel/sched/topology.c        |  2 +-
>  3 files changed, 45 insertions(+), 24 deletions(-)

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

LGTM, just a couple of questions in 3/5 and 4/5.




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] sched/fair: Wait before decaying max_newidle_lb_cost
  2021-10-19 12:35 ` [PATCH 3/5] sched/fair: Wait before decaying max_newidle_lb_cost Vincent Guittot
  2021-10-21  9:45   ` Mel Gorman
@ 2021-10-29 10:01   ` Dietmar Eggemann
  2021-10-29 12:19     ` Vincent Guittot
  2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  2 siblings, 1 reply; 34+ messages in thread
From: Dietmar Eggemann @ 2021-10-29 10:01 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen

On 19/10/2021 14:35, Vincent Guittot wrote:
> Decay max_newidle_lb_cost only when it has not been updated for a while
> and ensure to not decay a recently changed value.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  include/linux/sched/topology.h |  2 +-
>  kernel/sched/fair.c            | 36 +++++++++++++++++++++++++---------
>  kernel/sched/topology.c        |  2 +-
>  3 files changed, 29 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 2f9166f6dec8..c07bfa2d80f2 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -105,7 +105,7 @@ struct sched_domain {
>  
>  	/* idle_balance() stats */
>  	u64 max_newidle_lb_cost;
> -	unsigned long next_decay_max_lb_cost;
> +	unsigned long last_decay_max_lb_cost;
>  
>  	u64 avg_scan_cost;		/* select_idle_sibling */
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c4c36865321b..e50fd751e1df 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10239,6 +10239,30 @@ void update_max_interval(void)
>  	max_load_balance_interval = HZ*num_online_cpus()/10;
>  }
>  
> +static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
> +{
> +	if (cost > sd->max_newidle_lb_cost) {
> +		/*
> +		 * Track max cost of a domain to make sure to not delay the
> +		 * next wakeup on the CPU.
> +		 */
> +		sd->max_newidle_lb_cost = cost;
> +		sd->last_decay_max_lb_cost = jiffies;

That's the actual change of the patch: sd->last_decay_max_lb_cost being
moved forward also when newidle cost is updated from newidle_balance() ?

> +	} else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
> +		/*
> +		 * Decay the newidle max times by ~1% per second to ensure that
> +		 * it is not outdated and the current max cost is actually
> +		 * shorter.

I assume that `max cost` refers here to a local variable of the only
caller of update_newidle_cost(..., 0) - rebalance_domains()?

"the current max cost" has to be shorter so that
rq->max_idle_balance_cost also decays in this case. Is this what this
comment tries to say here?

[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 4/5] sched/fair: Remove sysctl_sched_migration_cost condition
  2021-10-19 12:35 ` [PATCH v3 4/5] sched/fair: Remove sysctl_sched_migration_cost condition Vincent Guittot
  2021-10-21  9:46   ` Mel Gorman
@ 2021-10-29 10:01   ` Dietmar Eggemann
  2021-10-29 12:30     ` Vincent Guittot
  2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
  2 siblings, 1 reply; 34+ messages in thread
From: Dietmar Eggemann @ 2021-10-29 10:01 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen

On 19/10/2021 14:35, Vincent Guittot wrote:
> With a default value of 500us, sysctl_sched_migration_cost is
> significanlty higher than the cost of load_balance. Remove the

Shouldn't this be rather `load balance cost on the lowest sd`? I assume
here that lb cost stands for sd->max_newidle_lb_cost of the 1st sd.

We still use sysctl_sched_migration_cost as a floor against max_cost
(i.e. lb cost of all sd's) when setting rq->max_idle_balance_cost in
rebalance_domains().

And in the add-on discussion (disabling the call to
nohz_newidle_balance() you mention that sd->max_newidle_lb_cost can be
higher than sysctl_sched_migration_cost (even when default 500us).


> condition and rely on the sd->max_newidle_lb_cost to abort
> newidle_balance.
> 
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e50fd751e1df..57eae0ebc492 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10895,8 +10895,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	rcu_read_lock();
>  	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
> -	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> -	    !READ_ONCE(this_rq->rd->overload) ||
> +	if (!READ_ONCE(this_rq->rd->overload) ||
>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>  
>  		if (sd)
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] sched/fair: Wait before decaying max_newidle_lb_cost
  2021-10-29 10:01   ` Dietmar Eggemann
@ 2021-10-29 12:19     ` Vincent Guittot
  0 siblings, 0 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-29 12:19 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	linux-kernel, tim.c.chen

On Fri, 29 Oct 2021 at 12:01, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 19/10/2021 14:35, Vincent Guittot wrote:
> > Decay max_newidle_lb_cost only when it has not been updated for a while
> > and ensure to not decay a recently changed value.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  include/linux/sched/topology.h |  2 +-
> >  kernel/sched/fair.c            | 36 +++++++++++++++++++++++++---------
> >  kernel/sched/topology.c        |  2 +-
> >  3 files changed, 29 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 2f9166f6dec8..c07bfa2d80f2 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -105,7 +105,7 @@ struct sched_domain {
> >
> >       /* idle_balance() stats */
> >       u64 max_newidle_lb_cost;
> > -     unsigned long next_decay_max_lb_cost;
> > +     unsigned long last_decay_max_lb_cost;
> >
> >       u64 avg_scan_cost;              /* select_idle_sibling */
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c4c36865321b..e50fd751e1df 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10239,6 +10239,30 @@ void update_max_interval(void)
> >       max_load_balance_interval = HZ*num_online_cpus()/10;
> >  }
> >
> > +static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
> > +{
> > +     if (cost > sd->max_newidle_lb_cost) {
> > +             /*
> > +              * Track max cost of a domain to make sure to not delay the
> > +              * next wakeup on the CPU.
> > +              */
> > +             sd->max_newidle_lb_cost = cost;
> > +             sd->last_decay_max_lb_cost = jiffies;
>
> That's the actual change of the patch: sd->last_decay_max_lb_cost being
> moved forward also when newidle cost is updated from newidle_balance() ?
>
> > +     } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
> > +             /*
> > +              * Decay the newidle max times by ~1% per second to ensure that
> > +              * it is not outdated and the current max cost is actually
> > +              * shorter.
>
> I assume that `max cost` refers here to a local variable of the only
> caller of update_newidle_cost(..., 0) - rebalance_domains()?
>
> "the current max cost" has to be shorter so that
> rq->max_idle_balance_cost also decays in this case. Is this what this
> comment tries to say here?

I refer to the time tracked in sd->max_newidle_lb_cost
here i set current_cost to zero to trigger a possible decay of
sd->max_newidle_lb_cost

>
> [...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 4/5] sched/fair: Remove sysctl_sched_migration_cost condition
  2021-10-29 10:01   ` Dietmar Eggemann
@ 2021-10-29 12:30     ` Vincent Guittot
  0 siblings, 0 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-29 12:30 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	linux-kernel, tim.c.chen

On Fri, 29 Oct 2021 at 12:02, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 19/10/2021 14:35, Vincent Guittot wrote:
> > With a default value of 500us, sysctl_sched_migration_cost is
> > significanlty higher than the cost of load_balance. Remove the
>
> Shouldn't this be rather `load balance cost on the lowest sd`? I assume
> here that lb cost stands for sd->max_newidle_lb_cost of the 1st sd.

Both.

During the tests, I did on thx2, the sum of sd->max_newidle_lb_cost
could range between 18us and 829us.

During those tests, I had a limited number of cgroup and the
update_blocked_average could certainly go above the 273us of SMT
sd->max_newidle_lb_cost and the 500us of sysctl_sched_migration_cost

>
> We still use sysctl_sched_migration_cost as a floor against max_cost
> (i.e. lb cost of all sd's) when setting rq->max_idle_balance_cost in
> rebalance_domains().

rq->max_idle_balance_cost is used to cap the value used to compute rq->avg_idle.

>
> And in the add-on discussion (disabling the call to
> nohz_newidle_balance() you mention that sd->max_newidle_lb_cost can be
> higher than sysctl_sched_migration_cost (even when default 500us).

yes, I have reached 273us on thx2 with 2 empty cgroups

>
>
> > condition and rely on the sd->max_newidle_lb_cost to abort
> > newidle_balance.
> >
> > Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e50fd751e1df..57eae0ebc492 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10895,8 +10895,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >       rcu_read_lock();
> >       sd = rcu_dereference_check_sched_domain(this_rq->sd);
> >
> > -     if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> > -         !READ_ONCE(this_rq->rd->overload) ||
> > +     if (!READ_ONCE(this_rq->rd->overload) ||
> >           (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> >
> >               if (sd)
> >
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-27  8:49   ` Vincent Guittot
  2021-10-27 20:53     ` Tim Chen
@ 2021-10-29 17:00     ` Tim Chen
  2021-10-31 10:18       ` Vincent Guittot
  1 sibling, 1 reply; 34+ messages in thread
From: Tim Chen @ 2021-10-29 17:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel

On Wed, 2021-10-27 at 10:49 +0200, Vincent Guittot wrote:
> 
> 
> Few problems still remain in your case if I'm not wrong:
> There is a patch that ensures that rq->next_balance is never set in
> the past.
> 

Vincent,

Were you planning to take the patch to prevent the next_balance to be
in the past?

Tim

---
From 2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3 Mon Sep 17 00:00:00 2001
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Tue, 11 May 2021 09:55:41 -0700
Subject: [PATCH] sched: sched: Fix rq->next_balance time updated to earlier
 than current time
To: hmem@eclists.intel.com

In traces on newidle_balance(), this_rq->next_balance
time goes backward and earlier than current time jiffies, e.g.

11.602 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
11.624 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
13.856 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
13.910 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.637 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
14.666 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c

It doesn't make sense to have a next_balance in the past.
Fix newidle_balance() and update_next_balance() so the next
balance time is at least jiffies+1.
---
 kernel/sched/fair.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1ecfb4..740a0572cbf1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9901,7 +9901,10 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
 
 	/* used by idle balance, so cpu_busy = 0 */
 	interval = get_sd_balance_interval(sd, 0);
-	next = sd->last_balance + interval;
+	if (time_after(jiffies+1, sd->last_balance + interval))
+		next = jiffies+1;
+	else
+		next = sd->last_balance + interval;
 
 	if (time_after(*next_balance, next))
 		*next_balance = next;
@@ -10681,6 +10684,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 out:
 	/* Move the next balance forward */
+	if (time_after(jiffies+1, this_rq->next_balance))
+		this_rq->next_balance = jiffies+1;
 	if (time_after(this_rq->next_balance, next_balance))
 		this_rq->next_balance = next_balance;
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-29  7:49   ` Vincent Guittot
@ 2021-10-29 19:37     ` Joel Fernandes
  0 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes @ 2021-10-29 19:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel, tim.c.chen, dianders,
	qais.yousef, Chris.Redpath, Yicong Yang, Barry Song

Hi Vincent,

On Fri, Oct 29, 2021 at 09:49:23AM +0200, Vincent Guittot wrote:
> On Fri, 29 Oct 2021 at 01:25, Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hi,  Vincent, Peter,
> >
> > On Tue, Oct 19, 2021 at 02:35:32PM +0200, Vincent Guittot wrote:
> > > This patchset updates newidle lb cost tracking and early abort:
> > >
> > > The time spent running update_blocked_averages is now accounted in the 1st
> > > sched_domain level. This time can be significant and move the cost of
> > > newidle lb above the avg_idle time.
> > >
> > > The decay of max_newidle_lb_cost is modified to start only when the field
> > > has not been updated for a while. Recent update will not be decayed
> > > immediatlybut only after a while.
> > >
> > > The condition of an avg_idle lower than sysctl_sched_migration_cost has
> > > been removed as the 500us value is quite large and prevent opportunity to
> > > pull task on the newly idle CPU for at least 1st domain levels.
> >
> > It appears this series is not yet in upstream Linus's tree. What's the latest on it?
> >
> 
> I sent an addon yesterday to cover cases that Tim cares about

Oh ok, I'll check it out. Thanks for letting me know.

> > I see a lot of times on ARM64 devices that load balance is skipped due to the
> > high the sysctl_sched_migration_cost. I saw another thread as well where
> 
> Have you tested the patchset ? Does it enable more load balance on
> your platform ?

Yes I tested and noticed load balance happening more often. I ran some
workloads noticed it makes things smoother. I am doing more tests as well.

Also, we need this series to fix the update_blocked_averages() latency issues
and I verified that preemptoff tracer does not show same high latency with
this series.

thanks,

 - Joel

> > someone complained the performance varies and the default might be too high:
> > https://lkml.org/lkml/2021/9/14/150
> 
> Added  Yicong and Barry in the list
> 
> >
> > Any other thoughts? Happy to help on any progress on this series as well. Thanks,
> >
> >  - Joel
> >
> > >
> > > Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> > > THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> > > following results:
> > >        min    avg   max
> > > SMT:   1us   33us  273us - this one includes the update of blocked load
> > > MC:    7us   49us  398us
> > > NUMA: 10us   45us  158us
> > >
> > >
> > > Some results for hackbench -l $LOOPS -g $group :
> > > group      tip/sched/core     + this patchset
> > > 1           15.189(+/- 2%)       14.987(+/- 2%)  +1%
> > > 4            4.336(+/- 3%)        4.322(+/- 5%)  +0%
> > > 16           3.654(+/- 1%)        2.922(+/- 3%) +20%
> > > 32           3.209(+/- 1%)        2.919(+/- 3%)  +9%
> > > 64           2.965(+/- 1%)        2.826(+/- 1%)  +4%
> > > 128          2.954(+/- 1%)        2.993(+/- 8%)  -1%
> > > 256          2.951(+/- 1%)        2.894(+/- 1%)  +2%
> > >
> > > tbench and reaim have not shown any difference
> > >
> > > Change since v2:
> > > - Update and decay of sd->last_decay_max_lb_cost are gathered in
> > >   update_newidle_cost(). The behavior remains almost the same except that
> > >   the decay can happen during newidle_balance now.
> > >
> > >   Tests results haven't shown any differences
> > >
> > >   I haven't modified rq->max_idle_balance_cost. It acts as the max value
> > >   for avg_idle and prevents the latter to reach high value during long
> > >   idle phase. Moving on an IIR filter instead, could delay the convergence
> > >   of avg_idle to a reasonnable value that reflect current situation.
> > >
> > > - Added a minor cleanup of newidle_balance
> > >
> > > Change since v1:
> > > - account the time spent in update_blocked_averages() in the 1st domain
> > >
> > > - reduce number of call of sched_clock_cpu()
> > >
> > > - change the way max_newidle_lb_cost is decayed. Peter suggested to use a
> > >   IIR but keeping a track of the current max value gave the best result
> > >
> > > - removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
> > >   as suggested by Peter
> > >
> > > Vincent Guittot (5):
> > >   sched/fair: Account update_blocked_averages in newidle_balance cost
> > >   sched/fair: Skip update_blocked_averages if we are defering load
> > >     balance
> > >   sched/fair: Wait before decaying max_newidle_lb_cost
> > >   sched/fair: Remove sysctl_sched_migration_cost condition
> > >   sched/fair: cleanup newidle_balance
> > >
> > >  include/linux/sched/topology.h |  2 +-
> > >  kernel/sched/fair.c            | 65 ++++++++++++++++++++++------------
> > >  kernel/sched/topology.c        |  2 +-
> > >  3 files changed, 45 insertions(+), 24 deletions(-)
> > >
> > > --
> > > 2.17.1
> > >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Cleanup newidle_balance
  2021-10-19 12:35 ` [PATCH v3 5/5] sched/fair: cleanup newidle_balance Vincent Guittot
  2021-10-21  9:46   ` Mel Gorman
@ 2021-10-31 10:16   ` tip-bot2 for Vincent Guittot
  1 sibling, 0 replies; 34+ messages in thread
From: tip-bot2 for Vincent Guittot @ 2021-10-31 10:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Guittot, Peter Zijlstra (Intel),
	Dietmar Eggemann, Mel Gorman, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8ea9183db4ad8afbcb7089a77c23eaf965b0cacd
Gitweb:        https://git.kernel.org/tip/8ea9183db4ad8afbcb7089a77c23eaf965b0cacd
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Tue, 19 Oct 2021 14:35:37 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sun, 31 Oct 2021 11:11:38 +01:00

sched/fair: Cleanup newidle_balance

update_next_balance() uses sd->last_balance which is not modified by
load_balance() so we can merge the 2 calls in one place.

No functional change

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-6-vincent.guittot@linaro.org
---
 kernel/sched/fair.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57eae0e..13950be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10916,10 +10916,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		int continue_balancing = 1;
 		u64 domain_cost;
 
-		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
-			update_next_balance(sd, &next_balance);
+		update_next_balance(sd, &next_balance);
+
+		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
 			break;
-		}
 
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 
@@ -10935,8 +10935,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 			t0 = t1;
 		}
 
-		update_next_balance(sd, &next_balance);
-
 		/*
 		 * Stop searching for tasks to pull if there are
 		 * now runnable tasks on this rq.

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Remove sysctl_sched_migration_cost condition
  2021-10-19 12:35 ` [PATCH v3 4/5] sched/fair: Remove sysctl_sched_migration_cost condition Vincent Guittot
  2021-10-21  9:46   ` Mel Gorman
  2021-10-29 10:01   ` Dietmar Eggemann
@ 2021-10-31 10:16   ` tip-bot2 for Vincent Guittot
  2 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Vincent Guittot @ 2021-10-31 10:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Vincent Guittot, Dietmar Eggemann, Mel Gorman, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     c5b0a7eefc70150caf23e37bc9d639c68c87a097
Gitweb:        https://git.kernel.org/tip/c5b0a7eefc70150caf23e37bc9d639c68c87a097
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Tue, 19 Oct 2021 14:35:36 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sun, 31 Oct 2021 11:11:38 +01:00

sched/fair: Remove sysctl_sched_migration_cost condition

With a default value of 500us, sysctl_sched_migration_cost is
significanlty higher than the cost of load_balance. Remove the
condition and rely on the sd->max_newidle_lb_cost to abort
newidle_balance.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-5-vincent.guittot@linaro.org
---
 kernel/sched/fair.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e50fd75..57eae0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10895,8 +10895,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
-	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !READ_ONCE(this_rq->rd->overload) ||
+	if (!READ_ONCE(this_rq->rd->overload) ||
 	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
 		if (sd)

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Wait before decaying max_newidle_lb_cost
  2021-10-19 12:35 ` [PATCH 3/5] sched/fair: Wait before decaying max_newidle_lb_cost Vincent Guittot
  2021-10-21  9:45   ` Mel Gorman
  2021-10-29 10:01   ` Dietmar Eggemann
@ 2021-10-31 10:16   ` tip-bot2 for Vincent Guittot
  2 siblings, 0 replies; 34+ messages in thread
From: tip-bot2 for Vincent Guittot @ 2021-10-31 10:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Guittot, Peter Zijlstra (Intel),
	Dietmar Eggemann, Mel Gorman, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e60b56e46b384cee1ad34e6adc164d883049c6c3
Gitweb:        https://git.kernel.org/tip/e60b56e46b384cee1ad34e6adc164d883049c6c3
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Tue, 19 Oct 2021 14:35:35 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sun, 31 Oct 2021 11:11:38 +01:00

sched/fair: Wait before decaying max_newidle_lb_cost

Decay max_newidle_lb_cost only when it has not been updated for a while
and ensure to not decay a recently changed value.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-4-vincent.guittot@linaro.org
---
 include/linux/sched/topology.h |  2 +-
 kernel/sched/fair.c            | 36 ++++++++++++++++++++++++---------
 kernel/sched/topology.c        |  2 +-
 3 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 2f9166f..c07bfa2 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -105,7 +105,7 @@ struct sched_domain {
 
 	/* idle_balance() stats */
 	u64 max_newidle_lb_cost;
-	unsigned long next_decay_max_lb_cost;
+	unsigned long last_decay_max_lb_cost;
 
 	u64 avg_scan_cost;		/* select_idle_sibling */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c4c3686..e50fd75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10239,6 +10239,30 @@ void update_max_interval(void)
 	max_load_balance_interval = HZ*num_online_cpus()/10;
 }
 
+static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
+{
+	if (cost > sd->max_newidle_lb_cost) {
+		/*
+		 * Track max cost of a domain to make sure to not delay the
+		 * next wakeup on the CPU.
+		 */
+		sd->max_newidle_lb_cost = cost;
+		sd->last_decay_max_lb_cost = jiffies;
+	} else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
+		/*
+		 * Decay the newidle max times by ~1% per second to ensure that
+		 * it is not outdated and the current max cost is actually
+		 * shorter.
+		 */
+		sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
+		sd->last_decay_max_lb_cost = jiffies;
+
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
@@ -10262,14 +10286,9 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 	for_each_domain(cpu, sd) {
 		/*
 		 * Decay the newidle max times here because this is a regular
-		 * visit to all the domains. Decay ~1% per second.
+		 * visit to all the domains.
 		 */
-		if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
-			sd->max_newidle_lb_cost =
-				(sd->max_newidle_lb_cost * 253) / 256;
-			sd->next_decay_max_lb_cost = jiffies + HZ;
-			need_decay = 1;
-		}
+		need_decay = update_newidle_cost(sd, 0);
 		max_cost += sd->max_newidle_lb_cost;
 
 		/*
@@ -10911,8 +10930,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 			t1 = sched_clock_cpu(this_cpu);
 			domain_cost = t1 - t0;
-			if (domain_cost > sd->max_newidle_lb_cost)
-				sd->max_newidle_lb_cost = domain_cost;
+			update_newidle_cost(sd, domain_cost);
 
 			curr_cost += domain_cost;
 			t0 = t1;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e812467..30169c7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1568,7 +1568,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		.last_balance		= jiffies,
 		.balance_interval	= sd_weight,
 		.max_newidle_lb_cost	= 0,
-		.next_decay_max_lb_cost	= jiffies,
+		.last_decay_max_lb_cost	= jiffies,
 		.child			= child,
 #ifdef CONFIG_SCHED_DEBUG
 		.name			= tl->name,

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Skip update_blocked_averages if we are defering load balance
  2021-10-19 12:35 ` [PATCH v3 2/5] sched/fair: Skip update_blocked_averages if we are defering load balance Vincent Guittot
  2021-10-21  9:44   ` Mel Gorman
@ 2021-10-31 10:16   ` tip-bot2 for Vincent Guittot
  1 sibling, 0 replies; 34+ messages in thread
From: tip-bot2 for Vincent Guittot @ 2021-10-31 10:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Guittot, Tim Chen, Peter Zijlstra (Intel),
	Dietmar Eggemann, Mel Gorman, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     9d783c8dd112ad4b619e74e4bf57d2be0b400693
Gitweb:        https://git.kernel.org/tip/9d783c8dd112ad4b619e74e4bf57d2be0b400693
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Tue, 19 Oct 2021 14:35:34 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sun, 31 Oct 2021 11:11:37 +01:00

sched/fair: Skip update_blocked_averages if we are defering load balance

In newidle_balance(), the scheduler skips load balance to the new idle cpu
when the 1st sd of this_rq is:

   this_rq->avg_idle < sd->max_newidle_lb_cost

Doing a costly call to update_blocked_averages() will not be useful and
simply adds overhead when this condition is true.

Check the condition early in newidle_balance() to skip
update_blocked_averages() when possible.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-3-vincent.guittot@linaro.org
---
 kernel/sched/fair.c |  9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c014567..c4c3686 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10873,17 +10873,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	 */
 	rq_unpin_lock(this_rq, rf);
 
+	rcu_read_lock();
+	sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
 	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !READ_ONCE(this_rq->rd->overload)) {
+	    !READ_ONCE(this_rq->rd->overload) ||
+	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
 
-		rcu_read_lock();
-		sd = rcu_dereference_check_sched_domain(this_rq->sd);
 		if (sd)
 			update_next_balance(sd, &next_balance);
 		rcu_read_unlock();
 
 		goto out;
 	}
+	rcu_read_unlock();
 
 	raw_spin_rq_unlock(this_rq);
 

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [tip: sched/core] sched/fair: Account update_blocked_averages in newidle_balance cost
  2021-10-19 12:35 ` [PATCH v3 1/5] sched/fair: Account update_blocked_averages in newidle_balance cost Vincent Guittot
  2021-10-21  9:43   ` Mel Gorman
@ 2021-10-31 10:16   ` tip-bot2 for Vincent Guittot
  1 sibling, 0 replies; 34+ messages in thread
From: tip-bot2 for Vincent Guittot @ 2021-10-31 10:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Guittot, Peter Zijlstra (Intel),
	Dietmar Eggemann, Mel Gorman, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     9e9af819db5dbe4bf99101628955a26e2a41a1a5
Gitweb:        https://git.kernel.org/tip/9e9af819db5dbe4bf99101628955a26e2a41a1a5
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Tue, 19 Oct 2021 14:35:33 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sun, 31 Oct 2021 11:11:37 +01:00

sched/fair: Account update_blocked_averages in newidle_balance cost

The time spent to update the blocked load can be significant depending of
the complexity fo the cgroup hierarchy. Take this time into account in
the cost of the 1st load balance of a newly idle cpu.

Also reduce the number of call to sched_clock_cpu() and track more actual
work.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-2-vincent.guittot@linaro.org
---
 kernel/sched/fair.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 87db481..c014567 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10840,9 +10840,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 {
 	unsigned long next_balance = jiffies + HZ;
 	int this_cpu = this_rq->cpu;
+	u64 t0, t1, curr_cost = 0;
 	struct sched_domain *sd;
 	int pulled_task = 0;
-	u64 curr_cost = 0;
 
 	update_misfit_status(NULL, this_rq);
 
@@ -10887,11 +10887,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 
 	raw_spin_rq_unlock(this_rq);
 
+	t0 = sched_clock_cpu(this_cpu);
 	update_blocked_averages(this_cpu);
+
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		int continue_balancing = 1;
-		u64 t0, domain_cost;
+		u64 domain_cost;
 
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
 			update_next_balance(sd, &next_balance);
@@ -10899,17 +10901,18 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 		}
 
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
-			t0 = sched_clock_cpu(this_cpu);
 
 			pulled_task = load_balance(this_cpu, this_rq,
 						   sd, CPU_NEWLY_IDLE,
 						   &continue_balancing);
 
-			domain_cost = sched_clock_cpu(this_cpu) - t0;
+			t1 = sched_clock_cpu(this_cpu);
+			domain_cost = t1 - t0;
 			if (domain_cost > sd->max_newidle_lb_cost)
 				sd->max_newidle_lb_cost = domain_cost;
 
 			curr_cost += domain_cost;
+			t0 = t1;
 		}
 
 		update_next_balance(sd, &next_balance);

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 0/5] Improve newidle lb cost tracking and early abort
  2021-10-29 17:00     ` Tim Chen
@ 2021-10-31 10:18       ` Vincent Guittot
  0 siblings, 0 replies; 34+ messages in thread
From: Vincent Guittot @ 2021-10-31 10:18 UTC (permalink / raw)
  To: Tim Chen
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel

On Fri, 29 Oct 2021 at 19:00, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Wed, 2021-10-27 at 10:49 +0200, Vincent Guittot wrote:
> >
> >
> > Few problems still remain in your case if I'm not wrong:
> > There is a patch that ensures that rq->next_balance is never set in
> > the past.
> >
>
> Vincent,
>
> Were you planning to take the patch to prevent the next_balance to be
> in the past?

I can add it to this serie


>
> Tim
>
> ---
> From 2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3 Mon Sep 17 00:00:00 2001
> From: Tim Chen <tim.c.chen@linux.intel.com>
> Date: Tue, 11 May 2021 09:55:41 -0700
> Subject: [PATCH] sched: sched: Fix rq->next_balance time updated to earlier
>  than current time
> To: hmem@eclists.intel.com
>
> In traces on newidle_balance(), this_rq->next_balance
> time goes backward and earlier than current time jiffies, e.g.
>
> 11.602 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
> 11.624 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 13.856 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
> 13.910 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.637 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
> 14.666 (         ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>
> It doesn't make sense to have a next_balance in the past.
> Fix newidle_balance() and update_next_balance() so the next
> balance time is at least jiffies+1.
> ---
>  kernel/sched/fair.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d75af1ecfb4..740a0572cbf1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9901,7 +9901,10 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
>
>         /* used by idle balance, so cpu_busy = 0 */
>         interval = get_sd_balance_interval(sd, 0);
> -       next = sd->last_balance + interval;
> +       if (time_after(jiffies+1, sd->last_balance + interval))
> +               next = jiffies+1;
> +       else
> +               next = sd->last_balance + interval;
>
>         if (time_after(*next_balance, next))
>                 *next_balance = next;
> @@ -10681,6 +10684,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
>  out:
>         /* Move the next balance forward */
> +       if (time_after(jiffies+1, this_rq->next_balance))
> +               this_rq->next_balance = jiffies+1;
>         if (time_after(this_rq->next_balance, next_balance))
>                 this_rq->next_balance = next_balance;
>
> --
> 2.20.1
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-10-31 10:18 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19 12:35 [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Vincent Guittot
2021-10-19 12:35 ` [PATCH v3 1/5] sched/fair: Account update_blocked_averages in newidle_balance cost Vincent Guittot
2021-10-21  9:43   ` Mel Gorman
2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2021-10-19 12:35 ` [PATCH v3 2/5] sched/fair: Skip update_blocked_averages if we are defering load balance Vincent Guittot
2021-10-21  9:44   ` Mel Gorman
2021-10-21 12:45     ` Vincent Guittot
2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2021-10-19 12:35 ` [PATCH 3/5] sched/fair: Wait before decaying max_newidle_lb_cost Vincent Guittot
2021-10-21  9:45   ` Mel Gorman
2021-10-29 10:01   ` Dietmar Eggemann
2021-10-29 12:19     ` Vincent Guittot
2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2021-10-19 12:35 ` [PATCH v3 4/5] sched/fair: Remove sysctl_sched_migration_cost condition Vincent Guittot
2021-10-21  9:46   ` Mel Gorman
2021-10-29 10:01   ` Dietmar Eggemann
2021-10-29 12:30     ` Vincent Guittot
2021-10-31 10:16   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2021-10-19 12:35 ` [PATCH v3 5/5] sched/fair: cleanup newidle_balance Vincent Guittot
2021-10-21  9:46   ` Mel Gorman
2021-10-31 10:16   ` [tip: sched/core] sched/fair: Cleanup newidle_balance tip-bot2 for Vincent Guittot
2021-10-21  9:52 ` [PATCH v3 0/5] Improve newidle lb cost tracking and early abort Mel Gorman
2021-10-21 12:42   ` Vincent Guittot
2021-10-26 17:25 ` Tim Chen
2021-10-27  8:49   ` Vincent Guittot
2021-10-27 20:53     ` Tim Chen
2021-10-28 12:15       ` Vincent Guittot
2021-10-28 21:36         ` Tim Chen
2021-10-29 17:00     ` Tim Chen
2021-10-31 10:18       ` Vincent Guittot
2021-10-28 23:25 ` Joel Fernandes
2021-10-29  7:49   ` Vincent Guittot
2021-10-29 19:37     ` Joel Fernandes
2021-10-29 10:01 ` Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).