All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] sched: Generalize misfit load balance
@ 2023-12-09  1:17 Qais Yousef
  2023-12-09  1:17 ` [PATCH 1/3] sched/fair: Add is_misfit_task() function Qais Yousef
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Qais Yousef @ 2023-12-09  1:17 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei,
	Qais Yousef

Misfit load balance was added to help handle HMP systems where we can make
a wrong decision at wake up thinking a task can run at a smaller core, but its
characteristics change and requires to migrate to a bigger core to meet its
performance demands.

With the addition of uclamp, we can encounter more cases where such wrong
placement decisions can be made and require load balancer to do a corrective
action.

Specifically if a big task capped by uclamp_max was placed on a big core at
wake up because EAS thought it is the most energy efficient core at the time,
the dynamics of the system might change where other uncapped tasks might wake
up on the cluster and there could be a better new more energy efficient
placement for the capped task(s).

We can generalize the misfit load balance to handle different type of misfits
(whatever they may be) by simply giving it a reason. The reason can decide the
type of action required then.

Current misfit implementation is considered MISFIT_PERF. Which means we need to
move a task to a better CPU to meet its performance requirement.

For UCLAMP_MAX I propose MISFIT_POWER, where we need to find a better placement
to control its impact on power.

Once we have an API to annotate latency sensitive tasks, it is anticipated
MISFIT_LATENCY load balance will be required to help handle oversubscribe
situations to help better distribute the latency sensitive tasks to help reduce
their wake up latency.

Patch 1 splits misfit status update from misfit detection by adding a new
function is_misfit_task().

Patch 2 implements the generalization logic by adding a misfit reason and
propagating that correctly and guarding the current misfit code with
MISFIT_PERF reason.

Patch 3 is an RFC on a potential implementation for MISFIT_POWER.

Patch 1 and 2 were tested stand alone and had no regression observed and should
not introduce a functional change and can be considered for merge if they make
sense after addressing any review comments.

Patch 3 was only tested to verify it does what I expected it to do. But no real
power/perf testing was done. Mainly because I was expecting to remove uclamp
max-aggregation [1] and the RFC I currently have (which I wrote many many
months ago) is tied to detecting a task being uncapped by max-aggregation.
I need to rethink the detection mechanism.

Beside that, the logic relies on using find_energy_efficient_cpu() to find the
best potential new placement for the task. To do that though, we need to force
every CPU to do the MISFIT_POWER load balance as we don't know which CPU should
do the pull. But there might be better thoughts on how to handle this. So
feedback and thoughts would be appreciated.

[1] https://lore.kernel.org/lkml/20231208015242.385103-1-qyousef@layalina.io/

Thanks!

--
Qais Yousef

Qais Yousef (3):
  sched/fair: Add is_misfit_task() function
  sched/fair: Generalize misfit lb by adding a misfit reason
  sched/fair: Implement new type of misfit MISFIT_POWER

 kernel/sched/fair.c  | 115 +++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |   9 ++++
 2 files changed, 110 insertions(+), 14 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] sched/fair: Add is_misfit_task() function
  2023-12-09  1:17 [PATCH 0/3] sched: Generalize misfit load balance Qais Yousef
@ 2023-12-09  1:17 ` Qais Yousef
  2023-12-09  1:17 ` [PATCH 2/3] sched/fair: Generalize misfit lb by adding a misfit reason Qais Yousef
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Qais Yousef @ 2023-12-09  1:17 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei,
	Qais Yousef

Split it from update_misfit_status(). Use it in detach_tasks() and
update_sg_lb_wakeup_stats().

This should help us generalize misfit lb to handle more than
upmigration.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/fair.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcea3d55d95d..eb9e891182cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5063,17 +5063,23 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
 	return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
 }
 
+static inline int is_misfit_task(struct task_struct *p, struct rq *rq)
+{
+	if (!p || p->nr_cpus_allowed == 1)
+		return 0;
+
+	if (task_fits_cpu(p, cpu_of(rq)))
+		return 0;
+
+	return 1;
+}
+
 static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
 {
 	if (!sched_asym_cpucap_active())
 		return;
 
-	if (!p || p->nr_cpus_allowed == 1) {
-		rq->misfit_task_load = 0;
-		return;
-	}
-
-	if (task_fits_cpu(p, cpu_of(rq))) {
+	if (!is_misfit_task(p, rq)) {
 		rq->misfit_task_load = 0;
 		return;
 	}
@@ -9105,7 +9111,7 @@ static int detach_tasks(struct lb_env *env)
 
 		case migrate_misfit:
 			/* This is not a misfit task */
-			if (task_fits_cpu(p, env->src_cpu))
+			if (!is_misfit_task(p, cpu_rq(env->src_cpu)))
 				goto next;
 
 			env->imbalance = 0;
@@ -10207,7 +10213,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 		/* Check if task fits in the CPU */
 		if (sd->flags & SD_ASYM_CPUCAPACITY &&
 		    sgs->group_misfit_task_load &&
-		    task_fits_cpu(p, i))
+		    !is_misfit_task(p, rq))
 			sgs->group_misfit_task_load = 0;
 
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] sched/fair: Generalize misfit lb by adding a misfit reason
  2023-12-09  1:17 [PATCH 0/3] sched: Generalize misfit load balance Qais Yousef
  2023-12-09  1:17 ` [PATCH 1/3] sched/fair: Add is_misfit_task() function Qais Yousef
@ 2023-12-09  1:17 ` Qais Yousef
  2023-12-09  1:17 ` [PATCH RFC 3/3] sched/fair: Implement new type of misfit MISFIT_POWER Qais Yousef
  2023-12-21 15:26 ` [PATCH 0/3] sched: Generalize misfit load balance Pierre Gondois
  3 siblings, 0 replies; 9+ messages in thread
From: Qais Yousef @ 2023-12-09  1:17 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei,
	Qais Yousef

MISFIT_PERF is what is currently implemented. It indicates that the task
requires moving to a more performant CPU.

Guard misfit handling in find_busiest_queue and update_sd_pick_busiest
with MISFIT_PERF. They explicitly assume this type of misfit

This generalizes misfit lb to allow for more types of misfits to be
added. Like MISFIT_POWER to help uclamp_max being more effective, and
MISFIT_LATENCY to help latency sensitive tasks to be spread in
oversubscribe conditions.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/fair.c  | 28 +++++++++++++++++++++++-----
 kernel/sched/sched.h |  8 ++++++++
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb9e891182cc..dd49b89a6e3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5063,7 +5063,8 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
 	return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
 }
 
-static inline int is_misfit_task(struct task_struct *p, struct rq *rq)
+static inline int is_misfit_task(struct task_struct *p, struct rq *rq,
+				 misfit_reason_t *reason)
 {
 	if (!p || p->nr_cpus_allowed == 1)
 		return 0;
@@ -5071,16 +5072,21 @@ static inline int is_misfit_task(struct task_struct *p, struct rq *rq)
 	if (task_fits_cpu(p, cpu_of(rq)))
 		return 0;
 
+	if (reason)
+		*reason = MISFIT_PERF;
 	return 1;
 }
 
 static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
 {
+	misfit_reason_t reason;
+
 	if (!sched_asym_cpucap_active())
 		return;
 
-	if (!is_misfit_task(p, rq)) {
+	if (!is_misfit_task(p, rq, &reason)) {
 		rq->misfit_task_load = 0;
+		rq->misfit_reason = -1;
 		return;
 	}
 
@@ -5089,6 +5095,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
 	 * task_h_load() returns 0.
 	 */
 	rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
+	rq->misfit_reason = reason;
 }
 
 #else /* CONFIG_SMP */
@@ -9111,7 +9118,7 @@ static int detach_tasks(struct lb_env *env)
 
 		case migrate_misfit:
 			/* This is not a misfit task */
-			if (!is_misfit_task(p, cpu_rq(env->src_cpu)))
+			if (!is_misfit_task(p, cpu_rq(env->src_cpu), NULL))
 				goto next;
 
 			env->imbalance = 0;
@@ -9426,6 +9433,7 @@ struct sg_lb_stats {
 	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
 	unsigned int group_smt_balance;  /* Task on busy SMT be moved */
 	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+	misfit_reason_t group_misfit_reason;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -9904,6 +9912,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			/* Check for a misfit task on the cpu */
 			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
 				sgs->group_misfit_task_load = rq->misfit_task_load;
+				sgs->group_misfit_reason = rq->misfit_reason;
 				*sg_status |= SG_OVERLOAD;
 			}
 		} else if ((env->idle != CPU_NOT_IDLE) &&
@@ -9969,6 +9978,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 */
 	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
 	    (sgs->group_type == group_misfit_task) &&
+	    (sgs->group_misfit_reason == MISFIT_PERF) &&
 	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
@@ -10193,6 +10203,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 
 	for_each_cpu(i, sched_group_span(group)) {
 		struct rq *rq = cpu_rq(i);
+		misfit_reason_t reason;
 		unsigned int local;
 
 		sgs->group_load += cpu_load_without(rq, p);
@@ -10212,9 +10223,15 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 
 		/* Check if task fits in the CPU */
 		if (sd->flags & SD_ASYM_CPUCAPACITY &&
-		    sgs->group_misfit_task_load &&
-		    !is_misfit_task(p, rq))
+		    sgs->group_misfit_task_load) {
+		    if (!is_misfit_task(p, rq, &reason)) {
 			sgs->group_misfit_task_load = 0;
+			sgs->group_misfit_reason = -1;
+		    } else {
+			sgs->group_misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
+			sgs->group_misfit_reason = reason;
+		    }
+		}
 
 	}
 
@@ -11008,6 +11025,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * average load.
 		 */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+		    rq->misfit_reason == MISFIT_PERF &&
 		    !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
 		    nr_running == 1)
 			continue;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58a54bda77d..399b6526afab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -962,6 +962,10 @@ struct balance_callback {
 	void (*func)(struct rq *rq);
 };
 
+typedef enum misfit_reason {
+	MISFIT_PERF,		/* Requires moving to a more performant CPU */
+} misfit_reason_t;
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -1168,6 +1172,10 @@ struct rq {
 	call_single_data_t	cfsb_csd;
 	struct list_head	cfsb_csd_list;
 #endif
+
+#ifdef CONFIG_SMP
+	misfit_reason_t		misfit_reason;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RFC 3/3] sched/fair: Implement new type of misfit MISFIT_POWER
  2023-12-09  1:17 [PATCH 0/3] sched: Generalize misfit load balance Qais Yousef
  2023-12-09  1:17 ` [PATCH 1/3] sched/fair: Add is_misfit_task() function Qais Yousef
  2023-12-09  1:17 ` [PATCH 2/3] sched/fair: Generalize misfit lb by adding a misfit reason Qais Yousef
@ 2023-12-09  1:17 ` Qais Yousef
  2023-12-11 16:14   ` Pierre Gondois
  2024-01-04 14:28   ` Pierre Gondois
  2023-12-21 15:26 ` [PATCH 0/3] sched: Generalize misfit load balance Pierre Gondois
  3 siblings, 2 replies; 9+ messages in thread
From: Qais Yousef @ 2023-12-09  1:17 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann
  Cc: linux-kernel, Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei,
	Qais Yousef

MISFIT_POWER requires moving the task to a more efficient CPU.

This can happen when a big task is capped by uclamp_max, but another
task wakes up on this CPU that can lift the capping, in this case we
need to migrate it to another, likely smaller, CPU to save power.

To enable that we need to be smarter about which CPU should do the pull.
But this requires enabling load balance on all CPUs so that the correct
CPU does the pull. Instead of the current behavior of nominating the CPU
with the largest capacity in the group to do the pull.

This is important to ensure the MISFIT_POWER tasks don't end up on most
performant CPUs, which is the default behavior of the load balance.  We
could end up wasting energy unnecessarily or interfere with more
important tasks on these big CPUs - leading to worse user experience.

To ensure optimal decision is made, we need to enable calling feec() to
pick the most efficient CPU for us. Which means we need to force feec()
to ignore overutilized flag. If feec() returns the same value as the CPU
that is doing the balance, we perform the pull. Otherwise we'd have to
defer for another CPU to do the pull.

To minimize the overhead, this is only done for MISFIT_POWER.

For capacity aware scheduling or none HMP systems, we will pick a CPU
that we won't cause its uclamp_max to be uncapped.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/fair.c  | 77 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h |  1 +
 2 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd49b89a6e3e..328467dbe88b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5066,10 +5066,33 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
 static inline int is_misfit_task(struct task_struct *p, struct rq *rq,
 				 misfit_reason_t *reason)
 {
+	unsigned long rq_util_max;
+	unsigned long p_util_min;
+	unsigned long p_util_max;
+	unsigned long util;
+
 	if (!p || p->nr_cpus_allowed == 1)
 		return 0;
 
-	if (task_fits_cpu(p, cpu_of(rq)))
+	rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
+	p_util_min = uclamp_eff_value(p, UCLAMP_MIN);
+	p_util_max = uclamp_eff_value(p, UCLAMP_MAX);
+	util = task_util_est(p);
+
+	if (uclamp_is_used()) {
+		/*
+		* Check if a big task is capped by uclamp max is now sharing
+		* the cpu with something else uncapped and must be moved away
+		*/
+		if (rq_util_max > p_util_max && util > p_util_max) {
+			if (reason)
+				*reason = MISFIT_POWER;
+
+			return 1;
+		}
+	}
+
+	if (util_fits_cpu(util, p_util_min, p_util_max, cpu_of(rq)))
 		return 0;
 
 	if (reason)
@@ -7923,7 +7946,8 @@ compute_energy(struct energy_env *eenv, struct perf_domain *pd,
  * other use-cases too. So, until someone finds a better way to solve this,
  * let's keep things simple by re-using the existing slow path.
  */
-static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
+static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu,
+				     bool ignore_ou)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
@@ -7940,7 +7964,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 
 	rcu_read_lock();
 	pd = rcu_dereference(rd->pd);
-	if (!pd || READ_ONCE(rd->overutilized))
+	if (!pd || (READ_ONCE(rd->overutilized) && !ignore_ou))
 		goto unlock;
 
 	/*
@@ -8144,7 +8168,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 			return cpu;
 
 		if (sched_energy_enabled()) {
-			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
+			new_cpu = find_energy_efficient_cpu(p, prev_cpu, false);
 			if (new_cpu >= 0)
 				return new_cpu;
 			new_cpu = prev_cpu;
@@ -9030,6 +9054,7 @@ static int detach_tasks(struct lb_env *env)
 {
 	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	unsigned long util, load;
+	misfit_reason_t reason;
 	struct task_struct *p;
 	int detached = 0;
 
@@ -9118,9 +9143,28 @@ static int detach_tasks(struct lb_env *env)
 
 		case migrate_misfit:
 			/* This is not a misfit task */
-			if (!is_misfit_task(p, cpu_rq(env->src_cpu), NULL))
+			if (!is_misfit_task(p, cpu_rq(env->src_cpu), &reason))
 				goto next;
 
+			if (reason == MISFIT_POWER) {
+				if (sched_energy_enabled()) {
+					int new_cpu = find_energy_efficient_cpu(p, env->src_cpu, true);
+					if (new_cpu != env->dst_cpu)
+						goto next;
+				} else {
+					unsigned long dst_uclamp_max = uclamp_rq_get(env->dst_rq, UCLAMP_MAX);
+					unsigned long p_uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
+
+					/*
+					 * Pick a task that will not cause us
+					 * to uncap dst_cpu. Or give up until
+					 * another CPU tries to do the pull.
+					 */
+					if (p_uclamp_max > dst_uclamp_max)
+						goto next;
+				}
+			}
+
 			env->imbalance = 0;
 			break;
 		}
@@ -11203,6 +11247,18 @@ static int should_we_balance(struct lb_env *env)
 	if (!cpumask_test_cpu(env->dst_cpu, env->cpus))
 		return 0;
 
+	/*
+	 * For MISFIT_POWER, we need every CPU to do the lb so that we can pick
+	 * the most energy efficient one via EAS if available or by making sure
+	 * the dst_rq uclamp_max higher than the misfit task's uclamp_max.
+	 *
+	 * We don't want to do a pull if both src and dst cpus are in
+	 * MISFIT_POWER state.
+	 */
+	if (env->src_rq->misfit_reason == MISFIT_POWER &&
+	    env->dst_rq->misfit_reason != MISFIT_POWER)
+			return 1;
+
 	/*
 	 * In the newly idle case, we will allow all the CPUs
 	 * to do the newly idle load balance.
@@ -11431,8 +11487,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		 * We do not want newidle balance, which can be very
 		 * frequent, pollute the failure counter causing
 		 * excessive cache_hot migrations and active balances.
+		 *
+		 * MISFIT_POWER can also trigger a lot of failed misfit
+		 * migrations as we need to ask every CPU to do the pull and
+		 * expectedly lots of failures will incur.
 		 */
-		if (idle != CPU_NEWLY_IDLE)
+		if (idle != CPU_NEWLY_IDLE && env.src_rq->misfit_reason != MISFIT_POWER)
 			sd->nr_balance_failed++;
 
 		if (need_active_balance(&env)) {
@@ -11515,8 +11575,11 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	 * repeatedly reach this code, which would lead to balance_interval
 	 * skyrocketing in a short amount of time. Skip the balance_interval
 	 * increase logic to avoid that.
+	 *
+	 * So does MISFIT_POWER which asks every CPU to do the pull as we can't
+	 * tell which one would be the best one to move to before hand.
 	 */
-	if (env.idle == CPU_NEWLY_IDLE)
+	if (env.idle == CPU_NEWLY_IDLE || env.src_rq->misfit_reason == MISFIT_POWER)
 		goto out;
 
 	/* tune up the balancing interval */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 399b6526afab..3852109ffe62 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -964,6 +964,7 @@ struct balance_callback {
 
 typedef enum misfit_reason {
 	MISFIT_PERF,		/* Requires moving to a more performant CPU */
+	MISFIT_POWER,		/* Requires moving to a more efficient CPU */
 } misfit_reason_t;
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 3/3] sched/fair: Implement new type of misfit MISFIT_POWER
  2023-12-09  1:17 ` [PATCH RFC 3/3] sched/fair: Implement new type of misfit MISFIT_POWER Qais Yousef
@ 2023-12-11 16:14   ` Pierre Gondois
  2024-01-04 14:28   ` Pierre Gondois
  1 sibling, 0 replies; 9+ messages in thread
From: Pierre Gondois @ 2023-12-11 16:14 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann
  Cc: linux-kernel, Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei



On 12/9/23 02:17, Qais Yousef wrote:
> MISFIT_POWER requires moving the task to a more efficient CPU.
> 
> This can happen when a big task is capped by uclamp_max, but another
> task wakes up on this CPU that can lift the capping, in this case we
> need to migrate it to another, likely smaller, CPU to save power.
> 
> To enable that we need to be smarter about which CPU should do the pull.
> But this requires enabling load balance on all CPUs so that the correct
> CPU does the pull. Instead of the current behavior of nominating the CPU
> with the largest capacity in the group to do the pull.
> 
> This is important to ensure the MISFIT_POWER tasks don't end up on most
> performant CPUs, which is the default behavior of the load balance.  We
> could end up wasting energy unnecessarily or interfere with more
> important tasks on these big CPUs - leading to worse user experience.
> 
> To ensure optimal decision is made, we need to enable calling feec() to
> pick the most efficient CPU for us. Which means we need to force feec()
> to ignore overutilized flag. If feec() returns the same value as the CPU
> that is doing the balance, we perform the pull. Otherwise we'd have to
> defer for another CPU to do the pull.
> 
> To minimize the overhead, this is only done for MISFIT_POWER.
> 
> For capacity aware scheduling or none HMP systems, we will pick a CPU
> that we won't cause its uclamp_max to be uncapped.
> 
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>   kernel/sched/fair.c  | 77 ++++++++++++++++++++++++++++++++++++++++----
>   kernel/sched/sched.h |  1 +
>   2 files changed, 71 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd49b89a6e3e..328467dbe88b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5066,10 +5066,33 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
>   static inline int is_misfit_task(struct task_struct *p, struct rq *rq,
>   				 misfit_reason_t *reason)
>   {
> +	unsigned long rq_util_max;
> +	unsigned long p_util_min;
> +	unsigned long p_util_max;
> +	unsigned long util;
> +
>   	if (!p || p->nr_cpus_allowed == 1)
>   		return 0;
>   
> -	if (task_fits_cpu(p, cpu_of(rq)))
> +	rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
> +	p_util_min = uclamp_eff_value(p, UCLAMP_MIN);
> +	p_util_max = uclamp_eff_value(p, UCLAMP_MAX);
> +	util = task_util_est(p);
> +
> +	if (uclamp_is_used()) {
> +		/*
> +		* Check if a big task is capped by uclamp max is now sharing
> +		* the cpu with something else uncapped and must be moved away
> +		*/
> +		if (rq_util_max > p_util_max && util > p_util_max) {
> +			if (reason)
> +				*reason = MISFIT_POWER;
> +
> +			return 1;
> +		}
> +	}
> +
> +	if (util_fits_cpu(util, p_util_min, p_util_max, cpu_of(rq)))
>   		return 0;
>   
>   	if (reason)
> @@ -7923,7 +7946,8 @@ compute_energy(struct energy_env *eenv, struct perf_domain *pd,
>    * other use-cases too. So, until someone finds a better way to solve this,
>    * let's keep things simple by re-using the existing slow path.
>    */
> -static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> +static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu,
> +				     bool ignore_ou)
>   {
>   	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
>   	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
> @@ -7940,7 +7964,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   
>   	rcu_read_lock();
>   	pd = rcu_dereference(rd->pd);
> -	if (!pd || READ_ONCE(rd->overutilized))
> +	if (!pd || (READ_ONCE(rd->overutilized) && !ignore_ou))
>   		goto unlock;
>   
>   	/*
> @@ -8144,7 +8168,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>   			return cpu;
>   
>   		if (sched_energy_enabled()) {
> -			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> +			new_cpu = find_energy_efficient_cpu(p, prev_cpu, false);
>   			if (new_cpu >= 0)
>   				return new_cpu;
>   			new_cpu = prev_cpu;
> @@ -9030,6 +9054,7 @@ static int detach_tasks(struct lb_env *env)
>   {
>   	struct list_head *tasks = &env->src_rq->cfs_tasks;
>   	unsigned long util, load;
> +	misfit_reason_t reason;
>   	struct task_struct *p;
>   	int detached = 0;
>   
> @@ -9118,9 +9143,28 @@ static int detach_tasks(struct lb_env *env)
>   
>   		case migrate_misfit:
>   			/* This is not a misfit task */
> -			if (!is_misfit_task(p, cpu_rq(env->src_cpu), NULL))
> +			if (!is_misfit_task(p, cpu_rq(env->src_cpu), &reason))
>   				goto next;
>   
> +			if (reason == MISFIT_POWER) {
> +				if (sched_energy_enabled()) {
> +					int new_cpu = find_energy_efficient_cpu(p, env->src_cpu, true);
> +					if (new_cpu != env->dst_cpu)
> +						goto next;
> +				} else {
> +					unsigned long dst_uclamp_max = uclamp_rq_get(env->dst_rq, UCLAMP_MAX);
> +					unsigned long p_uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> +
> +					/*
> +					 * Pick a task that will not cause us
> +					 * to uncap dst_cpu. Or give up until
> +					 * another CPU tries to do the pull.
> +					 */
> +					if (p_uclamp_max > dst_uclamp_max)
> +						goto next;
> +				}
> +			}
> +
>   			env->imbalance = 0;
>   			break;
>   		}
> @@ -11203,6 +11247,18 @@ static int should_we_balance(struct lb_env *env)
>   	if (!cpumask_test_cpu(env->dst_cpu, env->cpus))
>   		return 0;
>   
> +	/*
> +	 * For MISFIT_POWER, we need every CPU to do the lb so that we can pick
> +	 * the most energy efficient one via EAS if available or by making sure
> +	 * the dst_rq uclamp_max higher than the misfit task's uclamp_max.
> +	 *
> +	 * We don't want to do a pull if both src and dst cpus are in
> +	 * MISFIT_POWER state.
> +	 */
> +	if (env->src_rq->misfit_reason == MISFIT_POWER &&

In case someone tries the patch, it seems the src_rq field of the
struct lb_env env in load_balance() is not initialized, so I think accesses
to env->src_rq->misfit_reason should be replaced by:
   (env->src_rq && env->src_rq->misfit_reason)

> +	    env->dst_rq->misfit_reason != MISFIT_POWER)
> +			return 1;
> +
>   	/*
>   	 * In the newly idle case, we will allow all the CPUs
>   	 * to do the newly idle load balance.
> @@ -11431,8 +11487,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,


>   		 * We do not want newidle balance, which can be very
>   		 * frequent, pollute the failure counter causing
>   		 * excessive cache_hot migrations and active balances.
> +		 *
> +		 * MISFIT_POWER can also trigger a lot of failed misfit
> +		 * migrations as we need to ask every CPU to do the pull and
> +		 * expectedly lots of failures will incur.
>   		 */
> -		if (idle != CPU_NEWLY_IDLE)
> +		if (idle != CPU_NEWLY_IDLE && env.src_rq->misfit_reason != MISFIT_POWER)
>   			sd->nr_balance_failed++;
>   
>   		if (need_active_balance(&env)) {
> @@ -11515,8 +11575,11 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>   	 * repeatedly reach this code, which would lead to balance_interval
>   	 * skyrocketing in a short amount of time. Skip the balance_interval
>   	 * increase logic to avoid that.
> +	 *
> +	 * So does MISFIT_POWER which asks every CPU to do the pull as we can't
> +	 * tell which one would be the best one to move to before hand.
>   	 */
> -	if (env.idle == CPU_NEWLY_IDLE)
> +	if (env.idle == CPU_NEWLY_IDLE || env.src_rq->misfit_reason == MISFIT_POWER)
>   		goto out;
>   
>   	/* tune up the balancing interval */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 399b6526afab..3852109ffe62 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -964,6 +964,7 @@ struct balance_callback {
>   
>   typedef enum misfit_reason {
>   	MISFIT_PERF,		/* Requires moving to a more performant CPU */
> +	MISFIT_POWER,		/* Requires moving to a more efficient CPU */
>   } misfit_reason_t;
>   
>   /*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] sched: Generalize misfit load balance
  2023-12-09  1:17 [PATCH 0/3] sched: Generalize misfit load balance Qais Yousef
                   ` (2 preceding siblings ...)
  2023-12-09  1:17 ` [PATCH RFC 3/3] sched/fair: Implement new type of misfit MISFIT_POWER Qais Yousef
@ 2023-12-21 15:26 ` Pierre Gondois
  2023-12-28 23:38   ` Qais Yousef
  3 siblings, 1 reply; 9+ messages in thread
From: Pierre Gondois @ 2023-12-21 15:26 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann
  Cc: linux-kernel, Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei

Hello Qais,

On 12/9/23 02:17, Qais Yousef wrote:
> Misfit load balance was added to help handle HMP systems where we can make
> a wrong decision at wake up thinking a task can run at a smaller core, but its
> characteristics change and requires to migrate to a bigger core to meet its
> performance demands.
> 
> With the addition of uclamp, we can encounter more cases where such wrong
> placement decisions can be made and require load balancer to do a corrective
> action.
> 
> Specifically if a big task capped by uclamp_max was placed on a big core at
> wake up because EAS thought it is the most energy efficient core at the time,
> the dynamics of the system might change where other uncapped tasks might wake
> up on the cluster and there could be a better new more energy efficient
> placement for the capped task(s).
> 
> We can generalize the misfit load balance to handle different type of misfits
> (whatever they may be) by simply giving it a reason. The reason can decide the
> type of action required then.
> 
> Current misfit implementation is considered MISFIT_PERF. Which means we need to
> move a task to a better CPU to meet its performance requirement.
> 
> For UCLAMP_MAX I propose MISFIT_POWER, where we need to find a better placement
> to control its impact on power.
> 
> Once we have an API to annotate latency sensitive tasks, it is anticipated
> MISFIT_LATENCY load balance will be required to help handle oversubscribe
> situations to help better distribute the latency sensitive tasks to help reduce
> their wake up latency.
> 
> Patch 1 splits misfit status update from misfit detection by adding a new
> function is_misfit_task().
> 
> Patch 2 implements the generalization logic by adding a misfit reason and
> propagating that correctly and guarding the current misfit code with
> MISFIT_PERF reason.
> 
> Patch 3 is an RFC on a potential implementation for MISFIT_POWER.
> 
> Patch 1 and 2 were tested stand alone and had no regression observed and should
> not introduce a functional change and can be considered for merge if they make
> sense after addressing any review comments.
> 
> Patch 3 was only tested to verify it does what I expected it to do. But no real
> power/perf testing was done. Mainly because I was expecting to remove uclamp
> max-aggregation [1] and the RFC I currently have (which I wrote many many
> months ago) is tied to detecting a task being uncapped by max-aggregation.
> I need to rethink the detection mechanism.

I tried to trigger the MISFIT_POWER misfit reason without success so far.
Would it be possible to provide a workload/test to reliably trigger the
condition ?

Regards,
Pierre

> 
> Beside that, the logic relies on using find_energy_efficient_cpu() to find the
> best potential new placement for the task. To do that though, we need to force
> every CPU to do the MISFIT_POWER load balance as we don't know which CPU should
> do the pull. But there might be better thoughts on how to handle this. So
> feedback and thoughts would be appreciated.
> 
> [1] https://lore.kernel.org/lkml/20231208015242.385103-1-qyousef@layalina.io/
> 
> Thanks!
> 
> --
> Qais Yousef
> 
> Qais Yousef (3):
>    sched/fair: Add is_misfit_task() function
>    sched/fair: Generalize misfit lb by adding a misfit reason
>    sched/fair: Implement new type of misfit MISFIT_POWER
> 
>   kernel/sched/fair.c  | 115 +++++++++++++++++++++++++++++++++++++------
>   kernel/sched/sched.h |   9 ++++
>   2 files changed, 110 insertions(+), 14 deletions(-)
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] sched: Generalize misfit load balance
  2023-12-21 15:26 ` [PATCH 0/3] sched: Generalize misfit load balance Pierre Gondois
@ 2023-12-28 23:38   ` Qais Yousef
  0 siblings, 0 replies; 9+ messages in thread
From: Qais Yousef @ 2023-12-28 23:38 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	linux-kernel, Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei

[-- Attachment #1: Type: text/plain, Size: 3194 bytes --]

On 12/21/23 16:26, Pierre Gondois wrote:
> Hello Qais,
> 
> On 12/9/23 02:17, Qais Yousef wrote:
> > Misfit load balance was added to help handle HMP systems where we can make
> > a wrong decision at wake up thinking a task can run at a smaller core, but its
> > characteristics change and requires to migrate to a bigger core to meet its
> > performance demands.
> > 
> > With the addition of uclamp, we can encounter more cases where such wrong
> > placement decisions can be made and require load balancer to do a corrective
> > action.
> > 
> > Specifically if a big task capped by uclamp_max was placed on a big core at
> > wake up because EAS thought it is the most energy efficient core at the time,
> > the dynamics of the system might change where other uncapped tasks might wake
> > up on the cluster and there could be a better new more energy efficient
> > placement for the capped task(s).
> > 
> > We can generalize the misfit load balance to handle different type of misfits
> > (whatever they may be) by simply giving it a reason. The reason can decide the
> > type of action required then.
> > 
> > Current misfit implementation is considered MISFIT_PERF. Which means we need to
> > move a task to a better CPU to meet its performance requirement.
> > 
> > For UCLAMP_MAX I propose MISFIT_POWER, where we need to find a better placement
> > to control its impact on power.
> > 
> > Once we have an API to annotate latency sensitive tasks, it is anticipated
> > MISFIT_LATENCY load balance will be required to help handle oversubscribe
> > situations to help better distribute the latency sensitive tasks to help reduce
> > their wake up latency.
> > 
> > Patch 1 splits misfit status update from misfit detection by adding a new
> > function is_misfit_task().
> > 
> > Patch 2 implements the generalization logic by adding a misfit reason and
> > propagating that correctly and guarding the current misfit code with
> > MISFIT_PERF reason.
> > 
> > Patch 3 is an RFC on a potential implementation for MISFIT_POWER.
> > 
> > Patch 1 and 2 were tested stand alone and had no regression observed and should
> > not introduce a functional change and can be considered for merge if they make
> > sense after addressing any review comments.
> > 
> > Patch 3 was only tested to verify it does what I expected it to do. But no real
> > power/perf testing was done. Mainly because I was expecting to remove uclamp
> > max-aggregation [1] and the RFC I currently have (which I wrote many many
> > months ago) is tied to detecting a task being uncapped by max-aggregation.
> > I need to rethink the detection mechanism.
> 
> I tried to trigger the MISFIT_POWER misfit reason without success so far.
> Would it be possible to provide a workload/test to reliably trigger the
> condition ?

I spawn a busy loop like

	cat /dev/zero > dev/null

Then use

	uclampset -M 0 -p $PID

to change uclamp_max to 0 and 1024 back and forth.

Try to load the system with some workload and you should see something like
attached picture. Red boxes are periods where uclamp_max is 0. The rest is for
uclamp_max = 1024. Note how it being constantly moved between CPUs when capped.


Cheers

--
Qais Yousef

[-- Attachment #2: misfit_power.png --]
[-- Type: image/png, Size: 384120 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 3/3] sched/fair: Implement new type of misfit MISFIT_POWER
  2023-12-09  1:17 ` [PATCH RFC 3/3] sched/fair: Implement new type of misfit MISFIT_POWER Qais Yousef
  2023-12-11 16:14   ` Pierre Gondois
@ 2024-01-04 14:28   ` Pierre Gondois
  2024-01-05  1:21     ` Qais Yousef
  1 sibling, 1 reply; 9+ messages in thread
From: Pierre Gondois @ 2024-01-04 14:28 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann
  Cc: linux-kernel, Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei

Hello Qais,

I tried to do as you indicated at:
    https://lore.kernel.org/all/20231228233848.piyodw2s2ytli37a@airbuntu/
without success. I can see that the task is migrated from a big CPU to
smaller CPUs, but it doesn't seem to be related to the new MISFIT_POWER
feature.

Indeed, if the uclamp_max value of a CPU-bound task is set to 0, isn't it
normal have EAS/feec() migrating the task to smaller CPUs ? I added tracing
inside is_misfit_task() and load_balance()'s misfit path and could not see
this path being used.

On 12/9/23 02:17, Qais Yousef wrote:
> MISFIT_POWER requires moving the task to a more efficient CPU.
> 
> This can happen when a big task is capped by uclamp_max, but another
> task wakes up on this CPU that can lift the capping, in this case we
> need to migrate it to another, likely smaller, CPU to save power.

Just to be sure, are we talking about the following path, where sugov
decides which OPP to select ?
sugov_get_util()
\-effective_cpu_util()
   \-uclamp_rq_util_with()

To try to describe the issue in my own words, IIUC, the issue comes from
the fact that during energy estimations in feec(), we don't estimate the
impact of enqueuing a task on the rq's UCLAMP_MAX value. So a rq with a
little UCLAMP_MAX value might see the value grows if an uncapped task
is enqueued, leading to raising the frequency and consuming more
power.
Thus, this patch tries to detect such scenario and migrate the clamped
tasks.
Maybe another approach would be to estimate the impact of enqueuing a
task on the rq's UCLAMP_MAX value ?

Regards,
Pierre

> 
> To enable that we need to be smarter about which CPU should do the pull.
> But this requires enabling load balance on all CPUs so that the correct
> CPU does the pull. Instead of the current behavior of nominating the CPU
> with the largest capacity in the group to do the pull.
> 
> This is important to ensure the MISFIT_POWER tasks don't end up on most
> performant CPUs, which is the default behavior of the load balance.  We
> could end up wasting energy unnecessarily or interfere with more
> important tasks on these big CPUs - leading to worse user experience.
> 
> To ensure optimal decision is made, we need to enable calling feec() to
> pick the most efficient CPU for us. Which means we need to force feec()
> to ignore overutilized flag. If feec() returns the same value as the CPU
> that is doing the balance, we perform the pull. Otherwise we'd have to
> defer for another CPU to do the pull.
> 
> To minimize the overhead, this is only done for MISFIT_POWER.
> 
> For capacity aware scheduling or none HMP systems, we will pick a CPU
> that we won't cause its uclamp_max to be uncapped.
> 
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>   kernel/sched/fair.c  | 77 ++++++++++++++++++++++++++++++++++++++++----
>   kernel/sched/sched.h |  1 +
>   2 files changed, 71 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd49b89a6e3e..328467dbe88b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5066,10 +5066,33 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
>   static inline int is_misfit_task(struct task_struct *p, struct rq *rq,
>   				 misfit_reason_t *reason)
>   {
> +	unsigned long rq_util_max;
> +	unsigned long p_util_min;
> +	unsigned long p_util_max;
> +	unsigned long util;
> +
>   	if (!p || p->nr_cpus_allowed == 1)
>   		return 0;
>   
> -	if (task_fits_cpu(p, cpu_of(rq)))
> +	rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
> +	p_util_min = uclamp_eff_value(p, UCLAMP_MIN);
> +	p_util_max = uclamp_eff_value(p, UCLAMP_MAX);
> +	util = task_util_est(p);
> +
> +	if (uclamp_is_used()) {
> +		/*
> +		* Check if a big task is capped by uclamp max is now sharing
> +		* the cpu with something else uncapped and must be moved away
> +		*/
> +		if (rq_util_max > p_util_max && util > p_util_max) {
> +			if (reason)
> +				*reason = MISFIT_POWER;
> +
> +			return 1;
> +		}
> +	}
> +
> +	if (util_fits_cpu(util, p_util_min, p_util_max, cpu_of(rq)))
>   		return 0;
>   
>   	if (reason)
> @@ -7923,7 +7946,8 @@ compute_energy(struct energy_env *eenv, struct perf_domain *pd,
>    * other use-cases too. So, until someone finds a better way to solve this,
>    * let's keep things simple by re-using the existing slow path.
>    */
> -static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> +static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu,
> +				     bool ignore_ou)
>   {
>   	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
>   	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
> @@ -7940,7 +7964,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   
>   	rcu_read_lock();
>   	pd = rcu_dereference(rd->pd);
> -	if (!pd || READ_ONCE(rd->overutilized))
> +	if (!pd || (READ_ONCE(rd->overutilized) && !ignore_ou))
>   		goto unlock;
>   
>   	/*
> @@ -8144,7 +8168,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>   			return cpu;
>   
>   		if (sched_energy_enabled()) {
> -			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> +			new_cpu = find_energy_efficient_cpu(p, prev_cpu, false);
>   			if (new_cpu >= 0)
>   				return new_cpu;
>   			new_cpu = prev_cpu;
> @@ -9030,6 +9054,7 @@ static int detach_tasks(struct lb_env *env)
>   {
>   	struct list_head *tasks = &env->src_rq->cfs_tasks;
>   	unsigned long util, load;
> +	misfit_reason_t reason;
>   	struct task_struct *p;
>   	int detached = 0;
>   
> @@ -9118,9 +9143,28 @@ static int detach_tasks(struct lb_env *env)
>   
>   		case migrate_misfit:
>   			/* This is not a misfit task */
> -			if (!is_misfit_task(p, cpu_rq(env->src_cpu), NULL))
> +			if (!is_misfit_task(p, cpu_rq(env->src_cpu), &reason))
>   				goto next;
>   
> +			if (reason == MISFIT_POWER) {
> +				if (sched_energy_enabled()) {
> +					int new_cpu = find_energy_efficient_cpu(p, env->src_cpu, true);
> +					if (new_cpu != env->dst_cpu)
> +						goto next;
> +				} else {
> +					unsigned long dst_uclamp_max = uclamp_rq_get(env->dst_rq, UCLAMP_MAX);
> +					unsigned long p_uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> +
> +					/*
> +					 * Pick a task that will not cause us
> +					 * to uncap dst_cpu. Or give up until
> +					 * another CPU tries to do the pull.
> +					 */
> +					if (p_uclamp_max > dst_uclamp_max)
> +						goto next;
> +				}
> +			}
> +
>   			env->imbalance = 0;
>   			break;
>   		}
> @@ -11203,6 +11247,18 @@ static int should_we_balance(struct lb_env *env)
>   	if (!cpumask_test_cpu(env->dst_cpu, env->cpus))
>   		return 0;
>   
> +	/*
> +	 * For MISFIT_POWER, we need every CPU to do the lb so that we can pick
> +	 * the most energy efficient one via EAS if available or by making sure
> +	 * the dst_rq uclamp_max higher than the misfit task's uclamp_max.
> +	 *
> +	 * We don't want to do a pull if both src and dst cpus are in
> +	 * MISFIT_POWER state.
> +	 */
> +	if (env->src_rq->misfit_reason == MISFIT_POWER &&
> +	    env->dst_rq->misfit_reason != MISFIT_POWER)
> +			return 1;
> +
>   	/*
>   	 * In the newly idle case, we will allow all the CPUs
>   	 * to do the newly idle load balance.
> @@ -11431,8 +11487,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>   		 * We do not want newidle balance, which can be very
>   		 * frequent, pollute the failure counter causing
>   		 * excessive cache_hot migrations and active balances.
> +		 *
> +		 * MISFIT_POWER can also trigger a lot of failed misfit
> +		 * migrations as we need to ask every CPU to do the pull and
> +		 * expectedly lots of failures will incur.
>   		 */
> -		if (idle != CPU_NEWLY_IDLE)
> +		if (idle != CPU_NEWLY_IDLE && env.src_rq->misfit_reason != MISFIT_POWER)
>   			sd->nr_balance_failed++;
>   
>   		if (need_active_balance(&env)) {
> @@ -11515,8 +11575,11 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>   	 * repeatedly reach this code, which would lead to balance_interval
>   	 * skyrocketing in a short amount of time. Skip the balance_interval
>   	 * increase logic to avoid that.
> +	 *
> +	 * So does MISFIT_POWER which asks every CPU to do the pull as we can't
> +	 * tell which one would be the best one to move to before hand.
>   	 */
> -	if (env.idle == CPU_NEWLY_IDLE)
> +	if (env.idle == CPU_NEWLY_IDLE || env.src_rq->misfit_reason == MISFIT_POWER)
>   		goto out;
>   
>   	/* tune up the balancing interval */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 399b6526afab..3852109ffe62 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -964,6 +964,7 @@ struct balance_callback {
>   
>   typedef enum misfit_reason {
>   	MISFIT_PERF,		/* Requires moving to a more performant CPU */
> +	MISFIT_POWER,		/* Requires moving to a more efficient CPU */
>   } misfit_reason_t;
>   
>   /*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 3/3] sched/fair: Implement new type of misfit MISFIT_POWER
  2024-01-04 14:28   ` Pierre Gondois
@ 2024-01-05  1:21     ` Qais Yousef
  0 siblings, 0 replies; 9+ messages in thread
From: Qais Yousef @ 2024-01-05  1:21 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	linux-kernel, Lukasz Luba, Wei Wang, Rick Yiu, Chung-Kai Mei

Hi Pierre

On 01/04/24 15:28, Pierre Gondois wrote:
> Hello Qais,
> 
> I tried to do as you indicated at:
>    https://lore.kernel.org/all/20231228233848.piyodw2s2ytli37a@airbuntu/
> without success. I can see that the task is migrated from a big CPU to
> smaller CPUs, but it doesn't seem to be related to the new MISFIT_POWER
> feature.

Hmmm. It is possible something went wrong while preparing these set of patches.
I do remember trying this patch quickly, but judging by the bug you found
I might have missed doing a recent run after a set of changes. So something
could have gotten broken.

Let me retry it and see what's going on.

> Indeed, if the uclamp_max value of a CPU-bound task is set to 0, isn't it
> normal have EAS/feec() migrating the task to smaller CPUs ? I added tracing
> inside is_misfit_task() and load_balance()'s misfit path and could not see
> this path being used.

I did have similar debug messages and I could see them triggered. To be honest
I spent most of the time working on this in the past against 5.10 and 5.15
kernels. And when I started the forward port I already was working on removal
max aggregation and this whole patch needed to be rewritten so I kept it as
a guideline. My focus was on getting the misfit generalization done (patch
1 and 2) and demonstrate how this can be potentially used to implement better
logic to balance based on power.

The main ideas are:

1. We need to detect the MISFIT_POWER.
2. We need to force every CPU to try to pull.
3. We need to use feec() to decide which CPU to pull.

I'm not sure if there's potentially another better way. So I was hoping to see
if there are other PoVs to consider.

> 
> On 12/9/23 02:17, Qais Yousef wrote:
> > MISFIT_POWER requires moving the task to a more efficient CPU.
> > 
> > This can happen when a big task is capped by uclamp_max, but another
> > task wakes up on this CPU that can lift the capping, in this case we
> > need to migrate it to another, likely smaller, CPU to save power.
> 
> Just to be sure, are we talking about the following path, where sugov
> decides which OPP to select ?
> sugov_get_util()
> \-effective_cpu_util()
>   \-uclamp_rq_util_with()
> 
> To try to describe the issue in my own words, IIUC, the issue comes from
> the fact that during energy estimations in feec(), we don't estimate the
> impact of enqueuing a task on the rq's UCLAMP_MAX value. So a rq with a
> little UCLAMP_MAX value might see the value grows if an uncapped task
> is enqueued, leading to raising the frequency and consuming more
> power.
> Thus, this patch tries to detect such scenario and migrate the clamped
> tasks.

Yes, to a big degree. See below.

> Maybe another approach would be to estimate the impact of enqueuing a
> task on the rq's UCLAMP_MAX value ?

I'd like to think we'll remove rq uclamp value altogether, hopefully.

Generally I'd welcome ideas on what kind of MISFIT_POWER scenarios we have.
With uclamp_max it is the fact that these tasks can be busy loops (their
util_avg is too high) and will cause anything else RUNNING to run at max
frequency regardless of their util_avg. UCLAMP_MAX tells us we have opportunity
to move them somewhere less expensive.

Detection logic will be harder without rq uclamp_max.

My first ever detection logic was actually to check if the task is running on
the smallest fitting CPU; if not then move it to that. Then I switched to
detection based on whether it is capped or not with feec() deciding which is
the best next place to go to.

There's another problem is that these tasks when they end up on a big core,
they'll make the CPU look busy and can prevent other tasks from running
'comfortably' along side it. So not only they waste power, but they're getting
in the way of other work to get their work done with less interference. I'm not
sure if this should be treated as a different type of misfit though.

We need to get MISFIT_POWER in first and then see if the interference issue is
not automatically resolved then. From power perspective, a busy loop but capped
to be able to run on a mid or little, it is wrong to keep it on the big for
extended period of time from power perspective. And by product the interference
issue should be resolved, in theory at least.

Also not sure if we can have non UCLAMP_MAX based MISFIT_POWER. I couldn't come
up with a scenario yet, but I don't think we need to restrict ourselves to
UCLAMP_MAX only ones. So ideas are welcome :-)


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-01-05  1:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-09  1:17 [PATCH 0/3] sched: Generalize misfit load balance Qais Yousef
2023-12-09  1:17 ` [PATCH 1/3] sched/fair: Add is_misfit_task() function Qais Yousef
2023-12-09  1:17 ` [PATCH 2/3] sched/fair: Generalize misfit lb by adding a misfit reason Qais Yousef
2023-12-09  1:17 ` [PATCH RFC 3/3] sched/fair: Implement new type of misfit MISFIT_POWER Qais Yousef
2023-12-11 16:14   ` Pierre Gondois
2024-01-04 14:28   ` Pierre Gondois
2024-01-05  1:21     ` Qais Yousef
2023-12-21 15:26 ` [PATCH 0/3] sched: Generalize misfit load balance Pierre Gondois
2023-12-28 23:38   ` Qais Yousef

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.