linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints
@ 2023-08-20 21:06 Qais Yousef
  2023-08-20 21:06 ` [PATCH 1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom Qais Yousef
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Qais Yousef @ 2023-08-20 21:06 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Vincent Guittot, Dietmar Eggemann,
	Lukasz Luba, Qais Yousef

DVFS headroom, or map_util_perf(), is applied after uclamp restrictions are
applied in effective_cpu_util(). This will lead to two problems for uclamp:

	1. If util < uclamp_min, we'll run faster than uclamp_min. For example
	   util = 50, uclamp_min = 100. map_util_perf() = 125 instead of 100.

	2. If util > uclamp_max, we'll run faster than uclamp_max. For example
	   util = 900, uclamp_max = 800, map_util_perf() = 1000.

First patch rename the function to apply_dvfs_headroom() to reflect what it
really does. It is not really mapping util, but provides some headroom for the
util to grow. Provide a documentation for the function too.

Second patch is the actual fix.

Third patch moves apply_dvfs_headroom() to sched.h as there are no longer
users outside the scheduler.

Fourth patch is an RFC to redefine what the headroom means for RT, DL and IRQ
pressures.

Thanks!

--
Qais Yousef

Qais Yousef (4):
  sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom
  sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  sched: cpufreq: Move apply_dvfs_headroom() to sched.h
  sched: cpufreq: Apply DVFS headroom to CFS only

 include/linux/energy_model.h     |  1 -
 include/linux/sched/cpufreq.h    |  5 -----
 kernel/sched/core.c              |  7 +++++--
 kernel/sched/cpufreq_schedutil.c |  5 ++---
 kernel/sched/sched.h             | 24 ++++++++++++++++++++++++
 5 files changed, 31 insertions(+), 11 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom
  2023-08-20 21:06 [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints Qais Yousef
@ 2023-08-20 21:06 ` Qais Yousef
  2023-08-20 21:13   ` Qais Yousef
  2023-08-21 16:38   ` Dietmar Eggemann
  2023-08-20 21:06 ` [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints Qais Yousef
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 32+ messages in thread
From: Qais Yousef @ 2023-08-20 21:06 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Vincent Guittot, Dietmar Eggemann,
	Lukasz Luba, Qais Yousef

We are providing headroom for the utilization to grow until the next
decision point to pick the next frequency. Give the function a better
name and give it some documentation. It is not really mapping anything.

Provide a dummy definition for !CONFIG_CPUFREQ which will be required
for later patches.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 include/linux/energy_model.h     |  2 +-
 include/linux/sched/cpufreq.h    | 27 ++++++++++++++++++++++++++-
 kernel/sched/cpufreq_schedutil.c |  6 +++---
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index b9caa01dfac4..6ebde4e69e98 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -243,7 +243,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	scale_cpu = arch_scale_cpu_capacity(cpu);
 	ps = &pd->table[pd->nr_perf_states - 1];
 
-	max_util = map_util_perf(max_util);
+	max_util = apply_dvfs_headroom(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
 	freq = map_util_freq(max_util, ps->frequency, scale_cpu);
 
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index bdd31ab93bc5..f0069b354ac8 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -29,10 +29,35 @@ static inline unsigned long map_util_freq(unsigned long util,
 	return freq * util / cap;
 }
 
-static inline unsigned long map_util_perf(unsigned long util)
+/*
+ * DVFS decision are made at discrete points. If CPU stays busy, the util will
+ * continue to grow, which means it could need to run at a higher frequency
+ * before the next decision point was reached. IOW, we can't follow the util as
+ * it grows immediately, but there's a delay before we issue a request to go to
+ * higher frequency. The headroom caters for this delay so the system continues
+ * to run at adequate performance point.
+ *
+ * This function provides enough headroom to provide adequate performance
+ * assuming the CPU continues to be busy.
+ *
+ * At the moment it is a constant multiplication with 1.25.
+ *
+ * TODO: The headroom should be a function of the delay. 25% is too high
+ * especially on powerful systems. For example, if the delay is 500us, it makes
+ * more sense to give a small headroom as the next decision point is not far
+ * away and will follow the util if it continues to rise. On the other hand if
+ * the delay is 10ms, then we need a bigger headroom so the CPU won't struggle
+ * at a lower frequency if it never goes to idle until then.
+ */
+static inline unsigned long apply_dvfs_headroom(unsigned long util)
 {
 	return util + (util >> 2);
 }
+#else
+static inline unsigned long apply_dvfs_headroom(unsigned long util)
+{
+	return util;
+}
 #endif /* CONFIG_CPU_FREQ */
 
 #endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 4492608b7d7f..916c4d3d6192 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -143,7 +143,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
 
-	util = map_util_perf(util);
+	util = apply_dvfs_headroom(util);
 	freq = map_util_freq(util, freq, max);
 
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
@@ -406,8 +406,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 	    sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
 		sg_cpu->util = prev_util;
 
-	cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
-				   map_util_perf(sg_cpu->util), max_cap);
+	cpufreq_driver_adjust_perf(sg_cpu->cpu, apply_dvfs_headroom(sg_cpu->bw_dl),
+				   apply_dvfs_headroom(sg_cpu->util), max_cap);
 
 	sg_cpu->sg_policy->last_freq_update_time = time;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-08-20 21:06 [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints Qais Yousef
  2023-08-20 21:06 ` [PATCH 1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom Qais Yousef
@ 2023-08-20 21:06 ` Qais Yousef
  2023-08-21 16:39   ` Dietmar Eggemann
  2023-08-29 14:35   ` Vincent Guittot
  2023-08-20 21:06 ` [PATCH 3/4] sched: cpufreq: Move apply_dvfs_headroom() to sched.h Qais Yousef
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 32+ messages in thread
From: Qais Yousef @ 2023-08-20 21:06 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Vincent Guittot, Dietmar Eggemann,
	Lukasz Luba, Qais Yousef

DVFS headroom is applied after we calculate the effective_cpu_util()
which is where we honour uclamp constraints. It makes more sense to
apply the headroom there once and let all users naturally get the right
thing without having to sprinkle the call around in various places.

Before this fix running

	uclampset -M 800 cat /dev/zero > /dev/null

Will cause the test system to run at max freq of 2.8GHz. After the fix
it runs at 2.2GHz instead which is the correct value that matches the
capacity of 800.

Note that similar problem exist for uclamp_min. If util was 50, and
uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
constraints, we'll end up with util of 125 instead of 100. IOW, we get
boosted twice, first time by uclamp_min, and second time by dvfs
headroom.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 include/linux/energy_model.h     |  1 -
 kernel/sched/core.c              | 11 ++++++++---
 kernel/sched/cpufreq_schedutil.c |  5 ++---
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 6ebde4e69e98..adec808b371a 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	scale_cpu = arch_scale_cpu_capacity(cpu);
 	ps = &pd->table[pd->nr_perf_states - 1];
 
-	max_util = apply_dvfs_headroom(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
 	freq = map_util_freq(max_util, ps->frequency, scale_cpu);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index efe3848978a0..441d433c83cd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * frequency will be gracefully reduced with the utilization decay.
 	 */
 	util = util_cfs + cpu_util_rt(rq);
-	if (type == FREQUENCY_UTIL)
+	if (type == FREQUENCY_UTIL) {
+		util = apply_dvfs_headroom(util);
 		util = uclamp_rq_util_with(rq, util, p);
+	}
 
 	dl_util = cpu_util_dl(rq);
 
@@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 *              max - irq
 	 *   U' = irq + --------- * U
 	 *                 max
+	 *
+	 * We only need to apply dvfs headroom to irq part since the util part
+	 * already had it applied.
 	 */
 	util = scale_irq_capacity(util, irq, max);
-	util += irq;
+	util += type ==  FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq;
 
 	/*
 	 * Bandwidth required by DEADLINE must always be granted while, for
@@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * an interface. So, we only do the latter for now.
 	 */
 	if (type == FREQUENCY_UTIL)
-		util += cpu_bw_dl(rq);
+		util += apply_dvfs_headroom(cpu_bw_dl(rq));
 
 	return min(max, util);
 }
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 916c4d3d6192..0c7565ac31fb 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -143,7 +143,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
 
-	util = apply_dvfs_headroom(util);
 	freq = map_util_freq(util, freq, max);
 
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
@@ -406,8 +405,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 	    sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
 		sg_cpu->util = prev_util;
 
-	cpufreq_driver_adjust_perf(sg_cpu->cpu, apply_dvfs_headroom(sg_cpu->bw_dl),
-				   apply_dvfs_headroom(sg_cpu->util), max_cap);
+	cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_dl,
+				   sg_cpu->util, max_cap);
 
 	sg_cpu->sg_policy->last_freq_update_time = time;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/4] sched: cpufreq: Move apply_dvfs_headroom() to sched.h
  2023-08-20 21:06 [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints Qais Yousef
  2023-08-20 21:06 ` [PATCH 1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom Qais Yousef
  2023-08-20 21:06 ` [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints Qais Yousef
@ 2023-08-20 21:06 ` Qais Yousef
  2023-08-21 16:40   ` Dietmar Eggemann
  2023-08-20 21:06 ` [PATCH RFC 4/4] sched: cpufreq: Apply DVFS headroom to CFS only Qais Yousef
  2023-08-21 10:34 ` [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints Rafael J. Wysocki
  4 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-08-20 21:06 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Vincent Guittot, Dietmar Eggemann,
	Lukasz Luba, Qais Yousef

This function relies on updating util signal appropriately to give
a headroom to grow. This is more of a scheduler functionality than
cpufreq. Move it to sched.h where all the other util handling code
belongs.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 include/linux/sched/cpufreq.h | 30 ------------------------------
 kernel/sched/sched.h          | 24 ++++++++++++++++++++++++
 2 files changed, 24 insertions(+), 30 deletions(-)

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index f0069b354ac8..d01755d3142f 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -28,36 +28,6 @@ static inline unsigned long map_util_freq(unsigned long util,
 {
 	return freq * util / cap;
 }
-
-/*
- * DVFS decision are made at discrete points. If CPU stays busy, the util will
- * continue to grow, which means it could need to run at a higher frequency
- * before the next decision point was reached. IOW, we can't follow the util as
- * it grows immediately, but there's a delay before we issue a request to go to
- * higher frequency. The headroom caters for this delay so the system continues
- * to run at adequate performance point.
- *
- * This function provides enough headroom to provide adequate performance
- * assuming the CPU continues to be busy.
- *
- * At the moment it is a constant multiplication with 1.25.
- *
- * TODO: The headroom should be a function of the delay. 25% is too high
- * especially on powerful systems. For example, if the delay is 500us, it makes
- * more sense to give a small headroom as the next decision point is not far
- * away and will follow the util if it continues to rise. On the other hand if
- * the delay is 10ms, then we need a bigger headroom so the CPU won't struggle
- * at a lower frequency if it never goes to idle until then.
- */
-static inline unsigned long apply_dvfs_headroom(unsigned long util)
-{
-	return util + (util >> 2);
-}
-#else
-static inline unsigned long apply_dvfs_headroom(unsigned long util)
-{
-	return util;
-}
 #endif /* CONFIG_CPU_FREQ */
 
 #endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3a01b7a2bf66..56eeb5b05b50 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2997,6 +2997,30 @@ enum cpu_util_type {
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 				 enum cpu_util_type type,
 				 struct task_struct *p);
+/*
+ * DVFS decision are made at discrete points. If CPU stays busy, the util will
+ * continue to grow, which means it could need to run at a higher frequency
+ * before the next decision point was reached. IOW, we can't follow the util as
+ * it grows immediately, but there's a delay before we issue a request to go to
+ * higher frequency. The headroom caters for this delay so the system continues
+ * to run at adequate performance point.
+ *
+ * This function provides enough headroom to provide adequate performance
+ * assuming the CPU continues to be busy.
+ *
+ * At the moment it is a constant multiplication with 1.25.
+ *
+ * TODO: The headroom should be a function of the delay. 25% is too high
+ * especially on powerful systems. For example, if the delay is 500us, it makes
+ * more sense to give a small headroom as the next decision point is not far
+ * away and will follow the util if it continues to rise. On the other hand if
+ * the delay is 10ms, then we need a bigger headroom so the CPU won't struggle
+ * at a lower frequency if it never goes to idle until then.
+ */
+static inline unsigned long apply_dvfs_headroom(unsigned long util)
+{
+	return util + (util >> 2);
+}
 
 /*
  * Verify the fitness of task @p to run on @cpu taking into account the
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH RFC 4/4] sched: cpufreq: Apply DVFS headroom to CFS only
  2023-08-20 21:06 [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints Qais Yousef
                   ` (2 preceding siblings ...)
  2023-08-20 21:06 ` [PATCH 3/4] sched: cpufreq: Move apply_dvfs_headroom() to sched.h Qais Yousef
@ 2023-08-20 21:06 ` Qais Yousef
  2023-08-21 16:41   ` Dietmar Eggemann
  2023-08-21 10:34 ` [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints Rafael J. Wysocki
  4 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-08-20 21:06 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Vincent Guittot, Dietmar Eggemann,
	Lukasz Luba, Qais Yousef

RT and Deadline have exact performance requirement when running. RT runs
at max or a specific OPP defined by uclamp_min. Deadline's OPP is
defined by its bandwidth. Both of which are known ahead of time and
don't require a headroom to grow into.

IRQs on the other hand have no specific performance requirement and
cruises along at whatever the current OPP happens to be when they occur.

Now they all have PELT pressure signals that does impact frequency
selection and task placement. The question is do they need DVFS
headroom?

I think the answer is no because when CFS is not running at all, these
pressure signal has no real impact on performance for RT, DL or IRQ.

If CFS util is not zero, we already add their pressure as an
*additional* headroom to account for the lost/stolen time. So I argue
that the pressure are headroom themselves and shouldn't need an
additional DVFS headroom applied on top.

In summary final outcome should be:

	CFS + DVFS headroom + (RT, DT, IRQ) pressure headroom

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 kernel/sched/core.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 441d433c83cd..602e369753a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7438,10 +7438,11 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * When there are no CFS RUNNABLE tasks, clamps are released and
 	 * frequency will be gracefully reduced with the utilization decay.
 	 */
-	util = util_cfs + cpu_util_rt(rq);
 	if (type == FREQUENCY_UTIL) {
-		util = apply_dvfs_headroom(util);
+		util = apply_dvfs_headroom(util_cfs) + cpu_util_rt(rq);
 		util = uclamp_rq_util_with(rq, util, p);
+	} else {
+		util = util_cfs + cpu_util_rt(rq);
 	}
 
 	dl_util = cpu_util_dl(rq);
@@ -7473,12 +7474,9 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 *              max - irq
 	 *   U' = irq + --------- * U
 	 *                 max
-	 *
-	 * We only need to apply dvfs headroom to irq part since the util part
-	 * already had it applied.
 	 */
 	util = scale_irq_capacity(util, irq, max);
-	util += type ==  FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq;
+	util += irq;
 
 	/*
 	 * Bandwidth required by DEADLINE must always be granted while, for
@@ -7491,7 +7489,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * an interface. So, we only do the latter for now.
 	 */
 	if (type == FREQUENCY_UTIL)
-		util += apply_dvfs_headroom(cpu_bw_dl(rq));
+		util += cpu_bw_dl(rq);
 
 	return min(max, util);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom
  2023-08-20 21:06 ` [PATCH 1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom Qais Yousef
@ 2023-08-20 21:13   ` Qais Yousef
  2023-08-21 16:38   ` Dietmar Eggemann
  1 sibling, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-08-20 21:13 UTC (permalink / raw)
  To: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-pm, Vincent Guittot, Dietmar Eggemann, Lukasz Luba

On 08/20/23 22:06, Qais Yousef wrote:
> We are providing headroom for the utilization to grow until the next
> decision point to pick the next frequency. Give the function a better
> name and give it some documentation. It is not really mapping anything.
> 
> Provide a dummy definition for !CONFIG_CPUFREQ which will be required
> for later patches.
> 
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  include/linux/energy_model.h     |  2 +-
>  include/linux/sched/cpufreq.h    | 27 ++++++++++++++++++++++++++-
>  kernel/sched/cpufreq_schedutil.c |  6 +++---
>  3 files changed, 30 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index b9caa01dfac4..6ebde4e69e98 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -243,7 +243,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>  	scale_cpu = arch_scale_cpu_capacity(cpu);
>  	ps = &pd->table[pd->nr_perf_states - 1];
>  
> -	max_util = map_util_perf(max_util);
> +	max_util = apply_dvfs_headroom(max_util);
>  	max_util = min(max_util, allowed_cpu_cap);
>  	freq = map_util_freq(max_util, ps->frequency, scale_cpu);
>  
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index bdd31ab93bc5..f0069b354ac8 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -29,10 +29,35 @@ static inline unsigned long map_util_freq(unsigned long util,
>  	return freq * util / cap;
>  }
>  
> -static inline unsigned long map_util_perf(unsigned long util)
> +/*
> + * DVFS decision are made at discrete points. If CPU stays busy, the util will
> + * continue to grow, which means it could need to run at a higher frequency
> + * before the next decision point was reached. IOW, we can't follow the util as
> + * it grows immediately, but there's a delay before we issue a request to go to
> + * higher frequency. The headroom caters for this delay so the system continues
> + * to run at adequate performance point.
> + *
> + * This function provides enough headroom to provide adequate performance
> + * assuming the CPU continues to be busy.
> + *
> + * At the moment it is a constant multiplication with 1.25.
> + *
> + * TODO: The headroom should be a function of the delay. 25% is too high
> + * especially on powerful systems. For example, if the delay is 500us, it makes
> + * more sense to give a small headroom as the next decision point is not far
> + * away and will follow the util if it continues to rise. On the other hand if
> + * the delay is 10ms, then we need a bigger headroom so the CPU won't struggle
> + * at a lower frequency if it never goes to idle until then.
> + */

I'm happy to drop this TODO if it is controversial. I already have a series
that works in that effect that I will hopefully get a chance to post soon
enough.


Cheers

--
Qais Yousef

> +static inline unsigned long apply_dvfs_headroom(unsigned long util)
>  {
>  	return util + (util >> 2);
>  }
> +#else
> +static inline unsigned long apply_dvfs_headroom(unsigned long util)
> +{
> +	return util;
> +}
>  #endif /* CONFIG_CPU_FREQ */
>  
>  #endif /* _LINUX_SCHED_CPUFREQ_H */
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 4492608b7d7f..916c4d3d6192 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -143,7 +143,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>  	unsigned int freq = arch_scale_freq_invariant() ?
>  				policy->cpuinfo.max_freq : policy->cur;
>  
> -	util = map_util_perf(util);
> +	util = apply_dvfs_headroom(util);
>  	freq = map_util_freq(util, freq, max);
>  
>  	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> @@ -406,8 +406,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
>  	    sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
>  		sg_cpu->util = prev_util;
>  
> -	cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
> -				   map_util_perf(sg_cpu->util), max_cap);
> +	cpufreq_driver_adjust_perf(sg_cpu->cpu, apply_dvfs_headroom(sg_cpu->bw_dl),
> +				   apply_dvfs_headroom(sg_cpu->util), max_cap);
>  
>  	sg_cpu->sg_policy->last_freq_update_time = time;
>  }
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints
  2023-08-20 21:06 [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints Qais Yousef
                   ` (3 preceding siblings ...)
  2023-08-20 21:06 ` [PATCH RFC 4/4] sched: cpufreq: Apply DVFS headroom to CFS only Qais Yousef
@ 2023-08-21 10:34 ` Rafael J. Wysocki
  2023-08-26 19:17   ` Qais Yousef
  4 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2023-08-21 10:34 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Vincent Guittot, Dietmar Eggemann,
	Lukasz Luba

On Sun, Aug 20, 2023 at 11:08 PM Qais Yousef <qyousef@layalina.io> wrote:
>
> DVFS headroom, or map_util_perf(), is applied after uclamp restrictions are
> applied in effective_cpu_util(). This will lead to two problems for uclamp:
>
>         1. If util < uclamp_min, we'll run faster than uclamp_min. For example
>            util = 50, uclamp_min = 100. map_util_perf() = 125 instead of 100.
>
>         2. If util > uclamp_max, we'll run faster than uclamp_max. For example
>            util = 900, uclamp_max = 800, map_util_perf() = 1000.
>
> First patch rename the function to apply_dvfs_headroom() to reflect what it
> really does. It is not really mapping util, but provides some headroom for the
> util to grow. Provide a documentation for the function too.
>
> Second patch is the actual fix.
>
> Third patch moves apply_dvfs_headroom() to sched.h as there are no longer
> users outside the scheduler.
>
> Fourth patch is an RFC to redefine what the headroom means for RT, DL and IRQ
> pressures.
>
> Thanks!

For the first 3 patches in the series

Acked-by: Rafael J. Wysocki <rafael@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom
  2023-08-20 21:06 ` [PATCH 1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom Qais Yousef
  2023-08-20 21:13   ` Qais Yousef
@ 2023-08-21 16:38   ` Dietmar Eggemann
  2023-08-26 20:03     ` Qais Yousef
  1 sibling, 1 reply; 32+ messages in thread
From: Dietmar Eggemann @ 2023-08-21 16:38 UTC (permalink / raw)
  To: Qais Yousef, Rafael J. Wysocki, Viresh Kumar, Ingo Molnar,
	Peter Zijlstra
  Cc: linux-kernel, linux-pm, Vincent Guittot, Lukasz Luba

On 20/08/2023 23:06, Qais Yousef wrote:
> We are providing headroom for the utilization to grow until the next
> decision point to pick the next frequency. Give the function a better
> name and give it some documentation. It is not really mapping anything.

Wasn't the original aim to have a counterpart to task scheduler's
fits_capacity(), i.e. implement a frequency tipping point at 80%?

  #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)


(util / max) = 0.8, hence 1.25 for the frequency-invariant case?

https://lkml.kernel.org/r/11678919.CQLTrQTYxG@vostro.rjw.lan

  next_freq = 1.25 * max_freq * util / max

              1,25 *            util  <-- map_util_perf()

[...]

Difference is that EAS deals with `util_cfs` and `capacity` whereas
power (CPUfreq and EM) deals with `util` and `capacity_orig`. And this
is where `capacity pressure` comes in for EAS (or fair.c).

In this regard, I'm not sure why we should rename the function?

> + * This function provides enough headroom to provide adequate performance
> + * assuming the CPU continues to be busy.
> + *
> + * At the moment it is a constant multiplication with 1.25.
> + *
> + * TODO: The headroom should be a function of the delay. 25% is too high
> + * especially on powerful systems. For example, if the delay is 500us, it makes
> + * more sense to give a small headroom as the next decision point is not far
> + * away and will follow the util if it continues to rise. On the other hand if
> + * the delay is 10ms, then we need a bigger headroom so the CPU won't struggle
> + * at a lower frequency if it never goes to idle until then.

I wouldn't add this here since this implementation is not provided.

[...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-08-20 21:06 ` [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints Qais Yousef
@ 2023-08-21 16:39   ` Dietmar Eggemann
  2023-08-26 20:08     ` Qais Yousef
  2023-08-29 14:35   ` Vincent Guittot
  1 sibling, 1 reply; 32+ messages in thread
From: Dietmar Eggemann @ 2023-08-21 16:39 UTC (permalink / raw)
  To: Qais Yousef, Rafael J. Wysocki, Viresh Kumar, Ingo Molnar,
	Peter Zijlstra
  Cc: linux-kernel, linux-pm, Vincent Guittot, Lukasz Luba

On 20/08/2023 23:06, Qais Yousef wrote:
> DVFS headroom is applied after we calculate the effective_cpu_util()
> which is where we honour uclamp constraints. It makes more sense to
> apply the headroom there once and let all users naturally get the right
> thing without having to sprinkle the call around in various places.

uclamp is applied in effective_cpu_util(..., FREQUENCY_UTIL, ...) which
IMHO currently has 2 power callers: (1) schedutil: sugov_get_util() and
(2) EAS: eenv_pd_max_util()

> Before this fix running
> 
> 	uclampset -M 800 cat /dev/zero > /dev/null
> 
> Will cause the test system to run at max freq of 2.8GHz. After the fix
> it runs at 2.2GHz instead which is the correct value that matches the
> capacity of 800.

IMHO, a system at util = 800 (w/o uclamp) would also run at 2.8Ghz since
we would call map_util_to_perf() on 800, no matter from where we call it.

> Note that similar problem exist for uclamp_min. If util was 50, and
> uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> constraints, we'll end up with util of 125 instead of 100. IOW, we get
> boosted twice, first time by uclamp_min, and second time by dvfs
> headroom.

I see what you want to change here but:

So far we have `util -> uclamp -> map_util_to_perf()`

which is fine when we see uclamp as an entity which constrains util, not
the util after being mapped to a capacity constraint.

[...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] sched: cpufreq: Move apply_dvfs_headroom() to sched.h
  2023-08-20 21:06 ` [PATCH 3/4] sched: cpufreq: Move apply_dvfs_headroom() to sched.h Qais Yousef
@ 2023-08-21 16:40   ` Dietmar Eggemann
  0 siblings, 0 replies; 32+ messages in thread
From: Dietmar Eggemann @ 2023-08-21 16:40 UTC (permalink / raw)
  To: Qais Yousef, Rafael J. Wysocki, Viresh Kumar, Ingo Molnar,
	Peter Zijlstra
  Cc: linux-kernel, linux-pm, Vincent Guittot, Lukasz Luba

On 20/08/2023 23:06, Qais Yousef wrote:
> This function relies on updating util signal appropriately to give
> a headroom to grow. This is more of a scheduler functionality than
> cpufreq. Move it to sched.h where all the other util handling code
> belongs.

To me map_util_freq() is the power counterpart to fits_capacity()
[fair.c] which is used in schedutil (cpufreq) and EM to do the same*
thing as EAS in the task scheduler (fair.c).

* With the already (PATCH 1/4) mentioned difference that EAS deals with
`util_cfs` vs `capacity` whereas power deals with `util` vs `capacity_orig`.

[...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC 4/4] sched: cpufreq: Apply DVFS headroom to CFS only
  2023-08-20 21:06 ` [PATCH RFC 4/4] sched: cpufreq: Apply DVFS headroom to CFS only Qais Yousef
@ 2023-08-21 16:41   ` Dietmar Eggemann
  2023-08-26 20:27     ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Dietmar Eggemann @ 2023-08-21 16:41 UTC (permalink / raw)
  To: Qais Yousef, Rafael J. Wysocki, Viresh Kumar, Ingo Molnar,
	Peter Zijlstra
  Cc: linux-kernel, linux-pm, Vincent Guittot, Lukasz Luba

On 20/08/2023 23:06, Qais Yousef wrote:
> RT and Deadline have exact performance requirement when running. RT runs
> at max or a specific OPP defined by uclamp_min. Deadline's OPP is
> defined by its bandwidth. Both of which are known ahead of time and
> don't require a headroom to grow into.
> 
> IRQs on the other hand have no specific performance requirement and
> cruises along at whatever the current OPP happens to be when they occur.
> 
> Now they all have PELT pressure signals that does impact frequency
> selection and task placement. The question is do they need DVFS
> headroom?
> 
> I think the answer is no because when CFS is not running at all, these
> pressure signal has no real impact on performance for RT, DL or IRQ.
> 
> If CFS util is not zero, we already add their pressure as an
> *additional* headroom to account for the lost/stolen time. So I argue
> that the pressure are headroom themselves and shouldn't need an
> additional DVFS headroom applied on top.
> 
> In summary final outcome should be:
> 
> 	CFS + DVFS headroom + (RT, DT, IRQ) pressure headroom

I assume here you want to align the difference that EAS deals with
`util_cfs` vs `capacity` whereas power deals with `util` vs
`capacity_orig`? You want that power should only apply the 1.25 to util_cfs?

> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  kernel/sched/core.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 441d433c83cd..602e369753a3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7438,10 +7438,11 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>  	 * When there are no CFS RUNNABLE tasks, clamps are released and
>  	 * frequency will be gracefully reduced with the utilization decay.
>  	 */
> -	util = util_cfs + cpu_util_rt(rq);
>  	if (type == FREQUENCY_UTIL) {
> -		util = apply_dvfs_headroom(util);
> +		util = apply_dvfs_headroom(util_cfs) + cpu_util_rt(rq);
>  		util = uclamp_rq_util_with(rq, util, p);
> +	} else {
> +		util = util_cfs + cpu_util_rt(rq);
>  	}
>  
>  	dl_util = cpu_util_dl(rq);
> @@ -7473,12 +7474,9 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>  	 *              max - irq
>  	 *   U' = irq + --------- * U
>  	 *                 max
> -	 *
> -	 * We only need to apply dvfs headroom to irq part since the util part
> -	 * already had it applied.
>  	 */
>  	util = scale_irq_capacity(util, irq, max);
> -	util += type ==  FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq;
> +	util += irq;
>  
>  	/*
>  	 * Bandwidth required by DEADLINE must always be granted while, for
> @@ -7491,7 +7489,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>  	 * an interface. So, we only do the latter for now.
>  	 */
>  	if (type == FREQUENCY_UTIL)
> -		util += apply_dvfs_headroom(cpu_bw_dl(rq));
> +		util += cpu_bw_dl(rq);
>  
>  	return min(max, util);
>  }


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints
  2023-08-21 10:34 ` [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints Rafael J. Wysocki
@ 2023-08-26 19:17   ` Qais Yousef
  0 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-08-26 19:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Viresh Kumar, Ingo Molnar, Peter Zijlstra, linux-kernel,
	linux-pm, Vincent Guittot, Dietmar Eggemann, Lukasz Luba

On 08/21/23 12:34, Rafael J. Wysocki wrote:
> On Sun, Aug 20, 2023 at 11:08 PM Qais Yousef <qyousef@layalina.io> wrote:
> >
> > DVFS headroom, or map_util_perf(), is applied after uclamp restrictions are
> > applied in effective_cpu_util(). This will lead to two problems for uclamp:
> >
> >         1. If util < uclamp_min, we'll run faster than uclamp_min. For example
> >            util = 50, uclamp_min = 100. map_util_perf() = 125 instead of 100.
> >
> >         2. If util > uclamp_max, we'll run faster than uclamp_max. For example
> >            util = 900, uclamp_max = 800, map_util_perf() = 1000.
> >
> > First patch rename the function to apply_dvfs_headroom() to reflect what it
> > really does. It is not really mapping util, but provides some headroom for the
> > util to grow. Provide a documentation for the function too.
> >
> > Second patch is the actual fix.
> >
> > Third patch moves apply_dvfs_headroom() to sched.h as there are no longer
> > users outside the scheduler.
> >
> > Fourth patch is an RFC to redefine what the headroom means for RT, DL and IRQ
> > pressures.
> >
> > Thanks!
> 
> For the first 3 patches in the series
> 
> Acked-by: Rafael J. Wysocki <rafael@kernel.org>

Thanks for having a look!


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom
  2023-08-21 16:38   ` Dietmar Eggemann
@ 2023-08-26 20:03     ` Qais Yousef
  0 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-08-26 20:03 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Vincent Guittot, Lukasz Luba

On 08/21/23 18:38, Dietmar Eggemann wrote:
> On 20/08/2023 23:06, Qais Yousef wrote:
> > We are providing headroom for the utilization to grow until the next
> > decision point to pick the next frequency. Give the function a better
> > name and give it some documentation. It is not really mapping anything.
> 
> Wasn't the original aim to have a counterpart to task scheduler's
> fits_capacity(), i.e. implement a frequency tipping point at 80%?
> 
>   #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
> 
> 
> (util / max) = 0.8, hence 1.25 for the frequency-invariant case?
> 
> https://lkml.kernel.org/r/11678919.CQLTrQTYxG@vostro.rjw.lan
> 
>   next_freq = 1.25 * max_freq * util / max
> 
>               1,25 *            util  <-- map_util_perf()
> 
> [...]
> 
> Difference is that EAS deals with `util_cfs` and `capacity` whereas
> power (CPUfreq and EM) deals with `util` and `capacity_orig`. And this
> is where `capacity pressure` comes in for EAS (or fair.c).
> 
> In this regard, I'm not sure why we should rename the function?

I don't see any relation between the two to be honest. Both are magical numbers
actually that no longer mean any sense in real world and people modify them in
practice, as you know.

As we brought up the topic in pelt half life and other avenues, this 25% is
a headroom for the util to grow. And this headroom is required for DVFS reasons
as evident by the current definition being in cpufreq.h.

fits_capacity() is about misfit detection - I don't see any relation to
selecting frequency. But voodoo relationship that is based on past
circumstances I don't see holding true as both code and systems have evolved
significantly since then. I think you're trying to make sure we've hit the top
frequency before we force an upmigration early. But this is artificial and not
a real necessity.

Consider for example a single task running on a medium core with a capacity of
750. It has a steady state utilization of 300. Why should it run at 25% higher
OPP which can be one or 2 freq jumps difference? And since the power cost is
not linear, the power difference could be big. Not a cost I'd pay to
artificially coordinate with misfit detection. And how about SMP systems that
don't care about misfit detection? Why should they be tied to this magical 25%
headroom?

Note that DVFS hardware is a lot better nowadays that it is common to see
a reaction time of 500us. So if this 300 task goes through a transition and its
util starts growing again, we'll react to this within 500us; that's barely
a few percents jump compared to the 25% one. I think this is what the headroom
should be about, hence the new name.

I will send the patches to address this issue separately soon enough; they're
almost ready actually. I didn't expect the name of the function to be an issue
here to be honest and thought it is clear a headroom for dvfs reaction time.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-08-21 16:39   ` Dietmar Eggemann
@ 2023-08-26 20:08     ` Qais Yousef
  2023-09-12 14:34       ` Dietmar Eggemann
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-08-26 20:08 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Vincent Guittot, Lukasz Luba

On 08/21/23 18:39, Dietmar Eggemann wrote:
> On 20/08/2023 23:06, Qais Yousef wrote:
> > DVFS headroom is applied after we calculate the effective_cpu_util()
> > which is where we honour uclamp constraints. It makes more sense to
> > apply the headroom there once and let all users naturally get the right
> > thing without having to sprinkle the call around in various places.
> 
> uclamp is applied in effective_cpu_util(..., FREQUENCY_UTIL, ...) which
> IMHO currently has 2 power callers: (1) schedutil: sugov_get_util() and
> (2) EAS: eenv_pd_max_util()
> 
> > Before this fix running
> > 
> > 	uclampset -M 800 cat /dev/zero > /dev/null
> > 
> > Will cause the test system to run at max freq of 2.8GHz. After the fix
> > it runs at 2.2GHz instead which is the correct value that matches the
> > capacity of 800.
> 
> IMHO, a system at util = 800 (w/o uclamp) would also run at 2.8Ghz since
> we would call map_util_to_perf() on 800, no matter from where we call it.

Sorry, I would very strongly disagree here. What you're saying the effective
range of uclamp_max is 800 and anything above that will always go to max. How
can this be acceptable?

> 
> > Note that similar problem exist for uclamp_min. If util was 50, and
> > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> > constraints, we'll end up with util of 125 instead of 100. IOW, we get
> > boosted twice, first time by uclamp_min, and second time by dvfs
> > headroom.
> 
> I see what you want to change here but:
> 
> So far we have `util -> uclamp -> map_util_to_perf()`

:-O

So when I set the system uclamp_max to 800 it will still run at max; and this
is normal?!!

> 
> which is fine when we see uclamp as an entity which constrains util, not
> the util after being mapped to a capacity constraint.

-ENOPARSE.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH RFC 4/4] sched: cpufreq: Apply DVFS headroom to CFS only
  2023-08-21 16:41   ` Dietmar Eggemann
@ 2023-08-26 20:27     ` Qais Yousef
  0 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-08-26 20:27 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Vincent Guittot, Lukasz Luba

On 08/21/23 18:41, Dietmar Eggemann wrote:
> On 20/08/2023 23:06, Qais Yousef wrote:
> > RT and Deadline have exact performance requirement when running. RT runs
> > at max or a specific OPP defined by uclamp_min. Deadline's OPP is
> > defined by its bandwidth. Both of which are known ahead of time and
> > don't require a headroom to grow into.
> > 
> > IRQs on the other hand have no specific performance requirement and
> > cruises along at whatever the current OPP happens to be when they occur.
> > 
> > Now they all have PELT pressure signals that does impact frequency
> > selection and task placement. The question is do they need DVFS
> > headroom?
> > 
> > I think the answer is no because when CFS is not running at all, these
> > pressure signal has no real impact on performance for RT, DL or IRQ.
> > 
> > If CFS util is not zero, we already add their pressure as an
> > *additional* headroom to account for the lost/stolen time. So I argue
> > that the pressure are headroom themselves and shouldn't need an
> > additional DVFS headroom applied on top.
> > 
> > In summary final outcome should be:
> > 
> > 	CFS + DVFS headroom + (RT, DT, IRQ) pressure headroom
> 
> I assume here you want to align the difference that EAS deals with

This function is used on all systems that use schedutil - EAS being one of them
but not the only one. The definition isn't, and shouldn't, be tied to EAS.
I'm certainly intending this change for all possible users of schedutil.

> `util_cfs` vs `capacity` whereas power deals with `util` vs
> `capacity_orig`? You want that power should only apply the 1.25 to util_cfs?

I don't get what you're saying. But I think it's similar to what I'm saying.

To clarify. What I'm saying is that when we try to calculate the effective
util, CFS is the only entity in practice that interacts with DVFS. DL and RT by
design 'disable' DVFS and when they become runnable set the frequency to
a constant fixed point.  For them DVFS latencies are not acceptable - although
in practice they do take a single hit for the freq change on wake up. IRQ on
the other hand doesn't really care about DVFS. So we end up in practice that
CFS is the only entity that interacts with DVFS, so when we calculate the
DVFS headroom, we should only take its util into account.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-08-20 21:06 ` [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints Qais Yousef
  2023-08-21 16:39   ` Dietmar Eggemann
@ 2023-08-29 14:35   ` Vincent Guittot
  2023-08-29 16:37     ` Qais Yousef
  1 sibling, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2023-08-29 14:35 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

On Sun, 20 Aug 2023 at 23:08, Qais Yousef <qyousef@layalina.io> wrote:
>
> DVFS headroom is applied after we calculate the effective_cpu_util()
> which is where we honour uclamp constraints. It makes more sense to
> apply the headroom there once and let all users naturally get the right
> thing without having to sprinkle the call around in various places.

You have to take care of not mixing scheduler and cpufreq constraint and policy.

uclamp is a scheduler feature to highlight that the utilization
requirement of a task can't go above a level.

dvfs head room is a cpufreq decision to anticipate either hw
limitation and responsiveness problem or performance hints.

they come from different sources and rational and this patch mixed
them which i'm not sure is a correct thing to do

>
> Before this fix running
>
>         uclampset -M 800 cat /dev/zero > /dev/null
>
> Will cause the test system to run at max freq of 2.8GHz. After the fix
> it runs at 2.2GHz instead which is the correct value that matches the
> capacity of 800.

So a task with an utilization of 800 will run at higher freq than a
task clamped to 800 by uclamp ? Why should they run at different freq
for the same utilization ?

>
> Note that similar problem exist for uclamp_min. If util was 50, and
> uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> constraints, we'll end up with util of 125 instead of 100. IOW, we get
> boosted twice, first time by uclamp_min, and second time by dvfs
> headroom.
>
> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> ---
>  include/linux/energy_model.h     |  1 -
>  kernel/sched/core.c              | 11 ++++++++---
>  kernel/sched/cpufreq_schedutil.c |  5 ++---
>  3 files changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 6ebde4e69e98..adec808b371a 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>         scale_cpu = arch_scale_cpu_capacity(cpu);
>         ps = &pd->table[pd->nr_perf_states - 1];
>
> -       max_util = apply_dvfs_headroom(max_util);
>         max_util = min(max_util, allowed_cpu_cap);
>         freq = map_util_freq(max_util, ps->frequency, scale_cpu);
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index efe3848978a0..441d433c83cd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>          * frequency will be gracefully reduced with the utilization decay.
>          */
>         util = util_cfs + cpu_util_rt(rq);
> -       if (type == FREQUENCY_UTIL)
> +       if (type == FREQUENCY_UTIL) {
> +               util = apply_dvfs_headroom(util);

This is not the same as before because utilization has not being
scaled by irq steal time yet

>                 util = uclamp_rq_util_with(rq, util, p);
> +       }
>
>         dl_util = cpu_util_dl(rq);
>
> @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>          *              max - irq
>          *   U' = irq + --------- * U
>          *                 max
> +        *
> +        * We only need to apply dvfs headroom to irq part since the util part
> +        * already had it applied.
>          */
>         util = scale_irq_capacity(util, irq, max);
> -       util += irq;
> +       util += type ==  FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq;
>
>         /*
>          * Bandwidth required by DEADLINE must always be granted while, for
> @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>          * an interface. So, we only do the latter for now.
>          */
>         if (type == FREQUENCY_UTIL)
> -               util += cpu_bw_dl(rq);
> +               util += apply_dvfs_headroom(cpu_bw_dl(rq));

If we follow your reasoning with uclamp on the dl bandwidth, should we
not skip this as well ?

>
>         return min(max, util);
>  }
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 916c4d3d6192..0c7565ac31fb 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -143,7 +143,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>         unsigned int freq = arch_scale_freq_invariant() ?
>                                 policy->cpuinfo.max_freq : policy->cur;
>
> -       util = apply_dvfs_headroom(util);
>         freq = map_util_freq(util, freq, max);
>
>         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
> @@ -406,8 +405,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
>             sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
>                 sg_cpu->util = prev_util;
>
> -       cpufreq_driver_adjust_perf(sg_cpu->cpu, apply_dvfs_headroom(sg_cpu->bw_dl),
> -                                  apply_dvfs_headroom(sg_cpu->util), max_cap);
> +       cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_dl,
> +                                  sg_cpu->util, max_cap);
>
>         sg_cpu->sg_policy->last_freq_update_time = time;
>  }
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-08-29 14:35   ` Vincent Guittot
@ 2023-08-29 16:37     ` Qais Yousef
  2023-09-07 13:47       ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-08-29 16:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

On 08/29/23 16:35, Vincent Guittot wrote:
> On Sun, 20 Aug 2023 at 23:08, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > DVFS headroom is applied after we calculate the effective_cpu_util()
> > which is where we honour uclamp constraints. It makes more sense to
> > apply the headroom there once and let all users naturally get the right
> > thing without having to sprinkle the call around in various places.
> 
> You have to take care of not mixing scheduler and cpufreq constraint and policy.
> 
> uclamp is a scheduler feature to highlight that the utilization
> requirement of a task can't go above a level.

uclamp is a performance hint, which utilization is how we represent it
internally. A task with uclamp of 800 is not the same as util being actually
800. In our previous discussions around util_fits_cpu() we had similar
discussion on how the two can't be treated the same.

Same with uclamp_min; if it is set to 1024 but there is a task with util say
100, this task shouldn't cause overutilized as its actual utilization actually
fits, but it just asked to run at a higher performance point. The actual
utilization has to be in the over utilized range. Unless uclamp_max forces it
to fit of course.

So uclamp_max sets a cap to the performance that the task is allowed to reach.
And this translates into frequency selection and cpu selection (in case of HMP)
by design.

I don't think we're mixing matters here. But the headroom should be applied to
actual utilization, not uclamp. If the rq is capped to a specific performance
level, we should honour this.

We do the same with iowait boost where it is capped by uclamp_max. A task
capped by uclamp_max shouldn't be the trigger of running at a frequency higher
than this point. Otherwise we'd need to expose all these internal details to
the user, which I think we all agree isn't something to consider at all.

> 
> dvfs head room is a cpufreq decision to anticipate either hw
> limitation and responsiveness problem or performance hints.
> 
> they come from different sources and rational and this patch mixed
> them which i'm not sure is a correct thing to do

I don't think I'm mixing them up to be honest.

The governor is driven by effective_cpu_util() to tell it what is the
effective_cpu_util() when making frequency selection. This function takes into
account all the factors that could impact frequency selection including all type
of rq pressures (except thermal). I think it is appropriate to take headroom
into account there and make sure we honour uclamp_max hint to cap the
performance appropriately based on the effective uclamp_max value of the rq.

For example if actually util was 640, then after the headroom it'd be 800. And
if uclamp_max is 800, then this task will still get the 1.25 headroom. We are
not changing this behavior.

But if the task goes beyond that, then it'll stop at 800 because this what the
request is all about. A headroom beyond this point is not required because the
task (or rq to be more precise) is capped to this performance point and
regardless how busy the cpu gets in terms of real utilization or headroom, it
shouldn't go beyond this point. ie: if a task is a 1024 but uclamp_max of is
800 then it'll only get a performance equivalent to OPP@800 would give.

If we don't do that uclamp_max range effectively is from 0-819 (based on
current headroom setting of 1.25). Which is not what we want or what we're
telling userspace. Userspace sees the whole system performance levels
abstracted from 0 - 1024. As it should. The only effect they would observe and
there's no way around it is that OPPs are discrete points. So in reality our
0-1024 is a staircase where a range of util values will map to the same OPP and
then we'll get a jump. So the user can end up requesting for example 700 and
720 and not notice any difference because they both map to the same OPP.
I don't think we can fix that - but maybe I should add it to the uclamp doc as
a caveat when setting uclamp.

> 
> >
> > Before this fix running
> >
> >         uclampset -M 800 cat /dev/zero > /dev/null
> >
> > Will cause the test system to run at max freq of 2.8GHz. After the fix
> > it runs at 2.2GHz instead which is the correct value that matches the
> > capacity of 800.
> 
> So a task with an utilization of 800 will run at higher freq than a
> task clamped to 800 by uclamp ? Why should they run at different freq
> for the same utilization ?

Because uclamp sets an upper limit on the performance level the task should be
able to achieve. Imagine a task is 1024 and capped to 800, it should not run at
max frequency, right? What's the point of the uclamp_max hint if the headroom
will cause it to run at max anyway? We lost the meaning of the hint. And if
this headroom changes in the future, people will start observing different
behavior for existing uclamp_max settings on the same system because of this
this rightfully hidden and unaccounted for factor.

> 
> >
> > Note that similar problem exist for uclamp_min. If util was 50, and
> > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> > constraints, we'll end up with util of 125 instead of 100. IOW, we get
> > boosted twice, first time by uclamp_min, and second time by dvfs
> > headroom.
> >
> > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > ---
> >  include/linux/energy_model.h     |  1 -
> >  kernel/sched/core.c              | 11 ++++++++---
> >  kernel/sched/cpufreq_schedutil.c |  5 ++---
> >  3 files changed, 10 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> > index 6ebde4e69e98..adec808b371a 100644
> > --- a/include/linux/energy_model.h
> > +++ b/include/linux/energy_model.h
> > @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> >         scale_cpu = arch_scale_cpu_capacity(cpu);
> >         ps = &pd->table[pd->nr_perf_states - 1];
> >
> > -       max_util = apply_dvfs_headroom(max_util);
> >         max_util = min(max_util, allowed_cpu_cap);
> >         freq = map_util_freq(max_util, ps->frequency, scale_cpu);
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index efe3848978a0..441d433c83cd 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> >          * frequency will be gracefully reduced with the utilization decay.
> >          */
> >         util = util_cfs + cpu_util_rt(rq);
> > -       if (type == FREQUENCY_UTIL)
> > +       if (type == FREQUENCY_UTIL) {
> > +               util = apply_dvfs_headroom(util);
> 
> This is not the same as before because utilization has not being
> scaled by irq steal time yet

We do the scaling below, no?

AFAICS, we had:

	(util_cfs + util_rt + irq + ((max-irq)*(util_cfs + util_rt)/max)+ dl_bw) * scale

Using U = (util_cfs + util_rt) * scale

we can write this after the multiplication

	U + irq * scale + ((max-irq)*U/max) + dl_bw * scale

> 
> >                 util = uclamp_rq_util_with(rq, util, p);
> > +       }
> >
> >         dl_util = cpu_util_dl(rq);
> >
> > @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> >          *              max - irq
> >          *   U' = irq + --------- * U
> >          *                 max
> > +        *
> > +        * We only need to apply dvfs headroom to irq part since the util part
> > +        * already had it applied.
> >          */
> >         util = scale_irq_capacity(util, irq, max);
> > -       util += irq;
> > +       util += type ==  FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq;
> >
> >         /*
> >          * Bandwidth required by DEADLINE must always be granted while, for
> > @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> >          * an interface. So, we only do the latter for now.
> >          */
> >         if (type == FREQUENCY_UTIL)
> > -               util += cpu_bw_dl(rq);
> > +               util += apply_dvfs_headroom(cpu_bw_dl(rq));
> 
> If we follow your reasoning with uclamp on the dl bandwidth, should we
> not skip this as well ?

I do remove this in patch 4. Can fold that one into this one if you like.
I opted to keep the current behavior in this patch and remove these later in
patch 4.

I do think that both RT and DL shouldn't need DVFS headroom in general as they
both 'disable' it by default.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-08-29 16:37     ` Qais Yousef
@ 2023-09-07 13:47       ` Vincent Guittot
  2023-09-07 21:55         ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2023-09-07 13:47 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

On Tue, 29 Aug 2023 at 18:37, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 08/29/23 16:35, Vincent Guittot wrote:
> > On Sun, 20 Aug 2023 at 23:08, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > DVFS headroom is applied after we calculate the effective_cpu_util()
> > > which is where we honour uclamp constraints. It makes more sense to
> > > apply the headroom there once and let all users naturally get the right
> > > thing without having to sprinkle the call around in various places.
> >
> > You have to take care of not mixing scheduler and cpufreq constraint and policy.
> >
> > uclamp is a scheduler feature to highlight that the utilization
> > requirement of a task can't go above a level.
>
> uclamp is a performance hint, which utilization is how we represent it
> internally. A task with uclamp of 800 is not the same as util being actually
> 800. In our previous discussions around util_fits_cpu() we had similar
> discussion on how the two can't be treated the same.
>
> Same with uclamp_min; if it is set to 1024 but there is a task with util say
> 100, this task shouldn't cause overutilized as its actual utilization actually
> fits, but it just asked to run at a higher performance point. The actual
> utilization has to be in the over utilized range. Unless uclamp_max forces it
> to fit of course.
>
> So uclamp_max sets a cap to the performance that the task is allowed to reach.
> And this translates into frequency selection and cpu selection (in case of HMP)
> by design.

The dvfs head room is a sole property of cpufreq and the scheduler
doesn't care about it. Scheduler should provide some cpu utilization
hints. Then what cpufreq will do based on the utilization, the HW
constraint and the policy is out of scheduler scope.

This headroom is there only because energy model wants to forecast
what will be the frequency that cpufreq will select later on but
scheduler doesn't care

>
> I don't think we're mixing matters here. But the headroom should be applied to
> actual utilization, not uclamp. If the rq is capped to a specific performance
> level, we should honour this.
>
> We do the same with iowait boost where it is capped by uclamp_max. A task
> capped by uclamp_max shouldn't be the trigger of running at a frequency higher
> than this point. Otherwise we'd need to expose all these internal details to
> the user, which I think we all agree isn't something to consider at all.
>
> >
> > dvfs head room is a cpufreq decision to anticipate either hw
> > limitation and responsiveness problem or performance hints.
> >
> > they come from different sources and rational and this patch mixed
> > them which i'm not sure is a correct thing to do
>
> I don't think I'm mixing them up to be honest.
>
> The governor is driven by effective_cpu_util() to tell it what is the
> effective_cpu_util() when making frequency selection. This function takes into
> account all the factors that could impact frequency selection including all type
> of rq pressures (except thermal). I think it is appropriate to take headroom
> into account there and make sure we honour uclamp_max hint to cap the
> performance appropriately based on the effective uclamp_max value of the rq.
>
> For example if actually util was 640, then after the headroom it'd be 800. And
> if uclamp_max is 800, then this task will still get the 1.25 headroom. We are
> not changing this behavior.
>
> But if the task goes beyond that, then it'll stop at 800 because this what the
> request is all about. A headroom beyond this point is not required because the
> task (or rq to be more precise) is capped to this performance point and
> regardless how busy the cpu gets in terms of real utilization or headroom, it
> shouldn't go beyond this point. ie: if a task is a 1024 but uclamp_max of is
> 800 then it'll only get a performance equivalent to OPP@800 would give.
>
> If we don't do that uclamp_max range effectively is from 0-819 (based on
> current headroom setting of 1.25). Which is not what we want or what we're
> telling userspace. Userspace sees the whole system performance levels
> abstracted from 0 - 1024. As it should. The only effect they would observe and
> there's no way around it is that OPPs are discrete points. So in reality our
> 0-1024 is a staircase where a range of util values will map to the same OPP and
> then we'll get a jump. So the user can end up requesting for example 700 and
> 720 and not notice any difference because they both map to the same OPP.
> I don't think we can fix that - but maybe I should add it to the uclamp doc as
> a caveat when setting uclamp.
>
> >
> > >
> > > Before this fix running
> > >
> > >         uclampset -M 800 cat /dev/zero > /dev/null
> > >
> > > Will cause the test system to run at max freq of 2.8GHz. After the fix
> > > it runs at 2.2GHz instead which is the correct value that matches the
> > > capacity of 800.
> >
> > So a task with an utilization of 800 will run at higher freq than a
> > task clamped to 800 by uclamp ? Why should they run at different freq
> > for the same utilization ?
>
> Because uclamp sets an upper limit on the performance level the task should be
> able to achieve. Imagine a task is 1024 and capped to 800, it should not run at
> max frequency, right? What's the point of the uclamp_max hint if the headroom

Who knows ? Here we try to predict the coming utilization of the cpu
and this prediction is only partial so cpufreq might want to keep some
headroom even if uclamp max is applied to a cfs rq. Anticipate
unpredictable irq stolen time

IMO, dvfs_headroom should never be visible in scheduler code. This is
the case now; it's only visible by cpufreq which is the "owner" and
energy model which tries to predict cpufreq behavior

> will cause it to run at max anyway? We lost the meaning of the hint. And if
> this headroom changes in the future, people will start observing different
> behavior for existing uclamp_max settings on the same system because of this
> this rightfully hidden and unaccounted for factor.
>
> >
> > >
> > > Note that similar problem exist for uclamp_min. If util was 50, and
> > > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> > > constraints, we'll end up with util of 125 instead of 100. IOW, we get
> > > boosted twice, first time by uclamp_min, and second time by dvfs
> > > headroom.
> > >
> > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > > ---
> > >  include/linux/energy_model.h     |  1 -
> > >  kernel/sched/core.c              | 11 ++++++++---
> > >  kernel/sched/cpufreq_schedutil.c |  5 ++---
> > >  3 files changed, 10 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> > > index 6ebde4e69e98..adec808b371a 100644
> > > --- a/include/linux/energy_model.h
> > > +++ b/include/linux/energy_model.h
> > > @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> > >         scale_cpu = arch_scale_cpu_capacity(cpu);
> > >         ps = &pd->table[pd->nr_perf_states - 1];
> > >
> > > -       max_util = apply_dvfs_headroom(max_util);
> > >         max_util = min(max_util, allowed_cpu_cap);
> > >         freq = map_util_freq(max_util, ps->frequency, scale_cpu);
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index efe3848978a0..441d433c83cd 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > >          * frequency will be gracefully reduced with the utilization decay.
> > >          */
> > >         util = util_cfs + cpu_util_rt(rq);
> > > -       if (type == FREQUENCY_UTIL)
> > > +       if (type == FREQUENCY_UTIL) {
> > > +               util = apply_dvfs_headroom(util);
> >
> > This is not the same as before because utilization has not being
> > scaled by irq steal time yet
>
> We do the scaling below, no?
>
> AFAICS, we had:
>
>         (util_cfs + util_rt + irq + ((max-irq)*(util_cfs + util_rt)/max)+ dl_bw) * scale

I think it's only :
  ((util_cfs + util_rt) * (max -irq) / max + irq + dl_bw) * scale

and it' s effectively similar

>
> Using U = (util_cfs + util_rt) * scale
>
> we can write this after the multiplication
>
>         U + irq * scale + ((max-irq)*U/max) + dl_bw * scale
>
> >
> > >                 util = uclamp_rq_util_with(rq, util, p);
> > > +       }
> > >
> > >         dl_util = cpu_util_dl(rq);
> > >
> > > @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > >          *              max - irq
> > >          *   U' = irq + --------- * U
> > >          *                 max
> > > +        *
> > > +        * We only need to apply dvfs headroom to irq part since the util part
> > > +        * already had it applied.
> > >          */
> > >         util = scale_irq_capacity(util, irq, max);
> > > -       util += irq;
> > > +       util += type ==  FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq;
> > >
> > >         /*
> > >          * Bandwidth required by DEADLINE must always be granted while, for
> > > @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > >          * an interface. So, we only do the latter for now.
> > >          */
> > >         if (type == FREQUENCY_UTIL)
> > > -               util += cpu_bw_dl(rq);
> > > +               util += apply_dvfs_headroom(cpu_bw_dl(rq));
> >
> > If we follow your reasoning with uclamp on the dl bandwidth, should we
> > not skip this as well ?
>
> I do remove this in patch 4. Can fold that one into this one if you like.
> I opted to keep the current behavior in this patch and remove these later in
> patch 4.
>
> I do think that both RT and DL shouldn't need DVFS headroom in general as they
> both 'disable' it by default.

and the scheduler in general doesn't care about DVFS headroom.

I don't think it's the right way to fix your problem of selecting the
right frequency on your platform. It would be better to get a proper
interface between EM and cpufreq like
cpufreq_give_me_freq_for_this_utilization_ctx.

And  cpufreq needs 2 information:
- the utilization of the cpu
- the uclamp/performance/bandwidth hints on this cpu because the
uclamp_max hints are screwed up by irq scaling anyway.

For example:

irq = 256
util = 800 but uclamp=512

If I follow your reasoning about the uclamp being a performance hint
and uclamp=512 means that we should not go above 512 just for cfs+rt,
we should stay at 512 in this case because irq only needs 256 which is
below the 512 bandwidth. The utilization reported to cpufreq will be
above 512 whatever the current (720)  formula or your proposal (608).
In the case of uclamp, it should be applied after having been scaled
by irq time.

So we should have reported utilization of 720 with a bandwidth
requirement of 512 and then cpufreq can applies its headroom if needed

>
>
> Thanks!
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-07 13:47       ` Vincent Guittot
@ 2023-09-07 21:55         ` Qais Yousef
  2023-09-08 14:30           ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-09-07 21:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

On 09/07/23 15:47, Vincent Guittot wrote:
> On Tue, 29 Aug 2023 at 18:37, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 08/29/23 16:35, Vincent Guittot wrote:
> > > On Sun, 20 Aug 2023 at 23:08, Qais Yousef <qyousef@layalina.io> wrote:
> > > >
> > > > DVFS headroom is applied after we calculate the effective_cpu_util()
> > > > which is where we honour uclamp constraints. It makes more sense to
> > > > apply the headroom there once and let all users naturally get the right
> > > > thing without having to sprinkle the call around in various places.
> > >
> > > You have to take care of not mixing scheduler and cpufreq constraint and policy.
> > >
> > > uclamp is a scheduler feature to highlight that the utilization
> > > requirement of a task can't go above a level.
> >
> > uclamp is a performance hint, which utilization is how we represent it
> > internally. A task with uclamp of 800 is not the same as util being actually
> > 800. In our previous discussions around util_fits_cpu() we had similar
> > discussion on how the two can't be treated the same.
> >
> > Same with uclamp_min; if it is set to 1024 but there is a task with util say
> > 100, this task shouldn't cause overutilized as its actual utilization actually
> > fits, but it just asked to run at a higher performance point. The actual
> > utilization has to be in the over utilized range. Unless uclamp_max forces it
> > to fit of course.
> >
> > So uclamp_max sets a cap to the performance that the task is allowed to reach.
> > And this translates into frequency selection and cpu selection (in case of HMP)
> > by design.
> 
> The dvfs head room is a sole property of cpufreq and the scheduler
> doesn't care about it. Scheduler should provide some cpu utilization
> hints. Then what cpufreq will do based on the utilization, the HW
> constraint and the policy is out of scheduler scope.
> 
> This headroom is there only because energy model wants to forecast
> what will be the frequency that cpufreq will select later on but
> scheduler doesn't care
> 
> >
> > I don't think we're mixing matters here. But the headroom should be applied to
> > actual utilization, not uclamp. If the rq is capped to a specific performance
> > level, we should honour this.
> >
> > We do the same with iowait boost where it is capped by uclamp_max. A task
> > capped by uclamp_max shouldn't be the trigger of running at a frequency higher
> > than this point. Otherwise we'd need to expose all these internal details to
> > the user, which I think we all agree isn't something to consider at all.
> >
> > >
> > > dvfs head room is a cpufreq decision to anticipate either hw
> > > limitation and responsiveness problem or performance hints.
> > >
> > > they come from different sources and rational and this patch mixed
> > > them which i'm not sure is a correct thing to do
> >
> > I don't think I'm mixing them up to be honest.
> >
> > The governor is driven by effective_cpu_util() to tell it what is the
> > effective_cpu_util() when making frequency selection. This function takes into
> > account all the factors that could impact frequency selection including all type
> > of rq pressures (except thermal). I think it is appropriate to take headroom
> > into account there and make sure we honour uclamp_max hint to cap the
> > performance appropriately based on the effective uclamp_max value of the rq.
> >
> > For example if actually util was 640, then after the headroom it'd be 800. And
> > if uclamp_max is 800, then this task will still get the 1.25 headroom. We are
> > not changing this behavior.
> >
> > But if the task goes beyond that, then it'll stop at 800 because this what the
> > request is all about. A headroom beyond this point is not required because the
> > task (or rq to be more precise) is capped to this performance point and
> > regardless how busy the cpu gets in terms of real utilization or headroom, it
> > shouldn't go beyond this point. ie: if a task is a 1024 but uclamp_max of is
> > 800 then it'll only get a performance equivalent to OPP@800 would give.
> >
> > If we don't do that uclamp_max range effectively is from 0-819 (based on
> > current headroom setting of 1.25). Which is not what we want or what we're
> > telling userspace. Userspace sees the whole system performance levels
> > abstracted from 0 - 1024. As it should. The only effect they would observe and
> > there's no way around it is that OPPs are discrete points. So in reality our
> > 0-1024 is a staircase where a range of util values will map to the same OPP and
> > then we'll get a jump. So the user can end up requesting for example 700 and
> > 720 and not notice any difference because they both map to the same OPP.
> > I don't think we can fix that - but maybe I should add it to the uclamp doc as
> > a caveat when setting uclamp.
> >
> > >
> > > >
> > > > Before this fix running
> > > >
> > > >         uclampset -M 800 cat /dev/zero > /dev/null
> > > >
> > > > Will cause the test system to run at max freq of 2.8GHz. After the fix
> > > > it runs at 2.2GHz instead which is the correct value that matches the
> > > > capacity of 800.
> > >
> > > So a task with an utilization of 800 will run at higher freq than a
> > > task clamped to 800 by uclamp ? Why should they run at different freq
> > > for the same utilization ?
> >
> > Because uclamp sets an upper limit on the performance level the task should be
> > able to achieve. Imagine a task is 1024 and capped to 800, it should not run at
> > max frequency, right? What's the point of the uclamp_max hint if the headroom
> 
> Who knows ? Here we try to predict the coming utilization of the cpu
> and this prediction is only partial so cpufreq might want to keep some
> headroom even if uclamp max is applied to a cfs rq. Anticipate
> unpredictable irq stolen time

I struggle to see why we need headroom after uclamp_max. But I think I'm
starting to see what you're getting at. I think you'd like not to make
assumption in the scheduler and hide this logic in the governor itself?

> 
> IMO, dvfs_headroom should never be visible in scheduler code. This is
> the case now; it's only visible by cpufreq which is the "owner" and
> energy model which tries to predict cpufreq behavior

Okay.

> 
> > will cause it to run at max anyway? We lost the meaning of the hint. And if
> > this headroom changes in the future, people will start observing different
> > behavior for existing uclamp_max settings on the same system because of this
> > this rightfully hidden and unaccounted for factor.
> >
> > >
> > > >
> > > > Note that similar problem exist for uclamp_min. If util was 50, and
> > > > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> > > > constraints, we'll end up with util of 125 instead of 100. IOW, we get
> > > > boosted twice, first time by uclamp_min, and second time by dvfs
> > > > headroom.
> > > >
> > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > > > ---
> > > >  include/linux/energy_model.h     |  1 -
> > > >  kernel/sched/core.c              | 11 ++++++++---
> > > >  kernel/sched/cpufreq_schedutil.c |  5 ++---
> > > >  3 files changed, 10 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> > > > index 6ebde4e69e98..adec808b371a 100644
> > > > --- a/include/linux/energy_model.h
> > > > +++ b/include/linux/energy_model.h
> > > > @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> > > >         scale_cpu = arch_scale_cpu_capacity(cpu);
> > > >         ps = &pd->table[pd->nr_perf_states - 1];
> > > >
> > > > -       max_util = apply_dvfs_headroom(max_util);
> > > >         max_util = min(max_util, allowed_cpu_cap);
> > > >         freq = map_util_freq(max_util, ps->frequency, scale_cpu);
> > > >
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index efe3848978a0..441d433c83cd 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > > >          * frequency will be gracefully reduced with the utilization decay.
> > > >          */
> > > >         util = util_cfs + cpu_util_rt(rq);
> > > > -       if (type == FREQUENCY_UTIL)
> > > > +       if (type == FREQUENCY_UTIL) {
> > > > +               util = apply_dvfs_headroom(util);
> > >
> > > This is not the same as before because utilization has not being
> > > scaled by irq steal time yet
> >
> > We do the scaling below, no?
> >
> > AFAICS, we had:
> >
> >         (util_cfs + util_rt + irq + ((max-irq)*(util_cfs + util_rt)/max)+ dl_bw) * scale
> 
> I think it's only :
>   ((util_cfs + util_rt) * (max -irq) / max + irq + dl_bw) * scale
> 
> and it' s effectively similar

+1

> 
> >
> > Using U = (util_cfs + util_rt) * scale
> >
> > we can write this after the multiplication
> >
> >         U + irq * scale + ((max-irq)*U/max) + dl_bw * scale
> >
> > >
> > > >                 util = uclamp_rq_util_with(rq, util, p);
> > > > +       }
> > > >
> > > >         dl_util = cpu_util_dl(rq);
> > > >
> > > > @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > > >          *              max - irq
> > > >          *   U' = irq + --------- * U
> > > >          *                 max
> > > > +        *
> > > > +        * We only need to apply dvfs headroom to irq part since the util part
> > > > +        * already had it applied.
> > > >          */
> > > >         util = scale_irq_capacity(util, irq, max);
> > > > -       util += irq;
> > > > +       util += type ==  FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq;
> > > >
> > > >         /*
> > > >          * Bandwidth required by DEADLINE must always be granted while, for
> > > > @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > > >          * an interface. So, we only do the latter for now.
> > > >          */
> > > >         if (type == FREQUENCY_UTIL)
> > > > -               util += cpu_bw_dl(rq);
> > > > +               util += apply_dvfs_headroom(cpu_bw_dl(rq));
> > >
> > > If we follow your reasoning with uclamp on the dl bandwidth, should we
> > > not skip this as well ?
> >
> > I do remove this in patch 4. Can fold that one into this one if you like.
> > I opted to keep the current behavior in this patch and remove these later in
> > patch 4.
> >
> > I do think that both RT and DL shouldn't need DVFS headroom in general as they
> > both 'disable' it by default.
> 
> and the scheduler in general doesn't care about DVFS headroom.
> 
> I don't think it's the right way to fix your problem of selecting the
> right frequency on your platform. It would be better to get a proper
> interface between EM and cpufreq like
> cpufreq_give_me_freq_for_this_utilization_ctx.

Hmm is the problem then that effective_cpu_util() was moved to scheduler and it
should move back to schedutil and provide a proper interface to connect the two
in abstracted way?

> 
> And  cpufreq needs 2 information:
> - the utilization of the cpu
> - the uclamp/performance/bandwidth hints on this cpu because the
> uclamp_max hints are screwed up by irq scaling anyway.
> 
> For example:
> 
> irq = 256
> util = 800 but uclamp=512
> 
> If I follow your reasoning about the uclamp being a performance hint
> and uclamp=512 means that we should not go above 512 just for cfs+rt,

Yes. We already limit it to 512 now. With the change we'll limit it after the
headroom is applied to them.

> we should stay at 512 in this case because irq only needs 256 which is
> below the 512 bandwidth. The utilization reported to cpufreq will be

Hmm IIUC, no. irq pressure will be added on top as normal. It's just cfs+rt
will be restricted by their uclamp settings _after_ the headroom is applied.

> above 512 whatever the current (720)  formula or your proposal (608).
> In the case of uclamp, it should be applied after having been scaled
> by irq time.

I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers.

So the way I'm proposing it here

	util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
	util = uclamp_rq_util_with(rq, util, NULL) = 512
	util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384
	util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704
	util += dvfs_headroom(dl_bw) = 704

> 
> So we should have reported utilization of 720 with a bandwidth
> requirement of 512 and then cpufreq can applies its headroom if needed

The key part I'm changing is this

	util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
	util = uclamp_rq_util_with(rq, util, NULL) = 512

Before we had (assume irq, rt and dl are 0 for simplicity and a single task is
running)

	util = cfs = 800
	util = uclamp_rq_util_with(rq, util, NULL) = 512
	util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640

So we are running higher than we are allowed to. So applying the headroom
after taking uclamp constraints into account is the problem.

IIUC and your concerns are about how scheduler and schedutil are interacting
loses some abstraction, then yeah this makes sense and can refactor code to
better abstract the two and keep them decoupled.

But if you think the order the headroom is applied should be the way it is,
then this doesn't fix the problem.

A good use case to consider is a single task running at 1024. If I want to
limit it to 80% using uclamp_max, how can this be enforced? With current code,
the task will continue to run at 1024; which is the problem.

Running at 640 instead of 512 is still a problem too as this could be 1 or
2 OPP higher than what we want to allow; and there could be significant power
difference between those OPPs.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-07 21:55         ` Qais Yousef
@ 2023-09-08 14:30           ` Vincent Guittot
  2023-09-10 17:46             ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2023-09-08 14:30 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

On Thu, 7 Sept 2023 at 23:55, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 09/07/23 15:47, Vincent Guittot wrote:
> > On Tue, 29 Aug 2023 at 18:37, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 08/29/23 16:35, Vincent Guittot wrote:
> > > > On Sun, 20 Aug 2023 at 23:08, Qais Yousef <qyousef@layalina.io> wrote:
> > > > >
> > > > > DVFS headroom is applied after we calculate the effective_cpu_util()
> > > > > which is where we honour uclamp constraints. It makes more sense to
> > > > > apply the headroom there once and let all users naturally get the right
> > > > > thing without having to sprinkle the call around in various places.
> > > >
> > > > You have to take care of not mixing scheduler and cpufreq constraint and policy.
> > > >
> > > > uclamp is a scheduler feature to highlight that the utilization
> > > > requirement of a task can't go above a level.
> > >
> > > uclamp is a performance hint, which utilization is how we represent it
> > > internally. A task with uclamp of 800 is not the same as util being actually
> > > 800. In our previous discussions around util_fits_cpu() we had similar
> > > discussion on how the two can't be treated the same.
> > >
> > > Same with uclamp_min; if it is set to 1024 but there is a task with util say
> > > 100, this task shouldn't cause overutilized as its actual utilization actually
> > > fits, but it just asked to run at a higher performance point. The actual
> > > utilization has to be in the over utilized range. Unless uclamp_max forces it
> > > to fit of course.
> > >
> > > So uclamp_max sets a cap to the performance that the task is allowed to reach.
> > > And this translates into frequency selection and cpu selection (in case of HMP)
> > > by design.
> >
> > The dvfs head room is a sole property of cpufreq and the scheduler
> > doesn't care about it. Scheduler should provide some cpu utilization
> > hints. Then what cpufreq will do based on the utilization, the HW
> > constraint and the policy is out of scheduler scope.
> >
> > This headroom is there only because energy model wants to forecast
> > what will be the frequency that cpufreq will select later on but
> > scheduler doesn't care
> >
> > >
> > > I don't think we're mixing matters here. But the headroom should be applied to
> > > actual utilization, not uclamp. If the rq is capped to a specific performance
> > > level, we should honour this.
> > >
> > > We do the same with iowait boost where it is capped by uclamp_max. A task
> > > capped by uclamp_max shouldn't be the trigger of running at a frequency higher
> > > than this point. Otherwise we'd need to expose all these internal details to
> > > the user, which I think we all agree isn't something to consider at all.
> > >
> > > >
> > > > dvfs head room is a cpufreq decision to anticipate either hw
> > > > limitation and responsiveness problem or performance hints.
> > > >
> > > > they come from different sources and rational and this patch mixed
> > > > them which i'm not sure is a correct thing to do
> > >
> > > I don't think I'm mixing them up to be honest.
> > >
> > > The governor is driven by effective_cpu_util() to tell it what is the
> > > effective_cpu_util() when making frequency selection. This function takes into
> > > account all the factors that could impact frequency selection including all type
> > > of rq pressures (except thermal). I think it is appropriate to take headroom
> > > into account there and make sure we honour uclamp_max hint to cap the
> > > performance appropriately based on the effective uclamp_max value of the rq.
> > >
> > > For example if actually util was 640, then after the headroom it'd be 800. And
> > > if uclamp_max is 800, then this task will still get the 1.25 headroom. We are
> > > not changing this behavior.
> > >
> > > But if the task goes beyond that, then it'll stop at 800 because this what the
> > > request is all about. A headroom beyond this point is not required because the
> > > task (or rq to be more precise) is capped to this performance point and
> > > regardless how busy the cpu gets in terms of real utilization or headroom, it
> > > shouldn't go beyond this point. ie: if a task is a 1024 but uclamp_max of is
> > > 800 then it'll only get a performance equivalent to OPP@800 would give.
> > >
> > > If we don't do that uclamp_max range effectively is from 0-819 (based on
> > > current headroom setting of 1.25). Which is not what we want or what we're
> > > telling userspace. Userspace sees the whole system performance levels
> > > abstracted from 0 - 1024. As it should. The only effect they would observe and
> > > there's no way around it is that OPPs are discrete points. So in reality our
> > > 0-1024 is a staircase where a range of util values will map to the same OPP and
> > > then we'll get a jump. So the user can end up requesting for example 700 and
> > > 720 and not notice any difference because they both map to the same OPP.
> > > I don't think we can fix that - but maybe I should add it to the uclamp doc as
> > > a caveat when setting uclamp.
> > >
> > > >
> > > > >
> > > > > Before this fix running
> > > > >
> > > > >         uclampset -M 800 cat /dev/zero > /dev/null
> > > > >
> > > > > Will cause the test system to run at max freq of 2.8GHz. After the fix
> > > > > it runs at 2.2GHz instead which is the correct value that matches the
> > > > > capacity of 800.
> > > >
> > > > So a task with an utilization of 800 will run at higher freq than a
> > > > task clamped to 800 by uclamp ? Why should they run at different freq
> > > > for the same utilization ?
> > >
> > > Because uclamp sets an upper limit on the performance level the task should be
> > > able to achieve. Imagine a task is 1024 and capped to 800, it should not run at
> > > max frequency, right? What's the point of the uclamp_max hint if the headroom
> >
> > Who knows ? Here we try to predict the coming utilization of the cpu
> > and this prediction is only partial so cpufreq might want to keep some
> > headroom even if uclamp max is applied to a cfs rq. Anticipate
> > unpredictable irq stolen time
>
> I struggle to see why we need headroom after uclamp_max. But I think I'm
> starting to see what you're getting at. I think you'd like not to make
> assumption in the scheduler and hide this logic in the governor itself?

yes

>
> >
> > IMO, dvfs_headroom should never be visible in scheduler code. This is
> > the case now; it's only visible by cpufreq which is the "owner" and
> > energy model which tries to predict cpufreq behavior
>
> Okay.
>
> >
> > > will cause it to run at max anyway? We lost the meaning of the hint. And if
> > > this headroom changes in the future, people will start observing different
> > > behavior for existing uclamp_max settings on the same system because of this
> > > this rightfully hidden and unaccounted for factor.
> > >
> > > >
> > > > >
> > > > > Note that similar problem exist for uclamp_min. If util was 50, and
> > > > > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> > > > > constraints, we'll end up with util of 125 instead of 100. IOW, we get
> > > > > boosted twice, first time by uclamp_min, and second time by dvfs
> > > > > headroom.
> > > > >
> > > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
> > > > > ---
> > > > >  include/linux/energy_model.h     |  1 -
> > > > >  kernel/sched/core.c              | 11 ++++++++---
> > > > >  kernel/sched/cpufreq_schedutil.c |  5 ++---
> > > > >  3 files changed, 10 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> > > > > index 6ebde4e69e98..adec808b371a 100644
> > > > > --- a/include/linux/energy_model.h
> > > > > +++ b/include/linux/energy_model.h
> > > > > @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> > > > >         scale_cpu = arch_scale_cpu_capacity(cpu);
> > > > >         ps = &pd->table[pd->nr_perf_states - 1];
> > > > >
> > > > > -       max_util = apply_dvfs_headroom(max_util);
> > > > >         max_util = min(max_util, allowed_cpu_cap);
> > > > >         freq = map_util_freq(max_util, ps->frequency, scale_cpu);
> > > > >
> > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > > index efe3848978a0..441d433c83cd 100644
> > > > > --- a/kernel/sched/core.c
> > > > > +++ b/kernel/sched/core.c
> > > > > @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > > > >          * frequency will be gracefully reduced with the utilization decay.
> > > > >          */
> > > > >         util = util_cfs + cpu_util_rt(rq);
> > > > > -       if (type == FREQUENCY_UTIL)
> > > > > +       if (type == FREQUENCY_UTIL) {
> > > > > +               util = apply_dvfs_headroom(util);
> > > >
> > > > This is not the same as before because utilization has not being
> > > > scaled by irq steal time yet
> > >
> > > We do the scaling below, no?
> > >
> > > AFAICS, we had:
> > >
> > >         (util_cfs + util_rt + irq + ((max-irq)*(util_cfs + util_rt)/max)+ dl_bw) * scale
> >
> > I think it's only :
> >   ((util_cfs + util_rt) * (max -irq) / max + irq + dl_bw) * scale
> >
> > and it' s effectively similar
>
> +1
>
> >
> > >
> > > Using U = (util_cfs + util_rt) * scale
> > >
> > > we can write this after the multiplication
> > >
> > >         U + irq * scale + ((max-irq)*U/max) + dl_bw * scale
> > >
> > > >
> > > > >                 util = uclamp_rq_util_with(rq, util, p);
> > > > > +       }
> > > > >
> > > > >         dl_util = cpu_util_dl(rq);
> > > > >
> > > > > @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > > > >          *              max - irq
> > > > >          *   U' = irq + --------- * U
> > > > >          *                 max
> > > > > +        *
> > > > > +        * We only need to apply dvfs headroom to irq part since the util part
> > > > > +        * already had it applied.
> > > > >          */
> > > > >         util = scale_irq_capacity(util, irq, max);
> > > > > -       util += irq;
> > > > > +       util += type ==  FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq;
> > > > >
> > > > >         /*
> > > > >          * Bandwidth required by DEADLINE must always be granted while, for
> > > > > @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > > > >          * an interface. So, we only do the latter for now.
> > > > >          */
> > > > >         if (type == FREQUENCY_UTIL)
> > > > > -               util += cpu_bw_dl(rq);
> > > > > +               util += apply_dvfs_headroom(cpu_bw_dl(rq));
> > > >
> > > > If we follow your reasoning with uclamp on the dl bandwidth, should we
> > > > not skip this as well ?
> > >
> > > I do remove this in patch 4. Can fold that one into this one if you like.
> > > I opted to keep the current behavior in this patch and remove these later in
> > > patch 4.
> > >
> > > I do think that both RT and DL shouldn't need DVFS headroom in general as they
> > > both 'disable' it by default.
> >
> > and the scheduler in general doesn't care about DVFS headroom.
> >
> > I don't think it's the right way to fix your problem of selecting the
> > right frequency on your platform. It would be better to get a proper
> > interface between EM and cpufreq like
> > cpufreq_give_me_freq_for_this_utilization_ctx.
>
> Hmm is the problem then that effective_cpu_util() was moved to scheduler and it
> should move back to schedutil and provide a proper interface to connect the two
> in abstracted way?

probably yes

>
> >
> > And  cpufreq needs 2 information:
> > - the utilization of the cpu
> > - the uclamp/performance/bandwidth hints on this cpu because the
> > uclamp_max hints are screwed up by irq scaling anyway.
> >
> > For example:
> >
> > irq = 256
> > util = 800 but uclamp=512
> >
> > If I follow your reasoning about the uclamp being a performance hint
> > and uclamp=512 means that we should not go above 512 just for cfs+rt,
>
> Yes. We already limit it to 512 now. With the change we'll limit it after the
> headroom is applied to them.
>
> > we should stay at 512 in this case because irq only needs 256 which is
> > below the 512 bandwidth. The utilization reported to cpufreq will be
>
> Hmm IIUC, no. irq pressure will be added on top as normal. It's just cfs+rt
> will be restricted by their uclamp settings _after_ the headroom is applied.

But then you mixed some utilization level (irq pressure)  which is
time related with some performance/bandwidth limitation which is freq
related. And I think that part of the reason why we can't agree on
where to put headroom and how to take into account uclamp

>
> > above 512 whatever the current (720)  formula or your proposal (608).
> > In the case of uclamp, it should be applied after having been scaled
> > by irq time.
>
> I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers.

My bad, I finally decided to use an irq pressure of 128 in my
calculation but forgot to change the value in my email

>
> So the way I'm proposing it here
>
>         util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
>         util = uclamp_rq_util_with(rq, util, NULL) = 512
>         util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384
>         util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704
>         util += dvfs_headroom(dl_bw) = 704
>
> >
> > So we should have reported utilization of 720 with a bandwidth
> > requirement of 512 and then cpufreq can applies its headroom if needed
>
> The key part I'm changing is this
>
>         util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
>         util = uclamp_rq_util_with(rq, util, NULL) = 512
>
> Before we had (assume irq, rt and dl are 0 for simplicity and a single task is
> running)
>
>         util = cfs = 800
>         util = uclamp_rq_util_with(rq, util, NULL) = 512
>         util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640
>
> So we are running higher than we are allowed to. So applying the headroom
> after taking uclamp constraints into account is the problem.

But I think it's not correct because you mixed some utilization level
with some performance requirement. IIRC, you said that the
uclamp/min/max must not be seen as an utilization level but a
performance hint. In such case, you can't add it to some other
utilization because it defeats its original purpose

>
> IIUC and your concerns are about how scheduler and schedutil are interacting
> loses some abstraction, then yeah this makes sense and can refactor code to
> better abstract the two and keep them decoupled.
>
> But if you think the order the headroom is applied should be the way it is,
> then this doesn't fix the problem.
>
> A good use case to consider is a single task running at 1024. If I want to
> limit it to 80% using uclamp_max, how can this be enforced? With current code,
> the task will continue to run at 1024; which is the problem.

Single case is the easy part

>
> Running at 640 instead of 512 is still a problem too as this could be 1 or
> 2 OPP higher than what we want to allow; and there could be significant power
> difference between those OPPs.

That's my point.

Your proposal tries to fix the simple case of 1 task with a uclamp_max
but If we want to move in this direction, we should fix all cases.

I probably need to put my idea in real code to see what it means but
we should provide 2 value to cpufreq: an utilization level and a
performance hint which would max aggregate the various performance
hints that we have

>
>
> Thanks!

>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-08 14:30           ` Vincent Guittot
@ 2023-09-10 17:46             ` Qais Yousef
  2023-09-12 13:40               ` Dietmar Eggemann
  2023-09-12 16:03               ` Vincent Guittot
  0 siblings, 2 replies; 32+ messages in thread
From: Qais Yousef @ 2023-09-10 17:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

On 09/08/23 16:30, Vincent Guittot wrote:

> > Hmm IIUC, no. irq pressure will be added on top as normal. It's just cfs+rt
> > will be restricted by their uclamp settings _after_ the headroom is applied.
> 
> But then you mixed some utilization level (irq pressure)  which is
> time related with some performance/bandwidth limitation which is freq
> related. And I think that part of the reason why we can't agree on
> where to put headroom and how to take into account uclamp

I did not change how this part works. We can drop patch 4 completely if this is
what is causing the confusion.

What I changed is the order of applying uclamp_rq_util_with() and
apply_dvfs_headroom(). The rest is still the same as-is in the code today.

> 
> >
> > > above 512 whatever the current (720)  formula or your proposal (608).
> > > In the case of uclamp, it should be applied after having been scaled
> > > by irq time.
> >
> > I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers.
> 
> My bad, I finally decided to use an irq pressure of 128 in my
> calculation but forgot to change the value in my email
> 
> >
> > So the way I'm proposing it here
> >
> >         util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
> >         util = uclamp_rq_util_with(rq, util, NULL) = 512
> >         util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384
> >         util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704
> >         util += dvfs_headroom(dl_bw) = 704
> >
> > >
> > > So we should have reported utilization of 720 with a bandwidth
> > > requirement of 512 and then cpufreq can applies its headroom if needed
> >
> > The key part I'm changing is this
> >
> >         util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
> >         util = uclamp_rq_util_with(rq, util, NULL) = 512
> >
> > Before we had (assume irq, rt and dl are 0 for simplicity and a single task is
> > running)
> >
> >         util = cfs = 800
> >         util = uclamp_rq_util_with(rq, util, NULL) = 512
> >         util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640
> >
> > So we are running higher than we are allowed to. So applying the headroom
> > after taking uclamp constraints into account is the problem.
> 
> But I think it's not correct because you mixed some utilization level
> with some performance requirement. IIRC, you said that the
> uclamp/min/max must not be seen as an utilization level but a
> performance hint. In such case, you can't add it to some other
> utilization because it defeats its original purpose

But this is what we do today already. I didn't change this part. See below.

> 
> >
> > IIUC and your concerns are about how scheduler and schedutil are interacting
> > loses some abstraction, then yeah this makes sense and can refactor code to
> > better abstract the two and keep them decoupled.
> >
> > But if you think the order the headroom is applied should be the way it is,
> > then this doesn't fix the problem.
> >
> > A good use case to consider is a single task running at 1024. If I want to
> > limit it to 80% using uclamp_max, how can this be enforced? With current code,
> > the task will continue to run at 1024; which is the problem.
> 
> Single case is the easy part
> 
> >
> > Running at 640 instead of 512 is still a problem too as this could be 1 or
> > 2 OPP higher than what we want to allow; and there could be significant power
> > difference between those OPPs.
> 
> That's my point.
> 
> Your proposal tries to fix the simple case of 1 task with a uclamp_max
> but If we want to move in this direction, we should fix all cases.

I think I am addressing all cases. I don't think I understand the problem
you're trying to highlight. Is the RFC patch 4 is what is causing the
confusion? That one is not related to fixing uclamp_max; it's just me
questioning the status quo of which pressure signal requires the headroom.

The main change being done here actually is to apply_dvfs_headroom() *before*
doing uclamp_rq_util_with(). I am not sure how you see this mixing.

Current code performs apply_dvfs_headroom() *after*; which what causes the CPU
to run at a performance level higher than rq->uclamp[UCLAMP_MAX].

It doesn't matter how many tasks on the rq, if rq->uclamp[UCLAMP_MAX] is set to
800, then the CPU should not vote to max (assuminig all other pressures are 0).

Handling of irq pressure etc is not changed.

> I probably need to put my idea in real code to see what it means but
> we should provide 2 value to cpufreq: an utilization level and a
> performance hint which would max aggregate the various performance
> hints that we have

Hmm. This reads to me code structure change. From my perspective; the correct
behavior is to apply the headroom before restricting the performance level.

It seems you're seeing a better way to aggregate all the pressures when taking
uclamp into account. Which I am starting to see your point if this is what
you're saying.  Though I haven't changed the existing relationship. I did try
to question some things in patch 4, my thoughts were specific to headroom, but
it seems you have more generalized form.

I do agree in general it seems scheduler and schedutil are getting tightly
coupled code wise and it would be good to provide generic cpufreq interfaces to
potentially allow other governors to hook in and benefit.

Whether this is something desired or not, I don't know as I remember Peter and
Rafael wanted schedutil to grow to become the de-facto governor. It seems the
right thing anyway.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-10 17:46             ` Qais Yousef
@ 2023-09-12 13:40               ` Dietmar Eggemann
  2023-09-16 19:25                 ` Qais Yousef
  2023-09-12 16:03               ` Vincent Guittot
  1 sibling, 1 reply; 32+ messages in thread
From: Dietmar Eggemann @ 2023-09-12 13:40 UTC (permalink / raw)
  To: Qais Yousef, Vincent Guittot
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Lukasz Luba

On 10/09/2023 19:46, Qais Yousef wrote:
> On 09/08/23 16:30, Vincent Guittot wrote:
> 

[...]

>>>> above 512 whatever the current (720)  formula or your proposal (608).
>>>> In the case of uclamp, it should be applied after having been scaled
>>>> by irq time.
>>>
>>> I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers.
>>
>> My bad, I finally decided to use an irq pressure of 128 in my
>> calculation but forgot to change the value in my email
>>
>>>
>>> So the way I'm proposing it here
>>>
>>>         util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
>>>         util = uclamp_rq_util_with(rq, util, NULL) = 512
>>>         util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384
>>>         util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704
>>>         util += dvfs_headroom(dl_bw) = 704
>>>
>>>>
>>>> So we should have reported utilization of 720 with a bandwidth
>>>> requirement of 512 and then cpufreq can applies its headroom if needed
>>>
>>> The key part I'm changing is this
>>>
>>>         util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
>>>         util = uclamp_rq_util_with(rq, util, NULL) = 512
>>>
>>> Before we had (assume irq, rt and dl are 0 for simplicity and a single task is
>>> running)
>>>
>>>         util = cfs = 800
>>>         util = uclamp_rq_util_with(rq, util, NULL) = 512
>>>         util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640
>>>
>>> So we are running higher than we are allowed to. So applying the headroom
>>> after taking uclamp constraints into account is the problem.

I'm not sure I understood all the example math in this thread correctly:

Examples:

irq = 128 or 256

util = 800 uclamp = 512


--- current code:

((util_cfs + util_rt) * ((max - irq) / max) + irq + dl_bw) * scale

<- uclamped(cfs+rt) ->

<--               scale_irq_capacity()                  -->|<-- map_util_perf() 
                                                               / (headroom())  

irq = 128: (512 * (1024 - 128) / 1024 + 128 + 0) * 1.25 = 576 * 1.25 = 720

irq = 256: (512 * (1024 - 256) / 1024 + 256 + 0) * 1.25 = 640 * 1.25 = 800


--- new approach:

irq = 128: (512 * (1024 - 128) / 1024 + 128 + 0.25 * 128)            = 608

irq = 256: (512 * (1024 - 256) / 1024 + 256 + 0.25 * 256)            = 704

            <->
            uclamped(cfs+rt+headroom(cfs+rt))

            <- scale_irq_capacity() ->

            <--               headroom(irq) ?        -->


Is the correct?

[...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-08-26 20:08     ` Qais Yousef
@ 2023-09-12 14:34       ` Dietmar Eggemann
  2023-09-16 20:30         ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Dietmar Eggemann @ 2023-09-12 14:34 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Vincent Guittot, Lukasz Luba

On 26/08/2023 22:08, Qais Yousef wrote:
> On 08/21/23 18:39, Dietmar Eggemann wrote:
>> On 20/08/2023 23:06, Qais Yousef wrote:
>>> DVFS headroom is applied after we calculate the effective_cpu_util()
>>> which is where we honour uclamp constraints. It makes more sense to
>>> apply the headroom there once and let all users naturally get the right
>>> thing without having to sprinkle the call around in various places.
>>
>> uclamp is applied in effective_cpu_util(..., FREQUENCY_UTIL, ...) which
>> IMHO currently has 2 power callers: (1) schedutil: sugov_get_util() and
>> (2) EAS: eenv_pd_max_util()
>>
>>> Before this fix running
>>>
>>> 	uclampset -M 800 cat /dev/zero > /dev/null
>>>
>>> Will cause the test system to run at max freq of 2.8GHz. After the fix
>>> it runs at 2.2GHz instead which is the correct value that matches the
>>> capacity of 800.
>>
>> IMHO, a system at util = 800 (w/o uclamp) would also run at 2.8Ghz since
>> we would call map_util_to_perf() on 800, no matter from where we call it.
> 
> Sorry, I would very strongly disagree here. What you're saying the effective
> range of uclamp_max is 800 and anything above that will always go to max. How
> can this be acceptable?

No that's not what I wanted to say here.

I wanted to highlight the different treatment of `(1) a task with
(natural) util = 800` and `(2) a task with uclamp_max = 800`.

(1) util = 800

util = (1.25 * 800 * (1024 - irq) / 1024 + ...

        <-      ->
        uclamped(cfs+rt+headroom(cfs+rt))


(2) uclamp_max = 800

util = (800 * (1024 - irq) / 1024 + ...

        <->
        uclamped(cfs+rt+headroom(cfs+rt))

So for (1) the scheduler would ask for more than in (2).

That's essentially the same question which was raised here:

https://lkml.kernel.org/r/CAKfTPtDY48jpO+b-2KXawzxh-ty+FMKX6YUXioNR7kpgO=ua6Q@mail.gmail.com

>>> Note that similar problem exist for uclamp_min. If util was 50, and
>>> uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
>>> constraints, we'll end up with util of 125 instead of 100. IOW, we get
>>> boosted twice, first time by uclamp_min, and second time by dvfs
>>> headroom.
>>
>> I see what you want to change here but:
>>
>> So far we have `util -> uclamp -> map_util_to_perf()`
> 
> :-O
> 
> So when I set the system uclamp_max to 800 it will still run at max; and this
> is normal?!!

No that's an issue (A) as well. But the diff between (1) and (2) is IMHO a
new issue introduced by this patch-set.

>> which is fine when we see uclamp as an entity which constrains util, not
>> the util after being mapped to a capacity constraint.
> 
> -ENOPARSE.

What I meant is 'clamping the util' before scheduler hands over to
'cpufreq' is fine:

    util -> uclamp -> map_util_to_perf()
               
    scheduler    -->|<-- cpufreq

I do understand that you guys are already discussing a new
cpufreq_give_me_freq_for_this_utilization_ctx() between EM and CPUfreq
in the other thread of this patch to maybe sort out (A).

[...]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-10 17:46             ` Qais Yousef
  2023-09-12 13:40               ` Dietmar Eggemann
@ 2023-09-12 16:03               ` Vincent Guittot
  2023-09-16 19:25                 ` Qais Yousef
  1 sibling, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2023-09-12 16:03 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

On Sun, 10 Sept 2023 at 19:46, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 09/08/23 16:30, Vincent Guittot wrote:
>
> > > Hmm IIUC, no. irq pressure will be added on top as normal. It's just cfs+rt
> > > will be restricted by their uclamp settings _after_ the headroom is applied.
> >
> > But then you mixed some utilization level (irq pressure)  which is
> > time related with some performance/bandwidth limitation which is freq
> > related. And I think that part of the reason why we can't agree on
> > where to put headroom and how to take into account uclamp
>
> I did not change how this part works. We can drop patch 4 completely if this is
> what is causing the confusion.
>
> What I changed is the order of applying uclamp_rq_util_with() and
> apply_dvfs_headroom(). The rest is still the same as-is in the code today.
>
> >
> > >
> > > > above 512 whatever the current (720)  formula or your proposal (608).
> > > > In the case of uclamp, it should be applied after having been scaled
> > > > by irq time.
> > >
> > > I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers.
> >
> > My bad, I finally decided to use an irq pressure of 128 in my
> > calculation but forgot to change the value in my email
> >
> > >
> > > So the way I'm proposing it here
> > >
> > >         util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
> > >         util = uclamp_rq_util_with(rq, util, NULL) = 512
> > >         util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384
> > >         util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704
> > >         util += dvfs_headroom(dl_bw) = 704
> > >
> > > >
> > > > So we should have reported utilization of 720 with a bandwidth
> > > > requirement of 512 and then cpufreq can applies its headroom if needed
> > >
> > > The key part I'm changing is this
> > >
> > >         util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
> > >         util = uclamp_rq_util_with(rq, util, NULL) = 512
> > >
> > > Before we had (assume irq, rt and dl are 0 for simplicity and a single task is
> > > running)
> > >
> > >         util = cfs = 800
> > >         util = uclamp_rq_util_with(rq, util, NULL) = 512
> > >         util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640
> > >
> > > So we are running higher than we are allowed to. So applying the headroom
> > > after taking uclamp constraints into account is the problem.
> >
> > But I think it's not correct because you mixed some utilization level
> > with some performance requirement. IIRC, you said that the
> > uclamp/min/max must not be seen as an utilization level but a
> > performance hint. In such case, you can't add it to some other
> > utilization because it defeats its original purpose
>
> But this is what we do today already. I didn't change this part. See below.

And it seems that what is done today doesn't work correctly for you.
Your proposal to include cpufreq headroom into the scheduler is not
correct IMO and it only applies for some cases. Then, the cpufreq
driver can have some really good reason to apply some headroom even
with an uclamp value but it can't make any decision.

I think that we should use another way to fix your problem with how
uclamp than reordering how headroom is applied by cpufreq. Mixing
utilization and performance in one signal hide some differences that
cpufreq can make a good use of.

As an example:

cfs util = 650
cfs uclamp = 800
irq = 128

cfs with headroom 650*1.25=812 is clamped to 800

Final utilization will be : 800(1024-128)/1024+128*1.25=860 which is
above the target of 800.

When we look at the detail, we have:

cfs util once scaled to the full range is only 650(1024-128)/1024= 568

After applying irq (even with some headroom) 568+128*1.25 = 728 which
is below the uclamp of 800 so shouldn't we stay at 800 in this case ?

>
> >
> > >
> > > IIUC and your concerns are about how scheduler and schedutil are interacting
> > > loses some abstraction, then yeah this makes sense and can refactor code to
> > > better abstract the two and keep them decoupled.
> > >
> > > But if you think the order the headroom is applied should be the way it is,
> > > then this doesn't fix the problem.
> > >
> > > A good use case to consider is a single task running at 1024. If I want to
> > > limit it to 80% using uclamp_max, how can this be enforced? With current code,
> > > the task will continue to run at 1024; which is the problem.
> >
> > Single case is the easy part
> >
> > >
> > > Running at 640 instead of 512 is still a problem too as this could be 1 or
> > > 2 OPP higher than what we want to allow; and there could be significant power
> > > difference between those OPPs.
> >
> > That's my point.
> >
> > Your proposal tries to fix the simple case of 1 task with a uclamp_max
> > but If we want to move in this direction, we should fix all cases.
>
> I think I am addressing all cases. I don't think I understand the problem
> you're trying to highlight. Is the RFC patch 4 is what is causing the
> confusion? That one is not related to fixing uclamp_max; it's just me
> questioning the status quo of which pressure signal requires the headroom.

No patch 4 doesn't cause me confusion.

>
> The main change being done here actually is to apply_dvfs_headroom() *before*
> doing uclamp_rq_util_with(). I am not sure how you see this mixing.

Because dvfs_headroom is a cpufreq hints and you want to apply it
somewhere else.

>
> Current code performs apply_dvfs_headroom() *after*; which what causes the CPU
> to run at a performance level higher than rq->uclamp[UCLAMP_MAX].
>
> It doesn't matter how many tasks on the rq, if rq->uclamp[UCLAMP_MAX] is set to
> 800, then the CPU should not vote to max (assuminig all other pressures are 0).

You can't remove the irq pressure from the picture. If
rq->uclamp[UCLAMP_MAX] is set to 800 means that cpu must not go above
800, it should apply also after taking into account other inputs. At
least up to some level as described in my example above

>
> Handling of irq pressure etc is not changed.
>
> > I probably need to put my idea in real code to see what it means but
> > we should provide 2 value to cpufreq: an utilization level and a
> > performance hint which would max aggregate the various performance
> > hints that we have
>
> Hmm. This reads to me code structure change. From my perspective; the correct
> behavior is to apply the headroom before restricting the performance level.
>
> It seems you're seeing a better way to aggregate all the pressures when taking
> uclamp into account. Which I am starting to see your point if this is what
> you're saying.  Though I haven't changed the existing relationship. I did try
> to question some things in patch 4, my thoughts were specific to headroom, but
> it seems you have more generalized form.
>
> I do agree in general it seems scheduler and schedutil are getting tightly
> coupled code wise and it would be good to provide generic cpufreq interfaces to
> potentially allow other governors to hook in and benefit.
>
> Whether this is something desired or not, I don't know as I remember Peter and
> Rafael wanted schedutil to grow to become the de-facto governor. It seems the
> right thing anyway.
>
>
> Thanks!
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-12 16:03               ` Vincent Guittot
@ 2023-09-16 19:25                 ` Qais Yousef
  2023-09-24  7:58                   ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-09-16 19:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

On 09/12/23 18:03, Vincent Guittot wrote:

> And it seems that what is done today doesn't work correctly for you.
> Your proposal to include cpufreq headroom into the scheduler is not
> correct IMO and it only applies for some cases. Then, the cpufreq
> driver can have some really good reason to apply some headroom even
> with an uclamp value but it can't make any decision.
> 
> I think that we should use another way to fix your problem with how
> uclamp than reordering how headroom is applied by cpufreq. Mixing
> utilization and performance in one signal hide some differences that
> cpufreq can make a good use of.
> 
> As an example:
> 
> cfs util = 650
> cfs uclamp = 800
> irq = 128
> 
> cfs with headroom 650*1.25=812 is clamped to 800
> 
> Final utilization will be : 800(1024-128)/1024+128*1.25=860 which is
> above the target of 800.
> 
> When we look at the detail, we have:
> 
> cfs util once scaled to the full range is only 650(1024-128)/1024= 568
> 
> After applying irq (even with some headroom) 568+128*1.25 = 728 which
> is below the uclamp of 800 so shouldn't we stay at 800 in this case ?

Shouldn't it be (568+128)*1.25 = 870? Which is almost the 860 above. We calmped
the 812 to 800, with rounding errors that almost accounts for the 10 points
difference between 870 and 860..

I might have gotten the math wrong. But what I saw is that we have

	util = (X + Y + Z) * A

and what I did

	util = AX + AY + AZ

so maybe I missed something up, but I just did the multiplication with the
headroom to each element individually rather than after the sum.

So yeah, if I messed that part up, then that wasn't intentional and should be
done differently. But I still can't see it.

> >
> > The main change being done here actually is to apply_dvfs_headroom() *before*
> > doing uclamp_rq_util_with(). I am not sure how you see this mixing.
> 
> Because dvfs_headroom is a cpufreq hints and you want to apply it
> somewhere else.

I am still not sure if you mean we are mixing up the code and we need better
abstraction or something else.

Beside the abstraction problem, which I agree with, I can't see what I am
mixing up yet :( Sorry I think I need more helping hand to see it.

> > Current code performs apply_dvfs_headroom() *after*; which what causes the CPU
> > to run at a performance level higher than rq->uclamp[UCLAMP_MAX].
> >
> > It doesn't matter how many tasks on the rq, if rq->uclamp[UCLAMP_MAX] is set to
> > 800, then the CPU should not vote to max (assuminig all other pressures are 0).
> 
> You can't remove the irq pressure from the picture. If
> rq->uclamp[UCLAMP_MAX] is set to 800 means that cpu must not go above
> 800, it should apply also after taking into account other inputs. At
> least up to some level as described in my example above

I was trying to simplify to understand what you mean as I don't think I see the
problem you're highlighting still.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-12 13:40               ` Dietmar Eggemann
@ 2023-09-16 19:25                 ` Qais Yousef
  0 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-09-16 19:25 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Rafael J. Wysocki, Viresh Kumar, Ingo Molnar,
	Peter Zijlstra, linux-kernel, linux-pm, Lukasz Luba

On 09/12/23 15:40, Dietmar Eggemann wrote:
> On 10/09/2023 19:46, Qais Yousef wrote:
> > On 09/08/23 16:30, Vincent Guittot wrote:
> > 
> 
> [...]
> 
> >>>> above 512 whatever the current (720)  formula or your proposal (608).
> >>>> In the case of uclamp, it should be applied after having been scaled
> >>>> by irq time.
> >>>
> >>> I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers.
> >>
> >> My bad, I finally decided to use an irq pressure of 128 in my
> >> calculation but forgot to change the value in my email
> >>
> >>>
> >>> So the way I'm proposing it here
> >>>
> >>>         util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
> >>>         util = uclamp_rq_util_with(rq, util, NULL) = 512
> >>>         util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384
> >>>         util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704
> >>>         util += dvfs_headroom(dl_bw) = 704
> >>>
> >>>>
> >>>> So we should have reported utilization of 720 with a bandwidth
> >>>> requirement of 512 and then cpufreq can applies its headroom if needed
> >>>
> >>> The key part I'm changing is this
> >>>
> >>>         util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
> >>>         util = uclamp_rq_util_with(rq, util, NULL) = 512
> >>>
> >>> Before we had (assume irq, rt and dl are 0 for simplicity and a single task is
> >>> running)
> >>>
> >>>         util = cfs = 800
> >>>         util = uclamp_rq_util_with(rq, util, NULL) = 512
> >>>         util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640
> >>>
> >>> So we are running higher than we are allowed to. So applying the headroom
> >>> after taking uclamp constraints into account is the problem.
> 
> I'm not sure I understood all the example math in this thread correctly:
> 
> Examples:
> 
> irq = 128 or 256
> 
> util = 800 uclamp = 512
> 
> 
> --- current code:
> 
> ((util_cfs + util_rt) * ((max - irq) / max) + irq + dl_bw) * scale
> 
> <- uclamped(cfs+rt) ->
> 
> <--               scale_irq_capacity()                  -->|<-- map_util_perf() 
>                                                                / (headroom())  
> 
> irq = 128: (512 * (1024 - 128) / 1024 + 128 + 0) * 1.25 = 576 * 1.25 = 720
> 
> irq = 256: (512 * (1024 - 256) / 1024 + 256 + 0) * 1.25 = 640 * 1.25 = 800
> 
> 
> --- new approach:
> 
> irq = 128: (512 * (1024 - 128) / 1024 + 128 + 0.25 * 128)            = 608
> 
> irq = 256: (512 * (1024 - 256) / 1024 + 256 + 0.25 * 256)            = 704
> 
>             <->
>             uclamped(cfs+rt+headroom(cfs+rt))
> 
>             <- scale_irq_capacity() ->
> 
>             <--               headroom(irq) ?        -->
> 
> 
> Is the correct?

Yes, this is my understanding too. But I'm not sure anymore as it seems I'm
missing something from what Vincent is saying.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-12 14:34       ` Dietmar Eggemann
@ 2023-09-16 20:30         ` Qais Yousef
  0 siblings, 0 replies; 32+ messages in thread
From: Qais Yousef @ 2023-09-16 20:30 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Vincent Guittot, Lukasz Luba

On 09/12/23 16:34, Dietmar Eggemann wrote:

> No that's not what I wanted to say here.
> 
> I wanted to highlight the different treatment of `(1) a task with
> (natural) util = 800` and `(2) a task with uclamp_max = 800`.
> 
> (1) util = 800
> 
> util = (1.25 * 800 * (1024 - irq) / 1024 + ...
> 
>         <-      ->
>         uclamped(cfs+rt+headroom(cfs+rt))
> 
> 
> (2) uclamp_max = 800
> 
> util = (800 * (1024 - irq) / 1024 + ...
> 
>         <->
>         uclamped(cfs+rt+headroom(cfs+rt))
> 
> So for (1) the scheduler would ask for more than in (2).

Yes, which is the intention IMHO. If they behave the same, then uclamp_max is
not doing anything. And really this is the critical region in the system as
this is where the expensive OPPs lie.

> 
> That's essentially the same question which was raised here:
> 
> https://lkml.kernel.org/r/CAKfTPtDY48jpO+b-2KXawzxh-ty+FMKX6YUXioNR7kpgO=ua6Q@mail.gmail.com
> 
> >>> Note that similar problem exist for uclamp_min. If util was 50, and
> >>> uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> >>> constraints, we'll end up with util of 125 instead of 100. IOW, we get
> >>> boosted twice, first time by uclamp_min, and second time by dvfs
> >>> headroom.
> >>
> >> I see what you want to change here but:
> >>
> >> So far we have `util -> uclamp -> map_util_to_perf()`
> > 
> > :-O
> > 
> > So when I set the system uclamp_max to 800 it will still run at max; and this
> > is normal?!!
> 
> No that's an issue (A) as well. But the diff between (1) and (2) is IMHO a
> new issue introduced by this patch-set.

It is also the problem to fix :-)

> 
> >> which is fine when we see uclamp as an entity which constrains util, not
> >> the util after being mapped to a capacity constraint.
> > 
> > -ENOPARSE.
> 
> What I meant is 'clamping the util' before scheduler hands over to
> 'cpufreq' is fine:
> 
>     util -> uclamp -> map_util_to_perf()
>                
>     scheduler    -->|<-- cpufreq
> 
> I do understand that you guys are already discussing a new
> cpufreq_give_me_freq_for_this_utilization_ctx() between EM and CPUfreq
> in the other thread of this patch to maybe sort out (A).

This will help with abstraction. But behavior wise, I still think that if
a task (or a cgroup, the system) has uclamp_max to 800, then we should not run
faster than this because of dvfs headroom.

If we don't do this then max performance is effectively mapped at anything at
800 and above. And likely less than that.

For example if

	util@OPP[Fmax-1] = 950

then any util value at

	util = 950 * 0.8 = 760

will saturate the system and run at Fmax, because all values from above 950
will map to Fmax, and because of the headroom this becomes all values at 760 or
above.

This mean from user's perspective uclamp_min and uclamp_max is from 0-760 on
that system. But the abstraction we're providing is that the performance range
is from 0-1024.

If uclamp_min is set to 512 and the task has a util of 100, why should we run
at 512*1.25?

Similarly if uclamp_max wants to cap the task/cgroup/system from consuming the
top resources, why it needs the 1.25 headroom? If there are two tasks running
on the CPU and one of them is not capped, then yeah the headroom will be
applied. But if the rq->uclamp[UCLAMP_MAX] = 800, then the vote from freq from
this CPU should not exceed this. I don't see what the headroom is for in this
case.

As another example, do you expect a task that has a util of 1024 but uclamp_max
= 800 to run at max or at an OPP equivalent to 800 which would be 2 or 3 below
Fmax usually? The latter is the expectation from the interface. Users shouldn't
need to know there's a dvfs headroom and cater for that when they set
uclamp_max or uclamp_min.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-16 19:25                 ` Qais Yousef
@ 2023-09-24  7:58                   ` Vincent Guittot
  2023-09-24 17:23                     ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2023-09-24  7:58 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

On Sat, 16 Sept 2023 at 21:25, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 09/12/23 18:03, Vincent Guittot wrote:
>
> > And it seems that what is done today doesn't work correctly for you.
> > Your proposal to include cpufreq headroom into the scheduler is not
> > correct IMO and it only applies for some cases. Then, the cpufreq
> > driver can have some really good reason to apply some headroom even
> > with an uclamp value but it can't make any decision.
> >
> > I think that we should use another way to fix your problem with how
> > uclamp than reordering how headroom is applied by cpufreq. Mixing
> > utilization and performance in one signal hide some differences that
> > cpufreq can make a good use of.
> >
> > As an example:
> >
> > cfs util = 650
> > cfs uclamp = 800
> > irq = 128
> >
> > cfs with headroom 650*1.25=812 is clamped to 800
> >
> > Final utilization will be : 800(1024-128)/1024+128*1.25=860 which is
> > above the target of 800.
> >
> > When we look at the detail, we have:
> >
> > cfs util once scaled to the full range is only 650(1024-128)/1024= 568
> >
> > After applying irq (even with some headroom) 568+128*1.25 = 728 which
> > is below the uclamp of 800 so shouldn't we stay at 800 in this case ?
>
> Shouldn't it be (568+128)*1.25 = 870? Which is almost the 860 above. We calmped
> the 812 to 800, with rounding errors that almost accounts for the 10 points
> difference between 870 and 860..

no I voluntarily use 568 + 128*1.25. I added dvfs headroom for irq
just to ensure that you will not raise that I removed the headroom for
irq and focus on the use case but it might have created more
confusion.

My example above demonstrate that only taking care of cases with null
irq pressure is not enough and you can still ends up above 800

IIUC you point with uclamp_max. It is a performance limit that you
don't want to cross because of CFS.This means that we should not go
above 800 in my example because of cfs utilization: Irq needs between
128 and CFS asks 568 so the system needs 696 which is below the 800
uclamp. Even if you add the dvfs headroom on irq, the system is still
below 800. Only when you add dfvs headroom to cfs then you go above
800 but it's not needed because uclamp say that you should not go
above 800 because of CFS so we should stay at 800 whereas both current
formula and your new formula return a value above 800

>
> I might have gotten the math wrong. But what I saw is that we have
>
>         util = (X + Y + Z) * A
>
> and what I did
>
>         util = AX + AY + AZ
>
> so maybe I missed something up, but I just did the multiplication with the
> headroom to each element individually rather than after the sum.
>
> So yeah, if I messed that part up, then that wasn't intentional and should be
> done differently. But I still can't see it.
>
> > >
> > > The main change being done here actually is to apply_dvfs_headroom() *before*
> > > doing uclamp_rq_util_with(). I am not sure how you see this mixing.
> >
> > Because dvfs_headroom is a cpufreq hints and you want to apply it
> > somewhere else.
>
> I am still not sure if you mean we are mixing up the code and we need better
> abstraction or something else.
>
> Beside the abstraction problem, which I agree with, I can't see what I am
> mixing up yet :( Sorry I think I need more helping hand to see it.

There is a mix between actual utilization and performance limit and
when we add both we then lose important information as highlighted by
my example. If the current formula is not correct because we can go
above uclamp_max value, your proposal is not better. And the root
cause is mainly coming from adding utilization with performance limit
(i.e. uclamp)

That's why I said that we need a new interface to enable cpufreq to
not blindly apply its headroom but to make smarter decision at cpufreq
level


>
> > > Current code performs apply_dvfs_headroom() *after*; which what causes the CPU
> > > to run at a performance level higher than rq->uclamp[UCLAMP_MAX].
> > >
> > > It doesn't matter how many tasks on the rq, if rq->uclamp[UCLAMP_MAX] is set to
> > > 800, then the CPU should not vote to max (assuminig all other pressures are 0).
> >
> > You can't remove the irq pressure from the picture. If
> > rq->uclamp[UCLAMP_MAX] is set to 800 means that cpu must not go above
> > 800, it should apply also after taking into account other inputs. At
> > least up to some level as described in my example above
>
> I was trying to simplify to understand what you mean as I don't think I see the
> problem you're highlighting still.
>
>
> Thanks!
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-24  7:58                   ` Vincent Guittot
@ 2023-09-24 17:23                     ` Qais Yousef
  2023-09-28 17:50                       ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-09-24 17:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

On 09/24/23 09:58, Vincent Guittot wrote:

> > Shouldn't it be (568+128)*1.25 = 870? Which is almost the 860 above. We calmped
> > the 812 to 800, with rounding errors that almost accounts for the 10 points
> > difference between 870 and 860..
> 
> no I voluntarily use 568 + 128*1.25. I added dvfs headroom for irq
> just to ensure that you will not raise that I removed the headroom for
> irq and focus on the use case but it might have created more
> confusion.
> 
> My example above demonstrate that only taking care of cases with null
> irq pressure is not enough and you can still ends up above 800
> 
> IIUC you point with uclamp_max. It is a performance limit that you
> don't want to cross because of CFS.This means that we should not go
> above 800 in my example because of cfs utilization: Irq needs between
> 128 and CFS asks 568 so the system needs 696 which is below the 800
> uclamp. Even if you add the dvfs headroom on irq, the system is still
> below 800. Only when you add dfvs headroom to cfs then you go above
> 800 but it's not needed because uclamp say that you should not go

Yep, absolutely. It seems we agree that CFS shouldn't go above 800 if it is
capped even if there's headroom, but the question you have on the way it is
being applied. As long as we agree on this part which is a fundamental behavior
question that I thought is the pain point, the implementation details are
certainly something that I can improve on.

> above 800 because of CFS so we should stay at 800 whereas both current
> formula and your new formula return a value above 800

I'm not sure how to handle irq, rt and dl here to be honest. They seem to have
been taken as an 'additional' demand on top of CFS. So yes, we'll go above but
irq, and dl don't have knowledge about ucalmp_max. RT does and will be equally
capped like CFS. I kept current behavior the same, but I did wonder about them
too in patch 4.

So in a system where there are active CFS, RT, DL and IRQ and both CFS and RT
had a cap of 800, then they won't ask for me. But once we add IRQ and DL on
top, then we'll go above.

You think we shouldn't? See below for a suggestion.

> > I am still not sure if you mean we are mixing up the code and we need better
> > abstraction or something else.
> >
> > Beside the abstraction problem, which I agree with, I can't see what I am
> > mixing up yet :( Sorry I think I need more helping hand to see it.
> 
> There is a mix between actual utilization and performance limit and
> when we add both we then lose important information as highlighted by
> my example. If the current formula is not correct because we can go
> above uclamp_max value, your proposal is not better. And the root
> cause is mainly coming from adding utilization with performance limit
> (i.e. uclamp)
> 
> That's why I said that we need a new interface to enable cpufreq to
> not blindly apply its headroom but to make smarter decision at cpufreq
> level

Okay I see. I tend to agree here too. The question is should cpufreq take each
util (cfs, rt, dl, irq) as input and do the decision on its own. Or should the
scheduler add them and pass the aggregated value? If the latter, how can
cpufreq know how to apply the limit? From what I see all these decisions has to
happen in the same function but not split.

It seems the sticking point is how we interpret irq pressure with uclamp. It
seems you think we should apply any uclamp capping to this, which I think would
make sense.

And DL bandwidth we need to max() with the aggregated value.

So I think the formula could be

	util = cfs + rt pressure + irq pressure

	unsigned long cpufreq_convert_util_to_freq(rq, util, dl_bw)
	{
		eff_util = apply_dvfs_headroom(util);
		eff_util = uclamp_rq_util_with(rq, util, NULL);

		eff_util = max(eff_util, dl_bw);
	}

so we add the utilization of cfs, rt and irq (as per current formula). And then
let cpufreq do the headroom and limit management.

I changed the way we handle dl_bw as it is actually requesting to run at
a specific level and not really a pressure. So we max() it with eff_util.

If there's a DL task on the rq then it'd be running and the frequency it
needs is defined by its bandwidth.

We could also keep it as it is with

	unsigned long cpufreq_convert_util_to_freq(rq, util, dl_bw)
	{
		eff_util = apply_dvfs_headroom(util);
		eff_util = uclamp_rq_util_with(rq, util, NULL);

		eff_util += dl_bw;
	}

RT has uclamp knowledge so it'll either run at max or whatever value it might
have requested via uclamp_min. But DL doesn't set any uclamp_min and must be
either added or max()ed. I'm not sure which is more correct yet, but maybe
adding actually is better to ensure the CPU runs higher to handle all the tasks
on the rq.

What do you think?


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-24 17:23                     ` Qais Yousef
@ 2023-09-28 17:50                       ` Vincent Guittot
  2023-09-28 22:05                         ` Qais Yousef
  0 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2023-09-28 17:50 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

Le dimanche 24 sept. 2023 à 18:23:01 (+0100), Qais Yousef a écrit :
> On 09/24/23 09:58, Vincent Guittot wrote:
> 
> > > Shouldn't it be (568+128)*1.25 = 870? Which is almost the 860 above. We calmped
> > > the 812 to 800, with rounding errors that almost accounts for the 10 points
> > > difference between 870 and 860..
> > 
> > no I voluntarily use 568 + 128*1.25. I added dvfs headroom for irq
> > just to ensure that you will not raise that I removed the headroom for
> > irq and focus on the use case but it might have created more
> > confusion.
> > 
> > My example above demonstrate that only taking care of cases with null
> > irq pressure is not enough and you can still ends up above 800
> > 
> > IIUC you point with uclamp_max. It is a performance limit that you
> > don't want to cross because of CFS.This means that we should not go
> > above 800 in my example because of cfs utilization: Irq needs between
> > 128 and CFS asks 568 so the system needs 696 which is below the 800
> > uclamp. Even if you add the dvfs headroom on irq, the system is still
> > below 800. Only when you add dfvs headroom to cfs then you go above
> > 800 but it's not needed because uclamp say that you should not go
> 
> Yep, absolutely. It seems we agree that CFS shouldn't go above 800 if it is
> capped even if there's headroom, but the question you have on the way it is

At least I want to ensure that cpufreq has the right information to make a
smart decision. In the example above, it's not needed to go above 800 for
neither cfs nor irq.

> being applied. As long as we agree on this part which is a fundamental behavior
> question that I thought is the pain point, the implementation details are
> certainly something that I can improve on.
> 
> > above 800 because of CFS so we should stay at 800 whereas both current
> > formula and your new formula return a value above 800
> 
> I'm not sure how to handle irq, rt and dl here to be honest. They seem to have
> been taken as an 'additional' demand on top of CFS. So yes, we'll go above but
> irq, and dl don't have knowledge about ucalmp_max. RT does and will be equally
> capped like CFS. I kept current behavior the same, but I did wonder about them
> too in patch 4.
> 
> So in a system where there are active CFS, RT, DL and IRQ and both CFS and RT
> had a cap of 800, then they won't ask for me. But once we add IRQ and DL on
> top, then we'll go above.
> 
> You think we shouldn't? See below for a suggestion.

I'm afraid that the answer is not that straight forward

In the example above should we still stay at uclamp value if we now set uclamp
to 680 ?

And what about a uclamp of 160 ?

> 
> > > I am still not sure if you mean we are mixing up the code and we need better
> > > abstraction or something else.
> > >
> > > Beside the abstraction problem, which I agree with, I can't see what I am
> > > mixing up yet :( Sorry I think I need more helping hand to see it.
> > 
> > There is a mix between actual utilization and performance limit and
> > when we add both we then lose important information as highlighted by
> > my example. If the current formula is not correct because we can go
> > above uclamp_max value, your proposal is not better. And the root
> > cause is mainly coming from adding utilization with performance limit
> > (i.e. uclamp)
> > 
> > That's why I said that we need a new interface to enable cpufreq to
> > not blindly apply its headroom but to make smarter decision at cpufreq
> > level
> 
> Okay I see. I tend to agree here too. The question is should cpufreq take each
> util (cfs, rt, dl, irq) as input and do the decision on its own. Or should the
> scheduler add them and pass the aggregated value? If the latter, how can
> cpufreq know how to apply the limit? From what I see all these decisions has to
> happen in the same function but not split.

I'm not in favor of showing all details to cpufreq because it will have to
follow the internal changes. In instead, I was thinking of something like:

/* Function name to be changed */
unsigned_long effective_cpu_util(int cpu, unsigned int *min, unsigned int *max)

The function returns the actual utilization of the CPU and some minimum and
maximum limits with the possibility to have the min and/or Actual values > Max
because the min would be a hard minimum value whereas max only a soft maximum
value.

Min would be the minimum perf to provide to the cpu : typically DL_bw + irq
Actual would be the actual utilization of the cpu: cfs+rt+dl+irq (after scaling
  everything in the normal range)
Max would be the maximum needed performance for normal work: typically the
minimum between uclamp and capacity

Then cpufreq can use these 3 values to compute a performance level and it 
will know up to which perf level it should go and if it is worth it.
Something likr:

/* get scheduler statistic */
target = effective_cpu_util(cpu, util, &min, &max)

/* ensure min perf for dl and irq + some margin for others */
target = min + headroom

/* check if we really need to go to max */
if ((actual + headroom) < max)
  max = actual + headroom

/* use max of the 2 values */
target = max(target, max)

I put all this in the below quick patch which only compiled but not tested


---
 include/linux/energy_model.h     |  1 -
 kernel/sched/core.c              | 98 +++++++++++++++++---------------
 kernel/sched/cpufreq_schedutil.c |  6 +-
 kernel/sched/fair.c              |  4 +-
 kernel/sched/sched.h             |  7 ++-
 5 files changed, 61 insertions(+), 55 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index e4cf9baf5f9e..c424a1bcec38 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -261,7 +261,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	scale_cpu = arch_scale_cpu_capacity(cpu);
 	ref_freq = em_get_capacity_ref_freq(cpu, pd);

-	max_util = map_util_perf(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
 	freq = map_util_freq(max_util, ref_freq, scale_cpu);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6560392f2f83..030564f5be24 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7404,18 +7404,13 @@ int sched_core_idle_cpu(int cpu)
  * required to meet deadlines.
  */
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 enum cpu_util_type type,
-				 struct task_struct *p)
+				 unsigned long *min,
+				 unsigned long *max)
 {
-	unsigned long dl_util, util, irq, max;
+	unsigned long dl_util, util, irq, scale;
 	struct rq *rq = cpu_rq(cpu);

-	max = arch_scale_cpu_capacity(cpu);
-
-	if (!uclamp_is_used() &&
-	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
-		return max;
-	}
+	scale = arch_scale_cpu_capacity(cpu);

 	/*
 	 * Early check to see if IRQ/steal time saturates the CPU, can be
@@ -7423,9 +7418,16 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * update_irq_load_avg().
 	 */
 	irq = cpu_util_irq(rq);
-	if (unlikely(irq >= max))
-		return max;
+	if (unlikely(irq >= scale)) {
+		if (min)
+			*min = scale;
+		if (max)
+			*max = scale;
+		return scale;
+	}

+	if (min)
+		min = irq + cpu_bw_dl(rq);
 	/*
 	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
 	 * CFS tasks and we use the same metric to track the effective
@@ -7439,29 +7441,13 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * frequency will be gracefully reduced with the utilization decay.
 	 */
 	util = util_cfs + cpu_util_rt(rq);
-	if (type == FREQUENCY_UTIL)
-		util = uclamp_rq_util_with(rq, util, p);
+	util += cpu_util_dl(rq);

-	dl_util = cpu_util_dl(rq);
-
-	/*
-	 * For frequency selection we do not make cpu_util_dl() a permanent part
-	 * of this sum because we want to use cpu_bw_dl() later on, but we need
-	 * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
-	 * that we select f_max when there is no idle time.
-	 *
-	 * NOTE: numerical errors or stop class might cause us to not quite hit
-	 * saturation when we should -- something for later.
-	 */
-	if (util + dl_util >= max)
-		return max;
-
-	/*
-	 * OTOH, for energy computation we need the estimated running time, so
-	 * include util_dl and ignore dl_bw.
-	 */
-	if (type == ENERGY_UTIL)
-		util += dl_util;
+	if (util >= scale) {
+		if (max)
+			*max = scale;
+		return scale;
+	}

 	/*
 	 * There is still idle time; further improve the number by using the
@@ -7472,28 +7458,48 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 *   U' = irq + --------- * U
 	 *                 max
 	 */
-	util = scale_irq_capacity(util, irq, max);
+	util = scale_irq_capacity(util, irq, scale);
 	util += irq;

+	if (max)
+		*max = uclamp_rq_util_with(rq, util, NULL);
+
+	return min(scale, util);
+}
+
+
+/*
+ *  TODO: move this in cpufreq
+ */
+unsigned long effective_cpu_perf(int cpu, unsigned long util_cfs,
+				 struct task_struct *p)
+{
+	unsigned long actual, target;
+
+	/* Get utilization stats */
+	actual = effective_cpu_util(cpu, util_cfs, &min, &max);
+
 	/*
-	 * Bandwidth required by DEADLINE must always be granted while, for
-	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
-	 * to gracefully reduce the frequency when no tasks show up for longer
-	 * periods of time.
-	 *
-	 * Ideally we would like to set bw_dl as min/guaranteed freq and util +
-	 * bw_dl as requested freq. However, cpufreq is not yet ready for such
-	 * an interface. So, we only do the latter for now.
+	 * Provide at least enough capacity for DL + irq plus some headroom
+	 * for other activities
 	 */
-	if (type == FREQUENCY_UTIL)
-		util += cpu_bw_dl(rq);
+	target =  map_util_perf(min);

-	return min(max, util);
+	actual = map_util_perf(actual);
+	/* Actually we don't need to target the max performance */
+	if (actual < max)
+		max = actual;
+
+	/*
+	 * Ensure at least minimum perf target while providing more computa capacity when
+	 * possible
+	 */
+	target = max(target,max);
 }

 unsigned long sched_cpu_util(int cpu)
 {
-	return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
+	return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
 }
 #endif /* CONFIG_SMP */

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e2b9c8c3d69a..ef6b4b09ac12 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -162,7 +162,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq;
 	struct cpufreq_policy *policy = sg_policy->policy;

-	util = map_util_perf(util);
 	freq = get_capacity_ref_freq(policy);
 	freq = map_util_freq(util, freq, max);

@@ -179,8 +178,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
 	struct rq *rq = cpu_rq(sg_cpu->cpu);

 	sg_cpu->bw_dl = cpu_bw_dl(rq);
-	sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
-					  FREQUENCY_UTIL, NULL);
+	sg_cpu->util = effective_cpu_perf(sg_cpu->cpu, util, NULL);
 }

 /**
@@ -427,7 +425,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 		sg_cpu->util = prev_util;

 	cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
-				   map_util_perf(sg_cpu->util), max_cap);
+				   sg_cpu->util, max_cap);

 	sg_cpu->sg_policy->last_freq_update_time = time;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 06d6d0dde48a..50568e2fa1ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7570,7 +7570,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
 	for_each_cpu(cpu, pd_cpus) {
 		unsigned long util = cpu_util(cpu, p, -1, 0);

-		busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
+		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
 	}

 	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7602,7 +7602,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
+		eff_util = effective_cpu_perf(cpu, util, tsk);
 		max_util = max(max_util, eff_util);
 	}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 17ae151e90c0..e79cb1110e8d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2988,10 +2988,13 @@ enum cpu_util_type {
 	ENERGY_UTIL,
 };

-unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 enum cpu_util_type type,
+unsigned long effective_cpu_perf(int cpu, unsigned long util_cfs,
 				 struct task_struct *p);

+unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
+				 unsigned long *min,
+				 unsigned long *max)
+
 /*
  * Verify the fitness of task @p to run on @cpu taking into account the
  * CPU original capacity and the runtime/deadline ratio of the task.
--
2.34.1


> 
> It seems the sticking point is how we interpret irq pressure with uclamp. It
> seems you think we should apply any uclamp capping to this, which I think would
> make sense.
> 
> And DL bandwidth we need to max() with the aggregated value.
> 
> So I think the formula could be
> 
> 	util = cfs + rt pressure + irq pressure
> 
> 	unsigned long cpufreq_convert_util_to_freq(rq, util, dl_bw)
> 	{
> 		eff_util = apply_dvfs_headroom(util);
> 		eff_util = uclamp_rq_util_with(rq, util, NULL);
> 
> 		eff_util = max(eff_util, dl_bw);
> 	}
> 
> so we add the utilization of cfs, rt and irq (as per current formula). And then
> let cpufreq do the headroom and limit management.
> 
> I changed the way we handle dl_bw as it is actually requesting to run at
> a specific level and not really a pressure. So we max() it with eff_util.
> 
> If there's a DL task on the rq then it'd be running and the frequency it
> needs is defined by its bandwidth.
> 
> We could also keep it as it is with
> 
> 	unsigned long cpufreq_convert_util_to_freq(rq, util, dl_bw)
> 	{
> 		eff_util = apply_dvfs_headroom(util);
> 		eff_util = uclamp_rq_util_with(rq, util, NULL);
> 
> 		eff_util += dl_bw;
> 	}
> 
> RT has uclamp knowledge so it'll either run at max or whatever value it might
> have requested via uclamp_min. But DL doesn't set any uclamp_min and must be
> either added or max()ed. I'm not sure which is more correct yet, but maybe
> adding actually is better to ensure the CPU runs higher to handle all the tasks
> on the rq.
> 
> What do you think?
> 
> 
> Thanks!
> 
> --
> Qais Yousef

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-28 17:50                       ` Vincent Guittot
@ 2023-09-28 22:05                         ` Qais Yousef
  2023-09-29  8:01                           ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: Qais Yousef @ 2023-09-28 22:05 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

On 09/28/23 19:50, Vincent Guittot wrote:

> > 
> > Yep, absolutely. It seems we agree that CFS shouldn't go above 800 if it is
> > capped even if there's headroom, but the question you have on the way it is
> 
> At least I want to ensure that cpufreq has the right information to make a
> smart decision. In the example above, it's not needed to go above 800 for
> neither cfs nor irq.

Okay you want to do even bigger rework :-) I thought I might have pushed some
boundary with the rework I had in mind hehe.

> I'm not in favor of showing all details to cpufreq because it will have to
> follow the internal changes. In instead, I was thinking of something like:
> 
> /* Function name to be changed */
> unsigned_long effective_cpu_util(int cpu, unsigned int *min, unsigned int *max)
> 
> The function returns the actual utilization of the CPU and some minimum and
> maximum limits with the possibility to have the min and/or Actual values > Max
> because the min would be a hard minimum value whereas max only a soft maximum
> value.
> 
> Min would be the minimum perf to provide to the cpu : typically DL_bw + irq
> Actual would be the actual utilization of the cpu: cfs+rt+dl+irq (after scaling
>   everything in the normal range)
> Max would be the maximum needed performance for normal work: typically the
> minimum between uclamp and capacity
> 
> Then cpufreq can use these 3 values to compute a performance level and it 
> will know up to which perf level it should go and if it is worth it.
> Something likr:

Okay thanks! I think I have better clarity now. Let me try to rework the
patches.


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints
  2023-09-28 22:05                         ` Qais Yousef
@ 2023-09-29  8:01                           ` Vincent Guittot
  0 siblings, 0 replies; 32+ messages in thread
From: Vincent Guittot @ 2023-09-29  8:01 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Rafael J. Wysocki, Viresh Kumar, Ingo Molnar, Peter Zijlstra,
	linux-kernel, linux-pm, Dietmar Eggemann, Lukasz Luba

Le jeudi 28 sept. 2023 à 23:05:04 (+0100), Qais Yousef a écrit :
> On 09/28/23 19:50, Vincent Guittot wrote:
> 
> > > 
> > > Yep, absolutely. It seems we agree that CFS shouldn't go above 800 if it is
> > > capped even if there's headroom, but the question you have on the way it is
> > 
> > At least I want to ensure that cpufreq has the right information to make a
> > smart decision. In the example above, it's not needed to go above 800 for
> > neither cfs nor irq.
> 
> Okay you want to do even bigger rework :-) I thought I might have pushed some
> boundary with the rework I had in mind hehe.
> 
> > I'm not in favor of showing all details to cpufreq because it will have to
> > follow the internal changes. In instead, I was thinking of something like:
> > 
> > /* Function name to be changed */
> > unsigned_long effective_cpu_util(int cpu, unsigned int *min, unsigned int *max)
> > 
> > The function returns the actual utilization of the CPU and some minimum and
> > maximum limits with the possibility to have the min and/or Actual values > Max
> > because the min would be a hard minimum value whereas max only a soft maximum
> > value.
> > 
> > Min would be the minimum perf to provide to the cpu : typically DL_bw + irq
> > Actual would be the actual utilization of the cpu: cfs+rt+dl+irq (after scaling
> >   everything in the normal range)
> > Max would be the maximum needed performance for normal work: typically the
> > minimum between uclamp and capacity
> > 
> > Then cpufreq can use these 3 values to compute a performance level and it 
> > will know up to which perf level it should go and if it is worth it.
> > Something likr:
> 
> Okay thanks! I think I have better clarity now. Let me try to rework the
> patches.

This is the patch with everything amended, some pieces were missing in the previous version.

---
 include/linux/energy_model.h     |   1 -
 kernel/sched/core.c              | 103 +++++++++++++++++--------------
 kernel/sched/cpufreq_schedutil.c |   6 +-
 kernel/sched/fair.c              |   4 +-
 kernel/sched/sched.h             |   7 ++-
 5 files changed, 66 insertions(+), 55 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index e4cf9baf5f9e..c424a1bcec38 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -261,7 +261,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	scale_cpu = arch_scale_cpu_capacity(cpu);
 	ref_freq = em_get_capacity_ref_freq(cpu, pd);

-	max_util = map_util_perf(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
 	freq = map_util_freq(max_util, ref_freq, scale_cpu);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6560392f2f83..e5476703ba49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7404,18 +7404,13 @@ int sched_core_idle_cpu(int cpu)
  * required to meet deadlines.
  */
 unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 enum cpu_util_type type,
-				 struct task_struct *p)
+				 unsigned long *min,
+				 unsigned long *max)
 {
-	unsigned long dl_util, util, irq, max;
+	unsigned long util, irq, scale;
 	struct rq *rq = cpu_rq(cpu);

-	max = arch_scale_cpu_capacity(cpu);
-
-	if (!uclamp_is_used() &&
-	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
-		return max;
-	}
+	scale = arch_scale_cpu_capacity(cpu);

 	/*
 	 * Early check to see if IRQ/steal time saturates the CPU, can be
@@ -7423,9 +7418,16 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * update_irq_load_avg().
 	 */
 	irq = cpu_util_irq(rq);
-	if (unlikely(irq >= max))
-		return max;
+	if (unlikely(irq >= scale)) {
+		if (min)
+			*min = scale;
+		if (max)
+			*max = scale;
+		return scale;
+	}

+	if (min)
+		*min = irq + cpu_bw_dl(rq);
 	/*
 	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
 	 * CFS tasks and we use the same metric to track the effective
@@ -7439,29 +7441,13 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * frequency will be gracefully reduced with the utilization decay.
 	 */
 	util = util_cfs + cpu_util_rt(rq);
-	if (type == FREQUENCY_UTIL)
-		util = uclamp_rq_util_with(rq, util, p);
-
-	dl_util = cpu_util_dl(rq);
-
-	/*
-	 * For frequency selection we do not make cpu_util_dl() a permanent part
-	 * of this sum because we want to use cpu_bw_dl() later on, but we need
-	 * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
-	 * that we select f_max when there is no idle time.
-	 *
-	 * NOTE: numerical errors or stop class might cause us to not quite hit
-	 * saturation when we should -- something for later.
-	 */
-	if (util + dl_util >= max)
-		return max;
+	util += cpu_util_dl(rq);

-	/*
-	 * OTOH, for energy computation we need the estimated running time, so
-	 * include util_dl and ignore dl_bw.
-	 */
-	if (type == ENERGY_UTIL)
-		util += dl_util;
+	if (util >= scale) {
+		if (max)
+			*max = scale;
+		return scale;
+	}

 	/*
 	 * There is still idle time; further improve the number by using the
@@ -7472,28 +7458,53 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 *   U' = irq + --------- * U
 	 *                 max
 	 */
-	util = scale_irq_capacity(util, irq, max);
+	util = scale_irq_capacity(util, irq, scale);
 	util += irq;

+	if (max)
+		*max = uclamp_rq_util_with(rq, util, NULL);
+
+	return min(scale, util);
+}
+
+
+/*
+ *  TODO: move this in cpufreq
+ */
+unsigned long effective_cpu_perf(int cpu, unsigned long util_cfs,
+				 struct task_struct *p)
+{
+	unsigned long actual, target, min, max;
+	struct rq *rq = cpu_rq(cpu);
+
+	/* Get utilization stats */
+	actual = effective_cpu_util(cpu, util_cfs, &min, &max);
+
+	/* Check how max would be changed with p */
+	if (p)
+		max = min(max, uclamp_rq_util_with(rq, util_cfs, p));
+
 	/*
-	 * Bandwidth required by DEADLINE must always be granted while, for
-	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
-	 * to gracefully reduce the frequency when no tasks show up for longer
-	 * periods of time.
-	 *
-	 * Ideally we would like to set bw_dl as min/guaranteed freq and util +
-	 * bw_dl as requested freq. However, cpufreq is not yet ready for such
-	 * an interface. So, we only do the latter for now.
+	 * Provide at least enough capacity for DL + irq plus some headroom
+	 * for other activities
 	 */
-	if (type == FREQUENCY_UTIL)
-		util += cpu_bw_dl(rq);
+	target =  map_util_perf(min);

-	return min(max, util);
+	actual = map_util_perf(actual);
+	/* Actually we don't need to target the max performance */
+	if (actual < max)
+		max = actual;
+
+	/*
+	 * Ensure at least minimum perf target while providing more computa capacity when
+	 * possible
+	 */
+	return max(target,max);
 }

 unsigned long sched_cpu_util(int cpu)
 {
-	return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL);
+	return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
 }
 #endif /* CONFIG_SMP */

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e2b9c8c3d69a..ef6b4b09ac12 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -162,7 +162,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq;
 	struct cpufreq_policy *policy = sg_policy->policy;

-	util = map_util_perf(util);
 	freq = get_capacity_ref_freq(policy);
 	freq = map_util_freq(util, freq, max);

@@ -179,8 +178,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
 	struct rq *rq = cpu_rq(sg_cpu->cpu);

 	sg_cpu->bw_dl = cpu_bw_dl(rq);
-	sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
-					  FREQUENCY_UTIL, NULL);
+	sg_cpu->util = effective_cpu_perf(sg_cpu->cpu, util, NULL);
 }

 /**
@@ -427,7 +425,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 		sg_cpu->util = prev_util;

 	cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
-				   map_util_perf(sg_cpu->util), max_cap);
+				   sg_cpu->util, max_cap);

 	sg_cpu->sg_policy->last_freq_update_time = time;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 06d6d0dde48a..50568e2fa1ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7570,7 +7570,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
 	for_each_cpu(cpu, pd_cpus) {
 		unsigned long util = cpu_util(cpu, p, -1, 0);

-		busy_time += effective_cpu_util(cpu, util, ENERGY_UTIL, NULL);
+		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
 	}

 	eenv->pd_busy_time = min(eenv->pd_cap, busy_time);
@@ -7602,7 +7602,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		eff_util = effective_cpu_util(cpu, util, FREQUENCY_UTIL, tsk);
+		eff_util = effective_cpu_perf(cpu, util, tsk);
 		max_util = max(max_util, eff_util);
 	}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 17ae151e90c0..4cae9d7c4d8f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2988,10 +2988,13 @@ enum cpu_util_type {
 	ENERGY_UTIL,
 };

-unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
-				 enum cpu_util_type type,
+unsigned long effective_cpu_perf(int cpu, unsigned long util_cfs,
 				 struct task_struct *p);

+unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
+				 unsigned long *min,
+				 unsigned long *max);
+
 /*
  * Verify the fitness of task @p to run on @cpu taking into account the
  * CPU original capacity and the runtime/deadline ratio of the task.
--
2.34.1

> 
> 
> Cheers
> 
> --
> Qais Yousef

^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2023-09-29  8:01 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-20 21:06 [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints Qais Yousef
2023-08-20 21:06 ` [PATCH 1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom Qais Yousef
2023-08-20 21:13   ` Qais Yousef
2023-08-21 16:38   ` Dietmar Eggemann
2023-08-26 20:03     ` Qais Yousef
2023-08-20 21:06 ` [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints Qais Yousef
2023-08-21 16:39   ` Dietmar Eggemann
2023-08-26 20:08     ` Qais Yousef
2023-09-12 14:34       ` Dietmar Eggemann
2023-09-16 20:30         ` Qais Yousef
2023-08-29 14:35   ` Vincent Guittot
2023-08-29 16:37     ` Qais Yousef
2023-09-07 13:47       ` Vincent Guittot
2023-09-07 21:55         ` Qais Yousef
2023-09-08 14:30           ` Vincent Guittot
2023-09-10 17:46             ` Qais Yousef
2023-09-12 13:40               ` Dietmar Eggemann
2023-09-16 19:25                 ` Qais Yousef
2023-09-12 16:03               ` Vincent Guittot
2023-09-16 19:25                 ` Qais Yousef
2023-09-24  7:58                   ` Vincent Guittot
2023-09-24 17:23                     ` Qais Yousef
2023-09-28 17:50                       ` Vincent Guittot
2023-09-28 22:05                         ` Qais Yousef
2023-09-29  8:01                           ` Vincent Guittot
2023-08-20 21:06 ` [PATCH 3/4] sched: cpufreq: Move apply_dvfs_headroom() to sched.h Qais Yousef
2023-08-21 16:40   ` Dietmar Eggemann
2023-08-20 21:06 ` [PATCH RFC 4/4] sched: cpufreq: Apply DVFS headroom to CFS only Qais Yousef
2023-08-21 16:41   ` Dietmar Eggemann
2023-08-26 20:27     ` Qais Yousef
2023-08-21 10:34 ` [PATCH 0/4] Fix dvfs_headroom escaping uclamp constraints Rafael J. Wysocki
2023-08-26 19:17   ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).