All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS
@ 2021-06-14 18:58 Lukasz Luba
  2021-06-14 19:10 ` [PATCH 1/3] thermal: cpufreq_cooling: Update also offline CPUs per-cpu thermal_pressure Lukasz Luba
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Lukasz Luba @ 2021-06-14 18:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, peterz, rjw, viresh.kumar, vincent.guittot, qperret,
	dietmar.eggemann, vincent.donnefort, lukasz.luba,
	Beata.Michalska, mingo, juri.lelli, rostedt, segall, mgorman,
	bristot, thara.gopinath, amit.kachhap, amitk, rui.zhang,
	daniel.lezcano

Hi all,

The patch set v4 aims to add knowledge about reduced CPU capacity
into the Energy Model (EM) and Energy Aware Scheduler (EAS). Currently the
issue is that SchedUtil CPU frequency and EM frequency are not aligned,
when there is a CPU thermal capping. This causes an estimation error.
This patch set provides the information about allowed CPU capacity
into the EM (thanks to thermal pressure information). This improves the
energy estimation. More info about this mechanism can be found in the
patches description.

Changelog:
v4:
- removed local variable and improved description in patch 2/3
- added Reviewed-by from Vincent for patch 2/3
- added Acked-by from Viresh for patch 1/3
v3 [3]:
- switched to 'raw' per-cpu thermal pressure instead of thermal pressure
  geometric series signal, since it more suited for purpose of
  this use case: predicting SchedUtil frequency (Vincent, Dietmar)
- added more comment in the patch 2/3 header for use case when thermal
  capping might be applied even the CPUs are not over-utilized
  (Dietmar)
- added ACK tag from Rafael for SchedUtil part
- added a fix patch for offline CPUs in cpufreq_cooling and per-cpu
  thermal_pressure missing update
v2 [2]:
- clamp the returned value from effective_cpu_util() and avoid irq
  util scaling issues (Quentin)
v1 is available at [1]

Regards,
Lukasz

[1] https://lore.kernel.org/linux-pm/20210602135609.10867-1-lukasz.luba@arm.com/
[2] https://lore.kernel.org/lkml/20210604080954.13915-1-lukasz.luba@arm.com/
[3] https://lore.kernel.org/lkml/20210610150324.22919-1-lukasz.luba@arm.com/

Lukasz Luba (3):
  thermal: cpufreq_cooling: Update also offline CPUs per-cpu
    thermal_pressure
  sched/fair: Take thermal pressure into account while estimating energy
  sched/cpufreq: Consider reduced CPU capacity in energy calculation

 drivers/thermal/cpufreq_cooling.c |  2 +-
 include/linux/energy_model.h      | 16 +++++++++++++---
 include/linux/sched/cpufreq.h     |  2 +-
 kernel/sched/cpufreq_schedutil.c  |  1 +
 kernel/sched/fair.c               | 13 +++++++++----
 5 files changed, 25 insertions(+), 9 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] thermal: cpufreq_cooling: Update also offline CPUs per-cpu thermal_pressure
  2021-06-14 18:58 [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS Lukasz Luba
@ 2021-06-14 19:10 ` Lukasz Luba
  2021-06-18  8:46   ` [tip: sched/core] thermal/cpufreq_cooling: Update " tip-bot2 for Lukasz Luba
  2021-06-14 19:11 ` [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy Lukasz Luba
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Lukasz Luba @ 2021-06-14 19:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, peterz, rjw, viresh.kumar, vincent.guittot, qperret,
	dietmar.eggemann, vincent.donnefort, lukasz.luba,
	Beata.Michalska, mingo, juri.lelli, rostedt, segall, mgorman,
	bristot, thara.gopinath, amit.kachhap, amitk, rui.zhang,
	daniel.lezcano

The thermal pressure signal gives information to the scheduler about
reduced CPU capacity due to thermal. It is based on a value stored in a
per-cpu 'thermal_pressure' variable. The online CPUs will get the new
value there, while the offline won't. Unfortunately, when the CPU is back
online, the value read from per-cpu variable might be wrong (stale data).
This might affect the scheduler decisions, since it sees the CPU capacity
differently than what is actually available.

Fix it by making sure that all online+offline CPUs would get the proper
value in their per-cpu variable when thermal framework sets capping.

Fixes: f12e4f66ab6a3 ("thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping")
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/cpufreq_cooling.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index eeb4e4b76c0b..43b1ae8a7789 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -478,7 +478,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
 	ret = freq_qos_update_request(&cpufreq_cdev->qos_req, frequency);
 	if (ret >= 0) {
 		cpufreq_cdev->cpufreq_state = state;
-		cpus = cpufreq_cdev->policy->cpus;
+		cpus = cpufreq_cdev->policy->related_cpus;
 		max_capacity = arch_scale_cpu_capacity(cpumask_first(cpus));
 		capacity = frequency * max_capacity;
 		capacity /= cpufreq_cdev->policy->cpuinfo.max_freq;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy
  2021-06-14 18:58 [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS Lukasz Luba
  2021-06-14 19:10 ` [PATCH 1/3] thermal: cpufreq_cooling: Update also offline CPUs per-cpu thermal_pressure Lukasz Luba
@ 2021-06-14 19:11 ` Lukasz Luba
  2021-06-15 15:31   ` Dietmar Eggemann
  2021-06-18  8:46   ` [tip: sched/core] " tip-bot2 for Lukasz Luba
  2021-06-14 19:12 ` [PATCH v4 3/3] sched/cpufreq: Consider reduced CPU capacity in energy calculation Lukasz Luba
  2021-06-16 13:33 ` [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS Lukasz Luba
  3 siblings, 2 replies; 14+ messages in thread
From: Lukasz Luba @ 2021-06-14 19:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, peterz, rjw, viresh.kumar, vincent.guittot, qperret,
	dietmar.eggemann, vincent.donnefort, lukasz.luba,
	Beata.Michalska, mingo, juri.lelli, rostedt, segall, mgorman,
	bristot, thara.gopinath, amit.kachhap, amitk, rui.zhang,
	daniel.lezcano

Energy Aware Scheduling (EAS) needs to be able to predict the frequency
requests made by the SchedUtil governor to properly estimate energy used
in the future. It has to take into account CPUs utilization and forecast
Performance Domain (PD) frequency. There is a corner case when the max
allowed frequency might be reduced due to thermal. SchedUtil is aware of
that reduced frequency, so it should be taken into account also in EAS
estimations.

SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
to 'policy::max'. SchedUtil is responsible to respect that upper limit
while setting the frequency through CPUFreq drivers. This effective
frequency is stored internally in 'sugov_policy::next_freq' and EAS has
to predict that value.

In the existing code the raw value of arch_scale_cpu_capacity() is used
for clamping the returned CPU utilization from effective_cpu_util().
This patch fixes issue with too big single CPU utilization, by introducing
clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
capacity reduced by thermal pressure raw value.

Thanks to knowledge about allowed CPU capacity, we don't get too big value
for a single CPU utilization, which is then added to the util sum. The
util sum is used as a source of information for estimating whole PD energy.
To avoid wrong energy estimation in EAS (due to capped frequency), make
sure that the calculation of util sum is aware of allowed CPU capacity.

This thermal pressure might be visible in scenarios where the CPUs are not
heavily loaded, but some other component (like GPU) drastically reduced
available power budget and increased the SoC temperature. Thus, we still
use EAS for task placement and CPUs are not over-utilized.

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/sched/fair.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 161b92aa1c79..3634e077051d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6527,8 +6527,11 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 	struct cpumask *pd_mask = perf_domain_span(pd);
 	unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
 	unsigned long max_util = 0, sum_util = 0;
+	unsigned long _cpu_cap = cpu_cap;
 	int cpu;
 
+	_cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
+
 	/*
 	 * The capacity state of CPUs of the current rd can be driven by CPUs
 	 * of another rd if they belong to the same pd. So, account for the
@@ -6564,8 +6567,10 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * is already enough to scale the EM reported power
 		 * consumption at the (eventually clamped) cpu_capacity.
 		 */
-		sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
-					       ENERGY_UTIL, NULL);
+		cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
+					      ENERGY_UTIL, NULL);
+
+		sum_util += min(cpu_util, _cpu_cap);
 
 		/*
 		 * Performance domain frequency: utilization clamping
@@ -6576,7 +6581,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 */
 		cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
 					      FREQUENCY_UTIL, tsk);
-		max_util = max(max_util, cpu_util);
+		max_util = max(max_util, min(cpu_util, _cpu_cap));
 	}
 
 	return em_cpu_energy(pd->em_pd, max_util, sum_util);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 3/3] sched/cpufreq: Consider reduced CPU capacity in energy calculation
  2021-06-14 18:58 [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS Lukasz Luba
  2021-06-14 19:10 ` [PATCH 1/3] thermal: cpufreq_cooling: Update also offline CPUs per-cpu thermal_pressure Lukasz Luba
  2021-06-14 19:11 ` [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy Lukasz Luba
@ 2021-06-14 19:12 ` Lukasz Luba
  2021-06-18  8:46   ` [tip: sched/core] " tip-bot2 for Lukasz Luba
  2021-06-16 13:33 ` [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS Lukasz Luba
  3 siblings, 1 reply; 14+ messages in thread
From: Lukasz Luba @ 2021-06-14 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, peterz, rjw, viresh.kumar, vincent.guittot, qperret,
	dietmar.eggemann, vincent.donnefort, lukasz.luba,
	Beata.Michalska, mingo, juri.lelli, rostedt, segall, mgorman,
	bristot, thara.gopinath, amit.kachhap, amitk, rui.zhang,
	daniel.lezcano

Energy Aware Scheduling (EAS) needs to predict the decisions made by
SchedUtil. The map_util_freq() exists to do that.

There are corner cases where the max allowed frequency might be reduced
(due to thermal). SchedUtil as a CPUFreq governor, is aware of that
but EAS is not. This patch aims to address it.

SchedUtil stores the maximum allowed frequency in
'sugov_policy::next_freq' field. EAS has to predict that value, which is
the real used frequency. That value is made after a call to
cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits.
In the existing code EAS is not able to predict that real frequency.
This leads to energy estimation errors.

To avoid wrong energy estimation in EAS (due to frequency miss prediction)
make sure that the step which calculates Performance Domain frequency,
is also aware of the allowed CPU capacity.

Furthermore, modify map_util_freq() to not extend the frequency value.
Instead, use map_util_perf() to extend the util value in both places:
SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity.
In the end, we achieve the same desirable behavior for both subsystems
and alignment in regards to the real CPU frequency.

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part)
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h     | 16 +++++++++++++---
 include/linux/sched/cpufreq.h    |  2 +-
 kernel/sched/cpufreq_schedutil.c |  1 +
 kernel/sched/fair.c              |  2 +-
 4 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 757fc60658fa..3f221dbf5f95 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -91,6 +91,8 @@ void em_dev_unregister_perf_domain(struct device *dev);
  * @pd		: performance domain for which energy has to be estimated
  * @max_util	: highest utilization among CPUs of the domain
  * @sum_util	: sum of the utilization of all CPUs in the domain
+ * @allowed_cpu_cap	: maximum allowed CPU capacity for the @pd, which
+			  might reflect reduced frequency (due to thermal)
  *
  * This function must be used only for CPU devices. There is no validation,
  * i.e. if the EM is a CPU type and has cpumask allocated. It is called from
@@ -100,7 +102,8 @@ void em_dev_unregister_perf_domain(struct device *dev);
  * a capacity state satisfying the max utilization of the domain.
  */
 static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
-				unsigned long max_util, unsigned long sum_util)
+				unsigned long max_util, unsigned long sum_util,
+				unsigned long allowed_cpu_cap)
 {
 	unsigned long freq, scale_cpu;
 	struct em_perf_state *ps;
@@ -112,11 +115,17 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	/*
 	 * In order to predict the performance state, map the utilization of
 	 * the most utilized CPU of the performance domain to a requested
-	 * frequency, like schedutil.
+	 * frequency, like schedutil. Take also into account that the real
+	 * frequency might be set lower (due to thermal capping). Thus, clamp
+	 * max utilization to the allowed CPU capacity before calculating
+	 * effective frequency.
 	 */
 	cpu = cpumask_first(to_cpumask(pd->cpus));
 	scale_cpu = arch_scale_cpu_capacity(cpu);
 	ps = &pd->table[pd->nr_perf_states - 1];
+
+	max_util = map_util_perf(max_util);
+	max_util = min(max_util, allowed_cpu_cap);
 	freq = map_util_freq(max_util, ps->frequency, scale_cpu);
 
 	/*
@@ -209,7 +218,8 @@ static inline struct em_perf_domain *em_pd_get(struct device *dev)
 	return NULL;
 }
 static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
-			unsigned long max_util, unsigned long sum_util)
+			unsigned long max_util, unsigned long sum_util,
+			unsigned long allowed_cpu_cap)
 {
 	return 0;
 }
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index 6205578ab6ee..bdd31ab93bc5 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -26,7 +26,7 @@ bool cpufreq_this_cpu_can_update(struct cpufreq_policy *policy);
 static inline unsigned long map_util_freq(unsigned long util,
 					unsigned long freq, unsigned long cap)
 {
-	return (freq + (freq >> 2)) * util / cap;
+	return freq * util / cap;
 }
 
 static inline unsigned long map_util_perf(unsigned long util)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 4f09afd2f321..57124614363d 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -151,6 +151,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
 
+	util = map_util_perf(util);
 	freq = map_util_freq(util, freq, max);
 
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3634e077051d..75e082964250 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6584,7 +6584,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		max_util = max(max_util, min(cpu_util, _cpu_cap));
 	}
 
-	return em_cpu_energy(pd->em_pd, max_util, sum_util);
+	return em_cpu_energy(pd->em_pd, max_util, sum_util, _cpu_cap);
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy
  2021-06-14 19:11 ` [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy Lukasz Luba
@ 2021-06-15 15:31   ` Dietmar Eggemann
  2021-06-15 16:09     ` Lukasz Luba
  2021-06-18  8:46   ` [tip: sched/core] " tip-bot2 for Lukasz Luba
  1 sibling, 1 reply; 14+ messages in thread
From: Dietmar Eggemann @ 2021-06-15 15:31 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel
  Cc: linux-pm, peterz, rjw, viresh.kumar, vincent.guittot, qperret,
	vincent.donnefort, Beata.Michalska, mingo, juri.lelli, rostedt,
	segall, mgorman, bristot, thara.gopinath, amit.kachhap, amitk,
	rui.zhang, daniel.lezcano

On 14/06/2021 21:11, Lukasz Luba wrote:
> Energy Aware Scheduling (EAS) needs to be able to predict the frequency
> requests made by the SchedUtil governor to properly estimate energy used
> in the future. It has to take into account CPUs utilization and forecast
> Performance Domain (PD) frequency. There is a corner case when the max
> allowed frequency might be reduced due to thermal. SchedUtil is aware of
> that reduced frequency, so it should be taken into account also in EAS
> estimations.

It's important to highlight that this will only fix this issue between
schedutil and EAS when it's due to `thermal pressure` (today only via
CPU cooling). There are other places which could restrict policy->max
via freq_qos_update_request() and EAS will be unaware of it.

> SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
> a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
> to 'policy::max'. SchedUtil is responsible to respect that upper limit
> while setting the frequency through CPUFreq drivers. This effective
> frequency is stored internally in 'sugov_policy::next_freq' and EAS has
> to predict that value.
> 
> In the existing code the raw value of arch_scale_cpu_capacity() is used
> for clamping the returned CPU utilization from effective_cpu_util().
> This patch fixes issue with too big single CPU utilization, by introducing
> clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
> capacity reduced by thermal pressure raw value.
> 
> Thanks to knowledge about allowed CPU capacity, we don't get too big value
> for a single CPU utilization, which is then added to the util sum. The
> util sum is used as a source of information for estimating whole PD energy.
> To avoid wrong energy estimation in EAS (due to capped frequency), make
> sure that the calculation of util sum is aware of allowed CPU capacity.
> 
> This thermal pressure might be visible in scenarios where the CPUs are not
> heavily loaded, but some other component (like GPU) drastically reduced
> available power budget and increased the SoC temperature. Thus, we still
> use EAS for task placement and CPUs are not over-utilized.

IMHO, this means that this is catered for the IPA governor then. I'm not
sure if this would be beneficial when another thermal governor is used?

The mechanical side of the code would allow for such benefits, I just
don't know if their CPU cooling device + thermal zone setups would cater
for this?

> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  kernel/sched/fair.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 161b92aa1c79..3634e077051d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6527,8 +6527,11 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>  	struct cpumask *pd_mask = perf_domain_span(pd);
>  	unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
>  	unsigned long max_util = 0, sum_util = 0;
> +	unsigned long _cpu_cap = cpu_cap;
>  	int cpu;
>  
> +	_cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
> +

Maybe shorter?

        struct cpumask *pd_mask = perf_domain_span(pd);
-       unsigned long cpu_cap =
arch_scale_cpu_capacity(cpumask_first(pd_mask));
+       int cpu = cpumask_first(pd_mask);
+       unsigned long cpu_cap = arch_scale_cpu_capacity(cpu);
+       unsigned long _cpu_cap = cpu_cap - arch_scale_thermal_pressure(cpu);
        unsigned long max_util = 0, sum_util = 0;
-       unsigned long _cpu_cap = cpu_cap;
-       int cpu;
-
-       _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));

>  	/*
>  	 * The capacity state of CPUs of the current rd can be driven by CPUs
>  	 * of another rd if they belong to the same pd. So, account for the
> @@ -6564,8 +6567,10 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>  		 * is already enough to scale the EM reported power
>  		 * consumption at the (eventually clamped) cpu_capacity.
>  		 */
> -		sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
> -					       ENERGY_UTIL, NULL);
> +		cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
> +					      ENERGY_UTIL, NULL);
> +
> +		sum_util += min(cpu_util, _cpu_cap);
>  
>  		/*
>  		 * Performance domain frequency: utilization clamping
> @@ -6576,7 +6581,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>  		 */
>  		cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
>  					      FREQUENCY_UTIL, tsk);
> -		max_util = max(max_util, cpu_util);
> +		max_util = max(max_util, min(cpu_util, _cpu_cap));
>  	}
>  
>  	return em_cpu_energy(pd->em_pd, max_util, sum_util);

There is IPA specific code in cpufreq_set_cur_state() ->
get_state_freq() which accesses the EM:

    ...
    return cpufreq_cdev->em->table[idx].frequency;
    ...

Has it been discussed that the `per-PD max (allowed) CPU capacity` (1)
could be stored in the EM from there so that code like the EAS wakeup
code (compute_energy()) could retrieve this information from the EM?
And there wouldn't be any need to pass (1) into the EM (like now via
em_cpu_energy()).
This would be signalling within the EM compared to external signalling
via `CPU cooling -> thermal pressure <- EAS wakeup -> EM`.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy
  2021-06-15 15:31   ` Dietmar Eggemann
@ 2021-06-15 16:09     ` Lukasz Luba
  2021-06-16 17:24       ` Dietmar Eggemann
  0 siblings, 1 reply; 14+ messages in thread
From: Lukasz Luba @ 2021-06-15 16:09 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, linux-pm, peterz, rjw, viresh.kumar,
	vincent.guittot, qperret, vincent.donnefort, Beata.Michalska,
	mingo, juri.lelli, rostedt, segall, mgorman, bristot,
	thara.gopinath, amit.kachhap, amitk, rui.zhang, daniel.lezcano



On 6/15/21 4:31 PM, Dietmar Eggemann wrote:
> On 14/06/2021 21:11, Lukasz Luba wrote:
>> Energy Aware Scheduling (EAS) needs to be able to predict the frequency
>> requests made by the SchedUtil governor to properly estimate energy used
>> in the future. It has to take into account CPUs utilization and forecast
>> Performance Domain (PD) frequency. There is a corner case when the max
>> allowed frequency might be reduced due to thermal. SchedUtil is aware of
>> that reduced frequency, so it should be taken into account also in EAS
>> estimations.
> 
> It's important to highlight that this will only fix this issue between
> schedutil and EAS when it's due to `thermal pressure` (today only via
> CPU cooling). There are other places which could restrict policy->max
> via freq_qos_update_request() and EAS will be unaware of it.

True, but for this I have some other plans.

> 
>> SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
>> a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
>> to 'policy::max'. SchedUtil is responsible to respect that upper limit
>> while setting the frequency through CPUFreq drivers. This effective
>> frequency is stored internally in 'sugov_policy::next_freq' and EAS has
>> to predict that value.
>>
>> In the existing code the raw value of arch_scale_cpu_capacity() is used
>> for clamping the returned CPU utilization from effective_cpu_util().
>> This patch fixes issue with too big single CPU utilization, by introducing
>> clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
>> capacity reduced by thermal pressure raw value.
>>
>> Thanks to knowledge about allowed CPU capacity, we don't get too big value
>> for a single CPU utilization, which is then added to the util sum. The
>> util sum is used as a source of information for estimating whole PD energy.
>> To avoid wrong energy estimation in EAS (due to capped frequency), make
>> sure that the calculation of util sum is aware of allowed CPU capacity.
>>
>> This thermal pressure might be visible in scenarios where the CPUs are not
>> heavily loaded, but some other component (like GPU) drastically reduced
>> available power budget and increased the SoC temperature. Thus, we still
>> use EAS for task placement and CPUs are not over-utilized.
> 
> IMHO, this means that this is catered for the IPA governor then. I'm not
> sure if this would be beneficial when another thermal governor is used?

Yes, it will be, the cpufreq_set_cur_state() is called by
thermal exported function:
thermal_cdev_update()
   __thermal_cdev_update()
     thermal_cdev_set_cur_state()
       cdev->ops->set_cur_state(cdev, target)

So it can be called not only by IPA. All governors call it, because
that's the default mechanism.

> 
> The mechanical side of the code would allow for such benefits, I just
> don't know if their CPU cooling device + thermal zone setups would cater
> for this?

Yes, it's possible. Even for custom vendor governors (modified clones
of IPA)

> 
>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   kernel/sched/fair.c | 11 ++++++++---
>>   1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 161b92aa1c79..3634e077051d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6527,8 +6527,11 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>>   	struct cpumask *pd_mask = perf_domain_span(pd);
>>   	unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
>>   	unsigned long max_util = 0, sum_util = 0;
>> +	unsigned long _cpu_cap = cpu_cap;
>>   	int cpu;
>>   
>> +	_cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
>> +
> 
> Maybe shorter?
> 
>          struct cpumask *pd_mask = perf_domain_span(pd);
> -       unsigned long cpu_cap =
> arch_scale_cpu_capacity(cpumask_first(pd_mask));
> +       int cpu = cpumask_first(pd_mask);
> +       unsigned long cpu_cap = arch_scale_cpu_capacity(cpu);
> +       unsigned long _cpu_cap = cpu_cap - arch_scale_thermal_pressure(cpu);
>          unsigned long max_util = 0, sum_util = 0;
> -       unsigned long _cpu_cap = cpu_cap;
> -       int cpu;
> -
> -       _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));

Could be, but still, the definitions should be sorted from longest on
top, to shortest at the bottom. I wanted to avoid modifying too many
lines with this simple patch.

> 
>>   	/*
>>   	 * The capacity state of CPUs of the current rd can be driven by CPUs
>>   	 * of another rd if they belong to the same pd. So, account for the
>> @@ -6564,8 +6567,10 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>>   		 * is already enough to scale the EM reported power
>>   		 * consumption at the (eventually clamped) cpu_capacity.
>>   		 */
>> -		sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
>> -					       ENERGY_UTIL, NULL);
>> +		cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
>> +					      ENERGY_UTIL, NULL);
>> +
>> +		sum_util += min(cpu_util, _cpu_cap);
>>   
>>   		/*
>>   		 * Performance domain frequency: utilization clamping
>> @@ -6576,7 +6581,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>>   		 */
>>   		cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
>>   					      FREQUENCY_UTIL, tsk);
>> -		max_util = max(max_util, cpu_util);
>> +		max_util = max(max_util, min(cpu_util, _cpu_cap));
>>   	}
>>   
>>   	return em_cpu_energy(pd->em_pd, max_util, sum_util);
> 
> There is IPA specific code in cpufreq_set_cur_state() ->
> get_state_freq() which accesses the EM:
> 
>      ...
>      return cpufreq_cdev->em->table[idx].frequency;
>      ...
> 
> Has it been discussed that the `per-PD max (allowed) CPU capacity` (1)
> could be stored in the EM from there so that code like the EAS wakeup
> code (compute_energy()) could retrieve this information from the EM?

No, we haven't think about this approach in these patch sets.
The EM structure given to the cpufreq_cooling device and stored in:
cpufreq_cdev->em should not be modified. There are a few places which
receive the EM, but they all should not touch it. For those clients
it's a read-only data structure.

> And there wouldn't be any need to pass (1) into the EM (like now via
> em_cpu_energy()).
> This would be signalling within the EM compared to external signalling
> via `CPU cooling -> thermal pressure <- EAS wakeup -> EM`.
> 

I see what you mean, but this might cause some issues in the design
(per-cpu scmi cpu perf control). Let's use this EM pointer gently ;)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS
  2021-06-14 18:58 [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS Lukasz Luba
                   ` (2 preceding siblings ...)
  2021-06-14 19:12 ` [PATCH v4 3/3] sched/cpufreq: Consider reduced CPU capacity in energy calculation Lukasz Luba
@ 2021-06-16 13:33 ` Lukasz Luba
  3 siblings, 0 replies; 14+ messages in thread
From: Lukasz Luba @ 2021-06-16 13:33 UTC (permalink / raw)
  To: peterz
  Cc: linux-kernel, linux-pm, rjw, viresh.kumar, vincent.guittot,
	qperret, dietmar.eggemann, vincent.donnefort, Beata.Michalska,
	mingo, juri.lelli, rostedt, segall, mgorman, bristot,
	thara.gopinath, amit.kachhap, amitk, rui.zhang, daniel.lezcano

Hi Peter,


On 6/14/21 7:58 PM, Lukasz Luba wrote:
> Hi all,
> 
> The patch set v4 aims to add knowledge about reduced CPU capacity
> into the Energy Model (EM) and Energy Aware Scheduler (EAS). Currently the
> issue is that SchedUtil CPU frequency and EM frequency are not aligned,
> when there is a CPU thermal capping. This causes an estimation error.
> This patch set provides the information about allowed CPU capacity
> into the EM (thanks to thermal pressure information). This improves the
> energy estimation. More info about this mechanism can be found in the
> patches description.
> 
> Changelog:
> v4:
> - removed local variable and improved description in patch 2/3
> - added Reviewed-by from Vincent for patch 2/3
> - added Acked-by from Viresh for patch 1/3
> v3 [3]:
> - switched to 'raw' per-cpu thermal pressure instead of thermal pressure
>    geometric series signal, since it more suited for purpose of
>    this use case: predicting SchedUtil frequency (Vincent, Dietmar)
> - added more comment in the patch 2/3 header for use case when thermal
>    capping might be applied even the CPUs are not over-utilized
>    (Dietmar)
> - added ACK tag from Rafael for SchedUtil part
> - added a fix patch for offline CPUs in cpufreq_cooling and per-cpu
>    thermal_pressure missing update
> v2 [2]:
> - clamp the returned value from effective_cpu_util() and avoid irq
>    util scaling issues (Quentin)
> v1 is available at [1]
> 
> Regards,
> Lukasz
> 
> [1] https://lore.kernel.org/linux-pm/20210602135609.10867-1-lukasz.luba@arm.com/
> [2] https://lore.kernel.org/lkml/20210604080954.13915-1-lukasz.luba@arm.com/
> [3] https://lore.kernel.org/lkml/20210610150324.22919-1-lukasz.luba@arm.com/
> 
> Lukasz Luba (3):
>    thermal: cpufreq_cooling: Update also offline CPUs per-cpu
>      thermal_pressure
>    sched/fair: Take thermal pressure into account while estimating energy
>    sched/cpufreq: Consider reduced CPU capacity in energy calculation
> 
>   drivers/thermal/cpufreq_cooling.c |  2 +-
>   include/linux/energy_model.h      | 16 +++++++++++++---
>   include/linux/sched/cpufreq.h     |  2 +-
>   kernel/sched/cpufreq_schedutil.c  |  1 +
>   kernel/sched/fair.c               | 13 +++++++++----
>   5 files changed, 25 insertions(+), 9 deletions(-)
> 

Could you take these 3 patches via your tree, please?
I'm asking you because the fair.c has most changes
(apart from energy_model.h) and the patches got
ACKs from Rafael and Viresh. The patch which touches
fair.c got Reviewed-by Vincent Guittot. I have address
all the comment, thus, IMHO it could fly now.

Please let me know if you like me to re-base on top
of some of your branches.

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy
  2021-06-15 16:09     ` Lukasz Luba
@ 2021-06-16 17:24       ` Dietmar Eggemann
  2021-06-16 18:31         ` Lukasz Luba
  2021-06-16 19:25         ` Vincent Guittot
  0 siblings, 2 replies; 14+ messages in thread
From: Dietmar Eggemann @ 2021-06-16 17:24 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, peterz, rjw, viresh.kumar,
	vincent.guittot, qperret, vincent.donnefort, Beata.Michalska,
	mingo, juri.lelli, rostedt, segall, mgorman, bristot,
	thara.gopinath, amit.kachhap, amitk, rui.zhang, daniel.lezcano

On 15/06/2021 18:09, Lukasz Luba wrote:
> 
> On 6/15/21 4:31 PM, Dietmar Eggemann wrote:
>> On 14/06/2021 21:11, Lukasz Luba wrote:

[...]

>> It's important to highlight that this will only fix this issue between
>> schedutil and EAS when it's due to `thermal pressure` (today only via
>> CPU cooling). There are other places which could restrict policy->max
>> via freq_qos_update_request() and EAS will be unaware of it.
> 
> True, but for this I have some other plans.

As long as people are aware of the fact that this was developed to be
beneficial for `EAS - IPA` integration, I'm fine with this.

[...]

>> IMHO, this means that this is catered for the IPA governor then. I'm not
>> sure if this would be beneficial when another thermal governor is used?
> 
> Yes, it will be, the cpufreq_set_cur_state() is called by
> thermal exported function:
> thermal_cdev_update()
>   __thermal_cdev_update()
>     thermal_cdev_set_cur_state()
>       cdev->ops->set_cur_state(cdev, target)
> 
> So it can be called not only by IPA. All governors call it, because
> that's the default mechanism.

True, but I'm still not convinced that it is useful outside `EAS - IPA`.

>> The mechanical side of the code would allow for such benefits, I just
>> don't know if their CPU cooling device + thermal zone setups would cater
>> for this?
> 
> Yes, it's possible. Even for custom vendor governors (modified clones
> of IPA)

Let's stick to mainline here ;-) It's complicated enough ...

[...]

>> Maybe shorter?
>>
>>          struct cpumask *pd_mask = perf_domain_span(pd);
>> -       unsigned long cpu_cap =
>> arch_scale_cpu_capacity(cpumask_first(pd_mask));
>> +       int cpu = cpumask_first(pd_mask);
>> +       unsigned long cpu_cap = arch_scale_cpu_capacity(cpu);
>> +       unsigned long _cpu_cap = cpu_cap -
>> arch_scale_thermal_pressure(cpu);
>>          unsigned long max_util = 0, sum_util = 0;
>> -       unsigned long _cpu_cap = cpu_cap;
>> -       int cpu;
>> -
>> -       _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
> 
> Could be, but still, the definitions should be sorted from longest on
> top, to shortest at the bottom. I wanted to avoid modifying too many
> lines with this simple patch.

Only if there are no dependencies, but here we have already `cpu_cap ->
pd_mask`. OK, not a big deal.

[...]

>> There is IPA specific code in cpufreq_set_cur_state() ->
>> get_state_freq() which accesses the EM:
>>
>>      ...
>>      return cpufreq_cdev->em->table[idx].frequency;
>>      ...
>>
>> Has it been discussed that the `per-PD max (allowed) CPU capacity` (1)
>> could be stored in the EM from there so that code like the EAS wakeup
>> code (compute_energy()) could retrieve this information from the EM?
> 
> No, we haven't think about this approach in these patch sets.
> The EM structure given to the cpufreq_cooling device and stored in:
> cpufreq_cdev->em should not be modified. There are a few places which
> receive the EM, but they all should not touch it. For those clients
> it's a read-only data structure.
> 
>> And there wouldn't be any need to pass (1) into the EM (like now via
>> em_cpu_energy()).
>> This would be signalling within the EM compared to external signalling
>> via `CPU cooling -> thermal pressure <- EAS wakeup -> EM`.
> 
> I see what you mean, but this might cause some issues in the design
> (per-cpu scmi cpu perf control). Let's use this EM pointer gently ;)

OK, with the requirement that clients see the EM as ro:

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy
  2021-06-16 17:24       ` Dietmar Eggemann
@ 2021-06-16 18:31         ` Lukasz Luba
  2021-06-16 19:25         ` Vincent Guittot
  1 sibling, 0 replies; 14+ messages in thread
From: Lukasz Luba @ 2021-06-16 18:31 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, linux-pm, peterz, rjw, viresh.kumar,
	vincent.guittot, qperret, vincent.donnefort, Beata.Michalska,
	mingo, juri.lelli, rostedt, segall, mgorman, bristot,
	thara.gopinath, amit.kachhap, amitk, rui.zhang, daniel.lezcano



On 6/16/21 6:24 PM, Dietmar Eggemann wrote:
> On 15/06/2021 18:09, Lukasz Luba wrote:
>>
>> On 6/15/21 4:31 PM, Dietmar Eggemann wrote:
>>> On 14/06/2021 21:11, Lukasz Luba wrote:
> 
> [...]
> 
>>> It's important to highlight that this will only fix this issue between
>>> schedutil and EAS when it's due to `thermal pressure` (today only via
>>> CPU cooling). There are other places which could restrict policy->max
>>> via freq_qos_update_request() and EAS will be unaware of it.
>>
>> True, but for this I have some other plans.
> 
> As long as people are aware of the fact that this was developed to be
> beneficial for `EAS - IPA` integration, I'm fine with this.

Good. I had in mind that I will have to do some re-work on this
thermal pressure code in the cpufreq cooling, to satisfy our roadmap
goals...

> 
> [...]
> 
>>> IMHO, this means that this is catered for the IPA governor then. I'm not
>>> sure if this would be beneficial when another thermal governor is used?
>>
>> Yes, it will be, the cpufreq_set_cur_state() is called by
>> thermal exported function:
>> thermal_cdev_update()
>>    __thermal_cdev_update()
>>      thermal_cdev_set_cur_state()
>>        cdev->ops->set_cur_state(cdev, target)
>>
>> So it can be called not only by IPA. All governors call it, because
>> that's the default mechanism.
> 
> True, but I'm still not convinced that it is useful outside `EAS - IPA`.

It is. So in mainline thermal there is another governor: fair_share [1],
which uses 'weights' to split the cooling effort across cooling devices
in the thermal zone. That governor would manage CPUs and GPU and
set throttling like IPA.

> 
>>> The mechanical side of the code would allow for such benefits, I just
>>> don't know if their CPU cooling device + thermal zone setups would cater
>>> for this?
>>
>> Yes, it's possible. Even for custom vendor governors (modified clones
>> of IPA)
> 
> Let's stick to mainline here ;-) It's complicated enough ...

I agree, so there isn't only IPA in mainline.

> 
> [...]
> 
>>> Maybe shorter?
>>>
>>>           struct cpumask *pd_mask = perf_domain_span(pd);
>>> -       unsigned long cpu_cap =
>>> arch_scale_cpu_capacity(cpumask_first(pd_mask));
>>> +       int cpu = cpumask_first(pd_mask);
>>> +       unsigned long cpu_cap = arch_scale_cpu_capacity(cpu);
>>> +       unsigned long _cpu_cap = cpu_cap -
>>> arch_scale_thermal_pressure(cpu);
>>>           unsigned long max_util = 0, sum_util = 0;
>>> -       unsigned long _cpu_cap = cpu_cap;
>>> -       int cpu;
>>> -
>>> -       _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
>>
>> Could be, but still, the definitions should be sorted from longest on
>> top, to shortest at the bottom. I wanted to avoid modifying too many
>> lines with this simple patch.
> 
> Only if there are no dependencies, but here we have already `cpu_cap ->
> pd_mask`. OK, not a big deal.

True, those dependencies are tricky to sort them properly, so I coded
it this way.

[snip]

>> I see what you mean, but this might cause some issues in the design
>> (per-cpu scmi cpu perf control). Let's use this EM pointer gently ;)
> 
> OK, with the requirement that clients see the EM as ro:
> 
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 

Thank you Dietmar for the review!

Regards,
Lukasz

[1] 
https://elixir.bootlin.com/linux/v5.13-rc6/source/drivers/thermal/gov_fair_share.c#L111

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy
  2021-06-16 17:24       ` Dietmar Eggemann
  2021-06-16 18:31         ` Lukasz Luba
@ 2021-06-16 19:25         ` Vincent Guittot
  2021-06-16 20:22           ` Lukasz Luba
  1 sibling, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2021-06-16 19:25 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Lukasz Luba, linux-kernel, open list:THERMAL, Peter Zijlstra,
	Rafael J. Wysocki, Viresh Kumar, Quentin Perret,
	Vincent Donnefort, Beata Michalska, Ingo Molnar, Juri Lelli,
	Steven Rostedt, segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thara Gopinath, Amit Kachhap, Amit Kucheria, Zhang Rui,
	Daniel Lezcano

On Wed, 16 Jun 2021 at 19:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 15/06/2021 18:09, Lukasz Luba wrote:
> >
> > On 6/15/21 4:31 PM, Dietmar Eggemann wrote:
> >> On 14/06/2021 21:11, Lukasz Luba wrote:
>
> [...]
>
> >> It's important to highlight that this will only fix this issue between
> >> schedutil and EAS when it's due to `thermal pressure` (today only via
> >> CPU cooling). There are other places which could restrict policy->max
> >> via freq_qos_update_request() and EAS will be unaware of it.
> >
> > True, but for this I have some other plans.
>
> As long as people are aware of the fact that this was developed to be
> beneficial for `EAS - IPA` integration, I'm fine with this.

I don't think it's only for EAS - IPA. Thermal_pressure can be used by
HW throttling like here:
https://lkml.org/lkml/2021/6/8/1791

EAS is involved but not IPA

>
> [...]
>
> >> IMHO, this means that this is catered for the IPA governor then. I'm not
> >> sure if this would be beneficial when another thermal governor is used?
> >
> > Yes, it will be, the cpufreq_set_cur_state() is called by
> > thermal exported function:
> > thermal_cdev_update()
> >   __thermal_cdev_update()
> >     thermal_cdev_set_cur_state()
> >       cdev->ops->set_cur_state(cdev, target)
> >
> > So it can be called not only by IPA. All governors call it, because
> > that's the default mechanism.
>
> True, but I'm still not convinced that it is useful outside `EAS - IPA`.
>
> >> The mechanical side of the code would allow for such benefits, I just
> >> don't know if their CPU cooling device + thermal zone setups would cater
> >> for this?
> >
> > Yes, it's possible. Even for custom vendor governors (modified clones
> > of IPA)
>
> Let's stick to mainline here ;-) It's complicated enough ...
>
> [...]
>
> >> Maybe shorter?
> >>
> >>          struct cpumask *pd_mask = perf_domain_span(pd);
> >> -       unsigned long cpu_cap =
> >> arch_scale_cpu_capacity(cpumask_first(pd_mask));
> >> +       int cpu = cpumask_first(pd_mask);
> >> +       unsigned long cpu_cap = arch_scale_cpu_capacity(cpu);
> >> +       unsigned long _cpu_cap = cpu_cap -
> >> arch_scale_thermal_pressure(cpu);
> >>          unsigned long max_util = 0, sum_util = 0;
> >> -       unsigned long _cpu_cap = cpu_cap;
> >> -       int cpu;
> >> -
> >> -       _cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
> >
> > Could be, but still, the definitions should be sorted from longest on
> > top, to shortest at the bottom. I wanted to avoid modifying too many
> > lines with this simple patch.
>
> Only if there are no dependencies, but here we have already `cpu_cap ->
> pd_mask`. OK, not a big deal.
>
> [...]
>
> >> There is IPA specific code in cpufreq_set_cur_state() ->
> >> get_state_freq() which accesses the EM:
> >>
> >>      ...
> >>      return cpufreq_cdev->em->table[idx].frequency;
> >>      ...
> >>
> >> Has it been discussed that the `per-PD max (allowed) CPU capacity` (1)
> >> could be stored in the EM from there so that code like the EAS wakeup
> >> code (compute_energy()) could retrieve this information from the EM?
> >
> > No, we haven't think about this approach in these patch sets.
> > The EM structure given to the cpufreq_cooling device and stored in:
> > cpufreq_cdev->em should not be modified. There are a few places which
> > receive the EM, but they all should not touch it. For those clients
> > it's a read-only data structure.
> >
> >> And there wouldn't be any need to pass (1) into the EM (like now via
> >> em_cpu_energy()).
> >> This would be signalling within the EM compared to external signalling
> >> via `CPU cooling -> thermal pressure <- EAS wakeup -> EM`.
> >
> > I see what you mean, but this might cause some issues in the design
> > (per-cpu scmi cpu perf control). Let's use this EM pointer gently ;)
>
> OK, with the requirement that clients see the EM as ro:
>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy
  2021-06-16 19:25         ` Vincent Guittot
@ 2021-06-16 20:22           ` Lukasz Luba
  0 siblings, 0 replies; 14+ messages in thread
From: Lukasz Luba @ 2021-06-16 20:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, linux-kernel, open list:THERMAL,
	Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar, Quentin Perret,
	Vincent Donnefort, Beata Michalska, Ingo Molnar, Juri Lelli,
	Steven Rostedt, segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thara Gopinath, Amit Kachhap, Amit Kucheria, Zhang Rui,
	Daniel Lezcano



On 6/16/21 8:25 PM, Vincent Guittot wrote:
> On Wed, 16 Jun 2021 at 19:24, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 15/06/2021 18:09, Lukasz Luba wrote:
>>>
>>> On 6/15/21 4:31 PM, Dietmar Eggemann wrote:
>>>> On 14/06/2021 21:11, Lukasz Luba wrote:
>>
>> [...]
>>
>>>> It's important to highlight that this will only fix this issue between
>>>> schedutil and EAS when it's due to `thermal pressure` (today only via
>>>> CPU cooling). There are other places which could restrict policy->max
>>>> via freq_qos_update_request() and EAS will be unaware of it.
>>>
>>> True, but for this I have some other plans.
>>
>> As long as people are aware of the fact that this was developed to be
>> beneficial for `EAS - IPA` integration, I'm fine with this.
> 
> I don't think it's only for EAS - IPA. Thermal_pressure can be used by
> HW throttling like here:
> https://lkml.org/lkml/2021/6/8/1791
> 
> EAS is involved but not IPA

Thank you Vincent for pointing to Thara's patches. Indeed, this is a
good example. We will have to provide similar for our SCMI perf
notifications - these are the plans that I've mentioned. In both
new examples, the IPA (or other governors) won't be even involved.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip: sched/core] sched/cpufreq: Consider reduced CPU capacity in energy calculation
  2021-06-14 19:12 ` [PATCH v4 3/3] sched/cpufreq: Consider reduced CPU capacity in energy calculation Lukasz Luba
@ 2021-06-18  8:46   ` tip-bot2 for Lukasz Luba
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot2 for Lukasz Luba @ 2021-06-18  8:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Lukasz Luba, Peter Zijlstra (Intel),
	Rafael J. Wysocki, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8f1b971b4750e83e8fbd2f91a9efd4a38ad0ae51
Gitweb:        https://git.kernel.org/tip/8f1b971b4750e83e8fbd2f91a9efd4a38ad0ae51
Author:        Lukasz Luba <lukasz.luba@arm.com>
AuthorDate:    Mon, 14 Jun 2021 20:12:38 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 17 Jun 2021 14:11:43 +02:00

sched/cpufreq: Consider reduced CPU capacity in energy calculation

Energy Aware Scheduling (EAS) needs to predict the decisions made by
SchedUtil. The map_util_freq() exists to do that.

There are corner cases where the max allowed frequency might be reduced
(due to thermal). SchedUtil as a CPUFreq governor, is aware of that
but EAS is not. This patch aims to address it.

SchedUtil stores the maximum allowed frequency in
'sugov_policy::next_freq' field. EAS has to predict that value, which is
the real used frequency. That value is made after a call to
cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits.
In the existing code EAS is not able to predict that real frequency.
This leads to energy estimation errors.

To avoid wrong energy estimation in EAS (due to frequency miss prediction)
make sure that the step which calculates Performance Domain frequency,
is also aware of the allowed CPU capacity.

Furthermore, modify map_util_freq() to not extend the frequency value.
Instead, use map_util_perf() to extend the util value in both places:
SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity.
In the end, we achieve the same desirable behavior for both subsystems
and alignment in regards to the real CPU frequency.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part)
Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com
---
 include/linux/energy_model.h     | 16 +++++++++++++---
 include/linux/sched/cpufreq.h    |  2 +-
 kernel/sched/cpufreq_schedutil.c |  1 +
 kernel/sched/fair.c              |  2 +-
 4 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 757fc60..3f221db 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -91,6 +91,8 @@ void em_dev_unregister_perf_domain(struct device *dev);
  * @pd		: performance domain for which energy has to be estimated
  * @max_util	: highest utilization among CPUs of the domain
  * @sum_util	: sum of the utilization of all CPUs in the domain
+ * @allowed_cpu_cap	: maximum allowed CPU capacity for the @pd, which
+			  might reflect reduced frequency (due to thermal)
  *
  * This function must be used only for CPU devices. There is no validation,
  * i.e. if the EM is a CPU type and has cpumask allocated. It is called from
@@ -100,7 +102,8 @@ void em_dev_unregister_perf_domain(struct device *dev);
  * a capacity state satisfying the max utilization of the domain.
  */
 static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
-				unsigned long max_util, unsigned long sum_util)
+				unsigned long max_util, unsigned long sum_util,
+				unsigned long allowed_cpu_cap)
 {
 	unsigned long freq, scale_cpu;
 	struct em_perf_state *ps;
@@ -112,11 +115,17 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	/*
 	 * In order to predict the performance state, map the utilization of
 	 * the most utilized CPU of the performance domain to a requested
-	 * frequency, like schedutil.
+	 * frequency, like schedutil. Take also into account that the real
+	 * frequency might be set lower (due to thermal capping). Thus, clamp
+	 * max utilization to the allowed CPU capacity before calculating
+	 * effective frequency.
 	 */
 	cpu = cpumask_first(to_cpumask(pd->cpus));
 	scale_cpu = arch_scale_cpu_capacity(cpu);
 	ps = &pd->table[pd->nr_perf_states - 1];
+
+	max_util = map_util_perf(max_util);
+	max_util = min(max_util, allowed_cpu_cap);
 	freq = map_util_freq(max_util, ps->frequency, scale_cpu);
 
 	/*
@@ -209,7 +218,8 @@ static inline struct em_perf_domain *em_pd_get(struct device *dev)
 	return NULL;
 }
 static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
-			unsigned long max_util, unsigned long sum_util)
+			unsigned long max_util, unsigned long sum_util,
+			unsigned long allowed_cpu_cap)
 {
 	return 0;
 }
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index 6205578..bdd31ab 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -26,7 +26,7 @@ bool cpufreq_this_cpu_can_update(struct cpufreq_policy *policy);
 static inline unsigned long map_util_freq(unsigned long util,
 					unsigned long freq, unsigned long cap)
 {
-	return (freq + (freq >> 2)) * util / cap;
+	return freq * util / cap;
 }
 
 static inline unsigned long map_util_perf(unsigned long util)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 4f09afd..5712461 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -151,6 +151,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
 
+	util = map_util_perf(util);
 	freq = map_util_freq(util, freq, max);
 
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d6d190..ed7df1b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6592,7 +6592,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		max_util = max(max_util, min(cpu_util, _cpu_cap));
 	}
 
-	return em_cpu_energy(pd->em_pd, max_util, sum_util);
+	return em_cpu_energy(pd->em_pd, max_util, sum_util, _cpu_cap);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip: sched/core] sched/fair: Take thermal pressure into account while estimating energy
  2021-06-14 19:11 ` [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy Lukasz Luba
  2021-06-15 15:31   ` Dietmar Eggemann
@ 2021-06-18  8:46   ` tip-bot2 for Lukasz Luba
  1 sibling, 0 replies; 14+ messages in thread
From: tip-bot2 for Lukasz Luba @ 2021-06-18  8:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Lukasz Luba, Peter Zijlstra (Intel),
	Vincent Guittot, Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     489f16459e0008c7a5c4c5af34bd80898aa82c2d
Gitweb:        https://git.kernel.org/tip/489f16459e0008c7a5c4c5af34bd80898aa82c2d
Author:        Lukasz Luba <lukasz.luba@arm.com>
AuthorDate:    Mon, 14 Jun 2021 20:11:28 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 17 Jun 2021 14:11:43 +02:00

sched/fair: Take thermal pressure into account while estimating energy

Energy Aware Scheduling (EAS) needs to be able to predict the frequency
requests made by the SchedUtil governor to properly estimate energy used
in the future. It has to take into account CPUs utilization and forecast
Performance Domain (PD) frequency. There is a corner case when the max
allowed frequency might be reduced due to thermal. SchedUtil is aware of
that reduced frequency, so it should be taken into account also in EAS
estimations.

SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of
a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping
to 'policy::max'. SchedUtil is responsible to respect that upper limit
while setting the frequency through CPUFreq drivers. This effective
frequency is stored internally in 'sugov_policy::next_freq' and EAS has
to predict that value.

In the existing code the raw value of arch_scale_cpu_capacity() is used
for clamping the returned CPU utilization from effective_cpu_util().
This patch fixes issue with too big single CPU utilization, by introducing
clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU
capacity reduced by thermal pressure raw value.

Thanks to knowledge about allowed CPU capacity, we don't get too big value
for a single CPU utilization, which is then added to the util sum. The
util sum is used as a source of information for estimating whole PD energy.
To avoid wrong energy estimation in EAS (due to capped frequency), make
sure that the calculation of util sum is aware of allowed CPU capacity.

This thermal pressure might be visible in scenarios where the CPUs are not
heavily loaded, but some other component (like GPU) drastically reduced
available power budget and increased the SoC temperature. Thus, we still
use EAS for task placement and CPUs are not over-utilized.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20210614191128.22735-1-lukasz.luba@arm.com
---
 kernel/sched/fair.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 06c8ba7..0d6d190 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6535,8 +6535,11 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 	struct cpumask *pd_mask = perf_domain_span(pd);
 	unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
 	unsigned long max_util = 0, sum_util = 0;
+	unsigned long _cpu_cap = cpu_cap;
 	int cpu;
 
+	_cpu_cap -= arch_scale_thermal_pressure(cpumask_first(pd_mask));
+
 	/*
 	 * The capacity state of CPUs of the current rd can be driven by CPUs
 	 * of another rd if they belong to the same pd. So, account for the
@@ -6572,8 +6575,10 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * is already enough to scale the EM reported power
 		 * consumption at the (eventually clamped) cpu_capacity.
 		 */
-		sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
-					       ENERGY_UTIL, NULL);
+		cpu_util = effective_cpu_util(cpu, util_running, cpu_cap,
+					      ENERGY_UTIL, NULL);
+
+		sum_util += min(cpu_util, _cpu_cap);
 
 		/*
 		 * Performance domain frequency: utilization clamping
@@ -6584,7 +6589,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 */
 		cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
 					      FREQUENCY_UTIL, tsk);
-		max_util = max(max_util, cpu_util);
+		max_util = max(max_util, min(cpu_util, _cpu_cap));
 	}
 
 	return em_cpu_energy(pd->em_pd, max_util, sum_util);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip: sched/core] thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
  2021-06-14 19:10 ` [PATCH 1/3] thermal: cpufreq_cooling: Update also offline CPUs per-cpu thermal_pressure Lukasz Luba
@ 2021-06-18  8:46   ` tip-bot2 for Lukasz Luba
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot2 for Lukasz Luba @ 2021-06-18  8:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Lukasz Luba, Peter Zijlstra (Intel), Viresh Kumar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2ad8ccc17d1e4270cf65a3f2a07a7534aa23e3fb
Gitweb:        https://git.kernel.org/tip/2ad8ccc17d1e4270cf65a3f2a07a7534aa23e3fb
Author:        Lukasz Luba <lukasz.luba@arm.com>
AuthorDate:    Mon, 14 Jun 2021 20:10:30 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 17 Jun 2021 14:11:43 +02:00

thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure

The thermal pressure signal gives information to the scheduler about
reduced CPU capacity due to thermal. It is based on a value stored in
a per-cpu 'thermal_pressure' variable. The online CPUs will get the
new value there, while the offline won't. Unfortunately, when the CPU
is back online, the value read from per-cpu variable might be wrong
(stale data).  This might affect the scheduler decisions, since it
sees the CPU capacity differently than what is actually available.

Fix it by making sure that all online+offline CPUs would get the
proper value in their per-cpu variable when thermal framework sets
capping.

Fixes: f12e4f66ab6a3 ("thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping")
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/r/20210614191030.22241-1-lukasz.luba@arm.com
---
 drivers/thermal/cpufreq_cooling.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index eeb4e4b..43b1ae8 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -478,7 +478,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
 	ret = freq_qos_update_request(&cpufreq_cdev->qos_req, frequency);
 	if (ret >= 0) {
 		cpufreq_cdev->cpufreq_state = state;
-		cpus = cpufreq_cdev->policy->cpus;
+		cpus = cpufreq_cdev->policy->related_cpus;
 		max_capacity = arch_scale_cpu_capacity(cpumask_first(cpus));
 		capacity = frequency * max_capacity;
 		capacity /= cpufreq_cdev->policy->cpuinfo.max_freq;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-06-18  8:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-14 18:58 [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS Lukasz Luba
2021-06-14 19:10 ` [PATCH 1/3] thermal: cpufreq_cooling: Update also offline CPUs per-cpu thermal_pressure Lukasz Luba
2021-06-18  8:46   ` [tip: sched/core] thermal/cpufreq_cooling: Update " tip-bot2 for Lukasz Luba
2021-06-14 19:11 ` [PATCH v4 2/3] sched/fair: Take thermal pressure into account while estimating energy Lukasz Luba
2021-06-15 15:31   ` Dietmar Eggemann
2021-06-15 16:09     ` Lukasz Luba
2021-06-16 17:24       ` Dietmar Eggemann
2021-06-16 18:31         ` Lukasz Luba
2021-06-16 19:25         ` Vincent Guittot
2021-06-16 20:22           ` Lukasz Luba
2021-06-18  8:46   ` [tip: sched/core] " tip-bot2 for Lukasz Luba
2021-06-14 19:12 ` [PATCH v4 3/3] sched/cpufreq: Consider reduced CPU capacity in energy calculation Lukasz Luba
2021-06-18  8:46   ` [tip: sched/core] " tip-bot2 for Lukasz Luba
2021-06-16 13:33 ` [PATCH v4 0/3] Add allowed CPU capacity knowledge to EAS Lukasz Luba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.