[PATCH V2 0/2] cpufreq_cooling: Get effective CPU utilization from scheduler

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2 0/2] cpufreq_cooling: Get effective CPU utilization from scheduler
@ 2020-10-23 10:20 Viresh Kumar
  2020-10-23 10:20 ` [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c Viresh Kumar
  2020-10-23 10:20 ` [PATCH V2 2/2] thermal: cpufreq_cooling: Reuse sched_cpu_util() Viresh Kumar
  0 siblings, 2 replies; 12+ messages in thread
From: Viresh Kumar @ 2020-10-23 10:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Amit Daniel Kachhap, Amit Kucheria, Ben Segall,
	Daniel Bristot de Oliveira, Daniel Lezcano, Dietmar Eggemann,
	Javi Merino, Juri Lelli, Mel Gorman, Rafael J. Wysocki,
	Steven Rostedt, Viresh Kumar, Zhang Rui
  Cc: linux-kernel, Quentin Perret, Lukasz Luba, linux-pm

Hi Peter/Rafael,

I thought about the fallback thing getting registered by scheduler with
cpufreq core, and after Peter's comment about two interactions with
cpufreq I didn't like it much. Either way we are exposing the cpu
utilization finding algorithm to rest of the kernel, through cpufreq or
otherwise.

And so kept it simple for now. Scheduler exposes a single routine,
sched_cpu_util(), along with the related enum and that's all schedutil
and cpufreq_cooling stuff want.

V1->V2:
- Name the routine as sched_cpu_util().
- Make it more self sufficient and remove few parameters that aren't
  required to be exposed anymore to rest of the kernel.
- Better cleanups in schedutil and cpufreq_cooling.

---

Schedutil and fair.c use schedutil_cpu_util() to get an idea of how busy
a CPU is. Do the same for cpufreq_cooling which uses CPU's idle time
currently to get load, which is used to calculate the current power
consumption of the CPUs, which isn't that accurate.

Tested with hackbench and sysbench on Hikey (octa-core SMP) and no
regression was observed.

--
Viresh

Viresh Kumar (2):
  sched/core: Rename and move schedutil_cpu_util() to core.c
  thermal: cpufreq_cooling: Reuse sched_cpu_util()

 drivers/thermal/cpufreq_cooling.c |  70 +++++-------------
 include/linux/sched.h             |  19 +++++
 kernel/sched/core.c               | 113 +++++++++++++++++++++++++++++
 kernel/sched/cpufreq_schedutil.c  | 116 +-----------------------------
 kernel/sched/fair.c               |   6 +-
 kernel/sched/sched.h              |  29 +-------
 6 files changed, 156 insertions(+), 197 deletions(-)

-- 
2.25.0.rc1.19.g042ed3e048af

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c
  2020-10-23 10:20 [PATCH V2 0/2] cpufreq_cooling: Get effective CPU utilization from scheduler Viresh Kumar
@ 2020-10-23 10:20 ` Viresh Kumar
  2020-10-23 10:34   ` Peter Zijlstra
  2020-10-23 10:43   ` Vincent Guittot
  2020-10-23 10:20 ` [PATCH V2 2/2] thermal: cpufreq_cooling: Reuse sched_cpu_util() Viresh Kumar
  1 sibling, 2 replies; 12+ messages in thread
From: Viresh Kumar @ 2020-10-23 10:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Rafael J. Wysocki, Viresh Kumar
  Cc: linux-kernel, Quentin Perret, Lukasz Luba, linux-pm

There is nothing schedutil specific in schedutil_cpu_util(), move it to
core.c and rename it to sched_cpu_util(), so it can be used from other
parts of the kernel as well.

The cpufreq_cooling stuff will make use of this in a later commit.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 include/linux/sched.h            |  19 +++++
 kernel/sched/core.c              | 113 ++++++++++++++++++++++++++++++
 kernel/sched/cpufreq_schedutil.c | 116 +------------------------------
 kernel/sched/fair.c              |   6 +-
 kernel/sched/sched.h             |  29 +-------
 5 files changed, 140 insertions(+), 143 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 393db0690101..3c27c10141cb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1930,6 +1930,25 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 #define TASK_SIZE_OF(tsk)	TASK_SIZE
 #endif
 
+/**
+ * enum cpu_util_type - CPU utilization type
+ * @FREQUENCY_UTIL:	Utilization used to select frequency
+ * @ENERGY_UTIL:	Utilization used during energy calculation
+ *
+ * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
+ * need to be aggregated differently depending on the usage made of them. This
+ * enum is used within sched_cpu_util() to differentiate the types of
+ * utilization expected by the callers, and adjust the aggregation accordingly.
+ */
+enum cpu_util_type {
+	FREQUENCY_UTIL,
+	ENERGY_UTIL,
+};
+
+/* Returns effective CPU utilization, as seen by the scheduler */
+unsigned long sched_cpu_util(int cpu, enum cpu_util_type type,
+			     unsigned long max);
+
 #ifdef CONFIG_RSEQ
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d2003a7d5ab5..369ff54d11d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5117,6 +5117,119 @@ struct task_struct *idle_task(int cpu)
 	return cpu_rq(cpu)->idle;
 }
 
+/*
+ * This function computes an effective utilization for the given CPU, to be
+ * used for frequency selection given the linear relation: f = u * f_max.
+ *
+ * The scheduler tracks the following metrics:
+ *
+ *   cpu_util_{cfs,rt,dl,irq}()
+ *   cpu_bw_dl()
+ *
+ * Where the cfs,rt and dl util numbers are tracked with the same metric and
+ * synchronized windows and are thus directly comparable.
+ *
+ * The cfs,rt,dl utilization are the running times measured with rq->clock_task
+ * which excludes things like IRQ and steal-time. These latter are then accrued
+ * in the irq utilization.
+ *
+ * The DL bandwidth number otoh is not a measured metric but a value computed
+ * based on the task model parameters and gives the minimal utilization
+ * required to meet deadlines.
+ */
+unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
+				 unsigned long max, enum cpu_util_type type,
+				 struct task_struct *p)
+{
+	unsigned long dl_util, util, irq;
+	struct rq *rq = cpu_rq(cpu);
+
+	if (!uclamp_is_used() &&
+	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
+		return max;
+	}
+
+	/*
+	 * Early check to see if IRQ/steal time saturates the CPU, can be
+	 * because of inaccuracies in how we track these -- see
+	 * update_irq_load_avg().
+	 */
+	irq = cpu_util_irq(rq);
+	if (unlikely(irq >= max))
+		return max;
+
+	/*
+	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
+	 * CFS tasks and we use the same metric to track the effective
+	 * utilization (PELT windows are synchronized) we can directly add them
+	 * to obtain the CPU's actual utilization.
+	 *
+	 * CFS and RT utilization can be boosted or capped, depending on
+	 * utilization clamp constraints requested by currently RUNNABLE
+	 * tasks.
+	 * When there are no CFS RUNNABLE tasks, clamps are released and
+	 * frequency will be gracefully reduced with the utilization decay.
+	 */
+	util = util_cfs + cpu_util_rt(rq);
+	if (type == FREQUENCY_UTIL)
+		util = uclamp_rq_util_with(rq, util, p);
+
+	dl_util = cpu_util_dl(rq);
+
+	/*
+	 * For frequency selection we do not make cpu_util_dl() a permanent part
+	 * of this sum because we want to use cpu_bw_dl() later on, but we need
+	 * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
+	 * that we select f_max when there is no idle time.
+	 *
+	 * NOTE: numerical errors or stop class might cause us to not quite hit
+	 * saturation when we should -- something for later.
+	 */
+	if (util + dl_util >= max)
+		return max;
+
+	/*
+	 * OTOH, for energy computation we need the estimated running time, so
+	 * include util_dl and ignore dl_bw.
+	 */
+	if (type == ENERGY_UTIL)
+		util += dl_util;
+
+	/*
+	 * There is still idle time; further improve the number by using the
+	 * irq metric. Because IRQ/steal time is hidden from the task clock we
+	 * need to scale the task numbers:
+	 *
+	 *              max - irq
+	 *   U' = irq + --------- * U
+	 *                 max
+	 */
+	util = scale_irq_capacity(util, irq, max);
+	util += irq;
+
+	/*
+	 * Bandwidth required by DEADLINE must always be granted while, for
+	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+	 * to gracefully reduce the frequency when no tasks show up for longer
+	 * periods of time.
+	 *
+	 * Ideally we would like to set bw_dl as min/guaranteed freq and util +
+	 * bw_dl as requested freq. However, cpufreq is not yet ready for such
+	 * an interface. So, we only do the latter for now.
+	 */
+	if (type == FREQUENCY_UTIL)
+		util += cpu_bw_dl(rq);
+
+	return min(max, util);
+}
+
+unsigned long sched_cpu_util(int cpu, enum cpu_util_type type,
+			     unsigned long max)
+{
+	return effective_cpu_util(cpu, cpu_util_cfs(cpu_rq(cpu)), max, type,
+				  NULL);
+}
+
 /**
  * find_process_by_pid - find a process with a matching PID value.
  * @pid: the pid in question.
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5ae7b4e6e8d6..0c5c61a095f6 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -169,122 +169,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
 
-/*
- * This function computes an effective utilization for the given CPU, to be
- * used for frequency selection given the linear relation: f = u * f_max.
- *
- * The scheduler tracks the following metrics:
- *
- *   cpu_util_{cfs,rt,dl,irq}()
- *   cpu_bw_dl()
- *
- * Where the cfs,rt and dl util numbers are tracked with the same metric and
- * synchronized windows and are thus directly comparable.
- *
- * The cfs,rt,dl utilization are the running times measured with rq->clock_task
- * which excludes things like IRQ and steal-time. These latter are then accrued
- * in the irq utilization.
- *
- * The DL bandwidth number otoh is not a measured metric but a value computed
- * based on the task model parameters and gives the minimal utilization
- * required to meet deadlines.
- */
-unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long max, enum schedutil_type type,
-				 struct task_struct *p)
-{
-	unsigned long dl_util, util, irq;
-	struct rq *rq = cpu_rq(cpu);
-
-	if (!uclamp_is_used() &&
-	    type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
-		return max;
-	}
-
-	/*
-	 * Early check to see if IRQ/steal time saturates the CPU, can be
-	 * because of inaccuracies in how we track these -- see
-	 * update_irq_load_avg().
-	 */
-	irq = cpu_util_irq(rq);
-	if (unlikely(irq >= max))
-		return max;
-
-	/*
-	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
-	 * CFS tasks and we use the same metric to track the effective
-	 * utilization (PELT windows are synchronized) we can directly add them
-	 * to obtain the CPU's actual utilization.
-	 *
-	 * CFS and RT utilization can be boosted or capped, depending on
-	 * utilization clamp constraints requested by currently RUNNABLE
-	 * tasks.
-	 * When there are no CFS RUNNABLE tasks, clamps are released and
-	 * frequency will be gracefully reduced with the utilization decay.
-	 */
-	util = util_cfs + cpu_util_rt(rq);
-	if (type == FREQUENCY_UTIL)
-		util = uclamp_rq_util_with(rq, util, p);
-
-	dl_util = cpu_util_dl(rq);
-
-	/*
-	 * For frequency selection we do not make cpu_util_dl() a permanent part
-	 * of this sum because we want to use cpu_bw_dl() later on, but we need
-	 * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
-	 * that we select f_max when there is no idle time.
-	 *
-	 * NOTE: numerical errors or stop class might cause us to not quite hit
-	 * saturation when we should -- something for later.
-	 */
-	if (util + dl_util >= max)
-		return max;
-
-	/*
-	 * OTOH, for energy computation we need the estimated running time, so
-	 * include util_dl and ignore dl_bw.
-	 */
-	if (type == ENERGY_UTIL)
-		util += dl_util;
-
-	/*
-	 * There is still idle time; further improve the number by using the
-	 * irq metric. Because IRQ/steal time is hidden from the task clock we
-	 * need to scale the task numbers:
-	 *
-	 *              max - irq
-	 *   U' = irq + --------- * U
-	 *                 max
-	 */
-	util = scale_irq_capacity(util, irq, max);
-	util += irq;
-
-	/*
-	 * Bandwidth required by DEADLINE must always be granted while, for
-	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
-	 * to gracefully reduce the frequency when no tasks show up for longer
-	 * periods of time.
-	 *
-	 * Ideally we would like to set bw_dl as min/guaranteed freq and util +
-	 * bw_dl as requested freq. However, cpufreq is not yet ready for such
-	 * an interface. So, we only do the latter for now.
-	 */
-	if (type == FREQUENCY_UTIL)
-		util += cpu_bw_dl(rq);
-
-	return min(max, util);
-}
-
 static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 {
-	struct rq *rq = cpu_rq(sg_cpu->cpu);
-	unsigned long util = cpu_util_cfs(rq);
-	unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
-
-	sg_cpu->max = max;
-	sg_cpu->bw_dl = cpu_bw_dl(rq);
+	sg_cpu->max = arch_scale_cpu_capacity(sg_cpu->cpu);
+	sg_cpu->bw_dl = cpu_bw_dl(cpu_rq(sg_cpu->cpu));
 
-	return schedutil_cpu_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL, NULL);
+	return sched_cpu_util(sg_cpu->cpu, FREQUENCY_UTIL, sg_cpu->max);
 }
 
 /**
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa4c6227cd6d..52e2d866e875 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6499,7 +6499,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * is already enough to scale the EM reported power
 		 * consumption at the (eventually clamped) cpu_capacity.
 		 */
-		sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+		sum_util += effective_cpu_util(cpu, util_cfs, cpu_cap,
 					       ENERGY_UTIL, NULL);
 
 		/*
@@ -6509,7 +6509,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+		cpu_util = effective_cpu_util(cpu, util_cfs, cpu_cap,
 					      FREQUENCY_UTIL, tsk);
 		max_util = max(max_util, cpu_util);
 	}
@@ -6607,7 +6607,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			 * IOW, placing the task there would make the CPU
 			 * overutilized. Take uclamp into account to see how
 			 * much capacity we can get out of the CPU; this is
-			 * aligned with schedutil_cpu_util().
+			 * aligned with sched_cpu_util().
 			 */
 			util = uclamp_rq_util_with(cpu_rq(cpu), util, p);
 			if (!fits_capacity(util, cpu_cap))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index df80bfcea92e..0f0439344eec 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2486,25 +2486,8 @@ static inline unsigned long capacity_orig_of(int cpu)
 }
 #endif
 
-/**
- * enum schedutil_type - CPU utilization type
- * @FREQUENCY_UTIL:	Utilization used to select frequency
- * @ENERGY_UTIL:	Utilization used during energy calculation
- *
- * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
- * need to be aggregated differently depending on the usage made of them. This
- * enum is used within schedutil_freq_util() to differentiate the types of
- * utilization expected by the callers, and adjust the aggregation accordingly.
- */
-enum schedutil_type {
-	FREQUENCY_UTIL,
-	ENERGY_UTIL,
-};
-
-#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
-
-unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long max, enum schedutil_type type,
+unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
+				 unsigned long max, enum cpu_util_type type,
 				 struct task_struct *p);
 
 static inline unsigned long cpu_bw_dl(struct rq *rq)
@@ -2533,14 +2516,6 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
 {
 	return READ_ONCE(rq->avg_rt.util_avg);
 }
-#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
-static inline unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
-				 unsigned long max, enum schedutil_type type,
-				 struct task_struct *p)
-{
-	return 0;
-}
-#endif /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 static inline unsigned long cpu_util_irq(struct rq *rq)
-- 
2.25.0.rc1.19.g042ed3e048af


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V2 2/2] thermal: cpufreq_cooling: Reuse sched_cpu_util()
  2020-10-23 10:20 [PATCH V2 0/2] cpufreq_cooling: Get effective CPU utilization from scheduler Viresh Kumar
  2020-10-23 10:20 ` [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c Viresh Kumar
@ 2020-10-23 10:20 ` Viresh Kumar
  2020-10-23 10:37   ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Viresh Kumar @ 2020-10-23 10:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Amit Daniel Kachhap, Daniel Lezcano, Viresh Kumar, Javi Merino,
	Zhang Rui, Amit Kucheria
  Cc: linux-kernel, Quentin Perret, Lukasz Luba, linux-pm

Several parts of the kernel are already using the effective CPU
utilization (as seen by the scheduler) to get the current load on the
CPU, do the same here instead of depending on the idle time of the CPU,
which isn't that accurate comparatively.

Note that, this (and CPU frequency scaling in general) doesn't work that
well with idle injection as that is done from rt threads and is counted
as load while it tries to do quite the opposite. That should be solved
separately though.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 drivers/thermal/cpufreq_cooling.c | 70 +++++++------------------------
 1 file changed, 16 insertions(+), 54 deletions(-)

diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index cc2959f22f01..1315e4d4758b 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -19,6 +19,7 @@
 #include <linux/idr.h>
 #include <linux/pm_opp.h>
 #include <linux/pm_qos.h>
+#include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/thermal.h>
 
@@ -38,16 +39,6 @@
  *	...
  */
 
-/**
- * struct time_in_idle - Idle time stats
- * @time: previous reading of the absolute time that this cpu was idle
- * @timestamp: wall time of the last invocation of get_cpu_idle_time_us()
- */
-struct time_in_idle {
-	u64 time;
-	u64 timestamp;
-};
-
 /**
  * struct cpufreq_cooling_device - data for cooling device with cpufreq
  * @id: unique integer value corresponding to each cpufreq_cooling_device
@@ -62,7 +53,6 @@ struct time_in_idle {
  *	registered cooling device.
  * @policy: cpufreq policy.
  * @node: list_head to link all cpufreq_cooling_device together.
- * @idle_time: idle time stats
  * @qos_req: PM QoS contraint to apply
  *
  * This structure is required for keeping information of each registered
@@ -76,7 +66,6 @@ struct cpufreq_cooling_device {
 	struct em_perf_domain *em;
 	struct cpufreq_policy *policy;
 	struct list_head node;
-	struct time_in_idle *idle_time;
 	struct freq_qos_request qos_req;
 };
 
@@ -132,34 +121,18 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
 }
 
 /**
- * get_load() - get load for a cpu since last updated
- * @cpufreq_cdev:	&struct cpufreq_cooling_device for this cpu
+ * get_load() - get current load for a cpu
  * @cpu:	cpu number
- * @cpu_idx:	index of the cpu in time_in_idle*
  *
- * Return: The average load of cpu @cpu in percentage since this
- * function was last called.
+ * Return: The current load of cpu @cpu in percentage.
  */
-static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
-		    int cpu_idx)
+static u32 get_load(int cpu)
 {
-	u32 load;
-	u64 now, now_idle, delta_time, delta_idle;
-	struct time_in_idle *idle_time = &cpufreq_cdev->idle_time[cpu_idx];
-
-	now_idle = get_cpu_idle_time(cpu, &now, 0);
-	delta_idle = now_idle - idle_time->time;
-	delta_time = now - idle_time->timestamp;
+	unsigned long max = arch_scale_cpu_capacity(cpu);
+	unsigned long util;
 
-	if (delta_time <= delta_idle)
-		load = 0;
-	else
-		load = div64_u64(100 * (delta_time - delta_idle), delta_time);
-
-	idle_time->time = now_idle;
-	idle_time->timestamp = now;
-
-	return load;
+	util = sched_cpu_util(cpu, ENERGY_UTIL, max);
+	return (util * 100) / max;
 }
 
 /**
@@ -191,13 +164,12 @@ static u32 get_dynamic_power(struct cpufreq_cooling_device *cpufreq_cdev,
  * Instead, we calculate the current power on the assumption that the
  * immediate future will look like the immediate past.
  *
- * We use the current frequency and the average load since this
- * function was last called.  In reality, there could have been
- * multiple opps since this function was last called and that affects
- * the load calculation.  While it's not perfectly accurate, this
- * simplification is good enough and works.  REVISIT this, as more
- * complex code may be needed if experiments show that it's not
- * accurate enough.
+ * We use the current frequency and the current load.  In reality,
+ * there could have been multiple opps since this function was last
+ * called and that affects the load calculation.  While it's not
+ * perfectly accurate, this simplification is good enough and works.
+ * REVISIT this, as more complex code may be needed if experiments show
+ * that it's not accurate enough.
  *
  * Return: 0 on success, -E* if getting the static power failed.
  */
@@ -223,7 +195,7 @@ static int cpufreq_get_requested_power(struct thermal_cooling_device *cdev,
 		u32 load;
 
 		if (cpu_online(cpu))
-			load = get_load(cpufreq_cdev, cpu, i);
+			load = get_load(cpu);
 		else
 			load = 0;
 
@@ -517,13 +489,6 @@ __cpufreq_cooling_register(struct device_node *np,
 
 	cpufreq_cdev->policy = policy;
 	num_cpus = cpumask_weight(policy->related_cpus);
-	cpufreq_cdev->idle_time = kcalloc(num_cpus,
-					 sizeof(*cpufreq_cdev->idle_time),
-					 GFP_KERNEL);
-	if (!cpufreq_cdev->idle_time) {
-		cdev = ERR_PTR(-ENOMEM);
-		goto free_cdev;
-	}
 
 	/* max_level is an index, not a counter */
 	cpufreq_cdev->max_level = i - 1;
@@ -531,7 +496,7 @@ __cpufreq_cooling_register(struct device_node *np,
 	ret = ida_simple_get(&cpufreq_ida, 0, 0, GFP_KERNEL);
 	if (ret < 0) {
 		cdev = ERR_PTR(ret);
-		goto free_idle_time;
+		goto free_cdev;
 	}
 	cpufreq_cdev->id = ret;
 
@@ -580,8 +545,6 @@ __cpufreq_cooling_register(struct device_node *np,
 	freq_qos_remove_request(&cpufreq_cdev->qos_req);
 remove_ida:
 	ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
-free_idle_time:
-	kfree(cpufreq_cdev->idle_time);
 free_cdev:
 	kfree(cpufreq_cdev);
 	return cdev;
@@ -674,7 +637,6 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
 	thermal_cooling_device_unregister(cdev);
 	freq_qos_remove_request(&cpufreq_cdev->qos_req);
 	ida_simple_remove(&cpufreq_ida, cpufreq_cdev->id);
-	kfree(cpufreq_cdev->idle_time);
 	kfree(cpufreq_cdev);
 }
 EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
-- 
2.25.0.rc1.19.g042ed3e048af


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c
  2020-10-23 10:20 ` [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c Viresh Kumar
@ 2020-10-23 10:34   ` Peter Zijlstra
  2020-10-23 10:54     ` Viresh Kumar
  2020-10-23 10:43   ` Vincent Guittot
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2020-10-23 10:34 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Rafael J. Wysocki, linux-kernel,
	Quentin Perret, Lukasz Luba, linux-pm

On Fri, Oct 23, 2020 at 03:50:20PM +0530, Viresh Kumar wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d2003a7d5ab5..369ff54d11d4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5117,6 +5117,119 @@ struct task_struct *idle_task(int cpu)
>  	return cpu_rq(cpu)->idle;
>  }
>  
> +/*
> + * This function computes an effective utilization for the given CPU, to be
> + * used for frequency selection given the linear relation: f = u * f_max.
> + *
> + * The scheduler tracks the following metrics:
> + *
> + *   cpu_util_{cfs,rt,dl,irq}()
> + *   cpu_bw_dl()
> + *
> + * Where the cfs,rt and dl util numbers are tracked with the same metric and
> + * synchronized windows and are thus directly comparable.
> + *
> + * The cfs,rt,dl utilization are the running times measured with rq->clock_task
> + * which excludes things like IRQ and steal-time. These latter are then accrued
> + * in the irq utilization.
> + *
> + * The DL bandwidth number otoh is not a measured metric but a value computed
> + * based on the task model parameters and gives the minimal utilization
> + * required to meet deadlines.
> + */
> +unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> +				 unsigned long max, enum cpu_util_type type,
> +				 struct task_struct *p)
> +{
	...
> +}
> +
> +unsigned long sched_cpu_util(int cpu, enum cpu_util_type type,
> +			     unsigned long max)
> +{
> +	return effective_cpu_util(cpu, cpu_util_cfs(cpu_rq(cpu)), max, type,
> +				  NULL);
> +}

Shouldn't all that be: #ifdef CONFIG_SMP ?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 2/2] thermal: cpufreq_cooling: Reuse sched_cpu_util()
  2020-10-23 10:20 ` [PATCH V2 2/2] thermal: cpufreq_cooling: Reuse sched_cpu_util() Viresh Kumar
@ 2020-10-23 10:37   ` Peter Zijlstra
  2020-11-12  9:41     ` Viresh Kumar
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2020-10-23 10:37 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Ingo Molnar, Vincent Guittot, Amit Daniel Kachhap,
	Daniel Lezcano, Javi Merino, Zhang Rui, Amit Kucheria,
	linux-kernel, Quentin Perret, Lukasz Luba, linux-pm

On Fri, Oct 23, 2020 at 03:50:21PM +0530, Viresh Kumar wrote:
> Several parts of the kernel are already using the effective CPU
> utilization (as seen by the scheduler) to get the current load on the
> CPU, do the same here instead of depending on the idle time of the CPU,
> which isn't that accurate comparatively.
> 
> Note that, this (and CPU frequency scaling in general) doesn't work that
> well with idle injection as that is done from rt threads and is counted
> as load while it tries to do quite the opposite. That should be solved
> separately though.

Actual numbers that show the goodness would be nice ;-) Because clearly
we're doing this make it better.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c
  2020-10-23 10:20 ` [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c Viresh Kumar
  2020-10-23 10:34   ` Peter Zijlstra
@ 2020-10-23 10:43   ` Vincent Guittot
  2020-10-23 10:55     ` Viresh Kumar
  1 sibling, 1 reply; 12+ messages in thread
From: Vincent Guittot @ 2020-10-23 10:43 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Rafael J. Wysocki, linux-kernel,
	Quentin Perret, Lukasz Luba, open list:THERMAL

On Fri, 23 Oct 2020 at 12:20, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> There is nothing schedutil specific in schedutil_cpu_util(), move it to
> core.c and rename it to sched_cpu_util(), so it can be used from other

I wonder if pelt.c would be a better place for this than core.c ?

> parts of the kernel as well.
>
> The cpufreq_cooling stuff will make use of this in a later commit.
>
> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  include/linux/sched.h            |  19 +++++
>  kernel/sched/core.c              | 113 ++++++++++++++++++++++++++++++
>  kernel/sched/cpufreq_schedutil.c | 116 +------------------------------
>  kernel/sched/fair.c              |   6 +-
>  kernel/sched/sched.h             |  29 +-------
>  5 files changed, 140 insertions(+), 143 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 393db0690101..3c27c10141cb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1930,6 +1930,25 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
>  #define TASK_SIZE_OF(tsk)      TASK_SIZE
>  #endif
>
> +/**
> + * enum cpu_util_type - CPU utilization type
> + * @FREQUENCY_UTIL:    Utilization used to select frequency
> + * @ENERGY_UTIL:       Utilization used during energy calculation
> + *
> + * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
> + * need to be aggregated differently depending on the usage made of them. This
> + * enum is used within sched_cpu_util() to differentiate the types of
> + * utilization expected by the callers, and adjust the aggregation accordingly.
> + */
> +enum cpu_util_type {
> +       FREQUENCY_UTIL,
> +       ENERGY_UTIL,
> +};
> +
> +/* Returns effective CPU utilization, as seen by the scheduler */
> +unsigned long sched_cpu_util(int cpu, enum cpu_util_type type,
> +                            unsigned long max);
> +
>  #ifdef CONFIG_RSEQ
>
>  /*
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d2003a7d5ab5..369ff54d11d4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5117,6 +5117,119 @@ struct task_struct *idle_task(int cpu)
>         return cpu_rq(cpu)->idle;
>  }
>
> +/*
> + * This function computes an effective utilization for the given CPU, to be
> + * used for frequency selection given the linear relation: f = u * f_max.
> + *
> + * The scheduler tracks the following metrics:
> + *
> + *   cpu_util_{cfs,rt,dl,irq}()
> + *   cpu_bw_dl()
> + *
> + * Where the cfs,rt and dl util numbers are tracked with the same metric and
> + * synchronized windows and are thus directly comparable.
> + *
> + * The cfs,rt,dl utilization are the running times measured with rq->clock_task
> + * which excludes things like IRQ and steal-time. These latter are then accrued
> + * in the irq utilization.
> + *
> + * The DL bandwidth number otoh is not a measured metric but a value computed
> + * based on the task model parameters and gives the minimal utilization
> + * required to meet deadlines.
> + */
> +unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> +                                unsigned long max, enum cpu_util_type type,
> +                                struct task_struct *p)
> +{
> +       unsigned long dl_util, util, irq;
> +       struct rq *rq = cpu_rq(cpu);
> +
> +       if (!uclamp_is_used() &&
> +           type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
> +               return max;
> +       }
> +
> +       /*
> +        * Early check to see if IRQ/steal time saturates the CPU, can be
> +        * because of inaccuracies in how we track these -- see
> +        * update_irq_load_avg().
> +        */
> +       irq = cpu_util_irq(rq);
> +       if (unlikely(irq >= max))
> +               return max;
> +
> +       /*
> +        * Because the time spend on RT/DL tasks is visible as 'lost' time to
> +        * CFS tasks and we use the same metric to track the effective
> +        * utilization (PELT windows are synchronized) we can directly add them
> +        * to obtain the CPU's actual utilization.
> +        *
> +        * CFS and RT utilization can be boosted or capped, depending on
> +        * utilization clamp constraints requested by currently RUNNABLE
> +        * tasks.
> +        * When there are no CFS RUNNABLE tasks, clamps are released and
> +        * frequency will be gracefully reduced with the utilization decay.
> +        */
> +       util = util_cfs + cpu_util_rt(rq);
> +       if (type == FREQUENCY_UTIL)
> +               util = uclamp_rq_util_with(rq, util, p);
> +
> +       dl_util = cpu_util_dl(rq);
> +
> +       /*
> +        * For frequency selection we do not make cpu_util_dl() a permanent part
> +        * of this sum because we want to use cpu_bw_dl() later on, but we need
> +        * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
> +        * that we select f_max when there is no idle time.
> +        *
> +        * NOTE: numerical errors or stop class might cause us to not quite hit
> +        * saturation when we should -- something for later.
> +        */
> +       if (util + dl_util >= max)
> +               return max;
> +
> +       /*
> +        * OTOH, for energy computation we need the estimated running time, so
> +        * include util_dl and ignore dl_bw.
> +        */
> +       if (type == ENERGY_UTIL)
> +               util += dl_util;
> +
> +       /*
> +        * There is still idle time; further improve the number by using the
> +        * irq metric. Because IRQ/steal time is hidden from the task clock we
> +        * need to scale the task numbers:
> +        *
> +        *              max - irq
> +        *   U' = irq + --------- * U
> +        *                 max
> +        */
> +       util = scale_irq_capacity(util, irq, max);
> +       util += irq;
> +
> +       /*
> +        * Bandwidth required by DEADLINE must always be granted while, for
> +        * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
> +        * to gracefully reduce the frequency when no tasks show up for longer
> +        * periods of time.
> +        *
> +        * Ideally we would like to set bw_dl as min/guaranteed freq and util +
> +        * bw_dl as requested freq. However, cpufreq is not yet ready for such
> +        * an interface. So, we only do the latter for now.
> +        */
> +       if (type == FREQUENCY_UTIL)
> +               util += cpu_bw_dl(rq);
> +
> +       return min(max, util);
> +}
> +
> +unsigned long sched_cpu_util(int cpu, enum cpu_util_type type,
> +                            unsigned long max)
> +{
> +       return effective_cpu_util(cpu, cpu_util_cfs(cpu_rq(cpu)), max, type,
> +                                 NULL);
> +}
> +
>  /**
>   * find_process_by_pid - find a process with a matching PID value.
>   * @pid: the pid in question.
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 5ae7b4e6e8d6..0c5c61a095f6 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -169,122 +169,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>         return cpufreq_driver_resolve_freq(policy, freq);
>  }
>
> -/*
> - * This function computes an effective utilization for the given CPU, to be
> - * used for frequency selection given the linear relation: f = u * f_max.
> - *
> - * The scheduler tracks the following metrics:
> - *
> - *   cpu_util_{cfs,rt,dl,irq}()
> - *   cpu_bw_dl()
> - *
> - * Where the cfs,rt and dl util numbers are tracked with the same metric and
> - * synchronized windows and are thus directly comparable.
> - *
> - * The cfs,rt,dl utilization are the running times measured with rq->clock_task
> - * which excludes things like IRQ and steal-time. These latter are then accrued
> - * in the irq utilization.
> - *
> - * The DL bandwidth number otoh is not a measured metric but a value computed
> - * based on the task model parameters and gives the minimal utilization
> - * required to meet deadlines.
> - */
> -unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
> -                                unsigned long max, enum schedutil_type type,
> -                                struct task_struct *p)
> -{
> -       unsigned long dl_util, util, irq;
> -       struct rq *rq = cpu_rq(cpu);
> -
> -       if (!uclamp_is_used() &&
> -           type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
> -               return max;
> -       }
> -
> -       /*
> -        * Early check to see if IRQ/steal time saturates the CPU, can be
> -        * because of inaccuracies in how we track these -- see
> -        * update_irq_load_avg().
> -        */
> -       irq = cpu_util_irq(rq);
> -       if (unlikely(irq >= max))
> -               return max;
> -
> -       /*
> -        * Because the time spend on RT/DL tasks is visible as 'lost' time to
> -        * CFS tasks and we use the same metric to track the effective
> -        * utilization (PELT windows are synchronized) we can directly add them
> -        * to obtain the CPU's actual utilization.
> -        *
> -        * CFS and RT utilization can be boosted or capped, depending on
> -        * utilization clamp constraints requested by currently RUNNABLE
> -        * tasks.
> -        * When there are no CFS RUNNABLE tasks, clamps are released and
> -        * frequency will be gracefully reduced with the utilization decay.
> -        */
> -       util = util_cfs + cpu_util_rt(rq);
> -       if (type == FREQUENCY_UTIL)
> -               util = uclamp_rq_util_with(rq, util, p);
> -
> -       dl_util = cpu_util_dl(rq);
> -
> -       /*
> -        * For frequency selection we do not make cpu_util_dl() a permanent part
> -        * of this sum because we want to use cpu_bw_dl() later on, but we need
> -        * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
> -        * that we select f_max when there is no idle time.
> -        *
> -        * NOTE: numerical errors or stop class might cause us to not quite hit
> -        * saturation when we should -- something for later.
> -        */
> -       if (util + dl_util >= max)
> -               return max;
> -
> -       /*
> -        * OTOH, for energy computation we need the estimated running time, so
> -        * include util_dl and ignore dl_bw.
> -        */
> -       if (type == ENERGY_UTIL)
> -               util += dl_util;
> -
> -       /*
> -        * There is still idle time; further improve the number by using the
> -        * irq metric. Because IRQ/steal time is hidden from the task clock we
> -        * need to scale the task numbers:
> -        *
> -        *              max - irq
> -        *   U' = irq + --------- * U
> -        *                 max
> -        */
> -       util = scale_irq_capacity(util, irq, max);
> -       util += irq;
> -
> -       /*
> -        * Bandwidth required by DEADLINE must always be granted while, for
> -        * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
> -        * to gracefully reduce the frequency when no tasks show up for longer
> -        * periods of time.
> -        *
> -        * Ideally we would like to set bw_dl as min/guaranteed freq and util +
> -        * bw_dl as requested freq. However, cpufreq is not yet ready for such
> -        * an interface. So, we only do the latter for now.
> -        */
> -       if (type == FREQUENCY_UTIL)
> -               util += cpu_bw_dl(rq);
> -
> -       return min(max, util);
> -}
> -
>  static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
>  {
> -       struct rq *rq = cpu_rq(sg_cpu->cpu);
> -       unsigned long util = cpu_util_cfs(rq);
> -       unsigned long max = arch_scale_cpu_capacity(sg_cpu->cpu);
> -
> -       sg_cpu->max = max;
> -       sg_cpu->bw_dl = cpu_bw_dl(rq);
> +       sg_cpu->max = arch_scale_cpu_capacity(sg_cpu->cpu);
> +       sg_cpu->bw_dl = cpu_bw_dl(cpu_rq(sg_cpu->cpu));
>
> -       return schedutil_cpu_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL, NULL);
> +       return sched_cpu_util(sg_cpu->cpu, FREQUENCY_UTIL, sg_cpu->max);
>  }
>
>  /**
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aa4c6227cd6d..52e2d866e875 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6499,7 +6499,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>                  * is already enough to scale the EM reported power
>                  * consumption at the (eventually clamped) cpu_capacity.
>                  */
> -               sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
> +               sum_util += effective_cpu_util(cpu, util_cfs, cpu_cap,
>                                                ENERGY_UTIL, NULL);
>
>                 /*
> @@ -6509,7 +6509,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>                  * NOTE: in case RT tasks are running, by default the
>                  * FREQUENCY_UTIL's utilization can be max OPP.
>                  */
> -               cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
> +               cpu_util = effective_cpu_util(cpu, util_cfs, cpu_cap,
>                                               FREQUENCY_UTIL, tsk);
>                 max_util = max(max_util, cpu_util);
>         }
> @@ -6607,7 +6607,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>                          * IOW, placing the task there would make the CPU
>                          * overutilized. Take uclamp into account to see how
>                          * much capacity we can get out of the CPU; this is
> -                        * aligned with schedutil_cpu_util().
> +                        * aligned with sched_cpu_util().
>                          */
>                         util = uclamp_rq_util_with(cpu_rq(cpu), util, p);
>                         if (!fits_capacity(util, cpu_cap))
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index df80bfcea92e..0f0439344eec 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2486,25 +2486,8 @@ static inline unsigned long capacity_orig_of(int cpu)
>  }
>  #endif
>
> -/**
> - * enum schedutil_type - CPU utilization type
> - * @FREQUENCY_UTIL:    Utilization used to select frequency
> - * @ENERGY_UTIL:       Utilization used during energy calculation
> - *
> - * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
> - * need to be aggregated differently depending on the usage made of them. This
> - * enum is used within schedutil_freq_util() to differentiate the types of
> - * utilization expected by the callers, and adjust the aggregation accordingly.
> - */
> -enum schedutil_type {
> -       FREQUENCY_UTIL,
> -       ENERGY_UTIL,
> -};
> -
> -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> -
> -unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
> -                                unsigned long max, enum schedutil_type type,
> +unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> +                                unsigned long max, enum cpu_util_type type,
>                                  struct task_struct *p);
>
>  static inline unsigned long cpu_bw_dl(struct rq *rq)
> @@ -2533,14 +2516,6 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
>  {
>         return READ_ONCE(rq->avg_rt.util_avg);
>  }
> -#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
> -static inline unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
> -                                unsigned long max, enum schedutil_type type,
> -                                struct task_struct *p)
> -{
> -       return 0;
> -}
> -#endif /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
>
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  static inline unsigned long cpu_util_irq(struct rq *rq)
> --
> 2.25.0.rc1.19.g042ed3e048af
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c
  2020-10-23 10:34   ` Peter Zijlstra
@ 2020-10-23 10:54     ` Viresh Kumar
  2020-10-23 11:08       ` Lukasz Luba
  2020-10-23 12:34       ` Vincent Guittot
  0 siblings, 2 replies; 12+ messages in thread
From: Viresh Kumar @ 2020-10-23 10:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Rafael J. Wysocki, linux-kernel,
	Quentin Perret, Lukasz Luba, linux-pm

On 23-10-20, 12:34, Peter Zijlstra wrote:
> On Fri, Oct 23, 2020 at 03:50:20PM +0530, Viresh Kumar wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d2003a7d5ab5..369ff54d11d4 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5117,6 +5117,119 @@ struct task_struct *idle_task(int cpu)
> >  	return cpu_rq(cpu)->idle;
> >  }
> >  
> > +/*
> > + * This function computes an effective utilization for the given CPU, to be
> > + * used for frequency selection given the linear relation: f = u * f_max.
> > + *
> > + * The scheduler tracks the following metrics:
> > + *
> > + *   cpu_util_{cfs,rt,dl,irq}()
> > + *   cpu_bw_dl()
> > + *
> > + * Where the cfs,rt and dl util numbers are tracked with the same metric and
> > + * synchronized windows and are thus directly comparable.
> > + *
> > + * The cfs,rt,dl utilization are the running times measured with rq->clock_task
> > + * which excludes things like IRQ and steal-time. These latter are then accrued
> > + * in the irq utilization.
> > + *
> > + * The DL bandwidth number otoh is not a measured metric but a value computed
> > + * based on the task model parameters and gives the minimal utilization
> > + * required to meet deadlines.
> > + */
> > +unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > +				 unsigned long max, enum cpu_util_type type,
> > +				 struct task_struct *p)
> > +{
> 	...
> > +}
> > +
> > +unsigned long sched_cpu_util(int cpu, enum cpu_util_type type,
> > +			     unsigned long max)
> > +{
> > +	return effective_cpu_util(cpu, cpu_util_cfs(cpu_rq(cpu)), max, type,
> > +				  NULL);
> > +}
> 
> Shouldn't all that be: #ifdef CONFIG_SMP ?

I didn't realize that these matrices are only available in case of SMP
and that's why schedutil isn't available for !SMP. I wonder what we
should be doing in cpufreq_cooling now ? Make it depend on SMP ? Or
calculate load the traditional way (the stuff I just removed) for !SMP
case ?

:)

-- 
viresh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c
  2020-10-23 10:43   ` Vincent Guittot
@ 2020-10-23 10:55     ` Viresh Kumar
  0 siblings, 0 replies; 12+ messages in thread
From: Viresh Kumar @ 2020-10-23 10:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Rafael J. Wysocki, linux-kernel,
	Quentin Perret, Lukasz Luba, open list:THERMAL

On 23-10-20, 12:43, Vincent Guittot wrote:
> On Fri, 23 Oct 2020 at 12:20, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > There is nothing schedutil specific in schedutil_cpu_util(), move it to
> > core.c and rename it to sched_cpu_util(), so it can be used from other
> 
> I wonder if pelt.c would be a better place for this than core.c ?

Wherever you guys ask me to move it :)

-- 
viresh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c
  2020-10-23 10:54     ` Viresh Kumar
@ 2020-10-23 11:08       ` Lukasz Luba
  2020-10-23 12:34       ` Vincent Guittot
  1 sibling, 0 replies; 12+ messages in thread
From: Lukasz Luba @ 2020-10-23 11:08 UTC (permalink / raw)
  To: Viresh Kumar, Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Rafael J. Wysocki, linux-kernel,
	Quentin Perret, linux-pm



On 10/23/20 11:54 AM, Viresh Kumar wrote:
> On 23-10-20, 12:34, Peter Zijlstra wrote:
>> On Fri, Oct 23, 2020 at 03:50:20PM +0530, Viresh Kumar wrote:
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index d2003a7d5ab5..369ff54d11d4 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -5117,6 +5117,119 @@ struct task_struct *idle_task(int cpu)
>>>   	return cpu_rq(cpu)->idle;
>>>   }
>>>   
>>> +/*
>>> + * This function computes an effective utilization for the given CPU, to be
>>> + * used for frequency selection given the linear relation: f = u * f_max.
>>> + *
>>> + * The scheduler tracks the following metrics:
>>> + *
>>> + *   cpu_util_{cfs,rt,dl,irq}()
>>> + *   cpu_bw_dl()
>>> + *
>>> + * Where the cfs,rt and dl util numbers are tracked with the same metric and
>>> + * synchronized windows and are thus directly comparable.
>>> + *
>>> + * The cfs,rt,dl utilization are the running times measured with rq->clock_task
>>> + * which excludes things like IRQ and steal-time. These latter are then accrued
>>> + * in the irq utilization.
>>> + *
>>> + * The DL bandwidth number otoh is not a measured metric but a value computed
>>> + * based on the task model parameters and gives the minimal utilization
>>> + * required to meet deadlines.
>>> + */
>>> +unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
>>> +				 unsigned long max, enum cpu_util_type type,
>>> +				 struct task_struct *p)
>>> +{
>> 	...
>>> +}
>>> +
>>> +unsigned long sched_cpu_util(int cpu, enum cpu_util_type type,
>>> +			     unsigned long max)
>>> +{
>>> +	return effective_cpu_util(cpu, cpu_util_cfs(cpu_rq(cpu)), max, type,
>>> +				  NULL);
>>> +}
>>
>> Shouldn't all that be: #ifdef CONFIG_SMP ?
> 
> I didn't realize that these matrices are only available in case of SMP
> and that's why schedutil isn't available for !SMP. I wonder what we
> should be doing in cpufreq_cooling now ? Make it depend on SMP ? Or
> calculate load the traditional way (the stuff I just removed) for !SMP
> case ?

IMO the !SMP can leave with the old design, so keeping two
implementations under #ifdef CONFIG_SMP is fair I would say in this
case.

There are popular platforms !SMP (BeagleBone, RPi1, RPiZero) but I
haven't heard anyone was using IPA on them.

Regards,
Lukasz

> 
> :)
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c
  2020-10-23 10:54     ` Viresh Kumar
  2020-10-23 11:08       ` Lukasz Luba
@ 2020-10-23 12:34       ` Vincent Guittot
  2020-10-28  5:49         ` Viresh Kumar
  1 sibling, 1 reply; 12+ messages in thread
From: Vincent Guittot @ 2020-10-23 12:34 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Rafael J. Wysocki, linux-kernel,
	Quentin Perret, Lukasz Luba, open list:THERMAL

On Fri, 23 Oct 2020 at 12:54, Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 23-10-20, 12:34, Peter Zijlstra wrote:
> > On Fri, Oct 23, 2020 at 03:50:20PM +0530, Viresh Kumar wrote:
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index d2003a7d5ab5..369ff54d11d4 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -5117,6 +5117,119 @@ struct task_struct *idle_task(int cpu)
> > >     return cpu_rq(cpu)->idle;
> > >  }
> > >
> > > +/*
> > > + * This function computes an effective utilization for the given CPU, to be
> > > + * used for frequency selection given the linear relation: f = u * f_max.
> > > + *
> > > + * The scheduler tracks the following metrics:
> > > + *
> > > + *   cpu_util_{cfs,rt,dl,irq}()
> > > + *   cpu_bw_dl()
> > > + *
> > > + * Where the cfs,rt and dl util numbers are tracked with the same metric and
> > > + * synchronized windows and are thus directly comparable.
> > > + *
> > > + * The cfs,rt,dl utilization are the running times measured with rq->clock_task
> > > + * which excludes things like IRQ and steal-time. These latter are then accrued
> > > + * in the irq utilization.
> > > + *
> > > + * The DL bandwidth number otoh is not a measured metric but a value computed
> > > + * based on the task model parameters and gives the minimal utilization
> > > + * required to meet deadlines.
> > > + */
> > > +unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > > +                            unsigned long max, enum cpu_util_type type,
> > > +                            struct task_struct *p)
> > > +{
> >       ...
> > > +}
> > > +
> > > +unsigned long sched_cpu_util(int cpu, enum cpu_util_type type,
> > > +                        unsigned long max)
> > > +{
> > > +   return effective_cpu_util(cpu, cpu_util_cfs(cpu_rq(cpu)), max, type,
> > > +                             NULL);
> > > +}
> >
> > Shouldn't all that be: #ifdef CONFIG_SMP ?
>
> I didn't realize that these matrices are only available in case of SMP
> and that's why schedutil isn't available for !SMP. I wonder what we

Maybe it's time to make sched_util and pelt available for !SMP too.

With util_est and uclamp, I can see some benefits for !SMP compare to ondemand

> should be doing in cpufreq_cooling now ? Make it depend on SMP ? Or
> calculate load the traditional way (the stuff I just removed) for !SMP
> case ?
>
> :)
>
> --
> viresh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c
  2020-10-23 12:34       ` Vincent Guittot
@ 2020-10-28  5:49         ` Viresh Kumar
  0 siblings, 0 replies; 12+ messages in thread
From: Viresh Kumar @ 2020-10-28  5:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Rafael J. Wysocki, linux-kernel,
	Quentin Perret, Lukasz Luba, open list:THERMAL

On 23-10-20, 14:34, Vincent Guittot wrote:
> Maybe it's time to make sched_util and pelt available for !SMP too.
> 
> With util_est and uclamp, I can see some benefits for !SMP compare to ondemand

That's a decision you guys (sched maintainers) need to make :)

-- 
viresh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 2/2] thermal: cpufreq_cooling: Reuse sched_cpu_util()
  2020-10-23 10:37   ` Peter Zijlstra
@ 2020-11-12  9:41     ` Viresh Kumar
  0 siblings, 0 replies; 12+ messages in thread
From: Viresh Kumar @ 2020-11-12  9:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Amit Daniel Kachhap,
	Daniel Lezcano, Javi Merino, Zhang Rui, Amit Kucheria,
	linux-kernel, Quentin Perret, Lukasz Luba, linux-pm

On 23-10-20, 12:37, Peter Zijlstra wrote:
> Actual numbers that show the goodness would be nice ;-) Because clearly
> we're doing this make it better.

Hi Peter,

I tried the patchset with hackbench, sysbench and schbench. None of them showed
any regression or significant improvements. schbench was the one I was most
hopeful with as it creates the scenario where the utilization numbers provide a
better estimate of the future.

Scenario 1: The CPUs were mostly idle in the previous polling window of the IPA
governor as the tasks were sleeping and here are the details from traces (load
is in %):

   thermal_power_cpu_get_power: Old: cpus=00000000,000000ff freq=1200000 total_load=203 load={{0x35,0x1,0x0,0x31,0x0,0x0,0x64,0x0}} dynamic_power=1339
   thermal_power_cpu_get_power: New: cpus=00000000,000000ff freq=1200000 total_load=600 load={{0x60,0x46,0x45,0x45,0x48,0x3b,0x61,0x44}} dynamic_power=3960

Here, the "Old" line gives the load and requested_power (dynamic_power here)
numbers calculated using the idle time based implementation. And "New" is based
on CPU utilization from this patchset.

As can be clearly seen, the load and requested_power numbers are simply
incorrect in the idle time based approach and the numbers collected from CPU's
utilization are much better and will also match the expectations of the
schedutil governor.

Scenario 2: The CPUs were busy in the previous polling window of the IPA
governor:

   thermal_power_cpu_get_power: Old: cpus=00000000,000000ff freq=1200000 total_load=800 load={{0x64,0x64,0x64,0x64,0x64,0x64,0x64,0x64}} dynamic_power=5280
   thermal_power_cpu_get_power: New: cpus=00000000,000000ff freq=1200000 total_load=708 load={{0x4d,0x5c,0x5c,0x5b,0x5c,0x5c,0x51,0x5b}} dynamic_power=4672

As can be seen, the idle time based load is 100% for all the CPUs as it took
only the last window into account, but in reality the CPUs aren't that loaded as
shown by the utilazation numbers.

Though this patchset improves the power estimation done by the cpufreq_cooling
driver (which matches with the freq scaling governor, schedutil, as well), the
IPA governor doesn't necessarily appreciate the correctness of it as it takes
the decision to choose the next cooling state based on multiple factors, like
current temperature, target temperature, requested_power, all power players (who
request power from it), etc. The algorithm is complex there and I am afraid the
improved numbers here don't necessarily translate to better numbers for the
benchmarks like schbench. Another factor can be IPAs tuning for my platform
(Hikey6220).

Irrespective of the IPA governor, the estimate provided by the cpufreq cooling
driver does improve a lot with this patchset and are better aligned with the
schedutil governor and I believe it would be better to merge this nevertheless.

I have already prepared the next version which takes care of !SMP case, was just
holding it off until I was trying to get some numbers out.

-- 
viresh

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-11-12  9:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-23 10:20 [PATCH V2 0/2] cpufreq_cooling: Get effective CPU utilization from scheduler Viresh Kumar
2020-10-23 10:20 ` [PATCH V2 1/2] sched/core: Rename and move schedutil_cpu_util() to core.c Viresh Kumar
2020-10-23 10:34   ` Peter Zijlstra
2020-10-23 10:54     ` Viresh Kumar
2020-10-23 11:08       ` Lukasz Luba
2020-10-23 12:34       ` Vincent Guittot
2020-10-28  5:49         ` Viresh Kumar
2020-10-23 10:43   ` Vincent Guittot
2020-10-23 10:55     ` Viresh Kumar
2020-10-23 10:20 ` [PATCH V2 2/2] thermal: cpufreq_cooling: Reuse sched_cpu_util() Viresh Kumar
2020-10-23 10:37   ` Peter Zijlstra
2020-11-12  9:41     ` Viresh Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).