linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch v4 0/6] Introduce Thermal Pressure
@ 2019-10-22 20:34 Thara Gopinath
  2019-10-22 20:34 ` [Patch v4 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
                   ` (7 more replies)
  0 siblings, 8 replies; 51+ messages in thread
From: Thara Gopinath @ 2019-10-22 20:34 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Thermal governors can respond to an overheat event of a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in the kernel, task scheduler is 
not notified of capping of maximum frequency of a cpu.
In other words, scheduler is unware of maximum capacity
restrictions placed on a cpu due to thermal activity.
This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The reduction in the maximum possible capacity of a cpu due to a 
thermal event can be considered as thermal pressure. Instantaneous
thermal pressure is hard to record and can sometime be erroneous
as there can be mismatch between the actual capping of capacity
and scheduler recording it. Thus solution is to have a weighted
average per cpu value for thermal pressure over time.
The weight reflects the amount of time the cpu has spent at a
capped maximum frequency. Since thermal pressure is recorded as
an average, it must be decayed periodically. Exisiting algorithm
in the kernel scheduler pelt framework is re-used to calculate
the weighted average. This patch series also defines a sysctl
inerface to allow for a configurable decay period.

Regarding testing, basic build, boot and sanity testing have been
performed on db845c platform with debian file system.
Further, dhrystone and hackbench tests have been
run with the thermal pressure algorithm. During testing, due to
constraints of step wise governor in dealing with big little systems,
trip point 0 temperature was made assymetric between cpus in little
cluster and big cluster; the idea being that
big core will heat up and cpu cooling device will throttle the
frequency of the big cores faster, there by limiting the maximum available
capacity and the scheduler will spread out tasks to little cores as well.

Test Results

Hackbench: 1 group , 30000 loops, 10 runs       
                                               Result         SD             
                                               (Secs)     (% of mean)     
 No Thermal Pressure                            14.03       2.69%           
 Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%         
 Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%           
 Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%         
 Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%           
 Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%  

Dhrystone Run Time  : 20 threads, 3000 MLOOPS
                                                 Result      SD             
                                                 (Secs)    (% of mean)     
 No Thermal Pressure                              9.452      4.49%
 Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
 Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
 Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
 Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
 Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%  

A Brief History

The first version of this patch-series was posted with resuing
PELT algorithm to decay thermal pressure signal. The discussions
that followed were around whether intanteneous thermal pressure
solution is better and whether a stand-alone algortihm to accumulate
and decay thermal pressure is more appropriate than re-using the
PELT framework. 
Tests on Hikey960 showed the stand-alone algorithm performing slightly
better than resuing PELT algorithm and V2 was posted with the stand
alone algorithm. Test results were shared as part of this series.
Discussions were around re-using PELT algorithm and running
further tests with more granular decay period.

For some time after this development was impeded due to hardware
unavailability, some other unforseen and possibly unfortunate events.
For this version, h/w was switched from hikey960 to db845c.
Also Instantaneous thermal pressure was never tested as part of this
cycle as it is clear that weighted average is a better implementation.
The non-PELT algorithm never gave any conclusive results to prove that it
is better than reusing PELT algorithm, in this round of testing.
Also reusing PELT algorithm means thermal pressure tracks the
other utilization signals in the scheduler.

v3->v4:
	- "Patch 3/7:sched: Initialize per cpu thermal pressure structure"
	   is dropped as it is no longer needed following changes in other
	   other patches.
	- rest of the change log mentioned in specific patches.

Thara Gopinath (6):
  sched/pelt.c: Add support to track thermal pressure
  sched: Add infrastructure to store and update instantaneous thermal
    pressure
  sched/fair: Enable CFS periodic tick to update thermal pressure
  sched/fair: update cpu_capcity to reflect thermal pressure
  thermal/cpu-cooling: Update thermal pressure in case of a maximum
    frequency capping
  sched: thermal: Enable tuning of decay period

 Documentation/admin-guide/kernel-parameters.txt |  5 ++
 drivers/thermal/cpu_cooling.c                   | 31 ++++++++++-
 include/linux/sched.h                           |  8 +++
 kernel/sched/Makefile                           |  2 +-
 kernel/sched/fair.c                             |  6 +++
 kernel/sched/pelt.c                             | 13 +++++
 kernel/sched/pelt.h                             |  7 +++
 kernel/sched/sched.h                            |  1 +
 kernel/sched/thermal.c                          | 68 +++++++++++++++++++++++++
 kernel/sched/thermal.h                          | 13 +++++
 10 files changed, 151 insertions(+), 3 deletions(-)
 create mode 100644 kernel/sched/thermal.c
 create mode 100644 kernel/sched/thermal.h

-- 
2.1.4


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Patch v4 1/6] sched/pelt.c: Add support to track thermal pressure
  2019-10-22 20:34 [Patch v4 0/6] Introduce Thermal Pressure Thara Gopinath
@ 2019-10-22 20:34 ` Thara Gopinath
  2019-10-31  9:47   ` Ionela Voinescu
  2019-10-22 20:34 ` [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 51+ messages in thread
From: Thara Gopinath @ 2019-10-22 20:34 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Extrapolating on the exisiting framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressure. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a cpu running a rt/dl task through util_avg,
the average thermal pressure is tracked through load_avg. This is
because thermal pressure signal is weighted "delta" capacity
and is not binary(util_avg is binary). "delta capacity" here
means delta between the actual capacity of a cpu and the decreased
capacity a cpu due to a thermal event.
In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_avg can be called
to do the periodic bookeeping (accumulate, decay and average)
of the thermal pressure.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
v3->v4:
	- Renamed update_thermal_avg to update_thermal_load_avg.
	- Fixed typos as per review comments on mailing list
	- Reordered the code as per review comments

 kernel/sched/pelt.c  | 13 +++++++++++++
 kernel/sched/pelt.h  |  7 +++++++
 kernel/sched/sched.h |  1 +
 3 files changed, 21 insertions(+)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index a96db50..3821069 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -353,6 +353,19 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 	return 0;
 }
 
+int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+	if (___update_load_sum(now, &rq->avg_thermal,
+			       capacity,
+			       capacity,
+			       capacity)) {
+		___update_load_avg(&rq->avg_thermal, 1, 1);
+		return 1;
+	}
+
+	return 0;
+}
+
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 /*
  * irq:
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index afff644..c74226d 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -6,6 +6,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 int update_irq_load_avg(struct rq *rq, u64 running);
@@ -159,6 +160,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 }
 
 static inline int
+update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+	return 0;
+}
+
+static inline int
 update_irq_load_avg(struct rq *rq, u64 running)
 {
 	return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0db2c1b..d5d82c8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -944,6 +944,7 @@ struct rq {
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 	struct sched_avg	avg_irq;
 #endif
+	struct sched_avg	avg_thermal;
 	u64			idle_stamp;
 	u64			avg_idle;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-10-22 20:34 [Patch v4 0/6] Introduce Thermal Pressure Thara Gopinath
  2019-10-22 20:34 ` [Patch v4 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
@ 2019-10-22 20:34 ` Thara Gopinath
  2019-10-28 15:21   ` Peter Zijlstra
  2019-11-01 12:17   ` Dietmar Eggemann
  2019-10-22 20:34 ` [Patch v4 3/6] sched/fair: Enable CFS periodic tick to update " Thara Gopinath
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 51+ messages in thread
From: Thara Gopinath @ 2019-10-22 20:34 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Add thermal.c and thermal.h files that provides interface
APIs to initialize, update/average, track, accumulate and decay
thermal pressure per cpu basis. A per cpu variable delta_capacity is
introduced to keep track of instantaneous per cpu thermal pressure.
Thermal pressure is the delta between maximum capacity and capped
capacity due to a thermal event.
API trigger_thermal_pressure_average is called for periodic accumulate
and decay of the thermal pressure. It is to to be called from a
periodic tick function. This API passes on the instantaneous delta
capacity of a cpu to update_thermal_load_avg to do the necessary
accumulate, decay and average. 
API update_thermal_pressure is for the system to update the thermal
pressure by providing a capped frequency ratio.
Considering, trigger_thermal_pressure_average reads delta_capacity and
update_thermal_pressure writes into delta_capacity, one can argue for
some sort of locking mechanism to avoid a stale value.
But considering trigger_thermal_pressure_average can be called from a
system critical path like scheduler tick function, a locking mechanism
is not ideal. This means that it is possible the delta_capacity value
used to calculate average thermal pressure for a cpu can be
stale for upto 1 tick period.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
v3->v4:
	- Dropped per cpu max_capacity_info struct and instead added a per
	  delta_capacity variable to store the delta between maximum
	  capacity and capped capacity. The delta is now calculated when
	  thermal pressure is updated and not every tick.
	- Dropped populate_max_capacity_info api as only per cpu delta
	  capacity is stored.
	- Renamed update_periodic_maxcap to
	  trigger_thermal_pressure_average and update_maxcap_capacity to
	  update_thermal_pressure.

 include/linux/sched.h  |  8 ++++++++
 kernel/sched/Makefile  |  2 +-
 kernel/sched/thermal.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/thermal.h | 13 +++++++++++++
 4 files changed, 67 insertions(+), 1 deletion(-)
 create mode 100644 kernel/sched/thermal.c
 create mode 100644 kernel/sched/thermal.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c2e56b..d7ef543 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1983,6 +1983,14 @@ static inline void rseq_syscall(struct pt_regs *regs)
 
 #endif
 
+#ifdef CONFIG_SMP
+void update_thermal_pressure(int cpu, u64 capacity);
+#else
+static inline void update_thermal_pressure(int cpu, u64 capacity)
+{
+}
+#endif
+
 const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
 char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
 int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 21fb5a5..4d3b820 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle.o fair.o rt.o deadline.o
 obj-y += wait.o wait_bit.o swait.o completion.o
 
-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o thermal.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
new file mode 100644
index 0000000..0c84960
--- /dev/null
+++ b/kernel/sched/thermal.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Scheduler Thermal Interactions
+ *
+ *  Copyright (C) 2018 Linaro, Inc., Thara Gopinath <thara.gopinath@linaro.org>
+ */
+
+#include <linux/sched.h>
+#include "sched.h"
+#include "pelt.h"
+#include "thermal.h"
+
+static DEFINE_PER_CPU(unsigned long, delta_capacity);
+
+/**
+ * update_thermal_pressure: Update thermal pressure
+ * @cpu: the cpu for which thermal pressure is to be updated for
+ * @capped_freq_ratio: capped max frequency << SCHED_CAPACITY_SHIFT / max freq
+ *
+ * capped_freq_ratio is normalized into capped capacity and the delta between
+ * the arch_scale_cpu_capacity and capped capacity is stored in per cpu
+ * delta_capacity.
+ */
+void update_thermal_pressure(int cpu, u64 capped_freq_ratio)
+{
+	unsigned long __capacity, delta;
+
+	/* Normalize the capped freq ratio */
+	__capacity = (capped_freq_ratio * arch_scale_cpu_capacity(cpu)) >>
+							SCHED_CAPACITY_SHIFT;
+	delta = arch_scale_cpu_capacity(cpu) -  __capacity;
+	pr_debug("updating cpu%d thermal pressure to %lu\n", cpu, delta);
+
+	per_cpu(delta_capacity, cpu) = delta;
+}
+
+/**
+ * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate
+ *				     and average algorithm
+ */
+void trigger_thermal_pressure_average(struct rq *rq)
+{
+	update_thermal_load_avg(rq_clock_task(rq), rq,
+				per_cpu(delta_capacity, cpu_of(rq)));
+}
diff --git a/kernel/sched/thermal.h b/kernel/sched/thermal.h
new file mode 100644
index 0000000..26e5b07
--- /dev/null
+++ b/kernel/sched/thermal.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Scheduler thermal interaction internal methods.
+ */
+
+#ifdef CONFIG_SMP
+void trigger_thermal_pressure_average(struct rq *rq);
+
+#else
+static inline void trigger_thermal_pressure_average(struct rq *rq)
+{
+}
+#endif
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Patch v4 3/6] sched/fair: Enable CFS periodic tick to update thermal pressure
  2019-10-22 20:34 [Patch v4 0/6] Introduce Thermal Pressure Thara Gopinath
  2019-10-22 20:34 ` [Patch v4 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
  2019-10-22 20:34 ` [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
@ 2019-10-22 20:34 ` Thara Gopinath
  2019-10-28 15:24   ` Peter Zijlstra
  2019-10-31 16:11   ` Dietmar Eggemann
  2019-10-22 20:34 ` [Patch v4 4/6] sched/fair: update cpu_capcity to reflect " Thara Gopinath
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 51+ messages in thread
From: Thara Gopinath @ 2019-10-22 20:34 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Introduce support in CFS periodic tick to trigger the process of
computing average thermal pressure for a cpu.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 682a754..4f9c2cb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -21,6 +21,7 @@
  *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
  */
 #include "sched.h"
+#include "thermal.h"
 
 #include <trace/events/sched.h>
 
@@ -7574,6 +7575,8 @@ static void update_blocked_averages(int cpu)
 		done = false;
 
 	update_blocked_load_status(rq, !done);
+
+	trigger_thermal_pressure_average(rq);
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -9933,6 +9936,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
+
+	trigger_thermal_pressure_average(rq);
 }
 
 /*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-22 20:34 [Patch v4 0/6] Introduce Thermal Pressure Thara Gopinath
                   ` (2 preceding siblings ...)
  2019-10-22 20:34 ` [Patch v4 3/6] sched/fair: Enable CFS periodic tick to update " Thara Gopinath
@ 2019-10-22 20:34 ` Thara Gopinath
  2019-10-23 12:28   ` Qais Yousef
  2019-10-22 20:34 ` [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 51+ messages in thread
From: Thara Gopinath @ 2019-10-22 20:34 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

cpu_capacity relflects the maximum available capacity of a cpu. Thermal
pressure on a cpu means this maximum available capacity is reduced. This
patch reduces the average thermal pressure for a cpu from its maximum
available capacity so that cpu_capacity reflects the actual
available capacity.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 kernel/sched/fair.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f9c2cb..be3e802 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
 
 	used = READ_ONCE(rq->avg_rt.util_avg);
 	used += READ_ONCE(rq->avg_dl.util_avg);
+	used += READ_ONCE(rq->avg_thermal.load_avg);
 
 	if (unlikely(used >= max))
 		return 1;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-10-22 20:34 [Patch v4 0/6] Introduce Thermal Pressure Thara Gopinath
                   ` (3 preceding siblings ...)
  2019-10-22 20:34 ` [Patch v4 4/6] sched/fair: update cpu_capcity to reflect " Thara Gopinath
@ 2019-10-22 20:34 ` Thara Gopinath
  2019-10-28 15:33   ` Peter Zijlstra
  2019-10-31 16:29   ` Dietmar Eggemann
  2019-10-22 20:34 ` [Patch v4 6/6] sched: thermal: Enable tuning of decay period Thara Gopinath
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 51+ messages in thread
From: Thara Gopinath @ 2019-10-22 20:34 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Thermal governors can request for a cpu's maximum supported frequency
to be capped in case of an overheat event. This in turn means that the
maximum capacity available for tasks to run on the particular cpu is
reduced. Delta between the original maximum capacity and capped
maximum capacity is known as thermal pressure. Enable cpufreq cooling
device to update the thermal pressure in event of a capped
maximum frequency.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 drivers/thermal/cpu_cooling.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 391f397..2e6a979 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -218,6 +218,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
 }
 
 /**
+ * update_sched_max_capacity - update scheduler about change in cpu
+ *					max frequency.
+ * @policy - cpufreq policy whose max frequency is capped.
+ */
+static void update_sched_max_capacity(struct cpumask *cpus,
+				      unsigned int cur_max_freq,
+				      unsigned int max_freq)
+{
+	int cpu;
+	unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
+				  max_freq;
+
+	for_each_cpu(cpu, cpus)
+		update_thermal_pressure(cpu, capacity);
+}
+
+/**
  * get_load() - get load for a cpu since last updated
  * @cpufreq_cdev:	&struct cpufreq_cooling_device for this cpu
  * @cpu:	cpu number
@@ -320,6 +337,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
 				 unsigned long state)
 {
 	struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
+	int ret;
 
 	/* Request state should be less than max_level */
 	if (WARN_ON(state > cpufreq_cdev->max_level))
@@ -331,8 +349,17 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
 
 	cpufreq_cdev->cpufreq_state = state;
 
-	return dev_pm_qos_update_request(&cpufreq_cdev->qos_req,
-				cpufreq_cdev->freq_table[state].frequency);
+	ret = dev_pm_qos_update_request
+				(&cpufreq_cdev->qos_req,
+				 cpufreq_cdev->freq_table[state].frequency);
+
+	if (ret > 0)
+		update_sched_max_capacity
+				(cpufreq_cdev->policy->cpus,
+				 cpufreq_cdev->freq_table[state].frequency,
+				 cpufreq_cdev->policy->cpuinfo.max_freq);
+
+	return ret;
 }
 
 /**
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Patch v4 6/6] sched: thermal: Enable tuning of decay period
  2019-10-22 20:34 [Patch v4 0/6] Introduce Thermal Pressure Thara Gopinath
                   ` (4 preceding siblings ...)
  2019-10-22 20:34 ` [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
@ 2019-10-22 20:34 ` Thara Gopinath
  2019-10-28 15:42   ` Peter Zijlstra
  2019-11-04 16:12   ` Ionela Voinescu
  2019-10-29 15:34 ` [Patch v4 0/6] Introduce Thermal Pressure Daniel Lezcano
  2019-10-31  9:44 ` Ionela Voinescu
  7 siblings, 2 replies; 51+ messages in thread
From: Thara Gopinath @ 2019-10-22 20:34 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Thermal pressure follows pelt signas which means the
decay period for thermal pressure is the default pelt
decay period. Depending on soc charecteristics and thermal
activity, it might be beneficial to decay thermal pressure
slower, but still in-tune with the pelt signals.
One way to achieve this is to provide a command line parameter
to set the decay coefficient to an integer between 0 and 10.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
v3->v4:
	- Removed the sysctl setting to tune decay period and instead
	  introduced a command line parameter to control it. The rationale
	  here being changing decay period of a PELT signal runtime can
	  result in a skewed average value for atleast some cycles.

 Documentation/admin-guide/kernel-parameters.txt |  5 +++++
 kernel/sched/thermal.c                          | 25 ++++++++++++++++++++++++-
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a84a83f..61d7baa 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4273,6 +4273,11 @@
 			incurs a small amount of overhead in the scheduler
 			but is useful for debugging and performance tuning.
 
+	sched_thermal_decay_coeff=
+			[KNL, SMP] Set decay coefficient for thermal pressure signal.
+			Format: integer betweer 0 and 10
+			Default is 0.
+
 	skew_tick=	[KNL] Offset the periodic timer tick per cpu to mitigate
 			xtime_lock contention on larger systems, and/or RCU lock
 			contention on all systems with CONFIG_MAXSMP set.
diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
index 0c84960..0da31e1 100644
--- a/kernel/sched/thermal.c
+++ b/kernel/sched/thermal.c
@@ -10,6 +10,28 @@
 #include "pelt.h"
 #include "thermal.h"
 
+/**
+ * By default the decay is the default pelt decay period.
+ * The decay coefficient can change is decay period in
+ * multiples of 32.
+ *   Decay coefficient    Decay period(ms)
+ *	0			32
+ *	1			64
+ *	2			128
+ *	3			256
+ *	4			512
+ */
+static int sched_thermal_decay_coeff;
+
+static int __init setup_sched_thermal_decay_coeff(char *str)
+{
+	if (kstrtoint(str, 0, &sched_thermal_decay_coeff))
+		pr_warn("Unable to set scheduler thermal pressure decay coefficient\n");
+
+	return 1;
+}
+__setup("sched_thermal_decay_coeff=", setup_sched_thermal_decay_coeff);
+
 static DEFINE_PER_CPU(unsigned long, delta_capacity);
 
 /**
@@ -40,6 +62,7 @@ void update_thermal_pressure(int cpu, u64 capped_freq_ratio)
  */
 void trigger_thermal_pressure_average(struct rq *rq)
 {
-	update_thermal_load_avg(rq_clock_task(rq), rq,
+	update_thermal_load_avg(rq_clock_task(rq) >>
+				sched_thermal_decay_coeff, rq,
 				per_cpu(delta_capacity, cpu_of(rq)));
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-22 20:34 ` [Patch v4 4/6] sched/fair: update cpu_capcity to reflect " Thara Gopinath
@ 2019-10-23 12:28   ` Qais Yousef
  2019-10-28 15:30     ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Qais Yousef @ 2019-10-23 12:28 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret, linux-kernel, amit.kachhap, javi.merino,
	daniel.lezcano

On 10/22/19 16:34, Thara Gopinath wrote:
> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> pressure on a cpu means this maximum available capacity is reduced. This
> patch reduces the average thermal pressure for a cpu from its maximum
> available capacity so that cpu_capacity reflects the actual
> available capacity.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  kernel/sched/fair.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4f9c2cb..be3e802 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
>  
>  	used = READ_ONCE(rq->avg_rt.util_avg);
>  	used += READ_ONCE(rq->avg_dl.util_avg);
> +	used += READ_ONCE(rq->avg_thermal.load_avg);

Maybe a naive question - but can we add util_avg with load_avg without
a conversion? I thought the 2 signals have different properties.

Cheers

--
Qais Yousef

>  
>  	if (unlikely(used >= max))
>  		return 1;
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-10-22 20:34 ` [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
@ 2019-10-28 15:21   ` Peter Zijlstra
  2019-10-30 21:37     ` Thara Gopinath
  2019-11-01 12:17   ` Dietmar Eggemann
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2019-10-28 15:21 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, ionela.voinescu, vincent.guittot, rui.zhang, edubezval,
	qperret, linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On Tue, Oct 22, 2019 at 04:34:21PM -0400, Thara Gopinath wrote:
> Add thermal.c and thermal.h files that provides interface
> APIs to initialize, update/average, track, accumulate and decay
> thermal pressure per cpu basis. A per cpu variable delta_capacity is
> introduced to keep track of instantaneous per cpu thermal pressure.
> Thermal pressure is the delta between maximum capacity and capped
> capacity due to a thermal event.

> API trigger_thermal_pressure_average is called for periodic accumulate
> and decay of the thermal pressure. It is to to be called from a
> periodic tick function. This API passes on the instantaneous delta
> capacity of a cpu to update_thermal_load_avg to do the necessary
> accumulate, decay and average.

> API update_thermal_pressure is for the system to update the thermal
> pressure by providing a capped frequency ratio.

> Considering, trigger_thermal_pressure_average reads delta_capacity and
> update_thermal_pressure writes into delta_capacity, one can argue for
> some sort of locking mechanism to avoid a stale value.

> But considering trigger_thermal_pressure_average can be called from a
> system critical path like scheduler tick function, a locking mechanism
> is not ideal. This means that it is possible the delta_capacity value
> used to calculate average thermal pressure for a cpu can be
> stale for upto 1 tick period.

Please use a blank line at the end of a paragraph.

> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---

>  include/linux/sched.h  |  8 ++++++++
>  kernel/sched/Makefile  |  2 +-
>  kernel/sched/thermal.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/thermal.h | 13 +++++++++++++
>  4 files changed, 67 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/sched/thermal.c
>  create mode 100644 kernel/sched/thermal.h

These are some tiny files, do these functions really need their own
little files?


> diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
> new file mode 100644
> index 0000000..0c84960
> --- /dev/null
> +++ b/kernel/sched/thermal.c
> @@ -0,0 +1,45 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Scheduler Thermal Interactions
> + *
> + *  Copyright (C) 2018 Linaro, Inc., Thara Gopinath <thara.gopinath@linaro.org>
> + */
> +
> +#include <linux/sched.h>
> +#include "sched.h"
> +#include "pelt.h"
> +#include "thermal.h"
> +
> +static DEFINE_PER_CPU(unsigned long, delta_capacity);
> +
> +/**
> + * update_thermal_pressure: Update thermal pressure
> + * @cpu: the cpu for which thermal pressure is to be updated for
> + * @capped_freq_ratio: capped max frequency << SCHED_CAPACITY_SHIFT / max freq
> + *
> + * capped_freq_ratio is normalized into capped capacity and the delta between
> + * the arch_scale_cpu_capacity and capped capacity is stored in per cpu
> + * delta_capacity.
> + */
> +void update_thermal_pressure(int cpu, u64 capped_freq_ratio)
> +{
> +	unsigned long __capacity, delta;
> +
> +	/* Normalize the capped freq ratio */
> +	__capacity = (capped_freq_ratio * arch_scale_cpu_capacity(cpu)) >>
> +							SCHED_CAPACITY_SHIFT;
> +	delta = arch_scale_cpu_capacity(cpu) -  __capacity;
> +	pr_debug("updating cpu%d thermal pressure to %lu\n", cpu, delta);

Surely we can do without the pr_debug() here?

> +	per_cpu(delta_capacity, cpu) = delta;
> +}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 3/6] sched/fair: Enable CFS periodic tick to update thermal pressure
  2019-10-22 20:34 ` [Patch v4 3/6] sched/fair: Enable CFS periodic tick to update " Thara Gopinath
@ 2019-10-28 15:24   ` Peter Zijlstra
  2019-10-28 15:27     ` Peter Zijlstra
  2019-10-30 21:41     ` Thara Gopinath
  2019-10-31 16:11   ` Dietmar Eggemann
  1 sibling, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2019-10-28 15:24 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, ionela.voinescu, vincent.guittot, rui.zhang, edubezval,
	qperret, linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On Tue, Oct 22, 2019 at 04:34:22PM -0400, Thara Gopinath wrote:
> Introduce support in CFS periodic tick to trigger the process of
> computing average thermal pressure for a cpu.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  kernel/sched/fair.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 682a754..4f9c2cb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -21,6 +21,7 @@
>   *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
>   */
>  #include "sched.h"
> +#include "thermal.h"
>  
>  #include <trace/events/sched.h>
>  
> @@ -7574,6 +7575,8 @@ static void update_blocked_averages(int cpu)
>  		done = false;
>  
>  	update_blocked_load_status(rq, !done);
> +
> +	trigger_thermal_pressure_average(rq);
>  	rq_unlock_irqrestore(rq, &rf);
>  }

This changes only 1 of the 2 implementations of
update_blocked_averages(). Also, how does this interact with
rq->has_blocked_load ?

> @@ -9933,6 +9936,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  
>  	update_misfit_status(curr, rq);
>  	update_overutilized_status(task_rq(curr));
> +
> +	trigger_thermal_pressure_average(rq);
>  }

This seems to imply all this thermal stuff is fair only, in which case I
could suggest putting those few lines in fair.c. ~45 extra lines on
1e5+ lines really doesn't matter.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 3/6] sched/fair: Enable CFS periodic tick to update thermal pressure
  2019-10-28 15:24   ` Peter Zijlstra
@ 2019-10-28 15:27     ` Peter Zijlstra
  2019-10-30 21:41     ` Thara Gopinath
  1 sibling, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2019-10-28 15:27 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, ionela.voinescu, vincent.guittot, rui.zhang, edubezval,
	qperret, linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On Mon, Oct 28, 2019 at 04:24:21PM +0100, Peter Zijlstra wrote:
> On Tue, Oct 22, 2019 at 04:34:22PM -0400, Thara Gopinath wrote:
> > Introduce support in CFS periodic tick to trigger the process of
> > computing average thermal pressure for a cpu.
> > 
> > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> > ---
> >  kernel/sched/fair.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 682a754..4f9c2cb 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -21,6 +21,7 @@
> >   *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
> >   */
> >  #include "sched.h"
> > +#include "thermal.h"
> >  
> >  #include <trace/events/sched.h>
> >  
> > @@ -7574,6 +7575,8 @@ static void update_blocked_averages(int cpu)
> >  		done = false;
> >  
> >  	update_blocked_load_status(rq, !done);
> > +
> > +	trigger_thermal_pressure_average(rq);
> >  	rq_unlock_irqrestore(rq, &rf);
> >  }
> 
> This changes only 1 of the 2 implementations of
> update_blocked_averages(). Also, how does this interact with
> rq->has_blocked_load ?

Specifically, I'm thikning this wants a line in others_have_blocked():

+	if (READ_ONCE(rq->avg_thermal.load_avg))
+		return true;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-23 12:28   ` Qais Yousef
@ 2019-10-28 15:30     ` Peter Zijlstra
  2019-10-31 10:53       ` Qais Yousef
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2019-10-28 15:30 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Thara Gopinath, mingo, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret, linux-kernel, amit.kachhap,
	javi.merino, daniel.lezcano

On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
> On 10/22/19 16:34, Thara Gopinath wrote:
> > cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> > pressure on a cpu means this maximum available capacity is reduced. This
> > patch reduces the average thermal pressure for a cpu from its maximum
> > available capacity so that cpu_capacity reflects the actual
> > available capacity.
> > 
> > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> > ---
> >  kernel/sched/fair.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 4f9c2cb..be3e802 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
> >  
> >  	used = READ_ONCE(rq->avg_rt.util_avg);
> >  	used += READ_ONCE(rq->avg_dl.util_avg);
> > +	used += READ_ONCE(rq->avg_thermal.load_avg);
> 
> Maybe a naive question - but can we add util_avg with load_avg without
> a conversion? I thought the 2 signals have different properties.

Changelog of patch #1 explains, it's in that dense blob of text.

But yes, you're quite right that that wants a comment here.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-10-22 20:34 ` [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
@ 2019-10-28 15:33   ` Peter Zijlstra
  2019-10-31 16:29   ` Dietmar Eggemann
  1 sibling, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2019-10-28 15:33 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, ionela.voinescu, vincent.guittot, rui.zhang, edubezval,
	qperret, linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On Tue, Oct 22, 2019 at 04:34:24PM -0400, Thara Gopinath wrote:
> Thermal governors can request for a cpu's maximum supported frequency
> to be capped in case of an overheat event. This in turn means that the
> maximum capacity available for tasks to run on the particular cpu is
> reduced. Delta between the original maximum capacity and capped
> maximum capacity is known as thermal pressure. Enable cpufreq cooling
> device to update the thermal pressure in event of a capped
> maximum frequency.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  drivers/thermal/cpu_cooling.c | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 391f397..2e6a979 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -218,6 +218,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
>  }
>  
>  /**
> + * update_sched_max_capacity - update scheduler about change in cpu
> + *					max frequency.
> + * @policy - cpufreq policy whose max frequency is capped.

Uhm, this function doesn't have a @policy argument.

> + */
> +static void update_sched_max_capacity(struct cpumask *cpus,
> +				      unsigned int cur_max_freq,
> +				      unsigned int max_freq)
> +{
> +	int cpu;
> +	unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
> +				  max_freq;

check your types and ranges. What units is _freq in and does 'freq *
SCHED_CAPACITY' fit in 'unsigned int'? If so, why do you assign it to an
'unsigned long'?

> +
> +	for_each_cpu(cpu, cpus)
> +		update_thermal_pressure(cpu, capacity);
> +}

> @@ -331,8 +349,17 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
>  
>  	cpufreq_cdev->cpufreq_state = state;
>  
> -	return dev_pm_qos_update_request(&cpufreq_cdev->qos_req,
> -				cpufreq_cdev->freq_table[state].frequency);
> +	ret = dev_pm_qos_update_request
> +				(&cpufreq_cdev->qos_req,
> +				 cpufreq_cdev->freq_table[state].frequency);
> +
> +	if (ret > 0)
> +		update_sched_max_capacity
> +				(cpufreq_cdev->policy->cpus,
> +				 cpufreq_cdev->freq_table[state].frequency,
> +				 cpufreq_cdev->policy->cpuinfo.max_freq);

Codingstyle wants that in { }.

> +
> +	return ret;
>  }
>  
>  /**
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 6/6] sched: thermal: Enable tuning of decay period
  2019-10-22 20:34 ` [Patch v4 6/6] sched: thermal: Enable tuning of decay period Thara Gopinath
@ 2019-10-28 15:42   ` Peter Zijlstra
  2019-11-04 16:12   ` Ionela Voinescu
  1 sibling, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2019-10-28 15:42 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, ionela.voinescu, vincent.guittot, rui.zhang, edubezval,
	qperret, linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On Tue, Oct 22, 2019 at 04:34:25PM -0400, Thara Gopinath wrote:
> Thermal pressure follows pelt signas which means the
> decay period for thermal pressure is the default pelt
> decay period. Depending on soc charecteristics and thermal
> activity, it might be beneficial to decay thermal pressure
> slower, but still in-tune with the pelt signals.
> One way to achieve this is to provide a command line parameter
> to set the decay coefficient to an integer between 0 and 10.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
> v3->v4:
> 	- Removed the sysctl setting to tune decay period and instead
> 	  introduced a command line parameter to control it. The rationale
> 	  here being changing decay period of a PELT signal runtime can
> 	  result in a skewed average value for atleast some cycles.

TBH, I don't care too much about that. If you touch a knob, you'd better
know what it does anyway.

>  void trigger_thermal_pressure_average(struct rq *rq)
>  {
> -	update_thermal_load_avg(rq_clock_task(rq), rq,
> +	update_thermal_load_avg(rq_clock_task(rq) >>
> +				sched_thermal_decay_coeff, rq,

You see, 'coefficient' means 'multiplicative factor', but what we have
here is a negative exponent. More specifically it is a power-of-2
exponent, and we typically call them '_shift', given how we use them
with 'shift' operators.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 0/6] Introduce Thermal Pressure
  2019-10-22 20:34 [Patch v4 0/6] Introduce Thermal Pressure Thara Gopinath
                   ` (5 preceding siblings ...)
  2019-10-22 20:34 ` [Patch v4 6/6] sched: thermal: Enable tuning of decay period Thara Gopinath
@ 2019-10-29 15:34 ` Daniel Lezcano
  2019-10-31 10:07   ` Ionela Voinescu
  2019-10-31  9:44 ` Ionela Voinescu
  7 siblings, 1 reply; 51+ messages in thread
From: Daniel Lezcano @ 2019-10-29 15:34 UTC (permalink / raw)
  To: Thara Gopinath, mingo, peterz, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino

Hi Thara,

On 22/10/2019 22:34, Thara Gopinath wrote:
> Thermal governors can respond to an overheat event of a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in the kernel, task scheduler is 
> not notified of capping of maximum frequency of a cpu.
> In other words, scheduler is unware of maximum capacity
> restrictions placed on a cpu due to thermal activity.
> This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
> 
> The reduction in the maximum possible capacity of a cpu due to a 
> thermal event can be considered as thermal pressure. Instantaneous
> thermal pressure is hard to record and can sometime be erroneous
> as there can be mismatch between the actual capping of capacity
> and scheduler recording it. Thus solution is to have a weighted
> average per cpu value for thermal pressure over time.
> The weight reflects the amount of time the cpu has spent at a
> capped maximum frequency. Since thermal pressure is recorded as
> an average, it must be decayed periodically. Exisiting algorithm
> in the kernel scheduler pelt framework is re-used to calculate
> the weighted average. This patch series also defines a sysctl
> inerface to allow for a configurable decay period.
> 
> Regarding testing, basic build, boot and sanity testing have been
> performed on db845c platform with debian file system.
> Further, dhrystone and hackbench tests have been
> run with the thermal pressure algorithm. During testing, due to
> constraints of step wise governor in dealing with big little systems,
> trip point 0 temperature was made assymetric between cpus in little
> cluster and big cluster; the idea being that
> big core will heat up and cpu cooling device will throttle the
> frequency of the big cores faster, there by limiting the maximum available
> capacity and the scheduler will spread out tasks to little cores as well.
> 
> Test Results
> 
> Hackbench: 1 group , 30000 loops, 10 runs       
>                                                Result         SD             
>                                                (Secs)     (% of mean)     
>  No Thermal Pressure                            14.03       2.69%           
>  Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%         
>  Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%           
>  Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%         
>  Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%           
>  Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%  
> 
> Dhrystone Run Time  : 20 threads, 3000 MLOOPS
>                                                  Result      SD             
>                                                  (Secs)    (% of mean)     
>  No Thermal Pressure                              9.452      4.49%
>  Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
>  Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
>  Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
>  Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
>  Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%  

I took the opportunity to try glmark2 on the db845c platform with the
default decay and got the following glmark2 scores:

Without thermal pressure:

# NumSamples = 9; Min = 790.00; Max = 805.00
# Mean = 794.888889; Variance = 19.209877; SD = 4.382907; Median 794.000000
# each ∎ represents a count of 1
  790.0000 -   791.5000 [     2]: ∎∎
  791.5000 -   793.0000 [     2]: ∎∎
  793.0000 -   794.5000 [     2]: ∎∎
  794.5000 -   796.0000 [     1]: ∎
  796.0000 -   797.5000 [     0]:
  797.5000 -   799.0000 [     1]: ∎
  799.0000 -   800.5000 [     0]:
  800.5000 -   802.0000 [     0]:
  802.0000 -   803.5000 [     0]:
  803.5000 -   805.0000 [     1]: ∎


With thermal pressure:

# NumSamples = 9; Min = 933.00; Max = 960.00
# Mean = 940.777778; Variance = 64.172840; SD = 8.010795; Median 937.000000
# each ∎ represents a count of 1
  933.0000 -   935.7000 [     3]: ∎∎∎
  935.7000 -   938.4000 [     2]: ∎∎
  938.4000 -   941.1000 [     2]: ∎∎
  941.1000 -   943.8000 [     0]:
  943.8000 -   946.5000 [     0]:
  946.5000 -   949.2000 [     1]: ∎
  949.2000 -   951.9000 [     0]:
  951.9000 -   954.6000 [     0]:
  954.6000 -   957.3000 [     0]:
  957.3000 -   960.0000 [     1]: ∎



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-10-28 15:21   ` Peter Zijlstra
@ 2019-10-30 21:37     ` Thara Gopinath
  0 siblings, 0 replies; 51+ messages in thread
From: Thara Gopinath @ 2019-10-30 21:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, ionela.voinescu, vincent.guittot, rui.zhang, edubezval,
	qperret, linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Hello Peter,
Thanks for the review.


On 10/28/2019 11:21 AM, Peter Zijlstra wrote:
> On Tue, Oct 22, 2019 at 04:34:21PM -0400, Thara Gopinath wrote:
>> Add thermal.c and thermal.h files that provides interface
>> APIs to initialize, update/average, track, accumulate and decay
>> thermal pressure per cpu basis. A per cpu variable delta_capacity is
>> introduced to keep track of instantaneous per cpu thermal pressure.
>> Thermal pressure is the delta between maximum capacity and capped
>> capacity due to a thermal event.
> 
>> API trigger_thermal_pressure_average is called for periodic accumulate
>> and decay of the thermal pressure. It is to to be called from a
>> periodic tick function. This API passes on the instantaneous delta
>> capacity of a cpu to update_thermal_load_avg to do the necessary
>> accumulate, decay and average.
> 
>> API update_thermal_pressure is for the system to update the thermal
>> pressure by providing a capped frequency ratio.
> 
>> Considering, trigger_thermal_pressure_average reads delta_capacity and
>> update_thermal_pressure writes into delta_capacity, one can argue for
>> some sort of locking mechanism to avoid a stale value.
> 
>> But considering trigger_thermal_pressure_average can be called from a
>> system critical path like scheduler tick function, a locking mechanism
>> is not ideal. This means that it is possible the delta_capacity value
>> used to calculate average thermal pressure for a cpu can be
>> stale for upto 1 tick period.
> 
> Please use a blank line at the end of a paragraph.

Will do

> 
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
> 
>>  include/linux/sched.h  |  8 ++++++++
>>  kernel/sched/Makefile  |  2 +-
>>  kernel/sched/thermal.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>  kernel/sched/thermal.h | 13 +++++++++++++
>>  4 files changed, 67 insertions(+), 1 deletion(-)
>>  create mode 100644 kernel/sched/thermal.c
>>  create mode 100644 kernel/sched/thermal.h
> 
> These are some tiny files, do these functions really need their own
> little files?

I will merge them  into fair.c. There will be update_thermal_pressure
that will be called from cpu_cooling or other relevant framework.
> 
> 
>> diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
>> new file mode 100644
>> index 0000000..0c84960
>> --- /dev/null
>> +++ b/kernel/sched/thermal.c
>> @@ -0,0 +1,45 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Scheduler Thermal Interactions
>> + *
>> + *  Copyright (C) 2018 Linaro, Inc., Thara Gopinath <thara.gopinath@linaro.org>
>> + */
>> +
>> +#include <linux/sched.h>
>> +#include "sched.h"
>> +#include "pelt.h"
>> +#include "thermal.h"
>> +
>> +static DEFINE_PER_CPU(unsigned long, delta_capacity);
>> +
>> +/**
>> + * update_thermal_pressure: Update thermal pressure
>> + * @cpu: the cpu for which thermal pressure is to be updated for
>> + * @capped_freq_ratio: capped max frequency << SCHED_CAPACITY_SHIFT / max freq
>> + *
>> + * capped_freq_ratio is normalized into capped capacity and the delta between
>> + * the arch_scale_cpu_capacity and capped capacity is stored in per cpu
>> + * delta_capacity.
>> + */
>> +void update_thermal_pressure(int cpu, u64 capped_freq_ratio)
>> +{
>> +	unsigned long __capacity, delta;
>> +
>> +	/* Normalize the capped freq ratio */
>> +	__capacity = (capped_freq_ratio * arch_scale_cpu_capacity(cpu)) >>
>> +							SCHED_CAPACITY_SHIFT;
>> +	delta = arch_scale_cpu_capacity(cpu) -  __capacity;
>> +	pr_debug("updating cpu%d thermal pressure to %lu\n", cpu, delta);
> 
> Surely we can do without the pr_debug() here?

Will remove. I had it while developing to check if the thermal pressure
is calculated correct.
> 
>> +	per_cpu(delta_capacity, cpu) = delta;
>> +}


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 3/6] sched/fair: Enable CFS periodic tick to update thermal pressure
  2019-10-28 15:24   ` Peter Zijlstra
  2019-10-28 15:27     ` Peter Zijlstra
@ 2019-10-30 21:41     ` Thara Gopinath
  1 sibling, 0 replies; 51+ messages in thread
From: Thara Gopinath @ 2019-10-30 21:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, ionela.voinescu, vincent.guittot, rui.zhang, edubezval,
	qperret, linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 10/28/2019 11:24 AM, Peter Zijlstra wrote:
> On Tue, Oct 22, 2019 at 04:34:22PM -0400, Thara Gopinath wrote:
>> Introduce support in CFS periodic tick to trigger the process of
>> computing average thermal pressure for a cpu.
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>  kernel/sched/fair.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 682a754..4f9c2cb 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -21,6 +21,7 @@
>>   *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
>>   */
>>  #include "sched.h"
>> +#include "thermal.h"
>>  
>>  #include <trace/events/sched.h>
>>  
>> @@ -7574,6 +7575,8 @@ static void update_blocked_averages(int cpu)
>>  		done = false;
>>  
>>  	update_blocked_load_status(rq, !done);
>> +
>> +	trigger_thermal_pressure_average(rq);
>>  	rq_unlock_irqrestore(rq, &rf);
>>  }
> 
> This changes only 1 of the 2 implementations of
> update_blocked_averages(). Also, how does this interact with
> rq->has_blocked_load ?
I will add it to the other implementation of update_blocked_averages and
will also update others_have_blocked.
> 
>> @@ -9933,6 +9936,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>>  
>>  	update_misfit_status(curr, rq);
>>  	update_overutilized_status(task_rq(curr));
>> +
>> +	trigger_thermal_pressure_average(rq);
>>  }
> 
> This seems to imply all this thermal stuff is fair only, in which case I
> could suggest putting those few lines in fair.c. ~45 extra lines on
> 1e5+ lines really doesn't matter.
> 


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 0/6] Introduce Thermal Pressure
  2019-10-22 20:34 [Patch v4 0/6] Introduce Thermal Pressure Thara Gopinath
                   ` (6 preceding siblings ...)
  2019-10-29 15:34 ` [Patch v4 0/6] Introduce Thermal Pressure Daniel Lezcano
@ 2019-10-31  9:44 ` Ionela Voinescu
  2019-10-31 16:41   ` Thara Gopinath
  7 siblings, 1 reply; 51+ messages in thread
From: Ionela Voinescu @ 2019-10-31  9:44 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Hi Thara,

On Tuesday 22 Oct 2019 at 16:34:19 (-0400), Thara Gopinath wrote:
> Thermal governors can respond to an overheat event of a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in the kernel, task scheduler is 
> not notified of capping of maximum frequency of a cpu.
> In other words, scheduler is unware of maximum capacity

Nit: s/unware/unaware

> restrictions placed on a cpu due to thermal activity.
> This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
> 
> The reduction in the maximum possible capacity of a cpu due to a 
> thermal event can be considered as thermal pressure. Instantaneous
> thermal pressure is hard to record and can sometime be erroneous
> as there can be mismatch between the actual capping of capacity
> and scheduler recording it. Thus solution is to have a weighted
> average per cpu value for thermal pressure over time.
> The weight reflects the amount of time the cpu has spent at a
> capped maximum frequency. Since thermal pressure is recorded as
> an average, it must be decayed periodically. Exisiting algorithm
> in the kernel scheduler pelt framework is re-used to calculate
> the weighted average. This patch series also defines a sysctl
> inerface to allow for a configurable decay period.
> 
> Regarding testing, basic build, boot and sanity testing have been
> performed on db845c platform with debian file system.
> Further, dhrystone and hackbench tests have been
> run with the thermal pressure algorithm. During testing, due to
> constraints of step wise governor in dealing with big little systems,
> trip point 0 temperature was made assymetric between cpus in little
> cluster and big cluster; the idea being that
> big core will heat up and cpu cooling device will throttle the
> frequency of the big cores faster, there by limiting the maximum available
> capacity and the scheduler will spread out tasks to little cores as well.
>

Can you please share the changes you've made to sdm845.dtsi and a kernel
base on top of which to apply your patches? I would like to reproduce
your results and run more tests and it would be good if our setups were
as close as possible.

> Test Results
> 
> Hackbench: 1 group , 30000 loops, 10 runs       
>                                                Result         SD             
>                                                (Secs)     (% of mean)     
>  No Thermal Pressure                            14.03       2.69%           
>  Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%         
>  Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%           
>  Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%         
>  Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%           
>  Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%  
> 
> Dhrystone Run Time  : 20 threads, 3000 MLOOPS
>                                                  Result      SD             
>                                                  (Secs)    (% of mean)     
>  No Thermal Pressure                              9.452      4.49%
>  Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
>  Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
>  Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
>  Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
>  Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%  
>

Do you happen to know by how much the CPUs were capped during these
experiments?

Thanks,
Ionela.

> A Brief History
> 
> The first version of this patch-series was posted with resuing
> PELT algorithm to decay thermal pressure signal. The discussions
> that followed were around whether intanteneous thermal pressure
> solution is better and whether a stand-alone algortihm to accumulate
> and decay thermal pressure is more appropriate than re-using the
> PELT framework. 
> Tests on Hikey960 showed the stand-alone algorithm performing slightly
> better than resuing PELT algorithm and V2 was posted with the stand
> alone algorithm. Test results were shared as part of this series.
> Discussions were around re-using PELT algorithm and running
> further tests with more granular decay period.
> 
> For some time after this development was impeded due to hardware
> unavailability, some other unforseen and possibly unfortunate events.
> For this version, h/w was switched from hikey960 to db845c.
> Also Instantaneous thermal pressure was never tested as part of this
> cycle as it is clear that weighted average is a better implementation.
> The non-PELT algorithm never gave any conclusive results to prove that it
> is better than reusing PELT algorithm, in this round of testing.
> Also reusing PELT algorithm means thermal pressure tracks the
> other utilization signals in the scheduler.
> 
> v3->v4:
> 	- "Patch 3/7:sched: Initialize per cpu thermal pressure structure"
> 	   is dropped as it is no longer needed following changes in other
> 	   other patches.
> 	- rest of the change log mentioned in specific patches.
> 
> Thara Gopinath (6):
>   sched/pelt.c: Add support to track thermal pressure
>   sched: Add infrastructure to store and update instantaneous thermal
>     pressure
>   sched/fair: Enable CFS periodic tick to update thermal pressure
>   sched/fair: update cpu_capcity to reflect thermal pressure
>   thermal/cpu-cooling: Update thermal pressure in case of a maximum
>     frequency capping
>   sched: thermal: Enable tuning of decay period
> 
>  Documentation/admin-guide/kernel-parameters.txt |  5 ++
>  drivers/thermal/cpu_cooling.c                   | 31 ++++++++++-
>  include/linux/sched.h                           |  8 +++
>  kernel/sched/Makefile                           |  2 +-
>  kernel/sched/fair.c                             |  6 +++
>  kernel/sched/pelt.c                             | 13 +++++
>  kernel/sched/pelt.h                             |  7 +++
>  kernel/sched/sched.h                            |  1 +
>  kernel/sched/thermal.c                          | 68 +++++++++++++++++++++++++
>  kernel/sched/thermal.h                          | 13 +++++
>  10 files changed, 151 insertions(+), 3 deletions(-)
>  create mode 100644 kernel/sched/thermal.c
>  create mode 100644 kernel/sched/thermal.h
> 
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 1/6] sched/pelt.c: Add support to track thermal pressure
  2019-10-22 20:34 ` [Patch v4 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
@ 2019-10-31  9:47   ` Ionela Voinescu
  0 siblings, 0 replies; 51+ messages in thread
From: Ionela Voinescu @ 2019-10-31  9:47 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On Tuesday 22 Oct 2019 at 16:34:20 (-0400), Thara Gopinath wrote:
> Extrapolating on the exisiting framework to track rt/dl utilization using
> pelt signals, add a similar mechanism to track thermal pressure. The
> difference here from rt/dl utilization tracking is that, instead of
> tracking time spent by a cpu running a rt/dl task through util_avg,
> the average thermal pressure is tracked through load_avg. This is
> because thermal pressure signal is weighted "delta" capacity
> and is not binary(util_avg is binary). "delta capacity" here
> means delta between the actual capacity of a cpu and the decreased
> capacity a cpu due to a thermal event.
> In order to track average thermal pressure, a new sched_avg variable
> avg_thermal is introduced. Function update_thermal_avg can be called

Nit: s/update_thermal_avg/update_thermal_load_avg

> to do the periodic bookeeping (accumulate, decay and average)
> of the thermal pressure.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
> v3->v4:
> 	- Renamed update_thermal_avg to update_thermal_load_avg.
> 	- Fixed typos as per review comments on mailing list
> 	- Reordered the code as per review comments
> 
>  kernel/sched/pelt.c  | 13 +++++++++++++
>  kernel/sched/pelt.h  |  7 +++++++
>  kernel/sched/sched.h |  1 +
>  3 files changed, 21 insertions(+)
> 
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index a96db50..3821069 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -353,6 +353,19 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>  	return 0;
>  }
>  
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> +	if (___update_load_sum(now, &rq->avg_thermal,
> +			       capacity,
> +			       capacity,
> +			       capacity)) {
> +		___update_load_avg(&rq->avg_thermal, 1, 1);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  /*
>   * irq:
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index afff644..c74226d 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -6,6 +6,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
>  int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
>  int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
>  
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  int update_irq_load_avg(struct rq *rq, u64 running);
> @@ -159,6 +160,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>  }
>  
>  static inline int
> +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> +	return 0;
> +}
> +
> +static inline int
>  update_irq_load_avg(struct rq *rq, u64 running)
>  {
>  	return 0;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 0db2c1b..d5d82c8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -944,6 +944,7 @@ struct rq {
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  	struct sched_avg	avg_irq;
>  #endif
> +	struct sched_avg	avg_thermal;
>  	u64			idle_stamp;
>  	u64			avg_idle;
>  
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 0/6] Introduce Thermal Pressure
  2019-10-29 15:34 ` [Patch v4 0/6] Introduce Thermal Pressure Daniel Lezcano
@ 2019-10-31 10:07   ` Ionela Voinescu
  2019-10-31 11:54     ` Daniel Lezcano
  0 siblings, 1 reply; 51+ messages in thread
From: Ionela Voinescu @ 2019-10-31 10:07 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Thara Gopinath, mingo, peterz, vincent.guittot, rui.zhang,
	edubezval, qperret, linux-kernel, amit.kachhap, javi.merino

Hi Daniel,

On Tuesday 29 Oct 2019 at 16:34:11 (+0100), Daniel Lezcano wrote:
> Hi Thara,
> 
> On 22/10/2019 22:34, Thara Gopinath wrote:
> > Thermal governors can respond to an overheat event of a cpu by
> > capping the cpu's maximum possible frequency. This in turn
> > means that the maximum available compute capacity of the
> > cpu is restricted. But today in the kernel, task scheduler is 
> > not notified of capping of maximum frequency of a cpu.
> > In other words, scheduler is unware of maximum capacity
> > restrictions placed on a cpu due to thermal activity.
> > This patch series attempts to address this issue.
> > The benefits identified are better task placement among available
> > cpus in event of overheating which in turn leads to better
> > performance numbers.
> > 
> > The reduction in the maximum possible capacity of a cpu due to a 
> > thermal event can be considered as thermal pressure. Instantaneous
> > thermal pressure is hard to record and can sometime be erroneous
> > as there can be mismatch between the actual capping of capacity
> > and scheduler recording it. Thus solution is to have a weighted
> > average per cpu value for thermal pressure over time.
> > The weight reflects the amount of time the cpu has spent at a
> > capped maximum frequency. Since thermal pressure is recorded as
> > an average, it must be decayed periodically. Exisiting algorithm
> > in the kernel scheduler pelt framework is re-used to calculate
> > the weighted average. This patch series also defines a sysctl
> > inerface to allow for a configurable decay period.
> > 
> > Regarding testing, basic build, boot and sanity testing have been
> > performed on db845c platform with debian file system.
> > Further, dhrystone and hackbench tests have been
> > run with the thermal pressure algorithm. During testing, due to
> > constraints of step wise governor in dealing with big little systems,
> > trip point 0 temperature was made assymetric between cpus in little
> > cluster and big cluster; the idea being that
> > big core will heat up and cpu cooling device will throttle the
> > frequency of the big cores faster, there by limiting the maximum available
> > capacity and the scheduler will spread out tasks to little cores as well.
> > 
> > Test Results
> > 
> > Hackbench: 1 group , 30000 loops, 10 runs       
> >                                                Result         SD             
> >                                                (Secs)     (% of mean)     
> >  No Thermal Pressure                            14.03       2.69%           
> >  Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%         
> >  Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%           
> >  Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%         
> >  Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%           
> >  Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%  
> > 
> > Dhrystone Run Time  : 20 threads, 3000 MLOOPS
> >                                                  Result      SD             
> >                                                  (Secs)    (% of mean)     
> >  No Thermal Pressure                              9.452      4.49%
> >  Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
> >  Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
> >  Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
> >  Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
> >  Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%  
> 
> I took the opportunity to try glmark2 on the db845c platform with the
> default decay and got the following glmark2 scores:
> 
> Without thermal pressure:
> 
> # NumSamples = 9; Min = 790.00; Max = 805.00
> # Mean = 794.888889; Variance = 19.209877; SD = 4.382907; Median 794.000000
> # each ∎ represents a count of 1
>   790.0000 -   791.5000 [     2]: ∎∎
>   791.5000 -   793.0000 [     2]: ∎∎
>   793.0000 -   794.5000 [     2]: ∎∎
>   794.5000 -   796.0000 [     1]: ∎
>   796.0000 -   797.5000 [     0]:
>   797.5000 -   799.0000 [     1]: ∎
>   799.0000 -   800.5000 [     0]:
>   800.5000 -   802.0000 [     0]:
>   802.0000 -   803.5000 [     0]:
>   803.5000 -   805.0000 [     1]: ∎
> 
> 
> With thermal pressure:
> 
> # NumSamples = 9; Min = 933.00; Max = 960.00
> # Mean = 940.777778; Variance = 64.172840; SD = 8.010795; Median 937.000000
> # each ∎ represents a count of 1
>   933.0000 -   935.7000 [     3]: ∎∎∎
>   935.7000 -   938.4000 [     2]: ∎∎
>   938.4000 -   941.1000 [     2]: ∎∎
>   941.1000 -   943.8000 [     0]:
>   943.8000 -   946.5000 [     0]:
>   946.5000 -   949.2000 [     1]: ∎
>   949.2000 -   951.9000 [     0]:
>   951.9000 -   954.6000 [     0]:
>   954.6000 -   957.3000 [     0]:
>   957.3000 -   960.0000 [     1]: ∎
> 

Interesting! If I'm interpreting these correctly there seems to be
significant improvement when applying thermal pressure.

I'm not familiar with glmark2, can you tell me more about the process
and the work that the benchmark does? I assume this is a GPU benchmark,
but not knowing more about it I fail to see the correlation between
applying thermal pressure to CPU capacities and the improvement of GPU
performance.

Do you happen to know more about the behaviour that resulted in these
benchmark scores?

Thanks,
Ionela.

> 
> 
> -- 
>  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-28 15:30     ` Peter Zijlstra
@ 2019-10-31 10:53       ` Qais Yousef
  2019-10-31 15:38         ` Dietmar Eggemann
  2019-10-31 16:03         ` Thara Gopinath
  0 siblings, 2 replies; 51+ messages in thread
From: Qais Yousef @ 2019-10-31 10:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thara Gopinath, mingo, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret, linux-kernel, amit.kachhap,
	javi.merino, daniel.lezcano

On 10/28/19 16:30, Peter Zijlstra wrote:
> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
> > On 10/22/19 16:34, Thara Gopinath wrote:
> > > cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> > > pressure on a cpu means this maximum available capacity is reduced. This
> > > patch reduces the average thermal pressure for a cpu from its maximum
> > > available capacity so that cpu_capacity reflects the actual
> > > available capacity.
> > > 
> > > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> > > ---
> > >  kernel/sched/fair.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 4f9c2cb..be3e802 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
> > >  
> > >  	used = READ_ONCE(rq->avg_rt.util_avg);
> > >  	used += READ_ONCE(rq->avg_dl.util_avg);
> > > +	used += READ_ONCE(rq->avg_thermal.load_avg);
> > 
> > Maybe a naive question - but can we add util_avg with load_avg without
> > a conversion? I thought the 2 signals have different properties.
> 
> Changelog of patch #1 explains, it's in that dense blob of text.
> 
> But yes, you're quite right that that wants a comment here.

Thanks for the pointer! A comment would be nice indeed.

To make sure I got this correctly - it's because avg_thermal.load_avg
represents delta_capacity which is already a 'converted' form of load. So this
makes avg_thermal.load_avg a util_avg really. Correct?

If I managed to get it right somehow. It'd be nice if we can do inverse
conversion on delta_capacity so that avg_thermal.{load_avg, util_avg} meaning
is consistent across the board. But I don't feel strongly about it if this gets
documented properly.

Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 0/6] Introduce Thermal Pressure
  2019-10-31 10:07   ` Ionela Voinescu
@ 2019-10-31 11:54     ` Daniel Lezcano
  2019-10-31 12:57       ` Ionela Voinescu
  0 siblings, 1 reply; 51+ messages in thread
From: Daniel Lezcano @ 2019-10-31 11:54 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Thara Gopinath, mingo, peterz, vincent.guittot, rui.zhang,
	edubezval, qperret, linux-kernel, amit.kachhap, javi.merino

Hi Ionela,

On 31/10/2019 11:07, Ionela Voinescu wrote:
> Hi Daniel,
> 
> On Tuesday 29 Oct 2019 at 16:34:11 (+0100), Daniel Lezcano wrote:
>> Hi Thara,
>>
>> On 22/10/2019 22:34, Thara Gopinath wrote:
>>> Thermal governors can respond to an overheat event of a cpu by
>>> capping the cpu's maximum possible frequency. This in turn
>>> means that the maximum available compute capacity of the
>>> cpu is restricted. But today in the kernel, task scheduler is 
>>> not notified of capping of maximum frequency of a cpu.
>>> In other words, scheduler is unware of maximum capacity
>>> restrictions placed on a cpu due to thermal activity.
>>> This patch series attempts to address this issue.
>>> The benefits identified are better task placement among available
>>> cpus in event of overheating which in turn leads to better
>>> performance numbers.
>>>
>>> The reduction in the maximum possible capacity of a cpu due to a 
>>> thermal event can be considered as thermal pressure. Instantaneous
>>> thermal pressure is hard to record and can sometime be erroneous
>>> as there can be mismatch between the actual capping of capacity
>>> and scheduler recording it. Thus solution is to have a weighted
>>> average per cpu value for thermal pressure over time.
>>> The weight reflects the amount of time the cpu has spent at a
>>> capped maximum frequency. Since thermal pressure is recorded as
>>> an average, it must be decayed periodically. Exisiting algorithm
>>> in the kernel scheduler pelt framework is re-used to calculate
>>> the weighted average. This patch series also defines a sysctl
>>> inerface to allow for a configurable decay period.
>>>
>>> Regarding testing, basic build, boot and sanity testing have been
>>> performed on db845c platform with debian file system.
>>> Further, dhrystone and hackbench tests have been
>>> run with the thermal pressure algorithm. During testing, due to
>>> constraints of step wise governor in dealing with big little systems,
>>> trip point 0 temperature was made assymetric between cpus in little
>>> cluster and big cluster; the idea being that
>>> big core will heat up and cpu cooling device will throttle the
>>> frequency of the big cores faster, there by limiting the maximum available
>>> capacity and the scheduler will spread out tasks to little cores as well.
>>>
>>> Test Results
>>>
>>> Hackbench: 1 group , 30000 loops, 10 runs       
>>>                                                Result         SD             
>>>                                                (Secs)     (% of mean)     
>>>  No Thermal Pressure                            14.03       2.69%           
>>>  Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%         
>>>  Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%           
>>>  Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%         
>>>  Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%           
>>>  Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%  
>>>
>>> Dhrystone Run Time  : 20 threads, 3000 MLOOPS
>>>                                                  Result      SD             
>>>                                                  (Secs)    (% of mean)     
>>>  No Thermal Pressure                              9.452      4.49%
>>>  Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
>>>  Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
>>>  Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
>>>  Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
>>>  Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%  
>>
>> I took the opportunity to try glmark2 on the db845c platform with the
>> default decay and got the following glmark2 scores:
>>
>> Without thermal pressure:
>>
>> # NumSamples = 9; Min = 790.00; Max = 805.00
>> # Mean = 794.888889; Variance = 19.209877; SD = 4.382907; Median 794.000000
>> # each ∎ represents a count of 1
>>   790.0000 -   791.5000 [     2]: ∎∎
>>   791.5000 -   793.0000 [     2]: ∎∎
>>   793.0000 -   794.5000 [     2]: ∎∎
>>   794.5000 -   796.0000 [     1]: ∎
>>   796.0000 -   797.5000 [     0]:
>>   797.5000 -   799.0000 [     1]: ∎
>>   799.0000 -   800.5000 [     0]:
>>   800.5000 -   802.0000 [     0]:
>>   802.0000 -   803.5000 [     0]:
>>   803.5000 -   805.0000 [     1]: ∎
>>
>>
>> With thermal pressure:
>>
>> # NumSamples = 9; Min = 933.00; Max = 960.00
>> # Mean = 940.777778; Variance = 64.172840; SD = 8.010795; Median 937.000000
>> # each ∎ represents a count of 1
>>   933.0000 -   935.7000 [     3]: ∎∎∎
>>   935.7000 -   938.4000 [     2]: ∎∎
>>   938.4000 -   941.1000 [     2]: ∎∎
>>   941.1000 -   943.8000 [     0]:
>>   943.8000 -   946.5000 [     0]:
>>   946.5000 -   949.2000 [     1]: ∎
>>   949.2000 -   951.9000 [     0]:
>>   951.9000 -   954.6000 [     0]:
>>   954.6000 -   957.3000 [     0]:
>>   957.3000 -   960.0000 [     1]: ∎
>>
> 
> Interesting! If I'm interpreting these correctly there seems to be
> significant improvement when applying thermal pressure.
>
> I'm not familiar with glmark2, can you tell me more about the process
> and the work that the benchmark does?

glmark2 is a 3D benchmark. I ran it without parameters, so all tests are
run. At the end, it gives a score which are the values given above.

> I assume this is a GPU benchmark,
> but not knowing more about it I fail to see the correlation between
> applying thermal pressure to CPU capacities and the improvement of GPU
> performance.
> Do you happen to know more about the behaviour that resulted in these
> benchmark scores?

My hypothesis is glmark2 makes the GPU to contribute a lot to the
heating effect, thus increasing the temperature to the CPU close to it.




-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 0/6] Introduce Thermal Pressure
  2019-10-31 11:54     ` Daniel Lezcano
@ 2019-10-31 12:57       ` Ionela Voinescu
  2019-10-31 17:48         ` Daniel Lezcano
  0 siblings, 1 reply; 51+ messages in thread
From: Ionela Voinescu @ 2019-10-31 12:57 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Thara Gopinath, mingo, peterz, vincent.guittot, rui.zhang,
	edubezval, qperret, linux-kernel, amit.kachhap, javi.merino

On Thursday 31 Oct 2019 at 12:54:03 (+0100), Daniel Lezcano wrote:
> Hi Ionela,
> 
> On 31/10/2019 11:07, Ionela Voinescu wrote:
> > Hi Daniel,
> > 
> > On Tuesday 29 Oct 2019 at 16:34:11 (+0100), Daniel Lezcano wrote:
> >> Hi Thara,
> >>
> >> On 22/10/2019 22:34, Thara Gopinath wrote:
> >>> Thermal governors can respond to an overheat event of a cpu by
> >>> capping the cpu's maximum possible frequency. This in turn
> >>> means that the maximum available compute capacity of the
> >>> cpu is restricted. But today in the kernel, task scheduler is 
> >>> not notified of capping of maximum frequency of a cpu.
> >>> In other words, scheduler is unware of maximum capacity
> >>> restrictions placed on a cpu due to thermal activity.
> >>> This patch series attempts to address this issue.
> >>> The benefits identified are better task placement among available
> >>> cpus in event of overheating which in turn leads to better
> >>> performance numbers.
> >>>
> >>> The reduction in the maximum possible capacity of a cpu due to a 
> >>> thermal event can be considered as thermal pressure. Instantaneous
> >>> thermal pressure is hard to record and can sometime be erroneous
> >>> as there can be mismatch between the actual capping of capacity
> >>> and scheduler recording it. Thus solution is to have a weighted
> >>> average per cpu value for thermal pressure over time.
> >>> The weight reflects the amount of time the cpu has spent at a
> >>> capped maximum frequency. Since thermal pressure is recorded as
> >>> an average, it must be decayed periodically. Exisiting algorithm
> >>> in the kernel scheduler pelt framework is re-used to calculate
> >>> the weighted average. This patch series also defines a sysctl
> >>> inerface to allow for a configurable decay period.
> >>>
> >>> Regarding testing, basic build, boot and sanity testing have been
> >>> performed on db845c platform with debian file system.
> >>> Further, dhrystone and hackbench tests have been
> >>> run with the thermal pressure algorithm. During testing, due to
> >>> constraints of step wise governor in dealing with big little systems,
> >>> trip point 0 temperature was made assymetric between cpus in little
> >>> cluster and big cluster; the idea being that
> >>> big core will heat up and cpu cooling device will throttle the
> >>> frequency of the big cores faster, there by limiting the maximum available
> >>> capacity and the scheduler will spread out tasks to little cores as well.
> >>>
> >>> Test Results
> >>>
> >>> Hackbench: 1 group , 30000 loops, 10 runs       
> >>>                                                Result         SD             
> >>>                                                (Secs)     (% of mean)     
> >>>  No Thermal Pressure                            14.03       2.69%           
> >>>  Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%         
> >>>  Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%           
> >>>  Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%         
> >>>  Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%           
> >>>  Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%  
> >>>
> >>> Dhrystone Run Time  : 20 threads, 3000 MLOOPS
> >>>                                                  Result      SD             
> >>>                                                  (Secs)    (% of mean)     
> >>>  No Thermal Pressure                              9.452      4.49%
> >>>  Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
> >>>  Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
> >>>  Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
> >>>  Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
> >>>  Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%  
> >>
> >> I took the opportunity to try glmark2 on the db845c platform with the
> >> default decay and got the following glmark2 scores:
> >>
> >> Without thermal pressure:
> >>
> >> # NumSamples = 9; Min = 790.00; Max = 805.00
> >> # Mean = 794.888889; Variance = 19.209877; SD = 4.382907; Median 794.000000
> >> # each ∎ represents a count of 1
> >>   790.0000 -   791.5000 [     2]: ∎∎
> >>   791.5000 -   793.0000 [     2]: ∎∎
> >>   793.0000 -   794.5000 [     2]: ∎∎
> >>   794.5000 -   796.0000 [     1]: ∎
> >>   796.0000 -   797.5000 [     0]:
> >>   797.5000 -   799.0000 [     1]: ∎
> >>   799.0000 -   800.5000 [     0]:
> >>   800.5000 -   802.0000 [     0]:
> >>   802.0000 -   803.5000 [     0]:
> >>   803.5000 -   805.0000 [     1]: ∎
> >>
> >>
> >> With thermal pressure:
> >>
> >> # NumSamples = 9; Min = 933.00; Max = 960.00
> >> # Mean = 940.777778; Variance = 64.172840; SD = 8.010795; Median 937.000000
> >> # each ∎ represents a count of 1
> >>   933.0000 -   935.7000 [     3]: ∎∎∎
> >>   935.7000 -   938.4000 [     2]: ∎∎
> >>   938.4000 -   941.1000 [     2]: ∎∎
> >>   941.1000 -   943.8000 [     0]:
> >>   943.8000 -   946.5000 [     0]:
> >>   946.5000 -   949.2000 [     1]: ∎
> >>   949.2000 -   951.9000 [     0]:
> >>   951.9000 -   954.6000 [     0]:
> >>   954.6000 -   957.3000 [     0]:
> >>   957.3000 -   960.0000 [     1]: ∎
> >>
> > 
> > Interesting! If I'm interpreting these correctly there seems to be
> > significant improvement when applying thermal pressure.
> >
> > I'm not familiar with glmark2, can you tell me more about the process
> > and the work that the benchmark does?
> 
> glmark2 is a 3D benchmark. I ran it without parameters, so all tests are
> run. At the end, it gives a score which are the values given above.
> 
> > I assume this is a GPU benchmark,
> > but not knowing more about it I fail to see the correlation between
> > applying thermal pressure to CPU capacities and the improvement of GPU
> > performance.
> > Do you happen to know more about the behaviour that resulted in these
> > benchmark scores?
> 
> My hypothesis is glmark2 makes the GPU to contribute a lot to the
> heating effect, thus increasing the temperature to the CPU close to it.
>

Hhmm.. yes, I am assuming that there is some thermal mitigation (CPU
frequency capping) done as a result of the heat inflicted by the work
on the GPU, but these patches do not result in better thermal
management as for the GPU to perform better. They only inform the
scheduler in regards to reduced capacity of CPUs so it can decide to
better use the compute capacity that it has available.

There could be a second hand effect of the more efficient use of the
CPUs which would release thermal headroom for the GPU to use, but I
would not expect the differences to be as high as in the results above.

Another possibility is that work on the CPUs impacts the scores more
than I would expect for such a benchmark but again I would not
expect the work on the CPUs to be significant as to result in such
differences in the scores.

If you have the chance to look more into exactly what is the behaviour,
with and without thermal pressure - cooling states, average frequency,
use of CPUs, use of GPU, etc, it would be very valuable.

Thank you,
Ionela.

> 
> 
> 
> -- 
>  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-31 10:53       ` Qais Yousef
@ 2019-10-31 15:38         ` Dietmar Eggemann
  2019-10-31 15:48           ` Vincent Guittot
  2019-10-31 16:03         ` Thara Gopinath
  1 sibling, 1 reply; 51+ messages in thread
From: Dietmar Eggemann @ 2019-10-31 15:38 UTC (permalink / raw)
  To: Qais Yousef, Peter Zijlstra
  Cc: Thara Gopinath, mingo, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret, linux-kernel, amit.kachhap,
	javi.merino, daniel.lezcano

On 31.10.19 11:53, Qais Yousef wrote:
> On 10/28/19 16:30, Peter Zijlstra wrote:
>> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
>>> On 10/22/19 16:34, Thara Gopinath wrote:
>>>> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
>>>> pressure on a cpu means this maximum available capacity is reduced. This
>>>> patch reduces the average thermal pressure for a cpu from its maximum
>>>> available capacity so that cpu_capacity reflects the actual
>>>> available capacity.
>>>>
>>>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>>>> ---
>>>>  kernel/sched/fair.c | 1 +
>>>>  1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 4f9c2cb..be3e802 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
>>>>  
>>>>  	used = READ_ONCE(rq->avg_rt.util_avg);
>>>>  	used += READ_ONCE(rq->avg_dl.util_avg);
>>>> +	used += READ_ONCE(rq->avg_thermal.load_avg);
>>>
>>> Maybe a naive question - but can we add util_avg with load_avg without
>>> a conversion? I thought the 2 signals have different properties.
>>
>> Changelog of patch #1 explains, it's in that dense blob of text.
>>
>> But yes, you're quite right that that wants a comment here.
> 
> Thanks for the pointer! A comment would be nice indeed.
> 
> To make sure I got this correctly - it's because avg_thermal.load_avg
> represents delta_capacity which is already a 'converted' form of load. So this
> makes avg_thermal.load_avg a util_avg really. Correct?
> 
> If I managed to get it right somehow. It'd be nice if we can do inverse
> conversion on delta_capacity so that avg_thermal.{load_avg, util_avg} meaning
> is consistent across the board. But I don't feel strongly about it if this gets
> documented properly.

So why can't we use rq->avg_thermal.util_avg here? Since capacity is
closer to util than to load?

Is it because you want to use the influence of ___update_load_sum(...,
unsigned long load eq. per-cpu delta_capacity in your signal?

Why not call it this way then?

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 38210691c615..d3035457483f 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -357,9 +357,9 @@ int update_thermal_load_avg(u64 now, struct rq *rq,
u64 capacity)
 {
        if (___update_load_sum(now, &rq->avg_thermal,
                               capacity,
-                              capacity,
-                              capacity)) {
-               ___update_load_avg(&rq->avg_thermal, 1, 1);
+                              0,
+                              0)) {
+               ___update_load_avg(&rq->avg_thermal, 1, 0);
                return 1;
        }

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-31 15:38         ` Dietmar Eggemann
@ 2019-10-31 15:48           ` Vincent Guittot
  2019-10-31 16:17             ` Dietmar Eggemann
  0 siblings, 1 reply; 51+ messages in thread
From: Vincent Guittot @ 2019-10-31 15:48 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Qais Yousef, Peter Zijlstra, Thara Gopinath, Ingo Molnar,
	Ionela Voinescu, Zhang Rui, Eduardo Valentin, Quentin Perret,
	linux-kernel, Amit Kachhap, Javi Merino, Daniel Lezcano

On Thu, 31 Oct 2019 at 16:38, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 31.10.19 11:53, Qais Yousef wrote:
> > On 10/28/19 16:30, Peter Zijlstra wrote:
> >> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
> >>> On 10/22/19 16:34, Thara Gopinath wrote:
> >>>> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> >>>> pressure on a cpu means this maximum available capacity is reduced. This
> >>>> patch reduces the average thermal pressure for a cpu from its maximum
> >>>> available capacity so that cpu_capacity reflects the actual
> >>>> available capacity.
> >>>>
> >>>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> >>>> ---
> >>>>  kernel/sched/fair.c | 1 +
> >>>>  1 file changed, 1 insertion(+)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 4f9c2cb..be3e802 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
> >>>>
> >>>>    used = READ_ONCE(rq->avg_rt.util_avg);
> >>>>    used += READ_ONCE(rq->avg_dl.util_avg);
> >>>> +  used += READ_ONCE(rq->avg_thermal.load_avg);
> >>>
> >>> Maybe a naive question - but can we add util_avg with load_avg without
> >>> a conversion? I thought the 2 signals have different properties.
> >>
> >> Changelog of patch #1 explains, it's in that dense blob of text.
> >>
> >> But yes, you're quite right that that wants a comment here.
> >
> > Thanks for the pointer! A comment would be nice indeed.
> >
> > To make sure I got this correctly - it's because avg_thermal.load_avg
> > represents delta_capacity which is already a 'converted' form of load. So this
> > makes avg_thermal.load_avg a util_avg really. Correct?
> >
> > If I managed to get it right somehow. It'd be nice if we can do inverse
> > conversion on delta_capacity so that avg_thermal.{load_avg, util_avg} meaning
> > is consistent across the board. But I don't feel strongly about it if this gets
> > documented properly.
>
> So why can't we use rq->avg_thermal.util_avg here? Since capacity is
> closer to util than to load?
>
> Is it because you want to use the influence of ___update_load_sum(...,
> unsigned long load eq. per-cpu delta_capacity in your signal?
>
> Why not call it this way then?

util_avg tracks a binary state with 2 fixed weights: running(1024)  vs
not running (0)
In the case of thermal pressure, we want to track how much pressure is
put on the CPU: capping to half the max frequency is not the same as
capping only 10%
load_avg is not boolean but you set the weight  you want to apply and
this weight reflects the amount of pressure.

>
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 38210691c615..d3035457483f 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -357,9 +357,9 @@ int update_thermal_load_avg(u64 now, struct rq *rq,
> u64 capacity)
>  {
>         if (___update_load_sum(now, &rq->avg_thermal,
>                                capacity,
> -                              capacity,
> -                              capacity)) {
> -               ___update_load_avg(&rq->avg_thermal, 1, 1);
> +                              0,
> +                              0)) {
> +               ___update_load_avg(&rq->avg_thermal, 1, 0);
>                 return 1;
>         }

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-31 10:53       ` Qais Yousef
  2019-10-31 15:38         ` Dietmar Eggemann
@ 2019-10-31 16:03         ` Thara Gopinath
  2019-10-31 16:56           ` Qais Yousef
  1 sibling, 1 reply; 51+ messages in thread
From: Thara Gopinath @ 2019-10-31 16:03 UTC (permalink / raw)
  To: Qais Yousef, Peter Zijlstra
  Cc: mingo, ionela.voinescu, vincent.guittot, rui.zhang, edubezval,
	qperret, linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 10/31/2019 06:53 AM, Qais Yousef wrote:
> On 10/28/19 16:30, Peter Zijlstra wrote:
>> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
>>> On 10/22/19 16:34, Thara Gopinath wrote:
>>>> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
>>>> pressure on a cpu means this maximum available capacity is reduced. This
>>>> patch reduces the average thermal pressure for a cpu from its maximum
>>>> available capacity so that cpu_capacity reflects the actual
>>>> available capacity.
>>>>
>>>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>>>> ---
>>>>  kernel/sched/fair.c | 1 +
>>>>  1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 4f9c2cb..be3e802 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
>>>>  
>>>>  	used = READ_ONCE(rq->avg_rt.util_avg);
>>>>  	used += READ_ONCE(rq->avg_dl.util_avg);
>>>> +	used += READ_ONCE(rq->avg_thermal.load_avg);
>>>
>>> Maybe a naive question - but can we add util_avg with load_avg without
>>> a conversion? I thought the 2 signals have different properties.
>>
>> Changelog of patch #1 explains, it's in that dense blob of text.
>>
>> But yes, you're quite right that that wants a comment here.
> 
> Thanks for the pointer! A comment would be nice indeed.
> 
> To make sure I got this correctly - it's because avg_thermal.load_avg
> represents delta_capacity which is already a 'converted' form of load. So this
> makes avg_thermal.load_avg a util_avg really. Correct?
Hello Quais,

Sorry for not replying to your earlier email. Thanks for the review.
So if you look at the code, util_sum in calculated as a binary signal
converted into capacity. Check out the the below snippet from accumulate_sum

   if (load)
                sa->load_sum += load * contrib;
        if (runnable)
                sa->runnable_load_sum += runnable * contrib;
        if (running)
                sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;

So the actual delta for the thermal pressure will never be considered
if util_avg is used.

I will update this patch with relevant comment.




-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 3/6] sched/fair: Enable CFS periodic tick to update thermal pressure
  2019-10-22 20:34 ` [Patch v4 3/6] sched/fair: Enable CFS periodic tick to update " Thara Gopinath
  2019-10-28 15:24   ` Peter Zijlstra
@ 2019-10-31 16:11   ` Dietmar Eggemann
  2019-10-31 16:46     ` Thara Gopinath
  1 sibling, 1 reply; 51+ messages in thread
From: Dietmar Eggemann @ 2019-10-31 16:11 UTC (permalink / raw)
  To: Thara Gopinath, mingo, peterz, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 22.10.19 22:34, Thara Gopinath wrote:
> Introduce support in CFS periodic tick to trigger the process of
> computing average thermal pressure for a cpu.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  kernel/sched/fair.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 682a754..4f9c2cb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -21,6 +21,7 @@
>   *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
>   */
>  #include "sched.h"
> +#include "thermal.h"
>  
>  #include <trace/events/sched.h>
>  
> @@ -7574,6 +7575,8 @@ static void update_blocked_averages(int cpu)
>  		done = false;
>  
>  	update_blocked_load_status(rq, !done);
> +
> +	trigger_thermal_pressure_average(rq);
>  	rq_unlock_irqrestore(rq, &rf);
>  }

Since you update the thermal pressure signal in CFS's
update_blocked_averages() as well, I guess the patch title has to change.

>  
> @@ -9933,6 +9936,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  
>  	update_misfit_status(curr, rq);
>  	update_overutilized_status(task_rq(curr));
> +
> +	trigger_thermal_pressure_average(rq);
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-31 15:48           ` Vincent Guittot
@ 2019-10-31 16:17             ` Dietmar Eggemann
  2019-10-31 16:31               ` Vincent Guittot
  0 siblings, 1 reply; 51+ messages in thread
From: Dietmar Eggemann @ 2019-10-31 16:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Qais Yousef, Peter Zijlstra, Thara Gopinath, Ingo Molnar,
	Ionela Voinescu, Zhang Rui, Eduardo Valentin, Quentin Perret,
	linux-kernel, Amit Kachhap, Javi Merino, Daniel Lezcano

On 31.10.19 16:48, Vincent Guittot wrote:
> On Thu, 31 Oct 2019 at 16:38, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 31.10.19 11:53, Qais Yousef wrote:
>>> On 10/28/19 16:30, Peter Zijlstra wrote:
>>>> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
>>>>> On 10/22/19 16:34, Thara Gopinath wrote:

[...]

>>> To make sure I got this correctly - it's because avg_thermal.load_avg
>>> represents delta_capacity which is already a 'converted' form of load. So this
>>> makes avg_thermal.load_avg a util_avg really. Correct?
>>>
>>> If I managed to get it right somehow. It'd be nice if we can do inverse
>>> conversion on delta_capacity so that avg_thermal.{load_avg, util_avg} meaning
>>> is consistent across the board. But I don't feel strongly about it if this gets
>>> documented properly.
>>
>> So why can't we use rq->avg_thermal.util_avg here? Since capacity is
>> closer to util than to load?
>>
>> Is it because you want to use the influence of ___update_load_sum(...,
>> unsigned long load eq. per-cpu delta_capacity in your signal?
>>
>> Why not call it this way then?
> 
> util_avg tracks a binary state with 2 fixed weights: running(1024)  vs
> not running (0)
> In the case of thermal pressure, we want to track how much pressure is
> put on the CPU: capping to half the max frequency is not the same as
> capping only 10%
> load_avg is not boolean but you set the weight  you want to apply and
> this weight reflects the amount of pressure.

I see. This is what I meant by 'load (weight) eq. per-cpu delta_capacity
(pressure)'.


>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
>> index 38210691c615..d3035457483f 100644
>> --- a/kernel/sched/pelt.c
>> +++ b/kernel/sched/pelt.c
>> @@ -357,9 +357,9 @@ int update_thermal_load_avg(u64 now, struct rq *rq,
>> u64 capacity)
>>  {
>>         if (___update_load_sum(now, &rq->avg_thermal,
>>                                capacity,
>> -                              capacity,
>> -                              capacity)) {
>> -               ___update_load_avg(&rq->avg_thermal, 1, 1);
>> +                              0,
>> +                              0)) {
>> +               ___update_load_avg(&rq->avg_thermal, 1, 0);
>>                 return 1;
>>         }

So we could call it this way since we don't care about runnable_load or
util?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-10-22 20:34 ` [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
  2019-10-28 15:33   ` Peter Zijlstra
@ 2019-10-31 16:29   ` Dietmar Eggemann
  2019-10-31 16:38     ` Vincent Guittot
  2019-10-31 16:46     ` Thara Gopinath
  1 sibling, 2 replies; 51+ messages in thread
From: Dietmar Eggemann @ 2019-10-31 16:29 UTC (permalink / raw)
  To: Thara Gopinath, mingo, peterz, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 22.10.19 22:34, Thara Gopinath wrote:
> Thermal governors can request for a cpu's maximum supported frequency
> to be capped in case of an overheat event. This in turn means that the
> maximum capacity available for tasks to run on the particular cpu is
> reduced. Delta between the original maximum capacity and capped
> maximum capacity is known as thermal pressure. Enable cpufreq cooling
> device to update the thermal pressure in event of a capped
> maximum frequency.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  drivers/thermal/cpu_cooling.c | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 391f397..2e6a979 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -218,6 +218,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
>  }
>  
>  /**
> + * update_sched_max_capacity - update scheduler about change in cpu
> + *					max frequency.
> + * @policy - cpufreq policy whose max frequency is capped.
> + */
> +static void update_sched_max_capacity(struct cpumask *cpus,
> +				      unsigned int cur_max_freq,
> +				      unsigned int max_freq)
> +{
> +	int cpu;
> +	unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
> +				  max_freq;
> +
> +	for_each_cpu(cpu, cpus)
> +		update_thermal_pressure(cpu, capacity);
> +}
> +
> +/**
>   * get_load() - get load for a cpu since last updated
>   * @cpufreq_cdev:	&struct cpufreq_cooling_device for this cpu
>   * @cpu:	cpu number
> @@ -320,6 +337,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
>  				 unsigned long state)
>  {
>  	struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
> +	int ret;
>  
>  	/* Request state should be less than max_level */
>  	if (WARN_ON(state > cpufreq_cdev->max_level))
> @@ -331,8 +349,17 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
>  
>  	cpufreq_cdev->cpufreq_state = state;
>  
> -	return dev_pm_qos_update_request(&cpufreq_cdev->qos_req,
> -				cpufreq_cdev->freq_table[state].frequency);
> +	ret = dev_pm_qos_update_request
> +				(&cpufreq_cdev->qos_req,
> +				 cpufreq_cdev->freq_table[state].frequency);
> +
> +	if (ret > 0)
> +		update_sched_max_capacity
> +				(cpufreq_cdev->policy->cpus,
> +				 cpufreq_cdev->freq_table[state].frequency,
> +				 cpufreq_cdev->policy->cpuinfo.max_freq);
> +
> +	return ret;
>  }
>  
>  /**
> 

Why not getting rid of update_sched_max_capacity() entirely and call
update_thermal_pressure() in cpu_cooling.c directly? Saves one level in
the call chain and would mean less code for this feature.

Just compile tested on arm64:

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 3211b4d3a899..bf36995013b0 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -217,23 +217,6 @@ static u32 cpu_power_to_freq(struct
cpufreq_cooling_device *cpufreq_cdev,
        return freq_table[i - 1].frequency;
 }

-/**
- * update_sched_max_capacity - update scheduler about change in cpu
- *                                     max frequency.
- * @policy - cpufreq policy whose max frequency is capped.
- */
-static void update_sched_max_capacity(struct cpumask *cpus,
-                                     unsigned int cur_max_freq,
-                                     unsigned int max_freq)
-{
-       int cpu;
-       unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
-                                 max_freq;
-
-       for_each_cpu(cpu, cpus)
-               update_thermal_pressure(cpu, capacity);
-}
-
 /**
  * get_load() - get load for a cpu since last updated
  * @cpufreq_cdev:      &struct cpufreq_cooling_device for this cpu
@@ -353,7 +336,7 @@ static int cpufreq_set_cur_state(struct
thermal_cooling_device *cdev,
                                cpufreq_cdev->freq_table[state].frequency);

        if (ret > 0)
-               update_sched_max_capacity
+               update_thermal_pressure
                                (cpufreq_cdev->policy->cpus,
                                 cpufreq_cdev->freq_table[state].frequency,
                                 cpufreq_cdev->policy->cpuinfo.max_freq);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55dfe9634f67..5707813c7621 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1985,9 +1985,9 @@ static inline void rseq_syscall(struct pt_regs *regs)
 #endif

 #ifdef CONFIG_SMP
-void update_thermal_pressure(int cpu, u64 capacity);
+void update_thermal_pressure(struct cpumask *cpus, unsigned int cur,
unsigned int max);
 #else
-static inline void update_thermal_pressure(int cpu, u64 capacity)
+static inline void update_thermal_pressure(struct cpumask *cpus,
unsigned int cur, unsigned int max);
 {
 }
 #endif
diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
index 0da31e12a5ff..691bdd79597a 100644
--- a/kernel/sched/thermal.c
+++ b/kernel/sched/thermal.c
@@ -43,17 +43,16 @@ static DEFINE_PER_CPU(unsigned long, delta_capacity);
  * the arch_scale_cpu_capacity and capped capacity is stored in per cpu
  * delta_capacity.
  */
-void update_thermal_pressure(int cpu, u64 capped_freq_ratio)
+void update_thermal_pressure(struct cpumask *cpus, unsigned int cur,
unsigned int max)
 {
-       unsigned long __capacity, delta;
+       int cpu;

-       /* Normalize the capped freq ratio */
-       __capacity = (capped_freq_ratio * arch_scale_cpu_capacity(cpu)) >>
-
SCHED_CAPACITY_SHIFT;
-       delta = arch_scale_cpu_capacity(cpu) -  __capacity;
-       pr_debug("updating cpu%d thermal pressure to %lu\n", cpu, delta);
+       for_each_cpu(cpu, cpus) {
+               unsigned long scale_cap = arch_scale_cpu_capacity(cpu);
+               unsigned long cur_cap = cur * scale_cap / max;

-       per_cpu(delta_capacity, cpu) = delta;
+               per_cpu(delta_capacity, cpu) = scale_cap - cur_cap;
+       }
 }

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-31 16:17             ` Dietmar Eggemann
@ 2019-10-31 16:31               ` Vincent Guittot
  2019-10-31 16:44                 ` Dietmar Eggemann
  0 siblings, 1 reply; 51+ messages in thread
From: Vincent Guittot @ 2019-10-31 16:31 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Qais Yousef, Peter Zijlstra, Thara Gopinath, Ingo Molnar,
	Ionela Voinescu, Zhang Rui, Eduardo Valentin, Quentin Perret,
	linux-kernel, Amit Kachhap, Javi Merino, Daniel Lezcano

On Thu, 31 Oct 2019 at 17:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 31.10.19 16:48, Vincent Guittot wrote:
> > On Thu, 31 Oct 2019 at 16:38, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 31.10.19 11:53, Qais Yousef wrote:
> >>> On 10/28/19 16:30, Peter Zijlstra wrote:
> >>>> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
> >>>>> On 10/22/19 16:34, Thara Gopinath wrote:
>
> [...]
>
> >>> To make sure I got this correctly - it's because avg_thermal.load_avg
> >>> represents delta_capacity which is already a 'converted' form of load. So this
> >>> makes avg_thermal.load_avg a util_avg really. Correct?
> >>>
> >>> If I managed to get it right somehow. It'd be nice if we can do inverse
> >>> conversion on delta_capacity so that avg_thermal.{load_avg, util_avg} meaning
> >>> is consistent across the board. But I don't feel strongly about it if this gets
> >>> documented properly.
> >>
> >> So why can't we use rq->avg_thermal.util_avg here? Since capacity is
> >> closer to util than to load?
> >>
> >> Is it because you want to use the influence of ___update_load_sum(...,
> >> unsigned long load eq. per-cpu delta_capacity in your signal?
> >>
> >> Why not call it this way then?
> >
> > util_avg tracks a binary state with 2 fixed weights: running(1024)  vs
> > not running (0)
> > In the case of thermal pressure, we want to track how much pressure is
> > put on the CPU: capping to half the max frequency is not the same as
> > capping only 10%
> > load_avg is not boolean but you set the weight  you want to apply and
> > this weight reflects the amount of pressure.
>
> I see. This is what I meant by 'load (weight) eq. per-cpu delta_capacity
> (pressure)'.
>
>
> >> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> >> index 38210691c615..d3035457483f 100644
> >> --- a/kernel/sched/pelt.c
> >> +++ b/kernel/sched/pelt.c
> >> @@ -357,9 +357,9 @@ int update_thermal_load_avg(u64 now, struct rq *rq,
> >> u64 capacity)
> >>  {
> >>         if (___update_load_sum(now, &rq->avg_thermal,
> >>                                capacity,
> >> -                              capacity,
> >> -                              capacity)) {
> >> -               ___update_load_avg(&rq->avg_thermal, 1, 1);
> >> +                              0,
> >> +                              0)) {
> >> +               ___update_load_avg(&rq->avg_thermal, 1, 0);
> >>                 return 1;
> >>         }
>
> So we could call it this way since we don't care about runnable_load or
> util?

one way or the other is quite similar but the current solution is
aligned with other irq, rt, dl signals which duplicates the same state
in each fields

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-10-31 16:29   ` Dietmar Eggemann
@ 2019-10-31 16:38     ` Vincent Guittot
  2019-11-01 15:47       ` Ionela Voinescu
  2019-10-31 16:46     ` Thara Gopinath
  1 sibling, 1 reply; 51+ messages in thread
From: Vincent Guittot @ 2019-10-31 16:38 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Thara Gopinath, Ingo Molnar, Peter Zijlstra, Ionela Voinescu,
	Zhang Rui, Eduardo Valentin, Quentin Perret, linux-kernel,
	Amit Kachhap, Javi Merino, Daniel Lezcano

On Thu, 31 Oct 2019 at 17:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 22.10.19 22:34, Thara Gopinath wrote:
> > Thermal governors can request for a cpu's maximum supported frequency
> > to be capped in case of an overheat event. This in turn means that the
> > maximum capacity available for tasks to run on the particular cpu is
> > reduced. Delta between the original maximum capacity and capped
> > maximum capacity is known as thermal pressure. Enable cpufreq cooling
> > device to update the thermal pressure in event of a capped
> > maximum frequency.
> >
> > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> > ---
> >  drivers/thermal/cpu_cooling.c | 31 +++++++++++++++++++++++++++++--
> >  1 file changed, 29 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> > index 391f397..2e6a979 100644
> > --- a/drivers/thermal/cpu_cooling.c
> > +++ b/drivers/thermal/cpu_cooling.c
> > @@ -218,6 +218,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
> >  }
> >
> >  /**
> > + * update_sched_max_capacity - update scheduler about change in cpu
> > + *                                   max frequency.
> > + * @policy - cpufreq policy whose max frequency is capped.
> > + */
> > +static void update_sched_max_capacity(struct cpumask *cpus,
> > +                                   unsigned int cur_max_freq,
> > +                                   unsigned int max_freq)
> > +{
> > +     int cpu;
> > +     unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
> > +                               max_freq;
> > +
> > +     for_each_cpu(cpu, cpus)
> > +             update_thermal_pressure(cpu, capacity);
> > +}
> > +
> > +/**
> >   * get_load() - get load for a cpu since last updated
> >   * @cpufreq_cdev:    &struct cpufreq_cooling_device for this cpu
> >   * @cpu:     cpu number
> > @@ -320,6 +337,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> >                                unsigned long state)
> >  {
> >       struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
> > +     int ret;
> >
> >       /* Request state should be less than max_level */
> >       if (WARN_ON(state > cpufreq_cdev->max_level))
> > @@ -331,8 +349,17 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> >
> >       cpufreq_cdev->cpufreq_state = state;
> >
> > -     return dev_pm_qos_update_request(&cpufreq_cdev->qos_req,
> > -                             cpufreq_cdev->freq_table[state].frequency);
> > +     ret = dev_pm_qos_update_request
> > +                             (&cpufreq_cdev->qos_req,
> > +                              cpufreq_cdev->freq_table[state].frequency);
> > +
> > +     if (ret > 0)
> > +             update_sched_max_capacity
> > +                             (cpufreq_cdev->policy->cpus,
> > +                              cpufreq_cdev->freq_table[state].frequency,
> > +                              cpufreq_cdev->policy->cpuinfo.max_freq);
> > +
> > +     return ret;
> >  }
> >
> >  /**
> >
>
> Why not getting rid of update_sched_max_capacity() entirely and call
> update_thermal_pressure() in cpu_cooling.c directly? Saves one level in
> the call chain and would mean less code for this feature.

But you add complexity in update_thermal_pressure which now has to
deal with a cpumask and to compute some frequency ratio
IMHO, it's cleaner to keep update_thermal_pressure simple as it is now

>
> Just compile tested on arm64:
>
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 3211b4d3a899..bf36995013b0 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -217,23 +217,6 @@ static u32 cpu_power_to_freq(struct
> cpufreq_cooling_device *cpufreq_cdev,
>         return freq_table[i - 1].frequency;
>  }
>
> -/**
> - * update_sched_max_capacity - update scheduler about change in cpu
> - *                                     max frequency.
> - * @policy - cpufreq policy whose max frequency is capped.
> - */
> -static void update_sched_max_capacity(struct cpumask *cpus,
> -                                     unsigned int cur_max_freq,
> -                                     unsigned int max_freq)
> -{
> -       int cpu;
> -       unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
> -                                 max_freq;
> -
> -       for_each_cpu(cpu, cpus)
> -               update_thermal_pressure(cpu, capacity);
> -}
> -
>  /**
>   * get_load() - get load for a cpu since last updated
>   * @cpufreq_cdev:      &struct cpufreq_cooling_device for this cpu
> @@ -353,7 +336,7 @@ static int cpufreq_set_cur_state(struct
> thermal_cooling_device *cdev,
>                                 cpufreq_cdev->freq_table[state].frequency);
>
>         if (ret > 0)
> -               update_sched_max_capacity
> +               update_thermal_pressure
>                                 (cpufreq_cdev->policy->cpus,
>                                  cpufreq_cdev->freq_table[state].frequency,
>                                  cpufreq_cdev->policy->cpuinfo.max_freq);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 55dfe9634f67..5707813c7621 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1985,9 +1985,9 @@ static inline void rseq_syscall(struct pt_regs *regs)
>  #endif
>
>  #ifdef CONFIG_SMP
> -void update_thermal_pressure(int cpu, u64 capacity);
> +void update_thermal_pressure(struct cpumask *cpus, unsigned int cur,
> unsigned int max);
>  #else
> -static inline void update_thermal_pressure(int cpu, u64 capacity)
> +static inline void update_thermal_pressure(struct cpumask *cpus,
> unsigned int cur, unsigned int max);
>  {
>  }
>  #endif
> diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
> index 0da31e12a5ff..691bdd79597a 100644
> --- a/kernel/sched/thermal.c
> +++ b/kernel/sched/thermal.c
> @@ -43,17 +43,16 @@ static DEFINE_PER_CPU(unsigned long, delta_capacity);
>   * the arch_scale_cpu_capacity and capped capacity is stored in per cpu
>   * delta_capacity.
>   */
> -void update_thermal_pressure(int cpu, u64 capped_freq_ratio)
> +void update_thermal_pressure(struct cpumask *cpus, unsigned int cur,
> unsigned int max)
>  {
> -       unsigned long __capacity, delta;
> +       int cpu;
>
> -       /* Normalize the capped freq ratio */
> -       __capacity = (capped_freq_ratio * arch_scale_cpu_capacity(cpu)) >>
> -
> SCHED_CAPACITY_SHIFT;
> -       delta = arch_scale_cpu_capacity(cpu) -  __capacity;
> -       pr_debug("updating cpu%d thermal pressure to %lu\n", cpu, delta);
> +       for_each_cpu(cpu, cpus) {
> +               unsigned long scale_cap = arch_scale_cpu_capacity(cpu);
> +               unsigned long cur_cap = cur * scale_cap / max;
>
> -       per_cpu(delta_capacity, cpu) = delta;
> +               per_cpu(delta_capacity, cpu) = scale_cap - cur_cap;
> +       }
>  }

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 0/6] Introduce Thermal Pressure
  2019-10-31  9:44 ` Ionela Voinescu
@ 2019-10-31 16:41   ` Thara Gopinath
  2019-10-31 16:52     ` Thara Gopinath
  2019-11-05 21:04     ` Ionela Voinescu
  0 siblings, 2 replies; 51+ messages in thread
From: Thara Gopinath @ 2019-10-31 16:41 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 10/31/2019 05:44 AM, Ionela Voinescu wrote:
> Hi Thara,
> 
> On Tuesday 22 Oct 2019 at 16:34:19 (-0400), Thara Gopinath wrote:
>> Thermal governors can respond to an overheat event of a cpu by
>> capping the cpu's maximum possible frequency. This in turn
>> means that the maximum available compute capacity of the
>> cpu is restricted. But today in the kernel, task scheduler is 
>> not notified of capping of maximum frequency of a cpu.
>> In other words, scheduler is unware of maximum capacity
> 
> Nit: s/unware/unaware
> 
>> restrictions placed on a cpu due to thermal activity.
>> This patch series attempts to address this issue.
>> The benefits identified are better task placement among available
>> cpus in event of overheating which in turn leads to better
>> performance numbers.
>>
>> The reduction in the maximum possible capacity of a cpu due to a 
>> thermal event can be considered as thermal pressure. Instantaneous
>> thermal pressure is hard to record and can sometime be erroneous
>> as there can be mismatch between the actual capping of capacity
>> and scheduler recording it. Thus solution is to have a weighted
>> average per cpu value for thermal pressure over time.
>> The weight reflects the amount of time the cpu has spent at a
>> capped maximum frequency. Since thermal pressure is recorded as
>> an average, it must be decayed periodically. Exisiting algorithm
>> in the kernel scheduler pelt framework is re-used to calculate
>> the weighted average. This patch series also defines a sysctl
>> inerface to allow for a configurable decay period.
>>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on db845c platform with debian file system.
>> Further, dhrystone and hackbench tests have been
>> run with the thermal pressure algorithm. During testing, due to
>> constraints of step wise governor in dealing with big little systems,
>> trip point 0 temperature was made assymetric between cpus in little
>> cluster and big cluster; the idea being that
>> big core will heat up and cpu cooling device will throttle the
>> frequency of the big cores faster, there by limiting the maximum available
>> capacity and the scheduler will spread out tasks to little cores as well.
>>
> 
> Can you please share the changes you've made to sdm845.dtsi and a kernel
> base on top of which to apply your patches? I would like to reproduce
> your results and run more tests and it would be good if our setups were
> as close as possible.
Hi Ionela
Thank you for the review.
So I tested this on 5.4-rc1 kernel. The dtsi changes is to reduce the
thermal trip points for the big CPUs to 60000 or 70000 from the default
90000. I did this for 2 reasons
1. I could never get the db845 to heat up sufficiently for my test cases
with the default trip.
2. I was using the default step-wise governor for thermal. I did not
want little and big to start throttling by the same % because then the
task placement ratio will remain the same between little and big cores.


> 
>> Test Results
>>
>> Hackbench: 1 group , 30000 loops, 10 runs       
>>                                                Result         SD             
>>                                                (Secs)     (% of mean)     
>>  No Thermal Pressure                            14.03       2.69%           
>>  Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%         
>>  Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%           
>>  Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%         
>>  Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%           
>>  Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%  
>>
>> Dhrystone Run Time  : 20 threads, 3000 MLOOPS
>>                                                  Result      SD             
>>                                                  (Secs)    (% of mean)     
>>  No Thermal Pressure                              9.452      4.49%
>>  Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
>>  Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
>>  Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
>>  Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
>>  Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%  
>>
> 
> Do you happen to know by how much the CPUs were capped during these
> experiments?

I don't have any captured results here. I know that big cores were
capped and at times there was capacity inversion.

Also I will fix the nit comments above.

> 
> Thanks,
> Ionela.
> 



-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-31 16:31               ` Vincent Guittot
@ 2019-10-31 16:44                 ` Dietmar Eggemann
  0 siblings, 0 replies; 51+ messages in thread
From: Dietmar Eggemann @ 2019-10-31 16:44 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Qais Yousef, Peter Zijlstra, Thara Gopinath, Ingo Molnar,
	Ionela Voinescu, Zhang Rui, Eduardo Valentin, Quentin Perret,
	linux-kernel, Amit Kachhap, Javi Merino, Daniel Lezcano

On 31.10.19 17:31, Vincent Guittot wrote:
> On Thu, 31 Oct 2019 at 17:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 31.10.19 16:48, Vincent Guittot wrote:
>>> On Thu, 31 Oct 2019 at 16:38, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 31.10.19 11:53, Qais Yousef wrote:
>>>>> On 10/28/19 16:30, Peter Zijlstra wrote:
>>>>>> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
>>>>>>> On 10/22/19 16:34, Thara Gopinath wrote:

[...]

>>>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
>>>> index 38210691c615..d3035457483f 100644
>>>> --- a/kernel/sched/pelt.c
>>>> +++ b/kernel/sched/pelt.c
>>>> @@ -357,9 +357,9 @@ int update_thermal_load_avg(u64 now, struct rq *rq,
>>>> u64 capacity)
>>>>  {
>>>>         if (___update_load_sum(now, &rq->avg_thermal,
>>>>                                capacity,
>>>> -                              capacity,
>>>> -                              capacity)) {
>>>> -               ___update_load_avg(&rq->avg_thermal, 1, 1);
>>>> +                              0,
>>>> +                              0)) {
>>>> +               ___update_load_avg(&rq->avg_thermal, 1, 0);
>>>>                 return 1;
>>>>         }
>>
>> So we could call it this way since we don't care about runnable_load or
>> util?
> 
> one way or the other is quite similar but the current solution is
> aligned with other irq, rt, dl signals which duplicates the same state
> in each fields

I see. But there is a subtle difference. For irq, rt, dl, we have to
also set load (even we only use util) because of:

___update_load_sum() {

        ...
        if (!load)
                runnable = running = 0;
        ...
}

which is there for se's only.

I like self-explanatory code but I agree in this case it's not obvious.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-10-31 16:29   ` Dietmar Eggemann
  2019-10-31 16:38     ` Vincent Guittot
@ 2019-10-31 16:46     ` Thara Gopinath
  1 sibling, 0 replies; 51+ messages in thread
From: Thara Gopinath @ 2019-10-31 16:46 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, ionela.voinescu,
	vincent.guittot, rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 10/31/2019 12:29 PM, Dietmar Eggemann wrote:
> On 22.10.19 22:34, Thara Gopinath wrote:
>> Thermal governors can request for a cpu's maximum supported frequency
>> to be capped in case of an overheat event. This in turn means that the
>> maximum capacity available for tasks to run on the particular cpu is
>> reduced. Delta between the original maximum capacity and capped
>> maximum capacity is known as thermal pressure. Enable cpufreq cooling
>> device to update the thermal pressure in event of a capped
>> maximum frequency.
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>  drivers/thermal/cpu_cooling.c | 31 +++++++++++++++++++++++++++++--
>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>> index 391f397..2e6a979 100644
>> --- a/drivers/thermal/cpu_cooling.c
>> +++ b/drivers/thermal/cpu_cooling.c
>> @@ -218,6 +218,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
>>  }
>>  
>>  /**
>> + * update_sched_max_capacity - update scheduler about change in cpu
>> + *					max frequency.
>> + * @policy - cpufreq policy whose max frequency is capped.
>> + */
>> +static void update_sched_max_capacity(struct cpumask *cpus,
>> +				      unsigned int cur_max_freq,
>> +				      unsigned int max_freq)
>> +{
>> +	int cpu;
>> +	unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
>> +				  max_freq;
>> +
>> +	for_each_cpu(cpu, cpus)
>> +		update_thermal_pressure(cpu, capacity);
>> +}
>> +
>> +/**
>>   * get_load() - get load for a cpu since last updated
>>   * @cpufreq_cdev:	&struct cpufreq_cooling_device for this cpu
>>   * @cpu:	cpu number
>> @@ -320,6 +337,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
>>  				 unsigned long state)
>>  {
>>  	struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
>> +	int ret;
>>  
>>  	/* Request state should be less than max_level */
>>  	if (WARN_ON(state > cpufreq_cdev->max_level))
>> @@ -331,8 +349,17 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
>>  
>>  	cpufreq_cdev->cpufreq_state = state;
>>  
>> -	return dev_pm_qos_update_request(&cpufreq_cdev->qos_req,
>> -				cpufreq_cdev->freq_table[state].frequency);
>> +	ret = dev_pm_qos_update_request
>> +				(&cpufreq_cdev->qos_req,
>> +				 cpufreq_cdev->freq_table[state].frequency);
>> +
>> +	if (ret > 0)
>> +		update_sched_max_capacity
>> +				(cpufreq_cdev->policy->cpus,
>> +				 cpufreq_cdev->freq_table[state].frequency,
>> +				 cpufreq_cdev->policy->cpuinfo.max_freq);
>> +
>> +	return ret;
>>  }
>>  
>>  /**
>>
> 
> Why not getting rid of update_sched_max_capacity() entirely and call
> update_thermal_pressure() in cpu_cooling.c directly? Saves one level in
> the call chain and would mean less code for this feature.

Hi Dietmar,
Thanks for the review.

I did not want the scheduler piece of code to loop through the cpus.
Do you feel strongly about this one ?

Warm Regards
Thara
> 
> Just compile tested on arm64:
> 
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 3211b4d3a899..bf36995013b0 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -217,23 +217,6 @@ static u32 cpu_power_to_freq(struct
> cpufreq_cooling_device *cpufreq_cdev,
>         return freq_table[i - 1].frequency;
>  }
> 
> -/**
> - * update_sched_max_capacity - update scheduler about change in cpu
> - *                                     max frequency.
> - * @policy - cpufreq policy whose max frequency is capped.
> - */
> -static void update_sched_max_capacity(struct cpumask *cpus,
> -                                     unsigned int cur_max_freq,
> -                                     unsigned int max_freq)
> -{
> -       int cpu;
> -       unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
> -                                 max_freq;
> -
> -       for_each_cpu(cpu, cpus)
> -               update_thermal_pressure(cpu, capacity);
> -}
> -
>  /**
>   * get_load() - get load for a cpu since last updated
>   * @cpufreq_cdev:      &struct cpufreq_cooling_device for this cpu
> @@ -353,7 +336,7 @@ static int cpufreq_set_cur_state(struct
> thermal_cooling_device *cdev,
>                                 cpufreq_cdev->freq_table[state].frequency);
> 
>         if (ret > 0)
> -               update_sched_max_capacity
> +               update_thermal_pressure
>                                 (cpufreq_cdev->policy->cpus,
>                                  cpufreq_cdev->freq_table[state].frequency,
>                                  cpufreq_cdev->policy->cpuinfo.max_freq);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 55dfe9634f67..5707813c7621 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1985,9 +1985,9 @@ static inline void rseq_syscall(struct pt_regs *regs)
>  #endif
> 
>  #ifdef CONFIG_SMP
> -void update_thermal_pressure(int cpu, u64 capacity);
> +void update_thermal_pressure(struct cpumask *cpus, unsigned int cur,
> unsigned int max);
>  #else
> -static inline void update_thermal_pressure(int cpu, u64 capacity)
> +static inline void update_thermal_pressure(struct cpumask *cpus,
> unsigned int cur, unsigned int max);
>  {
>  }
>  #endif
> diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
> index 0da31e12a5ff..691bdd79597a 100644
> --- a/kernel/sched/thermal.c
> +++ b/kernel/sched/thermal.c
> @@ -43,17 +43,16 @@ static DEFINE_PER_CPU(unsigned long, delta_capacity);
>   * the arch_scale_cpu_capacity and capped capacity is stored in per cpu
>   * delta_capacity.
>   */
> -void update_thermal_pressure(int cpu, u64 capped_freq_ratio)
> +void update_thermal_pressure(struct cpumask *cpus, unsigned int cur,
> unsigned int max)
>  {
> -       unsigned long __capacity, delta;
> +       int cpu;
> 
> -       /* Normalize the capped freq ratio */
> -       __capacity = (capped_freq_ratio * arch_scale_cpu_capacity(cpu)) >>
> -
> SCHED_CAPACITY_SHIFT;
> -       delta = arch_scale_cpu_capacity(cpu) -  __capacity;
> -       pr_debug("updating cpu%d thermal pressure to %lu\n", cpu, delta);
> +       for_each_cpu(cpu, cpus) {
> +               unsigned long scale_cap = arch_scale_cpu_capacity(cpu);
> +               unsigned long cur_cap = cur * scale_cap / max;
> 
> -       per_cpu(delta_capacity, cpu) = delta;
> +               per_cpu(delta_capacity, cpu) = scale_cap - cur_cap;
> +       }
>  }
> 


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 3/6] sched/fair: Enable CFS periodic tick to update thermal pressure
  2019-10-31 16:11   ` Dietmar Eggemann
@ 2019-10-31 16:46     ` Thara Gopinath
  0 siblings, 0 replies; 51+ messages in thread
From: Thara Gopinath @ 2019-10-31 16:46 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, ionela.voinescu,
	vincent.guittot, rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 10/31/2019 12:11 PM, Dietmar Eggemann wrote:
> On 22.10.19 22:34, Thara Gopinath wrote:
>> Introduce support in CFS periodic tick to trigger the process of
>> computing average thermal pressure for a cpu.
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>  kernel/sched/fair.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 682a754..4f9c2cb 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -21,6 +21,7 @@
>>   *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
>>   */
>>  #include "sched.h"
>> +#include "thermal.h"
>>  
>>  #include <trace/events/sched.h>
>>  
>> @@ -7574,6 +7575,8 @@ static void update_blocked_averages(int cpu)
>>  		done = false;
>>  
>>  	update_blocked_load_status(rq, !done);
>> +
>> +	trigger_thermal_pressure_average(rq);
>>  	rq_unlock_irqrestore(rq, &rf);
>>  }
> 
> Since you update the thermal pressure signal in CFS's
> update_blocked_averages() as well, I guess the patch title has to change.
Will do. Thanks.
> 
>>  
>> @@ -9933,6 +9936,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>>  
>>  	update_misfit_status(curr, rq);
>>  	update_overutilized_status(task_rq(curr));
>> +
>> +	trigger_thermal_pressure_average(rq);
>>  }
>>  
>>  /*
>>


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 0/6] Introduce Thermal Pressure
  2019-10-31 16:41   ` Thara Gopinath
@ 2019-10-31 16:52     ` Thara Gopinath
  2019-11-05 21:04     ` Ionela Voinescu
  1 sibling, 0 replies; 51+ messages in thread
From: Thara Gopinath @ 2019-10-31 16:52 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 10/31/2019 12:41 PM, Thara Gopinath wrote:
> On 10/31/2019 05:44 AM, Ionela Voinescu wrote:
>> Hi Thara,
>>
>> On Tuesday 22 Oct 2019 at 16:34:19 (-0400), Thara Gopinath wrote:
>>> Thermal governors can respond to an overheat event of a cpu by
>>> capping the cpu's maximum possible frequency. This in turn
>>> means that the maximum available compute capacity of the
>>> cpu is restricted. But today in the kernel, task scheduler is 
>>> not notified of capping of maximum frequency of a cpu.
>>> In other words, scheduler is unware of maximum capacity
>>
>> Nit: s/unware/unaware
>>
>>> restrictions placed on a cpu due to thermal activity.
>>> This patch series attempts to address this issue.
>>> The benefits identified are better task placement among available
>>> cpus in event of overheating which in turn leads to better
>>> performance numbers.
>>>
>>> The reduction in the maximum possible capacity of a cpu due to a 
>>> thermal event can be considered as thermal pressure. Instantaneous
>>> thermal pressure is hard to record and can sometime be erroneous
>>> as there can be mismatch between the actual capping of capacity
>>> and scheduler recording it. Thus solution is to have a weighted
>>> average per cpu value for thermal pressure over time.
>>> The weight reflects the amount of time the cpu has spent at a
>>> capped maximum frequency. Since thermal pressure is recorded as
>>> an average, it must be decayed periodically. Exisiting algorithm
>>> in the kernel scheduler pelt framework is re-used to calculate
>>> the weighted average. This patch series also defines a sysctl
>>> inerface to allow for a configurable decay period.
>>>
>>> Regarding testing, basic build, boot and sanity testing have been
>>> performed on db845c platform with debian file system.
>>> Further, dhrystone and hackbench tests have been
>>> run with the thermal pressure algorithm. During testing, due to
>>> constraints of step wise governor in dealing with big little systems,
>>> trip point 0 temperature was made assymetric between cpus in little
>>> cluster and big cluster; the idea being that
>>> big core will heat up and cpu cooling device will throttle the
>>> frequency of the big cores faster, there by limiting the maximum available
>>> capacity and the scheduler will spread out tasks to little cores as well.
>>>
>>
>> Can you please share the changes you've made to sdm845.dtsi and a kernel
>> base on top of which to apply your patches? I would like to reproduce
>> your results and run more tests and it would be good if our setups were
>> as close as possible.
> Hi Ionela
> Thank you for the review.
> So I tested this on 5.4-rc1 kernel. The dtsi changes is to reduce the
> thermal trip points for the big CPUs to 60000 or 70000 from the default
> 90000. I did this for 2 reasons
> 1. I could never get the db845 to heat up sufficiently for my test cases
> with the default trip.
> 2. I was using the default step-wise governor for thermal. I did not
> want little and big to start throttling by the same % because then the
> task placement ratio will remain the same between little and big cores.
> 

So I am not sure though if this is the set up under which Daniel ran
glbench . I will let him comment on it.

> 
> 
> 


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-31 16:03         ` Thara Gopinath
@ 2019-10-31 16:56           ` Qais Yousef
  0 siblings, 0 replies; 51+ messages in thread
From: Qais Yousef @ 2019-10-31 16:56 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Peter Zijlstra, mingo, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret, linux-kernel, amit.kachhap,
	javi.merino, daniel.lezcano

On 10/31/19 12:03, Thara Gopinath wrote:
> On 10/31/2019 06:53 AM, Qais Yousef wrote:
> > On 10/28/19 16:30, Peter Zijlstra wrote:
> >> On Wed, Oct 23, 2019 at 01:28:40PM +0100, Qais Yousef wrote:
> >>> On 10/22/19 16:34, Thara Gopinath wrote:
> >>>> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> >>>> pressure on a cpu means this maximum available capacity is reduced. This
> >>>> patch reduces the average thermal pressure for a cpu from its maximum
> >>>> available capacity so that cpu_capacity reflects the actual
> >>>> available capacity.
> >>>>
> >>>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> >>>> ---
> >>>>  kernel/sched/fair.c | 1 +
> >>>>  1 file changed, 1 insertion(+)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 4f9c2cb..be3e802 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -7727,6 +7727,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
> >>>>  
> >>>>  	used = READ_ONCE(rq->avg_rt.util_avg);
> >>>>  	used += READ_ONCE(rq->avg_dl.util_avg);
> >>>> +	used += READ_ONCE(rq->avg_thermal.load_avg);
> >>>
> >>> Maybe a naive question - but can we add util_avg with load_avg without
> >>> a conversion? I thought the 2 signals have different properties.
> >>
> >> Changelog of patch #1 explains, it's in that dense blob of text.
> >>
> >> But yes, you're quite right that that wants a comment here.
> > 
> > Thanks for the pointer! A comment would be nice indeed.
> > 
> > To make sure I got this correctly - it's because avg_thermal.load_avg
> > represents delta_capacity which is already a 'converted' form of load. So this
> > makes avg_thermal.load_avg a util_avg really. Correct?
> Hello Quais,
> 
> Sorry for not replying to your earlier email. Thanks for the review.
> So if you look at the code, util_sum in calculated as a binary signal
> converted into capacity. Check out the the below snippet from accumulate_sum
> 
>    if (load)
>                 sa->load_sum += load * contrib;
>         if (runnable)
>                 sa->runnable_load_sum += runnable * contrib;
>         if (running)
>                 sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
> 
> So the actual delta for the thermal pressure will never be considered
> if util_avg is used.
> 
> I will update this patch with relevant comment.

Okay thanks for the explanation.

--
Qais Yousef

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 0/6] Introduce Thermal Pressure
  2019-10-31 12:57       ` Ionela Voinescu
@ 2019-10-31 17:48         ` Daniel Lezcano
  0 siblings, 0 replies; 51+ messages in thread
From: Daniel Lezcano @ 2019-10-31 17:48 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: Thara Gopinath, mingo, peterz, vincent.guittot, rui.zhang,
	edubezval, qperret, linux-kernel, amit.kachhap, javi.merino

On 31/10/2019 13:57, Ionela Voinescu wrote:
> On Thursday 31 Oct 2019 at 12:54:03 (+0100), Daniel Lezcano wrote:
>> Hi Ionela,
>>
>> On 31/10/2019 11:07, Ionela Voinescu wrote:
>>> Hi Daniel,
>>>
>>> On Tuesday 29 Oct 2019 at 16:34:11 (+0100), Daniel Lezcano wrote:
>>>> Hi Thara,
>>>>
>>>> On 22/10/2019 22:34, Thara Gopinath wrote:
>>>>> Thermal governors can respond to an overheat event of a cpu by
>>>>> capping the cpu's maximum possible frequency. This in turn
>>>>> means that the maximum available compute capacity of the
>>>>> cpu is restricted. But today in the kernel, task scheduler is 
>>>>> not notified of capping of maximum frequency of a cpu.
>>>>> In other words, scheduler is unware of maximum capacity
>>>>> restrictions placed on a cpu due to thermal activity.
>>>>> This patch series attempts to address this issue.
>>>>> The benefits identified are better task placement among available
>>>>> cpus in event of overheating which in turn leads to better
>>>>> performance numbers.
>>>>>
>>>>> The reduction in the maximum possible capacity of a cpu due to a 
>>>>> thermal event can be considered as thermal pressure. Instantaneous
>>>>> thermal pressure is hard to record and can sometime be erroneous
>>>>> as there can be mismatch between the actual capping of capacity
>>>>> and scheduler recording it. Thus solution is to have a weighted
>>>>> average per cpu value for thermal pressure over time.
>>>>> The weight reflects the amount of time the cpu has spent at a
>>>>> capped maximum frequency. Since thermal pressure is recorded as
>>>>> an average, it must be decayed periodically. Exisiting algorithm
>>>>> in the kernel scheduler pelt framework is re-used to calculate
>>>>> the weighted average. This patch series also defines a sysctl
>>>>> inerface to allow for a configurable decay period.
>>>>>
>>>>> Regarding testing, basic build, boot and sanity testing have been
>>>>> performed on db845c platform with debian file system.
>>>>> Further, dhrystone and hackbench tests have been
>>>>> run with the thermal pressure algorithm. During testing, due to
>>>>> constraints of step wise governor in dealing with big little systems,
>>>>> trip point 0 temperature was made assymetric between cpus in little
>>>>> cluster and big cluster; the idea being that
>>>>> big core will heat up and cpu cooling device will throttle the
>>>>> frequency of the big cores faster, there by limiting the maximum available
>>>>> capacity and the scheduler will spread out tasks to little cores as well.
>>>>>
>>>>> Test Results
>>>>>
>>>>> Hackbench: 1 group , 30000 loops, 10 runs       
>>>>>                                                Result         SD             
>>>>>                                                (Secs)     (% of mean)     
>>>>>  No Thermal Pressure                            14.03       2.69%           
>>>>>  Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%         
>>>>>  Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%           
>>>>>  Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%         
>>>>>  Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%           
>>>>>  Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%  
>>>>>
>>>>> Dhrystone Run Time  : 20 threads, 3000 MLOOPS
>>>>>                                                  Result      SD             
>>>>>                                                  (Secs)    (% of mean)     
>>>>>  No Thermal Pressure                              9.452      4.49%
>>>>>  Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
>>>>>  Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
>>>>>  Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
>>>>>  Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
>>>>>  Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%  
>>>>
>>>> I took the opportunity to try glmark2 on the db845c platform with the
>>>> default decay and got the following glmark2 scores:
>>>>
>>>> Without thermal pressure:
>>>>
>>>> # NumSamples = 9; Min = 790.00; Max = 805.00
>>>> # Mean = 794.888889; Variance = 19.209877; SD = 4.382907; Median 794.000000
>>>> # each ∎ represents a count of 1
>>>>   790.0000 -   791.5000 [     2]: ∎∎
>>>>   791.5000 -   793.0000 [     2]: ∎∎
>>>>   793.0000 -   794.5000 [     2]: ∎∎
>>>>   794.5000 -   796.0000 [     1]: ∎
>>>>   796.0000 -   797.5000 [     0]:
>>>>   797.5000 -   799.0000 [     1]: ∎
>>>>   799.0000 -   800.5000 [     0]:
>>>>   800.5000 -   802.0000 [     0]:
>>>>   802.0000 -   803.5000 [     0]:
>>>>   803.5000 -   805.0000 [     1]: ∎
>>>>
>>>>
>>>> With thermal pressure:
>>>>
>>>> # NumSamples = 9; Min = 933.00; Max = 960.00
>>>> # Mean = 940.777778; Variance = 64.172840; SD = 8.010795; Median 937.000000
>>>> # each ∎ represents a count of 1
>>>>   933.0000 -   935.7000 [     3]: ∎∎∎
>>>>   935.7000 -   938.4000 [     2]: ∎∎
>>>>   938.4000 -   941.1000 [     2]: ∎∎
>>>>   941.1000 -   943.8000 [     0]:
>>>>   943.8000 -   946.5000 [     0]:
>>>>   946.5000 -   949.2000 [     1]: ∎
>>>>   949.2000 -   951.9000 [     0]:
>>>>   951.9000 -   954.6000 [     0]:
>>>>   954.6000 -   957.3000 [     0]:
>>>>   957.3000 -   960.0000 [     1]: ∎
>>>>
>>>
>>> Interesting! If I'm interpreting these correctly there seems to be
>>> significant improvement when applying thermal pressure.
>>>
>>> I'm not familiar with glmark2, can you tell me more about the process
>>> and the work that the benchmark does?
>>
>> glmark2 is a 3D benchmark. I ran it without parameters, so all tests are
>> run. At the end, it gives a score which are the values given above.
>>
>>> I assume this is a GPU benchmark,
>>> but not knowing more about it I fail to see the correlation between
>>> applying thermal pressure to CPU capacities and the improvement of GPU
>>> performance.
>>> Do you happen to know more about the behaviour that resulted in these
>>> benchmark scores?
>>
>> My hypothesis is glmark2 makes the GPU to contribute a lot to the
>> heating effect, thus increasing the temperature to the CPU close to it.
>>
> 
> Hhmm.. yes, I am assuming that there is some thermal mitigation (CPU
> frequency capping) done as a result of the heat inflicted by the work
> on the GPU, but these patches do not result in better thermal
> management as for the GPU to perform better. They only inform the
> scheduler in regards to reduced capacity of CPUs so it can decide to
> better use the compute capacity that it has available.
> 
> There could be a second hand effect of the more efficient use of the
> CPUs which would release thermal headroom for the GPU to use, but I
> would not expect the differences to be as high as in the results above.

Indeed, you may be right.

> Another possibility is that work on the CPUs impacts the scores more
> than I would expect for such a benchmark but again I would not
> expect the work on the CPUs to be significant as to result in such
> differences in the scores.
> 
> If you have the chance to look more into exactly what is the behaviour,
> with and without thermal pressure - cooling states, average frequency,
> use of CPUs, use of GPU, etc, it would be very valuable.

Not sure I have enough bandwidth to do all. I'll double check if there
is a difference when testing both versions.



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-10-22 20:34 ` [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
  2019-10-28 15:21   ` Peter Zijlstra
@ 2019-11-01 12:17   ` Dietmar Eggemann
  2019-11-01 20:57     ` Thara Gopinath
  1 sibling, 1 reply; 51+ messages in thread
From: Dietmar Eggemann @ 2019-11-01 12:17 UTC (permalink / raw)
  To: Thara Gopinath, mingo, peterz, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 22.10.19 22:34, Thara Gopinath wrote:

[...]

> +/**
> + * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate
> + *				     and average algorithm
> + */
> +void trigger_thermal_pressure_average(struct rq *rq)
> +{
> +	update_thermal_load_avg(rq_clock_task(rq), rq,
> +				per_cpu(delta_capacity, cpu_of(rq)));
> +}

Why not call update_thermal_load_avg() directly in fair.c? We do this for all
the other update_foo_load_avg() functions (foo eq. irq, rt_rq, dl_rq ...)

You don't have to pass 'u64 now', so you can hide it plus the 
sched_thermal_decay_coeff add-on within update_thermal_load_avg().
(Similar to update_irq_load_avg()).

You could even hide 'u64 capacity' in it.

So we save one function layer (trigger_thermal_pressure_average()), thermal becomes
more aligned with the other PELT users when it comes to call-sites and we have less
code for this feature.

Something like this (only for 'u64 now' and only compile tested on arm64:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index be3e802a2dc5..ac3ec3a04469 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7576,7 +7576,7 @@ static void update_blocked_averages(int cpu)
 
        update_blocked_load_status(rq, !done);
 
-       trigger_thermal_pressure_average(rq);
+       update_thermal_load_avg(rq, per_cpu(delta_capacity, cpu_of(rq)));
        rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -9938,7 +9938,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
        update_misfit_status(curr, rq);
        update_overutilized_status(task_rq(curr));
 
-       trigger_thermal_pressure_average(rq);
+       update_thermal_load_avg(rq, per_cpu(delta_capacity, cpu_of(rq)));
 }
 
 /*
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 38210691c615..7dd0d7e43854 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -353,8 +353,12 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
        return 0;
 }
 
-int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+extern int sched_thermal_decay_coeff;
+
+int update_thermal_load_avg(struct rq *rq, u64 capacity)
 {
+       u64 now = rq_clock_task(rq) >> sched_thermal_decay_coeff;
+
        if (___update_load_sum(now, &rq->avg_thermal,
                               capacity,
                               capacity,
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index c74226de716e..91483c957b6c 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -6,7 +6,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
-int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
+int update_thermal_load_avg(struct rq *rq, u64 capacity);
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 int update_irq_load_avg(struct rq *rq, u64 running);
diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
index 0da31e12a5ff..7b8c4e35c28d 100644
--- a/kernel/sched/thermal.c
+++ b/kernel/sched/thermal.c
@@ -21,7 +21,7 @@
  *     3                       256
  *     4                       512
  */
-static int sched_thermal_decay_coeff;
+int sched_thermal_decay_coeff;
 
 static int __init setup_sched_thermal_decay_coeff(char *str)
 {
@@ -32,7 +32,7 @@ static int __init setup_sched_thermal_decay_coeff(char *str)
 }
 __setup("sched_thermal_decay_coeff=", setup_sched_thermal_decay_coeff);
 
-static DEFINE_PER_CPU(unsigned long, delta_capacity);
+DEFINE_PER_CPU(unsigned long, delta_capacity);
 
 /**
  * update_thermal_pressure: Update thermal pressure
@@ -55,14 +55,3 @@ void update_thermal_pressure(int cpu, u64 capped_freq_ratio)
 
        per_cpu(delta_capacity, cpu) = delta;
 }
-
-/**
- * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate
- *                                  and average algorithm
- */
-void trigger_thermal_pressure_average(struct rq *rq)
-{
-       update_thermal_load_avg(rq_clock_task(rq) >>
-                               sched_thermal_decay_coeff, rq,
-                               per_cpu(delta_capacity, cpu_of(rq)));
-}
diff --git a/kernel/sched/thermal.h b/kernel/sched/thermal.h
index 26e5b07e9c29..a6ee973db41b 100644
--- a/kernel/sched/thermal.h
+++ b/kernel/sched/thermal.h
@@ -2,12 +2,4 @@
 /*
  * Scheduler thermal interaction internal methods.
  */
-
-#ifdef CONFIG_SMP
-void trigger_thermal_pressure_average(struct rq *rq);
-
-#else
-static inline void trigger_thermal_pressure_average(struct rq *rq)
-{
-}
-#endif
+DECLARE_PER_CPU(unsigned long , delta_capacity);

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-10-31 16:38     ` Vincent Guittot
@ 2019-11-01 15:47       ` Ionela Voinescu
  2019-11-01 21:04         ` Thara Gopinath
  0 siblings, 1 reply; 51+ messages in thread
From: Ionela Voinescu @ 2019-11-01 15:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Thara Gopinath, Ingo Molnar, Peter Zijlstra,
	Zhang Rui, Eduardo Valentin, Quentin Perret, linux-kernel,
	Amit Kachhap, Javi Merino, Daniel Lezcano

Hi guys,

On Thursday 31 Oct 2019 at 17:38:25 (+0100), Vincent Guittot wrote:
> On Thu, 31 Oct 2019 at 17:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >
> > On 22.10.19 22:34, Thara Gopinath wrote:
> > > Thermal governors can request for a cpu's maximum supported frequency
> > > to be capped in case of an overheat event. This in turn means that the
> > > maximum capacity available for tasks to run on the particular cpu is
> > > reduced. Delta between the original maximum capacity and capped
> > > maximum capacity is known as thermal pressure. Enable cpufreq cooling
> > > device to update the thermal pressure in event of a capped
> > > maximum frequency.
> > >
> > > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> > > ---
> > >  drivers/thermal/cpu_cooling.c | 31 +++++++++++++++++++++++++++++--
> > >  1 file changed, 29 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> > > index 391f397..2e6a979 100644
> > > --- a/drivers/thermal/cpu_cooling.c
> > > +++ b/drivers/thermal/cpu_cooling.c
> > > @@ -218,6 +218,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
> > >  }
> > >
> > >  /**
> > > + * update_sched_max_capacity - update scheduler about change in cpu
> > > + *                                   max frequency.
> > > + * @policy - cpufreq policy whose max frequency is capped.
> > > + */
> > > +static void update_sched_max_capacity(struct cpumask *cpus,
> > > +                                   unsigned int cur_max_freq,
> > > +                                   unsigned int max_freq)
> > > +{
> > > +     int cpu;
> > > +     unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
> > > +                               max_freq;
> > > +
> > > +     for_each_cpu(cpu, cpus)
> > > +             update_thermal_pressure(cpu, capacity);
> > > +}
> > > +
> > > +/**
> > >   * get_load() - get load for a cpu since last updated
> > >   * @cpufreq_cdev:    &struct cpufreq_cooling_device for this cpu
> > >   * @cpu:     cpu number
> > > @@ -320,6 +337,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> > >                                unsigned long state)
> > >  {
> > >       struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
> > > +     int ret;
> > >
> > >       /* Request state should be less than max_level */
> > >       if (WARN_ON(state > cpufreq_cdev->max_level))
> > > @@ -331,8 +349,17 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> > >
> > >       cpufreq_cdev->cpufreq_state = state;
> > >
> > > -     return dev_pm_qos_update_request(&cpufreq_cdev->qos_req,
> > > -                             cpufreq_cdev->freq_table[state].frequency);
> > > +     ret = dev_pm_qos_update_request
> > > +                             (&cpufreq_cdev->qos_req,
> > > +                              cpufreq_cdev->freq_table[state].frequency);
> > > +
> > > +     if (ret > 0)
> > > +             update_sched_max_capacity
> > > +                             (cpufreq_cdev->policy->cpus,
> > > +                              cpufreq_cdev->freq_table[state].frequency,
> > > +                              cpufreq_cdev->policy->cpuinfo.max_freq);
> > > +
> > > +     return ret;
> > >  }
> > >
> > >  /**
> > >
> >
> > Why not getting rid of update_sched_max_capacity() entirely and call
> > update_thermal_pressure() in cpu_cooling.c directly? Saves one level in
> > the call chain and would mean less code for this feature.
> 
> But you add complexity in update_thermal_pressure which now has to
> deal with a cpumask and to compute some frequency ratio
> IMHO, it's cleaner to keep update_thermal_pressure simple as it is now
> 

How about removing update_thermal_pressure altogether and doing all the
work in update_sched_max_capacity? That is, have
update_sched_max_capacity compute the capped_freq_ratio, do the
normalization, and set it per_cpu for all CPUs in the frequency domain.
You'll save some calculations that you're now doing in
update_thermal_pressure for each cpu and you avoid shifting back and
forth.

If you're doing so it would be worth renaming update_sched_max_capacity
to something like update_sched_thermal_pressure.

Thanks,
Ionela.

> >
> > Just compile tested on arm64:
> >
> > diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> > index 3211b4d3a899..bf36995013b0 100644
> > --- a/drivers/thermal/cpu_cooling.c
> > +++ b/drivers/thermal/cpu_cooling.c
> > @@ -217,23 +217,6 @@ static u32 cpu_power_to_freq(struct
> > cpufreq_cooling_device *cpufreq_cdev,
> >         return freq_table[i - 1].frequency;
> >  }
> >
> > -/**
> > - * update_sched_max_capacity - update scheduler about change in cpu
> > - *                                     max frequency.
> > - * @policy - cpufreq policy whose max frequency is capped.
> > - */
> > -static void update_sched_max_capacity(struct cpumask *cpus,
> > -                                     unsigned int cur_max_freq,
> > -                                     unsigned int max_freq)
> > -{
> > -       int cpu;
> > -       unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
> > -                                 max_freq;
> > -
> > -       for_each_cpu(cpu, cpus)
> > -               update_thermal_pressure(cpu, capacity);
> > -}
> > -
> >  /**
> >   * get_load() - get load for a cpu since last updated
> >   * @cpufreq_cdev:      &struct cpufreq_cooling_device for this cpu
> > @@ -353,7 +336,7 @@ static int cpufreq_set_cur_state(struct
> > thermal_cooling_device *cdev,
> >                                 cpufreq_cdev->freq_table[state].frequency);
> >
> >         if (ret > 0)
> > -               update_sched_max_capacity
> > +               update_thermal_pressure
> >                                 (cpufreq_cdev->policy->cpus,
> >                                  cpufreq_cdev->freq_table[state].frequency,
> >                                  cpufreq_cdev->policy->cpuinfo.max_freq);
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 55dfe9634f67..5707813c7621 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1985,9 +1985,9 @@ static inline void rseq_syscall(struct pt_regs *regs)
> >  #endif
> >
> >  #ifdef CONFIG_SMP
> > -void update_thermal_pressure(int cpu, u64 capacity);
> > +void update_thermal_pressure(struct cpumask *cpus, unsigned int cur,
> > unsigned int max);
> >  #else
> > -static inline void update_thermal_pressure(int cpu, u64 capacity)
> > +static inline void update_thermal_pressure(struct cpumask *cpus,
> > unsigned int cur, unsigned int max);
> >  {
> >  }
> >  #endif
> > diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
> > index 0da31e12a5ff..691bdd79597a 100644
> > --- a/kernel/sched/thermal.c
> > +++ b/kernel/sched/thermal.c
> > @@ -43,17 +43,16 @@ static DEFINE_PER_CPU(unsigned long, delta_capacity);
> >   * the arch_scale_cpu_capacity and capped capacity is stored in per cpu
> >   * delta_capacity.
> >   */
> > -void update_thermal_pressure(int cpu, u64 capped_freq_ratio)
> > +void update_thermal_pressure(struct cpumask *cpus, unsigned int cur,
> > unsigned int max)
> >  {
> > -       unsigned long __capacity, delta;
> > +       int cpu;
> >
> > -       /* Normalize the capped freq ratio */
> > -       __capacity = (capped_freq_ratio * arch_scale_cpu_capacity(cpu)) >>
> > -
> > SCHED_CAPACITY_SHIFT;
> > -       delta = arch_scale_cpu_capacity(cpu) -  __capacity;
> > -       pr_debug("updating cpu%d thermal pressure to %lu\n", cpu, delta);
> > +       for_each_cpu(cpu, cpus) {
> > +               unsigned long scale_cap = arch_scale_cpu_capacity(cpu);
> > +               unsigned long cur_cap = cur * scale_cap / max;
> >
> > -       per_cpu(delta_capacity, cpu) = delta;
> > +               per_cpu(delta_capacity, cpu) = scale_cap - cur_cap;
> > +       }
> >  }

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-01 12:17   ` Dietmar Eggemann
@ 2019-11-01 20:57     ` Thara Gopinath
  2019-11-04 17:29       ` Dietmar Eggemann
  0 siblings, 1 reply; 51+ messages in thread
From: Thara Gopinath @ 2019-11-01 20:57 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, ionela.voinescu,
	vincent.guittot, rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 11/01/2019 08:17 AM, Dietmar Eggemann wrote:
> On 22.10.19 22:34, Thara Gopinath wrote:
> 
> [...]
> 
>> +/**
>> + * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate
>> + *				     and average algorithm
>> + */
>> +void trigger_thermal_pressure_average(struct rq *rq)
>> +{
>> +	update_thermal_load_avg(rq_clock_task(rq), rq,
>> +				per_cpu(delta_capacity, cpu_of(rq)));
>> +}
> 
> Why not call update_thermal_load_avg() directly in fair.c? We do this for all
> the other update_foo_load_avg() functions (foo eq. irq, rt_rq, dl_rq ...)
thermal.c is going away in next version and I am moving everything to
fair.c. So this is taken care of

> 
> You don't have to pass 'u64 now', so you can hide it plus the 

You still need now.All the update_*_avg apis take now as a parameter.


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-11-01 15:47       ` Ionela Voinescu
@ 2019-11-01 21:04         ` Thara Gopinath
  2019-11-04 14:41           ` Ionela Voinescu
  0 siblings, 1 reply; 51+ messages in thread
From: Thara Gopinath @ 2019-11-01 21:04 UTC (permalink / raw)
  To: Ionela Voinescu, Vincent Guittot
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Zhang Rui,
	Eduardo Valentin, Quentin Perret, linux-kernel, Amit Kachhap,
	Javi Merino, Daniel Lezcano

On 11/01/2019 11:47 AM, Ionela Voinescu wrote:
> Hi guys,
> 
> On Thursday 31 Oct 2019 at 17:38:25 (+0100), Vincent Guittot wrote:
>> On Thu, 31 Oct 2019 at 17:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 22.10.19 22:34, Thara Gopinath wrote:
>>>> Thermal governors can request for a cpu's maximum supported frequency
>>>> to be capped in case of an overheat event. This in turn means that the
>>>> maximum capacity available for tasks to run on the particular cpu is
>>>> reduced. Delta between the original maximum capacity and capped
>>>> maximum capacity is known as thermal pressure. Enable cpufreq cooling
>>>> device to update the thermal pressure in event of a capped
>>>> maximum frequency.
>>>>
>>>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>>>> ---
>>>>  drivers/thermal/cpu_cooling.c | 31 +++++++++++++++++++++++++++++--
>>>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>>>> index 391f397..2e6a979 100644
>>>> --- a/drivers/thermal/cpu_cooling.c
>>>> +++ b/drivers/thermal/cpu_cooling.c
>>>> @@ -218,6 +218,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
>>>>  }
>>>>
>>>>  /**
>>>> + * update_sched_max_capacity - update scheduler about change in cpu
>>>> + *                                   max frequency.
>>>> + * @policy - cpufreq policy whose max frequency is capped.
>>>> + */
>>>> +static void update_sched_max_capacity(struct cpumask *cpus,
>>>> +                                   unsigned int cur_max_freq,
>>>> +                                   unsigned int max_freq)
>>>> +{
>>>> +     int cpu;
>>>> +     unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
>>>> +                               max_freq;
>>>> +
>>>> +     for_each_cpu(cpu, cpus)
>>>> +             update_thermal_pressure(cpu, capacity);
>>>> +}
>>>> +
>>>> +/**
>>>>   * get_load() - get load for a cpu since last updated
>>>>   * @cpufreq_cdev:    &struct cpufreq_cooling_device for this cpu
>>>>   * @cpu:     cpu number
>>>> @@ -320,6 +337,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
>>>>                                unsigned long state)
>>>>  {
>>>>       struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
>>>> +     int ret;
>>>>
>>>>       /* Request state should be less than max_level */
>>>>       if (WARN_ON(state > cpufreq_cdev->max_level))
>>>> @@ -331,8 +349,17 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
>>>>
>>>>       cpufreq_cdev->cpufreq_state = state;
>>>>
>>>> -     return dev_pm_qos_update_request(&cpufreq_cdev->qos_req,
>>>> -                             cpufreq_cdev->freq_table[state].frequency);
>>>> +     ret = dev_pm_qos_update_request
>>>> +                             (&cpufreq_cdev->qos_req,
>>>> +                              cpufreq_cdev->freq_table[state].frequency);
>>>> +
>>>> +     if (ret > 0)
>>>> +             update_sched_max_capacity
>>>> +                             (cpufreq_cdev->policy->cpus,
>>>> +                              cpufreq_cdev->freq_table[state].frequency,
>>>> +                              cpufreq_cdev->policy->cpuinfo.max_freq);
>>>> +
>>>> +     return ret;
>>>>  }
>>>>
>>>>  /**
>>>>
>>>
>>> Why not getting rid of update_sched_max_capacity() entirely and call
>>> update_thermal_pressure() in cpu_cooling.c directly? Saves one level in
>>> the call chain and would mean less code for this feature.
>>
>> But you add complexity in update_thermal_pressure which now has to
>> deal with a cpumask and to compute some frequency ratio
>> IMHO, it's cleaner to keep update_thermal_pressure simple as it is now
>>
> 
> How about removing update_thermal_pressure altogether and doing all the
> work in update_sched_max_capacity? That is, have
> update_sched_max_capacity compute the capped_freq_ratio, do the
> normalization, and set it per_cpu for all CPUs in the frequency domain.
> You'll save some calculations that you're now doing in
> update_thermal_pressure for each cpu and you avoid shifting back and
> forth.

Yes.  I can pass the delta to update_thermal_pressure. I will still want
to keep update_thermal_pressure and a per cpu variable in fair.c to
store this.
> 
> If you're doing so it would be worth renaming update_sched_max_capacity
> to something like update_sched_thermal_pressure.
Will do.


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-11-01 21:04         ` Thara Gopinath
@ 2019-11-04 14:41           ` Ionela Voinescu
  0 siblings, 0 replies; 51+ messages in thread
From: Ionela Voinescu @ 2019-11-04 14:41 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Vincent Guittot, Dietmar Eggemann, Ingo Molnar, Peter Zijlstra,
	Zhang Rui, Eduardo Valentin, Quentin Perret, linux-kernel,
	Amit Kachhap, Javi Merino, Daniel Lezcano

Hi Thara,

On Friday 01 Nov 2019 at 17:04:47 (-0400), Thara Gopinath wrote:
> On 11/01/2019 11:47 AM, Ionela Voinescu wrote:
> > Hi guys,
> > 
> > On Thursday 31 Oct 2019 at 17:38:25 (+0100), Vincent Guittot wrote:
> >> On Thu, 31 Oct 2019 at 17:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>>
> >>> On 22.10.19 22:34, Thara Gopinath wrote:
> >>>> Thermal governors can request for a cpu's maximum supported frequency
> >>>> to be capped in case of an overheat event. This in turn means that the
> >>>> maximum capacity available for tasks to run on the particular cpu is
> >>>> reduced. Delta between the original maximum capacity and capped
> >>>> maximum capacity is known as thermal pressure. Enable cpufreq cooling
> >>>> device to update the thermal pressure in event of a capped
> >>>> maximum frequency.
> >>>>
> >>>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> >>>> ---
> >>>>  drivers/thermal/cpu_cooling.c | 31 +++++++++++++++++++++++++++++--
> >>>>  1 file changed, 29 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> >>>> index 391f397..2e6a979 100644
> >>>> --- a/drivers/thermal/cpu_cooling.c
> >>>> +++ b/drivers/thermal/cpu_cooling.c
> >>>> @@ -218,6 +218,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
> >>>>  }
> >>>>
> >>>>  /**
> >>>> + * update_sched_max_capacity - update scheduler about change in cpu
> >>>> + *                                   max frequency.
> >>>> + * @policy - cpufreq policy whose max frequency is capped.
> >>>> + */
> >>>> +static void update_sched_max_capacity(struct cpumask *cpus,
> >>>> +                                   unsigned int cur_max_freq,
> >>>> +                                   unsigned int max_freq)
> >>>> +{
> >>>> +     int cpu;
> >>>> +     unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
> >>>> +                               max_freq;
> >>>> +
> >>>> +     for_each_cpu(cpu, cpus)
> >>>> +             update_thermal_pressure(cpu, capacity);
> >>>> +}
> >>>> +
> >>>> +/**
> >>>>   * get_load() - get load for a cpu since last updated
> >>>>   * @cpufreq_cdev:    &struct cpufreq_cooling_device for this cpu
> >>>>   * @cpu:     cpu number
> >>>> @@ -320,6 +337,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> >>>>                                unsigned long state)
> >>>>  {
> >>>>       struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
> >>>> +     int ret;
> >>>>
> >>>>       /* Request state should be less than max_level */
> >>>>       if (WARN_ON(state > cpufreq_cdev->max_level))
> >>>> @@ -331,8 +349,17 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> >>>>
> >>>>       cpufreq_cdev->cpufreq_state = state;
> >>>>
> >>>> -     return dev_pm_qos_update_request(&cpufreq_cdev->qos_req,
> >>>> -                             cpufreq_cdev->freq_table[state].frequency);
> >>>> +     ret = dev_pm_qos_update_request
> >>>> +                             (&cpufreq_cdev->qos_req,
> >>>> +                              cpufreq_cdev->freq_table[state].frequency);
> >>>> +
> >>>> +     if (ret > 0)
> >>>> +             update_sched_max_capacity
> >>>> +                             (cpufreq_cdev->policy->cpus,
> >>>> +                              cpufreq_cdev->freq_table[state].frequency,
> >>>> +                              cpufreq_cdev->policy->cpuinfo.max_freq);
> >>>> +
> >>>> +     return ret;
> >>>>  }
> >>>>
> >>>>  /**
> >>>>
> >>>
> >>> Why not getting rid of update_sched_max_capacity() entirely and call
> >>> update_thermal_pressure() in cpu_cooling.c directly? Saves one level in
> >>> the call chain and would mean less code for this feature.
> >>
> >> But you add complexity in update_thermal_pressure which now has to
> >> deal with a cpumask and to compute some frequency ratio
> >> IMHO, it's cleaner to keep update_thermal_pressure simple as it is now
> >>
> > 
> > How about removing update_thermal_pressure altogether and doing all the
> > work in update_sched_max_capacity? That is, have
> > update_sched_max_capacity compute the capped_freq_ratio, do the
> > normalization, and set it per_cpu for all CPUs in the frequency domain.
> > You'll save some calculations that you're now doing in
> > update_thermal_pressure for each cpu and you avoid shifting back and
> > forth.
> 
> Yes.  I can pass the delta to update_thermal_pressure. I will still want
> to keep update_thermal_pressure and a per cpu variable in fair.c to
> store this.
> > 

Why do you want to keep the variable in fair.c? To me this thermal
pressure delta seems more of a CPU thermal characteristic, not a
CFS characteristic, so I would be tempted to define it and set it
in cpu_cooling.c and let fair.c/pelt.c to be just the consumers of
thermal pressure delta, either directly or through some interface.

What do you think?

Thanks,
Ionela.

> > If you're doing so it would be worth renaming update_sched_max_capacity
> > to something like update_sched_thermal_pressure.
> Will do.
> 
> 
> -- 
> Warm Regards
> Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 6/6] sched: thermal: Enable tuning of decay period
  2019-10-22 20:34 ` [Patch v4 6/6] sched: thermal: Enable tuning of decay period Thara Gopinath
  2019-10-28 15:42   ` Peter Zijlstra
@ 2019-11-04 16:12   ` Ionela Voinescu
  2019-11-05 20:26     ` Thara Gopinath
  1 sibling, 1 reply; 51+ messages in thread
From: Ionela Voinescu @ 2019-11-04 16:12 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Hi Thara,

On Tuesday 22 Oct 2019 at 16:34:25 (-0400), Thara Gopinath wrote:
> Thermal pressure follows pelt signas which means the
> decay period for thermal pressure is the default pelt
> decay period. Depending on soc charecteristics and thermal
> activity, it might be beneficial to decay thermal pressure
> slower, but still in-tune with the pelt signals.

I wonder if it can be beneficial to decay thermal pressure faster as
well.

This implementation makes 32 (LOAD_AVG_PERIOD) the minimum half-life
of the thermal pressure samples. This results in more than 100ms for a
sample to decay significantly and therefore let's say it can take more
than 100ms for capacity to return to (close to) max when the CPU is no
longer capped. This value seems high to me considering that a minimum
value should result in close to 'instantaneous' behaviour, when there
are thermal capping mechanisms that can react in ~20ms (hikey960 has a
polling delay of 25ms, if I'm remembering correctly).

I agree 32ms seems like a good default but given that you've made this
configurable as to give users options, I'm wondering if it would be
better to cover a wider range.

> One way to achieve this is to provide a command line parameter
> to set the decay coefficient to an integer between 0 and 10.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
> v3->v4:
> 	- Removed the sysctl setting to tune decay period and instead
> 	  introduced a command line parameter to control it. The rationale
> 	  here being changing decay period of a PELT signal runtime can
> 	  result in a skewed average value for atleast some cycles.
> 
>  Documentation/admin-guide/kernel-parameters.txt |  5 +++++
>  kernel/sched/thermal.c                          | 25 ++++++++++++++++++++++++-
>  2 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a84a83f..61d7baa 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4273,6 +4273,11 @@
>  			incurs a small amount of overhead in the scheduler
>  			but is useful for debugging and performance tuning.
>  
> +	sched_thermal_decay_coeff=
> +			[KNL, SMP] Set decay coefficient for thermal pressure signal.
> +			Format: integer betweer 0 and 10
> +			Default is 0.
> +
>  	skew_tick=	[KNL] Offset the periodic timer tick per cpu to mitigate
>  			xtime_lock contention on larger systems, and/or RCU lock
>  			contention on all systems with CONFIG_MAXSMP set.
> diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
> index 0c84960..0da31e1 100644
> --- a/kernel/sched/thermal.c
> +++ b/kernel/sched/thermal.c
> @@ -10,6 +10,28 @@
>  #include "pelt.h"
>  #include "thermal.h"
>  
> +/**
> + * By default the decay is the default pelt decay period.
> + * The decay coefficient can change is decay period in
> + * multiples of 32.

This description has to be corrected as well, as per Peter's comment.

Also, it might be good not to use the value 32 directly but to mention
that the decay period is a shift of LOAD_AVG_PERIOD. If that changes,
the translation from decay shift to decay period below will change as
well.

Thank you,
Ionela.

> + *   Decay coefficient    Decay period(ms)
> + *	0			32
> + *	1			64
> + *	2			128
> + *	3			256
> + *	4			512
> + */
> +static int sched_thermal_decay_coeff;
> +
> +static int __init setup_sched_thermal_decay_coeff(char *str)
> +{
> +	if (kstrtoint(str, 0, &sched_thermal_decay_coeff))
> +		pr_warn("Unable to set scheduler thermal pressure decay coefficient\n");
> +
> +	return 1;
> +}
> +__setup("sched_thermal_decay_coeff=", setup_sched_thermal_decay_coeff);
> +
>  static DEFINE_PER_CPU(unsigned long, delta_capacity);
>  
>  /**
> @@ -40,6 +62,7 @@ void update_thermal_pressure(int cpu, u64 capped_freq_ratio)
>   */
>  void trigger_thermal_pressure_average(struct rq *rq)
>  {
> -	update_thermal_load_avg(rq_clock_task(rq), rq,
> +	update_thermal_load_avg(rq_clock_task(rq) >>
> +				sched_thermal_decay_coeff, rq,
>  				per_cpu(delta_capacity, cpu_of(rq)));
>  }
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-01 20:57     ` Thara Gopinath
@ 2019-11-04 17:29       ` Dietmar Eggemann
  2019-11-04 17:34         ` Vincent Guittot
  0 siblings, 1 reply; 51+ messages in thread
From: Dietmar Eggemann @ 2019-11-04 17:29 UTC (permalink / raw)
  To: Thara Gopinath, mingo, peterz, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 01/11/2019 21:57, Thara Gopinath wrote:
> On 11/01/2019 08:17 AM, Dietmar Eggemann wrote:
>> On 22.10.19 22:34, Thara Gopinath wrote:
>>
>> [...]
>>
>>> +/**
>>> + * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate
>>> + *				     and average algorithm
>>> + */
>>> +void trigger_thermal_pressure_average(struct rq *rq)
>>> +{
>>> +	update_thermal_load_avg(rq_clock_task(rq), rq,
>>> +				per_cpu(delta_capacity, cpu_of(rq)));
>>> +}
>>
>> Why not call update_thermal_load_avg() directly in fair.c? We do this for all
>> the other update_foo_load_avg() functions (foo eq. irq, rt_rq, dl_rq ...)
> thermal.c is going away in next version and I am moving everything to
> fair.c. So this is taken care of
> 
>>
>> You don't have to pass 'u64 now', so you can hide it plus the 
> 
> You still need now.All the update_*_avg apis take now as a parameter.

You do need it for the ___update_load_sum() call inside the
foo_load_avg() functions. But that doesn't mean you have to pass it into
foo_load_avg(). Look at update_irq_load_avg() for example. We don't pass
rq->clock as now in there.

-int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+extern int sched_thermal_decay_coeff;
+
+int update_thermal_load_avg(struct rq *rq, u64 capacity)
 {
+       u64 now = rq_clock_task(rq) >> sched_thermal_decay_coeff;
+
        if (___update_load_sum(now, &rq->avg_thermal,
                               capacity,
                               capacity,

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-04 17:29       ` Dietmar Eggemann
@ 2019-11-04 17:34         ` Vincent Guittot
  2019-11-04 17:41           ` Dietmar Eggemann
  0 siblings, 1 reply; 51+ messages in thread
From: Vincent Guittot @ 2019-11-04 17:34 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Thara Gopinath, Ingo Molnar, Peter Zijlstra, Ionela Voinescu,
	Zhang Rui, Eduardo Valentin, Quentin Perret, linux-kernel,
	Amit Kachhap, Javi Merino, Daniel Lezcano

On Mon, 4 Nov 2019 at 18:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 01/11/2019 21:57, Thara Gopinath wrote:
> > On 11/01/2019 08:17 AM, Dietmar Eggemann wrote:
> >> On 22.10.19 22:34, Thara Gopinath wrote:
> >>
> >> [...]
> >>
> >>> +/**
> >>> + * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate
> >>> + *                              and average algorithm
> >>> + */
> >>> +void trigger_thermal_pressure_average(struct rq *rq)
> >>> +{
> >>> +   update_thermal_load_avg(rq_clock_task(rq), rq,
> >>> +                           per_cpu(delta_capacity, cpu_of(rq)));
> >>> +}
> >>
> >> Why not call update_thermal_load_avg() directly in fair.c? We do this for all
> >> the other update_foo_load_avg() functions (foo eq. irq, rt_rq, dl_rq ...)
> > thermal.c is going away in next version and I am moving everything to
> > fair.c. So this is taken care of
> >
> >>
> >> You don't have to pass 'u64 now', so you can hide it plus the
> >
> > You still need now.All the update_*_avg apis take now as a parameter.
>
> You do need it for the ___update_load_sum() call inside the
> foo_load_avg() functions. But that doesn't mean you have to pass it into
> foo_load_avg(). Look at update_irq_load_avg() for example. We don't pass
> rq->clock as now in there.

update_irq_load_avg is the exception but having now as a parameter is
the default behavior that update_thermal_load_avg have to follow

>
> -int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +extern int sched_thermal_decay_coeff;
> +
> +int update_thermal_load_avg(struct rq *rq, u64 capacity)
>  {
> +       u64 now = rq_clock_task(rq) >> sched_thermal_decay_coeff;
> +
>         if (___update_load_sum(now, &rq->avg_thermal,
>                                capacity,
>                                capacity,

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-04 17:34         ` Vincent Guittot
@ 2019-11-04 17:41           ` Dietmar Eggemann
  2019-11-04 17:48             ` Vincent Guittot
  0 siblings, 1 reply; 51+ messages in thread
From: Dietmar Eggemann @ 2019-11-04 17:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Thara Gopinath, Ingo Molnar, Peter Zijlstra, Ionela Voinescu,
	Zhang Rui, Eduardo Valentin, Quentin Perret, linux-kernel,
	Amit Kachhap, Javi Merino, Daniel Lezcano

On 04/11/2019 18:34, Vincent Guittot wrote:
> On Mon, 4 Nov 2019 at 18:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 01/11/2019 21:57, Thara Gopinath wrote:
>>> On 11/01/2019 08:17 AM, Dietmar Eggemann wrote:
>>>> On 22.10.19 22:34, Thara Gopinath wrote:

[...]

>>> You still need now.All the update_*_avg apis take now as a parameter.
>>
>> You do need it for the ___update_load_sum() call inside the
>> foo_load_avg() functions. But that doesn't mean you have to pass it into
>> foo_load_avg(). Look at update_irq_load_avg() for example. We don't pass
>> rq->clock as now in there.
> 
> update_irq_load_avg is the exception but having now as a parameter is
> the default behavior that update_thermal_load_avg have to follow

Why would this be? Just so the functions have the the same parameters?

In this case you could argue that update_irq_load_avg() has to pass in
rq->clock as now.

>> -int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
>> +extern int sched_thermal_decay_coeff;
>> +
>> +int update_thermal_load_avg(struct rq *rq, u64 capacity)
>>  {
>> +       u64 now = rq_clock_task(rq) >> sched_thermal_decay_coeff;
>> +
>>         if (___update_load_sum(now, &rq->avg_thermal,
>>                                capacity,
>>                                capacity,

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-04 17:41           ` Dietmar Eggemann
@ 2019-11-04 17:48             ` Vincent Guittot
  0 siblings, 0 replies; 51+ messages in thread
From: Vincent Guittot @ 2019-11-04 17:48 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Thara Gopinath, Ingo Molnar, Peter Zijlstra, Ionela Voinescu,
	Zhang Rui, Eduardo Valentin, Quentin Perret, linux-kernel,
	Amit Kachhap, Javi Merino, Daniel Lezcano

On Mon, 4 Nov 2019 at 18:42, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 04/11/2019 18:34, Vincent Guittot wrote:
> > On Mon, 4 Nov 2019 at 18:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 01/11/2019 21:57, Thara Gopinath wrote:
> >>> On 11/01/2019 08:17 AM, Dietmar Eggemann wrote:
> >>>> On 22.10.19 22:34, Thara Gopinath wrote:
>
> [...]
>
> >>> You still need now.All the update_*_avg apis take now as a parameter.
> >>
> >> You do need it for the ___update_load_sum() call inside the
> >> foo_load_avg() functions. But that doesn't mean you have to pass it into
> >> foo_load_avg(). Look at update_irq_load_avg() for example. We don't pass
> >> rq->clock as now in there.
> >
> > update_irq_load_avg is the exception but having now as a parameter is
> > the default behavior that update_thermal_load_avg have to follow
>
> Why would this be? Just so the functions have the the same parameters?

That's the default behavior to keep all pelt function to behave
similarly and keep outside what is not strictly related to pelt so it
will ease any further modification
sched_thermal_decay_coeff is not a pelt parameter but a thermal one
irq_avg is an exception not the default behavior to follow

>
> In this case you could argue that update_irq_load_avg() has to pass in
> rq->clock as now.
>
> >> -int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> >> +extern int sched_thermal_decay_coeff;
> >> +
> >> +int update_thermal_load_avg(struct rq *rq, u64 capacity)
> >>  {
> >> +       u64 now = rq_clock_task(rq) >> sched_thermal_decay_coeff;
> >> +
> >>         if (___update_load_sum(now, &rq->avg_thermal,
> >>                                capacity,
> >>                                capacity,

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 6/6] sched: thermal: Enable tuning of decay period
  2019-11-04 16:12   ` Ionela Voinescu
@ 2019-11-05 20:26     ` Thara Gopinath
  2019-11-05 21:29       ` Ionela Voinescu
  0 siblings, 1 reply; 51+ messages in thread
From: Thara Gopinath @ 2019-11-05 20:26 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 11/04/2019 11:12 AM, Ionela Voinescu wrote:
> Hi Thara,
> 
> On Tuesday 22 Oct 2019 at 16:34:25 (-0400), Thara Gopinath wrote:
>> Thermal pressure follows pelt signas which means the
>> decay period for thermal pressure is the default pelt
>> decay period. Depending on soc charecteristics and thermal
>> activity, it might be beneficial to decay thermal pressure
>> slower, but still in-tune with the pelt signals.
> 
> I wonder if it can be beneficial to decay thermal pressure faster as
> well.
> 
> This implementation makes 32 (LOAD_AVG_PERIOD) the minimum half-life
> of the thermal pressure samples. This results in more than 100ms for a
> sample to decay significantly and therefore let's say it can take more
> than 100ms for capacity to return to (close to) max when the CPU is no
> longer capped. This value seems high to me considering that a minimum
> value should result in close to 'instantaneous' behaviour, when there
> are thermal capping mechanisms that can react in ~20ms (hikey960 has a
> polling delay of 25ms, if I'm remembering correctly).
> 
> I agree 32ms seems like a good default but given that you've made this
> configurable as to give users options, I'm wondering if it would be
> better to cover a wider range.
> 
>> One way to achieve this is to provide a command line parameter
>> to set the decay coefficient to an integer between 0 and 10.
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>> v3->v4:
>> 	- Removed the sysctl setting to tune decay period and instead
>> 	  introduced a command line parameter to control it. The rationale
>> 	  here being changing decay period of a PELT signal runtime can
>> 	  result in a skewed average value for atleast some cycles.
>>
>>  Documentation/admin-guide/kernel-parameters.txt |  5 +++++
>>  kernel/sched/thermal.c                          | 25 ++++++++++++++++++++++++-
>>  2 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index a84a83f..61d7baa 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -4273,6 +4273,11 @@
>>  			incurs a small amount of overhead in the scheduler
>>  			but is useful for debugging and performance tuning.
>>  
>> +	sched_thermal_decay_coeff=
>> +			[KNL, SMP] Set decay coefficient for thermal pressure signal.
>> +			Format: integer betweer 0 and 10
>> +			Default is 0.
>> +
>>  	skew_tick=	[KNL] Offset the periodic timer tick per cpu to mitigate
>>  			xtime_lock contention on larger systems, and/or RCU lock
>>  			contention on all systems with CONFIG_MAXSMP set.
>> diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
>> index 0c84960..0da31e1 100644
>> --- a/kernel/sched/thermal.c
>> +++ b/kernel/sched/thermal.c
>> @@ -10,6 +10,28 @@
>>  #include "pelt.h"
>>  #include "thermal.h"
>>  
>> +/**
>> + * By default the decay is the default pelt decay period.
>> + * The decay coefficient can change is decay period in
>> + * multiples of 32.
> 
> This description has to be corrected as well, as per Peter's comment.
> 
> Also, it might be good not to use the value 32 directly but to mention
> that the decay period is a shift of LOAD_AVG_PERIOD. If that changes,
> the translation from decay shift to decay period below will change as
> well.

Hi Ionela,

I sent out the v5 without fixing this. Even if there are no other
comments on v5 I will send out a v6 fixing this.

Regarding a slower decay, we need a strong case for it.



-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 0/6] Introduce Thermal Pressure
  2019-10-31 16:41   ` Thara Gopinath
  2019-10-31 16:52     ` Thara Gopinath
@ 2019-11-05 21:04     ` Ionela Voinescu
  1 sibling, 0 replies; 51+ messages in thread
From: Ionela Voinescu @ 2019-11-05 21:04 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Hi Thara,

On Thursday 31 Oct 2019 at 12:41:20 (-0400), Thara Gopinath wrote:
[...]
> >> Regarding testing, basic build, boot and sanity testing have been
> >> performed on db845c platform with debian file system.
> >> Further, dhrystone and hackbench tests have been
> >> run with the thermal pressure algorithm. During testing, due to
> >> constraints of step wise governor in dealing with big little systems,
> >> trip point 0 temperature was made assymetric between cpus in little
> >> cluster and big cluster; the idea being that
> >> big core will heat up and cpu cooling device will throttle the
> >> frequency of the big cores faster, there by limiting the maximum available
> >> capacity and the scheduler will spread out tasks to little cores as well.
> >>
> > 
> > Can you please share the changes you've made to sdm845.dtsi and a kernel
> > base on top of which to apply your patches? I would like to reproduce
> > your results and run more tests and it would be good if our setups were
> > as close as possible.
> Hi Ionela
> Thank you for the review.
> So I tested this on 5.4-rc1 kernel. The dtsi changes is to reduce the
> thermal trip points for the big CPUs to 60000 or 70000 from the default
> 90000. I did this for 2 reasons
> 1. I could never get the db845 to heat up sufficiently for my test cases
> with the default trip.
> 2. I was using the default step-wise governor for thermal. I did not
> want little and big to start throttling by the same % because then the
> task placement ratio will remain the same between little and big cores.
> 
> 

Some early testing on this showed that when setting the trip point to
60000 for the big CPUs and the big cluster, and running hackbench (1
group, 30000 loops) the cooling state of the big cluster results in
always being set to the maximum (the lowest OPP), which results in
capacity inversion (almost) continuously.

For 70000 the average cooling state of the bigs is around 20 so it
will leave a few more OPPs available on the bigs more of the time,
but probably the capacity of bigs is mostly lower than the capacity
of little CPUs, during this test as well.

I think that explains the difference in results that you obtained
below. This is good as it shows that thermal pressure is useful but
it shouldn't show much difference between the different decay
periods, as can also be observed in your results below.

This being said, I did not obtained such significant results on my
side by I'll try again with the kernel you've pointed me to offline.

Thanks,
Ionela.

> > 
> >> Test Results
> >>
> >> Hackbench: 1 group , 30000 loops, 10 runs       
> >>                                                Result         SD             
> >>                                                (Secs)     (% of mean)     
> >>  No Thermal Pressure                            14.03       2.69%           
> >>  Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%         
> >>  Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%           
> >>  Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%         
> >>  Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%           
> >>  Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%  
> >>
> >> Dhrystone Run Time  : 20 threads, 3000 MLOOPS
> >>                                                  Result      SD             
> >>                                                  (Secs)    (% of mean)     
> >>  No Thermal Pressure                              9.452      4.49%
> >>  Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
> >>  Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
> >>  Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
> >>  Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
> >>  Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%  
> >>
> > 
> > Do you happen to know by how much the CPUs were capped during these
> > experiments?
> 
> I don't have any captured results here. I know that big cores were
> capped and at times there was capacity inversion.
> 
> Also I will fix the nit comments above.
> 
> > 
> > Thanks,
> > Ionela.
> > 
> 
> 
> 
> -- 
> Warm Regards
> Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Patch v4 6/6] sched: thermal: Enable tuning of decay period
  2019-11-05 20:26     ` Thara Gopinath
@ 2019-11-05 21:29       ` Ionela Voinescu
  0 siblings, 0 replies; 51+ messages in thread
From: Ionela Voinescu @ 2019-11-05 21:29 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On Tuesday 05 Nov 2019 at 15:26:06 (-0500), Thara Gopinath wrote:
> On 11/04/2019 11:12 AM, Ionela Voinescu wrote:
> > Hi Thara,
> > 
> > On Tuesday 22 Oct 2019 at 16:34:25 (-0400), Thara Gopinath wrote:
> >> Thermal pressure follows pelt signas which means the
> >> decay period for thermal pressure is the default pelt
> >> decay period. Depending on soc charecteristics and thermal
> >> activity, it might be beneficial to decay thermal pressure
> >> slower, but still in-tune with the pelt signals.
> > 
> > I wonder if it can be beneficial to decay thermal pressure faster as
> > well.
> > 
> > This implementation makes 32 (LOAD_AVG_PERIOD) the minimum half-life
> > of the thermal pressure samples. This results in more than 100ms for a
> > sample to decay significantly and therefore let's say it can take more
> > than 100ms for capacity to return to (close to) max when the CPU is no
> > longer capped. This value seems high to me considering that a minimum
> > value should result in close to 'instantaneous' behaviour, when there
> > are thermal capping mechanisms that can react in ~20ms (hikey960 has a
> > polling delay of 25ms, if I'm remembering correctly).
> > 
> > I agree 32ms seems like a good default but given that you've made this
> > configurable as to give users options, I'm wondering if it would be
> > better to cover a wider range.
> >

[...]

> 
> Hi Ionela,
>

[...]

> 
> Regarding a slower decay, we need a strong case for it.
> 
>

I think you mean faster decay, if you refer to my comment above.

To be blunt, I'm not sure there is a strong case for either kind of
dacay, if we look at the test results only. There is a theoretical case
for both, in my opinion and given that the purpose of this patch is to
give options to platform with different thermal characteristics, I do
believe it's worth providing a good range of options for the decay
period.

Thanks,
Ionela.

> 
> -- 
> Warm Regards
> Thara

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2019-11-05 21:29 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-22 20:34 [Patch v4 0/6] Introduce Thermal Pressure Thara Gopinath
2019-10-22 20:34 ` [Patch v4 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
2019-10-31  9:47   ` Ionela Voinescu
2019-10-22 20:34 ` [Patch v4 2/6] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
2019-10-28 15:21   ` Peter Zijlstra
2019-10-30 21:37     ` Thara Gopinath
2019-11-01 12:17   ` Dietmar Eggemann
2019-11-01 20:57     ` Thara Gopinath
2019-11-04 17:29       ` Dietmar Eggemann
2019-11-04 17:34         ` Vincent Guittot
2019-11-04 17:41           ` Dietmar Eggemann
2019-11-04 17:48             ` Vincent Guittot
2019-10-22 20:34 ` [Patch v4 3/6] sched/fair: Enable CFS periodic tick to update " Thara Gopinath
2019-10-28 15:24   ` Peter Zijlstra
2019-10-28 15:27     ` Peter Zijlstra
2019-10-30 21:41     ` Thara Gopinath
2019-10-31 16:11   ` Dietmar Eggemann
2019-10-31 16:46     ` Thara Gopinath
2019-10-22 20:34 ` [Patch v4 4/6] sched/fair: update cpu_capcity to reflect " Thara Gopinath
2019-10-23 12:28   ` Qais Yousef
2019-10-28 15:30     ` Peter Zijlstra
2019-10-31 10:53       ` Qais Yousef
2019-10-31 15:38         ` Dietmar Eggemann
2019-10-31 15:48           ` Vincent Guittot
2019-10-31 16:17             ` Dietmar Eggemann
2019-10-31 16:31               ` Vincent Guittot
2019-10-31 16:44                 ` Dietmar Eggemann
2019-10-31 16:03         ` Thara Gopinath
2019-10-31 16:56           ` Qais Yousef
2019-10-22 20:34 ` [Patch v4 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
2019-10-28 15:33   ` Peter Zijlstra
2019-10-31 16:29   ` Dietmar Eggemann
2019-10-31 16:38     ` Vincent Guittot
2019-11-01 15:47       ` Ionela Voinescu
2019-11-01 21:04         ` Thara Gopinath
2019-11-04 14:41           ` Ionela Voinescu
2019-10-31 16:46     ` Thara Gopinath
2019-10-22 20:34 ` [Patch v4 6/6] sched: thermal: Enable tuning of decay period Thara Gopinath
2019-10-28 15:42   ` Peter Zijlstra
2019-11-04 16:12   ` Ionela Voinescu
2019-11-05 20:26     ` Thara Gopinath
2019-11-05 21:29       ` Ionela Voinescu
2019-10-29 15:34 ` [Patch v4 0/6] Introduce Thermal Pressure Daniel Lezcano
2019-10-31 10:07   ` Ionela Voinescu
2019-10-31 11:54     ` Daniel Lezcano
2019-10-31 12:57       ` Ionela Voinescu
2019-10-31 17:48         ` Daniel Lezcano
2019-10-31  9:44 ` Ionela Voinescu
2019-10-31 16:41   ` Thara Gopinath
2019-10-31 16:52     ` Thara Gopinath
2019-11-05 21:04     ` Ionela Voinescu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).