linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch v3 0/7] Introduce Thermal Pressure
@ 2019-10-14  0:58 Thara Gopinath
  2019-10-14  0:58 ` [Patch v3 1/7] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Thara Gopinath @ 2019-10-14  0:58 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang, edubezval
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Thermal governors can respond to an overheat event of a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in the kernel, task scheduler is 
not notified of capping of maximum frequency of a cpu.
In other words, scheduler is unware of maximum capacity
restrictions placed on a cpu due to thermal activity.
This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The reduction in the maximum possible capacity of a cpu due to a 
thermal event can be considered as thermal pressure. Instantaneous
thermal pressure is hard to record and can sometime be erroneous
as there can be mismatch between the actual capping of capacity
and scheduler recording it. Thus solution is to have a weighted
average per cpu value for thermal pressure over time.
The weight reflects the amount of time the cpu has spent at a
capped maximum frequency. Since thermal pressure is recorded as
an average, it must be decayed periodically. Exisiting algorithm
in the kernel scheduler pelt framework is re-used to calculate
the weighted average. This patch series also defines a sysctl
inerface to allow for a configurable decay period.

Regarding testing, basic build, boot and sanity testing have been
performed on db845c platform with debian file system.
Further, dhrystone and hackbench tests have been
run with the thermal pressure algorithm. During testing, due to
constraints of step wise governor in dealing with big little systems,
trip point 0 temperature was made assymetric between cpus in little
cluster and big cluster; the idea being that
big core will heat up and cpu cooling device will throttle the
frequency of the big cores faster, there by limiting the maximum available
capacity and the scheduler will spread out tasks to little cores as well.

Test Results

Hackbench: 1 group , 30000 loops, 10 runs       
                                               Result         SD             
                                               (Secs)     (% of mean)     
 No Thermal Pressure                            14.03       2.69%           
 Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%         
 Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%           
 Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%         
 Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%           
 Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%  


Dhrystone Run Time  : 20 threads, 3000 MLOOPS
                                                 Result      SD             
                                                 (Secs)    (% of mean)     
 No Thermal Pressure                              9.452      4.49%
 Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
 Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
 Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
 Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
 Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%  

A Brief History

The first version of this patch-series was posted with resuing
PELT algorithm to decay thermal pressure signal. The discussions
that followed were around whether intanteneous thermal pressure
solution is better and whether a stand-alone algortihm to accumulate
and decay thermal pressure is more appropriate than re-using the
PELT framework. 
Tests on Hikey960 showed the stand-alone algorithm performing slightly
better than resuing PELT algorithm and V2 was posted with the stand
alone algorithm. Test results were shared as part of this series.
Discussions were around re-using PELT algorithm and running
further tests with more granular decay period.

For some time after this development was impeded due to hardware
unavailability, some other unforseen and possibly unfortunate events.
For this version, h/w was switched from hikey960 to db845c.
Also Instantaneous thermal pressure was never tested as part of this
cycle as it is clear that weighted average is a better implementation.
The non-PELT algorithm never gave any conclusive results to prove that it
is better than reusing PELT algorithm, in this round of testing.
Also reusing PELT algorithm means thermal pressure tracks the
other utilization signals in the scheduler.

Thara Gopinath (7):
  sched/pelt.c: Add support to track thermal pressure
  sched: Add infrastructure to store and update instantaneous thermal
    pressure
  sched: Initialize per cpu thermal pressure structure
  sched/fair: Enable CFS periodic tick to update thermal pressure
  sched/fair: update cpu_capcity to reflect thermal pressure
  thermal/cpu-cooling: Update thermal pressure in case of a maximum
    frequency capping
  sched: thermal: Enable tuning of decay period

 drivers/base/arch_topology.c  |  1 +
 drivers/thermal/cpu_cooling.c | 31 +++++++++++++++--
 include/linux/sched.h         | 14 ++++++++
 include/linux/sched/sysctl.h  |  1 +
 kernel/sched/Makefile         |  2 +-
 kernel/sched/core.c           |  2 ++
 kernel/sched/fair.c           |  6 ++++
 kernel/sched/pelt.c           | 13 +++++++
 kernel/sched/pelt.h           |  7 ++++
 kernel/sched/sched.h          |  1 +
 kernel/sched/thermal.c        | 80 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/thermal.h        | 13 +++++++
 kernel/sysctl.c               |  7 ++++
 13 files changed, 175 insertions(+), 3 deletions(-)
 create mode 100644 kernel/sched/thermal.c
 create mode 100644 kernel/sched/thermal.h

-- 
2.1.4


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Patch v3 1/7] sched/pelt.c: Add support to track thermal pressure
  2019-10-14  0:58 [Patch v3 0/7] Introduce Thermal Pressure Thara Gopinath
@ 2019-10-14  0:58 ` Thara Gopinath
  2019-10-14 13:55   ` Vincent Guittot
  2019-10-14  0:58 ` [Patch v3 2/7] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Thara Gopinath @ 2019-10-14  0:58 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang, edubezval
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Extrapolating on the exisitng framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressue. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a cpu running a rt/dl task through util_avg,
the average thermal pressure is tracked through load_avg.
In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_avg can be called
to do the periodic bookeeping (accumulate, decay and average)
of the thermal pressure.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 kernel/sched/pelt.c  | 13 +++++++++++++
 kernel/sched/pelt.h  |  7 +++++++
 kernel/sched/sched.h |  1 +
 3 files changed, 21 insertions(+)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index a96db50..f06aae3 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -353,6 +353,19 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 	return 0;
 }
 
+int update_thermal_avg(u64 now, struct rq *rq, u64 capacity)
+{
+	if (___update_load_sum(now, &rq->avg_thermal,
+			       capacity,
+			       capacity,
+			       capacity)) {
+		___update_load_avg(&rq->avg_thermal, 1, 1);
+		return 1;
+	}
+
+	return 0;
+}
+
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 /*
  * irq:
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index afff644..01c5436 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -6,6 +6,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_thermal_avg(u64 now, struct rq *rq, u64 capacity);
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 int update_irq_load_avg(struct rq *rq, u64 running);
@@ -175,6 +176,12 @@ update_rq_clock_pelt(struct rq *rq, s64 delta) { }
 static inline void
 update_idle_rq_clock_pelt(struct rq *rq) { }
 
+static inline int
+update_thermal_avg(u64 now, struct rq *rq, u64 capacity)
+{
+	return 0;
+}
+
 #endif
 
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0db2c1b..d5d82c8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -944,6 +944,7 @@ struct rq {
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 	struct sched_avg	avg_irq;
 #endif
+	struct sched_avg	avg_thermal;
 	u64			idle_stamp;
 	u64			avg_idle;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Patch v3 2/7] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-10-14  0:58 [Patch v3 0/7] Introduce Thermal Pressure Thara Gopinath
  2019-10-14  0:58 ` [Patch v3 1/7] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
@ 2019-10-14  0:58 ` Thara Gopinath
  2019-10-14 15:50   ` Vincent Guittot
  2019-10-14  0:58 ` [Patch v3 3/7] sched: Initialize per cpu thermal pressure structure Thara Gopinath
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Thara Gopinath @ 2019-10-14  0:58 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang, edubezval
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Add thermal.c and thermal.h files that provides interface
APIs to initialize, update/average, track, accumulate and decay
thermal pressure per cpu basis. A per cpu structure max_capacity_info is
introduced to keep track of instantaneous per cpu thermal pressure.
Thermal pressure is the delta between max_capacity and cap_capacity.
API update_periodic_maxcap is called for periodic accumulate and decay
of the thermal pressure. It is to to be called from a periodic tick
function. This API calculates the delta between max_capacity and
cap_capacity and passes on the delta to update_thermal_avg to do the
necessary accumulate, decay and average. API update_maxcap_capacity is for
the system to update the thermal pressure by updating cap_capacity.
Considering, update_periodic_maxcap reads cap_capacity and
update_maxcap_capacity writes into cap_capacity, one can argue for
some sort of locking mechanism to avoid a stale value.
But considering update_periodic_maxcap can be called from a system
critical path like scheduler tick function, a locking mechanism is not
ideal. This means that it is possible the value used to
calculate average thermal pressure for a cpu can be stale for upto 1
tick period.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 include/linux/sched.h  | 14 +++++++++++
 kernel/sched/Makefile  |  2 +-
 kernel/sched/thermal.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/thermal.h | 13 ++++++++++
 4 files changed, 94 insertions(+), 1 deletion(-)
 create mode 100644 kernel/sched/thermal.c
 create mode 100644 kernel/sched/thermal.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c2e56b..875ce2b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1983,6 +1983,20 @@ static inline void rseq_syscall(struct pt_regs *regs)
 
 #endif
 
+#ifdef CONFIG_SMP
+void update_maxcap_capacity(int cpu, u64 capacity);
+
+void populate_max_capacity_info(void);
+#else
+static inline void update_maxcap_capacity(int cpu, u64 capacity)
+{
+}
+
+static inline void populate_max_capacity_info(void)
+{
+}
+#endif
+
 const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
 char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
 int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 21fb5a5..4d3b820 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle.o fair.o rt.o deadline.o
 obj-y += wait.o wait_bit.o swait.o completion.o
 
-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o thermal.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
new file mode 100644
index 0000000..5f0b2d4
--- /dev/null
+++ b/kernel/sched/thermal.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Sceduler Thermal Interactions
+ *
+ *  Copyright (C) 2018 Linaro, Inc., Thara Gopinath <thara.gopinath@linaro.org>
+ */
+
+#include <linux/sched.h>
+#include "sched.h"
+#include "pelt.h"
+#include "thermal.h"
+
+struct max_capacity_info {
+	unsigned long max_capacity;
+	unsigned long cap_capacity;
+};
+
+static DEFINE_PER_CPU(struct max_capacity_info, max_cap);
+
+void update_maxcap_capacity(int cpu, u64 capacity)
+{
+	struct max_capacity_info *__max_cap;
+	unsigned long __capacity;
+
+	__max_cap = (&per_cpu(max_cap, cpu));
+	if (!__max_cap) {
+		pr_err("no max_capacity_info structure for cpu %d\n", cpu);
+		return;
+	}
+
+	/* Normalize the capacity */
+	__capacity = (capacity * arch_scale_cpu_capacity(cpu)) >>
+							SCHED_CAPACITY_SHIFT;
+	pr_debug("updating cpu%d capped capacity from %lu to %lu\n", cpu, __max_cap->cap_capacity, __capacity);
+
+	__max_cap->cap_capacity = __capacity;
+}
+
+void populate_max_capacity_info(void)
+{
+	struct max_capacity_info *__max_cap;
+	u64 capacity;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		__max_cap = (&per_cpu(max_cap, cpu));
+		if (!__max_cap)
+			continue;
+		capacity = arch_scale_cpu_capacity(cpu);
+		__max_cap->max_capacity = capacity;
+		__max_cap->cap_capacity = capacity;
+		pr_debug("cpu %d max capacity set to %ld\n", cpu, __max_cap->max_capacity);
+	}
+}
+
+void update_periodic_maxcap(struct rq *rq)
+{
+	struct max_capacity_info *__max_cap = (&per_cpu(max_cap, cpu_of(rq)));
+	unsigned long delta;
+
+	if (!__max_cap)
+		return;
+
+	delta = __max_cap->max_capacity - __max_cap->cap_capacity;
+	update_thermal_avg(rq_clock_task(rq), rq, delta);
+}
diff --git a/kernel/sched/thermal.h b/kernel/sched/thermal.h
new file mode 100644
index 0000000..5657cb4
--- /dev/null
+++ b/kernel/sched/thermal.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Scheduler thermal interaction internal methods.
+ */
+
+#ifdef CONFIG_SMP
+void update_periodic_maxcap(struct rq *rq);
+
+#else
+static inline void update_periodic_maxcap(struct rq *rq)
+{
+}
+#endif
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Patch v3 3/7] sched: Initialize per cpu thermal pressure structure
  2019-10-14  0:58 [Patch v3 0/7] Introduce Thermal Pressure Thara Gopinath
  2019-10-14  0:58 ` [Patch v3 1/7] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
  2019-10-14  0:58 ` [Patch v3 2/7] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
@ 2019-10-14  0:58 ` Thara Gopinath
  2019-10-14  0:58 ` [Patch v3 4/7] sched/fair: Enable CFS periodic tick to update thermal pressure Thara Gopinath
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Thara Gopinath @ 2019-10-14  0:58 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang, edubezval
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Initialize per cpu max_capacity_info during scheduler init.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 drivers/base/arch_topology.c | 1 +
 kernel/sched/core.c          | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1eb81f11..7ac9f2f 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -123,6 +123,7 @@ void topology_normalize_cpu_scale(void)
 		pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
 			cpu, topology_get_cpu_scale(cpu));
 	}
+	populate_max_capacity_info();
 }
 
 bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7880f4f..744f026 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6534,6 +6534,8 @@ void __init sched_init_smp(void)
 	init_sched_rt_class();
 	init_sched_dl_class();
 
+	populate_max_capacity_info();
+
 	sched_smp_initialized = true;
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Patch v3 4/7] sched/fair: Enable CFS periodic tick to update thermal pressure
  2019-10-14  0:58 [Patch v3 0/7] Introduce Thermal Pressure Thara Gopinath
                   ` (2 preceding siblings ...)
  2019-10-14  0:58 ` [Patch v3 3/7] sched: Initialize per cpu thermal pressure structure Thara Gopinath
@ 2019-10-14  0:58 ` Thara Gopinath
  2019-10-14  0:58 ` [Patch v3 5/7] sched/fair: update cpu_capcity to reflect " Thara Gopinath
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Thara Gopinath @ 2019-10-14  0:58 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang, edubezval
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Introduce support in CFS periodic tick to trigger the process of
computing average thermal pressure for a cpu.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83ab35e..fe7c165 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -21,6 +21,7 @@
  *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
  */
 #include "sched.h"
+#include "thermal.h"
 
 #include <trace/events/sched.h>
 
@@ -7566,6 +7567,8 @@ static void update_blocked_averages(int cpu)
 		done = false;
 
 	update_blocked_load_status(rq, !done);
+
+	update_periodic_maxcap(rq);
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -9925,6 +9928,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
+
+	update_periodic_maxcap(rq);
 }
 
 /*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Patch v3 5/7] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-10-14  0:58 [Patch v3 0/7] Introduce Thermal Pressure Thara Gopinath
                   ` (3 preceding siblings ...)
  2019-10-14  0:58 ` [Patch v3 4/7] sched/fair: Enable CFS periodic tick to update thermal pressure Thara Gopinath
@ 2019-10-14  0:58 ` Thara Gopinath
  2019-10-14  0:58 ` [Patch v3 6/7] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
  2019-10-14  0:58 ` [Patch v3 7/7] sched: thermal: Enable tuning of decay period Thara Gopinath
  6 siblings, 0 replies; 19+ messages in thread
From: Thara Gopinath @ 2019-10-14  0:58 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang, edubezval
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

cpu_capacity relflects the maximum available capacity of a cpu. Thermal
pressure on a cpu means this maximum available capacity is reduced. This
patch reduces the average thermal pressure for a cpu from its maximum
available capacity so that cpu_capacity reflects the actual
available capacity.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 kernel/sched/fair.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe7c165..cbc2208 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7719,6 +7719,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
 
 	used = READ_ONCE(rq->avg_rt.util_avg);
 	used += READ_ONCE(rq->avg_dl.util_avg);
+	used += READ_ONCE(rq->avg_thermal.load_avg);
 
 	if (unlikely(used >= max))
 		return 1;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Patch v3 6/7] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-10-14  0:58 [Patch v3 0/7] Introduce Thermal Pressure Thara Gopinath
                   ` (4 preceding siblings ...)
  2019-10-14  0:58 ` [Patch v3 5/7] sched/fair: update cpu_capcity to reflect " Thara Gopinath
@ 2019-10-14  0:58 ` Thara Gopinath
  2019-10-14  0:58 ` [Patch v3 7/7] sched: thermal: Enable tuning of decay period Thara Gopinath
  6 siblings, 0 replies; 19+ messages in thread
From: Thara Gopinath @ 2019-10-14  0:58 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang, edubezval
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Thermal governors can request for a cpu's maximum supported frequency
to be capped in case of an overheat event. This in turn means that the
maximum capacity available for tasks to run on the particular cpu is
reduced. Delta between the original maximum capacity and capped
maximum capacity is known as thermal pressure. Enable cpufreq cooling
device to update the thermal pressure in event of a capped
maximum frequency.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 drivers/thermal/cpu_cooling.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 391f397..9e2764a 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -218,6 +218,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
 }
 
 /**
+ * update_sched_max_capacity - update scheduler about change in cpu
+ *					max frequency.
+ * @policy - cpufreq policy whose max frequency is capped.
+ */
+static void update_sched_max_capacity(struct cpumask *cpus,
+				      unsigned int cur_max_freq,
+				      unsigned int max_freq)
+{
+	int cpu;
+	unsigned long capacity = (cur_max_freq << SCHED_CAPACITY_SHIFT) /
+				  max_freq;
+
+	for_each_cpu(cpu, cpus)
+		update_maxcap_capacity(cpu, capacity);
+}
+
+/**
  * get_load() - get load for a cpu since last updated
  * @cpufreq_cdev:	&struct cpufreq_cooling_device for this cpu
  * @cpu:	cpu number
@@ -320,6 +337,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
 				 unsigned long state)
 {
 	struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
+	int ret;
 
 	/* Request state should be less than max_level */
 	if (WARN_ON(state > cpufreq_cdev->max_level))
@@ -331,8 +349,17 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
 
 	cpufreq_cdev->cpufreq_state = state;
 
-	return dev_pm_qos_update_request(&cpufreq_cdev->qos_req,
-				cpufreq_cdev->freq_table[state].frequency);
+	ret = dev_pm_qos_update_request
+				(&cpufreq_cdev->qos_req,
+				 cpufreq_cdev->freq_table[state].frequency);
+
+	if (ret > 0)
+		update_sched_max_capacity
+				(cpufreq_cdev->policy->cpus,
+				 cpufreq_cdev->freq_table[state].frequency,
+				 cpufreq_cdev->policy->cpuinfo.max_freq);
+
+	return ret;
 }
 
 /**
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Patch v3 7/7] sched: thermal: Enable tuning of decay period
  2019-10-14  0:58 [Patch v3 0/7] Introduce Thermal Pressure Thara Gopinath
                   ` (5 preceding siblings ...)
  2019-10-14  0:58 ` [Patch v3 6/7] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
@ 2019-10-14  0:58 ` Thara Gopinath
  2019-10-15 10:14   ` Quentin Perret
  6 siblings, 1 reply; 19+ messages in thread
From: Thara Gopinath @ 2019-10-14  0:58 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang, edubezval
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Thermal pressure follows pelt signas which means the
decay period for thermal pressure is the default pelt
decay period. Depending on soc charecteristics and thermal
activity, it might be beneficial to decay thermal pressure
slower, but still in-tune with the pelt signals.
One way to achieve this slow decay is to right shift the now
run time.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 include/linux/sched/sysctl.h |  1 +
 kernel/sched/thermal.c       | 16 +++++++++++++++-
 kernel/sysctl.c              |  7 +++++++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215..f4c4afd 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -41,6 +41,7 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
+extern __read_mostly unsigned int sysctl_sched_thermal_decay_coeff;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
index 5f0b2d4..2a00488 100644
--- a/kernel/sched/thermal.c
+++ b/kernel/sched/thermal.c
@@ -10,6 +10,19 @@
 #include "pelt.h"
 #include "thermal.h"
 
+/**
+ * By default the decay is the default pelt decay period.
+ * The decay coefficient can change is decay period in
+ * multiples of 32.
+ *   Decay coefficient    Decay period(ms)
+ *	0			32
+ *	1			64
+ *	2			128
+ *	3			256
+ *	4			512
+ */
+unsigned int sysctl_sched_thermal_decay_coeff __read_mostly;
+
 struct max_capacity_info {
 	unsigned long max_capacity;
 	unsigned long cap_capacity;
@@ -62,5 +75,6 @@ void update_periodic_maxcap(struct rq *rq)
 		return;
 
 	delta = __max_cap->max_capacity - __max_cap->cap_capacity;
-	update_thermal_avg(rq_clock_task(rq), rq, delta);
+	update_thermal_avg((rq_clock_task(rq) >>
+			   sysctl_sched_thermal_decay_coeff), rq, delta);
 }
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 00fcea2..5056c08 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -376,6 +376,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "sched_thermal_decay_coeff",
+		.data		= &sysctl_sched_thermal_decay_coeff,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #ifdef CONFIG_SCHEDSTATS
 	{
 		.procname	= "sched_schedstats",
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Patch v3 1/7] sched/pelt.c: Add support to track thermal pressure
  2019-10-14  0:58 ` [Patch v3 1/7] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
@ 2019-10-14 13:55   ` Vincent Guittot
  2019-10-16 21:15     ` Thara Gopinath
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2019-10-14 13:55 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, linux-kernel, Amit Kachhap, Javi Merino,
	Daniel Lezcano

Hi Thara,

On Mon, 14 Oct 2019 at 02:58, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>
> Extrapolating on the exisitng framework to track rt/dl utilization using

s/exisitng/existing/

> pelt signals, add a similar mechanism to track thermal pressue. The

s/pessure/pressure/

> difference here from rt/dl utilization tracking is that, instead of
> tracking time spent by a cpu running a rt/dl task through util_avg,
> the average thermal pressure is tracked through load_avg.

It would be good to mention why you use load_avg field instead of
util_avg field: because the signal is weighted with the capped
capacity and is not binary
And also explained a bit what capacity refer to

> In order to track average thermal pressure, a new sched_avg variable
> avg_thermal is introduced. Function update_thermal_avg can be called
> to do the periodic bookeeping (accumulate, decay and average)
> of the thermal pressure.
>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  kernel/sched/pelt.c  | 13 +++++++++++++
>  kernel/sched/pelt.h  |  7 +++++++
>  kernel/sched/sched.h |  1 +
>  3 files changed, 21 insertions(+)
>
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index a96db50..f06aae3 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -353,6 +353,19 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>         return 0;
>  }
>
> +int update_thermal_avg(u64 now, struct rq *rq, u64 capacity)

All the other functions are named :
update_cfs_rq/rt_rq/dl_rq/irq_load_avg

might be good to keep similar name with update_thermal_load_avg

> +{
> +       if (___update_load_sum(now, &rq->avg_thermal,
> +                              capacity,
> +                              capacity,
> +                              capacity)) {
> +               ___update_load_avg(&rq->avg_thermal, 1, 1);
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  /*
>   * irq:
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index afff644..01c5436 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -6,6 +6,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
>  int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
>  int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
> +int update_thermal_avg(u64 now, struct rq *rq, u64 capacity);
>
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  int update_irq_load_avg(struct rq *rq, u64 running);
> @@ -175,6 +176,12 @@ update_rq_clock_pelt(struct rq *rq, s64 delta) { }
>  static inline void
>  update_idle_rq_clock_pelt(struct rq *rq) { }
>
> +static inline int

can you keep some function ordering as above and move
update_thermal_avg() just after update_dl_rq_load_avg


> +update_thermal_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> +       return 0;
> +}
> +
>  #endif
>
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 0db2c1b..d5d82c8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -944,6 +944,7 @@ struct rq {
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>         struct sched_avg        avg_irq;
>  #endif
> +       struct sched_avg        avg_thermal;
>         u64                     idle_stamp;
>         u64                     avg_idle;
>
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Patch v3 2/7] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-10-14  0:58 ` [Patch v3 2/7] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
@ 2019-10-14 15:50   ` Vincent Guittot
  2019-10-16 21:22     ` Thara Gopinath
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2019-10-14 15:50 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, linux-kernel, Amit Kachhap, Javi Merino,
	Daniel Lezcano

Hi Thara,

On Mon, 14 Oct 2019 at 02:58, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>
> Add thermal.c and thermal.h files that provides interface
> APIs to initialize, update/average, track, accumulate and decay
> thermal pressure per cpu basis. A per cpu structure max_capacity_info is
> introduced to keep track of instantaneous per cpu thermal pressure.
> Thermal pressure is the delta between max_capacity and cap_capacity.
> API update_periodic_maxcap is called for periodic accumulate and decay
> of the thermal pressure. It is to to be called from a periodic tick
> function. This API calculates the delta between max_capacity and
> cap_capacity and passes on the delta to update_thermal_avg to do the
> necessary accumulate, decay and average. API update_maxcap_capacity is for
> the system to update the thermal pressure by updating cap_capacity.
> Considering, update_periodic_maxcap reads cap_capacity and
> update_maxcap_capacity writes into cap_capacity, one can argue for
> some sort of locking mechanism to avoid a stale value.
> But considering update_periodic_maxcap can be called from a system
> critical path like scheduler tick function, a locking mechanism is not
> ideal. This means that it is possible the value used to
> calculate average thermal pressure for a cpu can be stale for upto 1
> tick period.
>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  include/linux/sched.h  | 14 +++++++++++
>  kernel/sched/Makefile  |  2 +-
>  kernel/sched/thermal.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/thermal.h | 13 ++++++++++
>  4 files changed, 94 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/sched/thermal.c
>  create mode 100644 kernel/sched/thermal.h
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2c2e56b..875ce2b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1983,6 +1983,20 @@ static inline void rseq_syscall(struct pt_regs *regs)
>
>  #endif
>
> +#ifdef CONFIG_SMP
> +void update_maxcap_capacity(int cpu, u64 capacity);
> +
> +void populate_max_capacity_info(void);
> +#else
> +static inline void update_maxcap_capacity(int cpu, u64 capacity)
> +{
> +}
> +
> +static inline void populate_max_capacity_info(void)
> +{
> +}
> +#endif
> +
>  const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
>  char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
>  int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 21fb5a5..4d3b820 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
>  obj-y += idle.o fair.o rt.o deadline.o
>  obj-y += wait.o wait_bit.o swait.o completion.o
>
> -obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
> +obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o thermal.o
>  obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
>  obj-$(CONFIG_SCHEDSTATS) += stats.o
>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
> diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
> new file mode 100644
> index 0000000..5f0b2d4
> --- /dev/null
> +++ b/kernel/sched/thermal.c
> @@ -0,0 +1,66 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Sceduler Thermal Interactions
> + *
> + *  Copyright (C) 2018 Linaro, Inc., Thara Gopinath <thara.gopinath@linaro.org>
> + */
> +
> +#include <linux/sched.h>
> +#include "sched.h"
> +#include "pelt.h"
> +#include "thermal.h"
> +
> +struct max_capacity_info {
> +       unsigned long max_capacity;
> +       unsigned long cap_capacity;
> +};
> +
> +static DEFINE_PER_CPU(struct max_capacity_info, max_cap);
> +
> +void update_maxcap_capacity(int cpu, u64 capacity)
> +{
> +       struct max_capacity_info *__max_cap;
> +       unsigned long __capacity;
> +
> +       __max_cap = (&per_cpu(max_cap, cpu));
> +       if (!__max_cap) {
> +               pr_err("no max_capacity_info structure for cpu %d\n", cpu);
> +               return;
> +       }
> +
> +       /* Normalize the capacity */
> +       __capacity = (capacity * arch_scale_cpu_capacity(cpu)) >>
> +                                                       SCHED_CAPACITY_SHIFT;
> +       pr_debug("updating cpu%d capped capacity from %lu to %lu\n", cpu, __max_cap->cap_capacity, __capacity);
> +
> +       __max_cap->cap_capacity = __capacity;
> +}
> +
> +void populate_max_capacity_info(void)
> +{
> +       struct max_capacity_info *__max_cap;
> +       u64 capacity;
> +       int cpu;
> +
> +       for_each_possible_cpu(cpu) {
> +               __max_cap = (&per_cpu(max_cap, cpu));
> +               if (!__max_cap)
> +                       continue;
> +               capacity = arch_scale_cpu_capacity(cpu);
> +               __max_cap->max_capacity = capacity;
> +               __max_cap->cap_capacity = capacity;
> +               pr_debug("cpu %d max capacity set to %ld\n", cpu, __max_cap->max_capacity);
> +       }
> +}

everything above seems to be there for the cpu cooling device and
should be included in it instead. The scheduler only need the capacity
capping
The cpu cooling device should just set the delta capacity in the
per-cpu variable (see my comment below)

> +
> +void update_periodic_maxcap(struct rq *rq)
> +{
> +       struct max_capacity_info *__max_cap = (&per_cpu(max_cap, cpu_of(rq)));
> +       unsigned long delta;
> +
> +       if (!__max_cap)
> +               return;
> +
> +       delta = __max_cap->max_capacity - __max_cap->cap_capacity;

Why don't you just save the delta in the per_cpu variable instead of
the struct max_capacity_info ? You have to compute the delta every
tick whereas we can expect it to not change that much compared to the
number of tick

> +       update_thermal_avg(rq_clock_task(rq), rq, delta);
> +}
> diff --git a/kernel/sched/thermal.h b/kernel/sched/thermal.h
> new file mode 100644
> index 0000000..5657cb4
> --- /dev/null
> +++ b/kernel/sched/thermal.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Scheduler thermal interaction internal methods.
> + */
> +
> +#ifdef CONFIG_SMP
> +void update_periodic_maxcap(struct rq *rq);
> +
> +#else
> +static inline void update_periodic_maxcap(struct rq *rq)
> +{
> +}
> +#endif
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Patch v3 7/7] sched: thermal: Enable tuning of decay period
  2019-10-14  0:58 ` [Patch v3 7/7] sched: thermal: Enable tuning of decay period Thara Gopinath
@ 2019-10-15 10:14   ` Quentin Perret
  2019-10-21 21:03     ` Thara Gopinath
  0 siblings, 1 reply; 19+ messages in thread
From: Quentin Perret @ 2019-10-15 10:14 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, linux-kernel, amit.kachhap, javi.merino,
	daniel.lezcano

Hi Thara,

On Sunday 13 Oct 2019 at 20:58:25 (-0400), Thara Gopinath wrote:
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 00fcea2..5056c08 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -376,6 +376,13 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	{
> +		.procname	= "sched_thermal_decay_coeff",
> +		.data		= &sysctl_sched_thermal_decay_coeff,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,

Perhaps change this for 'sched_proc_update_handler' with min and max
values ? Otherwise userspace is allowed to write nonsensical values
here. And since sysctl_sched_thermal_decay_coeff is used to shift, this
can lead to an undefined behaviour.

Also, could we take this sysctl out of SCHED_DEBUG ? I expect this to be
used/tuned on production devices where SCHED_DEBUG should theoretically
be off.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Patch v3 1/7] sched/pelt.c: Add support to track thermal pressure
  2019-10-14 13:55   ` Vincent Guittot
@ 2019-10-16 21:15     ` Thara Gopinath
  0 siblings, 0 replies; 19+ messages in thread
From: Thara Gopinath @ 2019-10-16 21:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, linux-kernel, Amit Kachhap, Javi Merino,
	Daniel Lezcano

Hi Vincent,

Thanks for the review.
On 10/14/2019 09:55 AM, Vincent Guittot wrote:
> Hi Thara,
> 
> On Mon, 14 Oct 2019 at 02:58, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>>
>> Extrapolating on the exisitng framework to track rt/dl utilization using
> 
> s/exisitng/existing/
> 
>> pelt signals, add a similar mechanism to track thermal pressue. The
> 
> s/pessure/pressure/
Will fix all the typo.
> 
>> difference here from rt/dl utilization tracking is that, instead of
>> tracking time spent by a cpu running a rt/dl task through util_avg,
>> the average thermal pressure is tracked through load_avg.
> 
> It would be good to mention why you use load_avg field instead of
> util_avg field: because the signal is weighted with the capped
> capacity and is not binary
> And also explained a bit what capacity refer to
> 
>> In order to track average thermal pressure, a new sched_avg variable
>> avg_thermal is introduced. Function update_thermal_avg can be called
>> to do the periodic bookeeping (accumulate, decay and average)
>> of the thermal pressure.
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>  kernel/sched/pelt.c  | 13 +++++++++++++
>>  kernel/sched/pelt.h  |  7 +++++++
>>  kernel/sched/sched.h |  1 +
>>  3 files changed, 21 insertions(+)
>>
>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
>> index a96db50..f06aae3 100644
>> --- a/kernel/sched/pelt.c
>> +++ b/kernel/sched/pelt.c
>> @@ -353,6 +353,19 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>>         return 0;
>>  }
>>
>> +int update_thermal_avg(u64 now, struct rq *rq, u64 capacity)
> 
> All the other functions are named :
> update_cfs_rq/rt_rq/dl_rq/irq_load_avg
> 
> might be good to keep similar name with update_thermal_load_avg

Sure. Will rename the function.
> 
>> +{
>> +       if (___update_load_sum(now, &rq->avg_thermal,
>> +                              capacity,
>> +                              capacity,
>> +                              capacity)) {
>> +               ___update_load_avg(&rq->avg_thermal, 1, 1);
>> +               return 1;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>>  /*
>>   * irq:
>> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
>> index afff644..01c5436 100644
>> --- a/kernel/sched/pelt.h
>> +++ b/kernel/sched/pelt.h
>> @@ -6,6 +6,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
>>  int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
>>  int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>>  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
>> +int update_thermal_avg(u64 now, struct rq *rq, u64 capacity);
>>
>>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>>  int update_irq_load_avg(struct rq *rq, u64 running);
>> @@ -175,6 +176,12 @@ update_rq_clock_pelt(struct rq *rq, s64 delta) { }
>>  static inline void
>>  update_idle_rq_clock_pelt(struct rq *rq) { }
>>
>> +static inline int
> 
> can you keep some function ordering as above and move
> update_thermal_avg() just after update_dl_rq_load_avg

will do.

Warm Regards
Thara


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Patch v3 2/7] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-10-14 15:50   ` Vincent Guittot
@ 2019-10-16 21:22     ` Thara Gopinath
  2019-10-17  8:44       ` Vincent Guittot
  0 siblings, 1 reply; 19+ messages in thread
From: Thara Gopinath @ 2019-10-16 21:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, linux-kernel, Amit Kachhap, Javi Merino,
	Daniel Lezcano

Hi Vincent,

Thanks for the review
On 10/14/2019 11:50 AM, Vincent Guittot wrote:
> Hi Thara,
> 
> On Mon, 14 Oct 2019 at 02:58, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>>
>> Add thermal.c and thermal.h files that provides interface
>> APIs to initialize, update/average, track, accumulate and decay
>> thermal pressure per cpu basis. A per cpu structure max_capacity_info is
>> introduced to keep track of instantaneous per cpu thermal pressure.
>> Thermal pressure is the delta between max_capacity and cap_capacity.
>> API update_periodic_maxcap is called for periodic accumulate and decay
>> of the thermal pressure. It is to to be called from a periodic tick
>> function. This API calculates the delta between max_capacity and
>> cap_capacity and passes on the delta to update_thermal_avg to do the
>> necessary accumulate, decay and average. API update_maxcap_capacity is for
>> the system to update the thermal pressure by updating cap_capacity.
>> Considering, update_periodic_maxcap reads cap_capacity and
>> update_maxcap_capacity writes into cap_capacity, one can argue for
>> some sort of locking mechanism to avoid a stale value.
>> But considering update_periodic_maxcap can be called from a system
>> critical path like scheduler tick function, a locking mechanism is not
>> ideal. This means that it is possible the value used to
>> calculate average thermal pressure for a cpu can be stale for upto 1
>> tick period.
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>  include/linux/sched.h  | 14 +++++++++++
>>  kernel/sched/Makefile  |  2 +-
>>  kernel/sched/thermal.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  kernel/sched/thermal.h | 13 ++++++++++
>>  4 files changed, 94 insertions(+), 1 deletion(-)
>>  create mode 100644 kernel/sched/thermal.c
>>  create mode 100644 kernel/sched/thermal.h
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 2c2e56b..875ce2b 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1983,6 +1983,20 @@ static inline void rseq_syscall(struct pt_regs *regs)
>>
>>  #endif
>>
>> +#ifdef CONFIG_SMP
>> +void update_maxcap_capacity(int cpu, u64 capacity);
>> +
>> +void populate_max_capacity_info(void);
>> +#else
>> +static inline void update_maxcap_capacity(int cpu, u64 capacity)
>> +{
>> +}
>> +
>> +static inline void populate_max_capacity_info(void)
>> +{
>> +}
>> +#endif
>> +
>>  const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
>>  char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
>>  int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
>> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
>> index 21fb5a5..4d3b820 100644
>> --- a/kernel/sched/Makefile
>> +++ b/kernel/sched/Makefile
>> @@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
>>  obj-y += idle.o fair.o rt.o deadline.o
>>  obj-y += wait.o wait_bit.o swait.o completion.o
>>
>> -obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
>> +obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o thermal.o
>>  obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
>>  obj-$(CONFIG_SCHEDSTATS) += stats.o
>>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
>> diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
>> new file mode 100644
>> index 0000000..5f0b2d4
>> --- /dev/null
>> +++ b/kernel/sched/thermal.c
>> @@ -0,0 +1,66 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Sceduler Thermal Interactions
>> + *
>> + *  Copyright (C) 2018 Linaro, Inc., Thara Gopinath <thara.gopinath@linaro.org>
>> + */
>> +
>> +#include <linux/sched.h>
>> +#include "sched.h"
>> +#include "pelt.h"
>> +#include "thermal.h"
>> +
>> +struct max_capacity_info {
>> +       unsigned long max_capacity;
>> +       unsigned long cap_capacity;
>> +};
>> +
>> +static DEFINE_PER_CPU(struct max_capacity_info, max_cap);
>> +
>> +void update_maxcap_capacity(int cpu, u64 capacity)
>> +{
>> +       struct max_capacity_info *__max_cap;
>> +       unsigned long __capacity;
>> +
>> +       __max_cap = (&per_cpu(max_cap, cpu));
>> +       if (!__max_cap) {
>> +               pr_err("no max_capacity_info structure for cpu %d\n", cpu);
>> +               return;
>> +       }
>> +
>> +       /* Normalize the capacity */
>> +       __capacity = (capacity * arch_scale_cpu_capacity(cpu)) >>
>> +                                                       SCHED_CAPACITY_SHIFT;
>> +       pr_debug("updating cpu%d capped capacity from %lu to %lu\n", cpu, __max_cap->cap_capacity, __capacity);
>> +
>> +       __max_cap->cap_capacity = __capacity;
>> +}
>> +
>> +void populate_max_capacity_info(void)
>> +{
>> +       struct max_capacity_info *__max_cap;
>> +       u64 capacity;
>> +       int cpu;
>> +
>> +       for_each_possible_cpu(cpu) {
>> +               __max_cap = (&per_cpu(max_cap, cpu));
>> +               if (!__max_cap)
>> +                       continue;
>> +               capacity = arch_scale_cpu_capacity(cpu);
>> +               __max_cap->max_capacity = capacity;
>> +               __max_cap->cap_capacity = capacity;
>> +               pr_debug("cpu %d max capacity set to %ld\n", cpu, __max_cap->max_capacity);
>> +       }
>> +}
> 
> everything above seems to be there for the cpu cooling device and
> should be included in it instead. The scheduler only need the capacity
> capping
> The cpu cooling device should just set the delta capacity in the
> per-cpu variable (see my comment below)
It can be a firmware  updating the thermal pressure instead of cpu
cooling device. Or may be some other entity .So instead of replicating
this code, isnt't it better to reside in one place ?
> 
>> +
>> +void update_periodic_maxcap(struct rq *rq)
>> +{
>> +       struct max_capacity_info *__max_cap = (&per_cpu(max_cap, cpu_of(rq)));
>> +       unsigned long delta;
>> +
>> +       if (!__max_cap)
>> +               return;
>> +
>> +       delta = __max_cap->max_capacity - __max_cap->cap_capacity;
> 
> Why don't you just save the delta in the per_cpu variable instead of
> the struct max_capacity_info ? You have to compute the delta every
> tick whereas we can expect it to not change that much compared to the
> number of tick.

Again I think thermal pressure can be applied from other entities like
firmware as well. But  I agree with your point that calculating delta
every tick is not a good idea. How about I move it to
update_maxcap_capacity so that delta is calculate every time an update
comes from cpu cooling device or anybody else ?

Warm Regards
Thara


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Patch v3 2/7] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-10-16 21:22     ` Thara Gopinath
@ 2019-10-17  8:44       ` Vincent Guittot
  2019-10-17 16:40         ` Thara Gopinath
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2019-10-17  8:44 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, linux-kernel, Amit Kachhap, Javi Merino,
	Daniel Lezcano

Hi Thara,

On Wed, 16 Oct 2019 at 23:22, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>
> Hi Vincent,
>
> Thanks for the review
> On 10/14/2019 11:50 AM, Vincent Guittot wrote:
> > Hi Thara,
> >
> > On Mon, 14 Oct 2019 at 02:58, Thara Gopinath <thara.gopinath@linaro.org> wrote:
> >>
> >> Add thermal.c and thermal.h files that provides interface
> >> APIs to initialize, update/average, track, accumulate and decay
> >> thermal pressure per cpu basis. A per cpu structure max_capacity_info is
> >> introduced to keep track of instantaneous per cpu thermal pressure.
> >> Thermal pressure is the delta between max_capacity and cap_capacity.
> >> API update_periodic_maxcap is called for periodic accumulate and decay
> >> of the thermal pressure. It is to to be called from a periodic tick
> >> function. This API calculates the delta between max_capacity and
> >> cap_capacity and passes on the delta to update_thermal_avg to do the
> >> necessary accumulate, decay and average. API update_maxcap_capacity is for
> >> the system to update the thermal pressure by updating cap_capacity.
> >> Considering, update_periodic_maxcap reads cap_capacity and
> >> update_maxcap_capacity writes into cap_capacity, one can argue for
> >> some sort of locking mechanism to avoid a stale value.
> >> But considering update_periodic_maxcap can be called from a system
> >> critical path like scheduler tick function, a locking mechanism is not
> >> ideal. This means that it is possible the value used to
> >> calculate average thermal pressure for a cpu can be stale for upto 1
> >> tick period.
> >>
> >> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> >> ---
> >>  include/linux/sched.h  | 14 +++++++++++
> >>  kernel/sched/Makefile  |  2 +-
> >>  kernel/sched/thermal.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  kernel/sched/thermal.h | 13 ++++++++++
> >>  4 files changed, 94 insertions(+), 1 deletion(-)
> >>  create mode 100644 kernel/sched/thermal.c
> >>  create mode 100644 kernel/sched/thermal.h
> >>
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index 2c2e56b..875ce2b 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -1983,6 +1983,20 @@ static inline void rseq_syscall(struct pt_regs *regs)
> >>
> >>  #endif
> >>
> >> +#ifdef CONFIG_SMP
> >> +void update_maxcap_capacity(int cpu, u64 capacity);
> >> +
> >> +void populate_max_capacity_info(void);
> >> +#else
> >> +static inline void update_maxcap_capacity(int cpu, u64 capacity)
> >> +{
> >> +}
> >> +
> >> +static inline void populate_max_capacity_info(void)
> >> +{
> >> +}
> >> +#endif
> >> +
> >>  const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
> >>  char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
> >>  int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
> >> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> >> index 21fb5a5..4d3b820 100644
> >> --- a/kernel/sched/Makefile
> >> +++ b/kernel/sched/Makefile
> >> @@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
> >>  obj-y += idle.o fair.o rt.o deadline.o
> >>  obj-y += wait.o wait_bit.o swait.o completion.o
> >>
> >> -obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
> >> +obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o thermal.o
> >>  obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
> >>  obj-$(CONFIG_SCHEDSTATS) += stats.o
> >>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
> >> diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
> >> new file mode 100644
> >> index 0000000..5f0b2d4
> >> --- /dev/null
> >> +++ b/kernel/sched/thermal.c
> >> @@ -0,0 +1,66 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Sceduler Thermal Interactions
> >> + *
> >> + *  Copyright (C) 2018 Linaro, Inc., Thara Gopinath <thara.gopinath@linaro.org>
> >> + */
> >> +
> >> +#include <linux/sched.h>
> >> +#include "sched.h"
> >> +#include "pelt.h"
> >> +#include "thermal.h"
> >> +
> >> +struct max_capacity_info {
> >> +       unsigned long max_capacity;
> >> +       unsigned long cap_capacity;
> >> +};
> >> +
> >> +static DEFINE_PER_CPU(struct max_capacity_info, max_cap);
> >> +
> >> +void update_maxcap_capacity(int cpu, u64 capacity)
> >> +{
> >> +       struct max_capacity_info *__max_cap;
> >> +       unsigned long __capacity;
> >> +
> >> +       __max_cap = (&per_cpu(max_cap, cpu));
> >> +       if (!__max_cap) {
> >> +               pr_err("no max_capacity_info structure for cpu %d\n", cpu);
> >> +               return;
> >> +       }
> >> +
> >> +       /* Normalize the capacity */
> >> +       __capacity = (capacity * arch_scale_cpu_capacity(cpu)) >>
> >> +                                                       SCHED_CAPACITY_SHIFT;
> >> +       pr_debug("updating cpu%d capped capacity from %lu to %lu\n", cpu, __max_cap->cap_capacity, __capacity);
> >> +
> >> +       __max_cap->cap_capacity = __capacity;
> >> +}
> >> +
> >> +void populate_max_capacity_info(void)
> >> +{
> >> +       struct max_capacity_info *__max_cap;
> >> +       u64 capacity;
> >> +       int cpu;
> >> +
> >> +       for_each_possible_cpu(cpu) {
> >> +               __max_cap = (&per_cpu(max_cap, cpu));
> >> +               if (!__max_cap)
> >> +                       continue;
> >> +               capacity = arch_scale_cpu_capacity(cpu);
> >> +               __max_cap->max_capacity = capacity;
> >> +               __max_cap->cap_capacity = capacity;
> >> +               pr_debug("cpu %d max capacity set to %ld\n", cpu, __max_cap->max_capacity);
> >> +       }
> >> +}
> >
> > everything above seems to be there for the cpu cooling device and
> > should be included in it instead. The scheduler only need the capacity
> > capping
> > The cpu cooling device should just set the delta capacity in the
> > per-cpu variable (see my comment below)
> It can be a firmware  updating the thermal pressure instead of cpu
> cooling device. Or may be some other entity .So instead of replicating
> this code, isnt't it better to reside in one place ?

But you have now idea which kind of metrics will be provided by
firmware. It might not be a capped frequency and a max frequency but
other metrics.
Also, __max_cap->max_capacity just save in a new per cpu struct the
return of arch_scale_cpu_capacity but this doesn't give us real
benefit




> >
> >> +
> >> +void update_periodic_maxcap(struct rq *rq)
> >> +{
> >> +       struct max_capacity_info *__max_cap = (&per_cpu(max_cap, cpu_of(rq)));
> >> +       unsigned long delta;
> >> +
> >> +       if (!__max_cap)
> >> +               return;
> >> +
> >> +       delta = __max_cap->max_capacity - __max_cap->cap_capacity;
> >
> > Why don't you just save the delta in the per_cpu variable instead of
> > the struct max_capacity_info ? You have to compute the delta every
> > tick whereas we can expect it to not change that much compared to the
> > number of tick.
>
> Again I think thermal pressure can be applied from other entities like
> firmware as well. But  I agree with your point that calculating delta
> every tick is not a good idea. How about I move it to
> update_maxcap_capacity so that delta is calculate every time an update
> comes from cpu cooling device or anybody else ?
>
> Warm Regards
> Thara
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Patch v3 2/7] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-10-17  8:44       ` Vincent Guittot
@ 2019-10-17 16:40         ` Thara Gopinath
  2019-10-18  8:05           ` Vincent Guittot
  0 siblings, 1 reply; 19+ messages in thread
From: Thara Gopinath @ 2019-10-17 16:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, linux-kernel, Amit Kachhap, Javi Merino,
	Daniel Lezcano

On 10/17/2019 04:44 AM, Vincent Guittot wrote:
> Hi Thara,
> 
> On Wed, 16 Oct 2019 at 23:22, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>>
>> Hi Vincent,
>>
>> Thanks for the review
>> On 10/14/2019 11:50 AM, Vincent Guittot wrote:
>>> Hi Thara,
>>>
>>> On Mon, 14 Oct 2019 at 02:58, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>>>>
>>>> Add thermal.c and thermal.h files that provides interface
>>>> APIs to initialize, update/average, track, accumulate and decay
>>>> thermal pressure per cpu basis. A per cpu structure max_capacity_info is
>>>> introduced to keep track of instantaneous per cpu thermal pressure.
>>>> Thermal pressure is the delta between max_capacity and cap_capacity.
>>>> API update_periodic_maxcap is called for periodic accumulate and decay
>>>> of the thermal pressure. It is to to be called from a periodic tick
>>>> function. This API calculates the delta between max_capacity and
>>>> cap_capacity and passes on the delta to update_thermal_avg to do the
>>>> necessary accumulate, decay and average. API update_maxcap_capacity is for
>>>> the system to update the thermal pressure by updating cap_capacity.
>>>> Considering, update_periodic_maxcap reads cap_capacity and
>>>> update_maxcap_capacity writes into cap_capacity, one can argue for
>>>> some sort of locking mechanism to avoid a stale value.
>>>> But considering update_periodic_maxcap can be called from a system
>>>> critical path like scheduler tick function, a locking mechanism is not
>>>> ideal. This means that it is possible the value used to
>>>> calculate average thermal pressure for a cpu can be stale for upto 1
>>>> tick period.
>>>>
>>>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>>>> ---
>>>>  include/linux/sched.h  | 14 +++++++++++
>>>>  kernel/sched/Makefile  |  2 +-
>>>>  kernel/sched/thermal.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  kernel/sched/thermal.h | 13 ++++++++++
>>>>  4 files changed, 94 insertions(+), 1 deletion(-)
>>>>  create mode 100644 kernel/sched/thermal.c
>>>>  create mode 100644 kernel/sched/thermal.h
>>>>
>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>> index 2c2e56b..875ce2b 100644
>>>> --- a/include/linux/sched.h
>>>> +++ b/include/linux/sched.h
>>>> @@ -1983,6 +1983,20 @@ static inline void rseq_syscall(struct pt_regs *regs)
>>>>
>>>>  #endif
>>>>
>>>> +#ifdef CONFIG_SMP
>>>> +void update_maxcap_capacity(int cpu, u64 capacity);
>>>> +
>>>> +void populate_max_capacity_info(void);
>>>> +#else
>>>> +static inline void update_maxcap_capacity(int cpu, u64 capacity)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline void populate_max_capacity_info(void)
>>>> +{
>>>> +}
>>>> +#endif
>>>> +
>>>>  const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
>>>>  char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
>>>>  int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
>>>> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
>>>> index 21fb5a5..4d3b820 100644
>>>> --- a/kernel/sched/Makefile
>>>> +++ b/kernel/sched/Makefile
>>>> @@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
>>>>  obj-y += idle.o fair.o rt.o deadline.o
>>>>  obj-y += wait.o wait_bit.o swait.o completion.o
>>>>
>>>> -obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
>>>> +obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o thermal.o
>>>>  obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
>>>>  obj-$(CONFIG_SCHEDSTATS) += stats.o
>>>>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
>>>> diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
>>>> new file mode 100644
>>>> index 0000000..5f0b2d4
>>>> --- /dev/null
>>>> +++ b/kernel/sched/thermal.c
>>>> @@ -0,0 +1,66 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Sceduler Thermal Interactions
>>>> + *
>>>> + *  Copyright (C) 2018 Linaro, Inc., Thara Gopinath <thara.gopinath@linaro.org>
>>>> + */
>>>> +
>>>> +#include <linux/sched.h>
>>>> +#include "sched.h"
>>>> +#include "pelt.h"
>>>> +#include "thermal.h"
>>>> +
>>>> +struct max_capacity_info {
>>>> +       unsigned long max_capacity;
>>>> +       unsigned long cap_capacity;
>>>> +};
>>>> +
>>>> +static DEFINE_PER_CPU(struct max_capacity_info, max_cap);
>>>> +
>>>> +void update_maxcap_capacity(int cpu, u64 capacity)
>>>> +{
>>>> +       struct max_capacity_info *__max_cap;
>>>> +       unsigned long __capacity;
>>>> +
>>>> +       __max_cap = (&per_cpu(max_cap, cpu));
>>>> +       if (!__max_cap) {
>>>> +               pr_err("no max_capacity_info structure for cpu %d\n", cpu);
>>>> +               return;
>>>> +       }
>>>> +
>>>> +       /* Normalize the capacity */
>>>> +       __capacity = (capacity * arch_scale_cpu_capacity(cpu)) >>
>>>> +                                                       SCHED_CAPACITY_SHIFT;
>>>> +       pr_debug("updating cpu%d capped capacity from %lu to %lu\n", cpu, __max_cap->cap_capacity, __capacity);
>>>> +
>>>> +       __max_cap->cap_capacity = __capacity;
>>>> +}
>>>> +
>>>> +void populate_max_capacity_info(void)
>>>> +{
>>>> +       struct max_capacity_info *__max_cap;
>>>> +       u64 capacity;
>>>> +       int cpu;
>>>> +
>>>> +       for_each_possible_cpu(cpu) {
>>>> +               __max_cap = (&per_cpu(max_cap, cpu));
>>>> +               if (!__max_cap)
>>>> +                       continue;
>>>> +               capacity = arch_scale_cpu_capacity(cpu);
>>>> +               __max_cap->max_capacity = capacity;
>>>> +               __max_cap->cap_capacity = capacity;
>>>> +               pr_debug("cpu %d max capacity set to %ld\n", cpu, __max_cap->max_capacity);
>>>> +       }
>>>> +}
>>>
>>> everything above seems to be there for the cpu cooling device and
>>> should be included in it instead. The scheduler only need the capacity
>>> capping
>>> The cpu cooling device should just set the delta capacity in the
>>> per-cpu variable (see my comment below)
>> It can be a firmware  updating the thermal pressure instead of cpu
>> cooling device. Or may be some other entity .So instead of replicating
>> this code, isnt't it better to reside in one place ?
> 
> But you have now idea which kind of metrics will be provided by
> firmware. It might not be a capped frequency and a max frequency but
> other metrics.
hmm. Why ? It is thermal pressure. It is finally applied to the
scheduler metric cpu_capacity. It should not be anything other than
capacity reduction due to thermal events that should be provided.
> Also, __max_cap->max_capacity just save in a new per cpu struct the
> return of arch_scale_cpu_capacity but this doesn't give us real
> benefit
I agree. I will change the struct into a single variable delta and
initialize it to 0. populate_max_capacity_info need not be there. I will
have update_maxcap_capacity calculate the delta with
arch_scale_cpu_capacity and store it into a per cpu variable.


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Patch v3 2/7] sched: Add infrastructure to store and update instantaneous thermal pressure
  2019-10-17 16:40         ` Thara Gopinath
@ 2019-10-18  8:05           ` Vincent Guittot
  0 siblings, 0 replies; 19+ messages in thread
From: Vincent Guittot @ 2019-10-18  8:05 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, linux-kernel, Amit Kachhap, Javi Merino,
	Daniel Lezcano

Hi Thara,

On Thu, 17 Oct 2019 at 18:40, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>
> On 10/17/2019 04:44 AM, Vincent Guittot wrote:
> > Hi Thara,
> >
> > On Wed, 16 Oct 2019 at 23:22, Thara Gopinath <thara.gopinath@linaro.org> wrote:
> >>
> >> Hi Vincent,
> >>
> >> Thanks for the review
> >> On 10/14/2019 11:50 AM, Vincent Guittot wrote:
> >>> Hi Thara,
> >>>
> >>> On Mon, 14 Oct 2019 at 02:58, Thara Gopinath <thara.gopinath@linaro.org> wrote:
> >>>>
> >>>> Add thermal.c and thermal.h files that provides interface
> >>>> APIs to initialize, update/average, track, accumulate and decay
> >>>> thermal pressure per cpu basis. A per cpu structure max_capacity_info is
> >>>> introduced to keep track of instantaneous per cpu thermal pressure.
> >>>> Thermal pressure is the delta between max_capacity and cap_capacity.
> >>>> API update_periodic_maxcap is called for periodic accumulate and decay
> >>>> of the thermal pressure. It is to to be called from a periodic tick
> >>>> function. This API calculates the delta between max_capacity and
> >>>> cap_capacity and passes on the delta to update_thermal_avg to do the
> >>>> necessary accumulate, decay and average. API update_maxcap_capacity is for
> >>>> the system to update the thermal pressure by updating cap_capacity.
> >>>> Considering, update_periodic_maxcap reads cap_capacity and
> >>>> update_maxcap_capacity writes into cap_capacity, one can argue for
> >>>> some sort of locking mechanism to avoid a stale value.
> >>>> But considering update_periodic_maxcap can be called from a system
> >>>> critical path like scheduler tick function, a locking mechanism is not
> >>>> ideal. This means that it is possible the value used to
> >>>> calculate average thermal pressure for a cpu can be stale for upto 1
> >>>> tick period.
> >>>>
> >>>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> >>>> ---
> >>>>  include/linux/sched.h  | 14 +++++++++++
> >>>>  kernel/sched/Makefile  |  2 +-
> >>>>  kernel/sched/thermal.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>  kernel/sched/thermal.h | 13 ++++++++++
> >>>>  4 files changed, 94 insertions(+), 1 deletion(-)
> >>>>  create mode 100644 kernel/sched/thermal.c
> >>>>  create mode 100644 kernel/sched/thermal.h
> >>>>
> >>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >>>> index 2c2e56b..875ce2b 100644
> >>>> --- a/include/linux/sched.h
> >>>> +++ b/include/linux/sched.h
> >>>> @@ -1983,6 +1983,20 @@ static inline void rseq_syscall(struct pt_regs *regs)
> >>>>
> >>>>  #endif
> >>>>
> >>>> +#ifdef CONFIG_SMP
> >>>> +void update_maxcap_capacity(int cpu, u64 capacity);
> >>>> +
> >>>> +void populate_max_capacity_info(void);
> >>>> +#else
> >>>> +static inline void update_maxcap_capacity(int cpu, u64 capacity)
> >>>> +{
> >>>> +}
> >>>> +
> >>>> +static inline void populate_max_capacity_info(void)
> >>>> +{
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>>  const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
> >>>>  char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
> >>>>  int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
> >>>> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> >>>> index 21fb5a5..4d3b820 100644
> >>>> --- a/kernel/sched/Makefile
> >>>> +++ b/kernel/sched/Makefile
> >>>> @@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
> >>>>  obj-y += idle.o fair.o rt.o deadline.o
> >>>>  obj-y += wait.o wait_bit.o swait.o completion.o
> >>>>
> >>>> -obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
> >>>> +obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o thermal.o
> >>>>  obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
> >>>>  obj-$(CONFIG_SCHEDSTATS) += stats.o
> >>>>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
> >>>> diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
> >>>> new file mode 100644
> >>>> index 0000000..5f0b2d4
> >>>> --- /dev/null
> >>>> +++ b/kernel/sched/thermal.c
> >>>> @@ -0,0 +1,66 @@
> >>>> +// SPDX-License-Identifier: GPL-2.0
> >>>> +/*
> >>>> + * Sceduler Thermal Interactions
> >>>> + *
> >>>> + *  Copyright (C) 2018 Linaro, Inc., Thara Gopinath <thara.gopinath@linaro.org>
> >>>> + */
> >>>> +
> >>>> +#include <linux/sched.h>
> >>>> +#include "sched.h"
> >>>> +#include "pelt.h"
> >>>> +#include "thermal.h"
> >>>> +
> >>>> +struct max_capacity_info {
> >>>> +       unsigned long max_capacity;
> >>>> +       unsigned long cap_capacity;
> >>>> +};
> >>>> +
> >>>> +static DEFINE_PER_CPU(struct max_capacity_info, max_cap);
> >>>> +
> >>>> +void update_maxcap_capacity(int cpu, u64 capacity)
> >>>> +{
> >>>> +       struct max_capacity_info *__max_cap;
> >>>> +       unsigned long __capacity;
> >>>> +
> >>>> +       __max_cap = (&per_cpu(max_cap, cpu));
> >>>> +       if (!__max_cap) {
> >>>> +               pr_err("no max_capacity_info structure for cpu %d\n", cpu);
> >>>> +               return;
> >>>> +       }
> >>>> +
> >>>> +       /* Normalize the capacity */
> >>>> +       __capacity = (capacity * arch_scale_cpu_capacity(cpu)) >>
> >>>> +                                                       SCHED_CAPACITY_SHIFT;
> >>>> +       pr_debug("updating cpu%d capped capacity from %lu to %lu\n", cpu, __max_cap->cap_capacity, __capacity);
> >>>> +
> >>>> +       __max_cap->cap_capacity = __capacity;
> >>>> +}
> >>>> +
> >>>> +void populate_max_capacity_info(void)
> >>>> +{
> >>>> +       struct max_capacity_info *__max_cap;
> >>>> +       u64 capacity;
> >>>> +       int cpu;
> >>>> +
> >>>> +       for_each_possible_cpu(cpu) {
> >>>> +               __max_cap = (&per_cpu(max_cap, cpu));
> >>>> +               if (!__max_cap)
> >>>> +                       continue;
> >>>> +               capacity = arch_scale_cpu_capacity(cpu);
> >>>> +               __max_cap->max_capacity = capacity;
> >>>> +               __max_cap->cap_capacity = capacity;
> >>>> +               pr_debug("cpu %d max capacity set to %ld\n", cpu, __max_cap->max_capacity);
> >>>> +       }
> >>>> +}
> >>>
> >>> everything above seems to be there for the cpu cooling device and
> >>> should be included in it instead. The scheduler only need the capacity
> >>> capping
> >>> The cpu cooling device should just set the delta capacity in the
> >>> per-cpu variable (see my comment below)
> >> It can be a firmware  updating the thermal pressure instead of cpu
> >> cooling device. Or may be some other entity .So instead of replicating
> >> this code, isnt't it better to reside in one place ?
> >
> > But you have now idea which kind of metrics will be provided by
> > firmware. It might not be a capped frequency and a max frequency but
> > other metrics.
> hmm. Why ? It is thermal pressure. It is finally applied to the
> scheduler metric cpu_capacity. It should not be anything other than
> capacity reduction due to thermal events that should be provided.

What I mean is that you don't know how the value will be provided and
if you will need a cap and a max value. It might easily provide
directly the delta,  or a percent or even something else


> > Also, __max_cap->max_capacity just save in a new per cpu struct the
> > return of arch_scale_cpu_capacity but this doesn't give us real
> > benefit
> I agree. I will change the struct into a single variable delta and
> initialize it to 0. populate_max_capacity_info need not be there. I will
> have update_maxcap_capacity calculate the delta with
> arch_scale_cpu_capacity and store it into a per cpu variable.

Also could you rename this function. You use thermal for the pelt
function so could you align the function names ? either to use thermal
everywhere or another name but could you not mix the naming
If you want to use thermal why not using update_thermal_pressure ? at
least it's easy to understand what it does

>


>
> --
> Warm Regards
> Thara

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Patch v3 7/7] sched: thermal: Enable tuning of decay period
  2019-10-15 10:14   ` Quentin Perret
@ 2019-10-21 21:03     ` Thara Gopinath
  2019-10-22  9:03       ` Quentin Perret
  0 siblings, 1 reply; 19+ messages in thread
From: Thara Gopinath @ 2019-10-21 21:03 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, linux-kernel, amit.kachhap, javi.merino,
	daniel.lezcano

Hi Quentin,

Thanks for the review.
On 10/15/2019 06:14 AM, Quentin Perret wrote:
> Hi Thara,
> 
> On Sunday 13 Oct 2019 at 20:58:25 (-0400), Thara Gopinath wrote:
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 00fcea2..5056c08 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -376,6 +376,13 @@ static struct ctl_table kern_table[] = {
>>  		.mode		= 0644,
>>  		.proc_handler	= proc_dointvec,
>>  	},
>> +	{
>> +		.procname	= "sched_thermal_decay_coeff",
>> +		.data		= &sysctl_sched_thermal_decay_coeff,
>> +		.maxlen		= sizeof(unsigned int),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec,
> 
> Perhaps change this for 'sched_proc_update_handler' with min and max
> values ? Otherwise userspace is allowed to write nonsensical values
> here. And since sysctl_sched_thermal_decay_coeff is used to shift, this
> can lead to an undefined behaviour.
Will do
> 
> Also, could we take this sysctl out of SCHED_DEBUG ? I expect this to be
> used/tuned on production devices where SCHED_DEBUG should theoretically
> be off.

I will take it out of SCHED_DEBUG. I am wondering if this should be
a runtime control at all. Because this is a shift this changes the
accumulating window for the thermal pressure signal. A runtime change
will not guarantee a clean start of the window. May be I should make
this a config option.

> 
> Thanks,
> Quentin
> 


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Patch v3 7/7] sched: thermal: Enable tuning of decay period
  2019-10-21 21:03     ` Thara Gopinath
@ 2019-10-22  9:03       ` Quentin Perret
  2019-10-22 20:36         ` Thara Gopinath
  0 siblings, 1 reply; 19+ messages in thread
From: Quentin Perret @ 2019-10-22  9:03 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, linux-kernel, amit.kachhap, javi.merino,
	daniel.lezcano, kernel-team

Hi Thara,

On Monday 21 Oct 2019 at 17:03:56 (-0400), Thara Gopinath wrote:
> On 10/15/2019 06:14 AM, Quentin Perret wrote:
> > Hi Thara,
> > 
> > On Sunday 13 Oct 2019 at 20:58:25 (-0400), Thara Gopinath wrote:
> >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> >> index 00fcea2..5056c08 100644
> >> --- a/kernel/sysctl.c
> >> +++ b/kernel/sysctl.c
> >> @@ -376,6 +376,13 @@ static struct ctl_table kern_table[] = {
> >>  		.mode		= 0644,
> >>  		.proc_handler	= proc_dointvec,
> >>  	},
> >> +	{
> >> +		.procname	= "sched_thermal_decay_coeff",
> >> +		.data		= &sysctl_sched_thermal_decay_coeff,
> >> +		.maxlen		= sizeof(unsigned int),
> >> +		.mode		= 0644,
> >> +		.proc_handler	= proc_dointvec,
> > 
> > Perhaps change this for 'sched_proc_update_handler' with min and max
> > values ? Otherwise userspace is allowed to write nonsensical values
> > here. And since sysctl_sched_thermal_decay_coeff is used to shift, this
> > can lead to an undefined behaviour.
> Will do
> > 
> > Also, could we take this sysctl out of SCHED_DEBUG ? I expect this to be
> > used/tuned on production devices where SCHED_DEBUG should theoretically
> > be off.
> 
> I will take it out of SCHED_DEBUG. I am wondering if this should be
> a runtime control at all. Because this is a shift this changes the
> accumulating window for the thermal pressure signal. A runtime change
> will not guarantee a clean start of the window. May be I should make
> this a config option.

I'd personally prefer if it wan't a Kconfig option. We'd like to make
Android devices (which are going to use this) work with a Generic Kernel
Image, which means there will be a single config for everyone. But I
expect this knob to be tuned to different values depending on the SoC.

If you really don't want a sysctl, perhaps a cmdline option could work ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Patch v3 7/7] sched: thermal: Enable tuning of decay period
  2019-10-22  9:03       ` Quentin Perret
@ 2019-10-22 20:36         ` Thara Gopinath
  0 siblings, 0 replies; 19+ messages in thread
From: Thara Gopinath @ 2019-10-22 20:36 UTC (permalink / raw)
  To: Quentin Perret
  Cc: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, linux-kernel, amit.kachhap, javi.merino,
	daniel.lezcano, kernel-team

On 10/22/2019 05:03 AM, Quentin Perret wrote:
> Hi Thara,
> 
> On Monday 21 Oct 2019 at 17:03:56 (-0400), Thara Gopinath wrote:
>> On 10/15/2019 06:14 AM, Quentin Perret wrote:
>>> Hi Thara,
>>>
>>> On Sunday 13 Oct 2019 at 20:58:25 (-0400), Thara Gopinath wrote:
>>>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>>>> index 00fcea2..5056c08 100644
>>>> --- a/kernel/sysctl.c
>>>> +++ b/kernel/sysctl.c
>>>> @@ -376,6 +376,13 @@ static struct ctl_table kern_table[] = {
>>>>  		.mode		= 0644,
>>>>  		.proc_handler	= proc_dointvec,
>>>>  	},
>>>> +	{
>>>> +		.procname	= "sched_thermal_decay_coeff",
>>>> +		.data		= &sysctl_sched_thermal_decay_coeff,
>>>> +		.maxlen		= sizeof(unsigned int),
>>>> +		.mode		= 0644,
>>>> +		.proc_handler	= proc_dointvec,
>>>
>>> Perhaps change this for 'sched_proc_update_handler' with min and max
>>> values ? Otherwise userspace is allowed to write nonsensical values
>>> here. And since sysctl_sched_thermal_decay_coeff is used to shift, this
>>> can lead to an undefined behaviour.
>> Will do
>>>
>>> Also, could we take this sysctl out of SCHED_DEBUG ? I expect this to be
>>> used/tuned on production devices where SCHED_DEBUG should theoretically
>>> be off.
>>
>> I will take it out of SCHED_DEBUG. I am wondering if this should be
>> a runtime control at all. Because this is a shift this changes the
>> accumulating window for the thermal pressure signal. A runtime change
>> will not guarantee a clean start of the window. May be I should make
>> this a config option.
> 
> I'd personally prefer if it wan't a Kconfig option. We'd like to make
> Android devices (which are going to use this) work with a Generic Kernel
> Image, which means there will be a single config for everyone. But I
> expect this knob to be tuned to different values depending on the SoC.
> 
> If you really don't want a sysctl, perhaps a cmdline option could work ?
yes . I will. I have sent across v4 with these and other review comments
fixed.
> 
> Thanks,
> Quentin
> 


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-10-22 20:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-14  0:58 [Patch v3 0/7] Introduce Thermal Pressure Thara Gopinath
2019-10-14  0:58 ` [Patch v3 1/7] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
2019-10-14 13:55   ` Vincent Guittot
2019-10-16 21:15     ` Thara Gopinath
2019-10-14  0:58 ` [Patch v3 2/7] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
2019-10-14 15:50   ` Vincent Guittot
2019-10-16 21:22     ` Thara Gopinath
2019-10-17  8:44       ` Vincent Guittot
2019-10-17 16:40         ` Thara Gopinath
2019-10-18  8:05           ` Vincent Guittot
2019-10-14  0:58 ` [Patch v3 3/7] sched: Initialize per cpu thermal pressure structure Thara Gopinath
2019-10-14  0:58 ` [Patch v3 4/7] sched/fair: Enable CFS periodic tick to update thermal pressure Thara Gopinath
2019-10-14  0:58 ` [Patch v3 5/7] sched/fair: update cpu_capcity to reflect " Thara Gopinath
2019-10-14  0:58 ` [Patch v3 6/7] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
2019-10-14  0:58 ` [Patch v3 7/7] sched: thermal: Enable tuning of decay period Thara Gopinath
2019-10-15 10:14   ` Quentin Perret
2019-10-21 21:03     ` Thara Gopinath
2019-10-22  9:03       ` Quentin Perret
2019-10-22 20:36         ` Thara Gopinath

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).