linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] Introduce thermal pressure
@ 2018-10-09 16:24 ` Thara Gopinath
  2018-10-09 16:24   ` [RFC PATCH 1/7] sched/pelt.c: Add option to make load and util calculations frequency invariant Thara Gopinath
                     ` (9 more replies)
  0 siblings, 10 replies; 67+ messages in thread
From: Thara Gopinath @ 2018-10-09 16:24 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, rui.zhang
  Cc: gregkh, rafael, amit.kachhap, viresh.kumar, javi.merino,
	edubezval, daniel.lezcano, linux-pm, quentin.perret,
	ionela.voinescu, vincent.guittot

Thermal governors can respond to an overheat event for a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in linux kernel, in event of maximum
frequency capping of a cpu, the maximum available compute
capacity of the cpu is not adjusted at all. In other words, scheduler
is unware maximum cpu capacity restrictions placed due to thermal
activity. This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The delta between the maximum possible capacity of a cpu and
maximum available capacity of a cpu due to thermal event can
be considered as thermal pressure. Instantaneous thermal pressure
is hard to record and can sometime be erroneous as there can be mismatch
between the actual capping of capacity and scheduler recording it.
Thus solution is to have a weighted average per cpu value for thermal
pressure over time. The weight reflects the amount of time the cpu has
spent at a capped maximum frequency. To accumulate, average and
appropriately decay thermal pressure, this patch series uses pelt
signals and reuses the available framework that does a similar
bookkeeping of rt/dl task utilization.

Regarding testing, basic build, boot and sanity testing have been
performed on hikey960 mainline kernel with debian file system.
Further aobench (An occlusion renderer for benchmarking realworld
floating point performance) showed the following results on hikey960
with debain.

                                        Result          Standard        Standard
                                        (Time secs)     Error           Deviation
Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

Thara Gopinath (7):
  sched/pelt: Add option to make load and util calculations frequency
    invariant
  sched/pelt.c: Add support to track thermal pressure
  sched: Add infrastructure to store and update instantaneous thermal
    pressure
  sched: Initialize per cpu thermal pressure structure
  sched/fair: Enable CFS periodic tick to update thermal pressure
  sched/fair: update cpu_capcity to reflect thermal pressure
  thermal/cpu-cooling: Update thermal pressure in case of a maximum
    frequency capping

 drivers/base/arch_topology.c  |  1 +
 drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
 include/linux/sched.h         | 14 +++++++++
 kernel/sched/Makefile         |  2 +-
 kernel/sched/core.c           |  2 ++
 kernel/sched/fair.c           |  4 +++
 kernel/sched/pelt.c           | 40 ++++++++++++++++++--------
 kernel/sched/pelt.h           |  7 +++++
 kernel/sched/sched.h          |  1 +
 kernel/sched/thermal.c        | 66 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/thermal.h        | 13 +++++++++
 11 files changed, 157 insertions(+), 13 deletions(-)
 create mode 100644 kernel/sched/thermal.c
 create mode 100644 kernel/sched/thermal.h

-- 
2.1.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH 1/7] sched/pelt.c: Add option to make load and util calculations frequency invariant
  2018-10-09 16:24 ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
@ 2018-10-09 16:24   ` Thara Gopinath
  2018-10-09 16:24   ` [RFC PATCH 2/7] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 67+ messages in thread
From: Thara Gopinath @ 2018-10-09 16:24 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, rui.zhang
  Cc: gregkh, rafael, amit.kachhap, viresh.kumar, javi.merino,
	edubezval, daniel.lezcano, linux-pm, quentin.perret,
	ionela.voinescu, vincent.guittot

Add an additional parametr in accumulate_sum to allow optional
frequency adjustment of load and utilization. When considering
rt/dl load/util, it is correct to scale it to the current cpu
frequency. On the other hand, thermal pressure(max capped frequency)
is frequency invariant.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 kernel/sched/pelt.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 35475c0..05b8798 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -107,7 +107,8 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  */
 static __always_inline u32
 accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
-	       unsigned long load, unsigned long runnable, int running)
+	       unsigned long load, unsigned long runnable, int running,
+		int freq_adjusted)
 {
 	unsigned long scale_freq, scale_cpu;
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
@@ -137,7 +138,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
 	}
 	sa->period_contrib = delta;
 
-	contrib = cap_scale(contrib, scale_freq);
+	if (freq_adjusted)
+		contrib = cap_scale(contrib, scale_freq);
 	if (load)
 		sa->load_sum += load * contrib;
 	if (runnable)
@@ -178,7 +180,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
  */
 static __always_inline int
 ___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
-		  unsigned long load, unsigned long runnable, int running)
+		  unsigned long load, unsigned long runnable, int running,
+			int freq_adjusted)
 {
 	u64 delta;
 
@@ -221,7 +224,8 @@ ___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
+	if (!accumulate_sum(delta, cpu, sa, load, runnable, running,
+						freq_adjusted))
 		return 0;
 
 	return 1;
@@ -272,7 +276,7 @@ int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
 	if (entity_is_task(se))
 		se->runnable_weight = se->load.weight;
 
-	if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
+	if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0, 1)) {
 		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
 		return 1;
 	}
@@ -286,7 +290,7 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
 		se->runnable_weight = se->load.weight;
 
 	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
-				cfs_rq->curr == se)) {
+				cfs_rq->curr == se, 1)) {
 
 		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
 		cfs_se_util_change(&se->avg);
@@ -301,7 +305,7 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
 	if (___update_load_sum(now, cpu, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
 				scale_load_down(cfs_rq->runnable_weight),
-				cfs_rq->curr != NULL)) {
+				cfs_rq->curr != NULL, 1)) {
 
 		___update_load_avg(&cfs_rq->avg, 1, 1);
 		return 1;
@@ -326,7 +330,7 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 	if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
 				running,
 				running,
-				running)) {
+				running, 1)) {
 
 		___update_load_avg(&rq->avg_rt, 1, 1);
 		return 1;
@@ -349,7 +353,7 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 	if (___update_load_sum(now, rq->cpu, &rq->avg_dl,
 				running,
 				running,
-				running)) {
+				running, 1)) {
 
 		___update_load_avg(&rq->avg_dl, 1, 1);
 		return 1;
@@ -385,11 +389,11 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
 				0,
 				0,
-				0);
+				0, 1);
 	ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
 				1,
 				1,
-				1);
+				1, 1);
 
 	if (ret)
 		___update_load_avg(&rq->avg_irq, 1, 1);
-- 
2.1.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH 2/7] sched/pelt.c: Add support to track thermal pressure
  2018-10-09 16:24 ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
  2018-10-09 16:24   ` [RFC PATCH 1/7] sched/pelt.c: Add option to make load and util calculations frequency invariant Thara Gopinath
@ 2018-10-09 16:24   ` Thara Gopinath
  2018-10-09 16:24   ` [RFC PATCH 3/7] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 67+ messages in thread
From: Thara Gopinath @ 2018-10-09 16:24 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, rui.zhang
  Cc: gregkh, rafael, amit.kachhap, viresh.kumar, javi.merino,
	edubezval, daniel.lezcano, linux-pm, quentin.perret,
	ionela.voinescu, vincent.guittot

Extrapolating on the exisitng framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressue. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a cpu running a rt/dl task through util_avg,
the average thermal pressure is tracked through load_avg.
In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_avg can be called
to do the periodic bookeeping (accumulate, decay and average)
of the thermal pressure.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 kernel/sched/pelt.c  | 14 ++++++++++++++
 kernel/sched/pelt.h  |  7 +++++++
 kernel/sched/sched.h |  1 +
 3 files changed, 22 insertions(+)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 05b8798..7034ede 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -362,6 +362,20 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 	return 0;
 }
 
+int update_thermal_avg(u64 now, struct rq *rq, u64 capacity)
+{
+	if (___update_load_sum(now, rq->cpu, &rq->avg_thermal,
+				capacity,
+				capacity,
+				capacity, 0)) {
+
+		___update_load_avg(&rq->avg_thermal, 1, 1);
+		return 1;
+	}
+
+	return 0;
+}
+
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
 /*
  * irq:
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index d2894db..cc5a3ad 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -5,6 +5,7 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
 int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_thermal_avg(u64 now, struct rq *rq, u64 capacity);
 
 #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
 int update_irq_load_avg(struct rq *rq, u64 running);
@@ -67,6 +68,12 @@ update_irq_load_avg(struct rq *rq, u64 running)
 {
 	return 0;
 }
+
+static inline int
+update_thermal_avg(u64 now, struct rq *rq, u64 capacity)
+{
+	return 0;
+}
 #endif
 
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8ca..30b40a5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -859,6 +859,7 @@ struct rq {
 #define HAVE_SCHED_AVG_IRQ
 	struct sched_avg	avg_irq;
 #endif
+	struct sched_avg	avg_thermal;
 	u64			idle_stamp;
 	u64			avg_idle;
 
-- 
2.1.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH 3/7] sched: Add infrastructure to store and update instantaneous thermal pressure
  2018-10-09 16:24 ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
  2018-10-09 16:24   ` [RFC PATCH 1/7] sched/pelt.c: Add option to make load and util calculations frequency invariant Thara Gopinath
  2018-10-09 16:24   ` [RFC PATCH 2/7] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
@ 2018-10-09 16:24   ` Thara Gopinath
  2018-10-09 16:24   ` [RFC PATCH 4/7] sched: Initialize per cpu thermal pressure structure Thara Gopinath
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 67+ messages in thread
From: Thara Gopinath @ 2018-10-09 16:24 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, rui.zhang
  Cc: gregkh, rafael, amit.kachhap, viresh.kumar, javi.merino,
	edubezval, daniel.lezcano, linux-pm, quentin.perret,
	ionela.voinescu, vincent.guittot

Add thermal.c and thermal.h files that provides interface
APIs to initialize, update/average, track, accumulate and decay
thermal pressure per cpu basis. A per cpu structure max_capacity_info is
introduced to keep track of instantaneous per cpu thermal pressure.
Thermal pressure the delta between max_capacity and cap_capacity.
API update_periodic_maxcap is called for periodic accumulate and decay
of the thermal pressure. It is to to be called from a periodic tick
function. The API calculates the delta between max_capacity and
cap_capacity and passes on the delta to update_thermal_avg to do the
necessary accumulate, decay and average. update_maxcap_capacity is for
the system to update the thermal pressure by updating cap_capacity.
Considering, update_periodic_maxcap reads cap_capacity and
update_maxcap_capacity writes into cap_capacity, one can argue for
some sort of locking mechanism to avoid a stale value.
But considering update_periodic_maxcap can be called from a system
critical path like scheduler tick function, a locking mechanism is not
ideal. This means that possibly for 1 tick period the value used to
calculate average thermal pressure for a cpu can be stale.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 include/linux/sched.h  | 14 +++++++++++
 kernel/sched/Makefile  |  2 +-
 kernel/sched/thermal.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/thermal.h | 13 ++++++++++
 4 files changed, 94 insertions(+), 1 deletion(-)
 create mode 100644 kernel/sched/thermal.c
 create mode 100644 kernel/sched/thermal.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57..931b76d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1891,3 +1891,17 @@ static inline void rseq_syscall(struct pt_regs *regs)
 #endif
 
 #endif
+
+#ifdef CONFIG_SMP
+void update_maxcap_capacity(int cpu, u64 capacity);
+
+void populate_max_capacity_info(void);
+#else
+static inline void update_maxcap_capacity(int cpu, u64 capacity)
+{
+}
+
+static inline void populate_max_capacity_info(void)
+{
+}
+#endif
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7fe1834..232a0cf 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle.o fair.o rt.o deadline.o
 obj-y += wait.o wait_bit.o swait.o completion.o
 
-obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o thermal.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/thermal.c b/kernel/sched/thermal.c
new file mode 100644
index 0000000..dd8300d
--- /dev/null
+++ b/kernel/sched/thermal.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Sceduler Thermal Interactions
+ *
+ *  Copyright (C) 2018 Linaro, Inc., Thara Gopinath <thara.gopinath@linaro.org>
+ */
+
+#include <linux/sched.h>
+#include "sched.h"
+#include "pelt.h"
+#include "thermal.h"
+
+struct max_capacity_info {
+	unsigned long max_capacity;
+	unsigned long cap_capacity;
+};
+
+static DEFINE_PER_CPU(struct max_capacity_info, max_cap);
+
+void update_maxcap_capacity(int cpu, u64 capacity)
+{
+	struct max_capacity_info *__max_cap;
+	u64 __capacity;
+
+	__max_cap = (&per_cpu(max_cap, cpu));
+	if (!__max_cap) {
+		pr_err("no max_capacity_info structure for cpu %d\n", cpu);
+		return;
+	}
+
+	/* Normalize the capacity */
+	__capacity = (capacity * arch_scale_cpu_capacity(NULL, cpu)) >>
+							SCHED_CAPACITY_SHIFT;
+	pr_debug("updating cpu%d capped capacity from %ld to %ld\n", cpu, __max_cap->cap_capacity, __capacity);
+
+	__max_cap->cap_capacity = __capacity;
+}
+
+void populate_max_capacity_info(void)
+{
+	struct max_capacity_info *__max_cap;
+	u64 capacity;
+	int cpu;
+
+
+	for_each_possible_cpu(cpu) {
+		__max_cap = (&per_cpu(max_cap, cpu));
+		if (!__max_cap)
+			continue;
+		capacity = arch_scale_cpu_capacity(NULL, cpu);
+		__max_cap->max_capacity = __max_cap->cap_capacity = capacity;
+		pr_debug("cpu %d max capacity set to %ld\n", cpu, __max_cap->max_capacity);
+	}
+}
+
+void update_periodic_maxcap(struct rq *rq)
+{
+	struct max_capacity_info *__max_cap = (&per_cpu(max_cap, cpu_of(rq)));
+	unsigned long delta;
+
+	if (!__max_cap)
+		return;
+
+	delta = __max_cap->max_capacity - __max_cap->cap_capacity;
+	update_thermal_avg(rq_clock_task(rq), rq, delta);
+}
diff --git a/kernel/sched/thermal.h b/kernel/sched/thermal.h
new file mode 100644
index 0000000..20a0270
--- /dev/null
+++ b/kernel/sched/thermal.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Scheduler thermal interaction internal methods.
+ */
+
+#ifdef CONFIG_SMP
+void update_periodic_maxcap(struct rq *rq);
+
+#else
+static inline void update_periodic_maxcap(struct rq *rq)
+{
+}
+#endif
-- 
2.1.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH 4/7] sched: Initialize per cpu thermal pressure structure
  2018-10-09 16:24 ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
                     ` (2 preceding siblings ...)
  2018-10-09 16:24   ` [RFC PATCH 3/7] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
@ 2018-10-09 16:24   ` Thara Gopinath
  2018-10-09 16:25   ` [RFC PATCH 5/7] sched/fair: Enable CFS periodic tick to update thermal pressure Thara Gopinath
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 67+ messages in thread
From: Thara Gopinath @ 2018-10-09 16:24 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, rui.zhang
  Cc: gregkh, rafael, amit.kachhap, viresh.kumar, javi.merino,
	edubezval, daniel.lezcano, linux-pm, quentin.perret,
	ionela.voinescu, vincent.guittot

Initialize per cpu max_capacity_info during scheduler init.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 drivers/base/arch_topology.c | 1 +
 kernel/sched/core.c          | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7cb0c6..542745f 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -126,6 +126,7 @@ void topology_normalize_cpu_scale(void)
 		pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
 			cpu, topology_get_cpu_scale(NULL, cpu));
 	}
+	populate_max_capacity_info();
 	mutex_unlock(&cpu_scale_mutex);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 625bc98..f0eed1a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5868,6 +5868,8 @@ void __init sched_init_smp(void)
 	init_sched_rt_class();
 	init_sched_dl_class();
 
+	populate_max_capacity_info();
+
 	sched_smp_initialized = true;
 }
 
-- 
2.1.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH 5/7] sched/fair: Enable CFS periodic tick to update thermal pressure
  2018-10-09 16:24 ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
                     ` (3 preceding siblings ...)
  2018-10-09 16:24   ` [RFC PATCH 4/7] sched: Initialize per cpu thermal pressure structure Thara Gopinath
@ 2018-10-09 16:25   ` Thara Gopinath
  2018-12-04 15:43     ` Vincent Guittot
  2018-10-09 16:25   ` [RFC PATCH 6/7] sched/fair: update cpu_capcity to reflect " Thara Gopinath
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 67+ messages in thread
From: Thara Gopinath @ 2018-10-09 16:25 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, rui.zhang
  Cc: gregkh, rafael, amit.kachhap, viresh.kumar, javi.merino,
	edubezval, daniel.lezcano, linux-pm, quentin.perret,
	ionela.voinescu, vincent.guittot

Introduce support in CFS periodic tick to trigger the process of
computing average thermal pressure for a cpu.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb59..7deb1d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -21,6 +21,7 @@
  *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
  */
 #include "sched.h"
+#include "thermal.h"
 
 #include <trace/events/sched.h>
 
@@ -9557,6 +9558,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
+
+	update_periodic_maxcap(rq);
 }
 
 /*
-- 
2.1.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH 6/7] sched/fair: update cpu_capcity to reflect thermal pressure
  2018-10-09 16:24 ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
                     ` (4 preceding siblings ...)
  2018-10-09 16:25   ` [RFC PATCH 5/7] sched/fair: Enable CFS periodic tick to update thermal pressure Thara Gopinath
@ 2018-10-09 16:25   ` Thara Gopinath
  2018-10-10  5:57     ` Javi Merino
  2018-10-09 16:25   ` [RFC PATCH 7/7] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 67+ messages in thread
From: Thara Gopinath @ 2018-10-09 16:25 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, rui.zhang
  Cc: gregkh, rafael, amit.kachhap, viresh.kumar, javi.merino,
	edubezval, daniel.lezcano, linux-pm, quentin.perret,
	ionela.voinescu, vincent.guittot

cpu_capacity relflects the maximum available capacity of a cpu. Thermal
pressure on a cpu means this maximum available capacity is reduced. This
patch reduces the average thermal pressure for a cpu from its maximum
available capacity so that cpu_capacity reflects the actual
available capacity.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 kernel/sched/fair.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7deb1d0..8651e55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7497,6 +7497,7 @@ static unsigned long scale_rt_capacity(int cpu)
 
 	used = READ_ONCE(rq->avg_rt.util_avg);
 	used += READ_ONCE(rq->avg_dl.util_avg);
+	used += READ_ONCE(rq->avg_thermal.load_avg);
 
 	if (unlikely(used >= max))
 		return 1;
-- 
2.1.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH 7/7] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2018-10-09 16:24 ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
                     ` (5 preceding siblings ...)
  2018-10-09 16:25   ` [RFC PATCH 6/7] sched/fair: update cpu_capcity to reflect " Thara Gopinath
@ 2018-10-09 16:25   ` Thara Gopinath
  2018-10-10  5:44   ` [RFC PATCH 0/7] Introduce thermal pressure Javi Merino
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 67+ messages in thread
From: Thara Gopinath @ 2018-10-09 16:25 UTC (permalink / raw)
  To: linux-kernel, mingo, peterz, rui.zhang
  Cc: gregkh, rafael, amit.kachhap, viresh.kumar, javi.merino,
	edubezval, daniel.lezcano, linux-pm, quentin.perret,
	ionela.voinescu, vincent.guittot

Thermal governors can request for a cpu's maximum supported frequency
to be capped in case of an overheat event. This in turn means that the
maximum capacity available for tasks to run on the particular cpu is
reduced. Delta between the original maximum capacity and capped
maximum capacity is known as thermal pressure. Enable cpufreq cooling
device to update the thermal pressure in event of a capped
maximum frequency.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 drivers/thermal/cpu_cooling.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index dfd2324..da8de66 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -31,6 +31,7 @@
 #include <linux/slab.h>
 #include <linux/cpu.h>
 #include <linux/cpu_cooling.h>
+#include <linux/sched/topology.h>
 
 #include <trace/events/thermal.h>
 
@@ -132,6 +133,21 @@ static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
 }
 
 /**
+ * cpufreq_update_sched_max_capacity - update scheduler about change in cpu
+ *					max frequency.
+ * @policy - cpufreq policy whose max frequency is capped.
+ */
+static void cpufreq_update_sched_max_capacity(struct cpufreq_policy *policy)
+{
+	int cpu;
+	unsigned long capacity = (policy->max << SCHED_CAPACITY_SHIFT) /
+					policy->cpuinfo.max_freq;
+
+	for_each_cpu(cpu, policy->cpus)
+		update_maxcap_capacity(cpu, capacity);
+}
+
+/**
  * cpufreq_thermal_notifier - notifier callback for cpufreq policy change.
  * @nb:	struct notifier_block * with callback info.
  * @event: value showing cpufreq event for which this function invoked.
@@ -175,8 +191,10 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
 		 */
 		clipped_freq = cpufreq_cdev->clipped_freq;
 
-		if (policy->max > clipped_freq)
+		if (policy->max > clipped_freq) {
 			cpufreq_verify_within_limits(policy, 0, clipped_freq);
+			cpufreq_update_sched_max_capacity(policy);
+		}
 		break;
 	}
 	mutex_unlock(&cooling_list_lock);
-- 
2.1.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-09 16:24 ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
                     ` (6 preceding siblings ...)
  2018-10-09 16:25   ` [RFC PATCH 7/7] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
@ 2018-10-10  5:44   ` Javi Merino
  2018-10-10 14:15     ` Thara Gopinath
  2018-10-10  6:17   ` Ingo Molnar
  2018-10-10 15:35   ` Lukasz Luba
  9 siblings, 1 reply; 67+ messages in thread
From: Javi Merino @ 2018-10-10  5:44 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, edubezval, daniel.lezcano, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot

On Tue, Oct 09, 2018 at 12:24:55PM -0400, Thara Gopinath wrote:
> Thermal governors can respond to an overheat event for a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in linux kernel, in event of maximum
> frequency capping of a cpu, the maximum available compute
> capacity of the cpu is not adjusted at all. In other words, scheduler
> is unware maximum cpu capacity restrictions placed due to thermal
> activity.

Interesting, I would have sworn that I tested this years ago by
lowering the maximum frequency of a cpufreq domain, and the scheduler
reacted accordingly to the new maximum capacities of the cpus.

>           This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
> 
> The delta between the maximum possible capacity of a cpu and
> maximum available capacity of a cpu due to thermal event can
> be considered as thermal pressure. Instantaneous thermal pressure
> is hard to record and can sometime be erroneous as there can be mismatch
> between the actual capping of capacity and scheduler recording it.
> Thus solution is to have a weighted average per cpu value for thermal
> pressure over time. The weight reflects the amount of time the cpu has
> spent at a capped maximum frequency. To accumulate, average and
> appropriately decay thermal pressure, this patch series uses pelt
> signals and reuses the available framework that does a similar
> bookkeeping of rt/dl task utilization.
> 
> Regarding testing, basic build, boot and sanity testing have been
> performed on hikey960 mainline kernel with debian file system.
> Further aobench (An occlusion renderer for benchmarking realworld
> floating point performance) showed the following results on hikey960
> with debain.
> 
>                                         Result          Standard        Standard
>                                         (Time secs)     Error           Deviation
> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
> 
> Thara Gopinath (7):
>   sched/pelt: Add option to make load and util calculations frequency
>     invariant
>   sched/pelt.c: Add support to track thermal pressure
>   sched: Add infrastructure to store and update instantaneous thermal
>     pressure
>   sched: Initialize per cpu thermal pressure structure
>   sched/fair: Enable CFS periodic tick to update thermal pressure
>   sched/fair: update cpu_capcity to reflect thermal pressure
>   thermal/cpu-cooling: Update thermal pressure in case of a maximum
>     frequency capping
> 
>  drivers/base/arch_topology.c  |  1 +
>  drivers/thermal/cpu_cooling.c | 20 ++++++++++++-

thermal?  There are other ways in which the maximum frequency of a cpu
can be limited, for example from userspace via scaling_max_freq.

When something (anything) changes the maximum frequency of a cpufreq
policy, the scheduler should be notified.  I think this change should
be done in cpufreq instead to make it generic and not particular to
a given maximum frequency "capper".

Cheers,
Javi

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 6/7] sched/fair: update cpu_capcity to reflect thermal pressure
  2018-10-09 16:25   ` [RFC PATCH 6/7] sched/fair: update cpu_capcity to reflect " Thara Gopinath
@ 2018-10-10  5:57     ` Javi Merino
  2018-10-10 14:22       ` Thara Gopinath
  0 siblings, 1 reply; 67+ messages in thread
From: Javi Merino @ 2018-10-10  5:57 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, edubezval, daniel.lezcano, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot

On Tue, Oct 09, 2018 at 12:25:01PM -0400, Thara Gopinath wrote:
> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> pressure on a cpu means this maximum available capacity is reduced. This
> patch reduces the average thermal pressure for a cpu from its maximum
> available capacity so that cpu_capacity reflects the actual
> available capacity.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  kernel/sched/fair.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7deb1d0..8651e55 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7497,6 +7497,7 @@ static unsigned long scale_rt_capacity(int cpu)
>  
>  	used = READ_ONCE(rq->avg_rt.util_avg);
>  	used += READ_ONCE(rq->avg_dl.util_avg);
> +	used += READ_ONCE(rq->avg_thermal.load_avg);

IIUIC, you are treating thermal pressure as an artificial load on the
cpu.  If so, this sounds like a hard to maintain hack.  Thermal
pressure have different characteristics to utilization.  What happens
if thermal sets the cpu cooling state back to 0 because there is
thermal headroom again?  Do we keep adding this artificial load to the
cpu just because there was thermal pressure in the past and let it
decay as if it was cpu load?

Cheers,
Javi


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-09 16:24 ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
                     ` (7 preceding siblings ...)
  2018-10-10  5:44   ` [RFC PATCH 0/7] Introduce thermal pressure Javi Merino
@ 2018-10-10  6:17   ` Ingo Molnar
  2018-10-10  8:29     ` Quentin Perret
  2018-10-10 15:43     ` Thara Gopinath
  2018-10-10 15:35   ` Lukasz Luba
  9 siblings, 2 replies; 67+ messages in thread
From: Ingo Molnar @ 2018-10-10  6:17 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, javi.merino, edubezval,
	daniel.lezcano, linux-pm, quentin.perret, ionela.voinescu,
	vincent.guittot


* Thara Gopinath <thara.gopinath@linaro.org> wrote:

> Thermal governors can respond to an overheat event for a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in linux kernel, in event of maximum
> frequency capping of a cpu, the maximum available compute
> capacity of the cpu is not adjusted at all. In other words, scheduler
> is unware maximum cpu capacity restrictions placed due to thermal
> activity. This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
> 
> The delta between the maximum possible capacity of a cpu and
> maximum available capacity of a cpu due to thermal event can
> be considered as thermal pressure. Instantaneous thermal pressure
> is hard to record and can sometime be erroneous as there can be mismatch
> between the actual capping of capacity and scheduler recording it.
> Thus solution is to have a weighted average per cpu value for thermal
> pressure over time. The weight reflects the amount of time the cpu has
> spent at a capped maximum frequency. To accumulate, average and
> appropriately decay thermal pressure, this patch series uses pelt
> signals and reuses the available framework that does a similar
> bookkeeping of rt/dl task utilization.
> 
> Regarding testing, basic build, boot and sanity testing have been
> performed on hikey960 mainline kernel with debian file system.
> Further aobench (An occlusion renderer for benchmarking realworld
> floating point performance) showed the following results on hikey960
> with debain.
> 
>                                         Result          Standard        Standard
>                                         (Time secs)     Error           Deviation
> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%

Wow, +13% speedup, impressive! We definitely want this outcome.

I'm wondering what happens if we do not track and decay the thermal load at all at the PELT 
level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal 
events we receive from the CPU.

You describe the averaging as:

> Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can 
> be mismatch between the actual capping of capacity and scheduler recording it.

Not sure I follow the argument here: are there bogus thermal throttling events? If so then
they are hopefully not frequent enough and should average out over time even if we follow
it instantly.

I.e. what is 'can sometimes be erroneous', exactly?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10  6:17   ` Ingo Molnar
@ 2018-10-10  8:29     ` Quentin Perret
  2018-10-10  8:50       ` Vincent Guittot
  2018-10-10 15:43     ` Thara Gopinath
  1 sibling, 1 reply; 67+ messages in thread
From: Quentin Perret @ 2018-10-10  8:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thara Gopinath, linux-kernel, mingo, peterz, rui.zhang, gregkh,
	rafael, amit.kachhap, viresh.kumar, javi.merino, edubezval,
	daniel.lezcano, linux-pm, ionela.voinescu, vincent.guittot

Hi Thara,

On Wednesday 10 Oct 2018 at 08:17:51 (+0200), Ingo Molnar wrote:
> 
> * Thara Gopinath <thara.gopinath@linaro.org> wrote:
> 
> > Thermal governors can respond to an overheat event for a cpu by
> > capping the cpu's maximum possible frequency. This in turn
> > means that the maximum available compute capacity of the
> > cpu is restricted. But today in linux kernel, in event of maximum
> > frequency capping of a cpu, the maximum available compute
> > capacity of the cpu is not adjusted at all. In other words, scheduler
> > is unware maximum cpu capacity restrictions placed due to thermal
> > activity. This patch series attempts to address this issue.
> > The benefits identified are better task placement among available
> > cpus in event of overheating which in turn leads to better
> > performance numbers.
> > 
> > The delta between the maximum possible capacity of a cpu and
> > maximum available capacity of a cpu due to thermal event can
> > be considered as thermal pressure. Instantaneous thermal pressure
> > is hard to record and can sometime be erroneous as there can be mismatch
> > between the actual capping of capacity and scheduler recording it.
> > Thus solution is to have a weighted average per cpu value for thermal
> > pressure over time. The weight reflects the amount of time the cpu has
> > spent at a capped maximum frequency. To accumulate, average and
> > appropriately decay thermal pressure, this patch series uses pelt
> > signals and reuses the available framework that does a similar
> > bookkeeping of rt/dl task utilization.
> > 
> > Regarding testing, basic build, boot and sanity testing have been
> > performed on hikey960 mainline kernel with debian file system.
> > Further aobench (An occlusion renderer for benchmarking realworld
> > floating point performance) showed the following results on hikey960
> > with debain.
> > 
> >                                         Result          Standard        Standard
> >                                         (Time secs)     Error           Deviation
> > Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
> > Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
> 
> Wow, +13% speedup, impressive! We definitely want this outcome.
> 
> I'm wondering what happens if we do not track and decay the thermal load at all at the PELT 
> level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal 
> events we receive from the CPU.

+1, it's not that obvious (to me at least) that averaging the thermal
pressure over time is necessarily what we want. Say the thermal governor
caps a CPU and 'removes' 70% of its capacity, it will take forever for
the PELT signal to ramp-up to that level before the scheduler can react.
And the other way around, if you release the cap, it'll take a while
before we actually start using the newly available capacity. I can also
imagine how reacting too fast can be counter-productive, but I guess
having numbers and/or use-cases to show that would be great :-)

Thara, have you tried to experiment with a simpler implementation as
suggested by Ingo ?

Also, assuming that we do want to average things, do we actually want to
tie the thermal ramp-up time to the PELT half life ? That provides
nice maths properties wrt the other signals, but it's not obvious to me
that this thermal 'constant' should be the same on all platforms. Or
maybe it should ?

Thanks,
Quentin

> 
> You describe the averaging as:
> 
> > Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can 
> > be mismatch between the actual capping of capacity and scheduler recording it.
> 
> Not sure I follow the argument here: are there bogus thermal throttling events? If so then
> they are hopefully not frequent enough and should average out over time even if we follow
> it instantly.
> 
> I.e. what is 'can sometimes be erroneous', exactly?
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10  8:29     ` Quentin Perret
@ 2018-10-10  8:50       ` Vincent Guittot
  2018-10-10  9:55         ` Quentin Perret
  0 siblings, 1 reply; 67+ messages in thread
From: Vincent Guittot @ 2018-10-10  8:50 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Ingo Molnar, Thara Gopinath, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Zhang Rui, gregkh, Rafael J. Wysocki,
	Amit Kachhap, viresh kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, open list:THERMAL, Ionela Voinescu

On Wed, 10 Oct 2018 at 10:29, Quentin Perret <quentin.perret@arm.com> wrote:
>
> Hi Thara,
>
> On Wednesday 10 Oct 2018 at 08:17:51 (+0200), Ingo Molnar wrote:
> >
> > * Thara Gopinath <thara.gopinath@linaro.org> wrote:
> >
> > > Thermal governors can respond to an overheat event for a cpu by
> > > capping the cpu's maximum possible frequency. This in turn
> > > means that the maximum available compute capacity of the
> > > cpu is restricted. But today in linux kernel, in event of maximum
> > > frequency capping of a cpu, the maximum available compute
> > > capacity of the cpu is not adjusted at all. In other words, scheduler
> > > is unware maximum cpu capacity restrictions placed due to thermal
> > > activity. This patch series attempts to address this issue.
> > > The benefits identified are better task placement among available
> > > cpus in event of overheating which in turn leads to better
> > > performance numbers.
> > >
> > > The delta between the maximum possible capacity of a cpu and
> > > maximum available capacity of a cpu due to thermal event can
> > > be considered as thermal pressure. Instantaneous thermal pressure
> > > is hard to record and can sometime be erroneous as there can be mismatch
> > > between the actual capping of capacity and scheduler recording it.
> > > Thus solution is to have a weighted average per cpu value for thermal
> > > pressure over time. The weight reflects the amount of time the cpu has
> > > spent at a capped maximum frequency. To accumulate, average and
> > > appropriately decay thermal pressure, this patch series uses pelt
> > > signals and reuses the available framework that does a similar
> > > bookkeeping of rt/dl task utilization.
> > >
> > > Regarding testing, basic build, boot and sanity testing have been
> > > performed on hikey960 mainline kernel with debian file system.
> > > Further aobench (An occlusion renderer for benchmarking realworld
> > > floating point performance) showed the following results on hikey960
> > > with debain.
> > >
> > >                                         Result          Standard        Standard
> > >                                         (Time secs)     Error           Deviation
> > > Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
> > > Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
> >
> > Wow, +13% speedup, impressive! We definitely want this outcome.
> >
> > I'm wondering what happens if we do not track and decay the thermal load at all at the PELT
> > level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal
> > events we receive from the CPU.
>
> +1, it's not that obvious (to me at least) that averaging the thermal
> pressure over time is necessarily what we want. Say the thermal governor
> caps a CPU and 'removes' 70% of its capacity, it will take forever for
> the PELT signal to ramp-up to that level before the scheduler can react.
> And the other way around, if you release the cap, it'll take a while
> before we actually start using the newly available capacity. I can also
> imagine how reacting too fast can be counter-productive, but I guess
> having numbers and/or use-cases to show that would be great :-)

The problem with reflecting directly the capping is that it happens
far more often than the pace at which cpu_capacity_orig is updated in
the scheduler. This means that at the moment when scheduler uses the
value, it might not be correct anymore. Then, this value are also used
when building the sched_domain and setting max_cpu_capacity which
would also implies the rebuilt the sched_domain topology ...

The pace of changing the capping is to fast to reflect that in the
whole scheduler topology

>
> Thara, have you tried to experiment with a simpler implementation as
> suggested by Ingo ?
>
> Also, assuming that we do want to average things, do we actually want to
> tie the thermal ramp-up time to the PELT half life ? That provides
> nice maths properties wrt the other signals, but it's not obvious to me
> that this thermal 'constant' should be the same on all platforms. Or
> maybe it should ?

The main interest of using PELT signal is that thermal pressure will
evolve at the same pace as other signals used in the scheduler. With
thermal pressure, we have the exact same problem as with RT tasks. The
thermal will cap the max frequency which will cap the utilization of
the tasks running on the CPU

>
> Thanks,
> Quentin
>
> >
> > You describe the averaging as:
> >
> > > Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can
> > > be mismatch between the actual capping of capacity and scheduler recording it.
> >
> > Not sure I follow the argument here: are there bogus thermal throttling events? If so then
> > they are hopefully not frequent enough and should average out over time even if we follow
> > it instantly.
> >
> > I.e. what is 'can sometimes be erroneous', exactly?
> >
> > Thanks,
> >
> >       Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10  8:50       ` Vincent Guittot
@ 2018-10-10  9:55         ` Quentin Perret
  2018-10-10 10:14           ` Vincent Guittot
  2018-10-10 17:03           ` Thara Gopinath
  0 siblings, 2 replies; 67+ messages in thread
From: Quentin Perret @ 2018-10-10  9:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Thara Gopinath, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Zhang Rui, gregkh, Rafael J. Wysocki,
	Amit Kachhap, viresh kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, open list:THERMAL, Ionela Voinescu

Hi Vincent,

On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:
> The problem with reflecting directly the capping is that it happens
> far more often than the pace at which cpu_capacity_orig is updated in
> the scheduler.

Hmm, how can you be so sure ? That most likely depends on the workload,
the platform and the thermal governor. Some platforms heat up slowly,
some quickly. The pace at which the thermal governor will change things
should depend on that I assume.

> This means that at the moment when scheduler uses the
> value, it might not be correct anymore.

And OTOH, when you remove a cap for example, it will take time before
the scheduler can see the newly available capacity if you need to wait
for the signal to decay. So you are using a wrong information too in
that scenario.

> Then, this value are also used
> when building the sched_domain and setting max_cpu_capacity which
> would also implies the rebuilt the sched_domain topology ...

Wait what ? I thought the thermal cap was reflected in capacity_of, not
capacity_orig_of ... You need to rebuild the sched_domain in case of
thermal pressure ?

Hmm, let me have a closer look at the patches, I must have missed
something ...

> The pace of changing the capping is to fast to reflect that in the
> whole scheduler topology

That's probably true in some cases, but it'd be cool to have numbers to
back up that statement, I think.

Now, if you do need to rebuild the sched domain topology every time you
update the thermal pressure, I think the PELT HL is _way_ too short for
that ... You can't rebuild the whole thing every 32ms or so. Or am I
misunderstanding something ?

> > Thara, have you tried to experiment with a simpler implementation as
> > suggested by Ingo ?
> >
> > Also, assuming that we do want to average things, do we actually want to
> > tie the thermal ramp-up time to the PELT half life ? That provides
> > nice maths properties wrt the other signals, but it's not obvious to me
> > that this thermal 'constant' should be the same on all platforms. Or
> > maybe it should ?
> 
> The main interest of using PELT signal is that thermal pressure will
> evolve at the same pace as other signals used in the scheduler.

Right, I think this is a nice property too (assuming that we actually
want to average things out).

> With
> thermal pressure, we have the exact same problem as with RT tasks. The
> thermal will cap the max frequency which will cap the utilization of
> the tasks running on the CPU

Well, the nature of the signal is slightly different IMO. Yes it's
capacity, but you're no actually measuring time spent on the CPU. All
other PELT signals are based on time, this thermal thing isn't, so it is
kinda different in a way. And I'm still wondering if it could be helpful
to be able to have a different HL for that thermal signal. That would
'break' the nice maths properties we have, yes, but is it a problem or is
it actually helpful to cope with the thermal characteristics of
different platforms ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10  9:55         ` Quentin Perret
@ 2018-10-10 10:14           ` Vincent Guittot
  2018-10-10 10:36             ` Quentin Perret
  2018-10-10 17:03           ` Thara Gopinath
  1 sibling, 1 reply; 67+ messages in thread
From: Vincent Guittot @ 2018-10-10 10:14 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Ingo Molnar, Thara Gopinath, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Zhang Rui, gregkh, Rafael J. Wysocki,
	Amit Kachhap, viresh kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, open list:THERMAL, Ionela Voinescu

On Wed, 10 Oct 2018 at 11:55, Quentin Perret <quentin.perret@arm.com> wrote:
>
> Hi Vincent,
>
> On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:
> > The problem with reflecting directly the capping is that it happens
> > far more often than the pace at which cpu_capacity_orig is updated in
> > the scheduler.
>
> Hmm, how can you be so sure ? That most likely depends on the workload,
> the platform and the thermal governor. Some platforms heat up slowly,
> some quickly. The pace at which the thermal governor will change things
> should depend on that I assume.
>
> > This means that at the moment when scheduler uses the
> > value, it might not be correct anymore.
>
> And OTOH, when you remove a cap for example, it will take time before
> the scheduler can see the newly available capacity if you need to wait
> for the signal to decay. So you are using a wrong information too in
> that scenario.

But we stay consistant with all other metrics. The same happen when a
task decide to stay idle for a long time after some activity... You
will have to wait for the signal to decay

>
> > Then, this value are also used
> > when building the sched_domain and setting max_cpu_capacity which
> > would also implies the rebuilt the sched_domain topology ...
>
> Wait what ? I thought the thermal cap was reflected in capacity_of, not
> capacity_orig_of ... You need to rebuild the sched_domain in case of
> thermal pressure ?
>
> Hmm, let me have a closer look at the patches, I must have missed
> something ...
>
> > The pace of changing the capping is to fast to reflect that in the
> > whole scheduler topology
>
> That's probably true in some cases, but it'd be cool to have numbers to
> back up that statement, I think.
>
> Now, if you do need to rebuild the sched domain topology every time you
> update the thermal pressure, I think the PELT HL is _way_ too short for
> that ... You can't rebuild the whole thing every 32ms or so. Or am I
> misunderstanding something ?
>
> > > Thara, have you tried to experiment with a simpler implementation as
> > > suggested by Ingo ?
> > >
> > > Also, assuming that we do want to average things, do we actually want to
> > > tie the thermal ramp-up time to the PELT half life ? That provides
> > > nice maths properties wrt the other signals, but it's not obvious to me
> > > that this thermal 'constant' should be the same on all platforms. Or
> > > maybe it should ?
> >
> > The main interest of using PELT signal is that thermal pressure will
> > evolve at the same pace as other signals used in the scheduler.
>
> Right, I think this is a nice property too (assuming that we actually
> want to average things out).
>
> > With
> > thermal pressure, we have the exact same problem as with RT tasks. The
> > thermal will cap the max frequency which will cap the utilization of
> > the tasks running on the CPU
>
> Well, the nature of the signal is slightly different IMO. Yes it's
> capacity, but you're no actually measuring time spent on the CPU. All
> other PELT signals are based on time, this thermal thing isn't, so it is
> kinda different in a way. And I'm still wondering if it could be helpful

hmmm ... it is based on time too.
Both signals (current ones and thermal one) are really close. The main
difference with other utilization signal is that instead of providing
a running/not running boolean that is then weighted by the current
capacity, the signal uses direclty the capped max capacity to reflect
the amount of cycle that is stolen by thermal mitigation.

> to be able to have a different HL for that thermal signal. That would
> 'break' the nice maths properties we have, yes, but is it a problem or is
> it actually helpful to cope with the thermal characteristics of
> different platforms ?

If you don't use the sign kind of signal with the same responsiveness,
you will start to see some OPP toggles as an example when the thermal
state change because one metrics will change faster than the other one
and you can't have a correct view of the system. Same problem was
happening with rt task.

I take the example of RT task because it quite similar in the effect
except that RT task steal all cycles when it runs whereas thermal
mitigation steal only some by capping the frequency
>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 10:14           ` Vincent Guittot
@ 2018-10-10 10:36             ` Quentin Perret
  2018-10-10 12:04               ` Vincent Guittot
  0 siblings, 1 reply; 67+ messages in thread
From: Quentin Perret @ 2018-10-10 10:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Thara Gopinath, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Zhang Rui, gregkh, Rafael J. Wysocki,
	Amit Kachhap, viresh kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, open list:THERMAL, Ionela Voinescu

On Wednesday 10 Oct 2018 at 12:14:32 (+0200), Vincent Guittot wrote:
> On Wed, 10 Oct 2018 at 11:55, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > Hi Vincent,
> >
> > On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:
> > > The problem with reflecting directly the capping is that it happens
> > > far more often than the pace at which cpu_capacity_orig is updated in
> > > the scheduler.
> >
> > Hmm, how can you be so sure ? That most likely depends on the workload,
> > the platform and the thermal governor. Some platforms heat up slowly,
> > some quickly. The pace at which the thermal governor will change things
> > should depend on that I assume.
> >
> > > This means that at the moment when scheduler uses the
> > > value, it might not be correct anymore.
> >
> > And OTOH, when you remove a cap for example, it will take time before
> > the scheduler can see the newly available capacity if you need to wait
> > for the signal to decay. So you are using a wrong information too in
> > that scenario.
> 
> But we stay consistant with all other metrics. The same happen when a
> task decide to stay idle for a long time after some activity... You
> will have to wait for the signal to decay

Because you see that as a task going idle. You could also say that
removing the cap from a CPU is equivalent to migrating a task off that
CPU. In this case you should see the newly available cap pretty much
immediately.

Also, I feel like the whole thermal capping story could interest the
Intel folks who, IIRC, were discussing how to represent their 'boosted'
OPPs in the scheduler some time ago. You can see the Intel boost thing
as a cap I think -- some OPPs can be inaccessible in some case. I'd be
interested to see what's their take on this.

> > > Then, this value are also used
> > > when building the sched_domain and setting max_cpu_capacity which
> > > would also implies the rebuilt the sched_domain topology ...
> >
> > Wait what ? I thought the thermal cap was reflected in capacity_of, not
> > capacity_orig_of ... You need to rebuild the sched_domain in case of
> > thermal pressure ?

I can't locate where you need to do this in the series. Do you actually
need to rebuild the sd hierarchy ?

> > Hmm, let me have a closer look at the patches, I must have missed
> > something ...
> >
> > > The pace of changing the capping is to fast to reflect that in the
> > > whole scheduler topology
> >
> > That's probably true in some cases, but it'd be cool to have numbers to
> > back up that statement, I think.
> >
> > Now, if you do need to rebuild the sched domain topology every time you
> > update the thermal pressure, I think the PELT HL is _way_ too short for
> > that ... You can't rebuild the whole thing every 32ms or so. Or am I
> > misunderstanding something ?
> >
> > > > Thara, have you tried to experiment with a simpler implementation as
> > > > suggested by Ingo ?
> > > >
> > > > Also, assuming that we do want to average things, do we actually want to
> > > > tie the thermal ramp-up time to the PELT half life ? That provides
> > > > nice maths properties wrt the other signals, but it's not obvious to me
> > > > that this thermal 'constant' should be the same on all platforms. Or
> > > > maybe it should ?
> > >
> > > The main interest of using PELT signal is that thermal pressure will
> > > evolve at the same pace as other signals used in the scheduler.
> >
> > Right, I think this is a nice property too (assuming that we actually
> > want to average things out).
> >
> > > With
> > > thermal pressure, we have the exact same problem as with RT tasks. The
> > > thermal will cap the max frequency which will cap the utilization of
> > > the tasks running on the CPU
> >
> > Well, the nature of the signal is slightly different IMO. Yes it's
> > capacity, but you're no actually measuring time spent on the CPU. All
> > other PELT signals are based on time, this thermal thing isn't, so it is
> > kinda different in a way. And I'm still wondering if it could be helpful
> 
> hmmm ... it is based on time too.

You're not actually measuring the time spent on the CPU by the 'thermal
task'. There is no such thing as a 'thermal task'. You're just trying to
model things like that, but the thermal stuff isn't actually
interrupting running tasks to eat CPU cycles. It just makes thing run
slower, which isn't exactly the same thing IMO.

But maybe that's a detail.

> Both signals (current ones and thermal one) are really close. The main
> difference with other utilization signal is that instead of providing
> a running/not running boolean that is then weighted by the current
> capacity, the signal uses direclty the capped max capacity to reflect
> the amount of cycle that is stolen by thermal mitigation.
> 
> > to be able to have a different HL for that thermal signal. That would
> > 'break' the nice maths properties we have, yes, but is it a problem or is
> > it actually helpful to cope with the thermal characteristics of
> > different platforms ?
> 
> If you don't use the sign kind of signal with the same responsiveness,
> you will start to see some OPP toggles as an example when the thermal
> state change because one metrics will change faster than the other one
> and you can't have a correct view of the system. Same problem was
> happening with rt task.

Well, that wasn't the problem with rt tasks. The problem with RT tasks
was that the time they spend on the CPU wasn't accounted _at all_ when
selecting frequency for CFS, not that the accounting was at a different
pace ...

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 10:36             ` Quentin Perret
@ 2018-10-10 12:04               ` Vincent Guittot
  2018-10-10 12:23                 ` Juri Lelli
  2018-10-10 13:05                 ` Quentin Perret
  0 siblings, 2 replies; 67+ messages in thread
From: Vincent Guittot @ 2018-10-10 12:04 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Ingo Molnar, Thara Gopinath, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Zhang Rui, gregkh, Rafael J. Wysocki,
	Amit Kachhap, viresh kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, open list:THERMAL, Ionela Voinescu

On Wed, 10 Oct 2018 at 12:36, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Wednesday 10 Oct 2018 at 12:14:32 (+0200), Vincent Guittot wrote:
> > On Wed, 10 Oct 2018 at 11:55, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > Hi Vincent,
> > >
> > > On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:
> > > > The problem with reflecting directly the capping is that it happens
> > > > far more often than the pace at which cpu_capacity_orig is updated in
> > > > the scheduler.
> > >
> > > Hmm, how can you be so sure ? That most likely depends on the workload,
> > > the platform and the thermal governor. Some platforms heat up slowly,
> > > some quickly. The pace at which the thermal governor will change things
> > > should depend on that I assume.
> > >
> > > > This means that at the moment when scheduler uses the
> > > > value, it might not be correct anymore.
> > >
> > > And OTOH, when you remove a cap for example, it will take time before
> > > the scheduler can see the newly available capacity if you need to wait
> > > for the signal to decay. So you are using a wrong information too in
> > > that scenario.
> >
> > But we stay consistant with all other metrics. The same happen when a
> > task decide to stay idle for a long time after some activity... You
> > will have to wait for the signal to decay
>
> Because you see that as a task going idle. You could also say that
> removing the cap from a CPU is equivalent to migrating a task off that
> CPU. In this case you should see the newly available cap pretty much
> immediately.
>
> Also, I feel like the whole thermal capping story could interest the
> Intel folks who, IIRC, were discussing how to represent their 'boosted'
> OPPs in the scheduler some time ago. You can see the Intel boost thing
> as a cap I think -- some OPPs can be inaccessible in some case. I'd be
> interested to see what's their take on this.
>
> > > > Then, this value are also used
> > > > when building the sched_domain and setting max_cpu_capacity which
> > > > would also implies the rebuilt the sched_domain topology ...
> > >
> > > Wait what ? I thought the thermal cap was reflected in capacity_of, not
> > > capacity_orig_of ... You need to rebuild the sched_domain in case of
> > > thermal pressure ?
>
> I can't locate where you need to do this in the series. Do you actually
> need to rebuild the sd hierarchy ?

This patchset doesn't touch cpu_capacity_orig and doesn't need to  as
it assume that the max capacity is unchanged but some capacity is
momentary stolen by thermal.
 If you want to reflect immediately all thermal capping change, you
have to update this field and all related fields and struct around

>
> > > Hmm, let me have a closer look at the patches, I must have missed
> > > something ...
> > >
> > > > The pace of changing the capping is to fast to reflect that in the
> > > > whole scheduler topology
> > >
> > > That's probably true in some cases, but it'd be cool to have numbers to
> > > back up that statement, I think.
> > >
> > > Now, if you do need to rebuild the sched domain topology every time you
> > > update the thermal pressure, I think the PELT HL is _way_ too short for
> > > that ... You can't rebuild the whole thing every 32ms or so. Or am I
> > > misunderstanding something ?
> > >
> > > > > Thara, have you tried to experiment with a simpler implementation as
> > > > > suggested by Ingo ?
> > > > >
> > > > > Also, assuming that we do want to average things, do we actually want to
> > > > > tie the thermal ramp-up time to the PELT half life ? That provides
> > > > > nice maths properties wrt the other signals, but it's not obvious to me
> > > > > that this thermal 'constant' should be the same on all platforms. Or
> > > > > maybe it should ?
> > > >
> > > > The main interest of using PELT signal is that thermal pressure will
> > > > evolve at the same pace as other signals used in the scheduler.
> > >
> > > Right, I think this is a nice property too (assuming that we actually
> > > want to average things out).
> > >
> > > > With
> > > > thermal pressure, we have the exact same problem as with RT tasks. The
> > > > thermal will cap the max frequency which will cap the utilization of
> > > > the tasks running on the CPU
> > >
> > > Well, the nature of the signal is slightly different IMO. Yes it's
> > > capacity, but you're no actually measuring time spent on the CPU. All
> > > other PELT signals are based on time, this thermal thing isn't, so it is
> > > kinda different in a way. And I'm still wondering if it could be helpful
> >
> > hmmm ... it is based on time too.
>
> You're not actually measuring the time spent on the CPU by the 'thermal
> task'. There is no such thing as a 'thermal task'. You're just trying to
> model things like that, but the thermal stuff isn't actually
> interrupting running tasks to eat CPU cycles. It just makes thing run
> slower, which isn't exactly the same thing IMO.
>
> But maybe that's a detail.
>
> > Both signals (current ones and thermal one) are really close. The main
> > difference with other utilization signal is that instead of providing
> > a running/not running boolean that is then weighted by the current
> > capacity, the signal uses direclty the capped max capacity to reflect
> > the amount of cycle that is stolen by thermal mitigation.
> >
> > > to be able to have a different HL for that thermal signal. That would
> > > 'break' the nice maths properties we have, yes, but is it a problem or is
> > > it actually helpful to cope with the thermal characteristics of
> > > different platforms ?
> >
> > If you don't use the sign kind of signal with the same responsiveness,
> > you will start to see some OPP toggles as an example when the thermal
> > state change because one metrics will change faster than the other one
> > and you can't have a correct view of the system. Same problem was
> > happening with rt task.
>
> Well, that wasn't the problem with rt tasks. The problem with RT tasks
> was that the time they spend on the CPU wasn't accounted _at all_ when
> selecting frequency for CFS, not that the accounting was at a different
> pace ...

The problem was the same with RT, the cfs utilization was lower than
reality because RT steals soem cycle to CFS
So schedutil was selecting a lower frequency when cfs was running
whereas the CPU was fully used.
The same can happen with thermal:
cap the max freq because of thermal
the utilization with decrease.
remove the cap
the utilization is still low and you will select a low OPP because you
don't take into account cycle stolen by thermal like with RT

Regards,
Vincent
>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 12:04               ` Vincent Guittot
@ 2018-10-10 12:23                 ` Juri Lelli
  2018-10-10 12:34                   ` Vincent Guittot
  2018-10-10 13:05                 ` Quentin Perret
  1 sibling, 1 reply; 67+ messages in thread
From: Juri Lelli @ 2018-10-10 12:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Quentin Perret, Ingo Molnar, Thara Gopinath, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Zhang Rui, gregkh,
	Rafael J. Wysocki, Amit Kachhap, viresh kumar, Javi Merino,
	Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Ionela Voinescu

On 10/10/18 14:04, Vincent Guittot wrote:

[...]

> The problem was the same with RT, the cfs utilization was lower than
> reality because RT steals soem cycle to CFS
> So schedutil was selecting a lower frequency when cfs was running
> whereas the CPU was fully used.
> The same can happen with thermal:
> cap the max freq because of thermal
> the utilization with decrease.
> remove the cap
> the utilization is still low and you will select a low OPP because you
> don't take into account cycle stolen by thermal like with RT

What if we scale frequency component considering the capped temporary
max?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 12:23                 ` Juri Lelli
@ 2018-10-10 12:34                   ` Vincent Guittot
  2018-10-10 12:50                     ` Juri Lelli
  0 siblings, 1 reply; 67+ messages in thread
From: Vincent Guittot @ 2018-10-10 12:34 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Quentin Perret, Ingo Molnar, Thara Gopinath, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Zhang Rui, gregkh,
	Rafael J. Wysocki, Amit Kachhap, viresh kumar, Javi Merino,
	Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Ionela Voinescu

Hi  Juri,

On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote:
>
> On 10/10/18 14:04, Vincent Guittot wrote:
>
> [...]
>
> > The problem was the same with RT, the cfs utilization was lower than
> > reality because RT steals soem cycle to CFS
> > So schedutil was selecting a lower frequency when cfs was running
> > whereas the CPU was fully used.
> > The same can happen with thermal:
> > cap the max freq because of thermal
> > the utilization with decrease.
> > remove the cap
> > the utilization is still low and you will select a low OPP because you
> > don't take into account cycle stolen by thermal like with RT
>
> What if we scale frequency component considering the capped temporary
> max?

Do you mean using a kind of scale_thermal_capacity in accumulate_sum
when computing utilization ?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 12:34                   ` Vincent Guittot
@ 2018-10-10 12:50                     ` Juri Lelli
  2018-10-10 13:08                       ` Vincent Guittot
  2018-10-10 13:11                       ` Quentin Perret
  0 siblings, 2 replies; 67+ messages in thread
From: Juri Lelli @ 2018-10-10 12:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Quentin Perret, Ingo Molnar, Thara Gopinath, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Zhang Rui, gregkh,
	Rafael J. Wysocki, Amit Kachhap, viresh kumar, Javi Merino,
	Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Ionela Voinescu

On 10/10/18 14:34, Vincent Guittot wrote:
> Hi  Juri,
> 
> On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote:
> >
> > On 10/10/18 14:04, Vincent Guittot wrote:
> >
> > [...]
> >
> > > The problem was the same with RT, the cfs utilization was lower than
> > > reality because RT steals soem cycle to CFS
> > > So schedutil was selecting a lower frequency when cfs was running
> > > whereas the CPU was fully used.
> > > The same can happen with thermal:
> > > cap the max freq because of thermal
> > > the utilization with decrease.
> > > remove the cap
> > > the utilization is still low and you will select a low OPP because you
> > > don't take into account cycle stolen by thermal like with RT
> >
> > What if we scale frequency component considering the capped temporary
> > max?
> 
> Do you mean using a kind of scale_thermal_capacity in accumulate_sum
> when computing utilization ?

Yeah, something like that I guess. So that we account for temporary
"fake" 1024..

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 12:04               ` Vincent Guittot
  2018-10-10 12:23                 ` Juri Lelli
@ 2018-10-10 13:05                 ` Quentin Perret
  2018-10-10 13:27                   ` Vincent Guittot
  1 sibling, 1 reply; 67+ messages in thread
From: Quentin Perret @ 2018-10-10 13:05 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Thara Gopinath, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Zhang Rui, gregkh, Rafael J. Wysocki,
	Amit Kachhap, viresh kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, open list:THERMAL, Ionela Voinescu

On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:
> This patchset doesn't touch cpu_capacity_orig and doesn't need to  as
> it assume that the max capacity is unchanged but some capacity is
> momentary stolen by thermal.
>  If you want to reflect immediately all thermal capping change, you
> have to update this field and all related fields and struct around

I don't follow you here. I never said I wanted to change
cpu_capacity_orig. I don't think we should do that actually. Changing
capacity_of (which is updated during LB	IIRC) is just fine. The question
is about what you want to do there: reflect an averaged value or the
instantaneous one.

It's not obvious (to me) that the complex one (the averaged value) is
better than the other, simpler, one. All I'm saying from the beginning
is that it'd be nice to have numbers and use cases to discuss the pros
and cons of both approaches.

> > > > Hmm, let me have a closer look at the patches, I must have missed
> > > > something ...
> > > >
> > > > > The pace of changing the capping is to fast to reflect that in the
> > > > > whole scheduler topology
> > > >
> > > > That's probably true in some cases, but it'd be cool to have numbers to
> > > > back up that statement, I think.
> > > >
> > > > Now, if you do need to rebuild the sched domain topology every time you
> > > > update the thermal pressure, I think the PELT HL is _way_ too short for
> > > > that ... You can't rebuild the whole thing every 32ms or so. Or am I
> > > > misunderstanding something ?
> > > >
> > > > > > Thara, have you tried to experiment with a simpler implementation as
> > > > > > suggested by Ingo ?
> > > > > >
> > > > > > Also, assuming that we do want to average things, do we actually want to
> > > > > > tie the thermal ramp-up time to the PELT half life ? That provides
> > > > > > nice maths properties wrt the other signals, but it's not obvious to me
> > > > > > that this thermal 'constant' should be the same on all platforms. Or
> > > > > > maybe it should ?
> > > > >
> > > > > The main interest of using PELT signal is that thermal pressure will
> > > > > evolve at the same pace as other signals used in the scheduler.
> > > >
> > > > Right, I think this is a nice property too (assuming that we actually
> > > > want to average things out).
> > > >
> > > > > With
> > > > > thermal pressure, we have the exact same problem as with RT tasks. The
> > > > > thermal will cap the max frequency which will cap the utilization of
> > > > > the tasks running on the CPU
> > > >
> > > > Well, the nature of the signal is slightly different IMO. Yes it's
> > > > capacity, but you're no actually measuring time spent on the CPU. All
> > > > other PELT signals are based on time, this thermal thing isn't, so it is
> > > > kinda different in a way. And I'm still wondering if it could be helpful
> > >
> > > hmmm ... it is based on time too.
> >
> > You're not actually measuring the time spent on the CPU by the 'thermal
> > task'. There is no such thing as a 'thermal task'. You're just trying to
> > model things like that, but the thermal stuff isn't actually
> > interrupting running tasks to eat CPU cycles. It just makes thing run
> > slower, which isn't exactly the same thing IMO.
> >
> > But maybe that's a detail.
> >
> > > Both signals (current ones and thermal one) are really close. The main
> > > difference with other utilization signal is that instead of providing
> > > a running/not running boolean that is then weighted by the current
> > > capacity, the signal uses direclty the capped max capacity to reflect
> > > the amount of cycle that is stolen by thermal mitigation.
> > >
> > > > to be able to have a different HL for that thermal signal. That would
> > > > 'break' the nice maths properties we have, yes, but is it a problem or is
> > > > it actually helpful to cope with the thermal characteristics of
> > > > different platforms ?
> > >
> > > If you don't use the sign kind of signal with the same responsiveness,
> > > you will start to see some OPP toggles as an example when the thermal
> > > state change because one metrics will change faster than the other one
> > > and you can't have a correct view of the system. Same problem was
> > > happening with rt task.
> >
> > Well, that wasn't the problem with rt tasks. The problem with RT tasks
> > was that the time they spend on the CPU wasn't accounted _at all_ when
> > selecting frequency for CFS, not that the accounting was at a different
> > pace ...
> 
> The problem was the same with RT, the cfs utilization was lower than
> reality because RT steals soem cycle to CFS
> So schedutil was selecting a lower frequency when cfs was running
> whereas the CPU was fully used.
> The same can happen with thermal:
> cap the max freq because of thermal
> the utilization with decrease.
> remove the cap
> the utilization is still low and you will select a low OPP because you
> don't take into account cycle stolen by thermal like with RT

I'm not arguing with the fact that we need to reflect the thermal
pressure in the scheduler to see that a CPU is fully busy. I agree with
that.

I'm saying you don't necessarily have to update the thermal stuff and
the existing PELT signals *at the same pace*, because different
platforms have different thermal characteristics.

Thanks,*
Quentin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 12:50                     ` Juri Lelli
@ 2018-10-10 13:08                       ` Vincent Guittot
  2018-10-10 13:34                         ` Juri Lelli
  2018-10-10 13:11                       ` Quentin Perret
  1 sibling, 1 reply; 67+ messages in thread
From: Vincent Guittot @ 2018-10-10 13:08 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Quentin Perret, Ingo Molnar, Thara Gopinath, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Zhang Rui, gregkh,
	Rafael J. Wysocki, Amit Kachhap, viresh kumar, Javi Merino,
	Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Ionela Voinescu

On Wed, 10 Oct 2018 at 14:50, Juri Lelli <juri.lelli@gmail.com> wrote:
>
> On 10/10/18 14:34, Vincent Guittot wrote:
> > Hi  Juri,
> >
> > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote:
> > >
> > > On 10/10/18 14:04, Vincent Guittot wrote:
> > >
> > > [...]
> > >
> > > > The problem was the same with RT, the cfs utilization was lower than
> > > > reality because RT steals soem cycle to CFS
> > > > So schedutil was selecting a lower frequency when cfs was running
> > > > whereas the CPU was fully used.
> > > > The same can happen with thermal:
> > > > cap the max freq because of thermal
> > > > the utilization with decrease.
> > > > remove the cap
> > > > the utilization is still low and you will select a low OPP because you
> > > > don't take into account cycle stolen by thermal like with RT
> > >
> > > What if we scale frequency component considering the capped temporary
> > > max?
> >
> > Do you mean using a kind of scale_thermal_capacity in accumulate_sum
> > when computing utilization ?
>
> Yeah, something like that I guess. So that we account for temporary
> "fake" 1024..

But the utilization will not be invariant anymore across the system

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 12:50                     ` Juri Lelli
  2018-10-10 13:08                       ` Vincent Guittot
@ 2018-10-10 13:11                       ` Quentin Perret
  1 sibling, 0 replies; 67+ messages in thread
From: Quentin Perret @ 2018-10-10 13:11 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Vincent Guittot, Ingo Molnar, Thara Gopinath, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Zhang Rui, gregkh,
	Rafael J. Wysocki, Amit Kachhap, viresh kumar, Javi Merino,
	Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Ionela Voinescu

On Wednesday 10 Oct 2018 at 14:50:33 (+0200), Juri Lelli wrote:
> On 10/10/18 14:34, Vincent Guittot wrote:
> > Hi  Juri,
> > 
> > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote:
> > >
> > > On 10/10/18 14:04, Vincent Guittot wrote:
> > >
> > > [...]
> > >
> > > > The problem was the same with RT, the cfs utilization was lower than
> > > > reality because RT steals soem cycle to CFS
> > > > So schedutil was selecting a lower frequency when cfs was running
> > > > whereas the CPU was fully used.
> > > > The same can happen with thermal:
> > > > cap the max freq because of thermal
> > > > the utilization with decrease.
> > > > remove the cap
> > > > the utilization is still low and you will select a low OPP because you
> > > > don't take into account cycle stolen by thermal like with RT
> > >
> > > What if we scale frequency component considering the capped temporary
> > > max?
> > 
> > Do you mean using a kind of scale_thermal_capacity in accumulate_sum
> > when computing utilization ?
> 
> Yeah, something like that I guess. So that we account for temporary
> "fake" 1024..

But wouldn't that break frequency invariance ? A task would look bigger
on a capped CPU than a non-capped one no ?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 13:05                 ` Quentin Perret
@ 2018-10-10 13:27                   ` Vincent Guittot
  2018-10-10 13:47                     ` Quentin Perret
  0 siblings, 1 reply; 67+ messages in thread
From: Vincent Guittot @ 2018-10-10 13:27 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Ingo Molnar, Thara Gopinath, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Zhang Rui, gregkh, Rafael J. Wysocki,
	Amit Kachhap, viresh kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, open list:THERMAL, Ionela Voinescu

On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:
> > This patchset doesn't touch cpu_capacity_orig and doesn't need to  as
> > it assume that the max capacity is unchanged but some capacity is
> > momentary stolen by thermal.
> >  If you want to reflect immediately all thermal capping change, you
> > have to update this field and all related fields and struct around
>
> I don't follow you here. I never said I wanted to change
> cpu_capacity_orig. I don't think we should do that actually. Changing
> capacity_of (which is updated during LB IIRC) is just fine. The question
> is about what you want to do there: reflect an averaged value or the
> instantaneous one.

Sorry I though your were speaking about updating this cpu_capacity_orig.
With using instantaneous max  value in capacity_of(), we are back to
the problem raised by Thara that  the value will most probably not
reflect the current capping value when it is used in LB, because LB
period can quite long on busy CPU (default max value is 32*sd_weight
ms)

>
> It's not obvious (to me) that the complex one (the averaged value) is
> better than the other, simpler, one. All I'm saying from the beginning
> is that it'd be nice to have numbers and use cases to discuss the pros
> and cons of both approaches.
>
> > > > > Hmm, let me have a closer look at the patches, I must have missed
> > > > > something ...
> > > > >
> > > > > > The pace of changing the capping is to fast to reflect that in the
> > > > > > whole scheduler topology
> > > > >

[snip]

> > >
> > > Well, that wasn't the problem with rt tasks. The problem with RT tasks
> > > was that the time they spend on the CPU wasn't accounted _at all_ when
> > > selecting frequency for CFS, not that the accounting was at a different
> > > pace ...
> >
> > The problem was the same with RT, the cfs utilization was lower than
> > reality because RT steals soem cycle to CFS
> > So schedutil was selecting a lower frequency when cfs was running
> > whereas the CPU was fully used.
> > The same can happen with thermal:
> > cap the max freq because of thermal
> > the utilization with decrease.
> > remove the cap
> > the utilization is still low and you will select a low OPP because you
> > don't take into account cycle stolen by thermal like with RT
>
> I'm not arguing with the fact that we need to reflect the thermal
> pressure in the scheduler to see that a CPU is fully busy. I agree with
> that.
>
> I'm saying you don't necessarily have to update the thermal stuff and
> the existing PELT signals *at the same pace*, because different
> platforms have different thermal characteristics.

But you also need to take into account how fast other metrics in the
scheduler are updated otherwise a metric will reflect a change not
already reflected in the other metrics and you might take wrong
decision as my example above where utilization is still low but
thermal pressure is nul and you assume that you have lot of spare
capacity
Having metrics that use same responsiveness and are synced, help to
get a consolidated view of the system.

Vincent
>
> Thanks,*
> Quentin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 13:08                       ` Vincent Guittot
@ 2018-10-10 13:34                         ` Juri Lelli
  2018-10-10 13:38                           ` Vincent Guittot
  2018-10-10 17:08                           ` Thara Gopinath
  0 siblings, 2 replies; 67+ messages in thread
From: Juri Lelli @ 2018-10-10 13:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Quentin Perret, Ingo Molnar, Thara Gopinath, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Zhang Rui, gregkh,
	Rafael J. Wysocki, Amit Kachhap, viresh kumar, Javi Merino,
	Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Ionela Voinescu

On 10/10/18 15:08, Vincent Guittot wrote:
> On Wed, 10 Oct 2018 at 14:50, Juri Lelli <juri.lelli@gmail.com> wrote:
> >
> > On 10/10/18 14:34, Vincent Guittot wrote:
> > > Hi  Juri,
> > >
> > > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote:
> > > >
> > > > On 10/10/18 14:04, Vincent Guittot wrote:
> > > >
> > > > [...]
> > > >
> > > > > The problem was the same with RT, the cfs utilization was lower than
> > > > > reality because RT steals soem cycle to CFS
> > > > > So schedutil was selecting a lower frequency when cfs was running
> > > > > whereas the CPU was fully used.
> > > > > The same can happen with thermal:
> > > > > cap the max freq because of thermal
> > > > > the utilization with decrease.
> > > > > remove the cap
> > > > > the utilization is still low and you will select a low OPP because you
> > > > > don't take into account cycle stolen by thermal like with RT
> > > >
> > > > What if we scale frequency component considering the capped temporary
> > > > max?
> > >
> > > Do you mean using a kind of scale_thermal_capacity in accumulate_sum
> > > when computing utilization ?
> >
> > Yeah, something like that I guess. So that we account for temporary
> > "fake" 1024..
> 
> But the utilization will not be invariant anymore across the system

Mmm, I guess I might be wrong, but I was thinking we should be able to
deal with this similarly to what we do with cpus with different max
capacities. So, another factor?  Because then, how do we handle other
ways in which max freq can be restricted (e.g. from userspace as Javi
was also mentioning)?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 13:34                         ` Juri Lelli
@ 2018-10-10 13:38                           ` Vincent Guittot
  2018-10-10 17:08                           ` Thara Gopinath
  1 sibling, 0 replies; 67+ messages in thread
From: Vincent Guittot @ 2018-10-10 13:38 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Quentin Perret, Ingo Molnar, Thara Gopinath, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Zhang Rui, gregkh,
	Rafael J. Wysocki, Amit Kachhap, viresh kumar, Javi Merino,
	Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Ionela Voinescu

On Wed, 10 Oct 2018 at 15:35, Juri Lelli <juri.lelli@gmail.com> wrote:
>
> On 10/10/18 15:08, Vincent Guittot wrote:
> > On Wed, 10 Oct 2018 at 14:50, Juri Lelli <juri.lelli@gmail.com> wrote:
> > >
> > > On 10/10/18 14:34, Vincent Guittot wrote:
> > > > Hi  Juri,
> > > >
> > > > On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote:
> > > > >
> > > > > On 10/10/18 14:04, Vincent Guittot wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > The problem was the same with RT, the cfs utilization was lower than
> > > > > > reality because RT steals soem cycle to CFS
> > > > > > So schedutil was selecting a lower frequency when cfs was running
> > > > > > whereas the CPU was fully used.
> > > > > > The same can happen with thermal:
> > > > > > cap the max freq because of thermal
> > > > > > the utilization with decrease.
> > > > > > remove the cap
> > > > > > the utilization is still low and you will select a low OPP because you
> > > > > > don't take into account cycle stolen by thermal like with RT
> > > > >
> > > > > What if we scale frequency component considering the capped temporary
> > > > > max?
> > > >
> > > > Do you mean using a kind of scale_thermal_capacity in accumulate_sum
> > > > when computing utilization ?
> > >
> > > Yeah, something like that I guess. So that we account for temporary
> > > "fake" 1024..
> >
> > But the utilization will not be invariant anymore across the system
>
> Mmm, I guess I might be wrong, but I was thinking we should be able to
> deal with this similarly to what we do with cpus with different max
> capacities. So, another factor?  Because then, how do we handle other
> ways in which max freq can be restricted (e.g. from userspace as Javi
> was also mentioning)?

IMHO, userspace capping is a different story because it is not
expected to happen so often but it should stay for a while and in this
case, a solution is probably to rebuild the sched_domain and update
all the cpu_capacity struct and fields

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 13:27                   ` Vincent Guittot
@ 2018-10-10 13:47                     ` Quentin Perret
  2018-10-10 15:19                       ` Vincent Guittot
  2018-10-10 16:15                       ` Ionela Voinescu
  0 siblings, 2 replies; 67+ messages in thread
From: Quentin Perret @ 2018-10-10 13:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Thara Gopinath, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Zhang Rui, gregkh, Rafael J. Wysocki,
	Amit Kachhap, viresh kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, open list:THERMAL, Ionela Voinescu

On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote:
> On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:
> > > This patchset doesn't touch cpu_capacity_orig and doesn't need to  as
> > > it assume that the max capacity is unchanged but some capacity is
> > > momentary stolen by thermal.
> > >  If you want to reflect immediately all thermal capping change, you
> > > have to update this field and all related fields and struct around
> >
> > I don't follow you here. I never said I wanted to change
> > cpu_capacity_orig. I don't think we should do that actually. Changing
> > capacity_of (which is updated during LB IIRC) is just fine. The question
> > is about what you want to do there: reflect an averaged value or the
> > instantaneous one.
> 
> Sorry I though your were speaking about updating this cpu_capacity_orig.

N/p, communication via email can easily become confusing :-)

> With using instantaneous max  value in capacity_of(), we are back to
> the problem raised by Thara that  the value will most probably not
> reflect the current capping value when it is used in LB, because LB
> period can quite long on busy CPU (default max value is 32*sd_weight
> ms)

But averaging the capping value over time doesn't make LB happen more
often ... That will help you account for capping that happened in the
past, but it's not obvious this is actually a good thing. Probably not
all the time anyway.

Say a CPU was capped at 50% of it's capacity, then the cap is removed.
At that point it'll take 100ms+ for the thermal signal to decay and let
the scheduler know about the newly available capacity. That can probably
be a performance hit in some use cases ... And the other way around, it
can also take forever for the scheduler to notice that a CPU has a
reduced capacity before reacting to it.

If you want to filter out very short transient capping events to avoid
over-reacting in the scheduler (is this actually happening ?), then
maybe the average should be done on the cooling device side or something
like that ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10  5:44   ` [RFC PATCH 0/7] Introduce thermal pressure Javi Merino
@ 2018-10-10 14:15     ` Thara Gopinath
  0 siblings, 0 replies; 67+ messages in thread
From: Thara Gopinath @ 2018-10-10 14:15 UTC (permalink / raw)
  To: Javi Merino
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, edubezval, daniel.lezcano, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot

Hello Javi,

Thanks for the interest.

On 10/10/2018 01:44 AM, Javi Merino wrote:
> On Tue, Oct 09, 2018 at 12:24:55PM -0400, Thara Gopinath wrote:
>> Thermal governors can respond to an overheat event for a cpu by
>> capping the cpu's maximum possible frequency. This in turn
>> means that the maximum available compute capacity of the
>> cpu is restricted. But today in linux kernel, in event of maximum
>> frequency capping of a cpu, the maximum available compute
>> capacity of the cpu is not adjusted at all. In other words, scheduler
>> is unware maximum cpu capacity restrictions placed due to thermal
>> activity.
> 
> Interesting, I would have sworn that I tested this years ago by
> lowering the maximum frequency of a cpufreq domain, and the scheduler
> reacted accordingly to the new maximum capacities of the cpus.
> 
>>           This patch series attempts to address this issue.
>> The benefits identified are better task placement among available
>> cpus in event of overheating which in turn leads to better
>> performance numbers.
>>
>> The delta between the maximum possible capacity of a cpu and
>> maximum available capacity of a cpu due to thermal event can
>> be considered as thermal pressure. Instantaneous thermal pressure
>> is hard to record and can sometime be erroneous as there can be mismatch
>> between the actual capping of capacity and scheduler recording it.
>> Thus solution is to have a weighted average per cpu value for thermal
>> pressure over time. The weight reflects the amount of time the cpu has
>> spent at a capped maximum frequency. To accumulate, average and
>> appropriately decay thermal pressure, this patch series uses pelt
>> signals and reuses the available framework that does a similar
>> bookkeeping of rt/dl task utilization.
>>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on hikey960 mainline kernel with debian file system.
>> Further aobench (An occlusion renderer for benchmarking realworld
>> floating point performance) showed the following results on hikey960
>> with debain.
>>
>>                                         Result          Standard        Standard
>>                                         (Time secs)     Error           Deviation
>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
>>
>> Thara Gopinath (7):
>>   sched/pelt: Add option to make load and util calculations frequency
>>     invariant
>>   sched/pelt.c: Add support to track thermal pressure
>>   sched: Add infrastructure to store and update instantaneous thermal
>>     pressure
>>   sched: Initialize per cpu thermal pressure structure
>>   sched/fair: Enable CFS periodic tick to update thermal pressure
>>   sched/fair: update cpu_capcity to reflect thermal pressure
>>   thermal/cpu-cooling: Update thermal pressure in case of a maximum
>>     frequency capping
>>
>>  drivers/base/arch_topology.c  |  1 +
>>  drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
> 
> thermal?  There are other ways in which the maximum frequency of a cpu
> can be limited, for example from userspace via scaling_max_freq.

Yes there are other ways in which maximum frequency of a cpu can be
capped. The difference probably is that in case of a user-space cap, the
time duration the cpu remains in the capped state is significantly
higher than capping due to a thermal event. So may be the response of
the scheduler should be different in that scenario (like rebuilding the
capacities of all cpus etc).

This patch series does not rebuild the scheduler structures. This just
tells the scheduler that some cpu capacity is stolen.
> 
> When something (anything) changes the maximum frequency of a cpufreq
> policy, the scheduler should be notified.  I think this change should
> be done in cpufreq instead to make it generic and not particular to
> a given maximum frequency "capper".

I am glad to hear you say this!  So far, all I have heard whenever I
bring up this topic is issues with such an update from cpufreq(delays,
lost updates etc). Personally, I have not seen these issues enough to
comment on them. I will really like to hear more about these issues for
an update from cpufreq here on the list.

For me, the patch series is a mechanism to let scheduler know that a
thermal event has stolen some cpu capacity. The update itself can happen
from any framework which can track the event and we all mutually agree on.

Regards
Thara

> 
> Cheers,
> Javi
> 


-- 
Regards
Thara

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 6/7] sched/fair: update cpu_capcity to reflect thermal pressure
  2018-10-10  5:57     ` Javi Merino
@ 2018-10-10 14:22       ` Thara Gopinath
  0 siblings, 0 replies; 67+ messages in thread
From: Thara Gopinath @ 2018-10-10 14:22 UTC (permalink / raw)
  To: Javi Merino
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, edubezval, daniel.lezcano, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot

On 10/10/2018 01:57 AM, Javi Merino wrote:
> On Tue, Oct 09, 2018 at 12:25:01PM -0400, Thara Gopinath wrote:
>> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
>> pressure on a cpu means this maximum available capacity is reduced. This
>> patch reduces the average thermal pressure for a cpu from its maximum
>> available capacity so that cpu_capacity reflects the actual
>> available capacity.
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>  kernel/sched/fair.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7deb1d0..8651e55 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7497,6 +7497,7 @@ static unsigned long scale_rt_capacity(int cpu)
>>  
>>  	used = READ_ONCE(rq->avg_rt.util_avg);
>>  	used += READ_ONCE(rq->avg_dl.util_avg);
>> +	used += READ_ONCE(rq->avg_thermal.load_avg);
> 
> IIUIC, you are treating thermal pressure as an artificial load on the
> cpu.  If so, this sounds like a hard to maintain hack.  Thermal
> pressure have different characteristics to utilization.  What happens
> if thermal sets the cpu cooling state back to 0 because there is
> thermal headroom again?  Do we keep adding this artificial load to the
> cpu just because there was thermal pressure in the past and let it
> decay as if it was cpu load?

Setting cpu cooling state back to 0 will decay the thermal pressure back
to 0 ? Yes, cpu will not register an instantaneous drop in the cap, it
will be decayed down following the PELT signals.

> 
> Cheers,
> Javi
> 


-- 
Regards
Thara

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 13:47                     ` Quentin Perret
@ 2018-10-10 15:19                       ` Vincent Guittot
  2018-10-10 16:15                       ` Ionela Voinescu
  1 sibling, 0 replies; 67+ messages in thread
From: Vincent Guittot @ 2018-10-10 15:19 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Ingo Molnar, Thara Gopinath, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Zhang Rui, gregkh, Rafael J. Wysocki,
	Amit Kachhap, viresh kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, open list:THERMAL, Ionela Voinescu

On Wed, 10 Oct 2018 at 15:48, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote:
> > On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:
> > > > This patchset doesn't touch cpu_capacity_orig and doesn't need to  as
> > > > it assume that the max capacity is unchanged but some capacity is
> > > > momentary stolen by thermal.
> > > >  If you want to reflect immediately all thermal capping change, you
> > > > have to update this field and all related fields and struct around
> > >
> > > I don't follow you here. I never said I wanted to change
> > > cpu_capacity_orig. I don't think we should do that actually. Changing
> > > capacity_of (which is updated during LB IIRC) is just fine. The question
> > > is about what you want to do there: reflect an averaged value or the
> > > instantaneous one.
> >
> > Sorry I though your were speaking about updating this cpu_capacity_orig.
>
> N/p, communication via email can easily become confusing :-)
>
> > With using instantaneous max  value in capacity_of(), we are back to
> > the problem raised by Thara that  the value will most probably not
> > reflect the current capping value when it is used in LB, because LB
> > period can quite long on busy CPU (default max value is 32*sd_weight
> > ms)
>
> But averaging the capping value over time doesn't make LB happen more
> often ... That will help you account for capping that happened in the

But you know what happens in average between 2 LB

> past, but it's not obvious this is actually a good thing. Probably not
> all the time anyway.
>
> Say a CPU was capped at 50% of it's capacity, then the cap is removed.
> At that point it'll take 100ms+ for the thermal signal to decay and let
> the scheduler know about the newly available capacity. That can probably

But the point is that you don't know:
- if the capping will not happen soon. If the pressure has reached the
50%,  it means that it already happened quite often in the past 100ms.
- if there is really available capacity as the current sum of
utilization reflects what was available for tasks and not what the
tasks really wants to use


> be a performance hit in some use cases ... And the other way around, it
> can also take forever for the scheduler to notice that a CPU has a

What do you mean by forever ?

> reduced capacity before reacting to it.
>
> If you want to filter out very short transient capping events to avoid
> over-reacting in the scheduler (is this actually happening ?), then
> maybe the average should be done on the cooling device side or something
> like that ?
>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-09 16:24 ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
                     ` (8 preceding siblings ...)
  2018-10-10  6:17   ` Ingo Molnar
@ 2018-10-10 15:35   ` Lukasz Luba
  2018-10-10 16:54     ` Daniel Lezcano
  2018-10-10 17:30     ` Thara Gopinath
  9 siblings, 2 replies; 67+ messages in thread
From: Lukasz Luba @ 2018-10-10 15:35 UTC (permalink / raw)
  To: Thara Gopinath, linux-kernel
  Cc: mingo, peterz, rui.zhang, gregkh, rafael, amit.kachhap,
	viresh.kumar, javi.merino, edubezval, daniel.lezcano, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot,
	Bartlomiej Zolnierkiewicz

Hi Thara,

I have run it on Exynos5433 mainline.
When it is enabled with step_wise thermal governor,
some of my tests are showing ~30-50% regression (i.e. hackbench),
dhrystone ~10%.

Could you tell me which thermal governor was used in your case?
Please also share the name of that benchmark, i will give it a try.
Is it single threaded compute-intensive?

Regards,
Lukasz

On 10/09/2018 06:24 PM, Thara Gopinath wrote:
> Thermal governors can respond to an overheat event for a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in linux kernel, in event of maximum
> frequency capping of a cpu, the maximum available compute
> capacity of the cpu is not adjusted at all. In other words, scheduler
> is unware maximum cpu capacity restrictions placed due to thermal
> activity. This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
> 
> The delta between the maximum possible capacity of a cpu and
> maximum available capacity of a cpu due to thermal event can
> be considered as thermal pressure. Instantaneous thermal pressure
> is hard to record and can sometime be erroneous as there can be mismatch
> between the actual capping of capacity and scheduler recording it.
> Thus solution is to have a weighted average per cpu value for thermal
> pressure over time. The weight reflects the amount of time the cpu has
> spent at a capped maximum frequency. To accumulate, average and
> appropriately decay thermal pressure, this patch series uses pelt
> signals and reuses the available framework that does a similar
> bookkeeping of rt/dl task utilization.
> 
> Regarding testing, basic build, boot and sanity testing have been
> performed on hikey960 mainline kernel with debian file system.
> Further aobench (An occlusion renderer for benchmarking realworld
> floating point performance) showed the following results on hikey960
> with debain.
> 
>                                          Result          Standard        Standard
>                                          (Time secs)     Error           Deviation
> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
> 
> Thara Gopinath (7):
>    sched/pelt: Add option to make load and util calculations frequency
>      invariant
>    sched/pelt.c: Add support to track thermal pressure
>    sched: Add infrastructure to store and update instantaneous thermal
>      pressure
>    sched: Initialize per cpu thermal pressure structure
>    sched/fair: Enable CFS periodic tick to update thermal pressure
>    sched/fair: update cpu_capcity to reflect thermal pressure
>    thermal/cpu-cooling: Update thermal pressure in case of a maximum
>      frequency capping
> 
>   drivers/base/arch_topology.c  |  1 +
>   drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
>   include/linux/sched.h         | 14 +++++++++
>   kernel/sched/Makefile         |  2 +-
>   kernel/sched/core.c           |  2 ++
>   kernel/sched/fair.c           |  4 +++
>   kernel/sched/pelt.c           | 40 ++++++++++++++++++--------
>   kernel/sched/pelt.h           |  7 +++++
>   kernel/sched/sched.h          |  1 +
>   kernel/sched/thermal.c        | 66 +++++++++++++++++++++++++++++++++++++++++++
>   kernel/sched/thermal.h        | 13 +++++++++
>   11 files changed, 157 insertions(+), 13 deletions(-)
>   create mode 100644 kernel/sched/thermal.c
>   create mode 100644 kernel/sched/thermal.h
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10  6:17   ` Ingo Molnar
  2018-10-10  8:29     ` Quentin Perret
@ 2018-10-10 15:43     ` Thara Gopinath
  2018-10-16  7:33       ` Ingo Molnar
  1 sibling, 1 reply; 67+ messages in thread
From: Thara Gopinath @ 2018-10-10 15:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, javi.merino, edubezval,
	daniel.lezcano, linux-pm, quentin.perret, ionela.voinescu,
	vincent.guittot

Hello Ingo,
Thank you for the review.

On 10/10/2018 02:17 AM, Ingo Molnar wrote:
> 
> * Thara Gopinath <thara.gopinath@linaro.org> wrote:
> 
>> Thermal governors can respond to an overheat event for a cpu by
>> capping the cpu's maximum possible frequency. This in turn
>> means that the maximum available compute capacity of the
>> cpu is restricted. But today in linux kernel, in event of maximum
>> frequency capping of a cpu, the maximum available compute
>> capacity of the cpu is not adjusted at all. In other words, scheduler
>> is unware maximum cpu capacity restrictions placed due to thermal
>> activity. This patch series attempts to address this issue.
>> The benefits identif

ied are better task placement among available
>> cpus in event of overheating which in turn leads to better
>> performance numbers.
>>
>> The delta between the maximum possible capacity of a cpu and
>> maximum available capacity of a cpu due to thermal event can
>> be considered as thermal pressure. Instantaneous thermal pressure
>> is hard to record and can sometime be erroneous as there can be mismatch
>> between the actual capping of capacity and scheduler recording it.
>> Thus solution is to have a weighted average per cpu value for thermal
>> pressure over time. The weight reflects the amount of time the cpu has
>> spent at a capped maximum frequency. To accumulate, average and
>> appropriately decay thermal pressure, this patch series uses pelt
>> signals and reuses the available framework that does a similar
>> bookkeeping of rt/dl task utilization.
>>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on hikey960 mainline kernel with debian file system.
>> Further aobench (An occlusion renderer for benchmarking realworld
>> floating point performance) showed the following results on hikey960
>> with debain.
>>
>>                                         Result          Standard        Standard
>>                                         (Time secs)     Error           Deviation
>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
> 
> Wow, +13% speedup, impressive! We definitely want this outcome.
> 
> I'm wondering what happens if we do not track and decay the thermal load at all at the PELT 
> level, but instantaneously decrease/increase effective CPU capacity in reaction to thermal 
> events we receive from the CPU.

The problem with instantaneous update is that sometimes thermal events
happen at a much faster pace than cpu_capacity is updated in
the scheduler. This means that at the moment when scheduler uses the
value, it might not be correct anymore.

Having said that, today Android common kernel has a solution which
instantaneously updates cpu_capacity in case of a thermal event.
To give a bit of background on the evolution of the solution I have
proposed, below is a time line of analysis I have done.

1.  I started this activity by analyzing the existing framework on
android common kernel. I ran android benchmark tests (Jankbench,
Vellamo, Geekbench) with and without the existing instantaneous update
mechanism. I found that there is no real performance difference to be
observed with an instantaneous updated of cpu_capacity at least in my
test scenarios.
2. Then I developed an algorithm to track, accumulate and decay the
capacity capping i.e an algorithm without using the pelt signals(this
was prior to the new pelt framework in mainline). With this android
benchmarks showed performance improvements. At this point I also ported
the solution to mainline kernel and ran the aobench analysis which again
showed a performance improvement.
3. Finally with the new pelt framework in place, I replaced my algorithm
with the one used for rt and dl utilization tracking which is the
current patch series. I have not been able to run tests with this on
Android yet.

All tests were performed on hikey960.
I have a Google spreadsheet, documenting results at various stages of
analysis. I am not sure how to share it with the group here.


> 
> You describe the averaging as:
> 
>> Instantaneous thermal pressure is hard to record and can sometime be erroneous as there can 
>> be mismatch between the actual capping of capacity and scheduler recording it.
> 
> Not sure I follow the argument here: are there bogus thermal throttling events? If so then
> they are hopefully not frequent enough and should average out over time even if we follow
> it instantly.

No bogus events. It is more like sometimes capping happens at a much
faster rate than cpu_capacity is updated and the scheduler looses these
events.

> 
> I.e. what is 'can sometimes be erroneous', exactly?
> 
> Thanks,
> 
> 	Ingo
> 


-- 
Regards
Thara

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 13:47                     ` Quentin Perret
  2018-10-10 15:19                       ` Vincent Guittot
@ 2018-10-10 16:15                       ` Ionela Voinescu
  1 sibling, 0 replies; 67+ messages in thread
From: Ionela Voinescu @ 2018-10-10 16:15 UTC (permalink / raw)
  To: Quentin Perret, Vincent Guittot
  Cc: Ingo Molnar, Thara Gopinath, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Zhang Rui, gregkh, Rafael J. Wysocki,
	Amit Kachhap, viresh kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, open list:THERMAL

Hi guys,

On 10/10/18 14:47, Quentin Perret wrote:
> On Wednesday 10 Oct 2018 at 15:27:57 (+0200), Vincent Guittot wrote:
>> On Wed, 10 Oct 2018 at 15:05, Quentin Perret <quentin.perret@arm.com> wrote:
>>>
>>> On Wednesday 10 Oct 2018 at 14:04:40 (+0200), Vincent Guittot wrote:
>>>> This patchset doesn't touch cpu_capacity_orig and doesn't need to  as
>>>> it assume that the max capacity is unchanged but some capacity is
>>>> momentary stolen by thermal.
>>>>  If you want to reflect immediately all thermal capping change, you
>>>> have to update this field and all related fields and struct around
>>>
>>> I don't follow you here. I never said I wanted to change
>>> cpu_capacity_orig. I don't think we should do that actually. Changing
>>> capacity_of (which is updated during LB IIRC) is just fine. The question
>>> is about what you want to do there: reflect an averaged value or the
>>> instantaneous one.
>>
>> Sorry I though your were speaking about updating this cpu_capacity_orig.
> 
> N/p, communication via email can easily become confusing :-)
> 
>> With using instantaneous max  value in capacity_of(), we are back to
>> the problem raised by Thara that  the value will most probably not
>> reflect the current capping value when it is used in LB, because LB
>> period can quite long on busy CPU (default max value is 32*sd_weight
>> ms)
> 
> But averaging the capping value over time doesn't make LB happen more
> often ... That will help you account for capping that happened in the
> past, but it's not obvious this is actually a good thing. Probably not
> all the time anyway.
> 
> Say a CPU was capped at 50% of it's capacity, then the cap is removed.
> At that point it'll take 100ms+ for the thermal signal to decay and let
> the scheduler know about the newly available capacity. That can probably
> be a performance hit in some use cases ... And the other way around, it
> can also take forever for the scheduler to notice that a CPU has a
> reduced capacity before reacting to it.
> 
> If you want to filter out very short transient capping events to avoid
> over-reacting in the scheduler (is this actually happening ?), then
> maybe the average should be done on the cooling device side or something
> like that ?
> 

I think there isn't just the issue of the *occasional* overreaction of a
thermal governor due to noise in the temperature sensors or some spike
in environmental temperature that determines a delayed reaction in the
scheduler due to when capacity is updated.

I'm seeing a bigger issue for *sustained* high temperatures that are not
treated effectively by governors. Depending on the platform, heat can
be dissipated over longer or shorter periods of time. This can determine
a seesaw effect on the maximum frequency: it determines the temperature
is over a threshold and it starts capping, but heat is not dissipated
quickly enough for that to reflect in the value of the temperature sensor,
so it continues to cap; when the temperature gets to normal, capping
is lifted, which in turn results access to higher OPPs and a return to
high temperatures, etc.

What will happen is that, *depending on platform* and the moment when
capacity is updated, you can see either a CPU with a capacity of 1024, or
let's say 800, or (on hikey960 :)) around 500, and back and forth
between them.

Because of these I tend to think that a regulated (averaged) value of
thermal pressure is better than an instantaneous one. Thermal mitigation
measures are there for the well-being and safety of a device, not for
optimizations so it can and should be allowed to overreact, or have a
delayed reaction. But ping-pong-ing tasks back and forth between CPUs
due to changes in CPU capacity is harmful for performance. What would be
awesome to achieve with this is (close to) optimal use of restricted
capacities of CPUs, and I tend to believe instantaneous and most probably
out of date capacity values would not lead to this.

But this is almost a gut feeling and of course it should be validated on
devices with different thermal characteristics. Given the high variation
between devices with regards to this I'd be reluctant to tie it to the
PELT half life.

Regards,
Ionela.

> Thanks,
> Quentin
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 15:35   ` Lukasz Luba
@ 2018-10-10 16:54     ` Daniel Lezcano
  2018-10-11  7:35       ` Lukasz Luba
  2018-10-10 17:30     ` Thara Gopinath
  1 sibling, 1 reply; 67+ messages in thread
From: Daniel Lezcano @ 2018-10-10 16:54 UTC (permalink / raw)
  To: Lukasz Luba, Thara Gopinath, linux-kernel
  Cc: mingo, peterz, rui.zhang, gregkh, rafael, amit.kachhap,
	viresh.kumar, javi.merino, edubezval, linux-pm, quentin.perret,
	ionela.voinescu, vincent.guittot, Bartlomiej Zolnierkiewicz

On 10/10/2018 17:35, Lukasz Luba wrote:
> Hi Thara,
> 
> I have run it on Exynos5433 mainline.
> When it is enabled with step_wise thermal governor,
> some of my tests are showing ~30-50% regression (i.e. hackbench),
> dhrystone ~10%.
> 
> Could you tell me which thermal governor was used in your case?
> Please also share the name of that benchmark, i will give it a try.
> Is it single threaded compute-intensive?

aobench AFAICT

It would be interesting if you can share the thermal profile of your board.


> On 10/09/2018 06:24 PM, Thara Gopinath wrote:
>> Thermal governors can respond to an overheat event for a cpu by
>> capping the cpu's maximum possible frequency. This in turn
>> means that the maximum available compute capacity of the
>> cpu is restricted. But today in linux kernel, in event of maximum
>> frequency capping of a cpu, the maximum available compute
>> capacity of the cpu is not adjusted at all. In other words, scheduler
>> is unware maximum cpu capacity restrictions placed due to thermal
>> activity. This patch series attempts to address this issue.
>> The benefits identified are better task placement among available
>> cpus in event of overheating which in turn leads to better
>> performance numbers.
>>
>> The delta between the maximum possible capacity of a cpu and
>> maximum available capacity of a cpu due to thermal event can
>> be considered as thermal pressure. Instantaneous thermal pressure
>> is hard to record and can sometime be erroneous as there can be mismatch
>> between the actual capping of capacity and scheduler recording it.
>> Thus solution is to have a weighted average per cpu value for thermal
>> pressure over time. The weight reflects the amount of time the cpu has
>> spent at a capped maximum frequency. To accumulate, average and
>> appropriately decay thermal pressure, this patch series uses pelt
>> signals and reuses the available framework that does a similar
>> bookkeeping of rt/dl task utilization.
>>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on hikey960 mainline kernel with debian file system.
>> Further aobench (An occlusion renderer for benchmarking realworld
>> floating point performance) showed the following results on hikey960
>> with debain.
>>
>>                                          Result          Standard        Standard
>>                                          (Time secs)     Error           Deviation
>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
>>
>> Thara Gopinath (7):
>>    sched/pelt: Add option to make load and util calculations frequency
>>      invariant
>>    sched/pelt.c: Add support to track thermal pressure
>>    sched: Add infrastructure to store and update instantaneous thermal
>>      pressure
>>    sched: Initialize per cpu thermal pressure structure
>>    sched/fair: Enable CFS periodic tick to update thermal pressure
>>    sched/fair: update cpu_capcity to reflect thermal pressure
>>    thermal/cpu-cooling: Update thermal pressure in case of a maximum
>>      frequency capping
>>
>>   drivers/base/arch_topology.c  |  1 +
>>   drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
>>   include/linux/sched.h         | 14 +++++++++
>>   kernel/sched/Makefile         |  2 +-
>>   kernel/sched/core.c           |  2 ++
>>   kernel/sched/fair.c           |  4 +++
>>   kernel/sched/pelt.c           | 40 ++++++++++++++++++--------
>>   kernel/sched/pelt.h           |  7 +++++
>>   kernel/sched/sched.h          |  1 +
>>   kernel/sched/thermal.c        | 66 +++++++++++++++++++++++++++++++++++++++++++
>>   kernel/sched/thermal.h        | 13 +++++++++
>>   11 files changed, 157 insertions(+), 13 deletions(-)
>>   create mode 100644 kernel/sched/thermal.c
>>   create mode 100644 kernel/sched/thermal.h
>>


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10  9:55         ` Quentin Perret
  2018-10-10 10:14           ` Vincent Guittot
@ 2018-10-10 17:03           ` Thara Gopinath
  1 sibling, 0 replies; 67+ messages in thread
From: Thara Gopinath @ 2018-10-10 17:03 UTC (permalink / raw)
  To: Quentin Perret, Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Zhang Rui, gregkh, Rafael J. Wysocki, Amit Kachhap, viresh kumar,
	Javi Merino, Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Ionela Voinescu

Hello Quentin,
On 10/10/2018 05:55 AM, Quentin Perret wrote:
> Hi Vincent,
> 
> On Wednesday 10 Oct 2018 at 10:50:05 (+0200), Vincent Guittot wrote:
>> The problem with reflecting directly the capping is that it happens
>> far more often than the pace at which cpu_capacity_orig is updated in
>> the scheduler.
> 
> Hmm, how can you be so sure ? That most likely depends on the workload,
> the platform and the thermal governor. Some platforms heat up slowly,
> some quickly. The pace at which the thermal governor will change things
> should depend on that I assume.

Yes I agree. How often a thermal event occurs is indeed dependent on
workload, platform and governors. For e.g. hikey960  the same workload
displays different behavior with and without a fan. What we want is a
generic solution that will work across "spiky" events and more spread
out events. An instantaneous update of cpu_capacity most certainly does
not capture spiky events. Also it can lead to cpu_capacity swinging
wildly from the original value to a much lower value and back again.
Averaging of thermal capacity limitation (Across any pre-determined
duration i.e using Pelt signals or without) takes care of smoothening
the effect of thermal capping and making sure that capping events are
not entirely missed.  For me it is a much more elegant solution than an
instantaneous solution.
Having said that, like you mentioned below, when the thermal capping
goes away, it is not reflected as an instantaneous availability of spare
capacity. It is slowly increased. But it is not necessarily wrong
information. All it tells the scheduler is that during the last
"pre-determined" duration on an average "X" was the capacity available
for a CFS task on this cpu.

> 
>> This means that at the moment when scheduler uses the
>> value, it might not be correct anymore.
> 
> And OTOH, when you remove a cap for example, it will take time before
> the scheduler can see the newly available capacity if you need to wait
> for the signal to decay. So you are using a wrong information too in
> that scenario.
> 
>> Then, this value are also used
>> when building the sched_domain and setting max_cpu_capacity which
>> would also implies the rebuilt the sched_domain topology ...
> 
> Wait what ? I thought the thermal cap was reflected in capacity_of, not
> capacity_orig_of ... You need to rebuild the sched_domain in case of
> thermal pressure ?
> 
> Hmm, let me have a closer look at the patches, I must have missed
> something ...
> 
>> The pace of changing the capping is to fast to reflect that in the
>> whole scheduler topology
> 
> That's probably true in some cases, but it'd be cool to have numbers to
> back up that statement, I think.
> 
> Now, if you do need to rebuild the sched domain topology every time you
> update the thermal pressure, I think the PELT HL is _way_ too short for
> that ... You can't rebuild the whole thing every 32ms or so. Or am I
> misunderstanding something ?
> 
>>> Thara, have you tried to experiment with a simpler implementation as
>>> suggested by Ingo ?
>>>
>>> Also, assuming that we do want to average things, do we actually want to
>>> tie the thermal ramp-up time to the PELT half life ? That provides
>>> nice maths properties wrt the other signals, but it's not obvious to me
>>> that this thermal 'constant' should be the same on all platforms. Or
>>> maybe it should ?
>>
>> The main interest of using PELT signal is that thermal pressure will
>> evolve at the same pace as other signals used in the scheduler.
> 
> Right, I think this is a nice property too (assuming that we actually
> want to average things out).
> 
>> With
>> thermal pressure, we have the exact same problem as with RT tasks. The
>> thermal will cap the max frequency which will cap the utilization of
>> the tasks running on the CPU
> 
> Well, the nature of the signal is slightly different IMO. Yes it's
> capacity, but you're no actually measuring time spent on the CPU. All
> other PELT signals are based on time, this thermal thing isn't, so it is
> kinda different in a way. And I'm still wondering if it could be helpful
> to be able to have a different HL for that thermal signal. That would
> 'break' the nice maths properties we have, yes, but is it a problem or is
> it actually helpful to cope with the thermal characteristics of
> different platforms ?


> 
> Thanks,
> Quentin
> 


-- 
Regards
Thara

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 13:34                         ` Juri Lelli
  2018-10-10 13:38                           ` Vincent Guittot
@ 2018-10-10 17:08                           ` Thara Gopinath
  1 sibling, 0 replies; 67+ messages in thread
From: Thara Gopinath @ 2018-10-10 17:08 UTC (permalink / raw)
  To: Juri Lelli, Vincent Guittot
  Cc: Quentin Perret, Ingo Molnar, linux-kernel, Ingo Molnar,
	Peter Zijlstra, Zhang Rui, gregkh, Rafael J. Wysocki,
	Amit Kachhap, viresh kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, open list:THERMAL, Ionela Voinescu

On 10/10/2018 09:34 AM, Juri Lelli wrote:
> On 10/10/18 15:08, Vincent Guittot wrote:
>> On Wed, 10 Oct 2018 at 14:50, Juri Lelli <juri.lelli@gmail.com> wrote:
>>>
>>> On 10/10/18 14:34, Vincent Guittot wrote:
>>>> Hi  Juri,
>>>>
>>>> On Wed, 10 Oct 2018 at 14:23, Juri Lelli <juri.lelli@gmail.com> wrote:
>>>>>
>>>>> On 10/10/18 14:04, Vincent Guittot wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>> The problem was the same with RT, the cfs utilization was lower than
>>>>>> reality because RT steals soem cycle to CFS
>>>>>> So schedutil was selecting a lower frequency when cfs was running
>>>>>> whereas the CPU was fully used.
>>>>>> The same can happen with thermal:
>>>>>> cap the max freq because of thermal
>>>>>> the utilization with decrease.
>>>>>> remove the cap
>>>>>> the utilization is still low and you will select a low OPP because you
>>>>>> don't take into account cycle stolen by thermal like with RT
>>>>>
>>>>> What if we scale frequency component considering the capped temporary
>>>>> max?
>>>>
>>>> Do you mean using a kind of scale_thermal_capacity in accumulate_sum
>>>> when computing utilization ?
>>>
>>> Yeah, something like that I guess. So that we account for temporary
>>> "fake" 1024..
>>
>> But the utilization will not be invariant anymore across the system
> 
> Mmm, I guess I might be wrong, but I was thinking we should be able to
> deal with this similarly to what we do with cpus with different max
> capacities. So, another factor?  Because then, how do we handle other
> ways in which max freq can be restricted (e.g. from userspace as Javi
> was also mentioning)?

IMHO, user-space restrictions should be handled separately. It should
probably reflect as an update of capacity_orig and rebuilding of
scheduler structures as such a restriction is meant to stay for a long
duration.

Regards
Thara
> 


-- 
Regards
Thara

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 15:35   ` Lukasz Luba
  2018-10-10 16:54     ` Daniel Lezcano
@ 2018-10-10 17:30     ` Thara Gopinath
  2018-10-11 11:10       ` Lukasz Luba
  1 sibling, 1 reply; 67+ messages in thread
From: Thara Gopinath @ 2018-10-10 17:30 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel
  Cc: mingo, peterz, rui.zhang, gregkh, rafael, amit.kachhap,
	viresh.kumar, javi.merino, edubezval, daniel.lezcano, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot,
	Bartlomiej Zolnierkiewicz

Hello Lukasz,

On 10/10/2018 11:35 AM, Lukasz Luba wrote:
> Hi Thara,
> 
> I have run it on Exynos5433 mainline.
> When it is enabled with step_wise thermal governor,
> some of my tests are showing ~30-50% regression (i.e. hackbench),
> dhrystone ~10%.

That is interesting. If I understand correctly, dhrystone spawns 16
threads or so and floods the system. In "theory", such a test should not
see any performance improvement and degradation. What is the thermal
activity like in your system? I will try running one of these tests on
hikey960.
> 
> Could you tell me which thermal governor was used in your case?
> Please also share the name of that benchmark, i will give it a try.
> Is it single threaded compute-intensive?

Step-wise governor.
I use aobench which is part of phoronix-test-suite.

Regards
Thara

> 
> Regards,
> Lukasz
> 
> On 10/09/2018 06:24 PM, Thara Gopinath wrote:
>> Thermal governors can respond to an overheat event for a cpu by
>> capping the cpu's maximum possible frequency. This in turn
>> means that the maximum available compute capacity of the
>> cpu is restricted. But today in linux kernel, in event of maximum
>> frequency capping of a cpu, the maximum available compute
>> capacity of the cpu is not adjusted at all. In other words, scheduler
>> is unware maximum cpu capacity restrictions placed due to thermal
>> activity. This patch series attempts to address this issue.
>> The benefits identified are better task placement among available
>> cpus in event of overheating which in turn leads to better
>> performance numbers.
>>
>> The delta between the maximum possible capacity of a cpu and
>> maximum available capacity of a cpu due to thermal event can
>> be considered as thermal pressure. Instantaneous thermal pressure
>> is hard to record and can sometime be erroneous as there can be mismatch
>> between the actual capping of capacity and scheduler recording it.
>> Thus solution is to have a weighted average per cpu value for thermal
>> pressure over time. The weight reflects the amount of time the cpu has
>> spent at a capped maximum frequency. To accumulate, average and
>> appropriately decay thermal pressure, this patch series uses pelt
>> signals and reuses the available framework that does a similar
>> bookkeeping of rt/dl task utilization.
>>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on hikey960 mainline kernel with debian file system.
>> Further aobench (An occlusion renderer for benchmarking realworld
>> floating point performance) showed the following results on hikey960
>> with debain.
>>
>>                                          Result          Standard        Standard
>>                                          (Time secs)     Error           Deviation
>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
>>
>> Thara Gopinath (7):
>>    sched/pelt: Add option to make load and util calculations frequency
>>      invariant
>>    sched/pelt.c: Add support to track thermal pressure
>>    sched: Add infrastructure to store and update instantaneous thermal
>>      pressure
>>    sched: Initialize per cpu thermal pressure structure
>>    sched/fair: Enable CFS periodic tick to update thermal pressure
>>    sched/fair: update cpu_capcity to reflect thermal pressure
>>    thermal/cpu-cooling: Update thermal pressure in case of a maximum
>>      frequency capping
>>
>>   drivers/base/arch_topology.c  |  1 +
>>   drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
>>   include/linux/sched.h         | 14 +++++++++
>>   kernel/sched/Makefile         |  2 +-
>>   kernel/sched/core.c           |  2 ++
>>   kernel/sched/fair.c           |  4 +++
>>   kernel/sched/pelt.c           | 40 ++++++++++++++++++--------
>>   kernel/sched/pelt.h           |  7 +++++
>>   kernel/sched/sched.h          |  1 +
>>   kernel/sched/thermal.c        | 66 +++++++++++++++++++++++++++++++++++++++++++
>>   kernel/sched/thermal.h        | 13 +++++++++
>>   11 files changed, 157 insertions(+), 13 deletions(-)
>>   create mode 100644 kernel/sched/thermal.c
>>   create mode 100644 kernel/sched/thermal.h
>>


-- 
Regards
Thara

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 16:54     ` Daniel Lezcano
@ 2018-10-11  7:35       ` Lukasz Luba
  2018-10-11  8:23         ` Daniel Lezcano
  0 siblings, 1 reply; 67+ messages in thread
From: Lukasz Luba @ 2018-10-11  7:35 UTC (permalink / raw)
  To: Daniel Lezcano, Thara Gopinath, linux-kernel
  Cc: mingo, peterz, rui.zhang, gregkh, rafael, amit.kachhap,
	viresh.kumar, javi.merino, edubezval, linux-pm, quentin.perret,
	ionela.voinescu, vincent.guittot, Bartlomiej Zolnierkiewicz

Hi Daniel,

On 10/10/2018 06:54 PM, Daniel Lezcano wrote:
> On 10/10/2018 17:35, Lukasz Luba wrote:
>> Hi Thara,
>>
>> I have run it on Exynos5433 mainline.
>> When it is enabled with step_wise thermal governor,
>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>> dhrystone ~10%.
>>
>> Could you tell me which thermal governor was used in your case?
>> Please also share the name of that benchmark, i will give it a try.
>> Is it single threaded compute-intensive?
> 
> aobench AFAICT
> 
> It would be interesting if you can share the thermal profile of your board.
> 
Thanks for the benchmark name.
It was tested on Samsung TM2 device with Exynos 5433 with debian.
Thermal stuff you can find in mainline:
arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi

Regards,
Lukasz

> 
>> On 10/09/2018 06:24 PM, Thara Gopinath wrote:
>>> Thermal governors can respond to an overheat event for a cpu by
>>> capping the cpu's maximum possible frequency. This in turn
>>> means that the maximum available compute capacity of the
>>> cpu is restricted. But today in linux kernel, in event of maximum
>>> frequency capping of a cpu, the maximum available compute
>>> capacity of the cpu is not adjusted at all. In other words, scheduler
>>> is unware maximum cpu capacity restrictions placed due to thermal
>>> activity. This patch series attempts to address this issue.
>>> The benefits identified are better task placement among available
>>> cpus in event of overheating which in turn leads to better
>>> performance numbers.
>>>
>>> The delta between the maximum possible capacity of a cpu and
>>> maximum available capacity of a cpu due to thermal event can
>>> be considered as thermal pressure. Instantaneous thermal pressure
>>> is hard to record and can sometime be erroneous as there can be mismatch
>>> between the actual capping of capacity and scheduler recording it.
>>> Thus solution is to have a weighted average per cpu value for thermal
>>> pressure over time. The weight reflects the amount of time the cpu has
>>> spent at a capped maximum frequency. To accumulate, average and
>>> appropriately decay thermal pressure, this patch series uses pelt
>>> signals and reuses the available framework that does a similar
>>> bookkeeping of rt/dl task utilization.
>>>
>>> Regarding testing, basic build, boot and sanity testing have been
>>> performed on hikey960 mainline kernel with debian file system.
>>> Further aobench (An occlusion renderer for benchmarking realworld
>>> floating point performance) showed the following results on hikey960
>>> with debain.
>>>
>>>                                           Result          Standard        Standard
>>>                                           (Time secs)     Error           Deviation
>>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
>>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
>>>
>>> Thara Gopinath (7):
>>>     sched/pelt: Add option to make load and util calculations frequency
>>>       invariant
>>>     sched/pelt.c: Add support to track thermal pressure
>>>     sched: Add infrastructure to store and update instantaneous thermal
>>>       pressure
>>>     sched: Initialize per cpu thermal pressure structure
>>>     sched/fair: Enable CFS periodic tick to update thermal pressure
>>>     sched/fair: update cpu_capcity to reflect thermal pressure
>>>     thermal/cpu-cooling: Update thermal pressure in case of a maximum
>>>       frequency capping
>>>
>>>    drivers/base/arch_topology.c  |  1 +
>>>    drivers/thermal/cpu_cooling.c | 20 ++++++++++++-
>>>    include/linux/sched.h         | 14 +++++++++
>>>    kernel/sched/Makefile         |  2 +-
>>>    kernel/sched/core.c           |  2 ++
>>>    kernel/sched/fair.c           |  4 +++
>>>    kernel/sched/pelt.c           | 40 ++++++++++++++++++--------
>>>    kernel/sched/pelt.h           |  7 +++++
>>>    kernel/sched/sched.h          |  1 +
>>>    kernel/sched/thermal.c        | 66 +++++++++++++++++++++++++++++++++++++++++++
>>>    kernel/sched/thermal.h        | 13 +++++++++
>>>    11 files changed, 157 insertions(+), 13 deletions(-)
>>>    create mode 100644 kernel/sched/thermal.c
>>>    create mode 100644 kernel/sched/thermal.h
>>>
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-11  7:35       ` Lukasz Luba
@ 2018-10-11  8:23         ` Daniel Lezcano
  2018-10-12  9:37           ` Lukasz Luba
  0 siblings, 1 reply; 67+ messages in thread
From: Daniel Lezcano @ 2018-10-11  8:23 UTC (permalink / raw)
  To: Lukasz Luba, Thara Gopinath, linux-kernel
  Cc: mingo, peterz, rui.zhang, gregkh, rafael, amit.kachhap,
	viresh.kumar, javi.merino, edubezval, linux-pm, quentin.perret,
	ionela.voinescu, vincent.guittot, Bartlomiej Zolnierkiewicz

On 11/10/2018 09:35, Lukasz Luba wrote:
> Hi Daniel,
> 
> On 10/10/2018 06:54 PM, Daniel Lezcano wrote:
>> On 10/10/2018 17:35, Lukasz Luba wrote:
>>> Hi Thara,
>>>
>>> I have run it on Exynos5433 mainline.
>>> When it is enabled with step_wise thermal governor,
>>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>>> dhrystone ~10%.
>>>
>>> Could you tell me which thermal governor was used in your case?
>>> Please also share the name of that benchmark, i will give it a try.
>>> Is it single threaded compute-intensive?
>>
>> aobench AFAICT
>>
>> It would be interesting if you can share the thermal profile of your board.
>>
> Thanks for the benchmark name.
> It was tested on Samsung TM2 device with Exynos 5433 with debian.
> Thermal stuff you can find in mainline:
> arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi

By thermal profile, I was meaning a figure with the temperature
regulation (temperature idle, then workload, then temperature increase,
mitigated temperature, end of workload, temperature decrease).

The thermal description looks wrong in the DT. I suggest to experiment
the series with the DT fixed.

eg. from hi6220.dtsi


                thermal-zones {

                        cls0: cls0 {
                                polling-delay = <1000>;
                                polling-delay-passive = <100>;
                                sustainable-power = <3326>;

                                /* sensor ID */
                                thermal-sensors = <&tsensor 2>;

                                trips {
                                        threshold: trip-point@0 {
                                                temperature = <65000>;
                                                hysteresis = <0>;
                                                type = "passive";
                                        };

                                        target: trip-point@1 {
                                                temperature = <75000>;
                                                hysteresis = <0>;
                                                type = "passive";
                                        };
                                };

                                cooling-maps {
                                        map0 {
                                                trip = <&target>;
                                                cooling-device = <&cpu0
THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
                                        };
                                };
                        };
                };

Note the cooling devices are *passive*


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 17:30     ` Thara Gopinath
@ 2018-10-11 11:10       ` Lukasz Luba
  2018-10-16 17:11         ` Vincent Guittot
  0 siblings, 1 reply; 67+ messages in thread
From: Lukasz Luba @ 2018-10-11 11:10 UTC (permalink / raw)
  To: Thara Gopinath, linux-kernel
  Cc: mingo, peterz, rui.zhang, gregkh, rafael, amit.kachhap,
	viresh.kumar, javi.merino, edubezval, daniel.lezcano, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot,
	Bartlomiej Zolnierkiewicz



On 10/10/2018 07:30 PM, Thara Gopinath wrote:
> Hello Lukasz,
> 
> On 10/10/2018 11:35 AM, Lukasz Luba wrote:
>> Hi Thara,
>>
>> I have run it on Exynos5433 mainline.
>> When it is enabled with step_wise thermal governor,
>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>> dhrystone ~10%.
> 
> That is interesting. If I understand correctly, dhrystone spawns 16
> threads or so and floods the system. In "theory", such a test should not
> see any performance improvement and degradation. What is the thermal
> activity like in your system? I will try running one of these tests on
> hikey960.
I use this dhrystone implementation:
https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c
It does not span new threads/processes and I pinned it to a single cpu.

My thermal setup is probably different than yours.
You have (on hikey960) probably 1 sensor for whole SoC and one thermal
zone (if it is this mainline file:
arch/arm64/boot/dts/hisilicon/hi3660.dtsi).
This thermal zone has two cooling devices - two clusters with dvfs.
Your temperature signal read out from that sensor is probably much
smoother. When you have sensor inside cluster, the rising factor
can be even 20deg/s (for big cores).
In my case, there are 4 thermal zones, each cluster has it's private
sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',
which is recommended for IPA.
>>
>> Could you tell me which thermal governor was used in your case?
>> Please also share the name of that benchmark, i will give it a try.
>> Is it single threaded compute-intensive?
> 
> Step-wise governor.
> I use aobench which is part of phoronix-test-suite.
> 
> Regards
> Thara
> 
I have built this aobench and run it pinned to single big cpu:
time taskset -c 4 ./aobench
The results:
3min-5:30min [mainline]
5:15min-5:50min [+patchset]

The idea is definitely worth to investigate further.

Regards,
Lukasz




^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-11  8:23         ` Daniel Lezcano
@ 2018-10-12  9:37           ` Lukasz Luba
  0 siblings, 0 replies; 67+ messages in thread
From: Lukasz Luba @ 2018-10-12  9:37 UTC (permalink / raw)
  To: Daniel Lezcano, linux-kernel
  Cc: Thara Gopinath, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, javi.merino, edubezval, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot,
	Bartlomiej Zolnierkiewicz


On 10/11/2018 10:23 AM, Daniel Lezcano wrote:
> On 11/10/2018 09:35, Lukasz Luba wrote:
>> Hi Daniel,
>>
>> On 10/10/2018 06:54 PM, Daniel Lezcano wrote:
>>> On 10/10/2018 17:35, Lukasz Luba wrote:
>>>> Hi Thara,
>>>>
>>>> I have run it on Exynos5433 mainline.
>>>> When it is enabled with step_wise thermal governor,
>>>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>>>> dhrystone ~10%.
>>>>
>>>> Could you tell me which thermal governor was used in your case?
>>>> Please also share the name of that benchmark, i will give it a try.
>>>> Is it single threaded compute-intensive?
>>>
>>> aobench AFAICT
>>>
>>> It would be interesting if you can share the thermal profile of your board.
>>>
>> Thanks for the benchmark name.
>> It was tested on Samsung TM2 device with Exynos 5433 with debian.
>> Thermal stuff you can find in mainline:
>> arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi
> 
> By thermal profile, I was meaning a figure with the temperature
> regulation (temperature idle, then workload, then temperature increase,
> mitigated temperature, end of workload, temperature decrease).
Currently, I cannot share these data.
> 
> The thermal description looks wrong in the DT. I suggest to experiment
> the series with the DT fixed.

Could you tell more what looks wrong or maybe send a draft/patch?
> eg. from hi6220.dtsi
> 
> 
>                  thermal-zones {
> 
>                          cls0: cls0 {
>                                  polling-delay = <1000>;
>                                  polling-delay-passive = <100>;
>                                  sustainable-power = <3326>;
> 
>                                  /* sensor ID */
>                                  thermal-sensors = <&tsensor 2>;
> 
>                                  trips {
>                                          threshold: trip-point@0 {
>                                                  temperature = <65000>;
>                                                  hysteresis = <0>;
>                                                  type = "passive";
>                                          };
> 
>                                          target: trip-point@1 {
>                                                  temperature = <75000>;
>                                                  hysteresis = <0>;
>                                                  type = "passive";
>                                          };
>                                  };
> 
>                                  cooling-maps {
>                                          map0 {
>                                                  trip = <&target>;
>                                                  cooling-device = <&cpu0
> THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
>                                          };
>                                  };
>                          };
>                  };
> 
> Note the cooling devices are *passive*
> 
>
For me this DT looks like it is copied from ARM Juno board for IPA,
where AFAIR there were no interrupts for the temperature sensor
(due to some psci/scpi/firmware lack of support).
The cooling map is also short. I am not sure if it would work efficient
with step-wise, with IPA it will work.

The DT configuration for Exynos has been working OK for years.
Exynos5433 supports 8 trip points, Exynos5422 has 4.
So if you have more trip points you need to start 'polling'.
Therefore, for Exynos there is no need to run queued work
in thermal framework every 1s or 100ms just to
read the current temperature.
The exynos-tmu driver gets irq and calls thermal framework
when such trip point is crossed. Then there is 'monitoring
window' for the trip point...
Long story short: 'active' is used for that reason.

Regards,
Lukasz


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-10 15:43     ` Thara Gopinath
@ 2018-10-16  7:33       ` Ingo Molnar
  2018-10-16  9:28         ` Lukasz Luba
  2018-10-17 16:21         ` Thara Gopinath
  0 siblings, 2 replies; 67+ messages in thread
From: Ingo Molnar @ 2018-10-16  7:33 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, javi.merino, edubezval,
	daniel.lezcano, linux-pm, quentin.perret, ionela.voinescu,
	vincent.guittot


* Thara Gopinath <thara.gopinath@linaro.org> wrote:

> >> Regarding testing, basic build, boot and sanity testing have been
> >> performed on hikey960 mainline kernel with debian file system.
> >> Further aobench (An occlusion renderer for benchmarking realworld
> >> floating point performance) showed the following results on hikey960
> >> with debain.
> >>
> >>                                         Result          Standard        Standard
> >>                                         (Time secs)     Error           Deviation
> >> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
> >> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
> > 
> > Wow, +13% speedup, impressive! We definitely want this outcome.
> > 
> > I'm wondering what happens if we do not track and decay the thermal 
> > load at all at the PELT level, but instantaneously decrease/increase 
> > effective CPU capacity in reaction to thermal events we receive from 
> > the CPU.
> 
> The problem with instantaneous update is that sometimes thermal events 
> happen at a much faster pace than cpu_capacity is updated in the 
> scheduler. This means that at the moment when scheduler uses the 
> value, it might not be correct anymore.

Let me offer a different interpretation: if we average throttling events 
then we create a 'smooth' average of 'true CPU capacity' that doesn't 
fluctuate much. This allows more stable yet asymmetric task placement if 
the thermal characteristics of the different cores is different 
(asymmetric). This, compared to instantaneous updates, would reduce 
unnecessary task migrations between cores.

Is that accurate?

If the thermal characteristics of the cores is roughly symmetric and the 
measured CPU-intense load itself is symmetric as well, then I have 
trouble seeing why reacting to thermal events should make any difference 
at all.

Are there any inherent asymmetries in the thermal properties of the 
cores, or in the benchmarked workload itself?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-16  7:33       ` Ingo Molnar
@ 2018-10-16  9:28         ` Lukasz Luba
  2018-10-17 16:21         ` Thara Gopinath
  1 sibling, 0 replies; 67+ messages in thread
From: Lukasz Luba @ 2018-10-16  9:28 UTC (permalink / raw)
  To: Ingo Molnar, Thara Gopinath
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, javi.merino, edubezval,
	daniel.lezcano, linux-pm, quentin.perret, ionela.voinescu,
	vincent.guittot, l.luba, Bartlomiej Zolnierkiewicz


On 10/16/2018 09:33 AM, Ingo Molnar wrote:
> 
> * Thara Gopinath <thara.gopinath@linaro.org> wrote:
> 
>>>> Regarding testing, basic build, boot and sanity testing have been
>>>> performed on hikey960 mainline kernel with debian file system.
>>>> Further aobench (An occlusion renderer for benchmarking realworld
>>>> floating point performance) showed the following results on hikey960
>>>> with debain.
>>>>
>>>>                                          Result          Standard        Standard
>>>>                                          (Time secs)     Error           Deviation
>>>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
>>>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
>>>
>>> Wow, +13% speedup, impressive! We definitely want this outcome.
>>>
>>> I'm wondering what happens if we do not track and decay the thermal
>>> load at all at the PELT level, but instantaneously decrease/increase
>>> effective CPU capacity in reaction to thermal events we receive from
>>> the CPU.
>>
>> The problem with instantaneous update is that sometimes thermal events
>> happen at a much faster pace than cpu_capacity is updated in the
>> scheduler. This means that at the moment when scheduler uses the
>> value, it might not be correct anymore.
> 
> Let me offer a different interpretation: if we average throttling events
> then we create a 'smooth' average of 'true CPU capacity' that doesn't
> fluctuate much. This allows more stable yet asymmetric task placement if
> the thermal characteristics of the different cores is different
> (asymmetric). This, compared to instantaneous updates, would reduce
> unnecessary task migrations between cores.
> 
> Is that accurate?
> 
> If the thermal characteristics of the cores is roughly symmetric and the
> measured CPU-intense load itself is symmetric as well, then I have
> trouble seeing why reacting to thermal events should make any difference
> at all.
> 
> Are there any inherent asymmetries in the thermal properties of the
> cores, or in the benchmarked workload itself?
The aobench that at least I have built is a single threaded app.
If there is migration of the process to cluster and core which is in
avg faster, then it will gain.
The hikey960 platform has limited number of OPPs.
big cluster: 2.36, 2.1, 1.8, 1.4, 0.9 [GHz]
little cluster: 1.84, 1.7, 1.4, 1.0, 0.5 [GHz]
Comparing to Exynos5433 which has 15 OPPs for big cluster every 100MHZ,
it is harder to pick-up the right one.
I can imagine that the thermal governor is jumping around 1.8, 1.4, 0.9
for the big cluster. Maybe little cluster is at higher OPP
and running there longer would help. Thermal has time slots are 100ms
(based on this DT).

Regarding other asymmetries, there are different parts of the cluster
and core utilized depending of workload and data set.
There might be floating point or vectorized code utilizing long piplines
in NEON and also causing less cache misses.
That will warm up more than integer unit or copy using load/store unit
(which occupy less silicon (and C 'capacitance')) at the same frequency.

There are also SoCs which have single power rail from DCDC in PMIC
for both asymmetric clusters. In SoC on front of these clusters,
there is internal LDO, which reduces the voltage to the cluster.
In such system cpufreq driver chooses max of the voltages for the
clusters and sets it to the PMIC, then sets LDOx voltage diff for
cluster with smaller voltage. This causes another asymmetries,
because more current going through LDO causes more heat than
direct DCDC voltage (i.e. seen as a heat on big cluster).

There are also cache portion power down asymmetries.
I have been developing such driver. Based on memory traffic
and cache hit/miss ratio it chooses how much cache can be powered down.
I can image that some HW does it without the need of SW assist.

There are SoCs with DDR modules mounted on top - PoP.
I still have to investigate what is different in SoC power budget
in such setup (depending on workload).

There are also workloads for UI using GPU, which can also
be utilized in 'portions' (shader cores from 1 to 32).

These asymmetries cause that simple assumptio
P_dynamic = C * V^2 * f
is probably not enough.

I would suggest to choose platform with more fine grained OPPs or
add more points to hikey960 and repeat the tests.

Regards,
Lukasz Luba

> 
> Thanks,
> 
> 	Ingo
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-11 11:10       ` Lukasz Luba
@ 2018-10-16 17:11         ` Vincent Guittot
  2018-10-17 16:24           ` Thara Gopinath
  2018-10-18  8:12           ` Lukasz Luba
  0 siblings, 2 replies; 67+ messages in thread
From: Vincent Guittot @ 2018-10-16 17:11 UTC (permalink / raw)
  To: l.luba
  Cc: Thara Gopinath, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Zhang Rui, gregkh, Rafael J. Wysocki, Amit Kachhap, viresh kumar,
	Javi Merino, Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Quentin Perret, Ionela Voinescu, b.zolnierkie

Hi Lukasz,

On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <l.luba@partner.samsung.com> wrote:
>
>
>
> On 10/10/2018 07:30 PM, Thara Gopinath wrote:
> > Hello Lukasz,
> >
> > On 10/10/2018 11:35 AM, Lukasz Luba wrote:
> >> Hi Thara,
> >>
> >> I have run it on Exynos5433 mainline.
> >> When it is enabled with step_wise thermal governor,
> >> some of my tests are showing ~30-50% regression (i.e. hackbench),
> >> dhrystone ~10%.
> >
> > That is interesting. If I understand correctly, dhrystone spawns 16
> > threads or so and floods the system. In "theory", such a test should not
> > see any performance improvement and degradation. What is the thermal
> > activity like in your system? I will try running one of these tests on
> > hikey960.
> I use this dhrystone implementation:
> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c
> It does not span new threads/processes and I pinned it to a single cpu.
>
> My thermal setup is probably different than yours.
> You have (on hikey960) probably 1 sensor for whole SoC and one thermal
> zone (if it is this mainline file:
> arch/arm64/boot/dts/hisilicon/hi3660.dtsi).
> This thermal zone has two cooling devices - two clusters with dvfs.
> Your temperature signal read out from that sensor is probably much
> smoother. When you have sensor inside cluster, the rising factor
> can be even 20deg/s (for big cores).
> In my case, there are 4 thermal zones, each cluster has it's private
> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',
> which is recommended for IPA.
> >>
> >> Could you tell me which thermal governor was used in your case?
> >> Please also share the name of that benchmark, i will give it a try.
> >> Is it single threaded compute-intensive?
> >
> > Step-wise governor.
> > I use aobench which is part of phoronix-test-suite.
> >
> > Regards
> > Thara
> >
> I have built this aobench and run it pinned to single big cpu:
> time taskset -c 4 ./aobench

Why have you pinned the test only on CPU4 ?
Goal of thermal pressure is to inform the scheduler of reduced compute
capacity and help the scheduler to take better decision in tasks
placement.
So I would not expect perf impact on your test as the bench will stay
pinned on the cpu4
That being said you obviously have perf impact as shown below in your results
And other tasks on the system are not pinned and might come and
disturb your bench

> The results:
> 3min-5:30min [mainline]
> 5:15min-5:50min [+patchset]
>
> The idea is definitely worth to investigate further.

Yes I agree

Vincent
>
> Regards,
> Lukasz
>
>
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-16  7:33       ` Ingo Molnar
  2018-10-16  9:28         ` Lukasz Luba
@ 2018-10-17 16:21         ` Thara Gopinath
  2018-10-18  6:48           ` Ingo Molnar
  1 sibling, 1 reply; 67+ messages in thread
From: Thara Gopinath @ 2018-10-17 16:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, javi.merino, edubezval,
	daniel.lezcano, linux-pm, quentin.perret, ionela.voinescu,
	vincent.guittot

On 10/16/2018 03:33 AM, Ingo Molnar wrote:
> 
> * Thara Gopinath <thara.gopinath@linaro.org> wrote:
> 
>>>> Regarding testing, basic build, boot and sanity testing have been
>>>> performed on hikey960 mainline kernel with debian file system.
>>>> Further aobench (An occlusion renderer for benchmarking realworld
>>>> floating point performance) showed the following results on hikey960
>>>> with debain.
>>>>
>>>>                                         Result          Standard        Standard
>>>>                                         (Time secs)     Error           Deviation
>>>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
>>>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
>>>
>>> Wow, +13% speedup, impressive! We definitely want this outcome.
>>>
>>> I'm wondering what happens if we do not track and decay the thermal 
>>> load at all at the PELT level, but instantaneously decrease/increase 
>>> effective CPU capacity in reaction to thermal events we receive from 
>>> the CPU.
>>
>> The problem with instantaneous update is that sometimes thermal events 
>> happen at a much faster pace than cpu_capacity is updated in the 
>> scheduler. This means that at the moment when scheduler uses the 
>> value, it might not be correct anymore.
> 
> Let me offer a different interpretation: if we average throttling events 
> then we create a 'smooth' average of 'true CPU capacity' that doesn't 
> fluctuate much. This allows more stable yet asymmetric task placement if 
> the thermal characteristics of the different cores is different 
> (asymmetric). This, compared to instantaneous updates, would reduce 
> unnecessary task migrations between cores.
> 
> Is that accurate?

Yes. I think it is accurate. I will also add that if we don't average
throttling events, we will miss the events that occur in between load
balancing(LB) period.

> 
> If the thermal characteristics of the cores is roughly symmetric and the 
> measured CPU-intense load itself is symmetric as well, then I have 
> trouble seeing why reacting to thermal events should make any difference 
> at all.
In this scenario, i agree that scheduler reaction to thermal events
should not make any difference in fact we should not observe any
improvement or degradation in performance.

> 
> Are there any inherent asymmetries in the thermal properties of the 
> cores, or in the benchmarked workload itself?

The benchmarked workload , meaning aobench? I don't think there arre any
asymmetries there. On Hikey960, there are two clusters with different
frequency domains. So yes I will say there is asymmetry there. Asides,
IMHO, any other tasks running on the system can create an inherent
asymmetry as cpu utilizations can vary.

Regards
Thara
> 
> Thanks,
> 
> 	Ingo
> 


-- 
Regards
Thara

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-16 17:11         ` Vincent Guittot
@ 2018-10-17 16:24           ` Thara Gopinath
  2018-10-18  8:00             ` Lukasz Luba
  2018-10-18  8:12           ` Lukasz Luba
  1 sibling, 1 reply; 67+ messages in thread
From: Thara Gopinath @ 2018-10-17 16:24 UTC (permalink / raw)
  To: Vincent Guittot, l.luba
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Zhang Rui, gregkh,
	Rafael J. Wysocki, Amit Kachhap, viresh kumar, Javi Merino,
	Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Quentin Perret, Ionela Voinescu, b.zolnierkie

On 10/16/2018 01:11 PM, Vincent Guittot wrote:
> Hi Lukasz,
> 
> On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <l.luba@partner.samsung.com> wrote:
>>
>>
>>
>> On 10/10/2018 07:30 PM, Thara Gopinath wrote:
>>> Hello Lukasz,
>>>
>>> On 10/10/2018 11:35 AM, Lukasz Luba wrote:
>>>> Hi Thara,
>>>>
>>>> I have run it on Exynos5433 mainline.
>>>> When it is enabled with step_wise thermal governor,
>>>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>>>> dhrystone ~10%.
>>>
>>> That is interesting. If I understand correctly, dhrystone spawns 16
>>> threads or so and floods the system. In "theory", such a test should not
>>> see any performance improvement and degradation. What is the thermal
>>> activity like in your system? I will try running one of these tests on
>>> hikey960.
>> I use this dhrystone implementation:
>> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c
>> It does not span new threads/processes and I pinned it to a single cpu.
>>
>> My thermal setup is probably different than yours.
>> You have (on hikey960) probably 1 sensor for whole SoC and one thermal
>> zone (if it is this mainline file:
>> arch/arm64/boot/dts/hisilicon/hi3660.dtsi).
>> This thermal zone has two cooling devices - two clusters with dvfs.
>> Your temperature signal read out from that sensor is probably much
>> smoother. When you have sensor inside cluster, the rising factor
>> can be even 20deg/s (for big cores).
>> In my case, there are 4 thermal zones, each cluster has it's private
>> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',
>> which is recommended for IPA.
>>>>
>>>> Could you tell me which thermal governor was used in your case?
>>>> Please also share the name of that benchmark, i will give it a try.
>>>> Is it single threaded compute-intensive?
>>>
>>> Step-wise governor.
>>> I use aobench which is part of phoronix-test-suite.
>>>
>>> Regards
>>> Thara
>>>
>> I have built this aobench and run it pinned to single big cpu:
>> time taskset -c 4 ./aobench
> 
> Why have you pinned the test only on CPU4 ?
> Goal of thermal pressure is to inform the scheduler of reduced compute
> capacity and help the scheduler to take better decision in tasks
> placement.

Hi Lukasz,

I agree with Vincent's observation. I had not seen this earlier. Pinning
a task to a cpu will obviously prevent migration. The performance
regression could be due to as Vincent mentioned below other tasks in the
system. On another note, which cpufreq governor are you using? Is the
core being bumped up to highest possible OPP during the test?

Regards
Thara
> So I would not expect perf impact on your test as the bench will stay
> pinned on the cpu4
> That being said you obviously have perf impact as shown below in your results
> And other tasks on the system are not pinned and might come and
> disturb your bench
> 
>> The results:
>> 3min-5:30min [mainline]
>> 5:15min-5:50min [+patchset]
>>
>> The idea is definitely worth to investigate further.
> 
> Yes I agree
> 
> Vincent
>>
>> Regards,
>> Lukasz
>>
>>
>>


-- 
Regards
Thara

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-17 16:21         ` Thara Gopinath
@ 2018-10-18  6:48           ` Ingo Molnar
  2018-10-18  7:08             ` Rafael J. Wysocki
  2018-10-18 16:17             ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
  0 siblings, 2 replies; 67+ messages in thread
From: Ingo Molnar @ 2018-10-18  6:48 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, javi.merino, edubezval,
	daniel.lezcano, linux-pm, quentin.perret, ionela.voinescu,
	vincent.guittot


* Thara Gopinath <thara.gopinath@linaro.org> wrote:

> On 10/16/2018 03:33 AM, Ingo Molnar wrote:
> > 
> > * Thara Gopinath <thara.gopinath@linaro.org> wrote:
> > 
> >>>> Regarding testing, basic build, boot and sanity testing have been
> >>>> performed on hikey960 mainline kernel with debian file system.
> >>>> Further aobench (An occlusion renderer for benchmarking realworld
> >>>> floating point performance) showed the following results on hikey960
> >>>> with debain.
> >>>>
> >>>>                                         Result          Standard        Standard
> >>>>                                         (Time secs)     Error           Deviation
> >>>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
> >>>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
> >>>
> >>> Wow, +13% speedup, impressive! We definitely want this outcome.
> >>>
> >>> I'm wondering what happens if we do not track and decay the thermal 
> >>> load at all at the PELT level, but instantaneously decrease/increase 
> >>> effective CPU capacity in reaction to thermal events we receive from 
> >>> the CPU.
> >>
> >> The problem with instantaneous update is that sometimes thermal events 
> >> happen at a much faster pace than cpu_capacity is updated in the 
> >> scheduler. This means that at the moment when scheduler uses the 
> >> value, it might not be correct anymore.
> > 
> > Let me offer a different interpretation: if we average throttling events 
> > then we create a 'smooth' average of 'true CPU capacity' that doesn't 
> > fluctuate much. This allows more stable yet asymmetric task placement if 
> > the thermal characteristics of the different cores is different 
> > (asymmetric). This, compared to instantaneous updates, would reduce 
> > unnecessary task migrations between cores.
> > 
> > Is that accurate?
> 
> Yes. I think it is accurate. I will also add that if we don't average
> throttling events, we will miss the events that occur in between load
> balancing(LB) period.

Yeah, so I'd definitely suggest to not integrate this averaging into 
pelt.c in the fashion presented, because:

 - This couples your thermal throttling averaging to the PELT decay
   half-time AFAICS, which would break the other user every time the
   decay is changed/tuned.

 - The boolean flag that changes behavior in pelt.c is not particularly
   clean either and complicates the code.

 - Instead maybe factor out a decaying average library into
   kernel/sched/avg.h perhaps (if this truly improves the code), and use
   those methods both in pelt.c and any future thermal.c - and maybe
   other places where we do decaying averages.

 - But simple decaying averages are not that complex either, so I think
   your original solution of open coding it is probably fine as well. 

Furthermore, any logic introduced by thermal.c and the resulting change 
to load-balancing behavior would have to be in perfect sync with cpufreq 
governor actions - one mechanism should not work against the other.

The only long term maintainable solution is to move all high level 
cpufreq logic and policy handling code into kernel/sched/cpufreq*.c, 
which has been done to a fair degree already in the past ~2 years - but 
it's unclear to me to what extent this is true for thermal throttling 
policy currently: there might be more governor surgery and code 
reshuffling required?

The short term goal would be to at minimum have all the bugs lined up in 
kernel/sched/* neatly, so that we have the chance to see and fix them in 
a single place. ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-18  6:48           ` Ingo Molnar
@ 2018-10-18  7:08             ` Rafael J. Wysocki
  2018-10-18  7:50               ` Ingo Molnar
  2018-10-18 16:17             ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
  1 sibling, 1 reply; 67+ messages in thread
From: Rafael J. Wysocki @ 2018-10-18  7:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thara Gopinath, Linux Kernel Mailing List, Ingo Molnar,
	Peter Zijlstra, Zhang, Rui, Greg Kroah-Hartman,
	Rafael J. Wysocki, Amit Kachhap, Viresh Kumar, Javi Merino,
	Eduardo Valentin, Daniel Lezcano, Linux PM, Quentin Perret,
	ionela.voinescu, Vincent Guittot

On Thu, Oct 18, 2018 at 8:48 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Thara Gopinath <thara.gopinath@linaro.org> wrote:
>
> > On 10/16/2018 03:33 AM, Ingo Molnar wrote:
> > >
> > > * Thara Gopinath <thara.gopinath@linaro.org> wrote:
> > >
> > >>>> Regarding testing, basic build, boot and sanity testing have been
> > >>>> performed on hikey960 mainline kernel with debian file system.
> > >>>> Further aobench (An occlusion renderer for benchmarking realworld
> > >>>> floating point performance) showed the following results on hikey960
> > >>>> with debain.
> > >>>>
> > >>>>                                         Result          Standard        Standard
> > >>>>                                         (Time secs)     Error           Deviation
> > >>>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
> > >>>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
> > >>>
> > >>> Wow, +13% speedup, impressive! We definitely want this outcome.
> > >>>
> > >>> I'm wondering what happens if we do not track and decay the thermal
> > >>> load at all at the PELT level, but instantaneously decrease/increase
> > >>> effective CPU capacity in reaction to thermal events we receive from
> > >>> the CPU.
> > >>
> > >> The problem with instantaneous update is that sometimes thermal events
> > >> happen at a much faster pace than cpu_capacity is updated in the
> > >> scheduler. This means that at the moment when scheduler uses the
> > >> value, it might not be correct anymore.
> > >
> > > Let me offer a different interpretation: if we average throttling events
> > > then we create a 'smooth' average of 'true CPU capacity' that doesn't
> > > fluctuate much. This allows more stable yet asymmetric task placement if
> > > the thermal characteristics of the different cores is different
> > > (asymmetric). This, compared to instantaneous updates, would reduce
> > > unnecessary task migrations between cores.
> > >
> > > Is that accurate?
> >
> > Yes. I think it is accurate. I will also add that if we don't average
> > throttling events, we will miss the events that occur in between load
> > balancing(LB) period.
>
> Yeah, so I'd definitely suggest to not integrate this averaging into
> pelt.c in the fashion presented, because:
>
>  - This couples your thermal throttling averaging to the PELT decay
>    half-time AFAICS, which would break the other user every time the
>    decay is changed/tuned.
>
>  - The boolean flag that changes behavior in pelt.c is not particularly
>    clean either and complicates the code.
>
>  - Instead maybe factor out a decaying average library into
>    kernel/sched/avg.h perhaps (if this truly improves the code), and use
>    those methods both in pelt.c and any future thermal.c - and maybe
>    other places where we do decaying averages.
>
>  - But simple decaying averages are not that complex either, so I think
>    your original solution of open coding it is probably fine as well.
>
> Furthermore, any logic introduced by thermal.c and the resulting change
> to load-balancing behavior would have to be in perfect sync with cpufreq
> governor actions - one mechanism should not work against the other.

Right, that really is required.

> The only long term maintainable solution is to move all high level
> cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,
> which has been done to a fair degree already in the past ~2 years - but
> it's unclear to me to what extent this is true for thermal throttling
> policy currently: there might be more governor surgery and code
> reshuffling required?

It doesn't cover thermal management directly ATM.

The EAS work kind of hopes to make a connection in there by adding a
common energy model to underlie both the performance scaling and
thermal management, but it doesn't change the thermal decision making
part AFAICS.

So it is fair to say that additional governor surgery and code
reshuffling will be required IMO.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-18  7:08             ` Rafael J. Wysocki
@ 2018-10-18  7:50               ` Ingo Molnar
  2018-10-18  8:14                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 67+ messages in thread
From: Ingo Molnar @ 2018-10-18  7:50 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thara Gopinath, Linux Kernel Mailing List, Ingo Molnar,
	Peter Zijlstra, Zhang, Rui, Greg Kroah-Hartman, Amit Kachhap,
	Viresh Kumar, Javi Merino, Eduardo Valentin, Daniel Lezcano,
	Linux PM, Quentin Perret, ionela.voinescu, Vincent Guittot


* Rafael J. Wysocki <rafael@kernel.org> wrote:

> > The only long term maintainable solution is to move all high level
> > cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,
> > which has been done to a fair degree already in the past ~2 years - but
> > it's unclear to me to what extent this is true for thermal throttling
> > policy currently: there might be more governor surgery and code
> > reshuffling required?
> 
> It doesn't cover thermal management directly ATM.
> 
> The EAS work kind of hopes to make a connection in there by adding a
> common energy model to underlie both the performance scaling and
> thermal management, but it doesn't change the thermal decision making
> part AFAICS.
> 
> So it is fair to say that additional governor surgery and code
> reshuffling will be required IMO.

BTW., when factoring out high level thermal management code it might make 
sense to increase the prominence of the cpufreq code within the scheduler 
and organize it a bit better, by introducing its own 
kernel/sched/cpufreq/ directory and renaming things the following way:

  kernel/sched/cpufreq.c		=> kernel/sched/cpufreq/core.c
  kernel/sched/cpufreq_schedutil.c	=> kernel/sched/cpufreq/metrics.c
  kernel/sched/thermal.c		=> kernel/sched/cpufreq/thermal.c

... or so?

With no change to functionality, this is just a re-organization and 
expansion/preparation for the bright future. =B-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-17 16:24           ` Thara Gopinath
@ 2018-10-18  8:00             ` Lukasz Luba
  0 siblings, 0 replies; 67+ messages in thread
From: Lukasz Luba @ 2018-10-18  8:00 UTC (permalink / raw)
  To: Thara Gopinath, Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Zhang Rui, gregkh,
	Rafael J. Wysocki, Amit Kachhap, viresh kumar, Javi Merino,
	Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Quentin Perret, Ionela Voinescu, b.zolnierkie


On 10/17/2018 06:24 PM, Thara Gopinath wrote:
> On 10/16/2018 01:11 PM, Vincent Guittot wrote:
>> Hi Lukasz,
>>
>> On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <l.luba@partner.samsung.com> wrote:
>>>
>>>
>>>
>>> On 10/10/2018 07:30 PM, Thara Gopinath wrote:
>>>> Hello Lukasz,
>>>>
>>>> On 10/10/2018 11:35 AM, Lukasz Luba wrote:
>>>>> Hi Thara,
>>>>>
>>>>> I have run it on Exynos5433 mainline.
>>>>> When it is enabled with step_wise thermal governor,
>>>>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>>>>> dhrystone ~10%.
>>>>
>>>> That is interesting. If I understand correctly, dhrystone spawns 16
>>>> threads or so and floods the system. In "theory", such a test should not
>>>> see any performance improvement and degradation. What is the thermal
>>>> activity like in your system? I will try running one of these tests on
>>>> hikey960.
>>> I use this dhrystone implementation:
>>> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c
>>> It does not span new threads/processes and I pinned it to a single cpu.
>>>
>>> My thermal setup is probably different than yours.
>>> You have (on hikey960) probably 1 sensor for whole SoC and one thermal
>>> zone (if it is this mainline file:
>>> arch/arm64/boot/dts/hisilicon/hi3660.dtsi).
>>> This thermal zone has two cooling devices - two clusters with dvfs.
>>> Your temperature signal read out from that sensor is probably much
>>> smoother. When you have sensor inside cluster, the rising factor
>>> can be even 20deg/s (for big cores).
>>> In my case, there are 4 thermal zones, each cluster has it's private
>>> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',
>>> which is recommended for IPA.
>>>>>
>>>>> Could you tell me which thermal governor was used in your case?
>>>>> Please also share the name of that benchmark, i will give it a try.
>>>>> Is it single threaded compute-intensive?
>>>>
>>>> Step-wise governor.
>>>> I use aobench which is part of phoronix-test-suite.
>>>>
>>>> Regards
>>>> Thara
>>>>
>>> I have built this aobench and run it pinned to single big cpu:
>>> time taskset -c 4 ./aobench
>>
>> Why have you pinned the test only on CPU4 ?
>> Goal of thermal pressure is to inform the scheduler of reduced compute
>> capacity and help the scheduler to take better decision in tasks
>> placement.
> 
> Hi Lukasz,
> 
> I agree with Vincent's observation. I had not seen this earlier. Pinning
> a task to a cpu will obviously prevent migration. The performance
> regression could be due to as Vincent mentioned below other tasks in the
> system. On another note, which cpufreq governor are you using? Is the
> core being bumped up to highest possible OPP during the test?
The governor is ondemand. No, it is not at highest OPP.
Could you send me the needed test configuration and condition?
We would then align in this area.

Regards,
Lukasz
> 
> Regards
> Thara
>> So I would not expect perf impact on your test as the bench will stay
>> pinned on the cpu4
>> That being said you obviously have perf impact as shown below in your results
>> And other tasks on the system are not pinned and might come and
>> disturb your bench
>>
>>> The results:
>>> 3min-5:30min [mainline]
>>> 5:15min-5:50min [+patchset]
>>>
>>> The idea is definitely worth to investigate further.
>>
>> Yes I agree
>>
>> Vincent
>>>
>>> Regards,
>>> Lukasz
>>>
>>>
>>>
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-16 17:11         ` Vincent Guittot
  2018-10-17 16:24           ` Thara Gopinath
@ 2018-10-18  8:12           ` Lukasz Luba
  1 sibling, 0 replies; 67+ messages in thread
From: Lukasz Luba @ 2018-10-18  8:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Thara Gopinath, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Zhang Rui, gregkh, Rafael J. Wysocki, Amit Kachhap, viresh kumar,
	Javi Merino, Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Quentin Perret, Ionela Voinescu, b.zolnierkie

Hi Vincent,

On 10/16/2018 07:11 PM, Vincent Guittot wrote:
> Hi Lukasz,
> 
> On Thu, 11 Oct 2018 at 13:10, Lukasz Luba <l.luba@partner.samsung.com> wrote:
>>
>>
>>
>> On 10/10/2018 07:30 PM, Thara Gopinath wrote:
>>> Hello Lukasz,
>>>
>>> On 10/10/2018 11:35 AM, Lukasz Luba wrote:
>>>> Hi Thara,
>>>>
>>>> I have run it on Exynos5433 mainline.
>>>> When it is enabled with step_wise thermal governor,
>>>> some of my tests are showing ~30-50% regression (i.e. hackbench),
>>>> dhrystone ~10%.
>>>
>>> That is interesting. If I understand correctly, dhrystone spawns 16
>>> threads or so and floods the system. In "theory", such a test should not
>>> see any performance improvement and degradation. What is the thermal
>>> activity like in your system? I will try running one of these tests on
>>> hikey960.
>> I use this dhrystone implementation:
>> https://github.com/Keith-S-Thompson/dhrystone/blob/master/v2.2/dry.c
>> It does not span new threads/processes and I pinned it to a single cpu.
>>
>> My thermal setup is probably different than yours.
>> You have (on hikey960) probably 1 sensor for whole SoC and one thermal
>> zone (if it is this mainline file:
>> arch/arm64/boot/dts/hisilicon/hi3660.dtsi).
>> This thermal zone has two cooling devices - two clusters with dvfs.
>> Your temperature signal read out from that sensor is probably much
>> smoother. When you have sensor inside cluster, the rising factor
>> can be even 20deg/s (for big cores).
>> In my case, there are 4 thermal zones, each cluster has it's private
>> sensor and thermal zone. There is no 'SoC sensor' or 'PCB sensor',
>> which is recommended for IPA.
>>>>
>>>> Could you tell me which thermal governor was used in your case?
>>>> Please also share the name of that benchmark, i will give it a try.
>>>> Is it single threaded compute-intensive?
>>>
>>> Step-wise governor.
>>> I use aobench which is part of phoronix-test-suite.
>>>
>>> Regards
>>> Thara
>>>
>> I have built this aobench and run it pinned to single big cpu:
>> time taskset -c 4 ./aobench
> 
> Why have you pinned the test only on CPU4 ?
> Goal of thermal pressure is to inform the scheduler of reduced compute
> capacity and help the scheduler to take better decision in tasks
> placement.
> So I would not expect perf impact on your test as the bench will stay
> pinned on the cpu4
> That being said you obviously have perf impact as shown below in your results
> And other tasks on the system are not pinned and might come and
> disturb your bench
Unpinned runs on this platform have a lot of variation. The tests take
even 16min to finish.

Regards,
Lukasz
> 
>> The results:
>> 3min-5:30min [mainline]
>> 5:15min-5:50min [+patchset]
>>
>> The idea is definitely worth to investigate further.
> 
> Yes I agree
> 
> Vincent
>>
>> Regards,
>> Lukasz
>>
>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-18  7:50               ` Ingo Molnar
@ 2018-10-18  8:14                 ` Rafael J. Wysocki
  2018-10-18  9:35                   ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Daniel Lezcano
  2018-10-18  9:44                   ` [PATCH V2 " Daniel Lezcano
  0 siblings, 2 replies; 67+ messages in thread
From: Rafael J. Wysocki @ 2018-10-18  8:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rafael J. Wysocki, Thara Gopinath, Linux Kernel Mailing List,
	Ingo Molnar, Peter Zijlstra, Zhang, Rui, Greg Kroah-Hartman,
	Amit Kachhap, Viresh Kumar, Javi Merino, Eduardo Valentin,
	Daniel Lezcano, Linux PM, Quentin Perret, ionela.voinescu,
	Vincent Guittot

On Thu, Oct 18, 2018 at 9:50 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> > > The only long term maintainable solution is to move all high level
> > > cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,
> > > which has been done to a fair degree already in the past ~2 years - but
> > > it's unclear to me to what extent this is true for thermal throttling
> > > policy currently: there might be more governor surgery and code
> > > reshuffling required?
> >
> > It doesn't cover thermal management directly ATM.
> >
> > The EAS work kind of hopes to make a connection in there by adding a
> > common energy model to underlie both the performance scaling and
> > thermal management, but it doesn't change the thermal decision making
> > part AFAICS.
> >
> > So it is fair to say that additional governor surgery and code
> > reshuffling will be required IMO.
>
> BTW., when factoring out high level thermal management code it might make
> sense to increase the prominence of the cpufreq code within the scheduler
> and organize it a bit better, by introducing its own
> kernel/sched/cpufreq/ directory and renaming things the following way:
>
>   kernel/sched/cpufreq.c                => kernel/sched/cpufreq/core.c
>   kernel/sched/cpufreq_schedutil.c      => kernel/sched/cpufreq/metrics.c
>   kernel/sched/thermal.c                => kernel/sched/cpufreq/thermal.c
>
> ... or so?
>
> With no change to functionality, this is just a re-organization and
> expansion/preparation for the bright future. =B-)

No disagreement here. :-)

Cheers,
Rafael

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files
  2018-10-18  8:14                 ` Rafael J. Wysocki
@ 2018-10-18  9:35                   ` Daniel Lezcano
  2018-10-18  9:35                     ` [PATCH 2/2] sched/cpufreq: Add the SPDX tags Daniel Lezcano
                                       ` (4 more replies)
  2018-10-18  9:44                   ` [PATCH V2 " Daniel Lezcano
  1 sibling, 5 replies; 67+ messages in thread
From: Daniel Lezcano @ 2018-10-18  9:35 UTC (permalink / raw)
  To: rafael, mingo
  Cc: thara.gopinath, linux-kernel, peterz, rui.zhang, gregkh,
	viresh.kumar, amit.kachhap, javi.merino, edubezval, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot, Ingo Molnar,
	Rafael J. Wysocki, Juri Lelli, Patrick Bellasi

It was suggested to set the scene for the PM components in the
scheduler code organization in the recent discussion about making the
scheduler aware of the capacity capping from the thermal framework.

Move the cpufreq files into its own directory as suggested at:

https://lkml.org/lkml/2018/10/18/353
https://lkml.org/lkml/2018/10/18/408

Suggested-by: Ingo Molnar <mingo@kernel.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 kernel/sched/Makefile                                   | 3 +--
 kernel/sched/{cpufreq.c => cpufreq/core.c}              | 2 +-
 kernel/sched/{cpufreq_schedutil.c => cpufreq/metrics.c} | 2 +-
 3 files changed, 3 insertions(+), 4 deletions(-)
 rename kernel/sched/{cpufreq.c => cpufreq/core.c} (99%)
 rename kernel/sched/{cpufreq_schedutil.c => cpufreq/metrics.c} (99%)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7fe1834..bc6bce0 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,13 +19,12 @@ endif
 obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle.o fair.o rt.o deadline.o
 obj-y += wait.o wait_bit.o swait.o completion.o
+obj-y += cpufreq/
 
 obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
-obj-$(CONFIG_CPU_FREQ) += cpufreq.o
-obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq/core.c
similarity index 99%
rename from kernel/sched/cpufreq.c
rename to kernel/sched/cpufreq/core.c
index 5e54cbc..8c17a63 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq/core.c
@@ -8,7 +8,7 @@
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  */
-#include "sched.h"
+#include "../sched.h"
 
 DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq/metrics.c
similarity index 99%
rename from kernel/sched/cpufreq_schedutil.c
rename to kernel/sched/cpufreq/metrics.c
index 3fffad3..597df47 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq/metrics.c
@@ -11,7 +11,7 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
-#include "sched.h"
+#include "../sched.h"
 
 #include <trace/events/power.h>
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 2/2] sched/cpufreq: Add the SPDX tags
  2018-10-18  9:35                   ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Daniel Lezcano
@ 2018-10-18  9:35                     ` Daniel Lezcano
  2018-10-18  9:42                     ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Rafael J. Wysocki
                                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 67+ messages in thread
From: Daniel Lezcano @ 2018-10-18  9:35 UTC (permalink / raw)
  To: rafael, mingo
  Cc: thara.gopinath, linux-kernel, peterz, rui.zhang, gregkh,
	viresh.kumar, amit.kachhap, javi.merino, edubezval, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot, Ingo Molnar,
	Rafael J. Wysocki, Juri Lelli, Patrick Bellasi

The SPDX tags are not present in core.c and metrics.c.

Replace the license description by the SPDX tags.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 kernel/sched/cpufreq/core.c    | 4 +---
 kernel/sched/cpufreq/metrics.c | 4 +---
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cpufreq/core.c b/kernel/sched/cpufreq/core.c
index 8c17a63..7a1400c 100644
--- a/kernel/sched/cpufreq/core.c
+++ b/kernel/sched/cpufreq/core.c
@@ -1,12 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Scheduler code and data structures related to cpufreq.
  *
  * Copyright (C) 2016, Intel Corporation
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
  */
 #include "../sched.h"
 
diff --git a/kernel/sched/cpufreq/metrics.c b/kernel/sched/cpufreq/metrics.c
index 597df47..9083a70 100644
--- a/kernel/sched/cpufreq/metrics.c
+++ b/kernel/sched/cpufreq/metrics.c
@@ -1,12 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * CPUFreq governor based on scheduler-provided CPU utilization data.
  *
  * Copyright (C) 2016, Intel Corporation
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-- 
2.7.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files
  2018-10-18  9:35                   ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Daniel Lezcano
  2018-10-18  9:35                     ` [PATCH 2/2] sched/cpufreq: Add the SPDX tags Daniel Lezcano
@ 2018-10-18  9:42                     ` Rafael J. Wysocki
  2018-10-18  9:54                       ` Daniel Lezcano
  2018-10-18  9:45                     ` Daniel Lezcano
                                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 67+ messages in thread
From: Rafael J. Wysocki @ 2018-10-18  9:42 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Rafael J. Wysocki, Ingo Molnar, Thara Gopinath,
	Linux Kernel Mailing List, Peter Zijlstra, Zhang, Rui,
	Greg Kroah-Hartman, Viresh Kumar, Amit Kachhap, Javi Merino,
	Eduardo Valentin, Linux PM, Quentin Perret, ionela.voinescu,
	Vincent Guittot, Ingo Molnar, Rafael Wysocki, Juri Lelli,
	Patrick Bellasi

On Thu, Oct 18, 2018 at 11:36 AM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> It was suggested to set the scene for the PM components in the
> scheduler code organization in the recent discussion about making the
> scheduler aware of the capacity capping from the thermal framework.
>
> Move the cpufreq files into its own directory as suggested at:
>
> https://lkml.org/lkml/2018/10/18/353
> https://lkml.org/lkml/2018/10/18/408

Fair enough, but do we need to do that right now?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH V2 1/2] sched/cpufreq: Reorganize the cpufreq files
  2018-10-18  8:14                 ` Rafael J. Wysocki
  2018-10-18  9:35                   ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Daniel Lezcano
@ 2018-10-18  9:44                   ` Daniel Lezcano
  2018-10-18  9:44                     ` [PATCH V2 2/2] sched/cpufreq: Add the SPDX tags Daniel Lezcano
  1 sibling, 1 reply; 67+ messages in thread
From: Daniel Lezcano @ 2018-10-18  9:44 UTC (permalink / raw)
  To: rafael, mingo
  Cc: thara.gopinath, linux-kernel, peterz, rui.zhang, gregkh,
	viresh.kumar, amit.kachhap, javi.merino, edubezval, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot, Ingo Molnar,
	Rafael J. Wysocki, Juri Lelli, Patrick Bellasi

It was suggested to set the scene for the PM components in the
scheduler code organization in the recent discussion about making the
scheduler aware of the capacity capping from the thermal framework.

Move the cpufreq files into its own directory as suggested at:

https://lkml.org/lkml/2018/10/18/353
https://lkml.org/lkml/2018/10/18/408

Suggested-by: Ingo Molnar <mingo@kernel.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
Changelog:
 * git added the Makefile in cpufreq/Makefile (V2)

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 kernel/sched/Makefile                                   | 3 +--
 kernel/sched/cpufreq/Makefile                           | 3 +++
 kernel/sched/{cpufreq.c => cpufreq/core.c}              | 2 +-
 kernel/sched/{cpufreq_schedutil.c => cpufreq/metrics.c} | 2 +-
 4 files changed, 6 insertions(+), 4 deletions(-)
 create mode 100644 kernel/sched/cpufreq/Makefile
 rename kernel/sched/{cpufreq.c => cpufreq/core.c} (99%)
 rename kernel/sched/{cpufreq_schedutil.c => cpufreq/metrics.c} (99%)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7fe1834..bc6bce0 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,13 +19,12 @@ endif
 obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += idle.o fair.o rt.o deadline.o
 obj-y += wait.o wait_bit.o swait.o completion.o
+obj-y += cpufreq/
 
 obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
-obj-$(CONFIG_CPU_FREQ) += cpufreq.o
-obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
diff --git a/kernel/sched/cpufreq/Makefile b/kernel/sched/cpufreq/Makefile
new file mode 100644
index 0000000..4bf1087
--- /dev/null
+++ b/kernel/sched/cpufreq/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_CPU_FREQ) += core.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += metrics.o
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq/core.c
similarity index 99%
rename from kernel/sched/cpufreq.c
rename to kernel/sched/cpufreq/core.c
index 5e54cbc..8c17a63 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq/core.c
@@ -8,7 +8,7 @@
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  */
-#include "sched.h"
+#include "../sched.h"
 
 DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
 
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq/metrics.c
similarity index 99%
rename from kernel/sched/cpufreq_schedutil.c
rename to kernel/sched/cpufreq/metrics.c
index 3fffad3..597df47 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq/metrics.c
@@ -11,7 +11,7 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
-#include "sched.h"
+#include "../sched.h"
 
 #include <trace/events/power.h>
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH V2 2/2] sched/cpufreq: Add the SPDX tags
  2018-10-18  9:44                   ` [PATCH V2 " Daniel Lezcano
@ 2018-10-18  9:44                     ` Daniel Lezcano
  0 siblings, 0 replies; 67+ messages in thread
From: Daniel Lezcano @ 2018-10-18  9:44 UTC (permalink / raw)
  To: rafael, mingo
  Cc: thara.gopinath, linux-kernel, peterz, rui.zhang, gregkh,
	viresh.kumar, amit.kachhap, javi.merino, edubezval, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot, Ingo Molnar,
	Rafael J. Wysocki, Juri Lelli, Patrick Bellasi

The SPDX tags are not present in core.c and metrics.c.

Replace the license description by the SPDX tags.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
 kernel/sched/cpufreq/core.c    | 4 +---
 kernel/sched/cpufreq/metrics.c | 4 +---
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cpufreq/core.c b/kernel/sched/cpufreq/core.c
index 8c17a63..7a1400c 100644
--- a/kernel/sched/cpufreq/core.c
+++ b/kernel/sched/cpufreq/core.c
@@ -1,12 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Scheduler code and data structures related to cpufreq.
  *
  * Copyright (C) 2016, Intel Corporation
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
  */
 #include "../sched.h"
 
diff --git a/kernel/sched/cpufreq/metrics.c b/kernel/sched/cpufreq/metrics.c
index 597df47..9083a70 100644
--- a/kernel/sched/cpufreq/metrics.c
+++ b/kernel/sched/cpufreq/metrics.c
@@ -1,12 +1,10 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * CPUFreq governor based on scheduler-provided CPU utilization data.
  *
  * Copyright (C) 2016, Intel Corporation
  * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-- 
2.7.4


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files
  2018-10-18  9:35                   ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Daniel Lezcano
  2018-10-18  9:35                     ` [PATCH 2/2] sched/cpufreq: Add the SPDX tags Daniel Lezcano
  2018-10-18  9:42                     ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Rafael J. Wysocki
@ 2018-10-18  9:45                     ` Daniel Lezcano
  2018-10-19  5:24                     ` kbuild test robot
  2018-10-19  5:52                     ` kbuild test robot
  4 siblings, 0 replies; 67+ messages in thread
From: Daniel Lezcano @ 2018-10-18  9:45 UTC (permalink / raw)
  To: rafael, mingo
  Cc: thara.gopinath, linux-kernel, peterz, rui.zhang, gregkh,
	viresh.kumar, amit.kachhap, javi.merino, edubezval, linux-pm,
	quentin.perret, ionela.voinescu, vincent.guittot, Ingo Molnar,
	Rafael J. Wysocki, Juri Lelli, Patrick Bellasi

On 18/10/2018 11:35, Daniel Lezcano wrote:
> It was suggested to set the scene for the PM components in the
> scheduler code organization in the recent discussion about making the
> scheduler aware of the capacity capping from the thermal framework.
> 
> Move the cpufreq files into its own directory as suggested at:
> 
> https://lkml.org/lkml/2018/10/18/353
> https://lkml.org/lkml/2018/10/18/408
> 
> Suggested-by: Ingo Molnar <mingo@kernel.org>
> Cc: Rafael J. Wysocki <rafael@kernel.org>
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> ---
>  kernel/sched/Makefile                                   | 3 +--
>  kernel/sched/{cpufreq.c => cpufreq/core.c}              | 2 +-
>  kernel/sched/{cpufreq_schedutil.c => cpufreq/metrics.c} | 2 +-


I forget to git add the cpufreq/Makefile, resending.



-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files
  2018-10-18  9:42                     ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Rafael J. Wysocki
@ 2018-10-18  9:54                       ` Daniel Lezcano
  2018-10-18 10:06                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 67+ messages in thread
From: Daniel Lezcano @ 2018-10-18  9:54 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Thara Gopinath, Linux Kernel Mailing List,
	Peter Zijlstra, Zhang, Rui, Greg Kroah-Hartman, Viresh Kumar,
	Amit Kachhap, Javi Merino, Eduardo Valentin, Linux PM,
	Quentin Perret, ionela.voinescu, Vincent Guittot, Ingo Molnar,
	Rafael Wysocki, Juri Lelli, Patrick Bellasi

On 18/10/2018 11:42, Rafael J. Wysocki wrote:
> On Thu, Oct 18, 2018 at 11:36 AM Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>>
>> It was suggested to set the scene for the PM components in the
>> scheduler code organization in the recent discussion about making the
>> scheduler aware of the capacity capping from the thermal framework.
>>
>> Move the cpufreq files into its own directory as suggested at:
>>
>> https://lkml.org/lkml/2018/10/18/353
>> https://lkml.org/lkml/2018/10/18/408
> 
> Fair enough, but do we need to do that right now?

I'm not planning to do more code reorg than this patch right now. Up to
you to decide if you are willing to take them.


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files
  2018-10-18  9:54                       ` Daniel Lezcano
@ 2018-10-18 10:06                         ` Rafael J. Wysocki
  2018-10-18 10:13                           ` Daniel Lezcano
  0 siblings, 1 reply; 67+ messages in thread
From: Rafael J. Wysocki @ 2018-10-18 10:06 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Rafael J. Wysocki, Ingo Molnar, Thara Gopinath,
	Linux Kernel Mailing List, Peter Zijlstra, Zhang, Rui,
	Greg Kroah-Hartman, Viresh Kumar, Amit Kachhap, Javi Merino,
	Eduardo Valentin, Linux PM, Quentin Perret, ionela.voinescu,
	Vincent Guittot, Ingo Molnar, Rafael Wysocki, Juri Lelli,
	Patrick Bellasi

On Thu, Oct 18, 2018 at 11:54 AM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> On 18/10/2018 11:42, Rafael J. Wysocki wrote:
> > On Thu, Oct 18, 2018 at 11:36 AM Daniel Lezcano
> > <daniel.lezcano@linaro.org> wrote:
> >>
> >> It was suggested to set the scene for the PM components in the
> >> scheduler code organization in the recent discussion about making the
> >> scheduler aware of the capacity capping from the thermal framework.
> >>
> >> Move the cpufreq files into its own directory as suggested at:
> >>
> >> https://lkml.org/lkml/2018/10/18/353
> >> https://lkml.org/lkml/2018/10/18/408
> >
> > Fair enough, but do we need to do that right now?
>
> I'm not planning to do more code reorg than this patch right now. Up to
> you to decide if you are willing to take them.

The SPDX one certainly is applicable, but I'm not sure about the other one TBH.

Why don't you add the SPDX IDs to those files as they are for now?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files
  2018-10-18 10:06                         ` Rafael J. Wysocki
@ 2018-10-18 10:13                           ` Daniel Lezcano
  0 siblings, 0 replies; 67+ messages in thread
From: Daniel Lezcano @ 2018-10-18 10:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Thara Gopinath, Linux Kernel Mailing List,
	Peter Zijlstra, Zhang, Rui, Greg Kroah-Hartman, Viresh Kumar,
	Amit Kachhap, Javi Merino, Eduardo Valentin, Linux PM,
	Quentin Perret, ionela.voinescu, Vincent Guittot, Ingo Molnar,
	Rafael Wysocki, Juri Lelli, Patrick Bellasi

On 18/10/2018 12:06, Rafael J. Wysocki wrote:
> On Thu, Oct 18, 2018 at 11:54 AM Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>>
>> On 18/10/2018 11:42, Rafael J. Wysocki wrote:
>>> On Thu, Oct 18, 2018 at 11:36 AM Daniel Lezcano
>>> <daniel.lezcano@linaro.org> wrote:
>>>>
>>>> It was suggested to set the scene for the PM components in the
>>>> scheduler code organization in the recent discussion about making the
>>>> scheduler aware of the capacity capping from the thermal framework.
>>>>
>>>> Move the cpufreq files into its own directory as suggested at:
>>>>
>>>> https://lkml.org/lkml/2018/10/18/353
>>>> https://lkml.org/lkml/2018/10/18/408
>>>
>>> Fair enough, but do we need to do that right now?
>>
>> I'm not planning to do more code reorg than this patch right now. Up to
>> you to decide if you are willing to take them.
> 
> The SPDX one certainly is applicable, but I'm not sure about the other one TBH.
> 
> Why don't you add the SPDX IDs to those files as they are for now?

Yes, sure.


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-18  6:48           ` Ingo Molnar
  2018-10-18  7:08             ` Rafael J. Wysocki
@ 2018-10-18 16:17             ` Thara Gopinath
  2018-10-19  8:02               ` Ingo Molnar
  1 sibling, 1 reply; 67+ messages in thread
From: Thara Gopinath @ 2018-10-18 16:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, javi.merino, edubezval,
	daniel.lezcano, linux-pm, quentin.perret, ionela.voinescu,
	vincent.guittot

On 10/18/2018 02:48 AM, Ingo Molnar wrote:
> 
> * Thara Gopinath <thara.gopinath@linaro.org> wrote:
> 
>> On 10/16/2018 03:33 AM, Ingo Molnar wrote:
>>>
>>> * Thara Gopinath <thara.gopinath@linaro.org> wrote:
>>>
>>>>>> Regarding testing, basic build, boot and sanity testing have been
>>>>>> performed on hikey960 mainline kernel with debian file system.
>>>>>> Further aobench (An occlusion renderer for benchmarking realworld
>>>>>> floating point performance) showed the following results on hikey960
>>>>>> with debain.
>>>>>>
>>>>>>                                         Result          Standard        Standard
>>>>>>                                         (Time secs)     Error           Deviation
>>>>>> Hikey 960 - no thermal pressure applied 138.67          6.52            11.52%
>>>>>> Hikey 960 -  thermal pressure applied   122.37          5.78            11.57%
>>>>>
>>>>> Wow, +13% speedup, impressive! We definitely want this outcome.
>>>>>
>>>>> I'm wondering what happens if we do not track and decay the thermal 
>>>>> load at all at the PELT level, but instantaneously decrease/increase 
>>>>> effective CPU capacity in reaction to thermal events we receive from 
>>>>> the CPU.
>>>>
>>>> The problem with instantaneous update is that sometimes thermal events 
>>>> happen at a much faster pace than cpu_capacity is updated in the 
>>>> scheduler. This means that at the moment when scheduler uses the 
>>>> value, it might not be correct anymore.
>>>
>>> Let me offer a different interpretation: if we average throttling events 
>>> then we create a 'smooth' average of 'true CPU capacity' that doesn't 
>>> fluctuate much. This allows more stable yet asymmetric task placement if 
>>> the thermal characteristics of the different cores is different 
>>> (asymmetric). This, compared to instantaneous updates, would reduce 
>>> unnecessary task migrations between cores.
>>>
>>> Is that accurate?
>>
>> Yes. I think it is accurate. I will also add that if we don't average
>> throttling events, we will miss the events that occur in between load
>> balancing(LB) period.
> 
> Yeah, so I'd definitely suggest to not integrate this averaging into 
> pelt.c in the fashion presented, because:
> 
>  - This couples your thermal throttling averaging to the PELT decay
>    half-time AFAICS, which would break the other user every time the
>    decay is changed/tuned.
Let me pose the question in this manner. Today rt utilization, dl
utilization etc is tracked via PELT. The inherent idea is that, a cpu
has some capacity stolen by let us say rt task and so subtract the
capacity utilized by the rt task from cpu when calculating the remaining
capacity for a CFS task. Now, the idea behind thermal pressure is that,
the maximum available capacity of a cpu is limited due to a thermal
event, so take it out of the remaining capacity of a cpu for a CFS task
(at-least to start with). If utilization for rt task, dl task etc is
calculated via PELT and the capacity constraint due to thermal event
calculated by another averaging algorithm, there can be some mismatch in
the "capacity stolen" calculations, right? Isnt't it better to track all
the events that can limit the capacity of a cpu via one algorithm ?

> 
>  - The boolean flag that changes behavior in pelt.c is not particularly
>    clean either and complicates the code.

I agree. Part of the idea behind this RFC patch set was to brainstorm if
there is a better approach to this.
> 
>  - Instead maybe factor out a decaying average library into
>    kernel/sched/avg.h perhaps (if this truly improves the code), and use
>    those methods both in pelt.c and any future thermal.c - and maybe
>    other places where we do decaying averages.
> 
>  - But simple decaying averages are not that complex either, so I think
>    your original solution of open coding it is probably fine as well. 
> 
> Furthermore, any logic introduced by thermal.c and the resulting change 
> to load-balancing behavior would have to be in perfect sync with cpufreq 
> governor actions - one mechanism should not work against the other.
I agree. I will go one step further and argue that the changes here make
best sense with sched util governor as it picks up the signals directly
from the scheduler to choose an OPP. Any other governor will be less
effective.


> 
> The only long term maintainable solution is to move all high level 
> cpufreq logic and policy handling code into kernel/sched/cpufreq*.c, 
> which has been done to a fair degree already in the past ~2 years - but 
> it's unclear to me to what extent this is true for thermal throttling 
> policy currently: there might be more governor surgery and code 
> reshuffling required?
> 
> The short term goal would be to at minimum have all the bugs lined up in 
> kernel/sched/* neatly, so that we have the chance to see and fix them in 
> a single place. ;-)

I see that Daniel has already sent some patches for this!
Regards
Thara
> 
> Thanks,
> 
> 	Ingo
> 


-- 
Regards
Thara

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files
  2018-10-18  9:35                   ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Daniel Lezcano
                                       ` (2 preceding siblings ...)
  2018-10-18  9:45                     ` Daniel Lezcano
@ 2018-10-19  5:24                     ` kbuild test robot
  2018-10-19  5:52                     ` kbuild test robot
  4 siblings, 0 replies; 67+ messages in thread
From: kbuild test robot @ 2018-10-19  5:24 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: kbuild-all, rafael, mingo, thara.gopinath, linux-kernel, peterz,
	rui.zhang, gregkh, viresh.kumar, amit.kachhap, javi.merino,
	edubezval, linux-pm, quentin.perret, ionela.voinescu,
	vincent.guittot, Ingo Molnar, Rafael J. Wysocki, Juri Lelli,
	Patrick Bellasi

[-- Attachment #1: Type: text/plain, Size: 1758 bytes --]

Hi Daniel,

I love your patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.19-rc8 next-20181018]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Daniel-Lezcano/sched-cpufreq-Reorganize-the-cpufreq-files/20181019-103851
config: x86_64-randconfig-s1-10191209 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> scripts/Makefile.build:45: kernel/sched/cpufreq/Makefile: No such file or directory
>> make[4]: *** No rule to make target 'kernel/sched/cpufreq/Makefile'.
   make[4]: Failed to remake makefile 'kernel/sched/cpufreq/Makefile'.

vim +45 scripts/Makefile.build

0c53c8e6e Sam Ravnborg   2007-10-14  41  
2a6914703 Sam Ravnborg   2005-07-25  42  # The filename Kbuild has precedence over Makefile
db8c1a7b2 Sam Ravnborg   2005-07-27  43  kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src))
0c53c8e6e Sam Ravnborg   2007-10-14  44  kbuild-file := $(if $(wildcard $(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile)
0c53c8e6e Sam Ravnborg   2007-10-14 @45  include $(kbuild-file)
^1da177e4 Linus Torvalds 2005-04-16  46  

:::::: The code at line 45 was first introduced by commit
:::::: 0c53c8e6eb456cde30f2305421c605713856abc8 kbuild: check for wrong use of CFLAGS

:::::: TO: Sam Ravnborg <sam@neptun.(none)>
:::::: CC: Sam Ravnborg <sam@neptun.(none)>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26292 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files
  2018-10-18  9:35                   ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Daniel Lezcano
                                       ` (3 preceding siblings ...)
  2018-10-19  5:24                     ` kbuild test robot
@ 2018-10-19  5:52                     ` kbuild test robot
  4 siblings, 0 replies; 67+ messages in thread
From: kbuild test robot @ 2018-10-19  5:52 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: kbuild-all, rafael, mingo, thara.gopinath, linux-kernel, peterz,
	rui.zhang, gregkh, viresh.kumar, amit.kachhap, javi.merino,
	edubezval, linux-pm, quentin.perret, ionela.voinescu,
	vincent.guittot, Ingo Molnar, Rafael J. Wysocki, Juri Lelli,
	Patrick Bellasi

[-- Attachment #1: Type: text/plain, Size: 1739 bytes --]

Hi Daniel,

I love your patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.19-rc8 next-20181018]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Daniel-Lezcano/sched-cpufreq-Reorganize-the-cpufreq-files/20181019-103851
config: x86_64-randconfig-s2-10191209 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> scripts/Makefile.modbuiltin:26: kernel/sched/cpufreq/Makefile: No such file or directory
   make[4]: *** No rule to make target 'kernel/sched/cpufreq/Makefile'.
   make[4]: Failed to remake makefile 'kernel/sched/cpufreq/Makefile'.

vim +26 scripts/Makefile.modbuiltin

607b30fc Michal Marek 2010-06-10  22  
bc081dd6 Michal Marek 2009-12-07  23  # The filename Kbuild has precedence over Makefile
bc081dd6 Michal Marek 2009-12-07  24  kbuild-dir := $(if $(filter /%,$(src)),$(src),$(srctree)/$(src))
bc081dd6 Michal Marek 2009-12-07  25  kbuild-file := $(if $(wildcard $(kbuild-dir)/Kbuild),$(kbuild-dir)/Kbuild,$(kbuild-dir)/Makefile)
bc081dd6 Michal Marek 2009-12-07 @26  include $(kbuild-file)
bc081dd6 Michal Marek 2009-12-07  27  

:::::: The code at line 26 was first introduced by commit
:::::: bc081dd6e9f622c73334dc465359168543ccaabf kbuild: generate modules.builtin

:::::: TO: Michal Marek <mmarek@suse.cz>
:::::: CC: Michal Marek <mmarek@suse.cz>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30817 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-18 16:17             ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
@ 2018-10-19  8:02               ` Ingo Molnar
  2018-10-19 11:29                 ` Valentin Schneider
  0 siblings, 1 reply; 67+ messages in thread
From: Ingo Molnar @ 2018-10-19  8:02 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, javi.merino, edubezval,
	daniel.lezcano, linux-pm, quentin.perret, ionela.voinescu,
	vincent.guittot


* Thara Gopinath <thara.gopinath@linaro.org> wrote:

> > Yeah, so I'd definitely suggest to not integrate this averaging into 
> > pelt.c in the fashion presented, because:
> > 
> >  - This couples your thermal throttling averaging to the PELT decay
> >    half-time AFAICS, which would break the other user every time the
> >    decay is changed/tuned.
>
> Let me pose the question in this manner. Today rt utilization, dl 
> utilization etc is tracked via PELT. The inherent idea is that, a cpu 
> has some capacity stolen by let us say rt task and so subtract the 
> capacity utilized by the rt task from cpu when calculating the 
> remaining capacity for a CFS task. Now, the idea behind thermal 
> pressure is that, the maximum available capacity of a cpu is limited 
> due to a thermal event, so take it out of the remaining capacity of a 
> cpu for a CFS task (at-least to start with). If utilization for rt 
> task, dl task etc is calculated via PELT and the capacity constraint 
> due to thermal event calculated by another averaging algorithm, there 
> can be some mismatch in the "capacity stolen" calculations, right? 
> Isnt't it better to track all the events that can limit the capacity of 
> a cpu via one algorithm ?

So what unifies RT and DL utilization is that those are all direct task 
loads independent of external factors.

Thermal load is more of a complex physical property of the combination of 
various internal and external factors: the whole system workload running 
(not just that single task), the thermal topology of the hardware, 
external temperatures, the hardware's and the governor's policy regarding 
thermal loads, etc. etc.

So while obviously when effective capacity of a CPU is calculated then 
these will all be subtracted from the maximum capacity of the CPU, but I 
think the thermal load metric and the averaging itself is probably 
dissimilar enough to not be calculated via the PELT half-life for 
example.

For example a reasonable future property would be match the speed of 
decay in the averaging to the observed speed of decay via temperature 
sensors? Most temperature sensors do a certain amount of averaging 
themselves as well - and some platforms might not expose temperatures at 
all, only 'got thermally throttled' / 'running at full speed' kind of 
feedback.

Anyway, this doesn't really impact the concept, it's an implementational 
detail, and much of this could be resolved if the averaging code in 
pelt.c was librarized a bit - and that's really what you did there in a 
fashion, I just think it should probably be abstracted out more clearly. 
(I have no clear implementational suggestions right now, other than 'try 
and see how it works out - it might be a bad idea'.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 0/7] Introduce thermal pressure
  2018-10-19  8:02               ` Ingo Molnar
@ 2018-10-19 11:29                 ` Valentin Schneider
  0 siblings, 0 replies; 67+ messages in thread
From: Valentin Schneider @ 2018-10-19 11:29 UTC (permalink / raw)
  To: Ingo Molnar, Thara Gopinath
  Cc: linux-kernel, mingo, peterz, rui.zhang, gregkh, rafael,
	amit.kachhap, viresh.kumar, javi.merino, edubezval,
	daniel.lezcano, linux-pm, quentin.perret, ionela.voinescu,
	vincent.guittot

Hi,

On 19/10/2018 09:02, Ingo Molnar wrote:
> 
> * Thara Gopinath <thara.gopinath@linaro.org> wrote:

[...]
 
> So what unifies RT and DL utilization is that those are all direct task 
> loads independent of external factors.
> 
> Thermal load is more of a complex physical property of the combination of 
> various internal and external factors: the whole system workload running 
> (not just that single task), the thermal topology of the hardware, 
> external temperatures, the hardware's and the governor's policy regarding 
> thermal loads, etc. etc.
> 
> So while obviously when effective capacity of a CPU is calculated then 
> these will all be subtracted from the maximum capacity of the CPU, but I 
> think the thermal load metric and the averaging itself is probably 
> dissimilar enough to not be calculated via the PELT half-life for 
> example.
> 
> For example a reasonable future property would be match the speed of 
> decay in the averaging to the observed speed of decay via temperature 
> sensors? Most temperature sensors do a certain amount of averaging 
> themselves as well - and some platforms might not expose temperatures at 
> all, only 'got thermally throttled' / 'running at full speed' kind of 
> feedback.

That would also open the door to having different decay speeds on
different domains, if we have the tsensors for it - big and LITTLE cores
are not going to heat up in the same way (although there's going to
be some heat propagation).

Another thermal decay speed hint I'd see would be the energy model - it
does tell us after all how much energy is going through those cores, and
with a rough estimate of how much they can take before overheating 
(sustainable-power entry in the devicetree) we might be able to deduce a
somewhat sane decay speed.

> 
> Anyway, this doesn't really impact the concept, it's an implementational 
> detail, and much of this could be resolved if the averaging code in 
> pelt.c was librarized a bit - and that's really what you did there in a 
> fashion, I just think it should probably be abstracted out more clearly. 
> (I have no clear implementational suggestions right now, other than 'try 
> and see how it works out - it might be a bad idea'.)
> 
> Thanks,
> 
> 	Ingo
> 
> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH 5/7] sched/fair: Enable CFS periodic tick to update thermal pressure
  2018-10-09 16:25   ` [RFC PATCH 5/7] sched/fair: Enable CFS periodic tick to update thermal pressure Thara Gopinath
@ 2018-12-04 15:43     ` Vincent Guittot
  0 siblings, 0 replies; 67+ messages in thread
From: Vincent Guittot @ 2018-12-04 15:43 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Zhang Rui, gregkh,
	Rafael J. Wysocki, Amit Kachhap, viresh kumar, Javi Merino,
	Eduardo Valentin, Daniel Lezcano, open list:THERMAL,
	Quentin Perret, Ionela Voinescu

Hi Thara,

On Tue, 9 Oct 2018 at 18:25, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>
> Introduce support in CFS periodic tick to trigger the process of
> computing average thermal pressure for a cpu.
>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  kernel/sched/fair.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b39fb59..7deb1d0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -21,6 +21,7 @@
>   *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
>   */
>  #include "sched.h"
> +#include "thermal.h"
>
>  #include <trace/events/sched.h>
>
> @@ -9557,6 +9558,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
>         if (static_branch_unlikely(&sched_numa_balancing))
>                 task_tick_numa(rq, curr);
> +
> +       update_periodic_maxcap(rq);

You have to call update_periodic_maxcap() in update_blocked_averages() too
Otherwise, the thermal pressure will not always be updated correctly
for tickless system

>  }
>
>  /*
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2018-12-04 15:44 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20181009162509epcas1p4fdd2e23039caa24586a4a52c6d2e7336@epcas1p4.samsung.com>
2018-10-09 16:24 ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
2018-10-09 16:24   ` [RFC PATCH 1/7] sched/pelt.c: Add option to make load and util calculations frequency invariant Thara Gopinath
2018-10-09 16:24   ` [RFC PATCH 2/7] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
2018-10-09 16:24   ` [RFC PATCH 3/7] sched: Add infrastructure to store and update instantaneous " Thara Gopinath
2018-10-09 16:24   ` [RFC PATCH 4/7] sched: Initialize per cpu thermal pressure structure Thara Gopinath
2018-10-09 16:25   ` [RFC PATCH 5/7] sched/fair: Enable CFS periodic tick to update thermal pressure Thara Gopinath
2018-12-04 15:43     ` Vincent Guittot
2018-10-09 16:25   ` [RFC PATCH 6/7] sched/fair: update cpu_capcity to reflect " Thara Gopinath
2018-10-10  5:57     ` Javi Merino
2018-10-10 14:22       ` Thara Gopinath
2018-10-09 16:25   ` [RFC PATCH 7/7] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
2018-10-10  5:44   ` [RFC PATCH 0/7] Introduce thermal pressure Javi Merino
2018-10-10 14:15     ` Thara Gopinath
2018-10-10  6:17   ` Ingo Molnar
2018-10-10  8:29     ` Quentin Perret
2018-10-10  8:50       ` Vincent Guittot
2018-10-10  9:55         ` Quentin Perret
2018-10-10 10:14           ` Vincent Guittot
2018-10-10 10:36             ` Quentin Perret
2018-10-10 12:04               ` Vincent Guittot
2018-10-10 12:23                 ` Juri Lelli
2018-10-10 12:34                   ` Vincent Guittot
2018-10-10 12:50                     ` Juri Lelli
2018-10-10 13:08                       ` Vincent Guittot
2018-10-10 13:34                         ` Juri Lelli
2018-10-10 13:38                           ` Vincent Guittot
2018-10-10 17:08                           ` Thara Gopinath
2018-10-10 13:11                       ` Quentin Perret
2018-10-10 13:05                 ` Quentin Perret
2018-10-10 13:27                   ` Vincent Guittot
2018-10-10 13:47                     ` Quentin Perret
2018-10-10 15:19                       ` Vincent Guittot
2018-10-10 16:15                       ` Ionela Voinescu
2018-10-10 17:03           ` Thara Gopinath
2018-10-10 15:43     ` Thara Gopinath
2018-10-16  7:33       ` Ingo Molnar
2018-10-16  9:28         ` Lukasz Luba
2018-10-17 16:21         ` Thara Gopinath
2018-10-18  6:48           ` Ingo Molnar
2018-10-18  7:08             ` Rafael J. Wysocki
2018-10-18  7:50               ` Ingo Molnar
2018-10-18  8:14                 ` Rafael J. Wysocki
2018-10-18  9:35                   ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Daniel Lezcano
2018-10-18  9:35                     ` [PATCH 2/2] sched/cpufreq: Add the SPDX tags Daniel Lezcano
2018-10-18  9:42                     ` [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files Rafael J. Wysocki
2018-10-18  9:54                       ` Daniel Lezcano
2018-10-18 10:06                         ` Rafael J. Wysocki
2018-10-18 10:13                           ` Daniel Lezcano
2018-10-18  9:45                     ` Daniel Lezcano
2018-10-19  5:24                     ` kbuild test robot
2018-10-19  5:52                     ` kbuild test robot
2018-10-18  9:44                   ` [PATCH V2 " Daniel Lezcano
2018-10-18  9:44                     ` [PATCH V2 2/2] sched/cpufreq: Add the SPDX tags Daniel Lezcano
2018-10-18 16:17             ` [RFC PATCH 0/7] Introduce thermal pressure Thara Gopinath
2018-10-19  8:02               ` Ingo Molnar
2018-10-19 11:29                 ` Valentin Schneider
2018-10-10 15:35   ` Lukasz Luba
2018-10-10 16:54     ` Daniel Lezcano
2018-10-11  7:35       ` Lukasz Luba
2018-10-11  8:23         ` Daniel Lezcano
2018-10-12  9:37           ` Lukasz Luba
2018-10-10 17:30     ` Thara Gopinath
2018-10-11 11:10       ` Lukasz Luba
2018-10-16 17:11         ` Vincent Guittot
2018-10-17 16:24           ` Thara Gopinath
2018-10-18  8:00             ` Lukasz Luba
2018-10-18  8:12           ` Lukasz Luba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).