linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch v5 0/6] Introduce Thermal Pressure
@ 2019-11-05 18:49 Thara Gopinath
  2019-11-05 18:49 ` [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
                   ` (7 more replies)
  0 siblings, 8 replies; 40+ messages in thread
From: Thara Gopinath @ 2019-11-05 18:49 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Thermal governors can respond to an overheat event of a cpu by
capping the cpu's maximum possible frequency. This in turn
means that the maximum available compute capacity of the
cpu is restricted. But today in the kernel, task scheduler is
not notified of capping of maximum frequency of a cpu.
In other words, scheduler is unaware of maximum capacity
restrictions placed on a cpu due to thermal activity.
This patch series attempts to address this issue.
The benefits identified are better task placement among available
cpus in event of overheating which in turn leads to better
performance numbers.

The reduction in the maximum possible capacity of a cpu due to a
thermal event can be considered as thermal pressure. Instantaneous
thermal pressure is hard to record and can sometime be erroneous
as there can be mismatch between the actual capping of capacity
and scheduler recording it. Thus solution is to have a weighted
average per cpu value for thermal pressure over time.
The weight reflects the amount of time the cpu has spent at a
capped maximum frequency. Since thermal pressure is recorded as
an average, it must be decayed periodically. Exisiting algorithm
in the kernel scheduler pelt framework is re-used to calculate
the weighted average. This patch series also defines a sysctl
inerface to allow for a configurable decay period.

Regarding testing, basic build, boot and sanity testing have been
performed on db845c platform with debian file system.
Further, dhrystone and hackbench tests have been
run with the thermal pressure algorithm. During testing, due to
constraints of step wise governor in dealing with big little systems,
trip point 0 temperature was made assymetric between cpus in little
cluster and big cluster; the idea being that
big core will heat up and cpu cooling device will throttle the
frequency of the big cores faster, there by limiting the maximum available
capacity and the scheduler will spread out tasks to little cores as well.

Test Results

Hackbench: 1 group , 30000 loops, 10 runs
                                               Result         SD
                                               (Secs)     (% of mean)
 No Thermal Pressure                            14.03       2.69%
 Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%
 Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%
 Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%
 Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%
 Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%

Dhrystone Run Time  : 20 threads, 3000 MLOOPS
                                                 Result      SD
                                                 (Secs)    (% of mean)
 No Thermal Pressure                              9.452      4.49%
 Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
 Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
 Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
 Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
 Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%

A Brief History

The first version of this patch-series was posted with resuing
PELT algorithm to decay thermal pressure signal. The discussions
that followed were around whether intanteneous thermal pressure
solution is better and whether a stand-alone algortihm to accumulate
and decay thermal pressure is more appropriate than re-using the
PELT framework.
Tests on Hikey960 showed the stand-alone algorithm performing slightly
better than resuing PELT algorithm and V2 was posted with the stand
alone algorithm. Test results were shared as part of this series.
Discussions were around re-using PELT algorithm and running
further tests with more granular decay period.

For some time after this development was impeded due to hardware
unavailability, some other unforseen and possibly unfortunate events.
For this version, h/w was switched from hikey960 to db845c.
Also Instantaneous thermal pressure was never tested as part of this
cycle as it is clear that weighted average is a better implementation.
The non-PELT algorithm never gave any conclusive results to prove that it
is better than reusing PELT algorithm, in this round of testing.
Also reusing PELT algorithm means thermal pressure tracks the
other utilization signals in the scheduler.

v3->v4:
        - "Patch 3/7:sched: Initialize per cpu thermal pressure structure"
           is dropped as it is no longer needed following changes in other
           other patches.
        - rest of the change log mentioned in specific patches.

Thara Gopinath (6):
  sched/pelt.c: Add support to track thermal pressure
  sched/fair: Add infrastructure to store and update  instantaneous
    thermal pressure
  sched/fair: Enable periodic update of thermal pressure
  sched/fair: update cpu_capcity to reflect thermal pressure
  thermal/cpu-cooling: Update thermal pressure in case of a maximum
    frequency capping
  sched/fair: Enable tuning of decay period

 Documentation/admin-guide/kernel-parameters.txt |  5 ++
 drivers/thermal/cpu_cooling.c                   | 36 ++++++++++++-
 include/linux/sched.h                           |  9 ++++
 kernel/sched/fair.c                             | 69 +++++++++++++++++++++++++
 kernel/sched/pelt.c                             | 13 +++++
 kernel/sched/pelt.h                             |  7 +++
 kernel/sched/sched.h                            |  1 +
 7 files changed, 138 insertions(+), 2 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure
  2019-11-05 18:49 [Patch v5 0/6] Introduce Thermal Pressure Thara Gopinath
@ 2019-11-05 18:49 ` Thara Gopinath
  2019-11-06  8:24   ` Vincent Guittot
                     ` (3 more replies)
  2019-11-05 18:49 ` [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous " Thara Gopinath
                   ` (6 subsequent siblings)
  7 siblings, 4 replies; 40+ messages in thread
From: Thara Gopinath @ 2019-11-05 18:49 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Extrapolating on the exisiting framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressure. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a cpu running a rt/dl task through util_avg,
the average thermal pressure is tracked through load_avg. This is
because thermal pressure signal is weighted "delta" capacity
and is not binary(util_avg is binary). "delta capacity" here
means delta between the actual capacity of a cpu and the decreased
capacity a cpu due to a thermal event.
In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_load_avg can be called
to do the periodic bookeeping (accumulate, decay and average)
of the thermal pressure.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 kernel/sched/pelt.c  | 13 +++++++++++++
 kernel/sched/pelt.h  |  7 +++++++
 kernel/sched/sched.h |  1 +
 3 files changed, 21 insertions(+)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index a96db50..3821069 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -353,6 +353,19 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 	return 0;
 }
 
+int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+	if (___update_load_sum(now, &rq->avg_thermal,
+			       capacity,
+			       capacity,
+			       capacity)) {
+		___update_load_avg(&rq->avg_thermal, 1, 1);
+		return 1;
+	}
+
+	return 0;
+}
+
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 /*
  * irq:
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index afff644..c74226d 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -6,6 +6,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
+int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 int update_irq_load_avg(struct rq *rq, u64 running);
@@ -159,6 +160,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 }
 
 static inline int
+update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
+{
+	return 0;
+}
+
+static inline int
 update_irq_load_avg(struct rq *rq, u64 running)
 {
 	return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0db2c1b..d5d82c8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -944,6 +944,7 @@ struct rq {
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 	struct sched_avg	avg_irq;
 #endif
+	struct sched_avg	avg_thermal;
 	u64			idle_stamp;
 	u64			avg_idle;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v5 2/6] sched/fair: Add infrastructure to store and update  instantaneous thermal pressure
  2019-11-05 18:49 [Patch v5 0/6] Introduce Thermal Pressure Thara Gopinath
  2019-11-05 18:49 ` [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
@ 2019-11-05 18:49 ` Thara Gopinath
  2019-11-05 20:21   ` Ionela Voinescu
                     ` (2 more replies)
  2019-11-05 18:49 ` [Patch v5 3/6] sched/fair: Enable periodic update of average " Thara Gopinath
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 40+ messages in thread
From: Thara Gopinath @ 2019-11-05 18:49 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Add interface APIs to initialize, update/average, track, accumulate
and decay thermal pressure per cpu basis. A per cpu variable
thermal_pressure is introduced to keep track of instantaneous per
cpu thermal pressure. Thermal pressure is the delta between maximum
capacity and capped capacity due to a thermal event.
API trigger_thermal_pressure_average is called for periodic accumulate
and decay of the thermal pressure.This API passes on the instantaneous
thermal pressure of a cpu to update_thermal_load_avg to do the necessary
accumulate, decay and average.
API update_thermal_pressure is for the system to update the thermal
pressure by providing a capped maximum capacity.
Considering, trigger_thermal_pressure_average reads thermal_pressure and
update_thermal_pressure writes into thermal_pressure, one can argue for
some sort of locking mechanism to avoid a stale value.
But considering trigger_thermal_pressure_average can be called from a
system critical path like scheduler tick function, a locking mechanism
is not ideal. This means that it is possible the thermal_pressure value
used to calculate average thermal pressure for a cpu can be
stale for upto 1 tick period.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---

v3->v4:
        - Dropped per cpu max_capacity_info struct and instead added a per
          delta_capacity variable to store the delta between maximum
          capacity and capped capacity. The delta is now calculated when
          thermal pressure is updated and not every tick.
        - Dropped populate_max_capacity_info api as only per cpu delta
          capacity is stored.
        - Renamed update_periodic_maxcap to
          trigger_thermal_pressure_average and update_maxcap_capacity to
          update_thermal_pressure.
v4->v5:
	- As per Peter's review comments folded thermal.c into fair.c.
	- As per Ionela's review comments revamped update_thermal_pressure
	  to take maximum available capacity as input instead of maximum
	  capped frequency ration.

---
 include/linux/sched.h |  9 +++++++++
 kernel/sched/fair.c   | 37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 263cf08..3c31084 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1993,6 +1993,15 @@ static inline void rseq_syscall(struct pt_regs *regs)
 
 #endif
 
+#ifdef CONFIG_SMP
+void update_thermal_pressure(int cpu, unsigned long capped_capacity);
+#else
+static inline void
+update_thermal_pressure(int cpu, unsigned long capped_capacity)
+{
+}
+#endif
+
 const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
 char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
 int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 682a754..2e907cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -86,6 +86,12 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity	= 1000000UL;
 
 const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
 
+/*
+ * Per-cpu instantaneous delta between maximum capacity
+ * and maximum available capacity due to thermal events.
+ */
+static DEFINE_PER_CPU(unsigned long, thermal_pressure);
+
 #ifdef CONFIG_SMP
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
@@ -10401,6 +10407,37 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 	return rr_interval;
 }
 
+#ifdef CONFIG_SMP
+/**
+ * update_thermal_pressure: Update thermal pressure
+ * @cpu: the cpu for which thermal pressure is to be updated for
+ * @capped_capacity: maximum capacity of the cpu after the capping
+ *		     due to thermal event.
+ *
+ * Delta between the arch_scale_cpu_capacity and capped max capacity is
+ * stored in per cpu thermal_pressure variable.
+ */
+void update_thermal_pressure(int cpu, unsigned long capped_capacity)
+{
+	unsigned long delta;
+
+	delta = arch_scale_cpu_capacity(cpu) - capped_capacity;
+	per_cpu(thermal_pressure, cpu) = delta;
+}
+#endif
+
+/**
+ * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate
+ *				     and average algorithm
+ */
+static void trigger_thermal_pressure_average(struct rq *rq)
+{
+#ifdef CONFIG_SMP
+	update_thermal_load_avg(rq_clock_task(rq), rq,
+				per_cpu(thermal_pressure, cpu_of(rq)));
+#endif
+}
+
 /*
  * All the scheduling class methods:
  */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v5 3/6] sched/fair: Enable periodic update of average thermal pressure
  2019-11-05 18:49 [Patch v5 0/6] Introduce Thermal Pressure Thara Gopinath
  2019-11-05 18:49 ` [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
  2019-11-05 18:49 ` [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous " Thara Gopinath
@ 2019-11-05 18:49 ` Thara Gopinath
  2019-11-06  8:32   ` Vincent Guittot
  2019-11-05 18:49 ` [Patch v5 4/6] sched/fair: update cpu_capcity to reflect " Thara Gopinath
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Thara Gopinath @ 2019-11-05 18:49 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Introduce support in CFS periodic tick and other bookkeeping apis
to trigger the process of computing average thermal pressure for a
cpu. Also consider avg_thermal.load_avg in others_have_blocked
which allows for decay of pelt signals.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---

v4->v5:
	- Updated both versions of update_blocked_averages to trigger the
	  process of computing average thermal pressure.
	- Updated others_have_blocked to considerd avg_thermal.load_avg.

 kernel/sched/fair.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2e907cc..9fb0494 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -92,6 +92,8 @@ const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
  */
 static DEFINE_PER_CPU(unsigned long, thermal_pressure);
 
+static void trigger_thermal_pressure_average(struct rq *rq);
+
 #ifdef CONFIG_SMP
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
@@ -7493,6 +7495,9 @@ static inline bool others_have_blocked(struct rq *rq)
 	if (READ_ONCE(rq->avg_dl.util_avg))
 		return true;
 
+	if (READ_ONCE(rq->avg_thermal.load_avg))
+		return true;
+
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 	if (READ_ONCE(rq->avg_irq.util_avg))
 		return true;
@@ -7580,6 +7585,8 @@ static void update_blocked_averages(int cpu)
 		done = false;
 
 	update_blocked_load_status(rq, !done);
+
+	trigger_thermal_pressure_average(rq);
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -7646,6 +7653,7 @@ static inline void update_blocked_averages(int cpu)
 	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
 	update_irq_load_avg(rq, 0);
 	update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
+	trigger_thermal_pressure_average(rq);
 	rq_unlock_irqrestore(rq, &rf);
 }
 
@@ -9939,6 +9947,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
+
+	trigger_thermal_pressure_average(rq);
 }
 
 /*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v5 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-11-05 18:49 [Patch v5 0/6] Introduce Thermal Pressure Thara Gopinath
                   ` (2 preceding siblings ...)
  2019-11-05 18:49 ` [Patch v5 3/6] sched/fair: Enable periodic update of average " Thara Gopinath
@ 2019-11-05 18:49 ` Thara Gopinath
  2019-11-06 16:56   ` Qais Yousef
  2019-11-19 10:51   ` Amit Kucheria
  2019-11-05 18:49 ` [Patch v5 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 40+ messages in thread
From: Thara Gopinath @ 2019-11-05 18:49 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

cpu_capacity relflects the maximum available capacity of a cpu. Thermal
pressure on a cpu means this maximum available capacity is reduced. This
patch reduces the average thermal pressure for a cpu from its maximum
available capacity so that cpu_capacity reflects the actual
available capacity.

Other signals that are deducted from cpu_capacity to reflect the actual
capacity available are rt and dl util_avg. util_avg tracks a binary signal
and uses the weights 1024 and 0. Whereas thermal pressure is tracked
using load_avg. load_avg uses the actual "delta" capacity as the weight.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 kernel/sched/fair.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9fb0494..5f6c371 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7738,6 +7738,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
 
 	used = READ_ONCE(rq->avg_rt.util_avg);
 	used += READ_ONCE(rq->avg_dl.util_avg);
+	used += READ_ONCE(rq->avg_thermal.load_avg);
 
 	if (unlikely(used >= max))
 		return 1;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v5 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-11-05 18:49 [Patch v5 0/6] Introduce Thermal Pressure Thara Gopinath
                   ` (3 preceding siblings ...)
  2019-11-05 18:49 ` [Patch v5 4/6] sched/fair: update cpu_capcity to reflect " Thara Gopinath
@ 2019-11-05 18:49 ` Thara Gopinath
  2019-11-06 12:50   ` Dietmar Eggemann
  2019-11-05 18:49 ` [Patch v5 6/6] sched/fair: Enable tuning of decay period Thara Gopinath
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Thara Gopinath @ 2019-11-05 18:49 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Thermal governors can request for a cpu's maximum supported frequency
to be capped in case of an overheat event. This in turn means that the
maximum capacity available for tasks to run on the particular cpu is
reduced. Delta between the original maximum capacity and capped
maximum capacity is known as thermal pressure. Enable cpufreq cooling
device to update the thermal pressure in event of a capped
maximum frequency.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---

v4->v5:
	- fixed issues in update_sched_max_capacity comment header.
	- Updated update_sched_max_capacity to calculate maximum available
	  capacity.

 drivers/thermal/cpu_cooling.c | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 391f397..c65b7c4 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -218,6 +218,27 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
 }
 
 /**
+ * update_sched_max_capacity - update scheduler about change in cpu
+ *				max frequency.
+ * @cpus: list of cpus whose max capacity needs udating in scheduler.
+ * @cur_max_freq: current maximum frequency.
+ * @max_freq: highest possible frequency.
+ */
+static void update_sched_max_capacity(struct cpumask *cpus,
+				      unsigned int cur_max_freq,
+				      unsigned int max_freq)
+{
+	int cpu;
+	unsigned long capacity;
+
+	for_each_cpu(cpu, cpus) {
+		capacity = cur_max_freq * arch_scale_cpu_capacity(cpu);
+		capacity /= max_freq;
+		update_thermal_pressure(cpu, capacity);
+	}
+}
+
+/**
  * get_load() - get load for a cpu since last updated
  * @cpufreq_cdev:	&struct cpufreq_cooling_device for this cpu
  * @cpu:	cpu number
@@ -320,6 +341,7 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
 				 unsigned long state)
 {
 	struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
+	int ret;
 
 	/* Request state should be less than max_level */
 	if (WARN_ON(state > cpufreq_cdev->max_level))
@@ -331,8 +353,18 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
 
 	cpufreq_cdev->cpufreq_state = state;
 
-	return dev_pm_qos_update_request(&cpufreq_cdev->qos_req,
-				cpufreq_cdev->freq_table[state].frequency);
+	ret = dev_pm_qos_update_request
+				(&cpufreq_cdev->qos_req,
+				 cpufreq_cdev->freq_table[state].frequency);
+
+	if (ret > 0) {
+		update_sched_max_capacity
+				(cpufreq_cdev->policy->cpus,
+				 cpufreq_cdev->freq_table[state].frequency,
+				 cpufreq_cdev->policy->cpuinfo.max_freq);
+	}
+
+	return ret;
 }
 
 /**
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v5 6/6] sched/fair: Enable tuning of decay period
  2019-11-05 18:49 [Patch v5 0/6] Introduce Thermal Pressure Thara Gopinath
                   ` (4 preceding siblings ...)
  2019-11-05 18:49 ` [Patch v5 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
@ 2019-11-05 18:49 ` Thara Gopinath
  2019-11-07 10:49   ` Vincent Guittot
  2019-11-19 10:52   ` Amit Kucheria
  2019-11-12 11:21 ` [Patch v5 0/6] Introduce Thermal Pressure Lukasz Luba
  2019-11-19 10:54 ` Amit Kucheria
  7 siblings, 2 replies; 40+ messages in thread
From: Thara Gopinath @ 2019-11-05 18:49 UTC (permalink / raw)
  To: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Thermal pressure follows pelt signas which means the
decay period for thermal pressure is the default pelt
decay period. Depending on soc charecteristics and thermal
activity, it might be beneficial to decay thermal pressure
slower, but still in-tune with the pelt signals.
One way to achieve this is to provide a command line parameter
to set a decay shift parameter to an integer between 0 and 10.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---

v4->v5:
	- Changed _coeff to _shift as per review comments on the list.

 Documentation/admin-guide/kernel-parameters.txt |  5 +++++
 kernel/sched/fair.c                             | 25 +++++++++++++++++++++++--
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index c82f87c..0b8f55e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4281,6 +4281,11 @@
 			incurs a small amount of overhead in the scheduler
 			but is useful for debugging and performance tuning.
 
+	sched_thermal_decay_shift=
+			[KNL, SMP] Set decay shift for thermal pressure signal.
+			Format: integer betweer 0 and 10
+			Default is 0.
+
 	skew_tick=	[KNL] Offset the periodic timer tick per cpu to mitigate
 			xtime_lock contention on larger systems, and/or RCU lock
 			contention on all systems with CONFIG_MAXSMP set.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5f6c371..61a020b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -91,6 +91,18 @@ const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
  * and maximum available capacity due to thermal events.
  */
 static DEFINE_PER_CPU(unsigned long, thermal_pressure);
+/**
+ * By default the decay is the default pelt decay period.
+ * The decay shift can change the decay period in
+ * multiples of 32.
+ *  Decay shift		Decay period(ms)
+ *	0			32
+ *	1			64
+ *	2			128
+ *	3			256
+ *	4			512
+ */
+static int sched_thermal_decay_shift;
 
 static void trigger_thermal_pressure_average(struct rq *rq);
 
@@ -10435,6 +10447,15 @@ void update_thermal_pressure(int cpu, unsigned long capped_capacity)
 	delta = arch_scale_cpu_capacity(cpu) - capped_capacity;
 	per_cpu(thermal_pressure, cpu) = delta;
 }
+
+static int __init setup_sched_thermal_decay_shift(char *str)
+{
+	if (kstrtoint(str, 0, &sched_thermal_decay_shift))
+		pr_warn("Unable to set scheduler thermal pressure decay shift parameter\n");
+
+	return 1;
+}
+__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 #endif
 
 /**
@@ -10444,8 +10465,8 @@ void update_thermal_pressure(int cpu, unsigned long capped_capacity)
 static void trigger_thermal_pressure_average(struct rq *rq)
 {
 #ifdef CONFIG_SMP
-	update_thermal_load_avg(rq_clock_task(rq), rq,
-				per_cpu(thermal_pressure, cpu_of(rq)));
+	update_thermal_load_avg(rq_clock_task(rq) >> sched_thermal_decay_shift,
+				rq, per_cpu(thermal_pressure, cpu_of(rq)));
 #endif
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update  instantaneous thermal pressure
  2019-11-05 18:49 ` [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous " Thara Gopinath
@ 2019-11-05 20:21   ` Ionela Voinescu
  2019-11-05 21:02     ` Thara Gopinath
  2019-11-06  8:27   ` Vincent Guittot
  2019-11-19 10:51   ` Amit Kucheria
  2 siblings, 1 reply; 40+ messages in thread
From: Ionela Voinescu @ 2019-11-05 20:21 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Hi Thara,

On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote:
[...]
> +static void trigger_thermal_pressure_average(struct rq *rq)
> +{
> +#ifdef CONFIG_SMP
> +	update_thermal_load_avg(rq_clock_task(rq), rq,
> +				per_cpu(thermal_pressure, cpu_of(rq)));
> +#endif
> +}

Why did you decide to keep trigger_thermal_pressure_average and not
call update_thermal_load_avg directly?

For !CONFIG_SMP you already have an update_thermal_load_avg function
that does nothing, in kernel/sched/pelt.h, so you don't need that
ifdef. 

Thanks,
Ionela.

> +
>  /*
>   * All the scheduling class methods:
>   */
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-05 20:21   ` Ionela Voinescu
@ 2019-11-05 21:02     ` Thara Gopinath
  2019-11-05 21:15       ` Ionela Voinescu
  0 siblings, 1 reply; 40+ messages in thread
From: Thara Gopinath @ 2019-11-05 21:02 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 11/05/2019 03:21 PM, Ionela Voinescu wrote:
> Hi Thara,
> 
> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote:
> [...]
>> +static void trigger_thermal_pressure_average(struct rq *rq)
>> +{
>> +#ifdef CONFIG_SMP
>> +	update_thermal_load_avg(rq_clock_task(rq), rq,
>> +				per_cpu(thermal_pressure, cpu_of(rq)));
>> +#endif
>> +}
> 
> Why did you decide to keep trigger_thermal_pressure_average and not
> call update_thermal_load_avg directly?
> 
> For !CONFIG_SMP you already have an update_thermal_load_avg function
> that does nothing, in kernel/sched/pelt.h, so you don't need that
> ifdef. 
Hi,

Yes you are right. But later with the shift option added, I shift
rq_clock_task(rq) by the shift. I thought it is better to contain it in
a function that replicate it in three different places. I can remove the
CONFIG_SMP in the next version
> 
> Thanks,
> Ionela.
> 
>> +
>>  /*
>>   * All the scheduling class methods:
>>   */
>> -- 
>> 2.1.4
>>


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-05 21:02     ` Thara Gopinath
@ 2019-11-05 21:15       ` Ionela Voinescu
  2019-11-05 21:29         ` Thara Gopinath
  0 siblings, 1 reply; 40+ messages in thread
From: Ionela Voinescu @ 2019-11-05 21:15 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote:
> On 11/05/2019 03:21 PM, Ionela Voinescu wrote:
> > Hi Thara,
> > 
> > On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote:
> > [...]
> >> +static void trigger_thermal_pressure_average(struct rq *rq)
> >> +{
> >> +#ifdef CONFIG_SMP
> >> +	update_thermal_load_avg(rq_clock_task(rq), rq,
> >> +				per_cpu(thermal_pressure, cpu_of(rq)));
> >> +#endif
> >> +}
> > 
> > Why did you decide to keep trigger_thermal_pressure_average and not
> > call update_thermal_load_avg directly?
> > 
> > For !CONFIG_SMP you already have an update_thermal_load_avg function
> > that does nothing, in kernel/sched/pelt.h, so you don't need that
> > ifdef. 
> Hi,
> 
> Yes you are right. But later with the shift option added, I shift
> rq_clock_task(rq) by the shift. I thought it is better to contain it in
> a function that replicate it in three different places. I can remove the
> CONFIG_SMP in the next version.

You could still keep that in one place if you shift the now argument of
___update_load_sum instead. 

To me that trigger_thermal_pressure_average function seems more code
than it's worth for this.

Thanks,
Ionela.

> > 
> > Thanks,
> > Ionela.
> > 
> >> +
> >>  /*
> >>   * All the scheduling class methods:
> >>   */
> >> -- 
> >> 2.1.4
> >>
> 
> 
> -- 
> Warm Regards
> Thara

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-05 21:15       ` Ionela Voinescu
@ 2019-11-05 21:29         ` Thara Gopinath
  2019-11-05 21:53           ` Ionela Voinescu
  0 siblings, 1 reply; 40+ messages in thread
From: Thara Gopinath @ 2019-11-05 21:29 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 11/05/2019 04:15 PM, Ionela Voinescu wrote:
> On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote:
>> On 11/05/2019 03:21 PM, Ionela Voinescu wrote:
>>> Hi Thara,
>>>
>>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote:
>>> [...]
>>>> +static void trigger_thermal_pressure_average(struct rq *rq)
>>>> +{
>>>> +#ifdef CONFIG_SMP
>>>> +	update_thermal_load_avg(rq_clock_task(rq), rq,
>>>> +				per_cpu(thermal_pressure, cpu_of(rq)));
>>>> +#endif
>>>> +}
>>>
>>> Why did you decide to keep trigger_thermal_pressure_average and not
>>> call update_thermal_load_avg directly?
>>>
>>> For !CONFIG_SMP you already have an update_thermal_load_avg function
>>> that does nothing, in kernel/sched/pelt.h, so you don't need that
>>> ifdef. 
>> Hi,
>>
>> Yes you are right. But later with the shift option added, I shift
>> rq_clock_task(rq) by the shift. I thought it is better to contain it in
>> a function that replicate it in three different places. I can remove the
>> CONFIG_SMP in the next version.
> 
> You could still keep that in one place if you shift the now argument of
> ___update_load_sum instead.

No. I cannot do this. The authors of the pelt framework  prefers not to
include a shift parameter there. I had discussed this with Vincent earlier.

> 
> To me that trigger_thermal_pressure_average function seems more code
> than it's worth for this.
> 
> Thanks,
> Ionela.
> 
>>>
>>> Thanks,
>>> Ionela.
>>>
>>>> +
>>>>  /*
>>>>   * All the scheduling class methods:
>>>>   */
>>>> -- 
>>>> 2.1.4
>>>>
>>
>>
>> -- 
>> Warm Regards
>> Thara


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-05 21:29         ` Thara Gopinath
@ 2019-11-05 21:53           ` Ionela Voinescu
  2019-11-06 12:50             ` Dietmar Eggemann
  0 siblings, 1 reply; 40+ messages in thread
From: Ionela Voinescu @ 2019-11-05 21:53 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote:
> On 11/05/2019 04:15 PM, Ionela Voinescu wrote:
> > On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote:
> >> On 11/05/2019 03:21 PM, Ionela Voinescu wrote:
> >>> Hi Thara,
> >>>
> >>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote:
> >>> [...]
> >>>> +static void trigger_thermal_pressure_average(struct rq *rq)
> >>>> +{
> >>>> +#ifdef CONFIG_SMP
> >>>> +	update_thermal_load_avg(rq_clock_task(rq), rq,
> >>>> +				per_cpu(thermal_pressure, cpu_of(rq)));
> >>>> +#endif
> >>>> +}
> >>>
> >>> Why did you decide to keep trigger_thermal_pressure_average and not
> >>> call update_thermal_load_avg directly?
> >>>
> >>> For !CONFIG_SMP you already have an update_thermal_load_avg function
> >>> that does nothing, in kernel/sched/pelt.h, so you don't need that
> >>> ifdef. 
> >> Hi,
> >>
> >> Yes you are right. But later with the shift option added, I shift
> >> rq_clock_task(rq) by the shift. I thought it is better to contain it in
> >> a function that replicate it in three different places. I can remove the
> >> CONFIG_SMP in the next version.
> > 
> > You could still keep that in one place if you shift the now argument of
> > ___update_load_sum instead.
> 
> No. I cannot do this. The authors of the pelt framework  prefers not to
> include a shift parameter there. I had discussed this with Vincent earlier.
> 

Right! I missed Vincent's last comment on this in v4. 

I would argue that it's more of a PELT parameter than a CFS parameter
:), where it's currently being used. I would also argue that's more of a
PELT parameter than a thermal parameter. It controls the PELT time
progression for the thermal signal, but it seems more to configure the
PELT algorithm, rather than directly characterize thermals.

In any case, it only seemed to me that adding a wrapper function for
this purpose only was not worth doing.

Thank you for clarifying,
Ionela.

> > 
> > To me that trigger_thermal_pressure_average function seems more code
> > than it's worth for this.
> > 
> > Thanks,
> > Ionela.
> > 
> >>>
> >>> Thanks,
> >>> Ionela.
> >>>
> >>>> +
> >>>>  /*
> >>>>   * All the scheduling class methods:
> >>>>   */
> >>>> -- 
> >>>> 2.1.4
> >>>>
> >>
> >>
> >> -- 
> >> Warm Regards
> >> Thara
> 
> 
> -- 
> Warm Regards
> Thara

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure
  2019-11-05 18:49 ` [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
@ 2019-11-06  8:24   ` Vincent Guittot
  2019-11-06 12:50   ` Dietmar Eggemann
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: Vincent Guittot @ 2019-11-06  8:24 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, Quentin Perret, linux-kernel, Amit Kachhap,
	Javi Merino, Daniel Lezcano

On Tue, 5 Nov 2019 at 19:49, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>
> Extrapolating on the exisiting framework to track rt/dl utilization using
> pelt signals, add a similar mechanism to track thermal pressure. The
> difference here from rt/dl utilization tracking is that, instead of
> tracking time spent by a cpu running a rt/dl task through util_avg,
> the average thermal pressure is tracked through load_avg. This is
> because thermal pressure signal is weighted "delta" capacity
> and is not binary(util_avg is binary). "delta capacity" here
> means delta between the actual capacity of a cpu and the decreased
> capacity a cpu due to a thermal event.
> In order to track average thermal pressure, a new sched_avg variable
> avg_thermal is introduced. Function update_thermal_load_avg can be called
> to do the periodic bookeeping (accumulate, decay and average)
> of the thermal pressure.
>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/pelt.c  | 13 +++++++++++++
>  kernel/sched/pelt.h  |  7 +++++++
>  kernel/sched/sched.h |  1 +
>  3 files changed, 21 insertions(+)
>
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index a96db50..3821069 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -353,6 +353,19 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>         return 0;
>  }
>
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> +       if (___update_load_sum(now, &rq->avg_thermal,
> +                              capacity,
> +                              capacity,
> +                              capacity)) {
> +               ___update_load_avg(&rq->avg_thermal, 1, 1);
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  /*
>   * irq:
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index afff644..c74226d 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -6,6 +6,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
>  int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
>  int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
>
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  int update_irq_load_avg(struct rq *rq, u64 running);
> @@ -159,6 +160,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>  }
>
>  static inline int
> +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> +       return 0;
> +}
> +
> +static inline int
>  update_irq_load_avg(struct rq *rq, u64 running)
>  {
>         return 0;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 0db2c1b..d5d82c8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -944,6 +944,7 @@ struct rq {
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>         struct sched_avg        avg_irq;
>  #endif
> +       struct sched_avg        avg_thermal;
>         u64                     idle_stamp;
>         u64                     avg_idle;
>
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-05 18:49 ` [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous " Thara Gopinath
  2019-11-05 20:21   ` Ionela Voinescu
@ 2019-11-06  8:27   ` Vincent Guittot
  2019-11-06 17:00     ` Thara Gopinath
  2019-11-19 10:51   ` Amit Kucheria
  2 siblings, 1 reply; 40+ messages in thread
From: Vincent Guittot @ 2019-11-06  8:27 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, Quentin Perret, linux-kernel, Amit Kachhap,
	Javi Merino, Daniel Lezcano

Hi Thara,


On Tue, 5 Nov 2019 at 19:49, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>
> Add interface APIs to initialize, update/average, track, accumulate
> and decay thermal pressure per cpu basis. A per cpu variable
> thermal_pressure is introduced to keep track of instantaneous per
> cpu thermal pressure. Thermal pressure is the delta between maximum
> capacity and capped capacity due to a thermal event.
> API trigger_thermal_pressure_average is called for periodic accumulate
> and decay of the thermal pressure.This API passes on the instantaneous
> thermal pressure of a cpu to update_thermal_load_avg to do the necessary
> accumulate, decay and average.
> API update_thermal_pressure is for the system to update the thermal
> pressure by providing a capped maximum capacity.
> Considering, trigger_thermal_pressure_average reads thermal_pressure and
> update_thermal_pressure writes into thermal_pressure, one can argue for
> some sort of locking mechanism to avoid a stale value.
> But considering trigger_thermal_pressure_average can be called from a
> system critical path like scheduler tick function, a locking mechanism
> is not ideal. This means that it is possible the thermal_pressure value
> used to calculate average thermal pressure for a cpu can be
> stale for upto 1 tick period.
>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>
> v3->v4:
>         - Dropped per cpu max_capacity_info struct and instead added a per
>           delta_capacity variable to store the delta between maximum
>           capacity and capped capacity. The delta is now calculated when
>           thermal pressure is updated and not every tick.
>         - Dropped populate_max_capacity_info api as only per cpu delta
>           capacity is stored.
>         - Renamed update_periodic_maxcap to
>           trigger_thermal_pressure_average and update_maxcap_capacity to
>           update_thermal_pressure.
> v4->v5:
>         - As per Peter's review comments folded thermal.c into fair.c.
>         - As per Ionela's review comments revamped update_thermal_pressure
>           to take maximum available capacity as input instead of maximum
>           capped frequency ration.
>
> ---
>  include/linux/sched.h |  9 +++++++++
>  kernel/sched/fair.c   | 37 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 263cf08..3c31084 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1993,6 +1993,15 @@ static inline void rseq_syscall(struct pt_regs *regs)
>
>  #endif
>
> +#ifdef CONFIG_SMP
> +void update_thermal_pressure(int cpu, unsigned long capped_capacity);
> +#else
> +static inline void
> +update_thermal_pressure(int cpu, unsigned long capped_capacity)
> +{
> +}
> +#endif
> +
>  const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
>  char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
>  int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 682a754..2e907cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -86,6 +86,12 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity       = 1000000UL;
>
>  const_debug unsigned int sysctl_sched_migration_cost   = 500000UL;
>
> +/*
> + * Per-cpu instantaneous delta between maximum capacity
> + * and maximum available capacity due to thermal events.
> + */
> +static DEFINE_PER_CPU(unsigned long, thermal_pressure);
> +
>  #ifdef CONFIG_SMP
>  /*
>   * For asym packing, by default the lower numbered CPU has higher priority.
> @@ -10401,6 +10407,37 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
>         return rr_interval;
>  }
>
> +#ifdef CONFIG_SMP
> +/**
> + * update_thermal_pressure: Update thermal pressure
> + * @cpu: the cpu for which thermal pressure is to be updated for
> + * @capped_capacity: maximum capacity of the cpu after the capping
> + *                  due to thermal event.
> + *
> + * Delta between the arch_scale_cpu_capacity and capped max capacity is
> + * stored in per cpu thermal_pressure variable.
> + */
> +void update_thermal_pressure(int cpu, unsigned long capped_capacity)
> +{
> +       unsigned long delta;
> +
> +       delta = arch_scale_cpu_capacity(cpu) - capped_capacity;
> +       per_cpu(thermal_pressure, cpu) = delta;

use WRITE_ONCE

> +}
> +#endif
> +
> +/**
> + * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate
> + *                                  and average algorithm
> + */
> +static void trigger_thermal_pressure_average(struct rq *rq)
> +{
> +#ifdef CONFIG_SMP
> +       update_thermal_load_avg(rq_clock_task(rq), rq,
> +                               per_cpu(thermal_pressure, cpu_of(rq)));
> +#endif
> +}
> +
>  /*
>   * All the scheduling class methods:
>   */
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 3/6] sched/fair: Enable periodic update of average thermal pressure
  2019-11-05 18:49 ` [Patch v5 3/6] sched/fair: Enable periodic update of average " Thara Gopinath
@ 2019-11-06  8:32   ` Vincent Guittot
  2019-11-06 17:01     ` Thara Gopinath
  0 siblings, 1 reply; 40+ messages in thread
From: Vincent Guittot @ 2019-11-06  8:32 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, Quentin Perret, linux-kernel, Amit Kachhap,
	Javi Merino, Daniel Lezcano

On Tue, 5 Nov 2019 at 19:49, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>
> Introduce support in CFS periodic tick and other bookkeeping apis
> to trigger the process of computing average thermal pressure for a
> cpu. Also consider avg_thermal.load_avg in others_have_blocked
> which allows for decay of pelt signals.
>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>
> v4->v5:
>         - Updated both versions of update_blocked_averages to trigger the
>           process of computing average thermal pressure.
>         - Updated others_have_blocked to considerd avg_thermal.load_avg.
>
>  kernel/sched/fair.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2e907cc..9fb0494 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -92,6 +92,8 @@ const_debug unsigned int sysctl_sched_migration_cost  = 500000UL;
>   */
>  static DEFINE_PER_CPU(unsigned long, thermal_pressure);
>
> +static void trigger_thermal_pressure_average(struct rq *rq);
> +
>  #ifdef CONFIG_SMP
>  /*
>   * For asym packing, by default the lower numbered CPU has higher priority.
> @@ -7493,6 +7495,9 @@ static inline bool others_have_blocked(struct rq *rq)
>         if (READ_ONCE(rq->avg_dl.util_avg))
>                 return true;
>
> +       if (READ_ONCE(rq->avg_thermal.load_avg))
> +               return true;
> +
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>         if (READ_ONCE(rq->avg_irq.util_avg))
>                 return true;
> @@ -7580,6 +7585,8 @@ static void update_blocked_averages(int cpu)
>                 done = false;
>
>         update_blocked_load_status(rq, !done);
> +
> +       trigger_thermal_pressure_average(rq);

This must be called before others_have_blocked() to take into account
the latest update

>         rq_unlock_irqrestore(rq, &rf);
>  }
>
> @@ -7646,6 +7653,7 @@ static inline void update_blocked_averages(int cpu)
>         update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
>         update_irq_load_avg(rq, 0);
>         update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
> +       trigger_thermal_pressure_average(rq);

idem

>         rq_unlock_irqrestore(rq, &rf);
>  }
>
> @@ -9939,6 +9947,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
>         update_misfit_status(curr, rq);
>         update_overutilized_status(task_rq(curr));
> +

remove blank line

> +       trigger_thermal_pressure_average(rq);
>  }
>
>  /*
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure
  2019-11-05 18:49 ` [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
  2019-11-06  8:24   ` Vincent Guittot
@ 2019-11-06 12:50   ` Dietmar Eggemann
  2019-11-06 17:00     ` Thara Gopinath
  2019-11-07 16:39   ` Qais Yousef
  2019-11-19 10:50   ` Amit Kucheria
  3 siblings, 1 reply; 40+ messages in thread
From: Dietmar Eggemann @ 2019-11-06 12:50 UTC (permalink / raw)
  To: Thara Gopinath, mingo, peterz, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 05/11/2019 19:49, Thara Gopinath wrote:
> Extrapolating on the exisiting framework to track rt/dl utilization using

s/exisiting/existing

> pelt signals, add a similar mechanism to track thermal pressure. The
> difference here from rt/dl utilization tracking is that, instead of
> tracking time spent by a cpu running a rt/dl task through util_avg,
> the average thermal pressure is tracked through load_avg. This is
> because thermal pressure signal is weighted "delta" capacity
> and is not binary(util_avg is binary). "delta capacity" here
> means delta between the actual capacity of a cpu and the decreased
> capacity a cpu due to a thermal event.
> In order to track average thermal pressure, a new sched_avg variable
> avg_thermal is introduced. Function update_thermal_load_avg can be called
> to do the periodic bookeeping (accumulate, decay and average)

s/bookeeping/bookkeeping

> of the thermal pressure.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  kernel/sched/pelt.c  | 13 +++++++++++++
>  kernel/sched/pelt.h  |  7 +++++++
>  kernel/sched/sched.h |  1 +
>  3 files changed, 21 insertions(+)
> 
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index a96db50..3821069 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -353,6 +353,19 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>  	return 0;
>  }

Minor thing: There are function headers for rt_rq, dl_rq and irq. rt_rq
even explains that 'load_avg and runnable_load_avg are not supported and
meaningless.' Could you do something similar for thermal here? It's not
self-explanatory why we track load_avg, runnable_load_avg and util_avg
for thermal but only use load_avg.

> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> +	if (___update_load_sum(now, &rq->avg_thermal,
> +			       capacity,
> +			       capacity,
> +			       capacity)) {
> +		___update_load_avg(&rq->avg_thermal, 1, 1);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-05 21:53           ` Ionela Voinescu
@ 2019-11-06 12:50             ` Dietmar Eggemann
  2019-11-06 17:53               ` Thara Gopinath
  0 siblings, 1 reply; 40+ messages in thread
From: Dietmar Eggemann @ 2019-11-06 12:50 UTC (permalink / raw)
  To: Ionela Voinescu, Thara Gopinath
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 05/11/2019 22:53, Ionela Voinescu wrote:
> On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote:
>> On 11/05/2019 04:15 PM, Ionela Voinescu wrote:
>>> On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote:
>>>> On 11/05/2019 03:21 PM, Ionela Voinescu wrote:
>>>>> Hi Thara,
>>>>>
>>>>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote:
>>>>> [...]
>>>>>> +static void trigger_thermal_pressure_average(struct rq *rq)
>>>>>> +{
>>>>>> +#ifdef CONFIG_SMP
>>>>>> +	update_thermal_load_avg(rq_clock_task(rq), rq,
>>>>>> +				per_cpu(thermal_pressure, cpu_of(rq)));
>>>>>> +#endif
>>>>>> +}
>>>>>
>>>>> Why did you decide to keep trigger_thermal_pressure_average and not
>>>>> call update_thermal_load_avg directly?
>>>>>
>>>>> For !CONFIG_SMP you already have an update_thermal_load_avg function
>>>>> that does nothing, in kernel/sched/pelt.h, so you don't need that
>>>>> ifdef. 
>>>> Hi,
>>>>
>>>> Yes you are right. But later with the shift option added, I shift
>>>> rq_clock_task(rq) by the shift. I thought it is better to contain it in
>>>> a function that replicate it in three different places. I can remove the
>>>> CONFIG_SMP in the next version.
>>>
>>> You could still keep that in one place if you shift the now argument of
>>> ___update_load_sum instead.
>>
>> No. I cannot do this. The authors of the pelt framework  prefers not to
>> include a shift parameter there. I had discussed this with Vincent earlier.
>>
> 
> Right! I missed Vincent's last comment on this in v4. 
> 
> I would argue that it's more of a PELT parameter than a CFS parameter
> :), where it's currently being used. I would also argue that's more of a
> PELT parameter than a thermal parameter. It controls the PELT time
> progression for the thermal signal, but it seems more to configure the
> PELT algorithm, rather than directly characterize thermals.
> 
> In any case, it only seemed to me that adding a wrapper function for
> this purpose only was not worth doing.

Coming back to the v4 discussion
https://lore.kernel.org/r/379d23e5-79a5-9d90-0fb6-125d9be85e99@arm.com

There is no API between pelt.c and other parts of the scheduler/kernel
so why should we keep an unnecessary parameter and wrapper functions?

There is also no abstraction, update_thermal_load_avg() in pelt.c even
carries '_thermal_' in its name.

So why not define this shift value '[sched_|pelt_]thermal_decay_shift'
there as well? It belongs to update_thermal_load_avg().

All call sites of update_thermal_load_avg() use 'rq_clock_task(rq) >>
sched_thermal_decay_shift' so I don't see the need to pass it in.

IMHO, preparing for eventual code changes (e.g. parsing different now
values) is not a good practice in the kernel. Keeping the code small and
tidy is.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-11-05 18:49 ` [Patch v5 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
@ 2019-11-06 12:50   ` Dietmar Eggemann
  2019-11-06 17:28     ` Thara Gopinath
  2019-11-07 13:00     ` Dietmar Eggemann
  0 siblings, 2 replies; 40+ messages in thread
From: Dietmar Eggemann @ 2019-11-06 12:50 UTC (permalink / raw)
  To: Thara Gopinath, mingo, peterz, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 05/11/2019 19:49, Thara Gopinath wrote:

[...]

> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 391f397..c65b7c4 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -218,6 +218,27 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
>  }
>  
>  /**
> + * update_sched_max_capacity - update scheduler about change in cpu
> + *				max frequency.
> + * @cpus: list of cpus whose max capacity needs udating in scheduler.
> + * @cur_max_freq: current maximum frequency.
> + * @max_freq: highest possible frequency.
> + */
> +static void update_sched_max_capacity(struct cpumask *cpus,
> +				      unsigned int cur_max_freq,
> +				      unsigned int max_freq)
> +{
> +	int cpu;
> +	unsigned long capacity;
> +
> +	for_each_cpu(cpu, cpus) {
> +		capacity = cur_max_freq * arch_scale_cpu_capacity(cpu);
> +		capacity /= max_freq;
> +		update_thermal_pressure(cpu, capacity);
> +	}
> +}
> +
> +/**

Have you seen
https://lore.kernel.org/r/2b19d7da-412c-932f-7251-110eadbef3e3@arm.com ?

Also the naming 'update_thermal_pressure()' is not really suitable for
an external task scheduler interface. If I see this called in the
cooling device code, I wouldn't guess that this is a task scheduler
interface.

[...]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-11-05 18:49 ` [Patch v5 4/6] sched/fair: update cpu_capcity to reflect " Thara Gopinath
@ 2019-11-06 16:56   ` Qais Yousef
  2019-11-06 17:31     ` Thara Gopinath
  2019-11-19 10:51   ` Amit Kucheria
  1 sibling, 1 reply; 40+ messages in thread
From: Qais Yousef @ 2019-11-06 16:56 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret, linux-kernel, amit.kachhap, javi.merino,
	daniel.lezcano

On 11/05/19 13:49, Thara Gopinath wrote:
> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> pressure on a cpu means this maximum available capacity is reduced. This
> patch reduces the average thermal pressure for a cpu from its maximum
> available capacity so that cpu_capacity reflects the actual
> available capacity.
> 
> Other signals that are deducted from cpu_capacity to reflect the actual
> capacity available are rt and dl util_avg. util_avg tracks a binary signal
> and uses the weights 1024 and 0. Whereas thermal pressure is tracked
> using load_avg. load_avg uses the actual "delta" capacity as the weight.

I think you intended to put this as comment...

> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  kernel/sched/fair.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9fb0494..5f6c371 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7738,6 +7738,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
>  
>  	used = READ_ONCE(rq->avg_rt.util_avg);
>  	used += READ_ONCE(rq->avg_dl.util_avg);
> +	used += READ_ONCE(rq->avg_thermal.load_avg);

... here?

I find the explanation hard to parse too. Do you think you can rephrase it?
Something based on what you wrote here would be more understandable IMHO:
https://lore.kernel.org/lkml/5DBB05BC.40502@linaro.org/


Thanks!

--
Qais Yousef

>  
>  	if (unlikely(used >= max))
>  		return 1;
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure
  2019-11-06 12:50   ` Dietmar Eggemann
@ 2019-11-06 17:00     ` Thara Gopinath
  0 siblings, 0 replies; 40+ messages in thread
From: Thara Gopinath @ 2019-11-06 17:00 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, ionela.voinescu,
	vincent.guittot, rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Hi Dietmar,
Thanks for the review.
On 11/06/2019 07:50 AM, Dietmar Eggemann wrote:
> On 05/11/2019 19:49, Thara Gopinath wrote:
>> Extrapolating on the exisiting framework to track rt/dl utilization using
> 
> s/exisiting/existing
> 
>> pelt signals, add a similar mechanism to track thermal pressure. The
>> difference here from rt/dl utilization tracking is that, instead of
>> tracking time spent by a cpu running a rt/dl task through util_avg,
>> the average thermal pressure is tracked through load_avg. This is
>> because thermal pressure signal is weighted "delta" capacity
>> and is not binary(util_avg is binary). "delta capacity" here
>> means delta between the actual capacity of a cpu and the decreased
>> capacity a cpu due to a thermal event.
>> In order to track average thermal pressure, a new sched_avg variable
>> avg_thermal is introduced. Function update_thermal_load_avg can be called
>> to do the periodic bookeeping (accumulate, decay and average)
> 
> s/bookeeping/bookkeeping
> 
>> of the thermal pressure.
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>  kernel/sched/pelt.c  | 13 +++++++++++++
>>  kernel/sched/pelt.h  |  7 +++++++
>>  kernel/sched/sched.h |  1 +
>>  3 files changed, 21 insertions(+)
>>
>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
>> index a96db50..3821069 100644
>> --- a/kernel/sched/pelt.c
>> +++ b/kernel/sched/pelt.c
>> @@ -353,6 +353,19 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>>  	return 0;
>>  }
> 
> Minor thing: There are function headers for rt_rq, dl_rq and irq. rt_rq
> even explains that 'load_avg and runnable_load_avg are not supported and
> meaningless.' Could you do something similar for thermal here? It's not
> self-explanatory why we track load_avg, runnable_load_avg and util_avg
> for thermal but only use load_avg.

Will put a function header and update the nits above.
> 
>> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
>> +{
>> +	if (___update_load_sum(now, &rq->avg_thermal,
>> +			       capacity,
>> +			       capacity,
>> +			       capacity)) {
>> +		___update_load_avg(&rq->avg_thermal, 1, 1);
>> +		return 1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
> 
> [...]
> 


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-06  8:27   ` Vincent Guittot
@ 2019-11-06 17:00     ` Thara Gopinath
  0 siblings, 0 replies; 40+ messages in thread
From: Thara Gopinath @ 2019-11-06 17:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, Quentin Perret, linux-kernel, Amit Kachhap,
	Javi Merino, Daniel Lezcano

Hi Vincent,
Thanks for the review
On 11/06/2019 03:27 AM, Vincent Guittot wrote:
> Hi Thara,
> 
> 
> On Tue, 5 Nov 2019 at 19:49, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>>
>> Add interface APIs to initialize, update/average, track, accumulate
>> and decay thermal pressure per cpu basis. A per cpu variable
>> thermal_pressure is introduced to keep track of instantaneous per
>> cpu thermal pressure. Thermal pressure is the delta between maximum
>> capacity and capped capacity due to a thermal event.
>> API trigger_thermal_pressure_average is called for periodic accumulate
>> and decay of the thermal pressure.This API passes on the instantaneous
>> thermal pressure of a cpu to update_thermal_load_avg to do the necessary
>> accumulate, decay and average.
>> API update_thermal_pressure is for the system to update the thermal
>> pressure by providing a capped maximum capacity.
>> Considering, trigger_thermal_pressure_average reads thermal_pressure and
>> update_thermal_pressure writes into thermal_pressure, one can argue for
>> some sort of locking mechanism to avoid a stale value.
>> But considering trigger_thermal_pressure_average can be called from a
>> system critical path like scheduler tick function, a locking mechanism
>> is not ideal. This means that it is possible the thermal_pressure value
>> used to calculate average thermal pressure for a cpu can be
>> stale for upto 1 tick period.
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>
>> v3->v4:
>>         - Dropped per cpu max_capacity_info struct and instead added a per
>>           delta_capacity variable to store the delta between maximum
>>           capacity and capped capacity. The delta is now calculated when
>>           thermal pressure is updated and not every tick.
>>         - Dropped populate_max_capacity_info api as only per cpu delta
>>           capacity is stored.
>>         - Renamed update_periodic_maxcap to
>>           trigger_thermal_pressure_average and update_maxcap_capacity to
>>           update_thermal_pressure.
>> v4->v5:
>>         - As per Peter's review comments folded thermal.c into fair.c.
>>         - As per Ionela's review comments revamped update_thermal_pressure
>>           to take maximum available capacity as input instead of maximum
>>           capped frequency ration.
>>
>> ---
>>  include/linux/sched.h |  9 +++++++++
>>  kernel/sched/fair.c   | 37 +++++++++++++++++++++++++++++++++++++
>>  2 files changed, 46 insertions(+)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 263cf08..3c31084 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1993,6 +1993,15 @@ static inline void rseq_syscall(struct pt_regs *regs)
>>
>>  #endif
>>
>> +#ifdef CONFIG_SMP
>> +void update_thermal_pressure(int cpu, unsigned long capped_capacity);
>> +#else
>> +static inline void
>> +update_thermal_pressure(int cpu, unsigned long capped_capacity)
>> +{
>> +}
>> +#endif
>> +
>>  const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
>>  char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
>>  int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 682a754..2e907cc 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -86,6 +86,12 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity       = 1000000UL;
>>
>>  const_debug unsigned int sysctl_sched_migration_cost   = 500000UL;
>>
>> +/*
>> + * Per-cpu instantaneous delta between maximum capacity
>> + * and maximum available capacity due to thermal events.
>> + */
>> +static DEFINE_PER_CPU(unsigned long, thermal_pressure);
>> +
>>  #ifdef CONFIG_SMP
>>  /*
>>   * For asym packing, by default the lower numbered CPU has higher priority.
>> @@ -10401,6 +10407,37 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
>>         return rr_interval;
>>  }
>>
>> +#ifdef CONFIG_SMP
>> +/**
>> + * update_thermal_pressure: Update thermal pressure
>> + * @cpu: the cpu for which thermal pressure is to be updated for
>> + * @capped_capacity: maximum capacity of the cpu after the capping
>> + *                  due to thermal event.
>> + *
>> + * Delta between the arch_scale_cpu_capacity and capped max capacity is
>> + * stored in per cpu thermal_pressure variable.
>> + */
>> +void update_thermal_pressure(int cpu, unsigned long capped_capacity)
>> +{
>> +       unsigned long delta;
>> +
>> +       delta = arch_scale_cpu_capacity(cpu) - capped_capacity;
>> +       per_cpu(thermal_pressure, cpu) = delta;
> 
> use WRITE_ONCE
Will do.
> 
>> +}
>> +#endif
>> +
>> +/**
>> + * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate
>> + *                                  and average algorithm
>> + */
>> +static void trigger_thermal_pressure_average(struct rq *rq)
>> +{
>> +#ifdef CONFIG_SMP
>> +       update_thermal_load_avg(rq_clock_task(rq), rq,
>> +                               per_cpu(thermal_pressure, cpu_of(rq)));
>> +#endif
>> +}
>> +
>>  /*
>>   * All the scheduling class methods:
>>   */
>> --
>> 2.1.4
>>


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 3/6] sched/fair: Enable periodic update of average thermal pressure
  2019-11-06  8:32   ` Vincent Guittot
@ 2019-11-06 17:01     ` Thara Gopinath
  0 siblings, 0 replies; 40+ messages in thread
From: Thara Gopinath @ 2019-11-06 17:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Ionela Voinescu, Zhang Rui,
	Eduardo Valentin, Quentin Perret, linux-kernel, Amit Kachhap,
	Javi Merino, Daniel Lezcano

On 11/06/2019 03:32 AM, Vincent Guittot wrote:
> On Tue, 5 Nov 2019 at 19:49, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>>
>> Introduce support in CFS periodic tick and other bookkeeping apis
>> to trigger the process of computing average thermal pressure for a
>> cpu. Also consider avg_thermal.load_avg in others_have_blocked
>> which allows for decay of pelt signals.
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>
>> v4->v5:
>>         - Updated both versions of update_blocked_averages to trigger the
>>           process of computing average thermal pressure.
>>         - Updated others_have_blocked to considerd avg_thermal.load_avg.
>>
>>  kernel/sched/fair.c | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2e907cc..9fb0494 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -92,6 +92,8 @@ const_debug unsigned int sysctl_sched_migration_cost  = 500000UL;
>>   */
>>  static DEFINE_PER_CPU(unsigned long, thermal_pressure);
>>
>> +static void trigger_thermal_pressure_average(struct rq *rq);
>> +
>>  #ifdef CONFIG_SMP
>>  /*
>>   * For asym packing, by default the lower numbered CPU has higher priority.
>> @@ -7493,6 +7495,9 @@ static inline bool others_have_blocked(struct rq *rq)
>>         if (READ_ONCE(rq->avg_dl.util_avg))
>>                 return true;
>>
>> +       if (READ_ONCE(rq->avg_thermal.load_avg))
>> +               return true;
>> +
>>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>>         if (READ_ONCE(rq->avg_irq.util_avg))
>>                 return true;
>> @@ -7580,6 +7585,8 @@ static void update_blocked_averages(int cpu)
>>                 done = false;
>>
>>         update_blocked_load_status(rq, !done);
>> +
>> +       trigger_thermal_pressure_average(rq);
> 
> This must be called before others_have_blocked() to take into account
> the latest update

It is a bug. I agree. Will fix it.
> 
>>         rq_unlock_irqrestore(rq, &rf);
>>  }
>>
>> @@ -7646,6 +7653,7 @@ static inline void update_blocked_averages(int cpu)
>>         update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
>>         update_irq_load_avg(rq, 0);
>>         update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
>> +       trigger_thermal_pressure_average(rq);
> 
> idem
> 
>>         rq_unlock_irqrestore(rq, &rf);
>>  }
>>
>> @@ -9939,6 +9947,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>>
>>         update_misfit_status(curr, rq);
>>         update_overutilized_status(task_rq(curr));
>> +
> 
> remove blank line
> 
>> +       trigger_thermal_pressure_average(rq);
>>  }
>>
>>  /*
>> --
>> 2.1.4
>>


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-11-06 12:50   ` Dietmar Eggemann
@ 2019-11-06 17:28     ` Thara Gopinath
  2019-11-07 13:00     ` Dietmar Eggemann
  1 sibling, 0 replies; 40+ messages in thread
From: Thara Gopinath @ 2019-11-06 17:28 UTC (permalink / raw)
  To: Dietmar Eggemann, mingo, peterz, ionela.voinescu,
	vincent.guittot, rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 11/06/2019 07:50 AM, Dietmar Eggemann wrote:
> On 05/11/2019 19:49, Thara Gopinath wrote:
> 
> [...]
> 
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>> index 391f397..c65b7c4 100644
>> --- a/drivers/thermal/cpu_cooling.c
>> +++ b/drivers/thermal/cpu_cooling.c
>> @@ -218,6 +218,27 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
>>  }
>>  
>>  /**
>> + * update_sched_max_capacity - update scheduler about change in cpu
>> + *				max frequency.
>> + * @cpus: list of cpus whose max capacity needs udating in scheduler.
>> + * @cur_max_freq: current maximum frequency.
>> + * @max_freq: highest possible frequency.
>> + */
>> +static void update_sched_max_capacity(struct cpumask *cpus,
>> +				      unsigned int cur_max_freq,
>> +				      unsigned int max_freq)
>> +{
>> +	int cpu;
>> +	unsigned long capacity;
>> +
>> +	for_each_cpu(cpu, cpus) {
>> +		capacity = cur_max_freq * arch_scale_cpu_capacity(cpu);
>> +		capacity /= max_freq;
>> +		update_thermal_pressure(cpu, capacity);
>> +	}
>> +}
>> +
>> +/**
> 
> Have you seen
> https://lore.kernel.org/r/2b19d7da-412c-932f-7251-110eadbef3e3@arm.com ?
Yes and have you seen this
https://lore.kernel.org/lkml/CAKfTPtCpZq61gQVpATtTdg5hDA+tP4bF6xPMsvHYUMoY+H-6FQ@mail.gmail.com/

this
https://lore.kernel.org/lkml/5DBB0FD4.8000509@linaro.org/

and this from Ionela (which is the basis for v5)
https://lore.kernel.org/lkml/20191101154612.GA4884@e108754-lin/

I stand by what I said earlier that I see no reason to take parsing of
the cpus into fair.c for this feature (this way this solution is more
generic and can be used from any other entity capping maximum cpu frequency)

> 
> Also the naming 'update_thermal_pressure()' is not really suitable for
> an external task scheduler interface. If I see this called in the
> cooling device code, I wouldn't guess that this is a task scheduler
> interface.

Do you have a name suggestion? When you say you don't like a name do
suggest a name so that I have an idea of what is it you are expecting.

> 
> [...]
> 


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-11-06 16:56   ` Qais Yousef
@ 2019-11-06 17:31     ` Thara Gopinath
  2019-11-06 17:41       ` Qais Yousef
  0 siblings, 1 reply; 40+ messages in thread
From: Thara Gopinath @ 2019-11-06 17:31 UTC (permalink / raw)
  To: Qais Yousef
  Cc: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret, linux-kernel, amit.kachhap, javi.merino,
	daniel.lezcano

On 11/06/2019 11:56 AM, Qais Yousef wrote:
> On 11/05/19 13:49, Thara Gopinath wrote:
>> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
>> pressure on a cpu means this maximum available capacity is reduced. This
>> patch reduces the average thermal pressure for a cpu from its maximum
>> available capacity so that cpu_capacity reflects the actual
>> available capacity.
>>
>> Other signals that are deducted from cpu_capacity to reflect the actual
>> capacity available are rt and dl util_avg. util_avg tracks a binary signal
>> and uses the weights 1024 and 0. Whereas thermal pressure is tracked
>> using load_avg. load_avg uses the actual "delta" capacity as the weight.
> 
> I think you intended to put this as comment...
> 
>>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> ---
>>  kernel/sched/fair.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9fb0494..5f6c371 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7738,6 +7738,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
>>  
>>  	used = READ_ONCE(rq->avg_rt.util_avg);
>>  	used += READ_ONCE(rq->avg_dl.util_avg);
>> +	used += READ_ONCE(rq->avg_thermal.load_avg);
> 
> ... here?

I did not!  But I can.
> 
> I find the explanation hard to parse too. Do you think you can rephrase it?
> Something based on what you wrote here would be more understandable IMHO:
> https://lore.kernel.org/lkml/5DBB05BC.40502@linaro.org/
I will try to rephrase it! I am sorry that you found it hard to parse.
Honestly, I cannot copy paste the code snippet I pointed out to you here
in comment.(And I think that is the reason you found it easier to
understand) But I will try my best to put it in words.

> 
> 
> Thanks!
> 
> --
> Qais Yousef
> 
>>  
>>  	if (unlikely(used >= max))
>>  		return 1;
>> -- 
>> 2.1.4
>>


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-11-06 17:31     ` Thara Gopinath
@ 2019-11-06 17:41       ` Qais Yousef
  0 siblings, 0 replies; 40+ messages in thread
From: Qais Yousef @ 2019-11-06 17:41 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret, linux-kernel, amit.kachhap, javi.merino,
	daniel.lezcano

On 11/06/19 12:31, Thara Gopinath wrote:
> On 11/06/2019 11:56 AM, Qais Yousef wrote:
> > On 11/05/19 13:49, Thara Gopinath wrote:
> >> cpu_capacity relflects the maximum available capacity of a cpu. Thermal
> >> pressure on a cpu means this maximum available capacity is reduced. This
> >> patch reduces the average thermal pressure for a cpu from its maximum
> >> available capacity so that cpu_capacity reflects the actual
> >> available capacity.
> >>
> >> Other signals that are deducted from cpu_capacity to reflect the actual
> >> capacity available are rt and dl util_avg. util_avg tracks a binary signal
> >> and uses the weights 1024 and 0. Whereas thermal pressure is tracked
> >> using load_avg. load_avg uses the actual "delta" capacity as the weight.
> > 
> > I think you intended to put this as comment...
> > 
> >>
> >> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> >> ---
> >>  kernel/sched/fair.c | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 9fb0494..5f6c371 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7738,6 +7738,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
> >>  
> >>  	used = READ_ONCE(rq->avg_rt.util_avg);
> >>  	used += READ_ONCE(rq->avg_dl.util_avg);
> >> +	used += READ_ONCE(rq->avg_thermal.load_avg);
> > 
> > ... here?
> 
> I did not!  But I can.
> > 
> > I find the explanation hard to parse too. Do you think you can rephrase it?
> > Something based on what you wrote here would be more understandable IMHO:
> > https://lore.kernel.org/lkml/5DBB05BC.40502@linaro.org/
> I will try to rephrase it! I am sorry that you found it hard to parse.
> Honestly, I cannot copy paste the code snippet I pointed out to you here
> in comment.(And I think that is the reason you found it easier to
> understand) But I will try my best to put it in words.

No worries. The problem could be me :-)

But a comment in the code is very important as util_avg + load_avg is confusing
without a comment. I wouldn't expect both signal to be compatible but the
thermal one is special. A comment explaining why it's special is all we need.


Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-06 12:50             ` Dietmar Eggemann
@ 2019-11-06 17:53               ` Thara Gopinath
  2019-11-07  9:32                 ` Dietmar Eggemann
  0 siblings, 1 reply; 40+ messages in thread
From: Thara Gopinath @ 2019-11-06 17:53 UTC (permalink / raw)
  To: Dietmar Eggemann, Ionela Voinescu
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 11/06/2019 07:50 AM, Dietmar Eggemann wrote:
> On 05/11/2019 22:53, Ionela Voinescu wrote:
>> On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote:
>>> On 11/05/2019 04:15 PM, Ionela Voinescu wrote:
>>>> On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote:
>>>>> On 11/05/2019 03:21 PM, Ionela Voinescu wrote:
>>>>>> Hi Thara,
>>>>>>
>>>>>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote:
>>>>>> [...]
>>>>>>> +static void trigger_thermal_pressure_average(struct rq *rq)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_SMP
>>>>>>> +	update_thermal_load_avg(rq_clock_task(rq), rq,
>>>>>>> +				per_cpu(thermal_pressure, cpu_of(rq)));
>>>>>>> +#endif
>>>>>>> +}
>>>>>>
>>>>>> Why did you decide to keep trigger_thermal_pressure_average and not
>>>>>> call update_thermal_load_avg directly?
>>>>>>
>>>>>> For !CONFIG_SMP you already have an update_thermal_load_avg function
>>>>>> that does nothing, in kernel/sched/pelt.h, so you don't need that
>>>>>> ifdef. 
>>>>> Hi,
>>>>>
>>>>> Yes you are right. But later with the shift option added, I shift
>>>>> rq_clock_task(rq) by the shift. I thought it is better to contain it in
>>>>> a function that replicate it in three different places. I can remove the
>>>>> CONFIG_SMP in the next version.
>>>>
>>>> You could still keep that in one place if you shift the now argument of
>>>> ___update_load_sum instead.
>>>
>>> No. I cannot do this. The authors of the pelt framework  prefers not to
>>> include a shift parameter there. I had discussed this with Vincent earlier.
>>>
>>
>> Right! I missed Vincent's last comment on this in v4. 
>>
>> I would argue that it's more of a PELT parameter than a CFS parameter
>> :), where it's currently being used. I would also argue that's more of a
>> PELT parameter than a thermal parameter. It controls the PELT time
>> progression for the thermal signal, but it seems more to configure the
>> PELT algorithm, rather than directly characterize thermals.
>>
>> In any case, it only seemed to me that adding a wrapper function for
>> this purpose only was not worth doing.
> 
> Coming back to the v4 discussion
> https://lore.kernel.org/r/379d23e5-79a5-9d90-0fb6-125d9be85e99@arm.com
> 
> There is no API between pelt.c and other parts of the scheduler/kernel
> so why should we keep an unnecessary parameter and wrapper functions?
> 
> There is also no abstraction, update_thermal_load_avg() in pelt.c even
> carries '_thermal_' in its name.
> 
> So why not define this shift value '[sched_|pelt_]thermal_decay_shift'
> there as well? It belongs to update_thermal_load_avg().
> 
> All call sites of update_thermal_load_avg() use 'rq_clock_task(rq) >>
> sched_thermal_decay_shift' so I don't see the need to pass it in.
> 
> IMHO, preparing for eventual code changes (e.g. parsing different now
> values) is not a good practice in the kernel. Keeping the code small and
> tidy is.

I think we are going in circles on this one. I acknowledge you have an
issue. That being said, I also understand the need to keep the pelt
framework code tight. Also Ionela pointed out that there could be a need
for a faster decay in which case it could mean a left shift leading to
further complications if defined in pelt.c (I am not saying that I will
implement a faster decay in this patch set but it is more of a future
extension if needed!)

I can make trigger_thermal_pressure_average inline if that will
alleviate some of the concerns.

I would also urge you to reconsider the merits of arguing this point
back and forth.
> 


-- 
Warm Regards
Thara

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-06 17:53               ` Thara Gopinath
@ 2019-11-07  9:32                 ` Dietmar Eggemann
  2019-11-07 10:48                   ` Vincent Guittot
  0 siblings, 1 reply; 40+ messages in thread
From: Dietmar Eggemann @ 2019-11-07  9:32 UTC (permalink / raw)
  To: Thara Gopinath, Ionela Voinescu
  Cc: mingo, peterz, vincent.guittot, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 06/11/2019 18:53, Thara Gopinath wrote:
> On 11/06/2019 07:50 AM, Dietmar Eggemann wrote:
>> On 05/11/2019 22:53, Ionela Voinescu wrote:
>>> On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote:
>>>> On 11/05/2019 04:15 PM, Ionela Voinescu wrote:
>>>>> On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote:
>>>>>> On 11/05/2019 03:21 PM, Ionela Voinescu wrote:
>>>>>>> Hi Thara,
>>>>>>>
>>>>>>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote:
>>>>>>> [...]
>>>>>>>> +static void trigger_thermal_pressure_average(struct rq *rq)
>>>>>>>> +{
>>>>>>>> +#ifdef CONFIG_SMP
>>>>>>>> +	update_thermal_load_avg(rq_clock_task(rq), rq,
>>>>>>>> +				per_cpu(thermal_pressure, cpu_of(rq)));
>>>>>>>> +#endif
>>>>>>>> +}
>>>>>>>
>>>>>>> Why did you decide to keep trigger_thermal_pressure_average and not
>>>>>>> call update_thermal_load_avg directly?
>>>>>>>
>>>>>>> For !CONFIG_SMP you already have an update_thermal_load_avg function
>>>>>>> that does nothing, in kernel/sched/pelt.h, so you don't need that
>>>>>>> ifdef. 
>>>>>> Hi,
>>>>>>
>>>>>> Yes you are right. But later with the shift option added, I shift
>>>>>> rq_clock_task(rq) by the shift. I thought it is better to contain it in
>>>>>> a function that replicate it in three different places. I can remove the
>>>>>> CONFIG_SMP in the next version.
>>>>>
>>>>> You could still keep that in one place if you shift the now argument of
>>>>> ___update_load_sum instead.
>>>>
>>>> No. I cannot do this. The authors of the pelt framework  prefers not to
>>>> include a shift parameter there. I had discussed this with Vincent earlier.
>>>>
>>>
>>> Right! I missed Vincent's last comment on this in v4. 
>>>
>>> I would argue that it's more of a PELT parameter than a CFS parameter
>>> :), where it's currently being used. I would also argue that's more of a
>>> PELT parameter than a thermal parameter. It controls the PELT time
>>> progression for the thermal signal, but it seems more to configure the
>>> PELT algorithm, rather than directly characterize thermals.
>>>
>>> In any case, it only seemed to me that adding a wrapper function for
>>> this purpose only was not worth doing.
>>
>> Coming back to the v4 discussion
>> https://lore.kernel.org/r/379d23e5-79a5-9d90-0fb6-125d9be85e99@arm.com
>>
>> There is no API between pelt.c and other parts of the scheduler/kernel
>> so why should we keep an unnecessary parameter and wrapper functions?
>>
>> There is also no abstraction, update_thermal_load_avg() in pelt.c even
>> carries '_thermal_' in its name.
>>
>> So why not define this shift value '[sched_|pelt_]thermal_decay_shift'
>> there as well? It belongs to update_thermal_load_avg().
>>
>> All call sites of update_thermal_load_avg() use 'rq_clock_task(rq) >>
>> sched_thermal_decay_shift' so I don't see the need to pass it in.
>>
>> IMHO, preparing for eventual code changes (e.g. parsing different now
>> values) is not a good practice in the kernel. Keeping the code small and
>> tidy is.
> 
> I think we are going in circles on this one. I acknowledge you have an
> issue. That being said, I also understand the need to keep the pelt
> framework code tight. Also Ionela pointed out that there could be a need
> for a faster decay in which case it could mean a left shift leading to
> further complications if defined in pelt.c (I am not saying that I will
> implement a faster decay in this patch set but it is more of a future
> extension if needed!)

The issue still exists so why not discussing it here?

Placing thermal related time shift operations in a
update_*thermal*_load_avg() PELT function look perfectly fine to me.

> I can make trigger_thermal_pressure_average inline if that will
> alleviate some of the concerns.

That's not the issue here. The issue is the extra shim layer which is
unnecessary in the current implementation.

update_blocked_averages()
{
    ...
    update_rt_rq_load_avg()
    update_dl_rq_load_avg()
    update_irq_load_avg()
    trigger_thermal_pressure_average() <--- ?
    ...
}

Wouldn't a direct call to update_thermal_load_avg() here make things so
much more clear? And I'm not talking about today and about people
involved in this review.

> I would also urge you to reconsider the merits of arguing this point
> back and forth.

I just did.

[...]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-07  9:32                 ` Dietmar Eggemann
@ 2019-11-07 10:48                   ` Vincent Guittot
  2019-11-07 11:36                     ` Dietmar Eggemann
  0 siblings, 1 reply; 40+ messages in thread
From: Vincent Guittot @ 2019-11-07 10:48 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Thara Gopinath, Ionela Voinescu, Ingo Molnar, Peter Zijlstra,
	Zhang Rui, Eduardo Valentin, Quentin Perret, linux-kernel,
	Amit Kachhap, Javi Merino, Daniel Lezcano

On Thu, 7 Nov 2019 at 10:32, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 06/11/2019 18:53, Thara Gopinath wrote:
> > On 11/06/2019 07:50 AM, Dietmar Eggemann wrote:
> >> On 05/11/2019 22:53, Ionela Voinescu wrote:
> >>> On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote:
> >>>> On 11/05/2019 04:15 PM, Ionela Voinescu wrote:
> >>>>> On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote:
> >>>>>> On 11/05/2019 03:21 PM, Ionela Voinescu wrote:
> >>>>>>> Hi Thara,
> >>>>>>>
> >>>>>>> On Tuesday 05 Nov 2019 at 13:49:42 (-0500), Thara Gopinath wrote:
> >>>>>>> [...]
> >>>>>>>> +static void trigger_thermal_pressure_average(struct rq *rq)
> >>>>>>>> +{
> >>>>>>>> +#ifdef CONFIG_SMP
> >>>>>>>> +      update_thermal_load_avg(rq_clock_task(rq), rq,
> >>>>>>>> +                              per_cpu(thermal_pressure, cpu_of(rq)));
> >>>>>>>> +#endif
> >>>>>>>> +}
> >>>>>>>
> >>>>>>> Why did you decide to keep trigger_thermal_pressure_average and not
> >>>>>>> call update_thermal_load_avg directly?
> >>>>>>>
> >>>>>>> For !CONFIG_SMP you already have an update_thermal_load_avg function
> >>>>>>> that does nothing, in kernel/sched/pelt.h, so you don't need that
> >>>>>>> ifdef.
> >>>>>> Hi,
> >>>>>>
> >>>>>> Yes you are right. But later with the shift option added, I shift
> >>>>>> rq_clock_task(rq) by the shift. I thought it is better to contain it in
> >>>>>> a function that replicate it in three different places. I can remove the
> >>>>>> CONFIG_SMP in the next version.
> >>>>>
> >>>>> You could still keep that in one place if you shift the now argument of
> >>>>> ___update_load_sum instead.
> >>>>
> >>>> No. I cannot do this. The authors of the pelt framework  prefers not to
> >>>> include a shift parameter there. I had discussed this with Vincent earlier.
> >>>>
> >>>
> >>> Right! I missed Vincent's last comment on this in v4.
> >>>
> >>> I would argue that it's more of a PELT parameter than a CFS parameter
> >>> :), where it's currently being used. I would also argue that's more of a
> >>> PELT parameter than a thermal parameter. It controls the PELT time
> >>> progression for the thermal signal, but it seems more to configure the
> >>> PELT algorithm, rather than directly characterize thermals.
> >>>
> >>> In any case, it only seemed to me that adding a wrapper function for
> >>> this purpose only was not worth doing.
> >>
> >> Coming back to the v4 discussion
> >> https://lore.kernel.org/r/379d23e5-79a5-9d90-0fb6-125d9be85e99@arm.com
> >>
> >> There is no API between pelt.c and other parts of the scheduler/kernel
> >> so why should we keep an unnecessary parameter and wrapper functions?
> >>
> >> There is also no abstraction, update_thermal_load_avg() in pelt.c even
> >> carries '_thermal_' in its name.
> >>
> >> So why not define this shift value '[sched_|pelt_]thermal_decay_shift'
> >> there as well? It belongs to update_thermal_load_avg().
> >>
> >> All call sites of update_thermal_load_avg() use 'rq_clock_task(rq) >>
> >> sched_thermal_decay_shift' so I don't see the need to pass it in.
> >>
> >> IMHO, preparing for eventual code changes (e.g. parsing different now
> >> values) is not a good practice in the kernel. Keeping the code small and
> >> tidy is.
> >
> > I think we are going in circles on this one. I acknowledge you have an
> > issue. That being said, I also understand the need to keep the pelt
> > framework code tight. Also Ionela pointed out that there could be a need
> > for a faster decay in which case it could mean a left shift leading to
> > further complications if defined in pelt.c (I am not saying that I will
> > implement a faster decay in this patch set but it is more of a future
> > extension if needed!)
>
> The issue still exists so why not discussing it here?
>
> Placing thermal related time shift operations in a
> update_*thermal*_load_avg() PELT function look perfectly fine to me.

As already said, having thermal related clock shift operation in
update_thermal_load_avg and pelt.c is a nack for me

>
> > I can make trigger_thermal_pressure_average inline if that will
> > alleviate some of the concerns.

In fact, trigger_thermal_pressure_average is only there because of
shifting the clock which is introduced only in patch 6.
So remove trigger_thermal_pressure_average from this patch and call directly

+       update_thermal_load_avg(rq_clock_task(rq), rq,
+                               per_cpu(thermal_pressure, cpu_of(rq)));

in patch3

>
> That's not the issue here. The issue is the extra shim layer which is
> unnecessary in the current implementation.
>
> update_blocked_averages()
> {
>     ...
>     update_rt_rq_load_avg()
>     update_dl_rq_load_avg()
>     update_irq_load_avg()
>     trigger_thermal_pressure_average() <--- ?
>     ...
> }
>
> Wouldn't a direct call to update_thermal_load_avg() here make things so
> much more clear? And I'm not talking about today and about people
> involved in this review.
>
> > I would also urge you to reconsider the merits of arguing this point
> > back and forth.
>
> I just did.
>
> [...]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 6/6] sched/fair: Enable tuning of decay period
  2019-11-05 18:49 ` [Patch v5 6/6] sched/fair: Enable tuning of decay period Thara Gopinath
@ 2019-11-07 10:49   ` Vincent Guittot
  2019-11-08 10:53     ` Dietmar Eggemann
  2019-11-19 10:52   ` Amit Kucheria
  1 sibling, 1 reply; 40+ messages in thread
From: Vincent Guittot @ 2019-11-07 10:49 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, ionela.voinescu, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

Le Tuesday 05 Nov 2019 à 13:49:46 (-0500), Thara Gopinath a écrit :
> Thermal pressure follows pelt signas which means the
> decay period for thermal pressure is the default pelt
> decay period. Depending on soc charecteristics and thermal
> activity, it might be beneficial to decay thermal pressure
> slower, but still in-tune with the pelt signals.
> One way to achieve this is to provide a command line parameter
> to set a decay shift parameter to an integer between 0 and 10.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
> 
> v4->v5:
> 	- Changed _coeff to _shift as per review comments on the list.
> 
>  Documentation/admin-guide/kernel-parameters.txt |  5 +++++
>  kernel/sched/fair.c                             | 25 +++++++++++++++++++++++--
>  2 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index c82f87c..0b8f55e 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4281,6 +4281,11 @@
>  			incurs a small amount of overhead in the scheduler
>  			but is useful for debugging and performance tuning.
>  
> +	sched_thermal_decay_shift=
> +			[KNL, SMP] Set decay shift for thermal pressure signal.
> +			Format: integer betweer 0 and 10
> +			Default is 0.
> +
>  	skew_tick=	[KNL] Offset the periodic timer tick per cpu to mitigate
>  			xtime_lock contention on larger systems, and/or RCU lock
>  			contention on all systems with CONFIG_MAXSMP set.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5f6c371..61a020b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -91,6 +91,18 @@ const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
>   * and maximum available capacity due to thermal events.
>   */
>  static DEFINE_PER_CPU(unsigned long, thermal_pressure);
> +/**
> + * By default the decay is the default pelt decay period.
> + * The decay shift can change the decay period in
> + * multiples of 32.
> + *  Decay shift		Decay period(ms)
> + *	0			32
> + *	1			64
> + *	2			128
> + *	3			256
> + *	4			512
> + */
> +static int sched_thermal_decay_shift;
>  
>  static void trigger_thermal_pressure_average(struct rq *rq);
>  
> @@ -10435,6 +10447,15 @@ void update_thermal_pressure(int cpu, unsigned long capped_capacity)
>  	delta = arch_scale_cpu_capacity(cpu) - capped_capacity;
>  	per_cpu(thermal_pressure, cpu) = delta;
>  }
> +
> +static int __init setup_sched_thermal_decay_shift(char *str)
> +{
> +	if (kstrtoint(str, 0, &sched_thermal_decay_shift))
> +		pr_warn("Unable to set scheduler thermal pressure decay shift parameter\n");
> +
> +	return 1;
> +}
> +__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>  #endif
>  
>  /**
> @@ -10444,8 +10465,8 @@ void update_thermal_pressure(int cpu, unsigned long capped_capacity)
>  static void trigger_thermal_pressure_average(struct rq *rq)
>  {
>  #ifdef CONFIG_SMP
> -	update_thermal_load_avg(rq_clock_task(rq), rq,
> -				per_cpu(thermal_pressure, cpu_of(rq)));
> +	update_thermal_load_avg(rq_clock_task(rq) >> sched_thermal_decay_shift,
> +				rq, per_cpu(thermal_pressure, cpu_of(rq)));

Would be better to create

+static inline u64 rq_clock_thermal(struct rq *rq)
+{
+       lockdep_assert_held(&rq->lock);
+       assert_clock_updated(rq);
+
+       return rq_clock_task(rq) >> sched_thermal_decay_shift;
+}
+

and use it when calling update_thermal_load_avg(rq_clock_thermal(rq)...


>  #endif
>  }
>  
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-07 10:48                   ` Vincent Guittot
@ 2019-11-07 11:36                     ` Dietmar Eggemann
  0 siblings, 0 replies; 40+ messages in thread
From: Dietmar Eggemann @ 2019-11-07 11:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Thara Gopinath, Ionela Voinescu, Ingo Molnar, Peter Zijlstra,
	Zhang Rui, Eduardo Valentin, Quentin Perret, linux-kernel,
	Amit Kachhap, Javi Merino, Daniel Lezcano

On 07/11/2019 11:48, Vincent Guittot wrote:
> On Thu, 7 Nov 2019 at 10:32, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 06/11/2019 18:53, Thara Gopinath wrote:
>>> On 11/06/2019 07:50 AM, Dietmar Eggemann wrote:
>>>> On 05/11/2019 22:53, Ionela Voinescu wrote:
>>>>> On Tuesday 05 Nov 2019 at 16:29:32 (-0500), Thara Gopinath wrote:
>>>>>> On 11/05/2019 04:15 PM, Ionela Voinescu wrote:
>>>>>>> On Tuesday 05 Nov 2019 at 16:02:00 (-0500), Thara Gopinath wrote:
>>>>>>>> On 11/05/2019 03:21 PM, Ionela Voinescu wrote:

[...]

> In fact, trigger_thermal_pressure_average is only there because of
> shifting the clock which is introduced only in patch 6.
> So remove trigger_thermal_pressure_average from this patch and call directly
> 
> +       update_thermal_load_avg(rq_clock_task(rq), rq,
> +                               per_cpu(thermal_pressure, cpu_of(rq)));
> 
> in patch3

I like the rq_clock_thermal() idea in
https://lore.kernel.org/r/20191107104901.GA472@linaro.org to get rid of
trigger_thermal_pressure_average().

>> That's not the issue here. The issue is the extra shim layer which is
>> unnecessary in the current implementation.
>>
>> update_blocked_averages()
>> {
>>     ...
>>     update_rt_rq_load_avg()
>>     update_dl_rq_load_avg()
>>     update_irq_load_avg()
>>     trigger_thermal_pressure_average() <--- ?
>>     ...
>> }

[...]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping
  2019-11-06 12:50   ` Dietmar Eggemann
  2019-11-06 17:28     ` Thara Gopinath
@ 2019-11-07 13:00     ` Dietmar Eggemann
  1 sibling, 0 replies; 40+ messages in thread
From: Dietmar Eggemann @ 2019-11-07 13:00 UTC (permalink / raw)
  To: Thara Gopinath, mingo, peterz, ionela.voinescu, vincent.guittot,
	rui.zhang, edubezval, qperret
  Cc: linux-kernel, amit.kachhap, javi.merino, daniel.lezcano



On 06/11/2019 13:50, Dietmar Eggemann wrote:
> On 05/11/2019 19:49, Thara Gopinath wrote:
> 
> [...]
> 
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>> index 391f397..c65b7c4 100644
>> --- a/drivers/thermal/cpu_cooling.c
>> +++ b/drivers/thermal/cpu_cooling.c
>> @@ -218,6 +218,27 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
>>  }
>>  
>>  /**
>> + * update_sched_max_capacity - update scheduler about change in cpu
>> + *				max frequency.
>> + * @cpus: list of cpus whose max capacity needs udating in scheduler.
>> + * @cur_max_freq: current maximum frequency.
>> + * @max_freq: highest possible frequency.
>> + */
>> +static void update_sched_max_capacity(struct cpumask *cpus,
>> +				      unsigned int cur_max_freq,
>> +				      unsigned int max_freq)
>> +{
>> +	int cpu;
>> +	unsigned long capacity;
>> +
>> +	for_each_cpu(cpu, cpus) {
>> +		capacity = cur_max_freq * arch_scale_cpu_capacity(cpu);
>> +		capacity /= max_freq;
>> +		update_thermal_pressure(cpu, capacity);
>> +	}
>> +}
>> +
>> +/**
> 
> Have you seen
> https://lore.kernel.org/r/2b19d7da-412c-932f-7251-110eadbef3e3@arm.com ?
> 
> Also the naming 'update_thermal_pressure()' is not really suitable for
> an external task scheduler interface. If I see this called in the
> cooling device code, I wouldn't guess that this is a task scheduler
> interface.
> 
> [...]

The current interface to bring in frequency and uarch (CPU) invariance
information into the task scheduler is arch_scale_[freq|cpu]_capacity.

So why not follow this approach for the thermal->sched interface?

+#ifndef arch_scale_thermal_capacity
+static __always_inline
+unsigned long arch_scale_thermal_capacity(int cpu)
+{
+       return 0;
+}
+#endif

Leave per_cpu(thermal_pressure, cpu) within cpu cooling and let cpu
cooling (or even platform) #define arch_scale_thermal_capacity.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure
  2019-11-05 18:49 ` [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
  2019-11-06  8:24   ` Vincent Guittot
  2019-11-06 12:50   ` Dietmar Eggemann
@ 2019-11-07 16:39   ` Qais Yousef
  2019-11-19 10:50   ` Amit Kucheria
  3 siblings, 0 replies; 40+ messages in thread
From: Qais Yousef @ 2019-11-07 16:39 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, ionela.voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret, linux-kernel, amit.kachhap, javi.merino,
	daniel.lezcano

Hi Thara

On 11/05/19 13:49, Thara Gopinath wrote:
> Extrapolating on the exisiting framework to track rt/dl utilization using
> pelt signals, add a similar mechanism to track thermal pressure. The
> difference here from rt/dl utilization tracking is that, instead of
> tracking time spent by a cpu running a rt/dl task through util_avg,
> the average thermal pressure is tracked through load_avg. This is
> because thermal pressure signal is weighted "delta" capacity
> and is not binary(util_avg is binary). "delta capacity" here
> means delta between the actual capacity of a cpu and the decreased
> capacity a cpu due to a thermal event.
> In order to track average thermal pressure, a new sched_avg variable
> avg_thermal is introduced. Function update_thermal_load_avg can be called
> to do the periodic bookeeping (accumulate, decay and average)
> of the thermal pressure.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  kernel/sched/pelt.c  | 13 +++++++++++++
>  kernel/sched/pelt.h  |  7 +++++++
>  kernel/sched/sched.h |  1 +
>  3 files changed, 21 insertions(+)
> 
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index a96db50..3821069 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -353,6 +353,19 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>  	return 0;
>  }
>  
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> +	if (___update_load_sum(now, &rq->avg_thermal,
> +			       capacity,
> +			       capacity,
> +			       capacity)) {
> +		___update_load_avg(&rq->avg_thermal, 1, 1);
> +		return 1;
> +	}
> +
> +	return 0;
> +}

Care to add a tracepoint to this new signal like we now have for the other
ones?

Thanks

--
Qais Yousef

> +
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  /*
>   * irq:
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index afff644..c74226d 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -6,6 +6,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
>  int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
>  int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
>  
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  int update_irq_load_avg(struct rq *rq, u64 running);
> @@ -159,6 +160,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>  }
>  
>  static inline int
> +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> +	return 0;
> +}
> +
> +static inline int
>  update_irq_load_avg(struct rq *rq, u64 running)
>  {
>  	return 0;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 0db2c1b..d5d82c8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -944,6 +944,7 @@ struct rq {
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  	struct sched_avg	avg_irq;
>  #endif
> +	struct sched_avg	avg_thermal;
>  	u64			idle_stamp;
>  	u64			avg_idle;
>  
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 6/6] sched/fair: Enable tuning of decay period
  2019-11-07 10:49   ` Vincent Guittot
@ 2019-11-08 10:53     ` Dietmar Eggemann
  0 siblings, 0 replies; 40+ messages in thread
From: Dietmar Eggemann @ 2019-11-08 10:53 UTC (permalink / raw)
  To: Vincent Guittot, Thara Gopinath
  Cc: mingo, peterz, ionela.voinescu, rui.zhang, edubezval, qperret,
	linux-kernel, amit.kachhap, javi.merino, daniel.lezcano

On 07/11/2019 11:49, Vincent Guittot wrote:
> Le Tuesday 05 Nov 2019 à 13:49:46 (-0500), Thara Gopinath a écrit :

[...]

>>  /**
>> @@ -10444,8 +10465,8 @@ void update_thermal_pressure(int cpu, unsigned long capped_capacity)
>>  static void trigger_thermal_pressure_average(struct rq *rq)
>>  {
>>  #ifdef CONFIG_SMP
>> -	update_thermal_load_avg(rq_clock_task(rq), rq,
>> -				per_cpu(thermal_pressure, cpu_of(rq)));
>> +	update_thermal_load_avg(rq_clock_task(rq) >> sched_thermal_decay_shift,
>> +				rq, per_cpu(thermal_pressure, cpu_of(rq)));
> 
> Would be better to create
> 
> +static inline u64 rq_clock_thermal(struct rq *rq)
> +{
> +       lockdep_assert_held(&rq->lock);
> +       assert_clock_updated(rq);

IMHO, the asserts can be skipped here since they're already done in
rq_clock_task().

> +       return rq_clock_task(rq) >> sched_thermal_decay_shift;
> +}

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 0/6] Introduce Thermal Pressure
  2019-11-05 18:49 [Patch v5 0/6] Introduce Thermal Pressure Thara Gopinath
                   ` (5 preceding siblings ...)
  2019-11-05 18:49 ` [Patch v5 6/6] sched/fair: Enable tuning of decay period Thara Gopinath
@ 2019-11-12 11:21 ` Lukasz Luba
  2019-11-19 15:12   ` Lukasz Luba
  2019-11-19 10:54 ` Amit Kucheria
  7 siblings, 1 reply; 40+ messages in thread
From: Lukasz Luba @ 2019-11-12 11:21 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, Ionela Voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret, linux-kernel, amit.kachhap, javi.merino,
	daniel.lezcano

Hi Thara,

I am going to try your patch set on some different board.
To do that I need more information regarding your setup.
Please find my comments below. I need probably one hack
which do not fully understand.

On 11/5/19 6:49 PM, Thara Gopinath wrote:
> Thermal governors can respond to an overheat event of a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in the kernel, task scheduler is
> not notified of capping of maximum frequency of a cpu.
> In other words, scheduler is unaware of maximum capacity
> restrictions placed on a cpu due to thermal activity.
> This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
>
> The reduction in the maximum possible capacity of a cpu due to a
> thermal event can be considered as thermal pressure. Instantaneous
> thermal pressure is hard to record and can sometime be erroneous
> as there can be mismatch between the actual capping of capacity
> and scheduler recording it. Thus solution is to have a weighted
> average per cpu value for thermal pressure over time.
> The weight reflects the amount of time the cpu has spent at a
> capped maximum frequency. Since thermal pressure is recorded as
> an average, it must be decayed periodically. Exisiting algorithm
> in the kernel scheduler pelt framework is re-used to calculate
> the weighted average. This patch series also defines a sysctl
> inerface to allow for a configurable decay period.
>
> Regarding testing, basic build, boot and sanity testing have been
> performed on db845c platform with debian file system.
> Further, dhrystone and hackbench tests have been
> run with the thermal pressure algorithm. During testing, due to
> constraints of step wise governor in dealing with big little systems,
I don't understand this modification. Could you explain what was the
issue and if this modification did not break the original
thermal solution upfront? You are then comparing this modified
version and treat it as an 'origin', am I right?

> trip point 0 temperature was made assymetric between cpus in little
> cluster and big cluster; the idea being that
> big core will heat up and cpu cooling device will throttle the
> frequency of the big cores faster, there by limiting the maximum available
> capacity and the scheduler will spread out tasks to little cores as well.
>
> Test Results
>
> Hackbench: 1 group , 30000 loops, 10 runs
>                                                 Result         SD
>                                                 (Secs)     (% of mean)
>   No Thermal Pressure                            14.03       2.69%
>   Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%
>   Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%
>   Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%
>   Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%
>   Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%
>
> Dhrystone Run Time  : 20 threads, 3000 MLOOPS
>                                                   Result      SD
>                                                   (Secs)    (% of mean)
>   No Thermal Pressure                              9.452      4.49%
>   Thermal Pressure PELT Algo. Decay : 32 ms        8.793      5.30%
>   Thermal Pressure PELT Algo. Decay : 64 ms        8.981      5.29%
>   Thermal Pressure PELT Algo. Decay : 128 ms       8.647      6.62%
>   Thermal Pressure PELT Algo. Decay : 256 ms       8.774      6.45%
>   Thermal Pressure PELT Algo. Decay : 512 ms       8.603      5.41%
>
What I would like to see also for this performance results is
avg temperature of the chip. Is it higher than in the 'origin'?

Regards,
Lukasz Luba

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure
  2019-11-05 18:49 ` [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
                     ` (2 preceding siblings ...)
  2019-11-07 16:39   ` Qais Yousef
@ 2019-11-19 10:50   ` Amit Kucheria
  3 siblings, 0 replies; 40+ messages in thread
From: Amit Kucheria @ 2019-11-19 10:50 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, ionela.voinescu, Vincent Guittot,
	Zhang Rui, Eduardo Valentin, qperret, LKML, Amit Daniel Kachhap,
	Javi Merino, Daniel Lezcano

On Wed, Nov 6, 2019 at 12:20 AM Thara Gopinath
<thara.gopinath@linaro.org> wrote:
>
> Extrapolating on the exisiting framework to track rt/dl utilization using
> pelt signals, add a similar mechanism to track thermal pressure. The
> difference here from rt/dl utilization tracking is that, instead of
> tracking time spent by a cpu running a rt/dl task through util_avg,
> the average thermal pressure is tracked through load_avg. This is
> because thermal pressure signal is weighted "delta" capacity
> and is not binary(util_avg is binary). "delta capacity" here
> means delta between the actual capacity of a cpu and the decreased
> capacity a cpu due to a thermal event.

Use a blank line here. And reflow the paragraph text.

> In order to track average thermal pressure, a new sched_avg variable
> avg_thermal is introduced. Function update_thermal_load_avg can be called
> to do the periodic bookeeping (accumulate, decay and average)
> of the thermal pressure.
>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  kernel/sched/pelt.c  | 13 +++++++++++++
>  kernel/sched/pelt.h  |  7 +++++++
>  kernel/sched/sched.h |  1 +
>  3 files changed, 21 insertions(+)
>
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index a96db50..3821069 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -353,6 +353,19 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>         return 0;
>  }
>
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> +       if (___update_load_sum(now, &rq->avg_thermal,
> +                              capacity,
> +                              capacity,
> +                              capacity)) {
> +               ___update_load_avg(&rq->avg_thermal, 1, 1);
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  /*
>   * irq:
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index afff644..c74226d 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -6,6 +6,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
>  int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
>  int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
>  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
> +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity);
>
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>  int update_irq_load_avg(struct rq *rq, u64 running);
> @@ -159,6 +160,12 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>  }
>
>  static inline int
> +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity)
> +{
> +       return 0;
> +}
> +
> +static inline int
>  update_irq_load_avg(struct rq *rq, u64 running)
>  {
>         return 0;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 0db2c1b..d5d82c8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -944,6 +944,7 @@ struct rq {
>  #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
>         struct sched_avg        avg_irq;
>  #endif
> +       struct sched_avg        avg_thermal;

Have your considered putting this inside a #ifdef
CONFIG_HAVE_SCHED_THERMAL_PRESSURE?


>         u64                     idle_stamp;
>         u64                     avg_idle;
>
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous thermal pressure
  2019-11-05 18:49 ` [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous " Thara Gopinath
  2019-11-05 20:21   ` Ionela Voinescu
  2019-11-06  8:27   ` Vincent Guittot
@ 2019-11-19 10:51   ` Amit Kucheria
  2 siblings, 0 replies; 40+ messages in thread
From: Amit Kucheria @ 2019-11-19 10:51 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, ionela.voinescu, Vincent Guittot,
	Zhang Rui, Eduardo Valentin, qperret, LKML, Amit Daniel Kachhap,
	Javi Merino, Daniel Lezcano

On Wed, Nov 6, 2019 at 12:20 AM Thara Gopinath
<thara.gopinath@linaro.org> wrote:
>
> Add interface APIs to initialize, update/average, track, accumulate
> and decay thermal pressure per cpu basis. A per cpu variable

*on a* per cpu basis.

> thermal_pressure is introduced to keep track of instantaneous per
> cpu thermal pressure. Thermal pressure is the delta between maximum
> capacity and capped capacity due to a thermal event.
> API trigger_thermal_pressure_average is called for periodic accumulate

s/accumulate/accumulation

> and decay of the thermal pressure.This API passes on the instantaneous
> thermal pressure of a cpu to update_thermal_load_avg to do the necessary
> accumulate, decay and average.

s/accumulate/accumulation

Add blank line here.

> API update_thermal_pressure is for the system to update the thermal
> pressure by providing a capped maximum capacity.

This is redundant given the below sentence.

> Considering, trigger_thermal_pressure_average reads thermal_pressure and

Lose the comma.

> update_thermal_pressure writes into thermal_pressure, one can argue for
> some sort of locking mechanism to avoid a stale value.
> But considering trigger_thermal_pressure_average can be called from a
> system critical path like scheduler tick function, a locking mechanism
> is not ideal. This means that it is possible the thermal_pressure value
> used to calculate average thermal pressure for a cpu can be
> stale for upto 1 tick period.

Please consider reflowing all your patch descriptions to make the
paragraphs better aligned and easier to read. Leave a blank line
between paragraphs.

In vim, you can do :gqap at each paragraph.


> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>
> v3->v4:
>         - Dropped per cpu max_capacity_info struct and instead added a per
>           delta_capacity variable to store the delta between maximum
>           capacity and capped capacity. The delta is now calculated when
>           thermal pressure is updated and not every tick.
>         - Dropped populate_max_capacity_info api as only per cpu delta
>           capacity is stored.
>         - Renamed update_periodic_maxcap to
>           trigger_thermal_pressure_average and update_maxcap_capacity to
>           update_thermal_pressure.
> v4->v5:
>         - As per Peter's review comments folded thermal.c into fair.c.
>         - As per Ionela's review comments revamped update_thermal_pressure
>           to take maximum available capacity as input instead of maximum
>           capped frequency ration.
>
> ---
>  include/linux/sched.h |  9 +++++++++
>  kernel/sched/fair.c   | 37 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 263cf08..3c31084 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1993,6 +1993,15 @@ static inline void rseq_syscall(struct pt_regs *regs)
>
>  #endif
>
> +#ifdef CONFIG_SMP
> +void update_thermal_pressure(int cpu, unsigned long capped_capacity);
> +#else
> +static inline void
> +update_thermal_pressure(int cpu, unsigned long capped_capacity)
> +{
> +}
> +#endif
> +
>  const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
>  char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
>  int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 682a754..2e907cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -86,6 +86,12 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity       = 1000000UL;
>
>  const_debug unsigned int sysctl_sched_migration_cost   = 500000UL;
>
> +/*
> + * Per-cpu instantaneous delta between maximum capacity
> + * and maximum available capacity due to thermal events.
> + */
> +static DEFINE_PER_CPU(unsigned long, thermal_pressure);
> +
>  #ifdef CONFIG_SMP
>  /*
>   * For asym packing, by default the lower numbered CPU has higher priority.
> @@ -10401,6 +10407,37 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
>         return rr_interval;
>  }
>
> +#ifdef CONFIG_SMP
> +/**
> + * update_thermal_pressure: Update thermal pressure
> + * @cpu: the cpu for which thermal pressure is to be updated for
> + * @capped_capacity: maximum capacity of the cpu after the capping
> + *                  due to thermal event.
> + *
> + * Delta between the arch_scale_cpu_capacity and capped max capacity is
> + * stored in per cpu thermal_pressure variable.
> + */
> +void update_thermal_pressure(int cpu, unsigned long capped_capacity)
> +{
> +       unsigned long delta;
> +
> +       delta = arch_scale_cpu_capacity(cpu) - capped_capacity;
> +       per_cpu(thermal_pressure, cpu) = delta;

Any reason you to save to delta first and then to the per-cpu
variable? Just make it

    per_cpu(thermal_pressure, cpu) = arch_scale_cpu_capacity(cpu) -
capped_capacity;

> +}

> +#endif
> +
> +/**
> + * trigger_thermal_pressure_average: Trigger the thermal pressure accumulate
> + *                                  and average algorithm
> + */
> +static void trigger_thermal_pressure_average(struct rq *rq)
> +{
> +#ifdef CONFIG_SMP
> +       update_thermal_load_avg(rq_clock_task(rq), rq,
> +                               per_cpu(thermal_pressure, cpu_of(rq)));
> +#endif
> +}
> +
>  /*
>   * All the scheduling class methods:
>   */
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 4/6] sched/fair: update cpu_capcity to reflect thermal pressure
  2019-11-05 18:49 ` [Patch v5 4/6] sched/fair: update cpu_capcity to reflect " Thara Gopinath
  2019-11-06 16:56   ` Qais Yousef
@ 2019-11-19 10:51   ` Amit Kucheria
  1 sibling, 0 replies; 40+ messages in thread
From: Amit Kucheria @ 2019-11-19 10:51 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, ionela.voinescu, Vincent Guittot,
	Zhang Rui, Eduardo Valentin, qperret, LKML, Amit Daniel Kachhap,
	Javi Merino, Daniel Lezcano

On Wed, Nov 6, 2019 at 12:20 AM Thara Gopinath
<thara.gopinath@linaro.org> wrote:
>
> cpu_capacity relflects the maximum available capacity of a cpu. Thermal

s/relflects/reflects

> pressure on a cpu means this maximum available capacity is reduced. This
> patch reduces the average thermal pressure for a cpu from its maximum
> available capacity so that cpu_capacity reflects the actual
> available capacity.
>
> Other signals that are deducted from cpu_capacity to reflect the actual
> capacity available are rt and dl util_avg. util_avg tracks a binary signal
> and uses the weights 1024 and 0. Whereas thermal pressure is tracked
> using load_avg. load_avg uses the actual "delta" capacity as the weight.

Consider rephrasing as,

Currently, effective cpu capacity is calculated by deducting RT and DL
average utilization (util_avg) from cpu_capacity. For thermal
pressure, we use load_avg instead of util_avg which is a binary signal
(0 or 1024) because <put why here>


> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>  kernel/sched/fair.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9fb0494..5f6c371 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7738,6 +7738,7 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
>
>         used = READ_ONCE(rq->avg_rt.util_avg);
>         used += READ_ONCE(rq->avg_dl.util_avg);
> +       used += READ_ONCE(rq->avg_thermal.load_avg);
>
>         if (unlikely(used >= max))
>                 return 1;
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 6/6] sched/fair: Enable tuning of decay period
  2019-11-05 18:49 ` [Patch v5 6/6] sched/fair: Enable tuning of decay period Thara Gopinath
  2019-11-07 10:49   ` Vincent Guittot
@ 2019-11-19 10:52   ` Amit Kucheria
  1 sibling, 0 replies; 40+ messages in thread
From: Amit Kucheria @ 2019-11-19 10:52 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, ionela.voinescu, Vincent Guittot,
	Zhang Rui, Eduardo Valentin, qperret, LKML, Amit Daniel Kachhap,
	Javi Merino, Daniel Lezcano

On Wed, Nov 6, 2019 at 12:20 AM Thara Gopinath
<thara.gopinath@linaro.org> wrote:
>
> Thermal pressure follows pelt signas which means the
> decay period for thermal pressure is the default pelt
> decay period. Depending on soc charecteristics and thermal
> activity, it might be beneficial to decay thermal pressure
> slower, but still in-tune with the pelt signals.
> One way to achieve this is to provide a command line parameter
> to set a decay shift parameter to an integer between 0 and 10.
>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> ---
>
> v4->v5:
>         - Changed _coeff to _shift as per review comments on the list.
>
>  Documentation/admin-guide/kernel-parameters.txt |  5 +++++
>  kernel/sched/fair.c                             | 25 +++++++++++++++++++++++--
>  2 files changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index c82f87c..0b8f55e 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4281,6 +4281,11 @@
>                         incurs a small amount of overhead in the scheduler
>                         but is useful for debugging and performance tuning.
>
> +       sched_thermal_decay_shift=
> +                       [KNL, SMP] Set decay shift for thermal pressure signal.
> +                       Format: integer betweer 0 and 10
> +                       Default is 0.
> +
>         skew_tick=      [KNL] Offset the periodic timer tick per cpu to mitigate
>                         xtime_lock contention on larger systems, and/or RCU lock
>                         contention on all systems with CONFIG_MAXSMP set.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5f6c371..61a020b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -91,6 +91,18 @@ const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
>   * and maximum available capacity due to thermal events.
>   */
>  static DEFINE_PER_CPU(unsigned long, thermal_pressure);
> +/**
> + * By default the decay is the default pelt decay period.
> + * The decay shift can change the decay period in
> + * multiples of 32.
> + *  Decay shift                Decay period(ms)
> + *     0                       32
> + *     1                       64
> + *     2                       128
> + *     3                       256
> + *     4                       512
> + */
> +static int sched_thermal_decay_shift;
>
>  static void trigger_thermal_pressure_average(struct rq *rq);
>
> @@ -10435,6 +10447,15 @@ void update_thermal_pressure(int cpu, unsigned long capped_capacity)
>         delta = arch_scale_cpu_capacity(cpu) - capped_capacity;
>         per_cpu(thermal_pressure, cpu) = delta;
>  }
> +
> +static int __init setup_sched_thermal_decay_shift(char *str)
> +{
> +       if (kstrtoint(str, 0, &sched_thermal_decay_shift))
> +               pr_warn("Unable to set scheduler thermal pressure decay shift parameter\n");

You're reading straight from the cmdline into a kernel variable w/o
any bounds checking. Perhaps use clamp or clamp_val to make sure it is
between 0 and 10?


> +
> +       return 1;
> +}
> +__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>  #endif
>
>  /**
> @@ -10444,8 +10465,8 @@ void update_thermal_pressure(int cpu, unsigned long capped_capacity)
>  static void trigger_thermal_pressure_average(struct rq *rq)
>  {
>  #ifdef CONFIG_SMP
> -       update_thermal_load_avg(rq_clock_task(rq), rq,
> -                               per_cpu(thermal_pressure, cpu_of(rq)));
> +       update_thermal_load_avg(rq_clock_task(rq) >> sched_thermal_decay_shift,
> +                               rq, per_cpu(thermal_pressure, cpu_of(rq)));
>  #endif
>  }
>
> --
> 2.1.4
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v5 0/6] Introduce Thermal Pressure
  2019-11-05 18:49 [Patch v5 0/6] Introduce Thermal Pressure Thara Gopinath
                   ` (6 preceding siblings ...)
  2019-11-12 11:21 ` [Patch v5 0/6] Introduce Thermal Pressure Lukasz Luba
@ 2019-11-19 10:54 ` Amit Kucheria
  7 siblings, 0 replies; 40+ messages in thread
From: Amit Kucheria @ 2019-11-19 10:54 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: Ingo Molnar, Peter Zijlstra, ionela.voinescu, Vincent Guittot,
	Zhang Rui, Eduardo Valentin, qperret, LKML, Amit Daniel Kachhap,
	Javi Merino, Daniel Lezcano

On Wed, Nov 6, 2019 at 12:20 AM Thara Gopinath
<thara.gopinath@linaro.org> wrote:
>
> Thermal governors can respond to an overheat event of a cpu by
> capping the cpu's maximum possible frequency. This in turn
> means that the maximum available compute capacity of the
> cpu is restricted. But today in the kernel, task scheduler is
> not notified of capping of maximum frequency of a cpu.
> In other words, scheduler is unaware of maximum capacity
> restrictions placed on a cpu due to thermal activity.
> This patch series attempts to address this issue.
> The benefits identified are better task placement among available
> cpus in event of overheating which in turn leads to better
> performance numbers.
>
> The reduction in the maximum possible capacity of a cpu due to a
> thermal event can be considered as thermal pressure. Instantaneous
> thermal pressure is hard to record and can sometime be erroneous
> as there can be mismatch between the actual capping of capacity
> and scheduler recording it. Thus solution is to have a weighted
> average per cpu value for thermal pressure over time.
> The weight reflects the amount of time the cpu has spent at a
> capped maximum frequency. Since thermal pressure is recorded as
> an average, it must be decayed periodically. Exisiting algorithm
> in the kernel scheduler pelt framework is re-used to calculate
> the weighted average. This patch series also defines a sysctl
> inerface to allow for a configurable decay period.
>
> Regarding testing, basic build, boot and sanity testing have been
> performed on db845c platform with debian file system.
> Further, dhrystone and hackbench tests have been
> run with the thermal pressure algorithm. During testing, due to
> constraints of step wise governor in dealing with big little systems,

What contraints?

> trip point 0 temperature was made assymetric between cpus in little
> cluster and big cluster; the idea being that
> big core will heat up and cpu cooling device will throttle the
> frequency of the big cores faster, there by limiting the maximum available
> capacity and the scheduler will spread out tasks to little cores as well.

Can you share the hack to get this behaviour as well so I can try to
reproduce on 845c?

> Test Results
>
> Hackbench: 1 group , 30000 loops, 10 runs
>                                                Result         SD
>                                                (Secs)     (% of mean)
>  No Thermal Pressure                            14.03       2.69%
>  Thermal Pressure PELT Algo. Decay : 32 ms      13.29       0.56%
>  Thermal Pressure PELT Algo. Decay : 64 ms      12.57       1.56%
>  Thermal Pressure PELT Algo. Decay : 128 ms     12.71       1.04%
>  Thermal Pressure PELT Algo. Decay : 256 ms     12.29       1.42%
>  Thermal Pressure PELT Algo. Decay : 512 ms     12.42       1.15%
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Re: [Patch v5 0/6] Introduce Thermal Pressure
  2019-11-12 11:21 ` [Patch v5 0/6] Introduce Thermal Pressure Lukasz Luba
@ 2019-11-19 15:12   ` Lukasz Luba
  0 siblings, 0 replies; 40+ messages in thread
From: Lukasz Luba @ 2019-11-19 15:12 UTC (permalink / raw)
  To: Thara Gopinath
  Cc: mingo, peterz, Ionela Voinescu, vincent.guittot, rui.zhang,
	edubezval, qperret, linux-kernel, amit.kachhap, javi.merino,
	daniel.lezcano, Dietmar Eggemann



On 11/12/19 11:21 AM, Lukasz Luba wrote:
> Hi Thara,
> 
> I am going to try your patch set on some different board.
> To do that I need more information regarding your setup.
> Please find my comments below. I need probably one hack
> which do not fully understand.
> 
> On 11/5/19 6:49 PM, Thara Gopinath wrote:
>> Thermal governors can respond to an overheat event of a cpu by
>> capping the cpu's maximum possible frequency. This in turn
>> means that the maximum available compute capacity of the
>> cpu is restricted. But today in the kernel, task scheduler is
>> not notified of capping of maximum frequency of a cpu.
>> In other words, scheduler is unaware of maximum capacity
>> restrictions placed on a cpu due to thermal activity.
>> This patch series attempts to address this issue.
>> The benefits identified are better task placement among available
>> cpus in event of overheating which in turn leads to better
>> performance numbers.
>>
>> The reduction in the maximum possible capacity of a cpu due to a
>> thermal event can be considered as thermal pressure. Instantaneous
>> thermal pressure is hard to record and can sometime be erroneous
>> as there can be mismatch between the actual capping of capacity
>> and scheduler recording it. Thus solution is to have a weighted
>> average per cpu value for thermal pressure over time.
>> The weight reflects the amount of time the cpu has spent at a
>> capped maximum frequency. Since thermal pressure is recorded as
>> an average, it must be decayed periodically. Exisiting algorithm
>> in the kernel scheduler pelt framework is re-used to calculate
>> the weighted average. This patch series also defines a sysctl
>> inerface to allow for a configurable decay period.
>>
>> Regarding testing, basic build, boot and sanity testing have been
>> performed on db845c platform with debian file system.
>> Further, dhrystone and hackbench tests have been
>> run with the thermal pressure algorithm. During testing, due to
>> constraints of step wise governor in dealing with big little systems,
> I don't understand this modification. Could you explain what was the
> issue and if this modification did not break the original
> thermal solution upfront? You are then comparing this modified
> version and treat it as an 'origin', am I right?
With Ionela's help I understood the reason for doing this hack.

For those who follow: She created a 'capacity inversion' between
big and little cores to tests if the patches really work.
How: she starts throttling big cores at lower temperature, so earlier
in time, thus the power is shifted towards little cores (which are more
energy efficient and can run with higher frequency). The big cores
run at minimum frequency and little (hopefully) at max frequency.

This 'capacity inversion' is the use case which might occur in the
real world. It is hard to trigger it in normal benchmarks, though.
I don't know how often this 'capacity inversion' occurs and for how
long it stays in real workloads. Based on the tests run with default
thermal solution and results almost the same, I would say that it is
not often (maybe 3% of the test period, otherwise I would get better
results because this patch set solves this issue).
I have run a few different kernels and benchmarks without this
'capacity inversions' and I don't see the regression (and benefits
from this solution), which is also a big plus in case of mainlining it.

In case where the 'capacity inversion' is artificially introduced
into the system for 100% time, the stress tests show huge difference.
Please refer to Ionela's test results [1] (~30% better).

Regards,
Lukasz Luba


[1] 
https://docs.google.com/spreadsheets/d/1ibxDSSSLTodLzihNAw6jM36eVZABuPMMnjvV-Xh4NEo/edit#gid=0

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2019-11-19 15:13 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-05 18:49 [Patch v5 0/6] Introduce Thermal Pressure Thara Gopinath
2019-11-05 18:49 ` [Patch v5 1/6] sched/pelt.c: Add support to track thermal pressure Thara Gopinath
2019-11-06  8:24   ` Vincent Guittot
2019-11-06 12:50   ` Dietmar Eggemann
2019-11-06 17:00     ` Thara Gopinath
2019-11-07 16:39   ` Qais Yousef
2019-11-19 10:50   ` Amit Kucheria
2019-11-05 18:49 ` [Patch v5 2/6] sched/fair: Add infrastructure to store and update instantaneous " Thara Gopinath
2019-11-05 20:21   ` Ionela Voinescu
2019-11-05 21:02     ` Thara Gopinath
2019-11-05 21:15       ` Ionela Voinescu
2019-11-05 21:29         ` Thara Gopinath
2019-11-05 21:53           ` Ionela Voinescu
2019-11-06 12:50             ` Dietmar Eggemann
2019-11-06 17:53               ` Thara Gopinath
2019-11-07  9:32                 ` Dietmar Eggemann
2019-11-07 10:48                   ` Vincent Guittot
2019-11-07 11:36                     ` Dietmar Eggemann
2019-11-06  8:27   ` Vincent Guittot
2019-11-06 17:00     ` Thara Gopinath
2019-11-19 10:51   ` Amit Kucheria
2019-11-05 18:49 ` [Patch v5 3/6] sched/fair: Enable periodic update of average " Thara Gopinath
2019-11-06  8:32   ` Vincent Guittot
2019-11-06 17:01     ` Thara Gopinath
2019-11-05 18:49 ` [Patch v5 4/6] sched/fair: update cpu_capcity to reflect " Thara Gopinath
2019-11-06 16:56   ` Qais Yousef
2019-11-06 17:31     ` Thara Gopinath
2019-11-06 17:41       ` Qais Yousef
2019-11-19 10:51   ` Amit Kucheria
2019-11-05 18:49 ` [Patch v5 5/6] thermal/cpu-cooling: Update thermal pressure in case of a maximum frequency capping Thara Gopinath
2019-11-06 12:50   ` Dietmar Eggemann
2019-11-06 17:28     ` Thara Gopinath
2019-11-07 13:00     ` Dietmar Eggemann
2019-11-05 18:49 ` [Patch v5 6/6] sched/fair: Enable tuning of decay period Thara Gopinath
2019-11-07 10:49   ` Vincent Guittot
2019-11-08 10:53     ` Dietmar Eggemann
2019-11-19 10:52   ` Amit Kucheria
2019-11-12 11:21 ` [Patch v5 0/6] Introduce Thermal Pressure Lukasz Luba
2019-11-19 15:12   ` Lukasz Luba
2019-11-19 10:54 ` Amit Kucheria

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).