All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/18] Introduce runtime modifiable Energy Model
@ 2023-09-25  8:11 Lukasz Luba
  2023-09-25  8:11 ` [PATCH v4 01/18] PM: EM: Add missing newline for the message log Lukasz Luba
                   ` (18 more replies)
  0 siblings, 19 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

Hi all,

This patch set adds a new feature which allows to modify Energy Model (EM)
power values at runtime. It will allow to better reflect power model of
a recent SoCs and silicon. Different characteristics of the power usage
can be leveraged and thus better decisions made during task placement in EAS.

It's part of feature set know as Dynamic Energy Model. It has been presented
and discussed recently at OSPM2023 [3]. This patch set implements the 1st
improvement for the EM.

The concepts:
1. The CPU power usage can vary due to the workload that it's running or due
to the temperature of the SoC. The same workload can use more power when the
temperature of the silicon has increased (e.g. due to hot GPU or ISP).
In such situation the EM can be adjusted and reflect the fact of increased
power usage. That power increase is due to static power
(sometimes called simply: leakage). The CPUs in recent SoCs are different.
We have heterogeneous SoCs with 3 (or even 4) different microarchitectures.
They are also built differently with High Performance (HP) cells or
Low Power (LP) cells. They are affected by the temperature increase
differently: HP cells have bigger leakage. The SW model can leverage that
knowledge.

2. It is also possible to change the EM to better reflect the currently
running workload. Usually the EM is derived from some average power values
taken from experiments with benchmark (e.g. Dhrystone). The model derived
from such scenario might not represent properly the workloads usually running
on the device. Therefore, runtime modification of the EM allows to switch to
a different model, when there is a need.

3. The EM can be adjusted after boot, when all the modules are loaded and
more information about the SoC is available e.g. chip binning. This would help
to better reflect the silicon characteristics. Thus, this EM modification
API allows it now. It wasn't possible in the past and the EM had to be
'set in stone'.

Some design details:
The internal mechanisms for the memory allocation are handled internally in the 
EM. Kernel modules can just call the new API to update the EM data and the 
new memory would be provided and owned by the EM. The EM memory is used by
EAS, which impacts those design decisions. The EM writers are protected by
a mutex. This new runtime modified EM table is protected using RCU mechanism,
which fits the current EAS hot path (which already uses RCU read lock).
The unregister API handles only non-CPU (e.g. GPU, ISP) devices and uses the
same mutex as EM modifiers to make sure the memory is safely freed.

More detailed explanation and background can be found in presentations
during LPC2022 [1][2] or in the documentation patches.

The time cost to update EM for 11 OPPs is listed below. It's roughly
1.5us per 1 OPP while doing this on Little CPU at max frequency (1.8GHz).
More detailed results:

(The 4 CPUs from top are the little (1.8MHz), than 2 Mid (2.2GHz) and
then 2 big (2.8GHz) (while cpu6 didn't run that code)
------------------------------------
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain             3104    51236.39 us     16.506 us       75.344 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain             1264    20768.15 us     16.430 us       62.257 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain             1166    18632.95 us     15.980 us       70.707 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain              770    12334.43 us     16.018 us       66.337 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain              101    920.613 us      9.114 us        21.380 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain               20    211.830 us      10.591 us       23.998 us
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  Function                               Hit    Time            Avg         s^2
  --------                               ---    ----            ---         ---
  em_dev_update_perf_domain               15    78.085 us       5.205 us        7.444 us


Some test results.
The EM can be updated to fit better the workload type. In the case below the EM
has been updated for the Jankbench test on Pixel6 (running v5.18 w/ mainline backports
for the scheduler bits). The Jankbench was run 10 times for those two configurations,
to get more reliable data.

1. Janky frames percentage
+--------+-----------------+---------------------+-------+-----------+
| metric |    variable     |       kernel        | value | perc_diff |
+--------+-----------------+---------------------+-------+-----------+
| gmean  | jank_percentage | EM_default          |  2.0  |   0.0%    |
| gmean  | jank_percentage | EM_modified_runtime |  1.3  |  -35.33%  |
+--------+-----------------+---------------------+-------+-----------+

2. Avg frame render time duration
+--------+---------------------+---------------------+-------+-----------+
| metric |      variable       |       kernel        | value | perc_diff |
+--------+---------------------+---------------------+-------+-----------+
| gmean  | mean_frame_duration | EM_default          | 10.5  |   0.0%    |
| gmean  | mean_frame_duration | EM_modified_runtime |  9.6  |  -8.52%   |
+--------+---------------------+---------------------+-------+-----------+

3. Max frame render time duration
+--------+--------------------+---------------------+-------+-----------+
| metric |      variable      |       kernel        | value | perc_diff |
+--------+--------------------+---------------------+-------+-----------+
| gmean  | max_frame_duration | EM_default          | 251.6 |   0.0%    |
| gmean  | max_frame_duration | EM_modified_runtime | 115.5 |  -54.09%  |
+--------+--------------------+---------------------+-------+-----------+

4. OS overutilized state percentage (when EAS is not working)
+--------------+---------------------+------+------------+------------+
|    metric    |       wa_path       | time | total_time | percentage |
+--------------+---------------------+------+------------+------------+
| overutilized | EM_default          | 1.65 |   253.38   |    0.65    |
| overutilized | EM_modified_runtime | 1.4  |   277.5    |    0.51    |
+--------------+---------------------+------+------------+------------+

5. All CPUs (Little+Mid+Big) power values in mW
+------------+--------+---------------------+-------+-----------+
|  channel   | metric |       kernel        | value | perc_diff |
+------------+--------+---------------------+-------+-----------+
|    CPU     | gmean  | EM_default          | 142.1 |   0.0%    |
|    CPU     | gmean  | EM_modified_runtime | 131.8 |  -7.27%   |
+------------+--------+---------------------+-------+-----------+



Changelog:
v4:
- refactored 2 rcu callbacks into 1 (Dietmar)
- fixed documentation (Dietmar)
- fixed 'run-time' into 'runtime' in all comments (Dietmar)
- fixed comments in patch headers (Diermar)
- fixed build issue in one patch (Dietmar)
- added cost of updating EM in the cover letter (Dietmar)
- added 'performance' field, which alled to optilize further
  the pre-calculated 'cost' field and remove division operation
  from runtime
- added update function during the boot after new EM is registered
  to modify older EMs if needed due to potential CPU capacity change
v3 [6]:
- adjusted inline comments for structure doc (Dietmar)
- extended patch header with infromation that only EAS will use the feature
  and it was driving the design (Dietmar)
- changed patch header and put shorter comment (Dietmar)
- moved the 'refactoring' patch that adds 'default_table' before the
  introduction of runtime modifiable feature as sugested by Dietmar in 
  numerous patches v2
- merged documentation patches into one
- added more explenation about the 2 tables design into the Doc (Dietmar)
- removed the CPPC+EM patch for runtime modification
- removed the trace patch, since the trace point would be added after a while
- renamed 'tmp' to 'runtime_table' variable in the unregister function,
  to better highlight the memory pointer checks (it is possible after
  moving the 'default_table' earlier)
- and added '__rcu' in the unregister function which should calm down
  the test bot warning
- renamed 'create' to 'refactor' in the patch header (Dietmar)
v2 [5]:
- solved build warning of unused variable in patch 13/17 when EM is
  not compiled in, e.g. on Intel platform for this cpufreq_cooling
- re-based on top of v6.4-rc1
v1:
- implementation can be found here [4]

Regards,
Lukasz Luba

[1] https://lpc.events/event/16/contributions/1341/attachments/955/1873/Dynamic_Energy_Model_to_handle_leakage_power.pdf
[2] https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf
[3] https://www.youtube.com/watch?v=2C-5uikSbtM&list=PL0fKordpLTjKsBOUcZqnzlHShri4YBL1H
[4] https://lore.kernel.org/lkml/20230314103357.26010-1-lukasz.luba@arm.com/
[5] https://lore.kernel.org/lkml/20230512095743.3393563-1-lukasz.luba@arm.com/
[6] https://lore.kernel.org/lkml/20230721155022.2339982-1-lukasz.luba@arm.com/


Lukasz Luba (18):
  PM: EM: Add missing newline for the message log
  PM: EM: Refactor em_cpufreq_update_efficiencies() arguments
  PM: EM: Find first CPU online while updating OPP efficiency
  PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
  PM: EM: Refactor a new function em_compute_costs()
  PM: EM: Check if the get_cost() callback is present in
    em_compute_costs()
  PM: EM: Refactor struct em_perf_domain and add default_table
  PM: EM: Add update_power() callback for runtime modifications
  PM: EM: Introduce runtime modifiable table
  PM: EM: Add RCU mechanism which safely cleans the old data
  PM: EM: Add runtime update interface to modify EM power
  PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
  Documentation: EM: Update with runtime modification design
  PM: EM: Add performance field to struct em_perf_state
  PM: EM: Adjust performance with runtime modification callback
  PM: EM: Support late CPUs booting and capacity adjustment
  PM: EM: Optimize em_cpu_energy() and remove division
  Documentation: EM: Update information about performance field

 Documentation/power/energy-model.rst | 149 +++++++++-
 drivers/powercap/dtpm_cpu.c          |  27 +-
 drivers/powercap/dtpm_devfreq.c      |  23 +-
 drivers/thermal/cpufreq_cooling.c    |  24 +-
 drivers/thermal/devfreq_cooling.c    |  23 +-
 include/linux/energy_model.h         | 155 ++++++-----
 kernel/power/energy_model.c          | 400 ++++++++++++++++++++++++---
 7 files changed, 651 insertions(+), 150 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v4 01/18] PM: EM: Add missing newline for the message log
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-25  8:11 ` [PATCH v4 02/18] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments Lukasz Luba
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

Fix missing newline for the string long in the error code path.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 7b44f5b89fa1..8b9dd4a39f63 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -250,7 +250,7 @@ static void em_cpufreq_update_efficiencies(struct device *dev)
 
 	policy = cpufreq_cpu_get(cpumask_first(em_span_cpus(pd)));
 	if (!policy) {
-		dev_warn(dev, "EM: Access to CPUFreq policy failed");
+		dev_warn(dev, "EM: Access to CPUFreq policy failed\n");
 		return;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 02/18] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
  2023-09-25  8:11 ` [PATCH v4 01/18] PM: EM: Add missing newline for the message log Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-25  8:11 ` [PATCH v4 03/18] PM: EM: Find first CPU online while updating OPP efficiency Lukasz Luba
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

In order to prepare the code for the modifiable EM perf_state table,
refactor existing function em_cpufreq_update_efficiencies().

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 8b9dd4a39f63..42486674b834 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -237,10 +237,10 @@ static int em_create_pd(struct device *dev, int nr_states,
 	return 0;
 }
 
-static void em_cpufreq_update_efficiencies(struct device *dev)
+static void
+em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
 {
 	struct em_perf_domain *pd = dev->em_pd;
-	struct em_perf_state *table;
 	struct cpufreq_policy *policy;
 	int found = 0;
 	int i;
@@ -254,8 +254,6 @@ static void em_cpufreq_update_efficiencies(struct device *dev)
 		return;
 	}
 
-	table = pd->table;
-
 	for (i = 0; i < pd->nr_perf_states; i++) {
 		if (!(table[i].flags & EM_PERF_STATE_INEFFICIENT))
 			continue;
@@ -397,7 +395,7 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 
 	dev->em_pd->flags |= flags;
 
-	em_cpufreq_update_efficiencies(dev);
+	em_cpufreq_update_efficiencies(dev, dev->em_pd->table);
 
 	em_debug_create_pd(dev);
 	dev_info(dev, "EM: created perf domain\n");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 03/18] PM: EM: Find first CPU online while updating OPP efficiency
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
  2023-09-25  8:11 ` [PATCH v4 01/18] PM: EM: Add missing newline for the message log Lukasz Luba
  2023-09-25  8:11 ` [PATCH v4 02/18] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-26 18:32   ` Rafael J. Wysocki
  2023-10-23 17:06   ` Daniel Lezcano
  2023-09-25  8:11 ` [PATCH v4 04/18] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible Lukasz Luba
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The Energy Model might be updated at runtime and the energy efficiency
for each OPP may change. Thus, there is a need to update also the
cpufreq framework and make it aligned to the new values. In order to
do that, use a first online CPU from the Performance Domain.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 42486674b834..3dafdd7731c4 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -243,12 +243,19 @@ em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
 	struct em_perf_domain *pd = dev->em_pd;
 	struct cpufreq_policy *policy;
 	int found = 0;
-	int i;
+	int i, cpu;
 
 	if (!_is_cpu_device(dev) || !pd)
 		return;
 
-	policy = cpufreq_cpu_get(cpumask_first(em_span_cpus(pd)));
+	/* Try to get a CPU which is online and in this PD */
+	cpu = cpumask_first_and(em_span_cpus(pd), cpu_active_mask);
+	if (cpu >= nr_cpu_ids) {
+		dev_warn(dev, "EM: No online CPU for CPUFreq policy\n");
+		return;
+	}
+
+	policy = cpufreq_cpu_get(cpu);
 	if (!policy) {
 		dev_warn(dev, "EM: Access to CPUFreq policy failed\n");
 		return;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 04/18] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (2 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 03/18] PM: EM: Find first CPU online while updating OPP efficiency Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-10-23 17:39   ` Daniel Lezcano
  2023-09-25  8:11 ` [PATCH v4 05/18] PM: EM: Refactor a new function em_compute_costs() Lukasz Luba
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The Energy Model (EM) is going to support runtime modification. There
are going to be 2 EM tables which store information. This patch aims
to prepare the code to be generic and use one of the tables. The function
will no longer get a pointer to 'struct em_perf_domain' (the EM) but
instead a pointer to 'struct em_perf_state' (which is one of the EM's
tables).

Prepare em_pd_get_efficient_state() for the upcoming changes and
make it possible to re-use. Return an index for the best performance
state for a given EM table. The function arguments that are introduced
should allow to work on different performance state arrays. The caller of
em_pd_get_efficient_state() should be able to use the index either
on the default or the modifiable EM table.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index b9caa01dfac4..8069f526c9d8 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -175,33 +175,35 @@ void em_dev_unregister_perf_domain(struct device *dev);
 
 /**
  * em_pd_get_efficient_state() - Get an efficient performance state from the EM
- * @pd   : Performance domain for which we want an efficient frequency
- * @freq : Frequency to map with the EM
+ * @state:		List of performance states, in ascending order
+ * @nr_perf_states:	Number of performance states
+ * @freq:		Frequency to map with the EM
+ * @pd_flags:		Performance Domain flags
  *
  * It is called from the scheduler code quite frequently and as a consequence
  * doesn't implement any check.
  *
- * Return: An efficient performance state, high enough to meet @freq
+ * Return: An efficient performance state id, high enough to meet @freq
  * requirement.
  */
-static inline
-struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
-						unsigned long freq)
+static inline int
+em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
+			  unsigned long freq, unsigned long pd_flags)
 {
 	struct em_perf_state *ps;
 	int i;
 
-	for (i = 0; i < pd->nr_perf_states; i++) {
-		ps = &pd->table[i];
+	for (i = 0; i < nr_perf_states; i++) {
+		ps = &table[i];
 		if (ps->frequency >= freq) {
-			if (pd->flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
+			if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
 			    ps->flags & EM_PERF_STATE_INEFFICIENT)
 				continue;
-			break;
+			return i;
 		}
 	}
 
-	return ps;
+	return nr_perf_states - 1;
 }
 
 /**
@@ -226,7 +228,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 {
 	unsigned long freq, scale_cpu;
 	struct em_perf_state *ps;
-	int cpu;
+	int cpu, i;
 
 	if (!sum_util)
 		return 0;
@@ -251,7 +253,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 * Find the lowest performance state of the Energy Model above the
 	 * requested frequency.
 	 */
-	ps = em_pd_get_efficient_state(pd, freq);
+	i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
+				      pd->flags);
+	ps = &pd->table[i];
 
 	/*
 	 * The capacity of a CPU in the domain at the performance state (ps)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 05/18] PM: EM: Refactor a new function em_compute_costs()
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (3 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 04/18] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-26 18:39   ` Rafael J. Wysocki
  2023-09-25  8:11 ` [PATCH v4 06/18] PM: EM: Check if the get_cost() callback is present in em_compute_costs() Lukasz Luba
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

Refactor a dedicated function which will be easier to maintain and re-use
in future. The upcoming changes for the modifiable EM perf_state table
will use it (instead of duplicating the code).

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 72 ++++++++++++++++++++++---------------
 1 file changed, 43 insertions(+), 29 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 3dafdd7731c4..7ea882401833 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -103,14 +103,52 @@ static void em_debug_create_pd(struct device *dev) {}
 static void em_debug_remove_pd(struct device *dev) {}
 #endif
 
+static int em_compute_costs(struct device *dev, struct em_perf_state *table,
+			    struct em_data_callback *cb, int nr_states,
+			    unsigned long flags)
+{
+	unsigned long prev_cost = ULONG_MAX;
+	u64 fmax;
+	int i, ret;
+
+	/* Compute the cost of each performance state. */
+	fmax = (u64) table[nr_states - 1].frequency;
+	for (i = nr_states - 1; i >= 0; i--) {
+		unsigned long power_res, cost;
+
+		if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
+			ret = cb->get_cost(dev, table[i].frequency, &cost);
+			if (ret || !cost || cost > EM_MAX_POWER) {
+				dev_err(dev, "EM: invalid cost %lu %d\n",
+					cost, ret);
+				return -EINVAL;
+			}
+		} else {
+			power_res = table[i].power;
+			cost = div64_u64(fmax * power_res, table[i].frequency);
+		}
+
+		table[i].cost = cost;
+
+		if (table[i].cost >= prev_cost) {
+			table[i].flags = EM_PERF_STATE_INEFFICIENT;
+			dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
+				table[i].frequency);
+		} else {
+			prev_cost = table[i].cost;
+		}
+	}
+
+	return 0;
+}
+
 static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 				int nr_states, struct em_data_callback *cb,
 				unsigned long flags)
 {
-	unsigned long power, freq, prev_freq = 0, prev_cost = ULONG_MAX;
+	unsigned long power, freq, prev_freq = 0;
 	struct em_perf_state *table;
 	int i, ret;
-	u64 fmax;
 
 	table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
 	if (!table)
@@ -154,33 +192,9 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 		table[i].frequency = prev_freq = freq;
 	}
 
-	/* Compute the cost of each performance state. */
-	fmax = (u64) table[nr_states - 1].frequency;
-	for (i = nr_states - 1; i >= 0; i--) {
-		unsigned long power_res, cost;
-
-		if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
-			ret = cb->get_cost(dev, table[i].frequency, &cost);
-			if (ret || !cost || cost > EM_MAX_POWER) {
-				dev_err(dev, "EM: invalid cost %lu %d\n",
-					cost, ret);
-				goto free_ps_table;
-			}
-		} else {
-			power_res = table[i].power;
-			cost = div64_u64(fmax * power_res, table[i].frequency);
-		}
-
-		table[i].cost = cost;
-
-		if (table[i].cost >= prev_cost) {
-			table[i].flags = EM_PERF_STATE_INEFFICIENT;
-			dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
-				table[i].frequency);
-		} else {
-			prev_cost = table[i].cost;
-		}
-	}
+	ret = em_compute_costs(dev, table, cb, nr_states, flags);
+	if (ret)
+		goto free_ps_table;
 
 	pd->table = table;
 	pd->nr_perf_states = nr_states;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 06/18] PM: EM: Check if the get_cost() callback is present in em_compute_costs()
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (4 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 05/18] PM: EM: Refactor a new function em_compute_costs() Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-26 18:46   ` Rafael J. Wysocki
  2023-10-23 18:23   ` Daniel Lezcano
  2023-09-25  8:11 ` [PATCH v4 07/18] PM: EM: Refactor struct em_perf_domain and add default_table Lukasz Luba
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The em_compute_cost() is going to be re-used in runtime modified EM
code path. Thus, make sure that this common code is safe and won't
try to use the NULL pointer. The former em_compute_cost() didn't have to
care about runtime modification code path. The upcoming changes introduce
such option, but with different callback. Those two paths which use
get_cost() (during first EM registration) or update_power() (during
runtime modification) need to be safely handled in em_compute_costs().

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 7ea882401833..35e07933b34a 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -116,7 +116,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 	for (i = nr_states - 1; i >= 0; i--) {
 		unsigned long power_res, cost;
 
-		if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
+		if (flags & EM_PERF_DOMAIN_ARTIFICIAL && cb->get_cost) {
 			ret = cb->get_cost(dev, table[i].frequency, &cost);
 			if (ret || !cost || cost > EM_MAX_POWER) {
 				dev_err(dev, "EM: invalid cost %lu %d\n",
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 07/18] PM: EM: Refactor struct em_perf_domain and add default_table
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (5 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 06/18] PM: EM: Check if the get_cost() callback is present in em_compute_costs() Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-26 18:52   ` Rafael J. Wysocki
  2023-09-25  8:11 ` [PATCH v4 08/18] PM: EM: Add update_power() callback for runtime modifications Lukasz Luba
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The Energy Model is going to support runtime modifications. Refactor old
implementation which accessed struct em_perf_state and introduce
em_perf_domain::default_table to clean up the design. This new field
will help to better distinguish 2 performance state tables.

Update all drivers or frameworks which used the old field:
em_perf_domain::table and now should use em_perf_domain::default_table.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/powercap/dtpm_cpu.c       | 27 +++++++++++++++++++--------
 drivers/powercap/dtpm_devfreq.c   | 23 ++++++++++++++++-------
 drivers/thermal/cpufreq_cooling.c | 24 ++++++++++++++++--------
 drivers/thermal/devfreq_cooling.c | 23 +++++++++++++++++------
 include/linux/energy_model.h      | 24 ++++++++++++++++++------
 kernel/power/energy_model.c       | 26 ++++++++++++++++++++++----
 6 files changed, 108 insertions(+), 39 deletions(-)

diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
index 2ff7717530bf..743a0ac8ecdf 100644
--- a/drivers/powercap/dtpm_cpu.c
+++ b/drivers/powercap/dtpm_cpu.c
@@ -43,6 +43,7 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
 {
 	struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
 	struct em_perf_domain *pd = em_cpu_get(dtpm_cpu->cpu);
+	struct em_perf_state *table;
 	struct cpumask cpus;
 	unsigned long freq;
 	u64 power;
@@ -51,19 +52,21 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
 	cpumask_and(&cpus, cpu_online_mask, to_cpumask(pd->cpus));
 	nr_cpus = cpumask_weight(&cpus);
 
+	table = pd->default_table->state;
+
 	for (i = 0; i < pd->nr_perf_states; i++) {
 
-		power = pd->table[i].power * nr_cpus;
+		power = table[i].power * nr_cpus;
 
 		if (power > power_limit)
 			break;
 	}
 
-	freq = pd->table[i - 1].frequency;
+	freq = table[i - 1].frequency;
 
 	freq_qos_update_request(&dtpm_cpu->qos_req, freq);
 
-	power_limit = pd->table[i - 1].power * nr_cpus;
+	power_limit = table[i - 1].power * nr_cpus;
 
 	return power_limit;
 }
@@ -88,12 +91,14 @@ static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
 static u64 get_pd_power_uw(struct dtpm *dtpm)
 {
 	struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
+	struct em_perf_state *table;
 	struct em_perf_domain *pd;
 	struct cpumask *pd_mask;
 	unsigned long freq;
 	int i;
 
 	pd = em_cpu_get(dtpm_cpu->cpu);
+	table = pd->default_table->state;
 
 	pd_mask = em_span_cpus(pd);
 
@@ -101,10 +106,10 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
 
 	for (i = 0; i < pd->nr_perf_states; i++) {
 
-		if (pd->table[i].frequency < freq)
+		if (table[i].frequency < freq)
 			continue;
 
-		return scale_pd_power_uw(pd_mask, pd->table[i].power *
+		return scale_pd_power_uw(pd_mask, table[i].power *
 					 MICROWATT_PER_MILLIWATT);
 	}
 
@@ -115,17 +120,20 @@ static int update_pd_power_uw(struct dtpm *dtpm)
 {
 	struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
 	struct em_perf_domain *em = em_cpu_get(dtpm_cpu->cpu);
+	struct em_perf_state *table;
 	struct cpumask cpus;
 	int nr_cpus;
 
 	cpumask_and(&cpus, cpu_online_mask, to_cpumask(em->cpus));
 	nr_cpus = cpumask_weight(&cpus);
 
-	dtpm->power_min = em->table[0].power;
+	table = em->default_table->state;
+
+	dtpm->power_min = table[0].power;
 	dtpm->power_min *= MICROWATT_PER_MILLIWATT;
 	dtpm->power_min *= nr_cpus;
 
-	dtpm->power_max = em->table[em->nr_perf_states - 1].power;
+	dtpm->power_max = table[em->nr_perf_states - 1].power;
 	dtpm->power_max *= MICROWATT_PER_MILLIWATT;
 	dtpm->power_max *= nr_cpus;
 
@@ -182,6 +190,7 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
 {
 	struct dtpm_cpu *dtpm_cpu;
 	struct cpufreq_policy *policy;
+	struct em_perf_state *table;
 	struct em_perf_domain *pd;
 	char name[CPUFREQ_NAME_LEN];
 	int ret = -ENOMEM;
@@ -198,6 +207,8 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
 	if (!pd || em_is_artificial(pd))
 		return -EINVAL;
 
+	table = pd->default_table->state;
+
 	dtpm_cpu = kzalloc(sizeof(*dtpm_cpu), GFP_KERNEL);
 	if (!dtpm_cpu)
 		return -ENOMEM;
@@ -216,7 +227,7 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
 
 	ret = freq_qos_add_request(&policy->constraints,
 				   &dtpm_cpu->qos_req, FREQ_QOS_MAX,
-				   pd->table[pd->nr_perf_states - 1].frequency);
+				   table[pd->nr_perf_states - 1].frequency);
 	if (ret)
 		goto out_dtpm_unregister;
 
diff --git a/drivers/powercap/dtpm_devfreq.c b/drivers/powercap/dtpm_devfreq.c
index 91276761a31d..6ef0f2b4a683 100644
--- a/drivers/powercap/dtpm_devfreq.c
+++ b/drivers/powercap/dtpm_devfreq.c
@@ -37,11 +37,14 @@ static int update_pd_power_uw(struct dtpm *dtpm)
 	struct devfreq *devfreq = dtpm_devfreq->devfreq;
 	struct device *dev = devfreq->dev.parent;
 	struct em_perf_domain *pd = em_pd_get(dev);
+	struct em_perf_state *table;
 
-	dtpm->power_min = pd->table[0].power;
+	table = pd->default_table->state;
+
+	dtpm->power_min = table[0].power;
 	dtpm->power_min *= MICROWATT_PER_MILLIWATT;
 
-	dtpm->power_max = pd->table[pd->nr_perf_states - 1].power;
+	dtpm->power_max = table[pd->nr_perf_states - 1].power;
 	dtpm->power_max *= MICROWATT_PER_MILLIWATT;
 
 	return 0;
@@ -53,22 +56,25 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
 	struct devfreq *devfreq = dtpm_devfreq->devfreq;
 	struct device *dev = devfreq->dev.parent;
 	struct em_perf_domain *pd = em_pd_get(dev);
+	struct em_perf_state *table;
 	unsigned long freq;
 	u64 power;
 	int i;
 
+	table = pd->default_table->state;
+
 	for (i = 0; i < pd->nr_perf_states; i++) {
 
-		power = pd->table[i].power * MICROWATT_PER_MILLIWATT;
+		power = table[i].power * MICROWATT_PER_MILLIWATT;
 		if (power > power_limit)
 			break;
 	}
 
-	freq = pd->table[i - 1].frequency;
+	freq = table[i - 1].frequency;
 
 	dev_pm_qos_update_request(&dtpm_devfreq->qos_req, freq);
 
-	power_limit = pd->table[i - 1].power * MICROWATT_PER_MILLIWATT;
+	power_limit = table[i - 1].power * MICROWATT_PER_MILLIWATT;
 
 	return power_limit;
 }
@@ -94,6 +100,7 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
 	struct device *dev = devfreq->dev.parent;
 	struct em_perf_domain *pd = em_pd_get(dev);
 	struct devfreq_dev_status status;
+	struct em_perf_state *table;
 	unsigned long freq;
 	u64 power;
 	int i;
@@ -102,15 +109,17 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
 	status = devfreq->last_status;
 	mutex_unlock(&devfreq->lock);
 
+	table = pd->default_table->state;
+
 	freq = DIV_ROUND_UP(status.current_frequency, HZ_PER_KHZ);
 	_normalize_load(&status);
 
 	for (i = 0; i < pd->nr_perf_states; i++) {
 
-		if (pd->table[i].frequency < freq)
+		if (table[i].frequency < freq)
 			continue;
 
-		power = pd->table[i].power * MICROWATT_PER_MILLIWATT;
+		power = table[i].power * MICROWATT_PER_MILLIWATT;
 		power *= status.busy_time;
 		power >>= 10;
 
diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index e2cc7bd30862..d468aca241e2 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -91,10 +91,11 @@ struct cpufreq_cooling_device {
 static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
 			       unsigned int freq)
 {
+	struct em_perf_state *table = cpufreq_cdev->em->default_table->state;
 	int i;
 
 	for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
-		if (freq > cpufreq_cdev->em->table[i].frequency)
+		if (freq > table[i].frequency)
 			break;
 	}
 
@@ -104,15 +105,16 @@ static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
 static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
 			     u32 freq)
 {
+	struct em_perf_state *table = cpufreq_cdev->em->default_table->state;
 	unsigned long power_mw;
 	int i;
 
 	for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
-		if (freq > cpufreq_cdev->em->table[i].frequency)
+		if (freq > table[i].frequency)
 			break;
 	}
 
-	power_mw = cpufreq_cdev->em->table[i + 1].power;
+	power_mw = table[i + 1].power;
 	power_mw /= MICROWATT_PER_MILLIWATT;
 
 	return power_mw;
@@ -121,18 +123,19 @@ static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
 static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
 			     u32 power)
 {
+	struct em_perf_state *table = cpufreq_cdev->em->default_table->state;
 	unsigned long em_power_mw;
 	int i;
 
 	for (i = cpufreq_cdev->max_level; i > 0; i--) {
 		/* Convert EM power to milli-Watts to make safe comparison */
-		em_power_mw = cpufreq_cdev->em->table[i].power;
+		em_power_mw = table[i].power;
 		em_power_mw /= MICROWATT_PER_MILLIWATT;
 		if (power >= em_power_mw)
 			break;
 	}
 
-	return cpufreq_cdev->em->table[i].frequency;
+	return table[i].frequency;
 }
 
 /**
@@ -262,8 +265,9 @@ static int cpufreq_get_requested_power(struct thermal_cooling_device *cdev,
 static int cpufreq_state2power(struct thermal_cooling_device *cdev,
 			       unsigned long state, u32 *power)
 {
-	unsigned int freq, num_cpus, idx;
 	struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
+	unsigned int freq, num_cpus, idx;
+	struct em_perf_state *table;
 
 	/* Request state should be less than max_level */
 	if (state > cpufreq_cdev->max_level)
@@ -271,8 +275,9 @@ static int cpufreq_state2power(struct thermal_cooling_device *cdev,
 
 	num_cpus = cpumask_weight(cpufreq_cdev->policy->cpus);
 
+	table = cpufreq_cdev->em->default_table->state;
 	idx = cpufreq_cdev->max_level - state;
-	freq = cpufreq_cdev->em->table[idx].frequency;
+	freq = table[idx].frequency;
 	*power = cpu_freq_to_power(cpufreq_cdev, freq) * num_cpus;
 
 	return 0;
@@ -378,8 +383,11 @@ static unsigned int get_state_freq(struct cpufreq_cooling_device *cpufreq_cdev,
 #ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
 	/* Use the Energy Model table if available */
 	if (cpufreq_cdev->em) {
+		struct em_perf_state *table;
+
+		table = cpufreq_cdev->em->default_table->state;
 		idx = cpufreq_cdev->max_level - state;
-		return cpufreq_cdev->em->table[idx].frequency;
+		return table[idx].frequency;
 	}
 #endif
 
diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index 262e62ab6cf2..4207ef850582 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -87,6 +87,7 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
 	struct device *dev = df->dev.parent;
+	struct em_perf_state *table;
 	unsigned long freq;
 	int perf_idx;
 
@@ -99,8 +100,9 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
 		return -EINVAL;
 
 	if (dfc->em_pd) {
+		table = dfc->em_pd->default_table->state;
 		perf_idx = dfc->max_state - state;
-		freq = dfc->em_pd->table[perf_idx].frequency * 1000;
+		freq = table[perf_idx].frequency * 1000;
 	} else {
 		freq = dfc->freq_table[state];
 	}
@@ -123,10 +125,11 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
  */
 static int get_perf_idx(struct em_perf_domain *em_pd, unsigned long freq)
 {
+	struct em_perf_state *table = em_pd->default_table->state;
 	int i;
 
 	for (i = 0; i < em_pd->nr_perf_states; i++) {
-		if (em_pd->table[i].frequency == freq)
+		if (table[i].frequency == freq)
 			return i;
 	}
 
@@ -181,6 +184,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
 	struct devfreq_dev_status status;
+	struct em_perf_state *table;
 	unsigned long state;
 	unsigned long freq;
 	unsigned long voltage;
@@ -192,6 +196,8 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 
 	freq = status.current_frequency;
 
+	table = dfc->em_pd->default_table->state;
+
 	if (dfc->power_ops && dfc->power_ops->get_real_power) {
 		voltage = get_voltage(df, freq);
 		if (voltage == 0) {
@@ -204,7 +210,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 			state = dfc->capped_state;
 
 			/* Convert EM power into milli-Watts first */
-			dfc->res_util = dfc->em_pd->table[state].power;
+			dfc->res_util = table[state].power;
 			dfc->res_util /= MICROWATT_PER_MILLIWATT;
 
 			dfc->res_util *= SCALE_ERROR_MITIGATION;
@@ -225,7 +231,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 		_normalize_load(&status);
 
 		/* Convert EM power into milli-Watts first */
-		*power = dfc->em_pd->table[perf_idx].power;
+		*power = table[perf_idx].power;
 		*power /= MICROWATT_PER_MILLIWATT;
 		/* Scale power for utilization */
 		*power *= status.busy_time;
@@ -245,13 +251,15 @@ static int devfreq_cooling_state2power(struct thermal_cooling_device *cdev,
 				       unsigned long state, u32 *power)
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
+	struct em_perf_state *table;
 	int perf_idx;
 
 	if (state > dfc->max_state)
 		return -EINVAL;
 
+	table = dfc->em_pd->default_table->state;
 	perf_idx = dfc->max_state - state;
-	*power = dfc->em_pd->table[perf_idx].power;
+	*power = table[perf_idx].power;
 	*power /= MICROWATT_PER_MILLIWATT;
 
 	return 0;
@@ -264,6 +272,7 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 	struct devfreq *df = dfc->devfreq;
 	struct devfreq_dev_status status;
 	unsigned long freq, em_power_mw;
+	struct em_perf_state *table;
 	s32 est_power;
 	int i;
 
@@ -273,6 +282,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 
 	freq = status.current_frequency;
 
+	table = dfc->em_pd->default_table->state;
+
 	if (dfc->power_ops && dfc->power_ops->get_real_power) {
 		/* Scale for resource utilization */
 		est_power = power * dfc->res_util;
@@ -290,7 +301,7 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 	 */
 	for (i = dfc->max_state; i > 0; i--) {
 		/* Convert EM power to milli-Watts to make safe comparison */
-		em_power_mw = dfc->em_pd->table[i].power;
+		em_power_mw = table[i].power;
 		em_power_mw /= MICROWATT_PER_MILLIWATT;
 		if (est_power >= em_power_mw)
 			break;
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 8069f526c9d8..d236e08e80dc 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -36,9 +36,19 @@ struct em_perf_state {
  */
 #define EM_PERF_STATE_INEFFICIENT BIT(0)
 
+/**
+ * struct em_perf_table - Performance states table
+ * @state:	List of performance states, in ascending order
+ * @rcu:	RCU used for safe access and destruction
+ */
+struct em_perf_table {
+	struct em_perf_state *state;
+	struct rcu_head rcu;
+};
+
 /**
  * struct em_perf_domain - Performance domain
- * @table:		List of performance states, in ascending order
+ * @default_table:	Pointer to the default em_perf_table
  * @nr_perf_states:	Number of performance states
  * @flags:		See "em_perf_domain flags"
  * @cpus:		Cpumask covering the CPUs of the domain. It's here
@@ -53,7 +63,7 @@ struct em_perf_state {
  * field is unused.
  */
 struct em_perf_domain {
-	struct em_perf_state *table;
+	struct em_perf_table *default_table;
 	int nr_perf_states;
 	unsigned long flags;
 	unsigned long cpus[];
@@ -227,12 +237,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 				unsigned long allowed_cpu_cap)
 {
 	unsigned long freq, scale_cpu;
-	struct em_perf_state *ps;
+	struct em_perf_state *table, *ps;
 	int cpu, i;
 
 	if (!sum_util)
 		return 0;
 
+	table = pd->default_table->state;
+
 	/*
 	 * In order to predict the performance state, map the utilization of
 	 * the most utilized CPU of the performance domain to a requested
@@ -243,7 +255,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 */
 	cpu = cpumask_first(to_cpumask(pd->cpus));
 	scale_cpu = arch_scale_cpu_capacity(cpu);
-	ps = &pd->table[pd->nr_perf_states - 1];
+	ps = &table[pd->nr_perf_states - 1];
 
 	max_util = map_util_perf(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
@@ -253,9 +265,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 * Find the lowest performance state of the Energy Model above the
 	 * requested frequency.
 	 */
-	i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
+	i = em_pd_get_efficient_state(table, pd->nr_perf_states, freq,
 				      pd->flags);
-	ps = &pd->table[i];
+	ps = &table[i];
 
 	/*
 	 * The capacity of a CPU in the domain at the performance state (ps)
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 35e07933b34a..797141638b29 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -66,6 +66,7 @@ DEFINE_SHOW_ATTRIBUTE(em_debug_flags);
 
 static void em_debug_create_pd(struct device *dev)
 {
+	struct em_perf_table *table = dev->em_pd->default_table;
 	struct dentry *d;
 	int i;
 
@@ -81,7 +82,7 @@ static void em_debug_create_pd(struct device *dev)
 
 	/* Create a sub-directory for each performance state */
 	for (i = 0; i < dev->em_pd->nr_perf_states; i++)
-		em_debug_create_ps(&dev->em_pd->table[i], d);
+		em_debug_create_ps(&table->state[i], d);
 
 }
 
@@ -196,7 +197,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 	if (ret)
 		goto free_ps_table;
 
-	pd->table = table;
+	pd->default_table->state = table;
 	pd->nr_perf_states = nr_states;
 
 	return 0;
@@ -210,6 +211,7 @@ static int em_create_pd(struct device *dev, int nr_states,
 			struct em_data_callback *cb, cpumask_t *cpus,
 			unsigned long flags)
 {
+	struct em_perf_table *default_table;
 	struct em_perf_domain *pd;
 	struct device *cpu_dev;
 	int cpu, ret, num_cpus;
@@ -234,8 +236,17 @@ static int em_create_pd(struct device *dev, int nr_states,
 			return -ENOMEM;
 	}
 
+	default_table = kzalloc(sizeof(*default_table), GFP_KERNEL);
+	if (!default_table) {
+		kfree(pd);
+		return -ENOMEM;
+	}
+
+	pd->default_table = default_table;
+
 	ret = em_create_perf_table(dev, pd, nr_states, cb, flags);
 	if (ret) {
+		kfree(default_table);
 		kfree(pd);
 		return ret;
 	}
@@ -358,6 +369,7 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 				bool microwatts)
 {
 	unsigned long cap, prev_cap = 0;
+	struct em_perf_state *table;
 	unsigned long flags = 0;
 	int cpu, ret;
 
@@ -416,7 +428,8 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 
 	dev->em_pd->flags |= flags;
 
-	em_cpufreq_update_efficiencies(dev, dev->em_pd->table);
+	table = dev->em_pd->default_table->state;
+	em_cpufreq_update_efficiencies(dev, table);
 
 	em_debug_create_pd(dev);
 	dev_info(dev, "EM: created perf domain\n");
@@ -435,12 +448,16 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
  */
 void em_dev_unregister_perf_domain(struct device *dev)
 {
+	struct em_perf_domain *pd;
+
 	if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
 		return;
 
 	if (_is_cpu_device(dev))
 		return;
 
+	pd = dev->em_pd;
+
 	/*
 	 * The mutex separates all register/unregister requests and protects
 	 * from potential clean-up/setup issues in the debugfs directories.
@@ -449,7 +466,8 @@ void em_dev_unregister_perf_domain(struct device *dev)
 	mutex_lock(&em_pd_mutex);
 	em_debug_remove_pd(dev);
 
-	kfree(dev->em_pd->table);
+	kfree(pd->default_table->state);
+	kfree(pd->default_table);
 	kfree(dev->em_pd);
 	dev->em_pd = NULL;
 	mutex_unlock(&em_pd_mutex);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 08/18] PM: EM: Add update_power() callback for runtime modifications
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (6 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 07/18] PM: EM: Refactor struct em_perf_domain and add default_table Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-26 18:59   ` Rafael J. Wysocki
  2023-09-25  8:11 ` [PATCH v4 09/18] PM: EM: Introduce runtime modifiable table Lukasz Luba
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The Energy Model (EM) is going to support runtime modifications. This
new callback would be used in the upcoming EM changes. The drivers
or frameworks which want to modify the EM have to implement the
update_power() callback.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index d236e08e80dc..546dee90f716 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -168,6 +168,26 @@ struct em_data_callback {
 	 */
 	int (*get_cost)(struct device *dev, unsigned long freq,
 			unsigned long *cost);
+
+	/**
+	 * update_power() - Provide new power at the given performance state of
+	 *		a device
+	 * @dev		: Device for which we do this operation (can be a CPU)
+	 * @freq	: Frequency at the performance state in kHz
+	 * @power	: New power value at the performance state
+	 *		(modified)
+	 * @priv	: Pointer to private data useful for tracking context
+	 *		during runtime modifications of EM.
+	 *
+	 * The update_power() is used by runtime modifiable EM. It aims to
+	 * provide updated power value for a given frequency, which is stored
+	 * in the performance state. The power value provided by this callback
+	 * should fit in the [0, EM_MAX_POWER] range.
+	 *
+	 * Return 0 on success, or appropriate error value in case of failure.
+	 */
+	int (*update_power)(struct device *dev, unsigned long freq,
+			    unsigned long *power, void *priv);
 };
 #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) ((em_cb).active_power = cb)
 #define EM_ADV_DATA_CB(_active_power_cb, _cost_cb)	\
@@ -175,6 +195,7 @@ struct em_data_callback {
 	  .get_cost = _cost_cb }
 #define EM_DATA_CB(_active_power_cb)			\
 		EM_ADV_DATA_CB(_active_power_cb, NULL)
+#define EM_UPDATE_CB(_update_power_cb) { .update_power = &_update_power_cb }
 
 struct em_perf_domain *em_cpu_get(int cpu);
 struct em_perf_domain *em_pd_get(struct device *dev);
@@ -331,6 +352,7 @@ struct em_data_callback {};
 #define EM_ADV_DATA_CB(_active_power_cb, _cost_cb) { }
 #define EM_DATA_CB(_active_power_cb) { }
 #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) do { } while (0)
+#define EM_UPDATE_CB(_update_cb) { }
 
 static inline
 int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 09/18] PM: EM: Introduce runtime modifiable table
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (7 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 08/18] PM: EM: Add update_power() callback for runtime modifications Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-26 19:12   ` Rafael J. Wysocki
  2023-09-25  8:11 ` [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data Lukasz Luba
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The new runtime table would be populated with a new power data to better
reflect the actual power. The power can vary over time e.g. due to the
SoC temperature change. Higher temperature can increase power values.
For longer running scenarios, such as game or camera, when also other
devices are used (e.g. GPU, ISP) the CPU power can change. The new
EM framework is able to addresses this issue and change the data
at runtime safely.

The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)
for the task placement. All the other users (thermal, etc.) are still
using the default (basic) EM. This fact drove the design of this feature.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h |  4 +++-
 kernel/power/energy_model.c  | 12 +++++++++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 546dee90f716..740e7c25cfff 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -39,7 +39,7 @@ struct em_perf_state {
 /**
  * struct em_perf_table - Performance states table
  * @state:	List of performance states, in ascending order
- * @rcu:	RCU used for safe access and destruction
+ * @rcu:	RCU used only for runtime modifiable table
  */
 struct em_perf_table {
 	struct em_perf_state *state;
@@ -49,6 +49,7 @@ struct em_perf_table {
 /**
  * struct em_perf_domain - Performance domain
  * @default_table:	Pointer to the default em_perf_table
+ * @runtime_table:	Pointer to the runtime modifiable em_perf_table
  * @nr_perf_states:	Number of performance states
  * @flags:		See "em_perf_domain flags"
  * @cpus:		Cpumask covering the CPUs of the domain. It's here
@@ -64,6 +65,7 @@ struct em_perf_table {
  */
 struct em_perf_domain {
 	struct em_perf_table *default_table;
+	struct em_perf_table __rcu *runtime_table;
 	int nr_perf_states;
 	unsigned long flags;
 	unsigned long cpus[];
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 797141638b29..5b40db38b745 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states,
 		return ret;
 	}
 
+	/* Initialize runtime table as default table. */
+	rcu_assign_pointer(pd->runtime_table, default_table);
+
 	if (_is_cpu_device(dev))
 		for_each_cpu(cpu, cpus) {
 			cpu_dev = get_cpu_device(cpu);
@@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
  */
 void em_dev_unregister_perf_domain(struct device *dev)
 {
+	struct em_perf_table __rcu *runtime_table;
 	struct em_perf_domain *pd;
 
 	if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
@@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev)
 		return;
 
 	pd = dev->em_pd;
-
 	/*
 	 * The mutex separates all register/unregister requests and protects
 	 * from potential clean-up/setup issues in the debugfs directories.
 	 * The debugfs directory name is the same as device's name.
 	 */
 	mutex_lock(&em_pd_mutex);
+
 	em_debug_remove_pd(dev);
 
+	runtime_table = pd->runtime_table;
+
+	rcu_assign_pointer(pd->runtime_table, NULL);
+	synchronize_rcu();
+
 	kfree(pd->default_table->state);
 	kfree(pd->default_table);
 	kfree(dev->em_pd);
+
 	dev->em_pd = NULL;
 	mutex_unlock(&em_pd_mutex);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (8 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 09/18] PM: EM: Introduce runtime modifiable table Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-26 10:28   ` kernel test robot
  2023-09-26 19:26   ` Rafael J. Wysocki
  2023-09-25  8:11 ` [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power Lukasz Luba
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The EM is going to support runtime modifications of the power data.
Introduce RCU safe mechanism to clean up the old allocated EM data.
It also adds a mutex for the EM structure to serialize the modifiers.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 5b40db38b745..2345837bfd2c 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -23,6 +23,9 @@
  */
 static DEFINE_MUTEX(em_pd_mutex);
 
+static void em_cpufreq_update_efficiencies(struct device *dev,
+					   struct em_perf_state *table);
+
 static bool _is_cpu_device(struct device *dev)
 {
 	return (dev->bus == &cpu_subsys);
@@ -104,6 +107,32 @@ static void em_debug_create_pd(struct device *dev) {}
 static void em_debug_remove_pd(struct device *dev) {}
 #endif
 
+static void em_destroy_rt_table_rcu(struct rcu_head *rp)
+{
+	struct em_perf_table *runtime_table;
+
+	runtime_table = container_of(rp, struct em_perf_table, rcu);
+	kfree(runtime_table->state);
+	kfree(runtime_table);
+}
+
+static void em_perf_runtime_table_set(struct device *dev,
+				      struct em_perf_table *runtime_table)
+{
+	struct em_perf_domain *pd = dev->em_pd;
+	struct em_perf_table *tmp;
+
+	tmp = pd->runtime_table;
+
+	rcu_assign_pointer(pd->runtime_table, runtime_table);
+
+	em_cpufreq_update_efficiencies(dev, runtime_table->state);
+
+	/* Don't free default table since it's used by other frameworks. */
+	if (tmp != pd->default_table)
+		call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
+}
+
 static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 			    struct em_data_callback *cb, int nr_states,
 			    unsigned long flags)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (9 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-26 17:21   ` kernel test robot
  2023-09-26 19:48   ` Rafael J. Wysocki
  2023-09-25  8:11 ` [PATCH v4 12/18] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS Lukasz Luba
                   ` (7 subsequent siblings)
  18 siblings, 2 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

Add an interface which allows to modify EM power data at runtime.
The new power information is populated by the provided callback, which
is called for each performance state. The CPU frequencies' efficiency is
re-calculated since that might be affected as well. The old EM memory
is going to be freed later using RCU mechanism.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h |   8 +++
 kernel/power/energy_model.c  | 111 +++++++++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 740e7c25cfff..8f055ab356ed 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -201,6 +201,8 @@ struct em_data_callback {
 
 struct em_perf_domain *em_cpu_get(int cpu);
 struct em_perf_domain *em_pd_get(struct device *dev);
+int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
+			      void *priv);
 int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 				struct em_data_callback *cb, cpumask_t *span,
 				bool microwatts);
@@ -384,6 +386,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
 {
 	return 0;
 }
+static inline
+int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
+			      void *priv)
+{
+	return -EINVAL;
+}
 #endif
 
 #endif
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 2345837bfd2c..78e1495dc87e 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -172,6 +172,101 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 	return 0;
 }
 
+/**
+ * em_dev_update_perf_domain() - Update runtime EM table for a device
+ * @dev		: Device for which the EM is to be updated
+ * @cb		: Callback function providing the power data for the EM
+ * @priv	: Pointer to private data useful for passing context
+ *		which might be required while calling @cb
+ *
+ * Update EM runtime modifiable table for a @dev using the callback
+ * defined in @cb. The EM new power values are then used for calculating
+ * the em_perf_state::cost for associated performance state.
+ *
+ * This function uses mutex to serialize writers, so it must not be called
+ * from non-sleeping context.
+ *
+ * Return 0 on success or a proper error in case of failure.
+ */
+int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
+			      void *priv)
+{
+	struct em_perf_table *runtime_table;
+	unsigned long power, freq;
+	struct em_perf_domain *pd;
+	int ret, i;
+
+	if (!cb || !cb->update_power)
+		return -EINVAL;
+
+	/*
+	 * The lock serializes update and unregister code paths. When the
+	 * EM has been unregistered in the meantime, we should capture that
+	 * when entering this critical section. It also makes sure that
+	 * two concurrent updates will be serialized.
+	 */
+	mutex_lock(&em_pd_mutex);
+
+	if (!dev || !dev->em_pd) {
+		ret = -EINVAL;
+		goto unlock_em;
+	}
+
+	pd = dev->em_pd;
+
+	runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
+	if (!runtime_table) {
+		ret = -ENOMEM;
+		goto unlock_em;
+	}
+
+	runtime_table->state = kcalloc(pd->nr_perf_states,
+				       sizeof(struct em_perf_state),
+				       GFP_KERNEL);
+	if (!runtime_table->state) {
+		ret = -ENOMEM;
+		goto free_runtime_table;
+	}
+
+	/* Populate runtime table with updated values using driver callback */
+	for (i = 0; i < pd->nr_perf_states; i++) {
+		freq = pd->default_table->state[i].frequency;
+		runtime_table->state[i].frequency = freq;
+
+		/*
+		 * Call driver callback to get a new power value for
+		 * a given frequency.
+		 */
+		ret = cb->update_power(dev, freq, &power, priv);
+		if (ret) {
+			dev_dbg(dev, "EM: runtime update error: %d\n", ret);
+			goto free_runtime_state_table;
+		}
+
+		runtime_table->state[i].power = power;
+	}
+
+	ret = em_compute_costs(dev, runtime_table->state, cb,
+			       pd->nr_perf_states, pd->flags);
+	if (ret)
+		goto free_runtime_state_table;
+
+	em_perf_runtime_table_set(dev, runtime_table);
+
+	mutex_unlock(&em_pd_mutex);
+	return 0;
+
+free_runtime_state_table:
+	kfree(runtime_table->state);
+free_runtime_table:
+	kfree(runtime_table);
+unlock_em:
+	mutex_unlock(&em_pd_mutex);
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(em_dev_update_perf_domain);
+
 static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 				int nr_states, struct em_data_callback *cb,
 				unsigned long flags)
@@ -494,6 +589,8 @@ void em_dev_unregister_perf_domain(struct device *dev)
 	 * The mutex separates all register/unregister requests and protects
 	 * from potential clean-up/setup issues in the debugfs directories.
 	 * The debugfs directory name is the same as device's name.
+	 * The lock also protects the updater of the runtime modifiable
+	 * EM and this remover.
 	 */
 	mutex_lock(&em_pd_mutex);
 
@@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
 
 	runtime_table = pd->runtime_table;
 
+	/*
+	 * Safely destroy runtime modifiable EM. By using the call
+	 * synchronize_rcu() we make sure we don't progress till last user
+	 * finished the RCU section and our update got applied.
+	 */
 	rcu_assign_pointer(pd->runtime_table, NULL);
 	synchronize_rcu();
 
+	/*
+	 * After the sync no updates will be in-flight, so free the
+	 * memory allocated for runtime table (if there was such).
+	 */
+	if (runtime_table != pd->default_table) {
+		kfree(runtime_table->state);
+		kfree(runtime_table);
+	}
+
 	kfree(pd->default_table->state);
 	kfree(pd->default_table);
 	kfree(dev->em_pd);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 12/18] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (10 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-26 19:54   ` Rafael J. Wysocki
  2023-09-25  8:11 ` [PATCH v4 13/18] Documentation: EM: Update with runtime modification design Lukasz Luba
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The new Energy Model (EM) supports runtime modification of the performance
state table to better model the power used by the SoC. Use this new
feature to improve energy estimation and therefore task placement in
Energy Aware Scheduler (EAS).

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 8f055ab356ed..41290ee2cdd0 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -261,15 +261,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 				unsigned long max_util, unsigned long sum_util,
 				unsigned long allowed_cpu_cap)
 {
+	struct em_perf_table *runtime_table;
 	unsigned long freq, scale_cpu;
-	struct em_perf_state *table, *ps;
+	struct em_perf_state *ps;
 	int cpu, i;
 
 	if (!sum_util)
 		return 0;
 
-	table = pd->default_table->state;
-
 	/*
 	 * In order to predict the performance state, map the utilization of
 	 * the most utilized CPU of the performance domain to a requested
@@ -280,7 +279,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 */
 	cpu = cpumask_first(to_cpumask(pd->cpus));
 	scale_cpu = arch_scale_cpu_capacity(cpu);
-	ps = &table[pd->nr_perf_states - 1];
+
+	/*
+	 * No rcu_read_lock() since it's already called by task scheduler.
+	 * The runtime_table is always there for CPUs, so we don't check.
+	 */
+	runtime_table = rcu_dereference(pd->runtime_table);
+
+	ps = &runtime_table->state[pd->nr_perf_states - 1];
 
 	max_util = map_util_perf(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
@@ -290,9 +296,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 * Find the lowest performance state of the Energy Model above the
 	 * requested frequency.
 	 */
-	i = em_pd_get_efficient_state(table, pd->nr_perf_states, freq,
-				      pd->flags);
-	ps = &table[i];
+	i = em_pd_get_efficient_state(runtime_table->state, pd->nr_perf_states,
+				      freq, pd->flags);
+	ps = &runtime_table->state[i];
 
 	/*
 	 * The capacity of a CPU in the domain at the performance state (ps)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 13/18] Documentation: EM: Update with runtime modification design
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (11 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 12/18] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-25  8:11 ` [PATCH v4 14/18] PM: EM: Add performance field to struct em_perf_state Lukasz Luba
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

Add a new section 'Design' which covers the information about Energy
Model. It contains the design decisions, describes models and how they
reflect the reality. Add description of the default EM. Change the
other section IDs. Add documentation bit for the new feature which
allows to modify the EM in runtime.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 Documentation/power/energy-model.rst | 144 +++++++++++++++++++++++++--
 1 file changed, 134 insertions(+), 10 deletions(-)

diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
index ef341be2882b..3115411f9839 100644
--- a/Documentation/power/energy-model.rst
+++ b/Documentation/power/energy-model.rst
@@ -72,16 +72,70 @@ required to have the same micro-architecture. CPUs in different performance
 domains can have different micro-architectures.
 
 
-2. Core APIs
+2. Design
+-----------------
+
+2.1 Basic EM
+^^^^^^^^^^^^
+
+The basic EM is built around constant power information for each performance
+state, which is accessible in: 'dev->em_pd->default_table->state'. This model
+can be derived based on power measurements of the device e.g. CPU while
+running some benchmark. The benchmark might be integer heavy or floating point
+computation with a data set fitting into the CPU cache or registers. Bare in
+mind that this model might not cover all possible workloads running on CPUs.
+Thus, please run a few different benchmarks and verify with some real
+workloads your power model values. The power variation due to the workload
+instruction mix and data set is not modeled. The static power, which can
+change during runtime due to variation of SOC temperature, is not modeled
+either.
+
+2.2 Runtime modifiable EM
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To better reflect power variation due to static power (leakage) the EM
+supports runtime modifications of the power values. The mechanism relies on
+RCU to free the modifiable EM perf_state table memory. Its user, the task
+scheduler, also uses RCU to access this memory. The EM framework is
+responsible for allocating the new memory for the modifiable EM perf_state
+table. The old memory is freed automatically using RCU callback mechanism.
+This design decision is made based on task scheduler using that data and
+to prevent wrong usage of kernel modules if they would be responsible for the
+memory management.
+
+There are two structures with the performance state tables in the EM:
+a) dev->em_pd->default_table
+b) dev->em_pd->runtime_table
+They both point to the same memory location via:
+'em_perf_table::state' pointer, until the first modification of the values
+This should save memory on platforms which would never modify the EM. When
+the first modification is made the 'default_table' (a) contains the old
+EM which was created during the setup. The modified EM is available in the
+'runtime_table' (b).
+
+Only EAS uses the 'runtime_table' and benefits from the updates to the
+EM values. Other sub-systems (thermal, powercap) use the 'default_table' (a).
+
+The kernel code which want to modify the EM values is protected from concurrent
+access using a mutex. Therefore, the code must use sleeping context when
+they want to modify the EM.
+
+With the runtime modifiable EM we switch from a 'single and during the entire
+runtime static EM' (system property) design to a 'single EM which can be
+changed during runtime according e.g. to the workload' (system and workload
+property) design.
+
+
+3. Core APIs
 ------------
 
-2.1 Config options
+3.1 Config options
 ^^^^^^^^^^^^^^^^^^
 
 CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
 
 
-2.2 Registration of performance domains
+3.2 Registration of performance domains
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Registration of 'advanced' EM
@@ -110,8 +164,8 @@ The last argument 'microwatts' is important to set with correct value. Kernel
 subsystems which use EM might rely on this flag to check if all EM devices use
 the same scale. If there are different scales, these subsystems might decide
 to return warning/error, stop working or panic.
-See Section 3. for an example of driver implementing this
-callback, or Section 2.4 for further documentation on this API
+See Section 4. for an example of driver implementing this
+callback, or Section 3.4 for further documentation on this API
 
 Registration of EM using DT
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -156,7 +210,7 @@ The EM which is registered using this method might not reflect correctly the
 physics of a real device, e.g. when static power (leakage) is important.
 
 
-2.3 Accessing performance domains
+3.3 Accessing performance domains
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 There are two API functions which provide the access to the energy model:
@@ -175,10 +229,31 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is
 not provided for other type of devices.
 
 More details about the above APIs can be found in ``<linux/energy_model.h>``
-or in Section 2.4
+or in Section 3.5
+
+
+3.4 Runtime modifications
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Drivers willing to modify the EM at runtime should use the following API::
+
 
+  int em_dev_update_perf_domain(struct device *dev,
+			struct em_data_callback *cb, void *priv);
 
-2.4 Description details of this API
+Drivers must provide a callback .update_power() returning power value for each
+performance state. The callback function provided by the driver is free
+to fetch data from any relevant location (DT, firmware, ...) or sensor.
+The .update_power() callback is called by the EM for each performance state to
+provide new power value. In the Section 4.2 there is an example driver
+which shows simple implementation of this mechanism. The callback can be
+declared with EM_UPDATE_CB() macro. The caller of that callback also passes
+a private void pointer back to the driver which tries to update EM.
+It is useful and helps to maintain the consistent context for all performance
+state calls for a given EM.
+
+
+3.5 Description details of this API
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. kernel-doc:: include/linux/energy_model.h
    :internal:
@@ -187,8 +262,11 @@ or in Section 2.4
    :export:
 
 
-3. Example driver
------------------
+4. Examples
+-----------
+
+4.1 Example driver with EM registration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The CPUFreq framework supports dedicated callback for registering
 the EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em().
@@ -242,3 +320,49 @@ EM framework::
   39	static struct cpufreq_driver foo_cpufreq_driver = {
   40		.register_em = foo_cpufreq_register_em,
   41	};
+
+
+4.2 Example driver with EM modification
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This section provides a simple example of a thermal driver modifying the EM.
+The driver implements a foo_mod_power() function to be provided to the
+EM framework. The driver is woken up periodically to check the temperature
+and modify the EM data if needed::
+
+  -> drivers/thermal/foo_thermal.c
+
+  01	static int foo_mod_power(struct device *dev, unsigned long freq,
+  02			unsigned long *power, void *priv)
+  03	{
+  04		struct foo_context *ctx = priv;
+  05
+  06		/* Estimate power for the given frequency and temperature */
+  07		*power = foo_estimate_power(dev, freq, ctx->temperature);
+  08		if (*power >= EM_MAX_POWER);
+  09			return -EINVAL;
+  10
+  11		return 0;
+  12	}
+  13
+  14	/*
+  15	 * Function called periodically to check the temperature and
+  16	 * update the EM if needed
+  17	 */
+  18	static void foo_thermal_em_update(struct foo_context *ctx)
+  19	{
+  20		struct em_data_callback em_cb = EM_UPDATE_CB(mod_power);
+  21		struct cpufreq_policy *policy = ctx->policy;
+  22		struct device *cpu_dev;
+  23
+  24		cpu_dev = get_cpu_device(cpumask_first(policy->cpus));
+  25
+  26		ctx->temperature = foo_get_temp(cpu_dev, ctx);
+  27		if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD)
+  28			return;
+  29
+  30		/* Update EM for the CPUs' performance domain */
+  31		ret = em_dev_update_perf_domain(cpu_dev, &em_cb, ctx);
+  32		if (ret)
+  33			pr_warn("foo_thermal: EM update failed\n");
+  34	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 14/18] PM: EM: Add performance field to struct em_perf_state
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (12 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 13/18] Documentation: EM: Update with runtime modification design Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-25  8:11 ` [PATCH v4 15/18] PM: EM: Adjust performance with runtime modification callback Lukasz Luba
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The performance doesn't scale linearly with the frequency. Also, it may
be different in different workloads. Some CPUs are designed to be
particularly good at some applications e.g. images or video processing
and other CPUs in different. When those different types of CPUs are
combined in one SoC they should be properly modeled to get max of the HW
in Energy Aware Scheduler (EAS). The Energy Model (EM) provides the
power vs. performance curves to the EAS, but assumes the CPUs capacity
is fixed and scales linearly with the frequency. This patch allows to
adjust the curve on the 'performance' axis as well.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h | 11 ++++++-----
 kernel/power/energy_model.c  | 27 +++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 41290ee2cdd0..37fc8490709d 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -12,6 +12,7 @@
 
 /**
  * struct em_perf_state - Performance state of a performance domain
+ * @performance:	Non-linear CPU performance at a given frequency
  * @frequency:	The frequency in KHz, for consistency with CPUFreq
  * @power:	The power consumed at this level (by 1 CPU or by a registered
  *		device). It can be a total power: static and dynamic.
@@ -20,6 +21,7 @@
  * @flags:	see "em_perf_state flags" description below.
  */
 struct em_perf_state {
+	unsigned long performance;
 	unsigned long frequency;
 	unsigned long power;
 	unsigned long cost;
@@ -223,14 +225,14 @@ void em_dev_unregister_perf_domain(struct device *dev);
  */
 static inline int
 em_pd_get_efficient_state(struct em_perf_state *table, int nr_perf_states,
-			  unsigned long freq, unsigned long pd_flags)
+			  unsigned long max_util, unsigned long pd_flags)
 {
 	struct em_perf_state *ps;
 	int i;
 
 	for (i = 0; i < nr_perf_states; i++) {
 		ps = &table[i];
-		if (ps->frequency >= freq) {
+		if (ps->performance >= max_util) {
 			if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
 			    ps->flags & EM_PERF_STATE_INEFFICIENT)
 				continue;
@@ -262,8 +264,8 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 				unsigned long allowed_cpu_cap)
 {
 	struct em_perf_table *runtime_table;
-	unsigned long freq, scale_cpu;
 	struct em_perf_state *ps;
+	unsigned long scale_cpu;
 	int cpu, i;
 
 	if (!sum_util)
@@ -290,14 +292,13 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 
 	max_util = map_util_perf(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
-	freq = map_util_freq(max_util, ps->frequency, scale_cpu);
 
 	/*
 	 * Find the lowest performance state of the Energy Model above the
 	 * requested frequency.
 	 */
 	i = em_pd_get_efficient_state(runtime_table->state, pd->nr_perf_states,
-				      freq, pd->flags);
+				      max_util, pd->flags);
 	ps = &runtime_table->state[i];
 
 	/*
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 78e1495dc87e..c7ad42b42c46 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -46,6 +46,7 @@ static void em_debug_create_ps(struct em_perf_state *ps, struct dentry *pd)
 	debugfs_create_ulong("frequency", 0444, d, &ps->frequency);
 	debugfs_create_ulong("power", 0444, d, &ps->power);
 	debugfs_create_ulong("cost", 0444, d, &ps->cost);
+	debugfs_create_ulong("performance", 0444, d, &ps->performance);
 	debugfs_create_ulong("inefficient", 0444, d, &ps->flags);
 }
 
@@ -133,6 +134,30 @@ static void em_perf_runtime_table_set(struct device *dev,
 		call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
 }
 
+static void em_init_performance(struct device *dev, struct em_perf_domain *pd,
+				struct em_perf_state *table, int nr_states)
+{
+	u64 fmax, max_cap;
+	int i, cpu;
+
+	/* This is needed only for CPUs and EAS skip other devices */
+	if (!_is_cpu_device(dev))
+		return;
+
+	cpu = cpumask_first(em_span_cpus(pd));
+
+	/*
+	 * Calculate the performance value for each frequency with
+	 * linear relationship. The final CPU capacity might not be ready at
+	 * boot time, but the EM will be updated a bit later with correct one.
+	 */
+	fmax = (u64) table[nr_states - 1].frequency;
+	max_cap = (u64) arch_scale_cpu_capacity(cpu);
+	for (i = 0; i < nr_states; i++)
+		table[i].performance = div64_u64(max_cap * table[i].frequency,
+						 fmax);
+}
+
 static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 			    struct em_data_callback *cb, int nr_states,
 			    unsigned long flags)
@@ -317,6 +342,8 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 		table[i].frequency = prev_freq = freq;
 	}
 
+	em_init_performance(dev, pd, table, nr_states);
+
 	ret = em_compute_costs(dev, table, cb, nr_states, flags);
 	if (ret)
 		goto free_ps_table;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 15/18] PM: EM: Adjust performance with runtime modification callback
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (13 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 14/18] PM: EM: Add performance field to struct em_perf_state Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-25  8:11 ` [PATCH v4 16/18] PM: EM: Support late CPUs booting and capacity adjustment Lukasz Luba
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The performance value may be modified at runtime together with the
power value for each OPP. They both would form a different power
and performance profile in the EM. Modify the callback interface
to make this possible.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h | 24 +++++++++++++++---------
 kernel/power/energy_model.c  |  7 ++++---
 2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 37fc8490709d..65a8794d1565 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -174,24 +174,29 @@ struct em_data_callback {
 			unsigned long *cost);
 
 	/**
-	 * update_power() - Provide new power at the given performance state of
-	 *		a device
+	 * update_power_perf() - Provide new power and performance at the given
+	 *		performance state of a device
 	 * @dev		: Device for which we do this operation (can be a CPU)
 	 * @freq	: Frequency at the performance state in kHz
 	 * @power	: New power value at the performance state
 	 *		(modified)
+	 * @perf	: New performance value at the performance state
+	 *		(modified)
 	 * @priv	: Pointer to private data useful for tracking context
 	 *		during runtime modifications of EM.
 	 *
-	 * The update_power() is used by runtime modifiable EM. It aims to
-	 * provide updated power value for a given frequency, which is stored
-	 * in the performance state. The power value provided by this callback
-	 * should fit in the [0, EM_MAX_POWER] range.
+	 * The update_power_perf() is used by runtime modifiable EM. It aims to
+	 * provide updated power and performance value for a given frequency,
+	 * which is stored in the performance state. The power value provided
+	 * by this callback should fit in the [0, EM_MAX_POWER] range. The
+	 * performance value should be lower or equal to the CPU max capacity
+	 * (1024).
 	 *
 	 * Return 0 on success, or appropriate error value in case of failure.
 	 */
-	int (*update_power)(struct device *dev, unsigned long freq,
-			    unsigned long *power, void *priv);
+	int (*update_power_perf)(struct device *dev, unsigned long freq,
+				 unsigned long *power, unsigned long *perf,
+				 void *priv);
 };
 #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) ((em_cb).active_power = cb)
 #define EM_ADV_DATA_CB(_active_power_cb, _cost_cb)	\
@@ -199,7 +204,8 @@ struct em_data_callback {
 	  .get_cost = _cost_cb }
 #define EM_DATA_CB(_active_power_cb)			\
 		EM_ADV_DATA_CB(_active_power_cb, NULL)
-#define EM_UPDATE_CB(_update_power_cb) { .update_power = &_update_power_cb }
+#define EM_UPDATE_CB(_update_pwr_perf_cb)		\
+	{ .update_power_perf = &_update_pwr_perf_cb }
 
 struct em_perf_domain *em_cpu_get(int cpu);
 struct em_perf_domain *em_pd_get(struct device *dev);
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index c7ad42b42c46..17a59a7717f7 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -217,11 +217,11 @@ int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
 			      void *priv)
 {
 	struct em_perf_table *runtime_table;
-	unsigned long power, freq;
+	unsigned long power, freq, perf;
 	struct em_perf_domain *pd;
 	int ret, i;
 
-	if (!cb || !cb->update_power)
+	if (!cb || !cb->update_power_perf)
 		return -EINVAL;
 
 	/*
@@ -262,13 +262,14 @@ int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
 		 * Call driver callback to get a new power value for
 		 * a given frequency.
 		 */
-		ret = cb->update_power(dev, freq, &power, priv);
+		ret = cb->update_power_perf(dev, freq, &power, &perf, priv);
 		if (ret) {
 			dev_dbg(dev, "EM: runtime update error: %d\n", ret);
 			goto free_runtime_state_table;
 		}
 
 		runtime_table->state[i].power = power;
+		runtime_table->state[i].performance = perf;
 	}
 
 	ret = em_compute_costs(dev, runtime_table->state, cb,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 16/18] PM: EM: Support late CPUs booting and capacity adjustment
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (14 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 15/18] PM: EM: Adjust performance with runtime modification callback Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-25  8:11 ` [PATCH v4 17/18] PM: EM: Optimize em_cpu_energy() and remove division Lukasz Luba
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The patch adds needed infrastructure to handle the late CPUs boot, which
might change the previous CPUs capacity values. With this changes the new
CPUs which try to register EM will trigger the needed re-calculations for
other CPUs EMs. Thanks to that the em_per_state::performance values will
be aligned with the CPU capacity information after all CPUs finish the
boot.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 kernel/power/energy_model.c | 108 ++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 17a59a7717f7..6bfd33c2e48c 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -25,6 +25,9 @@ static DEFINE_MUTEX(em_pd_mutex);
 
 static void em_cpufreq_update_efficiencies(struct device *dev,
 					   struct em_perf_state *table);
+static void em_check_capacity_update(void);
+static void em_update_workfn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(em_update_work, em_update_workfn);
 
 static bool _is_cpu_device(struct device *dev)
 {
@@ -591,6 +594,10 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 
 unlock:
 	mutex_unlock(&em_pd_mutex);
+
+	if (_is_cpu_device(dev))
+		em_check_capacity_update();
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
@@ -651,3 +658,104 @@ void em_dev_unregister_perf_domain(struct device *dev)
 	mutex_unlock(&em_pd_mutex);
 }
 EXPORT_SYMBOL_GPL(em_dev_unregister_perf_domain);
+
+/*
+ * Adjustment of CPU performance values after boot, when all CPUs capacites
+ * are correctly calculated.
+ */
+static int get_updated_perf(struct device *dev, unsigned long freq,
+				   unsigned long *power, unsigned long *perf,
+				   void *priv)
+{
+	struct em_perf_state *table = priv;
+	int i, cpu, nr_states;
+	u64 fmax, max_cap;
+
+	nr_states = dev->em_pd->nr_perf_states;
+
+	cpu = cpumask_first(em_span_cpus(dev->em_pd));
+
+	fmax = (u64) table[nr_states - 1].frequency;
+	max_cap = (u64) arch_scale_cpu_capacity(cpu);
+
+	for (i = 0; i < nr_states; i++) {
+		if (freq != table[i].frequency)
+			continue;
+
+		*power = table[i].power;
+		*perf = div64_u64(max_cap * freq, fmax);
+		break;
+	}
+
+	return 0;
+}
+
+static void em_check_capacity_update(void)
+{
+	struct em_data_callback em_cb = EM_UPDATE_CB(get_updated_perf);
+	struct em_perf_table *runtime_table;
+	struct em_perf_domain *em_pd;
+	cpumask_var_t cpu_done_mask;
+	unsigned long cpu_capacity;
+	struct em_perf_state *ps;
+	struct device *dev;
+	int cpu, ret;
+
+	if (!zalloc_cpumask_var(&cpu_done_mask, GFP_KERNEL)) {
+		pr_warn("EM: no free memory\n");
+		return;
+	}
+
+	/* Loop over all EMs and check if the CPU capacity has changed. */
+	for_each_possible_cpu(cpu) {
+		unsigned long em_max_performance;
+		struct cpufreq_policy *policy;
+
+		if (cpumask_test_cpu(cpu, cpu_done_mask))
+			continue;
+
+		policy = cpufreq_cpu_get(cpu);
+		if (!policy) {
+			pr_debug("EM: Accessing cpu%d policy failed\n", cpu);
+			schedule_delayed_work(&em_update_work,
+					      msecs_to_jiffies(1000));
+			break;
+		}
+
+		em_pd = em_cpu_get(cpu);
+		if (!em_pd || em_is_artificial(em_pd))
+			continue;
+
+		cpu_capacity = arch_scale_cpu_capacity(cpu);
+
+		rcu_read_lock();
+		runtime_table = rcu_dereference(em_pd->runtime_table);
+		ps = &runtime_table->state[em_pd->nr_perf_states - 1];
+		em_max_performance = ps->performance;
+		rcu_read_unlock();
+
+		/*
+		 * Check if the CPU capacity has been adjusted during boot
+		 * and trigger the update for new performance values.
+		 */
+		if (em_max_performance != cpu_capacity) {
+			dev = get_cpu_device(cpu);
+			ret = em_dev_update_perf_domain(dev, &em_cb,
+						em_pd->default_table->state);
+			if (ret)
+				dev_warn(dev, "EM: update failed %d\n", ret);
+			else
+				dev_info(dev, "EM: updated\n");
+		}
+
+		cpumask_or(cpu_done_mask, cpu_done_mask,
+			   em_span_cpus(em_pd));
+	}
+
+	free_cpumask_var(cpu_done_mask);
+}
+
+static void em_update_workfn(struct work_struct *work)
+{
+	em_check_capacity_update();
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 17/18] PM: EM: Optimize em_cpu_energy() and remove division
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (15 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 16/18] PM: EM: Support late CPUs booting and capacity adjustment Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-25  8:11 ` [PATCH v4 18/18] Documentation: EM: Update information about performance field Lukasz Luba
  2023-09-28 21:56 ` [PATCH v4 00/18] Introduce runtime modifiable Energy Model Qais Yousef
  18 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The Energy Model (EM) can be modified at runtime which brings new
possibilities. The em_cpu_energy() is called by the Energy Aware Scheduler
(EAS) in it's hot path. The energy calculation uses power value for
a given performance state (ps) and the CPU busy time as percentage for that
given frequency, which effectively is:

pd_nrg = ps->power * busy_time_pct                        (1)

                 cpu_util
busy_time_pct = -----------------                         (2)
                 ps->performance

The 'ps->performance' is the CPU capacity (performance) at that given ps.
Thus, in a situation when the OS is not overloaded and we have EAS
working, the busy time is lower than 'ps->performance' that the CPU is
running at. Therefore, in longer scheduling period we can treat the power
value calculated above as the energy.

We can optimize the last arithmetic operation in em_cpu_energy() and
remove the division. This can be done because em_perf_state::cost, which
is a special coefficient, can now hold the pre-calculated value including
the 'ps->performance' information for a performance state (ps):

              ps->power
ps->cost = ---------------                                (3)
           ps->performance

In the past the 'ps->performance' had to be calculated at runtime every
time the em_cpu_energy() was called. Thus there was this formula involved:

                  ps->freq
ps->performance = ------------- * scale_cpu               (4)
                  cpu_max_freq

When we inject (4) into (2) than we can have this equation:

                 cpu_util * cpu_max_freq
busy_time_pct = ------------------------                  (5)
                 ps->freq * scale_cpu

Because the right 'scale_cpu' value wasn't ready during the boot time
and EM initialization, we had to perform the division by 'scale_cpu'
at runtime. There was not safe mechanism to update EM at runtime.
It has changed thanks to EM runtime modification feature.

It is possible to avoid the division by 'scale_cpu' at runtime, because
EM is updated whenever new max capacity CPU is set in the system or after
the boot has finished and proper CPU capacity is ready.

Use that feature and do the needed division during the calculation of the
coefficient 'ps->cost'. That enhanced 'ps->cost' value can be then just
multiplied simply by utilization:

pd_nrg = ps->cost * \Sum cpu_util                         (6)

to get the needed energy for whole Performance Domain (PD).

With this optimization, the em_cpu_energy() should run faster on the Big
CPU by 1.43x and on the Little CPU by 1.69x.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/linux/energy_model.h | 68 +++++-------------------------------
 kernel/power/energy_model.c  |  7 ++--
 2 files changed, 12 insertions(+), 63 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 65a8794d1565..31c4e3b8f7c3 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -112,27 +112,6 @@ struct em_perf_domain {
 #define EM_MAX_NUM_CPUS 16
 #endif
 
-/*
- * To avoid an overflow on 32bit machines while calculating the energy
- * use a different order in the operation. First divide by the 'cpu_scale'
- * which would reduce big value stored in the 'cost' field, then multiply by
- * the 'sum_util'. This would allow to handle existing platforms, which have
- * e.g. power ~1.3 Watt at max freq, so the 'cost' value > 1mln micro-Watts.
- * In such scenario, where there are 4 CPUs in the Perf. Domain the 'sum_util'
- * could be 4096, then multiplication: 'cost' * 'sum_util'  would overflow.
- * This reordering of operations has some limitations, we lose small
- * precision in the estimation (comparing to 64bit platform w/o reordering).
- *
- * We are safe on 64bit machine.
- */
-#ifdef CONFIG_64BIT
-#define em_estimate_energy(cost, sum_util, scale_cpu) \
-	(((cost) * (sum_util)) / (scale_cpu))
-#else
-#define em_estimate_energy(cost, sum_util, scale_cpu) \
-	(((cost) / (scale_cpu)) * (sum_util))
-#endif
-
 struct em_data_callback {
 	/**
 	 * active_power() - Provide power at the next performance state of
@@ -271,29 +250,16 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 {
 	struct em_perf_table *runtime_table;
 	struct em_perf_state *ps;
-	unsigned long scale_cpu;
-	int cpu, i;
+	int i;
 
 	if (!sum_util)
 		return 0;
 
-	/*
-	 * In order to predict the performance state, map the utilization of
-	 * the most utilized CPU of the performance domain to a requested
-	 * frequency, like schedutil. Take also into account that the real
-	 * frequency might be set lower (due to thermal capping). Thus, clamp
-	 * max utilization to the allowed CPU capacity before calculating
-	 * effective frequency.
-	 */
-	cpu = cpumask_first(to_cpumask(pd->cpus));
-	scale_cpu = arch_scale_cpu_capacity(cpu);
-
 	/*
 	 * No rcu_read_lock() since it's already called by task scheduler.
 	 * The runtime_table is always there for CPUs, so we don't check.
 	 */
 	runtime_table = rcu_dereference(pd->runtime_table);
-
 	ps = &runtime_table->state[pd->nr_perf_states - 1];
 
 	max_util = map_util_perf(max_util);
@@ -308,35 +274,21 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	ps = &runtime_table->state[i];
 
 	/*
-	 * The capacity of a CPU in the domain at the performance state (ps)
-	 * can be computed as:
-	 *
-	 *             ps->freq * scale_cpu
-	 *   ps->cap = --------------------                          (1)
-	 *                 cpu_max_freq
-	 *
-	 * So, ignoring the costs of idle states (which are not available in
-	 * the EM), the energy consumed by this CPU at that performance state
+	 * The energy consumed by the CPU at the given performance state (ps)
 	 * is estimated as:
 	 *
-	 *             ps->power * cpu_util
-	 *   cpu_nrg = --------------------                          (2)
-	 *                   ps->cap
+	 *                ps->power
+	 *   cpu_nrg = --------------- * cpu_util                    (1)
+	 *             ps->performance
 	 *
-	 * since 'cpu_util / ps->cap' represents its percentage of busy time.
+	 * The 'cpu_util / ps->performance' represents its percentage of
+	 * busy time. The idle cost is ignored (it's not available in the EM).
 	 *
 	 *   NOTE: Although the result of this computation actually is in
 	 *         units of power, it can be manipulated as an energy value
 	 *         over a scheduling period, since it is assumed to be
 	 *         constant during that interval.
 	 *
-	 * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
-	 * of two terms:
-	 *
-	 *             ps->power * cpu_max_freq   cpu_util
-	 *   cpu_nrg = ------------------------ * ---------          (3)
-	 *                    ps->freq            scale_cpu
-	 *
 	 * The first term is static, and is stored in the em_perf_state struct
 	 * as 'ps->cost'.
 	 *
@@ -345,11 +297,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 * total energy of the domain (which is the simple sum of the energy of
 	 * all of its CPUs) can be factorized as:
 	 *
-	 *            ps->cost * \Sum cpu_util
-	 *   pd_nrg = ------------------------                       (4)
-	 *                  scale_cpu
+	 *   pd_nrg = ps->cost * \Sum cpu_util                       (2)
 	 */
-	return em_estimate_energy(ps->cost, sum_util, scale_cpu);
+	return ps->cost * sum_util;
 }
 
 /**
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 6bfd33c2e48c..cf9da7259f5e 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -166,11 +166,9 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 			    unsigned long flags)
 {
 	unsigned long prev_cost = ULONG_MAX;
-	u64 fmax;
 	int i, ret;
 
 	/* Compute the cost of each performance state. */
-	fmax = (u64) table[nr_states - 1].frequency;
 	for (i = nr_states - 1; i >= 0; i--) {
 		unsigned long power_res, cost;
 
@@ -182,8 +180,9 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
 				return -EINVAL;
 			}
 		} else {
-			power_res = table[i].power;
-			cost = div64_u64(fmax * power_res, table[i].frequency);
+			/* increase resolution of 'cost' precision */
+			power_res = table[i].power * 10;
+			cost = power_res / table[i].performance;
 		}
 
 		table[i].cost = cost;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH v4 18/18] Documentation: EM: Update information about performance field
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (16 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 17/18] PM: EM: Optimize em_cpu_energy() and remove division Lukasz Luba
@ 2023-09-25  8:11 ` Lukasz Luba
  2023-09-28 21:56 ` [PATCH v4 00/18] Introduce runtime modifiable Energy Model Qais Yousef
  18 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-25  8:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm, rafael
  Cc: lukasz.luba, dietmar.eggemann, rui.zhang, amit.kucheria,
	amit.kachhap, daniel.lezcano, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

The Energy Model supports the new information: performance for each
performance state. Update the needed documentation part accordingly.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 Documentation/power/energy-model.rst | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst
index 3115411f9839..da3619c0b98a 100644
--- a/Documentation/power/energy-model.rst
+++ b/Documentation/power/energy-model.rst
@@ -125,6 +125,11 @@ runtime static EM' (system property) design to a 'single EM which can be
 changed during runtime according e.g. to the workload' (system and workload
 property) design.
 
+It is possible also to modify the CPU performance values for each EM's
+performance state. Thus, the full power and performance profile (which
+is an exponential curve) can be changed according e.g. to the workload
+or system property.
+
 
 3. Core APIs
 ------------
@@ -326,18 +331,18 @@ EM framework::
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 This section provides a simple example of a thermal driver modifying the EM.
-The driver implements a foo_mod_power() function to be provided to the
+The driver implements a mod_power_perf() function to be provided to the
 EM framework. The driver is woken up periodically to check the temperature
 and modify the EM data if needed::
 
   -> drivers/thermal/foo_thermal.c
 
-  01	static int foo_mod_power(struct device *dev, unsigned long freq,
-  02			unsigned long *power, void *priv)
+  01	static int mod_power_perf(struct device *dev, unsigned long freq,
+  02			unsigned long *power, unsigned long *perf, void *priv)
   03	{
   04		struct foo_context *ctx = priv;
   05
-  06		/* Estimate power for the given frequency and temperature */
+  06		*perf = foo_estimate_performance(dev, freq);
   07		*power = foo_estimate_power(dev, freq, ctx->temperature);
   08		if (*power >= EM_MAX_POWER);
   09			return -EINVAL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-09-25  8:11 ` [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data Lukasz Luba
@ 2023-09-26 10:28   ` kernel test robot
  2023-09-26 19:26   ` Rafael J. Wysocki
  1 sibling, 0 replies; 58+ messages in thread
From: kernel test robot @ 2023-09-26 10:28 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: oe-kbuild-all, lukasz.luba, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

Hi Lukasz,

kernel test robot noticed the following build warnings:

[auto build test WARNING on rafael-pm/linux-next]
[also build test WARNING on rafael-pm/thermal linus/master v6.6-rc3 next-20230926]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lukasz-Luba/PM-EM-Add-missing-newline-for-the-message-log/20230925-181243
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
patch link:    https://lore.kernel.org/r/20230925081139.1305766-11-lukasz.luba%40arm.com
patch subject: [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
config: i386-randconfig-063-20230926 (https://download.01.org/0day-ci/archive/20230926/202309261850.jrucSbN8-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230926/202309261850.jrucSbN8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309261850.jrucSbN8-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> kernel/power/energy_model.c:125:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct em_perf_table *tmp @@     got struct em_perf_table [noderef] __rcu *runtime_table @@
   kernel/power/energy_model.c:125:13: sparse:     expected struct em_perf_table *tmp
   kernel/power/energy_model.c:125:13: sparse:     got struct em_perf_table [noderef] __rcu *runtime_table

vim +125 kernel/power/energy_model.c

   118	
   119	static void em_perf_runtime_table_set(struct device *dev,
   120					      struct em_perf_table *runtime_table)
   121	{
   122		struct em_perf_domain *pd = dev->em_pd;
   123		struct em_perf_table *tmp;
   124	
 > 125		tmp = pd->runtime_table;
   126	
   127		rcu_assign_pointer(pd->runtime_table, runtime_table);
   128	
   129		em_cpufreq_update_efficiencies(dev, runtime_table->state);
   130	
   131		/* Don't free default table since it's used by other frameworks. */
   132		if (tmp != pd->default_table)
   133			call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
   134	}
   135	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power
  2023-09-25  8:11 ` [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power Lukasz Luba
@ 2023-09-26 17:21   ` kernel test robot
  2023-09-26 19:48   ` Rafael J. Wysocki
  1 sibling, 0 replies; 58+ messages in thread
From: kernel test robot @ 2023-09-26 17:21 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: oe-kbuild-all, lukasz.luba, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

Hi Lukasz,

kernel test robot noticed the following build warnings:

[auto build test WARNING on rafael-pm/linux-next]
[also build test WARNING on rafael-pm/thermal linus/master v6.6-rc3 next-20230926]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lukasz-Luba/PM-EM-Add-missing-newline-for-the-message-log/20230925-181243
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
patch link:    https://lore.kernel.org/r/20230925081139.1305766-12-lukasz.luba%40arm.com
patch subject: [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power
config: i386-randconfig-063-20230926 (https://download.01.org/0day-ci/archive/20230927/202309270106.c56Z2Tci-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230927/202309270106.c56Z2Tci-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309270106.c56Z2Tci-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   kernel/power/energy_model.c:125:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct em_perf_table *tmp @@     got struct em_perf_table [noderef] __rcu *runtime_table @@
   kernel/power/energy_model.c:125:13: sparse:     expected struct em_perf_table *tmp
   kernel/power/energy_model.c:125:13: sparse:     got struct em_perf_table [noderef] __rcu *runtime_table
>> kernel/power/energy_model.c:613:27: sparse: sparse: incompatible types in comparison expression (different address spaces):
>> kernel/power/energy_model.c:613:27: sparse:    struct em_perf_table [noderef] __rcu *
>> kernel/power/energy_model.c:613:27: sparse:    struct em_perf_table *

vim +613 kernel/power/energy_model.c

   569	
   570	/**
   571	 * em_dev_unregister_perf_domain() - Unregister Energy Model (EM) for a device
   572	 * @dev		: Device for which the EM is registered
   573	 *
   574	 * Unregister the EM for the specified @dev (but not a CPU device).
   575	 */
   576	void em_dev_unregister_perf_domain(struct device *dev)
   577	{
   578		struct em_perf_table __rcu *runtime_table;
   579		struct em_perf_domain *pd;
   580	
   581		if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
   582			return;
   583	
   584		if (_is_cpu_device(dev))
   585			return;
   586	
   587		pd = dev->em_pd;
   588		/*
   589		 * The mutex separates all register/unregister requests and protects
   590		 * from potential clean-up/setup issues in the debugfs directories.
   591		 * The debugfs directory name is the same as device's name.
   592		 * The lock also protects the updater of the runtime modifiable
   593		 * EM and this remover.
   594		 */
   595		mutex_lock(&em_pd_mutex);
   596	
   597		em_debug_remove_pd(dev);
   598	
   599		runtime_table = pd->runtime_table;
   600	
   601		/*
   602		 * Safely destroy runtime modifiable EM. By using the call
   603		 * synchronize_rcu() we make sure we don't progress till last user
   604		 * finished the RCU section and our update got applied.
   605		 */
   606		rcu_assign_pointer(pd->runtime_table, NULL);
   607		synchronize_rcu();
   608	
   609		/*
   610		 * After the sync no updates will be in-flight, so free the
   611		 * memory allocated for runtime table (if there was such).
   612		 */
 > 613		if (runtime_table != pd->default_table) {

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 03/18] PM: EM: Find first CPU online while updating OPP efficiency
  2023-09-25  8:11 ` [PATCH v4 03/18] PM: EM: Find first CPU online while updating OPP efficiency Lukasz Luba
@ 2023-09-26 18:32   ` Rafael J. Wysocki
  2023-09-29  8:32     ` Lukasz Luba
  2023-10-23 17:06   ` Daniel Lezcano
  1 sibling, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-26 18:32 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> The Energy Model might be updated at runtime and the energy efficiency
> for each OPP may change. Thus, there is a need to update also the
> cpufreq framework and make it aligned to the new values. In order to
> do that, use a first online CPU from the Performance Domain.
>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  kernel/power/energy_model.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 42486674b834..3dafdd7731c4 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -243,12 +243,19 @@ em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
>         struct em_perf_domain *pd = dev->em_pd;
>         struct cpufreq_policy *policy;
>         int found = 0;
> -       int i;
> +       int i, cpu;
>
>         if (!_is_cpu_device(dev) || !pd)
>                 return;
>
> -       policy = cpufreq_cpu_get(cpumask_first(em_span_cpus(pd)));
> +       /* Try to get a CPU which is online and in this PD */
> +       cpu = cpumask_first_and(em_span_cpus(pd), cpu_active_mask);

The comment talks about "online" and cpu_active_mask is used.  Isn't
it a bit inconsistent?

> +       if (cpu >= nr_cpu_ids) {
> +               dev_warn(dev, "EM: No online CPU for CPUFreq policy\n");
> +               return;
> +       }
> +
> +       policy = cpufreq_cpu_get(cpu);
>         if (!policy) {
>                 dev_warn(dev, "EM: Access to CPUFreq policy failed\n");
>                 return;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 05/18] PM: EM: Refactor a new function em_compute_costs()
  2023-09-25  8:11 ` [PATCH v4 05/18] PM: EM: Refactor a new function em_compute_costs() Lukasz Luba
@ 2023-09-26 18:39   ` Rafael J. Wysocki
  2023-09-29  8:38     ` Lukasz Luba
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-26 18:39 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Refactor a dedicated function which will be easier to maintain and re-use
> in future. The upcoming changes for the modifiable EM perf_state table
> will use it (instead of duplicating the code).
>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>

If I'm not mistaken, this patch by itself is not going to change the
observable functionality in any way and it would be good to say that
in the changelog.

This also applies to some other patches in this series.

> ---
>  kernel/power/energy_model.c | 72 ++++++++++++++++++++++---------------
>  1 file changed, 43 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 3dafdd7731c4..7ea882401833 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -103,14 +103,52 @@ static void em_debug_create_pd(struct device *dev) {}
>  static void em_debug_remove_pd(struct device *dev) {}
>  #endif
>
> +static int em_compute_costs(struct device *dev, struct em_perf_state *table,
> +                           struct em_data_callback *cb, int nr_states,
> +                           unsigned long flags)
> +{
> +       unsigned long prev_cost = ULONG_MAX;
> +       u64 fmax;
> +       int i, ret;
> +
> +       /* Compute the cost of each performance state. */
> +       fmax = (u64) table[nr_states - 1].frequency;
> +       for (i = nr_states - 1; i >= 0; i--) {
> +               unsigned long power_res, cost;
> +
> +               if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
> +                       ret = cb->get_cost(dev, table[i].frequency, &cost);
> +                       if (ret || !cost || cost > EM_MAX_POWER) {
> +                               dev_err(dev, "EM: invalid cost %lu %d\n",
> +                                       cost, ret);
> +                               return -EINVAL;
> +                       }
> +               } else {
> +                       power_res = table[i].power;
> +                       cost = div64_u64(fmax * power_res, table[i].frequency);
> +               }
> +
> +               table[i].cost = cost;
> +
> +               if (table[i].cost >= prev_cost) {
> +                       table[i].flags = EM_PERF_STATE_INEFFICIENT;
> +                       dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
> +                               table[i].frequency);
> +               } else {
> +                       prev_cost = table[i].cost;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
>  static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
>                                 int nr_states, struct em_data_callback *cb,
>                                 unsigned long flags)
>  {
> -       unsigned long power, freq, prev_freq = 0, prev_cost = ULONG_MAX;
> +       unsigned long power, freq, prev_freq = 0;
>         struct em_perf_state *table;
>         int i, ret;
> -       u64 fmax;
>
>         table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
>         if (!table)
> @@ -154,33 +192,9 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
>                 table[i].frequency = prev_freq = freq;
>         }
>
> -       /* Compute the cost of each performance state. */
> -       fmax = (u64) table[nr_states - 1].frequency;
> -       for (i = nr_states - 1; i >= 0; i--) {
> -               unsigned long power_res, cost;
> -
> -               if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
> -                       ret = cb->get_cost(dev, table[i].frequency, &cost);
> -                       if (ret || !cost || cost > EM_MAX_POWER) {
> -                               dev_err(dev, "EM: invalid cost %lu %d\n",
> -                                       cost, ret);
> -                               goto free_ps_table;
> -                       }
> -               } else {
> -                       power_res = table[i].power;
> -                       cost = div64_u64(fmax * power_res, table[i].frequency);
> -               }
> -
> -               table[i].cost = cost;
> -
> -               if (table[i].cost >= prev_cost) {
> -                       table[i].flags = EM_PERF_STATE_INEFFICIENT;
> -                       dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
> -                               table[i].frequency);
> -               } else {
> -                       prev_cost = table[i].cost;
> -               }
> -       }
> +       ret = em_compute_costs(dev, table, cb, nr_states, flags);
> +       if (ret)
> +               goto free_ps_table;
>
>         pd->table = table;
>         pd->nr_perf_states = nr_states;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 06/18] PM: EM: Check if the get_cost() callback is present in em_compute_costs()
  2023-09-25  8:11 ` [PATCH v4 06/18] PM: EM: Check if the get_cost() callback is present in em_compute_costs() Lukasz Luba
@ 2023-09-26 18:46   ` Rafael J. Wysocki
  2023-09-29  8:42     ` Lukasz Luba
  2023-10-23 18:23   ` Daniel Lezcano
  1 sibling, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-26 18:46 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> The em_compute_cost() is going to be re-used in runtime modified EM
> code path. Thus, make sure that this common code is safe and won't
> try to use the NULL pointer. The former em_compute_cost() didn't have to
> care about runtime modification code path. The upcoming changes introduce
> such option, but with different callback. Those two paths which use
> get_cost() (during first EM registration) or update_power() (during
> runtime modification) need to be safely handled in em_compute_costs().

I would just say something like this:

"Subsequent changes will introduce a case in which cb->get_cost may
not be set in em_compute_costs(), so add a check to ensure that it is
not NULL before attempting to dereference it."

The rest of the changelog is just redundant IMO.

>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  kernel/power/energy_model.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 7ea882401833..35e07933b34a 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -116,7 +116,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>         for (i = nr_states - 1; i >= 0; i--) {
>                 unsigned long power_res, cost;
>
> -               if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
> +               if (flags & EM_PERF_DOMAIN_ARTIFICIAL && cb->get_cost) {
>                         ret = cb->get_cost(dev, table[i].frequency, &cost);
>                         if (ret || !cost || cost > EM_MAX_POWER) {
>                                 dev_err(dev, "EM: invalid cost %lu %d\n",
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 07/18] PM: EM: Refactor struct em_perf_domain and add default_table
  2023-09-25  8:11 ` [PATCH v4 07/18] PM: EM: Refactor struct em_perf_domain and add default_table Lukasz Luba
@ 2023-09-26 18:52   ` Rafael J. Wysocki
  2023-09-29  8:45     ` Lukasz Luba
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-26 18:52 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> The Energy Model is going to support runtime modifications. Refactor old
> implementation which accessed struct em_perf_state and introduce
> em_perf_domain::default_table to clean up the design. This new field
> will help to better distinguish 2 performance state tables.
>
> Update all drivers or frameworks which used the old field:
> em_perf_domain::table and now should use em_perf_domain::default_table.
>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/powercap/dtpm_cpu.c       | 27 +++++++++++++++++++--------
>  drivers/powercap/dtpm_devfreq.c   | 23 ++++++++++++++++-------
>  drivers/thermal/cpufreq_cooling.c | 24 ++++++++++++++++--------
>  drivers/thermal/devfreq_cooling.c | 23 +++++++++++++++++------
>  include/linux/energy_model.h      | 24 ++++++++++++++++++------
>  kernel/power/energy_model.c       | 26 ++++++++++++++++++++++----
>  6 files changed, 108 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/powercap/dtpm_cpu.c b/drivers/powercap/dtpm_cpu.c
> index 2ff7717530bf..743a0ac8ecdf 100644
> --- a/drivers/powercap/dtpm_cpu.c
> +++ b/drivers/powercap/dtpm_cpu.c
> @@ -43,6 +43,7 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
>  {
>         struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
>         struct em_perf_domain *pd = em_cpu_get(dtpm_cpu->cpu);
> +       struct em_perf_state *table;
>         struct cpumask cpus;
>         unsigned long freq;
>         u64 power;
> @@ -51,19 +52,21 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
>         cpumask_and(&cpus, cpu_online_mask, to_cpumask(pd->cpus));
>         nr_cpus = cpumask_weight(&cpus);
>
> +       table = pd->default_table->state;
> +
>         for (i = 0; i < pd->nr_perf_states; i++) {
>
> -               power = pd->table[i].power * nr_cpus;
> +               power = table[i].power * nr_cpus;
>
>                 if (power > power_limit)
>                         break;
>         }
>
> -       freq = pd->table[i - 1].frequency;
> +       freq = table[i - 1].frequency;
>
>         freq_qos_update_request(&dtpm_cpu->qos_req, freq);
>
> -       power_limit = pd->table[i - 1].power * nr_cpus;
> +       power_limit = table[i - 1].power * nr_cpus;
>
>         return power_limit;
>  }
> @@ -88,12 +91,14 @@ static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
>  static u64 get_pd_power_uw(struct dtpm *dtpm)
>  {
>         struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
> +       struct em_perf_state *table;
>         struct em_perf_domain *pd;
>         struct cpumask *pd_mask;
>         unsigned long freq;
>         int i;
>
>         pd = em_cpu_get(dtpm_cpu->cpu);
> +       table = pd->default_table->state;
>
>         pd_mask = em_span_cpus(pd);
>
> @@ -101,10 +106,10 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
>
>         for (i = 0; i < pd->nr_perf_states; i++) {
>
> -               if (pd->table[i].frequency < freq)
> +               if (table[i].frequency < freq)
>                         continue;
>
> -               return scale_pd_power_uw(pd_mask, pd->table[i].power *
> +               return scale_pd_power_uw(pd_mask, table[i].power *
>                                          MICROWATT_PER_MILLIWATT);
>         }
>
> @@ -115,17 +120,20 @@ static int update_pd_power_uw(struct dtpm *dtpm)
>  {
>         struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
>         struct em_perf_domain *em = em_cpu_get(dtpm_cpu->cpu);
> +       struct em_perf_state *table;
>         struct cpumask cpus;
>         int nr_cpus;
>
>         cpumask_and(&cpus, cpu_online_mask, to_cpumask(em->cpus));
>         nr_cpus = cpumask_weight(&cpus);
>
> -       dtpm->power_min = em->table[0].power;
> +       table = em->default_table->state;
> +
> +       dtpm->power_min = table[0].power;
>         dtpm->power_min *= MICROWATT_PER_MILLIWATT;
>         dtpm->power_min *= nr_cpus;
>
> -       dtpm->power_max = em->table[em->nr_perf_states - 1].power;
> +       dtpm->power_max = table[em->nr_perf_states - 1].power;
>         dtpm->power_max *= MICROWATT_PER_MILLIWATT;
>         dtpm->power_max *= nr_cpus;
>
> @@ -182,6 +190,7 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
>  {
>         struct dtpm_cpu *dtpm_cpu;
>         struct cpufreq_policy *policy;
> +       struct em_perf_state *table;
>         struct em_perf_domain *pd;
>         char name[CPUFREQ_NAME_LEN];
>         int ret = -ENOMEM;
> @@ -198,6 +207,8 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
>         if (!pd || em_is_artificial(pd))
>                 return -EINVAL;
>
> +       table = pd->default_table->state;
> +
>         dtpm_cpu = kzalloc(sizeof(*dtpm_cpu), GFP_KERNEL);
>         if (!dtpm_cpu)
>                 return -ENOMEM;
> @@ -216,7 +227,7 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
>
>         ret = freq_qos_add_request(&policy->constraints,
>                                    &dtpm_cpu->qos_req, FREQ_QOS_MAX,
> -                                  pd->table[pd->nr_perf_states - 1].frequency);
> +                                  table[pd->nr_perf_states - 1].frequency);
>         if (ret)
>                 goto out_dtpm_unregister;
>
> diff --git a/drivers/powercap/dtpm_devfreq.c b/drivers/powercap/dtpm_devfreq.c
> index 91276761a31d..6ef0f2b4a683 100644
> --- a/drivers/powercap/dtpm_devfreq.c
> +++ b/drivers/powercap/dtpm_devfreq.c
> @@ -37,11 +37,14 @@ static int update_pd_power_uw(struct dtpm *dtpm)
>         struct devfreq *devfreq = dtpm_devfreq->devfreq;
>         struct device *dev = devfreq->dev.parent;
>         struct em_perf_domain *pd = em_pd_get(dev);
> +       struct em_perf_state *table;
>
> -       dtpm->power_min = pd->table[0].power;
> +       table = pd->default_table->state;
> +
> +       dtpm->power_min = table[0].power;
>         dtpm->power_min *= MICROWATT_PER_MILLIWATT;
>
> -       dtpm->power_max = pd->table[pd->nr_perf_states - 1].power;
> +       dtpm->power_max = table[pd->nr_perf_states - 1].power;
>         dtpm->power_max *= MICROWATT_PER_MILLIWATT;
>
>         return 0;
> @@ -53,22 +56,25 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
>         struct devfreq *devfreq = dtpm_devfreq->devfreq;
>         struct device *dev = devfreq->dev.parent;
>         struct em_perf_domain *pd = em_pd_get(dev);
> +       struct em_perf_state *table;
>         unsigned long freq;
>         u64 power;
>         int i;
>
> +       table = pd->default_table->state;
> +
>         for (i = 0; i < pd->nr_perf_states; i++) {
>
> -               power = pd->table[i].power * MICROWATT_PER_MILLIWATT;
> +               power = table[i].power * MICROWATT_PER_MILLIWATT;
>                 if (power > power_limit)
>                         break;
>         }
>
> -       freq = pd->table[i - 1].frequency;
> +       freq = table[i - 1].frequency;
>
>         dev_pm_qos_update_request(&dtpm_devfreq->qos_req, freq);
>
> -       power_limit = pd->table[i - 1].power * MICROWATT_PER_MILLIWATT;
> +       power_limit = table[i - 1].power * MICROWATT_PER_MILLIWATT;
>
>         return power_limit;
>  }
> @@ -94,6 +100,7 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
>         struct device *dev = devfreq->dev.parent;
>         struct em_perf_domain *pd = em_pd_get(dev);
>         struct devfreq_dev_status status;
> +       struct em_perf_state *table;
>         unsigned long freq;
>         u64 power;
>         int i;
> @@ -102,15 +109,17 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
>         status = devfreq->last_status;
>         mutex_unlock(&devfreq->lock);
>
> +       table = pd->default_table->state;
> +
>         freq = DIV_ROUND_UP(status.current_frequency, HZ_PER_KHZ);
>         _normalize_load(&status);
>
>         for (i = 0; i < pd->nr_perf_states; i++) {
>
> -               if (pd->table[i].frequency < freq)
> +               if (table[i].frequency < freq)
>                         continue;
>
> -               power = pd->table[i].power * MICROWATT_PER_MILLIWATT;
> +               power = table[i].power * MICROWATT_PER_MILLIWATT;
>                 power *= status.busy_time;
>                 power >>= 10;
>
> diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
> index e2cc7bd30862..d468aca241e2 100644
> --- a/drivers/thermal/cpufreq_cooling.c
> +++ b/drivers/thermal/cpufreq_cooling.c
> @@ -91,10 +91,11 @@ struct cpufreq_cooling_device {
>  static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
>                                unsigned int freq)
>  {
> +       struct em_perf_state *table = cpufreq_cdev->em->default_table->state;
>         int i;
>
>         for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
> -               if (freq > cpufreq_cdev->em->table[i].frequency)
> +               if (freq > table[i].frequency)
>                         break;
>         }
>
> @@ -104,15 +105,16 @@ static unsigned long get_level(struct cpufreq_cooling_device *cpufreq_cdev,
>  static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
>                              u32 freq)
>  {
> +       struct em_perf_state *table = cpufreq_cdev->em->default_table->state;
>         unsigned long power_mw;
>         int i;
>
>         for (i = cpufreq_cdev->max_level - 1; i >= 0; i--) {
> -               if (freq > cpufreq_cdev->em->table[i].frequency)
> +               if (freq > table[i].frequency)
>                         break;
>         }
>
> -       power_mw = cpufreq_cdev->em->table[i + 1].power;
> +       power_mw = table[i + 1].power;
>         power_mw /= MICROWATT_PER_MILLIWATT;
>
>         return power_mw;
> @@ -121,18 +123,19 @@ static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_cdev,
>  static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
>                              u32 power)
>  {
> +       struct em_perf_state *table = cpufreq_cdev->em->default_table->state;
>         unsigned long em_power_mw;
>         int i;
>
>         for (i = cpufreq_cdev->max_level; i > 0; i--) {
>                 /* Convert EM power to milli-Watts to make safe comparison */
> -               em_power_mw = cpufreq_cdev->em->table[i].power;
> +               em_power_mw = table[i].power;
>                 em_power_mw /= MICROWATT_PER_MILLIWATT;
>                 if (power >= em_power_mw)
>                         break;
>         }
>
> -       return cpufreq_cdev->em->table[i].frequency;
> +       return table[i].frequency;
>  }
>
>  /**
> @@ -262,8 +265,9 @@ static int cpufreq_get_requested_power(struct thermal_cooling_device *cdev,
>  static int cpufreq_state2power(struct thermal_cooling_device *cdev,
>                                unsigned long state, u32 *power)
>  {
> -       unsigned int freq, num_cpus, idx;
>         struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
> +       unsigned int freq, num_cpus, idx;
> +       struct em_perf_state *table;
>
>         /* Request state should be less than max_level */
>         if (state > cpufreq_cdev->max_level)
> @@ -271,8 +275,9 @@ static int cpufreq_state2power(struct thermal_cooling_device *cdev,
>
>         num_cpus = cpumask_weight(cpufreq_cdev->policy->cpus);
>
> +       table = cpufreq_cdev->em->default_table->state;
>         idx = cpufreq_cdev->max_level - state;
> -       freq = cpufreq_cdev->em->table[idx].frequency;
> +       freq = table[idx].frequency;
>         *power = cpu_freq_to_power(cpufreq_cdev, freq) * num_cpus;
>
>         return 0;
> @@ -378,8 +383,11 @@ static unsigned int get_state_freq(struct cpufreq_cooling_device *cpufreq_cdev,
>  #ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
>         /* Use the Energy Model table if available */
>         if (cpufreq_cdev->em) {
> +               struct em_perf_state *table;
> +
> +               table = cpufreq_cdev->em->default_table->state;
>                 idx = cpufreq_cdev->max_level - state;
> -               return cpufreq_cdev->em->table[idx].frequency;
> +               return table[idx].frequency;
>         }
>  #endif
>
> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
> index 262e62ab6cf2..4207ef850582 100644
> --- a/drivers/thermal/devfreq_cooling.c
> +++ b/drivers/thermal/devfreq_cooling.c
> @@ -87,6 +87,7 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
>         struct devfreq_cooling_device *dfc = cdev->devdata;
>         struct devfreq *df = dfc->devfreq;
>         struct device *dev = df->dev.parent;
> +       struct em_perf_state *table;
>         unsigned long freq;
>         int perf_idx;
>
> @@ -99,8 +100,9 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
>                 return -EINVAL;
>
>         if (dfc->em_pd) {
> +               table = dfc->em_pd->default_table->state;
>                 perf_idx = dfc->max_state - state;
> -               freq = dfc->em_pd->table[perf_idx].frequency * 1000;
> +               freq = table[perf_idx].frequency * 1000;
>         } else {
>                 freq = dfc->freq_table[state];
>         }
> @@ -123,10 +125,11 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
>   */
>  static int get_perf_idx(struct em_perf_domain *em_pd, unsigned long freq)
>  {
> +       struct em_perf_state *table = em_pd->default_table->state;
>         int i;
>
>         for (i = 0; i < em_pd->nr_perf_states; i++) {
> -               if (em_pd->table[i].frequency == freq)
> +               if (table[i].frequency == freq)
>                         return i;
>         }
>
> @@ -181,6 +184,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>         struct devfreq_cooling_device *dfc = cdev->devdata;
>         struct devfreq *df = dfc->devfreq;
>         struct devfreq_dev_status status;
> +       struct em_perf_state *table;
>         unsigned long state;
>         unsigned long freq;
>         unsigned long voltage;
> @@ -192,6 +196,8 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>
>         freq = status.current_frequency;
>
> +       table = dfc->em_pd->default_table->state;
> +
>         if (dfc->power_ops && dfc->power_ops->get_real_power) {
>                 voltage = get_voltage(df, freq);
>                 if (voltage == 0) {
> @@ -204,7 +210,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>                         state = dfc->capped_state;
>
>                         /* Convert EM power into milli-Watts first */
> -                       dfc->res_util = dfc->em_pd->table[state].power;
> +                       dfc->res_util = table[state].power;
>                         dfc->res_util /= MICROWATT_PER_MILLIWATT;
>
>                         dfc->res_util *= SCALE_ERROR_MITIGATION;
> @@ -225,7 +231,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>                 _normalize_load(&status);
>
>                 /* Convert EM power into milli-Watts first */
> -               *power = dfc->em_pd->table[perf_idx].power;
> +               *power = table[perf_idx].power;
>                 *power /= MICROWATT_PER_MILLIWATT;
>                 /* Scale power for utilization */
>                 *power *= status.busy_time;
> @@ -245,13 +251,15 @@ static int devfreq_cooling_state2power(struct thermal_cooling_device *cdev,
>                                        unsigned long state, u32 *power)
>  {
>         struct devfreq_cooling_device *dfc = cdev->devdata;
> +       struct em_perf_state *table;
>         int perf_idx;
>
>         if (state > dfc->max_state)
>                 return -EINVAL;
>
> +       table = dfc->em_pd->default_table->state;
>         perf_idx = dfc->max_state - state;
> -       *power = dfc->em_pd->table[perf_idx].power;
> +       *power = table[perf_idx].power;
>         *power /= MICROWATT_PER_MILLIWATT;
>
>         return 0;
> @@ -264,6 +272,7 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>         struct devfreq *df = dfc->devfreq;
>         struct devfreq_dev_status status;
>         unsigned long freq, em_power_mw;
> +       struct em_perf_state *table;
>         s32 est_power;
>         int i;
>
> @@ -273,6 +282,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>
>         freq = status.current_frequency;
>
> +       table = dfc->em_pd->default_table->state;
> +
>         if (dfc->power_ops && dfc->power_ops->get_real_power) {
>                 /* Scale for resource utilization */
>                 est_power = power * dfc->res_util;
> @@ -290,7 +301,7 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>          */
>         for (i = dfc->max_state; i > 0; i--) {
>                 /* Convert EM power to milli-Watts to make safe comparison */
> -               em_power_mw = dfc->em_pd->table[i].power;
> +               em_power_mw = table[i].power;
>                 em_power_mw /= MICROWATT_PER_MILLIWATT;
>                 if (est_power >= em_power_mw)
>                         break;
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 8069f526c9d8..d236e08e80dc 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -36,9 +36,19 @@ struct em_perf_state {
>   */
>  #define EM_PERF_STATE_INEFFICIENT BIT(0)
>
> +/**
> + * struct em_perf_table - Performance states table
> + * @state:     List of performance states, in ascending order
> + * @rcu:       RCU used for safe access and destruction
> + */
> +struct em_perf_table {
> +       struct em_perf_state *state;
> +       struct rcu_head rcu;

There is no explanation of the role of this rcu_head in the changelog
or anywhere in this patch.

It is also not used as of this patch AFAICS.

It would be better to add it when it actually gets used IMV.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 08/18] PM: EM: Add update_power() callback for runtime modifications
  2023-09-25  8:11 ` [PATCH v4 08/18] PM: EM: Add update_power() callback for runtime modifications Lukasz Luba
@ 2023-09-26 18:59   ` Rafael J. Wysocki
  2023-09-29  9:00     ` Lukasz Luba
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-26 18:59 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> The Energy Model (EM) is going to support runtime modifications. This
> new callback would be used in the upcoming EM changes. The drivers
> or frameworks which want to modify the EM have to implement the
> update_power() callback.
>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index d236e08e80dc..546dee90f716 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -168,6 +168,26 @@ struct em_data_callback {
>          */
>         int (*get_cost)(struct device *dev, unsigned long freq,
>                         unsigned long *cost);
> +
> +       /**
> +        * update_power() - Provide new power at the given performance state of
> +        *              a device

The meaning of the above is unclear to me.

> +        * @dev         : Device for which we do this operation (can be a CPU)

It is unclear what "we" means in this context.  Maybe say "Target
device (can be a CPU)"?

> +        * @freq        : Frequency at the performance state in kHz

What performance state does this refer to?  And the frequency of what?

> +        * @power       : New power value at the performance state
> +        *              (modified)

Similarly unclear as the above.

> +        * @priv        : Pointer to private data useful for tracking context
> +        *              during runtime modifications of EM.

Who's going to set this pointer and use this data?

> +        *
> +        * The update_power() is used by runtime modifiable EM. It aims to

I would drop "The" from the above.

> +        * provide updated power value for a given frequency, which is stored
> +        * in the performance state.

A given frequency of what and the performance state of what does this refer to?

> + The power value provided by this callback
> +        * should fit in the [0, EM_MAX_POWER] range.
> +        *
> +        * Return 0 on success, or appropriate error value in case of failure.
> +        */
> +       int (*update_power)(struct device *dev, unsigned long freq,
> +                           unsigned long *power, void *priv);
>  };
>  #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) ((em_cb).active_power = cb)
>  #define EM_ADV_DATA_CB(_active_power_cb, _cost_cb)     \
> @@ -175,6 +195,7 @@ struct em_data_callback {
>           .get_cost = _cost_cb }
>  #define EM_DATA_CB(_active_power_cb)                   \
>                 EM_ADV_DATA_CB(_active_power_cb, NULL)
> +#define EM_UPDATE_CB(_update_power_cb) { .update_power = &_update_power_cb }
>
>  struct em_perf_domain *em_cpu_get(int cpu);
>  struct em_perf_domain *em_pd_get(struct device *dev);
> @@ -331,6 +352,7 @@ struct em_data_callback {};
>  #define EM_ADV_DATA_CB(_active_power_cb, _cost_cb) { }
>  #define EM_DATA_CB(_active_power_cb) { }
>  #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) do { } while (0)
> +#define EM_UPDATE_CB(_update_cb) { }
>
>  static inline
>  int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
> --

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 09/18] PM: EM: Introduce runtime modifiable table
  2023-09-25  8:11 ` [PATCH v4 09/18] PM: EM: Introduce runtime modifiable table Lukasz Luba
@ 2023-09-26 19:12   ` Rafael J. Wysocki
  2023-09-29  9:16     ` Lukasz Luba
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-26 19:12 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> The new runtime table would be populated with a new power data to better
> reflect the actual power. The power can vary over time e.g. due to the
> SoC temperature change. Higher temperature can increase power values.
> For longer running scenarios, such as game or camera, when also other
> devices are used (e.g. GPU, ISP) the CPU power can change. The new
> EM framework is able to addresses this issue and change the data
> at runtime safely.
>
> The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)
> for the task placement. All the other users (thermal, etc.) are still
> using the default (basic) EM. This fact drove the design of this feature.
>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h |  4 +++-
>  kernel/power/energy_model.c  | 12 +++++++++++-
>  2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 546dee90f716..740e7c25cfff 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -39,7 +39,7 @@ struct em_perf_state {
>  /**
>   * struct em_perf_table - Performance states table
>   * @state:     List of performance states, in ascending order
> - * @rcu:       RCU used for safe access and destruction
> + * @rcu:       RCU used only for runtime modifiable table

This still doesn't appear to be used anywhere, so why change it here?

>   */
>  struct em_perf_table {
>         struct em_perf_state *state;
> @@ -49,6 +49,7 @@ struct em_perf_table {
>  /**
>   * struct em_perf_domain - Performance domain
>   * @default_table:     Pointer to the default em_perf_table
> + * @runtime_table:     Pointer to the runtime modifiable em_perf_table

"Pointer to em_perf_table that can be dynamically updated"

>   * @nr_perf_states:    Number of performance states
>   * @flags:             See "em_perf_domain flags"
>   * @cpus:              Cpumask covering the CPUs of the domain. It's here
> @@ -64,6 +65,7 @@ struct em_perf_table {
>   */
>  struct em_perf_domain {
>         struct em_perf_table *default_table;
> +       struct em_perf_table __rcu *runtime_table;
>         int nr_perf_states;
>         unsigned long flags;
>         unsigned long cpus[];
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 797141638b29..5b40db38b745 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states,
>                 return ret;
>         }
>
> +       /* Initialize runtime table as default table. */

Redundant comment.

> +       rcu_assign_pointer(pd->runtime_table, default_table);
> +
>         if (_is_cpu_device(dev))
>                 for_each_cpu(cpu, cpus) {
>                         cpu_dev = get_cpu_device(cpu);
> @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
>   */
>  void em_dev_unregister_perf_domain(struct device *dev)
>  {
> +       struct em_perf_table __rcu *runtime_table;
>         struct em_perf_domain *pd;
>
>         if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
> @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev)
>                 return;
>
>         pd = dev->em_pd;
> -

Unrelated change.

>         /*
>          * The mutex separates all register/unregister requests and protects
>          * from potential clean-up/setup issues in the debugfs directories.
>          * The debugfs directory name is the same as device's name.
>          */
>         mutex_lock(&em_pd_mutex);
> +

Same here.

>         em_debug_remove_pd(dev);
>
> +       runtime_table = pd->runtime_table;
> +
> +       rcu_assign_pointer(pd->runtime_table, NULL);
> +       synchronize_rcu();

Is it really a good idea to call this under a mutex?

> +
>         kfree(pd->default_table->state);
>         kfree(pd->default_table);
>         kfree(dev->em_pd);
> +

Unrelated change.

>         dev->em_pd = NULL;
>         mutex_unlock(&em_pd_mutex);
>  }
> --

So this really adds a pointer to a table that can be dynamically
updated to struct em_perf_domain without any users so far.  It is not
used anywhere as of this patch AFAICS, which is not what the changelog
is saying.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-09-25  8:11 ` [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data Lukasz Luba
  2023-09-26 10:28   ` kernel test robot
@ 2023-09-26 19:26   ` Rafael J. Wysocki
  2023-09-29  9:36     ` Lukasz Luba
  1 sibling, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-26 19:26 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> The EM is going to support runtime modifications of the power data.
> Introduce RCU safe mechanism to clean up the old allocated EM data.

"RCU-based" probably and "to clean up the old EM data safely".

> It also adds a mutex for the EM structure to serialize the modifiers.

This part doesn't match the code changes in the patch.

> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  kernel/power/energy_model.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
>
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 5b40db38b745..2345837bfd2c 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -23,6 +23,9 @@
>   */
>  static DEFINE_MUTEX(em_pd_mutex);
>
> +static void em_cpufreq_update_efficiencies(struct device *dev,
> +                                          struct em_perf_state *table);
> +
>  static bool _is_cpu_device(struct device *dev)
>  {
>         return (dev->bus == &cpu_subsys);
> @@ -104,6 +107,32 @@ static void em_debug_create_pd(struct device *dev) {}
>  static void em_debug_remove_pd(struct device *dev) {}
>  #endif
>
> +static void em_destroy_rt_table_rcu(struct rcu_head *rp)

Adding static functions without callers will obviously cause the
compiler to complain, which is one of the reasons to avoid doing that.
The other is that it is hard to say how these functions are going to
be used without reviewing multiple patches simultaneously, which is a
pain as far as I'm concerned.

> +{
> +       struct em_perf_table *runtime_table;
> +
> +       runtime_table = container_of(rp, struct em_perf_table, rcu);
> +       kfree(runtime_table->state);
> +       kfree(runtime_table);

If runtime_table and its state were allocated in one go, it would be
possible to free them in one go either.

For some reason, you don't seem to want to do that, but why?

> +}
> +
> +static void em_perf_runtime_table_set(struct device *dev,
> +                                     struct em_perf_table *runtime_table)
> +{
> +       struct em_perf_domain *pd = dev->em_pd;
> +       struct em_perf_table *tmp;
> +
> +       tmp = pd->runtime_table;
> +
> +       rcu_assign_pointer(pd->runtime_table, runtime_table);
> +
> +       em_cpufreq_update_efficiencies(dev, runtime_table->state);
> +
> +       /* Don't free default table since it's used by other frameworks. */

Apparently, some frameworks are only going to use the default table
while the runtime-updatable table will be used somewhere else at the
same time.

I'm not really sure if this is a good idea.

> +       if (tmp != pd->default_table)
> +               call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
> +}
> +
>  static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>                             struct em_data_callback *cb, int nr_states,
>                             unsigned long flags)
> --

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power
  2023-09-25  8:11 ` [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power Lukasz Luba
  2023-09-26 17:21   ` kernel test robot
@ 2023-09-26 19:48   ` Rafael J. Wysocki
  2023-09-29 10:00     ` Lukasz Luba
  1 sibling, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-26 19:48 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:

First off, I would merge this with the previous patch, as the changes
would be much clearer then IMO.

> Add an interface which allows to modify EM power data at runtime.
> The new power information is populated by the provided callback, which
> is called for each performance state.

But it all starts with copying the frequencies from the default table.

> The CPU frequencies' efficiency is
> re-calculated since that might be affected as well. The old EM memory
> is going to be freed later using RCU mechanism.

Not all of it, but the old runtime table that is not going to be used any more.

> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h |   8 +++
>  kernel/power/energy_model.c  | 111 +++++++++++++++++++++++++++++++++++
>  2 files changed, 119 insertions(+)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 740e7c25cfff..8f055ab356ed 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -201,6 +201,8 @@ struct em_data_callback {
>
>  struct em_perf_domain *em_cpu_get(int cpu);
>  struct em_perf_domain *em_pd_get(struct device *dev);
> +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
> +                             void *priv);
>  int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
>                                 struct em_data_callback *cb, cpumask_t *span,
>                                 bool microwatts);
> @@ -384,6 +386,12 @@ static inline int em_pd_nr_perf_states(struct em_perf_domain *pd)
>  {
>         return 0;
>  }
> +static inline
> +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
> +                             void *priv)
> +{
> +       return -EINVAL;
> +}
>  #endif
>
>  #endif
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 2345837bfd2c..78e1495dc87e 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -172,6 +172,101 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>         return 0;
>  }
>
> +/**
> + * em_dev_update_perf_domain() - Update runtime EM table for a device
> + * @dev                : Device for which the EM is to be updated
> + * @cb         : Callback function providing the power data for the EM
> + * @priv       : Pointer to private data useful for passing context
> + *             which might be required while calling @cb

It is still unclear to me who is going to use this priv pointer and how.

> + *
> + * Update EM runtime modifiable table for a @dev using the callback
> + * defined in @cb. The EM new power values are then used for calculating
> + * the em_perf_state::cost for associated performance state.

It actually allocates a new runtime table and populates it from
scratch, using the frequencies from the default table and the
callback.

> + *
> + * This function uses mutex to serialize writers, so it must not be called

"a mutex"

> + * from non-sleeping context.
> + *
> + * Return 0 on success or a proper error in case of failure.
> + */
> +int em_dev_update_perf_domain(struct device *dev, struct em_data_callback *cb,
> +                             void *priv)
> +{
> +       struct em_perf_table *runtime_table;
> +       unsigned long power, freq;
> +       struct em_perf_domain *pd;
> +       int ret, i;
> +
> +       if (!cb || !cb->update_power)
> +               return -EINVAL;
> +
> +       /*
> +        * The lock serializes update and unregister code paths. When the
> +        * EM has been unregistered in the meantime, we should capture that
> +        * when entering this critical section. It also makes sure that
> +        * two concurrent updates will be serialized.
> +        */
> +       mutex_lock(&em_pd_mutex);
> +
> +       if (!dev || !dev->em_pd) {

Checking dev against NULL under the mutex is pointless (either it is
NULL or it isn't, so check it earlier).

> +               ret = -EINVAL;
> +               goto unlock_em;
> +       }
> +
> +       pd = dev->em_pd;

And I would check pd against NULL here.

> +
> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
> +       if (!runtime_table) {
> +               ret = -ENOMEM;
> +               goto unlock_em;
> +       }
> +
> +       runtime_table->state = kcalloc(pd->nr_perf_states,
> +                                      sizeof(struct em_perf_state),
> +                                      GFP_KERNEL);
> +       if (!runtime_table->state) {
> +               ret = -ENOMEM;
> +               goto free_runtime_table;
> +       }

The above allocations can be merged into one and allocating memory
under the mutex is questionable.

> +
> +       /* Populate runtime table with updated values using driver callback */
> +       for (i = 0; i < pd->nr_perf_states; i++) {
> +               freq = pd->default_table->state[i].frequency;
> +               runtime_table->state[i].frequency = freq;
> +
> +               /*
> +                * Call driver callback to get a new power value for
> +                * a given frequency.
> +                */
> +               ret = cb->update_power(dev, freq, &power, priv);
> +               if (ret) {
> +                       dev_dbg(dev, "EM: runtime update error: %d\n", ret);
> +                       goto free_runtime_state_table;
> +               }
> +
> +               runtime_table->state[i].power = power;
> +       }
> +
> +       ret = em_compute_costs(dev, runtime_table->state, cb,
> +                              pd->nr_perf_states, pd->flags);
> +       if (ret)
> +               goto free_runtime_state_table;
> +
> +       em_perf_runtime_table_set(dev, runtime_table);
> +
> +       mutex_unlock(&em_pd_mutex);
> +       return 0;
> +
> +free_runtime_state_table:
> +       kfree(runtime_table->state);
> +free_runtime_table:
> +       kfree(runtime_table);
> +unlock_em:
> +       mutex_unlock(&em_pd_mutex);
> +
> +       return -EINVAL;
> +}
> +EXPORT_SYMBOL_GPL(em_dev_update_perf_domain);
> +
>  static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
>                                 int nr_states, struct em_data_callback *cb,
>                                 unsigned long flags)
> @@ -494,6 +589,8 @@ void em_dev_unregister_perf_domain(struct device *dev)
>          * The mutex separates all register/unregister requests and protects
>          * from potential clean-up/setup issues in the debugfs directories.
>          * The debugfs directory name is the same as device's name.
> +        * The lock also protects the updater of the runtime modifiable
> +        * EM and this remover.
>          */
>         mutex_lock(&em_pd_mutex);
>
> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
>
>         runtime_table = pd->runtime_table;
>
> +       /*
> +        * Safely destroy runtime modifiable EM. By using the call
> +        * synchronize_rcu() we make sure we don't progress till last user
> +        * finished the RCU section and our update got applied.
> +        */
>         rcu_assign_pointer(pd->runtime_table, NULL);
>         synchronize_rcu();
>
> +       /*
> +        * After the sync no updates will be in-flight, so free the
> +        * memory allocated for runtime table (if there was such).
> +        */
> +       if (runtime_table != pd->default_table) {
> +               kfree(runtime_table->state);
> +               kfree(runtime_table);
> +       }

Can't this race with the RCU callback freeing the runtime table?

> +
>         kfree(pd->default_table->state);
>         kfree(pd->default_table);
>         kfree(dev->em_pd);
> --

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 12/18] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
  2023-09-25  8:11 ` [PATCH v4 12/18] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS Lukasz Luba
@ 2023-09-26 19:54   ` Rafael J. Wysocki
  2023-09-29 10:10     ` Lukasz Luba
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-26 19:54 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> The new Energy Model (EM) supports runtime modification of the performance
> state table to better model the power used by the SoC. Use this new
> feature to improve energy estimation and therefore task placement in
> Energy Aware Scheduler (EAS).
>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  include/linux/energy_model.h | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 8f055ab356ed..41290ee2cdd0 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -261,15 +261,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>                                 unsigned long max_util, unsigned long sum_util,
>                                 unsigned long allowed_cpu_cap)
>  {
> +       struct em_perf_table *runtime_table;

You may as well call it just "table".  The "runtime_" prefix doesn't
add much value here IMO.

>         unsigned long freq, scale_cpu;
> -       struct em_perf_state *table, *ps;
> +       struct em_perf_state *ps;
>         int cpu, i;
>
>         if (!sum_util)
>                 return 0;
>
> -       table = pd->default_table->state;
> -
>         /*
>          * In order to predict the performance state, map the utilization of
>          * the most utilized CPU of the performance domain to a requested
> @@ -280,7 +279,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>          */
>         cpu = cpumask_first(to_cpumask(pd->cpus));
>         scale_cpu = arch_scale_cpu_capacity(cpu);
> -       ps = &table[pd->nr_perf_states - 1];
> +
> +       /*
> +        * No rcu_read_lock() since it's already called by task scheduler.
> +        * The runtime_table is always there for CPUs, so we don't check.
> +        */
> +       runtime_table = rcu_dereference(pd->runtime_table);
> +
> +       ps = &runtime_table->state[pd->nr_perf_states - 1];
>
>         max_util = map_util_perf(max_util);
>         max_util = min(max_util, allowed_cpu_cap);
> @@ -290,9 +296,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>          * Find the lowest performance state of the Energy Model above the
>          * requested frequency.
>          */
> -       i = em_pd_get_efficient_state(table, pd->nr_perf_states, freq,
> -                                     pd->flags);
> -       ps = &table[i];
> +       i = em_pd_get_efficient_state(runtime_table->state, pd->nr_perf_states,
> +                                     freq, pd->flags);
> +       ps = &runtime_table->state[i];
>
>         /*
>          * The capacity of a CPU in the domain at the performance state (ps)
> --

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/18] Introduce runtime modifiable Energy Model
  2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
                   ` (17 preceding siblings ...)
  2023-09-25  8:11 ` [PATCH v4 18/18] Documentation: EM: Update information about performance field Lukasz Luba
@ 2023-09-28 21:56 ` Qais Yousef
  2023-10-03  8:06   ` Lukasz Luba
  18 siblings, 1 reply; 58+ messages in thread
From: Qais Yousef @ 2023-09-28 21:56 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, wvw

Hi Lukasz

On 09/25/23 09:11, Lukasz Luba wrote:
> Hi all,
> 
> This patch set adds a new feature which allows to modify Energy Model (EM)
> power values at runtime. It will allow to better reflect power model of
> a recent SoCs and silicon. Different characteristics of the power usage
> can be leveraged and thus better decisions made during task placement in EAS.
> 
> It's part of feature set know as Dynamic Energy Model. It has been presented
> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
> improvement for the EM.

Thanks for the series! I'm on CC this time :-) Unfortunately I have a planned
holiday starting tomorrow, so won't get a chance to do proper review till I'm
back which would be few weeks from now.

I just want to iterate that this is a real problem being seen in practice where
it's hard to provide a single average EM for all workloads now. So we certainly
need something like this.

Hopefully I'll get a chance to help with review when I'm back from holidays.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 03/18] PM: EM: Find first CPU online while updating OPP efficiency
  2023-09-26 18:32   ` Rafael J. Wysocki
@ 2023-09-29  8:32     ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-29  8:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

Hi Rafael,

Thank you having reviewing those patches!

On 9/26/23 19:32, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> The Energy Model might be updated at runtime and the energy efficiency
>> for each OPP may change. Thus, there is a need to update also the
>> cpufreq framework and make it aligned to the new values. In order to
>> do that, use a first online CPU from the Performance Domain.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   kernel/power/energy_model.c | 11 +++++++++--
>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 42486674b834..3dafdd7731c4 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -243,12 +243,19 @@ em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
>>          struct em_perf_domain *pd = dev->em_pd;
>>          struct cpufreq_policy *policy;
>>          int found = 0;
>> -       int i;
>> +       int i, cpu;
>>
>>          if (!_is_cpu_device(dev) || !pd)
>>                  return;
>>
>> -       policy = cpufreq_cpu_get(cpumask_first(em_span_cpus(pd)));
>> +       /* Try to get a CPU which is online and in this PD */
>> +       cpu = cpumask_first_and(em_span_cpus(pd), cpu_active_mask);
> 
> The comment talks about "online" and cpu_active_mask is used.  Isn't
> it a bit inconsistent?

good point, I'll change the word to 'active'

> 
>> +       if (cpu >= nr_cpu_ids) {
>> +               dev_warn(dev, "EM: No online CPU for CPUFreq policy\n");
>> +               return;
>> +       }
>> +
>> +       policy = cpufreq_cpu_get(cpu);
>>          if (!policy) {
>>                  dev_warn(dev, "EM: Access to CPUFreq policy failed\n");
>>                  return;
>> --
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 05/18] PM: EM: Refactor a new function em_compute_costs()
  2023-09-26 18:39   ` Rafael J. Wysocki
@ 2023-09-29  8:38     ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-29  8:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw



On 9/26/23 19:39, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Refactor a dedicated function which will be easier to maintain and re-use
>> in future. The upcoming changes for the modifiable EM perf_state table
>> will use it (instead of duplicating the code).
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> 
> If I'm not mistaken, this patch by itself is not going to change the
> observable functionality in any way and it would be good to say that
> in the changelog.
> 
> This also applies to some other patches in this series.
> 

Correct, no functional changes. I will add that information to the
patch header in next version.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 06/18] PM: EM: Check if the get_cost() callback is present in em_compute_costs()
  2023-09-26 18:46   ` Rafael J. Wysocki
@ 2023-09-29  8:42     ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-29  8:42 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw



On 9/26/23 19:46, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> The em_compute_cost() is going to be re-used in runtime modified EM
>> code path. Thus, make sure that this common code is safe and won't
>> try to use the NULL pointer. The former em_compute_cost() didn't have to
>> care about runtime modification code path. The upcoming changes introduce
>> such option, but with different callback. Those two paths which use
>> get_cost() (during first EM registration) or update_power() (during
>> runtime modification) need to be safely handled in em_compute_costs().
> 
> I would just say something like this:
> 
> "Subsequent changes will introduce a case in which cb->get_cost may
> not be set in em_compute_costs(), so add a check to ensure that it is
> not NULL before attempting to dereference it."
> 
> The rest of the changelog is just redundant IMO.
> 

Make sense, thanks, I'll change that.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 07/18] PM: EM: Refactor struct em_perf_domain and add default_table
  2023-09-26 18:52   ` Rafael J. Wysocki
@ 2023-09-29  8:45     ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-29  8:45 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw



On 9/26/23 19:52, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> The Energy Model is going to support runtime modifications. Refactor old
>> implementation which accessed struct em_perf_state and introduce
>> em_perf_domain::default_table to clean up the design. This new field
>> will help to better distinguish 2 performance state tables.
>>
>> Update all drivers or frameworks which used the old field:
>> em_perf_domain::table and now should use em_perf_domain::default_table.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>

[snip]

>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index 8069f526c9d8..d236e08e80dc 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -36,9 +36,19 @@ struct em_perf_state {
>>    */
>>   #define EM_PERF_STATE_INEFFICIENT BIT(0)
>>
>> +/**
>> + * struct em_perf_table - Performance states table
>> + * @state:     List of performance states, in ascending order
>> + * @rcu:       RCU used for safe access and destruction
>> + */
>> +struct em_perf_table {
>> +       struct em_perf_state *state;
>> +       struct rcu_head rcu;
> 
> There is no explanation of the role of this rcu_head in the changelog
> or anywhere in this patch.
> 
> It is also not used as of this patch AFAICS.
> 
> It would be better to add it when it actually gets used IMV.

That's a good point. I will introduce this 'rcu' field later.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 08/18] PM: EM: Add update_power() callback for runtime modifications
  2023-09-26 18:59   ` Rafael J. Wysocki
@ 2023-09-29  9:00     ` Lukasz Luba
  2023-09-29 12:18       ` Rafael J. Wysocki
  0 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-09-29  9:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw



On 9/26/23 19:59, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> The Energy Model (EM) is going to support runtime modifications. This
>> new callback would be used in the upcoming EM changes. The drivers
>> or frameworks which want to modify the EM have to implement the
>> update_power() callback.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   include/linux/energy_model.h | 22 ++++++++++++++++++++++
>>   1 file changed, 22 insertions(+)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index d236e08e80dc..546dee90f716 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -168,6 +168,26 @@ struct em_data_callback {
>>           */
>>          int (*get_cost)(struct device *dev, unsigned long freq,
>>                          unsigned long *cost);
>> +
>> +       /**
>> +        * update_power() - Provide new power at the given performance state of
>> +        *              a device
> 
> The meaning of the above is unclear to me.

I can try to rephrase this a bit:
' Provide a new power value for the device at the given frequency. This
allows to reflect changed power profile in runtime.'

> 
>> +        * @dev         : Device for which we do this operation (can be a CPU)
> 
> It is unclear what "we" means in this context.  Maybe say "Target
> device (can be a CPU)"?

That sounds better indeed.

> 
>> +        * @freq        : Frequency at the performance state in kHz
> 
> What performance state does this refer to?  And the frequency of what?

We just call the entry in EM the 'performance state' (so frequency and
power). I will rephrase this as well:
'Frequency of the @dev expressed in kHz'

> 
>> +        * @power       : New power value at the performance state
>> +        *              (modified)
> 
> Similarly unclear as the above.

OK, it can be re-written as:
'Power value of the @dev at the given @freq (modified)'

> 
>> +        * @priv        : Pointer to private data useful for tracking context
>> +        *              during runtime modifications of EM.
> 
> Who's going to set this pointer and use this data?

The driver or kernel module, which is aware about thermal events. Those
events might be coming from FW to kernel, but we need to maintain
the same 'context' for all OPPs when we start the EM update.

This 'priv' field is used for passing the 'context' back to the
caller, so the caller can consistently the update. The update might
be with some math formula which multiplies the power by some factor
value (based on thermal model and current temperature).

> 
>> +        *
>> +        * The update_power() is used by runtime modifiable EM. It aims to
> 
> I would drop "The" from the above.

OK

> 
>> +        * provide updated power value for a given frequency, which is stored
>> +        * in the performance state.
> 
> A given frequency of what and the performance state of what does this refer to?

I will change that and add the '@dev' word to make this more precised.


'update_power() is used by runtime modifiable EM. It aims to update the
@dev EM power values for all registered frequencies.'

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 09/18] PM: EM: Introduce runtime modifiable table
  2023-09-26 19:12   ` Rafael J. Wysocki
@ 2023-09-29  9:16     ` Lukasz Luba
  2023-09-29 12:27       ` Rafael J. Wysocki
  0 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-09-29  9:16 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw



On 9/26/23 20:12, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> The new runtime table would be populated with a new power data to better
>> reflect the actual power. The power can vary over time e.g. due to the
>> SoC temperature change. Higher temperature can increase power values.
>> For longer running scenarios, such as game or camera, when also other
>> devices are used (e.g. GPU, ISP) the CPU power can change. The new
>> EM framework is able to addresses this issue and change the data
>> at runtime safely.
>>
>> The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)
>> for the task placement. All the other users (thermal, etc.) are still
>> using the default (basic) EM. This fact drove the design of this feature.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   include/linux/energy_model.h |  4 +++-
>>   kernel/power/energy_model.c  | 12 +++++++++++-
>>   2 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index 546dee90f716..740e7c25cfff 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -39,7 +39,7 @@ struct em_perf_state {
>>   /**
>>    * struct em_perf_table - Performance states table
>>    * @state:     List of performance states, in ascending order
>> - * @rcu:       RCU used for safe access and destruction
>> + * @rcu:       RCU used only for runtime modifiable table
> 
> This still doesn't appear to be used anywhere, so why change it here?

I will try to move this later in the series.

> 
>>    */
>>   struct em_perf_table {
>>          struct em_perf_state *state;
>> @@ -49,6 +49,7 @@ struct em_perf_table {
>>   /**
>>    * struct em_perf_domain - Performance domain
>>    * @default_table:     Pointer to the default em_perf_table
>> + * @runtime_table:     Pointer to the runtime modifiable em_perf_table
> 
> "Pointer to em_perf_table that can be dynamically updated"

OK

> 
>>    * @nr_perf_states:    Number of performance states
>>    * @flags:             See "em_perf_domain flags"
>>    * @cpus:              Cpumask covering the CPUs of the domain. It's here
>> @@ -64,6 +65,7 @@ struct em_perf_table {
>>    */
>>   struct em_perf_domain {
>>          struct em_perf_table *default_table;
>> +       struct em_perf_table __rcu *runtime_table;
>>          int nr_perf_states;
>>          unsigned long flags;
>>          unsigned long cpus[];
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 797141638b29..5b40db38b745 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states,
>>                  return ret;
>>          }
>>
>> +       /* Initialize runtime table as default table. */
> 
> Redundant comment.

I'll drop it.

> 
>> +       rcu_assign_pointer(pd->runtime_table, default_table);
>> +
>>          if (_is_cpu_device(dev))
>>                  for_each_cpu(cpu, cpus) {
>>                          cpu_dev = get_cpu_device(cpu);
>> @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
>>    */
>>   void em_dev_unregister_perf_domain(struct device *dev)
>>   {
>> +       struct em_perf_table __rcu *runtime_table;
>>          struct em_perf_domain *pd;
>>
>>          if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
>> @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev)
>>                  return;
>>
>>          pd = dev->em_pd;
>> -
> 
> Unrelated change.

ACK

> 
>>          /*
>>           * The mutex separates all register/unregister requests and protects
>>           * from potential clean-up/setup issues in the debugfs directories.
>>           * The debugfs directory name is the same as device's name.
>>           */
>>          mutex_lock(&em_pd_mutex);
>> +
> 
> Same here.

ACK

> 
>>          em_debug_remove_pd(dev);
>>
>> +       runtime_table = pd->runtime_table;
>> +
>> +       rcu_assign_pointer(pd->runtime_table, NULL);
>> +       synchronize_rcu();
> 
> Is it really a good idea to call this under a mutex?

This is the unregistration of the EM code path, so a thermal
driver which gets some IRQs might not be aware that the EM
is going to vanish. That's why those two code paths: update
& unregister are protected with the same lock.

This synchronize_rcu() won't be long, but makes sure
that when we free(dev->em_pd) we don't leak runtime_table.

> 
>> +
>>          kfree(pd->default_table->state);
>>          kfree(pd->default_table);
>>          kfree(dev->em_pd);
>> +
> 
> Unrelated change.

ACK

> 
>>          dev->em_pd = NULL;
>>          mutex_unlock(&em_pd_mutex);
>>   }
>> --
> 
> So this really adds a pointer to a table that can be dynamically
> updated to struct em_perf_domain without any users so far.  It is not
> used anywhere as of this patch AFAICS, which is not what the changelog
> is saying.

Good catch. I will adjust the changlog in header and say:

'Add infrastructure and mechanisms for the new runtime table.
The runtime modifiable EM data is used by the Energy Aware Scheduler 
(EAS)for the task placement. All the other users (thermal, etc.) are
still using the default (basic) EM. This fact drove the design of this
feature.'

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-09-26 19:26   ` Rafael J. Wysocki
@ 2023-09-29  9:36     ` Lukasz Luba
  2023-09-29 12:59       ` Rafael J. Wysocki
  0 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-09-29  9:36 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw



On 9/26/23 20:26, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> The EM is going to support runtime modifications of the power data.
>> Introduce RCU safe mechanism to clean up the old allocated EM data.
> 
> "RCU-based" probably and "to clean up the old EM data safely".

Yes, thanks

> 
>> It also adds a mutex for the EM structure to serialize the modifiers.
> 
> This part doesn't match the code changes in the patch.

Good catch. It left from some older version. We use the existing
em_pd_mutex.

> 
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   kernel/power/energy_model.c | 29 +++++++++++++++++++++++++++++
>>   1 file changed, 29 insertions(+)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 5b40db38b745..2345837bfd2c 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -23,6 +23,9 @@
>>    */
>>   static DEFINE_MUTEX(em_pd_mutex);
>>
>> +static void em_cpufreq_update_efficiencies(struct device *dev,
>> +                                          struct em_perf_state *table);
>> +
>>   static bool _is_cpu_device(struct device *dev)
>>   {
>>          return (dev->bus == &cpu_subsys);
>> @@ -104,6 +107,32 @@ static void em_debug_create_pd(struct device *dev) {}
>>   static void em_debug_remove_pd(struct device *dev) {}
>>   #endif
>>
>> +static void em_destroy_rt_table_rcu(struct rcu_head *rp)
> 
> Adding static functions without callers will obviously cause the
> compiler to complain, which is one of the reasons to avoid doing that.
> The other is that it is hard to say how these functions are going to
> be used without reviewing multiple patches simultaneously, which is a
> pain as far as I'm concerned.

It is used in this patch, but inside the call_rcu() as 2nd arg.
I have marked that below. The compiler didn't complain IIRC.

> 
>> +{
>> +       struct em_perf_table *runtime_table;
>> +
>> +       runtime_table = container_of(rp, struct em_perf_table, rcu);
>> +       kfree(runtime_table->state);
>> +       kfree(runtime_table);
> 
> If runtime_table and its state were allocated in one go, it would be
> possible to free them in one go either.
> 
> For some reason, you don't seem to want to do that, but why?

We had a few internal reviews and there were voices where saying that
it's better to have 2 identical tables: 'default_table' and
'runtime_table' to make sure it's visible everywhere when it's used.
That made the need to actually have also the 'state' table inside.
I don't see it as a big problem, though.

> 
>> +}
>> +
>> +static void em_perf_runtime_table_set(struct device *dev,
>> +                                     struct em_perf_table *runtime_table)
>> +{
>> +       struct em_perf_domain *pd = dev->em_pd;
>> +       struct em_perf_table *tmp;
>> +
>> +       tmp = pd->runtime_table;
>> +
>> +       rcu_assign_pointer(pd->runtime_table, runtime_table);
>> +
>> +       em_cpufreq_update_efficiencies(dev, runtime_table->state);
>> +
>> +       /* Don't free default table since it's used by other frameworks. */
> 
> Apparently, some frameworks are only going to use the default table
> while the runtime-updatable table will be used somewhere else at the
> same time.
> 
> I'm not really sure if this is a good idea.

Runtime table is only for driving the task placement in the EAS.

The thermal gov IPA won't make better decisions because it already
has the mechanism to accumulate the error that it made.

The same applies to DTPM, which works in a more 'configurable' way,
rather that hard optimization mechanism (like EAS).

> 
>> +       if (tmp != pd->default_table)
>> +               call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);

The em_destroy_rt_table_rcu() is used here ^^^^^^

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power
  2023-09-26 19:48   ` Rafael J. Wysocki
@ 2023-09-29 10:00     ` Lukasz Luba
  2023-09-29 13:18       ` Rafael J. Wysocki
  0 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-09-29 10:00 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw



On 9/26/23 20:48, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> 
> First off, I would merge this with the previous patch, as the changes
> would be much clearer then IMO.

I was trying to avoid a big patch ~150 lines. I will do that merge.

> 
>> Add an interface which allows to modify EM power data at runtime.
>> The new power information is populated by the provided callback, which
>> is called for each performance state.
> 
> But it all starts with copying the frequencies from the default table.

Yes, I can add that to the description.

> 
>> The CPU frequencies' efficiency is
>> re-calculated since that might be affected as well. The old EM memory
>> is going to be freed later using RCU mechanism.
> 
> Not all of it, but the old runtime table that is not going to be used any more.

True, I will rephrase that, to make it more precised.

> 
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>

[snip]

>>
>> +/**
>> + * em_dev_update_perf_domain() - Update runtime EM table for a device
>> + * @dev                : Device for which the EM is to be updated
>> + * @cb         : Callback function providing the power data for the EM
>> + * @priv       : Pointer to private data useful for passing context
>> + *             which might be required while calling @cb
> 
> It is still unclear to me who is going to use this priv pointer and how.

I have explained that in some previous patch response. A driver or
kernel module which monitors the thermal situation and has context.

> 
>> + *
>> + * Update EM runtime modifiable table for a @dev using the callback
>> + * defined in @cb. The EM new power values are then used for calculating
>> + * the em_perf_state::cost for associated performance state.
> 
> It actually allocates a new runtime table and populates it from
> scratch, using the frequencies from the default table and the
> callback.

Yes, it allocated new table and put the updated power values there.
I can add that to the comment.

> 
>> + *
>> + * This function uses mutex to serialize writers, so it must not be called
> 
> "a mutex"

ACK

> 
>> + * from non-sleeping context.

[snip]

>> +
>> +       if (!dev || !dev->em_pd) {
> 
> Checking dev against NULL under the mutex is pointless (either it is
> NULL or it isn't, so check it earlier).

ACK

> 
>> +               ret = -EINVAL;
>> +               goto unlock_em;
>> +       }
>> +
>> +       pd = dev->em_pd;
> 
> And I would check pd against NULL here.

It's done above, next to '!dev || !dev->em_pd'

> 
>> +
>> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
>> +       if (!runtime_table) {
>> +               ret = -ENOMEM;
>> +               goto unlock_em;
>> +       }
>> +
>> +       runtime_table->state = kcalloc(pd->nr_perf_states,
>> +                                      sizeof(struct em_perf_state),
>> +                                      GFP_KERNEL);
>> +       if (!runtime_table->state) {
>> +               ret = -ENOMEM;
>> +               goto free_runtime_table;
>> +       }
> 
> The above allocations can be merged into one and allocating memory
> under the mutex is questionable.

So how to make sure that there is no 2 callers trying to update the
same EM or unregistration is not in the background?

[snip]

>>
>> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
>>
>>          runtime_table = pd->runtime_table;
>>
>> +       /*
>> +        * Safely destroy runtime modifiable EM. By using the call
>> +        * synchronize_rcu() we make sure we don't progress till last user
>> +        * finished the RCU section and our update got applied.
>> +        */
>>          rcu_assign_pointer(pd->runtime_table, NULL);
>>          synchronize_rcu();
>>
>> +       /*
>> +        * After the sync no updates will be in-flight, so free the
>> +        * memory allocated for runtime table (if there was such).
>> +        */
>> +       if (runtime_table != pd->default_table) {
>> +               kfree(runtime_table->state);
>> +               kfree(runtime_table);
>> +       }
> 
> Can't this race with the RCU callback freeing the runtime table?

That's why there is this 'synchronize_rcu()' above and the mutex. The
updating caller if finished the update, would unlock the mutex and this
unregister code can go. Here we call the synchronize_rcu() so we assure
the callback has finished for the update path and than we explicitly
free the saved 'runtime_table' here. So all data should be freed and
code serialized in those two paths.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 12/18] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
  2023-09-26 19:54   ` Rafael J. Wysocki
@ 2023-09-29 10:10     ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-09-29 10:10 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw



On 9/26/23 20:54, Rafael J. Wysocki wrote:
> On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> The new Energy Model (EM) supports runtime modification of the performance
>> state table to better model the power used by the SoC. Use this new
>> feature to improve energy estimation and therefore task placement in
>> Energy Aware Scheduler (EAS).
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   include/linux/energy_model.h | 20 +++++++++++++-------
>>   1 file changed, 13 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
>> index 8f055ab356ed..41290ee2cdd0 100644
>> --- a/include/linux/energy_model.h
>> +++ b/include/linux/energy_model.h
>> @@ -261,15 +261,14 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>>                                  unsigned long max_util, unsigned long sum_util,
>>                                  unsigned long allowed_cpu_cap)
>>   {
>> +       struct em_perf_table *runtime_table;
> 
> You may as well call it just "table".  The "runtime_" prefix doesn't
> add much value here IMO.

Yes, I'll do that

Thank you Rafael for the review of the patch set!

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 08/18] PM: EM: Add update_power() callback for runtime modifications
  2023-09-29  9:00     ` Lukasz Luba
@ 2023-09-29 12:18       ` Rafael J. Wysocki
  0 siblings, 0 replies; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-29 12:18 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Rafael J. Wysocki, linux-kernel, linux-pm, dietmar.eggemann,
	rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, mhiramat, qyousef, wvw

On Fri, Sep 29, 2023 at 10:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> On 9/26/23 19:59, Rafael J. Wysocki wrote:
> > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >> The Energy Model (EM) is going to support runtime modifications. This
> >> new callback would be used in the upcoming EM changes. The drivers
> >> or frameworks which want to modify the EM have to implement the
> >> update_power() callback.
> >>
> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> >> ---
> >>   include/linux/energy_model.h | 22 ++++++++++++++++++++++
> >>   1 file changed, 22 insertions(+)
> >>
> >> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> >> index d236e08e80dc..546dee90f716 100644
> >> --- a/include/linux/energy_model.h
> >> +++ b/include/linux/energy_model.h
> >> @@ -168,6 +168,26 @@ struct em_data_callback {
> >>           */
> >>          int (*get_cost)(struct device *dev, unsigned long freq,
> >>                          unsigned long *cost);
> >> +
> >> +       /**
> >> +        * update_power() - Provide new power at the given performance state of
> >> +        *              a device
> >
> > The meaning of the above is unclear to me.
>
> I can try to rephrase this a bit:
> ' Provide a new power value for the device at the given frequency. This
> allows to reflect changed power profile in runtime.'

Maybe "Estimate power for a given device frequency"

> >
> >> +        * @dev         : Device for which we do this operation (can be a CPU)
> >
> > It is unclear what "we" means in this context.  Maybe say "Target
> > device (can be a CPU)"?
>
> That sounds better indeed.
>
> >
> >> +        * @freq        : Frequency at the performance state in kHz
> >
> > What performance state does this refer to?  And the frequency of what?
>
> We just call the entry in EM the 'performance state' (so frequency and
> power). I will rephrase this as well:
> 'Frequency of the @dev expressed in kHz'

Or "Device frequency for which to estimate power"?

> >
> >> +        * @power       : New power value at the performance state
> >> +        *              (modified)
> >
> > Similarly unclear as the above.
>
> OK, it can be re-written as:
> 'Power value of the @dev at the given @freq (modified)'

This is not a power value, but a return pointer AFAICS.  So "location
to store the return power value".

> >
> >> +        * @priv        : Pointer to private data useful for tracking context
> >> +        *              during runtime modifications of EM.
> >
> > Who's going to set this pointer and use this data?
>
> The driver or kernel module, which is aware about thermal events. Those
> events might be coming from FW to kernel, but we need to maintain
> the same 'context' for all OPPs when we start the EM update.
>
> This 'priv' field is used for passing the 'context' back to the
> caller, so the caller can consistently the update. The update might
> be with some math formula which multiplies the power by some factor
> value (based on thermal model and current temperature).

I would say "Additional data for the callback function, provided by
the entity setting the callback pointer".

> >
> >> +        *
> >> +        * The update_power() is used by runtime modifiable EM. It aims to
> >
> > I would drop "The" from the above.
>
> OK
>
> >
> >> +        * provide updated power value for a given frequency, which is stored
> >> +        * in the performance state.
> >
> > A given frequency of what and the performance state of what does this refer to?
>
> I will change that and add the '@dev' word to make this more precised.

That would help.  Overall, I would say "is used by ... to obtain a new
power value for a given frequency of @dev".

>
> 'update_power() is used by runtime modifiable EM. It aims to update the
> @dev EM power values for all registered frequencies.'

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 09/18] PM: EM: Introduce runtime modifiable table
  2023-09-29  9:16     ` Lukasz Luba
@ 2023-09-29 12:27       ` Rafael J. Wysocki
  2023-10-06  8:03         ` Lukasz Luba
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-29 12:27 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Rafael J. Wysocki, linux-kernel, linux-pm, dietmar.eggemann,
	rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, mhiramat, qyousef, wvw

On Fri, Sep 29, 2023 at 11:15 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 9/26/23 20:12, Rafael J. Wysocki wrote:
> > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >> The new runtime table would be populated with a new power data to better
> >> reflect the actual power. The power can vary over time e.g. due to the
> >> SoC temperature change. Higher temperature can increase power values.
> >> For longer running scenarios, such as game or camera, when also other
> >> devices are used (e.g. GPU, ISP) the CPU power can change. The new
> >> EM framework is able to addresses this issue and change the data
> >> at runtime safely.
> >>
> >> The runtime modifiable EM data is used by the Energy Aware Scheduler (EAS)
> >> for the task placement. All the other users (thermal, etc.) are still
> >> using the default (basic) EM. This fact drove the design of this feature.
> >>
> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> >> ---
> >>   include/linux/energy_model.h |  4 +++-
> >>   kernel/power/energy_model.c  | 12 +++++++++++-
> >>   2 files changed, 14 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> >> index 546dee90f716..740e7c25cfff 100644
> >> --- a/include/linux/energy_model.h
> >> +++ b/include/linux/energy_model.h
> >> @@ -39,7 +39,7 @@ struct em_perf_state {
> >>   /**
> >>    * struct em_perf_table - Performance states table
> >>    * @state:     List of performance states, in ascending order
> >> - * @rcu:       RCU used for safe access and destruction
> >> + * @rcu:       RCU used only for runtime modifiable table
> >
> > This still doesn't appear to be used anywhere, so why change it here?
>
> I will try to move this later in the series.
>
> >
> >>    */
> >>   struct em_perf_table {
> >>          struct em_perf_state *state;
> >> @@ -49,6 +49,7 @@ struct em_perf_table {
> >>   /**
> >>    * struct em_perf_domain - Performance domain
> >>    * @default_table:     Pointer to the default em_perf_table
> >> + * @runtime_table:     Pointer to the runtime modifiable em_perf_table
> >
> > "Pointer to em_perf_table that can be dynamically updated"
>
> OK
>
> >
> >>    * @nr_perf_states:    Number of performance states
> >>    * @flags:             See "em_perf_domain flags"
> >>    * @cpus:              Cpumask covering the CPUs of the domain. It's here
> >> @@ -64,6 +65,7 @@ struct em_perf_table {
> >>    */
> >>   struct em_perf_domain {
> >>          struct em_perf_table *default_table;
> >> +       struct em_perf_table __rcu *runtime_table;
> >>          int nr_perf_states;
> >>          unsigned long flags;
> >>          unsigned long cpus[];
> >> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> >> index 797141638b29..5b40db38b745 100644
> >> --- a/kernel/power/energy_model.c
> >> +++ b/kernel/power/energy_model.c
> >> @@ -251,6 +251,9 @@ static int em_create_pd(struct device *dev, int nr_states,
> >>                  return ret;
> >>          }
> >>
> >> +       /* Initialize runtime table as default table. */
> >
> > Redundant comment.
>
> I'll drop it.
>
> >
> >> +       rcu_assign_pointer(pd->runtime_table, default_table);
> >> +
> >>          if (_is_cpu_device(dev))
> >>                  for_each_cpu(cpu, cpus) {
> >>                          cpu_dev = get_cpu_device(cpu);
> >> @@ -448,6 +451,7 @@ EXPORT_SYMBOL_GPL(em_dev_register_perf_domain);
> >>    */
> >>   void em_dev_unregister_perf_domain(struct device *dev)
> >>   {
> >> +       struct em_perf_table __rcu *runtime_table;
> >>          struct em_perf_domain *pd;
> >>
> >>          if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
> >> @@ -457,18 +461,24 @@ void em_dev_unregister_perf_domain(struct device *dev)
> >>                  return;
> >>
> >>          pd = dev->em_pd;
> >> -
> >
> > Unrelated change.
>
> ACK
>
> >
> >>          /*
> >>           * The mutex separates all register/unregister requests and protects
> >>           * from potential clean-up/setup issues in the debugfs directories.
> >>           * The debugfs directory name is the same as device's name.
> >>           */
> >>          mutex_lock(&em_pd_mutex);
> >> +
> >
> > Same here.
>
> ACK
>
> >
> >>          em_debug_remove_pd(dev);
> >>
> >> +       runtime_table = pd->runtime_table;
> >> +
> >> +       rcu_assign_pointer(pd->runtime_table, NULL);
> >> +       synchronize_rcu();
> >
> > Is it really a good idea to call this under a mutex?
>
> This is the unregistration of the EM code path, so a thermal
> driver which gets some IRQs might not be aware that the EM
> is going to vanish. That's why those two code paths: update
> & unregister are protected with the same lock.
>
> This synchronize_rcu() won't be long,

Are you sure?  This potentially waits for all of the CPUs in the
system to go through a quiescent state which may take a while in
principle.

In any case, though, this effectively makes everyone waiting for the
mutex also wait for the grace period to elapse and they may not care
about it.

> but makes sure that when we free(dev->em_pd) we don't leak runtime_table.
>
> >
> >> +
> >>          kfree(pd->default_table->state);
> >>          kfree(pd->default_table);
> >>          kfree(dev->em_pd);
> >> +
> >
> > Unrelated change.
>
> ACK
>
> >
> >>          dev->em_pd = NULL;
> >>          mutex_unlock(&em_pd_mutex);
> >>   }
> >> --
> >
> > So this really adds a pointer to a table that can be dynamically
> > updated to struct em_perf_domain without any users so far.  It is not
> > used anywhere as of this patch AFAICS, which is not what the changelog
> > is saying.
>
> Good catch. I will adjust the changlog in header and say:
>
> 'Add infrastructure and mechanisms for the new runtime table.
> The runtime modifiable EM data is used by the Energy Aware Scheduler
> (EAS)for the task placement.

I would make it more clear that this is going to happen after some
other subsequent changes.

> All the other users (thermal, etc.) are
> still using the default (basic) EM. This fact drove the design of this
> feature.'

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-09-29  9:36     ` Lukasz Luba
@ 2023-09-29 12:59       ` Rafael J. Wysocki
  2023-10-02 13:44         ` Lukasz Luba
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-29 12:59 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Rafael J. Wysocki, linux-kernel, linux-pm, dietmar.eggemann,
	rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, mhiramat, qyousef, wvw

On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 9/26/23 20:26, Rafael J. Wysocki wrote:
> > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >> The EM is going to support runtime modifications of the power data.
> >> Introduce RCU safe mechanism to clean up the old allocated EM data.
> >
> > "RCU-based" probably and "to clean up the old EM data safely".
>
> Yes, thanks
>
> >
> >> It also adds a mutex for the EM structure to serialize the modifiers.
> >
> > This part doesn't match the code changes in the patch.
>
> Good catch. It left from some older version. We use the existing
> em_pd_mutex.
>
> >
> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> >> ---
> >>   kernel/power/energy_model.c | 29 +++++++++++++++++++++++++++++
> >>   1 file changed, 29 insertions(+)
> >>
> >> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> >> index 5b40db38b745..2345837bfd2c 100644
> >> --- a/kernel/power/energy_model.c
> >> +++ b/kernel/power/energy_model.c
> >> @@ -23,6 +23,9 @@
> >>    */
> >>   static DEFINE_MUTEX(em_pd_mutex);
> >>
> >> +static void em_cpufreq_update_efficiencies(struct device *dev,
> >> +                                          struct em_perf_state *table);
> >> +
> >>   static bool _is_cpu_device(struct device *dev)
> >>   {
> >>          return (dev->bus == &cpu_subsys);
> >> @@ -104,6 +107,32 @@ static void em_debug_create_pd(struct device *dev) {}
> >>   static void em_debug_remove_pd(struct device *dev) {}
> >>   #endif
> >>
> >> +static void em_destroy_rt_table_rcu(struct rcu_head *rp)
> >
> > Adding static functions without callers will obviously cause the
> > compiler to complain, which is one of the reasons to avoid doing that.
> > The other is that it is hard to say how these functions are going to
> > be used without reviewing multiple patches simultaneously, which is a
> > pain as far as I'm concerned.
>
> It is used in this patch, but inside the call_rcu() as 2nd arg.

I missed that, sorry for the noise.

> I have marked that below. The compiler didn't complain IIRC.
>
> >
> >> +{
> >> +       struct em_perf_table *runtime_table;
> >> +
> >> +       runtime_table = container_of(rp, struct em_perf_table, rcu);
> >> +       kfree(runtime_table->state);
> >> +       kfree(runtime_table);
> >
> > If runtime_table and its state were allocated in one go, it would be
> > possible to free them in one go either.
> >
> > For some reason, you don't seem to want to do that, but why?
>
> We had a few internal reviews and there were voices where saying that
> it's better to have 2 identical tables: 'default_table' and
> 'runtime_table' to make sure it's visible everywhere when it's used.
> That made the need to actually have also the 'state' table inside.
> I don't see it as a big problem, though.

What I'm trying to say is that you can allocate runtime_table along
with the table pointed to by its state field in one invocation of
kzalloc() (say).

Having just one memory region to free eventually instead of two of
them would help to avoid some complexity, especially in the next
patch.

> >
> >> +}
> >> +
> >> +static void em_perf_runtime_table_set(struct device *dev,
> >> +                                     struct em_perf_table *runtime_table)
> >> +{
> >> +       struct em_perf_domain *pd = dev->em_pd;
> >> +       struct em_perf_table *tmp;
> >> +
> >> +       tmp = pd->runtime_table;
> >> +
> >> +       rcu_assign_pointer(pd->runtime_table, runtime_table);
> >> +
> >> +       em_cpufreq_update_efficiencies(dev, runtime_table->state);
> >> +
> >> +       /* Don't free default table since it's used by other frameworks. */
> >
> > Apparently, some frameworks are only going to use the default table
> > while the runtime-updatable table will be used somewhere else at the
> > same time.
> >
> > I'm not really sure if this is a good idea.
>
> Runtime table is only for driving the task placement in the EAS.
>
> The thermal gov IPA won't make better decisions because it already
> has the mechanism to accumulate the error that it made.
>
> The same applies to DTPM, which works in a more 'configurable' way,
> rather that hard optimization mechanism (like EAS).

My understanding of the above is that the other EM users don't really
care that much so they can get away with using the default table all
the time, but EAS needs more accuracy, so the table used by it needs
to be adjusted in certain situations.

Fair enough, I'm assuming that you've done some research around it.
Still, this is rather confusing.

> >
> >> +       if (tmp != pd->default_table)
> >> +               call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
>
> The em_destroy_rt_table_rcu() is used here ^^^^^^

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power
  2023-09-29 10:00     ` Lukasz Luba
@ 2023-09-29 13:18       ` Rafael J. Wysocki
  2023-10-02 14:09         ` Lukasz Luba
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-09-29 13:18 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Rafael J. Wysocki, linux-kernel, linux-pm, dietmar.eggemann,
	rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, mhiramat, qyousef, wvw

On Fri, Sep 29, 2023 at 11:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 9/26/23 20:48, Rafael J. Wysocki wrote:
> > On Mon, Sep 25, 2023 at 10:11 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >
> > First off, I would merge this with the previous patch, as the changes
> > would be much clearer then IMO.
>
> I was trying to avoid a big patch ~150 lines. I will do that merge.
>
> >
> >> Add an interface which allows to modify EM power data at runtime.
> >> The new power information is populated by the provided callback, which
> >> is called for each performance state.
> >
> > But it all starts with copying the frequencies from the default table.
>
> Yes, I can add that to the description.
>
> >
> >> The CPU frequencies' efficiency is
> >> re-calculated since that might be affected as well. The old EM memory
> >> is going to be freed later using RCU mechanism.
> >
> > Not all of it, but the old runtime table that is not going to be used any more.
>
> True, I will rephrase that, to make it more precised.
>
> >
> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>
> [snip]
>
> >>
> >> +/**
> >> + * em_dev_update_perf_domain() - Update runtime EM table for a device
> >> + * @dev                : Device for which the EM is to be updated
> >> + * @cb         : Callback function providing the power data for the EM
> >> + * @priv       : Pointer to private data useful for passing context
> >> + *             which might be required while calling @cb
> >
> > It is still unclear to me who is going to use this priv pointer and how.
>
> I have explained that in some previous patch response. A driver or
> kernel module which monitors the thermal situation and has context.
>
> >
> >> + *
> >> + * Update EM runtime modifiable table for a @dev using the callback
> >> + * defined in @cb. The EM new power values are then used for calculating
> >> + * the em_perf_state::cost for associated performance state.
> >
> > It actually allocates a new runtime table and populates it from
> > scratch, using the frequencies from the default table and the
> > callback.
>
> Yes, it allocated new table and put the updated power values there.
> I can add that to the comment.
>
> >
> >> + *
> >> + * This function uses mutex to serialize writers, so it must not be called
> >
> > "a mutex"
>
> ACK
>
> >
> >> + * from non-sleeping context.
>
> [snip]
>
> >> +
> >> +       if (!dev || !dev->em_pd) {
> >
> > Checking dev against NULL under the mutex is pointless (either it is
> > NULL or it isn't, so check it earlier).
>
> ACK
>
> >
> >> +               ret = -EINVAL;
> >> +               goto unlock_em;
> >> +       }
> >> +
> >> +       pd = dev->em_pd;
> >
> > And I would check pd against NULL here.
>
> It's done above, next to '!dev || !dev->em_pd'

Yes, it is, I meant something like this:

    if (!cb || !cb->update_power || !dev)
        return -EINVAL;

    mutex_lock(&em_pd_mutex);

    pd = dev->em_pd;
    if (!pd) {
        ret = -EINVAL; /* or perhaps -ENODATA */
        goto unlock_em;
    }


> >
> >> +
> >> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
> >> +       if (!runtime_table) {
> >> +               ret = -ENOMEM;
> >> +               goto unlock_em;
> >> +       }
> >> +
> >> +       runtime_table->state = kcalloc(pd->nr_perf_states,
> >> +                                      sizeof(struct em_perf_state),
> >> +                                      GFP_KERNEL);
> >> +       if (!runtime_table->state) {
> >> +               ret = -ENOMEM;
> >> +               goto free_runtime_table;
> >> +       }
> >
> > The above allocations can be merged into one and allocating memory
> > under the mutex is questionable.
>
> So how to make sure that there is no 2 callers trying to update the
> same EM or unregistration is not in the background?

You can allocate memory upfront and take the mutex before accessing
the shared data structures.  If there's an error in the code running
under the mutex, release it and then free the memory.

Allocating memory is one operation, updating the shared data
structures to use it is another one.  The former doesn't affect the
shared state in any way, so why do it under the mutex?

> [snip]
>
> >>
> >> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
> >>
> >>          runtime_table = pd->runtime_table;
> >>
> >> +       /*
> >> +        * Safely destroy runtime modifiable EM. By using the call
> >> +        * synchronize_rcu() we make sure we don't progress till last user
> >> +        * finished the RCU section and our update got applied.
> >> +        */
> >>          rcu_assign_pointer(pd->runtime_table, NULL);
> >>          synchronize_rcu();
> >>
> >> +       /*
> >> +        * After the sync no updates will be in-flight, so free the
> >> +        * memory allocated for runtime table (if there was such).
> >> +        */
> >> +       if (runtime_table != pd->default_table) {
> >> +               kfree(runtime_table->state);
> >> +               kfree(runtime_table);
> >> +       }
> >
> > Can't this race with the RCU callback freeing the runtime table?
>
> That's why there is this 'synchronize_rcu()' above and the mutex. The
> updating caller if finished the update, would unlock the mutex and this
> unregister code can go. Here we call the synchronize_rcu() so we assure
> the callback has finished for the update path and than we explicitly
> free the saved 'runtime_table' here. So all data should be freed and
> code serialized in those two paths.

This doesn't quite agree with my understanding of what synchronize_rcu() does.

IIUC, RCU callbacks can run as soon as the grace period has elapsed
and they need not wait for synchronize_rcu() to return.  Conversely,
synchronize_rcu() doesn't wait for all of the RCU callbacks to
complete.

Now, em_destroy_rt_table_rcu() doesn't actually use the mutex, so how
exactly is it protected against racing with this code?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-09-29 12:59       ` Rafael J. Wysocki
@ 2023-10-02 13:44         ` Lukasz Luba
  2023-10-06  8:46           ` Lukasz Luba
  0 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-10-02 13:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw



On 9/29/23 13:59, Rafael J. Wysocki wrote:
> On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:

[snip]

>> We had a few internal reviews and there were voices where saying that
>> it's better to have 2 identical tables: 'default_table' and
>> 'runtime_table' to make sure it's visible everywhere when it's used.
>> That made the need to actually have also the 'state' table inside.
>> I don't see it as a big problem, though.
> 
> What I'm trying to say is that you can allocate runtime_table along
> with the table pointed to by its state field in one invocation of
> kzalloc() (say).
> 
> Having just one memory region to free eventually instead of two of
> them would help to avoid some complexity, especially in the next
> patch.

I think, I know what you mean, basically:

------------------------------
struct em_perf_table {
	struct rcu_head rcu;
	struct em_perf_state state[];
}

kzalloc(sizeof(struct em_perf_table) + N * sizeof(struct em_perf_state))

------

IMO that should also be OK in the rest of places.
I agree the alloc/free code would be smaller.

Let me do that than.

> 
>>>
>>>> +}
>>>> +
>>>> +static void em_perf_runtime_table_set(struct device *dev,
>>>> +                                     struct em_perf_table *runtime_table)
>>>> +{
>>>> +       struct em_perf_domain *pd = dev->em_pd;
>>>> +       struct em_perf_table *tmp;
>>>> +
>>>> +       tmp = pd->runtime_table;
>>>> +
>>>> +       rcu_assign_pointer(pd->runtime_table, runtime_table);
>>>> +
>>>> +       em_cpufreq_update_efficiencies(dev, runtime_table->state);
>>>> +
>>>> +       /* Don't free default table since it's used by other frameworks. */
>>>
>>> Apparently, some frameworks are only going to use the default table
>>> while the runtime-updatable table will be used somewhere else at the
>>> same time.
>>>
>>> I'm not really sure if this is a good idea.
>>
>> Runtime table is only for driving the task placement in the EAS.
>>
>> The thermal gov IPA won't make better decisions because it already
>> has the mechanism to accumulate the error that it made.
>>
>> The same applies to DTPM, which works in a more 'configurable' way,
>> rather that hard optimization mechanism (like EAS).
> 
> My understanding of the above is that the other EM users don't really
> care that much so they can get away with using the default table all
> the time, but EAS needs more accuracy, so the table used by it needs
> to be adjusted in certain situations.

Yes

> 
> Fair enough, I'm assuming that you've done some research around it.
> Still, this is rather confusing.

Yes, I have presented those ~2y ago in Android Gerrit world
(got feedback from a few vendors) and in a few Linux conferences.

For now we don't plan to have this feature for the thermal
governor or something similar.

> 
>>>
>>>> +       if (tmp != pd->default_table)
>>>> +               call_rcu(&tmp->rcu, em_destroy_rt_table_rcu);
>>
>> The em_destroy_rt_table_rcu() is used here ^^^^^^

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power
  2023-09-29 13:18       ` Rafael J. Wysocki
@ 2023-10-02 14:09         ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-10-02 14:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw



On 9/29/23 14:18, Rafael J. Wysocki wrote:
> On Fri, Sep 29, 2023 at 11:59 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>

[snip]

>>
>> It's done above, next to '!dev || !dev->em_pd'
> 
> Yes, it is, I meant something like this:
> 
>      if (!cb || !cb->update_power || !dev)
>          return -EINVAL;
> 
>      mutex_lock(&em_pd_mutex);
> 
>      pd = dev->em_pd;
>      if (!pd) {
>          ret = -EINVAL; /* or perhaps -ENODATA */
>          goto unlock_em;
>      }
> 
> 

OK, I see what you mean. Let me change that.

>>>
>>>> +
>>>> +       runtime_table = kzalloc(sizeof(*runtime_table), GFP_KERNEL);
>>>> +       if (!runtime_table) {
>>>> +               ret = -ENOMEM;
>>>> +               goto unlock_em;
>>>> +       }
>>>> +
>>>> +       runtime_table->state = kcalloc(pd->nr_perf_states,
>>>> +                                      sizeof(struct em_perf_state),
>>>> +                                      GFP_KERNEL);
>>>> +       if (!runtime_table->state) {
>>>> +               ret = -ENOMEM;
>>>> +               goto free_runtime_table;
>>>> +       }
>>>
>>> The above allocations can be merged into one and allocating memory
>>> under the mutex is questionable.
>>
>> So how to make sure that there is no 2 callers trying to update the
>> same EM or unregistration is not in the background?
> 
> You can allocate memory upfront and take the mutex before accessing
> the shared data structures.  If there's an error in the code running
> under the mutex, release it and then free the memory.
> 
> Allocating memory is one operation, updating the shared data
> structures to use it is another one.  The former doesn't affect the
> shared state in any way, so why do it under the mutex?

Yes, make sense. I will shrink that critical section. Good catch,
thanks!

> 
>> [snip]
>>
>>>>
>>>> @@ -501,9 +598,23 @@ void em_dev_unregister_perf_domain(struct device *dev)
>>>>
>>>>           runtime_table = pd->runtime_table;
>>>>
>>>> +       /*
>>>> +        * Safely destroy runtime modifiable EM. By using the call
>>>> +        * synchronize_rcu() we make sure we don't progress till last user
>>>> +        * finished the RCU section and our update got applied.
>>>> +        */
>>>>           rcu_assign_pointer(pd->runtime_table, NULL);
>>>>           synchronize_rcu();
>>>>
>>>> +       /*
>>>> +        * After the sync no updates will be in-flight, so free the
>>>> +        * memory allocated for runtime table (if there was such).
>>>> +        */
>>>> +       if (runtime_table != pd->default_table) {
>>>> +               kfree(runtime_table->state);
>>>> +               kfree(runtime_table);
>>>> +       }
>>>
>>> Can't this race with the RCU callback freeing the runtime table?
>>
>> That's why there is this 'synchronize_rcu()' above and the mutex. The
>> updating caller if finished the update, would unlock the mutex and this
>> unregister code can go. Here we call the synchronize_rcu() so we assure
>> the callback has finished for the update path and than we explicitly
>> free the saved 'runtime_table' here. So all data should be freed and
>> code serialized in those two paths.
> 
> This doesn't quite agree with my understanding of what synchronize_rcu() does.
> 
> IIUC, RCU callbacks can run as soon as the grace period has elapsed
> and they need not wait for synchronize_rcu() to return.  Conversely,
> synchronize_rcu() doesn't wait for all of the RCU callbacks to
> complete.
> 
> Now, em_destroy_rt_table_rcu() doesn't actually use the mutex, so how
> exactly is it protected against racing with this code?


I'll try to draw in on some pictures...

(previous instance )
+---------------------+
|                     |
| runtime table    #1 |
|                     |
+---------------------+


(next instance )
+---------------------+
|                     |
| runtime table    #2 |
|                     |
+---------------------+


(not possible new instance)
+.....................+
.                     .
. runtime table    #3 .
.                     .
+.....................+



    cpu A - "updater"          |    cpu B - "remover"
                               |
------------------------------|------------------------------
    lock em_pd_mutex           |
                               |
       alloc runtime table #2  |   lock em_pd_mutex
                               |   (waiting)
       async free instance #1  |    .
                               |    .
    unlock em_pd_mutex         |    .
                               |   (enter critical section)
                               |
    lock em_pd_mutex           |   set NULL to runtime table ptr
    (waiting)                  |
    (wanted to create #3 inst) |   synchronize rcu to make it is visible
    .                          |
    .                          |   implicit free instance #2
    .                          |
    .                          |   free the rest of EM and EM
    .                          |
    .                          |   unlock em_pd_mutex
    (enter critical section)   |
    !dev->em_pd so             |
    unlock & exit              |
                               |
                               |


This should clean all involved memory and also prevent
of allocating new instance, when we unregister EM.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 00/18] Introduce runtime modifiable Energy Model
  2023-09-28 21:56 ` [PATCH v4 00/18] Introduce runtime modifiable Energy Model Qais Yousef
@ 2023-10-03  8:06   ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-10-03  8:06 UTC (permalink / raw)
  To: Qais Yousef
  Cc: linux-kernel, linux-pm, rafael, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, wvw

Hi Qais,


On 9/28/23 22:56, Qais Yousef wrote:
> Hi Lukasz
> 
> On 09/25/23 09:11, Lukasz Luba wrote:
>> Hi all,
>>
>> This patch set adds a new feature which allows to modify Energy Model (EM)
>> power values at runtime. It will allow to better reflect power model of
>> a recent SoCs and silicon. Different characteristics of the power usage
>> can be leveraged and thus better decisions made during task placement in EAS.
>>
>> It's part of feature set know as Dynamic Energy Model. It has been presented
>> and discussed recently at OSPM2023 [3]. This patch set implements the 1st
>> improvement for the EM.
> 
> Thanks for the series! I'm on CC this time :-) Unfortunately I have a planned
> holiday starting tomorrow, so won't get a chance to do proper review till I'm
> back which would be few weeks from now.
> 
> I just want to iterate that this is a real problem being seen in practice where
> it's hard to provide a single average EM for all workloads now. So we certainly
> need something like this.
> 
> Hopefully I'll get a chance to help with review when I'm back from holidays.
> 
> 

Thanks and no worries, there will be v5. I am going to address comments
from Rafael. So probably when you're back the v5 would be there then.

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 09/18] PM: EM: Introduce runtime modifiable table
  2023-09-29 12:27       ` Rafael J. Wysocki
@ 2023-10-06  8:03         ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-10-06  8:03 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw



On 9/29/23 13:27, Rafael J. Wysocki wrote:
> On Fri, Sep 29, 2023 at 11:15 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>>

[snip]

>>>>           em_debug_remove_pd(dev);
>>>>
>>>> +       runtime_table = pd->runtime_table;
>>>> +
>>>> +       rcu_assign_pointer(pd->runtime_table, NULL);
>>>> +       synchronize_rcu();
>>>
>>> Is it really a good idea to call this under a mutex?
>>
>> This is the unregistration of the EM code path, so a thermal
>> driver which gets some IRQs might not be aware that the EM
>> is going to vanish. That's why those two code paths: update
>> & unregister are protected with the same lock.
>>
>> This synchronize_rcu() won't be long,
> 
> Are you sure?  This potentially waits for all of the CPUs in the
> system to go through a quiescent state which may take a while in
> principle.
> 
> In any case, though, this effectively makes everyone waiting for the
> mutex also wait for the grace period to elapse and they may not care
> about it.

My apologies for the delay, I had to check this. Yes, should be possible
and safe to not wait here as you described on this synchronize_rcu().

What I have drawn in other response to patch 11/18 [1] should still be
true.

Thanks, I will remove this sync call from here.

> 
>> but makes sure that when we free(dev->em_pd) we don't leak runtime_table.
>>
>>>
>>>> +
>>>>           kfree(pd->default_table->state);
>>>>           kfree(pd->default_table);
>>>>           kfree(dev->em_pd);
>>>> +
>>>
>>> Unrelated change.
>>
>> ACK
>>
>>>
>>>>           dev->em_pd = NULL;
>>>>           mutex_unlock(&em_pd_mutex);
>>>>    }
>>>> --
>>>
>>> So this really adds a pointer to a table that can be dynamically
>>> updated to struct em_perf_domain without any users so far.  It is not
>>> used anywhere as of this patch AFAICS, which is not what the changelog
>>> is saying.
>>
>> Good catch. I will adjust the changlog in header and say:
>>
>> 'Add infrastructure and mechanisms for the new runtime table.
>> The runtime modifiable EM data is used by the Energy Aware Scheduler
>> (EAS)for the task placement.
> 
> I would make it more clear that this is going to happen after some
> other subsequent changes.
> 

OK, I will add that information too.

[1] 
https://lore.kernel.org/lkml/91d6e9be-d50c-d157-55a0-79134cbd01fb@arm.com/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-10-02 13:44         ` Lukasz Luba
@ 2023-10-06  8:46           ` Lukasz Luba
  2023-10-11 16:02             ` Wei Wang
  0 siblings, 1 reply; 58+ messages in thread
From: Lukasz Luba @ 2023-10-06  8:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef, wvw

Hi Rafael,

A change of direction here, regarding your comment below.

On 10/2/23 14:44, Lukasz Luba wrote:
> 
> 
> On 9/29/23 13:59, Rafael J. Wysocki wrote:
>> On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> 
> [snip]
> 

[snip]

>>>> Apparently, some frameworks are only going to use the default table
>>>> while the runtime-updatable table will be used somewhere else at the
>>>> same time.
>>>>
>>>> I'm not really sure if this is a good idea.
>>>
>>> Runtime table is only for driving the task placement in the EAS.
>>>
>>> The thermal gov IPA won't make better decisions because it already
>>> has the mechanism to accumulate the error that it made.
>>>
>>> The same applies to DTPM, which works in a more 'configurable' way,
>>> rather that hard optimization mechanism (like EAS).
>>
>> My understanding of the above is that the other EM users don't really
>> care that much so they can get away with using the default table all
>> the time, but EAS needs more accuracy, so the table used by it needs
>> to be adjusted in certain situations.
> 
> Yes
> 
>>
>> Fair enough, I'm assuming that you've done some research around it.
>> Still, this is rather confusing.
> 
> Yes, I have presented those ~2y ago in Android Gerrit world
> (got feedback from a few vendors) and in a few Linux conferences.
> 
> For now we don't plan to have this feature for the thermal
> governor or something similar.
> 

I have discussed with one of our partners your comment about 2 tables.
They would like to have this runtime modified EM in other places
as well: DTPM and thermal governor. So you had good gut feeling.

In the past in our IPA (thermal gov ~2016 and kernel v4.14) we
had two callbacks:
- get_static_power() [1]
- get_dynamic_power() [2]

Later ~2017/2018 v4.16 the static power mechanism was removed
completely by this commit 84fe2cab48590e4373978e4e.
The way how it was design, implemented and used justified that
decision. We later used EM in the cpu cooling which also only
had dynamic power information.

The PID mechanism in IPA tries to compensate that
missing information (about changed static power in time or a chip
binning) and adjusts the 'error'. How good and fast that is in all
situations - it's a different story (out of this scope).
So, IPA should not be worse with the runtime table.

The static power was on the chips and probably will be still.
You might remember my slide 13 from OSPM2024 showing two power
usage plots for the same Big CPU and 1.4GHz fixed (50% of fmax):
- w/ GPU working in the background using 1-1.5W
- w/o GPU in the background

The same workload run on Big, but power bigger is ~15% higher
after ~1min.

The static power (leakage) is the issue that this patch tries
to address for EAS. Although, there is not only the leakage.
It's about the whole 'profile', which can be different than what
could be built during boot default information.

So we would want to go for one single table in EM, which
is runtime modifiable.

That is something that you might be more confident and we would
have less diversity (2 tables) in the kernel.

Regards,
Lukasz


[1] 
https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L336
[2] 
https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L383

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-10-06  8:46           ` Lukasz Luba
@ 2023-10-11 16:02             ` Wei Wang
  2023-10-11 16:07               ` Rafael J. Wysocki
  0 siblings, 1 reply; 58+ messages in thread
From: Wei Wang @ 2023-10-11 16:02 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Rafael J. Wysocki, linux-kernel, linux-pm, dietmar.eggemann,
	rui.zhang, amit.kucheria, amit.kachhap, daniel.lezcano,
	viresh.kumar, len.brown, pavel, mhiramat, qyousef

On Fri, Oct 6, 2023 at 1:45 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Rafael,
>
> A change of direction here, regarding your comment below.
>
> On 10/2/23 14:44, Lukasz Luba wrote:
> >
> >
> > On 9/29/23 13:59, Rafael J. Wysocki wrote:
> >> On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >
> > [snip]
> >
>
> [snip]
>
> >>>> Apparently, some frameworks are only going to use the default table
> >>>> while the runtime-updatable table will be used somewhere else at the
> >>>> same time.
> >>>>
> >>>> I'm not really sure if this is a good idea.
> >>>
> >>> Runtime table is only for driving the task placement in the EAS.
> >>>
> >>> The thermal gov IPA won't make better decisions because it already
> >>> has the mechanism to accumulate the error that it made.
> >>>
> >>> The same applies to DTPM, which works in a more 'configurable' way,
> >>> rather that hard optimization mechanism (like EAS).
> >>
> >> My understanding of the above is that the other EM users don't really
> >> care that much so they can get away with using the default table all
> >> the time, but EAS needs more accuracy, so the table used by it needs
> >> to be adjusted in certain situations.
> >
> > Yes
> >
> >>
> >> Fair enough, I'm assuming that you've done some research around it.
> >> Still, this is rather confusing.
> >
> > Yes, I have presented those ~2y ago in Android Gerrit world
> > (got feedback from a few vendors) and in a few Linux conferences.
> >
> > For now we don't plan to have this feature for the thermal
> > governor or something similar.
> >
>
> I have discussed with one of our partners your comment about 2 tables.
> They would like to have this runtime modified EM in other places
> as well: DTPM and thermal governor. So you had good gut feeling.
>
> In the past in our IPA (thermal gov ~2016 and kernel v4.14) we
> had two callbacks:
> - get_static_power() [1]
> - get_dynamic_power() [2]
>
> Later ~2017/2018 v4.16 the static power mechanism was removed
> completely by this commit 84fe2cab48590e4373978e4e.
> The way how it was design, implemented and used justified that
> decision. We later used EM in the cpu cooling which also only
> had dynamic power information.
>
> The PID mechanism in IPA tries to compensate that
> missing information (about changed static power in time or a chip
> binning) and adjusts the 'error'. How good and fast that is in all
> situations - it's a different story (out of this scope).
> So, IPA should not be worse with the runtime table.
>
> The static power was on the chips and probably will be still.
> You might remember my slide 13 from OSPM2024 showing two power
> usage plots for the same Big CPU and 1.4GHz fixed (50% of fmax):
> - w/ GPU working in the background using 1-1.5W
> - w/o GPU in the background
>
> The same workload run on Big, but power bigger is ~15% higher
> after ~1min.
>
> The static power (leakage) is the issue that this patch tries
> to address for EAS. Although, there is not only the leakage.
> It's about the whole 'profile', which can be different than what
> could be built during boot default information.
>
> So we would want to go for one single table in EM, which
> is runtime modifiable.
>
> That is something that you might be more confident and we would
> have less diversity (2 tables) in the kernel.
>
> Regards,
> Lukasz
>
>

Indeed, we had a conversation about this with Lukasz recently. The key
idea is that there is no compelling reason to introduce diversity in
the mathematics involved. If we have confidence in the superior
accuracy of our model, it should be universally implemented. While the
governors are designed with some error tolerance, they can benefit
from enhanced accuracy in their operation.

Thanks!
-Wei

> [1]
> https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L336
> [2]
> https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L383

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-10-11 16:02             ` Wei Wang
@ 2023-10-11 16:07               ` Rafael J. Wysocki
  2023-10-12 13:16                 ` Lukasz Luba
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2023-10-11 16:07 UTC (permalink / raw)
  To: Wei Wang
  Cc: Lukasz Luba, Rafael J. Wysocki, linux-kernel, linux-pm,
	dietmar.eggemann, rui.zhang, amit.kucheria, amit.kachhap,
	daniel.lezcano, viresh.kumar, len.brown, pavel, mhiramat,
	qyousef

On Wed, Oct 11, 2023 at 6:03 PM Wei Wang <wvw@google.com> wrote:
>
> On Fri, Oct 6, 2023 at 1:45 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >
> > Hi Rafael,
> >
> > A change of direction here, regarding your comment below.
> >
> > On 10/2/23 14:44, Lukasz Luba wrote:
> > >
> > >
> > > On 9/29/23 13:59, Rafael J. Wysocki wrote:
> > >> On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
> > >
> > > [snip]
> > >
> >
> > [snip]
> >
> > >>>> Apparently, some frameworks are only going to use the default table
> > >>>> while the runtime-updatable table will be used somewhere else at the
> > >>>> same time.
> > >>>>
> > >>>> I'm not really sure if this is a good idea.
> > >>>
> > >>> Runtime table is only for driving the task placement in the EAS.
> > >>>
> > >>> The thermal gov IPA won't make better decisions because it already
> > >>> has the mechanism to accumulate the error that it made.
> > >>>
> > >>> The same applies to DTPM, which works in a more 'configurable' way,
> > >>> rather that hard optimization mechanism (like EAS).
> > >>
> > >> My understanding of the above is that the other EM users don't really
> > >> care that much so they can get away with using the default table all
> > >> the time, but EAS needs more accuracy, so the table used by it needs
> > >> to be adjusted in certain situations.
> > >
> > > Yes
> > >
> > >>
> > >> Fair enough, I'm assuming that you've done some research around it.
> > >> Still, this is rather confusing.
> > >
> > > Yes, I have presented those ~2y ago in Android Gerrit world
> > > (got feedback from a few vendors) and in a few Linux conferences.
> > >
> > > For now we don't plan to have this feature for the thermal
> > > governor or something similar.
> > >
> >
> > I have discussed with one of our partners your comment about 2 tables.
> > They would like to have this runtime modified EM in other places
> > as well: DTPM and thermal governor. So you had good gut feeling.
> >
> > In the past in our IPA (thermal gov ~2016 and kernel v4.14) we
> > had two callbacks:
> > - get_static_power() [1]
> > - get_dynamic_power() [2]
> >
> > Later ~2017/2018 v4.16 the static power mechanism was removed
> > completely by this commit 84fe2cab48590e4373978e4e.
> > The way how it was design, implemented and used justified that
> > decision. We later used EM in the cpu cooling which also only
> > had dynamic power information.
> >
> > The PID mechanism in IPA tries to compensate that
> > missing information (about changed static power in time or a chip
> > binning) and adjusts the 'error'. How good and fast that is in all
> > situations - it's a different story (out of this scope).
> > So, IPA should not be worse with the runtime table.
> >
> > The static power was on the chips and probably will be still.
> > You might remember my slide 13 from OSPM2024 showing two power
> > usage plots for the same Big CPU and 1.4GHz fixed (50% of fmax):
> > - w/ GPU working in the background using 1-1.5W
> > - w/o GPU in the background
> >
> > The same workload run on Big, but power bigger is ~15% higher
> > after ~1min.
> >
> > The static power (leakage) is the issue that this patch tries
> > to address for EAS. Although, there is not only the leakage.
> > It's about the whole 'profile', which can be different than what
> > could be built during boot default information.
> >
> > So we would want to go for one single table in EM, which
> > is runtime modifiable.
> >
> > That is something that you might be more confident and we would
> > have less diversity (2 tables) in the kernel.
> >
> > Regards,
> > Lukasz
> >
> >
>
> Indeed, we had a conversation about this with Lukasz recently. The key
> idea is that there is no compelling reason to introduce diversity in
> the mathematics involved. If we have confidence in the superior
> accuracy of our model, it should be universally implemented. While the
> governors are designed with some error tolerance, they can benefit
> from enhanced accuracy in their operation.

I agree, thanks!

> > [1]
> > https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L336
> > [2]
> > https://elixir.bootlin.com/linux/v4.14/source/drivers/thermal/cpu_cooling.c#L383

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data
  2023-10-11 16:07               ` Rafael J. Wysocki
@ 2023-10-12 13:16                 ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-10-12 13:16 UTC (permalink / raw)
  To: Rafael J. Wysocki, Wei Wang
  Cc: linux-kernel, linux-pm, dietmar.eggemann, rui.zhang,
	amit.kucheria, amit.kachhap, daniel.lezcano, viresh.kumar,
	len.brown, pavel, mhiramat, qyousef



On 10/11/23 17:07, Rafael J. Wysocki wrote:
> On Wed, Oct 11, 2023 at 6:03 PM Wei Wang <wvw@google.com> wrote:
>>
>> On Fri, Oct 6, 2023 at 1:45 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>
>>> Hi Rafael,
>>>
>>> A change of direction here, regarding your comment below.
>>>
>>> On 10/2/23 14:44, Lukasz Luba wrote:
>>>>
>>>>
>>>> On 9/29/23 13:59, Rafael J. Wysocki wrote:
>>>>> On Fri, Sep 29, 2023 at 11:36 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>>
>>>> [snip]
>>>>
>>>
>>> [snip]
>>>
>>>>>>> Apparently, some frameworks are only going to use the default table
>>>>>>> while the runtime-updatable table will be used somewhere else at the
>>>>>>> same time.
>>>>>>>
>>>>>>> I'm not really sure if this is a good idea.
>>>>>>
>>>>>> Runtime table is only for driving the task placement in the EAS.
>>>>>>
>>>>>> The thermal gov IPA won't make better decisions because it already
>>>>>> has the mechanism to accumulate the error that it made.
>>>>>>
>>>>>> The same applies to DTPM, which works in a more 'configurable' way,
>>>>>> rather that hard optimization mechanism (like EAS).
>>>>>
>>>>> My understanding of the above is that the other EM users don't really
>>>>> care that much so they can get away with using the default table all
>>>>> the time, but EAS needs more accuracy, so the table used by it needs
>>>>> to be adjusted in certain situations.
>>>>
>>>> Yes
>>>>
>>>>>
>>>>> Fair enough, I'm assuming that you've done some research around it.
>>>>> Still, this is rather confusing.
>>>>
>>>> Yes, I have presented those ~2y ago in Android Gerrit world
>>>> (got feedback from a few vendors) and in a few Linux conferences.
>>>>
>>>> For now we don't plan to have this feature for the thermal
>>>> governor or something similar.
>>>>
>>>
>>> I have discussed with one of our partners your comment about 2 tables.
>>> They would like to have this runtime modified EM in other places
>>> as well: DTPM and thermal governor. So you had good gut feeling.
>>>
>>> In the past in our IPA (thermal gov ~2016 and kernel v4.14) we
>>> had two callbacks:
>>> - get_static_power() [1]
>>> - get_dynamic_power() [2]
>>>
>>> Later ~2017/2018 v4.16 the static power mechanism was removed
>>> completely by this commit 84fe2cab48590e4373978e4e.
>>> The way how it was design, implemented and used justified that
>>> decision. We later used EM in the cpu cooling which also only
>>> had dynamic power information.
>>>
>>> The PID mechanism in IPA tries to compensate that
>>> missing information (about changed static power in time or a chip
>>> binning) and adjusts the 'error'. How good and fast that is in all
>>> situations - it's a different story (out of this scope).
>>> So, IPA should not be worse with the runtime table.
>>>
>>> The static power was on the chips and probably will be still.
>>> You might remember my slide 13 from OSPM2024 showing two power
>>> usage plots for the same Big CPU and 1.4GHz fixed (50% of fmax):
>>> - w/ GPU working in the background using 1-1.5W
>>> - w/o GPU in the background
>>>
>>> The same workload run on Big, but power bigger is ~15% higher
>>> after ~1min.
>>>
>>> The static power (leakage) is the issue that this patch tries
>>> to address for EAS. Although, there is not only the leakage.
>>> It's about the whole 'profile', which can be different than what
>>> could be built during boot default information.
>>>
>>> So we would want to go for one single table in EM, which
>>> is runtime modifiable.
>>>
>>> That is something that you might be more confident and we would
>>> have less diversity (2 tables) in the kernel.
>>>
>>> Regards,
>>> Lukasz
>>>
>>>
>>
>> Indeed, we had a conversation about this with Lukasz recently. The key
>> idea is that there is no compelling reason to introduce diversity in
>> the mathematics involved. If we have confidence in the superior
>> accuracy of our model, it should be universally implemented. While the
>> governors are designed with some error tolerance, they can benefit
>> from enhanced accuracy in their operation.
> 
> I agree, thanks!
> 

Thank you Wei and Rafael. I'm working on that implementation and
will be in v5 soon.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 03/18] PM: EM: Find first CPU online while updating OPP efficiency
  2023-09-25  8:11 ` [PATCH v4 03/18] PM: EM: Find first CPU online while updating OPP efficiency Lukasz Luba
  2023-09-26 18:32   ` Rafael J. Wysocki
@ 2023-10-23 17:06   ` Daniel Lezcano
  2023-10-24  7:50     ` Lukasz Luba
  1 sibling, 1 reply; 58+ messages in thread
From: Daniel Lezcano @ 2023-10-23 17:06 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: dietmar.eggemann, rui.zhang, amit.kucheria, amit.kachhap,
	viresh.kumar, len.brown, pavel, mhiramat, qyousef, wvw


Hi Lukasz,

On 25/09/2023 10:11, Lukasz Luba wrote:
> The Energy Model might be updated at runtime and the energy efficiency
> for each OPP may change. Thus, there is a need to update also the
> cpufreq framework and make it aligned to the new values. In order to
> do that, use a first online CPU from the Performance Domain.

I'm failing to do the connection with the description and the change.

Perhaps, the changelog shall explain why 'cpu' must be replaced with the 
first active cpu ?

> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>   kernel/power/energy_model.c | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 42486674b834..3dafdd7731c4 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -243,12 +243,19 @@ em_cpufreq_update_efficiencies(struct device *dev, struct em_perf_state *table)
>   	struct em_perf_domain *pd = dev->em_pd;
>   	struct cpufreq_policy *policy;
>   	int found = 0;
> -	int i;
> +	int i, cpu;
>   
>   	if (!_is_cpu_device(dev) || !pd)
>   		return;
>   
> -	policy = cpufreq_cpu_get(cpumask_first(em_span_cpus(pd)));
> +	/* Try to get a CPU which is online and in this PD */
> +	cpu = cpumask_first_and(em_span_cpus(pd), cpu_active_mask);
> +	if (cpu >= nr_cpu_ids) {
> +		dev_warn(dev, "EM: No online CPU for CPUFreq policy\n");
> +		return;
> +	}
> +
> +	policy = cpufreq_cpu_get(cpu);
>   	if (!policy) {
>   		dev_warn(dev, "EM: Access to CPUFreq policy failed\n");
>   		return;

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 04/18] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
  2023-09-25  8:11 ` [PATCH v4 04/18] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible Lukasz Luba
@ 2023-10-23 17:39   ` Daniel Lezcano
  2023-10-24  8:09     ` Lukasz Luba
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel Lezcano @ 2023-10-23 17:39 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: dietmar.eggemann, rui.zhang, amit.kucheria, amit.kachhap,
	viresh.kumar, len.brown, pavel, mhiramat, qyousef, wvw

On 25/09/2023 10:11, Lukasz Luba wrote:
> The Energy Model (EM) is going to support runtime modification. There
> are going to be 2 EM tables which store information. This patch aims
> to prepare the code to be generic and use one of the tables. The function
> will no longer get a pointer to 'struct em_perf_domain' (the EM) but
> instead a pointer to 'struct em_perf_state' (which is one of the EM's
> tables).
> 
> Prepare em_pd_get_efficient_state() for the upcoming changes and
> make it possible to re-use. Return an index for the best performance
> state for a given EM table. The function arguments that are introduced
> should allow to work on different performance state arrays. The caller of
> em_pd_get_efficient_state() should be able to use the index either
> on the default or the modifiable EM table.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---

[ ... ]

> @@ -251,7 +253,9 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>   	 * Find the lowest performance state of the Energy Model above the
>   	 * requested frequency.
>   	 */
> -	ps = em_pd_get_efficient_state(pd, freq);
> +	i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
> +				      pd->flags);

nitpicking but s/i/state/

Other than that:

Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>


> +	ps = &pd->table[i];
>   
>   	/*
>   	 * The capacity of a CPU in the domain at the performance state (ps)

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 06/18] PM: EM: Check if the get_cost() callback is present in em_compute_costs()
  2023-09-25  8:11 ` [PATCH v4 06/18] PM: EM: Check if the get_cost() callback is present in em_compute_costs() Lukasz Luba
  2023-09-26 18:46   ` Rafael J. Wysocki
@ 2023-10-23 18:23   ` Daniel Lezcano
  2023-10-24  8:14     ` Lukasz Luba
  1 sibling, 1 reply; 58+ messages in thread
From: Daniel Lezcano @ 2023-10-23 18:23 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, rafael
  Cc: dietmar.eggemann, rui.zhang, amit.kucheria, amit.kachhap,
	viresh.kumar, len.brown, pavel, mhiramat, qyousef, wvw

On 25/09/2023 10:11, Lukasz Luba wrote:
> The em_compute_cost() is going to be re-used in runtime modified EM
> code path. Thus, make sure that this common code is safe and won't
> try to use the NULL pointer. The former em_compute_cost() didn't have to
> care about runtime modification code path. The upcoming changes introduce
> such option, but with different callback. Those two paths which use
> get_cost() (during first EM registration) or update_power() (during
> runtime modification) need to be safely handled in em_compute_costs().
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>   kernel/power/energy_model.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 7ea882401833..35e07933b34a 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -116,7 +116,7 @@ static int em_compute_costs(struct device *dev, struct em_perf_state *table,
>   	for (i = nr_states - 1; i >= 0; i--) {
>   		unsigned long power_res, cost;
>   
> -		if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
> +		if (flags & EM_PERF_DOMAIN_ARTIFICIAL && cb->get_cost) {
>   			ret = cb->get_cost(dev, table[i].frequency, &cost);
>   			if (ret || !cost || cost > EM_MAX_POWER) {
>   				dev_err(dev, "EM: invalid cost %lu %d\n",

I do believe & operator has lower precedence than && operator, thus the 
test is actually:

	(flags & (EM_PERF_DOMAIN_ARTIFICIAL && cb->get_cost))

but it should be

	((flags & EM_PERF_DOMAIN_ARTIFICIAL) && cb->get_cost)

Right ?

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 03/18] PM: EM: Find first CPU online while updating OPP efficiency
  2023-10-23 17:06   ` Daniel Lezcano
@ 2023-10-24  7:50     ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-10-24  7:50 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: dietmar.eggemann, rui.zhang, rafael, linux-pm, linux-kernel,
	amit.kucheria, amit.kachhap, viresh.kumar, len.brown, pavel,
	mhiramat, qyousef, wvw

Hi Daniel,

Thanks for looking at the patches!

On 10/23/23 18:06, Daniel Lezcano wrote:
> 
> Hi Lukasz,
> 
> On 25/09/2023 10:11, Lukasz Luba wrote:
>> The Energy Model might be updated at runtime and the energy efficiency
>> for each OPP may change. Thus, there is a need to update also the
>> cpufreq framework and make it aligned to the new values. In order to
>> do that, use a first online CPU from the Performance Domain.
> 
> I'm failing to do the connection with the description and the change.
> 
> Perhaps, the changelog shall explain why 'cpu' must be replaced with the 
> first active cpu ?

It's not a big problem now for EM, since during the boot the first CPU
in the 'policy' is actually registering the EM. Although, this is an
assumption and for the new runtime update of EM, we cannot assume
that first is online. That's the motivation of the change. In a corner
case all CPUs might be put offline, but the EM is still there because
we never unregister EM for CPUs (to not race with task scheduler).

I will add that description to the patch header.

Thanks,
Lukasz

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 04/18] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible
  2023-10-23 17:39   ` Daniel Lezcano
@ 2023-10-24  8:09     ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-10-24  8:09 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: dietmar.eggemann, rui.zhang, amit.kucheria, amit.kachhap,
	viresh.kumar, len.brown, pavel, mhiramat, qyousef, wvw,
	linux-kernel, linux-pm, rafael



On 10/23/23 18:39, Daniel Lezcano wrote:
> On 25/09/2023 10:11, Lukasz Luba wrote:
>> The Energy Model (EM) is going to support runtime modification. There
>> are going to be 2 EM tables which store information. This patch aims
>> to prepare the code to be generic and use one of the tables. The function
>> will no longer get a pointer to 'struct em_perf_domain' (the EM) but
>> instead a pointer to 'struct em_perf_state' (which is one of the EM's
>> tables).
>>
>> Prepare em_pd_get_efficient_state() for the upcoming changes and
>> make it possible to re-use. Return an index for the best performance
>> state for a given EM table. The function arguments that are introduced
>> should allow to work on different performance state arrays. The caller of
>> em_pd_get_efficient_state() should be able to use the index either
>> on the default or the modifiable EM table.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
> 
> [ ... ]
> 
>> @@ -251,7 +253,9 @@ static inline unsigned long em_cpu_energy(struct 
>> em_perf_domain *pd,
>>        * Find the lowest performance state of the Energy Model above the
>>        * requested frequency.
>>        */
>> -    ps = em_pd_get_efficient_state(pd, freq);
>> +    i = em_pd_get_efficient_state(pd->table, pd->nr_perf_states, freq,
>> +                      pd->flags);
> 
> nitpicking but s/i/state/

Here it makes sense, I'll try to use 'state', but if that could be a bit
odd in later patches code, where I have:

ps = &runtime_table->state[i];

than:

'->state[state]'

won't fly. Although, let me check, because I'm going to drop the
2 tables design so some fields might get different names.

> 
> Other than that:
> 
> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>

Thanks!

> 
> 
>> +    ps = &pd->table[i];
>>       /*
>>        * The capacity of a CPU in the domain at the performance state 
>> (ps)
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v4 06/18] PM: EM: Check if the get_cost() callback is present in em_compute_costs()
  2023-10-23 18:23   ` Daniel Lezcano
@ 2023-10-24  8:14     ` Lukasz Luba
  0 siblings, 0 replies; 58+ messages in thread
From: Lukasz Luba @ 2023-10-24  8:14 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: dietmar.eggemann, rui.zhang, linux-kernel, amit.kucheria,
	amit.kachhap, viresh.kumar, len.brown, pavel, mhiramat, qyousef,
	wvw, linux-pm, rafael



On 10/23/23 19:23, Daniel Lezcano wrote:
> On 25/09/2023 10:11, Lukasz Luba wrote:
>> The em_compute_cost() is going to be re-used in runtime modified EM
>> code path. Thus, make sure that this common code is safe and won't
>> try to use the NULL pointer. The former em_compute_cost() didn't have to
>> care about runtime modification code path. The upcoming changes introduce
>> such option, but with different callback. Those two paths which use
>> get_cost() (during first EM registration) or update_power() (during
>> runtime modification) need to be safely handled in em_compute_costs().
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   kernel/power/energy_model.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index 7ea882401833..35e07933b34a 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -116,7 +116,7 @@ static int em_compute_costs(struct device *dev, 
>> struct em_perf_state *table,
>>       for (i = nr_states - 1; i >= 0; i--) {
>>           unsigned long power_res, cost;
>> -        if (flags & EM_PERF_DOMAIN_ARTIFICIAL) {
>> +        if (flags & EM_PERF_DOMAIN_ARTIFICIAL && cb->get_cost) {
>>               ret = cb->get_cost(dev, table[i].frequency, &cost);
>>               if (ret || !cost || cost > EM_MAX_POWER) {
>>                   dev_err(dev, "EM: invalid cost %lu %d\n",
> 
> I do believe & operator has lower precedence than && operator, thus the 
> test is actually:
> 
>      (flags & (EM_PERF_DOMAIN_ARTIFICIAL && cb->get_cost))
> 
> but it should be
> 
>      ((flags & EM_PERF_DOMAIN_ARTIFICIAL) && cb->get_cost)
> 
> Right ?
> 

The bitwise '&' is stronger than logical '&&', so the code will
work as in your 2nd example. Although, I will change it and add
parentheses for better reading.

Thanks!

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2023-10-24  8:13 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-25  8:11 [PATCH v4 00/18] Introduce runtime modifiable Energy Model Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 01/18] PM: EM: Add missing newline for the message log Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 02/18] PM: EM: Refactor em_cpufreq_update_efficiencies() arguments Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 03/18] PM: EM: Find first CPU online while updating OPP efficiency Lukasz Luba
2023-09-26 18:32   ` Rafael J. Wysocki
2023-09-29  8:32     ` Lukasz Luba
2023-10-23 17:06   ` Daniel Lezcano
2023-10-24  7:50     ` Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 04/18] PM: EM: Refactor em_pd_get_efficient_state() to be more flexible Lukasz Luba
2023-10-23 17:39   ` Daniel Lezcano
2023-10-24  8:09     ` Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 05/18] PM: EM: Refactor a new function em_compute_costs() Lukasz Luba
2023-09-26 18:39   ` Rafael J. Wysocki
2023-09-29  8:38     ` Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 06/18] PM: EM: Check if the get_cost() callback is present in em_compute_costs() Lukasz Luba
2023-09-26 18:46   ` Rafael J. Wysocki
2023-09-29  8:42     ` Lukasz Luba
2023-10-23 18:23   ` Daniel Lezcano
2023-10-24  8:14     ` Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 07/18] PM: EM: Refactor struct em_perf_domain and add default_table Lukasz Luba
2023-09-26 18:52   ` Rafael J. Wysocki
2023-09-29  8:45     ` Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 08/18] PM: EM: Add update_power() callback for runtime modifications Lukasz Luba
2023-09-26 18:59   ` Rafael J. Wysocki
2023-09-29  9:00     ` Lukasz Luba
2023-09-29 12:18       ` Rafael J. Wysocki
2023-09-25  8:11 ` [PATCH v4 09/18] PM: EM: Introduce runtime modifiable table Lukasz Luba
2023-09-26 19:12   ` Rafael J. Wysocki
2023-09-29  9:16     ` Lukasz Luba
2023-09-29 12:27       ` Rafael J. Wysocki
2023-10-06  8:03         ` Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 10/18] PM: EM: Add RCU mechanism which safely cleans the old data Lukasz Luba
2023-09-26 10:28   ` kernel test robot
2023-09-26 19:26   ` Rafael J. Wysocki
2023-09-29  9:36     ` Lukasz Luba
2023-09-29 12:59       ` Rafael J. Wysocki
2023-10-02 13:44         ` Lukasz Luba
2023-10-06  8:46           ` Lukasz Luba
2023-10-11 16:02             ` Wei Wang
2023-10-11 16:07               ` Rafael J. Wysocki
2023-10-12 13:16                 ` Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 11/18] PM: EM: Add runtime update interface to modify EM power Lukasz Luba
2023-09-26 17:21   ` kernel test robot
2023-09-26 19:48   ` Rafael J. Wysocki
2023-09-29 10:00     ` Lukasz Luba
2023-09-29 13:18       ` Rafael J. Wysocki
2023-10-02 14:09         ` Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 12/18] PM: EM: Use runtime modified EM for CPUs energy estimation in EAS Lukasz Luba
2023-09-26 19:54   ` Rafael J. Wysocki
2023-09-29 10:10     ` Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 13/18] Documentation: EM: Update with runtime modification design Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 14/18] PM: EM: Add performance field to struct em_perf_state Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 15/18] PM: EM: Adjust performance with runtime modification callback Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 16/18] PM: EM: Support late CPUs booting and capacity adjustment Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 17/18] PM: EM: Optimize em_cpu_energy() and remove division Lukasz Luba
2023-09-25  8:11 ` [PATCH v4 18/18] Documentation: EM: Update information about performance field Lukasz Luba
2023-09-28 21:56 ` [PATCH v4 00/18] Introduce runtime modifiable Energy Model Qais Yousef
2023-10-03  8:06   ` Lukasz Luba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.