All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/9] Inefficient OPPs
@ 2021-07-08 10:08 Vincent Donnefort
  2021-07-08 10:08 ` [PATCH v4 1/9] PM / EM: Fix inefficient states detection Vincent Donnefort
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Vincent Donnefort @ 2021-07-08 10:08 UTC (permalink / raw)
  To: peterz, rjw, viresh.kumar, vincent.guittot, qperret
  Cc: linux-pm, ionela.voinescu, lukasz.luba, dietmar.eggemann, mka,
	Vincent Donnefort

Hi all,

Here's the new patch-set version that brings support for skipping
inefficiencies found by the Energy Model. This version doesn't bring
changes for all the drivers that could benefit from this work at the
moment. I'll do that in the next version or in a separated patch-set.
Also, it's been discussed that enabling RELATION_E should be a driver
flag. This sadly needs to be read in functions that do not have access to
cpufreq_driver. Hence, I created a new policy flag instead.

A bit of context:

We (Power team in Arm) are working with an experimental kernel for the
Google's Pixel4 to evaluate and improve the current mainline performance
and energy consumption on a real life device with Android.

The SD855 SoC found in this phone has several OPPs that are inefficient.
I.e. despite a lower frequency, they have a greater cost. (That cost being
fmax * OPP power / OPP freq). This issue is twofold. First of course,
running a specific workload at an inefficient OPP is counterproductive
since it wastes wasting energy. But also, inefficient OPPs make a
performance domain less appealing for task placement than it really is.

We evaluated the change presented here by running 30 iterations of Android
PCMark "Work 2.0 Performance". While we did not see any statistically
significant performance impact, this change allowed to drastically improve
the idle time residency.


                           |   Running   |  WFI [1]  |    Idle   |
   ------------------------+-------------+-----------+-----------+
   Little cluster (4 CPUs) |    -0.35%   |   +0.35%  |   +0.79%  |
   ------------------------+-------------+-----------+-----------+
   Medium cluster (3 CPUs) |    -6.3%    |    -18%   |    +12%   |
   ------------------------+-------------+-----------+-----------+
   Big cluster    (1 CPU)  |    -6.4%    |    -6.5%  |    +2.8%  |
   ------------------------+-------------+-----------+-----------+

On the SD855, the inefficient OPPs are found on the little cluster. By
removing them from the Energy Model, we make the most efficient CPUs more
appealing for task placement, helping to reduce the running time for the
medium and big CPUs. Increasing idle time is crucial for this platform due
to the substantial energy cost differences among the clusters. Also,
despite not appearing in the statistics (the idle driver used here doesn't
report it), we can speculate that we also improve the cluster idle time.

[1] WFI: Wait for interrupt.

Changelog since v3:
  - New freq-table relation CPUFREQ_RELATION_E.
  - New CPUFreq driver flag CPUFREQ_READ_ENERGY_MODEL.
  - EM flag to skip or not inefficiencies (driven by CPUFreq).
  - Fix infinite loop in set_freq_table_efficiencies().

Changelog since v2:
  - Add separated support for inefficiencies into CPUFreq.
  - Collect Reviewed-by for the first patch.

Changelog since v1:
  - Remove the Look-up table as the numbers weren't strong enough to
    justify the implementation.
  - Split the patch.

Vincent Donnefort (9):
  PM / EM: Fix inefficient states detection
  PM / EM: Mark inefficient states
  PM / EM: Extend em_perf_domain with a flag field
  PM / EM: Allow skipping inefficient states
  cpufreq: Add an interface to mark inefficient frequencies
  cpufreq: Add a new freq-table relation CPUFREQ_RELATION_E
  cpufreq: CPUFREQ_RELATION_E in schedutil ondemand and conservative
  cpufreq: Add driver flag CPUFREQ_READ_ENERGY_MODEL
  cpufreq: dt: Add CPUFREQ_READ_ENERGY_MODEL

 drivers/cpufreq/cpufreq-dt.c           |  2 +-
 drivers/cpufreq/cpufreq.c              | 67 ++++++++++++++++++++++++++++++++-
 drivers/cpufreq/cpufreq_conservative.c |  2 +-
 drivers/cpufreq/cpufreq_ondemand.c     |  4 +-
 drivers/cpufreq/freq_table.c           | 57 +++++++++++++++++++++++++++-
 include/linux/cpufreq.h                | 50 +++++++++++++++++++++++--
 include/linux/energy_model.h           | 68 +++++++++++++++++++++++++++++-----
 kernel/power/energy_model.c            | 46 ++++++++++++++---------
 kernel/sched/cpufreq_schedutil.c       |  2 +-
 9 files changed, 260 insertions(+), 38 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v4 1/9] PM / EM: Fix inefficient states detection
  2021-07-08 10:08 [PATCH v4 0/9] Inefficient OPPs Vincent Donnefort
@ 2021-07-08 10:08 ` Vincent Donnefort
  2021-07-08 10:08 ` [PATCH v4 2/9] PM / EM: Mark inefficient states Vincent Donnefort
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Vincent Donnefort @ 2021-07-08 10:08 UTC (permalink / raw)
  To: peterz, rjw, viresh.kumar, vincent.guittot, qperret
  Cc: linux-pm, ionela.voinescu, lukasz.luba, dietmar.eggemann, mka,
	Vincent Donnefort

Currently, a debug message is printed if an inefficient state is detected
in the Energy Model. Unfortunately, it won't detect if the first state is
inefficient or if two successive states are. Fix this behavior.

Fixes: 27871f7a8a34 (PM: Introduce an Energy Model management framework)
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 0c620ebdc2e8..c4871a8ff977 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -107,8 +107,7 @@ static void em_debug_remove_pd(struct device *dev) {}
 static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 				int nr_states, struct em_data_callback *cb)
 {
-	unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
-	unsigned long power, freq, prev_freq = 0;
+	unsigned long power, freq, prev_freq = 0, prev_cost = ULONG_MAX;
 	struct em_perf_state *table;
 	int i, ret;
 	u64 fmax;
@@ -153,25 +152,19 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 
 		table[i].power = power;
 		table[i].frequency = prev_freq = freq;
-
-		/*
-		 * The hertz/watts efficiency ratio should decrease as the
-		 * frequency grows on sane platforms. But this isn't always
-		 * true in practice so warn the user if a higher OPP is more
-		 * power efficient than a lower one.
-		 */
-		opp_eff = freq / power;
-		if (opp_eff >= prev_opp_eff)
-			dev_dbg(dev, "EM: hertz/watts ratio non-monotonically decreasing: em_perf_state %d >= em_perf_state%d\n",
-					i, i - 1);
-		prev_opp_eff = opp_eff;
 	}
 
 	/* Compute the cost of each performance state. */
 	fmax = (u64) table[nr_states - 1].frequency;
-	for (i = 0; i < nr_states; i++) {
+	for (i = nr_states - 1; i >= 0; i--) {
 		table[i].cost = div64_u64(fmax * table[i].power,
 					  table[i].frequency);
+		if (table[i].cost >= prev_cost) {
+			dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
+				table[i].frequency);
+		} else {
+			prev_cost = table[i].cost;
+		}
 	}
 
 	pd->table = table;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 2/9] PM / EM: Mark inefficient states
  2021-07-08 10:08 [PATCH v4 0/9] Inefficient OPPs Vincent Donnefort
  2021-07-08 10:08 ` [PATCH v4 1/9] PM / EM: Fix inefficient states detection Vincent Donnefort
@ 2021-07-08 10:08 ` Vincent Donnefort
  2021-07-22  7:25   ` Lukasz Luba
  2021-07-08 10:09 ` [PATCH v4 3/9] PM / EM: Extend em_perf_domain with a flag field Vincent Donnefort
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Vincent Donnefort @ 2021-07-08 10:08 UTC (permalink / raw)
  To: peterz, rjw, viresh.kumar, vincent.guittot, qperret
  Cc: linux-pm, ionela.voinescu, lukasz.luba, dietmar.eggemann, mka,
	Vincent Donnefort

Some SoCs, such as the sd855 have OPPs within the same performance domain,
whose cost is higher than others with a higher frequency. Even though
those OPPs are interesting from a cooling perspective, it makes no sense
to use them when the device can run at full capacity. Those OPPs handicap
the performance domain, when choosing the most energy-efficient CPU and
are wasting energy. They are inefficient.

Hence, add support for such OPPs to the Energy Model. The table can now
be read skipping inefficient performance states (and by extension,
inefficient OPPs).

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 3f221dbf5f95..7ca4f9cc8baf 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -17,13 +17,25 @@
  *		device). It can be a total power: static and dynamic.
  * @cost:	The cost coefficient associated with this level, used during
  *		energy calculation. Equal to: power * max_frequency / frequency
+ * @flags:	see "em_perf_state flags" description below.
  */
 struct em_perf_state {
 	unsigned long frequency;
 	unsigned long power;
 	unsigned long cost;
+	unsigned long flags;
 };
 
+/*
+ * em_perf_state flags:
+ *
+ * EM_PERF_STATE_INEFFICIENT: The performance state is inefficient. There is
+ * in this em_perf_domain, another performance state with a higher frequency
+ * but a lower or equal power cost. Such inefficient states are ignored when
+ * using em_pd_get_efficient_*() functions.
+ */
+#define EM_PERF_STATE_INEFFICIENT BIT(0)
+
 /**
  * em_perf_domain - Performance domain
  * @table:		List of performance states, in ascending order
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index c4871a8ff977..30ab73ab6439 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -2,7 +2,7 @@
 /*
  * Energy Model of devices
  *
- * Copyright (c) 2018-2020, Arm ltd.
+ * Copyright (c) 2018-2021, Arm ltd.
  * Written by: Quentin Perret, Arm ltd.
  * Improvements provided by: Lukasz Luba, Arm ltd.
  */
@@ -42,6 +42,7 @@ static void em_debug_create_ps(struct em_perf_state *ps, struct dentry *pd)
 	debugfs_create_ulong("frequency", 0444, d, &ps->frequency);
 	debugfs_create_ulong("power", 0444, d, &ps->power);
 	debugfs_create_ulong("cost", 0444, d, &ps->cost);
+	debugfs_create_ulong("inefficient", 0444, d, &ps->flags);
 }
 
 static int em_debug_cpus_show(struct seq_file *s, void *unused)
@@ -160,6 +161,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
 		table[i].cost = div64_u64(fmax * table[i].power,
 					  table[i].frequency);
 		if (table[i].cost >= prev_cost) {
+			table[i].flags = EM_PERF_STATE_INEFFICIENT;
 			dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
 				table[i].frequency);
 		} else {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 3/9] PM / EM: Extend em_perf_domain with a flag field
  2021-07-08 10:08 [PATCH v4 0/9] Inefficient OPPs Vincent Donnefort
  2021-07-08 10:08 ` [PATCH v4 1/9] PM / EM: Fix inefficient states detection Vincent Donnefort
  2021-07-08 10:08 ` [PATCH v4 2/9] PM / EM: Mark inefficient states Vincent Donnefort
@ 2021-07-08 10:09 ` Vincent Donnefort
  2021-07-22  7:27   ` Lukasz Luba
  2021-07-08 10:09 ` [PATCH v4 4/9] PM / EM: Allow skipping inefficient states Vincent Donnefort
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Vincent Donnefort @ 2021-07-08 10:09 UTC (permalink / raw)
  To: peterz, rjw, viresh.kumar, vincent.guittot, qperret
  Cc: linux-pm, ionela.voinescu, lukasz.luba, dietmar.eggemann, mka,
	Vincent Donnefort

Merge the current "milliwatts" option into a "flag" field. This intends to
prepare the extension of this structure for inefficient states support in
the Energy Model.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 7ca4f9cc8baf..1deb727245be 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -40,8 +40,7 @@ struct em_perf_state {
  * em_perf_domain - Performance domain
  * @table:		List of performance states, in ascending order
  * @nr_perf_states:	Number of performance states
- * @milliwatts:		Flag indicating the power values are in milli-Watts
- *			or some other scale.
+ * @flags:		See "em_perf_domain flags"
  * @cpus:		Cpumask covering the CPUs of the domain. It's here
  *			for performance reasons to avoid potential cache
  *			misses during energy calculations in the scheduler
@@ -56,10 +55,18 @@ struct em_perf_state {
 struct em_perf_domain {
 	struct em_perf_state *table;
 	int nr_perf_states;
-	int milliwatts;
+	unsigned long flags;
 	unsigned long cpus[];
 };
 
+/*
+ *  em_perf_domain flags:
+ *
+ *  EM_PERF_DOMAIN_MILLIWATTS: The power values are in milli-Watts or some
+ *  other scale.
+ */
+#define EM_PERF_DOMAIN_MILLIWATTS BIT(0)
+
 #define em_span_cpus(em) (to_cpumask((em)->cpus))
 
 #ifdef CONFIG_ENERGY_MODEL
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 30ab73ab6439..3ab0b913bcfa 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -56,7 +56,8 @@ DEFINE_SHOW_ATTRIBUTE(em_debug_cpus);
 static int em_debug_units_show(struct seq_file *s, void *unused)
 {
 	struct em_perf_domain *pd = s->private;
-	char *units = pd->milliwatts ? "milliWatts" : "bogoWatts";
+	char *units = (pd->flags & EM_PERF_DOMAIN_MILLIWATTS) ?
+		"milliWatts" : "bogoWatts";
 
 	seq_printf(s, "%s\n", units);
 
@@ -343,7 +344,8 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 	if (ret)
 		goto unlock;
 
-	dev->em_pd->milliwatts = milliwatts;
+	if (milliwatts)
+		dev->em_pd->flags |= EM_PERF_DOMAIN_MILLIWATTS;
 
 	em_debug_create_pd(dev);
 	dev_info(dev, "EM: created perf domain\n");
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 4/9] PM / EM: Allow skipping inefficient states
  2021-07-08 10:08 [PATCH v4 0/9] Inefficient OPPs Vincent Donnefort
                   ` (2 preceding siblings ...)
  2021-07-08 10:09 ` [PATCH v4 3/9] PM / EM: Extend em_perf_domain with a flag field Vincent Donnefort
@ 2021-07-08 10:09 ` Vincent Donnefort
  2021-07-22 13:09   ` Lukasz Luba
  2021-07-08 10:09 ` [PATCH v4 5/9] cpufreq: Add an interface to mark inefficient frequencies Vincent Donnefort
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Vincent Donnefort @ 2021-07-08 10:09 UTC (permalink / raw)
  To: peterz, rjw, viresh.kumar, vincent.guittot, qperret
  Cc: linux-pm, ionela.voinescu, lukasz.luba, dietmar.eggemann, mka,
	Vincent Donnefort

The new performance domain flag EM_PERF_DOMAIN_SKIP_INEFFICIENCIES allows
to not take into account inefficient states when estimating energy
consumption. This intends to let the Energy Model know that CPUFreq itself
will skip inefficiencies and such states don't need to be part of the
estimation anymore.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 1deb727245be..fe9b90dd0c8c 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -64,8 +64,12 @@ struct em_perf_domain {
  *
  *  EM_PERF_DOMAIN_MILLIWATTS: The power values are in milli-Watts or some
  *  other scale.
+ *
+ *  EM_PERF_DOMAIN_SKIP_INEFFICIENCIES: Skip inefficient states when estimating
+ *  energy consumption.
  */
 #define EM_PERF_DOMAIN_MILLIWATTS BIT(0)
+#define EM_PERF_DOMAIN_SKIP_INEFFICIENCIES BIT(1)
 
 #define em_span_cpus(em) (to_cpumask((em)->cpus))
 
@@ -105,6 +109,37 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 void em_dev_unregister_perf_domain(struct device *dev);
 
 /**
+ * em_pd_get_efficient_state() - Get an efficient performance state from the EM
+ * @pd   : Performance domain for which we want an efficient frequency
+ * @freq : Frequency to map with the EM
+ *
+ * It is called from the scheduler code quite frequently and as a consequence
+ * doesn't implement any check.
+ *
+ * Return: An efficient performance state, high enough to meet @freq
+ * requirement.
+ */
+static inline
+struct em_perf_state *em_pd_get_efficient_state(struct em_perf_domain *pd,
+						unsigned long freq)
+{
+	struct em_perf_state *ps;
+	int i;
+
+	for (i = 0; i < pd->nr_perf_states; i++) {
+		ps = &pd->table[i];
+		if (ps->frequency >= freq) {
+			if (pd->flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
+			    ps->flags & EM_PERF_STATE_INEFFICIENT)
+				continue;
+			break;
+		}
+	}
+
+	return ps;
+}
+
+/**
  * em_cpu_energy() - Estimates the energy consumed by the CPUs of a
 		performance domain
  * @pd		: performance domain for which energy has to be estimated
@@ -126,7 +161,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 {
 	unsigned long freq, scale_cpu;
 	struct em_perf_state *ps;
-	int i, cpu;
+	int cpu;
 
 	if (!sum_util)
 		return 0;
@@ -151,11 +186,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 * Find the lowest performance state of the Energy Model above the
 	 * requested frequency.
 	 */
-	for (i = 0; i < pd->nr_perf_states; i++) {
-		ps = &pd->table[i];
-		if (ps->frequency >= freq)
-			break;
-	}
+	ps = em_pd_get_efficient_state(pd, freq);
 
 	/*
 	 * The capacity of a CPU in the domain at the performance state (ps)
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 3ab0b913bcfa..e4a05e4966d3 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -65,6 +65,17 @@ static int em_debug_units_show(struct seq_file *s, void *unused)
 }
 DEFINE_SHOW_ATTRIBUTE(em_debug_units);
 
+static int em_debug_skip_inefficiencies_show(struct seq_file *s, void *unused)
+{
+	struct em_perf_domain *pd = s->private;
+	int enabled = (pd->flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES) ? 1 : 0;
+
+	seq_printf(s, "%d\n", enabled);
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(em_debug_skip_inefficiencies);
+
 static void em_debug_create_pd(struct device *dev)
 {
 	struct dentry *d;
@@ -78,6 +89,8 @@ static void em_debug_create_pd(struct device *dev)
 				    &em_debug_cpus_fops);
 
 	debugfs_create_file("units", 0444, d, dev->em_pd, &em_debug_units_fops);
+	debugfs_create_file("skip-inefficiencies", 0444, d, dev->em_pd,
+			    &em_debug_skip_inefficiencies_fops);
 
 	/* Create a sub-directory for each performance state */
 	for (i = 0; i < dev->em_pd->nr_perf_states; i++)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 5/9] cpufreq: Add an interface to mark inefficient frequencies
  2021-07-08 10:08 [PATCH v4 0/9] Inefficient OPPs Vincent Donnefort
                   ` (3 preceding siblings ...)
  2021-07-08 10:09 ` [PATCH v4 4/9] PM / EM: Allow skipping inefficient states Vincent Donnefort
@ 2021-07-08 10:09 ` Vincent Donnefort
  2021-07-08 10:09 ` [PATCH v4 6/9] cpufreq: Add a new freq-table relation CPUFREQ_RELATION_E Vincent Donnefort
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Vincent Donnefort @ 2021-07-08 10:09 UTC (permalink / raw)
  To: peterz, rjw, viresh.kumar, vincent.guittot, qperret
  Cc: linux-pm, ionela.voinescu, lukasz.luba, dietmar.eggemann, mka,
	Vincent Donnefort

Some SoCs such as the sd855 have OPPs within the same policy whose cost is
higher than others with a higher frequency. Those OPPs are inefficients and
it might be interesting for a governor to not use them.

Adding a flag, CPUFREQ_INEFFICIENT_FREQ, to mark such OPPs into the
frequency table, as well as a new cpufreq_frequency_table member
"efficient". This new member allows CPUFreq to resolve an inefficient
frequency to an efficient one.

Efficient frequencies point to themselves. The efficiency resolution must
check it doesn't break the policy maximum.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/drivers/cpufreq/freq_table.c b/drivers/cpufreq/freq_table.c
index 67e56cf638ef..d68600b84d93 100644
--- a/drivers/cpufreq/freq_table.c
+++ b/drivers/cpufreq/freq_table.c
@@ -351,6 +351,52 @@ static int set_freq_table_sorted(struct cpufreq_policy *policy)
 	return 0;
 }
 
+static void set_freq_table_efficiencies(struct cpufreq_policy *policy)
+{
+	struct cpufreq_frequency_table *pos, *table = policy->freq_table;
+	enum cpufreq_table_sorting sort = policy->freq_table_sorted;
+	int efficient, idx;
+
+	/* Not supported */
+	if (sort == CPUFREQ_TABLE_UNSORTED) {
+		cpufreq_for_each_entry_idx(pos, table, idx)
+			pos->efficient = idx;
+		return;
+	}
+
+	/* The highest frequency is always efficient */
+	cpufreq_for_each_entry_idx(pos, table, idx) {
+		if (pos->frequency == CPUFREQ_ENTRY_INVALID)
+			continue;
+
+		efficient = idx;
+
+		if (sort == CPUFREQ_TABLE_SORTED_DESCENDING)
+			break;
+	}
+
+	for (;;) {
+		pos = &table[idx];
+
+		if (pos->frequency != CPUFREQ_ENTRY_INVALID) {
+			if (pos->flags & CPUFREQ_INEFFICIENT_FREQ) {
+				pos->efficient = efficient;
+			} else {
+				pos->efficient = idx;
+				efficient = idx;
+			}
+		}
+
+		if (sort == CPUFREQ_TABLE_SORTED_ASCENDING) {
+			if (--idx < 0)
+				break;
+		} else {
+			if (table[++idx].frequency == CPUFREQ_TABLE_END)
+				break;
+		}
+	}
+}
+
 int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)
 {
 	int ret;
@@ -362,7 +408,13 @@ int cpufreq_table_validate_and_sort(struct cpufreq_policy *policy)
 	if (ret)
 		return ret;
 
-	return set_freq_table_sorted(policy);
+	ret = set_freq_table_sorted(policy);
+	if (ret)
+		return ret;
+
+	set_freq_table_efficiencies(policy);
+
+	return ret;
 }
 
 MODULE_AUTHOR("Dominik Brodowski <linux@brodo.de>");
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 353969c7acd3..d10784cf7ee4 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -666,13 +666,15 @@ struct governor_attr {
 #define CPUFREQ_ENTRY_INVALID	~0u
 #define CPUFREQ_TABLE_END	~1u
 /* Special Values of .flags field */
-#define CPUFREQ_BOOST_FREQ	(1 << 0)
+#define CPUFREQ_BOOST_FREQ	 (1 << 0)
+#define CPUFREQ_INEFFICIENT_FREQ (1 << 1)
 
 struct cpufreq_frequency_table {
 	unsigned int	flags;
 	unsigned int	driver_data; /* driver specific data, not used by core */
 	unsigned int	frequency; /* kHz - doesn't need to be in ascending
 				    * order */
+	unsigned int	efficient; /* idx of an efficient frequency */
 };
 
 #if defined(CONFIG_CPU_FREQ) && defined(CONFIG_PM_OPP)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 6/9] cpufreq: Add a new freq-table relation CPUFREQ_RELATION_E
  2021-07-08 10:08 [PATCH v4 0/9] Inefficient OPPs Vincent Donnefort
                   ` (4 preceding siblings ...)
  2021-07-08 10:09 ` [PATCH v4 5/9] cpufreq: Add an interface to mark inefficient frequencies Vincent Donnefort
@ 2021-07-08 10:09 ` Vincent Donnefort
  2021-07-22  8:17   ` Lukasz Luba
  2021-07-08 10:09 ` [PATCH v4 7/9] cpufreq: CPUFREQ_RELATION_E in schedutil ondemand and conservative Vincent Donnefort
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Vincent Donnefort @ 2021-07-08 10:09 UTC (permalink / raw)
  To: peterz, rjw, viresh.kumar, vincent.guittot, qperret
  Cc: linux-pm, ionela.voinescu, lukasz.luba, dietmar.eggemann, mka,
	Vincent Donnefort

This new freq-table relation allows to look for the lowest frequency above
the request that is efficient but within the limit of the maximum allowed
by the policy.

Inefficient frequencies, skipped by CPUFREQ_RELATION_E must be marked with
the flag CPUFREQ_INEFFICIENT_FREQ.

CPUFREQ_RELATION_E is only effective if the driver allows it via
the policy flag relation_efficient. If it does not, the fallback
is CPUFREQ_RELATION_L.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 802abc925b2a..74eaa23bcc7c 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2246,8 +2246,12 @@ int __cpufreq_driver_target(struct cpufreq_policy *policy,
 	    !(cpufreq_driver->flags & CPUFREQ_NEED_UPDATE_LIMITS))
 		return 0;
 
-	if (cpufreq_driver->target)
+	if (cpufreq_driver->target) {
+		/* Verify CPUFREQ_RELATION_E support  */
+		relation = cpufreq_frequency_relation_efficient(policy,
+								relation);
 		return cpufreq_driver->target(policy, target_freq, relation);
+	}
 
 	if (!cpufreq_driver->target_index)
 		return -EINVAL;
diff --git a/drivers/cpufreq/freq_table.c b/drivers/cpufreq/freq_table.c
index d68600b84d93..0a46924305fb 100644
--- a/drivers/cpufreq/freq_table.c
+++ b/drivers/cpufreq/freq_table.c
@@ -143,6 +143,9 @@ int cpufreq_table_index_unsorted(struct cpufreq_policy *policy,
 	case CPUFREQ_RELATION_C:
 		optimal.frequency = ~0;
 		break;
+	case CPUFREQ_RELATION_E:
+		relation = CPUFREQ_RELATION_L;
+		break;
 	}
 
 	cpufreq_for_each_valid_entry_idx(pos, table, i) {
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index d10784cf7ee4..c7764ae05f84 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -115,6 +115,11 @@ struct cpufreq_policy {
 	bool			strict_target;
 
 	/*
+	 * Set if the driver supports CPUFREQ_RELATION_E.
+	 */
+	bool			relation_efficient;
+
+	/*
 	 * Preferred average time interval between consecutive invocations of
 	 * the driver to set the frequency for this policy.  To be set by the
 	 * scaling driver (0, which is the default, means no preference).
@@ -269,6 +274,7 @@ static inline void cpufreq_stats_record_transition(struct cpufreq_policy *policy
 #define CPUFREQ_RELATION_L 0  /* lowest frequency at or above target */
 #define CPUFREQ_RELATION_H 1  /* highest frequency below or at target */
 #define CPUFREQ_RELATION_C 2  /* closest frequency to target */
+#define CPUFREQ_RELATION_E 3  /* lowest efficient frequency at or above target */
 
 struct freq_attr {
 	struct attribute attr;
@@ -973,17 +979,45 @@ static inline int cpufreq_table_find_index_c(struct cpufreq_policy *policy,
 		return cpufreq_table_find_index_dc(policy, target_freq);
 }
 
+static inline unsigned int
+cpufreq_frequency_relation_efficient(struct cpufreq_policy *policy,
+				     unsigned int relation)
+{
+	if (!policy->relation_efficient && relation == CPUFREQ_RELATION_E)
+		return CPUFREQ_RELATION_L;
+
+	return relation;
+}
+
+static inline unsigned int
+cpufreq_frequency_find_efficient(struct cpufreq_policy *policy,
+				 unsigned int idx)
+{
+	struct cpufreq_frequency_table *table = policy->freq_table;
+	unsigned int efficient_idx = table[idx].efficient;
+
+	return table[efficient_idx].frequency <= policy->max ? efficient_idx :
+							       idx;
+}
+
 static inline int cpufreq_frequency_table_target(struct cpufreq_policy *policy,
 						 unsigned int target_freq,
 						 unsigned int relation)
 {
+	unsigned int idx;
+
 	if (unlikely(policy->freq_table_sorted == CPUFREQ_TABLE_UNSORTED))
 		return cpufreq_table_index_unsorted(policy, target_freq,
 						    relation);
 
+	relation = cpufreq_frequency_relation_efficient(policy, relation);
+
 	switch (relation) {
 	case CPUFREQ_RELATION_L:
-		return cpufreq_table_find_index_l(policy, target_freq);
+	case CPUFREQ_RELATION_E:
+		idx = cpufreq_table_find_index_l(policy, target_freq);
+		return relation == CPUFREQ_RELATION_L ? idx :
+			cpufreq_frequency_find_efficient(policy, idx);
 	case CPUFREQ_RELATION_H:
 		return cpufreq_table_find_index_h(policy, target_freq);
 	case CPUFREQ_RELATION_C:
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 7/9] cpufreq: CPUFREQ_RELATION_E in schedutil ondemand and conservative
  2021-07-08 10:08 [PATCH v4 0/9] Inefficient OPPs Vincent Donnefort
                   ` (5 preceding siblings ...)
  2021-07-08 10:09 ` [PATCH v4 6/9] cpufreq: Add a new freq-table relation CPUFREQ_RELATION_E Vincent Donnefort
@ 2021-07-08 10:09 ` Vincent Donnefort
  2021-07-08 10:09 ` [PATCH v4 8/9] cpufreq: Add driver flag CPUFREQ_READ_ENERGY_MODEL Vincent Donnefort
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Vincent Donnefort @ 2021-07-08 10:09 UTC (permalink / raw)
  To: peterz, rjw, viresh.kumar, vincent.guittot, qperret
  Cc: linux-pm, ionela.voinescu, lukasz.luba, dietmar.eggemann, mka,
	Vincent Donnefort

Avoid inefficient frequencies with the freq-table relation
CPUFREQ_RELATION_E instead of CPUFREQ_RELATION_L in governors where it is
possible. Are left aside:

  - userspace: there is no reason to not honour user requests.
  - powersave: selects the lowest frequency possible.

Caveat in ondemand: the governor only using CPUFREQ_RELATION_L when
powersavebias is set, the inefficient frequencies would be skipped only in
a such configuration.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 74eaa23bcc7c..f8d01d083df0 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -545,7 +545,7 @@ unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
 		unsigned int idx;
 
 		idx = cpufreq_frequency_table_target(policy, target_freq,
-						     CPUFREQ_RELATION_L);
+						     CPUFREQ_RELATION_E);
 		policy->cached_resolved_idx = idx;
 		return policy->freq_table[idx].frequency;
 	}
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index aa39ff31ec9f..084b35dbb4ef 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -134,7 +134,7 @@ static unsigned int cs_dbs_update(struct cpufreq_policy *policy)
 		else
 			requested_freq = policy->min;
 
-		__cpufreq_driver_target(policy, requested_freq, CPUFREQ_RELATION_L);
+		__cpufreq_driver_target(policy, requested_freq, CPUFREQ_RELATION_E);
 		dbs_info->requested_freq = requested_freq;
 	}
 
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index ac361a8b1d3b..19689d15d3d3 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -123,7 +123,7 @@ static void dbs_freq_increase(struct cpufreq_policy *policy, unsigned int freq)
 		return;
 
 	__cpufreq_driver_target(policy, freq, od_tuners->powersave_bias ?
-			CPUFREQ_RELATION_L : CPUFREQ_RELATION_H);
+			CPUFREQ_RELATION_E : CPUFREQ_RELATION_H);
 }
 
 /*
@@ -161,7 +161,7 @@ static void od_update(struct cpufreq_policy *policy)
 		if (od_tuners->powersave_bias)
 			freq_next = od_ops.powersave_bias_target(policy,
 								 freq_next,
-								 CPUFREQ_RELATION_L);
+								 CPUFREQ_RELATION_E);
 
 		__cpufreq_driver_target(policy, freq_next, CPUFREQ_RELATION_C);
 	}
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index c7764ae05f84..0110408fcf2d 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -638,7 +638,7 @@ static inline void cpufreq_policy_apply_limits(struct cpufreq_policy *policy)
 	if (policy->max < policy->cur)
 		__cpufreq_driver_target(policy, policy->max, CPUFREQ_RELATION_H);
 	else if (policy->min > policy->cur)
-		__cpufreq_driver_target(policy, policy->min, CPUFREQ_RELATION_L);
+		__cpufreq_driver_target(policy, policy->min, CPUFREQ_RELATION_E);
 }
 
 /* Governor attribute set */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 57124614363d..d6c7694b75bd 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -481,7 +481,7 @@ static void sugov_work(struct kthread_work *work)
 	raw_spin_unlock_irqrestore(&sg_policy->update_lock, flags);
 
 	mutex_lock(&sg_policy->work_lock);
-	__cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
+	__cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_E);
 	mutex_unlock(&sg_policy->work_lock);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 8/9] cpufreq: Add driver flag CPUFREQ_READ_ENERGY_MODEL
  2021-07-08 10:08 [PATCH v4 0/9] Inefficient OPPs Vincent Donnefort
                   ` (6 preceding siblings ...)
  2021-07-08 10:09 ` [PATCH v4 7/9] cpufreq: CPUFREQ_RELATION_E in schedutil ondemand and conservative Vincent Donnefort
@ 2021-07-08 10:09 ` Vincent Donnefort
  2021-07-08 10:09 ` [PATCH v4 9/9] cpufreq: dt: Add CPUFREQ_READ_ENERGY_MODEL Vincent Donnefort
  2021-08-04 16:21 ` [PATCH v4 0/9] Inefficient OPPs Rafael J. Wysocki
  9 siblings, 0 replies; 31+ messages in thread
From: Vincent Donnefort @ 2021-07-08 10:09 UTC (permalink / raw)
  To: peterz, rjw, viresh.kumar, vincent.guittot, qperret
  Cc: linux-pm, ionela.voinescu, lukasz.luba, dietmar.eggemann, mka,
	Vincent Donnefort

The Energy Model has a 1:1 mapping between OPPs and performance states
(em_perf_state). If a CPUFreq driver registers an Energy Model, the
inefficiencies found by the latter can apply to CPUFreq.
CPUFREQ_READ_ENERGY_MODEL allows, during registration, to read those
inefficiencies and mark the freq-table accordingly.

If a driver that set CPUFREQ_READ_ENERGY_MODEL, doesn't use a custom
->target() callback, CPUFreq can handle skipping inefficient frequencies
on its own and will automatically enable the freq-table relation
CPUFREQ_RELATION_E.

If CPUFREQ_RELATION_E is enabled, CPUFreq will let the Energy Model know
inefficiencies shouldn't be taken into account for the energy estimation.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index f8d01d083df0..cb8950da4a47 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -19,6 +19,7 @@
 #include <linux/cpu_cooling.h>
 #include <linux/delay.h>
 #include <linux/device.h>
+#include <linux/energy_model.h>
 #include <linux/init.h>
 #include <linux/kernel_stat.h>
 #include <linux/module.h>
@@ -1313,6 +1314,58 @@ static void cpufreq_policy_free(struct cpufreq_policy *policy)
 	kfree(policy);
 }
 
+static inline void
+cpufreq_read_inefficiencies_from_em(struct cpufreq_policy *policy,
+				    struct em_perf_domain *em_pd)
+{
+	struct cpufreq_frequency_table *pos, *table = policy->freq_table;
+	struct em_perf_state *em_table;
+	bool found = false;
+	int i;
+
+	if (!(cpufreq_driver->flags & CPUFREQ_READ_ENERGY_MODEL))
+		return;
+
+	if (!em_pd) {
+		pr_warn("CPUFreq driver wants to read inefficiencies from the Energy Model but none found\n");
+		return;
+	}
+
+	em_table = em_pd->table;
+
+	for (i = 0; i < em_pd->nr_perf_states; i++) {
+		if (!(em_table[i].flags & EM_PERF_STATE_INEFFICIENT))
+			continue;
+
+		cpufreq_for_each_valid_entry(pos, table) {
+			if (pos->frequency == em_table[i].frequency) {
+				pos->flags |= CPUFREQ_INEFFICIENT_FREQ;
+				found = true;
+				break;
+			}
+		}
+	}
+
+	if (!found)
+		return;
+
+	/*
+	 * If the driver does not use a custom ->target() callback, we can
+	 * automatically enable CPUFREQ_RELATION_E, supported by the default
+	 * function cpufreq_frequency_table_target(). Otherwise, the driver
+	 * must set this flag itself.
+	 */
+	if (!cpufreq_driver->target)
+		policy->relation_efficient = true;
+
+	/*
+	 * With CPUFREQ_RELATION_E enabled, inefficient frequencies will be
+	 * skipped. Let know the Energy Model it can skip them too.
+	 */
+	if (policy->relation_efficient)
+		em_pd->flags |= EM_PERF_DOMAIN_SKIP_INEFFICIENCIES;
+}
+
 static int cpufreq_online(unsigned int cpu)
 {
 	struct cpufreq_policy *policy;
@@ -1367,6 +1420,12 @@ static int cpufreq_online(unsigned int cpu)
 			goto out_free_policy;
 		}
 
+		/*
+		 * Sync potential inefficiencies with the Energy Model if the
+		 * driver requested CPUFREQ_READ_ENERGY_MODEL.
+		 */
+		cpufreq_read_inefficiencies_from_em(policy, em_cpu_get(cpu));
+
 		ret = cpufreq_table_validate_and_sort(policy);
 		if (ret)
 			goto out_exit_policy;
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 0110408fcf2d..f5e94207094c 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -440,6 +440,14 @@ struct cpufreq_driver {
  */
 #define CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING	BIT(6)
 
+/*
+ * Set by drivers which register an Energy Model and want to use the latter to
+ * populate the freq-table with inefficiency information. If the same driver
+ * is not implementing the ->target() callback, setting this flag will also
+ * automatically enable CPUFREQ_RELATION_E.
+ */
+#define CPUFREQ_READ_ENERGY_MODEL		BIT(7)
+
 int cpufreq_register_driver(struct cpufreq_driver *driver_data);
 int cpufreq_unregister_driver(struct cpufreq_driver *driver_data);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v4 9/9] cpufreq: dt: Add CPUFREQ_READ_ENERGY_MODEL
  2021-07-08 10:08 [PATCH v4 0/9] Inefficient OPPs Vincent Donnefort
                   ` (7 preceding siblings ...)
  2021-07-08 10:09 ` [PATCH v4 8/9] cpufreq: Add driver flag CPUFREQ_READ_ENERGY_MODEL Vincent Donnefort
@ 2021-07-08 10:09 ` Vincent Donnefort
  2021-08-04 16:21 ` [PATCH v4 0/9] Inefficient OPPs Rafael J. Wysocki
  9 siblings, 0 replies; 31+ messages in thread
From: Vincent Donnefort @ 2021-07-08 10:09 UTC (permalink / raw)
  To: peterz, rjw, viresh.kumar, vincent.guittot, qperret
  Cc: linux-pm, ionela.voinescu, lukasz.luba, dietmar.eggemann, mka,
	Vincent Donnefort

The cpufreq-dt driver registers an Energy Model. Let CPUfreq use it to
mark and skip inefficient frequencies.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c
index ece52863ba62..40c6109df427 100644
--- a/drivers/cpufreq/cpufreq-dt.c
+++ b/drivers/cpufreq/cpufreq-dt.c
@@ -176,7 +176,7 @@ static int cpufreq_exit(struct cpufreq_policy *policy)
 
 static struct cpufreq_driver dt_cpufreq_driver = {
 	.flags = CPUFREQ_NEED_INITIAL_FREQ_CHECK |
-		 CPUFREQ_IS_COOLING_DEV,
+		 CPUFREQ_IS_COOLING_DEV | CPUFREQ_READ_ENERGY_MODEL,
 	.verify = cpufreq_generic_frequency_table_verify,
 	.target_index = set_target,
 	.get = cpufreq_generic_get,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 2/9] PM / EM: Mark inefficient states
  2021-07-08 10:08 ` [PATCH v4 2/9] PM / EM: Mark inefficient states Vincent Donnefort
@ 2021-07-22  7:25   ` Lukasz Luba
  0 siblings, 0 replies; 31+ messages in thread
From: Lukasz Luba @ 2021-07-22  7:25 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, rjw, viresh.kumar, vincent.guittot, qperret, linux-pm,
	ionela.voinescu, dietmar.eggemann, mka



On 7/8/21 11:08 AM, Vincent Donnefort wrote:
> Some SoCs, such as the sd855 have OPPs within the same performance domain,
> whose cost is higher than others with a higher frequency. Even though
> those OPPs are interesting from a cooling perspective, it makes no sense
> to use them when the device can run at full capacity. Those OPPs handicap
> the performance domain, when choosing the most energy-efficient CPU and
> are wasting energy. They are inefficient.
> 
> Hence, add support for such OPPs to the Energy Model. The table can now
> be read skipping inefficient performance states (and by extension,
> inefficient OPPs).
> 
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
> 
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 3f221dbf5f95..7ca4f9cc8baf 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -17,13 +17,25 @@
>    *		device). It can be a total power: static and dynamic.
>    * @cost:	The cost coefficient associated with this level, used during
>    *		energy calculation. Equal to: power * max_frequency / frequency
> + * @flags:	see "em_perf_state flags" description below.
>    */
>   struct em_perf_state {
>   	unsigned long frequency;
>   	unsigned long power;
>   	unsigned long cost;
> +	unsigned long flags;
>   };
>   
> +/*
> + * em_perf_state flags:
> + *
> + * EM_PERF_STATE_INEFFICIENT: The performance state is inefficient. There is
> + * in this em_perf_domain, another performance state with a higher frequency
> + * but a lower or equal power cost. Such inefficient states are ignored when
> + * using em_pd_get_efficient_*() functions.
> + */
> +#define EM_PERF_STATE_INEFFICIENT BIT(0)
> +
>   /**
>    * em_perf_domain - Performance domain
>    * @table:		List of performance states, in ascending order
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index c4871a8ff977..30ab73ab6439 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -2,7 +2,7 @@
>   /*
>    * Energy Model of devices
>    *
> - * Copyright (c) 2018-2020, Arm ltd.
> + * Copyright (c) 2018-2021, Arm ltd.
>    * Written by: Quentin Perret, Arm ltd.
>    * Improvements provided by: Lukasz Luba, Arm ltd.
>    */
> @@ -42,6 +42,7 @@ static void em_debug_create_ps(struct em_perf_state *ps, struct dentry *pd)
>   	debugfs_create_ulong("frequency", 0444, d, &ps->frequency);
>   	debugfs_create_ulong("power", 0444, d, &ps->power);
>   	debugfs_create_ulong("cost", 0444, d, &ps->cost);
> +	debugfs_create_ulong("inefficient", 0444, d, &ps->flags);
>   }
>   
>   static int em_debug_cpus_show(struct seq_file *s, void *unused)
> @@ -160,6 +161,7 @@ static int em_create_perf_table(struct device *dev, struct em_perf_domain *pd,
>   		table[i].cost = div64_u64(fmax * table[i].power,
>   					  table[i].frequency);
>   		if (table[i].cost >= prev_cost) {
> +			table[i].flags = EM_PERF_STATE_INEFFICIENT;
>   			dev_dbg(dev, "EM: OPP:%lu is inefficient\n",
>   				table[i].frequency);
>   		} else {
> 

LGTM

Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 3/9] PM / EM: Extend em_perf_domain with a flag field
  2021-07-08 10:09 ` [PATCH v4 3/9] PM / EM: Extend em_perf_domain with a flag field Vincent Donnefort
@ 2021-07-22  7:27   ` Lukasz Luba
  0 siblings, 0 replies; 31+ messages in thread
From: Lukasz Luba @ 2021-07-22  7:27 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, rjw, viresh.kumar, vincent.guittot, qperret, linux-pm,
	ionela.voinescu, dietmar.eggemann, mka



On 7/8/21 11:09 AM, Vincent Donnefort wrote:
> Merge the current "milliwatts" option into a "flag" field. This intends to
> prepare the extension of this structure for inefficient states support in
> the Energy Model.
> 
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 7ca4f9cc8baf..1deb727245be 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -40,8 +40,7 @@ struct em_perf_state {
>    * em_perf_domain - Performance domain
>    * @table:		List of performance states, in ascending order
>    * @nr_perf_states:	Number of performance states
> - * @milliwatts:		Flag indicating the power values are in milli-Watts
> - *			or some other scale.
> + * @flags:		See "em_perf_domain flags"
>    * @cpus:		Cpumask covering the CPUs of the domain. It's here
>    *			for performance reasons to avoid potential cache
>    *			misses during energy calculations in the scheduler
> @@ -56,10 +55,18 @@ struct em_perf_state {
>   struct em_perf_domain {
>   	struct em_perf_state *table;
>   	int nr_perf_states;
> -	int milliwatts;
> +	unsigned long flags;
>   	unsigned long cpus[];
>   };
>   
> +/*
> + *  em_perf_domain flags:
> + *
> + *  EM_PERF_DOMAIN_MILLIWATTS: The power values are in milli-Watts or some
> + *  other scale.
> + */
> +#define EM_PERF_DOMAIN_MILLIWATTS BIT(0)
> +
>   #define em_span_cpus(em) (to_cpumask((em)->cpus))
>   
>   #ifdef CONFIG_ENERGY_MODEL
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index 30ab73ab6439..3ab0b913bcfa 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -56,7 +56,8 @@ DEFINE_SHOW_ATTRIBUTE(em_debug_cpus);
>   static int em_debug_units_show(struct seq_file *s, void *unused)
>   {
>   	struct em_perf_domain *pd = s->private;
> -	char *units = pd->milliwatts ? "milliWatts" : "bogoWatts";
> +	char *units = (pd->flags & EM_PERF_DOMAIN_MILLIWATTS) ?
> +		"milliWatts" : "bogoWatts";
>   
>   	seq_printf(s, "%s\n", units);
>   
> @@ -343,7 +344,8 @@ int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
>   	if (ret)
>   		goto unlock;
>   
> -	dev->em_pd->milliwatts = milliwatts;
> +	if (milliwatts)
> +		dev->em_pd->flags |= EM_PERF_DOMAIN_MILLIWATTS;
>   
>   	em_debug_create_pd(dev);
>   	dev_info(dev, "EM: created perf domain\n");
> 

LGTM

Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 6/9] cpufreq: Add a new freq-table relation CPUFREQ_RELATION_E
  2021-07-08 10:09 ` [PATCH v4 6/9] cpufreq: Add a new freq-table relation CPUFREQ_RELATION_E Vincent Donnefort
@ 2021-07-22  8:17   ` Lukasz Luba
  0 siblings, 0 replies; 31+ messages in thread
From: Lukasz Luba @ 2021-07-22  8:17 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, rjw, viresh.kumar, vincent.guittot, qperret, linux-pm,
	ionela.voinescu, dietmar.eggemann, mka



On 7/8/21 11:09 AM, Vincent Donnefort wrote:

[snip]

>   static inline int cpufreq_frequency_table_target(struct cpufreq_policy *policy,
>   						 unsigned int target_freq,
>   						 unsigned int relation)
>   {
> +	unsigned int idx;
> +
>   	if (unlikely(policy->freq_table_sorted == CPUFREQ_TABLE_UNSORTED))
>   		return cpufreq_table_index_unsorted(policy, target_freq,
>   						    relation);
>   
> +	relation = cpufreq_frequency_relation_efficient(policy, relation);
> +
>   	switch (relation) {
>   	case CPUFREQ_RELATION_L:
> -		return cpufreq_table_find_index_l(policy, target_freq);

Isn't a place where the 'fallthrough;' is needed? Any warnings seen
maybe in other arch compiler?

> +	case CPUFREQ_RELATION_E:
> +		idx = cpufreq_table_find_index_l(policy, target_freq);
> +		return relation == CPUFREQ_RELATION_L ? idx :
> +			cpufreq_frequency_find_efficient(policy, idx);
>   	case CPUFREQ_RELATION_H:
>   		return cpufreq_table_find_index_h(policy, target_freq);
>   	case CPUFREQ_RELATION_C:
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 4/9] PM / EM: Allow skipping inefficient states
  2021-07-08 10:09 ` [PATCH v4 4/9] PM / EM: Allow skipping inefficient states Vincent Donnefort
@ 2021-07-22 13:09   ` Lukasz Luba
  0 siblings, 0 replies; 31+ messages in thread
From: Lukasz Luba @ 2021-07-22 13:09 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, rjw, viresh.kumar, vincent.guittot, qperret, linux-pm,
	ionela.voinescu, dietmar.eggemann, mka



On 7/8/21 11:09 AM, Vincent Donnefort wrote:
> The new performance domain flag EM_PERF_DOMAIN_SKIP_INEFFICIENCIES allows
> to not take into account inefficient states when estimating energy
> consumption. This intends to let the Energy Model know that CPUFreq itself
> will skip inefficiencies and such states don't need to be part of the
> estimation anymore.
> 
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index 1deb727245be..fe9b90dd0c8c 100644

Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-07-08 10:08 [PATCH v4 0/9] Inefficient OPPs Vincent Donnefort
                   ` (8 preceding siblings ...)
  2021-07-08 10:09 ` [PATCH v4 9/9] cpufreq: dt: Add CPUFREQ_READ_ENERGY_MODEL Vincent Donnefort
@ 2021-08-04 16:21 ` Rafael J. Wysocki
  2021-08-10  6:13   ` Viresh Kumar
  9 siblings, 1 reply; 31+ messages in thread
From: Rafael J. Wysocki @ 2021-08-04 16:21 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: Peter Zijlstra, Rafael J. Wysocki, Viresh Kumar, Vincent Guittot,
	Quentin Perret, Linux PM, Ionela Voinescu, Lukasz Luba,
	Dietmar Eggemann, Matthias Kaehlcke

On Thu, Jul 8, 2021 at 12:09 PM Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>
> Hi all,
>
> Here's the new patch-set version that brings support for skipping
> inefficiencies found by the Energy Model. This version doesn't bring
> changes for all the drivers that could benefit from this work at the
> moment. I'll do that in the next version or in a separated patch-set.
> Also, it's been discussed that enabling RELATION_E should be a driver
> flag. This sadly needs to be read in functions that do not have access to
> cpufreq_driver. Hence, I created a new policy flag instead.
>
> A bit of context:
>
> We (Power team in Arm) are working with an experimental kernel for the
> Google's Pixel4 to evaluate and improve the current mainline performance
> and energy consumption on a real life device with Android.
>
> The SD855 SoC found in this phone has several OPPs that are inefficient.
> I.e. despite a lower frequency, they have a greater cost. (That cost being
> fmax * OPP power / OPP freq). This issue is twofold. First of course,
> running a specific workload at an inefficient OPP is counterproductive
> since it wastes wasting energy. But also, inefficient OPPs make a
> performance domain less appealing for task placement than it really is.
>
> We evaluated the change presented here by running 30 iterations of Android
> PCMark "Work 2.0 Performance". While we did not see any statistically
> significant performance impact, this change allowed to drastically improve
> the idle time residency.
>
>
>                            |   Running   |  WFI [1]  |    Idle   |
>    ------------------------+-------------+-----------+-----------+
>    Little cluster (4 CPUs) |    -0.35%   |   +0.35%  |   +0.79%  |
>    ------------------------+-------------+-----------+-----------+
>    Medium cluster (3 CPUs) |    -6.3%    |    -18%   |    +12%   |
>    ------------------------+-------------+-----------+-----------+
>    Big cluster    (1 CPU)  |    -6.4%    |    -6.5%  |    +2.8%  |
>    ------------------------+-------------+-----------+-----------+
>
> On the SD855, the inefficient OPPs are found on the little cluster. By
> removing them from the Energy Model, we make the most efficient CPUs more
> appealing for task placement, helping to reduce the running time for the
> medium and big CPUs. Increasing idle time is crucial for this platform due
> to the substantial energy cost differences among the clusters. Also,
> despite not appearing in the statistics (the idle driver used here doesn't
> report it), we can speculate that we also improve the cluster idle time.
>
> [1] WFI: Wait for interrupt.
>
> Changelog since v3:
>   - New freq-table relation CPUFREQ_RELATION_E.
>   - New CPUFreq driver flag CPUFREQ_READ_ENERGY_MODEL.
>   - EM flag to skip or not inefficiencies (driven by CPUFreq).
>   - Fix infinite loop in set_freq_table_efficiencies().
>
> Changelog since v2:
>   - Add separated support for inefficiencies into CPUFreq.
>   - Collect Reviewed-by for the first patch.
>
> Changelog since v1:
>   - Remove the Look-up table as the numbers weren't strong enough to
>     justify the implementation.
>   - Split the patch.
>
> Vincent Donnefort (9):
>   PM / EM: Fix inefficient states detection
>   PM / EM: Mark inefficient states
>   PM / EM: Extend em_perf_domain with a flag field
>   PM / EM: Allow skipping inefficient states
>   cpufreq: Add an interface to mark inefficient frequencies
>   cpufreq: Add a new freq-table relation CPUFREQ_RELATION_E
>   cpufreq: CPUFREQ_RELATION_E in schedutil ondemand and conservative
>   cpufreq: Add driver flag CPUFREQ_READ_ENERGY_MODEL
>   cpufreq: dt: Add CPUFREQ_READ_ENERGY_MODEL

The cpufreq changes are mostly fine by me now, but I would like to
hear from Viresh regarding this.

>
>  drivers/cpufreq/cpufreq-dt.c           |  2 +-
>  drivers/cpufreq/cpufreq.c              | 67 ++++++++++++++++++++++++++++++++-
>  drivers/cpufreq/cpufreq_conservative.c |  2 +-
>  drivers/cpufreq/cpufreq_ondemand.c     |  4 +-
>  drivers/cpufreq/freq_table.c           | 57 +++++++++++++++++++++++++++-
>  include/linux/cpufreq.h                | 50 +++++++++++++++++++++++--
>  include/linux/energy_model.h           | 68 +++++++++++++++++++++++++++++-----
>  kernel/power/energy_model.c            | 46 ++++++++++++++---------
>  kernel/sched/cpufreq_schedutil.c       |  2 +-
>  9 files changed, 260 insertions(+), 38 deletions(-)
>
> --

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-04 16:21 ` [PATCH v4 0/9] Inefficient OPPs Rafael J. Wysocki
@ 2021-08-10  6:13   ` Viresh Kumar
  2021-08-10  7:39     ` Viresh Kumar
                       ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Viresh Kumar @ 2021-08-10  6:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Vincent Donnefort, Peter Zijlstra, Rafael J. Wysocki,
	Vincent Guittot, Quentin Perret, Linux PM, Ionela Voinescu,
	Lukasz Luba, Dietmar Eggemann, Matthias Kaehlcke

On 04-08-21, 18:21, Rafael J. Wysocki wrote:
> The cpufreq changes are mostly fine by me now, but I would like to
> hear from Viresh regarding this.

I have few doubts/concerns here.

Just to iterate it again, the idea here is to choose a higher
frequency, which will work in an efficient manner based on energy
consumption. So this _only_ works for the case where the caller asked
for CPUFREQ_RELATION_L.

- The new flag being added here, CPUFREQ_RELATION_E, doesn't feel
  complete in this sense to me. It should rather be called as
  CPUFREQ_RELATION_LE instead as it is _always_ used with relation L.

- IMO this has made the caller site a bit confusing now, i.e.  why we
  send CPUFREQ_RELATION_E sometimes and CPUFREQ_RELATION_H at other
  times. Why shouldn't the _H freq be efficient as well ? I understand
  that this was done to not do the efficient stuff in case of
  userspace/powersave/performance governors.

  What about reusing following for finding all such cases ?

        policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING

  The driver can set a flag to tell if it wants efficient frequencies
  or not, and at runtime we apply efficient algorithm only if the
  current governor does DVFS, which will leave out
  userspace/performance/powersave.


Now the other thing I didn't like since the beginning, I still don't
like it :)

A cpufreq table is provided by the driver, which can have some
inefficient frequencies. The inefficient frequencies can be caught by
many parts of the kernel, the current way (in this series) is via EM.
But this can totally be anything else as well, like a devfreq driver.

The way we have tied this whole thing with EM, via
cpufreq_read_inefficiencies_from_em(), is what I don't like. We have
closely bound the whole thing with one of the ways this can be done
and we shouldn't be polluting the cpufreq core with this IMHO. In a
sane world, the cpufreq core should just provide an API, like
cpufreq_set_freq_invariant() and it can be called by any part of
the kernel.

I understand that the current problem is where do we make that call
from and I suggested dev_pm_opp_of_register_em() for the same earlier.
But that doesn't work as the policy isn't properly initialized by that
point.

Now that I see the current implementation, I think we can make it work
in a different way.

- Copy what's done for thermal-cooling in cpufreq core, i.e.
  CPUFREQ_IS_COOLING_DEV flag, for the EM core as well. So the cpufreq
  drivers can set a flag, CPUFREQ_REGISTER_EM, and the cpufreq core
  can call dev_pm_opp_of_register_em() on their behalf. This call will
  be made from cpufreq_online(), just before/after
  cpufreq_thermal_control_enabled() stuff. By this point the policy is
  properly initialized and is also updated in
  per_cpu(cpufreq_cpu_data).

- Now the cpufreq core can provide an API like
  cpufreq_set_freq_invariant(int cpu, unsigned long freq), which can
  be called from EM core's implementation of
  em_dev_register_perf_domain().

-- 
viresh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-10  6:13   ` Viresh Kumar
@ 2021-08-10  7:39     ` Viresh Kumar
  2021-08-10 12:28     ` Lukasz Luba
  2021-08-16 14:19     ` Rafael J. Wysocki
  2 siblings, 0 replies; 31+ messages in thread
From: Viresh Kumar @ 2021-08-10  7:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Vincent Donnefort, Peter Zijlstra, Rafael J. Wysocki,
	Vincent Guittot, Quentin Perret, Linux PM, Ionela Voinescu,
	Lukasz Luba, Dietmar Eggemann, Matthias Kaehlcke

On 10-08-21, 11:43, Viresh Kumar wrote:
> Now that I see the current implementation, I think we can make it work
> in a different way.
> 
> - Copy what's done for thermal-cooling in cpufreq core, i.e.
>   CPUFREQ_IS_COOLING_DEV flag, for the EM core as well. So the cpufreq
>   drivers can set a flag, CPUFREQ_REGISTER_EM, and the cpufreq core
>   can call dev_pm_opp_of_register_em() on their behalf. This call will
>   be made from cpufreq_online(), just before/after
>   cpufreq_thermal_control_enabled() stuff. By this point the policy is
>   properly initialized and is also updated in
>   per_cpu(cpufreq_cpu_data).

And I was able to get something out for this:

https://lore.kernel.org/lkml/cover.1628579170.git.viresh.kumar@linaro.org/

-- 
viresh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-10  6:13   ` Viresh Kumar
  2021-08-10  7:39     ` Viresh Kumar
@ 2021-08-10 12:28     ` Lukasz Luba
  2021-08-10 14:07       ` Quentin Perret
  2021-08-11  4:01       ` Viresh Kumar
  2021-08-16 14:19     ` Rafael J. Wysocki
  2 siblings, 2 replies; 31+ messages in thread
From: Lukasz Luba @ 2021-08-10 12:28 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Vincent Donnefort, Peter Zijlstra,
	Rafael J. Wysocki, Vincent Guittot, Quentin Perret, Linux PM,
	Ionela Voinescu, Dietmar Eggemann, Matthias Kaehlcke



On 8/10/21 7:13 AM, Viresh Kumar wrote:
> On 04-08-21, 18:21, Rafael J. Wysocki wrote:
>> The cpufreq changes are mostly fine by me now, but I would like to
>> hear from Viresh regarding this.
> 
> I have few doubts/concerns here.
> 
> Just to iterate it again, the idea here is to choose a higher
> frequency, which will work in an efficient manner based on energy
> consumption. So this _only_ works for the case where the caller asked
> for CPUFREQ_RELATION_L.
> 
> - The new flag being added here, CPUFREQ_RELATION_E, doesn't feel
>    complete in this sense to me. It should rather be called as
>    CPUFREQ_RELATION_LE instead as it is _always_ used with relation L.
> 
> - IMO this has made the caller site a bit confusing now, i.e.  why we
>    send CPUFREQ_RELATION_E sometimes and CPUFREQ_RELATION_H at other
>    times. Why shouldn't the _H freq be efficient as well ? I understand
>    that this was done to not do the efficient stuff in case of
>    userspace/powersave/performance governors.
> 
>    What about reusing following for finding all such cases ?
> 
>          policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING
> 
>    The driver can set a flag to tell if it wants efficient frequencies
>    or not, and at runtime we apply efficient algorithm only if the
>    current governor does DVFS, which will leave out
>    userspace/performance/powersave.
> 
> 
> Now the other thing I didn't like since the beginning, I still don't
> like it :)
> 
> A cpufreq table is provided by the driver, which can have some
> inefficient frequencies. The inefficient frequencies can be caught by
> many parts of the kernel, the current way (in this series) is via EM.
> But this can totally be anything else as well, like a devfreq driver.

Currently devfreq drivers and governors don't support 'inefficient'
OPPs idea. They are not even 'utilization' driven yet. I'm not sure
if that would make sense for their workloads. For now, they are far
behind the CPUFreq/scheduler development in this field.
It needs more research and experiments, to even estimate if this brings
benefits. So, I would just skip devfreq use case...

> 
> The way we have tied this whole thing with EM, via
> cpufreq_read_inefficiencies_from_em(), is what I don't like. We have
> closely bound the whole thing with one of the ways this can be done
> and we shouldn't be polluting the cpufreq core with this IMHO. In a
> sane world, the cpufreq core should just provide an API, like
> cpufreq_set_freq_invariant() and it can be called by any part of
> the kernel.
> 
> I understand that the current problem is where do we make that call
> from and I suggested dev_pm_opp_of_register_em() for the same earlier.
> But that doesn't work as the policy isn't properly initialized by that
> point.

True, the policy is not initialized yet when cpufreq_driver::init()
callback is called, which the current place for scmi-cpufreq.

What about the other callback cpufreq_driver::ready() ?
This might solve the two issues I mentioned below.

> 
> Now that I see the current implementation, I think we can make it work
> in a different way.
> 
> - Copy what's done for thermal-cooling in cpufreq core, i.e.
>    CPUFREQ_IS_COOLING_DEV flag, for the EM core as well. So the cpufreq
>    drivers can set a flag, CPUFREQ_REGISTER_EM, and the cpufreq core
>    can call dev_pm_opp_of_register_em() on their behalf. This call will
>    be made from cpufreq_online(), just before/after
>    cpufreq_thermal_control_enabled() stuff. By this point the policy is
>    properly initialized and is also updated in
>    per_cpu(cpufreq_cpu_data).
> 
> - Now the cpufreq core can provide an API like
>    cpufreq_set_freq_invariant(int cpu, unsigned long freq), which can
>    be called from EM core's implementation of
>    em_dev_register_perf_domain().
> 

I have to point out that there are two issues (which we
might solve):
1. Advanced setup, due to per-CPU performance requests,
which are not limited to real DVFS domain.
The scmi-cpufreq (and possibly some other soon) uses more
tricky EM setup. As you might recall, the construction of temporary
cpumask is a bit complex, since we want per-CPU policy, but the
cpumask pass to EM has not a single bit but is 'spanned' with a few
CPUs.

2. The scmi-cpufreq (and IIRC MTK cpufreq driver soon) requires
custom struct em_data_callback, since the power data is coming from FW.

If there is a need for complex EM registration, maybe we could
do it in the cpufreq_driver::ready(). The policy would be ready
during that call, so it might fly.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-10 12:28     ` Lukasz Luba
@ 2021-08-10 14:07       ` Quentin Perret
  2021-08-10 14:18         ` Lukasz Luba
  2021-08-11  4:01       ` Viresh Kumar
  1 sibling, 1 reply; 31+ messages in thread
From: Quentin Perret @ 2021-08-10 14:07 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Viresh Kumar, Rafael J. Wysocki, Vincent Donnefort,
	Peter Zijlstra, Rafael J. Wysocki, Vincent Guittot, Linux PM,
	Ionela Voinescu, Dietmar Eggemann, Matthias Kaehlcke

On Tuesday 10 Aug 2021 at 13:28:21 (+0100), Lukasz Luba wrote:
> 
> 
> On 8/10/21 7:13 AM, Viresh Kumar wrote:
> > On 04-08-21, 18:21, Rafael J. Wysocki wrote:
> > > The cpufreq changes are mostly fine by me now, but I would like to
> > > hear from Viresh regarding this.
> > 
> > I have few doubts/concerns here.
> > 
> > Just to iterate it again, the idea here is to choose a higher
> > frequency, which will work in an efficient manner based on energy
> > consumption. So this _only_ works for the case where the caller asked
> > for CPUFREQ_RELATION_L.
> > 
> > - The new flag being added here, CPUFREQ_RELATION_E, doesn't feel
> >    complete in this sense to me. It should rather be called as
> >    CPUFREQ_RELATION_LE instead as it is _always_ used with relation L.
> > 
> > - IMO this has made the caller site a bit confusing now, i.e.  why we
> >    send CPUFREQ_RELATION_E sometimes and CPUFREQ_RELATION_H at other
> >    times. Why shouldn't the _H freq be efficient as well ? I understand
> >    that this was done to not do the efficient stuff in case of
> >    userspace/powersave/performance governors.
> > 
> >    What about reusing following for finding all such cases ?
> > 
> >          policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING
> > 
> >    The driver can set a flag to tell if it wants efficient frequencies
> >    or not, and at runtime we apply efficient algorithm only if the
> >    current governor does DVFS, which will leave out
> >    userspace/performance/powersave.
> > 
> > 
> > Now the other thing I didn't like since the beginning, I still don't
> > like it :)
> > 
> > A cpufreq table is provided by the driver, which can have some
> > inefficient frequencies. The inefficient frequencies can be caught by
> > many parts of the kernel, the current way (in this series) is via EM.
> > But this can totally be anything else as well, like a devfreq driver.
> 
> Currently devfreq drivers and governors don't support 'inefficient'
> OPPs idea. They are not even 'utilization' driven yet. I'm not sure
> if that would make sense for their workloads. For now, they are far
> behind the CPUFreq/scheduler development in this field.
> It needs more research and experiments, to even estimate if this brings
> benefits. So, I would just skip devfreq use case...
> 
> > 
> > The way we have tied this whole thing with EM, via
> > cpufreq_read_inefficiencies_from_em(), is what I don't like. We have
> > closely bound the whole thing with one of the ways this can be done
> > and we shouldn't be polluting the cpufreq core with this IMHO. In a
> > sane world, the cpufreq core should just provide an API, like
> > cpufreq_set_freq_invariant() and it can be called by any part of
> > the kernel.
> > 
> > I understand that the current problem is where do we make that call
> > from and I suggested dev_pm_opp_of_register_em() for the same earlier.
> > But that doesn't work as the policy isn't properly initialized by that
> > point.
> 
> True, the policy is not initialized yet when cpufreq_driver::init()
> callback is called, which the current place for scmi-cpufreq.
> 
> What about the other callback cpufreq_driver::ready() ?
> This might solve the two issues I mentioned below.
> 
> > 
> > Now that I see the current implementation, I think we can make it work
> > in a different way.
> > 
> > - Copy what's done for thermal-cooling in cpufreq core, i.e.
> >    CPUFREQ_IS_COOLING_DEV flag, for the EM core as well. So the cpufreq
> >    drivers can set a flag, CPUFREQ_REGISTER_EM, and the cpufreq core
> >    can call dev_pm_opp_of_register_em() on their behalf. This call will
> >    be made from cpufreq_online(), just before/after
> >    cpufreq_thermal_control_enabled() stuff. By this point the policy is
> >    properly initialized and is also updated in
> >    per_cpu(cpufreq_cpu_data).
> > 
> > - Now the cpufreq core can provide an API like
> >    cpufreq_set_freq_invariant(int cpu, unsigned long freq), which can
> >    be called from EM core's implementation of
> >    em_dev_register_perf_domain().
> > 
> 
> I have to point out that there are two issues (which we
> might solve):
> 1. Advanced setup, due to per-CPU performance requests,
> which are not limited to real DVFS domain.
> The scmi-cpufreq (and possibly some other soon) uses more
> tricky EM setup. As you might recall, the construction of temporary
> cpumask is a bit complex, since we want per-CPU policy, but the
> cpumask pass to EM has not a single bit but is 'spanned' with a few
> CPUs.
> 
> 2. The scmi-cpufreq (and IIRC MTK cpufreq driver soon) requires
> custom struct em_data_callback, since the power data is coming from FW.

+1, we really need this to work.

> If there is a need for complex EM registration, maybe we could
> do it in the cpufreq_driver::ready(). The policy would be ready
> during that call, so it might fly.

I remember trying this, but ran into issues as well FWIW. I would need
to check if this is still relevant, but at the time this was introduced
we needed to register the EM _before_ the policy notifier is called with
'CPUFREQ_CREATE_POLICY', because this will trigger a sched domain
rebuild in the arch_topology driver, which allows the scheduler to pick
up the newly introduced EM data.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-10 14:07       ` Quentin Perret
@ 2021-08-10 14:18         ` Lukasz Luba
  2021-08-10 15:12           ` Lukasz Luba
  0 siblings, 1 reply; 31+ messages in thread
From: Lukasz Luba @ 2021-08-10 14:18 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Viresh Kumar, Rafael J. Wysocki, Vincent Donnefort,
	Peter Zijlstra, Rafael J. Wysocki, Vincent Guittot, Linux PM,
	Ionela Voinescu, Dietmar Eggemann, Matthias Kaehlcke



On 8/10/21 3:07 PM, Quentin Perret wrote:
> On Tuesday 10 Aug 2021 at 13:28:21 (+0100), Lukasz Luba wrote:
>>
>>
>> On 8/10/21 7:13 AM, Viresh Kumar wrote:
>>> On 04-08-21, 18:21, Rafael J. Wysocki wrote:
>>>> The cpufreq changes are mostly fine by me now, but I would like to
>>>> hear from Viresh regarding this.
>>>
>>> I have few doubts/concerns here.
>>>
>>> Just to iterate it again, the idea here is to choose a higher
>>> frequency, which will work in an efficient manner based on energy
>>> consumption. So this _only_ works for the case where the caller asked
>>> for CPUFREQ_RELATION_L.
>>>
>>> - The new flag being added here, CPUFREQ_RELATION_E, doesn't feel
>>>     complete in this sense to me. It should rather be called as
>>>     CPUFREQ_RELATION_LE instead as it is _always_ used with relation L.
>>>
>>> - IMO this has made the caller site a bit confusing now, i.e.  why we
>>>     send CPUFREQ_RELATION_E sometimes and CPUFREQ_RELATION_H at other
>>>     times. Why shouldn't the _H freq be efficient as well ? I understand
>>>     that this was done to not do the efficient stuff in case of
>>>     userspace/powersave/performance governors.
>>>
>>>     What about reusing following for finding all such cases ?
>>>
>>>           policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING
>>>
>>>     The driver can set a flag to tell if it wants efficient frequencies
>>>     or not, and at runtime we apply efficient algorithm only if the
>>>     current governor does DVFS, which will leave out
>>>     userspace/performance/powersave.
>>>
>>>
>>> Now the other thing I didn't like since the beginning, I still don't
>>> like it :)
>>>
>>> A cpufreq table is provided by the driver, which can have some
>>> inefficient frequencies. The inefficient frequencies can be caught by
>>> many parts of the kernel, the current way (in this series) is via EM.
>>> But this can totally be anything else as well, like a devfreq driver.
>>
>> Currently devfreq drivers and governors don't support 'inefficient'
>> OPPs idea. They are not even 'utilization' driven yet. I'm not sure
>> if that would make sense for their workloads. For now, they are far
>> behind the CPUFreq/scheduler development in this field.
>> It needs more research and experiments, to even estimate if this brings
>> benefits. So, I would just skip devfreq use case...
>>
>>>
>>> The way we have tied this whole thing with EM, via
>>> cpufreq_read_inefficiencies_from_em(), is what I don't like. We have
>>> closely bound the whole thing with one of the ways this can be done
>>> and we shouldn't be polluting the cpufreq core with this IMHO. In a
>>> sane world, the cpufreq core should just provide an API, like
>>> cpufreq_set_freq_invariant() and it can be called by any part of
>>> the kernel.
>>>
>>> I understand that the current problem is where do we make that call
>>> from and I suggested dev_pm_opp_of_register_em() for the same earlier.
>>> But that doesn't work as the policy isn't properly initialized by that
>>> point.
>>
>> True, the policy is not initialized yet when cpufreq_driver::init()
>> callback is called, which the current place for scmi-cpufreq.
>>
>> What about the other callback cpufreq_driver::ready() ?
>> This might solve the two issues I mentioned below.
>>
>>>
>>> Now that I see the current implementation, I think we can make it work
>>> in a different way.
>>>
>>> - Copy what's done for thermal-cooling in cpufreq core, i.e.
>>>     CPUFREQ_IS_COOLING_DEV flag, for the EM core as well. So the cpufreq
>>>     drivers can set a flag, CPUFREQ_REGISTER_EM, and the cpufreq core
>>>     can call dev_pm_opp_of_register_em() on their behalf. This call will
>>>     be made from cpufreq_online(), just before/after
>>>     cpufreq_thermal_control_enabled() stuff. By this point the policy is
>>>     properly initialized and is also updated in
>>>     per_cpu(cpufreq_cpu_data).
>>>
>>> - Now the cpufreq core can provide an API like
>>>     cpufreq_set_freq_invariant(int cpu, unsigned long freq), which can
>>>     be called from EM core's implementation of
>>>     em_dev_register_perf_domain().
>>>
>>
>> I have to point out that there are two issues (which we
>> might solve):
>> 1. Advanced setup, due to per-CPU performance requests,
>> which are not limited to real DVFS domain.
>> The scmi-cpufreq (and possibly some other soon) uses more
>> tricky EM setup. As you might recall, the construction of temporary
>> cpumask is a bit complex, since we want per-CPU policy, but the
>> cpumask pass to EM has not a single bit but is 'spanned' with a few
>> CPUs.
>>
>> 2. The scmi-cpufreq (and IIRC MTK cpufreq driver soon) requires
>> custom struct em_data_callback, since the power data is coming from FW.
> 
> +1, we really need this to work.
> 
>> If there is a need for complex EM registration, maybe we could
>> do it in the cpufreq_driver::ready(). The policy would be ready
>> during that call, so it might fly.
> 
> I remember trying this, but ran into issues as well FWIW. I would need
> to check if this is still relevant, but at the time this was introduced
> we needed to register the EM _before_ the policy notifier is called with
> 'CPUFREQ_CREATE_POLICY', because this will trigger a sched domain
> rebuild in the arch_topology driver, which allows the scheduler to pick
> up the newly introduced EM data.
> 

Let me go through that code and experiment if there is an issue.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-10 14:18         ` Lukasz Luba
@ 2021-08-10 15:12           ` Lukasz Luba
  2021-08-10 15:47             ` Quentin Perret
  0 siblings, 1 reply; 31+ messages in thread
From: Lukasz Luba @ 2021-08-10 15:12 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Viresh Kumar, Rafael J. Wysocki, Vincent Donnefort,
	Peter Zijlstra, Rafael J. Wysocki, Vincent Guittot, Linux PM,
	Ionela Voinescu, Dietmar Eggemann, Matthias Kaehlcke



On 8/10/21 3:18 PM, Lukasz Luba wrote:
> 
> 
> On 8/10/21 3:07 PM, Quentin Perret wrote:
>> On Tuesday 10 Aug 2021 at 13:28:21 (+0100), Lukasz Luba wrote:
>>>
>>>
>>> On 8/10/21 7:13 AM, Viresh Kumar wrote:
>>>> On 04-08-21, 18:21, Rafael J. Wysocki wrote:
>>>>> The cpufreq changes are mostly fine by me now, but I would like to
>>>>> hear from Viresh regarding this.
>>>>
>>>> I have few doubts/concerns here.
>>>>
>>>> Just to iterate it again, the idea here is to choose a higher
>>>> frequency, which will work in an efficient manner based on energy
>>>> consumption. So this _only_ works for the case where the caller asked
>>>> for CPUFREQ_RELATION_L.
>>>>
>>>> - The new flag being added here, CPUFREQ_RELATION_E, doesn't feel
>>>>     complete in this sense to me. It should rather be called as
>>>>     CPUFREQ_RELATION_LE instead as it is _always_ used with relation L.
>>>>
>>>> - IMO this has made the caller site a bit confusing now, i.e.  why we
>>>>     send CPUFREQ_RELATION_E sometimes and CPUFREQ_RELATION_H at other
>>>>     times. Why shouldn't the _H freq be efficient as well ? I 
>>>> understand
>>>>     that this was done to not do the efficient stuff in case of
>>>>     userspace/powersave/performance governors.
>>>>
>>>>     What about reusing following for finding all such cases ?
>>>>
>>>>           policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING
>>>>
>>>>     The driver can set a flag to tell if it wants efficient frequencies
>>>>     or not, and at runtime we apply efficient algorithm only if the
>>>>     current governor does DVFS, which will leave out
>>>>     userspace/performance/powersave.
>>>>
>>>>
>>>> Now the other thing I didn't like since the beginning, I still don't
>>>> like it :)
>>>>
>>>> A cpufreq table is provided by the driver, which can have some
>>>> inefficient frequencies. The inefficient frequencies can be caught by
>>>> many parts of the kernel, the current way (in this series) is via EM.
>>>> But this can totally be anything else as well, like a devfreq driver.
>>>
>>> Currently devfreq drivers and governors don't support 'inefficient'
>>> OPPs idea. They are not even 'utilization' driven yet. I'm not sure
>>> if that would make sense for their workloads. For now, they are far
>>> behind the CPUFreq/scheduler development in this field.
>>> It needs more research and experiments, to even estimate if this brings
>>> benefits. So, I would just skip devfreq use case...
>>>
>>>>
>>>> The way we have tied this whole thing with EM, via
>>>> cpufreq_read_inefficiencies_from_em(), is what I don't like. We have
>>>> closely bound the whole thing with one of the ways this can be done
>>>> and we shouldn't be polluting the cpufreq core with this IMHO. In a
>>>> sane world, the cpufreq core should just provide an API, like
>>>> cpufreq_set_freq_invariant() and it can be called by any part of
>>>> the kernel.
>>>>
>>>> I understand that the current problem is where do we make that call
>>>> from and I suggested dev_pm_opp_of_register_em() for the same earlier.
>>>> But that doesn't work as the policy isn't properly initialized by that
>>>> point.
>>>
>>> True, the policy is not initialized yet when cpufreq_driver::init()
>>> callback is called, which the current place for scmi-cpufreq.
>>>
>>> What about the other callback cpufreq_driver::ready() ?
>>> This might solve the two issues I mentioned below.
>>>
>>>>
>>>> Now that I see the current implementation, I think we can make it work
>>>> in a different way.
>>>>
>>>> - Copy what's done for thermal-cooling in cpufreq core, i.e.
>>>>     CPUFREQ_IS_COOLING_DEV flag, for the EM core as well. So the 
>>>> cpufreq
>>>>     drivers can set a flag, CPUFREQ_REGISTER_EM, and the cpufreq core
>>>>     can call dev_pm_opp_of_register_em() on their behalf. This call 
>>>> will
>>>>     be made from cpufreq_online(), just before/after
>>>>     cpufreq_thermal_control_enabled() stuff. By this point the 
>>>> policy is
>>>>     properly initialized and is also updated in
>>>>     per_cpu(cpufreq_cpu_data).
>>>>
>>>> - Now the cpufreq core can provide an API like
>>>>     cpufreq_set_freq_invariant(int cpu, unsigned long freq), which can
>>>>     be called from EM core's implementation of
>>>>     em_dev_register_perf_domain().
>>>>
>>>
>>> I have to point out that there are two issues (which we
>>> might solve):
>>> 1. Advanced setup, due to per-CPU performance requests,
>>> which are not limited to real DVFS domain.
>>> The scmi-cpufreq (and possibly some other soon) uses more
>>> tricky EM setup. As you might recall, the construction of temporary
>>> cpumask is a bit complex, since we want per-CPU policy, but the
>>> cpumask pass to EM has not a single bit but is 'spanned' with a few
>>> CPUs.
>>>
>>> 2. The scmi-cpufreq (and IIRC MTK cpufreq driver soon) requires
>>> custom struct em_data_callback, since the power data is coming from FW.
>>
>> +1, we really need this to work.
>>
>>> If there is a need for complex EM registration, maybe we could
>>> do it in the cpufreq_driver::ready(). The policy would be ready
>>> during that call, so it might fly.
>>
>> I remember trying this, but ran into issues as well FWIW. I would need
>> to check if this is still relevant, but at the time this was introduced
>> we needed to register the EM _before_ the policy notifier is called with
>> 'CPUFREQ_CREATE_POLICY', because this will trigger a sched domain
>> rebuild in the arch_topology driver, which allows the scheduler to pick
>> up the newly introduced EM data.
>>
> 
> Let me go through that code and experiment if there is an issue.

I've checked that. It's not the policy notifier and arch_topology which
cause an issue, but the cpufreq governor setup code. Anyway, we cannot
wait so late with the EM registration, till e.g. ::ready() callback.
Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-10 15:12           ` Lukasz Luba
@ 2021-08-10 15:47             ` Quentin Perret
  2021-08-11  5:03               ` Viresh Kumar
  0 siblings, 1 reply; 31+ messages in thread
From: Quentin Perret @ 2021-08-10 15:47 UTC (permalink / raw)
  To: Lukasz Luba, r
  Cc: Viresh Kumar, Rafael J. Wysocki, Vincent Donnefort,
	Peter Zijlstra, Rafael J. Wysocki, Vincent Guittot, Linux PM,
	Ionela Voinescu, Dietmar Eggemann, Matthias Kaehlcke

On Tuesday 10 Aug 2021 at 16:12:29 (+0100), Lukasz Luba wrote:
> I've checked that. It's not the policy notifier and arch_topology which
> cause an issue, but the cpufreq governor setup code. Anyway, we cannot
> wait so late with the EM registration, till e.g. ::ready() callback.

Aha, yes, because by the time the arch_topology driver rebuilds the
sched domains, the governor is not 'installed', which means the
scheduler is not in a position to enable EAS yet. So we need to wait
until sched_cpufreq_governor_change() is called for that. Makes sense,
thanks for checking, and +1 to your conclusion.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-10 12:28     ` Lukasz Luba
  2021-08-10 14:07       ` Quentin Perret
@ 2021-08-11  4:01       ` Viresh Kumar
  1 sibling, 0 replies; 31+ messages in thread
From: Viresh Kumar @ 2021-08-11  4:01 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Rafael J. Wysocki, Vincent Donnefort, Peter Zijlstra,
	Rafael J. Wysocki, Vincent Guittot, Quentin Perret, Linux PM,
	Ionela Voinescu, Dietmar Eggemann, Matthias Kaehlcke

On 10-08-21, 13:28, Lukasz Luba wrote:
> On 8/10/21 7:13 AM, Viresh Kumar wrote:
> > A cpufreq table is provided by the driver, which can have some
> > inefficient frequencies. The inefficient frequencies can be caught by
> > many parts of the kernel, the current way (in this series) is via EM.
> > But this can totally be anything else as well, like a devfreq driver.
> 
> Currently devfreq drivers and governors don't support 'inefficient'
> OPPs idea. They are not even 'utilization' driven yet. I'm not sure
> if that would make sense for their workloads. For now, they are far
> behind the CPUFreq/scheduler development in this field.

That's not my point.

> It needs more research and experiments, to even estimate if this brings
> benefits. So, I would just skip devfreq use case...

I don't care about devfreq case at all, just like you. What I am
saying is that this API can be called from anywhere and hard-coding
this for just EM is a plain Hack. A framework should never allow it.

> > I understand that the current problem is where do we make that call
> > from and I suggested dev_pm_opp_of_register_em() for the same earlier.
> > But that doesn't work as the policy isn't properly initialized by that
> > point.
> 
> True, the policy is not initialized yet when cpufreq_driver::init()
> callback is called, which the current place for scmi-cpufreq.
> 
> What about the other callback cpufreq_driver::ready() ?
> This might solve the two issues I mentioned below.

Yeah, lets see if we can solve that problem here.

> > Now that I see the current implementation, I think we can make it work
> > in a different way.
> > 
> > - Copy what's done for thermal-cooling in cpufreq core, i.e.
> >    CPUFREQ_IS_COOLING_DEV flag, for the EM core as well. So the cpufreq
> >    drivers can set a flag, CPUFREQ_REGISTER_EM, and the cpufreq core
> >    can call dev_pm_opp_of_register_em() on their behalf. This call will
> >    be made from cpufreq_online(), just before/after
> >    cpufreq_thermal_control_enabled() stuff. By this point the policy is
> >    properly initialized and is also updated in
> >    per_cpu(cpufreq_cpu_data).
> > 
> > - Now the cpufreq core can provide an API like
> >    cpufreq_set_freq_invariant(int cpu, unsigned long freq), which can
> >    be called from EM core's implementation of
> >    em_dev_register_perf_domain().
> > 
> 
> I have to point out that there are two issues (which we
> might solve):
> 1. Advanced setup, due to per-CPU performance requests,
> which are not limited to real DVFS domain.
> The scmi-cpufreq (and possibly some other soon) uses more
> tricky EM setup. As you might recall, the construction of temporary
> cpumask is a bit complex, since we want per-CPU policy, but the
> cpumask pass to EM has not a single bit but is 'spanned' with a few
> CPUs.
> 
> 2. The scmi-cpufreq (and IIRC MTK cpufreq driver soon) requires
> custom struct em_data_callback, since the power data is coming from FW.
> 
> If there is a need for complex EM registration, maybe we could
> do it in the cpufreq_driver::ready(). The policy would be ready
> during that call, so it might fly.

-- 
viresh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-10 15:47             ` Quentin Perret
@ 2021-08-11  5:03               ` Viresh Kumar
  2021-08-11 11:38                 ` Lukasz Luba
  0 siblings, 1 reply; 31+ messages in thread
From: Viresh Kumar @ 2021-08-11  5:03 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Lukasz Luba, r, Rafael J. Wysocki, Vincent Donnefort,
	Peter Zijlstra, Rafael J. Wysocki, Vincent Guittot, Linux PM,
	Ionela Voinescu, Dietmar Eggemann, Matthias Kaehlcke

On 10-08-21, 16:47, Quentin Perret wrote:
> On Tuesday 10 Aug 2021 at 16:12:29 (+0100), Lukasz Luba wrote:
> > I've checked that. It's not the policy notifier and arch_topology which
> > cause an issue, but the cpufreq governor setup code. Anyway, we cannot
> > wait so late with the EM registration, till e.g. ::ready() callback.
> 
> Aha, yes, because by the time the arch_topology driver rebuilds the
> sched domains, the governor is not 'installed', which means the
> scheduler is not in a position to enable EAS yet. So we need to wait
> until sched_cpufreq_governor_change() is called for that. Makes sense,
> thanks for checking, and +1 to your conclusion.

What about this then ?

Author: Viresh Kumar <viresh.kumar@linaro.org>
Date:   Wed Aug 11 10:24:28 2021 +0530

    cpufreq: Call ->ready() before initializing governor

    The driver may want to do stuff from the ->ready() callback, like
    registering with the EM core, after the policy is initialized, but
    before the governor is setup (since governor may end up using that
    information).

    Call the ->ready() callback before setting up the governor.

    Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 drivers/cpufreq/cpufreq.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index a060dc2aa2f2..2df41b98bbb3 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1494,6 +1494,10 @@ static int cpufreq_online(unsigned int cpu)
                write_unlock_irqrestore(&cpufreq_driver_lock, flags);
        }

+       /* Callback for handling stuff after policy is ready */
+       if (cpufreq_driver->ready)
+               cpufreq_driver->ready(policy);
+
        ret = cpufreq_init_policy(policy);
        if (ret) {
                pr_err("%s: Failed to initialize policy for cpu: %d (%d)\n",
@@ -1505,10 +1509,6 @@ static int cpufreq_online(unsigned int cpu)

        kobject_uevent(&policy->kobj, KOBJ_ADD);

-       /* Callback for handling stuff after policy is ready */
-       if (cpufreq_driver->ready)
-               cpufreq_driver->ready(policy);
-
        if (cpufreq_thermal_control_enabled(cpufreq_driver))
                policy->cdev = of_cpufreq_cooling_register(policy);

-- 
viresh

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-11  5:03               ` Viresh Kumar
@ 2021-08-11 11:38                 ` Lukasz Luba
  2021-08-16 14:35                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 31+ messages in thread
From: Lukasz Luba @ 2021-08-11 11:38 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Quentin Perret, r, Rafael J. Wysocki, Vincent Donnefort,
	Peter Zijlstra, Rafael J. Wysocki, Vincent Guittot, Linux PM,
	Ionela Voinescu, Dietmar Eggemann, Matthias Kaehlcke



On 8/11/21 6:03 AM, Viresh Kumar wrote:
> On 10-08-21, 16:47, Quentin Perret wrote:
>> On Tuesday 10 Aug 2021 at 16:12:29 (+0100), Lukasz Luba wrote:
>>> I've checked that. It's not the policy notifier and arch_topology which
>>> cause an issue, but the cpufreq governor setup code. Anyway, we cannot
>>> wait so late with the EM registration, till e.g. ::ready() callback.
>>
>> Aha, yes, because by the time the arch_topology driver rebuilds the
>> sched domains, the governor is not 'installed', which means the
>> scheduler is not in a position to enable EAS yet. So we need to wait
>> until sched_cpufreq_governor_change() is called for that. Makes sense,
>> thanks for checking, and +1 to your conclusion.
> 
> What about this then ?

If it doesn't break the current drivers which implement this callback,
then looks good.

> 
> Author: Viresh Kumar <viresh.kumar@linaro.org>
> Date:   Wed Aug 11 10:24:28 2021 +0530
> 
>      cpufreq: Call ->ready() before initializing governor
> 
>      The driver may want to do stuff from the ->ready() callback, like
>      registering with the EM core, after the policy is initialized, but
>      before the governor is setup (since governor may end up using that
>      information).
> 
>      Call the ->ready() callback before setting up the governor.
> 
>      Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>   drivers/cpufreq/cpufreq.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index a060dc2aa2f2..2df41b98bbb3 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -1494,6 +1494,10 @@ static int cpufreq_online(unsigned int cpu)
>                  write_unlock_irqrestore(&cpufreq_driver_lock, flags);
>          }
> 
> +       /* Callback for handling stuff after policy is ready */
> +       if (cpufreq_driver->ready)
> +               cpufreq_driver->ready(policy);
> +
>          ret = cpufreq_init_policy(policy);
>          if (ret) {
>                  pr_err("%s: Failed to initialize policy for cpu: %d (%d)\n",
> @@ -1505,10 +1509,6 @@ static int cpufreq_online(unsigned int cpu)
> 
>          kobject_uevent(&policy->kobj, KOBJ_ADD);
> 
> -       /* Callback for handling stuff after policy is ready */
> -       if (cpufreq_driver->ready)
> -               cpufreq_driver->ready(policy);
> -
>          if (cpufreq_thermal_control_enabled(cpufreq_driver))
>                  policy->cdev = of_cpufreq_cooling_register(policy);
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-10  6:13   ` Viresh Kumar
  2021-08-10  7:39     ` Viresh Kumar
  2021-08-10 12:28     ` Lukasz Luba
@ 2021-08-16 14:19     ` Rafael J. Wysocki
  2021-08-17  7:06       ` Viresh Kumar
  2 siblings, 1 reply; 31+ messages in thread
From: Rafael J. Wysocki @ 2021-08-16 14:19 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Rafael J. Wysocki, Vincent Donnefort, Peter Zijlstra,
	Rafael J. Wysocki, Vincent Guittot, Quentin Perret, Linux PM,
	Ionela Voinescu, Lukasz Luba, Dietmar Eggemann,
	Matthias Kaehlcke

On Tue, Aug 10, 2021 at 8:13 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 04-08-21, 18:21, Rafael J. Wysocki wrote:
> > The cpufreq changes are mostly fine by me now, but I would like to
> > hear from Viresh regarding this.
>
> I have few doubts/concerns here.
>
> Just to iterate it again, the idea here is to choose a higher
> frequency, which will work in an efficient manner based on energy
> consumption. So this _only_ works for the case where the caller asked
> for CPUFREQ_RELATION_L.
>
> - The new flag being added here, CPUFREQ_RELATION_E, doesn't feel
>   complete in this sense to me. It should rather be called as
>   CPUFREQ_RELATION_LE instead as it is _always_ used with relation L.

Well, this mostly is a matter of what the flag means.  If "E" implies
"L", I don't see a problem.

> - IMO this has made the caller site a bit confusing now, i.e.  why we
>   send CPUFREQ_RELATION_E sometimes and CPUFREQ_RELATION_H at other
>   times. Why shouldn't the _H freq be efficient as well ?

This is a good point, though.

>  I understand  that this was done to not do the efficient stuff in case of
>   userspace/powersave/performance governors.
>
>   What about reusing following for finding all such cases ?
>
>         policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING
>
>   The driver can set a flag to tell if it wants efficient frequencies
>   or not, and at runtime we apply efficient algorithm only if the
>   current governor does DVFS, which will leave out
>   userspace/performance/powersave.

As long as this can be done without actually accessing
policy->governor->flags on every transition, it sounds like a good
idea.

> Now the other thing I didn't like since the beginning, I still don't
> like it :)
>
> A cpufreq table is provided by the driver, which can have some
> inefficient frequencies. The inefficient frequencies can be caught by
> many parts of the kernel, the current way (in this series) is via EM.
> But this can totally be anything else as well, like a devfreq driver.
>
> The way we have tied this whole thing with EM, via
> cpufreq_read_inefficiencies_from_em(), is what I don't like. We have
> closely bound the whole thing with one of the ways this can be done
> and we shouldn't be polluting the cpufreq core with this IMHO. In a
> sane world, the cpufreq core should just provide an API, like
> cpufreq_set_freq_invariant() and it can be called by any part of
> the kernel.

Right.

> I understand that the current problem is where do we make that call
> from and I suggested dev_pm_opp_of_register_em() for the same earlier.
> But that doesn't work as the policy isn't properly initialized by that
> point.
>
> Now that I see the current implementation, I think we can make it work
> in a different way.
>
> - Copy what's done for thermal-cooling in cpufreq core, i.e.
>   CPUFREQ_IS_COOLING_DEV flag, for the EM core as well. So the cpufreq
>   drivers can set a flag, CPUFREQ_REGISTER_EM, and the cpufreq core
>   can call dev_pm_opp_of_register_em() on their behalf. This call will
>   be made from cpufreq_online(), just before/after
>   cpufreq_thermal_control_enabled() stuff. By this point the policy is
>   properly initialized and is also updated in
>   per_cpu(cpufreq_cpu_data).
>
> - Now the cpufreq core can provide an API like
>   cpufreq_set_freq_invariant(int cpu, unsigned long freq), which can
>   be called from EM core's implementation of
>   em_dev_register_perf_domain().

Sounds reasonable to me.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-11 11:38                 ` Lukasz Luba
@ 2021-08-16 14:35                   ` Rafael J. Wysocki
  2021-08-17  7:09                     ` Viresh Kumar
  0 siblings, 1 reply; 31+ messages in thread
From: Rafael J. Wysocki @ 2021-08-16 14:35 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Viresh Kumar, Quentin Perret, r, Rafael J. Wysocki,
	Vincent Donnefort, Peter Zijlstra, Rafael J. Wysocki,
	Vincent Guittot, Linux PM, Ionela Voinescu, Dietmar Eggemann,
	Matthias Kaehlcke

On Wed, Aug 11, 2021 at 1:38 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 8/11/21 6:03 AM, Viresh Kumar wrote:
> > On 10-08-21, 16:47, Quentin Perret wrote:
> >> On Tuesday 10 Aug 2021 at 16:12:29 (+0100), Lukasz Luba wrote:
> >>> I've checked that. It's not the policy notifier and arch_topology which
> >>> cause an issue, but the cpufreq governor setup code. Anyway, we cannot
> >>> wait so late with the EM registration, till e.g. ::ready() callback.
> >>
> >> Aha, yes, because by the time the arch_topology driver rebuilds the
> >> sched domains, the governor is not 'installed', which means the
> >> scheduler is not in a position to enable EAS yet. So we need to wait
> >> until sched_cpufreq_governor_change() is called for that. Makes sense,
> >> thanks for checking, and +1 to your conclusion.
> >
> > What about this then ?
>
> If it doesn't break the current drivers which implement this callback,
> then looks good.

It was introduced by

7c45cf31b3ab cpufreq: Introduce ->ready() callback for cpufreq drivers

and appears to be still suitable for the purpose stated in the changelog.

Anyway, the vexpress-spc driver is the only one doing anything
significant in this callback AFAICS.

> >
> > Author: Viresh Kumar <viresh.kumar@linaro.org>
> > Date:   Wed Aug 11 10:24:28 2021 +0530
> >
> >      cpufreq: Call ->ready() before initializing governor
> >
> >      The driver may want to do stuff from the ->ready() callback, like
> >      registering with the EM core, after the policy is initialized, but
> >      before the governor is setup (since governor may end up using that
> >      information).
> >
> >      Call the ->ready() callback before setting up the governor.
> >
> >      Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> > ---
> >   drivers/cpufreq/cpufreq.c | 8 ++++----
> >   1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index a060dc2aa2f2..2df41b98bbb3 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -1494,6 +1494,10 @@ static int cpufreq_online(unsigned int cpu)
> >                  write_unlock_irqrestore(&cpufreq_driver_lock, flags);
> >          }
> >
> > +       /* Callback for handling stuff after policy is ready */
> > +       if (cpufreq_driver->ready)
> > +               cpufreq_driver->ready(policy);
> > +
> >          ret = cpufreq_init_policy(policy);
> >          if (ret) {
> >                  pr_err("%s: Failed to initialize policy for cpu: %d (%d)\n",
> > @@ -1505,10 +1509,6 @@ static int cpufreq_online(unsigned int cpu)
> >
> >          kobject_uevent(&policy->kobj, KOBJ_ADD);
> >
> > -       /* Callback for handling stuff after policy is ready */
> > -       if (cpufreq_driver->ready)
> > -               cpufreq_driver->ready(policy);
> > -
> >          if (cpufreq_thermal_control_enabled(cpufreq_driver))
> >                  policy->cdev = of_cpufreq_cooling_register(policy);
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-16 14:19     ` Rafael J. Wysocki
@ 2021-08-17  7:06       ` Viresh Kumar
  2021-08-17  9:03         ` Lukasz Luba
  0 siblings, 1 reply; 31+ messages in thread
From: Viresh Kumar @ 2021-08-17  7:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Vincent Donnefort, Peter Zijlstra, Rafael J. Wysocki,
	Vincent Guittot, Quentin Perret, Linux PM, Ionela Voinescu,
	Lukasz Luba, Dietmar Eggemann, Matthias Kaehlcke

On 16-08-21, 16:19, Rafael J. Wysocki wrote:
> On Tue, Aug 10, 2021 at 8:13 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >  I understand  that this was done to not do the efficient stuff in case of
> >   userspace/powersave/performance governors.
> >
> >   What about reusing following for finding all such cases ?
> >
> >         policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING
> >
> >   The driver can set a flag to tell if it wants efficient frequencies
> >   or not, and at runtime we apply efficient algorithm only if the
> >   current governor does DVFS, which will leave out
> >   userspace/performance/powersave.
> 
> As long as this can be done without actually accessing
> policy->governor->flags on every transition, it sounds like a good
> idea.

Great.

Vincent, I hope you will be taking this forward then ?

-- 
viresh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-16 14:35                   ` Rafael J. Wysocki
@ 2021-08-17  7:09                     ` Viresh Kumar
  0 siblings, 0 replies; 31+ messages in thread
From: Viresh Kumar @ 2021-08-17  7:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Lukasz Luba, Quentin Perret, r, Vincent Donnefort,
	Peter Zijlstra, Rafael J. Wysocki, Vincent Guittot, Linux PM,
	Ionela Voinescu, Dietmar Eggemann, Matthias Kaehlcke

On 16-08-21, 16:35, Rafael J. Wysocki wrote:
> It was introduced by
> 
> 7c45cf31b3ab cpufreq: Introduce ->ready() callback for cpufreq drivers
> 
> and appears to be still suitable for the purpose stated in the changelog.
> 
> Anyway, the vexpress-spc driver is the only one doing anything
> significant in this callback AFAICS.

And I have already sent a patch to drop that user, after which we can
remove the callback altogether.

https://lore.kernel.org/linux-pm/0efe0c7b1c07591f07a905021f455b033441784f.1628659212.git.viresh.kumar@linaro.org/

Moreover, this idea of reusing ->ready() callback is already replaced
with this:

https://lore.kernel.org/linux-arm-msm/cover.1628742634.git.viresh.kumar@linaro.org/

-- 
viresh

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-17  7:06       ` Viresh Kumar
@ 2021-08-17  9:03         ` Lukasz Luba
  2021-08-23 17:06           ` Vincent Donnefort
  0 siblings, 1 reply; 31+ messages in thread
From: Lukasz Luba @ 2021-08-17  9:03 UTC (permalink / raw)
  To: Viresh Kumar, Rafael J. Wysocki
  Cc: Vincent Donnefort, Peter Zijlstra, Rafael J. Wysocki,
	Vincent Guittot, Quentin Perret, Linux PM, Ionela Voinescu,
	Dietmar Eggemann, Matthias Kaehlcke



On 8/17/21 8:06 AM, Viresh Kumar wrote:
> On 16-08-21, 16:19, Rafael J. Wysocki wrote:
>> On Tue, Aug 10, 2021 at 8:13 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>>>   I understand  that this was done to not do the efficient stuff in case of
>>>    userspace/powersave/performance governors.
>>>
>>>    What about reusing following for finding all such cases ?
>>>
>>>          policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING
>>>
>>>    The driver can set a flag to tell if it wants efficient frequencies
>>>    or not, and at runtime we apply efficient algorithm only if the
>>>    current governor does DVFS, which will leave out
>>>    userspace/performance/powersave.
>>
>> As long as this can be done without actually accessing
>> policy->governor->flags on every transition, it sounds like a good
>> idea.
> 
> Great.
> 
> Vincent, I hope you will be taking this forward then ?
> 

Vincent has been on vacations for quite a while, but he will be back
next week.

Thank you guys for the comments.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v4 0/9] Inefficient OPPs
  2021-08-17  9:03         ` Lukasz Luba
@ 2021-08-23 17:06           ` Vincent Donnefort
  0 siblings, 0 replies; 31+ messages in thread
From: Vincent Donnefort @ 2021-08-23 17:06 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: Viresh Kumar, Rafael J. Wysocki, Peter Zijlstra,
	Rafael J. Wysocki, Vincent Guittot, Quentin Perret, Linux PM,
	Ionela Voinescu, Dietmar Eggemann, Matthias Kaehlcke

On Tue, Aug 17, 2021 at 10:03:53AM +0100, Lukasz Luba wrote:
> 
> 
> On 8/17/21 8:06 AM, Viresh Kumar wrote:
> >On 16-08-21, 16:19, Rafael J. Wysocki wrote:
> >>On Tue, Aug 10, 2021 at 8:13 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >>>  I understand  that this was done to not do the efficient stuff in case of
> >>>   userspace/powersave/performance governors.
> >>>
> >>>   What about reusing following for finding all such cases ?
> >>>
> >>>         policy->governor->flags & CPUFREQ_GOV_DYNAMIC_SWITCHING
> >>>
> >>>   The driver can set a flag to tell if it wants efficient frequencies
> >>>   or not, and at runtime we apply efficient algorithm only if the
> >>>   current governor does DVFS, which will leave out
> >>>   userspace/performance/powersave.
> >>
> >>As long as this can be done without actually accessing
> >>policy->governor->flags on every transition, it sounds like a good
> >>idea.
> >
> >Great.
> >
> >Vincent, I hope you will be taking this forward then ?
> >
> 
> Vincent has been on vacations for quite a while, but he will be back
> next week.
> 
> Thank you guys for the comments.

Sorry, I was indeed off for few weeks, that's why I was silent until now.
Looks like I missed the party here. But if I sum-up what you discussed in
the thread:

  * With a newly introduced callback .register_em(), we could use the EM
    table to mark inefficiencies into the CPUFreq table from the CPUFreq
    driver.

  * Instead of having a RELATION_E, resolving inefficiencies to efficient
    frequencies whenever the driver supports it and the governor declares
    CPUFREQ_GOV_DYNAMIC_SWITCHING.

I'll have a look and I'll prepare a new version, based on:

  [PATCH V3 0/9] Add callback to register with energy model 

-- 
Vincent

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-08-23 17:06 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-08 10:08 [PATCH v4 0/9] Inefficient OPPs Vincent Donnefort
2021-07-08 10:08 ` [PATCH v4 1/9] PM / EM: Fix inefficient states detection Vincent Donnefort
2021-07-08 10:08 ` [PATCH v4 2/9] PM / EM: Mark inefficient states Vincent Donnefort
2021-07-22  7:25   ` Lukasz Luba
2021-07-08 10:09 ` [PATCH v4 3/9] PM / EM: Extend em_perf_domain with a flag field Vincent Donnefort
2021-07-22  7:27   ` Lukasz Luba
2021-07-08 10:09 ` [PATCH v4 4/9] PM / EM: Allow skipping inefficient states Vincent Donnefort
2021-07-22 13:09   ` Lukasz Luba
2021-07-08 10:09 ` [PATCH v4 5/9] cpufreq: Add an interface to mark inefficient frequencies Vincent Donnefort
2021-07-08 10:09 ` [PATCH v4 6/9] cpufreq: Add a new freq-table relation CPUFREQ_RELATION_E Vincent Donnefort
2021-07-22  8:17   ` Lukasz Luba
2021-07-08 10:09 ` [PATCH v4 7/9] cpufreq: CPUFREQ_RELATION_E in schedutil ondemand and conservative Vincent Donnefort
2021-07-08 10:09 ` [PATCH v4 8/9] cpufreq: Add driver flag CPUFREQ_READ_ENERGY_MODEL Vincent Donnefort
2021-07-08 10:09 ` [PATCH v4 9/9] cpufreq: dt: Add CPUFREQ_READ_ENERGY_MODEL Vincent Donnefort
2021-08-04 16:21 ` [PATCH v4 0/9] Inefficient OPPs Rafael J. Wysocki
2021-08-10  6:13   ` Viresh Kumar
2021-08-10  7:39     ` Viresh Kumar
2021-08-10 12:28     ` Lukasz Luba
2021-08-10 14:07       ` Quentin Perret
2021-08-10 14:18         ` Lukasz Luba
2021-08-10 15:12           ` Lukasz Luba
2021-08-10 15:47             ` Quentin Perret
2021-08-11  5:03               ` Viresh Kumar
2021-08-11 11:38                 ` Lukasz Luba
2021-08-16 14:35                   ` Rafael J. Wysocki
2021-08-17  7:09                     ` Viresh Kumar
2021-08-11  4:01       ` Viresh Kumar
2021-08-16 14:19     ` Rafael J. Wysocki
2021-08-17  7:06       ` Viresh Kumar
2021-08-17  9:03         ` Lukasz Luba
2021-08-23 17:06           ` Vincent Donnefort

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.