linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Thermal devfreq cooling improvements with Energy Model
@ 2020-11-18 12:03 Lukasz Luba
  2020-11-18 12:03 ` [PATCH v2 1/5] thermal: devfreq_cooling: change tracing function and arguments Lukasz Luba
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Lukasz Luba @ 2020-11-18 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel, ionela.voinescu

Hi all,

This patch set is a continuation of my previous work, which aimed
to add Energy Model to all devices. This series is a follow up
for the patches which got merged to v5.9-rc1. It aims to change
the thermal devfreq cooling and use the Energy Model instead of
private power table and structures. The new registration interface
in the patch 3/5 helps to register devfreq cooling and the EM in one
call. There is also another improvement, patch 2/5 is changing the
way how thermal gets the device status. Now it's taken on demand
and stored as a copy. The last patch wouldn't go through thermal tree,
but it's here for consistency.

The patch set is based on current next-20201118, which has new EM API
in the pm/linux-next tree.

changes:
v2:
- renamed freq_get_state() and related to perf_idx pattern as
  suggested by Ionela
v1 [2]

Regards,
Lukasz Luba

[1] https://lkml.org/lkml/2020/5/11/326
[2] https://lore.kernel.org/linux-pm/20200921122007.29610-1-lukasz.luba@arm.com/

Lukasz Luba (5):
  thermal: devfreq_cooling: change tracing function and arguments
  thermal: devfreq_cooling: get a copy of device status
  thermal: devfreq_cooling: add new registration functions with Energy
    Model
  thermal: devfreq_cooling: remove old power model and use EM
  drm/panfrost: Register devfreq cooling and attempt to add Energy Model

 drivers/gpu/drm/panfrost/panfrost_devfreq.c |   2 +-
 drivers/thermal/devfreq_cooling.c           | 434 ++++++++++----------
 include/linux/devfreq_cooling.h             |  39 +-
 include/trace/events/thermal.h              |  19 +-
 4 files changed, 259 insertions(+), 235 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/5] thermal: devfreq_cooling: change tracing function and arguments
  2020-11-18 12:03 [PATCH v2 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
@ 2020-11-18 12:03 ` Lukasz Luba
  2020-12-02 10:23   ` Ionela Voinescu
  2020-11-18 12:03 ` [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status Lukasz Luba
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Lukasz Luba @ 2020-11-18 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel, ionela.voinescu

Prepare for deleting the static and dynamic power calculation and clean
the trace function. These two fields are going to be removed in the next
changes.

Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> # for tracing code
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/devfreq_cooling.c |  3 +--
 include/trace/events/thermal.h    | 19 +++++++++----------
 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index dfab49a67252..659c0143c9f0 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -277,8 +277,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 		*power = dyn_power + static_power;
 	}
 
-	trace_thermal_power_devfreq_get_power(cdev, status, freq, dyn_power,
-					      static_power, *power);
+	trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);
 
 	return 0;
 fail:
diff --git a/include/trace/events/thermal.h b/include/trace/events/thermal.h
index 135e5421f003..8a5f04888abd 100644
--- a/include/trace/events/thermal.h
+++ b/include/trace/events/thermal.h
@@ -153,31 +153,30 @@ TRACE_EVENT(thermal_power_cpu_limit,
 TRACE_EVENT(thermal_power_devfreq_get_power,
 	TP_PROTO(struct thermal_cooling_device *cdev,
 		 struct devfreq_dev_status *status, unsigned long freq,
-		u32 dynamic_power, u32 static_power, u32 power),
+		u32 power),
 
-	TP_ARGS(cdev, status,  freq, dynamic_power, static_power, power),
+	TP_ARGS(cdev, status,  freq, power),
 
 	TP_STRUCT__entry(
 		__string(type,         cdev->type    )
 		__field(unsigned long, freq          )
-		__field(u32,           load          )
-		__field(u32,           dynamic_power )
-		__field(u32,           static_power  )
+		__field(u32,           busy_time)
+		__field(u32,           total_time)
 		__field(u32,           power)
 	),
 
 	TP_fast_assign(
 		__assign_str(type, cdev->type);
 		__entry->freq = freq;
-		__entry->load = (100 * status->busy_time) / status->total_time;
-		__entry->dynamic_power = dynamic_power;
-		__entry->static_power = static_power;
+		__entry->busy_time = status->busy_time;
+		__entry->total_time = status->total_time;
 		__entry->power = power;
 	),
 
-	TP_printk("type=%s freq=%lu load=%u dynamic_power=%u static_power=%u power=%u",
+	TP_printk("type=%s freq=%lu load=%u power=%u",
 		__get_str(type), __entry->freq,
-		__entry->load, __entry->dynamic_power, __entry->static_power,
+		__entry->total_time == 0 ? 0 :
+			(100 * __entry->busy_time) / __entry->total_time,
 		__entry->power)
 );
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-11-18 12:03 [PATCH v2 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
  2020-11-18 12:03 ` [PATCH v2 1/5] thermal: devfreq_cooling: change tracing function and arguments Lukasz Luba
@ 2020-11-18 12:03 ` Lukasz Luba
  2020-12-02 10:23   ` Ionela Voinescu
  2020-12-03 13:09   ` Daniel Lezcano
  2020-11-18 12:03 ` [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model Lukasz Luba
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: Lukasz Luba @ 2020-11-18 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel, ionela.voinescu

Devfreq cooling needs to now the correct status of the device in order
to operate. Do not rely on Devfreq last_status which might be a stale data
and get more up-to-date values of the load.

Devfreq framework can change the device status in the background. To
mitigate this situation make a copy of the status structure and use it
for internal calculations.

In addition this patch adds normalization function, which also makes sure
that whatever data comes from the device, it is in a sane range.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index 659c0143c9f0..925523694462 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -227,20 +227,46 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
 							       voltage);
 }
 
+static void _normalize_load(struct devfreq_dev_status *status)
+{
+	/* Make some space if needed */
+	if (status->busy_time > 0xffff) {
+		status->busy_time >>= 10;
+		status->total_time >>= 10;
+	}
+
+	if (status->busy_time > status->total_time)
+		status->busy_time = status->total_time;
+
+	status->busy_time *= 100;
+	status->busy_time /= status->total_time ? : 1;
+
+	/* Avoid division by 0 */
+	status->busy_time = status->busy_time ? : 1;
+	status->total_time = 100;
+}
 
 static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cdev,
 					       u32 *power)
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
-	struct devfreq_dev_status *status = &df->last_status;
+	struct devfreq_dev_status status;
 	unsigned long state;
-	unsigned long freq = status->current_frequency;
+	unsigned long freq;
 	unsigned long voltage;
 	u32 dyn_power = 0;
 	u32 static_power = 0;
 	int res;
 
+	mutex_lock(&df->lock);
+	res = df->profile->get_dev_status(df->dev.parent, &status);
+	mutex_unlock(&df->lock);
+	if (res)
+		return res;
+
+	freq = status.current_frequency;
+
 	state = freq_get_state(dfc, freq);
 	if (state == THERMAL_CSTATE_INVALID) {
 		res = -EAGAIN;
@@ -268,16 +294,18 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 	} else {
 		dyn_power = dfc->power_table[state];
 
+		_normalize_load(&status);
+
 		/* Scale dynamic power for utilization */
-		dyn_power *= status->busy_time;
-		dyn_power /= status->total_time;
+		dyn_power *= status.busy_time;
+		dyn_power /= status.total_time;
 		/* Get static power */
 		static_power = get_static_power(dfc, freq);
 
 		*power = dyn_power + static_power;
 	}
 
-	trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);
+	trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);
 
 	return 0;
 fail:
@@ -309,14 +337,20 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
-	struct devfreq_dev_status *status = &df->last_status;
-	unsigned long freq = status->current_frequency;
+	struct devfreq_dev_status status;
 	unsigned long busy_time;
+	unsigned long freq;
 	s32 dyn_power;
 	u32 static_power;
 	s32 est_power;
 	int i;
 
+	mutex_lock(&df->lock);
+	status = df->last_status;
+	mutex_unlock(&df->lock);
+
+	freq = status.current_frequency;
+
 	if (dfc->power_ops->get_real_power) {
 		/* Scale for resource utilization */
 		est_power = power * dfc->res_util;
@@ -328,8 +362,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 		dyn_power = dyn_power > 0 ? dyn_power : 0;
 
 		/* Scale dynamic power for utilization */
-		busy_time = status->busy_time ?: 1;
-		est_power = (dyn_power * status->total_time) / busy_time;
+		busy_time = status.busy_time ?: 1;
+		est_power = (dyn_power * status.total_time) / busy_time;
 	}
 
 	/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-11-18 12:03 [PATCH v2 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
  2020-11-18 12:03 ` [PATCH v2 1/5] thermal: devfreq_cooling: change tracing function and arguments Lukasz Luba
  2020-11-18 12:03 ` [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status Lukasz Luba
@ 2020-11-18 12:03 ` Lukasz Luba
  2020-12-02 10:24   ` Ionela Voinescu
  2020-12-03 15:40   ` Daniel Lezcano
  2020-11-18 12:03 ` [PATCH v2 4/5] thermal: devfreq_cooling: remove old power model and use EM Lukasz Luba
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: Lukasz Luba @ 2020-11-18 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel, ionela.voinescu

The Energy Model (EM) framework supports devices such as Devfreq. Create
new registration functions which automatically register EM for the thermal
devfreq_cooling devices. This patch prepares the code for coming changes
which are going to replace old power model with the new EM.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/devfreq_cooling.c | 99 ++++++++++++++++++++++++++++++-
 include/linux/devfreq_cooling.h   | 22 +++++++
 2 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index 925523694462..b354271742c5 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -50,6 +50,8 @@ static DEFINE_IDA(devfreq_ida);
  * @capped_state:	index to cooling state with in dynamic power budget
  * @req_max_freq:	PM QoS request for limiting the maximum frequency
  *			of the devfreq device.
+ * @em:		Energy Model for the associated Devfreq device
+ * @em_registered:	Devfreq cooling registered the EM and should free it.
  */
 struct devfreq_cooling_device {
 	int id;
@@ -63,6 +65,8 @@ struct devfreq_cooling_device {
 	u32 res_util;
 	int capped_state;
 	struct dev_pm_qos_request req_max_freq;
+	struct em_perf_domain *em;
+	bool em_registered;
 };
 
 static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev,
@@ -583,22 +587,115 @@ struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df)
 }
 EXPORT_SYMBOL_GPL(devfreq_cooling_register);
 
+/**
+ * devfreq_cooling_em_register_power() - Register devfreq cooling device with
+ *		power information and attempt to register Energy Model (EM)
+ * @df:		Pointer to devfreq device.
+ * @dfc_power:	Pointer to devfreq_cooling_power.
+ * @em_cb:	Callback functions providing the data of the EM
+ *
+ * Register a devfreq cooling device and attempt to register Energy Model. The
+ * available OPPs must be registered for the device.
+ *
+ * If @dfc_power is provided, the cooling device is registered with the
+ * power extensions. If @em_cb is provided it will be called for each OPP to
+ * calculate power value and cost. If @em_cb is not provided then simple Energy
+ * Model is going to be used, which requires "dynamic-power-coefficient" a
+ * devicetree property.
+ */
+struct thermal_cooling_device *
+devfreq_cooling_em_register_power(struct devfreq *df,
+				  struct devfreq_cooling_power *dfc_power,
+				  struct em_data_callback *em_cb)
+{
+	struct thermal_cooling_device *cdev;
+	struct devfreq_cooling_device *dfc;
+	struct device_node *np = NULL;
+	struct device *dev;
+	int nr_opp, ret;
+
+	if (IS_ERR_OR_NULL(df))
+		return ERR_PTR(-EINVAL);
+
+	dev = df->dev.parent;
+
+	if (em_cb) {
+		nr_opp = dev_pm_opp_get_opp_count(dev);
+		if (nr_opp <= 0) {
+			dev_err(dev, "No valid OPPs found\n");
+			return ERR_PTR(-EINVAL);
+		}
+
+		ret = em_dev_register_perf_domain(dev, nr_opp, em_cb, NULL, false);
+	} else {
+		ret = dev_pm_opp_of_register_em(dev, NULL);
+	}
+
+	if (ret)
+		dev_warn(dev, "Unable to register EM for devfreq cooling device (%d)\n",
+			 ret);
+
+	if (dev->of_node)
+		np = of_node_get(dev->of_node);
+
+	cdev = of_devfreq_cooling_register_power(np, df, dfc_power);
+
+	if (np)
+		of_node_put(np);
+
+	if (IS_ERR_OR_NULL(cdev)) {
+		if (!ret)
+			em_dev_unregister_perf_domain(dev);
+	} else {
+		dfc = cdev->devdata;
+		dfc->em_registered = !ret;
+	}
+
+	return cdev;
+}
+EXPORT_SYMBOL_GPL(devfreq_cooling_em_register_power);
+
+/**
+ * devfreq_cooling_em_register() - Register devfreq cooling device together
+ *				with Energy Model.
+ * @df:		Pointer to devfreq device.
+ * @em_cb:	Callback functions providing the data of the Energy Model
+ *
+ * This function attempts to register Energy Model for devfreq device and then
+ * register the devfreq cooling device.
+ */
+struct thermal_cooling_device *
+devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb)
+{
+	return devfreq_cooling_em_register_power(df, NULL, em_cb);
+}
+EXPORT_SYMBOL_GPL(devfreq_cooling_em_register);
+
 /**
  * devfreq_cooling_unregister() - Unregister devfreq cooling device.
  * @cdev: Pointer to devfreq cooling device to unregister.
+ *
+ * Unregisters devfreq cooling device and related Energy Model if it was
+ * present.
  */
 void devfreq_cooling_unregister(struct thermal_cooling_device *cdev)
 {
 	struct devfreq_cooling_device *dfc;
+	struct device *dev;
 
-	if (!cdev)
+	if (IS_ERR_OR_NULL(cdev))
 		return;
 
 	dfc = cdev->devdata;
+	dev = dfc->devfreq->dev.parent;
 
 	thermal_cooling_device_unregister(dfc->cdev);
 	ida_simple_remove(&devfreq_ida, dfc->id);
 	dev_pm_qos_remove_request(&dfc->req_max_freq);
+
+	if (dfc->em_registered)
+		em_dev_unregister_perf_domain(dev);
+
 	kfree(dfc->power_table);
 	kfree(dfc->freq_table);
 
diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h
index 9df2dfca68dd..19868fb922f1 100644
--- a/include/linux/devfreq_cooling.h
+++ b/include/linux/devfreq_cooling.h
@@ -11,6 +11,7 @@
 #define __DEVFREQ_COOLING_H__
 
 #include <linux/devfreq.h>
+#include <linux/energy_model.h>
 #include <linux/thermal.h>
 
 
@@ -65,6 +66,13 @@ struct thermal_cooling_device *
 of_devfreq_cooling_register(struct device_node *np, struct devfreq *df);
 struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df);
 void devfreq_cooling_unregister(struct thermal_cooling_device *dfc);
+struct thermal_cooling_device *
+devfreq_cooling_em_register_power(struct devfreq *df,
+				  struct devfreq_cooling_power *dfc_power,
+				  struct em_data_callback *em_cb);
+struct thermal_cooling_device *
+devfreq_cooling_em_register(struct devfreq *df,
+			    struct em_data_callback *em_cb);
 
 #else /* !CONFIG_DEVFREQ_THERMAL */
 
@@ -87,6 +95,20 @@ devfreq_cooling_register(struct devfreq *df)
 	return ERR_PTR(-EINVAL);
 }
 
+static inline struct thermal_cooling_device *
+devfreq_cooling_em_register_power(struct devfreq *df,
+				  struct devfreq_cooling_power *dfc_power,
+				  struct em_data_callback *em_cb)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+static inline struct thermal_cooling_device *
+devfreq_cooling_em_register(struct devfreq *df,	struct em_data_callback *em_cb)
+{
+	return ERR_PTR(-EINVAL);
+}
+
 static inline void
 devfreq_cooling_unregister(struct thermal_cooling_device *dfc)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 4/5] thermal: devfreq_cooling: remove old power model and use EM
  2020-11-18 12:03 [PATCH v2 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
                   ` (2 preceding siblings ...)
  2020-11-18 12:03 ` [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model Lukasz Luba
@ 2020-11-18 12:03 ` Lukasz Luba
  2020-12-02 10:26   ` Ionela Voinescu
  2020-11-18 12:03 ` [PATCH v2 5/5] drm/panfrost: Register devfreq cooling and attempt to add Energy Model Lukasz Luba
  2020-12-02 15:01 ` [PATCH v2 0/5] Thermal devfreq cooling improvements with " Daniel Lezcano
  5 siblings, 1 reply; 21+ messages in thread
From: Lukasz Luba @ 2020-11-18 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel, ionela.voinescu

Remove old power model and use new Energy Model to calculate the power
budget. It drops static + dynamic power calculations and power table
in order to use Energy Model performance domain data. This model
should be easy to use and could find more users. It is also less
complicated to setup the needed structures.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/devfreq_cooling.c | 302 ++++++++++--------------------
 include/linux/devfreq_cooling.h   |  17 --
 2 files changed, 96 insertions(+), 223 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index b354271742c5..28754ad46b96 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -33,20 +33,17 @@ static DEFINE_IDA(devfreq_ida);
  * @cdev:	Pointer to associated thermal cooling device.
  * @devfreq:	Pointer to associated devfreq device.
  * @cooling_state:	Current cooling state.
- * @power_table:	Pointer to table with maximum power draw for each
- *			cooling state. State is the index into the table, and
- *			the power is in mW.
  * @freq_table:	Pointer to a table with the frequencies sorted in descending
  *		order.  You can index the table by cooling device state
- * @freq_table_size:	Size of the @freq_table and @power_table
- * @power_ops:	Pointer to devfreq_cooling_power, used to generate the
- *		@power_table.
+ * @max_state:	It is the last index, that is, one less than the number of the
+ *		OPPs
+ * @power_ops:	Pointer to devfreq_cooling_power, a more precised model.
  * @res_util:	Resource utilization scaling factor for the power.
  *		It is multiplied by 100 to minimize the error. It is used
  *		for estimation of the power budget instead of using
- *		'utilization' (which is	'busy_time / 'total_time').
- *		The 'res_util' range is from 100 to (power_table[state] * 100)
- *		for the corresponding 'state'.
+ *		'utilization' (which is	'busy_time' / 'total_time').
+ *		The 'res_util' range is from 100 to power * 100	for the
+ *		corresponding 'state'.
  * @capped_state:	index to cooling state with in dynamic power budget
  * @req_max_freq:	PM QoS request for limiting the maximum frequency
  *			of the devfreq device.
@@ -58,9 +55,8 @@ struct devfreq_cooling_device {
 	struct thermal_cooling_device *cdev;
 	struct devfreq *devfreq;
 	unsigned long cooling_state;
-	u32 *power_table;
 	u32 *freq_table;
-	size_t freq_table_size;
+	size_t max_state;
 	struct devfreq_cooling_power *power_ops;
 	u32 res_util;
 	int capped_state;
@@ -74,7 +70,7 @@ static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev,
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 
-	*state = dfc->freq_table_size - 1;
+	*state = dfc->max_state;
 
 	return 0;
 }
@@ -96,16 +92,22 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
 	struct devfreq *df = dfc->devfreq;
 	struct device *dev = df->dev.parent;
 	unsigned long freq;
+	int perf_idx;
 
 	if (state == dfc->cooling_state)
 		return 0;
 
 	dev_dbg(dev, "Setting cooling state %lu\n", state);
 
-	if (state >= dfc->freq_table_size)
+	if (state > dfc->max_state)
 		return -EINVAL;
 
-	freq = dfc->freq_table[state];
+	if (dfc->em) {
+		perf_idx = dfc->max_state - state;
+		freq = dfc->em->table[perf_idx].frequency * 1000;
+	} else {
+		freq = dfc->freq_table[state];
+	}
 
 	dev_pm_qos_update_request(&dfc->req_max_freq,
 				  DIV_ROUND_UP(freq, HZ_PER_KHZ));
@@ -116,24 +118,24 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
 }
 
 /**
- * freq_get_state() - get the cooling state corresponding to a frequency
+ * get_perf_idx() - get the performance index corresponding to a frequency
  * @dfc:	Pointer to devfreq cooling device
- * @freq:	frequency in Hz
+ * @freq:	frequency in kHz
  *
- * Return: the cooling state associated with the @freq, or
- * THERMAL_CSTATE_INVALID if it wasn't found.
+ * Return: the performance index associated with the @freq, or
+ * -EINVAL if it wasn't found.
  */
-static unsigned long
-freq_get_state(struct devfreq_cooling_device *dfc, unsigned long freq)
+static int get_perf_idx(struct devfreq_cooling_device *dfc, unsigned long freq)
 {
+	struct em_perf_domain *em = dfc->em;
 	int i;
 
-	for (i = 0; i < dfc->freq_table_size; i++) {
-		if (dfc->freq_table[i] == freq)
+	for (i = 0; i < em->nr_perf_states; i++) {
+		if (em->table[i].frequency == freq)
 			return i;
 	}
 
-	return THERMAL_CSTATE_INVALID;
+	return -EINVAL;
 }
 
 static unsigned long get_voltage(struct devfreq *df, unsigned long freq)
@@ -164,71 +166,15 @@ static unsigned long get_voltage(struct devfreq *df, unsigned long freq)
 	return voltage;
 }
 
-/**
- * get_static_power() - calculate the static power
- * @dfc:	Pointer to devfreq cooling device
- * @freq:	Frequency in Hz
- *
- * Calculate the static power in milliwatts using the supplied
- * get_static_power().  The current voltage is calculated using the
- * OPP library.  If no get_static_power() was supplied, assume the
- * static power is negligible.
- */
-static unsigned long
-get_static_power(struct devfreq_cooling_device *dfc, unsigned long freq)
-{
-	struct devfreq *df = dfc->devfreq;
-	unsigned long voltage;
-
-	if (!dfc->power_ops->get_static_power)
-		return 0;
-
-	voltage = get_voltage(df, freq);
-
-	if (voltage == 0)
-		return 0;
-
-	return dfc->power_ops->get_static_power(df, voltage);
-}
-
-/**
- * get_dynamic_power - calculate the dynamic power
- * @dfc:	Pointer to devfreq cooling device
- * @freq:	Frequency in Hz
- * @voltage:	Voltage in millivolts
- *
- * Calculate the dynamic power in milliwatts consumed by the device at
- * frequency @freq and voltage @voltage.  If the get_dynamic_power()
- * was supplied as part of the devfreq_cooling_power struct, then that
- * function is used.  Otherwise, a simple power model (Pdyn = Coeff *
- * Voltage^2 * Frequency) is used.
- */
-static unsigned long
-get_dynamic_power(struct devfreq_cooling_device *dfc, unsigned long freq,
-		  unsigned long voltage)
+static void dfc_em_get_requested_power(struct em_perf_domain *em,
+				       struct devfreq_dev_status *status,
+				       u32 *power, int perf_idx)
 {
-	u64 power;
-	u32 freq_mhz;
-	struct devfreq_cooling_power *dfc_power = dfc->power_ops;
-
-	if (dfc_power->get_dynamic_power)
-		return dfc_power->get_dynamic_power(dfc->devfreq, freq,
-						    voltage);
+	*power = em->table[perf_idx].power;
 
-	freq_mhz = freq / 1000000;
-	power = (u64)dfc_power->dyn_power_coeff * freq_mhz * voltage * voltage;
-	do_div(power, 1000000000);
-
-	return power;
-}
-
-
-static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
-					    unsigned long freq,
-					    unsigned long voltage)
-{
-	return get_static_power(dfc, freq) + get_dynamic_power(dfc, freq,
-							       voltage);
+	/* Scale power for utilization */
+	*power *= status->busy_time;
+	*power /= status->total_time;
 }
 
 static void _normalize_load(struct devfreq_dev_status *status)
@@ -259,9 +205,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 	unsigned long state;
 	unsigned long freq;
 	unsigned long voltage;
-	u32 dyn_power = 0;
-	u32 static_power = 0;
-	int res;
+	int res, perf_idx;
 
 	mutex_lock(&df->lock);
 	res = df->profile->get_dev_status(df->dev.parent, &status);
@@ -271,13 +215,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 
 	freq = status.current_frequency;
 
-	state = freq_get_state(dfc, freq);
-	if (state == THERMAL_CSTATE_INVALID) {
-		res = -EAGAIN;
-		goto fail;
-	}
-
-	if (dfc->power_ops->get_real_power) {
+	if (dfc->power_ops && dfc->power_ops->get_real_power) {
 		voltage = get_voltage(df, freq);
 		if (voltage == 0) {
 			res = -EINVAL;
@@ -287,7 +225,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 		res = dfc->power_ops->get_real_power(df, power, freq, voltage);
 		if (!res) {
 			state = dfc->capped_state;
-			dfc->res_util = dfc->power_table[state];
+			dfc->res_util = dfc->em->table[state].power;
 			dfc->res_util *= SCALE_ERROR_MITIGATION;
 
 			if (*power > 1)
@@ -296,17 +234,15 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 			goto fail;
 		}
 	} else {
-		dyn_power = dfc->power_table[state];
+		/* Energy Model frequencies are in kHz */
+		perf_idx = get_perf_idx(dfc, freq / 1000);
+		if (perf_idx < 0) {
+			res = -EAGAIN;
+			goto fail;
+		}
 
 		_normalize_load(&status);
-
-		/* Scale dynamic power for utilization */
-		dyn_power *= status.busy_time;
-		dyn_power /= status.total_time;
-		/* Get static power */
-		static_power = get_static_power(dfc, freq);
-
-		*power = dyn_power + static_power;
+		dfc_em_get_requested_power(dfc->em, &status, power, perf_idx);
 	}
 
 	trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);
@@ -319,20 +255,17 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 }
 
 static int devfreq_cooling_state2power(struct thermal_cooling_device *cdev,
-				       unsigned long state,
-				       u32 *power)
+				       unsigned long state, u32 *power)
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
-	unsigned long freq;
-	u32 static_power;
+	int perf_idx;
 
-	if (state >= dfc->freq_table_size)
+	if (state > dfc->max_state)
 		return -EINVAL;
 
-	freq = dfc->freq_table[state];
-	static_power = get_static_power(dfc, freq);
+	perf_idx = dfc->max_state - state;
+	*power = dfc->em->table[perf_idx].power;
 
-	*power = dfc->power_table[state] + static_power;
 	return 0;
 }
 
@@ -342,10 +275,7 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
 	struct devfreq_dev_status status;
-	unsigned long busy_time;
 	unsigned long freq;
-	s32 dyn_power;
-	u32 static_power;
 	s32 est_power;
 	int i;
 
@@ -355,31 +285,27 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 
 	freq = status.current_frequency;
 
-	if (dfc->power_ops->get_real_power) {
+	if (dfc->power_ops && dfc->power_ops->get_real_power) {
 		/* Scale for resource utilization */
 		est_power = power * dfc->res_util;
 		est_power /= SCALE_ERROR_MITIGATION;
 	} else {
-		static_power = get_static_power(dfc, freq);
-
-		dyn_power = power - static_power;
-		dyn_power = dyn_power > 0 ? dyn_power : 0;
-
-		/* Scale dynamic power for utilization */
-		busy_time = status.busy_time ?: 1;
-		est_power = (dyn_power * status.total_time) / busy_time;
+		_normalize_load(&status);
+		est_power = power * status.total_time;
+		est_power /= status.busy_time;
 	}
 
 	/*
 	 * Find the first cooling state that is within the power
-	 * budget for dynamic power.
+	 * budget. The EM power table is sorted ascending.
 	 */
-	for (i = 0; i < dfc->freq_table_size - 1; i++)
-		if (est_power >= dfc->power_table[i])
+	for (i = dfc->max_state; i > 0; i--)
+		if (est_power >= dfc->em->table[i].power)
 			break;
 
-	*state = i;
-	dfc->capped_state = i;
+	*state = dfc->max_state - i;
+	dfc->capped_state = *state;
+
 	trace_thermal_power_devfreq_limit(cdev, freq, *state, power);
 	return 0;
 }
@@ -391,91 +317,43 @@ static struct thermal_cooling_device_ops devfreq_cooling_ops = {
 };
 
 /**
- * devfreq_cooling_gen_tables() - Generate power and freq tables.
- * @dfc: Pointer to devfreq cooling device.
- *
- * Generate power and frequency tables: the power table hold the
- * device's maximum power usage at each cooling state (OPP).  The
- * static and dynamic power using the appropriate voltage and
- * frequency for the state, is acquired from the struct
- * devfreq_cooling_power, and summed to make the maximum power draw.
+ * devfreq_cooling_gen_tables() - Generate frequency table.
+ * @dfc:	Pointer to devfreq cooling device.
+ * @num_opps:	Number of OPPs
  *
- * The frequency table holds the frequencies in descending order.
- * That way its indexed by cooling device state.
- *
- * The tables are malloced, and pointers put in dfc.  They must be
- * freed when unregistering the devfreq cooling device.
+ * Generate frequency table which holds the frequencies in descending
+ * order. That way its indexed by cooling device state. This is for
+ * compatibility with drivers which do not register Energy Model.
  *
  * Return: 0 on success, negative error code on failure.
  */
-static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc)
+static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc,
+				      int num_opps)
 {
 	struct devfreq *df = dfc->devfreq;
 	struct device *dev = df->dev.parent;
-	int ret, num_opps;
 	unsigned long freq;
-	u32 *power_table = NULL;
-	u32 *freq_table;
 	int i;
 
-	num_opps = dev_pm_opp_get_opp_count(dev);
-
-	if (dfc->power_ops) {
-		power_table = kcalloc(num_opps, sizeof(*power_table),
-				      GFP_KERNEL);
-		if (!power_table)
-			return -ENOMEM;
-	}
-
-	freq_table = kcalloc(num_opps, sizeof(*freq_table),
+	dfc->freq_table = kcalloc(num_opps, sizeof(*dfc->freq_table),
 			     GFP_KERNEL);
-	if (!freq_table) {
-		ret = -ENOMEM;
-		goto free_power_table;
-	}
+	if (!dfc->freq_table)
+		return -ENOMEM;
 
 	for (i = 0, freq = ULONG_MAX; i < num_opps; i++, freq--) {
-		unsigned long power, voltage;
 		struct dev_pm_opp *opp;
 
 		opp = dev_pm_opp_find_freq_floor(dev, &freq);
 		if (IS_ERR(opp)) {
-			ret = PTR_ERR(opp);
-			goto free_tables;
+			kfree(dfc->freq_table);
+			return PTR_ERR(opp);
 		}
 
-		voltage = dev_pm_opp_get_voltage(opp) / 1000; /* mV */
 		dev_pm_opp_put(opp);
-
-		if (dfc->power_ops) {
-			if (dfc->power_ops->get_real_power)
-				power = get_total_power(dfc, freq, voltage);
-			else
-				power = get_dynamic_power(dfc, freq, voltage);
-
-			dev_dbg(dev, "Power table: %lu MHz @ %lu mV: %lu = %lu mW\n",
-				freq / 1000000, voltage, power, power);
-
-			power_table[i] = power;
-		}
-
-		freq_table[i] = freq;
+		dfc->freq_table[i] = freq;
 	}
 
-	if (dfc->power_ops)
-		dfc->power_table = power_table;
-
-	dfc->freq_table = freq_table;
-	dfc->freq_table_size = num_opps;
-
 	return 0;
-
-free_tables:
-	kfree(freq_table);
-free_power_table:
-	kfree(power_table);
-
-	return ret;
 }
 
 /**
@@ -500,7 +378,7 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
 	struct thermal_cooling_device *cdev;
 	struct devfreq_cooling_device *dfc;
 	char dev_name[THERMAL_NAME_LENGTH];
-	int err;
+	int err, num_opps;
 
 	dfc = kzalloc(sizeof(*dfc), GFP_KERNEL);
 	if (!dfc)
@@ -508,28 +386,45 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
 
 	dfc->devfreq = df;
 
-	if (dfc_power) {
-		dfc->power_ops = dfc_power;
-
+	dfc->em = em_pd_get(df->dev.parent);
+	if (dfc->em) {
 		devfreq_cooling_ops.get_requested_power =
 			devfreq_cooling_get_requested_power;
 		devfreq_cooling_ops.state2power = devfreq_cooling_state2power;
 		devfreq_cooling_ops.power2state = devfreq_cooling_power2state;
+
+		dfc->power_ops = dfc_power;
+
+		num_opps = em_pd_nr_perf_states(dfc->em);
+	} else {
+		/* Backward compatibility for drivers which do not use IPA */
+		dev_dbg(df->dev.parent, "missing EM for cooling device\n");
+
+		num_opps = dev_pm_opp_get_opp_count(df->dev.parent);
+
+		err = devfreq_cooling_gen_tables(dfc, num_opps);
+		if (err)
+			goto free_dfc;
 	}
 
-	err = devfreq_cooling_gen_tables(dfc);
-	if (err)
+	if (num_opps <= 0) {
+		err = -EINVAL;
 		goto free_dfc;
+	}
+
+	/* max_state is an index, not a counter */
+	dfc->max_state = num_opps - 1;
 
 	err = dev_pm_qos_add_request(df->dev.parent, &dfc->req_max_freq,
 				     DEV_PM_QOS_MAX_FREQUENCY,
 				     PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE);
 	if (err < 0)
-		goto free_tables;
+		goto free_table;
 
 	err = ida_simple_get(&devfreq_ida, 0, 0, GFP_KERNEL);
 	if (err < 0)
 		goto remove_qos_req;
+
 	dfc->id = err;
 
 	snprintf(dev_name, sizeof(dev_name), "thermal-devfreq-%d", dfc->id);
@@ -550,12 +445,9 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
 
 release_ida:
 	ida_simple_remove(&devfreq_ida, dfc->id);
-
 remove_qos_req:
 	dev_pm_qos_remove_request(&dfc->req_max_freq);
-
-free_tables:
-	kfree(dfc->power_table);
+free_table:
 	kfree(dfc->freq_table);
 free_dfc:
 	kfree(dfc);
@@ -696,9 +588,7 @@ void devfreq_cooling_unregister(struct thermal_cooling_device *cdev)
 	if (dfc->em_registered)
 		em_dev_unregister_perf_domain(dev);
 
-	kfree(dfc->power_table);
 	kfree(dfc->freq_table);
-
 	kfree(dfc);
 }
 EXPORT_SYMBOL_GPL(devfreq_cooling_unregister);
diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h
index 19868fb922f1..4890b12b54b4 100644
--- a/include/linux/devfreq_cooling.h
+++ b/include/linux/devfreq_cooling.h
@@ -17,17 +17,6 @@
 
 /**
  * struct devfreq_cooling_power - Devfreq cooling power ops
- * @get_static_power:	Take voltage, in mV, and return the static power
- *			in mW.  If NULL, the static power is assumed
- *			to be 0.
- * @get_dynamic_power:	Take voltage, in mV, and frequency, in HZ, and
- *			return the dynamic power draw in mW.  If NULL,
- *			a simple power model is used.
- * @dyn_power_coeff:	Coefficient for the simple dynamic power model in
- *			mW/(MHz mV mV).
- *			If get_dynamic_power() is NULL, then the
- *			dynamic power is calculated as
- *			@dyn_power_coeff * frequency * voltage^2
  * @get_real_power:	When this is set, the framework uses it to ask the
  *			device driver for the actual power.
  *			Some devices have more sophisticated methods
@@ -47,14 +36,8 @@
  *			max total (static + dynamic) power value for each OPP.
  */
 struct devfreq_cooling_power {
-	unsigned long (*get_static_power)(struct devfreq *devfreq,
-					  unsigned long voltage);
-	unsigned long (*get_dynamic_power)(struct devfreq *devfreq,
-					   unsigned long freq,
-					   unsigned long voltage);
 	int (*get_real_power)(struct devfreq *df, u32 *power,
 			      unsigned long freq, unsigned long voltage);
-	unsigned long dyn_power_coeff;
 };
 
 #ifdef CONFIG_DEVFREQ_THERMAL
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 5/5] drm/panfrost: Register devfreq cooling and attempt to add Energy Model
  2020-11-18 12:03 [PATCH v2 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
                   ` (3 preceding siblings ...)
  2020-11-18 12:03 ` [PATCH v2 4/5] thermal: devfreq_cooling: remove old power model and use EM Lukasz Luba
@ 2020-11-18 12:03 ` Lukasz Luba
  2020-12-02 15:01 ` [PATCH v2 0/5] Thermal devfreq cooling improvements with " Daniel Lezcano
  5 siblings, 0 replies; 21+ messages in thread
From: Lukasz Luba @ 2020-11-18 12:03 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel, ionela.voinescu

Register devfreq cooling device and attempt to register Energy Model. This
will add the devfreq device to the Energy Model framework. It will create
a dedicated and unified data structures used i.e. in thermal framework.
The last NULL parameter indicates that the power model is simplified and
created based on DT 'dynamic-power-coefficient', voltage and frequency.

Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/gpu/drm/panfrost/panfrost_devfreq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.c b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
index 78e9d82f7318..f44d28fad085 100644
--- a/drivers/gpu/drm/panfrost/panfrost_devfreq.c
+++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
@@ -138,7 +138,7 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev)
 	}
 	pfdevfreq->devfreq = devfreq;
 
-	cooling = of_devfreq_cooling_register(dev->of_node, devfreq);
+	cooling = devfreq_cooling_em_register(devfreq, NULL);
 	if (IS_ERR(cooling))
 		DRM_DEV_INFO(dev, "Failed to register cooling device\n");
 	else
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/5] thermal: devfreq_cooling: change tracing function and arguments
  2020-11-18 12:03 ` [PATCH v2 1/5] thermal: devfreq_cooling: change tracing function and arguments Lukasz Luba
@ 2020-12-02 10:23   ` Ionela Voinescu
  0 siblings, 0 replies; 21+ messages in thread
From: Ionela Voinescu @ 2020-12-02 10:23 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	daniel.lezcano, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel

On Wednesday 18 Nov 2020 at 12:03:54 (+0000), Lukasz Luba wrote:
> Prepare for deleting the static and dynamic power calculation and clean
> the trace function. These two fields are going to be removed in the next
> changes.
> 
> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> # for tracing code
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/thermal/devfreq_cooling.c |  3 +--
>  include/trace/events/thermal.h    | 19 +++++++++----------
>  2 files changed, 10 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
> index dfab49a67252..659c0143c9f0 100644
> --- a/drivers/thermal/devfreq_cooling.c
> +++ b/drivers/thermal/devfreq_cooling.c
> @@ -277,8 +277,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  		*power = dyn_power + static_power;
>  	}
>  
> -	trace_thermal_power_devfreq_get_power(cdev, status, freq, dyn_power,
> -					      static_power, *power);
> +	trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);
>  
>  	return 0;
>  fail:
> diff --git a/include/trace/events/thermal.h b/include/trace/events/thermal.h
> index 135e5421f003..8a5f04888abd 100644
> --- a/include/trace/events/thermal.h
> +++ b/include/trace/events/thermal.h
> @@ -153,31 +153,30 @@ TRACE_EVENT(thermal_power_cpu_limit,
>  TRACE_EVENT(thermal_power_devfreq_get_power,
>  	TP_PROTO(struct thermal_cooling_device *cdev,
>  		 struct devfreq_dev_status *status, unsigned long freq,
> -		u32 dynamic_power, u32 static_power, u32 power),
> +		u32 power),
>  
> -	TP_ARGS(cdev, status,  freq, dynamic_power, static_power, power),
> +	TP_ARGS(cdev, status,  freq, power),
>  
>  	TP_STRUCT__entry(
>  		__string(type,         cdev->type    )
>  		__field(unsigned long, freq          )
> -		__field(u32,           load          )
> -		__field(u32,           dynamic_power )
> -		__field(u32,           static_power  )
> +		__field(u32,           busy_time)
> +		__field(u32,           total_time)
>  		__field(u32,           power)
>  	),
>  
>  	TP_fast_assign(
>  		__assign_str(type, cdev->type);
>  		__entry->freq = freq;
> -		__entry->load = (100 * status->busy_time) / status->total_time;
> -		__entry->dynamic_power = dynamic_power;
> -		__entry->static_power = static_power;
> +		__entry->busy_time = status->busy_time;
> +		__entry->total_time = status->total_time;
>  		__entry->power = power;
>  	),
>  
> -	TP_printk("type=%s freq=%lu load=%u dynamic_power=%u static_power=%u power=%u",
> +	TP_printk("type=%s freq=%lu load=%u power=%u",
>  		__get_str(type), __entry->freq,
> -		__entry->load, __entry->dynamic_power, __entry->static_power,
> +		__entry->total_time == 0 ? 0 :
> +			(100 * __entry->busy_time) / __entry->total_time,
>  		__entry->power)
>  );
>  
> -- 
> 2.17.1
> 

Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>

Regards,
Ionela.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-11-18 12:03 ` [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status Lukasz Luba
@ 2020-12-02 10:23   ` Ionela Voinescu
  2020-12-03 13:09   ` Daniel Lezcano
  1 sibling, 0 replies; 21+ messages in thread
From: Ionela Voinescu @ 2020-12-02 10:23 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	daniel.lezcano, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel

On Wednesday 18 Nov 2020 at 12:03:55 (+0000), Lukasz Luba wrote:
> Devfreq cooling needs to now the correct status of the device in order
> to operate. Do not rely on Devfreq last_status which might be a stale data
> and get more up-to-date values of the load.
> 
> Devfreq framework can change the device status in the background. To
> mitigate this situation make a copy of the status structure and use it
> for internal calculations.
> 
> In addition this patch adds normalization function, which also makes sure
> that whatever data comes from the device, it is in a sane range.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>  1 file changed, 43 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
> index 659c0143c9f0..925523694462 100644
> --- a/drivers/thermal/devfreq_cooling.c
> +++ b/drivers/thermal/devfreq_cooling.c
> @@ -227,20 +227,46 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
>  							       voltage);
>  }
>  
> +static void _normalize_load(struct devfreq_dev_status *status)
> +{
> +	/* Make some space if needed */
> +	if (status->busy_time > 0xffff) {
> +		status->busy_time >>= 10;
> +		status->total_time >>= 10;
> +	}
> +
> +	if (status->busy_time > status->total_time)
> +		status->busy_time = status->total_time;
> +
> +	status->busy_time *= 100;
> +	status->busy_time /= status->total_time ? : 1;
> +
> +	/* Avoid division by 0 */
> +	status->busy_time = status->busy_time ? : 1;
> +	status->total_time = 100;
> +}
>  
>  static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cdev,
>  					       u32 *power)
>  {
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
>  	struct devfreq *df = dfc->devfreq;
> -	struct devfreq_dev_status *status = &df->last_status;
> +	struct devfreq_dev_status status;
>  	unsigned long state;
> -	unsigned long freq = status->current_frequency;
> +	unsigned long freq;
>  	unsigned long voltage;
>  	u32 dyn_power = 0;
>  	u32 static_power = 0;
>  	int res;
>  
> +	mutex_lock(&df->lock);
> +	res = df->profile->get_dev_status(df->dev.parent, &status);
> +	mutex_unlock(&df->lock);
> +	if (res)
> +		return res;
> +
> +	freq = status.current_frequency;
> +
>  	state = freq_get_state(dfc, freq);
>  	if (state == THERMAL_CSTATE_INVALID) {
>  		res = -EAGAIN;
> @@ -268,16 +294,18 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  	} else {
>  		dyn_power = dfc->power_table[state];
>  
> +		_normalize_load(&status);
> +
>  		/* Scale dynamic power for utilization */
> -		dyn_power *= status->busy_time;
> -		dyn_power /= status->total_time;
> +		dyn_power *= status.busy_time;
> +		dyn_power /= status.total_time;
>  		/* Get static power */
>  		static_power = get_static_power(dfc, freq);
>  
>  		*power = dyn_power + static_power;
>  	}
>  
> -	trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);
> +	trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);
>  
>  	return 0;
>  fail:
> @@ -309,14 +337,20 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  {
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
>  	struct devfreq *df = dfc->devfreq;
> -	struct devfreq_dev_status *status = &df->last_status;
> -	unsigned long freq = status->current_frequency;
> +	struct devfreq_dev_status status;
>  	unsigned long busy_time;
> +	unsigned long freq;
>  	s32 dyn_power;
>  	u32 static_power;
>  	s32 est_power;
>  	int i;
>  
> +	mutex_lock(&df->lock);
> +	status = df->last_status;
> +	mutex_unlock(&df->lock);
> +
> +	freq = status.current_frequency;
> +
>  	if (dfc->power_ops->get_real_power) {
>  		/* Scale for resource utilization */
>  		est_power = power * dfc->res_util;
> @@ -328,8 +362,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  		dyn_power = dyn_power > 0 ? dyn_power : 0;
>  
>  		/* Scale dynamic power for utilization */
> -		busy_time = status->busy_time ?: 1;
> -		est_power = (dyn_power * status->total_time) / busy_time;
> +		busy_time = status.busy_time ?: 1;
> +		est_power = (dyn_power * status.total_time) / busy_time;
>  	}
>  
>  	/*
> -- 
> 2.17.1
> 

Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>

Thanks,
Ionela.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-11-18 12:03 ` [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model Lukasz Luba
@ 2020-12-02 10:24   ` Ionela Voinescu
  2020-12-02 11:14     ` Lukasz Luba
  2020-12-03 15:40   ` Daniel Lezcano
  1 sibling, 1 reply; 21+ messages in thread
From: Ionela Voinescu @ 2020-12-02 10:24 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	daniel.lezcano, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel

Hi Lukasz,

On Wednesday 18 Nov 2020 at 12:03:56 (+0000), Lukasz Luba wrote:
> + * Register a devfreq cooling device and attempt to register Energy Model. The
> + * available OPPs must be registered for the device.
> + *
> + * If @dfc_power is provided, the cooling device is registered with the
> + * power extensions. If @em_cb is provided it will be called for each OPP to
> + * calculate power value and cost. If @em_cb is not provided then simple Energy
> + * Model is going to be used, which requires "dynamic-power-coefficient" a
> + * devicetree property.
> + */
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register_power(struct devfreq *df,
> +				  struct devfreq_cooling_power *dfc_power,
> +				  struct em_data_callback *em_cb)
> +{
> +	struct thermal_cooling_device *cdev;
> +	struct devfreq_cooling_device *dfc;
> +	struct device_node *np = NULL;
> +	struct device *dev;
> +	int nr_opp, ret;
> +
> +	if (IS_ERR_OR_NULL(df))
> +		return ERR_PTR(-EINVAL);
> +
> +	dev = df->dev.parent;
> +
> +	if (em_cb) {
> +		nr_opp = dev_pm_opp_get_opp_count(dev);
> +		if (nr_opp <= 0) {
> +			dev_err(dev, "No valid OPPs found\n");
> +			return ERR_PTR(-EINVAL);
> +		}
> +
> +		ret = em_dev_register_perf_domain(dev, nr_opp, em_cb, NULL, false);
> +	} else {
> +		ret = dev_pm_opp_of_register_em(dev, NULL);
> +	}
> +
> +	if (ret)
> +		dev_warn(dev, "Unable to register EM for devfreq cooling device (%d)\n",
> +			 ret);
> +
> +	if (dev->of_node)
> +		np = of_node_get(dev->of_node);
> +

Should np be checked before use? I'm not sure if it's better to do the
assign first and then the check on np before use. It depends on the
consequences of passing a NULL node pointer later on.

> +	cdev = of_devfreq_cooling_register_power(np, df, dfc_power);
> +
> +	if (np)
> +		of_node_put(np);
> +
> +	if (IS_ERR_OR_NULL(cdev)) {
> +		if (!ret)
> +			em_dev_unregister_perf_domain(dev);
> +	} else {
> +		dfc = cdev->devdata;
> +		dfc->em_registered = !ret;
> +	}
> +
> +	return cdev;
> +}
> +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register_power);
> +
> +/**
> + * devfreq_cooling_em_register() - Register devfreq cooling device together
> + *				with Energy Model.
> + * @df:		Pointer to devfreq device.
> + * @em_cb:	Callback functions providing the data of the Energy Model
> + *
> + * This function attempts to register Energy Model for devfreq device and then
> + * register the devfreq cooling device.
> + */
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb)
> +{
> +	return devfreq_cooling_em_register_power(df, NULL, em_cb);
> +}
> +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register);
> +
>  /**
>   * devfreq_cooling_unregister() - Unregister devfreq cooling device.
>   * @cdev: Pointer to devfreq cooling device to unregister.
> + *
> + * Unregisters devfreq cooling device and related Energy Model if it was
> + * present.
>   */
>  void devfreq_cooling_unregister(struct thermal_cooling_device *cdev)
>  {
>  	struct devfreq_cooling_device *dfc;
> +	struct device *dev;
>  
> -	if (!cdev)
> +	if (IS_ERR_OR_NULL(cdev))
>  		return;
>  
>  	dfc = cdev->devdata;
> +	dev = dfc->devfreq->dev.parent;
>  
>  	thermal_cooling_device_unregister(dfc->cdev);
>  	ida_simple_remove(&devfreq_ida, dfc->id);
>  	dev_pm_qos_remove_request(&dfc->req_max_freq);
> +
> +	if (dfc->em_registered)
> +		em_dev_unregister_perf_domain(dev);
> +
>  	kfree(dfc->power_table);
>  	kfree(dfc->freq_table);
>  
> diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h
> index 9df2dfca68dd..19868fb922f1 100644
> --- a/include/linux/devfreq_cooling.h
> +++ b/include/linux/devfreq_cooling.h
> @@ -11,6 +11,7 @@
>  #define __DEVFREQ_COOLING_H__
>  
>  #include <linux/devfreq.h>
> +#include <linux/energy_model.h>
>  #include <linux/thermal.h>
>  
>  
> @@ -65,6 +66,13 @@ struct thermal_cooling_device *
>  of_devfreq_cooling_register(struct device_node *np, struct devfreq *df);
>  struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df);
>  void devfreq_cooling_unregister(struct thermal_cooling_device *dfc);
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register_power(struct devfreq *df,
> +				  struct devfreq_cooling_power *dfc_power,
> +				  struct em_data_callback *em_cb);
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register(struct devfreq *df,
> +			    struct em_data_callback *em_cb);
>  
>  #else /* !CONFIG_DEVFREQ_THERMAL */
>  
> @@ -87,6 +95,20 @@ devfreq_cooling_register(struct devfreq *df)
>  	return ERR_PTR(-EINVAL);
>  }
>  
> +static inline struct thermal_cooling_device *
> +devfreq_cooling_em_register_power(struct devfreq *df,
> +				  struct devfreq_cooling_power *dfc_power,
> +				  struct em_data_callback *em_cb)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +static inline struct thermal_cooling_device *
> +devfreq_cooling_em_register(struct devfreq *df,	struct em_data_callback *em_cb)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
>  static inline void
>  devfreq_cooling_unregister(struct thermal_cooling_device *dfc)
>  {
> -- 
> 2.17.1
> 

Otherwise it looks good to me:

Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>

Ionela.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/5] thermal: devfreq_cooling: remove old power model and use EM
  2020-11-18 12:03 ` [PATCH v2 4/5] thermal: devfreq_cooling: remove old power model and use EM Lukasz Luba
@ 2020-12-02 10:26   ` Ionela Voinescu
  0 siblings, 0 replies; 21+ messages in thread
From: Ionela Voinescu @ 2020-12-02 10:26 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	daniel.lezcano, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel

On Wednesday 18 Nov 2020 at 12:03:57 (+0000), Lukasz Luba wrote:
> Remove old power model and use new Energy Model to calculate the power
> budget. It drops static + dynamic power calculations and power table
> in order to use Energy Model performance domain data. This model
> should be easy to use and could find more users. It is also less
> complicated to setup the needed structures.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/thermal/devfreq_cooling.c | 302 ++++++++++--------------------
>  include/linux/devfreq_cooling.h   |  17 --
>  2 files changed, 96 insertions(+), 223 deletions(-)
> 
> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
> index b354271742c5..28754ad46b96 100644
> --- a/drivers/thermal/devfreq_cooling.c
> +++ b/drivers/thermal/devfreq_cooling.c
> @@ -33,20 +33,17 @@ static DEFINE_IDA(devfreq_ida);
>   * @cdev:	Pointer to associated thermal cooling device.
>   * @devfreq:	Pointer to associated devfreq device.
>   * @cooling_state:	Current cooling state.
> - * @power_table:	Pointer to table with maximum power draw for each
> - *			cooling state. State is the index into the table, and
> - *			the power is in mW.
>   * @freq_table:	Pointer to a table with the frequencies sorted in descending
>   *		order.  You can index the table by cooling device state
> - * @freq_table_size:	Size of the @freq_table and @power_table
> - * @power_ops:	Pointer to devfreq_cooling_power, used to generate the
> - *		@power_table.
> + * @max_state:	It is the last index, that is, one less than the number of the
> + *		OPPs
> + * @power_ops:	Pointer to devfreq_cooling_power, a more precised model.
>   * @res_util:	Resource utilization scaling factor for the power.
>   *		It is multiplied by 100 to minimize the error. It is used
>   *		for estimation of the power budget instead of using
> - *		'utilization' (which is	'busy_time / 'total_time').
> - *		The 'res_util' range is from 100 to (power_table[state] * 100)
> - *		for the corresponding 'state'.
> + *		'utilization' (which is	'busy_time' / 'total_time').
> + *		The 'res_util' range is from 100 to power * 100	for the
> + *		corresponding 'state'.
>   * @capped_state:	index to cooling state with in dynamic power budget
>   * @req_max_freq:	PM QoS request for limiting the maximum frequency
>   *			of the devfreq device.
> @@ -58,9 +55,8 @@ struct devfreq_cooling_device {
>  	struct thermal_cooling_device *cdev;
>  	struct devfreq *devfreq;
>  	unsigned long cooling_state;
> -	u32 *power_table;
>  	u32 *freq_table;
> -	size_t freq_table_size;
> +	size_t max_state;
>  	struct devfreq_cooling_power *power_ops;
>  	u32 res_util;
>  	int capped_state;
> @@ -74,7 +70,7 @@ static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev,
>  {
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
>  
> -	*state = dfc->freq_table_size - 1;
> +	*state = dfc->max_state;
>  
>  	return 0;
>  }
> @@ -96,16 +92,22 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
>  	struct devfreq *df = dfc->devfreq;
>  	struct device *dev = df->dev.parent;
>  	unsigned long freq;
> +	int perf_idx;
>  
>  	if (state == dfc->cooling_state)
>  		return 0;
>  
>  	dev_dbg(dev, "Setting cooling state %lu\n", state);
>  
> -	if (state >= dfc->freq_table_size)
> +	if (state > dfc->max_state)
>  		return -EINVAL;
>  
> -	freq = dfc->freq_table[state];
> +	if (dfc->em) {
> +		perf_idx = dfc->max_state - state;
> +		freq = dfc->em->table[perf_idx].frequency * 1000;
> +	} else {
> +		freq = dfc->freq_table[state];
> +	}
>  
>  	dev_pm_qos_update_request(&dfc->req_max_freq,
>  				  DIV_ROUND_UP(freq, HZ_PER_KHZ));
> @@ -116,24 +118,24 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
>  }
>  
>  /**
> - * freq_get_state() - get the cooling state corresponding to a frequency
> + * get_perf_idx() - get the performance index corresponding to a frequency
>   * @dfc:	Pointer to devfreq cooling device
> - * @freq:	frequency in Hz
> + * @freq:	frequency in kHz
>   *
> - * Return: the cooling state associated with the @freq, or
> - * THERMAL_CSTATE_INVALID if it wasn't found.
> + * Return: the performance index associated with the @freq, or
> + * -EINVAL if it wasn't found.
>   */
> -static unsigned long
> -freq_get_state(struct devfreq_cooling_device *dfc, unsigned long freq)
> +static int get_perf_idx(struct devfreq_cooling_device *dfc, unsigned long freq)
>  {
> +	struct em_perf_domain *em = dfc->em;
>  	int i;
>  
> -	for (i = 0; i < dfc->freq_table_size; i++) {
> -		if (dfc->freq_table[i] == freq)
> +	for (i = 0; i < em->nr_perf_states; i++) {
> +		if (em->table[i].frequency == freq)
>  			return i;
>  	}
>  
> -	return THERMAL_CSTATE_INVALID;
> +	return -EINVAL;
>  }
>  
>  static unsigned long get_voltage(struct devfreq *df, unsigned long freq)
> @@ -164,71 +166,15 @@ static unsigned long get_voltage(struct devfreq *df, unsigned long freq)
>  	return voltage;
>  }
>  
> -/**
> - * get_static_power() - calculate the static power
> - * @dfc:	Pointer to devfreq cooling device
> - * @freq:	Frequency in Hz
> - *
> - * Calculate the static power in milliwatts using the supplied
> - * get_static_power().  The current voltage is calculated using the
> - * OPP library.  If no get_static_power() was supplied, assume the
> - * static power is negligible.
> - */
> -static unsigned long
> -get_static_power(struct devfreq_cooling_device *dfc, unsigned long freq)
> -{
> -	struct devfreq *df = dfc->devfreq;
> -	unsigned long voltage;
> -
> -	if (!dfc->power_ops->get_static_power)
> -		return 0;
> -
> -	voltage = get_voltage(df, freq);
> -
> -	if (voltage == 0)
> -		return 0;
> -
> -	return dfc->power_ops->get_static_power(df, voltage);
> -}
> -
> -/**
> - * get_dynamic_power - calculate the dynamic power
> - * @dfc:	Pointer to devfreq cooling device
> - * @freq:	Frequency in Hz
> - * @voltage:	Voltage in millivolts
> - *
> - * Calculate the dynamic power in milliwatts consumed by the device at
> - * frequency @freq and voltage @voltage.  If the get_dynamic_power()
> - * was supplied as part of the devfreq_cooling_power struct, then that
> - * function is used.  Otherwise, a simple power model (Pdyn = Coeff *
> - * Voltage^2 * Frequency) is used.
> - */
> -static unsigned long
> -get_dynamic_power(struct devfreq_cooling_device *dfc, unsigned long freq,
> -		  unsigned long voltage)
> +static void dfc_em_get_requested_power(struct em_perf_domain *em,
> +				       struct devfreq_dev_status *status,
> +				       u32 *power, int perf_idx)
>  {
> -	u64 power;
> -	u32 freq_mhz;
> -	struct devfreq_cooling_power *dfc_power = dfc->power_ops;
> -
> -	if (dfc_power->get_dynamic_power)
> -		return dfc_power->get_dynamic_power(dfc->devfreq, freq,
> -						    voltage);
> +	*power = em->table[perf_idx].power;
>  
> -	freq_mhz = freq / 1000000;
> -	power = (u64)dfc_power->dyn_power_coeff * freq_mhz * voltage * voltage;
> -	do_div(power, 1000000000);
> -
> -	return power;
> -}
> -
> -
> -static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
> -					    unsigned long freq,
> -					    unsigned long voltage)
> -{
> -	return get_static_power(dfc, freq) + get_dynamic_power(dfc, freq,
> -							       voltage);
> +	/* Scale power for utilization */
> +	*power *= status->busy_time;
> +	*power /= status->total_time;
>  }

Nit: Reiterating my question on whether it's worth having this
additional function or whether its contents should be collapsed into the
single caller.

>  
>  static void _normalize_load(struct devfreq_dev_status *status)
> @@ -259,9 +205,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  	unsigned long state;
>  	unsigned long freq;
>  	unsigned long voltage;
> -	u32 dyn_power = 0;
> -	u32 static_power = 0;
> -	int res;
> +	int res, perf_idx;
>  
>  	mutex_lock(&df->lock);
>  	res = df->profile->get_dev_status(df->dev.parent, &status);
> @@ -271,13 +215,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  
>  	freq = status.current_frequency;
>  
> -	state = freq_get_state(dfc, freq);
> -	if (state == THERMAL_CSTATE_INVALID) {
> -		res = -EAGAIN;
> -		goto fail;
> -	}
> -
> -	if (dfc->power_ops->get_real_power) {
> +	if (dfc->power_ops && dfc->power_ops->get_real_power) {
>  		voltage = get_voltage(df, freq);
>  		if (voltage == 0) {
>  			res = -EINVAL;
> @@ -287,7 +225,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  		res = dfc->power_ops->get_real_power(df, power, freq, voltage);
>  		if (!res) {
>  			state = dfc->capped_state;
> -			dfc->res_util = dfc->power_table[state];
> +			dfc->res_util = dfc->em->table[state].power;
>  			dfc->res_util *= SCALE_ERROR_MITIGATION;
>  
>  			if (*power > 1)
> @@ -296,17 +234,15 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  			goto fail;
>  		}
>  	} else {
> -		dyn_power = dfc->power_table[state];
> +		/* Energy Model frequencies are in kHz */
> +		perf_idx = get_perf_idx(dfc, freq / 1000);
> +		if (perf_idx < 0) {
> +			res = -EAGAIN;
> +			goto fail;
> +		}
>  
>  		_normalize_load(&status);
> -
> -		/* Scale dynamic power for utilization */
> -		dyn_power *= status.busy_time;
> -		dyn_power /= status.total_time;
> -		/* Get static power */
> -		static_power = get_static_power(dfc, freq);
> -
> -		*power = dyn_power + static_power;
> +		dfc_em_get_requested_power(dfc->em, &status, power, perf_idx);
>  	}
>  
>  	trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);
> @@ -319,20 +255,17 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  }
>  
>  static int devfreq_cooling_state2power(struct thermal_cooling_device *cdev,
> -				       unsigned long state,
> -				       u32 *power)
> +				       unsigned long state, u32 *power)
>  {
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
> -	unsigned long freq;
> -	u32 static_power;
> +	int perf_idx;
>  
> -	if (state >= dfc->freq_table_size)
> +	if (state > dfc->max_state)
>  		return -EINVAL;
>  
> -	freq = dfc->freq_table[state];
> -	static_power = get_static_power(dfc, freq);
> +	perf_idx = dfc->max_state - state;
> +	*power = dfc->em->table[perf_idx].power;
>  
> -	*power = dfc->power_table[state] + static_power;
>  	return 0;
>  }
>  
> @@ -342,10 +275,7 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
>  	struct devfreq *df = dfc->devfreq;
>  	struct devfreq_dev_status status;
> -	unsigned long busy_time;
>  	unsigned long freq;
> -	s32 dyn_power;
> -	u32 static_power;
>  	s32 est_power;
>  	int i;
>  
> @@ -355,31 +285,27 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  
>  	freq = status.current_frequency;
>  
> -	if (dfc->power_ops->get_real_power) {
> +	if (dfc->power_ops && dfc->power_ops->get_real_power) {
>  		/* Scale for resource utilization */
>  		est_power = power * dfc->res_util;
>  		est_power /= SCALE_ERROR_MITIGATION;
>  	} else {
> -		static_power = get_static_power(dfc, freq);
> -
> -		dyn_power = power - static_power;
> -		dyn_power = dyn_power > 0 ? dyn_power : 0;
> -
> -		/* Scale dynamic power for utilization */
> -		busy_time = status.busy_time ?: 1;
> -		est_power = (dyn_power * status.total_time) / busy_time;
> +		_normalize_load(&status);
> +		est_power = power * status.total_time;
> +		est_power /= status.busy_time;
>  	}
>  
>  	/*
>  	 * Find the first cooling state that is within the power
> -	 * budget for dynamic power.
> +	 * budget. The EM power table is sorted ascending.
>  	 */
> -	for (i = 0; i < dfc->freq_table_size - 1; i++)
> -		if (est_power >= dfc->power_table[i])
> +	for (i = dfc->max_state; i > 0; i--)
> +		if (est_power >= dfc->em->table[i].power)
>  			break;
>  
> -	*state = i;
> -	dfc->capped_state = i;
> +	*state = dfc->max_state - i;
> +	dfc->capped_state = *state;
> +
>  	trace_thermal_power_devfreq_limit(cdev, freq, *state, power);
>  	return 0;
>  }
> @@ -391,91 +317,43 @@ static struct thermal_cooling_device_ops devfreq_cooling_ops = {
>  };
>  
>  /**
> - * devfreq_cooling_gen_tables() - Generate power and freq tables.
> - * @dfc: Pointer to devfreq cooling device.
> - *
> - * Generate power and frequency tables: the power table hold the
> - * device's maximum power usage at each cooling state (OPP).  The
> - * static and dynamic power using the appropriate voltage and
> - * frequency for the state, is acquired from the struct
> - * devfreq_cooling_power, and summed to make the maximum power draw.
> + * devfreq_cooling_gen_tables() - Generate frequency table.
> + * @dfc:	Pointer to devfreq cooling device.
> + * @num_opps:	Number of OPPs
>   *
> - * The frequency table holds the frequencies in descending order.
> - * That way its indexed by cooling device state.
> - *
> - * The tables are malloced, and pointers put in dfc.  They must be
> - * freed when unregistering the devfreq cooling device.
> + * Generate frequency table which holds the frequencies in descending
> + * order. That way its indexed by cooling device state. This is for
> + * compatibility with drivers which do not register Energy Model.
>   *
>   * Return: 0 on success, negative error code on failure.
>   */
> -static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc)
> +static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc,
> +				      int num_opps)
>  {
>  	struct devfreq *df = dfc->devfreq;
>  	struct device *dev = df->dev.parent;
> -	int ret, num_opps;
>  	unsigned long freq;
> -	u32 *power_table = NULL;
> -	u32 *freq_table;
>  	int i;
>  
> -	num_opps = dev_pm_opp_get_opp_count(dev);
> -
> -	if (dfc->power_ops) {
> -		power_table = kcalloc(num_opps, sizeof(*power_table),
> -				      GFP_KERNEL);
> -		if (!power_table)
> -			return -ENOMEM;
> -	}
> -
> -	freq_table = kcalloc(num_opps, sizeof(*freq_table),
> +	dfc->freq_table = kcalloc(num_opps, sizeof(*dfc->freq_table),
>  			     GFP_KERNEL);
> -	if (!freq_table) {
> -		ret = -ENOMEM;
> -		goto free_power_table;
> -	}
> +	if (!dfc->freq_table)
> +		return -ENOMEM;
>  
>  	for (i = 0, freq = ULONG_MAX; i < num_opps; i++, freq--) {
> -		unsigned long power, voltage;
>  		struct dev_pm_opp *opp;
>  
>  		opp = dev_pm_opp_find_freq_floor(dev, &freq);
>  		if (IS_ERR(opp)) {
> -			ret = PTR_ERR(opp);
> -			goto free_tables;
> +			kfree(dfc->freq_table);
> +			return PTR_ERR(opp);
>  		}
>  
> -		voltage = dev_pm_opp_get_voltage(opp) / 1000; /* mV */
>  		dev_pm_opp_put(opp);
> -
> -		if (dfc->power_ops) {
> -			if (dfc->power_ops->get_real_power)
> -				power = get_total_power(dfc, freq, voltage);
> -			else
> -				power = get_dynamic_power(dfc, freq, voltage);
> -
> -			dev_dbg(dev, "Power table: %lu MHz @ %lu mV: %lu = %lu mW\n",
> -				freq / 1000000, voltage, power, power);
> -
> -			power_table[i] = power;
> -		}
> -
> -		freq_table[i] = freq;
> +		dfc->freq_table[i] = freq;
>  	}
>  
> -	if (dfc->power_ops)
> -		dfc->power_table = power_table;
> -
> -	dfc->freq_table = freq_table;
> -	dfc->freq_table_size = num_opps;
> -
>  	return 0;
> -
> -free_tables:
> -	kfree(freq_table);
> -free_power_table:
> -	kfree(power_table);
> -
> -	return ret;
>  }
>  
>  /**
> @@ -500,7 +378,7 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
>  	struct thermal_cooling_device *cdev;
>  	struct devfreq_cooling_device *dfc;
>  	char dev_name[THERMAL_NAME_LENGTH];
> -	int err;
> +	int err, num_opps;
>  
>  	dfc = kzalloc(sizeof(*dfc), GFP_KERNEL);
>  	if (!dfc)
> @@ -508,28 +386,45 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
>  
>  	dfc->devfreq = df;
>  
> -	if (dfc_power) {
> -		dfc->power_ops = dfc_power;
> -
> +	dfc->em = em_pd_get(df->dev.parent);
> +	if (dfc->em) {
>  		devfreq_cooling_ops.get_requested_power =
>  			devfreq_cooling_get_requested_power;
>  		devfreq_cooling_ops.state2power = devfreq_cooling_state2power;
>  		devfreq_cooling_ops.power2state = devfreq_cooling_power2state;
> +
> +		dfc->power_ops = dfc_power;
> +
> +		num_opps = em_pd_nr_perf_states(dfc->em);
> +	} else {
> +		/* Backward compatibility for drivers which do not use IPA */
> +		dev_dbg(df->dev.parent, "missing EM for cooling device\n");
> +
> +		num_opps = dev_pm_opp_get_opp_count(df->dev.parent);
> +
> +		err = devfreq_cooling_gen_tables(dfc, num_opps);
> +		if (err)
> +			goto free_dfc;
>  	}
>  
> -	err = devfreq_cooling_gen_tables(dfc);
> -	if (err)
> +	if (num_opps <= 0) {
> +		err = -EINVAL;
>  		goto free_dfc;
> +	}
> +
> +	/* max_state is an index, not a counter */
> +	dfc->max_state = num_opps - 1;
>  
>  	err = dev_pm_qos_add_request(df->dev.parent, &dfc->req_max_freq,
>  				     DEV_PM_QOS_MAX_FREQUENCY,
>  				     PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE);
>  	if (err < 0)
> -		goto free_tables;
> +		goto free_table;
>  
>  	err = ida_simple_get(&devfreq_ida, 0, 0, GFP_KERNEL);
>  	if (err < 0)
>  		goto remove_qos_req;
> +
>  	dfc->id = err;
>  
>  	snprintf(dev_name, sizeof(dev_name), "thermal-devfreq-%d", dfc->id);
> @@ -550,12 +445,9 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
>  
>  release_ida:
>  	ida_simple_remove(&devfreq_ida, dfc->id);
> -
>  remove_qos_req:
>  	dev_pm_qos_remove_request(&dfc->req_max_freq);
> -
> -free_tables:
> -	kfree(dfc->power_table);
> +free_table:
>  	kfree(dfc->freq_table);
>  free_dfc:
>  	kfree(dfc);
> @@ -696,9 +588,7 @@ void devfreq_cooling_unregister(struct thermal_cooling_device *cdev)
>  	if (dfc->em_registered)
>  		em_dev_unregister_perf_domain(dev);
>  
> -	kfree(dfc->power_table);
>  	kfree(dfc->freq_table);
> -
>  	kfree(dfc);
>  }
>  EXPORT_SYMBOL_GPL(devfreq_cooling_unregister);
> diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h
> index 19868fb922f1..4890b12b54b4 100644
> --- a/include/linux/devfreq_cooling.h
> +++ b/include/linux/devfreq_cooling.h
> @@ -17,17 +17,6 @@
>  
>  /**
>   * struct devfreq_cooling_power - Devfreq cooling power ops
> - * @get_static_power:	Take voltage, in mV, and return the static power
> - *			in mW.  If NULL, the static power is assumed
> - *			to be 0.
> - * @get_dynamic_power:	Take voltage, in mV, and frequency, in HZ, and
> - *			return the dynamic power draw in mW.  If NULL,
> - *			a simple power model is used.
> - * @dyn_power_coeff:	Coefficient for the simple dynamic power model in
> - *			mW/(MHz mV mV).
> - *			If get_dynamic_power() is NULL, then the
> - *			dynamic power is calculated as
> - *			@dyn_power_coeff * frequency * voltage^2
>   * @get_real_power:	When this is set, the framework uses it to ask the
>   *			device driver for the actual power.
>   *			Some devices have more sophisticated methods
> @@ -47,14 +36,8 @@
>   *			max total (static + dynamic) power value for each OPP.
>   */
>  struct devfreq_cooling_power {
> -	unsigned long (*get_static_power)(struct devfreq *devfreq,
> -					  unsigned long voltage);
> -	unsigned long (*get_dynamic_power)(struct devfreq *devfreq,
> -					   unsigned long freq,
> -					   unsigned long voltage);
>  	int (*get_real_power)(struct devfreq *df, u32 *power,
>  			      unsigned long freq, unsigned long voltage);
> -	unsigned long dyn_power_coeff;
>  };
>  
>  #ifdef CONFIG_DEVFREQ_THERMAL
> -- 
> 2.17.1
> 

Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>

Thanks,
Ionela.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-12-02 10:24   ` Ionela Voinescu
@ 2020-12-02 11:14     ` Lukasz Luba
  2020-12-02 11:49       ` Ionela Voinescu
  0 siblings, 1 reply; 21+ messages in thread
From: Lukasz Luba @ 2020-12-02 11:14 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	daniel.lezcano, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel

Hi Ionela,

On 12/2/20 10:24 AM, Ionela Voinescu wrote:
> Hi Lukasz,
> 
> On Wednesday 18 Nov 2020 at 12:03:56 (+0000), Lukasz Luba wrote:

[snip]

>> +	struct device_node *np = NULL;

[snip]

>> +
>> +	if (dev->of_node)
>> +		np = of_node_get(dev->of_node);
>> +
> 
> Should np be checked before use? I'm not sure if it's better to do the
> assign first and then the check on np before use. It depends on the
> consequences of passing a NULL node pointer later on.

The np is actually dev->of_node (or left NULL, as set at the begging).
The only meaning of the line above is to increment the counter and then
decrement if CONFIG_OF_DYNAMIC was used.
The devfreq_cooling_register() has np = NULL and the registration can
handle it, so we should be OK here as well.

> 
>> +	cdev = of_devfreq_cooling_register_power(np, df, dfc_power);
>> +
>> +	if (np)
>> +		of_node_put(np);
>> +

[snip]

>>
> 
> Otherwise it looks good to me:
> 
> Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>

Thank you for the review.

Regards,
Lukasz

> 
> Ionela.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-12-02 11:14     ` Lukasz Luba
@ 2020-12-02 11:49       ` Ionela Voinescu
  2020-12-02 11:54         ` Lukasz Luba
  0 siblings, 1 reply; 21+ messages in thread
From: Ionela Voinescu @ 2020-12-02 11:49 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	daniel.lezcano, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel

On Wednesday 02 Dec 2020 at 11:14:02 (+0000), Lukasz Luba wrote:
> Hi Ionela,
> 
> On 12/2/20 10:24 AM, Ionela Voinescu wrote:
> > Hi Lukasz,
> > 
> > On Wednesday 18 Nov 2020 at 12:03:56 (+0000), Lukasz Luba wrote:
> 
> [snip]
> 
> > > +	struct device_node *np = NULL;
> 
> [snip]
> 
> > > +
> > > +	if (dev->of_node)
> > > +		np = of_node_get(dev->of_node);
> > > +
> > 
> > Should np be checked before use? I'm not sure if it's better to do the
> > assign first and then the check on np before use. It depends on the
> > consequences of passing a NULL node pointer later on.
> 
> The np is actually dev->of_node (or left NULL, as set at the begging).
> The only meaning of the line above is to increment the counter and then
> decrement if CONFIG_OF_DYNAMIC was used.
> The devfreq_cooling_register() has np = NULL and the registration can
> handle it, so we should be OK here as well.
> 

Yes, I just wanted to make sure later registration can handle np = NULL,
or whether we need to bail out.

In this case, you can drop both ifs - for (dev->of_node) before get and
for np before put below, as of_node_get/of_node_put can handle NULL
pointers themselves.

Thanks,
Ionela.

> > 
> > > +	cdev = of_devfreq_cooling_register_power(np, df, dfc_power);
> > > +
> > > +	if (np)
> > > +		of_node_put(np);
> > > +
> 
> [snip]
> 
> > > 
> > 
> > Otherwise it looks good to me:
> > 
> > Reviewed-by: Ionela Voinescu <ionela.voinescu@arm.com>
> 
> Thank you for the review.
> 
> Regards,
> Lukasz
> 
> > 
> > Ionela.
> > 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-12-02 11:49       ` Ionela Voinescu
@ 2020-12-02 11:54         ` Lukasz Luba
  0 siblings, 0 replies; 21+ messages in thread
From: Lukasz Luba @ 2020-12-02 11:54 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	daniel.lezcano, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel



On 12/2/20 11:49 AM, Ionela Voinescu wrote:
> On Wednesday 02 Dec 2020 at 11:14:02 (+0000), Lukasz Luba wrote:
>> Hi Ionela,
>>
>> On 12/2/20 10:24 AM, Ionela Voinescu wrote:
>>> Hi Lukasz,
>>>
>>> On Wednesday 18 Nov 2020 at 12:03:56 (+0000), Lukasz Luba wrote:
>>
>> [snip]
>>
>>>> +	struct device_node *np = NULL;
>>
>> [snip]
>>
>>>> +
>>>> +	if (dev->of_node)
>>>> +		np = of_node_get(dev->of_node);
>>>> +
>>>
>>> Should np be checked before use? I'm not sure if it's better to do the
>>> assign first and then the check on np before use. It depends on the
>>> consequences of passing a NULL node pointer later on.
>>
>> The np is actually dev->of_node (or left NULL, as set at the begging).
>> The only meaning of the line above is to increment the counter and then
>> decrement if CONFIG_OF_DYNAMIC was used.
>> The devfreq_cooling_register() has np = NULL and the registration can
>> handle it, so we should be OK here as well.
>>
> 
> Yes, I just wanted to make sure later registration can handle np = NULL,
> or whether we need to bail out.
> 
> In this case, you can drop both ifs - for (dev->of_node) before get and
> for np before put below, as of_node_get/of_node_put can handle NULL
> pointers themselves.

Right. I agree, I will resend this patch with that small change.
Thank you for having a look at it.

Lukasz

> 
> Thanks,
> Ionela.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Thermal devfreq cooling improvements with Energy Model
  2020-11-18 12:03 [PATCH v2 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
                   ` (4 preceding siblings ...)
  2020-11-18 12:03 ` [PATCH v2 5/5] drm/panfrost: Register devfreq cooling and attempt to add Energy Model Lukasz Luba
@ 2020-12-02 15:01 ` Daniel Lezcano
  5 siblings, 0 replies; 21+ messages in thread
From: Daniel Lezcano @ 2020-12-02 15:01 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel, ionela.voinescu

On 18/11/2020 13:03, Lukasz Luba wrote:
> Hi all,
> 
> This patch set is a continuation of my previous work, which aimed
> to add Energy Model to all devices. This series is a follow up
> for the patches which got merged to v5.9-rc1. It aims to change
> the thermal devfreq cooling and use the Energy Model instead of
> private power table and structures. The new registration interface
> in the patch 3/5 helps to register devfreq cooling and the EM in one
> call. There is also another improvement, patch 2/5 is changing the
> way how thermal gets the device status. Now it's taken on demand
> and stored as a copy. The last patch wouldn't go through thermal tree,
> but it's here for consistency.

The patch 5/5 is reviewed by the maintainers. If they agree, I can apply
the patch with this series.


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-11-18 12:03 ` [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status Lukasz Luba
  2020-12-02 10:23   ` Ionela Voinescu
@ 2020-12-03 13:09   ` Daniel Lezcano
  2020-12-03 15:38     ` Lukasz Luba
  1 sibling, 1 reply; 21+ messages in thread
From: Daniel Lezcano @ 2020-12-03 13:09 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel, ionela.voinescu

On 18/11/2020 13:03, Lukasz Luba wrote:
> Devfreq cooling needs to now the correct status of the device in order
> to operate. Do not rely on Devfreq last_status which might be a stale data
> and get more up-to-date values of the load.
> 
> Devfreq framework can change the device status in the background. To
> mitigate this situation make a copy of the status structure and use it
> for internal calculations.
> 
> In addition this patch adds normalization function, which also makes sure
> that whatever data comes from the device, it is in a sane range.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>  1 file changed, 43 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
> index 659c0143c9f0..925523694462 100644
> --- a/drivers/thermal/devfreq_cooling.c
> +++ b/drivers/thermal/devfreq_cooling.c
> @@ -227,20 +227,46 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
>  							       voltage);
>  }
>  
> +static void _normalize_load(struct devfreq_dev_status *status)
> +{
> +	/* Make some space if needed */
> +	if (status->busy_time > 0xffff) {
> +		status->busy_time >>= 10;
> +		status->total_time >>= 10;
> +	}
> +
> +	if (status->busy_time > status->total_time)
> +		status->busy_time = status->total_time;

How the condition above is possible?

> +	status->busy_time *= 100;
> +	status->busy_time /= status->total_time ? : 1;
> +
> +	/* Avoid division by 0 */
> +	status->busy_time = status->busy_time ? : 1;
> +	status->total_time = 100;

Why not base the normalization on 1024? and use an intermediate u64.

For example:

static u32 _normalize_load(struct devfreq_dev_status *status)
{
	u64 load = 0;

	/* Prevent divison by zero */
	if (!status->busy_time)
		return 0;

	/*
	 * Assuming status->total_time is always greater or equal
	 * to status->busy_time, it can not be equal to zero because
	 * of the test above
	 */
	load = status->busy_time * 1024;
	load /= status->total_time;

	/*
	 * load is always [1..1024[, so it can not be truncated by a
	 * u64 -> u32 coercive cast
	 */
	return (u32)load;
}


> +}
>  
>  static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cdev,
>  					       u32 *power)
>  {
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
>  	struct devfreq *df = dfc->devfreq;
> -	struct devfreq_dev_status *status = &df->last_status;
> +	struct devfreq_dev_status status;
>  	unsigned long state;
> -	unsigned long freq = status->current_frequency;
> +	unsigned long freq;
>  	unsigned long voltage;
>  	u32 dyn_power = 0;
>  	u32 static_power = 0;
>  	int res;
>  
> +	mutex_lock(&df->lock);
> +	res = df->profile->get_dev_status(df->dev.parent, &status);
> +	mutex_unlock(&df->lock);
> +	if (res)
> +		return res;
> +
> +	freq = status.current_frequency;
> +
>  	state = freq_get_state(dfc, freq);
>  	if (state == THERMAL_CSTATE_INVALID) {
>  		res = -EAGAIN;
> @@ -268,16 +294,18 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  	} else {
>  		dyn_power = dfc->power_table[state];
>  
> +		_normalize_load(&status);

		load = _normalize_load(&status);

> +
>  		/* Scale dynamic power for utilization */
> -		dyn_power *= status->busy_time;
> -		dyn_power /= status->total_time;
> +		dyn_power *= status.busy_time;
> +		dyn_power /= status.total_time;

		/*
		 * May be change dyn_power to a u64 to prevent overflow
		 * when multiplied by 'load'
		 */
		dyn_power = (dyn_power * load) / 1024;

>  		/* Get static power */
>  		static_power = get_static_power(dfc, freq);
>  
>  		*power = dyn_power + static_power;
>  	}
>  
> -	trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);
> +	trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);
>  
>  	return 0;
>  fail:
> @@ -309,14 +337,20 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  {
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
>  	struct devfreq *df = dfc->devfreq;
> -	struct devfreq_dev_status *status = &df->last_status;
> -	unsigned long freq = status->current_frequency;
> +	struct devfreq_dev_status status;
>  	unsigned long busy_time;
> +	unsigned long freq;
>  	s32 dyn_power;
>  	u32 static_power;
>  	s32 est_power;
>  	int i;
>  
> +	mutex_lock(&df->lock);
> +	status = df->last_status;
> +	mutex_unlock(&df->lock);
> +
> +	freq = status.current_frequency;
> +
>  	if (dfc->power_ops->get_real_power) {
>  		/* Scale for resource utilization */
>  		est_power = power * dfc->res_util;
> @@ -328,8 +362,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  		dyn_power = dyn_power > 0 ? dyn_power : 0;
>  
>  		/* Scale dynamic power for utilization */
> -		busy_time = status->busy_time ?: 1;
> -		est_power = (dyn_power * status->total_time) / busy_time;
> +		busy_time = status.busy_time ?: 1;
> +		est_power = (dyn_power * status.total_time) / busy_time;
>  	}
>  
>  	/*
> 


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-12-03 13:09   ` Daniel Lezcano
@ 2020-12-03 15:38     ` Lukasz Luba
  2020-12-03 16:09       ` Daniel Lezcano
  0 siblings, 1 reply; 21+ messages in thread
From: Lukasz Luba @ 2020-12-03 15:38 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel, ionela.voinescu



On 12/3/20 1:09 PM, Daniel Lezcano wrote:
> On 18/11/2020 13:03, Lukasz Luba wrote:
>> Devfreq cooling needs to now the correct status of the device in order
>> to operate. Do not rely on Devfreq last_status which might be a stale data
>> and get more up-to-date values of the load.
>>
>> Devfreq framework can change the device status in the background. To
>> mitigate this situation make a copy of the status structure and use it
>> for internal calculations.
>>
>> In addition this patch adds normalization function, which also makes sure
>> that whatever data comes from the device, it is in a sane range.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>>   1 file changed, 43 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
>> index 659c0143c9f0..925523694462 100644
>> --- a/drivers/thermal/devfreq_cooling.c
>> +++ b/drivers/thermal/devfreq_cooling.c
>> @@ -227,20 +227,46 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
>>   							       voltage);
>>   }
>>   
>> +static void _normalize_load(struct devfreq_dev_status *status)
>> +{
>> +	/* Make some space if needed */
>> +	if (status->busy_time > 0xffff) {
>> +		status->busy_time >>= 10;
>> +		status->total_time >>= 10;
>> +	}
>> +
>> +	if (status->busy_time > status->total_time)
>> +		status->busy_time = status->total_time;
> 
> How the condition above is possible?

They should, be checked by the driver, but I cannot trust
and have to check for all corner cases: (div by 0, overflow
one of them, etc). The busy_time and total_time are unsigned long,
which means 4B on 32bit machines.
If these values are coming from device counters, which count every
busy cycle and total cycles of a clock of a device running at e.g.
1GHz they would overflow every ~4s.

Normally IPA polling are 1s and 100ms, it's platform specific. But there
are also 'empty' periods when IPA sees temperature very low and does not
even call the .get_requested_power() callbacks for the cooling devices,
just grants max freq to all. This is problematic. I am investigating it
and will propose a solution for IPA soon.

I would avoid all of this if devfreq core would have default for all
devices a reliable polling timer... Let me check some possibilities also
for this case.

> 
>> +	status->busy_time *= 100;
>> +	status->busy_time /= status->total_time ? : 1;
>> +
>> +	/* Avoid division by 0 */
>> +	status->busy_time = status->busy_time ? : 1;
>> +	status->total_time = 100;
> 
> Why not base the normalization on 1024? and use an intermediate u64.

You are the 2nd reviewer who is asking this. I tried to keep 'load' as
in range [0, 100] since we also have 'load' in cpufreq cooling in this
range. Maybe I should switch to 1024 (Ionela was also asking for this).

> 
> For example:
> 
> static u32 _normalize_load(struct devfreq_dev_status *status)
> {
> 	u64 load = 0;
> 
> 	/* Prevent divison by zero */
> 	if (!status->busy_time)
> 		return 0;
> 
> 	/*
> 	 * Assuming status->total_time is always greater or equal
> 	 * to status->busy_time, it can not be equal to zero because
> 	 * of the test above
> 	 */
> 	load = status->busy_time * 1024;
> 	load /= status->total_time;

I wanted to avoid any divisions which involve 64bit var on 32bit
machine.

> 
> 	/*
> 	 * load is always [1..1024[, so it can not be truncated by a
> 	 * u64 -> u32 coercive cast
> 	 */
> 	return (u32)load;
> }
> 
> 
>> +}
>>   
>>   static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cdev,
>>   					       u32 *power)
>>   {
>>   	struct devfreq_cooling_device *dfc = cdev->devdata;
>>   	struct devfreq *df = dfc->devfreq;
>> -	struct devfreq_dev_status *status = &df->last_status;
>> +	struct devfreq_dev_status status;
>>   	unsigned long state;
>> -	unsigned long freq = status->current_frequency;
>> +	unsigned long freq;
>>   	unsigned long voltage;
>>   	u32 dyn_power = 0;
>>   	u32 static_power = 0;
>>   	int res;
>>   
>> +	mutex_lock(&df->lock);
>> +	res = df->profile->get_dev_status(df->dev.parent, &status);
>> +	mutex_unlock(&df->lock);
>> +	if (res)
>> +		return res;
>> +
>> +	freq = status.current_frequency;
>> +
>>   	state = freq_get_state(dfc, freq);
>>   	if (state == THERMAL_CSTATE_INVALID) {
>>   		res = -EAGAIN;
>> @@ -268,16 +294,18 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>>   	} else {
>>   		dyn_power = dfc->power_table[state];
>>   
>> +		_normalize_load(&status);
> 
> 		load = _normalize_load(&status);
> 
>> +
>>   		/* Scale dynamic power for utilization */
>> -		dyn_power *= status->busy_time;
>> -		dyn_power /= status->total_time;
>> +		dyn_power *= status.busy_time;
>> +		dyn_power /= status.total_time;
> 
> 		/*
> 		 * May be change dyn_power to a u64 to prevent overflow
> 		 * when multiplied by 'load'
> 		 */
> 		dyn_power = (dyn_power * load) / 1024;

dyn_power value from EM should fit in 16bit [1], so we should be safe.

I will experiment with the 1024 code and check some corner cases.

Thank you Daniel for the review!

Regards,
Lukasz

[1] 
https://elixir.bootlin.com/linux/v5.10-rc5/source/kernel/power/energy_model.c#L135

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-11-18 12:03 ` [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model Lukasz Luba
  2020-12-02 10:24   ` Ionela Voinescu
@ 2020-12-03 15:40   ` Daniel Lezcano
  2020-12-07  9:46     ` Lukasz Luba
  1 sibling, 1 reply; 21+ messages in thread
From: Daniel Lezcano @ 2020-12-03 15:40 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel, ionela.voinescu

On 18/11/2020 13:03, Lukasz Luba wrote:
> The Energy Model (EM) framework supports devices such as Devfreq. Create
> new registration functions which automatically register EM for the thermal
> devfreq_cooling devices. This patch prepares the code for coming changes
> which are going to replace old power model with the new EM.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/thermal/devfreq_cooling.c | 99 ++++++++++++++++++++++++++++++-
>  include/linux/devfreq_cooling.h   | 22 +++++++
>  2 files changed, 120 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
> index 925523694462..b354271742c5 100644
> --- a/drivers/thermal/devfreq_cooling.c
> +++ b/drivers/thermal/devfreq_cooling.c
> @@ -50,6 +50,8 @@ static DEFINE_IDA(devfreq_ida);
>   * @capped_state:	index to cooling state with in dynamic power budget
>   * @req_max_freq:	PM QoS request for limiting the maximum frequency
>   *			of the devfreq device.
> + * @em:		Energy Model for the associated Devfreq device
> + * @em_registered:	Devfreq cooling registered the EM and should free it.
>   */
>  struct devfreq_cooling_device {
>  	int id;
> @@ -63,6 +65,8 @@ struct devfreq_cooling_device {
>  	u32 res_util;
>  	int capped_state;
>  	struct dev_pm_qos_request req_max_freq;
> +	struct em_perf_domain *em;

This pointer is not needed, it is in the struct device.

> +	bool em_registered;

The boolean em_registered is not needed because of the test in the
function em_dev_unregister_perf_domain():

if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
                return;

Logically if the 'em' was not initialized, it must be NULL, the
corresponding struct device was zero-allocated.


>  };
>  
>  static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev,
> @@ -583,22 +587,115 @@ struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df)
>  }
>  EXPORT_SYMBOL_GPL(devfreq_cooling_register);
>  
> +/**
> + * devfreq_cooling_em_register_power() - Register devfreq cooling device with
> + *		power information and attempt to register Energy Model (EM)
> + * @df:		Pointer to devfreq device.
> + * @dfc_power:	Pointer to devfreq_cooling_power.
> + * @em_cb:	Callback functions providing the data of the EM
> + *
> + * Register a devfreq cooling device and attempt to register Energy Model. The
> + * available OPPs must be registered for the device.
> + *
> + * If @dfc_power is provided, the cooling device is registered with the
> + * power extensions. If @em_cb is provided it will be called for each OPP to
> + * calculate power value and cost. If @em_cb is not provided then simple Energy
> + * Model is going to be used, which requires "dynamic-power-coefficient" a
> + * devicetree property.
> + */
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register_power(struct devfreq *df,
> +				  struct devfreq_cooling_power *dfc_power,
> +				  struct em_data_callback *em_cb)
> +{
> +	struct thermal_cooling_device *cdev;
> +	struct devfreq_cooling_device *dfc;
> +	struct device_node *np = NULL;
> +	struct device *dev;
> +	int nr_opp, ret;
> +
> +	if (IS_ERR_OR_NULL(df))
> +		return ERR_PTR(-EINVAL);
> +
> +	dev = df->dev.parent;

Why the parent ?

> +
> +	if (em_cb) {
> +		nr_opp = dev_pm_opp_get_opp_count(dev);
> +		if (nr_opp <= 0) {
> +			dev_err(dev, "No valid OPPs found\n");
> +			return ERR_PTR(-EINVAL);
> +		}
> +
> +		ret = em_dev_register_perf_domain(dev, nr_opp, em_cb, NULL, false);
> +	} else {
> +		ret = dev_pm_opp_of_register_em(dev, NULL);
> +	}
> +
> +	if (ret)
> +		dev_warn(dev, "Unable to register EM for devfreq cooling device (%d)\n",
> +			 ret);
> +
> +	if (dev->of_node)
> +		np = of_node_get(dev->of_node);
> +
> +	cdev = of_devfreq_cooling_register_power(np, df, dfc_power);
> +
> +	if (np)
> +		of_node_put(np);> +
> +	if (IS_ERR_OR_NULL(cdev)) {
> +		if (!ret)
> +			em_dev_unregister_perf_domain(dev);
> +	} else {
> +		dfc = cdev->devdata;
> +		dfc->em_registered = !ret;
> +	}
> +
> +	return cdev;
> +}
> +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register_power);
> +
> +/**
> + * devfreq_cooling_em_register() - Register devfreq cooling device together
> + *				with Energy Model.
> + * @df:		Pointer to devfreq device.
> + * @em_cb:	Callback functions providing the data of the Energy Model
> + *
> + * This function attempts to register Energy Model for devfreq device and then
> + * register the devfreq cooling device.
> + */
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb)
> +{
> +	return devfreq_cooling_em_register_power(df, NULL, em_cb);
> +}
> +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register);
> +
>  /**
>   * devfreq_cooling_unregister() - Unregister devfreq cooling device.
>   * @cdev: Pointer to devfreq cooling device to unregister.
> + *
> + * Unregisters devfreq cooling device and related Energy Model if it was
> + * present.
>   */
>  void devfreq_cooling_unregister(struct thermal_cooling_device *cdev)
>  {
>  	struct devfreq_cooling_device *dfc;
> +	struct device *dev;
>  
> -	if (!cdev)
> +	if (IS_ERR_OR_NULL(cdev))

Why this additional IS_ERR check ?

>  		return;
>  
>  	dfc = cdev->devdata;
> +	dev = dfc->devfreq->dev.parent;
>  
>  	thermal_cooling_device_unregister(dfc->cdev);
>  	ida_simple_remove(&devfreq_ida, dfc->id);
>  	dev_pm_qos_remove_request(&dfc->req_max_freq);
> +
> +	if (dfc->em_registered)
> +		em_dev_unregister_perf_domain(dev);
> +

As stated before it can be called unconditionally

>  	kfree(dfc->power_table);
>  	kfree(dfc->freq_table);
>  
> diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h
> index 9df2dfca68dd..19868fb922f1 100644
> --- a/include/linux/devfreq_cooling.h
> +++ b/include/linux/devfreq_cooling.h
> @@ -11,6 +11,7 @@
>  #define __DEVFREQ_COOLING_H__
>  
>  #include <linux/devfreq.h>
> +#include <linux/energy_model.h>
>  #include <linux/thermal.h>
>  
>  
> @@ -65,6 +66,13 @@ struct thermal_cooling_device *
>  of_devfreq_cooling_register(struct device_node *np, struct devfreq *df);
>  struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df);
>  void devfreq_cooling_unregister(struct thermal_cooling_device *dfc);
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register_power(struct devfreq *df,
> +				  struct devfreq_cooling_power *dfc_power,
> +				  struct em_data_callback *em_cb);
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register(struct devfreq *df,
> +			    struct em_data_callback *em_cb);
>  
>  #else /* !CONFIG_DEVFREQ_THERMAL */
>  
> @@ -87,6 +95,20 @@ devfreq_cooling_register(struct devfreq *df)
>  	return ERR_PTR(-EINVAL);
>  }
>  
> +static inline struct thermal_cooling_device *
> +devfreq_cooling_em_register_power(struct devfreq *df,
> +				  struct devfreq_cooling_power *dfc_power,
> +				  struct em_data_callback *em_cb)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +static inline struct thermal_cooling_device *
> +devfreq_cooling_em_register(struct devfreq *df,	struct em_data_callback *em_cb)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
>  static inline void
>  devfreq_cooling_unregister(struct thermal_cooling_device *dfc)
>  {
> 


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-12-03 15:38     ` Lukasz Luba
@ 2020-12-03 16:09       ` Daniel Lezcano
  2020-12-07 12:41         ` Lukasz Luba
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel Lezcano @ 2020-12-03 16:09 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel, ionela.voinescu

On 03/12/2020 16:38, Lukasz Luba wrote:
> 
> 
> On 12/3/20 1:09 PM, Daniel Lezcano wrote:
>> On 18/11/2020 13:03, Lukasz Luba wrote:
>>> Devfreq cooling needs to now the correct status of the device in order
>>> to operate. Do not rely on Devfreq last_status which might be a stale
>>> data
>>> and get more up-to-date values of the load.
>>>
>>> Devfreq framework can change the device status in the background. To
>>> mitigate this situation make a copy of the status structure and use it
>>> for internal calculations.
>>>
>>> In addition this patch adds normalization function, which also makes
>>> sure
>>> that whatever data comes from the device, it is in a sane range.
>>>
>>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>>> ---
>>>   drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>>>   1 file changed, 43 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/thermal/devfreq_cooling.c
>>> b/drivers/thermal/devfreq_cooling.c
>>> index 659c0143c9f0..925523694462 100644
>>> --- a/drivers/thermal/devfreq_cooling.c
>>> +++ b/drivers/thermal/devfreq_cooling.c
>>> @@ -227,20 +227,46 @@ static inline unsigned long
>>> get_total_power(struct devfreq_cooling_device *dfc,
>>>                                      voltage);
>>>   }
>>>   +static void _normalize_load(struct devfreq_dev_status *status)
>>> +{
>>> +    /* Make some space if needed */
>>> +    if (status->busy_time > 0xffff) {
>>> +        status->busy_time >>= 10;
>>> +        status->total_time >>= 10;
>>> +    }
>>> +
>>> +    if (status->busy_time > status->total_time)
>>> +        status->busy_time = status->total_time;
>>
>> How the condition above is possible?
> 
> They should, be checked by the driver, but I cannot trust
> and have to check for all corner cases: (div by 0, overflow
> one of them, etc). The busy_time and total_time are unsigned long,
> which means 4B on 32bit machines.
> If these values are coming from device counters, which count every
> busy cycle and total cycles of a clock of a device running at e.g.
> 1GHz they would overflow every ~4s.

I don't think it is up to this routine to check the driver is correctly
implemented, especially at every call to get_requested_power.

If the normalization ends up by doing this kind of thing, there is
certainly something wrong in the 'status' computation to be fixed before
submitting this series.


> Normally IPA polling are 1s and 100ms, it's platform specific. But there
> are also 'empty' periods when IPA sees temperature very low and does not
> even call the .get_requested_power() callbacks for the cooling devices,
> just grants max freq to all. This is problematic. I am investigating it
> and will propose a solution for IPA soon.
> 
> I would avoid all of this if devfreq core would have default for all
> devices a reliable polling timer... Let me check some possibilities also
> for this case.

Ok, may be create an API to compute the 'idle,busy,total times' to be
used by the different the devfreq drivers and then fix the overflow in
this common place.

>>> +    status->busy_time *= 100;
>>> +    status->busy_time /= status->total_time ? : 1;
>>> +
>>> +    /* Avoid division by 0 */
>>> +    status->busy_time = status->busy_time ? : 1;
>>> +    status->total_time = 100;
>>
>> Why not base the normalization on 1024? and use an intermediate u64.
> 
> You are the 2nd reviewer who is asking this. I tried to keep 'load' as
> in range [0, 100] since we also have 'load' in cpufreq cooling in this
> range. Maybe I should switch to 1024 (Ionela was also asking for this).

Well it is common practice to compute normalization with 1024 because
the division is a bit shift and the compiler optimize the code very well
with that value.




-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-12-03 15:40   ` Daniel Lezcano
@ 2020-12-07  9:46     ` Lukasz Luba
  0 siblings, 0 replies; 21+ messages in thread
From: Lukasz Luba @ 2020-12-07  9:46 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel, ionela.voinescu



On 12/3/20 3:40 PM, Daniel Lezcano wrote:
> On 18/11/2020 13:03, Lukasz Luba wrote:
>> The Energy Model (EM) framework supports devices such as Devfreq. Create
>> new registration functions which automatically register EM for the thermal
>> devfreq_cooling devices. This patch prepares the code for coming changes
>> which are going to replace old power model with the new EM.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   drivers/thermal/devfreq_cooling.c | 99 ++++++++++++++++++++++++++++++-
>>   include/linux/devfreq_cooling.h   | 22 +++++++
>>   2 files changed, 120 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
>> index 925523694462..b354271742c5 100644
>> --- a/drivers/thermal/devfreq_cooling.c
>> +++ b/drivers/thermal/devfreq_cooling.c
>> @@ -50,6 +50,8 @@ static DEFINE_IDA(devfreq_ida);
>>    * @capped_state:	index to cooling state with in dynamic power budget
>>    * @req_max_freq:	PM QoS request for limiting the maximum frequency
>>    *			of the devfreq device.
>> + * @em:		Energy Model for the associated Devfreq device
>> + * @em_registered:	Devfreq cooling registered the EM and should free it.
>>    */
>>   struct devfreq_cooling_device {
>>   	int id;
>> @@ -63,6 +65,8 @@ struct devfreq_cooling_device {
>>   	u32 res_util;
>>   	int capped_state;
>>   	struct dev_pm_qos_request req_max_freq;
>> +	struct em_perf_domain *em;
> 
> This pointer is not needed, it is in the struct device.

It is just a helper pointer, to make the code simpler and avoid nested
pointers:

struct device *dev = dfc->devfreq->dev.parent
and then using dev->em

The code is cleaner with dfc->em, but let me have a look if I can
remove it and still have a clean code.

> 
>> +	bool em_registered;
> 
> The boolean em_registered is not needed because of the test in the
> function em_dev_unregister_perf_domain():
> 
> if (IS_ERR_OR_NULL(dev) || !dev->em_pd)
>                  return;
> 
> Logically if the 'em' was not initialized, it must be NULL, the
> corresponding struct device was zero-allocated.

It was needed for devfreq cooling to know who registered the EM.
If there is 2 frameworks and driver and all could register EM,
this code cannot blindly unregister EM in it's code. The EM might
be used still by PowerCap DTM, so the unregister might be called
explicitly by the driver.

But I will rewrite the register function and make it way simpler,
just registration of EM (stopping when it failed) and then cooling
device. Also unregister will be simpler.

Driver will have to keep the order of unregister functions for two
frameworks and call unregister devfreq cooling device as last one,
because it will remove the EM.

> 
> 
>>   };
>>   
>>   static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev,
>> @@ -583,22 +587,115 @@ struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df)
>>   }
>>   EXPORT_SYMBOL_GPL(devfreq_cooling_register);
>>   
>> +/**
>> + * devfreq_cooling_em_register_power() - Register devfreq cooling device with
>> + *		power information and attempt to register Energy Model (EM)
>> + * @df:		Pointer to devfreq device.
>> + * @dfc_power:	Pointer to devfreq_cooling_power.
>> + * @em_cb:	Callback functions providing the data of the EM
>> + *
>> + * Register a devfreq cooling device and attempt to register Energy Model. The
>> + * available OPPs must be registered for the device.
>> + *
>> + * If @dfc_power is provided, the cooling device is registered with the
>> + * power extensions. If @em_cb is provided it will be called for each OPP to
>> + * calculate power value and cost. If @em_cb is not provided then simple Energy
>> + * Model is going to be used, which requires "dynamic-power-coefficient" a
>> + * devicetree property.
>> + */
>> +struct thermal_cooling_device *
>> +devfreq_cooling_em_register_power(struct devfreq *df,
>> +				  struct devfreq_cooling_power *dfc_power,
>> +				  struct em_data_callback *em_cb)
>> +{
>> +	struct thermal_cooling_device *cdev;
>> +	struct devfreq_cooling_device *dfc;
>> +	struct device_node *np = NULL;
>> +	struct device *dev;
>> +	int nr_opp, ret;
>> +
>> +	if (IS_ERR_OR_NULL(df))
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	dev = df->dev.parent;
> 
> Why the parent ?

The parent has OPPs and we are calling em_perf_domain_register() or
dev_pm_opp_of_register_em() (which in addition needs DT node) for that
device.

The old devfreq cooling code also had dev parent, to enable/disenable
OPPs.

> 
>> +
>> +	if (em_cb) {
>> +		nr_opp = dev_pm_opp_get_opp_count(dev);
>> +		if (nr_opp <= 0) {
>> +			dev_err(dev, "No valid OPPs found\n");
>> +			return ERR_PTR(-EINVAL);
>> +		}
>> +
>> +		ret = em_dev_register_perf_domain(dev, nr_opp, em_cb, NULL, false);
>> +	} else {
>> +		ret = dev_pm_opp_of_register_em(dev, NULL);
>> +	}
>> +
>> +	if (ret)
>> +		dev_warn(dev, "Unable to register EM for devfreq cooling device (%d)\n",
>> +			 ret);
>> +
>> +	if (dev->of_node)
>> +		np = of_node_get(dev->of_node);
>> +
>> +	cdev = of_devfreq_cooling_register_power(np, df, dfc_power);
>> +
>> +	if (np)
>> +		of_node_put(np);> +
>> +	if (IS_ERR_OR_NULL(cdev)) {
>> +		if (!ret)
>> +			em_dev_unregister_perf_domain(dev);
>> +	} else {
>> +		dfc = cdev->devdata;
>> +		dfc->em_registered = !ret;
>> +	}
>> +
>> +	return cdev;
>> +}
>> +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register_power);
>> +
>> +/**
>> + * devfreq_cooling_em_register() - Register devfreq cooling device together
>> + *				with Energy Model.
>> + * @df:		Pointer to devfreq device.
>> + * @em_cb:	Callback functions providing the data of the Energy Model
>> + *
>> + * This function attempts to register Energy Model for devfreq device and then
>> + * register the devfreq cooling device.
>> + */
>> +struct thermal_cooling_device *
>> +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb)
>> +{
>> +	return devfreq_cooling_em_register_power(df, NULL, em_cb);
>> +}
>> +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register);
>> +
>>   /**
>>    * devfreq_cooling_unregister() - Unregister devfreq cooling device.
>>    * @cdev: Pointer to devfreq cooling device to unregister.
>> + *
>> + * Unregisters devfreq cooling device and related Energy Model if it was
>> + * present.
>>    */
>>   void devfreq_cooling_unregister(struct thermal_cooling_device *cdev)
>>   {
>>   	struct devfreq_cooling_device *dfc;
>> +	struct device *dev;
>>   
>> -	if (!cdev)
>> +	if (IS_ERR_OR_NULL(cdev))
> 
> Why this additional IS_ERR check ?

Not needed too much, but helps if driver doesn't check the
result of registration function and then just calls unregister
function, i.e.

	if (pfdev->devfreq.cooling)
		devfreq_cooling_unregister(pfdev->devfreq.cooling);

> 
>>   		return;
>>   
>>   	dfc = cdev->devdata;
>> +	dev = dfc->devfreq->dev.parent;
>>   
>>   	thermal_cooling_device_unregister(dfc->cdev);
>>   	ida_simple_remove(&devfreq_ida, dfc->id);
>>   	dev_pm_qos_remove_request(&dfc->req_max_freq);
>> +
>> +	if (dfc->em_registered)
>> +		em_dev_unregister_perf_domain(dev);
>> +
> 
> As stated before it can be called unconditionally

OK, I will rewrite it. The goal was to be able handle many situations
in register/unregister function, but I will make them simpler.

Thank you Daniel for review comments. I will address them in next
version.

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-12-03 16:09       ` Daniel Lezcano
@ 2020-12-07 12:41         ` Lukasz Luba
  2020-12-08 14:20           ` Lukasz Luba
  0 siblings, 1 reply; 21+ messages in thread
From: Lukasz Luba @ 2020-12-07 12:41 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel, ionela.voinescu



On 12/3/20 4:09 PM, Daniel Lezcano wrote:
> On 03/12/2020 16:38, Lukasz Luba wrote:
>>
>>
>> On 12/3/20 1:09 PM, Daniel Lezcano wrote:
>>> On 18/11/2020 13:03, Lukasz Luba wrote:
>>>> Devfreq cooling needs to now the correct status of the device in order
>>>> to operate. Do not rely on Devfreq last_status which might be a stale
>>>> data
>>>> and get more up-to-date values of the load.
>>>>
>>>> Devfreq framework can change the device status in the background. To
>>>> mitigate this situation make a copy of the status structure and use it
>>>> for internal calculations.
>>>>
>>>> In addition this patch adds normalization function, which also makes
>>>> sure
>>>> that whatever data comes from the device, it is in a sane range.
>>>>
>>>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>>>> ---
>>>>    drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>>>>    1 file changed, 43 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/thermal/devfreq_cooling.c
>>>> b/drivers/thermal/devfreq_cooling.c
>>>> index 659c0143c9f0..925523694462 100644
>>>> --- a/drivers/thermal/devfreq_cooling.c
>>>> +++ b/drivers/thermal/devfreq_cooling.c
>>>> @@ -227,20 +227,46 @@ static inline unsigned long
>>>> get_total_power(struct devfreq_cooling_device *dfc,
>>>>                                       voltage);
>>>>    }
>>>>    +static void _normalize_load(struct devfreq_dev_status *status)
>>>> +{
>>>> +    /* Make some space if needed */
>>>> +    if (status->busy_time > 0xffff) {
>>>> +        status->busy_time >>= 10;
>>>> +        status->total_time >>= 10;
>>>> +    }
>>>> +
>>>> +    if (status->busy_time > status->total_time)
>>>> +        status->busy_time = status->total_time;
>>>
>>> How the condition above is possible?
>>
>> They should, be checked by the driver, but I cannot trust
>> and have to check for all corner cases: (div by 0, overflow
>> one of them, etc). The busy_time and total_time are unsigned long,
>> which means 4B on 32bit machines.
>> If these values are coming from device counters, which count every
>> busy cycle and total cycles of a clock of a device running at e.g.
>> 1GHz they would overflow every ~4s.
> 
> I don't think it is up to this routine to check the driver is correctly
> implemented, especially at every call to get_requested_power.
> 
> If the normalization ends up by doing this kind of thing, there is
> certainly something wrong in the 'status' computation to be fixed before
> submitting this series.
> 
> 
>> Normally IPA polling are 1s and 100ms, it's platform specific. But there
>> are also 'empty' periods when IPA sees temperature very low and does not
>> even call the .get_requested_power() callbacks for the cooling devices,
>> just grants max freq to all. This is problematic. I am investigating it
>> and will propose a solution for IPA soon.
>>
>> I would avoid all of this if devfreq core would have default for all
>> devices a reliable polling timer... Let me check some possibilities also
>> for this case.
> 
> Ok, may be create an API to compute the 'idle,busy,total times' to be
> used by the different the devfreq drivers and then fix the overflow in
> this common place.

Yes, I have this plan, but I have to close this patch series. To go
forward with this, I will drop the normalization function and will keep
only the code of safe copy of the 'status', so using busy_time and
total_time will be safe.

I will address this computation and normalization in different patch
series. There might be a need of a new API as you pointed out, which
is out-of-scope of this patch set.

> 
>>>> +    status->busy_time *= 100;
>>>> +    status->busy_time /= status->total_time ? : 1;
>>>> +
>>>> +    /* Avoid division by 0 */
>>>> +    status->busy_time = status->busy_time ? : 1;
>>>> +    status->total_time = 100;
>>>
>>> Why not base the normalization on 1024? and use an intermediate u64.
>>
>> You are the 2nd reviewer who is asking this. I tried to keep 'load' as
>> in range [0, 100] since we also have 'load' in cpufreq cooling in this
>> range. Maybe I should switch to 1024 (Ionela was also asking for this).
> 
> Well it is common practice to compute normalization with 1024 because
> the division is a bit shift and the compiler optimize the code very well
> with that value.
> 

I will keep this 1024 in mind for the next topic series.

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-12-07 12:41         ` Lukasz Luba
@ 2020-12-08 14:20           ` Lukasz Luba
  0 siblings, 0 replies; 21+ messages in thread
From: Lukasz Luba @ 2020-12-08 14:20 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel, linux-pm, dri-devel, rui.zhang, amit.kucheria,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel, ionela.voinescu

Hi Daniel,

On 12/7/20 12:41 PM, Lukasz Luba wrote:
> 
> 
> On 12/3/20 4:09 PM, Daniel Lezcano wrote:
>> On 03/12/2020 16:38, Lukasz Luba wrote:
>>>
>>>
>>> On 12/3/20 1:09 PM, Daniel Lezcano wrote:
>>>> On 18/11/2020 13:03, Lukasz Luba wrote:
>>>>> Devfreq cooling needs to now the correct status of the device in order
>>>>> to operate. Do not rely on Devfreq last_status which might be a stale
>>>>> data
>>>>> and get more up-to-date values of the load.
>>>>>
>>>>> Devfreq framework can change the device status in the background. To
>>>>> mitigate this situation make a copy of the status structure and use it
>>>>> for internal calculations.
>>>>>
>>>>> In addition this patch adds normalization function, which also makes
>>>>> sure
>>>>> that whatever data comes from the device, it is in a sane range.
>>>>>
>>>>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>>>>> ---
>>>>>    drivers/thermal/devfreq_cooling.c | 52 
>>>>> +++++++++++++++++++++++++------
>>>>>    1 file changed, 43 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/drivers/thermal/devfreq_cooling.c
>>>>> b/drivers/thermal/devfreq_cooling.c
>>>>> index 659c0143c9f0..925523694462 100644
>>>>> --- a/drivers/thermal/devfreq_cooling.c
>>>>> +++ b/drivers/thermal/devfreq_cooling.c
>>>>> @@ -227,20 +227,46 @@ static inline unsigned long
>>>>> get_total_power(struct devfreq_cooling_device *dfc,
>>>>>                                       voltage);
>>>>>    }
>>>>>    +static void _normalize_load(struct devfreq_dev_status *status)
>>>>> +{
>>>>> +    /* Make some space if needed */
>>>>> +    if (status->busy_time > 0xffff) {
>>>>> +        status->busy_time >>= 10;
>>>>> +        status->total_time >>= 10;
>>>>> +    }
>>>>> +
>>>>> +    if (status->busy_time > status->total_time)
>>>>> +        status->busy_time = status->total_time;
>>>>
>>>> How the condition above is possible?
>>>
>>> They should, be checked by the driver, but I cannot trust
>>> and have to check for all corner cases: (div by 0, overflow
>>> one of them, etc). The busy_time and total_time are unsigned long,
>>> which means 4B on 32bit machines.
>>> If these values are coming from device counters, which count every
>>> busy cycle and total cycles of a clock of a device running at e.g.
>>> 1GHz they would overflow every ~4s.
>>
>> I don't think it is up to this routine to check the driver is correctly
>> implemented, especially at every call to get_requested_power.
>>
>> If the normalization ends up by doing this kind of thing, there is
>> certainly something wrong in the 'status' computation to be fixed before
>> submitting this series.
>>
>>
>>> Normally IPA polling are 1s and 100ms, it's platform specific. But there
>>> are also 'empty' periods when IPA sees temperature very low and does not
>>> even call the .get_requested_power() callbacks for the cooling devices,
>>> just grants max freq to all. This is problematic. I am investigating it
>>> and will propose a solution for IPA soon.
>>>
>>> I would avoid all of this if devfreq core would have default for all
>>> devices a reliable polling timer... Let me check some possibilities also
>>> for this case.
>>
>> Ok, may be create an API to compute the 'idle,busy,total times' to be
>> used by the different the devfreq drivers and then fix the overflow in
>> this common place.
> 
> Yes, I have this plan, but I have to close this patch series. To go
> forward with this, I will drop the normalization function and will keep
> only the code of safe copy of the 'status', so using busy_time and
> total_time will be safe.

I did experiments and actually I cannot drop this function. Drivers can
feed total_time and busy_time which are in nanoseconds, e.g. [1] 50ms =>
50.000.000ns which is then when multiplied by 1024  and exceed the u32.
I want to avoid 64bit variables and divisions, so shifting them earlier
would help. IMHO it does not harm this devfreq cooling to make that
check and handle ns values.

I am going to use the normalization into 0..1024 as you and Ionela
suggested.
I will also drop the direct device status check. That would be a
different patch series. In that patch set I will try to come with a
generic solution and some API.

Regards,
Lukasz

[1] 
https://elixir.bootlin.com/linux/v5.10-rc5/source/drivers/gpu/drm/panfrost/panfrost_devfreq.c#L66

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-12-08 14:21 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-18 12:03 [PATCH v2 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
2020-11-18 12:03 ` [PATCH v2 1/5] thermal: devfreq_cooling: change tracing function and arguments Lukasz Luba
2020-12-02 10:23   ` Ionela Voinescu
2020-11-18 12:03 ` [PATCH v2 2/5] thermal: devfreq_cooling: get a copy of device status Lukasz Luba
2020-12-02 10:23   ` Ionela Voinescu
2020-12-03 13:09   ` Daniel Lezcano
2020-12-03 15:38     ` Lukasz Luba
2020-12-03 16:09       ` Daniel Lezcano
2020-12-07 12:41         ` Lukasz Luba
2020-12-08 14:20           ` Lukasz Luba
2020-11-18 12:03 ` [PATCH v2 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model Lukasz Luba
2020-12-02 10:24   ` Ionela Voinescu
2020-12-02 11:14     ` Lukasz Luba
2020-12-02 11:49       ` Ionela Voinescu
2020-12-02 11:54         ` Lukasz Luba
2020-12-03 15:40   ` Daniel Lezcano
2020-12-07  9:46     ` Lukasz Luba
2020-11-18 12:03 ` [PATCH v2 4/5] thermal: devfreq_cooling: remove old power model and use EM Lukasz Luba
2020-12-02 10:26   ` Ionela Voinescu
2020-11-18 12:03 ` [PATCH v2 5/5] drm/panfrost: Register devfreq cooling and attempt to add Energy Model Lukasz Luba
2020-12-02 15:01 ` [PATCH v2 0/5] Thermal devfreq cooling improvements with " Daniel Lezcano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).