linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Thermal devfreq cooling improvements with Energy Model
@ 2020-09-21 12:20 Lukasz Luba
  2020-09-21 12:20 ` [PATCH 1/5] thermal: devfreq_cooling: change tracing function and arguments Lukasz Luba
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Lukasz Luba @ 2020-09-21 12:20 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel

Hi all,

This path set is a continuation of my previous work, which aimed
to add Energy Model to all devices. This series is a follow up
for the patches which got merged to v5.9-rc1. It aims to change
the thermal devfreq cooling and use the Energy Model instead of
private calculation and structures. The new registration interface
in the patch 3/5 helps to register devfreq cooling and the EM in one
call. There is also another improvement, patch 2/5 is changing the
way how thermal gets the device status. Now it's taken on demand
and stored as a copy. The last patch wouldn't go through thermal tree,
but it's here to consistency.

The patch set is based on current next (next-20200921).

Regards,
Lukasz Luba

[1] https://lkml.org/lkml/2020/5/11/326

Lukasz Luba (5):
  thermal: devfreq_cooling: change tracing function and arguments
  thermal: devfreq_cooling: get a copy of device status
  thermal: devfreq_cooling: add new registration functions with Energy
    Model
  thermal: devfreq_cooling: remove old power model and use EM
  drm/panfrost: Register devfreq cooling and attempt to add Energy Model

 drivers/gpu/drm/panfrost/panfrost_devfreq.c |   2 +-
 drivers/thermal/devfreq_cooling.c           | 414 ++++++++++----------
 include/linux/devfreq_cooling.h             |  39 +-
 include/trace/events/thermal.h              |  19 +-
 4 files changed, 249 insertions(+), 225 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] thermal: devfreq_cooling: change tracing function and arguments
  2020-09-21 12:20 [PATCH 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
@ 2020-09-21 12:20 ` Lukasz Luba
  2020-09-21 12:20 ` [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status Lukasz Luba
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Lukasz Luba @ 2020-09-21 12:20 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel

Prepare for deleting the static and dynamic power calculation and clean
the trace function. These two fields are going to be removed in the next
changes.

Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> # for tracing code
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/devfreq_cooling.c |  3 +--
 include/trace/events/thermal.h    | 19 +++++++++----------
 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index a12d29096229..7063ccb7b86d 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -278,8 +278,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 		*power = dyn_power + static_power;
 	}
 
-	trace_thermal_power_devfreq_get_power(cdev, status, freq, dyn_power,
-					      static_power, *power);
+	trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);
 
 	return 0;
 fail:
diff --git a/include/trace/events/thermal.h b/include/trace/events/thermal.h
index 135e5421f003..8a5f04888abd 100644
--- a/include/trace/events/thermal.h
+++ b/include/trace/events/thermal.h
@@ -153,31 +153,30 @@ TRACE_EVENT(thermal_power_cpu_limit,
 TRACE_EVENT(thermal_power_devfreq_get_power,
 	TP_PROTO(struct thermal_cooling_device *cdev,
 		 struct devfreq_dev_status *status, unsigned long freq,
-		u32 dynamic_power, u32 static_power, u32 power),
+		u32 power),
 
-	TP_ARGS(cdev, status,  freq, dynamic_power, static_power, power),
+	TP_ARGS(cdev, status,  freq, power),
 
 	TP_STRUCT__entry(
 		__string(type,         cdev->type    )
 		__field(unsigned long, freq          )
-		__field(u32,           load          )
-		__field(u32,           dynamic_power )
-		__field(u32,           static_power  )
+		__field(u32,           busy_time)
+		__field(u32,           total_time)
 		__field(u32,           power)
 	),
 
 	TP_fast_assign(
 		__assign_str(type, cdev->type);
 		__entry->freq = freq;
-		__entry->load = (100 * status->busy_time) / status->total_time;
-		__entry->dynamic_power = dynamic_power;
-		__entry->static_power = static_power;
+		__entry->busy_time = status->busy_time;
+		__entry->total_time = status->total_time;
 		__entry->power = power;
 	),
 
-	TP_printk("type=%s freq=%lu load=%u dynamic_power=%u static_power=%u power=%u",
+	TP_printk("type=%s freq=%lu load=%u power=%u",
 		__get_str(type), __entry->freq,
-		__entry->load, __entry->dynamic_power, __entry->static_power,
+		__entry->total_time == 0 ? 0 :
+			(100 * __entry->busy_time) / __entry->total_time,
 		__entry->power)
 );
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-09-21 12:20 [PATCH 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
  2020-09-21 12:20 ` [PATCH 1/5] thermal: devfreq_cooling: change tracing function and arguments Lukasz Luba
@ 2020-09-21 12:20 ` Lukasz Luba
  2020-10-07 16:11   ` Ionela Voinescu
  2020-10-14 14:34   ` Daniel Lezcano
  2020-09-21 12:20 ` [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model Lukasz Luba
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: Lukasz Luba @ 2020-09-21 12:20 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel

Devfreq cooling needs to now the correct status of the device in order
to operate. Do not rely on Devfreq last_status which might be a stale data
and get more up-to-date values of the load.

Devfreq framework can change the device status in the background. To
mitigate this situation make a copy of the status structure and use it
for internal calculations.

In addition this patch adds normalization function, which also makes sure
that whatever data comes from the device, it is in a sane range.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index 7063ccb7b86d..cf045bd4d16b 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
 							       voltage);
 }
 
+static void _normalize_load(struct devfreq_dev_status *status)
+{
+	/* Make some space if needed */
+	if (status->busy_time > 0xffff) {
+		status->busy_time >>= 10;
+		status->total_time >>= 10;
+	}
+
+	if (status->busy_time > status->total_time)
+		status->busy_time = status->total_time;
+
+	status->busy_time *= 100;
+	status->busy_time /= status->total_time ? : 1;
+
+	/* Avoid division by 0 */
+	status->busy_time = status->busy_time ? : 1;
+	status->total_time = 100;
+}
 
 static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cdev,
 					       struct thermal_zone_device *tz,
@@ -234,14 +252,22 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
-	struct devfreq_dev_status *status = &df->last_status;
+	struct devfreq_dev_status status;
 	unsigned long state;
-	unsigned long freq = status->current_frequency;
+	unsigned long freq;
 	unsigned long voltage;
 	u32 dyn_power = 0;
 	u32 static_power = 0;
 	int res;
 
+	mutex_lock(&df->lock);
+	res = df->profile->get_dev_status(df->dev.parent, &status);
+	mutex_unlock(&df->lock);
+	if (res)
+		return res;
+
+	freq = status.current_frequency;
+
 	state = freq_get_state(dfc, freq);
 	if (state == THERMAL_CSTATE_INVALID) {
 		res = -EAGAIN;
@@ -269,16 +295,18 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 	} else {
 		dyn_power = dfc->power_table[state];
 
+		_normalize_load(&status);
+
 		/* Scale dynamic power for utilization */
-		dyn_power *= status->busy_time;
-		dyn_power /= status->total_time;
+		dyn_power *= status.busy_time;
+		dyn_power /= status.total_time;
 		/* Get static power */
 		static_power = get_static_power(dfc, freq);
 
 		*power = dyn_power + static_power;
 	}
 
-	trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);
+	trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);
 
 	return 0;
 fail:
@@ -312,14 +340,20 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
-	struct devfreq_dev_status *status = &df->last_status;
-	unsigned long freq = status->current_frequency;
+	struct devfreq_dev_status status;
 	unsigned long busy_time;
+	unsigned long freq;
 	s32 dyn_power;
 	u32 static_power;
 	s32 est_power;
 	int i;
 
+	mutex_lock(&df->lock);
+	status = df->last_status;
+	mutex_unlock(&df->lock);
+
+	freq = status.current_frequency;
+
 	if (dfc->power_ops->get_real_power) {
 		/* Scale for resource utilization */
 		est_power = power * dfc->res_util;
@@ -331,8 +365,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 		dyn_power = dyn_power > 0 ? dyn_power : 0;
 
 		/* Scale dynamic power for utilization */
-		busy_time = status->busy_time ?: 1;
-		est_power = (dyn_power * status->total_time) / busy_time;
+		busy_time = status.busy_time ?: 1;
+		est_power = (dyn_power * status.total_time) / busy_time;
 	}
 
 	/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-09-21 12:20 [PATCH 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
  2020-09-21 12:20 ` [PATCH 1/5] thermal: devfreq_cooling: change tracing function and arguments Lukasz Luba
  2020-09-21 12:20 ` [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status Lukasz Luba
@ 2020-09-21 12:20 ` Lukasz Luba
  2020-10-07 12:07   ` Ionela Voinescu
  2020-09-21 12:20 ` [PATCH 4/5] thermal: devfreq_cooling: remove old power model and use EM Lukasz Luba
  2020-09-21 12:20 ` [PATCH 5/5] drm/panfrost: Register devfreq cooling and attempt to add Energy Model Lukasz Luba
  4 siblings, 1 reply; 20+ messages in thread
From: Lukasz Luba @ 2020-09-21 12:20 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel

The Energy Model (EM) framework supports devices such as Devfreq. Create
new registration functions which automatically register EM for the thermal
devfreq_cooling devices. This patch prepares the code for coming changes
which are going to replace old power model with the new EM.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/devfreq_cooling.c | 99 ++++++++++++++++++++++++++++++-
 include/linux/devfreq_cooling.h   | 22 +++++++
 2 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index cf045bd4d16b..7e091e795284 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -50,6 +50,8 @@ static DEFINE_IDA(devfreq_ida);
  * @capped_state:	index to cooling state with in dynamic power budget
  * @req_max_freq:	PM QoS request for limiting the maximum frequency
  *			of the devfreq device.
+ * @em:		Energy Model which represents the associated Devfreq device
+ * @em_registered:	Devfreq cooling registered the EM and should free it.
  */
 struct devfreq_cooling_device {
 	int id;
@@ -63,6 +65,8 @@ struct devfreq_cooling_device {
 	u32 res_util;
 	int capped_state;
 	struct dev_pm_qos_request req_max_freq;
+	struct em_perf_domain *em;
+	bool em_registered;
 };
 
 static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev,
@@ -586,22 +590,115 @@ struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df)
 }
 EXPORT_SYMBOL_GPL(devfreq_cooling_register);
 
+/**
+ * devfreq_cooling_em_register_power() - Register devfreq cooling device with
+ *		power information and attempt to register Energy Model (EM)
+ * @df:		Pointer to devfreq device.
+ * @dfc_power:	Pointer to devfreq_cooling_power.
+ * @em_cb:	Callback functions providing the data of the EM
+ *
+ * Register a devfreq cooling device and attempt to register Energy Model. The
+ * available OPPs must be registered for the device.
+ *
+ * If @dfc_power is provided, the cooling device is registered with the
+ * power extensions. If @em_cb is provided it will be called for each OPP to
+ * calculate power value and cost. If @em_cb is not provided then simple Energy
+ * Model is going to be used, which requires "dynamic-power-coefficient" a
+ * devicetree property.
+ */
+struct thermal_cooling_device *
+devfreq_cooling_em_register_power(struct devfreq *df,
+				  struct devfreq_cooling_power *dfc_power,
+				  struct em_data_callback *em_cb)
+{
+	struct thermal_cooling_device *cdev;
+	struct devfreq_cooling_device *dfc;
+	struct device_node *np = NULL;
+	struct device *dev;
+	int nr_opp, ret;
+
+	if (IS_ERR_OR_NULL(df))
+		return ERR_PTR(-EINVAL);
+
+	dev = df->dev.parent;
+
+	if (em_cb) {
+		nr_opp = dev_pm_opp_get_opp_count(dev);
+		if (nr_opp <= 0) {
+			dev_err(dev, "No valid OPPs found\n");
+			return ERR_PTR(-EINVAL);
+		}
+
+		ret = em_dev_register_perf_domain(dev, nr_opp, em_cb, NULL);
+	} else {
+		ret = dev_pm_opp_of_register_em(dev, NULL);
+	}
+
+	if (ret)
+		dev_warn(dev, "Unable to register EM for devfreq cooling device (%d)\n",
+			 ret);
+
+	if (dev->of_node)
+		np = of_node_get(dev->of_node);
+
+	cdev = of_devfreq_cooling_register_power(np, df, dfc_power);
+
+	if (np)
+		of_node_put(np);
+
+	if (IS_ERR_OR_NULL(cdev)) {
+		if (!ret)
+			em_dev_unregister_perf_domain(dev);
+	} else {
+		dfc = cdev->devdata;
+		dfc->em_registered = !ret;
+	}
+
+	return cdev;
+}
+EXPORT_SYMBOL_GPL(devfreq_cooling_em_register_power);
+
+/**
+ * devfreq_cooling_em_register() - Register devfreq cooling device together
+ *				with Energy Model.
+ * @df:		Pointer to devfreq device.
+ * @em_cb:	Callback functions providing the data of the Energy Model
+ *
+ * This function attempts to register Energy Model for devfreq device and then
+ * register the devfreq cooling device.
+ */
+struct thermal_cooling_device *
+devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb)
+{
+	return devfreq_cooling_em_register_power(df, NULL, em_cb);
+}
+EXPORT_SYMBOL_GPL(devfreq_cooling_em_register);
+
 /**
  * devfreq_cooling_unregister() - Unregister devfreq cooling device.
  * @cdev: Pointer to devfreq cooling device to unregister.
+ *
+ * Unregisters devfreq cooling device and related Energy Model if it was
+ * present.
  */
 void devfreq_cooling_unregister(struct thermal_cooling_device *cdev)
 {
 	struct devfreq_cooling_device *dfc;
+	struct device *dev;
 
-	if (!cdev)
+	if (IS_ERR_OR_NULL(cdev))
 		return;
 
 	dfc = cdev->devdata;
+	dev = dfc->devfreq->dev.parent;
 
 	thermal_cooling_device_unregister(dfc->cdev);
 	ida_simple_remove(&devfreq_ida, dfc->id);
 	dev_pm_qos_remove_request(&dfc->req_max_freq);
+
+	if (dfc->em_registered)
+		em_dev_unregister_perf_domain(dev);
+
 	kfree(dfc->power_table);
 	kfree(dfc->freq_table);
 
diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h
index 9df2dfca68dd..19868fb922f1 100644
--- a/include/linux/devfreq_cooling.h
+++ b/include/linux/devfreq_cooling.h
@@ -11,6 +11,7 @@
 #define __DEVFREQ_COOLING_H__
 
 #include <linux/devfreq.h>
+#include <linux/energy_model.h>
 #include <linux/thermal.h>
 
 
@@ -65,6 +66,13 @@ struct thermal_cooling_device *
 of_devfreq_cooling_register(struct device_node *np, struct devfreq *df);
 struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df);
 void devfreq_cooling_unregister(struct thermal_cooling_device *dfc);
+struct thermal_cooling_device *
+devfreq_cooling_em_register_power(struct devfreq *df,
+				  struct devfreq_cooling_power *dfc_power,
+				  struct em_data_callback *em_cb);
+struct thermal_cooling_device *
+devfreq_cooling_em_register(struct devfreq *df,
+			    struct em_data_callback *em_cb);
 
 #else /* !CONFIG_DEVFREQ_THERMAL */
 
@@ -87,6 +95,20 @@ devfreq_cooling_register(struct devfreq *df)
 	return ERR_PTR(-EINVAL);
 }
 
+static inline struct thermal_cooling_device *
+devfreq_cooling_em_register_power(struct devfreq *df,
+				  struct devfreq_cooling_power *dfc_power,
+				  struct em_data_callback *em_cb)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+static inline struct thermal_cooling_device *
+devfreq_cooling_em_register(struct devfreq *df,	struct em_data_callback *em_cb)
+{
+	return ERR_PTR(-EINVAL);
+}
+
 static inline void
 devfreq_cooling_unregister(struct thermal_cooling_device *dfc)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] thermal: devfreq_cooling: remove old power model and use EM
  2020-09-21 12:20 [PATCH 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
                   ` (2 preceding siblings ...)
  2020-09-21 12:20 ` [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model Lukasz Luba
@ 2020-09-21 12:20 ` Lukasz Luba
  2020-10-07 15:12   ` Ionela Voinescu
  2020-09-21 12:20 ` [PATCH 5/5] drm/panfrost: Register devfreq cooling and attempt to add Energy Model Lukasz Luba
  4 siblings, 1 reply; 20+ messages in thread
From: Lukasz Luba @ 2020-09-21 12:20 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel

Remove old power model and use new Energy Model to calculate the power
budget. It drops static + dynamic power calculations and power table
in order to use Energy Model performance domain data. This model
should be easy to use and could find more users. It is also less
complicated to setup the needed structures.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/devfreq_cooling.c | 282 +++++++++---------------------
 include/linux/devfreq_cooling.h   |  17 --
 2 files changed, 86 insertions(+), 213 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
index 7e091e795284..9b66bba51d09 100644
--- a/drivers/thermal/devfreq_cooling.c
+++ b/drivers/thermal/devfreq_cooling.c
@@ -33,20 +33,17 @@ static DEFINE_IDA(devfreq_ida);
  * @cdev:	Pointer to associated thermal cooling device.
  * @devfreq:	Pointer to associated devfreq device.
  * @cooling_state:	Current cooling state.
- * @power_table:	Pointer to table with maximum power draw for each
- *			cooling state. State is the index into the table, and
- *			the power is in mW.
  * @freq_table:	Pointer to a table with the frequencies sorted in descending
  *		order.  You can index the table by cooling device state
- * @freq_table_size:	Size of the @freq_table and @power_table
- * @power_ops:	Pointer to devfreq_cooling_power, used to generate the
- *		@power_table.
+ * @max_state:	It is the last index, that is, one less than the number of the
+ *		OPPs
+ * @power_ops:	Pointer to devfreq_cooling_power, a more precised model.
  * @res_util:	Resource utilization scaling factor for the power.
  *		It is multiplied by 100 to minimize the error. It is used
  *		for estimation of the power budget instead of using
- *		'utilization' (which is	'busy_time / 'total_time').
- *		The 'res_util' range is from 100 to (power_table[state] * 100)
- *		for the corresponding 'state'.
+ *		'utilization' (which is	'busy_time' / 'total_time').
+ *		The 'res_util' range is from 100 to power * 100	for the
+ *		corresponding 'state'.
  * @capped_state:	index to cooling state with in dynamic power budget
  * @req_max_freq:	PM QoS request for limiting the maximum frequency
  *			of the devfreq device.
@@ -58,9 +55,8 @@ struct devfreq_cooling_device {
 	struct thermal_cooling_device *cdev;
 	struct devfreq *devfreq;
 	unsigned long cooling_state;
-	u32 *power_table;
 	u32 *freq_table;
-	size_t freq_table_size;
+	size_t max_state;
 	struct devfreq_cooling_power *power_ops;
 	u32 res_util;
 	int capped_state;
@@ -74,7 +70,7 @@ static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev,
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 
-	*state = dfc->freq_table_size - 1;
+	*state = dfc->max_state;
 
 	return 0;
 }
@@ -102,10 +98,16 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
 
 	dev_dbg(dev, "Setting cooling state %lu\n", state);
 
-	if (state >= dfc->freq_table_size)
+	if (state > dfc->max_state)
 		return -EINVAL;
 
-	freq = dfc->freq_table[state];
+	if (dfc->em) {
+		/* Energy Model frequencies are in kHz */
+		freq = dfc->em->table[dfc->max_state - state].frequency;
+		freq *= 1000;
+	} else {
+		freq = dfc->freq_table[state];
+	}
 
 	dev_pm_qos_update_request(&dfc->req_max_freq,
 				  DIV_ROUND_UP(freq, HZ_PER_KHZ));
@@ -116,11 +118,11 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,
 }
 
 /**
- * freq_get_state() - get the cooling state corresponding to a frequency
+ * freq_get_state() - get the performance index corresponding to a frequency
  * @dfc:	Pointer to devfreq cooling device
- * @freq:	frequency in Hz
+ * @freq:	frequency in kHz
  *
- * Return: the cooling state associated with the @freq, or
+ * Return: the performance index associated with the @freq, or
  * THERMAL_CSTATE_INVALID if it wasn't found.
  */
 static unsigned long
@@ -128,8 +130,8 @@ freq_get_state(struct devfreq_cooling_device *dfc, unsigned long freq)
 {
 	int i;
 
-	for (i = 0; i < dfc->freq_table_size; i++) {
-		if (dfc->freq_table[i] == freq)
+	for (i = 0; i <= dfc->max_state; i++) {
+		if (dfc->em->table[i].frequency == freq)
 			return i;
 	}
 
@@ -164,71 +166,15 @@ static unsigned long get_voltage(struct devfreq *df, unsigned long freq)
 	return voltage;
 }
 
-/**
- * get_static_power() - calculate the static power
- * @dfc:	Pointer to devfreq cooling device
- * @freq:	Frequency in Hz
- *
- * Calculate the static power in milliwatts using the supplied
- * get_static_power().  The current voltage is calculated using the
- * OPP library.  If no get_static_power() was supplied, assume the
- * static power is negligible.
- */
-static unsigned long
-get_static_power(struct devfreq_cooling_device *dfc, unsigned long freq)
+static void dfc_em_get_requested_power(struct em_perf_domain *em,
+				       struct devfreq_dev_status *status,
+				       u32 *power, int em_perf_idx)
 {
-	struct devfreq *df = dfc->devfreq;
-	unsigned long voltage;
+	*power = em->table[em_perf_idx].power;
 
-	if (!dfc->power_ops->get_static_power)
-		return 0;
-
-	voltage = get_voltage(df, freq);
-
-	if (voltage == 0)
-		return 0;
-
-	return dfc->power_ops->get_static_power(df, voltage);
-}
-
-/**
- * get_dynamic_power - calculate the dynamic power
- * @dfc:	Pointer to devfreq cooling device
- * @freq:	Frequency in Hz
- * @voltage:	Voltage in millivolts
- *
- * Calculate the dynamic power in milliwatts consumed by the device at
- * frequency @freq and voltage @voltage.  If the get_dynamic_power()
- * was supplied as part of the devfreq_cooling_power struct, then that
- * function is used.  Otherwise, a simple power model (Pdyn = Coeff *
- * Voltage^2 * Frequency) is used.
- */
-static unsigned long
-get_dynamic_power(struct devfreq_cooling_device *dfc, unsigned long freq,
-		  unsigned long voltage)
-{
-	u64 power;
-	u32 freq_mhz;
-	struct devfreq_cooling_power *dfc_power = dfc->power_ops;
-
-	if (dfc_power->get_dynamic_power)
-		return dfc_power->get_dynamic_power(dfc->devfreq, freq,
-						    voltage);
-
-	freq_mhz = freq / 1000000;
-	power = (u64)dfc_power->dyn_power_coeff * freq_mhz * voltage * voltage;
-	do_div(power, 1000000000);
-
-	return power;
-}
-
-
-static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
-					    unsigned long freq,
-					    unsigned long voltage)
-{
-	return get_static_power(dfc, freq) + get_dynamic_power(dfc, freq,
-							       voltage);
+	/* Scale power for utilization */
+	*power *= status->busy_time;
+	*power /= status->total_time;
 }
 
 static void _normalize_load(struct devfreq_dev_status *status)
@@ -260,8 +206,6 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 	unsigned long state;
 	unsigned long freq;
 	unsigned long voltage;
-	u32 dyn_power = 0;
-	u32 static_power = 0;
 	int res;
 
 	mutex_lock(&df->lock);
@@ -272,13 +216,14 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 
 	freq = status.current_frequency;
 
-	state = freq_get_state(dfc, freq);
+	/* Energy Model frequencies are in kHz */
+	state = freq_get_state(dfc, freq / 1000);
 	if (state == THERMAL_CSTATE_INVALID) {
 		res = -EAGAIN;
 		goto fail;
 	}
 
-	if (dfc->power_ops->get_real_power) {
+	if (dfc->power_ops && dfc->power_ops->get_real_power) {
 		voltage = get_voltage(df, freq);
 		if (voltage == 0) {
 			res = -EINVAL;
@@ -288,7 +233,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 		res = dfc->power_ops->get_real_power(df, power, freq, voltage);
 		if (!res) {
 			state = dfc->capped_state;
-			dfc->res_util = dfc->power_table[state];
+			dfc->res_util = dfc->em->table[state].power;
 			dfc->res_util *= SCALE_ERROR_MITIGATION;
 
 			if (*power > 1)
@@ -297,17 +242,8 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
 			goto fail;
 		}
 	} else {
-		dyn_power = dfc->power_table[state];
-
 		_normalize_load(&status);
-
-		/* Scale dynamic power for utilization */
-		dyn_power *= status.busy_time;
-		dyn_power /= status.total_time;
-		/* Get static power */
-		static_power = get_static_power(dfc, freq);
-
-		*power = dyn_power + static_power;
+		dfc_em_get_requested_power(dfc->em, &status, power, state);
 	}
 
 	trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);
@@ -325,16 +261,14 @@ static int devfreq_cooling_state2power(struct thermal_cooling_device *cdev,
 				       u32 *power)
 {
 	struct devfreq_cooling_device *dfc = cdev->devdata;
-	unsigned long freq;
-	u32 static_power;
+	int idx;
 
-	if (state >= dfc->freq_table_size)
+	if (state > dfc->max_state)
 		return -EINVAL;
 
-	freq = dfc->freq_table[state];
-	static_power = get_static_power(dfc, freq);
+	idx = dfc->max_state - state;
+	*power = dfc->em->table[idx].power;
 
-	*power = dfc->power_table[state] + static_power;
 	return 0;
 }
 
@@ -345,11 +279,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 	struct devfreq_cooling_device *dfc = cdev->devdata;
 	struct devfreq *df = dfc->devfreq;
 	struct devfreq_dev_status status;
-	unsigned long busy_time;
+	u32 est_power = power;
 	unsigned long freq;
-	s32 dyn_power;
-	u32 static_power;
-	s32 est_power;
 	int i;
 
 	mutex_lock(&df->lock);
@@ -358,31 +289,26 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
 
 	freq = status.current_frequency;
 
-	if (dfc->power_ops->get_real_power) {
+	if (dfc->power_ops && dfc->power_ops->get_real_power) {
 		/* Scale for resource utilization */
 		est_power = power * dfc->res_util;
 		est_power /= SCALE_ERROR_MITIGATION;
 	} else {
-		static_power = get_static_power(dfc, freq);
-
-		dyn_power = power - static_power;
-		dyn_power = dyn_power > 0 ? dyn_power : 0;
-
-		/* Scale dynamic power for utilization */
-		busy_time = status.busy_time ?: 1;
-		est_power = (dyn_power * status.total_time) / busy_time;
+		_normalize_load(&status);
+		est_power *= status.total_time;
+		est_power /= status.busy_time;
 	}
 
 	/*
 	 * Find the first cooling state that is within the power
-	 * budget for dynamic power.
+	 * budget. The EM power table is sorted ascending.
 	 */
-	for (i = 0; i < dfc->freq_table_size - 1; i++)
-		if (est_power >= dfc->power_table[i])
+	for (i = dfc->max_state; i > 0; i--)
+		if (est_power >= dfc->em->table[i].power)
 			break;
 
-	*state = i;
-	dfc->capped_state = i;
+	*state = dfc->max_state - i;
+	dfc->capped_state = *state;
 	trace_thermal_power_devfreq_limit(cdev, freq, *state, power);
 	return 0;
 }
@@ -394,91 +320,43 @@ static struct thermal_cooling_device_ops devfreq_cooling_ops = {
 };
 
 /**
- * devfreq_cooling_gen_tables() - Generate power and freq tables.
- * @dfc: Pointer to devfreq cooling device.
- *
- * Generate power and frequency tables: the power table hold the
- * device's maximum power usage at each cooling state (OPP).  The
- * static and dynamic power using the appropriate voltage and
- * frequency for the state, is acquired from the struct
- * devfreq_cooling_power, and summed to make the maximum power draw.
- *
- * The frequency table holds the frequencies in descending order.
- * That way its indexed by cooling device state.
+ * devfreq_cooling_gen_tables() - Generate frequency table.
+ * @dfc:	Pointer to devfreq cooling device.
+ * @num_opps:	Number of OPPs
  *
- * The tables are malloced, and pointers put in dfc.  They must be
- * freed when unregistering the devfreq cooling device.
+ * Generate frequency table which holds the frequencies in descending
+ * order. That way its indexed by cooling device state. This is for
+ * compatibility with drivers which do not register Energy Model.
  *
  * Return: 0 on success, negative error code on failure.
  */
-static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc)
+static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc,
+				      int num_opps)
 {
 	struct devfreq *df = dfc->devfreq;
 	struct device *dev = df->dev.parent;
-	int ret, num_opps;
 	unsigned long freq;
-	u32 *power_table = NULL;
-	u32 *freq_table;
 	int i;
 
-	num_opps = dev_pm_opp_get_opp_count(dev);
-
-	if (dfc->power_ops) {
-		power_table = kcalloc(num_opps, sizeof(*power_table),
-				      GFP_KERNEL);
-		if (!power_table)
-			return -ENOMEM;
-	}
-
-	freq_table = kcalloc(num_opps, sizeof(*freq_table),
+	dfc->freq_table = kcalloc(num_opps, sizeof(*dfc->freq_table),
 			     GFP_KERNEL);
-	if (!freq_table) {
-		ret = -ENOMEM;
-		goto free_power_table;
-	}
+	if (!dfc->freq_table)
+		return -ENOMEM;
 
 	for (i = 0, freq = ULONG_MAX; i < num_opps; i++, freq--) {
-		unsigned long power, voltage;
 		struct dev_pm_opp *opp;
 
 		opp = dev_pm_opp_find_freq_floor(dev, &freq);
 		if (IS_ERR(opp)) {
-			ret = PTR_ERR(opp);
-			goto free_tables;
+			kfree(dfc->freq_table);
+			return PTR_ERR(opp);
 		}
 
-		voltage = dev_pm_opp_get_voltage(opp) / 1000; /* mV */
 		dev_pm_opp_put(opp);
-
-		if (dfc->power_ops) {
-			if (dfc->power_ops->get_real_power)
-				power = get_total_power(dfc, freq, voltage);
-			else
-				power = get_dynamic_power(dfc, freq, voltage);
-
-			dev_dbg(dev, "Power table: %lu MHz @ %lu mV: %lu = %lu mW\n",
-				freq / 1000000, voltage, power, power);
-
-			power_table[i] = power;
-		}
-
-		freq_table[i] = freq;
+		dfc->freq_table[i] = freq;
 	}
 
-	if (dfc->power_ops)
-		dfc->power_table = power_table;
-
-	dfc->freq_table = freq_table;
-	dfc->freq_table_size = num_opps;
-
 	return 0;
-
-free_tables:
-	kfree(freq_table);
-free_power_table:
-	kfree(power_table);
-
-	return ret;
 }
 
 /**
@@ -503,7 +381,7 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
 	struct thermal_cooling_device *cdev;
 	struct devfreq_cooling_device *dfc;
 	char dev_name[THERMAL_NAME_LENGTH];
-	int err;
+	int err, num_opps;
 
 	dfc = kzalloc(sizeof(*dfc), GFP_KERNEL);
 	if (!dfc)
@@ -511,28 +389,45 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
 
 	dfc->devfreq = df;
 
-	if (dfc_power) {
-		dfc->power_ops = dfc_power;
-
+	dfc->em = em_pd_get(df->dev.parent);
+	if (dfc->em) {
 		devfreq_cooling_ops.get_requested_power =
 			devfreq_cooling_get_requested_power;
 		devfreq_cooling_ops.state2power = devfreq_cooling_state2power;
 		devfreq_cooling_ops.power2state = devfreq_cooling_power2state;
+
+		dfc->power_ops = dfc_power;
+
+		num_opps = em_pd_nr_perf_states(dfc->em);
+	} else {
+		/* Backward compatibility for drivers which do not use IPA */
+		dev_dbg(df->dev.parent, "missing EM for cooling device\n");
+
+		num_opps = dev_pm_opp_get_opp_count(df->dev.parent);
+
+		err = devfreq_cooling_gen_tables(dfc, num_opps);
+		if (err)
+			goto free_dfc;
 	}
 
-	err = devfreq_cooling_gen_tables(dfc);
-	if (err)
+	if (num_opps <= 0) {
+		err = -EINVAL;
 		goto free_dfc;
+	}
+
+	/* max_state is an index, not a counter */
+	dfc->max_state = num_opps - 1;
 
 	err = dev_pm_qos_add_request(df->dev.parent, &dfc->req_max_freq,
 				     DEV_PM_QOS_MAX_FREQUENCY,
 				     PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE);
 	if (err < 0)
-		goto free_tables;
+		goto free_table;
 
 	err = ida_simple_get(&devfreq_ida, 0, 0, GFP_KERNEL);
 	if (err < 0)
 		goto remove_qos_req;
+
 	dfc->id = err;
 
 	snprintf(dev_name, sizeof(dev_name), "thermal-devfreq-%d", dfc->id);
@@ -553,12 +448,9 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
 
 release_ida:
 	ida_simple_remove(&devfreq_ida, dfc->id);
-
 remove_qos_req:
 	dev_pm_qos_remove_request(&dfc->req_max_freq);
-
-free_tables:
-	kfree(dfc->power_table);
+free_table:
 	kfree(dfc->freq_table);
 free_dfc:
 	kfree(dfc);
@@ -699,9 +591,7 @@ void devfreq_cooling_unregister(struct thermal_cooling_device *cdev)
 	if (dfc->em_registered)
 		em_dev_unregister_perf_domain(dev);
 
-	kfree(dfc->power_table);
 	kfree(dfc->freq_table);
-
 	kfree(dfc);
 }
 EXPORT_SYMBOL_GPL(devfreq_cooling_unregister);
diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h
index 19868fb922f1..4890b12b54b4 100644
--- a/include/linux/devfreq_cooling.h
+++ b/include/linux/devfreq_cooling.h
@@ -17,17 +17,6 @@
 
 /**
  * struct devfreq_cooling_power - Devfreq cooling power ops
- * @get_static_power:	Take voltage, in mV, and return the static power
- *			in mW.  If NULL, the static power is assumed
- *			to be 0.
- * @get_dynamic_power:	Take voltage, in mV, and frequency, in HZ, and
- *			return the dynamic power draw in mW.  If NULL,
- *			a simple power model is used.
- * @dyn_power_coeff:	Coefficient for the simple dynamic power model in
- *			mW/(MHz mV mV).
- *			If get_dynamic_power() is NULL, then the
- *			dynamic power is calculated as
- *			@dyn_power_coeff * frequency * voltage^2
  * @get_real_power:	When this is set, the framework uses it to ask the
  *			device driver for the actual power.
  *			Some devices have more sophisticated methods
@@ -47,14 +36,8 @@
  *			max total (static + dynamic) power value for each OPP.
  */
 struct devfreq_cooling_power {
-	unsigned long (*get_static_power)(struct devfreq *devfreq,
-					  unsigned long voltage);
-	unsigned long (*get_dynamic_power)(struct devfreq *devfreq,
-					   unsigned long freq,
-					   unsigned long voltage);
 	int (*get_real_power)(struct devfreq *df, u32 *power,
 			      unsigned long freq, unsigned long voltage);
-	unsigned long dyn_power_coeff;
 };
 
 #ifdef CONFIG_DEVFREQ_THERMAL
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] drm/panfrost: Register devfreq cooling and attempt to add Energy Model
  2020-09-21 12:20 [PATCH 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
                   ` (3 preceding siblings ...)
  2020-09-21 12:20 ` [PATCH 4/5] thermal: devfreq_cooling: remove old power model and use EM Lukasz Luba
@ 2020-09-21 12:20 ` Lukasz Luba
  4 siblings, 0 replies; 20+ messages in thread
From: Lukasz Luba @ 2020-09-21 12:20 UTC (permalink / raw)
  To: linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, daniel.lezcano, lukasz.luba,
	orjan.eide, robh, alyssa.rosenzweig, steven.price, airlied,
	daniel

Register devfreq cooling device and attempt to register Energy Model. This
will add the devfreq device to the Energy Model framework. It will create
a dedicated and unified data structures used i.e. in thermal framework.
The last NULL parameter indicates that the power model is simplified and
created based on DT 'dynamic-power-coefficient', voltage and frequency.

Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/gpu/drm/panfrost/panfrost_devfreq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.c b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
index 8ab025d0035f..0d97176c9fb5 100644
--- a/drivers/gpu/drm/panfrost/panfrost_devfreq.c
+++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.c
@@ -143,7 +143,7 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev)
 	}
 	pfdevfreq->devfreq = devfreq;
 
-	cooling = of_devfreq_cooling_register(dev->of_node, devfreq);
+	cooling = devfreq_cooling_em_register(devfreq, NULL);
 	if (IS_ERR(cooling))
 		DRM_DEV_INFO(dev, "Failed to register cooling device\n");
 	else
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-09-21 12:20 ` [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model Lukasz Luba
@ 2020-10-07 12:07   ` Ionela Voinescu
  2020-10-22 11:17     ` Lukasz Luba
  0 siblings, 1 reply; 20+ messages in thread
From: Ionela Voinescu @ 2020-10-07 12:07 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide

Hi Lukasz,

On Monday 21 Sep 2020 at 13:20:05 (+0100), Lukasz Luba wrote:
> The Energy Model (EM) framework supports devices such as Devfreq. Create
> new registration functions which automatically register EM for the thermal
> devfreq_cooling devices. This patch prepares the code for coming changes
> which are going to replace old power model with the new EM.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/thermal/devfreq_cooling.c | 99 ++++++++++++++++++++++++++++++-
>  include/linux/devfreq_cooling.h   | 22 +++++++
>  2 files changed, 120 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
> index cf045bd4d16b..7e091e795284 100644
> --- a/drivers/thermal/devfreq_cooling.c
> +++ b/drivers/thermal/devfreq_cooling.c
> @@ -50,6 +50,8 @@ static DEFINE_IDA(devfreq_ida);
>   * @capped_state:	index to cooling state with in dynamic power budget
>   * @req_max_freq:	PM QoS request for limiting the maximum frequency
>   *			of the devfreq device.
> + * @em:		Energy Model which represents the associated Devfreq device
                                     ^^^^^^^^^^^^^^^^
				     for
> + * @em_registered:	Devfreq cooling registered the EM and should free it.
>   */
>  struct devfreq_cooling_device {
>  	int id;
> @@ -63,6 +65,8 @@ struct devfreq_cooling_device {
>  	u32 res_util;
>  	int capped_state;
>  	struct dev_pm_qos_request req_max_freq;
> +	struct em_perf_domain *em;
> +	bool em_registered;
>  };
>  
>  static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev,
> @@ -586,22 +590,115 @@ struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df)
>  }
>  EXPORT_SYMBOL_GPL(devfreq_cooling_register);
>  
> +/**
> + * devfreq_cooling_em_register_power() - Register devfreq cooling device with
> + *		power information and attempt to register Energy Model (EM)

It took me a while to understand the differences between devfreq
register functions and it left me with a nagging feeling that we don't
need all of them. Also, looking over the cpufreq cooling devices, they
keep their registering interfaces quite simple.

With the functions added by this patch, the devfreq cooling devices will have:
 - old:
       of_devfreq_cooling_register_power
       of_devfreq_cooling_register
       devfreq_cooling_register
       devfreq_cooling_unregister
 - new:
       devfreq_cooling_em_register_power
       devfreq_cooling_em_register

My question is whether we actually need the two new
devfreq_cooling_em_register_power() and devfreq_cooling_em_register()?

The power_ops and the em are dependent on one another, so could we
extend the of_devfreq_cooling_register_power() to do the additional em
registration. We only need a way to pass the em_cb and I think that
could fit nicely in devfreq_cooling_power.

To be noted that I've reviewed these interfaces in the context of the
final state of devfreq_cooling.c, after the changes in 4/5.

> + * @df:		Pointer to devfreq device.
> + * @dfc_power:	Pointer to devfreq_cooling_power.
> + * @em_cb:	Callback functions providing the data of the EM
> + *
> + * Register a devfreq cooling device and attempt to register Energy Model. The
> + * available OPPs must be registered for the device.
> + *
> + * If @dfc_power is provided, the cooling device is registered with the
> + * power extensions. If @em_cb is provided it will be called for each OPP to
> + * calculate power value and cost. If @em_cb is not provided then simple Energy
> + * Model is going to be used, which requires "dynamic-power-coefficient" a
> + * devicetree property.
> + */
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register_power(struct devfreq *df,
> +				  struct devfreq_cooling_power *dfc_power,
> +				  struct em_data_callback *em_cb)
> +{
> +	struct thermal_cooling_device *cdev;
> +	struct devfreq_cooling_device *dfc;
> +	struct device_node *np = NULL;
> +	struct device *dev;
> +	int nr_opp, ret;
> +
> +	if (IS_ERR_OR_NULL(df))
> +		return ERR_PTR(-EINVAL);
> +
> +	dev = df->dev.parent;
> +
> +	if (em_cb) {
> +		nr_opp = dev_pm_opp_get_opp_count(dev);
> +		if (nr_opp <= 0) {
> +			dev_err(dev, "No valid OPPs found\n");
> +			return ERR_PTR(-EINVAL);
> +		}
> +
> +		ret = em_dev_register_perf_domain(dev, nr_opp, em_cb, NULL);
> +	} else {
> +		ret = dev_pm_opp_of_register_em(dev, NULL);
> +	}
> +
> +	if (ret)
> +		dev_warn(dev, "Unable to register EM for devfreq cooling device (%d)\n",
> +			 ret);
> +
> +	if (dev->of_node)
> +		np = of_node_get(dev->of_node);
> +
> +	cdev = of_devfreq_cooling_register_power(np, df, dfc_power);
> +
> +	if (np)
> +		of_node_put(np);
> +
> +	if (IS_ERR_OR_NULL(cdev)) {
> +		if (!ret)
> +			em_dev_unregister_perf_domain(dev);
> +	} else {
> +		dfc = cdev->devdata;
> +		dfc->em_registered = !ret;
> +	}
> +
> +	return cdev;
> +}
> +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register_power);
> +
> +/**
> + * devfreq_cooling_em_register() - Register devfreq cooling device together
> + *				with Energy Model.
> + * @df:		Pointer to devfreq device.
> + * @em_cb:	Callback functions providing the data of the Energy Model
> + *
> + * This function attempts to register Energy Model for devfreq device and then
> + * register the devfreq cooling device.
> + */
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb)
> +{
> +	return devfreq_cooling_em_register_power(df, NULL, em_cb);
> +}
> +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register);
> +
>  /**
>   * devfreq_cooling_unregister() - Unregister devfreq cooling device.
>   * @cdev: Pointer to devfreq cooling device to unregister.
> + *
> + * Unregisters devfreq cooling device and related Energy Model if it was
> + * present.
>   */
>  void devfreq_cooling_unregister(struct thermal_cooling_device *cdev)
>  {
>  	struct devfreq_cooling_device *dfc;
> +	struct device *dev;
>  
> -	if (!cdev)
> +	if (IS_ERR_OR_NULL(cdev))
>  		return;
>  
>  	dfc = cdev->devdata;
> +	dev = dfc->devfreq->dev.parent;
>  
>  	thermal_cooling_device_unregister(dfc->cdev);
>  	ida_simple_remove(&devfreq_ida, dfc->id);
>  	dev_pm_qos_remove_request(&dfc->req_max_freq);
> +
> +	if (dfc->em_registered)
> +		em_dev_unregister_perf_domain(dev);

Nit: Isn't it enough to check if dev->em_pd != NULL to be able to
unregister the perf_domain? That would remove the need for
dfc->em_registered.

I suppose one could say that's using implementation details on how the
EM is built and stored and we should not rely on it, so it's up to you
if you want to change it.

Kind regards,
Ionela.

> +
>  	kfree(dfc->power_table);
>  	kfree(dfc->freq_table);
>  
> diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h
> index 9df2dfca68dd..19868fb922f1 100644
> --- a/include/linux/devfreq_cooling.h
> +++ b/include/linux/devfreq_cooling.h
> @@ -11,6 +11,7 @@
>  #define __DEVFREQ_COOLING_H__
>  
>  #include <linux/devfreq.h>
> +#include <linux/energy_model.h>
>  #include <linux/thermal.h>
>  
>  
> @@ -65,6 +66,13 @@ struct thermal_cooling_device *
>  of_devfreq_cooling_register(struct device_node *np, struct devfreq *df);
>  struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df);
>  void devfreq_cooling_unregister(struct thermal_cooling_device *dfc);
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register_power(struct devfreq *df,
> +				  struct devfreq_cooling_power *dfc_power,
> +				  struct em_data_callback *em_cb);
> +struct thermal_cooling_device *
> +devfreq_cooling_em_register(struct devfreq *df,
> +			    struct em_data_callback *em_cb);
>  
>  #else /* !CONFIG_DEVFREQ_THERMAL */
>  
> @@ -87,6 +95,20 @@ devfreq_cooling_register(struct devfreq *df)
>  	return ERR_PTR(-EINVAL);
>  }
>  
> +static inline struct thermal_cooling_device *
> +devfreq_cooling_em_register_power(struct devfreq *df,
> +				  struct devfreq_cooling_power *dfc_power,
> +				  struct em_data_callback *em_cb)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +static inline struct thermal_cooling_device *
> +devfreq_cooling_em_register(struct devfreq *df,	struct em_data_callback *em_cb)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +
>  static inline void
>  devfreq_cooling_unregister(struct thermal_cooling_device *dfc)
>  {
> -- 
> 2.17.1
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] thermal: devfreq_cooling: remove old power model and use EM
  2020-09-21 12:20 ` [PATCH 4/5] thermal: devfreq_cooling: remove old power model and use EM Lukasz Luba
@ 2020-10-07 15:12   ` Ionela Voinescu
  2020-10-22 11:26     ` Lukasz Luba
  0 siblings, 1 reply; 20+ messages in thread
From: Ionela Voinescu @ 2020-10-07 15:12 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide

Hi Lukasz,

On Monday 21 Sep 2020 at 13:20:06 (+0100), Lukasz Luba wrote:
[..]
>  /**
> - * freq_get_state() - get the cooling state corresponding to a frequency
> + * freq_get_state() - get the performance index corresponding to a frequency

If we change the meaning of the return value, I think the function needs
a name change as well.

Also, we do treat this as a cooling state when we do validation and
compare it to THERMAL_CSTATE_INVALID,  but it's not actually a cooling
state (it's max_state - state). It does create confusion if we name
"state" both a performance index and a cooling state.

Given that the only user is devfreq_cooling_get_requested_power(),
might be good to collapse freq_get_state() in that function and rename
the "state" variable in there to "em_perf_idx".

>   * @dfc:	Pointer to devfreq cooling device
> - * @freq:	frequency in Hz
> + * @freq:	frequency in kHz
>   *
> - * Return: the cooling state associated with the @freq, or
> + * Return: the performance index associated with the @freq, or
>   * THERMAL_CSTATE_INVALID if it wasn't found.
>   */
>  static unsigned long
> @@ -128,8 +130,8 @@ freq_get_state(struct devfreq_cooling_device *dfc, unsigned long freq)
>  {
>  	int i;
>  
> -	for (i = 0; i < dfc->freq_table_size; i++) {
> -		if (dfc->freq_table[i] == freq)
> +	for (i = 0; i <= dfc->max_state; i++) {
> +		if (dfc->em->table[i].frequency == freq)
>  			return i;
>  	}
>  
> @@ -164,71 +166,15 @@ static unsigned long get_voltage(struct devfreq *df, unsigned long freq)
>  	return voltage;
>  }
>  
> -/**
> - * get_static_power() - calculate the static power
> - * @dfc:	Pointer to devfreq cooling device
> - * @freq:	Frequency in Hz
> - *
> - * Calculate the static power in milliwatts using the supplied
> - * get_static_power().  The current voltage is calculated using the
> - * OPP library.  If no get_static_power() was supplied, assume the
> - * static power is negligible.
> - */
> -static unsigned long
> -get_static_power(struct devfreq_cooling_device *dfc, unsigned long freq)
> +static void dfc_em_get_requested_power(struct em_perf_domain *em,
> +				       struct devfreq_dev_status *status,
> +				       u32 *power, int em_perf_idx)

Is there a reason for not directly returning the power value in this
function? Also, this only does a few arithmetic operations and it's only
called in one place. Is it worth to have this in a separate function?

[..]
> @@ -345,11 +279,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
>  	struct devfreq *df = dfc->devfreq;
>  	struct devfreq_dev_status status;
> -	unsigned long busy_time;
> +	u32 est_power = power;

Nit: You could use power directly and remove est_power as well.

>  	unsigned long freq;
> -	s32 dyn_power;
> -	u32 static_power;
> -	s32 est_power;
>  	int i;
>  
>  	mutex_lock(&df->lock);
> @@ -358,31 +289,26 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  
>  	freq = status.current_frequency;
>  
> -	if (dfc->power_ops->get_real_power) {
> +	if (dfc->power_ops && dfc->power_ops->get_real_power) {
>  		/* Scale for resource utilization */
>  		est_power = power * dfc->res_util;
>  		est_power /= SCALE_ERROR_MITIGATION;
>  	} else {
> -		static_power = get_static_power(dfc, freq);
> -
> -		dyn_power = power - static_power;
> -		dyn_power = dyn_power > 0 ? dyn_power : 0;
> -
> -		/* Scale dynamic power for utilization */
> -		busy_time = status.busy_time ?: 1;
> -		est_power = (dyn_power * status.total_time) / busy_time;
> +		_normalize_load(&status);
> +		est_power *= status.total_time;
> +		est_power /= status.busy_time;
>  	}
>  
>  	/*
>  	 * Find the first cooling state that is within the power
> -	 * budget for dynamic power.
> +	 * budget. The EM power table is sorted ascending.
>  	 */
> -	for (i = 0; i < dfc->freq_table_size - 1; i++)
> -		if (est_power >= dfc->power_table[i])
> +	for (i = dfc->max_state; i > 0; i--)
> +		if (est_power >= dfc->em->table[i].power)
>  			break;
>  
> -	*state = i;
> -	dfc->capped_state = i;
> +	*state = dfc->max_state - i;
> +	dfc->capped_state = *state;
>  	trace_thermal_power_devfreq_limit(cdev, freq, *state, power);
>  	return 0;
>  }
[..]
>  /**
> @@ -503,7 +381,7 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
>  	struct thermal_cooling_device *cdev;
>  	struct devfreq_cooling_device *dfc;
>  	char dev_name[THERMAL_NAME_LENGTH];
> -	int err;
> +	int err, num_opps;
>  
>  	dfc = kzalloc(sizeof(*dfc), GFP_KERNEL);
>  	if (!dfc)
> @@ -511,28 +389,45 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
>  
>  	dfc->devfreq = df;
>  
> -	if (dfc_power) {
> -		dfc->power_ops = dfc_power;
> -
> +	dfc->em = em_pd_get(df->dev.parent);
> +	if (dfc->em) {
>  		devfreq_cooling_ops.get_requested_power =
>  			devfreq_cooling_get_requested_power;
>  		devfreq_cooling_ops.state2power = devfreq_cooling_state2power;
>  		devfreq_cooling_ops.power2state = devfreq_cooling_power2state;
> +
> +		dfc->power_ops = dfc_power;
> +
> +		num_opps = em_pd_nr_perf_states(dfc->em);
> +	} else {
> +		/* Backward compatibility for drivers which do not use IPA */
> +		dev_dbg(df->dev.parent, "missing EM for cooling device\n");
> +
> +		num_opps = dev_pm_opp_get_opp_count(df->dev.parent);
> +
> +		err = devfreq_cooling_gen_tables(dfc, num_opps);
> +		if (err)
> +			goto free_dfc;
>  	}
>  
> -	err = devfreq_cooling_gen_tables(dfc);
> -	if (err)
> +	if (num_opps <= 0) {
> +		err = -EINVAL;
>  		goto free_dfc;
> +	}
> +
> +	/* max_state is an index, not a counter */

Nit: Might be more clear to replace "index" with cooling state. Then
knowledge about cooling states would make this more clear.

Regards,
Ionela.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-09-21 12:20 ` [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status Lukasz Luba
@ 2020-10-07 16:11   ` Ionela Voinescu
  2020-10-22 10:55     ` Lukasz Luba
  2020-10-14 14:34   ` Daniel Lezcano
  1 sibling, 1 reply; 20+ messages in thread
From: Ionela Voinescu @ 2020-10-07 16:11 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide

On Monday 21 Sep 2020 at 13:20:04 (+0100), Lukasz Luba wrote:
> Devfreq cooling needs to now the correct status of the device in order
> to operate. Do not rely on Devfreq last_status which might be a stale data
> and get more up-to-date values of the load.
> 
> Devfreq framework can change the device status in the background. To
> mitigate this situation make a copy of the status structure and use it
> for internal calculations.
> 
> In addition this patch adds normalization function, which also makes sure
> that whatever data comes from the device, it is in a sane range.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>  1 file changed, 43 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
> index 7063ccb7b86d..cf045bd4d16b 100644
> --- a/drivers/thermal/devfreq_cooling.c
> +++ b/drivers/thermal/devfreq_cooling.c
> @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
>  							       voltage);
>  }
>  
> +static void _normalize_load(struct devfreq_dev_status *status)

Is there a reason for the leading "_" ?
AFAIK, "__name()" is meant to suggest a "worker" function for another
"name()" function, but that would not apply here.

> +{
> +	/* Make some space if needed */
> +	if (status->busy_time > 0xffff) {
> +		status->busy_time >>= 10;
> +		status->total_time >>= 10;
> +	}

How about removing the above code and adding here:

status->busy_time = status->busy_time ? : 1;

> +
> +	if (status->busy_time > status->total_time)

This check would then cover the possibility that total_time is 0.

> +		status->busy_time = status->total_time;

But a reversal is needed here:
		status->total_time = status->busy_time;

> +
> +	status->busy_time *= 100;
> +	status->busy_time /= status->total_time ? : 1;
> +
> +	/* Avoid division by 0 */
> +	status->busy_time = status->busy_time ? : 1;
> +	status->total_time = 100;

Then all of this code can be replaced by:

status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10,
					     status->total_time);
status->total_time = 1 << 10;

This way you gain some resolution to busy_time and the divisions in the
callers would just become shifts by 10.

Hope it helps,
Ionela.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-09-21 12:20 ` [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status Lukasz Luba
  2020-10-07 16:11   ` Ionela Voinescu
@ 2020-10-14 14:34   ` Daniel Lezcano
  2020-10-22 11:45     ` Lukasz Luba
  1 sibling, 1 reply; 20+ messages in thread
From: Daniel Lezcano @ 2020-10-14 14:34 UTC (permalink / raw)
  To: Lukasz Luba, linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel

On 21/09/2020 14:20, Lukasz Luba wrote:
> Devfreq cooling needs to now the correct status of the device in order
> to operate. Do not rely on Devfreq last_status which might be a stale data
> and get more up-to-date values of the load.
> 
> Devfreq framework can change the device status in the background. To
> mitigate this situation make a copy of the status structure and use it
> for internal calculations.
> 
> In addition this patch adds normalization function, which also makes sure
> that whatever data comes from the device, it is in a sane range.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>  1 file changed, 43 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
> index 7063ccb7b86d..cf045bd4d16b 100644
> --- a/drivers/thermal/devfreq_cooling.c
> +++ b/drivers/thermal/devfreq_cooling.c
> @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
>  							       voltage);
>  }
>  
> +static void _normalize_load(struct devfreq_dev_status *status)
> +{
> +	/* Make some space if needed */
> +	if (status->busy_time > 0xffff) {
> +		status->busy_time >>= 10;
> +		status->total_time >>= 10;
> +	}
> +
> +	if (status->busy_time > status->total_time)
> +		status->busy_time = status->total_time;
> +
> +	status->busy_time *= 100;
> +	status->busy_time /= status->total_time ? : 1;
> +
> +	/* Avoid division by 0 */
> +	status->busy_time = status->busy_time ? : 1;
> +	status->total_time = 100;
> +}

Not sure that works if the devfreq governor is not on-demand.

Is it possible to use the energy model directly in devfreq to compute
the energy consumption given the state transitions since the last reading ?

The power will be read directly from devfreq which will return:

nrj + (current_power * (jiffies - last_update)) / (jiffies - last_update)

The devfreq cooling device driver would result in a much simpler code, no?

>  static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cdev,
>  					       struct thermal_zone_device *tz,
> @@ -234,14 +252,22 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  {
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
>  	struct devfreq *df = dfc->devfreq;
> -	struct devfreq_dev_status *status = &df->last_status;
> +	struct devfreq_dev_status status;
>  	unsigned long state;
> -	unsigned long freq = status->current_frequency;
> +	unsigned long freq;
>  	unsigned long voltage;
>  	u32 dyn_power = 0;
>  	u32 static_power = 0;
>  	int res;
>  
> +	mutex_lock(&df->lock);
> +	res = df->profile->get_dev_status(df->dev.parent, &status);
> +	mutex_unlock(&df->lock);
> +	if (res)
> +		return res;
> +
> +	freq = status.current_frequency;
> +
>  	state = freq_get_state(dfc, freq);
>  	if (state == THERMAL_CSTATE_INVALID) {
>  		res = -EAGAIN;
> @@ -269,16 +295,18 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
>  	} else {
>  		dyn_power = dfc->power_table[state];
>  
> +		_normalize_load(&status);
> +
>  		/* Scale dynamic power for utilization */
> -		dyn_power *= status->busy_time;
> -		dyn_power /= status->total_time;
> +		dyn_power *= status.busy_time;
> +		dyn_power /= status.total_time;
>  		/* Get static power */
>  		static_power = get_static_power(dfc, freq);
>  
>  		*power = dyn_power + static_power;
>  	}
>  
> -	trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);
> +	trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);
>  
>  	return 0;
>  fail:
> @@ -312,14 +340,20 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  {
>  	struct devfreq_cooling_device *dfc = cdev->devdata;
>  	struct devfreq *df = dfc->devfreq;
> -	struct devfreq_dev_status *status = &df->last_status;
> -	unsigned long freq = status->current_frequency;
> +	struct devfreq_dev_status status;
>  	unsigned long busy_time;
> +	unsigned long freq;
>  	s32 dyn_power;
>  	u32 static_power;
>  	s32 est_power;
>  	int i;
>  
> +	mutex_lock(&df->lock);
> +	status = df->last_status;
> +	mutex_unlock(&df->lock);
> +
> +	freq = status.current_frequency;
> +
>  	if (dfc->power_ops->get_real_power) {
>  		/* Scale for resource utilization */
>  		est_power = power * dfc->res_util;
> @@ -331,8 +365,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>  		dyn_power = dyn_power > 0 ? dyn_power : 0;
>  
>  		/* Scale dynamic power for utilization */
> -		busy_time = status->busy_time ?: 1;
> -		est_power = (dyn_power * status->total_time) / busy_time;
> +		busy_time = status.busy_time ?: 1;
> +		est_power = (dyn_power * status.total_time) / busy_time;
>  	}
>  
>  	/*
> 


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-10-07 16:11   ` Ionela Voinescu
@ 2020-10-22 10:55     ` Lukasz Luba
  2020-12-01 10:36       ` Ionela Voinescu
  0 siblings, 1 reply; 20+ messages in thread
From: Lukasz Luba @ 2020-10-22 10:55 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide

Hi Ionela,


On 10/7/20 5:11 PM, Ionela Voinescu wrote:
> On Monday 21 Sep 2020 at 13:20:04 (+0100), Lukasz Luba wrote:
>> Devfreq cooling needs to now the correct status of the device in order
>> to operate. Do not rely on Devfreq last_status which might be a stale data
>> and get more up-to-date values of the load.
>>
>> Devfreq framework can change the device status in the background. To
>> mitigate this situation make a copy of the status structure and use it
>> for internal calculations.
>>
>> In addition this patch adds normalization function, which also makes sure
>> that whatever data comes from the device, it is in a sane range.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>>   1 file changed, 43 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
>> index 7063ccb7b86d..cf045bd4d16b 100644
>> --- a/drivers/thermal/devfreq_cooling.c
>> +++ b/drivers/thermal/devfreq_cooling.c
>> @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
>>   							       voltage);
>>   }
>>   
>> +static void _normalize_load(struct devfreq_dev_status *status)
> 
> Is there a reason for the leading "_" ?
> AFAIK, "__name()" is meant to suggest a "worker" function for another
> "name()" function, but that would not apply here.

It is just a local name. Check e.g. ./drivers/opp/core.c there is a few:
_generic_set_opp_regulator(), _generic_set_opp_clk_only(),
_get_opp_count(), _find_opp_table(), _set_opp_bw(), etc.

It is just a shorter name for me, '_' means here locality.
Instead of calling it devfreq_cooling_normalize_load().

> 
>> +{
>> +	/* Make some space if needed */
>> +	if (status->busy_time > 0xffff) {
>> +		status->busy_time >>= 10;
>> +		status->total_time >>= 10;
>> +	}
> 
> How about removing the above code and adding here:
> 
> status->busy_time = status->busy_time ? : 1;

It's not equivalent. The code operates on raw device values, which
might be big (e.g. read from counters). If it's lager than the 0xffff,
it is going to be shifted to get smaller.

> 
>> +
>> +	if (status->busy_time > status->total_time)
> 
> This check would then cover the possibility that total_time is 0.
> 
>> +		status->busy_time = status->total_time;
> 
> But a reversal is needed here:
> 		status->total_time = status->busy_time;

No, I want to clamp the busy_time, which should not be bigger that
total time. It could happen when we deal with 'raw' values from device
counters.

> 
>> +
>> +	status->busy_time *= 100;
>> +	status->busy_time /= status->total_time ? : 1;
>> +
>> +	/* Avoid division by 0 */
>> +	status->busy_time = status->busy_time ? : 1;
>> +	status->total_time = 100;
> 
> Then all of this code can be replaced by:
> 
> status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10,
> 					     status->total_time);
> status->total_time = 1 << 10;

No, the total_time closed to 'unsigned long' would overflow.

> 
> This way you gain some resolution to busy_time and the divisions in the
> callers would just become shifts by 10.


I don't want to gain more resolution here. I want to be prepare for raw
(not processed yet) big values coming from driver.

Regards,
Lukasz

> 
> Hope it helps,
> Ionela.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-10-07 12:07   ` Ionela Voinescu
@ 2020-10-22 11:17     ` Lukasz Luba
  2020-12-01 14:05       ` Ionela Voinescu
  0 siblings, 1 reply; 20+ messages in thread
From: Lukasz Luba @ 2020-10-22 11:17 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide



On 10/7/20 1:07 PM, Ionela Voinescu wrote:
> Hi Lukasz,
> 
> On Monday 21 Sep 2020 at 13:20:05 (+0100), Lukasz Luba wrote:
>> The Energy Model (EM) framework supports devices such as Devfreq. Create
>> new registration functions which automatically register EM for the thermal
>> devfreq_cooling devices. This patch prepares the code for coming changes
>> which are going to replace old power model with the new EM.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   drivers/thermal/devfreq_cooling.c | 99 ++++++++++++++++++++++++++++++-
>>   include/linux/devfreq_cooling.h   | 22 +++++++
>>   2 files changed, 120 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
>> index cf045bd4d16b..7e091e795284 100644
>> --- a/drivers/thermal/devfreq_cooling.c
>> +++ b/drivers/thermal/devfreq_cooling.c
>> @@ -50,6 +50,8 @@ static DEFINE_IDA(devfreq_ida);
>>    * @capped_state:	index to cooling state with in dynamic power budget
>>    * @req_max_freq:	PM QoS request for limiting the maximum frequency
>>    *			of the devfreq device.
>> + * @em:		Energy Model which represents the associated Devfreq device
>                                       ^^^^^^^^^^^^^^^^
> 				     for

I will change it.

>> + * @em_registered:	Devfreq cooling registered the EM and should free it.
>>    */
>>   struct devfreq_cooling_device {
>>   	int id;
>> @@ -63,6 +65,8 @@ struct devfreq_cooling_device {
>>   	u32 res_util;
>>   	int capped_state;
>>   	struct dev_pm_qos_request req_max_freq;
>> +	struct em_perf_domain *em;
>> +	bool em_registered;
>>   };
>>   
>>   static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev,
>> @@ -586,22 +590,115 @@ struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df)
>>   }
>>   EXPORT_SYMBOL_GPL(devfreq_cooling_register);
>>   
>> +/**
>> + * devfreq_cooling_em_register_power() - Register devfreq cooling device with
>> + *		power information and attempt to register Energy Model (EM)
> 
> It took me a while to understand the differences between devfreq
> register functions and it left me with a nagging feeling that we don't
> need all of them. Also, looking over the cpufreq cooling devices, they
> keep their registering interfaces quite simple.

This was discussed in previous series, related to EM core changes.
It was requested to have a helper registration function which would
create EM automatically.

> 
> With the functions added by this patch, the devfreq cooling devices will have:
>   - old:
>         of_devfreq_cooling_register_power
>         of_devfreq_cooling_register
>         devfreq_cooling_register
>         devfreq_cooling_unregister
>   - new:
>         devfreq_cooling_em_register_power
>         devfreq_cooling_em_register
> 
> My question is whether we actually need the two new
> devfreq_cooling_em_register_power() and devfreq_cooling_em_register()?

It is just for consistency, with older scheme. It is only a wrapper, one
line, with default NULL. This scheme is common in thermal and some other
frameworks.

> 
> The power_ops and the em are dependent on one another, so could we
> extend the of_devfreq_cooling_register_power() to do the additional em
> registration. We only need a way to pass the em_cb and I think that
> could fit nicely in devfreq_cooling_power.

No, they aren't 'dependent on one another'. The EM usage doesn't depend
on presence of power_ops. Drivers might not support power_ops, but want
the framework still use EM and do power estimation.

> 
> To be noted that I've reviewed these interfaces in the context of the
> final state of devfreq_cooling.c, after the changes in 4/5.
> 
>> + * @df:		Pointer to devfreq device.
>> + * @dfc_power:	Pointer to devfreq_cooling_power.
>> + * @em_cb:	Callback functions providing the data of the EM
>> + *
>> + * Register a devfreq cooling device and attempt to register Energy Model. The
>> + * available OPPs must be registered for the device.
>> + *
>> + * If @dfc_power is provided, the cooling device is registered with the
>> + * power extensions. If @em_cb is provided it will be called for each OPP to
>> + * calculate power value and cost. If @em_cb is not provided then simple Energy
>> + * Model is going to be used, which requires "dynamic-power-coefficient" a
>> + * devicetree property.
>> + */
>> +struct thermal_cooling_device *
>> +devfreq_cooling_em_register_power(struct devfreq *df,
>> +				  struct devfreq_cooling_power *dfc_power,
>> +				  struct em_data_callback *em_cb)
>> +{
>> +	struct thermal_cooling_device *cdev;
>> +	struct devfreq_cooling_device *dfc;
>> +	struct device_node *np = NULL;
>> +	struct device *dev;
>> +	int nr_opp, ret;
>> +
>> +	if (IS_ERR_OR_NULL(df))
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	dev = df->dev.parent;
>> +
>> +	if (em_cb) {
>> +		nr_opp = dev_pm_opp_get_opp_count(dev);
>> +		if (nr_opp <= 0) {
>> +			dev_err(dev, "No valid OPPs found\n");
>> +			return ERR_PTR(-EINVAL);
>> +		}
>> +
>> +		ret = em_dev_register_perf_domain(dev, nr_opp, em_cb, NULL);
>> +	} else {
>> +		ret = dev_pm_opp_of_register_em(dev, NULL);
>> +	}
>> +
>> +	if (ret)
>> +		dev_warn(dev, "Unable to register EM for devfreq cooling device (%d)\n",
>> +			 ret);
>> +
>> +	if (dev->of_node)
>> +		np = of_node_get(dev->of_node);
>> +
>> +	cdev = of_devfreq_cooling_register_power(np, df, dfc_power);
>> +
>> +	if (np)
>> +		of_node_put(np);
>> +
>> +	if (IS_ERR_OR_NULL(cdev)) {
>> +		if (!ret)
>> +			em_dev_unregister_perf_domain(dev);
>> +	} else {
>> +		dfc = cdev->devdata;
>> +		dfc->em_registered = !ret;
>> +	}
>> +
>> +	return cdev;
>> +}
>> +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register_power);
>> +
>> +/**
>> + * devfreq_cooling_em_register() - Register devfreq cooling device together
>> + *				with Energy Model.
>> + * @df:		Pointer to devfreq device.
>> + * @em_cb:	Callback functions providing the data of the Energy Model
>> + *
>> + * This function attempts to register Energy Model for devfreq device and then
>> + * register the devfreq cooling device.
>> + */
>> +struct thermal_cooling_device *
>> +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb)
>> +{
>> +	return devfreq_cooling_em_register_power(df, NULL, em_cb);
>> +}
>> +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register);
>> +
>>   /**
>>    * devfreq_cooling_unregister() - Unregister devfreq cooling device.
>>    * @cdev: Pointer to devfreq cooling device to unregister.
>> + *
>> + * Unregisters devfreq cooling device and related Energy Model if it was
>> + * present.
>>    */
>>   void devfreq_cooling_unregister(struct thermal_cooling_device *cdev)
>>   {
>>   	struct devfreq_cooling_device *dfc;
>> +	struct device *dev;
>>   
>> -	if (!cdev)
>> +	if (IS_ERR_OR_NULL(cdev))
>>   		return;
>>   
>>   	dfc = cdev->devdata;
>> +	dev = dfc->devfreq->dev.parent;
>>   
>>   	thermal_cooling_device_unregister(dfc->cdev);
>>   	ida_simple_remove(&devfreq_ida, dfc->id);
>>   	dev_pm_qos_remove_request(&dfc->req_max_freq);
>> +
>> +	if (dfc->em_registered)
>> +		em_dev_unregister_perf_domain(dev);
> 
> Nit: Isn't it enough to check if dev->em_pd != NULL to be able to
> unregister the perf_domain? That would remove the need for
> dfc->em_registered.

The devfreq cooling may only unregister the EM if it has registered it.
If any other code did the registration, it should unregister when it
finished using it.

> 
> I suppose one could say that's using implementation details on how the
> EM is built and stored and we should not rely on it, so it's up to you
> if you want to change it.
> 
> Kind regards,
> Ionela.
> 
>> +
>>   	kfree(dfc->power_table);
>>   	kfree(dfc->freq_table);
>>   
>> diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h
>> index 9df2dfca68dd..19868fb922f1 100644
>> --- a/include/linux/devfreq_cooling.h
>> +++ b/include/linux/devfreq_cooling.h
>> @@ -11,6 +11,7 @@
>>   #define __DEVFREQ_COOLING_H__
>>   
>>   #include <linux/devfreq.h>
>> +#include <linux/energy_model.h>
>>   #include <linux/thermal.h>
>>   
>>   
>> @@ -65,6 +66,13 @@ struct thermal_cooling_device *
>>   of_devfreq_cooling_register(struct device_node *np, struct devfreq *df);
>>   struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df);
>>   void devfreq_cooling_unregister(struct thermal_cooling_device *dfc);
>> +struct thermal_cooling_device *
>> +devfreq_cooling_em_register_power(struct devfreq *df,
>> +				  struct devfreq_cooling_power *dfc_power,
>> +				  struct em_data_callback *em_cb);
>> +struct thermal_cooling_device *
>> +devfreq_cooling_em_register(struct devfreq *df,
>> +			    struct em_data_callback *em_cb);
>>   
>>   #else /* !CONFIG_DEVFREQ_THERMAL */
>>   
>> @@ -87,6 +95,20 @@ devfreq_cooling_register(struct devfreq *df)
>>   	return ERR_PTR(-EINVAL);
>>   }
>>   
>> +static inline struct thermal_cooling_device *
>> +devfreq_cooling_em_register_power(struct devfreq *df,
>> +				  struct devfreq_cooling_power *dfc_power,
>> +				  struct em_data_callback *em_cb)
>> +{
>> +	return ERR_PTR(-EINVAL);
>> +}
>> +
>> +static inline struct thermal_cooling_device *
>> +devfreq_cooling_em_register(struct devfreq *df,	struct em_data_callback *em_cb)
>> +{
>> +	return ERR_PTR(-EINVAL);
>> +}
>> +
>>   static inline void
>>   devfreq_cooling_unregister(struct thermal_cooling_device *dfc)
>>   {
>> -- 
>> 2.17.1
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] thermal: devfreq_cooling: remove old power model and use EM
  2020-10-07 15:12   ` Ionela Voinescu
@ 2020-10-22 11:26     ` Lukasz Luba
  0 siblings, 0 replies; 20+ messages in thread
From: Lukasz Luba @ 2020-10-22 11:26 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide



On 10/7/20 4:12 PM, Ionela Voinescu wrote:
> Hi Lukasz,
> 
> On Monday 21 Sep 2020 at 13:20:06 (+0100), Lukasz Luba wrote:
> [..]
>>   /**
>> - * freq_get_state() - get the cooling state corresponding to a frequency
>> + * freq_get_state() - get the performance index corresponding to a frequency
> 
> If we change the meaning of the return value, I think the function needs
> a name change as well.
> 
> Also, we do treat this as a cooling state when we do validation and
> compare it to THERMAL_CSTATE_INVALID,  but it's not actually a cooling
> state (it's max_state - state). It does create confusion if we name
> "state" both a performance index and a cooling state.
> 
> Given that the only user is devfreq_cooling_get_requested_power(),
> might be good to collapse freq_get_state() in that function and rename
> the "state" variable in there to "em_perf_idx".

I will have a look into this.

> 
>>    * @dfc:	Pointer to devfreq cooling device
>> - * @freq:	frequency in Hz
>> + * @freq:	frequency in kHz
>>    *
>> - * Return: the cooling state associated with the @freq, or
>> + * Return: the performance index associated with the @freq, or
>>    * THERMAL_CSTATE_INVALID if it wasn't found.
>>    */
>>   static unsigned long
>> @@ -128,8 +130,8 @@ freq_get_state(struct devfreq_cooling_device *dfc, unsigned long freq)
>>   {
>>   	int i;
>>   
>> -	for (i = 0; i < dfc->freq_table_size; i++) {
>> -		if (dfc->freq_table[i] == freq)
>> +	for (i = 0; i <= dfc->max_state; i++) {
>> +		if (dfc->em->table[i].frequency == freq)
>>   			return i;
>>   	}
>>   
>> @@ -164,71 +166,15 @@ static unsigned long get_voltage(struct devfreq *df, unsigned long freq)
>>   	return voltage;
>>   }
>>   
>> -/**
>> - * get_static_power() - calculate the static power
>> - * @dfc:	Pointer to devfreq cooling device
>> - * @freq:	Frequency in Hz
>> - *
>> - * Calculate the static power in milliwatts using the supplied
>> - * get_static_power().  The current voltage is calculated using the
>> - * OPP library.  If no get_static_power() was supplied, assume the
>> - * static power is negligible.
>> - */
>> -static unsigned long
>> -get_static_power(struct devfreq_cooling_device *dfc, unsigned long freq)
>> +static void dfc_em_get_requested_power(struct em_perf_domain *em,
>> +				       struct devfreq_dev_status *status,
>> +				       u32 *power, int em_perf_idx)
> 
> Is there a reason for not directly returning the power value in this
> function? Also, this only does a few arithmetic operations and it's only
> called in one place. Is it worth to have this in a separate function?

Good question, maybe I will just put this code where it's called.

> 
> [..]
>> @@ -345,11 +279,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>>   	struct devfreq_cooling_device *dfc = cdev->devdata;
>>   	struct devfreq *df = dfc->devfreq;
>>   	struct devfreq_dev_status status;
>> -	unsigned long busy_time;
>> +	u32 est_power = power;
> 
> Nit: You could use power directly and remove est_power as well.
> 
>>   	unsigned long freq;
>> -	s32 dyn_power;
>> -	u32 static_power;
>> -	s32 est_power;
>>   	int i;
>>   
>>   	mutex_lock(&df->lock);
>> @@ -358,31 +289,26 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
>>   
>>   	freq = status.current_frequency;
>>   
>> -	if (dfc->power_ops->get_real_power) {
>> +	if (dfc->power_ops && dfc->power_ops->get_real_power) {
>>   		/* Scale for resource utilization */
>>   		est_power = power * dfc->res_util;
>>   		est_power /= SCALE_ERROR_MITIGATION;
>>   	} else {
>> -		static_power = get_static_power(dfc, freq);
>> -
>> -		dyn_power = power - static_power;
>> -		dyn_power = dyn_power > 0 ? dyn_power : 0;
>> -
>> -		/* Scale dynamic power for utilization */
>> -		busy_time = status.busy_time ?: 1;
>> -		est_power = (dyn_power * status.total_time) / busy_time;
>> +		_normalize_load(&status);
>> +		est_power *= status.total_time;
>> +		est_power /= status.busy_time;
>>   	}
>>   
>>   	/*
>>   	 * Find the first cooling state that is within the power
>> -	 * budget for dynamic power.
>> +	 * budget. The EM power table is sorted ascending.
>>   	 */
>> -	for (i = 0; i < dfc->freq_table_size - 1; i++)
>> -		if (est_power >= dfc->power_table[i])
>> +	for (i = dfc->max_state; i > 0; i--)
>> +		if (est_power >= dfc->em->table[i].power)
>>   			break;
>>   
>> -	*state = i;
>> -	dfc->capped_state = i;
>> +	*state = dfc->max_state - i;
>> +	dfc->capped_state = *state;
>>   	trace_thermal_power_devfreq_limit(cdev, freq, *state, power);
>>   	return 0;
>>   }
> [..]
>>   /**
>> @@ -503,7 +381,7 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
>>   	struct thermal_cooling_device *cdev;
>>   	struct devfreq_cooling_device *dfc;
>>   	char dev_name[THERMAL_NAME_LENGTH];
>> -	int err;
>> +	int err, num_opps;
>>   
>>   	dfc = kzalloc(sizeof(*dfc), GFP_KERNEL);
>>   	if (!dfc)
>> @@ -511,28 +389,45 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
>>   
>>   	dfc->devfreq = df;
>>   
>> -	if (dfc_power) {
>> -		dfc->power_ops = dfc_power;
>> -
>> +	dfc->em = em_pd_get(df->dev.parent);
>> +	if (dfc->em) {
>>   		devfreq_cooling_ops.get_requested_power =
>>   			devfreq_cooling_get_requested_power;
>>   		devfreq_cooling_ops.state2power = devfreq_cooling_state2power;
>>   		devfreq_cooling_ops.power2state = devfreq_cooling_power2state;
>> +
>> +		dfc->power_ops = dfc_power;
>> +
>> +		num_opps = em_pd_nr_perf_states(dfc->em);
>> +	} else {
>> +		/* Backward compatibility for drivers which do not use IPA */
>> +		dev_dbg(df->dev.parent, "missing EM for cooling device\n");
>> +
>> +		num_opps = dev_pm_opp_get_opp_count(df->dev.parent);
>> +
>> +		err = devfreq_cooling_gen_tables(dfc, num_opps);
>> +		if (err)
>> +			goto free_dfc;
>>   	}
>>   
>> -	err = devfreq_cooling_gen_tables(dfc);
>> -	if (err)
>> +	if (num_opps <= 0) {
>> +		err = -EINVAL;
>>   		goto free_dfc;
>> +	}
>> +
>> +	/* max_state is an index, not a counter */
> 
> Nit: Might be more clear to replace "index" with cooling state. Then
> knowledge about cooling states would make this more clear.

Similar comment is in cpufreq_cooling.c. The 'index' here means the last
valid index in the array.

Thank you for the review comments for all patches.

Regards,
Lukasz

> 
> Regards,
> Ionela.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-10-14 14:34   ` Daniel Lezcano
@ 2020-10-22 11:45     ` Lukasz Luba
  0 siblings, 0 replies; 20+ messages in thread
From: Lukasz Luba @ 2020-10-22 11:45 UTC (permalink / raw)
  To: Daniel Lezcano, linux-kernel, linux-pm, dri-devel
  Cc: rui.zhang, amit.kucheria, orjan.eide, robh, alyssa.rosenzweig,
	steven.price, airlied, daniel

Hi Daniel,

On 10/14/20 3:34 PM, Daniel Lezcano wrote:
> On 21/09/2020 14:20, Lukasz Luba wrote:
>> Devfreq cooling needs to now the correct status of the device in order
>> to operate. Do not rely on Devfreq last_status which might be a stale data
>> and get more up-to-date values of the load.
>>
>> Devfreq framework can change the device status in the background. To
>> mitigate this situation make a copy of the status structure and use it
>> for internal calculations.
>>
>> In addition this patch adds normalization function, which also makes sure
>> that whatever data comes from the device, it is in a sane range.
>>
>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>> ---
>>   drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------
>>   1 file changed, 43 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c
>> index 7063ccb7b86d..cf045bd4d16b 100644
>> --- a/drivers/thermal/devfreq_cooling.c
>> +++ b/drivers/thermal/devfreq_cooling.c
>> @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc,
>>   							       voltage);
>>   }
>>   
>> +static void _normalize_load(struct devfreq_dev_status *status)
>> +{
>> +	/* Make some space if needed */
>> +	if (status->busy_time > 0xffff) {
>> +		status->busy_time >>= 10;
>> +		status->total_time >>= 10;
>> +	}
>> +
>> +	if (status->busy_time > status->total_time)
>> +		status->busy_time = status->total_time;
>> +
>> +	status->busy_time *= 100;
>> +	status->busy_time /= status->total_time ? : 1;
>> +
>> +	/* Avoid division by 0 */
>> +	status->busy_time = status->busy_time ? : 1;
>> +	status->total_time = 100;
>> +}
> 
> Not sure that works if the devfreq governor is not on-demand.
> 
> Is it possible to use the energy model directly in devfreq to compute
> the energy consumption given the state transitions since the last reading ?

This change is actually trying to create a safety net for what we do.

In the original code we take the last_state directly:
-	struct devfreq_dev_status *status = &df->last_status;

Then we simply multiply by 'busy_time' and div by 'total_time',
without checks... The values might be huge or zero, etc.
The _normalize_load() introduces this safety.

Apart from that, just simply taking a pointer to &df->last_status does
not protect us from:
- working on a struct which might be modified at the same time in
   background - not safe
- that struct might not be updated by long time, because devfreq
   didn't check it for a long (there are two polling modes in devfreq)

So taking a mutex and then a trigger the device status check and
make a copy of newest data. It is more safe.

I think this can be treated as a fix, not a feature.

> 
> The power will be read directly from devfreq which will return:
> 
> nrj + (current_power * (jiffies - last_update)) / (jiffies - last_update)
> 
> The devfreq cooling device driver would result in a much simpler code, no?

This is something that I would like to address after the EM changes are
merged. It would be the next step, how to estimate the power by taking
into consideration more information. This patch series just tries to
make it possible to use EM. The model improvements would be next.

Thank you Daniel for your review.

Regards,
Lukasz

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-10-22 10:55     ` Lukasz Luba
@ 2020-12-01 10:36       ` Ionela Voinescu
  2020-12-01 12:19         ` Lukasz Luba
  0 siblings, 1 reply; 20+ messages in thread
From: Ionela Voinescu @ 2020-12-01 10:36 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide

Hi,

Sorry for the delay and for the noise on this older version. I first
want to understand the code better.

On Thursday 22 Oct 2020 at 11:55:28 (+0100), Lukasz Luba wrote:
[..]
> 
> > 
> > > +{
> > > +	/* Make some space if needed */
> > > +	if (status->busy_time > 0xffff) {
> > > +		status->busy_time >>= 10;
> > > +		status->total_time >>= 10;
> > > +	}
> > 
> > How about removing the above code and adding here:
> > 
> > status->busy_time = status->busy_time ? : 1;
> 
> It's not equivalent. The code operates on raw device values, which
> might be big (e.g. read from counters). If it's lager than the 0xffff,
> it is going to be shifted to get smaller.
> 

Yes, the big values are handled below through the division and by making
total_time = 1024. These two initial checks are only to cover the
possibility for busy_time and total_time being 0, or busy_time >
total_time.

> > 
> > > +
> > > +	if (status->busy_time > status->total_time)
> > 
> > This check would then cover the possibility that total_time is 0.
> > 
> > > +		status->busy_time = status->total_time;
> > 
> > But a reversal is needed here:
> > 		status->total_time = status->busy_time;
> 
> No, I want to clamp the busy_time, which should not be bigger that
> total time. It could happen when we deal with 'raw' values from device
> counters.
> 

Yes, I understand. But isn't making total_time = busy_time accomplishing
the same thing?

> > 
> > > +
> > > +	status->busy_time *= 100;
> > > +	status->busy_time /= status->total_time ? : 1;
> > > +
> > > +	/* Avoid division by 0 */
> > > +	status->busy_time = status->busy_time ? : 1;
> > > +	status->total_time = 100;
> > 
> > Then all of this code can be replaced by:
> > 
> > status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10,
> > 					     status->total_time);
> > status->total_time = 1 << 10;
> 
> No, the total_time closed to 'unsigned long' would overflow.
> 

I'm not sure I understand. total_time gets a value of 1024, it's not
itself shifted by 10.

> > 
> > This way you gain some resolution to busy_time and the divisions in the
> > callers would just become shifts by 10.
> 
> 
> I don't want to gain more resolution here. I want to be prepare for raw
> (not processed yet) big values coming from driver.
>

Agreed! The higher resolution is an extra benefit. The more important
benefit is that, through my suggestion, you'd be replacing all future
divisions by shifts.

Thanks,
Ionela.

> Regards,
> Lukasz
> 
> > 
> > Hope it helps,
> > Ionela.
> > 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-12-01 10:36       ` Ionela Voinescu
@ 2020-12-01 12:19         ` Lukasz Luba
  2020-12-01 14:55           ` Ionela Voinescu
  0 siblings, 1 reply; 20+ messages in thread
From: Lukasz Luba @ 2020-12-01 12:19 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide



On 12/1/20 10:36 AM, Ionela Voinescu wrote:
> Hi,
> 
> Sorry for the delay and for the noise on this older version. I first
> want to understand the code better.
> 
> On Thursday 22 Oct 2020 at 11:55:28 (+0100), Lukasz Luba wrote:
> [..]
>>
>>>
>>>> +{
>>>> +	/* Make some space if needed */
>>>> +	if (status->busy_time > 0xffff) {
>>>> +		status->busy_time >>= 10;
>>>> +		status->total_time >>= 10;
>>>> +	}
>>>
>>> How about removing the above code and adding here:
>>>
>>> status->busy_time = status->busy_time ? : 1;
>>
>> It's not equivalent. The code operates on raw device values, which
>> might be big (e.g. read from counters). If it's lager than the 0xffff,
>> it is going to be shifted to get smaller.
>>
> 
> Yes, the big values are handled below through the division and by making
> total_time = 1024. These two initial checks are only to cover the
> possibility for busy_time and total_time being 0, or busy_time >
> total_time.
> 
>>>
>>>> +
>>>> +	if (status->busy_time > status->total_time)
>>>
>>> This check would then cover the possibility that total_time is 0.
>>>
>>>> +		status->busy_time = status->total_time;
>>>
>>> But a reversal is needed here:
>>> 		status->total_time = status->busy_time;
>>
>> No, I want to clamp the busy_time, which should not be bigger that
>> total time. It could happen when we deal with 'raw' values from device
>> counters.
>>
> 
> Yes, I understand. But isn't making total_time = busy_time accomplishing
> the same thing?
> 
>>>
>>>> +
>>>> +	status->busy_time *= 100;
>>>> +	status->busy_time /= status->total_time ? : 1;
>>>> +
>>>> +	/* Avoid division by 0 */
>>>> +	status->busy_time = status->busy_time ? : 1;
>>>> +	status->total_time = 100;
>>>
>>> Then all of this code can be replaced by:
>>>
>>> status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10,
>>> 					     status->total_time);
>>> status->total_time = 1 << 10;
>>
>> No, the total_time closed to 'unsigned long' would overflow.
>>
> 
> I'm not sure I understand. total_time gets a value of 1024, it's not
> itself shifted by 10.
> 
>>>
>>> This way you gain some resolution to busy_time and the divisions in the
>>> callers would just become shifts by 10.
>>
>>
>> I don't want to gain more resolution here. I want to be prepare for raw
>> (not processed yet) big values coming from driver.
>>
> 
> Agreed! The higher resolution is an extra benefit. The more important
> benefit is that, through my suggestion, you'd be replacing all future
> divisions by shifts.

You have probably missed some bits.
I don't see benefits, you have div64_u64() which is heavy on 32bit CPUs.

Then, what is the range of these values:
busy_time [0, 1024], total_time 1024 in your case.
These values are used for estimating power in two cases:
1. in devfreq_cooling_get_requested_power()
	est_power = power * busy_time / total_time
2. in devfreq_cooling_power2state():
	est_power = power * total_time / busy_time

As you can see above, the est_power values could overflow if total_time,
busy_time are raw values (like in old implementation). So normalize them
into 'some' scale. That was the motivation ('scale' motivation below).

In your case you cannot avoid division in 2. use case, because busy_time
can be any value in range [0, 1024].
We could avoid the division in 1. use case, but load in cpufreq cooling
is also in range of [0, 100], so this devfreq cooling is aligned. I
would like to avoid situation when someone is parsing the traces
and these two devices present different load scale.

I will think about better 'devfreq utilization' (as also Daniel
suggested)in future, but first this EM must be in mainline and cpufreq
cooling changes made by Viresh also there.
But it would be more then just scale change to [0, 1024]...

Regards,
Lukasz


> 
> Thanks,
> Ionela.
> 
>> Regards,
>> Lukasz
>>
>>>
>>> Hope it helps,
>>> Ionela.
>>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-10-22 11:17     ` Lukasz Luba
@ 2020-12-01 14:05       ` Ionela Voinescu
  2020-12-01 14:37         ` Lukasz Luba
  0 siblings, 1 reply; 20+ messages in thread
From: Ionela Voinescu @ 2020-12-01 14:05 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide

Hi,

On Thursday 22 Oct 2020 at 12:17:31 (+0100), Lukasz Luba wrote:
[..]

> > > +/**
> > > + * devfreq_cooling_em_register_power() - Register devfreq cooling device with
> > > + *		power information and attempt to register Energy Model (EM)
> > 
> > It took me a while to understand the differences between devfreq
> > register functions and it left me with a nagging feeling that we don't
> > need all of them. Also, looking over the cpufreq cooling devices, they
> > keep their registering interfaces quite simple.
> 
> This was discussed in previous series, related to EM core changes.
> It was requested to have a helper registration function which would
> create EM automatically.
> 
> > 
> > With the functions added by this patch, the devfreq cooling devices will have:
> >   - old:
> >         of_devfreq_cooling_register_power
> >         of_devfreq_cooling_register
> >         devfreq_cooling_register
> >         devfreq_cooling_unregister
> >   - new:
> >         devfreq_cooling_em_register_power
> >         devfreq_cooling_em_register
> > 
> > My question is whether we actually need the two new
> > devfreq_cooling_em_register_power() and devfreq_cooling_em_register()?
> 
> It is just for consistency, with older scheme. It is only a wrapper, one
> line, with default NULL. This scheme is common in thermal and some other
> frameworks.
> 
> > 
> > The power_ops and the em are dependent on one another, so could we
> > extend the of_devfreq_cooling_register_power() to do the additional em
> > registration. We only need a way to pass the em_cb and I think that
> > could fit nicely in devfreq_cooling_power.
> 
> No, they aren't 'dependent on one another'. The EM usage doesn't depend
> on presence of power_ops. Drivers might not support power_ops, but want
> the framework still use EM and do power estimation.
> 

Okay, wrong choice of words. There's only a one way dependency: you can't
use power_ops without an em, according to
of_devfreq_cooling_register_power().

Correct me if I'm wrong, but I see this as being okay as you still need
an em to give you the maximum power of a device in a certain state.

With this in mind, and taking in detail the possible calls of the
devfreq cooling register functions:

1. Register devfreq cooling device with energy model.
   (used in patch 5/5)

 -> devfreq_cooling_em_register()
    -> devfreq_cooling_em_register_power(dfc_power = NULL, em obtained
                                      through various methods)
      -> of_devfreq_cooling_register_power(same as above)

2. Register devfreq cooling device with power_ops and em:
   (not used)

 -> devfreq_cooling_em_register_power(dfc_power != NULL, em obtained
                                     through various methods)
   -> of_devfreq_cooling_register_power(same as above)

3. Register a devfreq cooling devices with power_ops but no em
   (not used)

 -> of_devfreq_cooling_register_power(dfc_power != NULL)


4. Register a devfreq cooling devices without any kind of power
   information (em or dfc_power/power_ops)

 -> devfreq_cooling_register() or of_devfreq_cooling_register()
   -> of_devfreq_cooling_register_power(dfc_power = NULL)


Given this, aren't we ending up with some possible calls to these
registration functions that don't make sense? That is case 3, as
of_devfreq_cooling_register_power() could not assign and later use
power_ops without an em. For this usecase, 2 should be used instead.

Therefore, can't the same be achieved by collapsing
devfreq_cooling_em_register_power() into
of_devfreq_cooling_register_power()? (with the user having the
possibility to provide the em callback similarly to how get_real_power()
is provided - in devfreq_cooling_power).

IMO is cleaner to unify the functionality (registration and callbacks)
of cooling devices with power capabilities (based on em alone or together
with power_ops). Otherwise we just create confusion for users registering
cooling devices not knowing which function to call.

If this has been discussed previously and I'm missing some details,
please provide some links to the discussions.

Thank you for the patience :).

Ionela.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-12-01 14:05       ` Ionela Voinescu
@ 2020-12-01 14:37         ` Lukasz Luba
  2020-12-01 15:02           ` Ionela Voinescu
  0 siblings, 1 reply; 20+ messages in thread
From: Lukasz Luba @ 2020-12-01 14:37 UTC (permalink / raw)
  To: Ionela Voinescu
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide



On 12/1/20 2:05 PM, Ionela Voinescu wrote:
> Hi,
> 
> On Thursday 22 Oct 2020 at 12:17:31 (+0100), Lukasz Luba wrote:
> [..]
> 
>>>> +/**
>>>> + * devfreq_cooling_em_register_power() - Register devfreq cooling device with
>>>> + *		power information and attempt to register Energy Model (EM)
>>>
>>> It took me a while to understand the differences between devfreq
>>> register functions and it left me with a nagging feeling that we don't
>>> need all of them. Also, looking over the cpufreq cooling devices, they
>>> keep their registering interfaces quite simple.
>>
>> This was discussed in previous series, related to EM core changes.
>> It was requested to have a helper registration function which would
>> create EM automatically.
>>
>>>
>>> With the functions added by this patch, the devfreq cooling devices will have:
>>>    - old:
>>>          of_devfreq_cooling_register_power
>>>          of_devfreq_cooling_register
>>>          devfreq_cooling_register
>>>          devfreq_cooling_unregister
>>>    - new:
>>>          devfreq_cooling_em_register_power
>>>          devfreq_cooling_em_register
>>>
>>> My question is whether we actually need the two new
>>> devfreq_cooling_em_register_power() and devfreq_cooling_em_register()?
>>
>> It is just for consistency, with older scheme. It is only a wrapper, one
>> line, with default NULL. This scheme is common in thermal and some other
>> frameworks.
>>
>>>
>>> The power_ops and the em are dependent on one another, so could we
>>> extend the of_devfreq_cooling_register_power() to do the additional em
>>> registration. We only need a way to pass the em_cb and I think that
>>> could fit nicely in devfreq_cooling_power.
>>
>> No, they aren't 'dependent on one another'. The EM usage doesn't depend
>> on presence of power_ops. Drivers might not support power_ops, but want
>> the framework still use EM and do power estimation.
>>
> 
> Okay, wrong choice of words. There's only a one way dependency: you can't
> use power_ops without an em, according to
> of_devfreq_cooling_register_power().
> 
> Correct me if I'm wrong, but I see this as being okay as you still need
> an em to give you the maximum power of a device in a certain state.
> 
> With this in mind, and taking in detail the possible calls of the
> devfreq cooling register functions:
> 
> 1. Register devfreq cooling device with energy model.
>     (used in patch 5/5)
> 
>   -> devfreq_cooling_em_register()
>      -> devfreq_cooling_em_register_power(dfc_power = NULL, em obtained
>                                        through various methods)
>        -> of_devfreq_cooling_register_power(same as above)
> 
> 2. Register devfreq cooling device with power_ops and em:
>     (not used)
> 
>   -> devfreq_cooling_em_register_power(dfc_power != NULL, em obtained
>                                       through various methods)
>     -> of_devfreq_cooling_register_power(same as above)
> 
> 3. Register a devfreq cooling devices with power_ops but no em
>     (not used)
> 
>   -> of_devfreq_cooling_register_power(dfc_power != NULL)
> 
> 
> 4. Register a devfreq cooling devices without any kind of power
>     information (em or dfc_power/power_ops)
> 
>   -> devfreq_cooling_register() or of_devfreq_cooling_register()
>     -> of_devfreq_cooling_register_power(dfc_power = NULL)
> 
> 
> Given this, aren't we ending up with some possible calls to these
> registration functions that don't make sense? That is case 3, as
> of_devfreq_cooling_register_power() could not assign and later use
> power_ops without an em. For this usecase, 2 should be used instead.

In use case 3. you missed that the driver could registered EM by itself.
Maybe wanted to manage the EM internally, for various reasons. Then this
registration use case 3. makes sense.

> 
> Therefore, can't the same be achieved by collapsing
> devfreq_cooling_em_register_power() into
> of_devfreq_cooling_register_power()? (with the user having the
> possibility to provide the em callback similarly to how get_real_power()
> is provided - in devfreq_cooling_power).
> 
> IMO is cleaner to unify the functionality (registration and callbacks)
> of cooling devices with power capabilities (based on em alone or together
> with power_ops). Otherwise we just create confusion for users registering
> cooling devices not knowing which function to call.

I don't want to add the code from devfreq_cooling_em_register_power()
into the of_devfreq_cooling_register_power(), these are pretty dense
functions with complicated error handling paths.
In this shape and a few wrappers, which help users to register according
to their needs, it looks OK.

There will be always a review of the coming drivers which would like to
register.

> 
> If this has been discussed previously and I'm missing some details,
> please provide some links to the discussions.
> 
> Thank you for the patience :).
> 
> Ionela.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status
  2020-12-01 12:19         ` Lukasz Luba
@ 2020-12-01 14:55           ` Ionela Voinescu
  0 siblings, 0 replies; 20+ messages in thread
From: Ionela Voinescu @ 2020-12-01 14:55 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide

On Tuesday 01 Dec 2020 at 12:19:18 (+0000), Lukasz Luba wrote:
> 
> 
> On 12/1/20 10:36 AM, Ionela Voinescu wrote:
> > Hi,
> > 
> > Sorry for the delay and for the noise on this older version. I first
> > want to understand the code better.
> > 
> > On Thursday 22 Oct 2020 at 11:55:28 (+0100), Lukasz Luba wrote:
> > [..]
> > > 
> > > > 
> > > > > +{
> > > > > +	/* Make some space if needed */
> > > > > +	if (status->busy_time > 0xffff) {
> > > > > +		status->busy_time >>= 10;
> > > > > +		status->total_time >>= 10;
> > > > > +	}
> > > > 
> > > > How about removing the above code and adding here:
> > > > 
> > > > status->busy_time = status->busy_time ? : 1;
> > > 
> > > It's not equivalent. The code operates on raw device values, which
> > > might be big (e.g. read from counters). If it's lager than the 0xffff,
> > > it is going to be shifted to get smaller.
> > > 
> > 
> > Yes, the big values are handled below through the division and by making
> > total_time = 1024. These two initial checks are only to cover the
> > possibility for busy_time and total_time being 0, or busy_time >
> > total_time.
> > 
> > > > 
> > > > > +
> > > > > +	if (status->busy_time > status->total_time)
> > > > 
> > > > This check would then cover the possibility that total_time is 0.
> > > > 
> > > > > +		status->busy_time = status->total_time;
> > > > 
> > > > But a reversal is needed here:
> > > > 		status->total_time = status->busy_time;
> > > 
> > > No, I want to clamp the busy_time, which should not be bigger that
> > > total time. It could happen when we deal with 'raw' values from device
> > > counters.
> > > 
> > 
> > Yes, I understand. But isn't making total_time = busy_time accomplishing
> > the same thing?
> > 
> > > > 
> > > > > +
> > > > > +	status->busy_time *= 100;
> > > > > +	status->busy_time /= status->total_time ? : 1;
> > > > > +
> > > > > +	/* Avoid division by 0 */
> > > > > +	status->busy_time = status->busy_time ? : 1;
> > > > > +	status->total_time = 100;
> > > > 
> > > > Then all of this code can be replaced by:
> > > > 
> > > > status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10,
> > > > 					     status->total_time);
> > > > status->total_time = 1 << 10;
> > > 
> > > No, the total_time closed to 'unsigned long' would overflow.
> > > 
> > 
> > I'm not sure I understand. total_time gets a value of 1024, it's not
> > itself shifted by 10.
> > 
> > > > 
> > > > This way you gain some resolution to busy_time and the divisions in the
> > > > callers would just become shifts by 10.
> > > 
> > > 
> > > I don't want to gain more resolution here. I want to be prepare for raw
> > > (not processed yet) big values coming from driver.
> > > 
> > 
> > Agreed! The higher resolution is an extra benefit. The more important
> > benefit is that, through my suggestion, you'd be replacing all future
> > divisions by shifts.
> 
> You have probably missed some bits.
> I don't see benefits, you have div64_u64() which is heavy on 32bit CPUs.
> 
> Then, what is the range of these values:
> busy_time [0, 1024], total_time 1024 in your case.
> These values are used for estimating power in two cases:
> 1. in devfreq_cooling_get_requested_power()
> 	est_power = power * busy_time / total_time
> 2. in devfreq_cooling_power2state():
> 	est_power = power * total_time / busy_time
> 
> As you can see above, the est_power values could overflow if total_time,
> busy_time are raw values (like in old implementation). So normalize them
> into 'some' scale. That was the motivation ('scale' motivation below).
> 

Agreed! I do think scaling is necessary, but in my mind the [0, 1024] scale
made more sense.

> In your case you cannot avoid division in 2. use case, because busy_time
> can be any value in range [0, 1024].
> We could avoid the division in 1. use case, but load in cpufreq cooling
> is also in range of [0, 100], so this devfreq cooling is aligned. I
> would like to avoid situation when someone is parsing the traces
> and these two devices present different load scale.
> 

Got it! Looking through the code I did overlook that 2 was reversed.

> I will think about better 'devfreq utilization' (as also Daniel
> suggested)in future, but first this EM must be in mainline and cpufreq
> cooling changes made by Viresh also there.
> But it would be more then just scale change to [0, 1024]...
> 

Okay, looking forward to this. It would be nice to align all of these
utilization metrics in the future for all kinds of devices.

Thanks,
Ionela.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model
  2020-12-01 14:37         ` Lukasz Luba
@ 2020-12-01 15:02           ` Ionela Voinescu
  0 siblings, 0 replies; 20+ messages in thread
From: Ionela Voinescu @ 2020-12-01 15:02 UTC (permalink / raw)
  To: Lukasz Luba
  Cc: linux-kernel, linux-pm, dri-devel, amit.kucheria, airlied,
	daniel.lezcano, steven.price, alyssa.rosenzweig, rui.zhang,
	orjan.eide

On Tuesday 01 Dec 2020 at 14:37:58 (+0000), Lukasz Luba wrote:
> 
> 
> On 12/1/20 2:05 PM, Ionela Voinescu wrote:
> > Hi,
> > 
> > On Thursday 22 Oct 2020 at 12:17:31 (+0100), Lukasz Luba wrote:
> > [..]
> > 
> > > > > +/**
> > > > > + * devfreq_cooling_em_register_power() - Register devfreq cooling device with
> > > > > + *		power information and attempt to register Energy Model (EM)
> > > > 
> > > > It took me a while to understand the differences between devfreq
> > > > register functions and it left me with a nagging feeling that we don't
> > > > need all of them. Also, looking over the cpufreq cooling devices, they
> > > > keep their registering interfaces quite simple.
> > > 
> > > This was discussed in previous series, related to EM core changes.
> > > It was requested to have a helper registration function which would
> > > create EM automatically.
> > > 
> > > > 
> > > > With the functions added by this patch, the devfreq cooling devices will have:
> > > >    - old:
> > > >          of_devfreq_cooling_register_power
> > > >          of_devfreq_cooling_register
> > > >          devfreq_cooling_register
> > > >          devfreq_cooling_unregister
> > > >    - new:
> > > >          devfreq_cooling_em_register_power
> > > >          devfreq_cooling_em_register
> > > > 
> > > > My question is whether we actually need the two new
> > > > devfreq_cooling_em_register_power() and devfreq_cooling_em_register()?
> > > 
> > > It is just for consistency, with older scheme. It is only a wrapper, one
> > > line, with default NULL. This scheme is common in thermal and some other
> > > frameworks.
> > > 
> > > > 
> > > > The power_ops and the em are dependent on one another, so could we
> > > > extend the of_devfreq_cooling_register_power() to do the additional em
> > > > registration. We only need a way to pass the em_cb and I think that
> > > > could fit nicely in devfreq_cooling_power.
> > > 
> > > No, they aren't 'dependent on one another'. The EM usage doesn't depend
> > > on presence of power_ops. Drivers might not support power_ops, but want
> > > the framework still use EM and do power estimation.
> > > 
> > 
> > Okay, wrong choice of words. There's only a one way dependency: you can't
> > use power_ops without an em, according to
> > of_devfreq_cooling_register_power().
> > 
> > Correct me if I'm wrong, but I see this as being okay as you still need
> > an em to give you the maximum power of a device in a certain state.
> > 
> > With this in mind, and taking in detail the possible calls of the
> > devfreq cooling register functions:
> > 
> > 1. Register devfreq cooling device with energy model.
> >     (used in patch 5/5)
> > 
> >   -> devfreq_cooling_em_register()
> >      -> devfreq_cooling_em_register_power(dfc_power = NULL, em obtained
> >                                        through various methods)
> >        -> of_devfreq_cooling_register_power(same as above)
> > 
> > 2. Register devfreq cooling device with power_ops and em:
> >     (not used)
> > 
> >   -> devfreq_cooling_em_register_power(dfc_power != NULL, em obtained
> >                                       through various methods)
> >     -> of_devfreq_cooling_register_power(same as above)
> > 
> > 3. Register a devfreq cooling devices with power_ops but no em
> >     (not used)
> > 
> >   -> of_devfreq_cooling_register_power(dfc_power != NULL)
> > 
> > 
> > 4. Register a devfreq cooling devices without any kind of power
> >     information (em or dfc_power/power_ops)
> > 
> >   -> devfreq_cooling_register() or of_devfreq_cooling_register()
> >     -> of_devfreq_cooling_register_power(dfc_power = NULL)
> > 
> > 
> > Given this, aren't we ending up with some possible calls to these
> > registration functions that don't make sense? That is case 3, as
> > of_devfreq_cooling_register_power() could not assign and later use
> > power_ops without an em. For this usecase, 2 should be used instead.
> 
> In use case 3. you missed that the driver could registered EM by itself.
> Maybe wanted to manage the EM internally, for various reasons. Then this
> registration use case 3. makes sense.
> 

Yes, the code allows it but it would be unlikely.

> > 
> > Therefore, can't the same be achieved by collapsing
> > devfreq_cooling_em_register_power() into
> > of_devfreq_cooling_register_power()? (with the user having the
> > possibility to provide the em callback similarly to how get_real_power()
> > is provided - in devfreq_cooling_power).
> > 
> > IMO is cleaner to unify the functionality (registration and callbacks)
> > of cooling devices with power capabilities (based on em alone or together
> > with power_ops). Otherwise we just create confusion for users registering
> > cooling devices not knowing which function to call.
> 
> I don't want to add the code from devfreq_cooling_em_register_power()
> into the of_devfreq_cooling_register_power(), these are pretty dense
> functions with complicated error handling paths.
> In this shape and a few wrappers, which help users to register according
> to their needs, it looks OK.
> 
> There will be always a review of the coming drivers which would like to
> register.
> 

Okay, no other arguments from my part.

I'll now take a look over v2. I just wanted to get some of these design
choices out of the way first.

Thanks,
Ionela.

> > 
> > If this has been discussed previously and I'm missing some details,
> > please provide some links to the discussions.
> > 
> > Thank you for the patience :).
> > 
> > Ionela.
> > 

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-12-01 15:03 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-21 12:20 [PATCH 0/5] Thermal devfreq cooling improvements with Energy Model Lukasz Luba
2020-09-21 12:20 ` [PATCH 1/5] thermal: devfreq_cooling: change tracing function and arguments Lukasz Luba
2020-09-21 12:20 ` [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status Lukasz Luba
2020-10-07 16:11   ` Ionela Voinescu
2020-10-22 10:55     ` Lukasz Luba
2020-12-01 10:36       ` Ionela Voinescu
2020-12-01 12:19         ` Lukasz Luba
2020-12-01 14:55           ` Ionela Voinescu
2020-10-14 14:34   ` Daniel Lezcano
2020-10-22 11:45     ` Lukasz Luba
2020-09-21 12:20 ` [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model Lukasz Luba
2020-10-07 12:07   ` Ionela Voinescu
2020-10-22 11:17     ` Lukasz Luba
2020-12-01 14:05       ` Ionela Voinescu
2020-12-01 14:37         ` Lukasz Luba
2020-12-01 15:02           ` Ionela Voinescu
2020-09-21 12:20 ` [PATCH 4/5] thermal: devfreq_cooling: remove old power model and use EM Lukasz Luba
2020-10-07 15:12   ` Ionela Voinescu
2020-10-22 11:26     ` Lukasz Luba
2020-09-21 12:20 ` [PATCH 5/5] drm/panfrost: Register devfreq cooling and attempt to add Energy Model Lukasz Luba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).