All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] The power allocator thermal governor
@ 2014-05-06 12:06 Javi Merino
  2014-05-06 12:06 ` [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone Javi Merino
                   ` (5 more replies)
  0 siblings, 6 replies; 48+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
  To: linux-pm; +Cc: Punit.Agrawal, Javi Merino

Hi,

This RFC is our first stab at implementing the ideas we proposed in
[1].  The power allocator governor allocates device power to control
temperature.  This requires transforming performance requests into
requested power, which we do with the aid of power models.  Patch 3
(thermal: add a basic cpu power actor) implements a simple power model
for cpus.  The division of power between the actors ensures that power
is allocated where it is needed the most, based on the current
workload.

[1] http://article.gmane.org/gmane.linux.power-management.general/43243

Patches 1 and 2 were previously sent as separate patches: [2] and
[3]. There was a similar patch to "thermal: provide a way for the
governors to have private data for each thermal zone" posted by Wei Ni
but we can't use it as it is because it doesn't allow the governor's
bind() function (start() in Wei's patch) to modify the governor_data.
Other than that, this patch incorporates Rui's suggestion of keeping
the original governor if the new governor fails to bind.

[2] http://article.gmane.org/gmane.linux.power-management.general/43624
[3] https://patchwork.kernel.org/patch/3859721/

It's an RFC because we still need to tune it to achieve the
temperature stability that we are aiming for.  This RFC is just meant
to remove major road blocks as early as possible (if any).

We'll provide integration instructions soon so that people can test it
in their platforms.

Cheers,
Javi & Punit

Javi Merino (5):
  thermal: let governors have private data for each thermal zone
  thermal: let cooling devices operate on other units other than state
  thermal: add a basic cpu power actor
  thermal: introduce the Power Allocator governor
  thermal: add trace events to the power allocator governor

 drivers/acpi/fan.c                      |   22 +-
 drivers/acpi/processor_thermal.c        |   23 +-
 drivers/acpi/video.c                    |   22 +-
 drivers/platform/x86/acerhdf.c          |   24 +-
 drivers/platform/x86/intel_menlow.c     |   26 +-
 drivers/power/power_supply_core.c       |   24 +-
 drivers/thermal/Kconfig                 |   14 +
 drivers/thermal/Makefile                |    1 +
 drivers/thermal/cpu_cooling.c           |  427 +++++++++++++++++++++++++++++--
 drivers/thermal/db8500_thermal.c        |    2 +-
 drivers/thermal/fair_share.c            |    2 +-
 drivers/thermal/intel_powerclamp.c      |   24 +-
 drivers/thermal/power_allocator.c       |  413 ++++++++++++++++++++++++++++++
 drivers/thermal/step_wise.c             |    2 +-
 drivers/thermal/thermal_core.c          |   91 +++++--
 drivers/thermal/thermal_core.h          |    8 +
 include/linux/cpu_cooling.h             |   28 ++
 include/linux/thermal.h                 |   25 +-
 include/trace/events/thermal.h          |   34 +++
 include/trace/events/thermal_governor.h |   35 +++
 20 files changed, 1153 insertions(+), 94 deletions(-)
 create mode 100644 drivers/thermal/power_allocator.c
 create mode 100644 include/trace/events/thermal.h
 create mode 100644 include/trace/events/thermal_governor.h

-- 
1.7.9.5



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone
  2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
@ 2014-05-06 12:06 ` Javi Merino
  2014-05-13 14:31   ` edubezval
  2014-05-06 12:06 ` [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state Javi Merino
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 48+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
  To: linux-pm; +Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin

A governor may need to store its current state between calls to
throttle().  That state depends on the thermal zone, so store it as
private data in struct thermal_zone_device.

The governors may have two new ops: bind_to_tz() and unbind_from_tz().
When provided, these functions let governors do some initialization
and teardown when they are bound/unbound to a tz and possibly store that
information in the governor_data field of the struct
thermal_zone_device.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 drivers/thermal/thermal_core.c |   67 +++++++++++++++++++++++++++++++++++-----
 include/linux/thermal.h        |    3 ++
 2 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 71b0ec0c370d..ac8e5990a63e 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -72,6 +72,45 @@ static struct thermal_governor *__find_governor(const char *name)
 	return NULL;
 }
 
+/**
+ * thermal_set_governor() - Switch to another governor
+ * @tz:		a valid pointer to a struct thermal_zone_device
+ * @new_gov:	pointer to the new governor
+ *
+ * Change the governor of thermal zone @tz.
+ *
+ * Returns 0 on success, an error if the new governor's bind_to_tz() failed.
+ */
+static int thermal_set_governor(struct thermal_zone_device *tz,
+				struct thermal_governor *new_gov)
+{
+	int ret = 0;
+
+	if (tz->governor && tz->governor->unbind_from_tz)
+		tz->governor->unbind_from_tz(tz);
+
+	if (new_gov && new_gov->bind_to_tz) {
+		ret = new_gov->bind_to_tz(tz);
+		if (ret) {
+			/* Try to register the old governor again */
+			if (tz->governor && tz->governor->bind_to_tz) {
+				if (tz->governor->bind_to_tz(tz))
+					/*
+					 * The new governor failed to register
+					 * and the previous one failed as well
+					 */
+					tz->governor = NULL;
+			}
+
+			return ret;
+		}
+	}
+
+	tz->governor = new_gov;
+
+	return ret;
+}
+
 int thermal_register_governor(struct thermal_governor *governor)
 {
 	int err;
@@ -104,8 +143,12 @@ int thermal_register_governor(struct thermal_governor *governor)
 
 		name = pos->tzp->governor_name;
 
-		if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
-			pos->governor = governor;
+		if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
+			int ret = thermal_set_governor(pos, governor);
+			if (ret)
+				pr_warn("Failed to set governor %s for zone %d: %d\n",
+					governor->name, pos->id, ret);
+		}
 	}
 
 	mutex_unlock(&thermal_list_lock);
@@ -131,7 +174,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
 	list_for_each_entry(pos, &thermal_tz_list, node) {
 		if (!strnicmp(pos->governor->name, governor->name,
 						THERMAL_NAME_LENGTH))
-			pos->governor = NULL;
+			thermal_set_governor(pos, NULL);
 	}
 
 	mutex_unlock(&thermal_list_lock);
@@ -756,8 +799,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
 	if (!gov)
 		goto exit;
 
-	tz->governor = gov;
-	ret = count;
+	ret = thermal_set_governor(tz, gov);
+	if (!ret)
+		ret = count;
 
 exit:
 	mutex_unlock(&thermal_governor_lock);
@@ -1452,6 +1496,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
 	int result;
 	int count;
 	int passive = 0;
+	struct thermal_governor *governor;
 
 	if (type && strlen(type) >= THERMAL_NAME_LENGTH)
 		return ERR_PTR(-EINVAL);
@@ -1542,9 +1587,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
 	mutex_lock(&thermal_governor_lock);
 
 	if (tz->tzp)
-		tz->governor = __find_governor(tz->tzp->governor_name);
+		governor = __find_governor(tz->tzp->governor_name);
 	else
-		tz->governor = def_governor;
+		governor = def_governor;
+
+	result = thermal_set_governor(tz, governor);
+	if (result) {
+		mutex_unlock(&thermal_governor_lock);
+		goto unregister;
+	}
 
 	mutex_unlock(&thermal_governor_lock);
 
@@ -1634,7 +1685,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
 		device_remove_file(&tz->device, &dev_attr_mode);
 	device_remove_file(&tz->device, &dev_attr_policy);
 	remove_trip_attrs(tz);
-	tz->governor = NULL;
+	thermal_set_governor(tz, NULL);
 
 	thermal_remove_hwmon_sysfs(tz);
 	release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index f7e11c7ea7d9..baac212f815e 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -177,6 +177,7 @@ struct thermal_zone_device {
 	struct thermal_zone_device_ops *ops;
 	const struct thermal_zone_params *tzp;
 	struct thermal_governor *governor;
+	void *governor_data;
 	struct list_head thermal_instances;
 	struct idr idr;
 	struct mutex lock; /* protect thermal_instances list */
@@ -187,6 +188,8 @@ struct thermal_zone_device {
 /* Structure that holds thermal governor information */
 struct thermal_governor {
 	char name[THERMAL_NAME_LENGTH];
+	int (*bind_to_tz)(struct thermal_zone_device *tz);
+	void (*unbind_from_tz)(struct thermal_zone_device *tz);
 	int (*throttle)(struct thermal_zone_device *tz, int trip);
 	struct list_head	governor_list;
 };
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state
  2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
  2014-05-06 12:06 ` [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone Javi Merino
@ 2014-05-06 12:06 ` Javi Merino
  2014-05-13 15:22   ` edubezval
  2014-05-06 12:06 ` [RFC PATCH 3/5] thermal: add a basic cpu power actor Javi Merino
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 48+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
  To: linux-pm; +Cc: Punit.Agrawal, Javi Merino, Zhang Rui

Currently, cooling devices operate on state, even though some (like
the cpufreq cooling device) have exported functions to convert
frequency to state.  Generalize this interface so that governors can
operate on other units (e.g. frequency, power,...).

Suggested-by: Eduardo Valentin <edubezval@gmail.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 drivers/acpi/fan.c                  |   22 ++++++++++++++------
 drivers/acpi/processor_thermal.c    |   23 +++++++++++++++------
 drivers/acpi/video.c                |   22 ++++++++++++++------
 drivers/platform/x86/acerhdf.c      |   24 ++++++++++++++++------
 drivers/platform/x86/intel_menlow.c |   26 +++++++++++++++++-------
 drivers/power/power_supply_core.c   |   24 ++++++++++++++++------
 drivers/thermal/cpu_cooling.c       |   38 ++++++++++++++++++++++++++---------
 drivers/thermal/db8500_thermal.c    |    2 +-
 drivers/thermal/fair_share.c        |    2 +-
 drivers/thermal/intel_powerclamp.c  |   24 ++++++++++++++++------
 drivers/thermal/step_wise.c         |    2 +-
 drivers/thermal/thermal_core.c      |   13 ++++++------
 include/linux/thermal.h             |   13 +++++++++---
 13 files changed, 169 insertions(+), 66 deletions(-)

diff --git a/drivers/acpi/fan.c b/drivers/acpi/fan.c
index 8acf53e62966..55346deb8202 100644
--- a/drivers/acpi/fan.c
+++ b/drivers/acpi/fan.c
@@ -79,20 +79,26 @@ static struct acpi_driver acpi_fan_driver = {
 
 /* thermal cooling device callbacks */
 static int fan_get_max_state(struct thermal_cooling_device *cdev, unsigned long
-			     *state)
+			     *state, enum thermal_cooling_unit unit)
 {
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	/* ACPI fan device only support two states: ON/OFF */
 	*state = 1;
 	return 0;
 }
 
 static int fan_get_cur_state(struct thermal_cooling_device *cdev, unsigned long
-			     *state)
+			     *state, enum thermal_cooling_unit unit)
 {
 	struct acpi_device *device = cdev->devdata;
 	int result;
 	int acpi_state = ACPI_STATE_D0;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	if (!device)
 		return -EINVAL;
 
@@ -106,11 +112,15 @@ static int fan_get_cur_state(struct thermal_cooling_device *cdev, unsigned long
 }
 
 static int
-fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state)
+fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state,
+		  enum thermal_cooling_unit unit)
 {
 	struct acpi_device *device = cdev->devdata;
 	int result;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	if (!device || (state != 0 && state != 1))
 		return -EINVAL;
 
@@ -121,9 +131,9 @@ fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state)
 }
 
 static const struct thermal_cooling_device_ops fan_cooling_ops = {
-	.get_max_state = fan_get_max_state,
-	.get_cur_state = fan_get_cur_state,
-	.set_cur_state = fan_set_cur_state,
+	.get_max = fan_get_max_state,
+	.get_cur = fan_get_cur_state,
+	.set_cur = fan_set_cur_state,
 };
 
 /* --------------------------------------------------------------------------
diff --git a/drivers/acpi/processor_thermal.c b/drivers/acpi/processor_thermal.c
index e003663b2f8e..48179ed9afe5 100644
--- a/drivers/acpi/processor_thermal.c
+++ b/drivers/acpi/processor_thermal.c
@@ -200,11 +200,15 @@ static int acpi_processor_max_state(struct acpi_processor *pr)
 }
 static int
 processor_get_max_state(struct thermal_cooling_device *cdev,
-			unsigned long *state)
+			unsigned long *state,
+			enum thermal_cooling_unit unit)
 {
 	struct acpi_device *device = cdev->devdata;
 	struct acpi_processor *pr;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	if (!device)
 		return -EINVAL;
 
@@ -218,11 +222,15 @@ processor_get_max_state(struct thermal_cooling_device *cdev,
 
 static int
 processor_get_cur_state(struct thermal_cooling_device *cdev,
-			unsigned long *cur_state)
+			unsigned long *cur_state,
+			enum thermal_cooling_unit unit)
 {
 	struct acpi_device *device = cdev->devdata;
 	struct acpi_processor *pr;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	if (!device)
 		return -EINVAL;
 
@@ -238,13 +246,16 @@ processor_get_cur_state(struct thermal_cooling_device *cdev,
 
 static int
 processor_set_cur_state(struct thermal_cooling_device *cdev,
-			unsigned long state)
+			unsigned long state, enum thermal_cooling_unit unit)
 {
 	struct acpi_device *device = cdev->devdata;
 	struct acpi_processor *pr;
 	int result = 0;
 	int max_pstate;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	if (!device)
 		return -EINVAL;
 
@@ -270,7 +281,7 @@ processor_set_cur_state(struct thermal_cooling_device *cdev,
 }
 
 const struct thermal_cooling_device_ops processor_cooling_ops = {
-	.get_max_state = processor_get_max_state,
-	.get_cur_state = processor_get_cur_state,
-	.set_cur_state = processor_set_cur_state,
+	.get_max = processor_get_max_state,
+	.get_cur = processor_get_cur_state,
+	.set_cur = processor_set_cur_state,
 };
diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
index 8b6990e417ec..f0aeb4d1ba91 100644
--- a/drivers/acpi/video.c
+++ b/drivers/acpi/video.c
@@ -282,23 +282,29 @@ static const struct backlight_ops acpi_backlight_ops = {
 
 /* thermal cooling device callbacks */
 static int video_get_max_state(struct thermal_cooling_device *cooling_dev, unsigned
-			       long *state)
+			       long *state, enum thermal_cooling_unit unit)
 {
 	struct acpi_device *device = cooling_dev->devdata;
 	struct acpi_video_device *video = acpi_driver_data(device);
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	*state = video->brightness->count - 3;
 	return 0;
 }
 
 static int video_get_cur_state(struct thermal_cooling_device *cooling_dev, unsigned
-			       long *state)
+			       long *state, enum thermal_cooling_unit unit)
 {
 	struct acpi_device *device = cooling_dev->devdata;
 	struct acpi_video_device *video = acpi_driver_data(device);
 	unsigned long long level;
 	int offset;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	if (acpi_video_device_lcd_get_level_current(video, &level, false))
 		return -EINVAL;
 	for (offset = 2; offset < video->brightness->count; offset++)
@@ -311,12 +317,16 @@ static int video_get_cur_state(struct thermal_cooling_device *cooling_dev, unsig
 }
 
 static int
-video_set_cur_state(struct thermal_cooling_device *cooling_dev, unsigned long state)
+video_set_cur_state(struct thermal_cooling_device *cooling_dev,
+		unsigned long state, enum thermal_cooling_unit unit)
 {
 	struct acpi_device *device = cooling_dev->devdata;
 	struct acpi_video_device *video = acpi_driver_data(device);
 	int level;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	if (state >= video->brightness->count - 2)
 		return -EINVAL;
 
@@ -326,9 +336,9 @@ video_set_cur_state(struct thermal_cooling_device *cooling_dev, unsigned long st
 }
 
 static const struct thermal_cooling_device_ops video_cooling_ops = {
-	.get_max_state = video_get_max_state,
-	.get_cur_state = video_get_cur_state,
-	.set_cur_state = video_set_cur_state,
+	.get_max = video_get_max_state,
+	.get_cur = video_get_cur_state,
+	.set_cur = video_set_cur_state,
 };
 
 /*
diff --git a/drivers/platform/x86/acerhdf.c b/drivers/platform/x86/acerhdf.c
index f94467c05225..0bb85e240d93 100644
--- a/drivers/platform/x86/acerhdf.c
+++ b/drivers/platform/x86/acerhdf.c
@@ -439,18 +439,26 @@ static struct thermal_zone_device_ops acerhdf_dev_ops = {
  * get maximal fan cooling state
  */
 static int acerhdf_get_max_state(struct thermal_cooling_device *cdev,
-				 unsigned long *state)
+				 unsigned long *state,
+				 enum thermal_cooling_unit unit)
 {
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	*state = 1;
 
 	return 0;
 }
 
 static int acerhdf_get_cur_state(struct thermal_cooling_device *cdev,
-				 unsigned long *state)
+				 unsigned long *state,
+				 enum thermal_cooling_unit unit)
 {
 	int err = 0, tmp;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	err = acerhdf_get_fanstate(&tmp);
 	if (err)
 		return err;
@@ -461,10 +469,14 @@ static int acerhdf_get_cur_state(struct thermal_cooling_device *cdev,
 
 /* change current fan state - is overwritten when running in kernel mode */
 static int acerhdf_set_cur_state(struct thermal_cooling_device *cdev,
-				 unsigned long state)
+				 unsigned long state,
+				 enum thermal_cooling_unit unit)
 {
 	int cur_temp, cur_state, err = 0;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	if (!kernelmode)
 		return 0;
 
@@ -498,9 +510,9 @@ err_out:
 
 /* bind fan callbacks to fan device */
 static struct thermal_cooling_device_ops acerhdf_cooling_ops = {
-	.get_max_state = acerhdf_get_max_state,
-	.get_cur_state = acerhdf_get_cur_state,
-	.set_cur_state = acerhdf_set_cur_state,
+	.get_max = acerhdf_get_max_state,
+	.get_cur = acerhdf_get_cur_state,
+	.set_cur = acerhdf_set_cur_state,
 };
 
 /* suspend / resume functionality */
diff --git a/drivers/platform/x86/intel_menlow.c b/drivers/platform/x86/intel_menlow.c
index e8b46d2c468c..9792da407c46 100644
--- a/drivers/platform/x86/intel_menlow.c
+++ b/drivers/platform/x86/intel_menlow.c
@@ -61,7 +61,8 @@ static void intel_menlow_unregister_sensor(void);
  * GTHS returning '0' would mean that no bandwidth control states are supported
  */
 static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
-				    unsigned long *max_state)
+				unsigned long *max_state,
+				enum thermal_cooling_unit unit)
 {
 	struct acpi_device *device = cdev->devdata;
 	acpi_handle handle = device->handle;
@@ -70,6 +71,9 @@ static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
 	union acpi_object arg;
 	acpi_status status = AE_OK;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	arg_list.count = 1;
 	arg_list.pointer = &arg;
 	arg.type = ACPI_TYPE_INTEGER;
@@ -87,7 +91,8 @@ static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
 }
 
 static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
-				    unsigned long *value)
+				    unsigned long *value,
+				    enum thermal_cooling_unit unit)
 {
 	struct acpi_device *device = cdev->devdata;
 	acpi_handle handle = device->handle;
@@ -96,6 +101,9 @@ static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
 	union acpi_object arg;
 	acpi_status status = AE_OK;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	arg_list.count = 1;
 	arg_list.pointer = &arg;
 	arg.type = ACPI_TYPE_INTEGER;
@@ -110,7 +118,8 @@ static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
 }
 
 static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
-				    unsigned long state)
+				    unsigned long state,
+				    enum thermal_cooling_unit unit)
 {
 	struct acpi_device *device = cdev->devdata;
 	acpi_handle handle = device->handle;
@@ -120,7 +129,10 @@ static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
 	unsigned long long temp;
 	unsigned long max_state;
 
-	if (memory_get_max_bandwidth(cdev, &max_state))
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
+	if (memory_get_max_bandwidth(cdev, &max_state, THERMAL_UNIT_STATE))
 		return -EFAULT;
 
 	if (state > max_state)
@@ -143,9 +155,9 @@ static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
 }
 
 static struct thermal_cooling_device_ops memory_cooling_ops = {
-	.get_max_state = memory_get_max_bandwidth,
-	.get_cur_state = memory_get_cur_bandwidth,
-	.set_cur_state = memory_set_cur_bandwidth,
+	.get_max = memory_get_max_bandwidth,
+	.get_cur = memory_get_cur_bandwidth,
+	.set_cur = memory_set_cur_bandwidth,
 };
 
 /*
diff --git a/drivers/power/power_supply_core.c b/drivers/power/power_supply_core.c
index 26606641fe44..69dc2901a788 100644
--- a/drivers/power/power_supply_core.c
+++ b/drivers/power/power_supply_core.c
@@ -440,12 +440,16 @@ static void psy_unregister_thermal(struct power_supply *psy)
 
 /* thermal cooling device callbacks */
 static int ps_get_max_charge_cntl_limit(struct thermal_cooling_device *tcd,
-					unsigned long *state)
+					unsigned long *state,
+					enum thermal_cooling_unit unit)
 {
 	struct power_supply *psy;
 	union power_supply_propval val;
 	int ret;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	psy = tcd->devdata;
 	ret = psy->get_property(psy,
 		POWER_SUPPLY_PROP_CHARGE_CONTROL_LIMIT_MAX, &val);
@@ -456,12 +460,16 @@ static int ps_get_max_charge_cntl_limit(struct thermal_cooling_device *tcd,
 }
 
 static int ps_get_cur_chrage_cntl_limit(struct thermal_cooling_device *tcd,
-					unsigned long *state)
+					unsigned long *state,
+					enum thermal_cooling_unit unit)
 {
 	struct power_supply *psy;
 	union power_supply_propval val;
 	int ret;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	psy = tcd->devdata;
 	ret = psy->get_property(psy,
 		POWER_SUPPLY_PROP_CHARGE_CONTROL_LIMIT, &val);
@@ -472,12 +480,16 @@ static int ps_get_cur_chrage_cntl_limit(struct thermal_cooling_device *tcd,
 }
 
 static int ps_set_cur_charge_cntl_limit(struct thermal_cooling_device *tcd,
-					unsigned long state)
+					unsigned long state,
+					enum thermal_cooling_unit unit)
 {
 	struct power_supply *psy;
 	union power_supply_propval val;
 	int ret;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	psy = tcd->devdata;
 	val.intval = state;
 	ret = psy->set_property(psy,
@@ -487,9 +499,9 @@ static int ps_set_cur_charge_cntl_limit(struct thermal_cooling_device *tcd,
 }
 
 static struct thermal_cooling_device_ops psy_tcd_ops = {
-	.get_max_state = ps_get_max_charge_cntl_limit,
-	.get_cur_state = ps_get_cur_chrage_cntl_limit,
-	.set_cur_state = ps_set_cur_charge_cntl_limit,
+	.get_max = ps_get_max_charge_cntl_limit,
+	.get_cur = ps_get_cur_chrage_cntl_limit,
+	.set_cur = ps_set_cur_charge_cntl_limit,
 };
 
 static int psy_register_cooler(struct power_supply *psy)
diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 4246262c4bd2..f3f4e6b5798e 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -348,14 +348,17 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
  * cpufreq_get_max_state - callback function to get the max cooling state.
  * @cdev: thermal cooling device pointer.
  * @state: fill this variable with the max cooling state.
+ * @unit: the units in which @state should be in.
  *
- * Callback for the thermal cooling device to return the cpufreq
- * max cooling state.
+ * Callback for the thermal cooling device to return the cpufreq max
+ * cooling state.  Currently only THERMAL_UNIT_STATE is valid for
+ * @unit.
  *
  * Return: 0 on success, an error code otherwise.
  */
 static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
-				 unsigned long *state)
+				unsigned long *state,
+				enum thermal_cooling_unit unit)
 {
 	struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
 	struct cpumask *mask = &cpufreq_device->allowed_cpus;
@@ -363,6 +366,9 @@ static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
 	unsigned int count = 0;
 	int ret;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	cpu = cpumask_any(mask);
 
 	ret = get_property(cpu, 0, &count, GET_MAXL);
@@ -377,17 +383,23 @@ static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
  * cpufreq_get_cur_state - callback function to get the current cooling state.
  * @cdev: thermal cooling device pointer.
  * @state: fill this variable with the current cooling state.
+ * @unit: the units in which state should be in
  *
  * Callback for the thermal cooling device to return the cpufreq
- * current cooling state.
+ * current cooling state.  Currently only THERMAL_UNIT_STATE is valid
+ * for @unit.
  *
  * Return: 0 on success, an error code otherwise.
  */
 static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
-				 unsigned long *state)
+				 unsigned long *state,
+				 enum thermal_cooling_unit unit)
 {
 	struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	*state = cpufreq_device->cpufreq_state;
 
 	return 0;
@@ -397,25 +409,31 @@ static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
  * cpufreq_set_cur_state - callback function to set the current cooling state.
  * @cdev: thermal cooling device pointer.
  * @state: set this variable to the current cooling state.
+ * @unit: the units in which @state should be in.
  *
  * Callback for the thermal cooling device to change the cpufreq
- * current cooling state.
+ * current cooling state.  Currently only THERMAL_UNIT_STATE is valid for
+ * @unit.
  *
  * Return: 0 on success, an error code otherwise.
  */
 static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
-				 unsigned long state)
+				 unsigned long state,
+				 enum thermal_cooling_unit unit)
 {
 	struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	return cpufreq_apply_cooling(cpufreq_device, state);
 }
 
 /* Bind cpufreq callbacks to thermal cooling device ops */
 static struct thermal_cooling_device_ops const cpufreq_cooling_ops = {
-	.get_max_state = cpufreq_get_max_state,
-	.get_cur_state = cpufreq_get_cur_state,
-	.set_cur_state = cpufreq_set_cur_state,
+	.get_max = cpufreq_get_max_state,
+	.get_cur = cpufreq_get_cur_state,
+	.set_cur = cpufreq_set_cur_state,
 };
 
 /* Notifier for cpufreq policy change */
diff --git a/drivers/thermal/db8500_thermal.c b/drivers/thermal/db8500_thermal.c
index 1e3b3bf9f993..0842a2ef88b4 100644
--- a/drivers/thermal/db8500_thermal.c
+++ b/drivers/thermal/db8500_thermal.c
@@ -67,7 +67,7 @@ static int db8500_cdev_bind(struct thermal_zone_device *thermal,
 	unsigned long max_state, upper, lower;
 	int i, ret = -EINVAL;
 
-	cdev->ops->get_max_state(cdev, &max_state);
+	cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
 
 	for (i = 0; i < ptrips->num_trips; i++) {
 		if (db8500_thermal_match_cdev(cdev, &ptrips->trip_points[i]))
diff --git a/drivers/thermal/fair_share.c b/drivers/thermal/fair_share.c
index 944ba2f340c8..8e6eb4883bb6 100644
--- a/drivers/thermal/fair_share.c
+++ b/drivers/thermal/fair_share.c
@@ -51,7 +51,7 @@ static long get_target_state(struct thermal_zone_device *tz,
 {
 	unsigned long max_state;
 
-	cdev->ops->get_max_state(cdev, &max_state);
+	cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
 
 	return (long)(weight * level * max_state) / (100 * tz->trips);
 }
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index a084325f1386..c32bee16b17a 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -616,16 +616,24 @@ static struct notifier_block powerclamp_cpu_notifier = {
 };
 
 static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
-				 unsigned long *state)
+				unsigned long *state,
+				enum thermal_cooling_unit unit)
 {
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	*state = MAX_TARGET_RATIO;
 
 	return 0;
 }
 
 static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
-				 unsigned long *state)
+				unsigned long *state,
+				enum thermal_cooling_unit unit)
 {
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	if (true == clamping)
 		*state = pkg_cstate_ratio_cur;
 	else
@@ -636,10 +644,14 @@ static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
 }
 
 static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
-				 unsigned long new_target_ratio)
+				unsigned long new_target_ratio,
+				enum thermal_cooling_unit unit)
 {
 	int ret = 0;
 
+	if (unit != THERMAL_UNIT_STATE)
+		return -EINVAL;
+
 	new_target_ratio = clamp(new_target_ratio, 0UL,
 				(unsigned long) (MAX_TARGET_RATIO-1));
 	if (set_target_ratio == 0 && new_target_ratio > 0) {
@@ -663,9 +675,9 @@ exit_set:
 
 /* bind to generic thermal layer as cooling device*/
 static struct thermal_cooling_device_ops powerclamp_cooling_ops = {
-	.get_max_state = powerclamp_get_max_state,
-	.get_cur_state = powerclamp_get_cur_state,
-	.set_cur_state = powerclamp_set_cur_state,
+	.get_max = powerclamp_get_max_state,
+	.get_cur = powerclamp_get_cur_state,
+	.set_cur = powerclamp_set_cur_state,
 };
 
 /* runs on Nehalem and later */
diff --git a/drivers/thermal/step_wise.c b/drivers/thermal/step_wise.c
index f251521baaa2..2a2cd4eba3c3 100644
--- a/drivers/thermal/step_wise.c
+++ b/drivers/thermal/step_wise.c
@@ -58,7 +58,7 @@ static unsigned long get_target_state(struct thermal_instance *instance,
 	 * Otherwise, we use the current state of the
 	 * cdev in use to determine the next_target.
 	 */
-	cdev->ops->get_cur_state(cdev, &cur_state);
+	cdev->ops->get_cur(cdev, &cur_state, THERMAL_UNIT_STATE);
 	next_target = instance->target;
 	dev_dbg(&cdev->device, "cur_state=%ld\n", cur_state);
 
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index ac8e5990a63e..5b46603cc7cd 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -871,7 +871,7 @@ thermal_cooling_device_max_state_show(struct device *dev,
 	unsigned long state;
 	int ret;
 
-	ret = cdev->ops->get_max_state(cdev, &state);
+	ret = cdev->ops->get_max(cdev, &state, THERMAL_UNIT_STATE);
 	if (ret)
 		return ret;
 	return sprintf(buf, "%ld\n", state);
@@ -885,7 +885,7 @@ thermal_cooling_device_cur_state_show(struct device *dev,
 	unsigned long state;
 	int ret;
 
-	ret = cdev->ops->get_cur_state(cdev, &state);
+	ret = cdev->ops->get_cur(cdev, &state, THERMAL_UNIT_STATE);
 	if (ret)
 		return ret;
 	return sprintf(buf, "%ld\n", state);
@@ -906,7 +906,7 @@ thermal_cooling_device_cur_state_store(struct device *dev,
 	if ((long)state < 0)
 		return -EINVAL;
 
-	result = cdev->ops->set_cur_state(cdev, state);
+	result = cdev->ops->set_cur(cdev, state, THERMAL_UNIT_STATE);
 	if (result)
 		return result;
 	return count;
@@ -983,7 +983,7 @@ int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz,
 	if (tz != pos1 || cdev != pos2)
 		return -EINVAL;
 
-	cdev->ops->get_max_state(cdev, &max_state);
+	cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
 
 	/* lower default 0, upper default max_state */
 	lower = lower == THERMAL_NO_LIMIT ? 0 : lower;
@@ -1143,8 +1143,7 @@ __thermal_cooling_device_register(struct device_node *np,
 	if (type && strlen(type) >= THERMAL_NAME_LENGTH)
 		return ERR_PTR(-EINVAL);
 
-	if (!ops || !ops->get_max_state || !ops->get_cur_state ||
-	    !ops->set_cur_state)
+	if (!ops || !ops->get_max || !ops->get_cur || !ops->set_cur)
 		return ERR_PTR(-EINVAL);
 
 	cdev = kzalloc(sizeof(struct thermal_cooling_device), GFP_KERNEL);
@@ -1329,7 +1328,7 @@ void thermal_cdev_update(struct thermal_cooling_device *cdev)
 			target = instance->target;
 	}
 	mutex_unlock(&cdev->lock);
-	cdev->ops->set_cur_state(cdev, target);
+	cdev->ops->set_cur(cdev, target, THERMAL_UNIT_STATE);
 	cdev->updated = true;
 	dev_dbg(&cdev->device, "set to state %lu\n", target);
 }
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index baac212f815e..8d183b8255eb 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -106,6 +106,10 @@ enum {
 };
 #define THERMAL_GENL_CMD_MAX (__THERMAL_GENL_CMD_MAX - 1)
 
+enum thermal_cooling_unit {
+	THERMAL_UNIT_STATE,
+};
+
 struct thermal_zone_device_ops {
 	int (*bind) (struct thermal_zone_device *,
 		     struct thermal_cooling_device *);
@@ -135,9 +139,12 @@ struct thermal_zone_device_ops {
 };
 
 struct thermal_cooling_device_ops {
-	int (*get_max_state) (struct thermal_cooling_device *, unsigned long *);
-	int (*get_cur_state) (struct thermal_cooling_device *, unsigned long *);
-	int (*set_cur_state) (struct thermal_cooling_device *, unsigned long);
+	int (*get_max)(struct thermal_cooling_device *, unsigned long *,
+			enum thermal_cooling_unit);
+	int (*get_cur)(struct thermal_cooling_device *, unsigned long *,
+			enum thermal_cooling_unit);
+	int (*set_cur)(struct thermal_cooling_device *, unsigned long,
+			enum thermal_cooling_unit);
 };
 
 struct thermal_cooling_device {
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 3/5] thermal: add a basic cpu power actor
  2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
  2014-05-06 12:06 ` [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone Javi Merino
  2014-05-06 12:06 ` [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state Javi Merino
@ 2014-05-06 12:06 ` Javi Merino
  2014-05-13 22:57   ` Eduardo Valentin
  2014-05-06 12:06 ` [RFC PATCH 4/5] thermal: introduce the Power Allocator governor Javi Merino
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 48+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
  To: linux-pm
  Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin, Punit Agrawal

Introduce a new thermal_cooling_unit: THERMAL_UNIT_POWER and let
cpu_cooling device operate using it as well as "states".  This allows
governors that call cpufreq_get_max/cpufreq_get_cur/cpufreq_set_cur()
with THERMAL_UNIT_POWER to use cpufreq cooling devices.

If the cpu cooling device is registered using
{of_,}cpufreq_power_actor_register() instead of
{of_,}cpufreq_cooling_register() it uses the current frequency (as
reported by cpufreq) as well as load and OPPs for the power
calculations.  The cpus must have registered their OPPs in the OPP
library.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 drivers/thermal/cpu_cooling.c  |  430 ++++++++++++++++++++++++++++++++++++----
 drivers/thermal/thermal_core.c |    4 +-
 include/linux/cpu_cooling.h    |   28 +++
 include/linux/thermal.h        |    4 +
 4 files changed, 422 insertions(+), 44 deletions(-)

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index f3f4e6b5798e..8052cd1f4a39 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -24,10 +24,27 @@
 #include <linux/thermal.h>
 #include <linux/cpufreq.h>
 #include <linux/err.h>
+#include <linux/pm_opp.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
 #include <linux/cpu_cooling.h>
 
+/* Maximum number of opps we can track */
+#define MAX_NUM_OPPS 32
+
+/**
+ * struct power_table - frequency to power conversion
+ * @frequency:	frequency in KHz
+ * @power:	power in mW
+ *
+ * This structure is built when the cooling device registers and helps
+ * in translating frequency to power and viceversa.
+ */
+struct power_table {
+	u32 frequency;
+	u32 power;
+};
+
 /**
  * struct cpufreq_cooling_device - data for cooling device with cpufreq
  * @id: unique integer value corresponding to each cpufreq_cooling_device
@@ -39,6 +56,14 @@
  * @cpufreq_val: integer value representing the absolute value of the clipped
  *	frequency.
  * @allowed_cpus: all the cpus involved for this cpufreq_cooling_device.
+ * @power_table: array of struct power_table for frequency to power conversion
+ * @power_table_entries: number of entries in the @power_table array
+ * @time_in_idle: previous reading of the absolute time that this cpu was idle
+ * @time_in_idle_timestamp: wall time of the last invocation of
+ *	get_cpu_idle_time_us()
+ * @last_load: load measured by the latest call to cpufreq_get_cur()
+ * @freq: frequency in KHz of the cpus represented by the cooling device
+ * @capacitance: the dynamic power coefficient of these cpus
  *
  * This structure is required for keeping information of each
  * cpufreq_cooling_device registered. In order to prevent corruption of this a
@@ -50,11 +75,19 @@ struct cpufreq_cooling_device {
 	unsigned int cpufreq_state;
 	unsigned int cpufreq_val;
 	struct cpumask allowed_cpus;
+	struct power_table power_table[MAX_NUM_OPPS];
+	int power_table_entries;
+	u64 time_in_idle[NR_CPUS];
+	u64 time_in_idle_timestamp[NR_CPUS];
+	u32 last_load;
+	u32 freq;
+	u32 capacitance;
 };
 static DEFINE_IDR(cpufreq_idr);
 static DEFINE_MUTEX(cooling_cpufreq_lock);
 
 static unsigned int cpufreq_dev_count;
+static bool power_actor_notifier_registered;
 
 /* notify_table passes value to the CPUFREQ_ADJUST callback function. */
 #define NOTIFY_INVALID NULL
@@ -115,6 +148,59 @@ static int is_cpufreq_valid(int cpu)
 	return !cpufreq_get_policy(&policy, cpu);
 }
 
+static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_device,
+			u32 freq)
+{
+	int i;
+	struct power_table *pt = cpufreq_device->power_table;
+
+	for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
+		if (freq <= pt[i].frequency)
+			break;
+
+	return pt[i].power;
+}
+
+static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_device,
+			u32 power)
+{
+	int i;
+	struct power_table *pt = cpufreq_device->power_table;
+
+	for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
+		if (power <= pt[i].power)
+			break;
+
+	return pt[i].frequency;
+}
+
+/**
+ * get_load - get load for a cpu since last updated
+ * @cpufreq_device: struct cpufreq_cooling_device for this cooling device
+ * @cpu: cpu number
+ *
+ * Return the average load of cpu @cpu in percentage since this
+ * function was last called.
+ */
+static u32 get_load(struct cpufreq_cooling_device *cpufreq_device, int cpu)
+{
+	u32 load;
+	u64 now, now_idle, delta_time, delta_idle;
+
+	now_idle = get_cpu_idle_time(cpu, &now, 0);
+	delta_idle = now_idle - cpufreq_device->time_in_idle[cpu];
+	delta_time = now - cpufreq_device->time_in_idle_timestamp[cpu];
+
+	if (delta_time <= delta_idle)
+		load = 0;
+	else
+		load = div64_u64(100 * (delta_time - delta_idle), delta_time);
+
+	cpufreq_device->time_in_idle[cpu] = now_idle;
+	cpufreq_device->time_in_idle_timestamp[cpu] = now;
+	return load;
+}
+
 enum cpufreq_cooling_property {
 	GET_LEVEL,
 	GET_FREQ,
@@ -342,98 +428,193 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
 	return 0;
 }
 
+/**
+ * cpufreq_frequency_change - notifier callback for cpufreq frequency changes
+ * @nb:		struct notifier_block * with callback info
+ * @event:	value showing cpufreq event for which this function invoked
+ * @data:	callback-specific data
+ *
+ * Callback to get notifications of frequency changes.  In the
+ * CPUFREQ_POSTCHANGE @event we store the new frequency so that
+ * cpufreq_get_cur() knows the current frequency and can convert it
+ * into power.
+ */
+static int cpufreq_frequency_change(struct notifier_block *nb,
+				unsigned long event, void *data)
+{
+	struct thermal_cooling_device *cdev;
+	struct cpufreq_freqs *freqs = data;
+
+	/* Only update frequency on postchange */
+	if (event != CPUFREQ_POSTCHANGE)
+		return NOTIFY_DONE;
+
+	mutex_lock(&thermal_list_lock);
+
+	list_for_each_entry(cdev, &thermal_cdev_list, node) {
+		struct cpufreq_cooling_device *cpufreq_device;
+
+		if (strncmp(cdev->type, "thermal-cpufreq", THERMAL_NAME_LENGTH))
+			continue;
+
+		cpufreq_device = cdev->devdata;
+
+		if (cpumask_test_cpu(freqs->cpu, &cpufreq_device->allowed_cpus))
+			cpufreq_device->freq = freqs->new;
+	}
+
+	mutex_unlock(&thermal_list_lock);
+
+	return NOTIFY_OK;
+}
+
 /* cpufreq cooling device callback functions are defined below */
 
 /**
- * cpufreq_get_max_state - callback function to get the max cooling state.
+ * cpufreq_get_max - callback function to get the max cooling state/power.
  * @cdev: thermal cooling device pointer.
- * @state: fill this variable with the max cooling state.
- * @unit: the units in which @state should be in.
+ * @val: fill this variable with the max cooling state/power.
+ * @unit: whether to put state or power in @val.
  *
  * Callback for the thermal cooling device to return the cpufreq max
- * cooling state.  Currently only THERMAL_UNIT_STATE is valid for
- * @unit.
+ * cooling state/power.  If @unit is THERMAL_UNIT_STATE, @val contains
+ * the maximum cooling state; if @unit is THERMAL_UNIT_POWER, @val
+ * maximum is power in milliwatts.
  *
  * Return: 0 on success, an error code otherwise.
  */
-static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
-				unsigned long *state,
-				enum thermal_cooling_unit unit)
+static int cpufreq_get_max(struct thermal_cooling_device *cdev,
+			   unsigned long *val, enum thermal_cooling_unit unit)
 {
 	struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
 	struct cpumask *mask = &cpufreq_device->allowed_cpus;
 	unsigned int cpu;
 	unsigned int count = 0;
-	int ret;
-
-	if (unit != THERMAL_UNIT_STATE)
-		return -EINVAL;
+	int ret = 0;
 
 	cpu = cpumask_any(mask);
 
-	ret = get_property(cpu, 0, &count, GET_MAXL);
+	if ((unit == THERMAL_UNIT_POWER) &&
+		(cpufreq_device->power_table_entries)) {
+		unsigned int num_cpus;
+		unsigned long max_freq;
+
+		max_freq = get_cpu_frequency(cpu, 0);
+		if (!max_freq)
+			return -EINVAL;
 
-	if (count > 0)
-		*state = count;
+		num_cpus = cpumask_weight(mask);
+		*val = cpu_freq_to_power(cdev->devdata, max_freq) * num_cpus;
+	} else if (unit == THERMAL_UNIT_STATE) {
+		ret = get_property(cpu, 0, &count, GET_MAXL);
+
+		if (count > 0)
+			*val = count;
+	} else {
+		ret = -EINVAL;
+	}
 
 	return ret;
 }
 
 /**
- * cpufreq_get_cur_state - callback function to get the current cooling state.
+ * cpufreq_get_cur - callback function to get the current cooling state/power.
  * @cdev: thermal cooling device pointer.
- * @state: fill this variable with the current cooling state.
- * @unit: the units in which state should be in
+ * @val: fill this variable with the current cooling state/power.
+ * @unit: whether to put state or power in @val.
  *
  * Callback for the thermal cooling device to return the cpufreq
- * current cooling state.  Currently only THERMAL_UNIT_STATE is valid
- * for @unit.
+ * current cooling state/power.  If @unit is THERMAL_UNIT_STATE, @val
+ * contains the current cooling state; if @unit is THERMAL_UNIT_POWER,
+ * @val is current power consumption in milliwatts.
  *
  * Return: 0 on success, an error code otherwise.
  */
-static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
-				 unsigned long *state,
-				 enum thermal_cooling_unit unit)
+static int cpufreq_get_cur(struct thermal_cooling_device *cdev,
+			   unsigned long *val, enum thermal_cooling_unit unit)
 {
 	struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
 
-	if (unit != THERMAL_UNIT_STATE)
-		return -EINVAL;
+	if ((unit == THERMAL_UNIT_POWER) &&
+		(cpufreq_device->power_table_entries)) {
+		int cpu;
+		u32 total_load = 0, raw_cpu_power, power = 0;
+
+		raw_cpu_power = cpu_freq_to_power(cpufreq_device,
+						cpufreq_device->freq);
+
+		for_each_cpu(cpu, &cpufreq_device->allowed_cpus) {
+			u32 load;
 
-	*state = cpufreq_device->cpufreq_state;
+			if (!cpu_online(cpu))
+				continue;
+
+			load = get_load(cpufreq_device, cpu);
+			power += (raw_cpu_power * load) / 100;
+			total_load += load;
+		}
+
+		cpufreq_device->last_load = total_load;
+		*val = power;
+	} else if (unit == THERMAL_UNIT_STATE) {
+		*val = cpufreq_device->cpufreq_state;
+	} else {
+		return -EINVAL;
+	}
 
 	return 0;
 }
 
 /**
- * cpufreq_set_cur_state - callback function to set the current cooling state.
+ * cpufreq_set_cur - callback function to set the current cooling state/power.
  * @cdev: thermal cooling device pointer.
- * @state: set this variable to the current cooling state.
- * @unit: the units in which @state should be in.
+ * @val: set this variable to the current cooling state/power.
+ * @unit: whether @val is a cooling state or power.
  *
  * Callback for the thermal cooling device to change the cpufreq
- * current cooling state.  Currently only THERMAL_UNIT_STATE is valid for
- * @unit.
+ * current cooling state/power.  If @unit is THERMAL_UNIT_STATE, @val
+ * is a cooling state; if @unit is THERMAL_UNIT_POWER, @val is the target
+ * power consumption in milliwatts.
  *
  * Return: 0 on success, an error code otherwise.
  */
-static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
-				 unsigned long state,
-				 enum thermal_cooling_unit unit)
+static int cpufreq_set_cur(struct thermal_cooling_device *cdev,
+			   unsigned long val, enum thermal_cooling_unit unit)
 {
 	struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
-
-	if (unit != THERMAL_UNIT_STATE)
+	unsigned long state;
+
+	if ((unit == THERMAL_UNIT_POWER) &&
+		(cpufreq_device->power_table_entries)) {
+		unsigned int cpu, freq;
+		u32 normalised_power, last_load;
+
+		cpu = cpumask_any(&cpufreq_device->allowed_cpus);
+		last_load = cpufreq_device->last_load ?
+			cpufreq_device->last_load : 1;
+		normalised_power = (val * 100) / last_load;
+		freq = cpu_power_to_freq(cpufreq_device, normalised_power);
+
+		state = cpufreq_cooling_get_level(cpu, freq);
+		if (state == THERMAL_CSTATE_INVALID) {
+			pr_err("Failed to convert %dKHz for cpu %d into a cdev state\n",
+				freq, cpu);
+			return -EINVAL;
+		}
+	} else if (unit == THERMAL_UNIT_STATE) {
+		state = val;
+	} else {
 		return -EINVAL;
+	}
 
 	return cpufreq_apply_cooling(cpufreq_device, state);
 }
 
 /* Bind cpufreq callbacks to thermal cooling device ops */
 static struct thermal_cooling_device_ops const cpufreq_cooling_ops = {
-	.get_max = cpufreq_get_max_state,
-	.get_cur = cpufreq_get_cur_state,
-	.set_cur = cpufreq_set_cur_state,
+	.get_max = cpufreq_get_max,
+	.get_cur = cpufreq_get_cur,
+	.set_cur = cpufreq_set_cur,
 };
 
 /* Notifier for cpufreq policy change */
@@ -441,10 +622,95 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
 	.notifier_call = cpufreq_thermal_notifier,
 };
 
+/* Notifier for cpufreq frequency changes */
+struct notifier_block cpufreq_transition_notifier = {
+	.notifier_call = cpufreq_frequency_change,
+};
+
+/**
+ * build_cpu_power_table - create a power to frequency table
+ * @cpufreq_dev:	the cpufreq_cooling_device in which to store the table
+ *
+ * Build a power to frequency table for this cpu and store it in
+ * @cpufreq_dev.  This table will be used in cpu_power_to_freq() and
+ * cpu_freq_to_power() to convert between power and frequency
+ * efficiently.  Power is stored in mW, frequency in KHz.  The
+ * resulting table is in ascending order.
+ *
+ * Returns 0 on success, -E* on error.
+ */
+static int build_cpu_power_table(struct cpufreq_cooling_device *cpufreq_dev)
+{
+	struct dev_pm_opp *opp;
+	struct device *dev = NULL;
+	int num_opps, cpu, i, ret = 0;
+	unsigned long freq;
+
+	num_opps = 0;
+
+	rcu_read_lock();
+
+	for_each_cpu(cpu, &cpufreq_dev->allowed_cpus) {
+		dev = get_cpu_device(cpu);
+		if (!dev)
+			continue;
+
+		num_opps = dev_pm_opp_get_opp_count(dev);
+		if (num_opps > 0) {
+			break;
+		} else if (num_opps < 0) {
+			ret = num_opps;
+			goto unlock;
+		}
+	}
+
+	if ((num_opps == 0) || (num_opps > MAX_NUM_OPPS)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	i = 0;
+	for (freq = 0;
+	     opp = dev_pm_opp_find_freq_ceil(dev, &freq), !IS_ERR(opp);
+	     freq++) {
+		u32 freq_mhz, voltage_mv;
+		u64 power;
+
+		freq_mhz = freq / 1000000;
+		voltage_mv = dev_pm_opp_get_voltage(opp) / 1000;
+
+		/*
+		 * Do the multiplication with MHz and millivolt so as
+		 * to not overflow.
+		 */
+		power = (u64)cpufreq_dev->capacitance * freq_mhz *
+			voltage_mv * voltage_mv;
+		do_div(power, 1000000000);
+
+		/* frequency is stored in power_table in KHz */
+		cpufreq_dev->power_table[i].frequency = freq / 1000;
+		cpufreq_dev->power_table[i].power = power;
+
+		i++;
+	}
+
+	if (i == 0) {
+		ret = PTR_ERR(opp);
+		goto unlock;
+	}
+
+	cpufreq_dev->power_table_entries = i;
+
+unlock:
+	rcu_read_unlock();
+	return ret;
+}
+
 /**
  * __cpufreq_cooling_register - helper function to create cpufreq cooling device
  * @np: a valid struct device_node to the cooling device device tree node
  * @clip_cpus: cpumask of cpus where the frequency constraints will happen.
+ * @capacitance: dynamic power coefficient for these cpus
  *
  * This interface function registers the cpufreq cooling device with the name
  * "thermal-cpufreq-%x". This api can support multiple instances of cpufreq
@@ -456,7 +722,7 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
  */
 static struct thermal_cooling_device *
 __cpufreq_cooling_register(struct device_node *np,
-			   const struct cpumask *clip_cpus)
+			const struct cpumask *clip_cpus, u32 capacitance)
 {
 	struct thermal_cooling_device *cool_dev;
 	struct cpufreq_cooling_device *cpufreq_dev = NULL;
@@ -485,6 +751,7 @@ __cpufreq_cooling_register(struct device_node *np,
 		return ERR_PTR(-ENOMEM);
 
 	cpumask_copy(&cpufreq_dev->allowed_cpus, clip_cpus);
+	cpufreq_dev->capacitance = capacitance;
 
 	ret = get_idr(&cpufreq_idr, &cpufreq_dev->id);
 	if (ret) {
@@ -531,7 +798,7 @@ __cpufreq_cooling_register(struct device_node *np,
 struct thermal_cooling_device *
 cpufreq_cooling_register(const struct cpumask *clip_cpus)
 {
-	return __cpufreq_cooling_register(NULL, clip_cpus);
+	return __cpufreq_cooling_register(NULL, clip_cpus, 0);
 }
 EXPORT_SYMBOL_GPL(cpufreq_cooling_register);
 
@@ -555,11 +822,90 @@ of_cpufreq_cooling_register(struct device_node *np,
 	if (!np)
 		return ERR_PTR(-EINVAL);
 
-	return __cpufreq_cooling_register(np, clip_cpus);
+	return __cpufreq_cooling_register(np, clip_cpus, 0);
 }
 EXPORT_SYMBOL_GPL(of_cpufreq_cooling_register);
 
 /**
+ * of_cpufreq_power_actor_register - function to create a cpu cooling device with power capabilities
+ * @np:		a valid struct device_node to the cooling device tree node
+ * @clip_cpus:	cpumask of cpus where the frequency constraints will happen
+ * @capacitance:	the dynamic power coefficient for these cpus
+ *
+ * Similar to of_cpufreq_cooling_register() this function registers a
+ * cooling device linked to the device tree node provided.  It has a
+ * simple power model so that get_cur/get_max/set_cur can operate with
+ * THERMAL_UNIT_POWER
+ *
+ * The cpus must have registered their OPPs in the OPP library.
+ *
+ * Return: a valid struct thermal_cooling_device pointer on success,
+ * on failure, it returns a corresponding ERR_PTR().
+ */
+struct thermal_cooling_device *
+of_cpufreq_power_actor_register(struct device_node *np,
+				const struct cpumask *clip_cpus,
+				u32 capacitance)
+{
+	struct thermal_cooling_device *cdev, *err_ret;
+	int ret;
+
+	cdev = __cpufreq_cooling_register(np, clip_cpus, capacitance);
+	if (IS_ERR(cdev))
+		return cdev;
+
+	ret = build_cpu_power_table(cdev->devdata);
+	if (ret) {
+		err_ret = ERR_PTR(ret);
+		goto unregister;
+	}
+
+	/*
+	 * You can't register multiple times the same notifier_block.
+	 * The first power actor registered is the only one that
+	 * registers the notifier.
+	 */
+	if (!power_actor_notifier_registered) {
+		ret = cpufreq_register_notifier(&cpufreq_transition_notifier,
+						CPUFREQ_TRANSITION_NOTIFIER);
+		if (ret) {
+			err_ret = ERR_PTR(ret);
+			goto unregister;
+		}
+		power_actor_notifier_registered = true;
+	}
+
+	return cdev;
+
+unregister:
+	cpufreq_cooling_unregister(cdev);
+	return err_ret;
+}
+EXPORT_SYMBOL_GPL(of_cpufreq_power_actor_register);
+
+/**
+ * cpufreq_power_actor_register - function to create a cpu cooling device with power capabilities
+ * @clip_cpus:	cpumask of cpus where the frequency constraints will happen
+ * @capacitance:	the dynamic power coefficient for these cpus
+ *
+ * Similar to cpufreq_cooling_register() this function registers a
+ * cpufreq cooling device that has a simple power model so that
+ * get_cur/get_max/set_cur can operate with THERMAL_UNIT_POWER.
+ *
+ * See of_cpufreq_power_actor_register() for notes about the
+ * requisites for this to work.
+ *
+ * Return: a valid struct thermal_cooling_device pointer on success,
+ * on failure, it returns a corresponding ERR_PTR().
+ */
+struct thermal_cooling_device *
+cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance)
+{
+	return of_cpufreq_power_actor_register(NULL, clip_cpus, capacitance);
+}
+EXPORT_SYMBOL_GPL(cpufreq_power_actor_register);
+
+/**
  * cpufreq_cooling_unregister - function to remove cpufreq cooling device.
  * @cdev: thermal cooling device pointer.
  *
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 5b46603cc7cd..1efaadf436fa 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -50,10 +50,10 @@ static DEFINE_IDR(thermal_cdev_idr);
 static DEFINE_MUTEX(thermal_idr_lock);
 
 static LIST_HEAD(thermal_tz_list);
-static LIST_HEAD(thermal_cdev_list);
+LIST_HEAD(thermal_cdev_list);
 static LIST_HEAD(thermal_governor_list);
 
-static DEFINE_MUTEX(thermal_list_lock);
+DEFINE_MUTEX(thermal_list_lock);
 static DEFINE_MUTEX(thermal_governor_lock);
 
 static struct thermal_governor *def_governor;
diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
index c303d383def1..b8d92ac4266b 100644
--- a/include/linux/cpu_cooling.h
+++ b/include/linux/cpu_cooling.h
@@ -36,6 +36,9 @@
 struct thermal_cooling_device *
 cpufreq_cooling_register(const struct cpumask *clip_cpus);
 
+struct thermal_cooling_device *
+cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance);
+
 /**
  * of_cpufreq_cooling_register - create cpufreq cooling device based on DT.
  * @np: a valid struct device_node to the cooling device device tree node.
@@ -45,6 +48,11 @@ cpufreq_cooling_register(const struct cpumask *clip_cpus);
 struct thermal_cooling_device *
 of_cpufreq_cooling_register(struct device_node *np,
 			    const struct cpumask *clip_cpus);
+
+struct thermal_cooling_device *
+of_cpufreq_power_actor_register(struct device_node *np,
+				const struct cpumask *clip_cpus,
+				u32 capacitance);
 #else
 static inline struct thermal_cooling_device *
 of_cpufreq_cooling_register(struct device_node *np,
@@ -52,6 +60,14 @@ of_cpufreq_cooling_register(struct device_node *np,
 {
 	return NULL;
 }
+
+static inline struct thermal_cooling_device *
+of_cpufreq_power_actor_register(struct device_node *np,
+				const struct cpumask *clip_cpus,
+				u32 capacitance);
+{
+	return ERR_PTR(-ENOSYS);
+}
 #endif
 
 /**
@@ -73,6 +89,18 @@ of_cpufreq_cooling_register(struct device_node *np,
 {
 	return NULL;
 }
+static inline struct thermal_cooling_device *
+cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance)
+{
+	return ERR_PTR(-ENOSYS);
+}
+static inline struct thermal_cooling_device *
+of_cpufreq_power_actor_register(struct device_node *np,
+				const struct cpumask *clip_cpus,
+				u32 capacitance)
+{
+	return ERR_PTR(-ENOSYS);
+}
 static inline
 void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
 {
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index 8d183b8255eb..5986f546ca98 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -59,6 +59,9 @@
 #define DEFAULT_THERMAL_GOVERNOR       "user_space"
 #endif
 
+extern struct list_head thermal_cdev_list;
+extern struct mutex thermal_list_lock;
+
 struct thermal_zone_device;
 struct thermal_cooling_device;
 
@@ -108,6 +111,7 @@ enum {
 
 enum thermal_cooling_unit {
 	THERMAL_UNIT_STATE,
+	THERMAL_UNIT_POWER,
 };
 
 struct thermal_zone_device_ops {
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 4/5] thermal: introduce the Power Allocator governor
  2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
                   ` (2 preceding siblings ...)
  2014-05-06 12:06 ` [RFC PATCH 3/5] thermal: add a basic cpu power actor Javi Merino
@ 2014-05-06 12:06 ` Javi Merino
  2014-05-13 23:20   ` Eduardo Valentin
  2014-05-06 12:06 ` [RFC PATCH 5/5] thermal: add trace events to the power allocator governor Javi Merino
  2014-05-14 10:05 ` [RFC PATCH 0/5] The power allocator thermal governor James King
  5 siblings, 1 reply; 48+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
  To: linux-pm
  Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin, Punit Agrawal

The power allocator governor is a thermal governor that controls system
and device power allocation to control temperature.  Conceptually, the
implementation takes a system view of heat dissipation by managing
multiple heat sources.

This governor relies on power-aware cooling devices (power actors) to
operate.  That is, cooling devices whose thermal_cooling_device_ops
accept THERMAL_UNIT_POWER.

It uses a Proportional Integral (PI) controller driven by the
temperature of the thermal zone.  This budget is then allocated to
each cooling device that can have bearing on the temperature we are
trying to control.  It decides how much power to give each cooling
device based on the performance they are requesting.  The PI
controller ensures that the total power budget does not exceed the
control temperature.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 drivers/thermal/Kconfig           |   14 ++
 drivers/thermal/Makefile          |    1 +
 drivers/thermal/power_allocator.c |  403 +++++++++++++++++++++++++++++++++++++
 drivers/thermal/thermal_core.c    |    7 +-
 drivers/thermal/thermal_core.h    |    8 +
 include/linux/thermal.h           |    5 +
 6 files changed, 437 insertions(+), 1 deletion(-)
 create mode 100644 drivers/thermal/power_allocator.c

diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 2d51912a6e40..f5e6b693fc19 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
 	  Select this if you want to let the user space manage the
 	  platform thermals.
 
+config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
+	bool "power_allocator"
+	select THERMAL_GOV_POWER_ALLOCATOR
+	help
+	  Select this if you want to control temperature based on
+	  system and device power allocation. This governor relies on
+	  power-aware cooling devices (power actors) to operate.
+
 endchoice
 
 config THERMAL_GOV_FAIR_SHARE
@@ -89,6 +97,12 @@ config THERMAL_GOV_USER_SPACE
 	help
 	  Enable this to let the user space manage the platform thermals.
 
+config THERMAL_GOV_POWER_ALLOCATOR
+	bool "Power allocator thermal governor"
+	help
+	  Enable this to manage platform thermals by dynamically
+	  allocating and limiting power to devices.
+
 config CPU_THERMAL
 	bool "generic cpu cooling support"
 	depends on CPU_FREQ
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 54e4ec9eb5df..1cb7d5fec8dc 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF)		+= of-thermal.o
 thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE)	+= fair_share.o
 thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE)	+= step_wise.o
 thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE)	+= user_space.o
+thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR)	+= power_allocator.o
 
 # cpufreq cooling
 thermal_sys-$(CONFIG_CPU_THERMAL)	+= cpu_cooling.o
diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
new file mode 100644
index 000000000000..39c425abf10a
--- /dev/null
+++ b/drivers/thermal/power_allocator.c
@@ -0,0 +1,403 @@
+/*
+ * A power allocator to manage temperature
+ *
+ * Copyright (C) 2014 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) "Power allocator: " fmt
+
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/thermal.h>
+
+#include "thermal_core.h"
+
+#define MAX_NUM_ACTORS 8
+
+#define FRAC_BITS 8
+#define int_to_frac(x) ((x) << FRAC_BITS)
+#define frac_to_int(x) ((x) >> FRAC_BITS)
+
+/**
+ * mul_frac - multiply two fixed-point numbers
+ * @x:	first multiplicand
+ * @y:	second multiplicand
+ *
+ * Returns the result of multiplying two fixed-point numbers.  The
+ * result is also a fixed-point number.
+ */
+static inline s64 mul_frac(s64 x, s64 y)
+{
+	return (x * y) >> FRAC_BITS;
+}
+
+enum power_allocator_trip_levels {
+	TRIP_SWITCH_ON = 0,	/* Switch on PI controller */
+	TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
+};
+
+/**
+ * struct power_allocator_params - parameters for the power allocator governor
+ * @k_po:	P parameter of the PI controller when overshooting (i.e., when
+ *		temperature is below the target)
+ * @k_pi:	P parameter of the PI controller when undershooting
+ * @k_i:	I parameter of the PI controller
+ * @integral_cutoff:	threshold below which the error is no longer accumulated
+			in the PI controller
+ * @err_integral:	Accumulated error in the PI controller.
+ */
+struct power_allocator_params {
+	s32 k_po;
+	s32 k_pu;
+	s32 k_i;
+	s32 integral_cutoff;
+	s32 err_integral;
+};
+
+/**
+ * pi_controller() - PI controller
+ * @tz:	thermal zone we are operating in
+ * @control_temp:	The target temperature
+ * @max_allocatable_power:	maximum allocatable power for this thermal zone
+ *
+ * This PI controller increases the available power budget so that the
+ * temperature of the thermal zone gets as close as possible to
+ * @control_temp and limits the power if it exceeds it.  k_po is the
+ * proportional term when we are overshooting, k_pu is the
+ * proportional term when we are undershooting.  integral_cutoff is a
+ * threshold below which we stop accumulating the error.  The
+ * accumulated error is only valid if the requested power will make
+ * the system warmer.  If the system is mostly idle, there's no point
+ * in accumulating positive error.
+ *
+ * It returns the power budget for the next period.
+ */
+static u32 pi_controller(struct thermal_zone_device *tz,
+			unsigned long current_temp, unsigned long control_temp,
+			unsigned long max_allocatable_power)
+{
+	s64 p, i, power_range;
+	s32 err;
+	struct power_allocator_params *params = tz->governor_data;
+
+	err = ((s32)control_temp - (s32)current_temp) / 1000;
+	err = int_to_frac(err);
+
+	/* Calculate the proportional term */
+	p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
+
+	/*
+	 * Calculate the integral term
+	 *
+	 * if the error s less than cut off allow integration (but
+	 * the integral is limited to max power)
+	 */
+	i = mul_frac(params->k_i, params->err_integral);
+
+	if (err < int_to_frac(params->integral_cutoff)) {
+		s64 tmpi = mul_frac(params->k_i, err);
+		tmpi += i;
+		if (tmpi <= int_to_frac(max_allocatable_power)) {
+			i = tmpi;
+			params->err_integral += err;
+		}
+	}
+
+	power_range = p + i;
+
+	/* feed-forward the known maximum dissipatable power */
+	power_range = tz->tzp->max_dissipatable_power +
+		frac_to_int(power_range);
+
+	/* output should not exceed max power */
+	if (power_range > max_allocatable_power)
+		power_range = max_allocatable_power;
+
+	return power_range >= 0 ? power_range : 0;
+}
+
+static void divvy_up_power(struct thermal_zone_device *tz,
+			unsigned long req_power[MAX_NUM_ACTORS],
+			unsigned long max_power[MAX_NUM_ACTORS],
+			int num_actors, unsigned long total_req_power,
+			u32 power_range,
+			unsigned long granted_power[MAX_NUM_ACTORS])
+{
+	unsigned long extra_power, capped_extra_power;
+	unsigned long extra_actor_power[MAX_NUM_ACTORS];
+	int i;
+
+	if (!total_req_power) {
+		/*
+		 * Nobody requested anything, so just give everybody
+		 * the maximum power
+		 */
+		for (i = 0; i < num_actors; i++)
+			granted_power[i] = max_power[i];
+
+		return;
+	}
+
+	capped_extra_power = 0;
+	extra_power = 0;
+	for (i = 0; i < num_actors; i++) {
+		u64 req_range = req_power[i] * power_range;
+
+		granted_power[i] = div_u64(req_range, total_req_power);
+
+		if (granted_power[i] > max_power[i]) {
+			extra_power += granted_power[i] - max_power[i];
+			granted_power[i] = max_power[i];
+		}
+
+		extra_actor_power[i] = max_power[i] - granted_power[i];
+		capped_extra_power += extra_actor_power[i];
+	}
+
+	if (!extra_power)
+		return;
+
+	/*
+	 * Re-divvy the reclaimed extra among actors based on
+	 * how far they are from the max
+	 */
+	extra_power = min(extra_power, capped_extra_power);
+	if (capped_extra_power > 0)
+		for (i = 0; i < num_actors; i++)
+			granted_power[i] += (extra_actor_power[i] *
+					extra_power) / capped_extra_power;
+}
+
+static int allocate_power(struct thermal_zone_device *tz,
+			unsigned long current_temp, unsigned long control_temp)
+{
+	struct thermal_instance *instance;
+	unsigned long req_power[MAX_NUM_ACTORS], max_power[MAX_NUM_ACTORS];
+	unsigned long granted_power[MAX_NUM_ACTORS];
+	unsigned long total_req_power, max_allocatable_power;
+	u32 power_range;
+	int i, num_actors, ret = 0;
+
+	i = 0;
+	total_req_power = 0;
+	max_allocatable_power = 0;
+
+	mutex_lock(&tz->lock);
+
+	list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
+		struct thermal_cooling_device *cdev = instance->cdev;
+
+		if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
+			continue;
+
+		if (i >= MAX_NUM_ACTORS) {
+			pr_warn("Too many actors (%d) for this governor\n", i);
+			ret = -EINVAL;
+			goto unlock;
+		}
+
+		ret = cdev->ops->get_cur(cdev, &req_power[i],
+					THERMAL_UNIT_POWER);
+		if (ret) {
+			pr_warn_once("Couldn't get current power for %s: %d\n",
+				cdev->type, ret);
+			goto unlock;
+		}
+
+		total_req_power += req_power[i];
+
+		cdev->ops->get_max(cdev, &max_power[i], THERMAL_UNIT_POWER);
+		max_allocatable_power += max_power[i];
+
+		i++;
+	}
+	num_actors = i;
+
+	power_range = pi_controller(tz, current_temp, control_temp,
+				max_allocatable_power);
+
+	divvy_up_power(tz, req_power, max_power, num_actors, total_req_power,
+		power_range, granted_power);
+
+	i = 0;
+	list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
+		struct thermal_cooling_device *cdev = instance->cdev;
+
+		if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
+			continue;
+
+		cdev->ops->set_cur(cdev, granted_power[i], THERMAL_UNIT_POWER);
+		i++;
+	}
+
+unlock:
+	mutex_unlock(&tz->lock);
+
+	return ret;
+}
+
+static int check_trips(struct thermal_zone_device *tz)
+{
+	int ret;
+	enum thermal_trip_type type;
+
+	if (tz->trips < 2)
+		return -EINVAL;
+
+	ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
+	if (ret)
+		return ret;
+
+	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
+		return -EINVAL;
+
+	ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
+	if (ret)
+		return ret;
+
+	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
+		return -EINVAL;
+
+	return ret;
+}
+
+static void reset_pi_controller(struct power_allocator_params *params)
+{
+	params->err_integral = 0;
+}
+
+/**
+ * power_allocator_bind - bind the power_allocator governor to a thermal zone
+ * @tz:	thermal zone to bind it to
+ *
+ * Check that the thermal zone is valid for this governor: has two
+ * thermal trips.  If so, initialize the PI controller parameters and
+ * bind it to the thermal zone.
+ *
+ * Returns 0 on success, -EINVAL if the trips were invalid or -ENOMEM
+ * if we ran out of memory.
+ */
+static int power_allocator_bind(struct thermal_zone_device *tz)
+{
+	int ret;
+	struct power_allocator_params *params;
+	unsigned long switch_on_temp, control_temp;
+	u32 temperature_threshold;
+
+	ret = check_trips(tz);
+	if (ret) {
+		pr_err("thermal zone %s has the wrong number of trips for this governor\n",
+			tz->type);
+		return ret;
+	}
+
+	if (!tz->tzp || !tz->tzp->max_dissipatable_power) {
+		pr_err("Trying to bind the power_allocator governor to a thermal zone without the max_dissipatable_power parameter\n");
+		return -EINVAL;
+	}
+
+	params = kzalloc(sizeof(*params), GFP_KERNEL);
+	if (!params)
+		return -ENOMEM;
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
+	if (ret)
+		goto free;
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
+				&control_temp);
+	if (ret)
+		goto free;
+
+	temperature_threshold = (control_temp - switch_on_temp) / 1000;
+
+	params->k_po = int_to_frac(tz->tzp->max_dissipatable_power) /
+		temperature_threshold;
+	params->k_pu = int_to_frac(2 * tz->tzp->max_dissipatable_power) /
+		temperature_threshold;
+	params->k_i = int_to_frac(1);
+	params->integral_cutoff = 0;
+
+	reset_pi_controller(params);
+
+	tz->governor_data = params;
+	return 0;
+
+free:
+	kfree(params);
+	return ret;
+}
+
+static void power_allocator_unbind(struct thermal_zone_device *tz)
+{
+	pr_debug("Unbinding from thermal zone %d\n", tz->id);
+	kfree(tz->governor_data);
+	tz->governor_data = NULL;
+}
+
+static int power_allocator_throttle(struct thermal_zone_device *tz, int trip)
+{
+	int ret;
+	unsigned long switch_on_temp, control_temp, current_temp;
+	struct power_allocator_params *params = tz->governor_data;
+
+	/*
+	 * We get called for every trip point but we only need to do
+	 * our calculations once
+	 */
+	if (trip != TRIP_MAX_DESIRED_TEMPERATURE)
+		return 0;
+
+	ret = thermal_zone_get_temp(tz, &current_temp);
+	if (ret) {
+		pr_warn("Failed to get temperature: %d\n", ret);
+		return ret;
+	}
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
+	if (ret) {
+		pr_warn("Failed to get switch on temperature: %d\n", ret);
+		return ret;
+	}
+
+	if (current_temp < switch_on_temp) {
+		reset_pi_controller(params);
+		return 0;
+	}
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
+				&control_temp);
+	if (ret) {
+		pr_warn("Failed to get the maximum desired temperature: %d\n",
+			ret);
+		return ret;
+	}
+
+	return allocate_power(tz, current_temp, control_temp);
+}
+
+static struct thermal_governor thermal_gov_power_allocator = {
+	.name		= "power_allocator",
+	.bind_to_tz	= power_allocator_bind,
+	.unbind_from_tz	= power_allocator_unbind,
+	.throttle	= power_allocator_throttle,
+};
+
+int thermal_gov_power_allocator_register(void)
+{
+	return thermal_register_governor(&thermal_gov_power_allocator);
+}
+
+void thermal_gov_power_allocator_unregister(void)
+{
+	thermal_unregister_governor(&thermal_gov_power_allocator);
+}
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 1efaadf436fa..bed3d2ef8ef3 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1840,7 +1840,11 @@ static int __init thermal_register_governors(void)
 	if (result)
 		return result;
 
-	return thermal_gov_user_space_register();
+	result = thermal_gov_user_space_register();
+	if (result)
+		return result;
+
+	return thermal_gov_power_allocator_register();
 }
 
 static void thermal_unregister_governors(void)
@@ -1848,6 +1852,7 @@ static void thermal_unregister_governors(void)
 	thermal_gov_step_wise_unregister();
 	thermal_gov_fair_share_unregister();
 	thermal_gov_user_space_unregister();
+	thermal_gov_power_allocator_unregister();
 }
 
 static int __init thermal_init(void)
diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h
index 3db339fb636f..b24cde2c71cc 100644
--- a/drivers/thermal/thermal_core.h
+++ b/drivers/thermal/thermal_core.h
@@ -77,6 +77,14 @@ static inline int thermal_gov_user_space_register(void) { return 0; }
 static inline void thermal_gov_user_space_unregister(void) {}
 #endif /* CONFIG_THERMAL_GOV_USER_SPACE */
 
+#ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
+int thermal_gov_power_allocator_register(void);
+void thermal_gov_power_allocator_unregister(void);
+#else
+static inline int thermal_gov_power_allocator_register(void) { return 0; }
+static inline void thermal_gov_power_allocator_unregister(void) {}
+#endif /* CONFIG_THERMAL_GOV_POWER_ALLOCATOR */
+
 /* device tree support */
 #ifdef CONFIG_THERMAL_OF
 int of_parse_thermal_zones(void);
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index 5986f546ca98..3861cb443520 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -57,6 +57,8 @@
 #define DEFAULT_THERMAL_GOVERNOR       "fair_share"
 #elif defined(CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE)
 #define DEFAULT_THERMAL_GOVERNOR       "user_space"
+#elif defined(CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR)
+#define DEFAULT_THERMAL_GOVERNOR       "power_allocator"
 #endif
 
 extern struct list_head thermal_cdev_list;
@@ -250,6 +252,9 @@ struct thermal_zone_params {
 
 	int num_tbps;	/* Number of tbp entries */
 	struct thermal_bind_params *tbp;
+
+	/* Maximum power (heat) that this thermal zone can dissipate in mW */
+	u32 max_dissipatable_power;
 };
 
 struct thermal_genl_event {
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 5/5] thermal: add trace events to the power allocator governor
  2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
                   ` (3 preceding siblings ...)
  2014-05-06 12:06 ` [RFC PATCH 4/5] thermal: introduce the Power Allocator governor Javi Merino
@ 2014-05-06 12:06 ` Javi Merino
  2014-05-06 12:28   ` Steven Rostedt
  2014-05-06 17:22   ` [RFC][PATCH] tracing: Add __cpumask() macro to trace events to record cpumasks Steven Rostedt
  2014-05-14 10:05 ` [RFC PATCH 0/5] The power allocator thermal governor James King
  5 siblings, 2 replies; 48+ messages in thread
From: Javi Merino @ 2014-05-06 12:06 UTC (permalink / raw)
  To: linux-pm
  Cc: Punit.Agrawal, Javi Merino, Zhang Rui, Eduardo Valentin,
	Steven Rostedt, Frederic Weisbecker, Ingo Molnar

Add trace events for the power allocator governor and the power actor
interface of the cpu cooling device.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>

---
Hi trace maintainers,

We would like to add cpumask to a trace point.  We currently allocate
a string and print the mask there with cpumask_scnprintf() which is a
bit heavyweight.  Will it be possible to add this capability to ftrace
so that we can add the cpumask to the trace buffer and then trace-cmd
prints the mask when reporting?

 drivers/thermal/cpu_cooling.c           |    9 ++++++++
 drivers/thermal/power_allocator.c       |   12 ++++++++++-
 include/trace/events/thermal.h          |   34 ++++++++++++++++++++++++++++++
 include/trace/events/thermal_governor.h |   35 +++++++++++++++++++++++++++++++
 4 files changed, 89 insertions(+), 1 deletion(-)
 create mode 100644 include/trace/events/thermal.h
 create mode 100644 include/trace/events/thermal_governor.h

diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
index 8052cd1f4a39..3520e7a6b5c3 100644
--- a/drivers/thermal/cpu_cooling.c
+++ b/drivers/thermal/cpu_cooling.c
@@ -29,6 +29,9 @@
 #include <linux/cpu.h>
 #include <linux/cpu_cooling.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/thermal.h>
+
 /* Maximum number of opps we can track */
 #define MAX_NUM_OPPS 32
 
@@ -565,6 +568,8 @@ static int cpufreq_get_cur(struct thermal_cooling_device *cdev,
 	return 0;
 }
 
+static char trace_cpus_str[DIV_ROUND_UP(NR_CPUS, 4) + 1];
+
 /**
  * cpufreq_set_cur - callback function to set the current cooling state/power.
  * @cdev: thermal cooling device pointer.
@@ -601,6 +606,10 @@ static int cpufreq_set_cur(struct thermal_cooling_device *cdev,
 				freq, cpu);
 			return -EINVAL;
 		}
+
+		cpumask_scnprintf(trace_cpus_str, ARRAY_SIZE(trace_cpus_str),
+				&cpufreq_device->allowed_cpus);
+		trace_thermal_power_limit(trace_cpus_str, freq, state, val);
 	} else if (unit == THERMAL_UNIT_STATE) {
 		state = val;
 	} else {
diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
index 39c425abf10a..cde1fa1bd58a 100644
--- a/drivers/thermal/power_allocator.c
+++ b/drivers/thermal/power_allocator.c
@@ -19,6 +19,9 @@
 #include <linux/slab.h>
 #include <linux/thermal.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/thermal_governor.h>
+
 #include "thermal_core.h"
 
 #define MAX_NUM_ACTORS 8
@@ -122,7 +125,14 @@ static u32 pi_controller(struct thermal_zone_device *tz,
 	if (power_range > max_allocatable_power)
 		power_range = max_allocatable_power;
 
-	return power_range >= 0 ? power_range : 0;
+	power_range = power_range >= 0 ? power_range : 0;
+
+	trace_thermal_power_allocator_pi(frac_to_int(err),
+					frac_to_int(params->err_integral),
+					frac_to_int(p), frac_to_int(i),
+					power_range);
+
+	return power_range;
 }
 
 static void divvy_up_power(struct thermal_zone_device *tz,
diff --git a/include/trace/events/thermal.h b/include/trace/events/thermal.h
new file mode 100644
index 000000000000..9e2c2c65e81b
--- /dev/null
+++ b/include/trace/events/thermal.h
@@ -0,0 +1,34 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM thermal
+
+#if !defined(_TRACE_THERMAL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_THERMAL_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(thermal_power_limit,
+	TP_PROTO(const char *cpus, unsigned int freq, unsigned long cdev_state,
+		unsigned long power),
+	TP_ARGS(cpus, freq, cdev_state, power),
+	TP_STRUCT__entry(
+		__string(cpus, cpus)
+		__field(unsigned int,  freq      )
+		__field(unsigned long, cdev_state)
+		__field(unsigned long, power     )
+	),
+	TP_fast_assign(
+		__assign_str(cpus, cpus);
+		__entry->freq = freq;
+		__entry->cdev_state = cdev_state;
+		__entry->power = power;
+	),
+
+	TP_printk("cpus=%s freq=%u cdev_state=%lu power=%lu",
+		__get_str(cpus), __entry->freq, __entry->cdev_state,
+		__entry->power)
+);
+
+#endif /* _TRACE_THERMAL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/trace/events/thermal_governor.h b/include/trace/events/thermal_governor.h
new file mode 100644
index 000000000000..1fbf5c91f659
--- /dev/null
+++ b/include/trace/events/thermal_governor.h
@@ -0,0 +1,35 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM thermal_governor
+
+#if !defined(_TRACE_THERMAL_GOVERNOR_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_THERMAL_GOVERNOR_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(thermal_power_allocator_pi,
+	TP_PROTO(s32 err, s32 err_integral, s64 p, s64 i, s32 output),
+	TP_ARGS(err, err_integral, p, i, output),
+	TP_STRUCT__entry(
+		__field(s32, err         )
+		__field(s32, err_integral)
+		__field(s64, p           )
+		__field(s64, i           )
+		__field(s32, output      )
+	),
+	TP_fast_assign(
+		__entry->err = err;
+		__entry->err_integral = err_integral;
+		__entry->p = p;
+		__entry->i = i;
+		__entry->output = output;
+	),
+
+	TP_printk("err=%d err_integral=%d p=%lld i=%lld output=%d",
+		__entry->err, __entry->err_integral,
+		__entry->p, __entry->i, __entry->output)
+);
+
+#endif /* _TRACE_THERMAL_GOVERNOR_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 5/5] thermal: add trace events to the power allocator governor
  2014-05-06 12:06 ` [RFC PATCH 5/5] thermal: add trace events to the power allocator governor Javi Merino
@ 2014-05-06 12:28   ` Steven Rostedt
  2014-05-06 17:22   ` [RFC][PATCH] tracing: Add __cpumask() macro to trace events to record cpumasks Steven Rostedt
  1 sibling, 0 replies; 48+ messages in thread
From: Steven Rostedt @ 2014-05-06 12:28 UTC (permalink / raw)
  To: Javi Merino
  Cc: linux-pm, Punit.Agrawal, Zhang Rui, Eduardo Valentin,
	Frederic Weisbecker, Ingo Molnar

On Tue,  6 May 2014 13:06:38 +0100
Javi Merino <javi.merino@arm.com> wrote:


> Hi trace maintainers,

Hi Javi,

> 
> We would like to add cpumask to a trace point.  We currently allocate
> a string and print the mask there with cpumask_scnprintf() which is a
> bit heavyweight.  Will it be possible to add this capability to ftrace
> so that we can add the cpumask to the trace buffer and then trace-cmd
> prints the mask when reporting?

Hmm, that shouldn't be too hard. I would suggest making it work like a
dynamic array, as we don't want to store more bits than necessary.

I'll see if I can whip something up.

-- Steve


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH] tracing: Add __cpumask() macro to trace events to record cpumasks
  2014-05-06 12:06 ` [RFC PATCH 5/5] thermal: add trace events to the power allocator governor Javi Merino
  2014-05-06 12:28   ` Steven Rostedt
@ 2014-05-06 17:22   ` Steven Rostedt
  2014-05-06 19:16     ` Mathieu Desnoyers
  1 sibling, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2014-05-06 17:22 UTC (permalink / raw)
  To: LKML
  Cc: Javi Merino, Andrew Morton, Mathieu Desnoyers, Ingo Molnar,
	Namhyung Kim, Jiri Olsa


Being able to show a cpumask of events can be useful as some events
may affect only some CPUs. There is no standard way to record the
cpumask and converting it to a string is rather expensive during
the trace as traces happen in hotpaths. It would be better to record
the raw event mask and be able to parse it at print time.

The following macros were added for use with the TRACE_EVENT() macro:

  __cpumask()
  __assign_cpumask()
  __get_cpumask()

To test this, I added this to the sched_migrate_task event, which
looked like this:

TRACE_EVENT(sched_migrate_task,

	TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),

	TP_ARGS(p, dest_cpu, cpus),

	TP_STRUCT__entry(
		__array(	char,	comm,	TASK_COMM_LEN	)
		__field(	pid_t,	pid			)
		__field(	int,	prio			)
		__field(	int,	orig_cpu		)
		__field(	int,	dest_cpu		)
		__cpumask(	cpumask, cpus			)
	),

	TP_fast_assign(
		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
		__entry->pid		= p->pid;
		__entry->prio		= p->prio;
		__entry->orig_cpu	= task_cpu(p);
		__entry->dest_cpu	= dest_cpu;
		__assign_cpumask(cpumask, cpus);
	),

	TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
		  __entry->comm, __entry->pid, __entry->prio,
		  __entry->orig_cpu, __entry->dest_cpu,
		  __get_cpumask(cpumask))
);

With the output of:

        ksmtuned-3613  [003] d..2   485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f
     migration/1-13    [001] d..5   485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f
             awk-3615  [002] d.H5   485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff
     migration/2-18    [002] d..5   485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f

Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com

Suggested-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---

Note, the tools/lib/traceevent will need to be updated to handle the
new __get_cpumask() macro in the TP_printk().

---
 include/linux/ftrace_event.h |  4 ++++
 include/linux/trace_seq.h    | 10 ++++++++
 include/trace/ftrace.h       | 57 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace_output.c  | 43 +++++++++++++++++++++++++++++++++
 4 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index d16da3e..f68ae9b 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -7,6 +7,7 @@
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
 #include <linux/perf_event.h>
+#include <linux/cpumask.h>
 #include <linux/tracepoint.h>
 
 struct trace_array;
@@ -38,6 +39,9 @@ const char *ftrace_print_symbols_seq_u64(struct trace_seq *p,
 								 *symbol_array);
 #endif
 
+const char *ftrace_print_cpumask_seq(struct trace_seq *p, void *cpumask_ptr,
+				     unsigned int cpumask_size);
+
 const char *ftrace_print_hex_seq(struct trace_seq *p,
 				 const unsigned char *buf, int len);
 
diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
index a32d86e..052b675 100644
--- a/include/linux/trace_seq.h
+++ b/include/linux/trace_seq.h
@@ -46,6 +46,9 @@ extern int trace_seq_putmem_hex(struct trace_seq *s, const void *mem,
 extern void *trace_seq_reserve(struct trace_seq *s, size_t len);
 extern int trace_seq_path(struct trace_seq *s, const struct path *path);
 
+extern int trace_seq_cpumask(struct trace_seq *s, const unsigned long *maskp,
+			     int nmaskbits);
+
 #else /* CONFIG_TRACING */
 static inline int trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
 {
@@ -57,6 +60,13 @@ trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
 	return 0;
 }
 
+static inline int
+trace_seq_cpumask(struct trace_seq *s, const unsigned long *maskp,
+		  int nmaskbits)
+{
+	return 0;
+}
+
 static inline int trace_print_seq(struct seq_file *m, struct trace_seq *s)
 {
 	return 0;
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 0a1a4f7..ff3c64e 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -53,6 +53,9 @@
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __cpumask
+#define __cpumask(item, src) __dynamic_array(char, item, -1)
+
 #undef TP_STRUCT__entry
 #define TP_STRUCT__entry(args...) args
 
@@ -128,6 +131,9 @@
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __string
+#define __string(item, src) __dynamic_array(unsigned long, item, -1)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 	struct ftrace_data_offsets_##call {				\
@@ -200,6 +206,15 @@
 #undef __get_str
 #define __get_str(field) (char *)__get_dynamic_array(field)
 
+#undef __get_cpumask
+#define __get_cpumask(field)						\
+	({								\
+		void *__cpumask = __get_dynamic_array(field);		\
+		unsigned int __cpumask_size;				\
+		__cpumask_size = (__entry->__data_loc_##field >> 16) & 0xffff; \
+		ftrace_print_cpumask_seq(p, __cpumask, __cpumask_size);	\
+	})
+
 #undef __print_flags
 #define __print_flags(flag, delim, flag_array...)			\
 	({								\
@@ -322,6 +337,9 @@ static struct trace_event_functions ftrace_event_type_funcs_##call = {	\
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __cpumask
+#define __cpumask(item, src) __dynamic_array(unsigned long, item, -1)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
 static int notrace __init						\
@@ -372,6 +390,29 @@ ftrace_define_fields_##call(struct ftrace_event_call *event_call)	\
 #define __string(item, src) __dynamic_array(char, item,			\
 		    strlen((src) ? (const char *)(src) : "(null)") + 1)
 
+/*
+ * __cpumask_size_in_bytes_raw is the number of bytes needed to hold
+ * num_possible_cpus().
+ */
+#define __cpumask_size_in_bytes_raw			\
+	((num_possible_cpus() + 7) / 8)
+
+#define __cpumask_size_in_longs						\
+	((__cpumask_size_in_bytes_raw + ((BITS_PER_LONG / 8) - 1))	\
+	 / (BITS_PER_LONG / 8))
+
+/*
+ * __cpumask_size_in_bytes is the number of bytes needed to hold
+ * num_possible_cpus() padded out to the nearest long. This is what
+ * is saved in the buffer, just to be consistent.
+ */
+#define __cpumask_size_in_bytes				\
+	(__cpumask_size_in_longs * (BITS_PER_LONG / 8))
+
+#undef __cpumask
+#define __cpumask(item, src) __dynamic_array(unsigned long, item,	\
+					     __cpumask_size_in_longs)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 static inline notrace int ftrace_get_offsets_##call(			\
@@ -513,12 +554,22 @@ static inline notrace int ftrace_get_offsets_##call(			\
 	__entry->__data_loc_##item = __data_offsets.item;
 
 #undef __string
-#define __string(item, src) __dynamic_array(char, item, -1)       	\
+#define __string(item, src) __dynamic_array(char, item, -1)
 
 #undef __assign_str
 #define __assign_str(dst, src)						\
 	strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
 
+#undef __cpumask
+#define __cpumask(item, src) __dynamic_array(unsigned long, item, -1)
+
+#undef __get_cpumask
+#define __get_cpumask(field) (char *)__get_dynamic_array(field)
+
+#undef __assign_cpumask
+#define __assign_cpumask(dst, src)					\
+	memcpy(__get_cpumask(dst), (src), __cpumask_size_in_bytes)
+
 #undef TP_fast_assign
 #define TP_fast_assign(args...) args
 
@@ -586,6 +637,7 @@ static inline void ftrace_test_probe_##call(void)			\
 #undef __print_hex
 #undef __get_dynamic_array
 #undef __get_str
+#undef __get_cpumask
 
 #undef TP_printk
 #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)
@@ -651,6 +703,9 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
 #undef __get_str
 #define __get_str(field) (char *)__get_dynamic_array(field)
 
+#undef __get_cpumask
+#define __get_cpumask(field) (char *)__get_dynamic_array(field)
+
 #undef __perf_addr
 #define __perf_addr(a)	(__addr = (a))
 
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a436de1..de43631 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -126,6 +126,36 @@ trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
 EXPORT_SYMBOL_GPL(trace_seq_printf);
 
 /**
+ * trace_seq_cpumask - put a list of longs as a cpumask print output
+ * @s:		trace sequence descriptor
+ * @maskp:	points to an array of unsigned longs that represent a cpumask
+ * @nmaskbits:	The number of bits that are valid in @maskp
+ *
+ * It returns 0 if the trace oversizes the buffer's free
+ * space, 1 otherwise.
+ *
+ * Although the name has cpumask, it does not use the cpumask structure,
+ * but instead the internal bitmask structure used by cpumask. This
+ * is used for printing cpumasks in tracepoints.
+ */
+int
+trace_seq_cpumask(struct trace_seq *s, const unsigned long *maskp,
+		  int nmaskbits)
+{
+	int len = (PAGE_SIZE - 1) - s->len;
+	int ret;
+
+	if (s->full || !len)
+		return 0;
+
+	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
+	s->len += ret;
+
+	return 1;
+}
+EXPORT_SYMBOL_GPL(trace_seq_cpumask);
+
+/**
  * trace_seq_vprintf - sequence printing of trace information
  * @s: trace sequence descriptor
  * @fmt: printf format string
@@ -399,6 +429,19 @@ EXPORT_SYMBOL(ftrace_print_symbols_seq_u64);
 #endif
 
 const char *
+ftrace_print_cpumask_seq(struct trace_seq *p, void *cpumask_ptr,
+			 unsigned int cpumask_size)
+{
+	const char *ret = p->buffer + p->len;
+
+	trace_seq_cpumask(p, cpumask_ptr, cpumask_size * 8);
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ftrace_print_cpumask_seq);
+
+const char *
 ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int buf_len)
 {
 	int i;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH] tracing: Add __cpumask() macro to trace events to record cpumasks
  2014-05-06 17:22   ` [RFC][PATCH] tracing: Add __cpumask() macro to trace events to record cpumasks Steven Rostedt
@ 2014-05-06 19:16     ` Mathieu Desnoyers
  2014-05-06 19:30       ` Steven Rostedt
  2014-05-07  3:12       ` [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks Steven Rostedt
  0 siblings, 2 replies; 48+ messages in thread
From: Mathieu Desnoyers @ 2014-05-06 19:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Javi Merino, Andrew Morton, Ingo Molnar, Namhyung Kim, Jiri Olsa

----- Original Message -----
> From: "Steven Rostedt" <rostedt@goodmis.org>
> To: "LKML" <linux-kernel@vger.kernel.org>
> Cc: "Javi Merino" <javi.merino@arm.com>, "Andrew Morton" <akpm@linux-foundation.org>, "Mathieu Desnoyers"
> <mathieu.desnoyers@efficios.com>, "Ingo Molnar" <mingo@kernel.org>, "Namhyung Kim" <namhyung@kernel.org>, "Jiri
> Olsa" <jolsa@redhat.com>
> Sent: Tuesday, May 6, 2014 1:22:38 PM
> Subject: [RFC][PATCH] tracing: Add __cpumask() macro to trace events to record cpumasks
> 
> 
> Being able to show a cpumask of events can be useful as some events
> may affect only some CPUs. There is no standard way to record the
> cpumask and converting it to a string is rather expensive during
> the trace as traces happen in hotpaths. It would be better to record
> the raw event mask and be able to parse it at print time.

Why name the type cpumask especially ? The type could be a "bitmask",
and the specific field semantic in this case (field name) could
be "cpus", thus implying that this is a cpu mask. This would allow
using bitmasks for other things than cpu masks.

Moreover, I've been thinking about eventually adding a bitmask associated
with an enumeration, so each bit of the bitmask could be associated with
an entry from an enumeration (a string). So having a bitmask type would
be a good step in that direction. For instance, the block layer
instrumentation could use this for the rwbs field, instead of
printing characters with blk_fill_rwbs(). This would save space and
cpu time at trace collection.

Thoughts ?

Thanks,

Mathieu


> 
> The following macros were added for use with the TRACE_EVENT() macro:
> 
>   __cpumask()
>   __assign_cpumask()
>   __get_cpumask()
> 
> To test this, I added this to the sched_migrate_task event, which
> looked like this:
> 
> TRACE_EVENT(sched_migrate_task,
> 
> 	TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),
> 
> 	TP_ARGS(p, dest_cpu, cpus),
> 
> 	TP_STRUCT__entry(
> 		__array(	char,	comm,	TASK_COMM_LEN	)
> 		__field(	pid_t,	pid			)
> 		__field(	int,	prio			)
> 		__field(	int,	orig_cpu		)
> 		__field(	int,	dest_cpu		)
> 		__cpumask(	cpumask, cpus			)
> 	),
> 
> 	TP_fast_assign(
> 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
> 		__entry->pid		= p->pid;
> 		__entry->prio		= p->prio;
> 		__entry->orig_cpu	= task_cpu(p);
> 		__entry->dest_cpu	= dest_cpu;
> 		__assign_cpumask(cpumask, cpus);
> 	),
> 
> 	TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
> 		  __entry->comm, __entry->pid, __entry->prio,
> 		  __entry->orig_cpu, __entry->dest_cpu,
> 		  __get_cpumask(cpumask))
> );
> 
> With the output of:
> 
>         ksmtuned-3613  [003] d..2   485.220508: sched_migrate_task:
>         comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2
>         cpumask=00000000,0000000f
>      migration/1-13    [001] d..5   485.221202: sched_migrate_task:
>      comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0
>      cpumask=00000000,0000000f
>              awk-3615  [002] d.H5   485.221747: sched_migrate_task:
>              comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1
>              cpumask=00000000,000000ff
>      migration/2-18    [002] d..5   485.222062: sched_migrate_task:
>      comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3
>      cpumask=00000000,0000000f
> 
> Link:
> http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com
> 
> Suggested-by: Javi Merino <javi.merino@arm.com>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> 
> Note, the tools/lib/traceevent will need to be updated to handle the
> new __get_cpumask() macro in the TP_printk().
> 
> ---
>  include/linux/ftrace_event.h |  4 ++++
>  include/linux/trace_seq.h    | 10 ++++++++
>  include/trace/ftrace.h       | 57
>  +++++++++++++++++++++++++++++++++++++++++++-
>  kernel/trace/trace_output.c  | 43 +++++++++++++++++++++++++++++++++
>  4 files changed, 113 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index d16da3e..f68ae9b 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -7,6 +7,7 @@
>  #include <linux/percpu.h>
>  #include <linux/hardirq.h>
>  #include <linux/perf_event.h>
> +#include <linux/cpumask.h>
>  #include <linux/tracepoint.h>
>  
>  struct trace_array;
> @@ -38,6 +39,9 @@ const char *ftrace_print_symbols_seq_u64(struct trace_seq
> *p,
>  								 *symbol_array);
>  #endif
>  
> +const char *ftrace_print_cpumask_seq(struct trace_seq *p, void *cpumask_ptr,
> +				     unsigned int cpumask_size);
> +
>  const char *ftrace_print_hex_seq(struct trace_seq *p,
>  				 const unsigned char *buf, int len);
>  
> diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
> index a32d86e..052b675 100644
> --- a/include/linux/trace_seq.h
> +++ b/include/linux/trace_seq.h
> @@ -46,6 +46,9 @@ extern int trace_seq_putmem_hex(struct trace_seq *s, const
> void *mem,
>  extern void *trace_seq_reserve(struct trace_seq *s, size_t len);
>  extern int trace_seq_path(struct trace_seq *s, const struct path *path);
>  
> +extern int trace_seq_cpumask(struct trace_seq *s, const unsigned long
> *maskp,
> +			     int nmaskbits);
> +
>  #else /* CONFIG_TRACING */
>  static inline int trace_seq_printf(struct trace_seq *s, const char *fmt,
>  ...)
>  {
> @@ -57,6 +60,13 @@ trace_seq_bprintf(struct trace_seq *s, const char *fmt,
> const u32 *binary)
>  	return 0;
>  }
>  
> +static inline int
> +trace_seq_cpumask(struct trace_seq *s, const unsigned long *maskp,
> +		  int nmaskbits)
> +{
> +	return 0;
> +}
> +
>  static inline int trace_print_seq(struct seq_file *m, struct trace_seq *s)
>  {
>  	return 0;
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 0a1a4f7..ff3c64e 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -53,6 +53,9 @@
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __cpumask
> +#define __cpumask(item, src) __dynamic_array(char, item, -1)
> +
>  #undef TP_STRUCT__entry
>  #define TP_STRUCT__entry(args...) args
>  
> @@ -128,6 +131,9 @@
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __string
> +#define __string(item, src) __dynamic_array(unsigned long, item, -1)
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  	struct ftrace_data_offsets_##call {				\
> @@ -200,6 +206,15 @@
>  #undef __get_str
>  #define __get_str(field) (char *)__get_dynamic_array(field)
>  
> +#undef __get_cpumask
> +#define __get_cpumask(field)						\
> +	({								\
> +		void *__cpumask = __get_dynamic_array(field);		\
> +		unsigned int __cpumask_size;				\
> +		__cpumask_size = (__entry->__data_loc_##field >> 16) & 0xffff; \
> +		ftrace_print_cpumask_seq(p, __cpumask, __cpumask_size);	\
> +	})
> +
>  #undef __print_flags
>  #define __print_flags(flag, delim, flag_array...)			\
>  	({								\
> @@ -322,6 +337,9 @@ static struct trace_event_functions
> ftrace_event_type_funcs_##call = {	\
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __cpumask
> +#define __cpumask(item, src) __dynamic_array(unsigned long, item, -1)
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
>  static int notrace __init						\
> @@ -372,6 +390,29 @@ ftrace_define_fields_##call(struct ftrace_event_call
> *event_call)	\
>  #define __string(item, src) __dynamic_array(char, item,			\
>  		    strlen((src) ? (const char *)(src) : "(null)") + 1)
>  
> +/*
> + * __cpumask_size_in_bytes_raw is the number of bytes needed to hold
> + * num_possible_cpus().
> + */
> +#define __cpumask_size_in_bytes_raw			\
> +	((num_possible_cpus() + 7) / 8)
> +
> +#define __cpumask_size_in_longs						\
> +	((__cpumask_size_in_bytes_raw + ((BITS_PER_LONG / 8) - 1))	\
> +	 / (BITS_PER_LONG / 8))
> +
> +/*
> + * __cpumask_size_in_bytes is the number of bytes needed to hold
> + * num_possible_cpus() padded out to the nearest long. This is what
> + * is saved in the buffer, just to be consistent.
> + */
> +#define __cpumask_size_in_bytes				\
> +	(__cpumask_size_in_longs * (BITS_PER_LONG / 8))
> +
> +#undef __cpumask
> +#define __cpumask(item, src) __dynamic_array(unsigned long, item,	\
> +					     __cpumask_size_in_longs)
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  static inline notrace int ftrace_get_offsets_##call(			\
> @@ -513,12 +554,22 @@ static inline notrace int ftrace_get_offsets_##call(
> 			\
>  	__entry->__data_loc_##item = __data_offsets.item;
>  
>  #undef __string
> -#define __string(item, src) __dynamic_array(char, item, -1)       	\
> +#define __string(item, src) __dynamic_array(char, item, -1)
>  
>  #undef __assign_str
>  #define __assign_str(dst, src)						\
>  	strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
>  
> +#undef __cpumask
> +#define __cpumask(item, src) __dynamic_array(unsigned long, item, -1)
> +
> +#undef __get_cpumask
> +#define __get_cpumask(field) (char *)__get_dynamic_array(field)
> +
> +#undef __assign_cpumask
> +#define __assign_cpumask(dst, src)					\
> +	memcpy(__get_cpumask(dst), (src), __cpumask_size_in_bytes)
> +
>  #undef TP_fast_assign
>  #define TP_fast_assign(args...) args
>  
> @@ -586,6 +637,7 @@ static inline void ftrace_test_probe_##call(void)			\
>  #undef __print_hex
>  #undef __get_dynamic_array
>  #undef __get_str
> +#undef __get_cpumask
>  
>  #undef TP_printk
>  #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)
> @@ -651,6 +703,9 @@ __attribute__((section("_ftrace_events")))
> *__event_##call = &event_##call
>  #undef __get_str
>  #define __get_str(field) (char *)__get_dynamic_array(field)
>  
> +#undef __get_cpumask
> +#define __get_cpumask(field) (char *)__get_dynamic_array(field)
> +
>  #undef __perf_addr
>  #define __perf_addr(a)	(__addr = (a))
>  
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index a436de1..de43631 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -126,6 +126,36 @@ trace_seq_printf(struct trace_seq *s, const char *fmt,
> ...)
>  EXPORT_SYMBOL_GPL(trace_seq_printf);
>  
>  /**
> + * trace_seq_cpumask - put a list of longs as a cpumask print output
> + * @s:		trace sequence descriptor
> + * @maskp:	points to an array of unsigned longs that represent a cpumask
> + * @nmaskbits:	The number of bits that are valid in @maskp
> + *
> + * It returns 0 if the trace oversizes the buffer's free
> + * space, 1 otherwise.
> + *
> + * Although the name has cpumask, it does not use the cpumask structure,
> + * but instead the internal bitmask structure used by cpumask. This
> + * is used for printing cpumasks in tracepoints.
> + */
> +int
> +trace_seq_cpumask(struct trace_seq *s, const unsigned long *maskp,
> +		  int nmaskbits)
> +{
> +	int len = (PAGE_SIZE - 1) - s->len;
> +	int ret;
> +
> +	if (s->full || !len)
> +		return 0;
> +
> +	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
> +	s->len += ret;
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL_GPL(trace_seq_cpumask);
> +
> +/**
>   * trace_seq_vprintf - sequence printing of trace information
>   * @s: trace sequence descriptor
>   * @fmt: printf format string
> @@ -399,6 +429,19 @@ EXPORT_SYMBOL(ftrace_print_symbols_seq_u64);
>  #endif
>  
>  const char *
> +ftrace_print_cpumask_seq(struct trace_seq *p, void *cpumask_ptr,
> +			 unsigned int cpumask_size)
> +{
> +	const char *ret = p->buffer + p->len;
> +
> +	trace_seq_cpumask(p, cpumask_ptr, cpumask_size * 8);
> +	trace_seq_putc(p, 0);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(ftrace_print_cpumask_seq);
> +
> +const char *
>  ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int
>  buf_len)
>  {
>  	int i;
> --
> 1.8.1.4
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH] tracing: Add __cpumask() macro to trace events to record cpumasks
  2014-05-06 19:16     ` Mathieu Desnoyers
@ 2014-05-06 19:30       ` Steven Rostedt
  2014-05-07  3:12       ` [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks Steven Rostedt
  1 sibling, 0 replies; 48+ messages in thread
From: Steven Rostedt @ 2014-05-06 19:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Javi Merino, Andrew Morton, Ingo Molnar, Namhyung Kim, Jiri Olsa

On Tue, 6 May 2014 19:16:34 +0000 (UTC)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> ----- Original Message -----
> > From: "Steven Rostedt" <rostedt@goodmis.org>
> > To: "LKML" <linux-kernel@vger.kernel.org>
> > Cc: "Javi Merino" <javi.merino@arm.com>, "Andrew Morton" <akpm@linux-foundation.org>, "Mathieu Desnoyers"
> > <mathieu.desnoyers@efficios.com>, "Ingo Molnar" <mingo@kernel.org>, "Namhyung Kim" <namhyung@kernel.org>, "Jiri
> > Olsa" <jolsa@redhat.com>
> > Sent: Tuesday, May 6, 2014 1:22:38 PM
> > Subject: [RFC][PATCH] tracing: Add __cpumask() macro to trace events to record cpumasks
> > 
> > 
> > Being able to show a cpumask of events can be useful as some events
> > may affect only some CPUs. There is no standard way to record the
> > cpumask and converting it to a string is rather expensive during
> > the trace as traces happen in hotpaths. It would be better to record
> > the raw event mask and be able to parse it at print time.
> 
> Why name the type cpumask especially ? The type could be a "bitmask",
> and the specific field semantic in this case (field name) could
> be "cpus", thus implying that this is a cpu mask. This would allow
> using bitmasks for other things than cpu masks.

Hmm, sure, perhaps it would be better to call it bitmask instead. It
basically is already.

> 
> Moreover, I've been thinking about eventually adding a bitmask associated
> with an enumeration, so each bit of the bitmask could be associated with
> an entry from an enumeration (a string). So having a bitmask type would
> be a good step in that direction. For instance, the block layer
> instrumentation could use this for the rwbs field, instead of
> printing characters with blk_fill_rwbs(). This would save space and
> cpu time at trace collection.

Perhaps.

-- Steve



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-06 19:16     ` Mathieu Desnoyers
  2014-05-06 19:30       ` Steven Rostedt
@ 2014-05-07  3:12       ` Steven Rostedt
  2014-05-07 11:40         ` Mathieu Desnoyers
  2014-05-14 14:23         ` Javi Merino
  1 sibling, 2 replies; 48+ messages in thread
From: Steven Rostedt @ 2014-05-07  3:12 UTC (permalink / raw)
  To: LKML
  Cc: Mathieu Desnoyers, Javi Merino, Andrew Morton, Ingo Molnar,
	Namhyung Kim, Jiri Olsa


Being able to show a cpumask of events can be useful as some events
may affect only some CPUs. There is no standard way to record the
cpumask and converting it to a string is rather expensive during
the trace as traces happen in hotpaths. It would be better to record
the raw event mask and be able to parse it at print time.

The following macros were added for use with the TRACE_EVENT() macro:

  __bitmask()
  __assign_bitmask()
  __get_bitmask()

To test this, I added this to the sched_migrate_task event, which
looked like this:

TRACE_EVENT(sched_migrate_task,

	TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),

	TP_ARGS(p, dest_cpu, cpus),

	TP_STRUCT__entry(
		__array(	char,	comm,	TASK_COMM_LEN	)
		__field(	pid_t,	pid			)
		__field(	int,	prio			)
		__field(	int,	orig_cpu		)
		__field(	int,	dest_cpu		)
		__bitmask(	cpumask, cpus			)
	),

	TP_fast_assign(
		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
		__entry->pid		= p->pid;
		__entry->prio		= p->prio;
		__entry->orig_cpu	= task_cpu(p);
		__entry->dest_cpu	= dest_cpu;
		__assign_bitmask(cpumask, cpumask_bits(cpus));
	),

	TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
		  __entry->comm, __entry->pid, __entry->prio,
		  __entry->orig_cpu, __entry->dest_cpu,
		  __get_bitmask(cpumask))
);

With the output of:

        ksmtuned-3613  [003] d..2   485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f
     migration/1-13    [001] d..5   485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f
             awk-3615  [002] d.H5   485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff
     migration/2-18    [002] d..5   485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f

Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com

Suggested-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---

Changes since v1: Use bitmask name instead of cpumask naming.

---
 include/linux/ftrace_event.h |  3 +++
 include/linux/trace_seq.h    | 10 ++++++++
 include/trace/ftrace.h       | 57 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace_output.c  | 41 +++++++++++++++++++++++++++++++
 4 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index d16da3e..cff3106 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -38,6 +38,9 @@ const char *ftrace_print_symbols_seq_u64(struct trace_seq *p,
 								 *symbol_array);
 #endif
 
+const char *ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
+				     unsigned int bitmask_size);
+
 const char *ftrace_print_hex_seq(struct trace_seq *p,
 				 const unsigned char *buf, int len);
 
diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
index a32d86e..1361169 100644
--- a/include/linux/trace_seq.h
+++ b/include/linux/trace_seq.h
@@ -46,6 +46,9 @@ extern int trace_seq_putmem_hex(struct trace_seq *s, const void *mem,
 extern void *trace_seq_reserve(struct trace_seq *s, size_t len);
 extern int trace_seq_path(struct trace_seq *s, const struct path *path);
 
+extern int trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
+			     int nmaskbits);
+
 #else /* CONFIG_TRACING */
 static inline int trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
 {
@@ -57,6 +60,13 @@ trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
 	return 0;
 }
 
+static inline int
+trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
+		  int nmaskbits)
+{
+	return 0;
+}
+
 static inline int trace_print_seq(struct seq_file *m, struct trace_seq *s)
 {
 	return 0;
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 0a1a4f7..d9c44af 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -53,6 +53,9 @@
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __bitmask
+#define __bitmask(item, src) __dynamic_array(char, item, -1)
+
 #undef TP_STRUCT__entry
 #define TP_STRUCT__entry(args...) args
 
@@ -128,6 +131,9 @@
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __string
+#define __string(item, src) __dynamic_array(unsigned long, item, -1)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 	struct ftrace_data_offsets_##call {				\
@@ -200,6 +206,15 @@
 #undef __get_str
 #define __get_str(field) (char *)__get_dynamic_array(field)
 
+#undef __get_bitmask
+#define __get_bitmask(field)						\
+	({								\
+		void *__bitmask = __get_dynamic_array(field);		\
+		unsigned int __bitmask_size;				\
+		__bitmask_size = (__entry->__data_loc_##field >> 16) & 0xffff; \
+		ftrace_print_bitmask_seq(p, __bitmask, __bitmask_size);	\
+	})
+
 #undef __print_flags
 #define __print_flags(flag, delim, flag_array...)			\
 	({								\
@@ -322,6 +337,9 @@ static struct trace_event_functions ftrace_event_type_funcs_##call = {	\
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __bitmask
+#define __bitmask(item, src) __dynamic_array(unsigned long, item, -1)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
 static int notrace __init						\
@@ -372,6 +390,29 @@ ftrace_define_fields_##call(struct ftrace_event_call *event_call)	\
 #define __string(item, src) __dynamic_array(char, item,			\
 		    strlen((src) ? (const char *)(src) : "(null)") + 1)
 
+/*
+ * __bitmask_size_in_bytes_raw is the number of bytes needed to hold
+ * num_possible_cpus().
+ */
+#define __bitmask_size_in_bytes_raw			\
+	((num_possible_cpus() + 7) / 8)
+
+#define __bitmask_size_in_longs						\
+	((__bitmask_size_in_bytes_raw + ((BITS_PER_LONG / 8) - 1))	\
+	 / (BITS_PER_LONG / 8))
+
+/*
+ * __bitmask_size_in_bytes is the number of bytes needed to hold
+ * num_possible_cpus() padded out to the nearest long. This is what
+ * is saved in the buffer, just to be consistent.
+ */
+#define __bitmask_size_in_bytes				\
+	(__bitmask_size_in_longs * (BITS_PER_LONG / 8))
+
+#undef __bitmask
+#define __bitmask(item, src) __dynamic_array(unsigned long, item,	\
+					     __bitmask_size_in_longs)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 static inline notrace int ftrace_get_offsets_##call(			\
@@ -513,12 +554,22 @@ static inline notrace int ftrace_get_offsets_##call(			\
 	__entry->__data_loc_##item = __data_offsets.item;
 
 #undef __string
-#define __string(item, src) __dynamic_array(char, item, -1)       	\
+#define __string(item, src) __dynamic_array(char, item, -1)
 
 #undef __assign_str
 #define __assign_str(dst, src)						\
 	strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
 
+#undef __bitmask
+#define __bitmask(item, src) __dynamic_array(unsigned long, item, -1)
+
+#undef __get_bitmask
+#define __get_bitmask(field) (char *)__get_dynamic_array(field)
+
+#undef __assign_bitmask
+#define __assign_bitmask(dst, src)					\
+	memcpy(__get_bitmask(dst), (src), __bitmask_size_in_bytes)
+
 #undef TP_fast_assign
 #define TP_fast_assign(args...) args
 
@@ -586,6 +637,7 @@ static inline void ftrace_test_probe_##call(void)			\
 #undef __print_hex
 #undef __get_dynamic_array
 #undef __get_str
+#undef __get_bitmask
 
 #undef TP_printk
 #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)
@@ -651,6 +703,9 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
 #undef __get_str
 #define __get_str(field) (char *)__get_dynamic_array(field)
 
+#undef __get_bitmask
+#define __get_bitmask(field) (char *)__get_dynamic_array(field)
+
 #undef __perf_addr
 #define __perf_addr(a)	(__addr = (a))
 
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a436de1..f3dad80 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -126,6 +126,34 @@ trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
 EXPORT_SYMBOL_GPL(trace_seq_printf);
 
 /**
+ * trace_seq_bitmask - put a list of longs as a bitmask print output
+ * @s:		trace sequence descriptor
+ * @maskp:	points to an array of unsigned longs that represent a bitmask
+ * @nmaskbits:	The number of bits that are valid in @maskp
+ *
+ * It returns 0 if the trace oversizes the buffer's free
+ * space, 1 otherwise.
+ *
+ * Writes a ASCII representation of a bitmask string into @s.
+ */
+int
+trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
+		  int nmaskbits)
+{
+	int len = (PAGE_SIZE - 1) - s->len;
+	int ret;
+
+	if (s->full || !len)
+		return 0;
+
+	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
+	s->len += ret;
+
+	return 1;
+}
+EXPORT_SYMBOL_GPL(trace_seq_bitmask);
+
+/**
  * trace_seq_vprintf - sequence printing of trace information
  * @s: trace sequence descriptor
  * @fmt: printf format string
@@ -399,6 +427,19 @@ EXPORT_SYMBOL(ftrace_print_symbols_seq_u64);
 #endif
 
 const char *
+ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
+			 unsigned int bitmask_size)
+{
+	const char *ret = p->buffer + p->len;
+
+	trace_seq_bitmask(p, bitmask_ptr, bitmask_size * 8);
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ftrace_print_bitmask_seq);
+
+const char *
 ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int buf_len)
 {
 	int i;
-- 
1.8.1.4




^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-07  3:12       ` [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks Steven Rostedt
@ 2014-05-07 11:40         ` Mathieu Desnoyers
  2014-05-07 15:45           ` Steven Rostedt
  2014-05-14 14:23         ` Javi Merino
  1 sibling, 1 reply; 48+ messages in thread
From: Mathieu Desnoyers @ 2014-05-07 11:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Javi Merino, Andrew Morton, Ingo Molnar, Namhyung Kim, Jiri Olsa

----- Original Message -----
> From: "Steven Rostedt" <rostedt@goodmis.org>
> To: "LKML" <linux-kernel@vger.kernel.org>
> Cc: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>, "Javi Merino" <javi.merino@arm.com>, "Andrew Morton"
> <akpm@linux-foundation.org>, "Ingo Molnar" <mingo@kernel.org>, "Namhyung Kim" <namhyung@kernel.org>, "Jiri Olsa"
> <jolsa@redhat.com>
> Sent: Tuesday, May 6, 2014 11:12:38 PM
> Subject: [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
> 
> 
> Being able to show a cpumask of events can be useful as some events
> may affect only some CPUs. There is no standard way to record the
> cpumask and converting it to a string is rather expensive during
> the trace as traces happen in hotpaths. It would be better to record
> the raw event mask and be able to parse it at print time.

Hi Steven,

OK. Changing cpumask into bitmask helps making this more generic. Now
I'm concerned about another topic: the type backing the bitmask, as well
as its endianness, have an important impact when storing and reading
the trace on different machines. This is not a problem cpumask typically
has, because the test_bit() and set bit APIs are always used from within
the kernel.

Looking at:

/**
 * test_bit - Determine whether a bit is set
 * @nr: bit number to test
 * @addr: Address to start counting from
 */
static inline int test_bit(int nr, const volatile unsigned long *addr)
{
        return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
}

We notice that the kernel's bit operations use an unsigned long
as "backing type" for the bitfield, with a native kernel endianness.

Therefore, the trace reader needs to take into account those two
informations (backing type size, and endianness) in order to get the
bit order right.

I remember that you made sure Ftrace's event description files expose
the endianness of the kernel, which partially takes care of this
concern, but how do you plan exposing the bitmask backing type size ?
(unsigned long maps to 32-bit or 64-bit depending on the architecture).

Thanks,

Mathieu

> 
> The following macros were added for use with the TRACE_EVENT() macro:
> 
>   __bitmask()
>   __assign_bitmask()
>   __get_bitmask()
> 
> To test this, I added this to the sched_migrate_task event, which
> looked like this:
> 
> TRACE_EVENT(sched_migrate_task,
> 
> 	TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),
> 
> 	TP_ARGS(p, dest_cpu, cpus),
> 
> 	TP_STRUCT__entry(
> 		__array(	char,	comm,	TASK_COMM_LEN	)
> 		__field(	pid_t,	pid			)
> 		__field(	int,	prio			)
> 		__field(	int,	orig_cpu		)
> 		__field(	int,	dest_cpu		)
> 		__bitmask(	cpumask, cpus			)
> 	),
> 
> 	TP_fast_assign(
> 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
> 		__entry->pid		= p->pid;
> 		__entry->prio		= p->prio;
> 		__entry->orig_cpu	= task_cpu(p);
> 		__entry->dest_cpu	= dest_cpu;
> 		__assign_bitmask(cpumask, cpumask_bits(cpus));
> 	),
> 
> 	TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
> 		  __entry->comm, __entry->pid, __entry->prio,
> 		  __entry->orig_cpu, __entry->dest_cpu,
> 		  __get_bitmask(cpumask))
> );
> 
> With the output of:
> 
>         ksmtuned-3613  [003] d..2   485.220508: sched_migrate_task:
>         comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2
>         cpumask=00000000,0000000f
>      migration/1-13    [001] d..5   485.221202: sched_migrate_task:
>      comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0
>      cpumask=00000000,0000000f
>              awk-3615  [002] d.H5   485.221747: sched_migrate_task:
>              comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1
>              cpumask=00000000,000000ff
>      migration/2-18    [002] d..5   485.222062: sched_migrate_task:
>      comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3
>      cpumask=00000000,0000000f
> 
> Link:
> http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com
> 
> Suggested-by: Javi Merino <javi.merino@arm.com>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> 
> Changes since v1: Use bitmask name instead of cpumask naming.
> 
> ---
>  include/linux/ftrace_event.h |  3 +++
>  include/linux/trace_seq.h    | 10 ++++++++
>  include/trace/ftrace.h       | 57
>  +++++++++++++++++++++++++++++++++++++++++++-
>  kernel/trace/trace_output.c  | 41 +++++++++++++++++++++++++++++++
>  4 files changed, 110 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index d16da3e..cff3106 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -38,6 +38,9 @@ const char *ftrace_print_symbols_seq_u64(struct trace_seq
> *p,
>  								 *symbol_array);
>  #endif
>  
> +const char *ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
> +				     unsigned int bitmask_size);
> +
>  const char *ftrace_print_hex_seq(struct trace_seq *p,
>  				 const unsigned char *buf, int len);
>  
> diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
> index a32d86e..1361169 100644
> --- a/include/linux/trace_seq.h
> +++ b/include/linux/trace_seq.h
> @@ -46,6 +46,9 @@ extern int trace_seq_putmem_hex(struct trace_seq *s, const
> void *mem,
>  extern void *trace_seq_reserve(struct trace_seq *s, size_t len);
>  extern int trace_seq_path(struct trace_seq *s, const struct path *path);
>  
> +extern int trace_seq_bitmask(struct trace_seq *s, const unsigned long
> *maskp,
> +			     int nmaskbits);
> +
>  #else /* CONFIG_TRACING */
>  static inline int trace_seq_printf(struct trace_seq *s, const char *fmt,
>  ...)
>  {
> @@ -57,6 +60,13 @@ trace_seq_bprintf(struct trace_seq *s, const char *fmt,
> const u32 *binary)
>  	return 0;
>  }
>  
> +static inline int
> +trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> +		  int nmaskbits)
> +{
> +	return 0;
> +}
> +
>  static inline int trace_print_seq(struct seq_file *m, struct trace_seq *s)
>  {
>  	return 0;
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 0a1a4f7..d9c44af 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -53,6 +53,9 @@
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __bitmask
> +#define __bitmask(item, src) __dynamic_array(char, item, -1)
> +
>  #undef TP_STRUCT__entry
>  #define TP_STRUCT__entry(args...) args
>  
> @@ -128,6 +131,9 @@
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __string
> +#define __string(item, src) __dynamic_array(unsigned long, item, -1)
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  	struct ftrace_data_offsets_##call {				\
> @@ -200,6 +206,15 @@
>  #undef __get_str
>  #define __get_str(field) (char *)__get_dynamic_array(field)
>  
> +#undef __get_bitmask
> +#define __get_bitmask(field)						\
> +	({								\
> +		void *__bitmask = __get_dynamic_array(field);		\
> +		unsigned int __bitmask_size;				\
> +		__bitmask_size = (__entry->__data_loc_##field >> 16) & 0xffff; \
> +		ftrace_print_bitmask_seq(p, __bitmask, __bitmask_size);	\
> +	})
> +
>  #undef __print_flags
>  #define __print_flags(flag, delim, flag_array...)			\
>  	({								\
> @@ -322,6 +337,9 @@ static struct trace_event_functions
> ftrace_event_type_funcs_##call = {	\
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __bitmask
> +#define __bitmask(item, src) __dynamic_array(unsigned long, item, -1)
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
>  static int notrace __init						\
> @@ -372,6 +390,29 @@ ftrace_define_fields_##call(struct ftrace_event_call
> *event_call)	\
>  #define __string(item, src) __dynamic_array(char, item,			\
>  		    strlen((src) ? (const char *)(src) : "(null)") + 1)
>  
> +/*
> + * __bitmask_size_in_bytes_raw is the number of bytes needed to hold
> + * num_possible_cpus().
> + */
> +#define __bitmask_size_in_bytes_raw			\
> +	((num_possible_cpus() + 7) / 8)
> +
> +#define __bitmask_size_in_longs						\
> +	((__bitmask_size_in_bytes_raw + ((BITS_PER_LONG / 8) - 1))	\
> +	 / (BITS_PER_LONG / 8))
> +
> +/*
> + * __bitmask_size_in_bytes is the number of bytes needed to hold
> + * num_possible_cpus() padded out to the nearest long. This is what
> + * is saved in the buffer, just to be consistent.
> + */
> +#define __bitmask_size_in_bytes				\
> +	(__bitmask_size_in_longs * (BITS_PER_LONG / 8))
> +
> +#undef __bitmask
> +#define __bitmask(item, src) __dynamic_array(unsigned long, item,	\
> +					     __bitmask_size_in_longs)
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  static inline notrace int ftrace_get_offsets_##call(			\
> @@ -513,12 +554,22 @@ static inline notrace int ftrace_get_offsets_##call(
> 			\
>  	__entry->__data_loc_##item = __data_offsets.item;
>  
>  #undef __string
> -#define __string(item, src) __dynamic_array(char, item, -1)       	\
> +#define __string(item, src) __dynamic_array(char, item, -1)
>  
>  #undef __assign_str
>  #define __assign_str(dst, src)						\
>  	strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
>  
> +#undef __bitmask
> +#define __bitmask(item, src) __dynamic_array(unsigned long, item, -1)
> +
> +#undef __get_bitmask
> +#define __get_bitmask(field) (char *)__get_dynamic_array(field)
> +
> +#undef __assign_bitmask
> +#define __assign_bitmask(dst, src)					\
> +	memcpy(__get_bitmask(dst), (src), __bitmask_size_in_bytes)
> +
>  #undef TP_fast_assign
>  #define TP_fast_assign(args...) args
>  
> @@ -586,6 +637,7 @@ static inline void ftrace_test_probe_##call(void)			\
>  #undef __print_hex
>  #undef __get_dynamic_array
>  #undef __get_str
> +#undef __get_bitmask
>  
>  #undef TP_printk
>  #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)
> @@ -651,6 +703,9 @@ __attribute__((section("_ftrace_events")))
> *__event_##call = &event_##call
>  #undef __get_str
>  #define __get_str(field) (char *)__get_dynamic_array(field)
>  
> +#undef __get_bitmask
> +#define __get_bitmask(field) (char *)__get_dynamic_array(field)
> +
>  #undef __perf_addr
>  #define __perf_addr(a)	(__addr = (a))
>  
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index a436de1..f3dad80 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -126,6 +126,34 @@ trace_seq_printf(struct trace_seq *s, const char *fmt,
> ...)
>  EXPORT_SYMBOL_GPL(trace_seq_printf);
>  
>  /**
> + * trace_seq_bitmask - put a list of longs as a bitmask print output
> + * @s:		trace sequence descriptor
> + * @maskp:	points to an array of unsigned longs that represent a bitmask
> + * @nmaskbits:	The number of bits that are valid in @maskp
> + *
> + * It returns 0 if the trace oversizes the buffer's free
> + * space, 1 otherwise.
> + *
> + * Writes a ASCII representation of a bitmask string into @s.
> + */
> +int
> +trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> +		  int nmaskbits)
> +{
> +	int len = (PAGE_SIZE - 1) - s->len;
> +	int ret;
> +
> +	if (s->full || !len)
> +		return 0;
> +
> +	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
> +	s->len += ret;
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL_GPL(trace_seq_bitmask);
> +
> +/**
>   * trace_seq_vprintf - sequence printing of trace information
>   * @s: trace sequence descriptor
>   * @fmt: printf format string
> @@ -399,6 +427,19 @@ EXPORT_SYMBOL(ftrace_print_symbols_seq_u64);
>  #endif
>  
>  const char *
> +ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
> +			 unsigned int bitmask_size)
> +{
> +	const char *ret = p->buffer + p->len;
> +
> +	trace_seq_bitmask(p, bitmask_ptr, bitmask_size * 8);
> +	trace_seq_putc(p, 0);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(ftrace_print_bitmask_seq);
> +
> +const char *
>  ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int
>  buf_len)
>  {
>  	int i;
> --
> 1.8.1.4
> 
> 
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-07 11:40         ` Mathieu Desnoyers
@ 2014-05-07 15:45           ` Steven Rostedt
  0 siblings, 0 replies; 48+ messages in thread
From: Steven Rostedt @ 2014-05-07 15:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Javi Merino, Andrew Morton, Ingo Molnar, Namhyung Kim, Jiri Olsa

On Wed, 7 May 2014 11:40:21 +0000 (UTC)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> ----- Original Message -----
> > From: "Steven Rostedt" <rostedt@goodmis.org>
> > To: "LKML" <linux-kernel@vger.kernel.org>
> > Cc: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>, "Javi Merino" <javi.merino@arm.com>, "Andrew Morton"
> > <akpm@linux-foundation.org>, "Ingo Molnar" <mingo@kernel.org>, "Namhyung Kim" <namhyung@kernel.org>, "Jiri Olsa"
> > <jolsa@redhat.com>
> > Sent: Tuesday, May 6, 2014 11:12:38 PM
> > Subject: [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
> > 
> > 
> > Being able to show a cpumask of events can be useful as some events
> > may affect only some CPUs. There is no standard way to record the
> > cpumask and converting it to a string is rather expensive during
> > the trace as traces happen in hotpaths. It would be better to record
> > the raw event mask and be able to parse it at print time.
> 
> Hi Steven,
> 
> OK. Changing cpumask into bitmask helps making this more generic. Now
> I'm concerned about another topic: the type backing the bitmask, as well
> as its endianness, have an important impact when storing and reading
> the trace on different machines. This is not a problem cpumask typically
> has, because the test_bit() and set bit APIs are always used from within
> the kernel.
> 
> Looking at:
> 
> /**
>  * test_bit - Determine whether a bit is set
>  * @nr: bit number to test
>  * @addr: Address to start counting from
>  */
> static inline int test_bit(int nr, const volatile unsigned long *addr)
> {
>         return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
> }
> 
> We notice that the kernel's bit operations use an unsigned long
> as "backing type" for the bitfield, with a native kernel endianness.
> 
> Therefore, the trace reader needs to take into account those two
> informations (backing type size, and endianness) in order to get the
> bit order right.
> 
> I remember that you made sure Ftrace's event description files expose
> the endianness of the kernel, which partially takes care of this
> concern, but how do you plan exposing the bitmask backing type size ?
> (unsigned long maps to 32-bit or 64-bit depending on the architecture).

Well, actually, the files don't expose endianess, but that is done by
the userspace tracer running on the machine. I'm not sure of any
machine that would have userspace and kernel space endianess different.

But the 32-bit vs 64-bit is exposed. Not really directly, but the way
traceevent finds this out is with pevent->header_page_size_size. That
is, in /debug/tracing/events/header_page the field "commit" is of size
long in the kernel. It's 8 for 64bit and 4 for 32bit.

This works even if userspace is 32 and kernel is 64.

With both endian and long size, it should be trivial for the reader to
be able to parse the cpumask on any machine.

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone
  2014-05-06 12:06 ` [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone Javi Merino
@ 2014-05-13 14:31   ` edubezval
  2014-05-15 14:30     ` Javi Merino
  0 siblings, 1 reply; 48+ messages in thread
From: edubezval @ 2014-05-13 14:31 UTC (permalink / raw)
  To: Javi Merino; +Cc: linux-pm, Punit Agrawal, Zhang Rui

Hello Javi,

Thanks for your patch. Find some comment inline.

On Tue, May 06, 2014 at 01:06:34PM +0100, Javi Merino wrote:
> A governor may need to store its current state between calls to
> throttle().  That state depends on the thermal zone, so store it as
> private data in struct thermal_zone_device.
>
> The governors may have two new ops: bind_to_tz() and unbind_from_tz().
> When provided, these functions let governors do some initialization
> and teardown when they are bound/unbound to a tz and possibly store that
> information in the governor_data field of the struct
> thermal_zone_device.


Overall the idea is fine to me, although so far we have no case of such
need in the current governors. I will have a look on your proposed
governor to check futher on this API.

>
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
>  drivers/thermal/thermal_core.c |   67 +++++++++++++++++++++++++++++++++++-----
>  include/linux/thermal.h        |    3 ++
>  2 files changed, 62 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 71b0ec0c370d..ac8e5990a63e 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -72,6 +72,45 @@ static struct thermal_governor *__find_governor(const char *name)
>   return NULL;
>  }
>
> +/**
> + * thermal_set_governor() - Switch to another governor
> + * @tz: a valid pointer to a struct thermal_zone_device
> + * @new_gov: pointer to the new governor
> + *
> + * Change the governor of thermal zone @tz.
> + *
> + * Returns 0 on success, an error if the new governor's bind_to_tz() failed.
> + */
> +static int thermal_set_governor(struct thermal_zone_device *tz,
> + struct thermal_governor *new_gov)
> +{
> + int ret = 0;
> +
> + if (tz->governor && tz->governor->unbind_from_tz)
> + tz->governor->unbind_from_tz(tz);
> +
> + if (new_gov && new_gov->bind_to_tz) {
> + ret = new_gov->bind_to_tz(tz);
> + if (ret) {
> + /* Try to register the old governor again */

I suggest to state in the Documentation that a governor may be called
to bind and unbind to same thermal zone several times, as per above code.

> + if (tz->governor && tz->governor->bind_to_tz) {
> + if (tz->governor->bind_to_tz(tz))
> + /*
> + * The new governor failed to register
> + * and the previous one failed as well
> + */

I suggest having a dev_warn here to notify that a zone is now orphan due to
switching from one governor to another.

> + tz->governor = NULL;
> + }
> +
> + return ret;
> + }
> + }
> +
> + tz->governor = new_gov;
> +
> + return ret;
> +}
> +
>  int thermal_register_governor(struct thermal_governor *governor)
>  {
>   int err;
> @@ -104,8 +143,12 @@ int thermal_register_governor(struct thermal_governor *governor)
>
>   name = pos->tzp->governor_name;
>
> - if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
> - pos->governor = governor;
> + if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
> + int ret = thermal_set_governor(pos, governor);

nip: add a blank line between declarations and first statement.

> + if (ret)
> + pr_warn("Failed to set governor %s for zone %d: %d\n",
> + governor->name, pos->id, ret);

dev_* family of logging functions is preferred in this file.

> + }
>   }
>
>   mutex_unlock(&thermal_list_lock);
> @@ -131,7 +174,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
>   list_for_each_entry(pos, &thermal_tz_list, node) {
>   if (!strnicmp(pos->governor->name, governor->name,
>   THERMAL_NAME_LENGTH))
> - pos->governor = NULL;
> + thermal_set_governor(pos, NULL);
>   }
>
>   mutex_unlock(&thermal_list_lock);
> @@ -756,8 +799,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
>   if (!gov)
>   goto exit;
>
> - tz->governor = gov;
> - ret = count;
> + ret = thermal_set_governor(tz, gov);
> + if (!ret)
> + ret = count;
>
>  exit:
>   mutex_unlock(&thermal_governor_lock);
> @@ -1452,6 +1496,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
>   int result;
>   int count;
>   int passive = 0;
> + struct thermal_governor *governor;
>
>   if (type && strlen(type) >= THERMAL_NAME_LENGTH)
>   return ERR_PTR(-EINVAL);
> @@ -1542,9 +1587,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
>   mutex_lock(&thermal_governor_lock);
>
>   if (tz->tzp)
> - tz->governor = __find_governor(tz->tzp->governor_name);
> + governor = __find_governor(tz->tzp->governor_name);
>   else
> - tz->governor = def_governor;
> + governor = def_governor;
> +
> + result = thermal_set_governor(tz, governor);
> + if (result) {
> + mutex_unlock(&thermal_governor_lock);
> + goto unregister;
> + }
>
>   mutex_unlock(&thermal_governor_lock);
>
> @@ -1634,7 +1685,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
>   device_remove_file(&tz->device, &dev_attr_mode);
>   device_remove_file(&tz->device, &dev_attr_policy);
>   remove_trip_attrs(tz);
> - tz->governor = NULL;
> + thermal_set_governor(tz, NULL);
>
>   thermal_remove_hwmon_sysfs(tz);
>   release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index f7e11c7ea7d9..baac212f815e 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -177,6 +177,7 @@ struct thermal_zone_device {
>   struct thermal_zone_device_ops *ops;
>   const struct thermal_zone_params *tzp;
>   struct thermal_governor *governor;
> + void *governor_data;

Please document this entry.

>   struct list_head thermal_instances;
>   struct idr idr;
>   struct mutex lock; /* protect thermal_instances list */
> @@ -187,6 +188,8 @@ struct thermal_zone_device {
>  /* Structure that holds thermal governor information */
>  struct thermal_governor {
>   char name[THERMAL_NAME_LENGTH];
> + int (*bind_to_tz)(struct thermal_zone_device *tz);
> + void (*unbind_from_tz)(struct thermal_zone_device *tz);

a governor cannot prevent unbinding?

Please, document the proposed operations.

>   int (*throttle)(struct thermal_zone_device *tz, int trip);
>   struct list_head governor_list;
>  };
> --
> 1.7.9.5
>
>


I request that you update Documentation/thermal/sysfs-api.txt too.

BR,

Eduardo Valentin,

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state
  2014-05-06 12:06 ` [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state Javi Merino
@ 2014-05-13 15:22   ` edubezval
  0 siblings, 0 replies; 48+ messages in thread
From: edubezval @ 2014-05-13 15:22 UTC (permalink / raw)
  To: Javi Merino; +Cc: linux-pm, Punit Agrawal, Zhang Rui

Hello Javi,

On Tue, May 06, 2014 at 01:06:35PM +0100, Javi Merino wrote:
> Currently, cooling devices operate on state, even though some (like
> the cpufreq cooling device) have exported functions to convert
> frequency to state.  Generalize this interface so that governors can
> operate on other units (e.g. frequency, power,...).
>
> Suggested-by: Eduardo Valentin <edubezval@gmail.com>
> Cc: Zhang Rui <rui.zhang@intel.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
>  drivers/acpi/fan.c                  |   22 ++++++++++++++------
>  drivers/acpi/processor_thermal.c    |   23 +++++++++++++++------
>  drivers/acpi/video.c                |   22 ++++++++++++++------
>  drivers/platform/x86/acerhdf.c      |   24 ++++++++++++++++------
>  drivers/platform/x86/intel_menlow.c |   26 +++++++++++++++++-------
>  drivers/power/power_supply_core.c   |   24 ++++++++++++++++------
>  drivers/thermal/cpu_cooling.c       |   38 ++++++++++++++++++++++++++---------
>  drivers/thermal/db8500_thermal.c    |    2 +-
>  drivers/thermal/fair_share.c        |    2 +-
>  drivers/thermal/intel_powerclamp.c  |   24 ++++++++++++++++------
>  drivers/thermal/step_wise.c         |    2 +-
>  drivers/thermal/thermal_core.c      |   13 ++++++------
>  include/linux/thermal.h             |   13 +++++++++---
>  13 files changed, 169 insertions(+), 66 deletions(-)
>

This change must be reflected under Documentation/thermal/sysfs-api.txt
too.


> diff --git a/drivers/acpi/fan.c b/drivers/acpi/fan.c
> index 8acf53e62966..55346deb8202 100644
> --- a/drivers/acpi/fan.c
> +++ b/drivers/acpi/fan.c
> @@ -79,20 +79,26 @@ static struct acpi_driver acpi_fan_driver = {
>
>  /* thermal cooling device callbacks */
>  static int fan_get_max_state(struct thermal_cooling_device *cdev, unsigned long
> -     *state)
> +     *state, enum thermal_cooling_unit unit)
>  {
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   /* ACPI fan device only support two states: ON/OFF */
>   *state = 1;
>   return 0;
>  }
>
>  static int fan_get_cur_state(struct thermal_cooling_device *cdev, unsigned long
> -     *state)
> +     *state, enum thermal_cooling_unit unit)
>  {
>   struct acpi_device *device = cdev->devdata;
>   int result;
>   int acpi_state = ACPI_STATE_D0;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   if (!device)
>   return -EINVAL;
>
> @@ -106,11 +112,15 @@ static int fan_get_cur_state(struct thermal_cooling_device *cdev, unsigned long
>  }
>
>  static int
> -fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state)
> +fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state,
> +  enum thermal_cooling_unit unit)
>  {
>   struct acpi_device *device = cdev->devdata;
>   int result;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   if (!device || (state != 0 && state != 1))
>   return -EINVAL;
>
> @@ -121,9 +131,9 @@ fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state)
>  }
>
>  static const struct thermal_cooling_device_ops fan_cooling_ops = {
> - .get_max_state = fan_get_max_state,
> - .get_cur_state = fan_get_cur_state,
> - .set_cur_state = fan_set_cur_state,
> + .get_max = fan_get_max_state,
> + .get_cur = fan_get_cur_state,
> + .set_cur = fan_set_cur_state,
>  };
>
>  /* --------------------------------------------------------------------------
> diff --git a/drivers/acpi/processor_thermal.c b/drivers/acpi/processor_thermal.c
> index e003663b2f8e..48179ed9afe5 100644
> --- a/drivers/acpi/processor_thermal.c
> +++ b/drivers/acpi/processor_thermal.c
> @@ -200,11 +200,15 @@ static int acpi_processor_max_state(struct acpi_processor *pr)
>  }
>  static int
>  processor_get_max_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
>  {
>   struct acpi_device *device = cdev->devdata;
>   struct acpi_processor *pr;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   if (!device)
>   return -EINVAL;
>
> @@ -218,11 +222,15 @@ processor_get_max_state(struct thermal_cooling_device *cdev,
>
>  static int
>  processor_get_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long *cur_state)
> + unsigned long *cur_state,
> + enum thermal_cooling_unit unit)
>  {
>   struct acpi_device *device = cdev->devdata;
>   struct acpi_processor *pr;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   if (!device)
>   return -EINVAL;
>
> @@ -238,13 +246,16 @@ processor_get_cur_state(struct thermal_cooling_device *cdev,
>
>  static int
>  processor_set_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long state)
> + unsigned long state, enum thermal_cooling_unit unit)
>  {
>   struct acpi_device *device = cdev->devdata;
>   struct acpi_processor *pr;
>   int result = 0;
>   int max_pstate;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   if (!device)
>   return -EINVAL;
>
> @@ -270,7 +281,7 @@ processor_set_cur_state(struct thermal_cooling_device *cdev,
>  }
>
>  const struct thermal_cooling_device_ops processor_cooling_ops = {
> - .get_max_state = processor_get_max_state,
> - .get_cur_state = processor_get_cur_state,
> - .set_cur_state = processor_set_cur_state,
> + .get_max = processor_get_max_state,
> + .get_cur = processor_get_cur_state,
> + .set_cur = processor_set_cur_state,
>  };
> diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
> index 8b6990e417ec..f0aeb4d1ba91 100644
> --- a/drivers/acpi/video.c
> +++ b/drivers/acpi/video.c
> @@ -282,23 +282,29 @@ static const struct backlight_ops acpi_backlight_ops = {
>
>  /* thermal cooling device callbacks */
>  static int video_get_max_state(struct thermal_cooling_device *cooling_dev, unsigned
> -       long *state)
> +       long *state, enum thermal_cooling_unit unit)
>  {
>   struct acpi_device *device = cooling_dev->devdata;
>   struct acpi_video_device *video = acpi_driver_data(device);
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   *state = video->brightness->count - 3;
>   return 0;
>  }
>
>  static int video_get_cur_state(struct thermal_cooling_device *cooling_dev, unsigned
> -       long *state)
> +       long *state, enum thermal_cooling_unit unit)
>  {
>   struct acpi_device *device = cooling_dev->devdata;
>   struct acpi_video_device *video = acpi_driver_data(device);
>   unsigned long long level;
>   int offset;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   if (acpi_video_device_lcd_get_level_current(video, &level, false))
>   return -EINVAL;
>   for (offset = 2; offset < video->brightness->count; offset++)
> @@ -311,12 +317,16 @@ static int video_get_cur_state(struct thermal_cooling_device *cooling_dev, unsig
>  }
>
>  static int
> -video_set_cur_state(struct thermal_cooling_device *cooling_dev, unsigned long state)
> +video_set_cur_state(struct thermal_cooling_device *cooling_dev,
> + unsigned long state, enum thermal_cooling_unit unit)
>  {
>   struct acpi_device *device = cooling_dev->devdata;
>   struct acpi_video_device *video = acpi_driver_data(device);
>   int level;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   if (state >= video->brightness->count - 2)
>   return -EINVAL;
>
> @@ -326,9 +336,9 @@ video_set_cur_state(struct thermal_cooling_device *cooling_dev, unsigned long st
>  }
>
>  static const struct thermal_cooling_device_ops video_cooling_ops = {
> - .get_max_state = video_get_max_state,
> - .get_cur_state = video_get_cur_state,
> - .set_cur_state = video_set_cur_state,
> + .get_max = video_get_max_state,
> + .get_cur = video_get_cur_state,
> + .set_cur = video_set_cur_state,
>  };
>
>  /*
> diff --git a/drivers/platform/x86/acerhdf.c b/drivers/platform/x86/acerhdf.c
> index f94467c05225..0bb85e240d93 100644
> --- a/drivers/platform/x86/acerhdf.c
> +++ b/drivers/platform/x86/acerhdf.c
> @@ -439,18 +439,26 @@ static struct thermal_zone_device_ops acerhdf_dev_ops = {
>   * get maximal fan cooling state
>   */
>  static int acerhdf_get_max_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
>  {
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   *state = 1;
>
>   return 0;
>  }
>
>  static int acerhdf_get_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
>  {
>   int err = 0, tmp;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   err = acerhdf_get_fanstate(&tmp);
>   if (err)
>   return err;
> @@ -461,10 +469,14 @@ static int acerhdf_get_cur_state(struct thermal_cooling_device *cdev,
>
>  /* change current fan state - is overwritten when running in kernel mode */
>  static int acerhdf_set_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long state)
> + unsigned long state,
> + enum thermal_cooling_unit unit)
>  {
>   int cur_temp, cur_state, err = 0;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   if (!kernelmode)
>   return 0;
>
> @@ -498,9 +510,9 @@ err_out:
>
>  /* bind fan callbacks to fan device */
>  static struct thermal_cooling_device_ops acerhdf_cooling_ops = {
> - .get_max_state = acerhdf_get_max_state,
> - .get_cur_state = acerhdf_get_cur_state,
> - .set_cur_state = acerhdf_set_cur_state,
> + .get_max = acerhdf_get_max_state,
> + .get_cur = acerhdf_get_cur_state,
> + .set_cur = acerhdf_set_cur_state,
>  };
>
>  /* suspend / resume functionality */
> diff --git a/drivers/platform/x86/intel_menlow.c b/drivers/platform/x86/intel_menlow.c
> index e8b46d2c468c..9792da407c46 100644
> --- a/drivers/platform/x86/intel_menlow.c
> +++ b/drivers/platform/x86/intel_menlow.c
> @@ -61,7 +61,8 @@ static void intel_menlow_unregister_sensor(void);
>   * GTHS returning '0' would mean that no bandwidth control states are supported
>   */
>  static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
> -    unsigned long *max_state)
> + unsigned long *max_state,
> + enum thermal_cooling_unit unit)
>  {
>   struct acpi_device *device = cdev->devdata;
>   acpi_handle handle = device->handle;
> @@ -70,6 +71,9 @@ static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
>   union acpi_object arg;
>   acpi_status status = AE_OK;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   arg_list.count = 1;
>   arg_list.pointer = &arg;
>   arg.type = ACPI_TYPE_INTEGER;
> @@ -87,7 +91,8 @@ static int memory_get_max_bandwidth(struct thermal_cooling_device *cdev,
>  }
>
>  static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
> -    unsigned long *value)
> +    unsigned long *value,
> +    enum thermal_cooling_unit unit)
>  {
>   struct acpi_device *device = cdev->devdata;
>   acpi_handle handle = device->handle;
> @@ -96,6 +101,9 @@ static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
>   union acpi_object arg;
>   acpi_status status = AE_OK;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   arg_list.count = 1;
>   arg_list.pointer = &arg;
>   arg.type = ACPI_TYPE_INTEGER;
> @@ -110,7 +118,8 @@ static int memory_get_cur_bandwidth(struct thermal_cooling_device *cdev,
>  }
>
>  static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
> -    unsigned long state)
> +    unsigned long state,
> +    enum thermal_cooling_unit unit)
>  {
>   struct acpi_device *device = cdev->devdata;
>   acpi_handle handle = device->handle;
> @@ -120,7 +129,10 @@ static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
>   unsigned long long temp;
>   unsigned long max_state;
>
> - if (memory_get_max_bandwidth(cdev, &max_state))
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
> + if (memory_get_max_bandwidth(cdev, &max_state, THERMAL_UNIT_STATE))
>   return -EFAULT;
>
>   if (state > max_state)
> @@ -143,9 +155,9 @@ static int memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
>  }
>
>  static struct thermal_cooling_device_ops memory_cooling_ops = {
> - .get_max_state = memory_get_max_bandwidth,
> - .get_cur_state = memory_get_cur_bandwidth,
> - .set_cur_state = memory_set_cur_bandwidth,
> + .get_max = memory_get_max_bandwidth,
> + .get_cur = memory_get_cur_bandwidth,
> + .set_cur = memory_set_cur_bandwidth,
>  };
>
>  /*
> diff --git a/drivers/power/power_supply_core.c b/drivers/power/power_supply_core.c
> index 26606641fe44..69dc2901a788 100644
> --- a/drivers/power/power_supply_core.c
> +++ b/drivers/power/power_supply_core.c
> @@ -440,12 +440,16 @@ static void psy_unregister_thermal(struct power_supply *psy)
>
>  /* thermal cooling device callbacks */
>  static int ps_get_max_charge_cntl_limit(struct thermal_cooling_device *tcd,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
>  {
>   struct power_supply *psy;
>   union power_supply_propval val;
>   int ret;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   psy = tcd->devdata;
>   ret = psy->get_property(psy,
>   POWER_SUPPLY_PROP_CHARGE_CONTROL_LIMIT_MAX, &val);
> @@ -456,12 +460,16 @@ static int ps_get_max_charge_cntl_limit(struct thermal_cooling_device *tcd,
>  }
>
>  static int ps_get_cur_chrage_cntl_limit(struct thermal_cooling_device *tcd,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
>  {
>   struct power_supply *psy;
>   union power_supply_propval val;
>   int ret;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   psy = tcd->devdata;
>   ret = psy->get_property(psy,
>   POWER_SUPPLY_PROP_CHARGE_CONTROL_LIMIT, &val);
> @@ -472,12 +480,16 @@ static int ps_get_cur_chrage_cntl_limit(struct thermal_cooling_device *tcd,
>  }
>
>  static int ps_set_cur_charge_cntl_limit(struct thermal_cooling_device *tcd,
> - unsigned long state)
> + unsigned long state,
> + enum thermal_cooling_unit unit)
>  {
>   struct power_supply *psy;
>   union power_supply_propval val;
>   int ret;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   psy = tcd->devdata;
>   val.intval = state;
>   ret = psy->set_property(psy,
> @@ -487,9 +499,9 @@ static int ps_set_cur_charge_cntl_limit(struct thermal_cooling_device *tcd,
>  }
>
>  static struct thermal_cooling_device_ops psy_tcd_ops = {
> - .get_max_state = ps_get_max_charge_cntl_limit,
> - .get_cur_state = ps_get_cur_chrage_cntl_limit,
> - .set_cur_state = ps_set_cur_charge_cntl_limit,
> + .get_max = ps_get_max_charge_cntl_limit,
> + .get_cur = ps_get_cur_chrage_cntl_limit,
> + .set_cur = ps_set_cur_charge_cntl_limit,
>  };
>
>  static int psy_register_cooler(struct power_supply *psy)
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index 4246262c4bd2..f3f4e6b5798e 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -348,14 +348,17 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
>   * cpufreq_get_max_state - callback function to get the max cooling state.
>   * @cdev: thermal cooling device pointer.
>   * @state: fill this variable with the max cooling state.
> + * @unit: the units in which @state should be in.
>   *
> - * Callback for the thermal cooling device to return the cpufreq
> - * max cooling state.
> + * Callback for the thermal cooling device to return the cpufreq max
> + * cooling state.  Currently only THERMAL_UNIT_STATE is valid for
> + * @unit.
>   *
>   * Return: 0 on success, an error code otherwise.
>   */
>  static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
>  {
>   struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
>   struct cpumask *mask = &cpufreq_device->allowed_cpus;
> @@ -363,6 +366,9 @@ static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
>   unsigned int count = 0;
>   int ret;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   cpu = cpumask_any(mask);
>
>   ret = get_property(cpu, 0, &count, GET_MAXL);
> @@ -377,17 +383,23 @@ static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
>   * cpufreq_get_cur_state - callback function to get the current cooling state.
>   * @cdev: thermal cooling device pointer.
>   * @state: fill this variable with the current cooling state.
> + * @unit: the units in which state should be in
>   *
>   * Callback for the thermal cooling device to return the cpufreq
> - * current cooling state.
> + * current cooling state.  Currently only THERMAL_UNIT_STATE is valid
> + * for @unit.
>   *
>   * Return: 0 on success, an error code otherwise.
>   */
>  static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
>  {
>   struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   *state = cpufreq_device->cpufreq_state;
>
>   return 0;
> @@ -397,25 +409,31 @@ static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
>   * cpufreq_set_cur_state - callback function to set the current cooling state.
>   * @cdev: thermal cooling device pointer.
>   * @state: set this variable to the current cooling state.
> + * @unit: the units in which @state should be in.
>   *
>   * Callback for the thermal cooling device to change the cpufreq
> - * current cooling state.
> + * current cooling state.  Currently only THERMAL_UNIT_STATE is valid for
> + * @unit.
>   *
>   * Return: 0 on success, an error code otherwise.
>   */
>  static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long state)
> + unsigned long state,
> + enum thermal_cooling_unit unit)
>  {
>   struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   return cpufreq_apply_cooling(cpufreq_device, state);
>  }
>
>  /* Bind cpufreq callbacks to thermal cooling device ops */
>  static struct thermal_cooling_device_ops const cpufreq_cooling_ops = {
> - .get_max_state = cpufreq_get_max_state,
> - .get_cur_state = cpufreq_get_cur_state,
> - .set_cur_state = cpufreq_set_cur_state,
> + .get_max = cpufreq_get_max_state,
> + .get_cur = cpufreq_get_cur_state,
> + .set_cur = cpufreq_set_cur_state,
>  };
>
>  /* Notifier for cpufreq policy change */
> diff --git a/drivers/thermal/db8500_thermal.c b/drivers/thermal/db8500_thermal.c
> index 1e3b3bf9f993..0842a2ef88b4 100644
> --- a/drivers/thermal/db8500_thermal.c
> +++ b/drivers/thermal/db8500_thermal.c
> @@ -67,7 +67,7 @@ static int db8500_cdev_bind(struct thermal_zone_device *thermal,
>   unsigned long max_state, upper, lower;
>   int i, ret = -EINVAL;
>
> - cdev->ops->get_max_state(cdev, &max_state);
> + cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
>
>   for (i = 0; i < ptrips->num_trips; i++) {
>   if (db8500_thermal_match_cdev(cdev, &ptrips->trip_points[i]))
> diff --git a/drivers/thermal/fair_share.c b/drivers/thermal/fair_share.c
> index 944ba2f340c8..8e6eb4883bb6 100644
> --- a/drivers/thermal/fair_share.c
> +++ b/drivers/thermal/fair_share.c
> @@ -51,7 +51,7 @@ static long get_target_state(struct thermal_zone_device *tz,
>  {
>   unsigned long max_state;
>
> - cdev->ops->get_max_state(cdev, &max_state);
> + cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
>
>   return (long)(weight * level * max_state) / (100 * tz->trips);
>  }
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index a084325f1386..c32bee16b17a 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -616,16 +616,24 @@ static struct notifier_block powerclamp_cpu_notifier = {
>  };
>
>  static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
>  {
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   *state = MAX_TARGET_RATIO;
>
>   return 0;
>  }
>
>  static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long *state)
> + unsigned long *state,
> + enum thermal_cooling_unit unit)
>  {
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   if (true == clamping)
>   *state = pkg_cstate_ratio_cur;
>   else
> @@ -636,10 +644,14 @@ static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
>  }
>
>  static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
> - unsigned long new_target_ratio)
> + unsigned long new_target_ratio,
> + enum thermal_cooling_unit unit)
>  {
>   int ret = 0;
>
> + if (unit != THERMAL_UNIT_STATE)
> + return -EINVAL;
> +
>   new_target_ratio = clamp(new_target_ratio, 0UL,
>   (unsigned long) (MAX_TARGET_RATIO-1));
>   if (set_target_ratio == 0 && new_target_ratio > 0) {
> @@ -663,9 +675,9 @@ exit_set:
>
>  /* bind to generic thermal layer as cooling device*/
>  static struct thermal_cooling_device_ops powerclamp_cooling_ops = {
> - .get_max_state = powerclamp_get_max_state,
> - .get_cur_state = powerclamp_get_cur_state,
> - .set_cur_state = powerclamp_set_cur_state,
> + .get_max = powerclamp_get_max_state,
> + .get_cur = powerclamp_get_cur_state,
> + .set_cur = powerclamp_set_cur_state,
>  };
>
>  /* runs on Nehalem and later */
> diff --git a/drivers/thermal/step_wise.c b/drivers/thermal/step_wise.c
> index f251521baaa2..2a2cd4eba3c3 100644
> --- a/drivers/thermal/step_wise.c
> +++ b/drivers/thermal/step_wise.c
> @@ -58,7 +58,7 @@ static unsigned long get_target_state(struct thermal_instance *instance,
>   * Otherwise, we use the current state of the
>   * cdev in use to determine the next_target.
>   */
> - cdev->ops->get_cur_state(cdev, &cur_state);
> + cdev->ops->get_cur(cdev, &cur_state, THERMAL_UNIT_STATE);
>   next_target = instance->target;
>   dev_dbg(&cdev->device, "cur_state=%ld\n", cur_state);
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index ac8e5990a63e..5b46603cc7cd 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -871,7 +871,7 @@ thermal_cooling_device_max_state_show(struct device *dev,
>   unsigned long state;
>   int ret;
>
> - ret = cdev->ops->get_max_state(cdev, &state);
> + ret = cdev->ops->get_max(cdev, &state, THERMAL_UNIT_STATE);
>   if (ret)
>   return ret;
>   return sprintf(buf, "%ld\n", state);
> @@ -885,7 +885,7 @@ thermal_cooling_device_cur_state_show(struct device *dev,
>   unsigned long state;
>   int ret;
>
> - ret = cdev->ops->get_cur_state(cdev, &state);
> + ret = cdev->ops->get_cur(cdev, &state, THERMAL_UNIT_STATE);
>   if (ret)
>   return ret;
>   return sprintf(buf, "%ld\n", state);
> @@ -906,7 +906,7 @@ thermal_cooling_device_cur_state_store(struct device *dev,
>   if ((long)state < 0)
>   return -EINVAL;
>
> - result = cdev->ops->set_cur_state(cdev, state);
> + result = cdev->ops->set_cur(cdev, state, THERMAL_UNIT_STATE);
>   if (result)
>   return result;
>   return count;
> @@ -983,7 +983,7 @@ int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz,
>   if (tz != pos1 || cdev != pos2)
>   return -EINVAL;
>
> - cdev->ops->get_max_state(cdev, &max_state);
> + cdev->ops->get_max(cdev, &max_state, THERMAL_UNIT_STATE);
>
>   /* lower default 0, upper default max_state */
>   lower = lower == THERMAL_NO_LIMIT ? 0 : lower;
> @@ -1143,8 +1143,7 @@ __thermal_cooling_device_register(struct device_node *np,
>   if (type && strlen(type) >= THERMAL_NAME_LENGTH)
>   return ERR_PTR(-EINVAL);
>
> - if (!ops || !ops->get_max_state || !ops->get_cur_state ||
> -    !ops->set_cur_state)
> + if (!ops || !ops->get_max || !ops->get_cur || !ops->set_cur)
>   return ERR_PTR(-EINVAL);
>
>   cdev = kzalloc(sizeof(struct thermal_cooling_device), GFP_KERNEL);
> @@ -1329,7 +1328,7 @@ void thermal_cdev_update(struct thermal_cooling_device *cdev)
>   target = instance->target;
>   }
>   mutex_unlock(&cdev->lock);
> - cdev->ops->set_cur_state(cdev, target);
> + cdev->ops->set_cur(cdev, target, THERMAL_UNIT_STATE);
>   cdev->updated = true;
>   dev_dbg(&cdev->device, "set to state %lu\n", target);
>  }


I believe the sysfs API needs to be refactored too. Here we would need
to have a new entry 'unit' to represent what unit the cooling device
supports.

> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index baac212f815e..8d183b8255eb 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -106,6 +106,10 @@ enum {
>  };
>  #define THERMAL_GENL_CMD_MAX (__THERMAL_GENL_CMD_MAX - 1)
>
> +enum thermal_cooling_unit {
> + THERMAL_UNIT_STATE,
> +};
> +
>  struct thermal_zone_device_ops {
>   int (*bind) (struct thermal_zone_device *,
>       struct thermal_cooling_device *);
> @@ -135,9 +139,12 @@ struct thermal_zone_device_ops {
>  };
>
>  struct thermal_cooling_device_ops {
> - int (*get_max_state) (struct thermal_cooling_device *, unsigned long *);
> - int (*get_cur_state) (struct thermal_cooling_device *, unsigned long *);
> - int (*set_cur_state) (struct thermal_cooling_device *, unsigned long);
> + int (*get_max)(struct thermal_cooling_device *, unsigned long *,
> + enum thermal_cooling_unit);
> + int (*get_cur)(struct thermal_cooling_device *, unsigned long *,
> + enum thermal_cooling_unit);
> + int (*set_cur)(struct thermal_cooling_device *, unsigned long,
> + enum thermal_cooling_unit);
>  };


Now, reading this patch again, I realize that this API may be improved.
How about doing a handshake between the cooling device and the governor
instead of sending the required unit on every message

That is, the thermal core would ask the cooling device its supported
unit and the governor supported unit, then the bind would be called
only if they match units. I am assuming we are limiting to one supported
unit per cooling device / governor, right?

>
>  struct thermal_cooling_device {
> --
> 1.7.9.5
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/5] thermal: add a basic cpu power actor
  2014-05-06 12:06 ` [RFC PATCH 3/5] thermal: add a basic cpu power actor Javi Merino
@ 2014-05-13 22:57   ` Eduardo Valentin
  2014-05-15 17:02     ` Javi Merino
  0 siblings, 1 reply; 48+ messages in thread
From: Eduardo Valentin @ 2014-05-13 22:57 UTC (permalink / raw)
  To: Javi Merino; +Cc: linux-pm, Punit.Agrawal, Zhang Rui

Hello Javi,

On Tue, May 06, 2014 at 01:06:36PM +0100, Javi Merino wrote:
> Introduce a new thermal_cooling_unit: THERMAL_UNIT_POWER and let
> cpu_cooling device operate using it as well as "states".  This allows
> governors that call cpufreq_get_max/cpufreq_get_cur/cpufreq_set_cur()
> with THERMAL_UNIT_POWER to use cpufreq cooling devices.
> 
> If the cpu cooling device is registered using
> {of_,}cpufreq_power_actor_register() instead of
> {of_,}cpufreq_cooling_register() it uses the current frequency (as
> reported by cpufreq) as well as load and OPPs for the power
> calculations.  The cpus must have registered their OPPs in the OPP
> library.
> 
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
>  drivers/thermal/cpu_cooling.c  |  430 ++++++++++++++++++++++++++++++++++++----
>  drivers/thermal/thermal_core.c |    4 +-
>  include/linux/cpu_cooling.h    |   28 +++
>  include/linux/thermal.h        |    4 +
>  4 files changed, 422 insertions(+), 44 deletions(-)

Could you please also document under Documentation/thermal/sysfs-api.txt the exposed APIs ?


> 
> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> index f3f4e6b5798e..8052cd1f4a39 100644
> --- a/drivers/thermal/cpu_cooling.c
> +++ b/drivers/thermal/cpu_cooling.c
> @@ -24,10 +24,27 @@
>  #include <linux/thermal.h>
>  #include <linux/cpufreq.h>
>  #include <linux/err.h>
> +#include <linux/pm_opp.h>
>  #include <linux/slab.h>
>  #include <linux/cpu.h>
>  #include <linux/cpu_cooling.h>
>  
> +/* Maximum number of opps we can track */
> +#define MAX_NUM_OPPS 32
> +
> +/**
> + * struct power_table - frequency to power conversion
> + * @frequency:	frequency in KHz
> + * @power:	power in mW
> + *
> + * This structure is built when the cooling device registers and helps
> + * in translating frequency to power and viceversa.
> + */
> +struct power_table {
> +	u32 frequency;
> +	u32 power;
> +};
> +
>  /**
>   * struct cpufreq_cooling_device - data for cooling device with cpufreq
>   * @id: unique integer value corresponding to each cpufreq_cooling_device
> @@ -39,6 +56,14 @@
>   * @cpufreq_val: integer value representing the absolute value of the clipped
>   *	frequency.
>   * @allowed_cpus: all the cpus involved for this cpufreq_cooling_device.
> + * @power_table: array of struct power_table for frequency to power conversion
> + * @power_table_entries: number of entries in the @power_table array
> + * @time_in_idle: previous reading of the absolute time that this cpu was idle
> + * @time_in_idle_timestamp: wall time of the last invocation of
> + *	get_cpu_idle_time_us()
> + * @last_load: load measured by the latest call to cpufreq_get_cur()
> + * @freq: frequency in KHz of the cpus represented by the cooling device
> + * @capacitance: the dynamic power coefficient of these cpus
>   *
>   * This structure is required for keeping information of each
>   * cpufreq_cooling_device registered. In order to prevent corruption of this a
> @@ -50,11 +75,19 @@ struct cpufreq_cooling_device {
>  	unsigned int cpufreq_state;
>  	unsigned int cpufreq_val;
>  	struct cpumask allowed_cpus;
> +	struct power_table power_table[MAX_NUM_OPPS];

Can we build this table dynamically?

I believe we should be able to fetch this table from opp layer. Besides,
should not be trouble of this code to estimate the max number of opps.

> +	int power_table_entries;
> +	u64 time_in_idle[NR_CPUS];
> +	u64 time_in_idle_timestamp[NR_CPUS];
> +	u32 last_load;
> +	u32 freq;
> +	u32 capacitance;
>  };
>  static DEFINE_IDR(cpufreq_idr);
>  static DEFINE_MUTEX(cooling_cpufreq_lock);
>  
>  static unsigned int cpufreq_dev_count;
> +static bool power_actor_notifier_registered;

Are you sure you need this bool? Shouldn' t you be using a counter? 

You probably need locking on this.

>  
>  /* notify_table passes value to the CPUFREQ_ADJUST callback function. */
>  #define NOTIFY_INVALID NULL
> @@ -115,6 +148,59 @@ static int is_cpufreq_valid(int cpu)
>  	return !cpufreq_get_policy(&policy, cpu);
>  }
>  
> +static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_device,
> +			u32 freq)
> +{
> +	int i;
> +	struct power_table *pt = cpufreq_device->power_table;
> +
> +	for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
> +		if (freq <= pt[i].frequency)
> +			break;
> +
> +	return pt[i].power;
> +}
> +
> +static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_device,
> +			u32 power)
> +{
> +	int i;
> +	struct power_table *pt = cpufreq_device->power_table;
> +
> +	for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
> +		if (power <= pt[i].power)
> +			break;
> +
> +	return pt[i].frequency;
> +}
> +
> +/**
> + * get_load - get load for a cpu since last updated
> + * @cpufreq_device: struct cpufreq_cooling_device for this cooling device
> + * @cpu: cpu number
> + *
> + * Return the average load of cpu @cpu in percentage since this
> + * function was last called.
> + */
> +static u32 get_load(struct cpufreq_cooling_device *cpufreq_device, int cpu)
> +{
> +	u32 load;
> +	u64 now, now_idle, delta_time, delta_idle;
> +
> +	now_idle = get_cpu_idle_time(cpu, &now, 0);
> +	delta_idle = now_idle - cpufreq_device->time_in_idle[cpu];
> +	delta_time = now - cpufreq_device->time_in_idle_timestamp[cpu];
> +
> +	if (delta_time <= delta_idle)
> +		load = 0;
> +	else
> +		load = div64_u64(100 * (delta_time - delta_idle), delta_time);
> +
> +	cpufreq_device->time_in_idle[cpu] = now_idle;
> +	cpufreq_device->time_in_idle_timestamp[cpu] = now;

nip: blank line here.

> +	return load;
> +}
> +
>  enum cpufreq_cooling_property {
>  	GET_LEVEL,
>  	GET_FREQ,
> @@ -342,98 +428,193 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
>  	return 0;
>  }
>  
> +/**
> + * cpufreq_frequency_change - notifier callback for cpufreq frequency changes
> + * @nb:		struct notifier_block * with callback info
> + * @event:	value showing cpufreq event for which this function invoked
> + * @data:	callback-specific data
> + *
> + * Callback to get notifications of frequency changes.  In the
> + * CPUFREQ_POSTCHANGE @event we store the new frequency so that
> + * cpufreq_get_cur() knows the current frequency and can convert it
> + * into power.
> + */
> +static int cpufreq_frequency_change(struct notifier_block *nb,
> +				unsigned long event, void *data)
> +{
> +	struct thermal_cooling_device *cdev;
> +	struct cpufreq_freqs *freqs = data;
> +
> +	/* Only update frequency on postchange */
> +	if (event != CPUFREQ_POSTCHANGE)
> +		return NOTIFY_DONE;
> +
> +	mutex_lock(&thermal_list_lock);
> +
> +	list_for_each_entry(cdev, &thermal_cdev_list, node) {
> +		struct cpufreq_cooling_device *cpufreq_device;
> +
> +		if (strncmp(cdev->type, "thermal-cpufreq", THERMAL_NAME_LENGTH))
> +			continue;
> +

I am not sure I follow why you need to go through the list of cooling devices
here. Can you please elaborate a bit more?

> +		cpufreq_device = cdev->devdata;
> +
> +		if (cpumask_test_cpu(freqs->cpu, &cpufreq_device->allowed_cpus))
> +			cpufreq_device->freq = freqs->new;
> +	}
> +
> +	mutex_unlock(&thermal_list_lock);
> +
> +	return NOTIFY_OK;
> +}
> +
>  /* cpufreq cooling device callback functions are defined below */
>  
>  /**
> - * cpufreq_get_max_state - callback function to get the max cooling state.
> + * cpufreq_get_max - callback function to get the max cooling state/power.
>   * @cdev: thermal cooling device pointer.
> - * @state: fill this variable with the max cooling state.
> - * @unit: the units in which @state should be in.
> + * @val: fill this variable with the max cooling state/power.
> + * @unit: whether to put state or power in @val.
>   *
>   * Callback for the thermal cooling device to return the cpufreq max
> - * cooling state.  Currently only THERMAL_UNIT_STATE is valid for
> - * @unit.
> + * cooling state/power.  If @unit is THERMAL_UNIT_STATE, @val contains
> + * the maximum cooling state; if @unit is THERMAL_UNIT_POWER, @val
> + * maximum is power in milliwatts.
>   *
>   * Return: 0 on success, an error code otherwise.
>   */
> -static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
> -				unsigned long *state,
> -				enum thermal_cooling_unit unit)
> +static int cpufreq_get_max(struct thermal_cooling_device *cdev,
> +			   unsigned long *val, enum thermal_cooling_unit unit)

Shouldn't this change be part of previous patch?

>  {
>  	struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
>  	struct cpumask *mask = &cpufreq_device->allowed_cpus;
>  	unsigned int cpu;
>  	unsigned int count = 0;
> -	int ret;
> -
> -	if (unit != THERMAL_UNIT_STATE)
> -		return -EINVAL;

What happens to this code if tomorrow we add yet another unit?


> +	int ret = 0;
>  
>  	cpu = cpumask_any(mask);
>  
> -	ret = get_property(cpu, 0, &count, GET_MAXL);
> +	if ((unit == THERMAL_UNIT_POWER) &&
> +		(cpufreq_device->power_table_entries)) {

I think a cleaner code would use a switch case statement. You should
deal with the power_table_entries restriction separatedly.

> +		unsigned int num_cpus;
> +		unsigned long max_freq;
> +
> +		max_freq = get_cpu_frequency(cpu, 0);
> +		if (!max_freq)
> +			return -EINVAL;
>  
> -	if (count > 0)
> -		*state = count;
> +		num_cpus = cpumask_weight(mask);
> +		*val = cpu_freq_to_power(cdev->devdata, max_freq) * num_cpus;
> +	} else if (unit == THERMAL_UNIT_STATE) {
> +		ret = get_property(cpu, 0, &count, GET_MAXL);
> +
> +		if (count > 0)
> +			*val = count;
> +	} else {
> +		ret = -EINVAL;
> +	}
>  
>  	return ret;
>  }
>  
>  /**
> - * cpufreq_get_cur_state - callback function to get the current cooling state.
> + * cpufreq_get_cur - callback function to get the current cooling state/power.
>   * @cdev: thermal cooling device pointer.
> - * @state: fill this variable with the current cooling state.
> - * @unit: the units in which state should be in
> + * @val: fill this variable with the current cooling state/power.
> + * @unit: whether to put state or power in @val.
>   *
>   * Callback for the thermal cooling device to return the cpufreq
> - * current cooling state.  Currently only THERMAL_UNIT_STATE is valid
> - * for @unit.
> + * current cooling state/power.  If @unit is THERMAL_UNIT_STATE, @val
> + * contains the current cooling state; if @unit is THERMAL_UNIT_POWER,
> + * @val is current power consumption in milliwatts.
>   *
>   * Return: 0 on success, an error code otherwise.
>   */
> -static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
> -				 unsigned long *state,
> -				 enum thermal_cooling_unit unit)
> +static int cpufreq_get_cur(struct thermal_cooling_device *cdev,
> +			   unsigned long *val, enum thermal_cooling_unit unit)
>  {
>  	struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
>  
> -	if (unit != THERMAL_UNIT_STATE)
> -		return -EINVAL;
> +	if ((unit == THERMAL_UNIT_POWER) &&
> +		(cpufreq_device->power_table_entries)) {

ditto.

> +		int cpu;
> +		u32 total_load = 0, raw_cpu_power, power = 0;
> +
> +		raw_cpu_power = cpu_freq_to_power(cpufreq_device,
> +						cpufreq_device->freq);
> +
> +		for_each_cpu(cpu, &cpufreq_device->allowed_cpus) {
> +			u32 load;
>  
> -	*state = cpufreq_device->cpufreq_state;
> +			if (!cpu_online(cpu))
> +				continue;
> +
> +			load = get_load(cpufreq_device, cpu);
> +			power += (raw_cpu_power * load) / 100;
> +			total_load += load;
> +		}
> +
> +		cpufreq_device->last_load = total_load;
> +		*val = power;
> +	} else if (unit == THERMAL_UNIT_STATE) {
> +		*val = cpufreq_device->cpufreq_state;
> +	} else {
> +		return -EINVAL;
> +	}
>  
>  	return 0;
>  }
>  
>  /**
> - * cpufreq_set_cur_state - callback function to set the current cooling state.
> + * cpufreq_set_cur - callback function to set the current cooling state/power.
>   * @cdev: thermal cooling device pointer.
> - * @state: set this variable to the current cooling state.
> - * @unit: the units in which @state should be in.
> + * @val: set this variable to the current cooling state/power.
> + * @unit: whether @val is a cooling state or power.
>   *
>   * Callback for the thermal cooling device to change the cpufreq
> - * current cooling state.  Currently only THERMAL_UNIT_STATE is valid for
> - * @unit.
> + * current cooling state/power.  If @unit is THERMAL_UNIT_STATE, @val
> + * is a cooling state; if @unit is THERMAL_UNIT_POWER, @val is the target
> + * power consumption in milliwatts.
>   *
>   * Return: 0 on success, an error code otherwise.
>   */
> -static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> -				 unsigned long state,
> -				 enum thermal_cooling_unit unit)
> +static int cpufreq_set_cur(struct thermal_cooling_device *cdev,
> +			   unsigned long val, enum thermal_cooling_unit unit)
>  {
>  	struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
> -
> -	if (unit != THERMAL_UNIT_STATE)
> +	unsigned long state;
> +
> +	if ((unit == THERMAL_UNIT_POWER) &&
> +		(cpufreq_device->power_table_entries)) {

ditto.

> +		unsigned int cpu, freq;
> +		u32 normalised_power, last_load;
> +
> +		cpu = cpumask_any(&cpufreq_device->allowed_cpus);
> +		last_load = cpufreq_device->last_load ?
> +			cpufreq_device->last_load : 1;
> +		normalised_power = (val * 100) / last_load;
> +		freq = cpu_power_to_freq(cpufreq_device, normalised_power);
> +
> +		state = cpufreq_cooling_get_level(cpu, freq);
> +		if (state == THERMAL_CSTATE_INVALID) {
> +			pr_err("Failed to convert %dKHz for cpu %d into a cdev state\n",
> +				freq, cpu);
> +			return -EINVAL;
> +		}
> +	} else if (unit == THERMAL_UNIT_STATE) {
> +		state = val;
> +	} else {
>  		return -EINVAL;
> +	}
>  
>  	return cpufreq_apply_cooling(cpufreq_device, state);
>  }
>  
>  /* Bind cpufreq callbacks to thermal cooling device ops */
>  static struct thermal_cooling_device_ops const cpufreq_cooling_ops = {
> -	.get_max = cpufreq_get_max_state,
> -	.get_cur = cpufreq_get_cur_state,
> -	.set_cur = cpufreq_set_cur_state,
> +	.get_max = cpufreq_get_max,
> +	.get_cur = cpufreq_get_cur,
> +	.set_cur = cpufreq_set_cur,
>  };
>  
>  /* Notifier for cpufreq policy change */
> @@ -441,10 +622,95 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
>  	.notifier_call = cpufreq_thermal_notifier,
>  };
>  
> +/* Notifier for cpufreq frequency changes */
> +struct notifier_block cpufreq_transition_notifier = {
> +	.notifier_call = cpufreq_frequency_change,
> +};
> +
> +/**
> + * build_cpu_power_table - create a power to frequency table
> + * @cpufreq_dev:	the cpufreq_cooling_device in which to store the table
> + *
> + * Build a power to frequency table for this cpu and store it in
> + * @cpufreq_dev.  This table will be used in cpu_power_to_freq() and
> + * cpu_freq_to_power() to convert between power and frequency
> + * efficiently.  Power is stored in mW, frequency in KHz.  The
> + * resulting table is in ascending order.
> + *
> + * Returns 0 on success, -E* on error.
> + */
> +static int build_cpu_power_table(struct cpufreq_cooling_device *cpufreq_dev)
> +{
> +	struct dev_pm_opp *opp;
> +	struct device *dev = NULL;
> +	int num_opps, cpu, i, ret = 0;
> +	unsigned long freq;
> +
> +	num_opps = 0;
> +
> +	rcu_read_lock();
> +
> +	for_each_cpu(cpu, &cpufreq_dev->allowed_cpus) {
> +		dev = get_cpu_device(cpu);
> +		if (!dev)
> +			continue;
> +
> +		num_opps = dev_pm_opp_get_opp_count(dev);
> +		if (num_opps > 0) {
> +			break;
> +		} else if (num_opps < 0) {
> +			ret = num_opps;
> +			goto unlock;
> +		}
> +	}
> +
> +	if ((num_opps == 0) || (num_opps > MAX_NUM_OPPS)) {

hmm.. Does it mean that if an existing platform has more than 32
opps, we won't support it? I would preffer we dynamically allocate
the table here and cover any number of opps.


> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	i = 0;
> +	for (freq = 0;
> +	     opp = dev_pm_opp_find_freq_ceil(dev, &freq), !IS_ERR(opp);
> +	     freq++) {
> +		u32 freq_mhz, voltage_mv;
> +		u64 power;
> +
> +		freq_mhz = freq / 1000000;
> +		voltage_mv = dev_pm_opp_get_voltage(opp) / 1000;
> +
> +		/*
> +		 * Do the multiplication with MHz and millivolt so as
> +		 * to not overflow.
> +		 */
> +		power = (u64)cpufreq_dev->capacitance * freq_mhz *
> +			voltage_mv * voltage_mv;

Essentially: P = C x F x V^2 right? Your model covers for dynamic power
only.

I believe this power model neglets leakage. Static power consumption
cannot be negleted nowadays, specially on thermal constrained CPUs.
(which happens to be the target of this subsystem).

My suggestion is that you model power as:
P = P_{dyn} + P_{static}

P_{dyn} can be what you propose above, for instance, or some other
better increasing convex function. But we cannot neglect P_{static}.

I suggest allowing platform device driver writer to input
power models here somehow. Either by using callbacks to
implement their own power model, or by feeding the API with
their own table.

> +		do_div(power, 1000000000);
> +
> +		/* frequency is stored in power_table in KHz */
> +		cpufreq_dev->power_table[i].frequency = freq / 1000;
> +		cpufreq_dev->power_table[i].power = power;
> +
> +		i++;
> +	}
> +
> +	if (i == 0) {
> +		ret = PTR_ERR(opp);
> +		goto unlock;
> +	}
> +
> +	cpufreq_dev->power_table_entries = i;
> +
> +unlock:
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
>  /**
>   * __cpufreq_cooling_register - helper function to create cpufreq cooling device
>   * @np: a valid struct device_node to the cooling device device tree node
>   * @clip_cpus: cpumask of cpus where the frequency constraints will happen.
> + * @capacitance: dynamic power coefficient for these cpus
>   *
>   * This interface function registers the cpufreq cooling device with the name
>   * "thermal-cpufreq-%x". This api can support multiple instances of cpufreq
> @@ -456,7 +722,7 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
>   */
>  static struct thermal_cooling_device *
>  __cpufreq_cooling_register(struct device_node *np,
> -			   const struct cpumask *clip_cpus)
> +			const struct cpumask *clip_cpus, u32 capacitance)
>  {
>  	struct thermal_cooling_device *cool_dev;
>  	struct cpufreq_cooling_device *cpufreq_dev = NULL;
> @@ -485,6 +751,7 @@ __cpufreq_cooling_register(struct device_node *np,
>  		return ERR_PTR(-ENOMEM);
>  
>  	cpumask_copy(&cpufreq_dev->allowed_cpus, clip_cpus);
> +	cpufreq_dev->capacitance = capacitance;
>  
>  	ret = get_idr(&cpufreq_idr, &cpufreq_dev->id);
>  	if (ret) {
> @@ -531,7 +798,7 @@ __cpufreq_cooling_register(struct device_node *np,
>  struct thermal_cooling_device *
>  cpufreq_cooling_register(const struct cpumask *clip_cpus)
>  {
> -	return __cpufreq_cooling_register(NULL, clip_cpus);
> +	return __cpufreq_cooling_register(NULL, clip_cpus, 0);
>  }
>  EXPORT_SYMBOL_GPL(cpufreq_cooling_register);
>  
> @@ -555,11 +822,90 @@ of_cpufreq_cooling_register(struct device_node *np,
>  	if (!np)
>  		return ERR_PTR(-EINVAL);
>  
> -	return __cpufreq_cooling_register(np, clip_cpus);
> +	return __cpufreq_cooling_register(np, clip_cpus, 0);
>  }
>  EXPORT_SYMBOL_GPL(of_cpufreq_cooling_register);
>  
>  /**
> + * of_cpufreq_power_actor_register - function to create a cpu cooling device with power capabilities
> + * @np:		a valid struct device_node to the cooling device tree node
> + * @clip_cpus:	cpumask of cpus where the frequency constraints will happen
> + * @capacitance:	the dynamic power coefficient for these cpus
> + *
> + * Similar to of_cpufreq_cooling_register() this function registers a
> + * cooling device linked to the device tree node provided.  It has a
> + * simple power model so that get_cur/get_max/set_cur can operate with
> + * THERMAL_UNIT_POWER
> + *
> + * The cpus must have registered their OPPs in the OPP library.
> + *
> + * Return: a valid struct thermal_cooling_device pointer on success,
> + * on failure, it returns a corresponding ERR_PTR().
> + */
> +struct thermal_cooling_device *
> +of_cpufreq_power_actor_register(struct device_node *np,
> +				const struct cpumask *clip_cpus,
> +				u32 capacitance)
> +{
> +	struct thermal_cooling_device *cdev, *err_ret;
> +	int ret;
> +
> +	cdev = __cpufreq_cooling_register(np, clip_cpus, capacitance);
> +	if (IS_ERR(cdev))
> +		return cdev;
> +
> +	ret = build_cpu_power_table(cdev->devdata);
> +	if (ret) {
> +		err_ret = ERR_PTR(ret);
> +		goto unregister;
> +	}
> +
> +	/*
> +	 * You can't register multiple times the same notifier_block.
> +	 * The first power actor registered is the only one that
> +	 * registers the notifier.
> +	 */
> +	if (!power_actor_notifier_registered) {
> +		ret = cpufreq_register_notifier(&cpufreq_transition_notifier,
> +						CPUFREQ_TRANSITION_NOTIFIER);
> +		if (ret) {
> +			err_ret = ERR_PTR(ret);
> +			goto unregister;
> +		}
> +		power_actor_notifier_registered = true;

I believe you need locking to update this global variable.

> +	}
> +
> +	return cdev;
> +
> +unregister:
> +	cpufreq_cooling_unregister(cdev);
> +	return err_ret;
> +}
> +EXPORT_SYMBOL_GPL(of_cpufreq_power_actor_register);

This needs documentation under Documentation/thermal/sysfs-api.txt.

> +
> +/**
> + * cpufreq_power_actor_register - function to create a cpu cooling device with power capabilities
> + * @clip_cpus:	cpumask of cpus where the frequency constraints will happen
> + * @capacitance:	the dynamic power coefficient for these cpus
> + *
> + * Similar to cpufreq_cooling_register() this function registers a
> + * cpufreq cooling device that has a simple power model so that
> + * get_cur/get_max/set_cur can operate with THERMAL_UNIT_POWER.
> + *
> + * See of_cpufreq_power_actor_register() for notes about the
> + * requisites for this to work.
> + *
> + * Return: a valid struct thermal_cooling_device pointer on success,
> + * on failure, it returns a corresponding ERR_PTR().
> + */
> +struct thermal_cooling_device *
> +cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance)
> +{
> +	return of_cpufreq_power_actor_register(NULL, clip_cpus, capacitance);
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_power_actor_register);

This needs documentation under Documentation/thermal/sysfs-api.txt.

> +
> +/**
>   * cpufreq_cooling_unregister - function to remove cpufreq cooling device.
>   * @cdev: thermal cooling device pointer.
>   *
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 5b46603cc7cd..1efaadf436fa 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -50,10 +50,10 @@ static DEFINE_IDR(thermal_cdev_idr);
>  static DEFINE_MUTEX(thermal_idr_lock);
>  
>  static LIST_HEAD(thermal_tz_list);
> -static LIST_HEAD(thermal_cdev_list);
> +LIST_HEAD(thermal_cdev_list);
>  static LIST_HEAD(thermal_governor_list);
>  
> -static DEFINE_MUTEX(thermal_list_lock);
> +DEFINE_MUTEX(thermal_list_lock);
>  static DEFINE_MUTEX(thermal_governor_lock);
>  
>  static struct thermal_governor *def_governor;
> diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
> index c303d383def1..b8d92ac4266b 100644
> --- a/include/linux/cpu_cooling.h
> +++ b/include/linux/cpu_cooling.h
> @@ -36,6 +36,9 @@
>  struct thermal_cooling_device *
>  cpufreq_cooling_register(const struct cpumask *clip_cpus);
>  
> +struct thermal_cooling_device *
> +cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance);
> +
>  /**
>   * of_cpufreq_cooling_register - create cpufreq cooling device based on DT.
>   * @np: a valid struct device_node to the cooling device device tree node.
> @@ -45,6 +48,11 @@ cpufreq_cooling_register(const struct cpumask *clip_cpus);
>  struct thermal_cooling_device *
>  of_cpufreq_cooling_register(struct device_node *np,
>  			    const struct cpumask *clip_cpus);
> +
> +struct thermal_cooling_device *
> +of_cpufreq_power_actor_register(struct device_node *np,
> +				const struct cpumask *clip_cpus,
> +				u32 capacitance);
>  #else
>  static inline struct thermal_cooling_device *
>  of_cpufreq_cooling_register(struct device_node *np,
> @@ -52,6 +60,14 @@ of_cpufreq_cooling_register(struct device_node *np,
>  {
>  	return NULL;
>  }
> +
> +static inline struct thermal_cooling_device *
> +of_cpufreq_power_actor_register(struct device_node *np,
> +				const struct cpumask *clip_cpus,
> +				u32 capacitance);
> +{
> +	return ERR_PTR(-ENOSYS);
> +}
>  #endif
>  
>  /**
> @@ -73,6 +89,18 @@ of_cpufreq_cooling_register(struct device_node *np,
>  {
>  	return NULL;
>  }
> +static inline struct thermal_cooling_device *
> +cpufreq_power_actor_register(const struct cpumask *clip_cpus, u32 capacitance)
> +{
> +	return ERR_PTR(-ENOSYS);
> +}
> +static inline struct thermal_cooling_device *
> +of_cpufreq_power_actor_register(struct device_node *np,
> +				const struct cpumask *clip_cpus,
> +				u32 capacitance)
> +{
> +	return ERR_PTR(-ENOSYS);
> +}
>  static inline
>  void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
>  {
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index 8d183b8255eb..5986f546ca98 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -59,6 +59,9 @@
>  #define DEFAULT_THERMAL_GOVERNOR       "user_space"
>  #endif
>  
> +extern struct list_head thermal_cdev_list;
> +extern struct mutex thermal_list_lock;
> +
>  struct thermal_zone_device;
>  struct thermal_cooling_device;
>  
> @@ -108,6 +111,7 @@ enum {
>  
>  enum thermal_cooling_unit {
>  	THERMAL_UNIT_STATE,
> +	THERMAL_UNIT_POWER,
>  };
>  
>  struct thermal_zone_device_ops {
> -- 
> 1.7.9.5
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 4/5] thermal: introduce the Power Allocator governor
  2014-05-06 12:06 ` [RFC PATCH 4/5] thermal: introduce the Power Allocator governor Javi Merino
@ 2014-05-13 23:20   ` Eduardo Valentin
  2014-05-15 18:24     ` Javi Merino
  0 siblings, 1 reply; 48+ messages in thread
From: Eduardo Valentin @ 2014-05-13 23:20 UTC (permalink / raw)
  To: Javi Merino; +Cc: linux-pm, Punit.Agrawal, Zhang Rui

Hello Javi,

Here are a series of initial comments. I will be probably reviewing the
code several times and coming back again with more questions.


On Tue, May 06, 2014 at 01:06:37PM +0100, Javi Merino wrote:
> The power allocator governor is a thermal governor that controls system
> and device power allocation to control temperature.  Conceptually, the
> implementation takes a system view of heat dissipation by managing
> multiple heat sources.

Can it be able to control several thermal zones then?

> 
> This governor relies on power-aware cooling devices (power actors) to
> operate.  That is, cooling devices whose thermal_cooling_device_ops
> accept THERMAL_UNIT_POWER.
> 
> It uses a Proportional Integral (PI) controller driven by the
> temperature of the thermal zone.  This budget is then allocated to
> each cooling device that can have bearing on the temperature we are
> trying to control.  It decides how much power to give each cooling
> device based on the performance they are requesting.  The PI
> controller ensures that the total power budget does not exceed the
> control temperature.
> 
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
>  drivers/thermal/Kconfig           |   14 ++
>  drivers/thermal/Makefile          |    1 +
>  drivers/thermal/power_allocator.c |  403 +++++++++++++++++++++++++++++++++++++
>  drivers/thermal/thermal_core.c    |    7 +-
>  drivers/thermal/thermal_core.h    |    8 +
>  include/linux/thermal.h           |    5 +

We need documentation under Documentation/thermal to understand
how to use this governor. Specially on how to accomodate
several actors and domains under its control.

>  6 files changed, 437 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/thermal/power_allocator.c
> 
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 2d51912a6e40..f5e6b693fc19 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
>  	  Select this if you want to let the user space manage the
>  	  platform thermals.
>  
> +config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
> +	bool "power_allocator"
> +	select THERMAL_GOV_POWER_ALLOCATOR
> +	help
> +	  Select this if you want to control temperature based on
> +	  system and device power allocation. This governor relies on
> +	  power-aware cooling devices (power actors) to operate.
> +
>  endchoice
>  
>  config THERMAL_GOV_FAIR_SHARE
> @@ -89,6 +97,12 @@ config THERMAL_GOV_USER_SPACE
>  	help
>  	  Enable this to let the user space manage the platform thermals.
>  
> +config THERMAL_GOV_POWER_ALLOCATOR
> +	bool "Power allocator thermal governor"
> +	help
> +	  Enable this to manage platform thermals by dynamically
> +	  allocating and limiting power to devices.
> +
>  config CPU_THERMAL
>  	bool "generic cpu cooling support"
>  	depends on CPU_FREQ
> diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
> index 54e4ec9eb5df..1cb7d5fec8dc 100644
> --- a/drivers/thermal/Makefile
> +++ b/drivers/thermal/Makefile
> @@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF)		+= of-thermal.o
>  thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE)	+= fair_share.o
>  thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE)	+= step_wise.o
>  thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE)	+= user_space.o
> +thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR)	+= power_allocator.o
>  
>  # cpufreq cooling
>  thermal_sys-$(CONFIG_CPU_THERMAL)	+= cpu_cooling.o
> diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
> new file mode 100644
> index 000000000000..39c425abf10a
> --- /dev/null
> +++ b/drivers/thermal/power_allocator.c
> @@ -0,0 +1,403 @@
> +/*
> + * A power allocator to manage temperature
> + *
> + * Copyright (C) 2014 ARM Ltd.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> + * kind, whether express or implied; without even the implied warranty
> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +
> +#define pr_fmt(fmt) "Power allocator: " fmt
> +
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/thermal.h>
> +
> +#include "thermal_core.h"
> +
> +#define MAX_NUM_ACTORS 8

hmm... Why do we need to limit the amount of actors?

> +
> +#define FRAC_BITS 8
> +#define int_to_frac(x) ((x) << FRAC_BITS)
> +#define frac_to_int(x) ((x) >> FRAC_BITS)
> +
> +/**
> + * mul_frac - multiply two fixed-point numbers
> + * @x:	first multiplicand
> + * @y:	second multiplicand
> + *
> + * Returns the result of multiplying two fixed-point numbers.  The
> + * result is also a fixed-point number.
> + */
> +static inline s64 mul_frac(s64 x, s64 y)
> +{
> +	return (x * y) >> FRAC_BITS;
> +}
> +
> +enum power_allocator_trip_levels {
> +	TRIP_SWITCH_ON = 0,	/* Switch on PI controller */
> +	TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
> +};

Can we reuse the trip types we already have?

> +
> +/**
> + * struct power_allocator_params - parameters for the power allocator governor
> + * @k_po:	P parameter of the PI controller when overshooting (i.e., when
> + *		temperature is below the target)
> + * @k_pi:	P parameter of the PI controller when undershooting
> + * @k_i:	I parameter of the PI controller
> + * @integral_cutoff:	threshold below which the error is no longer accumulated
> +			in the PI controller
> + * @err_integral:	Accumulated error in the PI controller.
> + */
> +struct power_allocator_params {
> +	s32 k_po;
> +	s32 k_pu;
> +	s32 k_i;
> +	s32 integral_cutoff;
> +	s32 err_integral;
> +};

We would need a documentation on how to define these values. 

> +
> +/**
> + * pi_controller() - PI controller
> + * @tz:	thermal zone we are operating in
> + * @control_temp:	The target temperature
> + * @max_allocatable_power:	maximum allocatable power for this thermal zone
> + *
> + * This PI controller increases the available power budget so that the
> + * temperature of the thermal zone gets as close as possible to
> + * @control_temp and limits the power if it exceeds it.  k_po is the
> + * proportional term when we are overshooting, k_pu is the
> + * proportional term when we are undershooting.  integral_cutoff is a
> + * threshold below which we stop accumulating the error.  The
> + * accumulated error is only valid if the requested power will make
> + * the system warmer.  If the system is mostly idle, there's no point
> + * in accumulating positive error.
> + *
> + * It returns the power budget for the next period.
> + */
> +static u32 pi_controller(struct thermal_zone_device *tz,
> +			unsigned long current_temp, unsigned long control_temp,
> +			unsigned long max_allocatable_power)
> +{
> +	s64 p, i, power_range;
> +	s32 err;
> +	struct power_allocator_params *params = tz->governor_data;
> +
> +	err = ((s32)control_temp - (s32)current_temp) / 1000;
> +	err = int_to_frac(err);
> +
> +	/* Calculate the proportional term */
> +	p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
> +
> +	/*
> +	 * Calculate the integral term
> +	 *
> +	 * if the error s less than cut off allow integration (but
> +	 * the integral is limited to max power)
> +	 */
> +	i = mul_frac(params->k_i, params->err_integral);
> +
> +	if (err < int_to_frac(params->integral_cutoff)) {
> +		s64 tmpi = mul_frac(params->k_i, err);
> +		tmpi += i;
> +		if (tmpi <= int_to_frac(max_allocatable_power)) {
> +			i = tmpi;
> +			params->err_integral += err;
> +		}
> +	}
> +
> +	power_range = p + i;
> +
> +	/* feed-forward the known maximum dissipatable power */
> +	power_range = tz->tzp->max_dissipatable_power +
> +		frac_to_int(power_range);
> +
> +	/* output should not exceed max power */
> +	if (power_range > max_allocatable_power)
> +		power_range = max_allocatable_power;
> +
> +	return power_range >= 0 ? power_range : 0;

What does it mean return 0? The above if does not make completelly sense
to me, as power_range can still be 0. power_range < 0 means error?


what if the value at power_range overflows the u32?

> +}
> +
> +static void divvy_up_power(struct thermal_zone_device *tz,

Can you please document how you perform this power split?

> +			unsigned long req_power[MAX_NUM_ACTORS],
> +			unsigned long max_power[MAX_NUM_ACTORS],
> +			int num_actors, unsigned long total_req_power,
> +			u32 power_range,
> +			unsigned long granted_power[MAX_NUM_ACTORS])
> +{
> +	unsigned long extra_power, capped_extra_power;
> +	unsigned long extra_actor_power[MAX_NUM_ACTORS];
> +	int i;
> +
> +	if (!total_req_power) {
> +		/*
> +		 * Nobody requested anything, so just give everybody
> +		 * the maximum power
> +		 */
> +		for (i = 0; i < num_actors; i++)
> +			granted_power[i] = max_power[i];
> +
> +		return;
> +	}
> +
> +	capped_extra_power = 0;
> +	extra_power = 0;
> +	for (i = 0; i < num_actors; i++) {
> +		u64 req_range = req_power[i] * power_range;
> +
> +		granted_power[i] = div_u64(req_range, total_req_power);
> +
> +		if (granted_power[i] > max_power[i]) {
> +			extra_power += granted_power[i] - max_power[i];
> +			granted_power[i] = max_power[i];
> +		}
> +
> +		extra_actor_power[i] = max_power[i] - granted_power[i];
> +		capped_extra_power += extra_actor_power[i];
> +	}
> +
> +	if (!extra_power)
> +		return;
> +
> +	/*
> +	 * Re-divvy the reclaimed extra among actors based on
> +	 * how far they are from the max
> +	 */
> +	extra_power = min(extra_power, capped_extra_power);
> +	if (capped_extra_power > 0)
> +		for (i = 0; i < num_actors; i++)
> +			granted_power[i] += (extra_actor_power[i] *
> +					extra_power) / capped_extra_power;
> +}
> +
> +static int allocate_power(struct thermal_zone_device *tz,
> +			unsigned long current_temp, unsigned long control_temp)
> +{
> +	struct thermal_instance *instance;
> +	unsigned long req_power[MAX_NUM_ACTORS], max_power[MAX_NUM_ACTORS];
> +	unsigned long granted_power[MAX_NUM_ACTORS];
> +	unsigned long total_req_power, max_allocatable_power;
> +	u32 power_range;
> +	int i, num_actors, ret = 0;
> +
> +	i = 0;
> +	total_req_power = 0;
> +	max_allocatable_power = 0;
> +
> +	mutex_lock(&tz->lock);
> +
> +	list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> +		struct thermal_cooling_device *cdev = instance->cdev;
> +
> +		if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)

How does one instance->trip gets the value of
TRIP_MAX_DESIRED_TEMPERATURE?

> +			continue;
> +
> +		if (i >= MAX_NUM_ACTORS) {
> +			pr_warn("Too many actors (%d) for this governor\n", i);

dev_* family is preferrable over pr_* family. Same applies to other
patches.

> +			ret = -EINVAL;
> +			goto unlock;
> +		}
> +
> +		ret = cdev->ops->get_cur(cdev, &req_power[i],
> +					THERMAL_UNIT_POWER);
> +		if (ret) {
> +			pr_warn_once("Couldn't get current power for %s: %d\n",
> +				cdev->type, ret);
> +			goto unlock;
> +		}
> +
> +		total_req_power += req_power[i];
> +
> +		cdev->ops->get_max(cdev, &max_power[i], THERMAL_UNIT_POWER);
> +		max_allocatable_power += max_power[i];
> +
> +		i++;
> +	}
> +	num_actors = i;
> +
> +	power_range = pi_controller(tz, current_temp, control_temp,
> +				max_allocatable_power);
> +
> +	divvy_up_power(tz, req_power, max_power, num_actors, total_req_power,
> +		power_range, granted_power);
> +
> +	i = 0;
> +	list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> +		struct thermal_cooling_device *cdev = instance->cdev;
> +
> +		if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
> +			continue;
> +
> +		cdev->ops->set_cur(cdev, granted_power[i], THERMAL_UNIT_POWER);
> +		i++;
> +	}
> +
> +unlock:
> +	mutex_unlock(&tz->lock);
> +
> +	return ret;
> +}
> +
> +static int check_trips(struct thermal_zone_device *tz)
> +{
> +	int ret;
> +	enum thermal_trip_type type;
> +
> +	if (tz->trips < 2)
> +		return -EINVAL;
> +
> +	ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
> +	if (ret)
> +		return ret;
> +
> +	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> +		return -EINVAL;
> +
> +	ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
> +	if (ret)
> +		return ret;
> +
> +	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> +		return -EINVAL;
> +
> +	return ret;
> +}
> +
> +static void reset_pi_controller(struct power_allocator_params *params)
> +{
> +	params->err_integral = 0;
> +}
> +
> +/**
> + * power_allocator_bind - bind the power_allocator governor to a thermal zone
> + * @tz:	thermal zone to bind it to
> + *
> + * Check that the thermal zone is valid for this governor: has two
> + * thermal trips.  If so, initialize the PI controller parameters and
> + * bind it to the thermal zone.
> + *
> + * Returns 0 on success, -EINVAL if the trips were invalid or -ENOMEM
> + * if we ran out of memory.
> + */
> +static int power_allocator_bind(struct thermal_zone_device *tz)
> +{
> +	int ret;
> +	struct power_allocator_params *params;
> +	unsigned long switch_on_temp, control_temp;
> +	u32 temperature_threshold;
> +
> +	ret = check_trips(tz);
> +	if (ret) {
> +		pr_err("thermal zone %s has the wrong number of trips for this governor\n",
> +			tz->type);
> +		return ret;
> +	}
> +
> +	if (!tz->tzp || !tz->tzp->max_dissipatable_power) {
> +		pr_err("Trying to bind the power_allocator governor to a thermal zone without the max_dissipatable_power parameter\n");

Can you please break the above line?

> +		return -EINVAL;
> +	}
> +
> +	params = kzalloc(sizeof(*params), GFP_KERNEL);

can we use devm_kzalloc?

> +	if (!params)
> +		return -ENOMEM;
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> +	if (ret)
> +		goto free;
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> +				&control_temp);

thermal zone drivers are not aware of these trip types, are they?

> +	if (ret)
> +		goto free;
> +
> +	temperature_threshold = (control_temp - switch_on_temp) / 1000;
> +
> +	params->k_po = int_to_frac(tz->tzp->max_dissipatable_power) /
> +		temperature_threshold;
> +	params->k_pu = int_to_frac(2 * tz->tzp->max_dissipatable_power) /
> +		temperature_threshold;
> +	params->k_i = int_to_frac(1);

I suppose we need a way to input k_po, k_pu and k_i parameters from
platform code to this governor right? Or are they always defined as the
above code? always based on max_dissipated_power?

> +	params->integral_cutoff = 0;
> +
> +	reset_pi_controller(params);
> +
> +	tz->governor_data = params;
> +	return 0;
> +
> +free:
> +	kfree(params);
> +	return ret;
> +}
> +
> +static void power_allocator_unbind(struct thermal_zone_device *tz)
> +{
> +	pr_debug("Unbinding from thermal zone %d\n", tz->id);
> +	kfree(tz->governor_data);
> +	tz->governor_data = NULL;
> +}
> +
> +static int power_allocator_throttle(struct thermal_zone_device *tz, int trip)
> +{
> +	int ret;
> +	unsigned long switch_on_temp, control_temp, current_temp;
> +	struct power_allocator_params *params = tz->governor_data;
> +
> +	/*
> +	 * We get called for every trip point but we only need to do
> +	 * our calculations once
> +	 */
> +	if (trip != TRIP_MAX_DESIRED_TEMPERATURE)
> +		return 0;
> +
> +	ret = thermal_zone_get_temp(tz, &current_temp);
> +	if (ret) {
> +		pr_warn("Failed to get temperature: %d\n", ret);
> +		return ret;
> +	}
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> +	if (ret) {
> +		pr_warn("Failed to get switch on temperature: %d\n", ret);
> +		return ret;
> +	}
> +
> +	if (current_temp < switch_on_temp) {
> +		reset_pi_controller(params);
> +		return 0;
> +	}
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> +				&control_temp);
> +	if (ret) {
> +		pr_warn("Failed to get the maximum desired temperature: %d\n",
> +			ret);
> +		return ret;
> +	}
> +
> +	return allocate_power(tz, current_temp, control_temp);
> +}
> +
> +static struct thermal_governor thermal_gov_power_allocator = {
> +	.name		= "power_allocator",
> +	.bind_to_tz	= power_allocator_bind,
> +	.unbind_from_tz	= power_allocator_unbind,
> +	.throttle	= power_allocator_throttle,
> +};
> +
> +int thermal_gov_power_allocator_register(void)
> +{
> +	return thermal_register_governor(&thermal_gov_power_allocator);
> +}
> +
> +void thermal_gov_power_allocator_unregister(void)
> +{
> +	thermal_unregister_governor(&thermal_gov_power_allocator);
> +}
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 1efaadf436fa..bed3d2ef8ef3 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -1840,7 +1840,11 @@ static int __init thermal_register_governors(void)
>  	if (result)
>  		return result;
>  
> -	return thermal_gov_user_space_register();
> +	result = thermal_gov_user_space_register();
> +	if (result)
> +		return result;
> +
> +	return thermal_gov_power_allocator_register();
>  }
>  
>  static void thermal_unregister_governors(void)
> @@ -1848,6 +1852,7 @@ static void thermal_unregister_governors(void)
>  	thermal_gov_step_wise_unregister();
>  	thermal_gov_fair_share_unregister();
>  	thermal_gov_user_space_unregister();
> +	thermal_gov_power_allocator_unregister();
>  }
>  
>  static int __init thermal_init(void)
> diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h
> index 3db339fb636f..b24cde2c71cc 100644
> --- a/drivers/thermal/thermal_core.h
> +++ b/drivers/thermal/thermal_core.h
> @@ -77,6 +77,14 @@ static inline int thermal_gov_user_space_register(void) { return 0; }
>  static inline void thermal_gov_user_space_unregister(void) {}
>  #endif /* CONFIG_THERMAL_GOV_USER_SPACE */
>  
> +#ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
> +int thermal_gov_power_allocator_register(void);
> +void thermal_gov_power_allocator_unregister(void);
> +#else
> +static inline int thermal_gov_power_allocator_register(void) { return 0; }
> +static inline void thermal_gov_power_allocator_unregister(void) {}
> +#endif /* CONFIG_THERMAL_GOV_POWER_ALLOCATOR */
> +
>  /* device tree support */
>  #ifdef CONFIG_THERMAL_OF
>  int of_parse_thermal_zones(void);
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index 5986f546ca98..3861cb443520 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -57,6 +57,8 @@
>  #define DEFAULT_THERMAL_GOVERNOR       "fair_share"
>  #elif defined(CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE)
>  #define DEFAULT_THERMAL_GOVERNOR       "user_space"
> +#elif defined(CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR)
> +#define DEFAULT_THERMAL_GOVERNOR       "power_allocator"
>  #endif
>  
>  extern struct list_head thermal_cdev_list;
> @@ -250,6 +252,9 @@ struct thermal_zone_params {
>  
>  	int num_tbps;	/* Number of tbp entries */
>  	struct thermal_bind_params *tbp;
> +
> +	/* Maximum power (heat) that this thermal zone can dissipate in mW */
> +	u32 max_dissipatable_power;
>  };
>  
>  struct thermal_genl_event {
> -- 
> 1.7.9.5
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 0/5] The power allocator thermal governor
  2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
                   ` (4 preceding siblings ...)
  2014-05-06 12:06 ` [RFC PATCH 5/5] thermal: add trace events to the power allocator governor Javi Merino
@ 2014-05-14 10:05 ` James King
  5 siblings, 0 replies; 48+ messages in thread
From: James King @ 2014-05-14 10:05 UTC (permalink / raw)
  To: linux-pm

Hi Javi and Punit,

We have integrated this patchset onto an internal dual ARM cluster SoC and 
it does do a good job of holding the SoC temperature steady while running 
high performance benchmarks. Happy to be added to all the patches as "verified 
by".

Thanks, James




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-07  3:12       ` [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks Steven Rostedt
  2014-05-07 11:40         ` Mathieu Desnoyers
@ 2014-05-14 14:23         ` Javi Merino
  2014-05-14 15:36           ` Steven Rostedt
  1 sibling, 1 reply; 48+ messages in thread
From: Javi Merino @ 2014-05-14 14:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Mathieu Desnoyers, Andrew Morton, Ingo Molnar,
	Namhyung Kim, Jiri Olsa

Hi Steven,

On Wed, May 07, 2014 at 04:12:38AM +0100, Steven Rostedt wrote:
> 
> Being able to show a cpumask of events can be useful as some events
> may affect only some CPUs. There is no standard way to record the
> cpumask and converting it to a string is rather expensive during
> the trace as traces happen in hotpaths. It would be better to record
> the raw event mask and be able to parse it at print time.
> 
> The following macros were added for use with the TRACE_EVENT() macro:
> 
>   __bitmask()
>   __assign_bitmask()
>   __get_bitmask()
> 
> To test this, I added this to the sched_migrate_task event, which
> looked like this:
> 
> TRACE_EVENT(sched_migrate_task,
> 
> 	TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),
> 
> 	TP_ARGS(p, dest_cpu, cpus),
> 
> 	TP_STRUCT__entry(
> 		__array(	char,	comm,	TASK_COMM_LEN	)
> 		__field(	pid_t,	pid			)
> 		__field(	int,	prio			)
> 		__field(	int,	orig_cpu		)
> 		__field(	int,	dest_cpu		)
> 		__bitmask(	cpumask, cpus			)
> 	),
> 
> 	TP_fast_assign(
> 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
> 		__entry->pid		= p->pid;
> 		__entry->prio		= p->prio;
> 		__entry->orig_cpu	= task_cpu(p);
> 		__entry->dest_cpu	= dest_cpu;
> 		__assign_bitmask(cpumask, cpumask_bits(cpus));
> 	),
> 
> 	TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
> 		  __entry->comm, __entry->pid, __entry->prio,
> 		  __entry->orig_cpu, __entry->dest_cpu,
> 		  __get_bitmask(cpumask))
> );
> 
> With the output of:
> 
>         ksmtuned-3613  [003] d..2   485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f
>      migration/1-13    [001] d..5   485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f
>              awk-3615  [002] d.H5   485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff
>      migration/2-18    [002] d..5   485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f
> 
> Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com
> 
> Suggested-by: Javi Merino <javi.merino@arm.com>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> 
> Changes since v1: Use bitmask name instead of cpumask naming.
> 
> ---
>  include/linux/ftrace_event.h |  3 +++
>  include/linux/trace_seq.h    | 10 ++++++++
>  include/trace/ftrace.h       | 57 +++++++++++++++++++++++++++++++++++++++++++-
>  kernel/trace/trace_output.c  | 41 +++++++++++++++++++++++++++++++
>  4 files changed, 110 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index d16da3e..cff3106 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -38,6 +38,9 @@ const char *ftrace_print_symbols_seq_u64(struct trace_seq *p,
>  								 *symbol_array);
>  #endif
>  
> +const char *ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
> +				     unsigned int bitmask_size);
> +
>  const char *ftrace_print_hex_seq(struct trace_seq *p,
>  				 const unsigned char *buf, int len);
>  
> diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
> index a32d86e..1361169 100644
> --- a/include/linux/trace_seq.h
> +++ b/include/linux/trace_seq.h
> @@ -46,6 +46,9 @@ extern int trace_seq_putmem_hex(struct trace_seq *s, const void *mem,
>  extern void *trace_seq_reserve(struct trace_seq *s, size_t len);
>  extern int trace_seq_path(struct trace_seq *s, const struct path *path);
>  
> +extern int trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> +			     int nmaskbits);
> +
>  #else /* CONFIG_TRACING */
>  static inline int trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
>  {
> @@ -57,6 +60,13 @@ trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
>  	return 0;
>  }
>  
> +static inline int
> +trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> +		  int nmaskbits)
> +{
> +	return 0;
> +}
> +
>  static inline int trace_print_seq(struct seq_file *m, struct trace_seq *s)
>  {
>  	return 0;
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 0a1a4f7..d9c44af 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -53,6 +53,9 @@
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __bitmask
> +#define __bitmask(item, src) __dynamic_array(char, item, -1)
> +
>  #undef TP_STRUCT__entry
>  #define TP_STRUCT__entry(args...) args
>  
> @@ -128,6 +131,9 @@
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __string
> +#define __string(item, src) __dynamic_array(unsigned long, item, -1)
> +

This overrides the previous definition of __string() and looks like it
shouldn't be here.

>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  	struct ftrace_data_offsets_##call {				\
> @@ -200,6 +206,15 @@
>  #undef __get_str
>  #define __get_str(field) (char *)__get_dynamic_array(field)
>  
> +#undef __get_bitmask
> +#define __get_bitmask(field)						\
> +	({								\
> +		void *__bitmask = __get_dynamic_array(field);		\
> +		unsigned int __bitmask_size;				\
> +		__bitmask_size = (__entry->__data_loc_##field >> 16) & 0xffff; \
> +		ftrace_print_bitmask_seq(p, __bitmask, __bitmask_size);	\
> +	})
> +
>  #undef __print_flags
>  #define __print_flags(flag, delim, flag_array...)			\
>  	({								\
> @@ -322,6 +337,9 @@ static struct trace_event_functions ftrace_event_type_funcs_##call = {	\
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __bitmask
> +#define __bitmask(item, src) __dynamic_array(unsigned long, item, -1)
> +
>
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
>  static int notrace __init						\
> @@ -372,6 +390,29 @@ ftrace_define_fields_##call(struct ftrace_event_call *event_call)	\
>  #define __string(item, src) __dynamic_array(char, item,			\
>  		    strlen((src) ? (const char *)(src) : "(null)") + 1)
>  
> +/*
> + * __bitmask_size_in_bytes_raw is the number of bytes needed to hold
> + * num_possible_cpus().
> + */
> +#define __bitmask_size_in_bytes_raw			\
> +	((num_possible_cpus() + 7) / 8)
> +
> +#define __bitmask_size_in_longs						\
> +	((__bitmask_size_in_bytes_raw + ((BITS_PER_LONG / 8) - 1))	\
> +	 / (BITS_PER_LONG / 8))
> +
> +/*
> + * __bitmask_size_in_bytes is the number of bytes needed to hold
> + * num_possible_cpus() padded out to the nearest long. This is what
> + * is saved in the buffer, just to be consistent.
> + */
> +#define __bitmask_size_in_bytes				\
> +	(__bitmask_size_in_longs * (BITS_PER_LONG / 8))
> +
> +#undef __bitmask
> +#define __bitmask(item, src) __dynamic_array(unsigned long, item,	\
> +					     __bitmask_size_in_longs)
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  static inline notrace int ftrace_get_offsets_##call(			\
> @@ -513,12 +554,22 @@ static inline notrace int ftrace_get_offsets_##call(			\
>  	__entry->__data_loc_##item = __data_offsets.item;
>  
>  #undef __string
> -#define __string(item, src) __dynamic_array(char, item, -1)       	\
> +#define __string(item, src) __dynamic_array(char, item, -1)
>  
>  #undef __assign_str
>  #define __assign_str(dst, src)						\
>  	strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
>  
> +#undef __bitmask
> +#define __bitmask(item, src) __dynamic_array(unsigned long, item, -1)

Why src?  It's not used in any of the definitions of the __bitmask()
macro, can we remove it?

Cheers,
Javi


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-14 14:23         ` Javi Merino
@ 2014-05-14 15:36           ` Steven Rostedt
  2014-05-14 15:50             ` Javi Merino
  0 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2014-05-14 15:36 UTC (permalink / raw)
  To: Javi Merino
  Cc: LKML, Mathieu Desnoyers, Andrew Morton, Ingo Molnar,
	Namhyung Kim, Jiri Olsa

On Wed, 14 May 2014 15:23:24 +0100
Javi Merino <javi.merino@arm.com> wrote:

> > diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> > index 0a1a4f7..d9c44af 100644
> > --- a/include/trace/ftrace.h
> > +++ b/include/trace/ftrace.h
> > @@ -53,6 +53,9 @@
> >  #undef __string
> >  #define __string(item, src) __dynamic_array(char, item, -1)
> >  
> > +#undef __bitmask
> > +#define __bitmask(item, src) __dynamic_array(char, item, -1)
> > +
> >  #undef TP_STRUCT__entry
> >  #define TP_STRUCT__entry(args...) args
> >  
> > @@ -128,6 +131,9 @@
> >  #undef __string
> >  #define __string(item, src) __dynamic_array(char, item, -1)
> >  
> > +#undef __string
> > +#define __string(item, src) __dynamic_array(unsigned long, item, -1)
> > +
> 
> This overrides the previous definition of __string() and looks like it
> shouldn't be here.

Bah! I knew there was a reason I didn't push this out to my for-next
branch yet. I can still rebase :-)

That should have been __bitmask(). Hmm, amazing it still worked.

> 
> >  #undef DECLARE_EVENT_CLASS
> >  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
> >  	struct ftrace_data_offsets_##call {				\
> > @@ -200,6 +206,15 @@
> >  #undef __get_str
> >  #define __get_str(field) (char *)__get_dynamic_array(field)
> >  
> > +#undef __get_bitmask
> > +#define __get_bitmask(field)						\
> > +	({								\
> > +		void *__bitmask = __get_dynamic_array(field);		\
> > +		unsigned int __bitmask_size;				\
> > +		__bitmask_size = (__entry->__data_loc_##field >> 16) & 0xffff; \
> > +		ftrace_print_bitmask_seq(p, __bitmask, __bitmask_size);	\
> > +	})
> > +
> >  #undef __print_flags
> >  #define __print_flags(flag, delim, flag_array...)			\
> >  	({								\
> > @@ -322,6 +337,9 @@ static struct trace_event_functions ftrace_event_type_funcs_##call = {	\
> >  #undef __string
> >  #define __string(item, src) __dynamic_array(char, item, -1)
> >  
> > +#undef __bitmask
> > +#define __bitmask(item, src) __dynamic_array(unsigned long, item, -1)
> > +
> >
> >  #undef DECLARE_EVENT_CLASS
> >  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
> >  static int notrace __init						\
> > @@ -372,6 +390,29 @@ ftrace_define_fields_##call(struct ftrace_event_call *event_call)	\
> >  #define __string(item, src) __dynamic_array(char, item,			\
> >  		    strlen((src) ? (const char *)(src) : "(null)") + 1)
> >  
> > +/*
> > + * __bitmask_size_in_bytes_raw is the number of bytes needed to hold
> > + * num_possible_cpus().
> > + */
> > +#define __bitmask_size_in_bytes_raw			\
> > +	((num_possible_cpus() + 7) / 8)
> > +
> > +#define __bitmask_size_in_longs						\
> > +	((__bitmask_size_in_bytes_raw + ((BITS_PER_LONG / 8) - 1))	\
> > +	 / (BITS_PER_LONG / 8))
> > +
> > +/*
> > + * __bitmask_size_in_bytes is the number of bytes needed to hold
> > + * num_possible_cpus() padded out to the nearest long. This is what
> > + * is saved in the buffer, just to be consistent.
> > + */
> > +#define __bitmask_size_in_bytes				\
> > +	(__bitmask_size_in_longs * (BITS_PER_LONG / 8))
> > +
> > +#undef __bitmask
> > +#define __bitmask(item, src) __dynamic_array(unsigned long, item,	\
> > +					     __bitmask_size_in_longs)
> > +
> >  #undef DECLARE_EVENT_CLASS
> >  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
> >  static inline notrace int ftrace_get_offsets_##call(			\
> > @@ -513,12 +554,22 @@ static inline notrace int ftrace_get_offsets_##call(			\
> >  	__entry->__data_loc_##item = __data_offsets.item;
> >  
> >  #undef __string
> > -#define __string(item, src) __dynamic_array(char, item, -1)       	\
> > +#define __string(item, src) __dynamic_array(char, item, -1)
> >  
> >  #undef __assign_str
> >  #define __assign_str(dst, src)						\
> >  	strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
> >  
> > +#undef __bitmask
> > +#define __bitmask(item, src) __dynamic_array(unsigned long, item, -1)
> 
> Why src?  It's not used in any of the definitions of the __bitmask()
> macro, can we remove it?

Hmm, I may need to refactor this. I may pull this patch for now to push
the rest to for-next.

I just noticed that we need a way to specify the length of the bitmask.
Right now it's hardcoded as "num_possible_cpus", which Mathieu was
asking for a more generic approach.

OK, let me pull this patch out of my tree (thank god I never pushed
it), and work on it a bit more.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-14 15:36           ` Steven Rostedt
@ 2014-05-14 15:50             ` Javi Merino
  2014-05-14 16:07               ` Steven Rostedt
  2014-05-14 18:25               ` [RFC][PATCH v3] " Steven Rostedt
  0 siblings, 2 replies; 48+ messages in thread
From: Javi Merino @ 2014-05-14 15:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Mathieu Desnoyers, Andrew Morton, Ingo Molnar,
	Namhyung Kim, Jiri Olsa

On Wed, May 14, 2014 at 04:36:23PM +0100, Steven Rostedt wrote:
> On Wed, 14 May 2014 15:23:24 +0100
> Javi Merino <javi.merino@arm.com> wrote:
> 
> > > diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> > > index 0a1a4f7..d9c44af 100644
> > > --- a/include/trace/ftrace.h
> > > +++ b/include/trace/ftrace.h
> > > @@ -53,6 +53,9 @@
> > >  #undef __string
> > >  #define __string(item, src) __dynamic_array(char, item, -1)
> > >  
> > > +#undef __bitmask
> > > +#define __bitmask(item, src) __dynamic_array(char, item, -1)
> > > +
> > >  #undef TP_STRUCT__entry
> > >  #define TP_STRUCT__entry(args...) args
> > >  
> > > @@ -128,6 +131,9 @@
> > >  #undef __string
> > >  #define __string(item, src) __dynamic_array(char, item, -1)
> > >  
> > > +#undef __string
> > > +#define __string(item, src) __dynamic_array(unsigned long, item, -1)
> > > +
> > 
> > This overrides the previous definition of __string() and looks like it
> > shouldn't be here.
> 
> Bah! I knew there was a reason I didn't push this out to my for-next
> branch yet. I can still rebase :-)
> 
> That should have been __bitmask(). Hmm, amazing it still worked.
> 
> > 
> > >  #undef DECLARE_EVENT_CLASS
> > >  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
> > >  	struct ftrace_data_offsets_##call {				\
> > > @@ -200,6 +206,15 @@
> > >  #undef __get_str
> > >  #define __get_str(field) (char *)__get_dynamic_array(field)
> > >  
> > > +#undef __get_bitmask
> > > +#define __get_bitmask(field)						\
> > > +	({								\
> > > +		void *__bitmask = __get_dynamic_array(field);		\
> > > +		unsigned int __bitmask_size;				\
> > > +		__bitmask_size = (__entry->__data_loc_##field >> 16) & 0xffff; \
> > > +		ftrace_print_bitmask_seq(p, __bitmask, __bitmask_size);	\
> > > +	})
> > > +
> > >  #undef __print_flags
> > >  #define __print_flags(flag, delim, flag_array...)			\
> > >  	({								\
> > > @@ -322,6 +337,9 @@ static struct trace_event_functions ftrace_event_type_funcs_##call = {	\
> > >  #undef __string
> > >  #define __string(item, src) __dynamic_array(char, item, -1)
> > >  
> > > +#undef __bitmask
> > > +#define __bitmask(item, src) __dynamic_array(unsigned long, item, -1)
> > > +
> > >
> > >  #undef DECLARE_EVENT_CLASS
> > >  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
> > >  static int notrace __init						\
> > > @@ -372,6 +390,29 @@ ftrace_define_fields_##call(struct ftrace_event_call *event_call)	\
> > >  #define __string(item, src) __dynamic_array(char, item,			\
> > >  		    strlen((src) ? (const char *)(src) : "(null)") + 1)
> > >  
> > > +/*
> > > + * __bitmask_size_in_bytes_raw is the number of bytes needed to hold
> > > + * num_possible_cpus().
> > > + */
> > > +#define __bitmask_size_in_bytes_raw			\
> > > +	((num_possible_cpus() + 7) / 8)
> > > +
> > > +#define __bitmask_size_in_longs						\
> > > +	((__bitmask_size_in_bytes_raw + ((BITS_PER_LONG / 8) - 1))	\
> > > +	 / (BITS_PER_LONG / 8))
> > > +
> > > +/*
> > > + * __bitmask_size_in_bytes is the number of bytes needed to hold
> > > + * num_possible_cpus() padded out to the nearest long. This is what
> > > + * is saved in the buffer, just to be consistent.
> > > + */
> > > +#define __bitmask_size_in_bytes				\
> > > +	(__bitmask_size_in_longs * (BITS_PER_LONG / 8))
> > > +
> > > +#undef __bitmask
> > > +#define __bitmask(item, src) __dynamic_array(unsigned long, item,	\
> > > +					     __bitmask_size_in_longs)
> > > +
> > >  #undef DECLARE_EVENT_CLASS
> > >  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
> > >  static inline notrace int ftrace_get_offsets_##call(			\
> > > @@ -513,12 +554,22 @@ static inline notrace int ftrace_get_offsets_##call(			\
> > >  	__entry->__data_loc_##item = __data_offsets.item;
> > >  
> > >  #undef __string
> > > -#define __string(item, src) __dynamic_array(char, item, -1)       	\
> > > +#define __string(item, src) __dynamic_array(char, item, -1)
> > >  
> > >  #undef __assign_str
> > >  #define __assign_str(dst, src)						\
> > >  	strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
> > >  
> > > +#undef __bitmask
> > > +#define __bitmask(item, src) __dynamic_array(unsigned long, item, -1)
> > 
> > Why src?  It's not used in any of the definitions of the __bitmask()
> > macro, can we remove it?
> 
> Hmm, I may need to refactor this. I may pull this patch for now to push
> the rest to for-next.
> 
> I just noticed that we need a way to specify the length of the bitmask.
> Right now it's hardcoded as "num_possible_cpus", which Mathieu was
> asking for a more generic approach.
> 
> OK, let me pull this patch out of my tree (thank god I never pushed
> it), and work on it a bit more.

Ok, well keep me posted on this, I've tried to use it but trace-cmd
says "[FAILED TO PARSE]" and I'm still trying to figure out if it's me
not using it properly or some bug in the code.

Cheers,
Javi


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-14 15:50             ` Javi Merino
@ 2014-05-14 16:07               ` Steven Rostedt
  2014-05-14 18:25               ` [RFC][PATCH v3] " Steven Rostedt
  1 sibling, 0 replies; 48+ messages in thread
From: Steven Rostedt @ 2014-05-14 16:07 UTC (permalink / raw)
  To: Javi Merino
  Cc: LKML, Mathieu Desnoyers, Andrew Morton, Ingo Molnar,
	Namhyung Kim, Jiri Olsa

On Wed, 14 May 2014 16:50:53 +0100
Javi Merino <javi.merino@arm.com> wrote:


> > OK, let me pull this patch out of my tree (thank god I never pushed
> > it), and work on it a bit more.
> 
> Ok, well keep me posted on this, I've tried to use it but trace-cmd
> says "[FAILED TO PARSE]" and I'm still trying to figure out if it's me
> not using it properly or some bug in the code.

Oh, it's going to require a trace-cmd patch. But that's rather trivial.
(actually, it will require a update to libtraceevent).

I can send you a patch for trace-cmd after I get this figured out. I
think I have an idea that can still keep this as 3.16 material.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH v3] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-14 15:50             ` Javi Merino
  2014-05-14 16:07               ` Steven Rostedt
@ 2014-05-14 18:25               ` Steven Rostedt
  2014-05-14 19:42                 ` Javi Merino
  1 sibling, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2014-05-14 18:25 UTC (permalink / raw)
  To: Javi Merino
  Cc: LKML, Mathieu Desnoyers, Andrew Morton, Ingo Molnar,
	Namhyung Kim, Jiri Olsa

[-- Attachment #1: Type: text/plain, Size: 10594 bytes --]

Updated version.

You can test it with the attached trace-cmd patch. I tested it against
big and little endian machines (each reading their own and the other
machine's trace.dat file). I did not test against 32bit, but it should
still work. I will test it before I commit it to traceevent library.

Can you review this and try it out?

Thanks!

-- Steve

From be2f29847428f5a290496d1c44ad938f2c2ad040 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
Date: Tue, 6 May 2014 13:10:24 -0400
Subject: [PATCH] tracing: Add __bitmask() macro to trace events to cpumasks
 and other bitmasks

Being able to show a cpumask of events can be useful as some events
may affect only some CPUs. There is no standard way to record the
cpumask and converting it to a string is rather expensive during
the trace as traces happen in hotpaths. It would be better to record
the raw event mask and be able to parse it at print time.

The following macros were added for use with the TRACE_EVENT() macro:

  __bitmask()
  __assign_bitmask()
  __get_bitmask()

To test this, I added this to the sched_migrate_task event, which
looked like this:

TRACE_EVENT(sched_migrate_task,

	TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),

	TP_ARGS(p, dest_cpu, cpus),

	TP_STRUCT__entry(
		__array(	char,	comm,	TASK_COMM_LEN	)
		__field(	pid_t,	pid			)
		__field(	int,	prio			)
		__field(	int,	orig_cpu		)
		__field(	int,	dest_cpu		)
		__bitmask(	cpumask, num_possible_cpus()	)
	),

	TP_fast_assign(
		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
		__entry->pid		= p->pid;
		__entry->prio		= p->prio;
		__entry->orig_cpu	= task_cpu(p);
		__entry->dest_cpu	= dest_cpu;
		__assign_bitmask(cpumask, cpumask_bits(cpus), num_possible_cpus());
	),

	TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
		  __entry->comm, __entry->pid, __entry->prio,
		  __entry->orig_cpu, __entry->dest_cpu,
		  __get_bitmask(cpumask))
);

With the output of:

        ksmtuned-3613  [003] d..2   485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f
     migration/1-13    [001] d..5   485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f
             awk-3615  [002] d.H5   485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff
     migration/2-18    [002] d..5   485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f

Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com

Suggested-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/ftrace_event.h |  3 +++
 include/linux/trace_seq.h    | 10 ++++++++
 include/trace/ftrace.h       | 57 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace_output.c  | 41 +++++++++++++++++++++++++++++++
 4 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index d16da3e..cff3106 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -38,6 +38,9 @@ const char *ftrace_print_symbols_seq_u64(struct trace_seq *p,
 								 *symbol_array);
 #endif
 
+const char *ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
+				     unsigned int bitmask_size);
+
 const char *ftrace_print_hex_seq(struct trace_seq *p,
 				 const unsigned char *buf, int len);
 
diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
index a32d86e..1361169 100644
--- a/include/linux/trace_seq.h
+++ b/include/linux/trace_seq.h
@@ -46,6 +46,9 @@ extern int trace_seq_putmem_hex(struct trace_seq *s, const void *mem,
 extern void *trace_seq_reserve(struct trace_seq *s, size_t len);
 extern int trace_seq_path(struct trace_seq *s, const struct path *path);
 
+extern int trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
+			     int nmaskbits);
+
 #else /* CONFIG_TRACING */
 static inline int trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
 {
@@ -57,6 +60,13 @@ trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
 	return 0;
 }
 
+static inline int
+trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
+		  int nmaskbits)
+{
+	return 0;
+}
+
 static inline int trace_print_seq(struct seq_file *m, struct trace_seq *s)
 {
 	return 0;
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 0a1a4f7..9b7a989 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -53,6 +53,9 @@
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __bitmask
+#define __bitmask(item, nr_bits) __dynamic_array(char, item, -1)
+
 #undef TP_STRUCT__entry
 #define TP_STRUCT__entry(args...) args
 
@@ -128,6 +131,9 @@
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __bitmask
+#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 	struct ftrace_data_offsets_##call {				\
@@ -200,6 +206,15 @@
 #undef __get_str
 #define __get_str(field) (char *)__get_dynamic_array(field)
 
+#undef __get_bitmask
+#define __get_bitmask(field)						\
+	({								\
+		void *__bitmask = __get_dynamic_array(field);		\
+		unsigned int __bitmask_size;				\
+		__bitmask_size = (__entry->__data_loc_##field >> 16) & 0xffff; \
+		ftrace_print_bitmask_seq(p, __bitmask, __bitmask_size);	\
+	})
+
 #undef __print_flags
 #define __print_flags(flag, delim, flag_array...)			\
 	({								\
@@ -322,6 +337,9 @@ static struct trace_event_functions ftrace_event_type_funcs_##call = {	\
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
 
+#undef __bitmask
+#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
 static int notrace __init						\
@@ -372,6 +390,29 @@ ftrace_define_fields_##call(struct ftrace_event_call *event_call)	\
 #define __string(item, src) __dynamic_array(char, item,			\
 		    strlen((src) ? (const char *)(src) : "(null)") + 1)
 
+/*
+ * __bitmask_size_in_bytes_raw is the number of bytes needed to hold
+ * num_possible_cpus().
+ */
+#define __bitmask_size_in_bytes_raw(nr_bits)	\
+	(((nr_bits) + 7) / 8)
+
+#define __bitmask_size_in_longs(nr_bits)			\
+	((__bitmask_size_in_bytes_raw(nr_bits) +		\
+	  ((BITS_PER_LONG / 8) - 1)) / (BITS_PER_LONG / 8))
+
+/*
+ * __bitmask_size_in_bytes is the number of bytes needed to hold
+ * num_possible_cpus() padded out to the nearest long. This is what
+ * is saved in the buffer, just to be consistent.
+ */
+#define __bitmask_size_in_bytes(nr_bits)				\
+	(__bitmask_size_in_longs(nr_bits) * (BITS_PER_LONG / 8))
+
+#undef __bitmask
+#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item,	\
+					 __bitmask_size_in_longs(nr_bits))
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 static inline notrace int ftrace_get_offsets_##call(			\
@@ -513,12 +554,22 @@ static inline notrace int ftrace_get_offsets_##call(			\
 	__entry->__data_loc_##item = __data_offsets.item;
 
 #undef __string
-#define __string(item, src) __dynamic_array(char, item, -1)       	\
+#define __string(item, src) __dynamic_array(char, item, -1)
 
 #undef __assign_str
 #define __assign_str(dst, src)						\
 	strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
 
+#undef __bitmask
+#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
+
+#undef __get_bitmask
+#define __get_bitmask(field) (char *)__get_dynamic_array(field)
+
+#undef __assign_bitmask
+#define __assign_bitmask(dst, src, nr_bits)					\
+	memcpy(__get_bitmask(dst), (src), __bitmask_size_in_bytes(nr_bits))
+
 #undef TP_fast_assign
 #define TP_fast_assign(args...) args
 
@@ -586,6 +637,7 @@ static inline void ftrace_test_probe_##call(void)			\
 #undef __print_hex
 #undef __get_dynamic_array
 #undef __get_str
+#undef __get_bitmask
 
 #undef TP_printk
 #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)
@@ -651,6 +703,9 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
 #undef __get_str
 #define __get_str(field) (char *)__get_dynamic_array(field)
 
+#undef __get_bitmask
+#define __get_bitmask(field) (char *)__get_dynamic_array(field)
+
 #undef __perf_addr
 #define __perf_addr(a)	(__addr = (a))
 
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a436de1..f3dad80 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -126,6 +126,34 @@ trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
 EXPORT_SYMBOL_GPL(trace_seq_printf);
 
 /**
+ * trace_seq_bitmask - put a list of longs as a bitmask print output
+ * @s:		trace sequence descriptor
+ * @maskp:	points to an array of unsigned longs that represent a bitmask
+ * @nmaskbits:	The number of bits that are valid in @maskp
+ *
+ * It returns 0 if the trace oversizes the buffer's free
+ * space, 1 otherwise.
+ *
+ * Writes a ASCII representation of a bitmask string into @s.
+ */
+int
+trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
+		  int nmaskbits)
+{
+	int len = (PAGE_SIZE - 1) - s->len;
+	int ret;
+
+	if (s->full || !len)
+		return 0;
+
+	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
+	s->len += ret;
+
+	return 1;
+}
+EXPORT_SYMBOL_GPL(trace_seq_bitmask);
+
+/**
  * trace_seq_vprintf - sequence printing of trace information
  * @s: trace sequence descriptor
  * @fmt: printf format string
@@ -399,6 +427,19 @@ EXPORT_SYMBOL(ftrace_print_symbols_seq_u64);
 #endif
 
 const char *
+ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
+			 unsigned int bitmask_size)
+{
+	const char *ret = p->buffer + p->len;
+
+	trace_seq_bitmask(p, bitmask_ptr, bitmask_size * 8);
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ftrace_print_bitmask_seq);
+
+const char *
 ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int buf_len)
 {
 	int i;
-- 
1.8.1.4



[-- Attachment #2: print-cpumask.patch --]
[-- Type: text/x-patch, Size: 5552 bytes --]

diff --git a/event-parse.c b/event-parse.c
index 114b41b..ca0d1fe 100644
--- a/event-parse.c
+++ b/event-parse.c
@@ -753,6 +753,9 @@ static void free_arg(struct print_arg *arg)
 	case PRINT_BSTRING:
 		free(arg->string.string);
 		break;
+	case PRINT_BITMASK:
+		free(arg->bitmask.bitmask);
+		break;
 	case PRINT_DYNAMIC_ARRAY:
 		free(arg->dynarray.index);
 		break;
@@ -2256,6 +2259,7 @@ static int arg_num_eval(struct print_arg *arg, long long *val)
 	case PRINT_FIELD ... PRINT_SYMBOL:
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 	default:
 		do_warning("invalid eval type %d", arg->type);
 		ret = 0;
@@ -2284,6 +2288,7 @@ static char *arg_eval (struct print_arg *arg)
 	case PRINT_FIELD ... PRINT_SYMBOL:
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 	default:
 		do_warning("invalid eval type %d", arg->type);
 		break;
@@ -2670,6 +2675,35 @@ process_str(struct event_format *event __maybe_unused, struct print_arg *arg,
 	return EVENT_ERROR;
 }
 
+static enum event_type
+process_bitmask(struct event_format *event __maybe_unused, struct print_arg *arg,
+	    char **tok)
+{
+	enum event_type type;
+	char *token;
+
+	if (read_expect_type(EVENT_ITEM, &token) < 0)
+		goto out_free;
+
+	arg->type = PRINT_BITMASK;
+	arg->bitmask.bitmask = token;
+	arg->bitmask.offset = -1;
+
+	if (read_expected(EVENT_DELIM, ")") < 0)
+		goto out_err;
+
+	type = read_token(&token);
+	*tok = token;
+
+	return type;
+
+ out_free:
+	free_token(token);
+ out_err:
+	*tok = NULL;
+	return EVENT_ERROR;
+}
+
 static struct pevent_function_handler *
 find_func_handler(struct pevent *pevent, char *func_name)
 {
@@ -2781,6 +2815,10 @@ process_function(struct event_format *event, struct print_arg *arg,
 		free_token(token);
 		return process_str(event, arg, tok);
 	}
+	if (strcmp(token, "__get_bitmask") == 0) {
+		free_token(token);
+		return process_bitmask(event, arg, tok);
+	}
 	if (strcmp(token, "__get_dynamic_array") == 0) {
 		free_token(token);
 		return process_dynamic_array(event, arg, tok);
@@ -3307,6 +3345,7 @@ eval_num_arg(void *data, int size, struct event_format *event, struct print_arg
 		return eval_type(val, arg, 0);
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 		return 0;
 	case PRINT_FUNC: {
 		struct trace_seq s;
@@ -3538,6 +3577,60 @@ static void print_str_to_seq(struct trace_seq *s, const char *format,
 		trace_seq_printf(s, format, str);
 }
 
+static void print_bitmask_to_seq(struct pevent *pevent,
+				 struct trace_seq *s, const char *format,
+				 int len_arg, const void *data, int size)
+{
+	int nr_bits = size * 8;
+	int str_size = (nr_bits + 3) / 4;
+	int len = 0;
+	char buf[3];
+	char *str;
+	int index;
+	int i;
+
+	/*
+	 * The kernel likes to put in commas every 32 bits, we
+	 * can do the same.
+	 */
+	str_size += nr_bits / 32;
+
+	str = malloc(str_size + 1);
+	if (!str) {
+		do_warning("%s: not enough memory!", __func__);
+		return;
+	}
+	str[str_size] = 0;
+
+	/* Start out with -3, 1 for the \0 and two for the byte */
+	for (i = str_size - 3; i >= 0; i -= 2) {
+		/*
+		 * data points to a bit mask of size bytes.
+		 * In the kernel, this is an array of long words, thus
+		 * endianess is very important.
+		 */
+		if (pevent->file_bigendian)
+			index = size - (len + 1);
+		else
+			index = len;
+
+		snprintf(buf, 3, "%02x", *((unsigned char *)data + index));
+		memcpy(str + i, buf, 2);
+		len++;
+		if (!(len & 3) && i > 0) {
+			i--;
+			str[i] = ',';
+		}
+	}			
+
+	if (len_arg >= 0)
+		trace_seq_printf(s, format, len_arg, str);
+	else
+		trace_seq_printf(s, format, str);
+
+	free(str);
+}
+
 static void print_str_arg(struct trace_seq *s, void *data, int size,
 			  struct event_format *event, const char *format,
 			  int len_arg, struct print_arg *arg)
@@ -3672,6 +3765,23 @@ static void print_str_arg(struct trace_seq *s, void *data, int size,
 	case PRINT_BSTRING:
 		print_str_to_seq(s, format, len_arg, arg->string.string);
 		break;
+	case PRINT_BITMASK: {
+		int bitmask_offset;
+		int bitmask_size;
+
+		if (arg->bitmask.offset == -1) {
+			struct format_field *f;
+
+			f = pevent_find_any_field(event, arg->bitmask.bitmask);
+			arg->bitmask.offset = f->offset;
+		}
+		bitmask_offset = data2host4(pevent, data + arg->string.offset);
+		bitmask_size = bitmask_offset >> 16;
+		bitmask_offset &= 0xffff;
+		print_bitmask_to_seq(pevent, s, format, len_arg,
+				     data + bitmask_offset, bitmask_size);
+		break;
+	}
 	case PRINT_OP:
 		/*
 		 * The only op for string should be ? :
@@ -4889,6 +4999,9 @@ static void print_args(struct print_arg *args)
 	case PRINT_BSTRING:
 		printf("__get_str(%s)", args->string.string);
 		break;
+	case PRINT_BITMASK:
+		printf("__get_bitmask(%s)", args->bitmask.bitmask);
+		break;
 	case PRINT_TYPE:
 		printf("(%s)", args->typecast.type);
 		print_args(args->typecast.item);
diff --git a/event-parse.h b/event-parse.h
index 108a4ef..babd4a4 100644
--- a/event-parse.h
+++ b/event-parse.h
@@ -211,6 +211,11 @@ struct print_arg_string {
 	int			offset;
 };
 
+struct print_arg_bitmask {
+	char			*bitmask;
+	int			offset;
+};
+
 struct print_arg_field {
 	char			*name;
 	struct format_field	*field;
@@ -277,6 +282,7 @@ enum print_arg_type {
 	PRINT_DYNAMIC_ARRAY,
 	PRINT_OP,
 	PRINT_FUNC,
+	PRINT_BITMASK,
 };
 
 struct print_arg {
@@ -291,6 +297,7 @@ struct print_arg {
 		struct print_arg_hex		hex;
 		struct print_arg_func		func;
 		struct print_arg_string		string;
+		struct print_arg_bitmask	bitmask;
 		struct print_arg_op		op;
 		struct print_arg_dynarray	dynarray;
 	};

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH v3] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-14 18:25               ` [RFC][PATCH v3] " Steven Rostedt
@ 2014-05-14 19:42                 ` Javi Merino
  2014-05-14 20:00                   ` Steven Rostedt
  2014-05-15 11:34                   ` Steven Rostedt
  0 siblings, 2 replies; 48+ messages in thread
From: Javi Merino @ 2014-05-14 19:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Mathieu Desnoyers, Andrew Morton, Ingo Molnar,
	Namhyung Kim, Jiri Olsa

On Wed, May 14, 2014 at 07:25:00PM +0100, Steven Rostedt wrote:
> Updated version.
> 
> You can test it with the attached trace-cmd patch. I tested it against
> big and little endian machines (each reading their own and the other
> machine's trace.dat file). I did not test against 32bit, but it should
> still work. I will test it before I commit it to traceevent library.
> 
> Can you review this and try it out?

I've tested it in a 32-bit system.  It generates the data and it's
able to read its own trace.dat fine.  However, If I read the
trace.dat on a 64-bit machine I get a random character at the
end of the mask:

$ ./trace-cmd report /tmp/trace.dat | grep thermal_power_limit
trace-cmd: No such file or directory
  bad op token {
  unknown op '{'
  unknown op '{'
  function scsi_trace_parse_cdb not defined
  function scsi_trace_parse_cdb not defined
  function scsi_trace_parse_cdb not defined
  function scsi_trace_parse_cdb not defined
  bad op token {
  bad op token {
  bad op token {
  bad op token {
  Error: expected type 4 but read 0
  bad op token {
  bad op token {
  Error: expected type 4 but read 0
  function jiffies_to_msecs not defined
  function jiffies_to_msecs not defined
  bad op token {
  bad op token {
 test_thermal.sh-2886  [000]    30.035000: thermal_power_limit: cpus=000000f0` freq=1400000 cdev_state=0 power=196
 test_thermal.sh-2886  [000]    30.035000: thermal_power_limit: cpus=0000000fP freq=1900000 cdev_state=0 power=5252
     kworker/0:1-195   [000]    34.265000: thermal_power_limit: cpus=000000f0� freq=1400000 cdev_state=0 power=196
     kworker/0:1-195   [000]    34.265000: thermal_power_limit: cpus=0000000f� freq=1900000 cdev_state=0 power=5252
 test_thermal.sh-2886  [000]    38.375000: thermal_power_limit: cpus=000000f0� freq=1400000 cdev_state=0 power=196
 test_thermal.sh-2886  [000]    38.375000: thermal_power_limit: cpus=0000000f@ freq=1900000 cdev_state=0 power=5252
     kworker/0:1-195   [000]    38.380000: thermal_power_limit: cpus=000000f0freq=500000 cdev_state=9 power=0
     kworker/0:1-195   [000]    38.380000: thermal_power_limit: cpus=0000000f0 freq=1900000 cdev_state=0 power=1745
     kworker/0:1-195   [000]    48.415000: thermal_power_limit: cpus=000000f0� freq=500000 cdev_state=9 power=0
     kworker/0:1-195   [000]    48.415000: thermal_power_limit: cpus=0000000f� freq=1900000 cdev_state=0 power=1740k

It looks like an off-by-one error.  The system that generated the
trace.dat is running a kernel configured with NR_CPUS=5, can that be
the culprit?

FWIW, this is the event I'm using:

TRACE_EVENT(thermal_power_limit,
	TP_PROTO(const struct cpumask *cpus, unsigned int freq, unsigned long cdev_state,
		unsigned long power),

	TP_ARGS(cpus, freq, cdev_state, power),

	TP_STRUCT__entry(
		__bitmask(cpumask, num_possible_cpus())
		__field(unsigned int,  freq      )
		__field(unsigned long, cdev_state)
		__field(unsigned long, power     )
	),

	TP_fast_assign(
		__assign_bitmask(cpumask, cpumask_bits(cpus), num_possible_cpus());
		__entry->freq = freq;
		__entry->cdev_state = cdev_state;
		__entry->power = power;
	),

	TP_printk("cpus=%s freq=%u cdev_state=%lu power=%lu",
		__get_bitmask(cpumask), __entry->freq, __entry->cdev_state,
		__entry->power)
);

Thanks!
Javi

> From be2f29847428f5a290496d1c44ad938f2c2ad040 Mon Sep 17 00:00:00 2001
> From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> Date: Tue, 6 May 2014 13:10:24 -0400
> Subject: [PATCH] tracing: Add __bitmask() macro to trace events to cpumasks
>  and other bitmasks
> 
> Being able to show a cpumask of events can be useful as some events
> may affect only some CPUs. There is no standard way to record the
> cpumask and converting it to a string is rather expensive during
> the trace as traces happen in hotpaths. It would be better to record
> the raw event mask and be able to parse it at print time.
> 
> The following macros were added for use with the TRACE_EVENT() macro:
> 
>   __bitmask()
>   __assign_bitmask()
>   __get_bitmask()
> 
> To test this, I added this to the sched_migrate_task event, which
> looked like this:
> 
> TRACE_EVENT(sched_migrate_task,
> 
> 	TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),
> 
> 	TP_ARGS(p, dest_cpu, cpus),
> 
> 	TP_STRUCT__entry(
> 		__array(	char,	comm,	TASK_COMM_LEN	)
> 		__field(	pid_t,	pid			)
> 		__field(	int,	prio			)
> 		__field(	int,	orig_cpu		)
> 		__field(	int,	dest_cpu		)
> 		__bitmask(	cpumask, num_possible_cpus()	)
> 	),
> 
> 	TP_fast_assign(
> 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
> 		__entry->pid		= p->pid;
> 		__entry->prio		= p->prio;
> 		__entry->orig_cpu	= task_cpu(p);
> 		__entry->dest_cpu	= dest_cpu;
> 		__assign_bitmask(cpumask, cpumask_bits(cpus), num_possible_cpus());
> 	),
> 
> 	TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
> 		  __entry->comm, __entry->pid, __entry->prio,
> 		  __entry->orig_cpu, __entry->dest_cpu,
> 		  __get_bitmask(cpumask))
> );
> 
> With the output of:
> 
>         ksmtuned-3613  [003] d..2   485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f
>      migration/1-13    [001] d..5   485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f
>              awk-3615  [002] d.H5   485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff
>      migration/2-18    [002] d..5   485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f
> 
> Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com
> 
> Suggested-by: Javi Merino <javi.merino@arm.com>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  include/linux/ftrace_event.h |  3 +++
>  include/linux/trace_seq.h    | 10 ++++++++
>  include/trace/ftrace.h       | 57 +++++++++++++++++++++++++++++++++++++++++++-
>  kernel/trace/trace_output.c  | 41 +++++++++++++++++++++++++++++++
>  4 files changed, 110 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index d16da3e..cff3106 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -38,6 +38,9 @@ const char *ftrace_print_symbols_seq_u64(struct trace_seq *p,
>  								 *symbol_array);
>  #endif
>  
> +const char *ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
> +				     unsigned int bitmask_size);
> +
>  const char *ftrace_print_hex_seq(struct trace_seq *p,
>  				 const unsigned char *buf, int len);
>  
> diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
> index a32d86e..1361169 100644
> --- a/include/linux/trace_seq.h
> +++ b/include/linux/trace_seq.h
> @@ -46,6 +46,9 @@ extern int trace_seq_putmem_hex(struct trace_seq *s, const void *mem,
>  extern void *trace_seq_reserve(struct trace_seq *s, size_t len);
>  extern int trace_seq_path(struct trace_seq *s, const struct path *path);
>  
> +extern int trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> +			     int nmaskbits);
> +
>  #else /* CONFIG_TRACING */
>  static inline int trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
>  {
> @@ -57,6 +60,13 @@ trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
>  	return 0;
>  }
>  
> +static inline int
> +trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> +		  int nmaskbits)
> +{
> +	return 0;
> +}
> +
>  static inline int trace_print_seq(struct seq_file *m, struct trace_seq *s)
>  {
>  	return 0;
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 0a1a4f7..9b7a989 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -53,6 +53,9 @@
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __bitmask
> +#define __bitmask(item, nr_bits) __dynamic_array(char, item, -1)
> +
>  #undef TP_STRUCT__entry
>  #define TP_STRUCT__entry(args...) args
>  
> @@ -128,6 +131,9 @@
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __bitmask
> +#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  	struct ftrace_data_offsets_##call {				\
> @@ -200,6 +206,15 @@
>  #undef __get_str
>  #define __get_str(field) (char *)__get_dynamic_array(field)
>  
> +#undef __get_bitmask
> +#define __get_bitmask(field)						\
> +	({								\
> +		void *__bitmask = __get_dynamic_array(field);		\
> +		unsigned int __bitmask_size;				\
> +		__bitmask_size = (__entry->__data_loc_##field >> 16) & 0xffff; \
> +		ftrace_print_bitmask_seq(p, __bitmask, __bitmask_size);	\
> +	})
> +
>  #undef __print_flags
>  #define __print_flags(flag, delim, flag_array...)			\
>  	({								\
> @@ -322,6 +337,9 @@ static struct trace_event_functions ftrace_event_type_funcs_##call = {	\
>  #undef __string
>  #define __string(item, src) __dynamic_array(char, item, -1)
>  
> +#undef __bitmask
> +#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
>  static int notrace __init						\
> @@ -372,6 +390,29 @@ ftrace_define_fields_##call(struct ftrace_event_call *event_call)	\
>  #define __string(item, src) __dynamic_array(char, item,			\
>  		    strlen((src) ? (const char *)(src) : "(null)") + 1)
>  
> +/*
> + * __bitmask_size_in_bytes_raw is the number of bytes needed to hold
> + * num_possible_cpus().
> + */
> +#define __bitmask_size_in_bytes_raw(nr_bits)	\
> +	(((nr_bits) + 7) / 8)
> +
> +#define __bitmask_size_in_longs(nr_bits)			\
> +	((__bitmask_size_in_bytes_raw(nr_bits) +		\
> +	  ((BITS_PER_LONG / 8) - 1)) / (BITS_PER_LONG / 8))
> +
> +/*
> + * __bitmask_size_in_bytes is the number of bytes needed to hold
> + * num_possible_cpus() padded out to the nearest long. This is what
> + * is saved in the buffer, just to be consistent.
> + */
> +#define __bitmask_size_in_bytes(nr_bits)				\
> +	(__bitmask_size_in_longs(nr_bits) * (BITS_PER_LONG / 8))
> +
> +#undef __bitmask
> +#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item,	\
> +					 __bitmask_size_in_longs(nr_bits))
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  static inline notrace int ftrace_get_offsets_##call(			\
> @@ -513,12 +554,22 @@ static inline notrace int ftrace_get_offsets_##call(			\
>  	__entry->__data_loc_##item = __data_offsets.item;
>  
>  #undef __string
> -#define __string(item, src) __dynamic_array(char, item, -1)       	\
> +#define __string(item, src) __dynamic_array(char, item, -1)
>  
>  #undef __assign_str
>  #define __assign_str(dst, src)						\
>  	strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
>  
> +#undef __bitmask
> +#define __bitmask(item, nr_bits) __dynamic_array(unsigned long, item, -1)
> +
> +#undef __get_bitmask
> +#define __get_bitmask(field) (char *)__get_dynamic_array(field)
> +
> +#undef __assign_bitmask
> +#define __assign_bitmask(dst, src, nr_bits)					\
> +	memcpy(__get_bitmask(dst), (src), __bitmask_size_in_bytes(nr_bits))
> +
>  #undef TP_fast_assign
>  #define TP_fast_assign(args...) args
>  
> @@ -586,6 +637,7 @@ static inline void ftrace_test_probe_##call(void)			\
>  #undef __print_hex
>  #undef __get_dynamic_array
>  #undef __get_str
> +#undef __get_bitmask
>  
>  #undef TP_printk
>  #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)
> @@ -651,6 +703,9 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
>  #undef __get_str
>  #define __get_str(field) (char *)__get_dynamic_array(field)
>  
> +#undef __get_bitmask
> +#define __get_bitmask(field) (char *)__get_dynamic_array(field)
> +
>  #undef __perf_addr
>  #define __perf_addr(a)	(__addr = (a))
>  
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index a436de1..f3dad80 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -126,6 +126,34 @@ trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
>  EXPORT_SYMBOL_GPL(trace_seq_printf);
>  
>  /**
> + * trace_seq_bitmask - put a list of longs as a bitmask print output
> + * @s:		trace sequence descriptor
> + * @maskp:	points to an array of unsigned longs that represent a bitmask
> + * @nmaskbits:	The number of bits that are valid in @maskp
> + *
> + * It returns 0 if the trace oversizes the buffer's free
> + * space, 1 otherwise.
> + *
> + * Writes a ASCII representation of a bitmask string into @s.
> + */
> +int
> +trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> +		  int nmaskbits)
> +{
> +	int len = (PAGE_SIZE - 1) - s->len;
> +	int ret;
> +
> +	if (s->full || !len)
> +		return 0;
> +
> +	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
> +	s->len += ret;
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL_GPL(trace_seq_bitmask);
> +
> +/**
>   * trace_seq_vprintf - sequence printing of trace information
>   * @s: trace sequence descriptor
>   * @fmt: printf format string
> @@ -399,6 +427,19 @@ EXPORT_SYMBOL(ftrace_print_symbols_seq_u64);
>  #endif
>  
>  const char *
> +ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
> +			 unsigned int bitmask_size)
> +{
> +	const char *ret = p->buffer + p->len;
> +
> +	trace_seq_bitmask(p, bitmask_ptr, bitmask_size * 8);
> +	trace_seq_putc(p, 0);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(ftrace_print_bitmask_seq);
> +
> +const char *
>  ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int buf_len)
>  {
>  	int i;
> -- 
> 1.8.1.4
> 

> diff --git a/event-parse.c b/event-parse.c
> index 114b41b..ca0d1fe 100644
> --- a/event-parse.c
> +++ b/event-parse.c
> @@ -753,6 +753,9 @@ static void free_arg(struct print_arg *arg)
>  	case PRINT_BSTRING:
>  		free(arg->string.string);
>  		break;
> +	case PRINT_BITMASK:
> +		free(arg->bitmask.bitmask);
> +		break;
>  	case PRINT_DYNAMIC_ARRAY:
>  		free(arg->dynarray.index);
>  		break;
> @@ -2256,6 +2259,7 @@ static int arg_num_eval(struct print_arg *arg, long long *val)
>  	case PRINT_FIELD ... PRINT_SYMBOL:
>  	case PRINT_STRING:
>  	case PRINT_BSTRING:
> +	case PRINT_BITMASK:
>  	default:
>  		do_warning("invalid eval type %d", arg->type);
>  		ret = 0;
> @@ -2284,6 +2288,7 @@ static char *arg_eval (struct print_arg *arg)
>  	case PRINT_FIELD ... PRINT_SYMBOL:
>  	case PRINT_STRING:
>  	case PRINT_BSTRING:
> +	case PRINT_BITMASK:
>  	default:
>  		do_warning("invalid eval type %d", arg->type);
>  		break;
> @@ -2670,6 +2675,35 @@ process_str(struct event_format *event __maybe_unused, struct print_arg *arg,
>  	return EVENT_ERROR;
>  }
>  
> +static enum event_type
> +process_bitmask(struct event_format *event __maybe_unused, struct print_arg *arg,
> +	    char **tok)
> +{
> +	enum event_type type;
> +	char *token;
> +
> +	if (read_expect_type(EVENT_ITEM, &token) < 0)
> +		goto out_free;
> +
> +	arg->type = PRINT_BITMASK;
> +	arg->bitmask.bitmask = token;
> +	arg->bitmask.offset = -1;
> +
> +	if (read_expected(EVENT_DELIM, ")") < 0)
> +		goto out_err;
> +
> +	type = read_token(&token);
> +	*tok = token;
> +
> +	return type;
> +
> + out_free:
> +	free_token(token);
> + out_err:
> +	*tok = NULL;
> +	return EVENT_ERROR;
> +}
> +
>  static struct pevent_function_handler *
>  find_func_handler(struct pevent *pevent, char *func_name)
>  {
> @@ -2781,6 +2815,10 @@ process_function(struct event_format *event, struct print_arg *arg,
>  		free_token(token);
>  		return process_str(event, arg, tok);
>  	}
> +	if (strcmp(token, "__get_bitmask") == 0) {
> +		free_token(token);
> +		return process_bitmask(event, arg, tok);
> +	}
>  	if (strcmp(token, "__get_dynamic_array") == 0) {
>  		free_token(token);
>  		return process_dynamic_array(event, arg, tok);
> @@ -3307,6 +3345,7 @@ eval_num_arg(void *data, int size, struct event_format *event, struct print_arg
>  		return eval_type(val, arg, 0);
>  	case PRINT_STRING:
>  	case PRINT_BSTRING:
> +	case PRINT_BITMASK:
>  		return 0;
>  	case PRINT_FUNC: {
>  		struct trace_seq s;
> @@ -3538,6 +3577,60 @@ static void print_str_to_seq(struct trace_seq *s, const char *format,
>  		trace_seq_printf(s, format, str);
>  }
>  
> +static void print_bitmask_to_seq(struct pevent *pevent,
> +				 struct trace_seq *s, const char *format,
> +				 int len_arg, const void *data, int size)
> +{
> +	int nr_bits = size * 8;
> +	int str_size = (nr_bits + 3) / 4;
> +	int len = 0;
> +	char buf[3];
> +	char *str;
> +	int index;
> +	int i;
> +
> +	/*
> +	 * The kernel likes to put in commas every 32 bits, we
> +	 * can do the same.
> +	 */
> +	str_size += nr_bits / 32;
> +
> +	str = malloc(str_size + 1);
> +	if (!str) {
> +		do_warning("%s: not enough memory!", __func__);
> +		return;
> +	}
> +	str[str_size] = 0;
> +
> +	/* Start out with -3, 1 for the \0 and two for the byte */
> +	for (i = str_size - 3; i >= 0; i -= 2) {
> +		/*
> +		 * data points to a bit mask of size bytes.
> +		 * In the kernel, this is an array of long words, thus
> +		 * endianess is very important.
> +		 */
> +		if (pevent->file_bigendian)
> +			index = size - (len + 1);
> +		else
> +			index = len;
> +
> +		snprintf(buf, 3, "%02x", *((unsigned char *)data + index));
> +		memcpy(str + i, buf, 2);
> +		len++;
> +		if (!(len & 3) && i > 0) {
> +			i--;
> +			str[i] = ',';
> +		}
> +	}			
> +
> +	if (len_arg >= 0)
> +		trace_seq_printf(s, format, len_arg, str);
> +	else
> +		trace_seq_printf(s, format, str);
> +
> +	free(str);
> +}
> +
>  static void print_str_arg(struct trace_seq *s, void *data, int size,
>  			  struct event_format *event, const char *format,
>  			  int len_arg, struct print_arg *arg)
> @@ -3672,6 +3765,23 @@ static void print_str_arg(struct trace_seq *s, void *data, int size,
>  	case PRINT_BSTRING:
>  		print_str_to_seq(s, format, len_arg, arg->string.string);
>  		break;
> +	case PRINT_BITMASK: {
> +		int bitmask_offset;
> +		int bitmask_size;
> +
> +		if (arg->bitmask.offset == -1) {
> +			struct format_field *f;
> +
> +			f = pevent_find_any_field(event, arg->bitmask.bitmask);
> +			arg->bitmask.offset = f->offset;
> +		}
> +		bitmask_offset = data2host4(pevent, data + arg->string.offset);
> +		bitmask_size = bitmask_offset >> 16;
> +		bitmask_offset &= 0xffff;
> +		print_bitmask_to_seq(pevent, s, format, len_arg,
> +				     data + bitmask_offset, bitmask_size);
> +		break;
> +	}
>  	case PRINT_OP:
>  		/*
>  		 * The only op for string should be ? :
> @@ -4889,6 +4999,9 @@ static void print_args(struct print_arg *args)
>  	case PRINT_BSTRING:
>  		printf("__get_str(%s)", args->string.string);
>  		break;
> +	case PRINT_BITMASK:
> +		printf("__get_bitmask(%s)", args->bitmask.bitmask);
> +		break;
>  	case PRINT_TYPE:
>  		printf("(%s)", args->typecast.type);
>  		print_args(args->typecast.item);
> diff --git a/event-parse.h b/event-parse.h
> index 108a4ef..babd4a4 100644
> --- a/event-parse.h
> +++ b/event-parse.h
> @@ -211,6 +211,11 @@ struct print_arg_string {
>  	int			offset;
>  };
>  
> +struct print_arg_bitmask {
> +	char			*bitmask;
> +	int			offset;
> +};
> +
>  struct print_arg_field {
>  	char			*name;
>  	struct format_field	*field;
> @@ -277,6 +282,7 @@ enum print_arg_type {
>  	PRINT_DYNAMIC_ARRAY,
>  	PRINT_OP,
>  	PRINT_FUNC,
> +	PRINT_BITMASK,
>  };
>  
>  struct print_arg {
> @@ -291,6 +297,7 @@ struct print_arg {
>  		struct print_arg_hex		hex;
>  		struct print_arg_func		func;
>  		struct print_arg_string		string;
> +		struct print_arg_bitmask	bitmask;
>  		struct print_arg_op		op;
>  		struct print_arg_dynarray	dynarray;
>  	};


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH v3] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-14 19:42                 ` Javi Merino
@ 2014-05-14 20:00                   ` Steven Rostedt
  2014-05-15 11:34                   ` Steven Rostedt
  1 sibling, 0 replies; 48+ messages in thread
From: Steven Rostedt @ 2014-05-14 20:00 UTC (permalink / raw)
  To: Javi Merino
  Cc: LKML, Mathieu Desnoyers, Andrew Morton, Ingo Molnar,
	Namhyung Kim, Jiri Olsa

On Wed, 14 May 2014 20:42:41 +0100
Javi Merino <javi.merino@arm.com> wrote:

> On Wed, May 14, 2014 at 07:25:00PM +0100, Steven Rostedt wrote:
> > Updated version.
> > 
> > You can test it with the attached trace-cmd patch. I tested it against
> > big and little endian machines (each reading their own and the other
> > machine's trace.dat file). I did not test against 32bit, but it should
> > still work. I will test it before I commit it to traceevent library.
> > 
> > Can you review this and try it out?
> 
> I've tested it in a 32-bit system.  It generates the data and it's
> able to read its own trace.dat fine.  However, If I read the
> trace.dat on a 64-bit machine I get a random character at the
> end of the mask:
> 

Can you send me your trace.dat file privately. Also, please compress it
with either gzip, bzip or xz.

Thanks!

-- Steve


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH v3] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-14 19:42                 ` Javi Merino
  2014-05-14 20:00                   ` Steven Rostedt
@ 2014-05-15 11:34                   ` Steven Rostedt
  2014-05-15 11:36                     ` Steven Rostedt
  1 sibling, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2014-05-15 11:34 UTC (permalink / raw)
  To: Javi Merino
  Cc: LKML, Mathieu Desnoyers, Andrew Morton, Ingo Molnar,
	Namhyung Kim, Jiri Olsa

[-- Attachment #1: Type: text/plain, Size: 688 bytes --]

On Wed, 14 May 2014 20:42:41 +0100
Javi Merino <javi.merino@arm.com> wrote:


> It looks like an off-by-one error.  The system that generated the
> trace.dat is running a kernel configured with NR_CPUS=5, can that be
> the culprit?

You were correct! Actually, I had two off by one errors :-/

One was with calculating the number of added commas

-	str_size += nr_bits / 32;
+	str_size += (nr_bits - 1) / 32

No comma if the we have exactly 32 bits (only one if we have 64).

The other was with the start:

-	for (i = str_size - 3; i >= 0; i -= 2) {
+	for (i = str_size - 2; i >= 0; i -= 2) {

I included the '\0' when I didn't have to :-p

Should work now. Thanks for testing.

-- Steve

[-- Attachment #2: print-cpumask.patch --]
[-- Type: text/x-patch, Size: 5325 bytes --]

---
 event-parse.c |  113 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 event-parse.h |    7 +++
 2 files changed, 120 insertions(+)

Index: trace-cmd.git/event-parse.c
===================================================================
--- trace-cmd.git.orig/event-parse.c	2014-05-15 07:29:18.552527943 -0400
+++ trace-cmd.git/event-parse.c	2014-05-15 07:29:54.483122795 -0400
@@ -753,6 +753,9 @@
 	case PRINT_BSTRING:
 		free(arg->string.string);
 		break;
+	case PRINT_BITMASK:
+		free(arg->bitmask.bitmask);
+		break;
 	case PRINT_DYNAMIC_ARRAY:
 		free(arg->dynarray.index);
 		break;
@@ -2256,6 +2259,7 @@
 	case PRINT_FIELD ... PRINT_SYMBOL:
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 	default:
 		do_warning("invalid eval type %d", arg->type);
 		ret = 0;
@@ -2284,6 +2288,7 @@
 	case PRINT_FIELD ... PRINT_SYMBOL:
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 	default:
 		do_warning("invalid eval type %d", arg->type);
 		break;
@@ -2670,6 +2675,35 @@
 	return EVENT_ERROR;
 }
 
+static enum event_type
+process_bitmask(struct event_format *event __maybe_unused, struct print_arg *arg,
+	    char **tok)
+{
+	enum event_type type;
+	char *token;
+
+	if (read_expect_type(EVENT_ITEM, &token) < 0)
+		goto out_free;
+
+	arg->type = PRINT_BITMASK;
+	arg->bitmask.bitmask = token;
+	arg->bitmask.offset = -1;
+
+	if (read_expected(EVENT_DELIM, ")") < 0)
+		goto out_err;
+
+	type = read_token(&token);
+	*tok = token;
+
+	return type;
+
+ out_free:
+	free_token(token);
+ out_err:
+	*tok = NULL;
+	return EVENT_ERROR;
+}
+
 static struct pevent_function_handler *
 find_func_handler(struct pevent *pevent, char *func_name)
 {
@@ -2781,6 +2815,10 @@
 		free_token(token);
 		return process_str(event, arg, tok);
 	}
+	if (strcmp(token, "__get_bitmask") == 0) {
+		free_token(token);
+		return process_bitmask(event, arg, tok);
+	}
 	if (strcmp(token, "__get_dynamic_array") == 0) {
 		free_token(token);
 		return process_dynamic_array(event, arg, tok);
@@ -3307,6 +3345,7 @@
 		return eval_type(val, arg, 0);
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 		return 0;
 	case PRINT_FUNC: {
 		struct trace_seq s;
@@ -3538,6 +3577,60 @@
 		trace_seq_printf(s, format, str);
 }
 
+static void print_bitmask_to_seq(struct pevent *pevent,
+				 struct trace_seq *s, const char *format,
+				 int len_arg, const void *data, int size)
+{
+	int nr_bits = size * 8;
+	int str_size = (nr_bits + 3) / 4;
+	int len = 0;
+	char buf[3];
+	char *str;
+	int index;
+	int i;
+
+	/*
+	 * The kernel likes to put in commas every 32 bits, we
+	 * can do the same.
+	 */
+	str_size += (nr_bits - 1) / 32;
+
+	str = malloc(str_size + 1);
+	if (!str) {
+		do_warning("%s: not enough memory!", __func__);
+		return;
+	}
+	str[str_size] = 0;
+
+	/* Start out with -2 for the two chars per byte */
+	for (i = str_size - 2; i >= 0; i -= 2) {
+		/*
+		 * data points to a bit mask of size bytes.
+		 * In the kernel, this is an array of long words, thus
+		 * endianess is very important.
+		 */
+		if (pevent->file_bigendian)
+			index = size - (len + 1);
+		else
+			index = len;
+
+		snprintf(buf, 3, "%02x", *((unsigned char *)data + index));
+		memcpy(str + i, buf, 2);
+		len++;
+		if (!(len & 3) && i > 0) {
+			i--;
+			str[i] = ',';
+		}
+	}
+
+	if (len_arg >= 0)
+		trace_seq_printf(s, format, len_arg, str);
+	else
+		trace_seq_printf(s, format, str);
+
+	free(str);
+}
+
 static void print_str_arg(struct trace_seq *s, void *data, int size,
 			  struct event_format *event, const char *format,
 			  int len_arg, struct print_arg *arg)
@@ -3672,6 +3765,23 @@
 	case PRINT_BSTRING:
 		print_str_to_seq(s, format, len_arg, arg->string.string);
 		break;
+	case PRINT_BITMASK: {
+		int bitmask_offset;
+		int bitmask_size;
+
+		if (arg->bitmask.offset == -1) {
+			struct format_field *f;
+
+			f = pevent_find_any_field(event, arg->bitmask.bitmask);
+			arg->bitmask.offset = f->offset;
+		}
+		bitmask_offset = data2host4(pevent, data + arg->string.offset);
+		bitmask_size = bitmask_offset >> 16;
+		bitmask_offset &= 0xffff;
+		print_bitmask_to_seq(pevent, s, format, len_arg,
+				     data + bitmask_offset, bitmask_size);
+		break;
+	}
 	case PRINT_OP:
 		/*
 		 * The only op for string should be ? :
@@ -4889,6 +4999,9 @@
 	case PRINT_BSTRING:
 		printf("__get_str(%s)", args->string.string);
 		break;
+	case PRINT_BITMASK:
+		printf("__get_bitmask(%s)", args->bitmask.bitmask);
+		break;
 	case PRINT_TYPE:
 		printf("(%s)", args->typecast.type);
 		print_args(args->typecast.item);
Index: trace-cmd.git/event-parse.h
===================================================================
--- trace-cmd.git.orig/event-parse.h	2014-05-15 07:29:18.552527943 -0400
+++ trace-cmd.git/event-parse.h	2014-05-15 07:29:19.772514187 -0400
@@ -211,6 +211,11 @@
 	int			offset;
 };
 
+struct print_arg_bitmask {
+	char			*bitmask;
+	int			offset;
+};
+
 struct print_arg_field {
 	char			*name;
 	struct format_field	*field;
@@ -277,6 +282,7 @@
 	PRINT_DYNAMIC_ARRAY,
 	PRINT_OP,
 	PRINT_FUNC,
+	PRINT_BITMASK,
 };
 
 struct print_arg {
@@ -291,6 +297,7 @@
 		struct print_arg_hex		hex;
 		struct print_arg_func		func;
 		struct print_arg_string		string;
+		struct print_arg_bitmask	bitmask;
 		struct print_arg_op		op;
 		struct print_arg_dynarray	dynarray;
 	};

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH v3] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-15 11:34                   ` Steven Rostedt
@ 2014-05-15 11:36                     ` Steven Rostedt
  2014-05-15 15:20                       ` Javi Merino
  0 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2014-05-15 11:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Javi Merino, LKML, Mathieu Desnoyers, Andrew Morton, Ingo Molnar,
	Namhyung Kim, Jiri Olsa

On Thu, 15 May 2014 07:34:05 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:


> Should work now. Thanks for testing.

If all goes well, can you please give me your "Tested-by" for both
patches. I've already ran my updated bitmask kernel patch through my
tests (although, there's still no event that uses it).

I'll push it to my for-next repo. I'll also work on updating
libtraceevent with the trace-cmd patch such that perf gets the benefit
of this new macro as well.

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone
  2014-05-13 14:31   ` edubezval
@ 2014-05-15 14:30     ` Javi Merino
  2014-05-15 15:10       ` Eduardo Valentin
  0 siblings, 1 reply; 48+ messages in thread
From: Javi Merino @ 2014-05-15 14:30 UTC (permalink / raw)
  To: edubezval; +Cc: linux-pm, Punit Agrawal, Zhang Rui

On Tue, May 13, 2014 at 03:31:08PM +0100, edubezval@gmail.com wrote:
> Hello Javi,
> 
> Thanks for your patch. Find some comment inline.
> 
> On Tue, May 06, 2014 at 01:06:34PM +0100, Javi Merino wrote:
> > A governor may need to store its current state between calls to
> > throttle().  That state depends on the thermal zone, so store it as
> > private data in struct thermal_zone_device.
> >
> > The governors may have two new ops: bind_to_tz() and unbind_from_tz().
> > When provided, these functions let governors do some initialization
> > and teardown when they are bound/unbound to a tz and possibly store that
> > information in the governor_data field of the struct
> > thermal_zone_device.
> 
> 
> Overall the idea is fine to me, although so far we have no case of such
> need in the current governors. I will have a look on your proposed
> governor to check futher on this API.
> 
> >
> > Cc: Zhang Rui <rui.zhang@intel.com>
> > Cc: Eduardo Valentin <edubezval@gmail.com>
> > Signed-off-by: Javi Merino <javi.merino@arm.com>
> > ---
> >  drivers/thermal/thermal_core.c |   67 +++++++++++++++++++++++++++++++++++-----
> >  include/linux/thermal.h        |    3 ++
> >  2 files changed, 62 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> > index 71b0ec0c370d..ac8e5990a63e 100644
> > --- a/drivers/thermal/thermal_core.c
> > +++ b/drivers/thermal/thermal_core.c
> > @@ -72,6 +72,45 @@ static struct thermal_governor *__find_governor(const char *name)
> >   return NULL;
> >  }
> >
> > +/**
> > + * thermal_set_governor() - Switch to another governor
> > + * @tz: a valid pointer to a struct thermal_zone_device
> > + * @new_gov: pointer to the new governor
> > + *
> > + * Change the governor of thermal zone @tz.
> > + *
> > + * Returns 0 on success, an error if the new governor's bind_to_tz() failed.
> > + */
> > +static int thermal_set_governor(struct thermal_zone_device *tz,
> > + struct thermal_governor *new_gov)
> > +{
> > + int ret = 0;
> > +
> > + if (tz->governor && tz->governor->unbind_from_tz)
> > + tz->governor->unbind_from_tz(tz);
> > +
> > + if (new_gov && new_gov->bind_to_tz) {
> > + ret = new_gov->bind_to_tz(tz);
> > + if (ret) {
> > + /* Try to register the old governor again */
> 
> I suggest to state in the Documentation that a governor may be called
> to bind and unbind to same thermal zone several times, as per above code.

Will do, but I find that the other governor op (throttle) is not
documented.  Where do you suggest we should put it?

> > + if (tz->governor && tz->governor->bind_to_tz) {
> > + if (tz->governor->bind_to_tz(tz))
> > + /*
> > + * The new governor failed to register
> > + * and the previous one failed as well
> > + */
> 
> I suggest having a dev_warn here to notify that a zone is now orphan due to
> switching from one governor to another.

Done

> > + tz->governor = NULL;
> > + }
> > +
> > + return ret;
> > + }
> > + }
> > +
> > + tz->governor = new_gov;
> > +
> > + return ret;
> > +}
> > +
> >  int thermal_register_governor(struct thermal_governor *governor)
> >  {
> >   int err;
> > @@ -104,8 +143,12 @@ int thermal_register_governor(struct thermal_governor *governor)
> >
> >   name = pos->tzp->governor_name;
> >
> > - if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
> > - pos->governor = governor;
> > + if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
> > + int ret = thermal_set_governor(pos, governor);
> 
> nip: add a blank line between declarations and first statement.

Ok

> > + if (ret)
> > + pr_warn("Failed to set governor %s for zone %d: %d\n",
> > + governor->name, pos->id, ret);
> 
> dev_* family of logging functions is preferred in this file.

Done

> > + }
> >   }
> >
> >   mutex_unlock(&thermal_list_lock);
> > @@ -131,7 +174,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
> >   list_for_each_entry(pos, &thermal_tz_list, node) {
> >   if (!strnicmp(pos->governor->name, governor->name,
> >   THERMAL_NAME_LENGTH))
> > - pos->governor = NULL;
> > + thermal_set_governor(pos, NULL);
> >   }
> >
> >   mutex_unlock(&thermal_list_lock);
> > @@ -756,8 +799,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
> >   if (!gov)
> >   goto exit;
> >
> > - tz->governor = gov;
> > - ret = count;
> > + ret = thermal_set_governor(tz, gov);
> > + if (!ret)
> > + ret = count;
> >
> >  exit:
> >   mutex_unlock(&thermal_governor_lock);
> > @@ -1452,6 +1496,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> >   int result;
> >   int count;
> >   int passive = 0;
> > + struct thermal_governor *governor;
> >
> >   if (type && strlen(type) >= THERMAL_NAME_LENGTH)
> >   return ERR_PTR(-EINVAL);
> > @@ -1542,9 +1587,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> >   mutex_lock(&thermal_governor_lock);
> >
> >   if (tz->tzp)
> > - tz->governor = __find_governor(tz->tzp->governor_name);
> > + governor = __find_governor(tz->tzp->governor_name);
> >   else
> > - tz->governor = def_governor;
> > + governor = def_governor;
> > +
> > + result = thermal_set_governor(tz, governor);
> > + if (result) {
> > + mutex_unlock(&thermal_governor_lock);
> > + goto unregister;
> > + }
> >
> >   mutex_unlock(&thermal_governor_lock);
> >
> > @@ -1634,7 +1685,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
> >   device_remove_file(&tz->device, &dev_attr_mode);
> >   device_remove_file(&tz->device, &dev_attr_policy);
> >   remove_trip_attrs(tz);
> > - tz->governor = NULL;
> > + thermal_set_governor(tz, NULL);
> >
> >   thermal_remove_hwmon_sysfs(tz);
> >   release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
> > diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> > index f7e11c7ea7d9..baac212f815e 100644
> > --- a/include/linux/thermal.h
> > +++ b/include/linux/thermal.h
> > @@ -177,6 +177,7 @@ struct thermal_zone_device {
> >   struct thermal_zone_device_ops *ops;
> >   const struct thermal_zone_params *tzp;
> >   struct thermal_governor *governor;
> > + void *governor_data;
> 
> Please document this entry.

Where?  None of these entries are documented.  I guess I'll have to
add a patch that describes each one of them and then add ours.

> >   struct list_head thermal_instances;
> >   struct idr idr;
> >   struct mutex lock; /* protect thermal_instances list */
> > @@ -187,6 +188,8 @@ struct thermal_zone_device {
> >  /* Structure that holds thermal governor information */
> >  struct thermal_governor {
> >   char name[THERMAL_NAME_LENGTH];
> > + int (*bind_to_tz)(struct thermal_zone_device *tz);
> > + void (*unbind_from_tz)(struct thermal_zone_device *tz);
> 
> a governor cannot prevent unbinding?

To me it's like open()/close() or malloc()/free().  You generally can
fail to bind, but not fail to unbind.  I can change it to an int if
you think it is something that we may need.

> Please, document the proposed operations.

Same as before, none of the entries in this struct are documented so
I'll submit a patch that describes the current ones and then add our
new ones in our patch.

> >   int (*throttle)(struct thermal_zone_device *tz, int trip);
> >   struct list_head governor_list;
> >  };
> > --
> > 1.7.9.5
> 
> I request that you update Documentation/thermal/sysfs-api.txt too.

Will do.

Cheers,
Javi


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone
  2014-05-15 14:30     ` Javi Merino
@ 2014-05-15 15:10       ` Eduardo Valentin
  0 siblings, 0 replies; 48+ messages in thread
From: Eduardo Valentin @ 2014-05-15 15:10 UTC (permalink / raw)
  To: Javi Merino; +Cc: linux-pm, Punit Agrawal, Zhang Rui

Javi,

On Thu, May 15, 2014 at 03:30:53PM +0100, Javi Merino wrote:
> On Tue, May 13, 2014 at 03:31:08PM +0100, edubezval@gmail.com wrote:
> > Hello Javi,
> > 
> > Thanks for your patch. Find some comment inline.
> > 

[cut]

> 
> Will do, but I find that the other governor op (throttle) is not
> documented.  Where do you suggest we should put it?
> 

The main entry for thermal Documentation is Documentation/thermal/. The
sysfs-api.txt file has major info about the thermal framework. But for
this case, I agree with you, we deserve a better explanation. I think
the fact that the governor API is becoming more interesting is a hint
that we need better documentation. I suggest having another file
describing the governor API and how it deals with other entities such
as sensor inputs and cooling actions.

Rui, any suggestion here?


> > > + if (tz->governor && tz->governor->bind_to_tz) {
> > > + if (tz->governor->bind_to_tz(tz))
> > > + /*
> > > + * The new governor failed to register
> > > + * and the previous one failed as well
> > > + */
> > 
> > I suggest having a dev_warn here to notify that a zone is now orphan due to
> > switching from one governor to another.
> 
> Done
> 
> > > + tz->governor = NULL;
> > > + }
> > > +
> > > + return ret;
> > > + }
> > > + }
> > > +
> > > + tz->governor = new_gov;
> > > +
> > > + return ret;
> > > +}
> > > +
> > >  int thermal_register_governor(struct thermal_governor *governor)
> > >  {
> > >   int err;
> > > @@ -104,8 +143,12 @@ int thermal_register_governor(struct thermal_governor *governor)
> > >
> > >   name = pos->tzp->governor_name;
> > >
> > > - if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
> > > - pos->governor = governor;
> > > + if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
> > > + int ret = thermal_set_governor(pos, governor);
> > 
> > nip: add a blank line between declarations and first statement.
> 
> Ok
> 
> > > + if (ret)
> > > + pr_warn("Failed to set governor %s for zone %d: %d\n",
> > > + governor->name, pos->id, ret);
> > 
> > dev_* family of logging functions is preferred in this file.
> 
> Done
> 
> > > + }
> > >   }
> > >
> > >   mutex_unlock(&thermal_list_lock);
> > > @@ -131,7 +174,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
> > >   list_for_each_entry(pos, &thermal_tz_list, node) {
> > >   if (!strnicmp(pos->governor->name, governor->name,
> > >   THERMAL_NAME_LENGTH))
> > > - pos->governor = NULL;
> > > + thermal_set_governor(pos, NULL);
> > >   }
> > >
> > >   mutex_unlock(&thermal_list_lock);
> > > @@ -756,8 +799,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
> > >   if (!gov)
> > >   goto exit;
> > >
> > > - tz->governor = gov;
> > > - ret = count;
> > > + ret = thermal_set_governor(tz, gov);
> > > + if (!ret)
> > > + ret = count;
> > >
> > >  exit:
> > >   mutex_unlock(&thermal_governor_lock);
> > > @@ -1452,6 +1496,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> > >   int result;
> > >   int count;
> > >   int passive = 0;
> > > + struct thermal_governor *governor;
> > >
> > >   if (type && strlen(type) >= THERMAL_NAME_LENGTH)
> > >   return ERR_PTR(-EINVAL);
> > > @@ -1542,9 +1587,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> > >   mutex_lock(&thermal_governor_lock);
> > >
> > >   if (tz->tzp)
> > > - tz->governor = __find_governor(tz->tzp->governor_name);
> > > + governor = __find_governor(tz->tzp->governor_name);
> > >   else
> > > - tz->governor = def_governor;
> > > + governor = def_governor;
> > > +
> > > + result = thermal_set_governor(tz, governor);
> > > + if (result) {
> > > + mutex_unlock(&thermal_governor_lock);
> > > + goto unregister;
> > > + }
> > >
> > >   mutex_unlock(&thermal_governor_lock);
> > >
> > > @@ -1634,7 +1685,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
> > >   device_remove_file(&tz->device, &dev_attr_mode);
> > >   device_remove_file(&tz->device, &dev_attr_policy);
> > >   remove_trip_attrs(tz);
> > > - tz->governor = NULL;
> > > + thermal_set_governor(tz, NULL);
> > >
> > >   thermal_remove_hwmon_sysfs(tz);
> > >   release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
> > > diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> > > index f7e11c7ea7d9..baac212f815e 100644
> > > --- a/include/linux/thermal.h
> > > +++ b/include/linux/thermal.h
> > > @@ -177,6 +177,7 @@ struct thermal_zone_device {
> > >   struct thermal_zone_device_ops *ops;
> > >   const struct thermal_zone_params *tzp;
> > >   struct thermal_governor *governor;
> > > + void *governor_data;
> > 
> > Please document this entry.
> 
> Where?  None of these entries are documented.  I guess I'll have to
> add a patch that describes each one of them and then add ours.
> 
> > >   struct list_head thermal_instances;
> > >   struct idr idr;
> > >   struct mutex lock; /* protect thermal_instances list */
> > > @@ -187,6 +188,8 @@ struct thermal_zone_device {
> > >  /* Structure that holds thermal governor information */
> > >  struct thermal_governor {
> > >   char name[THERMAL_NAME_LENGTH];
> > > + int (*bind_to_tz)(struct thermal_zone_device *tz);
> > > + void (*unbind_from_tz)(struct thermal_zone_device *tz);
> > 
> > a governor cannot prevent unbinding?
> 
> To me it's like open()/close() or malloc()/free().  You generally can
> fail to bind, but not fail to unbind.  I can change it to an int if
> you think it is something that we may need.
> 
> > Please, document the proposed operations.
> 
> Same as before, none of the entries in this struct are documented so
> I'll submit a patch that describes the current ones and then add our
> new ones in our patch.
> 
> > >   int (*throttle)(struct thermal_zone_device *tz, int trip);
> > >   struct list_head governor_list;
> > >  };
> > > --
> > > 1.7.9.5
> > 
> > I request that you update Documentation/thermal/sysfs-api.txt too.
> 
> Will do.
> 
> Cheers,
> Javi
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH v3] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
  2014-05-15 11:36                     ` Steven Rostedt
@ 2014-05-15 15:20                       ` Javi Merino
  0 siblings, 0 replies; 48+ messages in thread
From: Javi Merino @ 2014-05-15 15:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Mathieu Desnoyers, Andrew Morton, Ingo Molnar,
	Namhyung Kim, Jiri Olsa

On Thu, May 15, 2014 at 12:36:09PM +0100, Steven Rostedt wrote:
> On Thu, 15 May 2014 07:34:05 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> 
> > Should work now. Thanks for testing.
> 
> If all goes well, can you please give me your "Tested-by" for both
> patches. I've already ran my updated bitmask kernel patch through my
> tests (although, there's still no event that uses it).

I've tested it on my setup and it works beautifully now.  You can add
my Tested-by to both the kernel and the trace-cmd patches.  I'll
include the kernel one together with a patch that uses it as part of
our next series, thanks!

> I'll push it to my for-next repo. I'll also work on updating
> libtraceevent with the trace-cmd patch such that perf gets the benefit
> of this new macro as well.

Yup, I guess we'll need a new version of trace-cmd to support this
when it gets into the kernel.

Thanks for the prompt response,
Javi


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/5] thermal: add a basic cpu power actor
  2014-05-13 22:57   ` Eduardo Valentin
@ 2014-05-15 17:02     ` Javi Merino
  0 siblings, 0 replies; 48+ messages in thread
From: Javi Merino @ 2014-05-15 17:02 UTC (permalink / raw)
  To: Eduardo Valentin; +Cc: linux-pm, Punit Agrawal, Zhang Rui

Hi Eduardo,

On Tue, May 13, 2014 at 11:57:52PM +0100, Eduardo Valentin wrote:
> On Tue, May 06, 2014 at 01:06:36PM +0100, Javi Merino wrote:
> > Introduce a new thermal_cooling_unit: THERMAL_UNIT_POWER and let
> > cpu_cooling device operate using it as well as "states".  This allows
> > governors that call cpufreq_get_max/cpufreq_get_cur/cpufreq_set_cur()
> > with THERMAL_UNIT_POWER to use cpufreq cooling devices.
> >
> > If the cpu cooling device is registered using
> > {of_,}cpufreq_power_actor_register() instead of
> > {of_,}cpufreq_cooling_register() it uses the current frequency (as
> > reported by cpufreq) as well as load and OPPs for the power
> > calculations.  The cpus must have registered their OPPs in the OPP
> > library.
> >
> > Cc: Zhang Rui <rui.zhang@intel.com>
> > Cc: Eduardo Valentin <edubezval@gmail.com>
> > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> > Signed-off-by: Javi Merino <javi.merino@arm.com>
> > ---
> >  drivers/thermal/cpu_cooling.c  |  430 ++++++++++++++++++++++++++++++++++++----
> >  drivers/thermal/thermal_core.c |    4 +-
> >  include/linux/cpu_cooling.h    |   28 +++
> >  include/linux/thermal.h        |    4 +
> >  4 files changed, 422 insertions(+), 44 deletions(-)
> 
> Could you please also document under Documentation/thermal/sysfs-api.txt the exposed APIs ?
> 
> 
> >
> > diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
> > index f3f4e6b5798e..8052cd1f4a39 100644
> > --- a/drivers/thermal/cpu_cooling.c
> > +++ b/drivers/thermal/cpu_cooling.c
> > @@ -24,10 +24,27 @@
> >  #include <linux/thermal.h>
> >  #include <linux/cpufreq.h>
> >  #include <linux/err.h>
> > +#include <linux/pm_opp.h>
> >  #include <linux/slab.h>
> >  #include <linux/cpu.h>
> >  #include <linux/cpu_cooling.h>
> >
> > +/* Maximum number of opps we can track */
> > +#define MAX_NUM_OPPS 32
> > +
> > +/**
> > + * struct power_table - frequency to power conversion
> > + * @frequency:       frequency in KHz
> > + * @power:   power in mW
> > + *
> > + * This structure is built when the cooling device registers and helps
> > + * in translating frequency to power and viceversa.
> > + */
> > +struct power_table {
> > +     u32 frequency;
> > +     u32 power;
> > +};
> > +
> >  /**
> >   * struct cpufreq_cooling_device - data for cooling device with cpufreq
> >   * @id: unique integer value corresponding to each cpufreq_cooling_device
> > @@ -39,6 +56,14 @@
> >   * @cpufreq_val: integer value representing the absolute value of the clipped
> >   *   frequency.
> >   * @allowed_cpus: all the cpus involved for this cpufreq_cooling_device.
> > + * @power_table: array of struct power_table for frequency to power conversion
> > + * @power_table_entries: number of entries in the @power_table array
> > + * @time_in_idle: previous reading of the absolute time that this cpu was idle
> > + * @time_in_idle_timestamp: wall time of the last invocation of
> > + *   get_cpu_idle_time_us()
> > + * @last_load: load measured by the latest call to cpufreq_get_cur()
> > + * @freq: frequency in KHz of the cpus represented by the cooling device
> > + * @capacitance: the dynamic power coefficient of these cpus
> >   *
> >   * This structure is required for keeping information of each
> >   * cpufreq_cooling_device registered. In order to prevent corruption of this a
> > @@ -50,11 +75,19 @@ struct cpufreq_cooling_device {
> >       unsigned int cpufreq_state;
> >       unsigned int cpufreq_val;
> >       struct cpumask allowed_cpus;
> > +     struct power_table power_table[MAX_NUM_OPPS];
> 
> Can we build this table dynamically?
> 
> I believe we should be able to fetch this table from opp layer. Besides,
> should not be trouble of this code to estimate the max number of opps.

Ok, I'll change that as you suggested below and remove MAX_NUM_OPPS.

> > +     int power_table_entries;
> > +     u64 time_in_idle[NR_CPUS];
> > +     u64 time_in_idle_timestamp[NR_CPUS];
> > +     u32 last_load;
> > +     u32 freq;
> > +     u32 capacitance;
> >  };
> >  static DEFINE_IDR(cpufreq_idr);
> >  static DEFINE_MUTEX(cooling_cpufreq_lock);
> >
> >  static unsigned int cpufreq_dev_count;
> > +static bool power_actor_notifier_registered;
> 
> Are you sure you need this bool? Shouldn' t you be using a counter?

Yes, that should be a counter, like cpufreq_dev_count.  It's used
exactly like cpufreq_dev_count (to prevent registering the same
notifier_block more than once) so it should have the same semantics.

> You probably need locking on this.

True, added.

> >  /* notify_table passes value to the CPUFREQ_ADJUST callback function. */
> >  #define NOTIFY_INVALID NULL
> > @@ -115,6 +148,59 @@ static int is_cpufreq_valid(int cpu)
> >       return !cpufreq_get_policy(&policy, cpu);
> >  }
> >
> > +static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_device,
> > +                     u32 freq)
> > +{
> > +     int i;
> > +     struct power_table *pt = cpufreq_device->power_table;
> > +
> > +     for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
> > +             if (freq <= pt[i].frequency)
> > +                     break;
> > +
> > +     return pt[i].power;
> > +}
> > +
> > +static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_device,
> > +                     u32 power)
> > +{
> > +     int i;
> > +     struct power_table *pt = cpufreq_device->power_table;
> > +
> > +     for (i = 0; i < cpufreq_device->power_table_entries - 1; i++)
> > +             if (power <= pt[i].power)
> > +                     break;
> > +
> > +     return pt[i].frequency;
> > +}
> > +
> > +/**
> > + * get_load - get load for a cpu since last updated
> > + * @cpufreq_device: struct cpufreq_cooling_device for this cooling device
> > + * @cpu: cpu number
> > + *
> > + * Return the average load of cpu @cpu in percentage since this
> > + * function was last called.
> > + */
> > +static u32 get_load(struct cpufreq_cooling_device *cpufreq_device, int cpu)
> > +{
> > +     u32 load;
> > +     u64 now, now_idle, delta_time, delta_idle;
> > +
> > +     now_idle = get_cpu_idle_time(cpu, &now, 0);
> > +     delta_idle = now_idle - cpufreq_device->time_in_idle[cpu];
> > +     delta_time = now - cpufreq_device->time_in_idle_timestamp[cpu];
> > +
> > +     if (delta_time <= delta_idle)
> > +             load = 0;
> > +     else
> > +             load = div64_u64(100 * (delta_time - delta_idle), delta_time);
> > +
> > +     cpufreq_device->time_in_idle[cpu] = now_idle;
> > +     cpufreq_device->time_in_idle_timestamp[cpu] = now;
> 
> nip: blank line here.

Ok

> > +     return load;
> > +}
> > +
> >  enum cpufreq_cooling_property {
> >       GET_LEVEL,
> >       GET_FREQ,
> > @@ -342,98 +428,193 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb,
> >       return 0;
> >  }
> >
> > +/**
> > + * cpufreq_frequency_change - notifier callback for cpufreq frequency changes
> > + * @nb:              struct notifier_block * with callback info
> > + * @event:   value showing cpufreq event for which this function invoked
> > + * @data:    callback-specific data
> > + *
> > + * Callback to get notifications of frequency changes.  In the
> > + * CPUFREQ_POSTCHANGE @event we store the new frequency so that
> > + * cpufreq_get_cur() knows the current frequency and can convert it
> > + * into power.
> > + */
> > +static int cpufreq_frequency_change(struct notifier_block *nb,
> > +                             unsigned long event, void *data)
> > +{
> > +     struct thermal_cooling_device *cdev;
> > +     struct cpufreq_freqs *freqs = data;
> > +
> > +     /* Only update frequency on postchange */
> > +     if (event != CPUFREQ_POSTCHANGE)
> > +             return NOTIFY_DONE;
> > +
> > +     mutex_lock(&thermal_list_lock);
> > +
> > +     list_for_each_entry(cdev, &thermal_cdev_list, node) {
> > +             struct cpufreq_cooling_device *cpufreq_device;
> > +
> > +             if (strncmp(cdev->type, "thermal-cpufreq", THERMAL_NAME_LENGTH))
> > +                     continue;
> > +
> 
> I am not sure I follow why you need to go through the list of cooling devices
> here. Can you please elaborate a bit more?

Because of the code below, you need to find the cpufreq_device for
this cpu to update its frequency.  I've reworked most of this patch
and I think this loop is now clearer (and simpler) in the new version.

> > +             cpufreq_device = cdev->devdata;
> > +
> > +             if (cpumask_test_cpu(freqs->cpu, &cpufreq_device->allowed_cpus))
> > +                     cpufreq_device->freq = freqs->new;
> > +     }
> > +
> > +     mutex_unlock(&thermal_list_lock);
> > +
> > +     return NOTIFY_OK;
> > +}
> > +
> >  /* cpufreq cooling device callback functions are defined below */
> >
> >  /**
> > - * cpufreq_get_max_state - callback function to get the max cooling state.
> > + * cpufreq_get_max - callback function to get the max cooling state/power.
> >   * @cdev: thermal cooling device pointer.
> > - * @state: fill this variable with the max cooling state.
> > - * @unit: the units in which @state should be in.
> > + * @val: fill this variable with the max cooling state/power.
> > + * @unit: whether to put state or power in @val.
> >   *
> >   * Callback for the thermal cooling device to return the cpufreq max
> > - * cooling state.  Currently only THERMAL_UNIT_STATE is valid for
> > - * @unit.
> > + * cooling state/power.  If @unit is THERMAL_UNIT_STATE, @val contains
> > + * the maximum cooling state; if @unit is THERMAL_UNIT_POWER, @val
> > + * maximum is power in milliwatts.
> >   *
> >   * Return: 0 on success, an error code otherwise.
> >   */
> > -static int cpufreq_get_max_state(struct thermal_cooling_device *cdev,
> > -                             unsigned long *state,
> > -                             enum thermal_cooling_unit unit)
> > +static int cpufreq_get_max(struct thermal_cooling_device *cdev,
> > +                        unsigned long *val, enum thermal_cooling_unit unit)
> 
> Shouldn't this change be part of previous patch?

I've nuked most of this and moved it to a specific file.  Bottom line
is, extending the cooling device interface is a bit of a stretch and
doesn't work as well as I thought.  Take for example this function:
cpufreq_get_max(cdev, &val, THERMAL_UNIT_STATE) gives you the maximum
cooling state, i.e. minimum frequency.  OTOH,
cpufreq_get_max(cdev, &val, THERMAL_UNIT_POWER) gives you the maximum
power, i.e. the power consumption at maximum frequency.

Not only that, the interface is to rigid to allow for
platform-specific power models.  More on that below, as you've pointed
out some of the issues.

> >  {
> >       struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
> >       struct cpumask *mask = &cpufreq_device->allowed_cpus;
> >       unsigned int cpu;
> >       unsigned int count = 0;
> > -     int ret;
> > -
> > -     if (unit != THERMAL_UNIT_STATE)
> > -             return -EINVAL;
> 
> What happens to this code if tomorrow we add yet another unit?
> 
> 
> > +     int ret = 0;
> >
> >       cpu = cpumask_any(mask);
> >
> > -     ret = get_property(cpu, 0, &count, GET_MAXL);
> > +     if ((unit == THERMAL_UNIT_POWER) &&
> > +             (cpufreq_device->power_table_entries)) {
> 
> I think a cleaner code would use a switch case statement. You should
> deal with the power_table_entries restriction separatedly.
> 
> > +             unsigned int num_cpus;
> > +             unsigned long max_freq;
> > +
> > +             max_freq = get_cpu_frequency(cpu, 0);
> > +             if (!max_freq)
> > +                     return -EINVAL;
> >
> > -     if (count > 0)
> > -             *state = count;
> > +             num_cpus = cpumask_weight(mask);
> > +             *val = cpu_freq_to_power(cdev->devdata, max_freq) * num_cpus;
> > +     } else if (unit == THERMAL_UNIT_STATE) {
> > +             ret = get_property(cpu, 0, &count, GET_MAXL);
> > +
> > +             if (count > 0)
> > +                     *val = count;
> > +     } else {
> > +             ret = -EINVAL;
> > +     }
> >
> >       return ret;
> >  }
> >
> >  /**
> > - * cpufreq_get_cur_state - callback function to get the current cooling state.
> > + * cpufreq_get_cur - callback function to get the current cooling state/power.
> >   * @cdev: thermal cooling device pointer.
> > - * @state: fill this variable with the current cooling state.
> > - * @unit: the units in which state should be in
> > + * @val: fill this variable with the current cooling state/power.
> > + * @unit: whether to put state or power in @val.
> >   *
> >   * Callback for the thermal cooling device to return the cpufreq
> > - * current cooling state.  Currently only THERMAL_UNIT_STATE is valid
> > - * for @unit.
> > + * current cooling state/power.  If @unit is THERMAL_UNIT_STATE, @val
> > + * contains the current cooling state; if @unit is THERMAL_UNIT_POWER,
> > + * @val is current power consumption in milliwatts.
> >   *
> >   * Return: 0 on success, an error code otherwise.
> >   */
> > -static int cpufreq_get_cur_state(struct thermal_cooling_device *cdev,
> > -                              unsigned long *state,
> > -                              enum thermal_cooling_unit unit)
> > +static int cpufreq_get_cur(struct thermal_cooling_device *cdev,
> > +                        unsigned long *val, enum thermal_cooling_unit unit)
> >  {
> >       struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
> >
> > -     if (unit != THERMAL_UNIT_STATE)
> > -             return -EINVAL;
> > +     if ((unit == THERMAL_UNIT_POWER) &&
> > +             (cpufreq_device->power_table_entries)) {
> 
> ditto.
> 
> > +             int cpu;
> > +             u32 total_load = 0, raw_cpu_power, power = 0;
> > +
> > +             raw_cpu_power = cpu_freq_to_power(cpufreq_device,
> > +                                             cpufreq_device->freq);
> > +
> > +             for_each_cpu(cpu, &cpufreq_device->allowed_cpus) {
> > +                     u32 load;
> >
> > -     *state = cpufreq_device->cpufreq_state;
> > +                     if (!cpu_online(cpu))
> > +                             continue;
> > +
> > +                     load = get_load(cpufreq_device, cpu);
> > +                     power += (raw_cpu_power * load) / 100;
> > +                     total_load += load;
> > +             }
> > +
> > +             cpufreq_device->last_load = total_load;
> > +             *val = power;
> > +     } else if (unit == THERMAL_UNIT_STATE) {
> > +             *val = cpufreq_device->cpufreq_state;
> > +     } else {
> > +             return -EINVAL;
> > +     }
> >
> >       return 0;
> >  }
> >
> >  /**
> > - * cpufreq_set_cur_state - callback function to set the current cooling state.
> > + * cpufreq_set_cur - callback function to set the current cooling state/power.
> >   * @cdev: thermal cooling device pointer.
> > - * @state: set this variable to the current cooling state.
> > - * @unit: the units in which @state should be in.
> > + * @val: set this variable to the current cooling state/power.
> > + * @unit: whether @val is a cooling state or power.
> >   *
> >   * Callback for the thermal cooling device to change the cpufreq
> > - * current cooling state.  Currently only THERMAL_UNIT_STATE is valid for
> > - * @unit.
> > + * current cooling state/power.  If @unit is THERMAL_UNIT_STATE, @val
> > + * is a cooling state; if @unit is THERMAL_UNIT_POWER, @val is the target
> > + * power consumption in milliwatts.
> >   *
> >   * Return: 0 on success, an error code otherwise.
> >   */
> > -static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,
> > -                              unsigned long state,
> > -                              enum thermal_cooling_unit unit)
> > +static int cpufreq_set_cur(struct thermal_cooling_device *cdev,
> > +                        unsigned long val, enum thermal_cooling_unit unit)
> >  {
> >       struct cpufreq_cooling_device *cpufreq_device = cdev->devdata;
> > -
> > -     if (unit != THERMAL_UNIT_STATE)
> > +     unsigned long state;
> > +
> > +     if ((unit == THERMAL_UNIT_POWER) &&
> > +             (cpufreq_device->power_table_entries)) {
> 
> ditto.
> 
> > +             unsigned int cpu, freq;
> > +             u32 normalised_power, last_load;
> > +
> > +             cpu = cpumask_any(&cpufreq_device->allowed_cpus);
> > +             last_load = cpufreq_device->last_load ?
> > +                     cpufreq_device->last_load : 1;
> > +             normalised_power = (val * 100) / last_load;
> > +             freq = cpu_power_to_freq(cpufreq_device, normalised_power);
> > +
> > +             state = cpufreq_cooling_get_level(cpu, freq);
> > +             if (state == THERMAL_CSTATE_INVALID) {
> > +                     pr_err("Failed to convert %dKHz for cpu %d into a cdev state\n",
> > +                             freq, cpu);
> > +                     return -EINVAL;
> > +             }
> > +     } else if (unit == THERMAL_UNIT_STATE) {
> > +             state = val;
> > +     } else {
> >               return -EINVAL;
> > +     }
> >
> >       return cpufreq_apply_cooling(cpufreq_device, state);
> >  }
> >
> >  /* Bind cpufreq callbacks to thermal cooling device ops */
> >  static struct thermal_cooling_device_ops const cpufreq_cooling_ops = {
> > -     .get_max = cpufreq_get_max_state,
> > -     .get_cur = cpufreq_get_cur_state,
> > -     .set_cur = cpufreq_set_cur_state,
> > +     .get_max = cpufreq_get_max,
> > +     .get_cur = cpufreq_get_cur,
> > +     .set_cur = cpufreq_set_cur,
> >  };
> >
> >  /* Notifier for cpufreq policy change */
> > @@ -441,10 +622,95 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
> >       .notifier_call = cpufreq_thermal_notifier,
> >  };
> >
> > +/* Notifier for cpufreq frequency changes */
> > +struct notifier_block cpufreq_transition_notifier = {
> > +     .notifier_call = cpufreq_frequency_change,
> > +};
> > +
> > +/**
> > + * build_cpu_power_table - create a power to frequency table
> > + * @cpufreq_dev:     the cpufreq_cooling_device in which to store the table
> > + *
> > + * Build a power to frequency table for this cpu and store it in
> > + * @cpufreq_dev.  This table will be used in cpu_power_to_freq() and
> > + * cpu_freq_to_power() to convert between power and frequency
> > + * efficiently.  Power is stored in mW, frequency in KHz.  The
> > + * resulting table is in ascending order.
> > + *
> > + * Returns 0 on success, -E* on error.
> > + */
> > +static int build_cpu_power_table(struct cpufreq_cooling_device *cpufreq_dev)
> > +{
> > +     struct dev_pm_opp *opp;
> > +     struct device *dev = NULL;
> > +     int num_opps, cpu, i, ret = 0;
> > +     unsigned long freq;
> > +
> > +     num_opps = 0;
> > +
> > +     rcu_read_lock();
> > +
> > +     for_each_cpu(cpu, &cpufreq_dev->allowed_cpus) {
> > +             dev = get_cpu_device(cpu);
> > +             if (!dev)
> > +                     continue;
> > +
> > +             num_opps = dev_pm_opp_get_opp_count(dev);
> > +             if (num_opps > 0) {
> > +                     break;
> > +             } else if (num_opps < 0) {
> > +                     ret = num_opps;
> > +                     goto unlock;
> > +             }
> > +     }
> > +
> > +     if ((num_opps == 0) || (num_opps > MAX_NUM_OPPS)) {
> 
> hmm.. Does it mean that if an existing platform has more than 32
> opps, we won't support it? I would preffer we dynamically allocate
> the table here and cover any number of opps.

Will do.

> > +             ret = -EINVAL;
> > +             goto unlock;
> > +     }
> > +
> > +     i = 0;
> > +     for (freq = 0;
> > +          opp = dev_pm_opp_find_freq_ceil(dev, &freq), !IS_ERR(opp);
> > +          freq++) {
> > +             u32 freq_mhz, voltage_mv;
> > +             u64 power;
> > +
> > +             freq_mhz = freq / 1000000;
> > +             voltage_mv = dev_pm_opp_get_voltage(opp) / 1000;
> > +
> > +             /*
> > +              * Do the multiplication with MHz and millivolt so as
> > +              * to not overflow.
> > +              */
> > +             power = (u64)cpufreq_dev->capacitance * freq_mhz *
> > +                     voltage_mv * voltage_mv;
> 
> Essentially: P = C x F x V^2 right? Your model covers for dynamic power
> only.
> 
> I believe this power model neglets leakage. Static power consumption
> cannot be negleted nowadays, specially on thermal constrained CPUs.
> (which happens to be the target of this subsystem).
> 
> My suggestion is that you model power as:
> P = P_{dyn} + P_{static}
> 
> P_{dyn} can be what you propose above, for instance, or some other
> better increasing convex function. But we cannot neglect P_{static}.

For P_{static} we need the current temperature, which is a pain to get
from here without passing more parameters to the cooling device.
That's why I've removed this extension to the cooling device API and
just created another API to deal with getting and setting power.

> I suggest allowing platform device driver writer to input
> power models here somehow. Either by using callbacks to
> implement their own power model, or by feeding the API with
> their own table.

True, the power model needs to be "pluggable" and platforms should be
able to add their own.

> > +             do_div(power, 1000000000);
> > +
> > +             /* frequency is stored in power_table in KHz */
> > +             cpufreq_dev->power_table[i].frequency = freq / 1000;
> > +             cpufreq_dev->power_table[i].power = power;
> > +
> > +             i++;
> > +     }
> > +
> > +     if (i == 0) {
> > +             ret = PTR_ERR(opp);
> > +             goto unlock;
> > +     }
> > +
> > +     cpufreq_dev->power_table_entries = i;
> > +
> > +unlock:
> > +     rcu_read_unlock();
> > +     return ret;
> > +}
> > +
> >  /**
> >   * __cpufreq_cooling_register - helper function to create cpufreq cooling device
> >   * @np: a valid struct device_node to the cooling device device tree node
> >   * @clip_cpus: cpumask of cpus where the frequency constraints will happen.
> > + * @capacitance: dynamic power coefficient for these cpus
> >   *
> >   * This interface function registers the cpufreq cooling device with the name
> >   * "thermal-cpufreq-%x". This api can support multiple instances of cpufreq
> > @@ -456,7 +722,7 @@ static struct notifier_block thermal_cpufreq_notifier_block = {
> >   */
> >  static struct thermal_cooling_device *
> >  __cpufreq_cooling_register(struct device_node *np,
> > -                        const struct cpumask *clip_cpus)
> > +                     const struct cpumask *clip_cpus, u32 capacitance)
> >  {
> >       struct thermal_cooling_device *cool_dev;
> >       struct cpufreq_cooling_device *cpufreq_dev = NULL;
> > @@ -485,6 +751,7 @@ __cpufreq_cooling_register(struct device_node *np,
> >               return ERR_PTR(-ENOMEM);
> >
> >       cpumask_copy(&cpufreq_dev->allowed_cpus, clip_cpus);
> > +     cpufreq_dev->capacitance = capacitance;
> >
> >       ret = get_idr(&cpufreq_idr, &cpufreq_dev->id);
> >       if (ret) {
> > @@ -531,7 +798,7 @@ __cpufreq_cooling_register(struct device_node *np,
> >  struct thermal_cooling_device *
> >  cpufreq_cooling_register(const struct cpumask *clip_cpus)
> >  {
> > -     return __cpufreq_cooling_register(NULL, clip_cpus);
> > +     return __cpufreq_cooling_register(NULL, clip_cpus, 0);
> >  }
> >  EXPORT_SYMBOL_GPL(cpufreq_cooling_register);
> >
> > @@ -555,11 +822,90 @@ of_cpufreq_cooling_register(struct device_node *np,
> >       if (!np)
> >               return ERR_PTR(-EINVAL);
> >
> > -     return __cpufreq_cooling_register(np, clip_cpus);
> > +     return __cpufreq_cooling_register(np, clip_cpus, 0);
> >  }
> >  EXPORT_SYMBOL_GPL(of_cpufreq_cooling_register);
> >
> >  /**
> > + * of_cpufreq_power_actor_register - function to create a cpu cooling device with power capabilities
> > + * @np:              a valid struct device_node to the cooling device tree node
> > + * @clip_cpus:       cpumask of cpus where the frequency constraints will happen
> > + * @capacitance:     the dynamic power coefficient for these cpus
> > + *
> > + * Similar to of_cpufreq_cooling_register() this function registers a
> > + * cooling device linked to the device tree node provided.  It has a
> > + * simple power model so that get_cur/get_max/set_cur can operate with
> > + * THERMAL_UNIT_POWER
> > + *
> > + * The cpus must have registered their OPPs in the OPP library.
> > + *
> > + * Return: a valid struct thermal_cooling_device pointer on success,
> > + * on failure, it returns a corresponding ERR_PTR().
> > + */
> > +struct thermal_cooling_device *
> > +of_cpufreq_power_actor_register(struct device_node *np,
> > +                             const struct cpumask *clip_cpus,
> > +                             u32 capacitance)
> > +{
> > +     struct thermal_cooling_device *cdev, *err_ret;
> > +     int ret;
> > +
> > +     cdev = __cpufreq_cooling_register(np, clip_cpus, capacitance);
> > +     if (IS_ERR(cdev))
> > +             return cdev;
> > +
> > +     ret = build_cpu_power_table(cdev->devdata);
> > +     if (ret) {
> > +             err_ret = ERR_PTR(ret);
> > +             goto unregister;
> > +     }
> > +
> > +     /*
> > +      * You can't register multiple times the same notifier_block.
> > +      * The first power actor registered is the only one that
> > +      * registers the notifier.
> > +      */
> > +     if (!power_actor_notifier_registered) {
> > +             ret = cpufreq_register_notifier(&cpufreq_transition_notifier,
> > +                                             CPUFREQ_TRANSITION_NOTIFIER);
> > +             if (ret) {
> > +                     err_ret = ERR_PTR(ret);
> > +                     goto unregister;
> > +             }
> > +             power_actor_notifier_registered = true;
> 
> I believe you need locking to update this global variable.

Yes.

> > +     }
> > +
> > +     return cdev;
> > +
> > +unregister:
> > +     cpufreq_cooling_unregister(cdev);
> > +     return err_ret;
> > +}
> > +EXPORT_SYMBOL_GPL(of_cpufreq_power_actor_register);
> 
> This needs documentation under Documentation/thermal/sysfs-api.txt.

It's gone, but the new functions will need to be documented, yes.

Thanks,
Javi


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 4/5] thermal: introduce the Power Allocator governor
  2014-05-13 23:20   ` Eduardo Valentin
@ 2014-05-15 18:24     ` Javi Merino
  2014-05-22 14:26       ` Eduardo Valentin
  0 siblings, 1 reply; 48+ messages in thread
From: Javi Merino @ 2014-05-15 18:24 UTC (permalink / raw)
  To: Eduardo Valentin; +Cc: linux-pm, Punit Agrawal, Zhang Rui

Hi Eduardo,

On Wed, May 14, 2014 at 12:20:41AM +0100, Eduardo Valentin wrote:
> Hello Javi,
> 
> Here are a series of initial comments. I will be probably reviewing the
> code several times and coming back again with more questions.

Ok, we'll send updates regularly as we incorporate feedback and tune
it.

> On Tue, May 06, 2014 at 01:06:37PM +0100, Javi Merino wrote:
> > The power allocator governor is a thermal governor that controls system
> > and device power allocation to control temperature.  Conceptually, the
> > implementation takes a system view of heat dissipation by managing
> > multiple heat sources.
> 
> Can it be able to control several thermal zones then?

Yes, that's why we added bind_to_tz() and unbind_to_tz() to put the
governor's private data in the struct thermal_zone_device.

> > This governor relies on power-aware cooling devices (power actors) to
> > operate.  That is, cooling devices whose thermal_cooling_device_ops
> > accept THERMAL_UNIT_POWER.
> >
> > It uses a Proportional Integral (PI) controller driven by the
> > temperature of the thermal zone.  This budget is then allocated to
> > each cooling device that can have bearing on the temperature we are
> > trying to control.  It decides how much power to give each cooling
> > device based on the performance they are requesting.  The PI
> > controller ensures that the total power budget does not exceed the
> > control temperature.
> >
> > Cc: Zhang Rui <rui.zhang@intel.com>
> > Cc: Eduardo Valentin <edubezval@gmail.com>
> > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> > Signed-off-by: Javi Merino <javi.merino@arm.com>
> > ---
> >  drivers/thermal/Kconfig           |   14 ++
> >  drivers/thermal/Makefile          |    1 +
> >  drivers/thermal/power_allocator.c |  403 +++++++++++++++++++++++++++++++++++++
> >  drivers/thermal/thermal_core.c    |    7 +-
> >  drivers/thermal/thermal_core.h    |    8 +
> >  include/linux/thermal.h           |    5 +
> 
> We need documentation under Documentation/thermal to understand
> how to use this governor. Specially on how to accomodate
> several actors and domains under its control.

Sure, the next series will have some documentation on how to integrate
it in a platform and how all the components interact.

> >  6 files changed, 437 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/thermal/power_allocator.c
> >
> > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> > index 2d51912a6e40..f5e6b693fc19 100644
> > --- a/drivers/thermal/Kconfig
> > +++ b/drivers/thermal/Kconfig
> > @@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
> >         Select this if you want to let the user space manage the
> >         platform thermals.
> >
> > +config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
> > +     bool "power_allocator"
> > +     select THERMAL_GOV_POWER_ALLOCATOR
> > +     help
> > +       Select this if you want to control temperature based on
> > +       system and device power allocation. This governor relies on
> > +       power-aware cooling devices (power actors) to operate.
> > +
> >  endchoice
> >
> >  config THERMAL_GOV_FAIR_SHARE
> > @@ -89,6 +97,12 @@ config THERMAL_GOV_USER_SPACE
> >       help
> >         Enable this to let the user space manage the platform thermals.
> >
> > +config THERMAL_GOV_POWER_ALLOCATOR
> > +     bool "Power allocator thermal governor"
> > +     help
> > +       Enable this to manage platform thermals by dynamically
> > +       allocating and limiting power to devices.
> > +
> >  config CPU_THERMAL
> >       bool "generic cpu cooling support"
> >       depends on CPU_FREQ
> > diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
> > index 54e4ec9eb5df..1cb7d5fec8dc 100644
> > --- a/drivers/thermal/Makefile
> > +++ b/drivers/thermal/Makefile
> > @@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF)            += of-thermal.o
> >  thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE) += fair_share.o
> >  thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE)  += step_wise.o
> >  thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE) += user_space.o
> > +thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR)    += power_allocator.o
> >
> >  # cpufreq cooling
> >  thermal_sys-$(CONFIG_CPU_THERMAL)    += cpu_cooling.o
> > diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
> > new file mode 100644
> > index 000000000000..39c425abf10a
> > --- /dev/null
> > +++ b/drivers/thermal/power_allocator.c
> > @@ -0,0 +1,403 @@
> > +/*
> > + * A power allocator to manage temperature
> > + *
> > + * Copyright (C) 2014 ARM Ltd.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> > + * kind, whether express or implied; without even the implied warranty
> > + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#define pr_fmt(fmt) "Power allocator: " fmt
> > +
> > +#include <linux/list.h>
> > +#include <linux/slab.h>
> > +#include <linux/thermal.h>
> > +
> > +#include "thermal_core.h"
> > +
> > +#define MAX_NUM_ACTORS 8
> 
> hmm... Why do we need to limit the amount of actors?

Right, I could allocate the arrays dynamically.

> > +
> > +#define FRAC_BITS 8
> > +#define int_to_frac(x) ((x) << FRAC_BITS)
> > +#define frac_to_int(x) ((x) >> FRAC_BITS)
> > +
> > +/**
> > + * mul_frac - multiply two fixed-point numbers
> > + * @x:       first multiplicand
> > + * @y:       second multiplicand
> > + *
> > + * Returns the result of multiplying two fixed-point numbers.  The
> > + * result is also a fixed-point number.
> > + */
> > +static inline s64 mul_frac(s64 x, s64 y)
> > +{
> > +     return (x * y) >> FRAC_BITS;
> > +}
> > +
> > +enum power_allocator_trip_levels {
> > +     TRIP_SWITCH_ON = 0,     /* Switch on PI controller */
> > +     TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
> > +};
> 
> Can we reuse the trip types we already have?

You mean using "passive" as a "switch on" and "active" as
MAX_DESIRED_TEMPERATURE?  If you prefer it that way, I can do that.

> > +
> > +/**
> > + * struct power_allocator_params - parameters for the power allocator governor
> > + * @k_po:    P parameter of the PI controller when overshooting (i.e., when
> > + *           temperature is below the target)
> > + * @k_pi:    P parameter of the PI controller when undershooting
> > + * @k_i:     I parameter of the PI controller
> > + * @integral_cutoff: threshold below which the error is no longer accumulated
> > +                     in the PI controller
> > + * @err_integral:    Accumulated error in the PI controller.
> > + */
> > +struct power_allocator_params {
> > +     s32 k_po;
> > +     s32 k_pu;
> > +     s32 k_i;
> > +     s32 integral_cutoff;
> > +     s32 err_integral;
> > +};
> 
> We would need a documentation on how to define these values.

Originally I thought that this should not need to be exposed to the
outside world and could have defaults that worked everywhere.  I guess
people will like to fine tune them, so apart from the sane defaults,
there should be a way for platforms to override this, and that would
need documentation.

> > +
> > +/**
> > + * pi_controller() - PI controller
> > + * @tz:      thermal zone we are operating in
> > + * @control_temp:    The target temperature
> > + * @max_allocatable_power:   maximum allocatable power for this thermal zone
> > + *
> > + * This PI controller increases the available power budget so that the
> > + * temperature of the thermal zone gets as close as possible to
> > + * @control_temp and limits the power if it exceeds it.  k_po is the
> > + * proportional term when we are overshooting, k_pu is the
> > + * proportional term when we are undershooting.  integral_cutoff is a
> > + * threshold below which we stop accumulating the error.  The
> > + * accumulated error is only valid if the requested power will make
> > + * the system warmer.  If the system is mostly idle, there's no point
> > + * in accumulating positive error.
> > + *
> > + * It returns the power budget for the next period.
> > + */
> > +static u32 pi_controller(struct thermal_zone_device *tz,
> > +                     unsigned long current_temp, unsigned long control_temp,
> > +                     unsigned long max_allocatable_power)
> > +{
> > +     s64 p, i, power_range;
> > +     s32 err;
> > +     struct power_allocator_params *params = tz->governor_data;
> > +
> > +     err = ((s32)control_temp - (s32)current_temp) / 1000;
> > +     err = int_to_frac(err);
> > +
> > +     /* Calculate the proportional term */
> > +     p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
> > +
> > +     /*
> > +      * Calculate the integral term
> > +      *
> > +      * if the error s less than cut off allow integration (but
> > +      * the integral is limited to max power)
> > +      */
> > +     i = mul_frac(params->k_i, params->err_integral);
> > +
> > +     if (err < int_to_frac(params->integral_cutoff)) {
> > +             s64 tmpi = mul_frac(params->k_i, err);
> > +             tmpi += i;
> > +             if (tmpi <= int_to_frac(max_allocatable_power)) {
> > +                     i = tmpi;
> > +                     params->err_integral += err;
> > +             }
> > +     }
> > +
> > +     power_range = p + i;
> > +
> > +     /* feed-forward the known maximum dissipatable power */
> > +     power_range = tz->tzp->max_dissipatable_power +
> > +             frac_to_int(power_range);
> > +
> > +     /* output should not exceed max power */
> > +     if (power_range > max_allocatable_power)
> > +             power_range = max_allocatable_power;
> > +
> > +     return power_range >= 0 ? power_range : 0;
> 
> What does it mean return 0? The above if does not make completelly sense
> to me, as power_range can still be 0. power_range < 0 means error?

You return 0 if the calculated power range was negative.  I've
rewritten the above code to be:

  return clamp(power_range, (s64)0, (s64)max_allocatable_power);

which I think it's clear.

> what if the value at power_range overflows the u32?

Its maximum value is max_allocatable_power which in sane systems
shouldn't be greater than u32.  I could put a BUG() somewhere if
max_allocatable_power > U32_MAX, do you think it's worth it?

> > +}
> > +
> > +static void divvy_up_power(struct thermal_zone_device *tz,
> 
> Can you please document how you perform this power split?

Will do.

> > +                     unsigned long req_power[MAX_NUM_ACTORS],
> > +                     unsigned long max_power[MAX_NUM_ACTORS],
> > +                     int num_actors, unsigned long total_req_power,
> > +                     u32 power_range,
> > +                     unsigned long granted_power[MAX_NUM_ACTORS])
> > +{
> > +     unsigned long extra_power, capped_extra_power;
> > +     unsigned long extra_actor_power[MAX_NUM_ACTORS];
> > +     int i;
> > +
> > +     if (!total_req_power) {
> > +             /*
> > +              * Nobody requested anything, so just give everybody
> > +              * the maximum power
> > +              */
> > +             for (i = 0; i < num_actors; i++)
> > +                     granted_power[i] = max_power[i];
> > +
> > +             return;
> > +     }
> > +
> > +     capped_extra_power = 0;
> > +     extra_power = 0;
> > +     for (i = 0; i < num_actors; i++) {
> > +             u64 req_range = req_power[i] * power_range;
> > +
> > +             granted_power[i] = div_u64(req_range, total_req_power);
> > +
> > +             if (granted_power[i] > max_power[i]) {
> > +                     extra_power += granted_power[i] - max_power[i];
> > +                     granted_power[i] = max_power[i];
> > +             }
> > +
> > +             extra_actor_power[i] = max_power[i] - granted_power[i];
> > +             capped_extra_power += extra_actor_power[i];
> > +     }
> > +
> > +     if (!extra_power)
> > +             return;
> > +
> > +     /*
> > +      * Re-divvy the reclaimed extra among actors based on
> > +      * how far they are from the max
> > +      */
> > +     extra_power = min(extra_power, capped_extra_power);
> > +     if (capped_extra_power > 0)
> > +             for (i = 0; i < num_actors; i++)
> > +                     granted_power[i] += (extra_actor_power[i] *
> > +                                     extra_power) / capped_extra_power;
> > +}
> > +
> > +static int allocate_power(struct thermal_zone_device *tz,
> > +                     unsigned long current_temp, unsigned long control_temp)
> > +{
> > +     struct thermal_instance *instance;
> > +     unsigned long req_power[MAX_NUM_ACTORS], max_power[MAX_NUM_ACTORS];
> > +     unsigned long granted_power[MAX_NUM_ACTORS];
> > +     unsigned long total_req_power, max_allocatable_power;
> > +     u32 power_range;
> > +     int i, num_actors, ret = 0;
> > +
> > +     i = 0;
> > +     total_req_power = 0;
> > +     max_allocatable_power = 0;
> > +
> > +     mutex_lock(&tz->lock);
> > +
> > +     list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> > +             struct thermal_cooling_device *cdev = instance->cdev;
> > +
> > +             if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
> 
> How does one instance->trip gets the value of
> TRIP_MAX_DESIRED_TEMPERATURE?

In power_allocator_bind() you fail to bind if the trip point doesn't
exist.  This may be removed if we switch passive/active trip points.

> > +                     continue;
> > +
> > +             if (i >= MAX_NUM_ACTORS) {
> > +                     pr_warn("Too many actors (%d) for this governor\n", i);
> 
> dev_* family is preferrable over pr_* family. Same applies to other
> patches.

With the &tz->device as the device?  Sure.

> > +                     ret = -EINVAL;
> > +                     goto unlock;
> > +             }
> > +
> > +             ret = cdev->ops->get_cur(cdev, &req_power[i],
> > +                                     THERMAL_UNIT_POWER);
> > +             if (ret) {
> > +                     pr_warn_once("Couldn't get current power for %s: %d\n",
> > +                             cdev->type, ret);
> > +                     goto unlock;
> > +             }
> > +
> > +             total_req_power += req_power[i];
> > +
> > +             cdev->ops->get_max(cdev, &max_power[i], THERMAL_UNIT_POWER);
> > +             max_allocatable_power += max_power[i];
> > +
> > +             i++;
> > +     }
> > +     num_actors = i;
> > +
> > +     power_range = pi_controller(tz, current_temp, control_temp,
> > +                             max_allocatable_power);
> > +
> > +     divvy_up_power(tz, req_power, max_power, num_actors, total_req_power,
> > +             power_range, granted_power);
> > +
> > +     i = 0;
> > +     list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> > +             struct thermal_cooling_device *cdev = instance->cdev;
> > +
> > +             if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
> > +                     continue;
> > +
> > +             cdev->ops->set_cur(cdev, granted_power[i], THERMAL_UNIT_POWER);
> > +             i++;
> > +     }
> > +
> > +unlock:
> > +     mutex_unlock(&tz->lock);
> > +
> > +     return ret;
> > +}
> > +
> > +static int check_trips(struct thermal_zone_device *tz)
> > +{
> > +     int ret;
> > +     enum thermal_trip_type type;
> > +
> > +     if (tz->trips < 2)
> > +             return -EINVAL;
> > +
> > +     ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
> > +     if (ret)
> > +             return ret;
> > +
> > +     if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> > +             return -EINVAL;
> > +
> > +     ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
> > +     if (ret)
> > +             return ret;
> > +
> > +     if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> > +             return -EINVAL;
> > +
> > +     return ret;
> > +}
> > +
> > +static void reset_pi_controller(struct power_allocator_params *params)
> > +{
> > +     params->err_integral = 0;
> > +}
> > +
> > +/**
> > + * power_allocator_bind - bind the power_allocator governor to a thermal zone
> > + * @tz:      thermal zone to bind it to
> > + *
> > + * Check that the thermal zone is valid for this governor: has two
> > + * thermal trips.  If so, initialize the PI controller parameters and
> > + * bind it to the thermal zone.
> > + *
> > + * Returns 0 on success, -EINVAL if the trips were invalid or -ENOMEM
> > + * if we ran out of memory.
> > + */
> > +static int power_allocator_bind(struct thermal_zone_device *tz)
> > +{
> > +     int ret;
> > +     struct power_allocator_params *params;
> > +     unsigned long switch_on_temp, control_temp;
> > +     u32 temperature_threshold;
> > +
> > +     ret = check_trips(tz);
> > +     if (ret) {
> > +             pr_err("thermal zone %s has the wrong number of trips for this governor\n",
> > +                     tz->type);
> > +             return ret;
> > +     }
> > +
> > +     if (!tz->tzp || !tz->tzp->max_dissipatable_power) {
> > +             pr_err("Trying to bind the power_allocator governor to a thermal zone without the max_dissipatable_power parameter\n");
> 
> Can you please break the above line?

I prefer to keep it as it is.  From Documentation/CodingStyle:

  never break user-visible strings such as printk messages, because
  that breaks the ability to grep for them.

The string could be improved though.

> > +             return -EINVAL;
> > +     }
> > +
> > +     params = kzalloc(sizeof(*params), GFP_KERNEL);
> 
> can we use devm_kzalloc?

Sure.

> > +     if (!params)
> > +             return -ENOMEM;
> > +
> > +     ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> > +     if (ret)
> > +             goto free;
> > +
> > +     ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> > +                             &control_temp);
> 
> thermal zone drivers are not aware of these trip types, are they?

I'll do what you suggested before and turn them into passive/active.
 
> > +     if (ret)
> > +             goto free;
> > +
> > +     temperature_threshold = (control_temp - switch_on_temp) / 1000;
> > +
> > +     params->k_po = int_to_frac(tz->tzp->max_dissipatable_power) /
> > +             temperature_threshold;
> > +     params->k_pu = int_to_frac(2 * tz->tzp->max_dissipatable_power) /
> > +             temperature_threshold;
> > +     params->k_i = int_to_frac(1);
> 
> I suppose we need a way to input k_po, k_pu and k_i parameters from
> platform code to this governor right? Or are they always defined as the
> above code? always based on max_dissipated_power?

As I said above, I thought that we could just provide defaults and not
make life harder for platform code, but I guess we need to provide a
way for platform to override them.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 4/5] thermal: introduce the Power Allocator governor
  2014-05-15 18:24     ` Javi Merino
@ 2014-05-22 14:26       ` Eduardo Valentin
  0 siblings, 0 replies; 48+ messages in thread
From: Eduardo Valentin @ 2014-05-22 14:26 UTC (permalink / raw)
  To: Javi Merino; +Cc: linux-pm, Punit Agrawal, Zhang Rui

Hello again Javi,

I will answer couple of your questions here. But I will also review your
V2, no worries.

On Thu, May 15, 2014 at 07:24:44PM +0100, Javi Merino wrote:
> Hi Eduardo,
> 
> On Wed, May 14, 2014 at 12:20:41AM +0100, Eduardo Valentin wrote:
> > Hello Javi,
> > 
> > Here are a series of initial comments. I will be probably reviewing the
> > code several times and coming back again with more questions.
> 
> Ok, we'll send updates regularly as we incorporate feedback and tune
> it.


great!

> 
> > On Tue, May 06, 2014 at 01:06:37PM +0100, Javi Merino wrote:
> > > The power allocator governor is a thermal governor that controls system
> > > and device power allocation to control temperature.  Conceptually, the
> > > implementation takes a system view of heat dissipation by managing
> > > multiple heat sources.
> > 
> > Can it be able to control several thermal zones then?
> 
> Yes, that's why we added bind_to_tz() and unbind_to_tz() to put the
> governor's private data in the struct thermal_zone_device.


I see. But I was more interested if you meant that this governor is
capable of controling several thermal zones at the same time. That is,
builds decision based on information taken from several thermal zones.
Was that what you meant?
> 
> > > This governor relies on power-aware cooling devices (power actors) to
> > > operate.  That is, cooling devices whose thermal_cooling_device_ops
> > > accept THERMAL_UNIT_POWER.
> > >
> > > It uses a Proportional Integral (PI) controller driven by the
> > > temperature of the thermal zone.  This budget is then allocated to
> > > each cooling device that can have bearing on the temperature we are
> > > trying to control.  It decides how much power to give each cooling
> > > device based on the performance they are requesting.  The PI
> > > controller ensures that the total power budget does not exceed the
> > > control temperature.
> > >
> > > Cc: Zhang Rui <rui.zhang@intel.com>
> > > Cc: Eduardo Valentin <edubezval@gmail.com>
> > > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> > > Signed-off-by: Javi Merino <javi.merino@arm.com>
> > > ---
> > >  drivers/thermal/Kconfig           |   14 ++
> > >  drivers/thermal/Makefile          |    1 +
> > >  drivers/thermal/power_allocator.c |  403 +++++++++++++++++++++++++++++++++++++
> > >  drivers/thermal/thermal_core.c    |    7 +-
> > >  drivers/thermal/thermal_core.h    |    8 +
> > >  include/linux/thermal.h           |    5 +
> > 
> > We need documentation under Documentation/thermal to understand
> > how to use this governor. Specially on how to accomodate
> > several actors and domains under its control.
> 
> Sure, the next series will have some documentation on how to integrate
> it in a platform and how all the components interact.
> 

ok.

> > >  6 files changed, 437 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/thermal/power_allocator.c
> > >
> > > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> > > index 2d51912a6e40..f5e6b693fc19 100644
> > > --- a/drivers/thermal/Kconfig
> > > +++ b/drivers/thermal/Kconfig
> > > @@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
> > >         Select this if you want to let the user space manage the
> > >         platform thermals.
> > >
> > > +config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
> > > +     bool "power_allocator"
> > > +     select THERMAL_GOV_POWER_ALLOCATOR
> > > +     help
> > > +       Select this if you want to control temperature based on
> > > +       system and device power allocation. This governor relies on
> > > +       power-aware cooling devices (power actors) to operate.
> > > +
> > >  endchoice
> > >
> > >  config THERMAL_GOV_FAIR_SHARE
> > > @@ -89,6 +97,12 @@ config THERMAL_GOV_USER_SPACE
> > >       help
> > >         Enable this to let the user space manage the platform thermals.
> > >
> > > +config THERMAL_GOV_POWER_ALLOCATOR
> > > +     bool "Power allocator thermal governor"
> > > +     help
> > > +       Enable this to manage platform thermals by dynamically
> > > +       allocating and limiting power to devices.
> > > +
> > >  config CPU_THERMAL
> > >       bool "generic cpu cooling support"
> > >       depends on CPU_FREQ
> > > diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
> > > index 54e4ec9eb5df..1cb7d5fec8dc 100644
> > > --- a/drivers/thermal/Makefile
> > > +++ b/drivers/thermal/Makefile
> > > @@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF)            += of-thermal.o
> > >  thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE) += fair_share.o
> > >  thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE)  += step_wise.o
> > >  thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE) += user_space.o
> > > +thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR)    += power_allocator.o
> > >
> > >  # cpufreq cooling
> > >  thermal_sys-$(CONFIG_CPU_THERMAL)    += cpu_cooling.o
> > > diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
> > > new file mode 100644
> > > index 000000000000..39c425abf10a
> > > --- /dev/null
> > > +++ b/drivers/thermal/power_allocator.c
> > > @@ -0,0 +1,403 @@
> > > +/*
> > > + * A power allocator to manage temperature
> > > + *
> > > + * Copyright (C) 2014 ARM Ltd.
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + *
> > > + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> > > + * kind, whether express or implied; without even the implied warranty
> > > + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > > + * GNU General Public License for more details.
> > > + */
> > > +
> > > +#define pr_fmt(fmt) "Power allocator: " fmt
> > > +
> > > +#include <linux/list.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/thermal.h>
> > > +
> > > +#include "thermal_core.h"
> > > +
> > > +#define MAX_NUM_ACTORS 8
> > 
> > hmm... Why do we need to limit the amount of actors?
> 
> Right, I could allocate the arrays dynamically.

good.

> 
> > > +
> > > +#define FRAC_BITS 8
> > > +#define int_to_frac(x) ((x) << FRAC_BITS)
> > > +#define frac_to_int(x) ((x) >> FRAC_BITS)
> > > +
> > > +/**
> > > + * mul_frac - multiply two fixed-point numbers
> > > + * @x:       first multiplicand
> > > + * @y:       second multiplicand
> > > + *
> > > + * Returns the result of multiplying two fixed-point numbers.  The
> > > + * result is also a fixed-point number.
> > > + */
> > > +static inline s64 mul_frac(s64 x, s64 y)
> > > +{
> > > +     return (x * y) >> FRAC_BITS;
> > > +}
> > > +
> > > +enum power_allocator_trip_levels {
> > > +     TRIP_SWITCH_ON = 0,     /* Switch on PI controller */
> > > +     TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
> > > +};
> > 
> > Can we reuse the trip types we already have?
> 
> You mean using "passive" as a "switch on" and "active" as
> MAX_DESIRED_TEMPERATURE?  If you prefer it that way, I can do that.

hmmm.. well, "active", as per its original definition, targets points in
which the system performs (cooling) actions that burns more power to
remove heat. On the other hand, "passive" removes heat by reducing/modulating system's power consumption.

I believe you would need two passive trip points.

Another option is to use a passive trip point to map your switch on
event and a hot trip point to map your target temperature.


> 
> > > +
> > > +/**
> > > + * struct power_allocator_params - parameters for the power allocator governor
> > > + * @k_po:    P parameter of the PI controller when overshooting (i.e., when
> > > + *           temperature is below the target)
> > > + * @k_pi:    P parameter of the PI controller when undershooting
> > > + * @k_i:     I parameter of the PI controller
> > > + * @integral_cutoff: threshold below which the error is no longer accumulated
> > > +                     in the PI controller
> > > + * @err_integral:    Accumulated error in the PI controller.
> > > + */
> > > +struct power_allocator_params {
> > > +     s32 k_po;
> > > +     s32 k_pu;
> > > +     s32 k_i;
> > > +     s32 integral_cutoff;
> > > +     s32 err_integral;
> > > +};
> > 
> > We would need a documentation on how to define these values.
> 
> Originally I thought that this should not need to be exposed to the
> outside world and could have defaults that worked everywhere.  I guess
> people will like to fine tune them, so apart from the sane defaults,
> there should be a way for platforms to override this, and that would
> need documentation.

Can we achieve safe defaults applicable to every platforms?


> 
> > > +
> > > +/**
> > > + * pi_controller() - PI controller
> > > + * @tz:      thermal zone we are operating in
> > > + * @control_temp:    The target temperature
> > > + * @max_allocatable_power:   maximum allocatable power for this thermal zone
> > > + *
> > > + * This PI controller increases the available power budget so that the
> > > + * temperature of the thermal zone gets as close as possible to
> > > + * @control_temp and limits the power if it exceeds it.  k_po is the
> > > + * proportional term when we are overshooting, k_pu is the
> > > + * proportional term when we are undershooting.  integral_cutoff is a
> > > + * threshold below which we stop accumulating the error.  The
> > > + * accumulated error is only valid if the requested power will make
> > > + * the system warmer.  If the system is mostly idle, there's no point
> > > + * in accumulating positive error.
> > > + *
> > > + * It returns the power budget for the next period.
> > > + */
> > > +static u32 pi_controller(struct thermal_zone_device *tz,
> > > +                     unsigned long current_temp, unsigned long control_temp,
> > > +                     unsigned long max_allocatable_power)
> > > +{
> > > +     s64 p, i, power_range;
> > > +     s32 err;
> > > +     struct power_allocator_params *params = tz->governor_data;
> > > +
> > > +     err = ((s32)control_temp - (s32)current_temp) / 1000;
> > > +     err = int_to_frac(err);
> > > +
> > > +     /* Calculate the proportional term */
> > > +     p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
> > > +
> > > +     /*
> > > +      * Calculate the integral term
> > > +      *
> > > +      * if the error s less than cut off allow integration (but
> > > +      * the integral is limited to max power)
> > > +      */
> > > +     i = mul_frac(params->k_i, params->err_integral);
> > > +
> > > +     if (err < int_to_frac(params->integral_cutoff)) {
> > > +             s64 tmpi = mul_frac(params->k_i, err);
> > > +             tmpi += i;
> > > +             if (tmpi <= int_to_frac(max_allocatable_power)) {
> > > +                     i = tmpi;
> > > +                     params->err_integral += err;
> > > +             }
> > > +     }
> > > +
> > > +     power_range = p + i;
> > > +
> > > +     /* feed-forward the known maximum dissipatable power */
> > > +     power_range = tz->tzp->max_dissipatable_power +
> > > +             frac_to_int(power_range);
> > > +
> > > +     /* output should not exceed max power */
> > > +     if (power_range > max_allocatable_power)
> > > +             power_range = max_allocatable_power;
> > > +
> > > +     return power_range >= 0 ? power_range : 0;
> > 
> > What does it mean return 0? The above if does not make completelly sense
> > to me, as power_range can still be 0. power_range < 0 means error?
> 
> You return 0 if the calculated power range was negative.  I've
> rewritten the above code to be:
> 
>   return clamp(power_range, (s64)0, (s64)max_allocatable_power);
> 
> which I think it's clear.

sounds good to me.

> 
> > what if the value at power_range overflows the u32?
> 
> Its maximum value is max_allocatable_power which in sane systems
> shouldn't be greater than u32.  I could put a BUG() somewhere if
> max_allocatable_power > U32_MAX, do you think it's worth it?
> 

I agree to have a sane check at least in initialization phase to avoid
obscure overflow errors.

> > > +}
> > > +
> > > +static void divvy_up_power(struct thermal_zone_device *tz,
> > 
> > Can you please document how you perform this power split?
> 
> Will do.

good.

> 
> > > +                     unsigned long req_power[MAX_NUM_ACTORS],
> > > +                     unsigned long max_power[MAX_NUM_ACTORS],
> > > +                     int num_actors, unsigned long total_req_power,
> > > +                     u32 power_range,
> > > +                     unsigned long granted_power[MAX_NUM_ACTORS])
> > > +{
> > > +     unsigned long extra_power, capped_extra_power;
> > > +     unsigned long extra_actor_power[MAX_NUM_ACTORS];
> > > +     int i;
> > > +
> > > +     if (!total_req_power) {
> > > +             /*
> > > +              * Nobody requested anything, so just give everybody
> > > +              * the maximum power
> > > +              */
> > > +             for (i = 0; i < num_actors; i++)
> > > +                     granted_power[i] = max_power[i];
> > > +
> > > +             return;
> > > +     }
> > > +
> > > +     capped_extra_power = 0;
> > > +     extra_power = 0;
> > > +     for (i = 0; i < num_actors; i++) {
> > > +             u64 req_range = req_power[i] * power_range;
> > > +
> > > +             granted_power[i] = div_u64(req_range, total_req_power);
> > > +
> > > +             if (granted_power[i] > max_power[i]) {
> > > +                     extra_power += granted_power[i] - max_power[i];
> > > +                     granted_power[i] = max_power[i];
> > > +             }
> > > +
> > > +             extra_actor_power[i] = max_power[i] - granted_power[i];
> > > +             capped_extra_power += extra_actor_power[i];
> > > +     }
> > > +
> > > +     if (!extra_power)
> > > +             return;
> > > +
> > > +     /*
> > > +      * Re-divvy the reclaimed extra among actors based on
> > > +      * how far they are from the max
> > > +      */
> > > +     extra_power = min(extra_power, capped_extra_power);
> > > +     if (capped_extra_power > 0)
> > > +             for (i = 0; i < num_actors; i++)
> > > +                     granted_power[i] += (extra_actor_power[i] *
> > > +                                     extra_power) / capped_extra_power;
> > > +}
> > > +
> > > +static int allocate_power(struct thermal_zone_device *tz,
> > > +                     unsigned long current_temp, unsigned long control_temp)
> > > +{
> > > +     struct thermal_instance *instance;
> > > +     unsigned long req_power[MAX_NUM_ACTORS], max_power[MAX_NUM_ACTORS];
> > > +     unsigned long granted_power[MAX_NUM_ACTORS];
> > > +     unsigned long total_req_power, max_allocatable_power;
> > > +     u32 power_range;
> > > +     int i, num_actors, ret = 0;
> > > +
> > > +     i = 0;
> > > +     total_req_power = 0;
> > > +     max_allocatable_power = 0;
> > > +
> > > +     mutex_lock(&tz->lock);
> > > +
> > > +     list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> > > +             struct thermal_cooling_device *cdev = instance->cdev;
> > > +
> > > +             if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
> > 
> > How does one instance->trip gets the value of
> > TRIP_MAX_DESIRED_TEMPERATURE?
> 
> In power_allocator_bind() you fail to bind if the trip point doesn't
> exist.  This may be removed if we switch passive/active trip points.

yes, but you need a check while binding to a correct trip setup, I
agree.

> 
> > > +                     continue;
> > > +
> > > +             if (i >= MAX_NUM_ACTORS) {
> > > +                     pr_warn("Too many actors (%d) for this governor\n", i);
> > 
> > dev_* family is preferrable over pr_* family. Same applies to other
> > patches.
> 
> With the &tz->device as the device?  Sure.

yes, use the thermal zone device when possible. If thermal zone device
is unreliable, then use pr_* family.

> 
> > > +                     ret = -EINVAL;
> > > +                     goto unlock;
> > > +             }
> > > +
> > > +             ret = cdev->ops->get_cur(cdev, &req_power[i],
> > > +                                     THERMAL_UNIT_POWER);
> > > +             if (ret) {
> > > +                     pr_warn_once("Couldn't get current power for %s: %d\n",
> > > +                             cdev->type, ret);
> > > +                     goto unlock;
> > > +             }
> > > +
> > > +             total_req_power += req_power[i];
> > > +
> > > +             cdev->ops->get_max(cdev, &max_power[i], THERMAL_UNIT_POWER);
> > > +             max_allocatable_power += max_power[i];
> > > +
> > > +             i++;
> > > +     }
> > > +     num_actors = i;
> > > +
> > > +     power_range = pi_controller(tz, current_temp, control_temp,
> > > +                             max_allocatable_power);
> > > +
> > > +     divvy_up_power(tz, req_power, max_power, num_actors, total_req_power,
> > > +             power_range, granted_power);
> > > +
> > > +     i = 0;
> > > +     list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> > > +             struct thermal_cooling_device *cdev = instance->cdev;
> > > +
> > > +             if (instance->trip != TRIP_MAX_DESIRED_TEMPERATURE)
> > > +                     continue;
> > > +
> > > +             cdev->ops->set_cur(cdev, granted_power[i], THERMAL_UNIT_POWER);
> > > +             i++;
> > > +     }
> > > +
> > > +unlock:
> > > +     mutex_unlock(&tz->lock);
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +static int check_trips(struct thermal_zone_device *tz)
> > > +{
> > > +     int ret;
> > > +     enum thermal_trip_type type;
> > > +
> > > +     if (tz->trips < 2)
> > > +             return -EINVAL;
> > > +
> > > +     ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
> > > +     if (ret)
> > > +             return ret;
> > > +
> > > +     if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> > > +             return -EINVAL;
> > > +
> > > +     ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
> > > +     if (ret)
> > > +             return ret;
> > > +
> > > +     if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> > > +             return -EINVAL;
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +static void reset_pi_controller(struct power_allocator_params *params)
> > > +{
> > > +     params->err_integral = 0;
> > > +}
> > > +
> > > +/**
> > > + * power_allocator_bind - bind the power_allocator governor to a thermal zone
> > > + * @tz:      thermal zone to bind it to
> > > + *
> > > + * Check that the thermal zone is valid for this governor: has two
> > > + * thermal trips.  If so, initialize the PI controller parameters and
> > > + * bind it to the thermal zone.
> > > + *
> > > + * Returns 0 on success, -EINVAL if the trips were invalid or -ENOMEM
> > > + * if we ran out of memory.
> > > + */
> > > +static int power_allocator_bind(struct thermal_zone_device *tz)
> > > +{
> > > +     int ret;
> > > +     struct power_allocator_params *params;
> > > +     unsigned long switch_on_temp, control_temp;
> > > +     u32 temperature_threshold;
> > > +
> > > +     ret = check_trips(tz);
> > > +     if (ret) {
> > > +             pr_err("thermal zone %s has the wrong number of trips for this governor\n",
> > > +                     tz->type);
> > > +             return ret;
> > > +     }
> > > +
> > > +     if (!tz->tzp || !tz->tzp->max_dissipatable_power) {
> > > +             pr_err("Trying to bind the power_allocator governor to a thermal zone without the max_dissipatable_power parameter\n");
> > 
> > Can you please break the above line?
> 
> I prefer to keep it as it is.  From Documentation/CodingStyle:
> 
>   never break user-visible strings such as printk messages, because
>   that breaks the ability to grep for them.
> 
> The string could be improved though.

Yeah, but the above message is not too descriptive for the system
administrator. Think of it, it does mention an error situation, but does
not mention what is the thermal zone.

Maybe a smaller message like this would be more helpfull:

+	if (!tz->tzp || !tz->tzp->max_dissipatable_power) {
+		dev_err(tz->device,
+			"power_allocator: missing max_dissipatable_power\n");


The dev_err includes the name of the thermal zone device.

> 
> > > +             return -EINVAL;
> > > +     }
> > > +
> > > +     params = kzalloc(sizeof(*params), GFP_KERNEL);
> > 
> > can we use devm_kzalloc?
> 
> Sure.
> 

good.


> > > +     if (!params)
> > > +             return -ENOMEM;
> > > +
> > > +     ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> > > +     if (ret)
> > > +             goto free;
> > > +
> > > +     ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> > > +                             &control_temp);
> > 
> > thermal zone drivers are not aware of these trip types, are they?
> 
> I'll do what you suggested before and turn them into passive/active.

good.

>  
> > > +     if (ret)
> > > +             goto free;
> > > +
> > > +     temperature_threshold = (control_temp - switch_on_temp) / 1000;
> > > +
> > > +     params->k_po = int_to_frac(tz->tzp->max_dissipatable_power) /
> > > +             temperature_threshold;
> > > +     params->k_pu = int_to_frac(2 * tz->tzp->max_dissipatable_power) /
> > > +             temperature_threshold;
> > > +     params->k_i = int_to_frac(1);
> > 
> > I suppose we need a way to input k_po, k_pu and k_i parameters from
> > platform code to this governor right? Or are they always defined as the
> > above code? always based on max_dissipated_power?
> 
> As I said above, I thought that we could just provide defaults and not
> make life harder for platform code, but I guess we need to provide a
> way for platform to override them.
> 

I see. If we can achieve a single set of constants, then we can add
debugfs entries for k_{po,pu,i} constant values. That would allow for
debugging and fine tunning.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 0/4] tools lib traceevent: bitmask handling and plugin updates
@ 2014-06-03  3:20 Steven Rostedt
  2014-06-03  3:20 ` [PATCH v2 1/4] tools lib traceevent: Add flag to not load event plugins Steven Rostedt
                   ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Steven Rostedt @ 2014-06-03  3:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jiri Olsa, Namhyung Kim, Ingo Molnar, Andrew Morton

This is v2 of the patch series that added __get_bitmask() as well as
added some plugin code. This version addresses what Namhyung suggested.
The diff from v1 is posted below.

Steven Rostedt (Red Hat) (4):
      tools lib traceevent: Add flag to not load event plugins
      tools lib traceevent: Add options to plugins
      tools lib traceevent: Add options to function plugin
      tools lib traceevent: Added support for __get_bitmask() macro

----
 tools/lib/traceevent/event-parse.c                 | 113 ++++++++++++
 tools/lib/traceevent/event-parse.h                 |  25 ++-
 tools/lib/traceevent/event-plugin.c                | 204 ++++++++++++++++++++-
 tools/lib/traceevent/plugin_function.c             |  43 ++++-
 .../perf/util/scripting-engines/trace-event-perl.c |   1 +
 .../util/scripting-engines/trace-event-python.c    |   1 +
 6 files changed, 377 insertions(+), 10 deletions(-)


diff --git a/tools/lib/traceevent/event-parse.c b/tools/lib/traceevent/event-parse.c
index 2d6aa92..93825a1 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -3794,7 +3794,7 @@ static void print_str_arg(struct trace_seq *s, void *data, int size,
 			f = pevent_find_any_field(event, arg->bitmask.bitmask);
 			arg->bitmask.offset = f->offset;
 		}
-		bitmask_offset = data2host4(pevent, data + arg->string.offset);
+		bitmask_offset = data2host4(pevent, data + arg->bitmask.offset);
 		bitmask_size = bitmask_offset >> 16;
 		bitmask_offset &= 0xffff;
 		print_bitmask_to_seq(pevent, s, format, len_arg,
diff --git a/tools/lib/traceevent/event-parse.h b/tools/lib/traceevent/event-parse.h
index 025627f..7a3873f 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -362,7 +362,7 @@ enum pevent_func_arg_type {
 enum pevent_flag {
 	PEVENT_NSEC_OUTPUT		= 1,	/* output in NSECS */
 	PEVENT_DISABLE_SYS_PLUGINS	= 1 << 1,
-	PEVENT_DISABLE_PLUGINS		= 1 << 2
+	PEVENT_DISABLE_PLUGINS		= 1 << 2,
 };
 
 #define PEVENT_ERRORS 							      \
@@ -419,6 +419,8 @@ enum pevent_errno {
 
 struct plugin_list;
 
+#define INVALID_PLUGIN_LIST_OPTION	((char **)((unsigned long)-1))
+
 struct plugin_list *traceevent_load_plugins(struct pevent *pevent);
 void traceevent_unload_plugins(struct plugin_list *plugin_list,
 			       struct pevent *pevent);
diff --git a/tools/lib/traceevent/event-plugin.c b/tools/lib/traceevent/event-plugin.c
index a244794..648ef84 100644
--- a/tools/lib/traceevent/event-plugin.c
+++ b/tools/lib/traceevent/event-plugin.c
@@ -57,7 +57,7 @@ struct plugin_list {
  * used by toggling the option.
  *
  * Returns NULL if there's no options registered. On error it returns
- * an (char **)-1 (must check for that)
+ * INVALID_PLUGIN_LIST_OPTION
  *
  * Must be freed with traceevent_plugin_free_options_list().
  */
@@ -72,6 +72,7 @@ char **traceevent_plugin_list_options(void)
 	for (reg = registered_options; reg; reg = reg->next) {
 		for (op = reg->options; op->name; op++) {
 			char *alias = op->plugin_alias ? op->plugin_alias : op->file;
+			char **temp = list;
 
 			name = malloc(strlen(op->name) + strlen(alias) + 2);
 			if (!name)
@@ -80,6 +81,7 @@ char **traceevent_plugin_list_options(void)
 			sprintf(name, "%s:%s", alias, op->name);
 			list = realloc(list, count + 2);
 			if (!list) {
+				list = temp;
 				free(name);
 				goto err;
 			}
@@ -87,16 +89,14 @@ char **traceevent_plugin_list_options(void)
 			list[count] = NULL;
 		}
 	}
-	if (!count)
-		return NULL;
 	return list;
 
  err:
-	while (--count > 0)
+	while (--count >= 0)
 		free(list[count]);
 	free(list);
 
-	return (char **)((unsigned long)-1);
+	return INVALID_PLUGIN_LIST_OPTION;
 }
 
 void traceevent_plugin_free_options_list(char **list)

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 1/4] tools lib traceevent: Add flag to not load event plugins
  2014-06-03  3:20 [PATCH v2 0/4] tools lib traceevent: bitmask handling and plugin updates Steven Rostedt
@ 2014-06-03  3:20 ` Steven Rostedt
  2014-06-12 12:00   ` [tip:perf/core] " tip-bot for Steven Rostedt (Red Hat)
  2014-06-03  3:20 ` [PATCH v2 2/4] tools lib traceevent: Add options to plugins Steven Rostedt
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2014-06-03  3:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jiri Olsa, Namhyung Kim, Ingo Molnar, Andrew Morton

[-- Attachment #1: 0001-tools-lib-traceevent-Add-flag-to-not-load-event-plug.patch --]
[-- Type: text/plain, Size: 1657 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Add a flag to pevent that will let the callers be able to set it and
keep the system, and perhaps even normal plugins from being loaded.

This is useful when plugins might hide certain information and seeing
the raw events shows what may be going on.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 tools/lib/traceevent/event-parse.h  | 2 ++
 tools/lib/traceevent/event-plugin.c | 7 ++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/lib/traceevent/event-parse.h b/tools/lib/traceevent/event-parse.h
index feab94281634..a68ec3d8289f 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -354,6 +354,8 @@ enum pevent_func_arg_type {
 
 enum pevent_flag {
 	PEVENT_NSEC_OUTPUT		= 1,	/* output in NSECS */
+	PEVENT_DISABLE_SYS_PLUGINS	= 1 << 1,
+	PEVENT_DISABLE_PLUGINS		= 1 << 2,
 };
 
 #define PEVENT_ERRORS 							      \
diff --git a/tools/lib/traceevent/event-plugin.c b/tools/lib/traceevent/event-plugin.c
index 0c8bf6780e4d..317466bd1a37 100644
--- a/tools/lib/traceevent/event-plugin.c
+++ b/tools/lib/traceevent/event-plugin.c
@@ -148,12 +148,17 @@ load_plugins(struct pevent *pevent, const char *suffix,
 	char *path;
 	char *envdir;
 
+	if (pevent->flags & PEVENT_DISABLE_PLUGINS)
+		return;
+
 	/*
 	 * If a system plugin directory was defined,
 	 * check that first.
 	 */
 #ifdef PLUGIN_DIR
-	load_plugins_dir(pevent, suffix, PLUGIN_DIR, load_plugin, data);
+	if (!(pevent->flags & PEVENT_DISABLE_SYS_PLUGINS))
+		load_plugins_dir(pevent, suffix, PLUGIN_DIR,
+				 load_plugin, data);
 #endif
 
 	/*
-- 
2.0.0.rc2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 2/4] tools lib traceevent: Add options to plugins
  2014-06-03  3:20 [PATCH v2 0/4] tools lib traceevent: bitmask handling and plugin updates Steven Rostedt
  2014-06-03  3:20 ` [PATCH v2 1/4] tools lib traceevent: Add flag to not load event plugins Steven Rostedt
@ 2014-06-03  3:20 ` Steven Rostedt
  2014-06-03  6:51   ` Namhyung Kim
  2014-06-03  3:20 ` [PATCH v2 3/4] tools lib traceevent: Add options to function plugin Steven Rostedt
  2014-06-03  3:20 ` [PATCH v2 4/4] tools lib traceevent: Added support for __get_bitmask() macro Steven Rostedt
  3 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2014-06-03  3:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jiri Olsa, Namhyung Kim, Ingo Molnar, Andrew Morton

[-- Attachment #1: 0002-tools-lib-traceevent-Add-options-to-plugins.patch --]
[-- Type: text/plain, Size: 8227 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

The traceevent plugins allows developers to have their events print out
information that is more advanced than what can be achieved by the
trace event format files.

As these plugins are used on the userspace side of the tracing tools, it
is only logical that the tools should be able to produce different types
of output for the events. The types of events still need to be defined by
the plugins thus we need a way to pass information from the tool to the
plugin to specify what type of information to be shown.

Not only does the information need to be passed by the tool to plugin, but
the plugin also requires a way to notify the tool of what options it can
provide.

This builds the plugin option infrastructure that is taken from trace-cmd
that is used to allow plugins to produce different output based on the
options specified by the tool.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 tools/lib/traceevent/event-parse.h  |  16 ++-
 tools/lib/traceevent/event-plugin.c | 197 ++++++++++++++++++++++++++++++++++++
 2 files changed, 210 insertions(+), 3 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.h b/tools/lib/traceevent/event-parse.h
index a68ec3d8289f..56e0e6c12411 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -107,8 +107,8 @@ typedef int (*pevent_event_handler_func)(struct trace_seq *s,
 typedef int (*pevent_plugin_load_func)(struct pevent *pevent);
 typedef int (*pevent_plugin_unload_func)(struct pevent *pevent);
 
-struct plugin_option {
-	struct plugin_option		*next;
+struct pevent_plugin_option {
+	struct pevent_plugin_option	*next;
 	void				*handle;
 	char				*file;
 	char				*name;
@@ -135,7 +135,7 @@ struct plugin_option {
  * PEVENT_PLUGIN_OPTIONS:  (optional)
  *   Plugin options that can be set before loading
  *
- *   struct plugin_option PEVENT_PLUGIN_OPTIONS[] = {
+ *   struct pevent_plugin_option PEVENT_PLUGIN_OPTIONS[] = {
  *	{
  *		.name = "option-name",
  *		.plugin_alias = "overide-file-name", (optional)
@@ -412,9 +412,19 @@ enum pevent_errno {
 
 struct plugin_list;
 
+#define INVALID_PLUGIN_LIST_OPTION	((char **)((unsigned long)-1))
+
 struct plugin_list *traceevent_load_plugins(struct pevent *pevent);
 void traceevent_unload_plugins(struct plugin_list *plugin_list,
 			       struct pevent *pevent);
+char **traceevent_plugin_list_options(void);
+void traceevent_plugin_free_options_list(char **list);
+int traceevent_plugin_add_options(const char *name,
+				  struct pevent_plugin_option *options);
+void traceevent_plugin_remove_options(struct pevent_plugin_option *options);
+void traceevent_print_plugins(struct trace_seq *s,
+			      const char *prefix, const char *suffix,
+			      const struct plugin_list *list);
 
 struct cmdline;
 struct cmdline_list;
diff --git a/tools/lib/traceevent/event-plugin.c b/tools/lib/traceevent/event-plugin.c
index 317466bd1a37..648ef84dc37e 100644
--- a/tools/lib/traceevent/event-plugin.c
+++ b/tools/lib/traceevent/event-plugin.c
@@ -18,6 +18,7 @@
  * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  */
 
+#include <stdio.h>
 #include <string.h>
 #include <dlfcn.h>
 #include <stdlib.h>
@@ -30,12 +31,208 @@
 
 #define LOCAL_PLUGIN_DIR ".traceevent/plugins"
 
+static struct registered_plugin_options {
+	struct registered_plugin_options	*next;
+	struct pevent_plugin_option		*options;
+} *registered_options;
+
+static struct trace_plugin_options {
+	struct trace_plugin_options	*next;
+	char				*plugin;
+	char				*option;
+	char				*value;
+} *trace_plugin_options;
+
 struct plugin_list {
 	struct plugin_list	*next;
 	char			*name;
 	void			*handle;
 };
 
+/**
+ * traceevent_plugin_list_options - get list of plugin options
+ *
+ * Returns an array of char strings that list the currently registered
+ * plugin options in the format of <plugin>:<option>. This list can be
+ * used by toggling the option.
+ *
+ * Returns NULL if there's no options registered. On error it returns
+ * INVALID_PLUGIN_LIST_OPTION
+ *
+ * Must be freed with traceevent_plugin_free_options_list().
+ */
+char **traceevent_plugin_list_options(void)
+{
+	struct registered_plugin_options *reg;
+	struct pevent_plugin_option *op;
+	char **list = NULL;
+	char *name;
+	int count = 0;
+
+	for (reg = registered_options; reg; reg = reg->next) {
+		for (op = reg->options; op->name; op++) {
+			char *alias = op->plugin_alias ? op->plugin_alias : op->file;
+			char **temp = list;
+
+			name = malloc(strlen(op->name) + strlen(alias) + 2);
+			if (!name)
+				goto err;
+
+			sprintf(name, "%s:%s", alias, op->name);
+			list = realloc(list, count + 2);
+			if (!list) {
+				list = temp;
+				free(name);
+				goto err;
+			}
+			list[count++] = name;
+			list[count] = NULL;
+		}
+	}
+	return list;
+
+ err:
+	while (--count >= 0)
+		free(list[count]);
+	free(list);
+
+	return INVALID_PLUGIN_LIST_OPTION;
+}
+
+void traceevent_plugin_free_options_list(char **list)
+{
+	int i;
+
+	if (!list)
+		return;
+
+	if (list == (char **)((unsigned long)-1))
+		return;
+
+	for (i = 0; list[i]; i++)
+		free(list[i]);
+
+	free(list);
+}
+
+static int
+update_option(const char *file, struct pevent_plugin_option *option)
+{
+	struct trace_plugin_options *op;
+	char *plugin;
+
+	if (option->plugin_alias) {
+		plugin = strdup(option->plugin_alias);
+		if (!plugin)
+			return -1;
+	} else {
+		char *p;
+		plugin = strdup(file);
+		if (!plugin)
+			return -1;
+		p = strstr(plugin, ".");
+		if (p)
+			*p = '\0';
+	}
+
+	/* first look for named options */
+	for (op = trace_plugin_options; op; op = op->next) {
+		if (!op->plugin)
+			continue;
+		if (strcmp(op->plugin, plugin) != 0)
+			continue;
+		if (strcmp(op->option, option->name) != 0)
+			continue;
+
+		option->value = op->value;
+		option->set ^= 1;
+		goto out;
+	}
+
+	/* first look for unnamed options */
+	for (op = trace_plugin_options; op; op = op->next) {
+		if (op->plugin)
+			continue;
+		if (strcmp(op->option, option->name) != 0)
+			continue;
+
+		option->value = op->value;
+		option->set ^= 1;
+		break;
+	}
+
+ out:
+	free(plugin);
+	return 0;
+}
+
+/**
+ * traceevent_plugin_add_options - Add a set of options by a plugin
+ * @name: The name of the plugin adding the options
+ * @options: The set of options being loaded
+ *
+ * Sets the options with the values that have been added by user.
+ */
+int traceevent_plugin_add_options(const char *name,
+				  struct pevent_plugin_option *options)
+{
+	struct registered_plugin_options *reg;
+
+	reg = malloc(sizeof(*reg));
+	if (!reg)
+		return -1;
+	reg->next = registered_options;
+	reg->options = options;
+	registered_options = reg;
+
+	while (options->name) {
+		update_option(name, options);
+		options++;
+	}
+	return 0;
+}
+
+/**
+ * traceevent_plugin_remove_options - remove plugin options that were registered
+ * @options: Options to removed that were registered with traceevent_plugin_add_options
+ */
+void traceevent_plugin_remove_options(struct pevent_plugin_option *options)
+{
+	struct registered_plugin_options **last;
+	struct registered_plugin_options *reg;
+
+	for (last = &registered_options; *last; last = &(*last)->next) {
+		if ((*last)->options == options) {
+			reg = *last;
+			*last = reg->next;
+			free(reg);
+			return;
+		}
+	}
+			
+}
+
+/**
+ * traceevent_print_plugins - print out the list of plugins loaded
+ * @s: the trace_seq descripter to write to
+ * @prefix: The prefix string to add before listing the option name
+ * @suffix: The suffix string ot append after the option name
+ * @list: The list of plugins (usually returned by traceevent_load_plugins()
+ *
+ * Writes to the trace_seq @s the list of plugins (files) that is
+ * returned by traceevent_load_plugins(). Use @prefix and @suffix for formating:
+ * @prefix = "  ", @suffix = "\n".
+ */
+void traceevent_print_plugins(struct trace_seq *s,
+			      const char *prefix, const char *suffix,
+			      const struct plugin_list *list)
+{
+	while (list) {
+		trace_seq_printf(s, "%s%s%s", prefix, list->name, suffix);
+		list = list->next;
+	}
+}
+
 static void
 load_plugin(struct pevent *pevent, const char *path,
 	    const char *file, void *data)
-- 
2.0.0.rc2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 3/4] tools lib traceevent: Add options to function plugin
  2014-06-03  3:20 [PATCH v2 0/4] tools lib traceevent: bitmask handling and plugin updates Steven Rostedt
  2014-06-03  3:20 ` [PATCH v2 1/4] tools lib traceevent: Add flag to not load event plugins Steven Rostedt
  2014-06-03  3:20 ` [PATCH v2 2/4] tools lib traceevent: Add options to plugins Steven Rostedt
@ 2014-06-03  3:20 ` Steven Rostedt
  2014-06-12 12:00   ` [tip:perf/core] " tip-bot for Steven Rostedt (Red Hat)
  2014-06-03  3:20 ` [PATCH v2 4/4] tools lib traceevent: Added support for __get_bitmask() macro Steven Rostedt
  3 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2014-06-03  3:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jiri Olsa, Namhyung Kim, Ingo Molnar, Andrew Morton

[-- Attachment #1: 0003-tools-lib-traceevent-Add-options-to-function-plugin.patch --]
[-- Type: text/plain, Size: 4194 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Add the options "parent" and "indent" to the function plugin.

When parent is set, the output looks like this:

function:             fsnotify_modify <-- vfs_write
function:             zone_statistics <-- get_page_from_freelist
function:                __inc_zone_state <-- zone_statistics
function:                inotify_inode_queue_event <-- fsnotify_modify
function:                fsnotify_parent <-- fsnotify_modify
function:                __inc_zone_state <-- zone_statistics
function:                   __fsnotify_parent <-- fsnotify_parent
function:                   inotify_dentry_parent_queue_event <-- fsnotify_parent
function:             add_to_page_cache_lru <-- do_read_cache_page

When it's not set, it looks like:

function:             fsnotify_modify
function:             zone_statistics
function:                __inc_zone_state
function:                inotify_inode_queue_event
function:                fsnotify_parent
function:                __inc_zone_state
function:                   __fsnotify_parent
function:                   inotify_dentry_parent_queue_event
function:             add_to_page_cache_lru

When the otpion "indent" is not set, it looks like this:

function:             fsnotify_modify <-- vfs_write
function:             zone_statistics <-- get_page_from_freelist
function:             __inc_zone_state <-- zone_statistics
function:             inotify_inode_queue_event <-- fsnotify_modify
function:             fsnotify_parent <-- fsnotify_modify
function:             __inc_zone_state <-- zone_statistics
function:             __fsnotify_parent <-- fsnotify_parent
function:             inotify_dentry_parent_queue_event <-- fsnotify_parent
function:             add_to_page_cache_lru <-- do_read_cache_page

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 tools/lib/traceevent/plugin_function.c | 43 +++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/tools/lib/traceevent/plugin_function.c b/tools/lib/traceevent/plugin_function.c
index 80ba4ff1fe84..a00ec190821a 100644
--- a/tools/lib/traceevent/plugin_function.c
+++ b/tools/lib/traceevent/plugin_function.c
@@ -33,6 +33,29 @@ static int cpus = -1;
 
 #define STK_BLK 10
 
+struct pevent_plugin_option plugin_options[] =
+{
+	{
+		.name = "parent",
+		.plugin_alias = "ftrace",
+		.description =
+		"Print parent of functions for function events",
+	},
+	{
+		.name = "indent",
+		.plugin_alias = "ftrace",
+		.description =
+		"Try to show function call indents, based on parents",
+		.set = 1,
+	},
+	{
+		.name = NULL,
+	}
+};
+
+static struct pevent_plugin_option *ftrace_parent = &plugin_options[0];
+static struct pevent_plugin_option *ftrace_indent = &plugin_options[1];
+
 static void add_child(struct func_stack *stack, const char *child, int pos)
 {
 	int i;
@@ -119,7 +142,8 @@ static int function_handler(struct trace_seq *s, struct pevent_record *record,
 
 	parent = pevent_find_function(pevent, pfunction);
 
-	index = add_and_get_index(parent, func, record->cpu);
+	if (parent && ftrace_indent->set)
+		index = add_and_get_index(parent, func, record->cpu);
 
 	trace_seq_printf(s, "%*s", index*3, "");
 
@@ -128,11 +152,13 @@ static int function_handler(struct trace_seq *s, struct pevent_record *record,
 	else
 		trace_seq_printf(s, "0x%llx", function);
 
-	trace_seq_printf(s, " <-- ");
-	if (parent)
-		trace_seq_printf(s, "%s", parent);
-	else
-		trace_seq_printf(s, "0x%llx", pfunction);
+	if (ftrace_parent->set) {
+		trace_seq_printf(s, " <-- ");
+		if (parent)
+			trace_seq_printf(s, "%s", parent);
+		else
+			trace_seq_printf(s, "0x%llx", pfunction);
+	}
 
 	return 0;
 }
@@ -141,6 +167,9 @@ int PEVENT_PLUGIN_LOADER(struct pevent *pevent)
 {
 	pevent_register_event_handler(pevent, -1, "ftrace", "function",
 				      function_handler, NULL);
+
+	traceevent_plugin_add_options("ftrace", plugin_options);
+
 	return 0;
 }
 
@@ -157,6 +186,8 @@ void PEVENT_PLUGIN_UNLOADER(struct pevent *pevent)
 		free(fstack[i].stack);
 	}
 
+	traceevent_plugin_remove_options(plugin_options);
+
 	free(fstack);
 	fstack = NULL;
 	cpus = -1;
-- 
2.0.0.rc2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 4/4] tools lib traceevent: Added support for __get_bitmask() macro
  2014-06-03  3:20 [PATCH v2 0/4] tools lib traceevent: bitmask handling and plugin updates Steven Rostedt
                   ` (2 preceding siblings ...)
  2014-06-03  3:20 ` [PATCH v2 3/4] tools lib traceevent: Add options to function plugin Steven Rostedt
@ 2014-06-03  3:20 ` Steven Rostedt
  2014-06-12 12:01   ` [tip:perf/core] " tip-bot for Steven Rostedt (Red Hat)
  3 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2014-06-03  3:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jiri Olsa, Namhyung Kim, Ingo Molnar, Andrew Morton, Javi Merino

[-- Attachment #1: 0004-tools-lib-traceevent-Added-support-for-__get_bitmask.patch --]
[-- Type: text/plain, Size: 8035 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Coming in v3.16, trace events will be able to save bitmasks in raw
format in the ring buffer and output it with the __get_bitmask() macro.

In order for userspace tools to parse this, it must be able to handle
the __get_bitmask() call and be able to convert the data that's in
the ring buffer into a nice bitmask format. The output is similar to
what the kernel uses to print bitmasks, with a comma separator every
4 bytes (8 characters).

This allows for cpumasks to also be saved efficiently.

The first user is the thermal:thermal_power_limit event which has the
following output:

 thermal_power_limit:  cpus=0000000f freq=1900000 cdev_state=0 power=5252

Link: http://lkml.kernel.org/r/20140506132238.22e136d1@gandalf.local.home

Suggested-by: Javi Merino <javi.merino@arm.com>
Tested-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 tools/lib/traceevent/event-parse.c                 | 113 +++++++++++++++++++++
 tools/lib/traceevent/event-parse.h                 |   7 ++
 .../perf/util/scripting-engines/trace-event-perl.c |   1 +
 .../util/scripting-engines/trace-event-python.c    |   1 +
 4 files changed, 122 insertions(+)

diff --git a/tools/lib/traceevent/event-parse.c b/tools/lib/traceevent/event-parse.c
index b83184f2d484..93825a17dcce 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -765,6 +765,9 @@ static void free_arg(struct print_arg *arg)
 	case PRINT_BSTRING:
 		free(arg->string.string);
 		break;
+	case PRINT_BITMASK:
+		free(arg->bitmask.bitmask);
+		break;
 	case PRINT_DYNAMIC_ARRAY:
 		free(arg->dynarray.index);
 		break;
@@ -2268,6 +2271,7 @@ static int arg_num_eval(struct print_arg *arg, long long *val)
 	case PRINT_FIELD ... PRINT_SYMBOL:
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 	default:
 		do_warning("invalid eval type %d", arg->type);
 		ret = 0;
@@ -2296,6 +2300,7 @@ static char *arg_eval (struct print_arg *arg)
 	case PRINT_FIELD ... PRINT_SYMBOL:
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 	default:
 		do_warning("invalid eval type %d", arg->type);
 		break;
@@ -2683,6 +2688,35 @@ process_str(struct event_format *event __maybe_unused, struct print_arg *arg,
 	return EVENT_ERROR;
 }
 
+static enum event_type
+process_bitmask(struct event_format *event __maybe_unused, struct print_arg *arg,
+	    char **tok)
+{
+	enum event_type type;
+	char *token;
+
+	if (read_expect_type(EVENT_ITEM, &token) < 0)
+		goto out_free;
+
+	arg->type = PRINT_BITMASK;
+	arg->bitmask.bitmask = token;
+	arg->bitmask.offset = -1;
+
+	if (read_expected(EVENT_DELIM, ")") < 0)
+		goto out_err;
+
+	type = read_token(&token);
+	*tok = token;
+
+	return type;
+
+ out_free:
+	free_token(token);
+ out_err:
+	*tok = NULL;
+	return EVENT_ERROR;
+}
+
 static struct pevent_function_handler *
 find_func_handler(struct pevent *pevent, char *func_name)
 {
@@ -2797,6 +2831,10 @@ process_function(struct event_format *event, struct print_arg *arg,
 		free_token(token);
 		return process_str(event, arg, tok);
 	}
+	if (strcmp(token, "__get_bitmask") == 0) {
+		free_token(token);
+		return process_bitmask(event, arg, tok);
+	}
 	if (strcmp(token, "__get_dynamic_array") == 0) {
 		free_token(token);
 		return process_dynamic_array(event, arg, tok);
@@ -3324,6 +3362,7 @@ eval_num_arg(void *data, int size, struct event_format *event, struct print_arg
 		return eval_type(val, arg, 0);
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 		return 0;
 	case PRINT_FUNC: {
 		struct trace_seq s;
@@ -3556,6 +3595,60 @@ static void print_str_to_seq(struct trace_seq *s, const char *format,
 		trace_seq_printf(s, format, str);
 }
 
+static void print_bitmask_to_seq(struct pevent *pevent,
+				 struct trace_seq *s, const char *format,
+				 int len_arg, const void *data, int size)
+{
+	int nr_bits = size * 8;
+	int str_size = (nr_bits + 3) / 4;
+	int len = 0;
+	char buf[3];
+	char *str;
+	int index;
+	int i;
+
+	/*
+	 * The kernel likes to put in commas every 32 bits, we
+	 * can do the same.
+	 */
+	str_size += (nr_bits - 1) / 32;
+
+	str = malloc(str_size + 1);
+	if (!str) {
+		do_warning("%s: not enough memory!", __func__);
+		return;
+	}
+	str[str_size] = 0;
+
+	/* Start out with -2 for the two chars per byte */
+	for (i = str_size - 2; i >= 0; i -= 2) {
+		/*
+		 * data points to a bit mask of size bytes.
+		 * In the kernel, this is an array of long words, thus
+		 * endianess is very important.
+		 */
+		if (pevent->file_bigendian)
+			index = size - (len + 1);
+		else
+			index = len;
+
+		snprintf(buf, 3, "%02x", *((unsigned char *)data + index));
+		memcpy(str + i, buf, 2);
+		len++;
+		if (!(len & 3) && i > 0) {
+			i--;
+			str[i] = ',';
+		}
+	}
+
+	if (len_arg >= 0)
+		trace_seq_printf(s, format, len_arg, str);
+	else
+		trace_seq_printf(s, format, str);
+
+	free(str);
+}
+
 static void print_str_arg(struct trace_seq *s, void *data, int size,
 			  struct event_format *event, const char *format,
 			  int len_arg, struct print_arg *arg)
@@ -3691,6 +3784,23 @@ static void print_str_arg(struct trace_seq *s, void *data, int size,
 	case PRINT_BSTRING:
 		print_str_to_seq(s, format, len_arg, arg->string.string);
 		break;
+	case PRINT_BITMASK: {
+		int bitmask_offset;
+		int bitmask_size;
+
+		if (arg->bitmask.offset == -1) {
+			struct format_field *f;
+
+			f = pevent_find_any_field(event, arg->bitmask.bitmask);
+			arg->bitmask.offset = f->offset;
+		}
+		bitmask_offset = data2host4(pevent, data + arg->bitmask.offset);
+		bitmask_size = bitmask_offset >> 16;
+		bitmask_offset &= 0xffff;
+		print_bitmask_to_seq(pevent, s, format, len_arg,
+				     data + bitmask_offset, bitmask_size);
+		break;
+	}
 	case PRINT_OP:
 		/*
 		 * The only op for string should be ? :
@@ -4822,6 +4932,9 @@ static void print_args(struct print_arg *args)
 	case PRINT_BSTRING:
 		printf("__get_str(%s)", args->string.string);
 		break;
+	case PRINT_BITMASK:
+		printf("__get_bitmask(%s)", args->bitmask.bitmask);
+		break;
 	case PRINT_TYPE:
 		printf("(%s)", args->typecast.type);
 		print_args(args->typecast.item);
diff --git a/tools/lib/traceevent/event-parse.h b/tools/lib/traceevent/event-parse.h
index 56e0e6c12411..7a3873ff9a4f 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -208,6 +208,11 @@ struct print_arg_string {
 	int			offset;
 };
 
+struct print_arg_bitmask {
+	char			*bitmask;
+	int			offset;
+};
+
 struct print_arg_field {
 	char			*name;
 	struct format_field	*field;
@@ -274,6 +279,7 @@ enum print_arg_type {
 	PRINT_DYNAMIC_ARRAY,
 	PRINT_OP,
 	PRINT_FUNC,
+	PRINT_BITMASK,
 };
 
 struct print_arg {
@@ -288,6 +294,7 @@ struct print_arg {
 		struct print_arg_hex		hex;
 		struct print_arg_func		func;
 		struct print_arg_string		string;
+		struct print_arg_bitmask	bitmask;
 		struct print_arg_op		op;
 		struct print_arg_dynarray	dynarray;
 	};
diff --git a/tools/perf/util/scripting-engines/trace-event-perl.c b/tools/perf/util/scripting-engines/trace-event-perl.c
index e108207c5de0..af7da565a750 100644
--- a/tools/perf/util/scripting-engines/trace-event-perl.c
+++ b/tools/perf/util/scripting-engines/trace-event-perl.c
@@ -215,6 +215,7 @@ static void define_event_symbols(struct event_format *event,
 	case PRINT_BSTRING:
 	case PRINT_DYNAMIC_ARRAY:
 	case PRINT_STRING:
+	case PRINT_BITMASK:
 		break;
 	case PRINT_TYPE:
 		define_event_symbols(event, ev_name, args->typecast.item);
diff --git a/tools/perf/util/scripting-engines/trace-event-python.c b/tools/perf/util/scripting-engines/trace-event-python.c
index cd9774df3750..c3de0971bfdd 100644
--- a/tools/perf/util/scripting-engines/trace-event-python.c
+++ b/tools/perf/util/scripting-engines/trace-event-python.c
@@ -197,6 +197,7 @@ static void define_event_symbols(struct event_format *event,
 	case PRINT_BSTRING:
 	case PRINT_DYNAMIC_ARRAY:
 	case PRINT_FUNC:
+	case PRINT_BITMASK:
 		/* we should warn... */
 		return;
 	}
-- 
2.0.0.rc2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] tools lib traceevent: Add options to plugins
  2014-06-03  3:20 ` [PATCH v2 2/4] tools lib traceevent: Add options to plugins Steven Rostedt
@ 2014-06-03  6:51   ` Namhyung Kim
  2014-06-03 22:41     ` [PATCH v3 " Steven Rostedt
  2014-06-04 11:42     ` [PATCH v2 2/4] " Jiri Olsa
  0 siblings, 2 replies; 48+ messages in thread
From: Namhyung Kim @ 2014-06-03  6:51 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel, Jiri Olsa, Ingo Molnar, Andrew Morton

On Mon, 02 Jun 2014 23:20:14 -0400, Steven Rostedt wrote:
> +void traceevent_plugin_free_options_list(char **list)
> +{
> +	int i;
> +
> +	if (!list)
> +		return;
> +
> +	if (list == (char **)((unsigned long)-1))

It also should be:

	if (list == INVALID_PLUGIN_LIST_OPTION)


Other than that, the series looks good to me now!

Thanks,
Namhyung


> +		return;
> +
> +	for (i = 0; list[i]; i++)
> +		free(list[i]);
> +
> +	free(list);
> +}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 2/4] tools lib traceevent: Add options to plugins
  2014-06-03  6:51   ` Namhyung Kim
@ 2014-06-03 22:41     ` Steven Rostedt
  2014-06-03 22:43       ` Steven Rostedt
  2014-06-12 12:00       ` [tip:perf/core] " tip-bot for Steven Rostedt
  2014-06-04 11:42     ` [PATCH v2 2/4] " Jiri Olsa
  1 sibling, 2 replies; 48+ messages in thread
From: Steven Rostedt @ 2014-06-03 22:41 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Namhyung Kim, linux-kernel, Ingo Molnar, Andrew Morton


The traceevent plugins allows developers to have their events print out
information that is more advanced than what can be achieved by the
trace event format files.

As these plugins are used on the userspace side of the tracing tools, it
is only logical that the tools should be able to produce different types
of output for the events. The types of events still need to be defined by
the plugins thus we need a way to pass information from the tool to the
plugin to specify what type of information to be shown.

Not only does the information need to be passed by the tool to plugin, but
the plugin also requires a way to notify the tool of what options it can
provide.

This builds the plugin option infrastructure that is taken from trace-cmd
that is used to allow plugins to produce different output based on the
options specified by the tool.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 tools/lib/traceevent/event-parse.h  |  16 ++-
 tools/lib/traceevent/event-plugin.c | 197 ++++++++++++++++++++++++++++++++++++
 2 files changed, 210 insertions(+), 3 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.h b/tools/lib/traceevent/event-parse.h
index a68ec3d8289f..56e0e6c12411 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -107,8 +107,8 @@ typedef int (*pevent_event_handler_func)(struct trace_seq *s,
 typedef int (*pevent_plugin_load_func)(struct pevent *pevent);
 typedef int (*pevent_plugin_unload_func)(struct pevent *pevent);
 
-struct plugin_option {
-	struct plugin_option		*next;
+struct pevent_plugin_option {
+	struct pevent_plugin_option	*next;
 	void				*handle;
 	char				*file;
 	char				*name;
@@ -135,7 +135,7 @@ struct plugin_option {
  * PEVENT_PLUGIN_OPTIONS:  (optional)
  *   Plugin options that can be set before loading
  *
- *   struct plugin_option PEVENT_PLUGIN_OPTIONS[] = {
+ *   struct pevent_plugin_option PEVENT_PLUGIN_OPTIONS[] = {
  *	{
  *		.name = "option-name",
  *		.plugin_alias = "overide-file-name", (optional)
@@ -412,9 +412,19 @@ enum pevent_errno {
 
 struct plugin_list;
 
+#define INVALID_PLUGIN_LIST_OPTION	((char **)((unsigned long)-1))
+
 struct plugin_list *traceevent_load_plugins(struct pevent *pevent);
 void traceevent_unload_plugins(struct plugin_list *plugin_list,
 			       struct pevent *pevent);
+char **traceevent_plugin_list_options(void);
+void traceevent_plugin_free_options_list(char **list);
+int traceevent_plugin_add_options(const char *name,
+				  struct pevent_plugin_option *options);
+void traceevent_plugin_remove_options(struct pevent_plugin_option *options);
+void traceevent_print_plugins(struct trace_seq *s,
+			      const char *prefix, const char *suffix,
+			      const struct plugin_list *list);
 
 struct cmdline;
 struct cmdline_list;
diff --git a/tools/lib/traceevent/event-plugin.c b/tools/lib/traceevent/event-plugin.c
index 317466bd1a37..db7765be1bb6 100644
--- a/tools/lib/traceevent/event-plugin.c
+++ b/tools/lib/traceevent/event-plugin.c
@@ -18,6 +18,7 @@
  * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  */
 
+#include <stdio.h>
 #include <string.h>
 #include <dlfcn.h>
 #include <stdlib.h>
@@ -30,12 +31,208 @@
 
 #define LOCAL_PLUGIN_DIR ".traceevent/plugins"
 
+static struct registered_plugin_options {
+	struct registered_plugin_options	*next;
+	struct pevent_plugin_option		*options;
+} *registered_options;
+
+static struct trace_plugin_options {
+	struct trace_plugin_options	*next;
+	char				*plugin;
+	char				*option;
+	char				*value;
+} *trace_plugin_options;
+
 struct plugin_list {
 	struct plugin_list	*next;
 	char			*name;
 	void			*handle;
 };
 
+/**
+ * traceevent_plugin_list_options - get list of plugin options
+ *
+ * Returns an array of char strings that list the currently registered
+ * plugin options in the format of <plugin>:<option>. This list can be
+ * used by toggling the option.
+ *
+ * Returns NULL if there's no options registered. On error it returns
+ * INVALID_PLUGIN_LIST_OPTION
+ *
+ * Must be freed with traceevent_plugin_free_options_list().
+ */
+char **traceevent_plugin_list_options(void)
+{
+	struct registered_plugin_options *reg;
+	struct pevent_plugin_option *op;
+	char **list = NULL;
+	char *name;
+	int count = 0;
+
+	for (reg = registered_options; reg; reg = reg->next) {
+		for (op = reg->options; op->name; op++) {
+			char *alias = op->plugin_alias ? op->plugin_alias : op->file;
+			char **temp = list;
+
+			name = malloc(strlen(op->name) + strlen(alias) + 2);
+			if (!name)
+				goto err;
+
+			sprintf(name, "%s:%s", alias, op->name);
+			list = realloc(list, count + 2);
+			if (!list) {
+				list = temp;
+				free(name);
+				goto err;
+			}
+			list[count++] = name;
+			list[count] = NULL;
+		}
+	}
+	return list;
+
+ err:
+	while (--count >= 0)
+		free(list[count]);
+	free(list);
+
+	return INVALID_PLUGIN_LIST_OPTION;
+}
+
+void traceevent_plugin_free_options_list(char **list)
+{
+	int i;
+
+	if (!list)
+		return;
+
+	if (list == INVALID_PLUGIN_LIST_OPTION)
+		return;
+
+	for (i = 0; list[i]; i++)
+		free(list[i]);
+
+	free(list);
+}
+
+static int
+update_option(const char *file, struct pevent_plugin_option *option)
+{
+	struct trace_plugin_options *op;
+	char *plugin;
+
+	if (option->plugin_alias) {
+		plugin = strdup(option->plugin_alias);
+		if (!plugin)
+			return -1;
+	} else {
+		char *p;
+		plugin = strdup(file);
+		if (!plugin)
+			return -1;
+		p = strstr(plugin, ".");
+		if (p)
+			*p = '\0';
+	}
+
+	/* first look for named options */
+	for (op = trace_plugin_options; op; op = op->next) {
+		if (!op->plugin)
+			continue;
+		if (strcmp(op->plugin, plugin) != 0)
+			continue;
+		if (strcmp(op->option, option->name) != 0)
+			continue;
+
+		option->value = op->value;
+		option->set ^= 1;
+		goto out;
+	}
+
+	/* first look for unnamed options */
+	for (op = trace_plugin_options; op; op = op->next) {
+		if (op->plugin)
+			continue;
+		if (strcmp(op->option, option->name) != 0)
+			continue;
+
+		option->value = op->value;
+		option->set ^= 1;
+		break;
+	}
+
+ out:
+	free(plugin);
+	return 0;
+}
+
+/**
+ * traceevent_plugin_add_options - Add a set of options by a plugin
+ * @name: The name of the plugin adding the options
+ * @options: The set of options being loaded
+ *
+ * Sets the options with the values that have been added by user.
+ */
+int traceevent_plugin_add_options(const char *name,
+				  struct pevent_plugin_option *options)
+{
+	struct registered_plugin_options *reg;
+
+	reg = malloc(sizeof(*reg));
+	if (!reg)
+		return -1;
+	reg->next = registered_options;
+	reg->options = options;
+	registered_options = reg;
+
+	while (options->name) {
+		update_option(name, options);
+		options++;
+	}
+	return 0;
+}
+
+/**
+ * traceevent_plugin_remove_options - remove plugin options that were registered
+ * @options: Options to removed that were registered with traceevent_plugin_add_options
+ */
+void traceevent_plugin_remove_options(struct pevent_plugin_option *options)
+{
+	struct registered_plugin_options **last;
+	struct registered_plugin_options *reg;
+
+	for (last = &registered_options; *last; last = &(*last)->next) {
+		if ((*last)->options == options) {
+			reg = *last;
+			*last = reg->next;
+			free(reg);
+			return;
+		}
+	}
+			
+}
+
+/**
+ * traceevent_print_plugins - print out the list of plugins loaded
+ * @s: the trace_seq descripter to write to
+ * @prefix: The prefix string to add before listing the option name
+ * @suffix: The suffix string ot append after the option name
+ * @list: The list of plugins (usually returned by traceevent_load_plugins()
+ *
+ * Writes to the trace_seq @s the list of plugins (files) that is
+ * returned by traceevent_load_plugins(). Use @prefix and @suffix for formating:
+ * @prefix = "  ", @suffix = "\n".
+ */
+void traceevent_print_plugins(struct trace_seq *s,
+			      const char *prefix, const char *suffix,
+			      const struct plugin_list *list)
+{
+	while (list) {
+		trace_seq_printf(s, "%s%s%s", prefix, list->name, suffix);
+		list = list->next;
+	}
+}
+
 static void
 load_plugin(struct pevent *pevent, const char *path,
 	    const char *file, void *data)
-- 
2.0.0.rc2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/4] tools lib traceevent: Add options to plugins
  2014-06-03 22:41     ` [PATCH v3 " Steven Rostedt
@ 2014-06-03 22:43       ` Steven Rostedt
  2014-06-04  9:33         ` Jiri Olsa
  2014-06-12 12:00       ` [tip:perf/core] " tip-bot for Steven Rostedt
  1 sibling, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2014-06-03 22:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiri Olsa, Namhyung Kim, linux-kernel, Ingo Molnar, Andrew Morton

On Tue, 3 Jun 2014 18:41:54 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> The traceevent plugins allows developers to have their events print out
> information that is more advanced than what can be achieved by the
> trace event format files.
> 
> As these plugins are used on the userspace side of the tracing tools, it
> is only logical that the tools should be able to produce different types
> of output for the events. The types of events still need to be defined by
> the plugins thus we need a way to pass information from the tool to the
> plugin to specify what type of information to be shown.
> 
> Not only does the information need to be passed by the tool to plugin, but
> the plugin also requires a way to notify the tool of what options it can
> provide.
> 
> This builds the plugin option infrastructure that is taken from trace-cmd
> that is used to allow plugins to produce different output based on the
> options specified by the tool.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Jiri,

I only sent this patch for v3, as the 1,3,4 are fine. Do you want me to
send a full series? Or can you just pull in v2 of patches 1,3 and 4 and
then v3 of this one?

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/4] tools lib traceevent: Add options to plugins
  2014-06-03 22:43       ` Steven Rostedt
@ 2014-06-04  9:33         ` Jiri Olsa
  0 siblings, 0 replies; 48+ messages in thread
From: Jiri Olsa @ 2014-06-04  9:33 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Namhyung Kim, linux-kernel, Ingo Molnar, Andrew Morton

On Tue, Jun 03, 2014 at 06:43:10PM -0400, Steven Rostedt wrote:
> On Tue, 3 Jun 2014 18:41:54 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > 
> > The traceevent plugins allows developers to have their events print out
> > information that is more advanced than what can be achieved by the
> > trace event format files.
> > 
> > As these plugins are used on the userspace side of the tracing tools, it
> > is only logical that the tools should be able to produce different types
> > of output for the events. The types of events still need to be defined by
> > the plugins thus we need a way to pass information from the tool to the
> > plugin to specify what type of information to be shown.
> > 
> > Not only does the information need to be passed by the tool to plugin, but
> > the plugin also requires a way to notify the tool of what options it can
> > provide.
> > 
> > This builds the plugin option infrastructure that is taken from trace-cmd
> > that is used to allow plugins to produce different output based on the
> > options specified by the tool.
> > 
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 
> Jiri,
> 
> I only sent this patch for v3, as the 1,3,4 are fine. Do you want me to
> send a full series? Or can you just pull in v2 of patches 1,3 and 4 and
> then v3 of this one?

no need to resend, I'll take it from here

jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] tools lib traceevent: Add options to plugins
  2014-06-03  6:51   ` Namhyung Kim
  2014-06-03 22:41     ` [PATCH v3 " Steven Rostedt
@ 2014-06-04 11:42     ` Jiri Olsa
  2014-06-04 14:00       ` Namhyung Kim
  1 sibling, 1 reply; 48+ messages in thread
From: Jiri Olsa @ 2014-06-04 11:42 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: Steven Rostedt, linux-kernel, Ingo Molnar, Andrew Morton

On Tue, Jun 03, 2014 at 03:51:22PM +0900, Namhyung Kim wrote:
> On Mon, 02 Jun 2014 23:20:14 -0400, Steven Rostedt wrote:
> > +void traceevent_plugin_free_options_list(char **list)
> > +{
> > +	int i;
> > +
> > +	if (!list)
> > +		return;
> > +
> > +	if (list == (char **)((unsigned long)-1))
> 
> It also should be:
> 
> 	if (list == INVALID_PLUGIN_LIST_OPTION)
> 
> 
> Other than that, the series looks good to me now!

can I use your Acked-by?

thanks,
jirka

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/4] tools lib traceevent: Add options to plugins
  2014-06-04 11:42     ` [PATCH v2 2/4] " Jiri Olsa
@ 2014-06-04 14:00       ` Namhyung Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Namhyung Kim @ 2014-06-04 14:00 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Steven Rostedt, linux-kernel, Ingo Molnar, Andrew Morton

Hi Jiri,

2014-06-04 (수), 13:42 +0200, Jiri Olsa:
> On Tue, Jun 03, 2014 at 03:51:22PM +0900, Namhyung Kim wrote:
> > On Mon, 02 Jun 2014 23:20:14 -0400, Steven Rostedt wrote:
> > > +void traceevent_plugin_free_options_list(char **list)
> > > +{
> > > +	int i;
> > > +
> > > +	if (!list)
> > > +		return;
> > > +
> > > +	if (list == (char **)((unsigned long)-1))
> > 
> > It also should be:
> > 
> > 	if (list == INVALID_PLUGIN_LIST_OPTION)
> > 
> > 
> > Other than that, the series looks good to me now!
> 
> can I use your Acked-by?

Sure thing!

Acked-by: Namhyung Kim <namhyung@kernel.org>

Thanks,
Namhyung



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [tip:perf/core] tools lib traceevent: Add flag to not load event plugins
  2014-06-03  3:20 ` [PATCH v2 1/4] tools lib traceevent: Add flag to not load event plugins Steven Rostedt
@ 2014-06-12 12:00   ` tip-bot for Steven Rostedt (Red Hat)
  0 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Steven Rostedt (Red Hat) @ 2014-06-12 12:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, jolsa, rostedt, tglx, namhyung

Commit-ID:  a7c3196c79051f9e1a498f5be8fe6870bde5e55d
Gitweb:     http://git.kernel.org/tip/a7c3196c79051f9e1a498f5be8fe6870bde5e55d
Author:     Steven Rostedt (Red Hat) <rostedt@goodmis.org>
AuthorDate: Mon, 2 Jun 2014 23:20:13 -0400
Committer:  Jiri Olsa <jolsa@kernel.org>
CommitDate: Sat, 7 Jun 2014 23:33:36 +0200

tools lib traceevent: Add flag to not load event plugins

Add a flag to pevent that will let the callers be able to set it and
keep the system, and perhaps even normal plugins from being loaded.

This is useful when plugins might hide certain information and seeing
the raw events shows what may be going on.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20140603032223.678098063@goodmis.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/lib/traceevent/event-parse.h  | 2 ++
 tools/lib/traceevent/event-plugin.c | 7 ++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/lib/traceevent/event-parse.h b/tools/lib/traceevent/event-parse.h
index feab942..a68ec3d 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -354,6 +354,8 @@ enum pevent_func_arg_type {
 
 enum pevent_flag {
 	PEVENT_NSEC_OUTPUT		= 1,	/* output in NSECS */
+	PEVENT_DISABLE_SYS_PLUGINS	= 1 << 1,
+	PEVENT_DISABLE_PLUGINS		= 1 << 2,
 };
 
 #define PEVENT_ERRORS 							      \
diff --git a/tools/lib/traceevent/event-plugin.c b/tools/lib/traceevent/event-plugin.c
index 0c8bf67..317466b 100644
--- a/tools/lib/traceevent/event-plugin.c
+++ b/tools/lib/traceevent/event-plugin.c
@@ -148,12 +148,17 @@ load_plugins(struct pevent *pevent, const char *suffix,
 	char *path;
 	char *envdir;
 
+	if (pevent->flags & PEVENT_DISABLE_PLUGINS)
+		return;
+
 	/*
 	 * If a system plugin directory was defined,
 	 * check that first.
 	 */
 #ifdef PLUGIN_DIR
-	load_plugins_dir(pevent, suffix, PLUGIN_DIR, load_plugin, data);
+	if (!(pevent->flags & PEVENT_DISABLE_SYS_PLUGINS))
+		load_plugins_dir(pevent, suffix, PLUGIN_DIR,
+				 load_plugin, data);
 #endif
 
 	/*

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [tip:perf/core] tools lib traceevent: Add options to plugins
  2014-06-03 22:41     ` [PATCH v3 " Steven Rostedt
  2014-06-03 22:43       ` Steven Rostedt
@ 2014-06-12 12:00       ` tip-bot for Steven Rostedt
  1 sibling, 0 replies; 48+ messages in thread
From: tip-bot for Steven Rostedt @ 2014-06-12 12:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, jolsa, rostedt, tglx, namhyung

Commit-ID:  5827f2faabe40cc285cc67b697277547a19b6c9a
Gitweb:     http://git.kernel.org/tip/5827f2faabe40cc285cc67b697277547a19b6c9a
Author:     Steven Rostedt <rostedt@goodmis.org>
AuthorDate: Tue, 3 Jun 2014 18:41:54 -0400
Committer:  Jiri Olsa <jolsa@kernel.org>
CommitDate: Sat, 7 Jun 2014 23:33:36 +0200

tools lib traceevent: Add options to plugins

The traceevent plugins allows developers to have their events print out
information that is more advanced than what can be achieved by the
trace event format files.

As these plugins are used on the userspace side of the tracing tools, it
is only logical that the tools should be able to produce different types
of output for the events. The types of events still need to be defined by
the plugins thus we need a way to pass information from the tool to the
plugin to specify what type of information to be shown.

Not only does the information need to be passed by the tool to plugin, but
the plugin also requires a way to notify the tool of what options it can
provide.

This builds the plugin option infrastructure that is taken from trace-cmd
that is used to allow plugins to produce different output based on the
options specified by the tool.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20140603184154.0a4c031c@gandalf.local.home
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/lib/traceevent/event-parse.h  |  16 ++-
 tools/lib/traceevent/event-plugin.c | 196 ++++++++++++++++++++++++++++++++++++
 2 files changed, 209 insertions(+), 3 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.h b/tools/lib/traceevent/event-parse.h
index a68ec3d..56e0e6c 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -107,8 +107,8 @@ typedef int (*pevent_event_handler_func)(struct trace_seq *s,
 typedef int (*pevent_plugin_load_func)(struct pevent *pevent);
 typedef int (*pevent_plugin_unload_func)(struct pevent *pevent);
 
-struct plugin_option {
-	struct plugin_option		*next;
+struct pevent_plugin_option {
+	struct pevent_plugin_option	*next;
 	void				*handle;
 	char				*file;
 	char				*name;
@@ -135,7 +135,7 @@ struct plugin_option {
  * PEVENT_PLUGIN_OPTIONS:  (optional)
  *   Plugin options that can be set before loading
  *
- *   struct plugin_option PEVENT_PLUGIN_OPTIONS[] = {
+ *   struct pevent_plugin_option PEVENT_PLUGIN_OPTIONS[] = {
  *	{
  *		.name = "option-name",
  *		.plugin_alias = "overide-file-name", (optional)
@@ -412,9 +412,19 @@ enum pevent_errno {
 
 struct plugin_list;
 
+#define INVALID_PLUGIN_LIST_OPTION	((char **)((unsigned long)-1))
+
 struct plugin_list *traceevent_load_plugins(struct pevent *pevent);
 void traceevent_unload_plugins(struct plugin_list *plugin_list,
 			       struct pevent *pevent);
+char **traceevent_plugin_list_options(void);
+void traceevent_plugin_free_options_list(char **list);
+int traceevent_plugin_add_options(const char *name,
+				  struct pevent_plugin_option *options);
+void traceevent_plugin_remove_options(struct pevent_plugin_option *options);
+void traceevent_print_plugins(struct trace_seq *s,
+			      const char *prefix, const char *suffix,
+			      const struct plugin_list *list);
 
 struct cmdline;
 struct cmdline_list;
diff --git a/tools/lib/traceevent/event-plugin.c b/tools/lib/traceevent/event-plugin.c
index 317466b..136162c 100644
--- a/tools/lib/traceevent/event-plugin.c
+++ b/tools/lib/traceevent/event-plugin.c
@@ -18,6 +18,7 @@
  * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  */
 
+#include <stdio.h>
 #include <string.h>
 #include <dlfcn.h>
 #include <stdlib.h>
@@ -30,12 +31,207 @@
 
 #define LOCAL_PLUGIN_DIR ".traceevent/plugins"
 
+static struct registered_plugin_options {
+	struct registered_plugin_options	*next;
+	struct pevent_plugin_option		*options;
+} *registered_options;
+
+static struct trace_plugin_options {
+	struct trace_plugin_options	*next;
+	char				*plugin;
+	char				*option;
+	char				*value;
+} *trace_plugin_options;
+
 struct plugin_list {
 	struct plugin_list	*next;
 	char			*name;
 	void			*handle;
 };
 
+/**
+ * traceevent_plugin_list_options - get list of plugin options
+ *
+ * Returns an array of char strings that list the currently registered
+ * plugin options in the format of <plugin>:<option>. This list can be
+ * used by toggling the option.
+ *
+ * Returns NULL if there's no options registered. On error it returns
+ * INVALID_PLUGIN_LIST_OPTION
+ *
+ * Must be freed with traceevent_plugin_free_options_list().
+ */
+char **traceevent_plugin_list_options(void)
+{
+	struct registered_plugin_options *reg;
+	struct pevent_plugin_option *op;
+	char **list = NULL;
+	char *name;
+	int count = 0;
+
+	for (reg = registered_options; reg; reg = reg->next) {
+		for (op = reg->options; op->name; op++) {
+			char *alias = op->plugin_alias ? op->plugin_alias : op->file;
+			char **temp = list;
+
+			name = malloc(strlen(op->name) + strlen(alias) + 2);
+			if (!name)
+				goto err;
+
+			sprintf(name, "%s:%s", alias, op->name);
+			list = realloc(list, count + 2);
+			if (!list) {
+				list = temp;
+				free(name);
+				goto err;
+			}
+			list[count++] = name;
+			list[count] = NULL;
+		}
+	}
+	return list;
+
+ err:
+	while (--count >= 0)
+		free(list[count]);
+	free(list);
+
+	return INVALID_PLUGIN_LIST_OPTION;
+}
+
+void traceevent_plugin_free_options_list(char **list)
+{
+	int i;
+
+	if (!list)
+		return;
+
+	if (list == INVALID_PLUGIN_LIST_OPTION)
+		return;
+
+	for (i = 0; list[i]; i++)
+		free(list[i]);
+
+	free(list);
+}
+
+static int
+update_option(const char *file, struct pevent_plugin_option *option)
+{
+	struct trace_plugin_options *op;
+	char *plugin;
+
+	if (option->plugin_alias) {
+		plugin = strdup(option->plugin_alias);
+		if (!plugin)
+			return -1;
+	} else {
+		char *p;
+		plugin = strdup(file);
+		if (!plugin)
+			return -1;
+		p = strstr(plugin, ".");
+		if (p)
+			*p = '\0';
+	}
+
+	/* first look for named options */
+	for (op = trace_plugin_options; op; op = op->next) {
+		if (!op->plugin)
+			continue;
+		if (strcmp(op->plugin, plugin) != 0)
+			continue;
+		if (strcmp(op->option, option->name) != 0)
+			continue;
+
+		option->value = op->value;
+		option->set ^= 1;
+		goto out;
+	}
+
+	/* first look for unnamed options */
+	for (op = trace_plugin_options; op; op = op->next) {
+		if (op->plugin)
+			continue;
+		if (strcmp(op->option, option->name) != 0)
+			continue;
+
+		option->value = op->value;
+		option->set ^= 1;
+		break;
+	}
+
+ out:
+	free(plugin);
+	return 0;
+}
+
+/**
+ * traceevent_plugin_add_options - Add a set of options by a plugin
+ * @name: The name of the plugin adding the options
+ * @options: The set of options being loaded
+ *
+ * Sets the options with the values that have been added by user.
+ */
+int traceevent_plugin_add_options(const char *name,
+				  struct pevent_plugin_option *options)
+{
+	struct registered_plugin_options *reg;
+
+	reg = malloc(sizeof(*reg));
+	if (!reg)
+		return -1;
+	reg->next = registered_options;
+	reg->options = options;
+	registered_options = reg;
+
+	while (options->name) {
+		update_option(name, options);
+		options++;
+	}
+	return 0;
+}
+
+/**
+ * traceevent_plugin_remove_options - remove plugin options that were registered
+ * @options: Options to removed that were registered with traceevent_plugin_add_options
+ */
+void traceevent_plugin_remove_options(struct pevent_plugin_option *options)
+{
+	struct registered_plugin_options **last;
+	struct registered_plugin_options *reg;
+
+	for (last = &registered_options; *last; last = &(*last)->next) {
+		if ((*last)->options == options) {
+			reg = *last;
+			*last = reg->next;
+			free(reg);
+			return;
+		}
+	}
+}
+
+/**
+ * traceevent_print_plugins - print out the list of plugins loaded
+ * @s: the trace_seq descripter to write to
+ * @prefix: The prefix string to add before listing the option name
+ * @suffix: The suffix string ot append after the option name
+ * @list: The list of plugins (usually returned by traceevent_load_plugins()
+ *
+ * Writes to the trace_seq @s the list of plugins (files) that is
+ * returned by traceevent_load_plugins(). Use @prefix and @suffix for formating:
+ * @prefix = "  ", @suffix = "\n".
+ */
+void traceevent_print_plugins(struct trace_seq *s,
+			      const char *prefix, const char *suffix,
+			      const struct plugin_list *list)
+{
+	while (list) {
+		trace_seq_printf(s, "%s%s%s", prefix, list->name, suffix);
+		list = list->next;
+	}
+}
+
 static void
 load_plugin(struct pevent *pevent, const char *path,
 	    const char *file, void *data)

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [tip:perf/core] tools lib traceevent: Add options to function plugin
  2014-06-03  3:20 ` [PATCH v2 3/4] tools lib traceevent: Add options to function plugin Steven Rostedt
@ 2014-06-12 12:00   ` tip-bot for Steven Rostedt (Red Hat)
  0 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Steven Rostedt (Red Hat) @ 2014-06-12 12:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, jolsa, rostedt, tglx, namhyung

Commit-ID:  49440828ad7b809e9d31f6108875e3b1e974690c
Gitweb:     http://git.kernel.org/tip/49440828ad7b809e9d31f6108875e3b1e974690c
Author:     Steven Rostedt (Red Hat) <rostedt@goodmis.org>
AuthorDate: Mon, 2 Jun 2014 23:20:15 -0400
Committer:  Jiri Olsa <jolsa@kernel.org>
CommitDate: Sat, 7 Jun 2014 23:33:37 +0200

tools lib traceevent: Add options to function plugin

Add the options "parent" and "indent" to the function plugin.

When parent is set, the output looks like this:

function:             fsnotify_modify <-- vfs_write
function:             zone_statistics <-- get_page_from_freelist
function:                __inc_zone_state <-- zone_statistics
function:                inotify_inode_queue_event <-- fsnotify_modify
function:                fsnotify_parent <-- fsnotify_modify
function:                __inc_zone_state <-- zone_statistics
function:                   __fsnotify_parent <-- fsnotify_parent
function:                   inotify_dentry_parent_queue_event <-- fsnotify_parent
function:             add_to_page_cache_lru <-- do_read_cache_page

When it's not set, it looks like:

function:             fsnotify_modify
function:             zone_statistics
function:                __inc_zone_state
function:                inotify_inode_queue_event
function:                fsnotify_parent
function:                __inc_zone_state
function:                   __fsnotify_parent
function:                   inotify_dentry_parent_queue_event
function:             add_to_page_cache_lru

When the otpion "indent" is not set, it looks like this:

function:             fsnotify_modify <-- vfs_write
function:             zone_statistics <-- get_page_from_freelist
function:             __inc_zone_state <-- zone_statistics
function:             inotify_inode_queue_event <-- fsnotify_modify
function:             fsnotify_parent <-- fsnotify_modify
function:             __inc_zone_state <-- zone_statistics
function:             __fsnotify_parent <-- fsnotify_parent
function:             inotify_dentry_parent_queue_event <-- fsnotify_parent
function:             add_to_page_cache_lru <-- do_read_cache_page

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20140603032224.056940410@goodmis.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/lib/traceevent/plugin_function.c | 43 +++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/tools/lib/traceevent/plugin_function.c b/tools/lib/traceevent/plugin_function.c
index 80ba4ff..a00ec19 100644
--- a/tools/lib/traceevent/plugin_function.c
+++ b/tools/lib/traceevent/plugin_function.c
@@ -33,6 +33,29 @@ static int cpus = -1;
 
 #define STK_BLK 10
 
+struct pevent_plugin_option plugin_options[] =
+{
+	{
+		.name = "parent",
+		.plugin_alias = "ftrace",
+		.description =
+		"Print parent of functions for function events",
+	},
+	{
+		.name = "indent",
+		.plugin_alias = "ftrace",
+		.description =
+		"Try to show function call indents, based on parents",
+		.set = 1,
+	},
+	{
+		.name = NULL,
+	}
+};
+
+static struct pevent_plugin_option *ftrace_parent = &plugin_options[0];
+static struct pevent_plugin_option *ftrace_indent = &plugin_options[1];
+
 static void add_child(struct func_stack *stack, const char *child, int pos)
 {
 	int i;
@@ -119,7 +142,8 @@ static int function_handler(struct trace_seq *s, struct pevent_record *record,
 
 	parent = pevent_find_function(pevent, pfunction);
 
-	index = add_and_get_index(parent, func, record->cpu);
+	if (parent && ftrace_indent->set)
+		index = add_and_get_index(parent, func, record->cpu);
 
 	trace_seq_printf(s, "%*s", index*3, "");
 
@@ -128,11 +152,13 @@ static int function_handler(struct trace_seq *s, struct pevent_record *record,
 	else
 		trace_seq_printf(s, "0x%llx", function);
 
-	trace_seq_printf(s, " <-- ");
-	if (parent)
-		trace_seq_printf(s, "%s", parent);
-	else
-		trace_seq_printf(s, "0x%llx", pfunction);
+	if (ftrace_parent->set) {
+		trace_seq_printf(s, " <-- ");
+		if (parent)
+			trace_seq_printf(s, "%s", parent);
+		else
+			trace_seq_printf(s, "0x%llx", pfunction);
+	}
 
 	return 0;
 }
@@ -141,6 +167,9 @@ int PEVENT_PLUGIN_LOADER(struct pevent *pevent)
 {
 	pevent_register_event_handler(pevent, -1, "ftrace", "function",
 				      function_handler, NULL);
+
+	traceevent_plugin_add_options("ftrace", plugin_options);
+
 	return 0;
 }
 
@@ -157,6 +186,8 @@ void PEVENT_PLUGIN_UNLOADER(struct pevent *pevent)
 		free(fstack[i].stack);
 	}
 
+	traceevent_plugin_remove_options(plugin_options);
+
 	free(fstack);
 	fstack = NULL;
 	cpus = -1;

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [tip:perf/core] tools lib traceevent: Added support for __get_bitmask() macro
  2014-06-03  3:20 ` [PATCH v2 4/4] tools lib traceevent: Added support for __get_bitmask() macro Steven Rostedt
@ 2014-06-12 12:01   ` tip-bot for Steven Rostedt (Red Hat)
  0 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Steven Rostedt (Red Hat) @ 2014-06-12 12:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, jolsa, namhyung, javi.merino, rostedt, tglx

Commit-ID:  473a778a2f2949972b52ad7fc61577f381f2d05e
Gitweb:     http://git.kernel.org/tip/473a778a2f2949972b52ad7fc61577f381f2d05e
Author:     Steven Rostedt (Red Hat) <rostedt@goodmis.org>
AuthorDate: Mon, 2 Jun 2014 23:20:16 -0400
Committer:  Jiri Olsa <jolsa@kernel.org>
CommitDate: Sat, 7 Jun 2014 23:33:37 +0200

tools lib traceevent: Added support for __get_bitmask() macro

Coming in v3.16, trace events will be able to save bitmasks in raw
format in the ring buffer and output it with the __get_bitmask() macro.

In order for userspace tools to parse this, it must be able to handle
the __get_bitmask() call and be able to convert the data that's in
the ring buffer into a nice bitmask format. The output is similar to
what the kernel uses to print bitmasks, with a comma separator every
4 bytes (8 characters).

This allows for cpumasks to also be saved efficiently.

The first user is the thermal:thermal_power_limit event which has the
following output:

 thermal_power_limit:  cpus=0000000f freq=1900000 cdev_state=0 power=5252

Link: http://lkml.kernel.org/r/20140506132238.22e136d1@gandalf.local.home

Suggested-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Javi Merino <javi.merino@arm.com>
Link: http://lkml.kernel.org/r/20140603032224.229186537@goodmis.org
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 tools/lib/traceevent/event-parse.c                 | 113 +++++++++++++++++++++
 tools/lib/traceevent/event-parse.h                 |   7 ++
 .../perf/util/scripting-engines/trace-event-perl.c |   1 +
 .../util/scripting-engines/trace-event-python.c    |   1 +
 4 files changed, 122 insertions(+)

diff --git a/tools/lib/traceevent/event-parse.c b/tools/lib/traceevent/event-parse.c
index b83184f..93825a1 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -765,6 +765,9 @@ static void free_arg(struct print_arg *arg)
 	case PRINT_BSTRING:
 		free(arg->string.string);
 		break;
+	case PRINT_BITMASK:
+		free(arg->bitmask.bitmask);
+		break;
 	case PRINT_DYNAMIC_ARRAY:
 		free(arg->dynarray.index);
 		break;
@@ -2268,6 +2271,7 @@ static int arg_num_eval(struct print_arg *arg, long long *val)
 	case PRINT_FIELD ... PRINT_SYMBOL:
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 	default:
 		do_warning("invalid eval type %d", arg->type);
 		ret = 0;
@@ -2296,6 +2300,7 @@ static char *arg_eval (struct print_arg *arg)
 	case PRINT_FIELD ... PRINT_SYMBOL:
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 	default:
 		do_warning("invalid eval type %d", arg->type);
 		break;
@@ -2683,6 +2688,35 @@ process_str(struct event_format *event __maybe_unused, struct print_arg *arg,
 	return EVENT_ERROR;
 }
 
+static enum event_type
+process_bitmask(struct event_format *event __maybe_unused, struct print_arg *arg,
+	    char **tok)
+{
+	enum event_type type;
+	char *token;
+
+	if (read_expect_type(EVENT_ITEM, &token) < 0)
+		goto out_free;
+
+	arg->type = PRINT_BITMASK;
+	arg->bitmask.bitmask = token;
+	arg->bitmask.offset = -1;
+
+	if (read_expected(EVENT_DELIM, ")") < 0)
+		goto out_err;
+
+	type = read_token(&token);
+	*tok = token;
+
+	return type;
+
+ out_free:
+	free_token(token);
+ out_err:
+	*tok = NULL;
+	return EVENT_ERROR;
+}
+
 static struct pevent_function_handler *
 find_func_handler(struct pevent *pevent, char *func_name)
 {
@@ -2797,6 +2831,10 @@ process_function(struct event_format *event, struct print_arg *arg,
 		free_token(token);
 		return process_str(event, arg, tok);
 	}
+	if (strcmp(token, "__get_bitmask") == 0) {
+		free_token(token);
+		return process_bitmask(event, arg, tok);
+	}
 	if (strcmp(token, "__get_dynamic_array") == 0) {
 		free_token(token);
 		return process_dynamic_array(event, arg, tok);
@@ -3324,6 +3362,7 @@ eval_num_arg(void *data, int size, struct event_format *event, struct print_arg
 		return eval_type(val, arg, 0);
 	case PRINT_STRING:
 	case PRINT_BSTRING:
+	case PRINT_BITMASK:
 		return 0;
 	case PRINT_FUNC: {
 		struct trace_seq s;
@@ -3556,6 +3595,60 @@ static void print_str_to_seq(struct trace_seq *s, const char *format,
 		trace_seq_printf(s, format, str);
 }
 
+static void print_bitmask_to_seq(struct pevent *pevent,
+				 struct trace_seq *s, const char *format,
+				 int len_arg, const void *data, int size)
+{
+	int nr_bits = size * 8;
+	int str_size = (nr_bits + 3) / 4;
+	int len = 0;
+	char buf[3];
+	char *str;
+	int index;
+	int i;
+
+	/*
+	 * The kernel likes to put in commas every 32 bits, we
+	 * can do the same.
+	 */
+	str_size += (nr_bits - 1) / 32;
+
+	str = malloc(str_size + 1);
+	if (!str) {
+		do_warning("%s: not enough memory!", __func__);
+		return;
+	}
+	str[str_size] = 0;
+
+	/* Start out with -2 for the two chars per byte */
+	for (i = str_size - 2; i >= 0; i -= 2) {
+		/*
+		 * data points to a bit mask of size bytes.
+		 * In the kernel, this is an array of long words, thus
+		 * endianess is very important.
+		 */
+		if (pevent->file_bigendian)
+			index = size - (len + 1);
+		else
+			index = len;
+
+		snprintf(buf, 3, "%02x", *((unsigned char *)data + index));
+		memcpy(str + i, buf, 2);
+		len++;
+		if (!(len & 3) && i > 0) {
+			i--;
+			str[i] = ',';
+		}
+	}
+
+	if (len_arg >= 0)
+		trace_seq_printf(s, format, len_arg, str);
+	else
+		trace_seq_printf(s, format, str);
+
+	free(str);
+}
+
 static void print_str_arg(struct trace_seq *s, void *data, int size,
 			  struct event_format *event, const char *format,
 			  int len_arg, struct print_arg *arg)
@@ -3691,6 +3784,23 @@ static void print_str_arg(struct trace_seq *s, void *data, int size,
 	case PRINT_BSTRING:
 		print_str_to_seq(s, format, len_arg, arg->string.string);
 		break;
+	case PRINT_BITMASK: {
+		int bitmask_offset;
+		int bitmask_size;
+
+		if (arg->bitmask.offset == -1) {
+			struct format_field *f;
+
+			f = pevent_find_any_field(event, arg->bitmask.bitmask);
+			arg->bitmask.offset = f->offset;
+		}
+		bitmask_offset = data2host4(pevent, data + arg->bitmask.offset);
+		bitmask_size = bitmask_offset >> 16;
+		bitmask_offset &= 0xffff;
+		print_bitmask_to_seq(pevent, s, format, len_arg,
+				     data + bitmask_offset, bitmask_size);
+		break;
+	}
 	case PRINT_OP:
 		/*
 		 * The only op for string should be ? :
@@ -4822,6 +4932,9 @@ static void print_args(struct print_arg *args)
 	case PRINT_BSTRING:
 		printf("__get_str(%s)", args->string.string);
 		break;
+	case PRINT_BITMASK:
+		printf("__get_bitmask(%s)", args->bitmask.bitmask);
+		break;
 	case PRINT_TYPE:
 		printf("(%s)", args->typecast.type);
 		print_args(args->typecast.item);
diff --git a/tools/lib/traceevent/event-parse.h b/tools/lib/traceevent/event-parse.h
index 56e0e6c..7a3873f 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -208,6 +208,11 @@ struct print_arg_string {
 	int			offset;
 };
 
+struct print_arg_bitmask {
+	char			*bitmask;
+	int			offset;
+};
+
 struct print_arg_field {
 	char			*name;
 	struct format_field	*field;
@@ -274,6 +279,7 @@ enum print_arg_type {
 	PRINT_DYNAMIC_ARRAY,
 	PRINT_OP,
 	PRINT_FUNC,
+	PRINT_BITMASK,
 };
 
 struct print_arg {
@@ -288,6 +294,7 @@ struct print_arg {
 		struct print_arg_hex		hex;
 		struct print_arg_func		func;
 		struct print_arg_string		string;
+		struct print_arg_bitmask	bitmask;
 		struct print_arg_op		op;
 		struct print_arg_dynarray	dynarray;
 	};
diff --git a/tools/perf/util/scripting-engines/trace-event-perl.c b/tools/perf/util/scripting-engines/trace-event-perl.c
index e108207..af7da56 100644
--- a/tools/perf/util/scripting-engines/trace-event-perl.c
+++ b/tools/perf/util/scripting-engines/trace-event-perl.c
@@ -215,6 +215,7 @@ static void define_event_symbols(struct event_format *event,
 	case PRINT_BSTRING:
 	case PRINT_DYNAMIC_ARRAY:
 	case PRINT_STRING:
+	case PRINT_BITMASK:
 		break;
 	case PRINT_TYPE:
 		define_event_symbols(event, ev_name, args->typecast.item);
diff --git a/tools/perf/util/scripting-engines/trace-event-python.c b/tools/perf/util/scripting-engines/trace-event-python.c
index cd9774d..c3de097 100644
--- a/tools/perf/util/scripting-engines/trace-event-python.c
+++ b/tools/perf/util/scripting-engines/trace-event-python.c
@@ -197,6 +197,7 @@ static void define_event_symbols(struct event_format *event,
 	case PRINT_BSTRING:
 	case PRINT_DYNAMIC_ARRAY:
 	case PRINT_FUNC:
+	case PRINT_BITMASK:
 		/* we should warn... */
 		return;
 	}

^ permalink raw reply related	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2014-06-12 12:06 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-03  3:20 [PATCH v2 0/4] tools lib traceevent: bitmask handling and plugin updates Steven Rostedt
2014-06-03  3:20 ` [PATCH v2 1/4] tools lib traceevent: Add flag to not load event plugins Steven Rostedt
2014-06-12 12:00   ` [tip:perf/core] " tip-bot for Steven Rostedt (Red Hat)
2014-06-03  3:20 ` [PATCH v2 2/4] tools lib traceevent: Add options to plugins Steven Rostedt
2014-06-03  6:51   ` Namhyung Kim
2014-06-03 22:41     ` [PATCH v3 " Steven Rostedt
2014-06-03 22:43       ` Steven Rostedt
2014-06-04  9:33         ` Jiri Olsa
2014-06-12 12:00       ` [tip:perf/core] " tip-bot for Steven Rostedt
2014-06-04 11:42     ` [PATCH v2 2/4] " Jiri Olsa
2014-06-04 14:00       ` Namhyung Kim
2014-06-03  3:20 ` [PATCH v2 3/4] tools lib traceevent: Add options to function plugin Steven Rostedt
2014-06-12 12:00   ` [tip:perf/core] " tip-bot for Steven Rostedt (Red Hat)
2014-06-03  3:20 ` [PATCH v2 4/4] tools lib traceevent: Added support for __get_bitmask() macro Steven Rostedt
2014-06-12 12:01   ` [tip:perf/core] " tip-bot for Steven Rostedt (Red Hat)
  -- strict thread matches above, loose matches on Subject: below --
2014-05-06 12:06 [RFC PATCH 0/5] The power allocator thermal governor Javi Merino
2014-05-06 12:06 ` [RFC PATCH 1/5] thermal: let governors have private data for each thermal zone Javi Merino
2014-05-13 14:31   ` edubezval
2014-05-15 14:30     ` Javi Merino
2014-05-15 15:10       ` Eduardo Valentin
2014-05-06 12:06 ` [RFC PATCH 2/5] thermal: let cooling devices operate on other units other than state Javi Merino
2014-05-13 15:22   ` edubezval
2014-05-06 12:06 ` [RFC PATCH 3/5] thermal: add a basic cpu power actor Javi Merino
2014-05-13 22:57   ` Eduardo Valentin
2014-05-15 17:02     ` Javi Merino
2014-05-06 12:06 ` [RFC PATCH 4/5] thermal: introduce the Power Allocator governor Javi Merino
2014-05-13 23:20   ` Eduardo Valentin
2014-05-15 18:24     ` Javi Merino
2014-05-22 14:26       ` Eduardo Valentin
2014-05-06 12:06 ` [RFC PATCH 5/5] thermal: add trace events to the power allocator governor Javi Merino
2014-05-06 12:28   ` Steven Rostedt
2014-05-06 17:22   ` [RFC][PATCH] tracing: Add __cpumask() macro to trace events to record cpumasks Steven Rostedt
2014-05-06 19:16     ` Mathieu Desnoyers
2014-05-06 19:30       ` Steven Rostedt
2014-05-07  3:12       ` [RFC][PATCH v2] tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks Steven Rostedt
2014-05-07 11:40         ` Mathieu Desnoyers
2014-05-07 15:45           ` Steven Rostedt
2014-05-14 14:23         ` Javi Merino
2014-05-14 15:36           ` Steven Rostedt
2014-05-14 15:50             ` Javi Merino
2014-05-14 16:07               ` Steven Rostedt
2014-05-14 18:25               ` [RFC][PATCH v3] " Steven Rostedt
2014-05-14 19:42                 ` Javi Merino
2014-05-14 20:00                   ` Steven Rostedt
2014-05-15 11:34                   ` Steven Rostedt
2014-05-15 11:36                     ` Steven Rostedt
2014-05-15 15:20                       ` Javi Merino
2014-05-14 10:05 ` [RFC PATCH 0/5] The power allocator thermal governor James King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.