All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v5 00/10] The power allocator thermal governor
@ 2014-07-10 14:18 Javi Merino
  2014-07-10 14:18 ` [RFC PATCH v5 01/10] tracing: Add array printing helpers Javi Merino
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Javi Merino @ 2014-07-10 14:18 UTC (permalink / raw)
  To: linux-pm, linux-kernel; +Cc: punit.agrawal, broonie, Javi Merino

Hi linux-pm,

The power allocator governor allocates device power to control
temperature.  This requires transforming performance requests into
requested power, which we do with the aid of power models.  Patch 7
(thermal: add a basic cpu power actor) implements a simple power model
for cpus.  The division of power between the actors ensures that power
is allocated where it is needed the most, based on the current
workload.

Patches 1-3 adds array printing helpers to ftrace, which we then use
in patch 9.  Patch 4 is a generic documentation of the current thermal
framework and can be merged separately.

Changes since v4:
  - Add more tracing
  - Document some of the limitations of the power allocator governor
  - Export the power_actor API and move power_actor.h to include/linux

Changes since v3:
  - Use tz->passive to poll faster when the first trip point is hit.
  - Don't make a special directory for power_actors
  - Add a DT property for sustainable-power
  - Simplify the static power interface and pass the current thermal
    zone in every power_actor_ops to remove the controversial
    enum power_actor_types
  - Use locks with the actor_list list
  - Use cpufreq_get() to get the frequency of the cpu instead of
    using the notifiers.
  - Remove the prompt for THERMAL_POWER_ACTOR_CPU when configuring
    the kernel

Changes since v2:
  - Changed the PI controller into a PID controller
  - Added static power to the cpu power model
  - tz parameter max_dissipatable_power renamed to sustainable_power
  - Register the cpufreq cooling device as part of the
    power_cpu_actor registration.

Changes since v1:
  - Fixed finding cpufreq cooling devices in cpufreq_frequency_change()
  - Replaced the cooling device interface with a separate power actor
    API
  - Addressed most of Eduardo's comments
  - Incorporated ftrace support for bitmask to trace cpumasks

Todo:
  - Rethink the use of trip points and make it less intrusive
  - Let platforms override the power allocator governor parameters
  - Provide scripts to evaluate the proposal
  - Let the governor operate on cooling devices as well

Cheers,
Javi & Punit

Dave Martin (1):
  tracing: Add array printing helpers

Javi Merino (8):
  tools lib traceevent: Generalize numeric argument
  tools lib traceevent: Add support for __print_u{8,16,32,64}_array()
  thermal: document struct thermal_zone_device and thermal_governor
  thermal: let governors have private data for each thermal zone
  thermal: introduce the Power Actor API
  thermal: add a basic cpu power actor
  thermal: introduce the Power Allocator governor
  thermal: add trace events to the power allocator governor

Punit Agrawal (1):
  of: thermal: Introduce sustainable power for a thermal zone

 .../devicetree/bindings/thermal/thermal.txt        |   4 +
 Documentation/thermal/power_actor.txt              | 181 ++++++++
 Documentation/thermal/power_allocator.txt          |  61 +++
 drivers/thermal/Kconfig                            |  23 +
 drivers/thermal/Makefile                           |   5 +
 drivers/thermal/cpu_actor.c                        | 501 +++++++++++++++++++++
 drivers/thermal/of-thermal.c                       |   4 +
 drivers/thermal/power_actor.c                      |  70 +++
 drivers/thermal/power_allocator.c                  | 485 ++++++++++++++++++++
 drivers/thermal/thermal_core.c                     |  90 +++-
 drivers/thermal/thermal_core.h                     |   8 +
 include/linux/ftrace_event.h                       |   9 +
 include/linux/power_actor.h                        |  86 ++++
 include/linux/thermal.h                            |  63 ++-
 include/trace/events/thermal_power_allocator.h     | 138 ++++++
 include/trace/ftrace.h                             |  17 +
 kernel/trace/trace_output.c                        |  55 +++
 tools/lib/traceevent/event-parse.c                 |  88 +++-
 tools/lib/traceevent/event-parse.h                 |   8 +-
 19 files changed, 1865 insertions(+), 31 deletions(-)
 create mode 100644 Documentation/thermal/power_actor.txt
 create mode 100644 Documentation/thermal/power_allocator.txt
 create mode 100644 drivers/thermal/cpu_actor.c
 create mode 100644 drivers/thermal/power_actor.c
 create mode 100644 drivers/thermal/power_allocator.c
 create mode 100644 include/linux/power_actor.h
 create mode 100644 include/trace/events/thermal_power_allocator.h

-- 
1.9.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v5 01/10] tracing: Add array printing helpers
  2014-07-10 14:18 [RFC PATCH v5 00/10] The power allocator thermal governor Javi Merino
@ 2014-07-10 14:18 ` Javi Merino
  2014-07-10 15:40   ` Steven Rostedt
  2014-07-10 14:18 ` [RFC PATCH v5 02/10] tools lib traceevent: Generalize numeric argument Javi Merino
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Javi Merino @ 2014-07-10 14:18 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: punit.agrawal, broonie, Dave Martin, Steven Rostedt, Ingo Molnar

From: Dave Martin <Dave.Martin@arm.com>

If a trace event contains an array, there is currently no standard
way to format this for text output.  Drivers are currently hacking
around this by a) local hacks that use the trace_seq functionailty
directly, or b) just not printing that information.  For fixed size
arrays, formatting of the elements can be open-coded, but this gets
cumbersome for arrays of non-trivial size.

These approaches result in non-standard content of the event format
description delivered to userspace, so userland tools needs to be
taught to understand and parse each array printing method
individually.

This patch implements common __print_<type>_array() helpers that
tracepoint implementations can use instead of reinventing them.  A
simple C-style syntax is used to delimit the array and its elements
{like,this}.

So that the helpers can be used with large static arrays as well as
dynamic arrays, they take a pointer and element count: they can be
used with __get_dynamic_array() for use with dynamic arrays.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
---
 include/linux/ftrace_event.h |  9 ++++++++
 include/trace/ftrace.h       | 17 ++++++++++++++
 kernel/trace/trace_output.c  | 55 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 81 insertions(+)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index cff3106ffe2c..919f21a3420b 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -44,6 +44,15 @@ const char *ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
 const char *ftrace_print_hex_seq(struct trace_seq *p,
 				 const unsigned char *buf, int len);
 
+const char *ftrace_print_u8_array_seq(struct trace_seq *p,
+				      const u8 *buf, int count);
+const char *ftrace_print_u16_array_seq(struct trace_seq *p,
+				       const u16 *buf, int count);
+const char *ftrace_print_u32_array_seq(struct trace_seq *p,
+				       const u32 *buf, int count);
+const char *ftrace_print_u64_array_seq(struct trace_seq *p,
+				       const u64 *buf, int count);
+
 struct trace_iterator;
 struct trace_event;
 
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 26b4f2e13275..15bc5d417aea 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -263,6 +263,19 @@
 #undef __print_hex
 #define __print_hex(buf, buf_len) ftrace_print_hex_seq(p, buf, buf_len)
 
+#undef __print_u8_array
+#define __print_u8_array(array, count)			\
+	ftrace_print_u8_array_seq(p, array, count)
+#undef __print_u16_array
+#define __print_u16_array(array, count)			\
+	ftrace_print_u16_array_seq(p, array, count)
+#undef __print_u32_array
+#define __print_u32_array(array, count)			\
+	ftrace_print_u32_array_seq(p, array, count)
+#undef __print_u64_array
+#define __print_u64_array(array, count)			\
+	ftrace_print_u64_array_seq(p, array, count)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 static notrace enum print_line_t					\
@@ -676,6 +689,10 @@ static inline void ftrace_test_probe_##call(void)			\
 #undef __get_dynamic_array_len
 #undef __get_str
 #undef __get_bitmask
+#undef __print_u8_array
+#undef __print_u16_array
+#undef __print_u32_array
+#undef __print_u64_array
 
 #undef TP_printk
 #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index f3dad80c20b2..b46238e75523 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -454,6 +454,61 @@ ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int buf_len)
 }
 EXPORT_SYMBOL(ftrace_print_hex_seq);
 
+static const char *
+ftrace_print_array_seq(struct trace_seq *p, const void *buf, int buf_len,
+		       bool (*iterator)(struct trace_seq *p, const char *prefix,
+					const void **buf, int *buf_len))
+{
+	const char *ret = p->buffer + p->len;
+	const char *prefix = "";
+
+	trace_seq_putc(p, '{');
+
+	if (iterator(p, prefix, &buf, &buf_len)) {
+		prefix = ",";
+
+		while (iterator(p, prefix, &buf, &buf_len))
+			;
+	}
+
+	trace_seq_putc(p, '}');
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+
+#define DEFINE_PRINT_ARRAY(type, printk_type, format)			\
+static bool								\
+ftrace_print_array_iterator_##type(struct trace_seq *p, const char *prefix, \
+				   const void **buf, int *buf_len)	\
+{									\
+	const type *__src = *buf;					\
+									\
+	if (*buf_len < sizeof(*__src))					\
+		return false;						\
+									\
+	trace_seq_printf(p, "%s" format, prefix, (printk_type)*__src++); \
+									\
+	*buf = __src;							\
+	*buf_len -= sizeof(*__src);					\
+									\
+	return true;							\
+}									\
+									\
+const char *ftrace_print_##type##_array_seq(				\
+	struct trace_seq *p, const type *buf, int count)		\
+{									\
+	return ftrace_print_array_seq(p, buf, (count) * sizeof(type),	\
+				      ftrace_print_array_iterator_##type); \
+}									\
+									\
+EXPORT_SYMBOL(ftrace_print_##type##_array_seq);
+
+DEFINE_PRINT_ARRAY(u8, unsigned int, "0x%x")
+DEFINE_PRINT_ARRAY(u16, unsigned int, "0x%x")
+DEFINE_PRINT_ARRAY(u32, unsigned int, "0x%x")
+DEFINE_PRINT_ARRAY(u64, unsigned long long, "0x%llx")
+
 int ftrace_raw_output_prep(struct trace_iterator *iter,
 			   struct trace_event *trace_event)
 {
-- 
1.9.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v5 02/10] tools lib traceevent: Generalize numeric argument
  2014-07-10 14:18 [RFC PATCH v5 00/10] The power allocator thermal governor Javi Merino
  2014-07-10 14:18 ` [RFC PATCH v5 01/10] tracing: Add array printing helpers Javi Merino
@ 2014-07-10 14:18 ` Javi Merino
  2014-07-10 14:18 ` [RFC PATCH v5 03/10] tools lib traceevent: Add support for __print_u{8,16,32,64}_array() Javi Merino
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Javi Merino @ 2014-07-10 14:18 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: punit.agrawal, broonie, Javi Merino, Steven Rostedt,
	Arnaldo Carvalho de Melo, Jiri Olsa

Numeric arguments can be in different bases, so rename it to num so
that they can be used for formats other than PRINT_HEX

Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 tools/lib/traceevent/event-parse.c | 26 +++++++++++++-------------
 tools/lib/traceevent/event-parse.h |  4 ++--
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.c b/tools/lib/traceevent/event-parse.c
index 93825a17dcce..8a0a8749df4c 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -754,8 +754,8 @@ static void free_arg(struct print_arg *arg)
 		free_flag_sym(arg->symbol.symbols);
 		break;
 	case PRINT_HEX:
-		free_arg(arg->hex.field);
-		free_arg(arg->hex.size);
+		free_arg(arg->num.field);
+		free_arg(arg->num.size);
 		break;
 	case PRINT_TYPE:
 		free(arg->typecast.type);
@@ -2503,7 +2503,7 @@ process_hex(struct event_format *event, struct print_arg *arg, char **tok)
 	if (test_type_token(type, token, EVENT_DELIM, ","))
 		goto out_free;
 
-	arg->hex.field = field;
+	arg->num.field = field;
 
 	free_token(token);
 
@@ -2519,7 +2519,7 @@ process_hex(struct event_format *event, struct print_arg *arg, char **tok)
 	if (test_type_token(type, token, EVENT_DELIM, ")"))
 		goto out_free;
 
-	arg->hex.size = field;
+	arg->num.size = field;
 
 	free_token(token);
 	type = read_token_item(tok);
@@ -3740,24 +3740,24 @@ static void print_str_arg(struct trace_seq *s, void *data, int size,
 		}
 		break;
 	case PRINT_HEX:
-		if (arg->hex.field->type == PRINT_DYNAMIC_ARRAY) {
+		if (arg->num.field->type == PRINT_DYNAMIC_ARRAY) {
 			unsigned long offset;
 			offset = pevent_read_number(pevent,
-				data + arg->hex.field->dynarray.field->offset,
-				arg->hex.field->dynarray.field->size);
+				data + arg->num.field->dynarray.field->offset,
+				arg->num.field->dynarray.field->size);
 			hex = data + (offset & 0xffff);
 		} else {
-			field = arg->hex.field->field.field;
+			field = arg->num.field->field.field;
 			if (!field) {
-				str = arg->hex.field->field.name;
+				str = arg->num.field->field.name;
 				field = pevent_find_any_field(event, str);
 				if (!field)
 					goto out_warning_field;
-				arg->hex.field->field.field = field;
+				arg->num.field->field.field = field;
 			}
 			hex = data + field->offset;
 		}
-		len = eval_num_arg(data, size, event, arg->hex.size);
+		len = eval_num_arg(data, size, event, arg->num.size);
 		for (i = 0; i < len; i++) {
 			if (i)
 				trace_seq_putc(s, ' ');
@@ -4923,9 +4923,9 @@ static void print_args(struct print_arg *args)
 		break;
 	case PRINT_HEX:
 		printf("__print_hex(");
-		print_args(args->hex.field);
+		print_args(args->num.field);
 		printf(", ");
-		print_args(args->hex.size);
+		print_args(args->num.size);
 		printf(")");
 		break;
 	case PRINT_STRING:
diff --git a/tools/lib/traceevent/event-parse.h b/tools/lib/traceevent/event-parse.h
index 7a3873ff9a4f..2bf72e908a74 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -240,7 +240,7 @@ struct print_arg_symbol {
 	struct print_flag_sym	*symbols;
 };
 
-struct print_arg_hex {
+struct print_arg_num {
 	struct print_arg	*field;
 	struct print_arg	*size;
 };
@@ -291,7 +291,7 @@ struct print_arg {
 		struct print_arg_typecast	typecast;
 		struct print_arg_flags		flags;
 		struct print_arg_symbol		symbol;
-		struct print_arg_hex		hex;
+		struct print_arg_num		num;
 		struct print_arg_func		func;
 		struct print_arg_string		string;
 		struct print_arg_bitmask	bitmask;
-- 
1.9.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v5 03/10] tools lib traceevent: Add support for __print_u{8,16,32,64}_array()
  2014-07-10 14:18 [RFC PATCH v5 00/10] The power allocator thermal governor Javi Merino
  2014-07-10 14:18 ` [RFC PATCH v5 01/10] tracing: Add array printing helpers Javi Merino
  2014-07-10 14:18 ` [RFC PATCH v5 02/10] tools lib traceevent: Generalize numeric argument Javi Merino
@ 2014-07-10 14:18 ` Javi Merino
  2014-07-10 14:18 ` [RFC PATCH v5 04/10] thermal: document struct thermal_zone_device and thermal_governor Javi Merino
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Javi Merino @ 2014-07-10 14:18 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: punit.agrawal, broonie, Javi Merino, Arnaldo Carvalho de Melo,
	Steven Rostedt, Jiri Olsa

Trace can now generate traces with u8, u16, u32 and u64 dynamic
arrays.  Add support to parse them.

Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 tools/lib/traceevent/event-parse.c | 62 +++++++++++++++++++++++++++++++++++---
 tools/lib/traceevent/event-parse.h |  4 +++
 2 files changed, 61 insertions(+), 5 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.c b/tools/lib/traceevent/event-parse.c
index 8a0a8749df4c..8f25903d6e72 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -753,6 +753,10 @@ static void free_arg(struct print_arg *arg)
 		free_arg(arg->symbol.field);
 		free_flag_sym(arg->symbol.symbols);
 		break;
+	case PRINT_U8:
+	case PRINT_U16:
+	case PRINT_U32:
+	case PRINT_U64:
 	case PRINT_HEX:
 		free_arg(arg->num.field);
 		free_arg(arg->num.size);
@@ -2827,6 +2831,22 @@ process_function(struct event_format *event, struct print_arg *arg,
 		free_token(token);
 		return process_hex(event, arg, tok);
 	}
+	if (strcmp(token, "__print_u8_array") == 0) {
+		free_token(token);
+		return process_num(event, arg, tok, PRINT_U8);
+	}
+	if (strcmp(token, "__print_u16_array") == 0) {
+		free_token(token);
+		return process_num(event, arg, tok, PRINT_U16);
+	}
+	if (strcmp(token, "__print_u32_array") == 0) {
+		free_token(token);
+		return process_num(event, arg, tok, PRINT_U32);
+	}
+	if (strcmp(token, "__print_u64_array") == 0) {
+		free_token(token);
+		return process_num(event, arg, tok, PRINT_U64);
+	}
 	if (strcmp(token, "__get_str") == 0) {
 		free_token(token);
 		return process_str(event, arg, tok);
@@ -3355,6 +3375,10 @@ eval_num_arg(void *data, int size, struct event_format *event, struct print_arg
 		break;
 	case PRINT_FLAGS:
 	case PRINT_SYMBOL:
+	case PRINT_U8:
+	case PRINT_U16:
+	case PRINT_U32:
+	case PRINT_U64:
 	case PRINT_HEX:
 		break;
 	case PRINT_TYPE:
@@ -3660,7 +3684,7 @@ static void print_str_arg(struct trace_seq *s, void *data, int size,
 	unsigned long long val, fval;
 	unsigned long addr;
 	char *str;
-	unsigned char *hex;
+	void *num;
 	int print;
 	int i, len;
 
@@ -3739,13 +3763,17 @@ static void print_str_arg(struct trace_seq *s, void *data, int size,
 			}
 		}
 		break;
+	case PRINT_U8:
+	case PRINT_U16:
+	case PRINT_U32:
+	case PRINT_U64:
 	case PRINT_HEX:
 		if (arg->num.field->type == PRINT_DYNAMIC_ARRAY) {
 			unsigned long offset;
 			offset = pevent_read_number(pevent,
 				data + arg->num.field->dynarray.field->offset,
 				arg->num.field->dynarray.field->size);
-			hex = data + (offset & 0xffff);
+			num = data + (offset & 0xffff);
 		} else {
 			field = arg->num.field->field.field;
 			if (!field) {
@@ -3755,13 +3783,24 @@ static void print_str_arg(struct trace_seq *s, void *data, int size,
 					goto out_warning_field;
 				arg->num.field->field.field = field;
 			}
-			hex = data + field->offset;
+			num = data + field->offset;
 		}
 		len = eval_num_arg(data, size, event, arg->num.size);
 		for (i = 0; i < len; i++) {
 			if (i)
 				trace_seq_putc(s, ' ');
-			trace_seq_printf(s, "%02x", hex[i]);
+			if (arg->type == PRINT_HEX)
+				trace_seq_printf(s, "%02x",
+						((uint8_t *)num)[i]);
+			else if (arg->type == PRINT_U8)
+				trace_seq_printf(s, "%u", ((uint8_t *)num)[i]);
+			else if (arg->type == PRINT_U16)
+				trace_seq_printf(s, "%u", ((uint16_t *)num)[i]);
+			else if (arg->type == PRINT_U32)
+				trace_seq_printf(s, "%u", ((uint32_t *)num)[i]);
+			else    /* PRINT_U64 */
+				trace_seq_printf(s, "%lu",
+						((uint64_t *)num)[i]);
 		}
 		break;
 
@@ -4922,7 +4961,20 @@ static void print_args(struct print_arg *args)
 		printf(")");
 		break;
 	case PRINT_HEX:
-		printf("__print_hex(");
+	case PRINT_U8:
+	case PRINT_U16:
+	case PRINT_U32:
+	case PRINT_U64:
+		if (args->type == PRINT_HEX)
+			printf("__print_hex(");
+		else if (args->type == PRINT_U8)
+			printf("__print_u8_array(");
+		else if (args->type == PRINT_U16)
+			printf("__print_u16_array(");
+		else if (args->type == PRINT_U32)
+			printf("__print_u32_array(");
+		else /* PRINT_U64 */
+			printf("__print_u64_array(");
 		print_args(args->num.field);
 		printf(", ");
 		print_args(args->num.size);
diff --git a/tools/lib/traceevent/event-parse.h b/tools/lib/traceevent/event-parse.h
index 2bf72e908a74..51f1f0f0a3b5 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -272,6 +272,10 @@ enum print_arg_type {
 	PRINT_FIELD,
 	PRINT_FLAGS,
 	PRINT_SYMBOL,
+	PRINT_U8,
+	PRINT_U16,
+	PRINT_U32,
+	PRINT_U64,
 	PRINT_HEX,
 	PRINT_TYPE,
 	PRINT_STRING,
-- 
1.9.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v5 04/10] thermal: document struct thermal_zone_device and thermal_governor
  2014-07-10 14:18 [RFC PATCH v5 00/10] The power allocator thermal governor Javi Merino
                   ` (2 preceding siblings ...)
  2014-07-10 14:18 ` [RFC PATCH v5 03/10] tools lib traceevent: Add support for __print_u{8,16,32,64}_array() Javi Merino
@ 2014-07-10 14:18 ` Javi Merino
  2014-08-19 13:03   ` Eduardo Valentin
  2014-07-10 14:18 ` [RFC PATCH v5 05/10] thermal: let governors have private data for each thermal zone Javi Merino
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Javi Merino @ 2014-07-10 14:18 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: punit.agrawal, broonie, Javi Merino, Zhang Rui, Eduardo Valentin

Document struct thermal_zone_device and struct thermal_governor fields
and their use by the thermal framework code.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 include/linux/thermal.h | 46 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index f7e11c7ea7d9..0305cde21a74 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -158,6 +158,42 @@ struct thermal_attr {
 	char name[THERMAL_NAME_LENGTH];
 };
 
+/**
+ * struct thermal_zone_device - structure for a thermal zone
+ * @id:		unique id number for each thermal zone
+ * @type:	the thermal zone device type
+ * @device:	&struct device for this thermal zone
+ * @trip_temp_attrs:	attributes for trip points for sysfs: trip temperature
+ * @trip_type_attrs:	attributes for trip points for sysfs: trip type
+ * @trip_hyst_attrs:	attributes for trip points for sysfs: trip hysteresis
+ * @devdata:	private pointer for device private data
+ * @trips:	number of trip points the thermal zone supports
+ * @passive_delay:	number of milliseconds to wait between polls when
+ *			performing passive cooling.  Currenty only used by the
+ *			step-wise governor
+ * @polling_delay:	number of milliseconds to wait between polls when
+ *			checking whether trip points have been crossed (0 for
+ *			interrupt driven systems)
+ * @temperature:	current temperature.  This is only for core code,
+ *			drivers should use thermal_zone_get_temp() to get the
+ *			current temperature
+ * @last_temperature:	previous temperature read
+ * @emul_temperature:	emulated temperature when using CONFIG_THERMAL_EMULATION
+ * @passive:		1 if you've crossed a passive trip point, 0 otherwise.
+ *			Currenty only used by the step-wise governor.
+ * @forced_passive:	If > 0, temperature at which to switch on all ACPI
+ *			processor cooling devices.  Currently only used by the
+ *			step-wise governor.
+ * @ops:	operations this &thermal_zone_device supports
+ * @tzp:	thermal zone parameters
+ * @governor:	pointer to the governor for this thermal zone
+ * @thermal_instances:	list of &struct thermal_instance of this thermal zone
+ * @idr:	&struct idr to generate unique id for this zone's cooling
+ *		devices
+ * @lock:	lock to protect thermal_instances list
+ * @node:	node in thermal_tz_list (in thermal_core.c)
+ * @poll_queue:	delayed work for polling
+ */
 struct thermal_zone_device {
 	int id;
 	char type[THERMAL_NAME_LENGTH];
@@ -179,12 +215,18 @@ struct thermal_zone_device {
 	struct thermal_governor *governor;
 	struct list_head thermal_instances;
 	struct idr idr;
-	struct mutex lock; /* protect thermal_instances list */
+	struct mutex lock;
 	struct list_head node;
 	struct delayed_work poll_queue;
 };
 
-/* Structure that holds thermal governor information */
+/**
+ * struct thermal_governor - structure that holds thermal governor information
+ * @name:	name of the governor
+ * @throttle:	callback called for every trip point even if temperature is
+ *		below the trip point temperature
+ * @governor_list:	node in thermal_governor_list (in thermal_core.c)
+ */
 struct thermal_governor {
 	char name[THERMAL_NAME_LENGTH];
 	int (*throttle)(struct thermal_zone_device *tz, int trip);
-- 
1.9.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v5 05/10] thermal: let governors have private data for each thermal zone
  2014-07-10 14:18 [RFC PATCH v5 00/10] The power allocator thermal governor Javi Merino
                   ` (3 preceding siblings ...)
  2014-07-10 14:18 ` [RFC PATCH v5 04/10] thermal: document struct thermal_zone_device and thermal_governor Javi Merino
@ 2014-07-10 14:18 ` Javi Merino
  2014-08-19 12:49   ` edubezval
  2014-07-10 14:18 ` [RFC PATCH v5 06/10] thermal: introduce the Power Actor API Javi Merino
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Javi Merino @ 2014-07-10 14:18 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: punit.agrawal, broonie, Javi Merino, Zhang Rui, Eduardo Valentin

A governor may need to store its current state between calls to
throttle().  That state depends on the thermal zone, so store it as
private data in struct thermal_zone_device.

The governors may have two new ops: bind_to_tz() and unbind_from_tz().
When provided, these functions let governors do some initialization
and teardown when they are bound/unbound to a tz and possibly store that
information in the governor_data field of the struct
thermal_zone_device.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 drivers/thermal/thermal_core.c | 83 ++++++++++++++++++++++++++++++++++++++----
 include/linux/thermal.h        |  9 +++++
 2 files changed, 84 insertions(+), 8 deletions(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 71b0ec0c370d..3da99dd80ad5 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -72,6 +72,58 @@ static struct thermal_governor *__find_governor(const char *name)
 	return NULL;
 }
 
+/**
+ * bind_previous_governor() - bind the previous governor of the thermal zone
+ * @tz:		a valid pointer to a struct thermal_zone_device
+ * @failed_gov_name:	the name of the governor that failed to register
+ *
+ * Register the previous governor of the thermal zone after a new
+ * governor has failed to be bound.
+ */
+static void bind_previous_governor(struct thermal_zone_device *tz,
+				const char *failed_gov_name)
+{
+	if (tz->governor && tz->governor->bind_to_tz) {
+		if (tz->governor->bind_to_tz(tz)) {
+			dev_warn(&tz->device,
+				"governor %s failed to bind and the previous one (%s) failed to register again, thermal zone %s has no governor\n",
+				failed_gov_name, tz->governor->name, tz->type);
+			tz->governor = NULL;
+		}
+	}
+}
+
+/**
+ * thermal_set_governor() - Switch to another governor
+ * @tz:		a valid pointer to a struct thermal_zone_device
+ * @new_gov:	pointer to the new governor
+ *
+ * Change the governor of thermal zone @tz.
+ *
+ * Return: 0 on success, an error if the new governor's bind_to_tz() failed.
+ */
+static int thermal_set_governor(struct thermal_zone_device *tz,
+				struct thermal_governor *new_gov)
+{
+	int ret = 0;
+
+	if (tz->governor && tz->governor->unbind_from_tz)
+		tz->governor->unbind_from_tz(tz);
+
+	if (new_gov && new_gov->bind_to_tz) {
+		ret = new_gov->bind_to_tz(tz);
+		if (ret) {
+			bind_previous_governor(tz, new_gov->name);
+
+			return ret;
+		}
+	}
+
+	tz->governor = new_gov;
+
+	return ret;
+}
+
 int thermal_register_governor(struct thermal_governor *governor)
 {
 	int err;
@@ -104,8 +156,15 @@ int thermal_register_governor(struct thermal_governor *governor)
 
 		name = pos->tzp->governor_name;
 
-		if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
-			pos->governor = governor;
+		if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
+			int ret;
+
+			ret = thermal_set_governor(pos, governor);
+			if (ret)
+				dev_warn(&pos->device,
+					"Failed to set governor %s for thermal zone %s: %d\n",
+					governor->name, pos->type, ret);
+		}
 	}
 
 	mutex_unlock(&thermal_list_lock);
@@ -131,7 +190,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
 	list_for_each_entry(pos, &thermal_tz_list, node) {
 		if (!strnicmp(pos->governor->name, governor->name,
 						THERMAL_NAME_LENGTH))
-			pos->governor = NULL;
+			thermal_set_governor(pos, NULL);
 	}
 
 	mutex_unlock(&thermal_list_lock);
@@ -756,8 +815,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
 	if (!gov)
 		goto exit;
 
-	tz->governor = gov;
-	ret = count;
+	ret = thermal_set_governor(tz, gov);
+	if (!ret)
+		ret = count;
 
 exit:
 	mutex_unlock(&thermal_governor_lock);
@@ -1452,6 +1512,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
 	int result;
 	int count;
 	int passive = 0;
+	struct thermal_governor *governor;
 
 	if (type && strlen(type) >= THERMAL_NAME_LENGTH)
 		return ERR_PTR(-EINVAL);
@@ -1542,9 +1603,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
 	mutex_lock(&thermal_governor_lock);
 
 	if (tz->tzp)
-		tz->governor = __find_governor(tz->tzp->governor_name);
+		governor = __find_governor(tz->tzp->governor_name);
 	else
-		tz->governor = def_governor;
+		governor = def_governor;
+
+	result = thermal_set_governor(tz, governor);
+	if (result) {
+		mutex_unlock(&thermal_governor_lock);
+		goto unregister;
+	}
 
 	mutex_unlock(&thermal_governor_lock);
 
@@ -1634,7 +1701,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
 		device_remove_file(&tz->device, &dev_attr_mode);
 	device_remove_file(&tz->device, &dev_attr_policy);
 	remove_trip_attrs(tz);
-	tz->governor = NULL;
+	thermal_set_governor(tz, NULL);
 
 	thermal_remove_hwmon_sysfs(tz);
 	release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index 0305cde21a74..1124b7a9358a 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -187,6 +187,7 @@ struct thermal_attr {
  * @ops:	operations this &thermal_zone_device supports
  * @tzp:	thermal zone parameters
  * @governor:	pointer to the governor for this thermal zone
+ * @governor_data:	private pointer for governor data
  * @thermal_instances:	list of &struct thermal_instance of this thermal zone
  * @idr:	&struct idr to generate unique id for this zone's cooling
  *		devices
@@ -213,6 +214,7 @@ struct thermal_zone_device {
 	struct thermal_zone_device_ops *ops;
 	const struct thermal_zone_params *tzp;
 	struct thermal_governor *governor;
+	void *governor_data;
 	struct list_head thermal_instances;
 	struct idr idr;
 	struct mutex lock;
@@ -223,12 +225,19 @@ struct thermal_zone_device {
 /**
  * struct thermal_governor - structure that holds thermal governor information
  * @name:	name of the governor
+ * @bind_to_tz: callback called when binding to a thermal zone.  If it
+ *		returns 0, the governor is bound to the thermal zone,
+ *		otherwise it fails.
+ * @unbind_from_tz:	callback called when a governor is unbound from a
+ *			thermal zone.
  * @throttle:	callback called for every trip point even if temperature is
  *		below the trip point temperature
  * @governor_list:	node in thermal_governor_list (in thermal_core.c)
  */
 struct thermal_governor {
 	char name[THERMAL_NAME_LENGTH];
+	int (*bind_to_tz)(struct thermal_zone_device *tz);
+	void (*unbind_from_tz)(struct thermal_zone_device *tz);
 	int (*throttle)(struct thermal_zone_device *tz, int trip);
 	struct list_head	governor_list;
 };
-- 
1.9.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v5 06/10] thermal: introduce the Power Actor API
  2014-07-10 14:18 [RFC PATCH v5 00/10] The power allocator thermal governor Javi Merino
                   ` (4 preceding siblings ...)
  2014-07-10 14:18 ` [RFC PATCH v5 05/10] thermal: let governors have private data for each thermal zone Javi Merino
@ 2014-07-10 14:18 ` Javi Merino
  2014-07-10 14:18 ` [RFC PATCH v5 07/10] thermal: add a basic cpu power actor Javi Merino
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Javi Merino @ 2014-07-10 14:18 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: punit.agrawal, broonie, Javi Merino, Zhang Rui, Eduardo Valentin

This patch introduces the Power Actor API in the thermal framework.
With it, devices that can report their power consumption and control
it can be registered.  This base interface is meant to be used to
derive specific power actors, such as a cpu power actor.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 Documentation/thermal/power_actor.txt | 56 ++++++++++++++++++++++++++++
 drivers/thermal/Kconfig               |  3 ++
 drivers/thermal/Makefile              |  3 ++
 drivers/thermal/power_actor.c         | 70 +++++++++++++++++++++++++++++++++++
 include/linux/power_actor.h           | 60 ++++++++++++++++++++++++++++++
 5 files changed, 192 insertions(+)
 create mode 100644 Documentation/thermal/power_actor.txt
 create mode 100644 drivers/thermal/power_actor.c
 create mode 100644 include/linux/power_actor.h

diff --git a/Documentation/thermal/power_actor.txt b/Documentation/thermal/power_actor.txt
new file mode 100644
index 000000000000..11ca2d0bf0bd
--- /dev/null
+++ b/Documentation/thermal/power_actor.txt
@@ -0,0 +1,56 @@
+Power Actor API
+===============
+
+The base power actor API is meant to be used to derive specific power
+actors, such as a cpu power actor.  Power actors can be registered by
+calling `power_actor_register()` and should be unregistered by calling
+`power_actor_unregister()` with the `struct power_actor *` received in
+the call to `power_actor_register()`.
+
+This can't be implemented using the cooling device API because:
+
+1.  get_max_state() gives you the maximum cooling state which, for
+    passive devices, is the minimum performance (frequency in case of
+    cpufreq cdev).  get_max_power() gives you the maximum power, which
+    gives you the maximum performance (frequency in the case of CPUs,
+    GPUs and buses)
+
+2.  You need to pass the thermal_zone_device to all the callbacks,
+    something that the current cooling device API doesn't do.
+
+Callbacks
+---------
+
+1. u32 get_req_power(struct power_actor *actor,
+					struct thermal_zone_device *tz)
+@actor: a valid `struct power_actor *` registered with
+        `power_actor_register()`
+@tz:	the thermal zone closest to the actor (typically, the thermal
+		zone the caller is operating on)
+
+`get_req_power()` returns the current requested power in milliwatts.
+
+2. u32 get_max_power(struct power_actor *actor,
+					struct thermal_zone_device *tz)
+@actor: a valid `struct power_actor *` registered with
+        `power_actor_register()`
+@tz:	the thermal zone closest to the actor (typically, the thermal
+		zone the caller is operating on)
+
+`get_max_power()` returns the maximum power that the device could
+consume if it was fully utilized.  It's a function as some devices'
+maximum power consumption can change due to external factors such as
+temperature.
+
+3. int set_power(struct power_actor *actor,
+				struct thermal_zone_device *tz, u32 power)
+@actor: a valid `struct power_actor *` registered with
+        `power_actor_register()`
+@tz:	the thermal zone closest to the actor (typically, the thermal
+		zone the caller is operating on)
+@power: power in milliwatts
+
+`set_power()` should configure the device to consume @power
+milliwatts.
+
+Returns 0 on success, -E* on error.
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index f9a13867cb70..ce4ebe17252c 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -89,6 +89,9 @@ config THERMAL_GOV_USER_SPACE
 	help
 	  Enable this to let the user space manage the platform thermals.
 
+config THERMAL_POWER_ACTOR
+	bool
+
 config CPU_THERMAL
 	bool "generic cpu cooling support"
 	depends on CPU_FREQ
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index de0636a57a64..d83aa42ab573 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -14,6 +14,9 @@ thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE)	+= fair_share.o
 thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE)	+= step_wise.o
 thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE)	+= user_space.o
 
+# power actors
+obj-$(CONFIG_THERMAL_POWER_ACTOR) += power_actor.o
+
 # cpufreq cooling
 thermal_sys-$(CONFIG_CPU_THERMAL)	+= cpu_cooling.o
 
diff --git a/drivers/thermal/power_actor.c b/drivers/thermal/power_actor.c
new file mode 100644
index 000000000000..0b123f9850ea
--- /dev/null
+++ b/drivers/thermal/power_actor.c
@@ -0,0 +1,70 @@
+/*
+ * Basic interface for power actors
+ *
+ * Copyright (C) 2014 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) "Power actor: " fmt
+
+#include <linux/err.h>
+#include <linux/mutex.h>
+#include <linux/power_actor.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+
+LIST_HEAD(actor_list);
+static DEFINE_MUTEX(actor_list_lock);
+
+/**
+ * power_actor_register() - Register an actor in the power actor API
+ * @ops:	&struct power_actor_ops for this actor
+ * @privdata:	pointer to private data related to the actor
+ *
+ * Return: The &struct power_actor * on success, ERR_PTR() on failure
+ */
+struct power_actor *power_actor_register(struct power_actor_ops *ops,
+					void *privdata)
+{
+	struct power_actor *actor;
+
+	if (!ops->get_req_power || !ops->get_max_power || !ops->set_power)
+		return ERR_PTR(-EINVAL);
+
+	actor = kzalloc(sizeof(*actor), GFP_KERNEL);
+	if (!actor)
+		return ERR_PTR(-ENOMEM);
+
+	actor->ops = ops;
+	actor->data = privdata;
+
+	mutex_lock(&actor_list_lock);
+	list_add_rcu(&actor->actor_node, &actor_list);
+	mutex_unlock(&actor_list_lock);
+
+	return actor;
+}
+EXPORT_SYMBOL(power_actor_register);
+
+/**
+ * power_actor_unregister() - Unregister an actor
+ * @actor:	the actor to unregister
+ */
+void power_actor_unregister(struct power_actor *actor)
+{
+	mutex_lock(&actor_list_lock);
+	list_del_rcu(&actor->actor_node);
+	mutex_unlock(&actor_list_lock);
+	synchronize_rcu();
+
+	kfree(actor);
+}
+EXPORT_SYMBOL(power_actor_unregister);
diff --git a/include/linux/power_actor.h b/include/linux/power_actor.h
new file mode 100644
index 000000000000..c2357fe7aad5
--- /dev/null
+++ b/include/linux/power_actor.h
@@ -0,0 +1,60 @@
+/*
+ * Copyright (C) 2014 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __POWER_ACTOR_H__
+#define __POWER_ACTOR_H__
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+
+struct power_actor;
+
+/**
+ * struct power_actor_ops - callbacks for power actors
+ * @get_req_power:	return the current requested power in milliwatts
+ * @get_max_power:	return the max power that the device can currently
+ *			consume in milliwatts
+ * @set_power:		configure the device to consume a certain power in
+ *			milliwatts
+ */
+struct power_actor_ops {
+	u32 (*get_req_power)(struct power_actor *,
+			struct thermal_zone_device *);
+	u32 (*get_max_power)(struct power_actor *,
+			struct thermal_zone_device *);
+	int (*set_power)(struct power_actor *, struct thermal_zone_device *,
+			u32);
+};
+
+/**
+ * struct power_actor - structure for a power actor
+ * @ops:	callbacks for the power actor
+ * @data:	a private pointer for type-specific data
+ * @actor_node:	node in actor_list
+ */
+struct power_actor {
+	struct power_actor_ops *ops;
+	void *data;
+	struct list_head actor_node;
+};
+
+struct power_actor *power_actor_register(struct power_actor_ops *ops,
+					void *privdata);
+void power_actor_unregister(struct power_actor *actor);
+
+extern struct list_head actor_list;
+
+#endif /* __POWER_ACTOR_H__ */
-- 
1.9.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v5 07/10] thermal: add a basic cpu power actor
  2014-07-10 14:18 [RFC PATCH v5 00/10] The power allocator thermal governor Javi Merino
                   ` (5 preceding siblings ...)
  2014-07-10 14:18 ` [RFC PATCH v5 06/10] thermal: introduce the Power Actor API Javi Merino
@ 2014-07-10 14:18 ` Javi Merino
  2014-07-10 14:18 ` [RFC PATCH v5 08/10] thermal: introduce the Power Allocator governor Javi Merino
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Javi Merino @ 2014-07-10 14:18 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: punit.agrawal, broonie, Javi Merino, Zhang Rui, Eduardo Valentin

Introduce a power actor for cpus.  It has a basic power model to get
the current power utilization and uses cpufreq cooling devices to set
the desired power.  It uses the current frequency (as reported by
cpufreq) as well as load and OPPs for the power calculations.  The
cpus must have registered their OPPs in the OPP library.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 Documentation/thermal/power_actor.txt | 125 +++++++++
 drivers/thermal/Kconfig               |   5 +
 drivers/thermal/Makefile              |   1 +
 drivers/thermal/cpu_actor.c           | 488 ++++++++++++++++++++++++++++++++++
 include/linux/power_actor.h           |  26 ++
 5 files changed, 645 insertions(+)
 create mode 100644 drivers/thermal/cpu_actor.c

diff --git a/Documentation/thermal/power_actor.txt b/Documentation/thermal/power_actor.txt
index 11ca2d0bf0bd..c96344f12599 100644
--- a/Documentation/thermal/power_actor.txt
+++ b/Documentation/thermal/power_actor.txt
@@ -54,3 +54,128 @@ temperature.
 milliwatts.
 
 Returns 0 on success, -E* on error.
+
+CPU Power Actor API
+===================
+
+A simple power model for CPUs.  The current power is calculated as
+dynamic + (optionally) static power.  This power model requires that
+the operating-points of the CPUs are registered using the kernel's opp
+library and the `cpufreq_frequency_table` is assigned to the `struct
+device` of the cpu.  If you are using the `cpufreq-cpu0.c` driver then
+the `cpufreq_frequency_table` should already be assigned to the cpu
+device.
+
+The `plat_static_func` parameter of `power_cpu_actor_register()` is
+optional.  If you don't provide it, only dynamic power will be
+considered.
+
+Dynamic power
+-------------
+
+The dynamic power consumption of a processor depends on many factors.
+For a given processor implementation the primary factors are:
+
+- The time the processor spends running, consuming dynamic power, as
+  compared to the time in idle states where dynamic consumption is
+  negligible.  Herein we refer to this as 'utilisation'.
+- The voltage and frequency levels as a result of DVFS.  The DVFS
+  level is a dominant factor governing power consumption.
+- In running time the 'execution' behaviour (instruction types, memory
+  access patterns and so forth) causes, in most cases, a second order
+  variation.  In pathological cases this variation can be significant,
+  but typically it is of a much lesser impact than the factors above.
+
+A high level dynamic power consumption model may then be represented as:
+
+Pdyn = f(run) * Voltage^2 * Frequency * Utilisation
+
+f(run) here represents the described execution behaviour and its
+result has a units of Watts/Hz/Volt^2 (this often expressed in
+mW/MHz/uVolt^2)
+
+The detailed behaviour for f(run) could be modelled on-line.  However,
+in practice, such an on-line model has dependencies on a number of
+implementation specific processor support and characterisation
+factors.  Therefore, in initial implementation that contribution is
+represented as a constant coefficient.  This is a simplification
+consistent with the relative contribution to overall power variation.
+
+In this simplified representation our model becomes:
+
+Pdyn = Kd * Voltage^2 * Frequency * Utilisation
+
+Where Kd (capacitance) represents an indicative running time dynamic
+power coefficient in fundamental units of mW/MHz/uVolt^2
+
+Static Power
+------------
+
+Static leakage power consumption depends on a number of factors.  For a
+given circuit implementation the primary factors are:
+
+- Time the circuit spends in each 'power state'
+- Temperature
+- Operating voltage
+- Process grade
+
+The time the circuit spends in each 'power state' for a given
+evaluation period at first order means OFF or ON.  However,
+'retention' states can also be supported that reduce power during
+inactive periods without loss of context.
+
+Note: The visibility of state entries to the OS can vary, according to
+platform specifics, and this can then impact the accuracy of a model
+based on OS state information alone.  It might be possible in some
+cases to extract more accurate information from system resources.
+
+The temperature, operating voltage and process 'grade' (slow to fast)
+of the circuit are all significant factors in static leakage power
+consumption.  All of these have complex relationships to static power.
+
+Circuit implementation specific factors include the chosen silicon
+process as well as the type, number and size of transistors in both
+the logic gates and any RAM elements included.
+
+The static power consumption modelling must take into account the
+power managed regions that are implemented.  Taking the example of an
+ARM processor cluster, the modelling would take into account whether
+each CPU can be powered OFF separately or if only a single power
+region is implemented for the complete cluster.
+
+In one view, there are others, a static power consumption model can
+then start from a set of reference values for each power managed
+region (e.g. CPU, Cluster/L2) in each state (e.g. ON, OFF) at an
+arbitrary process grade, voltage and temperature point.  These values
+are then scaled for all of the following: the time in each state, the
+process grade, the current temperature and the operating
+voltage.  However, since both implementation specific and complex
+relationships dominate the estimate, the appropriate interface to the
+model from the cpu power actor is to provide a function callback that
+calculates the static power in this platform.  When registering the
+power cpu actor, pass the thermal zone closest to the cpu (to get the
+temperature) and a function pointer that follows the `get_static_t`
+prototype:
+
+    u32 plat_get_static(cpumask_t *cpumask, unsigned long voltage,
+				unsigned long temperature);
+
+with `cpumask` a cpumask of the cpus involved in the calculation,
+`voltage` the voltage at which they are opperating and `temperature`
+their current temperature.
+
+If `plat_static_func` is NULL when registering the power cpu actor,
+static power is considered to be negligible for this platform and only
+dynamic power is considered.
+
+The platform specific callback can then use any combination of tables
+and/or equations to permute the estimated value.  Process grade
+information is not passed to the model since access to such data, from
+on-chip measurement capability or manufacture time data, is platform
+specific.
+
+Note: the significance of static power for CPUs in comparison to
+dynamic power is highly dependent on implementation.  Given the
+potential complexity in implementation, the importance and accuracy of
+its inclusion when using cpu power actors should be assessed on a case by
+cases basis.
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index ce4ebe17252c..249b196deffd 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -92,6 +92,11 @@ config THERMAL_GOV_USER_SPACE
 config THERMAL_POWER_ACTOR
 	bool
 
+config THERMAL_POWER_ACTOR_CPU
+	depends on CPU_THERMAL
+	select THERMAL_POWER_ACTOR
+	bool
+
 config CPU_THERMAL
 	bool "generic cpu cooling support"
 	depends on CPU_FREQ
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index d83aa42ab573..74f97c90a46c 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -16,6 +16,7 @@ thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE)	+= user_space.o
 
 # power actors
 obj-$(CONFIG_THERMAL_POWER_ACTOR) += power_actor.o
+obj-$(CONFIG_THERMAL_POWER_ACTOR_CPU) += cpu_actor.o
 
 # cpufreq cooling
 thermal_sys-$(CONFIG_CPU_THERMAL)	+= cpu_cooling.o
diff --git a/drivers/thermal/cpu_actor.c b/drivers/thermal/cpu_actor.c
new file mode 100644
index 000000000000..45ea4fa92ea0
--- /dev/null
+++ b/drivers/thermal/cpu_actor.c
@@ -0,0 +1,488 @@
+/*
+ * A basic cpu power_actor
+ *
+ * Copyright (C) 2014 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) "CPU actor: " fmt
+
+#include <linux/cpu.h>
+#include <linux/cpufreq.h>
+#include <linux/cpumask.h>
+#include <linux/cpu_cooling.h>
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/pm_opp.h>
+#include <linux/power_actor.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+
+/**
+ * struct power_table - frequency to power conversion
+ * @frequency:	frequency in KHz
+ * @power:	power in mW
+ *
+ * This structure is built when the cooling device registers and helps
+ * in translating frequency to power and viceversa.
+ */
+struct power_table {
+	u32 frequency;
+	u32 power;
+};
+
+/**
+ * struct cpu_actor - information for each cpu actor
+ * @cpumask:	cpus covered by this actor
+ * @last_load:	load measured by the latest call to cpu_get_req_power()
+ * @time_in_idle:	previous reading of the absolute time that this cpu was
+ *			idle
+ * @time_in_idle_timestamp: wall time of the last invocation of
+ *			    get_cpu_idle_time_us()
+ * @dyn_power_table:	array of struct power_table for frequency to power
+ *			conversion
+ * @dyn_power_table_entries: number of entries in the @dyn_power_table array
+ * @cdev:	cpufreq cooling device associated with this actor
+ * @plat_get_static_power:	callback to calculate the static power
+ */
+struct cpu_actor {
+	cpumask_t cpumask;
+	u32 last_load;
+	u64 time_in_idle[NR_CPUS];
+	u64 time_in_idle_timestamp[NR_CPUS];
+	struct power_table *dyn_power_table;
+	int dyn_power_table_entries;
+	struct thermal_cooling_device *cdev;
+	get_static_t plat_get_static_power;
+};
+
+static u32 cpu_freq_to_power(struct cpu_actor *cpu_actor, u32 freq)
+{
+	int i;
+	struct power_table *pt = cpu_actor->dyn_power_table;
+
+	for (i = 0; i < cpu_actor->dyn_power_table_entries - 1; i++)
+		if (freq <= pt[i].frequency)
+			break;
+
+	return pt[i].power;
+}
+
+static u32 cpu_power_to_freq(struct cpu_actor *cpu_actor, u32 power)
+{
+	int i;
+	struct power_table *pt = cpu_actor->dyn_power_table;
+
+	for (i = 0; i < cpu_actor->dyn_power_table_entries - 1; i++)
+		if (power <= pt[i].power)
+			break;
+
+	return pt[i].frequency;
+}
+
+/**
+ * get_load() - get load for a cpu since last updated
+ * @cpu_actor:	&struct cpu_actor for this actor
+ * @cpu:	cpu number
+ *
+ * Return: The average load of cpu @cpu in percentage since this
+ * function was last called.
+ */
+static u32 get_load(struct cpu_actor *cpu_actor, int cpu)
+{
+	u32 load;
+	u64 now, now_idle, delta_time, delta_idle;
+
+	now_idle = get_cpu_idle_time(cpu, &now, 0);
+	delta_idle = now_idle - cpu_actor->time_in_idle[cpu];
+	delta_time = now - cpu_actor->time_in_idle_timestamp[cpu];
+
+	if (delta_time <= delta_idle)
+		load = 0;
+	else
+		load = div64_u64(100 * (delta_time - delta_idle), delta_time);
+
+	cpu_actor->time_in_idle[cpu] = now_idle;
+	cpu_actor->time_in_idle_timestamp[cpu] = now;
+
+	return load;
+}
+
+/**
+ * get_static_power() - calculate the static power consumed by the cpus
+ * @cpu_actor:	&struct cpu_actor for this cpu
+ * @tz:		&struct thermal_zone_device closest to the cpu
+ * @freq:	frequency in KHz
+ *
+ * Calculate the static power consumed by the cpus described by
+ * @cpu_actor running at frequency @freq.  This function relies on a
+ * platform specific function that should have been provided when the
+ * actor was registered.  If it wasn't, the static power is assumed to
+ * be negligible.
+ *
+ * Return: The static power consumed by the cpus.  It returns 0 on
+ * error or if there is no plat_get_static_power().
+ */
+static u32 get_static_power(struct cpu_actor *cpu_actor,
+			struct thermal_zone_device *tz, unsigned long freq)
+{
+	int err;
+	struct device *cpu_dev;
+	struct dev_pm_opp *opp;
+	unsigned long voltage, temperature;
+	cpumask_t *cpumask = &cpu_actor->cpumask;
+	unsigned long freq_hz = freq * 1000;
+
+	if (!cpu_actor->plat_get_static_power)
+		return 0;
+
+	if (freq == 0)
+		return 0;
+
+	cpu_dev = get_cpu_device(cpumask_any(cpumask));
+
+	rcu_read_lock();
+
+	opp = dev_pm_opp_find_freq_exact(cpu_dev, freq_hz, true);
+	voltage = dev_pm_opp_get_voltage(opp);
+
+	rcu_read_unlock();
+
+	if (voltage == 0) {
+		dev_warn_ratelimited(cpu_dev,
+				"Failed to get voltage for frequency %lu: %ld\n",
+				freq_hz, IS_ERR(opp) ? PTR_ERR(opp) : 0);
+		return 0;
+	}
+
+	err = thermal_zone_get_temp(tz, &temperature);
+	if (err) {
+		dev_warn(&tz->device, "Unable to read temperature: %d\n", err);
+		return 0;
+	}
+
+	return cpu_actor->plat_get_static_power(cpumask, voltage, temperature);
+}
+
+/**
+ * get_dynamic_power() - calculate the dynamic power
+ * @cpu_actor:	cpu_actor pointer
+ * @freq:	current frequency
+ *
+ * Return: the dynamic power consumed by the cpus described by
+ * @cpu_actor.
+ */
+static u32 get_dynamic_power(struct cpu_actor *cpu_actor, unsigned long freq)
+{
+	int cpu;
+	u32 power = 0, raw_cpu_power, total_load = 0;
+
+	raw_cpu_power = cpu_freq_to_power(cpu_actor, freq);
+
+	for_each_cpu(cpu, &cpu_actor->cpumask) {
+		u32 load;
+
+		if (!cpu_online(cpu))
+			continue;
+
+		load = get_load(cpu_actor, cpu);
+		power += (raw_cpu_power * load) / 100;
+		total_load += load;
+	}
+
+	cpu_actor->last_load = total_load;
+
+	return power;
+}
+
+/**
+ * cpu_get_req_power() - get the current power
+ * @actor:	power actor pointer
+ * @tz:		&thermal_zone_device closest to the CPU
+ *
+ * Callback for the power actor to return the current power
+ * consumption in milliwatts.
+ */
+static u32 cpu_get_req_power(struct power_actor *actor,
+			struct thermal_zone_device *tz)
+{
+	u32 static_power, dynamic_power;
+	unsigned long freq;
+	struct cpu_actor *cpu_actor = actor->data;
+
+	freq = cpufreq_quick_get(cpumask_any(&cpu_actor->cpumask));
+
+	static_power = get_static_power(cpu_actor, tz, freq);
+	dynamic_power = get_dynamic_power(cpu_actor, freq);
+
+	return static_power + dynamic_power;
+}
+
+/**
+ * cpu_get_max_power() - get the maximum power that the cpu could currently consume
+ * @actor:	power actor pointer
+ * @tz:		&thermal_zone_device closest to the CPU
+ *
+ * Callback for the power actor to return the maximum power
+ * consumption in milliwatts that the cpu could currently consume.
+ * The static power depends on temperature so the maximum power will
+ * vary over time.
+ */
+static u32 cpu_get_max_power(struct power_actor *actor,
+			struct thermal_zone_device *tz)
+{
+	u32 max_static_power, max_dyn_power;
+	cpumask_t *cpumask;
+	unsigned int max_freq, last_entry, num_cpus;
+	struct cpu_actor *cpu_actor = actor->data;
+
+	cpumask = &cpu_actor->cpumask;
+	max_freq = cpufreq_quick_get_max(cpumask_any(cpumask));
+	max_static_power = get_static_power(cpu_actor, tz, max_freq);
+
+	last_entry = cpu_actor->dyn_power_table_entries - 1;
+	num_cpus = cpumask_weight(cpumask);
+	max_dyn_power = cpu_actor->dyn_power_table[last_entry].power * num_cpus;
+
+	return max_static_power + max_dyn_power;
+}
+
+/**
+ * cpu_set_power() - set cpufreq cooling device to consume a certain power
+ * @actor: power actor pointer
+ * @tz:		&thermal_zone_device closest to the CPU
+ * @power: the power in milliwatts that should be set
+ *
+ * Callback for the power actor to configure the power consumption of
+ * the CPU to be @power milliwatts at most.  This function assumes
+ * that the load will remain constant.  The power is translated into a
+ * cooling state that the cpu cooling device then sets.
+ *
+ * Return: 0 on success, -EINVAL if it couldn't convert the frequency
+ * to a cpufreq cooling device state.
+ */
+static int cpu_set_power(struct power_actor *actor,
+			struct thermal_zone_device *tz, u32 power)
+{
+	unsigned int cpu, cur_freq, target_freq;
+	unsigned long cdev_state;
+	u32 dyn_power, normalised_power, last_load;
+	struct thermal_cooling_device *cdev;
+	struct cpu_actor *cpu_actor = actor->data;
+
+	cdev = cpu_actor->cdev;
+	cpu = cpumask_any(&cpu_actor->cpumask);
+	cur_freq = cpufreq_quick_get(cpu);
+
+	dyn_power = power - get_static_power(cpu_actor, tz, cur_freq);
+	last_load = cpu_actor->last_load ? cpu_actor->last_load : 1;
+	normalised_power = (dyn_power * 100) / last_load;
+	target_freq = cpu_power_to_freq(cpu_actor, normalised_power);
+
+	cdev_state = cpufreq_cooling_get_level(cpu, target_freq);
+	if (cdev_state == THERMAL_CSTATE_INVALID) {
+		pr_err("Failed to convert %dKHz for cpu %d into a cdev state\n",
+			target_freq, cpu);
+		return -EINVAL;
+	}
+
+	return cdev->ops->set_cur_state(cdev, cdev_state);
+}
+
+static struct power_actor_ops cpu_actor_ops = {
+	.get_req_power = cpu_get_req_power,
+	.get_max_power = cpu_get_max_power,
+	.set_power = cpu_set_power,
+};
+
+/**
+ * build_dyn_power_table() - create a dynamic power to frequency table
+ * @cpu_actor:	the cpu_actor in which to store the table
+ * @capacitance: dynamic power coefficient for these cpus
+ *
+ * Build a dynamic power to frequency table for this cpu and store it
+ * in @cpu_actor.  This table will be used in cpu_power_to_freq() and
+ * cpu_freq_to_power() to convert between power and frequency
+ * efficiently.  Power is stored in mW, frequency in KHz.  The
+ * resulting table is in ascending order.
+ *
+ * Return: 0 on success, -E* on error.
+ */
+static int build_dyn_power_table(struct cpu_actor *cpu_actor, u32 capacitance)
+{
+	struct power_table *power_table;
+	struct dev_pm_opp *opp;
+	struct device *dev = NULL;
+	int num_opps, cpu, i, ret = 0;
+	unsigned long freq;
+
+	num_opps = 0;
+
+	rcu_read_lock();
+
+	for_each_cpu(cpu, &cpu_actor->cpumask) {
+		dev = get_cpu_device(cpu);
+		if (!dev)
+			continue;
+
+		num_opps = dev_pm_opp_get_opp_count(dev);
+		if (num_opps > 0) {
+			break;
+		} else if (num_opps < 0) {
+			ret = num_opps;
+			goto unlock;
+		}
+	}
+
+	if (num_opps == 0) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	power_table = kcalloc(num_opps, sizeof(*power_table), GFP_KERNEL);
+
+	i = 0;
+	for (freq = 0;
+	     opp = dev_pm_opp_find_freq_ceil(dev, &freq), !IS_ERR(opp);
+	     freq++) {
+		u32 freq_mhz, voltage_mv;
+		u64 power;
+
+		freq_mhz = freq / 1000000;
+		voltage_mv = dev_pm_opp_get_voltage(opp) / 1000;
+
+		/*
+		 * Do the multiplication with MHz and millivolt so as
+		 * to not overflow.
+		 */
+		power = (u64)capacitance * freq_mhz * voltage_mv * voltage_mv;
+		do_div(power, 1000000000);
+
+		/* frequency is stored in power_table in KHz */
+		power_table[i].frequency = freq / 1000;
+		power_table[i].power = power;
+
+		i++;
+	}
+
+	if (i == 0) {
+		ret = PTR_ERR(opp);
+		goto unlock;
+	}
+
+	cpu_actor->dyn_power_table = power_table;
+	cpu_actor->dyn_power_table_entries = i;
+
+unlock:
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
+ * power_cpu_actor_register() - register a cpu_actor within the power actor API
+ * @np:		DT node for the cpus.
+ * @cpu:	one of the cpus covered by this power_actor
+ * @capacitance:	dynamic power coefficient for these cpus
+ * @plat_static_func:	function to calculate the static power consumed by these
+ *			cpus (optional)
+ *
+ * Create a cpufreq cooling device for @cpu (and its related cpus) and
+ * register it with the power actor API using a simple cpu power
+ * model.  If @np is not NULL, the cpufreq cooling device is
+ * registered with of_cpufreq_cooling_register(), otherwise
+ * cpufreq_cooling_register() is used.  The cpus must have registered
+ * their OPPs in the OPP library.
+ *
+ * An optional @plat_static_func may be provided to calculate the
+ * static power consumed by these cpus.  If the platform's static
+ * power consumption is unknown or negligible, make it NULL.
+ *
+ * The actor registered should be freed using
+ * power_cpu_actor_unregister() when it's no longer needed.
+ *
+ * Return: The power_actor created on success or the corresponding
+ * ERR_PTR() on failure.
+ */
+struct power_actor *
+power_cpu_actor_register(struct device_node *np, unsigned int cpu,
+			u32 capacitance, get_static_t plat_static_func)
+{
+	int ret;
+	struct thermal_cooling_device *cdev;
+	struct power_actor *actor, *err_ret;
+	struct cpu_actor *cpu_actor;
+	struct cpufreq_policy *cpufreq_pol;
+
+	cpu_actor = kzalloc(sizeof(*cpu_actor), GFP_KERNEL);
+	if (!cpu_actor)
+		return ERR_PTR(-ENOMEM);
+
+	cpufreq_pol = cpufreq_cpu_get(cpu);
+	if (!cpufreq_pol) {
+		err_ret = ERR_PTR(-EINVAL);
+		goto kfree_cpu_actor;
+	}
+
+	cpumask_copy(&cpu_actor->cpumask, cpufreq_pol->related_cpus);
+
+	cpufreq_cpu_put(cpufreq_pol);
+
+	if (!np)
+		cdev = cpufreq_cooling_register(&cpu_actor->cpumask);
+	else
+		cdev = of_cpufreq_cooling_register(np, &cpu_actor->cpumask);
+
+	if (!cdev) {
+		err_ret = ERR_PTR(PTR_ERR(cdev));
+		goto kfree_cpu_actor;
+	}
+
+	cpu_actor->cdev = cdev;
+	cpu_actor->plat_get_static_power = plat_static_func;
+
+	ret = build_dyn_power_table(cpu_actor, capacitance);
+	if (ret) {
+		err_ret = ERR_PTR(ret);
+		goto cdev_unregister;
+	}
+
+	actor = power_actor_register(&cpu_actor_ops, cpu_actor);
+	if (IS_ERR(actor)) {
+		err_ret = actor;
+		goto kfree_dyn_power_table;
+	}
+
+	return actor;
+
+kfree_dyn_power_table:
+	kfree(cpu_actor->dyn_power_table);
+cdev_unregister:
+	cpufreq_cooling_unregister(cdev);
+kfree_cpu_actor:
+	kfree(cpu_actor);
+
+	return err_ret;
+}
+
+/**
+ * power_cpu_actor_unregister() - Unregister a power cpu actor
+ * @actor:	the actor to unregister
+ */
+void power_cpu_actor_unregister(struct power_actor *actor)
+{
+	struct cpu_actor *cpu_actor = actor->data;
+
+	kfree(cpu_actor->dyn_power_table);
+	kfree(cpu_actor);
+	power_actor_unregister(actor);
+}
diff --git a/include/linux/power_actor.h b/include/linux/power_actor.h
index c2357fe7aad5..90ff910493b5 100644
--- a/include/linux/power_actor.h
+++ b/include/linux/power_actor.h
@@ -17,8 +17,12 @@
 #ifndef __POWER_ACTOR_H__
 #define __POWER_ACTOR_H__
 
+#include <linux/cpumask.h>
+#include <linux/device.h>
+#include <linux/err.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
+#include <linux/thermal.h>
 
 struct power_actor;
 
@@ -55,6 +59,28 @@ struct power_actor *power_actor_register(struct power_actor_ops *ops,
 					void *privdata);
 void power_actor_unregister(struct power_actor *actor);
 
+typedef u32 (*get_static_t)(cpumask_t *cpumask,
+				unsigned long voltage,
+				unsigned long temperature);
+
+#ifdef CONFIG_THERMAL_POWER_ACTOR_CPU
+struct power_actor *
+power_cpu_actor_register(struct device_node *np, unsigned int cpu,
+			u32 capacitance, get_static_t plat_static_func);
+void power_cpu_actor_unregister(struct power_actor *actor);
+#else
+static inline
+struct power_actor *
+power_cpu_actor_register(struct device_node *np, unsigned int cpu,
+			u32 capacitance, get_static_t plat_static_func)
+{
+	return ERR_PTR(-ENOSYS);
+}
+static inline void power_cpu_actor_unregister(struct power_actor *actor)
+{
+}
+#endif
+
 extern struct list_head actor_list;
 
 #endif /* __POWER_ACTOR_H__ */
-- 
1.9.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v5 08/10] thermal: introduce the Power Allocator governor
  2014-07-10 14:18 [RFC PATCH v5 00/10] The power allocator thermal governor Javi Merino
                   ` (6 preceding siblings ...)
  2014-07-10 14:18 ` [RFC PATCH v5 07/10] thermal: add a basic cpu power actor Javi Merino
@ 2014-07-10 14:18 ` Javi Merino
  2014-08-19 12:56   ` Eduardo Valentin
  2014-08-19 13:45   ` Eduardo Valentin
  2014-07-10 14:18 ` [RFC PATCH v5 09/10] thermal: add trace events to the power allocator governor Javi Merino
  2014-07-10 14:18 ` [RFC PATCH v5 10/10] of: thermal: Introduce sustainable power for a thermal zone Javi Merino
  9 siblings, 2 replies; 22+ messages in thread
From: Javi Merino @ 2014-07-10 14:18 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: punit.agrawal, broonie, Javi Merino, Zhang Rui, Eduardo Valentin

The power allocator governor is a thermal governor that controls system
and device power allocation to control temperature.  Conceptually, the
implementation divides the sustainable power of a thermal zone among
all the heat sources in that zone.

This governor relies on "power actors", entities that represent heat
sources.  They can report current and maximum power consumption and
can set a given maximum power consumption, usually via a cooling
device.

The governor uses a Proportional Integral Derivative (PID) controller
driven by the temperature of the thermal zone.  The output of the
controller is a power budget that is then allocated to each power
actor that can have bearing on the temperature we are trying to
control.  It decides how much power to give each cooling device based
on the performance they are requesting.  The PID controller ensures
that the total power budget does not exceed the control temperature.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 Documentation/thermal/power_allocator.txt |  61 ++++
 drivers/thermal/Kconfig                   |  15 +
 drivers/thermal/Makefile                  |   1 +
 drivers/thermal/power_allocator.c         | 467 ++++++++++++++++++++++++++++++
 drivers/thermal/thermal_core.c            |   7 +-
 drivers/thermal/thermal_core.h            |   8 +
 include/linux/thermal.h                   |   8 +
 7 files changed, 566 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/thermal/power_allocator.txt
 create mode 100644 drivers/thermal/power_allocator.c

diff --git a/Documentation/thermal/power_allocator.txt b/Documentation/thermal/power_allocator.txt
new file mode 100644
index 000000000000..1859074dadcb
--- /dev/null
+++ b/Documentation/thermal/power_allocator.txt
@@ -0,0 +1,61 @@
+Integration of the power_allocator governor in a platform
+=========================================================
+
+Registering thermal_zone_device
+-------------------------------
+
+An estimate of the sustainable dissipatable power (in mW) should be
+provided while registering the thermal zone.  This is the maximum
+sustained power for allocation at the desired maximum temperature.
+This number can vary for different conditions, but the closed-loop of
+the controller should take care of those variations, the
+`sustainable_power` should be an estimation of it.  Register your
+thermal zone with `thermal_zone_params` that have a
+`sustainable_power`.  If you weren't passing any
+`thermal_zone_params`, then something like this will do:
+
+	static const struct thermal_zone_params tz_params = {
+		.sustainable_power = 3500,
+	};
+
+and then pass `tz_params` as the 5th parameter to
+`thermal_zone_device_register()`
+
+Trip points
+-----------
+
+The governor requires the following two trip points:
+
+1.  "switch on" trip point: temperature above which the governor
+    control loop starts operating
+2.  "desired temperature" trip point: it should be higher than the
+    "switch on" trip point. It is the target temperature the governor
+    is controlling for.
+
+The trip points can be either active or passive.
+
+Power actors
+------------
+
+Devices controlled by this governor must be registered with the power
+actor API.  Read `power_actor.txt` for more information about them.
+
+Limitations of the power allocator governor
+===========================================
+
+The power allocator governor can't work with cooling devices directly.
+A power actor can be created to interface between the governor and the
+cooling device (see cpu_actor.c for an example).  Otherwise, if you
+have power actors and cooling devices that are next to the same
+thermal sensor create two thermal zones, one for each type.  Use the
+power allocator governor for the power actor thermal zone with the
+power actors and any other governor for the one with cooling devices.
+
+The power allocator governor's PID controller is highly dependent on a
+periodic tick.  If you have a driver that calls
+`thermal_zone_device_update()` (or anything that ends up calling the
+governor's `throttle()` function) repetitively, the governor response
+won't be very good.  Note that this is not particular to this
+governor, step-wise will also misbehave if you call its throttle()
+faster than the normal thermal framework tick (due to interrupts for
+example) as it will overreact.
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 249b196deffd..0e76c0dab5f3 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
 	  Select this if you want to let the user space manage the
 	  platform thermals.
 
+config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
+	bool "power_allocator"
+	select THERMAL_GOV_POWER_ALLOCATOR
+	help
+	  Select this if you want to control temperature based on
+	  system and device power allocation. This governor relies on
+	  power actors to operate.
+
 endchoice
 
 config THERMAL_GOV_FAIR_SHARE
@@ -89,6 +97,13 @@ config THERMAL_GOV_USER_SPACE
 	help
 	  Enable this to let the user space manage the platform thermals.
 
+config THERMAL_GOV_POWER_ALLOCATOR
+	bool "Power allocator thermal governor"
+	select THERMAL_POWER_ACTOR
+	help
+	  Enable this to manage platform thermals by dynamically
+	  allocating and limiting power to devices.
+
 config THERMAL_POWER_ACTOR
 	bool
 
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 74f97c90a46c..e74d57d0fe61 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF)		+= of-thermal.o
 thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE)	+= fair_share.o
 thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE)	+= step_wise.o
 thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE)	+= user_space.o
+thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR)	+= power_allocator.o
 
 # power actors
 obj-$(CONFIG_THERMAL_POWER_ACTOR) += power_actor.o
diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
new file mode 100644
index 000000000000..eb1797cd859b
--- /dev/null
+++ b/drivers/thermal/power_allocator.c
@@ -0,0 +1,467 @@
+/*
+ * A power allocator to manage temperature
+ *
+ * Copyright (C) 2014 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) "Power allocator: " fmt
+
+#include <linux/power_actor.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/thermal.h>
+
+#include "thermal_core.h"
+
+#define FRAC_BITS 8
+#define int_to_frac(x) ((x) << FRAC_BITS)
+#define frac_to_int(x) ((x) >> FRAC_BITS)
+
+/**
+ * mul_frac() - multiply two fixed-point numbers
+ * @x:	first multiplicand
+ * @y:	second multiplicand
+ *
+ * Return: the result of multiplying two fixed-point numbers.  The
+ * result is also a fixed-point number.
+ */
+static inline s64 mul_frac(s64 x, s64 y)
+{
+	return (x * y) >> FRAC_BITS;
+}
+
+enum power_allocator_trip_levels {
+	TRIP_SWITCH_ON = 0,	/* Switch on PID controller */
+	TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
+};
+
+/**
+ * struct power_allocator_params - parameters for the power allocator governor
+ * @k_po:	Proportional parameter of the PID controller when overshooting
+ *		(i.e., when temperature is below the target)
+ * @k_pi:	Proportional parameter of the PID controller when undershooting
+ * @k_i:	Integral parameter of the PID controller
+ * @k_d:	Derivative parameter of the PID controller
+ * @integral_cutoff:	threshold below which the error is no longer accumulated
+			in the PID controller
+ * @err_integral:	accumulated error in the PID controller.
+ * @prev_err:	error in the previous iteration of the PID controller.
+ *		Used to calculate the derivative term.
+ */
+struct power_allocator_params {
+	s32 k_po;
+	s32 k_pu;
+	s32 k_i;
+	s32 k_d;
+	s32 integral_cutoff;
+	s32 err_integral;
+	s32 prev_err;
+};
+
+/**
+ * pid_controller() - PID controller
+ * @tz:	thermal zone we are operating in
+ * @current_temp:	the current temperature
+ * @control_temp:	the target temperature
+ * @max_allocatable_power:	maximum allocatable power for this thermal zone
+ *
+ * This PID controller increases the available power budget so that the
+ * temperature of the thermal zone gets as close as possible to
+ * @control_temp and limits the power if it exceeds it.  k_po is the
+ * proportional term when we are overshooting, k_pu is the
+ * proportional term when we are undershooting.  integral_cutoff is a
+ * threshold below which we stop accumulating the error.  The
+ * accumulated error is only valid if the requested power will make
+ * the system warmer.  If the system is mostly idle, there's no point
+ * in accumulating positive error.
+ *
+ * Return: The power budget for the next period.
+ */
+static u32 pid_controller(struct thermal_zone_device *tz,
+			unsigned long current_temp, unsigned long control_temp,
+			u32 max_allocatable_power)
+{
+	s64 p, i, d, power_range;
+	s32 err;
+	struct power_allocator_params *params = tz->governor_data;
+
+	err = ((s32)control_temp - (s32)current_temp) / 1000;
+	err = int_to_frac(err);
+
+	/* Calculate the proportional term */
+	p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
+
+	/*
+	 * Calculate the integral term
+	 *
+	 * if the error s less than cut off allow integration (but
+	 * the integral is limited to max power)
+	 */
+	i = mul_frac(params->k_i, params->err_integral);
+
+	if (err < int_to_frac(params->integral_cutoff)) {
+		s64 tmpi = mul_frac(params->k_i, err);
+
+		tmpi += i;
+		if (tmpi <= int_to_frac(max_allocatable_power)) {
+			i = tmpi;
+			params->err_integral += err;
+		}
+	}
+
+	/*
+	 * Calculate the derivative term
+	 *
+	 * We do err - prev_err, so with a positive k_d, a decreasing
+	 * error (i.e. driving closer to the line) results in less
+	 * power being applied, slowing down the controller)
+	 */
+	d = mul_frac(params->k_d, err - params->prev_err);
+	params->prev_err = err;
+
+	power_range = p + i + d;
+
+	/* feed-forward the known sustainable dissipatable power */
+	power_range = tz->tzp->sustainable_power + frac_to_int(power_range);
+
+	return clamp(power_range, (s64)0, (s64)max_allocatable_power);
+}
+
+/**
+ * divvy_up_power() - divvy the allocated power between the actors
+ * @req_power:	each actor's requested power
+ * @max_power:	each actor's maximum available power
+ * @num_actors:	size of the @req_power, @max_power and @granted_power's array
+ * @total_req_power: sum of @req_power
+ * @power_range:	total allocated power
+ * @granted_power:	ouput array: each actor's granted power
+ *
+ * This function divides the total allocated power (@power_range)
+ * fairly between the actors.  It first tries to give each actor a
+ * share of the @power_range according to how much power it requested
+ * compared to the rest of the actors.  For example, if only one actor
+ * requests power, then it receives all the @power_range.  If
+ * three actors each requests 1mW, each receives a third of the
+ * @power_range.
+ *
+ * If any actor received more than their maximum power, then that
+ * surplus is re-divvied among the actors based on how far they are
+ * from their respective maximums.
+ *
+ * Granted power for each actor is written to @granted_power, which
+ * should've been allocated by the calling function.
+ */
+static void divvy_up_power(u32 *req_power, u32 *max_power, int num_actors,
+			u32 total_req_power, u32 power_range,
+			u32 *granted_power)
+{
+	u32 extra_power, capped_extra_power, extra_actor_power[num_actors];
+	int i;
+
+	if (!total_req_power) {
+		/*
+		 * Nobody requested anything, so just give everybody
+		 * the maximum power
+		 */
+		for (i = 0; i < num_actors; i++)
+			granted_power[i] = max_power[i];
+
+		return;
+	}
+
+	capped_extra_power = 0;
+	extra_power = 0;
+	for (i = 0; i < num_actors; i++) {
+		u64 req_range = req_power[i] * power_range;
+
+		granted_power[i] = div_u64(req_range, total_req_power);
+
+		if (granted_power[i] > max_power[i]) {
+			extra_power += granted_power[i] - max_power[i];
+			granted_power[i] = max_power[i];
+		}
+
+		extra_actor_power[i] = max_power[i] - granted_power[i];
+		capped_extra_power += extra_actor_power[i];
+	}
+
+	if (!extra_power)
+		return;
+
+	/*
+	 * Re-divvy the reclaimed extra among actors based on
+	 * how far they are from the max
+	 */
+	extra_power = min(extra_power, capped_extra_power);
+	if (capped_extra_power > 0)
+		for (i = 0; i < num_actors; i++)
+			granted_power[i] += (extra_actor_power[i] *
+					extra_power) / capped_extra_power;
+}
+
+static int allocate_power(struct thermal_zone_device *tz,
+			unsigned long current_temp, unsigned long control_temp)
+{
+	struct power_actor *actor;
+	u32 *req_power, *max_power, *granted_power;
+	u32 total_req_power, max_allocatable_power;
+	u32 power_range;
+	int i, num_actors, ret = 0;
+
+	mutex_lock(&tz->lock);
+	rcu_read_lock();
+
+	num_actors = 0;
+	list_for_each_entry_rcu(actor, &actor_list, actor_node)
+		num_actors++;
+
+	req_power = devm_kcalloc(&tz->device, num_actors, sizeof(*req_power),
+				GFP_KERNEL);
+	if (!req_power) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+
+	max_power = devm_kcalloc(&tz->device, num_actors, sizeof(*max_power),
+				GFP_KERNEL);
+	if (!max_power) {
+		ret = -ENOMEM;
+		goto free_req_power;
+	}
+
+	granted_power = devm_kcalloc(&tz->device, num_actors,
+				sizeof(*granted_power), GFP_KERNEL);
+	if (!granted_power) {
+		ret = -ENOMEM;
+		goto free_max_power;
+	}
+
+	i = 0;
+	total_req_power = 0;
+	max_allocatable_power = 0;
+
+	list_for_each_entry_rcu(actor, &actor_list, actor_node) {
+		req_power[i] = actor->ops->get_req_power(actor, tz);
+		total_req_power += req_power[i];
+
+		max_power[i] = actor->ops->get_max_power(actor, tz);
+		max_allocatable_power += max_power[i];
+
+		i++;
+	}
+
+	power_range = pid_controller(tz, current_temp, control_temp,
+				max_allocatable_power);
+
+	divvy_up_power(req_power, max_power, num_actors, total_req_power,
+		power_range, granted_power);
+
+	i = 0;
+	list_for_each_entry_rcu(actor, &actor_list, actor_node) {
+		actor->ops->set_power(actor, tz, granted_power[i]);
+		i++;
+	}
+
+	devm_kfree(&tz->device, granted_power);
+free_max_power:
+	devm_kfree(&tz->device, max_power);
+free_req_power:
+	devm_kfree(&tz->device, req_power);
+unlock:
+	rcu_read_unlock();
+	mutex_unlock(&tz->lock);
+
+	return ret;
+}
+
+static int check_trips(struct thermal_zone_device *tz)
+{
+	int ret;
+	enum thermal_trip_type type;
+
+	if (tz->trips < 2)
+		return -EINVAL;
+
+	ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
+	if (ret)
+		return ret;
+
+	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
+		return -EINVAL;
+
+	ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
+	if (ret)
+		return ret;
+
+	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
+		return -EINVAL;
+
+	return ret;
+}
+
+static void reset_pid_controller(struct power_allocator_params *params)
+{
+	params->err_integral = 0;
+	params->prev_err = 0;
+}
+
+static void allow_maximum_power(struct thermal_zone_device *tz)
+{
+	struct power_actor *actor;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(actor, &actor_list, actor_node) {
+		u32 max_power = actor->ops->get_max_power(actor, tz);
+
+		actor->ops->set_power(actor, tz, max_power);
+	}
+
+	rcu_read_unlock();
+}
+
+/**
+ * power_allocator_bind() - bind the power_allocator governor to a thermal zone
+ * @tz:	thermal zone to bind it to
+ *
+ * Check that the thermal zone is valid for this governor, that is, it
+ * has two thermal trips.  If so, initialize the PID controller
+ * parameters and bind it to the thermal zone.
+ *
+ * Return: 0 on success, -EINVAL if the trips were invalid or -ENOMEM
+ * if we ran out of memory.
+ */
+static int power_allocator_bind(struct thermal_zone_device *tz)
+{
+	int ret;
+	struct power_allocator_params *params;
+	unsigned long switch_on_temp, control_temp;
+	u32 temperature_threshold;
+
+	ret = check_trips(tz);
+	if (ret) {
+		dev_err(&tz->device,
+			"thermal zone %s has the wrong number of trips for this governor\n",
+			tz->type);
+		return ret;
+	}
+
+	if (!tz->tzp || !tz->tzp->sustainable_power) {
+		dev_err(&tz->device,
+			"power_allocator: missing sustainable_power\n");
+		return -EINVAL;
+	}
+
+	params = devm_kzalloc(&tz->device, sizeof(*params), GFP_KERNEL);
+	if (!params)
+		return -ENOMEM;
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
+	if (ret)
+		goto free;
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
+				&control_temp);
+	if (ret)
+		goto free;
+
+	temperature_threshold = (control_temp - switch_on_temp) / 1000;
+
+	params->k_po = int_to_frac(tz->tzp->sustainable_power) /
+		temperature_threshold;
+	params->k_pu = int_to_frac(2 * tz->tzp->sustainable_power) /
+		temperature_threshold;
+	params->k_i = int_to_frac(1);
+	params->k_d = int_to_frac(0);
+	params->integral_cutoff = 0;
+
+	reset_pid_controller(params);
+
+	tz->governor_data = params;
+
+	return 0;
+
+free:
+	devm_kfree(&tz->device, params);
+	return ret;
+}
+
+static void power_allocator_unbind(struct thermal_zone_device *tz)
+{
+	dev_dbg(&tz->device, "Unbinding from thermal zone %d\n", tz->id);
+	devm_kfree(&tz->device, tz->governor_data);
+	tz->governor_data = NULL;
+}
+
+static int power_allocator_throttle(struct thermal_zone_device *tz, int trip)
+{
+	int ret;
+	unsigned long switch_on_temp, control_temp, current_temp;
+	struct power_allocator_params *params = tz->governor_data;
+
+	/*
+	 * We get called for every trip point but we only need to do
+	 * our calculations once
+	 */
+	if (trip != TRIP_MAX_DESIRED_TEMPERATURE)
+		return 0;
+
+	ret = thermal_zone_get_temp(tz, &current_temp);
+	if (ret) {
+		dev_warn(&tz->device, "Failed to get temperature: %d\n", ret);
+		return ret;
+	}
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
+	if (ret) {
+		dev_warn(&tz->device,
+			"Failed to get switch on temperature: %d\n", ret);
+		return ret;
+	}
+
+	if (current_temp < switch_on_temp) {
+		tz->passive = 0;
+		reset_pid_controller(params);
+		allow_maximum_power(tz);
+		return 0;
+	}
+
+	tz->passive = 1;
+
+	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
+				&control_temp);
+	if (ret) {
+		dev_warn(&tz->device,
+			"Failed to get the maximum desired temperature: %d\n",
+			ret);
+		return ret;
+	}
+
+	return allocate_power(tz, current_temp, control_temp);
+}
+
+static struct thermal_governor thermal_gov_power_allocator = {
+	.name		= "power_allocator",
+	.bind_to_tz	= power_allocator_bind,
+	.unbind_from_tz	= power_allocator_unbind,
+	.throttle	= power_allocator_throttle,
+};
+
+int thermal_gov_power_allocator_register(void)
+{
+	return thermal_register_governor(&thermal_gov_power_allocator);
+}
+
+void thermal_gov_power_allocator_unregister(void)
+{
+	thermal_unregister_governor(&thermal_gov_power_allocator);
+}
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 3da99dd80ad5..1415d3d8a9eb 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -1857,7 +1857,11 @@ static int __init thermal_register_governors(void)
 	if (result)
 		return result;
 
-	return thermal_gov_user_space_register();
+	result = thermal_gov_user_space_register();
+	if (result)
+		return result;
+
+	return thermal_gov_power_allocator_register();
 }
 
 static void thermal_unregister_governors(void)
@@ -1865,6 +1869,7 @@ static void thermal_unregister_governors(void)
 	thermal_gov_step_wise_unregister();
 	thermal_gov_fair_share_unregister();
 	thermal_gov_user_space_unregister();
+	thermal_gov_power_allocator_unregister();
 }
 
 static int __init thermal_init(void)
diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h
index 3db339fb636f..b24cde2c71cc 100644
--- a/drivers/thermal/thermal_core.h
+++ b/drivers/thermal/thermal_core.h
@@ -77,6 +77,14 @@ static inline int thermal_gov_user_space_register(void) { return 0; }
 static inline void thermal_gov_user_space_unregister(void) {}
 #endif /* CONFIG_THERMAL_GOV_USER_SPACE */
 
+#ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
+int thermal_gov_power_allocator_register(void);
+void thermal_gov_power_allocator_unregister(void);
+#else
+static inline int thermal_gov_power_allocator_register(void) { return 0; }
+static inline void thermal_gov_power_allocator_unregister(void) {}
+#endif /* CONFIG_THERMAL_GOV_POWER_ALLOCATOR */
+
 /* device tree support */
 #ifdef CONFIG_THERMAL_OF
 int of_parse_thermal_zones(void);
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index 1124b7a9358a..e01141261756 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -57,6 +57,8 @@
 #define DEFAULT_THERMAL_GOVERNOR       "fair_share"
 #elif defined(CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE)
 #define DEFAULT_THERMAL_GOVERNOR       "user_space"
+#elif defined(CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR)
+#define DEFAULT_THERMAL_GOVERNOR       "power_allocator"
 #endif
 
 struct thermal_zone_device;
@@ -287,6 +289,12 @@ struct thermal_zone_params {
 
 	int num_tbps;	/* Number of tbp entries */
 	struct thermal_bind_params *tbp;
+
+	/*
+	 * Sustainable power (heat) that this thermal zone can dissipate in
+	 * mW
+	 */
+	u32 sustainable_power;
 };
 
 struct thermal_genl_event {
-- 
1.9.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v5 09/10] thermal: add trace events to the power allocator governor
  2014-07-10 14:18 [RFC PATCH v5 00/10] The power allocator thermal governor Javi Merino
                   ` (7 preceding siblings ...)
  2014-07-10 14:18 ` [RFC PATCH v5 08/10] thermal: introduce the Power Allocator governor Javi Merino
@ 2014-07-10 14:18 ` Javi Merino
  2014-07-10 15:44   ` Steven Rostedt
  2014-07-10 14:18 ` [RFC PATCH v5 10/10] of: thermal: Introduce sustainable power for a thermal zone Javi Merino
  9 siblings, 1 reply; 22+ messages in thread
From: Javi Merino @ 2014-07-10 14:18 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: punit.agrawal, broonie, Javi Merino, Zhang Rui, Eduardo Valentin,
	Steven Rostedt, Frederic Weisbecker, Ingo Molnar

Add trace events for the power allocator governor and the power actor
interface of the cpu cooling device.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
---
 drivers/thermal/cpu_actor.c                    |  17 ++-
 drivers/thermal/power_allocator.c              |  22 +++-
 include/trace/events/thermal_power_allocator.h | 138 +++++++++++++++++++++++++
 3 files changed, 173 insertions(+), 4 deletions(-)
 create mode 100644 include/trace/events/thermal_power_allocator.h

diff --git a/drivers/thermal/cpu_actor.c b/drivers/thermal/cpu_actor.c
index 45ea4fa92ea0..b5ed2e80e288 100644
--- a/drivers/thermal/cpu_actor.c
+++ b/drivers/thermal/cpu_actor.c
@@ -28,6 +28,8 @@
 #include <linux/printk.h>
 #include <linux/slab.h>
 
+#include <trace/events/thermal_power_allocator.h>
+
 /**
  * struct power_table - frequency to power conversion
  * @frequency:	frequency in KHz
@@ -184,11 +186,12 @@ static u32 get_static_power(struct cpu_actor *cpu_actor,
  */
 static u32 get_dynamic_power(struct cpu_actor *cpu_actor, unsigned long freq)
 {
-	int cpu;
-	u32 power = 0, raw_cpu_power, total_load = 0;
+	int i, cpu;
+	u32 power = 0, raw_cpu_power, total_load = 0, load_cpu[NR_CPUS];
 
 	raw_cpu_power = cpu_freq_to_power(cpu_actor, freq);
 
+	i = 0;
 	for_each_cpu(cpu, &cpu_actor->cpumask) {
 		u32 load;
 
@@ -198,8 +201,15 @@ static u32 get_dynamic_power(struct cpu_actor *cpu_actor, unsigned long freq)
 		load = get_load(cpu_actor, cpu);
 		power += (raw_cpu_power * load) / 100;
 		total_load += load;
+		load_cpu[i] = load;
+
+		i++;
 	}
 
+	trace_thermal_power_actor_cpu_get_dyn_power(&cpu_actor->cpumask, freq,
+						raw_cpu_power, load_cpu, i,
+						power);
+
 	cpu_actor->last_load = total_load;
 
 	return power;
@@ -296,6 +306,9 @@ static int cpu_set_power(struct power_actor *actor,
 		return -EINVAL;
 	}
 
+	trace_thermal_power_actor_cpu_limit(&cpu_actor->cpumask, target_freq,
+					cdev_state, power);
+
 	return cdev->ops->set_cur_state(cdev, cdev_state);
 }
 
diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
index eb1797cd859b..e6793d6d1288 100644
--- a/drivers/thermal/power_allocator.c
+++ b/drivers/thermal/power_allocator.c
@@ -20,6 +20,9 @@
 #include <linux/slab.h>
 #include <linux/thermal.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/thermal_power_allocator.h>
+
 #include "thermal_core.h"
 
 #define FRAC_BITS 8
@@ -133,7 +136,14 @@ static u32 pid_controller(struct thermal_zone_device *tz,
 	/* feed-forward the known sustainable dissipatable power */
 	power_range = tz->tzp->sustainable_power + frac_to_int(power_range);
 
-	return clamp(power_range, (s64)0, (s64)max_allocatable_power);
+	power_range = clamp(power_range, (s64)0, (s64)max_allocatable_power);
+
+	trace_thermal_power_allocator_pid(frac_to_int(err),
+					frac_to_int(params->err_integral),
+					frac_to_int(p), frac_to_int(i),
+					frac_to_int(d), power_range);
+
+	return power_range;
 }
 
 /**
@@ -214,7 +224,7 @@ static int allocate_power(struct thermal_zone_device *tz,
 	struct power_actor *actor;
 	u32 *req_power, *max_power, *granted_power;
 	u32 total_req_power, max_allocatable_power;
-	u32 power_range;
+	u32 total_granted_power, power_range;
 	int i, num_actors, ret = 0;
 
 	mutex_lock(&tz->lock);
@@ -265,12 +275,20 @@ static int allocate_power(struct thermal_zone_device *tz,
 	divvy_up_power(req_power, max_power, num_actors, total_req_power,
 		power_range, granted_power);
 
+	total_granted_power = 0;
 	i = 0;
 	list_for_each_entry_rcu(actor, &actor_list, actor_node) {
 		actor->ops->set_power(actor, tz, granted_power[i]);
+		total_granted_power += granted_power[i];
+
 		i++;
 	}
 
+	trace_thermal_power_allocator(req_power, total_req_power, granted_power,
+				total_granted_power, num_actors, power_range,
+				max_allocatable_power, current_temp,
+				(s32)control_temp - (s32)current_temp);
+
 	devm_kfree(&tz->device, granted_power);
 free_max_power:
 	devm_kfree(&tz->device, max_power);
diff --git a/include/trace/events/thermal_power_allocator.h b/include/trace/events/thermal_power_allocator.h
new file mode 100644
index 000000000000..9140cec55c63
--- /dev/null
+++ b/include/trace/events/thermal_power_allocator.h
@@ -0,0 +1,138 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM thermal_power_allocator
+
+#if !defined(_TRACE_THERMAL_POWER_ALLOCATOR_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_THERMAL_POWER_ALLOCATOR_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(thermal_power_allocator,
+	TP_PROTO(u32 *req_power, u32 total_req_power, u32 *granted_power,
+		u32 total_granted_power, size_t num_actors, u32 power_range,
+		u32 max_allocatable_power, unsigned long current_temp,
+		s32 delta_temp),
+	TP_ARGS(req_power, total_req_power, granted_power, total_granted_power,
+		num_actors, power_range, max_allocatable_power, current_temp,
+		delta_temp),
+	TP_STRUCT__entry(
+		__dynamic_array(u32,   req_power, num_actors    )
+		__field(u32,           total_req_power          )
+		__dynamic_array(u32,   granted_power, num_actors)
+		__field(u32,           total_granted_power      )
+		__field(size_t,        num_actors               )
+		__field(u32,           power_range              )
+		__field(u32,           max_allocatable_power    )
+		__field(unsigned long, current_temp             )
+		__field(s32,           delta_temp               )
+	),
+	TP_fast_assign(
+		memcpy(__get_dynamic_array(req_power), req_power,
+			num_actors * sizeof(*req_power));
+		__entry->total_req_power = total_req_power;
+		memcpy(__get_dynamic_array(granted_power), granted_power,
+			num_actors * sizeof(*granted_power));
+		__entry->total_granted_power = total_granted_power;
+		__entry->num_actors = num_actors;
+		__entry->power_range = power_range;
+		__entry->max_allocatable_power = max_allocatable_power;
+		__entry->current_temp = current_temp;
+		__entry->delta_temp = delta_temp;
+	),
+
+	TP_printk("req_power={%s} total_req_power=%u granted_power={%s} total_granted_power=%u power_range=%u max_allocatable_power=%u current_temperature=%lu delta_temperature=%d",
+		__print_u32_array(__get_dynamic_array(req_power),
+				__entry->num_actors),
+		__entry->total_req_power,
+		__print_u32_array(__get_dynamic_array(granted_power),
+				__entry->num_actors),
+		__entry->total_granted_power, __entry->power_range,
+		__entry->max_allocatable_power, __entry->current_temp,
+		__entry->delta_temp)
+);
+
+TRACE_EVENT(thermal_power_allocator_pid,
+	TP_PROTO(s32 err, s32 err_integral, s64 p, s64 i, s64 d, s32 output),
+	TP_ARGS(err, err_integral, p, i, d, output),
+	TP_STRUCT__entry(
+		__field(s32, err         )
+		__field(s32, err_integral)
+		__field(s64, p           )
+		__field(s64, i           )
+		__field(s64, d           )
+		__field(s32, output      )
+	),
+	TP_fast_assign(
+		__entry->err = err;
+		__entry->err_integral = err_integral;
+		__entry->p = p;
+		__entry->i = i;
+		__entry->d = d;
+		__entry->output = output;
+	),
+
+	TP_printk("err=%d err_integral=%d p=%lld i=%lld d=%lld output=%d",
+		__entry->err, __entry->err_integral,
+		__entry->p, __entry->i, __entry->d, __entry->output)
+);
+
+TRACE_EVENT(thermal_power_actor_cpu_get_dyn_power,
+	TP_PROTO(const struct cpumask *cpus, unsigned long freq,
+		u32 raw_cpu_power, u32 *load, size_t load_len, u32 power),
+
+	TP_ARGS(cpus, freq, raw_cpu_power, load, load_len, power),
+
+	TP_STRUCT__entry(
+		__bitmask(cpumask, num_possible_cpus())
+		__field(unsigned long, freq          )
+		__field(u32,           raw_cpu_power )
+		__dynamic_array(u32,   load, load_len)
+		__field(size_t,        load_len      )
+		__field(u32,           power         )
+	),
+
+	TP_fast_assign(
+		__assign_bitmask(cpumask, cpumask_bits(cpus),
+				num_possible_cpus());
+		__entry->freq = freq;
+		__entry->raw_cpu_power = raw_cpu_power;
+		memcpy(__get_dynamic_array(load), load,
+			load_len * sizeof(*load));
+		__entry->load_len = load_len;
+		__entry->power = power;
+	),
+
+	TP_printk("cpus=%s freq=%lu raw_cpu_power=%d load={%s} power=%d",
+		__get_bitmask(cpumask), __entry->freq, __entry->raw_cpu_power,
+		__print_u32_array(__get_dynamic_array(load), __entry->load_len),
+		__entry->power)
+);
+
+TRACE_EVENT(thermal_power_actor_cpu_limit,
+	TP_PROTO(const struct cpumask *cpus, unsigned int freq,
+		unsigned long cdev_state, u32 power),
+
+	TP_ARGS(cpus, freq, cdev_state, power),
+
+	TP_STRUCT__entry(
+		__bitmask(cpumask, num_possible_cpus())
+		__field(unsigned int,  freq      )
+		__field(unsigned long, cdev_state)
+		__field(u32,           power     )
+	),
+
+	TP_fast_assign(
+		__assign_bitmask(cpumask, cpumask_bits(cpus),
+				num_possible_cpus());
+		__entry->freq = freq;
+		__entry->cdev_state = cdev_state;
+		__entry->power = power;
+	),
+
+	TP_printk("cpus=%s freq=%u cdev_state=%lu power=%u",
+		__get_bitmask(cpumask), __entry->freq, __entry->cdev_state,
+		__entry->power)
+);
+#endif /* _TRACE_THERMAL_POWER_ALLOCATOR_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
1.9.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v5 10/10] of: thermal: Introduce sustainable power for a thermal zone
  2014-07-10 14:18 [RFC PATCH v5 00/10] The power allocator thermal governor Javi Merino
                   ` (8 preceding siblings ...)
  2014-07-10 14:18 ` [RFC PATCH v5 09/10] thermal: add trace events to the power allocator governor Javi Merino
@ 2014-07-10 14:18 ` Javi Merino
  9 siblings, 0 replies; 22+ messages in thread
From: Javi Merino @ 2014-07-10 14:18 UTC (permalink / raw)
  To: linux-pm, linux-kernel
  Cc: punit.agrawal, broonie, Zhang Rui, Eduardo Valentin

From: Punit Agrawal <punit.agrawal@arm.com>

Introduce an optional property called, sustainable-power, which
represents the power (in mW) which the thermal zone can safely
dissipate.

If provided the property is parsed and associated with the thermal
zone via the thermal zone parameters.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
---
 Documentation/devicetree/bindings/thermal/thermal.txt | 4 ++++
 drivers/thermal/of-thermal.c                          | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/Documentation/devicetree/bindings/thermal/thermal.txt b/Documentation/devicetree/bindings/thermal/thermal.txt
index f5db6b72a36f..c6eb9a8d2aed 100644
--- a/Documentation/devicetree/bindings/thermal/thermal.txt
+++ b/Documentation/devicetree/bindings/thermal/thermal.txt
@@ -167,6 +167,10 @@ Optional property:
 			by means of sensor ID. Additional coefficients are
 			interpreted as constant offset.
 
+- sustainable-power:	An estimate of the sustainable power (in mW) that the
+  Type: unsigned	thermal zone can dissipate.
+  Size: one cell
+
 Note: The delay properties are bound to the maximum dT/dt (temperature
 derivative over time) in two situations for a thermal zone:
 (i)  - when passive cooling is activated (polling-delay-passive); and
diff --git a/drivers/thermal/of-thermal.c b/drivers/thermal/of-thermal.c
index 04b1be7fa018..eaf81ea654b9 100644
--- a/drivers/thermal/of-thermal.c
+++ b/drivers/thermal/of-thermal.c
@@ -769,6 +769,7 @@ int __init of_parse_thermal_zones(void)
 	for_each_child_of_node(np, child) {
 		struct thermal_zone_device *zone;
 		struct thermal_zone_params *tzp;
+		u32 prop;
 
 		tz = thermal_of_build_thermal_zone(child);
 		if (IS_ERR(tz)) {
@@ -791,6 +792,9 @@ int __init of_parse_thermal_zones(void)
 		/* No hwmon because there might be hwmon drivers registering */
 		tzp->no_hwmon = true;
 
+		if (!of_property_read_u32(child, "sustainable-power", &prop))
+			tzp->sustainable_power = prop;
+
 		zone = thermal_zone_device_register(child->name, tz->ntrips,
 						    0, tz,
 						    ops, tzp,
-- 
1.9.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v5 01/10] tracing: Add array printing helpers
  2014-07-10 14:18 ` [RFC PATCH v5 01/10] tracing: Add array printing helpers Javi Merino
@ 2014-07-10 15:40   ` Steven Rostedt
  0 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2014-07-10 15:40 UTC (permalink / raw)
  To: Javi Merino
  Cc: linux-pm, linux-kernel, punit.agrawal, broonie, Dave Martin, Ingo Molnar

On Thu, 10 Jul 2014 15:18:39 +0100
"Javi Merino" <javi.merino@arm.com> wrote:

> From: Dave Martin <Dave.Martin@arm.com>
> 
> If a trace event contains an array, there is currently no standard
> way to format this for text output.  Drivers are currently hacking
> around this by a) local hacks that use the trace_seq functionailty
> directly, or b) just not printing that information.  For fixed size
> arrays, formatting of the elements can be open-coded, but this gets
> cumbersome for arrays of non-trivial size.
> 
> These approaches result in non-standard content of the event format
> description delivered to userspace, so userland tools needs to be
> taught to understand and parse each array printing method
> individually.
> 
> This patch implements common __print_<type>_array() helpers that
> tracepoint implementations can use instead of reinventing them.  A
> simple C-style syntax is used to delimit the array and its elements
> {like,this}.
> 
> So that the helpers can be used with large static arrays as well as
> dynamic arrays, they take a pointer and element count: they can be
> used with __get_dynamic_array() for use with dynamic arrays.

Note, please base any patches like this on my for-next branch, because
I have moved the trace_seq code out of trace_print.c.  Although, I need
to analyze this a little bit more to see if these still belong in
trace_print.c or if they should go into trace_seq.c.

My repo is at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git



> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Signed-off-by: Dave Martin <Dave.Martin@arm.com>
> ---
>  include/linux/ftrace_event.h |  9 ++++++++
>  include/trace/ftrace.h       | 17 ++++++++++++++
>  kernel/trace/trace_output.c  | 55 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 81 insertions(+)
> 
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index cff3106ffe2c..919f21a3420b 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -44,6 +44,15 @@ const char *ftrace_print_bitmask_seq(struct trace_seq *p, void *bitmask_ptr,
>  const char *ftrace_print_hex_seq(struct trace_seq *p,
>  				 const unsigned char *buf, int len);
>  
> +const char *ftrace_print_u8_array_seq(struct trace_seq *p,
> +				      const u8 *buf, int count);
> +const char *ftrace_print_u16_array_seq(struct trace_seq *p,
> +				       const u16 *buf, int count);
> +const char *ftrace_print_u32_array_seq(struct trace_seq *p,
> +				       const u32 *buf, int count);
> +const char *ftrace_print_u64_array_seq(struct trace_seq *p,
> +				       const u64 *buf, int count);
> +
>  struct trace_iterator;
>  struct trace_event;
>  
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 26b4f2e13275..15bc5d417aea 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -263,6 +263,19 @@
>  #undef __print_hex
>  #define __print_hex(buf, buf_len) ftrace_print_hex_seq(p, buf, buf_len)
>  
> +#undef __print_u8_array
> +#define __print_u8_array(array, count)			\
> +	ftrace_print_u8_array_seq(p, array, count)
> +#undef __print_u16_array
> +#define __print_u16_array(array, count)			\
> +	ftrace_print_u16_array_seq(p, array, count)
> +#undef __print_u32_array
> +#define __print_u32_array(array, count)			\
> +	ftrace_print_u32_array_seq(p, array, count)
> +#undef __print_u64_array
> +#define __print_u64_array(array, count)			\
> +	ftrace_print_u64_array_seq(p, array, count)

print_hex_seq() is still in trace_output.c, so perhaps it is fine to
keep these there too.

> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  static notrace enum print_line_t					\
> @@ -676,6 +689,10 @@ static inline void ftrace_test_probe_##call(void)			\
>  #undef __get_dynamic_array_len
>  #undef __get_str
>  #undef __get_bitmask
> +#undef __print_u8_array
> +#undef __print_u16_array
> +#undef __print_u32_array
> +#undef __print_u64_array
>  
>  #undef TP_printk
>  #define TP_printk(fmt, args...) "\"" fmt "\", "  __stringify(args)
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index f3dad80c20b2..b46238e75523 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -454,6 +454,61 @@ ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int buf_len)
>  }
>  EXPORT_SYMBOL(ftrace_print_hex_seq);
>  
> +static const char *
> +ftrace_print_array_seq(struct trace_seq *p, const void *buf, int buf_len,
> +		       bool (*iterator)(struct trace_seq *p, const char *prefix,
> +					const void **buf, int *buf_len))
> +{
> +	const char *ret = p->buffer + p->len;

There is now a helper in my for-next branch that removes the above open
coded code:

	const char *ret = trace_seq_buffer_ptr(p);

> +	const char *prefix = "";
> +
> +	trace_seq_putc(p, '{');
> +
> +	if (iterator(p, prefix, &buf, &buf_len)) {
> +		prefix = ",";
> +
> +		while (iterator(p, prefix, &buf, &buf_len))

The above doesn't even need the prefix variable. But I would keep it
and remove the if and have this:

	while (iterator(p, prefix, &buf, &buf_len))
		prefix = ",";

and hope gcc just optimizes it. This is not a critical path, don't
micro optimize it and make the code more complex than need be,
especially when gcc should be smart enough to optimize it for you.

> +			;
> +	}
> +
> +	trace_seq_putc(p, '}');
> +	trace_seq_putc(p, 0);

trace_seq is only one page in size, which should be enough. I'm
wondering if we should check the return status of all the trace_seq()
calls and return an error.

> +
> +	return ret;
> +}
> +
> +#define DEFINE_PRINT_ARRAY(type, printk_type, format)			\
> +static bool								\
> +ftrace_print_array_iterator_##type(struct trace_seq *p, const char *prefix, \
> +				   const void **buf, int *buf_len)	\
> +{									\
> +	const type *__src = *buf;					\
> +									\
> +	if (*buf_len < sizeof(*__src))					\
> +		return false;						\
> +									\
> +	trace_seq_printf(p, "%s" format, prefix, (printk_type)*__src++); \
> +									\
> +	*buf = __src;							\
> +	*buf_len -= sizeof(*__src);					\
> +									\
> +	return true;							\
> +}									\
> +									\
> +const char *ftrace_print_##type##_array_seq(				\
> +	struct trace_seq *p, const type *buf, int count)		\
> +{									\
> +	return ftrace_print_array_seq(p, buf, (count) * sizeof(type),	\
> +				      ftrace_print_array_iterator_##type); \
> +}									\
> +									\
> +EXPORT_SYMBOL(ftrace_print_##type##_array_seq);
> +
> +DEFINE_PRINT_ARRAY(u8, unsigned int, "0x%x")

Why not "unsigned char"?

> +DEFINE_PRINT_ARRAY(u16, unsigned int, "0x%x")

Why not "unsigned short"?

> +DEFINE_PRINT_ARRAY(u32, unsigned int, "0x%x")
> +DEFINE_PRINT_ARRAY(u64, unsigned long long, "0x%llx")
> +
>  int ftrace_raw_output_prep(struct trace_iterator *iter,
>  			   struct trace_event *trace_event)
>  {

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v5 09/10] thermal: add trace events to the power allocator governor
  2014-07-10 14:18 ` [RFC PATCH v5 09/10] thermal: add trace events to the power allocator governor Javi Merino
@ 2014-07-10 15:44   ` Steven Rostedt
  2014-07-10 16:20     ` Javi Merino
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2014-07-10 15:44 UTC (permalink / raw)
  To: Javi Merino
  Cc: linux-pm, linux-kernel, punit.agrawal, broonie, Zhang Rui,
	Eduardo Valentin, Frederic Weisbecker, Ingo Molnar

On Thu, 10 Jul 2014 15:18:47 +0100
"Javi Merino" <javi.merino@arm.com> wrote:

> Add trace events for the power allocator governor and the power actor
> interface of the cpu cooling device.
> 
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
>  drivers/thermal/cpu_actor.c                    |  17 ++-
>  drivers/thermal/power_allocator.c              |  22 +++-
>  include/trace/events/thermal_power_allocator.h | 138 +++++++++++++++++++++++++
>  3 files changed, 173 insertions(+), 4 deletions(-)
>  create mode 100644 include/trace/events/thermal_power_allocator.h
> 
> diff --git a/drivers/thermal/cpu_actor.c b/drivers/thermal/cpu_actor.c
> index 45ea4fa92ea0..b5ed2e80e288 100644
> --- a/drivers/thermal/cpu_actor.c
> +++ b/drivers/thermal/cpu_actor.c
> @@ -28,6 +28,8 @@
>  #include <linux/printk.h>
>  #include <linux/slab.h>
>  
> +#include <trace/events/thermal_power_allocator.h>
> +
>  /**
>   * struct power_table - frequency to power conversion
>   * @frequency:	frequency in KHz
> @@ -184,11 +186,12 @@ static u32 get_static_power(struct cpu_actor *cpu_actor,
>   */
>  static u32 get_dynamic_power(struct cpu_actor *cpu_actor, unsigned long freq)
>  {
> -	int cpu;
> -	u32 power = 0, raw_cpu_power, total_load = 0;
> +	int i, cpu;
> +	u32 power = 0, raw_cpu_power, total_load = 0, load_cpu[NR_CPUS];

When NR_CPUS == 1024, you just killed the stack, as you added 4K to it.
We upped the stack recently to 16k, but still.

>  
>  	raw_cpu_power = cpu_freq_to_power(cpu_actor, freq);
>  
> +	i = 0;
>  	for_each_cpu(cpu, &cpu_actor->cpumask) {
>  		u32 load;
>  
> @@ -198,8 +201,15 @@ static u32 get_dynamic_power(struct cpu_actor *cpu_actor, unsigned long freq)
>  		load = get_load(cpu_actor, cpu);
>  		power += (raw_cpu_power * load) / 100;
>  		total_load += load;
> +		load_cpu[i] = load;
> +
> +		i++;
>  	}
>  
> +	trace_thermal_power_actor_cpu_get_dyn_power(&cpu_actor->cpumask, freq,
> +						raw_cpu_power, load_cpu, i,
> +						power);

How many CPUs are you saving load_cpu on? A trace event can't be bigger
than a page. And the data is actually a little less than that with the
required headers.

-- Steve

> +
>  	cpu_actor->last_load = total_load;
>  
>  	return power;
> @@ -296,6 +306,9 @@ static int cpu_set_power(struct power_actor *actor,
>  		return -EINVAL;
>  	}
>  
> +	trace_thermal_power_actor_cpu_limit(&cpu_actor->cpumask, target_freq,
> +					cdev_state, power);
> +
>  	return cdev->ops->set_cur_state(cdev, cdev_state);
>  }
>  
> diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
> index eb1797cd859b..e6793d6d1288 100644
> --- a/drivers/thermal/power_allocator.c
> +++ b/drivers/thermal/power_allocator.c
> @@ -20,6 +20,9 @@
>  #include <linux/slab.h>
>  #include <linux/thermal.h>
>  
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/thermal_power_allocator.h>
> +
>  #include "thermal_core.h"
>  
>  #define FRAC_BITS 8
> @@ -133,7 +136,14 @@ static u32 pid_controller(struct thermal_zone_device *tz,
>  	/* feed-forward the known sustainable dissipatable power */
>  	power_range = tz->tzp->sustainable_power + frac_to_int(power_range);
>  
> -	return clamp(power_range, (s64)0, (s64)max_allocatable_power);
> +	power_range = clamp(power_range, (s64)0, (s64)max_allocatable_power);
> +
> +	trace_thermal_power_allocator_pid(frac_to_int(err),
> +					frac_to_int(params->err_integral),
> +					frac_to_int(p), frac_to_int(i),
> +					frac_to_int(d), power_range);
> +
> +	return power_range;
>  }
>  
>  /**
> @@ -214,7 +224,7 @@ static int allocate_power(struct thermal_zone_device *tz,
>  	struct power_actor *actor;
>  	u32 *req_power, *max_power, *granted_power;
>  	u32 total_req_power, max_allocatable_power;
> -	u32 power_range;
> +	u32 total_granted_power, power_range;
>  	int i, num_actors, ret = 0;
>  
>  	mutex_lock(&tz->lock);
> @@ -265,12 +275,20 @@ static int allocate_power(struct thermal_zone_device *tz,
>  	divvy_up_power(req_power, max_power, num_actors, total_req_power,
>  		power_range, granted_power);
>  
> +	total_granted_power = 0;
>  	i = 0;
>  	list_for_each_entry_rcu(actor, &actor_list, actor_node) {
>  		actor->ops->set_power(actor, tz, granted_power[i]);
> +		total_granted_power += granted_power[i];
> +
>  		i++;
>  	}
>  
> +	trace_thermal_power_allocator(req_power, total_req_power, granted_power,
> +				total_granted_power, num_actors, power_range,
> +				max_allocatable_power, current_temp,
> +				(s32)control_temp - (s32)current_temp);
> +
>  	devm_kfree(&tz->device, granted_power);
>  free_max_power:
>  	devm_kfree(&tz->device, max_power);
> diff --git a/include/trace/events/thermal_power_allocator.h b/include/trace/events/thermal_power_allocator.h
> new file mode 100644
> index 000000000000..9140cec55c63
> --- /dev/null
> +++ b/include/trace/events/thermal_power_allocator.h
> @@ -0,0 +1,138 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM thermal_power_allocator
> +
> +#if !defined(_TRACE_THERMAL_POWER_ALLOCATOR_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_THERMAL_POWER_ALLOCATOR_H
> +
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(thermal_power_allocator,
> +	TP_PROTO(u32 *req_power, u32 total_req_power, u32 *granted_power,
> +		u32 total_granted_power, size_t num_actors, u32 power_range,
> +		u32 max_allocatable_power, unsigned long current_temp,
> +		s32 delta_temp),
> +	TP_ARGS(req_power, total_req_power, granted_power, total_granted_power,
> +		num_actors, power_range, max_allocatable_power, current_temp,
> +		delta_temp),
> +	TP_STRUCT__entry(
> +		__dynamic_array(u32,   req_power, num_actors    )
> +		__field(u32,           total_req_power          )
> +		__dynamic_array(u32,   granted_power, num_actors)
> +		__field(u32,           total_granted_power      )
> +		__field(size_t,        num_actors               )
> +		__field(u32,           power_range              )
> +		__field(u32,           max_allocatable_power    )
> +		__field(unsigned long, current_temp             )
> +		__field(s32,           delta_temp               )
> +	),
> +	TP_fast_assign(
> +		memcpy(__get_dynamic_array(req_power), req_power,
> +			num_actors * sizeof(*req_power));
> +		__entry->total_req_power = total_req_power;
> +		memcpy(__get_dynamic_array(granted_power), granted_power,
> +			num_actors * sizeof(*granted_power));
> +		__entry->total_granted_power = total_granted_power;
> +		__entry->num_actors = num_actors;
> +		__entry->power_range = power_range;
> +		__entry->max_allocatable_power = max_allocatable_power;
> +		__entry->current_temp = current_temp;
> +		__entry->delta_temp = delta_temp;
> +	),
> +
> +	TP_printk("req_power={%s} total_req_power=%u granted_power={%s} total_granted_power=%u power_range=%u max_allocatable_power=%u current_temperature=%lu delta_temperature=%d",
> +		__print_u32_array(__get_dynamic_array(req_power),
> +				__entry->num_actors),
> +		__entry->total_req_power,
> +		__print_u32_array(__get_dynamic_array(granted_power),
> +				__entry->num_actors),
> +		__entry->total_granted_power, __entry->power_range,
> +		__entry->max_allocatable_power, __entry->current_temp,
> +		__entry->delta_temp)
> +);
> +
> +TRACE_EVENT(thermal_power_allocator_pid,
> +	TP_PROTO(s32 err, s32 err_integral, s64 p, s64 i, s64 d, s32 output),
> +	TP_ARGS(err, err_integral, p, i, d, output),
> +	TP_STRUCT__entry(
> +		__field(s32, err         )
> +		__field(s32, err_integral)
> +		__field(s64, p           )
> +		__field(s64, i           )
> +		__field(s64, d           )
> +		__field(s32, output      )
> +	),
> +	TP_fast_assign(
> +		__entry->err = err;
> +		__entry->err_integral = err_integral;
> +		__entry->p = p;
> +		__entry->i = i;
> +		__entry->d = d;
> +		__entry->output = output;
> +	),
> +
> +	TP_printk("err=%d err_integral=%d p=%lld i=%lld d=%lld output=%d",
> +		__entry->err, __entry->err_integral,
> +		__entry->p, __entry->i, __entry->d, __entry->output)
> +);
> +
> +TRACE_EVENT(thermal_power_actor_cpu_get_dyn_power,
> +	TP_PROTO(const struct cpumask *cpus, unsigned long freq,
> +		u32 raw_cpu_power, u32 *load, size_t load_len, u32 power),
> +
> +	TP_ARGS(cpus, freq, raw_cpu_power, load, load_len, power),
> +
> +	TP_STRUCT__entry(
> +		__bitmask(cpumask, num_possible_cpus())
> +		__field(unsigned long, freq          )
> +		__field(u32,           raw_cpu_power )
> +		__dynamic_array(u32,   load, load_len)
> +		__field(size_t,        load_len      )
> +		__field(u32,           power         )
> +	),
> +
> +	TP_fast_assign(
> +		__assign_bitmask(cpumask, cpumask_bits(cpus),
> +				num_possible_cpus());
> +		__entry->freq = freq;
> +		__entry->raw_cpu_power = raw_cpu_power;
> +		memcpy(__get_dynamic_array(load), load,
> +			load_len * sizeof(*load));
> +		__entry->load_len = load_len;
> +		__entry->power = power;
> +	),
> +
> +	TP_printk("cpus=%s freq=%lu raw_cpu_power=%d load={%s} power=%d",
> +		__get_bitmask(cpumask), __entry->freq, __entry->raw_cpu_power,
> +		__print_u32_array(__get_dynamic_array(load), __entry->load_len),
> +		__entry->power)
> +);
> +
> +TRACE_EVENT(thermal_power_actor_cpu_limit,
> +	TP_PROTO(const struct cpumask *cpus, unsigned int freq,
> +		unsigned long cdev_state, u32 power),
> +
> +	TP_ARGS(cpus, freq, cdev_state, power),
> +
> +	TP_STRUCT__entry(
> +		__bitmask(cpumask, num_possible_cpus())
> +		__field(unsigned int,  freq      )
> +		__field(unsigned long, cdev_state)
> +		__field(u32,           power     )
> +	),
> +
> +	TP_fast_assign(
> +		__assign_bitmask(cpumask, cpumask_bits(cpus),
> +				num_possible_cpus());
> +		__entry->freq = freq;
> +		__entry->cdev_state = cdev_state;
> +		__entry->power = power;
> +	),
> +
> +	TP_printk("cpus=%s freq=%u cdev_state=%lu power=%u",
> +		__get_bitmask(cpumask), __entry->freq, __entry->cdev_state,
> +		__entry->power)
> +);
> +#endif /* _TRACE_THERMAL_POWER_ALLOCATOR_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v5 09/10] thermal: add trace events to the power allocator governor
  2014-07-10 15:44   ` Steven Rostedt
@ 2014-07-10 16:20     ` Javi Merino
  2014-07-10 18:03       ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Javi Merino @ 2014-07-10 16:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-pm, linux-kernel, Punit Agrawal, broonie, Zhang Rui,
	Eduardo Valentin, Frederic Weisbecker, Ingo Molnar

On Thu, Jul 10, 2014 at 04:44:51PM +0100, Steven Rostedt wrote:
> On Thu, 10 Jul 2014 15:18:47 +0100
> "Javi Merino" <javi.merino@arm.com> wrote:
> 
> > Add trace events for the power allocator governor and the power actor
> > interface of the cpu cooling device.
> > 
> > Cc: Zhang Rui <rui.zhang@intel.com>
> > Cc: Eduardo Valentin <edubezval@gmail.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Signed-off-by: Javi Merino <javi.merino@arm.com>
> > ---
> >  drivers/thermal/cpu_actor.c                    |  17 ++-
> >  drivers/thermal/power_allocator.c              |  22 +++-
> >  include/trace/events/thermal_power_allocator.h | 138 +++++++++++++++++++++++++
> >  3 files changed, 173 insertions(+), 4 deletions(-)
> >  create mode 100644 include/trace/events/thermal_power_allocator.h
> > 
> > diff --git a/drivers/thermal/cpu_actor.c b/drivers/thermal/cpu_actor.c
> > index 45ea4fa92ea0..b5ed2e80e288 100644
> > --- a/drivers/thermal/cpu_actor.c
> > +++ b/drivers/thermal/cpu_actor.c
> > @@ -28,6 +28,8 @@
> >  #include <linux/printk.h>
> >  #include <linux/slab.h>
> >  
> > +#include <trace/events/thermal_power_allocator.h>
> > +
> >  /**
> >   * struct power_table - frequency to power conversion
> >   * @frequency:	frequency in KHz
> > @@ -184,11 +186,12 @@ static u32 get_static_power(struct cpu_actor *cpu_actor,
> >   */
> >  static u32 get_dynamic_power(struct cpu_actor *cpu_actor, unsigned long freq)
> >  {
> > -	int cpu;
> > -	u32 power = 0, raw_cpu_power, total_load = 0;
> > +	int i, cpu;
> > +	u32 power = 0, raw_cpu_power, total_load = 0, load_cpu[NR_CPUS];
> 
> When NR_CPUS == 1024, you just killed the stack, as you added 4K to it.
> We upped the stack recently to 16k, but still.

True, this array should be static.

> >  
> >  	raw_cpu_power = cpu_freq_to_power(cpu_actor, freq);
> >  
> > +	i = 0;
> >  	for_each_cpu(cpu, &cpu_actor->cpumask) {
> >  		u32 load;
> >  
> > @@ -198,8 +201,15 @@ static u32 get_dynamic_power(struct cpu_actor *cpu_actor, unsigned long freq)
> >  		load = get_load(cpu_actor, cpu);
> >  		power += (raw_cpu_power * load) / 100;
> >  		total_load += load;
> > +		load_cpu[i] = load;
> > +
> > +		i++;
> >  	}
> >  
> > +	trace_thermal_power_actor_cpu_get_dyn_power(&cpu_actor->cpumask, freq,
> > +						raw_cpu_power, load_cpu, i,
> > +						power);
> 
> How many CPUs are you saving load_cpu on? A trace event can't be bigger
> than a page. And the data is actually a little less than that with the
> required headers.

The biggest system I've tested it on is an 8 cpu system (with
NR_CPUS==8).  So yes, small and we haven't seen any issues.

Are you saying that we are siphoning too much data through ftrace?  He
find it really valuable to collect information during run and process
it afterwards but I can see how this may not be feasible for systems
with thousands of cpus.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v5 09/10] thermal: add trace events to the power allocator governor
  2014-07-10 16:20     ` Javi Merino
@ 2014-07-10 18:03       ` Steven Rostedt
  2014-07-11  8:27         ` Javi Merino
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2014-07-10 18:03 UTC (permalink / raw)
  To: Javi Merino
  Cc: linux-pm, linux-kernel, Punit Agrawal, broonie, Zhang Rui,
	Eduardo Valentin, Frederic Weisbecker, Ingo Molnar

On Thu, 10 Jul 2014 17:20:14 +0100
"Javi Merino" <javi.merino@arm.com> wrote:


> > 
> > How many CPUs are you saving load_cpu on? A trace event can't be bigger
> > than a page. And the data is actually a little less than that with the
> > required headers.
> 
> The biggest system I've tested it on is an 8 cpu system (with
> NR_CPUS==8).  So yes, small and we haven't seen any issues.
> 
> Are you saying that we are siphoning too much data through ftrace?  He
> find it really valuable to collect information during run and process
> it afterwards but I can see how this may not be feasible for systems
> with thousands of cpus.

Only too much for a single event. Perhaps have the tracepoint post per
CPU? Then you wouldn't need that array.

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v5 09/10] thermal: add trace events to the power allocator governor
  2014-07-10 18:03       ` Steven Rostedt
@ 2014-07-11  8:27         ` Javi Merino
  0 siblings, 0 replies; 22+ messages in thread
From: Javi Merino @ 2014-07-11  8:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-pm, linux-kernel, Punit Agrawal, broonie, Zhang Rui,
	Eduardo Valentin, Frederic Weisbecker, Ingo Molnar

On Thu, Jul 10, 2014 at 07:03:50PM +0100, Steven Rostedt wrote:
> On Thu, 10 Jul 2014 17:20:14 +0100
> "Javi Merino" <javi.merino@arm.com> wrote:
> 
> 
> > > 
> > > How many CPUs are you saving load_cpu on? A trace event can't be bigger
> > > than a page. And the data is actually a little less than that with the
> > > required headers.
> > 
> > The biggest system I've tested it on is an 8 cpu system (with
> > NR_CPUS==8).  So yes, small and we haven't seen any issues.
> > 
> > Are you saying that we are siphoning too much data through ftrace?  He
> > find it really valuable to collect information during run and process
> > it afterwards but I can see how this may not be feasible for systems
> > with thousands of cpus.
> 
> Only too much for a single event. Perhaps have the tracepoint post per
> CPU? Then you wouldn't need that array.

Sounds good, I'll do that.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v5 05/10] thermal: let governors have private data for each thermal zone
  2014-07-10 14:18 ` [RFC PATCH v5 05/10] thermal: let governors have private data for each thermal zone Javi Merino
@ 2014-08-19 12:49   ` edubezval
  2014-08-19 15:40     ` Javi Merino
  0 siblings, 1 reply; 22+ messages in thread
From: edubezval @ 2014-08-19 12:49 UTC (permalink / raw)
  To: Javi Merino; +Cc: linux-pm, LKML, Punit Agrawal, broonie, Zhang Rui

Javi,

On Thu, Jul 10, 2014 at 10:18 AM, Javi Merino <javi.merino@arm.com> wrote:
> A governor may need to store its current state between calls to
> throttle().  That state depends on the thermal zone, so store it as
> private data in struct thermal_zone_device.
>
> The governors may have two new ops: bind_to_tz() and unbind_from_tz().
> When provided, these functions let governors do some initialization
> and teardown when they are bound/unbound to a tz and possibly store that
> information in the governor_data field of the struct
> thermal_zone_device.
>
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
>  drivers/thermal/thermal_core.c | 83 ++++++++++++++++++++++++++++++++++++++----
>  include/linux/thermal.h        |  9 +++++
>  2 files changed, 84 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 71b0ec0c370d..3da99dd80ad5 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -72,6 +72,58 @@ static struct thermal_governor *__find_governor(const char *name)
>         return NULL;
>  }
>
> +/**
> + * bind_previous_governor() - bind the previous governor of the thermal zone
> + * @tz:                a valid pointer to a struct thermal_zone_device
> + * @failed_gov_name:   the name of the governor that failed to register
> + *
> + * Register the previous governor of the thermal zone after a new
> + * governor has failed to be bound.
> + */
> +static void bind_previous_governor(struct thermal_zone_device *tz,
> +                               const char *failed_gov_name)
> +{
> +       if (tz->governor && tz->governor->bind_to_tz) {
> +               if (tz->governor->bind_to_tz(tz)) {
> +                       dev_warn(&tz->device,
> +                               "governor %s failed to bind and the previous one (%s) failed to register again, thermal zone %s has no governor\n",
> +                               failed_gov_name, tz->governor->name, tz->type);

The above must be a dev_err(), not warn. Besides, I would prefer if
you could improve the message. What is the difference between register
and bind?

> +                       tz->governor = NULL;
> +               }
> +       }
> +}
> +
> +/**
> + * thermal_set_governor() - Switch to another governor
> + * @tz:                a valid pointer to a struct thermal_zone_device
> + * @new_gov:   pointer to the new governor
> + *
> + * Change the governor of thermal zone @tz.
> + *
> + * Return: 0 on success, an error if the new governor's bind_to_tz() failed.
> + */
> +static int thermal_set_governor(struct thermal_zone_device *tz,
> +                               struct thermal_governor *new_gov)
> +{
> +       int ret = 0;
> +
> +       if (tz->governor && tz->governor->unbind_from_tz)
> +               tz->governor->unbind_from_tz(tz);
> +
> +       if (new_gov && new_gov->bind_to_tz) {
> +               ret = new_gov->bind_to_tz(tz);
> +               if (ret) {
> +                       bind_previous_governor(tz, new_gov->name);
> +
> +                       return ret;
> +               }
> +       }
> +
> +       tz->governor = new_gov;
> +
> +       return ret;
> +}
> +
>  int thermal_register_governor(struct thermal_governor *governor)
>  {
>         int err;
> @@ -104,8 +156,15 @@ int thermal_register_governor(struct thermal_governor *governor)
>
>                 name = pos->tzp->governor_name;
>
> -               if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
> -                       pos->governor = governor;
> +               if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
> +                       int ret;
> +
> +                       ret = thermal_set_governor(pos, governor);
> +                       if (ret)
> +                               dev_warn(&pos->device,
> +                                       "Failed to set governor %s for thermal zone %s: %d\n",
> +                                       governor->name, pos->type, ret);

dev_err here too.

> +               }
>         }
>
>         mutex_unlock(&thermal_list_lock);
> @@ -131,7 +190,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
>         list_for_each_entry(pos, &thermal_tz_list, node) {
>                 if (!strnicmp(pos->governor->name, governor->name,
>                                                 THERMAL_NAME_LENGTH))
> -                       pos->governor = NULL;
> +                       thermal_set_governor(pos, NULL);
>         }
>
>         mutex_unlock(&thermal_list_lock);
> @@ -756,8 +815,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
>         if (!gov)
>                 goto exit;
>
> -       tz->governor = gov;
> -       ret = count;
> +       ret = thermal_set_governor(tz, gov);
> +       if (!ret)
> +               ret = count;
>
>  exit:
>         mutex_unlock(&thermal_governor_lock);
> @@ -1452,6 +1512,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
>         int result;
>         int count;
>         int passive = 0;
> +       struct thermal_governor *governor;
>
>         if (type && strlen(type) >= THERMAL_NAME_LENGTH)
>                 return ERR_PTR(-EINVAL);
> @@ -1542,9 +1603,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
>         mutex_lock(&thermal_governor_lock);
>
>         if (tz->tzp)
> -               tz->governor = __find_governor(tz->tzp->governor_name);
> +               governor = __find_governor(tz->tzp->governor_name);
>         else
> -               tz->governor = def_governor;
> +               governor = def_governor;
> +
> +       result = thermal_set_governor(tz, governor);
> +       if (result) {
> +               mutex_unlock(&thermal_governor_lock);
> +               goto unregister;
> +       }
>
>         mutex_unlock(&thermal_governor_lock);
>
> @@ -1634,7 +1701,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
>                 device_remove_file(&tz->device, &dev_attr_mode);
>         device_remove_file(&tz->device, &dev_attr_policy);
>         remove_trip_attrs(tz);
> -       tz->governor = NULL;
> +       thermal_set_governor(tz, NULL);
>
>         thermal_remove_hwmon_sysfs(tz);
>         release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index 0305cde21a74..1124b7a9358a 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -187,6 +187,7 @@ struct thermal_attr {
>   * @ops:       operations this &thermal_zone_device supports
>   * @tzp:       thermal zone parameters
>   * @governor:  pointer to the governor for this thermal zone
> + * @governor_data:     private pointer for governor data
>   * @thermal_instances: list of &struct thermal_instance of this thermal zone
>   * @idr:       &struct idr to generate unique id for this zone's cooling
>   *             devices
> @@ -213,6 +214,7 @@ struct thermal_zone_device {
>         struct thermal_zone_device_ops *ops;
>         const struct thermal_zone_params *tzp;
>         struct thermal_governor *governor;
> +       void *governor_data;
>         struct list_head thermal_instances;
>         struct idr idr;
>         struct mutex lock;
> @@ -223,12 +225,19 @@ struct thermal_zone_device {
>  /**
>   * struct thermal_governor - structure that holds thermal governor information
>   * @name:      name of the governor
> + * @bind_to_tz: callback called when binding to a thermal zone.  If it
> + *             returns 0, the governor is bound to the thermal zone,
> + *             otherwise it fails.
> + * @unbind_from_tz:    callback called when a governor is unbound from a
> + *                     thermal zone.
>   * @throttle:  callback called for every trip point even if temperature is
>   *             below the trip point temperature
>   * @governor_list:     node in thermal_governor_list (in thermal_core.c)
>   */
>  struct thermal_governor {
>         char name[THERMAL_NAME_LENGTH];
> +       int (*bind_to_tz)(struct thermal_zone_device *tz);
> +       void (*unbind_from_tz)(struct thermal_zone_device *tz);
>         int (*throttle)(struct thermal_zone_device *tz, int trip);
>         struct list_head        governor_list;
>  };
> --
> 1.9.1
>
>



-- 
Eduardo Bezerra Valentin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v5 08/10] thermal: introduce the Power Allocator governor
  2014-07-10 14:18 ` [RFC PATCH v5 08/10] thermal: introduce the Power Allocator governor Javi Merino
@ 2014-08-19 12:56   ` Eduardo Valentin
  2014-08-19 13:45   ` Eduardo Valentin
  1 sibling, 0 replies; 22+ messages in thread
From: Eduardo Valentin @ 2014-08-19 12:56 UTC (permalink / raw)
  To: Javi Merino; +Cc: linux-pm, linux-kernel, punit.agrawal, broonie, Zhang Rui

On Thu, Jul 10, 2014 at 03:18:46PM +0100, Javi Merino wrote:
> The power allocator governor is a thermal governor that controls system
> and device power allocation to control temperature.  Conceptually, the
> implementation divides the sustainable power of a thermal zone among
> all the heat sources in that zone.
> 
> This governor relies on "power actors", entities that represent heat
> sources.  They can report current and maximum power consumption and
> can set a given maximum power consumption, usually via a cooling
> device.
> 
> The governor uses a Proportional Integral Derivative (PID) controller
> driven by the temperature of the thermal zone.  The output of the
> controller is a power budget that is then allocated to each power
> actor that can have bearing on the temperature we are trying to
> control.  It decides how much power to give each cooling device based
> on the performance they are requesting.  The PID controller ensures
> that the total power budget does not exceed the control temperature.
> 
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
>  Documentation/thermal/power_allocator.txt |  61 ++++
>  drivers/thermal/Kconfig                   |  15 +
>  drivers/thermal/Makefile                  |   1 +
>  drivers/thermal/power_allocator.c         | 467 ++++++++++++++++++++++++++++++
>  drivers/thermal/thermal_core.c            |   7 +-
>  drivers/thermal/thermal_core.h            |   8 +
>  include/linux/thermal.h                   |   8 +
>  7 files changed, 566 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/thermal/power_allocator.txt
>  create mode 100644 drivers/thermal/power_allocator.c
> 
> diff --git a/Documentation/thermal/power_allocator.txt b/Documentation/thermal/power_allocator.txt
> new file mode 100644
> index 000000000000..1859074dadcb
> --- /dev/null
> +++ b/Documentation/thermal/power_allocator.txt
> @@ -0,0 +1,61 @@
> +Integration of the power_allocator governor in a platform
> +=========================================================
> +
> +Registering thermal_zone_device
> +-------------------------------
> +
> +An estimate of the sustainable dissipatable power (in mW) should be
> +provided while registering the thermal zone.  This is the maximum
> +sustained power for allocation at the desired maximum temperature.
> +This number can vary for different conditions, but the closed-loop of
> +the controller should take care of those variations, the
> +`sustainable_power` should be an estimation of it.  Register your
> +thermal zone with `thermal_zone_params` that have a
> +`sustainable_power`.  If you weren't passing any
> +`thermal_zone_params`, then something like this will do:
> +
> +	static const struct thermal_zone_params tz_params = {
> +		.sustainable_power = 3500,
> +	};
> +
> +and then pass `tz_params` as the 5th parameter to
> +`thermal_zone_device_register()`
> +
> +Trip points
> +-----------
> +
> +The governor requires the following two trip points:
> +
> +1.  "switch on" trip point: temperature above which the governor
> +    control loop starts operating
> +2.  "desired temperature" trip point: it should be higher than the
> +    "switch on" trip point. It is the target temperature the governor
> +    is controlling for.
> +
> +The trip points can be either active or passive.
> +
> +Power actors
> +------------
> +
> +Devices controlled by this governor must be registered with the power
> +actor API.  Read `power_actor.txt` for more information about them.
> +
> +Limitations of the power allocator governor
> +===========================================
> +
> +The power allocator governor can't work with cooling devices directly.
> +A power actor can be created to interface between the governor and the
> +cooling device (see cpu_actor.c for an example).  Otherwise, if you
> +have power actors and cooling devices that are next to the same
> +thermal sensor create two thermal zones, one for each type.  Use the
> +power allocator governor for the power actor thermal zone with the
> +power actors and any other governor for the one with cooling devices.
> +
> +The power allocator governor's PID controller is highly dependent on a
> +periodic tick.  If you have a driver that calls
> +`thermal_zone_device_update()` (or anything that ends up calling the
> +governor's `throttle()` function) repetitively, the governor response
> +won't be very good.  Note that this is not particular to this
> +governor, step-wise will also misbehave if you call its throttle()
> +faster than the normal thermal framework tick (due to interrupts for
> +example) as it will overreact.

Javi,

Can you please describe better the above overreaction situation? I would
say that is a bug, not a feature to be documented.

As such, needs fixing.

> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 249b196deffd..0e76c0dab5f3 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
>  	  Select this if you want to let the user space manage the
>  	  platform thermals.
>  
> +config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
> +	bool "power_allocator"
> +	select THERMAL_GOV_POWER_ALLOCATOR
> +	help
> +	  Select this if you want to control temperature based on
> +	  system and device power allocation. This governor relies on
> +	  power actors to operate.
> +
>  endchoice
>  
>  config THERMAL_GOV_FAIR_SHARE
> @@ -89,6 +97,13 @@ config THERMAL_GOV_USER_SPACE
>  	help
>  	  Enable this to let the user space manage the platform thermals.
>  
> +config THERMAL_GOV_POWER_ALLOCATOR
> +	bool "Power allocator thermal governor"
> +	select THERMAL_POWER_ACTOR
> +	help
> +	  Enable this to manage platform thermals by dynamically
> +	  allocating and limiting power to devices.
> +
>  config THERMAL_POWER_ACTOR
>  	bool
>  
> diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
> index 74f97c90a46c..e74d57d0fe61 100644
> --- a/drivers/thermal/Makefile
> +++ b/drivers/thermal/Makefile
> @@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF)		+= of-thermal.o
>  thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE)	+= fair_share.o
>  thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE)	+= step_wise.o
>  thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE)	+= user_space.o
> +thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR)	+= power_allocator.o
>  
>  # power actors
>  obj-$(CONFIG_THERMAL_POWER_ACTOR) += power_actor.o
> diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
> new file mode 100644
> index 000000000000..eb1797cd859b
> --- /dev/null
> +++ b/drivers/thermal/power_allocator.c
> @@ -0,0 +1,467 @@
> +/*
> + * A power allocator to manage temperature
> + *
> + * Copyright (C) 2014 ARM Ltd.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> + * kind, whether express or implied; without even the implied warranty
> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +
> +#define pr_fmt(fmt) "Power allocator: " fmt
> +
> +#include <linux/power_actor.h>
> +#include <linux/rculist.h>
> +#include <linux/slab.h>
> +#include <linux/thermal.h>
> +
> +#include "thermal_core.h"
> +
> +#define FRAC_BITS 8
> +#define int_to_frac(x) ((x) << FRAC_BITS)
> +#define frac_to_int(x) ((x) >> FRAC_BITS)
> +
> +/**
> + * mul_frac() - multiply two fixed-point numbers
> + * @x:	first multiplicand
> + * @y:	second multiplicand
> + *
> + * Return: the result of multiplying two fixed-point numbers.  The
> + * result is also a fixed-point number.
> + */
> +static inline s64 mul_frac(s64 x, s64 y)
> +{
> +	return (x * y) >> FRAC_BITS;
> +}
> +
> +enum power_allocator_trip_levels {
> +	TRIP_SWITCH_ON = 0,	/* Switch on PID controller */
> +	TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
> +};
> +
> +/**
> + * struct power_allocator_params - parameters for the power allocator governor
> + * @k_po:	Proportional parameter of the PID controller when overshooting
> + *		(i.e., when temperature is below the target)
> + * @k_pi:	Proportional parameter of the PID controller when undershooting
> + * @k_i:	Integral parameter of the PID controller
> + * @k_d:	Derivative parameter of the PID controller
> + * @integral_cutoff:	threshold below which the error is no longer accumulated
> +			in the PID controller
> + * @err_integral:	accumulated error in the PID controller.
> + * @prev_err:	error in the previous iteration of the PID controller.
> + *		Used to calculate the derivative term.
> + */
> +struct power_allocator_params {
> +	s32 k_po;
> +	s32 k_pu;
> +	s32 k_i;
> +	s32 k_d;
> +	s32 integral_cutoff;
> +	s32 err_integral;
> +	s32 prev_err;
> +};
> +
> +/**
> + * pid_controller() - PID controller
> + * @tz:	thermal zone we are operating in
> + * @current_temp:	the current temperature
> + * @control_temp:	the target temperature
> + * @max_allocatable_power:	maximum allocatable power for this thermal zone
> + *
> + * This PID controller increases the available power budget so that the
> + * temperature of the thermal zone gets as close as possible to
> + * @control_temp and limits the power if it exceeds it.  k_po is the
> + * proportional term when we are overshooting, k_pu is the
> + * proportional term when we are undershooting.  integral_cutoff is a
> + * threshold below which we stop accumulating the error.  The
> + * accumulated error is only valid if the requested power will make
> + * the system warmer.  If the system is mostly idle, there's no point
> + * in accumulating positive error.
> + *
> + * Return: The power budget for the next period.
> + */
> +static u32 pid_controller(struct thermal_zone_device *tz,
> +			unsigned long current_temp, unsigned long control_temp,
> +			u32 max_allocatable_power)
> +{
> +	s64 p, i, d, power_range;
> +	s32 err;
> +	struct power_allocator_params *params = tz->governor_data;
> +
> +	err = ((s32)control_temp - (s32)current_temp) / 1000;
> +	err = int_to_frac(err);
> +
> +	/* Calculate the proportional term */
> +	p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
> +
> +	/*
> +	 * Calculate the integral term
> +	 *
> +	 * if the error s less than cut off allow integration (but
> +	 * the integral is limited to max power)
> +	 */
> +	i = mul_frac(params->k_i, params->err_integral);
> +
> +	if (err < int_to_frac(params->integral_cutoff)) {
> +		s64 tmpi = mul_frac(params->k_i, err);
> +
> +		tmpi += i;
> +		if (tmpi <= int_to_frac(max_allocatable_power)) {
> +			i = tmpi;
> +			params->err_integral += err;
> +		}
> +	}
> +
> +	/*
> +	 * Calculate the derivative term
> +	 *
> +	 * We do err - prev_err, so with a positive k_d, a decreasing
> +	 * error (i.e. driving closer to the line) results in less
> +	 * power being applied, slowing down the controller)
> +	 */
> +	d = mul_frac(params->k_d, err - params->prev_err);
> +	params->prev_err = err;
> +
> +	power_range = p + i + d;
> +
> +	/* feed-forward the known sustainable dissipatable power */
> +	power_range = tz->tzp->sustainable_power + frac_to_int(power_range);
> +
> +	return clamp(power_range, (s64)0, (s64)max_allocatable_power);
> +}
> +
> +/**
> + * divvy_up_power() - divvy the allocated power between the actors
> + * @req_power:	each actor's requested power
> + * @max_power:	each actor's maximum available power
> + * @num_actors:	size of the @req_power, @max_power and @granted_power's array
> + * @total_req_power: sum of @req_power
> + * @power_range:	total allocated power
> + * @granted_power:	ouput array: each actor's granted power
> + *
> + * This function divides the total allocated power (@power_range)
> + * fairly between the actors.  It first tries to give each actor a
> + * share of the @power_range according to how much power it requested
> + * compared to the rest of the actors.  For example, if only one actor
> + * requests power, then it receives all the @power_range.  If
> + * three actors each requests 1mW, each receives a third of the
> + * @power_range.
> + *
> + * If any actor received more than their maximum power, then that
> + * surplus is re-divvied among the actors based on how far they are
> + * from their respective maximums.
> + *
> + * Granted power for each actor is written to @granted_power, which
> + * should've been allocated by the calling function.
> + */
> +static void divvy_up_power(u32 *req_power, u32 *max_power, int num_actors,
> +			u32 total_req_power, u32 power_range,
> +			u32 *granted_power)
> +{
> +	u32 extra_power, capped_extra_power, extra_actor_power[num_actors];
> +	int i;
> +
> +	if (!total_req_power) {
> +		/*
> +		 * Nobody requested anything, so just give everybody
> +		 * the maximum power
> +		 */
> +		for (i = 0; i < num_actors; i++)
> +			granted_power[i] = max_power[i];
> +
> +		return;
> +	}
> +
> +	capped_extra_power = 0;
> +	extra_power = 0;
> +	for (i = 0; i < num_actors; i++) {
> +		u64 req_range = req_power[i] * power_range;
> +
> +		granted_power[i] = div_u64(req_range, total_req_power);
> +
> +		if (granted_power[i] > max_power[i]) {
> +			extra_power += granted_power[i] - max_power[i];
> +			granted_power[i] = max_power[i];
> +		}
> +
> +		extra_actor_power[i] = max_power[i] - granted_power[i];
> +		capped_extra_power += extra_actor_power[i];
> +	}
> +
> +	if (!extra_power)
> +		return;
> +
> +	/*
> +	 * Re-divvy the reclaimed extra among actors based on
> +	 * how far they are from the max
> +	 */
> +	extra_power = min(extra_power, capped_extra_power);
> +	if (capped_extra_power > 0)
> +		for (i = 0; i < num_actors; i++)
> +			granted_power[i] += (extra_actor_power[i] *
> +					extra_power) / capped_extra_power;
> +}
> +
> +static int allocate_power(struct thermal_zone_device *tz,
> +			unsigned long current_temp, unsigned long control_temp)
> +{
> +	struct power_actor *actor;
> +	u32 *req_power, *max_power, *granted_power;
> +	u32 total_req_power, max_allocatable_power;
> +	u32 power_range;
> +	int i, num_actors, ret = 0;
> +
> +	mutex_lock(&tz->lock);
> +	rcu_read_lock();
> +
> +	num_actors = 0;
> +	list_for_each_entry_rcu(actor, &actor_list, actor_node)
> +		num_actors++;
> +
> +	req_power = devm_kcalloc(&tz->device, num_actors, sizeof(*req_power),
> +				GFP_KERNEL);
> +	if (!req_power) {
> +		ret = -ENOMEM;
> +		goto unlock;
> +	}
> +
> +	max_power = devm_kcalloc(&tz->device, num_actors, sizeof(*max_power),
> +				GFP_KERNEL);
> +	if (!max_power) {
> +		ret = -ENOMEM;
> +		goto free_req_power;
> +	}
> +
> +	granted_power = devm_kcalloc(&tz->device, num_actors,
> +				sizeof(*granted_power), GFP_KERNEL);
> +	if (!granted_power) {
> +		ret = -ENOMEM;
> +		goto free_max_power;
> +	}
> +
> +	i = 0;
> +	total_req_power = 0;
> +	max_allocatable_power = 0;
> +
> +	list_for_each_entry_rcu(actor, &actor_list, actor_node) {
> +		req_power[i] = actor->ops->get_req_power(actor, tz);
> +		total_req_power += req_power[i];
> +
> +		max_power[i] = actor->ops->get_max_power(actor, tz);
> +		max_allocatable_power += max_power[i];
> +
> +		i++;
> +	}
> +
> +	power_range = pid_controller(tz, current_temp, control_temp,
> +				max_allocatable_power);
> +
> +	divvy_up_power(req_power, max_power, num_actors, total_req_power,
> +		power_range, granted_power);
> +
> +	i = 0;
> +	list_for_each_entry_rcu(actor, &actor_list, actor_node) {
> +		actor->ops->set_power(actor, tz, granted_power[i]);
> +		i++;
> +	}
> +
> +	devm_kfree(&tz->device, granted_power);
> +free_max_power:
> +	devm_kfree(&tz->device, max_power);
> +free_req_power:
> +	devm_kfree(&tz->device, req_power);
> +unlock:
> +	rcu_read_unlock();
> +	mutex_unlock(&tz->lock);
> +
> +	return ret;
> +}
> +
> +static int check_trips(struct thermal_zone_device *tz)
> +{
> +	int ret;
> +	enum thermal_trip_type type;
> +
> +	if (tz->trips < 2)
> +		return -EINVAL;
> +
> +	ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
> +	if (ret)
> +		return ret;
> +
> +	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> +		return -EINVAL;
> +
> +	ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
> +	if (ret)
> +		return ret;
> +
> +	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> +		return -EINVAL;
> +
> +	return ret;
> +}
> +
> +static void reset_pid_controller(struct power_allocator_params *params)
> +{
> +	params->err_integral = 0;
> +	params->prev_err = 0;
> +}
> +
> +static void allow_maximum_power(struct thermal_zone_device *tz)
> +{
> +	struct power_actor *actor;
> +
> +	rcu_read_lock();
> +
> +	list_for_each_entry_rcu(actor, &actor_list, actor_node) {
> +		u32 max_power = actor->ops->get_max_power(actor, tz);
> +
> +		actor->ops->set_power(actor, tz, max_power);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
> +/**
> + * power_allocator_bind() - bind the power_allocator governor to a thermal zone
> + * @tz:	thermal zone to bind it to
> + *
> + * Check that the thermal zone is valid for this governor, that is, it
> + * has two thermal trips.  If so, initialize the PID controller
> + * parameters and bind it to the thermal zone.
> + *
> + * Return: 0 on success, -EINVAL if the trips were invalid or -ENOMEM
> + * if we ran out of memory.
> + */
> +static int power_allocator_bind(struct thermal_zone_device *tz)
> +{
> +	int ret;
> +	struct power_allocator_params *params;
> +	unsigned long switch_on_temp, control_temp;
> +	u32 temperature_threshold;
> +
> +	ret = check_trips(tz);
> +	if (ret) {
> +		dev_err(&tz->device,
> +			"thermal zone %s has the wrong number of trips for this governor\n",
> +			tz->type);
> +		return ret;
> +	}
> +
> +	if (!tz->tzp || !tz->tzp->sustainable_power) {
> +		dev_err(&tz->device,
> +			"power_allocator: missing sustainable_power\n");
> +		return -EINVAL;
> +	}
> +
> +	params = devm_kzalloc(&tz->device, sizeof(*params), GFP_KERNEL);
> +	if (!params)
> +		return -ENOMEM;
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> +	if (ret)
> +		goto free;
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> +				&control_temp);
> +	if (ret)
> +		goto free;
> +
> +	temperature_threshold = (control_temp - switch_on_temp) / 1000;
> +
> +	params->k_po = int_to_frac(tz->tzp->sustainable_power) /
> +		temperature_threshold;
> +	params->k_pu = int_to_frac(2 * tz->tzp->sustainable_power) /
> +		temperature_threshold;
> +	params->k_i = int_to_frac(1);
> +	params->k_d = int_to_frac(0);
> +	params->integral_cutoff = 0;
> +
> +	reset_pid_controller(params);
> +
> +	tz->governor_data = params;
> +
> +	return 0;
> +
> +free:
> +	devm_kfree(&tz->device, params);
> +	return ret;
> +}
> +
> +static void power_allocator_unbind(struct thermal_zone_device *tz)
> +{
> +	dev_dbg(&tz->device, "Unbinding from thermal zone %d\n", tz->id);
> +	devm_kfree(&tz->device, tz->governor_data);
> +	tz->governor_data = NULL;
> +}
> +
> +static int power_allocator_throttle(struct thermal_zone_device *tz, int trip)
> +{
> +	int ret;
> +	unsigned long switch_on_temp, control_temp, current_temp;
> +	struct power_allocator_params *params = tz->governor_data;
> +
> +	/*
> +	 * We get called for every trip point but we only need to do
> +	 * our calculations once
> +	 */
> +	if (trip != TRIP_MAX_DESIRED_TEMPERATURE)
> +		return 0;
> +
> +	ret = thermal_zone_get_temp(tz, &current_temp);
> +	if (ret) {
> +		dev_warn(&tz->device, "Failed to get temperature: %d\n", ret);
> +		return ret;
> +	}
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> +	if (ret) {
> +		dev_warn(&tz->device,
> +			"Failed to get switch on temperature: %d\n", ret);
> +		return ret;
> +	}
> +
> +	if (current_temp < switch_on_temp) {
> +		tz->passive = 0;
> +		reset_pid_controller(params);
> +		allow_maximum_power(tz);
> +		return 0;
> +	}
> +
> +	tz->passive = 1;
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> +				&control_temp);
> +	if (ret) {
> +		dev_warn(&tz->device,
> +			"Failed to get the maximum desired temperature: %d\n",
> +			ret);
> +		return ret;
> +	}
> +
> +	return allocate_power(tz, current_temp, control_temp);
> +}
> +
> +static struct thermal_governor thermal_gov_power_allocator = {
> +	.name		= "power_allocator",
> +	.bind_to_tz	= power_allocator_bind,
> +	.unbind_from_tz	= power_allocator_unbind,
> +	.throttle	= power_allocator_throttle,
> +};
> +
> +int thermal_gov_power_allocator_register(void)
> +{
> +	return thermal_register_governor(&thermal_gov_power_allocator);
> +}
> +
> +void thermal_gov_power_allocator_unregister(void)
> +{
> +	thermal_unregister_governor(&thermal_gov_power_allocator);
> +}
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 3da99dd80ad5..1415d3d8a9eb 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -1857,7 +1857,11 @@ static int __init thermal_register_governors(void)
>  	if (result)
>  		return result;
>  
> -	return thermal_gov_user_space_register();
> +	result = thermal_gov_user_space_register();
> +	if (result)
> +		return result;
> +
> +	return thermal_gov_power_allocator_register();
>  }
>  
>  static void thermal_unregister_governors(void)
> @@ -1865,6 +1869,7 @@ static void thermal_unregister_governors(void)
>  	thermal_gov_step_wise_unregister();
>  	thermal_gov_fair_share_unregister();
>  	thermal_gov_user_space_unregister();
> +	thermal_gov_power_allocator_unregister();
>  }
>  
>  static int __init thermal_init(void)
> diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h
> index 3db339fb636f..b24cde2c71cc 100644
> --- a/drivers/thermal/thermal_core.h
> +++ b/drivers/thermal/thermal_core.h
> @@ -77,6 +77,14 @@ static inline int thermal_gov_user_space_register(void) { return 0; }
>  static inline void thermal_gov_user_space_unregister(void) {}
>  #endif /* CONFIG_THERMAL_GOV_USER_SPACE */
>  
> +#ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
> +int thermal_gov_power_allocator_register(void);
> +void thermal_gov_power_allocator_unregister(void);
> +#else
> +static inline int thermal_gov_power_allocator_register(void) { return 0; }
> +static inline void thermal_gov_power_allocator_unregister(void) {}
> +#endif /* CONFIG_THERMAL_GOV_POWER_ALLOCATOR */
> +
>  /* device tree support */
>  #ifdef CONFIG_THERMAL_OF
>  int of_parse_thermal_zones(void);
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index 1124b7a9358a..e01141261756 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -57,6 +57,8 @@
>  #define DEFAULT_THERMAL_GOVERNOR       "fair_share"
>  #elif defined(CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE)
>  #define DEFAULT_THERMAL_GOVERNOR       "user_space"
> +#elif defined(CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR)
> +#define DEFAULT_THERMAL_GOVERNOR       "power_allocator"
>  #endif
>  
>  struct thermal_zone_device;
> @@ -287,6 +289,12 @@ struct thermal_zone_params {
>  
>  	int num_tbps;	/* Number of tbp entries */
>  	struct thermal_bind_params *tbp;
> +
> +	/*
> +	 * Sustainable power (heat) that this thermal zone can dissipate in
> +	 * mW
> +	 */
> +	u32 sustainable_power;
>  };
>  
>  struct thermal_genl_event {
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v5 04/10] thermal: document struct thermal_zone_device and thermal_governor
  2014-07-10 14:18 ` [RFC PATCH v5 04/10] thermal: document struct thermal_zone_device and thermal_governor Javi Merino
@ 2014-08-19 13:03   ` Eduardo Valentin
  0 siblings, 0 replies; 22+ messages in thread
From: Eduardo Valentin @ 2014-08-19 13:03 UTC (permalink / raw)
  To: Javi Merino; +Cc: linux-pm, linux-kernel, punit.agrawal, broonie, Zhang Rui

On Thu, Jul 10, 2014 at 03:18:42PM +0100, Javi Merino wrote:
> Document struct thermal_zone_device and struct thermal_governor fields
> and their use by the thermal framework code.

I am getting mangled text in this email. For instance, checkpatch.pl
complains:
ERROR: patch seems to be corrupt (line wrapped?)
#61: FILE: include/linux/thermal.h:159:
=20

ERROR: spaces required around that '=' (ctx:ExV)
#110: FILE: include/linux/thermal.h:218:
+=09struct mutex lock;
 ^



> 
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
>  include/linux/thermal.h | 46 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index f7e11c7ea7d9..0305cde21a74 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -158,6 +158,42 @@ struct thermal_attr {
>  	char name[THERMAL_NAME_LENGTH];
>  };
>  
> +/**
> + * struct thermal_zone_device - structure for a thermal zone
> + * @id:		unique id number for each thermal zone
> + * @type:	the thermal zone device type
> + * @device:	&struct device for this thermal zone
> + * @trip_temp_attrs:	attributes for trip points for sysfs: trip temperature
> + * @trip_type_attrs:	attributes for trip points for sysfs: trip type
> + * @trip_hyst_attrs:	attributes for trip points for sysfs: trip hysteresis
> + * @devdata:	private pointer for device private data
> + * @trips:	number of trip points the thermal zone supports
> + * @passive_delay:	number of milliseconds to wait between polls when
> + *			performing passive cooling.  Currenty only used by the
> + *			step-wise governor
> + * @polling_delay:	number of milliseconds to wait between polls when
> + *			checking whether trip points have been crossed (0 for
> + *			interrupt driven systems)
> + * @temperature:	current temperature.  This is only for core code,
> + *			drivers should use thermal_zone_get_temp() to get the
> + *			current temperature
> + * @last_temperature:	previous temperature read
> + * @emul_temperature:	emulated temperature when using CONFIG_THERMAL_EMULATION
> + * @passive:		1 if you've crossed a passive trip point, 0 otherwise.
> + *			Currenty only used by the step-wise governor.
> + * @forced_passive:	If > 0, temperature at which to switch on all ACPI
> + *			processor cooling devices.  Currently only used by the
> + *			step-wise governor.
> + * @ops:	operations this &thermal_zone_device supports
> + * @tzp:	thermal zone parameters
> + * @governor:	pointer to the governor for this thermal zone
> + * @thermal_instances:	list of &struct thermal_instance of this thermal zone
> + * @idr:	&struct idr to generate unique id for this zone's cooling
> + *		devices
> + * @lock:	lock to protect thermal_instances list
> + * @node:	node in thermal_tz_list (in thermal_core.c)
> + * @poll_queue:	delayed work for polling
> + */
>  struct thermal_zone_device {
>  	int id;
>  	char type[THERMAL_NAME_LENGTH];
> @@ -179,12 +215,18 @@ struct thermal_zone_device {
>  	struct thermal_governor *governor;
>  	struct list_head thermal_instances;
>  	struct idr idr;
> -	struct mutex lock; /* protect thermal_instances list */
> +	struct mutex lock;
>  	struct list_head node;
>  	struct delayed_work poll_queue;
>  };
>  
> -/* Structure that holds thermal governor information */
> +/**
> + * struct thermal_governor - structure that holds thermal governor information
> + * @name:	name of the governor
> + * @throttle:	callback called for every trip point even if temperature is
> + *		below the trip point temperature
> + * @governor_list:	node in thermal_governor_list (in thermal_core.c)
> + */
>  struct thermal_governor {
>  	char name[THERMAL_NAME_LENGTH];
>  	int (*throttle)(struct thermal_zone_device *tz, int trip);
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v5 08/10] thermal: introduce the Power Allocator governor
  2014-07-10 14:18 ` [RFC PATCH v5 08/10] thermal: introduce the Power Allocator governor Javi Merino
  2014-08-19 12:56   ` Eduardo Valentin
@ 2014-08-19 13:45   ` Eduardo Valentin
  2014-08-19 16:02     ` Javi Merino
  1 sibling, 1 reply; 22+ messages in thread
From: Eduardo Valentin @ 2014-08-19 13:45 UTC (permalink / raw)
  To: Javi Merino; +Cc: linux-pm, linux-kernel, punit.agrawal, broonie, Zhang Rui

Javi,

On Thu, Jul 10, 2014 at 03:18:46PM +0100, Javi Merino wrote:
> The power allocator governor is a thermal governor that controls system
> and device power allocation to control temperature.  Conceptually, the
> implementation divides the sustainable power of a thermal zone among
> all the heat sources in that zone.
> 
> This governor relies on "power actors", entities that represent heat
> sources.  They can report current and maximum power consumption and
> can set a given maximum power consumption, usually via a cooling
> device.
> 
> The governor uses a Proportional Integral Derivative (PID) controller
> driven by the temperature of the thermal zone.  The output of the
> controller is a power budget that is then allocated to each power
> actor that can have bearing on the temperature we are trying to
> control.  It decides how much power to give each cooling device based
> on the performance they are requesting.  The PID controller ensures
> that the total power budget does not exceed the control temperature.
> 
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Signed-off-by: Javi Merino <javi.merino@arm.com>
> ---
>  Documentation/thermal/power_allocator.txt |  61 ++++
>  drivers/thermal/Kconfig                   |  15 +
>  drivers/thermal/Makefile                  |   1 +
>  drivers/thermal/power_allocator.c         | 467 ++++++++++++++++++++++++++++++
>  drivers/thermal/thermal_core.c            |   7 +-
>  drivers/thermal/thermal_core.h            |   8 +
>  include/linux/thermal.h                   |   8 +
>  7 files changed, 566 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/thermal/power_allocator.txt
>  create mode 100644 drivers/thermal/power_allocator.c
> 
> diff --git a/Documentation/thermal/power_allocator.txt b/Documentation/thermal/power_allocator.txt
> new file mode 100644
> index 000000000000..1859074dadcb
> --- /dev/null
> +++ b/Documentation/thermal/power_allocator.txt
> @@ -0,0 +1,61 @@
> +Integration of the power_allocator governor in a platform
> +=========================================================
> +
> +Registering thermal_zone_device
> +-------------------------------
> +
> +An estimate of the sustainable dissipatable power (in mW) should be
> +provided while registering the thermal zone.  This is the maximum
> +sustained power for allocation at the desired maximum temperature.
> +This number can vary for different conditions, but the closed-loop of
> +the controller should take care of those variations, the
> +`sustainable_power` should be an estimation of it.  Register your
> +thermal zone with `thermal_zone_params` that have a
> +`sustainable_power`.  If you weren't passing any
> +`thermal_zone_params`, then something like this will do:
> +
> +	static const struct thermal_zone_params tz_params = {
> +		.sustainable_power = 3500,
> +	};
> +
> +and then pass `tz_params` as the 5th parameter to
> +`thermal_zone_device_register()`
> +
> +Trip points
> +-----------
> +
> +The governor requires the following two trip points:
> +
> +1.  "switch on" trip point: temperature above which the governor
> +    control loop starts operating
> +2.  "desired temperature" trip point: it should be higher than the
> +    "switch on" trip point. It is the target temperature the governor
> +    is controlling for.
> +
> +The trip points can be either active or passive.
> +

I would prefer, for the sake of keeping the right concept in use, you
state here that they are passive trip points.

> +Power actors
> +------------
> +
> +Devices controlled by this governor must be registered with the power
> +actor API.  Read `power_actor.txt` for more information about them.
> +
> +Limitations of the power allocator governor
> +===========================================
> +
> +The power allocator governor can't work with cooling devices directly.
> +A power actor can be created to interface between the governor and the
> +cooling device (see cpu_actor.c for an example).  Otherwise, if you
> +have power actors and cooling devices that are next to the same
> +thermal sensor create two thermal zones, one for each type.  Use the
> +power allocator governor for the power actor thermal zone with the
> +power actors and any other governor for the one with cooling devices.
> +
> +The power allocator governor's PID controller is highly dependent on a
> +periodic tick.  If you have a driver that calls
> +`thermal_zone_device_update()` (or anything that ends up calling the
> +governor's `throttle()` function) repetitively, the governor response
> +won't be very good.  Note that this is not particular to this
> +governor, step-wise will also misbehave if you call its throttle()
> +faster than the normal thermal framework tick (due to interrupts for
> +example) as it will overreact.

I would recommend rephrasing the paragraph above, as I mentioned in
other email, looks like a documented bug. Mentioning that the governor
is still dependent on polling_delay active_delay is still worth it.
Also, the coexistance of interrupt driven thermal zones with polling
driven framework requires the interrupt firing to be configure to
something meaneanful, such as, an event outside the temperature range
the current trip is. 

> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 249b196deffd..0e76c0dab5f3 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
>  	  Select this if you want to let the user space manage the
>  	  platform thermals.
>  
> +config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
> +	bool "power_allocator"
> +	select THERMAL_GOV_POWER_ALLOCATOR
> +	help
> +	  Select this if you want to control temperature based on
> +	  system and device power allocation. This governor relies on
> +	  power actors to operate.
> +
>  endchoice
>  
>  config THERMAL_GOV_FAIR_SHARE
> @@ -89,6 +97,13 @@ config THERMAL_GOV_USER_SPACE
>  	help
>  	  Enable this to let the user space manage the platform thermals.
>  
> +config THERMAL_GOV_POWER_ALLOCATOR
> +	bool "Power allocator thermal governor"
> +	select THERMAL_POWER_ACTOR
> +	help
> +	  Enable this to manage platform thermals by dynamically
> +	  allocating and limiting power to devices.
> +
>  config THERMAL_POWER_ACTOR
>  	bool
>  
> diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
> index 74f97c90a46c..e74d57d0fe61 100644
> --- a/drivers/thermal/Makefile
> +++ b/drivers/thermal/Makefile
> @@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF)		+= of-thermal.o
>  thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE)	+= fair_share.o
>  thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE)	+= step_wise.o
>  thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE)	+= user_space.o
> +thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR)	+= power_allocator.o
>  
>  # power actors
>  obj-$(CONFIG_THERMAL_POWER_ACTOR) += power_actor.o
> diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
> new file mode 100644
> index 000000000000..eb1797cd859b
> --- /dev/null
> +++ b/drivers/thermal/power_allocator.c
> @@ -0,0 +1,467 @@
> +/*
> + * A power allocator to manage temperature
> + *
> + * Copyright (C) 2014 ARM Ltd.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> + * kind, whether express or implied; without even the implied warranty
> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +
> +#define pr_fmt(fmt) "Power allocator: " fmt
> +
> +#include <linux/power_actor.h>
> +#include <linux/rculist.h>
> +#include <linux/slab.h>
> +#include <linux/thermal.h>
> +
> +#include "thermal_core.h"
> +
> +#define FRAC_BITS 8
> +#define int_to_frac(x) ((x) << FRAC_BITS)
> +#define frac_to_int(x) ((x) >> FRAC_BITS)
> +
> +/**
> + * mul_frac() - multiply two fixed-point numbers
> + * @x:	first multiplicand
> + * @y:	second multiplicand
> + *
> + * Return: the result of multiplying two fixed-point numbers.  The
> + * result is also a fixed-point number.
> + */
> +static inline s64 mul_frac(s64 x, s64 y)
> +{
> +	return (x * y) >> FRAC_BITS;
> +}
> +
> +enum power_allocator_trip_levels {
> +	TRIP_SWITCH_ON = 0,	/* Switch on PID controller */
> +	TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
> +};
> +
> +/**
> + * struct power_allocator_params - parameters for the power allocator governor
> + * @k_po:	Proportional parameter of the PID controller when overshooting
> + *		(i.e., when temperature is below the target)
> + * @k_pi:	Proportional parameter of the PID controller when undershooting
> + * @k_i:	Integral parameter of the PID controller
> + * @k_d:	Derivative parameter of the PID controller

Po, Pi, i, and d were not documented in your document above. Did you
miss them?

Given that they are core part of the governor behavior, I'd suggest
having a section in its document about how to fine tune the constants.
Explain, for instance, if it is enough to fine tune per platform, and
if having constants defined at compilation time or at runtime are good
point to be documented.

> + * @integral_cutoff:	threshold below which the error is no longer accumulated
> +			in the PID controller
> + * @err_integral:	accumulated error in the PID controller.
> + * @prev_err:	error in the previous iteration of the PID controller.
> + *		Used to calculate the derivative term.
> + */
> +struct power_allocator_params {
> +	s32 k_po;
> +	s32 k_pu;
> +	s32 k_i;
> +	s32 k_d;
> +	s32 integral_cutoff;
> +	s32 err_integral;
> +	s32 prev_err;
> +};
> +
> +/**
> + * pid_controller() - PID controller
> + * @tz:	thermal zone we are operating in
> + * @current_temp:	the current temperature
> + * @control_temp:	the target temperature
> + * @max_allocatable_power:	maximum allocatable power for this thermal zone
> + *
> + * This PID controller increases the available power budget so that the
> + * temperature of the thermal zone gets as close as possible to
> + * @control_temp and limits the power if it exceeds it.  k_po is the
> + * proportional term when we are overshooting, k_pu is the
> + * proportional term when we are undershooting.  integral_cutoff is a
> + * threshold below which we stop accumulating the error.  The
> + * accumulated error is only valid if the requested power will make
> + * the system warmer.  If the system is mostly idle, there's no point
> + * in accumulating positive error.
> + *
> + * Return: The power budget for the next period.
> + */
> +static u32 pid_controller(struct thermal_zone_device *tz,
> +			unsigned long current_temp, unsigned long control_temp,
> +			u32 max_allocatable_power)
> +{
> +	s64 p, i, d, power_range;
> +	s32 err;
> +	struct power_allocator_params *params = tz->governor_data;
> +
> +	err = ((s32)control_temp - (s32)current_temp) / 1000;
> +	err = int_to_frac(err);
> +
> +	/* Calculate the proportional term */
> +	p = mul_frac(err < 0 ? params->k_po : params->k_pu, err);
> +
> +	/*
> +	 * Calculate the integral term
> +	 *
> +	 * if the error s less than cut off allow integration (but
> +	 * the integral is limited to max power)
> +	 *




> +	i = mul_frac(params->k_i, params->err_integral);
> +
> +	if (err < int_to_frac(params->integral_cutoff)) {
> +		s64 tmpi = mul_frac(params->k_i, err);
> +
> +		tmpi += i;
> +		if (tmpi <= int_to_frac(max_allocatable_power)) {
> +			i = tmpi;
> +			params->err_integral += err;
> +		}
> +	}
> +
> +	/*
> +	 * Calculate the derivative term
> +	 *
> +	 * We do err - prev_err, so with a positive k_d, a decreasing
> +	 * error (i.e. driving closer to the line) results in less
> +	 * power being applied, slowing down the controller)
> +	 */
> +	d = mul_frac(params->k_d, err - params->prev_err);
> +	params->prev_err = err;
> +
> +	power_range = p + i + d;
> +
> +	/* feed-forward the known sustainable dissipatable power */
> +	power_range = tz->tzp->sustainable_power + frac_to_int(power_range);
> +
> +	return clamp(power_range, (s64)0, (s64)max_allocatable_power);
> +}
> +
> +/**
> + * divvy_up_power() - divvy the allocated power between the actors
> + * @req_power:	each actor's requested power
> + * @max_power:	each actor's maximum available power
> + * @num_actors:	size of the @req_power, @max_power and @granted_power's array
> + * @total_req_power: sum of @req_power
> + * @power_range:	total allocated power
> + * @granted_power:	ouput array: each actor's granted power
> + *
> + * This function divides the total allocated power (@power_range)
> + * fairly between the actors.  It first tries to give each actor a
> + * share of the @power_range according to how much power it requested
> + * compared to the rest of the actors.  For example, if only one actor
> + * requests power, then it receives all the @power_range.  If
> + * three actors each requests 1mW, each receives a third of the
> + * @power_range.
> + *
> + * If any actor received more than their maximum power, then that
> + * surplus is re-divvied among the actors based on how far they are
> + * from their respective maximums.
> + *
> + * Granted power for each actor is written to @granted_power, which
> + * should've been allocated by the calling function.
> + */
> +static void divvy_up_power(u32 *req_power, u32 *max_power, int num_actors,
> +			u32 total_req_power, u32 power_range,
> +			u32 *granted_power)
> +{
> +	u32 extra_power, capped_extra_power, extra_actor_power[num_actors];
> +	int i;
> +
> +	if (!total_req_power) {
> +		/*
> +		 * Nobody requested anything, so just give everybody
> +		 * the maximum power
> +		 */
> +		for (i = 0; i < num_actors; i++)
> +			granted_power[i] = max_power[i];
> +
> +		return;
> +	}
> +
> +	capped_extra_power = 0;
> +	extra_power = 0;
> +	for (i = 0; i < num_actors; i++) {
> +		u64 req_range = req_power[i] * power_range;
> +
> +		granted_power[i] = div_u64(req_range, total_req_power);
> +
> +		if (granted_power[i] > max_power[i]) {
> +			extra_power += granted_power[i] - max_power[i];
> +			granted_power[i] = max_power[i];
> +		}
> +
> +		extra_actor_power[i] = max_power[i] - granted_power[i];
> +		capped_extra_power += extra_actor_power[i];
> +	}
> +
> +	if (!extra_power)
> +		return;
> +
> +	/*
> +	 * Re-divvy the reclaimed extra among actors based on
> +	 * how far they are from the max
> +	 */
> +	extra_power = min(extra_power, capped_extra_power);
> +	if (capped_extra_power > 0)
> +		for (i = 0; i < num_actors; i++)
> +			granted_power[i] += (extra_actor_power[i] *
> +					extra_power) / capped_extra_power;
> +}
> +
> +static int allocate_power(struct thermal_zone_device *tz,
> +			unsigned long current_temp, unsigned long control_temp)
> +{
> +	struct power_actor *actor;
> +	u32 *req_power, *max_power, *granted_power;
> +	u32 total_req_power, max_allocatable_power;
> +	u32 power_range;
> +	int i, num_actors, ret = 0;
> +
> +	mutex_lock(&tz->lock);
> +	rcu_read_lock();
> +
> +	num_actors = 0;
> +	list_for_each_entry_rcu(actor, &actor_list, actor_node)
> +		num_actors++;
> +
> +	req_power = devm_kcalloc(&tz->device, num_actors, sizeof(*req_power),
> +				GFP_KERNEL);
> +	if (!req_power) {
> +		ret = -ENOMEM;
> +		goto unlock;
> +	}
> +
> +	max_power = devm_kcalloc(&tz->device, num_actors, sizeof(*max_power),
> +				GFP_KERNEL);
> +	if (!max_power) {
> +		ret = -ENOMEM;
> +		goto free_req_power;
> +	}
> +
> +	granted_power = devm_kcalloc(&tz->device, num_actors,
> +				sizeof(*granted_power), GFP_KERNEL);
> +	if (!granted_power) {
> +		ret = -ENOMEM;
> +		goto free_max_power;
> +	}
> +
> +	i = 0;
> +	total_req_power = 0;
> +	max_allocatable_power = 0;
> +
> +	list_for_each_entry_rcu(actor, &actor_list, actor_node) {
> +		req_power[i] = actor->ops->get_req_power(actor, tz);
> +		total_req_power += req_power[i];
> +
> +		max_power[i] = actor->ops->get_max_power(actor, tz);
> +		max_allocatable_power += max_power[i];
> +
> +		i++;
> +	}
> +
> +	power_range = pid_controller(tz, current_temp, control_temp,
> +				max_allocatable_power);
> +
> +	divvy_up_power(req_power, max_power, num_actors, total_req_power,
> +		power_range, granted_power);
> +
> +	i = 0;
> +	list_for_each_entry_rcu(actor, &actor_list, actor_node) {
> +		actor->ops->set_power(actor, tz, granted_power[i]);
> +		i++;
> +	}
> +
> +	devm_kfree(&tz->device, granted_power);
> +free_max_power:
> +	devm_kfree(&tz->device, max_power);
> +free_req_power:
> +	devm_kfree(&tz->device, req_power);
> +unlock:
> +	rcu_read_unlock();
> +	mutex_unlock(&tz->lock);
> +
> +	return ret;
> +}
> +
> +static int check_trips(struct thermal_zone_device *tz)
> +{
> +	int ret;
> +	enum thermal_trip_type type;
> +
> +	if (tz->trips < 2)
> +		return -EINVAL;
> +
> +	ret = tz->ops->get_trip_type(tz, TRIP_SWITCH_ON, &type);
> +	if (ret)
> +		return ret;
> +
> +	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> +		return -EINVAL;
> +
> +	ret = tz->ops->get_trip_type(tz, TRIP_MAX_DESIRED_TEMPERATURE, &type);
> +	if (ret)
> +		return ret;
> +
> +	if ((type != THERMAL_TRIP_PASSIVE) && (type != THERMAL_TRIP_ACTIVE))
> +		return -EINVAL;
> +
> +	return ret;
> +}
> +
> +static void reset_pid_controller(struct power_allocator_params *params)
> +{
> +	params->err_integral = 0;
> +	params->prev_err = 0;
> +}
> +
> +static void allow_maximum_power(struct thermal_zone_device *tz)
> +{
> +	struct power_actor *actor;
> +
> +	rcu_read_lock();
> +
> +	list_for_each_entry_rcu(actor, &actor_list, actor_node) {
> +		u32 max_power = actor->ops->get_max_power(actor, tz);
> +
> +		actor->ops->set_power(actor, tz, max_power);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
> +/**
> + * power_allocator_bind() - bind the power_allocator governor to a thermal zone
> + * @tz:	thermal zone to bind it to
> + *
> + * Check that the thermal zone is valid for this governor, that is, it
> + * has two thermal trips.  If so, initialize the PID controller
> + * parameters and bind it to the thermal zone.
> + *
> + * Return: 0 on success, -EINVAL if the trips were invalid or -ENOMEM
> + * if we ran out of memory.
> + */
> +static int power_allocator_bind(struct thermal_zone_device *tz)
> +{
> +	int ret;
> +	struct power_allocator_params *params;
> +	unsigned long switch_on_temp, control_temp;
> +	u32 temperature_threshold;
> +
> +	ret = check_trips(tz);
> +	if (ret) {
> +		dev_err(&tz->device,
> +			"thermal zone %s has the wrong number of trips for this governor\n",
> +			tz->type);
> +		return ret;
> +	}
> +
> +	if (!tz->tzp || !tz->tzp->sustainable_power) {
> +		dev_err(&tz->device,
> +			"power_allocator: missing sustainable_power\n");
> +		return -EINVAL;
> +	}
> +
> +	params = devm_kzalloc(&tz->device, sizeof(*params), GFP_KERNEL);
> +	if (!params)
> +		return -ENOMEM;
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> +	if (ret)
> +		goto free;
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> +				&control_temp);
> +	if (ret)
> +		goto free;
> +
> +	temperature_threshold = (control_temp - switch_on_temp) / 1000;
> +
> +	params->k_po = int_to_frac(tz->tzp->sustainable_power) /
> +		temperature_threshold;
> +	params->k_pu = int_to_frac(2 * tz->tzp->sustainable_power) /
> +		temperature_threshold;
> +	params->k_i = int_to_frac(1);
> +	params->k_d = int_to_frac(0);
> +	params->integral_cutoff = 0;
> +
> +	reset_pid_controller(params);
> +
> +	tz->governor_data = params;
> +
> +	return 0;
> +
> +free:
> +	devm_kfree(&tz->device, params);
> +	return ret;
> +}
> +
> +static void power_allocator_unbind(struct thermal_zone_device *tz)
> +{
> +	dev_dbg(&tz->device, "Unbinding from thermal zone %d\n", tz->id);
> +	devm_kfree(&tz->device, tz->governor_data);
> +	tz->governor_data = NULL;
> +}
> +
> +static int power_allocator_throttle(struct thermal_zone_device *tz, int trip)
> +{
> +	int ret;
> +	unsigned long switch_on_temp, control_temp, current_temp;
> +	struct power_allocator_params *params = tz->governor_data;
> +
> +	/*
> +	 * We get called for every trip point but we only need to do
> +	 * our calculations once
> +	 */
> +	if (trip != TRIP_MAX_DESIRED_TEMPERATURE)
> +		return 0;
> +
> +	ret = thermal_zone_get_temp(tz, &current_temp);
> +	if (ret) {
> +		dev_warn(&tz->device, "Failed to get temperature: %d\n", ret);
> +		return ret;
> +	}
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_SWITCH_ON, &switch_on_temp);
> +	if (ret) {
> +		dev_warn(&tz->device,
> +			"Failed to get switch on temperature: %d\n", ret);
> +		return ret;
> +	}
> +
> +	if (current_temp < switch_on_temp) {
> +		tz->passive = 0;
> +		reset_pid_controller(params);
> +		allow_maximum_power(tz);
> +		return 0;
> +	}
> +
> +	tz->passive = 1;
> +
> +	ret = tz->ops->get_trip_temp(tz, TRIP_MAX_DESIRED_TEMPERATURE,
> +				&control_temp);
> +	if (ret) {
> +		dev_warn(&tz->device,
> +			"Failed to get the maximum desired temperature: %d\n",
> +			ret);
> +		return ret;
> +	}
> +
> +	return allocate_power(tz, current_temp, control_temp);
> +}
> +
> +static struct thermal_governor thermal_gov_power_allocator = {
> +	.name		= "power_allocator",
> +	.bind_to_tz	= power_allocator_bind,
> +	.unbind_from_tz	= power_allocator_unbind,
> +	.throttle	= power_allocator_throttle,
> +};
> +
> +int thermal_gov_power_allocator_register(void)
> +{
> +	return thermal_register_governor(&thermal_gov_power_allocator);
> +}
> +
> +void thermal_gov_power_allocator_unregister(void)
> +{
> +	thermal_unregister_governor(&thermal_gov_power_allocator);
> +}
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 3da99dd80ad5..1415d3d8a9eb 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -1857,7 +1857,11 @@ static int __init thermal_register_governors(void)
>  	if (result)
>  		return result;
>  
> -	return thermal_gov_user_space_register();
> +	result = thermal_gov_user_space_register();
> +	if (result)
> +		return result;
> +
> +	return thermal_gov_power_allocator_register();
>  }
>  
>  static void thermal_unregister_governors(void)
> @@ -1865,6 +1869,7 @@ static void thermal_unregister_governors(void)
>  	thermal_gov_step_wise_unregister();
>  	thermal_gov_fair_share_unregister();
>  	thermal_gov_user_space_unregister();
> +	thermal_gov_power_allocator_unregister();
>  }
>  
>  static int __init thermal_init(void)
> diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h
> index 3db339fb636f..b24cde2c71cc 100644
> --- a/drivers/thermal/thermal_core.h
> +++ b/drivers/thermal/thermal_core.h
> @@ -77,6 +77,14 @@ static inline int thermal_gov_user_space_register(void) { return 0; }
>  static inline void thermal_gov_user_space_unregister(void) {}
>  #endif /* CONFIG_THERMAL_GOV_USER_SPACE */
>  
> +#ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR
> +int thermal_gov_power_allocator_register(void);
> +void thermal_gov_power_allocator_unregister(void);
> +#else
> +static inline int thermal_gov_power_allocator_register(void) { return 0; }
> +static inline void thermal_gov_power_allocator_unregister(void) {}
> +#endif /* CONFIG_THERMAL_GOV_POWER_ALLOCATOR */
> +
>  /* device tree support */
>  #ifdef CONFIG_THERMAL_OF
>  int of_parse_thermal_zones(void);
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index 1124b7a9358a..e01141261756 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -57,6 +57,8 @@
>  #define DEFAULT_THERMAL_GOVERNOR       "fair_share"
>  #elif defined(CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE)
>  #define DEFAULT_THERMAL_GOVERNOR       "user_space"
> +#elif defined(CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR)
> +#define DEFAULT_THERMAL_GOVERNOR       "power_allocator"
>  #endif
>  
>  struct thermal_zone_device;
> @@ -287,6 +289,12 @@ struct thermal_zone_params {
>  
>  	int num_tbps;	/* Number of tbp entries */
>  	struct thermal_bind_params *tbp;
> +
> +	/*
> +	 * Sustainable power (heat) that this thermal zone can dissipate in
> +	 * mW
> +	 */
> +	u32 sustainable_power;
>  };
>  
>  struct thermal_genl_event {
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v5 05/10] thermal: let governors have private data for each thermal zone
  2014-08-19 12:49   ` edubezval
@ 2014-08-19 15:40     ` Javi Merino
  0 siblings, 0 replies; 22+ messages in thread
From: Javi Merino @ 2014-08-19 15:40 UTC (permalink / raw)
  To: edubezval; +Cc: linux-pm, LKML, Punit Agrawal, broonie, Zhang Rui

On Tue, Aug 19, 2014 at 01:49:32PM +0100, edubezval@gmail.com wrote:
> Javi,

Hi Eduardo,

> On Thu, Jul 10, 2014 at 10:18 AM, Javi Merino <javi.merino@arm.com> wrote:
> > A governor may need to store its current state between calls to
> > throttle().  That state depends on the thermal zone, so store it as
> > private data in struct thermal_zone_device.
> >
> > The governors may have two new ops: bind_to_tz() and unbind_from_tz().
> > When provided, these functions let governors do some initialization
> > and teardown when they are bound/unbound to a tz and possibly store that
> > information in the governor_data field of the struct
> > thermal_zone_device.
> >
> > Cc: Zhang Rui <rui.zhang@intel.com>
> > Cc: Eduardo Valentin <edubezval@gmail.com>
> > Signed-off-by: Javi Merino <javi.merino@arm.com>
> > ---
> >  drivers/thermal/thermal_core.c | 83 ++++++++++++++++++++++++++++++++++++++----
> >  include/linux/thermal.h        |  9 +++++
> >  2 files changed, 84 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> > index 71b0ec0c370d..3da99dd80ad5 100644
> > --- a/drivers/thermal/thermal_core.c
> > +++ b/drivers/thermal/thermal_core.c
> > @@ -72,6 +72,58 @@ static struct thermal_governor *__find_governor(const char *name)
> >         return NULL;
> >  }
> >
> > +/**
> > + * bind_previous_governor() - bind the previous governor of the thermal zone
> > + * @tz:                a valid pointer to a struct thermal_zone_device
> > + * @failed_gov_name:   the name of the governor that failed to register
> > + *
> > + * Register the previous governor of the thermal zone after a new
> > + * governor has failed to be bound.
> > + */
> > +static void bind_previous_governor(struct thermal_zone_device *tz,
> > +                               const char *failed_gov_name)
> > +{
> > +       if (tz->governor && tz->governor->bind_to_tz) {
> > +               if (tz->governor->bind_to_tz(tz)) {
> > +                       dev_warn(&tz->device,
> > +                               "governor %s failed to bind and the previous one (%s) failed to register again, thermal zone %s has no governor\n",
> > +                               failed_gov_name, tz->governor->name, tz->type);
> 
> The above must be a dev_err(), not warn.

Sure

>                                          Besides, I would prefer if
> you could improve the message. What is the difference between register
> and bind?

None.  It should say bind in both, it'll make it clearer.  I'll fix
it.

> > +                       tz->governor = NULL;
> > +               }
> > +       }
> > +}
> > +
> > +/**
> > + * thermal_set_governor() - Switch to another governor
> > + * @tz:                a valid pointer to a struct thermal_zone_device
> > + * @new_gov:   pointer to the new governor
> > + *
> > + * Change the governor of thermal zone @tz.
> > + *
> > + * Return: 0 on success, an error if the new governor's bind_to_tz() failed.
> > + */
> > +static int thermal_set_governor(struct thermal_zone_device *tz,
> > +                               struct thermal_governor *new_gov)
> > +{
> > +       int ret = 0;
> > +
> > +       if (tz->governor && tz->governor->unbind_from_tz)
> > +               tz->governor->unbind_from_tz(tz);
> > +
> > +       if (new_gov && new_gov->bind_to_tz) {
> > +               ret = new_gov->bind_to_tz(tz);
> > +               if (ret) {
> > +                       bind_previous_governor(tz, new_gov->name);
> > +
> > +                       return ret;
> > +               }
> > +       }
> > +
> > +       tz->governor = new_gov;
> > +
> > +       return ret;
> > +}
> > +
> >  int thermal_register_governor(struct thermal_governor *governor)
> >  {
> >         int err;
> > @@ -104,8 +156,15 @@ int thermal_register_governor(struct thermal_governor *governor)
> >
> >                 name = pos->tzp->governor_name;
> >
> > -               if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH))
> > -                       pos->governor = governor;
> > +               if (!strnicmp(name, governor->name, THERMAL_NAME_LENGTH)) {
> > +                       int ret;
> > +
> > +                       ret = thermal_set_governor(pos, governor);
> > +                       if (ret)
> > +                               dev_warn(&pos->device,
> > +                                       "Failed to set governor %s for thermal zone %s: %d\n",
> > +                                       governor->name, pos->type, ret);
> 
> dev_err here too.

Ok.  Thanks!
Javi 

> > +               }
> >         }
> >
> >         mutex_unlock(&thermal_list_lock);
> > @@ -131,7 +190,7 @@ void thermal_unregister_governor(struct thermal_governor *governor)
> >         list_for_each_entry(pos, &thermal_tz_list, node) {
> >                 if (!strnicmp(pos->governor->name, governor->name,
> >                                                 THERMAL_NAME_LENGTH))
> > -                       pos->governor = NULL;
> > +                       thermal_set_governor(pos, NULL);
> >         }
> >
> >         mutex_unlock(&thermal_list_lock);
> > @@ -756,8 +815,9 @@ policy_store(struct device *dev, struct device_attribute *attr,
> >         if (!gov)
> >                 goto exit;
> >
> > -       tz->governor = gov;
> > -       ret = count;
> > +       ret = thermal_set_governor(tz, gov);
> > +       if (!ret)
> > +               ret = count;
> >
> >  exit:
> >         mutex_unlock(&thermal_governor_lock);
> > @@ -1452,6 +1512,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> >         int result;
> >         int count;
> >         int passive = 0;
> > +       struct thermal_governor *governor;
> >
> >         if (type && strlen(type) >= THERMAL_NAME_LENGTH)
> >                 return ERR_PTR(-EINVAL);
> > @@ -1542,9 +1603,15 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type,
> >         mutex_lock(&thermal_governor_lock);
> >
> >         if (tz->tzp)
> > -               tz->governor = __find_governor(tz->tzp->governor_name);
> > +               governor = __find_governor(tz->tzp->governor_name);
> >         else
> > -               tz->governor = def_governor;
> > +               governor = def_governor;
> > +
> > +       result = thermal_set_governor(tz, governor);
> > +       if (result) {
> > +               mutex_unlock(&thermal_governor_lock);
> > +               goto unregister;
> > +       }
> >
> >         mutex_unlock(&thermal_governor_lock);
> >
> > @@ -1634,7 +1701,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
> >                 device_remove_file(&tz->device, &dev_attr_mode);
> >         device_remove_file(&tz->device, &dev_attr_policy);
> >         remove_trip_attrs(tz);
> > -       tz->governor = NULL;
> > +       thermal_set_governor(tz, NULL);
> >
> >         thermal_remove_hwmon_sysfs(tz);
> >         release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id);
> > diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> > index 0305cde21a74..1124b7a9358a 100644
> > --- a/include/linux/thermal.h
> > +++ b/include/linux/thermal.h
> > @@ -187,6 +187,7 @@ struct thermal_attr {
> >   * @ops:       operations this &thermal_zone_device supports
> >   * @tzp:       thermal zone parameters
> >   * @governor:  pointer to the governor for this thermal zone
> > + * @governor_data:     private pointer for governor data
> >   * @thermal_instances: list of &struct thermal_instance of this thermal zone
> >   * @idr:       &struct idr to generate unique id for this zone's cooling
> >   *             devices
> > @@ -213,6 +214,7 @@ struct thermal_zone_device {
> >         struct thermal_zone_device_ops *ops;
> >         const struct thermal_zone_params *tzp;
> >         struct thermal_governor *governor;
> > +       void *governor_data;
> >         struct list_head thermal_instances;
> >         struct idr idr;
> >         struct mutex lock;
> > @@ -223,12 +225,19 @@ struct thermal_zone_device {
> >  /**
> >   * struct thermal_governor - structure that holds thermal governor information
> >   * @name:      name of the governor
> > + * @bind_to_tz: callback called when binding to a thermal zone.  If it
> > + *             returns 0, the governor is bound to the thermal zone,
> > + *             otherwise it fails.
> > + * @unbind_from_tz:    callback called when a governor is unbound from a
> > + *                     thermal zone.
> >   * @throttle:  callback called for every trip point even if temperature is
> >   *             below the trip point temperature
> >   * @governor_list:     node in thermal_governor_list (in thermal_core.c)
> >   */
> >  struct thermal_governor {
> >         char name[THERMAL_NAME_LENGTH];
> > +       int (*bind_to_tz)(struct thermal_zone_device *tz);
> > +       void (*unbind_from_tz)(struct thermal_zone_device *tz);
> >         int (*throttle)(struct thermal_zone_device *tz, int trip);
> >         struct list_head        governor_list;
> >  };
> > --
> > 1.9.1
> >
> >
> 
> 
> 
> -- 
> Eduardo Bezerra Valentin
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v5 08/10] thermal: introduce the Power Allocator governor
  2014-08-19 13:45   ` Eduardo Valentin
@ 2014-08-19 16:02     ` Javi Merino
  0 siblings, 0 replies; 22+ messages in thread
From: Javi Merino @ 2014-08-19 16:02 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: linux-pm, linux-kernel, Punit Agrawal, broonie, Zhang Rui

On Tue, Aug 19, 2014 at 02:45:52PM +0100, Eduardo Valentin wrote:
> Javi,

Hi Eduardo,

> On Thu, Jul 10, 2014 at 03:18:46PM +0100, Javi Merino wrote:
> > The power allocator governor is a thermal governor that controls system
> > and device power allocation to control temperature.  Conceptually, the
> > implementation divides the sustainable power of a thermal zone among
> > all the heat sources in that zone.
> >
> > This governor relies on "power actors", entities that represent heat
> > sources.  They can report current and maximum power consumption and
> > can set a given maximum power consumption, usually via a cooling
> > device.
> >
> > The governor uses a Proportional Integral Derivative (PID) controller
> > driven by the temperature of the thermal zone.  The output of the
> > controller is a power budget that is then allocated to each power
> > actor that can have bearing on the temperature we are trying to
> > control.  It decides how much power to give each cooling device based
> > on the performance they are requesting.  The PID controller ensures
> > that the total power budget does not exceed the control temperature.
> >
> > Cc: Zhang Rui <rui.zhang@intel.com>
> > Cc: Eduardo Valentin <edubezval@gmail.com>
> > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> > Signed-off-by: Javi Merino <javi.merino@arm.com>
> > ---
> >  Documentation/thermal/power_allocator.txt |  61 ++++
> >  drivers/thermal/Kconfig                   |  15 +
> >  drivers/thermal/Makefile                  |   1 +
> >  drivers/thermal/power_allocator.c         | 467 ++++++++++++++++++++++++++++++
> >  drivers/thermal/thermal_core.c            |   7 +-
> >  drivers/thermal/thermal_core.h            |   8 +
> >  include/linux/thermal.h                   |   8 +
> >  7 files changed, 566 insertions(+), 1 deletion(-)
> >  create mode 100644 Documentation/thermal/power_allocator.txt
> >  create mode 100644 drivers/thermal/power_allocator.c
> >
> > diff --git a/Documentation/thermal/power_allocator.txt b/Documentation/thermal/power_allocator.txt
> > new file mode 100644
> > index 000000000000..1859074dadcb
> > --- /dev/null
> > +++ b/Documentation/thermal/power_allocator.txt
> > @@ -0,0 +1,61 @@
> > +Integration of the power_allocator governor in a platform
> > +=========================================================
> > +
> > +Registering thermal_zone_device
> > +-------------------------------
> > +
> > +An estimate of the sustainable dissipatable power (in mW) should be
> > +provided while registering the thermal zone.  This is the maximum
> > +sustained power for allocation at the desired maximum temperature.
> > +This number can vary for different conditions, but the closed-loop of
> > +the controller should take care of those variations, the
> > +`sustainable_power` should be an estimation of it.  Register your
> > +thermal zone with `thermal_zone_params` that have a
> > +`sustainable_power`.  If you weren't passing any
> > +`thermal_zone_params`, then something like this will do:
> > +
> > +     static const struct thermal_zone_params tz_params = {
> > +             .sustainable_power = 3500,
> > +     };
> > +
> > +and then pass `tz_params` as the 5th parameter to
> > +`thermal_zone_device_register()`
> > +
> > +Trip points
> > +-----------
> > +
> > +The governor requires the following two trip points:
> > +
> > +1.  "switch on" trip point: temperature above which the governor
> > +    control loop starts operating
> > +2.  "desired temperature" trip point: it should be higher than the
> > +    "switch on" trip point. It is the target temperature the governor
> > +    is controlling for.
> > +
> > +The trip points can be either active or passive.
> > +
> 
> I would prefer, for the sake of keeping the right concept in use, you
> state here that they are passive trip points.

Sure, I've also changed the function that checks this condition
accordingly.

> > +Power actors
> > +------------
> > +
> > +Devices controlled by this governor must be registered with the power
> > +actor API.  Read `power_actor.txt` for more information about them.
> > +
> > +Limitations of the power allocator governor
> > +===========================================
> > +
> > +The power allocator governor can't work with cooling devices directly.
> > +A power actor can be created to interface between the governor and the
> > +cooling device (see cpu_actor.c for an example).  Otherwise, if you
> > +have power actors and cooling devices that are next to the same
> > +thermal sensor create two thermal zones, one for each type.  Use the
> > +power allocator governor for the power actor thermal zone with the
> > +power actors and any other governor for the one with cooling devices.
> > +
> > +The power allocator governor's PID controller is highly dependent on a
> > +periodic tick.  If you have a driver that calls
> > +`thermal_zone_device_update()` (or anything that ends up calling the
> > +governor's `throttle()` function) repetitively, the governor response
> > +won't be very good.  Note that this is not particular to this
> > +governor, step-wise will also misbehave if you call its throttle()
> > +faster than the normal thermal framework tick (due to interrupts for
> > +example) as it will overreact.
> 
> I would recommend rephrasing the paragraph above, as I mentioned in
> other email, looks like a documented bug. Mentioning that the governor
> is still dependent on polling_delay active_delay is still worth it.
> Also, the coexistance of interrupt driven thermal zones with polling
> driven framework requires the interrupt firing to be configure to
> something meaneanful, such as, an event outside the temperature range
> the current trip is.

Ok, I'll rephrase it to clarify the point.

> > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> > index 249b196deffd..0e76c0dab5f3 100644
> > --- a/drivers/thermal/Kconfig
> > +++ b/drivers/thermal/Kconfig
> > @@ -71,6 +71,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE
> >         Select this if you want to let the user space manage the
> >         platform thermals.
> >
> > +config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
> > +     bool "power_allocator"
> > +     select THERMAL_GOV_POWER_ALLOCATOR
> > +     help
> > +       Select this if you want to control temperature based on
> > +       system and device power allocation. This governor relies on
> > +       power actors to operate.
> > +
> >  endchoice
> >
> >  config THERMAL_GOV_FAIR_SHARE
> > @@ -89,6 +97,13 @@ config THERMAL_GOV_USER_SPACE
> >       help
> >         Enable this to let the user space manage the platform thermals.
> >
> > +config THERMAL_GOV_POWER_ALLOCATOR
> > +     bool "Power allocator thermal governor"
> > +     select THERMAL_POWER_ACTOR
> > +     help
> > +       Enable this to manage platform thermals by dynamically
> > +       allocating and limiting power to devices.
> > +
> >  config THERMAL_POWER_ACTOR
> >       bool
> >
> > diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
> > index 74f97c90a46c..e74d57d0fe61 100644
> > --- a/drivers/thermal/Makefile
> > +++ b/drivers/thermal/Makefile
> > @@ -13,6 +13,7 @@ thermal_sys-$(CONFIG_THERMAL_OF)            += of-thermal.o
> >  thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE) += fair_share.o
> >  thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE)  += step_wise.o
> >  thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE) += user_space.o
> > +thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR)    += power_allocator.o
> >
> >  # power actors
> >  obj-$(CONFIG_THERMAL_POWER_ACTOR) += power_actor.o
> > diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c
> > new file mode 100644
> > index 000000000000..eb1797cd859b
> > --- /dev/null
> > +++ b/drivers/thermal/power_allocator.c
> > @@ -0,0 +1,467 @@
> > +/*
> > + * A power allocator to manage temperature
> > + *
> > + * Copyright (C) 2014 ARM Ltd.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> > + * kind, whether express or implied; without even the implied warranty
> > + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#define pr_fmt(fmt) "Power allocator: " fmt
> > +
> > +#include <linux/power_actor.h>
> > +#include <linux/rculist.h>
> > +#include <linux/slab.h>
> > +#include <linux/thermal.h>
> > +
> > +#include "thermal_core.h"
> > +
> > +#define FRAC_BITS 8
> > +#define int_to_frac(x) ((x) << FRAC_BITS)
> > +#define frac_to_int(x) ((x) >> FRAC_BITS)
> > +
> > +/**
> > + * mul_frac() - multiply two fixed-point numbers
> > + * @x:       first multiplicand
> > + * @y:       second multiplicand
> > + *
> > + * Return: the result of multiplying two fixed-point numbers.  The
> > + * result is also a fixed-point number.
> > + */
> > +static inline s64 mul_frac(s64 x, s64 y)
> > +{
> > +     return (x * y) >> FRAC_BITS;
> > +}
> > +
> > +enum power_allocator_trip_levels {
> > +     TRIP_SWITCH_ON = 0,     /* Switch on PID controller */
> > +     TRIP_MAX_DESIRED_TEMPERATURE, /* Temperature we are controlling for */
> > +};
> > +
> > +/**
> > + * struct power_allocator_params - parameters for the power allocator governor
> > + * @k_po:    Proportional parameter of the PID controller when overshooting
> > + *           (i.e., when temperature is below the target)
> > + * @k_pi:    Proportional parameter of the PID controller when undershooting
> > + * @k_i:     Integral parameter of the PID controller
> > + * @k_d:     Derivative parameter of the PID controller
> 
> Po, Pi, i, and d were not documented in your document above. Did you
> miss them?
> 
> Given that they are core part of the governor behavior, I'd suggest
> having a section in its document about how to fine tune the constants.
> Explain, for instance, if it is enough to fine tune per platform, and
> if having constants defined at compilation time or at runtime are good
> point to be documented.

We need to provide a way for platforms to specify these constants.
Currently they are fixed because I thought that some heuristic would
be good enough to cover all the platforms.  It turns out that it's not
realistic so we will provide a mechanism for platforms to change them,
we will export them in debugfs and document them in the next series.

Cheers,
Javi


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-08-19 16:03 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-10 14:18 [RFC PATCH v5 00/10] The power allocator thermal governor Javi Merino
2014-07-10 14:18 ` [RFC PATCH v5 01/10] tracing: Add array printing helpers Javi Merino
2014-07-10 15:40   ` Steven Rostedt
2014-07-10 14:18 ` [RFC PATCH v5 02/10] tools lib traceevent: Generalize numeric argument Javi Merino
2014-07-10 14:18 ` [RFC PATCH v5 03/10] tools lib traceevent: Add support for __print_u{8,16,32,64}_array() Javi Merino
2014-07-10 14:18 ` [RFC PATCH v5 04/10] thermal: document struct thermal_zone_device and thermal_governor Javi Merino
2014-08-19 13:03   ` Eduardo Valentin
2014-07-10 14:18 ` [RFC PATCH v5 05/10] thermal: let governors have private data for each thermal zone Javi Merino
2014-08-19 12:49   ` edubezval
2014-08-19 15:40     ` Javi Merino
2014-07-10 14:18 ` [RFC PATCH v5 06/10] thermal: introduce the Power Actor API Javi Merino
2014-07-10 14:18 ` [RFC PATCH v5 07/10] thermal: add a basic cpu power actor Javi Merino
2014-07-10 14:18 ` [RFC PATCH v5 08/10] thermal: introduce the Power Allocator governor Javi Merino
2014-08-19 12:56   ` Eduardo Valentin
2014-08-19 13:45   ` Eduardo Valentin
2014-08-19 16:02     ` Javi Merino
2014-07-10 14:18 ` [RFC PATCH v5 09/10] thermal: add trace events to the power allocator governor Javi Merino
2014-07-10 15:44   ` Steven Rostedt
2014-07-10 16:20     ` Javi Merino
2014-07-10 18:03       ` Steven Rostedt
2014-07-11  8:27         ` Javi Merino
2014-07-10 14:18 ` [RFC PATCH v5 10/10] of: thermal: Introduce sustainable power for a thermal zone Javi Merino

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.