linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Energy/power monitoring within the kernel
@ 2012-10-23 17:30 Pawel Moll
  2012-10-23 17:43 ` Steven Rostedt
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Pawel Moll @ 2012-10-23 17:30 UTC (permalink / raw)
  To: Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Guenter Roeck, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Jesper Juhl, Thomas Renninger, Jean Pihet,
	Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Guenter Roeck, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Jesper Juhl, Thomas Renninger, Jean Pihet
  Cc: linux-kernel, linux-arm-kernel, lm-sensors, linaro-dev

Greetings All,

More and more of people are getting interested in the subject of power
(energy) consumption monitoring. We have some external tools like
"battery simulators", energy probes etc., but some targets can measure
their power usage on their own.

Traditionally such data should be exposed to the user via hwmon sysfs
interface, and that's exactly what I did for "my" platform - I have
a /sys/class/hwmon/hwmon*/device/energy*_input and this was good
enough to draw pretty graphs in userspace. Everyone was happy...

Now I am getting new requests to do more with this data. In particular
I'm asked how to add such information to ftrace/perf output. The second
most frequent request is about providing it to a "energy aware"
cpufreq governor.

I've came up with three (non-mutually exclusive) options. I will
appreciate any other ideas and comments (including "it makes not sense
whatsoever" ones, with justification). Of course I am more than willing
to spend time on prototyping anything that seems reasonable and propose
patches.



=== Option 1: Trace event ===

This seems to be the "cheapest" option. Simply defining a trace event
that can be generated by a hwmon (or any other) driver makes the
interesting data immediately available to any ftrace/perf user. Of
course it doesn't really help with the cpufreq case, but seems to be
a good place to start with.

The question is how to define it... I've came up with two prototypes:

= Generic hwmon trace event =

This one allows any driver to generate a trace event whenever any
"hwmon attribute" (measured value) gets updated. The rate at which the
updates happen can be controlled by already existing "update_interval"
attribute.

8<-------------------------------------------
TRACE_EVENT(hwmon_attr_update,
	TP_PROTO(struct device *dev, struct attribute *attr, long long input),
	TP_ARGS(dev, attr, input),

	TP_STRUCT__entry(
		__string(       dev,		dev_name(dev))
		__string(	attr,		attr->name)
		__field(	long long,	input)
	),

	TP_fast_assign(
		__assign_str(dev, dev_name(dev));
		__assign_str(attr, attr->name);
		__entry->input = input;
	),

	TP_printk("%s %s %lld", __get_str(dev), __get_str(attr), __entry->input)
);
8<-------------------------------------------

It generates such ftrace message:

<...>212.673126: hwmon_attr_update: hwmon4 temp1_input 34361

One issue with this is that some external knowledge is required to
relate a number to a processor core. Or maybe it's not an issue at all
because it should be left for the user(space)?

= CPU power/energy/temperature trace event =

This one is designed to emphasize the relation between the measured
value (whether it is energy, temperature or any other physical
phenomena, really) and CPUs, so it is quite specific (too specific?)

8<-------------------------------------------
TRACE_EVENT(cpus_environment,
	TP_PROTO(const struct cpumask *cpus, long long value, char unit),
	TP_ARGS(cpus, value, unit),

	TP_STRUCT__entry(
		__array(	unsigned char,	cpus,	sizeof(struct cpumask))
		__field(	long long,	value)
		__field(	char,		unit)
	),

	TP_fast_assign(
		memcpy(__entry->cpus, cpus, sizeof(struct cpumask));
		__entry->value = value;
		__entry->unit = unit;
	),

	TP_printk("cpus %s %lld[%c]",
		__print_cpumask((struct cpumask *)__entry->cpus),
		__entry->value, __entry->unit)
);
8<-------------------------------------------

And the equivalent ftrace message is:

<...>127.063107: cpus_environment: cpus 0,1,2,3 34361[C]

It's a cpumask, not just single cpu id, because the sensor may measure
the value per set of CPUs, eg. a temperature of the whole silicon die
(so all the cores) or an energy consumed by a subset of cores (this
is my particular use case - two meters monitor a cluster of two
processors and a cluster of three processors, all working as a SMP
system).

Of course the cpus __array could be actually a special __cpumask field
type (I've just hacked the __print_cpumask so far). And I've just
realised that the unit field should actually be a string to allow unit
prefixes to be specified (the above should obviously be "34361[mC]"
not "[C]"). Also - excuse the "cpus_environment" name - this was the
best I was able to come up with at the time and I'm eager to accept
any alternative suggestions :-)



=== Option 2: hwmon perf PMU ===

Although the trace event makes it possible to obtain interesting
information using perf, the user wouldn't be able to treat the
energy meter as a normal data source. In particular there would
be no way of creating a group of events consisting eg. of a
"normal" leader (eg. cache miss event) triggering energy meter
read. The only way to get this done is to implement a perf PMU
backend providing "environmental data" to the user.

= High-level hwmon API and PMU =

Current hwmon subsystem does not provide any abstraction for the
measured values and requires particular drivers to create specified
sysfs attributes than used by userspace libsensors. This makes
the framework ultimately flexible and ultimately hard to access
from within the kernel...

What could be done here is some (simple) API to register the
measured values with the hwmon core which would result in creating
equivalent sysfs attributes automagically, but also allow a
in-kernel API for values enumeration and access. That way the core
could also register a "hwmon PMU" with the perf framework providing
data from all "compliant" drivers.

= A driver-specific PMU =

Of course a particular driver could register its own perf PMU on its
own. It's certainly an option, just very suboptimal in my opinion.
Or maybe not? Maybe the task is so specialized that it makes sense?



=== Option 3: CPU power(energy) monitoring framework ===

And last but not least, maybe the problem deserves some dedicated
API? Something that would take providers and feed their data into
interested parties, in particular a perf PMU implementation and
cpufreq governors?

Maybe it could be an extension to the thermal framework? It already
gives some meaning to a physical phenomena. Adding other, related ones
like energy, and relating it to cpu cores could make some sense.



I've tried to gather all potentially interested audience in the To:
list, but if I missed anyone - please, do let them (and/or me) know.

Best regards and thanks for participation in the discussion!

Pawel




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Energy/power monitoring within the kernel
  2012-10-23 17:30 [RFC] Energy/power monitoring within the kernel Pawel Moll
@ 2012-10-23 17:43 ` Steven Rostedt
  2012-10-24 16:00   ` Pawel Moll
  2012-10-23 18:49 ` Andy Green
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2012-10-23 17:43 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Guenter Roeck, Frederic Weisbecker, Ingo Molnar,
	Jesper Juhl, Thomas Renninger, Jean Pihet, linux-kernel,
	linux-arm-kernel, lm-sensors, linaro-dev

On Tue, 2012-10-23 at 18:30 +0100, Pawel Moll wrote:

> 
> === Option 1: Trace event ===
> 
> This seems to be the "cheapest" option. Simply defining a trace event
> that can be generated by a hwmon (or any other) driver makes the
> interesting data immediately available to any ftrace/perf user. Of
> course it doesn't really help with the cpufreq case, but seems to be
> a good place to start with.
> 
> The question is how to define it... I've came up with two prototypes:
> 
> = Generic hwmon trace event =
> 
> This one allows any driver to generate a trace event whenever any
> "hwmon attribute" (measured value) gets updated. The rate at which the
> updates happen can be controlled by already existing "update_interval"
> attribute.
> 
> 8<-------------------------------------------
> TRACE_EVENT(hwmon_attr_update,
> 	TP_PROTO(struct device *dev, struct attribute *attr, long long input),
> 	TP_ARGS(dev, attr, input),
> 
> 	TP_STRUCT__entry(
> 		__string(       dev,		dev_name(dev))
> 		__string(	attr,		attr->name)
> 		__field(	long long,	input)
> 	),
> 
> 	TP_fast_assign(
> 		__assign_str(dev, dev_name(dev));
> 		__assign_str(attr, attr->name);
> 		__entry->input = input;
> 	),
> 
> 	TP_printk("%s %s %lld", __get_str(dev), __get_str(attr), __entry->input)
> );
> 8<-------------------------------------------
> 
> It generates such ftrace message:
> 
> <...>212.673126: hwmon_attr_update: hwmon4 temp1_input 34361
> 
> One issue with this is that some external knowledge is required to
> relate a number to a processor core. Or maybe it's not an issue at all
> because it should be left for the user(space)?

If the external knowledge can be characterized in a userspace tool with
the given data here, I see no issues with this.

> 
> = CPU power/energy/temperature trace event =
> 
> This one is designed to emphasize the relation between the measured
> value (whether it is energy, temperature or any other physical
> phenomena, really) and CPUs, so it is quite specific (too specific?)
> 
> 8<-------------------------------------------
> TRACE_EVENT(cpus_environment,
> 	TP_PROTO(const struct cpumask *cpus, long long value, char unit),
> 	TP_ARGS(cpus, value, unit),
> 
> 	TP_STRUCT__entry(
> 		__array(	unsigned char,	cpus,	sizeof(struct cpumask))
> 		__field(	long long,	value)
> 		__field(	char,		unit)
> 	),
> 
> 	TP_fast_assign(
> 		memcpy(__entry->cpus, cpus, sizeof(struct cpumask));

Copying the entire cpumask seems like overkill. Especially when you have
4096 CPU machines.

> 		__entry->value = value;
> 		__entry->unit = unit;
> 	),
> 
> 	TP_printk("cpus %s %lld[%c]",
> 		__print_cpumask((struct cpumask *)__entry->cpus),
> 		__entry->value, __entry->unit)
> );
> 8<-------------------------------------------
> 
> And the equivalent ftrace message is:
> 
> <...>127.063107: cpus_environment: cpus 0,1,2,3 34361[C]
> 
> It's a cpumask, not just single cpu id, because the sensor may measure
> the value per set of CPUs, eg. a temperature of the whole silicon die
> (so all the cores) or an energy consumed by a subset of cores (this
> is my particular use case - two meters monitor a cluster of two
> processors and a cluster of three processors, all working as a SMP
> system).
> 
> Of course the cpus __array could be actually a special __cpumask field
> type (I've just hacked the __print_cpumask so far). And I've just
> realised that the unit field should actually be a string to allow unit
> prefixes to be specified (the above should obviously be "34361[mC]"
> not "[C]"). Also - excuse the "cpus_environment" name - this was the
> best I was able to come up with at the time and I'm eager to accept
> any alternative suggestions :-)

Perhaps making a field that can be a subset of cpus may be better. That
way we don't waste the ring buffer with lots of zeros. I'm guessing that
it will only be a group of cpus, and not a scattered list? Of course,
I've seen boxes where the cpu numbers went from core to core. That is,
cpu 0 was on core 1, cpu 1 was on core 2, and then it would repeat. 
cpu 8 was on core 1, cpu 9 was on core 2, etc.

But still, this could be compressed somehow.

I'll let others comment on the rest.

-- Steve



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Energy/power monitoring within the kernel
  2012-10-23 17:30 [RFC] Energy/power monitoring within the kernel Pawel Moll
  2012-10-23 17:43 ` Steven Rostedt
@ 2012-10-23 18:49 ` Andy Green
  2012-10-24 16:05   ` Pawel Moll
  2012-10-23 22:02 ` Guenter Roeck
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Andy Green @ 2012-10-23 18:49 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Guenter Roeck, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Jesper Juhl, Thomas Renninger, Jean Pihet,
	linaro-dev, linux-kernel, linux-arm-kernel, lm-sensors

On 10/23/12 19:30, the mail apparently from Pawel Moll included:
> Greetings All,
>
> More and more of people are getting interested in the subject of power
> (energy) consumption monitoring. We have some external tools like
> "battery simulators", energy probes etc., but some targets can measure
> their power usage on their own.
>
> Traditionally such data should be exposed to the user via hwmon sysfs
> interface, and that's exactly what I did for "my" platform - I have
> a /sys/class/hwmon/hwmon*/device/energy*_input and this was good
> enough to draw pretty graphs in userspace. Everyone was happy...
>
> Now I am getting new requests to do more with this data. In particular
> I'm asked how to add such information to ftrace/perf output. The second
> most frequent request is about providing it to a "energy aware"
> cpufreq governor.
>
> I've came up with three (non-mutually exclusive) options. I will
> appreciate any other ideas and comments (including "it makes not sense
> whatsoever" ones, with justification). Of course I am more than willing
> to spend time on prototyping anything that seems reasonable and propose
> patches.
>
>
>
> === Option 1: Trace event ===
>
> This seems to be the "cheapest" option. Simply defining a trace event
> that can be generated by a hwmon (or any other) driver makes the
> interesting data immediately available to any ftrace/perf user. Of
> course it doesn't really help with the cpufreq case, but seems to be
> a good place to start with.
>
> The question is how to define it... I've came up with two prototypes:
>
> = Generic hwmon trace event =
>
> This one allows any driver to generate a trace event whenever any
> "hwmon attribute" (measured value) gets updated. The rate at which the
> updates happen can be controlled by already existing "update_interval"
> attribute.
>
> 8<-------------------------------------------
> TRACE_EVENT(hwmon_attr_update,
> 	TP_PROTO(struct device *dev, struct attribute *attr, long long input),
> 	TP_ARGS(dev, attr, input),
>
> 	TP_STRUCT__entry(
> 		__string(       dev,		dev_name(dev))
> 		__string(	attr,		attr->name)
> 		__field(	long long,	input)
> 	),
>
> 	TP_fast_assign(
> 		__assign_str(dev, dev_name(dev));
> 		__assign_str(attr, attr->name);
> 		__entry->input = input;
> 	),
>
> 	TP_printk("%s %s %lld", __get_str(dev), __get_str(attr), __entry->input)
> );
> 8<-------------------------------------------
>
> It generates such ftrace message:
>
> <...>212.673126: hwmon_attr_update: hwmon4 temp1_input 34361
>
> One issue with this is that some external knowledge is required to
> relate a number to a processor core. Or maybe it's not an issue at all
> because it should be left for the user(space)?
>
> = CPU power/energy/temperature trace event =
>
> This one is designed to emphasize the relation between the measured
> value (whether it is energy, temperature or any other physical
> phenomena, really) and CPUs, so it is quite specific (too specific?)
>
> 8<-------------------------------------------
> TRACE_EVENT(cpus_environment,
> 	TP_PROTO(const struct cpumask *cpus, long long value, char unit),
> 	TP_ARGS(cpus, value, unit),
>
> 	TP_STRUCT__entry(
> 		__array(	unsigned char,	cpus,	sizeof(struct cpumask))
> 		__field(	long long,	value)
> 		__field(	char,		unit)
> 	),
>
> 	TP_fast_assign(
> 		memcpy(__entry->cpus, cpus, sizeof(struct cpumask));
> 		__entry->value = value;
> 		__entry->unit = unit;
> 	),
>
> 	TP_printk("cpus %s %lld[%c]",
> 		__print_cpumask((struct cpumask *)__entry->cpus),
> 		__entry->value, __entry->unit)
> );
> 8<-------------------------------------------
>
> And the equivalent ftrace message is:
>
> <...>127.063107: cpus_environment: cpus 0,1,2,3 34361[C]
>
> It's a cpumask, not just single cpu id, because the sensor may measure
> the value per set of CPUs, eg. a temperature of the whole silicon die
> (so all the cores) or an energy consumed by a subset of cores (this
> is my particular use case - two meters monitor a cluster of two
> processors and a cluster of three processors, all working as a SMP
> system).
>
> Of course the cpus __array could be actually a special __cpumask field
> type (I've just hacked the __print_cpumask so far). And I've just
> realised that the unit field should actually be a string to allow unit
> prefixes to be specified (the above should obviously be "34361[mC]"
> not "[C]"). Also - excuse the "cpus_environment" name - this was the
> best I was able to come up with at the time and I'm eager to accept
> any alternative suggestions :-)

A thought on that... from an SoC perspective there are other interesting 
power rails than go to just the CPU core.  For example DDR power and 
rails involved with other IP units on the SoC such as 3D graphics unit. 
  So tying one number to specifically a CPU core does not sound like 
it's enough.

> === Option 2: hwmon perf PMU ===
>
> Although the trace event makes it possible to obtain interesting
> information using perf, the user wouldn't be able to treat the
> energy meter as a normal data source. In particular there would
> be no way of creating a group of events consisting eg. of a
> "normal" leader (eg. cache miss event) triggering energy meter
> read. The only way to get this done is to implement a perf PMU
> backend providing "environmental data" to the user.

In terms of like perf top don't think it'll be possible to know when to 
sample the acquisition hardware to tie the result to a particular line 
of code, even if it had the bandwidth to do that.  Power readings are 
likely to lag activities on the cpu somewhat, considering sub-ns core 
clocks, especially if it's actually measuring the input side of a regulator.

> = High-level hwmon API and PMU =
>
> Current hwmon subsystem does not provide any abstraction for the
> measured values and requires particular drivers to create specified
> sysfs attributes than used by userspace libsensors. This makes
> the framework ultimately flexible and ultimately hard to access
> from within the kernel...
>
> What could be done here is some (simple) API to register the
> measured values with the hwmon core which would result in creating
> equivalent sysfs attributes automagically, but also allow a
> in-kernel API for values enumeration and access. That way the core
> could also register a "hwmon PMU" with the perf framework providing
> data from all "compliant" drivers.
>
> = A driver-specific PMU =
>
> Of course a particular driver could register its own perf PMU on its
> own. It's certainly an option, just very suboptimal in my opinion.
> Or maybe not? Maybe the task is so specialized that it makes sense?
>
>
>
> === Option 3: CPU power(energy) monitoring framework ===
>
> And last but not least, maybe the problem deserves some dedicated
> API? Something that would take providers and feed their data into
> interested parties, in particular a perf PMU implementation and
> cpufreq governors?
>
> Maybe it could be an extension to the thermal framework? It already
> gives some meaning to a physical phenomena. Adding other, related ones
> like energy, and relating it to cpu cores could make some sense.

If you turn the problem upside down to solve the representation question 
first, maybe there's a way forward defining the "power tree" in terms of 
regulators, and then adding something in struct regulator that spams 
readers with timestamped results if the regulator has a power monitoring 
capability.

Then you can map the regulators in the power tree to real devices by the 
names or the supply stuff.  Just a thought.

-Andy

-- 
Andy Green | TI Landing Team Leader
Linaro.org │ Open source software for ARM SoCs | Follow Linaro
http://facebook.com/pages/Linaro/155974581091106  - 
http://twitter.com/#!/linaroorg - http://linaro.org/linaro-blog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Energy/power monitoring within the kernel
  2012-10-23 17:30 [RFC] Energy/power monitoring within the kernel Pawel Moll
  2012-10-23 17:43 ` Steven Rostedt
  2012-10-23 18:49 ` Andy Green
@ 2012-10-23 22:02 ` Guenter Roeck
  2012-10-24 16:37   ` Pawel Moll
  2012-10-24  0:40 ` Thomas Renninger
  2012-10-24  0:41 ` Thomas Renninger
  4 siblings, 1 reply; 11+ messages in thread
From: Guenter Roeck @ 2012-10-23 22:02 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Steven Rostedt, Frederic Weisbecker, Ingo Molnar,
	Jesper Juhl, Thomas Renninger, Jean Pihet, linux-kernel,
	linux-arm-kernel, lm-sensors, linaro-dev

On Tue, Oct 23, 2012 at 06:30:49PM +0100, Pawel Moll wrote:
> Greetings All,
> 
> More and more of people are getting interested in the subject of power
> (energy) consumption monitoring. We have some external tools like
> "battery simulators", energy probes etc., but some targets can measure
> their power usage on their own.
> 
> Traditionally such data should be exposed to the user via hwmon sysfs
> interface, and that's exactly what I did for "my" platform - I have
> a /sys/class/hwmon/hwmon*/device/energy*_input and this was good
> enough to draw pretty graphs in userspace. Everyone was happy...
> 
Only driver supporting "energy" output so far is ibmaem, and the reported energy
is supposed to be cumulative, as in energy = power * time. Do you mean power,
possibly ?

> Now I am getting new requests to do more with this data. In particular
> I'm asked how to add such information to ftrace/perf output. The second
> most frequent request is about providing it to a "energy aware"
> cpufreq governor.
> 

Anything energy related would have to be along the line of "do something after a
certain amount of work has been performed", which at least at the surface does
not make much sense to me, unless you mean something along the line of a
process scheduler which schedules a process not based on time slices but based
on energy consumed, ie if you want to define a time slice not in milli-seconds
but in Joule.

If so, I would argue that a similar behavior could be achieved by varying the
duration of time slices with the current CPU speed, or simply by using cycle
count instead of time as time slice parameter. Not that I am sure if such an
approach would really be of interest for anyone. 

Or do you really mean power, not energy, such as in "reduce CPU speed if its
power consumption is above X Watt" ?

> I've came up with three (non-mutually exclusive) options. I will
> appreciate any other ideas and comments (including "it makes not sense
> whatsoever" ones, with justification). Of course I am more than willing
> to spend time on prototyping anything that seems reasonable and propose
> patches.
> 
> 
> 
> === Option 1: Trace event ===
> 
> This seems to be the "cheapest" option. Simply defining a trace event
> that can be generated by a hwmon (or any other) driver makes the
> interesting data immediately available to any ftrace/perf user. Of
> course it doesn't really help with the cpufreq case, but seems to be
> a good place to start with.
> 
> The question is how to define it... I've came up with two prototypes:
> 
> = Generic hwmon trace event =
> 
> This one allows any driver to generate a trace event whenever any
> "hwmon attribute" (measured value) gets updated. The rate at which the
> updates happen can be controlled by already existing "update_interval"
> attribute.
> 
> 8<-------------------------------------------
> TRACE_EVENT(hwmon_attr_update,
> 	TP_PROTO(struct device *dev, struct attribute *attr, long long input),
> 	TP_ARGS(dev, attr, input),
> 
> 	TP_STRUCT__entry(
> 		__string(       dev,		dev_name(dev))
> 		__string(	attr,		attr->name)
> 		__field(	long long,	input)
> 	),
> 
> 	TP_fast_assign(
> 		__assign_str(dev, dev_name(dev));
> 		__assign_str(attr, attr->name);
> 		__entry->input = input;
> 	),
> 
> 	TP_printk("%s %s %lld", __get_str(dev), __get_str(attr), __entry->input)
> );
> 8<-------------------------------------------
> 
> It generates such ftrace message:
> 
> <...>212.673126: hwmon_attr_update: hwmon4 temp1_input 34361
> 
> One issue with this is that some external knowledge is required to
> relate a number to a processor core. Or maybe it's not an issue at all
> because it should be left for the user(space)?
> 
> = CPU power/energy/temperature trace event =
> 
> This one is designed to emphasize the relation between the measured
> value (whether it is energy, temperature or any other physical
> phenomena, really) and CPUs, so it is quite specific (too specific?)
> 
> 8<-------------------------------------------
> TRACE_EVENT(cpus_environment,
> 	TP_PROTO(const struct cpumask *cpus, long long value, char unit),
> 	TP_ARGS(cpus, value, unit),
> 
> 	TP_STRUCT__entry(
> 		__array(	unsigned char,	cpus,	sizeof(struct cpumask))
> 		__field(	long long,	value)
> 		__field(	char,		unit)
> 	),
> 
> 	TP_fast_assign(
> 		memcpy(__entry->cpus, cpus, sizeof(struct cpumask));
> 		__entry->value = value;
> 		__entry->unit = unit;
> 	),
> 
> 	TP_printk("cpus %s %lld[%c]",
> 		__print_cpumask((struct cpumask *)__entry->cpus),
> 		__entry->value, __entry->unit)
> );
> 8<-------------------------------------------
> 
> And the equivalent ftrace message is:
> 
> <...>127.063107: cpus_environment: cpus 0,1,2,3 34361[C]
> 
> It's a cpumask, not just single cpu id, because the sensor may measure
> the value per set of CPUs, eg. a temperature of the whole silicon die
> (so all the cores) or an energy consumed by a subset of cores (this
> is my particular use case - two meters monitor a cluster of two
> processors and a cluster of three processors, all working as a SMP
> system).
> 
> Of course the cpus __array could be actually a special __cpumask field
> type (I've just hacked the __print_cpumask so far). And I've just
> realised that the unit field should actually be a string to allow unit
> prefixes to be specified (the above should obviously be "34361[mC]"
> not "[C]"). Also - excuse the "cpus_environment" name - this was the
> best I was able to come up with at the time and I'm eager to accept
> any alternative suggestions :-)
> 
I am not sure how this would be expected to work. hwmon is, by its very nature,
a passive subsystem: It doesn't do anything unless data is explicitly requested
from it. It does not update an attribute unless that attribute is read.
That does not seem to fit well with the idea of tracing - which assumes
that some activity is happening, ultimately, all by itself, presumably
periodically. The idea to have a user space application read hwmon data only
for it to trigger trace events does not seem to be very compelling to me.

An exception is if a monitoring device suppports interrupts, and if its driver
actually implements those interrupts. This is, however, not the case for most of
the current drivers (if any), mostly because interrupt support for hardware
monitoring devices is very platform dependent and thus difficult to implement.

> 
> === Option 2: hwmon perf PMU ===
> 
> Although the trace event makes it possible to obtain interesting
> information using perf, the user wouldn't be able to treat the
> energy meter as a normal data source. In particular there would
> be no way of creating a group of events consisting eg. of a
> "normal" leader (eg. cache miss event) triggering energy meter
> read. The only way to get this done is to implement a perf PMU
> backend providing "environmental data" to the user.
> 
> = High-level hwmon API and PMU =
> 
> Current hwmon subsystem does not provide any abstraction for the
> measured values and requires particular drivers to create specified
> sysfs attributes than used by userspace libsensors. This makes
> the framework ultimately flexible and ultimately hard to access
> from within the kernel...
> 
> What could be done here is some (simple) API to register the
> measured values with the hwmon core which would result in creating
> equivalent sysfs attributes automagically, but also allow a
> in-kernel API for values enumeration and access. That way the core
> could also register a "hwmon PMU" with the perf framework providing
> data from all "compliant" drivers.
> 
> = A driver-specific PMU =
> 
> Of course a particular driver could register its own perf PMU on its
> own. It's certainly an option, just very suboptimal in my opinion.
> Or maybe not? Maybe the task is so specialized that it makes sense?
> 
We had a couple of attempts to provide an in-kernel API. Unfortunately,
the result was, at least so far, more complexity on the driver side.
So the difficulty is really to define an API which is really simple, and does
not just complicate driver development for a (presumably) rare use case.

Guenter

> 
> 
> === Option 3: CPU power(energy) monitoring framework ===
> 
> And last but not least, maybe the problem deserves some dedicated
> API? Something that would take providers and feed their data into
> interested parties, in particular a perf PMU implementation and
> cpufreq governors?
> 
> Maybe it could be an extension to the thermal framework? It already
> gives some meaning to a physical phenomena. Adding other, related ones
> like energy, and relating it to cpu cores could make some sense.
> 
> 
> 
> I've tried to gather all potentially interested audience in the To:
> list, but if I missed anyone - please, do let them (and/or me) know.
> 
> Best regards and thanks for participation in the discussion!
> 
> Pawel
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Energy/power monitoring within the kernel
  2012-10-23 17:30 [RFC] Energy/power monitoring within the kernel Pawel Moll
                   ` (2 preceding siblings ...)
  2012-10-23 22:02 ` Guenter Roeck
@ 2012-10-24  0:40 ` Thomas Renninger
  2012-10-24 16:51   ` Pawel Moll
  2012-10-24  0:41 ` Thomas Renninger
  4 siblings, 1 reply; 11+ messages in thread
From: Thomas Renninger @ 2012-10-24  0:40 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Guenter Roeck, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Jesper Juhl, Jean Pihet, linux-kernel,
	linux-arm-kernel, lm-sensors, linaro-dev

Hi,

On Tuesday, October 23, 2012 06:30:49 PM Pawel Moll wrote:
> Greetings All,
> 
> More and more of people are getting interested in the subject of power
> (energy) consumption monitoring. We have some external tools like
> "battery simulators", energy probes etc., but some targets can measure
> their power usage on their own.
> 
> Traditionally such data should be exposed to the user via hwmon sysfs
> interface, and that's exactly what I did for "my" platform - I have
> a /sys/class/hwmon/hwmon*/device/energy*_input and this was good
> enough to draw pretty graphs in userspace. Everyone was happy...
> 
> Now I am getting new requests to do more with this data. In particular
> I'm asked how to add such information to ftrace/perf output.
Why? What is the gain?

Perf events can be triggered at any point in the kernel.
A cpufreq event is triggered when the frequency gets changed.
CPU idle events are triggered when the kernel requests to enter an idle state
or exits one.

When would you trigger a thermal or a power event?
There is the possibility of (critical) thermal limits.
But if I understand this correctly you want this for debugging and
I guess you have everything interesting one can do with temperature
values:
  - read the temperature
  - draw some nice graphs from the results

Hm, I guess I know what you want to do:
In your temperature/energy graph, you want to have some dots
when relevant HW states (frequency, sleep states,  DDR power,...)
changed. Then you are able to see the effects over a timeline.

So you have to bring the existing frequency/idle perf events together
with temperature readings

Cleanest solution could be to enhance the exisiting userspace apps
(pytimechart/perf timechart) and let them add another line
(temperature/energy), but the data would not come from perf, but
from sysfs/hwmon.
Not sure whether this works out with the timechart tools.
Anyway, this sounds like a userspace only problem.

   Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Energy/power monitoring within the kernel
  2012-10-23 17:30 [RFC] Energy/power monitoring within the kernel Pawel Moll
                   ` (3 preceding siblings ...)
  2012-10-24  0:40 ` Thomas Renninger
@ 2012-10-24  0:41 ` Thomas Renninger
  4 siblings, 0 replies; 11+ messages in thread
From: Thomas Renninger @ 2012-10-24  0:41 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Guenter Roeck, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Jesper Juhl, Jean Pihet, linux-kernel,
	linux-arm-kernel, lm-sensors, linaro-dev

Hi,

On Tuesday, October 23, 2012 06:30:49 PM Pawel Moll wrote:
> Greetings All,
> 
> More and more of people are getting interested in the subject of power
> (energy) consumption monitoring. We have some external tools like
> "battery simulators", energy probes etc., but some targets can measure
> their power usage on their own.
> 
> Traditionally such data should be exposed to the user via hwmon sysfs
> interface, and that's exactly what I did for "my" platform - I have
> a /sys/class/hwmon/hwmon*/device/energy*_input and this was good
> enough to draw pretty graphs in userspace. Everyone was happy...
> 
> Now I am getting new requests to do more with this data. In particular
> I'm asked how to add such information to ftrace/perf output.
Why? What is the gain?

Perf events can be triggered at any point in the kernel.
A cpufreq event is triggered when the frequency gets changed.
CPU idle events are triggered when the kernel requests to enter an idle state
or exits one.

When would you trigger a thermal or a power event?
There is the possibility of (critical) thermal limits.
But if I understand this correctly you want this for debugging and
I guess you have everything interesting one can do with temperature
values:
  - read the temperature
  - draw some nice graphs from the results

Hm, I guess I know what you want to do:
In your temperature/energy graph, you want to have some dots
when relevant HW states (frequency, sleep states,  DDR power,...)
changed. Then you are able to see the effects over a timeline.

So you have to bring the existing frequency/idle perf events together
with temperature readings

Cleanest solution could be to enhance the exisiting userspace apps
(pytimechart/perf timechart) and let them add another line
(temperature/energy), but the data would not come from perf, but
from sysfs/hwmon.
Not sure whether this works out with the timechart tools.
Anyway, this sounds like a userspace only problem.

   Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Energy/power monitoring within the kernel
  2012-10-23 17:43 ` Steven Rostedt
@ 2012-10-24 16:00   ` Pawel Moll
  0 siblings, 0 replies; 11+ messages in thread
From: Pawel Moll @ 2012-10-24 16:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Guenter Roeck, Frederic Weisbecker, Ingo Molnar,
	Jesper Juhl, Thomas Renninger, Jean Pihet, linux-kernel,
	linux-arm-kernel, lm-sensors, linaro-dev

On Tue, 2012-10-23 at 18:43 +0100, Steven Rostedt wrote:
> > <...>212.673126: hwmon_attr_update: hwmon4 temp1_input 34361
> > 
> > One issue with this is that some external knowledge is required to
> > relate a number to a processor core. Or maybe it's not an issue at all
> > because it should be left for the user(space)?
> 
> If the external knowledge can be characterized in a userspace tool with
> the given data here, I see no issues with this.

Ok, fine.

> > 	TP_fast_assign(
> > 		memcpy(__entry->cpus, cpus, sizeof(struct cpumask));
> 
> Copying the entire cpumask seems like overkill. Especially when you have
> 4096 CPU machines.

Uh, right. I didn't consider such use case...

> Perhaps making a field that can be a subset of cpus may be better. That
> way we don't waste the ring buffer with lots of zeros. I'm guessing that
> it will only be a group of cpus, and not a scattered list? Of course,
> I've seen boxes where the cpu numbers went from core to core. That is,
> cpu 0 was on core 1, cpu 1 was on core 2, and then it would repeat. 
> cpu 8 was on core 1, cpu 9 was on core 2, etc.
> 
> But still, this could be compressed somehow.

Sure thing. Or I could simply use cpumask_scnprintf() on the assign
stage and keep an already-formatted string. Or, as the cpumask per
sensor would be de-facto constant, I could assume keep only a pointer to
it. Will keep it in mind if this event was supposed to happen.

Thanks!

Paweł




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Energy/power monitoring within the kernel
  2012-10-23 18:49 ` Andy Green
@ 2012-10-24 16:05   ` Pawel Moll
  0 siblings, 0 replies; 11+ messages in thread
From: Pawel Moll @ 2012-10-24 16:05 UTC (permalink / raw)
  To: Andy Green
  Cc: Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Guenter Roeck, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Jesper Juhl, Thomas Renninger, Jean Pihet,
	linaro-dev, linux-kernel, linux-arm-kernel, lm-sensors

On Tue, 2012-10-23 at 19:49 +0100, Andy Green wrote:
> A thought on that... from an SoC perspective there are other interesting 
> power rails than go to just the CPU core.  For example DDR power and 
> rails involved with other IP units on the SoC such as 3D graphics unit. 
>   So tying one number to specifically a CPU core does not sound like 
> it's enough.

I do realize this. I just didn't want to try to cover too much ground,
and cpufreq governor would be interested in cpu-related data anyway...

> If you turn the problem upside down to solve the representation question 
> first, maybe there's a way forward defining the "power tree" in terms of 
> regulators, and then adding something in struct regulator that spams 
> readers with timestamped results if the regulator has a power monitoring 
> capability.
> 
> Then you can map the regulators in the power tree to real devices by the 
> names or the supply stuff.  Just a thought.

Hm. Interesting idea indeed - if a regulator device was able to report
the energy being produced by it (instead of looking at cumulative energy
consumed by more than one device), defining "power domains" (by adding
selected cpus as consumers) would be straight forward and the cpufreq
could request the information that way.

I'll look into it, thanks!

Paweł



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Energy/power monitoring within the kernel
  2012-10-23 22:02 ` Guenter Roeck
@ 2012-10-24 16:37   ` Pawel Moll
  2012-10-24 20:01     ` Guenter Roeck
  0 siblings, 1 reply; 11+ messages in thread
From: Pawel Moll @ 2012-10-24 16:37 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Steven Rostedt, Frederic Weisbecker, Ingo Molnar,
	Jesper Juhl, Thomas Renninger, Jean Pihet, linux-kernel,
	linux-arm-kernel, lm-sensors, linaro-dev

On Tue, 2012-10-23 at 23:02 +0100, Guenter Roeck wrote:
> > Traditionally such data should be exposed to the user via hwmon sysfs
> > interface, and that's exactly what I did for "my" platform - I have
> > a /sys/class/hwmon/hwmon*/device/energy*_input and this was good
> > enough to draw pretty graphs in userspace. Everyone was happy...
> > 
> Only driver supporting "energy" output so far is ibmaem, and the reported energy
> is supposed to be cumulative, as in energy = power * time. Do you mean power,
> possibly ?

So the vexpress would be the second one, than :-) as the energy
"monitor" actually on the latest tiles reports 64-bit value of
microJoules consumed (or produced) since the power-up.

Some of the older boards were able to report instant power, but this
metrics is less useful in our case.

> > Now I am getting new requests to do more with this data. In particular
> > I'm asked how to add such information to ftrace/perf output. The second
> > most frequent request is about providing it to a "energy aware"
> > cpufreq governor.
> 
> Anything energy related would have to be along the line of "do something after a
> certain amount of work has been performed", which at least at the surface does
> not make much sense to me, unless you mean something along the line of a
> process scheduler which schedules a process not based on time slices but based
> on energy consumed, ie if you want to define a time slice not in milli-seconds
> but in Joule.

Actually there is some research being done in this direction, but it's
way too early to draw any conclusions...

> If so, I would argue that a similar behavior could be achieved by varying the
> duration of time slices with the current CPU speed, or simply by using cycle
> count instead of time as time slice parameter. Not that I am sure if such an
> approach would really be of interest for anyone. 
> 
> Or do you really mean power, not energy, such as in "reduce CPU speed if its
> power consumption is above X Watt" ?

Uh. To be completely honest I must answer: I'm not sure how the "energy
aware" cpufreq governor is supposed to work. I have been simply asked to
provide the data in some standard way, if possible.

> I am not sure how this would be expected to work. hwmon is, by its very nature,
> a passive subsystem: It doesn't do anything unless data is explicitly requested
> from it. It does not update an attribute unless that attribute is read.
> That does not seem to fit well with the idea of tracing - which assumes
> that some activity is happening, ultimately, all by itself, presumably
> periodically. The idea to have a user space application read hwmon data only
> for it to trigger trace events does not seem to be very compelling to me.

What I had in mind was similar to what adt7470 driver does. The driver
would automatically access the device every now and then to update it's
internal state and generate the trace event on the way. This
auto-refresh "feature" is particularly appealing for me, as on some of
"my" platforms can take up to 500 microseconds to actually get the data.
So doing this in background (and providing users with the last known
value in the meantime) seems attractive.

> An exception is if a monitoring device suppports interrupts, and if its driver
> actually implements those interrupts. This is, however, not the case for most of
> the current drivers (if any), mostly because interrupt support for hardware
> monitoring devices is very platform dependent and thus difficult to implement.

Interestingly enough the newest version of our platform control micro
(doing the energy monitoring as well) can generate and interrupt when a
transaction is finished, so I was planning to periodically update the
all sort of values. And again, generating a trace event on this
opportunity would be trivial.

> > Of course a particular driver could register its own perf PMU on its
> > own. It's certainly an option, just very suboptimal in my opinion.
> > Or maybe not? Maybe the task is so specialized that it makes sense?
> > 
> We had a couple of attempts to provide an in-kernel API. Unfortunately,
> the result was, at least so far, more complexity on the driver side.
> So the difficulty is really to define an API which is really simple, and does
> not just complicate driver development for a (presumably) rare use case.

Yes, I appreciate this. That's why this option is actually my least
favourite. Anyway, what I was thinking about was just a thin shin that
*can* be used by a driver to register some particular value with the
core (so it can be enumerated and accessed by in-kernel clients) and the
core could (or not) create a sysfs attribute for this value on behalf of
the driver. Seems lightweight enough, unless previous experience
suggests otherwise?

Cheers!

Paweł



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Energy/power monitoring within the kernel
  2012-10-24  0:40 ` Thomas Renninger
@ 2012-10-24 16:51   ` Pawel Moll
  0 siblings, 0 replies; 11+ messages in thread
From: Pawel Moll @ 2012-10-24 16:51 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Guenter Roeck, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Jesper Juhl, Jean Pihet, linux-kernel,
	linux-arm-kernel, lm-sensors, linaro-dev

On Wed, 2012-10-24 at 01:40 +0100, Thomas Renninger wrote:
> > More and more of people are getting interested in the subject of power
> > (energy) consumption monitoring. We have some external tools like
> > "battery simulators", energy probes etc., but some targets can measure
> > their power usage on their own.
> > 
> > Traditionally such data should be exposed to the user via hwmon sysfs
> > interface, and that's exactly what I did for "my" platform - I have
> > a /sys/class/hwmon/hwmon*/device/energy*_input and this was good
> > enough to draw pretty graphs in userspace. Everyone was happy...
> > 
> > Now I am getting new requests to do more with this data. In particular
> > I'm asked how to add such information to ftrace/perf output.
> Why? What is the gain?
> 
> Perf events can be triggered at any point in the kernel.
> A cpufreq event is triggered when the frequency gets changed.
> CPU idle events are triggered when the kernel requests to enter an idle state
> or exits one.
> 
> When would you trigger a thermal or a power event?
> There is the possibility of (critical) thermal limits.
> But if I understand this correctly you want this for debugging and
> I guess you have everything interesting one can do with temperature
> values:
>   - read the temperature
>   - draw some nice graphs from the results
> 
> Hm, I guess I know what you want to do:
> In your temperature/energy graph, you want to have some dots
> when relevant HW states (frequency, sleep states,  DDR power,...)
> changed. Then you are able to see the effects over a timeline.
> 
> So you have to bring the existing frequency/idle perf events together
> with temperature readings
> 
> Cleanest solution could be to enhance the exisiting userspace apps
> (pytimechart/perf timechart) and let them add another line
> (temperature/energy), but the data would not come from perf, but
> from sysfs/hwmon.
> Not sure whether this works out with the timechart tools.
> Anyway, this sounds like a userspace only problem.

Ok, so it is actually what I'm working on right now. Not with the
standard perf tool (there are other users of that API ;-) but indeed I'm
trying to "enrich" the data stream coming from kernel with user-space
originating values. I am a little bit concerned about effect of extra
syscalls (accessing the value and gettimeofday to generate a timestamp)
at a higher sampling rates, but most likely it won't be a problem. Can
report once I know more, if this is of interest to anyone.

Anyway, there are at least two debug/trace related use cases that can
not be satisfied that way (of course one could argue about their
usefulness):

1. ftrace-over-network (https://lwn.net/Articles/410200/) which is
particularly appealing for "embedded users", where there's virtually no
useful userspace available (think Android). Here a (functional) trace
event is embedded into a normal trace and available "for free" at the
host side.

2. perf groups - the general idea is that one event (let it be cycle
counter interrupt or even a timer) triggers read of other values (eg.
cache counter or - in this case - energy counter). The aim is to have a
regular "snapshots" of the system state. I'm not sure if the standard
perf tool can do this, but I do :-)

And last, but not least, there are the non-debug/trace clients for
energy data as discussed in other mails in this thread. Of course the
trace event won't really satisfy their needs either.

Thanks for your feedback!

Paweł



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Energy/power monitoring within the kernel
  2012-10-24 16:37   ` Pawel Moll
@ 2012-10-24 20:01     ` Guenter Roeck
  0 siblings, 0 replies; 11+ messages in thread
From: Guenter Roeck @ 2012-10-24 20:01 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Amit Daniel Kachhap, Zhang Rui, Viresh Kumar, Daniel Lezcano,
	Jean Delvare, Steven Rostedt, Frederic Weisbecker, Ingo Molnar,
	Jesper Juhl, Thomas Renninger, Jean Pihet, linux-kernel,
	linux-arm-kernel, lm-sensors, linaro-dev

On Wed, Oct 24, 2012 at 05:37:27PM +0100, Pawel Moll wrote:
> On Tue, 2012-10-23 at 23:02 +0100, Guenter Roeck wrote:
> > > Traditionally such data should be exposed to the user via hwmon sysfs
> > > interface, and that's exactly what I did for "my" platform - I have
> > > a /sys/class/hwmon/hwmon*/device/energy*_input and this was good
> > > enough to draw pretty graphs in userspace. Everyone was happy...
> > > 
> > Only driver supporting "energy" output so far is ibmaem, and the reported energy
> > is supposed to be cumulative, as in energy = power * time. Do you mean power,
> > possibly ?
> 
> So the vexpress would be the second one, than :-) as the energy
> "monitor" actually on the latest tiles reports 64-bit value of
> microJoules consumed (or produced) since the power-up.
> 
> Some of the older boards were able to report instant power, but this
> metrics is less useful in our case.
> 
> > > Now I am getting new requests to do more with this data. In particular
> > > I'm asked how to add such information to ftrace/perf output. The second
> > > most frequent request is about providing it to a "energy aware"
> > > cpufreq governor.
> > 
> > Anything energy related would have to be along the line of "do something after a
> > certain amount of work has been performed", which at least at the surface does
> > not make much sense to me, unless you mean something along the line of a
> > process scheduler which schedules a process not based on time slices but based
> > on energy consumed, ie if you want to define a time slice not in milli-seconds
> > but in Joule.
> 
> Actually there is some research being done in this direction, but it's
> way too early to draw any conclusions...
> 
> > If so, I would argue that a similar behavior could be achieved by varying the
> > duration of time slices with the current CPU speed, or simply by using cycle
> > count instead of time as time slice parameter. Not that I am sure if such an
> > approach would really be of interest for anyone. 
> > 
> > Or do you really mean power, not energy, such as in "reduce CPU speed if its
> > power consumption is above X Watt" ?
> 
> Uh. To be completely honest I must answer: I'm not sure how the "energy
> aware" cpufreq governor is supposed to work. I have been simply asked to
> provide the data in some standard way, if possible.
> 
> > I am not sure how this would be expected to work. hwmon is, by its very nature,
> > a passive subsystem: It doesn't do anything unless data is explicitly requested
> > from it. It does not update an attribute unless that attribute is read.
> > That does not seem to fit well with the idea of tracing - which assumes
> > that some activity is happening, ultimately, all by itself, presumably
> > periodically. The idea to have a user space application read hwmon data only
> > for it to trigger trace events does not seem to be very compelling to me.
> 
> What I had in mind was similar to what adt7470 driver does. The driver
> would automatically access the device every now and then to update it's
> internal state and generate the trace event on the way. This
> auto-refresh "feature" is particularly appealing for me, as on some of
> "my" platforms can take up to 500 microseconds to actually get the data.
> So doing this in background (and providing users with the last known
> value in the meantime) seems attractive.
> 
A bad example doesn't mean it should be used elsewhere.

adt7470 needs up to two seconds for a temperature measurement cycle, and it
can not perform automatic cycles all by itself. In this context, executing
temperature measurement cycles in the background makes a lot of sense,
especially since one does not want to wait for two seconds when reading
a sysfs attribute.

But that only means that the chip is most likely not a good choice when selecting
a temperature sensor, not that the code necessary to get it working should be used
as an example for other drivers. 

Guenter

> > An exception is if a monitoring device suppports interrupts, and if its driver
> > actually implements those interrupts. This is, however, not the case for most of
> > the current drivers (if any), mostly because interrupt support for hardware
> > monitoring devices is very platform dependent and thus difficult to implement.
> 
> Interestingly enough the newest version of our platform control micro
> (doing the energy monitoring as well) can generate and interrupt when a
> transaction is finished, so I was planning to periodically update the
> all sort of values. And again, generating a trace event on this
> opportunity would be trivial.
> 
> > > Of course a particular driver could register its own perf PMU on its
> > > own. It's certainly an option, just very suboptimal in my opinion.
> > > Or maybe not? Maybe the task is so specialized that it makes sense?
> > > 
> > We had a couple of attempts to provide an in-kernel API. Unfortunately,
> > the result was, at least so far, more complexity on the driver side.
> > So the difficulty is really to define an API which is really simple, and does
> > not just complicate driver development for a (presumably) rare use case.
> 
> Yes, I appreciate this. That's why this option is actually my least
> favourite. Anyway, what I was thinking about was just a thin shin that
> *can* be used by a driver to register some particular value with the
> core (so it can be enumerated and accessed by in-kernel clients) and the
> core could (or not) create a sysfs attribute for this value on behalf of
> the driver. Seems lightweight enough, unless previous experience
> suggests otherwise?
> 
> Cheers!
> 
> Paweł
> 
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-10-24 20:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-23 17:30 [RFC] Energy/power monitoring within the kernel Pawel Moll
2012-10-23 17:43 ` Steven Rostedt
2012-10-24 16:00   ` Pawel Moll
2012-10-23 18:49 ` Andy Green
2012-10-24 16:05   ` Pawel Moll
2012-10-23 22:02 ` Guenter Roeck
2012-10-24 16:37   ` Pawel Moll
2012-10-24 20:01     ` Guenter Roeck
2012-10-24  0:40 ` Thomas Renninger
2012-10-24 16:51   ` Pawel Moll
2012-10-24  0:41 ` Thomas Renninger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).