From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757378Ab2JWRnX (ORCPT ); Tue, 23 Oct 2012 13:43:23 -0400 Received: from hrndva-omtalb.mail.rr.com ([71.74.56.122]:8989 "EHLO hrndva-omtalb.mail.rr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757251Ab2JWRnL (ORCPT ); Tue, 23 Oct 2012 13:43:11 -0400 X-Authority-Analysis: v=2.0 cv=YP4dOG6x c=1 sm=0 a=rXTBtCOcEpjy1lPqhTCpEQ==:17 a=mNMOxpOpBa8A:10 a=lmT2Z4Mpfl0A:10 a=5SG0PmZfjMsA:10 a=Q9fys5e9bTEA:10 a=meVymXHHAAAA:8 a=lWe663OmXZ4A:10 a=Md2_uM7vpecJbsWCsHMA:9 a=PUjeQqilurYA:10 a=rXTBtCOcEpjy1lPqhTCpEQ==:117 X-Cloudmark-Score: 0 X-Originating-IP: 74.67.115.198 Message-ID: <1351014187.8467.24.camel@gandalf.local.home> Subject: Re: [RFC] Energy/power monitoring within the kernel From: Steven Rostedt To: Pawel Moll Cc: Amit Daniel Kachhap , Zhang Rui , Viresh Kumar , Daniel Lezcano , Jean Delvare , Guenter Roeck , Frederic Weisbecker , Ingo Molnar , Jesper Juhl , Thomas Renninger , Jean Pihet , linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, lm-sensors@lm-sensors.org, linaro-dev@lists.linaro.org Date: Tue, 23 Oct 2012 13:43:07 -0400 In-Reply-To: <1351013449.9070.5.camel@hornet> References: <1351013449.9070.5.camel@hornet> Content-Type: text/plain; charset="ISO-8859-15" X-Mailer: Evolution 3.4.3-1 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2012-10-23 at 18:30 +0100, Pawel Moll wrote: > > === Option 1: Trace event === > > This seems to be the "cheapest" option. Simply defining a trace event > that can be generated by a hwmon (or any other) driver makes the > interesting data immediately available to any ftrace/perf user. Of > course it doesn't really help with the cpufreq case, but seems to be > a good place to start with. > > The question is how to define it... I've came up with two prototypes: > > = Generic hwmon trace event = > > This one allows any driver to generate a trace event whenever any > "hwmon attribute" (measured value) gets updated. The rate at which the > updates happen can be controlled by already existing "update_interval" > attribute. > > 8<------------------------------------------- > TRACE_EVENT(hwmon_attr_update, > TP_PROTO(struct device *dev, struct attribute *attr, long long input), > TP_ARGS(dev, attr, input), > > TP_STRUCT__entry( > __string( dev, dev_name(dev)) > __string( attr, attr->name) > __field( long long, input) > ), > > TP_fast_assign( > __assign_str(dev, dev_name(dev)); > __assign_str(attr, attr->name); > __entry->input = input; > ), > > TP_printk("%s %s %lld", __get_str(dev), __get_str(attr), __entry->input) > ); > 8<------------------------------------------- > > It generates such ftrace message: > > <...>212.673126: hwmon_attr_update: hwmon4 temp1_input 34361 > > One issue with this is that some external knowledge is required to > relate a number to a processor core. Or maybe it's not an issue at all > because it should be left for the user(space)? If the external knowledge can be characterized in a userspace tool with the given data here, I see no issues with this. > > = CPU power/energy/temperature trace event = > > This one is designed to emphasize the relation between the measured > value (whether it is energy, temperature or any other physical > phenomena, really) and CPUs, so it is quite specific (too specific?) > > 8<------------------------------------------- > TRACE_EVENT(cpus_environment, > TP_PROTO(const struct cpumask *cpus, long long value, char unit), > TP_ARGS(cpus, value, unit), > > TP_STRUCT__entry( > __array( unsigned char, cpus, sizeof(struct cpumask)) > __field( long long, value) > __field( char, unit) > ), > > TP_fast_assign( > memcpy(__entry->cpus, cpus, sizeof(struct cpumask)); Copying the entire cpumask seems like overkill. Especially when you have 4096 CPU machines. > __entry->value = value; > __entry->unit = unit; > ), > > TP_printk("cpus %s %lld[%c]", > __print_cpumask((struct cpumask *)__entry->cpus), > __entry->value, __entry->unit) > ); > 8<------------------------------------------- > > And the equivalent ftrace message is: > > <...>127.063107: cpus_environment: cpus 0,1,2,3 34361[C] > > It's a cpumask, not just single cpu id, because the sensor may measure > the value per set of CPUs, eg. a temperature of the whole silicon die > (so all the cores) or an energy consumed by a subset of cores (this > is my particular use case - two meters monitor a cluster of two > processors and a cluster of three processors, all working as a SMP > system). > > Of course the cpus __array could be actually a special __cpumask field > type (I've just hacked the __print_cpumask so far). And I've just > realised that the unit field should actually be a string to allow unit > prefixes to be specified (the above should obviously be "34361[mC]" > not "[C]"). Also - excuse the "cpus_environment" name - this was the > best I was able to come up with at the time and I'm eager to accept > any alternative suggestions :-) Perhaps making a field that can be a subset of cpus may be better. That way we don't waste the ring buffer with lots of zeros. I'm guessing that it will only be a group of cpus, and not a scattered list? Of course, I've seen boxes where the cpu numbers went from core to core. That is, cpu 0 was on core 1, cpu 1 was on core 2, and then it would repeat. cpu 8 was on core 1, cpu 9 was on core 2, etc. But still, this could be compressed somehow. I'll let others comment on the rest. -- Steve