linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/4] perf/x86: add Intel RAPL PMU support
@ 2013-11-12 16:58 Stephane Eranian
  2013-11-12 16:58 ` [PATCH v7 1/4] perf: add active_entry list head to struct perf_event Stephane Eranian
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Stephane Eranian @ 2013-11-12 16:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, ak, acme, jolsa, zheng.z.yan, bp, maria.n.dimakopoulou

This patch adds a new uncore PMU to expose the Intel
RAPL (Running Average Power Limit) energy consumption counters.
Up to 3 counters, each counting a single RAPL event are exposed.

The RAPL counters are available on Intel SandyBridge, IvyBridge,
Haswell. The server processors add a 3rd counter to measure
DRAM power consumption.

The following events are available and exposed in sysfs:
- power/energy-cores: power consumption of all cores on socket
- power/energy-pkg  : power consumption of all cores + LLC cache
- power/energy-dram : power consumption of DRAM (servers only)

The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only
one CPU per socket is sufficient. 

The counters all count in the same unit. The perf_events API
exposes all RAPL counters as 64-bit integers counting in unit
of 1/2^32 Joules (about 0.23 nJ). User level tools must convert
the counts by multiplying them by the scaling factor exposed
in the correponding event .scale file in sysfs to obtain a value
expressed in Joules. The reason for this approach is that the kernel
avoids doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point and does not incur any precision
loss. Thanks to PeterZ for suggesting this approach.

To convert the raw count in Watts: W = C * 2.3 / (1e10 * time)

The kernel exposes both the scaling factor and the unit (Joules)
in sysfs:
$ ls -1 /sys/devices/power/events/energy-*
/sys/devices/power/events/energy-cores
/sys/devices/power/events/energy-cores.scale
/sys/devices/power/events/energy-cores.unit
/sys/devices/power/events/energy-pkg
/sys/devices/power/events/energy-pkg.scale
/sys/devices/power/events/energy-pkg.unit

$ cat /sys/devices/power/events/energy-cores.scale
2.3283064365386962890625e-10

$ cat cat /sys/devices/power/events/energy-cores.unit
Joules

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/rapl/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

The PMU exports a cpumask in /sys/devices/power/cpumask. It
is used by perf to ensure only one instance of each RAPL event
is measured per processor socket. Hotplug CPU is also supported.

The perf stat infrastrructure is enhanced to show events
units. It also applies the scaling factor. As such, perf stat prints
RAPL events in Joules (and not increments of 0.23 nJ):

 # perf stat -a -e power/energy-pkg/,power/energy-cores/,cycles -I 1000 sleep 1000
 #          time             counts   unit events
     1.000282860               2.51 Joules power/energy-pkg/
     1.000282860               0.31 Joules power/energy-cores/
     1.000282860           37765378        cycles

The patch adds a hrtimer to poll the counters given that
they do no interrupt on overflow. Hardware counters are 32-bit
wide.

In v2, we add the locking necesarry to protect the rapl_pmu
struct. We also add a description at the top of the file.
We check for Intel only processor. We improved the data
layout of the rapl_pmu struct. We also lifted the restriction
of the number of instances of RAPL counters that can be active
at the same time. RAPL is free running counters, so ought to be
able to measure events as many times as necessary in parallel
via multiple tools. There is never multiplexing among RAPL events.

In v3, we have renamed the event to be more generic power/* instead
of rapl/*. We have modified perf stat to print the event with the
unit and scaling factors.

In v4, we integrate the feedback from Jiri and rebase to 3.12-rc7+
from tip.git.

In v5, we export the full scaling factor to increase prescision.
In the perf tool, we changed the way the .unit and .scale syfs
entries are parsed. Thank to Jiri for this contribution on this.
We also fix a couple of printf() issues with perf stat and units.
Now, we print no unit symbol when the event has no unit (was ? before).
Patch is relative to 3.12 from tip.git.

In v6, we fixed a few issues in the perf tool having to do with printing
of the unit. Works for uncore events now. There is a major restructuring
of the code in the kernel because the hrtimer needs to be per-cpu and
not shared per socket because there can be multiple sessions in parallel
and also because of hotplug CPU. The hotplug cpu code was also updated.

In v7, we rebased to 3.12+ and dropped the hotplug cpu lock because it
was not needed. Hotplug CPU is still serialized.

Thanks to all contributors to this patch series: PeterZ, Jiri, Maria,
Arnaldo, Andi, Ingo.

Supported CPUs: SandyBridge, IvyBridge, Haswell.

Signed-off-by: Stephane Eranian <eranian@google.com>

Stephane Eranian (4):
  perf: add active_entry list head to struct perf_event
  perf stat: add event unit and scale support
  perf,x86: add Intel RAPL PMU support
  perf,x86: add RAPL hrtimer support

 arch/x86/kernel/cpu/Makefile                |    2 +-
 arch/x86/kernel/cpu/perf_event_intel_rapl.c |  661 +++++++++++++++++++++++++++
 include/linux/perf_event.h                  |    5 +-
 kernel/events/core.c                        |    1 +
 tools/perf/builtin-stat.c                   |  114 +++--
 tools/perf/util/evsel.c                     |    2 +
 tools/perf/util/evsel.h                     |    3 +
 tools/perf/util/parse-events.c              |   28 +-
 tools/perf/util/pmu.c                       |  138 +++++-
 tools/perf/util/pmu.h                       |    3 +-
 10 files changed, 911 insertions(+), 46 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v7 1/4] perf: add active_entry list head to struct perf_event
  2013-11-12 16:58 [PATCH v7 0/4] perf/x86: add Intel RAPL PMU support Stephane Eranian
@ 2013-11-12 16:58 ` Stephane Eranian
  2013-11-27 14:07   ` [tip:perf/core] perf: Add " tip-bot for Stephane Eranian
  2013-11-12 16:58 ` [PATCH v7 2/4] perf stat: add event unit and scale support Stephane Eranian
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Stephane Eranian @ 2013-11-12 16:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, ak, acme, jolsa, zheng.z.yan, bp, maria.n.dimakopoulou

This patch adds a new field to the struct perf_event.
It is intended to be used to chain events which are
active (enabled). It helps in the hardware layer
for PMUs which do not have actual counter restrictions, i.e.,
free running read-only counters. Active events are chained
as opposed to being tracked via the counter they use.

To save space we use a union with hlist_entry as both
are mutually exclusive (suggested by Jiri Olsa).

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 include/linux/perf_event.h |    5 ++++-
 kernel/events/core.c       |    1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2e069d1..8f4a70f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -319,7 +319,10 @@ struct perf_event {
 	 */
 	struct list_head		migrate_entry;
 
-	struct hlist_node		hlist_entry;
+	union {
+		struct hlist_node	hlist_entry;
+		struct list_head	active_entry;
+	};
 	int				nr_siblings;
 	int				group_flags;
 	struct perf_event		*group_leader;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4dc078d..90dca5c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6663,6 +6663,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->event_entry);
 	INIT_LIST_HEAD(&event->sibling_list);
 	INIT_LIST_HEAD(&event->rb_entry);
+	INIT_LIST_HEAD(&event->active_entry);
 
 	init_waitqueue_head(&event->waitq);
 	init_irq_work(&event->pending, perf_pending_event);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 2/4] perf stat: add event unit and scale support
  2013-11-12 16:58 [PATCH v7 0/4] perf/x86: add Intel RAPL PMU support Stephane Eranian
  2013-11-12 16:58 ` [PATCH v7 1/4] perf: add active_entry list head to struct perf_event Stephane Eranian
@ 2013-11-12 16:58 ` Stephane Eranian
  2013-11-27 14:08   ` [tip:perf/core] tools/perf/stat: Add " tip-bot for Stephane Eranian
  2013-11-12 16:58 ` [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support Stephane Eranian
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Stephane Eranian @ 2013-11-12 16:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, ak, acme, jolsa, zheng.z.yan, bp, maria.n.dimakopoulou

This patch adds perf stat support for handling event units and
scales as exported by the kernel.

The kernel can export PMU events actual unit and scaling factor
via sysfs:
$ ls -1 /sys/devices/power/events/energy-*
/sys/devices/power/events/energy-cores
/sys/devices/power/events/energy-cores.scale
/sys/devices/power/events/energy-cores.unit
/sys/devices/power/events/energy-pkg
/sys/devices/power/events/energy-pkg.scale
/sys/devices/power/events/energy-pkg.unit
$ cat /sys/devices/power/events/energy-cores.scale
2.3283064365386962890625e-10
$ cat cat /sys/devices/power/events/energy-cores.unit
Joules

This patch modifies the pmu event alias code to check
for the presence of the .unit and .scale files to load
the corresponding values. They are then used by perf stat
transparently:

 # perf stat -a -e power/energy-pkg/,power/energy-cores/,cycles -I 1000 sleep 1000
 #          time             counts   unit events
     1.000214717               3.07 Joules power/energy-pkg/         [100.00%]
     1.000214717               0.53 Joules power/energy-cores/
     1.000214717           12965028        cycles                    [100.00%]
     2.000749289               3.01 Joules power/energy-pkg/
     2.000749289               0.52 Joules power/energy-cores/
     2.000749289           15817043        cycles

When the event does not have an explicit unit exported by
the kernel, nothing is printed. In csv output mode, there
will be an empty field.

Special thanks to Jiri for providing the supporting code
in the parser to trigger reading of the scale and unit files.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
---
 tools/perf/builtin-stat.c      |  114 ++++++++++++++++++++++++---------
 tools/perf/util/evsel.c        |    2 +
 tools/perf/util/evsel.h        |    3 +
 tools/perf/util/parse-events.c |   28 +++++---
 tools/perf/util/pmu.c          |  138 ++++++++++++++++++++++++++++++++++++++--
 tools/perf/util/pmu.h          |    3 +-
 6 files changed, 244 insertions(+), 44 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 0fc1c94..66d20c9 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -138,6 +138,7 @@ static const char		*post_cmd			= NULL;
 static bool			sync_run			= false;
 static unsigned int		interval			= 0;
 static unsigned int		initial_delay			= 0;
+static unsigned int		unit_width			= 4; /* strlen("unit") */
 static bool			forever				= false;
 static struct timespec		ref_time;
 static struct cpu_map		*aggr_map;
@@ -462,17 +463,17 @@ static void print_interval(void)
 	if (num_print_interval == 0 && !csv_output) {
 		switch (aggr_mode) {
 		case AGGR_SOCKET:
-			fprintf(output, "#           time socket cpus             counts events\n");
+			fprintf(output, "#           time socket cpus             counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_CORE:
-			fprintf(output, "#           time core         cpus             counts events\n");
+			fprintf(output, "#           time core         cpus             counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_NONE:
-			fprintf(output, "#           time CPU                 counts events\n");
+			fprintf(output, "#           time CPU                counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_GLOBAL:
 		default:
-			fprintf(output, "#           time             counts events\n");
+			fprintf(output, "#           time             counts %*s events\n", unit_width, "unit");
 		}
 	}
 
@@ -517,6 +518,7 @@ static int __run_perf_stat(int argc, const char **argv)
 	unsigned long long t0, t1;
 	struct perf_evsel *counter;
 	struct timespec ts;
+	size_t l;
 	int status = 0;
 	const bool forks = (argc > 0);
 
@@ -566,6 +568,10 @@ static int __run_perf_stat(int argc, const char **argv)
 			return -1;
 		}
 		counter->supported = true;
+
+		l = strlen(counter->unit);
+		if (l > unit_width)
+			unit_width = l;
 	}
 
 	if (perf_evlist__apply_filters(evsel_list)) {
@@ -705,14 +711,25 @@ static void aggr_printout(struct perf_evsel *evsel, int id, int nr)
 static void nsec_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
 {
 	double msecs = avg / 1e6;
-	const char *fmt = csv_output ? "%.6f%s%s" : "%18.6f%s%-25s";
+	const char *fmt_v, *fmt_n;
 	char name[25];
 
+	fmt_v = csv_output ? "%.6f%s" : "%18.6f%s";
+	fmt_n = csv_output ? "%s" : "%-25s";
+
 	aggr_printout(evsel, cpu, nr);
 
 	scnprintf(name, sizeof(name), "%s%s",
 		  perf_evsel__name(evsel), csv_output ? "" : " (msec)");
-	fprintf(output, fmt, msecs, csv_sep, name);
+
+	fprintf(output, fmt_v, msecs, csv_sep);
+
+	if (csv_output)
+		fprintf(output, "%s%s", evsel->unit, csv_sep);
+	else
+		fprintf(output, "%-*s%s", unit_width, evsel->unit, csv_sep);
+
+	fprintf(output, fmt_n, name);
 
 	if (evsel->cgrp)
 		fprintf(output, "%s%s", csv_sep, evsel->cgrp->name);
@@ -909,21 +926,31 @@ static void print_ll_cache_misses(int cpu,
 static void abs_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
 {
 	double total, ratio = 0.0, total2;
+	double sc =  evsel->scale;
 	const char *fmt;
 
-	if (csv_output)
-		fmt = "%.0f%s%s";
-	else if (big_num)
-		fmt = "%'18.0f%s%-25s";
-	else
-		fmt = "%18.0f%s%-25s";
+	if (csv_output) {
+		fmt = sc != 1.0 ?  "%.2f%s" : "%.0f%s";
+	} else {
+		if (big_num)
+			fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
+		else
+			fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
+	}
 
 	aggr_printout(evsel, cpu, nr);
 
 	if (aggr_mode == AGGR_GLOBAL)
 		cpu = 0;
 
-	fprintf(output, fmt, avg, csv_sep, perf_evsel__name(evsel));
+	fprintf(output, fmt, avg, csv_sep);
+
+	if (evsel->unit)
+		fprintf(output, "%-*s%s",
+			csv_output ? 0 : unit_width,
+			evsel->unit, csv_sep);
+
+	fprintf(output, "%-*s", csv_output ? 0 : 25, perf_evsel__name(evsel));
 
 	if (evsel->cgrp)
 		fprintf(output, "%s%s", csv_sep, evsel->cgrp->name);
@@ -942,7 +969,10 @@ static void abs_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
 
 		if (total && avg) {
 			ratio = total / avg;
-			fprintf(output, "\n                                             #   %5.2f  stalled cycles per insn", ratio);
+			fprintf(output, "\n");
+			if (aggr_mode == AGGR_NONE)
+				fprintf(output, "        ");
+			fprintf(output, "                                                  #   %5.2f  stalled cycles per insn", ratio);
 		}
 
 	} else if (perf_evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES) &&
@@ -1062,6 +1092,7 @@ static void print_aggr(char *prefix)
 {
 	struct perf_evsel *counter;
 	int cpu, cpu2, s, s2, id, nr;
+	double uval;
 	u64 ena, run, val;
 
 	if (!(aggr_map || aggr_get_id))
@@ -1088,11 +1119,17 @@ static void print_aggr(char *prefix)
 			if (run == 0 || ena == 0) {
 				aggr_printout(counter, id, nr);
 
-				fprintf(output, "%*s%s%*s",
+				fprintf(output, "%*s%s",
 					csv_output ? 0 : 18,
 					counter->supported ? CNTR_NOT_COUNTED : CNTR_NOT_SUPPORTED,
-					csv_sep,
-					csv_output ? 0 : -24,
+					csv_sep);
+
+				fprintf(output, "%-*s%s",
+					csv_output ? 0 : unit_width,
+					counter->unit, csv_sep);
+
+				fprintf(output, "%*s",
+					csv_output ? 0 : -25,
 					perf_evsel__name(counter));
 
 				if (counter->cgrp)
@@ -1102,11 +1139,12 @@ static void print_aggr(char *prefix)
 				fputc('\n', output);
 				continue;
 			}
+			uval = val * counter->scale;
 
 			if (nsec_counter(counter))
-				nsec_printout(id, nr, counter, val);
+				nsec_printout(id, nr, counter, uval);
 			else
-				abs_printout(id, nr, counter, val);
+				abs_printout(id, nr, counter, uval);
 
 			if (!csv_output) {
 				print_noise(counter, 1.0);
@@ -1129,16 +1167,21 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix)
 	struct perf_stat *ps = counter->priv;
 	double avg = avg_stats(&ps->res_stats[0]);
 	int scaled = counter->counts->scaled;
+	double uval;
 
 	if (prefix)
 		fprintf(output, "%s", prefix);
 
 	if (scaled == -1) {
-		fprintf(output, "%*s%s%*s",
+		fprintf(output, "%*s%s",
 			csv_output ? 0 : 18,
 			counter->supported ? CNTR_NOT_COUNTED : CNTR_NOT_SUPPORTED,
-			csv_sep,
-			csv_output ? 0 : -24,
+			csv_sep);
+		fprintf(output, "%-*s%s",
+			csv_output ? 0 : unit_width,
+			counter->unit, csv_sep);
+		fprintf(output, "%*s",
+			csv_output ? 0 : -25,
 			perf_evsel__name(counter));
 
 		if (counter->cgrp)
@@ -1148,10 +1191,12 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix)
 		return;
 	}
 
+	uval = avg * counter->scale;
+
 	if (nsec_counter(counter))
-		nsec_printout(-1, 0, counter, avg);
+		nsec_printout(-1, 0, counter, uval);
 	else
-		abs_printout(-1, 0, counter, avg);
+		abs_printout(-1, 0, counter, uval);
 
 	print_noise(counter, avg);
 
@@ -1178,6 +1223,7 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix)
 static void print_counter(struct perf_evsel *counter, char *prefix)
 {
 	u64 ena, run, val;
+	double uval;
 	int cpu;
 
 	for (cpu = 0; cpu < perf_evsel__nr_cpus(counter); cpu++) {
@@ -1189,14 +1235,20 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
 			fprintf(output, "%s", prefix);
 
 		if (run == 0 || ena == 0) {
-			fprintf(output, "CPU%*d%s%*s%s%*s",
+			fprintf(output, "CPU%*d%s%*s%s",
 				csv_output ? 0 : -4,
 				perf_evsel__cpus(counter)->map[cpu], csv_sep,
 				csv_output ? 0 : 18,
 				counter->supported ? CNTR_NOT_COUNTED : CNTR_NOT_SUPPORTED,
-				csv_sep,
-				csv_output ? 0 : -24,
-				perf_evsel__name(counter));
+				csv_sep);
+
+				fprintf(output, "%-*s%s",
+					csv_output ? 0 : unit_width,
+					counter->unit, csv_sep);
+
+				fprintf(output, "%*s",
+					csv_output ? 0 : -25,
+					perf_evsel__name(counter));
 
 			if (counter->cgrp)
 				fprintf(output, "%s%s",
@@ -1206,10 +1258,12 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
 			continue;
 		}
 
+		uval = val * counter->scale;
+
 		if (nsec_counter(counter))
-			nsec_printout(cpu, 0, counter, val);
+			nsec_printout(cpu, 0, counter, uval);
 		else
-			abs_printout(cpu, 0, counter, val);
+			abs_printout(cpu, 0, counter, uval);
 
 		if (!csv_output) {
 			print_noise(counter, 1.0);
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 5280820..f1e3d02 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -162,6 +162,8 @@ void perf_evsel__init(struct perf_evsel *evsel,
 	evsel->idx	   = idx;
 	evsel->attr	   = *attr;
 	evsel->leader	   = evsel;
+	evsel->unit	   = "";
+	evsel->scale	   = 1.0;
 	INIT_LIST_HEAD(&evsel->node);
 	hists__init(&evsel->hists);
 	evsel->sample_size = __perf_evsel__sample_size(attr->sample_type);
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 64ec8e1..df2ea2b 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -68,6 +68,8 @@ struct perf_evsel {
 	u32			ids;
 	struct hists		hists;
 	char			*name;
+	double			scale;
+	const char		*unit;
 	struct event_format	*tp_format;
 	union {
 		void		*priv;
@@ -127,6 +129,7 @@ extern const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX];
 int __perf_evsel__hw_cache_type_op_res_name(u8 type, u8 op, u8 result,
 					    char *bf, size_t size);
 const char *perf_evsel__name(struct perf_evsel *evsel);
+
 const char *perf_evsel__group_name(struct perf_evsel *evsel);
 int perf_evsel__group_desc(struct perf_evsel *evsel, char *buf, size_t size);
 
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index c90e55c..90ae328 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -269,9 +269,10 @@ const char *event_type(int type)
 
 
 
-static int __add_event(struct list_head *list, int *idx,
-		       struct perf_event_attr *attr,
-		       char *name, struct cpu_map *cpus)
+static struct perf_evsel *
+__add_event(struct list_head *list, int *idx,
+	    struct perf_event_attr *attr,
+	    char *name, struct cpu_map *cpus)
 {
 	struct perf_evsel *evsel;
 
@@ -279,19 +280,19 @@ static int __add_event(struct list_head *list, int *idx,
 
 	evsel = perf_evsel__new(attr, (*idx)++);
 	if (!evsel)
-		return -ENOMEM;
+		return NULL;
 
 	evsel->cpus = cpus;
 	if (name)
 		evsel->name = strdup(name);
 	list_add_tail(&evsel->node, list);
-	return 0;
+	return evsel;
 }
 
 static int add_event(struct list_head *list, int *idx,
 		     struct perf_event_attr *attr, char *name)
 {
-	return __add_event(list, idx, attr, name, NULL);
+	return __add_event(list, idx, attr, name, NULL) ? 0 : -ENOMEM;
 }
 
 static int parse_aliases(char *str, const char *names[][PERF_EVSEL__MAX_ALIASES], int size)
@@ -633,6 +634,9 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 {
 	struct perf_event_attr attr;
 	struct perf_pmu *pmu;
+	struct perf_evsel *evsel;
+	char *unit;
+	double scale;
 
 	pmu = perf_pmu__find(name);
 	if (!pmu)
@@ -640,7 +644,7 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 
 	memset(&attr, 0, sizeof(attr));
 
-	if (perf_pmu__check_alias(pmu, head_config))
+	if (perf_pmu__check_alias(pmu, head_config, &unit, &scale))
 		return -EINVAL;
 
 	/*
@@ -652,8 +656,14 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 	if (perf_pmu__config(pmu, &attr, head_config))
 		return -EINVAL;
 
-	return __add_event(list, idx, &attr, pmu_event_name(head_config),
-			   pmu->cpus);
+	evsel = __add_event(list, idx, &attr, pmu_event_name(head_config),
+			    pmu->cpus);
+	if (evsel) {
+		evsel->unit = unit;
+		evsel->scale = scale;
+	}
+
+	return evsel ? 0 : -ENOMEM;
 }
 
 int parse_events__modifier_group(struct list_head *list,
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index c232d8d..56fc10a 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -1,19 +1,23 @@
 #include <linux/list.h>
 #include <sys/types.h>
-#include <sys/stat.h>
 #include <unistd.h>
 #include <stdio.h>
 #include <dirent.h>
 #include "fs.h"
+#include <locale.h>
 #include "util.h"
 #include "pmu.h"
 #include "parse-events.h"
 #include "cpumap.h"
 
+#define UNIT_MAX_LEN	31 /* max length for event unit name */
+
 struct perf_pmu_alias {
 	char *name;
 	struct list_head terms;
 	struct list_head list;
+	char unit[UNIT_MAX_LEN+1];
+	double scale;
 };
 
 struct perf_pmu_format {
@@ -94,7 +98,80 @@ static int pmu_format(const char *name, struct list_head *format)
 	return 0;
 }
 
-static int perf_pmu__new_alias(struct list_head *list, char *name, FILE *file)
+static int perf_pmu__parse_scale(struct perf_pmu_alias *alias, char *dir, char *name)
+{
+	struct stat st;
+	ssize_t sret;
+	char scale[128];
+	int fd, ret = -1;
+	char path[PATH_MAX];
+	char *lc;
+
+	snprintf(path, PATH_MAX, "%s/%s.scale", dir, name);
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	if (fstat(fd, &st) < 0)
+		goto error;
+
+	sret = read(fd, scale, sizeof(scale)-1);
+	if (sret < 0)
+		goto error;
+
+	scale[sret] = '\0';
+	/*
+	 * save current locale
+	 */
+	lc = setlocale(LC_NUMERIC, NULL);
+
+	/*
+	 * force to C locale to ensure kernel
+	 * scale string is converted correctly.
+	 * kernel uses default C locale.
+	 */
+	setlocale(LC_NUMERIC, "C");
+
+	alias->scale = strtod(scale, NULL);
+
+	/* restore locale */
+	setlocale(LC_NUMERIC, lc);
+
+	ret = 0;
+error:
+	close(fd);
+	return ret;
+}
+
+static int perf_pmu__parse_unit(struct perf_pmu_alias *alias, char *dir, char *name)
+{
+	char path[PATH_MAX];
+	ssize_t sret;
+	int fd;
+
+	snprintf(path, PATH_MAX, "%s/%s.unit", dir, name);
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+		sret = read(fd, alias->unit, UNIT_MAX_LEN);
+	if (sret < 0)
+		goto error;
+
+	close(fd);
+
+	alias->unit[sret] = '\0';
+
+	return 0;
+error:
+	close(fd);
+	alias->unit[0] = '\0';
+	return -1;
+}
+
+static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, FILE *file)
 {
 	struct perf_pmu_alias *alias;
 	char buf[256];
@@ -110,6 +187,9 @@ static int perf_pmu__new_alias(struct list_head *list, char *name, FILE *file)
 		return -ENOMEM;
 
 	INIT_LIST_HEAD(&alias->terms);
+	alias->scale = 1.0;
+	alias->unit[0] = '\0';
+
 	ret = parse_events_terms(&alias->terms, buf);
 	if (ret) {
 		free(alias);
@@ -117,7 +197,14 @@ static int perf_pmu__new_alias(struct list_head *list, char *name, FILE *file)
 	}
 
 	alias->name = strdup(name);
+	/*
+	 * load unit name and scale if available
+	 */
+	perf_pmu__parse_unit(alias, dir, name);
+	perf_pmu__parse_scale(alias, dir, name);
+
 	list_add_tail(&alias->list, list);
+
 	return 0;
 }
 
@@ -129,6 +216,7 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
 {
 	struct dirent *evt_ent;
 	DIR *event_dir;
+	size_t len;
 	int ret = 0;
 
 	event_dir = opendir(dir);
@@ -143,13 +231,24 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
 		if (!strcmp(name, ".") || !strcmp(name, ".."))
 			continue;
 
+		/*
+		 * skip .unit and .scale info files
+		 * parsed in perf_pmu__new_alias()
+		 */
+		len = strlen(name);
+		if (len > 5 && !strcmp(name + len - 5, ".unit"))
+			continue;
+		if (len > 6 && !strcmp(name + len - 6, ".scale"))
+			continue;
+
 		snprintf(path, PATH_MAX, "%s/%s", dir, name);
 
 		ret = -EINVAL;
 		file = fopen(path, "r");
 		if (!file)
 			break;
-		ret = perf_pmu__new_alias(head, name, file);
+
+		ret = perf_pmu__new_alias(head, dir, name, file);
 		fclose(file);
 	}
 
@@ -508,16 +607,42 @@ static struct perf_pmu_alias *pmu_find_alias(struct perf_pmu *pmu,
 	return NULL;
 }
 
+
+static int check_unit_scale(struct perf_pmu_alias *alias,
+			    char **unit, double *scale)
+{
+	/*
+	 * Only one term in event definition can
+	 * define unit and scale, fail if there's
+	 * more than one.
+	 */
+	if ((*unit && alias->unit) ||
+	    (*scale && alias->scale))
+		return -EINVAL;
+
+	if (alias->unit)
+		*unit = alias->unit;
+
+	if (alias->scale)
+		*scale = alias->scale;
+
+	return 0;
+}
+
 /*
  * Find alias in the terms list and replace it with the terms
  * defined for the alias
  */
-int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms)
+int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
+			  char **unit, double *scale)
 {
 	struct parse_events_term *term, *h;
 	struct perf_pmu_alias *alias;
 	int ret;
 
+	*unit   = NULL;
+	*scale  = 0;
+
 	list_for_each_entry_safe(term, h, head_terms, list) {
 		alias = pmu_find_alias(pmu, term);
 		if (!alias)
@@ -525,6 +650,11 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms)
 		ret = pmu_alias_terms(alias, &term->list);
 		if (ret)
 			return ret;
+
+		ret = check_unit_scale(alias, unit, scale);
+		if (ret)
+			return ret;
+
 		list_del(&term->list);
 		free(term);
 	}
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 1179b26..9183380 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -28,7 +28,8 @@ int perf_pmu__config(struct perf_pmu *pmu, struct perf_event_attr *attr,
 int perf_pmu__config_terms(struct list_head *formats,
 			   struct perf_event_attr *attr,
 			   struct list_head *head_terms);
-int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms);
+int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
+			  char **unit, double *scale);
 struct list_head *perf_pmu__alias(struct perf_pmu *pmu,
 				  struct list_head *head_terms);
 int perf_pmu_wrap(void);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-12 16:58 [PATCH v7 0/4] perf/x86: add Intel RAPL PMU support Stephane Eranian
  2013-11-12 16:58 ` [PATCH v7 1/4] perf: add active_entry list head to struct perf_event Stephane Eranian
  2013-11-12 16:58 ` [PATCH v7 2/4] perf stat: add event unit and scale support Stephane Eranian
@ 2013-11-12 16:58 ` Stephane Eranian
  2013-11-27 10:23   ` Peter Zijlstra
                     ` (2 more replies)
  2013-11-12 16:58 ` [PATCH v7 4/4] perf,x86: add RAPL hrtimer support Stephane Eranian
  2013-11-12 18:10 ` [PATCH v7 0/4] perf/x86: add Intel RAPL PMU support Andi Kleen
  4 siblings, 3 replies; 22+ messages in thread
From: Stephane Eranian @ 2013-11-12 16:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, ak, acme, jolsa, zheng.z.yan, bp, maria.n.dimakopoulou

This patch adds a new uncore PMU to expose the Intel
RAPL energy consumption counters. Up to 3 counters,
each counting a particular RAPL event are exposed.

The RAPL counters are available on Intel SandyBridge,
IvyBridge, Haswell. The server skus add a 3rd counter.

The following events are available and exposed in sysfs:
- power/energy-cores: power consumption of all cores on socket
- power/energy-pkg: power consumption of all cores + LLc cache
- power/energy-dram: power consumption of DRAM (servers only)

For each event both the unit (Joules) and scale (2^-32 J)
is exposed in sysfs for use by perf stat and other tools.
Files are:
	/sys/devices/power/events/energy-*.unit
	/sys/devices/power/events/energy-*.scale

The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only
one CPU per socket is sufficient. The /sys/devices/power/cpumask
file can be used by tools to figure out which CPUs to monitor
by default. For instance, on a 2-socket system, 2 CPUs
(one on each socket) will be shown.

All the counters measure in the same unit (exposed via sysfs).
The perf_events API exposes all RAPL counters as 64-bit integers
counting in unit of 1/2^32 Joules (about 0.23 nJ). User level tools
must convert the counts by multiplying them by 2^-32 to obtain
Joules. The reason for this is that the kernel avoids
doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point usage. There is no loss of
precision. Thanks to PeterZ for suggesting this approach.

To convert the raw count in Watt:
   W = C * 2.3 / (1e10 * time)
or ldexp(C, -32).

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/power/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
---
 arch/x86/kernel/cpu/Makefile                |    2 +-
 arch/x86/kernel/cpu/perf_event_intel_rapl.c |  591 +++++++++++++++++++++++++++
 2 files changed, 592 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 47b56a7..6359506 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -36,7 +36,7 @@ obj-$(CONFIG_CPU_SUP_AMD)		+= perf_event_amd_iommu.o
 endif
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
-obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_rapl.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
new file mode 100644
index 0000000..cfcd386
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -0,0 +1,591 @@
+/*
+ * perf_event_intel_rapl.c: support Intel RAPL energy consumption counters
+ * Copyright (C) 2013 Google, Inc., Stephane Eranian
+ *
+ * Intel RAPL interface is specified in the IA-32 Manual Vol3b
+ * section 14.7.1 (September 2013)
+ *
+ * RAPL provides more controls than just reporting energy consumption
+ * however here we only expose the 3 energy consumption free running
+ * counters (pp0, pkg, dram).
+ *
+ * Each of those counters increments in a power unit defined by the
+ * RAPL_POWER_UNIT MSR. On SandyBridge, this unit is 1/(2^16) Joules
+ * but it can vary.
+ *
+ * Counter to rapl events mappings:
+ *
+ *  pp0 counter: consumption of all physical cores (power plane 0)
+ * 	  event: rapl_energy_cores
+ *    perf code: 0x1
+ *
+ *  pkg counter: consumption of the whole processor package
+ *	  event: rapl_energy_pkg
+ *    perf code: 0x2
+ *
+ * dram counter: consumption of the dram domain (servers only)
+ *	  event: rapl_energy_dram
+ *    perf code: 0x3
+ *
+ * We manage those counters as free running (read-only). They may be
+ * use simultaneously by other tools, such as turbostat.
+ *
+ * The events only support system-wide mode counting. There is no
+ * sampling support because it does not make sense and is not
+ * supported by the RAPL hardware.
+ *
+ * Because we want to avoid floating-point operations in the kernel,
+ * the events are all reported in fixed point arithmetic (32.32).
+ * Tools must adjust the counts to convert them to Watts using
+ * the duration of the measurement. Tools may use a function such as
+ * ldexp(raw_count, -32);
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/perf_event.h>
+#include <asm/cpu_device_id.h>
+#include "perf_event.h"
+
+/*
+ * RAPL energy status counters
+ */
+#define RAPL_IDX_PP0_NRG_STAT	0	/* all cores */
+#define INTEL_RAPL_PP0		0x1	/* pseudo-encoding */
+#define RAPL_IDX_PKG_NRG_STAT	1	/* entire package */
+#define INTEL_RAPL_PKG		0x2	/* pseudo-encoding */
+#define RAPL_IDX_RAM_NRG_STAT	2	/* DRAM */
+#define INTEL_RAPL_RAM		0x3	/* pseudo-encoding */
+
+/* Clients have PP0, PKG */
+#define RAPL_IDX_CLN	(1<<RAPL_IDX_PP0_NRG_STAT|\
+			 1<<RAPL_IDX_PKG_NRG_STAT)
+
+/* Servers have PP0, PKG, RAM */
+#define RAPL_IDX_SRV	(1<<RAPL_IDX_PP0_NRG_STAT|\
+			 1<<RAPL_IDX_PKG_NRG_STAT|\
+			 1<<RAPL_IDX_RAM_NRG_STAT)
+
+/*
+ * event code: LSB 8 bits, passed in attr->config
+ * any other bit is reserved
+ */
+#define RAPL_EVENT_MASK	0xFFULL
+
+#define DEFINE_RAPL_FORMAT_ATTR(_var, _name, _format)		\
+static ssize_t __rapl_##_var##_show(struct kobject *kobj,	\
+				struct kobj_attribute *attr,	\
+				char *page)			\
+{								\
+	BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE);		\
+	return sprintf(page, _format "\n");			\
+}								\
+static struct kobj_attribute format_attr_##_var =		\
+	__ATTR(_name, 0444, __rapl_##_var##_show, NULL)
+
+#define RAPL_EVENT_DESC(_name, _config)				\
+{								\
+	.attr	= __ATTR(_name, 0444, rapl_event_show, NULL),	\
+	.config	= _config,					\
+}
+
+#define RAPL_CNTR_WIDTH 32 /* 32-bit rapl counters */
+
+struct rapl_pmu {
+	spinlock_t	 lock;
+	int		 hw_unit;  /* 1/2^hw_unit Joule */
+	int		 n_active; /* number of active events */
+	struct list_head active_list;
+	struct pmu	 *pmu; /* pointer to rapl_pmu_class */
+};
+
+static struct pmu rapl_pmu_class;
+static cpumask_t rapl_cpu_mask;
+static int rapl_cntr_mask;
+
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu);
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu_to_free);
+
+static inline u64 rapl_read_counter(struct perf_event *event)
+{
+	u64 raw;
+	rdmsrl(event->hw.event_base, raw);
+	return raw;
+}
+
+static inline u64 rapl_scale(u64 v)
+{
+	/*
+	 * scale delta to smallest unit (1/2^32)
+	 * users must then scale back: count * 1/(1e9*2^32) to get Joules
+	 * or use ldexp(count, -32).
+	 * Watts = Joules/Time delta
+	 */
+	return v << (32 - __get_cpu_var(rapl_pmu)->hw_unit);
+}
+
+static u64 rapl_event_update(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	u64 prev_raw_count, new_raw_count;
+	s64 delta, sdelta;
+	int shift = RAPL_CNTR_WIDTH;
+
+again:
+	prev_raw_count = local64_read(&hwc->prev_count);
+	rdmsrl(event->hw.event_base, new_raw_count);
+
+	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+			    new_raw_count) != prev_raw_count) {
+		cpu_relax();
+		goto again;
+	}
+
+	/*
+	 * Now we have the new raw value and have updated the prev
+	 * timestamp already. We can now calculate the elapsed delta
+	 * (event-)time and add that to the generic event.
+	 *
+	 * Careful, not all hw sign-extends above the physical width
+	 * of the count.
+	 */
+	delta = (new_raw_count << shift) - (prev_raw_count << shift);
+	delta >>= shift;
+
+	sdelta = rapl_scale(delta);
+
+	local64_add(sdelta, &event->count);
+
+	return new_raw_count;
+}
+
+static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
+				   struct perf_event *event)
+{
+	if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+		return;
+
+	event->hw.state = 0;
+
+	list_add_tail(&event->active_entry, &pmu->active_list);
+
+	local64_set(&event->hw.prev_count, rapl_read_counter(event));
+
+	pmu->n_active++;
+}
+
+static void rapl_pmu_event_start(struct perf_event *event, int mode)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+	unsigned long flags;
+
+	spin_lock_irqsave(&pmu->lock, flags);
+	__rapl_pmu_event_start(pmu, event);
+	spin_unlock_irqrestore(&pmu->lock, flags);
+}
+
+static void rapl_pmu_event_stop(struct perf_event *event, int mode)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pmu->lock, flags);
+
+	/* mark event as deactivated and stopped */
+	if (!(hwc->state & PERF_HES_STOPPED)) {
+		WARN_ON_ONCE(pmu->n_active <= 0);
+		pmu->n_active--;
+
+		list_del(&event->active_entry);
+
+		WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
+		hwc->state |= PERF_HES_STOPPED;
+	}
+
+	/* check if update of sw counter is necessary */
+	if ((mode & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
+		/*
+		 * Drain the remaining delta count out of a event
+		 * that we are disabling:
+		 */
+		rapl_event_update(event);
+		hwc->state |= PERF_HES_UPTODATE;
+	}
+
+	spin_unlock_irqrestore(&pmu->lock, flags);
+}
+
+static int rapl_pmu_event_add(struct perf_event *event, int mode)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pmu->lock, flags);
+
+	hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+	if (mode & PERF_EF_START)
+		__rapl_pmu_event_start(pmu, event);
+
+	spin_unlock_irqrestore(&pmu->lock, flags);
+
+	return 0;
+}
+
+static void rapl_pmu_event_del(struct perf_event *event, int flags)
+{
+	rapl_pmu_event_stop(event, PERF_EF_UPDATE);
+}
+
+static int rapl_pmu_event_init(struct perf_event *event)
+{
+	u64 cfg = event->attr.config & RAPL_EVENT_MASK;
+	int bit, msr, ret = 0;
+
+	/* only look at RAPL events */
+	if (event->attr.type != rapl_pmu_class.type)
+		return -ENOENT;
+
+	/* check only supported bits are set */
+	if (event->attr.config & ~RAPL_EVENT_MASK)
+		return -EINVAL;
+
+	/*
+	 * check event is known (determines counter)
+	 */
+	switch (cfg) {
+	case INTEL_RAPL_PP0:
+		bit = RAPL_IDX_PP0_NRG_STAT;
+		msr = MSR_PP0_ENERGY_STATUS;
+		break;
+	case INTEL_RAPL_PKG:
+		bit = RAPL_IDX_PKG_NRG_STAT;
+		msr = MSR_PKG_ENERGY_STATUS;
+		break;
+	case INTEL_RAPL_RAM:
+		bit = RAPL_IDX_RAM_NRG_STAT;
+		msr = MSR_DRAM_ENERGY_STATUS;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* check event supported */
+	if (!(rapl_cntr_mask & (1 << bit)))
+		return -EINVAL;
+
+	/* unsupported modes and filters */
+	if (event->attr.exclude_user   ||
+	    event->attr.exclude_kernel ||
+	    event->attr.exclude_hv     ||
+	    event->attr.exclude_idle   ||
+	    event->attr.exclude_host   ||
+	    event->attr.exclude_guest  ||
+	    event->attr.sample_period) /* no sampling */
+		return -EINVAL;
+
+	/* must be done before validate_group */
+	event->hw.event_base = msr;
+	event->hw.config = cfg;
+	event->hw.idx = bit;
+
+	return ret;
+}
+
+static void rapl_pmu_event_read(struct perf_event *event)
+{
+	rapl_event_update(event);
+}
+
+static ssize_t rapl_get_attr_cpumask(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);
+
+	buf[n++] = '\n';
+	buf[n] = '\0';
+	return n;
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, rapl_get_attr_cpumask, NULL);
+
+static struct attribute *rapl_pmu_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL,
+};
+
+static struct attribute_group rapl_pmu_attr_group = {
+	.attrs = rapl_pmu_attrs,
+};
+
+EVENT_ATTR_STR(energy-cores, rapl_cores, "event=0x01");
+EVENT_ATTR_STR(energy-pkg  , rapl_pkg, "event=0x02");
+EVENT_ATTR_STR(energy-ram  , rapl_ram, "event=0x03");
+
+EVENT_ATTR_STR(energy-cores.unit, rapl_cores_unit, "Joules");
+EVENT_ATTR_STR(energy-pkg.unit  , rapl_pkg_unit, "Joules");
+EVENT_ATTR_STR(energy-ram.unit  , rapl_ram_unit, "Joules");
+
+/*
+ * we compute in 0.23 nJ increments regardless of MSR
+ */
+EVENT_ATTR_STR(energy-cores.scale, rapl_cores_scale, "2.3283064365386962890625e-10");
+EVENT_ATTR_STR(energy-pkg.scale, rapl_pkg_scale, "2.3283064365386962890625e-10");
+EVENT_ATTR_STR(energy-ram.scale, rapl_ram_scale, "2.3283064365386962890625e-10");
+
+static struct attribute *rapl_events_srv_attr[] = {
+	EVENT_PTR(rapl_cores),
+	EVENT_PTR(rapl_pkg),
+	EVENT_PTR(rapl_ram),
+
+	EVENT_PTR(rapl_cores_unit),
+	EVENT_PTR(rapl_pkg_unit),
+	EVENT_PTR(rapl_ram_unit),
+
+	EVENT_PTR(rapl_cores_scale),
+	EVENT_PTR(rapl_pkg_scale),
+	EVENT_PTR(rapl_ram_scale),
+	NULL,
+};
+
+static struct attribute *rapl_events_cln_attr[] = {
+	EVENT_PTR(rapl_cores),
+	EVENT_PTR(rapl_pkg),
+
+	EVENT_PTR(rapl_cores_unit),
+	EVENT_PTR(rapl_pkg_unit),
+
+	EVENT_PTR(rapl_cores_scale),
+	EVENT_PTR(rapl_pkg_scale),
+	NULL,
+};
+
+static struct attribute_group rapl_pmu_events_group = {
+	.name = "events",
+	.attrs = NULL, /* patched at runtime */
+};
+
+DEFINE_RAPL_FORMAT_ATTR(event, event, "config:0-7");
+static struct attribute *rapl_formats_attr[] = {
+	&format_attr_event.attr,
+	NULL,
+};
+
+static struct attribute_group rapl_pmu_format_group = {
+	.name = "format",
+	.attrs = rapl_formats_attr,
+};
+
+const struct attribute_group *rapl_attr_groups[] = {
+	&rapl_pmu_attr_group,
+	&rapl_pmu_format_group,
+	&rapl_pmu_events_group,
+	NULL,
+};
+
+static struct pmu rapl_pmu_class = {
+	.attr_groups	= rapl_attr_groups,
+	.task_ctx_nr	= perf_invalid_context, /* system-wide only */
+	.event_init	= rapl_pmu_event_init,
+	.add		= rapl_pmu_event_add, /* must have */
+	.del		= rapl_pmu_event_del, /* must have */
+	.start		= rapl_pmu_event_start,
+	.stop		= rapl_pmu_event_stop,
+	.read		= rapl_pmu_event_read,
+};
+
+static void rapl_cpu_exit(int cpu)
+{
+	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+	int i, phys_id = topology_physical_package_id(cpu);
+	int target = -1;
+
+	/* find a new cpu on same package */
+	for_each_online_cpu(i) {
+		if (i == cpu)
+			continue;
+		if (phys_id == topology_physical_package_id(i)) {
+			target = i;
+			break;
+		}
+	}
+	/*
+	 * clear cpu from cpumask
+	 * if was set in cpumask and still some cpu on package,
+	 * then move to new cpu
+	 */
+	if (cpumask_test_and_clear_cpu(cpu, &rapl_cpu_mask) && target >= 0)
+		cpumask_set_cpu(target, &rapl_cpu_mask);
+
+	WARN_ON(cpumask_empty(&rapl_cpu_mask));
+	/*
+	 * migrate events and context to new cpu
+	 */
+	if (target >= 0)
+		perf_pmu_migrate_context(pmu->pmu, cpu, target);
+}
+
+static void rapl_cpu_init(int cpu)
+{
+	int i, phys_id = topology_physical_package_id(cpu);
+
+	/* check if phys_is is already covered */
+	for_each_cpu(i, &rapl_cpu_mask) {
+		if (phys_id == topology_physical_package_id(i))
+			return;
+	}
+	/* was not found, so add it */
+	cpumask_set_cpu(cpu, &rapl_cpu_mask);
+}
+
+static int rapl_cpu_prepare(int cpu)
+{
+	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+	int phys_id = topology_physical_package_id(cpu);
+
+	if (pmu)
+		return 0;
+
+	if (phys_id < 0)
+		return -1;
+
+	pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_node(cpu));
+	if (!pmu)
+		return -1;
+
+	spin_lock_init(&pmu->lock);
+
+	INIT_LIST_HEAD(&pmu->active_list);
+
+	/*
+	 * grab power unit as: 1/2^unit Joules
+	 *
+	 * we cache in local PMU instance
+	 */
+	rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);
+	pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;
+	pmu->pmu = &rapl_pmu_class;
+
+	/* set RAPL pmu for this cpu for now */
+	per_cpu(rapl_pmu, cpu) = pmu;
+	per_cpu(rapl_pmu_to_free, cpu) = NULL;
+
+	return 0;
+}
+
+static void rapl_cpu_kfree(int cpu)
+{
+	struct rapl_pmu *pmu = per_cpu(rapl_pmu_to_free, cpu);
+
+	kfree(pmu);
+
+	per_cpu(rapl_pmu_to_free, cpu) = NULL;
+}
+
+static int rapl_cpu_dying(int cpu)
+{
+	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+
+	if (!pmu)
+		return 0;
+
+	per_cpu(rapl_pmu, cpu) = NULL;
+
+	per_cpu(rapl_pmu_to_free, cpu) = pmu;
+
+	return 0;
+}
+
+static int rapl_cpu_notifier(struct notifier_block *self,
+			     unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (long)hcpu;
+
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_UP_PREPARE:
+		rapl_cpu_prepare(cpu);
+		break;
+	case CPU_STARTING:
+		rapl_cpu_init(cpu);
+		break;
+	case CPU_UP_CANCELED:
+	case CPU_DYING:
+		rapl_cpu_dying(cpu);
+		break;
+	case CPU_ONLINE:
+	case CPU_DEAD:
+		rapl_cpu_kfree(cpu);
+		break;
+	case CPU_DOWN_PREPARE:
+		rapl_cpu_exit(cpu);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static const struct x86_cpu_id rapl_cpu_match[] = {
+	[0] = { .vendor = X86_VENDOR_INTEL, .family = 6 },
+	[1] = {},
+};
+
+static int __init rapl_pmu_init(void)
+{
+	struct rapl_pmu *pmu;
+	int cpu, ret;
+
+	/*
+	 * check for Intel processor family 6
+	 */
+	if (!x86_match_cpu(rapl_cpu_match))
+		return 0;
+
+	/* check supported CPU */
+	switch (boot_cpu_data.x86_model) {
+	case 42: /* Sandy Bridge */
+	case 58: /* Ivy Bridge */
+	case 60: /* Haswell */
+		rapl_cntr_mask = RAPL_IDX_CLN;
+		rapl_pmu_events_group.attrs = rapl_events_cln_attr;
+		break;
+	case 45: /* Sandy Bridge-EP */
+	case 62: /* IvyTown */
+		rapl_cntr_mask = RAPL_IDX_SRV;
+		rapl_pmu_events_group.attrs = rapl_events_srv_attr;
+		break;
+
+	default:
+		/* unsupported */
+		return 0;
+	}
+	get_online_cpus();
+
+	for_each_online_cpu(cpu) {
+		rapl_cpu_prepare(cpu);
+		rapl_cpu_init(cpu);
+	}
+
+	perf_cpu_notifier(rapl_cpu_notifier);
+
+	ret = perf_pmu_register(&rapl_pmu_class, "power", -1);
+	if (WARN_ON(ret)) {
+		pr_info("RAPL PMU detected, registration failed (%d), RAPL PMU disabled\n", ret);
+		put_online_cpus();
+		return -1;
+	}
+
+	pmu = __get_cpu_var(rapl_pmu);
+
+	pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"
+		" API unit is 2^-32 Joules,"
+		" %d fixed counters\n",
+		pmu->hw_unit,
+		hweight32(rapl_cntr_mask));
+
+	put_online_cpus();
+
+	return 0;
+}
+device_initcall(rapl_pmu_init);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 4/4] perf,x86: add RAPL hrtimer support
  2013-11-12 16:58 [PATCH v7 0/4] perf/x86: add Intel RAPL PMU support Stephane Eranian
                   ` (2 preceding siblings ...)
  2013-11-12 16:58 ` [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support Stephane Eranian
@ 2013-11-12 16:58 ` Stephane Eranian
  2013-11-27 14:10   ` [tip:perf/core] perf/x86: Add " tip-bot for Stephane Eranian
  2013-11-27 14:33   ` tip-bot for Stephane Eranian
  2013-11-12 18:10 ` [PATCH v7 0/4] perf/x86: add Intel RAPL PMU support Andi Kleen
  4 siblings, 2 replies; 22+ messages in thread
From: Stephane Eranian @ 2013-11-12 16:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, ak, acme, jolsa, zheng.z.yan, bp, maria.n.dimakopoulou

The RAPL PMU counters do not interrupt on overflow.
Therefore, the kernel needs to poll the counters
to avoid missing an overflow. This patch adds
the hrtimer code to do this.

The timer interval is calculated at boot time
based on the power unit used by the HW.

There is one hrtimer per-cpu to handle the case
of multiple simultaneous use across cores on
the same package + hotplug CPU.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel_rapl.c |   74 ++++++++++++++++++++++++++-
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index cfcd386..a6f16ca 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -96,6 +96,8 @@ struct rapl_pmu {
 	int		 n_active; /* number of active events */
 	struct list_head active_list;
 	struct pmu	 *pmu; /* pointer to rapl_pmu_class */
+	ktime_t		 timer_interval; /* in ktime_t unit */
+	struct hrtimer   hrtimer;
 };
 
 static struct pmu rapl_pmu_class;
@@ -158,6 +160,48 @@ static u64 rapl_event_update(struct perf_event *event)
 	return new_raw_count;
 }
 
+static void rapl_start_hrtimer(struct rapl_pmu *pmu)
+{
+	__hrtimer_start_range_ns(&pmu->hrtimer,
+			pmu->timer_interval, 0,
+			HRTIMER_MODE_REL_PINNED, 0);
+}
+
+static void rapl_stop_hrtimer(struct rapl_pmu *pmu)
+{
+	hrtimer_cancel(&pmu->hrtimer);
+}
+
+static enum hrtimer_restart rapl_hrtimer_handle(struct hrtimer *hrtimer)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+	struct perf_event *event;
+	unsigned long flags;
+
+	if (!pmu->n_active)
+		return HRTIMER_NORESTART;
+
+	spin_lock_irqsave(&pmu->lock, flags);
+
+	list_for_each_entry(event, &pmu->active_list, active_entry) {
+		rapl_event_update(event);
+	}
+
+	spin_unlock_irqrestore(&pmu->lock, flags);
+
+	hrtimer_forward_now(hrtimer, pmu->timer_interval);
+
+	return HRTIMER_RESTART;
+}
+
+static void rapl_hrtimer_init(struct rapl_pmu *pmu)
+{
+	struct hrtimer *hr = &pmu->hrtimer;
+
+	hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hr->function = rapl_hrtimer_handle;
+}
+
 static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
 				   struct perf_event *event)
 {
@@ -171,6 +215,8 @@ static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
 	local64_set(&event->hw.prev_count, rapl_read_counter(event));
 
 	pmu->n_active++;
+	if (pmu->n_active == 1)
+		rapl_start_hrtimer(pmu);
 }
 
 static void rapl_pmu_event_start(struct perf_event *event, int mode)
@@ -195,6 +241,8 @@ static void rapl_pmu_event_stop(struct perf_event *event, int mode)
 	if (!(hwc->state & PERF_HES_STOPPED)) {
 		WARN_ON_ONCE(pmu->n_active <= 0);
 		pmu->n_active--;
+		if (pmu->n_active == 0)
+			rapl_stop_hrtimer(pmu);
 
 		list_del(&event->active_entry);
 
@@ -423,6 +471,9 @@ static void rapl_cpu_exit(int cpu)
 	 */
 	if (target >= 0)
 		perf_pmu_migrate_context(pmu->pmu, cpu, target);
+
+	/* cancel overflow polling timer for CPU */
+	rapl_stop_hrtimer(pmu);
 }
 
 static void rapl_cpu_init(int cpu)
@@ -442,6 +493,7 @@ static int rapl_cpu_prepare(int cpu)
 {
 	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
 	int phys_id = topology_physical_package_id(cpu);
+	u64 ms;
 
 	if (pmu)
 		return 0;
@@ -466,6 +518,22 @@ static int rapl_cpu_prepare(int cpu)
 	pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;
 	pmu->pmu = &rapl_pmu_class;
 
+	/*
+	 * use reference of 200W for scaling the timeout
+	 * to avoid missing counter overflows.
+	 * 200W = 200 Joules/sec
+	 * divide interval by 2 to avoid lockstep (2 * 100)
+	 * if hw unit is 32, then we use 2 ms 1/200/2
+	 */
+	if (pmu->hw_unit < 32)
+		ms = 1000 * (1ULL << (32 - pmu->hw_unit - 1)) / (2 * 100);
+	else
+		ms = 2;
+
+	pmu->timer_interval = ms_to_ktime(ms);
+
+	rapl_hrtimer_init(pmu);
+
 	/* set RAPL pmu for this cpu for now */
 	per_cpu(rapl_pmu, cpu) = pmu;
 	per_cpu(rapl_pmu_to_free, cpu) = NULL;
@@ -580,9 +648,11 @@ static int __init rapl_pmu_init(void)
 
 	pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"
 		" API unit is 2^-32 Joules,"
-		" %d fixed counters\n",
+		" %d fixed counters"
+		" %llu ms ovfl timer\n",
 		pmu->hw_unit,
-		hweight32(rapl_cntr_mask));
+		hweight32(rapl_cntr_mask),
+		ktime_to_ms(pmu->timer_interval));
 
 	put_online_cpus();
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 0/4] perf/x86: add Intel RAPL PMU support
  2013-11-12 16:58 [PATCH v7 0/4] perf/x86: add Intel RAPL PMU support Stephane Eranian
                   ` (3 preceding siblings ...)
  2013-11-12 16:58 ` [PATCH v7 4/4] perf,x86: add RAPL hrtimer support Stephane Eranian
@ 2013-11-12 18:10 ` Andi Kleen
  4 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2013-11-12 18:10 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, jolsa, zheng.z.yan, bp,
	maria.n.dimakopoulou

> In v7, we rebased to 3.12+ and dropped the hotplug cpu lock because it
> was not needed. Hotplug CPU is still serialized.

Looks all good to me. Thanks for working on this.

Reviewed-by: Andi Kleen <ak@linux.intel.com>

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-12 16:58 ` [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support Stephane Eranian
@ 2013-11-27 10:23   ` Peter Zijlstra
  2013-11-27 14:08   ` [tip:perf/core] perf/x86: Add " tip-bot for Stephane Eranian
  2013-11-27 18:35   ` [PATCH v7 3/4] perf,x86: add " Vince Weaver
  2 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2013-11-27 10:23 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, ak, acme, jolsa, zheng.z.yan, bp,
	maria.n.dimakopoulou

On Tue, Nov 12, 2013 at 05:58:50PM +0100, Stephane Eranian wrote:
> Sampling is not supported by the RAPL PMU. There is no
> privilege level filtering either.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>

So the merge window closed and we're starting to merge new bits for the
next cycle.

Ingo fell over the double SoB here; Maria didn't actually send me this
patch, so the SoB seems awkward at best.

Sob only comes from being the primary author or from collecting it in an
aggregate work and passing it on.

I know people often use this double sob for co-authorship (even on
single line patches), but its not supported by the rules as outlined in
Documentation/SubmittingPatches.

So Ingo changed it to a Reviewed-by tag if that's fine with you -
further credits should be expressed either as split-out patches that
track authorship, or additional credits in the source code.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf: Add active_entry list head to struct perf_event
  2013-11-12 16:58 ` [PATCH v7 1/4] perf: add active_entry list head to struct perf_event Stephane Eranian
@ 2013-11-27 14:07   ` tip-bot for Stephane Eranian
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Stephane Eranian @ 2013-11-27 14:07 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, eranian, hpa, mingo, peterz, ak, tglx

Commit-ID:  71ad88efebbcde374bddf904b96f3a7fc82d45d4
Gitweb:     http://git.kernel.org/tip/71ad88efebbcde374bddf904b96f3a7fc82d45d4
Author:     Stephane Eranian <eranian@google.com>
AuthorDate: Tue, 12 Nov 2013 17:58:48 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 27 Nov 2013 11:16:38 +0100

perf: Add active_entry list head to struct perf_event

This patch adds a new field to the struct perf_event.
It is intended to be used to chain events which are
active (enabled). It helps in the hardware layer
for PMUs which do not have actual counter restrictions, i.e.,
free running read-only counters. Active events are chained
as opposed to being tracked via the counter they use.

To save space we use a union with hlist_entry as both
are mutually exclusive (suggested by Jiri Olsa).

Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: zheng.z.yan@intel.com
Cc: bp@alien8.de
Cc: maria.n.dimakopoulou@gmail.com
Link: http://lkml.kernel.org/r/1384275531-10892-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/perf_event.h | 5 ++++-
 kernel/events/core.c       | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2e069d1..8f4a70f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -319,7 +319,10 @@ struct perf_event {
 	 */
 	struct list_head		migrate_entry;
 
-	struct hlist_node		hlist_entry;
+	union {
+		struct hlist_node	hlist_entry;
+		struct list_head	active_entry;
+	};
 	int				nr_siblings;
 	int				group_flags;
 	struct perf_event		*group_leader;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 72348dc..403b781 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6655,6 +6655,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	INIT_LIST_HEAD(&event->event_entry);
 	INIT_LIST_HEAD(&event->sibling_list);
 	INIT_LIST_HEAD(&event->rb_entry);
+	INIT_LIST_HEAD(&event->active_entry);
 
 	init_waitqueue_head(&event->waitq);
 	init_irq_work(&event->pending, perf_pending_event);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:perf/core] tools/perf/stat: Add event unit and scale support
  2013-11-12 16:58 ` [PATCH v7 2/4] perf stat: add event unit and scale support Stephane Eranian
@ 2013-11-27 14:08   ` tip-bot for Stephane Eranian
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Stephane Eranian @ 2013-11-27 14:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, peterz, jolsa, ak, tglx

Commit-ID:  410136f5dd96b6013fe6d1011b523b1c247e1ccb
Gitweb:     http://git.kernel.org/tip/410136f5dd96b6013fe6d1011b523b1c247e1ccb
Author:     Stephane Eranian <eranian@google.com>
AuthorDate: Tue, 12 Nov 2013 17:58:49 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 27 Nov 2013 11:16:39 +0100

tools/perf/stat: Add event unit and scale support

This patch adds perf stat support for handling event units and
scales as exported by the kernel.

The kernel can export PMU events actual unit and scaling factor
via sysfs:

  $ ls -1 /sys/devices/power/events/energy-*
  /sys/devices/power/events/energy-cores
  /sys/devices/power/events/energy-cores.scale
  /sys/devices/power/events/energy-cores.unit
  /sys/devices/power/events/energy-pkg
  /sys/devices/power/events/energy-pkg.scale
  /sys/devices/power/events/energy-pkg.unit
  $ cat /sys/devices/power/events/energy-cores.scale
  2.3283064365386962890625e-10
  $ cat cat /sys/devices/power/events/energy-cores.unit
  Joules

This patch modifies the pmu event alias code to check
for the presence of the .unit and .scale files to load
the corresponding values. They are then used by perf stat
transparently:

   # perf stat -a -e power/energy-pkg/,power/energy-cores/,cycles -I 1000 sleep 1000
   #          time             counts   unit events
       1.000214717               3.07 Joules power/energy-pkg/         [100.00%]
       1.000214717               0.53 Joules power/energy-cores/
       1.000214717           12965028        cycles                    [100.00%]
       2.000749289               3.01 Joules power/energy-pkg/
       2.000749289               0.52 Joules power/energy-cores/
       2.000749289           15817043        cycles

When the event does not have an explicit unit exported by
the kernel, nothing is printed. In csv output mode, there
will be an empty field.

Special thanks to Jiri for providing the supporting code
in the parser to trigger reading of the scale and unit files.

Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Jiri Olsa <jolsa@redhat.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: zheng.z.yan@intel.com
Cc: bp@alien8.de
Cc: maria.n.dimakopoulou@gmail.com
Cc: acme@redhat.com
Link: http://lkml.kernel.org/r/1384275531-10892-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 tools/perf/builtin-stat.c      | 114 +++++++++++++++++++++++++---------
 tools/perf/util/evsel.c        |   2 +
 tools/perf/util/evsel.h        |   3 +
 tools/perf/util/parse-events.c |  28 ++++++---
 tools/perf/util/pmu.c          | 138 +++++++++++++++++++++++++++++++++++++++--
 tools/perf/util/pmu.h          |   3 +-
 6 files changed, 244 insertions(+), 44 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index ee0d565..dab98b5 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -138,6 +138,7 @@ static const char		*post_cmd			= NULL;
 static bool			sync_run			= false;
 static unsigned int		interval			= 0;
 static unsigned int		initial_delay			= 0;
+static unsigned int		unit_width			= 4; /* strlen("unit") */
 static bool			forever				= false;
 static struct timespec		ref_time;
 static struct cpu_map		*aggr_map;
@@ -461,17 +462,17 @@ static void print_interval(void)
 	if (num_print_interval == 0 && !csv_output) {
 		switch (aggr_mode) {
 		case AGGR_SOCKET:
-			fprintf(output, "#           time socket cpus             counts events\n");
+			fprintf(output, "#           time socket cpus             counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_CORE:
-			fprintf(output, "#           time core         cpus             counts events\n");
+			fprintf(output, "#           time core         cpus             counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_NONE:
-			fprintf(output, "#           time CPU                 counts events\n");
+			fprintf(output, "#           time CPU                counts %*s events\n", unit_width, "unit");
 			break;
 		case AGGR_GLOBAL:
 		default:
-			fprintf(output, "#           time             counts events\n");
+			fprintf(output, "#           time             counts %*s events\n", unit_width, "unit");
 		}
 	}
 
@@ -516,6 +517,7 @@ static int __run_perf_stat(int argc, const char **argv)
 	unsigned long long t0, t1;
 	struct perf_evsel *counter;
 	struct timespec ts;
+	size_t l;
 	int status = 0;
 	const bool forks = (argc > 0);
 
@@ -565,6 +567,10 @@ static int __run_perf_stat(int argc, const char **argv)
 			return -1;
 		}
 		counter->supported = true;
+
+		l = strlen(counter->unit);
+		if (l > unit_width)
+			unit_width = l;
 	}
 
 	if (perf_evlist__apply_filters(evsel_list)) {
@@ -704,14 +710,25 @@ static void aggr_printout(struct perf_evsel *evsel, int id, int nr)
 static void nsec_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
 {
 	double msecs = avg / 1e6;
-	const char *fmt = csv_output ? "%.6f%s%s" : "%18.6f%s%-25s";
+	const char *fmt_v, *fmt_n;
 	char name[25];
 
+	fmt_v = csv_output ? "%.6f%s" : "%18.6f%s";
+	fmt_n = csv_output ? "%s" : "%-25s";
+
 	aggr_printout(evsel, cpu, nr);
 
 	scnprintf(name, sizeof(name), "%s%s",
 		  perf_evsel__name(evsel), csv_output ? "" : " (msec)");
-	fprintf(output, fmt, msecs, csv_sep, name);
+
+	fprintf(output, fmt_v, msecs, csv_sep);
+
+	if (csv_output)
+		fprintf(output, "%s%s", evsel->unit, csv_sep);
+	else
+		fprintf(output, "%-*s%s", unit_width, evsel->unit, csv_sep);
+
+	fprintf(output, fmt_n, name);
 
 	if (evsel->cgrp)
 		fprintf(output, "%s%s", csv_sep, evsel->cgrp->name);
@@ -908,21 +925,31 @@ static void print_ll_cache_misses(int cpu,
 static void abs_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
 {
 	double total, ratio = 0.0, total2;
+	double sc =  evsel->scale;
 	const char *fmt;
 
-	if (csv_output)
-		fmt = "%.0f%s%s";
-	else if (big_num)
-		fmt = "%'18.0f%s%-25s";
-	else
-		fmt = "%18.0f%s%-25s";
+	if (csv_output) {
+		fmt = sc != 1.0 ?  "%.2f%s" : "%.0f%s";
+	} else {
+		if (big_num)
+			fmt = sc != 1.0 ? "%'18.2f%s" : "%'18.0f%s";
+		else
+			fmt = sc != 1.0 ? "%18.2f%s" : "%18.0f%s";
+	}
 
 	aggr_printout(evsel, cpu, nr);
 
 	if (aggr_mode == AGGR_GLOBAL)
 		cpu = 0;
 
-	fprintf(output, fmt, avg, csv_sep, perf_evsel__name(evsel));
+	fprintf(output, fmt, avg, csv_sep);
+
+	if (evsel->unit)
+		fprintf(output, "%-*s%s",
+			csv_output ? 0 : unit_width,
+			evsel->unit, csv_sep);
+
+	fprintf(output, "%-*s", csv_output ? 0 : 25, perf_evsel__name(evsel));
 
 	if (evsel->cgrp)
 		fprintf(output, "%s%s", csv_sep, evsel->cgrp->name);
@@ -941,7 +968,10 @@ static void abs_printout(int cpu, int nr, struct perf_evsel *evsel, double avg)
 
 		if (total && avg) {
 			ratio = total / avg;
-			fprintf(output, "\n                                             #   %5.2f  stalled cycles per insn", ratio);
+			fprintf(output, "\n");
+			if (aggr_mode == AGGR_NONE)
+				fprintf(output, "        ");
+			fprintf(output, "                                                  #   %5.2f  stalled cycles per insn", ratio);
 		}
 
 	} else if (perf_evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES) &&
@@ -1061,6 +1091,7 @@ static void print_aggr(char *prefix)
 {
 	struct perf_evsel *counter;
 	int cpu, cpu2, s, s2, id, nr;
+	double uval;
 	u64 ena, run, val;
 
 	if (!(aggr_map || aggr_get_id))
@@ -1087,11 +1118,17 @@ static void print_aggr(char *prefix)
 			if (run == 0 || ena == 0) {
 				aggr_printout(counter, id, nr);
 
-				fprintf(output, "%*s%s%*s",
+				fprintf(output, "%*s%s",
 					csv_output ? 0 : 18,
 					counter->supported ? CNTR_NOT_COUNTED : CNTR_NOT_SUPPORTED,
-					csv_sep,
-					csv_output ? 0 : -24,
+					csv_sep);
+
+				fprintf(output, "%-*s%s",
+					csv_output ? 0 : unit_width,
+					counter->unit, csv_sep);
+
+				fprintf(output, "%*s",
+					csv_output ? 0 : -25,
 					perf_evsel__name(counter));
 
 				if (counter->cgrp)
@@ -1101,11 +1138,12 @@ static void print_aggr(char *prefix)
 				fputc('\n', output);
 				continue;
 			}
+			uval = val * counter->scale;
 
 			if (nsec_counter(counter))
-				nsec_printout(id, nr, counter, val);
+				nsec_printout(id, nr, counter, uval);
 			else
-				abs_printout(id, nr, counter, val);
+				abs_printout(id, nr, counter, uval);
 
 			if (!csv_output) {
 				print_noise(counter, 1.0);
@@ -1128,16 +1166,21 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix)
 	struct perf_stat *ps = counter->priv;
 	double avg = avg_stats(&ps->res_stats[0]);
 	int scaled = counter->counts->scaled;
+	double uval;
 
 	if (prefix)
 		fprintf(output, "%s", prefix);
 
 	if (scaled == -1) {
-		fprintf(output, "%*s%s%*s",
+		fprintf(output, "%*s%s",
 			csv_output ? 0 : 18,
 			counter->supported ? CNTR_NOT_COUNTED : CNTR_NOT_SUPPORTED,
-			csv_sep,
-			csv_output ? 0 : -24,
+			csv_sep);
+		fprintf(output, "%-*s%s",
+			csv_output ? 0 : unit_width,
+			counter->unit, csv_sep);
+		fprintf(output, "%*s",
+			csv_output ? 0 : -25,
 			perf_evsel__name(counter));
 
 		if (counter->cgrp)
@@ -1147,10 +1190,12 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix)
 		return;
 	}
 
+	uval = avg * counter->scale;
+
 	if (nsec_counter(counter))
-		nsec_printout(-1, 0, counter, avg);
+		nsec_printout(-1, 0, counter, uval);
 	else
-		abs_printout(-1, 0, counter, avg);
+		abs_printout(-1, 0, counter, uval);
 
 	print_noise(counter, avg);
 
@@ -1177,6 +1222,7 @@ static void print_counter_aggr(struct perf_evsel *counter, char *prefix)
 static void print_counter(struct perf_evsel *counter, char *prefix)
 {
 	u64 ena, run, val;
+	double uval;
 	int cpu;
 
 	for (cpu = 0; cpu < perf_evsel__nr_cpus(counter); cpu++) {
@@ -1188,14 +1234,20 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
 			fprintf(output, "%s", prefix);
 
 		if (run == 0 || ena == 0) {
-			fprintf(output, "CPU%*d%s%*s%s%*s",
+			fprintf(output, "CPU%*d%s%*s%s",
 				csv_output ? 0 : -4,
 				perf_evsel__cpus(counter)->map[cpu], csv_sep,
 				csv_output ? 0 : 18,
 				counter->supported ? CNTR_NOT_COUNTED : CNTR_NOT_SUPPORTED,
-				csv_sep,
-				csv_output ? 0 : -24,
-				perf_evsel__name(counter));
+				csv_sep);
+
+				fprintf(output, "%-*s%s",
+					csv_output ? 0 : unit_width,
+					counter->unit, csv_sep);
+
+				fprintf(output, "%*s",
+					csv_output ? 0 : -25,
+					perf_evsel__name(counter));
 
 			if (counter->cgrp)
 				fprintf(output, "%s%s",
@@ -1205,10 +1257,12 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
 			continue;
 		}
 
+		uval = val * counter->scale;
+
 		if (nsec_counter(counter))
-			nsec_printout(cpu, 0, counter, val);
+			nsec_printout(cpu, 0, counter, uval);
 		else
-			abs_printout(cpu, 0, counter, val);
+			abs_printout(cpu, 0, counter, uval);
 
 		if (!csv_output) {
 			print_noise(counter, 1.0);
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 46dd4c2..dad6492 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -162,6 +162,8 @@ void perf_evsel__init(struct perf_evsel *evsel,
 	evsel->idx	   = idx;
 	evsel->attr	   = *attr;
 	evsel->leader	   = evsel;
+	evsel->unit	   = "";
+	evsel->scale	   = 1.0;
 	INIT_LIST_HEAD(&evsel->node);
 	hists__init(&evsel->hists);
 	evsel->sample_size = __perf_evsel__sample_size(attr->sample_type);
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 1ea7c92..8120eeb 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -68,6 +68,8 @@ struct perf_evsel {
 	u32			ids;
 	struct hists		hists;
 	char			*name;
+	double			scale;
+	const char		*unit;
 	struct event_format	*tp_format;
 	union {
 		void		*priv;
@@ -138,6 +140,7 @@ extern const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX];
 int __perf_evsel__hw_cache_type_op_res_name(u8 type, u8 op, u8 result,
 					    char *bf, size_t size);
 const char *perf_evsel__name(struct perf_evsel *evsel);
+
 const char *perf_evsel__group_name(struct perf_evsel *evsel);
 int perf_evsel__group_desc(struct perf_evsel *evsel, char *buf, size_t size);
 
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 6de6f89..969cb8f 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -269,9 +269,10 @@ const char *event_type(int type)
 
 
 
-static int __add_event(struct list_head *list, int *idx,
-		       struct perf_event_attr *attr,
-		       char *name, struct cpu_map *cpus)
+static struct perf_evsel *
+__add_event(struct list_head *list, int *idx,
+	    struct perf_event_attr *attr,
+	    char *name, struct cpu_map *cpus)
 {
 	struct perf_evsel *evsel;
 
@@ -279,19 +280,19 @@ static int __add_event(struct list_head *list, int *idx,
 
 	evsel = perf_evsel__new_idx(attr, (*idx)++);
 	if (!evsel)
-		return -ENOMEM;
+		return NULL;
 
 	evsel->cpus = cpus;
 	if (name)
 		evsel->name = strdup(name);
 	list_add_tail(&evsel->node, list);
-	return 0;
+	return evsel;
 }
 
 static int add_event(struct list_head *list, int *idx,
 		     struct perf_event_attr *attr, char *name)
 {
-	return __add_event(list, idx, attr, name, NULL);
+	return __add_event(list, idx, attr, name, NULL) ? 0 : -ENOMEM;
 }
 
 static int parse_aliases(char *str, const char *names[][PERF_EVSEL__MAX_ALIASES], int size)
@@ -633,6 +634,9 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 {
 	struct perf_event_attr attr;
 	struct perf_pmu *pmu;
+	struct perf_evsel *evsel;
+	char *unit;
+	double scale;
 
 	pmu = perf_pmu__find(name);
 	if (!pmu)
@@ -640,7 +644,7 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 
 	memset(&attr, 0, sizeof(attr));
 
-	if (perf_pmu__check_alias(pmu, head_config))
+	if (perf_pmu__check_alias(pmu, head_config, &unit, &scale))
 		return -EINVAL;
 
 	/*
@@ -652,8 +656,14 @@ int parse_events_add_pmu(struct list_head *list, int *idx,
 	if (perf_pmu__config(pmu, &attr, head_config))
 		return -EINVAL;
 
-	return __add_event(list, idx, &attr, pmu_event_name(head_config),
-			   pmu->cpus);
+	evsel = __add_event(list, idx, &attr, pmu_event_name(head_config),
+			    pmu->cpus);
+	if (evsel) {
+		evsel->unit = unit;
+		evsel->scale = scale;
+	}
+
+	return evsel ? 0 : -ENOMEM;
 }
 
 int parse_events__modifier_group(struct list_head *list,
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index c232d8d..56fc10a 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -1,19 +1,23 @@
 #include <linux/list.h>
 #include <sys/types.h>
-#include <sys/stat.h>
 #include <unistd.h>
 #include <stdio.h>
 #include <dirent.h>
 #include "fs.h"
+#include <locale.h>
 #include "util.h"
 #include "pmu.h"
 #include "parse-events.h"
 #include "cpumap.h"
 
+#define UNIT_MAX_LEN	31 /* max length for event unit name */
+
 struct perf_pmu_alias {
 	char *name;
 	struct list_head terms;
 	struct list_head list;
+	char unit[UNIT_MAX_LEN+1];
+	double scale;
 };
 
 struct perf_pmu_format {
@@ -94,7 +98,80 @@ static int pmu_format(const char *name, struct list_head *format)
 	return 0;
 }
 
-static int perf_pmu__new_alias(struct list_head *list, char *name, FILE *file)
+static int perf_pmu__parse_scale(struct perf_pmu_alias *alias, char *dir, char *name)
+{
+	struct stat st;
+	ssize_t sret;
+	char scale[128];
+	int fd, ret = -1;
+	char path[PATH_MAX];
+	char *lc;
+
+	snprintf(path, PATH_MAX, "%s/%s.scale", dir, name);
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	if (fstat(fd, &st) < 0)
+		goto error;
+
+	sret = read(fd, scale, sizeof(scale)-1);
+	if (sret < 0)
+		goto error;
+
+	scale[sret] = '\0';
+	/*
+	 * save current locale
+	 */
+	lc = setlocale(LC_NUMERIC, NULL);
+
+	/*
+	 * force to C locale to ensure kernel
+	 * scale string is converted correctly.
+	 * kernel uses default C locale.
+	 */
+	setlocale(LC_NUMERIC, "C");
+
+	alias->scale = strtod(scale, NULL);
+
+	/* restore locale */
+	setlocale(LC_NUMERIC, lc);
+
+	ret = 0;
+error:
+	close(fd);
+	return ret;
+}
+
+static int perf_pmu__parse_unit(struct perf_pmu_alias *alias, char *dir, char *name)
+{
+	char path[PATH_MAX];
+	ssize_t sret;
+	int fd;
+
+	snprintf(path, PATH_MAX, "%s/%s.unit", dir, name);
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+		sret = read(fd, alias->unit, UNIT_MAX_LEN);
+	if (sret < 0)
+		goto error;
+
+	close(fd);
+
+	alias->unit[sret] = '\0';
+
+	return 0;
+error:
+	close(fd);
+	alias->unit[0] = '\0';
+	return -1;
+}
+
+static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, FILE *file)
 {
 	struct perf_pmu_alias *alias;
 	char buf[256];
@@ -110,6 +187,9 @@ static int perf_pmu__new_alias(struct list_head *list, char *name, FILE *file)
 		return -ENOMEM;
 
 	INIT_LIST_HEAD(&alias->terms);
+	alias->scale = 1.0;
+	alias->unit[0] = '\0';
+
 	ret = parse_events_terms(&alias->terms, buf);
 	if (ret) {
 		free(alias);
@@ -117,7 +197,14 @@ static int perf_pmu__new_alias(struct list_head *list, char *name, FILE *file)
 	}
 
 	alias->name = strdup(name);
+	/*
+	 * load unit name and scale if available
+	 */
+	perf_pmu__parse_unit(alias, dir, name);
+	perf_pmu__parse_scale(alias, dir, name);
+
 	list_add_tail(&alias->list, list);
+
 	return 0;
 }
 
@@ -129,6 +216,7 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
 {
 	struct dirent *evt_ent;
 	DIR *event_dir;
+	size_t len;
 	int ret = 0;
 
 	event_dir = opendir(dir);
@@ -143,13 +231,24 @@ static int pmu_aliases_parse(char *dir, struct list_head *head)
 		if (!strcmp(name, ".") || !strcmp(name, ".."))
 			continue;
 
+		/*
+		 * skip .unit and .scale info files
+		 * parsed in perf_pmu__new_alias()
+		 */
+		len = strlen(name);
+		if (len > 5 && !strcmp(name + len - 5, ".unit"))
+			continue;
+		if (len > 6 && !strcmp(name + len - 6, ".scale"))
+			continue;
+
 		snprintf(path, PATH_MAX, "%s/%s", dir, name);
 
 		ret = -EINVAL;
 		file = fopen(path, "r");
 		if (!file)
 			break;
-		ret = perf_pmu__new_alias(head, name, file);
+
+		ret = perf_pmu__new_alias(head, dir, name, file);
 		fclose(file);
 	}
 
@@ -508,16 +607,42 @@ static struct perf_pmu_alias *pmu_find_alias(struct perf_pmu *pmu,
 	return NULL;
 }
 
+
+static int check_unit_scale(struct perf_pmu_alias *alias,
+			    char **unit, double *scale)
+{
+	/*
+	 * Only one term in event definition can
+	 * define unit and scale, fail if there's
+	 * more than one.
+	 */
+	if ((*unit && alias->unit) ||
+	    (*scale && alias->scale))
+		return -EINVAL;
+
+	if (alias->unit)
+		*unit = alias->unit;
+
+	if (alias->scale)
+		*scale = alias->scale;
+
+	return 0;
+}
+
 /*
  * Find alias in the terms list and replace it with the terms
  * defined for the alias
  */
-int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms)
+int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
+			  char **unit, double *scale)
 {
 	struct parse_events_term *term, *h;
 	struct perf_pmu_alias *alias;
 	int ret;
 
+	*unit   = NULL;
+	*scale  = 0;
+
 	list_for_each_entry_safe(term, h, head_terms, list) {
 		alias = pmu_find_alias(pmu, term);
 		if (!alias)
@@ -525,6 +650,11 @@ int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms)
 		ret = pmu_alias_terms(alias, &term->list);
 		if (ret)
 			return ret;
+
+		ret = check_unit_scale(alias, unit, scale);
+		if (ret)
+			return ret;
+
 		list_del(&term->list);
 		free(term);
 	}
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 1179b26..9183380 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -28,7 +28,8 @@ int perf_pmu__config(struct perf_pmu *pmu, struct perf_event_attr *attr,
 int perf_pmu__config_terms(struct list_head *formats,
 			   struct perf_event_attr *attr,
 			   struct list_head *head_terms);
-int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms);
+int perf_pmu__check_alias(struct perf_pmu *pmu, struct list_head *head_terms,
+			  char **unit, double *scale);
 struct list_head *perf_pmu__alias(struct perf_pmu *pmu,
 				  struct list_head *head_terms);
 int perf_pmu_wrap(void);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf/x86: Add Intel RAPL PMU support
  2013-11-12 16:58 ` [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support Stephane Eranian
  2013-11-27 10:23   ` Peter Zijlstra
@ 2013-11-27 14:08   ` tip-bot for Stephane Eranian
  2013-11-27 18:35   ` [PATCH v7 3/4] perf,x86: add " Vince Weaver
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Stephane Eranian @ 2013-11-27 14:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, peterz, maria.n.dimakopoulou,
	ak, tglx

Commit-ID:  4788e5b4b2338f85fa42a712a182d8afd65d7c58
Gitweb:     http://git.kernel.org/tip/4788e5b4b2338f85fa42a712a182d8afd65d7c58
Author:     Stephane Eranian <eranian@google.com>
AuthorDate: Tue, 12 Nov 2013 17:58:50 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 27 Nov 2013 11:16:40 +0100

perf/x86: Add Intel RAPL PMU support

This patch adds a new uncore PMU to expose the Intel
RAPL energy consumption counters. Up to 3 counters,
each counting a particular RAPL event are exposed.

The RAPL counters are available on Intel SandyBridge,
IvyBridge, Haswell. The server skus add a 3rd counter.

The following events are available and exposed in sysfs:

  - power/energy-cores: power consumption of all cores on socket
  - power/energy-pkg: power consumption of all cores + LLc cache
  - power/energy-dram: power consumption of DRAM (servers only)

For each event both the unit (Joules) and scale (2^-32 J)
is exposed in sysfs for use by perf stat and other tools.
The files are:

	/sys/devices/power/events/energy-*.unit
	/sys/devices/power/events/energy-*.scale

The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only
one CPU per socket is sufficient. The /sys/devices/power/cpumask
file can be used by tools to figure out which CPUs to monitor
by default. For instance, on a 2-socket system, 2 CPUs
(one on each socket) will be shown.

All the counters measure in the same unit (exposed via sysfs).
The perf_events API exposes all RAPL counters as 64-bit integers
counting in unit of 1/2^32 Joules (about 0.23 nJ). User level tools
must convert the counts by multiplying them by 2^-32 to obtain
Joules. The reason for this is that the kernel avoids
doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point usage. There is no loss of
precision. Thanks to PeterZ for suggesting this approach.

To convert the raw count in Watt:
   W = C * 2.3 / (1e10 * time)
or ldexp(C, -32).

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/power/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: zheng.z.yan@intel.com
Cc: bp@alien8.de
Link: http://lkml.kernel.org/r/1384275531-10892-4-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/Makefile                |   2 +-
 arch/x86/kernel/cpu/perf_event_intel_rapl.c | 591 ++++++++++++++++++++++++++++
 2 files changed, 592 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 47b56a7..6359506 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -36,7 +36,7 @@ obj-$(CONFIG_CPU_SUP_AMD)		+= perf_event_amd_iommu.o
 endif
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
-obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_rapl.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
new file mode 100644
index 0000000..cfcd386
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -0,0 +1,591 @@
+/*
+ * perf_event_intel_rapl.c: support Intel RAPL energy consumption counters
+ * Copyright (C) 2013 Google, Inc., Stephane Eranian
+ *
+ * Intel RAPL interface is specified in the IA-32 Manual Vol3b
+ * section 14.7.1 (September 2013)
+ *
+ * RAPL provides more controls than just reporting energy consumption
+ * however here we only expose the 3 energy consumption free running
+ * counters (pp0, pkg, dram).
+ *
+ * Each of those counters increments in a power unit defined by the
+ * RAPL_POWER_UNIT MSR. On SandyBridge, this unit is 1/(2^16) Joules
+ * but it can vary.
+ *
+ * Counter to rapl events mappings:
+ *
+ *  pp0 counter: consumption of all physical cores (power plane 0)
+ * 	  event: rapl_energy_cores
+ *    perf code: 0x1
+ *
+ *  pkg counter: consumption of the whole processor package
+ *	  event: rapl_energy_pkg
+ *    perf code: 0x2
+ *
+ * dram counter: consumption of the dram domain (servers only)
+ *	  event: rapl_energy_dram
+ *    perf code: 0x3
+ *
+ * We manage those counters as free running (read-only). They may be
+ * use simultaneously by other tools, such as turbostat.
+ *
+ * The events only support system-wide mode counting. There is no
+ * sampling support because it does not make sense and is not
+ * supported by the RAPL hardware.
+ *
+ * Because we want to avoid floating-point operations in the kernel,
+ * the events are all reported in fixed point arithmetic (32.32).
+ * Tools must adjust the counts to convert them to Watts using
+ * the duration of the measurement. Tools may use a function such as
+ * ldexp(raw_count, -32);
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/perf_event.h>
+#include <asm/cpu_device_id.h>
+#include "perf_event.h"
+
+/*
+ * RAPL energy status counters
+ */
+#define RAPL_IDX_PP0_NRG_STAT	0	/* all cores */
+#define INTEL_RAPL_PP0		0x1	/* pseudo-encoding */
+#define RAPL_IDX_PKG_NRG_STAT	1	/* entire package */
+#define INTEL_RAPL_PKG		0x2	/* pseudo-encoding */
+#define RAPL_IDX_RAM_NRG_STAT	2	/* DRAM */
+#define INTEL_RAPL_RAM		0x3	/* pseudo-encoding */
+
+/* Clients have PP0, PKG */
+#define RAPL_IDX_CLN	(1<<RAPL_IDX_PP0_NRG_STAT|\
+			 1<<RAPL_IDX_PKG_NRG_STAT)
+
+/* Servers have PP0, PKG, RAM */
+#define RAPL_IDX_SRV	(1<<RAPL_IDX_PP0_NRG_STAT|\
+			 1<<RAPL_IDX_PKG_NRG_STAT|\
+			 1<<RAPL_IDX_RAM_NRG_STAT)
+
+/*
+ * event code: LSB 8 bits, passed in attr->config
+ * any other bit is reserved
+ */
+#define RAPL_EVENT_MASK	0xFFULL
+
+#define DEFINE_RAPL_FORMAT_ATTR(_var, _name, _format)		\
+static ssize_t __rapl_##_var##_show(struct kobject *kobj,	\
+				struct kobj_attribute *attr,	\
+				char *page)			\
+{								\
+	BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE);		\
+	return sprintf(page, _format "\n");			\
+}								\
+static struct kobj_attribute format_attr_##_var =		\
+	__ATTR(_name, 0444, __rapl_##_var##_show, NULL)
+
+#define RAPL_EVENT_DESC(_name, _config)				\
+{								\
+	.attr	= __ATTR(_name, 0444, rapl_event_show, NULL),	\
+	.config	= _config,					\
+}
+
+#define RAPL_CNTR_WIDTH 32 /* 32-bit rapl counters */
+
+struct rapl_pmu {
+	spinlock_t	 lock;
+	int		 hw_unit;  /* 1/2^hw_unit Joule */
+	int		 n_active; /* number of active events */
+	struct list_head active_list;
+	struct pmu	 *pmu; /* pointer to rapl_pmu_class */
+};
+
+static struct pmu rapl_pmu_class;
+static cpumask_t rapl_cpu_mask;
+static int rapl_cntr_mask;
+
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu);
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu_to_free);
+
+static inline u64 rapl_read_counter(struct perf_event *event)
+{
+	u64 raw;
+	rdmsrl(event->hw.event_base, raw);
+	return raw;
+}
+
+static inline u64 rapl_scale(u64 v)
+{
+	/*
+	 * scale delta to smallest unit (1/2^32)
+	 * users must then scale back: count * 1/(1e9*2^32) to get Joules
+	 * or use ldexp(count, -32).
+	 * Watts = Joules/Time delta
+	 */
+	return v << (32 - __get_cpu_var(rapl_pmu)->hw_unit);
+}
+
+static u64 rapl_event_update(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	u64 prev_raw_count, new_raw_count;
+	s64 delta, sdelta;
+	int shift = RAPL_CNTR_WIDTH;
+
+again:
+	prev_raw_count = local64_read(&hwc->prev_count);
+	rdmsrl(event->hw.event_base, new_raw_count);
+
+	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+			    new_raw_count) != prev_raw_count) {
+		cpu_relax();
+		goto again;
+	}
+
+	/*
+	 * Now we have the new raw value and have updated the prev
+	 * timestamp already. We can now calculate the elapsed delta
+	 * (event-)time and add that to the generic event.
+	 *
+	 * Careful, not all hw sign-extends above the physical width
+	 * of the count.
+	 */
+	delta = (new_raw_count << shift) - (prev_raw_count << shift);
+	delta >>= shift;
+
+	sdelta = rapl_scale(delta);
+
+	local64_add(sdelta, &event->count);
+
+	return new_raw_count;
+}
+
+static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
+				   struct perf_event *event)
+{
+	if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+		return;
+
+	event->hw.state = 0;
+
+	list_add_tail(&event->active_entry, &pmu->active_list);
+
+	local64_set(&event->hw.prev_count, rapl_read_counter(event));
+
+	pmu->n_active++;
+}
+
+static void rapl_pmu_event_start(struct perf_event *event, int mode)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+	unsigned long flags;
+
+	spin_lock_irqsave(&pmu->lock, flags);
+	__rapl_pmu_event_start(pmu, event);
+	spin_unlock_irqrestore(&pmu->lock, flags);
+}
+
+static void rapl_pmu_event_stop(struct perf_event *event, int mode)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pmu->lock, flags);
+
+	/* mark event as deactivated and stopped */
+	if (!(hwc->state & PERF_HES_STOPPED)) {
+		WARN_ON_ONCE(pmu->n_active <= 0);
+		pmu->n_active--;
+
+		list_del(&event->active_entry);
+
+		WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
+		hwc->state |= PERF_HES_STOPPED;
+	}
+
+	/* check if update of sw counter is necessary */
+	if ((mode & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
+		/*
+		 * Drain the remaining delta count out of a event
+		 * that we are disabling:
+		 */
+		rapl_event_update(event);
+		hwc->state |= PERF_HES_UPTODATE;
+	}
+
+	spin_unlock_irqrestore(&pmu->lock, flags);
+}
+
+static int rapl_pmu_event_add(struct perf_event *event, int mode)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pmu->lock, flags);
+
+	hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+	if (mode & PERF_EF_START)
+		__rapl_pmu_event_start(pmu, event);
+
+	spin_unlock_irqrestore(&pmu->lock, flags);
+
+	return 0;
+}
+
+static void rapl_pmu_event_del(struct perf_event *event, int flags)
+{
+	rapl_pmu_event_stop(event, PERF_EF_UPDATE);
+}
+
+static int rapl_pmu_event_init(struct perf_event *event)
+{
+	u64 cfg = event->attr.config & RAPL_EVENT_MASK;
+	int bit, msr, ret = 0;
+
+	/* only look at RAPL events */
+	if (event->attr.type != rapl_pmu_class.type)
+		return -ENOENT;
+
+	/* check only supported bits are set */
+	if (event->attr.config & ~RAPL_EVENT_MASK)
+		return -EINVAL;
+
+	/*
+	 * check event is known (determines counter)
+	 */
+	switch (cfg) {
+	case INTEL_RAPL_PP0:
+		bit = RAPL_IDX_PP0_NRG_STAT;
+		msr = MSR_PP0_ENERGY_STATUS;
+		break;
+	case INTEL_RAPL_PKG:
+		bit = RAPL_IDX_PKG_NRG_STAT;
+		msr = MSR_PKG_ENERGY_STATUS;
+		break;
+	case INTEL_RAPL_RAM:
+		bit = RAPL_IDX_RAM_NRG_STAT;
+		msr = MSR_DRAM_ENERGY_STATUS;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* check event supported */
+	if (!(rapl_cntr_mask & (1 << bit)))
+		return -EINVAL;
+
+	/* unsupported modes and filters */
+	if (event->attr.exclude_user   ||
+	    event->attr.exclude_kernel ||
+	    event->attr.exclude_hv     ||
+	    event->attr.exclude_idle   ||
+	    event->attr.exclude_host   ||
+	    event->attr.exclude_guest  ||
+	    event->attr.sample_period) /* no sampling */
+		return -EINVAL;
+
+	/* must be done before validate_group */
+	event->hw.event_base = msr;
+	event->hw.config = cfg;
+	event->hw.idx = bit;
+
+	return ret;
+}
+
+static void rapl_pmu_event_read(struct perf_event *event)
+{
+	rapl_event_update(event);
+}
+
+static ssize_t rapl_get_attr_cpumask(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);
+
+	buf[n++] = '\n';
+	buf[n] = '\0';
+	return n;
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, rapl_get_attr_cpumask, NULL);
+
+static struct attribute *rapl_pmu_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL,
+};
+
+static struct attribute_group rapl_pmu_attr_group = {
+	.attrs = rapl_pmu_attrs,
+};
+
+EVENT_ATTR_STR(energy-cores, rapl_cores, "event=0x01");
+EVENT_ATTR_STR(energy-pkg  , rapl_pkg, "event=0x02");
+EVENT_ATTR_STR(energy-ram  , rapl_ram, "event=0x03");
+
+EVENT_ATTR_STR(energy-cores.unit, rapl_cores_unit, "Joules");
+EVENT_ATTR_STR(energy-pkg.unit  , rapl_pkg_unit, "Joules");
+EVENT_ATTR_STR(energy-ram.unit  , rapl_ram_unit, "Joules");
+
+/*
+ * we compute in 0.23 nJ increments regardless of MSR
+ */
+EVENT_ATTR_STR(energy-cores.scale, rapl_cores_scale, "2.3283064365386962890625e-10");
+EVENT_ATTR_STR(energy-pkg.scale, rapl_pkg_scale, "2.3283064365386962890625e-10");
+EVENT_ATTR_STR(energy-ram.scale, rapl_ram_scale, "2.3283064365386962890625e-10");
+
+static struct attribute *rapl_events_srv_attr[] = {
+	EVENT_PTR(rapl_cores),
+	EVENT_PTR(rapl_pkg),
+	EVENT_PTR(rapl_ram),
+
+	EVENT_PTR(rapl_cores_unit),
+	EVENT_PTR(rapl_pkg_unit),
+	EVENT_PTR(rapl_ram_unit),
+
+	EVENT_PTR(rapl_cores_scale),
+	EVENT_PTR(rapl_pkg_scale),
+	EVENT_PTR(rapl_ram_scale),
+	NULL,
+};
+
+static struct attribute *rapl_events_cln_attr[] = {
+	EVENT_PTR(rapl_cores),
+	EVENT_PTR(rapl_pkg),
+
+	EVENT_PTR(rapl_cores_unit),
+	EVENT_PTR(rapl_pkg_unit),
+
+	EVENT_PTR(rapl_cores_scale),
+	EVENT_PTR(rapl_pkg_scale),
+	NULL,
+};
+
+static struct attribute_group rapl_pmu_events_group = {
+	.name = "events",
+	.attrs = NULL, /* patched at runtime */
+};
+
+DEFINE_RAPL_FORMAT_ATTR(event, event, "config:0-7");
+static struct attribute *rapl_formats_attr[] = {
+	&format_attr_event.attr,
+	NULL,
+};
+
+static struct attribute_group rapl_pmu_format_group = {
+	.name = "format",
+	.attrs = rapl_formats_attr,
+};
+
+const struct attribute_group *rapl_attr_groups[] = {
+	&rapl_pmu_attr_group,
+	&rapl_pmu_format_group,
+	&rapl_pmu_events_group,
+	NULL,
+};
+
+static struct pmu rapl_pmu_class = {
+	.attr_groups	= rapl_attr_groups,
+	.task_ctx_nr	= perf_invalid_context, /* system-wide only */
+	.event_init	= rapl_pmu_event_init,
+	.add		= rapl_pmu_event_add, /* must have */
+	.del		= rapl_pmu_event_del, /* must have */
+	.start		= rapl_pmu_event_start,
+	.stop		= rapl_pmu_event_stop,
+	.read		= rapl_pmu_event_read,
+};
+
+static void rapl_cpu_exit(int cpu)
+{
+	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+	int i, phys_id = topology_physical_package_id(cpu);
+	int target = -1;
+
+	/* find a new cpu on same package */
+	for_each_online_cpu(i) {
+		if (i == cpu)
+			continue;
+		if (phys_id == topology_physical_package_id(i)) {
+			target = i;
+			break;
+		}
+	}
+	/*
+	 * clear cpu from cpumask
+	 * if was set in cpumask and still some cpu on package,
+	 * then move to new cpu
+	 */
+	if (cpumask_test_and_clear_cpu(cpu, &rapl_cpu_mask) && target >= 0)
+		cpumask_set_cpu(target, &rapl_cpu_mask);
+
+	WARN_ON(cpumask_empty(&rapl_cpu_mask));
+	/*
+	 * migrate events and context to new cpu
+	 */
+	if (target >= 0)
+		perf_pmu_migrate_context(pmu->pmu, cpu, target);
+}
+
+static void rapl_cpu_init(int cpu)
+{
+	int i, phys_id = topology_physical_package_id(cpu);
+
+	/* check if phys_is is already covered */
+	for_each_cpu(i, &rapl_cpu_mask) {
+		if (phys_id == topology_physical_package_id(i))
+			return;
+	}
+	/* was not found, so add it */
+	cpumask_set_cpu(cpu, &rapl_cpu_mask);
+}
+
+static int rapl_cpu_prepare(int cpu)
+{
+	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+	int phys_id = topology_physical_package_id(cpu);
+
+	if (pmu)
+		return 0;
+
+	if (phys_id < 0)
+		return -1;
+
+	pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_node(cpu));
+	if (!pmu)
+		return -1;
+
+	spin_lock_init(&pmu->lock);
+
+	INIT_LIST_HEAD(&pmu->active_list);
+
+	/*
+	 * grab power unit as: 1/2^unit Joules
+	 *
+	 * we cache in local PMU instance
+	 */
+	rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);
+	pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;
+	pmu->pmu = &rapl_pmu_class;
+
+	/* set RAPL pmu for this cpu for now */
+	per_cpu(rapl_pmu, cpu) = pmu;
+	per_cpu(rapl_pmu_to_free, cpu) = NULL;
+
+	return 0;
+}
+
+static void rapl_cpu_kfree(int cpu)
+{
+	struct rapl_pmu *pmu = per_cpu(rapl_pmu_to_free, cpu);
+
+	kfree(pmu);
+
+	per_cpu(rapl_pmu_to_free, cpu) = NULL;
+}
+
+static int rapl_cpu_dying(int cpu)
+{
+	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+
+	if (!pmu)
+		return 0;
+
+	per_cpu(rapl_pmu, cpu) = NULL;
+
+	per_cpu(rapl_pmu_to_free, cpu) = pmu;
+
+	return 0;
+}
+
+static int rapl_cpu_notifier(struct notifier_block *self,
+			     unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (long)hcpu;
+
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_UP_PREPARE:
+		rapl_cpu_prepare(cpu);
+		break;
+	case CPU_STARTING:
+		rapl_cpu_init(cpu);
+		break;
+	case CPU_UP_CANCELED:
+	case CPU_DYING:
+		rapl_cpu_dying(cpu);
+		break;
+	case CPU_ONLINE:
+	case CPU_DEAD:
+		rapl_cpu_kfree(cpu);
+		break;
+	case CPU_DOWN_PREPARE:
+		rapl_cpu_exit(cpu);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static const struct x86_cpu_id rapl_cpu_match[] = {
+	[0] = { .vendor = X86_VENDOR_INTEL, .family = 6 },
+	[1] = {},
+};
+
+static int __init rapl_pmu_init(void)
+{
+	struct rapl_pmu *pmu;
+	int cpu, ret;
+
+	/*
+	 * check for Intel processor family 6
+	 */
+	if (!x86_match_cpu(rapl_cpu_match))
+		return 0;
+
+	/* check supported CPU */
+	switch (boot_cpu_data.x86_model) {
+	case 42: /* Sandy Bridge */
+	case 58: /* Ivy Bridge */
+	case 60: /* Haswell */
+		rapl_cntr_mask = RAPL_IDX_CLN;
+		rapl_pmu_events_group.attrs = rapl_events_cln_attr;
+		break;
+	case 45: /* Sandy Bridge-EP */
+	case 62: /* IvyTown */
+		rapl_cntr_mask = RAPL_IDX_SRV;
+		rapl_pmu_events_group.attrs = rapl_events_srv_attr;
+		break;
+
+	default:
+		/* unsupported */
+		return 0;
+	}
+	get_online_cpus();
+
+	for_each_online_cpu(cpu) {
+		rapl_cpu_prepare(cpu);
+		rapl_cpu_init(cpu);
+	}
+
+	perf_cpu_notifier(rapl_cpu_notifier);
+
+	ret = perf_pmu_register(&rapl_pmu_class, "power", -1);
+	if (WARN_ON(ret)) {
+		pr_info("RAPL PMU detected, registration failed (%d), RAPL PMU disabled\n", ret);
+		put_online_cpus();
+		return -1;
+	}
+
+	pmu = __get_cpu_var(rapl_pmu);
+
+	pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"
+		" API unit is 2^-32 Joules,"
+		" %d fixed counters\n",
+		pmu->hw_unit,
+		hweight32(rapl_cntr_mask));
+
+	put_online_cpus();
+
+	return 0;
+}
+device_initcall(rapl_pmu_init);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf/x86: Add RAPL hrtimer support
  2013-11-12 16:58 ` [PATCH v7 4/4] perf,x86: add RAPL hrtimer support Stephane Eranian
@ 2013-11-27 14:10   ` tip-bot for Stephane Eranian
  2013-11-27 14:33   ` tip-bot for Stephane Eranian
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot for Stephane Eranian @ 2013-11-27 14:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, eranian, hpa, mingo, peterz, ak, tglx

Commit-ID:  eafade90e67a9b627aee331a9d0cc0c4328fb00e
Gitweb:     http://git.kernel.org/tip/eafade90e67a9b627aee331a9d0cc0c4328fb00e
Author:     Stephane Eranian <eranian@google.com>
AuthorDate: Tue, 12 Nov 2013 17:58:51 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 27 Nov 2013 13:45:04 +0100

perf/x86: Add RAPL hrtimer support

The RAPL PMU counters do not interrupt on overflow.
Therefore, the kernel needs to poll the counters
to avoid missing an overflow. This patch adds
the hrtimer code to do this.

The timer interval is calculated at boot time
based on the power unit used by the HW.

There is one hrtimer per-cpu to handle the case
of multiple simultaneous use across cores on
the same package + hotplug CPU.

Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
[ Applied 32-bit build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: zheng.z.yan@intel.com
Cc: bp@alien8.de
Cc: maria.n.dimakopoulou@gmail.com
Link: http://lkml.kernel.org/r/1384275531-10892-5-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/perf_event_intel_rapl.c | 74 ++++++++++++++++++++++++++++-
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index cfcd386..bf8e4a7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -96,6 +96,8 @@ struct rapl_pmu {
 	int		 n_active; /* number of active events */
 	struct list_head active_list;
 	struct pmu	 *pmu; /* pointer to rapl_pmu_class */
+	ktime_t		 timer_interval; /* in ktime_t unit */
+	struct hrtimer   hrtimer;
 };
 
 static struct pmu rapl_pmu_class;
@@ -158,6 +160,48 @@ again:
 	return new_raw_count;
 }
 
+static void rapl_start_hrtimer(struct rapl_pmu *pmu)
+{
+	__hrtimer_start_range_ns(&pmu->hrtimer,
+			pmu->timer_interval, 0,
+			HRTIMER_MODE_REL_PINNED, 0);
+}
+
+static void rapl_stop_hrtimer(struct rapl_pmu *pmu)
+{
+	hrtimer_cancel(&pmu->hrtimer);
+}
+
+static enum hrtimer_restart rapl_hrtimer_handle(struct hrtimer *hrtimer)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+	struct perf_event *event;
+	unsigned long flags;
+
+	if (!pmu->n_active)
+		return HRTIMER_NORESTART;
+
+	spin_lock_irqsave(&pmu->lock, flags);
+
+	list_for_each_entry(event, &pmu->active_list, active_entry) {
+		rapl_event_update(event);
+	}
+
+	spin_unlock_irqrestore(&pmu->lock, flags);
+
+	hrtimer_forward_now(hrtimer, pmu->timer_interval);
+
+	return HRTIMER_RESTART;
+}
+
+static void rapl_hrtimer_init(struct rapl_pmu *pmu)
+{
+	struct hrtimer *hr = &pmu->hrtimer;
+
+	hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hr->function = rapl_hrtimer_handle;
+}
+
 static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
 				   struct perf_event *event)
 {
@@ -171,6 +215,8 @@ static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
 	local64_set(&event->hw.prev_count, rapl_read_counter(event));
 
 	pmu->n_active++;
+	if (pmu->n_active == 1)
+		rapl_start_hrtimer(pmu);
 }
 
 static void rapl_pmu_event_start(struct perf_event *event, int mode)
@@ -195,6 +241,8 @@ static void rapl_pmu_event_stop(struct perf_event *event, int mode)
 	if (!(hwc->state & PERF_HES_STOPPED)) {
 		WARN_ON_ONCE(pmu->n_active <= 0);
 		pmu->n_active--;
+		if (pmu->n_active == 0)
+			rapl_stop_hrtimer(pmu);
 
 		list_del(&event->active_entry);
 
@@ -423,6 +471,9 @@ static void rapl_cpu_exit(int cpu)
 	 */
 	if (target >= 0)
 		perf_pmu_migrate_context(pmu->pmu, cpu, target);
+
+	/* cancel overflow polling timer for CPU */
+	rapl_stop_hrtimer(pmu);
 }
 
 static void rapl_cpu_init(int cpu)
@@ -442,6 +493,7 @@ static int rapl_cpu_prepare(int cpu)
 {
 	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
 	int phys_id = topology_physical_package_id(cpu);
+	u64 ms;
 
 	if (pmu)
 		return 0;
@@ -466,6 +518,22 @@ static int rapl_cpu_prepare(int cpu)
 	pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;
 	pmu->pmu = &rapl_pmu_class;
 
+	/*
+	 * use reference of 200W for scaling the timeout
+	 * to avoid missing counter overflows.
+	 * 200W = 200 Joules/sec
+	 * divide interval by 2 to avoid lockstep (2 * 100)
+	 * if hw unit is 32, then we use 2 ms 1/200/2
+	 */
+	if (pmu->hw_unit < 32)
+		ms = (1000 / (2 * 100)) * (1ULL << (32 - pmu->hw_unit - 1));
+	else
+		ms = 2;
+
+	pmu->timer_interval = ms_to_ktime(ms);
+
+	rapl_hrtimer_init(pmu);
+
 	/* set RAPL pmu for this cpu for now */
 	per_cpu(rapl_pmu, cpu) = pmu;
 	per_cpu(rapl_pmu_to_free, cpu) = NULL;
@@ -580,9 +648,11 @@ static int __init rapl_pmu_init(void)
 
 	pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"
 		" API unit is 2^-32 Joules,"
-		" %d fixed counters\n",
+		" %d fixed counters"
+		" %llu ms ovfl timer\n",
 		pmu->hw_unit,
-		hweight32(rapl_cntr_mask));
+		hweight32(rapl_cntr_mask),
+		ktime_to_ms(pmu->timer_interval));
 
 	put_online_cpus();
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [tip:perf/core] perf/x86: Add RAPL hrtimer support
  2013-11-12 16:58 ` [PATCH v7 4/4] perf,x86: add RAPL hrtimer support Stephane Eranian
  2013-11-27 14:10   ` [tip:perf/core] perf/x86: Add " tip-bot for Stephane Eranian
@ 2013-11-27 14:33   ` tip-bot for Stephane Eranian
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot for Stephane Eranian @ 2013-11-27 14:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, peterz, maria.n.dimakopoulou,
	ak, tglx

Commit-ID:  65661f96d3b32f4b28fef26d21be81d7e173b965
Gitweb:     http://git.kernel.org/tip/65661f96d3b32f4b28fef26d21be81d7e173b965
Author:     Stephane Eranian <eranian@google.com>
AuthorDate: Tue, 12 Nov 2013 17:58:51 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 27 Nov 2013 15:31:23 +0100

perf/x86: Add RAPL hrtimer support

The RAPL PMU counters do not interrupt on overflow.
Therefore, the kernel needs to poll the counters
to avoid missing an overflow. This patch adds
the hrtimer code to do this.

The timer interval is calculated at boot time
based on the power unit used by the HW.

There is one hrtimer per-cpu to handle the case
of multiple simultaneous use across cores on
the same package + hotplug CPU.

Thanks to Maria Dimakopoulou for her contributions
to this patch especially on the math aspects.

Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
[ Applied 32-bit build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: zheng.z.yan@intel.com
Cc: bp@alien8.de
Cc: maria.n.dimakopoulou@gmail.com
Link: http://lkml.kernel.org/r/1384275531-10892-5-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/perf_event_intel_rapl.c | 74 ++++++++++++++++++++++++++++-
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index cfcd386..bf8e4a7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -96,6 +96,8 @@ struct rapl_pmu {
 	int		 n_active; /* number of active events */
 	struct list_head active_list;
 	struct pmu	 *pmu; /* pointer to rapl_pmu_class */
+	ktime_t		 timer_interval; /* in ktime_t unit */
+	struct hrtimer   hrtimer;
 };
 
 static struct pmu rapl_pmu_class;
@@ -158,6 +160,48 @@ again:
 	return new_raw_count;
 }
 
+static void rapl_start_hrtimer(struct rapl_pmu *pmu)
+{
+	__hrtimer_start_range_ns(&pmu->hrtimer,
+			pmu->timer_interval, 0,
+			HRTIMER_MODE_REL_PINNED, 0);
+}
+
+static void rapl_stop_hrtimer(struct rapl_pmu *pmu)
+{
+	hrtimer_cancel(&pmu->hrtimer);
+}
+
+static enum hrtimer_restart rapl_hrtimer_handle(struct hrtimer *hrtimer)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+	struct perf_event *event;
+	unsigned long flags;
+
+	if (!pmu->n_active)
+		return HRTIMER_NORESTART;
+
+	spin_lock_irqsave(&pmu->lock, flags);
+
+	list_for_each_entry(event, &pmu->active_list, active_entry) {
+		rapl_event_update(event);
+	}
+
+	spin_unlock_irqrestore(&pmu->lock, flags);
+
+	hrtimer_forward_now(hrtimer, pmu->timer_interval);
+
+	return HRTIMER_RESTART;
+}
+
+static void rapl_hrtimer_init(struct rapl_pmu *pmu)
+{
+	struct hrtimer *hr = &pmu->hrtimer;
+
+	hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hr->function = rapl_hrtimer_handle;
+}
+
 static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
 				   struct perf_event *event)
 {
@@ -171,6 +215,8 @@ static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
 	local64_set(&event->hw.prev_count, rapl_read_counter(event));
 
 	pmu->n_active++;
+	if (pmu->n_active == 1)
+		rapl_start_hrtimer(pmu);
 }
 
 static void rapl_pmu_event_start(struct perf_event *event, int mode)
@@ -195,6 +241,8 @@ static void rapl_pmu_event_stop(struct perf_event *event, int mode)
 	if (!(hwc->state & PERF_HES_STOPPED)) {
 		WARN_ON_ONCE(pmu->n_active <= 0);
 		pmu->n_active--;
+		if (pmu->n_active == 0)
+			rapl_stop_hrtimer(pmu);
 
 		list_del(&event->active_entry);
 
@@ -423,6 +471,9 @@ static void rapl_cpu_exit(int cpu)
 	 */
 	if (target >= 0)
 		perf_pmu_migrate_context(pmu->pmu, cpu, target);
+
+	/* cancel overflow polling timer for CPU */
+	rapl_stop_hrtimer(pmu);
 }
 
 static void rapl_cpu_init(int cpu)
@@ -442,6 +493,7 @@ static int rapl_cpu_prepare(int cpu)
 {
 	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
 	int phys_id = topology_physical_package_id(cpu);
+	u64 ms;
 
 	if (pmu)
 		return 0;
@@ -466,6 +518,22 @@ static int rapl_cpu_prepare(int cpu)
 	pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;
 	pmu->pmu = &rapl_pmu_class;
 
+	/*
+	 * use reference of 200W for scaling the timeout
+	 * to avoid missing counter overflows.
+	 * 200W = 200 Joules/sec
+	 * divide interval by 2 to avoid lockstep (2 * 100)
+	 * if hw unit is 32, then we use 2 ms 1/200/2
+	 */
+	if (pmu->hw_unit < 32)
+		ms = (1000 / (2 * 100)) * (1ULL << (32 - pmu->hw_unit - 1));
+	else
+		ms = 2;
+
+	pmu->timer_interval = ms_to_ktime(ms);
+
+	rapl_hrtimer_init(pmu);
+
 	/* set RAPL pmu for this cpu for now */
 	per_cpu(rapl_pmu, cpu) = pmu;
 	per_cpu(rapl_pmu_to_free, cpu) = NULL;
@@ -580,9 +648,11 @@ static int __init rapl_pmu_init(void)
 
 	pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"
 		" API unit is 2^-32 Joules,"
-		" %d fixed counters\n",
+		" %d fixed counters"
+		" %llu ms ovfl timer\n",
 		pmu->hw_unit,
-		hweight32(rapl_cntr_mask));
+		hweight32(rapl_cntr_mask),
+		ktime_to_ms(pmu->timer_interval));
 
 	put_online_cpus();
 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-12 16:58 ` [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support Stephane Eranian
  2013-11-27 10:23   ` Peter Zijlstra
  2013-11-27 14:08   ` [tip:perf/core] perf/x86: Add " tip-bot for Stephane Eranian
@ 2013-11-27 18:35   ` Vince Weaver
  2013-11-27 18:38     ` Stephane Eranian
  2 siblings, 1 reply; 22+ messages in thread
From: Vince Weaver @ 2013-11-27 18:35 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, ak, acme, jolsa, zheng.z.yan, bp,
	maria.n.dimakopoulou

On Tue, 12 Nov 2013, Stephane Eranian wrote:

> This patch adds a new uncore PMU to expose the Intel
> RAPL energy consumption counters. Up to 3 counters,
> each counting a particular RAPL event are exposed.
> 
> The RAPL counters are available on Intel SandyBridge,
> IvyBridge, Haswell. The server skus add a 3rd counter.

So I notice PP1 (which is the GPU power on non-server chips) 
is not supported.

Is that just for simplicity?

> The following events are available and exposed in sysfs:
> - power/energy-cores: power consumption of all cores on socket
> - power/energy-pkg: power consumption of all cores + LLc cache
> - power/energy-dram: power consumption of DRAM (servers only)

This "power" naming seems a bit generic.  If other hardware has power
measurements can they be put in the same directory?
power/gpu? power/usb?

Also, can support for reading the power from other vendors be put here?
Like AMDs (unfortunately named) APM (active power management) power
readings?

> Files are:
> 	/sys/devices/power/events/energy-*.unit
> 	/sys/devices/power/events/energy-*.scale

Are all of these sys files having documentation added under 
Documentation/ABI?

> The RAPL PMU is uncore by nature and is implemented such
> that it only works in system-wide mode. Measuring only
> one CPU per socket is sufficient. The /sys/devices/power/cpumask
> file can be used by tools to figure out which CPUs to monitor
> by default. For instance, on a 2-socket system, 2 CPUs
> (one on each socket) will be shown.

do the measurements require CAP_SYS_ADMIN like other system-wide uncore 
measurements?  It didn't look like it in the patch but I might have missed 
it.

I'm sure the security people will start making claims that you can make 
guesses about password encryption algorithms based on the global power 
consumption numbers.


Sorry if these are annoying questions, I am glad to see this driver make 
progress, as I've had the misfortune of maintaining various user-space-MSR 
hacks designed to get this info because of lack of kernel support.

Vince

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-27 18:35   ` [PATCH v7 3/4] perf,x86: add " Vince Weaver
@ 2013-11-27 18:38     ` Stephane Eranian
  2013-11-27 22:03       ` Vince Weaver
  0 siblings, 1 reply; 22+ messages in thread
From: Stephane Eranian @ 2013-11-27 18:38 UTC (permalink / raw)
  To: Vince Weaver
  Cc: LKML, Peter Zijlstra, mingo, ak, Arnaldo Carvalho de Melo,
	Jiri Olsa, Yan, Zheng, Borislav Petkov, Maria Dimakopoulou

On Wed, Nov 27, 2013 at 7:35 PM, Vince Weaver <vince@deater.net> wrote:
> On Tue, 12 Nov 2013, Stephane Eranian wrote:
>
>> This patch adds a new uncore PMU to expose the Intel
>> RAPL energy consumption counters. Up to 3 counters,
>> each counting a particular RAPL event are exposed.
>>
>> The RAPL counters are available on Intel SandyBridge,
>> IvyBridge, Haswell. The server skus add a 3rd counter.
>
> So I notice PP1 (which is the GPU power on non-server chips)
> is not supported.
>
> Is that just for simplicity?
>
Does it work on specific models only? I bet so. How to detect those?

>> The following events are available and exposed in sysfs:
>> - power/energy-cores: power consumption of all cores on socket
>> - power/energy-pkg: power consumption of all cores + LLc cache
>> - power/energy-dram: power consumption of DRAM (servers only)
>
> This "power" naming seems a bit generic.  If other hardware has power
> measurements can they be put in the same directory?
> power/gpu? power/usb?
>
I think so.

> Also, can support for reading the power from other vendors be put here?
> Like AMDs (unfortunately named) APM (active power management) power
> readings?
>
>> Files are:
>>       /sys/devices/power/events/energy-*.unit
>>       /sys/devices/power/events/energy-*.scale
>
> Are all of these sys files having documentation added under
> Documentation/ABI?
>
Not yet.

>> The RAPL PMU is uncore by nature and is implemented such
>> that it only works in system-wide mode. Measuring only
>> one CPU per socket is sufficient. The /sys/devices/power/cpumask
>> file can be used by tools to figure out which CPUs to monitor
>> by default. For instance, on a 2-socket system, 2 CPUs
>> (one on each socket) will be shown.
>
> do the measurements require CAP_SYS_ADMIN like other system-wide uncore
> measurements?  It didn't look like it in the patch but I might have missed
> it.
>
Measurements are limited to system-wide counting. Thus, you need
CAP_SYS_ADMIN.

> I'm sure the security people will start making claims that you can make
> guesses about password encryption algorithms based on the global power
> consumption numbers.
>
>
> Sorry if these are annoying questions, I am glad to see this driver make
> progress, as I've had the misfortune of maintaining various user-space-MSR
> hacks designed to get this info because of lack of kernel support.
>
You don't really need a driver, you can also just use modprobe msr + the
turbostat utility now part of the kernel source tree. It taps into the same
counters.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-27 18:38     ` Stephane Eranian
@ 2013-11-27 22:03       ` Vince Weaver
  2013-11-28 12:26         ` Ingo Molnar
  0 siblings, 1 reply; 22+ messages in thread
From: Vince Weaver @ 2013-11-27 22:03 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, Peter Zijlstra, mingo, ak, Arnaldo Carvalho de Melo,
	Jiri Olsa, Yan, Zheng, Borislav Petkov, Maria Dimakopoulou

On Wed, 27 Nov 2013, Stephane Eranian wrote:

> On Wed, Nov 27, 2013 at 7:35 PM, Vince Weaver <vince@deater.net> wrote:

> > So I notice PP1 (which is the GPU power on non-server chips)
> > is not supported.
> >
> > Is that just for simplicity?
> >
> Does it work on specific models only? I bet so. How to detect those?

In general it is on the machines that don't support the DRAM measurements 
(so the non-EP machines) but I don't know if there's a nice list anywhere.

Intel manuals say:
   For a client platform, PP1 domain refers to the power plane of a 
   specific device in the uncore. For server platforms, PP1 domain is not
   supported,

usually PP1 I think maps to the embedded GPU.

> > Sorry if these are annoying questions, I am glad to see this driver make
> > progress, as I've had the misfortune of maintaining various user-space-MSR
> > hacks designed to get this info because of lack of kernel support.
> >
> You don't really need a driver, you can also just use modprobe msr + the
> turbostat utility now part of the kernel source tree. It taps into the same
> counters.

yes, the code I maintain doesn't have a custom driver, it just uses the 
raw msr interface.  That's how PAPI currently works.  Though I guess I'll 
have to maintain both that and the new interface for a while as not 
everyone's going to run out and instlal 3.14.

It's a shame that it took so long for proper RAPL support to get merged.  
Because of the delay there are lots of people poking the MSRs themselves 
and various vendors who are shipping custom out-of-tree RAPL drivers, etc.

Vince

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-27 22:03       ` Vince Weaver
@ 2013-11-28 12:26         ` Ingo Molnar
  2013-11-28 12:35           ` Stephane Eranian
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2013-11-28 12:26 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Stephane Eranian, LKML, Peter Zijlstra, mingo, ak,
	Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng, Borislav Petkov,
	Maria Dimakopoulou


* Vince Weaver <vince@deater.net> wrote:

> On Wed, 27 Nov 2013, Stephane Eranian wrote:
> 
> > On Wed, Nov 27, 2013 at 7:35 PM, Vince Weaver <vince@deater.net> wrote:
> 
> > > So I notice PP1 (which is the GPU power on non-server chips)
> > > is not supported.
> > >
> > > Is that just for simplicity?
> > >
> > Does it work on specific models only? I bet so. How to detect those?
> 
> In general it is on the machines that don't support the DRAM measurements 
> (so the non-EP machines) but I don't know if there's a nice list anywhere.
> 
> Intel manuals say:
>    For a client platform, PP1 domain refers to the power plane of a 
>    specific device in the uncore. For server platforms, PP1 domain is not
>    supported,
> 
> usually PP1 I think maps to the embedded GPU.

It would indeed be nice to expose PP1 too via the same facility - 
Haswell and later spends some 40% of the CPU die on the integrated GPU 
and people end up using it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-28 12:26         ` Ingo Molnar
@ 2013-11-28 12:35           ` Stephane Eranian
  2013-11-29 19:16             ` Vince Weaver
  0 siblings, 1 reply; 22+ messages in thread
From: Stephane Eranian @ 2013-11-28 12:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vince Weaver, LKML, Peter Zijlstra, mingo, ak,
	Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng, Borislav Petkov,
	Maria Dimakopoulou

On Thu, Nov 28, 2013 at 1:26 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Vince Weaver <vince@deater.net> wrote:
>
>> On Wed, 27 Nov 2013, Stephane Eranian wrote:
>>
>> > On Wed, Nov 27, 2013 at 7:35 PM, Vince Weaver <vince@deater.net> wrote:
>>
>> > > So I notice PP1 (which is the GPU power on non-server chips)
>> > > is not supported.
>> > >
>> > > Is that just for simplicity?
>> > >
>> > Does it work on specific models only? I bet so. How to detect those?
>>
>> In general it is on the machines that don't support the DRAM measurements
>> (so the non-EP machines) but I don't know if there's a nice list anywhere.
>>
>> Intel manuals say:
>>    For a client platform, PP1 domain refers to the power plane of a
>>    specific device in the uncore. For server platforms, PP1 domain is not
>>    supported,
>>
>> usually PP1 I think maps to the embedded GPU.
>
> It would indeed be nice to expose PP1 too via the same facility -
> Haswell and later spends some 40% of the CPU die on the integrated GPU
> and people end up using it.
>
My worry is to determine if the GPU is actually enabled or even present.
Using the x86_model may not be enough for that.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-28 12:35           ` Stephane Eranian
@ 2013-11-29 19:16             ` Vince Weaver
  2013-11-29 20:09               ` Stephane Eranian
  0 siblings, 1 reply; 22+ messages in thread
From: Vince Weaver @ 2013-11-29 19:16 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Ingo Molnar, LKML, Peter Zijlstra, mingo, ak,
	Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng, Borislav Petkov,
	Maria Dimakopoulou

On Thu, 28 Nov 2013, Stephane Eranian wrote:

> On Thu, Nov 28, 2013 at 1:26 PM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Vince Weaver <vince@deater.net> wrote:
> >
> >> On Wed, 27 Nov 2013, Stephane Eranian wrote:
> >>
> >> > On Wed, Nov 27, 2013 at 7:35 PM, Vince Weaver <vince@deater.net> wrote:
> >>
> >> > > So I notice PP1 (which is the GPU power on non-server chips)
> >> > > is not supported.
> >> > >
> >> > > Is that just for simplicity?
> >> > >
> >> > Does it work on specific models only? I bet so. How to detect those?
> >>
> >> In general it is on the machines that don't support the DRAM measurements
> >> (so the non-EP machines) but I don't know if there's a nice list anywhere.
> >>
> >> Intel manuals say:
> >>    For a client platform, PP1 domain refers to the power plane of a
> >>    specific device in the uncore. For server platforms, PP1 domain is not
> >>    supported,
> >>
> >> usually PP1 I think maps to the embedded GPU.
> >
> > It would indeed be nice to expose PP1 too via the same facility -
> > Haswell and later spends some 40% of the CPU die on the integrated GPU
> > and people end up using it.
> >
> My worry is to determine if the GPU is actually enabled or even present.
> Using the x86_model may not be enough for that.

In my experience if the device is not there you just get 0s as results
from RAPL (I've also seen this on some Sandybridge-EP machines we have 
that for whatever reason don't support the DRAM RAPL results).

So in theory it would be harmless to export the values even if not 
supported.  What is the worst failure mode?  That somehow a recent CPU 
doesn't support the MSR and we get a GPF when trying to access it?

Vince

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-29 19:16             ` Vince Weaver
@ 2013-11-29 20:09               ` Stephane Eranian
  2013-11-29 20:31                 ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Stephane Eranian @ 2013-11-29 20:09 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Ingo Molnar, LKML, Peter Zijlstra, mingo, ak,
	Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng, Borislav Petkov,
	Maria Dimakopoulou

On Fri, Nov 29, 2013 at 8:16 PM, Vince Weaver <vince@deater.net> wrote:
> On Thu, 28 Nov 2013, Stephane Eranian wrote:
>
>> On Thu, Nov 28, 2013 at 1:26 PM, Ingo Molnar <mingo@kernel.org> wrote:
>> >
>> > * Vince Weaver <vince@deater.net> wrote:
>> >
>> >> On Wed, 27 Nov 2013, Stephane Eranian wrote:
>> >>
>> >> > On Wed, Nov 27, 2013 at 7:35 PM, Vince Weaver <vince@deater.net> wrote:
>> >>
>> >> > > So I notice PP1 (which is the GPU power on non-server chips)
>> >> > > is not supported.
>> >> > >
>> >> > > Is that just for simplicity?
>> >> > >
>> >> > Does it work on specific models only? I bet so. How to detect those?
>> >>
>> >> In general it is on the machines that don't support the DRAM measurements
>> >> (so the non-EP machines) but I don't know if there's a nice list anywhere.
>> >>
>> >> Intel manuals say:
>> >>    For a client platform, PP1 domain refers to the power plane of a
>> >>    specific device in the uncore. For server platforms, PP1 domain is not
>> >>    supported,
>> >>
>> >> usually PP1 I think maps to the embedded GPU.
>> >
>> > It would indeed be nice to expose PP1 too via the same facility -
>> > Haswell and later spends some 40% of the CPU die on the integrated GPU
>> > and people end up using it.
>> >
>> My worry is to determine if the GPU is actually enabled or even present.
>> Using the x86_model may not be enough for that.
>
> In my experience if the device is not there you just get 0s as results
> from RAPL (I've also seen this on some Sandybridge-EP machines we have
> that for whatever reason don't support the DRAM RAPL results).
>
> So in theory it would be harmless to export the values even if not
> supported.  What is the worst failure mode?  That somehow a recent CPU
> doesn't support the MSR and we get a GPF when trying to access it?
>
Yes. I think that's what you get with DRAM on client for instance.

> Vince

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-29 20:09               ` Stephane Eranian
@ 2013-11-29 20:31                 ` Borislav Petkov
  2013-11-29 23:25                   ` Stephane Eranian
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2013-11-29 20:31 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Vince Weaver, Ingo Molnar, LKML, Peter Zijlstra, mingo, ak,
	Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng,
	Maria Dimakopoulou

On Fri, Nov 29, 2013 at 09:09:46PM +0100, Stephane Eranian wrote:
> > So in theory it would be harmless to export the values even if not
> > supported.  What is the worst failure mode?  That somehow a recent CPU
> > doesn't support the MSR and we get a GPF when trying to access it?
> >
> Yes. I think that's what you get with DRAM on client for instance.

We have rdmsr_safe for non-existent MSRs, in case touching such is the
issue.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-29 20:31                 ` Borislav Petkov
@ 2013-11-29 23:25                   ` Stephane Eranian
  2013-11-30 16:39                     ` Vince Weaver
  0 siblings, 1 reply; 22+ messages in thread
From: Stephane Eranian @ 2013-11-29 23:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Vince Weaver, Ingo Molnar, LKML, Peter Zijlstra, mingo, ak,
	Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng,
	Maria Dimakopoulou

Vince,

Ok, I have added PP1 now and it works fine on my HSW desktop.
What I want to know now is what is this map to on servers?
And on clients, it is always mapped to the GFX. If so, then
we can use a more specific name: power/energy-gfx/

I will send a patch soon.

Thanks.

On Fri, Nov 29, 2013 at 9:31 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 29, 2013 at 09:09:46PM +0100, Stephane Eranian wrote:
>> > So in theory it would be harmless to export the values even if not
>> > supported.  What is the worst failure mode?  That somehow a recent CPU
>> > doesn't support the MSR and we get a GPF when trying to access it?
>> >
>> Yes. I think that's what you get with DRAM on client for instance.
>
> We have rdmsr_safe for non-existent MSRs, in case touching such is the
> issue.
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support
  2013-11-29 23:25                   ` Stephane Eranian
@ 2013-11-30 16:39                     ` Vince Weaver
  0 siblings, 0 replies; 22+ messages in thread
From: Vince Weaver @ 2013-11-30 16:39 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Borislav Petkov, Ingo Molnar, LKML, Peter Zijlstra, mingo, ak,
	Arnaldo Carvalho de Melo, Jiri Olsa, Yan, Zheng,
	Maria Dimakopoulou

On Sat, 30 Nov 2013, Stephane Eranian wrote:

> Ok, I have added PP1 now and it works fine on my HSW desktop.
> What I want to know now is what is this map to on servers?
> And on clients, it is always mapped to the GFX. If so, then
> we can use a more specific name: power/energy-gfx/

I don't know if it is documented well, but I get the impression that PP1 
is always graphics.

The turbostat code included with the kernel certainly thinks it is always 
graphics, it maps PP1 to "GFX"

The turbostat code also has a map of which chips have which features, 
although it seems to use internal Intel abbreviations that I don't 
necessarily follow (BYT? AVN?):


        switch (model) {
        case 0x2A:
        case 0x3A:
        case 0x3C:      /* HSW */
        case 0x3F:      /* HSW */
        case 0x45:      /* HSW */
        case 0x46:      /* HSW */
                do_rapl = RAPL_PKG | RAPL_CORES | RAPL_CORE_POLICY | 
RAPL_GFX | RAPL_PKG_POWER_INFO;
                break;
        case 0x2D:
        case 0x3E:
                do_rapl = RAPL_PKG | RAPL_CORES | RAPL_CORE_POLICY | 
RAPL_DRAM | RAPL_PKG_PERF_STATUS | RAPL_DRAM_PERF_STATUS | 
RAPL_PKG_POWER_INFO;
                break;
        case 0x37:      /* BYT */
        case 0x4D:      /* AVN */
                do_rapl = RAPL_PKG | RAPL_CORES ;
                break;
        default:
                return;
        }


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-11-30 16:36 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-12 16:58 [PATCH v7 0/4] perf/x86: add Intel RAPL PMU support Stephane Eranian
2013-11-12 16:58 ` [PATCH v7 1/4] perf: add active_entry list head to struct perf_event Stephane Eranian
2013-11-27 14:07   ` [tip:perf/core] perf: Add " tip-bot for Stephane Eranian
2013-11-12 16:58 ` [PATCH v7 2/4] perf stat: add event unit and scale support Stephane Eranian
2013-11-27 14:08   ` [tip:perf/core] tools/perf/stat: Add " tip-bot for Stephane Eranian
2013-11-12 16:58 ` [PATCH v7 3/4] perf,x86: add Intel RAPL PMU support Stephane Eranian
2013-11-27 10:23   ` Peter Zijlstra
2013-11-27 14:08   ` [tip:perf/core] perf/x86: Add " tip-bot for Stephane Eranian
2013-11-27 18:35   ` [PATCH v7 3/4] perf,x86: add " Vince Weaver
2013-11-27 18:38     ` Stephane Eranian
2013-11-27 22:03       ` Vince Weaver
2013-11-28 12:26         ` Ingo Molnar
2013-11-28 12:35           ` Stephane Eranian
2013-11-29 19:16             ` Vince Weaver
2013-11-29 20:09               ` Stephane Eranian
2013-11-29 20:31                 ` Borislav Petkov
2013-11-29 23:25                   ` Stephane Eranian
2013-11-30 16:39                     ` Vince Weaver
2013-11-12 16:58 ` [PATCH v7 4/4] perf,x86: add RAPL hrtimer support Stephane Eranian
2013-11-27 14:10   ` [tip:perf/core] perf/x86: Add " tip-bot for Stephane Eranian
2013-11-27 14:33   ` tip-bot for Stephane Eranian
2013-11-12 18:10 ` [PATCH v7 0/4] perf/x86: add Intel RAPL PMU support Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).