All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support
@ 2013-10-07 16:09 Stephane Eranian
  2013-10-07 16:09 ` [PATCH v1 1/2] " Stephane Eranian
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Stephane Eranian @ 2013-10-07 16:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, zheng.z.yan

This patch adds a new uncore PMU to expose the Intel
RAPL energy consumption counters. Up to 3 counters,
each counting a particular RAPL event are exposed.

The RAPL counters are available on Intel SandyBridge,
IvyBridge, Haswell. The server skus add a 3rd counter.

The following events are available nd exposed in sysfs:
- rapl-energy-cores: power consumption of all cores on socket
- rapl-energy-pkg: power consumption of all cores + LLc cache
- rapl-energy-dram: power consumption of DRAM

The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only
one CPU per socket is sufficient. The /sys/devices/rapl/cpumask
is exported and can be used by tools to figure out which CPU
to monitor by default. For instance, on a 2-socket system, 2 CPUs
(one on each socket) will be shown.

The counters all count in the same unit. The perf_events API
exposes all RAPL counters as 64-bit integers counting in unit
of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
the counts by multiplying them by 0.23 and divide 10^9 to
obtain Joules.  The reason for this is that the kernel avoids
doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point and minimizes the loss of
precision (bits). Thanks to PeterZ for suggesting this approach.

To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/rapl/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

The PMU exports a cpumask in /sys/devices/uncore/cpumask. It
is used by perf to ensure only one instance of each RAPL event
is measured per processor socket. Hotplug CPU is also supported.

We artificially limit the number of simultaneous RAPL events
to a max of 1 instance of each (so up to 3). That helps track
events and is sufficient given that RAPL events do not support
any filters, i.e., no gain in measuring the same event twice
in an event group.

The second patch adds a hrtimer to poll the counters given that
they do no interrupt on overflow. Hardware counters are 32-bit
wide.

Supported CPUs: SandyBridge, IvyBridge, Haswell.

$ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
            time             counts events
     1.000345931        772 278 493 rapl/rapl-energy-cores/
     1.000345931     55 539 138 560 rapl/rapl-energy-pkg/    
     2.000836387        771 751 936 rapl/rapl-energy-cores/  
     2.000836387     55 326 015 488 rapl/rapl-energy-pkg/

Stephane Eranian (2):
  perf,x86: add Intel RAPL PMU support
  perf,x86: add RAPL hrtimer support

 arch/x86/kernel/cpu/Makefile                |    2 +-
 arch/x86/kernel/cpu/perf_event_intel_rapl.c |  649 +++++++++++++++++++++++++++
 tools/perf/util/evsel.c                     |    1 -
 3 files changed, 650 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v1 1/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 16:09 [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support Stephane Eranian
@ 2013-10-07 16:09 ` Stephane Eranian
  2013-10-07 17:55   ` Andi Kleen
  2013-10-07 16:09 ` [PATCH v1 2/2] perf,x86: add RAPL hrtimer support Stephane Eranian
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Stephane Eranian @ 2013-10-07 16:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, zheng.z.yan

This patch adds a new uncore PMU to expose the Intel
RAPL energy consumption counters. Up to 3 counters,
each counting a particular RAPL event are exposed.

The RAPL counters are available on Intel SandyBridge,
IvyBridge, Haswell. The server skus add a 3rd counter.

The following events are available nd exposed in sysfs:
- rapl-energy-cores: power consumption of all cores on socket
- rapl-energy-pkg: power consumption of all cores + LLc cache
- rapl-energy-dram: power consumption of DRAM

The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only
one CPU per socket is sufficient. The /sys/devices/rapl/cpumask
is exported and can be used by tools to figure out which CPU
to monitor by default. For instance, on a 2-socket system, 2 CPUs
(one on each socket) will be shown.

The counters all count in the same unit. The perf_events API
exposes all RAPL counters as 64-bit integers counting in unit
of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
the counts by multiplying them by 0.23 and divide 10^9 to
obtain Joules.  The reason for this is that the kernel avoids
doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point and minimizes the loss of
precision (bits). Thanks to PeterZ for suggesting this approach.

To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/rapl/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

We artificially limit the number of simultaneous RAPL events
to a max of 1 instance of each (so up to 3). That helps track
events and is sufficient given that RAPL events do not support
any filters, i.e., no gain in measuring the same event twice
in an event group.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/Makefile                |    2 +-
 arch/x86/kernel/cpu/perf_event_intel_rapl.c |  580 +++++++++++++++++++++++++++
 tools/perf/util/evsel.c                     |    1 -
 3 files changed, 581 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 47b56a7..6359506 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -36,7 +36,7 @@ obj-$(CONFIG_CPU_SUP_AMD)		+= perf_event_amd_iommu.o
 endif
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_p6.o perf_event_knc.o perf_event_p4.o
 obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
-obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o
+obj-$(CONFIG_CPU_SUP_INTEL)		+= perf_event_intel_uncore.o perf_event_intel_rapl.o
 endif
 
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
new file mode 100644
index 0000000..f59dbd4
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -0,0 +1,580 @@
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/pci.h>
+#include <linux/perf_event.h>
+#include "perf_event.h"
+/*
+ * RAPL energy status counters
+ */
+#define RAPL_IDX_PP0_NRG_STAT	0	/* all cores */
+#define INTEL_RAPL_PP0		0x1	/* pseudo-encoding */
+#define RAPL_IDX_PKG_NRG_STAT	1	/* entire package */
+#define INTEL_RAPL_PKG		0x2	/* pseudo-encoding */
+#define RAPL_IDX_RAM_NRG_STAT	2	/* DRAM */
+#define INTEL_RAPL_RAM		0x3	/* pseudo-encoding */
+
+#define RAPL_IDX_MAX		4	/* Max number of RAPL counters */
+
+/* Desktops have PP0, PKG */
+#define RAPL_IDX_CLN	(1<<RAPL_IDX_PP0_NRG_STAT|\
+			 1<<RAPL_IDX_PKG_NRG_STAT)
+
+/* Servers have PP0, PKG, RAM */
+#define RAPL_IDX_SRV	(1<<RAPL_IDX_PP0_NRG_STAT|\
+			 1<<RAPL_IDX_PKG_NRG_STAT|\
+			 1<<RAPL_IDX_RAM_NRG_STAT)
+
+/*
+ * event  code: LSB 8 bits, to pass in attr->config
+ * any other bit is reserved
+ */
+#define RAPL_EVENT_MASK	0xFFULL
+
+#define DEFINE_RAPL_FORMAT_ATTR(_var, _name, _format)			\
+static ssize_t __rapl_##_var##_show(struct kobject *kobj,		\
+				struct kobj_attribute *attr,		\
+				char *page)				\
+{									\
+	BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE);			\
+	return sprintf(page, _format "\n");				\
+}									\
+static struct kobj_attribute format_attr_##_var =			\
+	__ATTR(_name, 0444, __rapl_##_var##_show, NULL)
+
+#define RAPL_EVENT_DESC(_name, _config)			\
+{								\
+	.attr	= __ATTR(_name, 0444, rapl_event_show, NULL),	\
+	.config	= _config,					\
+}
+
+#define RAPL_CNTR_WIDTH 32 /* 32-bit rapl counters */
+
+struct rapl_pmu {
+	atomic_t	refcnt;
+	int		hw_unit;  /* 1/2^hw_unit Joule */
+	int		phys_id;
+	int		n_active; /* number of active events */
+	unsigned long active_mask[BITS_TO_LONGS(RAPL_IDX_MAX)];
+	struct perf_event *events[RAPL_IDX_MAX];
+};
+
+static struct pmu rapl_pmu_class;
+static cpumask_t rapl_cpu_mask;
+static int rapl_cntr_mask;
+
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu);
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu_kfree);
+
+static inline u64 rapl_read_counter(struct perf_event *event)
+{
+	u64 raw;
+	rdmsrl(event->hw.event_base, raw);
+	return raw;
+}
+
+static inline u64 rapl_scale(u64 v)
+{
+	/*
+	 * scale delta to smallest unit (1/2^32)
+	 * users must then scale back: count * 1/(1e9*2^32) to get Joules
+	 * Watts = Joules/Time delta
+	 */
+	return v << (32 - __get_cpu_var(rapl_pmu)->hw_unit);
+}
+
+static u64 rapl_event_update(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	u64 prev_raw_count, new_raw_count;
+	s64 delta, sdelta;
+	int shift = RAPL_CNTR_WIDTH;
+
+again:
+	prev_raw_count = local64_read(&hwc->prev_count);
+	rdmsrl(event->hw.event_base, new_raw_count);
+
+	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+			    new_raw_count) != prev_raw_count)
+		goto again;
+
+	/*
+	 * Now we have the new raw value and have updated the prev
+	 * timestamp already. We can now calculate the elapsed delta
+	 * (event-)time and add that to the generic event.
+	 *
+	 * Careful, not all hw sign-extends above the physical width
+	 * of the count.
+	 */
+	delta = (new_raw_count << shift) - (prev_raw_count << shift);
+	delta >>= shift;
+
+	sdelta = rapl_scale(delta);
+
+	local64_add(sdelta, &event->count);
+
+	return new_raw_count;
+}
+
+static void rapl_pmu_event_start(struct perf_event *event, int flags)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+
+	if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+		return;
+
+	event->hw.state = 0;
+
+	local64_set(&event->hw.prev_count, rapl_read_counter(event));
+
+	pmu->n_active++;
+}
+
+static int rapl_pmu_event_add(struct perf_event *event, int flags)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	int idx = event->hw.idx;
+
+	/* counter already in use */
+	if (__test_and_set_bit(idx, pmu->active_mask))
+		return -EAGAIN;
+
+	pmu->events[idx] = event;
+
+	hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+	if (flags & PERF_EF_START)
+		rapl_pmu_event_start(event, 0);
+
+	return 0;
+}
+
+static void rapl_pmu_event_stop(struct perf_event *event, int flags)
+{
+	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+	struct hw_perf_event *hwc = &event->hw;
+
+	/* mark event as deactivated and stopped */
+	if (__test_and_clear_bit(hwc->idx, pmu->active_mask)) {
+		WARN_ON_ONCE(pmu->n_active <= 0);
+		pmu->n_active--;
+		pmu->events[hwc->idx] = NULL;
+		WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
+		hwc->state |= PERF_HES_STOPPED;
+	}
+
+	/* check if update of sw counter is necessary */
+	if ((flags & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
+		/*
+		 * Drain the remaining delta count out of a event
+		 * that we are disabling:
+		 */
+		rapl_event_update(event);
+		hwc->state |= PERF_HES_UPTODATE;
+	}
+}
+
+static void rapl_pmu_event_del(struct perf_event *event, int flags)
+{
+	rapl_pmu_event_stop(event, PERF_EF_UPDATE);
+}
+
+static int rapl_validate_group(struct perf_event *event)
+{
+	struct perf_event *leader = event->group_leader;
+	unsigned long active_mask[BITS_TO_LONGS(RAPL_IDX_MAX)];
+	struct perf_event *e;
+
+	/*
+	 * group can only have RAPL PMU events
+	 * just need to verify they don't use the same
+	 * counter
+	 * Although the RAPL counters are read-only and
+	 * we could have as many RAPL events as we would
+	 * like, we artificially limit to a maximum of
+	 * one event of each kind. That helps us track
+	 * events better especially when we want to avoid
+	 * missing counter overflows. Furthermore, the
+	 * counters have no filters, thus adding 2 instances
+	 * of an event does not buy anything.
+	 */
+	bitmap_zero(active_mask, RAPL_IDX_MAX);
+
+	/*
+	 * event is not yet connected with siblings
+	 */
+	__set_bit(event->hw.idx, active_mask);
+
+	/*
+	 * add leader too
+	 */
+	if (__test_and_set_bit(leader->hw.idx, active_mask))
+			return -EINVAL;
+
+	/*
+	 * now check existing siblings
+	 */
+	list_for_each_entry(e, &leader->sibling_list, group_entry) {
+		if (__test_and_set_bit(e->hw.idx, active_mask))
+			return -EINVAL;
+	}
+	return 0;
+}
+
+static int rapl_pmu_event_init(struct perf_event *event)
+{
+	u64 cfg = event->attr.config & RAPL_EVENT_MASK;
+	int bit, msr, ret = 0;
+
+	/* only look at RAPL events */
+	if (event->attr.type != rapl_pmu_class.type)
+		return -ENOENT;
+
+	/* check only supported bits are set */
+	if (event->attr.config & ~RAPL_EVENT_MASK)
+		return -EINVAL;
+
+	/*
+	 * check event is known (determines counter)
+	 */
+	switch (cfg) {
+	case INTEL_RAPL_PP0:
+		bit = RAPL_IDX_PP0_NRG_STAT;
+		msr = MSR_PP0_ENERGY_STATUS;
+		break;
+	case INTEL_RAPL_PKG:
+		bit = RAPL_IDX_PKG_NRG_STAT;
+		msr = MSR_PKG_ENERGY_STATUS;
+		break;
+	case INTEL_RAPL_RAM:
+		bit = RAPL_IDX_RAM_NRG_STAT;
+		msr = MSR_DRAM_ENERGY_STATUS;
+		break;
+	default:
+		return -EINVAL;
+	}
+	/* check event supported */
+	if (!(rapl_cntr_mask & (1 << bit)))
+		return -EINVAL;
+
+	/* unsupported modes and filters */
+	if (event->attr.exclude_user   ||
+	    event->attr.exclude_kernel ||
+	    event->attr.exclude_hv     ||
+	    event->attr.exclude_idle   ||
+	    event->attr.exclude_host   ||
+	    event->attr.exclude_guest  ||
+	    event->attr.sample_period) /* no sampling */
+		return -EINVAL;
+
+	/* must be done before validate_group */
+	event->hw.event_base = msr;
+	event->hw.idx = bit;
+
+	if (event->group_leader != event)
+		ret = rapl_validate_group(event);
+
+	return ret;
+}
+
+static void rapl_pmu_event_read(struct perf_event *event)
+{
+	rapl_event_update(event);
+}
+
+static ssize_t rapl_get_attr_cpumask(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);
+
+	buf[n++] = '\n';
+	buf[n] = '\0';
+	return n;
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, rapl_get_attr_cpumask, NULL);
+
+static struct attribute *rapl_pmu_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL,
+};
+
+static struct attribute_group rapl_pmu_attr_group = {
+	.attrs = rapl_pmu_attrs,
+};
+
+EVENT_ATTR_STR(rapl-energy-cores, rapl_pp0, "event=0x01");
+EVENT_ATTR_STR(rapl-energy-pkg  , rapl_pkg, "event=0x02");
+EVENT_ATTR_STR(rapl-energy-ram  , rapl_ram, "event=0x03");
+
+static struct attribute *rapl_events_srv_attr[] = {
+	EVENT_PTR(rapl_pp0),
+	EVENT_PTR(rapl_pkg),
+	EVENT_PTR(rapl_ram),
+	NULL,
+};
+
+static struct attribute *rapl_events_cln_attr[] = {
+	EVENT_PTR(rapl_pp0),
+	EVENT_PTR(rapl_pkg),
+	NULL,
+};
+
+static struct attribute_group rapl_pmu_events_group = {
+	.name = "events",
+	.attrs = NULL, /* patched at runtime */
+};
+
+DEFINE_RAPL_FORMAT_ATTR(event, event, "config:0-7");
+static struct attribute *rapl_formats_attr[] = {
+	&format_attr_event.attr,
+	NULL,
+};
+
+static struct attribute_group rapl_pmu_format_group = {
+	.name = "format",
+	.attrs = rapl_formats_attr,
+};
+
+const struct attribute_group *rapl_attr_groups[] = {
+	&rapl_pmu_attr_group,
+	&rapl_pmu_format_group,
+	&rapl_pmu_events_group,
+	NULL,
+};
+
+static struct pmu rapl_pmu_class = {
+	.attr_groups	= rapl_attr_groups,
+	.task_ctx_nr	= perf_invalid_context, /* system-wide only */
+	.event_init	= rapl_pmu_event_init,
+	.add		= rapl_pmu_event_add, /* must have */
+	.del		= rapl_pmu_event_del, /* must have */
+	.start		= rapl_pmu_event_start,
+	.stop		= rapl_pmu_event_stop,
+	.read		= rapl_pmu_event_read,
+};
+
+static void rapl_exit_cpu(int cpu)
+{
+	int i, phys_id = topology_physical_package_id(cpu);
+
+	/* if CPU not in RAPL mask, nothing to do */
+	if (!cpumask_test_and_clear_cpu(cpu, &rapl_cpu_mask))
+		return;
+
+	/* find a new cpu on same package */
+	for_each_online_cpu(i) {
+		if (i == cpu || i == 0)
+			continue;
+		if (phys_id == topology_physical_package_id(i)) {
+			cpumask_set_cpu(i, &rapl_cpu_mask);
+			break;
+		}
+	}
+
+	WARN_ON(cpumask_empty(&rapl_cpu_mask));
+}
+
+static void rapl_init_cpu(int cpu)
+{
+	int i, phys_id = topology_physical_package_id(cpu);
+
+	/* check if phys_is is already covered */
+	for_each_cpu(i, &rapl_cpu_mask) {
+		if (i == 0)
+			continue;
+		if (phys_id == topology_physical_package_id(i))
+			return;
+	}
+	/* was not found, so add it */
+	cpumask_set_cpu(cpu, &rapl_cpu_mask);
+}
+
+static int rapl_cpu_prepare(int cpu)
+{
+	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+	int phys_id = topology_physical_package_id(cpu);
+
+	if (pmu)
+		return 0;
+
+	if (phys_id < 0)
+		return -1;
+
+	pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_node(cpu));
+	if (!pmu)
+		return -1;
+
+	atomic_set(&pmu->refcnt, 1);
+	pmu->phys_id = phys_id;
+	/*
+	 * grab power unit as: 1/2^unit Joules
+	 *
+	 * we cache in local PMU instance
+	 */
+	rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);
+	pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;
+
+	/* set RAPL pmu for this cpu for now */
+	per_cpu(rapl_pmu_kfree, cpu) = NULL;
+	per_cpu(rapl_pmu, cpu) = pmu;
+
+	return 0;
+}
+
+static int rapl_cpu_starting(int cpu)
+{
+	struct rapl_pmu *pmu2;
+	struct rapl_pmu *pmu1 = per_cpu(rapl_pmu, cpu);
+	int i, phys_id = topology_physical_package_id(cpu);
+
+	if (pmu1)
+		return 0;
+
+	for_each_online_cpu(i) {
+		pmu2 = per_cpu(rapl_pmu, i);
+
+		if (!pmu2 || i == cpu)
+			continue;
+
+		if (pmu2->phys_id == phys_id) {
+			per_cpu(rapl_pmu, cpu) = pmu2;
+			per_cpu(rapl_pmu_kfree, cpu) = pmu1;
+			atomic_inc(&pmu2->refcnt);
+			break;
+		}
+	}
+	return 0;
+}
+
+static int rapl_cpu_dying(int cpu)
+{
+	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+	int i;
+
+	if (!pmu)
+		return 0;
+	/*
+	 * stop all syswide RAPL events on that  CPU
+	 * as a consequence also stops the hrtimer
+	 */
+	for (i = 0; i < RAPL_IDX_MAX; i++) {
+		if (pmu->events[i])
+			rapl_pmu_event_stop(pmu->events[i], PERF_EF_UPDATE);
+	}
+	per_cpu(rapl_pmu, cpu) = NULL;
+
+	if (atomic_dec_and_test(&pmu->refcnt))
+		kfree(pmu);
+
+	return 0;
+}
+
+static int rapl_cpu_notifier(struct notifier_block *self,
+			     unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (long)hcpu;
+
+	/* allocate/free data structure for uncore box */
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_UP_PREPARE:
+		rapl_cpu_prepare(cpu);
+		break;
+	case CPU_STARTING:
+		rapl_cpu_starting(cpu);
+		break;
+	case CPU_UP_CANCELED:
+	case CPU_DYING:
+		rapl_cpu_dying(cpu);
+		break;
+	case CPU_ONLINE:
+		kfree(per_cpu(rapl_pmu_kfree, cpu));
+		per_cpu(rapl_pmu_kfree, cpu) = NULL;
+		break;
+	case CPU_DEAD:
+		per_cpu(rapl_pmu, cpu) = NULL;
+		break;
+	default:
+		break;
+	}
+
+	/* select the cpu that collects uncore events */
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DOWN_FAILED:
+	case CPU_STARTING:
+		rapl_init_cpu(cpu);
+		break;
+	case CPU_DOWN_PREPARE:
+		rapl_exit_cpu(cpu);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static int __init rapl_pmu_init(void)
+{
+	struct rapl_pmu *pmu;
+	int i, cpu, ret;
+
+	/* check supported CPU */
+	switch (boot_cpu_data.x86_model) {
+	case 42: /* Sandy Bridge */
+	case 58: /* Ivy Bridge */
+	case 60: /* Haswell */
+		rapl_cntr_mask = RAPL_IDX_CLN;
+		rapl_pmu_events_group.attrs = rapl_events_cln_attr;
+		break;
+	case 45: /* Sandy Bridge-EP */
+	case 62: /* IvyTown */
+		rapl_cntr_mask = RAPL_IDX_SRV;
+		rapl_pmu_events_group.attrs = rapl_events_srv_attr;
+		break;
+
+	default:
+		/* unsupported */
+		return 0;
+	}
+	get_online_cpus();
+
+	for_each_online_cpu(cpu) {
+		int phys_id = topology_physical_package_id(cpu);
+
+		/* save on prepare by only calling prepare for new phys_id */
+		for_each_cpu(i, &rapl_cpu_mask) {
+			if (phys_id == topology_physical_package_id(i)) {
+				phys_id = -1;
+				break;
+			}
+		}
+		if (phys_id < 0) {
+			pmu = per_cpu(rapl_pmu, i);
+			if (pmu) {
+				per_cpu(rapl_pmu, cpu) = pmu;
+				atomic_inc(&pmu->refcnt);
+			}
+			continue;
+		}
+		rapl_cpu_prepare(cpu);
+		cpumask_set_cpu(cpu, &rapl_cpu_mask);
+	}
+
+	perf_cpu_notifier(rapl_cpu_notifier);
+
+	ret = perf_pmu_register(&rapl_pmu_class, "rapl", -1);
+	WARN_ON(ret);
+
+	pmu = __get_cpu_var(rapl_pmu);
+	pr_info("RAPL PMU detected, hw unit 2^-%d Joules, "
+		" API unit is 2^-32 Joules,"
+		" %d fixed counters\n",
+		pmu->hw_unit,
+		hweight32(rapl_cntr_mask));
+
+	put_online_cpus();
+
+	return 0;
+}
+device_initcall(rapl_pmu_init);
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index abe69af..12bfd7d 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -895,7 +895,6 @@ int __perf_evsel__read(struct perf_evsel *evsel,
 			if (readn(FD(evsel, cpu, thread),
 				  &count, nv * sizeof(u64)) < 0)
 				return -errno;
-
 			aggr->val += count.val;
 			if (scale) {
 				aggr->ena += count.ena;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v1 2/2] perf,x86: add RAPL hrtimer support
  2013-10-07 16:09 [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support Stephane Eranian
  2013-10-07 16:09 ` [PATCH v1 1/2] " Stephane Eranian
@ 2013-10-07 16:09 ` Stephane Eranian
  2013-10-07 16:19 ` [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support Borislav Petkov
  2013-10-07 16:29 ` Peter Zijlstra
  3 siblings, 0 replies; 16+ messages in thread
From: Stephane Eranian @ 2013-10-07 16:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, zheng.z.yan

The RAPL PMU counters do not interrupt on overflow.
Therefore, the kernel needs to poll the counters
to avoid missing an overflow. This patch adds
the hrtimer code to do this.

The timer internval is calculated at boot time
based on the power unit used by the HW.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel_rapl.c |   75 +++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index f59dbd4..6294d62 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -54,6 +54,8 @@ struct rapl_pmu {
 	int		hw_unit;  /* 1/2^hw_unit Joule */
 	int		phys_id;
 	int		n_active; /* number of active events */
+	ktime_t		timer_interval; /* in ktime_t unit */
+	struct hrtimer	hrtimer;
 	unsigned long active_mask[BITS_TO_LONGS(RAPL_IDX_MAX)];
 	struct perf_event *events[RAPL_IDX_MAX];
 };
@@ -82,6 +84,18 @@ static inline u64 rapl_scale(u64 v)
 	return v << (32 - __get_cpu_var(rapl_pmu)->hw_unit);
 }
 
+static void rapl_start_hrtimer(struct rapl_pmu *pmu)
+{
+	__hrtimer_start_range_ns(&pmu->hrtimer,
+			pmu->timer_interval, 0,
+			HRTIMER_MODE_REL_PINNED, 0);
+}
+
+static void rapl_stop_hrtimer(struct rapl_pmu *pmu)
+{
+	hrtimer_cancel(&pmu->hrtimer);
+}
+
 static u64 rapl_event_update(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
@@ -115,6 +129,38 @@ static u64 rapl_event_update(struct perf_event *event)
 	return new_raw_count;
 }
 
+static enum hrtimer_restart rapl_hrtimer_handle(struct hrtimer *hrtimer)
+{
+	struct rapl_pmu *pmu = container_of(hrtimer, struct rapl_pmu, hrtimer);
+	unsigned long flags;
+	int i;
+
+	if (!pmu->n_active)
+		return HRTIMER_NORESTART;
+
+	local_irq_save(flags);
+
+	for_each_set_bit(i, pmu->active_mask, RAPL_IDX_MAX) {
+		rapl_event_update(pmu->events[i]);
+	}
+
+	local_irq_restore(flags);
+
+	hrtimer_forward_now(&pmu->hrtimer, pmu->timer_interval);
+
+
+	return HRTIMER_RESTART;
+}
+
+static void rapl_hrtimer_init(struct rapl_pmu *pmu)
+{
+	hrtimer_init(&pmu->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	pmu->hrtimer.function = rapl_hrtimer_handle;
+}
+
+
+
+
 static void rapl_pmu_event_start(struct perf_event *event, int flags)
 {
 	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
@@ -127,6 +173,8 @@ static void rapl_pmu_event_start(struct perf_event *event, int flags)
 	local64_set(&event->hw.prev_count, rapl_read_counter(event));
 
 	pmu->n_active++;
+	if (pmu->n_active == 1)
+		rapl_start_hrtimer(pmu);
 }
 
 static int rapl_pmu_event_add(struct perf_event *event, int flags)
@@ -158,6 +206,9 @@ static void rapl_pmu_event_stop(struct perf_event *event, int flags)
 	if (__test_and_clear_bit(hwc->idx, pmu->active_mask)) {
 		WARN_ON_ONCE(pmu->n_active <= 0);
 		pmu->n_active--;
+		if (pmu->n_active == 0)
+			rapl_stop_hrtimer(pmu);
+
 		pmu->events[hwc->idx] = NULL;
 		WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
 		hwc->state |= PERF_HES_STOPPED;
@@ -394,6 +445,7 @@ static int rapl_cpu_prepare(int cpu)
 {
 	struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
 	int phys_id = topology_physical_package_id(cpu);
+	u64 ms;
 
 	if (pmu)
 		return 0;
@@ -415,6 +467,20 @@ static int rapl_cpu_prepare(int cpu)
 	rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);
 	pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;
 
+	/*
+	 * use reference of 200W for scaling the timeout
+	 * to avoid missing counter overflows.
+	 * 200W = 200 Joules/sec
+	 * divide interval by 2 to avoid lockstep (2 * 100)
+	 * if hw unit is 32, then we use 2 ms 1/200/2
+	 */
+	if (pmu->hw_unit < 32)
+		ms = 1000 * (1ULL << (32 - pmu->hw_unit - 1)) / (2 * 100);
+	else
+		ms = 2;
+
+	pmu->timer_interval = ms_to_ktime(ms);
+
 	/* set RAPL pmu for this cpu for now */
 	per_cpu(rapl_pmu_kfree, cpu) = NULL;
 	per_cpu(rapl_pmu, cpu) = pmu;
@@ -559,6 +625,7 @@ static int __init rapl_pmu_init(void)
 		}
 		rapl_cpu_prepare(cpu);
 		cpumask_set_cpu(cpu, &rapl_cpu_mask);
+		rapl_hrtimer_init(per_cpu(rapl_pmu, cpu));
 	}
 
 	perf_cpu_notifier(rapl_cpu_notifier);
@@ -567,11 +634,13 @@ static int __init rapl_pmu_init(void)
 	WARN_ON(ret);
 
 	pmu = __get_cpu_var(rapl_pmu);
-	pr_info("RAPL PMU detected, hw unit 2^-%d Joules, "
+	pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"
 		" API unit is 2^-32 Joules,"
-		" %d fixed counters\n",
+		" %d fixed counters,",
+		" %Lu ms ovfl timer\n",
 		pmu->hw_unit,
-		hweight32(rapl_cntr_mask));
+		hweight32(rapl_cntr_mask),
+		ktime_to_ms(pmu->timer_interval));
 
 	put_online_cpus();
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 16:09 [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support Stephane Eranian
  2013-10-07 16:09 ` [PATCH v1 1/2] " Stephane Eranian
  2013-10-07 16:09 ` [PATCH v1 2/2] perf,x86: add RAPL hrtimer support Stephane Eranian
@ 2013-10-07 16:19 ` Borislav Petkov
  2013-10-07 16:24   ` Stephane Eranian
  2013-10-07 16:29 ` Peter Zijlstra
  3 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2013-10-07 16:19 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, ak, acme, jolsa, zheng.z.yan

On Mon, Oct 07, 2013 at 06:09:15PM +0200, Stephane Eranian wrote:
> The counters all count in the same unit. The perf_events API
> exposes all RAPL counters as 64-bit integers counting in unit
> of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
> the counts by multiplying them by 0.23 and divide 10^9 to
> obtain Joules.  The reason for this is that the kernel avoids
> doing floating point math whenever possible because it is
> expensive (user floating-point state must be saved). The method
> used avoids kernel floating-point and minimizes the loss of
> precision (bits). Thanks to PeterZ for suggesting this approach.
> 
> To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)

...

> $ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
>             time             counts events
>      1.000345931        772 278 493 rapl/rapl-energy-cores/
>      1.000345931     55 539 138 560 rapl/rapl-energy-pkg/    
>      2.000836387        771 751 936 rapl/rapl-energy-cores/  
>      2.000836387     55 326 015 488 rapl/rapl-energy-pkg/

So can we do the Watt conversion in perf tool and make that "counts"
output more human-friendly like what those numbers are, to which
core/LLC they belong, etc, etc?

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 16:19 ` [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support Borislav Petkov
@ 2013-10-07 16:24   ` Stephane Eranian
  2013-10-07 16:42     ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Stephane Eranian @ 2013-10-07 16:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: LKML, Peter Zijlstra, mingo, ak, Arnaldo Carvalho de Melo,
	Jiri Olsa, Yan, Zheng

On Mon, Oct 7, 2013 at 6:19 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Oct 07, 2013 at 06:09:15PM +0200, Stephane Eranian wrote:
>> The counters all count in the same unit. The perf_events API
>> exposes all RAPL counters as 64-bit integers counting in unit
>> of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
>> the counts by multiplying them by 0.23 and divide 10^9 to
>> obtain Joules.  The reason for this is that the kernel avoids
>> doing floating point math whenever possible because it is
>> expensive (user floating-point state must be saved). The method
>> used avoids kernel floating-point and minimizes the loss of
>> precision (bits). Thanks to PeterZ for suggesting this approach.
>>
>> To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)
>
> ...
>
>> $ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
>>             time             counts events
>>      1.000345931        772 278 493 rapl/rapl-energy-cores/
>>      1.000345931     55 539 138 560 rapl/rapl-energy-pkg/
>>      2.000836387        771 751 936 rapl/rapl-energy-cores/
>>      2.000836387     55 326 015 488 rapl/rapl-energy-pkg/
>
> So can we do the Watt conversion in perf tool and make that "counts"
> output more human-friendly like what those numbers are, to which
> core/LLC they belong, etc, etc?
>
We could but that means we would need to special case the events in perf.
I was trying to avoid that.

> Thanks.
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 16:09 [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support Stephane Eranian
                   ` (2 preceding siblings ...)
  2013-10-07 16:19 ` [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support Borislav Petkov
@ 2013-10-07 16:29 ` Peter Zijlstra
  3 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2013-10-07 16:29 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, mingo, ak, acme, jolsa, zheng.z.yan

On Mon, Oct 07, 2013 at 06:09:15PM +0200, Stephane Eranian wrote:
> The counters all count in the same unit. The perf_events API
> exposes all RAPL counters as 64-bit integers counting in unit
> of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
> the counts by multiplying them by 0.23 and divide 10^9 to
> obtain Joules.  The reason for this is that the kernel avoids
> doing floating point math whenever possible because it is
> expensive (user floating-point state must be saved). The method
> used avoids kernel floating-point and minimizes the loss of
> precision (bits). Thanks to PeterZ for suggesting this approach.
> 
> To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)

Right, so the output is in 32.32 fixed point. So if you want to convert
to double you'd do something like: 

  double watt = ldexp(counter, -32);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 16:24   ` Stephane Eranian
@ 2013-10-07 16:42     ` Borislav Petkov
  0 siblings, 0 replies; 16+ messages in thread
From: Borislav Petkov @ 2013-10-07 16:42 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, Peter Zijlstra, mingo, ak, Arnaldo Carvalho de Melo,
	Jiri Olsa, Yan, Zheng

On Mon, Oct 07, 2013 at 06:24:18PM +0200, Stephane Eranian wrote:
> We could but that means we would need to special case the events in perf.
> I was trying to avoid that.

Maybe we could use some sort of a post-processing hook on those events'
output which is defined only for that type of events...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 1/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 16:09 ` [PATCH v1 1/2] " Stephane Eranian
@ 2013-10-07 17:55   ` Andi Kleen
  2013-10-07 18:08     ` Peter Zijlstra
  2013-10-07 20:58     ` Stephane Eranian
  0 siblings, 2 replies; 16+ messages in thread
From: Andi Kleen @ 2013-10-07 17:55 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, peterz, mingo, acme, jolsa, zheng.z.yan

Quick review. Thanks for working on this. It should work
nicely with ucevent -- people already asked for reporting 
power numbers there.

> diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
> new file mode 100644
> index 0000000..f59dbd4
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
> @@ -0,0 +1,580 @@

Having a comment at the beginning of each file with two sentences
what the file roughly does and what "RAPL" actually is would be useful.

Also a pointer to the SDM chapters is also useful.

> +static u64 rapl_event_update(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +	u64 prev_raw_count, new_raw_count;
> +	s64 delta, sdelta;
> +	int shift = RAPL_CNTR_WIDTH;
> +
> +again:
> +	prev_raw_count = local64_read(&hwc->prev_count);
> +	rdmsrl(event->hw.event_base, new_raw_count);
> +
> +	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
> +			    new_raw_count) != prev_raw_count)

Add a cpu_relax()

> +		goto again;
> +
> +	struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
> +
> +	if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
> +		return;
> +
> +	event->hw.state = 0;
> +
> +	local64_set(&event->hw.prev_count, rapl_read_counter(event));
> +
> +	pmu->n_active++;

What lock protects this add? 

> +}
> +
> +static ssize_t rapl_get_attr_cpumask(struct device *dev,
> +				struct device_attribute *attr, char *buf)
> +{
> +	int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);

Check n here in case it overflowed

> +
> +	buf[n++] = '\n';
> +	buf[n] = '\0';
> +	return n;



> +	for_each_online_cpu(i) {
> +		pmu2 = per_cpu(rapl_pmu, i);
> +
> +		if (!pmu2 || i == cpu)
> +			continue;
> +
> +		if (pmu2->phys_id == phys_id) {
> +			per_cpu(rapl_pmu, cpu) = pmu2;
> +			per_cpu(rapl_pmu_kfree, cpu) = pmu1;
> +			atomic_inc(&pmu2->refcnt);
> +			break;
> +		}
> +	}

Doesn't this need a lock of some form? AFAIK we can do parallel
CPU startup now.

Similar to the other code walking the CPUs.

> +static int __init rapl_pmu_init(void)
> +{
> +	struct rapl_pmu *pmu;
> +	int i, cpu, ret;

You need to check for Intel CPU here, as this is called unconditionally.

A more modern way to do this is to use x86_cpu_id. 
This would in principle allow making it a module later (if perf ever
supports that)

> +
> +	/* check supported CPU */
> +	switch (boot_cpu_data.x86_model) {
> +	case 42: /* Sandy Bridge */
> +	case 58: /* Ivy Bridge */
> +	case 60: /* Haswell */

Need more model numbers for Haswell (see the main perf driver) 

> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index abe69af..12bfd7d 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -895,7 +895,6 @@ int __perf_evsel__read(struct perf_evsel *evsel,
>  			if (readn(FD(evsel, cpu, thread),
>  				  &count, nv * sizeof(u64)) < 0)
>  				return -errno;
> -
>  			aggr->val += count.val;
>  			if (scale) {
>  				aggr->ena += count.ena;

Bogus hunk

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 1/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 17:55   ` Andi Kleen
@ 2013-10-07 18:08     ` Peter Zijlstra
  2013-10-07 19:22       ` Andi Kleen
  2013-10-07 20:58     ` Stephane Eranian
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2013-10-07 18:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stephane Eranian, linux-kernel, mingo, acme, jolsa, zheng.z.yan

On Mon, Oct 07, 2013 at 10:55:42AM -0700, Andi Kleen wrote:
> This would in principle allow making it a module later (if perf ever
> supports that)

IIRC its a few EXPORTs away from being able to do that.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 1/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 18:08     ` Peter Zijlstra
@ 2013-10-07 19:22       ` Andi Kleen
  2013-10-07 20:33         ` Peter Zijlstra
  2013-10-08  7:36         ` Ingo Molnar
  0 siblings, 2 replies; 16+ messages in thread
From: Andi Kleen @ 2013-10-07 19:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, linux-kernel, mingo, acme, jolsa, zheng.z.yan

On Mon, Oct 07, 2013 at 08:08:10PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 07, 2013 at 10:55:42AM -0700, Andi Kleen wrote:
> > This would in principle allow making it a module later (if perf ever
> > supports that)
> 
> IIRC its a few EXPORTs away from being able to do that.

Great. With ~700k text that would be a good thing.
After all most users don't develop.

Hopefully we can get there soon.

Is anyone actively working on it?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 1/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 19:22       ` Andi Kleen
@ 2013-10-07 20:33         ` Peter Zijlstra
  2013-10-08  7:36         ` Ingo Molnar
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2013-10-07 20:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stephane Eranian, linux-kernel, mingo, acme, jolsa, zheng.z.yan

On Mon, Oct 07, 2013 at 12:22:58PM -0700, Andi Kleen wrote:
> On Mon, Oct 07, 2013 at 08:08:10PM +0200, Peter Zijlstra wrote:
> > On Mon, Oct 07, 2013 at 10:55:42AM -0700, Andi Kleen wrote:
> > > This would in principle allow making it a module later (if perf ever
> > > supports that)
> > 
> > IIRC its a few EXPORTs away from being able to do that.
> 
> Great. With ~700k text that would be a good thing.
> After all most users don't develop.
> 
> Hopefully we can get there soon.
> 
> Is anyone actively working on it?

All of perf being a module; no and that's not actually going to happen.
PMU driver modules should be fairly simple though.

Dunno if anybody is working on that, I typically consider my .config
broken if its got =m in it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 1/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 17:55   ` Andi Kleen
  2013-10-07 18:08     ` Peter Zijlstra
@ 2013-10-07 20:58     ` Stephane Eranian
  2013-10-07 21:45       ` Andi Kleen
  1 sibling, 1 reply; 16+ messages in thread
From: Stephane Eranian @ 2013-10-07 20:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Peter Zijlstra, mingo, Arnaldo Carvalho de Melo, Jiri Olsa,
	Yan, Zheng

On Mon, Oct 7, 2013 at 7:55 PM, Andi Kleen <ak@linux.intel.com> wrote:
> Quick review. Thanks for working on this. It should work
> nicely with ucevent -- people already asked for reporting
> power numbers there.
>
Yes, got some requests myself too. So I implemented this.

>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
>> new file mode 100644
>> index 0000000..f59dbd4
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
>> @@ -0,0 +1,580 @@
>
> Having a comment at the beginning of each file with two sentences
> what the file roughly does and what "RAPL" actually is would be useful.
>
> Also a pointer to the SDM chapters is also useful.
>
Forgot to add that. Will do in V2.

>> +static u64 rapl_event_update(struct perf_event *event)
>> +{
>> +     struct hw_perf_event *hwc = &event->hw;
>> +     u64 prev_raw_count, new_raw_count;
>> +     s64 delta, sdelta;
>> +     int shift = RAPL_CNTR_WIDTH;
>> +
>> +again:
>> +     prev_raw_count = local64_read(&hwc->prev_count);
>> +     rdmsrl(event->hw.event_base, new_raw_count);
>> +
>> +     if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
>> +                         new_raw_count) != prev_raw_count)
>
> Add a cpu_relax()
>
But then it should be in perf_event_*.c as well.
It's a verbatim copy of the existing code. Now given that RAPL does not
interrupt, the only risk here would be preemption. I did not verify whether
this function was always called with interrupts disabled. So I left the retry
loop.


>> +             goto again;
>> +
>> +     struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
>> +
>> +     if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
>> +             return;
>> +
>> +     event->hw.state = 0;
>> +
>> +     local64_set(&event->hw.prev_count, rapl_read_counter(event));
>> +
>> +     pmu->n_active++;
>
> What lock protects this add?
>
None. I will add one. Bu then I am wondering about if it is really
necessary given
that RAPL event are system-wide and this pinned to a CPU. If the call came
from another CPU, then it IPI there, and that means that CPU is executing that
code. Any other CPU will need IPI too, and that interrupt will be kept pending.
Am I missing a test case here? Are IPI reentrant?

>> +}
>> +
>> +static ssize_t rapl_get_attr_cpumask(struct device *dev,
>> +                             struct device_attribute *attr, char *buf)
>> +{
>> +     int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);
>
> Check n here in case it overflowed
>
But isn't that what the -2 and the below \n\0 are for?

>> +
>> +     buf[n++] = '\n';
>> +     buf[n] = '\0';
>> +     return n;
>
>
>
>> +     for_each_online_cpu(i) {
>> +             pmu2 = per_cpu(rapl_pmu, i);
>> +
>> +             if (!pmu2 || i == cpu)
>> +                     continue;
>> +
>> +             if (pmu2->phys_id == phys_id) {
>> +                     per_cpu(rapl_pmu, cpu) = pmu2;
>> +                     per_cpu(rapl_pmu_kfree, cpu) = pmu1;
>> +                     atomic_inc(&pmu2->refcnt);
>> +                     break;
>> +             }
>> +     }
>
> Doesn't this need a lock of some form? AFAIK we can do parallel
> CPU startup now.
>
Did not know about this change? But then that means all the other
perf_event *_starting() and maybe even _*prepare() routines must also
use locks. I can add that to RAPL.

> Similar to the other code walking the CPUs.
>
>> +static int __init rapl_pmu_init(void)
>> +{
>> +     struct rapl_pmu *pmu;
>> +     int i, cpu, ret;
>
> You need to check for Intel CPU here, as this is called unconditionally.
>
> A more modern way to do this is to use x86_cpu_id.
> This would in principle allow making it a module later (if perf ever
> supports that)
>
Forgot that, will fix it.

>> +
>> +     /* check supported CPU */
>> +     switch (boot_cpu_data.x86_model) {
>> +     case 42: /* Sandy Bridge */
>> +     case 58: /* Ivy Bridge */
>> +     case 60: /* Haswell */
>
> Need more model numbers for Haswell (see the main perf driver)
>
Don't have all the models to test...

>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index abe69af..12bfd7d 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -895,7 +895,6 @@ int __perf_evsel__read(struct perf_evsel *evsel,
>>                       if (readn(FD(evsel, cpu, thread),
>>                                 &count, nv * sizeof(u64)) < 0)
>>                               return -errno;
>> -
>>                       aggr->val += count.val;
>>                       if (scale) {
>>                               aggr->ena += count.ena;
>
> Bogus hunk
>
Arg, yes. It should not be here.


Thanks for the review.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 1/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 20:58     ` Stephane Eranian
@ 2013-10-07 21:45       ` Andi Kleen
  2013-10-07 22:38         ` Stephane Eranian
  2013-10-08 15:10         ` Stephane Eranian
  0 siblings, 2 replies; 16+ messages in thread
From: Andi Kleen @ 2013-10-07 21:45 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, Peter Zijlstra, mingo, Arnaldo Carvalho de Melo, Jiri Olsa,
	Yan, Zheng

Stephane Eranian <eranian@google.com> writes:
>
>>> +             goto again;
>>> +
>>> +     struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
>>> +
>>> +     if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
>>> +             return;
>>> +
>>> +     event->hw.state = 0;
>>> +
>>> +     local64_set(&event->hw.prev_count, rapl_read_counter(event));
>>> +
>>> +     pmu->n_active++;
>>
>> What lock protects this add?
>>
> None. I will add one. Bu then I am wondering about if it is really
> necessary given
> that RAPL event are system-wide and this pinned to a CPU. If the call came
> from another CPU, then it IPI there, and that means that CPU is executing that
> code. Any other CPU will need IPI too, and that interrupt will be kept pending.
> Am I missing a test case here? Are IPI reentrant?

they can be if interrupts are enabled (likely here)

>
>>> +}
>>> +
>>> +static ssize_t rapl_get_attr_cpumask(struct device *dev,
>>> +                             struct device_attribute *attr, char *buf)
>>> +{
>>> +     int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);
>>
>> Check n here in case it overflowed
>>
> But isn't that what the -2 and the below \n\0 are for?

I know it's very unlikely and other stuff would break, but

Assuming you have a system with some many CPUs that they don't fit 
into a page. Then the scnprintf would fail, but you would corrupt
random data because you write before the buffer.

>> Doesn't this need a lock of some form? AFAIK we can do parallel
>> CPU startup now.
>>
> Did not know about this change? But then that means all the other
> perf_event *_starting() and maybe even _*prepare() routines must also
> use locks. I can add that to RAPL.

Yes may be broken everywhere.

>>> +     /* check supported CPU */
>>> +     switch (boot_cpu_data.x86_model) {
>>> +     case 42: /* Sandy Bridge */
>>> +     case 58: /* Ivy Bridge */
>>> +     case 60: /* Haswell */
>>
>> Need more model numbers for Haswell (see the main perf driver)
>>
> Don't have all the models to test...

It should be all the same.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 1/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 21:45       ` Andi Kleen
@ 2013-10-07 22:38         ` Stephane Eranian
  2013-10-08 15:10         ` Stephane Eranian
  1 sibling, 0 replies; 16+ messages in thread
From: Stephane Eranian @ 2013-10-07 22:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Peter Zijlstra, mingo, Arnaldo Carvalho de Melo, Jiri Olsa,
	Yan, Zheng

On Mon, Oct 7, 2013 at 11:45 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Stephane Eranian <eranian@google.com> writes:
>>
>>>> +             goto again;
>>>> +
>>>> +     struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
>>>> +
>>>> +     if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
>>>> +             return;
>>>> +
>>>> +     event->hw.state = 0;
>>>> +
>>>> +     local64_set(&event->hw.prev_count, rapl_read_counter(event));
>>>> +
>>>> +     pmu->n_active++;
>>>
>>> What lock protects this add?
>>>
>> None. I will add one. Bu then I am wondering about if it is really
>> necessary given
>> that RAPL event are system-wide and this pinned to a CPU. If the call came
>> from another CPU, then it IPI there, and that means that CPU is executing that
>> code. Any other CPU will need IPI too, and that interrupt will be kept pending.
>> Am I missing a test case here? Are IPI reentrant?
>
> they can be if interrupts are enabled (likely here)
>
I will check on that.

>>
>>>> +}
>>>> +
>>>> +static ssize_t rapl_get_attr_cpumask(struct device *dev,
>>>> +                             struct device_attribute *attr, char *buf)
>>>> +{
>>>> +     int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);
>>>
>>> Check n here in case it overflowed
>>>
>> But isn't that what the -2 and the below \n\0 are for?
>
> I know it's very unlikely and other stuff would break, but
>
> Assuming you have a system with some many CPUs that they don't fit
> into a page. Then the scnprintf would fail, but you would corrupt
> random data because you write before the buffer.
>
My understanding is that cpulist_scnprintf() behaves like snprintf(). It
generates up to PAGE_SIZE-2 characters in the buffer. So if you
have a very large number of CPUs, the generation of the output string in buf
will stop, i.e., truncated string. The return value is the length of the string.
That n cannot be negative. So how you could write buffer the buffer (buf)?

The part I don't like about the API of rapl_get_attr_cpumask() here is that
it assumes that the buf is PAGE_SIZE. Its size is not passed as an argument.
But maybe this is what you are pointing out to me.

>>> Doesn't this need a lock of some form? AFAIK we can do parallel
>>> CPU startup now.
>>>
>> Did not know about this change? But then that means all the other
>> perf_event *_starting() and maybe even _*prepare() routines must also
>> use locks. I can add that to RAPL.
>
> Yes may be broken everywhere.
>
>>>> +     /* check supported CPU */
>>>> +     switch (boot_cpu_data.x86_model) {
>>>> +     case 42: /* Sandy Bridge */
>>>> +     case 58: /* Ivy Bridge */
>>>> +     case 60: /* Haswell */
>>>
>>> Need more model numbers for Haswell (see the main perf driver)
>>>
>> Don't have all the models to test...
>
> It should be all the same.
>
Need to know which ones are client vs. servers. Not have the same
number of  RAPL events.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 1/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 19:22       ` Andi Kleen
  2013-10-07 20:33         ` Peter Zijlstra
@ 2013-10-08  7:36         ` Ingo Molnar
  1 sibling, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2013-10-08  7:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Stephane Eranian, linux-kernel, mingo, acme,
	jolsa, zheng.z.yan


* Andi Kleen <ak@linux.intel.com> wrote:

> On Mon, Oct 07, 2013 at 08:08:10PM +0200, Peter Zijlstra wrote:
> > On Mon, Oct 07, 2013 at 10:55:42AM -0700, Andi Kleen wrote:
> > > This would in principle allow making it a module later (if perf ever
> > > supports that)
> > 
> > IIRC its a few EXPORTs away from being able to do that.
> 
> Great. With ~700k text that would be a good thing.

Nonsense, the real overhead of core perf + PMU drivers on x86-64 is around 
150k.

The 700k overhead you claimed is not reproducible, at all, and I tested it 
with your own config:

  https://lkml.org/lkml/2013/10/8/62

700k in perf is nonsensical - it's an obvious lie really, the code sizes 
of the relevant perf .o objects are nowhere even _close_ to that amount:

 hubble:~/tip> size $(find . -name '*perf*.o')
   text    data     bss     dec     hex filename
    887       0      32     919     397 ./arch/x86/kernel/cpu/perfctr-watchdog.o
   2932     680      96    3708     e7c ./arch/x86/kernel/cpu/perf_event_amd_uncore.o
  14117    6361       1   20479    4fff ./arch/x86/kernel/cpu/perf_event_intel.o
  23787   19541     264   43592    aa48 ./arch/x86/kernel/cpu/perf_event_intel_uncore.o
   4531    1572       0    6103    17d7 ./arch/x86/kernel/cpu/perf_event_p4.o
   1728    1121       0    2849     b21 ./arch/x86/kernel/cpu/perf_event_knc.o
   4591     811      32    5434    153a ./arch/x86/kernel/cpu/perf_event_amd_ibs.o
  13394    6797     116   20307    4f53 ./arch/x86/kernel/cpu/perf_event.o
   3156    1483      32    4671    123f ./arch/x86/kernel/cpu/perf_event_amd_iommu.o
   5322    4396       0    9718    25f6 ./arch/x86/kernel/cpu/perf_event_intel_ds.o
  11383       1       0   11384    2c78 ./arch/x86/kernel/cpu/perf_event_intel_lbr.o
   3636    1125       0    4761    1299 ./arch/x86/kernel/cpu/perf_event_amd.o
   1282     544       0    1826     722 ./arch/x86/kernel/cpu/perf_event_p6.o
    278       1       0     279     117 ./arch/x86/kernel/perf_regs.o
   4844      88      12    4944    1350 ./drivers/acpi/processor_perflib.o
     77      96       0     173      ad ./drivers/cpufreq/cpufreq_performance.o
    232      64       0     296     128 ./drivers/devfreq/governor_performance.o
   1972       4      64    2040     7f8 ./kernel/trace/trace_event_perf.o

So stop making that ridiculous claim without posting exact .config's 
publicly.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v1 1/2] perf,x86: add Intel RAPL PMU support
  2013-10-07 21:45       ` Andi Kleen
  2013-10-07 22:38         ` Stephane Eranian
@ 2013-10-08 15:10         ` Stephane Eranian
  1 sibling, 0 replies; 16+ messages in thread
From: Stephane Eranian @ 2013-10-08 15:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Peter Zijlstra, mingo, Arnaldo Carvalho de Melo, Jiri Olsa,
	Yan, Zheng

Andi,

On Mon, Oct 7, 2013 at 11:45 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Stephane Eranian <eranian@google.com> writes:
>>
>>>> +             goto again;
>>>> +
>>>> +     struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
>>>> +
>>>> +     if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
>>>> +             return;
>>>> +
>>>> +     event->hw.state = 0;
>>>> +
>>>> +     local64_set(&event->hw.prev_count, rapl_read_counter(event));
>>>> +
>>>> +     pmu->n_active++;
>>>
>>> What lock protects this add?
>>>
>> None. I will add one. Bu then I am wondering about if it is really
>> necessary given
>> that RAPL event are system-wide and this pinned to a CPU. If the call came
>> from another CPU, then it IPI there, and that means that CPU is executing that
>> code. Any other CPU will need IPI too, and that interrupt will be kept pending.
>> Am I missing a test case here? Are IPI reentrant?
>
> they can be if interrupts are enabled (likely here)
>
So, I spent some time trying to figure this out via instrumentation and it seems
it is never the case that this function or in fact __perf_event_enable() for a
syswide event is called with interrupts enabled. Why?

Well, it has to do with cpu_function_call() which is ALWAYS called for a syswide
event on the perf_event_enable() code path.

If you are calling for an event on the same CPU, you end up executing:
smp_call_function_single()
       if (cpu == this_cpu) {
                local_irq_save(flags);
                func(info);
                local_irq_restore(flags);

If you are calling a remote CPU, then you end up in the APIC code to send
an IPI. On the receiving side, I could not find the local_irq_save() call, but
I verified that upon entry, __perf_event_enable() has interrupts disabled.
And that's either because I missed the interrupt masking call OR because
the HW does it automatically for us. I could not yet figure this out.

In any case, looks like both the start() and stop() routine are protected
from interrupts and thus preemption, so we may not need a lock to
protect n_active.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2013-10-08 15:10 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-07 16:09 [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support Stephane Eranian
2013-10-07 16:09 ` [PATCH v1 1/2] " Stephane Eranian
2013-10-07 17:55   ` Andi Kleen
2013-10-07 18:08     ` Peter Zijlstra
2013-10-07 19:22       ` Andi Kleen
2013-10-07 20:33         ` Peter Zijlstra
2013-10-08  7:36         ` Ingo Molnar
2013-10-07 20:58     ` Stephane Eranian
2013-10-07 21:45       ` Andi Kleen
2013-10-07 22:38         ` Stephane Eranian
2013-10-08 15:10         ` Stephane Eranian
2013-10-07 16:09 ` [PATCH v1 2/2] perf,x86: add RAPL hrtimer support Stephane Eranian
2013-10-07 16:19 ` [PATCH v1 0/2] perf,x86: add Intel RAPL PMU support Borislav Petkov
2013-10-07 16:24   ` Stephane Eranian
2013-10-07 16:42     ` Borislav Petkov
2013-10-07 16:29 ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.