linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
@ 2011-11-21 14:51 Peter Zijlstra
  2011-11-21 14:51 ` [RFC][PATCH 1/6] perf: Update the mmap control page on mmap() Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-21 14:51 UTC (permalink / raw)
  To: mingo
  Cc: William Cohen, Stephane Eranian, Arun Sharma, Vince Weaver, linux-kernel

These few patches implement x86 RDPMC support and add an extention to the self
monitoring data to also allow additional time updates using userspace TSC reads.

There's a few loose ends, but it mostly seems to work.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 1/6] perf: Update the mmap control page on mmap()
  2011-11-21 14:51 [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Peter Zijlstra
@ 2011-11-21 14:51 ` Peter Zijlstra
  2011-11-21 14:51 ` [RFC][PATCH 2/6] perf, arch: Rework perf_event_index() Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-21 14:51 UTC (permalink / raw)
  To: mingo
  Cc: William Cohen, linux-kernel, Stephane Eranian, Arun Sharma,
	Vince Weaver, Peter Zijlstra

[-- Attachment #1: perf-fix-userpage.patch --]
[-- Type: text/plain, Size: 758 bytes --]

Apparently we didn't update the mmap control page right after mmap(),
which leads to surprises when userspace wants to use it.

Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/events/core.c |    2 ++
 1 file changed, 2 insertions(+)
Index: linux-2.6/kernel/events/core.c
===================================================================
--- linux-2.6.orig/kernel/events/core.c
+++ linux-2.6/kernel/events/core.c
@@ -3463,6 +3463,8 @@ static int perf_mmap(struct file *file, 
 	event->mmap_user = get_current_user();
 	vma->vm_mm->pinned_vm += event->mmap_locked;
 
+	perf_event_update_userpage(event);
+
 unlock:
 	if (!ret)
 		atomic_inc(&event->mmap_count);



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-21 14:51 [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Peter Zijlstra
  2011-11-21 14:51 ` [RFC][PATCH 1/6] perf: Update the mmap control page on mmap() Peter Zijlstra
@ 2011-11-21 14:51 ` Peter Zijlstra
  2011-11-21 16:22   ` Eric B Munson
  2011-11-21 17:23   ` Will Deacon
  2011-11-21 14:51 ` [RFC][PATCH 3/6] perf, x86: Implement userspace RDPMC Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-21 14:51 UTC (permalink / raw)
  To: mingo
  Cc: William Cohen, linux-kernel, Michael Cree, Will Deacon,
	Deng-Cheng Zhu, Anton Blanchard, Eric B Munson, Heiko Carstens,
	Paul Mundt, David S. Miller, Richard Kuo, Stephane Eranian,
	Arun Sharma, Vince Weaver, Peter Zijlstra

[-- Attachment #1: perf-rdpmc.patch --]
[-- Type: text/plain, Size: 7802 bytes --]

Put the logic to compute the event index into a per pmu method. This
is required because the x86 rules are weird and wonderful and don't
match the capabilities of the current scheme.

AFAIK only powerpc actually has a usable userspace read of the PMCs
but I'm not at all sure anybody actually used that.

ARM looks like it cared, but I really wouldn't know, Will?

For the rest we preserve the status quo with a default function.

We probably want to provide an explicit method for sw counters and the
like so they don't try and dereference event->hw.index which is a
random field in the middle of their hrtimer structure.

Cc: Michael Cree <mcree@orcon.net.nz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/arm/include/asm/perf_event.h            |    4 ----
 arch/arm/kernel/perf_event.c                 |   10 ++++++++++
 arch/frv/include/asm/perf_event.h            |    2 --
 arch/hexagon/include/asm/perf_event.h        |    2 --
 arch/powerpc/include/asm/perf_event_server.h |    2 --
 arch/powerpc/kernel/perf_event.c             |    6 ++++++
 arch/s390/include/asm/perf_event.h           |    1 -
 arch/x86/include/asm/perf_event.h            |    2 --
 include/linux/perf_event.h                   |    6 ++++++
 kernel/events/core.c                         |   14 +++++++++-----
 10 files changed, 31 insertions(+), 18 deletions(-)
Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -679,6 +679,12 @@ struct pmu {
 	 * for each successful ->add() during the transaction.
 	 */
 	void (*cancel_txn)		(struct pmu *pmu); /* optional */
+
+	/*
+	 * Will return the value for perf_event_mmap_page::index for this event,
+	 * if no implementation is provided it will default to: event->hw.idx + 1.
+	 */
+	int (*event_idx)		(struct perf_event *event); /*optional */
 };
 
 /**
Index: linux-2.6/kernel/events/core.c
===================================================================
--- linux-2.6.orig/kernel/events/core.c
+++ linux-2.6/kernel/events/core.c
@@ -3191,10 +3191,6 @@ int perf_event_task_disable(void)
 	return 0;
 }
 
-#ifndef PERF_EVENT_INDEX_OFFSET
-# define PERF_EVENT_INDEX_OFFSET 0
-#endif
-
 static int perf_event_index(struct perf_event *event)
 {
 	if (event->hw.state & PERF_HES_STOPPED)
@@ -3203,7 +3199,7 @@ static int perf_event_index(struct perf_
 	if (event->state != PERF_EVENT_STATE_ACTIVE)
 		return 0;
 
-	return event->hw.idx + 1 - PERF_EVENT_INDEX_OFFSET;
+	return event->pmu->event_idx(event);
 }
 
 static void calc_timer_values(struct perf_event *event,
@@ -5334,6 +5330,11 @@ static void perf_pmu_cancel_txn(struct p
 	perf_pmu_enable(pmu);
 }
 
+static int perf_event_idx_default(struct perf_event *event)
+{
+	return event->hw.idx + 1;
+}
+
 /*
  * Ensures all contexts with the same task_ctx_nr have the same
  * pmu_cpu_context too.
@@ -5523,6 +5524,9 @@ int perf_pmu_register(struct pmu *pmu, c
 		pmu->pmu_disable = perf_pmu_nop_void;
 	}
 
+	if (!pmu->event_idx)
+		pmu->event_idx = perf_event_idx_default;
+
 	list_add_rcu(&pmu->entry, &pmus);
 	ret = 0;
 unlock:
Index: linux-2.6/arch/arm/include/asm/perf_event.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/perf_event.h
+++ linux-2.6/arch/arm/include/asm/perf_event.h
@@ -12,10 +12,6 @@
 #ifndef __ARM_PERF_EVENT_H__
 #define __ARM_PERF_EVENT_H__
 
-/* ARM performance counters start from 1 (in the cp15 accesses) so use the
- * same indexes here for consistency. */
-#define PERF_EVENT_INDEX_OFFSET 1
-
 /* ARM perf PMU IDs for use by internal perf clients. */
 enum arm_perf_pmu_ids {
 	ARM_PERF_PMU_ID_XSCALE1	= 0,
Index: linux-2.6/arch/arm/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/arch/arm/kernel/perf_event.c
+++ linux-2.6/arch/arm/kernel/perf_event.c
@@ -572,6 +572,15 @@ static void armpmu_disable(struct pmu *p
 	armpmu->stop();
 }
 
+/*
+ * ARM performance counters start from 1 (in the cp15 accesses) so use the
+ * same indexes here for consistency.
+ */
+static int armpmu_event_idx(struct perf_event *event)
+{
+	return event->hw.idx;
+}
+
 static void __init armpmu_init(struct arm_pmu *armpmu)
 {
 	atomic_set(&armpmu->active_events, 0);
@@ -586,6 +595,7 @@ static void __init armpmu_init(struct ar
 		.start		= armpmu_start,
 		.stop		= armpmu_stop,
 		.read		= armpmu_read,
+		.event_idx	= armpmu_event_idx,
 	};
 }
 
Index: linux-2.6/arch/frv/include/asm/perf_event.h
===================================================================
--- linux-2.6.orig/arch/frv/include/asm/perf_event.h
+++ linux-2.6/arch/frv/include/asm/perf_event.h
@@ -12,6 +12,4 @@
 #ifndef _ASM_PERF_EVENT_H
 #define _ASM_PERF_EVENT_H
 
-#define PERF_EVENT_INDEX_OFFSET	0
-
 #endif /* _ASM_PERF_EVENT_H */
Index: linux-2.6/arch/hexagon/include/asm/perf_event.h
===================================================================
--- linux-2.6.orig/arch/hexagon/include/asm/perf_event.h
+++ linux-2.6/arch/hexagon/include/asm/perf_event.h
@@ -19,6 +19,4 @@
 #ifndef _ASM_PERF_EVENT_H
 #define _ASM_PERF_EVENT_H
 
-#define PERF_EVENT_INDEX_OFFSET	0
-
 #endif /* _ASM_PERF_EVENT_H */
Index: linux-2.6/arch/powerpc/include/asm/perf_event_server.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/perf_event_server.h
+++ linux-2.6/arch/powerpc/include/asm/perf_event_server.h
@@ -61,8 +61,6 @@ struct pt_regs;
 extern unsigned long perf_misc_flags(struct pt_regs *regs);
 extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
 
-#define PERF_EVENT_INDEX_OFFSET	1
-
 /*
  * Only override the default definitions in include/linux/perf_event.h
  * if we have hardware PMU support.
Index: linux-2.6/arch/powerpc/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/perf_event.c
+++ linux-2.6/arch/powerpc/kernel/perf_event.c
@@ -1187,6 +1187,11 @@ static int power_pmu_event_init(struct p
 	return err;
 }
 
+static int power_pmu_event_idx(struct perf_event *event)
+{
+	return event->hw.idx;
+}
+
 struct pmu power_pmu = {
 	.pmu_enable	= power_pmu_enable,
 	.pmu_disable	= power_pmu_disable,
@@ -1199,6 +1204,7 @@ struct pmu power_pmu = {
 	.start_txn	= power_pmu_start_txn,
 	.cancel_txn	= power_pmu_cancel_txn,
 	.commit_txn	= power_pmu_commit_txn,
+	.event_idx	= power_pmu_event_idx,
 };
 
 /*
Index: linux-2.6/arch/s390/include/asm/perf_event.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/perf_event.h
+++ linux-2.6/arch/s390/include/asm/perf_event.h
@@ -6,4 +6,3 @@
 
 /* Empty, just to avoid compiling error */
 
-#define PERF_EVENT_INDEX_OFFSET 0
Index: linux-2.6/arch/x86/include/asm/perf_event.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/perf_event.h
+++ linux-2.6/arch/x86/include/asm/perf_event.h
@@ -187,8 +187,6 @@ extern u32 get_ibs_caps(void);
 #ifdef CONFIG_PERF_EVENTS
 extern void perf_events_lapic_init(void);
 
-#define PERF_EVENT_INDEX_OFFSET			0
-
 /*
  * Abuse bit 3 of the cpu eflags register to indicate proper PEBS IP fixups.
  * This flag is otherwise unused and ABI specified to be 0, so nobody should



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 3/6] perf, x86: Implement userspace RDPMC
  2011-11-21 14:51 [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Peter Zijlstra
  2011-11-21 14:51 ` [RFC][PATCH 1/6] perf: Update the mmap control page on mmap() Peter Zijlstra
  2011-11-21 14:51 ` [RFC][PATCH 2/6] perf, arch: Rework perf_event_index() Peter Zijlstra
@ 2011-11-21 14:51 ` Peter Zijlstra
  2011-11-21 14:51 ` [RFC][PATCH 4/6] perf, x86: Provide means of disabling " Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-21 14:51 UTC (permalink / raw)
  To: mingo
  Cc: William Cohen, linux-kernel, Stephane Eranian, Arun Sharma,
	Vince Weaver, Peter Zijlstra

[-- Attachment #1: perf-rdpmc-2.patch --]
[-- Type: text/plain, Size: 1344 bytes --]

Implement a correct pmu::event_idx for the x86 counter index rules and
set CR4.PCE on CPU_STARTING.

Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -1211,6 +1211,7 @@ x86_pmu_notifier(struct notifier_block *
 		break;
 
 	case CPU_STARTING:
+		set_in_cr4(X86_CR4_PCE);
 		if (x86_pmu.cpu_starting)
 			x86_pmu.cpu_starting(cpu);
 		break;
@@ -1536,6 +1537,18 @@ static int x86_pmu_event_init(struct per
 	return err;
 }
 
+static int x86_pmu_event_idx(struct perf_event *event)
+{
+	int idx = event->hw.idx;
+
+	if (x86_pmu.num_counters_fixed && idx >= X86_PMC_IDX_FIXED) {
+		idx -= X86_PMC_IDX_FIXED;
+		idx |= 1 << 30;
+	}
+
+	return idx + 1;
+}
+
 static struct pmu pmu = {
 	.pmu_enable	= x86_pmu_enable,
 	.pmu_disable	= x86_pmu_disable,
@@ -1551,6 +1564,8 @@ static struct pmu pmu = {
 	.start_txn	= x86_pmu_start_txn,
 	.cancel_txn	= x86_pmu_cancel_txn,
 	.commit_txn	= x86_pmu_commit_txn,
+
+	.event_idx	= x86_pmu_event_idx,
 };
 
 /*



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 4/6] perf, x86: Provide means of disabling userspace RDPMC
  2011-11-21 14:51 [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Peter Zijlstra
                   ` (2 preceding siblings ...)
  2011-11-21 14:51 ` [RFC][PATCH 3/6] perf, x86: Implement userspace RDPMC Peter Zijlstra
@ 2011-11-21 14:51 ` Peter Zijlstra
  2011-11-21 14:51 ` [RFC][PATCH 5/6] perf: Extend the mmap control page with time (TSC) fields Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-21 14:51 UTC (permalink / raw)
  To: mingo
  Cc: William Cohen, linux-kernel, Stephane Eranian, Arun Sharma,
	Vince Weaver, Peter Zijlstra

[-- Attachment #1: perf-rdpmc-3.patch --]
[-- Type: text/plain, Size: 4246 bytes --]

The userspace RDPMC is a data leak since people can poke at random
counters that are not their own, therefore provide a pmu specific
attribute to turn it off.

XXX: we probably want a perf_pmu_add_option() function but all this
sysfs stuff gives me a head-ache.

Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c |   55 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/perf_event.h |    8 +++++
 include/linux/perf_event.h       |    1 
 kernel/events/core.c             |    1 
 4 files changed, 64 insertions(+), 1 deletion(-)
Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -614,6 +614,7 @@ struct pmu {
 	struct list_head		entry;
 
 	struct device			*dev;
+	const struct attribute_group	**attr_groups;
 	char				*name;
 	int				type;
 
Index: linux-2.6/kernel/events/core.c
===================================================================
--- linux-2.6.orig/kernel/events/core.c
+++ linux-2.6/kernel/events/core.c
@@ -5421,6 +5421,7 @@ static int pmu_dev_alloc(struct pmu *pmu
 	if (!pmu->dev)
 		goto out;
 
+	pmu->dev->groups = pmu->attr_groups;
 	device_initialize(pmu->dev);
 	ret = dev_set_name(pmu->dev, "%s", pmu->name);
 	if (ret)
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -24,6 +24,7 @@
 #include <linux/slab.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
+#include <linux/device.h>
 
 #include <asm/apic.h>
 #include <asm/stacktrace.h>
@@ -1211,7 +1212,8 @@ x86_pmu_notifier(struct notifier_block *
 		break;
 
 	case CPU_STARTING:
-		set_in_cr4(X86_CR4_PCE);
+		if (x86_pmu.attr_rdpmc)
+			set_in_cr4(X86_CR4_PCE);
 		if (x86_pmu.cpu_starting)
 			x86_pmu.cpu_starting(cpu);
 		break;
@@ -1314,6 +1316,8 @@ static int __init init_hw_perf_events(vo
 		}
 	}
 
+	x86_pmu.attr_rdpmc = 1; /* enable userspace RDPMC usage by default */
+
 	pr_info("... version:                %d\n",     x86_pmu.version);
 	pr_info("... bit width:              %d\n",     x86_pmu.cntval_bits);
 	pr_info("... generic registers:      %d\n",     x86_pmu.num_counters);
@@ -1549,10 +1553,59 @@ static int x86_pmu_event_idx(struct perf
 	return idx + 1;
 }
 
+static ssize_t get_attr_rdpmc(struct device *cdev,
+			      struct device_attribute *attr,
+			      char *buf)
+{
+	return snprintf(buf, 40, "%d\n", x86_pmu.attr_rdpmc);
+}
+
+static void change_rdpmc(void *info)
+{
+	bool enable = !!(unsigned long)info;
+
+	if (enable)
+		set_in_cr4(X86_CR4_PCE);
+	else
+		clear_in_cr4(X86_CR4_PCE);
+}
+
+static ssize_t set_attr_rdpmc(struct device *cdev,
+			      struct device_attribute *attr,
+			      const char *buf, size_t count)
+{
+	unsigned long val = simple_strtoul(buf, NULL, 0);
+
+	if (!!val != !!x86_pmu.attr_rdpmc) {
+		x86_pmu.attr_rdpmc = !!val;
+		smp_call_function(change_rdpmc, (void *)val, 1);
+	}
+
+	return count;
+}
+
+static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
+
+static struct attribute *x86_pmu_attrs[] = {
+	&dev_attr_rdpmc.attr,
+	NULL,
+};
+
+static struct attribute_group x86_pmu_attr_group = {
+	.attrs = x86_pmu_attrs,
+};
+
+static const struct attribute_group *x86_pmu_attr_groups[] = {
+	&x86_pmu_attr_group,
+	NULL,
+};
+
 static struct pmu pmu = {
 	.pmu_enable	= x86_pmu_enable,
 	.pmu_disable	= x86_pmu_disable,
 
+	.attr_groups	= x86_pmu_attr_groups,
+
 	.event_init	= x86_pmu_event_init,
 
 	.add		= x86_pmu_add,
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.h
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.h
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.h
@@ -302,6 +302,14 @@ struct x86_pmu {
 	void		(*quirks)(void);
 	int		perfctr_second_write;
 
+	/*
+	 * sysfs attrs
+	 */
+	int		attr_rdpmc;
+
+	/*
+	 * CPU Hotplug hooks
+	 */
 	int		(*cpu_prepare)(int cpu);
 	void		(*cpu_starting)(int cpu);
 	void		(*cpu_dying)(int cpu);



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 5/6] perf: Extend the mmap control page with time (TSC) fields
  2011-11-21 14:51 [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Peter Zijlstra
                   ` (3 preceding siblings ...)
  2011-11-21 14:51 ` [RFC][PATCH 4/6] perf, x86: Provide means of disabling " Peter Zijlstra
@ 2011-11-21 14:51 ` Peter Zijlstra
  2011-12-28 17:55   ` Stephane Eranian
  2011-11-21 14:51 ` [RFC][PATCH 6/6] perf, tools: X86 RDPMC, RDTSC test Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-21 14:51 UTC (permalink / raw)
  To: mingo
  Cc: William Cohen, linux-kernel, Stephane Eranian, Arun Sharma,
	Vince Weaver, Peter Zijlstra

[-- Attachment #1: perf-mmap-tsc.patch --]
[-- Type: text/plain, Size: 4315 bytes --]

Extend the mmap control page with fields so that userspace can compute
time deltas relative to the provided time fields.

Currently only implemented for x86 with constant and nonstop TSC.

Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c |   14 ++++++++++++++
 include/linux/perf_event.h       |    4 +++-
 kernel/events/core.c             |   21 ++++++++++++++-------
 3 files changed, 31 insertions(+), 8 deletions(-)
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -32,6 +32,7 @@
 #include <asm/compat.h>
 #include <asm/smp.h>
 #include <asm/alternative.h>
+#include <asm/timer.h>
 
 #include "perf_event.h"
 
@@ -1621,6 +1622,19 @@ static struct pmu pmu = {
 	.event_idx	= x86_pmu_event_idx,
 };
 
+void perf_update_user_clock(struct perf_event_mmap_page *userpg, u64 now)
+{
+	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
+		return;
+
+	if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
+		return;
+
+	userpg->time_mult = this_cpu_read(cyc2ns);
+	userpg->time_shift = CYC2NS_SCALE_FACTOR;
+	userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
+}
+
 /*
  * callchain support
  */
Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -290,12 +290,14 @@ struct perf_event_mmap_page {
 	__s64	offset;			/* add to hardware event value */
 	__u64	time_enabled;		/* time event active */
 	__u64	time_running;		/* time event on cpu */
+	__u32	time_mult, time_shift;
+	__u64	time_offset;
 
 		/*
 		 * Hole for extension of the self monitor capabilities
 		 */
 
-	__u64	__reserved[123];	/* align to 1k */
+	__u64	__reserved[121];	/* align to 1k */
 
 	/*
 	 * Control data for the mmap() data buffer.
Index: linux-2.6/kernel/events/core.c
===================================================================
--- linux-2.6.orig/kernel/events/core.c
+++ linux-2.6/kernel/events/core.c
@@ -3203,17 +3203,22 @@ static int perf_event_index(struct perf_
 }
 
 static void calc_timer_values(struct perf_event *event,
+				u64 *now,
 				u64 *enabled,
 				u64 *running)
 {
-	u64 now, ctx_time;
+	u64 ctx_time;
 
-	now = perf_clock();
-	ctx_time = event->shadow_ctx_time + now;
+	*now = perf_clock();
+	ctx_time = event->shadow_ctx_time + *now;
 	*enabled = ctx_time - event->tstamp_enabled;
 	*running = ctx_time - event->tstamp_running;
 }
 
+void __weak perf_update_user_clock(struct perf_event_mmap_page *userpg, u64 now)
+{
+}
+
 /*
  * Callers need to ensure there can be no nesting of this function, otherwise
  * the seqlock logic goes bad. We can not serialize this because the arch
@@ -3223,7 +3228,7 @@ void perf_event_update_userpage(struct p
 {
 	struct perf_event_mmap_page *userpg;
 	struct ring_buffer *rb;
-	u64 enabled, running;
+	u64 enabled, running, now;
 
 	rcu_read_lock();
 	/*
@@ -3235,7 +3240,7 @@ void perf_event_update_userpage(struct p
 	 * because of locking issue as we can be called in
 	 * NMI context
 	 */
-	calc_timer_values(event, &enabled, &running);
+	calc_timer_values(event, &now, &enabled, &running);
 	rb = rcu_dereference(event->rb);
 	if (!rb)
 		goto unlock;
@@ -3260,6 +3265,8 @@ void perf_event_update_userpage(struct p
 	userpg->time_running = running +
 			atomic64_read(&event->child_total_time_running);
 
+	perf_update_user_clock(userpg, now);
+
 	barrier();
 	++userpg->lock;
 	preempt_enable();
@@ -3692,7 +3699,7 @@ static void perf_output_read_group(struc
 static void perf_output_read(struct perf_output_handle *handle,
 			     struct perf_event *event)
 {
-	u64 enabled = 0, running = 0;
+	u64 enabled = 0, running = 0, now;
 	u64 read_format = event->attr.read_format;
 
 	/*
@@ -3705,7 +3712,7 @@ static void perf_output_read(struct perf
 	 * NMI context
 	 */
 	if (read_format & PERF_FORMAT_TOTAL_TIMES)
-		calc_timer_values(event, &enabled, &running);
+		calc_timer_values(event, &now, &enabled, &running);
 
 	if (event->attr.read_format & PERF_FORMAT_GROUP)
 		perf_output_read_group(handle, event, enabled, running);



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 6/6] perf, tools: X86 RDPMC, RDTSC test
  2011-11-21 14:51 [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Peter Zijlstra
                   ` (4 preceding siblings ...)
  2011-11-21 14:51 ` [RFC][PATCH 5/6] perf: Extend the mmap control page with time (TSC) fields Peter Zijlstra
@ 2011-11-21 14:51 ` Peter Zijlstra
  2011-11-21 15:29   ` Stephane Eranian
  2011-11-21 15:02 ` [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Vince Weaver
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-21 14:51 UTC (permalink / raw)
  To: mingo
  Cc: William Cohen, linux-kernel, Stephane Eranian, Arun Sharma,
	Vince Weaver, Peter Zijlstra

[-- Attachment #1: perf-tools-test-rdpmc.patch --]
[-- Type: text/plain, Size: 3633 bytes --]

Implement a simple test for the self-monitoring data from the
mmap-control page.

 6: x86 rdpmc test: Ok
0: 6042 2481
1: 60042 23669
2: 605410 245588
3: 6060818 2396231

XXX: come up with logic to automagically turn these numbers into OK/FAIL.

Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 tools/perf/builtin-test.c |  129 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 129 insertions(+)
Index: linux-2.6/tools/perf/builtin-test.c
===================================================================
--- linux-2.6.orig/tools/perf/builtin-test.c
+++ linux-2.6/tools/perf/builtin-test.c
@@ -14,6 +14,8 @@
 #include "util/thread_map.h"
 #include "../../include/linux/hw_breakpoint.h"
 
+#include <sys/mman.h>
+
 static long page_size;
 
 static int vmlinux_matches_kallsyms_filter(struct map *map __used, struct symbol *sym)
@@ -841,6 +843,127 @@ static int test__parse_events(void)
 
 	return ret;
 }
+
+#if defined(__x86_64__) || defined(__i386__)
+
+#define barrier() asm volatile("" ::: "memory")
+
+static u64 rdpmc(unsigned int counter)
+{
+	unsigned int low, high;
+
+	asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));
+
+	return low | ((u64)high) << 32;
+}
+
+static u64 rdtsc(void)
+{
+	unsigned int low, high;
+
+	asm volatile("rdtsc" : "=a" (low), "=d" (high));
+
+	return low | ((u64)high) << 32;
+}
+
+static u64 mmap_read_self(void *addr, u64 *enabled, u64 *running)
+{
+	struct perf_event_mmap_page *pc = addr;
+	u32 seq;
+	u64 count, delta;
+
+	do {
+again:
+		seq = pc->lock;
+		barrier();
+		if (seq & 1)
+			goto again;
+
+		if ((enabled || running) && pc->time_mult) {
+			u64 cyc = rdtsc();
+			u64 rem, quot;
+
+			quot = (cyc >> pc->time_shift);
+			rem = cyc & ((1 << pc->time_shift) - 1);
+			delta = pc->time_offset +
+				quot * pc->time_mult +
+				((rem * pc->time_mult) >> pc->time_shift);
+		}
+
+		if (enabled)
+			*enabled = pc->time_enabled + delta;
+
+		if (running)
+			*running = pc->time_running + delta;
+
+		if (pc->index) {
+			count = rdpmc(pc->index - 1);
+			count += pc->offset;
+		} else
+			goto fail;
+
+		barrier();
+	} while (pc->lock != seq);
+
+	return count;
+
+fail:
+	/*
+	 * XXX do read() here
+	 */
+	printf("FAIL FAIL FAIL\n");
+	return 0;
+}
+
+static int test__rdpmc(void)
+{
+	volatile int tmp = 0;
+	int loops = 1000;
+	int n, i;
+	int fd;
+	void *addr;
+	struct perf_event_attr attr = {
+		.type = PERF_TYPE_HARDWARE,
+		.config = PERF_COUNT_HW_INSTRUCTIONS,
+	};
+
+
+	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+	if (fd < 0) {
+		die("Error: sys_perf_event_open() syscall returned "
+		    "with %d (%s)\n", fd, strerror(errno));
+	}
+
+	addr = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+	if (addr == (void *)(-1)) {
+		die("Error: mmap() syscall returned "
+		    "with (%s)\n", strerror(errno));
+	}
+
+	for (n = 0; n < 4; n++) {
+
+		u64 stamp, now;
+		u64 stamp2, now2;
+
+		stamp = mmap_read_self(addr, NULL, &stamp2);
+
+		for (i = 0; i < loops; i++)
+			tmp++;
+
+		now = mmap_read_self(addr, NULL, &now2);
+		loops *= 10;
+
+		printf("%d: %Lu %Lu\n", n, (unsigned long long)(now - stamp),
+					   (unsigned long long)(now2 - stamp2));
+	}
+
+	munmap(addr, page_size);
+	close(fd);
+
+	return 0;
+}
+#endif
+
 static struct test {
 	const char *desc;
 	int (*func)(void);
@@ -865,6 +988,12 @@ static struct test {
 		.desc = "parse events tests",
 		.func = test__parse_events,
 	},
+#if defined(__x86_64__) || defined(__i386__)
+	{
+		.desc = "x86 rdpmc test",
+		.func = test__rdpmc,
+	},
+#endif
 	{
 		.func = NULL,
 	},



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-11-21 14:51 [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Peter Zijlstra
                   ` (5 preceding siblings ...)
  2011-11-21 14:51 ` [RFC][PATCH 6/6] perf, tools: X86 RDPMC, RDTSC test Peter Zijlstra
@ 2011-11-21 15:02 ` Vince Weaver
  2011-11-21 16:05   ` William Cohen
  2011-11-21 16:08   ` William Cohen
  2011-12-02 19:26 ` Arun Sharma
  2011-12-16 22:36 ` Vince Weaver
  8 siblings, 2 replies; 48+ messages in thread
From: Vince Weaver @ 2011-11-21 15:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, Stephane Eranian, Arun Sharma, linux-kernel

On Mon, 21 Nov 2011, Peter Zijlstra wrote:

> These few patches implement x86 RDPMC support and add an extention to the self
> monitoring data to also allow additional time updates using userspace TSC reads.
> 
> There's a few loose ends, but it mostly seems to work.

I'll have to test these out.  

I have some low-level benchmarks I've been working on, you can see some 
preliminary results here:
   http://web.eecs.utk.edu/~vweaver1/projects/perf-events/benchmarks/

perf_events was lagging behind perfmon2 and perfctr in self-monitoring 
overhead, but maybe these changes will help that.

perf_event overhead got a lot worse between 2.6.32 and 3.0, I'm mid-bisect 
on that trying to isolate the cause.

Vince


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 6/6] perf, tools: X86 RDPMC, RDTSC test
  2011-11-21 14:51 ` [RFC][PATCH 6/6] perf, tools: X86 RDPMC, RDTSC test Peter Zijlstra
@ 2011-11-21 15:29   ` Stephane Eranian
  2011-11-21 15:37     ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Stephane Eranian @ 2011-11-21 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, linux-kernel, Arun Sharma, Vince Weaver

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 6424 bytes --]

Peter,

I don't see how this test and infrastructure handles the case where the event
is multiplexed. I know there is time_enabled and time_running. But those are
not sync'd to the moment of the rdpmc(). I think there needs to be some other
timestamp in the mmap struct so the user can compute a delta to then add to
time_enabled and time_running.

Unless, we assume the two time metrics are there ONLY to compute a scaling
ratio. In which case, I think, we don't need the delta because if we
can do rdpmc()
it means the event is currently scheduled and thus time_enabled and time_running
are both ticking which means the scaling ratio does not change since the moment
the event was scheduled in.

Am I understanding the reasoning right?


On Mon, Nov 21, 2011 at 3:51 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Implement a simple test for the self-monitoring data from the
> mmap-control page.
>
>  6: x86 rdpmc test: Ok
> 0: 6042 2481
> 1: 60042 23669
> 2: 605410 245588
> 3: 6060818 2396231
>
> XXX: come up with logic to automagically turn these numbers into OK/FAIL.
>
> Cc: Stephane Eranian <eranian@google.com>
> Cc: Arun Sharma <asharma@fb.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  tools/perf/builtin-test.c |  129 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 129 insertions(+)
> Index: linux-2.6/tools/perf/builtin-test.c
> ===================================================================
> --- linux-2.6.orig/tools/perf/builtin-test.c
> +++ linux-2.6/tools/perf/builtin-test.c
> @@ -14,6 +14,8 @@
>  #include "util/thread_map.h"
>  #include "../../include/linux/hw_breakpoint.h"
>
> +#include <sys/mman.h>
> +
>  static long page_size;
>
>  static int vmlinux_matches_kallsyms_filter(struct map *map __used, struct symbol *sym)
> @@ -841,6 +843,127 @@ static int test__parse_events(void)
>
>        return ret;
>  }
> +
> +#if defined(__x86_64__) || defined(__i386__)
> +
> +#define barrier() asm volatile("" ::: "memory")
> +
> +static u64 rdpmc(unsigned int counter)
> +{
> +       unsigned int low, high;
> +
> +       asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));
> +
> +       return low | ((u64)high) << 32;
> +}
> +
> +static u64 rdtsc(void)
> +{
> +       unsigned int low, high;
> +
> +       asm volatile("rdtsc" : "=a" (low), "=d" (high));
> +
> +       return low | ((u64)high) << 32;
> +}
> +
> +static u64 mmap_read_self(void *addr, u64 *enabled, u64 *running)
> +{
> +       struct perf_event_mmap_page *pc = addr;
> +       u32 seq;
> +       u64 count, delta;
> +
> +       do {
> +again:
> +               seq = pc->lock;
> +               barrier();
> +               if (seq & 1)
> +                       goto again;
> +
> +               if ((enabled || running) && pc->time_mult) {
> +                       u64 cyc = rdtsc();
> +                       u64 rem, quot;
> +
> +                       quot = (cyc >> pc->time_shift);
> +                       rem = cyc & ((1 << pc->time_shift) - 1);
> +                       delta = pc->time_offset +
> +                               quot * pc->time_mult +
> +                               ((rem * pc->time_mult) >> pc->time_shift);
> +               }
> +
> +               if (enabled)
> +                       *enabled = pc->time_enabled + delta;
> +
> +               if (running)
> +                       *running = pc->time_running + delta;
> +
> +               if (pc->index) {
> +                       count = rdpmc(pc->index - 1);
> +                       count += pc->offset;
> +               } else
> +                       goto fail;
> +
> +               barrier();
> +       } while (pc->lock != seq);
> +
> +       return count;
> +
> +fail:
> +       /*
> +        * XXX do read() here
> +        */
> +       printf("FAIL FAIL FAIL\n");
> +       return 0;
> +}
> +
> +static int test__rdpmc(void)
> +{
> +       volatile int tmp = 0;
> +       int loops = 1000;
> +       int n, i;
> +       int fd;
> +       void *addr;
> +       struct perf_event_attr attr = {
> +               .type = PERF_TYPE_HARDWARE,
> +               .config = PERF_COUNT_HW_INSTRUCTIONS,
> +       };
> +
> +
> +       fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> +       if (fd < 0) {
> +               die("Error: sys_perf_event_open() syscall returned "
> +                   "with %d (%s)\n", fd, strerror(errno));
> +       }
> +
> +       addr = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
> +       if (addr == (void *)(-1)) {
> +               die("Error: mmap() syscall returned "
> +                   "with (%s)\n", strerror(errno));
> +       }
> +
> +       for (n = 0; n < 4; n++) {
> +
> +               u64 stamp, now;
> +               u64 stamp2, now2;
> +
> +               stamp = mmap_read_self(addr, NULL, &stamp2);
> +
> +               for (i = 0; i < loops; i++)
> +                       tmp++;
> +
> +               now = mmap_read_self(addr, NULL, &now2);
> +               loops *= 10;
> +
> +               printf("%d: %Lu %Lu\n", n, (unsigned long long)(now - stamp),
> +                                          (unsigned long long)(now2 - stamp2));
> +       }
> +
> +       munmap(addr, page_size);
> +       close(fd);
> +
> +       return 0;
> +}
> +#endif
> +
>  static struct test {
>        const char *desc;
>        int (*func)(void);
> @@ -865,6 +988,12 @@ static struct test {
>                .desc = "parse events tests",
>                .func = test__parse_events,
>        },
> +#if defined(__x86_64__) || defined(__i386__)
> +       {
> +               .desc = "x86 rdpmc test",
> +               .func = test__rdpmc,
> +       },
> +#endif
>        {
>                .func = NULL,
>        },
>
>
>
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 6/6] perf, tools: X86 RDPMC, RDTSC test
  2011-11-21 15:29   ` Stephane Eranian
@ 2011-11-21 15:37     ` Peter Zijlstra
  2011-11-21 16:59       ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-21 15:37 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: mingo, William Cohen, linux-kernel, Arun Sharma, Vince Weaver

On Mon, 2011-11-21 at 16:29 +0100, Stephane Eranian wrote:
> Peter,
> 
> I don't see how this test and infrastructure handles the case where the event
> is multiplexed. I know there is time_enabled and time_running. But those are
> not sync'd to the moment of the rdpmc(). I think there needs to be some other
> timestamp in the mmap struct so the user can compute a delta to then add to
> time_enabled and time_running.

When the counter isn't actually on the PMU, ->index will be 0 and rdpmc
should not be attempted.

> Unless, we assume the two time metrics are there ONLY to compute a scaling
> ratio. In which case, I think, we don't need the delta because if we
> can do rdpmc()
> it means the event is currently scheduled and thus time_enabled and time_running
> are both ticking which means the scaling ratio does not change since the moment
> the event was scheduled in.

Right, you don't need delta to compute the scale, but its useful for
user-space time based measurements, Arun wanted to do something like
that.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-11-21 15:02 ` [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Vince Weaver
@ 2011-11-21 16:05   ` William Cohen
  2011-11-21 16:08   ` William Cohen
  1 sibling, 0 replies; 48+ messages in thread
From: William Cohen @ 2011-11-21 16:05 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Peter Zijlstra, mingo, Stephane Eranian, Arun Sharma, linux-kernel

On 11/21/2011 10:02 AM, Vince Weaver wrote:
> On Mon, 21 Nov 2011, Peter Zijlstra wrote:
> 
>> These few patches implement x86 RDPMC support and add an extention to the self
>> monitoring data to also allow additional time updates using userspace TSC reads.
>>
>> There's a few loose ends, but it mostly seems to work.
> 
> I'll have to test these out.  
> 
> I have some low-level benchmarks I've been working on, you can see some 
> preliminary results here:
>    http://web.eecs.utk.edu/~vweaver1/projects/perf-events/benchmarks/
> 
> perf_events was lagging behind perfmon2 and perfctr in self-monitoring 
> overhead, but maybe these changes will help that.
> 
> perf_event overhead got a lot worse between 2.6.32 and 3.0, I'm mid-bisect 
> on that trying to isolate the cause.
> 
> Vince
> 

Hi Vince,

Any improvements in performance of self-monitoring would be welcomed. International Symposium on computer Architecture (ISCA 2011) has paper about a low overhead mechanism to read out the performance counters, LiMiT (http://castl.cs.columbia.edu/limit/index.php/LiMiT) that avoids using system calls to read the performance counter data. That paper shows a fair amount of overhead for PAPI.

-Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-11-21 15:02 ` [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Vince Weaver
  2011-11-21 16:05   ` William Cohen
@ 2011-11-21 16:08   ` William Cohen
  1 sibling, 0 replies; 48+ messages in thread
From: William Cohen @ 2011-11-21 16:08 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Peter Zijlstra, mingo, Stephane Eranian, Arun Sharma, linux-kernel

On 11/21/2011 10:02 AM, Vince Weaver wrote:
> On Mon, 21 Nov 2011, Peter Zijlstra wrote:
> 
>> These few patches implement x86 RDPMC support and add an extention to the self
>> monitoring data to also allow additional time updates using userspace TSC reads.
>>
>> There's a few loose ends, but it mostly seems to work.
> 
> I'll have to test these out.  
> 
> I have some low-level benchmarks I've been working on, you can see some 
> preliminary results here:
>    http://web.eecs.utk.edu/~vweaver1/projects/perf-events/benchmarks/
> 
> perf_events was lagging behind perfmon2 and perfctr in self-monitoring 
> overhead, but maybe these changes will help that.
> 
> perf_event overhead got a lot worse between 2.6.32 and 3.0, I'm mid-bisect 
> on that trying to isolate the cause.
> 
> Vince
> 

Hi Vince,

Any improvements in performance of self-monitoring would be welcomed. International Symposium on computer Architecture (ISCA 2011) has paper about a low overhead mechanism to read out the performance counters, LiMiT (http://castl.cs.columbia.edu/limit/index.php/LiMiT) that avoids using system calls to read the performance counter data. That paper shows a fair amount of overhead for PAPI.

-Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-21 14:51 ` [RFC][PATCH 2/6] perf, arch: Rework perf_event_index() Peter Zijlstra
@ 2011-11-21 16:22   ` Eric B Munson
  2011-11-21 17:23   ` Will Deacon
  1 sibling, 0 replies; 48+ messages in thread
From: Eric B Munson @ 2011-11-21 16:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Will Deacon,
	Deng-Cheng Zhu, Anton Blanchard, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver

[-- Attachment #1: Type: text/plain, Size: 8845 bytes --]

On Mon, 21 Nov 2011, Peter Zijlstra wrote:

> Put the logic to compute the event index into a per pmu method. This
> is required because the x86 rules are weird and wonderful and don't
> match the capabilities of the current scheme.
> 
> AFAIK only powerpc actually has a usable userspace read of the PMCs
> but I'm not at all sure anybody actually used that.
> 
> ARM looks like it cared, but I really wouldn't know, Will?
> 
> For the rest we preserve the status quo with a default function.
> 
> We probably want to provide an explicit method for sw counters and the
> like so they don't try and dereference event->hw.index which is a
> random field in the middle of their hrtimer structure.
> 
> Cc: Michael Cree <mcree@orcon.net.nz>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
> Cc: Anton Blanchard <anton@samba.org>
> Cc: Eric B Munson <emunson@mgebm.net>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Paul Mundt <lethal@linux-sh.org>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Richard Kuo <rkuo@codeaurora.org>
> Cc: Stephane Eranian <eranian@google.com>
> Cc: Arun Sharma <asharma@fb.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Acked-by: Eric B Munson <emunson@mgebm.net>

> ---
>  arch/arm/include/asm/perf_event.h            |    4 ----
>  arch/arm/kernel/perf_event.c                 |   10 ++++++++++
>  arch/frv/include/asm/perf_event.h            |    2 --
>  arch/hexagon/include/asm/perf_event.h        |    2 --
>  arch/powerpc/include/asm/perf_event_server.h |    2 --
>  arch/powerpc/kernel/perf_event.c             |    6 ++++++
>  arch/s390/include/asm/perf_event.h           |    1 -
>  arch/x86/include/asm/perf_event.h            |    2 --
>  include/linux/perf_event.h                   |    6 ++++++
>  kernel/events/core.c                         |   14 +++++++++-----
>  10 files changed, 31 insertions(+), 18 deletions(-)
> Index: linux-2.6/include/linux/perf_event.h
> ===================================================================
> --- linux-2.6.orig/include/linux/perf_event.h
> +++ linux-2.6/include/linux/perf_event.h
> @@ -679,6 +679,12 @@ struct pmu {
>  	 * for each successful ->add() during the transaction.
>  	 */
>  	void (*cancel_txn)		(struct pmu *pmu); /* optional */
> +
> +	/*
> +	 * Will return the value for perf_event_mmap_page::index for this event,
> +	 * if no implementation is provided it will default to: event->hw.idx + 1.
> +	 */
> +	int (*event_idx)		(struct perf_event *event); /*optional */
>  };
>  
>  /**
> Index: linux-2.6/kernel/events/core.c
> ===================================================================
> --- linux-2.6.orig/kernel/events/core.c
> +++ linux-2.6/kernel/events/core.c
> @@ -3191,10 +3191,6 @@ int perf_event_task_disable(void)
>  	return 0;
>  }
>  
> -#ifndef PERF_EVENT_INDEX_OFFSET
> -# define PERF_EVENT_INDEX_OFFSET 0
> -#endif
> -
>  static int perf_event_index(struct perf_event *event)
>  {
>  	if (event->hw.state & PERF_HES_STOPPED)
> @@ -3203,7 +3199,7 @@ static int perf_event_index(struct perf_
>  	if (event->state != PERF_EVENT_STATE_ACTIVE)
>  		return 0;
>  
> -	return event->hw.idx + 1 - PERF_EVENT_INDEX_OFFSET;
> +	return event->pmu->event_idx(event);
>  }
>  
>  static void calc_timer_values(struct perf_event *event,
> @@ -5334,6 +5330,11 @@ static void perf_pmu_cancel_txn(struct p
>  	perf_pmu_enable(pmu);
>  }
>  
> +static int perf_event_idx_default(struct perf_event *event)
> +{
> +	return event->hw.idx + 1;
> +}
> +
>  /*
>   * Ensures all contexts with the same task_ctx_nr have the same
>   * pmu_cpu_context too.
> @@ -5523,6 +5524,9 @@ int perf_pmu_register(struct pmu *pmu, c
>  		pmu->pmu_disable = perf_pmu_nop_void;
>  	}
>  
> +	if (!pmu->event_idx)
> +		pmu->event_idx = perf_event_idx_default;
> +
>  	list_add_rcu(&pmu->entry, &pmus);
>  	ret = 0;
>  unlock:
> Index: linux-2.6/arch/arm/include/asm/perf_event.h
> ===================================================================
> --- linux-2.6.orig/arch/arm/include/asm/perf_event.h
> +++ linux-2.6/arch/arm/include/asm/perf_event.h
> @@ -12,10 +12,6 @@
>  #ifndef __ARM_PERF_EVENT_H__
>  #define __ARM_PERF_EVENT_H__
>  
> -/* ARM performance counters start from 1 (in the cp15 accesses) so use the
> - * same indexes here for consistency. */
> -#define PERF_EVENT_INDEX_OFFSET 1
> -
>  /* ARM perf PMU IDs for use by internal perf clients. */
>  enum arm_perf_pmu_ids {
>  	ARM_PERF_PMU_ID_XSCALE1	= 0,
> Index: linux-2.6/arch/arm/kernel/perf_event.c
> ===================================================================
> --- linux-2.6.orig/arch/arm/kernel/perf_event.c
> +++ linux-2.6/arch/arm/kernel/perf_event.c
> @@ -572,6 +572,15 @@ static void armpmu_disable(struct pmu *p
>  	armpmu->stop();
>  }
>  
> +/*
> + * ARM performance counters start from 1 (in the cp15 accesses) so use the
> + * same indexes here for consistency.
> + */
> +static int armpmu_event_idx(struct perf_event *event)
> +{
> +	return event->hw.idx;
> +}
> +
>  static void __init armpmu_init(struct arm_pmu *armpmu)
>  {
>  	atomic_set(&armpmu->active_events, 0);
> @@ -586,6 +595,7 @@ static void __init armpmu_init(struct ar
>  		.start		= armpmu_start,
>  		.stop		= armpmu_stop,
>  		.read		= armpmu_read,
> +		.event_idx	= armpmu_event_idx,
>  	};
>  }
>  
> Index: linux-2.6/arch/frv/include/asm/perf_event.h
> ===================================================================
> --- linux-2.6.orig/arch/frv/include/asm/perf_event.h
> +++ linux-2.6/arch/frv/include/asm/perf_event.h
> @@ -12,6 +12,4 @@
>  #ifndef _ASM_PERF_EVENT_H
>  #define _ASM_PERF_EVENT_H
>  
> -#define PERF_EVENT_INDEX_OFFSET	0
> -
>  #endif /* _ASM_PERF_EVENT_H */
> Index: linux-2.6/arch/hexagon/include/asm/perf_event.h
> ===================================================================
> --- linux-2.6.orig/arch/hexagon/include/asm/perf_event.h
> +++ linux-2.6/arch/hexagon/include/asm/perf_event.h
> @@ -19,6 +19,4 @@
>  #ifndef _ASM_PERF_EVENT_H
>  #define _ASM_PERF_EVENT_H
>  
> -#define PERF_EVENT_INDEX_OFFSET	0
> -
>  #endif /* _ASM_PERF_EVENT_H */
> Index: linux-2.6/arch/powerpc/include/asm/perf_event_server.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/perf_event_server.h
> +++ linux-2.6/arch/powerpc/include/asm/perf_event_server.h
> @@ -61,8 +61,6 @@ struct pt_regs;
>  extern unsigned long perf_misc_flags(struct pt_regs *regs);
>  extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
>  
> -#define PERF_EVENT_INDEX_OFFSET	1
> -
>  /*
>   * Only override the default definitions in include/linux/perf_event.h
>   * if we have hardware PMU support.
> Index: linux-2.6/arch/powerpc/kernel/perf_event.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/kernel/perf_event.c
> +++ linux-2.6/arch/powerpc/kernel/perf_event.c
> @@ -1187,6 +1187,11 @@ static int power_pmu_event_init(struct p
>  	return err;
>  }
>  
> +static int power_pmu_event_idx(struct perf_event *event)
> +{
> +	return event->hw.idx;
> +}
> +
>  struct pmu power_pmu = {
>  	.pmu_enable	= power_pmu_enable,
>  	.pmu_disable	= power_pmu_disable,
> @@ -1199,6 +1204,7 @@ struct pmu power_pmu = {
>  	.start_txn	= power_pmu_start_txn,
>  	.cancel_txn	= power_pmu_cancel_txn,
>  	.commit_txn	= power_pmu_commit_txn,
> +	.event_idx	= power_pmu_event_idx,
>  };
>  
>  /*
> Index: linux-2.6/arch/s390/include/asm/perf_event.h
> ===================================================================
> --- linux-2.6.orig/arch/s390/include/asm/perf_event.h
> +++ linux-2.6/arch/s390/include/asm/perf_event.h
> @@ -6,4 +6,3 @@
>  
>  /* Empty, just to avoid compiling error */
>  
> -#define PERF_EVENT_INDEX_OFFSET 0
> Index: linux-2.6/arch/x86/include/asm/perf_event.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/perf_event.h
> +++ linux-2.6/arch/x86/include/asm/perf_event.h
> @@ -187,8 +187,6 @@ extern u32 get_ibs_caps(void);
>  #ifdef CONFIG_PERF_EVENTS
>  extern void perf_events_lapic_init(void);
>  
> -#define PERF_EVENT_INDEX_OFFSET			0
> -
>  /*
>   * Abuse bit 3 of the cpu eflags register to indicate proper PEBS IP fixups.
>   * This flag is otherwise unused and ABI specified to be 0, so nobody should
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 6/6] perf, tools: X86 RDPMC, RDTSC test
  2011-11-21 15:37     ` Peter Zijlstra
@ 2011-11-21 16:59       ` Peter Zijlstra
  2011-11-21 17:42         ` Stephane Eranian
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-21 16:59 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: mingo, William Cohen, linux-kernel, Arun Sharma, Vince Weaver

On Mon, 2011-11-21 at 16:37 +0100, Peter Zijlstra wrote:
> On Mon, 2011-11-21 at 16:29 +0100, Stephane Eranian wrote:
> > Peter,
> > 
> > I don't see how this test and infrastructure handles the case where the event
> > is multiplexed. I know there is time_enabled and time_running. But those are
> > not sync'd to the moment of the rdpmc(). I think there needs to be some other
> > timestamp in the mmap struct so the user can compute a delta to then add to
> > time_enabled and time_running.
> 
> When the counter isn't actually on the PMU, ->index will be 0 and rdpmc
> should not be attempted.
> 
> > Unless, we assume the two time metrics are there ONLY to compute a scaling
> > ratio. In which case, I think, we don't need the delta because if we
> > can do rdpmc()
> > it means the event is currently scheduled and thus time_enabled and time_running
> > are both ticking which means the scaling ratio does not change since the moment
> > the event was scheduled in.
> 
> Right, you don't need delta to compute the scale, but its useful for
> user-space time based measurements, Arun wanted to do something like
> that.

I'm full of crap, of course that makes a difference :-)

Even when both running and enabled are incremented, the scaling does
still change: 3/2 != 4/3 etc..

Using that we can actually deal with the whole multiplexing thing
without ever having to fall back to read(), something like:


static u64 mmap_read_self(void *addr)
{
        struct perf_event_mmap_page *pc = addr;
        u32 seq, idx, time_mult, time_shift;
        u64 count, cyc, time_offset, enabled, running, delta;

        do {
                seq = pc->lock;
                barrier();

                enabled = pc->time_enabled;
                running = pc->time_running;

                if (enabled != running) {
                        cyc = rdtsc();
                        time_mult = pc->time_mult;
                        time_shift = pc->time_shift;
                        time_offset = pc->time_offset;
                } 

                idx = pc->index;
                count = pc->offset;
                if (idx)
                        count += rdpmc(idx - 1);

                barrier();
        } while (pc->lock != seq);

        if (enabled != running) {
                u64 quot, rem;

                quot = (cyc >> time_shift);
                rem = cyc & ((1 << time_shift) - 1);
                delta = time_offset + quot * time_mult + 
                        ((rem * time_mult) >> time_shift);
                
                enabled += delta;
                if (idx)
                        running += delta;
                
                quot = count / running;
                rem = count % running;
                count = quot * enabled + (rem * enabled) / running;
        }

        return count;
}

Now all I need to do is make sure pc->offset actually makes sense,
because currently it looks like we're off by a factor
event->hw.prev_count when idx is set.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-21 14:51 ` [RFC][PATCH 2/6] perf, arch: Rework perf_event_index() Peter Zijlstra
  2011-11-21 16:22   ` Eric B Munson
@ 2011-11-21 17:23   ` Will Deacon
  2011-11-21 19:18     ` Peter Zijlstra
  1 sibling, 1 reply; 48+ messages in thread
From: Will Deacon @ 2011-11-21 17:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Deng-Cheng Zhu,
	Anton Blanchard, Eric B Munson, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver

Hi Peter,

On Mon, Nov 21, 2011 at 02:51:16PM +0000, Peter Zijlstra wrote:
> Put the logic to compute the event index into a per pmu method. This
> is required because the x86 rules are weird and wonderful and don't
> match the capabilities of the current scheme.
> 
> AFAIK only powerpc actually has a usable userspace read of the PMCs
> but I'm not at all sure anybody actually used that.
> 
> ARM looks like it cared, but I really wouldn't know, Will?

It used to care, but it doesn't anymore. Feel free to make the offset 0 and
use the (now) generic codepath.

With that:

Acked-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 6/6] perf, tools: X86 RDPMC, RDTSC test
  2011-11-21 16:59       ` Peter Zijlstra
@ 2011-11-21 17:42         ` Stephane Eranian
  0 siblings, 0 replies; 48+ messages in thread
From: Stephane Eranian @ 2011-11-21 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, linux-kernel, Arun Sharma, Vince Weaver

On Mon, Nov 21, 2011 at 5:59 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-11-21 at 16:37 +0100, Peter Zijlstra wrote:
>> On Mon, 2011-11-21 at 16:29 +0100, Stephane Eranian wrote:
>> > Peter,
>> >
>> > I don't see how this test and infrastructure handles the case where the event
>> > is multiplexed. I know there is time_enabled and time_running. But those are
>> > not sync'd to the moment of the rdpmc(). I think there needs to be some other
>> > timestamp in the mmap struct so the user can compute a delta to then add to
>> > time_enabled and time_running.
>>
>> When the counter isn't actually on the PMU, ->index will be 0 and rdpmc
>> should not be attempted.
>>
>> > Unless, we assume the two time metrics are there ONLY to compute a scaling
>> > ratio. In which case, I think, we don't need the delta because if we
>> > can do rdpmc()
>> > it means the event is currently scheduled and thus time_enabled and time_running
>> > are both ticking which means the scaling ratio does not change since the moment
>> > the event was scheduled in.
>>
>> Right, you don't need delta to compute the scale, but its useful for
>> user-space time based measurements, Arun wanted to do something like
>> that.
>
> I'm full of crap, of course that makes a difference :-)
>
> Even when both running and enabled are incremented, the scaling does
> still change: 3/2 != 4/3 etc..
>
You're right!

count = raw_count * time_enabled / time_running

If we add 1s to time_enabled and time_running, it's not the
same scaling. We're not multiplying, we're adding.

To do correct scaling, we need to figure out how many ns have elapsed
since time_enabled and time_running were last updated, i.e., when the
counter was last scheduled.

count = raw_count * (time_enabled + x) / (time_running + x)

That's what I was suggesting initially.

I you use rdtsc() + a timestamp in the mmapped page, you can
determine x.

Personally, I never understood how this mmapped count could even work
on PPC in the case of multiplexing. The problem above is not specific
to X86.

> Using that we can actually deal with the whole multiplexing thing
> without ever having to fall back to read(), something like:
>
>
> static u64 mmap_read_self(void *addr)
> {
>        struct perf_event_mmap_page *pc = addr;
>        u32 seq, idx, time_mult, time_shift;
>        u64 count, cyc, time_offset, enabled, running, delta;
>
>        do {
>                seq = pc->lock;
>                barrier();
>
>                enabled = pc->time_enabled;
>                running = pc->time_running;
>
>                if (enabled != running) {
>                        cyc = rdtsc();
>                        time_mult = pc->time_mult;
>                        time_shift = pc->time_shift;
>                        time_offset = pc->time_offset;
>                }
>
>                idx = pc->index;
>                count = pc->offset;
>                if (idx)
>                        count += rdpmc(idx - 1);
>
>                barrier();
>        } while (pc->lock != seq);
>
>        if (enabled != running) {
>                u64 quot, rem;
>
>                quot = (cyc >> time_shift);
>                rem = cyc & ((1 << time_shift) - 1);
>                delta = time_offset + quot * time_mult +
>                        ((rem * time_mult) >> time_shift);
>
>                enabled += delta;
>                if (idx)
>                        running += delta;
>
>                quot = count / running;
>                rem = count % running;
>                count = quot * enabled + (rem * enabled) / running;
>        }
>
>        return count;
> }
>
> Now all I need to do is make sure pc->offset actually makes sense,
> because currently it looks like we're off by a factor
> event->hw.prev_count when idx is set.
>
>
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-21 17:23   ` Will Deacon
@ 2011-11-21 19:18     ` Peter Zijlstra
  2011-11-21 20:31       ` Will Deacon
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-21 19:18 UTC (permalink / raw)
  To: Will Deacon
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Deng-Cheng Zhu,
	Anton Blanchard, Eric B Munson, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver

On Mon, 2011-11-21 at 17:23 +0000, Will Deacon wrote:
> Hi Peter,
> 
> On Mon, Nov 21, 2011 at 02:51:16PM +0000, Peter Zijlstra wrote:
> > Put the logic to compute the event index into a per pmu method. This
> > is required because the x86 rules are weird and wonderful and don't
> > match the capabilities of the current scheme.
> > 
> > AFAIK only powerpc actually has a usable userspace read of the PMCs
> > but I'm not at all sure anybody actually used that.
> > 
> > ARM looks like it cared, but I really wouldn't know, Will?
> 
> It used to care, but it doesn't anymore. Feel free to make the offset 0 and
> use the (now) generic codepath.
> 
> With that:
> 
> Acked-by: Will Deacon <will.deacon@arm.com>


But does ARM have a read PMU counter from userspace
instruction/capability?

Lacking that its all moot of course. If it does, it would be nice to
have an ARM version of patch 6.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-21 19:18     ` Peter Zijlstra
@ 2011-11-21 20:31       ` Will Deacon
  2011-11-21 20:35         ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Will Deacon @ 2011-11-21 20:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Deng-Cheng Zhu,
	Anton Blanchard, Eric B Munson, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver

On Mon, Nov 21, 2011 at 07:18:10PM +0000, Peter Zijlstra wrote:
> But does ARM have a read PMU counter from userspace
> instruction/capability?

In ARMv7, you can enable user access to the performance counters but the
access is R/W so I don't think it's something we want to do (could interfere
with another task doing system-wide profiling).

> Lacking that its all moot of course. If it does, it would be nice to
> have an ARM version of patch 6.

Well we'd need a way to get around the all-or-nothing user access to the PMU
and also the fact that we don't always have a user-readable clocksource.

Either way, the event counters are indexed from 0 on ARMv7 so we should
use perf_event_idx_default.

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-21 20:31       ` Will Deacon
@ 2011-11-21 20:35         ` Peter Zijlstra
  2011-11-21 22:43           ` Will Deacon
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-21 20:35 UTC (permalink / raw)
  To: Will Deacon
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Deng-Cheng Zhu,
	Anton Blanchard, Eric B Munson, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver

On Mon, 2011-11-21 at 20:31 +0000, Will Deacon wrote:
> On Mon, Nov 21, 2011 at 07:18:10PM +0000, Peter Zijlstra wrote:
> > But does ARM have a read PMU counter from userspace
> > instruction/capability?
> 
> In ARMv7, you can enable user access to the performance counters but the
> access is R/W so I don't think it's something we want to do (could interfere
> with another task doing system-wide profiling).

Yeah, write access is pushing it a bit.. 

> > Lacking that its all moot of course. If it does, it would be nice to
> > have an ARM version of patch 6.
> 
> Well we'd need a way to get around the all-or-nothing user access to the PMU
> and also the fact that we don't always have a user-readable clocksource.
> 
> Either way, the event counters are indexed from 0 on ARMv7 so we should
> use perf_event_idx_default.

Ok, done. Thanks!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-21 20:35         ` Peter Zijlstra
@ 2011-11-21 22:43           ` Will Deacon
  2011-11-22 11:26             ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Will Deacon @ 2011-11-21 22:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Deng-Cheng Zhu,
	Anton Blanchard, Eric B Munson, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver

On Mon, Nov 21, 2011 at 08:35:55PM +0000, Peter Zijlstra wrote:
> On Mon, 2011-11-21 at 20:31 +0000, Will Deacon wrote:
> > On Mon, Nov 21, 2011 at 07:18:10PM +0000, Peter Zijlstra wrote:
> > > But does ARM have a read PMU counter from userspace
> > > instruction/capability?
> > 
> > In ARMv7, you can enable user access to the performance counters but the
> > access is R/W so I don't think it's something we want to do (could interfere
> > with another task doing system-wide profiling).
> 
> Yeah, write access is pushing it a bit.. 

Perhaps we could disable it while per-cpu events are running, although I
think this will probably just lead to SIGILL central for anybody trying to
use the counters in userspace.

> > > Lacking that its all moot of course. If it does, it would be nice to
> > > have an ARM version of patch 6.
> > 
> > Well we'd need a way to get around the all-or-nothing user access to the PMU
> > and also the fact that we don't always have a user-readable clocksource.
> > 
> > Either way, the event counters are indexed from 0 on ARMv7 so we should
> > use perf_event_idx_default.
> 
> Ok, done. Thanks!

Brill.

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-21 22:43           ` Will Deacon
@ 2011-11-22 11:26             ` Peter Zijlstra
  2011-11-22 11:47               ` Will Deacon
  2011-11-22 11:48               ` Oleg Strikov
  0 siblings, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-22 11:26 UTC (permalink / raw)
  To: Will Deacon
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Deng-Cheng Zhu,
	Anton Blanchard, Eric B Munson, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver, ostrikov

On Mon, 2011-11-21 at 22:43 +0000, Will Deacon wrote:
> On Mon, Nov 21, 2011 at 08:35:55PM +0000, Peter Zijlstra wrote:
> > On Mon, 2011-11-21 at 20:31 +0000, Will Deacon wrote:
> > > On Mon, Nov 21, 2011 at 07:18:10PM +0000, Peter Zijlstra wrote:
> > > > But does ARM have a read PMU counter from userspace
> > > > instruction/capability?
> > > 
> > > In ARMv7, you can enable user access to the performance counters but the
> > > access is R/W so I don't think it's something we want to do (could interfere
> > > with another task doing system-wide profiling).
> > 
> > Yeah, write access is pushing it a bit.. 
> 
> Perhaps we could disable it while per-cpu events are running, although I
> think this will probably just lead to SIGILL central for anybody trying to
> use the counters in userspace.

One possibility would be to do as I did in patch 4, except ARM has it
disabled by default and the folks who think they know WTF they're doing
can enable it or so.

Ostrikov mentioned on #kernelnewbies he wanted to have this enabled
because apparently games use it. He mentioned toggling the user access
on/off depending on if the kernel was using perf or not, but that would
create very unreliable service.

Best would be to use perf to program the thing and use the userspace
read and simply agree not to write to these registers (and pray people
don't)..

Also, for those ARMs that do have a user readable clock, you could
support the new time_{mult,shift,offset} from patch 5.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-22 11:26             ` Peter Zijlstra
@ 2011-11-22 11:47               ` Will Deacon
  2011-11-22 11:49                 ` Oleg Strikov
  2011-11-22 11:51                 ` Peter Zijlstra
  2011-11-22 11:48               ` Oleg Strikov
  1 sibling, 2 replies; 48+ messages in thread
From: Will Deacon @ 2011-11-22 11:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Deng-Cheng Zhu,
	Anton Blanchard, Eric B Munson, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver, ostrikov

On Tue, Nov 22, 2011 at 11:26:20AM +0000, Peter Zijlstra wrote:
> On Mon, 2011-11-21 at 22:43 +0000, Will Deacon wrote:
> > Perhaps we could disable it while per-cpu events are running, although I
> > think this will probably just lead to SIGILL central for anybody trying to
> > use the counters in userspace.
> 
> One possibility would be to do as I did in patch 4, except ARM has it
> disabled by default and the folks who think they know WTF they're doing
> can enable it or so.

The problem is that everybody thinks they know WTF they're doing!

> Ostrikov mentioned on #kernelnewbies he wanted to have this enabled
> because apparently games use it. He mentioned toggling the user access
> on/off depending on if the kernel was using perf or not, but that would
> create very unreliable service.

Well we already have a reserve/release PMU thing which perf honours so we
could conceivably do this. I still reckon this will just lead to SIGILLs in
userspace though because we can't sanely notify tasks that they should leave
the PMU alone for a bit.

> Best would be to use perf to program the thing and use the userspace
> read and simply agree not to write to these registers (and pray people
> don't)..

But you know that the first thing people will do is zero the registers.

> Also, for those ARMs that do have a user readable clock, you could
> support the new time_{mult,shift,offset} from patch 5.

The user-readable clock will first appear in Cortex-A15, so the code for
that still needs to hit mainline before I can look at doing this in perf.

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-22 11:26             ` Peter Zijlstra
  2011-11-22 11:47               ` Will Deacon
@ 2011-11-22 11:48               ` Oleg Strikov
  1 sibling, 0 replies; 48+ messages in thread
From: Oleg Strikov @ 2011-11-22 11:48 UTC (permalink / raw)
  To: Peter Zijlstra, Will Deacon
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Deng-Cheng Zhu,
	Anton Blanchard, Eric B Munson, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2283 bytes --]

I was hoping that with the userspace access to the counters disabled I get zeroes/trash from the counters.
But unfortunately I got SIGILL which is bad. But at least I've made a test.

-----Original Message-----
From: Peter Zijlstra [mailto:a.p.zijlstra@chello.nl] 
Sent: Tuesday, November 22, 2011 3:26 PM
To: Will Deacon
Cc: mingo@elte.hu; William Cohen; linux-kernel@vger.kernel.org; Michael Cree; Deng-Cheng Zhu; Anton Blanchard; Eric B Munson; Heiko Carstens; Paul Mundt; David S. Miller; Richard Kuo; Stephane Eranian; Arun Sharma; Vince Weaver; Oleg Strikov
Subject: Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()

On Mon, 2011-11-21 at 22:43 +0000, Will Deacon wrote:
> On Mon, Nov 21, 2011 at 08:35:55PM +0000, Peter Zijlstra wrote:
> > On Mon, 2011-11-21 at 20:31 +0000, Will Deacon wrote:
> > > On Mon, Nov 21, 2011 at 07:18:10PM +0000, Peter Zijlstra wrote:
> > > > But does ARM have a read PMU counter from userspace 
> > > > instruction/capability?
> > > 
> > > In ARMv7, you can enable user access to the performance counters 
> > > but the access is R/W so I don't think it's something we want to 
> > > do (could interfere with another task doing system-wide profiling).
> > 
> > Yeah, write access is pushing it a bit.. 
> 
> Perhaps we could disable it while per-cpu events are running, although 
> I think this will probably just lead to SIGILL central for anybody 
> trying to use the counters in userspace.

One possibility would be to do as I did in patch 4, except ARM has it disabled by default and the folks who think they know WTF they're doing can enable it or so.

Ostrikov mentioned on #kernelnewbies he wanted to have this enabled because apparently games use it. He mentioned toggling the user access on/off depending on if the kernel was using perf or not, but that would create very unreliable service.

Best would be to use perf to program the thing and use the userspace read and simply agree not to write to these registers (and pray people don't)..

Also, for those ARMs that do have a user readable clock, you could support the new time_{mult,shift,offset} from patch 5.


ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-22 11:47               ` Will Deacon
@ 2011-11-22 11:49                 ` Oleg Strikov
  2011-11-22 11:52                   ` Will Deacon
  2011-11-22 11:51                 ` Peter Zijlstra
  1 sibling, 1 reply; 48+ messages in thread
From: Oleg Strikov @ 2011-11-22 11:49 UTC (permalink / raw)
  To: Will Deacon, Peter Zijlstra
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Deng-Cheng Zhu,
	Anton Blanchard, Eric B Munson, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver

> The user-readable clock will first appear in Cortex-A15, so the code for that still needs to hit mainline before I can look at doing this in perf.

This could be very useful. I think that the cycles counter covers the 90% of all the in-app profiler needs.

-----Original Message-----
From: Will Deacon [mailto:will.deacon@arm.com] 
Sent: Tuesday, November 22, 2011 3:47 PM
To: Peter Zijlstra
Cc: mingo@elte.hu; William Cohen; linux-kernel@vger.kernel.org; Michael Cree; Deng-Cheng Zhu; Anton Blanchard; Eric B Munson; Heiko Carstens; Paul Mundt; David S. Miller; Richard Kuo; Stephane Eranian; Arun Sharma; Vince Weaver; Oleg Strikov
Subject: Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()

On Tue, Nov 22, 2011 at 11:26:20AM +0000, Peter Zijlstra wrote:
> On Mon, 2011-11-21 at 22:43 +0000, Will Deacon wrote:
> > Perhaps we could disable it while per-cpu events are running, 
> > although I think this will probably just lead to SIGILL central for 
> > anybody trying to use the counters in userspace.
> 
> One possibility would be to do as I did in patch 4, except ARM has it 
> disabled by default and the folks who think they know WTF they're 
> doing can enable it or so.

The problem is that everybody thinks they know WTF they're doing!

> Ostrikov mentioned on #kernelnewbies he wanted to have this enabled 
> because apparently games use it. He mentioned toggling the user access 
> on/off depending on if the kernel was using perf or not, but that 
> would create very unreliable service.

Well we already have a reserve/release PMU thing which perf honours so we could conceivably do this. I still reckon this will just lead to SIGILLs in userspace though because we can't sanely notify tasks that they should leave the PMU alone for a bit.

> Best would be to use perf to program the thing and use the userspace 
> read and simply agree not to write to these registers (and pray people 
> don't)..

But you know that the first thing people will do is zero the registers.

> Also, for those ARMs that do have a user readable clock, you could 
> support the new time_{mult,shift,offset} from patch 5.

The user-readable clock will first appear in Cortex-A15, so the code for that still needs to hit mainline before I can look at doing this in perf.

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-22 11:47               ` Will Deacon
  2011-11-22 11:49                 ` Oleg Strikov
@ 2011-11-22 11:51                 ` Peter Zijlstra
  2011-11-22 11:54                   ` Will Deacon
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2011-11-22 11:51 UTC (permalink / raw)
  To: Will Deacon
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Deng-Cheng Zhu,
	Anton Blanchard, Eric B Munson, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver, ostrikov

On Tue, 2011-11-22 at 11:47 +0000, Will Deacon wrote:
> On Tue, Nov 22, 2011 at 11:26:20AM +0000, Peter Zijlstra wrote:
> > On Mon, 2011-11-21 at 22:43 +0000, Will Deacon wrote:
> > > Perhaps we could disable it while per-cpu events are running, although I
> > > think this will probably just lead to SIGILL central for anybody trying to
> > > use the counters in userspace.
> > 
> > One possibility would be to do as I did in patch 4, except ARM has it
> > disabled by default and the folks who think they know WTF they're doing
> > can enable it or so.
> 
> The problem is that everybody thinks they know WTF they're doing!

> But you know that the first thing people will do is zero the registers.

*groan*, fair enough ;-)

> > Also, for those ARMs that do have a user readable clock, you could
> > support the new time_{mult,shift,offset} from patch 5.
> 
> The user-readable clock will first appear in Cortex-A15, so the code for
> that still needs to hit mainline before I can look at doing this in perf.

OK, I had interpreted your "we don't always have" to be slightly more
common than just A15.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-22 11:49                 ` Oleg Strikov
@ 2011-11-22 11:52                   ` Will Deacon
  2011-11-22 11:56                     ` Oleg Strikov
  2011-11-22 12:00                     ` Oleg Strikov
  0 siblings, 2 replies; 48+ messages in thread
From: Will Deacon @ 2011-11-22 11:52 UTC (permalink / raw)
  To: Oleg Strikov
  Cc: Peter Zijlstra, mingo, William Cohen, linux-kernel, Michael Cree,
	Deng-Cheng Zhu, Anton Blanchard, Eric B Munson, Heiko Carstens,
	Paul Mundt, David S. Miller, Richard Kuo, Stephane Eranian,
	Arun Sharma, Vince Weaver

On Tue, Nov 22, 2011 at 11:49:41AM +0000, Oleg Strikov wrote:
> > The user-readable clock will first appear in Cortex-A15, so the code for that still needs to hit mainline before I can look at doing this in perf.
> 
> This could be very useful. I think that the cycles counter covers the 90% of all the in-app profiler needs.

Well, it's slightly different to the cycle counter in that it will continue
to count when the core is idle. The PMU cycle counter may stop when you
issue a WFI. The frequency of the clock is also unlikely to be the same as
the CPU clock.

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-22 11:51                 ` Peter Zijlstra
@ 2011-11-22 11:54                   ` Will Deacon
  0 siblings, 0 replies; 48+ messages in thread
From: Will Deacon @ 2011-11-22 11:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, linux-kernel, Michael Cree, Deng-Cheng Zhu,
	Anton Blanchard, Eric B Munson, Heiko Carstens, Paul Mundt,
	David S. Miller, Richard Kuo, Stephane Eranian, Arun Sharma,
	Vince Weaver, ostrikov

On Tue, Nov 22, 2011 at 11:51:50AM +0000, Peter Zijlstra wrote:
> On Tue, 2011-11-22 at 11:47 +0000, Will Deacon wrote:
> 
> OK, I had interpreted your "we don't always have" to be slightly more
> common than just A15.

It was introduced for v7.1, of which the A15 is the first implementation. A7
and future cores should also have it (it's part of the architected timers
which are really needed for doing virtualisation properly).

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-22 11:52                   ` Will Deacon
@ 2011-11-22 11:56                     ` Oleg Strikov
  2011-11-22 12:00                     ` Oleg Strikov
  1 sibling, 0 replies; 48+ messages in thread
From: Oleg Strikov @ 2011-11-22 11:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, mingo, William Cohen, linux-kernel, Michael Cree,
	Deng-Cheng Zhu, Anton Blanchard, Eric B Munson, Heiko Carstens,
	Paul Mundt, David S. Miller, Richard Kuo, Stephane Eranian,
	Arun Sharma, Vince Weaver

Currently the clock_gettime() is usually used for the in-app profiling.
It usually creates a lot of redundant work and may even change the profiling results (if you _are_ CPU-bounded already).
So it will be significant leap ahead.

-----Original Message-----
From: Will Deacon [mailto:will.deacon@arm.com] 
Sent: Tuesday, November 22, 2011 3:52 PM
To: Oleg Strikov
Cc: Peter Zijlstra; mingo@elte.hu; William Cohen; linux-kernel@vger.kernel.org; Michael Cree; Deng-Cheng Zhu; Anton Blanchard; Eric B Munson; Heiko Carstens; Paul Mundt; David S. Miller; Richard Kuo; Stephane Eranian; Arun Sharma; Vince Weaver
Subject: Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()

On Tue, Nov 22, 2011 at 11:49:41AM +0000, Oleg Strikov wrote:
> > The user-readable clock will first appear in Cortex-A15, so the code for that still needs to hit mainline before I can look at doing this in perf.
> 
> This could be very useful. I think that the cycles counter covers the 90% of all the in-app profiler needs.

Well, it's slightly different to the cycle counter in that it will continue to count when the core is idle. The PMU cycle counter may stop when you issue a WFI. The frequency of the clock is also unlikely to be the same as the CPU clock.

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-22 11:52                   ` Will Deacon
  2011-11-22 11:56                     ` Oleg Strikov
@ 2011-11-22 12:00                     ` Oleg Strikov
  2011-11-22 12:14                       ` Will Deacon
  1 sibling, 1 reply; 48+ messages in thread
From: Oleg Strikov @ 2011-11-22 12:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, mingo, William Cohen, linux-kernel, Michael Cree,
	Deng-Cheng Zhu, Anton Blanchard, Eric B Munson, Heiko Carstens,
	Paul Mundt, David S. Miller, Richard Kuo, Stephane Eranian,
	Arun Sharma, Vince Weaver

Also, please do not forget that enabling the userspace access to the counters may not only create some problems
to the profilers like Perf/OProfile but could also trash the kernel with the interrupts.

The PerfEvents subsystem enables the interrupts for the counters overflows.
If the userspace application can change the counter value the right way -- it could generate a lot of interrupts per second.

Oleg

-----Original Message-----
From: Will Deacon [mailto:will.deacon@arm.com] 
Sent: Tuesday, November 22, 2011 3:52 PM
To: Oleg Strikov
Cc: Peter Zijlstra; mingo@elte.hu; William Cohen; linux-kernel@vger.kernel.org; Michael Cree; Deng-Cheng Zhu; Anton Blanchard; Eric B Munson; Heiko Carstens; Paul Mundt; David S. Miller; Richard Kuo; Stephane Eranian; Arun Sharma; Vince Weaver
Subject: Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()

On Tue, Nov 22, 2011 at 11:49:41AM +0000, Oleg Strikov wrote:
> > The user-readable clock will first appear in Cortex-A15, so the code for that still needs to hit mainline before I can look at doing this in perf.
> 
> This could be very useful. I think that the cycles counter covers the 90% of all the in-app profiler needs.

Well, it's slightly different to the cycle counter in that it will continue to count when the core is idle. The PMU cycle counter may stop when you issue a WFI. The frequency of the clock is also unlikely to be the same as the CPU clock.

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-22 12:00                     ` Oleg Strikov
@ 2011-11-22 12:14                       ` Will Deacon
  2011-11-22 12:25                         ` Oleg Strikov
  0 siblings, 1 reply; 48+ messages in thread
From: Will Deacon @ 2011-11-22 12:14 UTC (permalink / raw)
  To: Oleg Strikov
  Cc: Peter Zijlstra, mingo, William Cohen, linux-kernel, Michael Cree,
	Deng-Cheng Zhu, Anton Blanchard, Eric B Munson, Heiko Carstens,
	Paul Mundt, David S. Miller, Richard Kuo, Stephane Eranian,
	Arun Sharma, Vince Weaver

On Tue, Nov 22, 2011 at 12:00:33PM +0000, Oleg Strikov wrote:
> Also, please do not forget that enabling the userspace access to the counters may not only create some problems
> to the profilers like Perf/OProfile but could also trash the kernel with the interrupts.

Actually, the registers for controlling interrupts aren't visible to
userspace with ARMv7. On ARMv6, this is a different story, which is why I
didn't mention doing this for older platforms earlier :)

Anyway, this thread has digressed now so you can always jump onto the ARM
list (linux-arm-kernel) if you want to discuss this more.

Cheers,

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()
  2011-11-22 12:14                       ` Will Deacon
@ 2011-11-22 12:25                         ` Oleg Strikov
  0 siblings, 0 replies; 48+ messages in thread
From: Oleg Strikov @ 2011-11-22 12:25 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, mingo, William Cohen, linux-kernel, Michael Cree,
	Deng-Cheng Zhu, Anton Blanchard, Eric B Munson, Heiko Carstens,
	Paul Mundt, David S. Miller, Richard Kuo, Stephane Eranian,
	Arun Sharma, Vince Weaver

> Actually, the registers for controlling interrupts aren't visible to userspace with ARMv7. On ARMv6, this is a different story, which is why I didn't mention doing this > for older platforms earlier :)

Yes. But the profiling tool enables the interrupts.
I mean that with the profiler running the bad application can break the system.

Oleg

-----Original Message-----
From: Will Deacon [mailto:will.deacon@arm.com] 
Sent: Tuesday, November 22, 2011 4:14 PM
To: Oleg Strikov
Cc: Peter Zijlstra; mingo@elte.hu; William Cohen; linux-kernel@vger.kernel.org; Michael Cree; Deng-Cheng Zhu; Anton Blanchard; Eric B Munson; Heiko Carstens; Paul Mundt; David S. Miller; Richard Kuo; Stephane Eranian; Arun Sharma; Vince Weaver
Subject: Re: [RFC][PATCH 2/6] perf, arch: Rework perf_event_index()

On Tue, Nov 22, 2011 at 12:00:33PM +0000, Oleg Strikov wrote:
> Also, please do not forget that enabling the userspace access to the 
> counters may not only create some problems to the profilers like Perf/OProfile but could also trash the kernel with the interrupts.

Actually, the registers for controlling interrupts aren't visible to userspace with ARMv7. On ARMv6, this is a different story, which is why I didn't mention doing this for older platforms earlier :)

Anyway, this thread has digressed now so you can always jump onto the ARM list (linux-arm-kernel) if you want to discuss this more.

Cheers,

Will

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-11-21 14:51 [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Peter Zijlstra
                   ` (6 preceding siblings ...)
  2011-11-21 15:02 ` [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Vince Weaver
@ 2011-12-02 19:26 ` Arun Sharma
  2011-12-02 22:22   ` Stephane Eranian
  2011-12-16 22:36 ` Vince Weaver
  8 siblings, 1 reply; 48+ messages in thread
From: Arun Sharma @ 2011-12-02 19:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, Stephane Eranian, Vince Weaver, linux-kernel

On 11/21/11 6:51 AM, Peter Zijlstra wrote:
> These few patches implement x86 RDPMC support and add an extention to the self
> monitoring data to also allow additional time updates using userspace TSC reads.
>
> There's a few loose ends, but it mostly seems to work.

I haven't had a chance to test this out yet. But low overhead, always on 
perf counters is something we're very interested in. Thanks for 
implementing it.

However, I suspect the major cost of leaving the perf counters always on 
seems to be in the hit on context switches, rather than the cost of 
reading the perf counters themselves. For eg:

   Baseline:

    (for i in `seq 1 10`; do numactl --cpunodebind 1 ./lat_ctx -P1 -s32k 
4; done) 2>&1 | tee lmbench1.log

   1 event:

   (for i in `seq 1 10`; do numactl --cpunodebind 1 perf stat -e 
instructions ./lat_ctx -P1 -s32k 4; done) 2>&1 | tee lmbench2.log

   2 events:

   (for i in `seq 1 10`; do numactl --cpunodebind 1 perf stat -e 
cycles,instructions ./lat_ctx -P1 -s32k 4; done) 2>&1 | tee lmbench3.log

   Baseline: 2.2us
   One event: 6.8us
   Two events: 7.2us

The cost seems to be at roughly 5us (I measured 2.6.38 and 3.2-rc2). 
I'll dig a bit more on what may be going on here.

  -Arun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-02 19:26 ` Arun Sharma
@ 2011-12-02 22:22   ` Stephane Eranian
  2011-12-05 20:16     ` Arun Sharma
  0 siblings, 1 reply; 48+ messages in thread
From: Stephane Eranian @ 2011-12-02 22:22 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Peter Zijlstra, mingo, William Cohen, Vince Weaver, linux-kernel

Have you tried with inheritance disabled?
I don't remember if perf stat has it enabled buy default.

On Fri, Dec 2, 2011 at 11:26 AM, Arun Sharma <asharma@fb.com> wrote:
> On 11/21/11 6:51 AM, Peter Zijlstra wrote:
>>
>> These few patches implement x86 RDPMC support and add an extention to the
>> self
>> monitoring data to also allow additional time updates using userspace TSC
>> reads.
>>
>> There's a few loose ends, but it mostly seems to work.
>
>
> I haven't had a chance to test this out yet. But low overhead, always on
> perf counters is something we're very interested in. Thanks for implementing
> it.
>
> However, I suspect the major cost of leaving the perf counters always on
> seems to be in the hit on context switches, rather than the cost of reading
> the perf counters themselves. For eg:
>
>  Baseline:
>
>   (for i in `seq 1 10`; do numactl --cpunodebind 1 ./lat_ctx -P1 -s32k 4;
> done) 2>&1 | tee lmbench1.log
>
>  1 event:
>
>  (for i in `seq 1 10`; do numactl --cpunodebind 1 perf stat -e instructions
> ./lat_ctx -P1 -s32k 4; done) 2>&1 | tee lmbench2.log
>
>  2 events:
>
>  (for i in `seq 1 10`; do numactl --cpunodebind 1 perf stat -e
> cycles,instructions ./lat_ctx -P1 -s32k 4; done) 2>&1 | tee lmbench3.log
>
>  Baseline: 2.2us
>  One event: 6.8us
>  Two events: 7.2us
>
> The cost seems to be at roughly 5us (I measured 2.6.38 and 3.2-rc2). I'll
> dig a bit more on what may be going on here.
>
>  -Arun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-02 22:22   ` Stephane Eranian
@ 2011-12-05 20:16     ` Arun Sharma
  2011-12-05 23:17       ` Arun Sharma
  0 siblings, 1 reply; 48+ messages in thread
From: Arun Sharma @ 2011-12-05 20:16 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, mingo, William Cohen, Vince Weaver, linux-kernel

On 12/2/11 2:22 PM, Stephane Eranian wrote:
> Have you tried with inheritance disabled?
> I don't remember if perf stat has it enabled buy default.

Disabling inheritance using the -i switch as in:

(for i in `seq 1 10`; do numactl --cpunodebind 1 perf stat -i -e 
instructions ./lat_ctx -P1 -s32k 4; done)

shows numbers closer to baseline, since the threads used for 
benchmarking are not counting events.

This suggests that:

jump_label_dec(&perf_task_events);

itself doesn't hurt much.

  -Arun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-05 20:16     ` Arun Sharma
@ 2011-12-05 23:17       ` Arun Sharma
  2011-12-06  1:38         ` Stephane Eranian
  2011-12-06  9:42         ` Peter Zijlstra
  0 siblings, 2 replies; 48+ messages in thread
From: Arun Sharma @ 2011-12-05 23:17 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, mingo, William Cohen, Vince Weaver, linux-kernel

On 12/5/11 12:16 PM, Arun Sharma wrote:
> On 12/2/11 2:22 PM, Stephane Eranian wrote:
>> Have you tried with inheritance disabled?
>> I don't remember if perf stat has it enabled buy default.
>
> Disabling inheritance using the -i switch as in:
>
> (for i in `seq 1 10`; do numactl --cpunodebind 1 perf stat -i -e
> instructions ./lat_ctx -P1 -s32k 4; done)
>
> shows numbers closer to baseline, since the threads used for
> benchmarking are not counting events.

I spent some more time looking into this. Running:

perf stat -e instructions  ./lat_ctx -P1 -s32k 4 -N 1000000 &
perf record -ag -- sleep 3

didn't show high cycles counts on any perf_events related functions in 
the context switch path. The only PMU related stuff that showed up had 
to do with x86_pmu_enable called from an IPI.

So just as an experiment I disabled perf_rotate_context() via:

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4252,8 +4252,6 @@ void scheduler_tick(void)
         curr->sched_class->task_tick(rq, curr, 0);
         raw_spin_unlock(&rq->lock);

-       perf_event_task_tick();
-

That seems to bring the lat_ctx numbers very close to baseline.

I suspect that some optimizations are possible in perf_rotate_context 
that don't involve enabling/disabling the PMU via an IPI for simple 
cases that involve one or two hardware events (eg: fixed counters).

  -Arun

Sample stack trace:

lat_ctx 29874 [012] 350729.824173: cycles:
         ffffffff8101149c intel_pmu_enable_all ([kernel.kallsyms])
         ffffffff810115f7 intel_pmu_nhm_enable_all ([kernel.kallsyms])
         ffffffff8100e877 x86_pmu_enable ([kernel.kallsyms])
         ffffffff810a99de perf_pmu_enable ([kernel.kallsyms])
         ffffffff8100d682 x86_pmu_commit_txn ([kernel.kallsyms])
         ffffffff810aa4f9 group_sched_in ([kernel.kallsyms])
         ffffffff810ab06d __perf_event_enable ([kernel.kallsyms])
         ffffffff810a8c93 remote_function ([kernel.kallsyms])
         ffffffff81065eec generic_smp_call_function_single_interrupt 
([kernel.kallsyms])
         ffffffff81018064 smp_call_function_single_interrupt 
([kernel.kallsyms])
         ffffffff81461fcb call_function_single_interrupt ([kernel.kallsyms])
                   401ec4 bread (/root/lat_ctx)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-05 23:17       ` Arun Sharma
@ 2011-12-06  1:38         ` Stephane Eranian
  2011-12-06  9:42         ` Peter Zijlstra
  1 sibling, 0 replies; 48+ messages in thread
From: Stephane Eranian @ 2011-12-06  1:38 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Peter Zijlstra, mingo, William Cohen, Vince Weaver, linux-kernel

I believe the multiplexing rate is just unnecessary high too. I think
Peter added some code to
make it tunable but the user interface is not there yet AFAICT. There
is quite some work needed
to stop, reschedule and restart the counters.

On Mon, Dec 5, 2011 at 3:17 PM, Arun Sharma <asharma@fb.com> wrote:
> On 12/5/11 12:16 PM, Arun Sharma wrote:
>>
>> On 12/2/11 2:22 PM, Stephane Eranian wrote:
>>>
>>> Have you tried with inheritance disabled?
>>> I don't remember if perf stat has it enabled buy default.
>>
>>
>> Disabling inheritance using the -i switch as in:
>>
>> (for i in `seq 1 10`; do numactl --cpunodebind 1 perf stat -i -e
>> instructions ./lat_ctx -P1 -s32k 4; done)
>>
>> shows numbers closer to baseline, since the threads used for
>> benchmarking are not counting events.
>
>
> I spent some more time looking into this. Running:
>
> perf stat -e instructions  ./lat_ctx -P1 -s32k 4 -N 1000000 &
> perf record -ag -- sleep 3
>
> didn't show high cycles counts on any perf_events related functions in the
> context switch path. The only PMU related stuff that showed up had to do
> with x86_pmu_enable called from an IPI.
>
> So just as an experiment I disabled perf_rotate_context() via:
>
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4252,8 +4252,6 @@ void scheduler_tick(void)
>        curr->sched_class->task_tick(rq, curr, 0);
>        raw_spin_unlock(&rq->lock);
>
> -       perf_event_task_tick();
> -
>
> That seems to bring the lat_ctx numbers very close to baseline.
>
> I suspect that some optimizations are possible in perf_rotate_context that
> don't involve enabling/disabling the PMU via an IPI for simple cases that
> involve one or two hardware events (eg: fixed counters).
>
>  -Arun
>
> Sample stack trace:
>
> lat_ctx 29874 [012] 350729.824173: cycles:
>        ffffffff8101149c intel_pmu_enable_all ([kernel.kallsyms])
>        ffffffff810115f7 intel_pmu_nhm_enable_all ([kernel.kallsyms])
>        ffffffff8100e877 x86_pmu_enable ([kernel.kallsyms])
>        ffffffff810a99de perf_pmu_enable ([kernel.kallsyms])
>        ffffffff8100d682 x86_pmu_commit_txn ([kernel.kallsyms])
>        ffffffff810aa4f9 group_sched_in ([kernel.kallsyms])
>        ffffffff810ab06d __perf_event_enable ([kernel.kallsyms])
>        ffffffff810a8c93 remote_function ([kernel.kallsyms])
>        ffffffff81065eec generic_smp_call_function_single_interrupt
> ([kernel.kallsyms])
>        ffffffff81018064 smp_call_function_single_interrupt
> ([kernel.kallsyms])
>        ffffffff81461fcb call_function_single_interrupt ([kernel.kallsyms])
>                  401ec4 bread (/root/lat_ctx)
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-05 23:17       ` Arun Sharma
  2011-12-06  1:38         ` Stephane Eranian
@ 2011-12-06  9:42         ` Peter Zijlstra
  2011-12-06 21:53           ` Arun Sharma
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2011-12-06  9:42 UTC (permalink / raw)
  To: Arun Sharma
  Cc: Stephane Eranian, mingo, William Cohen, Vince Weaver, linux-kernel

On Mon, 2011-12-05 at 15:17 -0800, Arun Sharma wrote:
> 
> I suspect that some optimizations are possible in perf_rotate_context 
> that don't involve enabling/disabling the PMU via an IPI for simple 
> cases that involve one or two hardware events (eg: fixed counters). 

I've got a patch queued that does exactly that.

---
Subject: perf: Avoid a useless pmu_disable() in the perf-tick
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Nov 16 14:38:16 CET 2011

Gleb writes:
> Currently pmu is disabled and re-enabled on each timer interrupt even
> when no rotation or frequency adjustment is needed. On Intel CPU this
> results in two writes into PERF_GLOBAL_CTRL MSR per tick. On bare metal
> it does not cause significant slowdown, but when running perf in a virtual
> machine it leads to 20% slowdown on my machine.

Cure this by keeping a perf_event_context::nr_freq counter that counts the
number of active events that require frequency adjustments and use this in a
similar fashion to the already existing nr_events != nr_active test in
perf_rotate_context().

By being able to exclude both rotation and frequency adjustments a-priory for
the common case we can avoid the otherwise superfluous PMU disable.

Suggested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/perf_event.h |    1 
 kernel/events/core.c       |   50 +++++++++++++++++++++++++++++----------------
 2 files changed, 34 insertions(+), 17 deletions(-)

Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -889,6 +889,7 @@ struct perf_event_context {
 	int				nr_active;
 	int				is_active;
 	int				nr_stat;
+	int				nr_freq;
 	int				rotate_disable;
 	atomic_t			refcount;
 	struct task_struct		*task;
Index: linux-2.6/kernel/events/core.c
===================================================================
--- linux-2.6.orig/kernel/events/core.c
+++ linux-2.6/kernel/events/core.c
@@ -1127,6 +1127,8 @@ event_sched_out(struct perf_event *event
 	if (!is_software_event(event))
 		cpuctx->active_oncpu--;
 	ctx->nr_active--;
+	if (event->attr.freq && event->attr.sample_freq)
+		ctx->nr_freq--;
 	if (event->attr.exclusive || !cpuctx->active_oncpu)
 		cpuctx->exclusive = 0;
 }
@@ -1404,6 +1406,8 @@ event_sched_in(struct perf_event *event,
 	if (!is_software_event(event))
 		cpuctx->active_oncpu++;
 	ctx->nr_active++;
+	if (event->attr.freq && event->attr.sample_freq)
+		ctx->nr_freq++;
 
 	if (event->attr.exclusive)
 		cpuctx->exclusive = 1;
@@ -2326,6 +2330,9 @@ static void perf_ctx_adjust_freq(struct
 	u64 interrupts, now;
 	s64 delta;
 
+	if (!ctx->nr_freq)
+		return;
+
 	list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
 		if (event->state != PERF_EVENT_STATE_ACTIVE)
 			continue;
@@ -2381,12 +2388,14 @@ static void perf_rotate_context(struct p
 {
 	u64 interval = (u64)cpuctx->jiffies_interval * TICK_NSEC;
 	struct perf_event_context *ctx = NULL;
-	int rotate = 0, remove = 1;
+	int rotate = 0, remove = 1, freq = 0;
 
 	if (cpuctx->ctx.nr_events) {
 		remove = 0;
 		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
 			rotate = 1;
+		if (cpuctx->ctx.nr_freq)
+			freq = 1;
 	}
 
 	ctx = cpuctx->task_ctx;
@@ -2394,33 +2403,40 @@ static void perf_rotate_context(struct p
 		remove = 0;
 		if (ctx->nr_events != ctx->nr_active)
 			rotate = 1;
+		if (ctx->nr_freq)
+			freq = 1;
 	}
 
+	if (!rotate && !freq)
+		goto done;
+
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 	perf_pmu_disable(cpuctx->ctx.pmu);
-	perf_ctx_adjust_freq(&cpuctx->ctx, interval);
-	if (ctx)
-		perf_ctx_adjust_freq(ctx, interval);
-
-	if (!rotate)
-		goto done;
 
-	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-	if (ctx)
-		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+	if (freq) {
+		perf_ctx_adjust_freq(&cpuctx->ctx, interval);
+		if (ctx)
+			perf_ctx_adjust_freq(ctx, interval);
+	}
+
+	if (rotate) {
+		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+		if (ctx)
+			ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+
+		rotate_ctx(&cpuctx->ctx);
+		if (ctx)
+			rotate_ctx(ctx);
 
-	rotate_ctx(&cpuctx->ctx);
-	if (ctx)
-		rotate_ctx(ctx);
+		perf_event_sched_in(cpuctx, ctx, current);
+	}
 
-	perf_event_sched_in(cpuctx, ctx, current);
+	perf_pmu_enable(cpuctx->ctx.pmu);
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 
 done:
 	if (remove)
 		list_del_init(&cpuctx->rotation_list);
-
-	perf_pmu_enable(cpuctx->ctx.pmu);
-	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 }
 
 void perf_event_task_tick(void)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-06  9:42         ` Peter Zijlstra
@ 2011-12-06 21:53           ` Arun Sharma
  0 siblings, 0 replies; 48+ messages in thread
From: Arun Sharma @ 2011-12-06 21:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, mingo, William Cohen, Vince Weaver, linux-kernel

On 12/6/11 1:42 AM, Peter Zijlstra wrote:
> On Mon, 2011-12-05 at 15:17 -0800, Arun Sharma wrote:
>>
>> I suspect that some optimizations are possible in perf_rotate_context
>> that don't involve enabling/disabling the PMU via an IPI for simple
>> cases that involve one or two hardware events (eg: fixed counters).
>
> I've got a patch queued that does exactly that.

Verified that this fixes the lat_ctx regression. Thanks.

  -Arun


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-11-21 14:51 [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Peter Zijlstra
                   ` (7 preceding siblings ...)
  2011-12-02 19:26 ` Arun Sharma
@ 2011-12-16 22:36 ` Vince Weaver
  2011-12-21 12:58   ` Peter Zijlstra
  8 siblings, 1 reply; 48+ messages in thread
From: Vince Weaver @ 2011-12-16 22:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, Stephane Eranian, Arun Sharma,
	Vince Weaver, linux-kernel

On Mon, 21 Nov 2011, Peter Zijlstra wrote:

> These few patches implement x86 RDPMC support and add an extention to the self
> monitoring data to also allow additional time updates using userspace TSC reads.
> 
> There's a few loose ends, but it mostly seems to work.

I've finally had time to apply and test these patches.

Firstly, the documentation was a bit sparse (though I guess that's OK for 
an RFC patch).  Is it possible to read multiple counters at once with this 
interface, i.e. using FORMAT_GROUP?

Also, am I correct that the counters are set to always counting, so you 
always have to do a rdpmc() before and a rdpmc() after?

I ran some benchmarks, the results can be found here:
   http://web.eecs.utk.edu/~vweaver1/projects/perf-events/benchmarks/rdtsc_overhead/

I used the mmap_read_self() routine from your example as the "read" 
performance that I measured.

If start/stop are truly unnecessary when run with your patch then the 
"total" results shift in your favor.  Otherwise perfmon2 and perfctr still 
win the self-monitoring overhead race by a lot.

Vince


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-16 22:36 ` Vince Weaver
@ 2011-12-21 12:58   ` Peter Zijlstra
  2011-12-21 13:15     ` Ingo Molnar
  2011-12-21 15:04     ` Vince Weaver
  0 siblings, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2011-12-21 12:58 UTC (permalink / raw)
  To: Vince Weaver
  Cc: mingo, William Cohen, Stephane Eranian, Arun Sharma,
	Vince Weaver, linux-kernel

On Fri, 2011-12-16 at 17:36 -0500, Vince Weaver wrote:
> Is it possible to read multiple counters at once with this 
> interface, i.e. using FORMAT_GROUP?

No, that's not something that I had considered, I'll see if I can make
that happen by making an array inside the control page instead of a
single version.

For now you'll have to mmap() the control page for each counter
individually.

> Also, am I correct that the counters are set to always counting, so you 
> always have to do a rdpmc() before and a rdpmc() after?

Depending on how you use it, but yes that's how I'd use it, avoids
having to do ioctl()s

> I used the mmap_read_self() routine from your example as the "read" 
> performance that I measured.

Yeah that's about it, if you want to discard the overload scenario, eg
you use pinned counters or so, you can optimize it further by stripping
out the tsc and scaling muck.

> If start/stop are truly unnecessary when run with your patch then the 
> "total" results shift in your favor.  Otherwise perfmon2 and perfctr still 
> win the self-monitoring overhead race by a lot.

You can of course also do the start/stop/reset in userspace by keeping
an offset to the counter, avoiding the ioctl()s and simply keeping the
counter running.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-21 12:58   ` Peter Zijlstra
@ 2011-12-21 13:15     ` Ingo Molnar
  2011-12-23 20:12       ` Vince Weaver
  2011-12-21 15:04     ` Vince Weaver
  1 sibling, 1 reply; 48+ messages in thread
From: Ingo Molnar @ 2011-12-21 13:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, William Cohen, Stephane Eranian, Arun Sharma,
	Vince Weaver, linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > I used the mmap_read_self() routine from your example as the 
> > "read" performance that I measured.
> 
> Yeah that's about it, if you want to discard the overload 
> scenario, eg you use pinned counters or so, you can optimize 
> it further by stripping out the tsc and scaling muck.

With the TSC scaling muck it a bit above 50 cycles here.

Without that, by optimizing it further for pinned counters, the 
overhead of mmap_read_self() gets down to 36 cycles on a Nehalem 
box.

A PEBS profile run of it shows that 90% of the overhead is in 
the RDPMC instruction:

    2.92 :          42a919:       lea    -0x1(%rax),%ecx                                          ▒
         :                                                                                        ▒
         :        static u64 rdpmc(unsigned int counter)                                          ▒
         :        {                                                                               ▒
         :                unsigned int low, high;                                                 ▒
         :                                                                                        ▒
         :                asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));        ▒
   89.20 :          42a91c:       rdpmc                                                           ▒
         :                        count = pc->offset;                                             ▒
         :                        if (idx)                                                        ▒
         :                                count += rdpmc(idx - 1);                                ▒
         :                                                                                        ▒

So the RDPMC instruction alone is 32 cycles. The perf way of 
reading it adds another 4 cycles to it.

So the measured 'perf overhead' is so ridiculously low that it's 
virtually non-existent.

Here's "pinned events" variant i've measured:

static u64 mmap_read_self(void *addr)
{
        struct perf_event_mmap_page *pc = addr;
        u32 seq, idx;
        u64 count;

        do {
                seq = pc->lock;
                barrier();

                idx = pc->index;
                count = pc->offset;
                if (idx)
                        count += rdpmc(idx - 1);

                barrier();
        } while (pc->lock != seq);

        return count;
}

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-21 12:58   ` Peter Zijlstra
  2011-12-21 13:15     ` Ingo Molnar
@ 2011-12-21 15:04     ` Vince Weaver
  2011-12-21 21:32       ` Vince Weaver
  1 sibling, 1 reply; 48+ messages in thread
From: Vince Weaver @ 2011-12-21 15:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, mingo, William Cohen, Stephane Eranian,
	Arun Sharma, linux-kernel

On Wed, 21 Dec 2011, Peter Zijlstra wrote:

> On Fri, 2011-12-16 at 17:36 -0500, Vince Weaver wrote:
> > Is it possible to read multiple counters at once with this 
> > interface, i.e. using FORMAT_GROUP?
> 
> No, that's not something that I had considered, I'll see if I can make
> that happen by making an array inside the control page instead of a
> single version.

well it's not strictly necessary, I was just checking.  The perfctr
code just does a loop through all available counters too.

At some point PAPI should just abandon using FORMAT_GROUP as it seems to 
be more trouble than its worth.  Though if we have fast rdpmc-based 
counter reads it becomes a little less necessary.

> > Also, am I correct that the counters are set to always counting, so you 
> > always have to do a rdpmc() before and a rdpmc() after?
> 
> Depending on how you use it, but yes that's how I'd use it, avoids
> having to do ioctl()s

currently when I tried doing a start ioctl then an immediate read, the 
counter values returned were a very high value.  So unless I did two 
rdpmcs and subtracted, the results made no sense.

was I doing something wrong?  I'll have to investigate more.

> > If start/stop are truly unnecessary when run with your patch then the 
> > "total" results shift in your favor.  Otherwise perfmon2 and perfctr still 
> > win the self-monitoring overhead race by a lot.
> 
> You can of course also do the start/stop/reset in userspace by keeping
> an offset to the counter, avoiding the ioctl()s and simply keeping the
> counter running.

is this guaranteed to work in the face of context switches and/or 
multiplexing?

I need to go back and look at the perfctr code and make sure I understand 
the games it does to ensure that the values it reads out are valid.

Thanks,

Vince


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-21 15:04     ` Vince Weaver
@ 2011-12-21 21:32       ` Vince Weaver
  2011-12-21 21:41         ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Vince Weaver @ 2011-12-21 21:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, mingo, William Cohen, Stephane Eranian,
	Arun Sharma, linux-kernel

On Wed, 21 Dec 2011, Vince Weaver wrote:
> 
> currently when I tried doing a start ioctl then an immediate read, the 
> counter values returned were a very high value.  So unless I did two 
> rdpmcs and subtracted, the results made no sense.

I've added two rdpmc() tests to my perf_event_test suite.  A few comments.

* One, is it possible to detect at runtime that rdpmc() is supported?  
  Maybe a field in the mmap'd buffer that can be read?

  Otherwise if you try to do a rdpmc() on an unpatched kernel you get a 
  general protection error and a segfault.  This makes it tricky to have 
  nice code that tries rdpmc() and falls back to read() if it is 
  unavailable.

* Second, if I try a start / run 100M instructions / read
  and measure retired instructions I get:
      281474977711543 instructions
  This is 0x10000000f45b7 which is 1<<48 + 100M.  I guess an artifact
  of how the counters are set up?  Is it possible to have this start
  at 0?  

  Otherwise you need to always do two reads and a subtract  to get good 
  values.

Thanks,

Vince


(below are the results from the rdpmc_validation test)


This test checks if userspace rdpmc() style reads work.

total start/work/read/stop latency: 646001 cycles
	Event 0 -- count: 281474977711543 running: 320324
	Event 1 -- count: 281474977712320 running: 320939
total start/read/work/read/stop latency: 653451 cycles
	Event 0 -- count: 1000828 running: 321985
	Event 1 -- count: 1000108 running: 321195

   Expected: 1000000
   High: 1000828   Low:  1000828   Average:  1000828
   ( note, a small value above 1000000 may be expected due
     to overhead and interrupt noise, among other reasons)
   Average Error = 0.08%
Testing if userspace rdpmc reads give expected results...    PASSED


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-21 21:32       ` Vince Weaver
@ 2011-12-21 21:41         ` Peter Zijlstra
  2011-12-21 22:19           ` Vince Weaver
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2011-12-21 21:41 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Vince Weaver, mingo, William Cohen, Stephane Eranian,
	Arun Sharma, linux-kernel

On Wed, 2011-12-21 at 16:32 -0500, Vince Weaver wrote:

> * One, is it possible to detect at runtime that rdpmc() is supported?  
>   Maybe a field in the mmap'd buffer that can be read?

Ingo suggested a similar thing, will implement (although Christmas might
interfere with delivery timelines etc..).

> * Second, if I try a start / run 100M instructions / read
>   and measure retired instructions I get:
>       281474977711543 instructions
>   This is 0x10000000f45b7 which is 1<<48 + 100M.  I guess an artifact
>   of how the counters are set up?  Is it possible to have this start
>   at 0?  

Its supposed to be 0 based, clearly I messed up something and need to go
figure out what. Thanks for reporting.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-21 21:41         ` Peter Zijlstra
@ 2011-12-21 22:19           ` Vince Weaver
  2011-12-21 22:32             ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Vince Weaver @ 2011-12-21 22:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, mingo, William Cohen, Stephane Eranian,
	Arun Sharma, linux-kernel

On Wed, 21 Dec 2011, Peter Zijlstra wrote:

> On Wed, 2011-12-21 at 16:32 -0500, Vince Weaver wrote:
> 
> > * One, is it possible to detect at runtime that rdpmc() is supported?  
> >   Maybe a field in the mmap'd buffer that can be read?
> 
> Ingo suggested a similar thing, will implement (although Christmas might
> interfere with delivery timelines etc..).

no rush.

> > * Second, if I try a start / run 100M instructions / read
> >   and measure retired instructions I get:
> >       281474977711543 instructions
> >   This is 0x10000000f45b7 which is 1<<48 + 100M.  I guess an artifact
> >   of how the counters are set up?  Is it possible to have this start
> >   at 0?  
> 
> Its supposed to be 0 based, clearly I messed up something and need to go
> figure out what. Thanks for reporting.

That's on a Nehalem machine, if it matters.

Speaking of bit 48, does your implementation do anything useful about
the 32/40/48 bit limitation that some processors have with rdpmc?
Or is the answer that if you might get overflow, then you have to use slow 
syscall-based read?

perfctr has some pretty clever code that manages to use rdpmc() while 
avoiding the overflow issue, but it was written at a time when older 
processors were in use and where this was more of an issue.

Vince

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-21 22:19           ` Vince Weaver
@ 2011-12-21 22:32             ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2011-12-21 22:32 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Vince Weaver, mingo, William Cohen, Stephane Eranian,
	Arun Sharma, linux-kernel

On Wed, 2011-12-21 at 17:19 -0500, Vince Weaver wrote:
> 
> Speaking of bit 48, does your implementation do anything useful about
> the 32/40/48 bit limitation that some processors have with rdpmc?
> Or is the answer that if you might get overflow, then you have to use slow 
> syscall-based read? 

Yeah, we program an overflow interrupt around there (47 bits iirc) and
consume the total into the u64 count value and update the mmap control
page's offset.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support
  2011-12-21 13:15     ` Ingo Molnar
@ 2011-12-23 20:12       ` Vince Weaver
  0 siblings, 0 replies; 48+ messages in thread
From: Vince Weaver @ 2011-12-23 20:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Vince Weaver, William Cohen, Stephane Eranian,
	Arun Sharma, linux-kernel

On Wed, 21 Dec 2011, Ingo Molnar wrote:

> Here's "pinned events" variant i've measured:
> 
> static u64 mmap_read_self(void *addr)
> {
>         struct perf_event_mmap_page *pc = addr;
>         u32 seq, idx;
>         u64 count;
> 
>         do {
>                 seq = pc->lock;
>                 barrier();
> 
>                 idx = pc->index;
>                 count = pc->offset;
>                 if (idx)
>                         count += rdpmc(idx - 1);
> 
>                 barrier();
>         } while (pc->lock != seq);
> 
>         return count;
> }

currently you need to do at least two rdpmc() calls when doing a 
start/read/stop (I use this as a benchmark as it's what PAPI code commonly 
does).

This is because the pc->offset value isn't initalized to 0 on start,
but to max_period & cntrval_mask.

I'm not sure what perf_event can do about this short of having a separate
field in the mmap structure that doesn't have the overflow offset 
considerations.


As an aside, I notice that the internal perf_event read() routine on x86 
seems to use rdmsrl() instead of the equivelent rdpmc().  From what I 
understand, at least through core2 (and maybe later) rdpmc() is faster 
than the equivelent rdmsr() call.  I'm not sure if would be worth
replacing the calls though.

Vince





^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/6] perf: Extend the mmap control page with time (TSC) fields
  2011-11-21 14:51 ` [RFC][PATCH 5/6] perf: Extend the mmap control page with time (TSC) fields Peter Zijlstra
@ 2011-12-28 17:55   ` Stephane Eranian
  0 siblings, 0 replies; 48+ messages in thread
From: Stephane Eranian @ 2011-12-28 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, William Cohen, linux-kernel, Arun Sharma, Vince Weaver

On Mon, Nov 21, 2011 at 2:51 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Extend the mmap control page with fields so that userspace can compute
> time deltas relative to the provided time fields.
>
> Currently only implemented for x86 with constant and nonstop TSC.
>

It is not obvious to me how one would use time_mult, time_shift, time_offset
+ TSC to complement time_enabled/time_running to compute the correct
scaling factor. The patch does not include any example. Woud you mind
describing the method?

Thanks.


> Cc: Stephane Eranian <eranian@google.com>
> Cc: Arun Sharma <asharma@fb.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  arch/x86/kernel/cpu/perf_event.c |   14 ++++++++++++++
>  include/linux/perf_event.h       |    4 +++-
>  kernel/events/core.c             |   21 ++++++++++++++-------
>  3 files changed, 31 insertions(+), 8 deletions(-)
> Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
> +++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
> @@ -32,6 +32,7 @@
>  #include <asm/compat.h>
>  #include <asm/smp.h>
>  #include <asm/alternative.h>
> +#include <asm/timer.h>
>
>  #include "perf_event.h"
>
> @@ -1621,6 +1622,19 @@ static struct pmu pmu = {
>        .event_idx      = x86_pmu_event_idx,
>  };
>
> +void perf_update_user_clock(struct perf_event_mmap_page *userpg, u64 now)
> +{
> +       if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
> +               return;
> +
> +       if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
> +               return;
> +
> +       userpg->time_mult = this_cpu_read(cyc2ns);
> +       userpg->time_shift = CYC2NS_SCALE_FACTOR;
> +       userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
> +}
> +
>  /*
>  * callchain support
>  */
> Index: linux-2.6/include/linux/perf_event.h
> ===================================================================
> --- linux-2.6.orig/include/linux/perf_event.h
> +++ linux-2.6/include/linux/perf_event.h
> @@ -290,12 +290,14 @@ struct perf_event_mmap_page {
>        __s64   offset;                 /* add to hardware event value */
>        __u64   time_enabled;           /* time event active */
>        __u64   time_running;           /* time event on cpu */
> +       __u32   time_mult, time_shift;
> +       __u64   time_offset;
>
>                /*
>                 * Hole for extension of the self monitor capabilities
>                 */
>
> -       __u64   __reserved[123];        /* align to 1k */
> +       __u64   __reserved[121];        /* align to 1k */
>
>        /*
>         * Control data for the mmap() data buffer.
> Index: linux-2.6/kernel/events/core.c
> ===================================================================
> --- linux-2.6.orig/kernel/events/core.c
> +++ linux-2.6/kernel/events/core.c
> @@ -3203,17 +3203,22 @@ static int perf_event_index(struct perf_
>  }
>
>  static void calc_timer_values(struct perf_event *event,
> +                               u64 *now,
>                                u64 *enabled,
>                                u64 *running)
>  {
> -       u64 now, ctx_time;
> +       u64 ctx_time;
>
> -       now = perf_clock();
> -       ctx_time = event->shadow_ctx_time + now;
> +       *now = perf_clock();
> +       ctx_time = event->shadow_ctx_time + *now;
>        *enabled = ctx_time - event->tstamp_enabled;
>        *running = ctx_time - event->tstamp_running;
>  }
>
> +void __weak perf_update_user_clock(struct perf_event_mmap_page *userpg, u64 now)
> +{
> +}
> +
>  /*
>  * Callers need to ensure there can be no nesting of this function, otherwise
>  * the seqlock logic goes bad. We can not serialize this because the arch
> @@ -3223,7 +3228,7 @@ void perf_event_update_userpage(struct p
>  {
>        struct perf_event_mmap_page *userpg;
>        struct ring_buffer *rb;
> -       u64 enabled, running;
> +       u64 enabled, running, now;
>
>        rcu_read_lock();
>        /*
> @@ -3235,7 +3240,7 @@ void perf_event_update_userpage(struct p
>         * because of locking issue as we can be called in
>         * NMI context
>         */
> -       calc_timer_values(event, &enabled, &running);
> +       calc_timer_values(event, &now, &enabled, &running);
>        rb = rcu_dereference(event->rb);
>        if (!rb)
>                goto unlock;
> @@ -3260,6 +3265,8 @@ void perf_event_update_userpage(struct p
>        userpg->time_running = running +
>                        atomic64_read(&event->child_total_time_running);
>
> +       perf_update_user_clock(userpg, now);
> +
>        barrier();
>        ++userpg->lock;
>        preempt_enable();
> @@ -3692,7 +3699,7 @@ static void perf_output_read_group(struc
>  static void perf_output_read(struct perf_output_handle *handle,
>                             struct perf_event *event)
>  {
> -       u64 enabled = 0, running = 0;
> +       u64 enabled = 0, running = 0, now;
>        u64 read_format = event->attr.read_format;
>
>        /*
> @@ -3705,7 +3712,7 @@ static void perf_output_read(struct perf
>         * NMI context
>         */
>        if (read_format & PERF_FORMAT_TOTAL_TIMES)
> -               calc_timer_values(event, &enabled, &running);
> +               calc_timer_values(event, &now, &enabled, &running);
>
>        if (event->attr.read_format & PERF_FORMAT_GROUP)
>                perf_output_read_group(handle, event, enabled, running);
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2011-12-28 17:55 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-21 14:51 [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Peter Zijlstra
2011-11-21 14:51 ` [RFC][PATCH 1/6] perf: Update the mmap control page on mmap() Peter Zijlstra
2011-11-21 14:51 ` [RFC][PATCH 2/6] perf, arch: Rework perf_event_index() Peter Zijlstra
2011-11-21 16:22   ` Eric B Munson
2011-11-21 17:23   ` Will Deacon
2011-11-21 19:18     ` Peter Zijlstra
2011-11-21 20:31       ` Will Deacon
2011-11-21 20:35         ` Peter Zijlstra
2011-11-21 22:43           ` Will Deacon
2011-11-22 11:26             ` Peter Zijlstra
2011-11-22 11:47               ` Will Deacon
2011-11-22 11:49                 ` Oleg Strikov
2011-11-22 11:52                   ` Will Deacon
2011-11-22 11:56                     ` Oleg Strikov
2011-11-22 12:00                     ` Oleg Strikov
2011-11-22 12:14                       ` Will Deacon
2011-11-22 12:25                         ` Oleg Strikov
2011-11-22 11:51                 ` Peter Zijlstra
2011-11-22 11:54                   ` Will Deacon
2011-11-22 11:48               ` Oleg Strikov
2011-11-21 14:51 ` [RFC][PATCH 3/6] perf, x86: Implement userspace RDPMC Peter Zijlstra
2011-11-21 14:51 ` [RFC][PATCH 4/6] perf, x86: Provide means of disabling " Peter Zijlstra
2011-11-21 14:51 ` [RFC][PATCH 5/6] perf: Extend the mmap control page with time (TSC) fields Peter Zijlstra
2011-12-28 17:55   ` Stephane Eranian
2011-11-21 14:51 ` [RFC][PATCH 6/6] perf, tools: X86 RDPMC, RDTSC test Peter Zijlstra
2011-11-21 15:29   ` Stephane Eranian
2011-11-21 15:37     ` Peter Zijlstra
2011-11-21 16:59       ` Peter Zijlstra
2011-11-21 17:42         ` Stephane Eranian
2011-11-21 15:02 ` [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support Vince Weaver
2011-11-21 16:05   ` William Cohen
2011-11-21 16:08   ` William Cohen
2011-12-02 19:26 ` Arun Sharma
2011-12-02 22:22   ` Stephane Eranian
2011-12-05 20:16     ` Arun Sharma
2011-12-05 23:17       ` Arun Sharma
2011-12-06  1:38         ` Stephane Eranian
2011-12-06  9:42         ` Peter Zijlstra
2011-12-06 21:53           ` Arun Sharma
2011-12-16 22:36 ` Vince Weaver
2011-12-21 12:58   ` Peter Zijlstra
2011-12-21 13:15     ` Ingo Molnar
2011-12-23 20:12       ` Vince Weaver
2011-12-21 15:04     ` Vince Weaver
2011-12-21 21:32       ` Vince Weaver
2011-12-21 21:41         ` Peter Zijlstra
2011-12-21 22:19           ` Vince Weaver
2011-12-21 22:32             ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).