All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 00/11] Another stab at PEBS and LBR support
@ 2010-03-03 16:39 Peter Zijlstra
  2010-03-03 16:39 ` [RFC][PATCH 01/11] perf, x86: Remove superfluous arguments to x86_perf_event_set_period() Peter Zijlstra
                   ` (10 more replies)
  0 siblings, 11 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Peter Zijlstra


Sorta works, the PEBS-LBR-fixup stuff makes my machine unhappy, but I could
have made a silly mistake there..

Can be tested using the below patchlet and something like: perf top -e r00c0p

---
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 05d0c5c..f8314e6 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -656,6 +656,11 @@ parse_raw_event(const char **strp, struct perf_event_attr *attr)
 		return EVT_FAILED;
 	n = hex2u64(str + 1, &config);
 	if (n > 0) {
+		if (str[n+1] == 'p') {
+			attr->precise = 1;
+			printf("precise\n");
+			n++;
+		}
 		*strp = str + n + 1;
 		attr->type = PERF_TYPE_RAW;
 		attr->config = config;



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC][PATCH 01/11] perf, x86: Remove superfluous arguments to x86_perf_event_set_period()
  2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
@ 2010-03-03 16:39 ` Peter Zijlstra
  2010-03-03 16:39 ` [RFC][PATCH 02/11] perf, x86: Remove superfluous arguments to x86_perf_event_update() Peter Zijlstra
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Peter Zijlstra

[-- Attachment #1: perf-x86-cleanup-args.patch --]
[-- Type: text/plain, Size: 2698 bytes --]

The second and third argument to x86_perf_event_set_period() are
superfluous since they are simple expressions of the first argument.
Hence remove them.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c       |   15 +++++++--------
 arch/x86/kernel/cpu/perf_event_intel.c |    2 +-
 2 files changed, 8 insertions(+), 9 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -165,8 +165,7 @@ static DEFINE_PER_CPU(struct cpu_hw_even
 	.enabled = 1,
 };
 
-static int x86_perf_event_set_period(struct perf_event *event,
-			     struct hw_perf_event *hwc, int idx);
+static int x86_perf_event_set_period(struct perf_event *event);
 
 /*
  * Generalized hw caching related hw_event table, filled
@@ -830,7 +829,7 @@ void hw_perf_enable(void)
 
 			if (hwc->idx == -1) {
 				x86_assign_hw_event(event, cpuc, i);
-				x86_perf_event_set_period(event, hwc, hwc->idx);
+				x86_perf_event_set_period(event);
 			}
 			/*
 			 * need to mark as active because x86_pmu_disable()
@@ -871,12 +870,12 @@ static DEFINE_PER_CPU(u64 [X86_PMC_IDX_M
  * To be called with the event disabled in hw:
  */
 static int
-x86_perf_event_set_period(struct perf_event *event,
-			     struct hw_perf_event *hwc, int idx)
+x86_perf_event_set_period(struct perf_event *event)
 {
+	struct hw_perf_event *hwc = &event->hw;
 	s64 left = atomic64_read(&hwc->period_left);
 	s64 period = hwc->sample_period;
-	int err, ret = 0;
+	int err, ret = 0, idx = hwc->idx;
 
 	if (idx == X86_PMC_IDX_FIXED_BTS)
 		return 0;
@@ -974,7 +973,7 @@ static int x86_pmu_start(struct perf_eve
 	if (hwc->idx == -1)
 		return -EAGAIN;
 
-	x86_perf_event_set_period(event, hwc, hwc->idx);
+	x86_perf_event_set_period(event);
 	x86_pmu.enable(hwc, hwc->idx);
 
 	return 0;
@@ -1119,7 +1118,7 @@ static int x86_pmu_handle_irq(struct pt_
 		handled		= 1;
 		data.period	= event->hw.last_period;
 
-		if (!x86_perf_event_set_period(event, hwc, idx))
+		if (!x86_perf_event_set_period(event))
 			continue;
 
 		if (perf_event_overflow(event, 1, &data, regs))
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -700,7 +700,7 @@ static int intel_pmu_save_and_restart(st
 	int ret;
 
 	x86_perf_event_update(event, hwc, idx);
-	ret = x86_perf_event_set_period(event, hwc, idx);
+	ret = x86_perf_event_set_period(event);
 
 	return ret;
 }

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 02/11] perf, x86: Remove superfluous arguments to x86_perf_event_update()
  2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
  2010-03-03 16:39 ` [RFC][PATCH 01/11] perf, x86: Remove superfluous arguments to x86_perf_event_set_period() Peter Zijlstra
@ 2010-03-03 16:39 ` Peter Zijlstra
  2010-03-03 16:39 ` [RFC][PATCH 03/11] perf, x86: Change x86_pmu.{enable,disable} calling convention Peter Zijlstra
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Peter Zijlstra

[-- Attachment #1: perf-x86-cleanup-args1.patch --]
[-- Type: text/plain, Size: 2493 bytes --]

The second and third argument to x86_perf_event_update() are
superfluous since they are simple expressions of the first argument.
Hence remove them.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c       |   11 ++++++-----
 arch/x86/kernel/cpu/perf_event_intel.c |   10 ++--------
 2 files changed, 8 insertions(+), 13 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -188,11 +188,12 @@ static u64 __read_mostly hw_cache_event_
  * Returns the delta events processed.
  */
 static u64
-x86_perf_event_update(struct perf_event *event,
-			struct hw_perf_event *hwc, int idx)
+x86_perf_event_update(struct perf_event *event)
 {
+	struct hw_perf_event *hwc = &event->hw;
 	int shift = 64 - x86_pmu.event_bits;
 	u64 prev_raw_count, new_raw_count;
+	int idx = hwc->idx;
 	s64 delta;
 
 	if (idx == X86_PMC_IDX_FIXED_BTS)
@@ -1059,7 +1060,7 @@ static void x86_pmu_stop(struct perf_eve
 	 * Drain the remaining delta count out of a event
 	 * that we are disabling:
 	 */
-	x86_perf_event_update(event, hwc, idx);
+	x86_perf_event_update(event);
 
 	cpuc->events[idx] = NULL;
 }
@@ -1108,7 +1109,7 @@ static int x86_pmu_handle_irq(struct pt_
 		event = cpuc->events[idx];
 		hwc = &event->hw;
 
-		val = x86_perf_event_update(event, hwc, idx);
+		val = x86_perf_event_update(event);
 		if (val & (1ULL << (x86_pmu.event_bits - 1)))
 			continue;
 
@@ -1419,7 +1420,7 @@ void __init init_hw_perf_events(void)
 
 static inline void x86_pmu_read(struct perf_event *event)
 {
-	x86_perf_event_update(event, &event->hw, event->hw.idx);
+	x86_perf_event_update(event);
 }
 
 static const struct pmu pmu = {
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -695,14 +695,8 @@ static void intel_pmu_enable_event(struc
  */
 static int intel_pmu_save_and_restart(struct perf_event *event)
 {
-	struct hw_perf_event *hwc = &event->hw;
-	int idx = hwc->idx;
-	int ret;
-
-	x86_perf_event_update(event, hwc, idx);
-	ret = x86_perf_event_set_period(event);
-
-	return ret;
+	x86_perf_event_update(event);
+	return x86_perf_event_set_period(event);
 }
 
 static void intel_pmu_reset(void)

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 03/11] perf, x86: Change x86_pmu.{enable,disable} calling convention
  2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
  2010-03-03 16:39 ` [RFC][PATCH 01/11] perf, x86: Remove superfluous arguments to x86_perf_event_set_period() Peter Zijlstra
  2010-03-03 16:39 ` [RFC][PATCH 02/11] perf, x86: Remove superfluous arguments to x86_perf_event_update() Peter Zijlstra
@ 2010-03-03 16:39 ` Peter Zijlstra
  2010-03-03 16:39 ` [RFC][PATCH 04/11] perf, x86: Use unlocked bitops Peter Zijlstra
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Peter Zijlstra

[-- Attachment #1: perf-x86-cleanup-args2.patch --]
[-- Type: text/plain, Size: 7246 bytes --]

Pass the ful perf_event into the x86_pmu functions so that those may
make use of more than the hw_perf_event, and while doing this, remove
the superfluous second argument.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c       |   31 +++++++++++++++----------------
 arch/x86/kernel/cpu/perf_event_intel.c |   30 +++++++++++++++++-------------
 arch/x86/kernel/cpu/perf_event_p6.c    |   10 ++++++----
 3 files changed, 38 insertions(+), 33 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -133,8 +133,8 @@ struct x86_pmu {
 	int		(*handle_irq)(struct pt_regs *);
 	void		(*disable_all)(void);
 	void		(*enable_all)(void);
-	void		(*enable)(struct hw_perf_event *, int);
-	void		(*disable)(struct hw_perf_event *, int);
+	void		(*enable)(struct perf_event *);
+	void		(*disable)(struct perf_event *);
 	unsigned	eventsel;
 	unsigned	perfctr;
 	u64		(*event_map)(int);
@@ -840,7 +840,7 @@ void hw_perf_enable(void)
 			set_bit(hwc->idx, cpuc->active_mask);
 			cpuc->events[hwc->idx] = event;
 
-			x86_pmu.enable(hwc, hwc->idx);
+			x86_pmu.enable(event);
 			perf_event_update_userpage(event);
 		}
 		cpuc->n_added = 0;
@@ -853,15 +853,16 @@ void hw_perf_enable(void)
 	x86_pmu.enable_all();
 }
 
-static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc, int idx)
+static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc)
 {
-	(void)checking_wrmsrl(hwc->config_base + idx,
+	(void)checking_wrmsrl(hwc->config_base + hwc->idx,
 			      hwc->config | ARCH_PERFMON_EVENTSEL_ENABLE);
 }
 
-static inline void x86_pmu_disable_event(struct hw_perf_event *hwc, int idx)
+static inline void x86_pmu_disable_event(struct perf_event *event)
 {
-	(void)checking_wrmsrl(hwc->config_base + idx, hwc->config);
+	struct hw_perf_event *hwc = &event->hw;
+	(void)checking_wrmsrl(hwc->config_base + hwc->idx, hwc->config);
 }
 
 static DEFINE_PER_CPU(u64 [X86_PMC_IDX_MAX], pmc_prev_left);
@@ -922,11 +923,11 @@ x86_perf_event_set_period(struct perf_ev
 	return ret;
 }
 
-static void x86_pmu_enable_event(struct hw_perf_event *hwc, int idx)
+static void x86_pmu_enable_event(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	if (cpuc->enabled)
-		__x86_pmu_enable_event(hwc, idx);
+		__x86_pmu_enable_event(&event->hw);
 }
 
 /*
@@ -969,13 +970,11 @@ static int x86_pmu_enable(struct perf_ev
 
 static int x86_pmu_start(struct perf_event *event)
 {
-	struct hw_perf_event *hwc = &event->hw;
-
-	if (hwc->idx == -1)
+	if (event->hw.idx == -1)
 		return -EAGAIN;
 
 	x86_perf_event_set_period(event);
-	x86_pmu.enable(hwc, hwc->idx);
+	x86_pmu.enable(event);
 
 	return 0;
 }
@@ -989,7 +988,7 @@ static void x86_pmu_unthrottle(struct pe
 				cpuc->events[hwc->idx] != event))
 		return;
 
-	x86_pmu.enable(hwc, hwc->idx);
+	x86_pmu.enable(event);
 }
 
 void perf_event_print_debug(void)
@@ -1054,7 +1053,7 @@ static void x86_pmu_stop(struct perf_eve
 	 * could reenable again:
 	 */
 	clear_bit(idx, cpuc->active_mask);
-	x86_pmu.disable(hwc, idx);
+	x86_pmu.disable(event);
 
 	/*
 	 * Drain the remaining delta count out of a event
@@ -1123,7 +1122,7 @@ static int x86_pmu_handle_irq(struct pt_
 			continue;
 
 		if (perf_event_overflow(event, 1, &data, regs))
-			x86_pmu.disable(hwc, idx);
+			x86_pmu.disable(event);
 	}
 
 	if (handled)
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -548,9 +548,9 @@ static inline void intel_pmu_ack_status(
 }
 
 static inline void
-intel_pmu_disable_fixed(struct hw_perf_event *hwc, int __idx)
+intel_pmu_disable_fixed(struct hw_perf_event *hwc)
 {
-	int idx = __idx - X86_PMC_IDX_FIXED;
+	int idx = hwc->idx - X86_PMC_IDX_FIXED;
 	u64 ctrl_val, mask;
 
 	mask = 0xfULL << (idx * 4);
@@ -622,26 +622,28 @@ static void intel_pmu_drain_bts_buffer(v
 }
 
 static inline void
-intel_pmu_disable_event(struct hw_perf_event *hwc, int idx)
+intel_pmu_disable_event(struct perf_event *event)
 {
-	if (unlikely(idx == X86_PMC_IDX_FIXED_BTS)) {
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (unlikely(hwc->idx == X86_PMC_IDX_FIXED_BTS)) {
 		intel_pmu_disable_bts();
 		intel_pmu_drain_bts_buffer();
 		return;
 	}
 
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
-		intel_pmu_disable_fixed(hwc, idx);
+		intel_pmu_disable_fixed(hwc);
 		return;
 	}
 
-	x86_pmu_disable_event(hwc, idx);
+	x86_pmu_disable_event(event);
 }
 
 static inline void
-intel_pmu_enable_fixed(struct hw_perf_event *hwc, int __idx)
+intel_pmu_enable_fixed(struct hw_perf_event *hwc)
 {
-	int idx = __idx - X86_PMC_IDX_FIXED;
+	int idx = hwc->idx - X86_PMC_IDX_FIXED;
 	u64 ctrl_val, bits, mask;
 	int err;
 
@@ -671,9 +673,11 @@ intel_pmu_enable_fixed(struct hw_perf_ev
 	err = checking_wrmsrl(hwc->config_base, ctrl_val);
 }
 
-static void intel_pmu_enable_event(struct hw_perf_event *hwc, int idx)
+static void intel_pmu_enable_event(struct perf_event *event)
 {
-	if (unlikely(idx == X86_PMC_IDX_FIXED_BTS)) {
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (unlikely(hwc->idx == X86_PMC_IDX_FIXED_BTS)) {
 		if (!__get_cpu_var(cpu_hw_events).enabled)
 			return;
 
@@ -682,11 +686,11 @@ static void intel_pmu_enable_event(struc
 	}
 
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
-		intel_pmu_enable_fixed(hwc, idx);
+		intel_pmu_enable_fixed(hwc);
 		return;
 	}
 
-	__x86_pmu_enable_event(hwc, idx);
+	__x86_pmu_enable_event(hwc);
 }
 
 /*
@@ -774,7 +778,7 @@ again:
 		data.period = event->hw.last_period;
 
 		if (perf_event_overflow(event, 1, &data, regs))
-			intel_pmu_disable_event(&event->hw, bit);
+			intel_pmu_disable_event(event);
 	}
 
 	intel_pmu_ack_status(ack);
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_p6.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_p6.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_p6.c
@@ -77,27 +77,29 @@ static void p6_pmu_enable_all(void)
 }
 
 static inline void
-p6_pmu_disable_event(struct hw_perf_event *hwc, int idx)
+p6_pmu_disable_event(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct hw_perf_event *hwc = &event->hw;
 	u64 val = P6_NOP_EVENT;
 
 	if (cpuc->enabled)
 		val |= ARCH_PERFMON_EVENTSEL_ENABLE;
 
-	(void)checking_wrmsrl(hwc->config_base + idx, val);
+	(void)checking_wrmsrl(hwc->config_base + hwc->idx, val);
 }
 
-static void p6_pmu_enable_event(struct hw_perf_event *hwc, int idx)
+static void p6_pmu_enable_event(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct hw_perf_event *hwc = &event->hw;
 	u64 val;
 
 	val = hwc->config;
 	if (cpuc->enabled)
 		val |= ARCH_PERFMON_EVENTSEL_ENABLE;
 
-	(void)checking_wrmsrl(hwc->config_base + idx, val);
+	(void)checking_wrmsrl(hwc->config_base + hwc->idx, val);
 }
 
 static __initconst struct x86_pmu p6_pmu = {

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 04/11] perf, x86: Use unlocked bitops
  2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
                   ` (2 preceding siblings ...)
  2010-03-03 16:39 ` [RFC][PATCH 03/11] perf, x86: Change x86_pmu.{enable,disable} calling convention Peter Zijlstra
@ 2010-03-03 16:39 ` Peter Zijlstra
  2010-03-03 16:39 ` [RFC][PATCH 05/11] perf: Generic perf_sample_data initialization Peter Zijlstra
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Peter Zijlstra

[-- Attachment #1: perf-x86-unlocked-bitops.patch --]
[-- Type: text/plain, Size: 2593 bytes --]

There is no concurrency on these variables, so don't use LOCK'ed ops.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c       |    8 ++++----
 arch/x86/kernel/cpu/perf_event_amd.c   |    2 +-
 arch/x86/kernel/cpu/perf_event_intel.c |    2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -638,7 +638,7 @@ static int x86_schedule_events(struct cp
 		if (test_bit(hwc->idx, used_mask))
 			break;
 
-		set_bit(hwc->idx, used_mask);
+		__set_bit(hwc->idx, used_mask);
 		if (assign)
 			assign[i] = hwc->idx;
 	}
@@ -687,7 +687,7 @@ static int x86_schedule_events(struct cp
 			if (j == X86_PMC_IDX_MAX)
 				break;
 
-			set_bit(j, used_mask);
+			__set_bit(j, used_mask);
 
 			if (assign)
 				assign[i] = j;
@@ -837,7 +837,7 @@ void hw_perf_enable(void)
 			 * clear active_mask and events[] yet it preserves
 			 * idx
 			 */
-			set_bit(hwc->idx, cpuc->active_mask);
+			__set_bit(hwc->idx, cpuc->active_mask);
 			cpuc->events[hwc->idx] = event;
 
 			x86_pmu.enable(event);
@@ -1052,7 +1052,7 @@ static void x86_pmu_stop(struct perf_eve
 	 * Must be done before we disable, otherwise the nmi handler
 	 * could reenable again:
 	 */
-	clear_bit(idx, cpuc->active_mask);
+	__clear_bit(idx, cpuc->active_mask);
 	x86_pmu.disable(event);
 
 	/*
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_amd.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_amd.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_amd.c
@@ -309,7 +309,7 @@ static struct amd_nb *amd_alloc_nb(int c
 	 * initialize all possible NB constraints
 	 */
 	for (i = 0; i < x86_pmu.num_events; i++) {
-		set_bit(i, nb->event_constraints[i].idxmsk);
+		__set_bit(i, nb->event_constraints[i].idxmsk);
 		nb->event_constraints[i].weight = 1;
 	}
 	return nb;
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -768,7 +768,7 @@ again:
 	for_each_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
 		struct perf_event *event = cpuc->events[bit];
 
-		clear_bit(bit, (unsigned long *) &status);
+		__clear_bit(bit, (unsigned long *) &status);
 		if (!test_bit(bit, cpuc->active_mask))
 			continue;
 

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 05/11] perf: Generic perf_sample_data initialization
  2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
                   ` (3 preceding siblings ...)
  2010-03-03 16:39 ` [RFC][PATCH 04/11] perf, x86: Use unlocked bitops Peter Zijlstra
@ 2010-03-03 16:39 ` Peter Zijlstra
  2010-03-03 16:49   ` David Miller
                     ` (2 more replies)
  2010-03-03 16:39 ` [RFC][PATCH 06/11] perf, x86: PEBS infrastructure Peter Zijlstra
                   ` (5 subsequent siblings)
  10 siblings, 3 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Jamie Iles,
	Jean Pihet, David S. Miller, stable, Peter Zijlstra

[-- Attachment #1: perf-fixup-data.patch --]
[-- Type: text/plain, Size: 6207 bytes --]

This makes it easier to extend perf_sample_data and fixes a bug on
arm and sparc, which failed to set ->raw to NULL, which can cause
crashes when combined with PERF_SAMPLE_RAW.

It also optimizes PowerPC and tracepoint, because the struct
initialization is forced to zero out the whole structure.

CC: Jamie Iles <jamie.iles@picochip.com>
CC: Jean Pihet <jpihet@mvista.com>
CC: Paul Mackerras <paulus@samba.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: David S. Miller <davem@davemloft.net>
CC: Stephane Eranian <eranian@google.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/arm/kernel/perf_event.c           |    4 ++--
 arch/powerpc/kernel/perf_event.c       |    8 ++++----
 arch/sparc/kernel/perf_event.c         |    2 +-
 arch/x86/kernel/cpu/perf_event.c       |    3 +--
 arch/x86/kernel/cpu/perf_event_intel.c |    6 ++----
 include/linux/perf_event.h             |    7 +++++++
 kernel/perf_event.c                    |   21 ++++++++-------------
 7 files changed, 25 insertions(+), 26 deletions(-)

Index: linux-2.6/arch/arm/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/arch/arm/kernel/perf_event.c
+++ linux-2.6/arch/arm/kernel/perf_event.c
@@ -965,7 +965,7 @@ armv6pmu_handle_irq(int irq_num,
 	 */
 	armv6_pmcr_write(pmcr);
 
-	data.addr = 0;
+	perf_sample_data_init(&data, 0);
 
 	cpuc = &__get_cpu_var(cpu_hw_events);
 	for (idx = 0; idx <= armpmu->num_events; ++idx) {
@@ -1945,7 +1945,7 @@ static irqreturn_t armv7pmu_handle_irq(i
 	 */
 	regs = get_irq_regs();
 
-	data.addr = 0;
+	perf_sample_data_init(&data, 0);
 
 	cpuc = &__get_cpu_var(cpu_hw_events);
 	for (idx = 0; idx <= armpmu->num_events; ++idx) {
Index: linux-2.6/arch/powerpc/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/perf_event.c
+++ linux-2.6/arch/powerpc/kernel/perf_event.c
@@ -1164,10 +1164,10 @@ static void record_and_restart(struct pe
 	 * Finally record data if requested.
 	 */
 	if (record) {
-		struct perf_sample_data data = {
-			.addr	= ~0ULL,
-			.period	= event->hw.last_period,
-		};
+		struct perf_sample_data data;
+
+		perf_sample_data_init(&data, ~0ULL);
+		data.period = event->hw.last_period;
 
 		if (event->attr.sample_type & PERF_SAMPLE_ADDR)
 			perf_get_data_addr(regs, &data.addr);
Index: linux-2.6/arch/sparc/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/arch/sparc/kernel/perf_event.c
+++ linux-2.6/arch/sparc/kernel/perf_event.c
@@ -1189,7 +1189,7 @@ static int __kprobes perf_event_nmi_hand
 
 	regs = args->regs;
 
-	data.addr = 0;
+	perf_sample_data_init(&data, 0);
 
 	cpuc = &__get_cpu_var(cpu_hw_events);
 
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -1096,8 +1096,7 @@ static int x86_pmu_handle_irq(struct pt_
 	int idx, handled = 0;
 	u64 val;
 
-	data.addr = 0;
-	data.raw = NULL;
+	perf_sample_data_init(&data, 0);
 
 	cpuc = &__get_cpu_var(cpu_hw_events);
 
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -590,10 +590,9 @@ static void intel_pmu_drain_bts_buffer(v
 
 	ds->bts_index = ds->bts_buffer_base;
 
+	perf_sample_data_init(&data, 0);
 
 	data.period	= event->hw.last_period;
-	data.addr	= 0;
-	data.raw	= NULL;
 	regs.ip		= 0;
 
 	/*
@@ -740,8 +739,7 @@ static int intel_pmu_handle_irq(struct p
 	int bit, loops;
 	u64 ack, status;
 
-	data.addr = 0;
-	data.raw = NULL;
+	perf_sample_data_init(&data, 0);
 
 	cpuc = &__get_cpu_var(cpu_hw_events);
 
Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -801,6 +801,13 @@ struct perf_sample_data {
 	struct perf_raw_record		*raw;
 };
 
+static inline
+void perf_sample_data_init(struct perf_sample_data *data, u64 addr)
+{
+	data->addr = addr;
+	data->raw  = NULL;
+}
+
 extern void perf_output_sample(struct perf_output_handle *handle,
 			       struct perf_event_header *header,
 			       struct perf_sample_data *data,
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -4108,8 +4108,7 @@ void __perf_sw_event(u32 event_id, u64 n
 	if (rctx < 0)
 		return;
 
-	data.addr = addr;
-	data.raw  = NULL;
+	perf_sample_data_init(&data, addr);
 
 	do_perf_sw_event(PERF_TYPE_SOFTWARE, event_id, nr, nmi, &data, regs);
 
@@ -4154,11 +4153,10 @@ static enum hrtimer_restart perf_swevent
 	struct perf_event *event;
 	u64 period;
 
-	event	= container_of(hrtimer, struct perf_event, hw.hrtimer);
+	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
 	event->pmu->read(event);
 
-	data.addr = 0;
-	data.raw = NULL;
+	perf_sample_data_init(&data, 0);
 	data.period = event->hw.last_period;
 	regs = get_irq_regs();
 	/*
@@ -4322,17 +4320,15 @@ static const struct pmu perf_ops_task_cl
 void perf_tp_event(int event_id, u64 addr, u64 count, void *record,
 			  int entry_size)
 {
+	struct pt_regs *regs = get_irq_regs();
+	struct perf_sample_data data;
 	struct perf_raw_record raw = {
 		.size = entry_size,
 		.data = record,
 	};
 
-	struct perf_sample_data data = {
-		.addr = addr,
-		.raw = &raw,
-	};
-
-	struct pt_regs *regs = get_irq_regs();
+	perf_sample_data_init(&data, addr);
+	data.raw = &raw;
 
 	if (!regs)
 		regs = task_pt_regs(current);
@@ -4448,8 +4444,7 @@ void perf_bp_event(struct perf_event *bp
 	struct perf_sample_data sample;
 	struct pt_regs *regs = data;
 
-	sample.raw = NULL;
-	sample.addr = bp->attr.bp_addr;
+	perf_sample_data_init(&sample, bp->attr.bp_addr);
 
 	if (!perf_exclude_event(bp, regs))
 		perf_swevent_add(bp, 1, 1, &sample, regs);

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 06/11] perf, x86: PEBS infrastructure
  2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
                   ` (4 preceding siblings ...)
  2010-03-03 16:39 ` [RFC][PATCH 05/11] perf: Generic perf_sample_data initialization Peter Zijlstra
@ 2010-03-03 16:39 ` Peter Zijlstra
  2010-03-03 17:38   ` Robert Richter
  2010-03-03 16:39 ` [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS Peter Zijlstra
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Peter Zijlstra

[-- Attachment #1: pebs.patch --]
[-- Type: text/plain, Size: 30155 bytes --]

Implement a simple PEBS model that always takes a single PEBS event at
a time. This is done so that the interaction with the rest of the
system is as expected (freq adjust, period randomization, lbr).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c          |  223 +++--------
 arch/x86/kernel/cpu/perf_event_intel.c    |  152 +------
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  594 ++++++++++++++++++++++++++++++
 include/linux/perf_event.h                |    3 
 4 files changed, 709 insertions(+), 263 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
===================================================================
--- /dev/null
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -0,0 +1,594 @@
+#ifdef CONFIG_CPU_SUP_INTEL
+
+/* The maximal number of PEBS events: */
+#define MAX_PEBS_EVENTS		4
+
+/* The size of a BTS record in bytes: */
+#define BTS_RECORD_SIZE		24
+
+#define BTS_BUFFER_SIZE		(PAGE_SIZE << 4)
+#define PEBS_BUFFER_SIZE	PAGE_SIZE
+
+/*
+ * pebs_record_32 for p4 and core not supported
+
+struct pebs_record_32 {
+	u32 flags, ip;
+	u32 ax, bc, cx, dx;
+	u32 si, di, bp, sp;
+};
+
+ */
+
+struct pebs_record_core {
+	u64 flags, ip;
+	u64 ax, bx, cx, dx;
+	u64 si, di, bp, sp;
+	u64 r8,  r9,  r10, r11;
+	u64 r12, r13, r14, r15;
+};
+
+struct pebs_record_nhm {
+	u64 flags, ip;
+	u64 ax, bx, cx, dx;
+	u64 si, di, bp, sp;
+	u64 r8,  r9,  r10, r11;
+	u64 r12, r13, r14, r15;
+	u64 status, dla, dse, lat;
+};
+
+/*
+ * Bits in the debugctlmsr controlling branch tracing.
+ */
+#define X86_DEBUGCTL_TR			(1 << 6)
+#define X86_DEBUGCTL_BTS		(1 << 7)
+#define X86_DEBUGCTL_BTINT		(1 << 8)
+#define X86_DEBUGCTL_BTS_OFF_OS		(1 << 9)
+#define X86_DEBUGCTL_BTS_OFF_USR	(1 << 10)
+
+/*
+ * A debug store configuration.
+ *
+ * We only support architectures that use 64bit fields.
+ */
+struct debug_store {
+	u64	bts_buffer_base;
+	u64	bts_index;
+	u64	bts_absolute_maximum;
+	u64	bts_interrupt_threshold;
+	u64	pebs_buffer_base;
+	u64	pebs_index;
+	u64	pebs_absolute_maximum;
+	u64	pebs_interrupt_threshold;
+	u64	pebs_event_reset[MAX_PEBS_EVENTS];
+};
+
+static inline void init_debug_store_on_cpu(int cpu)
+{
+	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+
+	if (!ds)
+		return;
+
+	wrmsr_on_cpu(cpu, MSR_IA32_DS_AREA,
+		     (u32)((u64)(unsigned long)ds),
+		     (u32)((u64)(unsigned long)ds >> 32));
+}
+
+static inline void fini_debug_store_on_cpu(int cpu)
+{
+	if (!per_cpu(cpu_hw_events, cpu).ds)
+		return;
+
+	wrmsr_on_cpu(cpu, MSR_IA32_DS_AREA, 0, 0);
+}
+
+static void release_ds_buffers(void)
+{
+	int cpu;
+
+	if (!x86_pmu.bts && !x86_pmu.pebs)
+		return;
+
+	get_online_cpus();
+
+	for_each_online_cpu(cpu)
+		fini_debug_store_on_cpu(cpu);
+
+	for_each_possible_cpu(cpu) {
+		struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+
+		if (!ds)
+			continue;
+
+		per_cpu(cpu_hw_events, cpu).ds = NULL;
+
+		kfree((void *)(unsigned long)ds->pebs_buffer_base);
+		kfree((void *)(unsigned long)ds->bts_buffer_base);
+		kfree(ds);
+	}
+
+	put_online_cpus();
+}
+
+static int reserve_ds_buffers(void)
+{
+	int cpu, err = 0;
+
+	if (!x86_pmu.bts && !x86_pmu.pebs)
+		return 0;
+
+	get_online_cpus();
+
+	for_each_possible_cpu(cpu) {
+		struct debug_store *ds;
+		void *buffer;
+		int max, thresh;
+
+		err = -ENOMEM;
+		ds = kzalloc(sizeof(*ds), GFP_KERNEL);
+		if (unlikely(!ds)) {
+			kfree(buffer);
+			break;
+		}
+		per_cpu(cpu_hw_events, cpu).ds = ds;
+
+		if (x86_pmu.bts) {
+			buffer = kzalloc(BTS_BUFFER_SIZE, GFP_KERNEL);
+			if (unlikely(!buffer))
+				break;
+
+			max = BTS_BUFFER_SIZE / BTS_RECORD_SIZE;
+			thresh = max / 16;
+
+			ds->bts_buffer_base = (u64)(unsigned long)buffer;
+			ds->bts_index = ds->bts_buffer_base;
+			ds->bts_absolute_maximum = ds->bts_buffer_base +
+				max * BTS_RECORD_SIZE;
+			ds->bts_interrupt_threshold = ds->bts_absolute_maximum -
+				thresh * BTS_RECORD_SIZE;
+		}
+
+		if (x86_pmu.pebs) {
+			buffer = kzalloc(PEBS_BUFFER_SIZE, GFP_KERNEL);
+			if (unlikely(!buffer))
+				break;
+
+			max = PEBS_BUFFER_SIZE / x86_pmu.pebs_record_size;
+
+			ds->pebs_buffer_base = (u64)(unsigned long)buffer;
+			ds->pebs_index = ds->pebs_buffer_base;
+			ds->pebs_absolute_maximum = ds->pebs_buffer_base +
+				max * x86_pmu.pebs_record_size;
+			/*
+			 * Always use single record PEBS
+			 */
+			ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
+				x86_pmu.pebs_record_size;
+		}
+
+		err = 0;
+	}
+
+	if (err)
+		release_ds_buffers();
+	else {
+		for_each_online_cpu(cpu)
+			init_debug_store_on_cpu(cpu);
+	}
+
+	put_online_cpus();
+
+	return err;
+}
+
+/*
+ * BTS
+ */
+
+static struct event_constraint bts_constraint =
+	EVENT_CONSTRAINT(0, 1ULL << X86_PMC_IDX_FIXED_BTS, 0);
+
+static void intel_pmu_enable_bts(u64 config)
+{
+	unsigned long debugctlmsr;
+
+	debugctlmsr = get_debugctlmsr();
+
+	debugctlmsr |= X86_DEBUGCTL_TR;
+	debugctlmsr |= X86_DEBUGCTL_BTS;
+	debugctlmsr |= X86_DEBUGCTL_BTINT;
+
+	if (!(config & ARCH_PERFMON_EVENTSEL_OS))
+		debugctlmsr |= X86_DEBUGCTL_BTS_OFF_OS;
+
+	if (!(config & ARCH_PERFMON_EVENTSEL_USR))
+		debugctlmsr |= X86_DEBUGCTL_BTS_OFF_USR;
+
+	update_debugctlmsr(debugctlmsr);
+}
+
+static void intel_pmu_disable_bts(void)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	unsigned long debugctlmsr;
+
+	if (!cpuc->ds)
+		return;
+
+	debugctlmsr = get_debugctlmsr();
+
+	debugctlmsr &=
+		~(X86_DEBUGCTL_TR | X86_DEBUGCTL_BTS | X86_DEBUGCTL_BTINT |
+		  X86_DEBUGCTL_BTS_OFF_OS | X86_DEBUGCTL_BTS_OFF_USR);
+
+	update_debugctlmsr(debugctlmsr);
+}
+
+static void intel_pmu_drain_bts_buffer(void)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct debug_store *ds = cpuc->ds;
+	struct bts_record {
+		u64	from;
+		u64	to;
+		u64	flags;
+	};
+	struct perf_event *event = cpuc->events[X86_PMC_IDX_FIXED_BTS];
+	struct bts_record *at, *top;
+	struct perf_output_handle handle;
+	struct perf_event_header header;
+	struct perf_sample_data data;
+	struct pt_regs regs;
+
+	if (!event)
+		return;
+
+	if (!ds)
+		return;
+
+	at  = (struct bts_record *)(unsigned long)ds->bts_buffer_base;
+	top = (struct bts_record *)(unsigned long)ds->bts_index;
+
+	if (top <= at)
+		return;
+
+	ds->bts_index = ds->bts_buffer_base;
+
+	perf_sample_data_init(&data, 0);
+	data.period = event->hw.last_period;
+	regs.ip     = 0;
+
+	/*
+	 * Prepare a generic sample, i.e. fill in the invariant fields.
+	 * We will overwrite the from and to address before we output
+	 * the sample.
+	 */
+	perf_prepare_sample(&header, &data, event, &regs);
+
+	if (perf_output_begin(&handle, event, header.size * (top - at), 1, 1))
+		return;
+
+	for (; at < top; at++) {
+		data.ip		= at->from;
+		data.addr	= at->to;
+
+		perf_output_sample(&handle, &header, &data, event);
+	}
+
+	perf_output_end(&handle);
+
+	/* There's new data available. */
+	event->hw.interrupts++;
+	event->pending_kill = POLL_IN;
+}
+
+/*
+ * PEBS
+ */
+
+static struct event_constraint intel_core_pebs_events[] = {
+	PEBS_EVENT_CONSTRAINT(0x00c0, 0x1), /* INSTR_RETIRED.ANY */
+	PEBS_EVENT_CONSTRAINT(0xfec1, 0x1), /* X87_OPS_RETIRED.ANY */
+	PEBS_EVENT_CONSTRAINT(0x00c5, 0x1), /* BR_INST_RETIRED.MISPRED */
+	PEBS_EVENT_CONSTRAINT(0x1fc7, 0x1), /* SIMD_INST_RETURED.ANY */
+	PEBS_EVENT_CONSTRAINT(0x01cb, 0x1), /* MEM_LOAD_RETIRED.L1D_MISS */
+	PEBS_EVENT_CONSTRAINT(0x02cb, 0x1), /* MEM_LOAD_RETIRED.L1D_LINE_MISS */
+	PEBS_EVENT_CONSTRAINT(0x04cb, 0x1), /* MEM_LOAD_RETIRED.L2_MISS */
+	PEBS_EVENT_CONSTRAINT(0x08cb, 0x1), /* MEM_LOAD_RETIRED.L2_LINE_MISS */
+	PEBS_EVENT_CONSTRAINT(0x10cb, 0x1), /* MEM_LOAD_RETIRED.DTLB_MISS */
+	EVENT_CONSTRAINT_END
+};
+
+static struct event_constraint intel_nehalem_pebs_events[] = {
+	PEBS_EVENT_CONSTRAINT(0x00c0, 0xf), /* INSTR_RETIRED.ANY */
+	PEBS_EVENT_CONSTRAINT(0xfec1, 0xf), /* X87_OPS_RETIRED.ANY */
+	PEBS_EVENT_CONSTRAINT(0x00c5, 0xf), /* BR_INST_RETIRED.MISPRED */
+	PEBS_EVENT_CONSTRAINT(0x1fc7, 0xf), /* SIMD_INST_RETURED.ANY */
+	PEBS_EVENT_CONSTRAINT(0x01cb, 0xf), /* MEM_LOAD_RETIRED.L1D_MISS */
+	PEBS_EVENT_CONSTRAINT(0x02cb, 0xf), /* MEM_LOAD_RETIRED.L1D_LINE_MISS */
+	PEBS_EVENT_CONSTRAINT(0x04cb, 0xf), /* MEM_LOAD_RETIRED.L2_MISS */
+	PEBS_EVENT_CONSTRAINT(0x08cb, 0xf), /* MEM_LOAD_RETIRED.L2_LINE_MISS */
+	PEBS_EVENT_CONSTRAINT(0x10cb, 0xf), /* MEM_LOAD_RETIRED.DTLB_MISS */
+	EVENT_CONSTRAINT_END
+};
+
+static struct event_constraint *
+intel_pebs_constraints(struct perf_event *event)
+{
+	struct event_constraint *c;
+
+	if (!event->attr.precise)
+		return NULL;
+
+	if (x86_pmu.pebs_constraints) {
+		for_each_event_constraint(c, x86_pmu.pebs_constraints) {
+			if ((event->hw.config & c->cmask) == c->code)
+				return c;
+		}
+	}
+
+	return &emptyconstraint;
+}
+
+static void intel_pmu_pebs_enable(struct hw_perf_event *hwc)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	u64 val = cpuc->pebs_enabled;
+
+	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
+
+	val |= 1ULL << hwc->idx;
+	wrmsrl(MSR_IA32_PEBS_ENABLE, val);
+}
+
+static void intel_pmu_pebs_disable(struct hw_perf_event *hwc)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	u64 val = cpuc->pebs_enabled;
+
+	val &= ~(1ULL << hwc->idx);
+	wrmsrl(MSR_IA32_PEBS_ENABLE, val);
+
+	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
+}
+
+static void intel_pmu_pebs_enable_all(void)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (cpuc->pebs_enabled)
+		wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
+}
+
+static void intel_pmu_pebs_disable_all(void)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (cpuc->pebs_enabled)
+		wrmsrl(MSR_IA32_PEBS_ENABLE, 0);
+}
+
+#define CC(pebs, regs, reg) (regs)->reg = (pebs)->reg
+
+#ifdef CONFIG_X86_32
+
+#define PEBS_TO_REGS(pebs, regs)		\
+do {						\
+	memset((regs), 0, sizeof(*regs));	\
+	CC((pebs), (regs), ax);			\
+	CC((pebs), (regs), bx);			\
+	CC((pebs), (regs), cx);			\
+	CC((pebs), (regs), dx);			\
+	CC((pebs), (regs), si);			\
+	CC((pebs), (regs), di);			\
+	CC((pebs), (regs), bp);			\
+	CC((pebs), (regs), sp);			\
+	CC((pebs), (regs), flags);		\
+	CC((pebs), (regs), ip);			\
+} while (0)
+
+#else /* CONFIG_X86_64 */
+
+#define PEBS_TO_REGS(pebs, regs)		\
+do {						\
+	memset((regs), 0, sizeof(*regs));	\
+	CC((pebs), (regs), ax);			\
+	CC((pebs), (regs), bx);			\
+	CC((pebs), (regs), cx);			\
+	CC((pebs), (regs), dx);			\
+	CC((pebs), (regs), si);			\
+	CC((pebs), (regs), di);			\
+	CC((pebs), (regs), bp);			\
+	CC((pebs), (regs), sp);			\
+	CC((pebs), (regs), r8);			\
+	CC((pebs), (regs), r9);			\
+	CC((pebs), (regs), r10);		\
+	CC((pebs), (regs), r11);		\
+	CC((pebs), (regs), r12);		\
+	CC((pebs), (regs), r13);		\
+	CC((pebs), (regs), r14);		\
+	CC((pebs), (regs), r15);		\
+	CC((pebs), (regs), flags);		\
+	CC((pebs), (regs), ip);			\
+} while (0)
+
+#endif
+
+static int intel_pmu_save_and_restart(struct perf_event *event);
+static void intel_pmu_disable_event(struct perf_event *event);
+
+static void intel_pmu_drain_pebs_core(void)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct debug_store *ds = cpuc->ds;
+	struct perf_event *event = cpuc->events[0]; /* PMC0 only */
+	struct pebs_record_core *at, *top;
+	struct perf_sample_data data;
+	struct pt_regs regs;
+	int n;
+
+	if (!event || !ds || !x86_pmu.pebs)
+		return;
+
+	intel_pmu_pebs_disable_all();
+
+	at  = (struct pebs_record_core *)(unsigned long)ds->pebs_buffer_base;
+	top = (struct pebs_record_core *)(unsigned long)ds->pebs_index;
+
+	if (top <= at)
+		goto out;
+
+	ds->pebs_index = ds->pebs_buffer_base;
+
+	if (!intel_pmu_save_and_restart(event))
+		goto out;
+
+	perf_sample_data_init(&data, 0);
+	data.period = event->hw.last_period;
+
+	n = top - at;
+
+	/*
+	 * Should not happen, we program the threshold at 1 and do not
+	 * set a reset value.
+	 */
+	if (unlikely(n > 1)) {
+		trace_printk("PEBS: too many events: %d\n", n);
+		at += n-1;
+	}
+
+	PEBS_TO_REGS(at, &regs);
+
+	if (perf_event_overflow(event, 1, &data, &regs))
+		intel_pmu_disable_event(event);
+
+out:
+	intel_pmu_pebs_enable_all();
+}
+
+static void intel_pmu_drain_pebs_nhm(void)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct debug_store *ds = cpuc->ds;
+	struct pebs_record_nhm *at, *top;
+	struct perf_sample_data data;
+	struct perf_event *event = NULL;
+	struct pt_regs regs;
+	int bit, n;
+
+	if (!ds || !x86_pmu.pebs)
+		return;
+
+	intel_pmu_pebs_disable_all();
+
+	at  = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
+
+	if (top <= at)
+		goto out;
+
+	ds->pebs_index = ds->pebs_buffer_base;
+
+	n = top - at;
+
+	/*
+	 * Should not happen, we program the threshold at 1 and do not
+	 * set a reset value.
+	 */
+	if (unlikely(n > MAX_PEBS_EVENTS))
+		trace_printk("PEBS: too many events: %d\n", n);
+
+	for ( ; at < top; at++) {
+		for_each_bit(bit, (unsigned long *)&at->status, MAX_PEBS_EVENTS) {
+			if (!cpuc->events[bit]->attr.precise)
+				continue;
+
+			if (event)
+				trace_printk("PEBS: status: %Lx\n", at->status);
+
+			event = cpuc->events[bit];
+		}
+
+		if (!event) {
+			trace_printk("PEBS: interrupt, status: %Lx\n",
+					at->status);
+			continue;
+		}
+
+		if (!intel_pmu_save_and_restart(event))
+			continue;
+
+		perf_sample_data_init(&data, 0);
+		data.period = event->hw.last_period;
+
+		PEBS_TO_REGS(at, &regs);
+
+		if (perf_event_overflow(event, 1, &data, &regs))
+			intel_pmu_disable_event(event);
+	}
+out:
+	intel_pmu_pebs_enable_all();
+}
+
+/*
+ * BTS, PEBS probe and setup
+ */
+
+static void intel_ds_init(void)
+{
+	/*
+	 * No support for 32bit formats
+	 */
+	if (!boot_cpu_has(X86_FEATURE_DTES64))
+		return;
+
+	x86_pmu.bts  = boot_cpu_has(X86_FEATURE_BTS);
+	x86_pmu.pebs = boot_cpu_has(X86_FEATURE_PEBS);
+	if (x86_pmu.pebs) {
+		int format = 0;
+
+		if (x86_pmu.version > 1) {
+			u64 capabilities;
+			/*
+			 * v2+ has a PEBS format field
+			 */
+			rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);
+			format = (capabilities >> 8) & 0xf;
+		}
+
+		switch (format) {
+		case 0:
+			printk(KERN_CONT "PEBS v0, ");
+			x86_pmu.pebs_record_size = sizeof(struct pebs_record_core);
+			x86_pmu.drain_pebs = intel_pmu_drain_pebs_core;
+			x86_pmu.pebs_constraints = intel_core_pebs_events;
+			break;
+
+		case 1:
+			printk(KERN_CONT "PEBS v1, ");
+			x86_pmu.pebs_record_size = sizeof(struct pebs_record_nhm);
+			x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm;
+			x86_pmu.pebs_constraints = intel_nehalem_pebs_events;
+			break;
+
+		default:
+			printk(KERN_CONT "PEBS unknown format: %d, ", format);
+			x86_pmu.pebs = 0;
+			break;
+		}
+	}
+}
+
+#else /* CONFIG_CPU_SUP_INTEL */
+
+static int reseve_ds_buffers(void)
+{
+	return 0;
+}
+
+static void release_ds_buffers(void)
+{
+}
+
+#endif /* CONFIG_CPU_SUP_INTEL */
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -31,45 +31,6 @@
 
 static u64 perf_event_mask __read_mostly;
 
-/* The maximal number of PEBS events: */
-#define MAX_PEBS_EVENTS	4
-
-/* The size of a BTS record in bytes: */
-#define BTS_RECORD_SIZE		24
-
-/* The size of a per-cpu BTS buffer in bytes: */
-#define BTS_BUFFER_SIZE		(BTS_RECORD_SIZE * 2048)
-
-/* The BTS overflow threshold in bytes from the end of the buffer: */
-#define BTS_OVFL_TH		(BTS_RECORD_SIZE * 128)
-
-
-/*
- * Bits in the debugctlmsr controlling branch tracing.
- */
-#define X86_DEBUGCTL_TR			(1 << 6)
-#define X86_DEBUGCTL_BTS		(1 << 7)
-#define X86_DEBUGCTL_BTINT		(1 << 8)
-#define X86_DEBUGCTL_BTS_OFF_OS		(1 << 9)
-#define X86_DEBUGCTL_BTS_OFF_USR	(1 << 10)
-
-/*
- * A debug store configuration.
- *
- * We only support architectures that use 64bit fields.
- */
-struct debug_store {
-	u64	bts_buffer_base;
-	u64	bts_index;
-	u64	bts_absolute_maximum;
-	u64	bts_interrupt_threshold;
-	u64	pebs_buffer_base;
-	u64	pebs_index;
-	u64	pebs_absolute_maximum;
-	u64	pebs_interrupt_threshold;
-	u64	pebs_event_reset[MAX_PEBS_EVENTS];
-};
-
 struct event_constraint {
 	union {
 		unsigned long	idxmsk[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
@@ -88,17 +49,29 @@ struct amd_nb {
 };
 
 struct cpu_hw_events {
+	/*
+	 * Generic x86 PMC bits
+	 */
 	struct perf_event	*events[X86_PMC_IDX_MAX]; /* in counter order */
 	unsigned long		active_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	unsigned long		interrupts;
 	int			enabled;
-	struct debug_store	*ds;
 
 	int			n_events;
 	int			n_added;
 	int			assign[X86_PMC_IDX_MAX]; /* event to counter assignment */
 	u64			tags[X86_PMC_IDX_MAX];
 	struct perf_event	*event_list[X86_PMC_IDX_MAX]; /* in enabled order */
+
+	/*
+	 * Intel DebugStore bits
+	 */
+	struct debug_store	*ds;
+	u64			pebs_enabled;
+
+	/*
+	 * AMD specific bits
+	 */
 	struct amd_nb		*amd_nb;
 };
 
@@ -112,12 +85,24 @@ struct cpu_hw_events {
 #define EVENT_CONSTRAINT(c, n, m)	\
 	__EVENT_CONSTRAINT(c, n, m, HWEIGHT(n))
 
+/*
+ * Constraint on the Event code.
+ */
 #define INTEL_EVENT_CONSTRAINT(c, n)	\
 	EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVTSEL_MASK)
 
+/*
+ * Constraint on the Event code + UMask + fixed-mask
+ */
 #define FIXED_EVENT_CONSTRAINT(c, n)	\
 	EVENT_CONSTRAINT(c, (1ULL << (32+n)), INTEL_ARCH_FIXED_MASK)
 
+/*
+ * Constraint on the Event code + UMask
+ */
+#define PEBS_EVENT_CONSTRAINT(c, n)	\
+	EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK)
+
 #define EVENT_CONSTRAINT_END		\
 	EVENT_CONSTRAINT(0, 0, 0)
 
@@ -128,6 +113,9 @@ struct cpu_hw_events {
  * struct x86_pmu - generic x86 pmu
  */
 struct x86_pmu {
+	/*
+	 * Generic x86 PMC bits
+	 */
 	const char	*name;
 	int		version;
 	int		(*handle_irq)(struct pt_regs *);
@@ -146,10 +134,6 @@ struct x86_pmu {
 	u64		event_mask;
 	int		apic;
 	u64		max_period;
-	u64		intel_ctrl;
-	void		(*enable_bts)(u64 config);
-	void		(*disable_bts)(void);
-
 	struct event_constraint *
 			(*get_event_constraints)(struct cpu_hw_events *cpuc,
 						 struct perf_event *event);
@@ -157,6 +141,19 @@ struct x86_pmu {
 	void		(*put_event_constraints)(struct cpu_hw_events *cpuc,
 						 struct perf_event *event);
 	struct event_constraint *event_constraints;
+
+	/*
+	 * Intel Arch Perfmon v2+
+	 */
+	u64		intel_ctrl;
+
+	/*
+	 * Intel DebugStore bits
+	 */
+	int		bts, pebs;
+	int		pebs_record_size;
+	void		(*drain_pebs)(void);
+	struct event_constraint *pebs_constraints;
 };
 
 static struct x86_pmu x86_pmu __read_mostly;
@@ -288,110 +285,14 @@ static void release_pmc_hardware(void)
 #endif
 }
 
-static inline bool bts_available(void)
-{
-	return x86_pmu.enable_bts != NULL;
-}
-
-static inline void init_debug_store_on_cpu(int cpu)
-{
-	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
-
-	if (!ds)
-		return;
-
-	wrmsr_on_cpu(cpu, MSR_IA32_DS_AREA,
-		     (u32)((u64)(unsigned long)ds),
-		     (u32)((u64)(unsigned long)ds >> 32));
-}
-
-static inline void fini_debug_store_on_cpu(int cpu)
-{
-	if (!per_cpu(cpu_hw_events, cpu).ds)
-		return;
-
-	wrmsr_on_cpu(cpu, MSR_IA32_DS_AREA, 0, 0);
-}
-
-static void release_bts_hardware(void)
-{
-	int cpu;
-
-	if (!bts_available())
-		return;
-
-	get_online_cpus();
-
-	for_each_online_cpu(cpu)
-		fini_debug_store_on_cpu(cpu);
-
-	for_each_possible_cpu(cpu) {
-		struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
-
-		if (!ds)
-			continue;
-
-		per_cpu(cpu_hw_events, cpu).ds = NULL;
-
-		kfree((void *)(unsigned long)ds->bts_buffer_base);
-		kfree(ds);
-	}
-
-	put_online_cpus();
-}
-
-static int reserve_bts_hardware(void)
-{
-	int cpu, err = 0;
-
-	if (!bts_available())
-		return 0;
-
-	get_online_cpus();
-
-	for_each_possible_cpu(cpu) {
-		struct debug_store *ds;
-		void *buffer;
-
-		err = -ENOMEM;
-		buffer = kzalloc(BTS_BUFFER_SIZE, GFP_KERNEL);
-		if (unlikely(!buffer))
-			break;
-
-		ds = kzalloc(sizeof(*ds), GFP_KERNEL);
-		if (unlikely(!ds)) {
-			kfree(buffer);
-			break;
-		}
-
-		ds->bts_buffer_base = (u64)(unsigned long)buffer;
-		ds->bts_index = ds->bts_buffer_base;
-		ds->bts_absolute_maximum =
-			ds->bts_buffer_base + BTS_BUFFER_SIZE;
-		ds->bts_interrupt_threshold =
-			ds->bts_absolute_maximum - BTS_OVFL_TH;
-
-		per_cpu(cpu_hw_events, cpu).ds = ds;
-		err = 0;
-	}
-
-	if (err)
-		release_bts_hardware();
-	else {
-		for_each_online_cpu(cpu)
-			init_debug_store_on_cpu(cpu);
-	}
-
-	put_online_cpus();
-
-	return err;
-}
+static int reserve_ds_buffers(void);
+static void release_ds_buffers(void);
 
 static void hw_perf_event_destroy(struct perf_event *event)
 {
 	if (atomic_dec_and_mutex_lock(&active_events, &pmc_reserve_mutex)) {
 		release_pmc_hardware();
-		release_bts_hardware();
+		release_ds_buffers();
 		mutex_unlock(&pmc_reserve_mutex);
 	}
 }
@@ -454,7 +355,7 @@ static int __hw_perf_event_init(struct p
 			if (!reserve_pmc_hardware())
 				err = -EBUSY;
 			else
-				err = reserve_bts_hardware();
+				err = reserve_ds_buffers();
 		}
 		if (!err)
 			atomic_inc(&active_events);
@@ -532,7 +433,7 @@ static int __hw_perf_event_init(struct p
 	if ((attr->config == PERF_COUNT_HW_BRANCH_INSTRUCTIONS) &&
 	    (hwc->sample_period == 1)) {
 		/* BTS is not supported by this architecture. */
-		if (!bts_available())
+		if (!x86_pmu.bts)
 			return -EOPNOTSUPP;
 
 		/* BTS is currently only allowed for user-mode. */
@@ -994,6 +895,7 @@ static void x86_pmu_unthrottle(struct pe
 void perf_event_print_debug(void)
 {
 	u64 ctrl, status, overflow, pmc_ctrl, pmc_count, prev_left, fixed;
+	u64 pebs;
 	struct cpu_hw_events *cpuc;
 	unsigned long flags;
 	int cpu, idx;
@@ -1011,12 +913,14 @@ void perf_event_print_debug(void)
 		rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
 		rdmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, overflow);
 		rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, fixed);
+		rdmsrl(MSR_IA32_PEBS_ENABLE, pebs);
 
 		pr_info("\n");
 		pr_info("CPU#%d: ctrl:       %016llx\n", cpu, ctrl);
 		pr_info("CPU#%d: status:     %016llx\n", cpu, status);
 		pr_info("CPU#%d: overflow:   %016llx\n", cpu, overflow);
 		pr_info("CPU#%d: fixed:      %016llx\n", cpu, fixed);
+		pr_info("CPU#%d: pebs:       %016llx\n", cpu, pebs);
 	}
 	pr_info("CPU#%d: active:       %016llx\n", cpu, *(u64 *)cpuc->active_mask);
 
@@ -1334,6 +1238,7 @@ undo:
 
 #include "perf_event_amd.c"
 #include "perf_event_p6.c"
+#include "perf_event_intel_ds.c"
 #include "perf_event_intel.c"
 
 static void __init pmu_check_apic(void)
@@ -1431,6 +1336,32 @@ static const struct pmu pmu = {
 };
 
 /*
+ * validate that we can schedule this event
+ */
+static int validate_event(struct perf_event *event)
+{
+	struct cpu_hw_events *fake_cpuc;
+	struct event_constraint *c;
+	int ret = 0;
+
+	fake_cpuc = kmalloc(sizeof(*fake_cpuc), GFP_KERNEL | __GFP_ZERO);
+	if (!fake_cpuc)
+		return -ENOMEM;
+
+	c = x86_pmu.get_event_constraints(fake_cpuc, event);
+
+	if (!c || !c->weight)
+		ret = -ENOSPC;
+
+	if (x86_pmu.put_event_constraints)
+		x86_pmu.put_event_constraints(fake_cpuc, event);
+
+	kfree(fake_cpuc);
+
+	return ret;
+}
+
+/*
  * validate a single event group
  *
  * validation include:
@@ -1495,6 +1426,8 @@ const struct pmu *hw_perf_event_init(str
 
 		if (event->group_leader != event)
 			err = validate_group(event);
+		else
+			err = validate_event(event);
 
 		event->pmu = tmp;
 	}
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -470,42 +470,6 @@ static u64 intel_pmu_raw_event(u64 hw_ev
 	return hw_event & CORE_EVNTSEL_MASK;
 }
 
-static void intel_pmu_enable_bts(u64 config)
-{
-	unsigned long debugctlmsr;
-
-	debugctlmsr = get_debugctlmsr();
-
-	debugctlmsr |= X86_DEBUGCTL_TR;
-	debugctlmsr |= X86_DEBUGCTL_BTS;
-	debugctlmsr |= X86_DEBUGCTL_BTINT;
-
-	if (!(config & ARCH_PERFMON_EVENTSEL_OS))
-		debugctlmsr |= X86_DEBUGCTL_BTS_OFF_OS;
-
-	if (!(config & ARCH_PERFMON_EVENTSEL_USR))
-		debugctlmsr |= X86_DEBUGCTL_BTS_OFF_USR;
-
-	update_debugctlmsr(debugctlmsr);
-}
-
-static void intel_pmu_disable_bts(void)
-{
-	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-	unsigned long debugctlmsr;
-
-	if (!cpuc->ds)
-		return;
-
-	debugctlmsr = get_debugctlmsr();
-
-	debugctlmsr &=
-		~(X86_DEBUGCTL_TR | X86_DEBUGCTL_BTS | X86_DEBUGCTL_BTINT |
-		  X86_DEBUGCTL_BTS_OFF_OS | X86_DEBUGCTL_BTS_OFF_USR);
-
-	update_debugctlmsr(debugctlmsr);
-}
-
 static void intel_pmu_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -514,6 +478,8 @@ static void intel_pmu_disable_all(void)
 
 	if (test_bit(X86_PMC_IDX_FIXED_BTS, cpuc->active_mask))
 		intel_pmu_disable_bts();
+
+	intel_pmu_pebs_disable_all();
 }
 
 static void intel_pmu_enable_all(void)
@@ -531,6 +497,8 @@ static void intel_pmu_enable_all(void)
 
 		intel_pmu_enable_bts(event->hw.config);
 	}
+
+	intel_pmu_pebs_enable_all();
 }
 
 static inline u64 intel_pmu_get_status(void)
@@ -547,8 +515,7 @@ static inline void intel_pmu_ack_status(
 	wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack);
 }
 
-static inline void
-intel_pmu_disable_fixed(struct hw_perf_event *hwc)
+static void intel_pmu_disable_fixed(struct hw_perf_event *hwc)
 {
 	int idx = hwc->idx - X86_PMC_IDX_FIXED;
 	u64 ctrl_val, mask;
@@ -560,68 +527,7 @@ intel_pmu_disable_fixed(struct hw_perf_e
 	(void)checking_wrmsrl(hwc->config_base, ctrl_val);
 }
 
-static void intel_pmu_drain_bts_buffer(void)
-{
-	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-	struct debug_store *ds = cpuc->ds;
-	struct bts_record {
-		u64	from;
-		u64	to;
-		u64	flags;
-	};
-	struct perf_event *event = cpuc->events[X86_PMC_IDX_FIXED_BTS];
-	struct bts_record *at, *top;
-	struct perf_output_handle handle;
-	struct perf_event_header header;
-	struct perf_sample_data data;
-	struct pt_regs regs;
-
-	if (!event)
-		return;
-
-	if (!ds)
-		return;
-
-	at  = (struct bts_record *)(unsigned long)ds->bts_buffer_base;
-	top = (struct bts_record *)(unsigned long)ds->bts_index;
-
-	if (top <= at)
-		return;
-
-	ds->bts_index = ds->bts_buffer_base;
-
-	perf_sample_data_init(&data, 0);
-
-	data.period	= event->hw.last_period;
-	regs.ip		= 0;
-
-	/*
-	 * Prepare a generic sample, i.e. fill in the invariant fields.
-	 * We will overwrite the from and to address before we output
-	 * the sample.
-	 */
-	perf_prepare_sample(&header, &data, event, &regs);
-
-	if (perf_output_begin(&handle, event,
-			      header.size * (top - at), 1, 1))
-		return;
-
-	for (; at < top; at++) {
-		data.ip		= at->from;
-		data.addr	= at->to;
-
-		perf_output_sample(&handle, &header, &data, event);
-	}
-
-	perf_output_end(&handle);
-
-	/* There's new data available. */
-	event->hw.interrupts++;
-	event->pending_kill = POLL_IN;
-}
-
-static inline void
-intel_pmu_disable_event(struct perf_event *event)
+static void intel_pmu_disable_event(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
 
@@ -637,10 +543,12 @@ intel_pmu_disable_event(struct perf_even
 	}
 
 	x86_pmu_disable_event(event);
+
+	if (unlikely(event->attr.precise))
+		intel_pmu_pebs_disable(hwc);
 }
 
-static inline void
-intel_pmu_enable_fixed(struct hw_perf_event *hwc)
+static void intel_pmu_enable_fixed(struct hw_perf_event *hwc)
 {
 	int idx = hwc->idx - X86_PMC_IDX_FIXED;
 	u64 ctrl_val, bits, mask;
@@ -689,6 +597,9 @@ static void intel_pmu_enable_event(struc
 		return;
 	}
 
+	if (unlikely(event->attr.precise))
+		intel_pmu_pebs_enable(hwc);
+
 	__x86_pmu_enable_event(hwc);
 }
 
@@ -763,10 +674,17 @@ again:
 
 	inc_irq_stat(apic_perf_irqs);
 	ack = status;
+
+	/*
+	 * PEBS overflow sets bit 62 in the global status register
+	 */
+	if (__test_and_clear_bit(62, (unsigned long *)&status))
+		x86_pmu.drain_pebs();
+
 	for_each_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
 		struct perf_event *event = cpuc->events[bit];
 
-		__clear_bit(bit, (unsigned long *) &status);
+		__clear_bit(bit, (unsigned long *)&status);
 		if (!test_bit(bit, cpuc->active_mask))
 			continue;
 
@@ -793,22 +711,18 @@ again:
 	return 1;
 }
 
-static struct event_constraint bts_constraint =
-	EVENT_CONSTRAINT(0, 1ULL << X86_PMC_IDX_FIXED_BTS, 0);
-
 static struct event_constraint *
-intel_special_constraints(struct perf_event *event)
+intel_bts_constraints(struct perf_event *event)
 {
-	unsigned int hw_event;
-
-	hw_event = event->hw.config & INTEL_ARCH_EVENT_MASK;
+	struct hw_perf_event *hwc = &event->hw;
+	unsigned int hw_event, bts_event;
 
-	if (unlikely((hw_event ==
-		      x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS)) &&
-		     (event->hw.sample_period == 1))) {
+	hw_event = hwc->config & INTEL_ARCH_EVENT_MASK;
+	bts_event = x86_pmu.event_map(PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
 
+	if (unlikely(hw_event == bts_event && hwc->sample_period == 1))
 		return &bts_constraint;
-	}
+
 	return NULL;
 }
 
@@ -817,7 +731,11 @@ intel_get_event_constraints(struct cpu_h
 {
 	struct event_constraint *c;
 
-	c = intel_special_constraints(event);
+	c = intel_bts_constraints(event);
+	if (c)
+		return c;
+
+	c = intel_pebs_constraints(event);
 	if (c)
 		return c;
 
@@ -866,8 +784,6 @@ static __initconst struct x86_pmu intel_
 	 * the generic event period:
 	 */
 	.max_period		= (1ULL << 31) - 1,
-	.enable_bts		= intel_pmu_enable_bts,
-	.disable_bts		= intel_pmu_disable_bts,
 	.get_event_constraints	= intel_get_event_constraints
 };
 
@@ -914,6 +830,8 @@ static __init int intel_pmu_init(void)
 	if (version > 1)
 		x86_pmu.num_events_fixed = max((int)edx.split.num_events_fixed, 3);
 
+	intel_ds_init();
+
 	/*
 	 * Install the hw-cache-events table:
 	 */
Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -203,8 +203,9 @@ struct perf_event_attr {
 				enable_on_exec :  1, /* next exec enables     */
 				task           :  1, /* trace fork/exit       */
 				watermark      :  1, /* wakeup_watermark      */
+				precise        :  1, /* OoO invariant counter */
 
-				__reserved_1   : 49;
+				__reserved_1   : 48;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
                   ` (5 preceding siblings ...)
  2010-03-03 16:39 ` [RFC][PATCH 06/11] perf, x86: PEBS infrastructure Peter Zijlstra
@ 2010-03-03 16:39 ` Peter Zijlstra
  2010-03-03 17:30   ` Stephane Eranian
  2010-03-03 22:02   ` Frederic Weisbecker
  2010-03-03 16:39 ` [RFC][PATCH 08/11] perf, x86: Implement simple LBR support Peter Zijlstra
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Peter Zijlstra

[-- Attachment #1: perf-sample-regs.patch --]
[-- Type: text/plain, Size: 2365 bytes --]

Simply copy out the provided pt_regs in a u64 aligned fashion.

XXX: do task_pt_regs() and get_irq_regs() always clear everything or
     are we now leaking data?

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/perf_event.h |    5 ++++-
 kernel/perf_event.c        |   17 +++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -125,8 +125,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_PERIOD			= 1U << 8,
 	PERF_SAMPLE_STREAM_ID			= 1U << 9,
 	PERF_SAMPLE_RAW				= 1U << 10,
+	PERF_SAMPLE_REGS			= 1U << 11,
 
-	PERF_SAMPLE_MAX = 1U << 11,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 12,		/* non-ABI */
 };
 
 /*
@@ -392,6 +393,7 @@ enum perf_event_type {
 	 *	{ u64			period;   } && PERF_SAMPLE_PERIOD
 	 *
 	 *	{ struct read_format	values;	  } && PERF_SAMPLE_READ
+	 * 	{ struct pt_regs	regs;	  } && PERF_SAMPLE_REGS
 	 *
 	 *	{ u64			nr,
 	 *	  u64			ips[nr];  } && PERF_SAMPLE_CALLCHAIN
@@ -800,6 +802,7 @@ struct perf_sample_data {
 	u64				period;
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
+	struct pt_regs			*regs;
 };
 
 static inline
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -3176,6 +3176,17 @@ void perf_output_sample(struct perf_outp
 	if (sample_type & PERF_SAMPLE_READ)
 		perf_output_read(handle, event);
 
+	if (sample_type & PERF_SAMPLE_REGS) {
+		int size = DIV_ROUND_UP(sizeof(struct pt_regs), sizeof(u64)) -
+			   sizeof(struct pt_regs);
+
+		perf_output_put(handle, *data->regs);
+		if (size) {
+			u64 zero = 0;
+			perf_output_copy(handle, &zero, size);
+		}
+	}
+
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
 		if (data->callchain) {
 			int size = 1;
@@ -3273,6 +3284,12 @@ void perf_prepare_sample(struct perf_eve
 	if (sample_type & PERF_SAMPLE_READ)
 		header->size += perf_event_read_size(event);
 
+	if (sample_type & PERF_SAMPLE_REGS) {
+		data->regs = regs;
+		header->size += DIV_ROUND_UP(sizeof(struct pt_regs),
+					     sizeof(u64));
+	}
+
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
 		int size = 1;
 

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 08/11] perf, x86: Implement simple LBR support
  2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
                   ` (6 preceding siblings ...)
  2010-03-03 16:39 ` [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS Peter Zijlstra
@ 2010-03-03 16:39 ` Peter Zijlstra
  2010-03-03 21:52   ` Stephane Eranian
  2010-03-03 21:57   ` Stephane Eranian
  2010-03-03 16:39 ` [RFC][PATCH 09/11] perf, x86: Implement PERF_SAMPLE_BRANCH_STACK Peter Zijlstra
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Peter Zijlstra

[-- Attachment #1: perf-lbr.patch --]
[-- Type: text/plain, Size: 8982 bytes --]

Implement support for Intel LBR stacks that support
FREEZE_LBRS_ON_PMI. We do not (yet?) support the LBR config register
because that is SMT wide and would also put undue restraints on the
PEBS users.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c           |   22 ++
 arch/x86/kernel/cpu/perf_event_intel.c     |   13 +
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  228 +++++++++++++++++++++++++++++
 3 files changed, 263 insertions(+)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -48,6 +48,12 @@ struct amd_nb {
 	struct event_constraint event_constraints[X86_PMC_IDX_MAX];
 };
 
+#define MAX_LBR_ENTRIES		16
+
+struct lbr_entry {
+	u64	from, to, flags;
+};
+
 struct cpu_hw_events {
 	/*
 	 * Generic x86 PMC bits
@@ -70,6 +76,14 @@ struct cpu_hw_events {
 	u64			pebs_enabled;
 
 	/*
+	 * Intel LBR bits
+	 */
+	int			lbr_users;
+	int			lbr_entries;
+	struct lbr_entry	lbr_stack[MAX_LBR_ENTRIES];
+	void			*lbr_context;
+
+	/*
 	 * AMD specific bits
 	 */
 	struct amd_nb		*amd_nb;
@@ -154,6 +168,13 @@ struct x86_pmu {
 	int		pebs_record_size;
 	void		(*drain_pebs)(void);
 	struct event_constraint *pebs_constraints;
+
+	/*
+	 * Intel LBR
+	 */
+	unsigned long	lbr_tos, lbr_from, lbr_to; /* MSR base regs       */
+	int		lbr_nr;			   /* hardware stack size */
+	int		lbr_format;		   /* hardware format     */
 };
 
 static struct x86_pmu x86_pmu __read_mostly;
@@ -1238,6 +1259,7 @@ undo:
 
 #include "perf_event_amd.c"
 #include "perf_event_p6.c"
+#include "perf_event_intel_lbr.c"
 #include "perf_event_intel_ds.c"
 #include "perf_event_intel.c"
 
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -480,6 +480,7 @@ static void intel_pmu_disable_all(void)
 		intel_pmu_disable_bts();
 
 	intel_pmu_pebs_disable_all();
+	intel_pmu_lbr_disable_all();
 }
 
 static void intel_pmu_enable_all(void)
@@ -499,6 +500,7 @@ static void intel_pmu_enable_all(void)
 	}
 
 	intel_pmu_pebs_enable_all();
+	intel_pmu_lbr_enable_all();
 }
 
 static inline u64 intel_pmu_get_status(void)
@@ -675,6 +677,8 @@ again:
 	inc_irq_stat(apic_perf_irqs);
 	ack = status;
 
+	intel_pmu_lbr_read();
+
 	/*
 	 * PEBS overflow sets bit 62 in the global status register
 	 */
@@ -847,6 +851,8 @@ static __init int intel_pmu_init(void)
 		memcpy(hw_cache_event_ids, core2_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 
+		intel_pmu_lbr_init_core();
+
 		x86_pmu.event_constraints = intel_core2_event_constraints;
 		pr_cont("Core2 events, ");
 		break;
@@ -856,13 +862,18 @@ static __init int intel_pmu_init(void)
 		memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 
+		intel_pmu_lbr_init_nhm();
+
 		x86_pmu.event_constraints = intel_nehalem_event_constraints;
 		pr_cont("Nehalem/Corei7 events, ");
 		break;
+
 	case 28: /* Atom */
 		memcpy(hw_cache_event_ids, atom_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 
+		intel_pmu_lbr_init_atom();
+
 		x86_pmu.event_constraints = intel_gen_event_constraints;
 		pr_cont("Atom events, ");
 		break;
@@ -872,6 +883,8 @@ static __init int intel_pmu_init(void)
 		memcpy(hw_cache_event_ids, westmere_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 
+		intel_pmu_lbr_init_nhm();
+
 		x86_pmu.event_constraints = intel_westmere_event_constraints;
 		pr_cont("Westmere events, ");
 		break;
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel_lbr.c
===================================================================
--- /dev/null
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -0,0 +1,228 @@
+#ifdef CONFIG_CPU_SUP_INTEL
+
+enum {
+	LBR_FORMAT_32		= 0x00,
+	LBR_FORMAT_LIP		= 0x01,
+	LBR_FORMAT_EIP		= 0x02,
+	LBR_FORMAT_EIP_FLAGS	= 0x03,
+};
+
+/*
+ * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
+ * otherwise it becomes near impossible to get a reliable stack.
+ */
+
+#define X86_DEBUGCTL_LBR               		(1 << 0)
+#define X86_DEBUGCTL_FREEZE_LBRS_ON_PMI		(1 << 11)
+
+static void __intel_pmu_lbr_enable(void)
+{
+	u64 debugctl;
+
+	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+	debugctl |= (X86_DEBUGCTL_LBR | X86_DEBUGCTL_FREEZE_LBRS_ON_PMI);
+	wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+}
+
+static void __intel_pmu_lbr_disable(void)
+{
+	u64 debugctl;
+
+	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+	debugctl &= ~(X86_DEBUGCTL_LBR | X86_DEBUGCTL_FREEZE_LBRS_ON_PMI);
+	wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+}
+
+static void intel_pmu_lbr_reset_32(void)
+{
+	int i;
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++)
+		wrmsrl(x86_pmu.lbr_from + i, 0);
+}
+
+static void intel_pmu_lbr_reset_64(void)
+{
+	int i;
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		wrmsrl(x86_pmu.lbr_from + i, 0);
+		wrmsrl(x86_pmu.lbr_to   + i, 0);
+	}
+}
+
+static void intel_pmu_lbr_reset(void)
+{
+	if (x86_pmu.lbr_format == LBR_FORMAT_32)
+		intel_pmu_lbr_reset_32();
+	else
+		intel_pmu_lbr_reset_64();
+}
+
+static void intel_pmu_lbr_enable(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (!x86_pmu.lbr_nr)
+		return;
+
+	WARN_ON(cpuc->enabled);
+
+	/*
+	 * Reset the LBR stack if this is the first LBR user or
+	 * we changed task context so as to avoid data leaks.
+	 */
+
+	if (!cpuc->lbr_users ||
+	    (event->ctx->task && cpuc->lbr_context != event->ctx)) {
+		intel_pmu_lbr_reset();
+		cpuc->lbr_context = event->ctx;
+	}
+
+	cpuc->lbr_users++;
+}
+
+static void intel_pmu_lbr_disable(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (!x86_pmu.lbr_nr)
+		return;
+
+	cpuc->lbr_users--;
+
+	BUG_ON(cpuc->lbr_users < 0);
+	WARN_ON(cpuc->enabled);
+}
+
+static void intel_pmu_lbr_enable_all(void)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (cpuc->lbr_users)
+		__intel_pmu_lbr_enable();
+}
+
+static void intel_pmu_lbr_disable_all(void)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (cpuc->lbr_users)
+		__intel_pmu_lbr_disable();
+}
+
+static inline u64 intel_pmu_lbr_tos(void)
+{
+	u64 tos;
+
+	rdmsrl(x86_pmu.lbr_tos, tos);
+
+	return tos;
+}
+
+static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
+{
+	unsigned long mask = x86_pmu.lbr_nr - 1;
+	u64 tos = intel_pmu_lbr_tos();
+	int i;
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++, tos--) {
+		unsigned long lbr_idx = (tos - i) & mask;
+		union {
+			struct {
+				u32 from;
+				u32 to;
+			};
+			u64     lbr;
+		} msr_lastbranch;
+
+		rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
+
+		cpuc->lbr_stack[i].from  = msr_lastbranch.from;
+		cpuc->lbr_stack[i].to    = msr_lastbranch.to;
+		cpuc->lbr_stack[i].flags = 0;
+	}
+	cpuc->lbr_entries = i;
+}
+
+#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
+
+/*
+ * Due to lack of segmentation in Linux the effective address (offset)
+ * is the same as the linear address, allowing us to merge the LIP and EIP
+ * LBR formats.
+ */
+static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
+{
+	unsigned long mask = x86_pmu.lbr_nr - 1;
+	u64 tos = intel_pmu_lbr_tos();
+	int i;
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++, tos--) {
+		unsigned long lbr_idx = (tos - i) & mask;
+		u64 from, to, flags = 0;
+
+		rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
+		rdmsrl(x86_pmu.lbr_to   + lbr_idx, to);
+
+		if (x86_pmu.lbr_format == LBR_FORMAT_EIP_FLAGS) {
+			flags = !!(from & LBR_FROM_FLAG_MISPRED);
+			from = (u64)((((s64)from) << 1) >> 1);
+		}
+
+		cpuc->lbr_stack[i].from  = from;
+		cpuc->lbr_stack[i].to    = to;
+		cpuc->lbr_stack[i].flags = flags;
+	}
+	cpuc->lbr_entries = i;
+}
+
+static void intel_pmu_lbr_read(void)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (!cpuc->lbr_users)
+		return;
+
+	if (x86_pmu.lbr_format == LBR_FORMAT_32)
+		intel_pmu_lbr_read_32(cpuc);
+	else
+		intel_pmu_lbr_read_64(cpuc);
+}
+
+static int intel_pmu_lbr_format(void)
+{
+	u64 capabilities;
+
+	rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);
+	return capabilities & 0x1f;
+}
+
+static void intel_pmu_lbr_init_core(void)
+{
+	x86_pmu.lbr_format = intel_pmu_lbr_format();
+	x86_pmu.lbr_nr     = 4;
+	x86_pmu.lbr_tos    = 0x01c9;
+	x86_pmu.lbr_from   = 0x40;
+	x86_pmu.lbr_to     = 0x60;
+}
+
+static void intel_pmu_lbr_init_nhm(void)
+{
+	x86_pmu.lbr_format = intel_pmu_lbr_format();
+	x86_pmu.lbr_nr     = 16;
+	x86_pmu.lbr_tos    = 0x01c9;
+	x86_pmu.lbr_from   = 0x680;
+	x86_pmu.lbr_to     = 0x6c0;
+}
+
+static void intel_pmu_lbr_init_atom(void)
+{
+	x86_pmu.lbr_format = intel_pmu_lbr_format();
+	x86_pmu.lbr_nr	   = 8;
+	x86_pmu.lbr_tos    = 0x01c9;
+	x86_pmu.lbr_from   = 0x40;
+	x86_pmu.lbr_to     = 0x60;
+}
+
+#endif /* CONFIG_CPU_SUP_INTEL */

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 09/11] perf, x86: Implement PERF_SAMPLE_BRANCH_STACK
  2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
                   ` (7 preceding siblings ...)
  2010-03-03 16:39 ` [RFC][PATCH 08/11] perf, x86: Implement simple LBR support Peter Zijlstra
@ 2010-03-03 16:39 ` Peter Zijlstra
  2010-03-03 21:08   ` Frederic Weisbecker
  2010-03-03 16:39 ` [RFC][PATCH 10/11] perf, x86: use LBR for PEBS IP+1 fixup Peter Zijlstra
  2010-03-03 16:39 ` [RFC][PATCH 11/11] perf, x86: Clean up IA32_PERF_CAPABILITIES usage Peter Zijlstra
  10 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Peter Zijlstra

[-- Attachment #1: perf-sample-lbr.patch --]
[-- Type: text/plain, Size: 9664 bytes --]


Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c           |   14 +++-------
 arch/x86/kernel/cpu/perf_event_intel.c     |   10 ++++++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   16 ++++--------
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   20 ++++++++-------
 include/linux/perf_event.h                 |   27 +++++++++++++++++---
 kernel/perf_event.c                        |   38 ++++++++++++++++++++++-------
 6 files changed, 83 insertions(+), 42 deletions(-)

Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -126,8 +126,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_STREAM_ID			= 1U << 9,
 	PERF_SAMPLE_RAW				= 1U << 10,
 	PERF_SAMPLE_REGS			= 1U << 11,
+	PERF_SAMPLE_BRANCH_STACK		= 1U << 12,
 
-	PERF_SAMPLE_MAX = 1U << 12,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 13,		/* non-ABI */
 };
 
 /*
@@ -395,9 +396,14 @@ enum perf_event_type {
 	 *	{ struct read_format	values;	  } && PERF_SAMPLE_READ
 	 * 	{ struct pt_regs	regs;	  } && PERF_SAMPLE_REGS
 	 *
-	 *	{ u64			nr,
+	 *	{ u64			nr;
 	 *	  u64			ips[nr];  } && PERF_SAMPLE_CALLCHAIN
 	 *
+	 * 	{ u64			nr;
+	 * 	  { u64 from, to, flags;
+	 * 	  }			lbr[nr];  } && PERF_SAMPLE_BRANCH_STACK
+	 *
+	 *
 	 *	#
 	 *	# The RAW record below is opaque data wrt the ABI
 	 *	#
@@ -469,6 +475,17 @@ struct perf_raw_record {
 	void				*data;
 };
 
+struct perf_branch_entry {
+	__u64				from;
+	__u64				to;
+	__u64				flags;
+};
+
+struct perf_branch_stack {
+	__u64				nr;
+	struct perf_branch_entry	entries[0];
+};
+
 struct task_struct;
 
 /**
@@ -803,13 +820,15 @@ struct perf_sample_data {
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
 	struct pt_regs			*regs;
+	struct perf_branch_stack	*branches;
 };
 
 static inline
 void perf_sample_data_init(struct perf_sample_data *data, u64 addr)
 {
-	data->addr = addr;
-	data->raw  = NULL;
+	data->addr     = addr;
+	data->raw      = NULL;
+	data->branches = NULL;
 }
 
 extern void perf_output_sample(struct perf_output_handle *handle,
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -3189,12 +3189,9 @@ void perf_output_sample(struct perf_outp
 
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
 		if (data->callchain) {
-			int size = 1;
+			int size = sizeof(u64);
 
-			if (data->callchain)
-				size += data->callchain->nr;
-
-			size *= sizeof(u64);
+			size += data->callchain->nr * sizeof(u64);
 
 			perf_output_copy(handle, data->callchain, size);
 		} else {
@@ -3203,6 +3200,20 @@ void perf_output_sample(struct perf_outp
 		}
 	}
 
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		if (data->branches) {
+			int size = sizeof(u64);
+
+			size += data->branches->nr *
+				sizeof(struct perf_branch_entry);
+
+			perf_output_copy(handle, data->branches, size);
+		} else {
+			u64 nr = 0;
+			perf_output_put(handle, nr);
+		}
+	}
+
 	if (sample_type & PERF_SAMPLE_RAW) {
 		if (data->raw) {
 			perf_output_put(handle, data->raw->size);
@@ -3291,14 +3302,25 @@ void perf_prepare_sample(struct perf_eve
 	}
 
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
-		int size = 1;
+		int size = sizeof(u64);
 
 		data->callchain = perf_callchain(regs);
 
 		if (data->callchain)
-			size += data->callchain->nr;
+			size += data->callchain->nr * sizeof(u64);
+
+		header->size += size;
+	}
 
-		header->size += size * sizeof(u64);
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		int size = sizeof(u64);
+
+		if (data->branches) {
+			size += data->branches->nr *
+				sizeof(struct perf_branch_entry);
+		}
+
+		header->size += size;
 	}
 
 	if (sample_type & PERF_SAMPLE_RAW) {
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -50,10 +50,6 @@ struct amd_nb {
 
 #define MAX_LBR_ENTRIES		16
 
-struct lbr_entry {
-	u64	from, to, flags;
-};
-
 struct cpu_hw_events {
 	/*
 	 * Generic x86 PMC bits
@@ -78,10 +74,10 @@ struct cpu_hw_events {
 	/*
 	 * Intel LBR bits
 	 */
-	int			lbr_users;
-	int			lbr_entries;
-	struct lbr_entry	lbr_stack[MAX_LBR_ENTRIES];
-	void			*lbr_context;
+	int				lbr_users;
+	void				*lbr_context;
+	struct perf_branch_stack	lbr_stack;
+	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
 
 	/*
 	 * AMD specific bits
@@ -166,7 +162,7 @@ struct x86_pmu {
 	 */
 	int		bts, pebs;
 	int		pebs_record_size;
-	void		(*drain_pebs)(void);
+	void		(*drain_pebs)(struct perf_sample_data *data);
 	struct event_constraint *pebs_constraints;
 
 	/*
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel_lbr.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -138,11 +138,11 @@ static void intel_pmu_lbr_read_32(struct
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
 
-		cpuc->lbr_stack[i].from  = msr_lastbranch.from;
-		cpuc->lbr_stack[i].to    = msr_lastbranch.to;
-		cpuc->lbr_stack[i].flags = 0;
+		cpuc->lbr_entries[i].from  = msr_lastbranch.from;
+		cpuc->lbr_entries[i].to    = msr_lastbranch.to;
+		cpuc->lbr_entries[i].flags = 0;
 	}
-	cpuc->lbr_entries = i;
+	cpuc->lbr_stack.nr = i;
 }
 
 #define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
@@ -170,14 +170,14 @@ static void intel_pmu_lbr_read_64(struct
 			from = (u64)((((s64)from) << 1) >> 1);
 		}
 
-		cpuc->lbr_stack[i].from  = from;
-		cpuc->lbr_stack[i].to    = to;
-		cpuc->lbr_stack[i].flags = flags;
+		cpuc->lbr_entries[i].from  = from;
+		cpuc->lbr_entries[i].to    = to;
+		cpuc->lbr_entries[i].flags = flags;
 	}
-	cpuc->lbr_entries = i;
+	cpuc->lbr_stack.nr = i;
 }
 
-static void intel_pmu_lbr_read(void)
+static void intel_pmu_lbr_read(struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 
@@ -188,6 +188,8 @@ static void intel_pmu_lbr_read(void)
 		intel_pmu_lbr_read_32(cpuc);
 	else
 		intel_pmu_lbr_read_64(cpuc);
+
+	data->branches = &cpuc->lbr_stack;
 }
 
 static int intel_pmu_lbr_format(void)
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -548,6 +548,9 @@ static void intel_pmu_disable_event(stru
 
 	if (unlikely(event->attr.precise))
 		intel_pmu_pebs_disable(hwc);
+
+	if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK)
+		intel_pmu_lbr_disable(event);
 }
 
 static void intel_pmu_enable_fixed(struct hw_perf_event *hwc)
@@ -602,6 +605,9 @@ static void intel_pmu_enable_event(struc
 	if (unlikely(event->attr.precise))
 		intel_pmu_pebs_enable(hwc);
 
+	if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK)
+		intel_pmu_lbr_enable(event);
+
 	__x86_pmu_enable_event(hwc);
 }
 
@@ -677,13 +683,13 @@ again:
 	inc_irq_stat(apic_perf_irqs);
 	ack = status;
 
-	intel_pmu_lbr_read();
+	intel_pmu_lbr_read(&data);
 
 	/*
 	 * PEBS overflow sets bit 62 in the global status register
 	 */
 	if (__test_and_clear_bit(62, (unsigned long *)&status))
-		x86_pmu.drain_pebs();
+		x86_pmu.drain_pebs(&data);
 
 	for_each_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
 		struct perf_event *event = cpuc->events[bit];
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -418,13 +418,12 @@ do {						\
 static int intel_pmu_save_and_restart(struct perf_event *event);
 static void intel_pmu_disable_event(struct perf_event *event);
 
-static void intel_pmu_drain_pebs_core(void)
+static void intel_pmu_drain_pebs_core(struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
 	struct perf_event *event = cpuc->events[0]; /* PMC0 only */
 	struct pebs_record_core *at, *top;
-	struct perf_sample_data data;
 	struct pt_regs regs;
 	int n;
 
@@ -444,8 +443,7 @@ static void intel_pmu_drain_pebs_core(vo
 	if (!intel_pmu_save_and_restart(event))
 		goto out;
 
-	perf_sample_data_init(&data, 0);
-	data.period = event->hw.last_period;
+	data->period = event->hw.last_period;
 
 	n = top - at;
 
@@ -460,19 +458,18 @@ static void intel_pmu_drain_pebs_core(vo
 
 	PEBS_TO_REGS(at, &regs);
 
-	if (perf_event_overflow(event, 1, &data, &regs))
+	if (perf_event_overflow(event, 1, data, &regs))
 		intel_pmu_disable_event(event);
 
 out:
 	intel_pmu_pebs_enable_all();
 }
 
-static void intel_pmu_drain_pebs_nhm(void)
+static void intel_pmu_drain_pebs_nhm(struct perf_sample_data *data)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
 	struct pebs_record_nhm *at, *top;
-	struct perf_sample_data data;
 	struct perf_event *event = NULL;
 	struct pt_regs regs;
 	int bit, n;
@@ -519,12 +516,11 @@ static void intel_pmu_drain_pebs_nhm(voi
 		if (!intel_pmu_save_and_restart(event))
 			continue;
 
-		perf_sample_data_init(&data, 0);
-		data.period = event->hw.last_period;
+		data->period = event->hw.last_period;
 
 		PEBS_TO_REGS(at, &regs);
 
-		if (perf_event_overflow(event, 1, &data, &regs))
+		if (perf_event_overflow(event, 1, data, &regs))
 			intel_pmu_disable_event(event);
 	}
 out:

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 10/11] perf, x86: use LBR for PEBS IP+1 fixup
  2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
                   ` (8 preceding siblings ...)
  2010-03-03 16:39 ` [RFC][PATCH 09/11] perf, x86: Implement PERF_SAMPLE_BRANCH_STACK Peter Zijlstra
@ 2010-03-03 16:39 ` Peter Zijlstra
  2010-03-03 18:05   ` Masami Hiramatsu
  2010-03-03 16:39 ` [RFC][PATCH 11/11] perf, x86: Clean up IA32_PERF_CAPABILITIES usage Peter Zijlstra
  10 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Masami Hiramatsu,
	Peter Zijlstra

[-- Attachment #1: perf-pebs-lbr.patch --]
[-- Type: text/plain, Size: 6648 bytes --]

PEBS always reports the IP+1, that is the instruction after the one
that got sampled, cure this by using the LBR to reliably rewind the
instruction stream.

CC: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c          |   70 ++++++++++++-------------
 arch/x86/kernel/cpu/perf_event_intel.c    |    4 -
 arch/x86/kernel/cpu/perf_event_intel_ds.c |   81 +++++++++++++++++++++++++++++-
 3 files changed, 116 insertions(+), 39 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -29,6 +29,41 @@
 #include <asm/stacktrace.h>
 #include <asm/nmi.h>
 
+/*
+ * best effort, GUP based copy_from_user() that assumes IRQ or NMI context
+ */
+static unsigned long
+copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
+{
+	unsigned long offset, addr = (unsigned long)from;
+	int type = in_nmi() ? KM_NMI : KM_IRQ0;
+	unsigned long size, len = 0;
+	struct page *page;
+	void *map;
+	int ret;
+
+	do {
+		ret = __get_user_pages_fast(addr, 1, 0, &page);
+		if (!ret)
+			break;
+
+		offset = addr & (PAGE_SIZE - 1);
+		size = min(PAGE_SIZE - offset, n - len);
+
+		map = kmap_atomic(page, type);
+		memcpy(to, map+offset, size);
+		kunmap_atomic(map, type);
+		put_page(page);
+
+		len  += size;
+		to   += size;
+		addr += size;
+
+	} while (len < n);
+
+	return len;
+}
+
 static u64 perf_event_mask __read_mostly;
 
 struct event_constraint {
@@ -1516,41 +1551,6 @@ perf_callchain_kernel(struct pt_regs *re
 	dump_trace(NULL, regs, NULL, regs->bp, &backtrace_ops, entry);
 }
 
-/*
- * best effort, GUP based copy_from_user() that assumes IRQ or NMI context
- */
-static unsigned long
-copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
-{
-	unsigned long offset, addr = (unsigned long)from;
-	int type = in_nmi() ? KM_NMI : KM_IRQ0;
-	unsigned long size, len = 0;
-	struct page *page;
-	void *map;
-	int ret;
-
-	do {
-		ret = __get_user_pages_fast(addr, 1, 0, &page);
-		if (!ret)
-			break;
-
-		offset = addr & (PAGE_SIZE - 1);
-		size = min(PAGE_SIZE - offset, n - len);
-
-		map = kmap_atomic(page, type);
-		memcpy(to, map+offset, size);
-		kunmap_atomic(map, type);
-		put_page(page);
-
-		len  += size;
-		to   += size;
-		addr += size;
-
-	} while (len < n);
-
-	return len;
-}
-
 static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
 {
 	unsigned long bytes;
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -547,7 +547,7 @@ static void intel_pmu_disable_event(stru
 	x86_pmu_disable_event(event);
 
 	if (unlikely(event->attr.precise))
-		intel_pmu_pebs_disable(hwc);
+		intel_pmu_pebs_disable(event);
 
 	if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK)
 		intel_pmu_lbr_disable(event);
@@ -603,7 +603,7 @@ static void intel_pmu_enable_event(struc
 	}
 
 	if (unlikely(event->attr.precise))
-		intel_pmu_pebs_enable(hwc);
+		intel_pmu_pebs_enable(event);
 
 	if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK)
 		intel_pmu_lbr_enable(event);
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -331,26 +331,32 @@ intel_pebs_constraints(struct perf_event
 	return &emptyconstraint;
 }
 
-static void intel_pmu_pebs_enable(struct hw_perf_event *hwc)
+static void intel_pmu_pebs_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct hw_perf_event *hwc = &event->hw;
 	u64 val = cpuc->pebs_enabled;
 
 	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
 
 	val |= 1ULL << hwc->idx;
 	wrmsrl(MSR_IA32_PEBS_ENABLE, val);
+
+	intel_pmu_lbr_enable(event);
 }
 
-static void intel_pmu_pebs_disable(struct hw_perf_event *hwc)
+static void intel_pmu_pebs_disable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct hw_perf_event *hwc = &event->hw;
 	u64 val = cpuc->pebs_enabled;
 
 	val &= ~(1ULL << hwc->idx);
 	wrmsrl(MSR_IA32_PEBS_ENABLE, val);
 
 	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
+
+	intel_pmu_lbr_disable(event);
 }
 
 static void intel_pmu_pebs_enable_all(void)
@@ -415,6 +421,74 @@ do {						\
 
 #endif
 
+#include <asm/insn.h>
+
+#define MAX_INSN_SIZE	16
+
+static void intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
+{
+#if 0
+	/*
+	 * Borken, makes the machine expode at times trying to
+	 * derefence funny userspace addresses.
+	 *
+	 * Should we always fwd decode from @to, instead of trying
+	 * to rewind as implemented?
+	 */
+
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	unsigned long from = cpuc->lbr_entries[0].from;
+	unsigned long to = cpuc->lbr_entries[0].to;
+	unsigned long ip = regs->ip;
+	u8 buf[2*MAX_INSN_SIZE];
+	u8 *kaddr;
+	int i;
+
+	if (from && to) {
+		/*
+		 * We sampled a branch insn, rewind using the LBR stack
+		 */
+		if (ip == to) {
+			regs->ip = from;
+			return;
+		}
+	}
+
+	if (user_mode(regs)) {
+		int bytes = copy_from_user_nmi(buf,
+				(void __user *)(ip - MAX_INSN_SIZE),
+				2*MAX_INSN_SIZE);
+
+		/*
+		 * If we fail to copy the insn stream, give up
+		 */
+		if (bytes != 2*MAX_INSN_SIZE)
+			return;
+
+		kaddr = buf;
+	} else
+		kaddr = (void *)(ip - MAX_INSN_SIZE);
+
+	/*
+	 * Try to find the longest insn ending up at the given IP
+	 */
+	for (i = MAX_INSN_SIZE; i > 0; i--) {
+		struct insn insn;
+
+		kernel_insn_init(&insn, kaddr + MAX_INSN_SIZE - i);
+		insn_get_length(&insn);
+		if (insn.length == i) {
+			regs->ip -= i;
+			return;
+		}
+	}
+
+	/*
+	 * We failed to find a match for the previous insn.. give up
+	 */
+#endif
+}
+
 static int intel_pmu_save_and_restart(struct perf_event *event);
 static void intel_pmu_disable_event(struct perf_event *event);
 
@@ -458,6 +532,8 @@ static void intel_pmu_drain_pebs_core(st
 
 	PEBS_TO_REGS(at, &regs);
 
+	intel_pmu_pebs_fixup_ip(&regs);
+
 	if (perf_event_overflow(event, 1, data, &regs))
 		intel_pmu_disable_event(event);
 
@@ -519,6 +595,7 @@ static void intel_pmu_drain_pebs_nhm(str
 		data->period = event->hw.last_period;
 
 		PEBS_TO_REGS(at, &regs);
+		intel_pmu_pebs_fixup_ip(&regs);
 
 		if (perf_event_overflow(event, 1, data, &regs))
 			intel_pmu_disable_event(event);

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC][PATCH 11/11] perf, x86: Clean up IA32_PERF_CAPABILITIES usage
  2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
                   ` (9 preceding siblings ...)
  2010-03-03 16:39 ` [RFC][PATCH 10/11] perf, x86: use LBR for PEBS IP+1 fixup Peter Zijlstra
@ 2010-03-03 16:39 ` Peter Zijlstra
  10 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 16:39 UTC (permalink / raw)
  To: mingo, linux-kernel
  Cc: paulus, eranian, robert.richter, fweisbec, Peter Zijlstra

[-- Attachment #1: perf-capabilities.patch --]
[-- Type: text/plain, Size: 6302 bytes --]

Saner PERF_CAPABILITIES support

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event.c           |   15 +++++++++++++--
 arch/x86/kernel/cpu/perf_event_intel.c     |   10 ++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   26 +++++++++++++-------------
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   18 ++++--------------
 4 files changed, 40 insertions(+), 29 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -154,6 +154,17 @@ struct cpu_hw_events {
 #define for_each_event_constraint(e, c)	\
 	for ((e) = (c); (e)->cmask; (e)++)
 
+union perf_capabilities {
+	struct {
+		u64	lbr_format    : 6;
+		u64	pebs_trap     : 1;
+		u64	pebs_arch_reg : 1;
+		u64	pebs_format   : 4;
+		u64	smm_freeze    : 1;
+	};
+	u64	capabilities;
+};
+
 /*
  * struct x86_pmu - generic x86 pmu
  */
@@ -190,7 +201,8 @@ struct x86_pmu {
 	/*
 	 * Intel Arch Perfmon v2+
 	 */
-	u64		intel_ctrl;
+	u64			intel_ctrl;
+	union perf_capabilities intel_perf_capabilities;
 
 	/*
 	 * Intel DebugStore bits
@@ -205,7 +217,6 @@ struct x86_pmu {
 	 */
 	unsigned long	lbr_tos, lbr_from, lbr_to; /* MSR base regs       */
 	int		lbr_nr;			   /* hardware stack size */
-	int		lbr_format;		   /* hardware format     */
 };
 
 static struct x86_pmu x86_pmu __read_mostly;
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -840,6 +840,16 @@ static __init int intel_pmu_init(void)
 	if (version > 1)
 		x86_pmu.num_events_fixed = max((int)edx.split.num_events_fixed, 3);
 
+	/*
+	 * v2 and above have a perf capabilities MSR
+	 */
+	if (version > 1) {
+		u64 capabilities;
+
+		rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);
+		x86_pmu.intel_perf_capabilities.capabilities = capabilities;
+	}
+
 	intel_ds_init();
 
 	/*
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -444,6 +444,12 @@ static void intel_pmu_pebs_fixup_ip(stru
 	u8 *kaddr;
 	int i;
 
+	/*
+	 * We don't need to fixup if the PEBS assist is fault like
+	 */
+	if (!x86_pmu.intel_perf_capabilities.pebs_trap)
+		return;
+
 	if (from && to) {
 		/*
 		 * We sampled a branch insn, rewind using the LBR stack
@@ -619,34 +625,28 @@ static void intel_ds_init(void)
 	x86_pmu.bts  = boot_cpu_has(X86_FEATURE_BTS);
 	x86_pmu.pebs = boot_cpu_has(X86_FEATURE_PEBS);
 	if (x86_pmu.pebs) {
-		int format = 0;
-
-		if (x86_pmu.version > 1) {
-			u64 capabilities;
-			/*
-			 * v2+ has a PEBS format field
-			 */
-			rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);
-			format = (capabilities >> 8) & 0xf;
-		}
+		int format = x86_pmu.intel_perf_capabilities.pebs_format;
+		char pebs_type =
+			x86_pmu.intel_perf_capabilities.pebs_trap ?  '+' : '-';
 
 		switch (format) {
 		case 0:
-			printk(KERN_CONT "PEBS v0, ");
+			printk(KERN_CONT "PEBS fmt0%c, ", pebs_type);
 			x86_pmu.pebs_record_size = sizeof(struct pebs_record_core);
 			x86_pmu.drain_pebs = intel_pmu_drain_pebs_core;
 			x86_pmu.pebs_constraints = intel_core_pebs_events;
 			break;
 
 		case 1:
-			printk(KERN_CONT "PEBS v1, ");
+			printk(KERN_CONT "PEBS fmt1%c, ", pebs_type);
 			x86_pmu.pebs_record_size = sizeof(struct pebs_record_nhm);
 			x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm;
 			x86_pmu.pebs_constraints = intel_nehalem_pebs_events;
 			break;
 
 		default:
-			printk(KERN_CONT "PEBS unknown format: %d, ", format);
+			printk(KERN_CONT "no PEBS fmt%d%c, ",
+					format, pebs_type);
 			x86_pmu.pebs = 0;
 			break;
 		}
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel_lbr.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -53,7 +53,7 @@ static void intel_pmu_lbr_reset_64(void)
 
 static void intel_pmu_lbr_reset(void)
 {
-	if (x86_pmu.lbr_format == LBR_FORMAT_32)
+	if (x86_pmu.intel_perf_capabilities.lbr_format == LBR_FORMAT_32)
 		intel_pmu_lbr_reset_32();
 	else
 		intel_pmu_lbr_reset_64();
@@ -155,6 +155,7 @@ static void intel_pmu_lbr_read_32(struct
 static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 {
 	unsigned long mask = x86_pmu.lbr_nr - 1;
+	int lbr_format = x86_pmu.intel_perf_capabilities.lbr_format;
 	u64 tos = intel_pmu_lbr_tos();
 	int i;
 
@@ -165,7 +166,7 @@ static void intel_pmu_lbr_read_64(struct
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
 		rdmsrl(x86_pmu.lbr_to   + lbr_idx, to);
 
-		if (x86_pmu.lbr_format == LBR_FORMAT_EIP_FLAGS) {
+		if (lbr_format == LBR_FORMAT_EIP_FLAGS) {
 			flags = !!(from & LBR_FROM_FLAG_MISPRED);
 			from = (u64)((((s64)from) << 1) >> 1);
 		}
@@ -184,7 +185,7 @@ static void intel_pmu_lbr_read(struct pe
 	if (!cpuc->lbr_users)
 		return;
 
-	if (x86_pmu.lbr_format == LBR_FORMAT_32)
+	if (x86_pmu.intel_perf_capabilities.lbr_format == LBR_FORMAT_32)
 		intel_pmu_lbr_read_32(cpuc);
 	else
 		intel_pmu_lbr_read_64(cpuc);
@@ -192,17 +193,8 @@ static void intel_pmu_lbr_read(struct pe
 	data->branches = &cpuc->lbr_stack;
 }
 
-static int intel_pmu_lbr_format(void)
-{
-	u64 capabilities;
-
-	rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);
-	return capabilities & 0x1f;
-}
-
 static void intel_pmu_lbr_init_core(void)
 {
-	x86_pmu.lbr_format = intel_pmu_lbr_format();
 	x86_pmu.lbr_nr     = 4;
 	x86_pmu.lbr_tos    = 0x01c9;
 	x86_pmu.lbr_from   = 0x40;
@@ -211,7 +203,6 @@ static void intel_pmu_lbr_init_core(void
 
 static void intel_pmu_lbr_init_nhm(void)
 {
-	x86_pmu.lbr_format = intel_pmu_lbr_format();
 	x86_pmu.lbr_nr     = 16;
 	x86_pmu.lbr_tos    = 0x01c9;
 	x86_pmu.lbr_from   = 0x680;
@@ -220,7 +211,6 @@ static void intel_pmu_lbr_init_nhm(void)
 
 static void intel_pmu_lbr_init_atom(void)
 {
-	x86_pmu.lbr_format = intel_pmu_lbr_format();
 	x86_pmu.lbr_nr	   = 8;
 	x86_pmu.lbr_tos    = 0x01c9;
 	x86_pmu.lbr_from   = 0x40;

-- 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 05/11] perf: Generic perf_sample_data initialization
  2010-03-03 16:39 ` [RFC][PATCH 05/11] perf: Generic perf_sample_data initialization Peter Zijlstra
@ 2010-03-03 16:49   ` David Miller
  2010-03-03 21:14   ` Frederic Weisbecker
  2010-03-05  8:44   ` Jean Pihet
  2 siblings, 0 replies; 44+ messages in thread
From: David Miller @ 2010-03-03 16:49 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: mingo, linux-kernel, paulus, eranian, robert.richter, fweisbec,
	jamie.iles, jpihet, stable

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed, 03 Mar 2010 17:39:41 +0100

> This makes it easier to extend perf_sample_data and fixes a bug on
> arm and sparc, which failed to set ->raw to NULL, which can cause
> crashes when combined with PERF_SAMPLE_RAW.
> 
> It also optimizes PowerPC and tracepoint, because the struct
> initialization is forced to zero out the whole structure.
> 
> CC: Jamie Iles <jamie.iles@picochip.com>
> CC: Jean Pihet <jpihet@mvista.com>
> CC: Paul Mackerras <paulus@samba.org>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: David S. Miller <davem@davemloft.net>
> CC: Stephane Eranian <eranian@google.com>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: stable@kernel.org
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-03 16:39 ` [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS Peter Zijlstra
@ 2010-03-03 17:30   ` Stephane Eranian
  2010-03-03 17:39     ` Peter Zijlstra
  2010-03-03 22:02   ` Frederic Weisbecker
  1 sibling, 1 reply; 44+ messages in thread
From: Stephane Eranian @ 2010-03-03 17:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, paulus, robert.richter, fweisbec, David S. Miller

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 4218 bytes --]

This assumes struct pt_regs is somehow exported to userland.Is that the case?
I would clearly spell out that the REGS are the interrupted REGS,not the overflow REGS. Maybe PERF_SAMPLE_IREGS.
On Wed, Mar 3, 2010 at 8:39 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:> Simply copy out the provided pt_regs in a u64 aligned fashion.>> XXX: do task_pt_regs() and get_irq_regs() always clear everything or>     are we now leaking data?>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>> --->  include/linux/perf_event.h |    5 ++++->  kernel/perf_event.c        |   17 +++++++++++++++++>  2 files changed, 21 insertions(+), 1 deletion(-)>> Index: linux-2.6/include/linux/perf_event.h> ===================================================================> --- linux-2.6.orig/include/linux/perf_event.h> +++ linux-2.6/include/linux/perf_event.h> @@ -125,8 +125,9 @@ enum perf_event_sample_format {>        PERF_SAMPLE_PERIOD                      = 1U << 8,>        PERF_SAMPLE_STREAM_ID                   = 1U << 9,>        PERF_SAMPLE_RAW                         = 1U << 10,> +       PERF_SAMPLE_REGS                        = 1U << 11,>> -       PERF_SAMPLE_MAX = 1U << 11,             /* non-ABI */> +       PERF_SAMPLE_MAX = 1U << 12,             /* non-ABI */>  };>>  /*> @@ -392,6 +393,7 @@ enum perf_event_type {>         *      { u64                   period;   } && PERF_SAMPLE_PERIOD>         *>         *      { struct read_format    values;   } && PERF_SAMPLE_READ> +        *      { struct pt_regs        regs;     } && PERF_SAMPLE_REGS>         *>         *      { u64                   nr,>         *        u64                   ips[nr];  } && PERF_SAMPLE_CALLCHAIN> @@ -800,6 +802,7 @@ struct perf_sample_data {>        u64                             period;>        struct perf_callchain_entry     *callchain;>        struct perf_raw_record          *raw;> +       struct pt_regs                  *regs;>  };>>  static inline> Index: linux-2.6/kernel/perf_event.c> ===================================================================> --- linux-2.6.orig/kernel/perf_event.c> +++ linux-2.6/kernel/perf_event.c> @@ -3176,6 +3176,17 @@ void perf_output_sample(struct perf_outp>        if (sample_type & PERF_SAMPLE_READ)>                perf_output_read(handle, event);>> +       if (sample_type & PERF_SAMPLE_REGS) {> +               int size = DIV_ROUND_UP(sizeof(struct pt_regs), sizeof(u64)) -> +                          sizeof(struct pt_regs);> +> +               perf_output_put(handle, *data->regs);> +               if (size) {> +                       u64 zero = 0;> +                       perf_output_copy(handle, &zero, size);> +               }> +       }> +>        if (sample_type & PERF_SAMPLE_CALLCHAIN) {>                if (data->callchain) {>                        int size = 1;> @@ -3273,6 +3284,12 @@ void perf_prepare_sample(struct perf_eve>        if (sample_type & PERF_SAMPLE_READ)>                header->size += perf_event_read_size(event);>> +       if (sample_type & PERF_SAMPLE_REGS) {> +               data->regs = regs;> +               header->size += DIV_ROUND_UP(sizeof(struct pt_regs),> +                                            sizeof(u64));> +       }> +>        if (sample_type & PERF_SAMPLE_CALLCHAIN) {>                int size = 1;>>> -->>


-- Stephane Eranian  | EMEA Software EngineeringGoogle France | 38 avenue de l'Opéra | 75002 ParisTel : +33 (0) 1 42 68 53 00This email may be confidential or privileged. If you received thiscommunication by mistake, pleasedon't forward it to anyone else, please erase all copies andattachments, and please let me know thatit went to the wrong person. Thanksÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 06/11] perf, x86: PEBS infrastructure
  2010-03-03 16:39 ` [RFC][PATCH 06/11] perf, x86: PEBS infrastructure Peter Zijlstra
@ 2010-03-03 17:38   ` Robert Richter
  2010-03-03 17:42     ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Robert Richter @ 2010-03-03 17:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, paulus, eranian, fweisbec

On 03.03.10 17:39:42, Peter Zijlstra wrote:
> Implement a simple PEBS model that always takes a single PEBS event at
> a time. This is done so that the interaction with the rest of the
> system is as expected (freq adjust, period randomization, lbr).
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---

[...]

> +static int validate_event(struct perf_event *event)
> +{
> +	struct cpu_hw_events *fake_cpuc;
> +	struct event_constraint *c;
> +	int ret = 0;
> +
> +	fake_cpuc = kmalloc(sizeof(*fake_cpuc), GFP_KERNEL | __GFP_ZERO);
> +	if (!fake_cpuc)
> +		return -ENOMEM;
> +
> +	c = x86_pmu.get_event_constraints(fake_cpuc, event);
> +
> +	if (!c || !c->weight)
> +		ret = -ENOSPC;
> +
> +	if (x86_pmu.put_event_constraints)
> +		x86_pmu.put_event_constraints(fake_cpuc, event);

A fake cpu with the struct filled with zeros will cause a null pointer
exception in amd_get_event_constraints():

	struct amd_nb *nb = cpuc->amd_nb;

Shouldn't x86_schedule_events() sufficient to decide if a single
counter is available? I did not yet look at group events, this might
happen there too.

-Robert

> +
> +	kfree(fake_cpuc);
> +
> +	return ret;
> +}
> +
> +/*
>   * validate a single event group
>   *
>   * validation include:
> @@ -1495,6 +1426,8 @@ const struct pmu *hw_perf_event_init(str
>  
>  		if (event->group_leader != event)
>  			err = validate_group(event);
> +		else
> +			err = validate_event(event);
>  
>  		event->pmu = tmp;
>  	}

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-03 17:30   ` Stephane Eranian
@ 2010-03-03 17:39     ` Peter Zijlstra
  2010-03-03 17:49       ` Stephane Eranian
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 17:39 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: mingo, linux-kernel, paulus, robert.richter, fweisbec, David S. Miller

On Wed, 2010-03-03 at 09:30 -0800, Stephane Eranian wrote:
> This assumes struct pt_regs is somehow exported to userland.
> Is that the case?

I seems to have understood they were, and asm/ptrace.h seems to agree
with that, it has !__KERNEL__ definitions for struct pt_regs.

> I would clearly spell out that the REGS are the interrupted REGS,
> not the overflow REGS. Maybe PERF_SAMPLE_IREGS. 

They can be both, for PEBS they are the overflow trap (until PEBS does
fault) regs.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 06/11] perf, x86: PEBS infrastructure
  2010-03-03 17:38   ` Robert Richter
@ 2010-03-03 17:42     ` Peter Zijlstra
  2010-03-04  8:50       ` Robert Richter
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 17:42 UTC (permalink / raw)
  To: Robert Richter; +Cc: mingo, linux-kernel, paulus, eranian, fweisbec

On Wed, 2010-03-03 at 18:38 +0100, Robert Richter wrote:
> > +     fake_cpuc = kmalloc(sizeof(*fake_cpuc), GFP_KERNEL | __GFP_ZERO);
> > +     if (!fake_cpuc)
> > +             return -ENOMEM;
> > +
> > +     c = x86_pmu.get_event_constraints(fake_cpuc, event);
> > +
> > +     if (!c || !c->weight)
> > +             ret = -ENOSPC;
> > +
> > +     if (x86_pmu.put_event_constraints)
> > +             x86_pmu.put_event_constraints(fake_cpuc, event);
> 
> A fake cpu with the struct filled with zeros will cause a null pointer
> exception in amd_get_event_constraints():
> 
>         struct amd_nb *nb = cpuc->amd_nb;

That should result in nb == NULL, right? which is checked slightly
further in the function.

> Shouldn't x86_schedule_events() sufficient to decide if a single
> counter is available? I did not yet look at group events, this might
> happen there too.

Sure, but we will only attempt scheduling them at enable time, this is a
creation time check, failing to create an unschedulable event seems
prudent.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-03 17:39     ` Peter Zijlstra
@ 2010-03-03 17:49       ` Stephane Eranian
  2010-03-03 17:55         ` David Miller
  0 siblings, 1 reply; 44+ messages in thread
From: Stephane Eranian @ 2010-03-03 17:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, paulus, robert.richter, fweisbec, David S. Miller

On Wed, Mar 3, 2010 at 9:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-03-03 at 09:30 -0800, Stephane Eranian wrote:
>> This assumes struct pt_regs is somehow exported to userland.
>> Is that the case?
>
> I seems to have understood they were, and asm/ptrace.h seems to agree
> with that, it has !__KERNEL__ definitions for struct pt_regs.
>
Seems to be the case, indeed.

>> I would clearly spell out that the REGS are the interrupted REGS,
>> not the overflow REGS. Maybe PERF_SAMPLE_IREGS.
>
> They can be both, for PEBS they are the overflow trap (until PEBS does
> fault) regs.

You're saying without PEBS= interrupted state, with PEBS=overflow state.
That precludes requesting both interrupted + overflow state when PEBS is
enabled. That may be interesting to look at differences, distances (in the IP).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-03 17:49       ` Stephane Eranian
@ 2010-03-03 17:55         ` David Miller
  2010-03-03 18:18           ` Stephane Eranian
                             ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: David Miller @ 2010-03-03 17:55 UTC (permalink / raw)
  To: eranian; +Cc: peterz, mingo, linux-kernel, paulus, robert.richter, fweisbec

From: Stephane Eranian <eranian@google.com>
Date: Wed, 3 Mar 2010 09:49:33 -0800

> On Wed, Mar 3, 2010 at 9:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Wed, 2010-03-03 at 09:30 -0800, Stephane Eranian wrote:
>>> This assumes struct pt_regs is somehow exported to userland.
>>> Is that the case?
>>
>> I seems to have understood they were, and asm/ptrace.h seems to agree
>> with that, it has !__KERNEL__ definitions for struct pt_regs.
>>
> Seems to be the case, indeed.

BTW, how are you going to cope with compat systems?

If I build 'perf' on a sparc64 kernel build, it's going to get the
64-bit pt_regs.  So I can't then use that binary on a sparc box
running a 32-bit kernel.

And vice versa.

And more generally aren't we supposed to be able to eventually analyze
perf dumps on any platform not just the one 'perf' was built under?

We'll need to do something about the encoding of pt_regs, therefore.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 10/11] perf, x86: use LBR for PEBS IP+1 fixup
  2010-03-03 16:39 ` [RFC][PATCH 10/11] perf, x86: use LBR for PEBS IP+1 fixup Peter Zijlstra
@ 2010-03-03 18:05   ` Masami Hiramatsu
  2010-03-03 19:37     ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Masami Hiramatsu @ 2010-03-03 18:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, paulus, eranian, robert.richter, fweisbec

Peter Zijlstra wrote:
> PEBS always reports the IP+1, that is the instruction after the one
> that got sampled, cure this by using the LBR to reliably rewind the
> instruction stream.

Hmm, does PEBS always report one byte after the end address of the
sampled instruction? Or the instruction which will be executed next
step?

[...]
> +#include <asm/insn.h>
> +
> +#define MAX_INSN_SIZE	16

Hmm, we'd better integrate these kinds of definitions into
asm/insn.h... (several features define it)

> +
> +static void intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
> +{
> +#if 0
> +	/*
> +	 * Borken, makes the machine expode at times trying to
> +	 * derefence funny userspace addresses.
> +	 *
> +	 * Should we always fwd decode from @to, instead of trying
> +	 * to rewind as implemented?
> +	 */
> +
> +	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +	unsigned long from = cpuc->lbr_entries[0].from;
> +	unsigned long to = cpuc->lbr_entries[0].to;

Ah, I see. For branch instruction case, we can use LBR to
find previous IP...

> +	unsigned long ip = regs->ip;
> +	u8 buf[2*MAX_INSN_SIZE];
> +	u8 *kaddr;
> +	int i;
> +
> +	if (from && to) {
> +		/*
> +		 * We sampled a branch insn, rewind using the LBR stack
> +		 */
> +		if (ip == to) {
> +			regs->ip = from;
> +			return;
> +		}
> +	}
> +
> +	if (user_mode(regs)) {
> +		int bytes = copy_from_user_nmi(buf,
> +				(void __user *)(ip - MAX_INSN_SIZE),
> +				2*MAX_INSN_SIZE);
> +

maybe, you'd better check the source address range is within
the user address range. e.g. ip < MAX_INSN_SIZE. 

> +		/*
> +		 * If we fail to copy the insn stream, give up
> +		 */
> +		if (bytes != 2*MAX_INSN_SIZE)
> +			return;
> +
> +		kaddr = buf;
> +	} else
> +		kaddr = (void *)(ip - MAX_INSN_SIZE);

It also needs to be checked this address within kernel text.

> +
> +	/*
> +	 * Try to find the longest insn ending up at the given IP
> +	 */
> +	for (i = MAX_INSN_SIZE; i > 0; i--) {
> +		struct insn insn;
> +
> +		kernel_insn_init(&insn, kaddr + MAX_INSN_SIZE - i);
> +		insn_get_length(&insn);
> +		if (insn.length == i) {
> +			regs->ip -= i;
> +			return;
> +		}
> +	}

Hmm, this will not work correctly on x86, since the decoder can
miss-decode the tail bytes of previous instruction as prefix bytes. :(

Thus, if you want to rewind instruction stream, you need to decode
a function (or basic block) entirely.

Thank you,

> +
> +	/*
> +	 * We failed to find a match for the previous insn.. give up
> +	 */
> +#endif
> +}
> +
>  static int intel_pmu_save_and_restart(struct perf_event *event);
>  static void intel_pmu_disable_event(struct perf_event *event);
>  
> @@ -458,6 +532,8 @@ static void intel_pmu_drain_pebs_core(st
>  
>  	PEBS_TO_REGS(at, &regs);
>  
> +	intel_pmu_pebs_fixup_ip(&regs);
> +
>  	if (perf_event_overflow(event, 1, data, &regs))
>  		intel_pmu_disable_event(event);
>  
> @@ -519,6 +595,7 @@ static void intel_pmu_drain_pebs_nhm(str
>  		data->period = event->hw.last_period;
>  
>  		PEBS_TO_REGS(at, &regs);
> +		intel_pmu_pebs_fixup_ip(&regs);
>  
>  		if (perf_event_overflow(event, 1, data, &regs))
>  			intel_pmu_disable_event(event);
> 
> -- 

-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-03 17:55         ` David Miller
@ 2010-03-03 18:18           ` Stephane Eranian
  2010-03-03 19:18           ` Peter Zijlstra
  2010-03-04  2:59           ` Ingo Molnar
  2 siblings, 0 replies; 44+ messages in thread
From: Stephane Eranian @ 2010-03-03 18:18 UTC (permalink / raw)
  To: David Miller
  Cc: peterz, mingo, linux-kernel, paulus, robert.richter, fweisbec

On Wed, Mar 3, 2010 at 9:55 AM, David Miller <davem@davemloft.net> wrote:
> From: Stephane Eranian <eranian@google.com>
> Date: Wed, 3 Mar 2010 09:49:33 -0800
>
>> On Wed, Mar 3, 2010 at 9:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Wed, 2010-03-03 at 09:30 -0800, Stephane Eranian wrote:
>>>> This assumes struct pt_regs is somehow exported to userland.
>>>> Is that the case?
>>>
>>> I seems to have understood they were, and asm/ptrace.h seems to agree
>>> with that, it has !__KERNEL__ definitions for struct pt_regs.
>>>
>> Seems to be the case, indeed.
>
> BTW, how are you going to cope with compat systems?
>
> If I build 'perf' on a sparc64 kernel build, it's going to get the
> 64-bit pt_regs.  So I can't then use that binary on a sparc box
> running a 32-bit kernel.
>
> And vice versa.
>
That was going to be my next question. The pt_regs you return
depends on the binary you are monitoring (32 vs. 64) if interrupt
occurred in userland. But what about if it happens in kernel mode?


> And more generally aren't we supposed to be able to eventually analyze
> perf dumps on any platform not just the one 'perf' was built under?
>
> We'll need to do something about the encoding of pt_regs, therefore.
>



-- 
Stephane Eranian  | EMEA Software Engineering
Google France | 38 avenue de l'Opéra | 75002 Paris
Tel : +33 (0) 1 42 68 53 00
This email may be confidential or privileged. If you received this
communication by mistake, please
don't forward it to anyone else, please erase all copies and
attachments, and please let me know that
it went to the wrong person. Thanks

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-03 17:55         ` David Miller
  2010-03-03 18:18           ` Stephane Eranian
@ 2010-03-03 19:18           ` Peter Zijlstra
  2010-03-04  2:59           ` Ingo Molnar
  2 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 19:18 UTC (permalink / raw)
  To: David Miller
  Cc: eranian, mingo, linux-kernel, paulus, robert.richter, fweisbec

On Wed, 2010-03-03 at 09:55 -0800, David Miller wrote:
> From: Stephane Eranian <eranian@google.com>
> Date: Wed, 3 Mar 2010 09:49:33 -0800
> 
> > On Wed, Mar 3, 2010 at 9:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> On Wed, 2010-03-03 at 09:30 -0800, Stephane Eranian wrote:
> >>> This assumes struct pt_regs is somehow exported to userland.
> >>> Is that the case?
> >>
> >> I seems to have understood they were, and asm/ptrace.h seems to agree
> >> with that, it has !__KERNEL__ definitions for struct pt_regs.
> >>
> > Seems to be the case, indeed.
> 
> BTW, how are you going to cope with compat systems?
> 
> If I build 'perf' on a sparc64 kernel build, it's going to get the
> 64-bit pt_regs.  So I can't then use that binary on a sparc box
> running a 32-bit kernel.
> 
> And vice versa.
> 
> And more generally aren't we supposed to be able to eventually analyze
> perf dumps on any platform not just the one 'perf' was built under?
> 
> We'll need to do something about the encoding of pt_regs, therefore.

Hrm, yes,.. what I can do for the moment is 'cheat' and make the raw
PEBS record available through PERF_SAMPLE_RAW (that also has CAP_ADMIN,
which I guess is a good idea for full reg sets), and then we can work
out how to expose pt_regs later.

If someone has a better suggestion than this, which is basically blurp
out host native pt_regs and cope, please tell ;-)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 10/11] perf, x86: use LBR for PEBS IP+1 fixup
  2010-03-03 18:05   ` Masami Hiramatsu
@ 2010-03-03 19:37     ` Peter Zijlstra
  2010-03-03 21:11       ` Masami Hiramatsu
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-03 19:37 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: mingo, linux-kernel, paulus, eranian, robert.richter, fweisbec

On Wed, 2010-03-03 at 13:05 -0500, Masami Hiramatsu wrote:
> Peter Zijlstra wrote:
> > PEBS always reports the IP+1, that is the instruction after the one
> > that got sampled, cure this by using the LBR to reliably rewind the
> > instruction stream.
> 
> Hmm, does PEBS always report one byte after the end address of the
> sampled instruction? Or the instruction which will be executed next
> step?

The next instruction, its trap like.

> [...]
> > +#include <asm/insn.h>
> > +
> > +#define MAX_INSN_SIZE	16
> 
> Hmm, we'd better integrate these kinds of definitions into
> asm/insn.h... (several features define it)

Agreed, I'll look at doing a patch to collect them all into asm/insn.h
if nobody beats me to it :-)

> > +
> > +static void intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
> > +{
> > +#if 0
> > +	/*
> > +	 * Borken, makes the machine expode at times trying to
> > +	 * derefence funny userspace addresses.
> > +	 *
> > +	 * Should we always fwd decode from @to, instead of trying
> > +	 * to rewind as implemented?
> > +	 */
> > +
> > +	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> > +	unsigned long from = cpuc->lbr_entries[0].from;
> > +	unsigned long to = cpuc->lbr_entries[0].to;
> 
> Ah, I see. For branch instruction case, we can use LBR to
> find previous IP...

Right, we use the LBR to find the basic block.

> > +	unsigned long ip = regs->ip;
> > +	u8 buf[2*MAX_INSN_SIZE];
> > +	u8 *kaddr;
> > +	int i;
> > +
> > +	if (from && to) {
> > +		/*
> > +		 * We sampled a branch insn, rewind using the LBR stack
> > +		 */
> > +		if (ip == to) {
> > +			regs->ip = from;
> > +			return;
> > +		}
> > +	}
> > +
> > +	if (user_mode(regs)) {
> > +		int bytes = copy_from_user_nmi(buf,
> > +				(void __user *)(ip - MAX_INSN_SIZE),
> > +				2*MAX_INSN_SIZE);
> > +
> 
> maybe, you'd better check the source address range is within
> the user address range. e.g. ip < MAX_INSN_SIZE. 

Not only that, I realized user_mode() checks regs->cs, which is not set
by the PEBS code, so I added some helpers.

> > +
> > +	/*
> > +	 * Try to find the longest insn ending up at the given IP
> > +	 */
> > +	for (i = MAX_INSN_SIZE; i > 0; i--) {
> > +		struct insn insn;
> > +
> > +		kernel_insn_init(&insn, kaddr + MAX_INSN_SIZE - i);
> > +		insn_get_length(&insn);
> > +		if (insn.length == i) {
> > +			regs->ip -= i;
> > +			return;
> > +		}
> > +	}
> 
> Hmm, this will not work correctly on x86, since the decoder can
> miss-decode the tail bytes of previous instruction as prefix bytes. :(
> 
> Thus, if you want to rewind instruction stream, you need to decode
> a function (or basic block) entirely.

Something like the below?

#ifdef CONFIG_X86_32
static bool kernel_ip(unsigned long ip)
{
        return ip > TASK_SIZE;
}
#else
static bool kernel_ip(unsigned long ip)
{
        return (long)ip < 0;
}
#endif

static int intel_pmu_pebs_fixup_ip(unsigned long *ipp)
{
        struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
        unsigned long from = cpuc->lbr_entries[0].from;
        unsigned long old_to, to = cpuc->lbr_entries[0].to;
        unsigned long ip = *ipp;
        int i;

        /*
         * We don't need to fixup if the PEBS assist is fault like
         */
        if (!x86_pmu.intel_perf_capabilities.pebs_trap)
                return 0;

        if (!cpuc->lbr_stack.nr || !from || !to)
                return 0;

        if (ip < to)
                return 0;

        /*
         * We sampled a branch insn, rewind using the LBR stack
         */
        if (ip == to) {
                *ipp = from;
                return 1;
        }

        do {
                struct insn insn;
                u8 buf[MAX_INSN_SIZE];
                void *kaddr;

                old_to = to;
                if (!kernel_ip(ip)) {
                        int bytes = copy_from_user_nmi(buf, (void __user *)to,
                                        MAX_INSN_SIZE);

                        if (bytes != MAX_INSN_SIZE)
                                return 0;

                        kaddr = buf;
                } else kaddr = (void *)to;

                kernel_insn_init(&insn, kaddr);
                insn_get_length(&insn);
                to += insn.length;
        } while (to < ip);

        if (to == ip) {
                *ipp = old_to;
                return 1;
        }

        return 0;
}

I thought about exposing the success of this fixup as a PERF_RECORD_MISC
bit.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 09/11] perf, x86: Implement PERF_SAMPLE_BRANCH_STACK
  2010-03-03 16:39 ` [RFC][PATCH 09/11] perf, x86: Implement PERF_SAMPLE_BRANCH_STACK Peter Zijlstra
@ 2010-03-03 21:08   ` Frederic Weisbecker
  0 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2010-03-03 21:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, paulus, eranian, robert.richter

On Wed, Mar 03, 2010 at 05:39:45PM +0100, Peter Zijlstra wrote:
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  arch/x86/kernel/cpu/perf_event.c           |   14 +++-------
>  arch/x86/kernel/cpu/perf_event_intel.c     |   10 ++++++-
>  arch/x86/kernel/cpu/perf_event_intel_ds.c  |   16 ++++--------
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |   20 ++++++++-------
>  include/linux/perf_event.h                 |   27 +++++++++++++++++---
>  kernel/perf_event.c                        |   38 ++++++++++++++++++++++-------
>  6 files changed, 83 insertions(+), 42 deletions(-)
> 
> Index: linux-2.6/include/linux/perf_event.h
> ===================================================================
> --- linux-2.6.orig/include/linux/perf_event.h
> +++ linux-2.6/include/linux/perf_event.h
> @@ -126,8 +126,9 @@ enum perf_event_sample_format {
>  	PERF_SAMPLE_STREAM_ID			= 1U << 9,
>  	PERF_SAMPLE_RAW				= 1U << 10,
>  	PERF_SAMPLE_REGS			= 1U << 11,
> +	PERF_SAMPLE_BRANCH_STACK		= 1U << 12,
>  
> -	PERF_SAMPLE_MAX = 1U << 12,		/* non-ABI */
> +	PERF_SAMPLE_MAX = 1U << 13,		/* non-ABI */
>  };
>  
>  /*
> @@ -395,9 +396,14 @@ enum perf_event_type {
>  	 *	{ struct read_format	values;	  } && PERF_SAMPLE_READ
>  	 * 	{ struct pt_regs	regs;	  } && PERF_SAMPLE_REGS
>  	 *
> -	 *	{ u64			nr,
> +	 *	{ u64			nr;
>  	 *	  u64			ips[nr];  } && PERF_SAMPLE_CALLCHAIN
>  	 *
> +	 * 	{ u64			nr;
> +	 * 	  { u64 from, to, flags;
> +	 * 	  }			lbr[nr];  } && PERF_SAMPLE_BRANCH_STACK
> +	 *
> +	 *
>  	 *	#
>  	 *	# The RAW record below is opaque data wrt the ABI
>  	 *	#
> @@ -469,6 +475,17 @@ struct perf_raw_record {
>  	void				*data;
>  };
>  
> +struct perf_branch_entry {
> +	__u64				from;
> +	__u64				to;
> +	__u64				flags;
> +};
> +
> +struct perf_branch_stack {
> +	__u64				nr;
> +	struct perf_branch_entry	entries[0];
> +};
> +
>  struct task_struct;
>  
>  /**
> @@ -803,13 +820,15 @@ struct perf_sample_data {
>  	struct perf_callchain_entry	*callchain;
>  	struct perf_raw_record		*raw;
>  	struct pt_regs			*regs;
> +	struct perf_branch_stack	*branches;
>  };
>  
>  static inline
>  void perf_sample_data_init(struct perf_sample_data *data, u64 addr)
>  {
> -	data->addr = addr;
> -	data->raw  = NULL;
> +	data->addr     = addr;
> +	data->raw      = NULL;
> +	data->branches = NULL;
>  }
>  
>  extern void perf_output_sample(struct perf_output_handle *handle,
> Index: linux-2.6/kernel/perf_event.c
> ===================================================================
> --- linux-2.6.orig/kernel/perf_event.c
> +++ linux-2.6/kernel/perf_event.c
> @@ -3189,12 +3189,9 @@ void perf_output_sample(struct perf_outp
>  
>  	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
>  		if (data->callchain) {
> -			int size = 1;
> +			int size = sizeof(u64);
>  
> -			if (data->callchain)
> -				size += data->callchain->nr;
> -
> -			size *= sizeof(u64);
> +			size += data->callchain->nr * sizeof(u64);
>  
>  			perf_output_copy(handle, data->callchain, size);
>  		} else {
> @@ -3203,6 +3200,20 @@ void perf_output_sample(struct perf_outp
>  		}
>  	}
>  
> +	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
> +		if (data->branches) {
> +			int size = sizeof(u64);
> +
> +			size += data->branches->nr *
> +				sizeof(struct perf_branch_entry);
> +
> +			perf_output_copy(handle, data->branches, size);
> +		} else {
> +			u64 nr = 0;
> +			perf_output_put(handle, nr);
> +		}
> +	}
> +
>  	if (sample_type & PERF_SAMPLE_RAW) {
>  		if (data->raw) {
>  			perf_output_put(handle, data->raw->size);
> @@ -3291,14 +3302,25 @@ void perf_prepare_sample(struct perf_eve
>  	}
>  
>  	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
> -		int size = 1;
> +		int size = sizeof(u64);
>  
>  		data->callchain = perf_callchain(regs);
>  
>  		if (data->callchain)
> -			size += data->callchain->nr;
> +			size += data->callchain->nr * sizeof(u64);
> +
> +		header->size += size;
> +	}
>  
> -		header->size += size * sizeof(u64);
> +	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
> +		int size = sizeof(u64);
> +
> +		if (data->branches) {
> +			size += data->branches->nr *
> +				sizeof(struct perf_branch_entry);
> +		}
> +
> +		header->size += size;
>  	}



That looks good to me, (at least the generic part, as I don't
know enough the x86 part to tell).


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 10/11] perf, x86: use LBR for PEBS IP+1 fixup
  2010-03-03 19:37     ` Peter Zijlstra
@ 2010-03-03 21:11       ` Masami Hiramatsu
  2010-03-03 21:50         ` Stephane Eranian
  0 siblings, 1 reply; 44+ messages in thread
From: Masami Hiramatsu @ 2010-03-03 21:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, paulus, eranian, robert.richter, fweisbec

Peter Zijlstra wrote:
> On Wed, 2010-03-03 at 13:05 -0500, Masami Hiramatsu wrote:
>> Peter Zijlstra wrote:
>>> PEBS always reports the IP+1, that is the instruction after the one
>>> that got sampled, cure this by using the LBR to reliably rewind the
>>> instruction stream.
>>
>> Hmm, does PEBS always report one byte after the end address of the
>> sampled instruction? Or the instruction which will be executed next
>> step?
> 
> The next instruction, its trap like.
> 
>> [...]
>>> +#include <asm/insn.h>
>>> +
>>> +#define MAX_INSN_SIZE	16
>>
>> Hmm, we'd better integrate these kinds of definitions into
>> asm/insn.h... (several features define it)
> 
> Agreed, I'll look at doing a patch to collect them all into asm/insn.h
> if nobody beats me to it :-)

At least kprobes doesn't :)

>>> +
>>> +static void intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
>>> +{
>>> +#if 0
>>> +	/*
>>> +	 * Borken, makes the machine expode at times trying to
>>> +	 * derefence funny userspace addresses.
>>> +	 *
>>> +	 * Should we always fwd decode from @to, instead of trying
>>> +	 * to rewind as implemented?
>>> +	 */
>>> +
>>> +	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>>> +	unsigned long from = cpuc->lbr_entries[0].from;
>>> +	unsigned long to = cpuc->lbr_entries[0].to;
>>
>> Ah, I see. For branch instruction case, we can use LBR to
>> find previous IP...
> 
> Right, we use the LBR to find the basic block.

Hm, that's a good idea :)

>>> +	unsigned long ip = regs->ip;
>>> +	u8 buf[2*MAX_INSN_SIZE];
>>> +	u8 *kaddr;
>>> +	int i;
>>> +
>>> +	if (from && to) {
>>> +		/*
>>> +		 * We sampled a branch insn, rewind using the LBR stack
>>> +		 */
>>> +		if (ip == to) {
>>> +			regs->ip = from;
>>> +			return;
>>> +		}
>>> +	}
>>> +
>>> +	if (user_mode(regs)) {
>>> +		int bytes = copy_from_user_nmi(buf,
>>> +				(void __user *)(ip - MAX_INSN_SIZE),
>>> +				2*MAX_INSN_SIZE);
>>> +
>>
>> maybe, you'd better check the source address range is within
>> the user address range. e.g. ip < MAX_INSN_SIZE. 
> 
> Not only that, I realized user_mode() checks regs->cs, which is not set
> by the PEBS code, so I added some helpers.
> 
>>> +
>>> +	/*
>>> +	 * Try to find the longest insn ending up at the given IP
>>> +	 */
>>> +	for (i = MAX_INSN_SIZE; i > 0; i--) {
>>> +		struct insn insn;
>>> +
>>> +		kernel_insn_init(&insn, kaddr + MAX_INSN_SIZE - i);
>>> +		insn_get_length(&insn);
>>> +		if (insn.length == i) {
>>> +			regs->ip -= i;
>>> +			return;
>>> +		}
>>> +	}
>>
>> Hmm, this will not work correctly on x86, since the decoder can
>> miss-decode the tail bytes of previous instruction as prefix bytes. :(
>>
>> Thus, if you want to rewind instruction stream, you need to decode
>> a function (or basic block) entirely.
> 
> Something like the below?

Great! it looks good to me.
Yeah, LBR.to may always smaller than current ip (if no one disabled LBR).

Thank you,

> 
> #ifdef CONFIG_X86_32
> static bool kernel_ip(unsigned long ip)
> {
>         return ip > TASK_SIZE;
> }
> #else
> static bool kernel_ip(unsigned long ip)
> {
>         return (long)ip < 0;
> }
> #endif
> 
> static int intel_pmu_pebs_fixup_ip(unsigned long *ipp)
> {
>         struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>         unsigned long from = cpuc->lbr_entries[0].from;
>         unsigned long old_to, to = cpuc->lbr_entries[0].to;
>         unsigned long ip = *ipp;
>         int i;
> 
>         /*
>          * We don't need to fixup if the PEBS assist is fault like
>          */
>         if (!x86_pmu.intel_perf_capabilities.pebs_trap)
>                 return 0;
> 
>         if (!cpuc->lbr_stack.nr || !from || !to)
>                 return 0;
> 
>         if (ip < to)
>                 return 0;
> 
>         /*
>          * We sampled a branch insn, rewind using the LBR stack
>          */
>         if (ip == to) {
>                 *ipp = from;
>                 return 1;
>         }
> 
>         do {
>                 struct insn insn;
>                 u8 buf[MAX_INSN_SIZE];
>                 void *kaddr;
> 
>                 old_to = to;
>                 if (!kernel_ip(ip)) {
>                         int bytes = copy_from_user_nmi(buf, (void __user *)to,
>                                         MAX_INSN_SIZE);
> 
>                         if (bytes != MAX_INSN_SIZE)
>                                 return 0;
> 
>                         kaddr = buf;
>                 } else kaddr = (void *)to;
> 
>                 kernel_insn_init(&insn, kaddr);
>                 insn_get_length(&insn);
>                 to += insn.length;
>         } while (to < ip);
> 
>         if (to == ip) {
>                 *ipp = old_to;
>                 return 1;
>         }
> 
>         return 0;
> }
> 
> I thought about exposing the success of this fixup as a PERF_RECORD_MISC
> bit.
> 

-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 05/11] perf: Generic perf_sample_data initialization
  2010-03-03 16:39 ` [RFC][PATCH 05/11] perf: Generic perf_sample_data initialization Peter Zijlstra
  2010-03-03 16:49   ` David Miller
@ 2010-03-03 21:14   ` Frederic Weisbecker
  2010-03-05  8:44   ` Jean Pihet
  2 siblings, 0 replies; 44+ messages in thread
From: Frederic Weisbecker @ 2010-03-03 21:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, paulus, eranian, robert.richter, Jamie Iles,
	Jean Pihet, David S. Miller, stable

On Wed, Mar 03, 2010 at 05:39:41PM +0100, Peter Zijlstra wrote:
> This makes it easier to extend perf_sample_data and fixes a bug on
> arm and sparc, which failed to set ->raw to NULL, which can cause
> crashes when combined with PERF_SAMPLE_RAW.
> 
> It also optimizes PowerPC and tracepoint, because the struct
> initialization is forced to zero out the whole structure.
> 
> CC: Jamie Iles <jamie.iles@picochip.com>
> CC: Jean Pihet <jpihet@mvista.com>
> CC: Paul Mackerras <paulus@samba.org>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: David S. Miller <davem@davemloft.net>
> CC: Stephane Eranian <eranian@google.com>

Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 10/11] perf, x86: use LBR for PEBS IP+1 fixup
  2010-03-03 21:11       ` Masami Hiramatsu
@ 2010-03-03 21:50         ` Stephane Eranian
  2010-03-04  8:57           ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Stephane Eranian @ 2010-03-03 21:50 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, mingo, linux-kernel, paulus, robert.richter, fweisbec

I think systematically and transparently using LBR to correct PEBS off-by-one
problem is not such a good idea. You've basically highjacked LBR and user
cannot use it in a different way.

There are PEBS+LBR measurements where you care about extracting the LBR data.
There are PEBS measurements where you don't care about getting the correct IP.
I don't necessarily want to pay the price, especially when this could
be done offline
in the tool.


On Wed, Mar 3, 2010 at 10:11 PM, Masami Hiramatsu <mhiramat@redhat.com> wrote:
> Peter Zijlstra wrote:
>> On Wed, 2010-03-03 at 13:05 -0500, Masami Hiramatsu wrote:
>>> Peter Zijlstra wrote:
>>>> PEBS always reports the IP+1, that is the instruction after the one
>>>> that got sampled, cure this by using the LBR to reliably rewind the
>>>> instruction stream.
>>>
>>> Hmm, does PEBS always report one byte after the end address of the
>>> sampled instruction? Or the instruction which will be executed next
>>> step?
>>
>> The next instruction, its trap like.
>>
>>> [...]
>>>> +#include <asm/insn.h>
>>>> +
>>>> +#define MAX_INSN_SIZE      16
>>>
>>> Hmm, we'd better integrate these kinds of definitions into
>>> asm/insn.h... (several features define it)
>>
>> Agreed, I'll look at doing a patch to collect them all into asm/insn.h
>> if nobody beats me to it :-)
>
> At least kprobes doesn't :)
>
>>>> +
>>>> +static void intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
>>>> +{
>>>> +#if 0
>>>> +   /*
>>>> +    * Borken, makes the machine expode at times trying to
>>>> +    * derefence funny userspace addresses.
>>>> +    *
>>>> +    * Should we always fwd decode from @to, instead of trying
>>>> +    * to rewind as implemented?
>>>> +    */
>>>> +
>>>> +   struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>>>> +   unsigned long from = cpuc->lbr_entries[0].from;
>>>> +   unsigned long to = cpuc->lbr_entries[0].to;
>>>
>>> Ah, I see. For branch instruction case, we can use LBR to
>>> find previous IP...
>>
>> Right, we use the LBR to find the basic block.
>
> Hm, that's a good idea :)
>
>>>> +   unsigned long ip = regs->ip;
>>>> +   u8 buf[2*MAX_INSN_SIZE];
>>>> +   u8 *kaddr;
>>>> +   int i;
>>>> +
>>>> +   if (from && to) {
>>>> +           /*
>>>> +            * We sampled a branch insn, rewind using the LBR stack
>>>> +            */
>>>> +           if (ip == to) {
>>>> +                   regs->ip = from;
>>>> +                   return;
>>>> +           }
>>>> +   }
>>>> +
>>>> +   if (user_mode(regs)) {
>>>> +           int bytes = copy_from_user_nmi(buf,
>>>> +                           (void __user *)(ip - MAX_INSN_SIZE),
>>>> +                           2*MAX_INSN_SIZE);
>>>> +
>>>
>>> maybe, you'd better check the source address range is within
>>> the user address range. e.g. ip < MAX_INSN_SIZE.
>>
>> Not only that, I realized user_mode() checks regs->cs, which is not set
>> by the PEBS code, so I added some helpers.
>>
>>>> +
>>>> +   /*
>>>> +    * Try to find the longest insn ending up at the given IP
>>>> +    */
>>>> +   for (i = MAX_INSN_SIZE; i > 0; i--) {
>>>> +           struct insn insn;
>>>> +
>>>> +           kernel_insn_init(&insn, kaddr + MAX_INSN_SIZE - i);
>>>> +           insn_get_length(&insn);
>>>> +           if (insn.length == i) {
>>>> +                   regs->ip -= i;
>>>> +                   return;
>>>> +           }
>>>> +   }
>>>
>>> Hmm, this will not work correctly on x86, since the decoder can
>>> miss-decode the tail bytes of previous instruction as prefix bytes. :(
>>>
>>> Thus, if you want to rewind instruction stream, you need to decode
>>> a function (or basic block) entirely.
>>
>> Something like the below?
>
> Great! it looks good to me.
> Yeah, LBR.to may always smaller than current ip (if no one disabled LBR).
>
> Thank you,
>
>>
>> #ifdef CONFIG_X86_32
>> static bool kernel_ip(unsigned long ip)
>> {
>>         return ip > TASK_SIZE;
>> }
>> #else
>> static bool kernel_ip(unsigned long ip)
>> {
>>         return (long)ip < 0;
>> }
>> #endif
>>
>> static int intel_pmu_pebs_fixup_ip(unsigned long *ipp)
>> {
>>         struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>>         unsigned long from = cpuc->lbr_entries[0].from;
>>         unsigned long old_to, to = cpuc->lbr_entries[0].to;
>>         unsigned long ip = *ipp;
>>         int i;
>>
>>         /*
>>          * We don't need to fixup if the PEBS assist is fault like
>>          */
>>         if (!x86_pmu.intel_perf_capabilities.pebs_trap)
>>                 return 0;
>>
>>         if (!cpuc->lbr_stack.nr || !from || !to)
>>                 return 0;
>>
>>         if (ip < to)
>>                 return 0;
>>
>>         /*
>>          * We sampled a branch insn, rewind using the LBR stack
>>          */
>>         if (ip == to) {
>>                 *ipp = from;
>>                 return 1;
>>         }
>>
>>         do {
>>                 struct insn insn;
>>                 u8 buf[MAX_INSN_SIZE];
>>                 void *kaddr;
>>
>>                 old_to = to;
>>                 if (!kernel_ip(ip)) {
>>                         int bytes = copy_from_user_nmi(buf, (void __user *)to,
>>                                         MAX_INSN_SIZE);
>>
>>                         if (bytes != MAX_INSN_SIZE)
>>                                 return 0;
>>
>>                         kaddr = buf;
>>                 } else kaddr = (void *)to;
>>
>>                 kernel_insn_init(&insn, kaddr);
>>                 insn_get_length(&insn);
>>                 to += insn.length;
>>         } while (to < ip);
>>
>>         if (to == ip) {
>>                 *ipp = old_to;
>>                 return 1;
>>         }
>>
>>         return 0;
>> }
>>
>> I thought about exposing the success of this fixup as a PERF_RECORD_MISC
>> bit.
>>
>
> --
> Masami Hiramatsu
> e-mail: mhiramat@redhat.com
>



-- 
Stephane Eranian  | EMEA Software Engineering
Google France | 38 avenue de l'Opéra | 75002 Paris
Tel : +33 (0) 1 42 68 53 00
This email may be confidential or privileged. If you received this
communication by mistake, please
don't forward it to anyone else, please erase all copies and
attachments, and please let me know that
it went to the wrong person. Thanks

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 08/11] perf, x86: Implement simple LBR support
  2010-03-03 16:39 ` [RFC][PATCH 08/11] perf, x86: Implement simple LBR support Peter Zijlstra
@ 2010-03-03 21:52   ` Stephane Eranian
  2010-03-04  8:58     ` Peter Zijlstra
  2010-03-03 21:57   ` Stephane Eranian
  1 sibling, 1 reply; 44+ messages in thread
From: Stephane Eranian @ 2010-03-03 21:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, paulus, robert.richter, fweisbec

On Wed, Mar 3, 2010 at 5:39 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Implement support for Intel LBR stacks that support
> FREEZE_LBRS_ON_PMI. We do not (yet?) support the LBR config register
> because that is SMT wide and would also put undue restraints on the
> PEBS users.
>
You're saying PEBS users have priorities over pure LBR users?
Why is that?

Without coding this, how would you expose LBR configuration to userland
given you're using the PERF_SAMPLE_BRANCH_STACK approach?


> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  arch/x86/kernel/cpu/perf_event.c           |   22 ++
>  arch/x86/kernel/cpu/perf_event_intel.c     |   13 +
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |  228 +++++++++++++++++++++++++++++
>  3 files changed, 263 insertions(+)
>
> Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
> +++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
> @@ -48,6 +48,12 @@ struct amd_nb {
>        struct event_constraint event_constraints[X86_PMC_IDX_MAX];
>  };
>
> +#define MAX_LBR_ENTRIES                16
> +
> +struct lbr_entry {
> +       u64     from, to, flags;
> +};
> +
>  struct cpu_hw_events {
>        /*
>         * Generic x86 PMC bits
> @@ -70,6 +76,14 @@ struct cpu_hw_events {
>        u64                     pebs_enabled;
>
>        /*
> +        * Intel LBR bits
> +        */
> +       int                     lbr_users;
> +       int                     lbr_entries;
> +       struct lbr_entry        lbr_stack[MAX_LBR_ENTRIES];
> +       void                    *lbr_context;
> +
> +       /*
>         * AMD specific bits
>         */
>        struct amd_nb           *amd_nb;
> @@ -154,6 +168,13 @@ struct x86_pmu {
>        int             pebs_record_size;
>        void            (*drain_pebs)(void);
>        struct event_constraint *pebs_constraints;
> +
> +       /*
> +        * Intel LBR
> +        */
> +       unsigned long   lbr_tos, lbr_from, lbr_to; /* MSR base regs       */
> +       int             lbr_nr;                    /* hardware stack size */
> +       int             lbr_format;                /* hardware format     */
>  };
>
>  static struct x86_pmu x86_pmu __read_mostly;
> @@ -1238,6 +1259,7 @@ undo:
>
>  #include "perf_event_amd.c"
>  #include "perf_event_p6.c"
> +#include "perf_event_intel_lbr.c"
>  #include "perf_event_intel_ds.c"
>  #include "perf_event_intel.c"
>
> Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
> +++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -480,6 +480,7 @@ static void intel_pmu_disable_all(void)
>                intel_pmu_disable_bts();
>
>        intel_pmu_pebs_disable_all();
> +       intel_pmu_lbr_disable_all();
>  }
>
>  static void intel_pmu_enable_all(void)
> @@ -499,6 +500,7 @@ static void intel_pmu_enable_all(void)
>        }
>
>        intel_pmu_pebs_enable_all();
> +       intel_pmu_lbr_enable_all();
>  }
>
>  static inline u64 intel_pmu_get_status(void)
> @@ -675,6 +677,8 @@ again:
>        inc_irq_stat(apic_perf_irqs);
>        ack = status;
>
> +       intel_pmu_lbr_read();
> +
>        /*
>         * PEBS overflow sets bit 62 in the global status register
>         */
> @@ -847,6 +851,8 @@ static __init int intel_pmu_init(void)
>                memcpy(hw_cache_event_ids, core2_hw_cache_event_ids,
>                       sizeof(hw_cache_event_ids));
>
> +               intel_pmu_lbr_init_core();
> +
>                x86_pmu.event_constraints = intel_core2_event_constraints;
>                pr_cont("Core2 events, ");
>                break;
> @@ -856,13 +862,18 @@ static __init int intel_pmu_init(void)
>                memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
>                       sizeof(hw_cache_event_ids));
>
> +               intel_pmu_lbr_init_nhm();
> +
>                x86_pmu.event_constraints = intel_nehalem_event_constraints;
>                pr_cont("Nehalem/Corei7 events, ");
>                break;
> +
>        case 28: /* Atom */
>                memcpy(hw_cache_event_ids, atom_hw_cache_event_ids,
>                       sizeof(hw_cache_event_ids));
>
> +               intel_pmu_lbr_init_atom();
> +
>                x86_pmu.event_constraints = intel_gen_event_constraints;
>                pr_cont("Atom events, ");
>                break;
> @@ -872,6 +883,8 @@ static __init int intel_pmu_init(void)
>                memcpy(hw_cache_event_ids, westmere_hw_cache_event_ids,
>                       sizeof(hw_cache_event_ids));
>
> +               intel_pmu_lbr_init_nhm();
> +
>                x86_pmu.event_constraints = intel_westmere_event_constraints;
>                pr_cont("Westmere events, ");
>                break;
> Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -0,0 +1,228 @@
> +#ifdef CONFIG_CPU_SUP_INTEL
> +
> +enum {
> +       LBR_FORMAT_32           = 0x00,
> +       LBR_FORMAT_LIP          = 0x01,
> +       LBR_FORMAT_EIP          = 0x02,
> +       LBR_FORMAT_EIP_FLAGS    = 0x03,
> +};
> +
> +/*
> + * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
> + * otherwise it becomes near impossible to get a reliable stack.
> + */
> +
> +#define X86_DEBUGCTL_LBR                               (1 << 0)
> +#define X86_DEBUGCTL_FREEZE_LBRS_ON_PMI                (1 << 11)
> +
> +static void __intel_pmu_lbr_enable(void)
> +{
> +       u64 debugctl;
> +
> +       rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
> +       debugctl |= (X86_DEBUGCTL_LBR | X86_DEBUGCTL_FREEZE_LBRS_ON_PMI);
> +       wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
> +}
> +
> +static void __intel_pmu_lbr_disable(void)
> +{
> +       u64 debugctl;
> +
> +       rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
> +       debugctl &= ~(X86_DEBUGCTL_LBR | X86_DEBUGCTL_FREEZE_LBRS_ON_PMI);
> +       wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
> +}
> +
> +static void intel_pmu_lbr_reset_32(void)
> +{
> +       int i;
> +
> +       for (i = 0; i < x86_pmu.lbr_nr; i++)
> +               wrmsrl(x86_pmu.lbr_from + i, 0);
> +}
> +
> +static void intel_pmu_lbr_reset_64(void)
> +{
> +       int i;
> +
> +       for (i = 0; i < x86_pmu.lbr_nr; i++) {
> +               wrmsrl(x86_pmu.lbr_from + i, 0);
> +               wrmsrl(x86_pmu.lbr_to   + i, 0);
> +       }
> +}
> +
> +static void intel_pmu_lbr_reset(void)
> +{
> +       if (x86_pmu.lbr_format == LBR_FORMAT_32)
> +               intel_pmu_lbr_reset_32();
> +       else
> +               intel_pmu_lbr_reset_64();
> +}
> +
> +static void intel_pmu_lbr_enable(struct perf_event *event)
> +{
> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +
> +       if (!x86_pmu.lbr_nr)
> +               return;
> +
> +       WARN_ON(cpuc->enabled);
> +
> +       /*
> +        * Reset the LBR stack if this is the first LBR user or
> +        * we changed task context so as to avoid data leaks.
> +        */
> +
> +       if (!cpuc->lbr_users ||
> +           (event->ctx->task && cpuc->lbr_context != event->ctx)) {
> +               intel_pmu_lbr_reset();
> +               cpuc->lbr_context = event->ctx;
> +       }
> +
> +       cpuc->lbr_users++;
> +}
> +
> +static void intel_pmu_lbr_disable(struct perf_event *event)
> +{
> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +
> +       if (!x86_pmu.lbr_nr)
> +               return;
> +
> +       cpuc->lbr_users--;
> +
> +       BUG_ON(cpuc->lbr_users < 0);
> +       WARN_ON(cpuc->enabled);
> +}
> +
> +static void intel_pmu_lbr_enable_all(void)
> +{
> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +
> +       if (cpuc->lbr_users)
> +               __intel_pmu_lbr_enable();
> +}
> +
> +static void intel_pmu_lbr_disable_all(void)
> +{
> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +
> +       if (cpuc->lbr_users)
> +               __intel_pmu_lbr_disable();
> +}
> +
> +static inline u64 intel_pmu_lbr_tos(void)
> +{
> +       u64 tos;
> +
> +       rdmsrl(x86_pmu.lbr_tos, tos);
> +
> +       return tos;
> +}
> +
> +static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
> +{
> +       unsigned long mask = x86_pmu.lbr_nr - 1;
> +       u64 tos = intel_pmu_lbr_tos();
> +       int i;
> +
> +       for (i = 0; i < x86_pmu.lbr_nr; i++, tos--) {
> +               unsigned long lbr_idx = (tos - i) & mask;
> +               union {
> +                       struct {
> +                               u32 from;
> +                               u32 to;
> +                       };
> +                       u64     lbr;
> +               } msr_lastbranch;
> +
> +               rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
> +
> +               cpuc->lbr_stack[i].from  = msr_lastbranch.from;
> +               cpuc->lbr_stack[i].to    = msr_lastbranch.to;
> +               cpuc->lbr_stack[i].flags = 0;
> +       }
> +       cpuc->lbr_entries = i;
> +}
> +
> +#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
> +
> +/*
> + * Due to lack of segmentation in Linux the effective address (offset)
> + * is the same as the linear address, allowing us to merge the LIP and EIP
> + * LBR formats.
> + */
> +static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
> +{
> +       unsigned long mask = x86_pmu.lbr_nr - 1;
> +       u64 tos = intel_pmu_lbr_tos();
> +       int i;
> +
> +       for (i = 0; i < x86_pmu.lbr_nr; i++, tos--) {
> +               unsigned long lbr_idx = (tos - i) & mask;
> +               u64 from, to, flags = 0;
> +
> +               rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
> +               rdmsrl(x86_pmu.lbr_to   + lbr_idx, to);
> +
> +               if (x86_pmu.lbr_format == LBR_FORMAT_EIP_FLAGS) {
> +                       flags = !!(from & LBR_FROM_FLAG_MISPRED);
> +                       from = (u64)((((s64)from) << 1) >> 1);
> +               }
> +
> +               cpuc->lbr_stack[i].from  = from;
> +               cpuc->lbr_stack[i].to    = to;
> +               cpuc->lbr_stack[i].flags = flags;
> +       }
> +       cpuc->lbr_entries = i;
> +}
> +
> +static void intel_pmu_lbr_read(void)
> +{
> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +
> +       if (!cpuc->lbr_users)
> +               return;
> +
> +       if (x86_pmu.lbr_format == LBR_FORMAT_32)
> +               intel_pmu_lbr_read_32(cpuc);
> +       else
> +               intel_pmu_lbr_read_64(cpuc);
> +}
> +
> +static int intel_pmu_lbr_format(void)
> +{
> +       u64 capabilities;
> +
> +       rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);
> +       return capabilities & 0x1f;
> +}
> +
> +static void intel_pmu_lbr_init_core(void)
> +{
> +       x86_pmu.lbr_format = intel_pmu_lbr_format();
> +       x86_pmu.lbr_nr     = 4;
> +       x86_pmu.lbr_tos    = 0x01c9;
> +       x86_pmu.lbr_from   = 0x40;
> +       x86_pmu.lbr_to     = 0x60;
> +}
> +
> +static void intel_pmu_lbr_init_nhm(void)
> +{
> +       x86_pmu.lbr_format = intel_pmu_lbr_format();
> +       x86_pmu.lbr_nr     = 16;
> +       x86_pmu.lbr_tos    = 0x01c9;
> +       x86_pmu.lbr_from   = 0x680;
> +       x86_pmu.lbr_to     = 0x6c0;
> +}
> +
> +static void intel_pmu_lbr_init_atom(void)
> +{
> +       x86_pmu.lbr_format = intel_pmu_lbr_format();
> +       x86_pmu.lbr_nr     = 8;
> +       x86_pmu.lbr_tos    = 0x01c9;
> +       x86_pmu.lbr_from   = 0x40;
> +       x86_pmu.lbr_to     = 0x60;
> +}
> +
> +#endif /* CONFIG_CPU_SUP_INTEL */
>
> --
>
>



-- 
Stephane Eranian  | EMEA Software Engineering
Google France | 38 avenue de l'Opéra | 75002 Paris
Tel : +33 (0) 1 42 68 53 00
This email may be confidential or privileged. If you received this
communication by mistake, please
don't forward it to anyone else, please erase all copies and
attachments, and please let me know that
it went to the wrong person. Thanks

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 08/11] perf, x86: Implement simple LBR support
  2010-03-03 16:39 ` [RFC][PATCH 08/11] perf, x86: Implement simple LBR support Peter Zijlstra
  2010-03-03 21:52   ` Stephane Eranian
@ 2010-03-03 21:57   ` Stephane Eranian
  2010-03-04  8:58     ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Stephane Eranian @ 2010-03-03 21:57 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, paulus, robert.richter, fweisbec

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 12897 bytes --]

I don't understand how LBR state is migrated when a per-thread event is movedfrom one CPU to another. It seems LBR is managed per-cpu.
Can you explain this to me?

On Wed, Mar 3, 2010 at 5:39 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:> Implement support for Intel LBR stacks that support> FREEZE_LBRS_ON_PMI. We do not (yet?) support the LBR config register> because that is SMT wide and would also put undue restraints on the> PEBS users.>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>> --->  arch/x86/kernel/cpu/perf_event.c           |   22 ++>  arch/x86/kernel/cpu/perf_event_intel.c     |   13 +>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |  228 +++++++++++++++++++++++++++++>  3 files changed, 263 insertions(+)>> Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c> ===================================================================> --- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c> +++ linux-2.6/arch/x86/kernel/cpu/perf_event.c> @@ -48,6 +48,12 @@ struct amd_nb {>        struct event_constraint event_constraints[X86_PMC_IDX_MAX];>  };>> +#define MAX_LBR_ENTRIES                16> +> +struct lbr_entry {> +       u64     from, to, flags;> +};> +>  struct cpu_hw_events {>        /*>         * Generic x86 PMC bits> @@ -70,6 +76,14 @@ struct cpu_hw_events {>        u64                     pebs_enabled;>>        /*> +        * Intel LBR bits> +        */> +       int                     lbr_users;> +       int                     lbr_entries;> +       struct lbr_entry        lbr_stack[MAX_LBR_ENTRIES];> +       void                    *lbr_context;> +> +       /*>         * AMD specific bits>         */>        struct amd_nb           *amd_nb;> @@ -154,6 +168,13 @@ struct x86_pmu {>        int             pebs_record_size;>        void            (*drain_pebs)(void);>        struct event_constraint *pebs_constraints;> +> +       /*> +        * Intel LBR> +        */> +       unsigned long   lbr_tos, lbr_from, lbr_to; /* MSR base regs       */> +       int             lbr_nr;                    /* hardware stack size */> +       int             lbr_format;                /* hardware format     */>  };>>  static struct x86_pmu x86_pmu __read_mostly;> @@ -1238,6 +1259,7 @@ undo:>>  #include "perf_event_amd.c">  #include "perf_event_p6.c"> +#include "perf_event_intel_lbr.c">  #include "perf_event_intel_ds.c">  #include "perf_event_intel.c">> Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c> ===================================================================> --- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c> +++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c> @@ -480,6 +480,7 @@ static void intel_pmu_disable_all(void)>                intel_pmu_disable_bts();>>        intel_pmu_pebs_disable_all();> +       intel_pmu_lbr_disable_all();>  }>>  static void intel_pmu_enable_all(void)> @@ -499,6 +500,7 @@ static void intel_pmu_enable_all(void)>        }>>        intel_pmu_pebs_enable_all();> +       intel_pmu_lbr_enable_all();>  }>>  static inline u64 intel_pmu_get_status(void)> @@ -675,6 +677,8 @@ again:>        inc_irq_stat(apic_perf_irqs);>        ack = status;>> +       intel_pmu_lbr_read();> +>        /*>         * PEBS overflow sets bit 62 in the global status register>         */> @@ -847,6 +851,8 @@ static __init int intel_pmu_init(void)>                memcpy(hw_cache_event_ids, core2_hw_cache_event_ids,>                       sizeof(hw_cache_event_ids));>> +               intel_pmu_lbr_init_core();> +>                x86_pmu.event_constraints = intel_core2_event_constraints;>                pr_cont("Core2 events, ");>                break;> @@ -856,13 +862,18 @@ static __init int intel_pmu_init(void)>                memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,>                       sizeof(hw_cache_event_ids));>> +               intel_pmu_lbr_init_nhm();> +>                x86_pmu.event_constraints = intel_nehalem_event_constraints;>                pr_cont("Nehalem/Corei7 events, ");>                break;> +>        case 28: /* Atom */>                memcpy(hw_cache_event_ids, atom_hw_cache_event_ids,>                       sizeof(hw_cache_event_ids));>> +               intel_pmu_lbr_init_atom();> +>                x86_pmu.event_constraints = intel_gen_event_constraints;>                pr_cont("Atom events, ");>                break;> @@ -872,6 +883,8 @@ static __init int intel_pmu_init(void)>                memcpy(hw_cache_event_ids, westmere_hw_cache_event_ids,>                       sizeof(hw_cache_event_ids));>> +               intel_pmu_lbr_init_nhm();> +>                x86_pmu.event_constraints = intel_westmere_event_constraints;>                pr_cont("Westmere events, ");>                break;> Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel_lbr.c> ===================================================================> --- /dev/null> +++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel_lbr.c> @@ -0,0 +1,228 @@> +#ifdef CONFIG_CPU_SUP_INTEL> +> +enum {> +       LBR_FORMAT_32           = 0x00,> +       LBR_FORMAT_LIP          = 0x01,> +       LBR_FORMAT_EIP          = 0x02,> +       LBR_FORMAT_EIP_FLAGS    = 0x03,> +};> +> +/*> + * We only support LBR implementations that have FREEZE_LBRS_ON_PMI> + * otherwise it becomes near impossible to get a reliable stack.> + */> +> +#define X86_DEBUGCTL_LBR                               (1 << 0)> +#define X86_DEBUGCTL_FREEZE_LBRS_ON_PMI                (1 << 11)> +> +static void __intel_pmu_lbr_enable(void)> +{> +       u64 debugctl;> +> +       rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);> +       debugctl |= (X86_DEBUGCTL_LBR | X86_DEBUGCTL_FREEZE_LBRS_ON_PMI);> +       wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);> +}> +> +static void __intel_pmu_lbr_disable(void)> +{> +       u64 debugctl;> +> +       rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);> +       debugctl &= ~(X86_DEBUGCTL_LBR | X86_DEBUGCTL_FREEZE_LBRS_ON_PMI);> +       wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);> +}> +> +static void intel_pmu_lbr_reset_32(void)> +{> +       int i;> +> +       for (i = 0; i < x86_pmu.lbr_nr; i++)> +               wrmsrl(x86_pmu.lbr_from + i, 0);> +}> +> +static void intel_pmu_lbr_reset_64(void)> +{> +       int i;> +> +       for (i = 0; i < x86_pmu.lbr_nr; i++) {> +               wrmsrl(x86_pmu.lbr_from + i, 0);> +               wrmsrl(x86_pmu.lbr_to   + i, 0);> +       }> +}> +> +static void intel_pmu_lbr_reset(void)> +{> +       if (x86_pmu.lbr_format == LBR_FORMAT_32)> +               intel_pmu_lbr_reset_32();> +       else> +               intel_pmu_lbr_reset_64();> +}> +> +static void intel_pmu_lbr_enable(struct perf_event *event)> +{> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);> +> +       if (!x86_pmu.lbr_nr)> +               return;> +> +       WARN_ON(cpuc->enabled);> +> +       /*> +        * Reset the LBR stack if this is the first LBR user or> +        * we changed task context so as to avoid data leaks.> +        */> +> +       if (!cpuc->lbr_users ||> +           (event->ctx->task && cpuc->lbr_context != event->ctx)) {> +               intel_pmu_lbr_reset();> +               cpuc->lbr_context = event->ctx;> +       }> +> +       cpuc->lbr_users++;> +}> +> +static void intel_pmu_lbr_disable(struct perf_event *event)> +{> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);> +> +       if (!x86_pmu.lbr_nr)> +               return;> +> +       cpuc->lbr_users--;> +> +       BUG_ON(cpuc->lbr_users < 0);> +       WARN_ON(cpuc->enabled);> +}> +> +static void intel_pmu_lbr_enable_all(void)> +{> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);> +> +       if (cpuc->lbr_users)> +               __intel_pmu_lbr_enable();> +}> +> +static void intel_pmu_lbr_disable_all(void)> +{> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);> +> +       if (cpuc->lbr_users)> +               __intel_pmu_lbr_disable();> +}> +> +static inline u64 intel_pmu_lbr_tos(void)> +{> +       u64 tos;> +> +       rdmsrl(x86_pmu.lbr_tos, tos);> +> +       return tos;> +}> +> +static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)> +{> +       unsigned long mask = x86_pmu.lbr_nr - 1;> +       u64 tos = intel_pmu_lbr_tos();> +       int i;> +> +       for (i = 0; i < x86_pmu.lbr_nr; i++, tos--) {> +               unsigned long lbr_idx = (tos - i) & mask;> +               union {> +                       struct {> +                               u32 from;> +                               u32 to;> +                       };> +                       u64     lbr;> +               } msr_lastbranch;> +> +               rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);> +> +               cpuc->lbr_stack[i].from  = msr_lastbranch.from;> +               cpuc->lbr_stack[i].to    = msr_lastbranch.to;> +               cpuc->lbr_stack[i].flags = 0;> +       }> +       cpuc->lbr_entries = i;> +}> +> +#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)> +> +/*> + * Due to lack of segmentation in Linux the effective address (offset)> + * is the same as the linear address, allowing us to merge the LIP and EIP> + * LBR formats.> + */> +static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)> +{> +       unsigned long mask = x86_pmu.lbr_nr - 1;> +       u64 tos = intel_pmu_lbr_tos();> +       int i;> +> +       for (i = 0; i < x86_pmu.lbr_nr; i++, tos--) {> +               unsigned long lbr_idx = (tos - i) & mask;> +               u64 from, to, flags = 0;> +> +               rdmsrl(x86_pmu.lbr_from + lbr_idx, from);> +               rdmsrl(x86_pmu.lbr_to   + lbr_idx, to);> +> +               if (x86_pmu.lbr_format == LBR_FORMAT_EIP_FLAGS) {> +                       flags = !!(from & LBR_FROM_FLAG_MISPRED);> +                       from = (u64)((((s64)from) << 1) >> 1);> +               }> +> +               cpuc->lbr_stack[i].from  = from;> +               cpuc->lbr_stack[i].to    = to;> +               cpuc->lbr_stack[i].flags = flags;> +       }> +       cpuc->lbr_entries = i;> +}> +> +static void intel_pmu_lbr_read(void)> +{> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);> +> +       if (!cpuc->lbr_users)> +               return;> +> +       if (x86_pmu.lbr_format == LBR_FORMAT_32)> +               intel_pmu_lbr_read_32(cpuc);> +       else> +               intel_pmu_lbr_read_64(cpuc);> +}> +> +static int intel_pmu_lbr_format(void)> +{> +       u64 capabilities;> +> +       rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);> +       return capabilities & 0x1f;> +}> +> +static void intel_pmu_lbr_init_core(void)> +{> +       x86_pmu.lbr_format = intel_pmu_lbr_format();> +       x86_pmu.lbr_nr     = 4;> +       x86_pmu.lbr_tos    = 0x01c9;> +       x86_pmu.lbr_from   = 0x40;> +       x86_pmu.lbr_to     = 0x60;> +}> +> +static void intel_pmu_lbr_init_nhm(void)> +{> +       x86_pmu.lbr_format = intel_pmu_lbr_format();> +       x86_pmu.lbr_nr     = 16;> +       x86_pmu.lbr_tos    = 0x01c9;> +       x86_pmu.lbr_from   = 0x680;> +       x86_pmu.lbr_to     = 0x6c0;> +}> +> +static void intel_pmu_lbr_init_atom(void)> +{> +       x86_pmu.lbr_format = intel_pmu_lbr_format();> +       x86_pmu.lbr_nr     = 8;> +       x86_pmu.lbr_tos    = 0x01c9;> +       x86_pmu.lbr_from   = 0x40;> +       x86_pmu.lbr_to     = 0x60;> +}> +> +#endif /* CONFIG_CPU_SUP_INTEL */>> -->>


-- Stephane Eranian  | EMEA Software EngineeringGoogle France | 38 avenue de l'Opéra | 75002 ParisTel : +33 (0) 1 42 68 53 00This email may be confidential or privileged. If you received thiscommunication by mistake, pleasedon't forward it to anyone else, please erase all copies andattachments, and please let me know thatit went to the wrong person. Thanksÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-03 16:39 ` [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS Peter Zijlstra
  2010-03-03 17:30   ` Stephane Eranian
@ 2010-03-03 22:02   ` Frederic Weisbecker
  2010-03-04  8:58     ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2010-03-03 22:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, paulus, eranian, robert.richter

On Wed, Mar 03, 2010 at 05:39:43PM +0100, Peter Zijlstra wrote:
> Simply copy out the provided pt_regs in a u64 aligned fashion.
> 
> XXX: do task_pt_regs() and get_irq_regs() always clear everything or
>      are we now leaking data?


It looks like there is a leak in case of non trace syscalls.
where we don't appear to save r12-15.

Then task_pt_regs() may leak the top of a process stack...?


> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/perf_event.h |    5 ++++-
>  kernel/perf_event.c        |   17 +++++++++++++++++
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/include/linux/perf_event.h
> ===================================================================
> --- linux-2.6.orig/include/linux/perf_event.h
> +++ linux-2.6/include/linux/perf_event.h
> @@ -125,8 +125,9 @@ enum perf_event_sample_format {
>  	PERF_SAMPLE_PERIOD			= 1U << 8,
>  	PERF_SAMPLE_STREAM_ID			= 1U << 9,
>  	PERF_SAMPLE_RAW				= 1U << 10,
> +	PERF_SAMPLE_REGS			= 1U << 11,
>  
> -	PERF_SAMPLE_MAX = 1U << 11,		/* non-ABI */
> +	PERF_SAMPLE_MAX = 1U << 12,		/* non-ABI */
>  };
>  
>  /*
> @@ -392,6 +393,7 @@ enum perf_event_type {
>  	 *	{ u64			period;   } && PERF_SAMPLE_PERIOD
>  	 *
>  	 *	{ struct read_format	values;	  } && PERF_SAMPLE_READ
> +	 * 	{ struct pt_regs	regs;	  } && PERF_SAMPLE_REGS
>  	 *
>  	 *	{ u64			nr,
>  	 *	  u64			ips[nr];  } && PERF_SAMPLE_CALLCHAIN
> @@ -800,6 +802,7 @@ struct perf_sample_data {
>  	u64				period;
>  	struct perf_callchain_entry	*callchain;
>  	struct perf_raw_record		*raw;
> +	struct pt_regs			*regs;
>  };
>  
>  static inline
> Index: linux-2.6/kernel/perf_event.c
> ===================================================================
> --- linux-2.6.orig/kernel/perf_event.c
> +++ linux-2.6/kernel/perf_event.c
> @@ -3176,6 +3176,17 @@ void perf_output_sample(struct perf_outp
>  	if (sample_type & PERF_SAMPLE_READ)
>  		perf_output_read(handle, event);
>  
> +	if (sample_type & PERF_SAMPLE_REGS) {
> +		int size = DIV_ROUND_UP(sizeof(struct pt_regs), sizeof(u64)) -
> +			   sizeof(struct pt_regs);
> +
> +		perf_output_put(handle, *data->regs);
> +		if (size) {
> +			u64 zero = 0;
> +			perf_output_copy(handle, &zero, size);
> +		}
> +	}
> +
>  	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
>  		if (data->callchain) {
>  			int size = 1;
> @@ -3273,6 +3284,12 @@ void perf_prepare_sample(struct perf_eve
>  	if (sample_type & PERF_SAMPLE_READ)
>  		header->size += perf_event_read_size(event);
>  
> +	if (sample_type & PERF_SAMPLE_REGS) {
> +		data->regs = regs;
> +		header->size += DIV_ROUND_UP(sizeof(struct pt_regs),
> +					     sizeof(u64));
> +	}
> +
>  	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
>  		int size = 1;
>  
> 
> -- 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-03 17:55         ` David Miller
  2010-03-03 18:18           ` Stephane Eranian
  2010-03-03 19:18           ` Peter Zijlstra
@ 2010-03-04  2:59           ` Ingo Molnar
  2010-03-04 12:58             ` Arnaldo Carvalho de Melo
  2 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2010-03-04  2:59 UTC (permalink / raw)
  To: David Miller, Arnaldo Carvalho de Melo
  Cc: eranian, peterz, linux-kernel, paulus, robert.richter, fweisbec


* David Miller <davem@davemloft.net> wrote:

> And more generally aren't we supposed to be able to eventually analyze perf 
> dumps on any platform not just the one 'perf' was built under?

A aidenote: in this cycle Arnaldo improved this aspect of perf (and those 
changes are now upstream). In theory you should be able to do a 'perf record' 
+ 'perf archive' on your Sparc box and then analyze it via 'perf report' on an 
x86 box - and vice versa.

( Note, it was not tested in that specific combination - another combination
  was tested by Arnaldo: 32-bit PA-RISC profile interpreted on 64-bit x86. )

So yes, i agree that at minimum perf should be able to tell apart the nature 
of any recording and flag combinations it cannot handle (yet).

Btw, i think the most popular use of PEBS is its precise nature, not the 
register dumping aspect per se. If the kernel can provide that transparently 
then that's a usecase that does not need a register dump (in user-space that 
is). It's borderline doable on x86 ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 06/11] perf, x86: PEBS infrastructure
  2010-03-03 17:42     ` Peter Zijlstra
@ 2010-03-04  8:50       ` Robert Richter
  0 siblings, 0 replies; 44+ messages in thread
From: Robert Richter @ 2010-03-04  8:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, paulus, eranian, fweisbec

On 03.03.10 18:42:48, Peter Zijlstra wrote:
> On Wed, 2010-03-03 at 18:38 +0100, Robert Richter wrote:
> > > +     fake_cpuc = kmalloc(sizeof(*fake_cpuc), GFP_KERNEL | __GFP_ZERO);
> > > +     if (!fake_cpuc)
> > > +             return -ENOMEM;
> > > +
> > > +     c = x86_pmu.get_event_constraints(fake_cpuc, event);
> > > +
> > > +     if (!c || !c->weight)
> > > +             ret = -ENOSPC;
> > > +
> > > +     if (x86_pmu.put_event_constraints)
> > > +             x86_pmu.put_event_constraints(fake_cpuc, event);
> > 
> > A fake cpu with the struct filled with zeros will cause a null pointer
> > exception in amd_get_event_constraints():
> > 
> >         struct amd_nb *nb = cpuc->amd_nb;
> 
> That should result in nb == NULL, right? which is checked slightly
> further in the function.

Yes, right. The problem was in your earlier version of this code where
fake_cpuc was a null pointer. The check in amd_get_event_constraints()
for nb should work.

-Robert

> 
> > Shouldn't x86_schedule_events() sufficient to decide if a single
> > counter is available? I did not yet look at group events, this might
> > happen there too.
> 
> Sure, but we will only attempt scheduling them at enable time, this is a
> creation time check, failing to create an unschedulable event seems
> prudent.
> 
> 

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 10/11] perf, x86: use LBR for PEBS IP+1 fixup
  2010-03-03 21:50         ` Stephane Eranian
@ 2010-03-04  8:57           ` Peter Zijlstra
  2010-03-09  1:41             ` Stephane Eranian
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-04  8:57 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Masami Hiramatsu, mingo, linux-kernel, paulus, robert.richter, fweisbec

On Wed, 2010-03-03 at 22:50 +0100, Stephane Eranian wrote:

> I think systematically and transparently using LBR to correct PEBS off-by-one
> problem is not such a good idea. You've basically highjacked LBR and user
> cannot use it in a different way.

Well, they could, it just makes scheduling the stuff more interesting.

> There are PEBS+LBR measurements where you care about extracting the LBR data.
> There are PEBS measurements where you don't care about getting the correct IP.
> I don't necessarily want to pay the price, especially when this could
> be done offline in the tool.

There are some people who argue that fixing up that +1 insn issue is
critical, sadly they don't appear to want to argue their case in public.
What we can do is make it optional I guess.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 08/11] perf, x86: Implement simple LBR support
  2010-03-03 21:52   ` Stephane Eranian
@ 2010-03-04  8:58     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-04  8:58 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: mingo, linux-kernel, paulus, robert.richter, fweisbec

On Wed, 2010-03-03 at 22:52 +0100, Stephane Eranian wrote:
> On Wed, Mar 3, 2010 at 5:39 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > Implement support for Intel LBR stacks that support
> > FREEZE_LBRS_ON_PMI. We do not (yet?) support the LBR config register
> > because that is SMT wide and would also put undue restraints on the
> > PEBS users.
> >
> You're saying PEBS users have priorities over pure LBR users?
> Why is that?

I say no such thing, I only say it would make scheduling the PEBS things
more interesting.

> Without coding this, how would you expose LBR configuration to userland
> given you're using the PERF_SAMPLE_BRANCH_STACK approach?

Possibly using a second config word in the attr, but given how sucky the
hardware currently is (sharing the config between SMT) I'd be inclined
to pretend it doesn't exist for the moment.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-03 22:02   ` Frederic Weisbecker
@ 2010-03-04  8:58     ` Peter Zijlstra
  2010-03-04 11:04       ` Ingo Molnar
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-04  8:58 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: mingo, linux-kernel, paulus, eranian, robert.richter

On Wed, 2010-03-03 at 23:02 +0100, Frederic Weisbecker wrote:
> On Wed, Mar 03, 2010 at 05:39:43PM +0100, Peter Zijlstra wrote:
> > Simply copy out the provided pt_regs in a u64 aligned fashion.
> > 
> > XXX: do task_pt_regs() and get_irq_regs() always clear everything or
> >      are we now leaking data?
> 
> 
> It looks like there is a leak in case of non trace syscalls.
> where we don't appear to save r12-15.
> 
> Then task_pt_regs() may leak the top of a process stack...?


Right, I was afraid of that. I've put this PERF_SAMPLE_REGS thing in the
freezer for now as people seem unsure how to deal with it.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 08/11] perf, x86: Implement simple LBR support
  2010-03-03 21:57   ` Stephane Eranian
@ 2010-03-04  8:58     ` Peter Zijlstra
  2010-03-04 17:54       ` Stephane Eranian
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-04  8:58 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: mingo, linux-kernel, paulus, robert.richter, fweisbec

On Wed, 2010-03-03 at 22:57 +0100, Stephane Eranian wrote:
> I don't understand how LBR state is migrated when a per-thread event is moved
> from one CPU to another. It seems LBR is managed per-cpu.
> 
> Can you explain this to me?

It is not, its basically impossible to do given that the TOS doesn't
count more bits than is strictly needed.

Or we should stop supporting cpu and task users at the same time.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-04  8:58     ` Peter Zijlstra
@ 2010-03-04 11:04       ` Ingo Molnar
  0 siblings, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2010-03-04 11:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, linux-kernel, paulus, eranian, robert.richter


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2010-03-03 at 23:02 +0100, Frederic Weisbecker wrote:
> > On Wed, Mar 03, 2010 at 05:39:43PM +0100, Peter Zijlstra wrote:
> > > Simply copy out the provided pt_regs in a u64 aligned fashion.
> > > 
> > > XXX: do task_pt_regs() and get_irq_regs() always clear everything or
> > >      are we now leaking data?
> > 
> > 
> > It looks like there is a leak in case of non trace syscalls.
> > where we don't appear to save r12-15.
> > 
> > Then task_pt_regs() may leak the top of a process stack...?
> 
> Right, I was afraid of that. I've put this PERF_SAMPLE_REGS thing in the 
> freezer for now as people seem unsure how to deal with it.

Also, we dont want to expose PEBS nor LBR on an ABI level without there being 
a user-space component making good use of it.

For example tools/perf/ support would qualify. Raw libraries alone dont really 
count as they generally lag plus there's no guarantee for a full feedback loop 
either.

Adding ABI details is always a tricky business and we only want to do it if 
there's direct, immediate, close involvement with the user-space side, and 
real, immediate benefits to users.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS
  2010-03-04  2:59           ` Ingo Molnar
@ 2010-03-04 12:58             ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 44+ messages in thread
From: Arnaldo Carvalho de Melo @ 2010-03-04 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, eranian, peterz, linux-kernel, paulus,
	robert.richter, fweisbec

Em Thu, Mar 04, 2010 at 03:59:08AM +0100, Ingo Molnar escreveu:
> 
> * David Miller <davem@davemloft.net> wrote:
> 
> > And more generally aren't we supposed to be able to eventually analyze perf 
> > dumps on any platform not just the one 'perf' was built under?
> 
> A aidenote: in this cycle Arnaldo improved this aspect of perf (and those 
> changes are now upstream). In theory you should be able to do a 'perf record' 
> + 'perf archive' on your Sparc box and then analyze it via 'perf report' on an 
> x86 box - and vice versa.
> 
> ( Note, it was not tested in that specific combination - another combination
>   was tested by Arnaldo: 32-bit PA-RISC profile interpreted on 64-bit x86. )

It was the other way around, 64-bit x86 interpreted on 64-bit PARISC.
Should work in any direction.

Caveats:

perf archive requires build-ids, the kernel has them in distros that
have this support in their toolchain, enabled unconditionally since
about 2.6.24.

If vmlinux is available, it will be used, if not a copy of
/proc/kallsyms is made and as well is keyed by build-id.

I have plans to cope with build-id-less systems, but no code yet.
 
- Arnaldo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 08/11] perf, x86: Implement simple LBR support
  2010-03-04  8:58     ` Peter Zijlstra
@ 2010-03-04 17:54       ` Stephane Eranian
  2010-03-04 18:18         ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Stephane Eranian @ 2010-03-04 17:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, paulus, robert.richter, fweisbec

On Thu, Mar 4, 2010 at 12:58 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-03-03 at 22:57 +0100, Stephane Eranian wrote:
>> I don't understand how LBR state is migrated when a per-thread event is moved
>> from one CPU to another. It seems LBR is managed per-cpu.
>>
>> Can you explain this to me?
>
> It is not, its basically impossible to do given that the TOS doesn't
> count more bits than is strictly needed.
>
I don't get that about the TOS.

So you are saying that one context switch out, you drop the current
content of LBR. When you are scheduled back in on an another CPU,
you grab whatever is there?

> Or we should stop supporting cpu and task users at the same time.
>
Or you should consider LBR as an event which has a constraint that
it can only run on one pseudo counter (similar to what you do with
BTS). Scheduling would take care of the mutual exclusion. Multiplexing
would provide the work-around.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 08/11] perf, x86: Implement simple LBR support
  2010-03-04 17:54       ` Stephane Eranian
@ 2010-03-04 18:18         ` Peter Zijlstra
  2010-03-04 20:23           ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-04 18:18 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: mingo, linux-kernel, paulus, robert.richter, fweisbec

On Thu, 2010-03-04 at 09:54 -0800, Stephane Eranian wrote:
> On Thu, Mar 4, 2010 at 12:58 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, 2010-03-03 at 22:57 +0100, Stephane Eranian wrote:
> >> I don't understand how LBR state is migrated when a per-thread event is moved
> >> from one CPU to another. It seems LBR is managed per-cpu.
> >>
> >> Can you explain this to me?
> >
> > It is not, its basically impossible to do given that the TOS doesn't
> > count more bits than is strictly needed.
> >
> I don't get that about the TOS.
> 
> So you are saying that one context switch out, you drop the current
> content of LBR. When you are scheduled back in on an another CPU,
> you grab whatever is there?

What is currently implemented is that we loose history at the point a
new task schedules in an LBR using event.

If we had a wider TOS we could try and stitch partial stacks together
because we could detect overflow.

We could also preserve the LBR because we would be able to know where a
task got scheduled in and not release information of the previous task
while still allowing a cpu-wide user to see everything.

> > Or we should stop supporting cpu and task users at the same time.
> >
> Or you should consider LBR as an event which has a constraint that
> it can only run on one pseudo counter (similar to what you do with
> BTS). Scheduling would take care of the mutual exclusion. Multiplexing
> would provide the work-around.

Yes, that an even more limited case than not sharing it between task and
cpu context, which is basically the strongest you need.

If you do that you can store the LBR stack on unschedule and put it back
on schedule (on whichever cpu that may be).

But since we do not support LBR-config that'll be of very limited use
since there are enough branches between the point where we schedule the
counter to hitting userspace to cycle the LBR several times.




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 08/11] perf, x86: Implement simple LBR support
  2010-03-04 18:18         ` Peter Zijlstra
@ 2010-03-04 20:23           ` Peter Zijlstra
  2010-03-04 20:57             ` Stephane Eranian
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-03-04 20:23 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: mingo, linux-kernel, paulus, robert.richter, fweisbec

On Thu, 2010-03-04 at 19:18 +0100, Peter Zijlstra wrote:
> What is currently implemented is that we loose history at the point a
> new task schedules in an LBR using event.
> 
This also matches CPU errata AX14, AJ52 and AAK109 which states that a
task switch may produce faulty LBR state, so clearing history after a
task switch seems the best thing to do.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 08/11] perf, x86: Implement simple LBR support
  2010-03-04 20:23           ` Peter Zijlstra
@ 2010-03-04 20:57             ` Stephane Eranian
  0 siblings, 0 replies; 44+ messages in thread
From: Stephane Eranian @ 2010-03-04 20:57 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, paulus, robert.richter, fweisbec

On Thu, Mar 4, 2010 at 12:23 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-03-04 at 19:18 +0100, Peter Zijlstra wrote:
>> What is currently implemented is that we loose history at the point a
>> new task schedules in an LBR using event.
>>
> This also matches CPU errata AX14, AJ52 and AAK109 which states that a
> task switch may produce faulty LBR state, so clearing history after a
> task switch seems the best thing to do.
>
>
You would save the LBR before the task switch and restore after the
task switch, so I don't see how you would be impacted by this. You
would not pick up the bogus LBR content.

Given that you seem to be interested only in LBR at the user level.
I think what you have right now should work. But I don't like a design
that precludes supporting LBR config regardless of the fact the MSR
is shared or not, because that is preventing some interesting
measurements.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 05/11] perf: Generic perf_sample_data initialization
  2010-03-03 16:39 ` [RFC][PATCH 05/11] perf: Generic perf_sample_data initialization Peter Zijlstra
  2010-03-03 16:49   ` David Miller
  2010-03-03 21:14   ` Frederic Weisbecker
@ 2010-03-05  8:44   ` Jean Pihet
  2 siblings, 0 replies; 44+ messages in thread
From: Jean Pihet @ 2010-03-05  8:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, paulus, eranian, robert.richter, fweisbec,
	Jamie Iles, David S. Miller, stable


On Wednesday 03 March 2010 17:39:41 Peter Zijlstra wrote:
> This makes it easier to extend perf_sample_data and fixes a bug on
> arm and sparc, which failed to set ->raw to NULL, which can cause
> crashes when combined with PERF_SAMPLE_RAW.
>
> It also optimizes PowerPC and tracepoint, because the struct
> initialization is forced to zero out the whole structure.
>
> CC: Jamie Iles <jamie.iles@picochip.com>
> CC: Jean Pihet <jpihet@mvista.com>
> CC: Paul Mackerras <paulus@samba.org>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: David S. Miller <davem@davemloft.net>
> CC: Stephane Eranian <eranian@google.com>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: stable@kernel.org
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Acked-by: Jean Pihet <jpihet@mvista.com>

Thanks!




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC][PATCH 10/11] perf, x86: use LBR for PEBS IP+1 fixup
  2010-03-04  8:57           ` Peter Zijlstra
@ 2010-03-09  1:41             ` Stephane Eranian
  0 siblings, 0 replies; 44+ messages in thread
From: Stephane Eranian @ 2010-03-09  1:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Masami Hiramatsu, mingo, linux-kernel, paulus, robert.richter, fweisbec

On Thu, Mar 4, 2010 at 9:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-03-03 at 22:50 +0100, Stephane Eranian wrote:
>
>> I think systematically and transparently using LBR to correct PEBS off-by-one
>> problem is not such a good idea. You've basically highjacked LBR and user
>> cannot use it in a different way.
>
> Well, they could, it just makes scheduling the stuff more interesting.
>
>> There are PEBS+LBR measurements where you care about extracting the LBR data.
>> There are PEBS measurements where you don't care about getting the correct IP.
>> I don't necessarily want to pay the price, especially when this could
>> be done offline in the tool.
>
> There are some people who argue that fixing up that +1 insn issue is
> critical, sadly they don't appear to want to argue their case in public.
> What we can do is make it optional I guess.

I can see why they would want IP, instead of IP+1. But what I am saying
is that there are certain measurements where you need to use LBR in
another way. For instance, you may want to combine PEBS + LBR to capture the
path that leads to a cache miss. For that you would need to configure LBR
to record only call branches. Then you would do the correction of the IP offline
in the tool. In this case, the patch is more important than the IP+1 error.

This is why I think you need to provide a config field to disable IP+1
correction,
and thus free LBR for other usage. I understand this also means you cannot
share the LBR with other competing events (on the same or distinct CPUs),
but that's what event scheduling is good for.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2010-03-09  1:41 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-03 16:39 [RFC][PATCH 00/11] Another stab at PEBS and LBR support Peter Zijlstra
2010-03-03 16:39 ` [RFC][PATCH 01/11] perf, x86: Remove superfluous arguments to x86_perf_event_set_period() Peter Zijlstra
2010-03-03 16:39 ` [RFC][PATCH 02/11] perf, x86: Remove superfluous arguments to x86_perf_event_update() Peter Zijlstra
2010-03-03 16:39 ` [RFC][PATCH 03/11] perf, x86: Change x86_pmu.{enable,disable} calling convention Peter Zijlstra
2010-03-03 16:39 ` [RFC][PATCH 04/11] perf, x86: Use unlocked bitops Peter Zijlstra
2010-03-03 16:39 ` [RFC][PATCH 05/11] perf: Generic perf_sample_data initialization Peter Zijlstra
2010-03-03 16:49   ` David Miller
2010-03-03 21:14   ` Frederic Weisbecker
2010-03-05  8:44   ` Jean Pihet
2010-03-03 16:39 ` [RFC][PATCH 06/11] perf, x86: PEBS infrastructure Peter Zijlstra
2010-03-03 17:38   ` Robert Richter
2010-03-03 17:42     ` Peter Zijlstra
2010-03-04  8:50       ` Robert Richter
2010-03-03 16:39 ` [RFC][PATCH 07/11] perf: Provide PERF_SAMPLE_REGS Peter Zijlstra
2010-03-03 17:30   ` Stephane Eranian
2010-03-03 17:39     ` Peter Zijlstra
2010-03-03 17:49       ` Stephane Eranian
2010-03-03 17:55         ` David Miller
2010-03-03 18:18           ` Stephane Eranian
2010-03-03 19:18           ` Peter Zijlstra
2010-03-04  2:59           ` Ingo Molnar
2010-03-04 12:58             ` Arnaldo Carvalho de Melo
2010-03-03 22:02   ` Frederic Weisbecker
2010-03-04  8:58     ` Peter Zijlstra
2010-03-04 11:04       ` Ingo Molnar
2010-03-03 16:39 ` [RFC][PATCH 08/11] perf, x86: Implement simple LBR support Peter Zijlstra
2010-03-03 21:52   ` Stephane Eranian
2010-03-04  8:58     ` Peter Zijlstra
2010-03-03 21:57   ` Stephane Eranian
2010-03-04  8:58     ` Peter Zijlstra
2010-03-04 17:54       ` Stephane Eranian
2010-03-04 18:18         ` Peter Zijlstra
2010-03-04 20:23           ` Peter Zijlstra
2010-03-04 20:57             ` Stephane Eranian
2010-03-03 16:39 ` [RFC][PATCH 09/11] perf, x86: Implement PERF_SAMPLE_BRANCH_STACK Peter Zijlstra
2010-03-03 21:08   ` Frederic Weisbecker
2010-03-03 16:39 ` [RFC][PATCH 10/11] perf, x86: use LBR for PEBS IP+1 fixup Peter Zijlstra
2010-03-03 18:05   ` Masami Hiramatsu
2010-03-03 19:37     ` Peter Zijlstra
2010-03-03 21:11       ` Masami Hiramatsu
2010-03-03 21:50         ` Stephane Eranian
2010-03-04  8:57           ` Peter Zijlstra
2010-03-09  1:41             ` Stephane Eranian
2010-03-03 16:39 ` [RFC][PATCH 11/11] perf, x86: Clean up IA32_PERF_CAPABILITIES usage Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.