All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15] pending perf_counter bits
@ 2009-03-30 17:07 Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 01/15] perf_counter: unify and fix delayed counter wakeup Peter Zijlstra
                   ` (15 more replies)
  0 siblings, 16 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

Most of the previous set, except I kept the powerpc set_perf_counter_pending()
bits.

This set also adds callchain support for x86 (no userspace yet -- am looking
at porting sysprof). Furthermore, it has a first attempt at making
perf_counters and oprofile and nmi_watchdog not stomp on each other.

New bits only compile tested on x86_64 and ppc64.
-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/15] perf_counter: unify and fix delayed counter wakeup
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-03-31  5:45   ` Paul Mackerras
  2009-04-01 10:12   ` [tip:perfcounters/core] " Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 02/15] perf_counter: fix update_userpage() Peter Zijlstra
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: perf_counter-fix-delayed-wakeup.patch --]
[-- Type: text/plain, Size: 14536 bytes --]

While going over the wakeup code I noticed delayed wakeups only work
for hardware counters but basically all software counters rely on
them.

This patch unifies and generalizes the delayed wakeup to fix this
issue.

Since we're dealing with NMI context bits here, use a cmpxchg() based
single link list implementation to track counters that have pending
wakeups.

[ This should really be generic code for delayed wakeups, but since we
  cannot use cmpxchg()/xchg() in generic code, I've let it live in the
  perf_counter code. -- Eric Dumazet could use it to aggregate the
  network wakeups. ]

Furthermore, the x86 method of using TIF flags was flawed in that its
quite possible to end up setting the bit on the idle task, loosing the
wakeup.

The powerpc method uses per-cpu storage and does appear to be
sufficient.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/powerpc/include/asm/hw_irq.h   |    4 -
 arch/powerpc/kernel/irq.c           |    2 
 arch/powerpc/kernel/perf_counter.c  |   22 ------
 arch/x86/include/asm/perf_counter.h |    5 -
 arch/x86/include/asm/thread_info.h  |    4 -
 arch/x86/kernel/cpu/perf_counter.c  |   29 --------
 arch/x86/kernel/signal.c            |    6 -
 include/linux/perf_counter.h        |   15 ++--
 kernel/perf_counter.c               |  128 +++++++++++++++++++++++++++++++++---
 kernel/timer.c                      |    3 
 10 files changed, 142 insertions(+), 76 deletions(-)

Index: linux-2.6/arch/powerpc/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/perf_counter.c
+++ linux-2.6/arch/powerpc/kernel/perf_counter.c
@@ -650,24 +650,6 @@ hw_perf_counter_init(struct perf_counter
 }
 
 /*
- * Handle wakeups.
- */
-void perf_counter_do_pending(void)
-{
-	int i;
-	struct cpu_hw_counters *cpuhw = &__get_cpu_var(cpu_hw_counters);
-	struct perf_counter *counter;
-
-	for (i = 0; i < cpuhw->n_counters; ++i) {
-		counter = cpuhw->counter[i];
-		if (counter && counter->wakeup_pending) {
-			counter->wakeup_pending = 0;
-			wake_up(&counter->waitq);
-		}
-	}
-}
-
-/*
  * A counter has overflowed; update its count and record
  * things if requested.  Note that interrupts are hard-disabled
  * here so there is no possibility of being interrupted.
@@ -720,7 +702,7 @@ static void perf_counter_interrupt(struc
 	struct cpu_hw_counters *cpuhw = &__get_cpu_var(cpu_hw_counters);
 	struct perf_counter *counter;
 	long val;
-	int need_wakeup = 0, found = 0;
+	int found = 0;
 
 	for (i = 0; i < cpuhw->n_counters; ++i) {
 		counter = cpuhw->counter[i];
@@ -761,7 +743,7 @@ static void perf_counter_interrupt(struc
 	 * immediately; otherwise we'll have do the wakeup when interrupts
 	 * get soft-enabled.
 	 */
-	if (get_perf_counter_pending() && regs->softe) {
+	if (test_perf_counter_pending() && regs->softe) {
 		irq_enter();
 		clear_perf_counter_pending();
 		perf_counter_do_pending();
Index: linux-2.6/arch/x86/kernel/cpu/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_counter.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_counter.c
@@ -227,7 +227,6 @@ static int __hw_perf_counter_init(struct
 		 */
 		hwc->config |= pmc_ops->event_map(perf_event_id(hw_event));
 	}
-	counter->wakeup_pending = 0;
 
 	return 0;
 }
@@ -773,34 +772,6 @@ void smp_perf_counter_interrupt(struct p
 	irq_exit();
 }
 
-/*
- * This handler is triggered by NMI contexts:
- */
-void perf_counter_notify(struct pt_regs *regs)
-{
-	struct cpu_hw_counters *cpuc;
-	unsigned long flags;
-	int bit, cpu;
-
-	local_irq_save(flags);
-	cpu = smp_processor_id();
-	cpuc = &per_cpu(cpu_hw_counters, cpu);
-
-	for_each_bit(bit, cpuc->used, X86_PMC_IDX_MAX) {
-		struct perf_counter *counter = cpuc->counters[bit];
-
-		if (!counter)
-			continue;
-
-		if (counter->wakeup_pending) {
-			counter->wakeup_pending = 0;
-			wake_up(&counter->waitq);
-		}
-	}
-
-	local_irq_restore(flags);
-}
-
 void perf_counters_lapic_init(int nmi)
 {
 	u32 apic_val;
Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -275,6 +275,10 @@ struct perf_mmap_data {
 	void 				*data_pages[0];
 };
 
+struct perf_wakeup_entry {
+	struct perf_wakeup_entry *next;
+};
+
 /**
  * struct perf_counter - performance counter kernel representation:
  */
@@ -350,7 +354,7 @@ struct perf_counter {
 	/* poll related */
 	wait_queue_head_t		waitq;
 	/* optional: for NMIs */
-	int				wakeup_pending;
+	struct perf_wakeup_entry	wakeup;
 
 	void (*destroy)(struct perf_counter *);
 	struct rcu_head			rcu_head;
@@ -427,7 +431,7 @@ extern void perf_counter_task_sched_out(
 extern void perf_counter_task_tick(struct task_struct *task, int cpu);
 extern void perf_counter_init_task(struct task_struct *child);
 extern void perf_counter_exit_task(struct task_struct *child);
-extern void perf_counter_notify(struct pt_regs *regs);
+extern void perf_counter_do_pending(void);
 extern void perf_counter_print_debug(void);
 extern void perf_counter_unthrottle(void);
 extern u64 hw_perf_save_disable(void);
@@ -461,7 +465,7 @@ static inline void
 perf_counter_task_tick(struct task_struct *task, int cpu)		{ }
 static inline void perf_counter_init_task(struct task_struct *child)	{ }
 static inline void perf_counter_exit_task(struct task_struct *child)	{ }
-static inline void perf_counter_notify(struct pt_regs *regs)		{ }
+static inline void perf_counter_do_pending(void)			{ }
 static inline void perf_counter_print_debug(void)			{ }
 static inline void perf_counter_unthrottle(void)			{ }
 static inline void hw_perf_restore(u64 ctrl)				{ }
@@ -469,8 +473,9 @@ static inline u64 hw_perf_save_disable(v
 static inline int perf_counter_task_disable(void)	{ return -EINVAL; }
 static inline int perf_counter_task_enable(void)	{ return -EINVAL; }
 
-static inline void perf_swcounter_event(u32 event, u64 nr,
-					int nmi, struct pt_regs *regs)	{ }
+static inline void
+perf_swcounter_event(u32 event, u64 nr, int nmi, struct pt_regs *regs)	{ }
+
 #endif
 
 #endif /* __KERNEL__ */
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1197,8 +1197,12 @@ static void free_counter_rcu(struct rcu_
 	kfree(counter);
 }
 
+static void perf_pending_sync(struct perf_counter *counter);
+
 static void free_counter(struct perf_counter *counter)
 {
+	perf_pending_sync(counter);
+
 	if (counter->destroy)
 		counter->destroy(counter);
 
@@ -1529,6 +1533,118 @@ static const struct file_operations perf
 };
 
 /*
+ * Perf counter wakeup
+ *
+ * If there's data, ensure we set the poll() state and publish everything
+ * to user-space before waking everybody up.
+ */
+
+void perf_counter_wakeup(struct perf_counter *counter)
+{
+	struct perf_mmap_data *data;
+
+	rcu_read_lock();
+	data = rcu_dereference(counter->data);
+	if (data) {
+		(void)atomic_xchg(&data->wakeup, POLL_IN);
+		__perf_counter_update_userpage(counter, data);
+	}
+	rcu_read_unlock();
+
+	wake_up_all(&counter->waitq);
+}
+
+/*
+ * Pending wakeups
+ *
+ * Handle the case where we need to wakeup up from NMI (or rq->lock) context.
+ *
+ * The NMI bit means we cannot possibly take locks. Therefore, maintain a
+ * single linked list and use cmpxchg() to add entries lockless.
+ */
+
+#define PENDING_TAIL ((struct perf_wakeup_entry *)-1UL)
+
+static DEFINE_PER_CPU(struct perf_wakeup_entry *, perf_wakeup_head) = {
+	PENDING_TAIL,
+};
+
+static void perf_pending_queue(struct perf_counter *counter)
+{
+	struct perf_wakeup_entry **head;
+	struct perf_wakeup_entry *prev, *next;
+
+	if (cmpxchg(&counter->wakeup.next, NULL, PENDING_TAIL) != NULL)
+		return;
+
+	head = &get_cpu_var(perf_wakeup_head);
+
+	do {
+		prev = counter->wakeup.next = *head;
+		next = &counter->wakeup;
+	} while (cmpxchg(head, prev, next) != prev);
+
+	set_perf_counter_pending();
+
+	put_cpu_var(perf_wakeup_head);
+}
+
+static int __perf_pending_run(void)
+{
+	struct perf_wakeup_entry *list;
+	int nr = 0;
+
+	list = xchg(&__get_cpu_var(perf_wakeup_head), PENDING_TAIL);
+	while (list != PENDING_TAIL) {
+		struct perf_counter *counter = container_of(list,
+				struct perf_counter, wakeup);
+
+		list = list->next;
+
+		counter->wakeup.next = NULL;
+		/*
+		 * Ensure we observe the unqueue before we issue the wakeup,
+		 * so that we won't be waiting forever.
+		 * -- see perf_not_pending().
+		 */
+		smp_wmb();
+
+		perf_counter_wakeup(counter);
+		nr++;
+	}
+
+	return nr;
+}
+
+static inline int perf_not_pending(struct perf_counter *counter)
+{
+	/*
+	 * If we flush on whatever cpu we run, there is a chance we don't
+	 * need to wait.
+	 */
+	get_cpu();
+	__perf_pending_run();
+	put_cpu();
+
+	/*
+	 * Ensure we see the proper queue state before going to sleep
+	 * so that we do not miss the wakeup. -- see perf_pending_handle()
+	 */
+	smp_rmb();
+	return counter->wakeup.next == NULL;
+}
+
+static void perf_pending_sync(struct perf_counter *counter)
+{
+	wait_event(counter->waitq, perf_not_pending(counter));
+}
+
+void perf_counter_do_pending(void)
+{
+	__perf_pending_run();
+}
+
+/*
  * Output
  */
 
@@ -1611,13 +1727,10 @@ static void perf_output_copy(struct perf
 static void perf_output_end(struct perf_output_handle *handle, int nmi)
 {
 	if (handle->wakeup) {
-		(void)atomic_xchg(&handle->data->wakeup, POLL_IN);
-		__perf_counter_update_userpage(handle->counter, handle->data);
-		if (nmi) {
-			handle->counter->wakeup_pending = 1;
-			set_perf_counter_pending();
-		} else
-			wake_up(&handle->counter->waitq);
+		if (nmi)
+			perf_pending_queue(handle->counter);
+		else
+			perf_counter_wakeup(handle->counter);
 	}
 	rcu_read_unlock();
 }
@@ -2211,7 +2324,6 @@ perf_counter_alloc(struct perf_counter_h
 
 	counter->cpu			= cpu;
 	counter->hw_event		= *hw_event;
-	counter->wakeup_pending		= 0;
 	counter->group_leader		= group_leader;
 	counter->hw_ops			= NULL;
 	counter->ctx			= ctx;
Index: linux-2.6/arch/x86/kernel/signal.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/signal.c
+++ linux-2.6/arch/x86/kernel/signal.c
@@ -6,7 +6,6 @@
  *  2000-06-20  Pentium III FXSR, SSE support by Gareth Hughes
  *  2000-2002   x86-64 support by Andi Kleen
  */
-#include <linux/perf_counter.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
@@ -872,11 +871,6 @@ do_notify_resume(struct pt_regs *regs, v
 		tracehook_notify_resume(regs);
 	}
 
-	if (thread_info_flags & _TIF_PERF_COUNTERS) {
-		clear_thread_flag(TIF_PERF_COUNTERS);
-		perf_counter_notify(regs);
-	}
-
 #ifdef CONFIG_X86_32
 	clear_thread_flag(TIF_IRET);
 #endif /* CONFIG_X86_32 */
Index: linux-2.6/arch/x86/include/asm/thread_info.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/thread_info.h
+++ linux-2.6/arch/x86/include/asm/thread_info.h
@@ -83,7 +83,6 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
-#define TIF_PERF_COUNTERS	11	/* notify perf counter work */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -107,7 +106,6 @@ struct thread_info {
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
-#define _TIF_PERF_COUNTERS	(1 << TIF_PERF_COUNTERS)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
@@ -141,7 +139,7 @@ struct thread_info {
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
-	(_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_PERF_COUNTERS|_TIF_NOTIFY_RESUME)
+	(_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_NOTIFY_RESUME)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
Index: linux-2.6/kernel/timer.c
===================================================================
--- linux-2.6.orig/kernel/timer.c
+++ linux-2.6/kernel/timer.c
@@ -37,6 +37,7 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include <linux/perf_counter.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -1167,6 +1168,8 @@ static void run_timer_softirq(struct sof
 {
 	struct tvec_base *base = __get_cpu_var(tvec_bases);
 
+	perf_counter_do_pending();
+
 	hrtimer_run_pending();
 
 	if (time_after_eq(jiffies, base->timer_jiffies))
Index: linux-2.6/arch/powerpc/include/asm/hw_irq.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/hw_irq.h
+++ linux-2.6/arch/powerpc/include/asm/hw_irq.h
@@ -132,7 +132,7 @@ static inline int irqs_disabled_flags(un
 struct hw_interrupt_type;
 
 #ifdef CONFIG_PERF_COUNTERS
-static inline unsigned long get_perf_counter_pending(void)
+static inline unsigned long test_perf_counter_pending(void)
 {
 	unsigned long x;
 
@@ -160,7 +160,7 @@ extern void perf_counter_do_pending(void
 
 #else
 
-static inline unsigned long get_perf_counter_pending(void)
+static inline unsigned long test_perf_counter_pending(void)
 {
 	return 0;
 }
Index: linux-2.6/arch/powerpc/kernel/irq.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/irq.c
+++ linux-2.6/arch/powerpc/kernel/irq.c
@@ -135,7 +135,7 @@ notrace void raw_local_irq_restore(unsig
 			iseries_handle_interrupts();
 	}
 
-	if (get_perf_counter_pending()) {
+	if (test_perf_counter_pending()) {
 		clear_perf_counter_pending();
 		perf_counter_do_pending();
 	}
Index: linux-2.6/arch/x86/include/asm/perf_counter.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/perf_counter.h
+++ linux-2.6/arch/x86/include/asm/perf_counter.h
@@ -84,8 +84,9 @@ union cpuid10_edx {
 #define MSR_ARCH_PERFMON_FIXED_CTR2			0x30b
 #define X86_PMC_IDX_FIXED_BUS_CYCLES			(X86_PMC_IDX_FIXED + 2)
 
-#define set_perf_counter_pending()	\
-		set_tsk_thread_flag(current, TIF_PERF_COUNTERS);
+#define set_perf_counter_pending()	do { } while (0)
+#define clear_perf_counter_pending()	do { } while (0)
+#define test_perf_counter_pending()	(0)
 
 #ifdef CONFIG_PERF_COUNTERS
 extern void init_hw_perf_counters(void);

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 02/15] perf_counter: fix update_userpage()
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 01/15] perf_counter: unify and fix delayed counter wakeup Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:12   ` [tip:perfcounters/core] " Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 03/15] perf_counter: kerneltop: simplify data_head read Peter Zijlstra
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: perf_counter-fix-update_userpage.patch --]
[-- Type: text/plain, Size: 4349 bytes --]

It just occured to me it is possible to have multiple contending
updates of the userpage (mmap information vs overflow vs counter).
This would break the seqlock logic.

It appear the arch code uses this from NMI context, so we cannot
possibly serialize its use, therefore separate the data_head update
from it and let it return to its original use.

The arch code needs to make sure there are no contending callers by
disabling the counter before using it -- powerpc appears to do this
nicely.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
---
 include/linux/perf_counter.h |   35 +++++++++++++++++++++++++++++++++++
 kernel/perf_counter.c        |   38 +++++++++++++++++++++++---------------
 2 files changed, 58 insertions(+), 15 deletions(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -160,10 +160,45 @@ struct perf_counter_hw_event {
 struct perf_counter_mmap_page {
 	__u32	version;		/* version number of this structure */
 	__u32	compat_version;		/* lowest version this is compat with */
+
+	/*
+	 * Bits needed to read the hw counters in user-space.
+	 *
+	 * The index and offset should be read atomically using the seqlock:
+	 *
+	 *   __u32 seq, index;
+	 *   __s64 offset;
+	 *
+	 * again:
+	 *   rmb();
+	 *   seq = pc->lock;
+	 *
+	 *   if (unlikely(seq & 1)) {
+	 *     cpu_relax();
+	 *     goto again;
+	 *   }
+	 *
+	 *   index = pc->index;
+	 *   offset = pc->offset;
+	 *
+	 *   rmb();
+	 *   if (pc->lock != seq)
+	 *     goto again;
+	 *
+	 * After this, index contains architecture specific counter index + 1,
+	 * so that 0 means unavailable, offset contains the value to be added
+	 * to the result of the raw timer read to obtain this counter's value.
+	 */
 	__u32	lock;			/* seqlock for synchronization */
 	__u32	index;			/* hardware counter identifier */
 	__s64	offset;			/* add to hardware counter value */
 
+	/*
+	 * Control data for the mmap() data buffer.
+	 *
+	 * User-space reading this value should issue an rmb(), on SMP capable
+	 * platforms, after reading this value -- see perf_counter_wakeup().
+	 */
 	__u32   data_head;		/* head in the data section */
 };
 
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1316,10 +1316,22 @@ static long perf_ioctl(struct file *file
 	return err;
 }
 
-static void __perf_counter_update_userpage(struct perf_counter *counter,
-					   struct perf_mmap_data *data)
+/*
+ * Callers need to ensure there can be no nesting of this function, otherwise
+ * the seqlock logic goes bad. We can not serialize this because the arch
+ * code calls this from NMI context.
+ */
+void perf_counter_update_userpage(struct perf_counter *counter)
 {
-	struct perf_counter_mmap_page *userpg = data->user_page;
+	struct perf_mmap_data *data;
+	struct perf_counter_mmap_page *userpg;
+
+	rcu_read_lock();
+	data = rcu_dereference(counter->data);
+	if (!data)
+		goto unlock;
+
+	userpg = data->user_page;
 
 	/*
 	 * Disable preemption so as to not let the corresponding user-space
@@ -1333,20 +1345,10 @@ static void __perf_counter_update_userpa
 	if (counter->state == PERF_COUNTER_STATE_ACTIVE)
 		userpg->offset -= atomic64_read(&counter->hw.prev_count);
 
-	userpg->data_head = atomic_read(&data->head);
 	smp_wmb();
 	++userpg->lock;
 	preempt_enable();
-}
-
-void perf_counter_update_userpage(struct perf_counter *counter)
-{
-	struct perf_mmap_data *data;
-
-	rcu_read_lock();
-	data = rcu_dereference(counter->data);
-	if (data)
-		__perf_counter_update_userpage(counter, data);
+unlock:
 	rcu_read_unlock();
 }
 
@@ -1547,7 +1549,13 @@ void perf_counter_wakeup(struct perf_cou
 	data = rcu_dereference(counter->data);
 	if (data) {
 		(void)atomic_xchg(&data->wakeup, POLL_IN);
-		__perf_counter_update_userpage(counter, data);
+		/*
+		 * Ensure all data writes are issued before updating the
+		 * user-space data head information. The matching rmb()
+		 * will be in userspace after reading this value.
+		 */
+		smp_wmb();
+		data->user_page->data_head = atomic_read(&data->head);
 	}
 	rcu_read_unlock();
 

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 03/15] perf_counter: kerneltop: simplify data_head read
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 01/15] perf_counter: unify and fix delayed counter wakeup Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 02/15] perf_counter: fix update_userpage() Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:13   ` [tip:perfcounters/core] " Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 04/15] perf_counter: executable mmap() information Peter Zijlstra
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: kerneltop-new-mmap.patch --]
[-- Type: text/plain, Size: 852 bytes --]

Now that the kernel side changed, match up again.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/perf_counter/kerneltop.c |   14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

Index: linux-2.6/Documentation/perf_counter/kerneltop.c
===================================================================
--- linux-2.6.orig/Documentation/perf_counter/kerneltop.c
+++ linux-2.6/Documentation/perf_counter/kerneltop.c
@@ -1125,22 +1125,10 @@ struct mmap_data {
 static unsigned int mmap_read_head(struct mmap_data *md)
 {
 	struct perf_counter_mmap_page *pc = md->base;
-	unsigned int seq, head;
-
-repeat:
-	rmb();
-	seq = pc->lock;
-
-	if (unlikely(seq & 1)) {
-		cpu_relax();
-		goto repeat;
-	}
+	int head;
 
 	head = pc->data_head;
-
 	rmb();
-	if (pc->lock != seq)
-		goto repeat;
 
 	return head;
 }

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 04/15] perf_counter: executable mmap() information
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (2 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 03/15] perf_counter: kerneltop: simplify data_head read Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:13   ` [tip:perfcounters/core] " Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 05/15] perf_counter: kerneltop: parse the mmap data stream Peter Zijlstra
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra, Andrew Morton

[-- Attachment #1: perf_counter-mmap-tracking.patch --]
[-- Type: text/plain, Size: 7409 bytes --]

Currently the profiling information returns userspace IPs but no way
to correlate them to userspace code. Userspace could look into
/proc/$pid/maps but that might not be current or even present anymore
at the time of analyzing the IPs.

Therefore provide means to track the mmap information and provide it
in the output stream.

XXX: only covers mmap()/munmap(), mremap() and mprotect() are missing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/perf_counter.h |   24 ++++++-
 kernel/perf_counter.c        |  145 +++++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                    |   10 ++
 3 files changed, 177 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -137,9 +137,11 @@ struct perf_counter_hw_event {
 				exclude_kernel :  1, /* ditto kernel          */
 				exclude_hv     :  1, /* ditto hypervisor      */
 				exclude_idle   :  1, /* don't count when idle */
-				include_tid    :  1, /* include the tid */
+				include_tid    :  1, /* include the tid       */
+				mmap           :  1, /* include mmap data     */
+				munmap         :  1, /* include munmap data   */
 
-				__reserved_1   : 54;
+				__reserved_1   : 52;
 
 	__u32			extra_config_len;
 	__u32			__reserved_4;
@@ -211,6 +213,9 @@ enum perf_event_type {
 	PERF_EVENT_IP		= 0,
 	PERF_EVENT_GROUP	= 1,
 
+	PERF_EVENT_MMAP		= 2,
+	PERF_EVENT_MUNMAP	= 3,
+
 	__PERF_EVENT_TID	= 0x100,
 };
 
@@ -491,6 +496,12 @@ static inline int is_software_counter(st
 
 extern void perf_swcounter_event(u32, u64, int, struct pt_regs *);
 
+extern void perf_counter_mmap(unsigned long addr, unsigned long len,
+			      unsigned long pgoff, struct file *file);
+
+extern void perf_counter_munmap(unsigned long addr, unsigned long len,
+				unsigned long pgoff, struct file *file);
+
 #else
 static inline void
 perf_counter_task_sched_in(struct task_struct *task, int cpu)		{ }
@@ -511,6 +522,15 @@ static inline int perf_counter_task_enab
 static inline void
 perf_swcounter_event(u32 event, u64 nr, int nmi, struct pt_regs *regs)	{ }
 
+
+static inline void
+perf_counter_mmap(unsigned long addr, unsigned long len,
+		  unsigned long pgoff, struct file *file)		{ }
+
+static inline void
+perf_counter_munmap(unsigned long addr, unsigned long len,
+		    unsigned long pgoff, struct file *file) 		{ }
+
 #endif
 
 #endif /* __KERNEL__ */
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -25,6 +25,7 @@
 #include <linux/anon_inodes.h>
 #include <linux/kernel_stat.h>
 #include <linux/perf_counter.h>
+#include <linux/dcache.h>
 
 #include <asm/irq_regs.h>
 
@@ -1842,6 +1843,150 @@ void perf_counter_output(struct perf_cou
 }
 
 /*
+ * mmap tracking
+ */
+
+struct perf_mmap_event {
+	struct file	*file;
+	char		*file_name;
+	int		file_size;
+
+	struct {
+		struct perf_event_header	header;
+
+		u32				pid;
+		u32				tid;
+		u64				start;
+		u64				len;
+		u64				pgoff;
+	} event;
+};
+
+static void perf_counter_mmap_output(struct perf_counter *counter,
+				     struct perf_mmap_event *mmap_event)
+{
+	struct perf_output_handle handle;
+	int size = mmap_event->event.header.size;
+	int ret = perf_output_begin(&handle, counter, size);
+
+	if (ret)
+		return;
+
+	perf_output_put(&handle, mmap_event->event);
+	perf_output_copy(&handle, mmap_event->file_name,
+				   mmap_event->file_size);
+	perf_output_end(&handle, 0);
+}
+
+static int perf_counter_mmap_match(struct perf_counter *counter,
+				   struct perf_mmap_event *mmap_event)
+{
+	if (counter->hw_event.mmap &&
+	    mmap_event->event.header.type == PERF_EVENT_MMAP)
+		return 1;
+
+	if (counter->hw_event.munmap &&
+	    mmap_event->event.header.type == PERF_EVENT_MUNMAP)
+		return 1;
+
+	return 0;
+}
+
+static void perf_counter_mmap_ctx(struct perf_counter_context *ctx,
+				  struct perf_mmap_event *mmap_event)
+{
+	struct perf_counter *counter;
+
+	if (system_state != SYSTEM_RUNNING || list_empty(&ctx->event_list))
+		return;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(counter, &ctx->event_list, event_entry) {
+		if (perf_counter_mmap_match(counter, mmap_event))
+			perf_counter_mmap_output(counter, mmap_event);
+	}
+	rcu_read_unlock();
+}
+
+static void perf_counter_mmap_event(struct perf_mmap_event *mmap_event)
+{
+	struct perf_cpu_context *cpuctx;
+	struct file *file = mmap_event->file;
+	unsigned int size;
+	char tmp[16];
+	char *buf = NULL;
+	char *name;
+
+	if (file) {
+		buf = kzalloc(PATH_MAX, GFP_KERNEL);
+		if (!buf) {
+			name = strncpy(tmp, "//enomem", sizeof(tmp));
+			goto got_name;
+		}
+		name = dentry_path(file->f_dentry, buf, PATH_MAX);
+		if (IS_ERR(name)) {
+			name = strncpy(tmp, "//toolong", sizeof(tmp));
+			goto got_name;
+		}
+	} else {
+		name = strncpy(tmp, "//anon", sizeof(tmp));
+		goto got_name;
+	}
+
+got_name:
+	size = ALIGN(strlen(name), sizeof(u64));
+
+	mmap_event->file_name = name;
+	mmap_event->file_size = size;
+
+	mmap_event->event.header.size = sizeof(mmap_event->event) + size;
+
+	cpuctx = &get_cpu_var(perf_cpu_context);
+	perf_counter_mmap_ctx(&cpuctx->ctx, mmap_event);
+	put_cpu_var(perf_cpu_context);
+
+	perf_counter_mmap_ctx(&current->perf_counter_ctx, mmap_event);
+
+	kfree(buf);
+}
+
+void perf_counter_mmap(unsigned long addr, unsigned long len,
+		       unsigned long pgoff, struct file *file)
+{
+	struct perf_mmap_event mmap_event = {
+		.file   = file,
+		.event  = {
+			.header = { .type = PERF_EVENT_MMAP, },
+			.pid	= current->group_leader->pid,
+			.tid	= current->pid,
+			.start  = addr,
+			.len    = len,
+			.pgoff  = pgoff,
+		},
+	};
+
+	perf_counter_mmap_event(&mmap_event);
+}
+
+void perf_counter_munmap(unsigned long addr, unsigned long len,
+			 unsigned long pgoff, struct file *file)
+{
+	struct perf_mmap_event mmap_event = {
+		.file   = file,
+		.event  = {
+			.header = { .type = PERF_EVENT_MUNMAP, },
+			.pid	= current->group_leader->pid,
+			.tid	= current->pid,
+			.start  = addr,
+			.len    = len,
+			.pgoff  = pgoff,
+		},
+	};
+
+	perf_counter_mmap_event(&mmap_event);
+}
+
+/*
  * Generic software counter infrastructure
  */
 
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -27,6 +27,7 @@
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
+#include <linux/perf_counter.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1219,6 +1220,9 @@ munmap_back:
 	if (correct_wcount)
 		atomic_inc(&inode->i_writecount);
 out:
+	if (vm_flags & VM_EXEC)
+		perf_counter_mmap(addr, len, pgoff, file);
+
 	mm->total_vm += len >> PAGE_SHIFT;
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
@@ -1752,6 +1756,12 @@ static void remove_vma_list(struct mm_st
 	do {
 		long nrpages = vma_pages(vma);
 
+		if (vma->vm_flags & VM_EXEC) {
+			perf_counter_munmap(vma->vm_start,
+					nrpages << PAGE_SHIFT,
+					vma->vm_pgoff, vma->vm_file);
+		}
+
 		mm->total_vm -= nrpages;
 		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
 		vma = remove_vma(vma);

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 05/15] perf_counter: kerneltop: parse the mmap data stream
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (3 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 04/15] perf_counter: executable mmap() information Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:13   ` [tip:perfcounters/core] " Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 06/15] perf_counter: powerpc: only reserve PMU hardware when we need it Peter Zijlstra
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: kerneltop-mmap-events.patch --]
[-- Type: text/plain, Size: 4049 bytes --]

frob the kerneltop code to print the mmap data in the stream

Better use would be collecting the IPs per PID and mapping them onto
the provided userspace code.. TODO

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/perf_counter/kerneltop.c |   50 +++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 6 deletions(-)

Index: linux-2.6/Documentation/perf_counter/kerneltop.c
===================================================================
--- linux-2.6.orig/Documentation/perf_counter/kerneltop.c
+++ linux-2.6/Documentation/perf_counter/kerneltop.c
@@ -184,6 +184,8 @@ static int			nmi				=  1;
 static int			group				=  0;
 static unsigned int		page_size;
 static unsigned int		mmap_pages			=  16;
+static int			use_mmap			= 0;
+static int			use_munmap			= 0;
 
 static char			*vmlinux;
 
@@ -333,6 +335,8 @@ static void display_help(void)
 	" -z        --zero             # zero counts after display\n"
 	" -D        --dump_symtab      # dump symbol table to stderr on startup\n"
 	" -m pages  --mmap_pages=<pages> # number of mmap data pages\n"
+	" -M        --mmap_info        # print mmap info stream\n"
+	" -U        --munmap_info      # print munmap info stream\n"
 	);
 
 	exit(0);
@@ -1052,9 +1056,11 @@ static void process_options(int argc, ch
 			{"stat",	no_argument,		NULL, 'S'},
 			{"zero",	no_argument,		NULL, 'z'},
 			{"mmap_pages",	required_argument,	NULL, 'm'},
+			{"mmap_info",	no_argument,		NULL, 'M'},
+			{"munmap_info",	no_argument,		NULL, 'U'},
 			{NULL,		0,			NULL,  0 }
 		};
-		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hn:m:p:s:Sx:z",
+		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hn:m:p:s:Sx:zMU",
 				    long_options, &option_index);
 		if (c == -1)
 			break;
@@ -1092,6 +1098,8 @@ static void process_options(int argc, ch
 		case 'x': vmlinux			= strdup(optarg); break;
 		case 'z': zero				=              1; break;
 		case 'm': mmap_pages			=   atoi(optarg); break;
+		case 'M': use_mmap			=              1; break;
+		case 'U': use_munmap			=              1; break;
 		default: error = 1; break;
 		}
 	}
@@ -1172,12 +1180,29 @@ static void mmap_read(struct mmap_data *
 	last_read = this_read;
 
 	for (; old != head;) {
-		struct event_struct {
+		struct ip_event {
 			struct perf_event_header header;
 			__u64 ip;
 			__u32 pid, tid;
-		} *event = (struct event_struct *)&data[old & md->mask];
-		struct event_struct event_copy;
+		};
+		struct mmap_event {
+			struct perf_event_header header;
+			__u32 pid, tid;
+			__u64 start;
+			__u64 len;
+			__u64 pgoff;
+			char filename[PATH_MAX];
+		};
+
+		typedef union event_union {
+			struct perf_event_header header;
+			struct ip_event ip;
+			struct mmap_event mmap;
+		} event_t;
+
+		event_t *event = (event_t *)&data[old & md->mask];
+
+		event_t event_copy;
 
 		unsigned int size = event->header.size;
 
@@ -1187,7 +1212,7 @@ static void mmap_read(struct mmap_data *
 		 */
 		if ((old & md->mask) + size != ((old + size) & md->mask)) {
 			unsigned int offset = old;
-			unsigned int len = sizeof(*event), cpy;
+			unsigned int len = min(sizeof(*event), size), cpy;
 			void *dst = &event_copy;
 
 			do {
@@ -1206,7 +1231,18 @@ static void mmap_read(struct mmap_data *
 		switch (event->header.type) {
 		case PERF_EVENT_IP:
 		case PERF_EVENT_IP | __PERF_EVENT_TID:
-			process_event(event->ip, md->counter);
+			process_event(event->ip.ip, md->counter);
+			break;
+
+		case PERF_EVENT_MMAP:
+		case PERF_EVENT_MUNMAP:
+			printf("%s: %Lu %Lu %Lu %s\n",
+					event->header.type == PERF_EVENT_MMAP
+					  ? "mmap" : "munmap",
+					event->mmap.start,
+					event->mmap.len,
+					event->mmap.pgoff,
+					event->mmap.filename);
 			break;
 		}
 	}
@@ -1255,6 +1291,8 @@ int main(int argc, char *argv[])
 			hw_event.record_type	= PERF_RECORD_IRQ;
 			hw_event.nmi		= nmi;
 			hw_event.include_tid	= 1;
+			hw_event.mmap		= use_mmap;
+			hw_event.munmap		= use_munmap;
 
 			fd[i][counter] = sys_perf_counter_open(&hw_event, tid, cpu, group_fd, 0);
 			if (fd[i][counter] < 0) {

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 06/15] perf_counter: powerpc: only reserve PMU hardware when we need it
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (4 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 05/15] perf_counter: kerneltop: parse the mmap data stream Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:13   ` [tip:perfcounters/core] " Paul Mackerras
  2009-03-30 17:07 ` [PATCH 07/15] perf_counter: make it possible for hw_perf_counter_init to return error codes Peter Zijlstra
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: paulus-perf_counter-powerpc-only_reserve_PMU_hardware_when_we_need_it.patch --]
[-- Type: text/plain, Size: 3836 bytes --]

From: Paul Mackerras <paulus@samba.org>

Impact: cooperate with oprofile

At present, on PowerPC, if you have perf_counters compiled in, oprofile
doesn't work.  There is code to allow the PMU to be shared between
competing subsystems, such as perf_counters and oprofile, but currently
the perf_counter subsystem reserves the PMU for itself at boot time,
and never releases it.

This makes perf_counter play nicely with oprofile.  Now we keep a count
of how many perf_counter instances are counting hardware events, and
reserve the PMU when that count becomes non-zero, and release the PMU
when that count becomes zero.  This means that it is possible to have
perf_counters compiled in and still use oprofile, as long as there are
no hardware perf_counters active.  This also means that if oprofile is
active, sys_perf_counter_open will fail if the hw_event specifies a
hardware event.

To avoid races with other tasks creating and destroying perf_counters,
we use a mutex.  We use atomic_inc_not_zero and atomic_add_unless to
avoid having to take the mutex unless there is a possibility of the
count going between 0 and 1.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

 arch/powerpc/kernel/perf_counter.c |   47 +++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 5 deletions(-)

Index: linux-2.6/arch/powerpc/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/perf_counter.c
+++ linux-2.6/arch/powerpc/kernel/perf_counter.c
@@ -41,6 +41,8 @@ struct power_pmu *ppmu;
  */
 static unsigned int freeze_counters_kernel = MMCR0_FCS;
 
+static void perf_counter_interrupt(struct pt_regs *regs);
+
 void perf_counter_print_debug(void)
 {
 }
@@ -594,6 +596,24 @@ struct hw_perf_counter_ops power_perf_op
 	.read = power_perf_read
 };
 
+/* Number of perf_counters counting hardware events */
+static atomic_t num_counters;
+/* Used to avoid races in calling reserve/release_pmc_hardware */
+static DEFINE_MUTEX(pmc_reserve_mutex);
+
+/*
+ * Release the PMU if this is the last perf_counter.
+ */
+static void hw_perf_counter_destroy(struct perf_counter *counter)
+{
+	if (!atomic_add_unless(&num_counters, -1, 1)) {
+		mutex_lock(&pmc_reserve_mutex);
+		if (atomic_dec_return(&num_counters) == 0)
+			release_pmc_hardware();
+		mutex_unlock(&pmc_reserve_mutex);
+	}
+}
+
 const struct hw_perf_counter_ops *
 hw_perf_counter_init(struct perf_counter *counter)
 {
@@ -601,6 +621,7 @@ hw_perf_counter_init(struct perf_counter
 	struct perf_counter *ctrs[MAX_HWCOUNTERS];
 	unsigned int events[MAX_HWCOUNTERS];
 	int n;
+	int err;
 
 	if (!ppmu)
 		return NULL;
@@ -646,6 +667,27 @@ hw_perf_counter_init(struct perf_counter
 
 	counter->hw.config = events[n];
 	atomic64_set(&counter->hw.period_left, counter->hw_event.irq_period);
+
+	/*
+	 * See if we need to reserve the PMU.
+	 * If no counters are currently in use, then we have to take a
+	 * mutex to ensure that we don't race with another task doing
+	 * reserve_pmc_hardware or release_pmc_hardware.
+	 */
+	err = 0;
+	if (!atomic_inc_not_zero(&num_counters)) {
+		mutex_lock(&pmc_reserve_mutex);
+		if (atomic_read(&num_counters) == 0 &&
+		    reserve_pmc_hardware(perf_counter_interrupt))
+			err = -EBUSY;
+		else
+			atomic_inc(&num_counters);
+		mutex_unlock(&pmc_reserve_mutex);
+	}
+	counter->destroy = hw_perf_counter_destroy;
+
+	if (err)
+		return NULL;
 	return &power_perf_ops;
 }
 
@@ -756,11 +798,6 @@ static int init_perf_counters(void)
 {
 	unsigned long pvr;
 
-	if (reserve_pmc_hardware(perf_counter_interrupt)) {
-		printk(KERN_ERR "Couldn't init performance monitor subsystem\n");
-		return -EBUSY;
-	}
-
 	/* XXX should get this from cputable */
 	pvr = mfspr(SPRN_PVR);
 	switch (PVR_VER(pvr)) {

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 07/15] perf_counter: make it possible for hw_perf_counter_init to return error codes
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (5 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 06/15] perf_counter: powerpc: only reserve PMU hardware when we need it Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:13   ` [tip:perfcounters/core] " Paul Mackerras
  2009-03-30 17:07 ` [PATCH 08/15] perf_counter: x86: proper error propagation for the x86 hw_perf_counter_init() Peter Zijlstra
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: paulus-perf_counter-make_it_possible_for_hw_perf_counter_init_to_return_error_codes-v2.patch --]
[-- Type: text/plain, Size: 5418 bytes --]

From: Paul Mackerras <paulus@samba.org>

Impact: better error reporting

At present, if hw_perf_counter_init encounters an error, all it can do
is return NULL, which causes sys_perf_counter_open to return an EINVAL
error to userspace.  This isn't very informative for userspace; it means
that userspace can't tell the difference between "sorry, oprofile is
already using the PMU" and "we don't support this CPU" and "this CPU
doesn't support the requested generic hardware event".

This commit uses the PTR_ERR/ERR_PTR/IS_ERR set of macros to let
hw_perf_counter_init return an error code on error rather than just NULL
if it wishes.  If it does so, that error code will be returned from
sys_perf_counter_open to userspace.  If it returns NULL, an EINVAL
error will be returned to userspace, as before.

This also adapts the powerpc hw_perf_counter_init to make use of this
to return ENXIO, EINVAL, EBUSY, or EOPNOTSUPP as appropriate.  It would
be good to add extra error numbers in future to allow userspace to
distinguish the various errors that are currently reported as EINVAL,
i.e. irq_period < 0, too many events in a group, conflict between
exclude_* settings in a group, and PMU resource conflict in a group.

[v2: fix a bug pointed out by Corey Ashford where error returns from
hw_perf_counter_init were not handled correctly in the case of raw
hardware events.]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/powerpc/kernel/perf_counter.c |   14 +++++++-------
 kernel/perf_counter.c              |   35 ++++++++++++++++++++++-------------
 2 files changed, 29 insertions(+), 20 deletions(-)

Index: linux-2.6/arch/powerpc/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/perf_counter.c
+++ linux-2.6/arch/powerpc/kernel/perf_counter.c
@@ -624,13 +624,13 @@ hw_perf_counter_init(struct perf_counter
 	int err;
 
 	if (!ppmu)
-		return NULL;
+		return ERR_PTR(-ENXIO);
 	if ((s64)counter->hw_event.irq_period < 0)
-		return NULL;
+		return ERR_PTR(-EINVAL);
 	if (!perf_event_raw(&counter->hw_event)) {
 		ev = perf_event_id(&counter->hw_event);
 		if (ev >= ppmu->n_generic || ppmu->generic_events[ev] == 0)
-			return NULL;
+			return ERR_PTR(-EOPNOTSUPP);
 		ev = ppmu->generic_events[ev];
 	} else {
 		ev = perf_event_config(&counter->hw_event);
@@ -656,14 +656,14 @@ hw_perf_counter_init(struct perf_counter
 		n = collect_events(counter->group_leader, ppmu->n_counter - 1,
 				   ctrs, events);
 		if (n < 0)
-			return NULL;
+			return ERR_PTR(-EINVAL);
 	}
 	events[n] = ev;
 	ctrs[n] = counter;
 	if (check_excludes(ctrs, n, 1))
-		return NULL;
+		return ERR_PTR(-EINVAL);
 	if (power_check_constraints(events, n + 1))
-		return NULL;
+		return ERR_PTR(-EINVAL);
 
 	counter->hw.config = events[n];
 	atomic64_set(&counter->hw.period_left, counter->hw_event.irq_period);
@@ -687,7 +687,7 @@ hw_perf_counter_init(struct perf_counter
 	counter->destroy = hw_perf_counter_destroy;
 
 	if (err)
-		return NULL;
+		return ERR_PTR(err);
 	return &power_perf_ops;
 }
 
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -2453,10 +2453,11 @@ perf_counter_alloc(struct perf_counter_h
 {
 	const struct hw_perf_counter_ops *hw_ops;
 	struct perf_counter *counter;
+	long err;
 
 	counter = kzalloc(sizeof(*counter), gfpflags);
 	if (!counter)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	/*
 	 * Single counters are their own group leaders, with an
@@ -2505,12 +2506,18 @@ perf_counter_alloc(struct perf_counter_h
 		hw_ops = tp_perf_counter_init(counter);
 		break;
 	}
+done:
+	err = 0;
+	if (!hw_ops)
+		err = -EINVAL;
+	else if (IS_ERR(hw_ops))
+		err = PTR_ERR(hw_ops);
 
-	if (!hw_ops) {
+	if (err) {
 		kfree(counter);
-		return NULL;
+		return ERR_PTR(err);
 	}
-done:
+
 	counter->hw_ops = hw_ops;
 
 	return counter;
@@ -2583,10 +2590,10 @@ SYSCALL_DEFINE5(perf_counter_open,
 			goto err_put_context;
 	}
 
-	ret = -EINVAL;
 	counter = perf_counter_alloc(&hw_event, cpu, ctx, group_leader,
 				     GFP_KERNEL);
-	if (!counter)
+	ret = PTR_ERR(counter);
+	if (IS_ERR(counter))
 		goto err_put_context;
 
 	ret = anon_inode_getfd("[perf_counter]", &perf_fops, counter, 0);
@@ -2658,8 +2665,8 @@ inherit_counter(struct perf_counter *par
 	child_counter = perf_counter_alloc(&parent_counter->hw_event,
 					   parent_counter->cpu, child_ctx,
 					   group_leader, GFP_KERNEL);
-	if (!child_counter)
-		return NULL;
+	if (IS_ERR(child_counter))
+		return child_counter;
 
 	/*
 	 * Link it up in the child's context:
@@ -2710,15 +2717,17 @@ static int inherit_group(struct perf_cou
 {
 	struct perf_counter *leader;
 	struct perf_counter *sub;
+	struct perf_counter *child_ctr;
 
 	leader = inherit_counter(parent_counter, parent, parent_ctx,
 				 child, NULL, child_ctx);
-	if (!leader)
-		return -ENOMEM;
+	if (IS_ERR(leader))
+		return PTR_ERR(leader);
 	list_for_each_entry(sub, &parent_counter->sibling_list, list_entry) {
-		if (!inherit_counter(sub, parent, parent_ctx,
-				     child, leader, child_ctx))
-			return -ENOMEM;
+		child_ctr = inherit_counter(sub, parent, parent_ctx,
+					    child, leader, child_ctx);
+		if (IS_ERR(child_ctr))
+			return PTR_ERR(child_ctr);
 	}
 	return 0;
 }

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 08/15] perf_counter: x86: proper error propagation for the x86 hw_perf_counter_init()
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (6 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 07/15] perf_counter: make it possible for hw_perf_counter_init to return error codes Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:13   ` [tip:perfcounters/core] " Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 09/15] perf_counter tools: optionally scale counter values in perfstat mode Peter Zijlstra
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: perf_counter-x86-error.patch --]
[-- Type: text/plain, Size: 662 bytes --]

Now that Paul cleaned up the error propagation paths, pass down the
x86 error as well.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_counter.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_counter.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_counter.c
@@ -954,7 +954,7 @@ hw_perf_counter_init(struct perf_counter
 
 	err = __hw_perf_counter_init(counter);
 	if (err)
-		return NULL;
+		return ERR_PTR(err);
 
 	return &x86_perf_counter_ops;
 }

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 09/15] perf_counter tools: optionally scale counter values in perfstat mode
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (7 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 08/15] perf_counter: x86: proper error propagation for the x86 hw_perf_counter_init() Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01  9:39   ` Ingo Molnar
  2009-04-01 10:14   ` [tip:perfcounters/core] " Paul Mackerras
  2009-03-30 17:07 ` [PATCH 10/15] perf_counter: small cleanup of the output routines Peter Zijlstra
                   ` (6 subsequent siblings)
  15 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: paulus-perf_counter_tools-optionally_scale_counter_values_in_perfstat_mode.patch --]
[-- Type: text/plain, Size: 5885 bytes --]

From: Paul Mackerras <paulus@samba.org>

Impact: new functionality

This adds add an option to the perfstat mode of kerneltop to scale the
reported counter values according to the fraction of time that each
counter gets to count.  This is invoked with the -l option (I used 'l'
because s, c, a and e were all taken already.)  This uses the new
PERF_RECORD_TOTAL_TIME_{ENABLED,RUNNING} read format options.

With this, we get output like this:

$ ./perfstat -l -e 0:0,0:1,0:2,0:3,0:4,0:5 ./spin

 Performance counter stats for './spin':

     4016072055  CPU cycles           (events)  (scaled from 66.53%)
     2005887318  instructions         (events)  (scaled from 66.53%)
        1762849  cache references     (events)  (scaled from 66.69%)
         165229  cache misses         (events)  (scaled from 66.85%)
     1001298009  branches             (events)  (scaled from 66.78%)
          41566  branch misses        (events)  (scaled from 66.61%)

 Wall-clock time elapsed:  2438.227446 msecs

This also lets us detect when a counter is zero because the counter
never got to go on the CPU at all.  In that case we print <not counted>
rather than 0.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/perf_counter/kerneltop.c |   56 ++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 11 deletions(-)

Index: linux-2.6/Documentation/perf_counter/kerneltop.c
===================================================================
--- linux-2.6.orig/Documentation/perf_counter/kerneltop.c
+++ linux-2.6/Documentation/perf_counter/kerneltop.c
@@ -197,6 +197,8 @@ static int			delay_secs			=  2;
 static int			zero;
 static int			dump_symtab;
 
+static int			scale;
+
 struct source_line {
 	uint64_t		EIP;
 	unsigned long		count;
@@ -305,6 +307,7 @@ static void display_perfstat_help(void)
 	display_events_help();
 
 	printf(
+	" -l                           # scale counter values\n"
 	" -a                           # system-wide collection\n");
 	exit(0);
 }
@@ -328,6 +331,7 @@ static void display_help(void)
 	" -c CNT    --count=CNT        # event period to sample\n\n"
 	" -C CPU    --cpu=CPU          # CPU (-1 for all)                 [default: -1]\n"
 	" -p PID    --pid=PID          # PID of sampled task (-1 for all) [default: -1]\n\n"
+	" -l                           # show scale factor for RR events\n"
 	" -d delay  --delay=<seconds>  # sampling/display delay           [default:  2]\n"
 	" -f CNT    --filter=CNT       # min-event-count filter          [default: 100]\n\n"
 	" -s symbol --symbol=<symbol>  # function to be showed annotated one-shot\n"
@@ -436,6 +440,9 @@ static void create_perfstat_counter(int 
 	hw_event.config		= event_id[counter];
 	hw_event.record_type	= PERF_RECORD_SIMPLE;
 	hw_event.nmi		= 0;
+	if (scale)
+		hw_event.read_format	= PERF_FORMAT_TOTAL_TIME_ENABLED |
+					  PERF_FORMAT_TOTAL_TIME_RUNNING;
 
 	if (system_wide) {
 		int cpu;
@@ -507,28 +514,53 @@ int do_perfstat(int argc, char *argv[])
 	fprintf(stderr, "\n");
 
 	for (counter = 0; counter < nr_counters; counter++) {
-		int cpu;
-		__u64 count, single_count;
+		int cpu, nv;
+		__u64 count[3], single_count[3];
+		int scaled;
 
-		count = 0;
+		count[0] = count[1] = count[2] = 0;
+		nv = scale ? 3 : 1;
 		for (cpu = 0; cpu < nr_cpus; cpu ++) {
 			res = read(fd[cpu][counter],
-					(char *) &single_count, sizeof(single_count));
-			assert(res == sizeof(single_count));
-			count += single_count;
+				   single_count, nv * sizeof(__u64));
+			assert(res == nv * sizeof(__u64));
+
+			count[0] += single_count[0];
+			if (scale) {
+				count[1] += single_count[1];
+				count[2] += single_count[2];
+			}
+		}
+
+		scaled = 0;
+		if (scale) {
+			if (count[2] == 0) {
+				fprintf(stderr, " %14s  %-20s\n",
+					"<not counted>", event_name(counter));
+				continue;
+			}
+			if (count[2] < count[1]) {
+				scaled = 1;
+				count[0] = (unsigned long long)
+					((double)count[0] * count[1] / count[2] + 0.5);
+			}
 		}
 
 		if (event_id[counter] == EID(PERF_TYPE_SOFTWARE, PERF_COUNT_CPU_CLOCK) ||
 		    event_id[counter] == EID(PERF_TYPE_SOFTWARE, PERF_COUNT_TASK_CLOCK)) {
 
-			double msecs = (double)count / 1000000;
+			double msecs = (double)count[0] / 1000000;
 
-			fprintf(stderr, " %14.6f  %-20s (msecs)\n",
+			fprintf(stderr, " %14.6f  %-20s (msecs)",
 				msecs, event_name(counter));
 		} else {
-			fprintf(stderr, " %14Ld  %-20s (events)\n",
-				count, event_name(counter));
+			fprintf(stderr, " %14Ld  %-20s (events)",
+				count[0], event_name(counter));
 		}
+		if (scaled)
+			fprintf(stderr, "  (scaled from %.2f%%)",
+				(double) count[2] / count[1] * 100);
+		fprintf(stderr, "\n");
 	}
 	fprintf(stderr, "\n");
 	fprintf(stderr, " Wall-clock time elapsed: %12.6f msecs\n",
@@ -1049,6 +1081,7 @@ static void process_options(int argc, ch
 			{"filter",	required_argument,	NULL, 'f'},
 			{"group",	required_argument,	NULL, 'g'},
 			{"help",	no_argument,		NULL, 'h'},
+			{"scale",	no_argument,		NULL, 'l'},
 			{"nmi",		required_argument,	NULL, 'n'},
 			{"pid",		required_argument,	NULL, 'p'},
 			{"vmlinux",	required_argument,	NULL, 'x'},
@@ -1060,7 +1093,7 @@ static void process_options(int argc, ch
 			{"munmap_info",	no_argument,		NULL, 'U'},
 			{NULL,		0,			NULL,  0 }
 		};
-		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hn:m:p:s:Sx:zMU",
+		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hln:m:p:s:Sx:zMU",
 				    long_options, &option_index);
 		if (c == -1)
 			break;
@@ -1084,6 +1117,7 @@ static void process_options(int argc, ch
 		case 'f': count_filter			=   atoi(optarg); break;
 		case 'g': group				=   atoi(optarg); break;
 		case 'h':      				  display_help(); break;
+		case 'l': scale				=	       1; break;
 		case 'n': nmi				=   atoi(optarg); break;
 		case 'p':
 			/* CPU and PID are mutually exclusive */

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 10/15] perf_counter: small cleanup of the output routines
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (8 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 09/15] perf_counter tools: optionally scale counter values in perfstat mode Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:14   ` [tip:perfcounters/core] " Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 11/15] perf_counter: re-arrange the perf_event_type Peter Zijlstra
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: perf_counter-cleanup-output.patch --]
[-- Type: text/plain, Size: 3667 bytes --]

Move the nmi argument to the _begin() function, so that _end() only needs the
handle. This allows the _begin() function to generate a wakeup on event loss.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/perf_counter.c |   42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1663,10 +1663,20 @@ struct perf_output_handle {
 	unsigned int		offset;
 	unsigned int		head;
 	int			wakeup;
+	int			nmi;
 };
 
+static inline void __perf_output_wakeup(struct perf_output_handle *handle)
+{
+	if (handle->nmi)
+		perf_pending_queue(handle->counter);
+	else
+		perf_counter_wakeup(handle->counter);
+}
+
 static int perf_output_begin(struct perf_output_handle *handle,
-			     struct perf_counter *counter, unsigned int size)
+			     struct perf_counter *counter, unsigned int size,
+			     int nmi)
 {
 	struct perf_mmap_data *data;
 	unsigned int offset, head;
@@ -1676,15 +1686,17 @@ static int perf_output_begin(struct perf
 	if (!data)
 		goto out;
 
+	handle->counter	= counter;
+	handle->nmi	= nmi;
+
 	if (!data->nr_pages)
-		goto out;
+		goto fail;
 
 	do {
 		offset = head = atomic_read(&data->head);
 		head += size;
 	} while (atomic_cmpxchg(&data->head, offset, head) != offset);
 
-	handle->counter	= counter;
 	handle->data	= data;
 	handle->offset	= offset;
 	handle->head	= head;
@@ -1692,6 +1704,8 @@ static int perf_output_begin(struct perf
 
 	return 0;
 
+fail:
+	__perf_output_wakeup(handle);
 out:
 	rcu_read_unlock();
 
@@ -1733,14 +1747,10 @@ static void perf_output_copy(struct perf
 #define perf_output_put(handle, x) \
 	perf_output_copy((handle), &(x), sizeof(x))
 
-static void perf_output_end(struct perf_output_handle *handle, int nmi)
+static void perf_output_end(struct perf_output_handle *handle)
 {
-	if (handle->wakeup) {
-		if (nmi)
-			perf_pending_queue(handle->counter);
-		else
-			perf_counter_wakeup(handle->counter);
-	}
+	if (handle->wakeup)
+		__perf_output_wakeup(handle);
 	rcu_read_unlock();
 }
 
@@ -1750,12 +1760,12 @@ static int perf_output_write(struct perf
 	struct perf_output_handle handle;
 	int ret;
 
-	ret = perf_output_begin(&handle, counter, size);
+	ret = perf_output_begin(&handle, counter, size, nmi);
 	if (ret)
 		goto out;
 
 	perf_output_copy(&handle, buf, size);
-	perf_output_end(&handle, nmi);
+	perf_output_end(&handle);
 
 out:
 	return ret;
@@ -1804,7 +1814,7 @@ static void perf_output_group(struct per
 
 	size = sizeof(header) + counter->nr_siblings * sizeof(entry);
 
-	ret = perf_output_begin(&handle, counter, size);
+	ret = perf_output_begin(&handle, counter, size, nmi);
 	if (ret)
 		return;
 
@@ -1824,7 +1834,7 @@ static void perf_output_group(struct per
 		perf_output_put(&handle, entry);
 	}
 
-	perf_output_end(&handle, nmi);
+	perf_output_end(&handle);
 }
 
 void perf_counter_output(struct perf_counter *counter,
@@ -1869,7 +1879,7 @@ static void perf_counter_mmap_output(str
 {
 	struct perf_output_handle handle;
 	int size = mmap_event->event.header.size;
-	int ret = perf_output_begin(&handle, counter, size);
+	int ret = perf_output_begin(&handle, counter, size, 0);
 
 	if (ret)
 		return;
@@ -1877,7 +1887,7 @@ static void perf_counter_mmap_output(str
 	perf_output_put(&handle, mmap_event->event);
 	perf_output_copy(&handle, mmap_event->file_name,
 				   mmap_event->file_size);
-	perf_output_end(&handle, 0);
+	perf_output_end(&handle);
 }
 
 static int perf_counter_mmap_match(struct perf_counter *counter,

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 11/15] perf_counter: re-arrange the perf_event_type
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (9 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 10/15] perf_counter: small cleanup of the output routines Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:14   ` [tip:perfcounters/core] " Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 12/15] pref_counter: kerneltop: update event_types Peter Zijlstra
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: perf_counter-rearrange-event-types.patch --]
[-- Type: text/plain, Size: 2918 bytes --]

Breaks ABI yet again :-)

Change the event type so that [0, 2^31-1] are regular event types, but
[2^31, 2^32-1] forms a bitmask for overflow events.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/perf_counter.h |    6 +++-
 kernel/perf_counter.c        |   56 +++++++++++++++++++------------------------
 2 files changed, 29 insertions(+), 33 deletions(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -210,13 +210,15 @@ struct perf_event_header {
 };
 
 enum perf_event_type {
-	PERF_EVENT_IP		= 0,
+
 	PERF_EVENT_GROUP	= 1,
 
 	PERF_EVENT_MMAP		= 2,
 	PERF_EVENT_MUNMAP	= 3,
 
-	__PERF_EVENT_TID	= 0x100,
+	PERF_EVENT_OVERFLOW	= 1UL << 31,
+	__PERF_EVENT_IP		= 1UL << 30,
+	__PERF_EVENT_TID	= 1UL << 29,
 };
 
 #ifdef __KERNEL__
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1754,50 +1754,44 @@ static void perf_output_end(struct perf_
 	rcu_read_unlock();
 }
 
-static int perf_output_write(struct perf_counter *counter, int nmi,
-			     void *buf, ssize_t size)
-{
-	struct perf_output_handle handle;
-	int ret;
-
-	ret = perf_output_begin(&handle, counter, size, nmi);
-	if (ret)
-		goto out;
-
-	perf_output_copy(&handle, buf, size);
-	perf_output_end(&handle);
-
-out:
-	return ret;
-}
-
 static void perf_output_simple(struct perf_counter *counter,
 			       int nmi, struct pt_regs *regs)
 {
-	unsigned int size;
+	int ret;
+	struct perf_output_handle handle;
+	struct perf_event_header header;
+	u64 ip;
 	struct {
-		struct perf_event_header header;
-		u64 ip;
 		u32 pid, tid;
-	} event;
+	} tid_entry;
 
-	event.header.type = PERF_EVENT_IP;
-	event.ip = instruction_pointer(regs);
+	header.type = PERF_EVENT_OVERFLOW;
+	header.size = sizeof(header);
 
-	size = sizeof(event);
+	ip = instruction_pointer(regs);
+	header.type |= __PERF_EVENT_IP;
+	header.size += sizeof(ip);
 
 	if (counter->hw_event.include_tid) {
 		/* namespace issues */
-		event.pid = current->group_leader->pid;
-		event.tid = current->pid;
+		tid_entry.pid = current->group_leader->pid;
+		tid_entry.tid = current->pid;
 
-		event.header.type |= __PERF_EVENT_TID;
-	} else
-		size -= sizeof(u64);
+		header.type |= __PERF_EVENT_TID;
+		header.size += sizeof(tid_entry);
+	}
 
-	event.header.size = size;
+	ret = perf_output_begin(&handle, counter, header.size, nmi);
+	if (ret)
+		return;
+
+	perf_output_put(&handle, header);
+	perf_output_put(&handle, ip);
+
+	if (counter->hw_event.include_tid)
+		perf_output_put(&handle, tid_entry);
 
-	perf_output_write(counter, nmi, &event, size);
+	perf_output_end(&handle);
 }
 
 static void perf_output_group(struct perf_counter *counter, int nmi)

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 12/15] pref_counter: kerneltop: update event_types
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (10 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 11/15] perf_counter: re-arrange the perf_event_type Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:14   ` [tip:perfcounters/core] perf_counter tools: " Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 13/15] perf_counter: provide generic callchain bits Peter Zijlstra
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: kerneltop-rearrange-event-types.patch --]
[-- Type: text/plain, Size: 792 bytes --]

go along with the new perf_event_type

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 Documentation/perf_counter/kerneltop.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/Documentation/perf_counter/kerneltop.c
===================================================================
--- linux-2.6.orig/Documentation/perf_counter/kerneltop.c
+++ linux-2.6/Documentation/perf_counter/kerneltop.c
@@ -1263,8 +1263,8 @@ static void mmap_read(struct mmap_data *
 		old += size;
 
 		switch (event->header.type) {
-		case PERF_EVENT_IP:
-		case PERF_EVENT_IP | __PERF_EVENT_TID:
+		case PERF_EVENT_OVERFLOW | __PERF_EVENT_IP:
+		case PERF_EVENT_OVERFLOW | __PERF_EVENT_IP | __PERF_EVENT_TID:
 			process_event(event->ip.ip, md->counter);
 			break;
 

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (11 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 12/15] pref_counter: kerneltop: update event_types Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-03-31  6:12   ` Paul Mackerras
  2009-04-01 10:14   ` [tip:perfcounters/core] " Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 14/15] perf_counter: x86: callchain support Peter Zijlstra
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: perf_counter-callchain.patch --]
[-- Type: text/plain, Size: 3121 bytes --]

Provide the generic callchain support bits. If hw_event->callchain is
set the arch specific perf_callchain() function is called upon to
provide a perf_callchain_entry structure filled with the current
callchain.

If it does so, it is added to the overflow output event.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/perf_counter.h |   13 ++++++++++++-
 kernel/perf_counter.c        |   27 +++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -140,8 +140,9 @@ struct perf_counter_hw_event {
 				include_tid    :  1, /* include the tid       */
 				mmap           :  1, /* include mmap data     */
 				munmap         :  1, /* include munmap data   */
+				callchain      :  1, /* add callchain data    */
 
-				__reserved_1   : 52;
+				__reserved_1   : 51;
 
 	__u32			extra_config_len;
 	__u32			__reserved_4;
@@ -219,6 +220,7 @@ enum perf_event_type {
 	PERF_EVENT_OVERFLOW	= 1UL << 31,
 	__PERF_EVENT_IP		= 1UL << 30,
 	__PERF_EVENT_TID	= 1UL << 29,
+	__PERF_EVENT_CALLCHAIN  = 1UL << 28,
 };
 
 #ifdef __KERNEL__
@@ -504,6 +506,15 @@ extern void perf_counter_mmap(unsigned l
 extern void perf_counter_munmap(unsigned long addr, unsigned long len,
 				unsigned long pgoff, struct file *file);
 
+#define MAX_STACK_DEPTH		255
+
+struct perf_callchain_entry {
+	u64	nr;
+	u64	ip[MAX_STACK_DEPTH];
+};
+
+extern struct perf_callchain_entry *perf_callchain(struct pt_regs *regs);
+
 #else
 static inline void
 perf_counter_task_sched_in(struct task_struct *task, int cpu)		{ }
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1654,6 +1654,17 @@ void perf_counter_do_pending(void)
 }
 
 /*
+ * Callchain support -- arch specific
+ */
+
+struct perf_callchain_entry *
+__attribute__((weak))
+perf_callchain(struct pt_regs *regs)
+{
+	return NULL;
+}
+
+/*
  * Output
  */
 
@@ -1764,6 +1775,8 @@ static void perf_output_simple(struct pe
 	struct {
 		u32 pid, tid;
 	} tid_entry;
+	struct perf_callchain_entry *callchain = NULL;
+	int callchain_size = 0;
 
 	header.type = PERF_EVENT_OVERFLOW;
 	header.size = sizeof(header);
@@ -1781,6 +1794,17 @@ static void perf_output_simple(struct pe
 		header.size += sizeof(tid_entry);
 	}
 
+	if (counter->hw_event.callchain) {
+		callchain = perf_callchain(regs);
+
+		if (callchain) {
+			callchain_size = (1 + callchain->nr) * sizeof(u64);
+
+			header.type |= __PERF_EVENT_CALLCHAIN;
+			header.size += callchain_size;
+		}
+	}
+
 	ret = perf_output_begin(&handle, counter, header.size, nmi);
 	if (ret)
 		return;
@@ -1791,6 +1815,9 @@ static void perf_output_simple(struct pe
 	if (counter->hw_event.include_tid)
 		perf_output_put(&handle, tid_entry);
 
+	if (callchain)
+		perf_output_copy(&handle, callchain, callchain_size);
+
 	perf_output_end(&handle);
 }
 

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 14/15] perf_counter: x86: callchain support
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (12 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 13/15] perf_counter: provide generic callchain bits Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:14   ` [tip:perfcounters/core] " Peter Zijlstra
  2009-03-30 17:07 ` [PATCH 15/15] perf_counter: pmc arbitration Peter Zijlstra
  2009-03-31  5:43 ` [PATCH 00/15] pending perf_counter bits Paul Mackerras
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: perf_counter-x86-callchain.patch --]
[-- Type: text/plain, Size: 3921 bytes --]

Provide the x86 perf_callchain() implementation.

Code based on the sysprof code.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_counter.c |  154 +++++++++++++++++++++++++++++++++++++
 1 file changed, 154 insertions(+)

Index: linux-2.6/arch/x86/kernel/cpu/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_counter.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_counter.c
@@ -16,8 +16,10 @@
 #include <linux/module.h>
 #include <linux/kdebug.h>
 #include <linux/sched.h>
+#include <linux/uaccess.h>
 
 #include <asm/apic.h>
+#include <asm/stacktrace.h>
 
 static bool perf_counters_initialized __read_mostly;
 
@@ -958,3 +960,155 @@ hw_perf_counter_init(struct perf_counter
 
 	return &x86_perf_counter_ops;
 }
+
+/*
+ * callchain support
+ */
+
+static inline
+void callchain_store(struct perf_callchain_entry *entry, unsigned long ip)
+{
+	if (entry->nr < MAX_STACK_DEPTH)
+		entry->ip[entry->nr++] = ip;
+}
+
+static DEFINE_PER_CPU(struct perf_callchain_entry, irq_entry);
+static DEFINE_PER_CPU(struct perf_callchain_entry, nmi_entry);
+
+
+static void
+backtrace_warning_symbol(void *data, char *msg, unsigned long symbol)
+{
+	/* Ignore warnings */
+}
+
+static void backtrace_warning(void *data, char *msg)
+{
+	/* Ignore warnings */
+}
+
+static int backtrace_stack(void *data, char *name)
+{
+	/* Don't bother with IRQ stacks for now */
+	return -1;
+}
+
+static void backtrace_address(void *data, unsigned long addr, int reliable)
+{
+	struct perf_callchain_entry *entry = data;
+
+	if (reliable)
+		callchain_store(entry, addr);
+}
+
+static const struct stacktrace_ops backtrace_ops = {
+	.warning		= backtrace_warning,
+	.warning_symbol		= backtrace_warning_symbol,
+	.stack			= backtrace_stack,
+	.address		= backtrace_address,
+};
+
+static void
+perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry)
+{
+	unsigned long bp;
+	char *stack;
+
+	callchain_store(entry, instruction_pointer(regs));
+
+	stack = ((char *)regs + sizeof(struct pt_regs));
+#ifdef CONFIG_FRAME_POINTER
+	bp = frame_pointer(regs);
+#else
+	bp = 0;
+#endif
+
+	dump_trace(NULL, regs, (void *)stack, bp, &backtrace_ops, entry);
+}
+
+
+struct stack_frame {
+	const void __user	*next_fp;
+	unsigned long		return_address;
+};
+
+static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
+{
+	int ret;
+
+	if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
+		return 0;
+
+	ret = 1;
+	pagefault_disable();
+	if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
+		ret = 0;
+	pagefault_enable();
+
+	return ret;
+}
+
+static void
+perf_callchain_user(struct pt_regs *regs, struct perf_callchain_entry *entry)
+{
+	struct stack_frame frame;
+	const void __user *fp;
+
+	regs = (struct pt_regs *)current->thread.sp0 - 1;
+	fp   = (void __user *)regs->bp;
+
+	callchain_store(entry, regs->ip);
+
+	while (entry->nr < MAX_STACK_DEPTH) {
+		frame.next_fp	     = NULL;
+		frame.return_address = 0;
+
+		if (!copy_stack_frame(fp, &frame))
+			break;
+
+		if ((unsigned long)fp < user_stack_pointer(regs))
+			break;
+
+		callchain_store(entry, frame.return_address);
+		fp = frame.next_fp;
+	}
+}
+
+static void
+perf_do_callchain(struct pt_regs *regs, struct perf_callchain_entry *entry)
+{
+	int is_user;
+
+	if (!regs)
+		return;
+
+	is_user = user_mode(regs);
+
+	if (!current || current->pid == 0)
+		return;
+
+	if (is_user && current->state != TASK_RUNNING)
+		return;
+
+	if (!is_user)
+		perf_callchain_kernel(regs, entry);
+
+	if (current->mm)
+		perf_callchain_user(regs, entry);
+}
+
+struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
+{
+	struct perf_callchain_entry *entry;
+
+	if (in_nmi())
+		entry = &__get_cpu_var(nmi_entry);
+	else
+		entry = &__get_cpu_var(irq_entry);
+
+	entry->nr = 0;
+
+	perf_do_callchain(regs, entry);
+
+	return entry;
+}

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 15/15] perf_counter: pmc arbitration
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (13 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 14/15] perf_counter: x86: callchain support Peter Zijlstra
@ 2009-03-30 17:07 ` Peter Zijlstra
  2009-04-01 10:15   ` [tip:perfcounters/core] " Peter Zijlstra
  2009-03-31  5:43 ` [PATCH 00/15] pending perf_counter bits Paul Mackerras
  15 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-30 17:07 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Paul Mackerras, Peter Zijlstra

[-- Attachment #1: perf_counter-reserve.patch --]
[-- Type: text/plain, Size: 2854 bytes --]

Follow the example set by powerpc and try to play nice with oprofile
and the nmi watchdog.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_counter.c |   75 +++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

Index: linux-2.6/arch/x86/kernel/cpu/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_counter.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_counter.c
@@ -20,6 +20,7 @@
 
 #include <asm/apic.h>
 #include <asm/stacktrace.h>
+#include <asm/nmi.h>
 
 static bool perf_counters_initialized __read_mostly;
 
@@ -172,6 +173,65 @@ again:
 	atomic64_sub(delta, &hwc->period_left);
 }
 
+static atomic_t num_counters;
+static DEFINE_MUTEX(pmc_reserve_mutex);
+
+static bool reserve_pmc_hardware(void)
+{
+	int i;
+
+	if (nmi_watchdog == NMI_LOCAL_APIC)
+		disable_lapic_nmi_watchdog();
+
+	for (i = 0; i < nr_counters_generic; i++) {
+		if (!reserve_perfctr_nmi(pmc_ops->perfctr + i))
+			goto perfctr_fail;
+	}
+
+	for (i = 0; i < nr_counters_generic; i++) {
+		if (!reserve_evntsel_nmi(pmc_ops->eventsel + i))
+			goto eventsel_fail;
+	}
+
+	return true;
+
+eventsel_fail:
+	for (i--; i >= 0; i--)
+		release_evntsel_nmi(pmc_ops->eventsel + i);
+
+	i = nr_counters_generic;
+
+perfctr_fail:
+	for (i--; i >= 0; i--)
+		release_perfctr_nmi(pmc_ops->perfctr + i);
+
+	if (nmi_watchdog == NMI_LOCAL_APIC)
+		enable_lapic_nmi_watchdog();
+
+	return false;
+}
+
+static void release_pmc_hardware(void)
+{
+	int i;
+
+	for (i = 0; i < nr_counters_generic; i++) {
+		release_perfctr_nmi(pmc_ops->perfctr + i);
+		release_evntsel_nmi(pmc_ops->eventsel + i);
+	}
+
+	if (nmi_watchdog == NMI_LOCAL_APIC)
+		enable_lapic_nmi_watchdog();
+}
+
+static void hw_perf_counter_destroy(struct perf_counter *counter)
+{
+	if (atomic_dec_and_mutex_lock(&num_counters, &pmc_reserve_mutex)) {
+		release_pmc_hardware();
+		mutex_unlock(&pmc_reserve_mutex);
+	}
+}
+
 /*
  * Setup the hardware configuration for a given hw_event_type
  */
@@ -179,10 +239,23 @@ static int __hw_perf_counter_init(struct
 {
 	struct perf_counter_hw_event *hw_event = &counter->hw_event;
 	struct hw_perf_counter *hwc = &counter->hw;
+	int err;
 
 	if (unlikely(!perf_counters_initialized))
 		return -EINVAL;
 
+	err = 0;
+	if (atomic_inc_not_zero(&num_counters)) {
+		mutex_lock(&pmc_reserve_mutex);
+		if (atomic_read(&num_counters) == 0 && !reserve_pmc_hardware())
+			err = -EBUSY;
+		else
+			atomic_inc(&num_counters);
+		mutex_unlock(&pmc_reserve_mutex);
+	}
+	if (err)
+		return err;
+
 	/*
 	 * Generate PMC IRQs:
 	 * (keep 'enabled' bit clear for now)
@@ -230,6 +303,8 @@ static int __hw_perf_counter_init(struct
 		hwc->config |= pmc_ops->event_map(perf_event_id(hw_event));
 	}
 
+	counter->destroy = hw_perf_counter_destroy;
+
 	return 0;
 }
 

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/15] pending perf_counter bits
  2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
                   ` (14 preceding siblings ...)
  2009-03-30 17:07 ` [PATCH 15/15] perf_counter: pmc arbitration Peter Zijlstra
@ 2009-03-31  5:43 ` Paul Mackerras
  15 siblings, 0 replies; 55+ messages in thread
From: Paul Mackerras @ 2009-03-31  5:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

Peter Zijlstra writes:

> Most of the previous set, except I kept the powerpc set_perf_counter_pending()
> bits.
> 
> This set also adds callchain support for x86 (no userspace yet -- am looking
> at porting sysprof). Furthermore, it has a first attempt at making
> perf_counters and oprofile and nmi_watchdog not stomp on each other.
> 
> New bits only compile tested on x86_64 and ppc64.

Seems to run ok on ppc64, except that the kerneltop output seems to
show fewer functions than before, for some reason.

Paul.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/15] perf_counter: unify and fix delayed counter wakeup
  2009-03-30 17:07 ` [PATCH 01/15] perf_counter: unify and fix delayed counter wakeup Peter Zijlstra
@ 2009-03-31  5:45   ` Paul Mackerras
  2009-03-31  6:37     ` Peter Zijlstra
  2009-04-01 10:12   ` [tip:perfcounters/core] " Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Paul Mackerras @ 2009-03-31  5:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

Peter Zijlstra writes:

> +void perf_counter_wakeup(struct perf_counter *counter)
> +{
> +	struct perf_mmap_data *data;
> +
> +	rcu_read_lock();
> +	data = rcu_dereference(counter->data);
> +	if (data) {
> +		(void)atomic_xchg(&data->wakeup, POLL_IN);

Really just a nit, but how is this atomic_xchg any different from
atomic_set(&data->wakeup, POLL_IN) aside from being slower?

Paul.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-03-30 17:07 ` [PATCH 13/15] perf_counter: provide generic callchain bits Peter Zijlstra
@ 2009-03-31  6:12   ` Paul Mackerras
  2009-03-31  6:39     ` Peter Zijlstra
  2009-04-01 10:14   ` [tip:perfcounters/core] " Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Paul Mackerras @ 2009-03-31  6:12 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

Peter Zijlstra writes:

>  				include_tid    :  1, /* include the tid       */
>  				mmap           :  1, /* include mmap data     */
>  				munmap         :  1, /* include munmap data   */
> +				callchain      :  1, /* add callchain data    */

Interesting, I would have put callchain (and include_tid, also) in
hw_event.record_type rather than as individual 1-bit fields.  The
present arrangement where some selection of what goes into the ring
buffer is in record_type and some is in individual bits seems a bit
awkward.  Plus, with the current arrangement I can't get both the IP
and the values of the other group members, which I might reasonable
want.

I think either we make record_type bit-significant, or we define
individual bits in hw_event for recording the IP and other group
members.

There are a couple of other things I want to be able to record on an
event - we have registers on powerpc that give information about the
event that caused the interrupt, and it would be nice to be able to
record them.  (These registers include instruction and data addresses
associated with the event; the instruction address can be further on
from where the interrupt was taken because of out-of-order instruction
execution and because interrupts might be hard-disabled at the point
where the interrupt becomes pending.)

Those registers would need bits in record_type or in the hw_event to
indicate that we want them recorded.

Paul.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/15] perf_counter: unify and fix delayed counter wakeup
  2009-03-31  5:45   ` Paul Mackerras
@ 2009-03-31  6:37     ` Peter Zijlstra
  2009-04-01  9:04       ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-31  6:37 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel

On Tue, 2009-03-31 at 16:45 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > +void perf_counter_wakeup(struct perf_counter *counter)
> > +{
> > +	struct perf_mmap_data *data;
> > +
> > +	rcu_read_lock();
> > +	data = rcu_dereference(counter->data);
> > +	if (data) {
> > +		(void)atomic_xchg(&data->wakeup, POLL_IN);
> 
> Really just a nit, but how is this atomic_xchg any different from
> atomic_set(&data->wakeup, POLL_IN) aside from being slower?

Probably, I got my head in a twist, atomic_set() is simply an unlocked
assignment (although volatile), and I read the value using a locked
xchg().

I wasn't sure how these two would interact and so I chickened out :-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-03-31  6:12   ` Paul Mackerras
@ 2009-03-31  6:39     ` Peter Zijlstra
  2009-03-31  7:24       ` Corey Ashford
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-31  6:39 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel

On Tue, 2009-03-31 at 17:12 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> >  				include_tid    :  1, /* include the tid       */
> >  				mmap           :  1, /* include mmap data     */
> >  				munmap         :  1, /* include munmap data   */
> > +				callchain      :  1, /* add callchain data    */
> 
> Interesting, I would have put callchain (and include_tid, also) in
> hw_event.record_type rather than as individual 1-bit fields.  The
> present arrangement where some selection of what goes into the ring
> buffer is in record_type and some is in individual bits seems a bit
> awkward.  Plus, with the current arrangement I can't get both the IP
> and the values of the other group members, which I might reasonable
> want.
> 
> I think either we make record_type bit-significant, or we define
> individual bits in hw_event for recording the IP and other group
> members.
> 
> There are a couple of other things I want to be able to record on an
> event - we have registers on powerpc that give information about the
> event that caused the interrupt, and it would be nice to be able to
> record them.  (These registers include instruction and data addresses
> associated with the event; the instruction address can be further on
> from where the interrupt was taken because of out-of-order instruction
> execution and because interrupts might be hard-disabled at the point
> where the interrupt becomes pending.)
> 
> Those registers would need bits in record_type or in the hw_event to
> indicate that we want them recorded.

Sure, I'm fine with moving them to record_type and making that a bit
field. I've also considered merging the group and 'simple' output to
enable what you say.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-03-31  6:39     ` Peter Zijlstra
@ 2009-03-31  7:24       ` Corey Ashford
  2009-03-31  8:43         ` Peter Zijlstra
  2009-03-31  9:12         ` Paul Mackerras
  0 siblings, 2 replies; 55+ messages in thread
From: Corey Ashford @ 2009-03-31  7:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Paul Mackerras, Ingo Molnar, linux-kernel

Peter Zijlstra wrote:
> On Tue, 2009-03-31 at 17:12 +1100, Paul Mackerras wrote:
>> Peter Zijlstra writes:
>>
>>>  				include_tid    :  1, /* include the tid       */
>>>  				mmap           :  1, /* include mmap data     */
>>>  				munmap         :  1, /* include munmap data   */
>>> +				callchain      :  1, /* add callchain data    */
>> Interesting, I would have put callchain (and include_tid, also) in
>> hw_event.record_type rather than as individual 1-bit fields.  The
>> present arrangement where some selection of what goes into the ring
>> buffer is in record_type and some is in individual bits seems a bit
>> awkward.  Plus, with the current arrangement I can't get both the IP
>> and the values of the other group members, which I might reasonable
>> want.
>>
>> I think either we make record_type bit-significant, or we define
>> individual bits in hw_event for recording the IP and other group
>> members.
>>
>> There are a couple of other things I want to be able to record on an
>> event - we have registers on powerpc that give information about the
>> event that caused the interrupt, and it would be nice to be able to
>> record them.  (These registers include instruction and data addresses
>> associated with the event; the instruction address can be further on
>> from where the interrupt was taken because of out-of-order instruction
>> execution and because interrupts might be hard-disabled at the point
>> where the interrupt becomes pending.)
>>
>> Those registers would need bits in record_type or in the hw_event to
>> indicate that we want them recorded.
> 
> Sure, I'm fine with moving them to record_type and making that a bit
> field. I've also considered merging the group and 'simple' output to
> enable what you say.

Originally, the record_type field was used to determine what would be 
read if you read from the fd.  However, now it seems to be what you get 
from the mmap buffer (only), that there is no longer a way to read the 
record stream from the fd anymore. Is this true?  If so, I like this 
idea because it simplifies the ABI: there is one way to read the current 
value of counters and another way to read sample records (via mmap), 
both of these operations use a single fd.

If this is the case, what is the exact meaning of PERF_COUNTER_SIMPLE 
now? and PERF_COUNTER_GROUP?  One simplification would be that reading 
the fd of a group leader would always read up all of the counters in the 
group (along with their enabled and running times if those bits are 
set), and that reading a single counter's fd would yield only that 
counter's values and times (if enabled).  In effect, these two values, 
PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary 
at all.  Other bits would be used to determine what's in the mmap'd samples.

- Corey



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-03-31  7:24       ` Corey Ashford
@ 2009-03-31  8:43         ` Peter Zijlstra
  2009-03-31  9:09           ` Paul Mackerras
  2009-03-31  9:12         ` Paul Mackerras
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-31  8:43 UTC (permalink / raw)
  To: Corey Ashford; +Cc: Paul Mackerras, Ingo Molnar, linux-kernel

On Tue, 2009-03-31 at 00:24 -0700, Corey Ashford wrote:
> Peter Zijlstra wrote:
> > On Tue, 2009-03-31 at 17:12 +1100, Paul Mackerras wrote:
> >> Peter Zijlstra writes:
> >>
> >>>  				include_tid    :  1, /* include the tid       */
> >>>  				mmap           :  1, /* include mmap data     */
> >>>  				munmap         :  1, /* include munmap data   */
> >>> +				callchain      :  1, /* add callchain data    */
> >> Interesting, I would have put callchain (and include_tid, also) in
> >> hw_event.record_type rather than as individual 1-bit fields.  The
> >> present arrangement where some selection of what goes into the ring
> >> buffer is in record_type and some is in individual bits seems a bit
> >> awkward.  Plus, with the current arrangement I can't get both the IP
> >> and the values of the other group members, which I might reasonable
> >> want.
> >>
> >> I think either we make record_type bit-significant, or we define
> >> individual bits in hw_event for recording the IP and other group
> >> members.
> >>
> >> There are a couple of other things I want to be able to record on an
> >> event - we have registers on powerpc that give information about the
> >> event that caused the interrupt, and it would be nice to be able to
> >> record them.  (These registers include instruction and data addresses
> >> associated with the event; the instruction address can be further on
> >> from where the interrupt was taken because of out-of-order instruction
> >> execution and because interrupts might be hard-disabled at the point
> >> where the interrupt becomes pending.)
> >>
> >> Those registers would need bits in record_type or in the hw_event to
> >> indicate that we want them recorded.
> > 
> > Sure, I'm fine with moving them to record_type and making that a bit
> > field. I've also considered merging the group and 'simple' output to
> > enable what you say.
> 
> Originally, the record_type field was used to determine what would be 
> read if you read from the fd.  However, now it seems to be what you get 
> from the mmap buffer (only), that there is no longer a way to read the 
> record stream from the fd anymore. Is this true?  

Yes, read() will return the current value of the counter.
mmap() is used for all async data streams.

> If so, I like this 
> idea because it simplifies the ABI: there is one way to read the current 
> value of counters and another way to read sample records (via mmap), 
> both of these operations use a single fd.

Correct, glad you like it ;-)

> If this is the case, what is the exact meaning of PERF_COUNTER_SIMPLE 
> now? and PERF_COUNTER_GROUP? 

Legacy mainly. I've been working towards unifying those. As Paul also
suggested, group can be yet another output option.

>  One simplification would be that reading 
> the fd of a group leader would always read up all of the counters in the 
> group (along with their enabled and running times if those bits are 
> set), and that reading a single counter's fd would yield only that 
> counter's values and times (if enabled).  In effect, these two values, 
> PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary 
> at all.  Other bits would be used to determine what's in the mmap'd samples.

Quite so, that sounds sensible, Paul?

Would you be overly bothered by the read() output also having that
{type,size} header we use for the mmap() data?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-03-31  8:43         ` Peter Zijlstra
@ 2009-03-31  9:09           ` Paul Mackerras
  0 siblings, 0 replies; 55+ messages in thread
From: Paul Mackerras @ 2009-03-31  9:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Corey Ashford, Ingo Molnar, linux-kernel

Peter Zijlstra writes:

> >  One simplification would be that reading 
> > the fd of a group leader would always read up all of the counters in the 
> > group (along with their enabled and running times if those bits are 
> > set), and that reading a single counter's fd would yield only that 
> > counter's values and times (if enabled).  In effect, these two values, 
> > PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary 
> > at all.  Other bits would be used to determine what's in the mmap'd samples.
> 
> Quite so, that sounds sensible, Paul?

I think the readout of the other group members (for a group leader)
should be enabled by a bit in hw_event.read_format, say
PERF_FORMAT_GROUP.

There is a slight complication - we would want to read all the
counters in the group atomically, or as close to atomically as we
could get, and we don't have any way to do that at the moment.

> Would you be overly bothered by the read() output also having that
> {type,size} header we use for the mmap() data?

How about a PERF_FORMAT_HEADERS bit in hw_event.read_format to say that
we want to get the headers in the read() output?

With those two extra bits we can fulfill these desires without adding
overhead for programs that don't want the extra stuff, and without
breaking compatibility.

Paul.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-03-31  7:24       ` Corey Ashford
  2009-03-31  8:43         ` Peter Zijlstra
@ 2009-03-31  9:12         ` Paul Mackerras
  2009-03-31 14:00           ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Paul Mackerras @ 2009-03-31  9:12 UTC (permalink / raw)
  To: Corey Ashford; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel

Corey Ashford writes:

> If this is the case, what is the exact meaning of PERF_COUNTER_SIMPLE 
> now? and PERF_COUNTER_GROUP?  One simplification would be that reading 
> the fd of a group leader would always read up all of the counters in the 
> group (along with their enabled and running times if those bits are 
> set), and that reading a single counter's fd would yield only that 
> counter's values and times (if enabled).  In effect, these two values, 
> PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary 
> at all.  Other bits would be used to determine what's in the mmap'd samples.

Now that events are only read through mmap, it's quite simple -
hw_event.read_format controls what read() gives you, and
hw_event.record_type controls what gets put into the pages that you
get with mmap.

Currently read_format and record_type don't use the same set of bit
definitions.  Maybe they should?  I don't have a strong feeling about
it, but that might be a nice simplification.

Paul.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-03-31  9:12         ` Paul Mackerras
@ 2009-03-31 14:00           ` Peter Zijlstra
  2009-03-31 17:11             ` Corey Ashford
  2009-04-01  3:48             ` Paul Mackerras
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-03-31 14:00 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Corey Ashford, Ingo Molnar, linux-kernel

On Tue, 2009-03-31 at 20:12 +1100, Paul Mackerras wrote:
> Corey Ashford writes:
> 
> > If this is the case, what is the exact meaning of PERF_COUNTER_SIMPLE 
> > now? and PERF_COUNTER_GROUP?  One simplification would be that reading 
> > the fd of a group leader would always read up all of the counters in the 
> > group (along with their enabled and running times if those bits are 
> > set), and that reading a single counter's fd would yield only that 
> > counter's values and times (if enabled).  In effect, these two values, 
> > PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary 
> > at all.  Other bits would be used to determine what's in the mmap'd samples.
> 
> Now that events are only read through mmap, it's quite simple -
> hw_event.read_format controls what read() gives you, and
> hw_event.record_type controls what gets put into the pages that you
> get with mmap.
> 
> Currently read_format and record_type don't use the same set of bit
> definitions.  Maybe they should?  I don't have a strong feeling about
> it, but that might be a nice simplification.

Something like the below? 

That still has the record and read things separate, but as one unified
overflow output.

I reduced record and read fields to 32bits, as the output type only has
32bits.

diff did a wonderful job making the patch unreadable :/

---
Subject: 
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Mar 31 15:13:36 CEST 2009

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/perf_counter.h |   37 +++++++---------
 kernel/perf_counter.c        |   99 ++++++++++++++++---------------------------
 2 files changed, 57 insertions(+), 79 deletions(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -73,15 +73,6 @@ enum sw_event_ids {
 	PERF_SW_EVENTS_MAX		= 7,
 };
 
-/*
- * IRQ-notification data record type:
- */
-enum perf_counter_record_type {
-	PERF_RECORD_SIMPLE		= 0,
-	PERF_RECORD_IRQ			= 1,
-	PERF_RECORD_GROUP		= 2,
-};
-
 #define __PERF_COUNTER_MASK(name) 			\
 	(((1ULL << PERF_COUNTER_##name##_BITS) - 1) <<	\
 	 PERF_COUNTER_##name##_SHIFT)
@@ -103,6 +94,17 @@ enum perf_counter_record_type {
 #define PERF_COUNTER_EVENT_MASK		__PERF_COUNTER_MASK(EVENT)
 
 /*
+ * Bits that can be set in hw_event.record_type to request information
+ * in the overflow packets.
+ */
+enum perf_counter_record_format {
+	PERF_RECORD_IP		= 1U << 0,
+	PERF_RECORD_TID		= 1U << 1,
+	PERF_RECORD_GROUP	= 1U << 2,
+	PERF_RECORD_CALLCHAIN	= 1U << 3,
+};
+
+/*
  * Bits that can be set in hw_event.read_format to request that
  * reads on the counter should return the indicated quantities,
  * in increasing order of bit value, after the counter value.
@@ -125,8 +127,8 @@ struct perf_counter_hw_event {
 	__u64			config;
 
 	__u64			irq_period;
-	__u64			record_type;
-	__u64			read_format;
+	__u32			record_type;
+	__u32			read_format;
 
 	__u64			disabled       :  1, /* off by default        */
 				nmi	       :  1, /* NMI sampling          */
@@ -137,12 +139,10 @@ struct perf_counter_hw_event {
 				exclude_kernel :  1, /* ditto kernel          */
 				exclude_hv     :  1, /* ditto hypervisor      */
 				exclude_idle   :  1, /* don't count when idle */
-				include_tid    :  1, /* include the tid       */
 				mmap           :  1, /* include mmap data     */
 				munmap         :  1, /* include munmap data   */
-				callchain      :  1, /* add callchain data    */
 
-				__reserved_1   : 51;
+				__reserved_1   : 53;
 
 	__u32			extra_config_len;
 	__u32			__reserved_4;
@@ -212,15 +212,14 @@ struct perf_event_header {
 
 enum perf_event_type {
 
-	PERF_EVENT_GROUP	= 1,
-
-	PERF_EVENT_MMAP		= 2,
-	PERF_EVENT_MUNMAP	= 3,
+	PERF_EVENT_MMAP		= 1,
+	PERF_EVENT_MUNMAP	= 2,
 
 	PERF_EVENT_OVERFLOW	= 1UL << 31,
 	__PERF_EVENT_IP		= 1UL << 30,
 	__PERF_EVENT_TID	= 1UL << 29,
-	__PERF_EVENT_CALLCHAIN  = 1UL << 28,
+	__PERF_EVENT_GROUP	= 1UL << 28,
+	__PERF_EVENT_CALLCHAIN  = 1UL << 27,
 };
 
 #ifdef __KERNEL__
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1765,27 +1765,34 @@ static void perf_output_end(struct perf_
 	rcu_read_unlock();
 }
 
-static void perf_output_simple(struct perf_counter *counter,
-			       int nmi, struct pt_regs *regs)
+void perf_counter_output(struct perf_counter *counter,
+			 int nmi, struct pt_regs *regs)
 {
 	int ret;
+	u64 record_type = counter->hw_event.record_type;
 	struct perf_output_handle handle;
 	struct perf_event_header header;
 	u64 ip;
 	struct {
 		u32 pid, tid;
 	} tid_entry;
+	struct {
+		u64 event;
+		u64 counter;
+	} group_entry;
 	struct perf_callchain_entry *callchain = NULL;
 	int callchain_size = 0;
 
 	header.type = PERF_EVENT_OVERFLOW;
 	header.size = sizeof(header);
 
-	ip = instruction_pointer(regs);
-	header.type |= __PERF_EVENT_IP;
-	header.size += sizeof(ip);
+	if (record_type & PERF_RECORD_IP) {
+		ip = instruction_pointer(regs);
+		header.type |= __PERF_EVENT_IP;
+		header.size += sizeof(ip);
+	}
 
-	if (counter->hw_event.include_tid) {
+	if (record_type & PERF_RECORD_TID) {
 		/* namespace issues */
 		tid_entry.pid = current->group_leader->pid;
 		tid_entry.tid = current->pid;
@@ -1794,7 +1801,13 @@ static void perf_output_simple(struct pe
 		header.size += sizeof(tid_entry);
 	}
 
-	if (counter->hw_event.callchain) {
+	if (record_type & PERF_RECORD_GROUP) {
+		header.type |= __PERF_EVENT_GROUP;
+		header.size += sizeof(u64) +
+			counter->nr_siblings * sizeof(group_entry);
+	}
+
+	if (record_type & PERF_RECORD_CALLCHAIN) {
 		callchain = perf_callchain(regs);
 
 		if (callchain) {
@@ -1810,69 +1823,35 @@ static void perf_output_simple(struct pe
 		return;
 
 	perf_output_put(&handle, header);
-	perf_output_put(&handle, ip);
 
-	if (counter->hw_event.include_tid)
-		perf_output_put(&handle, tid_entry);
+	if (record_type & PERF_RECORD_IP)
+		perf_output_put(&handle, ip);
 
-	if (callchain)
-		perf_output_copy(&handle, callchain, callchain_size);
-
-	perf_output_end(&handle);
-}
-
-static void perf_output_group(struct perf_counter *counter, int nmi)
-{
-	struct perf_output_handle handle;
-	struct perf_event_header header;
-	struct perf_counter *leader, *sub;
-	unsigned int size;
-	struct {
-		u64 event;
-		u64 counter;
-	} entry;
-	int ret;
-
-	size = sizeof(header) + counter->nr_siblings * sizeof(entry);
+	if (record_type & PERF_RECORD_TID)
+		perf_output_put(&handle, tid_entry);
 
-	ret = perf_output_begin(&handle, counter, size, nmi);
-	if (ret)
-		return;
+	if (record_type & PERF_RECORD_GROUP) {
+		struct perf_counter *leader, *sub;
+		u64 nr = counter->nr_siblings;
 
-	header.type = PERF_EVENT_GROUP;
-	header.size = size;
+		perf_output_put(&handle, nr);
 
-	perf_output_put(&handle, header);
+		leader = counter->group_leader;
+		list_for_each_entry(sub, &leader->sibling_list, list_entry) {
+			if (sub != counter)
+				sub->hw_ops->read(sub);
 
-	leader = counter->group_leader;
-	list_for_each_entry(sub, &leader->sibling_list, list_entry) {
-		if (sub != counter)
-			sub->hw_ops->read(sub);
+			group_entry.event = sub->hw_event.config;
+			group_entry.counter = atomic64_read(&sub->count);
 
-		entry.event = sub->hw_event.config;
-		entry.counter = atomic64_read(&sub->count);
-
-		perf_output_put(&handle, entry);
+			perf_output_put(&handle, group_entry);
+		}
 	}
 
-	perf_output_end(&handle);
-}
-
-void perf_counter_output(struct perf_counter *counter,
-			 int nmi, struct pt_regs *regs)
-{
-	switch (counter->hw_event.record_type) {
-	case PERF_RECORD_SIMPLE:
-		return;
-
-	case PERF_RECORD_IRQ:
-		perf_output_simple(counter, nmi, regs);
-		break;
+	if (callchain)
+		perf_output_copy(&handle, callchain, callchain_size);
 
-	case PERF_RECORD_GROUP:
-		perf_output_group(counter, nmi);
-		break;
-	}
+	perf_output_end(&handle);
 }
 
 /*



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-03-31 14:00           ` Peter Zijlstra
@ 2009-03-31 17:11             ` Corey Ashford
  2009-04-01  3:48             ` Paul Mackerras
  1 sibling, 0 replies; 55+ messages in thread
From: Corey Ashford @ 2009-03-31 17:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Paul Mackerras, Ingo Molnar, linux-kernel

Peter Zijlstra wrote:
> On Tue, 2009-03-31 at 20:12 +1100, Paul Mackerras wrote:
>> Corey Ashford writes:
>>
>>> If this is the case, what is the exact meaning of PERF_COUNTER_SIMPLE 
>>> now? and PERF_COUNTER_GROUP?  One simplification would be that reading 
>>> the fd of a group leader would always read up all of the counters in the 
>>> group (along with their enabled and running times if those bits are 
>>> set), and that reading a single counter's fd would yield only that 
>>> counter's values and times (if enabled).  In effect, these two values, 
>>> PERF_COUNTER_GROUP and PERF_COUNTER_SIMPLE would no longer be necessary 
>>> at all.  Other bits would be used to determine what's in the mmap'd samples.
>> Now that events are only read through mmap, it's quite simple -
>> hw_event.read_format controls what read() gives you, and
>> hw_event.record_type controls what gets put into the pages that you
>> get with mmap.
>>
>> Currently read_format and record_type don't use the same set of bit
>> definitions.  Maybe they should?  I don't have a strong feeling about
>> it, but that might be a nice simplification.
> 
> Something like the below? 
> 
> That still has the record and read things separate, but as one unified
> overflow output.
> 
> I reduced record and read fields to 32bits, as the output type only has
> 32bits.
> 
> diff did a wonderful job making the patch unreadable :/
>

This looks great!  I like the idea of two bit vectors to describe what 
you get when you read and the format of the mmap'd data.  And I like the 
ability not to include the header on each record if the user space 
program knows the exact format of the sample records.

Paul is right, too, that having a read_format bit to determine whether 
reading the leader gives all of the counters or just the leader is a 
nice addition.  I can't think of when I'd use it, but it's a good 
capability.

- Corey


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-03-31 14:00           ` Peter Zijlstra
  2009-03-31 17:11             ` Corey Ashford
@ 2009-04-01  3:48             ` Paul Mackerras
  2009-04-01  7:59               ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Paul Mackerras @ 2009-04-01  3:48 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Corey Ashford, Ingo Molnar, linux-kernel

Peter Zijlstra writes:

> That still has the record and read things separate, but as one unified
> overflow output.

I take it PERF_EVENT_OVERFLOW refers to counter overflow, not ring
buffer overflow?  That had me confused for a bit, so more explicit
naming, or at least some comments, would be good.

>  /*
> + * Bits that can be set in hw_event.record_type to request information
> + * in the overflow packets.
> + */
> +enum perf_counter_record_format {
> +	PERF_RECORD_IP		= 1U << 0,
> +	PERF_RECORD_TID		= 1U << 1,
> +	PERF_RECORD_GROUP	= 1U << 2,
> +	PERF_RECORD_CALLCHAIN	= 1U << 3,
> +};
[snip]
>  enum perf_event_type {
>  
> -	PERF_EVENT_GROUP	= 1,
> -
> -	PERF_EVENT_MMAP		= 2,
> -	PERF_EVENT_MUNMAP	= 3,
> +	PERF_EVENT_MMAP		= 1,
> +	PERF_EVENT_MUNMAP	= 2,
>  
>  	PERF_EVENT_OVERFLOW	= 1UL << 31,
>  	__PERF_EVENT_IP		= 1UL << 30,
>  	__PERF_EVENT_TID	= 1UL << 29,
> -	__PERF_EVENT_CALLCHAIN  = 1UL << 28,
> +	__PERF_EVENT_GROUP	= 1UL << 28,
> +	__PERF_EVENT_CALLCHAIN  = 1UL << 27,

Could we use the same value (and even the same name) for
PERF_RECORD_IP/__PERF_EVENT_IP, PERF_RECORD_TID/__PERF_EVENT_TID,
etc.?

Also, I haven't looked at the callchain stuff much, but does the
callchain info contain a recognizable end delimiter?  At present the
callchain comes last, but if we add more output elements they'll
presumably go after it, so working out where the callchain ends may
become tricky if we're not careful.

Also, let's add PERF_RECORD/PERF_EVENT bits for:

* EVENT_INSTR_ADDR
* EVENT_DATA_ADDR
* EVENT_INSTR_FLAGS
* EVENT_CPU_FLAGS	(so we can distinguish hypervisor/kernel/user)

We would have to call into arch code to get the values for these.

Paul.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-04-01  3:48             ` Paul Mackerras
@ 2009-04-01  7:59               ` Peter Zijlstra
  2009-04-01  8:45                 ` Paul Mackerras
  2009-04-01 23:25                 ` Corey Ashford
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01  7:59 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Corey Ashford, Ingo Molnar, linux-kernel

On Wed, 2009-04-01 at 14:48 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > That still has the record and read things separate, but as one unified
> > overflow output.
> 
> I take it PERF_EVENT_OVERFLOW refers to counter overflow, not ring
> buffer overflow?  That had me confused for a bit, so more explicit
> naming, or at least some comments, would be good.

Ah, yes, I see how that can confuse. PERF_EVENT_COUNTER_OVERFLOW then?

I was thinking about doing splice() support and that could also generate
actual event overflow events ;-)

> >  /*
> > + * Bits that can be set in hw_event.record_type to request information
> > + * in the overflow packets.
> > + */
> > +enum perf_counter_record_format {
> > +	PERF_RECORD_IP		= 1U << 0,
> > +	PERF_RECORD_TID		= 1U << 1,
> > +	PERF_RECORD_GROUP	= 1U << 2,
> > +	PERF_RECORD_CALLCHAIN	= 1U << 3,
> > +};
> [snip]
> >  enum perf_event_type {
> >  
> > -	PERF_EVENT_GROUP	= 1,
> > -
> > -	PERF_EVENT_MMAP		= 2,
> > -	PERF_EVENT_MUNMAP	= 3,
> > +	PERF_EVENT_MMAP		= 1,
> > +	PERF_EVENT_MUNMAP	= 2,
> >  
> >  	PERF_EVENT_OVERFLOW	= 1UL << 31,
> >  	__PERF_EVENT_IP		= 1UL << 30,
> >  	__PERF_EVENT_TID	= 1UL << 29,
> > -	__PERF_EVENT_CALLCHAIN  = 1UL << 28,
> > +	__PERF_EVENT_GROUP	= 1UL << 28,
> > +	__PERF_EVENT_CALLCHAIN  = 1UL << 27,
> 
> Could we use the same value (and even the same name) for
> PERF_RECORD_IP/__PERF_EVENT_IP, PERF_RECORD_TID/__PERF_EVENT_TID,
> etc.?

I suppose we could.

> Also, I haven't looked at the callchain stuff much, but does the
> callchain info contain a recognizable end delimiter?  At present the
> callchain comes last, but if we add more output elements they'll
> presumably go after it, so working out where the callchain ends may
> become tricky if we're not careful.

It writes:

struct callchain_event {
	u64 nr
	u64 ips[nr]
};

So the first number tells how many ips are in the callchain, which
should be good enough to figure out where it ends.

> Also, let's add PERF_RECORD/PERF_EVENT bits for:
> 
> * EVENT_INSTR_ADDR

I'm failing to come up with what this could be..

> * EVENT_DATA_ADDR

This would be the data address operated upon? Like what address caused
the fault/cache-miss, etc?

> * EVENT_INSTR_FLAGS

Again not quite sure what this would be.

> * EVENT_CPU_FLAGS	(so we can distinguish hypervisor/kernel/user)

Currently we can based on address, an IP < 0 is kernel and > 0 is
userspace, but yeah, I see how this makes life easier.

> We would have to call into arch code to get the values for these.

I suppose all these things can be gleaned from pt_regs, so that should
be doable.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-04-01  7:59               ` Peter Zijlstra
@ 2009-04-01  8:45                 ` Paul Mackerras
  2009-04-01 10:00                   ` Ingo Molnar
  2009-04-01 23:25                 ` Corey Ashford
  1 sibling, 1 reply; 55+ messages in thread
From: Paul Mackerras @ 2009-04-01  8:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Corey Ashford, Ingo Molnar, linux-kernel

Peter Zijlstra writes:

> Ah, yes, I see how that can confuse. PERF_EVENT_COUNTER_OVERFLOW then?

Sounds reasonable.

> > Also, let's add PERF_RECORD/PERF_EVENT bits for:
> > 
> > * EVENT_INSTR_ADDR
> 
> I'm failing to come up with what this could be..

So, you have lots of instructions in flight in the processor, and one
of them causes an event that increments a counter and causes it to
overflow, so an interrupt request is generated.  Even if the interrupt
is taken "immediately", it can still happen that the set of
instructions the processor decides to complete before taking the
interrupt includes some instructions after the instruction that caused
the counter to overflow, and of course if interrupts are (hard-)
disabled at the time of the overflow, the interrupt will happen
later.  That means that the IP from the pt_regs is not generally a
reliable indication of which instruction made the counter overflow.

On POWER processors we have a register which gives us a much more
reliable indication of which instruction caused the counter overflow,
at least in those cases where the event can be attributed to a
specific instruction.  This EVENT_INSTR_ADDR bit would ask for that
register to be sampled and recorded.

> > * EVENT_DATA_ADDR
> 
> This would be the data address operated upon? Like what address caused
> the fault/cache-miss, etc?

That's right.  POWER processors have a register that records that
where possible.

> > * EVENT_INSTR_FLAGS
> 
> Again not quite sure what this would be.

POWER processors have a register that records information about the
instruction that caused the counter overflow, such as did it have a
data address associated with it, did it cause a dcache miss, etc.

> > * EVENT_CPU_FLAGS	(so we can distinguish hypervisor/kernel/user)
> 
> Currently we can based on address, an IP < 0 is kernel and > 0 is
> userspace, but yeah, I see how this makes life easier.

We can't distinguish hypervisor addresses that way, and on some
architectures (including x86_32 with a 4G/4G split) we can't
distinguish kernel/user just by the address.  I was thinking the cpu
flags would also include things like interrupt enable state, FPU
enable state, etc.

> > We would have to call into arch code to get the values for these.
> 
> I suppose all these things can be gleaned from pt_regs, so that should
> be doable.

Hmmm, 3 of the 4 would require SPR (= what x86 calls MSR) reads.
Maybe it's best to have a block of arch-specific bits that can be
defined per-arch and implemented in arch code.

Paul.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/15] perf_counter: unify and fix delayed counter wakeup
  2009-03-31  6:37     ` Peter Zijlstra
@ 2009-04-01  9:04       ` Ingo Molnar
  0 siblings, 0 replies; 55+ messages in thread
From: Ingo Molnar @ 2009-04-01  9:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Paul Mackerras, linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2009-03-31 at 16:45 +1100, Paul Mackerras wrote:
> > Peter Zijlstra writes:
> > 
> > > +void perf_counter_wakeup(struct perf_counter *counter)
> > > +{
> > > +	struct perf_mmap_data *data;
> > > +
> > > +	rcu_read_lock();
> > > +	data = rcu_dereference(counter->data);
> > > +	if (data) {
> > > +		(void)atomic_xchg(&data->wakeup, POLL_IN);
> > 
> > Really just a nit, but how is this atomic_xchg any different from
> > atomic_set(&data->wakeup, POLL_IN) aside from being slower?
> 
> Probably, I got my head in a twist, atomic_set() is simply an 
> unlocked assignment (although volatile), and I read the value 
> using a locked xchg().
> 
> I wasn't sure how these two would interact and so I chickened out 
> :-)

The fact that you needed to use an ugly cast to silence the compiler 
should have told you that we kernel developers never chicken out! :)

(And if we do, it's called an orderly tactical retreat.)

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/15] perf_counter tools: optionally scale counter values in perfstat mode
  2009-03-30 17:07 ` [PATCH 09/15] perf_counter tools: optionally scale counter values in perfstat mode Peter Zijlstra
@ 2009-04-01  9:39   ` Ingo Molnar
  2009-04-01 10:50     ` Paul Mackerras
  2009-04-01 10:14   ` [tip:perfcounters/core] " Paul Mackerras
  1 sibling, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2009-04-01  9:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Paul Mackerras


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> From: Paul Mackerras <paulus@samba.org>
> 
> Impact: new functionality
> 
> This adds add an option to the perfstat mode of kerneltop to scale 
> the reported counter values according to the fraction of time that 
> each counter gets to count.  This is invoked with the -l option (I 
> used 'l' because s, c, a and e were all taken already.)  This uses 
> the new PERF_RECORD_TOTAL_TIME_{ENABLED,RUNNING} read format 
> options.
> 
> With this, we get output like this:
> 
> $ ./perfstat -l -e 0:0,0:1,0:2,0:3,0:4,0:5 ./spin
> 
>  Performance counter stats for './spin':
> 
>      4016072055  CPU cycles           (events)  (scaled from 66.53%)
>      2005887318  instructions         (events)  (scaled from 66.53%)
>         1762849  cache references     (events)  (scaled from 66.69%)
>          165229  cache misses         (events)  (scaled from 66.85%)
>      1001298009  branches             (events)  (scaled from 66.78%)
>           41566  branch misses        (events)  (scaled from 66.61%)
> 
>  Wall-clock time elapsed:  2438.227446 msecs
> 
> This also lets us detect when a counter is zero because the 
> counter never got to go on the CPU at all.  In that case we print 
> <not counted> rather than 0.

Nice! Shouldnt this be the default mode of output?

I think we want to make users aware of the fact when output is 
sampled due to over-commit, rather than precise, agreed?

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-04-01  8:45                 ` Paul Mackerras
@ 2009-04-01 10:00                   ` Ingo Molnar
  2009-04-01 11:53                     ` Paul Mackerras
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2009-04-01 10:00 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Peter Zijlstra, Corey Ashford, linux-kernel


* Paul Mackerras <paulus@samba.org> wrote:

> Peter Zijlstra writes:
> 
> > Ah, yes, I see how that can confuse. PERF_EVENT_COUNTER_OVERFLOW then?
> 
> Sounds reasonable.
> 
> > > Also, let's add PERF_RECORD/PERF_EVENT bits for:
> > > 
> > > * EVENT_INSTR_ADDR
> > 
> > I'm failing to come up with what this could be..
> 
> So, you have lots of instructions in flight in the processor, and 
> one of them causes an event that increments a counter and causes 
> it to overflow, so an interrupt request is generated.  Even if the 
> interrupt is taken "immediately", it can still happen that the set 
> of instructions the processor decides to complete before taking 
> the interrupt includes some instructions after the instruction 
> that caused the counter to overflow, and of course if interrupts 
> are (hard-) disabled at the time of the overflow, the interrupt 
> will happen later.  That means that the IP from the pt_regs is not 
> generally a reliable indication of which instruction made the 
> counter overflow.
> 
> On POWER processors we have a register which gives us a much more 
> reliable indication of which instruction caused the counter 
> overflow, at least in those cases where the event can be 
> attributed to a specific instruction.  This EVENT_INSTR_ADDR bit 
> would ask for that register to be sampled and recorded.

So it's a bit like PEBS and IBS on the x86, right?

In theory one could simply override the sampled ptregs->ip with this 
more precise register value. The instruction where the IRQ hit is 
probably meaningless, if more precise information is available. But 
we can have both too i guess.

The data address extension definitely makes sense - it can be used 
to for a profile view along the data symbol dimension, instead of 
the usual function symbol dimension.

CPU flags makes sense too - irqs-off can help the annotation of 
source code sections where the profiler sees that irqs were 
disabled.

It seems here we gradually descend into arch-specific CPU state 
technicalities and it's not immediately obvious where to draw the 
line.

Call-chain and data address abstractions are clear. CPU flags is 
less clear: we could perhaps split off the irq state and the 
privilege level information - that is present on all CPUs.

The rest should probably be opaque and not generalized.

_Perhaps_, to stem the inevitable list of such small details, it 
might make sense to have a record type with signal frame qualities - 
which would include most of this info. That would mix well with the 
planned feature of signal generation anyway, right?

I.e. we could extend the lowlevel sigcontext signal frame generation 
code in arch/x86/kernel/signal.c (and its powerpc equivalent) to 
generate a signal frame but output it into the mmap buffer, not into 
the userspace stack - and we would not actually execute a signal in 
that context.

 [ of course, when the counter is configured to generate a signal 
   that is done too. The code would be dual purpose. ]

So user-space would get a fully signal frame compatible record - and 
we'd not have to create a per arch ABI for this because we'd piggy 
back to the signal frame format.

We could add SA_NOFPU support for fast-track integer-registers-only 
frames, etc.

Hm?

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: unify and fix delayed counter wakeup
  2009-03-30 17:07 ` [PATCH 01/15] perf_counter: unify and fix delayed counter wakeup Peter Zijlstra
  2009-03-31  5:45   ` Paul Mackerras
@ 2009-04-01 10:12   ` Peter Zijlstra
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  ca68d0c0088fcb93bd47b5e8e3b94fd86ac0ca07
Gitweb:     http://git.kernel.org/tip/ca68d0c0088fcb93bd47b5e8e3b94fd86ac0ca07
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:02 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:32 +0200

perf_counter: unify and fix delayed counter wakeup

While going over the wakeup code I noticed delayed wakeups only work
for hardware counters but basically all software counters rely on
them.

This patch unifies and generalizes the delayed wakeup to fix this
issue.

Since we're dealing with NMI context bits here, use a cmpxchg() based
single link list implementation to track counters that have pending
wakeups.

[ This should really be generic code for delayed wakeups, but since we
  cannot use cmpxchg()/xchg() in generic code, I've let it live in the
  perf_counter code. -- Eric Dumazet could use it to aggregate the
  network wakeups. ]

Furthermore, the x86 method of using TIF flags was flawed in that its
quite possible to end up setting the bit on the idle task, loosing the
wakeup.

The powerpc method uses per-cpu storage and does appear to be
sufficient.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090330171023.153932974@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/powerpc/include/asm/hw_irq.h   |    4 +-
 arch/powerpc/kernel/irq.c           |    2 +-
 arch/powerpc/kernel/perf_counter.c  |   22 +------
 arch/x86/include/asm/perf_counter.h |    5 +-
 arch/x86/include/asm/thread_info.h  |    4 +-
 arch/x86/kernel/cpu/perf_counter.c  |   29 --------
 arch/x86/kernel/signal.c            |    6 --
 include/linux/perf_counter.h        |   15 +++--
 kernel/perf_counter.c               |  128 ++++++++++++++++++++++++++++++++--
 kernel/timer.c                      |    3 +
 10 files changed, 142 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_irq.h b/arch/powerpc/include/asm/hw_irq.h
index 94361c0..fcd643d 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -132,7 +132,7 @@ static inline int irqs_disabled_flags(unsigned long flags)
 struct hw_interrupt_type;
 
 #ifdef CONFIG_PERF_COUNTERS
-static inline unsigned long get_perf_counter_pending(void)
+static inline unsigned long test_perf_counter_pending(void)
 {
 	unsigned long x;
 
@@ -160,7 +160,7 @@ extern void perf_counter_do_pending(void);
 
 #else
 
-static inline unsigned long get_perf_counter_pending(void)
+static inline unsigned long test_perf_counter_pending(void)
 {
 	return 0;
 }
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 2269976..32e9fe8 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -135,7 +135,7 @@ notrace void raw_local_irq_restore(unsigned long en)
 			iseries_handle_interrupts();
 	}
 
-	if (get_perf_counter_pending()) {
+	if (test_perf_counter_pending()) {
 		clear_perf_counter_pending();
 		perf_counter_do_pending();
 	}
diff --git a/arch/powerpc/kernel/perf_counter.c b/arch/powerpc/kernel/perf_counter.c
index df007fe..cde720f 100644
--- a/arch/powerpc/kernel/perf_counter.c
+++ b/arch/powerpc/kernel/perf_counter.c
@@ -650,24 +650,6 @@ hw_perf_counter_init(struct perf_counter *counter)
 }
 
 /*
- * Handle wakeups.
- */
-void perf_counter_do_pending(void)
-{
-	int i;
-	struct cpu_hw_counters *cpuhw = &__get_cpu_var(cpu_hw_counters);
-	struct perf_counter *counter;
-
-	for (i = 0; i < cpuhw->n_counters; ++i) {
-		counter = cpuhw->counter[i];
-		if (counter && counter->wakeup_pending) {
-			counter->wakeup_pending = 0;
-			wake_up(&counter->waitq);
-		}
-	}
-}
-
-/*
  * A counter has overflowed; update its count and record
  * things if requested.  Note that interrupts are hard-disabled
  * here so there is no possibility of being interrupted.
@@ -720,7 +702,7 @@ static void perf_counter_interrupt(struct pt_regs *regs)
 	struct cpu_hw_counters *cpuhw = &__get_cpu_var(cpu_hw_counters);
 	struct perf_counter *counter;
 	long val;
-	int need_wakeup = 0, found = 0;
+	int found = 0;
 
 	for (i = 0; i < cpuhw->n_counters; ++i) {
 		counter = cpuhw->counter[i];
@@ -761,7 +743,7 @@ static void perf_counter_interrupt(struct pt_regs *regs)
 	 * immediately; otherwise we'll have do the wakeup when interrupts
 	 * get soft-enabled.
 	 */
-	if (get_perf_counter_pending() && regs->softe) {
+	if (test_perf_counter_pending() && regs->softe) {
 		irq_enter();
 		clear_perf_counter_pending();
 		perf_counter_do_pending();
diff --git a/arch/x86/include/asm/perf_counter.h b/arch/x86/include/asm/perf_counter.h
index 1662043..e2b0e66 100644
--- a/arch/x86/include/asm/perf_counter.h
+++ b/arch/x86/include/asm/perf_counter.h
@@ -84,8 +84,9 @@ union cpuid10_edx {
 #define MSR_ARCH_PERFMON_FIXED_CTR2			0x30b
 #define X86_PMC_IDX_FIXED_BUS_CYCLES			(X86_PMC_IDX_FIXED + 2)
 
-#define set_perf_counter_pending()	\
-		set_tsk_thread_flag(current, TIF_PERF_COUNTERS);
+#define set_perf_counter_pending()	do { } while (0)
+#define clear_perf_counter_pending()	do { } while (0)
+#define test_perf_counter_pending()	(0)
 
 #ifdef CONFIG_PERF_COUNTERS
 extern void init_hw_perf_counters(void);
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 3ffd5d2..8820a73 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,7 +83,6 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
-#define TIF_PERF_COUNTERS	11	/* notify perf counter work */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -107,7 +106,6 @@ struct thread_info {
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
-#define _TIF_PERF_COUNTERS	(1 << TIF_PERF_COUNTERS)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
@@ -141,7 +139,7 @@ struct thread_info {
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
-	(_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_PERF_COUNTERS|_TIF_NOTIFY_RESUME)
+	(_TIF_SIGPENDING|_TIF_MCE_NOTIFY|_TIF_NOTIFY_RESUME)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
diff --git a/arch/x86/kernel/cpu/perf_counter.c b/arch/x86/kernel/cpu/perf_counter.c
index 3f95b0c..7aab177 100644
--- a/arch/x86/kernel/cpu/perf_counter.c
+++ b/arch/x86/kernel/cpu/perf_counter.c
@@ -227,7 +227,6 @@ static int __hw_perf_counter_init(struct perf_counter *counter)
 		 */
 		hwc->config |= pmc_ops->event_map(perf_event_id(hw_event));
 	}
-	counter->wakeup_pending = 0;
 
 	return 0;
 }
@@ -773,34 +772,6 @@ void smp_perf_counter_interrupt(struct pt_regs *regs)
 	irq_exit();
 }
 
-/*
- * This handler is triggered by NMI contexts:
- */
-void perf_counter_notify(struct pt_regs *regs)
-{
-	struct cpu_hw_counters *cpuc;
-	unsigned long flags;
-	int bit, cpu;
-
-	local_irq_save(flags);
-	cpu = smp_processor_id();
-	cpuc = &per_cpu(cpu_hw_counters, cpu);
-
-	for_each_bit(bit, cpuc->used, X86_PMC_IDX_MAX) {
-		struct perf_counter *counter = cpuc->counters[bit];
-
-		if (!counter)
-			continue;
-
-		if (counter->wakeup_pending) {
-			counter->wakeup_pending = 0;
-			wake_up(&counter->waitq);
-		}
-	}
-
-	local_irq_restore(flags);
-}
-
 void perf_counters_lapic_init(int nmi)
 {
 	u32 apic_val;
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 4da412a..e074fe5 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -6,7 +6,6 @@
  *  2000-06-20  Pentium III FXSR, SSE support by Gareth Hughes
  *  2000-2002   x86-64 support by Andi Kleen
  */
-#include <linux/perf_counter.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
@@ -866,11 +865,6 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 		tracehook_notify_resume(regs);
 	}
 
-	if (thread_info_flags & _TIF_PERF_COUNTERS) {
-		clear_thread_flag(TIF_PERF_COUNTERS);
-		perf_counter_notify(regs);
-	}
-
 #ifdef CONFIG_X86_32
 	clear_thread_flag(TIF_IRET);
 #endif /* CONFIG_X86_32 */
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 6bf67ce..0d83322 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -275,6 +275,10 @@ struct perf_mmap_data {
 	void 				*data_pages[0];
 };
 
+struct perf_wakeup_entry {
+	struct perf_wakeup_entry *next;
+};
+
 /**
  * struct perf_counter - performance counter kernel representation:
  */
@@ -350,7 +354,7 @@ struct perf_counter {
 	/* poll related */
 	wait_queue_head_t		waitq;
 	/* optional: for NMIs */
-	int				wakeup_pending;
+	struct perf_wakeup_entry	wakeup;
 
 	void (*destroy)(struct perf_counter *);
 	struct rcu_head			rcu_head;
@@ -427,7 +431,7 @@ extern void perf_counter_task_sched_out(struct task_struct *task, int cpu);
 extern void perf_counter_task_tick(struct task_struct *task, int cpu);
 extern void perf_counter_init_task(struct task_struct *child);
 extern void perf_counter_exit_task(struct task_struct *child);
-extern void perf_counter_notify(struct pt_regs *regs);
+extern void perf_counter_do_pending(void);
 extern void perf_counter_print_debug(void);
 extern void perf_counter_unthrottle(void);
 extern u64 hw_perf_save_disable(void);
@@ -461,7 +465,7 @@ static inline void
 perf_counter_task_tick(struct task_struct *task, int cpu)		{ }
 static inline void perf_counter_init_task(struct task_struct *child)	{ }
 static inline void perf_counter_exit_task(struct task_struct *child)	{ }
-static inline void perf_counter_notify(struct pt_regs *regs)		{ }
+static inline void perf_counter_do_pending(void)			{ }
 static inline void perf_counter_print_debug(void)			{ }
 static inline void perf_counter_unthrottle(void)			{ }
 static inline void hw_perf_restore(u64 ctrl)				{ }
@@ -469,8 +473,9 @@ static inline u64 hw_perf_save_disable(void)		      { return 0; }
 static inline int perf_counter_task_disable(void)	{ return -EINVAL; }
 static inline int perf_counter_task_enable(void)	{ return -EINVAL; }
 
-static inline void perf_swcounter_event(u32 event, u64 nr,
-					int nmi, struct pt_regs *regs)	{ }
+static inline void
+perf_swcounter_event(u32 event, u64 nr, int nmi, struct pt_regs *regs)	{ }
+
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 3b862a7..f70ff80 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -1197,8 +1197,12 @@ static void free_counter_rcu(struct rcu_head *head)
 	kfree(counter);
 }
 
+static void perf_pending_sync(struct perf_counter *counter);
+
 static void free_counter(struct perf_counter *counter)
 {
+	perf_pending_sync(counter);
+
 	if (counter->destroy)
 		counter->destroy(counter);
 
@@ -1529,6 +1533,118 @@ static const struct file_operations perf_fops = {
 };
 
 /*
+ * Perf counter wakeup
+ *
+ * If there's data, ensure we set the poll() state and publish everything
+ * to user-space before waking everybody up.
+ */
+
+void perf_counter_wakeup(struct perf_counter *counter)
+{
+	struct perf_mmap_data *data;
+
+	rcu_read_lock();
+	data = rcu_dereference(counter->data);
+	if (data) {
+		(void)atomic_xchg(&data->wakeup, POLL_IN);
+		__perf_counter_update_userpage(counter, data);
+	}
+	rcu_read_unlock();
+
+	wake_up_all(&counter->waitq);
+}
+
+/*
+ * Pending wakeups
+ *
+ * Handle the case where we need to wakeup up from NMI (or rq->lock) context.
+ *
+ * The NMI bit means we cannot possibly take locks. Therefore, maintain a
+ * single linked list and use cmpxchg() to add entries lockless.
+ */
+
+#define PENDING_TAIL ((struct perf_wakeup_entry *)-1UL)
+
+static DEFINE_PER_CPU(struct perf_wakeup_entry *, perf_wakeup_head) = {
+	PENDING_TAIL,
+};
+
+static void perf_pending_queue(struct perf_counter *counter)
+{
+	struct perf_wakeup_entry **head;
+	struct perf_wakeup_entry *prev, *next;
+
+	if (cmpxchg(&counter->wakeup.next, NULL, PENDING_TAIL) != NULL)
+		return;
+
+	head = &get_cpu_var(perf_wakeup_head);
+
+	do {
+		prev = counter->wakeup.next = *head;
+		next = &counter->wakeup;
+	} while (cmpxchg(head, prev, next) != prev);
+
+	set_perf_counter_pending();
+
+	put_cpu_var(perf_wakeup_head);
+}
+
+static int __perf_pending_run(void)
+{
+	struct perf_wakeup_entry *list;
+	int nr = 0;
+
+	list = xchg(&__get_cpu_var(perf_wakeup_head), PENDING_TAIL);
+	while (list != PENDING_TAIL) {
+		struct perf_counter *counter = container_of(list,
+				struct perf_counter, wakeup);
+
+		list = list->next;
+
+		counter->wakeup.next = NULL;
+		/*
+		 * Ensure we observe the unqueue before we issue the wakeup,
+		 * so that we won't be waiting forever.
+		 * -- see perf_not_pending().
+		 */
+		smp_wmb();
+
+		perf_counter_wakeup(counter);
+		nr++;
+	}
+
+	return nr;
+}
+
+static inline int perf_not_pending(struct perf_counter *counter)
+{
+	/*
+	 * If we flush on whatever cpu we run, there is a chance we don't
+	 * need to wait.
+	 */
+	get_cpu();
+	__perf_pending_run();
+	put_cpu();
+
+	/*
+	 * Ensure we see the proper queue state before going to sleep
+	 * so that we do not miss the wakeup. -- see perf_pending_handle()
+	 */
+	smp_rmb();
+	return counter->wakeup.next == NULL;
+}
+
+static void perf_pending_sync(struct perf_counter *counter)
+{
+	wait_event(counter->waitq, perf_not_pending(counter));
+}
+
+void perf_counter_do_pending(void)
+{
+	__perf_pending_run();
+}
+
+/*
  * Output
  */
 
@@ -1611,13 +1727,10 @@ static void perf_output_copy(struct perf_output_handle *handle,
 static void perf_output_end(struct perf_output_handle *handle, int nmi)
 {
 	if (handle->wakeup) {
-		(void)atomic_xchg(&handle->data->wakeup, POLL_IN);
-		__perf_counter_update_userpage(handle->counter, handle->data);
-		if (nmi) {
-			handle->counter->wakeup_pending = 1;
-			set_perf_counter_pending();
-		} else
-			wake_up(&handle->counter->waitq);
+		if (nmi)
+			perf_pending_queue(handle->counter);
+		else
+			perf_counter_wakeup(handle->counter);
 	}
 	rcu_read_unlock();
 }
@@ -2211,7 +2324,6 @@ perf_counter_alloc(struct perf_counter_hw_event *hw_event,
 
 	counter->cpu			= cpu;
 	counter->hw_event		= *hw_event;
-	counter->wakeup_pending		= 0;
 	counter->group_leader		= group_leader;
 	counter->hw_ops			= NULL;
 	counter->ctx			= ctx;
diff --git a/kernel/timer.c b/kernel/timer.c
index ef1c385..8556804 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -37,6 +37,7 @@
 #include <linux/delay.h>
 #include <linux/tick.h>
 #include <linux/kallsyms.h>
+#include <linux/perf_counter.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -1131,6 +1132,8 @@ static void run_timer_softirq(struct softirq_action *h)
 {
 	struct tvec_base *base = __get_cpu_var(tvec_bases);
 
+	perf_counter_do_pending();
+
 	hrtimer_run_pending();
 
 	if (time_after_eq(jiffies, base->timer_jiffies))

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: fix update_userpage()
  2009-03-30 17:07 ` [PATCH 02/15] perf_counter: fix update_userpage() Peter Zijlstra
@ 2009-04-01 10:12   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  c48b1e4e8565e663bf085048afc61bf845a4801f
Gitweb:     http://git.kernel.org/tip/c48b1e4e8565e663bf085048afc61bf845a4801f
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:03 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:32 +0200

perf_counter: fix update_userpage()

It just occured to me it is possible to have multiple contending
updates of the userpage (mmap information vs overflow vs counter).
This would break the seqlock logic.

It appear the arch code uses this from NMI context, so we cannot
possibly serialize its use, therefore separate the data_head update
from it and let it return to its original use.

The arch code needs to make sure there are no contending callers by
disabling the counter before using it -- powerpc appears to do this
nicely.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090330171023.241410660@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 include/linux/perf_counter.h |   35 +++++++++++++++++++++++++++++++++++
 kernel/perf_counter.c        |   38 +++++++++++++++++++++++---------------
 2 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 0d83322..8ac1885 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -160,10 +160,45 @@ struct perf_counter_hw_event {
 struct perf_counter_mmap_page {
 	__u32	version;		/* version number of this structure */
 	__u32	compat_version;		/* lowest version this is compat with */
+
+	/*
+	 * Bits needed to read the hw counters in user-space.
+	 *
+	 * The index and offset should be read atomically using the seqlock:
+	 *
+	 *   __u32 seq, index;
+	 *   __s64 offset;
+	 *
+	 * again:
+	 *   rmb();
+	 *   seq = pc->lock;
+	 *
+	 *   if (unlikely(seq & 1)) {
+	 *     cpu_relax();
+	 *     goto again;
+	 *   }
+	 *
+	 *   index = pc->index;
+	 *   offset = pc->offset;
+	 *
+	 *   rmb();
+	 *   if (pc->lock != seq)
+	 *     goto again;
+	 *
+	 * After this, index contains architecture specific counter index + 1,
+	 * so that 0 means unavailable, offset contains the value to be added
+	 * to the result of the raw timer read to obtain this counter's value.
+	 */
 	__u32	lock;			/* seqlock for synchronization */
 	__u32	index;			/* hardware counter identifier */
 	__s64	offset;			/* add to hardware counter value */
 
+	/*
+	 * Control data for the mmap() data buffer.
+	 *
+	 * User-space reading this value should issue an rmb(), on SMP capable
+	 * platforms, after reading this value -- see perf_counter_wakeup().
+	 */
 	__u32   data_head;		/* head in the data section */
 };
 
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index f70ff80..c95e923 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -1316,10 +1316,22 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	return err;
 }
 
-static void __perf_counter_update_userpage(struct perf_counter *counter,
-					   struct perf_mmap_data *data)
+/*
+ * Callers need to ensure there can be no nesting of this function, otherwise
+ * the seqlock logic goes bad. We can not serialize this because the arch
+ * code calls this from NMI context.
+ */
+void perf_counter_update_userpage(struct perf_counter *counter)
 {
-	struct perf_counter_mmap_page *userpg = data->user_page;
+	struct perf_mmap_data *data;
+	struct perf_counter_mmap_page *userpg;
+
+	rcu_read_lock();
+	data = rcu_dereference(counter->data);
+	if (!data)
+		goto unlock;
+
+	userpg = data->user_page;
 
 	/*
 	 * Disable preemption so as to not let the corresponding user-space
@@ -1333,20 +1345,10 @@ static void __perf_counter_update_userpage(struct perf_counter *counter,
 	if (counter->state == PERF_COUNTER_STATE_ACTIVE)
 		userpg->offset -= atomic64_read(&counter->hw.prev_count);
 
-	userpg->data_head = atomic_read(&data->head);
 	smp_wmb();
 	++userpg->lock;
 	preempt_enable();
-}
-
-void perf_counter_update_userpage(struct perf_counter *counter)
-{
-	struct perf_mmap_data *data;
-
-	rcu_read_lock();
-	data = rcu_dereference(counter->data);
-	if (data)
-		__perf_counter_update_userpage(counter, data);
+unlock:
 	rcu_read_unlock();
 }
 
@@ -1547,7 +1549,13 @@ void perf_counter_wakeup(struct perf_counter *counter)
 	data = rcu_dereference(counter->data);
 	if (data) {
 		(void)atomic_xchg(&data->wakeup, POLL_IN);
-		__perf_counter_update_userpage(counter, data);
+		/*
+		 * Ensure all data writes are issued before updating the
+		 * user-space data head information. The matching rmb()
+		 * will be in userspace after reading this value.
+		 */
+		smp_wmb();
+		data->user_page->data_head = atomic_read(&data->head);
 	}
 	rcu_read_unlock();
 

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: kerneltop: simplify data_head read
  2009-03-30 17:07 ` [PATCH 03/15] perf_counter: kerneltop: simplify data_head read Peter Zijlstra
@ 2009-04-01 10:13   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  b5b7e234c3d14eb614998ce5a96719baf146c71c
Gitweb:     http://git.kernel.org/tip/b5b7e234c3d14eb614998ce5a96719baf146c71c
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:04 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:33 +0200

perf_counter: kerneltop: simplify data_head read

Now that the kernel side changed, match up again.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090330171023.327144324@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 Documentation/perf_counter/kerneltop.c |   14 +-------------
 1 files changed, 1 insertions(+), 13 deletions(-)

diff --git a/Documentation/perf_counter/kerneltop.c b/Documentation/perf_counter/kerneltop.c
index fda1438..2779c57 100644
--- a/Documentation/perf_counter/kerneltop.c
+++ b/Documentation/perf_counter/kerneltop.c
@@ -1125,22 +1125,10 @@ struct mmap_data {
 static unsigned int mmap_read_head(struct mmap_data *md)
 {
 	struct perf_counter_mmap_page *pc = md->base;
-	unsigned int seq, head;
-
-repeat:
-	rmb();
-	seq = pc->lock;
-
-	if (unlikely(seq & 1)) {
-		cpu_relax();
-		goto repeat;
-	}
+	int head;
 
 	head = pc->data_head;
-
 	rmb();
-	if (pc->lock != seq)
-		goto repeat;
 
 	return head;
 }

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: executable mmap() information
  2009-03-30 17:07 ` [PATCH 04/15] perf_counter: executable mmap() information Peter Zijlstra
@ 2009-04-01 10:13   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, akpm, a.p.zijlstra, tglx, mingo

Commit-ID:  b2f5c7cf9c5359702e09e7619959fe9cfa84e8f7
Gitweb:     http://git.kernel.org/tip/b2f5c7cf9c5359702e09e7619959fe9cfa84e8f7
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:05 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:33 +0200

perf_counter: executable mmap() information

Currently the profiling information returns userspace IPs but no way
to correlate them to userspace code. Userspace could look into
/proc/$pid/maps but that might not be current or even present anymore
at the time of analyzing the IPs.

Therefore provide means to track the mmap information and provide it
in the output stream.

XXX: only covers mmap()/munmap(), mremap() and mprotect() are missing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20090330171023.417259499@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 include/linux/perf_counter.h |   24 ++++++-
 kernel/perf_counter.c        |  145 ++++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                    |   10 +++
 3 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 8ac1885..037a811 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -137,9 +137,11 @@ struct perf_counter_hw_event {
 				exclude_kernel :  1, /* ditto kernel          */
 				exclude_hv     :  1, /* ditto hypervisor      */
 				exclude_idle   :  1, /* don't count when idle */
-				include_tid    :  1, /* include the tid */
+				include_tid    :  1, /* include the tid       */
+				mmap           :  1, /* include mmap data     */
+				munmap         :  1, /* include munmap data   */
 
-				__reserved_1   : 54;
+				__reserved_1   : 52;
 
 	__u32			extra_config_len;
 	__u32			__reserved_4;
@@ -211,6 +213,9 @@ enum perf_event_type {
 	PERF_EVENT_IP		= 0,
 	PERF_EVENT_GROUP	= 1,
 
+	PERF_EVENT_MMAP		= 2,
+	PERF_EVENT_MUNMAP	= 3,
+
 	__PERF_EVENT_TID	= 0x100,
 };
 
@@ -491,6 +496,12 @@ static inline int is_software_counter(struct perf_counter *counter)
 
 extern void perf_swcounter_event(u32, u64, int, struct pt_regs *);
 
+extern void perf_counter_mmap(unsigned long addr, unsigned long len,
+			      unsigned long pgoff, struct file *file);
+
+extern void perf_counter_munmap(unsigned long addr, unsigned long len,
+				unsigned long pgoff, struct file *file);
+
 #else
 static inline void
 perf_counter_task_sched_in(struct task_struct *task, int cpu)		{ }
@@ -511,6 +522,15 @@ static inline int perf_counter_task_enable(void)	{ return -EINVAL; }
 static inline void
 perf_swcounter_event(u32 event, u64 nr, int nmi, struct pt_regs *regs)	{ }
 
+
+static inline void
+perf_counter_mmap(unsigned long addr, unsigned long len,
+		  unsigned long pgoff, struct file *file)		{ }
+
+static inline void
+perf_counter_munmap(unsigned long addr, unsigned long len,
+		    unsigned long pgoff, struct file *file) 		{ }
+
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index c95e923..f35e89e 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -25,6 +25,7 @@
 #include <linux/anon_inodes.h>
 #include <linux/kernel_stat.h>
 #include <linux/perf_counter.h>
+#include <linux/dcache.h>
 
 #include <asm/irq_regs.h>
 
@@ -1844,6 +1845,150 @@ void perf_counter_output(struct perf_counter *counter,
 }
 
 /*
+ * mmap tracking
+ */
+
+struct perf_mmap_event {
+	struct file	*file;
+	char		*file_name;
+	int		file_size;
+
+	struct {
+		struct perf_event_header	header;
+
+		u32				pid;
+		u32				tid;
+		u64				start;
+		u64				len;
+		u64				pgoff;
+	} event;
+};
+
+static void perf_counter_mmap_output(struct perf_counter *counter,
+				     struct perf_mmap_event *mmap_event)
+{
+	struct perf_output_handle handle;
+	int size = mmap_event->event.header.size;
+	int ret = perf_output_begin(&handle, counter, size);
+
+	if (ret)
+		return;
+
+	perf_output_put(&handle, mmap_event->event);
+	perf_output_copy(&handle, mmap_event->file_name,
+				   mmap_event->file_size);
+	perf_output_end(&handle, 0);
+}
+
+static int perf_counter_mmap_match(struct perf_counter *counter,
+				   struct perf_mmap_event *mmap_event)
+{
+	if (counter->hw_event.mmap &&
+	    mmap_event->event.header.type == PERF_EVENT_MMAP)
+		return 1;
+
+	if (counter->hw_event.munmap &&
+	    mmap_event->event.header.type == PERF_EVENT_MUNMAP)
+		return 1;
+
+	return 0;
+}
+
+static void perf_counter_mmap_ctx(struct perf_counter_context *ctx,
+				  struct perf_mmap_event *mmap_event)
+{
+	struct perf_counter *counter;
+
+	if (system_state != SYSTEM_RUNNING || list_empty(&ctx->event_list))
+		return;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(counter, &ctx->event_list, event_entry) {
+		if (perf_counter_mmap_match(counter, mmap_event))
+			perf_counter_mmap_output(counter, mmap_event);
+	}
+	rcu_read_unlock();
+}
+
+static void perf_counter_mmap_event(struct perf_mmap_event *mmap_event)
+{
+	struct perf_cpu_context *cpuctx;
+	struct file *file = mmap_event->file;
+	unsigned int size;
+	char tmp[16];
+	char *buf = NULL;
+	char *name;
+
+	if (file) {
+		buf = kzalloc(PATH_MAX, GFP_KERNEL);
+		if (!buf) {
+			name = strncpy(tmp, "//enomem", sizeof(tmp));
+			goto got_name;
+		}
+		name = dentry_path(file->f_dentry, buf, PATH_MAX);
+		if (IS_ERR(name)) {
+			name = strncpy(tmp, "//toolong", sizeof(tmp));
+			goto got_name;
+		}
+	} else {
+		name = strncpy(tmp, "//anon", sizeof(tmp));
+		goto got_name;
+	}
+
+got_name:
+	size = ALIGN(strlen(name), sizeof(u64));
+
+	mmap_event->file_name = name;
+	mmap_event->file_size = size;
+
+	mmap_event->event.header.size = sizeof(mmap_event->event) + size;
+
+	cpuctx = &get_cpu_var(perf_cpu_context);
+	perf_counter_mmap_ctx(&cpuctx->ctx, mmap_event);
+	put_cpu_var(perf_cpu_context);
+
+	perf_counter_mmap_ctx(&current->perf_counter_ctx, mmap_event);
+
+	kfree(buf);
+}
+
+void perf_counter_mmap(unsigned long addr, unsigned long len,
+		       unsigned long pgoff, struct file *file)
+{
+	struct perf_mmap_event mmap_event = {
+		.file   = file,
+		.event  = {
+			.header = { .type = PERF_EVENT_MMAP, },
+			.pid	= current->group_leader->pid,
+			.tid	= current->pid,
+			.start  = addr,
+			.len    = len,
+			.pgoff  = pgoff,
+		},
+	};
+
+	perf_counter_mmap_event(&mmap_event);
+}
+
+void perf_counter_munmap(unsigned long addr, unsigned long len,
+			 unsigned long pgoff, struct file *file)
+{
+	struct perf_mmap_event mmap_event = {
+		.file   = file,
+		.event  = {
+			.header = { .type = PERF_EVENT_MUNMAP, },
+			.pid	= current->group_leader->pid,
+			.tid	= current->pid,
+			.start  = addr,
+			.len    = len,
+			.pgoff  = pgoff,
+		},
+	};
+
+	perf_counter_mmap_event(&mmap_event);
+}
+
+/*
  * Generic software counter infrastructure
  */
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 00ced3e..c14fa8c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -27,6 +27,7 @@
 #include <linux/mempolicy.h>
 #include <linux/rmap.h>
 #include <linux/mmu_notifier.h>
+#include <linux/perf_counter.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1219,6 +1220,9 @@ munmap_back:
 	if (correct_wcount)
 		atomic_inc(&inode->i_writecount);
 out:
+	if (vm_flags & VM_EXEC)
+		perf_counter_mmap(addr, len, pgoff, file);
+
 	mm->total_vm += len >> PAGE_SHIFT;
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
@@ -1752,6 +1756,12 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
 	do {
 		long nrpages = vma_pages(vma);
 
+		if (vma->vm_flags & VM_EXEC) {
+			perf_counter_munmap(vma->vm_start,
+					nrpages << PAGE_SHIFT,
+					vma->vm_pgoff, vma->vm_file);
+		}
+
 		mm->total_vm -= nrpages;
 		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
 		vma = remove_vma(vma);

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: kerneltop: parse the mmap data stream
  2009-03-30 17:07 ` [PATCH 05/15] perf_counter: kerneltop: parse the mmap data stream Peter Zijlstra
@ 2009-04-01 10:13   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  984cf8720099284c815057d783b68349615cb66e
Gitweb:     http://git.kernel.org/tip/984cf8720099284c815057d783b68349615cb66e
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:06 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:34 +0200

perf_counter: kerneltop: parse the mmap data stream

frob the kerneltop code to print the mmap data in the stream

Better use would be collecting the IPs per PID and mapping them onto
the provided userspace code.. TODO

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090330171023.501902515@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 Documentation/perf_counter/kerneltop.c |   50 ++++++++++++++++++++++++++++----
 1 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/Documentation/perf_counter/kerneltop.c b/Documentation/perf_counter/kerneltop.c
index 2779c57..995111d 100644
--- a/Documentation/perf_counter/kerneltop.c
+++ b/Documentation/perf_counter/kerneltop.c
@@ -184,6 +184,8 @@ static int			nmi				=  1;
 static int			group				=  0;
 static unsigned int		page_size;
 static unsigned int		mmap_pages			=  16;
+static int			use_mmap			= 0;
+static int			use_munmap			= 0;
 
 static char			*vmlinux;
 
@@ -333,6 +335,8 @@ static void display_help(void)
 	" -z        --zero             # zero counts after display\n"
 	" -D        --dump_symtab      # dump symbol table to stderr on startup\n"
 	" -m pages  --mmap_pages=<pages> # number of mmap data pages\n"
+	" -M        --mmap_info        # print mmap info stream\n"
+	" -U        --munmap_info      # print munmap info stream\n"
 	);
 
 	exit(0);
@@ -1052,9 +1056,11 @@ static void process_options(int argc, char *argv[])
 			{"stat",	no_argument,		NULL, 'S'},
 			{"zero",	no_argument,		NULL, 'z'},
 			{"mmap_pages",	required_argument,	NULL, 'm'},
+			{"mmap_info",	no_argument,		NULL, 'M'},
+			{"munmap_info",	no_argument,		NULL, 'U'},
 			{NULL,		0,			NULL,  0 }
 		};
-		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hn:m:p:s:Sx:z",
+		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hn:m:p:s:Sx:zMU",
 				    long_options, &option_index);
 		if (c == -1)
 			break;
@@ -1092,6 +1098,8 @@ static void process_options(int argc, char *argv[])
 		case 'x': vmlinux			= strdup(optarg); break;
 		case 'z': zero				=              1; break;
 		case 'm': mmap_pages			=   atoi(optarg); break;
+		case 'M': use_mmap			=              1; break;
+		case 'U': use_munmap			=              1; break;
 		default: error = 1; break;
 		}
 	}
@@ -1172,12 +1180,29 @@ static void mmap_read(struct mmap_data *md)
 	last_read = this_read;
 
 	for (; old != head;) {
-		struct event_struct {
+		struct ip_event {
 			struct perf_event_header header;
 			__u64 ip;
 			__u32 pid, tid;
-		} *event = (struct event_struct *)&data[old & md->mask];
-		struct event_struct event_copy;
+		};
+		struct mmap_event {
+			struct perf_event_header header;
+			__u32 pid, tid;
+			__u64 start;
+			__u64 len;
+			__u64 pgoff;
+			char filename[PATH_MAX];
+		};
+
+		typedef union event_union {
+			struct perf_event_header header;
+			struct ip_event ip;
+			struct mmap_event mmap;
+		} event_t;
+
+		event_t *event = (event_t *)&data[old & md->mask];
+
+		event_t event_copy;
 
 		unsigned int size = event->header.size;
 
@@ -1187,7 +1212,7 @@ static void mmap_read(struct mmap_data *md)
 		 */
 		if ((old & md->mask) + size != ((old + size) & md->mask)) {
 			unsigned int offset = old;
-			unsigned int len = sizeof(*event), cpy;
+			unsigned int len = min(sizeof(*event), size), cpy;
 			void *dst = &event_copy;
 
 			do {
@@ -1206,7 +1231,18 @@ static void mmap_read(struct mmap_data *md)
 		switch (event->header.type) {
 		case PERF_EVENT_IP:
 		case PERF_EVENT_IP | __PERF_EVENT_TID:
-			process_event(event->ip, md->counter);
+			process_event(event->ip.ip, md->counter);
+			break;
+
+		case PERF_EVENT_MMAP:
+		case PERF_EVENT_MUNMAP:
+			printf("%s: %Lu %Lu %Lu %s\n",
+					event->header.type == PERF_EVENT_MMAP
+					  ? "mmap" : "munmap",
+					event->mmap.start,
+					event->mmap.len,
+					event->mmap.pgoff,
+					event->mmap.filename);
 			break;
 		}
 	}
@@ -1255,6 +1291,8 @@ int main(int argc, char *argv[])
 			hw_event.record_type	= PERF_RECORD_IRQ;
 			hw_event.nmi		= nmi;
 			hw_event.include_tid	= 1;
+			hw_event.mmap		= use_mmap;
+			hw_event.munmap		= use_munmap;
 
 			fd[i][counter] = sys_perf_counter_open(&hw_event, tid, cpu, group_fd, 0);
 			if (fd[i][counter] < 0) {

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: powerpc: only reserve PMU hardware when we need it
  2009-03-30 17:07 ` [PATCH 06/15] perf_counter: powerpc: only reserve PMU hardware when we need it Peter Zijlstra
@ 2009-04-01 10:13   ` Paul Mackerras
  0 siblings, 0 replies; 55+ messages in thread
From: Paul Mackerras @ 2009-04-01 10:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  59d98a0eb7662a468f99d37cae0b1e3c7b4ce80b
Gitweb:     http://git.kernel.org/tip/59d98a0eb7662a468f99d37cae0b1e3c7b4ce80b
Author:     Paul Mackerras <paulus@samba.org>
AuthorDate: Mon, 30 Mar 2009 19:07:07 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:34 +0200

perf_counter: powerpc: only reserve PMU hardware when we need it

Impact: cooperate with oprofile

At present, on PowerPC, if you have perf_counters compiled in, oprofile
doesn't work.  There is code to allow the PMU to be shared between
competing subsystems, such as perf_counters and oprofile, but currently
the perf_counter subsystem reserves the PMU for itself at boot time,
and never releases it.

This makes perf_counter play nicely with oprofile.  Now we keep a count
of how many perf_counter instances are counting hardware events, and
reserve the PMU when that count becomes non-zero, and release the PMU
when that count becomes zero.  This means that it is possible to have
perf_counters compiled in and still use oprofile, as long as there are
no hardware perf_counters active.  This also means that if oprofile is
active, sys_perf_counter_open will fail if the hw_event specifies a
hardware event.

To avoid races with other tasks creating and destroying perf_counters,
we use a mutex.  We use atomic_inc_not_zero and atomic_add_unless to
avoid having to take the mutex unless there is a possibility of the
count going between 0 and 1.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090330171023.627912475@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/powerpc/kernel/perf_counter.c |   47 ++++++++++++++++++++++++++++++++----
 1 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/perf_counter.c b/arch/powerpc/kernel/perf_counter.c
index cde720f..560dd1e 100644
--- a/arch/powerpc/kernel/perf_counter.c
+++ b/arch/powerpc/kernel/perf_counter.c
@@ -41,6 +41,8 @@ struct power_pmu *ppmu;
  */
 static unsigned int freeze_counters_kernel = MMCR0_FCS;
 
+static void perf_counter_interrupt(struct pt_regs *regs);
+
 void perf_counter_print_debug(void)
 {
 }
@@ -594,6 +596,24 @@ struct hw_perf_counter_ops power_perf_ops = {
 	.read = power_perf_read
 };
 
+/* Number of perf_counters counting hardware events */
+static atomic_t num_counters;
+/* Used to avoid races in calling reserve/release_pmc_hardware */
+static DEFINE_MUTEX(pmc_reserve_mutex);
+
+/*
+ * Release the PMU if this is the last perf_counter.
+ */
+static void hw_perf_counter_destroy(struct perf_counter *counter)
+{
+	if (!atomic_add_unless(&num_counters, -1, 1)) {
+		mutex_lock(&pmc_reserve_mutex);
+		if (atomic_dec_return(&num_counters) == 0)
+			release_pmc_hardware();
+		mutex_unlock(&pmc_reserve_mutex);
+	}
+}
+
 const struct hw_perf_counter_ops *
 hw_perf_counter_init(struct perf_counter *counter)
 {
@@ -601,6 +621,7 @@ hw_perf_counter_init(struct perf_counter *counter)
 	struct perf_counter *ctrs[MAX_HWCOUNTERS];
 	unsigned int events[MAX_HWCOUNTERS];
 	int n;
+	int err;
 
 	if (!ppmu)
 		return NULL;
@@ -646,6 +667,27 @@ hw_perf_counter_init(struct perf_counter *counter)
 
 	counter->hw.config = events[n];
 	atomic64_set(&counter->hw.period_left, counter->hw_event.irq_period);
+
+	/*
+	 * See if we need to reserve the PMU.
+	 * If no counters are currently in use, then we have to take a
+	 * mutex to ensure that we don't race with another task doing
+	 * reserve_pmc_hardware or release_pmc_hardware.
+	 */
+	err = 0;
+	if (!atomic_inc_not_zero(&num_counters)) {
+		mutex_lock(&pmc_reserve_mutex);
+		if (atomic_read(&num_counters) == 0 &&
+		    reserve_pmc_hardware(perf_counter_interrupt))
+			err = -EBUSY;
+		else
+			atomic_inc(&num_counters);
+		mutex_unlock(&pmc_reserve_mutex);
+	}
+	counter->destroy = hw_perf_counter_destroy;
+
+	if (err)
+		return NULL;
 	return &power_perf_ops;
 }
 
@@ -769,11 +811,6 @@ static int init_perf_counters(void)
 {
 	unsigned long pvr;
 
-	if (reserve_pmc_hardware(perf_counter_interrupt)) {
-		printk(KERN_ERR "Couldn't init performance monitor subsystem\n");
-		return -EBUSY;
-	}
-
 	/* XXX should get this from cputable */
 	pvr = mfspr(SPRN_PVR);
 	switch (PVR_VER(pvr)) {

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: make it possible for hw_perf_counter_init to return error codes
  2009-03-30 17:07 ` [PATCH 07/15] perf_counter: make it possible for hw_perf_counter_init to return error codes Peter Zijlstra
@ 2009-04-01 10:13   ` Paul Mackerras
  0 siblings, 0 replies; 55+ messages in thread
From: Paul Mackerras @ 2009-04-01 10:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  a9ac421ff0a0893671eb4fe2f6a6a1d7dc847dd3
Gitweb:     http://git.kernel.org/tip/a9ac421ff0a0893671eb4fe2f6a6a1d7dc847dd3
Author:     Paul Mackerras <paulus@samba.org>
AuthorDate: Mon, 30 Mar 2009 19:07:08 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:35 +0200

perf_counter: make it possible for hw_perf_counter_init to return error codes

Impact: better error reporting

At present, if hw_perf_counter_init encounters an error, all it can do
is return NULL, which causes sys_perf_counter_open to return an EINVAL
error to userspace.  This isn't very informative for userspace; it means
that userspace can't tell the difference between "sorry, oprofile is
already using the PMU" and "we don't support this CPU" and "this CPU
doesn't support the requested generic hardware event".

This commit uses the PTR_ERR/ERR_PTR/IS_ERR set of macros to let
hw_perf_counter_init return an error code on error rather than just NULL
if it wishes.  If it does so, that error code will be returned from
sys_perf_counter_open to userspace.  If it returns NULL, an EINVAL
error will be returned to userspace, as before.

This also adapts the powerpc hw_perf_counter_init to make use of this
to return ENXIO, EINVAL, EBUSY, or EOPNOTSUPP as appropriate.  It would
be good to add extra error numbers in future to allow userspace to
distinguish the various errors that are currently reported as EINVAL,
i.e. irq_period < 0, too many events in a group, conflict between
exclude_* settings in a group, and PMU resource conflict in a group.

[ v2: fix a bug pointed out by Corey Ashford where error returns from
      hw_perf_counter_init were not handled correctly in the case of
      raw hardware events.]

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090330171023.682428180@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/powerpc/kernel/perf_counter.c |   14 +++++++-------
 kernel/perf_counter.c              |   35 ++++++++++++++++++++++-------------
 2 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/perf_counter.c b/arch/powerpc/kernel/perf_counter.c
index 560dd1e..0a4d14f 100644
--- a/arch/powerpc/kernel/perf_counter.c
+++ b/arch/powerpc/kernel/perf_counter.c
@@ -624,13 +624,13 @@ hw_perf_counter_init(struct perf_counter *counter)
 	int err;
 
 	if (!ppmu)
-		return NULL;
+		return ERR_PTR(-ENXIO);
 	if ((s64)counter->hw_event.irq_period < 0)
-		return NULL;
+		return ERR_PTR(-EINVAL);
 	if (!perf_event_raw(&counter->hw_event)) {
 		ev = perf_event_id(&counter->hw_event);
 		if (ev >= ppmu->n_generic || ppmu->generic_events[ev] == 0)
-			return NULL;
+			return ERR_PTR(-EOPNOTSUPP);
 		ev = ppmu->generic_events[ev];
 	} else {
 		ev = perf_event_config(&counter->hw_event);
@@ -656,14 +656,14 @@ hw_perf_counter_init(struct perf_counter *counter)
 		n = collect_events(counter->group_leader, ppmu->n_counter - 1,
 				   ctrs, events);
 		if (n < 0)
-			return NULL;
+			return ERR_PTR(-EINVAL);
 	}
 	events[n] = ev;
 	ctrs[n] = counter;
 	if (check_excludes(ctrs, n, 1))
-		return NULL;
+		return ERR_PTR(-EINVAL);
 	if (power_check_constraints(events, n + 1))
-		return NULL;
+		return ERR_PTR(-EINVAL);
 
 	counter->hw.config = events[n];
 	atomic64_set(&counter->hw.period_left, counter->hw_event.irq_period);
@@ -687,7 +687,7 @@ hw_perf_counter_init(struct perf_counter *counter)
 	counter->destroy = hw_perf_counter_destroy;
 
 	if (err)
-		return NULL;
+		return ERR_PTR(err);
 	return &power_perf_ops;
 }
 
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index f35e89e..d07b452 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -2453,10 +2453,11 @@ perf_counter_alloc(struct perf_counter_hw_event *hw_event,
 {
 	const struct hw_perf_counter_ops *hw_ops;
 	struct perf_counter *counter;
+	long err;
 
 	counter = kzalloc(sizeof(*counter), gfpflags);
 	if (!counter)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	/*
 	 * Single counters are their own group leaders, with an
@@ -2505,12 +2506,18 @@ perf_counter_alloc(struct perf_counter_hw_event *hw_event,
 		hw_ops = tp_perf_counter_init(counter);
 		break;
 	}
+done:
+	err = 0;
+	if (!hw_ops)
+		err = -EINVAL;
+	else if (IS_ERR(hw_ops))
+		err = PTR_ERR(hw_ops);
 
-	if (!hw_ops) {
+	if (err) {
 		kfree(counter);
-		return NULL;
+		return ERR_PTR(err);
 	}
-done:
+
 	counter->hw_ops = hw_ops;
 
 	return counter;
@@ -2583,10 +2590,10 @@ SYSCALL_DEFINE5(perf_counter_open,
 			goto err_put_context;
 	}
 
-	ret = -EINVAL;
 	counter = perf_counter_alloc(&hw_event, cpu, ctx, group_leader,
 				     GFP_KERNEL);
-	if (!counter)
+	ret = PTR_ERR(counter);
+	if (IS_ERR(counter))
 		goto err_put_context;
 
 	ret = anon_inode_getfd("[perf_counter]", &perf_fops, counter, 0);
@@ -2658,8 +2665,8 @@ inherit_counter(struct perf_counter *parent_counter,
 	child_counter = perf_counter_alloc(&parent_counter->hw_event,
 					   parent_counter->cpu, child_ctx,
 					   group_leader, GFP_KERNEL);
-	if (!child_counter)
-		return NULL;
+	if (IS_ERR(child_counter))
+		return child_counter;
 
 	/*
 	 * Link it up in the child's context:
@@ -2710,15 +2717,17 @@ static int inherit_group(struct perf_counter *parent_counter,
 {
 	struct perf_counter *leader;
 	struct perf_counter *sub;
+	struct perf_counter *child_ctr;
 
 	leader = inherit_counter(parent_counter, parent, parent_ctx,
 				 child, NULL, child_ctx);
-	if (!leader)
-		return -ENOMEM;
+	if (IS_ERR(leader))
+		return PTR_ERR(leader);
 	list_for_each_entry(sub, &parent_counter->sibling_list, list_entry) {
-		if (!inherit_counter(sub, parent, parent_ctx,
-				     child, leader, child_ctx))
-			return -ENOMEM;
+		child_ctr = inherit_counter(sub, parent, parent_ctx,
+					    child, leader, child_ctx);
+		if (IS_ERR(child_ctr))
+			return PTR_ERR(child_ctr);
 	}
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: x86: proper error propagation for the x86 hw_perf_counter_init()
  2009-03-30 17:07 ` [PATCH 08/15] perf_counter: x86: proper error propagation for the x86 hw_perf_counter_init() Peter Zijlstra
@ 2009-04-01 10:13   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  1763e5959b2302761b10e84df38165fac86bd1a5
Gitweb:     http://git.kernel.org/tip/1763e5959b2302761b10e84df38165fac86bd1a5
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:09 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:35 +0200

perf_counter: x86: proper error propagation for the x86 hw_perf_counter_init()

Now that Paul cleaned up the error propagation paths, pass down the
x86 error as well.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090330171023.792822360@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/x86/kernel/cpu/perf_counter.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_counter.c b/arch/x86/kernel/cpu/perf_counter.c
index 7aab177..b8885cc 100644
--- a/arch/x86/kernel/cpu/perf_counter.c
+++ b/arch/x86/kernel/cpu/perf_counter.c
@@ -954,7 +954,7 @@ hw_perf_counter_init(struct perf_counter *counter)
 
 	err = __hw_perf_counter_init(counter);
 	if (err)
-		return NULL;
+		return ERR_PTR(err);
 
 	return &x86_perf_counter_ops;
 }

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter tools: optionally scale counter values in perfstat mode
  2009-03-30 17:07 ` [PATCH 09/15] perf_counter tools: optionally scale counter values in perfstat mode Peter Zijlstra
  2009-04-01  9:39   ` Ingo Molnar
@ 2009-04-01 10:14   ` Paul Mackerras
  1 sibling, 0 replies; 55+ messages in thread
From: Paul Mackerras @ 2009-04-01 10:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  654365a6d9ce51634d6091776af74c7848b0e40e
Gitweb:     http://git.kernel.org/tip/654365a6d9ce51634d6091776af74c7848b0e40e
Author:     Paul Mackerras <paulus@samba.org>
AuthorDate: Mon, 30 Mar 2009 19:07:10 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:36 +0200

perf_counter tools: optionally scale counter values in perfstat mode

Impact: new functionality

This adds add an option to the perfstat mode of kerneltop to scale the
reported counter values according to the fraction of time that each
counter gets to count.  This is invoked with the -l option (I used 'l'
because s, c, a and e were all taken already.)  This uses the new
PERF_RECORD_TOTAL_TIME_{ENABLED,RUNNING} read format options.

With this, we get output like this:

$ ./perfstat -l -e 0:0,0:1,0:2,0:3,0:4,0:5 ./spin

 Performance counter stats for './spin':

     4016072055  CPU cycles           (events)  (scaled from 66.53%)
     2005887318  instructions         (events)  (scaled from 66.53%)
        1762849  cache references     (events)  (scaled from 66.69%)
         165229  cache misses         (events)  (scaled from 66.85%)
     1001298009  branches             (events)  (scaled from 66.78%)
          41566  branch misses        (events)  (scaled from 66.61%)

 Wall-clock time elapsed:  2438.227446 msecs

This also lets us detect when a counter is zero because the counter
never got to go on the CPU at all.  In that case we print <not counted>
rather than 0.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090330171023.871484899@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 Documentation/perf_counter/kerneltop.c |   56 +++++++++++++++++++++++++------
 1 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/Documentation/perf_counter/kerneltop.c b/Documentation/perf_counter/kerneltop.c
index 995111d..c0ca015 100644
--- a/Documentation/perf_counter/kerneltop.c
+++ b/Documentation/perf_counter/kerneltop.c
@@ -197,6 +197,8 @@ static int			delay_secs			=  2;
 static int			zero;
 static int			dump_symtab;
 
+static int			scale;
+
 struct source_line {
 	uint64_t		EIP;
 	unsigned long		count;
@@ -305,6 +307,7 @@ static void display_perfstat_help(void)
 	display_events_help();
 
 	printf(
+	" -l                           # scale counter values\n"
 	" -a                           # system-wide collection\n");
 	exit(0);
 }
@@ -328,6 +331,7 @@ static void display_help(void)
 	" -c CNT    --count=CNT        # event period to sample\n\n"
 	" -C CPU    --cpu=CPU          # CPU (-1 for all)                 [default: -1]\n"
 	" -p PID    --pid=PID          # PID of sampled task (-1 for all) [default: -1]\n\n"
+	" -l                           # show scale factor for RR events\n"
 	" -d delay  --delay=<seconds>  # sampling/display delay           [default:  2]\n"
 	" -f CNT    --filter=CNT       # min-event-count filter          [default: 100]\n\n"
 	" -s symbol --symbol=<symbol>  # function to be showed annotated one-shot\n"
@@ -436,6 +440,9 @@ static void create_perfstat_counter(int counter)
 	hw_event.config		= event_id[counter];
 	hw_event.record_type	= PERF_RECORD_SIMPLE;
 	hw_event.nmi		= 0;
+	if (scale)
+		hw_event.read_format	= PERF_FORMAT_TOTAL_TIME_ENABLED |
+					  PERF_FORMAT_TOTAL_TIME_RUNNING;
 
 	if (system_wide) {
 		int cpu;
@@ -507,28 +514,53 @@ int do_perfstat(int argc, char *argv[])
 	fprintf(stderr, "\n");
 
 	for (counter = 0; counter < nr_counters; counter++) {
-		int cpu;
-		__u64 count, single_count;
+		int cpu, nv;
+		__u64 count[3], single_count[3];
+		int scaled;
 
-		count = 0;
+		count[0] = count[1] = count[2] = 0;
+		nv = scale ? 3 : 1;
 		for (cpu = 0; cpu < nr_cpus; cpu ++) {
 			res = read(fd[cpu][counter],
-					(char *) &single_count, sizeof(single_count));
-			assert(res == sizeof(single_count));
-			count += single_count;
+				   single_count, nv * sizeof(__u64));
+			assert(res == nv * sizeof(__u64));
+
+			count[0] += single_count[0];
+			if (scale) {
+				count[1] += single_count[1];
+				count[2] += single_count[2];
+			}
+		}
+
+		scaled = 0;
+		if (scale) {
+			if (count[2] == 0) {
+				fprintf(stderr, " %14s  %-20s\n",
+					"<not counted>", event_name(counter));
+				continue;
+			}
+			if (count[2] < count[1]) {
+				scaled = 1;
+				count[0] = (unsigned long long)
+					((double)count[0] * count[1] / count[2] + 0.5);
+			}
 		}
 
 		if (event_id[counter] == EID(PERF_TYPE_SOFTWARE, PERF_COUNT_CPU_CLOCK) ||
 		    event_id[counter] == EID(PERF_TYPE_SOFTWARE, PERF_COUNT_TASK_CLOCK)) {
 
-			double msecs = (double)count / 1000000;
+			double msecs = (double)count[0] / 1000000;
 
-			fprintf(stderr, " %14.6f  %-20s (msecs)\n",
+			fprintf(stderr, " %14.6f  %-20s (msecs)",
 				msecs, event_name(counter));
 		} else {
-			fprintf(stderr, " %14Ld  %-20s (events)\n",
-				count, event_name(counter));
+			fprintf(stderr, " %14Ld  %-20s (events)",
+				count[0], event_name(counter));
 		}
+		if (scaled)
+			fprintf(stderr, "  (scaled from %.2f%%)",
+				(double) count[2] / count[1] * 100);
+		fprintf(stderr, "\n");
 	}
 	fprintf(stderr, "\n");
 	fprintf(stderr, " Wall-clock time elapsed: %12.6f msecs\n",
@@ -1049,6 +1081,7 @@ static void process_options(int argc, char *argv[])
 			{"filter",	required_argument,	NULL, 'f'},
 			{"group",	required_argument,	NULL, 'g'},
 			{"help",	no_argument,		NULL, 'h'},
+			{"scale",	no_argument,		NULL, 'l'},
 			{"nmi",		required_argument,	NULL, 'n'},
 			{"pid",		required_argument,	NULL, 'p'},
 			{"vmlinux",	required_argument,	NULL, 'x'},
@@ -1060,7 +1093,7 @@ static void process_options(int argc, char *argv[])
 			{"munmap_info",	no_argument,		NULL, 'U'},
 			{NULL,		0,			NULL,  0 }
 		};
-		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hn:m:p:s:Sx:zMU",
+		int c = getopt_long(argc, argv, "+:ac:C:d:De:f:g:hln:m:p:s:Sx:zMU",
 				    long_options, &option_index);
 		if (c == -1)
 			break;
@@ -1084,6 +1117,7 @@ static void process_options(int argc, char *argv[])
 		case 'f': count_filter			=   atoi(optarg); break;
 		case 'g': group				=   atoi(optarg); break;
 		case 'h':      				  display_help(); break;
+		case 'l': scale				=	       1; break;
 		case 'n': nmi				=   atoi(optarg); break;
 		case 'p':
 			/* CPU and PID are mutually exclusive */

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: small cleanup of the output routines
  2009-03-30 17:07 ` [PATCH 10/15] perf_counter: small cleanup of the output routines Peter Zijlstra
@ 2009-04-01 10:14   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  8f2d296c34f94cc84bbc21e9f5bf4fde01df6c41
Gitweb:     http://git.kernel.org/tip/8f2d296c34f94cc84bbc21e9f5bf4fde01df6c41
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:11 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:36 +0200

perf_counter: small cleanup of the output routines

Move the nmi argument to the _begin() function, so that _end() only needs the
handle. This allows the _begin() function to generate a wakeup on event loss.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090330171023.959404268@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 kernel/perf_counter.c |   42 ++++++++++++++++++++++++++----------------
 1 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index d07b452..4471e7e 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -1663,10 +1663,20 @@ struct perf_output_handle {
 	unsigned int		offset;
 	unsigned int		head;
 	int			wakeup;
+	int			nmi;
 };
 
+static inline void __perf_output_wakeup(struct perf_output_handle *handle)
+{
+	if (handle->nmi)
+		perf_pending_queue(handle->counter);
+	else
+		perf_counter_wakeup(handle->counter);
+}
+
 static int perf_output_begin(struct perf_output_handle *handle,
-			     struct perf_counter *counter, unsigned int size)
+			     struct perf_counter *counter, unsigned int size,
+			     int nmi)
 {
 	struct perf_mmap_data *data;
 	unsigned int offset, head;
@@ -1676,15 +1686,17 @@ static int perf_output_begin(struct perf_output_handle *handle,
 	if (!data)
 		goto out;
 
+	handle->counter	= counter;
+	handle->nmi	= nmi;
+
 	if (!data->nr_pages)
-		goto out;
+		goto fail;
 
 	do {
 		offset = head = atomic_read(&data->head);
 		head += size;
 	} while (atomic_cmpxchg(&data->head, offset, head) != offset);
 
-	handle->counter	= counter;
 	handle->data	= data;
 	handle->offset	= offset;
 	handle->head	= head;
@@ -1692,6 +1704,8 @@ static int perf_output_begin(struct perf_output_handle *handle,
 
 	return 0;
 
+fail:
+	__perf_output_wakeup(handle);
 out:
 	rcu_read_unlock();
 
@@ -1733,14 +1747,10 @@ static void perf_output_copy(struct perf_output_handle *handle,
 #define perf_output_put(handle, x) \
 	perf_output_copy((handle), &(x), sizeof(x))
 
-static void perf_output_end(struct perf_output_handle *handle, int nmi)
+static void perf_output_end(struct perf_output_handle *handle)
 {
-	if (handle->wakeup) {
-		if (nmi)
-			perf_pending_queue(handle->counter);
-		else
-			perf_counter_wakeup(handle->counter);
-	}
+	if (handle->wakeup)
+		__perf_output_wakeup(handle);
 	rcu_read_unlock();
 }
 
@@ -1750,12 +1760,12 @@ static int perf_output_write(struct perf_counter *counter, int nmi,
 	struct perf_output_handle handle;
 	int ret;
 
-	ret = perf_output_begin(&handle, counter, size);
+	ret = perf_output_begin(&handle, counter, size, nmi);
 	if (ret)
 		goto out;
 
 	perf_output_copy(&handle, buf, size);
-	perf_output_end(&handle, nmi);
+	perf_output_end(&handle);
 
 out:
 	return ret;
@@ -1804,7 +1814,7 @@ static void perf_output_group(struct perf_counter *counter, int nmi)
 
 	size = sizeof(header) + counter->nr_siblings * sizeof(entry);
 
-	ret = perf_output_begin(&handle, counter, size);
+	ret = perf_output_begin(&handle, counter, size, nmi);
 	if (ret)
 		return;
 
@@ -1824,7 +1834,7 @@ static void perf_output_group(struct perf_counter *counter, int nmi)
 		perf_output_put(&handle, entry);
 	}
 
-	perf_output_end(&handle, nmi);
+	perf_output_end(&handle);
 }
 
 void perf_counter_output(struct perf_counter *counter,
@@ -1869,7 +1879,7 @@ static void perf_counter_mmap_output(struct perf_counter *counter,
 {
 	struct perf_output_handle handle;
 	int size = mmap_event->event.header.size;
-	int ret = perf_output_begin(&handle, counter, size);
+	int ret = perf_output_begin(&handle, counter, size, 0);
 
 	if (ret)
 		return;
@@ -1877,7 +1887,7 @@ static void perf_counter_mmap_output(struct perf_counter *counter,
 	perf_output_put(&handle, mmap_event->event);
 	perf_output_copy(&handle, mmap_event->file_name,
 				   mmap_event->file_size);
-	perf_output_end(&handle, 0);
+	perf_output_end(&handle);
 }
 
 static int perf_counter_mmap_match(struct perf_counter *counter,

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: re-arrange the perf_event_type
  2009-03-30 17:07 ` [PATCH 11/15] perf_counter: re-arrange the perf_event_type Peter Zijlstra
@ 2009-04-01 10:14   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  06cb4a0996bf4d2b47058aff0e0192819efe502f
Gitweb:     http://git.kernel.org/tip/06cb4a0996bf4d2b47058aff0e0192819efe502f
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:12 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:37 +0200

perf_counter: re-arrange the perf_event_type

Breaks ABI yet again :-)

Change the event type so that [0, 2^31-1] are regular event types, but
[2^31, 2^32-1] forms a bitmask for overflow events.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090330171024.047961770@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 include/linux/perf_counter.h |    6 +++-
 kernel/perf_counter.c        |   56 ++++++++++++++++++-----------------------
 2 files changed, 29 insertions(+), 33 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 037a811..edf5bfb 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -210,13 +210,15 @@ struct perf_event_header {
 };
 
 enum perf_event_type {
-	PERF_EVENT_IP		= 0,
+
 	PERF_EVENT_GROUP	= 1,
 
 	PERF_EVENT_MMAP		= 2,
 	PERF_EVENT_MUNMAP	= 3,
 
-	__PERF_EVENT_TID	= 0x100,
+	PERF_EVENT_OVERFLOW	= 1UL << 31,
+	__PERF_EVENT_IP		= 1UL << 30,
+	__PERF_EVENT_TID	= 1UL << 29,
 };
 
 #ifdef __KERNEL__
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 4471e7e..d93e9dd 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -1754,50 +1754,44 @@ static void perf_output_end(struct perf_output_handle *handle)
 	rcu_read_unlock();
 }
 
-static int perf_output_write(struct perf_counter *counter, int nmi,
-			     void *buf, ssize_t size)
-{
-	struct perf_output_handle handle;
-	int ret;
-
-	ret = perf_output_begin(&handle, counter, size, nmi);
-	if (ret)
-		goto out;
-
-	perf_output_copy(&handle, buf, size);
-	perf_output_end(&handle);
-
-out:
-	return ret;
-}
-
 static void perf_output_simple(struct perf_counter *counter,
 			       int nmi, struct pt_regs *regs)
 {
-	unsigned int size;
+	int ret;
+	struct perf_output_handle handle;
+	struct perf_event_header header;
+	u64 ip;
 	struct {
-		struct perf_event_header header;
-		u64 ip;
 		u32 pid, tid;
-	} event;
+	} tid_entry;
 
-	event.header.type = PERF_EVENT_IP;
-	event.ip = instruction_pointer(regs);
+	header.type = PERF_EVENT_OVERFLOW;
+	header.size = sizeof(header);
 
-	size = sizeof(event);
+	ip = instruction_pointer(regs);
+	header.type |= __PERF_EVENT_IP;
+	header.size += sizeof(ip);
 
 	if (counter->hw_event.include_tid) {
 		/* namespace issues */
-		event.pid = current->group_leader->pid;
-		event.tid = current->pid;
+		tid_entry.pid = current->group_leader->pid;
+		tid_entry.tid = current->pid;
 
-		event.header.type |= __PERF_EVENT_TID;
-	} else
-		size -= sizeof(u64);
+		header.type |= __PERF_EVENT_TID;
+		header.size += sizeof(tid_entry);
+	}
 
-	event.header.size = size;
+	ret = perf_output_begin(&handle, counter, header.size, nmi);
+	if (ret)
+		return;
+
+	perf_output_put(&handle, header);
+	perf_output_put(&handle, ip);
+
+	if (counter->hw_event.include_tid)
+		perf_output_put(&handle, tid_entry);
 
-	perf_output_write(counter, nmi, &event, size);
+	perf_output_end(&handle);
 }
 
 static void perf_output_group(struct perf_counter *counter, int nmi)

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter tools: kerneltop: update event_types
  2009-03-30 17:07 ` [PATCH 12/15] pref_counter: kerneltop: update event_types Peter Zijlstra
@ 2009-04-01 10:14   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  1fbb3f14bf98d66df9cec6b8321b6c8d2f87655a
Gitweb:     http://git.kernel.org/tip/1fbb3f14bf98d66df9cec6b8321b6c8d2f87655a
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:13 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:37 +0200

perf_counter tools: kerneltop: update event_types

Go along with the new perf_event_type ABI.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090330171024.133985461@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 Documentation/perf_counter/kerneltop.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/perf_counter/kerneltop.c b/Documentation/perf_counter/kerneltop.c
index c0ca015..430810d 100644
--- a/Documentation/perf_counter/kerneltop.c
+++ b/Documentation/perf_counter/kerneltop.c
@@ -1263,8 +1263,8 @@ static void mmap_read(struct mmap_data *md)
 		old += size;
 
 		switch (event->header.type) {
-		case PERF_EVENT_IP:
-		case PERF_EVENT_IP | __PERF_EVENT_TID:
+		case PERF_EVENT_OVERFLOW | __PERF_EVENT_IP:
+		case PERF_EVENT_OVERFLOW | __PERF_EVENT_IP | __PERF_EVENT_TID:
 			process_event(event->ip.ip, md->counter);
 			break;
 

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: provide generic callchain bits
  2009-03-30 17:07 ` [PATCH 13/15] perf_counter: provide generic callchain bits Peter Zijlstra
  2009-03-31  6:12   ` Paul Mackerras
@ 2009-04-01 10:14   ` Peter Zijlstra
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  1254256f65ce3973462dc153e02b4f3a7e314e20
Gitweb:     http://git.kernel.org/tip/1254256f65ce3973462dc153e02b4f3a7e314e20
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:14 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:38 +0200

perf_counter: provide generic callchain bits

Provide the generic callchain support bits. If hw_event->callchain is
set the arch specific perf_callchain() function is called upon to
provide a perf_callchain_entry structure filled with the current
callchain.

If it does so, it is added to the overflow output event.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090330171024.254266860@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 include/linux/perf_counter.h |   13 ++++++++++++-
 kernel/perf_counter.c        |   27 +++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index edf5bfb..43083af 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -140,8 +140,9 @@ struct perf_counter_hw_event {
 				include_tid    :  1, /* include the tid       */
 				mmap           :  1, /* include mmap data     */
 				munmap         :  1, /* include munmap data   */
+				callchain      :  1, /* add callchain data    */
 
-				__reserved_1   : 52;
+				__reserved_1   : 51;
 
 	__u32			extra_config_len;
 	__u32			__reserved_4;
@@ -219,6 +220,7 @@ enum perf_event_type {
 	PERF_EVENT_OVERFLOW	= 1UL << 31,
 	__PERF_EVENT_IP		= 1UL << 30,
 	__PERF_EVENT_TID	= 1UL << 29,
+	__PERF_EVENT_CALLCHAIN  = 1UL << 28,
 };
 
 #ifdef __KERNEL__
@@ -504,6 +506,15 @@ extern void perf_counter_mmap(unsigned long addr, unsigned long len,
 extern void perf_counter_munmap(unsigned long addr, unsigned long len,
 				unsigned long pgoff, struct file *file);
 
+#define MAX_STACK_DEPTH		255
+
+struct perf_callchain_entry {
+	u64	nr;
+	u64	ip[MAX_STACK_DEPTH];
+};
+
+extern struct perf_callchain_entry *perf_callchain(struct pt_regs *regs);
+
 #else
 static inline void
 perf_counter_task_sched_in(struct task_struct *task, int cpu)		{ }
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index d93e9dd..860cdc2 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -1654,6 +1654,17 @@ void perf_counter_do_pending(void)
 }
 
 /*
+ * Callchain support -- arch specific
+ */
+
+struct perf_callchain_entry *
+__attribute__((weak))
+perf_callchain(struct pt_regs *regs)
+{
+	return NULL;
+}
+
+/*
  * Output
  */
 
@@ -1764,6 +1775,8 @@ static void perf_output_simple(struct perf_counter *counter,
 	struct {
 		u32 pid, tid;
 	} tid_entry;
+	struct perf_callchain_entry *callchain = NULL;
+	int callchain_size = 0;
 
 	header.type = PERF_EVENT_OVERFLOW;
 	header.size = sizeof(header);
@@ -1781,6 +1794,17 @@ static void perf_output_simple(struct perf_counter *counter,
 		header.size += sizeof(tid_entry);
 	}
 
+	if (counter->hw_event.callchain) {
+		callchain = perf_callchain(regs);
+
+		if (callchain) {
+			callchain_size = (1 + callchain->nr) * sizeof(u64);
+
+			header.type |= __PERF_EVENT_CALLCHAIN;
+			header.size += callchain_size;
+		}
+	}
+
 	ret = perf_output_begin(&handle, counter, header.size, nmi);
 	if (ret)
 		return;
@@ -1791,6 +1815,9 @@ static void perf_output_simple(struct perf_counter *counter,
 	if (counter->hw_event.include_tid)
 		perf_output_put(&handle, tid_entry);
 
+	if (callchain)
+		perf_output_copy(&handle, callchain, callchain_size);
+
 	perf_output_end(&handle);
 }
 

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: x86: callchain support
  2009-03-30 17:07 ` [PATCH 14/15] perf_counter: x86: callchain support Peter Zijlstra
@ 2009-04-01 10:14   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, fweisbec,
	sandmann, srostedt, tglx, mingo

Commit-ID:  94f40f6a09ff7d0a2232a094ea11f953f3ae813e
Gitweb:     http://git.kernel.org/tip/94f40f6a09ff7d0a2232a094ea11f953f3ae813e
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:15 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:38 +0200

perf_counter: x86: callchain support

Provide the x86 perf_callchain() implementation.

Code based on the ftrace/sysprof code from Soeren Sandmann Pedersen.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Soeren Sandmann Pedersen <sandmann@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <srostedt@redhat.com>
LKML-Reference: <20090330171024.341993293@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/x86/kernel/cpu/perf_counter.c |  154 ++++++++++++++++++++++++++++++++++++
 1 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_counter.c b/arch/x86/kernel/cpu/perf_counter.c
index b8885cc..e16dfaf 100644
--- a/arch/x86/kernel/cpu/perf_counter.c
+++ b/arch/x86/kernel/cpu/perf_counter.c
@@ -16,8 +16,10 @@
 #include <linux/module.h>
 #include <linux/kdebug.h>
 #include <linux/sched.h>
+#include <linux/uaccess.h>
 
 #include <asm/apic.h>
+#include <asm/stacktrace.h>
 
 static bool perf_counters_initialized __read_mostly;
 
@@ -958,3 +960,155 @@ hw_perf_counter_init(struct perf_counter *counter)
 
 	return &x86_perf_counter_ops;
 }
+
+/*
+ * callchain support
+ */
+
+static inline
+void callchain_store(struct perf_callchain_entry *entry, unsigned long ip)
+{
+	if (entry->nr < MAX_STACK_DEPTH)
+		entry->ip[entry->nr++] = ip;
+}
+
+static DEFINE_PER_CPU(struct perf_callchain_entry, irq_entry);
+static DEFINE_PER_CPU(struct perf_callchain_entry, nmi_entry);
+
+
+static void
+backtrace_warning_symbol(void *data, char *msg, unsigned long symbol)
+{
+	/* Ignore warnings */
+}
+
+static void backtrace_warning(void *data, char *msg)
+{
+	/* Ignore warnings */
+}
+
+static int backtrace_stack(void *data, char *name)
+{
+	/* Don't bother with IRQ stacks for now */
+	return -1;
+}
+
+static void backtrace_address(void *data, unsigned long addr, int reliable)
+{
+	struct perf_callchain_entry *entry = data;
+
+	if (reliable)
+		callchain_store(entry, addr);
+}
+
+static const struct stacktrace_ops backtrace_ops = {
+	.warning		= backtrace_warning,
+	.warning_symbol		= backtrace_warning_symbol,
+	.stack			= backtrace_stack,
+	.address		= backtrace_address,
+};
+
+static void
+perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry)
+{
+	unsigned long bp;
+	char *stack;
+
+	callchain_store(entry, instruction_pointer(regs));
+
+	stack = ((char *)regs + sizeof(struct pt_regs));
+#ifdef CONFIG_FRAME_POINTER
+	bp = frame_pointer(regs);
+#else
+	bp = 0;
+#endif
+
+	dump_trace(NULL, regs, (void *)stack, bp, &backtrace_ops, entry);
+}
+
+
+struct stack_frame {
+	const void __user	*next_fp;
+	unsigned long		return_address;
+};
+
+static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
+{
+	int ret;
+
+	if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
+		return 0;
+
+	ret = 1;
+	pagefault_disable();
+	if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
+		ret = 0;
+	pagefault_enable();
+
+	return ret;
+}
+
+static void
+perf_callchain_user(struct pt_regs *regs, struct perf_callchain_entry *entry)
+{
+	struct stack_frame frame;
+	const void __user *fp;
+
+	regs = (struct pt_regs *)current->thread.sp0 - 1;
+	fp   = (void __user *)regs->bp;
+
+	callchain_store(entry, regs->ip);
+
+	while (entry->nr < MAX_STACK_DEPTH) {
+		frame.next_fp	     = NULL;
+		frame.return_address = 0;
+
+		if (!copy_stack_frame(fp, &frame))
+			break;
+
+		if ((unsigned long)fp < user_stack_pointer(regs))
+			break;
+
+		callchain_store(entry, frame.return_address);
+		fp = frame.next_fp;
+	}
+}
+
+static void
+perf_do_callchain(struct pt_regs *regs, struct perf_callchain_entry *entry)
+{
+	int is_user;
+
+	if (!regs)
+		return;
+
+	is_user = user_mode(regs);
+
+	if (!current || current->pid == 0)
+		return;
+
+	if (is_user && current->state != TASK_RUNNING)
+		return;
+
+	if (!is_user)
+		perf_callchain_kernel(regs, entry);
+
+	if (current->mm)
+		perf_callchain_user(regs, entry);
+}
+
+struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
+{
+	struct perf_callchain_entry *entry;
+
+	if (in_nmi())
+		entry = &__get_cpu_var(nmi_entry);
+	else
+		entry = &__get_cpu_var(irq_entry);
+
+	entry->nr = 0;
+
+	perf_do_callchain(regs, entry);
+
+	return entry;
+}

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [tip:perfcounters/core] perf_counter: pmc arbitration
  2009-03-30 17:07 ` [PATCH 15/15] perf_counter: pmc arbitration Peter Zijlstra
@ 2009-04-01 10:15   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-01 10:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paulus, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  b4110aa4b80378855c92a090dc079a3c5c69b69c
Gitweb:     http://git.kernel.org/tip/b4110aa4b80378855c92a090dc079a3c5c69b69c
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 30 Mar 2009 19:07:16 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 1 Apr 2009 11:33:39 +0200

perf_counter: pmc arbitration

Follow the example set by powerpc and try to play nice with oprofile
and the nmi watchdog.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090330171024.459968444@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/x86/kernel/cpu/perf_counter.c |   75 ++++++++++++++++++++++++++++++++++++
 1 files changed, 75 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_counter.c b/arch/x86/kernel/cpu/perf_counter.c
index e16dfaf..2a946a1 100644
--- a/arch/x86/kernel/cpu/perf_counter.c
+++ b/arch/x86/kernel/cpu/perf_counter.c
@@ -20,6 +20,7 @@
 
 #include <asm/apic.h>
 #include <asm/stacktrace.h>
+#include <asm/nmi.h>
 
 static bool perf_counters_initialized __read_mostly;
 
@@ -172,6 +173,65 @@ again:
 	atomic64_sub(delta, &hwc->period_left);
 }
 
+static atomic_t num_counters;
+static DEFINE_MUTEX(pmc_reserve_mutex);
+
+static bool reserve_pmc_hardware(void)
+{
+	int i;
+
+	if (nmi_watchdog == NMI_LOCAL_APIC)
+		disable_lapic_nmi_watchdog();
+
+	for (i = 0; i < nr_counters_generic; i++) {
+		if (!reserve_perfctr_nmi(pmc_ops->perfctr + i))
+			goto perfctr_fail;
+	}
+
+	for (i = 0; i < nr_counters_generic; i++) {
+		if (!reserve_evntsel_nmi(pmc_ops->eventsel + i))
+			goto eventsel_fail;
+	}
+
+	return true;
+
+eventsel_fail:
+	for (i--; i >= 0; i--)
+		release_evntsel_nmi(pmc_ops->eventsel + i);
+
+	i = nr_counters_generic;
+
+perfctr_fail:
+	for (i--; i >= 0; i--)
+		release_perfctr_nmi(pmc_ops->perfctr + i);
+
+	if (nmi_watchdog == NMI_LOCAL_APIC)
+		enable_lapic_nmi_watchdog();
+
+	return false;
+}
+
+static void release_pmc_hardware(void)
+{
+	int i;
+
+	for (i = 0; i < nr_counters_generic; i++) {
+		release_perfctr_nmi(pmc_ops->perfctr + i);
+		release_evntsel_nmi(pmc_ops->eventsel + i);
+	}
+
+	if (nmi_watchdog == NMI_LOCAL_APIC)
+		enable_lapic_nmi_watchdog();
+}
+
+static void hw_perf_counter_destroy(struct perf_counter *counter)
+{
+	if (atomic_dec_and_mutex_lock(&num_counters, &pmc_reserve_mutex)) {
+		release_pmc_hardware();
+		mutex_unlock(&pmc_reserve_mutex);
+	}
+}
+
 /*
  * Setup the hardware configuration for a given hw_event_type
  */
@@ -179,10 +239,23 @@ static int __hw_perf_counter_init(struct perf_counter *counter)
 {
 	struct perf_counter_hw_event *hw_event = &counter->hw_event;
 	struct hw_perf_counter *hwc = &counter->hw;
+	int err;
 
 	if (unlikely(!perf_counters_initialized))
 		return -EINVAL;
 
+	err = 0;
+	if (atomic_inc_not_zero(&num_counters)) {
+		mutex_lock(&pmc_reserve_mutex);
+		if (atomic_read(&num_counters) == 0 && !reserve_pmc_hardware())
+			err = -EBUSY;
+		else
+			atomic_inc(&num_counters);
+		mutex_unlock(&pmc_reserve_mutex);
+	}
+	if (err)
+		return err;
+
 	/*
 	 * Generate PMC IRQs:
 	 * (keep 'enabled' bit clear for now)
@@ -230,6 +303,8 @@ static int __hw_perf_counter_init(struct perf_counter *counter)
 		hwc->config |= pmc_ops->event_map(perf_event_id(hw_event));
 	}
 
+	counter->destroy = hw_perf_counter_destroy;
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 09/15] perf_counter tools: optionally scale counter values in perfstat mode
  2009-04-01  9:39   ` Ingo Molnar
@ 2009-04-01 10:50     ` Paul Mackerras
  0 siblings, 0 replies; 55+ messages in thread
From: Paul Mackerras @ 2009-04-01 10:50 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel

Ingo Molnar writes:

> Nice! Shouldnt this be the default mode of output?

I agree, it should.

> I think we want to make users aware of the fact when output is 
> sampled due to over-commit, rather than precise, agreed?

Definitely.

Paul.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-04-01 10:00                   ` Ingo Molnar
@ 2009-04-01 11:53                     ` Paul Mackerras
  0 siblings, 0 replies; 55+ messages in thread
From: Paul Mackerras @ 2009-04-01 11:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Corey Ashford, linux-kernel

Ingo Molnar writes:

> So it's a bit like PEBS and IBS on the x86, right?

Well, my hazy impression was that one or both of those wrote samples
into a ring buffer in hardware, which would be different...

> In theory one could simply override the sampled ptregs->ip with this 
> more precise register value. The instruction where the IRQ hit is 
> probably meaningless, if more precise information is available. But 
> we can have both too i guess.
> 
> The data address extension definitely makes sense - it can be used 
> to for a profile view along the data symbol dimension, instead of 
> the usual function symbol dimension.
> 
> CPU flags makes sense too - irqs-off can help the annotation of 
> source code sections where the profiler sees that irqs were 
> disabled.
> 
> It seems here we gradually descend into arch-specific CPU state 
> technicalities and it's not immediately obvious where to draw the 
> line.

I hoped that event instruction address, event data address and cpu
flags, at least, would be sufficiently abstract to consider having as
generic things, though of course how you get to them is
arch-specific.  The main use of event flags is to know whether or not
the event instruction/data address values are valid.  Instead of
recording the event flags we could just not put the event instr/data
address records in the ring buffer if the values aren't valid
(e.g. the event data address won't be valid if the instruction doesn't
access memory).

> Call-chain and data address abstractions are clear. CPU flags is 
> less clear: we could perhaps split off the irq state and the 
> privilege level information - that is present on all CPUs.

In a machine-independent format, you mean?  That would be a good idea.

> The rest should probably be opaque and not generalized.

And can be provided on a per-arch basis using special raw event code
values.

> _Perhaps_, to stem the inevitable list of such small details, it 
> might make sense to have a record type with signal frame qualities - 
> which would include most of this info. That would mix well with the 
> planned feature of signal generation anyway, right?

Hmmm, so record the entire architected state of the machine, and
effectively put an entire ucontext_t in the ring buffer?  That's
certainly possible - what would we use it for?

> I.e. we could extend the lowlevel sigcontext signal frame generation 
> code in arch/x86/kernel/signal.c (and its powerpc equivalent) to 
> generate a signal frame but output it into the mmap buffer, not into 
> the userspace stack - and we would not actually execute a signal in 
> that context.
> 
>  [ of course, when the counter is configured to generate a signal 
>    that is done too. The code would be dual purpose. ]
> 
> So user-space would get a fully signal frame compatible record - and 
> we'd not have to create a per arch ABI for this because we'd piggy 
> back to the signal frame format.
> 
> We could add SA_NOFPU support for fast-track integer-registers-only 
> frames, etc.
> 
> Hm?

I think the motivation for having the stop-and-signal behaviour is not
so much to get at the entire register set, as to have a way to profile
application-dependent things, i.e. you'd record or histogram something
derived from memory contents at the time the signal was delivered.
Or you might want to do a stack trace but use unwind information to
get the parameters to each function (which on powerpc could now be in
any register or saved onto the stack, once you start going up the call
chain any distance), which is more than we want to do in the kernel.

Paul.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-04-01  7:59               ` Peter Zijlstra
  2009-04-01  8:45                 ` Paul Mackerras
@ 2009-04-01 23:25                 ` Corey Ashford
  2009-04-02  6:43                   ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Corey Ashford @ 2009-04-01 23:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Paul Mackerras, Ingo Molnar, linux-kernel

Peter Zijlstra wrote:
> On Wed, 2009-04-01 at 14:48 +1100, Paul Mackerras wrote:
>> Peter Zijlstra writes:
>>
>>> That still has the record and read things separate, but as one unified
>>> overflow output.
>> I take it PERF_EVENT_OVERFLOW refers to counter overflow, not ring
>> buffer overflow?  That had me confused for a bit, so more explicit
>> naming, or at least some comments, would be good.
> 
> Ah, yes, I see how that can confuse. PERF_EVENT_COUNTER_OVERFLOW then?
> 
> I was thinking about doing splice() support and that could also generate
> actual event overflow events ;-)
> 
>>>  /*
>>> + * Bits that can be set in hw_event.record_type to request information
>>> + * in the overflow packets.
>>> + */
>>> +enum perf_counter_record_format {
>>> +	PERF_RECORD_IP		= 1U << 0,
>>> +	PERF_RECORD_TID		= 1U << 1,
>>> +	PERF_RECORD_GROUP	= 1U << 2,
>>> +	PERF_RECORD_CALLCHAIN	= 1U << 3,
>>> +};
>> [snip]
>>>  enum perf_event_type {
>>>  
>>> -	PERF_EVENT_GROUP	= 1,
>>> -
>>> -	PERF_EVENT_MMAP		= 2,
>>> -	PERF_EVENT_MUNMAP	= 3,
>>> +	PERF_EVENT_MMAP		= 1,
>>> +	PERF_EVENT_MUNMAP	= 2,
>>>  
>>>  	PERF_EVENT_OVERFLOW	= 1UL << 31,
>>>  	__PERF_EVENT_IP		= 1UL << 30,
>>>  	__PERF_EVENT_TID	= 1UL << 29,
>>> -	__PERF_EVENT_CALLCHAIN  = 1UL << 28,
>>> +	__PERF_EVENT_GROUP	= 1UL << 28,
>>> +	__PERF_EVENT_CALLCHAIN  = 1UL << 27,
>> Could we use the same value (and even the same name) for
>> PERF_RECORD_IP/__PERF_EVENT_IP, PERF_RECORD_TID/__PERF_EVENT_TID,
>> etc.?
> 
> I suppose we could.
> 
>> Also, I haven't looked at the callchain stuff much, but does the
>> callchain info contain a recognizable end delimiter?  At present the
>> callchain comes last, but if we add more output elements they'll
>> presumably go after it, so working out where the callchain ends may
>> become tricky if we're not careful.
> 
> It writes:
> 
> struct callchain_event {
> 	u64 nr
> 	u64 ips[nr]
> };

Looking at the current version of perf_counter.h in the -tip tree, this 
struct definition is not visible to userspace (it's in the #ifdef KERNEL 
section).

- Corey


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-04-01 23:25                 ` Corey Ashford
@ 2009-04-02  6:43                   ` Peter Zijlstra
  2009-04-02  7:41                     ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-02  6:43 UTC (permalink / raw)
  To: Corey Ashford; +Cc: Paul Mackerras, Ingo Molnar, linux-kernel

On Wed, 2009-04-01 at 16:25 -0700, Corey Ashford wrote:

> > struct callchain_event {
> > 	u64 nr
> > 	u64 ips[nr]
> > };
> 
> Looking at the current version of perf_counter.h in the -tip tree, this 
> struct definition is not visible to userspace (it's in the #ifdef KERNEL 
> section).

Correct. Its been on my todo list to write a meta-struct definition
along with the perf_event_type enum that details the data layout of the
various types, but not include any actual structure other than the
header in the userspace bits.

Also, related to a comment Paul made on callchain markers, I wanted to
change it so that we have ip markers in the chain that separate the
kernel and user bits.

I was thinking of ending the kernel trace with -1ULL and ending the user
chain with 0ULL. Presumably we need one more to end hypervisor chains in
case anybody manages to generate those ;-).

This chain however will not change the struct, it will merely add 2
extra entries.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-04-02  6:43                   ` Peter Zijlstra
@ 2009-04-02  7:41                     ` Peter Zijlstra
  2009-04-02  9:10                       ` Paul Mackerras
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-02  7:41 UTC (permalink / raw)
  To: Corey Ashford; +Cc: Paul Mackerras, Ingo Molnar, linux-kernel

On Thu, 2009-04-02 at 08:43 +0200, Peter Zijlstra wrote:
> On Wed, 2009-04-01 at 16:25 -0700, Corey Ashford wrote:
> 
> > > struct callchain_event {
> > > 	u64 nr
> > > 	u64 ips[nr]
> > > };
> > 
> > Looking at the current version of perf_counter.h in the -tip tree, this 
> > struct definition is not visible to userspace (it's in the #ifdef KERNEL 
> > section).
> 
> Correct. Its been on my todo list to write a meta-struct definition
> along with the perf_event_type enum that details the data layout of the
> various types, but not include any actual structure other than the
> header in the userspace bits.
> 
> Also, related to a comment Paul made on callchain markers, I wanted to
> change it so that we have ip markers in the chain that separate the
> kernel and user bits.
> 
> I was thinking of ending the kernel trace with -1ULL and ending the user
> chain with 0ULL. Presumably we need one more to end hypervisor chains in
> case anybody manages to generate those ;-).
> 
> This chain however will not change the struct, it will merely add 2
> extra entries.

Hmm, another idea:

struct perf_callchain_entry {
  __u32 total, hv, kernel, user;
  __u64 ips[total];
};



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-04-02  7:41                     ` Peter Zijlstra
@ 2009-04-02  9:10                       ` Paul Mackerras
  2009-04-02  9:14                         ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Paul Mackerras @ 2009-04-02  9:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Corey Ashford, Ingo Molnar, linux-kernel

Peter Zijlstra writes:

> Hmm, another idea:
> 
> struct perf_callchain_entry {
>   __u32 total, hv, kernel, user;

That should work - and they could be u16's in fact.

>   __u64 ips[total];
> };

Paul.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 13/15] perf_counter: provide generic callchain bits
  2009-04-02  9:10                       ` Paul Mackerras
@ 2009-04-02  9:14                         ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2009-04-02  9:14 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Corey Ashford, Ingo Molnar, linux-kernel

On Thu, 2009-04-02 at 20:10 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
> 
> > Hmm, another idea:
> > 
> > struct perf_callchain_entry {
> >   __u32 total, hv, kernel, user;
> 
> That should work - and they could be u16's in fact.

Gah, I just send it out with u32. But yes, will change.

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2009-04-02  9:16 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-30 17:07 [PATCH 00/15] pending perf_counter bits Peter Zijlstra
2009-03-30 17:07 ` [PATCH 01/15] perf_counter: unify and fix delayed counter wakeup Peter Zijlstra
2009-03-31  5:45   ` Paul Mackerras
2009-03-31  6:37     ` Peter Zijlstra
2009-04-01  9:04       ` Ingo Molnar
2009-04-01 10:12   ` [tip:perfcounters/core] " Peter Zijlstra
2009-03-30 17:07 ` [PATCH 02/15] perf_counter: fix update_userpage() Peter Zijlstra
2009-04-01 10:12   ` [tip:perfcounters/core] " Peter Zijlstra
2009-03-30 17:07 ` [PATCH 03/15] perf_counter: kerneltop: simplify data_head read Peter Zijlstra
2009-04-01 10:13   ` [tip:perfcounters/core] " Peter Zijlstra
2009-03-30 17:07 ` [PATCH 04/15] perf_counter: executable mmap() information Peter Zijlstra
2009-04-01 10:13   ` [tip:perfcounters/core] " Peter Zijlstra
2009-03-30 17:07 ` [PATCH 05/15] perf_counter: kerneltop: parse the mmap data stream Peter Zijlstra
2009-04-01 10:13   ` [tip:perfcounters/core] " Peter Zijlstra
2009-03-30 17:07 ` [PATCH 06/15] perf_counter: powerpc: only reserve PMU hardware when we need it Peter Zijlstra
2009-04-01 10:13   ` [tip:perfcounters/core] " Paul Mackerras
2009-03-30 17:07 ` [PATCH 07/15] perf_counter: make it possible for hw_perf_counter_init to return error codes Peter Zijlstra
2009-04-01 10:13   ` [tip:perfcounters/core] " Paul Mackerras
2009-03-30 17:07 ` [PATCH 08/15] perf_counter: x86: proper error propagation for the x86 hw_perf_counter_init() Peter Zijlstra
2009-04-01 10:13   ` [tip:perfcounters/core] " Peter Zijlstra
2009-03-30 17:07 ` [PATCH 09/15] perf_counter tools: optionally scale counter values in perfstat mode Peter Zijlstra
2009-04-01  9:39   ` Ingo Molnar
2009-04-01 10:50     ` Paul Mackerras
2009-04-01 10:14   ` [tip:perfcounters/core] " Paul Mackerras
2009-03-30 17:07 ` [PATCH 10/15] perf_counter: small cleanup of the output routines Peter Zijlstra
2009-04-01 10:14   ` [tip:perfcounters/core] " Peter Zijlstra
2009-03-30 17:07 ` [PATCH 11/15] perf_counter: re-arrange the perf_event_type Peter Zijlstra
2009-04-01 10:14   ` [tip:perfcounters/core] " Peter Zijlstra
2009-03-30 17:07 ` [PATCH 12/15] pref_counter: kerneltop: update event_types Peter Zijlstra
2009-04-01 10:14   ` [tip:perfcounters/core] perf_counter tools: " Peter Zijlstra
2009-03-30 17:07 ` [PATCH 13/15] perf_counter: provide generic callchain bits Peter Zijlstra
2009-03-31  6:12   ` Paul Mackerras
2009-03-31  6:39     ` Peter Zijlstra
2009-03-31  7:24       ` Corey Ashford
2009-03-31  8:43         ` Peter Zijlstra
2009-03-31  9:09           ` Paul Mackerras
2009-03-31  9:12         ` Paul Mackerras
2009-03-31 14:00           ` Peter Zijlstra
2009-03-31 17:11             ` Corey Ashford
2009-04-01  3:48             ` Paul Mackerras
2009-04-01  7:59               ` Peter Zijlstra
2009-04-01  8:45                 ` Paul Mackerras
2009-04-01 10:00                   ` Ingo Molnar
2009-04-01 11:53                     ` Paul Mackerras
2009-04-01 23:25                 ` Corey Ashford
2009-04-02  6:43                   ` Peter Zijlstra
2009-04-02  7:41                     ` Peter Zijlstra
2009-04-02  9:10                       ` Paul Mackerras
2009-04-02  9:14                         ` Peter Zijlstra
2009-04-01 10:14   ` [tip:perfcounters/core] " Peter Zijlstra
2009-03-30 17:07 ` [PATCH 14/15] perf_counter: x86: callchain support Peter Zijlstra
2009-04-01 10:14   ` [tip:perfcounters/core] " Peter Zijlstra
2009-03-30 17:07 ` [PATCH 15/15] perf_counter: pmc arbitration Peter Zijlstra
2009-04-01 10:15   ` [tip:perfcounters/core] " Peter Zijlstra
2009-03-31  5:43 ` [PATCH 00/15] pending perf_counter bits Paul Mackerras

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.