All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Optimize perf ring-buffer
@ 2010-05-18 13:32 Peter Zijlstra
  2010-05-18 13:32 ` [PATCH 1/5] perf: Disallow mmap() on per-task inherited events Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-18 13:32 UTC (permalink / raw)
  To: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo
  Cc: Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	linux-kernel, Peter Zijlstra

This patch-set optimizes the perf ring-buffer by removing IRQ disable and 
all LOCK ops from the fast paths.

There's also an RFC patch that adds perf_output_addr() which will allow
you to write directly to the buffer, however I'm not sure how to go
about using it for the trace events since we need to multiplex event --
there is no single copy of the actual event.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] perf: Disallow mmap() on per-task inherited events
  2010-05-18 13:32 [PATCH 0/5] Optimize perf ring-buffer Peter Zijlstra
@ 2010-05-18 13:32 ` Peter Zijlstra
  2010-05-19  7:19   ` Frederic Weisbecker
  2010-05-25  0:55   ` Paul Mackerras
  2010-05-18 13:33 ` [PATCH 2/5] perf: Remove IRQ-disable from the perf_output path Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-18 13:32 UTC (permalink / raw)
  To: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo
  Cc: Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-disallow-inherited-mmap.patch --]
[-- Type: text/plain, Size: 971 bytes --]

Since we now have working per-task-per-cpu events for a while,
disallow mmap() on per-task inherited events. Those things were
a performance problem anyway, and doing away with it allows
us to optimize the buffer somewhat by assuming there is only a single
writer.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/perf_event.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2580,6 +2580,14 @@ static int perf_mmap(struct file *file, 
 	long user_extra, extra;
 	int ret = 0;
 
+	/*
+	 * Don't allow mmap() of inherited per-task counters. This would
+	 * create a performance issue due to all children writing to the
+	 * same buffer.
+	 */
+	if (event->cpu == -1 && event->attr.inherit)
+		return -EINVAL;
+
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/5] perf: Remove IRQ-disable from the perf_output path
  2010-05-18 13:32 [PATCH 0/5] Optimize perf ring-buffer Peter Zijlstra
  2010-05-18 13:32 ` [PATCH 1/5] perf: Disallow mmap() on per-task inherited events Peter Zijlstra
@ 2010-05-18 13:33 ` Peter Zijlstra
  2010-05-18 13:33 ` [PATCH 3/5] perf: Convert the perf output buffer to local_t Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-18 13:33 UTC (permalink / raw)
  To: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo
  Cc: Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-buffer-no-irq.patch --]
[-- Type: text/plain, Size: 6408 bytes --]

Since we can now assume there is only a single writer to each buffer,
we can remove per-cpu lock thingy and use a simply nest-count to the
same effect.

This removes the need to disable IRQs.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/perf_event.h |    5 --
 kernel/perf_event.c        |   94 +++++++++++++--------------------------------
 2 files changed, 30 insertions(+), 69 deletions(-)

Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -597,12 +597,12 @@ struct perf_mmap_data {
 	atomic_t			events;		/* event_id limit       */
 
 	atomic_long_t			head;		/* write position    */
-	atomic_long_t			done_head;	/* completed head    */
 
-	atomic_t			lock;		/* concurrent writes */
 	atomic_t			wakeup;		/* needs a wakeup    */
 	atomic_t			lost;		/* nr records lost   */
 
+	atomic_t			nest;		/* nested writers    */
+
 	long				watermark;	/* wakeup watermark  */
 
 	struct perf_event_mmap_page	*user_page;
@@ -807,7 +807,6 @@ struct perf_output_handle {
 	unsigned long			offset;
 	int				nmi;
 	int				sample;
-	int				locked;
 };
 
 #ifdef CONFIG_PERF_EVENTS
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2519,8 +2519,6 @@ perf_mmap_data_init(struct perf_event *e
 {
 	long max_size = perf_data_size(data);
 
-	atomic_set(&data->lock, -1);
-
 	if (event->attr.watermark) {
 		data->watermark = min_t(long, max_size,
 					event->attr.wakeup_watermark);
@@ -2906,82 +2904,56 @@ static void perf_output_wakeup(struct pe
 }
 
 /*
- * Curious locking construct.
- *
  * We need to ensure a later event_id doesn't publish a head when a former
- * event_id isn't done writing. However since we need to deal with NMIs we
+ * event isn't done writing. However since we need to deal with NMIs we
  * cannot fully serialize things.
  *
- * What we do is serialize between CPUs so we only have to deal with NMI
- * nesting on a single CPU.
- *
  * We only publish the head (and generate a wakeup) when the outer-most
- * event_id completes.
+ * event completes.
  */
-static void perf_output_lock(struct perf_output_handle *handle)
+static void perf_output_get_handle(struct perf_output_handle *handle)
 {
 	struct perf_mmap_data *data = handle->data;
-	int cur, cpu = get_cpu();
-
-	handle->locked = 0;
-
-	for (;;) {
-		cur = atomic_cmpxchg(&data->lock, -1, cpu);
-		if (cur == -1) {
-			handle->locked = 1;
-			break;
-		}
-		if (cur == cpu)
-			break;
 
-		cpu_relax();
-	}
+	preempt_disable();
+	atomic_inc(&data->nest);
 }
 
-static void perf_output_unlock(struct perf_output_handle *handle)
+static void perf_output_put_handle(struct perf_output_handle *handle)
 {
 	struct perf_mmap_data *data = handle->data;
 	unsigned long head;
-	int cpu;
-
-	data->done_head = data->head;
-
-	if (!handle->locked)
-		goto out;
 
 again:
-	/*
-	 * The xchg implies a full barrier that ensures all writes are done
-	 * before we publish the new head, matched by a rmb() in userspace when
-	 * reading this position.
-	 */
-	while ((head = atomic_long_xchg(&data->done_head, 0)))
-		data->user_page->data_head = head;
+	head = atomic_long_read(&data->head);
 
 	/*
-	 * NMI can happen here, which means we can miss a done_head update.
+	 * IRQ/NMI can happen here, which means we can miss a head update.
 	 */
 
-	cpu = atomic_xchg(&data->lock, -1);
-	WARN_ON_ONCE(cpu != smp_processor_id());
+	if (!atomic_dec_and_test(&data->nest))
+		return;
 
 	/*
-	 * Therefore we have to validate we did not indeed do so.
+	 * Publish the known good head. Rely on the full barrier implied
+	 * by atomic_dec_and_test() order the data->head read and this
+	 * write.
 	 */
-	if (unlikely(atomic_long_read(&data->done_head))) {
-		/*
-		 * Since we had it locked, we can lock it again.
-		 */
-		while (atomic_cmpxchg(&data->lock, -1, cpu) != -1)
-			cpu_relax();
+	data->user_page->data_head = head;
 
+	/*
+	 * Now check if we missed an update, rely on the (compiler)
+	 * barrier in atomic_dec_and_test() to re-read data->head.
+	 */
+	if (unlikely(head != atomic_long_read(&data->head))) {
+		atomic_inc(&data->nest);
 		goto again;
 	}
 
 	if (atomic_xchg(&data->wakeup, 0))
 		perf_output_wakeup(handle);
-out:
-	put_cpu();
+
+	preempt_enable();
 }
 
 void perf_output_copy(struct perf_output_handle *handle,
@@ -3063,7 +3035,7 @@ int perf_output_begin(struct perf_output
 	if (have_lost)
 		size += sizeof(lost_event);
 
-	perf_output_lock(handle);
+	perf_output_get_handle(handle);
 
 	do {
 		/*
@@ -3083,7 +3055,7 @@ int perf_output_begin(struct perf_output
 	handle->head	= head;
 
 	if (head - tail > data->watermark)
-		atomic_set(&data->wakeup, 1);
+		atomic_inc(&data->wakeup);
 
 	if (have_lost) {
 		lost_event.header.type = PERF_RECORD_LOST;
@@ -3099,7 +3071,7 @@ int perf_output_begin(struct perf_output
 
 fail:
 	atomic_inc(&data->lost);
-	perf_output_unlock(handle);
+	perf_output_put_handle(handle);
 out:
 	rcu_read_unlock();
 
@@ -3117,11 +3089,11 @@ void perf_output_end(struct perf_output_
 		int events = atomic_inc_return(&data->events);
 		if (events >= wakeup_events) {
 			atomic_sub(wakeup_events, &data->events);
-			atomic_set(&data->wakeup, 1);
+			atomic_inc(&data->wakeup);
 		}
 	}
 
-	perf_output_unlock(handle);
+	perf_output_put_handle(handle);
 	rcu_read_unlock();
 }
 
@@ -3457,22 +3429,13 @@ static void perf_event_task_output(struc
 {
 	struct perf_output_handle handle;
 	struct task_struct *task = task_event->task;
-	unsigned long flags;
 	int size, ret;
 
-	/*
-	 * If this CPU attempts to acquire an rq lock held by a CPU spinning
-	 * in perf_output_lock() from interrupt context, it's game over.
-	 */
-	local_irq_save(flags);
-
 	size  = task_event->event_id.header.size;
 	ret = perf_output_begin(&handle, event, size, 0, 0);
 
-	if (ret) {
-		local_irq_restore(flags);
+	if (ret)
 		return;
-	}
 
 	task_event->event_id.pid = perf_event_pid(event, task);
 	task_event->event_id.ppid = perf_event_pid(event, current);
@@ -3483,7 +3446,6 @@ static void perf_event_task_output(struc
 	perf_output_put(&handle, task_event->event_id);
 
 	perf_output_end(&handle);
-	local_irq_restore(flags);
 }
 
 static int perf_event_task_match(struct perf_event *event)



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/5] perf: Convert the perf output buffer to local_t
  2010-05-18 13:32 [PATCH 0/5] Optimize perf ring-buffer Peter Zijlstra
  2010-05-18 13:32 ` [PATCH 1/5] perf: Disallow mmap() on per-task inherited events Peter Zijlstra
  2010-05-18 13:33 ` [PATCH 2/5] perf: Remove IRQ-disable from the perf_output path Peter Zijlstra
@ 2010-05-18 13:33 ` Peter Zijlstra
  2010-05-18 13:33 ` [PATCH 4/5] perf: Avoid local_xchg Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-18 13:33 UTC (permalink / raw)
  To: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo
  Cc: Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-buffer-local_t.patch --]
[-- Type: text/plain, Size: 4695 bytes --]

Since there is now only a single writer, we can use local_t instead
and avoid all these pesky LOCK insn.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/perf_event.h |   15 +++++++--------
 kernel/perf_event.c        |   30 +++++++++++++++---------------
 2 files changed, 22 insertions(+), 23 deletions(-)

Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -485,6 +485,7 @@ struct perf_guest_info_callbacks {
 #include <linux/ftrace.h>
 #include <linux/cpu.h>
 #include <asm/atomic.h>
+#include <asm/local.h>
 
 #define PERF_MAX_STACK_DEPTH		255
 
@@ -588,20 +589,18 @@ struct perf_mmap_data {
 #ifdef CONFIG_PERF_USE_VMALLOC
 	struct work_struct		work;
 #endif
-	int				data_order;
+	int				data_order;	/* allocation order  */
 	int				nr_pages;	/* nr of data pages  */
 	int				writable;	/* are we writable   */
 	int				nr_locked;	/* nr pages mlocked  */
 
 	atomic_t			poll;		/* POLL_ for wakeups */
-	atomic_t			events;		/* event_id limit       */
 
-	atomic_long_t			head;		/* write position    */
-
-	atomic_t			wakeup;		/* needs a wakeup    */
-	atomic_t			lost;		/* nr records lost   */
-
-	atomic_t			nest;		/* nested writers    */
+	local_t				head;		/* write position    */
+	local_t				nest;		/* nested writers    */
+	local_t				events;		/* event limit       */
+	local_t				wakeup;		/* needs a wakeup    */
+	local_t				lost;		/* nr records lost   */
 
 	long				watermark;	/* wakeup watermark  */
 
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2916,7 +2916,7 @@ static void perf_output_get_handle(struc
 	struct perf_mmap_data *data = handle->data;
 
 	preempt_disable();
-	atomic_inc(&data->nest);
+	local_inc(&data->nest);
 }
 
 static void perf_output_put_handle(struct perf_output_handle *handle)
@@ -2925,13 +2925,13 @@ static void perf_output_put_handle(struc
 	unsigned long head;
 
 again:
-	head = atomic_long_read(&data->head);
+	head = local_read(&data->head);
 
 	/*
 	 * IRQ/NMI can happen here, which means we can miss a head update.
 	 */
 
-	if (!atomic_dec_and_test(&data->nest))
+	if (!local_dec_and_test(&data->nest))
 		return;
 
 	/*
@@ -2945,12 +2945,12 @@ again:
 	 * Now check if we missed an update, rely on the (compiler)
 	 * barrier in atomic_dec_and_test() to re-read data->head.
 	 */
-	if (unlikely(head != atomic_long_read(&data->head))) {
-		atomic_inc(&data->nest);
+	if (unlikely(head != local_read(&data->head))) {
+		local_inc(&data->nest);
 		goto again;
 	}
 
-	if (atomic_xchg(&data->wakeup, 0))
+	if (local_xchg(&data->wakeup, 0))
 		perf_output_wakeup(handle);
 
 	preempt_enable();
@@ -3031,7 +3031,7 @@ int perf_output_begin(struct perf_output
 	if (!data->nr_pages)
 		goto out;
 
-	have_lost = atomic_read(&data->lost);
+	have_lost = local_read(&data->lost);
 	if (have_lost)
 		size += sizeof(lost_event);
 
@@ -3045,24 +3045,24 @@ int perf_output_begin(struct perf_output
 		 */
 		tail = ACCESS_ONCE(data->user_page->data_tail);
 		smp_rmb();
-		offset = head = atomic_long_read(&data->head);
+		offset = head = local_read(&data->head);
 		head += size;
 		if (unlikely(!perf_output_space(data, tail, offset, head)))
 			goto fail;
-	} while (atomic_long_cmpxchg(&data->head, offset, head) != offset);
+	} while (local_cmpxchg(&data->head, offset, head) != offset);
 
 	handle->offset	= offset;
 	handle->head	= head;
 
 	if (head - tail > data->watermark)
-		atomic_inc(&data->wakeup);
+		local_inc(&data->wakeup);
 
 	if (have_lost) {
 		lost_event.header.type = PERF_RECORD_LOST;
 		lost_event.header.misc = 0;
 		lost_event.header.size = sizeof(lost_event);
 		lost_event.id          = event->id;
-		lost_event.lost        = atomic_xchg(&data->lost, 0);
+		lost_event.lost        = local_xchg(&data->lost, 0);
 
 		perf_output_put(handle, lost_event);
 	}
@@ -3070,7 +3070,7 @@ int perf_output_begin(struct perf_output
 	return 0;
 
 fail:
-	atomic_inc(&data->lost);
+	local_inc(&data->lost);
 	perf_output_put_handle(handle);
 out:
 	rcu_read_unlock();
@@ -3086,10 +3086,10 @@ void perf_output_end(struct perf_output_
 	int wakeup_events = event->attr.wakeup_events;
 
 	if (handle->sample && wakeup_events) {
-		int events = atomic_inc_return(&data->events);
+		int events = local_inc_return(&data->events);
 		if (events >= wakeup_events) {
-			atomic_sub(wakeup_events, &data->events);
-			atomic_inc(&data->wakeup);
+			local_sub(wakeup_events, &data->events);
+			local_inc(&data->wakeup);
 		}
 	}
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/5] perf: Avoid local_xchg
  2010-05-18 13:32 [PATCH 0/5] Optimize perf ring-buffer Peter Zijlstra
                   ` (2 preceding siblings ...)
  2010-05-18 13:33 ` [PATCH 3/5] perf: Convert the perf output buffer to local_t Peter Zijlstra
@ 2010-05-18 13:33 ` Peter Zijlstra
  2010-05-18 13:33 ` [RFC PATCH 5/5] perf: Implement perf_output_addr() Peter Zijlstra
  2010-05-19  7:14 ` [PATCH 0/5] Optimize perf ring-buffer Frederic Weisbecker
  5 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-18 13:33 UTC (permalink / raw)
  To: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo
  Cc: Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-buffer-no-xchg-wakeup.patch --]
[-- Type: text/plain, Size: 1266 bytes --]

Since the x86 XCHG ins implies LOCK, avoid the use by using a sequence
count instead.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/perf_event.h |    1 +
 kernel/perf_event.c        |    3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -804,6 +804,7 @@ struct perf_output_handle {
 	struct perf_mmap_data		*data;
 	unsigned long			head;
 	unsigned long			offset;
+	unsigned long			wakeup;
 	int				nmi;
 	int				sample;
 };
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2917,6 +2917,7 @@ static void perf_output_get_handle(struc
 
 	preempt_disable();
 	local_inc(&data->nest);
+	handle->wakeup = local_read(&data->wakeup);
 }
 
 static void perf_output_put_handle(struct perf_output_handle *handle)
@@ -2950,7 +2951,7 @@ again:
 		goto again;
 	}
 
-	if (local_xchg(&data->wakeup, 0))
+	if (handle->wakeup != local_read(&data->wakeup))
 		perf_output_wakeup(handle);
 
 	preempt_enable();



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-18 13:32 [PATCH 0/5] Optimize perf ring-buffer Peter Zijlstra
                   ` (3 preceding siblings ...)
  2010-05-18 13:33 ` [PATCH 4/5] perf: Avoid local_xchg Peter Zijlstra
@ 2010-05-18 13:33 ` Peter Zijlstra
  2010-05-18 14:09   ` Peter Zijlstra
  2010-05-19  7:21   ` Frederic Weisbecker
  2010-05-19  7:14 ` [PATCH 0/5] Optimize perf ring-buffer Frederic Weisbecker
  5 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-18 13:33 UTC (permalink / raw)
  To: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo
  Cc: Frederic Weisbecker, Steven Rostedt, Thomas Gleixner,
	linux-kernel, Peter Zijlstra

[-- Attachment #1: perf-buffer-begin_ptr.patch --]
[-- Type: text/plain, Size: 8052 bytes --]

perf_output_addr() will, for space allocated using PO_LINEAR, allow
one to get a linear address for writing its data to.

Tracepoints tend to want to do this, although when there is need to
multiplex the events it is of course possible that each event will get
different data due to having to construct the event multiple times.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c |    3 -
 include/linux/perf_event.h                |   18 ++++++-
 kernel/perf_event.c                       |   73 +++++++++++++++++++++++-------
 3 files changed, 74 insertions(+), 20 deletions(-)

Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -436,6 +436,14 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_SAMPLE			= 9,
 
+	/*
+	 * struct {
+	 * 	struct perf_event_header	header;
+	 * 	u64				__null[];
+	 * };
+	 */
+	PERF_RECORD_NOP				= 10,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -805,8 +813,7 @@ struct perf_output_handle {
 	unsigned long			head;
 	unsigned long			offset;
 	unsigned long			wakeup;
-	int				nmi;
-	int				sample;
+	unsigned int			flags;
 };
 
 #ifdef CONFIG_PERF_EVENTS
@@ -1002,12 +1009,17 @@ extern void perf_bp_event(struct perf_ev
 #define perf_instruction_pointer(regs)	instruction_pointer(regs)
 #endif
 
+#define PO_NOWAKE		0x01	/* can't do wakeups        */
+#define PO_SAMPLE		0x02	/* is a PERF_RECORD_SAMPLE */
+#define PO_LINEAR		0x03	/* linear addressable      */
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size,
-			     int nmi, int sample);
+			     unsigned int flags);
 extern void perf_output_end(struct perf_output_handle *handle);
 extern void perf_output_copy(struct perf_output_handle *handle,
 			     const void *buf, unsigned int len);
+extern void *perf_output_addr(struct perf_output_handle *handle);
 extern int perf_swevent_get_recursion_context(void);
 extern void perf_swevent_put_recursion_context(int rctx);
 extern void perf_event_enable(struct perf_event *event);
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2895,7 +2895,7 @@ static void perf_output_wakeup(struct pe
 {
 	atomic_set(&handle->data->poll, POLL_IN);
 
-	if (handle->nmi) {
+	if (handle->flags & PO_NOWAKE) {
 		handle->event->pending_wakeup = 1;
 		perf_pending_queue(&handle->event->pending,
 				   perf_pending_event);
@@ -2997,12 +2997,12 @@ void perf_output_copy(struct perf_output
 
 int perf_output_begin(struct perf_output_handle *handle,
 		      struct perf_event *event, unsigned int size,
-		      int nmi, int sample)
+		      unsigned int flags)
 {
 	struct perf_event *output_event;
 	struct perf_mmap_data *data;
 	unsigned long tail, offset, head;
-	int have_lost;
+	int have_lost, nop_size = 0;
 	struct {
 		struct perf_event_header header;
 		u64			 id;
@@ -3026,18 +3026,20 @@ int perf_output_begin(struct perf_output
 
 	handle->data	= data;
 	handle->event	= event;
-	handle->nmi	= nmi;
-	handle->sample	= sample;
+	handle->flags	= flags;
 
 	if (!data->nr_pages)
 		goto out;
 
+	perf_output_get_handle(handle);
+
+	if ((flags & PO_LINEAR) && size > (PAGE_SIZE << data->data_order))
+		goto fail;
+
 	have_lost = local_read(&data->lost);
 	if (have_lost)
 		size += sizeof(lost_event);
 
-	perf_output_get_handle(handle);
-
 	do {
 		/*
 		 * Userspace could choose to issue a mb() before updating the
@@ -3047,9 +3049,25 @@ int perf_output_begin(struct perf_output
 		tail = ACCESS_ONCE(data->user_page->data_tail);
 		smp_rmb();
 		offset = head = local_read(&data->head);
-		head += size;
+		head += size + nop_size;
 		if (unlikely(!perf_output_space(data, tail, offset, head)))
 			goto fail;
+
+		if ((flags & PO_LINEAR)) {
+			unsigned long mask = (PAGE_SIZE << data->data_order) - 1;
+			unsigned long start = offset + nop_size;
+
+			if (have_lost)
+				start += sizeof(lost_event);
+
+			if ((start & ~mask) != (head & ~mask)) {
+				nop_size = (head & ~mask) - offset;
+				if (have_lost)
+					nop_size -= sizeof(lost_event);
+				continue;
+			}
+		}
+
 	} while (local_cmpxchg(&data->head, offset, head) != offset);
 
 	handle->offset	= offset;
@@ -3068,6 +3086,15 @@ int perf_output_begin(struct perf_output
 		perf_output_put(handle, lost_event);
 	}
 
+	if (nop_size) {
+		lost_event.header.type = PERF_RECORD_NOP;
+		lost_event.header.misc = 0;
+		lost_event.header.size = nop_size;
+
+		perf_output_put(handle, lost_event.header);
+		handle->offset += nop_size - sizeof(lost_event.header);
+	}
+
 	return 0;
 
 fail:
@@ -3079,6 +3106,20 @@ out:
 	return -ENOSPC;
 }
 
+void *perf_output_addr(struct perf_output_handle *handle)
+{
+	unsigned long pages_mask = handle->data->nr_pages - 1;
+	unsigned long page_order = handle->data->data_order;
+	void **pages = handle->data->data_pages;
+	int nr;
+
+	if (!(handle->flags & PO_LINEAR))
+		return NULL;
+
+	nr = (handle->offset >> (PAGE_SHIFT + page_order)) & pages_mask;
+	return pages[nr] + (handle->offset & ((PAGE_SIZE << page_order) - 1));
+}
+
 void perf_output_end(struct perf_output_handle *handle)
 {
 	struct perf_event *event = handle->event;
@@ -3086,7 +3127,7 @@ void perf_output_end(struct perf_output_
 
 	int wakeup_events = event->attr.wakeup_events;
 
-	if (handle->sample && wakeup_events) {
+	if ((handle->flags & PO_SAMPLE) && wakeup_events) {
 		int events = local_inc_return(&data->events);
 		if (events >= wakeup_events) {
 			local_sub(wakeup_events, &data->events);
@@ -3359,11 +3400,11 @@ static void perf_event_output(struct per
 
 	perf_prepare_sample(&header, data, event, regs);
 
-	if (perf_output_begin(&handle, event, header.size, nmi, 1))
+	if (perf_output_begin(&handle, event, header.size,
+				(nmi ? PO_NOWAKE : 0) | PO_SAMPLE))
 		return;
 
 	perf_output_sample(&handle, &header, data, event);
-
 	perf_output_end(&handle);
 }
 
@@ -3394,7 +3435,7 @@ perf_event_read_event(struct perf_event 
 	};
 	int ret;
 
-	ret = perf_output_begin(&handle, event, read_event.header.size, 0, 0);
+	ret = perf_output_begin(&handle, event, read_event.header.size, 0);
 	if (ret)
 		return;
 
@@ -3433,7 +3474,7 @@ static void perf_event_task_output(struc
 	int size, ret;
 
 	size  = task_event->event_id.header.size;
-	ret = perf_output_begin(&handle, event, size, 0, 0);
+	ret = perf_output_begin(&handle, event, size, 0);
 
 	if (ret)
 		return;
@@ -3548,7 +3589,7 @@ static void perf_event_comm_output(struc
 {
 	struct perf_output_handle handle;
 	int size = comm_event->event_id.header.size;
-	int ret = perf_output_begin(&handle, event, size, 0, 0);
+	int ret = perf_output_begin(&handle, event, size, 0);
 
 	if (ret)
 		return;
@@ -3667,7 +3708,7 @@ static void perf_event_mmap_output(struc
 {
 	struct perf_output_handle handle;
 	int size = mmap_event->event_id.header.size;
-	int ret = perf_output_begin(&handle, event, size, 0, 0);
+	int ret = perf_output_begin(&handle, event, size, 0);
 
 	if (ret)
 		return;
@@ -3828,7 +3869,7 @@ static void perf_log_throttle(struct per
 	if (enable)
 		throttle_event.header.type = PERF_RECORD_UNTHROTTLE;
 
-	ret = perf_output_begin(&handle, event, sizeof(throttle_event), 1, 0);
+	ret = perf_output_begin(&handle, event, sizeof(throttle_event), PO_NOWAKE);
 	if (ret)
 		return;
 
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -255,7 +255,8 @@ static void intel_pmu_drain_bts_buffer(v
 	 */
 	perf_prepare_sample(&header, &data, event, &regs);
 
-	if (perf_output_begin(&handle, event, header.size * (top - at), 1, 1))
+	if (perf_output_begin(&handle, event, header.size * (top - at),
+				PO_NOWAKE|PO_SAMPLE))
 		return;
 
 	for (; at < top; at++) {



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-18 13:33 ` [RFC PATCH 5/5] perf: Implement perf_output_addr() Peter Zijlstra
@ 2010-05-18 14:09   ` Peter Zijlstra
  2010-05-19  7:21   ` Frederic Weisbecker
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-18 14:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul Mackerras, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, linux-kernel

On Tue, 2010-05-18 at 15:33 +0200, Peter Zijlstra wrote:

> +#define PO_NOWAKE		0x01	/* can't do wakeups        */
> +#define PO_SAMPLE		0x02	/* is a PERF_RECORD_SAMPLE */
> +#define PO_LINEAR		0x03	/* linear addressable      */

D'0h, /me gives himself the moron-award...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] Optimize perf ring-buffer
  2010-05-18 13:32 [PATCH 0/5] Optimize perf ring-buffer Peter Zijlstra
                   ` (4 preceding siblings ...)
  2010-05-18 13:33 ` [RFC PATCH 5/5] perf: Implement perf_output_addr() Peter Zijlstra
@ 2010-05-19  7:14 ` Frederic Weisbecker
  5 siblings, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2010-05-19  7:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	Steven Rostedt, Thomas Gleixner, linux-kernel

On Tue, May 18, 2010 at 03:32:58PM +0200, Peter Zijlstra wrote:
> This patch-set optimizes the perf ring-buffer by removing IRQ disable and 
> all LOCK ops from the fast paths.
> 
> There's also an RFC patch that adds perf_output_addr() which will allow
> you to write directly to the buffer, however I'm not sure how to go
> about using it for the trace events since we need to multiplex event --
> there is no single copy of the actual event.


What do you mean by the fact we need to multiplex event?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] perf: Disallow mmap() on per-task inherited events
  2010-05-18 13:32 ` [PATCH 1/5] perf: Disallow mmap() on per-task inherited events Peter Zijlstra
@ 2010-05-19  7:19   ` Frederic Weisbecker
  2010-05-25  0:55   ` Paul Mackerras
  1 sibling, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2010-05-19  7:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	Steven Rostedt, Thomas Gleixner, linux-kernel

On Tue, May 18, 2010 at 03:32:59PM +0200, Peter Zijlstra wrote:
> Since we now have working per-task-per-cpu events for a while,
> disallow mmap() on per-task inherited events. Those things were
> a performance problem anyway, and doing away with it allows
> us to optimize the buffer somewhat by assuming there is only a single
> writer.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---



That is a very good news.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-18 13:33 ` [RFC PATCH 5/5] perf: Implement perf_output_addr() Peter Zijlstra
  2010-05-18 14:09   ` Peter Zijlstra
@ 2010-05-19  7:21   ` Frederic Weisbecker
  2010-05-19  7:58     ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Frederic Weisbecker @ 2010-05-19  7:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	Steven Rostedt, Thomas Gleixner, linux-kernel

On Tue, May 18, 2010 at 03:33:03PM +0200, Peter Zijlstra wrote:
> perf_output_addr() will, for space allocated using PO_LINEAR, allow
> one to get a linear address for writing its data to.
> 
> Tracepoints tend to want to do this, although when there is need to
> multiplex the events it is of course possible that each event will get
> different data due to having to construct the event multiple times.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---


I'm still not sure what you mean here by this multiplexing. Is
this about per cpu multiplexing?

There is another problem. We need something like
perf_output_discard() in case the filter reject the event (which
must be filled for this check to happen).


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-19  7:21   ` Frederic Weisbecker
@ 2010-05-19  7:58     ` Peter Zijlstra
  2010-05-19  9:03       ` Frederic Weisbecker
  2010-05-19 14:47       ` Steven Rostedt
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-19  7:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	Steven Rostedt, Thomas Gleixner, linux-kernel

On Wed, 2010-05-19 at 09:21 +0200, Frederic Weisbecker wrote:

> I'm still not sure what you mean here by this multiplexing. Is
> this about per cpu multiplexing?

Suppose there's two events attached to the same tracepoint. Will you
write the tracepoint twice and risk different data in each, or will you
do it once and copy it into each buffer?

> There is another problem. We need something like
> perf_output_discard() in case the filter reject the event (which
> must be filled for this check to happen).

Yeah, I utterly hate that, I opted to let anything with a filter take
the slow path. Not only would I have to add a discard, but I'd have to
decrement the counter as well, which is a big no-no.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-19  7:58     ` Peter Zijlstra
@ 2010-05-19  9:03       ` Frederic Weisbecker
  2010-05-19 14:47       ` Steven Rostedt
  1 sibling, 0 replies; 22+ messages in thread
From: Frederic Weisbecker @ 2010-05-19  9:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo,
	Steven Rostedt, Thomas Gleixner, linux-kernel

On Wed, May 19, 2010 at 09:58:02AM +0200, Peter Zijlstra wrote:
> On Wed, 2010-05-19 at 09:21 +0200, Frederic Weisbecker wrote:
> 
> > I'm still not sure what you mean here by this multiplexing. Is
> > this about per cpu multiplexing?
> 
> Suppose there's two events attached to the same tracepoint. Will you
> write the tracepoint twice and risk different data in each, or will you
> do it once and copy it into each buffer?



The only different data we risk in each is the timestamp, which is
something we can probably fetch once before actually filling the
buffers.


 
> > There is another problem. We need something like
> > perf_output_discard() in case the filter reject the event (which
> > must be filled for this check to happen).
> 
> Yeah, I utterly hate that, I opted to let anything with a filter take
> the slow path. Not only would I have to add a discard, but I'd have to
> decrement the counter as well, which is a big no-no.


That makes it complicated. But for now we don't have much other solutions.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-19  7:58     ` Peter Zijlstra
  2010-05-19  9:03       ` Frederic Weisbecker
@ 2010-05-19 14:47       ` Steven Rostedt
  2010-05-19 15:05         ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2010-05-19 14:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Thomas Gleixner, linux-kernel

On Wed, 2010-05-19 at 09:58 +0200, Peter Zijlstra wrote:
> On Wed, 2010-05-19 at 09:21 +0200, Frederic Weisbecker wrote:
> 
> > I'm still not sure what you mean here by this multiplexing. Is
> > this about per cpu multiplexing?
> 
> Suppose there's two events attached to the same tracepoint. Will you
> write the tracepoint twice and risk different data in each, or will you
> do it once and copy it into each buffer?

Is this because the same function deals with the same tracepoint, and
has difficulty in knowing which event it is dealing with?

Note, the shrinking of the TRACE_EVENT() code that I pushed (and I'm
hoping makes it to 35 since it lays the ground work for lots of features
on top of TRACE_EVENT()), allows you to pass private data to each probe
registered to the tracepoint. Letting the same function handle two
different activities, or different tracepoints.

> 
> > There is another problem. We need something like
> > perf_output_discard() in case the filter reject the event (which
> > must be filled for this check to happen).
> 
> Yeah, I utterly hate that, I opted to let anything with a filter take
> the slow path. Not only would I have to add a discard, but I'd have to
> decrement the counter as well, which is a big no-no.

Hmm, this would impact performance on system wide recording of events
that are filtered. One would think adding a filter would speed things
up, not slow it down.

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-19 14:47       ` Steven Rostedt
@ 2010-05-19 15:05         ` Peter Zijlstra
  2010-05-19 15:38           ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-19 15:05 UTC (permalink / raw)
  To: rostedt
  Cc: Frederic Weisbecker, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Thomas Gleixner, linux-kernel

On Wed, 2010-05-19 at 10:47 -0400, Steven Rostedt wrote:
> On Wed, 2010-05-19 at 09:58 +0200, Peter Zijlstra wrote:
> > On Wed, 2010-05-19 at 09:21 +0200, Frederic Weisbecker wrote:
> > 
> > > I'm still not sure what you mean here by this multiplexing. Is
> > > this about per cpu multiplexing?
> > 
> > Suppose there's two events attached to the same tracepoint. Will you
> > write the tracepoint twice and risk different data in each, or will you
> > do it once and copy it into each buffer?
> 
> Is this because the same function deals with the same tracepoint, and
> has difficulty in knowing which event it is dealing with?

No, but suppose the tracepoint has a racy expression in it. Having to
evaluate { assign; } multiple times could yield different results, which
in turn means you have to run the filter multiple times too, etc..

Although I suppose you could delay the commit of the first even and copy
from there into the next events, but that might give rather messy code.

> Note, the shrinking of the TRACE_EVENT() code that I pushed (and I'm
> hoping makes it to 35 since it lays the ground work for lots of features
> on top of TRACE_EVENT()), allows you to pass private data to each probe
> registered to the tracepoint. Letting the same function handle two
> different activities, or different tracepoints.

tracepoint_probe_register() is useless, it requires scheduling. I
currently register a probe on pref_event creation and then maintain a
per-cpu hlist of active events.

> > > There is another problem. We need something like
> > > perf_output_discard() in case the filter reject the event (which
> > > must be filled for this check to happen).
> > 
> > Yeah, I utterly hate that, I opted to let anything with a filter take
> > the slow path. Not only would I have to add a discard, but I'd have to
> > decrement the counter as well, which is a big no-no.
> 
> Hmm, this would impact performance on system wide recording of events
> that are filtered. One would think adding a filter would speed things
> up, not slow it down.

Depends, actually running the filter and backing out might take more
time than simply logging it, esp if you've already done all of the work
and only lack a commit.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-19 15:05         ` Peter Zijlstra
@ 2010-05-19 15:38           ` Steven Rostedt
  2010-05-19 15:50             ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2010-05-19 15:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Thomas Gleixner, linux-kernel

On Wed, 2010-05-19 at 17:05 +0200, Peter Zijlstra wrote:
> On Wed, 2010-05-19 at 10:47 -0400, Steven Rostedt wrote:
> > On Wed, 2010-05-19 at 09:58 +0200, Peter Zijlstra wrote:
> > > On Wed, 2010-05-19 at 09:21 +0200, Frederic Weisbecker wrote:
> > > 
> > > > I'm still not sure what you mean here by this multiplexing. Is
> > > > this about per cpu multiplexing?
> > > 
> > > Suppose there's two events attached to the same tracepoint. Will you
> > > write the tracepoint twice and risk different data in each, or will you
> > > do it once and copy it into each buffer?
> > 
> > Is this because the same function deals with the same tracepoint, and
> > has difficulty in knowing which event it is dealing with?
> 
> No, but suppose the tracepoint has a racy expression in it. Having to
> evaluate { assign; } multiple times could yield different results, which
> in turn means you have to run the filter multiple times too, etc..

I'm still a bit confused by what you mean here. Could you show an
example?

> 
> Although I suppose you could delay the commit of the first even and copy
> from there into the next events, but that might give rather messy code.
> 
> > Note, the shrinking of the TRACE_EVENT() code that I pushed (and I'm
> > hoping makes it to 35 since it lays the ground work for lots of features
> > on top of TRACE_EVENT()), allows you to pass private data to each probe
> > registered to the tracepoint. Letting the same function handle two
> > different activities, or different tracepoints.
> 
> tracepoint_probe_register() is useless, it requires scheduling. I
> currently register a probe on pref_event creation and then maintain a
> per-cpu hlist of active events.

When is perf_event creation? When the user runs the code or at boot up?

> 
> > > > There is another problem. We need something like
> > > > perf_output_discard() in case the filter reject the event (which
> > > > must be filled for this check to happen).
> > > 
> > > Yeah, I utterly hate that, I opted to let anything with a filter take
> > > the slow path. Not only would I have to add a discard, but I'd have to
> > > decrement the counter as well, which is a big no-no.
> > 
> > Hmm, this would impact performance on system wide recording of events
> > that are filtered. One would think adding a filter would speed things
> > up, not slow it down.
> 
> Depends, actually running the filter and backing out might take more
> time than simply logging it, esp if you've already done all of the work
> and only lack a commit.

Hmm, could be, don't know for sure. I just want to keep the macro magic
to a minimum ;-)

-- Steve




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-19 15:38           ` Steven Rostedt
@ 2010-05-19 15:50             ` Peter Zijlstra
  2010-05-19 16:08               ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-19 15:50 UTC (permalink / raw)
  To: rostedt
  Cc: Frederic Weisbecker, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Thomas Gleixner, linux-kernel

On Wed, 2010-05-19 at 11:38 -0400, Steven Rostedt wrote:

> > No, but suppose the tracepoint has a racy expression in it. Having to
> > evaluate { assign; } multiple times could yield different results, which
> > in turn means you have to run the filter multiple times too, etc..
> 
> I'm still a bit confused by what you mean here. Could you show an
> example?

Well, suppose { assign; } contains:

 entry->foo = atomic_read(&bar);

Now suppose you have multiple active consumers of the tracepoint, either
you do the evaluation once and copy that around, or you do it multiple
times and end up with different results.

> > Although I suppose you could delay the commit of the first even and copy
> > from there into the next events, but that might give rather messy code.
> > 
> > > Note, the shrinking of the TRACE_EVENT() code that I pushed (and I'm
> > > hoping makes it to 35 since it lays the ground work for lots of features
> > > on top of TRACE_EVENT()), allows you to pass private data to each probe
> > > registered to the tracepoint. Letting the same function handle two
> > > different activities, or different tracepoints.
> > 
> > tracepoint_probe_register() is useless, it requires scheduling. I
> > currently register a probe on pref_event creation and then maintain a
> > per-cpu hlist of active events.
> 
> When is perf_event creation? When the user runs the code or at boot up?

sys_perf_counter_open()

And an event could be per task, so it needs to be scheduled along with
the task context, try doing that with probes ;-)

> Hmm, could be, don't know for sure. I just want to keep the macro magic
> to a minimum ;-)

Right, but filters evaluated at the point where you basically already
done all the hard work simply don't make much sense in my book.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-19 15:50             ` Peter Zijlstra
@ 2010-05-19 16:08               ` Steven Rostedt
  2010-05-19 16:15                 ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2010-05-19 16:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Thomas Gleixner, linux-kernel

On Wed, 2010-05-19 at 17:50 +0200, Peter Zijlstra wrote:
> On Wed, 2010-05-19 at 11:38 -0400, Steven Rostedt wrote:
> 
> > > No, but suppose the tracepoint has a racy expression in it. Having to
> > > evaluate { assign; } multiple times could yield different results, which
> > > in turn means you have to run the filter multiple times too, etc..
> > 
> > I'm still a bit confused by what you mean here. Could you show an
> > example?
> 
> Well, suppose { assign; } contains:
> 
>  entry->foo = atomic_read(&bar);
> 
> Now suppose you have multiple active consumers of the tracepoint, either
> you do the evaluation once and copy that around, or you do it multiple
> times and end up with different results.

OK, this is where I'm getting a bit lost. The "multiple active
consumers". Is this multiple instances of perf? Or perf doing multiple
things with that event using different buffers?

> 
> > > Although I suppose you could delay the commit of the first even and copy
> > > from there into the next events, but that might give rather messy code.
> > > 
> > > > Note, the shrinking of the TRACE_EVENT() code that I pushed (and I'm
> > > > hoping makes it to 35 since it lays the ground work for lots of features
> > > > on top of TRACE_EVENT()), allows you to pass private data to each probe
> > > > registered to the tracepoint. Letting the same function handle two
> > > > different activities, or different tracepoints.
> > > 
> > > tracepoint_probe_register() is useless, it requires scheduling. I
> > > currently register a probe on pref_event creation and then maintain a
> > > per-cpu hlist of active events.
> > 
> > When is perf_event creation? When the user runs the code or at boot up?
> 
> sys_perf_counter_open()
> 
> And an event could be per task, so it needs to be scheduled along with
> the task context, try doing that with probes ;-)

Ah, this is basically the same thing that ftrace does too. It only
enables the tracepoint (or function tracer) at initiation of the trace,
and uses things like a hash table to determine if the event (or
function) should be traced or not.

> 
> > Hmm, could be, don't know for sure. I just want to keep the macro magic
> > to a minimum ;-)
> 
> Right, but filters evaluated at the point where you basically already
> done all the hard work simply don't make much sense in my book.

Well, the hard work was just to reserve the buffer, which is under 100ns
to do. But we still need the assign, because the filters compare the
result of those assigns.

I guess you are saying that if we have a filter, we need to do the
assign to a temporary buffer, evaluate, and then decide if we should
record it (via copy) or not.

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-19 16:08               ` Steven Rostedt
@ 2010-05-19 16:15                 ` Peter Zijlstra
  2010-05-19 16:27                   ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-19 16:15 UTC (permalink / raw)
  To: rostedt
  Cc: Frederic Weisbecker, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Thomas Gleixner, linux-kernel

On Wed, 2010-05-19 at 12:08 -0400, Steven Rostedt wrote:
> > Now suppose you have multiple active consumers of the tracepoint, either
> > you do the evaluation once and copy that around, or you do it multiple
> > times and end up with different results.
> 
> OK, this is where I'm getting a bit lost. The "multiple active
> consumers". Is this multiple instances of perf? Or perf doing multiple
> things with that event using different buffers? 

Multiple perf events of the same tracepoint, basically what you would en
up with if you were to allow multiple buffers.

Say task A and B both sample C's sched:sched_wakeup events. Then the
tracepoint will have two active perf_events hanging from it and we need
to fill two buffers.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-19 16:15                 ` Peter Zijlstra
@ 2010-05-19 16:27                   ` Steven Rostedt
  2010-05-19 16:34                     ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2010-05-19 16:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Thomas Gleixner, linux-kernel

On Wed, 2010-05-19 at 18:15 +0200, Peter Zijlstra wrote:
> On Wed, 2010-05-19 at 12:08 -0400, Steven Rostedt wrote:
> > > Now suppose you have multiple active consumers of the tracepoint, either
> > > you do the evaluation once and copy that around, or you do it multiple
> > > times and end up with different results.
> > 
> > OK, this is where I'm getting a bit lost. The "multiple active
> > consumers". Is this multiple instances of perf? Or perf doing multiple
> > things with that event using different buffers? 
> 
> Multiple perf events of the same tracepoint, basically what you would en
> up with if you were to allow multiple buffers.
> 
> Say task A and B both sample C's sched:sched_wakeup events. Then the
> tracepoint will have two active perf_events hanging from it and we need
> to fill two buffers.

OK, so I would let them evaluate separately. If they do have two
different results, then that's fine, because the view of an event could
possible be different. The &bar may change in the two instances, but how
much does that matter? Which version of &bar is correct anyway?

How do you handle the multiple readers then? The call to record the
event copies to each buffer that is registered for that event?

If more than one buffer is attached to an event, you could also work to
directly write to one, and then copy directly from that buffer to the
others.

-- Steve




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/5] perf: Implement perf_output_addr()
  2010-05-19 16:27                   ` Steven Rostedt
@ 2010-05-19 16:34                     ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-19 16:34 UTC (permalink / raw)
  To: rostedt
  Cc: Frederic Weisbecker, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, Thomas Gleixner, linux-kernel

On Wed, 2010-05-19 at 12:27 -0400, Steven Rostedt wrote:

> OK, so I would let them evaluate separately. If they do have two
> different results, then that's fine, because the view of an event could
> possible be different. The &bar may change in the two instances, but how
> much does that matter? Which version of &bar is correct anyway?

Right, can do, but again, that sucks for filters -- I'm starting to
think we never should have accepted that stuff.

> How do you handle the multiple readers then? The call to record the
> event copies to each buffer that is registered for that event?

Yeah, each perf_event has its own buffer (usually) so we simply generate
multiple events, one for each buffer. The readers, task A and B will get
wakeups once in a while to empty the buffer.

> If more than one buffer is attached to an event, you could also work to
> directly write to one, and then copy directly from that buffer to the
> others.

Yeah, I mentioned this a few emails back, but since you need the keep
the commit open until you've copied it the code can get quite ugly.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] perf: Disallow mmap() on per-task inherited events
  2010-05-18 13:32 ` [PATCH 1/5] perf: Disallow mmap() on per-task inherited events Peter Zijlstra
  2010-05-19  7:19   ` Frederic Weisbecker
@ 2010-05-25  0:55   ` Paul Mackerras
  2010-05-25  8:19     ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Paul Mackerras @ 2010-05-25  0:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, linux-kernel

On Tue, May 18, 2010 at 03:32:59PM +0200, Peter Zijlstra wrote:

> Since we now have working per-task-per-cpu events for a while,
> disallow mmap() on per-task inherited events. Those things were
> a performance problem anyway, and doing away with it allows
> us to optimize the buffer somewhat by assuming there is only a single
> writer.

This also disallows user-space access to hardware counter for this
event -- which is arguably OK, since doing the userspace read would
give a different answer to read() on the event fd, as the read() sums
up all the child counters for us.  Nevertheless, I think this
side-effect is worth mentioning.

Paul.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] perf: Disallow mmap() on per-task inherited events
  2010-05-25  0:55   ` Paul Mackerras
@ 2010-05-25  8:19     ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2010-05-25  8:19 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	Steven Rostedt, Thomas Gleixner, linux-kernel

On Tue, 2010-05-25 at 10:55 +1000, Paul Mackerras wrote:
> On Tue, May 18, 2010 at 03:32:59PM +0200, Peter Zijlstra wrote:
> 
> > Since we now have working per-task-per-cpu events for a while,
> > disallow mmap() on per-task inherited events. Those things were
> > a performance problem anyway, and doing away with it allows
> > us to optimize the buffer somewhat by assuming there is only a single
> > writer.
> 
> This also disallows user-space access to hardware counter for this
> event -- which is arguably OK, since doing the userspace read would
> give a different answer to read() on the event fd, as the read() sums
> up all the child counters for us.  Nevertheless, I think this
> side-effect is worth mentioning.

Right, using rdpmc (and similar) should be strictly limited to strict
self-monitoring.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-05-25  8:19 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-18 13:32 [PATCH 0/5] Optimize perf ring-buffer Peter Zijlstra
2010-05-18 13:32 ` [PATCH 1/5] perf: Disallow mmap() on per-task inherited events Peter Zijlstra
2010-05-19  7:19   ` Frederic Weisbecker
2010-05-25  0:55   ` Paul Mackerras
2010-05-25  8:19     ` Peter Zijlstra
2010-05-18 13:33 ` [PATCH 2/5] perf: Remove IRQ-disable from the perf_output path Peter Zijlstra
2010-05-18 13:33 ` [PATCH 3/5] perf: Convert the perf output buffer to local_t Peter Zijlstra
2010-05-18 13:33 ` [PATCH 4/5] perf: Avoid local_xchg Peter Zijlstra
2010-05-18 13:33 ` [RFC PATCH 5/5] perf: Implement perf_output_addr() Peter Zijlstra
2010-05-18 14:09   ` Peter Zijlstra
2010-05-19  7:21   ` Frederic Weisbecker
2010-05-19  7:58     ` Peter Zijlstra
2010-05-19  9:03       ` Frederic Weisbecker
2010-05-19 14:47       ` Steven Rostedt
2010-05-19 15:05         ` Peter Zijlstra
2010-05-19 15:38           ` Steven Rostedt
2010-05-19 15:50             ` Peter Zijlstra
2010-05-19 16:08               ` Steven Rostedt
2010-05-19 16:15                 ` Peter Zijlstra
2010-05-19 16:27                   ` Steven Rostedt
2010-05-19 16:34                     ` Peter Zijlstra
2010-05-19  7:14 ` [PATCH 0/5] Optimize perf ring-buffer Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.