* [PATCH] perf: Update event buffer tail when overwriting old events
@ 2013-06-06 5:58 Yan, Zheng
2013-06-18 9:13 ` Peter Zijlstra
2013-07-08 12:15 ` Peter Zijlstra
0 siblings, 2 replies; 12+ messages in thread
From: Yan, Zheng @ 2013-06-06 5:58 UTC (permalink / raw)
To: linux-kernel; +Cc: peterz, mingo, eranian, ak, Yan, Zheng
From: "Yan, Zheng" <zheng.z.yan@intel.com>
If perf event buffer is in overwrite mode, the kernel only updates
the data head when it overwrites old samples. The program that owns
the buffer need periodically check the buffer and update a variable
that tracks the date tail. If the program fails to do this in time,
the data tail can be overwritted by new samples. The program has to
rewind the buffer because it does not know where is the first vaild
sample.
This patch makes the kernel update the date tail when it overwrites
old events. So the program that owns the event buffer can always
read the latest samples. This is convenient for programs that use
perf to do branch tracing. One use case is GDB branch tracing:
(http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
It uses perf interface to read BTS, but only cares the branches
before the ptrace event.
I added code to perf_output_{begin/end} to count how many cycles
are spent by sample output, then ran "perf record" to profile kernel
compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
The first number is scaled to 1000, the rest numbers are scaled by
the same factor.
before overwrite mode after overwrite mode
AVG 1000 999 1046 1044
STDEV 19.4 19.5 17.1 17.9
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
kernel/events/internal.h | 2 ++
kernel/events/ring_buffer.c | 74 ++++++++++++++++++++++++---------------------
2 files changed, 42 insertions(+), 34 deletions(-)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index eb675c4..c6d7539 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -20,6 +20,8 @@ struct ring_buffer {
atomic_t poll; /* POLL_ for wakeups */
+ local_t tail; /* read position */
+ local_t next_tail; /* next read position */
local_t head; /* write position */
local_t nest; /* nested writers */
local_t events; /* event limit */
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144..2d5b15e 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -15,28 +15,9 @@
#include "internal.h"
-static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
- unsigned long offset, unsigned long head)
+static bool perf_output_space(unsigned long tail, unsigned long offset,
+ unsigned long head, unsigned long mask)
{
- unsigned long sz = perf_data_size(rb);
- unsigned long mask = sz - 1;
-
- /*
- * check if user-writable
- * overwrite : over-write its own tail
- * !overwrite: buffer possibly drops events.
- */
- if (rb->overwrite)
- return true;
-
- /*
- * verify that payload is not bigger than buffer
- * otherwise masking logic may fail to detect
- * the "not enough space" condition
- */
- if ((head - offset) > sz)
- return false;
-
offset = (offset - tail) & mask;
head = (head - tail) & mask;
@@ -113,7 +94,7 @@ int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size)
{
struct ring_buffer *rb;
- unsigned long tail, offset, head;
+ unsigned long tail, offset, head, max_size;
int have_lost;
struct perf_sample_data sample_data;
struct {
@@ -136,7 +117,8 @@ int perf_output_begin(struct perf_output_handle *handle,
handle->rb = rb;
handle->event = event;
- if (!rb->nr_pages)
+ max_size = perf_data_size(rb);
+ if (size > max_size)
goto out;
have_lost = local_read(&rb->lost);
@@ -149,19 +131,43 @@ int perf_output_begin(struct perf_output_handle *handle,
perf_output_get_handle(handle);
- do {
+ if (rb->overwrite) {
+ do {
+ tail = local_read(&rb->tail);
+ offset = local_read(&rb->head);
+ head = offset + size;
+ if (unlikely(!perf_output_space(tail, offset, head,
+ max_size - 1))) {
+ tail = local_read(&rb->next_tail);
+ local_set(&rb->tail, tail);
+ rb->user_page->data_tail = tail;
+ }
+ } while (local_cmpxchg(&rb->head, offset, head) != offset);
+
/*
- * Userspace could choose to issue a mb() before updating the
- * tail pointer. So that all reads will be completed before the
- * write is issued.
+ * Save the start of next event when half of the buffer
+ * has been filled. Later when the event buffer overflows,
+ * update the tail pointer to point to it.
*/
- tail = ACCESS_ONCE(rb->user_page->data_tail);
- smp_rmb();
- offset = head = local_read(&rb->head);
- head += size;
- if (unlikely(!perf_output_space(rb, tail, offset, head)))
- goto fail;
- } while (local_cmpxchg(&rb->head, offset, head) != offset);
+ if (tail == local_read(&rb->next_tail) &&
+ head - tail >= (max_size >> 1))
+ local_cmpxchg(&rb->next_tail, tail, head);
+ } else {
+ do {
+ /*
+ * Userspace could choose to issue a mb() before
+ * updating the tail pointer. So that all reads will
+ * be completed before the write is issued.
+ */
+ tail = ACCESS_ONCE(rb->user_page->data_tail);
+ smp_rmb();
+ offset = local_read(&rb->head);
+ head = offset + size;
+ if (unlikely(!perf_output_space(tail, offset, head,
+ max_size - 1)))
+ goto fail;
+ } while (local_cmpxchg(&rb->head, offset, head) != offset);
+ }
if (head - local_read(&rb->wakeup) > rb->watermark)
local_add(rb->watermark, &rb->wakeup);
--
1.8.1.4
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-06-06 5:58 [PATCH] perf: Update event buffer tail when overwriting old events Yan, Zheng
@ 2013-06-18 9:13 ` Peter Zijlstra
2013-07-08 12:15 ` Peter Zijlstra
1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2013-06-18 9:13 UTC (permalink / raw)
To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak
On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> If perf event buffer is in overwrite mode, the kernel only updates
> the data head when it overwrites old samples. The program that owns
> the buffer need periodically check the buffer and update a variable
> that tracks the date tail. If the program fails to do this in time,
> the data tail can be overwritted by new samples. The program has to
> rewind the buffer because it does not know where is the first vaild
> sample.
>
> This patch makes the kernel update the date tail when it overwrites
> old events. So the program that owns the event buffer can always
> read the latest samples. This is convenient for programs that use
> perf to do branch tracing. One use case is GDB branch tracing:
> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
> It uses perf interface to read BTS, but only cares the branches
> before the ptrace event.
>
> I added code to perf_output_{begin/end} to count how many cycles
> are spent by sample output, then ran "perf record" to profile kernel
> compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
> The first number is scaled to 1000, the rest numbers are scaled by
> the same factor.
>
> before overwrite mode after overwrite mode
> AVG 1000 999 1046 1044
> STDEV 19.4 19.5 17.1 17.9
>
Right, so it all gets about 5% more expensive.. I don't suppose there's
anything smart we can do to avoid this for the !overwrite mode?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-06-06 5:58 [PATCH] perf: Update event buffer tail when overwriting old events Yan, Zheng
2013-06-18 9:13 ` Peter Zijlstra
@ 2013-07-08 12:15 ` Peter Zijlstra
2013-07-09 6:18 ` Namhyung Kim
` (2 more replies)
1 sibling, 3 replies; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-08 12:15 UTC (permalink / raw)
To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak
On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
> If perf event buffer is in overwrite mode, the kernel only updates
> the data head when it overwrites old samples. The program that owns
> the buffer need periodically check the buffer and update a variable
> that tracks the date tail. If the program fails to do this in time,
> the data tail can be overwritted by new samples. The program has to
> rewind the buffer because it does not know where is the first vaild
> sample.
>
> This patch makes the kernel update the date tail when it overwrites
> old events. So the program that owns the event buffer can always
> read the latest samples. This is convenient for programs that use
> perf to do branch tracing. One use case is GDB branch tracing:
> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
> It uses perf interface to read BTS, but only cares the branches
> before the ptrace event.
>
> I added code to perf_output_{begin/end} to count how many cycles
> are spent by sample output, then ran "perf record" to profile kernel
> compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
> The first number is scaled to 1000, the rest numbers are scaled by
> the same factor.
>
> before overwrite mode after overwrite mode
> AVG 1000 999 1046 1044
> STDEV 19.4 19.5 17.1 17.9
OK, so I was sure I replied to this email; but apparently I didn't :/
So its still adding about 5% overhead to the regular case; this is sad.
What does something like the below do?
---
include/linux/perf_event.h | 2 +
kernel/events/core.c | 60 ++++++++++++++++++++++++------
kernel/events/internal.h | 2 +
kernel/events/ring_buffer.c | 90 +++++++++++++++++++++++++++------------------
4 files changed, 106 insertions(+), 48 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8873f82..bcce98a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -753,6 +753,8 @@ static inline bool has_branch_stack(struct perf_event *event)
extern int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size);
+extern int perf_output_begin_overwrite(struct perf_output_handle *handle,
+ struct perf_event *event, unsigned int size);
extern void perf_output_end(struct perf_output_handle *handle);
extern unsigned int perf_output_copy(struct perf_output_handle *handle,
const void *buf, unsigned int len);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1833bc5..4d674e9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3878,6 +3878,8 @@ static const struct vm_operations_struct perf_mmap_vmops = {
.page_mkwrite = perf_mmap_fault,
};
+static void perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb);
+
static int perf_mmap(struct file *file, struct vm_area_struct *vma)
{
struct perf_event *event = file->private_data;
@@ -3985,6 +3987,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
vma->vm_mm->pinned_vm += extra;
ring_buffer_attach(event, rb);
+ perf_event_set_overflow(event, rb);
rcu_assign_pointer(event->rb, rb);
perf_event_update_userpage(event);
@@ -4595,9 +4598,12 @@ void perf_prepare_sample(struct perf_event_header *header,
}
}
-static void perf_event_output(struct perf_event *event,
- struct perf_sample_data *data,
- struct pt_regs *regs)
+static __always_inline void
+__perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs,
+ int (*output_begin)(struct perf_output_handle *,
+ struct perf_event *, unsigned int))
{
struct perf_output_handle handle;
struct perf_event_header header;
@@ -4607,7 +4613,7 @@ static void perf_event_output(struct perf_event *event,
perf_prepare_sample(&header, data, event, regs);
- if (perf_output_begin(&handle, event, header.size))
+ if (output_begin(&handle, event, header.size))
goto exit;
perf_output_sample(&handle, &header, data, event);
@@ -4618,6 +4624,33 @@ static void perf_event_output(struct perf_event *event,
rcu_read_unlock();
}
+static void perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ __perf_event_output(event, data, regs, perf_output_begin);
+}
+
+static void perf_event_output_overwrite(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ __perf_event_output(event, data, regs, perf_output_begin_overwrite);
+}
+
+static void
+perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
+{
+ if (event->overflow_handler != perf_event_output ||
+ event->overflow_handler != perf_event_output_overwrite)
+ return;
+
+ if (rb->overwrite)
+ event->overflow_handler = perf_event_output_overwrite;
+ else
+ event->overflow_handler = perf_event_output;
+}
+
/*
* read event_id
*/
@@ -5183,10 +5216,7 @@ static int __perf_event_overflow(struct perf_event *event,
irq_work_queue(&event->pending);
}
- if (event->overflow_handler)
- event->overflow_handler(event, data, regs);
- else
- perf_event_output(event, data, regs);
+ event->overflow_handler(event, data, regs);
if (event->fasync && event->pending_kill) {
event->pending_wakeup = 1;
@@ -6501,8 +6531,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
context = parent_event->overflow_handler_context;
}
- event->overflow_handler = overflow_handler;
- event->overflow_handler_context = context;
+ if (overflow_handler) {
+ event->overflow_handler = overflow_handler;
+ event->overflow_handler_context = context;
+ } else {
+ event->overflow_handler = perf_event_output;
+ event->overflow_handler_context = NULL;
+ }
perf_event__state_init(event);
@@ -6736,9 +6771,10 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
if (old_rb)
ring_buffer_detach(event, old_rb);
- if (rb)
+ if (rb) {
ring_buffer_attach(event, rb);
-
+ perf_event_set_overflow(event, rb);
+ }
rcu_assign_pointer(event->rb, rb);
if (old_rb) {
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index ca65997..c4e4610 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -20,6 +20,8 @@ struct ring_buffer {
atomic_t poll; /* POLL_ for wakeups */
+ local_t tail; /* read position */
+ local_t next_tail; /* next read position */
local_t head; /* write position */
local_t nest; /* nested writers */
local_t events; /* event limit */
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144..5887044 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -15,28 +15,9 @@
#include "internal.h"
-static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
- unsigned long offset, unsigned long head)
+static bool perf_output_space(unsigned long tail, unsigned long offset,
+ unsigned long head, unsigned long mask)
{
- unsigned long sz = perf_data_size(rb);
- unsigned long mask = sz - 1;
-
- /*
- * check if user-writable
- * overwrite : over-write its own tail
- * !overwrite: buffer possibly drops events.
- */
- if (rb->overwrite)
- return true;
-
- /*
- * verify that payload is not bigger than buffer
- * otherwise masking logic may fail to detect
- * the "not enough space" condition
- */
- if ((head - offset) > sz)
- return false;
-
offset = (offset - tail) & mask;
head = (head - tail) & mask;
@@ -109,11 +90,11 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
preempt_enable();
}
-int perf_output_begin(struct perf_output_handle *handle,
- struct perf_event *event, unsigned int size)
+static __always_inline int __perf_output_begin(struct perf_output_handle *handle,
+ struct perf_event *event, unsigned int size, bool overwrite)
{
struct ring_buffer *rb;
- unsigned long tail, offset, head;
+ unsigned long tail, offset, head, max_size;
int have_lost;
struct perf_sample_data sample_data;
struct {
@@ -136,7 +117,8 @@ int perf_output_begin(struct perf_output_handle *handle,
handle->rb = rb;
handle->event = event;
- if (!rb->nr_pages)
+ max_size = perf_data_size(rb);
+ if (size > max_size)
goto out;
have_lost = local_read(&rb->lost);
@@ -149,19 +131,43 @@ int perf_output_begin(struct perf_output_handle *handle,
perf_output_get_handle(handle);
- do {
+ if (overwrite) {
+ do {
+ tail = local_read(&rb->tail);
+ offset = local_read(&rb->head);
+ head = offset + size;
+ if (unlikely(!perf_output_space(tail, offset, head,
+ max_size - 1))) {
+ tail = local_read(&rb->next_tail);
+ local_set(&rb->tail, tail);
+ rb->user_page->data_tail = tail;
+ }
+ } while (local_cmpxchg(&rb->head, offset, head) != offset);
+
/*
- * Userspace could choose to issue a mb() before updating the
- * tail pointer. So that all reads will be completed before the
- * write is issued.
+ * Save the start of next event when half of the buffer
+ * has been filled. Later when the event buffer overflows,
+ * update the tail pointer to point to it.
*/
- tail = ACCESS_ONCE(rb->user_page->data_tail);
- smp_rmb();
- offset = head = local_read(&rb->head);
- head += size;
- if (unlikely(!perf_output_space(rb, tail, offset, head)))
- goto fail;
- } while (local_cmpxchg(&rb->head, offset, head) != offset);
+ if (tail == local_read(&rb->next_tail) &&
+ head - tail >= (max_size >> 1))
+ local_cmpxchg(&rb->next_tail, tail, head);
+ } else {
+ do {
+ /*
+ * Userspace could choose to issue a mb() before
+ * updating the tail pointer. So that all reads will
+ * be completed before the write is issued.
+ */
+ tail = ACCESS_ONCE(rb->user_page->data_tail);
+ smp_rmb();
+ offset = local_read(&rb->head);
+ head = offset + size;
+ if (unlikely(!perf_output_space(tail, offset, head,
+ max_size - 1)))
+ goto fail;
+ } while (local_cmpxchg(&rb->head, offset, head) != offset);
+ }
if (head - local_read(&rb->wakeup) > rb->watermark)
local_add(rb->watermark, &rb->wakeup);
@@ -194,6 +200,18 @@ int perf_output_begin(struct perf_output_handle *handle,
return -ENOSPC;
}
+int perf_output_begin(struct perf_output_handle *handle,
+ struct perf_event *event, unsigned int size)
+{
+ return __perf_output_begin(handle, event, size, false);
+}
+
+int perf_output_begin_overwrite(struct perf_output_handle *handle,
+ struct perf_event *event, unsigned int size)
+{
+ return __perf_output_begin(handle, event, size, true);
+}
+
unsigned int perf_output_copy(struct perf_output_handle *handle,
const void *buf, unsigned int len)
{
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-08 12:15 ` Peter Zijlstra
@ 2013-07-09 6:18 ` Namhyung Kim
2013-07-09 7:40 ` Peter Zijlstra
2013-07-09 7:05 ` Yan, Zheng
2013-07-10 11:37 ` Yan, Zheng
2 siblings, 1 reply; 12+ messages in thread
From: Namhyung Kim @ 2013-07-09 6:18 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Yan, Zheng, linux-kernel, mingo, eranian, ak
Hi Peter,
On Mon, 8 Jul 2013 14:15:57 +0200, Peter Zijlstra wrote:
[SNIP]
> +static void
> +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> +{
> + if (event->overflow_handler != perf_event_output ||
I guess you wanted "&&" instead of "||" here.
Thanks,
Namhyung
> + event->overflow_handler != perf_event_output_overwrite)
> + return;
> +
> + if (rb->overwrite)
> + event->overflow_handler = perf_event_output_overwrite;
> + else
> + event->overflow_handler = perf_event_output;
> +}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-08 12:15 ` Peter Zijlstra
2013-07-09 6:18 ` Namhyung Kim
@ 2013-07-09 7:05 ` Yan, Zheng
2013-07-09 8:05 ` Peter Zijlstra
2013-07-10 11:37 ` Yan, Zheng
2 siblings, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2013-07-09 7:05 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak
On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
> On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> If perf event buffer is in overwrite mode, the kernel only updates
>> the data head when it overwrites old samples. The program that owns
>> the buffer need periodically check the buffer and update a variable
>> that tracks the date tail. If the program fails to do this in time,
>> the data tail can be overwritted by new samples. The program has to
>> rewind the buffer because it does not know where is the first vaild
>> sample.
>>
>> This patch makes the kernel update the date tail when it overwrites
>> old events. So the program that owns the event buffer can always
>> read the latest samples. This is convenient for programs that use
>> perf to do branch tracing. One use case is GDB branch tracing:
>> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
>> It uses perf interface to read BTS, but only cares the branches
>> before the ptrace event.
>>
>> I added code to perf_output_{begin/end} to count how many cycles
>> are spent by sample output, then ran "perf record" to profile kernel
>> compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
>> The first number is scaled to 1000, the rest numbers are scaled by
>> the same factor.
>>
>> before overwrite mode after overwrite mode
>> AVG 1000 999 1046 1044
>> STDEV 19.4 19.5 17.1 17.9
>
> OK, so I was sure I replied to this email; but apparently I didn't :/
>
> So its still adding about 5% overhead to the regular case; this is sad.
>
> What does something like the below do?
Thank you for your help. I ran the same test, the results for regular case
are much better. But it still has about 1% overhead, probably because we
enlarge the ring_buffer structure, make it less cache friendly.
origin with the patch
AVG 1000 1013
STDEV 13.4 15.0
Regards
Yan, Zheng
>
> ---
> include/linux/perf_event.h | 2 +
> kernel/events/core.c | 60 ++++++++++++++++++++++++------
> kernel/events/internal.h | 2 +
> kernel/events/ring_buffer.c | 90 +++++++++++++++++++++++++++------------------
> 4 files changed, 106 insertions(+), 48 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8873f82..bcce98a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -753,6 +753,8 @@ static inline bool has_branch_stack(struct perf_event *event)
>
> extern int perf_output_begin(struct perf_output_handle *handle,
> struct perf_event *event, unsigned int size);
> +extern int perf_output_begin_overwrite(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size);
> extern void perf_output_end(struct perf_output_handle *handle);
> extern unsigned int perf_output_copy(struct perf_output_handle *handle,
> const void *buf, unsigned int len);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 1833bc5..4d674e9 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3878,6 +3878,8 @@ static const struct vm_operations_struct perf_mmap_vmops = {
> .page_mkwrite = perf_mmap_fault,
> };
>
> +static void perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb);
> +
> static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> {
> struct perf_event *event = file->private_data;
> @@ -3985,6 +3987,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> vma->vm_mm->pinned_vm += extra;
>
> ring_buffer_attach(event, rb);
> + perf_event_set_overflow(event, rb);
> rcu_assign_pointer(event->rb, rb);
>
> perf_event_update_userpage(event);
> @@ -4595,9 +4598,12 @@ void perf_prepare_sample(struct perf_event_header *header,
> }
> }
>
> -static void perf_event_output(struct perf_event *event,
> - struct perf_sample_data *data,
> - struct pt_regs *regs)
> +static __always_inline void
> +__perf_event_output(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs,
> + int (*output_begin)(struct perf_output_handle *,
> + struct perf_event *, unsigned int))
> {
> struct perf_output_handle handle;
> struct perf_event_header header;
> @@ -4607,7 +4613,7 @@ static void perf_event_output(struct perf_event *event,
>
> perf_prepare_sample(&header, data, event, regs);
>
> - if (perf_output_begin(&handle, event, header.size))
> + if (output_begin(&handle, event, header.size))
> goto exit;
>
> perf_output_sample(&handle, &header, data, event);
> @@ -4618,6 +4624,33 @@ static void perf_event_output(struct perf_event *event,
> rcu_read_unlock();
> }
>
> +static void perf_event_output(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + __perf_event_output(event, data, regs, perf_output_begin);
> +}
> +
> +static void perf_event_output_overwrite(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + __perf_event_output(event, data, regs, perf_output_begin_overwrite);
> +}
> +
> +static void
> +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> +{
> + if (event->overflow_handler != perf_event_output ||
> + event->overflow_handler != perf_event_output_overwrite)
> + return;
> +
> + if (rb->overwrite)
> + event->overflow_handler = perf_event_output_overwrite;
> + else
> + event->overflow_handler = perf_event_output;
> +}
> +
> /*
> * read event_id
> */
> @@ -5183,10 +5216,7 @@ static int __perf_event_overflow(struct perf_event *event,
> irq_work_queue(&event->pending);
> }
>
> - if (event->overflow_handler)
> - event->overflow_handler(event, data, regs);
> - else
> - perf_event_output(event, data, regs);
> + event->overflow_handler(event, data, regs);
>
> if (event->fasync && event->pending_kill) {
> event->pending_wakeup = 1;
> @@ -6501,8 +6531,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> context = parent_event->overflow_handler_context;
> }
>
> - event->overflow_handler = overflow_handler;
> - event->overflow_handler_context = context;
> + if (overflow_handler) {
> + event->overflow_handler = overflow_handler;
> + event->overflow_handler_context = context;
> + } else {
> + event->overflow_handler = perf_event_output;
> + event->overflow_handler_context = NULL;
> + }
>
> perf_event__state_init(event);
>
> @@ -6736,9 +6771,10 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
> if (old_rb)
> ring_buffer_detach(event, old_rb);
>
> - if (rb)
> + if (rb) {
> ring_buffer_attach(event, rb);
> -
> + perf_event_set_overflow(event, rb);
> + }
> rcu_assign_pointer(event->rb, rb);
>
> if (old_rb) {
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index ca65997..c4e4610 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -20,6 +20,8 @@ struct ring_buffer {
>
> atomic_t poll; /* POLL_ for wakeups */
>
> + local_t tail; /* read position */
> + local_t next_tail; /* next read position */
> local_t head; /* write position */
> local_t nest; /* nested writers */
> local_t events; /* event limit */
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144..5887044 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -15,28 +15,9 @@
>
> #include "internal.h"
>
> -static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
> - unsigned long offset, unsigned long head)
> +static bool perf_output_space(unsigned long tail, unsigned long offset,
> + unsigned long head, unsigned long mask)
> {
> - unsigned long sz = perf_data_size(rb);
> - unsigned long mask = sz - 1;
> -
> - /*
> - * check if user-writable
> - * overwrite : over-write its own tail
> - * !overwrite: buffer possibly drops events.
> - */
> - if (rb->overwrite)
> - return true;
> -
> - /*
> - * verify that payload is not bigger than buffer
> - * otherwise masking logic may fail to detect
> - * the "not enough space" condition
> - */
> - if ((head - offset) > sz)
> - return false;
> -
> offset = (offset - tail) & mask;
> head = (head - tail) & mask;
>
> @@ -109,11 +90,11 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> preempt_enable();
> }
>
> -int perf_output_begin(struct perf_output_handle *handle,
> - struct perf_event *event, unsigned int size)
> +static __always_inline int __perf_output_begin(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size, bool overwrite)
> {
> struct ring_buffer *rb;
> - unsigned long tail, offset, head;
> + unsigned long tail, offset, head, max_size;
> int have_lost;
> struct perf_sample_data sample_data;
> struct {
> @@ -136,7 +117,8 @@ int perf_output_begin(struct perf_output_handle *handle,
> handle->rb = rb;
> handle->event = event;
>
> - if (!rb->nr_pages)
> + max_size = perf_data_size(rb);
> + if (size > max_size)
> goto out;
>
> have_lost = local_read(&rb->lost);
> @@ -149,19 +131,43 @@ int perf_output_begin(struct perf_output_handle *handle,
>
> perf_output_get_handle(handle);
>
> - do {
> + if (overwrite) {
> + do {
> + tail = local_read(&rb->tail);
> + offset = local_read(&rb->head);
> + head = offset + size;
> + if (unlikely(!perf_output_space(tail, offset, head,
> + max_size - 1))) {
> + tail = local_read(&rb->next_tail);
> + local_set(&rb->tail, tail);
> + rb->user_page->data_tail = tail;
> + }
> + } while (local_cmpxchg(&rb->head, offset, head) != offset);
> +
> /*
> - * Userspace could choose to issue a mb() before updating the
> - * tail pointer. So that all reads will be completed before the
> - * write is issued.
> + * Save the start of next event when half of the buffer
> + * has been filled. Later when the event buffer overflows,
> + * update the tail pointer to point to it.
> */
> - tail = ACCESS_ONCE(rb->user_page->data_tail);
> - smp_rmb();
> - offset = head = local_read(&rb->head);
> - head += size;
> - if (unlikely(!perf_output_space(rb, tail, offset, head)))
> - goto fail;
> - } while (local_cmpxchg(&rb->head, offset, head) != offset);
> + if (tail == local_read(&rb->next_tail) &&
> + head - tail >= (max_size >> 1))
> + local_cmpxchg(&rb->next_tail, tail, head);
> + } else {
> + do {
> + /*
> + * Userspace could choose to issue a mb() before
> + * updating the tail pointer. So that all reads will
> + * be completed before the write is issued.
> + */
> + tail = ACCESS_ONCE(rb->user_page->data_tail);
> + smp_rmb();
> + offset = local_read(&rb->head);
> + head = offset + size;
> + if (unlikely(!perf_output_space(tail, offset, head,
> + max_size - 1)))
> + goto fail;
> + } while (local_cmpxchg(&rb->head, offset, head) != offset);
> + }
>
> if (head - local_read(&rb->wakeup) > rb->watermark)
> local_add(rb->watermark, &rb->wakeup);
> @@ -194,6 +200,18 @@ int perf_output_begin(struct perf_output_handle *handle,
> return -ENOSPC;
> }
>
> +int perf_output_begin(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size)
> +{
> + return __perf_output_begin(handle, event, size, false);
> +}
> +
> +int perf_output_begin_overwrite(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size)
> +{
> + return __perf_output_begin(handle, event, size, true);
> +}
> +
> unsigned int perf_output_copy(struct perf_output_handle *handle,
> const void *buf, unsigned int len)
> {
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-09 6:18 ` Namhyung Kim
@ 2013-07-09 7:40 ` Peter Zijlstra
0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-09 7:40 UTC (permalink / raw)
To: Namhyung Kim; +Cc: Yan, Zheng, linux-kernel, mingo, eranian, ak
On Tue, Jul 09, 2013 at 03:18:20PM +0900, Namhyung Kim wrote:
> Hi Peter,
>
> On Mon, 8 Jul 2013 14:15:57 +0200, Peter Zijlstra wrote:
> [SNIP]
> > +static void
> > +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> > +{
> > + if (event->overflow_handler != perf_event_output ||
>
> I guess you wanted "&&" instead of "||" here.
Uhm.. yeah. /me dons the brown paper bag.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-09 7:05 ` Yan, Zheng
@ 2013-07-09 8:05 ` Peter Zijlstra
2013-07-09 13:52 ` Yan, Zheng
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-09 8:05 UTC (permalink / raw)
To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak
On Tue, Jul 09, 2013 at 03:05:41PM +0800, Yan, Zheng wrote:
> Thank you for your help. I ran the same test, the results for regular case
> are much better. But it still has about 1% overhead, probably because we
> enlarge the ring_buffer structure, make it less cache friendly.
>
> origin with the patch
> AVG 1000 1013
> STDEV 13.4 15.0
And this is the !overwrite case, right? I don't suppose you cured the logic
Namhyung Kim pointed out? That should affect the overwrite case I suppose since
it won't switch to perf_event_output_overwrite().
tip/master:
struct ring_buffer {
atomic_t refcount; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
struct callback_head callback_head; /* 8 16 */
int nr_pages; /* 24 4 */
int overwrite; /* 28 4 */
atomic_t poll; /* 32 4 */
/* XXX 4 bytes hole, try to pack */
local_t head; /* 40 8 */
local_t nest; /* 48 8 */
local_t events; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
local_t wakeup; /* 64 8 */
local_t lost; /* 72 8 */
long int watermark; /* 80 8 */
spinlock_t event_lock; /* 88 56 */
/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
struct list_head event_list; /* 144 16 */
atomic_t mmap_count; /* 160 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int mmap_locked; /* 168 8 */
struct user_struct * mmap_user; /* 176 8 */
struct perf_event_mmap_page * user_page; /* 184 8 */
/* --- cacheline 3 boundary (192 bytes) --- */
void * data_pages[0]; /* 192 0 */
/* size: 192, cachelines: 3, members: 18 */
/* sum members: 180, holes: 3, sum holes: 12 */
};
tip/master + patch:
struct ring_buffer {
atomic_t refcount; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
struct callback_head callback_head; /* 8 16 */
int nr_pages; /* 24 4 */
int overwrite; /* 28 4 */
atomic_t poll; /* 32 4 */
/* XXX 4 bytes hole, try to pack */
local_t tail; /* 40 8 */
local_t next_tail; /* 48 8 */
local_t head; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
local_t nest; /* 64 8 */
local_t events; /* 72 8 */
local_t wakeup; /* 80 8 */
local_t lost; /* 88 8 */
long int watermark; /* 96 8 */
spinlock_t event_lock; /* 104 56 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
struct list_head event_list; /* 160 16 */
atomic_t mmap_count; /* 176 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int mmap_locked; /* 184 8 */
/* --- cacheline 3 boundary (192 bytes) --- */
struct user_struct * mmap_user; /* 192 8 */
struct perf_event_mmap_page * user_page; /* 200 8 */
void * data_pages[0]; /* 208 0 */
/* size: 208, cachelines: 4, members: 20 */
/* sum members: 196, holes: 3, sum holes: 12 */
/* last cacheline: 16 bytes */
};
tip/master + patch^2:
struct ring_buffer {
atomic_t refcount; /* 0 4 */
atomic_t mmap_count; /* 4 4 */
union {
int overwrite; /* 4 */
struct callback_head callback_head; /* 16 */
}; /* 8 16 */
int nr_pages; /* 24 4 */
atomic_t poll; /* 28 4 */
local_t tail; /* 32 8 */
local_t next_tail; /* 40 8 */
local_t head; /* 48 8 */
local_t nest; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
local_t events; /* 64 8 */
local_t wakeup; /* 72 8 */
local_t lost; /* 80 8 */
long int watermark; /* 88 8 */
spinlock_t event_lock; /* 96 56 */
/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
struct list_head event_list; /* 152 16 */
long unsigned int mmap_locked; /* 168 8 */
struct user_struct * mmap_user; /* 176 8 */
struct perf_event_mmap_page * user_page; /* 184 8 */
/* --- cacheline 3 boundary (192 bytes) --- */
void * data_pages[0]; /* 192 0 */
/* size: 192, cachelines: 3, members: 19 */
};
---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4641,7 +4641,7 @@ static void perf_event_output_overwrite(
static void
perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
{
- if (event->overflow_handler != perf_event_output ||
+ if (event->overflow_handler != perf_event_output &&
event->overflow_handler != perf_event_output_overwrite)
return;
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -10,13 +10,16 @@
struct ring_buffer {
atomic_t refcount;
- struct rcu_head rcu_head;
+ atomic_t mmap_count;
+ union {
+ int overwrite; /* can overwrite itself */
+ struct rcu_head rcu_head;
+ };
#ifdef CONFIG_PERF_USE_VMALLOC
struct work_struct work;
int page_order; /* allocation order */
#endif
int nr_pages; /* nr of data pages */
- int overwrite; /* can overwrite itself */
atomic_t poll; /* POLL_ for wakeups */
@@ -33,7 +36,6 @@ struct ring_buffer {
spinlock_t event_lock;
struct list_head event_list;
- atomic_t mmap_count;
unsigned long mmap_locked;
struct user_struct *mmap_user;
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-09 8:05 ` Peter Zijlstra
@ 2013-07-09 13:52 ` Yan, Zheng
2013-07-09 14:31 ` Peter Zijlstra
0 siblings, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2013-07-09 13:52 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak
On 07/09/2013 04:05 PM, Peter Zijlstra wrote:
> On Tue, Jul 09, 2013 at 03:05:41PM +0800, Yan, Zheng wrote:
>
>> Thank you for your help. I ran the same test, the results for regular case
>> are much better. But it still has about 1% overhead, probably because we
>> enlarge the ring_buffer structure, make it less cache friendly.
>>
>> origin with the patch
>> AVG 1000 1013
>> STDEV 13.4 15.0
>
> And this is the !overwrite case, right? I don't suppose you cured the logic
> Namhyung Kim pointed out? That should affect the overwrite case I suppose since
> it won't switch to perf_event_output_overwrite().
yes, it's the overwrite case.
>
> tip/master:
>
> struct ring_buffer {
> atomic_t refcount; /* 0 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> struct callback_head callback_head; /* 8 16 */
> int nr_pages; /* 24 4 */
> int overwrite; /* 28 4 */
> atomic_t poll; /* 32 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> local_t head; /* 40 8 */
> local_t nest; /* 48 8 */
> local_t events; /* 56 8 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> local_t wakeup; /* 64 8 */
> local_t lost; /* 72 8 */
> long int watermark; /* 80 8 */
> spinlock_t event_lock; /* 88 56 */
> /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
> struct list_head event_list; /* 144 16 */
> atomic_t mmap_count; /* 160 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> long unsigned int mmap_locked; /* 168 8 */
> struct user_struct * mmap_user; /* 176 8 */
> struct perf_event_mmap_page * user_page; /* 184 8 */
> /* --- cacheline 3 boundary (192 bytes) --- */
> void * data_pages[0]; /* 192 0 */
>
> /* size: 192, cachelines: 3, members: 18 */
> /* sum members: 180, holes: 3, sum holes: 12 */
> };
>
> tip/master + patch:
>
> struct ring_buffer {
> atomic_t refcount; /* 0 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> struct callback_head callback_head; /* 8 16 */
> int nr_pages; /* 24 4 */
> int overwrite; /* 28 4 */
> atomic_t poll; /* 32 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> local_t tail; /* 40 8 */
> local_t next_tail; /* 48 8 */
> local_t head; /* 56 8 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> local_t nest; /* 64 8 */
> local_t events; /* 72 8 */
> local_t wakeup; /* 80 8 */
> local_t lost; /* 88 8 */
> long int watermark; /* 96 8 */
> spinlock_t event_lock; /* 104 56 */
> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> struct list_head event_list; /* 160 16 */
> atomic_t mmap_count; /* 176 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> long unsigned int mmap_locked; /* 184 8 */
> /* --- cacheline 3 boundary (192 bytes) --- */
> struct user_struct * mmap_user; /* 192 8 */
> struct perf_event_mmap_page * user_page; /* 200 8 */
> void * data_pages[0]; /* 208 0 */
>
> /* size: 208, cachelines: 4, members: 20 */
> /* sum members: 196, holes: 3, sum holes: 12 */
> /* last cacheline: 16 bytes */
> };
>
> tip/master + patch^2:
>
> struct ring_buffer {
> atomic_t refcount; /* 0 4 */
> atomic_t mmap_count; /* 4 4 */
> union {
> int overwrite; /* 4 */
> struct callback_head callback_head; /* 16 */
> }; /* 8 16 */
> int nr_pages; /* 24 4 */
> atomic_t poll; /* 28 4 */
> local_t tail; /* 32 8 */
> local_t next_tail; /* 40 8 */
> local_t head; /* 48 8 */
> local_t nest; /* 56 8 */
> /* --- cacheline 1 boundary (64 bytes) --- */
> local_t events; /* 64 8 */
> local_t wakeup; /* 72 8 */
> local_t lost; /* 80 8 */
> long int watermark; /* 88 8 */
> spinlock_t event_lock; /* 96 56 */
> /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
> struct list_head event_list; /* 152 16 */
> long unsigned int mmap_locked; /* 168 8 */
> struct user_struct * mmap_user; /* 176 8 */
> struct perf_event_mmap_page * user_page; /* 184 8 */
> /* --- cacheline 3 boundary (192 bytes) --- */
> void * data_pages[0]; /* 192 0 */
>
> /* size: 192, cachelines: 3, members: 19 */
> };
>
>
> ---
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4641,7 +4641,7 @@ static void perf_event_output_overwrite(
> static void
> perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> {
> - if (event->overflow_handler != perf_event_output ||
> + if (event->overflow_handler != perf_event_output &&
> event->overflow_handler != perf_event_output_overwrite)
> return;
>
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -10,13 +10,16 @@
>
> struct ring_buffer {
> atomic_t refcount;
> - struct rcu_head rcu_head;
> + atomic_t mmap_count;
> + union {
> + int overwrite; /* can overwrite itself */
> + struct rcu_head rcu_head;
> + };
> #ifdef CONFIG_PERF_USE_VMALLOC
> struct work_struct work;
> int page_order; /* allocation order */
> #endif
> int nr_pages; /* nr of data pages */
> - int overwrite; /* can overwrite itself */
>
> atomic_t poll; /* POLL_ for wakeups */
>
> @@ -33,7 +36,6 @@ struct ring_buffer {
> spinlock_t event_lock;
> struct list_head event_list;
>
> - atomic_t mmap_count;
> unsigned long mmap_locked;
> struct user_struct *mmap_user;
>
>
origin patch patch^2
AVG 1000 1013 1028
STDEV 13.4 15.0 14.6
patch^2 doesn't help. I will try moving the new fields down and re-test tomorrow
Regards
Yan, Zheng
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-09 13:52 ` Yan, Zheng
@ 2013-07-09 14:31 ` Peter Zijlstra
0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-09 14:31 UTC (permalink / raw)
To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak
On Tue, Jul 09, 2013 at 09:52:17PM +0800, Yan, Zheng wrote:
> On 07/09/2013 04:05 PM, Peter Zijlstra wrote:
> > On Tue, Jul 09, 2013 at 03:05:41PM +0800, Yan, Zheng wrote:
> >
> >> Thank you for your help. I ran the same test, the results for regular case
> >> are much better. But it still has about 1% overhead, probably because we
> >> enlarge the ring_buffer structure, make it less cache friendly.
> >>
> >> origin with the patch
> >> AVG 1000 1013
> >> STDEV 13.4 15.0
> >
> > And this is the !overwrite case, right? I don't suppose you cured the logic
> > Namhyung Kim pointed out? That should affect the overwrite case I suppose since
> > it won't switch to perf_event_output_overwrite().
>
> yes, it's the overwrite case.
So the most common case is the !overwrite one; we should ensure no significant
regression there. The overwrite case isn't used that much because as you've
found its really hard to use without a valid tail pointer. So I'm not too
bothered about making the overwrite case a _little_ more expensive if that
makes it far more usable.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-08 12:15 ` Peter Zijlstra
2013-07-09 6:18 ` Namhyung Kim
2013-07-09 7:05 ` Yan, Zheng
@ 2013-07-10 11:37 ` Yan, Zheng
2013-07-10 11:44 ` Peter Zijlstra
2 siblings, 1 reply; 12+ messages in thread
From: Yan, Zheng @ 2013-07-10 11:37 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak
On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
> On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> If perf event buffer is in overwrite mode, the kernel only updates
>> the data head when it overwrites old samples. The program that owns
>> the buffer need periodically check the buffer and update a variable
>> that tracks the date tail. If the program fails to do this in time,
>> the data tail can be overwritted by new samples. The program has to
>> rewind the buffer because it does not know where is the first vaild
>> sample.
>>
>> This patch makes the kernel update the date tail when it overwrites
>> old events. So the program that owns the event buffer can always
>> read the latest samples. This is convenient for programs that use
>> perf to do branch tracing. One use case is GDB branch tracing:
>> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
>> It uses perf interface to read BTS, but only cares the branches
>> before the ptrace event.
>>
>> I added code to perf_output_{begin/end} to count how many cycles
>> are spent by sample output, then ran "perf record" to profile kernel
>> compilation 10 times on IvyBridge-EP. (perf record -a make -j 60)
>> The first number is scaled to 1000, the rest numbers are scaled by
>> the same factor.
>>
>> before overwrite mode after overwrite mode
>> AVG 1000 999 1046 1044
>> STDEV 19.4 19.5 17.1 17.9
>
> OK, so I was sure I replied to this email; but apparently I didn't :/
>
> So its still adding about 5% overhead to the regular case; this is sad.
>
> What does something like the below do?
>
I re-test the patch on a different 32 core sandybridge-ep machine. the result is quite good.
origin origin overwrite modified modified overwrite
AVG 1000 1044 960 1006
STDEV 39.0 26.0 28.1 14.4
Regards
Yan, Zheng
> ---
> include/linux/perf_event.h | 2 +
> kernel/events/core.c | 60 ++++++++++++++++++++++++------
> kernel/events/internal.h | 2 +
> kernel/events/ring_buffer.c | 90 +++++++++++++++++++++++++++------------------
> 4 files changed, 106 insertions(+), 48 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8873f82..bcce98a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -753,6 +753,8 @@ static inline bool has_branch_stack(struct perf_event *event)
>
> extern int perf_output_begin(struct perf_output_handle *handle,
> struct perf_event *event, unsigned int size);
> +extern int perf_output_begin_overwrite(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size);
> extern void perf_output_end(struct perf_output_handle *handle);
> extern unsigned int perf_output_copy(struct perf_output_handle *handle,
> const void *buf, unsigned int len);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 1833bc5..4d674e9 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3878,6 +3878,8 @@ static const struct vm_operations_struct perf_mmap_vmops = {
> .page_mkwrite = perf_mmap_fault,
> };
>
> +static void perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb);
> +
> static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> {
> struct perf_event *event = file->private_data;
> @@ -3985,6 +3987,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> vma->vm_mm->pinned_vm += extra;
>
> ring_buffer_attach(event, rb);
> + perf_event_set_overflow(event, rb);
> rcu_assign_pointer(event->rb, rb);
>
> perf_event_update_userpage(event);
> @@ -4595,9 +4598,12 @@ void perf_prepare_sample(struct perf_event_header *header,
> }
> }
>
> -static void perf_event_output(struct perf_event *event,
> - struct perf_sample_data *data,
> - struct pt_regs *regs)
> +static __always_inline void
> +__perf_event_output(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs,
> + int (*output_begin)(struct perf_output_handle *,
> + struct perf_event *, unsigned int))
> {
> struct perf_output_handle handle;
> struct perf_event_header header;
> @@ -4607,7 +4613,7 @@ static void perf_event_output(struct perf_event *event,
>
> perf_prepare_sample(&header, data, event, regs);
>
> - if (perf_output_begin(&handle, event, header.size))
> + if (output_begin(&handle, event, header.size))
> goto exit;
>
> perf_output_sample(&handle, &header, data, event);
> @@ -4618,6 +4624,33 @@ static void perf_event_output(struct perf_event *event,
> rcu_read_unlock();
> }
>
> +static void perf_event_output(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + __perf_event_output(event, data, regs, perf_output_begin);
> +}
> +
> +static void perf_event_output_overwrite(struct perf_event *event,
> + struct perf_sample_data *data,
> + struct pt_regs *regs)
> +{
> + __perf_event_output(event, data, regs, perf_output_begin_overwrite);
> +}
> +
> +static void
> +perf_event_set_overflow(struct perf_event *event, struct ring_buffer *rb)
> +{
> + if (event->overflow_handler != perf_event_output ||
> + event->overflow_handler != perf_event_output_overwrite)
> + return;
> +
> + if (rb->overwrite)
> + event->overflow_handler = perf_event_output_overwrite;
> + else
> + event->overflow_handler = perf_event_output;
> +}
> +
> /*
> * read event_id
> */
> @@ -5183,10 +5216,7 @@ static int __perf_event_overflow(struct perf_event *event,
> irq_work_queue(&event->pending);
> }
>
> - if (event->overflow_handler)
> - event->overflow_handler(event, data, regs);
> - else
> - perf_event_output(event, data, regs);
> + event->overflow_handler(event, data, regs);
>
> if (event->fasync && event->pending_kill) {
> event->pending_wakeup = 1;
> @@ -6501,8 +6531,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> context = parent_event->overflow_handler_context;
> }
>
> - event->overflow_handler = overflow_handler;
> - event->overflow_handler_context = context;
> + if (overflow_handler) {
> + event->overflow_handler = overflow_handler;
> + event->overflow_handler_context = context;
> + } else {
> + event->overflow_handler = perf_event_output;
> + event->overflow_handler_context = NULL;
> + }
>
> perf_event__state_init(event);
>
> @@ -6736,9 +6771,10 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
> if (old_rb)
> ring_buffer_detach(event, old_rb);
>
> - if (rb)
> + if (rb) {
> ring_buffer_attach(event, rb);
> -
> + perf_event_set_overflow(event, rb);
> + }
> rcu_assign_pointer(event->rb, rb);
>
> if (old_rb) {
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index ca65997..c4e4610 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -20,6 +20,8 @@ struct ring_buffer {
>
> atomic_t poll; /* POLL_ for wakeups */
>
> + local_t tail; /* read position */
> + local_t next_tail; /* next read position */
> local_t head; /* write position */
> local_t nest; /* nested writers */
> local_t events; /* event limit */
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144..5887044 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -15,28 +15,9 @@
>
> #include "internal.h"
>
> -static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
> - unsigned long offset, unsigned long head)
> +static bool perf_output_space(unsigned long tail, unsigned long offset,
> + unsigned long head, unsigned long mask)
> {
> - unsigned long sz = perf_data_size(rb);
> - unsigned long mask = sz - 1;
> -
> - /*
> - * check if user-writable
> - * overwrite : over-write its own tail
> - * !overwrite: buffer possibly drops events.
> - */
> - if (rb->overwrite)
> - return true;
> -
> - /*
> - * verify that payload is not bigger than buffer
> - * otherwise masking logic may fail to detect
> - * the "not enough space" condition
> - */
> - if ((head - offset) > sz)
> - return false;
> -
> offset = (offset - tail) & mask;
> head = (head - tail) & mask;
>
> @@ -109,11 +90,11 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> preempt_enable();
> }
>
> -int perf_output_begin(struct perf_output_handle *handle,
> - struct perf_event *event, unsigned int size)
> +static __always_inline int __perf_output_begin(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size, bool overwrite)
> {
> struct ring_buffer *rb;
> - unsigned long tail, offset, head;
> + unsigned long tail, offset, head, max_size;
> int have_lost;
> struct perf_sample_data sample_data;
> struct {
> @@ -136,7 +117,8 @@ int perf_output_begin(struct perf_output_handle *handle,
> handle->rb = rb;
> handle->event = event;
>
> - if (!rb->nr_pages)
> + max_size = perf_data_size(rb);
> + if (size > max_size)
> goto out;
>
> have_lost = local_read(&rb->lost);
> @@ -149,19 +131,43 @@ int perf_output_begin(struct perf_output_handle *handle,
>
> perf_output_get_handle(handle);
>
> - do {
> + if (overwrite) {
> + do {
> + tail = local_read(&rb->tail);
> + offset = local_read(&rb->head);
> + head = offset + size;
> + if (unlikely(!perf_output_space(tail, offset, head,
> + max_size - 1))) {
> + tail = local_read(&rb->next_tail);
> + local_set(&rb->tail, tail);
> + rb->user_page->data_tail = tail;
> + }
> + } while (local_cmpxchg(&rb->head, offset, head) != offset);
> +
> /*
> - * Userspace could choose to issue a mb() before updating the
> - * tail pointer. So that all reads will be completed before the
> - * write is issued.
> + * Save the start of next event when half of the buffer
> + * has been filled. Later when the event buffer overflows,
> + * update the tail pointer to point to it.
> */
> - tail = ACCESS_ONCE(rb->user_page->data_tail);
> - smp_rmb();
> - offset = head = local_read(&rb->head);
> - head += size;
> - if (unlikely(!perf_output_space(rb, tail, offset, head)))
> - goto fail;
> - } while (local_cmpxchg(&rb->head, offset, head) != offset);
> + if (tail == local_read(&rb->next_tail) &&
> + head - tail >= (max_size >> 1))
> + local_cmpxchg(&rb->next_tail, tail, head);
> + } else {
> + do {
> + /*
> + * Userspace could choose to issue a mb() before
> + * updating the tail pointer. So that all reads will
> + * be completed before the write is issued.
> + */
> + tail = ACCESS_ONCE(rb->user_page->data_tail);
> + smp_rmb();
> + offset = local_read(&rb->head);
> + head = offset + size;
> + if (unlikely(!perf_output_space(tail, offset, head,
> + max_size - 1)))
> + goto fail;
> + } while (local_cmpxchg(&rb->head, offset, head) != offset);
> + }
>
> if (head - local_read(&rb->wakeup) > rb->watermark)
> local_add(rb->watermark, &rb->wakeup);
> @@ -194,6 +200,18 @@ int perf_output_begin(struct perf_output_handle *handle,
> return -ENOSPC;
> }
>
> +int perf_output_begin(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size)
> +{
> + return __perf_output_begin(handle, event, size, false);
> +}
> +
> +int perf_output_begin_overwrite(struct perf_output_handle *handle,
> + struct perf_event *event, unsigned int size)
> +{
> + return __perf_output_begin(handle, event, size, true);
> +}
> +
> unsigned int perf_output_copy(struct perf_output_handle *handle,
> const void *buf, unsigned int len)
> {
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-10 11:37 ` Yan, Zheng
@ 2013-07-10 11:44 ` Peter Zijlstra
2013-07-11 0:46 ` Yan, Zheng
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2013-07-10 11:44 UTC (permalink / raw)
To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak
On Wed, Jul 10, 2013 at 07:37:43PM +0800, Yan, Zheng wrote:
> On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
> > On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
> >> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> >> before overwrite mode after overwrite mode
> >> AVG 1000 999 1046 1044
> >> STDEV 19.4 19.5 17.1 17.9
> >
> > OK, so I was sure I replied to this email; but apparently I didn't :/
> >
> > So its still adding about 5% overhead to the regular case; this is sad.
> >
> > What does something like the below do?
> >
>
> I re-test the patch on a different 32 core sandybridge-ep machine. the result is quite good.
>
> origin origin overwrite modified modified overwrite
> AVG 1000 1044 960 1006
> STDEV 39.0 26.0 28.1 14.4
Nice! -- you did fix the snafu for the overwrite more before testing right?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] perf: Update event buffer tail when overwriting old events
2013-07-10 11:44 ` Peter Zijlstra
@ 2013-07-11 0:46 ` Yan, Zheng
0 siblings, 0 replies; 12+ messages in thread
From: Yan, Zheng @ 2013-07-11 0:46 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, ak
On 07/10/2013 07:44 PM, Peter Zijlstra wrote:
> On Wed, Jul 10, 2013 at 07:37:43PM +0800, Yan, Zheng wrote:
>> On 07/08/2013 08:15 PM, Peter Zijlstra wrote:
>>> On Thu, Jun 06, 2013 at 01:58:06PM +0800, Yan, Zheng wrote:
>>>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>
>>>> before overwrite mode after overwrite mode
>>>> AVG 1000 999 1046 1044
>>>> STDEV 19.4 19.5 17.1 17.9
>>>
>>> OK, so I was sure I replied to this email; but apparently I didn't :/
>>>
>>> So its still adding about 5% overhead to the regular case; this is sad.
>>>
>>> What does something like the below do?
>>>
>>
>> I re-test the patch on a different 32 core sandybridge-ep machine. the result is quite good.
>>
>> origin origin overwrite modified modified overwrite
>> AVG 1000 1044 960 1006
>> STDEV 39.0 26.0 28.1 14.4
>
> Nice! -- you did fix the snafu for the overwrite more before testing right?
>
yes, of course.
Regards
Yan, Zheng
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-07-11 0:46 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-06 5:58 [PATCH] perf: Update event buffer tail when overwriting old events Yan, Zheng
2013-06-18 9:13 ` Peter Zijlstra
2013-07-08 12:15 ` Peter Zijlstra
2013-07-09 6:18 ` Namhyung Kim
2013-07-09 7:40 ` Peter Zijlstra
2013-07-09 7:05 ` Yan, Zheng
2013-07-09 8:05 ` Peter Zijlstra
2013-07-09 13:52 ` Yan, Zheng
2013-07-09 14:31 ` Peter Zijlstra
2013-07-10 11:37 ` Yan, Zheng
2013-07-10 11:44 ` Peter Zijlstra
2013-07-11 0:46 ` Yan, Zheng
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.