All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] perf core: Read from overwrite ring buffer
@ 2016-01-22 12:13 Wang Nan
  2016-01-22 12:13 ` [PATCH 1/6] perf core: Introduce new ioctl options to pause and resume " Wang Nan
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Wang Nan @ 2016-01-22 12:13 UTC (permalink / raw)
  To: peterz, alexei.starovoitov, acme
  Cc: linux-kernel, Wang Nan, He Kuang, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Brendan Gregg, Jiri Olsa,
	Masami Hiramatsu, Namhyung Kim, Zefan Li, pi3orama

This is v2 of this series.

Compare with v1:

Fixes several bugs in v1.

Corresponsing perf has finished and can be found from:

 https://git.kernel.org/cgit/linux/kernel/git/pi3orama/linux.git/ 
 branch: perf/overwrite-benchmark

Some benchmarking results can be found from [1].

Summary:

On a PC with Intel E5-2640 0 @ 2.50GHz CPU, execute close(-1) 3000000
times, capture raw_syscalls:* with perf into overwrite ring buffer,
check total time (in us):

             MEAN           STDVAR
BASE     :  879870.81      11913.13
RAWPERF  : 2603854.7      706658.4
WRTBKWRD : 2313301.220      6727.957
TAILSIZE : 2383051.860      5248.061
RAWOVWRT : 2315273.180      5221.025
RAWOVWRT*: 2323970.45       5103.39 

Where:
BASE: don't use perf at all.
RAWPERF: use non-overwrite ring buffer, perf collects all data,
	 write to /dev/null
WRTBKWRD: Use backward writing ring buffer, write from tail to head,
          never wakeup perf, collect data when exiting.
TAILSIZE: Use tailsize ring buffer, pad 8 bytes for the size of the
          record for each event, never wakeup perf, collect data when
	  exiting.
RAWOVWRT: Use raw overwrite ring buffer, never wakeup perf, don't
          collect data at all.
RAWOVWRT*: Same as RAWOVWRT, without this patchset.

The benchmarking results shows WRTBKWRD is good enough. I suggest not
to implement TAILSIZE and tail-header ring buffer.

I will post the result on a smartphone next week.

[1] http://lkml.kernel.org/g/56A07FF3.2090009@huawei.com

Wang Nan (6):
  perf core: Introduce new ioctl options to pause and resume ring buffer
  perf core: Set event's default overflow_handler
  perf core: Prepare writing into ring buffer from end
  perf core: Add backward attribute to perf event
  perf core: Reduce perf event output overhead by setting overwrite
    handler
  perf core: Put size of a sample at the end of it by
    PERF_SAMPLE_TAILSIZE

 include/linux/perf_event.h      |  39 +++++++---
 include/uapi/linux/perf_event.h |   7 +-
 kernel/events/core.c            | 155 +++++++++++++++++++++++++++++++---------
 kernel/events/internal.h        |  11 +++
 kernel/events/ring_buffer.c     |  70 +++++++++++++++---
 5 files changed, 228 insertions(+), 54 deletions(-)

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
-- 
1.8.3.4

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/6] perf core: Introduce new ioctl options to pause and resume ring buffer
  2016-01-22 12:13 [PATCH 0/6] perf core: Read from overwrite ring buffer Wang Nan
@ 2016-01-22 12:13 ` Wang Nan
  2016-01-22 19:36   ` Alexei Starovoitov
  2016-01-22 12:13 ` [PATCH 2/6] perf core: Set event's default overflow_handler Wang Nan
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: Wang Nan @ 2016-01-22 12:13 UTC (permalink / raw)
  To: peterz, alexei.starovoitov, acme
  Cc: linux-kernel, Wang Nan, He Kuang, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Brendan Gregg, Jiri Olsa,
	Masami Hiramatsu, Namhyung Kim, Zefan Li, pi3orama

Add new ioctl() to pause/resume ring-buffer output.

In some situations we want to read from ring buffer only when we
ensure nothing can write to the ring buffer during reading. Without
this patch we have to turn off all events attached to this ring buffer.
This patch is for supporting overwritable ring buffer with TAILSIZE
selected.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
---
 include/uapi/linux/perf_event.h |  1 +
 kernel/events/core.c            | 13 +++++++++++++
 kernel/events/internal.h        | 11 +++++++++++
 kernel/events/ring_buffer.c     |  7 ++++++-
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 1afe962..2c7f00c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -401,6 +401,7 @@ struct perf_event_attr {
 #define PERF_EVENT_IOC_SET_FILTER	_IOW('$', 6, char *)
 #define PERF_EVENT_IOC_ID		_IOR('$', 7, __u64 *)
 #define PERF_EVENT_IOC_SET_BPF		_IOW('$', 8, __u32)
+#define PERF_EVENT_IOC_PAUSE_OUTPUT	_IO ('$', 9)
 
 enum perf_event_ioc_flags {
 	PERF_IOC_FLAG_GROUP		= 1U << 0,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bf82441..9e9c84da 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4241,6 +4241,19 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
 	case PERF_EVENT_IOC_SET_BPF:
 		return perf_event_set_bpf_prog(event, arg);
 
+	case PERF_EVENT_IOC_PAUSE_OUTPUT: {
+		struct ring_buffer *rb;
+
+		rcu_read_lock();
+		rb = rcu_dereference(event->rb);
+		if (!event->rb) {
+			rcu_read_unlock();
+			return -EINVAL;
+		}
+		rb_toggle_paused(rb, !!arg);
+		rcu_read_unlock();
+		return 0;
+	}
 	default:
 		return -ENOTTY;
 	}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 2bbad9c..6a93d1b 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -18,6 +18,7 @@ struct ring_buffer {
 #endif
 	int				nr_pages;	/* nr of data pages  */
 	int				overwrite;	/* can overwrite itself */
+	int				paused;		/* can write into ring buffer */
 
 	atomic_t			poll;		/* POLL_ for wakeups */
 
@@ -65,6 +66,16 @@ static inline void rb_free_rcu(struct rcu_head *rcu_head)
 	rb_free(rb);
 }
 
+static inline void
+rb_toggle_paused(struct ring_buffer *rb,
+		 bool pause)
+{
+	if (!pause && rb->nr_pages)
+		rb->paused = 0;
+	else
+		rb->paused = 1;
+}
+
 extern struct ring_buffer *
 rb_alloc(int nr_pages, long watermark, int cpu, int flags);
 extern void perf_event_wakeup(struct perf_event *event);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index adfdc05..9f1a93f 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -125,8 +125,11 @@ int perf_output_begin(struct perf_output_handle *handle,
 	if (unlikely(!rb))
 		goto out;
 
-	if (unlikely(!rb->nr_pages))
+	if (unlikely(rb->paused)) {
+		if (rb->nr_pages)
+			local_inc(&rb->lost);
 		goto out;
+	}
 
 	handle->rb    = rb;
 	handle->event = event;
@@ -244,6 +247,8 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 	INIT_LIST_HEAD(&rb->event_list);
 	spin_lock_init(&rb->event_lock);
 	init_irq_work(&rb->irq_work, rb_irq_work);
+
+	rb->paused = rb->nr_pages ? 0 : 1;
 }
 
 static void ring_buffer_put_async(struct ring_buffer *rb)
-- 
1.8.3.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/6] perf core: Set event's default overflow_handler
  2016-01-22 12:13 [PATCH 0/6] perf core: Read from overwrite ring buffer Wang Nan
  2016-01-22 12:13 ` [PATCH 1/6] perf core: Introduce new ioctl options to pause and resume " Wang Nan
@ 2016-01-22 12:13 ` Wang Nan
  2016-01-22 12:13 ` [PATCH 3/6] perf core: Prepare writing into ring buffer from end Wang Nan
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Wang Nan @ 2016-01-22 12:13 UTC (permalink / raw)
  To: peterz, alexei.starovoitov, acme
  Cc: linux-kernel, Wang Nan, He Kuang, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Brendan Gregg, Jiri Olsa,
	Masami Hiramatsu, Namhyung Kim, Zefan Li, pi3orama

Set a default event->overflow_handler in perf_event_alloc() so don't
need checking event->overflow_handler in __perf_event_overflow().
Following commits can give a different default overflow_handler.

No extra performance introduced into hot path because in the original
code we still need reading this handler from memory. A conditional branch
is avoided so actually we remove some instructions.

Initial idea comes from Peter at [1].

[1] http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
---
 kernel/events/core.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9e9c84da..f79c4be 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6402,10 +6402,7 @@ static int __perf_event_overflow(struct perf_event *event,
 		irq_work_queue(&event->pending);
 	}
 
-	if (event->overflow_handler)
-		event->overflow_handler(event, data, regs);
-	else
-		perf_event_output(event, data, regs);
+	event->overflow_handler(event, data, regs);
 
 	if (*perf_event_fasync(event) && event->pending_kill) {
 		event->pending_wakeup = 1;
@@ -7874,8 +7871,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		context = parent_event->overflow_handler_context;
 	}
 
-	event->overflow_handler	= overflow_handler;
-	event->overflow_handler_context = context;
+	if (overflow_handler) {
+		event->overflow_handler	= overflow_handler;
+		event->overflow_handler_context = context;
+	} else {
+		event->overflow_handler = perf_event_output;
+		event->overflow_handler_context = NULL;
+	}
 
 	perf_event__state_init(event);
 
-- 
1.8.3.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/6] perf core: Prepare writing into ring buffer from end
  2016-01-22 12:13 [PATCH 0/6] perf core: Read from overwrite ring buffer Wang Nan
  2016-01-22 12:13 ` [PATCH 1/6] perf core: Introduce new ioctl options to pause and resume " Wang Nan
  2016-01-22 12:13 ` [PATCH 2/6] perf core: Set event's default overflow_handler Wang Nan
@ 2016-01-22 12:13 ` Wang Nan
  2016-01-22 12:13 ` [PATCH 4/6] perf core: Add backward attribute to perf event Wang Nan
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Wang Nan @ 2016-01-22 12:13 UTC (permalink / raw)
  To: peterz, alexei.starovoitov, acme
  Cc: linux-kernel, Wang Nan, He Kuang, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Brendan Gregg, Jiri Olsa,
	Masami Hiramatsu, Namhyung Kim, Zefan Li, pi3orama

Convert perf_output_begin to __perf_output_begin and make the later
function able to write records from the end of the ring buffer.
Following commits will utilize the 'backward' flag.

This patch doesn't introduce any extra performance overhead since we
use always_inline.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
---
 kernel/events/ring_buffer.c | 42 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 9f1a93f..0684e880 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -102,8 +102,21 @@ out:
 	preempt_enable();
 }
 
-int perf_output_begin(struct perf_output_handle *handle,
-		      struct perf_event *event, unsigned int size)
+static bool __always_inline
+ring_buffer_has_space(unsigned long head, unsigned long tail,
+		      unsigned long data_size, unsigned int size,
+		      bool backward)
+{
+	if (!backward)
+		return CIRC_SPACE(head, tail, data_size) >= size;
+	else
+		return CIRC_SPACE(tail, head, data_size) >= size;
+}
+
+static int __always_inline
+__perf_output_begin(struct perf_output_handle *handle,
+		    struct perf_event *event, unsigned int size,
+		    bool backward)
 {
 	struct ring_buffer *rb;
 	unsigned long tail, offset, head;
@@ -146,9 +159,12 @@ int perf_output_begin(struct perf_output_handle *handle,
 	do {
 		tail = READ_ONCE(rb->user_page->data_tail);
 		offset = head = local_read(&rb->head);
-		if (!rb->overwrite &&
-		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
-			goto fail;
+		if (!rb->overwrite) {
+			if (unlikely(!ring_buffer_has_space(head, tail,
+							    perf_data_size(rb),
+							    size, backward)))
+				goto fail;
+		}
 
 		/*
 		 * The above forms a control dependency barrier separating the
@@ -162,9 +178,17 @@ int perf_output_begin(struct perf_output_handle *handle,
 		 * See perf_output_put_handle().
 		 */
 
-		head += size;
+		if (!backward)
+			head += size;
+		else
+			head -= size;
 	} while (local_cmpxchg(&rb->head, offset, head) != offset);
 
+	if (backward) {
+		offset = head;
+		head = (u64)(-head);
+	}
+
 	/*
 	 * We rely on the implied barrier() by local_cmpxchg() to ensure
 	 * none of the data stores below can be lifted up by the compiler.
@@ -206,6 +230,12 @@ out:
 	return -ENOSPC;
 }
 
+int perf_output_begin(struct perf_output_handle *handle,
+		      struct perf_event *event, unsigned int size)
+{
+	return __perf_output_begin(handle, event, size, false);
+}
+
 unsigned int perf_output_copy(struct perf_output_handle *handle,
 		      const void *buf, unsigned int len)
 {
-- 
1.8.3.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/6] perf core: Add backward attribute to perf event
  2016-01-22 12:13 [PATCH 0/6] perf core: Read from overwrite ring buffer Wang Nan
                   ` (2 preceding siblings ...)
  2016-01-22 12:13 ` [PATCH 3/6] perf core: Prepare writing into ring buffer from end Wang Nan
@ 2016-01-22 12:13 ` Wang Nan
  2016-01-22 12:13 ` [PATCH 5/6] perf core: Reduce perf event output overhead by setting overwrite handler Wang Nan
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Wang Nan @ 2016-01-22 12:13 UTC (permalink / raw)
  To: peterz, alexei.starovoitov, acme
  Cc: linux-kernel, Wang Nan, He Kuang, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Brendan Gregg, Jiri Olsa,
	Masami Hiramatsu, Namhyung Kim, Zefan Li, pi3orama

In perf_event_attr a new bit 'write_backward' is appended to indicate
this event should write ring buffer from its end to beginning.

In perf_output_begin(), prepare ring buffer according this bit.

This patch introduces small overhead into perf_output_begin():
an extra memory read and a conditional branch. Further patch can remove
this overhead using custom output handler.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
---
 include/linux/perf_event.h      | 5 +++++
 include/uapi/linux/perf_event.h | 3 ++-
 kernel/events/core.c            | 7 +++++++
 kernel/events/ring_buffer.c     | 2 ++
 4 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f9828a4..54c3fb2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1032,6 +1032,11 @@ static inline bool has_aux(struct perf_event *event)
 	return event->pmu->setup_aux;
 }
 
+static inline bool is_write_backward(struct perf_event *event)
+{
+	return !!event->attr.write_backward;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 2c7f00c..598b9b0 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -340,7 +340,8 @@ struct perf_event_attr {
 				comm_exec      :  1, /* flag comm events that are due to an exec */
 				use_clockid    :  1, /* use @clockid for time fields */
 				context_switch :  1, /* context switch data */
-				__reserved_1   : 37;
+				write_backward :  1, /* Write ring buffer from end to beginning */
+				__reserved_1   : 36;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f79c4be..8ad22a5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8107,6 +8107,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 		goto out;
 
 	/*
+	 * Either writing ring buffer from beginning or from end.
+	 * Mixing is not allowed.
+	 */
+	if (is_write_backward(output_event) != is_write_backward(event))
+		goto out;
+
+	/*
 	 * If both events generate aux data, they must be on the same PMU
 	 */
 	if (has_aux(event) && has_aux(output_event) &&
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 0684e880..28543e1 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -233,6 +233,8 @@ out:
 int perf_output_begin(struct perf_output_handle *handle,
 		      struct perf_event *event, unsigned int size)
 {
+	if (unlikely(is_write_backward(event)))
+		return __perf_output_begin(handle, event, size, true);
 	return __perf_output_begin(handle, event, size, false);
 }
 
-- 
1.8.3.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/6] perf core: Reduce perf event output overhead by setting overwrite handler
  2016-01-22 12:13 [PATCH 0/6] perf core: Read from overwrite ring buffer Wang Nan
                   ` (3 preceding siblings ...)
  2016-01-22 12:13 ` [PATCH 4/6] perf core: Add backward attribute to perf event Wang Nan
@ 2016-01-22 12:13 ` Wang Nan
  2016-01-22 12:13 ` [PATCH 6/6] perf core: Put size of a sample at the end of it by PERF_SAMPLE_TAILSIZE Wang Nan
  2016-01-22 19:37 ` [PATCH 0/6] perf core: Read from overwrite ring buffer Alexei Starovoitov
  6 siblings, 0 replies; 10+ messages in thread
From: Wang Nan @ 2016-01-22 12:13 UTC (permalink / raw)
  To: peterz, alexei.starovoitov, acme
  Cc: linux-kernel, Wang Nan, He Kuang, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Brendan Gregg, Jiri Olsa,
	Masami Hiramatsu, Namhyung Kim, Zefan Li, pi3orama

By creating onward and backward specific overflow handler and setting
them according to event's backward setting, normal sampling events
don't need to check backward setting of an event any more.

This is the last patch of backward writing patchset. After this patch,
there's no extra overhead introduced to the fast path of sampling
output.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
---
 include/linux/perf_event.h  | 17 +++++++++++++++--
 kernel/events/core.c        | 41 ++++++++++++++++++++++++++++++++++++-----
 kernel/events/ring_buffer.c | 12 ++++++++++++
 3 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 54c3fb2..c0335b9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -830,9 +830,15 @@ extern int perf_event_overflow(struct perf_event *event,
 				 struct perf_sample_data *data,
 				 struct pt_regs *regs);
 
+extern void perf_event_output_onward(struct perf_event *event,
+				     struct perf_sample_data *data,
+				     struct pt_regs *regs);
+extern void perf_event_output_backward(struct perf_event *event,
+				       struct perf_sample_data *data,
+				       struct pt_regs *regs);
 extern void perf_event_output(struct perf_event *event,
-				struct perf_sample_data *data,
-				struct pt_regs *regs);
+			      struct perf_sample_data *data,
+			      struct pt_regs *regs);
 
 extern void
 perf_event_header__init_id(struct perf_event_header *header,
@@ -1039,6 +1045,13 @@ static inline bool is_write_backward(struct perf_event *event)
 
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
+extern int perf_output_begin_onward(struct perf_output_handle *handle,
+				    struct perf_event *event,
+				    unsigned int size);
+extern int perf_output_begin_backward(struct perf_output_handle *handle,
+				      struct perf_event *event,
+				      unsigned int size);
+
 extern void perf_output_end(struct perf_output_handle *handle);
 extern unsigned int perf_output_copy(struct perf_output_handle *handle,
 			     const void *buf, unsigned int len);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8ad22a5..8a25e46 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5541,9 +5541,13 @@ void perf_prepare_sample(struct perf_event_header *header,
 	}
 }
 
-void perf_event_output(struct perf_event *event,
-			struct perf_sample_data *data,
-			struct pt_regs *regs)
+static void __always_inline
+__perf_event_output(struct perf_event *event,
+		    struct perf_sample_data *data,
+		    struct pt_regs *regs,
+		    int (*output_begin)(struct perf_output_handle *,
+					struct perf_event *,
+					unsigned int))
 {
 	struct perf_output_handle handle;
 	struct perf_event_header header;
@@ -5553,7 +5557,7 @@ void perf_event_output(struct perf_event *event,
 
 	perf_prepare_sample(&header, data, event, regs);
 
-	if (perf_output_begin(&handle, event, header.size))
+	if (output_begin(&handle, event, header.size))
 		goto exit;
 
 	perf_output_sample(&handle, &header, data, event);
@@ -5564,6 +5568,30 @@ exit:
 	rcu_read_unlock();
 }
 
+void
+perf_event_output_onward(struct perf_event *event,
+			 struct perf_sample_data *data,
+			 struct pt_regs *regs)
+{
+	__perf_event_output(event, data, regs, perf_output_begin_onward);
+}
+
+void
+perf_event_output_backward(struct perf_event *event,
+			   struct perf_sample_data *data,
+			   struct pt_regs *regs)
+{
+	__perf_event_output(event, data, regs, perf_output_begin_backward);
+}
+
+void
+perf_event_output(struct perf_event *event,
+		  struct perf_sample_data *data,
+		  struct pt_regs *regs)
+{
+	__perf_event_output(event, data, regs, perf_output_begin);
+}
+
 /*
  * read event_id
  */
@@ -7874,8 +7902,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (overflow_handler) {
 		event->overflow_handler	= overflow_handler;
 		event->overflow_handler_context = context;
+	} else if (is_write_backward(event)){
+		event->overflow_handler = perf_event_output_backward;
+		event->overflow_handler_context = NULL;
 	} else {
-		event->overflow_handler = perf_event_output;
+		event->overflow_handler = perf_event_output_onward;
 		event->overflow_handler_context = NULL;
 	}
 
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 28543e1..ca11809 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -230,6 +230,18 @@ out:
 	return -ENOSPC;
 }
 
+int perf_output_begin_onward(struct perf_output_handle *handle,
+			     struct perf_event *event, unsigned int size)
+{
+	return __perf_output_begin(handle, event, size, false);
+}
+
+int perf_output_begin_backward(struct perf_output_handle *handle,
+			       struct perf_event *event, unsigned int size)
+{
+	return __perf_output_begin(handle, event, size, true);
+}
+
 int perf_output_begin(struct perf_output_handle *handle,
 		      struct perf_event *event, unsigned int size)
 {
-- 
1.8.3.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/6] perf core: Put size of a sample at the end of it by PERF_SAMPLE_TAILSIZE
  2016-01-22 12:13 [PATCH 0/6] perf core: Read from overwrite ring buffer Wang Nan
                   ` (4 preceding siblings ...)
  2016-01-22 12:13 ` [PATCH 5/6] perf core: Reduce perf event output overhead by setting overwrite handler Wang Nan
@ 2016-01-22 12:13 ` Wang Nan
  2016-01-22 19:37 ` [PATCH 0/6] perf core: Read from overwrite ring buffer Alexei Starovoitov
  6 siblings, 0 replies; 10+ messages in thread
From: Wang Nan @ 2016-01-22 12:13 UTC (permalink / raw)
  To: peterz, alexei.starovoitov, acme
  Cc: linux-kernel, Wang Nan, He Kuang, Yunlong Song,
	Alexei Starovoitov, Arnaldo Carvalho de Melo, Brendan Gregg,
	Jiri Olsa, Masami Hiramatsu, Namhyung Kim, Zefan Li, pi3orama

This patch introduces a PERF_SAMPLE_TAILSIZE flag which allows a size
field attached at the end of a sample. The idea comes from [1] that,
with size at tail of an event, it is possible for user program who
read from the ring buffer parse events backward.

For example:

   head
    |
    V
 +--+---+-------+----------+------+---+
 |E6|...|   B  8|   C    11|  D  7|E..|
 +--+---+-------+----------+------+---+

In this case, from the 'head' pointer provided by kernel, user program
can first see '6' by (*(head - sizeof(u64))), then it can get the start
pointer of record 'E', then it can read size and find start position
of record D, C, B in similar way.

The implementation is easy: adding a PERF_SAMPLE_TAILSIZE flag, makes
perf_output_sample() output size at the end of a sample.

Following things are done for ensure the ring buffer is safe for
backward parsing:

 - Don't allow two events with different PERF_SAMPLE_TAILSIZE setting
   set their output to each other;

 - For non-sample events, also output tailsize if required.

This patch has a limitation for perf:

Before reading such ring buffer, perf must ensure all events which may
output to it is already stopped, so the 'head' pointer it get is the
end of the last record.

[1] http://lkml.kernel.org/g/1449063499-236703-1-git-send-email-wangnan0@huawei.com

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Yunlong Song <yunlong.song@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
---
 include/linux/perf_event.h      | 17 ++++++---
 include/uapi/linux/perf_event.h |  3 +-
 kernel/events/core.c            | 82 +++++++++++++++++++++++++++++------------
 kernel/events/ring_buffer.c     |  7 ++--
 4 files changed, 75 insertions(+), 34 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c0335b9..7c70d4b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -841,13 +841,13 @@ extern void perf_event_output(struct perf_event *event,
 			      struct pt_regs *regs);
 
 extern void
-perf_event_header__init_id(struct perf_event_header *header,
-			   struct perf_sample_data *data,
-			   struct perf_event *event);
+perf_event_header__init_extra(struct perf_event_header *header,
+			      struct perf_sample_data *data,
+			      struct perf_event *event);
 extern void
-perf_event__output_id_sample(struct perf_event *event,
-			     struct perf_output_handle *handle,
-			     struct perf_sample_data *sample);
+perf_event__output_extra(struct perf_event *event, u64 evt_size,
+			 struct perf_output_handle *handle,
+			 struct perf_sample_data *sample);
 
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
@@ -1043,6 +1043,11 @@ static inline bool is_write_backward(struct perf_event *event)
 	return !!event->attr.write_backward;
 }
 
+static inline bool has_tailsize(struct perf_event *event)
+{
+	return !!(event->attr.sample_type & PERF_SAMPLE_TAILSIZE);
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern int perf_output_begin_onward(struct perf_output_handle *handle,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 598b9b0..f0cad26 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -139,8 +139,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_REGS_INTR			= 1U << 18,
+	PERF_SAMPLE_TAILSIZE			= 1U << 19,
 
-	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8a25e46..3014aa3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5141,12 +5141,14 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
 	}
 }
 
-void perf_event_header__init_id(struct perf_event_header *header,
-				struct perf_sample_data *data,
-				struct perf_event *event)
+void perf_event_header__init_extra(struct perf_event_header *header,
+				   struct perf_sample_data *data,
+				   struct perf_event *event)
 {
 	if (event->attr.sample_id_all)
 		__perf_event_header__init_id(header, data, event);
+	if (has_tailsize(event))
+		header->size += sizeof(u64);
 }
 
 static void __perf_event__output_id_sample(struct perf_output_handle *handle,
@@ -5173,12 +5175,14 @@ static void __perf_event__output_id_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, data->id);
 }
 
-void perf_event__output_id_sample(struct perf_event *event,
-				  struct perf_output_handle *handle,
-				  struct perf_sample_data *sample)
+void perf_event__output_extra(struct perf_event *event, u64 evt_size,
+			      struct perf_output_handle *handle,
+			      struct perf_sample_data *sample)
 {
 	if (event->attr.sample_id_all)
 		__perf_event__output_id_sample(handle, sample);
+	if (has_tailsize(event))
+		perf_output_put(handle, evt_size);
 }
 
 static void perf_output_read_one(struct perf_output_handle *handle,
@@ -5420,6 +5424,13 @@ void perf_output_sample(struct perf_output_handle *handle,
 		}
 	}
 
+	/* Should be the last one */
+	if (sample_type & PERF_SAMPLE_TAILSIZE) {
+		u64 evt_size = header->size;
+
+		perf_output_put(handle, evt_size);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -5539,6 +5550,9 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_TAILSIZE)
+		header->size += sizeof(u64);
 }
 
 static void __always_inline
@@ -5620,14 +5634,15 @@ perf_event_read_event(struct perf_event *event,
 	};
 	int ret;
 
-	perf_event_header__init_id(&read_event.header, &sample, event);
+	perf_event_header__init_extra(&read_event.header, &sample, event);
 	ret = perf_output_begin(&handle, event, read_event.header.size);
 	if (ret)
 		return;
 
 	perf_output_put(&handle, read_event);
 	perf_output_read(&handle, event);
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, read_event.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 }
@@ -5739,7 +5754,7 @@ static void perf_event_task_output(struct perf_event *event,
 	if (!perf_event_task_match(event))
 		return;
 
-	perf_event_header__init_id(&task_event->event_id.header, &sample, event);
+	perf_event_header__init_extra(&task_event->event_id.header, &sample, event);
 
 	ret = perf_output_begin(&handle, event,
 				task_event->event_id.header.size);
@@ -5756,7 +5771,9 @@ static void perf_event_task_output(struct perf_event *event,
 
 	perf_output_put(&handle, task_event->event_id);
 
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event,
+				 task_event->event_id.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 out:
@@ -5835,7 +5852,7 @@ static void perf_event_comm_output(struct perf_event *event,
 	if (!perf_event_comm_match(event))
 		return;
 
-	perf_event_header__init_id(&comm_event->event_id.header, &sample, event);
+	perf_event_header__init_extra(&comm_event->event_id.header, &sample, event);
 	ret = perf_output_begin(&handle, event,
 				comm_event->event_id.header.size);
 
@@ -5849,7 +5866,8 @@ static void perf_event_comm_output(struct perf_event *event,
 	__output_copy(&handle, comm_event->comm,
 				   comm_event->comm_size);
 
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, comm_event->event_id.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 out:
@@ -5958,7 +5976,7 @@ static void perf_event_mmap_output(struct perf_event *event,
 		mmap_event->event_id.header.size += sizeof(mmap_event->flags);
 	}
 
-	perf_event_header__init_id(&mmap_event->event_id.header, &sample, event);
+	perf_event_header__init_extra(&mmap_event->event_id.header, &sample, event);
 	ret = perf_output_begin(&handle, event,
 				mmap_event->event_id.header.size);
 	if (ret)
@@ -5981,7 +5999,8 @@ static void perf_event_mmap_output(struct perf_event *event,
 	__output_copy(&handle, mmap_event->file_name,
 				   mmap_event->file_size);
 
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, mmap_event->event_id.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 out:
@@ -6164,14 +6183,15 @@ void perf_event_aux_event(struct perf_event *event, unsigned long head,
 	};
 	int ret;
 
-	perf_event_header__init_id(&rec.header, &sample, event);
+	perf_event_header__init_extra(&rec.header, &sample, event);
 	ret = perf_output_begin(&handle, event, rec.header.size);
 
 	if (ret)
 		return;
 
 	perf_output_put(&handle, rec);
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, rec.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 }
@@ -6197,7 +6217,7 @@ void perf_log_lost_samples(struct perf_event *event, u64 lost)
 		.lost		= lost,
 	};
 
-	perf_event_header__init_id(&lost_samples_event.header, &sample, event);
+	perf_event_header__init_extra(&lost_samples_event.header, &sample, event);
 
 	ret = perf_output_begin(&handle, event,
 				lost_samples_event.header.size);
@@ -6205,7 +6225,8 @@ void perf_log_lost_samples(struct perf_event *event, u64 lost)
 		return;
 
 	perf_output_put(&handle, lost_samples_event);
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, lost_samples_event.header.size,
+				 &handle, &sample);
 	perf_output_end(&handle);
 }
 
@@ -6252,7 +6273,7 @@ static void perf_event_switch_output(struct perf_event *event, void *data)
 					perf_event_tid(event, se->next_prev);
 	}
 
-	perf_event_header__init_id(&se->event_id.header, &sample, event);
+	perf_event_header__init_extra(&se->event_id.header, &sample, event);
 
 	ret = perf_output_begin(&handle, event, se->event_id.header.size);
 	if (ret)
@@ -6263,7 +6284,8 @@ static void perf_event_switch_output(struct perf_event *event, void *data)
 	else
 		perf_output_put(&handle, se->event_id);
 
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, se->event_id.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 }
@@ -6323,7 +6345,7 @@ static void perf_log_throttle(struct perf_event *event, int enable)
 	if (enable)
 		throttle_event.header.type = PERF_RECORD_UNTHROTTLE;
 
-	perf_event_header__init_id(&throttle_event.header, &sample, event);
+	perf_event_header__init_extra(&throttle_event.header, &sample, event);
 
 	ret = perf_output_begin(&handle, event,
 				throttle_event.header.size);
@@ -6331,7 +6353,8 @@ static void perf_log_throttle(struct perf_event *event, int enable)
 		return;
 
 	perf_output_put(&handle, throttle_event);
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, throttle_event.header.size,
+				 &handle, &sample);
 	perf_output_end(&handle);
 }
 
@@ -6359,14 +6382,15 @@ static void perf_log_itrace_start(struct perf_event *event)
 	rec.pid	= perf_event_pid(event, current);
 	rec.tid	= perf_event_tid(event, current);
 
-	perf_event_header__init_id(&rec.header, &sample, event);
+	perf_event_header__init_extra(&rec.header, &sample, event);
 	ret = perf_output_begin(&handle, event, rec.header.size);
 
 	if (ret)
 		return;
 
 	perf_output_put(&handle, rec);
-	perf_event__output_id_sample(event, &handle, &sample);
+	perf_event__output_extra(event, rec.header.size,
+				 &handle, &sample);
 
 	perf_output_end(&handle);
 }
@@ -8151,6 +8175,16 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 	    event->pmu != output_event->pmu)
 		goto out;
 
+	/*
+	 * Don't allow mixed tailsize setting since the resuling
+	 * ringbuffer would unable to be parsed backward.
+	 *
+	 * '!=' is safe because has_tailsize() returns bool, two differnt
+	 * non-zero values would be treated as equal (both true).
+	 */
+	if (has_tailsize(event) != has_tailsize(output_event))
+		goto out;
+
 set:
 	mutex_lock(&event->mmap_mutex);
 	/* Can't redirect output if we've got an active mmap() */
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index ca11809..c392a50 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -213,10 +213,11 @@ __perf_output_begin(struct perf_output_handle *handle,
 		lost_event.id          = event->id;
 		lost_event.lost        = local_xchg(&rb->lost, 0);
 
-		perf_event_header__init_id(&lost_event.header,
-					   &sample_data, event);
+		perf_event_header__init_extra(&lost_event.header,
+					      &sample_data, event);
 		perf_output_put(handle, lost_event);
-		perf_event__output_id_sample(event, handle, &sample_data);
+		perf_event__output_extra(event, lost_event.header.type,
+					 handle, &sample_data);
 	}
 
 	return 0;
-- 
1.8.3.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/6] perf core: Introduce new ioctl options to pause and resume ring buffer
  2016-01-22 12:13 ` [PATCH 1/6] perf core: Introduce new ioctl options to pause and resume " Wang Nan
@ 2016-01-22 19:36   ` Alexei Starovoitov
  0 siblings, 0 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2016-01-22 19:36 UTC (permalink / raw)
  To: Wang Nan
  Cc: peterz, acme, linux-kernel, He Kuang, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Brendan Gregg, Jiri Olsa,
	Masami Hiramatsu, Namhyung Kim, Zefan Li, pi3orama

On Fri, Jan 22, 2016 at 12:13:49PM +0000, Wang Nan wrote:
> Add new ioctl() to pause/resume ring-buffer output.
> 
> In some situations we want to read from ring buffer only when we
> ensure nothing can write to the ring buffer during reading. Without
> this patch we have to turn off all events attached to this ring buffer.
> This patch is for supporting overwritable ring buffer with TAILSIZE
> selected.

TAILSIZE is dropped. pls adjust commit log.

> Signed-off-by: Wang Nan <wangnan0@huawei.com>
> Cc: He Kuang <hekuang@huawei.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Zefan Li <lizefan@huawei.com>
> Cc: pi3orama@163.com
> ---
>  include/uapi/linux/perf_event.h |  1 +
>  kernel/events/core.c            | 13 +++++++++++++
>  kernel/events/internal.h        | 11 +++++++++++
>  kernel/events/ring_buffer.c     |  7 ++++++-
>  4 files changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 1afe962..2c7f00c 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -401,6 +401,7 @@ struct perf_event_attr {
>  #define PERF_EVENT_IOC_SET_FILTER	_IOW('$', 6, char *)
>  #define PERF_EVENT_IOC_ID		_IOR('$', 7, __u64 *)
>  #define PERF_EVENT_IOC_SET_BPF		_IOW('$', 8, __u32)
> +#define PERF_EVENT_IOC_PAUSE_OUTPUT	_IO ('$', 9)

do you need to add '__u32' here, since boolean flag is passed to it?

> +	case PERF_EVENT_IOC_PAUSE_OUTPUT: {
> +		struct ring_buffer *rb;
> +
> +		rcu_read_lock();
> +		rb = rcu_dereference(event->rb);
> +		if (!event->rb) {
> +			rcu_read_unlock();
> +			return -EINVAL;
> +		}
> +		rb_toggle_paused(rb, !!arg);
> +		rcu_read_unlock();
> +		return 0;
> +	}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/6] perf core: Read from overwrite ring buffer
  2016-01-22 12:13 [PATCH 0/6] perf core: Read from overwrite ring buffer Wang Nan
                   ` (5 preceding siblings ...)
  2016-01-22 12:13 ` [PATCH 6/6] perf core: Put size of a sample at the end of it by PERF_SAMPLE_TAILSIZE Wang Nan
@ 2016-01-22 19:37 ` Alexei Starovoitov
  6 siblings, 0 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2016-01-22 19:37 UTC (permalink / raw)
  To: Wang Nan
  Cc: peterz, acme, linux-kernel, He Kuang, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Brendan Gregg, Jiri Olsa,
	Masami Hiramatsu, Namhyung Kim, Zefan Li, pi3orama

On Fri, Jan 22, 2016 at 12:13:48PM +0000, Wang Nan wrote:
> This is v2 of this series.
> 
> Compare with v1:
> 
> Fixes several bugs in v1.
> 
> Corresponsing perf has finished and can be found from:
> 
>  https://git.kernel.org/cgit/linux/kernel/git/pi3orama/linux.git/ 
>  branch: perf/overwrite-benchmark
> 
> Some benchmarking results can be found from [1].
> 
> Summary:
> 
> On a PC with Intel E5-2640 0 @ 2.50GHz CPU, execute close(-1) 3000000
> times, capture raw_syscalls:* with perf into overwrite ring buffer,
> check total time (in us):
> 
>              MEAN           STDVAR
> BASE     :  879870.81      11913.13
> RAWPERF  : 2603854.7      706658.4
> WRTBKWRD : 2313301.220      6727.957
> TAILSIZE : 2383051.860      5248.061
> RAWOVWRT : 2315273.180      5221.025
> RAWOVWRT*: 2323970.45       5103.39 
> 
> Where:
> BASE: don't use perf at all.
> RAWPERF: use non-overwrite ring buffer, perf collects all data,
> 	 write to /dev/null
> WRTBKWRD: Use backward writing ring buffer, write from tail to head,
>           never wakeup perf, collect data when exiting.
> TAILSIZE: Use tailsize ring buffer, pad 8 bytes for the size of the
>           record for each event, never wakeup perf, collect data when
> 	  exiting.
> RAWOVWRT: Use raw overwrite ring buffer, never wakeup perf, don't
>           collect data at all.
> RAWOVWRT*: Same as RAWOVWRT, without this patchset.
> 
> The benchmarking results shows WRTBKWRD is good enough. I suggest not
> to implement TAILSIZE and tail-header ring buffer.

yes. just drop patch 6.

> I will post the result on a smartphone next week.

great. would be interesting comparison.
overall I think it looks good.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/6] perf core: Prepare writing into ring buffer from end
  2016-01-19 11:16 ` [PATCH 0/6] perf core: Read from overwrite " Wang Nan
@ 2016-01-19 11:16   ` Wang Nan
  0 siblings, 0 replies; 10+ messages in thread
From: Wang Nan @ 2016-01-19 11:16 UTC (permalink / raw)
  To: peterz, ast
  Cc: linux-kernel, Wang Nan, He Kuang, Arnaldo Carvalho de Melo,
	Brendan Gregg, Jiri Olsa, Masami Hiramatsu, Namhyung Kim,
	Zefan Li, pi3orama

Convert perf_output_begin to __perf_output_begin and make the later
function able to write records from the end of the ring buffer.
Following commits will utilize the 'backward' flag.

This patch doesn't introduce any extra performance overhead since we
use always_inline.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
---
 kernel/events/ring_buffer.c | 37 +++++++++++++++++++++++++++++++------
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 9f1a93f..bbc3bc6 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -102,8 +102,21 @@ out:
 	preempt_enable();
 }
 
-int perf_output_begin(struct perf_output_handle *handle,
-		      struct perf_event *event, unsigned int size)
+static bool __always_inline
+ring_buffer_has_space(unsigned long head, unsigned long tail,
+		      unsigned long data_size, unsigned int size,
+		      bool backward)
+{
+	if (!backward)
+		return CIRC_SPACE(head, tail, data_size) < size;
+	else
+		return CIRC_SPACE(tail, head, data_size) < size;
+}
+
+static int __always_inline
+__perf_output_begin(struct perf_output_handle *handle,
+		    struct perf_event *event, unsigned int size,
+		    bool backward)
 {
 	struct ring_buffer *rb;
 	unsigned long tail, offset, head;
@@ -146,9 +159,12 @@ int perf_output_begin(struct perf_output_handle *handle,
 	do {
 		tail = READ_ONCE(rb->user_page->data_tail);
 		offset = head = local_read(&rb->head);
-		if (!rb->overwrite &&
-		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
-			goto fail;
+		if (!rb->overwrite) {
+			if (unlikely(!ring_buffer_has_space(head, tail,
+							    perf_data_size(rb),
+							    size, backward)))
+				goto fail;
+		}
 
 		/*
 		 * The above forms a control dependency barrier separating the
@@ -162,7 +178,10 @@ int perf_output_begin(struct perf_output_handle *handle,
 		 * See perf_output_put_handle().
 		 */
 
-		head += size;
+		if (!backward)
+			head += size;
+		else
+			head -= size;
 	} while (local_cmpxchg(&rb->head, offset, head) != offset);
 
 	/*
@@ -206,6 +225,12 @@ out:
 	return -ENOSPC;
 }
 
+int perf_output_begin(struct perf_output_handle *handle,
+		      struct perf_event *event, unsigned int size)
+{
+	return __perf_output_begin(handle, event, size, false);
+}
+
 unsigned int perf_output_copy(struct perf_output_handle *handle,
 		      const void *buf, unsigned int len)
 {
-- 
1.8.3.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-01-22 19:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-22 12:13 [PATCH 0/6] perf core: Read from overwrite ring buffer Wang Nan
2016-01-22 12:13 ` [PATCH 1/6] perf core: Introduce new ioctl options to pause and resume " Wang Nan
2016-01-22 19:36   ` Alexei Starovoitov
2016-01-22 12:13 ` [PATCH 2/6] perf core: Set event's default overflow_handler Wang Nan
2016-01-22 12:13 ` [PATCH 3/6] perf core: Prepare writing into ring buffer from end Wang Nan
2016-01-22 12:13 ` [PATCH 4/6] perf core: Add backward attribute to perf event Wang Nan
2016-01-22 12:13 ` [PATCH 5/6] perf core: Reduce perf event output overhead by setting overwrite handler Wang Nan
2016-01-22 12:13 ` [PATCH 6/6] perf core: Put size of a sample at the end of it by PERF_SAMPLE_TAILSIZE Wang Nan
2016-01-22 19:37 ` [PATCH 0/6] perf core: Read from overwrite ring buffer Alexei Starovoitov
  -- strict thread matches above, loose matches on Subject: below --
2016-01-18 12:02 [PATCH] perf core: Introduce new ioctl options to pause and resume " Peter Zijlstra
2016-01-19 11:16 ` [PATCH 0/6] perf core: Read from overwrite " Wang Nan
2016-01-19 11:16   ` [PATCH 3/6] perf core: Prepare writing into ring buffer from end Wang Nan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.