All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wang Nan <wangnan0@huawei.com>
To: <acme@kernel.org>
Cc: <linux-kernel@vger.kernel.org>, <pi3orama@163.com>,
	Wang Nan <wangnan0@huawei.com>, He Kuang <hekuang@huawei.com>,
	"Arnaldo Carvalho de Melo" <acme@redhat.com>,
	Brendan Gregg <brendan.d.gregg@gmail.com>,
	Jiri Olsa <jolsa@kernel.org>,
	Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
	Namhyung Kim <namhyung@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Zefan Li <lizefan@huawei.com>
Subject: [PATCH 10/10] perf core: Add backward attribute to perf event
Date: Wed, 13 Apr 2016 08:21:13 +0000	[thread overview]
Message-ID: <1460535673-159866-11-git-send-email-wangnan0@huawei.com> (raw)
In-Reply-To: <1460535673-159866-1-git-send-email-wangnan0@huawei.com>

This patch introduces 'write_backward' bit to perf_event_attr, which
controls the direction of a ring buffer. After set, the corresponding
ring buffer is written from end to beginning. This feature is design to
support reading from overwritable ring buffer.

Ring buffer can be created by mapping a perf event fd. Kernel puts event
records into ring buffer, user programs like perf fetch them from
address returned by mmap(). To prevent racing between kernel and perf,
they communicate to each other through 'head' and 'tail' pointers.
Kernel maintains 'head' pointer, points it to the next free area (tail
of the last record). Perf maintains 'tail' pointer, points it to the
tail of last consumed record (record has already been fetched). Kernel
determines the available space in a ring buffer using these two
pointers to avoid overwrite unfetched records.

By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
Different from normal ring buffer, perf is unable to maintain 'tail'
pointer because writing is forbidden. Therefore, for this type of ring
buffers, kernel overwrite old records unconditionally, works like flight
recorder. This feature would be useful if reading from overwritable ring
buffer were as easy as reading from normal ring buffer. However,
there's an obscure problem.

The following figure demonstrates a full overwritable ring buffer. In
this figure, the 'head' pointer points to the end of last record, and a
long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
would have pointed to position (X), so kernel knows there's no more
space in the ring buffer. However, for an overwritable ring buffer,
kernel ignore the 'tail' pointer.

   (X)                              head
    .                                |
    .                                V
    +------+-------+----------+------+---+
    |A....A|B.....B|C........C|D....D|   |
    +------+-------+----------+------+---+

Record 'A' is overwritten by event 'E':

      head
       |
       V
    +--+---+-------+----------+------+---+
    |.E|..A|B.....B|C........C|D....D|E..|
    +--+---+-------+----------+------+---+

Now perf decides to read from this ring buffer. However, none of these
two natural positions, 'head' and the start of this ring buffer, are
pointing to the head of a record. Even the full ring buffer can be
accessed by perf, it is unable to find a position to start decoding.

The first attempt tries to solve this problem AFAIK can be found from
[1]. It makes kernel to maintain 'tail' pointer: updates it when ring
buffer is half full. However, this approach introduces overhead to
fast path. Test result shows a 1% overhead [2]. In addition, this method
utilizes no more tham 50% records.

Another attempt can be found from [3], which allows putting the size of
an event at the end of each record. This approach allows perf to find
records in a backword manner from 'head' pointer by reading size of a
record from its tail. However, because of alignment requirement, it
needs 8 bytes to record the size of a record, which is a huge waste. Its
performance is also not good, because more data need to be written.
This approach also introduces some extra branch instructions to fast
path.

'write_backward' is a better solution to this problem.

Following figure demonstrates the state of the overwritable ring buffer
when 'write_backward' is set before overwriting:

       head
        |
        V
    +---+------+----------+-------+------+
    |   |D....D|C........C|B.....B|A....A|
    +---+------+----------+-------+------+

and after overwriting:
                                     head
                                      |
                                      V
    +---+------+----------+-------+---+--+
    |..E|D....D|C........C|B.....B|A..|E.|
    +---+------+----------+-------+---+--+

In each situation, 'head' points to the beginning of the newest record.
>From this record, perf can iterate over the full ring buffer and fetch
records one by one.

The only limitation needs to consider is back-to-back reading. Due to
the non-deterministic of user program, it is impossible to ensure the
ring buffer keeps stable during reading. Consider an extreme situation:
perf is scheduled out after reading record 'D', then a burst of events
come, eat up the whole ring buffer (one or multiple rounds). When perf
comes back, reading after 'D' is incorrect now.

To prevent this problem, we need to find a way to ensure the ring buffer
is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
suggested because its overhead is lower than
ioctl(PERF_EVENT_IOC_ENABLE).

By carefully verifying 'header' pointer, reader can avoid pausing the
ring-buffer. For example:

    /* A union of all possible events */
    union perf_event event;

    p = head = perf_mmap__read_head();
    while (true) {
        /* copy header of next event */
        fetch(&event.header, p, sizeof(event.header));

        /* read 'head' pointer */
        head = perf_mmap__read_head();

        /* check overwritten: is the header good? */
        if (!verify(sizeof(event.header), p, head))
            break;

        /* copy the whole event */
        fetch(&event, p, event.header.size);

        /* read 'head' pointer again */
        head = perf_mmap__read_head();

        /* is the whole event good? */
        if (!verify(event.header.size, p, head))
            break;
        p += event.header.size;
    }

However, the overhead is high because:

 a) In-place decoding is not safe.
    Copying-verifying-decoding is required.
 b) Fetching 'head' pointer requires additional synchronization.

(From Alexei Starovoitov:

Even this trick work, pause is needed for more than stability of
reading. When we collect the events into overwrite buffer we're waiting
for some other trigger (like all cpu utilization spike or just one cpu
running and all others are idle) and when it happens the buffer has
valuable info from the past. At this point new events are no longer
interesting and buffer should be paused, events read and unpaused until
next trigger comes.)

This patch utilizes event's default overflow_handler introduced
previously. perf_event_output_backward() is created as the default
overflow handler for backward ring buffers. To avoid extra overhead to
fast path, original perf_event_output() becomes __perf_event_output()
and marked '__always_inline'. In theory, there's no extra overhead
introduced to fast path.

Performance result:

Calling 3000000 times of 'close(-1)', use gettimeofday() to check
duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
system calls. In ns.

Testing environment:

 CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
 Kernel : v4.5.0
                   MEAN         STDVAR
BASE            800214.950    2853.083
PRE1           2253846.700    9997.014
PRE2           2257495.540    8516.293
POST           2250896.100    8933.921

Where 'BASE' is pure performance without capturing. 'PRE1' is test
result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
patch. 'POST' is test result after this patch. See [4] for detail
experimental setup.

Considering the stdvar, this patch doesn't introduce performance
overhead to fast path.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
[2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
[3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
[4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: He Kuang <hekuang@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
---
 include/linux/perf_event.h      | 28 +++++++++++++++++++++---
 include/uapi/linux/perf_event.h |  3 ++-
 kernel/events/core.c            | 48 ++++++++++++++++++++++++++++++++++++-----
 kernel/events/ring_buffer.c     | 16 +++++++++++++-
 4 files changed, 85 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4065ca2..0cc36ad 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -834,14 +834,24 @@ extern int perf_event_overflow(struct perf_event *event,
 				 struct perf_sample_data *data,
 				 struct pt_regs *regs);
 
+extern void perf_event_output_forward(struct perf_event *event,
+				     struct perf_sample_data *data,
+				     struct pt_regs *regs);
+extern void perf_event_output_backward(struct perf_event *event,
+				       struct perf_sample_data *data,
+				       struct pt_regs *regs);
 extern void perf_event_output(struct perf_event *event,
-				struct perf_sample_data *data,
-				struct pt_regs *regs);
+			      struct perf_sample_data *data,
+			      struct pt_regs *regs);
 
 static inline bool
 is_default_overflow_handler(struct perf_event *event)
 {
-	return (event->overflow_handler == perf_event_output);
+	if (likely(event->overflow_handler == perf_event_output_forward))
+		return true;
+	if (unlikely(event->overflow_handler == perf_event_output_backward))
+		return true;
+	return false;
 }
 
 extern void
@@ -1042,8 +1052,20 @@ static inline bool has_aux(struct perf_event *event)
 	return event->pmu->setup_aux;
 }
 
+static inline bool is_write_backward(struct perf_event *event)
+{
+	return !!event->attr.write_backward;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
+extern int perf_output_begin_forward(struct perf_output_handle *handle,
+				    struct perf_event *event,
+				    unsigned int size);
+extern int perf_output_begin_backward(struct perf_output_handle *handle,
+				      struct perf_event *event,
+				      unsigned int size);
+
 extern void perf_output_end(struct perf_output_handle *handle);
 extern unsigned int perf_output_copy(struct perf_output_handle *handle,
 			     const void *buf, unsigned int len);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index a3c1903..43fc8d2 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -340,7 +340,8 @@ struct perf_event_attr {
 				comm_exec      :  1, /* flag comm events that are due to an exec */
 				use_clockid    :  1, /* use @clockid for time fields */
 				context_switch :  1, /* context switch data */
-				__reserved_1   : 37;
+				write_backward :  1, /* Write ring buffer from end to beginning */
+				__reserved_1   : 36;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8c3b35f..263a9d8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5693,9 +5693,13 @@ void perf_prepare_sample(struct perf_event_header *header,
 	}
 }
 
-void perf_event_output(struct perf_event *event,
-			struct perf_sample_data *data,
-			struct pt_regs *regs)
+static void __always_inline
+__perf_event_output(struct perf_event *event,
+		    struct perf_sample_data *data,
+		    struct pt_regs *regs,
+		    int (*output_begin)(struct perf_output_handle *,
+					struct perf_event *,
+					unsigned int))
 {
 	struct perf_output_handle handle;
 	struct perf_event_header header;
@@ -5705,7 +5709,7 @@ void perf_event_output(struct perf_event *event,
 
 	perf_prepare_sample(&header, data, event, regs);
 
-	if (perf_output_begin(&handle, event, header.size))
+	if (output_begin(&handle, event, header.size))
 		goto exit;
 
 	perf_output_sample(&handle, &header, data, event);
@@ -5716,6 +5720,30 @@ exit:
 	rcu_read_unlock();
 }
 
+void
+perf_event_output_forward(struct perf_event *event,
+			 struct perf_sample_data *data,
+			 struct pt_regs *regs)
+{
+	__perf_event_output(event, data, regs, perf_output_begin_forward);
+}
+
+void
+perf_event_output_backward(struct perf_event *event,
+			   struct perf_sample_data *data,
+			   struct pt_regs *regs)
+{
+	__perf_event_output(event, data, regs, perf_output_begin_backward);
+}
+
+void
+perf_event_output(struct perf_event *event,
+		  struct perf_sample_data *data,
+		  struct pt_regs *regs)
+{
+	__perf_event_output(event, data, regs, perf_output_begin);
+}
+
 /*
  * read event_id
  */
@@ -8152,8 +8180,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (overflow_handler) {
 		event->overflow_handler	= overflow_handler;
 		event->overflow_handler_context = context;
+	} else if (is_write_backward(event)){
+		event->overflow_handler = perf_event_output_backward;
+		event->overflow_handler_context = NULL;
 	} else {
-		event->overflow_handler = perf_event_output;
+		event->overflow_handler = perf_event_output_forward;
 		event->overflow_handler_context = NULL;
 	}
 
@@ -8388,6 +8419,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 		goto out;
 
 	/*
+	 * Either writing ring buffer from beginning or from end.
+	 * Mixing is not allowed.
+	 */
+	if (is_write_backward(output_event) != is_write_backward(event))
+		goto out;
+
+	/*
 	 * If both events generate aux data, they must be on the same PMU
 	 */
 	if (has_aux(event) && has_aux(output_event) &&
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 60be55a..c49bab4 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -230,10 +230,24 @@ out:
 	return -ENOSPC;
 }
 
+int perf_output_begin_forward(struct perf_output_handle *handle,
+			     struct perf_event *event, unsigned int size)
+{
+	return __perf_output_begin(handle, event, size, false);
+}
+
+int perf_output_begin_backward(struct perf_output_handle *handle,
+			       struct perf_event *event, unsigned int size)
+{
+	return __perf_output_begin(handle, event, size, true);
+}
+
 int perf_output_begin(struct perf_output_handle *handle,
 		      struct perf_event *event, unsigned int size)
 {
-	return __perf_output_begin(handle, event, size, false);
+
+	return __perf_output_begin(handle, event, size,
+				   unlikely(is_write_backward(event)));
 }
 
 unsigned int perf_output_copy(struct perf_output_handle *handle,
-- 
1.8.3.4

  parent reply	other threads:[~2016-04-13  8:23 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-13  8:21 [PATCH 00/10] perf tools: Use SIGUSR2 control data dumpping Wang Nan
2016-04-13  8:21 ` [PATCH 01/10] perf tools: Make ordered_events reusable Wang Nan
2016-04-13 15:24   ` Arnaldo Carvalho de Melo
2016-04-14 13:35   ` [tip:perf/core] perf ordered_events: Introduce reinit() tip-bot for Wang Nan
2016-04-14 13:35   ` [tip:perf/core] perf session: Make ordered_events reusable tip-bot for Wang Nan
2016-04-13  8:21 ` [PATCH 02/10] perf tools: Add perf_data_file__switch() helper Wang Nan
2016-04-14 13:36   ` [tip:perf/core] perf data: " tip-bot for Wang Nan
2016-04-15 10:41   ` [PATCH 02/10] perf tools: " Jiri Olsa
2016-04-15 16:00     ` Arnaldo Carvalho de Melo
2016-04-13  8:21 ` [PATCH 03/10] perf record: Turns auxtrace_snapshot_enable into 3 states Wang Nan
2016-04-13 15:55   ` Arnaldo Carvalho de Melo
2016-04-14  7:15     ` Adrian Hunter
2016-04-14  7:50       ` Wangnan (F)
2016-04-14  8:30         ` Adrian Hunter
2016-04-14  9:07           ` Wangnan (F)
2016-04-14 10:21             ` Adrian Hunter
2016-04-14 13:36   ` [tip:perf/core] " tip-bot for Wang Nan
2016-04-13  8:21 ` [PATCH 04/10] perf record: Add '--timestamp-filename' option to append timestamp to output filename Wang Nan
2016-04-14 13:36   ` [tip:perf/core] perf record: Add '--timestamp-filename' option to append timestamp to output file name tip-bot for Wang Nan
2016-04-13  8:21 ` [PATCH 05/10] perf record: Split output into multiple files via '--switch-output' Wang Nan
2016-04-13  8:21 ` [PATCH 06/10] perf record: Force enable --timestamp-filename when --switch-output is provided Wang Nan
2016-04-13  8:21 ` [PATCH 07/10] perf record: Disable buildid cache options by default in switch output mode Wang Nan
2016-04-13  8:21 ` [PATCH 08/10] perf record: Re-synthesize tracking events after output switching Wang Nan
2016-04-13  8:21 ` [PATCH 09/10] perf record: Generate tracking events for process forked by perf Wang Nan
2016-04-13  8:21 ` Wang Nan [this message]
2016-04-13 17:15 ` [PATCH 00/10] perf tools: Use SIGUSR2 control data dumpping Arnaldo Carvalho de Melo
2016-04-15 10:40 ` Jiri Olsa
2016-04-15 10:45   ` Wangnan (F)
2016-04-15 11:40     ` Wangnan (F)
2016-04-15 13:09       ` Arnaldo Carvalho de Melo
2016-04-15 16:26         ` Arnaldo Carvalho de Melo
2016-04-15 16:48           ` Wangnan (F)
2016-04-15 17:56             ` Wangnan (F)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1460535673-159866-11-git-send-email-wangnan0@huawei.com \
    --to=wangnan0@huawei.com \
    --cc=acme@kernel.org \
    --cc=acme@redhat.com \
    --cc=brendan.d.gregg@gmail.com \
    --cc=hekuang@huawei.com \
    --cc=jolsa@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=masami.hiramatsu.pt@hitachi.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pi3orama@163.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.