All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] perf ring-buffer fixes
@ 2019-05-17 11:52 Peter Zijlstra
  2019-05-17 11:52 ` [PATCH 1/4] perf/ring_buffer: Fix exposing a temporarily decreased data_head Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Peter Zijlstra @ 2019-05-17 11:52 UTC (permalink / raw)
  To: peterz, mingo
  Cc: yabinc, acme, alexander.shishkin, jolsa, namhyung, mark.rutland,
	linux-kernel

This is basically a split of Yabin's last patch.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] perf/ring_buffer: Fix exposing a temporarily decreased data_head.
  2019-05-17 11:52 [PATCH 0/4] perf ring-buffer fixes Peter Zijlstra
@ 2019-05-17 11:52 ` Peter Zijlstra
  2019-05-17 13:05   ` Ingo Molnar
  2019-05-24  8:08   ` [tip:perf/urgent] " tip-bot for Yabin Cui
  2019-05-17 11:52 ` [PATCH 2/4] perf/ring_buffer: Add ordering to rb->nest increment Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2019-05-17 11:52 UTC (permalink / raw)
  To: peterz, mingo
  Cc: yabinc, acme, alexander.shishkin, jolsa, namhyung, mark.rutland,
	linux-kernel, Ingo Molnar

In perf_output_put_handle(), an IRQ/NMI can happen in below location and
write records to the same ring buffer:
	...
	local_dec_and_test(&rb->nest)
	...                          <-- an IRQ/NMI can happen here
	rb->user_page->data_head = head;
	...

In this case, a value A is written to data_head in the IRQ, then a value
B is written to data_head after the IRQ. And A > B. As a result,
data_head is temporarily decreased from A to B. And a reader may see
data_head < data_tail if it read the buffer frequently enough, which
creates unexpected behaviors.

This can be fixed by moving dec(&rb->nest) to after updating data_head,
which prevents the IRQ/NMI above from updating data_head.

Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Fixes: ef60777c9abd ("perf: Optimize the perf_output() path by removing IRQ-disables")
Signed-off-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190516184010.167903-1-yabinc@google.com
---
 kernel/events/ring_buffer.c |   24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -51,11 +51,18 @@ static void perf_output_put_handle(struc
 	head = local_read(&rb->head);
 
 	/*
-	 * IRQ/NMI can happen here, which means we can miss a head update.
+	 * IRQ/NMI can happen here and advance @rb->head, causing our
+	 * load above to be stale.
 	 */
 
-	if (!local_dec_and_test(&rb->nest))
+	/*
+	 * If this isn't the outermost nesting, we don't have to update
+	 * @rb->user_page->data_head.
+	 */
+	if (local_read(&rb->nest) > 1) {
+		local_dec(&rb->nest);
 		goto out;
+	}
 
 	/*
 	 * Since the mmap() consumer (userspace) can run on a different CPU:
@@ -87,9 +94,18 @@ static void perf_output_put_handle(struc
 	rb->user_page->data_head = head;
 
 	/*
-	 * Now check if we missed an update -- rely on previous implied
-	 * compiler barriers to force a re-read.
+	 * We must publish the head before decrementing the nest count,
+	 * otherwise an IRQ/NMI can publish a more recent head value and our
+	 * write will (temporarily) publish a stale value.
+	 */
+	barrier();
+	local_set(&rb->nest, 0);
+
+	/*
+	 * Ensure we decrement @rb->nest before we validate the @rb->head.
+	 * Otherwise we cannot be sure we caught the 'last' nested update.
 	 */
+	barrier();
 	if (unlikely(head != local_read(&rb->head))) {
 		local_inc(&rb->nest);
 		goto again;



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/4] perf/ring_buffer: Add ordering to rb->nest increment
  2019-05-17 11:52 [PATCH 0/4] perf ring-buffer fixes Peter Zijlstra
  2019-05-17 11:52 ` [PATCH 1/4] perf/ring_buffer: Fix exposing a temporarily decreased data_head Peter Zijlstra
@ 2019-05-17 11:52 ` Peter Zijlstra
  2019-05-24  8:08   ` [tip:perf/urgent] " tip-bot for Peter Zijlstra
  2019-05-17 11:52 ` [PATCH 3/4] perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data Peter Zijlstra
  2019-05-17 11:52 ` [PATCH 4/4] perf/ring-buffer: Use regular variables for nesting Peter Zijlstra
  3 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2019-05-17 11:52 UTC (permalink / raw)
  To: peterz, mingo
  Cc: yabinc, acme, alexander.shishkin, jolsa, namhyung, mark.rutland,
	linux-kernel

Similar to how decrementing rb->next too early can cause data_head to
(temporarily) be observed to go backward, so too can this happen when
we increment too late.

This barrier() ensures the rb->head load happens after the increment,
both the one in the 'goto again' path, as the one from
perf_output_get_handle() -- albeit very unlikely to matter for the
latter.

Fixes: ef60777c9abd ("perf: Optimize the perf_output() path by removing IRQ-disables")
Suggested-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/ring_buffer.c |    9 +++++++++
 1 file changed, 9 insertions(+)

--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -48,6 +48,15 @@ static void perf_output_put_handle(struc
 	unsigned long head;
 
 again:
+	/*
+	 * In order to avoid publishing a head value that goes backwards,
+	 * we must ensure the load of @rb->head happens after we've
+	 * incremented @rb->nest.
+	 *
+	 * Otherwise we can observe a @rb->head value before one published
+	 * by an IRQ/NMI happening between the load and the increment.
+	 */
+	barrier();
 	head = local_read(&rb->head);
 
 	/*



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/4] perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data
  2019-05-17 11:52 [PATCH 0/4] perf ring-buffer fixes Peter Zijlstra
  2019-05-17 11:52 ` [PATCH 1/4] perf/ring_buffer: Fix exposing a temporarily decreased data_head Peter Zijlstra
  2019-05-17 11:52 ` [PATCH 2/4] perf/ring_buffer: Add ordering to rb->nest increment Peter Zijlstra
@ 2019-05-17 11:52 ` Peter Zijlstra
  2019-05-24  8:09   ` [tip:perf/urgent] " tip-bot for Peter Zijlstra
  2019-05-17 11:52 ` [PATCH 4/4] perf/ring-buffer: Use regular variables for nesting Peter Zijlstra
  3 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2019-05-17 11:52 UTC (permalink / raw)
  To: peterz, mingo
  Cc: yabinc, acme, alexander.shishkin, jolsa, namhyung, mark.rutland,
	linux-kernel

We must use {READ,WRITE}_ONCE() on rb->user_page data such that
concurrent usage will see whole values. A few key sites were missing
this.

Fixes:7b732a750477 ("perf_counter: new output ABI - part 1")
Suggested-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/ring_buffer.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -100,7 +100,7 @@ static void perf_output_put_handle(struc
 	 * See perf_output_begin().
 	 */
 	smp_wmb(); /* B, matches C */
-	rb->user_page->data_head = head;
+	WRITE_ONCE(rb->user_page->data_head, head);
 
 	/*
 	 * We must publish the head before decrementing the nest count,
@@ -496,7 +496,7 @@ void perf_aux_output_end(struct perf_out
 		perf_event_aux_event(handle->event, aux_head, size,
 				     handle->aux_flags);
 
-	rb->user_page->aux_head = rb->aux_head;
+	WRITE_ONCE(rb->user_page->aux_head, rb->aux_head);
 	if (rb_need_aux_wakeup(rb))
 		wakeup = true;
 
@@ -528,7 +528,7 @@ int perf_aux_output_skip(struct perf_out
 
 	rb->aux_head += size;
 
-	rb->user_page->aux_head = rb->aux_head;
+	WRITE_ONCE(rb->user_page->aux_head, rb->aux_head);
 	if (rb_need_aux_wakeup(rb)) {
 		perf_output_wakeup(handle);
 		handle->wakeup = rb->aux_wakeup + rb->aux_watermark;



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 4/4] perf/ring-buffer: Use regular variables for nesting
  2019-05-17 11:52 [PATCH 0/4] perf ring-buffer fixes Peter Zijlstra
                   ` (2 preceding siblings ...)
  2019-05-17 11:52 ` [PATCH 3/4] perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data Peter Zijlstra
@ 2019-05-17 11:52 ` Peter Zijlstra
  2019-05-24  8:10   ` [tip:perf/urgent] " tip-bot for Peter Zijlstra
  3 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2019-05-17 11:52 UTC (permalink / raw)
  To: peterz, mingo
  Cc: yabinc, acme, alexander.shishkin, jolsa, namhyung, mark.rutland,
	linux-kernel

While the IRQ/NMI will nest, the nest-count will be invariant over the
actual exception, since it will decrement equal to increment.

This means we can -- carefully -- use a regular variable since the
typical LOAD-STORE race doesn't exist (similar to preempt_count).

This optimizes the ring-buffer for all LOAD-STORE architectures, since
they need to use atomic ops to implement local_t.

Suggested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/internal.h    |    4 ++--
 kernel/events/ring_buffer.c |   41 ++++++++++++++++++++++++++---------------
 2 files changed, 28 insertions(+), 17 deletions(-)

--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -24,7 +24,7 @@ struct ring_buffer {
 	atomic_t			poll;		/* POLL_ for wakeups */
 
 	local_t				head;		/* write position    */
-	local_t				nest;		/* nested writers    */
+	unsigned int			nest;		/* nested writers    */
 	local_t				events;		/* event limit       */
 	local_t				wakeup;		/* wakeup stamp      */
 	local_t				lost;		/* nr records lost   */
@@ -41,7 +41,7 @@ struct ring_buffer {
 
 	/* AUX area */
 	long				aux_head;
-	local_t				aux_nest;
+	unsigned int			aux_nest;
 	long				aux_wakeup;	/* last aux_watermark boundary crossed by aux_head */
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -38,7 +38,12 @@ static void perf_output_get_handle(struc
 	struct ring_buffer *rb = handle->rb;
 
 	preempt_disable();
-	local_inc(&rb->nest);
+
+	/*
+	 * Avoid an explicit LOAD/STORE such that architectures with memops
+	 * can use them.
+	 */
+	(*(volatile unsigned int *)&rb->nest)++;
 	handle->wakeup = local_read(&rb->wakeup);
 }
 
@@ -46,6 +51,17 @@ static void perf_output_put_handle(struc
 {
 	struct ring_buffer *rb = handle->rb;
 	unsigned long head;
+	unsigned int nest;
+
+	/*
+	 * If this isn't the outermost nesting, we don't have to update
+	 * @rb->user_page->data_head.
+	 */
+	nest = READ_ONCE(rb->nest);
+	if (nest > 1) {
+		WRITE_ONCE(rb->nest, nest - 1);
+		goto out;
+	}
 
 again:
 	/*
@@ -65,15 +81,6 @@ static void perf_output_put_handle(struc
 	 */
 
 	/*
-	 * If this isn't the outermost nesting, we don't have to update
-	 * @rb->user_page->data_head.
-	 */
-	if (local_read(&rb->nest) > 1) {
-		local_dec(&rb->nest);
-		goto out;
-	}
-
-	/*
 	 * Since the mmap() consumer (userspace) can run on a different CPU:
 	 *
 	 *   kernel				user
@@ -108,7 +115,7 @@ static void perf_output_put_handle(struc
 	 * write will (temporarily) publish a stale value.
 	 */
 	barrier();
-	local_set(&rb->nest, 0);
+	WRITE_ONCE(rb->nest, 0);
 
 	/*
 	 * Ensure we decrement @rb->nest before we validate the @rb->head.
@@ -116,7 +123,7 @@ static void perf_output_put_handle(struc
 	 */
 	barrier();
 	if (unlikely(head != local_read(&rb->head))) {
-		local_inc(&rb->nest);
+		WRITE_ONCE(rb->nest, 1);
 		goto again;
 	}
 
@@ -355,6 +362,7 @@ void *perf_aux_output_begin(struct perf_
 	struct perf_event *output_event = event;
 	unsigned long aux_head, aux_tail;
 	struct ring_buffer *rb;
+	unsigned int nest;
 
 	if (output_event->parent)
 		output_event = output_event->parent;
@@ -385,13 +393,16 @@ void *perf_aux_output_begin(struct perf_
 	if (!refcount_inc_not_zero(&rb->aux_refcount))
 		goto err;
 
+	nest = READ_ONCE(rb->aux_nest);
 	/*
 	 * Nesting is not supported for AUX area, make sure nested
 	 * writers are caught early
 	 */
-	if (WARN_ON_ONCE(local_xchg(&rb->aux_nest, 1)))
+	if (WARN_ON_ONCE(nest))
 		goto err_put;
 
+	WRITE_ONCE(rb->aux_nest, nest + 1);
+
 	aux_head = rb->aux_head;
 
 	handle->rb = rb;
@@ -419,7 +430,7 @@ void *perf_aux_output_begin(struct perf_
 		if (!handle->size) { /* A, matches D */
 			event->pending_disable = smp_processor_id();
 			perf_output_wakeup(handle);
-			local_set(&rb->aux_nest, 0);
+			WRITE_ONCE(rb->aux_nest, 0);
 			goto err_put;
 		}
 	}
@@ -508,7 +519,7 @@ void perf_aux_output_end(struct perf_out
 
 	handle->event = NULL;
 
-	local_set(&rb->aux_nest, 0);
+	WRITE_ONCE(rb->aux_nest, 0);
 	/* can't be last */
 	rb_free_aux(rb);
 	ring_buffer_put(rb);



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] perf/ring_buffer: Fix exposing a temporarily decreased data_head.
  2019-05-17 11:52 ` [PATCH 1/4] perf/ring_buffer: Fix exposing a temporarily decreased data_head Peter Zijlstra
@ 2019-05-17 13:05   ` Ingo Molnar
  2019-05-17 14:26     ` Peter Zijlstra
  2019-05-24  8:08   ` [tip:perf/urgent] " tip-bot for Yabin Cui
  1 sibling, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2019-05-17 13:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: yabinc, acme, alexander.shishkin, jolsa, namhyung, mark.rutland,
	linux-kernel, Ingo Molnar


* Peter Zijlstra <peterz@infradead.org> wrote:

> In perf_output_put_handle(), an IRQ/NMI can happen in below location and
> write records to the same ring buffer:
> 	...
> 	local_dec_and_test(&rb->nest)
> 	...                          <-- an IRQ/NMI can happen here
> 	rb->user_page->data_head = head;
> 	...
> 
> In this case, a value A is written to data_head in the IRQ, then a value
> B is written to data_head after the IRQ. And A > B. As a result,
> data_head is temporarily decreased from A to B. And a reader may see
> data_head < data_tail if it read the buffer frequently enough, which
> creates unexpected behaviors.
> 
> This can be fixed by moving dec(&rb->nest) to after updating data_head,
> which prevents the IRQ/NMI above from updating data_head.
> 
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Fixes: ef60777c9abd ("perf: Optimize the perf_output() path by removing IRQ-disables")
> Signed-off-by: Yabin Cui <yabinc@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20190516184010.167903-1-yabinc@google.com

So these are missing a bunch of:

  From: Yabin Cui <yabinc@google.com>

lines, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] perf/ring_buffer: Fix exposing a temporarily decreased data_head.
  2019-05-17 13:05   ` Ingo Molnar
@ 2019-05-17 14:26     ` Peter Zijlstra
  2019-05-24  7:01       ` Ingo Molnar
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2019-05-17 14:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: yabinc, acme, alexander.shishkin, jolsa, namhyung, mark.rutland,
	linux-kernel, Ingo Molnar

On Fri, May 17, 2019 at 03:05:09PM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > In perf_output_put_handle(), an IRQ/NMI can happen in below location and
> > write records to the same ring buffer:
> > 	...
> > 	local_dec_and_test(&rb->nest)
> > 	...                          <-- an IRQ/NMI can happen here
> > 	rb->user_page->data_head = head;
> > 	...
> > 
> > In this case, a value A is written to data_head in the IRQ, then a value
> > B is written to data_head after the IRQ. And A > B. As a result,
> > data_head is temporarily decreased from A to B. And a reader may see
> > data_head < data_tail if it read the buffer frequently enough, which
> > creates unexpected behaviors.
> > 
> > This can be fixed by moving dec(&rb->nest) to after updating data_head,
> > which prevents the IRQ/NMI above from updating data_head.
> > 
> > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Jiri Olsa <jolsa@redhat.com>
> > Cc: Namhyung Kim <namhyung@kernel.org>
> > Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> > Fixes: ef60777c9abd ("perf: Optimize the perf_output() path by removing IRQ-disables")
> > Signed-off-by: Yabin Cui <yabinc@google.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Link: https://lkml.kernel.org/r/20190516184010.167903-1-yabinc@google.com
> 
> So these are missing a bunch of:
> 
>   From: Yabin Cui <yabinc@google.com>
> 
> lines, right?
> 

The first. certainly, the rest, while inspired by his patch, is more
complete than what he did.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] perf/ring_buffer: Fix exposing a temporarily decreased data_head.
  2019-05-17 14:26     ` Peter Zijlstra
@ 2019-05-24  7:01       ` Ingo Molnar
  0 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2019-05-24  7:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: yabinc, acme, alexander.shishkin, jolsa, namhyung, mark.rutland,
	linux-kernel, Ingo Molnar


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, May 17, 2019 at 03:05:09PM +0200, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > In perf_output_put_handle(), an IRQ/NMI can happen in below location and
> > > write records to the same ring buffer:
> > > 	...
> > > 	local_dec_and_test(&rb->nest)
> > > 	...                          <-- an IRQ/NMI can happen here
> > > 	rb->user_page->data_head = head;
> > > 	...
> > > 
> > > In this case, a value A is written to data_head in the IRQ, then a value
> > > B is written to data_head after the IRQ. And A > B. As a result,
> > > data_head is temporarily decreased from A to B. And a reader may see
> > > data_head < data_tail if it read the buffer frequently enough, which
> > > creates unexpected behaviors.
> > > 
> > > This can be fixed by moving dec(&rb->nest) to after updating data_head,
> > > which prevents the IRQ/NMI above from updating data_head.
> > > 
> > > Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Jiri Olsa <jolsa@redhat.com>
> > > Cc: Namhyung Kim <namhyung@kernel.org>
> > > Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> > > Fixes: ef60777c9abd ("perf: Optimize the perf_output() path by removing IRQ-disables")
> > > Signed-off-by: Yabin Cui <yabinc@google.com>
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > Link: https://lkml.kernel.org/r/20190516184010.167903-1-yabinc@google.com
> > 
> > So these are missing a bunch of:
> > 
> >   From: Yabin Cui <yabinc@google.com>
> > 
> > lines, right?
> > 
> 
> The first. certainly, the rest, while inspired by his patch, is more
> complete than what he did.

Oh, indeed - I adjusted the first patch only.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:perf/urgent] perf/ring_buffer: Fix exposing a temporarily decreased data_head
  2019-05-17 11:52 ` [PATCH 1/4] perf/ring_buffer: Fix exposing a temporarily decreased data_head Peter Zijlstra
  2019-05-17 13:05   ` Ingo Molnar
@ 2019-05-24  8:08   ` tip-bot for Yabin Cui
  1 sibling, 0 replies; 12+ messages in thread
From: tip-bot for Yabin Cui @ 2019-05-24  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jolsa, hpa, torvalds, peterz, linux-kernel, namhyung, tglx, acme,
	eranian, mingo, yabinc, alexander.shishkin, acme, vincent.weaver

Commit-ID:  1b038c6e05ff70a1e66e3e571c2e6106bdb75f53
Gitweb:     https://git.kernel.org/tip/1b038c6e05ff70a1e66e3e571c2e6106bdb75f53
Author:     Yabin Cui <yabinc@google.com>
AuthorDate: Fri, 17 May 2019 13:52:31 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 24 May 2019 09:00:10 +0200

perf/ring_buffer: Fix exposing a temporarily decreased data_head

In perf_output_put_handle(), an IRQ/NMI can happen in below location and
write records to the same ring buffer:

	...
	local_dec_and_test(&rb->nest)
	...                          <-- an IRQ/NMI can happen here
	rb->user_page->data_head = head;
	...

In this case, a value A is written to data_head in the IRQ, then a value
B is written to data_head after the IRQ. And A > B. As a result,
data_head is temporarily decreased from A to B. And a reader may see
data_head < data_tail if it read the buffer frequently enough, which
creates unexpected behaviors.

This can be fixed by moving dec(&rb->nest) to after updating data_head,
which prevents the IRQ/NMI above from updating data_head.

[ Split up by peterz. ]

Signed-off-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: mark.rutland@arm.com
Fixes: ef60777c9abd ("perf: Optimize the perf_output() path by removing IRQ-disables")
Link: http://lkml.kernel.org/r/20190517115418.224478157@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/ring_buffer.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 674b35383491..009467a60578 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -51,11 +51,18 @@ again:
 	head = local_read(&rb->head);
 
 	/*
-	 * IRQ/NMI can happen here, which means we can miss a head update.
+	 * IRQ/NMI can happen here and advance @rb->head, causing our
+	 * load above to be stale.
 	 */
 
-	if (!local_dec_and_test(&rb->nest))
+	/*
+	 * If this isn't the outermost nesting, we don't have to update
+	 * @rb->user_page->data_head.
+	 */
+	if (local_read(&rb->nest) > 1) {
+		local_dec(&rb->nest);
 		goto out;
+	}
 
 	/*
 	 * Since the mmap() consumer (userspace) can run on a different CPU:
@@ -87,9 +94,18 @@ again:
 	rb->user_page->data_head = head;
 
 	/*
-	 * Now check if we missed an update -- rely on previous implied
-	 * compiler barriers to force a re-read.
+	 * We must publish the head before decrementing the nest count,
+	 * otherwise an IRQ/NMI can publish a more recent head value and our
+	 * write will (temporarily) publish a stale value.
+	 */
+	barrier();
+	local_set(&rb->nest, 0);
+
+	/*
+	 * Ensure we decrement @rb->nest before we validate the @rb->head.
+	 * Otherwise we cannot be sure we caught the 'last' nested update.
 	 */
+	barrier();
 	if (unlikely(head != local_read(&rb->head))) {
 		local_inc(&rb->nest);
 		goto again;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:perf/urgent] perf/ring_buffer: Add ordering to rb->nest increment
  2019-05-17 11:52 ` [PATCH 2/4] perf/ring_buffer: Add ordering to rb->nest increment Peter Zijlstra
@ 2019-05-24  8:08   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-05-24  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: yabinc, eranian, mingo, jolsa, linux-kernel, acme, tglx,
	vincent.weaver, peterz, torvalds, hpa, alexander.shishkin

Commit-ID:  3f9fbe9bd86c534eba2faf5d840fd44c6049f50e
Gitweb:     https://git.kernel.org/tip/3f9fbe9bd86c534eba2faf5d840fd44c6049f50e
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 17 May 2019 13:52:32 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 24 May 2019 09:00:10 +0200

perf/ring_buffer: Add ordering to rb->nest increment

Similar to how decrementing rb->next too early can cause data_head to
(temporarily) be observed to go backward, so too can this happen when
we increment too late.

This barrier() ensures the rb->head load happens after the increment,
both the one in the 'goto again' path, as the one from
perf_output_get_handle() -- albeit very unlikely to matter for the
latter.

Suggested-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: mark.rutland@arm.com
Cc: namhyung@kernel.org
Fixes: ef60777c9abd ("perf: Optimize the perf_output() path by removing IRQ-disables")
Link: http://lkml.kernel.org/r/20190517115418.309516009@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/ring_buffer.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 009467a60578..4b5f8d932400 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -48,6 +48,15 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
 	unsigned long head;
 
 again:
+	/*
+	 * In order to avoid publishing a head value that goes backwards,
+	 * we must ensure the load of @rb->head happens after we've
+	 * incremented @rb->nest.
+	 *
+	 * Otherwise we can observe a @rb->head value before one published
+	 * by an IRQ/NMI happening between the load and the increment.
+	 */
+	barrier();
 	head = local_read(&rb->head);
 
 	/*

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:perf/urgent] perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data
  2019-05-17 11:52 ` [PATCH 3/4] perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data Peter Zijlstra
@ 2019-05-24  8:09   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-05-24  8:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: alexander.shishkin, eranian, tglx, torvalds, linux-kernel, jolsa,
	vincent.weaver, yabinc, acme, mingo, hpa, peterz

Commit-ID:  4d839dd9e4356bbacf3eb0ab13a549b83b008c21
Gitweb:     https://git.kernel.org/tip/4d839dd9e4356bbacf3eb0ab13a549b83b008c21
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 17 May 2019 13:52:33 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 24 May 2019 09:00:11 +0200

perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data

We must use {READ,WRITE}_ONCE() on rb->user_page data such that
concurrent usage will see whole values. A few key sites were missing
this.

Suggested-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: mark.rutland@arm.com
Cc: namhyung@kernel.org
Fixes: 7b732a750477 ("perf_counter: new output ABI - part 1")
Link: http://lkml.kernel.org/r/20190517115418.394192145@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/ring_buffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 4b5f8d932400..7a0c73e4b3eb 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -100,7 +100,7 @@ again:
 	 * See perf_output_begin().
 	 */
 	smp_wmb(); /* B, matches C */
-	rb->user_page->data_head = head;
+	WRITE_ONCE(rb->user_page->data_head, head);
 
 	/*
 	 * We must publish the head before decrementing the nest count,
@@ -496,7 +496,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size)
 		perf_event_aux_event(handle->event, aux_head, size,
 				     handle->aux_flags);
 
-	rb->user_page->aux_head = rb->aux_head;
+	WRITE_ONCE(rb->user_page->aux_head, rb->aux_head);
 	if (rb_need_aux_wakeup(rb))
 		wakeup = true;
 
@@ -528,7 +528,7 @@ int perf_aux_output_skip(struct perf_output_handle *handle, unsigned long size)
 
 	rb->aux_head += size;
 
-	rb->user_page->aux_head = rb->aux_head;
+	WRITE_ONCE(rb->user_page->aux_head, rb->aux_head);
 	if (rb_need_aux_wakeup(rb)) {
 		perf_output_wakeup(handle);
 		handle->wakeup = rb->aux_wakeup + rb->aux_watermark;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:perf/urgent] perf/ring-buffer: Use regular variables for nesting
  2019-05-17 11:52 ` [PATCH 4/4] perf/ring-buffer: Use regular variables for nesting Peter Zijlstra
@ 2019-05-24  8:10   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-05-24  8:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, vincent.weaver, alexander.shishkin, eranian, linux-kernel,
	mingo, torvalds, peterz, jolsa, tglx, acme

Commit-ID:  5322ea58a06da2e69c5ef36a9b4d4b9255edd423
Gitweb:     https://git.kernel.org/tip/5322ea58a06da2e69c5ef36a9b4d4b9255edd423
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 17 May 2019 13:52:34 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 24 May 2019 09:00:11 +0200

perf/ring-buffer: Use regular variables for nesting

While the IRQ/NMI will nest, the nest-count will be invariant over the
actual exception, since it will decrement equal to increment.

This means we can -- carefully -- use a regular variable since the
typical LOAD-STORE race doesn't exist (similar to preempt_count).

This optimizes the ring-buffer for all LOAD-STORE architectures, since
they need to use atomic ops to implement local_t.

Suggested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: mark.rutland@arm.com
Cc: namhyung@kernel.org
Cc: yabinc@google.com
Link: http://lkml.kernel.org/r/20190517115418.481392777@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/internal.h    |  4 ++--
 kernel/events/ring_buffer.c | 41 ++++++++++++++++++++++++++---------------
 2 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 79c47076700a..3aef4191798c 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -24,7 +24,7 @@ struct ring_buffer {
 	atomic_t			poll;		/* POLL_ for wakeups */
 
 	local_t				head;		/* write position    */
-	local_t				nest;		/* nested writers    */
+	unsigned int			nest;		/* nested writers    */
 	local_t				events;		/* event limit       */
 	local_t				wakeup;		/* wakeup stamp      */
 	local_t				lost;		/* nr records lost   */
@@ -41,7 +41,7 @@ struct ring_buffer {
 
 	/* AUX area */
 	long				aux_head;
-	local_t				aux_nest;
+	unsigned int			aux_nest;
 	long				aux_wakeup;	/* last aux_watermark boundary crossed by aux_head */
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 7a0c73e4b3eb..ffb59a4ef4ff 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -38,7 +38,12 @@ static void perf_output_get_handle(struct perf_output_handle *handle)
 	struct ring_buffer *rb = handle->rb;
 
 	preempt_disable();
-	local_inc(&rb->nest);
+
+	/*
+	 * Avoid an explicit LOAD/STORE such that architectures with memops
+	 * can use them.
+	 */
+	(*(volatile unsigned int *)&rb->nest)++;
 	handle->wakeup = local_read(&rb->wakeup);
 }
 
@@ -46,6 +51,17 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
 {
 	struct ring_buffer *rb = handle->rb;
 	unsigned long head;
+	unsigned int nest;
+
+	/*
+	 * If this isn't the outermost nesting, we don't have to update
+	 * @rb->user_page->data_head.
+	 */
+	nest = READ_ONCE(rb->nest);
+	if (nest > 1) {
+		WRITE_ONCE(rb->nest, nest - 1);
+		goto out;
+	}
 
 again:
 	/*
@@ -64,15 +80,6 @@ again:
 	 * load above to be stale.
 	 */
 
-	/*
-	 * If this isn't the outermost nesting, we don't have to update
-	 * @rb->user_page->data_head.
-	 */
-	if (local_read(&rb->nest) > 1) {
-		local_dec(&rb->nest);
-		goto out;
-	}
-
 	/*
 	 * Since the mmap() consumer (userspace) can run on a different CPU:
 	 *
@@ -108,7 +115,7 @@ again:
 	 * write will (temporarily) publish a stale value.
 	 */
 	barrier();
-	local_set(&rb->nest, 0);
+	WRITE_ONCE(rb->nest, 0);
 
 	/*
 	 * Ensure we decrement @rb->nest before we validate the @rb->head.
@@ -116,7 +123,7 @@ again:
 	 */
 	barrier();
 	if (unlikely(head != local_read(&rb->head))) {
-		local_inc(&rb->nest);
+		WRITE_ONCE(rb->nest, 1);
 		goto again;
 	}
 
@@ -355,6 +362,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 	struct perf_event *output_event = event;
 	unsigned long aux_head, aux_tail;
 	struct ring_buffer *rb;
+	unsigned int nest;
 
 	if (output_event->parent)
 		output_event = output_event->parent;
@@ -385,13 +393,16 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 	if (!refcount_inc_not_zero(&rb->aux_refcount))
 		goto err;
 
+	nest = READ_ONCE(rb->aux_nest);
 	/*
 	 * Nesting is not supported for AUX area, make sure nested
 	 * writers are caught early
 	 */
-	if (WARN_ON_ONCE(local_xchg(&rb->aux_nest, 1)))
+	if (WARN_ON_ONCE(nest))
 		goto err_put;
 
+	WRITE_ONCE(rb->aux_nest, nest + 1);
+
 	aux_head = rb->aux_head;
 
 	handle->rb = rb;
@@ -419,7 +430,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
 		if (!handle->size) { /* A, matches D */
 			event->pending_disable = smp_processor_id();
 			perf_output_wakeup(handle);
-			local_set(&rb->aux_nest, 0);
+			WRITE_ONCE(rb->aux_nest, 0);
 			goto err_put;
 		}
 	}
@@ -508,7 +519,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size)
 
 	handle->event = NULL;
 
-	local_set(&rb->aux_nest, 0);
+	WRITE_ONCE(rb->aux_nest, 0);
 	/* can't be last */
 	rb_free_aux(rb);
 	ring_buffer_put(rb);

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-05-24  8:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-17 11:52 [PATCH 0/4] perf ring-buffer fixes Peter Zijlstra
2019-05-17 11:52 ` [PATCH 1/4] perf/ring_buffer: Fix exposing a temporarily decreased data_head Peter Zijlstra
2019-05-17 13:05   ` Ingo Molnar
2019-05-17 14:26     ` Peter Zijlstra
2019-05-24  7:01       ` Ingo Molnar
2019-05-24  8:08   ` [tip:perf/urgent] " tip-bot for Yabin Cui
2019-05-17 11:52 ` [PATCH 2/4] perf/ring_buffer: Add ordering to rb->nest increment Peter Zijlstra
2019-05-24  8:08   ` [tip:perf/urgent] " tip-bot for Peter Zijlstra
2019-05-17 11:52 ` [PATCH 3/4] perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data Peter Zijlstra
2019-05-24  8:09   ` [tip:perf/urgent] " tip-bot for Peter Zijlstra
2019-05-17 11:52 ` [PATCH 4/4] perf/ring-buffer: Use regular variables for nesting Peter Zijlstra
2019-05-24  8:10   ` [tip:perf/urgent] " tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.