LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH] perf: Prevent recursion in ring buffer
@ 2018-09-12 19:33 Jiri Olsa
  2018-09-13  7:07 ` Peter Zijlstra
  2018-09-13  7:40 ` [PATCH] perf: Prevent recursion in ring buffer Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Jiri Olsa @ 2018-09-12 19:33 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: lkml, Ingo Molnar, Namhyung Kim, Alexander Shishkin,
	Peter Zijlstra, Andi Kleen

Some of the scheduling tracepoints allow the perf_tp_event
code to write to ring buffer under different cpu than the
code is running on.

This results in corrupted ring buffer data demonstrated in
following perf commands:

  # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

       Total time: 0.383 [sec]
  [ perf record: Woken up 8 times to write data ]
  0x42b890 [0]: failed to process type: -1765585640
  [ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ]

  # perf report --stdio
  0x42b890 [0]: failed to process type: -1765585640

The reason for the corruptions are some of the scheduling tracepoints,
that have __perf_task dfined and thus allow to store data to another
cpu ring buffer:

  sched_waking
  sched_wakeup
  sched_wakeup_new
  sched_stat_wait
  sched_stat_sleep
  sched_stat_iowait
  sched_stat_blocked

The perf_tp_event function first store samples for current cpu
related events defined for tracepoint:

    hlist_for_each_entry_rcu(event, head, hlist_entry)
      perf_swevent_event(event, count, &data, regs);

And then iterates events of the 'task' and store the sample
for any task's event that passes tracepoint checks:

  ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);

  list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
    if (event->attr.type != PERF_TYPE_TRACEPOINT)
      continue;
    if (event->attr.config != entry->type)
      continue;

    perf_swevent_event(event, count, &data, regs);
  }

Above code can race with same code running on another cpu,
ending up with 2 cpus trying to store under the same ring
buffer, which is not handled at the moment.

This patch adds atomic 'recursion' flag (any cpu could touch
the ring buffer at any moment), that guards the entry to the
storage code. It's set in __perf_output_begin function and
released in perf_output_put_handle. If the flag is set, the
code increases the lost count and bails out.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/events/internal.h    | 1 +
 kernel/events/ring_buffer.c | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 6dc725a7e7bc..82599da9723f 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -11,6 +11,7 @@
 
 struct ring_buffer {
 	atomic_t			refcount;
+	atomic_t			recursion;
 	struct rcu_head			rcu_head;
 #ifdef CONFIG_PERF_USE_VMALLOC
 	struct work_struct		work;
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 4a9937076331..0c976ac414c5 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -101,6 +101,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
 
 out:
 	preempt_enable();
+	atomic_set(&rb->recursion, 0);
 }
 
 static __always_inline bool
@@ -145,6 +146,12 @@ __perf_output_begin(struct perf_output_handle *handle,
 		goto out;
 	}
 
+	if (atomic_cmpxchg(&rb->recursion, 0, 1) != 0) {
+		if (rb->nr_pages)
+			local_inc(&rb->lost);
+		goto out;
+	}
+
 	handle->rb    = rb;
 	handle->event = event;
 
@@ -286,6 +293,7 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
 		rb->overwrite = 1;
 
 	atomic_set(&rb->refcount, 1);
+	atomic_set(&rb->recursion, 0);
 
 	INIT_LIST_HEAD(&rb->event_list);
 	spin_lock_init(&rb->event_lock);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] perf: Prevent recursion in ring buffer
  2018-09-12 19:33 [PATCH] perf: Prevent recursion in ring buffer Jiri Olsa
@ 2018-09-13  7:07 ` Peter Zijlstra
  2018-09-13  7:41   ` Jiri Olsa
  2018-09-13  7:46   ` Jiri Olsa
  2018-09-13  7:40 ` [PATCH] perf: Prevent recursion in ring buffer Peter Zijlstra
  1 sibling, 2 replies; 9+ messages in thread
From: Peter Zijlstra @ 2018-09-13  7:07 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, lkml, Ingo Molnar, Namhyung Kim,
	Alexander Shishkin, Andi Kleen

On Wed, Sep 12, 2018 at 09:33:17PM +0200, Jiri Olsa wrote:
> Some of the scheduling tracepoints allow the perf_tp_event
> code to write to ring buffer under different cpu than the
> code is running on.

ARGH.. that is indeed borken.

> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 4a9937076331..0c976ac414c5 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -101,6 +101,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
>  
>  out:
>  	preempt_enable();
> +	atomic_set(&rb->recursion, 0);
>  }
>  
>  static __always_inline bool
> @@ -145,6 +146,12 @@ __perf_output_begin(struct perf_output_handle *handle,
>  		goto out;
>  	}
>  
> +	if (atomic_cmpxchg(&rb->recursion, 0, 1) != 0) {
> +		if (rb->nr_pages)
> +			local_inc(&rb->lost);
> +		goto out;
> +	}
> +
>  	handle->rb    = rb;
>  	handle->event = event;
>  
> @@ -286,6 +293,7 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
>  		rb->overwrite = 1;
>  
>  	atomic_set(&rb->refcount, 1);
> +	atomic_set(&rb->recursion, 0);
>  
>  	INIT_LIST_HEAD(&rb->event_list);
>  	spin_lock_init(&rb->event_lock);

That's not a recursion count, that's a test-and-set spinlock, and you
got the memory ordering wrong for that.

Also, we tried very hard to avoid atomic ops in the ring-buffer and you
just wrecked that. Worse, you wrecked previously working interrupt
nesting output.

Let me have a look at this.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] perf: Prevent recursion in ring buffer
  2018-09-12 19:33 [PATCH] perf: Prevent recursion in ring buffer Jiri Olsa
  2018-09-13  7:07 ` Peter Zijlstra
@ 2018-09-13  7:40 ` Peter Zijlstra
  2018-09-13  7:53   ` Jiri Olsa
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-09-13  7:40 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, lkml, Ingo Molnar, Namhyung Kim,
	Alexander Shishkin, Andi Kleen

On Wed, Sep 12, 2018 at 09:33:17PM +0200, Jiri Olsa wrote:

>   # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging

> The reason for the corruptions are some of the scheduling tracepoints,
> that have __perf_task dfined and thus allow to store data to another
> cpu ring buffer:
> 
>   sched_waking
>   sched_wakeup
>   sched_wakeup_new
>   sched_stat_wait
>   sched_stat_sleep
>   sched_stat_iowait
>   sched_stat_blocked

> And then iterates events of the 'task' and store the sample
> for any task's event that passes tracepoint checks:
> 
>   ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
> 
>   list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
>     if (event->attr.type != PERF_TYPE_TRACEPOINT)
>       continue;
>     if (event->attr.config != entry->type)
>       continue;
> 
>     perf_swevent_event(event, count, &data, regs);
>   }
> 
> Above code can race with same code running on another cpu,
> ending up with 2 cpus trying to store under the same ring
> buffer, which is not handled at the moment.

It can yes, however the only way I can see this breaking is if we use
!inherited events with a strict per-task buffer, but your record command
doesn't use that.

Now, your test-case uses inherited events, which would all share the
buffer, however IIRC inherited events require per-task-per-cpu buffers,
because there is already no guarantee the various tasks run on the same
CPU in the first place.

This means we _should_ write to the @task's local CPU buffer, and that
would work again.

Let me try and figure out where this is going wrong.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] perf: Prevent recursion in ring buffer
  2018-09-13  7:07 ` Peter Zijlstra
@ 2018-09-13  7:41   ` Jiri Olsa
  2018-09-13  7:46   ` Jiri Olsa
  1 sibling, 0 replies; 9+ messages in thread
From: Jiri Olsa @ 2018-09-13  7:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Andi Kleen

On Thu, Sep 13, 2018 at 09:07:40AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 09:33:17PM +0200, Jiri Olsa wrote:
> > Some of the scheduling tracepoints allow the perf_tp_event
> > code to write to ring buffer under different cpu than the
> > code is running on.
> 
> ARGH.. that is indeed borken.
> 
> > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> > index 4a9937076331..0c976ac414c5 100644
> > --- a/kernel/events/ring_buffer.c
> > +++ b/kernel/events/ring_buffer.c
> > @@ -101,6 +101,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> >  
> >  out:
> >  	preempt_enable();
> > +	atomic_set(&rb->recursion, 0);
> >  }
> >  
> >  static __always_inline bool
> > @@ -145,6 +146,12 @@ __perf_output_begin(struct perf_output_handle *handle,
> >  		goto out;
> >  	}
> >  
> > +	if (atomic_cmpxchg(&rb->recursion, 0, 1) != 0) {
> > +		if (rb->nr_pages)
> > +			local_inc(&rb->lost);
> > +		goto out;
> > +	}
> > +
> >  	handle->rb    = rb;
> >  	handle->event = event;
> >  
> > @@ -286,6 +293,7 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
> >  		rb->overwrite = 1;
> >  
> >  	atomic_set(&rb->refcount, 1);
> > +	atomic_set(&rb->recursion, 0);
> >  
> >  	INIT_LIST_HEAD(&rb->event_list);
> >  	spin_lock_init(&rb->event_lock);
> 
> That's not a recursion count, that's a test-and-set spinlock, and you
> got the memory ordering wrong for that.
> 
> Also, we tried very hard to avoid atomic ops in the ring-buffer and you
> just wrecked that. Worse, you wrecked previously working interrupt
> nesting output.

ah.. the interrupt will also bail out now.. right :-\

> 
> Let me have a look at this.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] perf: Prevent recursion in ring buffer
  2018-09-13  7:07 ` Peter Zijlstra
  2018-09-13  7:41   ` Jiri Olsa
@ 2018-09-13  7:46   ` Jiri Olsa
  2018-09-13  9:37     ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Jiri Olsa @ 2018-09-13  7:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Andi Kleen

On Thu, Sep 13, 2018 at 09:07:40AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 09:33:17PM +0200, Jiri Olsa wrote:
> > Some of the scheduling tracepoints allow the perf_tp_event
> > code to write to ring buffer under different cpu than the
> > code is running on.
> 
> ARGH.. that is indeed borken.
> 
> > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> > index 4a9937076331..0c976ac414c5 100644
> > --- a/kernel/events/ring_buffer.c
> > +++ b/kernel/events/ring_buffer.c
> > @@ -101,6 +101,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> >  
> >  out:
> >  	preempt_enable();
> > +	atomic_set(&rb->recursion, 0);
> >  }
> >  
> >  static __always_inline bool
> > @@ -145,6 +146,12 @@ __perf_output_begin(struct perf_output_handle *handle,
> >  		goto out;
> >  	}
> >  
> > +	if (atomic_cmpxchg(&rb->recursion, 0, 1) != 0) {
> > +		if (rb->nr_pages)
> > +			local_inc(&rb->lost);
> > +		goto out;
> > +	}
> > +
> >  	handle->rb    = rb;
> >  	handle->event = event;
> >  
> > @@ -286,6 +293,7 @@ ring_buffer_init(struct ring_buffer *rb, long watermark, int flags)
> >  		rb->overwrite = 1;
> >  
> >  	atomic_set(&rb->refcount, 1);
> > +	atomic_set(&rb->recursion, 0);
> >  
> >  	INIT_LIST_HEAD(&rb->event_list);
> >  	spin_lock_init(&rb->event_lock);
> 
> That's not a recursion count, that's a test-and-set spinlock, and you
> got the memory ordering wrong for that.
> 
> Also, we tried very hard to avoid atomic ops in the ring-buffer and you
> just wrecked that. Worse, you wrecked previously working interrupt
> nesting output.
> 
> Let me have a look at this.

I was first thinking to just leave it on the current cpu,
but not sure current users would be ok with that ;-)

jirka


---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index abaed4f8bb7f..9b534a2ecf17 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8308,6 +8308,8 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
 				continue;
 			if (event->attr.config != entry->type)
 				continue;
+			if (event->cpu != smp_processor_id())
+				continue;
 			if (perf_tp_event_match(event, &data, regs))
 				perf_swevent_event(event, count, &data, regs);
 		}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] perf: Prevent recursion in ring buffer
  2018-09-13  7:40 ` [PATCH] perf: Prevent recursion in ring buffer Peter Zijlstra
@ 2018-09-13  7:53   ` Jiri Olsa
  0 siblings, 0 replies; 9+ messages in thread
From: Jiri Olsa @ 2018-09-13  7:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Andi Kleen

On Thu, Sep 13, 2018 at 09:40:42AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 09:33:17PM +0200, Jiri Olsa wrote:
> 
> >   # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
> 
> > The reason for the corruptions are some of the scheduling tracepoints,
> > that have __perf_task dfined and thus allow to store data to another
> > cpu ring buffer:
> > 
> >   sched_waking
> >   sched_wakeup
> >   sched_wakeup_new
> >   sched_stat_wait
> >   sched_stat_sleep
> >   sched_stat_iowait
> >   sched_stat_blocked
> 
> > And then iterates events of the 'task' and store the sample
> > for any task's event that passes tracepoint checks:
> > 
> >   ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
> > 
> >   list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
> >     if (event->attr.type != PERF_TYPE_TRACEPOINT)
> >       continue;
> >     if (event->attr.config != entry->type)
> >       continue;
> > 
> >     perf_swevent_event(event, count, &data, regs);
> >   }
> > 
> > Above code can race with same code running on another cpu,
> > ending up with 2 cpus trying to store under the same ring
> > buffer, which is not handled at the moment.
> 
> It can yes, however the only way I can see this breaking is if we use
> !inherited events with a strict per-task buffer, but your record command
> doesn't use that.
> 
> Now, your test-case uses inherited events, which would all share the
> buffer, however IIRC inherited events require per-task-per-cpu buffers,

that's what perf record always does when monitoring task.. there's
an event/rb for each cpu and the given task

and all events for the task (sched:*) on given cpu share that single
cpu ring buffer via PERF_EVENT_IOC_SET_OUTPUT

> because there is already no guarantee the various tasks run on the same
> CPU in the first place.
> 
> This means we _should_ write to the @task's local CPU buffer, and that
> would work again.
> 
> Let me try and figure out where this is going wrong.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] perf: Prevent recursion in ring buffer
  2018-09-13  7:46   ` Jiri Olsa
@ 2018-09-13  9:37     ` Peter Zijlstra
  2018-09-23 16:13       ` [PATCHv2] perf: Prevent concurent ring buffer access Jiri Olsa
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-09-13  9:37 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Andi Kleen

On Thu, Sep 13, 2018 at 09:46:07AM +0200, Jiri Olsa wrote:
> On Thu, Sep 13, 2018 at 09:07:40AM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 12, 2018 at 09:33:17PM +0200, Jiri Olsa wrote:
> > > Some of the scheduling tracepoints allow the perf_tp_event
> > > code to write to ring buffer under different cpu than the
> > > code is running on.
> > 
> > ARGH.. that is indeed borken.

> I was first thinking to just leave it on the current cpu,
> but not sure current users would be ok with that ;-)

> ---
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index abaed4f8bb7f..9b534a2ecf17 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8308,6 +8308,8 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
>  				continue;
>  			if (event->attr.config != entry->type)
>  				continue;
> +			if (event->cpu != smp_processor_id())
> +				continue;
>  			if (perf_tp_event_match(event, &data, regs))
>  				perf_swevent_event(event, count, &data, regs);
>  		}

That might indeed be the best we can do.

So the whole TP muck would be responsible for placing only matching
events on the hlist, which is where our normal CPU filter is I think.

The above then does the same for @task. Which without this would also be
getting nr_cpus copies of the event I think.

It does mean not getting any events if the @task only has a per-task
buffer, but there's nothing to be done about that. And I'm not even sure
we can create a useful warning for that :/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCHv2] perf: Prevent concurent ring buffer access
  2018-09-13  9:37     ` Peter Zijlstra
@ 2018-09-23 16:13       ` Jiri Olsa
  2018-10-02 10:01         ` [tip:perf/core] perf/ring_buffer: " tip-bot for Jiri Olsa
  0 siblings, 1 reply; 9+ messages in thread
From: Jiri Olsa @ 2018-09-23 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, lkml, Ingo Molnar,
	Namhyung Kim, Alexander Shishkin, Andi Kleen, Andrew Vagin

On Thu, Sep 13, 2018 at 11:37:54AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 13, 2018 at 09:46:07AM +0200, Jiri Olsa wrote:
> > On Thu, Sep 13, 2018 at 09:07:40AM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 12, 2018 at 09:33:17PM +0200, Jiri Olsa wrote:
> > > > Some of the scheduling tracepoints allow the perf_tp_event
> > > > code to write to ring buffer under different cpu than the
> > > > code is running on.
> > > 
> > > ARGH.. that is indeed borken.
> 
> > I was first thinking to just leave it on the current cpu,
> > but not sure current users would be ok with that ;-)
> 
> > ---
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index abaed4f8bb7f..9b534a2ecf17 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -8308,6 +8308,8 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
> >  				continue;
> >  			if (event->attr.config != entry->type)
> >  				continue;
> > +			if (event->cpu != smp_processor_id())
> > +				continue;
> >  			if (perf_tp_event_match(event, &data, regs))
> >  				perf_swevent_event(event, count, &data, regs);
> >  		}
> 
> That might indeed be the best we can do.
> 
> So the whole TP muck would be responsible for placing only matching
> events on the hlist, which is where our normal CPU filter is I think.
> 
> The above then does the same for @task. Which without this would also be
> getting nr_cpus copies of the event I think.
> 
> It does mean not getting any events if the @task only has a per-task
> buffer, but there's nothing to be done about that. And I'm not even sure
> we can create a useful warning for that :/

ok, sending full patch (v2) with above change

cc-ing Andrew Vagin who added this feature,
because this patch change the way it works

thanks,
jirka


---
Some of the scheduling tracepoints allow the perf_tp_event
code to write to ring buffer under different cpu than the
code is running on.

This results in corrupted ring buffer data demonstrated in
following perf commands:

  # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

       Total time: 0.383 [sec]
  [ perf record: Woken up 8 times to write data ]
  0x42b890 [0]: failed to process type: -1765585640
  [ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ]

  # perf report --stdio
  0x42b890 [0]: failed to process type: -1765585640

The reason for the corruptions are some of the scheduling tracepoints,
that have __perf_task dfined and thus allow to store data to another
cpu ring buffer:

  sched_waking
  sched_wakeup
  sched_wakeup_new
  sched_stat_wait
  sched_stat_sleep
  sched_stat_iowait
  sched_stat_blocked

The perf_tp_event function first store samples for current cpu
related events defined for tracepoint:

    hlist_for_each_entry_rcu(event, head, hlist_entry)
      perf_swevent_event(event, count, &data, regs);

And then iterates events of the 'task' and store the sample
for any task's event that passes tracepoint checks:

  ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);

  list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
    if (event->attr.type != PERF_TYPE_TRACEPOINT)
      continue;
    if (event->attr.config != entry->type)
      continue;

    perf_swevent_event(event, count, &data, regs);
  }

Above code can race with same code running on another cpu,
ending up with 2 cpus trying to store under the same ring
buffer, which is not handled at the moment.

This patch prevents the race, by allowing only events
with the same current cpu to receive the event.

Fixes: e6dab5ffab59 ("perf/trace: Add ability to set a target task for events")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/events/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c80549bf82c6..f269f666510c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8308,6 +8308,8 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
 			goto unlock;
 
 		list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
+			if (event->cpu != smp_processor_id())
+				continue;
 			if (event->attr.type != PERF_TYPE_TRACEPOINT)
 				continue;
 			if (event->attr.config != entry->type)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tip:perf/core] perf/ring_buffer: Prevent concurent ring buffer access
  2018-09-23 16:13       ` [PATCHv2] perf: Prevent concurent ring buffer access Jiri Olsa
@ 2018-10-02 10:01         ` " tip-bot for Jiri Olsa
  0 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Jiri Olsa @ 2018-10-02 10:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: alexander.shishkin, torvalds, peterz, jolsa, hpa, tglx, namhyung,
	linux-kernel, acme, avagin, jolsa, vincent.weaver, acme, mingo,
	eranian

Commit-ID:  cd6fb677ce7e460c25bdd66f689734102ec7d642
Gitweb:     https://git.kernel.org/tip/cd6fb677ce7e460c25bdd66f689734102ec7d642
Author:     Jiri Olsa <jolsa@redhat.com>
AuthorDate: Sun, 23 Sep 2018 18:13:43 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 2 Oct 2018 09:37:59 +0200

perf/ring_buffer: Prevent concurent ring buffer access

Some of the scheduling tracepoints allow the perf_tp_event
code to write to ring buffer under different cpu than the
code is running on.

This results in corrupted ring buffer data demonstrated in
following perf commands:

  # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

       Total time: 0.383 [sec]
  [ perf record: Woken up 8 times to write data ]
  0x42b890 [0]: failed to process type: -1765585640
  [ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ]

  # perf report --stdio
  0x42b890 [0]: failed to process type: -1765585640

The reason for the corruption are some of the scheduling tracepoints,
that have __perf_task dfined and thus allow to store data to another
cpu ring buffer:

  sched_waking
  sched_wakeup
  sched_wakeup_new
  sched_stat_wait
  sched_stat_sleep
  sched_stat_iowait
  sched_stat_blocked

The perf_tp_event function first store samples for current cpu
related events defined for tracepoint:

    hlist_for_each_entry_rcu(event, head, hlist_entry)
      perf_swevent_event(event, count, &data, regs);

And then iterates events of the 'task' and store the sample
for any task's event that passes tracepoint checks:

  ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);

  list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
    if (event->attr.type != PERF_TYPE_TRACEPOINT)
      continue;
    if (event->attr.config != entry->type)
      continue;

    perf_swevent_event(event, count, &data, regs);
  }

Above code can race with same code running on another cpu,
ending up with 2 cpus trying to store under the same ring
buffer, which is specifically not allowed.

This patch prevents the problem, by allowing only events with the same
current cpu to receive the event.

NOTE: this requires the use of (per-task-)per-cpu buffers for this
feature to work; perf-record does this.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
[peterz: small edits to Changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: e6dab5ffab59 ("perf/trace: Add ability to set a target task for events")
Link: http://lkml.kernel.org/r/20180923161343.GB15054@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index dfb1d951789e..5a97f34bc14c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8314,6 +8314,8 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
 			goto unlock;
 
 		list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
+			if (event->cpu != smp_processor_id())
+				continue;
 			if (event->attr.type != PERF_TYPE_TRACEPOINT)
 				continue;
 			if (event->attr.config != entry->type)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-12 19:33 [PATCH] perf: Prevent recursion in ring buffer Jiri Olsa
2018-09-13  7:07 ` Peter Zijlstra
2018-09-13  7:41   ` Jiri Olsa
2018-09-13  7:46   ` Jiri Olsa
2018-09-13  9:37     ` Peter Zijlstra
2018-09-23 16:13       ` [PATCHv2] perf: Prevent concurent ring buffer access Jiri Olsa
2018-10-02 10:01         ` [tip:perf/core] perf/ring_buffer: " tip-bot for Jiri Olsa
2018-09-13  7:40 ` [PATCH] perf: Prevent recursion in ring buffer Peter Zijlstra
2018-09-13  7:53   ` Jiri Olsa

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox