[PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
@ 2014-03-07 13:38 Alexander Shishkin
  2014-03-13 19:58 ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Alexander Shishkin @ 2014-03-07 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Stephane Eranian, Andi Kleen, Alexander Shishkin,
	Paul McKenney

This is more of a problem description than an actual bugfix, but currently
ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
the ring buffer's event list, leading to cpu stalls.

What this patch does is crude, but fixes the problem, which is: one rcu
grace period has to elapse between ring_buffer_detach() and subsequent
ring_buffer_attach(), otherwise either the attach will fail or the wakeup
will misbehave. Also, making it a call_rcu() callback will make it race
with attach().

Another solution that I see is to check for list_empty(&event->rb_entry)
before wake_up_all() in ring_buffer_wakeup() and restart the list
traversal if it is indeed empty, but that is ugly too as there will be
extra wakeups on some events.

Anything that I'm missing here? Any better ideas?

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/events/core.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 661951a..bce41e0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3861,7 +3861,7 @@ static void ring_buffer_attach(struct perf_event *event,

 	spin_lock_irqsave(&rb->event_lock, flags);
 	if (list_empty(&event->rb_entry))
-		list_add(&event->rb_entry, &rb->event_list);
+		list_add_rcu(&event->rb_entry, &rb->event_list);
 	spin_unlock_irqrestore(&rb->event_lock, flags);
 }

@@ -3873,9 +3873,11 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
 		return;

 	spin_lock_irqsave(&rb->event_lock, flags);
-	list_del_init(&event->rb_entry);
+	list_del_rcu(&event->rb_entry);
 	wake_up_all(&event->waitq);
 	spin_unlock_irqrestore(&rb->event_lock, flags);
+	synchronize_rcu();
+	INIT_LIST_HEAD(&event->rb_entry);
 }

 static void ring_buffer_wakeup(struct perf_event *event)
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-07 13:38 [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup() Alexander Shishkin
@ 2014-03-13 19:58 ` Paul E. McKenney
  2014-03-14  9:50   ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2014-03-13 19:58 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Stephane Eranian, Andi Kleen

On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote:
> This is more of a problem description than an actual bugfix, but currently
> ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
> the ring buffer's event list, leading to cpu stalls.
> 
> What this patch does is crude, but fixes the problem, which is: one rcu
> grace period has to elapse between ring_buffer_detach() and subsequent
> ring_buffer_attach(), otherwise either the attach will fail or the wakeup
> will misbehave. Also, making it a call_rcu() callback will make it race
> with attach().
> 
> Another solution that I see is to check for list_empty(&event->rb_entry)
> before wake_up_all() in ring_buffer_wakeup() and restart the list
> traversal if it is indeed empty, but that is ugly too as there will be
> extra wakeups on some events.
> 
> Anything that I'm missing here? Any better ideas?

Not sure it qualifies as "better", but git call to ring_buffer_detach()
is going to free the event anyway, so the synchronize_rcu() and the
INIT_LIST_HEAD() should not be needed in that case.  I am guessing that
the same is true for perf_mmap_close().

So that leaves the call in perf_event_set_output(), which detaches from an
old rb before attaching that same event to a new one.  So maybe have the
synchronize_rcu() and INIT_LIST_HEAD() instead be in the "if (old_rb)",
which might be a reasonably uncommon case?

							Thanx, Paul

> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  kernel/events/core.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 661951a..bce41e0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3861,7 +3861,7 @@ static void ring_buffer_attach(struct perf_event *event,
> 
>  	spin_lock_irqsave(&rb->event_lock, flags);
>  	if (list_empty(&event->rb_entry))
> -		list_add(&event->rb_entry, &rb->event_list);
> +		list_add_rcu(&event->rb_entry, &rb->event_list);
>  	spin_unlock_irqrestore(&rb->event_lock, flags);
>  }
> 
> @@ -3873,9 +3873,11 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
>  		return;
> 
>  	spin_lock_irqsave(&rb->event_lock, flags);
> -	list_del_init(&event->rb_entry);
> +	list_del_rcu(&event->rb_entry);
>  	wake_up_all(&event->waitq);
>  	spin_unlock_irqrestore(&rb->event_lock, flags);
> +	synchronize_rcu();
> +	INIT_LIST_HEAD(&event->rb_entry);
>  }
> 
>  static void ring_buffer_wakeup(struct perf_event *event)
> -- 
> 1.9.0
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-13 19:58 ` Paul E. McKenney
@ 2014-03-14  9:50   ` Peter Zijlstra
  2014-03-14 20:47     ` Paul E. McKenney
  2014-05-07 12:35     ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-03-14  9:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

On Thu, Mar 13, 2014 at 12:58:16PM -0700, Paul E. McKenney wrote:
> On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote:
> > This is more of a problem description than an actual bugfix, but currently
> > ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
> > the ring buffer's event list, leading to cpu stalls.
> > 
> > What this patch does is crude, but fixes the problem, which is: one rcu
> > grace period has to elapse between ring_buffer_detach() and subsequent
> > ring_buffer_attach(), otherwise either the attach will fail or the wakeup
> > will misbehave. Also, making it a call_rcu() callback will make it race
> > with attach().
> > 
> > Another solution that I see is to check for list_empty(&event->rb_entry)
> > before wake_up_all() in ring_buffer_wakeup() and restart the list
> > traversal if it is indeed empty, but that is ugly too as there will be
> > extra wakeups on some events.
> > 
> > Anything that I'm missing here? Any better ideas?
> 
> Not sure it qualifies as "better", but git call to ring_buffer_detach()
> is going to free the event anyway, so the synchronize_rcu() and the
> INIT_LIST_HEAD() should not be needed in that case.  I am guessing that
> the same is true for perf_mmap_close().
> 
> So that leaves the call in perf_event_set_output(), which detaches from an
> old rb before attaching that same event to a new one.  So maybe have the
> synchronize_rcu() and INIT_LIST_HEAD() instead be in the "if (old_rb)",
> which might be a reasonably uncommon case?

How about something like so that only does the sync_rcu() if really
needed.


---
 kernel/events/core.c     | 11 +++++++++--
 kernel/events/internal.h |  1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 661951ab8ae7..88c8c810e081 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3856,12 +3856,17 @@ static void ring_buffer_attach(struct perf_event *event,
 {
 	unsigned long flags;
 
+	if (rb->rcu_batches == rcu_batches_completed()) {
+		synchronize_rcu();
+		INIT_LIST_HEAD(&event->rb_entry);
+	}
+
 	if (!list_empty(&event->rb_entry))
 		return;
 
 	spin_lock_irqsave(&rb->event_lock, flags);
 	if (list_empty(&event->rb_entry))
-		list_add(&event->rb_entry, &rb->event_list);
+		list_add_rcu(&event->rb_entry, &rb->event_list);
 	spin_unlock_irqrestore(&rb->event_lock, flags);
 }
 
@@ -3873,9 +3878,11 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
 		return;
 
 	spin_lock_irqsave(&rb->event_lock, flags);
-	list_del_init(&event->rb_entry);
+	list_del_rcu(&event->rb_entry);
 	wake_up_all(&event->waitq);
 	spin_unlock_irqrestore(&rb->event_lock, flags);
+
+	rb->rcu_batches = rcu_batches_completed();
 }
 
 static void ring_buffer_wakeup(struct perf_event *event)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b218782ad..698b5881b2a4 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -30,6 +30,7 @@ struct ring_buffer {
 	/* poll crap */
 	spinlock_t			event_lock;
 	struct list_head		event_list;
+	unsigned long			rcu_batches;
 
 	atomic_t			mmap_count;
 	unsigned long			mmap_locked;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-14  9:50   ` Peter Zijlstra
@ 2014-03-14 20:47     ` Paul E. McKenney
  2014-03-14 22:43       ` Peter Zijlstra
  2014-05-07 12:35     ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2014-03-14 20:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

On Fri, Mar 14, 2014 at 10:50:33AM +0100, Peter Zijlstra wrote:
> On Thu, Mar 13, 2014 at 12:58:16PM -0700, Paul E. McKenney wrote:
> > On Fri, Mar 07, 2014 at 03:38:46PM +0200, Alexander Shishkin wrote:
> > > This is more of a problem description than an actual bugfix, but currently
> > > ring_buffer_detach() can kick in while ring_buffer_wakeup() is traversing
> > > the ring buffer's event list, leading to cpu stalls.
> > > 
> > > What this patch does is crude, but fixes the problem, which is: one rcu
> > > grace period has to elapse between ring_buffer_detach() and subsequent
> > > ring_buffer_attach(), otherwise either the attach will fail or the wakeup
> > > will misbehave. Also, making it a call_rcu() callback will make it race
> > > with attach().
> > > 
> > > Another solution that I see is to check for list_empty(&event->rb_entry)
> > > before wake_up_all() in ring_buffer_wakeup() and restart the list
> > > traversal if it is indeed empty, but that is ugly too as there will be
> > > extra wakeups on some events.
> > > 
> > > Anything that I'm missing here? Any better ideas?
> > 
> > Not sure it qualifies as "better", but git call to ring_buffer_detach()
> > is going to free the event anyway, so the synchronize_rcu() and the
> > INIT_LIST_HEAD() should not be needed in that case.  I am guessing that
> > the same is true for perf_mmap_close().
> > 
> > So that leaves the call in perf_event_set_output(), which detaches from an
> > old rb before attaching that same event to a new one.  So maybe have the
> > synchronize_rcu() and INIT_LIST_HEAD() instead be in the "if (old_rb)",
> > which might be a reasonably uncommon case?
> 
> How about something like so that only does the sync_rcu() if really
> needed.

This general idea can be made to work, but it will need some
internal-to-RCU help.  One vulnerability of the patch below is the
following sequence of steps:

1.	RCU has just finished a grace period, and is doing the
	end-of-grace-period accounting.

2.	The code below invokes rcu_batches_completed().  Let's assume
	the result returned is 42.

3.	RCU completes the end-of-grace-period accounting, and increments
	rcu_sched_state.completed.

4.	The code below checks ->rcu_batches against the result from
	another invocation of rcu_batches_completed() and sees that
	the 43 is not equal to 42, so skips the synchronize_rcu().

Except that a grace period has not actually completed.  Boom!!!

The problem is that rcu_batches_completed() is only intended to give
progress information on RCU.

What I can do is give you a pair of functions, one to take a snapshot of
the current grace-period state (returning an unsigned long) and another
to evaluate a previous snapshot, invoking synchronize_rcu() if there has
not been a full grace period in the meantime.

The most straightforward approach would invoke acquiring the global
rcu_state ->lock on each call, which I am guessing just might be
considered to be excessive overhead.  ;-)  I should be able to decrease
the overhead to a memory barrier on each call, and perhaps even down
to an smp_load_acquire().  Accessing the RCU state probably costs you
a cache miss both times.

Thoughts?

							Thanx, Paul

> ---
>  kernel/events/core.c     | 11 +++++++++--
>  kernel/events/internal.h |  1 +
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 661951ab8ae7..88c8c810e081 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3856,12 +3856,17 @@ static void ring_buffer_attach(struct perf_event *event,
>  {
>  	unsigned long flags;
> 
> +	if (rb->rcu_batches == rcu_batches_completed()) {
> +		synchronize_rcu();
> +		INIT_LIST_HEAD(&event->rb_entry);
> +	}
> +
>  	if (!list_empty(&event->rb_entry))
>  		return;
> 
>  	spin_lock_irqsave(&rb->event_lock, flags);
>  	if (list_empty(&event->rb_entry))
> -		list_add(&event->rb_entry, &rb->event_list);
> +		list_add_rcu(&event->rb_entry, &rb->event_list);
>  	spin_unlock_irqrestore(&rb->event_lock, flags);
>  }
> 
> @@ -3873,9 +3878,11 @@ static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
>  		return;
> 
>  	spin_lock_irqsave(&rb->event_lock, flags);
> -	list_del_init(&event->rb_entry);
> +	list_del_rcu(&event->rb_entry);
>  	wake_up_all(&event->waitq);
>  	spin_unlock_irqrestore(&rb->event_lock, flags);
> +
> +	rb->rcu_batches = rcu_batches_completed();
>  }
> 
>  static void ring_buffer_wakeup(struct perf_event *event)
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index 569b218782ad..698b5881b2a4 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -30,6 +30,7 @@ struct ring_buffer {
>  	/* poll crap */
>  	spinlock_t			event_lock;
>  	struct list_head		event_list;
> +	unsigned long			rcu_batches;
> 
>  	atomic_t			mmap_count;
>  	unsigned long			mmap_locked;
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-14 20:47     ` Paul E. McKenney
@ 2014-03-14 22:43       ` Peter Zijlstra
  2014-03-14 23:02         ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-03-14 22:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote:
> This general idea can be made to work, but it will need some
> internal-to-RCU help.  One vulnerability of the patch below is the
> following sequence of steps:
> 
> 1.	RCU has just finished a grace period, and is doing the
> 	end-of-grace-period accounting.
> 
> 2.	The code below invokes rcu_batches_completed().  Let's assume
> 	the result returned is 42.
> 
> 3.	RCU completes the end-of-grace-period accounting, and increments
> 	rcu_sched_state.completed.
> 
> 4.	The code below checks ->rcu_batches against the result from
> 	another invocation of rcu_batches_completed() and sees that
> 	the 43 is not equal to 42, so skips the synchronize_rcu().
> 
> Except that a grace period has not actually completed.  Boom!!!
> 
> The problem is that rcu_batches_completed() is only intended to give
> progress information on RCU.

Ah, I thought I was missing something when I was looking through the rcu
code in a hurry :-)

I knew there'd be some subtlety between completed and gpnum and such :-)

> What I can do is give you a pair of functions, one to take a snapshot of
> the current grace-period state (returning an unsigned long) and another
> to evaluate a previous snapshot, invoking synchronize_rcu() if there has
> not been a full grace period in the meantime.
> 
> The most straightforward approach would invoke acquiring the global
> rcu_state ->lock on each call, which I am guessing just might be
> considered to be excessive overhead.  ;-)  I should be able to decrease
> the overhead to a memory barrier on each call, and perhaps even down
> to an smp_load_acquire().  Accessing the RCU state probably costs you
> a cache miss both times.
> 
> Thoughts?

Sounds fine, the attach isn't a hotpath, so even the locked version
should be fine, but I won't keep you from making it all fancy and such
:-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-14 22:43       ` Peter Zijlstra
@ 2014-03-14 23:02         ` Paul E. McKenney
  2014-03-15  0:00           ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2014-03-14 23:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

On Fri, Mar 14, 2014 at 11:43:17PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote:
> > This general idea can be made to work, but it will need some
> > internal-to-RCU help.  One vulnerability of the patch below is the
> > following sequence of steps:
> > 
> > 1.	RCU has just finished a grace period, and is doing the
> > 	end-of-grace-period accounting.
> > 
> > 2.	The code below invokes rcu_batches_completed().  Let's assume
> > 	the result returned is 42.
> > 
> > 3.	RCU completes the end-of-grace-period accounting, and increments
> > 	rcu_sched_state.completed.
> > 
> > 4.	The code below checks ->rcu_batches against the result from
> > 	another invocation of rcu_batches_completed() and sees that
> > 	the 43 is not equal to 42, so skips the synchronize_rcu().
> > 
> > Except that a grace period has not actually completed.  Boom!!!
> > 
> > The problem is that rcu_batches_completed() is only intended to give
> > progress information on RCU.
> 
> Ah, I thought I was missing something when I was looking through the rcu
> code in a hurry :-)

Well, given that I sometimes miss things when looking through RCU code
carefuly, I guess I cannot give you too much trouble about it.

> I knew there'd be some subtlety between completed and gpnum and such :-)

Some of which I have learned about one RCU bug at a time.  ;-)

> > What I can do is give you a pair of functions, one to take a snapshot of
> > the current grace-period state (returning an unsigned long) and another
> > to evaluate a previous snapshot, invoking synchronize_rcu() if there has
> > not been a full grace period in the meantime.
> > 
> > The most straightforward approach would invoke acquiring the global
> > rcu_state ->lock on each call, which I am guessing just might be
> > considered to be excessive overhead.  ;-)  I should be able to decrease
> > the overhead to a memory barrier on each call, and perhaps even down
> > to an smp_load_acquire().  Accessing the RCU state probably costs you
> > a cache miss both times.
> > 
> > Thoughts?
> 
> Sounds fine, the attach isn't a hotpath, so even the locked version
> should be fine, but I won't keep you from making it all fancy and such
> :-)

Fair enough, let me see what I can come up with.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-14 23:02         ` Paul E. McKenney
@ 2014-03-15  0:00           ` Paul E. McKenney
  2014-03-17 11:18             ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2014-03-15  0:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

On Fri, Mar 14, 2014 at 04:02:31PM -0700, Paul E. McKenney wrote:
> On Fri, Mar 14, 2014 at 11:43:17PM +0100, Peter Zijlstra wrote:
> > On Fri, Mar 14, 2014 at 01:47:37PM -0700, Paul E. McKenney wrote:
> > > This general idea can be made to work, but it will need some
> > > internal-to-RCU help.  One vulnerability of the patch below is the
> > > following sequence of steps:
> > > 
> > > 1.	RCU has just finished a grace period, and is doing the
> > > 	end-of-grace-period accounting.
> > > 
> > > 2.	The code below invokes rcu_batches_completed().  Let's assume
> > > 	the result returned is 42.
> > > 
> > > 3.	RCU completes the end-of-grace-period accounting, and increments
> > > 	rcu_sched_state.completed.
> > > 
> > > 4.	The code below checks ->rcu_batches against the result from
> > > 	another invocation of rcu_batches_completed() and sees that
> > > 	the 43 is not equal to 42, so skips the synchronize_rcu().
> > > 
> > > Except that a grace period has not actually completed.  Boom!!!
> > > 
> > > The problem is that rcu_batches_completed() is only intended to give
> > > progress information on RCU.
> > 
> > Ah, I thought I was missing something when I was looking through the rcu
> > code in a hurry :-)
> 
> Well, given that I sometimes miss things when looking through RCU code
> carefuly, I guess I cannot give you too much trouble about it.
> 
> > I knew there'd be some subtlety between completed and gpnum and such :-)
> 
> Some of which I have learned about one RCU bug at a time.  ;-)
> 
> > > What I can do is give you a pair of functions, one to take a snapshot of
> > > the current grace-period state (returning an unsigned long) and another
> > > to evaluate a previous snapshot, invoking synchronize_rcu() if there has
> > > not been a full grace period in the meantime.
> > > 
> > > The most straightforward approach would invoke acquiring the global
> > > rcu_state ->lock on each call, which I am guessing just might be
> > > considered to be excessive overhead.  ;-)  I should be able to decrease
> > > the overhead to a memory barrier on each call, and perhaps even down
> > > to an smp_load_acquire().  Accessing the RCU state probably costs you
> > > a cache miss both times.
> > > 
> > > Thoughts?
> > 
> > Sounds fine, the attach isn't a hotpath, so even the locked version
> > should be fine, but I won't keep you from making it all fancy and such
> > :-)
> 
> Fair enough, let me see what I can come up with.

And here is an untested patch.  Thoughts?

(And yes, I need to update documentation and torture tests accordingly.)

							Thanx, Paul

------------------------------------------------------------------------

rcu: Provide grace-period piggybacking API
    
The following pattern is currently not well supported by RCU:
    
1.	Make data element inaccessible to RCU readers.
    
2.	Do work that probably lasts for more than one grace period.
    
3.	Do something to make sure RCU readers in flight before #1 above
    	have completed.
    
Here are some things that could currently be done:
    
a.	Do a synchronize_rcu() unconditionally at either #1 or #3 above.
    	This works, but imposes needless work and latency.
    
b.	Post an RCU callback at #1 above that does a wakeup, then
    	wait for the wakeup at #3.  This works well, but likely results
    	in an extra unneeded grace period.  Open-coding this is also
    	a bit more semi-tricky code than would be good.

This commit therefore adds get_state_synchronize_rcu() and
cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
above and pass its return value to cond_synchronize_rcu() at #3 above.
This results in a call to synchronize_rcu() if no grace period has
elapsed between #1 and #3, but requires only a load, comparison, and
memory barrier if a full grace period did elapse.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c9be2235712c..dbf0f225bca0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1503,13 +1503,14 @@ static int rcu_gp_init(struct rcu_state *rsp)
 
 	/* Advance to a new grace period and initialize state. */
 	record_gp_stall_check_time(rsp);
-	smp_wmb(); /* Record GP times before starting GP. */
-	rsp->gpnum++;
+	/* Record GP times before starting GP, hence smp_store_release(). */
+	smp_store_release(&rsp->gpnum, rsp->gpnum + 1);
 	trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start"));
 	raw_spin_unlock_irq(&rnp->lock);
 
 	/* Exclude any concurrent CPU-hotplug operations. */
 	mutex_lock(&rsp->onoff_mutex);
+	smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */
 
 	/*
 	 * Set the quiescent-state-needed bits in all the rcu_node
@@ -1638,10 +1639,11 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	}
 	rnp = rcu_get_root(rsp);
 	raw_spin_lock_irq(&rnp->lock);
-	smp_mb__after_unlock_lock();
+	smp_mb__after_unlock_lock(); /* Order GP before ->completed update. */
 	rcu_nocb_gp_set(rnp, nocb);
 
-	rsp->completed = rsp->gpnum; /* Declare grace period done. */
+	/* Declare grace period done. */
+	ACCESS_ONCE(rsp->completed) = rsp->gpnum;
 	trace_rcu_grace_period(rsp->name, rsp->completed, TPS("end"));
 	rsp->fqs_state = RCU_GP_IDLE;
 	rdp = this_cpu_ptr(rsp->rda);
@@ -2728,6 +2730,37 @@ void synchronize_rcu_bh(void)
 }
 EXPORT_SYMBOL_GPL(synchronize_rcu_bh);
 
+/**
+ * get_state_synchronize_rcu - Snapshot current RCU state
+ *
+ * Returns a cookie that is used by a later call to cond_synchronize_rcu()
+ * to determine whether or not a full grace period has elapsed in the
+ * meantime.
+ */
+unsigned long get_state_synchronize_rcu(void)
+{
+	return smp_load_acquire(&rcu_state->gpnum);
+}
+EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
+
+/**
+ * cond_synchronize_rcu - Conditionally wait for an RCU grace period
+ *
+ * @oldstate: return value from earlier call to get_state_synchronize_rcu()
+ *
+ * If a full RCU grace period has elapsed since the earlier call to
+ * get_state_synchronize_rcu(), just return.  Otherwise, invoke
+ * synchronize_rcu() to wait for a full grace period.
+ */
+void cond_synchronize_rcu(unsigned long oldstate)
+{
+	unsigned long newstate = smp_load_acquire(&rcu_state->completed);
+
+	if (ULONG_CMP_GE(oldstate, newstate))
+		synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(cond_synchronize_rcu);
+
 static int synchronize_sched_expedited_cpu_stop(void *data)
 {
 	/*


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-15  0:00           ` Paul E. McKenney
@ 2014-03-17 11:18             ` Peter Zijlstra
  2014-03-17 16:48               ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-03-17 11:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

> rcu: Provide grace-period piggybacking API
>     
> The following pattern is currently not well supported by RCU:
>     
> 1.	Make data element inaccessible to RCU readers.
>     
> 2.	Do work that probably lasts for more than one grace period.
>     
> 3.	Do something to make sure RCU readers in flight before #1 above
>     	have completed.
>     
> Here are some things that could currently be done:
>     
> a.	Do a synchronize_rcu() unconditionally at either #1 or #3 above.
>     	This works, but imposes needless work and latency.
>     
> b.	Post an RCU callback at #1 above that does a wakeup, then
>     	wait for the wakeup at #3.  This works well, but likely results
>     	in an extra unneeded grace period.  Open-coding this is also
>     	a bit more semi-tricky code than would be good.
> 
> This commit therefore adds get_state_synchronize_rcu() and
> cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
> above and pass its return value to cond_synchronize_rcu() at #3 above.
> This results in a call to synchronize_rcu() if no grace period has
> elapsed between #1 and #3, but requires only a load, comparison, and
> memory barrier if a full grace period did elapse.
> 
> Reported-by: Peter Zijlstra <peterz@infradead.org>

More a requested-by, I'd say.

> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> +/**
> + * get_state_synchronize_rcu - Snapshot current RCU state
> + *
> + * Returns a cookie that is used by a later call to cond_synchronize_rcu()
> + * to determine whether or not a full grace period has elapsed in the
> + * meantime.
> + */
> +unsigned long get_state_synchronize_rcu(void)
> +{

/*
 * Make sure this load happens before anything following it; such that
 * ... ?
 */

The way I imaged using this is taking this snapshot after the RCU
operation, such that we err towards seeing a later grace period and
synchronizing too much in stead of seeing an earlier and sync'ing too
little.

Such use would suggest the barrier the other way around.

> +	return smp_load_acquire(&rcu_state->gpnum);
> +}
> +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);

I can't say I'm excited about that function name; but I'm also
completely lacking in alternatives.

> +
> +/**
> + * cond_synchronize_rcu - Conditionally wait for an RCU grace period
> + *
> + * @oldstate: return value from earlier call to get_state_synchronize_rcu()
> + *
> + * If a full RCU grace period has elapsed since the earlier call to
> + * get_state_synchronize_rcu(), just return.  Otherwise, invoke
> + * synchronize_rcu() to wait for a full grace period.
> + */
> +void cond_synchronize_rcu(unsigned long oldstate)
> +{
> +	unsigned long newstate = smp_load_acquire(&rcu_state->completed);

Again, uncommented barriers; the load_acquire seems to suggest you want
to make sure the sync_rcu() bits happen after this read. But seeing how
sync_rcu() will invariably read the same data again and you get an
address dep this seems somewhat superfluous.

Then again, can't hurt I suppose :-)

> +	if (ULONG_CMP_GE(oldstate, newstate))

So if the observed active gp (oldstate) is ahead or equal to the last
completed gp, then we wait?

I thought the double grace period thing was for preemptible rcu; is it
done here because you don't want to special case the preemptible rcu and
make do with a single implementation?

And I suppose that on wrap around; we do extra sync_rcu() calls, which
can never be wrong.

> +		synchronize_rcu();
> +}
> +EXPORT_SYMBOL_GPL(cond_synchronize_rcu);


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-17 11:18             ` Peter Zijlstra
@ 2014-03-17 16:48               ` Paul E. McKenney
  2014-03-17 17:30                 ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2014-03-17 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

On Mon, Mar 17, 2014 at 12:18:07PM +0100, Peter Zijlstra wrote:
> > rcu: Provide grace-period piggybacking API
> >     
> > The following pattern is currently not well supported by RCU:
> >     
> > 1.	Make data element inaccessible to RCU readers.
> >     
> > 2.	Do work that probably lasts for more than one grace period.
> >     
> > 3.	Do something to make sure RCU readers in flight before #1 above
> >     	have completed.
> >     
> > Here are some things that could currently be done:
> >     
> > a.	Do a synchronize_rcu() unconditionally at either #1 or #3 above.
> >     	This works, but imposes needless work and latency.
> >     
> > b.	Post an RCU callback at #1 above that does a wakeup, then
> >     	wait for the wakeup at #3.  This works well, but likely results
> >     	in an extra unneeded grace period.  Open-coding this is also
> >     	a bit more semi-tricky code than would be good.
> > 
> > This commit therefore adds get_state_synchronize_rcu() and
> > cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
> > above and pass its return value to cond_synchronize_rcu() at #3 above.
> > This results in a call to synchronize_rcu() if no grace period has
> > elapsed between #1 and #3, but requires only a load, comparison, and
> > memory barrier if a full grace period did elapse.
> > 
> > Reported-by: Peter Zijlstra <peterz@infradead.org>
> 
> More a requested-by, I'd say.

Fair enough...

> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > +/**
> > + * get_state_synchronize_rcu - Snapshot current RCU state
> > + *
> > + * Returns a cookie that is used by a later call to cond_synchronize_rcu()
> > + * to determine whether or not a full grace period has elapsed in the
> > + * meantime.
> > + */
> > +unsigned long get_state_synchronize_rcu(void)
> > +{
> 
> /*
>  * Make sure this load happens before anything following it; such that
>  * ... ?
>  */

Good point, this barrier is a bit obsure.  Here you go:

	/*
	 * Make sure this load happens before the purportedly
	 * time-consuming work between get_state_synchronize_rcu()
	 * and cond_synchronize_rcu().
	 */

So all you lose by leaving this barrier out is a slightly higher
probability of invoking synchronize_rcu() at cond_synchronize_rcu() time.
And without the barrier there is some chance of the CPU doing the load in
cond_synchronize_rcu() before this load in get_state_synchronize_rcu().
Which should be OK, but simpler to exclude this than have to worry about
counters out of sync.  Besides, you are taking a grace-period delay in
here somewhere, so the memory barrier is way down in the noise.

> The way I imaged using this is taking this snapshot after the RCU
> operation, such that we err towards seeing a later grace period and
> synchronizing too much in stead of seeing an earlier and sync'ing too
> little.
> 
> Such use would suggest the barrier the other way around.

Ah, but that ordering happens at the updater's end.  And there
are in fact memory barriers before and after the increment of
->gpnum in rcu_gp_init():

	/* Advance to a new grace period and initialize state. */
	record_gp_stall_check_time(rsp);
	/* Record GP times before starting GP, hence smp_store_release(). */
	smp_store_release(&rsp->gpnum, rsp->gpnum + 1);
	trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start"));
	raw_spin_unlock_irq(&rnp->lock);

	/* Exclude any concurrent CPU-hotplug operations. */
	mutex_lock(&rsp->onoff_mutex);
	smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */

The smp_store_release() provides the memory barrier before, and the
smp_mb__after_unlock_lock() provides it after.

> > +	return smp_load_acquire(&rcu_state->gpnum);
> > +}
> > +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
> 
> I can't say I'm excited about that function name; but I'm also
> completely lacking in alternatives.

;-)

> > +
> > +/**
> > + * cond_synchronize_rcu - Conditionally wait for an RCU grace period
> > + *
> > + * @oldstate: return value from earlier call to get_state_synchronize_rcu()
> > + *
> > + * If a full RCU grace period has elapsed since the earlier call to
> > + * get_state_synchronize_rcu(), just return.  Otherwise, invoke
> > + * synchronize_rcu() to wait for a full grace period.
> > + */
> > +void cond_synchronize_rcu(unsigned long oldstate)
> > +{
> > +	unsigned long newstate = smp_load_acquire(&rcu_state->completed);
> 
> Again, uncommented barriers; the load_acquire seems to suggest you want
> to make sure the sync_rcu() bits happen after this read. But seeing how
> sync_rcu() will invariably read the same data again and you get an
> address dep this seems somewhat superfluous.

Not quite!  The get_state_synchronize_rcu() function reads ->gpnum,
but cond_synchronize_rcu() reads ->completed.  The

> Then again, can't hurt I suppose :-)
> 
> > +	if (ULONG_CMP_GE(oldstate, newstate))
> 
> So if the observed active gp (oldstate) is ahead or equal to the last
> completed gp, then we wait?

Yep!  I will risk an ASCII diagram:


3:	                                                  +----gpnum----+-- ...
	                                                  |             |
2:	                            +----gpnum----+-------+--completed--+
	                            |             |
1:	      +----gpnum----+-------+--completed--+
	      |             |
0:	+-----+--completed--+


A full RCU grace period happens between a pair of "|"s on the same line.
By inspection, if your snapshot of ->gpnum is greater than the current
value of ->completed, a grace period has passed.

> I thought the double grace period thing was for preemptible rcu; is it
> done here because you don't want to special case the preemptible rcu and
> make do with a single implementation?

No, we are waiting for only one grace period.  We -would- need a double
grace period if we were using the ->gpnum and ->completed values from the
rcu_node structures because these values can each be one unit out of sync
with the global values.  And we might someday need to take this step,
for example if large machines and heavy use of these primitives someday
results in excessive memory contention on the ->gpnum and ->completed
values in the global rcu_state structure.  And then the ASCII diagram
is as follows:

3:	                                                  +-+--gpnum----+--+...
	                                                  | ^           |  ^
2:	                            +-+--gpnum----+-------+-+completed--+>>+
	                            | ^           |  ^
1:	      +-+--gpnum----+--+----+-+completed--+>>+
	      | ^           |  ^
0:	+-----+-+completed--+>>+

Here the "^"s represent the later transitions of the ->gpnum and ->completed
values in the rcu_node structures.  But we still have to wait for the
interval between a pair of "|" characters: The interval between a pair
of "^" characters does -not- suffice, because there might not be a full
set of quiescent states between the "^"s!  So we have to wait for two, so
that the conditional would become:

	if (ULONG_CMP_GE(oldstate + 1, newstate))

Referring to the second ASCII diagram above, if oldstate is zero, we might
have missed the beginning of the first grace period.  We must therefore
wait until the end of the second grace period, at which point newstate
will be 2.  Which works out, as 1+1>=2.

> And I suppose that on wrap around; we do extra sync_rcu() calls, which
> can never be wrong.

Yep!  Ah, I suppose I should comment that, shouldn't I?  How about the
following in the header comment?

 * Yes, this function does not take counter wrap into account.  But
 * counter wrap is harmless.  If the counter wraps, you have waited
 * at least 4 billion grace periods already (more on a 64-bit system!),
 * so waiting for one extra grace period should not be a problem.

> > +		synchronize_rcu();
> > +}
> > +EXPORT_SYMBOL_GPL(cond_synchronize_rcu);

Ugh.  And I also forgot rcutiny.  It is pretty straightforward.  ;-)

	static inline unsigned long get_state_synchronize_rcu(void)
	{
		return 0;
	}

	static inline void cond_synchronize_rcu(unsigned long oldstate)
	{
		might_sleep();
	}

This works because on a !SMP system, might_sleep() is a grace period.
These are in include/linux/rcutiny.h.

I also added the needed declarations in include/linux/rcutree.h.

Updated patch below.

							Thanx, Paul

------------------------------------------------------------------------

rcu: Provide grace-period piggybacking API

The following pattern is currently not well supported by RCU:

1.	Make data element inaccessible to RCU readers.

2.	Do work that probably lasts for more than one grace period.

3.	Do something to make sure RCU readers in flight before #1 above
    	have completed.
    
Here are some things that could currently be done:

a.	Do a synchronize_rcu() unconditionally at either #1 or #3 above.
    	This works, but imposes needless work and latency.
    
b.	Post an RCU callback at #1 above that does a wakeup, then
    	wait for the wakeup at #3.  This works well, but likely results
    	in an extra unneeded grace period.  Open-coding this is also
    	a bit more semi-tricky code than would be good.

This commit therefore adds get_state_synchronize_rcu() and
cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
above and pass its return value to cond_synchronize_rcu() at #3 above.
This results in a call to synchronize_rcu() if no grace period has
elapsed between #1 and #3, but requires only a load, comparison, and
memory barrier if a full grace period did elapse.

Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 7b62ecdd75e8..d40a6a451330 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -27,6 +27,16 @@

#include <linux/cache.h>

+static inline unsigned long get_state_synchronize_rcu(void)
+{
+	return 0;
+}
+
+static inline void cond_synchronize_rcu(unsigned long oldstate)
+{
+	might_sleep();
+}
+
static inline void rcu_barrier_bh(void)
{
wait_rcu_gp(call_rcu_bh);
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 44211ff34b9a..3e2f5d432743 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -76,6 +76,8 @@ static inline void synchronize_rcu_bh_expedited(void)
void rcu_barrier(void);
void rcu_barrier_bh(void);
void rcu_barrier_sched(void);
+unsigned long get_state_synchronize_rcu(void);
+void cond_synchronize_rcu(unsigned long oldstate);

extern unsigned long rcutorture_testseq;
extern unsigned long rcutorture_vernum;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c9be2235712c..65e99cf080e0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1503,13 +1503,14 @@ static int rcu_gp_init(struct rcu_state *rsp)

/* Advance to a new grace period and initialize state. */
record_gp_stall_check_time(rsp);
-	smp_wmb(); /* Record GP times before starting GP. */
-	rsp->gpnum++;
+	/* Record GP times before starting GP, hence smp_store_release(). */
+	smp_store_release(&rsp->gpnum, rsp->gpnum + 1);
trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start"));
raw_spin_unlock_irq(&rnp->lock);

/* Exclude any concurrent CPU-hotplug operations. */
mutex_lock(&rsp->onoff_mutex);
+	smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */

/*
 * Set the quiescent-state-needed bits in all the rcu_node
@@ -1638,10 +1639,11 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
}
rnp = rcu_get_root(rsp);
raw_spin_lock_irq(&rnp->lock);
-	smp_mb__after_unlock_lock();
+	smp_mb__after_unlock_lock(); /* Order GP before ->completed update. */
rcu_nocb_gp_set(rnp, nocb);

-	rsp->completed = rsp->gpnum; /* Declare grace period done. */
+	/* Declare grace period done. */
+	ACCESS_ONCE(rsp->completed) = rsp->gpnum;
trace_rcu_grace_period(rsp->name, rsp->completed, TPS("end"));
rsp->fqs_state = RCU_GP_IDLE;
rdp = this_cpu_ptr(rsp->rda);
@@ -2728,6 +2730,47 @@ void synchronize_rcu_bh(void)
}
EXPORT_SYMBOL_GPL(synchronize_rcu_bh);

+/**
+ * get_state_synchronize_rcu - Snapshot current RCU state
+ *
+ * Returns a cookie that is used by a later call to cond_synchronize_rcu()
+ * to determine whether or not a full grace period has elapsed in the
+ * meantime.
+ */
+unsigned long get_state_synchronize_rcu(void)
+{
+	/*
+	 * Make sure this load happens before the purportedly
+	 * time-consuming work between get_state_synchronize_rcu()
+	 * and cond_synchronize_rcu().
+	 */
+	return smp_load_acquire(&rcu_state->gpnum);
+}
+EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
+
+/**
+ * cond_synchronize_rcu - Conditionally wait for an RCU grace period
+ *
+ * @oldstate: return value from earlier call to get_state_synchronize_rcu()
+ *
+ * If a full RCU grace period has elapsed since the earlier call to
+ * get_state_synchronize_rcu(), just return.  Otherwise, invoke
+ * synchronize_rcu() to wait for a full grace period.
+ *
+ * Yes, this function does not take counter wrap into account.  But
+ * counter wrap is harmless.  If the counter wraps, we have waited for
+ * more than 2 billion grace periods (and way more on a 64-bit system!),
+ * so waiting for one additional grace period should be just fine.
+ */
+void cond_synchronize_rcu(unsigned long oldstate)
+{
+	unsigned long newstate = smp_load_acquire(&rcu_state->completed);
+
+	if (ULONG_CMP_GE(oldstate, newstate))
+		synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(cond_synchronize_rcu);
+
static int synchronize_sched_expedited_cpu_stop(void *data)
{
/*


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-17 16:48               ` Paul E. McKenney
@ 2014-03-17 17:30                 ` Peter Zijlstra
  2014-03-18  2:45                   ` Paul E. McKenney
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-03-17 17:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

On Mon, Mar 17, 2014 at 09:48:15AM -0700, Paul E. McKenney wrote:
> Good point, this barrier is a bit obsure.  Here you go:
> 
> 	/*
> 	 * Make sure this load happens before the purportedly
> 	 * time-consuming work between get_state_synchronize_rcu()
> 	 * and cond_synchronize_rcu().
> 	 */
> 
> So all you lose by leaving this barrier out is a slightly higher
> probability of invoking synchronize_rcu() at cond_synchronize_rcu() time.
> And without the barrier there is some chance of the CPU doing the load in
> cond_synchronize_rcu() before this load in get_state_synchronize_rcu().
> Which should be OK, but simpler to exclude this than have to worry about
> counters out of sync.  Besides, you are taking a grace-period delay in
> here somewhere, so the memory barrier is way down in the noise.

Oh sure; expense wasn't my concern. But I always question barriers
without comments, and I couldn't come up with a reason for this one.

> > The way I imaged using this is taking this snapshot after the RCU
> > operation, such that we err towards seeing a later grace period and
> > synchronizing too much in stead of seeing an earlier and sync'ing too
> > little.
> > 
> > Such use would suggest the barrier the other way around.
> 
> Ah, but that ordering happens at the updater's end.  And there
> are in fact memory barriers before and after the increment of
> ->gpnum in rcu_gp_init():
> 
> 	/* Advance to a new grace period and initialize state. */
> 	record_gp_stall_check_time(rsp);
> 	/* Record GP times before starting GP, hence smp_store_release(). */
> 	smp_store_release(&rsp->gpnum, rsp->gpnum + 1);
> 	trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start"));
> 	raw_spin_unlock_irq(&rnp->lock);
> 
> 	/* Exclude any concurrent CPU-hotplug operations. */
> 	mutex_lock(&rsp->onoff_mutex);
> 	smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */
> 
> The smp_store_release() provides the memory barrier before, and the
> smp_mb__after_unlock_lock() provides it after.

That wasn't exactly what I was talking about.

So suppose:

  spin_lock(&lock)
  list_del_rcu(&entry->foo);
  spin_unlock(&lock);

  rcu_stamp = get_state_sync_rcu();

Now we very much want to ensure to get the gpnum _after_ completing that
list_del_rcu(), because if we were to observe the gpnum before it could
complete before and we'd fail to wait long enough.

Now in the above, and with your proposed implementation, there is in
fact no guarantee the load doesn't slip up past the list_del_rcu().

The RELEASE per the spin_unlock() only guarantees the list_del_rcu()
happens before it, but the gpnum load can slip in, and in fact pass over
the list_del_rcu().

> > > +void cond_synchronize_rcu(unsigned long oldstate)
> > > +{
> > > +	unsigned long newstate = smp_load_acquire(&rcu_state->completed);
> > 
> > Again, uncommented barriers; the load_acquire seems to suggest you want
> > to make sure the sync_rcu() bits happen after this read. But seeing how
> > sync_rcu() will invariably read the same data again and you get an
> > address dep this seems somewhat superfluous.
> 
> Not quite!  The get_state_synchronize_rcu() function reads ->gpnum,
> but cond_synchronize_rcu() reads ->completed.  The

I was talking about the synchronize_rcu() call later in
cond_synchronize_rcu().

But looking at its implementation; they don't in fact load either gpnum
or completed.

> > > +	if (ULONG_CMP_GE(oldstate, newstate))
> > 
> > So if the observed active gp (oldstate) is ahead or equal to the last
> > completed gp, then we wait?
> 
> Yep!  I will risk an ASCII diagram:
> 
> 
> 3:	                                                  +----gpnum----+-- ...
> 	                                                  |             |
> 2:	                            +----gpnum----+-------+--completed--+
> 	                            |             |
> 1:	      +----gpnum----+-------+--completed--+
> 	      |             |
> 0:	+-----+--completed--+
> 
> 
> A full RCU grace period happens between a pair of "|"s on the same line.
> By inspection, if your snapshot of ->gpnum is greater than the current
> value of ->completed, a grace period has passed.

OK, so I get the > part, but I'm not sure I get the = part of the above.
The way I read the diagram, when completed matches gpnum the grace
period is done and we don't have to wait anymore.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-17 17:30                 ` Peter Zijlstra
@ 2014-03-18  2:45                   ` Paul E. McKenney
  2014-03-18  8:26                     ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2014-03-18  2:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

On Mon, Mar 17, 2014 at 06:30:38PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 17, 2014 at 09:48:15AM -0700, Paul E. McKenney wrote:
> > Good point, this barrier is a bit obsure.  Here you go:
> > 
> > 	/*
> > 	 * Make sure this load happens before the purportedly
> > 	 * time-consuming work between get_state_synchronize_rcu()
> > 	 * and cond_synchronize_rcu().
> > 	 */
> > 
> > So all you lose by leaving this barrier out is a slightly higher
> > probability of invoking synchronize_rcu() at cond_synchronize_rcu() time.
> > And without the barrier there is some chance of the CPU doing the load in
> > cond_synchronize_rcu() before this load in get_state_synchronize_rcu().
> > Which should be OK, but simpler to exclude this than have to worry about
> > counters out of sync.  Besides, you are taking a grace-period delay in
> > here somewhere, so the memory barrier is way down in the noise.
> 
> Oh sure; expense wasn't my concern. But I always question barriers
> without comments, and I couldn't come up with a reason for this one.
> 
> > > The way I imaged using this is taking this snapshot after the RCU
> > > operation, such that we err towards seeing a later grace period and
> > > synchronizing too much in stead of seeing an earlier and sync'ing too
> > > little.
> > > 
> > > Such use would suggest the barrier the other way around.
> > 
> > Ah, but that ordering happens at the updater's end.  And there
> > are in fact memory barriers before and after the increment of
> > ->gpnum in rcu_gp_init():
> > 
> > 	/* Advance to a new grace period and initialize state. */
> > 	record_gp_stall_check_time(rsp);
> > 	/* Record GP times before starting GP, hence smp_store_release(). */
> > 	smp_store_release(&rsp->gpnum, rsp->gpnum + 1);
> > 	trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start"));
> > 	raw_spin_unlock_irq(&rnp->lock);
> > 
> > 	/* Exclude any concurrent CPU-hotplug operations. */
> > 	mutex_lock(&rsp->onoff_mutex);
> > 	smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */
> > 
> > The smp_store_release() provides the memory barrier before, and the
> > smp_mb__after_unlock_lock() provides it after.
> 
> That wasn't exactly what I was talking about.
> 
> So suppose:
> 
>   spin_lock(&lock)
>   list_del_rcu(&entry->foo);
>   spin_unlock(&lock);
> 
>   rcu_stamp = get_state_sync_rcu();
> 
> Now we very much want to ensure to get the gpnum _after_ completing that
> list_del_rcu(), because if we were to observe the gpnum before it could
> complete before and we'd fail to wait long enough.
> 
> Now in the above, and with your proposed implementation, there is in
> fact no guarantee the load doesn't slip up past the list_del_rcu().
> 
> The RELEASE per the spin_unlock() only guarantees the list_del_rcu()
> happens before it, but the gpnum load can slip in, and in fact pass over
> the list_del_rcu().

Right you are!  Fixed patch below.

> > > > +void cond_synchronize_rcu(unsigned long oldstate)
> > > > +{
> > > > +	unsigned long newstate = smp_load_acquire(&rcu_state->completed);
> > > 
> > > Again, uncommented barriers; the load_acquire seems to suggest you want
> > > to make sure the sync_rcu() bits happen after this read. But seeing how
> > > sync_rcu() will invariably read the same data again and you get an
> > > address dep this seems somewhat superfluous.
> > 
> > Not quite!  The get_state_synchronize_rcu() function reads ->gpnum,
> > but cond_synchronize_rcu() reads ->completed.  The
> 
> I was talking about the synchronize_rcu() call later in
> cond_synchronize_rcu().
> 
> But looking at its implementation; they don't in fact load either gpnum
> or completed.

They don't directly, but they register callbacks, which causes the
RCU core code to access them.  But if we invoke synchronize_rcu(),
it really doesn't matter what either get_state_synchronize_rcu() or
cond_synchronize_rcu().  The memory barrier from cond_synchronize_rcu()'s
smp_load_acquire() is instead needed for the case where we -don't-
invoke synchornize_rcu().  In that case, whatever was supposed to follow
the grace period had better really follow the grace period.

> > > > +	if (ULONG_CMP_GE(oldstate, newstate))
> > > 
> > > So if the observed active gp (oldstate) is ahead or equal to the last
> > > completed gp, then we wait?
> > 
> > Yep!  I will risk an ASCII diagram:
> > 
> > 
> > 3:	                                                  +----gpnum----+-- ...
> > 	                                                  |             |
> > 2:	                            +----gpnum----+-------+--completed--+
> > 	                            |             |
> > 1:	      +----gpnum----+-------+--completed--+
> > 	      |             |
> > 0:	+-----+--completed--+
> > 
> > 
> > A full RCU grace period happens between a pair of "|"s on the same line.
> > By inspection, if your snapshot of ->gpnum is greater than the current
> > value of ->completed, a grace period has passed.
> 
> OK, so I get the > part, but I'm not sure I get the = part of the above.
> The way I read the diagram, when completed matches gpnum the grace
> period is done and we don't have to wait anymore.

Absolutely not!  Let's try laying out the scenario:

1.	Someone calls get_state_synchronize_rcu() when ->gpnum==->completed==0.
	It returns zero.

2.	That someone immediately calls cond_synchronize_rcu().  Nothing
	has changed, so oldstate==newstate==0.

	We had better call synchronize_rcu() in this case!!!

Another way of looking at it...  At any point in time, either gpnum and
completed have the same value, or gpnum is one greater than completed.

A.	They are equal, say to zero.  In this case, we need one grace
	period.  For this to happen, ->gpnum first increases by one
	(but our snapshot remains zero), then after the grace period
	completes, ->completed catches up.  So as long as ->completed
	is less than or equal to our snapshot, we need to call
	synchronize_rcu().

B.	They differ, say ->gpnum is equal to one and ->completed is
	equal to zero.  This means that there is a grace period in
	progress.  We must first wait for this grace period to
	complete, then wait for the next grace period to complete.

	After the first grace period completes, we will have
	both ->gpnum and ->completed equal to one, that is to
	say, equal to our snapshot.  We still haven't waited for
	a full grace period, so if cond_synchronize_rcu() gets called
	at this point, it needs to invoke synchronize_rcu().

	After the second grace period starts, we will have ->gpnum
	equal to two and ->completed still equal to one.  We still
	have not waited a full grace period, so any call to
	cond_synchronize_rcu() must invoke synchronize_rcu().
	Which it will, because ->completed is still equal to the
	snapshot returned by get_state_synchronize_rcu().

	After the second grace period completes, we will have waited
	for a full grace period, and therefore need not wait any
	longer.  But at this point, ->completed will be equal to
	two, and will thus be greater than our snapshot.  Now,
	cond_synchronize_rcu() need not invoke synchronize_rcu(),
	and because ->completed is now greater than our snapshot,
	it won't.

Make sense?

Updated patch below.  Added the memory barrier to get_state_synchronize_rcu()
as you suggested, and also added more comments.

							Thanx, Paul

------------------------------------------------------------------------

rcu: Provide grace-period piggybacking API

The following pattern is currently not well supported by RCU:

1.	Make data element inaccessible to RCU readers.

2.	Do work that probably lasts for more than one grace period.

3.	Do something to make sure RCU readers in flight before #1 above
    	have completed.
    
Here are some things that could currently be done:

a.	Do a synchronize_rcu() unconditionally at either #1 or #3 above.
    	This works, but imposes needless work and latency.
    
b.	Post an RCU callback at #1 above that does a wakeup, then
    	wait for the wakeup at #3.  This works well, but likely results
    	in an extra unneeded grace period.  Open-coding this is also
    	a bit more semi-tricky code than would be good.
    
This commit therefore adds get_state_synchronize_rcu() and
cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
above and pass its return value to cond_synchronize_rcu() at #3 above.
This results in a call to synchronize_rcu() if no grace period has
elapsed between #1 and #3, but requires only a load, comparison, and
memory barrier if a full grace period did elapse.

Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 7b62ecdd75e8..d40a6a451330 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -27,6 +27,16 @@
 
 #include <linux/cache.h>
 
+static inline unsigned long get_state_synchronize_rcu(void)
+{
+	return 0;
+}
+
+static inline void cond_synchronize_rcu(unsigned long oldstate)
+{
+	might_sleep();
+}
+
 static inline void rcu_barrier_bh(void)
 {
 	wait_rcu_gp(call_rcu_bh);
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 44211ff34b9a..3e2f5d432743 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -76,6 +76,8 @@ static inline void synchronize_rcu_bh_expedited(void)
 void rcu_barrier(void);
 void rcu_barrier_bh(void);
 void rcu_barrier_sched(void);
+unsigned long get_state_synchronize_rcu(void);
+void cond_synchronize_rcu(unsigned long oldstate);
 
 extern unsigned long rcutorture_testseq;
 extern unsigned long rcutorture_vernum;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c9be2235712c..2631399e73de 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1503,13 +1503,14 @@ static int rcu_gp_init(struct rcu_state *rsp)
 
 	/* Advance to a new grace period and initialize state. */
 	record_gp_stall_check_time(rsp);
-	smp_wmb(); /* Record GP times before starting GP. */
-	rsp->gpnum++;
+	/* Record GP times before starting GP, hence smp_store_release(). */
+	smp_store_release(&rsp->gpnum, rsp->gpnum + 1);
 	trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start"));
 	raw_spin_unlock_irq(&rnp->lock);
 
 	/* Exclude any concurrent CPU-hotplug operations. */
 	mutex_lock(&rsp->onoff_mutex);
+	smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */
 
 	/*
 	 * Set the quiescent-state-needed bits in all the rcu_node
@@ -1638,10 +1639,11 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	}
 	rnp = rcu_get_root(rsp);
 	raw_spin_lock_irq(&rnp->lock);
-	smp_mb__after_unlock_lock();
+	smp_mb__after_unlock_lock(); /* Order GP before ->completed update. */
 	rcu_nocb_gp_set(rnp, nocb);
 
-	rsp->completed = rsp->gpnum; /* Declare grace period done. */
+	/* Declare grace period done. */
+	ACCESS_ONCE(rsp->completed) = rsp->gpnum;
 	trace_rcu_grace_period(rsp->name, rsp->completed, TPS("end"));
 	rsp->fqs_state = RCU_GP_IDLE;
 	rdp = this_cpu_ptr(rsp->rda);
@@ -2728,6 +2730,58 @@ void synchronize_rcu_bh(void)
 }
 EXPORT_SYMBOL_GPL(synchronize_rcu_bh);
 
+/**
+ * get_state_synchronize_rcu - Snapshot current RCU state
+ *
+ * Returns a cookie that is used by a later call to cond_synchronize_rcu()
+ * to determine whether or not a full grace period has elapsed in the
+ * meantime.
+ */
+unsigned long get_state_synchronize_rcu(void)
+{
+	/*
+	 * Any prior manipulation of RCU-protected data must happen
+	 * before the load from ->gpnum.
+	 */
+	smp_mb();  /* ^^^ */
+
+	/*
+	 * Make sure this load happens before the purportedly
+	 * time-consuming work between get_state_synchronize_rcu()
+	 * and cond_synchronize_rcu().
+	 */
+	return smp_load_acquire(&rcu_state->gpnum);
+}
+EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
+
+/**
+ * cond_synchronize_rcu - Conditionally wait for an RCU grace period
+ *
+ * @oldstate: return value from earlier call to get_state_synchronize_rcu()
+ *
+ * If a full RCU grace period has elapsed since the earlier call to
+ * get_state_synchronize_rcu(), just return.  Otherwise, invoke
+ * synchronize_rcu() to wait for a full grace period.
+ *
+ * Yes, this function does not take counter wrap into account.  But
+ * counter wrap is harmless.  If the counter wraps, we have waited for
+ * more than 2 billion grace periods (and way more on a 64-bit system!),
+ * so waiting for one additional grace period should be just fine.
+ */
+void cond_synchronize_rcu(unsigned long oldstate)
+{
+	unsigned long newstate;
+
+	/*
+	 * Ensure that this load happens before any RCU-destructive
+	 * actions the caller might carry out after we return.
+	 */
+	newstate = smp_load_acquire(&rcu_state->completed);
+	if (ULONG_CMP_GE(oldstate, newstate))
+		synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(cond_synchronize_rcu);
+
 static int synchronize_sched_expedited_cpu_stop(void *data)
 {
 	/*


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-18  2:45                   ` Paul E. McKenney
@ 2014-03-18  8:26                     ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-03-18  8:26 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

On Mon, Mar 17, 2014 at 07:45:22PM -0700, Paul E. McKenney wrote:
> > > Yep!  I will risk an ASCII diagram:
> > > 
> > > 
> > > 3:	                                                  +----gpnum----+-- ...
> > > 	                                                  |             |
> > > 2:	                            +----gpnum----+-------+--completed--+
> > > 	                            |             |
> > > 1:	      +----gpnum----+-------+--completed--+
> > > 	      |             |
> > > 0:	+-----+--completed--+
> > > 
> > > 
> > > A full RCU grace period happens between a pair of "|"s on the same line.
> > > By inspection, if your snapshot of ->gpnum is greater than the current
> > > value of ->completed, a grace period has passed.
> > 
> > OK, so I get the > part, but I'm not sure I get the = part of the above.
> > The way I read the diagram, when completed matches gpnum the grace
> > period is done and we don't have to wait anymore.
> 
> Absolutely not!  Let's try laying out the scenario:
> 
> 1.	Someone calls get_state_synchronize_rcu() when ->gpnum==->completed==0.
> 	It returns zero.
> 
> 2.	That someone immediately calls cond_synchronize_rcu().  Nothing
> 	has changed, so oldstate==newstate==0.
> 
> 	We had better call synchronize_rcu() in this case!!!

> Make sense?

Yes, should have seen that! Thanks for bearing with me on this.

> ------------------------------------------------------------------------
> 
> rcu: Provide grace-period piggybacking API
> 
> The following pattern is currently not well supported by RCU:
> 
> 1.	Make data element inaccessible to RCU readers.
> 
> 2.	Do work that probably lasts for more than one grace period.
> 
> 3.	Do something to make sure RCU readers in flight before #1 above
>     	have completed.
>     
> Here are some things that could currently be done:
> 
> a.	Do a synchronize_rcu() unconditionally at either #1 or #3 above.
>     	This works, but imposes needless work and latency.
>     
> b.	Post an RCU callback at #1 above that does a wakeup, then
>     	wait for the wakeup at #3.  This works well, but likely results
>     	in an extra unneeded grace period.  Open-coding this is also
>     	a bit more semi-tricky code than would be good.
>     
> This commit therefore adds get_state_synchronize_rcu() and
> cond_synchronize_rcu() APIs.  Call get_state_synchronize_rcu() at #1
> above and pass its return value to cond_synchronize_rcu() at #3 above.
> This results in a call to synchronize_rcu() if no grace period has
> elapsed between #1 and #3, but requires only a load, comparison, and
> memory barrier if a full grace period did elapse.
> 
> Requested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Thanks!

Acked-by: Peter Zijlstra <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-03-14  9:50   ` Peter Zijlstra
  2014-03-14 20:47     ` Paul E. McKenney
@ 2014-05-07 12:35     ` Peter Zijlstra
  2014-05-07 18:06       ` Peter Zijlstra
  2014-05-19 12:48       ` [tip:perf/urgent] perf: Fix a race between ring_buffer_detach() and ring_buffer_attach() tip-bot for Peter Zijlstra
  1 sibling, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-07 12:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 9161 bytes --]

On Fri, Mar 14, 2014 at 10:50:33AM +0100, Peter Zijlstra wrote:
> How about something like so that only does the sync_rcu() if really
> needed.

Paul reminded me we still have this issue open; I had another look and
the patch I proposed was completely broken, so I had to write me a new
one :-)

---
Subject: perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 14 Mar 2014 10:50:33 +0100

Alexander noticed that we use RCU iteration on rb->event_list but do
not use list_{add,del}_rcu() to add,remove entries to that list, nor
do we observe proper grace periods when re-using the entries.

Merge ring_buffer_detach() into ring_buffer_attach() such that
attaching to the NULL buffer is detaching.

Furthermore, ensure that between any 'detach' and 'attach' of the same
event we observe the required grace period, but only when strictly
required. In effect this means that only ioctl(.request =
PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
normal initial attach and final detach will not be delayed.

This patch should, I think, do the right thing under all
circumstances, the 'normal' cases all should never see the extra grace
period, but the two cases:

 1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
    ring_buffer set, will now observe the required grace period between
    removing itself from the old and attaching itself to the new buffer.

    This case is 'simple' in that both buffers are present in
    perf_event_set_output() one could think an unconditional
    synchronize_rcu() would be sufficient; however...

 2) an event that has a buffer attached, the buffer is destroyed
    (munmap) and then the event is attached to a new/different buffer
    using PERF_EVENT_IOC_SET_OUTPUT.

    This case is more complex because the buffer destruction does:
      ring_buffer_attach(.rb = NULL)
    followed by the ioctl() doing:
      ring_buffer_attach(.rb = foo);

    and we still need to observe the grace period between these two
    calls due to us reusing the event->rb_entry list_head.

In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
call.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Reported-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/perf_event.h |    2 
 kernel/events/core.c       |  109 ++++++++++++++++++++-------------------------
 2 files changed, 51 insertions(+), 60 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -403,6 +403,8 @@ struct perf_event {
 
 	struct ring_buffer		*rb;
 	struct list_head		rb_entry;
+	unsigned long			rcu_batches;
+	int				rcu_pending;
 
 	/* poll related */
 	wait_queue_head_t		waitq;
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3179,7 +3179,8 @@ static void free_event_rcu(struct rcu_he
 }
 
 static void ring_buffer_put(struct ring_buffer *rb);
-static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);
+static void ring_buffer_attach(struct perf_event *event,
+			       struct ring_buffer *rb);
 
 static void unaccount_event_cpu(struct perf_event *event, int cpu)
 {
@@ -3242,8 +3243,6 @@ static void free_event(struct perf_event
 	unaccount_event(event);
 
 	if (event->rb) {
-		struct ring_buffer *rb;
-
 		/*
 		 * Can happen when we close an event with re-directed output.
 		 *
@@ -3251,12 +3250,7 @@ static void free_event(struct perf_event
 		 * over us; possibly making our ring_buffer_put() the last.
 		 */
 		mutex_lock(&event->mmap_mutex);
-		rb = event->rb;
-		if (rb) {
-			rcu_assign_pointer(event->rb, NULL);
-			ring_buffer_detach(event, rb);
-			ring_buffer_put(rb); /* could be last */
-		}
+		ring_buffer_attach(event, NULL);
 		mutex_unlock(&event->mmap_mutex);
 	}
 
@@ -3843,28 +3837,47 @@ static int perf_mmap_fault(struct vm_are
 static void ring_buffer_attach(struct perf_event *event,
 			       struct ring_buffer *rb)
 {
+	struct ring_buffer *old_rb = NULL;
 	unsigned long flags;
 
-	if (!list_empty(&event->rb_entry))
-		return;
+	if (event->rb) {
+		/*
+		 * Should be impossible, we set this when removing
+		 * event->rb_entry and wait/clear when adding event->rb_entry.
+		 */
+		WARN_ON_ONCE(event->rcu_pending);
 
-	spin_lock_irqsave(&rb->event_lock, flags);
-	if (list_empty(&event->rb_entry))
-		list_add(&event->rb_entry, &rb->event_list);
-	spin_unlock_irqrestore(&rb->event_lock, flags);
-}
+		old_rb = event->rb;
+		event->rcu_batches = get_state_synchronize_rcu();
+		event->rcu_pending = 1;
 
-static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
-{
-	unsigned long flags;
+		spin_lock_irqsave(&rb->event_lock, flags);
+		list_del_rcu(&event->rb_entry);
+		spin_unlock_irqrestore(&rb->event_lock, flags);
+	}
 
-	if (list_empty(&event->rb_entry))
-		return;
+	if (event->rcu_pending && rb) {
+		cond_synchronize_rcu(event->rcu_batches);
+		event->rcu_pending = 0;
+	}
 
-	spin_lock_irqsave(&rb->event_lock, flags);
-	list_del_init(&event->rb_entry);
-	wake_up_all(&event->waitq);
-	spin_unlock_irqrestore(&rb->event_lock, flags);
+	if (rb) {
+		spin_lock_irqsave(&rb->event_lock, flags);
+		list_add_rcu(&event->rb_entry, &rb->event_list);
+		spin_unlock_irqrestore(&rb->event_lock, flags);
+	}
+
+	rcu_assign_pointer(event->rb, rb);
+
+	if (old_rb) {
+		ring_buffer_put(old_rb);
+		/*
+		 * Since we detached before setting the new rb, so that we
+		 * could attach the new rb, we could have missed a wakeup.
+		 * Provide it now.
+		 */
+		wake_up_all(&event->waitq);
+	}
 }
 
 static void ring_buffer_wakeup(struct perf_event *event)
@@ -3933,7 +3946,7 @@ static void perf_mmap_close(struct vm_ar
 {
 	struct perf_event *event = vma->vm_file->private_data;
 
-	struct ring_buffer *rb = event->rb;
+	struct ring_buffer *rb = ring_buffer_get(event);
 	struct user_struct *mmap_user = rb->mmap_user;
 	int mmap_locked = rb->mmap_locked;
 	unsigned long size = perf_data_size(rb);
@@ -3941,18 +3954,14 @@ static void perf_mmap_close(struct vm_ar
 	atomic_dec(&rb->mmap_count);
 
 	if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
-		return;
+		goto out_put;
 
-	/* Detach current event from the buffer. */
-	rcu_assign_pointer(event->rb, NULL);
-	ring_buffer_detach(event, rb);
+	ring_buffer_attach(event, NULL);
 	mutex_unlock(&event->mmap_mutex);
 
 	/* If there's still other mmap()s of this buffer, we're done. */
-	if (atomic_read(&rb->mmap_count)) {
-		ring_buffer_put(rb); /* can't be last */
-		return;
-	}
+	if (atomic_read(&rb->mmap_count))
+		goto out_put;
 
 	/*
 	 * No other mmap()s, detach from all other events that might redirect
@@ -3982,11 +3991,9 @@ static void perf_mmap_close(struct vm_ar
 		 * still restart the iteration to make sure we're not now
 		 * iterating the wrong list.
 		 */
-		if (event->rb == rb) {
-			rcu_assign_pointer(event->rb, NULL);
-			ring_buffer_detach(event, rb);
-			ring_buffer_put(rb); /* can't be last, we still have one */
-		}
+		if (event->rb == rb)
+			ring_buffer_attach(event, NULL);
+
 		mutex_unlock(&event->mmap_mutex);
 		put_event(event);
 
@@ -4011,6 +4018,7 @@ static void perf_mmap_close(struct vm_ar
 	vma->vm_mm->pinned_vm -= mmap_locked;
 	free_uid(mmap_user);
 
+out_put:
 	ring_buffer_put(rb); /* could be last */
 }
 
@@ -4128,7 +4136,6 @@ static int perf_mmap(struct file *file,
 	vma->vm_mm->pinned_vm += extra;
 
 	ring_buffer_attach(event, rb);
-	rcu_assign_pointer(event->rb, rb);
 
 	perf_event_init_userpage(event);
 	perf_event_update_userpage(event);
@@ -6929,7 +6936,7 @@ static int perf_copy_attr(struct perf_ev
 static int
 perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 {
-	struct ring_buffer *rb = NULL, *old_rb = NULL;
+	struct ring_buffer *rb = NULL;
 	int ret = -EINVAL;
 
 	if (!output_event)
@@ -6957,8 +6964,6 @@ perf_event_set_output(struct perf_event
 	if (atomic_read(&event->mmap_count))
 		goto unlock;
 
-	old_rb = event->rb;
-
 	if (output_event) {
 		/* get the rb we want to redirect to */
 		rb = ring_buffer_get(output_event);
@@ -6966,23 +6971,7 @@ perf_event_set_output(struct perf_event
 			goto unlock;
 	}
 
-	if (old_rb)
-		ring_buffer_detach(event, old_rb);
-
-	if (rb)
-		ring_buffer_attach(event, rb);
-
-	rcu_assign_pointer(event->rb, rb);
-
-	if (old_rb) {
-		ring_buffer_put(old_rb);
-		/*
-		 * Since we detached before setting the new rb, so that we
-		 * could attach the new rb, we could have missed a wakeup.
-		 * Provide it now.
-		 */
-		wake_up_all(&event->waitq);
-	}
+	ring_buffer_attach(event, rb);
 
 	ret = 0;
 unlock:

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-05-07 12:35     ` Peter Zijlstra
@ 2014-05-07 18:06       ` Peter Zijlstra
  2014-05-08 15:37         ` Paul E. McKenney
  2014-05-19 12:48       ` [tip:perf/urgent] perf: Fix a race between ring_buffer_detach() and ring_buffer_attach() tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-07 18:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1427 bytes --]

On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote:
>  static void ring_buffer_attach(struct perf_event *event,
>  			       struct ring_buffer *rb)
>  {
> +	struct ring_buffer *old_rb = NULL;
>  	unsigned long flags;
>  
> +	if (event->rb) {
> +		/*
> +		 * Should be impossible, we set this when removing
> +		 * event->rb_entry and wait/clear when adding event->rb_entry.
> +		 */
> +		WARN_ON_ONCE(event->rcu_pending);
>  
> +		old_rb = event->rb;
> +		event->rcu_batches = get_state_synchronize_rcu();
> +		event->rcu_pending = 1;
>  
> +		spin_lock_irqsave(&rb->event_lock, flags);
> +		list_del_rcu(&event->rb_entry);
> +		spin_unlock_irqrestore(&rb->event_lock, flags);

This all works a whole lot better if you make that old_rb->event_lock.

> +	}
>  
> +	if (event->rcu_pending && rb) {
> +		cond_synchronize_rcu(event->rcu_batches);
> +		event->rcu_pending = 0;
> +	}
>  
> +	if (rb) {
> +		spin_lock_irqsave(&rb->event_lock, flags);
> +		list_add_rcu(&event->rb_entry, &rb->event_list);
> +		spin_unlock_irqrestore(&rb->event_lock, flags);
> +	}
> +
> +	rcu_assign_pointer(event->rb, rb);
> +
> +	if (old_rb) {
> +		ring_buffer_put(old_rb);
> +		/*
> +		 * Since we detached before setting the new rb, so that we
> +		 * could attach the new rb, we could have missed a wakeup.
> +		 * Provide it now.
> +		 */
> +		wake_up_all(&event->waitq);
> +	}
>  }

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-05-07 18:06       ` Peter Zijlstra
@ 2014-05-08 15:37         ` Paul E. McKenney
  2014-05-08 16:09           ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2014-05-08 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

On Wed, May 07, 2014 at 08:06:21PM +0200, Peter Zijlstra wrote:
> On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote:
> >  static void ring_buffer_attach(struct perf_event *event,
> >  			       struct ring_buffer *rb)
> >  {
> > +	struct ring_buffer *old_rb = NULL;
> >  	unsigned long flags;
> >  
> > +	if (event->rb) {
> > +		/*
> > +		 * Should be impossible, we set this when removing
> > +		 * event->rb_entry and wait/clear when adding event->rb_entry.
> > +		 */
> > +		WARN_ON_ONCE(event->rcu_pending);
> >  
> > +		old_rb = event->rb;
> > +		event->rcu_batches = get_state_synchronize_rcu();
> > +		event->rcu_pending = 1;
> >  
> > +		spin_lock_irqsave(&rb->event_lock, flags);
> > +		list_del_rcu(&event->rb_entry);
> > +		spin_unlock_irqrestore(&rb->event_lock, flags);
> 
> This all works a whole lot better if you make that old_rb->event_lock.
> 
> > +	}
> >  
> > +	if (event->rcu_pending && rb) {
> > +		cond_synchronize_rcu(event->rcu_batches);

There is not a whole lot of code between the get_state_synchronize_rcu()
and the cond_synchronize_rcu(), so I would expect this to do a
synchronize_rcu() almost all the time.  Or am I missing something here?

							Thanx, Paul

> > +		event->rcu_pending = 0;
> > +	}
> >  
> > +	if (rb) {
> > +		spin_lock_irqsave(&rb->event_lock, flags);
> > +		list_add_rcu(&event->rb_entry, &rb->event_list);
> > +		spin_unlock_irqrestore(&rb->event_lock, flags);
> > +	}
> > +
> > +	rcu_assign_pointer(event->rb, rb);
> > +
> > +	if (old_rb) {
> > +		ring_buffer_put(old_rb);
> > +		/*
> > +		 * Since we detached before setting the new rb, so that we
> > +		 * could attach the new rb, we could have missed a wakeup.
> > +		 * Provide it now.
> > +		 */
> > +		wake_up_all(&event->waitq);
> > +	}
> >  }



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup()
  2014-05-08 15:37         ` Paul E. McKenney
@ 2014-05-08 16:09           ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-08 16:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexander Shishkin, Ingo Molnar, linux-kernel,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Stephane Eranian, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1864 bytes --]

On Thu, May 08, 2014 at 08:37:27AM -0700, Paul E. McKenney wrote:
> On Wed, May 07, 2014 at 08:06:21PM +0200, Peter Zijlstra wrote:
> > On Wed, May 07, 2014 at 02:35:26PM +0200, Peter Zijlstra wrote:
> > >  static void ring_buffer_attach(struct perf_event *event,
> > >  			       struct ring_buffer *rb)
> > >  {
> > > +	struct ring_buffer *old_rb = NULL;
> > >  	unsigned long flags;
> > >  
> > > +	if (event->rb) {
> > > +		/*
> > > +		 * Should be impossible, we set this when removing
> > > +		 * event->rb_entry and wait/clear when adding event->rb_entry.
> > > +		 */
> > > +		WARN_ON_ONCE(event->rcu_pending);
> > >  
> > > +		old_rb = event->rb;
> > > +		event->rcu_batches = get_state_synchronize_rcu();
> > > +		event->rcu_pending = 1;
> > >  
> > > +		spin_lock_irqsave(&rb->event_lock, flags);
> > > +		list_del_rcu(&event->rb_entry);
> > > +		spin_unlock_irqrestore(&rb->event_lock, flags);
> > 
> > This all works a whole lot better if you make that old_rb->event_lock.
> > 
> > > +	}
> > >  
> > > +	if (event->rcu_pending && rb) {
> > > +		cond_synchronize_rcu(event->rcu_batches);
> 
> There is not a whole lot of code between the get_state_synchronize_rcu()
> and the cond_synchronize_rcu(), so I would expect this to do a
> synchronize_rcu() almost all the time.  Or am I missing something here?

From the Changelog:

 2) an event that has a buffer attached, the buffer is destroyed
    (munmap) and then the event is attached to a new/different buffer
    using PERF_EVENT_IOC_SET_OUTPUT.

    This case is more complex because the buffer destruction does:
      ring_buffer_attach(.rb = NULL)
    followed by the ioctl() doing:
      ring_buffer_attach(.rb = foo);

    and we still need to observe the grace period between these two
    calls due to us reusing the event->rb_entry list_head.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip:perf/urgent] perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()
  2014-05-07 12:35     ` Peter Zijlstra
  2014-05-07 18:06       ` Peter Zijlstra
@ 2014-05-19 12:48       ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-05-19 12:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mingo, eranian, paulus, hpa, mingo, efault, peterz,
	alexander.shishkin, paulmck, fweisbec, ak, tglx

Commit-ID:  b69cf53640da2b86439596118cfa95233154ee76
Gitweb:     http://git.kernel.org/tip/b69cf53640da2b86439596118cfa95233154ee76
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Fri, 14 Mar 2014 10:50:33 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 19 May 2014 21:44:56 +0900

perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()

Alexander noticed that we use RCU iteration on rb->event_list but do
not use list_{add,del}_rcu() to add,remove entries to that list, nor
do we observe proper grace periods when re-using the entries.

Merge ring_buffer_detach() into ring_buffer_attach() such that
attaching to the NULL buffer is detaching.

Furthermore, ensure that between any 'detach' and 'attach' of the same
event we observe the required grace period, but only when strictly
required. In effect this means that only ioctl(.request =
PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
normal initial attach and final detach will not be delayed.

This patch should, I think, do the right thing under all
circumstances, the 'normal' cases all should never see the extra grace
period, but the two cases:

 1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
    ring_buffer set, will now observe the required grace period between
    removing itself from the old and attaching itself to the new buffer.

    This case is 'simple' in that both buffers are present in
    perf_event_set_output() one could think an unconditional
    synchronize_rcu() would be sufficient; however...

 2) an event that has a buffer attached, the buffer is destroyed
    (munmap) and then the event is attached to a new/different buffer
    using PERF_EVENT_IOC_SET_OUTPUT.

    This case is more complex because the buffer destruction does:
      ring_buffer_attach(.rb = NULL)
    followed by the ioctl() doing:
      ring_buffer_attach(.rb = foo);

    and we still need to observe the grace period between these two
    calls due to us reusing the event->rb_entry list_head.

In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
call.

Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140507123526.GD13658@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/perf_event.h |   2 +
 kernel/events/core.c       | 109 ++++++++++++++++++++-------------------------
 2 files changed, 51 insertions(+), 60 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3356abc..3ef6ea1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -402,6 +402,8 @@ struct perf_event {
 
 	struct ring_buffer		*rb;
 	struct list_head		rb_entry;
+	unsigned long			rcu_batches;
+	int				rcu_pending;
 
 	/* poll related */
 	wait_queue_head_t		waitq;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index feb1329..440eefc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3192,7 +3192,8 @@ static void free_event_rcu(struct rcu_head *head)
 }
 
 static void ring_buffer_put(struct ring_buffer *rb);
-static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);
+static void ring_buffer_attach(struct perf_event *event,
+			       struct ring_buffer *rb);
 
 static void unaccount_event_cpu(struct perf_event *event, int cpu)
 {
@@ -3252,8 +3253,6 @@ static void free_event(struct perf_event *event)
 	unaccount_event(event);
 
 	if (event->rb) {
-		struct ring_buffer *rb;
-
 		/*
 		 * Can happen when we close an event with re-directed output.
 		 *
@@ -3261,12 +3260,7 @@ static void free_event(struct perf_event *event)
 		 * over us; possibly making our ring_buffer_put() the last.
 		 */
 		mutex_lock(&event->mmap_mutex);
-		rb = event->rb;
-		if (rb) {
-			rcu_assign_pointer(event->rb, NULL);
-			ring_buffer_detach(event, rb);
-			ring_buffer_put(rb); /* could be last */
-		}
+		ring_buffer_attach(event, NULL);
 		mutex_unlock(&event->mmap_mutex);
 	}
 
@@ -3850,28 +3844,47 @@ unlock:
 static void ring_buffer_attach(struct perf_event *event,
 			       struct ring_buffer *rb)
 {
+	struct ring_buffer *old_rb = NULL;
 	unsigned long flags;
 
-	if (!list_empty(&event->rb_entry))
-		return;
+	if (event->rb) {
+		/*
+		 * Should be impossible, we set this when removing
+		 * event->rb_entry and wait/clear when adding event->rb_entry.
+		 */
+		WARN_ON_ONCE(event->rcu_pending);
 
-	spin_lock_irqsave(&rb->event_lock, flags);
-	if (list_empty(&event->rb_entry))
-		list_add(&event->rb_entry, &rb->event_list);
-	spin_unlock_irqrestore(&rb->event_lock, flags);
-}
+		old_rb = event->rb;
+		event->rcu_batches = get_state_synchronize_rcu();
+		event->rcu_pending = 1;
 
-static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb)
-{
-	unsigned long flags;
+		spin_lock_irqsave(&old_rb->event_lock, flags);
+		list_del_rcu(&event->rb_entry);
+		spin_unlock_irqrestore(&old_rb->event_lock, flags);
+	}
 
-	if (list_empty(&event->rb_entry))
-		return;
+	if (event->rcu_pending && rb) {
+		cond_synchronize_rcu(event->rcu_batches);
+		event->rcu_pending = 0;
+	}
 
-	spin_lock_irqsave(&rb->event_lock, flags);
-	list_del_init(&event->rb_entry);
-	wake_up_all(&event->waitq);
-	spin_unlock_irqrestore(&rb->event_lock, flags);
+	if (rb) {
+		spin_lock_irqsave(&rb->event_lock, flags);
+		list_add_rcu(&event->rb_entry, &rb->event_list);
+		spin_unlock_irqrestore(&rb->event_lock, flags);
+	}
+
+	rcu_assign_pointer(event->rb, rb);
+
+	if (old_rb) {
+		ring_buffer_put(old_rb);
+		/*
+		 * Since we detached before setting the new rb, so that we
+		 * could attach the new rb, we could have missed a wakeup.
+		 * Provide it now.
+		 */
+		wake_up_all(&event->waitq);
+	}
 }
 
 static void ring_buffer_wakeup(struct perf_event *event)
@@ -3940,7 +3953,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 {
 	struct perf_event *event = vma->vm_file->private_data;
 
-	struct ring_buffer *rb = event->rb;
+	struct ring_buffer *rb = ring_buffer_get(event);
 	struct user_struct *mmap_user = rb->mmap_user;
 	int mmap_locked = rb->mmap_locked;
 	unsigned long size = perf_data_size(rb);
@@ -3948,18 +3961,14 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	atomic_dec(&rb->mmap_count);
 
 	if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
-		return;
+		goto out_put;
 
-	/* Detach current event from the buffer. */
-	rcu_assign_pointer(event->rb, NULL);
-	ring_buffer_detach(event, rb);
+	ring_buffer_attach(event, NULL);
 	mutex_unlock(&event->mmap_mutex);
 
 	/* If there's still other mmap()s of this buffer, we're done. */
-	if (atomic_read(&rb->mmap_count)) {
-		ring_buffer_put(rb); /* can't be last */
-		return;
-	}
+	if (atomic_read(&rb->mmap_count))
+		goto out_put;
 
 	/*
 	 * No other mmap()s, detach from all other events that might redirect
@@ -3989,11 +3998,9 @@ again:
 		 * still restart the iteration to make sure we're not now
 		 * iterating the wrong list.
 		 */
-		if (event->rb == rb) {
-			rcu_assign_pointer(event->rb, NULL);
-			ring_buffer_detach(event, rb);
-			ring_buffer_put(rb); /* can't be last, we still have one */
-		}
+		if (event->rb == rb)
+			ring_buffer_attach(event, NULL);
+
 		mutex_unlock(&event->mmap_mutex);
 		put_event(event);
 
@@ -4018,6 +4025,7 @@ again:
 	vma->vm_mm->pinned_vm -= mmap_locked;
 	free_uid(mmap_user);
 
+out_put:
 	ring_buffer_put(rb); /* could be last */
 }
 
@@ -4135,7 +4143,6 @@ again:
 	vma->vm_mm->pinned_vm += extra;
 
 	ring_buffer_attach(event, rb);
-	rcu_assign_pointer(event->rb, rb);
 
 	perf_event_init_userpage(event);
 	perf_event_update_userpage(event);
@@ -6934,7 +6941,7 @@ err_size:
 static int
 perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 {
-	struct ring_buffer *rb = NULL, *old_rb = NULL;
+	struct ring_buffer *rb = NULL;
 	int ret = -EINVAL;
 
 	if (!output_event)
@@ -6962,8 +6969,6 @@ set:
 	if (atomic_read(&event->mmap_count))
 		goto unlock;
 
-	old_rb = event->rb;
-
 	if (output_event) {
 		/* get the rb we want to redirect to */
 		rb = ring_buffer_get(output_event);
@@ -6971,23 +6976,7 @@ set:
 			goto unlock;
 	}
 
-	if (old_rb)
-		ring_buffer_detach(event, old_rb);
-
-	if (rb)
-		ring_buffer_attach(event, rb);
-
-	rcu_assign_pointer(event->rb, rb);
-
-	if (old_rb) {
-		ring_buffer_put(old_rb);
-		/*
-		 * Since we detached before setting the new rb, so that we
-		 * could attach the new rb, we could have missed a wakeup.
-		 * Provide it now.
-		 */
-		wake_up_all(&event->waitq);
-	}
+	ring_buffer_attach(event, rb);
 
 	ret = 0;
 unlock:

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-05-19 12:50 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-07 13:38 [PATCH] [RFC] perf: Fix a race between ring_buffer_detach() and ring_buffer_wakeup() Alexander Shishkin
2014-03-13 19:58 ` Paul E. McKenney
2014-03-14  9:50   ` Peter Zijlstra
2014-03-14 20:47     ` Paul E. McKenney
2014-03-14 22:43       ` Peter Zijlstra
2014-03-14 23:02         ` Paul E. McKenney
2014-03-15  0:00           ` Paul E. McKenney
2014-03-17 11:18             ` Peter Zijlstra
2014-03-17 16:48               ` Paul E. McKenney
2014-03-17 17:30                 ` Peter Zijlstra
2014-03-18  2:45                   ` Paul E. McKenney
2014-03-18  8:26                     ` Peter Zijlstra
2014-05-07 12:35     ` Peter Zijlstra
2014-05-07 18:06       ` Peter Zijlstra
2014-05-08 15:37         ` Paul E. McKenney
2014-05-08 16:09           ` Peter Zijlstra
2014-05-19 12:48       ` [tip:perf/urgent] perf: Fix a race between ring_buffer_detach() and ring_buffer_attach() tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.