linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
@ 2017-06-09  3:25 Krister Johansen
  2017-06-09  7:19 ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Krister Johansen @ 2017-06-09  3:25 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Paul E. McKenney
  Cc: linux-kernel, Steven Rostedt, Paul Gortmaker, Thomas Gleixner

The behavior of swake_up() differs from that of wake_up(), and from the
swake_up() that came from RT linux. A memory barrier, or some other
synchronization, is needed prior to a swake_up so that the waiter sees
the condition set by the waker, and so that the waker does not see an
empty wait list.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
 include/linux/swait.h | 27 +++++++++++++++++++++++++++
 kernel/sched/swait.c  | 18 +++++++++++++++++-
 2 files changed, 44 insertions(+), 1 deletion(-)

 This came out of a discussion that Paul McKenney and I had about
 whether other callers of swake_up() knew that they needed to issue
 some kind of barrier prior to invoking swake_up.  In the case of
 wake_up(), the caller will always wake any waiters.  With the simple
 queues, a swait_active occurs locklessly prior to the wakeup so
 additional synchronization may be needed on behalf of the caller.

diff --git a/include/linux/swait.h b/include/linux/swait.h
index c1f9c62..fede974 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -79,6 +79,33 @@ extern void __init_swait_queue_head(struct swait_queue_head *q, const char *name
 	DECLARE_SWAIT_QUEUE_HEAD(name)
 #endif
 
+/**
+ * swait_active -- locklessly test for waiters on the queue
+ * @q: the swait_queue to test for waiters
+ *
+ * returns true if the wait list is not empty
+ *
+ * NOTE: this function is lockless and requires care, incorrect usage _will_
+ * lead to sporadic and non-obvious failure.
+ *
+ * Use either while holding swait_queue_head_t::lock or when used for wakeups
+ * with an extra smp_mb() like:
+ *
+ *      CPU0 - waker                    CPU1 - waiter
+ *
+ *      @cond = true;
+ *      smp_mb();
+ *      if (swait_active(wq))           swait_event(wq, cond);
+ *        swake_up(wq);
+ *
+ *
+ * Because without the explicit smp_mb() it's possible for the
+ * swait_active() load to get hoisted over the @cond store such that we'll
+ * observe an empty wait list while the waiter might not observe @cond.
+ *
+ * Also note that this 'optimization' trades a spin_lock() for an smp_mb(),
+ * which (when the lock is uncontended) are of roughly equal cost.
+ */
 static inline int swait_active(struct swait_queue_head *q)
 {
 	return !list_empty(&q->task_list);
diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index 3d5610d..6e949a8 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -29,6 +29,16 @@ void swake_up_locked(struct swait_queue_head *q)
 }
 EXPORT_SYMBOL(swake_up_locked);
 
+/**
+ * swake_up - wake up a waiter on a simple waitqueue
+ * @q: the simple wait queue head
+ *
+ * In order for this to function properly, since it uses swait_active()
+ * internally, some kind of memory barrier must occur prior to calling this.
+ * Typically, this will be smp_mb__after_atomic(), but if the value is
+ * manipulated non-atomically, one may need to use a less regular barrier, such
+ * as smp_mb().  spin_unlock() does not guarantee a memory barrier.
+ */
 void swake_up(struct swait_queue_head *q)
 {
 	unsigned long flags;
@@ -45,7 +55,13 @@ EXPORT_SYMBOL(swake_up);
 /*
  * Does not allow usage from IRQ disabled, since we must be able to
  * release IRQs to guarantee bounded hold time.
- */
+ *
+ * In order for this to function properly, since it uses swait_active()
+ * internally, some kind of memory barrier must occur prior to calling this.
+ * Typically, this will be smp_mb__after_atomic(), but if the value is
+ * manipulated non-atomically, one may need to use a less regular barrier, such
+ * as smp_mb().  spin_unlock() does not guarantee a memory barrier.
+ */
 void swake_up_all(struct swait_queue_head *q)
 {
 	struct swait_queue *curr;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-09  3:25 [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up Krister Johansen
@ 2017-06-09  7:19 ` Peter Zijlstra
  2017-06-09 12:45   ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2017-06-09  7:19 UTC (permalink / raw)
  To: Krister Johansen
  Cc: Ingo Molnar, Paul E. McKenney, linux-kernel, Steven Rostedt,
	Paul Gortmaker, Thomas Gleixner

On Thu, Jun 08, 2017 at 08:25:46PM -0700, Krister Johansen wrote:
> The behavior of swake_up() differs from that of wake_up(), and from the
> swake_up() that came from RT linux. A memory barrier, or some other
> synchronization, is needed prior to a swake_up so that the waiter sees
> the condition set by the waker, and so that the waker does not see an
> empty wait list.

Urgh.. let me stare at that. But it sounds like the wrong solution since
we wanted to keep the wait and swait APIs as close as possible.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-09  7:19 ` Peter Zijlstra
@ 2017-06-09 12:45   ` Paul E. McKenney
  2017-06-13 23:23     ` Steven Rostedt
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2017-06-09 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Krister Johansen, Ingo Molnar, linux-kernel, Steven Rostedt,
	Paul Gortmaker, Thomas Gleixner

On Fri, Jun 09, 2017 at 09:19:57AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 08, 2017 at 08:25:46PM -0700, Krister Johansen wrote:
> > The behavior of swake_up() differs from that of wake_up(), and from the
> > swake_up() that came from RT linux. A memory barrier, or some other
> > synchronization, is needed prior to a swake_up so that the waiter sees
> > the condition set by the waker, and so that the waker does not see an
> > empty wait list.
> 
> Urgh.. let me stare at that. But it sounds like the wrong solution since
> we wanted to keep the wait and swait APIs as close as possible.

But don't they both need some sort of ordering, be it memory barriers or
locking, to handle the case where the wait/swait doesn't actually sleep?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-09 12:45   ` Paul E. McKenney
@ 2017-06-13 23:23     ` Steven Rostedt
  2017-06-13 23:42       ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2017-06-13 23:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Krister Johansen, Ingo Molnar, linux-kernel,
	Paul Gortmaker, Thomas Gleixner

On Fri, 9 Jun 2017 05:45:54 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Fri, Jun 09, 2017 at 09:19:57AM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 08, 2017 at 08:25:46PM -0700, Krister Johansen wrote:  
> > > The behavior of swake_up() differs from that of wake_up(), and from the
> > > swake_up() that came from RT linux. A memory barrier, or some other
> > > synchronization, is needed prior to a swake_up so that the waiter sees
> > > the condition set by the waker, and so that the waker does not see an
> > > empty wait list.  
> > 
> > Urgh.. let me stare at that. But it sounds like the wrong solution since
> > we wanted to keep the wait and swait APIs as close as possible.  
> 
> But don't they both need some sort of ordering, be it memory barriers or
> locking, to handle the case where the wait/swait doesn't actually sleep?
> 

Looking at an RCU example, and assuming that ordering can move around
within a spin lock, and that changes can leak into a spin lock region
from both before and after. Could we have:

(looking at __call_rcu_core() and rcu_gp_kthread()

	CPU0				CPU1
	----				----
				__call_rcu_core() {

				 spin_lock(rnp_root)
				 need_wake = __rcu_start_gp() {
				  rcu_start_gp_advanced() {
				   gp_flags = FLAG_INIT
				  }
				 }

 rcu_gp_kthread() {
   swait_event_interruptible(wq,
	gp_flags & FLAG_INIT) {
   spin_lock(q->lock)

				*fetch wq->task_list here! *

   list_add(wq->task_list, q->task_list)
   spin_unlock(q->lock);

   *fetch old value of gp_flags here *


				 spin_unlock(rnp_root)

				 rcu_gp_kthread_wake() {
				  swake_up(wq) {
				   swait_active(wq) {
				    list_empty(wq->task_list)

				   } * return false *

  if (condition) * false *
    schedule();

Looks like a memory barrier is missing. Perhaps we should slap on into
swait_active()? I don't think it is wise to let users add there own, as
I think we currently have bugs now.

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-13 23:23     ` Steven Rostedt
@ 2017-06-13 23:42       ` Paul E. McKenney
  2017-06-14  1:15         ` Steven Rostedt
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2017-06-13 23:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Krister Johansen, Ingo Molnar, linux-kernel,
	Paul Gortmaker, Thomas Gleixner

On Tue, Jun 13, 2017 at 07:23:08PM -0400, Steven Rostedt wrote:
> On Fri, 9 Jun 2017 05:45:54 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Fri, Jun 09, 2017 at 09:19:57AM +0200, Peter Zijlstra wrote:
> > > On Thu, Jun 08, 2017 at 08:25:46PM -0700, Krister Johansen wrote:  
> > > > The behavior of swake_up() differs from that of wake_up(), and from the
> > > > swake_up() that came from RT linux. A memory barrier, or some other
> > > > synchronization, is needed prior to a swake_up so that the waiter sees
> > > > the condition set by the waker, and so that the waker does not see an
> > > > empty wait list.  
> > > 
> > > Urgh.. let me stare at that. But it sounds like the wrong solution since
> > > we wanted to keep the wait and swait APIs as close as possible.  
> > 
> > But don't they both need some sort of ordering, be it memory barriers or
> > locking, to handle the case where the wait/swait doesn't actually sleep?
> > 
> 
> Looking at an RCU example, and assuming that ordering can move around
> within a spin lock, and that changes can leak into a spin lock region
> from both before and after. Could we have:
> 
> (looking at __call_rcu_core() and rcu_gp_kthread()
> 
> 	CPU0				CPU1
> 	----				----
> 				__call_rcu_core() {
> 
> 				 spin_lock(rnp_root)
> 				 need_wake = __rcu_start_gp() {
> 				  rcu_start_gp_advanced() {
> 				   gp_flags = FLAG_INIT
> 				  }
> 				 }
> 
>  rcu_gp_kthread() {
>    swait_event_interruptible(wq,
> 	gp_flags & FLAG_INIT) {
>    spin_lock(q->lock)
> 
> 				*fetch wq->task_list here! *
> 
>    list_add(wq->task_list, q->task_list)
>    spin_unlock(q->lock);
> 
>    *fetch old value of gp_flags here *

Both reads of ->gp_flags are READ_ONCE(), so having seen the new value
in swait_event_interruptible(), this task/CPU cannot see the old value
from some later access.  You have to have accesses to two different
variables to require a memory barrier (at least assuming consistent use
of READ_ONCE(), WRITE_ONCE(), or equivalent).

> 				 spin_unlock(rnp_root)
> 
> 				 rcu_gp_kthread_wake() {
> 				  swake_up(wq) {
> 				   swait_active(wq) {
> 				    list_empty(wq->task_list)
> 
> 				   } * return false *
> 
>   if (condition) * false *
>     schedule();
> 
> Looks like a memory barrier is missing. Perhaps we should slap on into
> swait_active()? I don't think it is wise to let users add there own, as
> I think we currently have bugs now.

I -know- I have bugs now.  ;-)

But I don't believe this is one of them.  Or am I getting confused?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-13 23:42       ` Paul E. McKenney
@ 2017-06-14  1:15         ` Steven Rostedt
  2017-06-14  3:58           ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2017-06-14  1:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Krister Johansen, Ingo Molnar, linux-kernel,
	Paul Gortmaker, Thomas Gleixner

On Tue, 13 Jun 2017 16:42:05 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Tue, Jun 13, 2017 at 07:23:08PM -0400, Steven Rostedt wrote:
> > On Fri, 9 Jun 2017 05:45:54 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >   
> > > On Fri, Jun 09, 2017 at 09:19:57AM +0200, Peter Zijlstra wrote:  
> > > > On Thu, Jun 08, 2017 at 08:25:46PM -0700, Krister Johansen wrote:    
> > > > > The behavior of swake_up() differs from that of wake_up(), and from the
> > > > > swake_up() that came from RT linux. A memory barrier, or some other
> > > > > synchronization, is needed prior to a swake_up so that the waiter sees
> > > > > the condition set by the waker, and so that the waker does not see an
> > > > > empty wait list.    
> > > > 
> > > > Urgh.. let me stare at that. But it sounds like the wrong solution since
> > > > we wanted to keep the wait and swait APIs as close as possible.    
> > > 
> > > But don't they both need some sort of ordering, be it memory barriers or
> > > locking, to handle the case where the wait/swait doesn't actually sleep?
> > >   
> > 
> > Looking at an RCU example, and assuming that ordering can move around
> > within a spin lock, and that changes can leak into a spin lock region
> > from both before and after. Could we have:
> > 
> > (looking at __call_rcu_core() and rcu_gp_kthread()
> > 
> > 	CPU0				CPU1
> > 	----				----
> > 				__call_rcu_core() {
> > 
> > 				 spin_lock(rnp_root)
> > 				 need_wake = __rcu_start_gp() {
> > 				  rcu_start_gp_advanced() {
> > 				   gp_flags = FLAG_INIT
> > 				  }
> > 				 }
> > 
> >  rcu_gp_kthread() {
> >    swait_event_interruptible(wq,
> > 	gp_flags & FLAG_INIT) {
> >    spin_lock(q->lock)
> > 
> > 				*fetch wq->task_list here! *
> > 
> >    list_add(wq->task_list, q->task_list)
> >    spin_unlock(q->lock);
> > 
> >    *fetch old value of gp_flags here *  
> 
> Both reads of ->gp_flags are READ_ONCE(), so having seen the new value
> in swait_event_interruptible(), this task/CPU cannot see the old value
> from some later access.  You have to have accesses to two different
> variables to require a memory barrier (at least assuming consistent use
> of READ_ONCE(), WRITE_ONCE(), or equivalent).

If I'm not mistaken, READ_ONCE() and WRITE_ONCE() is just volatiles
added. The compiler may not leak or move the the fetches, but what
about the hardware?

A spin_lock() only needs to make sure what is after it does not leak
before it.

A spin_unlock() only needs to make sure what is before it must not leak
after it.

>From my understandings of reading memory-barrier.txt, there's no
guarantees that the hardware doesn't let reads or writes that happen
before a spin_lock() happen after it. Nor does it guarantee that reads
or writes that happen after a spin_unlock() doesn't happen before it.

The spin_locks only need to protect the inside of the critical section,
not the outside of it leaking in.

I'm looking at this in particular:

====
  (1) ACQUIRE operation implication:

     Memory operations issued after the ACQUIRE will be completed after the
     ACQUIRE operation has completed.

     Memory operations issued before the ACQUIRE may be completed after
     the ACQUIRE operation has completed.  An smp_mb__before_spinlock(),
     combined with a following ACQUIRE, orders prior stores against
     subsequent loads and stores.  Note that this is weaker than smp_mb()!
     The smp_mb__before_spinlock() primitive is free on many architectures.

 (2) RELEASE operation implication:

     Memory operations issued before the RELEASE will be completed before the
     RELEASE operation has completed.

     Memory operations issued after the RELEASE may be completed before the
     RELEASE operation has completed.
====

-- Steve


> 
> > 				 spin_unlock(rnp_root)
> > 
> > 				 rcu_gp_kthread_wake() {
> > 				  swake_up(wq) {
> > 				   swait_active(wq) {
> > 				    list_empty(wq->task_list)
> > 
> > 				   } * return false *
> > 
> >   if (condition) * false *
> >     schedule();
> > 
> > Looks like a memory barrier is missing. Perhaps we should slap on into
> > swait_active()? I don't think it is wise to let users add there own, as
> > I think we currently have bugs now.  
> 
> I -know- I have bugs now.  ;-)
> 
> But I don't believe this is one of them.  Or am I getting confused?
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-14  1:15         ` Steven Rostedt
@ 2017-06-14  3:58           ` Paul E. McKenney
  2017-06-14 13:10             ` Steven Rostedt
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2017-06-14  3:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Krister Johansen, Ingo Molnar, linux-kernel,
	Paul Gortmaker, Thomas Gleixner

On Tue, Jun 13, 2017 at 09:15:47PM -0400, Steven Rostedt wrote:
> On Tue, 13 Jun 2017 16:42:05 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Tue, Jun 13, 2017 at 07:23:08PM -0400, Steven Rostedt wrote:
> > > On Fri, 9 Jun 2017 05:45:54 -0700
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > >   
> > > > On Fri, Jun 09, 2017 at 09:19:57AM +0200, Peter Zijlstra wrote:  
> > > > > On Thu, Jun 08, 2017 at 08:25:46PM -0700, Krister Johansen wrote:    
> > > > > > The behavior of swake_up() differs from that of wake_up(), and from the
> > > > > > swake_up() that came from RT linux. A memory barrier, or some other
> > > > > > synchronization, is needed prior to a swake_up so that the waiter sees
> > > > > > the condition set by the waker, and so that the waker does not see an
> > > > > > empty wait list.    
> > > > > 
> > > > > Urgh.. let me stare at that. But it sounds like the wrong solution since
> > > > > we wanted to keep the wait and swait APIs as close as possible.    
> > > > 
> > > > But don't they both need some sort of ordering, be it memory barriers or
> > > > locking, to handle the case where the wait/swait doesn't actually sleep?
> > > >   
> > > 
> > > Looking at an RCU example, and assuming that ordering can move around
> > > within a spin lock, and that changes can leak into a spin lock region
> > > from both before and after. Could we have:
> > > 
> > > (looking at __call_rcu_core() and rcu_gp_kthread()
> > > 
> > > 	CPU0				CPU1
> > > 	----				----
> > > 				__call_rcu_core() {
> > > 
> > > 				 spin_lock(rnp_root)
> > > 				 need_wake = __rcu_start_gp() {
> > > 				  rcu_start_gp_advanced() {
> > > 				   gp_flags = FLAG_INIT
> > > 				  }
> > > 				 }
> > > 
> > >  rcu_gp_kthread() {
> > >    swait_event_interruptible(wq,
> > > 	gp_flags & FLAG_INIT) {
> > >    spin_lock(q->lock)
> > > 
> > > 				*fetch wq->task_list here! *
> > > 
> > >    list_add(wq->task_list, q->task_list)
> > >    spin_unlock(q->lock);
> > > 
> > >    *fetch old value of gp_flags here *  
> > 
> > Both reads of ->gp_flags are READ_ONCE(), so having seen the new value
> > in swait_event_interruptible(), this task/CPU cannot see the old value
> > from some later access.  You have to have accesses to two different
> > variables to require a memory barrier (at least assuming consistent use
> > of READ_ONCE(), WRITE_ONCE(), or equivalent).
> 
> If I'm not mistaken, READ_ONCE() and WRITE_ONCE() is just volatiles
> added. The compiler may not leak or move the the fetches, but what
> about the hardware?

The hardware cannot move the references if both references are in the
same thread and to the same variable, which is the case with ->gp_flags.

> A spin_lock() only needs to make sure what is after it does not leak
> before it.
> 
> A spin_unlock() only needs to make sure what is before it must not leak
> after it.

Both true, with the exception of a spin_is_locked() to that same
lock variable, which cannot be reordered with either spin_lock() or
spin_unlock() in either direction.

> From my understandings of reading memory-barrier.txt, there's no
> guarantees that the hardware doesn't let reads or writes that happen
> before a spin_lock() happen after it. Nor does it guarantee that reads
> or writes that happen after a spin_unlock() doesn't happen before it.
> 
> The spin_locks only need to protect the inside of the critical section,
> not the outside of it leaking in.

Again, quite true.

> I'm looking at this in particular:
> 
> ====
>   (1) ACQUIRE operation implication:
> 
>      Memory operations issued after the ACQUIRE will be completed after the
>      ACQUIRE operation has completed.
> 
>      Memory operations issued before the ACQUIRE may be completed after
>      the ACQUIRE operation has completed.  An smp_mb__before_spinlock(),
>      combined with a following ACQUIRE, orders prior stores against
>      subsequent loads and stores.  Note that this is weaker than smp_mb()!
>      The smp_mb__before_spinlock() primitive is free on many architectures.
> 
>  (2) RELEASE operation implication:
> 
>      Memory operations issued before the RELEASE will be completed before the
>      RELEASE operation has completed.
> 
>      Memory operations issued after the RELEASE may be completed before the
>      RELEASE operation has completed.
> ====

And here is the part you also need to look at:

====

 (*) Overlapping loads and stores within a particular CPU will appear to be
     ordered within that CPU.  This means that for:

	a = READ_ONCE(*X); WRITE_ONCE(*X, b);

     the CPU will only issue the following sequence of memory operations:

	a = LOAD *X, STORE *X = b

     And for:

	WRITE_ONCE(*X, c); d = READ_ONCE(*X);

     the CPU will only issue:

	STORE *X = c, d = LOAD *X

     (Loads and stores overlap if they are targeted at overlapping pieces of
     memory).

====

This section needs some help -- the actual guarantee is stronger, that
all CPUs will agree on the order of volatile same-sized aligned accesses
to a given single location.  So if a previous READ_ONCE() sees the new
value, any subsequent READ_ONCE() from that same variable is guaranteed
to also see the new value (or some later value).

Does that help, or am I missing something here?

							Thanx, Paul

> -- Steve
> 
> 
> > 
> > > 				 spin_unlock(rnp_root)
> > > 
> > > 				 rcu_gp_kthread_wake() {
> > > 				  swake_up(wq) {
> > > 				   swait_active(wq) {
> > > 				    list_empty(wq->task_list)
> > > 
> > > 				   } * return false *
> > > 
> > >   if (condition) * false *
> > >     schedule();
> > > 
> > > Looks like a memory barrier is missing. Perhaps we should slap on into
> > > swait_active()? I don't think it is wise to let users add there own, as
> > > I think we currently have bugs now.  
> > 
> > I -know- I have bugs now.  ;-)
> > 
> > But I don't believe this is one of them.  Or am I getting confused?
> > 
> > 							Thanx, Paul
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-14  3:58           ` Paul E. McKenney
@ 2017-06-14 13:10             ` Steven Rostedt
  2017-06-14 15:02               ` Steven Rostedt
  2017-06-14 15:55               ` [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up Paul E. McKenney
  0 siblings, 2 replies; 16+ messages in thread
From: Steven Rostedt @ 2017-06-14 13:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Krister Johansen, Ingo Molnar, linux-kernel,
	Paul Gortmaker, Thomas Gleixner

On Tue, 13 Jun 2017 20:58:43 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> And here is the part you also need to look at:

Why? We are talking about two different, unrelated variables modified
on two different CPUs. I don't see where the overlap is.

> 
> ====
> 
>  (*) Overlapping loads and stores within a particular CPU will appear to be
>      ordered within that CPU.  This means that for:
> 
> 	a = READ_ONCE(*X); WRITE_ONCE(*X, b);
> 
>      the CPU will only issue the following sequence of memory operations:
> 
> 	a = LOAD *X, STORE *X = b
> 
>      And for:
> 
> 	WRITE_ONCE(*X, c); d = READ_ONCE(*X);
> 
>      the CPU will only issue:
> 
> 	STORE *X = c, d = LOAD *X
> 
>      (Loads and stores overlap if they are targeted at overlapping pieces of
>      memory).
> 
> ====
> 
> This section needs some help -- the actual guarantee is stronger, that
> all CPUs will agree on the order of volatile same-sized aligned accesses
> to a given single location.  So if a previous READ_ONCE() sees the new
> value, any subsequent READ_ONCE() from that same variable is guaranteed
> to also see the new value (or some later value).
> 
> Does that help, or am I missing something here?

Maybe I'm missing something. Let me rewrite what I first wrote, and
then abstract it into a simpler version:

Here's what I first wrote:

(looking at __call_rcu_core() and rcu_gp_kthread()

	CPU0				CPU1
	----				----
				__call_rcu_core() {

				 spin_lock(rnp_root)
				 need_wake = __rcu_start_gp() {
				  rcu_start_gp_advanced() {
				   gp_flags = FLAG_INIT
				  }
				 }

 rcu_gp_kthread() {
   swait_event_interruptible(wq,
	gp_flags & FLAG_INIT) {
   spin_lock(q->lock)

				*fetch wq->task_list here! *

   list_add(wq->task_list, q->task_list)
   spin_unlock(q->lock);

   *fetch old value of gp_flags here *


				 spin_unlock(rnp_root)

				 rcu_gp_kthread_wake() {
				  swake_up(wq) {
				   swait_active(wq) {
				    list_empty(wq->task_list)

				   } * return false *

  if (condition) * false *
    schedule();


Now let's make it simpler. I'll even add the READ_ONCE and WRITE_ONCE
where applicable.


	CPU0				CPU1
	----				----
				LOCK(A)

 LOCK(B)
				 WRITE_ONCE(X, INIT)

				 (the cpu may postpone writing X)

				 (the cpu can fetch wq list here)
  list_add(wq, q)

 UNLOCK(B)

 (the cpu may fetch old value of X)

				 (write of X happens here)

 if (READ_ONCE(X) != init)
   schedule();

				UNLOCK(A)

				 if (list_empty(wq))
				   return;

Tell me again how the READ_ONCE() and WRITE_ONCE() helps in this
scenario?

Because we are using spinlocks, this wont be an issue for most
architectures. The bug happens if the fetching of the list_empty()
leaks into before the UNLOCK(A).

If the reading/writing of the list and the reading/writing of gp_flags
gets reversed in either direction by the CPU, then we have a problem.

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-14 13:10             ` Steven Rostedt
@ 2017-06-14 15:02               ` Steven Rostedt
  2017-06-14 16:25                 ` Krister Johansen
  2017-06-14 15:55               ` [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up Paul E. McKenney
  1 sibling, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2017-06-14 15:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Krister Johansen, Ingo Molnar, linux-kernel,
	Paul Gortmaker, Thomas Gleixner

On Wed, 14 Jun 2017 09:10:15 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Now let's make it simpler. I'll even add the READ_ONCE and WRITE_ONCE
> where applicable.
> 
> 
> 	CPU0				CPU1
> 	----				----
> 				LOCK(A)
> 
>  LOCK(B)
> 				 WRITE_ONCE(X, INIT)
> 
> 				 (the cpu may postpone writing X)
> 
> 				 (the cpu can fetch wq list here)
>   list_add(wq, q)
> 
>  UNLOCK(B)
> 
>  (the cpu may fetch old value of X)
> 
> 				 (write of X happens here)
> 
>  if (READ_ONCE(X) != init)
>    schedule();
> 
> 				UNLOCK(A)
> 
> 				 if (list_empty(wq))
> 				   return;
> 
> Tell me again how the READ_ONCE() and WRITE_ONCE() helps in this
> scenario?
> 
> Because we are using spinlocks, this wont be an issue for most
> architectures. The bug happens if the fetching of the list_empty()
> leaks into before the UNLOCK(A).
> 
> If the reading/writing of the list and the reading/writing of gp_flags
> gets reversed in either direction by the CPU, then we have a problem.

FYI..

Both sides need a memory barrier. Otherwise, even with a memory barrier
on CPU1 we can still have:


	CPU0				CPU1
	----				----

				LOCK(A)
 LOCK(B)

 list_add(wq, q)

 (cpu waits to write wq list)

 (cpu fetches X)

				 WRITE_ONCE(X, INIT)

				UNLOCK(A)

				smp_mb();

				if (list_empty(wq))
				   return;

 (cpu writes wq list)

 UNLOCK(B)

 if (READ_ONCE(X) != INIT)
   schedule()


Luckily for us, there is a memory barrier on CPU0. In
prepare_to_swait() we have:

	raw_spin_lock_irqsave(&q->lock, flags);
	__prepare_to_swait(q, wait);
	set_current_state(state);
	raw_spin_unlock_irqrestore(&q->lock, flags);

And that set_current_state() call includes a memory barrier, which will
prevent the above from happening, as the addition to the wq list must
be flushed before fetching X.

I still strongly believe that the swait_active() requires a memory
barrier.

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-14 13:10             ` Steven Rostedt
  2017-06-14 15:02               ` Steven Rostedt
@ 2017-06-14 15:55               ` Paul E. McKenney
  1 sibling, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2017-06-14 15:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Krister Johansen, Ingo Molnar, linux-kernel,
	Paul Gortmaker, Thomas Gleixner

On Wed, Jun 14, 2017 at 09:10:15AM -0400, Steven Rostedt wrote:
> On Tue, 13 Jun 2017 20:58:43 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > And here is the part you also need to look at:
> 
> Why? We are talking about two different, unrelated variables modified
> on two different CPUs. I don't see where the overlap is.

It does sound like we are talking past each other.

Please see below for how I was interpreting your sequence of events.

> > ====
> > 
> >  (*) Overlapping loads and stores within a particular CPU will appear to be
> >      ordered within that CPU.  This means that for:
> > 
> > 	a = READ_ONCE(*X); WRITE_ONCE(*X, b);
> > 
> >      the CPU will only issue the following sequence of memory operations:
> > 
> > 	a = LOAD *X, STORE *X = b
> > 
> >      And for:
> > 
> > 	WRITE_ONCE(*X, c); d = READ_ONCE(*X);
> > 
> >      the CPU will only issue:
> > 
> > 	STORE *X = c, d = LOAD *X
> > 
> >      (Loads and stores overlap if they are targeted at overlapping pieces of
> >      memory).
> > 
> > ====
> > 
> > This section needs some help -- the actual guarantee is stronger, that
> > all CPUs will agree on the order of volatile same-sized aligned accesses
> > to a given single location.  So if a previous READ_ONCE() sees the new
> > value, any subsequent READ_ONCE() from that same variable is guaranteed
> > to also see the new value (or some later value).
> > 
> > Does that help, or am I missing something here?
> 
> Maybe I'm missing something. Let me rewrite what I first wrote, and
> then abstract it into a simpler version:
> 
> Here's what I first wrote:
> 
> (looking at __call_rcu_core() and rcu_gp_kthread()
> 
> 	CPU0				CPU1
> 	----				----
> 				__call_rcu_core() {
> 
> 				 spin_lock(rnp_root)
> 				 need_wake = __rcu_start_gp() {
> 				  rcu_start_gp_advanced() {
> 				   gp_flags = FLAG_INIT
> 				  }
> 				 }
> 
>  rcu_gp_kthread() {
>    swait_event_interruptible(wq,
> 	gp_flags & FLAG_INIT) {

This is the first access to ->gp_flags from rcu_gp_kthread().

>    spin_lock(q->lock)
> 
> 				*fetch wq->task_list here! *
> 
>    list_add(wq->task_list, q->task_list)
>    spin_unlock(q->lock);
> 
>    *fetch old value of gp_flags here *

This is the second access to ->gp_flags.

Since you are saying that ->gp_flags is only accessed once, perhaps
this code from spin_lock() down is intended to be an expansion of
swait_event_interruptible()?

#define swait_event_interruptible(wq, condition)			\
({									\
	int __ret = 0;							\
	if (!(condition))						\
		__ret = __swait_event_interruptible(wq, condition);	\
	__ret;								\
})

But no, in this case, we have the macro argument named "condition"
accessing ->gp_flags, and a control dependency forcing that access to
precede the spin_lock() in __prepare_to_swait().  We cannot acquire the
spinlock unless the condition is false, that is, the old value is fetched.
So there is a first fetch of ->gp_flags that is constrained to happen
before the spin_lock().  Any fetch of ->gp_flags after the spin_unlock()
must therefore be a second fetch.  Which of course might still get the
old value because the update to ->gp_flags might not have propagated yet.

But it appears that you are worried about something else.

> 				 spin_unlock(rnp_root)
> 
> 				 rcu_gp_kthread_wake() {
> 				  swake_up(wq) {
> 				   swait_active(wq) {
> 				    list_empty(wq->task_list)

We don't hold q->lock here, so I am guessing that your concern is that
we aren't guaranteed to see the above list_add().

Is that the case?

If so, your suggested fix is to place an smp_mb() between
swait_event_interruptible()'s access to "condition" and
__prepare_to_swait()'s list_add(), correct?  And also an
smp_mb() before swake_up()'s call to swait_active(), correct?

The second smp_mb() could be placed by the user, but the first
one cannot, at least not reasonably.

So did I get the point eventually?  ;-)

							Thanx, Paul

> 				   } * return false *
> 
>   if (condition) * false *
>     schedule();
> 
> 
> Now let's make it simpler. I'll even add the READ_ONCE and WRITE_ONCE
> where applicable.
> 
> 
> 	CPU0				CPU1
> 	----				----
> 				LOCK(A)
> 
>  LOCK(B)
> 				 WRITE_ONCE(X, INIT)
> 
> 				 (the cpu may postpone writing X)
> 
> 				 (the cpu can fetch wq list here)
>   list_add(wq, q)
> 
>  UNLOCK(B)
> 
>  (the cpu may fetch old value of X)
> 
> 				 (write of X happens here)
> 
>  if (READ_ONCE(X) != init)
>    schedule();
> 
> 				UNLOCK(A)
> 
> 				 if (list_empty(wq))
> 				   return;
> 
> Tell me again how the READ_ONCE() and WRITE_ONCE() helps in this
> scenario?
> 
> Because we are using spinlocks, this wont be an issue for most
> architectures. The bug happens if the fetching of the list_empty()
> leaks into before the UNLOCK(A).
> 
> If the reading/writing of the list and the reading/writing of gp_flags
> gets reversed in either direction by the CPU, then we have a problem.
> 
> -- Steve
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-14 15:02               ` Steven Rostedt
@ 2017-06-14 16:25                 ` Krister Johansen
  2017-06-15  4:18                   ` Boqun Feng
  0 siblings, 1 reply; 16+ messages in thread
From: Krister Johansen @ 2017-06-14 16:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Peter Zijlstra, Krister Johansen, Ingo Molnar,
	linux-kernel, Paul Gortmaker, Thomas Gleixner

On Wed, Jun 14, 2017 at 11:02:40AM -0400, Steven Rostedt wrote:
> On Wed, 14 Jun 2017 09:10:15 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Now let's make it simpler. I'll even add the READ_ONCE and WRITE_ONCE
> > where applicable.
> > 
> > 
> > 	CPU0				CPU1
> > 	----				----
> > 				LOCK(A)
> > 
> >  LOCK(B)
> > 				 WRITE_ONCE(X, INIT)
> > 
> > 				 (the cpu may postpone writing X)
> > 
> > 				 (the cpu can fetch wq list here)
> >   list_add(wq, q)
> > 
> >  UNLOCK(B)
> > 
> >  (the cpu may fetch old value of X)
> > 
> > 				 (write of X happens here)
> > 
> >  if (READ_ONCE(X) != init)
> >    schedule();
> > 
> > 				UNLOCK(A)
> > 
> > 				 if (list_empty(wq))
> > 				   return;
> > 
> > Tell me again how the READ_ONCE() and WRITE_ONCE() helps in this
> > scenario?
> > 
> > Because we are using spinlocks, this wont be an issue for most
> > architectures. The bug happens if the fetching of the list_empty()
> > leaks into before the UNLOCK(A).
> > 
> > If the reading/writing of the list and the reading/writing of gp_flags
> > gets reversed in either direction by the CPU, then we have a problem.
> 
> FYI..
> 
> Both sides need a memory barrier. Otherwise, even with a memory barrier
> on CPU1 we can still have:
> 
> 
> 	CPU0				CPU1
> 	----				----
> 
> 				LOCK(A)
>  LOCK(B)
> 
>  list_add(wq, q)
> 
>  (cpu waits to write wq list)
> 
>  (cpu fetches X)
> 
> 				 WRITE_ONCE(X, INIT)
> 
> 				UNLOCK(A)
> 
> 				smp_mb();
> 
> 				if (list_empty(wq))
> 				   return;
> 
>  (cpu writes wq list)
> 
>  UNLOCK(B)
> 
>  if (READ_ONCE(X) != INIT)
>    schedule()
> 
> 
> Luckily for us, there is a memory barrier on CPU0. In
> prepare_to_swait() we have:
> 
> 	raw_spin_lock_irqsave(&q->lock, flags);
> 	__prepare_to_swait(q, wait);
> 	set_current_state(state);
> 	raw_spin_unlock_irqrestore(&q->lock, flags);
> 
> And that set_current_state() call includes a memory barrier, which will
> prevent the above from happening, as the addition to the wq list must
> be flushed before fetching X.
> 
> I still strongly believe that the swait_active() requires a memory
> barrier.

FWLIW, I agree.  There was a smb_mb() in RT-linux's equivalent of
swait_activate().

https://www.spinics.net/lists/linux-rt-users/msg10340.html

If the barrier goes in swait_active() then we don't have to require all
of the callers of swait_active and swake_up to issue the barrier
instead.  Handling this in swait_active is likely to be less error
prone.  Though, we could also do something like wq_has_sleeper() and use
that preferentially in swake_up and its variants.

-K

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-14 16:25                 ` Krister Johansen
@ 2017-06-15  4:18                   ` Boqun Feng
  2017-06-15 17:56                     ` Paul E. McKenney
  2017-08-10 12:10                     ` [tip:locking/core] sched/wait: Remove the lockless swait_active() check in swake_up*() tip-bot for Boqun Feng
  0 siblings, 2 replies; 16+ messages in thread
From: Boqun Feng @ 2017-06-15  4:18 UTC (permalink / raw)
  To: Krister Johansen
  Cc: Steven Rostedt, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	linux-kernel, Paul Gortmaker, Thomas Gleixner

On Wed, Jun 14, 2017 at 09:25:58AM -0700, Krister Johansen wrote:
> On Wed, Jun 14, 2017 at 11:02:40AM -0400, Steven Rostedt wrote:
> > On Wed, 14 Jun 2017 09:10:15 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > Now let's make it simpler. I'll even add the READ_ONCE and WRITE_ONCE
> > > where applicable.
> > > 
> > > 
> > > 	CPU0				CPU1
> > > 	----				----
> > > 				LOCK(A)
> > > 
> > >  LOCK(B)
> > > 				 WRITE_ONCE(X, INIT)
> > > 
> > > 				 (the cpu may postpone writing X)
> > > 
> > > 				 (the cpu can fetch wq list here)
> > >   list_add(wq, q)
> > > 
> > >  UNLOCK(B)
> > > 
> > >  (the cpu may fetch old value of X)
> > > 
> > > 				 (write of X happens here)
> > > 
> > >  if (READ_ONCE(X) != init)
> > >    schedule();
> > > 
> > > 				UNLOCK(A)
> > > 
> > > 				 if (list_empty(wq))
> > > 				   return;
> > > 
> > > Tell me again how the READ_ONCE() and WRITE_ONCE() helps in this
> > > scenario?
> > > 
> > > Because we are using spinlocks, this wont be an issue for most
> > > architectures. The bug happens if the fetching of the list_empty()
> > > leaks into before the UNLOCK(A).
> > > 
> > > If the reading/writing of the list and the reading/writing of gp_flags
> > > gets reversed in either direction by the CPU, then we have a problem.
> > 
> > FYI..
> > 
> > Both sides need a memory barrier. Otherwise, even with a memory barrier
> > on CPU1 we can still have:
> > 
> > 
> > 	CPU0				CPU1
> > 	----				----
> > 
> > 				LOCK(A)
> >  LOCK(B)
> > 
> >  list_add(wq, q)
> > 
> >  (cpu waits to write wq list)
> > 
> >  (cpu fetches X)
> > 
> > 				 WRITE_ONCE(X, INIT)
> > 
> > 				UNLOCK(A)
> > 
> > 				smp_mb();
> > 
> > 				if (list_empty(wq))
> > 				   return;
> > 
> >  (cpu writes wq list)
> > 
> >  UNLOCK(B)
> > 
> >  if (READ_ONCE(X) != INIT)
> >    schedule()
> > 
> > 
> > Luckily for us, there is a memory barrier on CPU0. In
> > prepare_to_swait() we have:
> > 
> > 	raw_spin_lock_irqsave(&q->lock, flags);
> > 	__prepare_to_swait(q, wait);
> > 	set_current_state(state);
> > 	raw_spin_unlock_irqrestore(&q->lock, flags);
> > 
> > And that set_current_state() call includes a memory barrier, which will
> > prevent the above from happening, as the addition to the wq list must
> > be flushed before fetching X.
> > 
> > I still strongly believe that the swait_active() requires a memory
> > barrier.
> 
> FWLIW, I agree.  There was a smb_mb() in RT-linux's equivalent of
> swait_activate().
> 
> https://www.spinics.net/lists/linux-rt-users/msg10340.html
> 
> If the barrier goes in swait_active() then we don't have to require all
> of the callers of swait_active and swake_up to issue the barrier
> instead.  Handling this in swait_active is likely to be less error
> prone.  Though, we could also do something like wq_has_sleeper() and use
> that preferentially in swake_up and its variants.
> 

I think it makes more sense that we delete the swait_active() in
swake_up()? Because we seems to encourage users to do the quick check on
wait queue on their own, so why do the check again in swake_up()?
Besides, wake_up() doesn't call waitqueue_activie() outside the lock
critical section either.

So how about the patch below(Testing is in progress)? Peter?

Regards,
Boqun

--------------------->8
Subject: [PATCH] swait: Remove the lockless swait_active() check in
 swake_up*()

Steven Rostedt reported a potential race in RCU core because of
swake_up():

        CPU0                            CPU1
        ----                            ----
                                __call_rcu_core() {

                                 spin_lock(rnp_root)
                                 need_wake = __rcu_start_gp() {
                                  rcu_start_gp_advanced() {
                                   gp_flags = FLAG_INIT
                                  }
                                 }

 rcu_gp_kthread() {
   swait_event_interruptible(wq,
        gp_flags & FLAG_INIT) {
   spin_lock(q->lock)

                                *fetch wq->task_list here! *

   list_add(wq->task_list, q->task_list)
   spin_unlock(q->lock);

   *fetch old value of gp_flags here *

                                 spin_unlock(rnp_root)

                                 rcu_gp_kthread_wake() {
                                  swake_up(wq) {
                                   swait_active(wq) {
                                    list_empty(wq->task_list)

                                   } * return false *

  if (condition) * false *
    schedule();

In this case, a wakeup is missed, which could cause the rcu_gp_kthread
waits for a long time.

The reason of this is that we do a lockless swait_active() check in
swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
before swait_active() to provide the proper order or 2) simply remove
the swait_active() in swake_up().

The solution 2 not only fixes this problem but also keeps the swait and
wait API as close as possible, as wake_up() doesn't provide a full
barrier and doesn't do a lockless check of the wait queue either.
Moreover, there are users already using swait_active() to do their quick
checks for the wait queues, so it make less sense that swake_up() and
swake_up_all() do this on their own.

This patch then removes the lockless swait_active() check in swake_up()
and swake_up_all().

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 kernel/sched/swait.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index 3d5610dcce11..2227e183e202 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -33,9 +33,6 @@ void swake_up(struct swait_queue_head *q)
 {
 	unsigned long flags;
 
-	if (!swait_active(q))
-		return;
-
 	raw_spin_lock_irqsave(&q->lock, flags);
 	swake_up_locked(q);
 	raw_spin_unlock_irqrestore(&q->lock, flags);
@@ -51,9 +48,6 @@ void swake_up_all(struct swait_queue_head *q)
 	struct swait_queue *curr;
 	LIST_HEAD(tmp);
 
-	if (!swait_active(q))
-		return;
-
 	raw_spin_lock_irq(&q->lock);
 	list_splice_init(&q->task_list, &tmp);
 	while (!list_empty(&tmp)) {
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-15  4:18                   ` Boqun Feng
@ 2017-06-15 17:56                     ` Paul E. McKenney
  2017-06-16  1:07                       ` Boqun Feng
  2017-08-10 12:10                     ` [tip:locking/core] sched/wait: Remove the lockless swait_active() check in swake_up*() tip-bot for Boqun Feng
  1 sibling, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2017-06-15 17:56 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Krister Johansen, Steven Rostedt, Peter Zijlstra, Ingo Molnar,
	linux-kernel, Paul Gortmaker, Thomas Gleixner

On Thu, Jun 15, 2017 at 12:18:28PM +0800, Boqun Feng wrote:
> On Wed, Jun 14, 2017 at 09:25:58AM -0700, Krister Johansen wrote:
> > On Wed, Jun 14, 2017 at 11:02:40AM -0400, Steven Rostedt wrote:
> > > On Wed, 14 Jun 2017 09:10:15 -0400
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > > 
> > > > Now let's make it simpler. I'll even add the READ_ONCE and WRITE_ONCE
> > > > where applicable.
> > > > 
> > > > 
> > > > 	CPU0				CPU1
> > > > 	----				----
> > > > 				LOCK(A)
> > > > 
> > > >  LOCK(B)
> > > > 				 WRITE_ONCE(X, INIT)
> > > > 
> > > > 				 (the cpu may postpone writing X)
> > > > 
> > > > 				 (the cpu can fetch wq list here)
> > > >   list_add(wq, q)
> > > > 
> > > >  UNLOCK(B)
> > > > 
> > > >  (the cpu may fetch old value of X)
> > > > 
> > > > 				 (write of X happens here)
> > > > 
> > > >  if (READ_ONCE(X) != init)
> > > >    schedule();
> > > > 
> > > > 				UNLOCK(A)
> > > > 
> > > > 				 if (list_empty(wq))
> > > > 				   return;
> > > > 
> > > > Tell me again how the READ_ONCE() and WRITE_ONCE() helps in this
> > > > scenario?
> > > > 
> > > > Because we are using spinlocks, this wont be an issue for most
> > > > architectures. The bug happens if the fetching of the list_empty()
> > > > leaks into before the UNLOCK(A).
> > > > 
> > > > If the reading/writing of the list and the reading/writing of gp_flags
> > > > gets reversed in either direction by the CPU, then we have a problem.
> > > 
> > > FYI..
> > > 
> > > Both sides need a memory barrier. Otherwise, even with a memory barrier
> > > on CPU1 we can still have:
> > > 
> > > 
> > > 	CPU0				CPU1
> > > 	----				----
> > > 
> > > 				LOCK(A)
> > >  LOCK(B)
> > > 
> > >  list_add(wq, q)
> > > 
> > >  (cpu waits to write wq list)
> > > 
> > >  (cpu fetches X)
> > > 
> > > 				 WRITE_ONCE(X, INIT)
> > > 
> > > 				UNLOCK(A)
> > > 
> > > 				smp_mb();
> > > 
> > > 				if (list_empty(wq))
> > > 				   return;
> > > 
> > >  (cpu writes wq list)
> > > 
> > >  UNLOCK(B)
> > > 
> > >  if (READ_ONCE(X) != INIT)
> > >    schedule()
> > > 
> > > 
> > > Luckily for us, there is a memory barrier on CPU0. In
> > > prepare_to_swait() we have:
> > > 
> > > 	raw_spin_lock_irqsave(&q->lock, flags);
> > > 	__prepare_to_swait(q, wait);
> > > 	set_current_state(state);
> > > 	raw_spin_unlock_irqrestore(&q->lock, flags);
> > > 
> > > And that set_current_state() call includes a memory barrier, which will
> > > prevent the above from happening, as the addition to the wq list must
> > > be flushed before fetching X.
> > > 
> > > I still strongly believe that the swait_active() requires a memory
> > > barrier.
> > 
> > FWLIW, I agree.  There was a smb_mb() in RT-linux's equivalent of
> > swait_activate().
> > 
> > https://www.spinics.net/lists/linux-rt-users/msg10340.html
> > 
> > If the barrier goes in swait_active() then we don't have to require all
> > of the callers of swait_active and swake_up to issue the barrier
> > instead.  Handling this in swait_active is likely to be less error
> > prone.  Though, we could also do something like wq_has_sleeper() and use
> > that preferentially in swake_up and its variants.
> > 
> 
> I think it makes more sense that we delete the swait_active() in
> swake_up()? Because we seems to encourage users to do the quick check on
> wait queue on their own, so why do the check again in swake_up()?
> Besides, wake_up() doesn't call waitqueue_activie() outside the lock
> critical section either.
> 
> So how about the patch below(Testing is in progress)? Peter?

It is quite possible that a problem I am seeing is caused by this, but
there are reasons to believe otherwise.  And in any case, the problem is
quite rare, taking tens or perhaps even hundreds of hours of rcutorture
to reproduce.

So, would you be willing to create a dedicated swait torture test to check
this out?  The usual approach would be to create a circle of kthreads,
with each waiting on the previous kthread and waking up the next one.
Each kthread, after being awakened, checks a variable that its waker
sets just before the wakeup.  Have another kthread check for hangs.

Possibly introduce timeouts and random delays to stir things up a bit.

But maybe such a test already exists.  Does anyone know of one?  I don't
see anything obvious.

Interested?

							Thanx, Paul

> Regards,
> Boqun
> 
> --------------------->8
> Subject: [PATCH] swait: Remove the lockless swait_active() check in
>  swake_up*()
> 
> Steven Rostedt reported a potential race in RCU core because of
> swake_up():
> 
>         CPU0                            CPU1
>         ----                            ----
>                                 __call_rcu_core() {
> 
>                                  spin_lock(rnp_root)
>                                  need_wake = __rcu_start_gp() {
>                                   rcu_start_gp_advanced() {
>                                    gp_flags = FLAG_INIT
>                                   }
>                                  }
> 
>  rcu_gp_kthread() {
>    swait_event_interruptible(wq,
>         gp_flags & FLAG_INIT) {
>    spin_lock(q->lock)
> 
>                                 *fetch wq->task_list here! *
> 
>    list_add(wq->task_list, q->task_list)
>    spin_unlock(q->lock);
> 
>    *fetch old value of gp_flags here *
> 
>                                  spin_unlock(rnp_root)
> 
>                                  rcu_gp_kthread_wake() {
>                                   swake_up(wq) {
>                                    swait_active(wq) {
>                                     list_empty(wq->task_list)
> 
>                                    } * return false *
> 
>   if (condition) * false *
>     schedule();
> 
> In this case, a wakeup is missed, which could cause the rcu_gp_kthread
> waits for a long time.
> 
> The reason of this is that we do a lockless swait_active() check in
> swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
> before swait_active() to provide the proper order or 2) simply remove
> the swait_active() in swake_up().
> 
> The solution 2 not only fixes this problem but also keeps the swait and
> wait API as close as possible, as wake_up() doesn't provide a full
> barrier and doesn't do a lockless check of the wait queue either.
> Moreover, there are users already using swait_active() to do their quick
> checks for the wait queues, so it make less sense that swake_up() and
> swake_up_all() do this on their own.
> 
> This patch then removes the lockless swait_active() check in swake_up()
> and swake_up_all().
> 
> Reported-by: Steven Rostedt <rostedt@goodmis.org>
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  kernel/sched/swait.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
> index 3d5610dcce11..2227e183e202 100644
> --- a/kernel/sched/swait.c
> +++ b/kernel/sched/swait.c
> @@ -33,9 +33,6 @@ void swake_up(struct swait_queue_head *q)
>  {
>  	unsigned long flags;
> 
> -	if (!swait_active(q))
> -		return;
> -
>  	raw_spin_lock_irqsave(&q->lock, flags);
>  	swake_up_locked(q);
>  	raw_spin_unlock_irqrestore(&q->lock, flags);
> @@ -51,9 +48,6 @@ void swake_up_all(struct swait_queue_head *q)
>  	struct swait_queue *curr;
>  	LIST_HEAD(tmp);
> 
> -	if (!swait_active(q))
> -		return;
> -
>  	raw_spin_lock_irq(&q->lock);
>  	list_splice_init(&q->task_list, &tmp);
>  	while (!list_empty(&tmp)) {
> -- 
> 2.13.0
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-15 17:56                     ` Paul E. McKenney
@ 2017-06-16  1:07                       ` Boqun Feng
  2017-06-16  3:09                         ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Boqun Feng @ 2017-06-16  1:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Krister Johansen, Steven Rostedt, Peter Zijlstra, Ingo Molnar,
	linux-kernel, Paul Gortmaker, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 2211 bytes --]

On Thu, Jun 15, 2017 at 10:56:29AM -0700, Paul E. McKenney wrote:
[...]
> > > 
> > > FWLIW, I agree.  There was a smb_mb() in RT-linux's equivalent of
> > > swait_activate().
> > > 
> > > https://www.spinics.net/lists/linux-rt-users/msg10340.html
> > > 
> > > If the barrier goes in swait_active() then we don't have to require all
> > > of the callers of swait_active and swake_up to issue the barrier
> > > instead.  Handling this in swait_active is likely to be less error
> > > prone.  Though, we could also do something like wq_has_sleeper() and use
> > > that preferentially in swake_up and its variants.
> > > 
> > 
> > I think it makes more sense that we delete the swait_active() in
> > swake_up()? Because we seems to encourage users to do the quick check on
> > wait queue on their own, so why do the check again in swake_up()?
> > Besides, wake_up() doesn't call waitqueue_activie() outside the lock
> > critical section either.
> > 
> > So how about the patch below(Testing is in progress)? Peter?
> 
> It is quite possible that a problem I am seeing is caused by this, but
> there are reasons to believe otherwise.  And in any case, the problem is
> quite rare, taking tens or perhaps even hundreds of hours of rcutorture
> to reproduce.
> 
> So, would you be willing to create a dedicated swait torture test to check
> this out?  The usual approach would be to create a circle of kthreads,
> with each waiting on the previous kthread and waking up the next one.
> Each kthread, after being awakened, checks a variable that its waker
> sets just before the wakeup.  Have another kthread check for hangs.
> 
> Possibly introduce timeouts and random delays to stir things up a bit.
> 
> But maybe such a test already exists.  Does anyone know of one?  I don't
> see anything obvious.
> 

Your waketorture patchset[1] seems to be something similar, at least a
good start ;-)

As we don't know which kind of scenario will trigger the problem easily,
I will play around with different ones, and hopefully we can find a way.

Regards,
Boqun

[1]: https://marc.info/?l=linux-kernel&m=146602969518960

> Interested?
> 
> 							Thanx, Paul
> 
[...]

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up.
  2017-06-16  1:07                       ` Boqun Feng
@ 2017-06-16  3:09                         ` Paul E. McKenney
  0 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2017-06-16  3:09 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Krister Johansen, Steven Rostedt, Peter Zijlstra, Ingo Molnar,
	linux-kernel, Paul Gortmaker, Thomas Gleixner

On Fri, Jun 16, 2017 at 09:07:57AM +0800, Boqun Feng wrote:
> On Thu, Jun 15, 2017 at 10:56:29AM -0700, Paul E. McKenney wrote:
> [...]
> > > > 
> > > > FWLIW, I agree.  There was a smb_mb() in RT-linux's equivalent of
> > > > swait_activate().
> > > > 
> > > > https://www.spinics.net/lists/linux-rt-users/msg10340.html
> > > > 
> > > > If the barrier goes in swait_active() then we don't have to require all
> > > > of the callers of swait_active and swake_up to issue the barrier
> > > > instead.  Handling this in swait_active is likely to be less error
> > > > prone.  Though, we could also do something like wq_has_sleeper() and use
> > > > that preferentially in swake_up and its variants.
> > > > 
> > > 
> > > I think it makes more sense that we delete the swait_active() in
> > > swake_up()? Because we seems to encourage users to do the quick check on
> > > wait queue on their own, so why do the check again in swake_up()?
> > > Besides, wake_up() doesn't call waitqueue_activie() outside the lock
> > > critical section either.
> > > 
> > > So how about the patch below(Testing is in progress)? Peter?
> > 
> > It is quite possible that a problem I am seeing is caused by this, but
> > there are reasons to believe otherwise.  And in any case, the problem is
> > quite rare, taking tens or perhaps even hundreds of hours of rcutorture
> > to reproduce.
> > 
> > So, would you be willing to create a dedicated swait torture test to check
> > this out?  The usual approach would be to create a circle of kthreads,
> > with each waiting on the previous kthread and waking up the next one.
> > Each kthread, after being awakened, checks a variable that its waker
> > sets just before the wakeup.  Have another kthread check for hangs.
> > 
> > Possibly introduce timeouts and random delays to stir things up a bit.
> > 
> > But maybe such a test already exists.  Does anyone know of one?  I don't
> > see anything obvious.
> > 
> 
> Your waketorture patchset[1] seems to be something similar, at least a
> good start ;-)

Glad I could help!  ;-)

> As we don't know which kind of scenario will trigger the problem easily,
> I will play around with different ones, and hopefully we can find a way.

Makes sense, please let me know how it goes!

							Thanx, Paul

> Regards,
> Boqun
> 
> [1]: https://marc.info/?l=linux-kernel&m=146602969518960
> 
> > Interested?
> > 
> > 							Thanx, Paul
> > 
> [...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip:locking/core] sched/wait: Remove the lockless swait_active() check in swake_up*()
  2017-06-15  4:18                   ` Boqun Feng
  2017-06-15 17:56                     ` Paul E. McKenney
@ 2017-08-10 12:10                     ` tip-bot for Boqun Feng
  1 sibling, 0 replies; 16+ messages in thread
From: tip-bot for Boqun Feng @ 2017-08-10 12:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, paul.gortmaker, rostedt, mingo, peterz, tglx, kjlx,
	boqun.feng, hpa, paulmck, torvalds

Commit-ID:  35a2897c2a306cca344ca5c0b43416707018f434
Gitweb:     http://git.kernel.org/tip/35a2897c2a306cca344ca5c0b43416707018f434
Author:     Boqun Feng <boqun.feng@gmail.com>
AuthorDate: Thu, 15 Jun 2017 12:18:28 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 10 Aug 2017 12:28:53 +0200

sched/wait: Remove the lockless swait_active() check in swake_up*()

Steven Rostedt reported a potential race in RCU core because of
swake_up():

        CPU0                            CPU1
        ----                            ----
                                __call_rcu_core() {

                                 spin_lock(rnp_root)
                                 need_wake = __rcu_start_gp() {
                                  rcu_start_gp_advanced() {
                                   gp_flags = FLAG_INIT
                                  }
                                 }

 rcu_gp_kthread() {
   swait_event_interruptible(wq,
        gp_flags & FLAG_INIT) {
   spin_lock(q->lock)

                                *fetch wq->task_list here! *

   list_add(wq->task_list, q->task_list)
   spin_unlock(q->lock);

   *fetch old value of gp_flags here *

                                 spin_unlock(rnp_root)

                                 rcu_gp_kthread_wake() {
                                  swake_up(wq) {
                                   swait_active(wq) {
                                    list_empty(wq->task_list)

                                   } * return false *

  if (condition) * false *
    schedule();

In this case, a wakeup is missed, which could cause the rcu_gp_kthread
waits for a long time.

The reason of this is that we do a lockless swait_active() check in
swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
before swait_active() to provide the proper order or 2) simply remove
the swait_active() in swake_up().

The solution 2 not only fixes this problem but also keeps the swait and
wait API as close as possible, as wake_up() doesn't provide a full
barrier and doesn't do a lockless check of the wait queue either.
Moreover, there are users already using swait_active() to do their quick
checks for the wait queues, so it make less sense that swake_up() and
swake_up_all() do this on their own.

This patch then removes the lockless swait_active() check in swake_up()
and swake_up_all().

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Krister Johansen <kjlx@templeofstupid.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170615041828.zk3a3sfyudm5p6nl@tardis
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/swait.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index 3d5610d..2227e18 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -33,9 +33,6 @@ void swake_up(struct swait_queue_head *q)
 {
 	unsigned long flags;
 
-	if (!swait_active(q))
-		return;
-
 	raw_spin_lock_irqsave(&q->lock, flags);
 	swake_up_locked(q);
 	raw_spin_unlock_irqrestore(&q->lock, flags);
@@ -51,9 +48,6 @@ void swake_up_all(struct swait_queue_head *q)
 	struct swait_queue *curr;
 	LIST_HEAD(tmp);
 
-	if (!swait_active(q))
-		return;
-
 	raw_spin_lock_irq(&q->lock);
 	list_splice_init(&q->task_list, &tmp);
 	while (!list_empty(&tmp)) {

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-08-10 12:15 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-09  3:25 [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up Krister Johansen
2017-06-09  7:19 ` Peter Zijlstra
2017-06-09 12:45   ` Paul E. McKenney
2017-06-13 23:23     ` Steven Rostedt
2017-06-13 23:42       ` Paul E. McKenney
2017-06-14  1:15         ` Steven Rostedt
2017-06-14  3:58           ` Paul E. McKenney
2017-06-14 13:10             ` Steven Rostedt
2017-06-14 15:02               ` Steven Rostedt
2017-06-14 16:25                 ` Krister Johansen
2017-06-15  4:18                   ` Boqun Feng
2017-06-15 17:56                     ` Paul E. McKenney
2017-06-16  1:07                       ` Boqun Feng
2017-06-16  3:09                         ` Paul E. McKenney
2017-08-10 12:10                     ` [tip:locking/core] sched/wait: Remove the lockless swait_active() check in swake_up*() tip-bot for Boqun Feng
2017-06-14 15:55               ` [PATCH tip/sched/core] Add comments to aid in safer usage of swake_up Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).