linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -tip 1/3] locking/qrwlock: Rename ->lock to ->wait_lock
@ 2015-09-14  7:37 Davidlohr Bueso
  2015-09-14  7:37 ` [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics Davidlohr Bueso
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Davidlohr Bueso @ 2015-09-14  7:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner
  Cc: Paul E. McKenney, Davidlohr Bueso, linux-kernel, Davidlohr Bueso

... trivial, but reads a little nicer when we name our
actual primitive 'lock'.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 include/asm-generic/qrwlock_types.h | 4 ++--
 kernel/locking/qrwlock.c            | 8 ++++----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/qrwlock_types.h b/include/asm-generic/qrwlock_types.h
index 4d76f24..0abc6b6 100644
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -10,12 +10,12 @@
 
 typedef struct qrwlock {
 	atomic_t		cnts;
-	arch_spinlock_t		lock;
+	arch_spinlock_t		wait_lock;
 } arch_rwlock_t;
 
 #define	__ARCH_RW_LOCK_UNLOCKED {		\
 	.cnts = ATOMIC_INIT(0),			\
-	.lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
+	.wait_lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
 }
 
 #endif /* __ASM_GENERIC_QRWLOCK_TYPES_H */
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index f17a3e3..fec0823 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -86,7 +86,7 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	/*
 	 * Put the reader into the wait queue
 	 */
-	arch_spin_lock(&lock->lock);
+	arch_spin_lock(&lock->wait_lock);
 
 	/*
 	 * The ACQUIRE semantics of the following spinning code ensure
@@ -99,7 +99,7 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	/*
 	 * Signal the next one in queue to become queue head
 	 */
-	arch_spin_unlock(&lock->lock);
+	arch_spin_unlock(&lock->wait_lock);
 }
 EXPORT_SYMBOL(queued_read_lock_slowpath);
 
@@ -112,7 +112,7 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 	u32 cnts;
 
 	/* Put the writer into the wait queue */
-	arch_spin_lock(&lock->lock);
+	arch_spin_lock(&lock->wait_lock);
 
 	/* Try to acquire the lock directly if no reader is present */
 	if (!atomic_read(&lock->cnts) &&
@@ -144,6 +144,6 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 		cpu_relax_lowlatency();
 	}
 unlock:
-	arch_spin_unlock(&lock->lock);
+	arch_spin_unlock(&lock->wait_lock);
 }
 EXPORT_SYMBOL(queued_write_lock_slowpath);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-14  7:37 [PATCH -tip 1/3] locking/qrwlock: Rename ->lock to ->wait_lock Davidlohr Bueso
@ 2015-09-14  7:37 ` Davidlohr Bueso
  2015-09-14 12:32   ` Peter Zijlstra
  2015-09-14  7:37 ` [PATCH -tip 3/3] locking/osq: Relax atomic semantics Davidlohr Bueso
  2015-09-18  8:50 ` [tip:locking/core] locking/qrwlock: Rename ->lock to ->wait_lock tip-bot for Davidlohr Bueso
  2 siblings, 1 reply; 28+ messages in thread
From: Davidlohr Bueso @ 2015-09-14  7:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner
  Cc: Paul E. McKenney, Davidlohr Bueso, linux-kernel, Davidlohr Bueso

The barrier parings for wake-queues are very straightforward, and thus
we can ease the barrier requirements, for archs that support it, for
wake_q_add by relying on acquire semantics. As such, (i) we keep the
pairing structure/logic and (ii) users, such as mqueues, can continue to
rely on a full barrier after the successful [Rmw].

[Another alternative could be to just not try at all and fully downgrade
to cmpxchg_relaxed() and rely on users to enable their own synchronization.
But controlling this ourselves makes me sleep better at night.]

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/sched/core.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6ab415a..7567603 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -523,14 +523,14 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
 	struct wake_q_node *node = &task->wake_q;
 
 	/*
-	 * Atomically grab the task, if ->wake_q is !nil already it means
-	 * its already queued (either by us or someone else) and will get the
-	 * wakeup due to that.
+	 * Atomically grab the task. If ->wake_q is non-nil (failed cmpxchg)
+	 * then the task is already queued (by us or someone else) and will
+	 * get the wakeup due to that.
 	 *
-	 * This cmpxchg() implies a full barrier, which pairs with the write
-	 * barrier implied by the wakeup in wake_up_list().
+	 * Use acquire semantics to add the next pointer, which pairs with the
+	 * write barrier implied by the wakeup in wake_up_list().
 	 */
-	if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
+	if (cmpxchg_acquire(&node->next, NULL, WAKE_Q_TAIL))
 		return;
 
 	get_task_struct(task);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH -tip 3/3] locking/osq: Relax atomic semantics
  2015-09-14  7:37 [PATCH -tip 1/3] locking/qrwlock: Rename ->lock to ->wait_lock Davidlohr Bueso
  2015-09-14  7:37 ` [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics Davidlohr Bueso
@ 2015-09-14  7:37 ` Davidlohr Bueso
  2015-09-18  8:50   ` [tip:locking/core] " tip-bot for Davidlohr Bueso
  2015-09-18  8:50 ` [tip:locking/core] locking/qrwlock: Rename ->lock to ->wait_lock tip-bot for Davidlohr Bueso
  2 siblings, 1 reply; 28+ messages in thread
From: Davidlohr Bueso @ 2015-09-14  7:37 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner
  Cc: Paul E. McKenney, Davidlohr Bueso, linux-kernel, Davidlohr Bueso

... by using acquire/release for ops around the lock->tail. As such,
weakly ordered archs can benefit from more relaxed use of barriers
when issuing atomics.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/osq_lock.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index dc85ee2..d092a0c 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -50,7 +50,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 
 	for (;;) {
 		if (atomic_read(&lock->tail) == curr &&
-		    atomic_cmpxchg(&lock->tail, curr, old) == curr) {
+		    atomic_cmpxchg_acquire(&lock->tail, curr, old) == curr) {
 			/*
 			 * We were the last queued, we moved @lock back. @prev
 			 * will now observe @lock and will complete its
@@ -92,7 +92,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	node->next = NULL;
 	node->cpu = curr;
 
-	old = atomic_xchg(&lock->tail, curr);
+	/*
+	 * ACQUIRE semantics, pairs with corresponding RELEASE
+	 * in unlock() uncontended, or fastpath.
+	 */
+	old = atomic_xchg_acquire(&lock->tail, curr);
 	if (old == OSQ_UNLOCKED_VAL)
 		return true;
 
@@ -184,7 +188,8 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 	/*
 	 * Fast path for the uncontended case.
 	 */
-	if (likely(atomic_cmpxchg(&lock->tail, curr, OSQ_UNLOCKED_VAL) == curr))
+	if (likely(atomic_cmpxchg_release(&lock->tail, curr,
+					  OSQ_UNLOCKED_VAL) == curr))
 		return;
 
 	/*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-14  7:37 ` [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics Davidlohr Bueso
@ 2015-09-14 12:32   ` Peter Zijlstra
  2015-09-14 21:08     ` Davidlohr Bueso
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2015-09-14 12:32 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Ingo Molnar, Thomas Gleixner, Paul E. McKenney, linux-kernel,
	Davidlohr Bueso

On Mon, Sep 14, 2015 at 12:37:23AM -0700, Davidlohr Bueso wrote:
> The barrier parings for wake-queues are very straightforward, and thus
> we can ease the barrier requirements, for archs that support it, for
> wake_q_add by relying on acquire semantics. As such, (i) we keep the
> pairing structure/logic and (ii) users, such as mqueues, can continue to
> rely on a full barrier after the successful [Rmw].

> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  kernel/sched/core.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6ab415a..7567603 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -523,14 +523,14 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
>  	struct wake_q_node *node = &task->wake_q;
>  
>  	/*
> +	 * Atomically grab the task. If ->wake_q is non-nil (failed cmpxchg)
> +	 * then the task is already queued (by us or someone else) and will
> +	 * get the wakeup due to that.
>  	 *
> +	 * Use acquire semantics to add the next pointer, which pairs with the
> +	 * write barrier implied by the wakeup in wake_up_list().
>  	 */
> +	if (cmpxchg_acquire(&node->next, NULL, WAKE_Q_TAIL))
>  		return;
>  
>  	get_task_struct(task);

I'm not seeing a _why_ on the acquire semantics. Not saying the patch is
wrong, just saying I want words on why acquire is correct.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-14 12:32   ` Peter Zijlstra
@ 2015-09-14 21:08     ` Davidlohr Bueso
  2015-09-15  9:49       ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Davidlohr Bueso @ 2015-09-14 21:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Paul E. McKenney, linux-kernel,
	Davidlohr Bueso

On Mon, 14 Sep 2015, Peter Zijlstra wrote:

>On Mon, Sep 14, 2015 at 12:37:23AM -0700, Davidlohr Bueso wrote:
>>	/*
>> +	 * Atomically grab the task. If ->wake_q is non-nil (failed cmpxchg)
>> +	 * then the task is already queued (by us or someone else) and will
>> +	 * get the wakeup due to that.
>>	 *
>> +	 * Use acquire semantics to add the next pointer, which pairs with the
>> +	 * write barrier implied by the wakeup in wake_up_list().
>>	 */
>> +	if (cmpxchg_acquire(&node->next, NULL, WAKE_Q_TAIL))
>>		return;
>>
>>	get_task_struct(task);
>
>I'm not seeing a _why_ on the acquire semantics. Not saying the patch is
>wrong, just saying I want words on why acquire is correct.

Well, I was just taking advantage of removing the upper barrier. Considering
that the formal semantics, you are right that we need not actual acquire per-se
(ie for node->next) but instead merely ensure a barrier in wake_q_add(). This is
kind of why I had hinted of going full _relaxed(). We could also rephrase the
comment, something like:

      * Use ACQUIRE semantics to add the next pointer, such that
      * wake_q_add() implies a full barrier. This pairs with the
      * write barrier implied by the wakeup in wake_up_list().
      */

What do you think?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-14 21:08     ` Davidlohr Bueso
@ 2015-09-15  9:49       ` Peter Zijlstra
  2015-09-15  9:55         ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2015-09-15  9:49 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Ingo Molnar, Thomas Gleixner, Paul E. McKenney, linux-kernel,
	Davidlohr Bueso

On Mon, Sep 14, 2015 at 02:08:06PM -0700, Davidlohr Bueso wrote:
> On Mon, 14 Sep 2015, Peter Zijlstra wrote:
> 
> >On Mon, Sep 14, 2015 at 12:37:23AM -0700, Davidlohr Bueso wrote:
> >>	/*
> >>+	 * Atomically grab the task. If ->wake_q is non-nil (failed cmpxchg)
> >>+	 * then the task is already queued (by us or someone else) and will
> >>+	 * get the wakeup due to that.
> >>	 *
> >>+	 * Use acquire semantics to add the next pointer, which pairs with the
> >>+	 * write barrier implied by the wakeup in wake_up_list().
> >>	 */
> >>+	if (cmpxchg_acquire(&node->next, NULL, WAKE_Q_TAIL))
> >>		return;
> >>
> >>	get_task_struct(task);
> >
> >I'm not seeing a _why_ on the acquire semantics. Not saying the patch is
> >wrong, just saying I want words on why acquire is correct.
>
> Well, I was just taking advantage of removing the upper barrier. Considering
> that the formal semantics, you are right that we need not actual acquire per-se
> (ie for node->next) but instead merely ensure a barrier in wake_q_add(). This is
> kind of why I had hinted of going full _relaxed(). We could also rephrase the
> comment, something like:
>
>      * Use ACQUIRE semantics to add the next pointer, such that
>      * wake_q_add() implies a full barrier. This pairs with the
>      * write barrier implied by the wakeup in wake_up_list().
>      */
>
> What do you think?

Still befuddled. I'm thinking that if you want to remove a barrier,
you'd remove that second and keep the first. That is RELEASE.

That way, you know the stores prior to the wake queue are done by the
time you observe the queued entry, and therefore (transitively) know
those stores are done by the time you do the actual wakeup.

Two issues with that though; firstly RELEASE is not actually guaranteed
to be transitive -- now the only arch that does not implement it with a
full barrier is ARGH64, so we could just ask Will, but I'm not sure its
'good' to start relying on this.

Secondly, the wake queues are not concurrent, they're in context, so I
don't see ordering matter at all. The only reason its a cmpxchg() is
because there is the (small) possibility of two contexts wanting to wake
the same task, and we use task_struct storage for the queue.

Or am I mistaken and do we have concurrent users of wake queues?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-15  9:49       ` Peter Zijlstra
@ 2015-09-15  9:55         ` Peter Zijlstra
  2015-09-15 12:41           ` Paul E. McKenney
  2015-09-15 19:49           ` Davidlohr Bueso
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-09-15  9:55 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Ingo Molnar, Thomas Gleixner, Paul E. McKenney, linux-kernel,
	Davidlohr Bueso

On Tue, Sep 15, 2015 at 11:49:49AM +0200, Peter Zijlstra wrote:
> On Mon, Sep 14, 2015 at 02:08:06PM -0700, Davidlohr Bueso wrote:
> > On Mon, 14 Sep 2015, Peter Zijlstra wrote:
> > 
> > >On Mon, Sep 14, 2015 at 12:37:23AM -0700, Davidlohr Bueso wrote:
> > >>	/*
> > >>+	 * Atomically grab the task. If ->wake_q is non-nil (failed cmpxchg)
> > >>+	 * then the task is already queued (by us or someone else) and will
> > >>+	 * get the wakeup due to that.
> > >>	 *
> > >>+	 * Use acquire semantics to add the next pointer, which pairs with the
> > >>+	 * write barrier implied by the wakeup in wake_up_list().
> > >>	 */
> > >>+	if (cmpxchg_acquire(&node->next, NULL, WAKE_Q_TAIL))
> > >>		return;
> > >>
> > >>	get_task_struct(task);
> > >
> > >I'm not seeing a _why_ on the acquire semantics. Not saying the patch is
> > >wrong, just saying I want words on why acquire is correct.
> >
> > Well, I was just taking advantage of removing the upper barrier. Considering
> > that the formal semantics, you are right that we need not actual acquire per-se
> > (ie for node->next) but instead merely ensure a barrier in wake_q_add(). This is
> > kind of why I had hinted of going full _relaxed(). We could also rephrase the
> > comment, something like:
> >
> >      * Use ACQUIRE semantics to add the next pointer, such that
> >      * wake_q_add() implies a full barrier. This pairs with the
> >      * write barrier implied by the wakeup in wake_up_list().
> >      */
> >
> > What do you think?
> 
> Still befuddled. I'm thinking that if you want to remove a barrier,
> you'd remove that second and keep the first. That is RELEASE.
> 
> That way, you know the stores prior to the wake queue are done by the
> time you observe the queued entry, and therefore (transitively) know
> those stores are done by the time you do the actual wakeup.
> 
> Two issues with that though; firstly RELEASE is not actually guaranteed
> to be transitive -- now the only arch that does not implement it with a
> full barrier is ARGH64, so we could just ask Will, but I'm not sure its
> 'good' to start relying on this.

Never mind, the PPC people will implement this with lwsync and that is
very much not transitive IIRC.

That said, you could do:

	smp_mb__before_atomic();
	cmpxchg_relaxed();

Which would still be a full barrier and therefore transitive. However
this point still stands:

> Secondly, the wake queues are not concurrent, they're in context, so I
> don't see ordering matter at all. The only reason its a cmpxchg() is
> because there is the (small) possibility of two contexts wanting to wake
> the same task, and we use task_struct storage for the queue.

I don't think we need _any_ barriers here, unless we have concurrent
users of the wake queues (or want to allow any, do we?).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-15  9:55         ` Peter Zijlstra
@ 2015-09-15 12:41           ` Paul E. McKenney
  2015-09-15 12:48             ` Peter Zijlstra
  2015-09-15 19:49           ` Davidlohr Bueso
  1 sibling, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2015-09-15 12:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, linux-kernel,
	Davidlohr Bueso

On Tue, Sep 15, 2015 at 11:55:12AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 15, 2015 at 11:49:49AM +0200, Peter Zijlstra wrote:
> > On Mon, Sep 14, 2015 at 02:08:06PM -0700, Davidlohr Bueso wrote:
> > > On Mon, 14 Sep 2015, Peter Zijlstra wrote:
> > > 
> > > >On Mon, Sep 14, 2015 at 12:37:23AM -0700, Davidlohr Bueso wrote:
> > > >>	/*
> > > >>+	 * Atomically grab the task. If ->wake_q is non-nil (failed cmpxchg)
> > > >>+	 * then the task is already queued (by us or someone else) and will
> > > >>+	 * get the wakeup due to that.
> > > >>	 *
> > > >>+	 * Use acquire semantics to add the next pointer, which pairs with the
> > > >>+	 * write barrier implied by the wakeup in wake_up_list().
> > > >>	 */
> > > >>+	if (cmpxchg_acquire(&node->next, NULL, WAKE_Q_TAIL))
> > > >>		return;
> > > >>
> > > >>	get_task_struct(task);
> > > >
> > > >I'm not seeing a _why_ on the acquire semantics. Not saying the patch is
> > > >wrong, just saying I want words on why acquire is correct.
> > >
> > > Well, I was just taking advantage of removing the upper barrier. Considering
> > > that the formal semantics, you are right that we need not actual acquire per-se
> > > (ie for node->next) but instead merely ensure a barrier in wake_q_add(). This is
> > > kind of why I had hinted of going full _relaxed(). We could also rephrase the
> > > comment, something like:
> > >
> > >      * Use ACQUIRE semantics to add the next pointer, such that
> > >      * wake_q_add() implies a full barrier. This pairs with the
> > >      * write barrier implied by the wakeup in wake_up_list().
> > >      */
> > >
> > > What do you think?
> > 
> > Still befuddled. I'm thinking that if you want to remove a barrier,
> > you'd remove that second and keep the first. That is RELEASE.
> > 
> > That way, you know the stores prior to the wake queue are done by the
> > time you observe the queued entry, and therefore (transitively) know
> > those stores are done by the time you do the actual wakeup.
> > 
> > Two issues with that though; firstly RELEASE is not actually guaranteed
> > to be transitive -- now the only arch that does not implement it with a
> > full barrier is ARGH64, so we could just ask Will, but I'm not sure its
> > 'good' to start relying on this.
> 
> Never mind, the PPC people will implement this with lwsync and that is
> very much not transitive IIRC.

I am probably lost on context, but...

It turns out that lwsync is transitive in special cases.  One of them
is a series of release-acquire pairs, which can extend indefinitely.

Does that help in this case?

							Thanx, Paul

> That said, you could do:
> 
> 	smp_mb__before_atomic();
> 	cmpxchg_relaxed();
> 
> Which would still be a full barrier and therefore transitive. However
> this point still stands:
> 
> > Secondly, the wake queues are not concurrent, they're in context, so I
> > don't see ordering matter at all. The only reason its a cmpxchg() is
> > because there is the (small) possibility of two contexts wanting to wake
> > the same task, and we use task_struct storage for the queue.
> 
> I don't think we need _any_ barriers here, unless we have concurrent
> users of the wake queues (or want to allow any, do we?).
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-15 12:41           ` Paul E. McKenney
@ 2015-09-15 12:48             ` Peter Zijlstra
  2015-09-15 14:09               ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2015-09-15 12:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, linux-kernel,
	Davidlohr Bueso

On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > Never mind, the PPC people will implement this with lwsync and that is
> > very much not transitive IIRC.
> 
> I am probably lost on context, but...
> 
> It turns out that lwsync is transitive in special cases.  One of them
> is a series of release-acquire pairs, which can extend indefinitely.
> 
> Does that help in this case?

Probably not, but good to know. I still don't think we want to rely on
ACQUIRE/RELEASE being transitive in general though.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-15 12:48             ` Peter Zijlstra
@ 2015-09-15 14:09               ` Paul E. McKenney
  2015-09-15 14:14                 ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2015-09-15 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, linux-kernel,
	Davidlohr Bueso

On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > Never mind, the PPC people will implement this with lwsync and that is
> > > very much not transitive IIRC.
> > 
> > I am probably lost on context, but...
> > 
> > It turns out that lwsync is transitive in special cases.  One of them
> > is a series of release-acquire pairs, which can extend indefinitely.
> > 
> > Does that help in this case?
> 
> Probably not, but good to know. I still don't think we want to rely on
> ACQUIRE/RELEASE being transitive in general though.

OK, I will bite...  Why not?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-15 14:09               ` Paul E. McKenney
@ 2015-09-15 14:14                 ` Peter Zijlstra
  2015-09-15 15:34                   ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2015-09-15 14:14 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, linux-kernel,
	Davidlohr Bueso

On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > very much not transitive IIRC.
> > > 
> > > I am probably lost on context, but...
> > > 
> > > It turns out that lwsync is transitive in special cases.  One of them
> > > is a series of release-acquire pairs, which can extend indefinitely.
> > > 
> > > Does that help in this case?
> > 
> > Probably not, but good to know. I still don't think we want to rely on
> > ACQUIRE/RELEASE being transitive in general though.
> 
> OK, I will bite...  Why not?

It would mean us reviewing all archs (again) and documenting it I
suppose. Which is of course entirely possible.

That said, I don't think the case at hand requires it, so lets postpone
this for now ;-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-15 14:14                 ` Peter Zijlstra
@ 2015-09-15 15:34                   ` Paul E. McKenney
  2015-09-15 16:30                     ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2015-09-15 15:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, linux-kernel,
	Davidlohr Bueso

On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > very much not transitive IIRC.
> > > > 
> > > > I am probably lost on context, but...
> > > > 
> > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > 
> > > > Does that help in this case?
> > > 
> > > Probably not, but good to know. I still don't think we want to rely on
> > > ACQUIRE/RELEASE being transitive in general though.
> > 
> > OK, I will bite...  Why not?
> 
> It would mean us reviewing all archs (again) and documenting it I
> suppose. Which is of course entirely possible.
> 
> That said, I don't think the case at hand requires it, so lets postpone
> this for now ;-)

True enough, but in my experience smp_store_release() and
smp_load_acquire() are a -lot- easier to use than other barriers,
and transitivity will help promote their use.  So...

All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
smp_store_release()/smp_load_acquire() via their native ordering in
combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
which is not TSO, uses an mfence instruction.  Power supports this via
lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
and stlr instructions (in non-SMP, it uses barrier(), which suffices
in that case).  IA64 supports this via total ordering of all release
instructions in theory and by the actual full-barrier implementation
in practice (and the fact that gcc emits st.rel and ld.acq instructions
for volatile stores and loads).  All other architectures use smp_mb(),
which is transitive.

Did I miss anything?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-15 15:34                   ` Paul E. McKenney
@ 2015-09-15 16:30                     ` Peter Zijlstra
  2015-09-15 17:09                       ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2015-09-15 16:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, linux-kernel,
	Davidlohr Bueso

On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > very much not transitive IIRC.
> > > > > 
> > > > > I am probably lost on context, but...
> > > > > 
> > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > 
> > > > > Does that help in this case?
> > > > 
> > > > Probably not, but good to know. I still don't think we want to rely on
> > > > ACQUIRE/RELEASE being transitive in general though.
> > > 
> > > OK, I will bite...  Why not?
> > 
> > It would mean us reviewing all archs (again) and documenting it I
> > suppose. Which is of course entirely possible.
> > 
> > That said, I don't think the case at hand requires it, so lets postpone
> > this for now ;-)
> 
> True enough, but in my experience smp_store_release() and
> smp_load_acquire() are a -lot- easier to use than other barriers,
> and transitivity will help promote their use.  So...
> 
> All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> smp_store_release()/smp_load_acquire() via their native ordering in
> combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> which is not TSO, uses an mfence instruction.  Power supports this via
> lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> and stlr instructions (in non-SMP, it uses barrier(), which suffices
> in that case).  IA64 supports this via total ordering of all release
> instructions in theory and by the actual full-barrier implementation
> in practice (and the fact that gcc emits st.rel and ld.acq instructions
> for volatile stores and loads).  All other architectures use smp_mb(),
> which is transitive.
> 
> Did I miss anything?

I think that about covers it.. the only odd duckling might be s390 which
is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
which seems to confuse matters.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-15 16:30                     ` Peter Zijlstra
@ 2015-09-15 17:09                       ` Paul E. McKenney
  2015-09-18 21:41                         ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2015-09-15 17:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, linux-kernel,
	Davidlohr Bueso

On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > very much not transitive IIRC.
> > > > > > 
> > > > > > I am probably lost on context, but...
> > > > > > 
> > > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > 
> > > > > > Does that help in this case?
> > > > > 
> > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > 
> > > > OK, I will bite...  Why not?
> > > 
> > > It would mean us reviewing all archs (again) and documenting it I
> > > suppose. Which is of course entirely possible.
> > > 
> > > That said, I don't think the case at hand requires it, so lets postpone
> > > this for now ;-)
> > 
> > True enough, but in my experience smp_store_release() and
> > smp_load_acquire() are a -lot- easier to use than other barriers,
> > and transitivity will help promote their use.  So...
> > 
> > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > smp_store_release()/smp_load_acquire() via their native ordering in
> > combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> > which is not TSO, uses an mfence instruction.  Power supports this via
> > lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > in that case).  IA64 supports this via total ordering of all release
> > instructions in theory and by the actual full-barrier implementation
> > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > for volatile stores and loads).  All other architectures use smp_mb(),
> > which is transitive.
> > 
> > Did I miss anything?
> 
> I think that about covers it.. the only odd duckling might be s390 which
> is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> which seems to confuse matters.

Fair point, adding Martin and Heiko on CC for their thoughts.

It looks like this applies to recent mainframes that have new atomic
instructions, which, yes, might need something to make them work with
fully transitive smp_load_acquire() and smp_store_release().

Martin, Heiko, the question is whether or not the current s390
smp_store_release() and smp_load_acquire() can be transitive.
For example, if all the Xi variables below are initially zero,
is it possible for all the r0, r1, r2, ... rN variables to
have the value 1 at the end of the test.

CPU 0
	r0 = smp_load_acquire(&X0);
	smp_store_release(&X1, 1);

CPU 1
	r1 = smp_load_acquire(&X1);
	smp_store_release(&X2, 1);

CPU 2
	r2 = smp_load_acquire(&X2);
	smp_store_release(&X3, 1);

...

CPU N
	rN = smp_load_acquire(&XN);
	smp_store_release(&X0, 1);

If smp_store_release() and smp_load_acquire() are transitive, the
answer would be "no".

A similar litmus test involving atomics would be as follows, again
with all Xi initially zero:

CPU 0
	atomic_inc(&X0);
	smp_store_release(&X1, 1);

CPU 1
	r1 = smp_load_acquire(&X1);
	smp_store_release(&X2, 1);

CPU 2
	r2 = smp_load_acquire(&X2);
	smp_store_release(&X3, 1);

...

CPU N
	rN = smp_load_acquire(&XN);
	r0 = atomic_read(&X0);

Here, the question is whether r0 can be zero, but r1, r2, ... rN all
being 1 at the end of the test.

Thoughts?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-15  9:55         ` Peter Zijlstra
  2015-09-15 12:41           ` Paul E. McKenney
@ 2015-09-15 19:49           ` Davidlohr Bueso
  2015-09-16  9:01             ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Davidlohr Bueso @ 2015-09-15 19:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Paul E. McKenney, linux-kernel,
	Davidlohr Bueso

On Tue, 15 Sep 2015, Peter Zijlstra wrote:

>> Secondly, the wake queues are not concurrent, they're in context, so I
>> don't see ordering matter at all. The only reason its a cmpxchg() is
>> because there is the (small) possibility of two contexts wanting to wake
>> the same task, and we use task_struct storage for the queue.
>
>I don't think we need _any_ barriers here, unless we have concurrent
>users of the wake queues (or want to allow any, do we?).

Exactly, the queues are not concurent and do not need barriers, but some of
our callers do expect them.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-15 19:49           ` Davidlohr Bueso
@ 2015-09-16  9:01             ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2015-09-16  9:01 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Ingo Molnar, Thomas Gleixner, Paul E. McKenney, linux-kernel,
	Davidlohr Bueso

On Tue, Sep 15, 2015 at 12:49:46PM -0700, Davidlohr Bueso wrote:
> On Tue, 15 Sep 2015, Peter Zijlstra wrote:
> 
> >>Secondly, the wake queues are not concurrent, they're in context, so I
> >>don't see ordering matter at all. The only reason its a cmpxchg() is
> >>because there is the (small) possibility of two contexts wanting to wake
> >>the same task, and we use task_struct storage for the queue.
> >
> >I don't think we need _any_ barriers here, unless we have concurrent
> >users of the wake queues (or want to allow any, do we?).
> 
> Exactly, the queues are not concurent and do not need barriers, but some of
> our callers do expect them.

Ah, that is what you were saying. In that case, I think we should remove
all our barriers and make them explicit in the callers where needed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [tip:locking/core] locking/qrwlock: Rename ->lock to ->wait_lock
  2015-09-14  7:37 [PATCH -tip 1/3] locking/qrwlock: Rename ->lock to ->wait_lock Davidlohr Bueso
  2015-09-14  7:37 ` [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics Davidlohr Bueso
  2015-09-14  7:37 ` [PATCH -tip 3/3] locking/osq: Relax atomic semantics Davidlohr Bueso
@ 2015-09-18  8:50 ` tip-bot for Davidlohr Bueso
  2 siblings, 0 replies; 28+ messages in thread
From: tip-bot for Davidlohr Bueso @ 2015-09-18  8:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, hpa, dave, torvalds, peterz, Waiman.Long, mingo, dbueso,
	linux-kernel, paulmck, akpm

Commit-ID:  6e1e5196975fb7ecc501b3fe1075b77aea2b7839
Gitweb:     http://git.kernel.org/tip/6e1e5196975fb7ecc501b3fe1075b77aea2b7839
Author:     Davidlohr Bueso <dave@stgolabs.net>
AuthorDate: Mon, 14 Sep 2015 00:37:22 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 18 Sep 2015 09:27:29 +0200

locking/qrwlock: Rename ->lock to ->wait_lock

... trivial, but reads a little nicer when we name our
actual primitive 'lock'.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hpe.com>
Link: http://lkml.kernel.org/r/1442216244-4409-1-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/asm-generic/qrwlock_types.h | 4 ++--
 kernel/locking/qrwlock.c            | 8 ++++----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/qrwlock_types.h b/include/asm-generic/qrwlock_types.h
index 4d76f24..0abc6b6 100644
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -10,12 +10,12 @@
 
 typedef struct qrwlock {
 	atomic_t		cnts;
-	arch_spinlock_t		lock;
+	arch_spinlock_t		wait_lock;
 } arch_rwlock_t;
 
 #define	__ARCH_RW_LOCK_UNLOCKED {		\
 	.cnts = ATOMIC_INIT(0),			\
-	.lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
+	.wait_lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
 }
 
 #endif /* __ASM_GENERIC_QRWLOCK_TYPES_H */
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index f17a3e3..fec0823 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -86,7 +86,7 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	/*
 	 * Put the reader into the wait queue
 	 */
-	arch_spin_lock(&lock->lock);
+	arch_spin_lock(&lock->wait_lock);
 
 	/*
 	 * The ACQUIRE semantics of the following spinning code ensure
@@ -99,7 +99,7 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
 	/*
 	 * Signal the next one in queue to become queue head
 	 */
-	arch_spin_unlock(&lock->lock);
+	arch_spin_unlock(&lock->wait_lock);
 }
 EXPORT_SYMBOL(queued_read_lock_slowpath);
 
@@ -112,7 +112,7 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 	u32 cnts;
 
 	/* Put the writer into the wait queue */
-	arch_spin_lock(&lock->lock);
+	arch_spin_lock(&lock->wait_lock);
 
 	/* Try to acquire the lock directly if no reader is present */
 	if (!atomic_read(&lock->cnts) &&
@@ -144,6 +144,6 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
 		cpu_relax_lowlatency();
 	}
 unlock:
-	arch_spin_unlock(&lock->lock);
+	arch_spin_unlock(&lock->wait_lock);
 }
 EXPORT_SYMBOL(queued_write_lock_slowpath);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [tip:locking/core] locking/osq: Relax atomic semantics
  2015-09-14  7:37 ` [PATCH -tip 3/3] locking/osq: Relax atomic semantics Davidlohr Bueso
@ 2015-09-18  8:50   ` tip-bot for Davidlohr Bueso
  0 siblings, 0 replies; 28+ messages in thread
From: tip-bot for Davidlohr Bueso @ 2015-09-18  8:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, linux-kernel, Waiman.Long, hpa, dave, mingo, akpm,
	dbueso, paulmck, tglx, peterz

Commit-ID:  c55a6ffa6285e29f874ed403979472631ec70bff
Gitweb:     http://git.kernel.org/tip/c55a6ffa6285e29f874ed403979472631ec70bff
Author:     Davidlohr Bueso <dave@stgolabs.net>
AuthorDate: Mon, 14 Sep 2015 00:37:24 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 18 Sep 2015 09:27:29 +0200

locking/osq: Relax atomic semantics

... by using acquire/release for ops around the lock->tail. As such,
weakly ordered archs can benefit from more relaxed use of barriers
when issuing atomics.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hpe.com>
Link: http://lkml.kernel.org/r/1442216244-4409-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/osq_lock.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index dc85ee2..d092a0c 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -50,7 +50,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 
 	for (;;) {
 		if (atomic_read(&lock->tail) == curr &&
-		    atomic_cmpxchg(&lock->tail, curr, old) == curr) {
+		    atomic_cmpxchg_acquire(&lock->tail, curr, old) == curr) {
 			/*
 			 * We were the last queued, we moved @lock back. @prev
 			 * will now observe @lock and will complete its
@@ -92,7 +92,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	node->next = NULL;
 	node->cpu = curr;
 
-	old = atomic_xchg(&lock->tail, curr);
+	/*
+	 * ACQUIRE semantics, pairs with corresponding RELEASE
+	 * in unlock() uncontended, or fastpath.
+	 */
+	old = atomic_xchg_acquire(&lock->tail, curr);
 	if (old == OSQ_UNLOCKED_VAL)
 		return true;
 
@@ -184,7 +188,8 @@ void osq_unlock(struct optimistic_spin_queue *lock)
 	/*
 	 * Fast path for the uncontended case.
 	 */
-	if (likely(atomic_cmpxchg(&lock->tail, curr, OSQ_UNLOCKED_VAL) == curr))
+	if (likely(atomic_cmpxchg_release(&lock->tail, curr,
+					  OSQ_UNLOCKED_VAL) == curr))
 		return;
 
 	/*

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-15 17:09                       ` Paul E. McKenney
@ 2015-09-18 21:41                         ` Paul E. McKenney
  2015-09-21  9:22                           ` Martin Schwidefsky
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2015-09-18 21:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, Ingo Molnar, Thomas Gleixner, linux-kernel,
	Davidlohr Bueso, heiko.carstens, schwidefsky

On Tue, Sep 15, 2015 at 10:09:41AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > > very much not transitive IIRC.
> > > > > > > 
> > > > > > > I am probably lost on context, but...
> > > > > > > 
> > > > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > > 
> > > > > > > Does that help in this case?
> > > > > > 
> > > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > > 
> > > > > OK, I will bite...  Why not?
> > > > 
> > > > It would mean us reviewing all archs (again) and documenting it I
> > > > suppose. Which is of course entirely possible.
> > > > 
> > > > That said, I don't think the case at hand requires it, so lets postpone
> > > > this for now ;-)
> > > 
> > > True enough, but in my experience smp_store_release() and
> > > smp_load_acquire() are a -lot- easier to use than other barriers,
> > > and transitivity will help promote their use.  So...
> > > 
> > > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > > smp_store_release()/smp_load_acquire() via their native ordering in
> > > combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> > > which is not TSO, uses an mfence instruction.  Power supports this via
> > > lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> > > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > > in that case).  IA64 supports this via total ordering of all release
> > > instructions in theory and by the actual full-barrier implementation
> > > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > > for volatile stores and loads).  All other architectures use smp_mb(),
> > > which is transitive.
> > > 
> > > Did I miss anything?
> > 
> > I think that about covers it.. the only odd duckling might be s390 which
> > is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> > which seems to confuse matters.
> 
> Fair point, adding Martin and Heiko on CC for their thoughts.
> 
> It looks like this applies to recent mainframes that have new atomic
> instructions, which, yes, might need something to make them work with
> fully transitive smp_load_acquire() and smp_store_release().
> 
> Martin, Heiko, the question is whether or not the current s390
> smp_store_release() and smp_load_acquire() can be transitive.
> For example, if all the Xi variables below are initially zero,
> is it possible for all the r0, r1, r2, ... rN variables to
> have the value 1 at the end of the test.

Right...  This time actually adding Martin and Heiko on CC...

							Thanx, Paul

> CPU 0
> 	r0 = smp_load_acquire(&X0);
> 	smp_store_release(&X1, 1);
> 
> CPU 1
> 	r1 = smp_load_acquire(&X1);
> 	smp_store_release(&X2, 1);
> 
> CPU 2
> 	r2 = smp_load_acquire(&X2);
> 	smp_store_release(&X3, 1);
> 
> ...
> 
> CPU N
> 	rN = smp_load_acquire(&XN);
> 	smp_store_release(&X0, 1);
> 
> If smp_store_release() and smp_load_acquire() are transitive, the
> answer would be "no".
> 
> A similar litmus test involving atomics would be as follows, again
> with all Xi initially zero:
> 
> CPU 0
> 	atomic_inc(&X0);
> 	smp_store_release(&X1, 1);
> 
> CPU 1
> 	r1 = smp_load_acquire(&X1);
> 	smp_store_release(&X2, 1);
> 
> CPU 2
> 	r2 = smp_load_acquire(&X2);
> 	smp_store_release(&X3, 1);
> 
> ...
> 
> CPU N
> 	rN = smp_load_acquire(&XN);
> 	r0 = atomic_read(&X0);
> 
> Here, the question is whether r0 can be zero, but r1, r2, ... rN all
> being 1 at the end of the test.
> 
> Thoughts?
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-18 21:41                         ` Paul E. McKenney
@ 2015-09-21  9:22                           ` Martin Schwidefsky
  2015-09-22 10:27                             ` Martin Schwidefsky
  0 siblings, 1 reply; 28+ messages in thread
From: Martin Schwidefsky @ 2015-09-21  9:22 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Davidlohr Bueso, Ingo Molnar, Thomas Gleixner,
	linux-kernel, Davidlohr Bueso, heiko.carstens

On Fri, 18 Sep 2015 14:41:20 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Tue, Sep 15, 2015 at 10:09:41AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> > > On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > > > very much not transitive IIRC.
> > > > > > > > 
> > > > > > > > I am probably lost on context, but...
> > > > > > > > 
> > > > > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > > > 
> > > > > > > > Does that help in this case?
> > > > > > > 
> > > > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > > > 
> > > > > > OK, I will bite...  Why not?
> > > > > 
> > > > > It would mean us reviewing all archs (again) and documenting it I
> > > > > suppose. Which is of course entirely possible.
> > > > > 
> > > > > That said, I don't think the case at hand requires it, so lets postpone
> > > > > this for now ;-)
> > > > 
> > > > True enough, but in my experience smp_store_release() and
> > > > smp_load_acquire() are a -lot- easier to use than other barriers,
> > > > and transitivity will help promote their use.  So...
> > > > 
> > > > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > > > smp_store_release()/smp_load_acquire() via their native ordering in
> > > > combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> > > > which is not TSO, uses an mfence instruction.  Power supports this via
> > > > lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> > > > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > > > in that case).  IA64 supports this via total ordering of all release
> > > > instructions in theory and by the actual full-barrier implementation
> > > > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > > > for volatile stores and loads).  All other architectures use smp_mb(),
> > > > which is transitive.
> > > > 
> > > > Did I miss anything?
> > > 
> > > I think that about covers it.. the only odd duckling might be s390 which
> > > is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> > > which seems to confuse matters.
> > 
> > Fair point, adding Martin and Heiko on CC for their thoughts.

Well we always had the full memory barrier for the various versions of
smp_mb__xxx, they just have moved around and renamed several times.

After discussing this with Heiko we came to the conclusion that we can use
a simple barrier() for smp_mb__before_atomic() and smp_mb__after_atomic().

> > It looks like this applies to recent mainframes that have new atomic
> > instructions, which, yes, might need something to make them work with
> > fully transitive smp_load_acquire() and smp_store_release().
> > 
> > Martin, Heiko, the question is whether or not the current s390
> > smp_store_release() and smp_load_acquire() can be transitive.
> > For example, if all the Xi variables below are initially zero,
> > is it possible for all the r0, r1, r2, ... rN variables to
> > have the value 1 at the end of the test.
> 
> Right...  This time actually adding Martin and Heiko on CC...
> 
> 							Thanx, Paul
> 
> > CPU 0
> > 	r0 = smp_load_acquire(&X0);
> > 	smp_store_release(&X1, 1);
> > 
> > CPU 1
> > 	r1 = smp_load_acquire(&X1);
> > 	smp_store_release(&X2, 1);
> > 
> > CPU 2
> > 	r2 = smp_load_acquire(&X2);
> > 	smp_store_release(&X3, 1);
> > 
> > ...
> > 
> > CPU N
> > 	rN = smp_load_acquire(&XN);
> > 	smp_store_release(&X0, 1);
> > 
> > If smp_store_release() and smp_load_acquire() are transitive, the
> > answer would be "no".

The answer is "no". Christian recently summarized what the principles of
operation has to say about the CPU read / write behavior. If you consider
the sequential order of instructions then

1) reads are in order
2) writes are in order
3) reads can happen earlier
4) writes can happen later

> > A similar litmus test involving atomics would be as follows, again
> > with all Xi initially zero:
> > 
> > CPU 0
> > 	atomic_inc(&X0);
> > 	smp_store_release(&X1, 1);
> > 
> > CPU 1
> > 	r1 = smp_load_acquire(&X1);
> > 	smp_store_release(&X2, 1);
> > 
> > CPU 2
> > 	r2 = smp_load_acquire(&X2);
> > 	smp_store_release(&X3, 1);
> > 
> > ...
> > 
> > CPU N
> > 	rN = smp_load_acquire(&XN);
> > 	r0 = atomic_read(&X0);
> > 
> > Here, the question is whether r0 can be zero, but r1, r2, ... rN all
> > being 1 at the end of the test.

r0 = 0 and all r1, r2, ... rN = 1 can not happen on s390.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-21  9:22                           ` Martin Schwidefsky
@ 2015-09-22 10:27                             ` Martin Schwidefsky
  2015-09-22 12:23                               ` Boqun Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Martin Schwidefsky @ 2015-09-22 10:27 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Davidlohr Bueso, Ingo Molnar, Thomas Gleixner,
	linux-kernel, Davidlohr Bueso, heiko.carstens

On Mon, 21 Sep 2015 11:22:52 +0200
Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> On Fri, 18 Sep 2015 14:41:20 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Tue, Sep 15, 2015 at 10:09:41AM -0700, Paul E. McKenney wrote:
> > > On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > > > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > > > > very much not transitive IIRC.
> > > > > > > > > 
> > > > > > > > > I am probably lost on context, but...
> > > > > > > > > 
> > > > > > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > > > > 
> > > > > > > > > Does that help in this case?
> > > > > > > > 
> > > > > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > > > > 
> > > > > > > OK, I will bite...  Why not?
> > > > > > 
> > > > > > It would mean us reviewing all archs (again) and documenting it I
> > > > > > suppose. Which is of course entirely possible.
> > > > > > 
> > > > > > That said, I don't think the case at hand requires it, so lets postpone
> > > > > > this for now ;-)
> > > > > 
> > > > > True enough, but in my experience smp_store_release() and
> > > > > smp_load_acquire() are a -lot- easier to use than other barriers,
> > > > > and transitivity will help promote their use.  So...
> > > > > 
> > > > > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > > > > smp_store_release()/smp_load_acquire() via their native ordering in
> > > > > combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> > > > > which is not TSO, uses an mfence instruction.  Power supports this via
> > > > > lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> > > > > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > > > > in that case).  IA64 supports this via total ordering of all release
> > > > > instructions in theory and by the actual full-barrier implementation
> > > > > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > > > > for volatile stores and loads).  All other architectures use smp_mb(),
> > > > > which is transitive.
> > > > > 
> > > > > Did I miss anything?
> > > > 
> > > > I think that about covers it.. the only odd duckling might be s390 which
> > > > is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> > > > which seems to confuse matters.
> > > 
> > > Fair point, adding Martin and Heiko on CC for their thoughts.
> 
> Well we always had the full memory barrier for the various versions of
> smp_mb__xxx, they just have moved around and renamed several times.
> 
> After discussing this with Heiko we came to the conclusion that we can use
> a simple barrier() for smp_mb__before_atomic() and smp_mb__after_atomic().
> 
> > > It looks like this applies to recent mainframes that have new atomic
> > > instructions, which, yes, might need something to make them work with
> > > fully transitive smp_load_acquire() and smp_store_release().
> > > 
> > > Martin, Heiko, the question is whether or not the current s390
> > > smp_store_release() and smp_load_acquire() can be transitive.
> > > For example, if all the Xi variables below are initially zero,
> > > is it possible for all the r0, r1, r2, ... rN variables to
> > > have the value 1 at the end of the test.
> > 
> > Right...  This time actually adding Martin and Heiko on CC...
> > 
> > 							Thanx, Paul
> > 
> > > CPU 0
> > > 	r0 = smp_load_acquire(&X0);
> > > 	smp_store_release(&X1, 1);
> > > 
> > > CPU 1
> > > 	r1 = smp_load_acquire(&X1);
> > > 	smp_store_release(&X2, 1);
> > > 
> > > CPU 2
> > > 	r2 = smp_load_acquire(&X2);
> > > 	smp_store_release(&X3, 1);
> > > 
> > > ...
> > > 
> > > CPU N
> > > 	rN = smp_load_acquire(&XN);
> > > 	smp_store_release(&X0, 1);
> > > 
> > > If smp_store_release() and smp_load_acquire() are transitive, the
> > > answer would be "no".
> 
> The answer is "no". Christian recently summarized what the principles of
> operation has to say about the CPU read / write behavior. If you consider
> the sequential order of instructions then
> 
> 1) reads are in order
> 2) writes are in order
> 3) reads can happen earlier
> 4) writes can happen later

Correction. The principles of operation states this:

"A storage-operand store specified by one instruction appears to precede
all storage-operand stores specified by conceptually subsequent instructions,
but it does not necessarily precede storage-operand fetches specified by
conceptually subsequent instructions. However, a storage-operand store
appears to precede a conceptually subsequent storage-operand fetch from the
same main-storage location."

As observed by other CPUs a write to one memory location can "overtake" a
read of another memory location if there is no explicit memory-barrier
between the load and the store instruction.

In the above example X0, X1, ... XN are different memory locations, so
architecturally the answer is "yes", all r0, r1, ... rN variables can have
the value of 1 after the test. I doubt that any existing machine will
show this behavior though.

> > > A similar litmus test involving atomics would be as follows, again
> > > with all Xi initially zero:
> > > 
> > > CPU 0
> > > 	atomic_inc(&X0);
> > > 	smp_store_release(&X1, 1);
> > > 
> > > CPU 1
> > > 	r1 = smp_load_acquire(&X1);
> > > 	smp_store_release(&X2, 1);
> > > 
> > > CPU 2
> > > 	r2 = smp_load_acquire(&X2);
> > > 	smp_store_release(&X3, 1);
> > > 
> > > ...
> > > 
> > > CPU N
> > > 	rN = smp_load_acquire(&XN);
> > > 	r0 = atomic_read(&X0);
> > > 
> > > Here, the question is whether r0 can be zero, but r1, r2, ... rN all
> > > being 1 at the end of the test.
> 
> r0 = 0 and all r1, r2, ... rN = 1 can not happen on s390.
 
This indeed can not happen as the atomic_inc is a store type reference
which precedes the smp_store_release store type reference.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-22 10:27                             ` Martin Schwidefsky
@ 2015-09-22 12:23                               ` Boqun Feng
  2015-09-22 12:51                                 ` Martin Schwidefsky
  0 siblings, 1 reply; 28+ messages in thread
From: Boqun Feng @ 2015-09-22 12:23 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: paulmck, Peter Zijlstra, Davidlohr Bueso, Ingo Molnar,
	Thomas Gleixner, linux-kernel, Davidlohr Bueso, heiko.carstens

[-- Attachment #1: Type: text/plain, Size: 7689 bytes --]

Hi Martin,

On Tue, Sep 22, 2015 at 12:27:35PM +0200, Martin Schwidefsky wrote:
> On Mon, 21 Sep 2015 11:22:52 +0200
> Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> 
> > On Fri, 18 Sep 2015 14:41:20 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > On Tue, Sep 15, 2015 at 10:09:41AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > > > > > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > > > > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > > > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > > > > > very much not transitive IIRC.
> > > > > > > > > > 
> > > > > > > > > > I am probably lost on context, but...
> > > > > > > > > > 
> > > > > > > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > > > > > 
> > > > > > > > > > Does that help in this case?
> > > > > > > > > 
> > > > > > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > > > > > 
> > > > > > > > OK, I will bite...  Why not?
> > > > > > > 
> > > > > > > It would mean us reviewing all archs (again) and documenting it I
> > > > > > > suppose. Which is of course entirely possible.
> > > > > > > 
> > > > > > > That said, I don't think the case at hand requires it, so lets postpone
> > > > > > > this for now ;-)
> > > > > > 
> > > > > > True enough, but in my experience smp_store_release() and
> > > > > > smp_load_acquire() are a -lot- easier to use than other barriers,
> > > > > > and transitivity will help promote their use.  So...
> > > > > > 
> > > > > > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > > > > > smp_store_release()/smp_load_acquire() via their native ordering in
> > > > > > combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> > > > > > which is not TSO, uses an mfence instruction.  Power supports this via
> > > > > > lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> > > > > > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > > > > > in that case).  IA64 supports this via total ordering of all release
> > > > > > instructions in theory and by the actual full-barrier implementation
> > > > > > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > > > > > for volatile stores and loads).  All other architectures use smp_mb(),
> > > > > > which is transitive.
> > > > > > 
> > > > > > Did I miss anything?
> > > > > 
> > > > > I think that about covers it.. the only odd duckling might be s390 which
> > > > > is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> > > > > which seems to confuse matters.
> > > > 
> > > > Fair point, adding Martin and Heiko on CC for their thoughts.
> > 
> > Well we always had the full memory barrier for the various versions of
> > smp_mb__xxx, they just have moved around and renamed several times.
> > 
> > After discussing this with Heiko we came to the conclusion that we can use
> > a simple barrier() for smp_mb__before_atomic() and smp_mb__after_atomic().
> > 
> > > > It looks like this applies to recent mainframes that have new atomic
> > > > instructions, which, yes, might need something to make them work with
> > > > fully transitive smp_load_acquire() and smp_store_release().
> > > > 
> > > > Martin, Heiko, the question is whether or not the current s390
> > > > smp_store_release() and smp_load_acquire() can be transitive.
> > > > For example, if all the Xi variables below are initially zero,
> > > > is it possible for all the r0, r1, r2, ... rN variables to
> > > > have the value 1 at the end of the test.
> > > 
> > > Right...  This time actually adding Martin and Heiko on CC...
> > > 
> > > 							Thanx, Paul
> > > 
> > > > CPU 0
> > > > 	r0 = smp_load_acquire(&X0);
> > > > 	smp_store_release(&X1, 1);
> > > > 
> > > > CPU 1
> > > > 	r1 = smp_load_acquire(&X1);
> > > > 	smp_store_release(&X2, 1);
> > > > 
> > > > CPU 2
> > > > 	r2 = smp_load_acquire(&X2);
> > > > 	smp_store_release(&X3, 1);
> > > > 
> > > > ...
> > > > 
> > > > CPU N
> > > > 	rN = smp_load_acquire(&XN);
> > > > 	smp_store_release(&X0, 1);
> > > > 
> > > > If smp_store_release() and smp_load_acquire() are transitive, the
> > > > answer would be "no".
> > 
> > The answer is "no". Christian recently summarized what the principles of
> > operation has to say about the CPU read / write behavior. If you consider
> > the sequential order of instructions then
> > 
> > 1) reads are in order
> > 2) writes are in order
> > 3) reads can happen earlier
> > 4) writes can happen later
> 
> Correction. The principles of operation states this:
> 
> "A storage-operand store specified by one instruction appears to precede
> all storage-operand stores specified by conceptually subsequent instructions,
> but it does not necessarily precede storage-operand fetches specified by
> conceptually subsequent instructions. However, a storage-operand store
> appears to precede a conceptually subsequent storage-operand fetch from the
> same main-storage location."
> 
> As observed by other CPUs a write to one memory location can "overtake" a
> read of another memory location if there is no explicit memory-barrier
> between the load and the store instruction.
> 
> In the above example X0, X1, ... XN are different memory locations, so
> architecturally the answer is "yes", all r0, r1, ... rN variables can have
> the value of 1 after the test. I doubt that any existing machine will
> show this behavior though.
> 

Just be curious, how about when N == 1? The test then becomes:

CPU 0
	r0 = smp_load_acquire(&X0);
	smp_store_release(&X1,1);

CPU 1
	r1 = smp_load_acquire(&X1);
	smp_store_release(&X0,1);

Is it possible that r0 == 1 and r1 == 1 at the end, due to the same
reason?

Regards,
Boqun

> > > > A similar litmus test involving atomics would be as follows, again
> > > > with all Xi initially zero:
> > > > 
> > > > CPU 0
> > > > 	atomic_inc(&X0);
> > > > 	smp_store_release(&X1, 1);
> > > > 
> > > > CPU 1
> > > > 	r1 = smp_load_acquire(&X1);
> > > > 	smp_store_release(&X2, 1);
> > > > 
> > > > CPU 2
> > > > 	r2 = smp_load_acquire(&X2);
> > > > 	smp_store_release(&X3, 1);
> > > > 
> > > > ...
> > > > 
> > > > CPU N
> > > > 	rN = smp_load_acquire(&XN);
> > > > 	r0 = atomic_read(&X0);
> > > > 
> > > > Here, the question is whether r0 can be zero, but r1, r2, ... rN all
> > > > being 1 at the end of the test.
> > 
> > r0 = 0 and all r1, r2, ... rN = 1 can not happen on s390.
>  
> This indeed can not happen as the atomic_inc is a store type reference
> which precedes the smp_store_release store type reference.
> 
> -- 
> blue skies,
>    Martin.
> 
> "Reality continues to ruin my life." - Calvin.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-22 12:23                               ` Boqun Feng
@ 2015-09-22 12:51                                 ` Martin Schwidefsky
  2015-09-22 13:29                                   ` Boqun Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Martin Schwidefsky @ 2015-09-22 12:51 UTC (permalink / raw)
  To: Boqun Feng
  Cc: paulmck, Peter Zijlstra, Davidlohr Bueso, Ingo Molnar,
	Thomas Gleixner, linux-kernel, Davidlohr Bueso, heiko.carstens

[-- Attachment #1: Type: text/plain, Size: 7053 bytes --]

On Tue, 22 Sep 2015 20:23:26 +0800
Boqun Feng <boqun.feng@gmail.com> wrote:

> Hi Martin,
> 
> On Tue, Sep 22, 2015 at 12:27:35PM +0200, Martin Schwidefsky wrote:
> > On Mon, 21 Sep 2015 11:22:52 +0200
> > Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > 
> > > On Fri, 18 Sep 2015 14:41:20 -0700
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > 
> > > > On Tue, Sep 15, 2015 at 10:09:41AM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> > > > > > On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > > > > > > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > > > > > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > > > > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > > > > > > very much not transitive IIRC.
> > > > > > > > > > > 
> > > > > > > > > > > I am probably lost on context, but...
> > > > > > > > > > > 
> > > > > > > > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > > > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > > > > > > 
> > > > > > > > > > > Does that help in this case?
> > > > > > > > > > 
> > > > > > > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > > > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > > > > > > 
> > > > > > > > > OK, I will bite...  Why not?
> > > > > > > > 
> > > > > > > > It would mean us reviewing all archs (again) and documenting it I
> > > > > > > > suppose. Which is of course entirely possible.
> > > > > > > > 
> > > > > > > > That said, I don't think the case at hand requires it, so lets postpone
> > > > > > > > this for now ;-)
> > > > > > > 
> > > > > > > True enough, but in my experience smp_store_release() and
> > > > > > > smp_load_acquire() are a -lot- easier to use than other barriers,
> > > > > > > and transitivity will help promote their use.  So...
> > > > > > > 
> > > > > > > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > > > > > > smp_store_release()/smp_load_acquire() via their native ordering in
> > > > > > > combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> > > > > > > which is not TSO, uses an mfence instruction.  Power supports this via
> > > > > > > lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> > > > > > > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > > > > > > in that case).  IA64 supports this via total ordering of all release
> > > > > > > instructions in theory and by the actual full-barrier implementation
> > > > > > > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > > > > > > for volatile stores and loads).  All other architectures use smp_mb(),
> > > > > > > which is transitive.
> > > > > > > 
> > > > > > > Did I miss anything?
> > > > > > 
> > > > > > I think that about covers it.. the only odd duckling might be s390 which
> > > > > > is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> > > > > > which seems to confuse matters.
> > > > > 
> > > > > Fair point, adding Martin and Heiko on CC for their thoughts.
> > > 
> > > Well we always had the full memory barrier for the various versions of
> > > smp_mb__xxx, they just have moved around and renamed several times.
> > > 
> > > After discussing this with Heiko we came to the conclusion that we can use
> > > a simple barrier() for smp_mb__before_atomic() and smp_mb__after_atomic().
> > > 
> > > > > It looks like this applies to recent mainframes that have new atomic
> > > > > instructions, which, yes, might need something to make them work with
> > > > > fully transitive smp_load_acquire() and smp_store_release().
> > > > > 
> > > > > Martin, Heiko, the question is whether or not the current s390
> > > > > smp_store_release() and smp_load_acquire() can be transitive.
> > > > > For example, if all the Xi variables below are initially zero,
> > > > > is it possible for all the r0, r1, r2, ... rN variables to
> > > > > have the value 1 at the end of the test.
> > > > 
> > > > Right...  This time actually adding Martin and Heiko on CC...
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > CPU 0
> > > > > 	r0 = smp_load_acquire(&X0);
> > > > > 	smp_store_release(&X1, 1);
> > > > > 
> > > > > CPU 1
> > > > > 	r1 = smp_load_acquire(&X1);
> > > > > 	smp_store_release(&X2, 1);
> > > > > 
> > > > > CPU 2
> > > > > 	r2 = smp_load_acquire(&X2);
> > > > > 	smp_store_release(&X3, 1);
> > > > > 
> > > > > ...
> > > > > 
> > > > > CPU N
> > > > > 	rN = smp_load_acquire(&XN);
> > > > > 	smp_store_release(&X0, 1);
> > > > > 
> > > > > If smp_store_release() and smp_load_acquire() are transitive, the
> > > > > answer would be "no".
> > > 
> > > The answer is "no". Christian recently summarized what the principles of
> > > operation has to say about the CPU read / write behavior. If you consider
> > > the sequential order of instructions then
> > > 
> > > 1) reads are in order
> > > 2) writes are in order
> > > 3) reads can happen earlier
> > > 4) writes can happen later
> > 
> > Correction. The principles of operation states this:
> > 
> > "A storage-operand store specified by one instruction appears to precede
> > all storage-operand stores specified by conceptually subsequent instructions,
> > but it does not necessarily precede storage-operand fetches specified by
> > conceptually subsequent instructions. However, a storage-operand store
> > appears to precede a conceptually subsequent storage-operand fetch from the
> > same main-storage location."
> > 
> > As observed by other CPUs a write to one memory location can "overtake" a
> > read of another memory location if there is no explicit memory-barrier
> > between the load and the store instruction.
> > 
> > In the above example X0, X1, ... XN are different memory locations, so
> > architecturally the answer is "yes", all r0, r1, ... rN variables can have
> > the value of 1 after the test. I doubt that any existing machine will
> > show this behavior though.
> > 
> 
> Just be curious, how about when N == 1? The test then becomes:
> 
> CPU 0
> 	r0 = smp_load_acquire(&X0);
> 	smp_store_release(&X1,1);
> 
> CPU 1
> 	r1 = smp_load_acquire(&X1);
> 	smp_store_release(&X0,1);
> 
> Is it possible that r0 == 1 and r1 == 1 at the end, due to the same
> reason?

Yes, that is possible for the same reason. To change that we would have
to replace the barrier() in smp_load_acquire/smp_store_release with
smp_mb().

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-22 12:51                                 ` Martin Schwidefsky
@ 2015-09-22 13:29                                   ` Boqun Feng
  2015-09-22 14:33                                     ` Martin Schwidefsky
  0 siblings, 1 reply; 28+ messages in thread
From: Boqun Feng @ 2015-09-22 13:29 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: paulmck, Peter Zijlstra, Davidlohr Bueso, Ingo Molnar,
	Thomas Gleixner, linux-kernel, Davidlohr Bueso, heiko.carstens

[-- Attachment #1: Type: text/plain, Size: 7758 bytes --]

On Tue, Sep 22, 2015 at 02:51:36PM +0200, Martin Schwidefsky wrote:
> On Tue, 22 Sep 2015 20:23:26 +0800
> Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> > Hi Martin,
> > 
> > On Tue, Sep 22, 2015 at 12:27:35PM +0200, Martin Schwidefsky wrote:
> > > On Mon, 21 Sep 2015 11:22:52 +0200
> > > Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > 
> > > > On Fri, 18 Sep 2015 14:41:20 -0700
> > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > On Tue, Sep 15, 2015 at 10:09:41AM -0700, Paul E. McKenney wrote:
> > > > > > On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> > > > > > > On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > > > > > > > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > > > > > > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > > > > > > > very much not transitive IIRC.
> > > > > > > > > > > > 
> > > > > > > > > > > > I am probably lost on context, but...
> > > > > > > > > > > > 
> > > > > > > > > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > > > > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > > > > > > > 
> > > > > > > > > > > > Does that help in this case?
> > > > > > > > > > > 
> > > > > > > > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > > > > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > > > > > > > 
> > > > > > > > > > OK, I will bite...  Why not?
> > > > > > > > > 
> > > > > > > > > It would mean us reviewing all archs (again) and documenting it I
> > > > > > > > > suppose. Which is of course entirely possible.
> > > > > > > > > 
> > > > > > > > > That said, I don't think the case at hand requires it, so lets postpone
> > > > > > > > > this for now ;-)
> > > > > > > > 
> > > > > > > > True enough, but in my experience smp_store_release() and
> > > > > > > > smp_load_acquire() are a -lot- easier to use than other barriers,
> > > > > > > > and transitivity will help promote their use.  So...
> > > > > > > > 
> > > > > > > > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > > > > > > > smp_store_release()/smp_load_acquire() via their native ordering in
> > > > > > > > combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> > > > > > > > which is not TSO, uses an mfence instruction.  Power supports this via
> > > > > > > > lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> > > > > > > > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > > > > > > > in that case).  IA64 supports this via total ordering of all release
> > > > > > > > instructions in theory and by the actual full-barrier implementation
> > > > > > > > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > > > > > > > for volatile stores and loads).  All other architectures use smp_mb(),
> > > > > > > > which is transitive.
> > > > > > > > 
> > > > > > > > Did I miss anything?
> > > > > > > 
> > > > > > > I think that about covers it.. the only odd duckling might be s390 which
> > > > > > > is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> > > > > > > which seems to confuse matters.
> > > > > > 
> > > > > > Fair point, adding Martin and Heiko on CC for their thoughts.
> > > > 
> > > > Well we always had the full memory barrier for the various versions of
> > > > smp_mb__xxx, they just have moved around and renamed several times.
> > > > 
> > > > After discussing this with Heiko we came to the conclusion that we can use
> > > > a simple barrier() for smp_mb__before_atomic() and smp_mb__after_atomic().
> > > > 
> > > > > > It looks like this applies to recent mainframes that have new atomic
> > > > > > instructions, which, yes, might need something to make them work with
> > > > > > fully transitive smp_load_acquire() and smp_store_release().
> > > > > > 
> > > > > > Martin, Heiko, the question is whether or not the current s390
> > > > > > smp_store_release() and smp_load_acquire() can be transitive.
> > > > > > For example, if all the Xi variables below are initially zero,
> > > > > > is it possible for all the r0, r1, r2, ... rN variables to
> > > > > > have the value 1 at the end of the test.
> > > > > 
> > > > > Right...  This time actually adding Martin and Heiko on CC...
> > > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > > > > CPU 0
> > > > > > 	r0 = smp_load_acquire(&X0);
> > > > > > 	smp_store_release(&X1, 1);
> > > > > > 
> > > > > > CPU 1
> > > > > > 	r1 = smp_load_acquire(&X1);
> > > > > > 	smp_store_release(&X2, 1);
> > > > > > 
> > > > > > CPU 2
> > > > > > 	r2 = smp_load_acquire(&X2);
> > > > > > 	smp_store_release(&X3, 1);
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > CPU N
> > > > > > 	rN = smp_load_acquire(&XN);
> > > > > > 	smp_store_release(&X0, 1);
> > > > > > 
> > > > > > If smp_store_release() and smp_load_acquire() are transitive, the
> > > > > > answer would be "no".
> > > > 
> > > > The answer is "no". Christian recently summarized what the principles of
> > > > operation has to say about the CPU read / write behavior. If you consider
> > > > the sequential order of instructions then
> > > > 
> > > > 1) reads are in order
> > > > 2) writes are in order
> > > > 3) reads can happen earlier
> > > > 4) writes can happen later
> > > 
> > > Correction. The principles of operation states this:
> > > 
> > > "A storage-operand store specified by one instruction appears to precede
> > > all storage-operand stores specified by conceptually subsequent instructions,
> > > but it does not necessarily precede storage-operand fetches specified by
> > > conceptually subsequent instructions. However, a storage-operand store
> > > appears to precede a conceptually subsequent storage-operand fetch from the
> > > same main-storage location."
> > > 

Confused...

IIUC, the previous paragraph actually means that a STORE-LOAD pair can be
reordered. But the below reasoning is saying that a LOAD-STORE pair can
be reordered. Do I miss something here?

> > > As observed by other CPUs a write to one memory location can "overtake" a
> > > read of another memory location if there is no explicit memory-barrier
> > > between the load and the store instruction.
> > > 
> > > In the above example X0, X1, ... XN are different memory locations, so
> > > architecturally the answer is "yes", all r0, r1, ... rN variables can have
> > > the value of 1 after the test. I doubt that any existing machine will
> > > show this behavior though.
> > > 
> > 
> > Just be curious, how about when N == 1? The test then becomes:
> > 
> > CPU 0
> > 	r0 = smp_load_acquire(&X0);
> > 	smp_store_release(&X1,1);
> > 
> > CPU 1
> > 	r1 = smp_load_acquire(&X1);
> > 	smp_store_release(&X0,1);
> > 
> > Is it possible that r0 == 1 and r1 == 1 at the end, due to the same
> > reason?
> 
> Yes, that is possible for the same reason. To change that we would have
> to replace the barrier() in smp_load_acquire/smp_store_release with
> smp_mb().
> 

I thought that s390 is TSO, so this is prohibitted. If that is possible,
I think, that means the current implementation of smp_load_acquire and
smp_store_release on s390 is incorrect...

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-22 13:29                                   ` Boqun Feng
@ 2015-09-22 14:33                                     ` Martin Schwidefsky
  2015-09-22 15:28                                       ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Martin Schwidefsky @ 2015-09-22 14:33 UTC (permalink / raw)
  To: Boqun Feng
  Cc: paulmck, Peter Zijlstra, Davidlohr Bueso, Ingo Molnar,
	Thomas Gleixner, linux-kernel, Davidlohr Bueso, heiko.carstens

[-- Attachment #1: Type: text/plain, Size: 8826 bytes --]

On Tue, 22 Sep 2015 21:29:14 +0800
Boqun Feng <boqun.feng@gmail.com> wrote:

> On Tue, Sep 22, 2015 at 02:51:36PM +0200, Martin Schwidefsky wrote:
> > On Tue, 22 Sep 2015 20:23:26 +0800
> > Boqun Feng <boqun.feng@gmail.com> wrote:
> > 
> > > Hi Martin,
> > > 
> > > On Tue, Sep 22, 2015 at 12:27:35PM +0200, Martin Schwidefsky wrote:
> > > > On Mon, 21 Sep 2015 11:22:52 +0200
> > > > Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > > 
> > > > > On Fri, 18 Sep 2015 14:41:20 -0700
> > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > On Tue, Sep 15, 2015 at 10:09:41AM -0700, Paul E. McKenney wrote:
> > > > > > > On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> > > > > > > > On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > > > > > > > > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > > > > > > > > very much not transitive IIRC.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I am probably lost on context, but...
> > > > > > > > > > > > > 
> > > > > > > > > > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > > > > > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Does that help in this case?
> > > > > > > > > > > > 
> > > > > > > > > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > > > > > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > > > > > > > > 
> > > > > > > > > > > OK, I will bite...  Why not?
> > > > > > > > > > 
> > > > > > > > > > It would mean us reviewing all archs (again) and documenting it I
> > > > > > > > > > suppose. Which is of course entirely possible.
> > > > > > > > > > 
> > > > > > > > > > That said, I don't think the case at hand requires it, so lets postpone
> > > > > > > > > > this for now ;-)
> > > > > > > > > 
> > > > > > > > > True enough, but in my experience smp_store_release() and
> > > > > > > > > smp_load_acquire() are a -lot- easier to use than other barriers,
> > > > > > > > > and transitivity will help promote their use.  So...
> > > > > > > > > 
> > > > > > > > > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > > > > > > > > smp_store_release()/smp_load_acquire() via their native ordering in
> > > > > > > > > combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> > > > > > > > > which is not TSO, uses an mfence instruction.  Power supports this via
> > > > > > > > > lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> > > > > > > > > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > > > > > > > > in that case).  IA64 supports this via total ordering of all release
> > > > > > > > > instructions in theory and by the actual full-barrier implementation
> > > > > > > > > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > > > > > > > > for volatile stores and loads).  All other architectures use smp_mb(),
> > > > > > > > > which is transitive.
> > > > > > > > > 
> > > > > > > > > Did I miss anything?
> > > > > > > > 
> > > > > > > > I think that about covers it.. the only odd duckling might be s390 which
> > > > > > > > is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> > > > > > > > which seems to confuse matters.
> > > > > > > 
> > > > > > > Fair point, adding Martin and Heiko on CC for their thoughts.
> > > > > 
> > > > > Well we always had the full memory barrier for the various versions of
> > > > > smp_mb__xxx, they just have moved around and renamed several times.
> > > > > 
> > > > > After discussing this with Heiko we came to the conclusion that we can use
> > > > > a simple barrier() for smp_mb__before_atomic() and smp_mb__after_atomic().
> > > > > 
> > > > > > > It looks like this applies to recent mainframes that have new atomic
> > > > > > > instructions, which, yes, might need something to make them work with
> > > > > > > fully transitive smp_load_acquire() and smp_store_release().
> > > > > > > 
> > > > > > > Martin, Heiko, the question is whether or not the current s390
> > > > > > > smp_store_release() and smp_load_acquire() can be transitive.
> > > > > > > For example, if all the Xi variables below are initially zero,
> > > > > > > is it possible for all the r0, r1, r2, ... rN variables to
> > > > > > > have the value 1 at the end of the test.
> > > > > > 
> > > > > > Right...  This time actually adding Martin and Heiko on CC...
> > > > > > 
> > > > > > 							Thanx, Paul
> > > > > > 
> > > > > > > CPU 0
> > > > > > > 	r0 = smp_load_acquire(&X0);
> > > > > > > 	smp_store_release(&X1, 1);
> > > > > > > 
> > > > > > > CPU 1
> > > > > > > 	r1 = smp_load_acquire(&X1);
> > > > > > > 	smp_store_release(&X2, 1);
> > > > > > > 
> > > > > > > CPU 2
> > > > > > > 	r2 = smp_load_acquire(&X2);
> > > > > > > 	smp_store_release(&X3, 1);
> > > > > > > 
> > > > > > > ...
> > > > > > > 
> > > > > > > CPU N
> > > > > > > 	rN = smp_load_acquire(&XN);
> > > > > > > 	smp_store_release(&X0, 1);
> > > > > > > 
> > > > > > > If smp_store_release() and smp_load_acquire() are transitive, the
> > > > > > > answer would be "no".
> > > > > 
> > > > > The answer is "no". Christian recently summarized what the principles of
> > > > > operation has to say about the CPU read / write behavior. If you consider
> > > > > the sequential order of instructions then
> > > > > 
> > > > > 1) reads are in order
> > > > > 2) writes are in order
> > > > > 3) reads can happen earlier
> > > > > 4) writes can happen later
> > > > 
> > > > Correction. The principles of operation states this:
> > > > 
> > > > "A storage-operand store specified by one instruction appears to precede
> > > > all storage-operand stores specified by conceptually subsequent instructions,
> > > > but it does not necessarily precede storage-operand fetches specified by
> > > > conceptually subsequent instructions. However, a storage-operand store
> > > > appears to precede a conceptually subsequent storage-operand fetch from the
> > > > same main-storage location."
> > > > 
> 
> Confused...

Yeah, seems like I'm confused as well. This stuff always make my head hurt..
 
> IIUC, the previous paragraph actually means that a STORE-LOAD pair can be
> reordered. But the below reasoning is saying that a LOAD-STORE pair can
> be reordered. Do I miss something here?

True, the above paragraph allows a store to move past a load and not the other
way around.

> > > > As observed by other CPUs a write to one memory location can "overtake" a
> > > > read of another memory location if there is no explicit memory-barrier
> > > > between the load and the store instruction.
> > > > 
> > > > In the above example X0, X1, ... XN are different memory locations, so
> > > > architecturally the answer is "yes", all r0, r1, ... rN variables can have
> > > > the value of 1 after the test. I doubt that any existing machine will
> > > > show this behavior though.
> > > > 
> > > 
> > > Just be curious, how about when N == 1? The test then becomes:
> > > 
> > > CPU 0
> > > 	r0 = smp_load_acquire(&X0);
> > > 	smp_store_release(&X1,1);
> > > 
> > > CPU 1
> > > 	r1 = smp_load_acquire(&X1);
> > > 	smp_store_release(&X0,1);
> > > 
> > > Is it possible that r0 == 1 and r1 == 1 at the end, due to the same
> > > reason?
> > 
> > Yes, that is possible for the same reason. To change that we would have
> > to replace the barrier() in smp_load_acquire/smp_store_release with
> > smp_mb().
> > 
> 
> I thought that s390 is TSO, so this is prohibitted. If that is possible,
> I think, that means the current implementation of smp_load_acquire and
> smp_store_release on s390 is incorrect...

Ok, further reading of chapter 5 of the principles revealed this:

"As observed by other CPUs and by channel programs, storage-operand fetches
associated with one instruction execution appear to precede all storage
operand references for conceptually subsequent instructions."

So no writes before reads. Correction to the correction: all r0, r1, ...rN
equal to one can not happen after all. Got me worried there ;-)

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-22 14:33                                     ` Martin Schwidefsky
@ 2015-09-22 15:28                                       ` Paul E. McKenney
  2015-09-23  6:43                                         ` Martin Schwidefsky
  0 siblings, 1 reply; 28+ messages in thread
From: Paul E. McKenney @ 2015-09-22 15:28 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Boqun Feng, Peter Zijlstra, Davidlohr Bueso, Ingo Molnar,
	Thomas Gleixner, linux-kernel, Davidlohr Bueso, heiko.carstens

On Tue, Sep 22, 2015 at 04:33:07PM +0200, Martin Schwidefsky wrote:
> On Tue, 22 Sep 2015 21:29:14 +0800
> Boqun Feng <boqun.feng@gmail.com> wrote:
> 
> > On Tue, Sep 22, 2015 at 02:51:36PM +0200, Martin Schwidefsky wrote:
> > > On Tue, 22 Sep 2015 20:23:26 +0800
> > > Boqun Feng <boqun.feng@gmail.com> wrote:
> > > 
> > > > Hi Martin,
> > > > 
> > > > On Tue, Sep 22, 2015 at 12:27:35PM +0200, Martin Schwidefsky wrote:
> > > > > On Mon, 21 Sep 2015 11:22:52 +0200
> > > > > Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > > > 
> > > > > > On Fri, 18 Sep 2015 14:41:20 -0700
> > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > > > > 
> > > > > > > On Tue, Sep 15, 2015 at 10:09:41AM -0700, Paul E. McKenney wrote:
> > > > > > > > On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> > > > > > > > > On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > > > > > > > > > very much not transitive IIRC.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I am probably lost on context, but...
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > > > > > > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Does that help in this case?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > > > > > > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > > > > > > > > > 
> > > > > > > > > > > > OK, I will bite...  Why not?
> > > > > > > > > > > 
> > > > > > > > > > > It would mean us reviewing all archs (again) and documenting it I
> > > > > > > > > > > suppose. Which is of course entirely possible.
> > > > > > > > > > > 
> > > > > > > > > > > That said, I don't think the case at hand requires it, so lets postpone
> > > > > > > > > > > this for now ;-)
> > > > > > > > > > 
> > > > > > > > > > True enough, but in my experience smp_store_release() and
> > > > > > > > > > smp_load_acquire() are a -lot- easier to use than other barriers,
> > > > > > > > > > and transitivity will help promote their use.  So...
> > > > > > > > > > 
> > > > > > > > > > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > > > > > > > > > smp_store_release()/smp_load_acquire() via their native ordering in
> > > > > > > > > > combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> > > > > > > > > > which is not TSO, uses an mfence instruction.  Power supports this via
> > > > > > > > > > lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> > > > > > > > > > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > > > > > > > > > in that case).  IA64 supports this via total ordering of all release
> > > > > > > > > > instructions in theory and by the actual full-barrier implementation
> > > > > > > > > > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > > > > > > > > > for volatile stores and loads).  All other architectures use smp_mb(),
> > > > > > > > > > which is transitive.
> > > > > > > > > > 
> > > > > > > > > > Did I miss anything?
> > > > > > > > > 
> > > > > > > > > I think that about covers it.. the only odd duckling might be s390 which
> > > > > > > > > is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> > > > > > > > > which seems to confuse matters.
> > > > > > > > 
> > > > > > > > Fair point, adding Martin and Heiko on CC for their thoughts.
> > > > > > 
> > > > > > Well we always had the full memory barrier for the various versions of
> > > > > > smp_mb__xxx, they just have moved around and renamed several times.
> > > > > > 
> > > > > > After discussing this with Heiko we came to the conclusion that we can use
> > > > > > a simple barrier() for smp_mb__before_atomic() and smp_mb__after_atomic().
> > > > > > 
> > > > > > > > It looks like this applies to recent mainframes that have new atomic
> > > > > > > > instructions, which, yes, might need something to make them work with
> > > > > > > > fully transitive smp_load_acquire() and smp_store_release().
> > > > > > > > 
> > > > > > > > Martin, Heiko, the question is whether or not the current s390
> > > > > > > > smp_store_release() and smp_load_acquire() can be transitive.
> > > > > > > > For example, if all the Xi variables below are initially zero,
> > > > > > > > is it possible for all the r0, r1, r2, ... rN variables to
> > > > > > > > have the value 1 at the end of the test.
> > > > > > > 
> > > > > > > Right...  This time actually adding Martin and Heiko on CC...
> > > > > > > 
> > > > > > > 							Thanx, Paul
> > > > > > > 
> > > > > > > > CPU 0
> > > > > > > > 	r0 = smp_load_acquire(&X0);
> > > > > > > > 	smp_store_release(&X1, 1);
> > > > > > > > 
> > > > > > > > CPU 1
> > > > > > > > 	r1 = smp_load_acquire(&X1);
> > > > > > > > 	smp_store_release(&X2, 1);
> > > > > > > > 
> > > > > > > > CPU 2
> > > > > > > > 	r2 = smp_load_acquire(&X2);
> > > > > > > > 	smp_store_release(&X3, 1);
> > > > > > > > 
> > > > > > > > ...
> > > > > > > > 
> > > > > > > > CPU N
> > > > > > > > 	rN = smp_load_acquire(&XN);
> > > > > > > > 	smp_store_release(&X0, 1);
> > > > > > > > 
> > > > > > > > If smp_store_release() and smp_load_acquire() are transitive, the
> > > > > > > > answer would be "no".
> > > > > > 
> > > > > > The answer is "no". Christian recently summarized what the principles of
> > > > > > operation has to say about the CPU read / write behavior. If you consider
> > > > > > the sequential order of instructions then
> > > > > > 
> > > > > > 1) reads are in order
> > > > > > 2) writes are in order
> > > > > > 3) reads can happen earlier
> > > > > > 4) writes can happen later
> > > > > 
> > > > > Correction. The principles of operation states this:
> > > > > 
> > > > > "A storage-operand store specified by one instruction appears to precede
> > > > > all storage-operand stores specified by conceptually subsequent instructions,
> > > > > but it does not necessarily precede storage-operand fetches specified by
> > > > > conceptually subsequent instructions. However, a storage-operand store
> > > > > appears to precede a conceptually subsequent storage-operand fetch from the
> > > > > same main-storage location."
> > > > > 
> > 
> > Confused...
> 
> Yeah, seems like I'm confused as well. This stuff always make my head hurt..
>  
> > IIUC, the previous paragraph actually means that a STORE-LOAD pair can be
> > reordered. But the below reasoning is saying that a LOAD-STORE pair can
> > be reordered. Do I miss something here?
> 
> True, the above paragraph allows a store to move past a load and not the other
> way around.
> 
> > > > > As observed by other CPUs a write to one memory location can "overtake" a
> > > > > read of another memory location if there is no explicit memory-barrier
> > > > > between the load and the store instruction.
> > > > > 
> > > > > In the above example X0, X1, ... XN are different memory locations, so
> > > > > architecturally the answer is "yes", all r0, r1, ... rN variables can have
> > > > > the value of 1 after the test. I doubt that any existing machine will
> > > > > show this behavior though.
> > > > > 
> > > > 
> > > > Just be curious, how about when N == 1? The test then becomes:
> > > > 
> > > > CPU 0
> > > > 	r0 = smp_load_acquire(&X0);
> > > > 	smp_store_release(&X1,1);
> > > > 
> > > > CPU 1
> > > > 	r1 = smp_load_acquire(&X1);
> > > > 	smp_store_release(&X0,1);
> > > > 
> > > > Is it possible that r0 == 1 and r1 == 1 at the end, due to the same
> > > > reason?
> > > 
> > > Yes, that is possible for the same reason. To change that we would have
> > > to replace the barrier() in smp_load_acquire/smp_store_release with
> > > smp_mb().
> > > 
> > 
> > I thought that s390 is TSO, so this is prohibitted. If that is possible,
> > I think, that means the current implementation of smp_load_acquire and
> > smp_store_release on s390 is incorrect...
> 
> Ok, further reading of chapter 5 of the principles revealed this:
> 
> "As observed by other CPUs and by channel programs, storage-operand fetches
> associated with one instruction execution appear to precede all storage
> operand references for conceptually subsequent instructions."
> 
> So no writes before reads. Correction to the correction: all r0, r1, ...rN
> equal to one can not happen after all. Got me worried there ;-)

Whew!!!

So s390's current smp_store_release() and smp_load_acquire() provide
ordering as needed, then, right?  For example, suppose that there
was a long chain of smp_load_acquire()/smp_store_release() pairs
involving many CPUs.  Would the all-ones case still be impossible?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-22 15:28                                       ` Paul E. McKenney
@ 2015-09-23  6:43                                         ` Martin Schwidefsky
  2015-09-25 21:30                                           ` Paul E. McKenney
  0 siblings, 1 reply; 28+ messages in thread
From: Martin Schwidefsky @ 2015-09-23  6:43 UTC (permalink / raw)
  To: paulmck
  Cc: Boqun Feng, Peter Zijlstra, Davidlohr Bueso, Ingo Molnar,
	Thomas Gleixner, linux-kernel, Davidlohr Bueso, heiko.carstens

On Tue, 22 Sep 2015 08:28:22 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Tue, Sep 22, 2015 at 04:33:07PM +0200, Martin Schwidefsky wrote:
> > On Tue, 22 Sep 2015 21:29:14 +0800
> > Boqun Feng <boqun.feng@gmail.com> wrote:
> > 
> > > On Tue, Sep 22, 2015 at 02:51:36PM +0200, Martin Schwidefsky wrote:
> > > > On Tue, 22 Sep 2015 20:23:26 +0800
> > > > Boqun Feng <boqun.feng@gmail.com> wrote:
> > > > 
> > > > > Hi Martin,
> > > > > 
> > > > > On Tue, Sep 22, 2015 at 12:27:35PM +0200, Martin Schwidefsky wrote:
> > > > > > On Mon, 21 Sep 2015 11:22:52 +0200
> > > > > > Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > > > > 
> > > > > > > On Fri, 18 Sep 2015 14:41:20 -0700
> > > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > > > > > 
> > > > > > > > On Tue, Sep 15, 2015 at 10:09:41AM -0700, Paul E. McKenney wrote:
> > > > > > > > > On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > > > > > > > > > > very much not transitive IIRC.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I am probably lost on context, but...
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > > > > > > > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Does that help in this case?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > > > > > > > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > OK, I will bite...  Why not?
> > > > > > > > > > > > 
> > > > > > > > > > > > It would mean us reviewing all archs (again) and documenting it I
> > > > > > > > > > > > suppose. Which is of course entirely possible.
> > > > > > > > > > > > 
> > > > > > > > > > > > That said, I don't think the case at hand requires it, so lets postpone
> > > > > > > > > > > > this for now ;-)
> > > > > > > > > > > 
> > > > > > > > > > > True enough, but in my experience smp_store_release() and
> > > > > > > > > > > smp_load_acquire() are a -lot- easier to use than other barriers,
> > > > > > > > > > > and transitivity will help promote their use.  So...
> > > > > > > > > > > 
> > > > > > > > > > > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > > > > > > > > > > smp_store_release()/smp_load_acquire() via their native ordering in
> > > > > > > > > > > combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> > > > > > > > > > > which is not TSO, uses an mfence instruction.  Power supports this via
> > > > > > > > > > > lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> > > > > > > > > > > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > > > > > > > > > > in that case).  IA64 supports this via total ordering of all release
> > > > > > > > > > > instructions in theory and by the actual full-barrier implementation
> > > > > > > > > > > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > > > > > > > > > > for volatile stores and loads).  All other architectures use smp_mb(),
> > > > > > > > > > > which is transitive.
> > > > > > > > > > > 
> > > > > > > > > > > Did I miss anything?
> > > > > > > > > > 
> > > > > > > > > > I think that about covers it.. the only odd duckling might be s390 which
> > > > > > > > > > is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> > > > > > > > > > which seems to confuse matters.
> > > > > > > > > 
> > > > > > > > > Fair point, adding Martin and Heiko on CC for their thoughts.
> > > > > > > 
> > > > > > > Well we always had the full memory barrier for the various versions of
> > > > > > > smp_mb__xxx, they just have moved around and renamed several times.
> > > > > > > 
> > > > > > > After discussing this with Heiko we came to the conclusion that we can use
> > > > > > > a simple barrier() for smp_mb__before_atomic() and smp_mb__after_atomic().
> > > > > > > 
> > > > > > > > > It looks like this applies to recent mainframes that have new atomic
> > > > > > > > > instructions, which, yes, might need something to make them work with
> > > > > > > > > fully transitive smp_load_acquire() and smp_store_release().
> > > > > > > > > 
> > > > > > > > > Martin, Heiko, the question is whether or not the current s390
> > > > > > > > > smp_store_release() and smp_load_acquire() can be transitive.
> > > > > > > > > For example, if all the Xi variables below are initially zero,
> > > > > > > > > is it possible for all the r0, r1, r2, ... rN variables to
> > > > > > > > > have the value 1 at the end of the test.
> > > > > > > > 
> > > > > > > > Right...  This time actually adding Martin and Heiko on CC...
> > > > > > > > 
> > > > > > > > 							Thanx, Paul
> > > > > > > > 
> > > > > > > > > CPU 0
> > > > > > > > > 	r0 = smp_load_acquire(&X0);
> > > > > > > > > 	smp_store_release(&X1, 1);
> > > > > > > > > 
> > > > > > > > > CPU 1
> > > > > > > > > 	r1 = smp_load_acquire(&X1);
> > > > > > > > > 	smp_store_release(&X2, 1);
> > > > > > > > > 
> > > > > > > > > CPU 2
> > > > > > > > > 	r2 = smp_load_acquire(&X2);
> > > > > > > > > 	smp_store_release(&X3, 1);
> > > > > > > > > 
> > > > > > > > > ...
> > > > > > > > > 
> > > > > > > > > CPU N
> > > > > > > > > 	rN = smp_load_acquire(&XN);
> > > > > > > > > 	smp_store_release(&X0, 1);
> > > > > > > > > 
> > > > > > > > > If smp_store_release() and smp_load_acquire() are transitive, the
> > > > > > > > > answer would be "no".
> > > > > > > 
> > > > > > > The answer is "no". Christian recently summarized what the principles of
> > > > > > > operation has to say about the CPU read / write behavior. If you consider
> > > > > > > the sequential order of instructions then
> > > > > > > 
> > > > > > > 1) reads are in order
> > > > > > > 2) writes are in order
> > > > > > > 3) reads can happen earlier
> > > > > > > 4) writes can happen later
> > > > > > 
> > > > > > Correction. The principles of operation states this:
> > > > > > 
> > > > > > "A storage-operand store specified by one instruction appears to precede
> > > > > > all storage-operand stores specified by conceptually subsequent instructions,
> > > > > > but it does not necessarily precede storage-operand fetches specified by
> > > > > > conceptually subsequent instructions. However, a storage-operand store
> > > > > > appears to precede a conceptually subsequent storage-operand fetch from the
> > > > > > same main-storage location."
> > > > > > 
> > > 
> > > Confused...
> > 
> > Yeah, seems like I'm confused as well. This stuff always make my head hurt..
> >  
> > > IIUC, the previous paragraph actually means that a STORE-LOAD pair can be
> > > reordered. But the below reasoning is saying that a LOAD-STORE pair can
> > > be reordered. Do I miss something here?
> > 
> > True, the above paragraph allows a store to move past a load and not the other
> > way around.
> > 
> > > > > > As observed by other CPUs a write to one memory location can "overtake" a
> > > > > > read of another memory location if there is no explicit memory-barrier
> > > > > > between the load and the store instruction.
> > > > > > 
> > > > > > In the above example X0, X1, ... XN are different memory locations, so
> > > > > > architecturally the answer is "yes", all r0, r1, ... rN variables can have
> > > > > > the value of 1 after the test. I doubt that any existing machine will
> > > > > > show this behavior though.
> > > > > > 
> > > > > 
> > > > > Just be curious, how about when N == 1? The test then becomes:
> > > > > 
> > > > > CPU 0
> > > > > 	r0 = smp_load_acquire(&X0);
> > > > > 	smp_store_release(&X1,1);
> > > > > 
> > > > > CPU 1
> > > > > 	r1 = smp_load_acquire(&X1);
> > > > > 	smp_store_release(&X0,1);
> > > > > 
> > > > > Is it possible that r0 == 1 and r1 == 1 at the end, due to the same
> > > > > reason?
> > > > 
> > > > Yes, that is possible for the same reason. To change that we would have
> > > > to replace the barrier() in smp_load_acquire/smp_store_release with
> > > > smp_mb().
> > > > 
> > > 
> > > I thought that s390 is TSO, so this is prohibitted. If that is possible,
> > > I think, that means the current implementation of smp_load_acquire and
> > > smp_store_release on s390 is incorrect...
> > 
> > Ok, further reading of chapter 5 of the principles revealed this:
> > 
> > "As observed by other CPUs and by channel programs, storage-operand fetches
> > associated with one instruction execution appear to precede all storage
> > operand references for conceptually subsequent instructions."
> > 
> > So no writes before reads. Correction to the correction: all r0, r1, ...rN
> > equal to one can not happen after all. Got me worried there ;-)
> 
> Whew!!!
> 
> So s390's current smp_store_release() and smp_load_acquire() provide
> ordering as needed, then, right?  For example, suppose that there
> was a long chain of smp_load_acquire()/smp_store_release() pairs
> involving many CPUs.  Would the all-ones case still be impossible?

Yes, that is impossible.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics
  2015-09-23  6:43                                         ` Martin Schwidefsky
@ 2015-09-25 21:30                                           ` Paul E. McKenney
  0 siblings, 0 replies; 28+ messages in thread
From: Paul E. McKenney @ 2015-09-25 21:30 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Boqun Feng, Peter Zijlstra, Davidlohr Bueso, Ingo Molnar,
	Thomas Gleixner, linux-kernel, Davidlohr Bueso, heiko.carstens

On Wed, Sep 23, 2015 at 08:43:21AM +0200, Martin Schwidefsky wrote:
> On Tue, 22 Sep 2015 08:28:22 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Tue, Sep 22, 2015 at 04:33:07PM +0200, Martin Schwidefsky wrote:
> > > On Tue, 22 Sep 2015 21:29:14 +0800
> > > Boqun Feng <boqun.feng@gmail.com> wrote:
> > > 
> > > > On Tue, Sep 22, 2015 at 02:51:36PM +0200, Martin Schwidefsky wrote:
> > > > > On Tue, 22 Sep 2015 20:23:26 +0800
> > > > > Boqun Feng <boqun.feng@gmail.com> wrote:
> > > > > 
> > > > > > Hi Martin,
> > > > > > 
> > > > > > On Tue, Sep 22, 2015 at 12:27:35PM +0200, Martin Schwidefsky wrote:
> > > > > > > On Mon, 21 Sep 2015 11:22:52 +0200
> > > > > > > Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > > > > > 
> > > > > > > > On Fri, 18 Sep 2015 14:41:20 -0700
> > > > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > > > > > > 
> > > > > > > > > On Tue, Sep 15, 2015 at 10:09:41AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > > On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > > > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > > > > > > > > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > > > > > > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > > > > > > > > > > > very much not transitive IIRC.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > I am probably lost on context, but...
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > It turns out that lwsync is transitive in special cases.  One of them
> > > > > > > > > > > > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Does that help in this case?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > > > > > > > > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > OK, I will bite...  Why not?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > It would mean us reviewing all archs (again) and documenting it I
> > > > > > > > > > > > > suppose. Which is of course entirely possible.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > That said, I don't think the case at hand requires it, so lets postpone
> > > > > > > > > > > > > this for now ;-)
> > > > > > > > > > > > 
> > > > > > > > > > > > True enough, but in my experience smp_store_release() and
> > > > > > > > > > > > smp_load_acquire() are a -lot- easier to use than other barriers,
> > > > > > > > > > > > and transitivity will help promote their use.  So...
> > > > > > > > > > > > 
> > > > > > > > > > > > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > > > > > > > > > > > smp_store_release()/smp_load_acquire() via their native ordering in
> > > > > > > > > > > > combination with barrier() macros.  x86 with CONFIG_X86_PPRO_FENCE=y,
> > > > > > > > > > > > which is not TSO, uses an mfence instruction.  Power supports this via
> > > > > > > > > > > > lwsync's partial cumulativity.  ARM64 supports it in SMP via the new ldar
> > > > > > > > > > > > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > > > > > > > > > > > in that case).  IA64 supports this via total ordering of all release
> > > > > > > > > > > > instructions in theory and by the actual full-barrier implementation
> > > > > > > > > > > > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > > > > > > > > > > > for volatile stores and loads).  All other architectures use smp_mb(),
> > > > > > > > > > > > which is transitive.
> > > > > > > > > > > > 
> > > > > > > > > > > > Did I miss anything?
> > > > > > > > > > > 
> > > > > > > > > > > I think that about covers it.. the only odd duckling might be s390 which
> > > > > > > > > > > is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> > > > > > > > > > > which seems to confuse matters.
> > > > > > > > > > 
> > > > > > > > > > Fair point, adding Martin and Heiko on CC for their thoughts.
> > > > > > > > 
> > > > > > > > Well we always had the full memory barrier for the various versions of
> > > > > > > > smp_mb__xxx, they just have moved around and renamed several times.
> > > > > > > > 
> > > > > > > > After discussing this with Heiko we came to the conclusion that we can use
> > > > > > > > a simple barrier() for smp_mb__before_atomic() and smp_mb__after_atomic().
> > > > > > > > 
> > > > > > > > > > It looks like this applies to recent mainframes that have new atomic
> > > > > > > > > > instructions, which, yes, might need something to make them work with
> > > > > > > > > > fully transitive smp_load_acquire() and smp_store_release().
> > > > > > > > > > 
> > > > > > > > > > Martin, Heiko, the question is whether or not the current s390
> > > > > > > > > > smp_store_release() and smp_load_acquire() can be transitive.
> > > > > > > > > > For example, if all the Xi variables below are initially zero,
> > > > > > > > > > is it possible for all the r0, r1, r2, ... rN variables to
> > > > > > > > > > have the value 1 at the end of the test.
> > > > > > > > > 
> > > > > > > > > Right...  This time actually adding Martin and Heiko on CC...
> > > > > > > > > 
> > > > > > > > > 							Thanx, Paul
> > > > > > > > > 
> > > > > > > > > > CPU 0
> > > > > > > > > > 	r0 = smp_load_acquire(&X0);
> > > > > > > > > > 	smp_store_release(&X1, 1);
> > > > > > > > > > 
> > > > > > > > > > CPU 1
> > > > > > > > > > 	r1 = smp_load_acquire(&X1);
> > > > > > > > > > 	smp_store_release(&X2, 1);
> > > > > > > > > > 
> > > > > > > > > > CPU 2
> > > > > > > > > > 	r2 = smp_load_acquire(&X2);
> > > > > > > > > > 	smp_store_release(&X3, 1);
> > > > > > > > > > 
> > > > > > > > > > ...
> > > > > > > > > > 
> > > > > > > > > > CPU N
> > > > > > > > > > 	rN = smp_load_acquire(&XN);
> > > > > > > > > > 	smp_store_release(&X0, 1);
> > > > > > > > > > 
> > > > > > > > > > If smp_store_release() and smp_load_acquire() are transitive, the
> > > > > > > > > > answer would be "no".
> > > > > > > > 
> > > > > > > > The answer is "no". Christian recently summarized what the principles of
> > > > > > > > operation has to say about the CPU read / write behavior. If you consider
> > > > > > > > the sequential order of instructions then
> > > > > > > > 
> > > > > > > > 1) reads are in order
> > > > > > > > 2) writes are in order
> > > > > > > > 3) reads can happen earlier
> > > > > > > > 4) writes can happen later
> > > > > > > 
> > > > > > > Correction. The principles of operation states this:
> > > > > > > 
> > > > > > > "A storage-operand store specified by one instruction appears to precede
> > > > > > > all storage-operand stores specified by conceptually subsequent instructions,
> > > > > > > but it does not necessarily precede storage-operand fetches specified by
> > > > > > > conceptually subsequent instructions. However, a storage-operand store
> > > > > > > appears to precede a conceptually subsequent storage-operand fetch from the
> > > > > > > same main-storage location."
> > > > > > > 
> > > > 
> > > > Confused...
> > > 
> > > Yeah, seems like I'm confused as well. This stuff always make my head hurt..
> > >  
> > > > IIUC, the previous paragraph actually means that a STORE-LOAD pair can be
> > > > reordered. But the below reasoning is saying that a LOAD-STORE pair can
> > > > be reordered. Do I miss something here?
> > > 
> > > True, the above paragraph allows a store to move past a load and not the other
> > > way around.
> > > 
> > > > > > > As observed by other CPUs a write to one memory location can "overtake" a
> > > > > > > read of another memory location if there is no explicit memory-barrier
> > > > > > > between the load and the store instruction.
> > > > > > > 
> > > > > > > In the above example X0, X1, ... XN are different memory locations, so
> > > > > > > architecturally the answer is "yes", all r0, r1, ... rN variables can have
> > > > > > > the value of 1 after the test. I doubt that any existing machine will
> > > > > > > show this behavior though.
> > > > > > > 
> > > > > > 
> > > > > > Just be curious, how about when N == 1? The test then becomes:
> > > > > > 
> > > > > > CPU 0
> > > > > > 	r0 = smp_load_acquire(&X0);
> > > > > > 	smp_store_release(&X1,1);
> > > > > > 
> > > > > > CPU 1
> > > > > > 	r1 = smp_load_acquire(&X1);
> > > > > > 	smp_store_release(&X0,1);
> > > > > > 
> > > > > > Is it possible that r0 == 1 and r1 == 1 at the end, due to the same
> > > > > > reason?
> > > > > 
> > > > > Yes, that is possible for the same reason. To change that we would have
> > > > > to replace the barrier() in smp_load_acquire/smp_store_release with
> > > > > smp_mb().
> > > > > 
> > > > 
> > > > I thought that s390 is TSO, so this is prohibitted. If that is possible,
> > > > I think, that means the current implementation of smp_load_acquire and
> > > > smp_store_release on s390 is incorrect...
> > > 
> > > Ok, further reading of chapter 5 of the principles revealed this:
> > > 
> > > "As observed by other CPUs and by channel programs, storage-operand fetches
> > > associated with one instruction execution appear to precede all storage
> > > operand references for conceptually subsequent instructions."
> > > 
> > > So no writes before reads. Correction to the correction: all r0, r1, ...rN
> > > equal to one can not happen after all. Got me worried there ;-)
> > 
> > Whew!!!
> > 
> > So s390's current smp_store_release() and smp_load_acquire() provide
> > ordering as needed, then, right?  For example, suppose that there
> > was a long chain of smp_load_acquire()/smp_store_release() pairs
> > involving many CPUs.  Would the all-ones case still be impossible?
> 
> Yes, that is impossible.

Very good!  Release-acquire chains can be transitive, then.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-09-25 21:39 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-14  7:37 [PATCH -tip 1/3] locking/qrwlock: Rename ->lock to ->wait_lock Davidlohr Bueso
2015-09-14  7:37 ` [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics Davidlohr Bueso
2015-09-14 12:32   ` Peter Zijlstra
2015-09-14 21:08     ` Davidlohr Bueso
2015-09-15  9:49       ` Peter Zijlstra
2015-09-15  9:55         ` Peter Zijlstra
2015-09-15 12:41           ` Paul E. McKenney
2015-09-15 12:48             ` Peter Zijlstra
2015-09-15 14:09               ` Paul E. McKenney
2015-09-15 14:14                 ` Peter Zijlstra
2015-09-15 15:34                   ` Paul E. McKenney
2015-09-15 16:30                     ` Peter Zijlstra
2015-09-15 17:09                       ` Paul E. McKenney
2015-09-18 21:41                         ` Paul E. McKenney
2015-09-21  9:22                           ` Martin Schwidefsky
2015-09-22 10:27                             ` Martin Schwidefsky
2015-09-22 12:23                               ` Boqun Feng
2015-09-22 12:51                                 ` Martin Schwidefsky
2015-09-22 13:29                                   ` Boqun Feng
2015-09-22 14:33                                     ` Martin Schwidefsky
2015-09-22 15:28                                       ` Paul E. McKenney
2015-09-23  6:43                                         ` Martin Schwidefsky
2015-09-25 21:30                                           ` Paul E. McKenney
2015-09-15 19:49           ` Davidlohr Bueso
2015-09-16  9:01             ` Peter Zijlstra
2015-09-14  7:37 ` [PATCH -tip 3/3] locking/osq: Relax atomic semantics Davidlohr Bueso
2015-09-18  8:50   ` [tip:locking/core] " tip-bot for Davidlohr Bueso
2015-09-18  8:50 ` [tip:locking/core] locking/qrwlock: Rename ->lock to ->wait_lock tip-bot for Davidlohr Bueso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).