All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] scheduler ordering bits
@ 2015-11-02 13:29 Peter Zijlstra
  2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-02 13:29 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, peterz, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

Hai,

These patches are the result of recent discussions in a thread on barrier
documentation.

Please have a careful look, as always, this stuff is tricky.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 1/4] sched: Better document the try_to_wake_up() barriers
  2015-11-02 13:29 [PATCH 0/4] scheduler ordering bits Peter Zijlstra
@ 2015-11-02 13:29 ` Peter Zijlstra
  2015-12-04  0:09   ` Byungchul Park
  2015-12-04  0:58   ` Byungchul Park
  2015-11-02 13:29 ` [PATCH 2/4] sched: Document Program-Order guarantees Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-02 13:29 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, peterz, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

[-- Attachment #1: peterz-sched-ttwu-comment.patch --]
[-- Type: text/plain, Size: 1408 bytes --]

Explain how the control dependency and smp_rmb() end up providing
ACQUIRE semantics and pair with smp_store_release() in
finish_lock_switch().

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |    8 +++++++-
 kernel/sched/sched.h |    3 +++
 2 files changed, 10 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1947,7 +1947,13 @@ try_to_wake_up(struct task_struct *p, un
 	while (p->on_cpu)
 		cpu_relax();
 	/*
-	 * Pairs with the smp_wmb() in finish_lock_switch().
+	 * Combined with the control dependency above, we have an effective
+	 * smp_load_acquire() without the need for full barriers.
+	 *
+	 * Pairs with the smp_store_release() in finish_lock_switch().
+	 *
+	 * This ensures that tasks getting woken will be fully ordered against
+	 * their previous state and preserve Program Order.
 	 */
 	smp_rmb();
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1073,6 +1073,9 @@ static inline void finish_lock_switch(st
 	 * We must ensure this doesn't happen until the switch is completely
 	 * finished.
 	 *
+	 * In particular, the load of prev->state in finish_task_switch() must
+	 * happen before this.
+	 *
 	 * Pairs with the control dependency and rmb in try_to_wake_up().
 	 */
 	smp_store_release(&prev->on_cpu, 0);



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 2/4] sched: Document Program-Order guarantees
  2015-11-02 13:29 [PATCH 0/4] scheduler ordering bits Peter Zijlstra
  2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
@ 2015-11-02 13:29 ` Peter Zijlstra
  2015-11-02 20:27   ` Paul Turner
  2015-11-02 13:29 ` [PATCH 3/4] sched: Fix a race in try_to_wake_up() vs schedule() Peter Zijlstra
  2015-11-02 13:29 ` [PATCH 4/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
  3 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-02 13:29 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, peterz, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

[-- Attachment #1: peter_zijlstra-documentation-remove_misleading_examples_of_the_barriers_in_wake__.patch --]
[-- Type: text/plain, Size: 6001 bytes --]

These are some notes on the scheduler locking and how it provides
program order guarantees on SMP systems.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |  142 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 142 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1905,6 +1905,148 @@ static void ttwu_queue(struct task_struc
 	raw_spin_unlock(&rq->lock);
 }
 
+/*
+ * Notes on Program-Order guarantees on SMP systems.
+ *
+ *
+ *   PREEMPTION/MIGRATION
+ *
+ * Regular preemption/migration is safe because as long as the task is runnable
+ * migrations involve both rq locks, albeit not (necessarily) at the same time.
+ *
+ * So we get (we allow 3 CPU migrations):
+ *
+ *   CPU0            CPU1            CPU2
+ *
+ *   LOCK rq(0)->lock
+ *   sched-out X
+ *   sched-in Y
+ *   UNLOCK rq(0)->lock
+ *
+ *                                   LOCK rq(0)->lock // MB against CPU0
+ *                                   dequeue X
+ *                                   UNLOCK rq(0)->lock
+ *
+ *                                   LOCK rq(1)->lock
+ *                                   enqueue X
+ *                                   UNLOCK rq(1)->lock
+ *
+ *                   LOCK rq(1)->lock // MB against CPU2
+ *                   sched-out Z
+ *                   sched-in X
+ *                   UNLOCK rq(1)->lock
+ *
+ * and the first LOCK rq(0) on CPU2 gives a full order against the UNLOCK rq(0)
+ * on CPU0. Similarly the LOCK rq(1) on CPU1 provides full order against the
+ * UNLOCK rq(1) on CPU2, therefore by the time task X runs on CPU1 it must
+ * observe the state it left behind on CPU0.
+ *
+ *
+ *   BLOCKING -- aka. SLEEP + WAKEUP
+ *
+ * For blocking things are a little more interesting, because when we dequeue
+ * the task, we don't need to acquire the old rq lock in order to migrate it.
+ *
+ * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the task
+ * to CPU2 (the most complex example):
+ *
+ *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending)
+ *
+ *   X->state = UNINTERRUPTIBLE
+ *   MB
+ *   if (cond)
+ *     break
+ *                    cond = true
+ *
+ *   LOCK rq(0)->lock LOCK X->pi_lock
+ *
+ *   dequeue X
+ *                    while (X->on_cpu)
+ *                      cpu_relax()
+ *   sched-out X
+ *   RELEASE
+ *   X->on_cpu = 0
+ *                    RMB
+ *                    X->state = WAKING
+ *                    set_task_cpu(X,2)
+ *                      WMB
+ *                      ti(X)->cpu = 2
+ *
+ *                    llist_add(X, rq(2)) // MB
+ *                                          llist_del_all() // MB
+ *
+ *                                          LOCK rq(2)->lock
+ *                                          enqueue X
+ *                                          X->state = RUNNING
+ *                                          UNLOCK rq(2)->lock
+ *
+ *                                          LOCK rq(2)->lock
+ *                                          sched-out Z
+ *                                          sched-in X
+ *                                          UNLOCK rq(1)->lock
+ *
+ *                                          if (cond) // _TRUE_
+ *                    UNLOCK X->pi_lock
+ *   UNLOCK rq(0)->lock
+ *
+ * So in this case the scheduler does not provide an obvious full barrier; but
+ * the smp_store_release() in finish_lock_switch(), paired with the control-dep
+ * and smp_rmb() in try_to_wake_up() form a release-acquire pair and fully
+ * order things between CPU0 and CPU1.
+ *
+ * The llist primitives order things between CPU1 and CPU2 -- the alternative
+ * is CPU1 doing the remote enqueue (the alternative path in ttwu_queue()) in
+ * which case the rq(2)->lock release/acquire will order things between them.
+ *
+ * Which again leads to the guarantee that by the time X gets to run on CPU2
+ * it must observe the state it left behind on CPU0.
+ *
+ * However; for blocking there is a second guarantee we must provide, namely we
+ * must observe the state that lead to our wakeup. That is, not only must X
+ * observe its own prior state, it must also observe the @cond store.
+ *
+ * This too is achieved in the above, since CPU1 does the waking, we only need
+ * the ordering between CPU1 and CPU2, which is the same as the above.
+ *
+ *
+ * There is however a much more interesting case for this guarantee, where X
+ * never makes it off CPU0:
+ *
+ *   CPU0 (schedule)  CPU1 (try_to_wake_up)
+ *
+ *   X->state = UNINTERRUPTIBLE
+ *   MB
+ *   if (cond)
+ *     break
+ *                    cond = true
+ *
+ *                    WMB (aka smp_mb__before_spinlock)
+ *                    LOCK X->pi_lock
+ *
+ *                    if (X->on_rq)
+ *                      LOCK rq(0)->lock
+ *                      X->state = RUNNING
+ *                      UNLOCK rq(0)->lock
+ *
+ *   LOCK rq(0)->lock // MB against CPU1
+ *   UNLOCK rq(0)->lock
+ *
+ *   if (cond) // _TRUE_
+ *
+ *                    UNLOCK X->pi_lock
+ *
+ * Here our task X never quite leaves CPU0, the wakeup happens before we can
+ * dequeue and schedule someone else. In this case we must still observe cond
+ * after our call to schedule() completes.
+ *
+ * This is achieved by the smp_mb__before_spinlock() WMB which ensures the store
+ * cannot leak inside the LOCK, and LOCK rq(0)->lock on CPU0 provides full order
+ * against the UNLOCK rq(0)->lock from CPU1. Furthermore our load of cond cannot
+ * happen before this same LOCK.
+ *
+ * Therefore, again, we're good.
+ */
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 3/4] sched: Fix a race in try_to_wake_up() vs schedule()
  2015-11-02 13:29 [PATCH 0/4] scheduler ordering bits Peter Zijlstra
  2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
  2015-11-02 13:29 ` [PATCH 2/4] sched: Document Program-Order guarantees Peter Zijlstra
@ 2015-11-02 13:29 ` Peter Zijlstra
  2015-11-02 13:29 ` [PATCH 4/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
  3 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-02 13:29 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, peterz, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

[-- Attachment #1: peterz-sched-fix-ttwu-race.patch --]
[-- Type: text/plain, Size: 2503 bytes --]

Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
that we'll prematurely continue with the wakeup and effectively run p on
two CPUs at the same time.

Even though the overlap is very limited; the task is in the middle of
being scheduled out; it could still result in corruption of the
scheduler data structures.


        CPU0                            CPU1

        set_current_state(...)

        <preempt_schedule>
          context_switch(X, Y)
            prepare_lock_switch(Y)
              Y->on_cpu = 1;
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

                                        try_to_wake_up(X)
                                          LOCK(p->pi_lock);

                                          t = X->on_cpu; // 0

          context_switch(Y, X)
            prepare_lock_switch(X)
              X->on_cpu = 1;
            finish_lock_switch(Y)
              store_release(Y->on_cpu, 0);
        </preempt_schedule>

        schedule();
          deactivate_task(X);
          X->on_rq = 0;

                                          if (X->on_rq) // false

                                          if (t) while (X->on_cpu)
                                            cpu_relax();

          context_switch(X, ..)
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);


Avoid the load of X->on_cpu being hoisted over the X->on_rq load.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2084,6 +2084,25 @@ try_to_wake_up(struct task_struct *p, un
 
 #ifdef CONFIG_SMP
 	/*
+	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
+	 * possible to, falsely, observe p->on_cpu == 0.
+	 *
+	 * One must be running (->on_cpu == 1) in order to remove oneself
+	 * from the runqueue.
+	 *
+	 *  [S] ->on_cpu = 1;	[L] ->on_rq
+	 *      UNLOCK rq->lock
+	 *			RMB
+	 *      LOCK   rq->lock
+	 *  [S] ->on_rq = 0;    [L] ->on_cpu
+	 *
+	 * Pairs with the full barrier implied in the UNLOCK+LOCK on rq->lock
+	 * from the consecutive calls to schedule(); the first switching to our
+	 * task, the second putting it to sleep.
+	 */
+	smp_rmb();
+
+	/*
 	 * If the owning (remote) cpu is still in the middle of schedule() with
 	 * this task as prev, wait until its done referencing the task.
 	 */



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 13:29 [PATCH 0/4] scheduler ordering bits Peter Zijlstra
                   ` (2 preceding siblings ...)
  2015-11-02 13:29 ` [PATCH 3/4] sched: Fix a race in try_to_wake_up() vs schedule() Peter Zijlstra
@ 2015-11-02 13:29 ` Peter Zijlstra
  2015-11-02 13:57   ` Peter Zijlstra
                     ` (4 more replies)
  3 siblings, 5 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-02 13:29 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, peterz, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

[-- Attachment #1: peterz-smp_cond_acquire.patch --]
[-- Type: text/plain, Size: 2892 bytes --]

Introduce smp_cond_acquire() which combines a control dependency and a
read barrier to form acquire semantics.

This primitive has two benefits:
 - it documents control dependencies,
 - its typically cheaper than using smp_load_acquire() in a loop.

Note that while smp_cond_acquire() has an explicit
smp_read_barrier_depends() for Alpha, neither sites it gets used in
were actually buggy on Alpha for their lack of it. The first uses
smp_rmb(), which on Alpha is a full barrier too and therefore serves
its purpose. The second had an explicit full barrier.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/compiler.h |   18 ++++++++++++++++++
 kernel/sched/core.c      |    8 +-------
 kernel/task_work.c       |    4 ++--
 3 files changed, 21 insertions(+), 9 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -275,6 +275,24 @@ static __always_inline void __write_once
 	__val; \
 })
 
+/**
+ * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ *
+ * The control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. ACQUIRE.
+ */
+#define smp_cond_acquire(cond)	do {		\
+	while (!(cond))				\
+		cpu_relax();			\
+	smp_read_barrier_depends(); /* ctrl */	\
+	smp_rmb(); /* ctrl + rmb := acquire */	\
+} while (0)
+
 #endif /* __KERNEL__ */
 
 #endif /* __ASSEMBLY__ */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2111,19 +2111,13 @@ try_to_wake_up(struct task_struct *p, un
 	/*
 	 * If the owning (remote) cpu is still in the middle of schedule() with
 	 * this task as prev, wait until its done referencing the task.
-	 */
-	while (p->on_cpu)
-		cpu_relax();
-	/*
-	 * Combined with the control dependency above, we have an effective
-	 * smp_load_acquire() without the need for full barriers.
 	 *
 	 * Pairs with the smp_store_release() in finish_lock_switch().
 	 *
 	 * This ensures that tasks getting woken will be fully ordered against
 	 * their previous state and preserve Program Order.
 	 */
-	smp_rmb();
+	smp_cond_acquire(!p->on_cpu);
 
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -102,13 +102,13 @@ void task_work_run(void)
 
 		if (!work)
 			break;
+
 		/*
 		 * Synchronize with task_work_cancel(). It can't remove
 		 * the first entry == work, cmpxchg(task_works) should
 		 * fail, but it can play with *work and other entries.
 		 */
-		raw_spin_unlock_wait(&task->pi_lock);
-		smp_mb();
+		smp_cond_acquire(!raw_spin_is_locked(&task->pi_lock));
 
 		do {
 			next = work->next;



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 13:29 ` [PATCH 4/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
@ 2015-11-02 13:57   ` Peter Zijlstra
  2015-11-02 17:43     ` Will Deacon
  2015-11-02 17:42   ` Will Deacon
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-02 13:57 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, paulmck, boqun.feng, corbet, mhocko, dhowells,
	torvalds, will.deacon

On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote:

> Note that while smp_cond_acquire() has an explicit
> smp_read_barrier_depends() for Alpha, neither sites it gets used in
> were actually buggy on Alpha for their lack of it. The first uses
> smp_rmb(), which on Alpha is a full barrier too and therefore serves
> its purpose. The second had an explicit full barrier.

> +/**
> + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> + * @cond: boolean expression to wait for
> + *
> + * Equivalent to using smp_load_acquire() on the condition variable but employs
> + * the control dependency of the wait to reduce the barrier on many platforms.
> + *
> + * The control dependency provides a LOAD->STORE order, the additional RMB
> + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> + * aka. ACQUIRE.
> + */
> +#define smp_cond_acquire(cond)	do {		\
> +	while (!(cond))				\
> +		cpu_relax();			\
> +	smp_read_barrier_depends(); /* ctrl */	\
> +	smp_rmb(); /* ctrl + rmb := acquire */	\
> +} while (0)

So per the above argument we could leave out the
smp_read_barrier_depends() for Alpha, although that would break
consistency with all the other control dependency primitives we have. It
would avoid issuing a double barrier.

Thoughts?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 13:29 ` [PATCH 4/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
  2015-11-02 13:57   ` Peter Zijlstra
@ 2015-11-02 17:42   ` Will Deacon
  2015-11-02 18:08   ` Linus Torvalds
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 78+ messages in thread
From: Will Deacon @ 2015-11-02 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds

Hi Peter,

On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote:
> Introduce smp_cond_acquire() which combines a control dependency and a
> read barrier to form acquire semantics.
> 
> This primitive has two benefits:
>  - it documents control dependencies,
>  - its typically cheaper than using smp_load_acquire() in a loop.

I'm not sure that's necessarily true on arm64, where we have a native
load-acquire instruction, but not a READ -> READ barrier (smp_rmb()
orders prior loads against subsequent loads and stores for us).

Perhaps we could allow architectures to provide their own definition of
smp_cond_acquire in case they can implement it more efficiently?

> Note that while smp_cond_acquire() has an explicit
> smp_read_barrier_depends() for Alpha, neither sites it gets used in
> were actually buggy on Alpha for their lack of it. The first uses
> smp_rmb(), which on Alpha is a full barrier too and therefore serves
> its purpose. The second had an explicit full barrier.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/compiler.h |   18 ++++++++++++++++++
>  kernel/sched/core.c      |    8 +-------
>  kernel/task_work.c       |    4 ++--
>  3 files changed, 21 insertions(+), 9 deletions(-)
> 
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -275,6 +275,24 @@ static __always_inline void __write_once
>  	__val; \
>  })
>  
> +/**
> + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> + * @cond: boolean expression to wait for
> + *
> + * Equivalent to using smp_load_acquire() on the condition variable but employs
> + * the control dependency of the wait to reduce the barrier on many platforms.
> + *
> + * The control dependency provides a LOAD->STORE order, the additional RMB
> + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> + * aka. ACQUIRE.
> + */
> +#define smp_cond_acquire(cond)	do {		\

I think the previous version that you posted/discussed had the actual
address of the variable being loaded passed in here too? That would be
useful for arm64, where we can wait-until-memory-location-has-changed
to save us re-evaluating cond prematurely.

> +	while (!(cond))				\
> +		cpu_relax();			\
> +	smp_read_barrier_depends(); /* ctrl */	\
> +	smp_rmb(); /* ctrl + rmb := acquire */	\

It's actually stronger than acquire, I think, because accesses before the
smp_cond_acquire cannot be moved across it.

> +} while (0)
> +
>  #endif /* __KERNEL__ */
>  
>  #endif /* __ASSEMBLY__ */
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2111,19 +2111,13 @@ try_to_wake_up(struct task_struct *p, un
>  	/*
>  	 * If the owning (remote) cpu is still in the middle of schedule() with
>  	 * this task as prev, wait until its done referencing the task.
> -	 */
> -	while (p->on_cpu)
> -		cpu_relax();
> -	/*
> -	 * Combined with the control dependency above, we have an effective
> -	 * smp_load_acquire() without the need for full barriers.
>  	 *
>  	 * Pairs with the smp_store_release() in finish_lock_switch().
>  	 *
>  	 * This ensures that tasks getting woken will be fully ordered against
>  	 * their previous state and preserve Program Order.
>  	 */
> -	smp_rmb();
> +	smp_cond_acquire(!p->on_cpu);
>  
>  	p->sched_contributes_to_load = !!task_contributes_to_load(p);
>  	p->state = TASK_WAKING;
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -102,13 +102,13 @@ void task_work_run(void)
>  
>  		if (!work)
>  			break;
> +
>  		/*
>  		 * Synchronize with task_work_cancel(). It can't remove
>  		 * the first entry == work, cmpxchg(task_works) should
>  		 * fail, but it can play with *work and other entries.
>  		 */
> -		raw_spin_unlock_wait(&task->pi_lock);
> -		smp_mb();
> +		smp_cond_acquire(!raw_spin_is_locked(&task->pi_lock));

Hmm, there's some sort of release equivalent in kernel/exit.c, but I
couldn't easily figure out whether we could do anything there. If we
could, we could kill raw_spin_unlock_wait :)

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 13:57   ` Peter Zijlstra
@ 2015-11-02 17:43     ` Will Deacon
  2015-11-03  1:14       ` Paul E. McKenney
  0 siblings, 1 reply; 78+ messages in thread
From: Will Deacon @ 2015-11-02 17:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds

On Mon, Nov 02, 2015 at 02:57:26PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote:
> 
> > Note that while smp_cond_acquire() has an explicit
> > smp_read_barrier_depends() for Alpha, neither sites it gets used in
> > were actually buggy on Alpha for their lack of it. The first uses
> > smp_rmb(), which on Alpha is a full barrier too and therefore serves
> > its purpose. The second had an explicit full barrier.
> 
> > +/**
> > + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> > + * @cond: boolean expression to wait for
> > + *
> > + * Equivalent to using smp_load_acquire() on the condition variable but employs
> > + * the control dependency of the wait to reduce the barrier on many platforms.
> > + *
> > + * The control dependency provides a LOAD->STORE order, the additional RMB
> > + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> > + * aka. ACQUIRE.
> > + */
> > +#define smp_cond_acquire(cond)	do {		\
> > +	while (!(cond))				\
> > +		cpu_relax();			\
> > +	smp_read_barrier_depends(); /* ctrl */	\
> > +	smp_rmb(); /* ctrl + rmb := acquire */	\
> > +} while (0)
> 
> So per the above argument we could leave out the
> smp_read_barrier_depends() for Alpha, although that would break
> consistency with all the other control dependency primitives we have. It
> would avoid issuing a double barrier.
> 
> Thoughts?

Do we even know that Alpha needs a barrier for control-dependencies in
the first place?

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 13:29 ` [PATCH 4/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
  2015-11-02 13:57   ` Peter Zijlstra
  2015-11-02 17:42   ` Will Deacon
@ 2015-11-02 18:08   ` Linus Torvalds
  2015-11-02 18:37     ` Will Deacon
  2015-11-02 20:36   ` David Howells
  2015-11-03 17:59   ` Oleg Nesterov
  4 siblings, 1 reply; 78+ messages in thread
From: Linus Torvalds @ 2015-11-02 18:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Oleg Nesterov, Linux Kernel Mailing List,
	Paul McKenney, boqun.feng, Jonathan Corbet, Michal Hocko,
	David Howells, Will Deacon

On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> +#define smp_cond_acquire(cond) do {            \
> +       while (!(cond))                         \
> +               cpu_relax();                    \
> +       smp_read_barrier_depends(); /* ctrl */  \
> +       smp_rmb(); /* ctrl + rmb := acquire */  \
> +} while (0)

This code makes absolutely no sense.

smp_read_barrier_depends() is about a memory barrier where there is a
data dependency between two accesses. The "depends" is very much about
the data dependency, and very much about *nothing* else.

Your comment talks about control dependencies, but
smp_read_barrier_depends() has absolutely nothing to do with a control
dependency. In fact, it is explicitly a no-op on architectures like
ARM and PowerPC that violate control dependencies.

So the code may end up *working*, but the comments in it are
misleading, insane, and nonsensical. And I think the code is actively
wrong (although in the sense that it has a barrier too *many*, not
lacking one).

Because smp_read_barrier_depends() definitely has *nothing* to do with
control barriers in any way, shape or form. The comment is actively
and entirely wrong.

As far as I know, even on alpha that 'smp_read_barrier_depends()' is
unnecessary. Not because of any barrier semantics (whether
smp_read_barrier depends or any other barrier), but simply because a
store cannot finalize before the conditional has been resolved, which
means that the read must have been done.

So for alpha, you end up depending on the exact same logic that you
depend on for every other architecture, since there is no
"architectural" memory ordering guarantee of it anywhere else either
(ok, so x86 has the explicit "memory ordering is causal" guarantee, so
x86 does kind of make that conditional ordering explicit).

Adding that "smp_read_barrier_depends()" doesn't add anything to the
whole ctrl argument.

So the code looks insane to me. On everything but alpha,
"smp_read_barrier_depends()" is a no-op. And on alpha, a combination
of "smp_read_barrier_depends()" and "smp_rmb()" is nonsensical. So in
no case can that code make sense, as far as I can tell.

                       Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 18:08   ` Linus Torvalds
@ 2015-11-02 18:37     ` Will Deacon
  2015-11-02 19:17       ` Linus Torvalds
  0 siblings, 1 reply; 78+ messages in thread
From: Will Deacon @ 2015-11-02 18:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, boqun.feng,
	Jonathan Corbet, Michal Hocko, David Howells

On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > +#define smp_cond_acquire(cond) do {            \
> > +       while (!(cond))                         \
> > +               cpu_relax();                    \
> > +       smp_read_barrier_depends(); /* ctrl */  \
> > +       smp_rmb(); /* ctrl + rmb := acquire */  \
> > +} while (0)
> 
> This code makes absolutely no sense.
> 
> smp_read_barrier_depends() is about a memory barrier where there is a
> data dependency between two accesses. The "depends" is very much about
> the data dependency, and very much about *nothing* else.

Paul wasn't so sure, which I think is why smp_read_barrier_depends()
is already used in, for example, READ_ONCE_CTRL:

  http://lkml.kernel.org/r/20151007154003.GJ3910@linux.vnet.ibm.com

although I agree that this would pave the way for speculative stores on
Alpha and that seems like a heavy accusation to make.

> Your comment talks about control dependencies, but
> smp_read_barrier_depends() has absolutely nothing to do with a control
> dependency. In fact, it is explicitly a no-op on architectures like
> ARM and PowerPC that violate control dependencies.

In this case, control dependencies are only referring to READ -> WRITE
ordering, so they are honoured by ARM and PowerPC.

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 18:37     ` Will Deacon
@ 2015-11-02 19:17       ` Linus Torvalds
  2015-11-02 19:57         ` Will Deacon
                           ` (2 more replies)
  0 siblings, 3 replies; 78+ messages in thread
From: Linus Torvalds @ 2015-11-02 19:17 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, boqun.feng,
	Jonathan Corbet, Michal Hocko, David Howells

On Mon, Nov 2, 2015 at 10:37 AM, Will Deacon <will.deacon@arm.com> wrote:
> On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote:
>> On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > +#define smp_cond_acquire(cond) do {            \
>> > +       while (!(cond))                         \
>> > +               cpu_relax();                    \
>> > +       smp_read_barrier_depends(); /* ctrl */  \
>> > +       smp_rmb(); /* ctrl + rmb := acquire */  \
>> > +} while (0)
>>
>> This code makes absolutely no sense.
>>
>> smp_read_barrier_depends() is about a memory barrier where there is a
>> data dependency between two accesses. The "depends" is very much about
>> the data dependency, and very much about *nothing* else.
>
> Paul wasn't so sure, which I think is why smp_read_barrier_depends()
> is already used in, for example, READ_ONCE_CTRL:
>
>   http://lkml.kernel.org/r/20151007154003.GJ3910@linux.vnet.ibm.com

Quoting the alpha architecture manual is kind of pointless, when NO
OTHER ARCHITECTURE OUT THERE guararantees that whole "read +
conditional orders wrt subsequent writes" _either_.

(Again, with the exception of x86, which has the sane "we honor causality")

Alpha isn't special. And smp_read_barrier_depends() hasn't magically
become something new.

If people think that control dependency needs a memory barrier on
alpha, then it damn well needs on on all other weakly ordered
architectuers too, afaik.

Either that "you cannot finalize a write unless all conditionals it
depends on are finalized" is true or it is not. That argument has
*never* been about some architecture-specific memory ordering model,
as far as I know.

As to READ_ONCE_CTRL - two wrongs don't make a right.

That smp_read_barrier_depends() there doesn't make any sense either.

And finally, the alpha architecture manual actually does have the
notion of "Dependence Constraint" (5.6.1.7) that talks about writes
that depend on previous reads (where "depends" is explicitly spelled
out to be about conditionals, write data _or_ write address). They are
actually constrained on alpha too.

Note that a "Dependence Constraint" is not a memory barrier, because
it only affects that particular chain of dependencies. So it doesn't
order other things in *general*, but it does order a particular read
with a particular sef of subsequent write. Which is all we guarantee
on anything else too wrt the whole control dependencies.

The smp_read_barrier_depends() is a *READ BARRIER*. It's about a
*read* that depends on a previous read. Now, it so happens that alpha
doesn't have a "read-to-read" barrier, and it's implemented as a full
barrier, but it really should be read as a "smp_rmb().

Yes, yes, alpha has the worst memory ordering ever, and the worst
barriers for that insane memory ordering, but that still does not make
alpha "magically able to reorder more than physically or logically
possible". We don't add barriers that don't make sense just because
the alpha memory orderings are insane.

Paul? I really disagree with how you have totally tried to re-purpose
smp_read_barrier_depends() in ways that are insane and make no sense.
That is not a control dependency. If it was, then PPC and ARM would
need to make it a real barrier too. I really don't see why you have
singled out alpha as the victim of your "let's just randomly change
the rules".

> In this case, control dependencies are only referring to READ -> WRITE
> ordering, so they are honoured by ARM and PowerPC.

Do ARM and PPC actually guarantee the generic "previous reads always
order before subsequent writes"?

Because if that's the issue, then we should perhaps add a new ordering
primitive that says that (ie "smp_read_to_write_barrier()"). That
could be a useful thing in general, where we currently use "smp_mb()"
to order an earlier read with a later write.

                   Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 19:17       ` Linus Torvalds
@ 2015-11-02 19:57         ` Will Deacon
  2015-11-02 20:23           ` Peter Zijlstra
  2015-11-02 21:56         ` Peter Zijlstra
  2015-11-03  1:57         ` Paul E. McKenney
  2 siblings, 1 reply; 78+ messages in thread
From: Will Deacon @ 2015-11-02 19:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, boqun.feng,
	Jonathan Corbet, Michal Hocko, David Howells

On Mon, Nov 02, 2015 at 11:17:17AM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 10:37 AM, Will Deacon <will.deacon@arm.com> wrote:
> > On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote:
> >> On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > +#define smp_cond_acquire(cond) do {            \
> >> > +       while (!(cond))                         \
> >> > +               cpu_relax();                    \
> >> > +       smp_read_barrier_depends(); /* ctrl */  \
> >> > +       smp_rmb(); /* ctrl + rmb := acquire */  \
> >> > +} while (0)
> >>
> >> This code makes absolutely no sense.
> >>
> >> smp_read_barrier_depends() is about a memory barrier where there is a
> >> data dependency between two accesses. The "depends" is very much about
> >> the data dependency, and very much about *nothing* else.
> >
> > Paul wasn't so sure, which I think is why smp_read_barrier_depends()
> > is already used in, for example, READ_ONCE_CTRL:
> >
> >   http://lkml.kernel.org/r/20151007154003.GJ3910@linux.vnet.ibm.com
> 
> Quoting the alpha architecture manual is kind of pointless, when NO
> OTHER ARCHITECTURE OUT THERE guararantees that whole "read +
> conditional orders wrt subsequent writes" _either_.
> 
> (Again, with the exception of x86, which has the sane "we honor causality")
> 
> Alpha isn't special. And smp_read_barrier_depends() hasn't magically
> become something new.
> 
> If people think that control dependency needs a memory barrier on
> alpha, then it damn well needs on on all other weakly ordered
> architectuers too, afaik.
> 
> Either that "you cannot finalize a write unless all conditionals it
> depends on are finalized" is true or it is not. That argument has
> *never* been about some architecture-specific memory ordering model,
> as far as I know.

You can imagine a (horribly broken) value-speculating architecture that
would permit this re-ordering, but then you'd have speculative stores
and thin-air values, which Alpha doesn't have.

> As to READ_ONCE_CTRL - two wrongs don't make a right.

Sure, I just wanted to point out the precedence and related discussion.

> That smp_read_barrier_depends() there doesn't make any sense either.
> 
> And finally, the alpha architecture manual actually does have the
> notion of "Dependence Constraint" (5.6.1.7) that talks about writes
> that depend on previous reads (where "depends" is explicitly spelled
> out to be about conditionals, write data _or_ write address). They are
> actually constrained on alpha too.

In which case, it looks like we can remove the smp_read_barrier_depends
instances from all of the control-dependency macros, but I'll defer to
Paul in case he has some further insight. I assume you're ok with this
patch if the smp_read_barrier_depends() is removed?

> > In this case, control dependencies are only referring to READ -> WRITE
> > ordering, so they are honoured by ARM and PowerPC.
> 
> Do ARM and PPC actually guarantee the generic "previous reads always
> order before subsequent writes"?

Only to *dependent* subsequent writes, but Peter's patch makes all
subsequent writes dependent on cond, so I think that's the right way to
achieve the ordering we want.

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 19:57         ` Will Deacon
@ 2015-11-02 20:23           ` Peter Zijlstra
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-02 20:23 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, boqun.feng,
	Jonathan Corbet, Michal Hocko, David Howells

On Mon, Nov 02, 2015 at 07:57:09PM +0000, Will Deacon wrote:
> > >> smp_read_barrier_depends() is about a memory barrier where there is a
> > >> data dependency between two accesses. The "depends" is very much about
> > >> the data dependency, and very much about *nothing* else.
> > >
> > > Paul wasn't so sure, which I think is why smp_read_barrier_depends()
> > > is already used in, for example, READ_ONCE_CTRL:
> > >
> > >   http://lkml.kernel.org/r/20151007154003.GJ3910@linux.vnet.ibm.com
> > 
> > Quoting the alpha architecture manual is kind of pointless, when NO
> > OTHER ARCHITECTURE OUT THERE guararantees that whole "read +
> > conditional orders wrt subsequent writes" _either_.
> > 
> > (Again, with the exception of x86, which has the sane "we honor causality")
> > 
> > Alpha isn't special. And smp_read_barrier_depends() hasn't magically
> > become something new.
> > 
> > If people think that control dependency needs a memory barrier on
> > alpha, then it damn well needs on on all other weakly ordered
> > architectuers too, afaik.
> > 
> > Either that "you cannot finalize a write unless all conditionals it
> > depends on are finalized" is true or it is not. That argument has
> > *never* been about some architecture-specific memory ordering model,
> > as far as I know.
> 
> You can imagine a (horribly broken) value-speculating architecture that
> would permit this re-ordering, but then you'd have speculative stores
> and thin-air values, which Alpha doesn't have.

I've tried arguing this point with Paul on a number of occasions, and I
think he at one point constructed some elaborate scheme that depended on
the split cache where this might require the full barrier.

> > As to READ_ONCE_CTRL - two wrongs don't make a right.
> 
> Sure, I just wanted to point out the precedence and related discussion.

I purely followed 'convention' here.

> > That smp_read_barrier_depends() there doesn't make any sense either.
> > 
> > And finally, the alpha architecture manual actually does have the
> > notion of "Dependence Constraint" (5.6.1.7) that talks about writes
> > that depend on previous reads (where "depends" is explicitly spelled
> > out to be about conditionals, write data _or_ write address). They are
> > actually constrained on alpha too.
> 
> In which case, it looks like we can remove the smp_read_barrier_depends
> instances from all of the control-dependency macros, but I'll defer to
> Paul in case he has some further insight. I assume you're ok with this
> patch if the smp_read_barrier_depends() is removed?

That would be good indeed.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/4] sched: Document Program-Order guarantees
  2015-11-02 13:29 ` [PATCH 2/4] sched: Document Program-Order guarantees Peter Zijlstra
@ 2015-11-02 20:27   ` Paul Turner
  2015-11-02 20:34     ` Peter Zijlstra
  2015-11-20 10:02     ` Peter Zijlstra
  0 siblings, 2 replies; 78+ messages in thread
From: Paul Turner @ 2015-11-02 20:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Oleg Nesterov, LKML, Paul McKenney, boqun.feng,
	Jonathan Corbet, mhocko, dhowells, Linus Torvalds, will.deacon

On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> These are some notes on the scheduler locking and how it provides
> program order guarantees on SMP systems.
>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: David Howells <dhowells@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |  142 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 142 insertions(+)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1905,6 +1905,148 @@ static void ttwu_queue(struct task_struc
>         raw_spin_unlock(&rq->lock);
>  }
>
> +/*
> + * Notes on Program-Order guarantees on SMP systems.
> + *
> + *
> + *   PREEMPTION/MIGRATION
> + *
> + * Regular preemption/migration is safe because as long as the task is runnable
> + * migrations involve both rq locks, albeit not (necessarily) at the same time.
> + *
> + * So we get (we allow 3 CPU migrations):
> + *
> + *   CPU0            CPU1            CPU2
> + *
> + *   LOCK rq(0)->lock
> + *   sched-out X
> + *   sched-in Y
> + *   UNLOCK rq(0)->lock
> + *
> + *                                   LOCK rq(0)->lock // MB against CPU0
> + *                                   dequeue X
> + *                                   UNLOCK rq(0)->lock
> + *
> + *                                   LOCK rq(1)->lock
> + *                                   enqueue X
> + *                                   UNLOCK rq(1)->lock
> + *
> + *                   LOCK rq(1)->lock // MB against CPU2
> + *                   sched-out Z
> + *                   sched-in X
> + *                   UNLOCK rq(1)->lock
> + *
> + * and the first LOCK rq(0) on CPU2 gives a full order against the UNLOCK rq(0)
> + * on CPU0. Similarly the LOCK rq(1) on CPU1 provides full order against the
> + * UNLOCK rq(1) on CPU2, therefore by the time task X runs on CPU1 it must
> + * observe the state it left behind on CPU0.
> + *

I suspect this part might be more explicitly expressed by specifying
the requirements that migration satisfies; then providing an example.
This makes it easier for others to reason about the locks and saves
worrying about whether the examples hit our 3 million sub-cases.

I'd also propose just dropping preemption from this part, we only need
memory order to be correct on migration, whether it's scheduled or not
[it also invites confusion with the wake-up case].

Something like:
When any task 't' migrates, all activity on its prior cpu [c1] is
guaranteed to be happens-before any subsequent execution on its new
cpu [c2].  There are 3 components to enforcing this.

[c1]          1) Sched-out of t requires rq(c1)->lock
[any cpu] 2) Any migration of t, by any cpu is required to synchronize
on *both* rq(c1)->lock and rq(c2)->lock
[c2]          3) Sched-in of t requires cq(c2)->lock

Transitivity guarantees that (2) orders after (1) and (3) after (2).
Note that in some cases (e.g. active, or idle cpu) the balancing cpu
in (2) may be c1 or c2.

[Follow example]


> + *
> + *   BLOCKING -- aka. SLEEP + WAKEUP
> + *
> + * For blocking things are a little more interesting, because when we dequeue
> + * the task, we don't need to acquire the old rq lock in order to migrate it.
> + *
> + * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the task
> + * to CPU2 (the most complex example):
> + *
> + *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending)
> + *
> + *   X->state = UNINTERRUPTIBLE
> + *   MB
> + *   if (cond)
> + *     break
> + *                    cond = true
> + *
> + *   LOCK rq(0)->lock LOCK X->pi_lock
> + *
> + *   dequeue X
> + *                    while (X->on_cpu)
> + *                      cpu_relax()
> + *   sched-out X
> + *   RELEASE
> + *   X->on_cpu = 0
> + *                    RMB
> + *                    X->state = WAKING
> + *                    set_task_cpu(X,2)
> + *                      WMB
> + *                      ti(X)->cpu = 2
> + *
> + *                    llist_add(X, rq(2)) // MB
> + *                                          llist_del_all() // MB
> + *
> + *                                          LOCK rq(2)->lock
> + *                                          enqueue X
> + *                                          X->state = RUNNING
> + *                                          UNLOCK rq(2)->lock
> + *
> + *                                          LOCK rq(2)->lock
> + *                                          sched-out Z
> + *                                          sched-in X
> + *                                          UNLOCK rq(1)->lock
> + *
> + *                                          if (cond) // _TRUE_
> + *                    UNLOCK X->pi_lock
> + *   UNLOCK rq(0)->lock
> + *
> + * So in this case the scheduler does not provide an obvious full barrier; but
> + * the smp_store_release() in finish_lock_switch(), paired with the control-dep
> + * and smp_rmb() in try_to_wake_up() form a release-acquire pair and fully
> + * order things between CPU0 and CPU1.
> + *
> + * The llist primitives order things between CPU1 and CPU2 -- the alternative
> + * is CPU1 doing the remote enqueue (the alternative path in ttwu_queue()) in
> + * which case the rq(2)->lock release/acquire will order things between them.

I'd vote for similar treatment here.  We can make the dependency on
on_cpu much more obvious, something that is a reasonably subtle detail
today.

> + *
> + * Which again leads to the guarantee that by the time X gets to run on CPU2
> + * it must observe the state it left behind on CPU0.
> + *
> + * However; for blocking there is a second guarantee we must provide, namely we
> + * must observe the state that lead to our wakeup. That is, not only must X
> + * observe its own prior state, it must also observe the @cond store.
> + *
> + * This too is achieved in the above, since CPU1 does the waking, we only need
> + * the ordering between CPU1 and CPU2, which is the same as the above.
> + *
> + *
> + * There is however a much more interesting case for this guarantee, where X
> + * never makes it off CPU0:
> + *
> + *   CPU0 (schedule)  CPU1 (try_to_wake_up)
> + *
> + *   X->state = UNINTERRUPTIBLE
> + *   MB
> + *   if (cond)
> + *     break
> + *                    cond = true
> + *
> + *                    WMB (aka smp_mb__before_spinlock)
> + *                    LOCK X->pi_lock
> + *
> + *                    if (X->on_rq)
> + *                      LOCK rq(0)->lock
> + *                      X->state = RUNNING
> + *                      UNLOCK rq(0)->lock
> + *
> + *   LOCK rq(0)->lock // MB against CPU1
> + *   UNLOCK rq(0)->lock
> + *
> + *   if (cond) // _TRUE_
> + *
> + *                    UNLOCK X->pi_lock
> + *
> + * Here our task X never quite leaves CPU0, the wakeup happens before we can
> + * dequeue and schedule someone else. In this case we must still observe cond
> + * after our call to schedule() completes.
> + *
> + * This is achieved by the smp_mb__before_spinlock() WMB which ensures the store
> + * cannot leak inside the LOCK, and LOCK rq(0)->lock on CPU0 provides full order
> + * against the UNLOCK rq(0)->lock from CPU1. Furthermore our load of cond cannot
> + * happen before this same LOCK.
> + *
> + * Therefore, again, we're good.
> + */
> +
>  /**
>   * try_to_wake_up - wake up a thread
>   * @p: the thread to be awakened
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/4] sched: Document Program-Order guarantees
  2015-11-02 20:27   ` Paul Turner
@ 2015-11-02 20:34     ` Peter Zijlstra
  2015-11-02 22:09       ` Paul Turner
  2015-11-20 10:02     ` Peter Zijlstra
  1 sibling, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-02 20:34 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Oleg Nesterov, LKML, Paul McKenney, boqun.feng,
	Jonathan Corbet, mhocko, dhowells, Linus Torvalds, will.deacon

On Mon, Nov 02, 2015 at 12:27:05PM -0800, Paul Turner wrote:
> I suspect this part might be more explicitly expressed by specifying
> the requirements that migration satisfies; then providing an example.
> This makes it easier for others to reason about the locks and saves
> worrying about whether the examples hit our 3 million sub-cases.
> 
> I'd also propose just dropping preemption from this part, we only need
> memory order to be correct on migration, whether it's scheduled or not
> [it also invites confusion with the wake-up case].
> 
> Something like:
> When any task 't' migrates, all activity on its prior cpu [c1] is
> guaranteed to be happens-before any subsequent execution on its new
> cpu [c2].  There are 3 components to enforcing this.
> 
> [c1]          1) Sched-out of t requires rq(c1)->lock
> [any cpu] 2) Any migration of t, by any cpu is required to synchronize
> on *both* rq(c1)->lock and rq(c2)->lock
> [c2]          3) Sched-in of t requires cq(c2)->lock
> 
> Transitivity guarantees that (2) orders after (1) and (3) after (2).
> Note that in some cases (e.g. active, or idle cpu) the balancing cpu
> in (2) may be c1 or c2.
> 
> [Follow example]

Make sense, I'll try and reword things like that.

Note that in don't actually need the strong transitivity here (RCsc),
weak transitivity (RCpc) is in fact sufficient.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 13:29 ` [PATCH 4/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
                     ` (2 preceding siblings ...)
  2015-11-02 18:08   ` Linus Torvalds
@ 2015-11-02 20:36   ` David Howells
  2015-11-02 20:40     ` Peter Zijlstra
  2015-11-02 21:11     ` Linus Torvalds
  2015-11-03 17:59   ` Oleg Nesterov
  4 siblings, 2 replies; 78+ messages in thread
From: David Howells @ 2015-11-02 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, boqun.feng,
	Jonathan Corbet, Michal Hocko, Will Deacon

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > +       smp_read_barrier_depends(); /* ctrl */  \
> > +       smp_rmb(); /* ctrl + rmb := acquire */  \

Doesn't smp_rmb() imply an smp_read_barrier_depends() anyway?  In
memory-barriers.txt, it says:

     Read memory barriers imply data dependency barriers, and so can
     substitute for them.

David

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 20:36   ` David Howells
@ 2015-11-02 20:40     ` Peter Zijlstra
  2015-11-02 21:11     ` Linus Torvalds
  1 sibling, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-02 20:40 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, boqun.feng,
	Jonathan Corbet, Michal Hocko, Will Deacon

On Mon, Nov 02, 2015 at 08:36:49PM +0000, David Howells wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > > +       smp_read_barrier_depends(); /* ctrl */  \
> > > +       smp_rmb(); /* ctrl + rmb := acquire */  \
> 
> Doesn't smp_rmb() imply an smp_read_barrier_depends() anyway?  In
> memory-barriers.txt, it says:
> 
>      Read memory barriers imply data dependency barriers, and so can
>      substitute for them.

Yes, I noted that in a follow up email, Alpha implements both as MB. So
just the smp_rmb() is sufficient.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 20:36   ` David Howells
  2015-11-02 20:40     ` Peter Zijlstra
@ 2015-11-02 21:11     ` Linus Torvalds
  1 sibling, 0 replies; 78+ messages in thread
From: Linus Torvalds @ 2015-11-02 21:11 UTC (permalink / raw)
  To: David Howells
  Cc: Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, boqun.feng,
	Jonathan Corbet, Michal Hocko, Will Deacon

On Mon, Nov 2, 2015 at 12:36 PM, David Howells <dhowells@redhat.com> wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> > +       smp_read_barrier_depends(); /* ctrl */  \
>> > +       smp_rmb(); /* ctrl + rmb := acquire */  \
>
> Doesn't smp_rmb() imply an smp_read_barrier_depends() anyway?

Yes, it does. But that "smp_read_barrier_depends()" is actually
mis-used as a "barrier against subsequent dependent writes, thanks to
the control flow". It's not protecting against subsequent reads -
which is what the smp_rmb() is about.

Which is completely bogus, but that's what the comment implies.

Of course, on alpha (which is where smp_read_barrier_depends() makes a
difference), both that and smp_rmb() are just full memory barriers,
because alpha is some crazy sh*t. So yes, a "smp_rmb()" is sufficient
everywhere, but that is actually not where the confusion comes from in
the first place.

                   Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 19:17       ` Linus Torvalds
  2015-11-02 19:57         ` Will Deacon
@ 2015-11-02 21:56         ` Peter Zijlstra
  2015-11-03  1:57         ` Paul E. McKenney
  2 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-02 21:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, boqun.feng,
	Jonathan Corbet, Michal Hocko, David Howells

On Mon, Nov 02, 2015 at 11:17:17AM -0800, Linus Torvalds wrote:
> As to READ_ONCE_CTRL - two wrongs don't make a right.
> 
> That smp_read_barrier_depends() there doesn't make any sense either.
> 
> And finally, the alpha architecture manual actually does have the
> notion of "Dependence Constraint" (5.6.1.7) that talks about writes
> that depend on previous reads (where "depends" is explicitly spelled
> out to be about conditionals, write data _or_ write address). They are
> actually constrained on alpha too.
> 
> Note that a "Dependence Constraint" is not a memory barrier, because
> it only affects that particular chain of dependencies. So it doesn't
> order other things in *general*, but it does order a particular read
> with a particular sef of subsequent write. Which is all we guarantee
> on anything else too wrt the whole control dependencies.

Something like so then, Paul?

---
Subject: locking: Alpha honours control dependencies too

The alpha architecture manual actually does have the notion of
"Dependence Constraint" (5.6.1.7) that talks about writes that depend on
previous reads (where "depends" is explicitly spelled out to be about
conditionals, write data _or_ write address). They are actually
constrained on alpha too.

Which means we can remove the smp_read_barrier_depends() abuse from the
various control dependency primitives.

Retain the primitives, as they are a useful documentation aid.

Maybe-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/atomic.h   | 2 --
 include/linux/compiler.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 27e580d232ca..f16b1dedd909 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -8,7 +8,6 @@
 static inline int atomic_read_ctrl(const atomic_t *v)
 {
 	int val = atomic_read(v);
-	smp_read_barrier_depends(); /* Enforce control dependency. */
 	return val;
 }
 #endif
@@ -565,7 +564,6 @@ static inline int atomic_dec_if_positive(atomic_t *v)
 static inline long long atomic64_read_ctrl(const atomic64_t *v)
 {
 	long long val = atomic64_read(v);
-	smp_read_barrier_depends(); /* Enforce control dependency. */
 	return val;
 }
 #endif
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 3d7810341b57..1d1f5902189d 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -311,7 +311,6 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 #define READ_ONCE_CTRL(x) \
 ({ \
 	typeof(x) __val = READ_ONCE(x); \
-	smp_read_barrier_depends(); /* Enforce control dependency. */ \
 	__val; \
 })
 

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/4] sched: Document Program-Order guarantees
  2015-11-02 20:34     ` Peter Zijlstra
@ 2015-11-02 22:09       ` Paul Turner
  2015-11-02 22:12         ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Turner @ 2015-11-02 22:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Oleg Nesterov, LKML, Paul McKenney, boqun.feng,
	Jonathan Corbet, mhocko, dhowells, Linus Torvalds, will.deacon

On Mon, Nov 2, 2015 at 12:34 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Nov 02, 2015 at 12:27:05PM -0800, Paul Turner wrote:
>> I suspect this part might be more explicitly expressed by specifying
>> the requirements that migration satisfies; then providing an example.
>> This makes it easier for others to reason about the locks and saves
>> worrying about whether the examples hit our 3 million sub-cases.
>>
>> I'd also propose just dropping preemption from this part, we only need
>> memory order to be correct on migration, whether it's scheduled or not
>> [it also invites confusion with the wake-up case].
>>
>> Something like:
>> When any task 't' migrates, all activity on its prior cpu [c1] is
>> guaranteed to be happens-before any subsequent execution on its new
>> cpu [c2].  There are 3 components to enforcing this.
>>
>> [c1]          1) Sched-out of t requires rq(c1)->lock
>> [any cpu] 2) Any migration of t, by any cpu is required to synchronize
>> on *both* rq(c1)->lock and rq(c2)->lock
>> [c2]          3) Sched-in of t requires cq(c2)->lock
>>
>> Transitivity guarantees that (2) orders after (1) and (3) after (2).
>> Note that in some cases (e.g. active, or idle cpu) the balancing cpu
>> in (2) may be c1 or c2.
>>
>> [Follow example]
>
> Make sense, I'll try and reword things like that.
>
> Note that in don't actually need the strong transitivity here (RCsc),
> weak transitivity (RCpc) is in fact sufficient.

Yeah, I thought about just using acquire/release to talk about the
interplay, in particular with the release in (1) in release and
acquire from (3)  which would make some of this much more explicit and
highlight that we only need RCpc.  We have not been very consistent at
using this terminology, although this could be a good starting point.

If we went this route, we could do something like:

+ * So in this case the scheduler does not provide an obvious full barrier; but
+ * the smp_store_release() in finish_lock_switch(), paired with the control-dep
+ * and smp_rmb() in try_to_wake_up() form a release-acquire pair and fully
+ * order things between CPU0 and CPU1.

Instead of having this, which is complete, but hard to synchronize
with the points at which it actually matters.  Just use acquire and
release above, then at the actual site, e.g. in try_to_wake_up()
document how we deliver the acquire required by the higher level
documentation/requirements.

This makes it easier to maintain the stupidly racy documentation
consistency property in the future.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/4] sched: Document Program-Order guarantees
  2015-11-02 22:09       ` Paul Turner
@ 2015-11-02 22:12         ` Peter Zijlstra
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-02 22:12 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Oleg Nesterov, LKML, Paul McKenney, boqun.feng,
	Jonathan Corbet, mhocko, dhowells, Linus Torvalds, will.deacon

On Mon, Nov 02, 2015 at 02:09:20PM -0800, Paul Turner wrote:
> If we went this route, we could do something like:
> 
> + * So in this case the scheduler does not provide an obvious full barrier; but
> + * the smp_store_release() in finish_lock_switch(), paired with the control-dep
> + * and smp_rmb() in try_to_wake_up() form a release-acquire pair and fully
> + * order things between CPU0 and CPU1.
> 
> Instead of having this, which is complete, but hard to synchronize
> with the points at which it actually matters.  Just use acquire and
> release above, then at the actual site, e.g. in try_to_wake_up()
> document how we deliver the acquire required by the higher level
> documentation/requirements.

Right, which was most of the point of trying to introduce
smp_cond_acquire(), abstract out the tricky cond-dep and rmb trickery
so we can indeed talk about release+acquire like normal people ;-)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 17:43     ` Will Deacon
@ 2015-11-03  1:14       ` Paul E. McKenney
  2015-11-03  1:25         ` Linus Torvalds
  0 siblings, 1 reply; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-03  1:14 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, mingo, oleg, linux-kernel, boqun.feng, corbet,
	mhocko, dhowells, torvalds

On Mon, Nov 02, 2015 at 05:43:48PM +0000, Will Deacon wrote:
> On Mon, Nov 02, 2015 at 02:57:26PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 02, 2015 at 02:29:05PM +0100, Peter Zijlstra wrote:
> > 
> > > Note that while smp_cond_acquire() has an explicit
> > > smp_read_barrier_depends() for Alpha, neither sites it gets used in
> > > were actually buggy on Alpha for their lack of it. The first uses
> > > smp_rmb(), which on Alpha is a full barrier too and therefore serves
> > > its purpose. The second had an explicit full barrier.
> > 
> > > +/**
> > > + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> > > + * @cond: boolean expression to wait for
> > > + *
> > > + * Equivalent to using smp_load_acquire() on the condition variable but employs
> > > + * the control dependency of the wait to reduce the barrier on many platforms.
> > > + *
> > > + * The control dependency provides a LOAD->STORE order, the additional RMB
> > > + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> > > + * aka. ACQUIRE.
> > > + */
> > > +#define smp_cond_acquire(cond)	do {		\
> > > +	while (!(cond))				\
> > > +		cpu_relax();			\
> > > +	smp_read_barrier_depends(); /* ctrl */	\
> > > +	smp_rmb(); /* ctrl + rmb := acquire */	\
> > > +} while (0)
> > 
> > So per the above argument we could leave out the
> > smp_read_barrier_depends() for Alpha, although that would break
> > consistency with all the other control dependency primitives we have. It
> > would avoid issuing a double barrier.
> > 
> > Thoughts?
> 
> Do we even know that Alpha needs a barrier for control-dependencies in
> the first place?

You would ask that question when I am many thousands of miles from my
copy of the Alpha reference manual!  ;-)

There is explicit wording in that manual that says that no multi-variable
ordering is implied without explicit memory-barrier instructions.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-03  1:14       ` Paul E. McKenney
@ 2015-11-03  1:25         ` Linus Torvalds
  0 siblings, 0 replies; 78+ messages in thread
From: Linus Torvalds @ 2015-11-03  1:25 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Will Deacon, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, boqun.feng, Jonathan Corbet,
	Michal Hocko, David Howells

On Mon, Nov 2, 2015 at 5:14 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> You would ask that question when I am many thousands of miles from my
> copy of the Alpha reference manual!  ;-)

I don't think I've touched a paper manual in years. Too m uch effort
to index and search. Look here

   http://download.majix.org/dec/alpha_arch_ref.pdf

> There is explicit wording in that manual that says that no multi-variable
> ordering is implied without explicit memory-barrier instructions.

Right. But the whole "read -> conditional -> write" ends up being a
dependency chain to *every* single write that happens after the
conditional.

So there may be no "multi-variable ordering" implied in any individual
access, but despite that a simple "read + conditional" orders every
single store that comes after it. Even if it orders them
"individually".

                Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 19:17       ` Linus Torvalds
  2015-11-02 19:57         ` Will Deacon
  2015-11-02 21:56         ` Peter Zijlstra
@ 2015-11-03  1:57         ` Paul E. McKenney
  2015-11-03 19:40           ` Linus Torvalds
  2 siblings, 1 reply; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-03  1:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, boqun.feng, Jonathan Corbet,
	Michal Hocko, David Howells

On Mon, Nov 02, 2015 at 11:17:17AM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 10:37 AM, Will Deacon <will.deacon@arm.com> wrote:
> > On Mon, Nov 02, 2015 at 10:08:24AM -0800, Linus Torvalds wrote:
> >> On Mon, Nov 2, 2015 at 5:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > +#define smp_cond_acquire(cond) do {            \
> >> > +       while (!(cond))                         \
> >> > +               cpu_relax();                    \
> >> > +       smp_read_barrier_depends(); /* ctrl */  \
> >> > +       smp_rmb(); /* ctrl + rmb := acquire */  \
> >> > +} while (0)
> >>
> >> This code makes absolutely no sense.
> >>
> >> smp_read_barrier_depends() is about a memory barrier where there is a
> >> data dependency between two accesses. The "depends" is very much about
> >> the data dependency, and very much about *nothing* else.
> >
> > Paul wasn't so sure, which I think is why smp_read_barrier_depends()
> > is already used in, for example, READ_ONCE_CTRL:
> >
> >   http://lkml.kernel.org/r/20151007154003.GJ3910@linux.vnet.ibm.com
> 
> Quoting the alpha architecture manual is kind of pointless, when NO
> OTHER ARCHITECTURE OUT THERE guararantees that whole "read +
> conditional orders wrt subsequent writes" _either_.
> 
> (Again, with the exception of x86, which has the sane "we honor causality")
> 
> Alpha isn't special. And smp_read_barrier_depends() hasn't magically
> become something new.

The existing control dependencies (READ_ONCE_CTRL() and friends) only
guarantee ordering against later stores, and not against later loads.
Of the weakly ordered architectures, only Alpha fails to respect
load-to-store control dependencies.

> If people think that control dependency needs a memory barrier on
> alpha, then it damn well needs on on all other weakly ordered
> architectuers too, afaik.

For load-to-load control dependencies, agreed, from what I can see,
all the weakly ordered architectures do need an explicit barrier.

> Either that "you cannot finalize a write unless all conditionals it
> depends on are finalized" is true or it is not. That argument has
> *never* been about some architecture-specific memory ordering model,
> as far as I know.
> 
> As to READ_ONCE_CTRL - two wrongs don't make a right.
> 
> That smp_read_barrier_depends() there doesn't make any sense either.
> 
> And finally, the alpha architecture manual actually does have the
> notion of "Dependence Constraint" (5.6.1.7) that talks about writes
> that depend on previous reads (where "depends" is explicitly spelled
> out to be about conditionals, write data _or_ write address). They are
> actually constrained on alpha too.

I am in India and my Alpha Architecture Manual is in the USA.  Google
Books has a PDF, but it conveniently omits that particular section.
I do remember quoting that section at the Alpha architects back in the
late 1990s and being told that it didn't mean what I thought it meant.

And they did post a clarification on the web:

	http://h71000.www7.hp.com/wizard/wiz_2637.html

	For instance, your producer must issue a "memory barrier"
	instruction after writing the data to shared memory and before
	inserting it on the queue; likewise, your consumer must issue a
	memory barrier instruction after removing an item from the queue
	and before reading from its memory.  Otherwise, you risk seeing
	stale data, since, while the Alpha processor does provide coherent
	memory, it does not provide implicit ordering of reads and writes.
	(That is, the write of the producer's data might reach memory
	after the write of the queue, such that the consumer might read
	the new item from the queue but get the previous values from
	the item's memory.

So 5.6.1.7 apparently does not sanction data dependency ordering.

> Note that a "Dependence Constraint" is not a memory barrier, because
> it only affects that particular chain of dependencies. So it doesn't
> order other things in *general*, but it does order a particular read
> with a particular sef of subsequent write. Which is all we guarantee
> on anything else too wrt the whole control dependencies.
> 
> The smp_read_barrier_depends() is a *READ BARRIER*. It's about a
> *read* that depends on a previous read. Now, it so happens that alpha
> doesn't have a "read-to-read" barrier, and it's implemented as a full
> barrier, but it really should be read as a "smp_rmb().
> 
> Yes, yes, alpha has the worst memory ordering ever, and the worst
> barriers for that insane memory ordering, but that still does not make
> alpha "magically able to reorder more than physically or logically
> possible". We don't add barriers that don't make sense just because
> the alpha memory orderings are insane.
> 
> Paul? I really disagree with how you have totally tried to re-purpose
> smp_read_barrier_depends() in ways that are insane and make no sense.

I did consider adding a new name, but apparently was lazy that day.
I would of course be happy to add a separate name.

> That is not a control dependency. If it was, then PPC and ARM would
> need to make it a real barrier too. I really don't see why you have
> singled out alpha as the victim of your "let's just randomly change
> the rules".

Just trying to make this all work on Alpha as well as the other
architectures...  But if the actual Alpha hardware that is currently
running recent Linux kernels is more strict than the architecture, I of
course agree that we could code to the actual hardware rather than to
the architecture.  They aren't making new Alphas, so we could argue
thta the usual forward-compatibility issues do not apply.

And I have not been able to come up with an actual scenario that
allows real Alpha hardware to reorder a store with a prior load that it
control-depends on.  Then again, I wasn't able to come up with the
split-cache scenario that breaks data dependencies before my lengthy
argument^Wdiscussion with the Alpha architects.

> > In this case, control dependencies are only referring to READ -> WRITE
> > ordering, so they are honoured by ARM and PowerPC.
> 
> Do ARM and PPC actually guarantee the generic "previous reads always
> order before subsequent writes"?

Almost.  PPC instead guarantees ordering between a previous read and a
later store, but only if they are separated (at runtime) by a conditional
branch that depends on the read.

> Because if that's the issue, then we should perhaps add a new ordering
> primitive that says that (ie "smp_read_to_write_barrier()"). That
> could be a useful thing in general, where we currently use "smp_mb()"
> to order an earlier read with a later write.

I agree that smp_read_to_write_barrier() would be useful, and on PPC
it could use the lighter-weight lwsync instruction.  But that is a
different case than the load-to-store control dependencies, which PPC
(and I believe also ARM) enforce without a memory-barrier instruction.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-02 13:29 ` [PATCH 4/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
                     ` (3 preceding siblings ...)
  2015-11-02 20:36   ` David Howells
@ 2015-11-03 17:59   ` Oleg Nesterov
  2015-11-03 18:23     ` Peter Zijlstra
  2015-11-11  9:39     ` Boqun Feng
  4 siblings, 2 replies; 78+ messages in thread
From: Oleg Nesterov @ 2015-11-03 17:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

On 11/02, Peter Zijlstra wrote:
>
> +#define smp_cond_acquire(cond)	do {		\
> +	while (!(cond))				\
> +		cpu_relax();			\
> +	smp_read_barrier_depends(); /* ctrl */	\
> +	smp_rmb(); /* ctrl + rmb := acquire */	\
> +} while (0)

...

> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -102,13 +102,13 @@ void task_work_run(void)
>
>  		if (!work)
>  			break;
> +
>  		/*
>  		 * Synchronize with task_work_cancel(). It can't remove
>  		 * the first entry == work, cmpxchg(task_works) should
>  		 * fail, but it can play with *work and other entries.
>  		 */
> -		raw_spin_unlock_wait(&task->pi_lock);
> -		smp_mb();
> +		smp_cond_acquire(!raw_spin_is_locked(&task->pi_lock));

Unfortunately this doesn't look exactly right...

spin_unlock_wait() is not equal to "while (locked) relax", the latter
is live-lockable or at least sub-optimal: we do not really need to spin
until we observe !spin_is_locked(), we only need to synchronize with the
current owner of this lock. Once it drops the lock we can proceed, we
do not care if another thread takes the same lock right after that.

Oleg.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-03 17:59   ` Oleg Nesterov
@ 2015-11-03 18:23     ` Peter Zijlstra
  2015-11-11  9:39     ` Boqun Feng
  1 sibling, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-03 18:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: mingo, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote:

> > -		raw_spin_unlock_wait(&task->pi_lock);
> > -		smp_mb();
> > +		smp_cond_acquire(!raw_spin_is_locked(&task->pi_lock));
> 
> Unfortunately this doesn't look exactly right...
> 
> spin_unlock_wait() is not equal to "while (locked) relax", the latter
> is live-lockable or at least sub-optimal: we do not really need to spin
> until we observe !spin_is_locked(), we only need to synchronize with the
> current owner of this lock. Once it drops the lock we can proceed, we
> do not care if another thread takes the same lock right after that.

Ah indeed. And while every use of spin_unlock_wait() has 'interesting'
barriers associated, they all seem different.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-03  1:57         ` Paul E. McKenney
@ 2015-11-03 19:40           ` Linus Torvalds
  2015-11-04  3:57             ` Paul E. McKenney
  0 siblings, 1 reply; 78+ messages in thread
From: Linus Torvalds @ 2015-11-03 19:40 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Will Deacon, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, boqun.feng, Jonathan Corbet,
	Michal Hocko, David Howells

On Mon, Nov 2, 2015 at 5:57 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>>
>> Alpha isn't special. And smp_read_barrier_depends() hasn't magically
>> become something new.
>
> The existing control dependencies (READ_ONCE_CTRL() and friends) only
> guarantee ordering against later stores, and not against later loads.

Right. And using "smp_read_barrier_depends()" for them is WRONG.

That's my argument.

Your arguments make no sense.

> Of the weakly ordered architectures, only Alpha fails to respect
> load-to-store control dependencies.

NO IT DOES NOT.

Christ, Paul. I even sent you the alpha architecture manual
information where it explicitly says that there's a dependency
ordering for that case.

There's a reason that "smp_read_barrier_depends()" is called
smp_READ_barrier_depends().

It's a rmb. Really

You have turned it into something else in your mind. But your mind is WRONG.

> I am in India and my Alpha Architecture Manual is in the USA.

I sent you a link to something that should work, and that has the section.

> And they did post a clarification on the web:

So for alpha, you trust a random web posting by a unknown person that
talks about some random problem in an application that we don't even
know what it is.

But you don't trust the architecture manual, and you don't trust the
fact that it si *physically impossible* to not have the
load-to-control-to-store ordering without some kind of magical
nullifying stores that we know that alpha didn't have?

But then magically, you trust the architecture manuals for other
architectures, or just take the "you cannot have smp-visible
speculative stores" on faith.

But alpha is different, and lives in a universe where causality
suddenly doesn't exist.

I really don't understand your logic.

> So 5.6.1.7 apparently does not sanction data dependency ordering.

Exactly why are you arguing against he architecture manual?

>> Paul? I really disagree with how you have totally tried to re-purpose
>> smp_read_barrier_depends() in ways that are insane and make no sense.
>
> I did consider adding a new name, but apparently was lazy that day.
> I would of course be happy to add a separate name.

That is NOT WHAT I WANT AT ALL.

Get rid of the smp_read_barrier_depends(). It doesn't do control
barriers against stores, it has never done that, and they aren't
needed in the first place.

There is no need for a new name. The only thing that there is a need
for is to just realize that alpha isn't magical.

Alpha does have a stupid segmented cache which means that even when
the core is constrained by load-to-load data address dependencies
(because no alpha actually did value predication), the cachelines that
core loads may not be ordered without the memory barrier.

But that "second load may get a stale value from the past" is purely
about loads. You cannot have the same issue happen for stores, because
there's no way a store buffer somehow magically goes backwards in time
and exposes the store before the load that it depended on.

And the architecture manual even *EXPLICITLY* says so. Alpha does
actually have a dependency chain from loads to dependent stores -
through addresses, through data, _and_ through control. It's
documented, but equally importantly, it's just basically impossible to
not have an architecture that does that.

Even if you do some really fancy things like actual value prediction
(which the alpha architecture _allows_, although no alpha ever did,
afaik), that cannot break the load->store dependency. Because even if
the load value was predicted (or any control flow after the load was
predicted), the store cannot be committed and made visible to other
CPU's until after that prediction has been verified.

So there may be bogus values in a store buffer, but those values will
have to be killed with the store instructions that caused them, if any
prediction failed. They won't be visible to other CPU's.

And I say that not because the architecture manual says so (although
it does), but because because such a CPU wouldn't *work*. It wouldn't
be a CPU, it would at most be a fuzzy neural network that generates
cool results that may be interesting, but they won't be dependable or
reliable in the sense that the Linux kernel depends on.

>> That is not a control dependency. If it was, then PPC and ARM would
>> need to make it a real barrier too. I really don't see why you have
>> singled out alpha as the victim of your "let's just randomly change
>> the rules".
>
> Just trying to make this all work on Alpha as well as the other
> architectures...  But if the actual Alpha hardware that is currently
> running recent Linux kernels is more strict than the architecture

.. really. This is specified in the architecture manual. The fact that
you have found some random posting by some random person that says
"you need barriers everywhere" is immaterial. That support blog may
well have been simply a "I don't know what I am talking about, and I
don't know what problem you have, but I do know that the memory model
is really nasty, and adding random barriers will make otherwise
correct code work".

But we don't add random read barriers to make a control-to-store
barrier. Not when the architecture manual explicitly says there is a
dependency chain there, and not when I don't see how you could
possibly even make a valid CPU that doesn't have that dependency.

                Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-03 19:40           ` Linus Torvalds
@ 2015-11-04  3:57             ` Paul E. McKenney
  2015-11-04  4:43               ` Linus Torvalds
  0 siblings, 1 reply; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-04  3:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, boqun.feng, Jonathan Corbet,
	Michal Hocko, David Howells

On Tue, Nov 03, 2015 at 11:40:24AM -0800, Linus Torvalds wrote:
> On Mon, Nov 2, 2015 at 5:57 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:

[ . . . ]

> > I am in India and my Alpha Architecture Manual is in the USA.
> 
> I sent you a link to something that should work, and that has the section.

Thank you, and yes, it clearly states that read-to-write dependencies
are ordered.  Color me slow and stupid.

> > And they did post a clarification on the web:
> 
> So for alpha, you trust a random web posting by a unknown person that
> talks about some random problem in an application that we don't even
> know what it is.

In my defense, the Alpha architects pointed me at that web posting, but
yes, it appears to be pushing for overly conservative safety rather than
accuracy.  Please accept my apologies for my confusion.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-04  3:57             ` Paul E. McKenney
@ 2015-11-04  4:43               ` Linus Torvalds
  2015-11-04 12:54                 ` Paul E. McKenney
  0 siblings, 1 reply; 78+ messages in thread
From: Linus Torvalds @ 2015-11-04  4:43 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Will Deacon, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, boqun.feng, Jonathan Corbet,
	Michal Hocko, David Howells

On Tue, Nov 3, 2015 at 7:57 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> Thank you, and yes, it clearly states that read-to-write dependencies
> are ordered.

Well, I wouldn't say that it's exactly "clear".

The fact that they explicitly say "Note that the DP relation does not
directly impose a BEFORE (⇐) ordering between accesses u and v" makes
it all clear as mud.

They *then* go on to talk about how the DP relationship *together*
with the odd "is source of" ordering (which in turn is defined in
terms of BEFORE ordering) cannot have cycles.

I have no idea why they do it that way, but the reason seems to be
that they wanted to make "BEFORE" be purely about barriers and
accesses, and make the other orderings be described separately. So the
"BEFORE" ordering is used to define how memory must act, which is then
used as a basis for that storage definition and the "is source of"
thing.

But none of that seems to make much sense to a *user*.

The fact that they seem to equate "BEFORE" with "Processor Issue
Constraints" also makes me think that the whole logic was written by a
CPU designer, and part of why they document it that way is that the
CPU designer literally thought of "can I issue this access" as being
very different from "is there some inherent ordering that just results
from issues outside of my design".

I really don't know. That whole series of memory ordering rules makes
my head hurt.

But I do think the only thing that matters in the end is that they do
have that DP relationship between reads and subsequently dependent
writes, but  basically not for *any* other relationship.

So on alpha read-vs-read, write-vs-later-read, and write-vs-write all
have to have memory barriers, unless the accesses physically overlap.

                 Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-04  4:43               ` Linus Torvalds
@ 2015-11-04 12:54                 ` Paul E. McKenney
  0 siblings, 0 replies; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-04 12:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Peter Zijlstra, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, boqun.feng, Jonathan Corbet,
	Michal Hocko, David Howells, rth, ink, mattst88

On Tue, Nov 03, 2015 at 08:43:22PM -0800, Linus Torvalds wrote:
> On Tue, Nov 3, 2015 at 7:57 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > Thank you, and yes, it clearly states that read-to-write dependencies
> > are ordered.
> 
> Well, I wouldn't say that it's exactly "clear".
> 
> The fact that they explicitly say "Note that the DP relation does not
> directly impose a BEFORE (⇐) ordering between accesses u and v" makes
> it all clear as mud.
> 
> They *then* go on to talk about how the DP relationship *together*
> with the odd "is source of" ordering (which in turn is defined in
> terms of BEFORE ordering) cannot have cycles.
> 
> I have no idea why they do it that way, but the reason seems to be
> that they wanted to make "BEFORE" be purely about barriers and
> accesses, and make the other orderings be described separately. So the
> "BEFORE" ordering is used to define how memory must act, which is then
> used as a basis for that storage definition and the "is source of"
> thing.
> 
> But none of that seems to make much sense to a *user*.
> 
> The fact that they seem to equate "BEFORE" with "Processor Issue
> Constraints" also makes me think that the whole logic was written by a
> CPU designer, and part of why they document it that way is that the
> CPU designer literally thought of "can I issue this access" as being
> very different from "is there some inherent ordering that just results
> from issues outside of my design".
> 
> I really don't know. That whole series of memory ordering rules makes
> my head hurt.

The honest answer is that for Alpha, I don't know either.  But if your
head isn't hurting enough yet, feel free to read on...



My guess, based loosely on ARM and PowerPC, is that memory barriers
provide global ordering but that the load-to-store dependency ordering is
strictly local.  Which as you say does not necessarily make much sense
to a user.  One guess is that the following would never trigger the
BUG_ON() given x, y, and z initially zero (and ignoring the possibility
of compiler mischief):

	CPU 0		CPU 1		CPU 2
	r1 = x;		r2 = y;		r3 = z;
	if (r1)		if (r2)		if (r3)
		y = 1;		z = 1;		x = 1;
	BUG_ON(r1 == 1 && r2 == 1 && r3 == 1); /* after the dust settles */

The write-to-read relationships prevent misordering.  The copy of the
Alpha manual I downloaded hints that this BUG_ON() could not fire.

However, the following might well trigger:

	CPU 0		CPU 1		CPU 2
	x = 1;		r1 = x;		r2 = y;
			if (r1)		if (r2)
				y = 1;		x = 2;
	BUG_ON(r1 == 1 && r2 == 1 && x == 1); /* after the dust settles */

The dependency ordering orders each CPU individually, but might not force
CPU 0's write to reach CPU 1 and CPU 2 at the same time.  So the BUG_ON()
case would happen if CPU 0's write to x reach CPU 1 before it reached
CPU 2, in which case the x==2 value might not be seen outside of CPU 2,
so that everyone agrees on the order of values taken on by x.

And this could be prevented by enforcing global (rather than local)
ordering by placing a memory barrier in CPU 1:

	CPU 0		CPU 1		CPU 2
	x = 1;		r1 = x;		r2 = y;
			smp_mb();	if (r2)
			if (r1)			x = 2;
				y = 1;
	BUG_ON(r1 == 1 && r2 == 1 && x == 1); /* after the dust settles */

But the reference manual doesn't have this sort of litmus test, so who
knows?  CCing the Alpha maintainers in case they know.

> But I do think the only thing that matters in the end is that they do
> have that DP relationship between reads and subsequently dependent
> writes, but  basically not for *any* other relationship.
> 
> So on alpha read-vs-read, write-vs-later-read, and write-vs-write all
> have to have memory barriers, unless the accesses physically overlap.

Agreed.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-03 17:59   ` Oleg Nesterov
  2015-11-03 18:23     ` Peter Zijlstra
@ 2015-11-11  9:39     ` Boqun Feng
  2015-11-11 10:34       ` Boqun Feng
  2015-11-11 12:12       ` Peter Zijlstra
  1 sibling, 2 replies; 78+ messages in thread
From: Boqun Feng @ 2015-11-11  9:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon

[-- Attachment #1: Type: text/plain, Size: 1420 bytes --]

Hi Oleg,

On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote:
[snip]
> 
> Unfortunately this doesn't look exactly right...
> 
> spin_unlock_wait() is not equal to "while (locked) relax", the latter
> is live-lockable or at least sub-optimal: we do not really need to spin

Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
Because spin_unlock_wait() is used for us to wait for a certain lock to
RELEASE so that we can do something *after* we observe the RELEASE.
Considering the follow example:

	CPU 0 				CPU 1
	============================	===========================
	{ X = 0 }
					WRITE_ONCE(X, 1);
					spin_unlock(&lock);
	spin_unlock_wait(&lock)
	r1 = READ_ONCE(X);

If spin_unlock_wait() is not an ACQUIRE, r1 can be 0 in this case,
right? Am I missing something subtle here? Or spin_unlock_wait() itself
doesn't have the ACQUIRE semantics, but it should always come with a
smp_mb() *following* it to achieve the ACQUIRE semantics? However in
do_exit(), an smp_mb() is preceding raw_spin_unlock_wait() rather than
following, which makes me confused... could you explain that? Thank you
;-)

Regards,
Boqun

> until we observe !spin_is_locked(), we only need to synchronize with the
> current owner of this lock. Once it drops the lock we can proceed, we
> do not care if another thread takes the same lock right after that.
> 
> Oleg.
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-11  9:39     ` Boqun Feng
@ 2015-11-11 10:34       ` Boqun Feng
  2015-11-11 19:53         ` Oleg Nesterov
  2015-11-12 13:50         ` Paul E. McKenney
  2015-11-11 12:12       ` Peter Zijlstra
  1 sibling, 2 replies; 78+ messages in thread
From: Boqun Feng @ 2015-11-11 10:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon

[-- Attachment #1: Type: text/plain, Size: 2824 bytes --]

On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
> Hi Oleg,
> 
> On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote:
> [snip]
> > 
> > Unfortunately this doesn't look exactly right...
> > 
> > spin_unlock_wait() is not equal to "while (locked) relax", the latter
> > is live-lockable or at least sub-optimal: we do not really need to spin
> 
> Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?

Hmm.. I guess I was wrong, it doesn't need to be an ACQUIRE, it needs
only to use the control dependency to order the load of lock state and
stores following it.

> Because spin_unlock_wait() is used for us to wait for a certain lock to
> RELEASE so that we can do something *after* we observe the RELEASE.
> Considering the follow example:
> 
> 	CPU 0 				CPU 1
> 	============================	===========================
> 	{ X = 0 }
> 					WRITE_ONCE(X, 1);
> 					spin_unlock(&lock);
> 	spin_unlock_wait(&lock)
> 	r1 = READ_ONCE(X);
> 
> If spin_unlock_wait() is not an ACQUIRE, r1 can be 0 in this case,
> right? Am I missing something subtle here? Or spin_unlock_wait() itself
> doesn't have the ACQUIRE semantics, but it should always come with a
> smp_mb() *following* it to achieve the ACQUIRE semantics? However in
> do_exit(), an smp_mb() is preceding raw_spin_unlock_wait() rather than
> following, which makes me confused... could you explain that? Thank you
> ;-)
> 

But still, there is one suspicious use of smp_mb() in do_exit():

	/*
	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
	 * when the following two conditions become true.
	 *   - There is race condition of mmap_sem (It is acquired by
	 *     exit_mm()), and
	 *   - SMI occurs before setting TASK_RUNINNG.
	 *     (or hypervisor of virtual machine switches to other guest)
	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
	 *
	 * To avoid it, we have to wait for releasing tsk->pi_lock which
	 * is held by try_to_wake_up()
	 */
	smp_mb();
	raw_spin_unlock_wait(&tsk->pi_lock);

	/* causes final put_task_struct in finish_task_switch(). */
	tsk->state = TASK_DEAD;
	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
	schedule();

Seems like smp_mb() doesn't need here? Because the control dependency
already orders load of tsk->pi_lock and store of tsk->state, and this
control dependency order guarantee pairs with the spin_unlock(->pi_lock)
in try_to_wake_up() to avoid data race on ->state.

Regards,
Boqun

> Regards,
> Boqun
> 
> > until we observe !spin_is_locked(), we only need to synchronize with the
> > current owner of this lock. Once it drops the lock we can proceed, we
> > do not care if another thread takes the same lock right after that.
> > 
> > Oleg.
> > 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-11  9:39     ` Boqun Feng
  2015-11-11 10:34       ` Boqun Feng
@ 2015-11-11 12:12       ` Peter Zijlstra
  2015-11-11 19:39         ` Oleg Nesterov
  1 sibling, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-11 12:12 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Oleg Nesterov, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon

On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:

> Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?

I did wonder the same thing, it would simplify a number of things if
this were so.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-11 12:12       ` Peter Zijlstra
@ 2015-11-11 19:39         ` Oleg Nesterov
  2015-11-11 21:23           ` Linus Torvalds
  2015-11-12  7:14           ` Boqun Feng
  0 siblings, 2 replies; 78+ messages in thread
From: Oleg Nesterov @ 2015-11-11 19:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon

On 11/11, Peter Zijlstra wrote:
>
> On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
>
> > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
>
> I did wonder the same thing, it would simplify a number of things if
> this were so.

Yes, me too.

Sometimes I even think it should have both ACQUIRE + RELEASE semantics.
IOW, it should be "equivalent" to spin_lock() + spin_unlock().

Consider this code:

	object_t *object;
	spinlock_t lock;

	void update(void)
	{
		object_t *o;

		spin_lock(&lock);
		o = READ_ONCE(object);
		if (o) {
			BUG_ON(o->dead);
			do_something(o);
		}
		spin_unlock(&lock);
	}

	void destroy(void) // can be called only once, can't race with itself
	{
		object_t *o;

		o = object;
		object = NULL;

		/*
		 * pairs with lock/ACQUIRE. The next update() must see
		 * object == NULL after spin_lock();
		 */
		smp_mb();

		spin_unlock_wait(&lock);

		/*
		 * pairs with unlock/RELEASE. The previous update() has
		 * already passed BUG_ON(o->dead).
		 *
		 * (Yes, yes, in this particular case it is not needed,
		 *  we can rely on the control dependency).
		 */
		smp_mb();

		o->dead = true;
	}

I believe the code above is correct and it needs the barriers on both sides.

If it is wrong, then task_work_run() is buggy: it relies on mb() implied by
cmpxchg() before spin_unlock_wait() the same way: the next task_work_cancel()
should see the result of our cmpxchg(), it must not try to read work->next or
work->func.

Hmm. Not sure I really understand what I am trying to say... Perhaps in fact
I mean that unlock_wait() should be removed because it is too subtle for me ;)

Oleg.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-11 10:34       ` Boqun Feng
@ 2015-11-11 19:53         ` Oleg Nesterov
  2015-11-12 13:50         ` Paul E. McKenney
  1 sibling, 0 replies; 78+ messages in thread
From: Oleg Nesterov @ 2015-11-11 19:53 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon

He Boqun,

Let me first state that I can't answer authoritatively when it comes to
barriers. That said,

On 11/11, Boqun Feng wrote:
>
> But still, there is one suspicious use of smp_mb() in do_exit():
>
> 	/*
> 	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> 	 * when the following two conditions become true.
> 	 *   - There is race condition of mmap_sem (It is acquired by
> 	 *     exit_mm()), and
> 	 *   - SMI occurs before setting TASK_RUNINNG.
> 	 *     (or hypervisor of virtual machine switches to other guest)
> 	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> 	 *
> 	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> 	 * is held by try_to_wake_up()
> 	 */
> 	smp_mb();
> 	raw_spin_unlock_wait(&tsk->pi_lock);
>
> 	/* causes final put_task_struct in finish_task_switch(). */
> 	tsk->state = TASK_DEAD;
> 	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
> 	schedule();
>
> Seems like smp_mb() doesn't need here?

Please see my reply to peterz's email.

AFAICS, we need the barries on both sides. But, since we only need to
STORE into tsk->state after unlock_wait(), we can rely on the control
dependency and avoid the 2nd mb().

Oleg.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-11 19:39         ` Oleg Nesterov
@ 2015-11-11 21:23           ` Linus Torvalds
  2015-11-12  7:14           ` Boqun Feng
  1 sibling, 0 replies; 78+ messages in thread
From: Linus Torvalds @ 2015-11-11 21:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Boqun Feng, Ingo Molnar,
	Linux Kernel Mailing List, Paul McKenney, Jonathan Corbet,
	Michal Hocko, David Howells, Will Deacon

On Wed, Nov 11, 2015 at 11:39 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 11/11, Peter Zijlstra wrote:
>
> Sometimes I even think it should have both ACQUIRE + RELEASE semantics.
> IOW, it should be "equivalent" to spin_lock() + spin_unlock().

That's insane.

"Release" semantics are - by definition - about stuff that *predeces* it.

It is simply not sensible to have a "wait_for_unlock()" that then
synchronizes loads or stores that happened *before* the wait. That's
some crazy voodoo programming.

And if you do need to have model where you do "store something, then
make sure that the unlock we wait for happens after the store", then
you need to just add a "smp_mb()" in between that store and the
waiting for unlock.

Or just take the damn lock, and don't play any games at all.

                   Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-11 19:39         ` Oleg Nesterov
  2015-11-11 21:23           ` Linus Torvalds
@ 2015-11-12  7:14           ` Boqun Feng
  2015-11-12 10:28             ` Peter Zijlstra
                               ` (2 more replies)
  1 sibling, 3 replies; 78+ messages in thread
From: Boqun Feng @ 2015-11-12  7:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 6173 bytes --]

On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> On 11/11, Peter Zijlstra wrote:
> >
> > On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
> >
> > > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
> >
> > I did wonder the same thing, it would simplify a number of things if
> > this were so.
> 
> Yes, me too.
> 
> Sometimes I even think it should have both ACQUIRE + RELEASE semantics.
> IOW, it should be "equivalent" to spin_lock() + spin_unlock().
> 
> Consider this code:
> 
> 	object_t *object;
> 	spinlock_t lock;
> 
> 	void update(void)
> 	{
> 		object_t *o;
> 
> 		spin_lock(&lock);
> 		o = READ_ONCE(object);
> 		if (o) {
> 			BUG_ON(o->dead);
> 			do_something(o);
> 		}
> 		spin_unlock(&lock);
> 	}
> 
> 	void destroy(void) // can be called only once, can't race with itself
> 	{
> 		object_t *o;
> 
> 		o = object;
> 		object = NULL;
> 
> 		/*
> 		 * pairs with lock/ACQUIRE. The next update() must see
> 		 * object == NULL after spin_lock();
> 		 */
> 		smp_mb();
> 
> 		spin_unlock_wait(&lock);
> 
> 		/*
> 		 * pairs with unlock/RELEASE. The previous update() has
> 		 * already passed BUG_ON(o->dead).
> 		 *
> 		 * (Yes, yes, in this particular case it is not needed,
> 		 *  we can rely on the control dependency).
> 		 */
> 		smp_mb();
> 
> 		o->dead = true;
> 	}
> 
> I believe the code above is correct and it needs the barriers on both sides.
> 

Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
only guarantees that the memory operations following spin_lock() can't
be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
i.e. the case below can happen(assuming the spin_lock() is implemented
as ll/sc loop)

	spin_lock(&lock):
	  r1 = *lock; // LL, r1 == 0
	o = READ_ONCE(object); // could be reordered here.
	  *lock = 1; // SC

This could happen because of the ACQUIRE semantics of spin_lock(), and 
the current implementation of spin_lock() on PPC allows this happen.

(Cc PPC maintainers for their opinions on this one)

Therefore the case below can happen:

	CPU 1			CPU 2			CPU 3
	==================	====================	==============
							spin_unlock(&lock);
				spin_lock(&lock):
				  r1 = *lock; // r1 == 0;
				o = READ_ONCE(object); // reordered here
	object = NULL;
	smp_mb();
	spin_unlock_wait(&lock);
				  *lock = 1;
	smp_mb();		
	o->dead = true;
				if (o) // true
				  BUG_ON(o->dead); // true!!


To show this, I also translate this situation into a PPC litmus for
herd[1]:

	PPC spin-lock-wait
	"
	r1: local variable of lock
	r2: constant 1
	r3: constant 0 or NULL
	r4: local variable of object, i.e. o
	r5: local variable of *o (simulate ->dead as I didn't know 
	    how to write fields of structure in herd ;-()
	r13: the address of lock, i.e. &lock
	r14: the address of object, i.e. &object
	"
	{
	0:r1=0;0:r2=1;0:r3=0;0:r13=lock;0:r14=object;
	1:r1=0;1:r2=1;1:r3=0;1:r4=0;1:r5=0;1:r13=lock;1:r14=object;
	2:r1=0;2:r13=lock;
	lock=1; object=old; old=0;
	}

	P0              | P1                 | P2 ;
	ld r4,0(r14)    | Lock:              | stw r1,0(r13);
	std r3,0(r14)   | lwarx r1,r3,r13    | ;
	                | cmpwi r1,0         | ;
	sync            | bne Lock           | ;
	                | stwcx. r2,r3,r13   | ;
	Wait:           | bne Lock           | ;
	lwz r1,0(r13)   | lwsync             | ;
	cmpwi r1,0      | ld r4,0(r14)       | ;
	bne Wait        | cmpwi r4,0         | ;
	                | beq Fail           | ;
	sync            | lwz r5, 0(r4)      | ;
	stw r2,0(r4)    | Fail:              | ;
	                | lwsync             | ;
	                | stw r3, 0(r13)     | ;

	exists
	(1:r4=old /\ 1:r5=1)

,whose result says that (1:r4=old /\ 1:r5=1) can happen:

	Test spin-lock-wait Allowed
	States 3
	1:r4=0; 1:r5=0;
	1:r4=old; 1:r5=0;
	1:r4=old; 1:r5=1;
	Loop Ok
	Witnesses
	Positive: 18 Negative: 108
	Condition exists (1:r4=old /\ 1:r5=1)
	Observation spin-lock-wait Sometimes 18 108
	Hash=244f8c0f91df5a5ed985500ed7230272

Please note that I use backwards jump in this litmus, which is only
supported by herd(not by ppcmem[2]), and it will take a while to get the
result. And I'm not that confident that I'm familiar with this tool,
maybe Paul and Will can help check my translate and usage here ;-)


IIUC, the problem here is that spin_lock_wait() can be implemented with
only LOAD operations, and to have a RELEASE semantics, one primivite
must have a *STORE* part, therefore spin_lock_wait() can not be a
RELEASE.

So.. we probably need to take the lock here.

> If it is wrong, then task_work_run() is buggy: it relies on mb() implied by
> cmpxchg() before spin_unlock_wait() the same way: the next task_work_cancel()
> should see the result of our cmpxchg(), it must not try to read work->next or
> work->func.
> 
> Hmm. Not sure I really understand what I am trying to say... Perhaps in fact
> I mean that unlock_wait() should be removed because it is too subtle for me ;)
> 

;-)

I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
a control dependency to pair with spin_unlock(), for example, the
following snippet in do_exit() is OK, except the smp_mb() is redundant,
unless I'm missing something subtle:


	/*
	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
	 * when the following two conditions become true.
	 *   - There is race condition of mmap_sem (It is acquired by
	 *     exit_mm()), and
	 *   - SMI occurs before setting TASK_RUNINNG.
	 *     (or hypervisor of virtual machine switches to other guest)
	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
	 *
	 * To avoid it, we have to wait for releasing tsk->pi_lock which
	 * is held by try_to_wake_up()
	 */
	smp_mb();
	raw_spin_unlock_wait(&tsk->pi_lock);

	/* causes final put_task_struct in finish_task_switch(). */
	tsk->state = TASK_DEAD;

Ref:
[1]: https://github.com/herd/herdtools
[2]: http://www.cl.cam.ac.uk/~pes20/ppcmem/index.html

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12  7:14           ` Boqun Feng
@ 2015-11-12 10:28             ` Peter Zijlstra
  2015-11-12 15:00             ` Oleg Nesterov
  2015-11-12 18:21             ` Linus Torvalds
  2 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-12 10:28 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Oleg Nesterov, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 03:14:51PM +0800, Boqun Feng wrote:
> Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> only guarantees that the memory operations following spin_lock() can't
> be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> i.e. the case below can happen(assuming the spin_lock() is implemented
> as ll/sc loop)
> 
> 	spin_lock(&lock):
> 	  r1 = *lock; // LL, r1 == 0
> 	o = READ_ONCE(object); // could be reordered here.
> 	  *lock = 1; // SC
> 
> This could happen because of the ACQUIRE semantics of spin_lock(), and 
> the current implementation of spin_lock() on PPC allows this happen.

Urgh, you just _had_ to send an email like this, didn't you ;-)

I think AARGH64 does the same thing. They either use LDAXR/STXR, which
also places the ACQUIRE on the load, or use LDADDA (v8.1) which I
suspect does the same, but I'm not entirely sure how to decode these 8.1
atomics yet.

Let me think a little more on this..



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-11 10:34       ` Boqun Feng
  2015-11-11 19:53         ` Oleg Nesterov
@ 2015-11-12 13:50         ` Paul E. McKenney
  1 sibling, 0 replies; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-12 13:50 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Oleg Nesterov, Peter Zijlstra, mingo, linux-kernel, corbet,
	mhocko, dhowells, torvalds, will.deacon

On Wed, Nov 11, 2015 at 06:34:56PM +0800, Boqun Feng wrote:
> On Wed, Nov 11, 2015 at 05:39:40PM +0800, Boqun Feng wrote:
> > Hi Oleg,
> > 
> > On Tue, Nov 03, 2015 at 06:59:58PM +0100, Oleg Nesterov wrote:
> > [snip]
> > > 
> > > Unfortunately this doesn't look exactly right...
> > > 
> > > spin_unlock_wait() is not equal to "while (locked) relax", the latter
> > > is live-lockable or at least sub-optimal: we do not really need to spin
> > 
> > Just be curious, should spin_unlock_wait() semantically be an ACQUIRE?
> 
> Hmm.. I guess I was wrong, it doesn't need to be an ACQUIRE, it needs
> only to use the control dependency to order the load of lock state and
> stores following it.

I must say that spin_unlock_wait() and friends are a bit tricky.
There are only a few uses in drivers/ata/libata-eh.c, ipc/sem.c,
kernel/exit.c, kernel/sched/completion.c, and kernel/task_work.c.
Almost all of these seem to assume that spin_unlock_wait() provides
no ordering.  We might have just barely enough uses to produce useful
abstractions, but my guess is that it would not hurt to wait.

> > Because spin_unlock_wait() is used for us to wait for a certain lock to
> > RELEASE so that we can do something *after* we observe the RELEASE.
> > Considering the follow example:
> > 
> > 	CPU 0 				CPU 1
> > 	============================	===========================
> > 	{ X = 0 }
> > 					WRITE_ONCE(X, 1);
> > 					spin_unlock(&lock);
> > 	spin_unlock_wait(&lock)
> > 	r1 = READ_ONCE(X);
> > 
> > If spin_unlock_wait() is not an ACQUIRE, r1 can be 0 in this case,
> > right? Am I missing something subtle here? Or spin_unlock_wait() itself
> > doesn't have the ACQUIRE semantics, but it should always come with a
> > smp_mb() *following* it to achieve the ACQUIRE semantics? However in
> > do_exit(), an smp_mb() is preceding raw_spin_unlock_wait() rather than
> > following, which makes me confused... could you explain that? Thank you
> > ;-)
> > 
> 
> But still, there is one suspicious use of smp_mb() in do_exit():
> 
> 	/*
> 	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> 	 * when the following two conditions become true.
> 	 *   - There is race condition of mmap_sem (It is acquired by
> 	 *     exit_mm()), and
> 	 *   - SMI occurs before setting TASK_RUNINNG.
> 	 *     (or hypervisor of virtual machine switches to other guest)
> 	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> 	 *
> 	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> 	 * is held by try_to_wake_up()
> 	 */
> 	smp_mb();
> 	raw_spin_unlock_wait(&tsk->pi_lock);
> 
> 	/* causes final put_task_struct in finish_task_switch(). */
> 	tsk->state = TASK_DEAD;
> 	tsk->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
> 	schedule();
> 
> Seems like smp_mb() doesn't need here? Because the control dependency
> already orders load of tsk->pi_lock and store of tsk->state, and this
> control dependency order guarantee pairs with the spin_unlock(->pi_lock)
> in try_to_wake_up() to avoid data race on ->state.

The exit() path is pretty heavyweight, so I suspect that an extra smp_mb()
is down in the noise.  Or are you saying that this is somehow unsafe?

							Thanx, Paul

> Regards,
> Boqun
> 
> > Regards,
> > Boqun
> > 
> > > until we observe !spin_is_locked(), we only need to synchronize with the
> > > current owner of this lock. Once it drops the lock we can proceed, we
> > > do not care if another thread takes the same lock right after that.
> > > 
> > > Oleg.
> > > 
> 
> 



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 15:00             ` Oleg Nesterov
@ 2015-11-12 14:40               ` Paul E. McKenney
  2015-11-12 14:49                 ` Boqun Feng
  2015-11-12 14:50                 ` Peter Zijlstra
  2015-11-12 15:18               ` Boqun Feng
  1 sibling, 2 replies; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-12 14:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Boqun Feng, Peter Zijlstra, mingo, linux-kernel, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote:
> On 11/12, Boqun Feng wrote:
> >
> > On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> > >
> > > 	object_t *object;
> > > 	spinlock_t lock;
> > >
> > > 	void update(void)
> > > 	{
> > > 		object_t *o;
> > >
> > > 		spin_lock(&lock);
> > > 		o = READ_ONCE(object);
> > > 		if (o) {
> > > 			BUG_ON(o->dead);
> > > 			do_something(o);
> > > 		}
> > > 		spin_unlock(&lock);
> > > 	}
> > >
> > > 	void destroy(void) // can be called only once, can't race with itself
> > > 	{
> > > 		object_t *o;
> > >
> > > 		o = object;
> > > 		object = NULL;
> > >
> > > 		/*
> > > 		 * pairs with lock/ACQUIRE. The next update() must see
> > > 		 * object == NULL after spin_lock();
> > > 		 */
> > > 		smp_mb();
> > >
> > > 		spin_unlock_wait(&lock);
> > >
> > > 		/*
> > > 		 * pairs with unlock/RELEASE. The previous update() has
> > > 		 * already passed BUG_ON(o->dead).
> > > 		 *
> > > 		 * (Yes, yes, in this particular case it is not needed,
> > > 		 *  we can rely on the control dependency).
> > > 		 */
> > > 		smp_mb();
> > >
> > > 		o->dead = true;
> > > 	}
> > >
> > > I believe the code above is correct and it needs the barriers on both sides.
> > >
> >
> > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> > only guarantees that the memory operations following spin_lock() can't
> > be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> > i.e. the case below can happen(assuming the spin_lock() is implemented
> > as ll/sc loop)
> >
> > 	spin_lock(&lock):
> > 	  r1 = *lock; // LL, r1 == 0
> > 	o = READ_ONCE(object); // could be reordered here.
> > 	  *lock = 1; // SC
> >
> > This could happen because of the ACQUIRE semantics of spin_lock(), and
> > the current implementation of spin_lock() on PPC allows this happen.
> >
> > (Cc PPC maintainers for their opinions on this one)
> 
> In this case the code above is obviously wrong. And I do not understand
> how we can rely on spin_unlock_wait() then.
> 
> And afaics do_exit() is buggy too then, see below.
> 
> > I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
> > a control dependency to pair with spin_unlock(), for example, the
> > following snippet in do_exit() is OK, except the smp_mb() is redundant,
> > unless I'm missing something subtle:
> >
> > 	/*
> > 	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> > 	 * when the following two conditions become true.
> > 	 *   - There is race condition of mmap_sem (It is acquired by
> > 	 *     exit_mm()), and
> > 	 *   - SMI occurs before setting TASK_RUNINNG.
> > 	 *     (or hypervisor of virtual machine switches to other guest)
> > 	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> > 	 *
> > 	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> > 	 * is held by try_to_wake_up()
> > 	 */
> > 	smp_mb();
> > 	raw_spin_unlock_wait(&tsk->pi_lock);
> 
> Perhaps it is me who missed something. But I don't think we can remove
> this mb(). And at the same time it can't help on PPC if I understand
> your explanation above correctly.

I cannot resist suggesting that any lock that interacts with
spin_unlock_wait() must have all relevant acquisitions followed by
smp_mb__after_unlock_lock().

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 14:40               ` Paul E. McKenney
@ 2015-11-12 14:49                 ` Boqun Feng
  2015-11-12 15:02                   ` Paul E. McKenney
  2015-11-12 14:50                 ` Peter Zijlstra
  1 sibling, 1 reply; 78+ messages in thread
From: Boqun Feng @ 2015-11-12 14:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Peter Zijlstra, mingo, linux-kernel, corbet,
	mhocko, dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 644 bytes --]

On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
[snip]
> 
> I cannot resist suggesting that any lock that interacts with
> spin_unlock_wait() must have all relevant acquisitions followed by
> smp_mb__after_unlock_lock().
> 

But

1.	This would expand the purpose of smp_mb__after_unlock_lock(),
	right? smp_mb__after_unlock_lock() is for making UNLOCK-LOCK
	pair global transitive rather than guaranteeing no operations
	can be reorder before the STORE part of LOCK/ACQUIRE.

2.	If ARM64 has the same problem as PPC now,
	smp_mb__after_unlock_lock() can't help, as it's a no-op on
	ARM64.

Regards,
Boqun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 14:40               ` Paul E. McKenney
  2015-11-12 14:49                 ` Boqun Feng
@ 2015-11-12 14:50                 ` Peter Zijlstra
  2015-11-12 15:01                   ` Paul E. McKenney
  1 sibling, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-12 14:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Boqun Feng, mingo, linux-kernel, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> I cannot resist suggesting that any lock that interacts with
> spin_unlock_wait() must have all relevant acquisitions followed by
> smp_mb__after_unlock_lock().

Ha! that would certainly help here. But it would mean that argh64v8 also
needs to define that, even though that is already RCsc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12  7:14           ` Boqun Feng
  2015-11-12 10:28             ` Peter Zijlstra
@ 2015-11-12 15:00             ` Oleg Nesterov
  2015-11-12 14:40               ` Paul E. McKenney
  2015-11-12 15:18               ` Boqun Feng
  2015-11-12 18:21             ` Linus Torvalds
  2 siblings, 2 replies; 78+ messages in thread
From: Oleg Nesterov @ 2015-11-12 15:00 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On 11/12, Boqun Feng wrote:
>
> On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> >
> > 	object_t *object;
> > 	spinlock_t lock;
> >
> > 	void update(void)
> > 	{
> > 		object_t *o;
> >
> > 		spin_lock(&lock);
> > 		o = READ_ONCE(object);
> > 		if (o) {
> > 			BUG_ON(o->dead);
> > 			do_something(o);
> > 		}
> > 		spin_unlock(&lock);
> > 	}
> >
> > 	void destroy(void) // can be called only once, can't race with itself
> > 	{
> > 		object_t *o;
> >
> > 		o = object;
> > 		object = NULL;
> >
> > 		/*
> > 		 * pairs with lock/ACQUIRE. The next update() must see
> > 		 * object == NULL after spin_lock();
> > 		 */
> > 		smp_mb();
> >
> > 		spin_unlock_wait(&lock);
> >
> > 		/*
> > 		 * pairs with unlock/RELEASE. The previous update() has
> > 		 * already passed BUG_ON(o->dead).
> > 		 *
> > 		 * (Yes, yes, in this particular case it is not needed,
> > 		 *  we can rely on the control dependency).
> > 		 */
> > 		smp_mb();
> >
> > 		o->dead = true;
> > 	}
> >
> > I believe the code above is correct and it needs the barriers on both sides.
> >
>
> Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> only guarantees that the memory operations following spin_lock() can't
> be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> i.e. the case below can happen(assuming the spin_lock() is implemented
> as ll/sc loop)
>
> 	spin_lock(&lock):
> 	  r1 = *lock; // LL, r1 == 0
> 	o = READ_ONCE(object); // could be reordered here.
> 	  *lock = 1; // SC
>
> This could happen because of the ACQUIRE semantics of spin_lock(), and
> the current implementation of spin_lock() on PPC allows this happen.
>
> (Cc PPC maintainers for their opinions on this one)

In this case the code above is obviously wrong. And I do not understand
how we can rely on spin_unlock_wait() then.

And afaics do_exit() is buggy too then, see below.

> I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
> a control dependency to pair with spin_unlock(), for example, the
> following snippet in do_exit() is OK, except the smp_mb() is redundant,
> unless I'm missing something subtle:
>
> 	/*
> 	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> 	 * when the following two conditions become true.
> 	 *   - There is race condition of mmap_sem (It is acquired by
> 	 *     exit_mm()), and
> 	 *   - SMI occurs before setting TASK_RUNINNG.
> 	 *     (or hypervisor of virtual machine switches to other guest)
> 	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> 	 *
> 	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> 	 * is held by try_to_wake_up()
> 	 */
> 	smp_mb();
> 	raw_spin_unlock_wait(&tsk->pi_lock);

Perhaps it is me who missed something. But I don't think we can remove
this mb(). And at the same time it can't help on PPC if I understand
your explanation above correctly.

To simplify, lets ignore exit_mm/down_read/etc. The exiting task does


	current->state = TASK_UNINTERRUPTIBLE;
	// without schedule() in between
	current->state = TASK_RUNNING;

	smp_mb();
	spin_unlock_wait(pi_lock);

	current->state = TASK_DEAD;
	schedule();

and we need to ensure that if we race with try_to_wake_up(TASK_UNINTERRUPTIBLE)
it can't change TASK_DEAD back to RUNNING.

Without smp_mb() this can be reordered, spin_unlock_wait(pi_locked) can
read the old "unlocked" state of pi_lock before we set UNINTERRUPTIBLE,
so in fact we could have

	current->state = TASK_UNINTERRUPTIBLE;
	
	spin_unlock_wait(pi_lock);

	current->state = TASK_RUNNING;

	current->state = TASK_DEAD;

and this can obviously race with ttwu() which can take pi_lock and see
state == TASK_UNINTERRUPTIBLE after spin_unlock_wait().

And, if I understand you correctly, this smp_mb() can't help on PPC.
try_to_wake_up() can read task->state before it writes to *pi_lock.
To me this doesn't really differ from the code above,

	CPU 1 (do_exit)				CPU_2 (ttwu)

						spin_lock(pi_lock):
						  r1 = *pi_lock; // r1 == 0;
	p->state = TASK_UNINTERRUPTIBLE;
						state = p->state;
	p->state = TASK_RUNNING;
	mb();
	spin_unlock_wait();
						*pi_lock = 1;

	p->state = TASK_DEAD;
						if (state & TASK_UNINTERRUPTIBLE) // true
							p->state = RUNNING;

No?

And smp_mb__before_spinlock() looks wrong too then.

Oleg.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 14:50                 ` Peter Zijlstra
@ 2015-11-12 15:01                   ` Paul E. McKenney
  2015-11-12 15:08                     ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-12 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Boqun Feng, mingo, linux-kernel, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> > I cannot resist suggesting that any lock that interacts with
> > spin_unlock_wait() must have all relevant acquisitions followed by
> > smp_mb__after_unlock_lock().
> 
> Ha! that would certainly help here. But it would mean that argh64v8 also
> needs to define that, even though that is already RCsc.

Maybe.  It could also be that arm64 avoids the need somehow, for example
via their RCsc behavior.  Their memory model is similar to PPC, but not
exactly the same.

Will?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 14:49                 ` Boqun Feng
@ 2015-11-12 15:02                   ` Paul E. McKenney
  2015-11-12 21:53                     ` Will Deacon
  0 siblings, 1 reply; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-12 15:02 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Oleg Nesterov, Peter Zijlstra, mingo, linux-kernel, corbet,
	mhocko, dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 10:49:02PM +0800, Boqun Feng wrote:
> On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> [snip]
> > 
> > I cannot resist suggesting that any lock that interacts with
> > spin_unlock_wait() must have all relevant acquisitions followed by
> > smp_mb__after_unlock_lock().
> > 
> 
> But
> 
> 1.	This would expand the purpose of smp_mb__after_unlock_lock(),
> 	right? smp_mb__after_unlock_lock() is for making UNLOCK-LOCK
> 	pair global transitive rather than guaranteeing no operations
> 	can be reorder before the STORE part of LOCK/ACQUIRE.

Indeed it would.  Which might be OK.

> 2.	If ARM64 has the same problem as PPC now,
> 	smp_mb__after_unlock_lock() can't help, as it's a no-op on
> 	ARM64.

Agreed, and that is why we need Will to weigh in.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 15:01                   ` Paul E. McKenney
@ 2015-11-12 15:08                     ` Peter Zijlstra
  2015-11-12 15:20                       ` Paul E. McKenney
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-12 15:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Boqun Feng, mingo, linux-kernel, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 07:01:49AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> > > I cannot resist suggesting that any lock that interacts with
> > > spin_unlock_wait() must have all relevant acquisitions followed by
> > > smp_mb__after_unlock_lock().
> > 
> > Ha! that would certainly help here. But it would mean that argh64v8 also
> > needs to define that, even though that is already RCsc.
> 
> Maybe.  It could also be that arm64 avoids the need somehow, for example
> via their RCsc behavior.  Their memory model is similar to PPC, but not
> exactly the same.
> 
> Will?

So when I spoke to Will earlier today, we agreed that LDAXR+STXR is
susceptible to the same problem. The STXR will allow loads to pass up
over that store.

On v8.1, which is using LDADDA, this is not an issue, for as the ACQUIRE
is part of the LOAD, the Read-Modify-Write is indivisible as a whole,
and therefore a subsequent load has nothing to pass over.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 15:00             ` Oleg Nesterov
  2015-11-12 14:40               ` Paul E. McKenney
@ 2015-11-12 15:18               ` Boqun Feng
  2015-11-12 18:38                 ` Oleg Nesterov
  1 sibling, 1 reply; 78+ messages in thread
From: Boqun Feng @ 2015-11-12 15:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

[-- Attachment #1: Type: text/plain, Size: 4139 bytes --]

On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote:
> On 11/12, Boqun Feng wrote:
[snip]
> >
> > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> > only guarantees that the memory operations following spin_lock() can't
> > be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> > i.e. the case below can happen(assuming the spin_lock() is implemented
> > as ll/sc loop)
> >
> > 	spin_lock(&lock):
> > 	  r1 = *lock; // LL, r1 == 0
> > 	o = READ_ONCE(object); // could be reordered here.
> > 	  *lock = 1; // SC
> >
> > This could happen because of the ACQUIRE semantics of spin_lock(), and
> > the current implementation of spin_lock() on PPC allows this happen.
> >
> > (Cc PPC maintainers for their opinions on this one)
> 
> In this case the code above is obviously wrong. And I do not understand
> how we can rely on spin_unlock_wait() then.
> 
> And afaics do_exit() is buggy too then, see below.
> 
> > I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
> > a control dependency to pair with spin_unlock(), for example, the
> > following snippet in do_exit() is OK, except the smp_mb() is redundant,
> > unless I'm missing something subtle:
> >
> > 	/*
> > 	 * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> > 	 * when the following two conditions become true.
> > 	 *   - There is race condition of mmap_sem (It is acquired by
> > 	 *     exit_mm()), and
> > 	 *   - SMI occurs before setting TASK_RUNINNG.
> > 	 *     (or hypervisor of virtual machine switches to other guest)
> > 	 *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> > 	 *
> > 	 * To avoid it, we have to wait for releasing tsk->pi_lock which
> > 	 * is held by try_to_wake_up()
> > 	 */
> > 	smp_mb();
> > 	raw_spin_unlock_wait(&tsk->pi_lock);
> 
> Perhaps it is me who missed something. But I don't think we can remove
> this mb(). And at the same time it can't help on PPC if I understand

You are right, we need this smp_mb() to order the previous STORE of
->state with the LOAD of ->pi_lock. I missed that part because I saw all
the explicit STOREs of ->state in do_exit() are set_current_state()
which has a smp_mb() following the STOREs.

> your explanation above correctly.
> 
> To simplify, lets ignore exit_mm/down_read/etc. The exiting task does
> 
> 
> 	current->state = TASK_UNINTERRUPTIBLE;
> 	// without schedule() in between
> 	current->state = TASK_RUNNING;
> 
> 	smp_mb();
> 	spin_unlock_wait(pi_lock);
> 
> 	current->state = TASK_DEAD;
> 	schedule();
> 
> and we need to ensure that if we race with try_to_wake_up(TASK_UNINTERRUPTIBLE)
> it can't change TASK_DEAD back to RUNNING.
> 
> Without smp_mb() this can be reordered, spin_unlock_wait(pi_locked) can
> read the old "unlocked" state of pi_lock before we set UNINTERRUPTIBLE,
> so in fact we could have
> 
> 	current->state = TASK_UNINTERRUPTIBLE;
> 	
> 	spin_unlock_wait(pi_lock);
> 
> 	current->state = TASK_RUNNING;
> 
> 	current->state = TASK_DEAD;
> 
> and this can obviously race with ttwu() which can take pi_lock and see
> state == TASK_UNINTERRUPTIBLE after spin_unlock_wait().
> 

Yep, my mistake ;-)

> And, if I understand you correctly, this smp_mb() can't help on PPC.
> try_to_wake_up() can read task->state before it writes to *pi_lock.
> To me this doesn't really differ from the code above,
> 
> 	CPU 1 (do_exit)				CPU_2 (ttwu)
> 
> 						spin_lock(pi_lock):
> 						  r1 = *pi_lock; // r1 == 0;
> 	p->state = TASK_UNINTERRUPTIBLE;
> 						state = p->state;
> 	p->state = TASK_RUNNING;
> 	mb();
> 	spin_unlock_wait();
> 						*pi_lock = 1;
> 
> 	p->state = TASK_DEAD;
> 						if (state & TASK_UNINTERRUPTIBLE) // true
> 							p->state = RUNNING;
> 
> No?
> 

do_exit() is surely buggy if spin_lock() could work in this way.

> And smp_mb__before_spinlock() looks wrong too then.
> 

Maybe not? As smp_mb__before_spinlock() is used before a LOCK operation,
which has both LOAD part and STORE part unlike spin_unlock_wait()?

> Oleg.
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 15:08                     ` Peter Zijlstra
@ 2015-11-12 15:20                       ` Paul E. McKenney
  2015-11-12 21:25                         ` Will Deacon
  0 siblings, 1 reply; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-12 15:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Boqun Feng, mingo, linux-kernel, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 04:08:22PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 12, 2015 at 07:01:49AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> > > > I cannot resist suggesting that any lock that interacts with
> > > > spin_unlock_wait() must have all relevant acquisitions followed by
> > > > smp_mb__after_unlock_lock().
> > > 
> > > Ha! that would certainly help here. But it would mean that argh64v8 also
> > > needs to define that, even though that is already RCsc.
> > 
> > Maybe.  It could also be that arm64 avoids the need somehow, for example
> > via their RCsc behavior.  Their memory model is similar to PPC, but not
> > exactly the same.
> > 
> > Will?
> 
> So when I spoke to Will earlier today, we agreed that LDAXR+STXR is
> susceptible to the same problem. The STXR will allow loads to pass up
> over that store.
> 
> On v8.1, which is using LDADDA, this is not an issue, for as the ACQUIRE
> is part of the LOAD, the Read-Modify-Write is indivisible as a whole,
> and therefore a subsequent load has nothing to pass over.

So one approach required for one level of hardware and another for the
next level.  I can relate to that all too well...  :-/

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 18:38                 ` Oleg Nesterov
@ 2015-11-12 18:02                   ` Peter Zijlstra
  2015-11-12 19:33                     ` Oleg Nesterov
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-12 18:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Boqun Feng, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote:
> It seems that PPC needs to define smp_mb__before_spinlock() as full mb(),
> no?

It does:

arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() smp_mb()


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12  7:14           ` Boqun Feng
  2015-11-12 10:28             ` Peter Zijlstra
  2015-11-12 15:00             ` Oleg Nesterov
@ 2015-11-12 18:21             ` Linus Torvalds
  2015-11-12 22:09               ` Will Deacon
  2015-11-16 15:56               ` Peter Zijlstra
  2 siblings, 2 replies; 78+ messages in thread
From: Linus Torvalds @ 2015-11-12 18:21 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Paul McKenney, Jonathan Corbet,
	Michal Hocko, David Howells, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Wed, Nov 11, 2015 at 11:14 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
>
> Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> only guarantees that the memory operations following spin_lock() can't
> be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> i.e. the case below can happen(assuming the spin_lock() is implemented
> as ll/sc loop)
>
>         spin_lock(&lock):
>           r1 = *lock; // LL, r1 == 0
>         o = READ_ONCE(object); // could be reordered here.
>           *lock = 1; // SC

It may be worth noting that at least in theory, not only reads may
pass the store. If the spin-lock is done as

        r1 = *lock   // read-acquire, r1 == 0
        *lock = 1    // SC

then even *writes* inside the locked region might pass up through the
"*lock = 1".

So other CPU's - that haven't taken the spinlock - could see the
modifications inside the critical region before they actually see the
lock itself change.

Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
generally be that you have some external ordering guarantee that
guarantees that the lock has been taken. For example, for the IPC
semaphores, we do either one of:

 (a) get large lock, then - once you hold that lock - wait for each small lock

or

 (b) get small lock, then - once you hold that lock - check that the
largo lock is unlocked

and that's the case we should really worry about.  The other uses of
spin_unlock_wait() should have similar "I have other reasons to know
I've seen that the lock was taken, or will never be taken after this
because XYZ".

This is why powerpc has a memory barrier in "arch_spin_is_locked()".
Exactly so that the "check that the other lock is unlocked" is
guaranteed to be ordered wrt the store that gets the first lock.

It looks like ARM64 gets this wrong and is fundamentally buggy wrt
"spin_is_locked()" (and, as a result, "spin_unlock_wait()").

BUT! And this is a bug BUT:

It should be noted that that is purely an ARM64 bug. Not a bug in our
users. If you have a spinlock where the "get lock write" part of the
lock can be delayed, then you have to have a "arch_spin_is_locked()"
that has the proper memory barriers.

Of course, ARM still hides their architecture manuals in odd places,
so I can't double-check. But afaik, ARM64 store-conditional is "store
exclusive with release", and it has only release semantics, and ARM64
really does have the above bug.

On that note: can anybody point me to the latest ARM64 8.1
architecture manual in pdf form, without the "you have to register"
crap? I thought ARM released it, but all my googling just points to
the idiotic ARM service center that wants me to sign away something
just to see the docs. Which I don't do.

                                   Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 15:18               ` Boqun Feng
@ 2015-11-12 18:38                 ` Oleg Nesterov
  2015-11-12 18:02                   ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Oleg Nesterov @ 2015-11-12 18:38 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On 11/12, Boqun Feng wrote:
>
> On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote:
> >
> > No?
> >
>
> do_exit() is surely buggy if spin_lock() could work in this way.

OK, good ;) so we need to fix spin_lock() on PPC ? Or add
mb__after_unlock_lock() but this leads to other questions... Or
change do_exit() to do lock() + unlock().

>
> > And smp_mb__before_spinlock() looks wrong too then.
> >
>
> Maybe not? As smp_mb__before_spinlock() is used before a LOCK operation,
> which has both LOAD part and STORE part unlike spin_unlock_wait()?

Maybe not.

But let me remind that the original purpose of this mb__before_spinlock()
was to ensure that "CONDITION = true" before ttwu() can not be reordered
with

	if (!(p->state & state))
		goto out;		// do not wakeup


inside try_to_wake_up(). Otherwise

	CONDITION = true;
	try_to_wake_up(p);

can race with "p" doing

	set_current_state(...);	// implies mb();
	if (CONDITION)
		return;
	schedule();

because try_to_wake_up() can read p->state before it sets CONDITION = 1
and then it won't wakeup "p" which has already checked this CONDITION.


Now. If try_to_wake_up() can read p->state before it writes to *pi_lock,
then how smp_mb__before_spinlock() == wmb() can help to serialize STORE
and LOAD?

It seems that PPC needs to define smp_mb__before_spinlock() as full mb(),
no?

Oleg.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 19:33                     ` Oleg Nesterov
@ 2015-11-12 18:59                       ` Paul E. McKenney
  2015-11-12 21:33                         ` Will Deacon
  0 siblings, 1 reply; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-12 18:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Boqun Feng, mingo, linux-kernel, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 08:33:02PM +0100, Oleg Nesterov wrote:
> On 11/12, Peter Zijlstra wrote:
> >
> > On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote:
> > > It seems that PPC needs to define smp_mb__before_spinlock() as full mb(),
> > > no?
> >
> > It does:
> >
> > arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() smp_mb()
> 
> Ah, indeed, thanks.
> 
> And given that it also defines smp_mb__after_unlock_lock() as smp_mb(),
> I am starting to understand how it can help to avoid the races with
> spin_unlock_wait() in (for example) do_exit().
> 
> But as Boqun has already mentioned, this means that mb__after_unlock_lock()
> has the new meaning which should be documented.
> 
> Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be reverted
> then ;)

Surprisingly, this reverts cleanly against today's mainline, please see
the patch below.  Against my -rcu stack, not so much, but so it goes.  ;-)

							Thanx, Paul

------------------------------------------------------------------------

commit eff0632b4181f91f2596d56f7c73194e1a869aff
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Thu Nov 12 10:54:23 2015 -0800

    Revert "rcu,locking: Privatize smp_mb__after_unlock_lock()"
    
    This reverts commit 12d560f4ea87030667438a169912380be00cea4b.
    
    The reason for this revert is that smp_mb__after_unlock_lock() might
    prove useful outside of RCU after all for interactions between
    the locking primitives and spin_unlock_wait().

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index aef9487303d0..d4501664d49f 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1855,10 +1855,16 @@ RELEASE are to the same lock variable, but only from the perspective of
 another CPU not holding that lock.  In short, a ACQUIRE followed by an
 RELEASE may -not- be assumed to be a full memory barrier.
 
-Similarly, the reverse case of a RELEASE followed by an ACQUIRE does
-not imply a full memory barrier.  Therefore, the CPU's execution of the
-critical sections corresponding to the RELEASE and the ACQUIRE can cross,
-so that:
+Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
+imply a full memory barrier.  If it is necessary for a RELEASE-ACQUIRE
+pair to produce a full barrier, the ACQUIRE can be followed by an
+smp_mb__after_unlock_lock() invocation.  This will produce a full barrier
+(including transitivity) if either (a) the RELEASE and the ACQUIRE are
+executed by the same CPU or task, or (b) the RELEASE and ACQUIRE act on
+the same variable.  The smp_mb__after_unlock_lock() primitive is free
+on many architectures.  Without smp_mb__after_unlock_lock(), the CPU's
+execution of the critical sections corresponding to the RELEASE and the
+ACQUIRE can cross, so that:
 
 	*A = a;
 	RELEASE M
@@ -1896,6 +1902,29 @@ the RELEASE would simply complete, thereby avoiding the deadlock.
 	a sleep-unlock race, but the locking primitive needs to resolve
 	such races properly in any case.
 
+With smp_mb__after_unlock_lock(), the two critical sections cannot overlap.
+For example, with the following code, the store to *A will always be
+seen by other CPUs before the store to *B:
+
+	*A = a;
+	RELEASE M
+	ACQUIRE N
+	smp_mb__after_unlock_lock();
+	*B = b;
+
+The operations will always occur in one of the following orders:
+
+	STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B
+	STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B
+	ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B
+
+If the RELEASE and ACQUIRE were instead both operating on the same lock
+variable, only the first of these alternatives can occur.  In addition,
+the more strongly ordered systems may rule out some of the above orders.
+But in any case, as noted earlier, the smp_mb__after_unlock_lock()
+ensures that the store to *A will always be seen as happening before
+the store to *B.
+
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
 anything at all - especially with respect to I/O accesses - unless combined
@@ -2126,6 +2155,40 @@ But it won't see any of:
 	*E, *F or *G following RELEASE Q
 
 
+However, if the following occurs:
+
+	CPU 1				CPU 2
+	===============================	===============================
+	WRITE_ONCE(*A, a);
+	ACQUIRE M		     [1]
+	WRITE_ONCE(*B, b);
+	WRITE_ONCE(*C, c);
+	RELEASE M	     [1]
+	WRITE_ONCE(*D, d);		WRITE_ONCE(*E, e);
+					ACQUIRE M		     [2]
+					smp_mb__after_unlock_lock();
+					WRITE_ONCE(*F, f);
+					WRITE_ONCE(*G, g);
+					RELEASE M	     [2]
+					WRITE_ONCE(*H, h);
+
+CPU 3 might see:
+
+	*E, ACQUIRE M [1], *C, *B, *A, RELEASE M [1],
+		ACQUIRE M [2], *H, *F, *G, RELEASE M [2], *D
+
+But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
+
+	*B, *C, *D, *F, *G or *H preceding ACQUIRE M [1]
+	*A, *B or *C following RELEASE M [1]
+	*F, *G or *H preceding ACQUIRE M [2]
+	*A, *B, *C, *E, *F or *G following RELEASE M [2]
+
+Note that the smp_mb__after_unlock_lock() is critically important
+here: Without it CPU 3 might see some of the above orderings.
+Without smp_mb__after_unlock_lock(), the accesses are not guaranteed
+to be seen in order unless CPU 3 holds lock M.
+
 
 ACQUIRES VS I/O ACCESSES
 ------------------------
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 523673d7583c..4dbe072eecbe 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -28,6 +28,8 @@
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
 
+#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
+
 #ifdef CONFIG_PPC64
 /* use 0x800000yy when locked, where yy == CPU number */
 #ifdef __BIG_ENDIAN__
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 47dd0cebd204..ffcd053ca89a 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -130,6 +130,16 @@ do {								\
 #define smp_mb__before_spinlock()	smp_wmb()
 #endif
 
+/*
+ * Place this after a lock-acquisition primitive to guarantee that
+ * an UNLOCK+LOCK pair act as a full barrier.  This guarantee applies
+ * if the UNLOCK and LOCK are executed by the same CPU or if the
+ * UNLOCK and LOCK operate on the same lock variable.
+ */
+#ifndef smp_mb__after_unlock_lock
+#define smp_mb__after_unlock_lock()	do { } while (0)
+#endif
+
 /**
  * raw_spin_unlock_wait - wait until the spinlock gets unlocked
  * @lock: the spinlock in question.
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 9fb4e238d4dc..8c6753d903ec 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -652,15 +652,3 @@ static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 }
 #endif /* #ifdef CONFIG_RCU_TRACE */
-
-/*
- * Place this after a lock-acquisition primitive to guarantee that
- * an UNLOCK+LOCK pair act as a full barrier.  This guarantee applies
- * if the UNLOCK and LOCK are executed by the same CPU or if the
- * UNLOCK and LOCK operate on the same lock variable.
- */
-#ifdef CONFIG_PPC
-#define smp_mb__after_unlock_lock()	smp_mb()  /* Full ordering for lock. */
-#else /* #ifdef CONFIG_PPC */
-#define smp_mb__after_unlock_lock()	do { } while (0)
-#endif /* #else #ifdef CONFIG_PPC */


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 18:02                   ` Peter Zijlstra
@ 2015-11-12 19:33                     ` Oleg Nesterov
  2015-11-12 18:59                       ` Paul E. McKenney
  0 siblings, 1 reply; 78+ messages in thread
From: Oleg Nesterov @ 2015-11-12 19:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, mingo, linux-kernel, paulmck, corbet, mhocko,
	dhowells, torvalds, will.deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On 11/12, Peter Zijlstra wrote:
>
> On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote:
> > It seems that PPC needs to define smp_mb__before_spinlock() as full mb(),
> > no?
>
> It does:
>
> arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() smp_mb()

Ah, indeed, thanks.

And given that it also defines smp_mb__after_unlock_lock() as smp_mb(),
I am starting to understand how it can help to avoid the races with
spin_unlock_wait() in (for example) do_exit().

But as Boqun has already mentioned, this means that mb__after_unlock_lock()
has the new meaning which should be documented.

Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be reverted
then ;)

Oleg.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 15:20                       ` Paul E. McKenney
@ 2015-11-12 21:25                         ` Will Deacon
  0 siblings, 0 replies; 78+ messages in thread
From: Will Deacon @ 2015-11-12 21:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Oleg Nesterov, Boqun Feng, mingo, linux-kernel,
	corbet, mhocko, dhowells, torvalds, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

[sorry for the late reply, I'm away from my desk until Monday since I'm
 busy with family issues]

On Thu, Nov 12, 2015 at 07:20:42AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 12, 2015 at 04:08:22PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 12, 2015 at 07:01:49AM -0800, Paul E. McKenney wrote:
> > > On Thu, Nov 12, 2015 at 03:50:13PM +0100, Peter Zijlstra wrote:
> > > > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> > > > > I cannot resist suggesting that any lock that interacts with
> > > > > spin_unlock_wait() must have all relevant acquisitions followed by
> > > > > smp_mb__after_unlock_lock().
> > > > 
> > > > Ha! that would certainly help here. But it would mean that argh64v8 also
> > > > needs to define that, even though that is already RCsc.
> > > 
> > > Maybe.  It could also be that arm64 avoids the need somehow, for example
> > > via their RCsc behavior.  Their memory model is similar to PPC, but not
> > > exactly the same.
> > > 
> > > Will?
> > 
> > So when I spoke to Will earlier today, we agreed that LDAXR+STXR is
> > susceptible to the same problem. The STXR will allow loads to pass up
> > over that store.
> > 
> > On v8.1, which is using LDADDA, this is not an issue, for as the ACQUIRE
> > is part of the LOAD, the Read-Modify-Write is indivisible as a whole,
> > and therefore a subsequent load has nothing to pass over.
> 
> So one approach required for one level of hardware and another for the
> next level.  I can relate to that all too well...  :-/

Just to confirm, Peter's correct in that Boqun's litmus test is permitted
by the arm64 architecture when the ll/sc spinlock definitions are in use.

However, I don't think that strengthening smp_mb__after_unlock_lock is
the right way to solve this. I'll reply to the other part of the thread...

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 18:59                       ` Paul E. McKenney
@ 2015-11-12 21:33                         ` Will Deacon
  2015-11-12 23:43                           ` Paul E. McKenney
  0 siblings, 1 reply; 78+ messages in thread
From: Will Deacon @ 2015-11-12 21:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Peter Zijlstra, Boqun Feng, mingo, linux-kernel,
	corbet, mhocko, dhowells, torvalds, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 10:59:06AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 12, 2015 at 08:33:02PM +0100, Oleg Nesterov wrote:
> > On 11/12, Peter Zijlstra wrote:
> > >
> > > On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote:
> > > > It seems that PPC needs to define smp_mb__before_spinlock() as full mb(),
> > > > no?
> > >
> > > It does:
> > >
> > > arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() smp_mb()
> > 
> > Ah, indeed, thanks.
> > 
> > And given that it also defines smp_mb__after_unlock_lock() as smp_mb(),
> > I am starting to understand how it can help to avoid the races with
> > spin_unlock_wait() in (for example) do_exit().
> > 
> > But as Boqun has already mentioned, this means that mb__after_unlock_lock()
> > has the new meaning which should be documented.
> > 
> > Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be reverted
> > then ;)
> 
> Surprisingly, this reverts cleanly against today's mainline, please see
> the patch below.  Against my -rcu stack, not so much, but so it goes.  ;-)

I think we ended up concluding that smp_mb__after_unlock_lock is indeed
required, but I don't think we should just resurrect the old definition,
which doesn't keep UNLOCK -> LOCK distinct from RELEASE -> ACQUIRE. I'm
still working on documenting the different types of transitivity that we
identified in that thread, but it's slow going.

Also, as far as spin_unlock_wait is concerned, it is neither a LOCK or
an UNLOCK and this barrier doesn't offer us anything. Sure, it might
work because PPC defines it as smp_mb(), but it doesn't help on arm64
and defining the macro is overkill for us in most places (i.e. RCU).

If we decide that the current usage of spin_unlock_wait is valid, then I
would much rather implement a version of it in the arm64 backend that
does something like:

 1:  ldrex r1, [&lock]
     if r1 indicates that lock is taken, branch back to 1b
     strex r1, [&lock]
     if store failed, branch back to 1b

i.e. we don't just test the lock, but we also write it back atomically
if we discover that it's free. That would then clear the exclusive monitor
on any cores in the process of taking the lock and restore the ordering
that we need.

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 15:02                   ` Paul E. McKenney
@ 2015-11-12 21:53                     ` Will Deacon
  0 siblings, 0 replies; 78+ messages in thread
From: Will Deacon @ 2015-11-12 21:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, Oleg Nesterov, Peter Zijlstra, mingo, linux-kernel,
	corbet, mhocko, dhowells, torvalds, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 07:02:51AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 12, 2015 at 10:49:02PM +0800, Boqun Feng wrote:
> > On Thu, Nov 12, 2015 at 06:40:04AM -0800, Paul E. McKenney wrote:
> > [snip]
> > > 
> > > I cannot resist suggesting that any lock that interacts with
> > > spin_unlock_wait() must have all relevant acquisitions followed by
> > > smp_mb__after_unlock_lock().
> > > 
> > 
> > But
> > 
> > 1.	This would expand the purpose of smp_mb__after_unlock_lock(),
> > 	right? smp_mb__after_unlock_lock() is for making UNLOCK-LOCK
> > 	pair global transitive rather than guaranteeing no operations
> > 	can be reorder before the STORE part of LOCK/ACQUIRE.
> 
> Indeed it would.  Which might be OK.
> 
> > 2.	If ARM64 has the same problem as PPC now,
> > 	smp_mb__after_unlock_lock() can't help, as it's a no-op on
> > 	ARM64.
> 
> Agreed, and that is why we need Will to weigh in.

I really don't want to implement smp_mb__after_unlock_lock, because we
don't need it based on its current definition and I think there's a better
way to fix spin_unlock_wait (see my other post).

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 18:21             ` Linus Torvalds
@ 2015-11-12 22:09               ` Will Deacon
  2015-11-16 15:56               ` Peter Zijlstra
  1 sibling, 0 replies; 78+ messages in thread
From: Will Deacon @ 2015-11-12 22:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Boqun Feng, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List, Paul McKenney, Jonathan Corbet,
	Michal Hocko, David Howells, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> On Wed, Nov 11, 2015 at 11:14 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
> >
> > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> > only guarantees that the memory operations following spin_lock() can't
> > be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> > i.e. the case below can happen(assuming the spin_lock() is implemented
> > as ll/sc loop)
> >
> >         spin_lock(&lock):
> >           r1 = *lock; // LL, r1 == 0
> >         o = READ_ONCE(object); // could be reordered here.
> >           *lock = 1; // SC
> 
> It may be worth noting that at least in theory, not only reads may
> pass the store. If the spin-lock is done as
> 
>         r1 = *lock   // read-acquire, r1 == 0
>         *lock = 1    // SC
> 
> then even *writes* inside the locked region might pass up through the
> "*lock = 1".

Right, but only if we didn't have the control dependency to branch back
in the case that the SC failed. In that case, the lock would be broken
because you'd dive into the critical section even if you failed to take
the lock.

> So other CPU's - that haven't taken the spinlock - could see the
> modifications inside the critical region before they actually see the
> lock itself change.
> 
> Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> generally be that you have some external ordering guarantee that
> guarantees that the lock has been taken. For example, for the IPC
> semaphores, we do either one of:
> 
>  (a) get large lock, then - once you hold that lock - wait for each small lock
> 
> or
> 
>  (b) get small lock, then - once you hold that lock - check that the
> largo lock is unlocked
> 
> and that's the case we should really worry about.  The other uses of
> spin_unlock_wait() should have similar "I have other reasons to know
> I've seen that the lock was taken, or will never be taken after this
> because XYZ".
> 
> This is why powerpc has a memory barrier in "arch_spin_is_locked()".
> Exactly so that the "check that the other lock is unlocked" is
> guaranteed to be ordered wrt the store that gets the first lock.
> 
> It looks like ARM64 gets this wrong and is fundamentally buggy wrt
> "spin_is_locked()" (and, as a result, "spin_unlock_wait()").

I don't see how a memory barrier would help us here, and Boqun's example
had smp_mb() either sides of the spin_unlock_wait(). What we actually need
is to make spin_unlock_wait more like a LOCK operation, so that it forces
parallel lockers to replay their LL/SC sequences.

I'll write a patch once I've heard more back from Paul about
smp_mb__after_unlock_lock().

> BUT! And this is a bug BUT:
> 
> It should be noted that that is purely an ARM64 bug. Not a bug in our
> users. If you have a spinlock where the "get lock write" part of the
> lock can be delayed, then you have to have a "arch_spin_is_locked()"
> that has the proper memory barriers.
> 
> Of course, ARM still hides their architecture manuals in odd places,
> so I can't double-check. But afaik, ARM64 store-conditional is "store
> exclusive with release", and it has only release semantics, and ARM64
> really does have the above bug.

The store-conditional in our spin_lock routine has no barrier semantics;
they are enforced by the load-exclusive-acquire that pairs with it.

> On that note: can anybody point me to the latest ARM64 8.1
> architecture manual in pdf form, without the "you have to register"
> crap? I thought ARM released it, but all my googling just points to
> the idiotic ARM service center that wants me to sign away something
> just to see the docs. Which I don't do.

We haven't yet released a document for the 8.1 instructions, but I can
certainly send you a copy when it's made available.

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 21:33                         ` Will Deacon
@ 2015-11-12 23:43                           ` Paul E. McKenney
  2015-11-16 13:58                             ` Will Deacon
  0 siblings, 1 reply; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-12 23:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: Oleg Nesterov, Peter Zijlstra, Boqun Feng, mingo, linux-kernel,
	corbet, mhocko, dhowells, torvalds, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 09:33:39PM +0000, Will Deacon wrote:
> On Thu, Nov 12, 2015 at 10:59:06AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 12, 2015 at 08:33:02PM +0100, Oleg Nesterov wrote:
> > > On 11/12, Peter Zijlstra wrote:
> > > >
> > > > On Thu, Nov 12, 2015 at 07:38:07PM +0100, Oleg Nesterov wrote:
> > > > > It seems that PPC needs to define smp_mb__before_spinlock() as full mb(),
> > > > > no?
> > > >
> > > > It does:
> > > >
> > > > arch/powerpc/include/asm/barrier.h:#define smp_mb__before_spinlock() smp_mb()
> > > 
> > > Ah, indeed, thanks.
> > > 
> > > And given that it also defines smp_mb__after_unlock_lock() as smp_mb(),
> > > I am starting to understand how it can help to avoid the races with
> > > spin_unlock_wait() in (for example) do_exit().
> > > 
> > > But as Boqun has already mentioned, this means that mb__after_unlock_lock()
> > > has the new meaning which should be documented.
> > > 
> > > Hmm. And 12d560f4 "Privatize smp_mb__after_unlock_lock()" should be reverted
> > > then ;)
> > 
> > Surprisingly, this reverts cleanly against today's mainline, please see
> > the patch below.  Against my -rcu stack, not so much, but so it goes.  ;-)
> 
> I think we ended up concluding that smp_mb__after_unlock_lock is indeed
> required, but I don't think we should just resurrect the old definition,
> which doesn't keep UNLOCK -> LOCK distinct from RELEASE -> ACQUIRE. I'm
> still working on documenting the different types of transitivity that we
> identified in that thread, but it's slow going.
> 
> Also, as far as spin_unlock_wait is concerned, it is neither a LOCK or
> an UNLOCK and this barrier doesn't offer us anything. Sure, it might
> work because PPC defines it as smp_mb(), but it doesn't help on arm64
> and defining the macro is overkill for us in most places (i.e. RCU).
> 
> If we decide that the current usage of spin_unlock_wait is valid, then I
> would much rather implement a version of it in the arm64 backend that
> does something like:
> 
>  1:  ldrex r1, [&lock]
>      if r1 indicates that lock is taken, branch back to 1b
>      strex r1, [&lock]
>      if store failed, branch back to 1b
> 
> i.e. we don't just test the lock, but we also write it back atomically
> if we discover that it's free. That would then clear the exclusive monitor
> on any cores in the process of taking the lock and restore the ordering
> that we need.

We could clearly do something similar in PowerPC, but I suspect that this
would hurt really badly on large systems, given that there are PowerPC
systems with more than a thousand hardware threads.  So one approach
is ARM makes spin_unlock_wait() do the write, similar to spin_lock();
spin_lock(), but PowerPC relies on smp_mb__after_unlock_lock().

Or does someone have a better proposal?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 23:43                           ` Paul E. McKenney
@ 2015-11-16 13:58                             ` Will Deacon
  0 siblings, 0 replies; 78+ messages in thread
From: Will Deacon @ 2015-11-16 13:58 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Peter Zijlstra, Boqun Feng, mingo, linux-kernel,
	corbet, mhocko, dhowells, torvalds, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

Hi Paul,

On Thu, Nov 12, 2015 at 03:43:51PM -0800, Paul E. McKenney wrote:
> On Thu, Nov 12, 2015 at 09:33:39PM +0000, Will Deacon wrote:
> > I think we ended up concluding that smp_mb__after_unlock_lock is indeed
> > required, but I don't think we should just resurrect the old definition,
> > which doesn't keep UNLOCK -> LOCK distinct from RELEASE -> ACQUIRE. I'm
> > still working on documenting the different types of transitivity that we
> > identified in that thread, but it's slow going.
> > 
> > Also, as far as spin_unlock_wait is concerned, it is neither a LOCK or
> > an UNLOCK and this barrier doesn't offer us anything. Sure, it might
> > work because PPC defines it as smp_mb(), but it doesn't help on arm64
> > and defining the macro is overkill for us in most places (i.e. RCU).
> > 
> > If we decide that the current usage of spin_unlock_wait is valid, then I
> > would much rather implement a version of it in the arm64 backend that
> > does something like:
> > 
> >  1:  ldrex r1, [&lock]
> >      if r1 indicates that lock is taken, branch back to 1b
> >      strex r1, [&lock]
> >      if store failed, branch back to 1b
> > 
> > i.e. we don't just test the lock, but we also write it back atomically
> > if we discover that it's free. That would then clear the exclusive monitor
> > on any cores in the process of taking the lock and restore the ordering
> > that we need.
> 
> We could clearly do something similar in PowerPC, but I suspect that this
> would hurt really badly on large systems, given that there are PowerPC
> systems with more than a thousand hardware threads.  So one approach
> is ARM makes spin_unlock_wait() do the write, similar to spin_lock();
> spin_lock(), but PowerPC relies on smp_mb__after_unlock_lock().

Sure, I'm certainly not trying to tell you how to do this for PPC, but
the above would be better for arm64 (any huge system should be using the
8.1 atomics anyway).

> Or does someone have a better proposal?

I don't think I'm completely clear on the current proposal wrt
smp_mb__after_unlock_lock. To summarise my understanding (in the context
of Boqun's original example, which I've duplicated at the end of this
mail):

  * Putting smp_mb__after_unlock_lock() after the LOCK on CPU2 creates
    global order, by upgrading the UNLOCK -> LOCK to a full barrier. If
    we extend this to include the accesses made by the UNLOCK and LOCK
    as happening *before* the notional full barrier, then a from-read
    edge from CPU2 to CPU1 on `object' implies that the LOCK operation
    is observed by CPU1 before it writes object = NULL. So we can add
    this barrier and fix the test for PPC.

  * Upgrading spin_unlock_wait to a LOCK operation (which is basically
    what I'm proposing for arm64) means that we now rely on LOCK -> LOCK
    being part of a single, total order (i.e. all CPUs agree on the
    order in which a lock was taken).

Assuming PPC has this global LOCK -> LOCK ordering, then we end up in a
sticky situation defining the kernel's memory model because we don't
have an architecture-agnostic semantic. The only way I can see to fix
this is by adding something like smp_mb__between_lock_unlock_wait, but
that's grim.

Do you see a way forward?

Will

--->8

	CPU 1			CPU 2			CPU 3
	==================	====================	==============
							spin_unlock(&lock);
				spin_lock(&lock):
				  r1 = *lock; // r1 == 0;
				o = READ_ONCE(object); // reordered here
	object = NULL;
	smp_mb();
	spin_unlock_wait(&lock);
				  *lock = 1;
	smp_mb();
	o->dead = true;
				if (o) // true
				  BUG_ON(o->dead); // true!!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-12 18:21             ` Linus Torvalds
  2015-11-12 22:09               ` Will Deacon
@ 2015-11-16 15:56               ` Peter Zijlstra
  2015-11-16 16:04                 ` Peter Zijlstra
  1 sibling, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-16 15:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Boqun Feng, Oleg Nesterov, Ingo Molnar,
	Linux Kernel Mailing List, Paul McKenney, Jonathan Corbet,
	Michal Hocko, David Howells, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> generally be that you have some external ordering guarantee that
> guarantees that the lock has been taken. For example, for the IPC
> semaphores, we do either one of:
> 
>  (a) get large lock, then - once you hold that lock - wait for each small lock
> 
> or
> 
>  (b) get small lock, then - once you hold that lock - check that the
> largo lock is unlocked
> 
> and that's the case we should really worry about.  The other uses of
> spin_unlock_wait() should have similar "I have other reasons to know
> I've seen that the lock was taken, or will never be taken after this
> because XYZ".

I don't think this is true for the usage in do_exit(), we have no
knowledge on if pi_lock is taken or not. We just want to make sure that
_if_ it were taken, we wait until it is released.

But I'm not sure where task_work_run() sits, at first reading it appears
to also not be true -- there doesn't appear to be a reason we know a
lock to be held.

It does however appear true for the usage in completion_done(), where by
having tested x->done, we know a pi_lock _was_ held.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-16 15:56               ` Peter Zijlstra
@ 2015-11-16 16:04                 ` Peter Zijlstra
  2015-11-16 16:24                   ` Will Deacon
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-16 16:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Boqun Feng, Oleg Nesterov, Ingo Molnar,
	Linux Kernel Mailing List, Paul McKenney, Jonathan Corbet,
	Michal Hocko, David Howells, Will Deacon, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> > generally be that you have some external ordering guarantee that
> > guarantees that the lock has been taken. For example, for the IPC
> > semaphores, we do either one of:
> > 
> >  (a) get large lock, then - once you hold that lock - wait for each small lock
> > 
> > or
> > 
> >  (b) get small lock, then - once you hold that lock - check that the
> > largo lock is unlocked
> > 
> > and that's the case we should really worry about.  The other uses of
> > spin_unlock_wait() should have similar "I have other reasons to know
> > I've seen that the lock was taken, or will never be taken after this
> > because XYZ".
> 
> I don't think this is true for the usage in do_exit(), we have no
> knowledge on if pi_lock is taken or not. We just want to make sure that
> _if_ it were taken, we wait until it is released.

And unless PPC would move to using RCsc locks with a SYNC in
spin_lock(), I don't think it makes sense to add
smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this.
As that is far more expensive than flipping the exit path to do
spin_lock()+spin_unlock().

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-16 16:04                 ` Peter Zijlstra
@ 2015-11-16 16:24                   ` Will Deacon
  2015-11-16 16:44                     ` Paul E. McKenney
  2015-11-16 21:58                     ` Linus Torvalds
  0 siblings, 2 replies; 78+ messages in thread
From: Will Deacon @ 2015-11-16 16:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Boqun Feng, Oleg Nesterov, Ingo Molnar,
	Linux Kernel Mailing List, Paul McKenney, Jonathan Corbet,
	Michal Hocko, David Howells, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> > > generally be that you have some external ordering guarantee that
> > > guarantees that the lock has been taken. For example, for the IPC
> > > semaphores, we do either one of:
> > > 
> > >  (a) get large lock, then - once you hold that lock - wait for each small lock
> > > 
> > > or
> > > 
> > >  (b) get small lock, then - once you hold that lock - check that the
> > > largo lock is unlocked
> > > 
> > > and that's the case we should really worry about.  The other uses of
> > > spin_unlock_wait() should have similar "I have other reasons to know
> > > I've seen that the lock was taken, or will never be taken after this
> > > because XYZ".
> > 
> > I don't think this is true for the usage in do_exit(), we have no
> > knowledge on if pi_lock is taken or not. We just want to make sure that
> > _if_ it were taken, we wait until it is released.
> 
> And unless PPC would move to using RCsc locks with a SYNC in
> spin_lock(), I don't think it makes sense to add
> smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this.
> As that is far more expensive than flipping the exit path to do
> spin_lock()+spin_unlock().

... or we upgrade spin_unlock_wait to a LOCK operation, which might be
slightly cheaper than spin_lock()+spin_unlock().

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-16 16:24                   ` Will Deacon
@ 2015-11-16 16:44                     ` Paul E. McKenney
  2015-11-16 16:46                       ` Will Deacon
  2015-11-16 21:58                     ` Linus Torvalds
  1 sibling, 1 reply; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-16 16:44 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linus Torvalds, Boqun Feng, Oleg Nesterov,
	Ingo Molnar, Linux Kernel Mailing List, Jonathan Corbet,
	Michal Hocko, David Howells, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Mon, Nov 16, 2015 at 04:24:53PM +0000, Will Deacon wrote:
> On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> > > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> > > > generally be that you have some external ordering guarantee that
> > > > guarantees that the lock has been taken. For example, for the IPC
> > > > semaphores, we do either one of:
> > > > 
> > > >  (a) get large lock, then - once you hold that lock - wait for each small lock
> > > > 
> > > > or
> > > > 
> > > >  (b) get small lock, then - once you hold that lock - check that the
> > > > largo lock is unlocked
> > > > 
> > > > and that's the case we should really worry about.  The other uses of
> > > > spin_unlock_wait() should have similar "I have other reasons to know
> > > > I've seen that the lock was taken, or will never be taken after this
> > > > because XYZ".
> > > 
> > > I don't think this is true for the usage in do_exit(), we have no
> > > knowledge on if pi_lock is taken or not. We just want to make sure that
> > > _if_ it were taken, we wait until it is released.
> > 
> > And unless PPC would move to using RCsc locks with a SYNC in
> > spin_lock(), I don't think it makes sense to add
> > smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this.
> > As that is far more expensive than flipping the exit path to do
> > spin_lock()+spin_unlock().
> 
> ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> slightly cheaper than spin_lock()+spin_unlock().

Or we supply a heavyweight version of spin_unlock_wait() that forces
the cache miss.  But I bet that the difference in overhead between
spin_lock()+spin_unlock() and the heavyweight version would be down in
the noise.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-16 16:44                     ` Paul E. McKenney
@ 2015-11-16 16:46                       ` Will Deacon
  2015-11-16 17:15                         ` Paul E. McKenney
  0 siblings, 1 reply; 78+ messages in thread
From: Will Deacon @ 2015-11-16 16:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Boqun Feng, Oleg Nesterov,
	Ingo Molnar, Linux Kernel Mailing List, Jonathan Corbet,
	Michal Hocko, David Howells, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Mon, Nov 16, 2015 at 08:44:43AM -0800, Paul E. McKenney wrote:
> On Mon, Nov 16, 2015 at 04:24:53PM +0000, Will Deacon wrote:
> > On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote:
> > > > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> > > > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> > > > > generally be that you have some external ordering guarantee that
> > > > > guarantees that the lock has been taken. For example, for the IPC
> > > > > semaphores, we do either one of:
> > > > > 
> > > > >  (a) get large lock, then - once you hold that lock - wait for each small lock
> > > > > 
> > > > > or
> > > > > 
> > > > >  (b) get small lock, then - once you hold that lock - check that the
> > > > > largo lock is unlocked
> > > > > 
> > > > > and that's the case we should really worry about.  The other uses of
> > > > > spin_unlock_wait() should have similar "I have other reasons to know
> > > > > I've seen that the lock was taken, or will never be taken after this
> > > > > because XYZ".
> > > > 
> > > > I don't think this is true for the usage in do_exit(), we have no
> > > > knowledge on if pi_lock is taken or not. We just want to make sure that
> > > > _if_ it were taken, we wait until it is released.
> > > 
> > > And unless PPC would move to using RCsc locks with a SYNC in
> > > spin_lock(), I don't think it makes sense to add
> > > smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this.
> > > As that is far more expensive than flipping the exit path to do
> > > spin_lock()+spin_unlock().
> > 
> > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > slightly cheaper than spin_lock()+spin_unlock().
> 
> Or we supply a heavyweight version of spin_unlock_wait() that forces
> the cache miss.  But I bet that the difference in overhead between
> spin_lock()+spin_unlock() and the heavyweight version would be down in
> the noise.

I'm not so sure. If the lock is ticket-based, then spin_lock() has to
queue for its turn, whereas spin_unlock_wait could just wait for the
next unlock.

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-16 16:46                       ` Will Deacon
@ 2015-11-16 17:15                         ` Paul E. McKenney
  0 siblings, 0 replies; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-16 17:15 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linus Torvalds, Boqun Feng, Oleg Nesterov,
	Ingo Molnar, Linux Kernel Mailing List, Jonathan Corbet,
	Michal Hocko, David Howells, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Mon, Nov 16, 2015 at 04:46:36PM +0000, Will Deacon wrote:
> On Mon, Nov 16, 2015 at 08:44:43AM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 16, 2015 at 04:24:53PM +0000, Will Deacon wrote:
> > > On Mon, Nov 16, 2015 at 05:04:45PM +0100, Peter Zijlstra wrote:
> > > > On Mon, Nov 16, 2015 at 04:56:58PM +0100, Peter Zijlstra wrote:
> > > > > On Thu, Nov 12, 2015 at 10:21:39AM -0800, Linus Torvalds wrote:
> > > > > > Now, the point of spin_unlock_wait() (and "spin_is_locked()") should
> > > > > > generally be that you have some external ordering guarantee that
> > > > > > guarantees that the lock has been taken. For example, for the IPC
> > > > > > semaphores, we do either one of:
> > > > > > 
> > > > > >  (a) get large lock, then - once you hold that lock - wait for each small lock
> > > > > > 
> > > > > > or
> > > > > > 
> > > > > >  (b) get small lock, then - once you hold that lock - check that the
> > > > > > largo lock is unlocked
> > > > > > 
> > > > > > and that's the case we should really worry about.  The other uses of
> > > > > > spin_unlock_wait() should have similar "I have other reasons to know
> > > > > > I've seen that the lock was taken, or will never be taken after this
> > > > > > because XYZ".
> > > > > 
> > > > > I don't think this is true for the usage in do_exit(), we have no
> > > > > knowledge on if pi_lock is taken or not. We just want to make sure that
> > > > > _if_ it were taken, we wait until it is released.
> > > > 
> > > > And unless PPC would move to using RCsc locks with a SYNC in
> > > > spin_lock(), I don't think it makes sense to add
> > > > smp_mb__after_unlock_lock() to all tsk->pi_lock instances to fix this.
> > > > As that is far more expensive than flipping the exit path to do
> > > > spin_lock()+spin_unlock().
> > > 
> > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > > slightly cheaper than spin_lock()+spin_unlock().
> > 
> > Or we supply a heavyweight version of spin_unlock_wait() that forces
> > the cache miss.  But I bet that the difference in overhead between
> > spin_lock()+spin_unlock() and the heavyweight version would be down in
> > the noise.
> 
> I'm not so sure. If the lock is ticket-based, then spin_lock() has to
> queue for its turn, whereas spin_unlock_wait could just wait for the
> next unlock.

Fair point, and it actually applies to high-contention spinlocks as well,
just a bit less deterministically.  OK, given that I believe that we do
see high contention on the lock in question, I withdraw any objections
to a heavy-weight form of spin_unlock_wait().

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-16 16:24                   ` Will Deacon
  2015-11-16 16:44                     ` Paul E. McKenney
@ 2015-11-16 21:58                     ` Linus Torvalds
  2015-11-17 11:51                       ` Will Deacon
  1 sibling, 1 reply; 78+ messages in thread
From: Linus Torvalds @ 2015-11-16 21:58 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Boqun Feng, Oleg Nesterov, Ingo Molnar,
	Linux Kernel Mailing List, Paul McKenney, Jonathan Corbet,
	Michal Hocko, David Howells, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> slightly cheaper than spin_lock()+spin_unlock().

So traditionally the real concern has been the cacheline ping-pong
part of spin_unlock_wait(). I think adding a memory barrier (that
doesn't force any exclusive states, just ordering) to it is fine, but
I don't think we want to necessarily have it have to get the cacheline
into exclusive state.

Because if spin_unlock_wait() ends up having to get the spinlock
cacheline (for example, by writing the same value back with a SC), I
don't think spin_unlock_wait() will really be all that much cheaper
than just getting the spinlock, and in that case we shouldn't play
complicated ordering games.

On another issue:

I'm also looking at the ARM documentation for strx, and the
_documentation_ says that it has no stronger ordering than a "store
release", but I'm starting to wonder if that is actually true.

Because I do end up thinking that it does have the same "control
dependency" to all subsequent writes (but not reads). So reads after
the SC can percolate up, but I think writes are restricted.

Why? In order for the SC to be able to return success, the write
itself may not have been actually done yet, but the cacheline for the
write must have successfully be turned into exclusive ownership.
Agreed?

That means that by the time a SC returns success, no other CPU can see
the old value of the spinlock any more. So by the time any subsequent
stores in the locked region can be visible to any other CPU's, the
locked value of the lock itself has to be visible too.

Agreed?

So I think that in effect, when a spinlock is implemnted with LL/SC,
the loads inside the locked region are only ordered wrt the acquire on
the LL, but the stores can be considered ordered wrt the SC.

No?

So I think a _successful_ SC - is still more ordered than just any
random store with release consistency.

Of course, I'm not sure that actually *helps* us, because I think the
problem tends to be loads in the locked region moving up earlier than
the actual store that sets the lock, but maybe it makes some
difference.

                Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-16 21:58                     ` Linus Torvalds
@ 2015-11-17 11:51                       ` Will Deacon
  2015-11-17 21:01                         ` Paul E. McKenney
  0 siblings, 1 reply; 78+ messages in thread
From: Will Deacon @ 2015-11-17 11:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Boqun Feng, Oleg Nesterov, Ingo Molnar,
	Linux Kernel Mailing List, Paul McKenney, Jonathan Corbet,
	Michal Hocko, David Howells, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

Hi Linus,

On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote:
> On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon <will.deacon@arm.com> wrote:
> >
> > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > slightly cheaper than spin_lock()+spin_unlock().
> 
> So traditionally the real concern has been the cacheline ping-pong
> part of spin_unlock_wait(). I think adding a memory barrier (that
> doesn't force any exclusive states, just ordering) to it is fine, but
> I don't think we want to necessarily have it have to get the cacheline
> into exclusive state.

The problem is, I don't think the memory-barrier buys you anything in
the context of Boqun's example. In fact, he already had smp_mb() either
side of the spin_unlock_wait() and its still broken on arm64 and ppc.

Paul is proposing adding a memory barrier after spin_lock() in the racing
thread, but I personally think people will forget to add that.

> Because if spin_unlock_wait() ends up having to get the spinlock
> cacheline (for example, by writing the same value back with a SC), I
> don't think spin_unlock_wait() will really be all that much cheaper
> than just getting the spinlock, and in that case we shouldn't play
> complicated ordering games.

It was the lock-fairness guarantees that I was concerned about. A
spin_lock could place you into a queue, so you're no longer waiting for
a single spin_unlock(), you're now waiting for *all* the spin_unlocks
by the CPUs preceding you in the queue.

> On another issue:
> 
> I'm also looking at the ARM documentation for strx, and the
> _documentation_ says that it has no stronger ordering than a "store
> release", but I'm starting to wonder if that is actually true.
> 
> Because I do end up thinking that it does have the same "control
> dependency" to all subsequent writes (but not reads). So reads after
> the SC can percolate up, but I think writes are restricted.
> 
> Why? In order for the SC to be able to return success, the write
> itself may not have been actually done yet, but the cacheline for the
> write must have successfully be turned into exclusive ownership.
> Agreed?

If the LL/SC logic hangs off the coherency logic, then yes, but there
are other ways to build this (using a seperate "exclusive monitor" block)
and the architecture caters for this, too. See below.

> That means that by the time a SC returns success, no other CPU can see
> the old value of the spinlock any more. So by the time any subsequent
> stores in the locked region can be visible to any other CPU's, the
> locked value of the lock itself has to be visible too.
> 
> Agreed?

No. A successful SC is *not* multi-copy atomic, but you're right to
point out that it provides more guarantees than a plain store. In
particular, a successful SC cannot return its success value until its
corresponding write has fixed its place in the coherence order for the
location that it is updating.

To be more concrete (simplified AArch64 asm, X1 and X2 hold addresses of
zero-initialised locations):


P0
LDXR	X0, [X1]
ADD	X0, X0, #1
STXR	X0, [X1]	// Succeeds

P1
LDR	X0, [X1]	// Reads 1
<dependency>
STR	#1, [X2]

P2
LDR	X0, [X2]	// Reads 1
<dependency>
LDR	X0, [X1]	// **Not required to read 1**


However:

P0
LDXR	X0, [X1]
ADD	X0, X0, #1
STXR	X0, [X1]	// Succeeds

P1
LDR	X0, [X1]	// Reads 1
<dependency>
STR	#1, [X2]

P2
LDR	X0, [X2]	// Reads 1
<dependency>
STR	#2, [X1]	// Location at [X1] must be ordered {0->1->2}


We can also extend this example so that P2 instead does:

P2
LDR	X0, [X2]	// Reads 1
<dependency>
LDXR	X0, [X1]
ADD	X0, X0, #1
STXR	X0, [X1]	// Succeeds; location at [x1] must be ordered {0->1->2}

i.e. the STXR cannot succeed until the coherence order has been resolved,
which also requires the LDXR to return the up-to-date value.

> So I think that in effect, when a spinlock is implemnted with LL/SC,
> the loads inside the locked region are only ordered wrt the acquire on
> the LL, but the stores can be considered ordered wrt the SC.
> 
> No?

I initially fell into the same trap because a control dependency between
a load and a store is sufficient to create order (i.e. we don't speculate
writes). An SC is different, though, because the control dependency can
be resolved without the write being multi-copy atomic, whereas a read is
required to return its data (i.e. complete) before the control hazard can
be resolved.

I think that all of this means I should either:

  (1) Update the arm64 spin_unlock_wait to use LDXR/STXR (perhaps with
      acquire semantics?)

- or -

  (2) Replace spin_unlock_wait with spin_lock; spin_unlock, my worries
      about queuing notwithstanding.

The cacheline ping-pong in (1) can be mitigated somewhat by the use of
wfe, so it won't be any worse than a spin_lock().

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-17 11:51                       ` Will Deacon
@ 2015-11-17 21:01                         ` Paul E. McKenney
  2015-11-18 11:25                           ` Will Deacon
  0 siblings, 1 reply; 78+ messages in thread
From: Paul E. McKenney @ 2015-11-17 21:01 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Peter Zijlstra, Boqun Feng, Oleg Nesterov,
	Ingo Molnar, Linux Kernel Mailing List, Jonathan Corbet,
	Michal Hocko, David Howells, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Tue, Nov 17, 2015 at 11:51:10AM +0000, Will Deacon wrote:
> Hi Linus,
> 
> On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote:
> > On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon <will.deacon@arm.com> wrote:
> > >
> > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > > slightly cheaper than spin_lock()+spin_unlock().
> > 
> > So traditionally the real concern has been the cacheline ping-pong
> > part of spin_unlock_wait(). I think adding a memory barrier (that
> > doesn't force any exclusive states, just ordering) to it is fine, but
> > I don't think we want to necessarily have it have to get the cacheline
> > into exclusive state.
> 
> The problem is, I don't think the memory-barrier buys you anything in
> the context of Boqun's example. In fact, he already had smp_mb() either
> side of the spin_unlock_wait() and its still broken on arm64 and ppc.
> 
> Paul is proposing adding a memory barrier after spin_lock() in the racing
> thread, but I personally think people will forget to add that.

A mechanical check would certainly make me feel better about it, so that
any lock that was passed to spin_unlock_wait() was required to have all
acquisitions followed by smp_mb__after_unlock_lock() or some such.
But I haven't yet given up on finding a better solution.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-17 21:01                         ` Paul E. McKenney
@ 2015-11-18 11:25                           ` Will Deacon
  2015-11-19 18:01                             ` Will Deacon
  0 siblings, 1 reply; 78+ messages in thread
From: Will Deacon @ 2015-11-18 11:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Zijlstra, Boqun Feng, Oleg Nesterov,
	Ingo Molnar, Linux Kernel Mailing List, Jonathan Corbet,
	Michal Hocko, David Howells, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Tue, Nov 17, 2015 at 01:01:09PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 17, 2015 at 11:51:10AM +0000, Will Deacon wrote:
> > On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote:
> > > On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon <will.deacon@arm.com> wrote:
> > > >
> > > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > > > slightly cheaper than spin_lock()+spin_unlock().
> > > 
> > > So traditionally the real concern has been the cacheline ping-pong
> > > part of spin_unlock_wait(). I think adding a memory barrier (that
> > > doesn't force any exclusive states, just ordering) to it is fine, but
> > > I don't think we want to necessarily have it have to get the cacheline
> > > into exclusive state.
> > 
> > The problem is, I don't think the memory-barrier buys you anything in
> > the context of Boqun's example. In fact, he already had smp_mb() either
> > side of the spin_unlock_wait() and its still broken on arm64 and ppc.
> > 
> > Paul is proposing adding a memory barrier after spin_lock() in the racing
> > thread, but I personally think people will forget to add that.
> 
> A mechanical check would certainly make me feel better about it, so that
> any lock that was passed to spin_unlock_wait() was required to have all
> acquisitions followed by smp_mb__after_unlock_lock() or some such.
> But I haven't yet given up on finding a better solution.

Right-o. I'll hack together the arm64 spin_unlock_wait fix, but hold off
merging it for a few weeks in case we get struck by a sudden flash of
inspiration.

Will

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-18 11:25                           ` Will Deacon
@ 2015-11-19 18:01                             ` Will Deacon
  2015-11-20 10:09                               ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Will Deacon @ 2015-11-19 18:01 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Peter Zijlstra, Boqun Feng, Oleg Nesterov,
	Ingo Molnar, Linux Kernel Mailing List, Jonathan Corbet,
	Michal Hocko, David Howells, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Wed, Nov 18, 2015 at 11:25:14AM +0000, Will Deacon wrote:
> On Tue, Nov 17, 2015 at 01:01:09PM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 17, 2015 at 11:51:10AM +0000, Will Deacon wrote:
> > > On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote:
> > > > On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon <will.deacon@arm.com> wrote:
> > > > >
> > > > > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > > > > slightly cheaper than spin_lock()+spin_unlock().
> > > > 
> > > > So traditionally the real concern has been the cacheline ping-pong
> > > > part of spin_unlock_wait(). I think adding a memory barrier (that
> > > > doesn't force any exclusive states, just ordering) to it is fine, but
> > > > I don't think we want to necessarily have it have to get the cacheline
> > > > into exclusive state.
> > > 
> > > The problem is, I don't think the memory-barrier buys you anything in
> > > the context of Boqun's example. In fact, he already had smp_mb() either
> > > side of the spin_unlock_wait() and its still broken on arm64 and ppc.
> > > 
> > > Paul is proposing adding a memory barrier after spin_lock() in the racing
> > > thread, but I personally think people will forget to add that.
> > 
> > A mechanical check would certainly make me feel better about it, so that
> > any lock that was passed to spin_unlock_wait() was required to have all
> > acquisitions followed by smp_mb__after_unlock_lock() or some such.
> > But I haven't yet given up on finding a better solution.
> 
> Right-o. I'll hack together the arm64 spin_unlock_wait fix, but hold off
> merging it for a few weeks in case we get struck by a sudden flash of
> inspiration.

For completeness, here's what I've currently got. I've failed to measure
any performance impact on my 8-core systems, but that's not surprising.

Will

--->8

>From da14adc1aef2f12b7a7def4d6b7dde254a91ebf1 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Thu, 19 Nov 2015 17:48:31 +0000
Subject: [PATCH] arm64: spinlock: serialise spin_unlock_wait against
 concurrent lockers

Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
on architectures implementing spin_lock with LL/SC sequences and acquire
semantics:

 | CPU 1                   CPU 2                     CPU 3
 | ==================      ====================      ==============
 |                                                   spin_unlock(&lock);
 |                         spin_lock(&lock):
 |                           r1 = *lock; // r1 == 0;
 |                         o = READ_ONCE(object); // reordered here
 | object = NULL;
 | smp_mb();
 | spin_unlock_wait(&lock);
 |                           *lock = 1;
 | smp_mb();
 | o->dead = true;
 |                         if (o) // true
 |                           BUG_ON(o->dead); // true!!

The crux of the problem is that spin_unlock_wait(&lock) can return on
CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it
to serialise against a concurrent locker and giving it acquire semantics
in the process (although it is not at all clear whether this is needed -
different callers seem to assume different things about the barrier
semantics and architectures are similarly disjoint in their
implementations of the macro).

This patch implements spin_unlock_wait using an LL/SC sequence with
acquire semantics on arm64. For v8.1 systems with the LSE atomics, the
exclusive writeback is omitted, since the spin_lock operation is
indivisible and no intermediate state can be observed.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/spinlock.h | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index c85e96d174a5..b531791a75ff 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -26,9 +26,29 @@
  * The memory barriers are implicit with the load-acquire and store-release
  * instructions.
  */
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	unsigned int tmp;
+	arch_spinlock_t lockval;
 
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+	asm volatile(
+"	sevl\n"
+"1:	wfe\n"
+"2:	ldaxr	%w0, %2\n"
+"	eor	%w1, %w0, %w0, ror #16\n"
+"	cbnz	%w1, 1b\n"
+	ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
+"	stxr	%w1, %w0, %2\n"
+	/* Serialise against any concurrent lockers */
+"	cbnz	%w1, 2b\n",
+	/* LSE atomics */
+"	nop\n"
+"	nop\n")
+	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
+	:
+	: "memory");
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/4] sched: Document Program-Order guarantees
  2015-11-02 20:27   ` Paul Turner
  2015-11-02 20:34     ` Peter Zijlstra
@ 2015-11-20 10:02     ` Peter Zijlstra
  2015-11-20 14:08       ` Boqun Feng
  1 sibling, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-20 10:02 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Oleg Nesterov, LKML, Paul McKenney, boqun.feng,
	Jonathan Corbet, mhocko, dhowells, Linus Torvalds, will.deacon

On Mon, Nov 02, 2015 at 12:27:05PM -0800, Paul Turner wrote:
> I suspect this part might be more explicitly expressed by specifying
> the requirements that migration satisfies; then providing an example.
> This makes it easier for others to reason about the locks and saves
> worrying about whether the examples hit our 3 million sub-cases.

Something like so then? Note that this patch (obviously) comes after
introducing smp_cond_acquire(), simplifying the RELEASE/ACQUIRE on
p->on_cpu.

---
Subject: sched: Document Program-Order guarantees
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Nov 17 19:01:11 CET 2015

These are some notes on the scheduler locking and how it provides
program order guarantees on SMP systems.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1916,6 +1916,97 @@ static void ttwu_queue(struct task_struc
 	raw_spin_unlock(&rq->lock);
 }
 
+/*
+ * Notes on Program-Order guarantees on SMP systems.
+ *
+ *  MIGRATION
+ *
+ * The basic program-order guarantee on SMP systems is that when a task [t]
+ * migrates, all its activity on its old cpu [c0] happens-before any subsequent
+ * execution on its new cpu [c1].
+ *
+ * For migration (of runnable tasks) this is provided by the following means:
+ *
+ *  A) UNLOCK of the rq(c0)->lock scheduling out task t
+ *  B) migration for t is required to synchronize *both* rq(c0)->lock and
+ *     rq(c1)->lock (if not at the same time, then in that order).
+ *  C) LOCK of the rq(c1)->lock scheduling in task
+ *
+ * Transitivity guarantees that B happens after A and C after B.
+ * Note: we only require RCpc transitivity.
+ * Note: the cpu doing B need not be c0 or c1
+ *
+ * Example:
+ *
+ *   CPU0            CPU1            CPU2
+ *
+ *   LOCK rq(0)->lock
+ *   sched-out X
+ *   sched-in Y
+ *   UNLOCK rq(0)->lock
+ *
+ *                                   LOCK rq(0)->lock // orders against CPU0
+ *                                   dequeue X
+ *                                   UNLOCK rq(0)->lock
+ *
+ *                                   LOCK rq(1)->lock
+ *                                   enqueue X
+ *                                   UNLOCK rq(1)->lock
+ *
+ *                   LOCK rq(1)->lock // orders against CPU2
+ *                   sched-out Z
+ *                   sched-in X
+ *                   UNLOCK rq(1)->lock
+ *
+ *
+ *  BLOCKING -- aka. SLEEP + WAKEUP
+ *
+ * For blocking we (obviously) need to provide the same guarantee as for
+ * migration. However the means are completely different as there is no lock
+ * chain to provide order. Instead we do:
+ *
+ *   1) smp_store_release(X->on_cpu, 0)
+ *   2) smp_cond_acquire(!X->on_cpu)
+ *
+ * Example:
+ *
+ *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
+ *
+ *   LOCK rq(0)->lock LOCK X->pi_lock
+ *   dequeue X
+ *   sched-out X
+ *   smp_store_release(X->on_cpu, 0);
+ *
+ *                    smp_cond_acquire(!X->on_cpu);
+ *                    X->state = WAKING
+ *                    set_task_cpu(X,2)
+ *
+ *                    LOCK rq(2)->lock
+ *                    enqueue X
+ *                    X->state = RUNNING
+ *                    UNLOCK rq(2)->lock
+ *
+ *                                          LOCK rq(2)->lock // orders against CPU1
+ *                                          sched-out Z
+ *                                          sched-in X
+ *                                          UNLOCK rq(1)->lock
+ *
+ *                    UNLOCK X->pi_lock
+ *   UNLOCK rq(0)->lock
+ *
+ *
+ * However; for wakeups there is a second guarantee we must provide, namely we
+ * must observe the state that lead to our wakeup. That is, not only must our
+ * task observe its own prior state, it must also observe the stores prior to
+ * its wakeup.
+ *
+ * This means that any means of doing remote wakeups must order the CPU doing
+ * the wakeup against the CPU the task is going to end up running on. This,
+ * however, is already required for the regular Program-Order guarantee above,
+ * since the waking CPU is the one issueing the ACQUIRE (2).
+ *
+ */
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
  2015-11-19 18:01                             ` Will Deacon
@ 2015-11-20 10:09                               ` Peter Zijlstra
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-20 10:09 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Linus Torvalds, Boqun Feng, Oleg Nesterov,
	Ingo Molnar, Linux Kernel Mailing List, Jonathan Corbet,
	Michal Hocko, David Howells, Michael Ellerman,
	Benjamin Herrenschmidt, Paul Mackerras

On Thu, Nov 19, 2015 at 06:01:52PM +0000, Will Deacon wrote:
> For completeness, here's what I've currently got. I've failed to measure
> any performance impact on my 8-core systems, but that's not surprising.

> +static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> +{
> +	unsigned int tmp;
> +	arch_spinlock_t lockval;
>
> +	asm volatile(
> +"	sevl\n"
> +"1:	wfe\n"

Using WFE here would lower the cacheline bouncing pressure a bit I
imagine. Sure we still pull it over into S(hared) after every store
but we don't keep banging on it making the initial e(X)clusive grab
hard.

> +"2:	ldaxr	%w0, %2\n"
> +"	eor	%w1, %w0, %w0, ror #16\n"
> +"	cbnz	%w1, 1b\n"
> +	ARM64_LSE_ATOMIC_INSN(
> +	/* LL/SC */
> +"	stxr	%w1, %w0, %2\n"
> +	/* Serialise against any concurrent lockers */
> +"	cbnz	%w1, 2b\n",
> +	/* LSE atomics */
> +"	nop\n"
> +"	nop\n")

I find these ARM64_LSE macro thingies aren't always easy to read, its
fairly easy to overlook the ',' separating the v8 and v8.1 parts, esp.
if you have further interleaving comments like in the above.

> +	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
> +	:
> +	: "memory");
> +}

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/4] sched: Document Program-Order guarantees
  2015-11-20 10:02     ` Peter Zijlstra
@ 2015-11-20 14:08       ` Boqun Feng
  2015-11-20 14:18         ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Boqun Feng @ 2015-11-20 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Ingo Molnar, Oleg Nesterov, LKML, Paul McKenney,
	Jonathan Corbet, mhocko, dhowells, Linus Torvalds, will.deacon

[-- Attachment #1: Type: text/plain, Size: 2205 bytes --]

Hi Peter,

On Fri, Nov 20, 2015 at 11:02:30AM +0100, Peter Zijlstra wrote:
[snip]
> + *  BLOCKING -- aka. SLEEP + WAKEUP
> + *
> + * For blocking we (obviously) need to provide the same guarantee as for
> + * migration. However the means are completely different as there is no lock
> + * chain to provide order. Instead we do:
> + *
> + *   1) smp_store_release(X->on_cpu, 0)
> + *   2) smp_cond_acquire(!X->on_cpu)
> + *
> + * Example:
> + *
> + *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
> + *
> + *   LOCK rq(0)->lock LOCK X->pi_lock
> + *   dequeue X
> + *   sched-out X
> + *   smp_store_release(X->on_cpu, 0);
> + *
> + *                    smp_cond_acquire(!X->on_cpu);
> + *                    X->state = WAKING
> + *                    set_task_cpu(X,2)
> + *
> + *                    LOCK rq(2)->lock
> + *                    enqueue X
> + *                    X->state = RUNNING
> + *                    UNLOCK rq(2)->lock
> + *
> + *                                          LOCK rq(2)->lock // orders against CPU1
> + *                                          sched-out Z
> + *                                          sched-in X
> + *                                          UNLOCK rq(1)->lock
> + *
> + *                    UNLOCK X->pi_lock
> + *   UNLOCK rq(0)->lock
> + *
> + *
> + * However; for wakeups there is a second guarantee we must provide, namely we
> + * must observe the state that lead to our wakeup. That is, not only must our
> + * task observe its own prior state, it must also observe the stores prior to
> + * its wakeup.
> + *
> + * This means that any means of doing remote wakeups must order the CPU doing
> + * the wakeup against the CPU the task is going to end up running on. This,
> + * however, is already required for the regular Program-Order guarantee above,
> + * since the waking CPU is the one issueing the ACQUIRE (2).
> + *

Hope I'm the only one who got confused about the "2" in "ACQUIRE (2)",
what does that refer? "2) smp_cond_acquire(!X->on_cpu)"?

The comments are great, just try to understand your meaning here ;-)

Regards,
Boqun

> + */
> +
>  /**
>   * try_to_wake_up - wake up a thread
>   * @p: the thread to be awakened

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/4] sched: Document Program-Order guarantees
  2015-11-20 14:08       ` Boqun Feng
@ 2015-11-20 14:18         ` Peter Zijlstra
  2015-11-20 14:21           ` Boqun Feng
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-20 14:18 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul Turner, Ingo Molnar, Oleg Nesterov, LKML, Paul McKenney,
	Jonathan Corbet, mhocko, dhowells, Linus Torvalds, will.deacon

On Fri, Nov 20, 2015 at 10:08:50PM +0800, Boqun Feng wrote:
> Hi Peter,
> 
> On Fri, Nov 20, 2015 at 11:02:30AM +0100, Peter Zijlstra wrote:
> [snip]
> > + *  BLOCKING -- aka. SLEEP + WAKEUP
> > + *
> > + * For blocking we (obviously) need to provide the same guarantee as for
> > + * migration. However the means are completely different as there is no lock
> > + * chain to provide order. Instead we do:
> > + *
> > + *   1) smp_store_release(X->on_cpu, 0)
> > + *   2) smp_cond_acquire(!X->on_cpu)
> > + *
> > + * Example:
> > + *
> > + *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
> > + *
> > + *   LOCK rq(0)->lock LOCK X->pi_lock
> > + *   dequeue X
> > + *   sched-out X
> > + *   smp_store_release(X->on_cpu, 0);
> > + *
> > + *                    smp_cond_acquire(!X->on_cpu);
> > + *                    X->state = WAKING
> > + *                    set_task_cpu(X,2)
> > + *
> > + *                    LOCK rq(2)->lock
> > + *                    enqueue X
> > + *                    X->state = RUNNING
> > + *                    UNLOCK rq(2)->lock
> > + *
> > + *                                          LOCK rq(2)->lock // orders against CPU1
> > + *                                          sched-out Z
> > + *                                          sched-in X
> > + *                                          UNLOCK rq(1)->lock
> > + *
> > + *                    UNLOCK X->pi_lock
> > + *   UNLOCK rq(0)->lock
> > + *
> > + *
> > + * However; for wakeups there is a second guarantee we must provide, namely we
> > + * must observe the state that lead to our wakeup. That is, not only must our
> > + * task observe its own prior state, it must also observe the stores prior to
> > + * its wakeup.
> > + *
> > + * This means that any means of doing remote wakeups must order the CPU doing
> > + * the wakeup against the CPU the task is going to end up running on. This,
> > + * however, is already required for the regular Program-Order guarantee above,
> > + * since the waking CPU is the one issueing the ACQUIRE (2).
> > + *
> 
> Hope I'm the only one who got confused about the "2" in "ACQUIRE (2)",
> what does that refer? "2) smp_cond_acquire(!X->on_cpu)"?

Yes, exactly that. Would an unadorned 2 be clearer?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/4] sched: Document Program-Order guarantees
  2015-11-20 14:18         ` Peter Zijlstra
@ 2015-11-20 14:21           ` Boqun Feng
  2015-11-20 19:41             ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Boqun Feng @ 2015-11-20 14:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, Ingo Molnar, Oleg Nesterov, LKML, Paul McKenney,
	Jonathan Corbet, mhocko, dhowells, Linus Torvalds, will.deacon

[-- Attachment #1: Type: text/plain, Size: 2540 bytes --]

On Fri, Nov 20, 2015 at 03:18:12PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 20, 2015 at 10:08:50PM +0800, Boqun Feng wrote:
> > Hi Peter,
> > 
> > On Fri, Nov 20, 2015 at 11:02:30AM +0100, Peter Zijlstra wrote:
> > [snip]
> > > + *  BLOCKING -- aka. SLEEP + WAKEUP
> > > + *
> > > + * For blocking we (obviously) need to provide the same guarantee as for
> > > + * migration. However the means are completely different as there is no lock
> > > + * chain to provide order. Instead we do:
> > > + *
> > > + *   1) smp_store_release(X->on_cpu, 0)
> > > + *   2) smp_cond_acquire(!X->on_cpu)
> > > + *
> > > + * Example:
> > > + *
> > > + *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
> > > + *
> > > + *   LOCK rq(0)->lock LOCK X->pi_lock
> > > + *   dequeue X
> > > + *   sched-out X
> > > + *   smp_store_release(X->on_cpu, 0);
> > > + *
> > > + *                    smp_cond_acquire(!X->on_cpu);
> > > + *                    X->state = WAKING
> > > + *                    set_task_cpu(X,2)
> > > + *
> > > + *                    LOCK rq(2)->lock
> > > + *                    enqueue X
> > > + *                    X->state = RUNNING
> > > + *                    UNLOCK rq(2)->lock
> > > + *
> > > + *                                          LOCK rq(2)->lock // orders against CPU1
> > > + *                                          sched-out Z
> > > + *                                          sched-in X
> > > + *                                          UNLOCK rq(1)->lock
> > > + *
> > > + *                    UNLOCK X->pi_lock
> > > + *   UNLOCK rq(0)->lock
> > > + *
> > > + *
> > > + * However; for wakeups there is a second guarantee we must provide, namely we
> > > + * must observe the state that lead to our wakeup. That is, not only must our
> > > + * task observe its own prior state, it must also observe the stores prior to
> > > + * its wakeup.
> > > + *
> > > + * This means that any means of doing remote wakeups must order the CPU doing
> > > + * the wakeup against the CPU the task is going to end up running on. This,
> > > + * however, is already required for the regular Program-Order guarantee above,
> > > + * since the waking CPU is the one issueing the ACQUIRE (2).
> > > + *
> > 
> > Hope I'm the only one who got confused about the "2" in "ACQUIRE (2)",
> > what does that refer? "2) smp_cond_acquire(!X->on_cpu)"?
> 
> Yes, exactly that. Would an unadorned 2 be clearer?

How about "the one issueing the ACQUIRE (smp_cond_acquire)"?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/4] sched: Document Program-Order guarantees
  2015-11-20 14:21           ` Boqun Feng
@ 2015-11-20 19:41             ` Peter Zijlstra
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2015-11-20 19:41 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul Turner, Ingo Molnar, Oleg Nesterov, LKML, Paul McKenney,
	Jonathan Corbet, mhocko, dhowells, Linus Torvalds, will.deacon

On Fri, Nov 20, 2015 at 10:21:11PM +0800, Boqun Feng wrote:
> > Yes, exactly that. Would an unadorned 2 be clearer?
> 
> How about "the one issueing the ACQUIRE (smp_cond_acquire)"?

Sure, that works. Thanks


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/4] sched: Better document the try_to_wake_up() barriers
  2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
@ 2015-12-04  0:09   ` Byungchul Park
  2015-12-04  0:58   ` Byungchul Park
  1 sibling, 0 replies; 78+ messages in thread
From: Byungchul Park @ 2015-12-04  0:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

On Mon, Nov 02, 2015 at 02:29:02PM +0100, Peter Zijlstra wrote:
> Explain how the control dependency and smp_rmb() end up providing
> ACQUIRE semantics and pair with smp_store_release() in
> finish_lock_switch().
> 
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c  |    8 +++++++-
>  kernel/sched/sched.h |    3 +++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1947,7 +1947,13 @@ try_to_wake_up(struct task_struct *p, un
>  	while (p->on_cpu)
>  		cpu_relax();
>  	/*
> -	 * Pairs with the smp_wmb() in finish_lock_switch().
> +	 * Combined with the control dependency above, we have an effective
> +	 * smp_load_acquire() without the need for full barriers.
> +	 *
> +	 * Pairs with the smp_store_release() in finish_lock_switch().
> +	 *
> +	 * This ensures that tasks getting woken will be fully ordered against
> +	 * their previous state and preserve Program Order.
>  	 */
>  	smp_rmb();

Hello,

I am not sure, but just curious..

I thought it's good enough to use a kind of dependancy barrier such as
smp_read_barrier_depends() instead of smp_rmb() here unless something
e.g. compiler breaks the control dependacy. Right?

>  
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1073,6 +1073,9 @@ static inline void finish_lock_switch(st
>  	 * We must ensure this doesn't happen until the switch is completely
>  	 * finished.
>  	 *
> +	 * In particular, the load of prev->state in finish_task_switch() must
> +	 * happen before this.
> +	 *
>  	 * Pairs with the control dependency and rmb in try_to_wake_up().
>  	 */
>  	smp_store_release(&prev->on_cpu, 0);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/4] sched: Better document the try_to_wake_up() barriers
  2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
  2015-12-04  0:09   ` Byungchul Park
@ 2015-12-04  0:58   ` Byungchul Park
  1 sibling, 0 replies; 78+ messages in thread
From: Byungchul Park @ 2015-12-04  0:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

On Mon, Nov 02, 2015 at 02:29:02PM +0100, Peter Zijlstra wrote:
> Explain how the control dependency and smp_rmb() end up providing
> ACQUIRE semantics and pair with smp_store_release() in
> finish_lock_switch().
> 
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c  |    8 +++++++-
>  kernel/sched/sched.h |    3 +++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1947,7 +1947,13 @@ try_to_wake_up(struct task_struct *p, un
>  	while (p->on_cpu)
>  		cpu_relax();
>  	/*
> -	 * Pairs with the smp_wmb() in finish_lock_switch().
> +	 * Combined with the control dependency above, we have an effective
> +	 * smp_load_acquire() without the need for full barriers.
> +	 *
> +	 * Pairs with the smp_store_release() in finish_lock_switch().
> +	 *
> +	 * This ensures that tasks getting woken will be fully ordered against
> +	 * their previous state and preserve Program Order.
>  	 */
>  	smp_rmb();

I am very sorry.

Please ignore my previous question :-(

>  
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1073,6 +1073,9 @@ static inline void finish_lock_switch(st
>  	 * We must ensure this doesn't happen until the switch is completely
>  	 * finished.
>  	 *
> +	 * In particular, the load of prev->state in finish_task_switch() must
> +	 * happen before this.
> +	 *
>  	 * Pairs with the control dependency and rmb in try_to_wake_up().
>  	 */
>  	smp_store_release(&prev->on_cpu, 0);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2015-12-04  0:58 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-02 13:29 [PATCH 0/4] scheduler ordering bits Peter Zijlstra
2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
2015-12-04  0:09   ` Byungchul Park
2015-12-04  0:58   ` Byungchul Park
2015-11-02 13:29 ` [PATCH 2/4] sched: Document Program-Order guarantees Peter Zijlstra
2015-11-02 20:27   ` Paul Turner
2015-11-02 20:34     ` Peter Zijlstra
2015-11-02 22:09       ` Paul Turner
2015-11-02 22:12         ` Peter Zijlstra
2015-11-20 10:02     ` Peter Zijlstra
2015-11-20 14:08       ` Boqun Feng
2015-11-20 14:18         ` Peter Zijlstra
2015-11-20 14:21           ` Boqun Feng
2015-11-20 19:41             ` Peter Zijlstra
2015-11-02 13:29 ` [PATCH 3/4] sched: Fix a race in try_to_wake_up() vs schedule() Peter Zijlstra
2015-11-02 13:29 ` [PATCH 4/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
2015-11-02 13:57   ` Peter Zijlstra
2015-11-02 17:43     ` Will Deacon
2015-11-03  1:14       ` Paul E. McKenney
2015-11-03  1:25         ` Linus Torvalds
2015-11-02 17:42   ` Will Deacon
2015-11-02 18:08   ` Linus Torvalds
2015-11-02 18:37     ` Will Deacon
2015-11-02 19:17       ` Linus Torvalds
2015-11-02 19:57         ` Will Deacon
2015-11-02 20:23           ` Peter Zijlstra
2015-11-02 21:56         ` Peter Zijlstra
2015-11-03  1:57         ` Paul E. McKenney
2015-11-03 19:40           ` Linus Torvalds
2015-11-04  3:57             ` Paul E. McKenney
2015-11-04  4:43               ` Linus Torvalds
2015-11-04 12:54                 ` Paul E. McKenney
2015-11-02 20:36   ` David Howells
2015-11-02 20:40     ` Peter Zijlstra
2015-11-02 21:11     ` Linus Torvalds
2015-11-03 17:59   ` Oleg Nesterov
2015-11-03 18:23     ` Peter Zijlstra
2015-11-11  9:39     ` Boqun Feng
2015-11-11 10:34       ` Boqun Feng
2015-11-11 19:53         ` Oleg Nesterov
2015-11-12 13:50         ` Paul E. McKenney
2015-11-11 12:12       ` Peter Zijlstra
2015-11-11 19:39         ` Oleg Nesterov
2015-11-11 21:23           ` Linus Torvalds
2015-11-12  7:14           ` Boqun Feng
2015-11-12 10:28             ` Peter Zijlstra
2015-11-12 15:00             ` Oleg Nesterov
2015-11-12 14:40               ` Paul E. McKenney
2015-11-12 14:49                 ` Boqun Feng
2015-11-12 15:02                   ` Paul E. McKenney
2015-11-12 21:53                     ` Will Deacon
2015-11-12 14:50                 ` Peter Zijlstra
2015-11-12 15:01                   ` Paul E. McKenney
2015-11-12 15:08                     ` Peter Zijlstra
2015-11-12 15:20                       ` Paul E. McKenney
2015-11-12 21:25                         ` Will Deacon
2015-11-12 15:18               ` Boqun Feng
2015-11-12 18:38                 ` Oleg Nesterov
2015-11-12 18:02                   ` Peter Zijlstra
2015-11-12 19:33                     ` Oleg Nesterov
2015-11-12 18:59                       ` Paul E. McKenney
2015-11-12 21:33                         ` Will Deacon
2015-11-12 23:43                           ` Paul E. McKenney
2015-11-16 13:58                             ` Will Deacon
2015-11-12 18:21             ` Linus Torvalds
2015-11-12 22:09               ` Will Deacon
2015-11-16 15:56               ` Peter Zijlstra
2015-11-16 16:04                 ` Peter Zijlstra
2015-11-16 16:24                   ` Will Deacon
2015-11-16 16:44                     ` Paul E. McKenney
2015-11-16 16:46                       ` Will Deacon
2015-11-16 17:15                         ` Paul E. McKenney
2015-11-16 21:58                     ` Linus Torvalds
2015-11-17 11:51                       ` Will Deacon
2015-11-17 21:01                         ` Paul E. McKenney
2015-11-18 11:25                           ` Will Deacon
2015-11-19 18:01                             ` Will Deacon
2015-11-20 10:09                               ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.