All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] scheduler ordering bits -v2
@ 2015-12-03 12:40 Peter Zijlstra
  2015-12-03 12:40 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-12-03 12:40 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, peterz, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon, waiman.long, pjt

Hi,

These are updates of the pending scheduler ordering patches, taking (hopefully)
all prior comments into consideration.

I've queued them for merging, but please have a careful last look at them.

Thanks!


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] sched: Better document the try_to_wake_up() barriers
  2015-12-03 12:40 [PATCH 0/4] scheduler ordering bits -v2 Peter Zijlstra
@ 2015-12-03 12:40 ` Peter Zijlstra
  2015-12-03 12:40 ` [PATCH 2/4] sched: Fix a race in try_to_wake_up() vs schedule() Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-12-03 12:40 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, peterz, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon, waiman.long, pjt

[-- Attachment #1: peterz-sched-ttwu-comment.patch --]
[-- Type: text/plain, Size: 1408 bytes --]

Explain how the control dependency and smp_rmb() end up providing
ACQUIRE semantics and pair with smp_store_release() in
finish_lock_switch().

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |    8 +++++++-
 kernel/sched/sched.h |    3 +++
 2 files changed, 10 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1947,7 +1947,13 @@ try_to_wake_up(struct task_struct *p, un
 	while (p->on_cpu)
 		cpu_relax();
 	/*
-	 * Pairs with the smp_wmb() in finish_lock_switch().
+	 * Combined with the control dependency above, we have an effective
+	 * smp_load_acquire() without the need for full barriers.
+	 *
+	 * Pairs with the smp_store_release() in finish_lock_switch().
+	 *
+	 * This ensures that tasks getting woken will be fully ordered against
+	 * their previous state and preserve Program Order.
 	 */
 	smp_rmb();
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1073,6 +1073,9 @@ static inline void finish_lock_switch(st
 	 * We must ensure this doesn't happen until the switch is completely
 	 * finished.
 	 *
+	 * In particular, the load of prev->state in finish_task_switch() must
+	 * happen before this.
+	 *
 	 * Pairs with the control dependency and rmb in try_to_wake_up().
 	 */
 	smp_store_release(&prev->on_cpu, 0);



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/4] sched: Fix a race in try_to_wake_up() vs schedule()
  2015-12-03 12:40 [PATCH 0/4] scheduler ordering bits -v2 Peter Zijlstra
  2015-12-03 12:40 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
@ 2015-12-03 12:40 ` Peter Zijlstra
  2015-12-03 12:40 ` [PATCH 3/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
  2015-12-03 12:40 ` [PATCH 4/4] sched: Document Program-Order guarantees Peter Zijlstra
  3 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-12-03 12:40 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, peterz, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon, waiman.long, pjt

[-- Attachment #1: peterz-sched-fix-ttwu-race.patch --]
[-- Type: text/plain, Size: 2503 bytes --]

Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
that we'll prematurely continue with the wakeup and effectively run p on
two CPUs at the same time.

Even though the overlap is very limited; the task is in the middle of
being scheduled out; it could still result in corruption of the
scheduler data structures.


        CPU0                            CPU1

        set_current_state(...)

        <preempt_schedule>
          context_switch(X, Y)
            prepare_lock_switch(Y)
              Y->on_cpu = 1;
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);

                                        try_to_wake_up(X)
                                          LOCK(p->pi_lock);

                                          t = X->on_cpu; // 0

          context_switch(Y, X)
            prepare_lock_switch(X)
              X->on_cpu = 1;
            finish_lock_switch(Y)
              store_release(Y->on_cpu, 0);
        </preempt_schedule>

        schedule();
          deactivate_task(X);
          X->on_rq = 0;

                                          if (X->on_rq) // false

                                          if (t) while (X->on_cpu)
                                            cpu_relax();

          context_switch(X, ..)
            finish_lock_switch(X)
              store_release(X->on_cpu, 0);


Avoid the load of X->on_cpu being hoisted over the X->on_rq load.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2084,6 +2084,25 @@ try_to_wake_up(struct task_struct *p, un
 
 #ifdef CONFIG_SMP
 	/*
+	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
+	 * possible to, falsely, observe p->on_cpu == 0.
+	 *
+	 * One must be running (->on_cpu == 1) in order to remove oneself
+	 * from the runqueue.
+	 *
+	 *  [S] ->on_cpu = 1;	[L] ->on_rq
+	 *      UNLOCK rq->lock
+	 *			RMB
+	 *      LOCK   rq->lock
+	 *  [S] ->on_rq = 0;    [L] ->on_cpu
+	 *
+	 * Pairs with the full barrier implied in the UNLOCK+LOCK on rq->lock
+	 * from the consecutive calls to schedule(); the first switching to our
+	 * task, the second putting it to sleep.
+	 */
+	smp_rmb();
+
+	/*
 	 * If the owning (remote) cpu is still in the middle of schedule() with
 	 * this task as prev, wait until its done referencing the task.
 	 */



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-03 12:40 [PATCH 0/4] scheduler ordering bits -v2 Peter Zijlstra
  2015-12-03 12:40 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
  2015-12-03 12:40 ` [PATCH 2/4] sched: Fix a race in try_to_wake_up() vs schedule() Peter Zijlstra
@ 2015-12-03 12:40 ` Peter Zijlstra
  2015-12-03 16:37   ` Will Deacon
  2015-12-03 19:41   ` Davidlohr Bueso
  2015-12-03 12:40 ` [PATCH 4/4] sched: Document Program-Order guarantees Peter Zijlstra
  3 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-12-03 12:40 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, peterz, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon, waiman.long, pjt

[-- Attachment #1: peterz-smp_cond_acquire.patch --]
[-- Type: text/plain, Size: 2919 bytes --]

Introduce smp_cond_acquire() which combines a control dependency and a
read barrier to form acquire semantics.

This primitive has two benefits:
 - it documents control dependencies,
 - its typically cheaper than using smp_load_acquire() in a loop.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/compiler.h   |   17 +++++++++++++++++
 kernel/locking/qspinlock.c |    3 +--
 kernel/sched/core.c        |    8 +-------
 kernel/sched/sched.h       |    2 +-
 4 files changed, 20 insertions(+), 10 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -299,6 +299,23 @@ static __always_inline void __write_once
 	__u.__val;					\
 })
 
+/**
+ * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ *
+ * The control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. ACQUIRE.
+ */
+#define smp_cond_acquire(cond)	do {		\
+	while (!(cond))				\
+		cpu_relax();			\
+	smp_rmb(); /* ctrl + rmb := acquire */	\
+} while (0)
+
 #endif /* __KERNEL__ */
 
 #endif /* __ASSEMBLY__ */
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -446,8 +446,7 @@ void queued_spin_lock_slowpath(struct qs
 	if ((val = pv_wait_head_or_lock(lock, node)))
 		goto locked;
 
-	while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
-		cpu_relax();
+	smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
 
 locked:
 	/*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1979,19 +1979,13 @@ try_to_wake_up(struct task_struct *p, un
 	/*
 	 * If the owning (remote) cpu is still in the middle of schedule() with
 	 * this task as prev, wait until its done referencing the task.
-	 */
-	while (p->on_cpu)
-		cpu_relax();
-	/*
-	 * Combined with the control dependency above, we have an effective
-	 * smp_load_acquire() without the need for full barriers.
 	 *
 	 * Pairs with the smp_store_release() in finish_lock_switch().
 	 *
 	 * This ensures that tasks getting woken will be fully ordered against
 	 * their previous state and preserve Program Order.
 	 */
-	smp_rmb();
+	smp_cond_acquire(!p->on_cpu);
 
 	p->sched_contributes_to_load = !!task_contributes_to_load(p);
 	p->state = TASK_WAKING;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1079,7 +1079,7 @@ static inline void finish_lock_switch(st
 	 * In particular, the load of prev->state in finish_task_switch() must
 	 * happen before this.
 	 *
-	 * Pairs with the control dependency and rmb in try_to_wake_up().
+	 * Pairs with the smp_cond_acquire() in try_to_wake_up().
 	 */
 	smp_store_release(&prev->on_cpu, 0);
 #endif



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/4] sched: Document Program-Order guarantees
  2015-12-03 12:40 [PATCH 0/4] scheduler ordering bits -v2 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2015-12-03 12:40 ` [PATCH 3/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
@ 2015-12-03 12:40 ` Peter Zijlstra
  2015-12-03 13:16   ` Boqun Feng
  3 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2015-12-03 12:40 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, peterz, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon, waiman.long, pjt

[-- Attachment #1: peterz-sched-doc-program_order-v2.patch --]
[-- Type: text/plain, Size: 4050 bytes --]

These are some notes on the scheduler locking and how it provides
program order guarantees on SMP systems.

Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   91 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1916,6 +1916,97 @@ static void ttwu_queue(struct task_struc
 	raw_spin_unlock(&rq->lock);
 }
 
+/*
+ * Notes on Program-Order guarantees on SMP systems.
+ *
+ *  MIGRATION
+ *
+ * The basic program-order guarantee on SMP systems is that when a task [t]
+ * migrates, all its activity on its old cpu [c0] happens-before any subsequent
+ * execution on its new cpu [c1].
+ *
+ * For migration (of runnable tasks) this is provided by the following means:
+ *
+ *  A) UNLOCK of the rq(c0)->lock scheduling out task t
+ *  B) migration for t is required to synchronize *both* rq(c0)->lock and
+ *     rq(c1)->lock (if not at the same time, then in that order).
+ *  C) LOCK of the rq(c1)->lock scheduling in task
+ *
+ * Transitivity guarantees that B happens after A and C after B.
+ * Note: we only require RCpc transitivity.
+ * Note: the cpu doing B need not be c0 or c1
+ *
+ * Example:
+ *
+ *   CPU0            CPU1            CPU2
+ *
+ *   LOCK rq(0)->lock
+ *   sched-out X
+ *   sched-in Y
+ *   UNLOCK rq(0)->lock
+ *
+ *                                   LOCK rq(0)->lock // orders against CPU0
+ *                                   dequeue X
+ *                                   UNLOCK rq(0)->lock
+ *
+ *                                   LOCK rq(1)->lock
+ *                                   enqueue X
+ *                                   UNLOCK rq(1)->lock
+ *
+ *                   LOCK rq(1)->lock // orders against CPU2
+ *                   sched-out Z
+ *                   sched-in X
+ *                   UNLOCK rq(1)->lock
+ *
+ *
+ *  BLOCKING -- aka. SLEEP + WAKEUP
+ *
+ * For blocking we (obviously) need to provide the same guarantee as for
+ * migration. However the means are completely different as there is no lock
+ * chain to provide order. Instead we do:
+ *
+ *   1) smp_store_release(X->on_cpu, 0)
+ *   2) smp_cond_acquire(!X->on_cpu)
+ *
+ * Example:
+ *
+ *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
+ *
+ *   LOCK rq(0)->lock LOCK X->pi_lock
+ *   dequeue X
+ *   sched-out X
+ *   smp_store_release(X->on_cpu, 0);
+ *
+ *                    smp_cond_acquire(!X->on_cpu);
+ *                    X->state = WAKING
+ *                    set_task_cpu(X,2)
+ *
+ *                    LOCK rq(2)->lock
+ *                    enqueue X
+ *                    X->state = RUNNING
+ *                    UNLOCK rq(2)->lock
+ *
+ *                                          LOCK rq(2)->lock // orders against CPU1
+ *                                          sched-out Z
+ *                                          sched-in X
+ *                                          UNLOCK rq(1)->lock
+ *
+ *                    UNLOCK X->pi_lock
+ *   UNLOCK rq(0)->lock
+ *
+ *
+ * However; for wakeups there is a second guarantee we must provide, namely we
+ * must observe the state that lead to our wakeup. That is, not only must our
+ * task observe its own prior state, it must also observe the stores prior to
+ * its wakeup.
+ *
+ * This means that any means of doing remote wakeups must order the CPU doing
+ * the wakeup against the CPU the task is going to end up running on. This,
+ * however, is already required for the regular Program-Order guarantee above,
+ * since the waking CPU is the one issueing the ACQUIRE (smp_cond_acquire).
+ *
+ */
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] sched: Document Program-Order guarantees
  2015-12-03 12:40 ` [PATCH 4/4] sched: Document Program-Order guarantees Peter Zijlstra
@ 2015-12-03 13:16   ` Boqun Feng
  2015-12-03 13:29     ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Boqun Feng @ 2015-12-03 13:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, linux-kernel, paulmck, corbet, mhocko, dhowells,
	torvalds, will.deacon, waiman.long, pjt

[-- Attachment #1: Type: text/plain, Size: 1043 bytes --]

On Thu, Dec 03, 2015 at 01:40:14PM +0100, Peter Zijlstra wrote:
[snip]
> + *
> + *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
> + *
> + *   LOCK rq(0)->lock LOCK X->pi_lock
> + *   dequeue X
> + *   sched-out X
> + *   smp_store_release(X->on_cpu, 0);
> + *
> + *                    smp_cond_acquire(!X->on_cpu);
> + *                    X->state = WAKING
> + *                    set_task_cpu(X,2)
> + *
> + *                    LOCK rq(2)->lock
> + *                    enqueue X
> + *                    X->state = RUNNING
> + *                    UNLOCK rq(2)->lock
> + *
> + *                                          LOCK rq(2)->lock // orders against CPU1
> + *                                          sched-out Z
> + *                                          sched-in X
> + *                                          UNLOCK rq(1)->lock
						^^^^^^^^^^^^^^^

This is a typo, right? Should be "UNLOCK rq(2)->lock".

Regards,
Boqun

> + *
> + *                    UNLOCK X->pi_lock
> + *   UNLOCK rq(0)->lock
> + *
> + *

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] sched: Document Program-Order guarantees
  2015-12-03 13:16   ` Boqun Feng
@ 2015-12-03 13:29     ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-12-03 13:29 UTC (permalink / raw)
  To: Boqun Feng
  Cc: mingo, oleg, linux-kernel, paulmck, corbet, mhocko, dhowells,
	torvalds, will.deacon, waiman.long, pjt

On Thu, Dec 03, 2015 at 09:16:48PM +0800, Boqun Feng wrote:
> On Thu, Dec 03, 2015 at 01:40:14PM +0100, Peter Zijlstra wrote:
> [snip]
> > + *
> > + *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
> > + *
> > + *   LOCK rq(0)->lock LOCK X->pi_lock
> > + *   dequeue X
> > + *   sched-out X
> > + *   smp_store_release(X->on_cpu, 0);
> > + *
> > + *                    smp_cond_acquire(!X->on_cpu);
> > + *                    X->state = WAKING
> > + *                    set_task_cpu(X,2)
> > + *
> > + *                    LOCK rq(2)->lock
> > + *                    enqueue X
> > + *                    X->state = RUNNING
> > + *                    UNLOCK rq(2)->lock
> > + *
> > + *                                          LOCK rq(2)->lock // orders against CPU1
> > + *                                          sched-out Z
> > + *                                          sched-in X
> > + *                                          UNLOCK rq(1)->lock
> 						^^^^^^^^^^^^^^^
> 
> This is a typo, right? Should be "UNLOCK rq(2)->lock".

Duh, yes. Fixed.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-03 12:40 ` [PATCH 3/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
@ 2015-12-03 16:37   ` Will Deacon
  2015-12-03 20:26     ` Peter Zijlstra
  2015-12-03 19:41   ` Davidlohr Bueso
  1 sibling, 1 reply; 21+ messages in thread
From: Will Deacon @ 2015-12-03 16:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, waiman.long, pjt

Hi Peter,

On Thu, Dec 03, 2015 at 01:40:13PM +0100, Peter Zijlstra wrote:
> Introduce smp_cond_acquire() which combines a control dependency and a
> read barrier to form acquire semantics.
> 
> This primitive has two benefits:
>  - it documents control dependencies,
>  - its typically cheaper than using smp_load_acquire() in a loop.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/compiler.h   |   17 +++++++++++++++++
>  kernel/locking/qspinlock.c |    3 +--
>  kernel/sched/core.c        |    8 +-------
>  kernel/sched/sched.h       |    2 +-
>  4 files changed, 20 insertions(+), 10 deletions(-)
> 
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -299,6 +299,23 @@ static __always_inline void __write_once
>  	__u.__val;					\
>  })
>  
> +/**
> + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> + * @cond: boolean expression to wait for
> + *
> + * Equivalent to using smp_load_acquire() on the condition variable but employs
> + * the control dependency of the wait to reduce the barrier on many platforms.
> + *
> + * The control dependency provides a LOAD->STORE order, the additional RMB
> + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> + * aka. ACQUIRE.
> + */
> +#define smp_cond_acquire(cond)	do {		\
> +	while (!(cond))				\
> +		cpu_relax();			\
> +	smp_rmb(); /* ctrl + rmb := acquire */	\
> +} while (0)
> +
>  #endif /* __KERNEL__ */
>  
>  #endif /* __ASSEMBLY__ */
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -446,8 +446,7 @@ void queued_spin_lock_slowpath(struct qs
>  	if ((val = pv_wait_head_or_lock(lock, node)))
>  		goto locked;
>  
> -	while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
> -		cpu_relax();
> +	smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));

I think we spoke about this before, but what would work really well for
arm64 here is if we could override smp_cond_acquire in such a way that
the atomic_read could be performed explicitly in the macro. That would
allow us to use an LDXR to set the exclusive monitor, which in turn
means we can issue a WFE and get a cheap wakeup when lock->val is
actually modified.

With the current scheme, there's not enough information expressed in the
"cond" parameter to perform this optimisation.

Cheers,

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-03 12:40 ` [PATCH 3/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
  2015-12-03 16:37   ` Will Deacon
@ 2015-12-03 19:41   ` Davidlohr Bueso
  2015-12-03 20:31     ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Davidlohr Bueso @ 2015-12-03 19:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon, waiman.long, pjt

On Thu, 03 Dec 2015, Peter Zijlstra wrote:

>+/**
>+ * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
>+ * @cond: boolean expression to wait for
>+ *
>+ * Equivalent to using smp_load_acquire() on the condition variable but employs
>+ * the control dependency of the wait to reduce the barrier on many platforms.
>+ *
>+ * The control dependency provides a LOAD->STORE order, the additional RMB
>+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
>+ * aka. ACQUIRE.
>+ */
>+#define smp_cond_acquire(cond)	do {		\
>+	while (!(cond))				\
>+		cpu_relax();			\
>+	smp_rmb(); /* ctrl + rmb := acquire */	\
>+} while (0)

So this hides the fact that we actually are waiting on the cond, as opposed
to conditional acquiring. Could it be renamed to something like smp_waitcond_acquire()?

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-03 16:37   ` Will Deacon
@ 2015-12-03 20:26     ` Peter Zijlstra
  2015-12-03 21:16       ` Peter Zijlstra
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-12-03 20:26 UTC (permalink / raw)
  To: Will Deacon
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, waiman.long, pjt

On Thu, Dec 03, 2015 at 04:37:26PM +0000, Will Deacon wrote:
> > +#define smp_cond_acquire(cond)	do {		\
> > +	while (!(cond))				\
> > +		cpu_relax();			\
> > +	smp_rmb(); /* ctrl + rmb := acquire */	\
> > +} while (0)

> > +	smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
> 
> I think we spoke about this before, but what would work really well for
> arm64 here is if we could override smp_cond_acquire in such a way that
> the atomic_read could be performed explicitly in the macro. That would
> allow us to use an LDXR to set the exclusive monitor, which in turn
> means we can issue a WFE and get a cheap wakeup when lock->val is
> actually modified.
> 
> With the current scheme, there's not enough information expressed in the
> "cond" parameter to perform this optimisation.

Right, but I'm having a hard time constructing something pretty that can
do that. Lambda functions would be lovely, but we don't have those :/

While we can easily pass a pointer to an arbitrary type, we need
an expression to evaluate the result of the pointer load to act as our
condition.

  smp_cond_acquire(&lock->val.counter,
		   [](int val){ return !(val & _Q_LOCKED_PENDING_MASK); });

Would be nice, but alas.

The best we can do is hardcode a variable name; maybe something like:

#define smp_cond_acquire(ptr, expr) do {			\
	typeof(*ptr) val;					\
	while ((val = READ_ONCE(*ptr)), expr)			\
		cpu_relax();					\
	smp_rmb(); /* ctrl + rmb := acquire */			\
} while (0)

Which would let us write:

  smp_cond_acquire(&lock->val.counter, !(val & _Q_LOCKED_PENDING_MASK));


Thoughts?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-03 19:41   ` Davidlohr Bueso
@ 2015-12-03 20:31     ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-12-03 20:31 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon, waiman.long, pjt

On Thu, Dec 03, 2015 at 11:41:39AM -0800, Davidlohr Bueso wrote:

> >+#define smp_cond_acquire(cond)	do {		\
> >+	while (!(cond))				\
> >+		cpu_relax();			\
> >+	smp_rmb(); /* ctrl + rmb := acquire */	\
> >+} while (0)
> 
> So this hides the fact that we actually are waiting on the cond, as opposed
> to conditional acquiring. Could it be renamed to something like smp_waitcond_acquire()?

Right, I'm conflicted about that. On the one hand you're right, on the
other hand we spin-wait so the next person will want it called
smp_spin_wait_cond_acquire(), also it gets terribly long either way :/

bike-shed away I imagine.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-03 20:26     ` Peter Zijlstra
@ 2015-12-03 21:16       ` Peter Zijlstra
  2015-12-04 14:57       ` Will Deacon
  2015-12-04 20:51       ` Waiman Long
  2 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-12-03 21:16 UTC (permalink / raw)
  To: Will Deacon
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, waiman.long, pjt

On Thu, Dec 03, 2015 at 09:26:27PM +0100, Peter Zijlstra wrote:
> The best we can do is hardcode a variable name; maybe something like:
> 
> #define smp_cond_acquire(ptr, expr) do {			\
> 	typeof(*ptr) val;					\
> 	while ((val = READ_ONCE(*ptr)), expr)			\
> 		cpu_relax();					\
> 	smp_rmb(); /* ctrl + rmb := acquire */			\
> } while (0)
> 
> Which would let us write:
> 
>   smp_cond_acquire(&lock->val.counter, !(val & _Q_LOCKED_PENDING_MASK));

So I'm thinking that might still be hard to use for you if the LDXR+WFE
has the regular LL/SC constraint of not allowing other loads in.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-03 20:26     ` Peter Zijlstra
  2015-12-03 21:16       ` Peter Zijlstra
@ 2015-12-04 14:57       ` Will Deacon
  2015-12-04 20:51       ` Waiman Long
  2 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2015-12-04 14:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, waiman.long, pjt

On Thu, Dec 03, 2015 at 09:26:27PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 03, 2015 at 04:37:26PM +0000, Will Deacon wrote:
> > > +#define smp_cond_acquire(cond)	do {		\
> > > +	while (!(cond))				\
> > > +		cpu_relax();			\
> > > +	smp_rmb(); /* ctrl + rmb := acquire */	\
> > > +} while (0)
> 
> > > +	smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
> > 
> > I think we spoke about this before, but what would work really well for
> > arm64 here is if we could override smp_cond_acquire in such a way that
> > the atomic_read could be performed explicitly in the macro. That would
> > allow us to use an LDXR to set the exclusive monitor, which in turn
> > means we can issue a WFE and get a cheap wakeup when lock->val is
> > actually modified.
> > 
> > With the current scheme, there's not enough information expressed in the
> > "cond" parameter to perform this optimisation.
> 
> Right, but I'm having a hard time constructing something pretty that can
> do that. Lambda functions would be lovely, but we don't have those :/
> 
> While we can easily pass a pointer to an arbitrary type, we need
> an expression to evaluate the result of the pointer load to act as our
> condition.
> 
>   smp_cond_acquire(&lock->val.counter,
> 		   [](int val){ return !(val & _Q_LOCKED_PENDING_MASK); });
> 
> Would be nice, but alas.
> 
> The best we can do is hardcode a variable name; maybe something like:
> 
> #define smp_cond_acquire(ptr, expr) do {			\
> 	typeof(*ptr) val;					\
> 	while ((val = READ_ONCE(*ptr)), expr)			\
> 		cpu_relax();					\
> 	smp_rmb(); /* ctrl + rmb := acquire */			\
> } while (0)
> 
> Which would let us write:
> 
>   smp_cond_acquire(&lock->val.counter, !(val & _Q_LOCKED_PENDING_MASK));
> 
> 
> Thoughts?

That would certainly work for me, but I appreciate it's not pretty. We
could have an extra macro parameter for the name of the temporary
variable, if you wanted to make it explicit.

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-03 20:26     ` Peter Zijlstra
  2015-12-03 21:16       ` Peter Zijlstra
  2015-12-04 14:57       ` Will Deacon
@ 2015-12-04 20:51       ` Waiman Long
  2015-12-04 22:05         ` Linus Torvalds
  2 siblings, 1 reply; 21+ messages in thread
From: Waiman Long @ 2015-12-04 20:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, mingo, oleg, linux-kernel, paulmck, boqun.feng,
	corbet, mhocko, dhowells, torvalds, pjt

On 12/03/2015 03:26 PM, Peter Zijlstra wrote:
> On Thu, Dec 03, 2015 at 04:37:26PM +0000, Will Deacon wrote:
>>> +#define smp_cond_acquire(cond)	do {		\
>>> +	while (!(cond))				\
>>> +		cpu_relax();			\
>>> +	smp_rmb(); /* ctrl + rmb := acquire */	\
>>> +} while (0)
>>> +	smp_cond_acquire(!((val = atomic_read(&lock->val))&  _Q_LOCKED_PENDING_MASK));
>> I think we spoke about this before, but what would work really well for
>> arm64 here is if we could override smp_cond_acquire in such a way that
>> the atomic_read could be performed explicitly in the macro. That would
>> allow us to use an LDXR to set the exclusive monitor, which in turn
>> means we can issue a WFE and get a cheap wakeup when lock->val is
>> actually modified.
>>
>> With the current scheme, there's not enough information expressed in the
>> "cond" parameter to perform this optimisation.
> Right, but I'm having a hard time constructing something pretty that can
> do that. Lambda functions would be lovely, but we don't have those :/
>
> While we can easily pass a pointer to an arbitrary type, we need
> an expression to evaluate the result of the pointer load to act as our
> condition.
>
>    smp_cond_acquire(&lock->val.counter,
> 		   [](int val){ return !(val&  _Q_LOCKED_PENDING_MASK); });
>
> Would be nice, but alas.
>
> The best we can do is hardcode a variable name; maybe something like:
>
> #define smp_cond_acquire(ptr, expr) do {			\
> 	typeof(*ptr) val;					\
> 	while ((val = READ_ONCE(*ptr)), expr)			\
> 		cpu_relax();					\
> 	smp_rmb(); /* ctrl + rmb := acquire */			\
> } while (0)
>
> Which would let us write:
>
>    smp_cond_acquire(&lock->val.counter, !(val&  _Q_LOCKED_PENDING_MASK));
>
>
> Thoughts?
Will the following work?

#define smp_cond_load_acquire(ptr, cond_expr, neg) ({   \
         typeof(*ptr) _val;                              \
         while (neg ((_val = READ_ONCE(*ptr)) cond_expr))\
                 cpu_relax();                            \
         smp_rmb();                                      \
         _val;                                           \
})

         val = smp_cond_load_acquire(&lock->val.counter, & 
_Q_LOCKED_PENDING_MASK, !);

Cheers,
Longman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-04 20:51       ` Waiman Long
@ 2015-12-04 22:05         ` Linus Torvalds
  2015-12-04 22:48           ` Waiman Long
  2015-12-04 23:43           ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2015-12-04 22:05 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Will Deacon, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, Boqun Feng,
	Jonathan Corbet, Michal Hocko, David Howells, Paul Turner

On Fri, Dec 4, 2015 at 12:51 PM, Waiman Long <waiman.long@hpe.com> wrote:
>
> Will the following work?

Are we trying to win some obfuscated C contest here?

Just make it do something like (skipping backslashes to make it easier
to type and read)

 #define smp_cond_load_acquire(ptr, cond_expr) ({
     typeof(*ptr) VAL;
     for (;;) {
          VAL = READ_ONCE(*ptr);
          if (cond_expr) break;
          cpu_relax();
     }
     smp_rmb();
     VAL;
  })

and then you'd have it be

    val = smp_cond_load_acquire(&lock->val.counter,
               !(VAL & _Q_LOCKED_PENDING_MASK));

which is at least halfway legible. Not some odd "fragments of
expressions" interfaces unless absolutely required, please.

Of course, I suspect we should not use READ_ONCE(), but some
architecture-overridable version that just defaults to READ_ONCE().
Same goes for that "smp_rmb()". Because maybe some architectures will
just prefer an explicit acquire, and I suspect we do *not* want
architectures having to recreate and override that crazy loop.

How much does this all actually end up mattering, btw?

               Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-04 22:05         ` Linus Torvalds
@ 2015-12-04 22:48           ` Waiman Long
  2015-12-04 23:43           ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Waiman Long @ 2015-12-04 22:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Will Deacon, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, Boqun Feng,
	Jonathan Corbet, Michal Hocko, David Howells, Paul Turner

On 12/04/2015 05:05 PM, Linus Torvalds wrote:
> On Fri, Dec 4, 2015 at 12:51 PM, Waiman Long<waiman.long@hpe.com>  wrote:
>> Will the following work?
> Are we trying to win some obfuscated C contest here?
>
> Just make it do something like (skipping backslashes to make it easier
> to type and read)
>
>   #define smp_cond_load_acquire(ptr, cond_expr) ({
>       typeof(*ptr) VAL;
>       for (;;) {
>            VAL = READ_ONCE(*ptr);
>            if (cond_expr) break;
>            cpu_relax();
>       }
>       smp_rmb();
>       VAL;
>    })
>
> and then you'd have it be
>
>      val = smp_cond_load_acquire(&lock->val.counter,
>                 !(VAL&  _Q_LOCKED_PENDING_MASK));
>
> which is at least halfway legible. Not some odd "fragments of
> expressions" interfaces unless absolutely required, please.

It is just some random thought that I have. I am not saying that it is 
the right way to go.

> Of course, I suspect we should not use READ_ONCE(), but some
> architecture-overridable version that just defaults to READ_ONCE().
> Same goes for that "smp_rmb()". Because maybe some architectures will
> just prefer an explicit acquire, and I suspect we do *not* want
> architectures having to recreate and override that crazy loop.
>
> How much does this all actually end up mattering, btw?
>
>                 Linus

I think what Will want to do is to provide an architecture specific 
replacement for the whole macro, not just part of it. So using READ_ONCE 
should be fine.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-04 22:05         ` Linus Torvalds
  2015-12-04 22:48           ` Waiman Long
@ 2015-12-04 23:43           ` Peter Zijlstra
  2015-12-07 15:18             ` Will Deacon
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2015-12-04 23:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Will Deacon, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, Boqun Feng,
	Jonathan Corbet, Michal Hocko, David Howells, Paul Turner

On Fri, Dec 04, 2015 at 02:05:49PM -0800, Linus Torvalds wrote:
> Of course, I suspect we should not use READ_ONCE(), but some
> architecture-overridable version that just defaults to READ_ONCE().
> Same goes for that "smp_rmb()". Because maybe some architectures will
> just prefer an explicit acquire, and I suspect we do *not* want
> architectures having to recreate and override that crazy loop.
> 
> How much does this all actually end up mattering, btw?

Not sure, I'll have to let Will quantify that. But the whole reason
we're having this discussion is that ARM64 has a MONITOR+MWAIT like
construct that they'd like to use to avoid the spinning.

Of course, in order to use that, they _have_ to override the crazy loop.

Now, Will and I spoke earlier today, and the version proposed by me (and
you, since that is roughly similar) will indeed work for them in that it
would allow them to rewrite the thing something like:


	typeof(*ptr) VAL;
	for (;;) {
		VAL = READ_ONCE(*ptr);
		if (expr)
			break;
		cmp_and_wait(ptr, VAL);
	}


Where their cmd_and_wait(ptr, val) looks a little like:

	asm volatile(
		"	ldxr	%w0, %1		\n"
		"	sub	%w0, %w0, %2	\n"
		"	cbnz	1f		\n"
		"	wfe			\n"
		"1:"

		: "=&r" (tmp)
		: "Q" (*ptr), "r" (val)
	);

(excuse my poor ARM asm foo)

Which sets up a load-exclusive monitor, compares if the value loaded
matches what we previously saw, and if so, wait-for-event.

WFE will wake on any event that would've also invalidated a subsequent
stxr or store-exclusive.

ARM64 also of course can choose to use load-acquire instead of the
READ_ONCE(), or still issue the smp_rmb(), dunno what is best for them.
The load-acquire would (potentially) be issued multiple times, vs the
rmb only once.  I'll let Will sort that.


In any case, WFE is both better for power consumption and lowers the
cacheline pressure, ie. nobody keeps trying to pull the line into shared
state all the time while you're trying to get a store done.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()
  2015-12-04 23:43           ` Peter Zijlstra
@ 2015-12-07 15:18             ` Will Deacon
  0 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2015-12-07 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Waiman Long, Ingo Molnar, Oleg Nesterov,
	Linux Kernel Mailing List, Paul McKenney, Boqun Feng,
	Jonathan Corbet, Michal Hocko, David Howells, Paul Turner

On Sat, Dec 05, 2015 at 12:43:37AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 04, 2015 at 02:05:49PM -0800, Linus Torvalds wrote:
> > Of course, I suspect we should not use READ_ONCE(), but some
> > architecture-overridable version that just defaults to READ_ONCE().
> > Same goes for that "smp_rmb()". Because maybe some architectures will
> > just prefer an explicit acquire, and I suspect we do *not* want
> > architectures having to recreate and override that crazy loop.
> > 
> > How much does this all actually end up mattering, btw?
> 
> Not sure, I'll have to let Will quantify that. But the whole reason
> we're having this discussion is that ARM64 has a MONITOR+MWAIT like
> construct that they'd like to use to avoid the spinning.
> 
> Of course, in order to use that, they _have_ to override the crazy loop.

Right. This also removes one of the few hurdles standing between us (arm64)
and generic locking routines such as qspinlock, where we really don't
want busy-wait loops (since cpu_relax doesn't give us the opportuinity
to use wfe safely).

> Now, Will and I spoke earlier today, and the version proposed by me (and
> you, since that is roughly similar) will indeed work for them in that it
> would allow them to rewrite the thing something like:
> 
> 
> 	typeof(*ptr) VAL;
> 	for (;;) {
> 		VAL = READ_ONCE(*ptr);
> 		if (expr)
> 			break;
> 		cmp_and_wait(ptr, VAL);
> 	}
> 
> 
> Where their cmd_and_wait(ptr, val) looks a little like:
> 
> 	asm volatile(
> 		"	ldxr	%w0, %1		\n"
> 		"	sub	%w0, %w0, %2	\n"
> 		"	cbnz	1f		\n"
> 		"	wfe			\n"
> 		"1:"
> 
> 		: "=&r" (tmp)
> 		: "Q" (*ptr), "r" (val)
> 	);
> 
> (excuse my poor ARM asm foo)
> 
> Which sets up a load-exclusive monitor, compares if the value loaded
> matches what we previously saw, and if so, wait-for-event.
> 
> WFE will wake on any event that would've also invalidated a subsequent
> stxr or store-exclusive.
> 
> ARM64 also of course can choose to use load-acquire instead of the
> READ_ONCE(), or still issue the smp_rmb(), dunno what is best for them.
> The load-acquire would (potentially) be issued multiple times, vs the
> rmb only once.  I'll let Will sort that.

Yup. I'll just override the whole thing using something much like what
you're suggesting.

Cheers,

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] sched: Better document the try_to_wake_up() barriers
  2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
  2015-12-04  0:09   ` Byungchul Park
@ 2015-12-04  0:58   ` Byungchul Park
  1 sibling, 0 replies; 21+ messages in thread
From: Byungchul Park @ 2015-12-04  0:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

On Mon, Nov 02, 2015 at 02:29:02PM +0100, Peter Zijlstra wrote:
> Explain how the control dependency and smp_rmb() end up providing
> ACQUIRE semantics and pair with smp_store_release() in
> finish_lock_switch().
> 
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c  |    8 +++++++-
>  kernel/sched/sched.h |    3 +++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1947,7 +1947,13 @@ try_to_wake_up(struct task_struct *p, un
>  	while (p->on_cpu)
>  		cpu_relax();
>  	/*
> -	 * Pairs with the smp_wmb() in finish_lock_switch().
> +	 * Combined with the control dependency above, we have an effective
> +	 * smp_load_acquire() without the need for full barriers.
> +	 *
> +	 * Pairs with the smp_store_release() in finish_lock_switch().
> +	 *
> +	 * This ensures that tasks getting woken will be fully ordered against
> +	 * their previous state and preserve Program Order.
>  	 */
>  	smp_rmb();

I am very sorry.

Please ignore my previous question :-(

>  
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1073,6 +1073,9 @@ static inline void finish_lock_switch(st
>  	 * We must ensure this doesn't happen until the switch is completely
>  	 * finished.
>  	 *
> +	 * In particular, the load of prev->state in finish_task_switch() must
> +	 * happen before this.
> +	 *
>  	 * Pairs with the control dependency and rmb in try_to_wake_up().
>  	 */
>  	smp_store_release(&prev->on_cpu, 0);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] sched: Better document the try_to_wake_up() barriers
  2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
@ 2015-12-04  0:09   ` Byungchul Park
  2015-12-04  0:58   ` Byungchul Park
  1 sibling, 0 replies; 21+ messages in thread
From: Byungchul Park @ 2015-12-04  0:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, oleg, linux-kernel, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

On Mon, Nov 02, 2015 at 02:29:02PM +0100, Peter Zijlstra wrote:
> Explain how the control dependency and smp_rmb() end up providing
> ACQUIRE semantics and pair with smp_store_release() in
> finish_lock_switch().
> 
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c  |    8 +++++++-
>  kernel/sched/sched.h |    3 +++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1947,7 +1947,13 @@ try_to_wake_up(struct task_struct *p, un
>  	while (p->on_cpu)
>  		cpu_relax();
>  	/*
> -	 * Pairs with the smp_wmb() in finish_lock_switch().
> +	 * Combined with the control dependency above, we have an effective
> +	 * smp_load_acquire() without the need for full barriers.
> +	 *
> +	 * Pairs with the smp_store_release() in finish_lock_switch().
> +	 *
> +	 * This ensures that tasks getting woken will be fully ordered against
> +	 * their previous state and preserve Program Order.
>  	 */
>  	smp_rmb();

Hello,

I am not sure, but just curious..

I thought it's good enough to use a kind of dependancy barrier such as
smp_read_barrier_depends() instead of smp_rmb() here unless something
e.g. compiler breaks the control dependacy. Right?

>  
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1073,6 +1073,9 @@ static inline void finish_lock_switch(st
>  	 * We must ensure this doesn't happen until the switch is completely
>  	 * finished.
>  	 *
> +	 * In particular, the load of prev->state in finish_task_switch() must
> +	 * happen before this.
> +	 *
>  	 * Pairs with the control dependency and rmb in try_to_wake_up().
>  	 */
>  	smp_store_release(&prev->on_cpu, 0);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] sched: Better document the try_to_wake_up() barriers
  2015-11-02 13:29 [PATCH 0/4] scheduler ordering bits Peter Zijlstra
@ 2015-11-02 13:29 ` Peter Zijlstra
  2015-12-04  0:09   ` Byungchul Park
  2015-12-04  0:58   ` Byungchul Park
  0 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2015-11-02 13:29 UTC (permalink / raw)
  To: mingo, oleg
  Cc: linux-kernel, peterz, paulmck, boqun.feng, corbet, mhocko,
	dhowells, torvalds, will.deacon

[-- Attachment #1: peterz-sched-ttwu-comment.patch --]
[-- Type: text/plain, Size: 1408 bytes --]

Explain how the control dependency and smp_rmb() end up providing
ACQUIRE semantics and pair with smp_store_release() in
finish_lock_switch().

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |    8 +++++++-
 kernel/sched/sched.h |    3 +++
 2 files changed, 10 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1947,7 +1947,13 @@ try_to_wake_up(struct task_struct *p, un
 	while (p->on_cpu)
 		cpu_relax();
 	/*
-	 * Pairs with the smp_wmb() in finish_lock_switch().
+	 * Combined with the control dependency above, we have an effective
+	 * smp_load_acquire() without the need for full barriers.
+	 *
+	 * Pairs with the smp_store_release() in finish_lock_switch().
+	 *
+	 * This ensures that tasks getting woken will be fully ordered against
+	 * their previous state and preserve Program Order.
 	 */
 	smp_rmb();
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1073,6 +1073,9 @@ static inline void finish_lock_switch(st
 	 * We must ensure this doesn't happen until the switch is completely
 	 * finished.
 	 *
+	 * In particular, the load of prev->state in finish_task_switch() must
+	 * happen before this.
+	 *
 	 * Pairs with the control dependency and rmb in try_to_wake_up().
 	 */
 	smp_store_release(&prev->on_cpu, 0);



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-12-07 15:18 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-03 12:40 [PATCH 0/4] scheduler ordering bits -v2 Peter Zijlstra
2015-12-03 12:40 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
2015-12-03 12:40 ` [PATCH 2/4] sched: Fix a race in try_to_wake_up() vs schedule() Peter Zijlstra
2015-12-03 12:40 ` [PATCH 3/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
2015-12-03 16:37   ` Will Deacon
2015-12-03 20:26     ` Peter Zijlstra
2015-12-03 21:16       ` Peter Zijlstra
2015-12-04 14:57       ` Will Deacon
2015-12-04 20:51       ` Waiman Long
2015-12-04 22:05         ` Linus Torvalds
2015-12-04 22:48           ` Waiman Long
2015-12-04 23:43           ` Peter Zijlstra
2015-12-07 15:18             ` Will Deacon
2015-12-03 19:41   ` Davidlohr Bueso
2015-12-03 20:31     ` Peter Zijlstra
2015-12-03 12:40 ` [PATCH 4/4] sched: Document Program-Order guarantees Peter Zijlstra
2015-12-03 13:16   ` Boqun Feng
2015-12-03 13:29     ` Peter Zijlstra
  -- strict thread matches above, loose matches on Subject: below --
2015-11-02 13:29 [PATCH 0/4] scheduler ordering bits Peter Zijlstra
2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
2015-12-04  0:09   ` Byungchul Park
2015-12-04  0:58   ` Byungchul Park

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.