All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: mingo@kernel.org, oleg@redhat.com
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	paulmck@linux.vnet.ibm.com, boqun.feng@gmail.com, corbet@lwn.net,
	mhocko@kernel.org, dhowells@redhat.com,
	torvalds@linux-foundation.org, will.deacon@arm.com
Subject: [PATCH 2/4] sched: Document Program-Order guarantees
Date: Mon, 02 Nov 2015 14:29:03 +0100	[thread overview]
Message-ID: <20151102134940.883198067@infradead.org> (raw)
In-Reply-To: 20151102132901.157178466@infradead.org

[-- Attachment #1: peter_zijlstra-documentation-remove_misleading_examples_of_the_barriers_in_wake__.patch --]
[-- Type: text/plain, Size: 6001 bytes --]

These are some notes on the scheduler locking and how it provides
program order guarantees on SMP systems.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |  142 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 142 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1905,6 +1905,148 @@ static void ttwu_queue(struct task_struc
 	raw_spin_unlock(&rq->lock);
 }
 
+/*
+ * Notes on Program-Order guarantees on SMP systems.
+ *
+ *
+ *   PREEMPTION/MIGRATION
+ *
+ * Regular preemption/migration is safe because as long as the task is runnable
+ * migrations involve both rq locks, albeit not (necessarily) at the same time.
+ *
+ * So we get (we allow 3 CPU migrations):
+ *
+ *   CPU0            CPU1            CPU2
+ *
+ *   LOCK rq(0)->lock
+ *   sched-out X
+ *   sched-in Y
+ *   UNLOCK rq(0)->lock
+ *
+ *                                   LOCK rq(0)->lock // MB against CPU0
+ *                                   dequeue X
+ *                                   UNLOCK rq(0)->lock
+ *
+ *                                   LOCK rq(1)->lock
+ *                                   enqueue X
+ *                                   UNLOCK rq(1)->lock
+ *
+ *                   LOCK rq(1)->lock // MB against CPU2
+ *                   sched-out Z
+ *                   sched-in X
+ *                   UNLOCK rq(1)->lock
+ *
+ * and the first LOCK rq(0) on CPU2 gives a full order against the UNLOCK rq(0)
+ * on CPU0. Similarly the LOCK rq(1) on CPU1 provides full order against the
+ * UNLOCK rq(1) on CPU2, therefore by the time task X runs on CPU1 it must
+ * observe the state it left behind on CPU0.
+ *
+ *
+ *   BLOCKING -- aka. SLEEP + WAKEUP
+ *
+ * For blocking things are a little more interesting, because when we dequeue
+ * the task, we don't need to acquire the old rq lock in order to migrate it.
+ *
+ * Say CPU0 does a wait_event() and CPU1 does the wake() and migrates the task
+ * to CPU2 (the most complex example):
+ *
+ *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (sched_ttwu_pending)
+ *
+ *   X->state = UNINTERRUPTIBLE
+ *   MB
+ *   if (cond)
+ *     break
+ *                    cond = true
+ *
+ *   LOCK rq(0)->lock LOCK X->pi_lock
+ *
+ *   dequeue X
+ *                    while (X->on_cpu)
+ *                      cpu_relax()
+ *   sched-out X
+ *   RELEASE
+ *   X->on_cpu = 0
+ *                    RMB
+ *                    X->state = WAKING
+ *                    set_task_cpu(X,2)
+ *                      WMB
+ *                      ti(X)->cpu = 2
+ *
+ *                    llist_add(X, rq(2)) // MB
+ *                                          llist_del_all() // MB
+ *
+ *                                          LOCK rq(2)->lock
+ *                                          enqueue X
+ *                                          X->state = RUNNING
+ *                                          UNLOCK rq(2)->lock
+ *
+ *                                          LOCK rq(2)->lock
+ *                                          sched-out Z
+ *                                          sched-in X
+ *                                          UNLOCK rq(1)->lock
+ *
+ *                                          if (cond) // _TRUE_
+ *                    UNLOCK X->pi_lock
+ *   UNLOCK rq(0)->lock
+ *
+ * So in this case the scheduler does not provide an obvious full barrier; but
+ * the smp_store_release() in finish_lock_switch(), paired with the control-dep
+ * and smp_rmb() in try_to_wake_up() form a release-acquire pair and fully
+ * order things between CPU0 and CPU1.
+ *
+ * The llist primitives order things between CPU1 and CPU2 -- the alternative
+ * is CPU1 doing the remote enqueue (the alternative path in ttwu_queue()) in
+ * which case the rq(2)->lock release/acquire will order things between them.
+ *
+ * Which again leads to the guarantee that by the time X gets to run on CPU2
+ * it must observe the state it left behind on CPU0.
+ *
+ * However; for blocking there is a second guarantee we must provide, namely we
+ * must observe the state that lead to our wakeup. That is, not only must X
+ * observe its own prior state, it must also observe the @cond store.
+ *
+ * This too is achieved in the above, since CPU1 does the waking, we only need
+ * the ordering between CPU1 and CPU2, which is the same as the above.
+ *
+ *
+ * There is however a much more interesting case for this guarantee, where X
+ * never makes it off CPU0:
+ *
+ *   CPU0 (schedule)  CPU1 (try_to_wake_up)
+ *
+ *   X->state = UNINTERRUPTIBLE
+ *   MB
+ *   if (cond)
+ *     break
+ *                    cond = true
+ *
+ *                    WMB (aka smp_mb__before_spinlock)
+ *                    LOCK X->pi_lock
+ *
+ *                    if (X->on_rq)
+ *                      LOCK rq(0)->lock
+ *                      X->state = RUNNING
+ *                      UNLOCK rq(0)->lock
+ *
+ *   LOCK rq(0)->lock // MB against CPU1
+ *   UNLOCK rq(0)->lock
+ *
+ *   if (cond) // _TRUE_
+ *
+ *                    UNLOCK X->pi_lock
+ *
+ * Here our task X never quite leaves CPU0, the wakeup happens before we can
+ * dequeue and schedule someone else. In this case we must still observe cond
+ * after our call to schedule() completes.
+ *
+ * This is achieved by the smp_mb__before_spinlock() WMB which ensures the store
+ * cannot leak inside the LOCK, and LOCK rq(0)->lock on CPU0 provides full order
+ * against the UNLOCK rq(0)->lock from CPU1. Furthermore our load of cond cannot
+ * happen before this same LOCK.
+ *
+ * Therefore, again, we're good.
+ */
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened



  parent reply	other threads:[~2015-11-02 13:53 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-02 13:29 [PATCH 0/4] scheduler ordering bits Peter Zijlstra
2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
2015-12-04  0:09   ` Byungchul Park
2015-12-04  0:58   ` Byungchul Park
2015-11-02 13:29 ` Peter Zijlstra [this message]
2015-11-02 20:27   ` [PATCH 2/4] sched: Document Program-Order guarantees Paul Turner
2015-11-02 20:34     ` Peter Zijlstra
2015-11-02 22:09       ` Paul Turner
2015-11-02 22:12         ` Peter Zijlstra
2015-11-20 10:02     ` Peter Zijlstra
2015-11-20 14:08       ` Boqun Feng
2015-11-20 14:18         ` Peter Zijlstra
2015-11-20 14:21           ` Boqun Feng
2015-11-20 19:41             ` Peter Zijlstra
2015-11-02 13:29 ` [PATCH 3/4] sched: Fix a race in try_to_wake_up() vs schedule() Peter Zijlstra
2015-11-02 13:29 ` [PATCH 4/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
2015-11-02 13:57   ` Peter Zijlstra
2015-11-02 17:43     ` Will Deacon
2015-11-03  1:14       ` Paul E. McKenney
2015-11-03  1:25         ` Linus Torvalds
2015-11-02 17:42   ` Will Deacon
2015-11-02 18:08   ` Linus Torvalds
2015-11-02 18:37     ` Will Deacon
2015-11-02 19:17       ` Linus Torvalds
2015-11-02 19:57         ` Will Deacon
2015-11-02 20:23           ` Peter Zijlstra
2015-11-02 21:56         ` Peter Zijlstra
2015-11-03  1:57         ` Paul E. McKenney
2015-11-03 19:40           ` Linus Torvalds
2015-11-04  3:57             ` Paul E. McKenney
2015-11-04  4:43               ` Linus Torvalds
2015-11-04 12:54                 ` Paul E. McKenney
2015-11-02 20:36   ` David Howells
2015-11-02 20:40     ` Peter Zijlstra
2015-11-02 21:11     ` Linus Torvalds
2015-11-03 17:59   ` Oleg Nesterov
2015-11-03 18:23     ` Peter Zijlstra
2015-11-11  9:39     ` Boqun Feng
2015-11-11 10:34       ` Boqun Feng
2015-11-11 19:53         ` Oleg Nesterov
2015-11-12 13:50         ` Paul E. McKenney
2015-11-11 12:12       ` Peter Zijlstra
2015-11-11 19:39         ` Oleg Nesterov
2015-11-11 21:23           ` Linus Torvalds
2015-11-12  7:14           ` Boqun Feng
2015-11-12 10:28             ` Peter Zijlstra
2015-11-12 15:00             ` Oleg Nesterov
2015-11-12 14:40               ` Paul E. McKenney
2015-11-12 14:49                 ` Boqun Feng
2015-11-12 15:02                   ` Paul E. McKenney
2015-11-12 21:53                     ` Will Deacon
2015-11-12 14:50                 ` Peter Zijlstra
2015-11-12 15:01                   ` Paul E. McKenney
2015-11-12 15:08                     ` Peter Zijlstra
2015-11-12 15:20                       ` Paul E. McKenney
2015-11-12 21:25                         ` Will Deacon
2015-11-12 15:18               ` Boqun Feng
2015-11-12 18:38                 ` Oleg Nesterov
2015-11-12 18:02                   ` Peter Zijlstra
2015-11-12 19:33                     ` Oleg Nesterov
2015-11-12 18:59                       ` Paul E. McKenney
2015-11-12 21:33                         ` Will Deacon
2015-11-12 23:43                           ` Paul E. McKenney
2015-11-16 13:58                             ` Will Deacon
2015-11-12 18:21             ` Linus Torvalds
2015-11-12 22:09               ` Will Deacon
2015-11-16 15:56               ` Peter Zijlstra
2015-11-16 16:04                 ` Peter Zijlstra
2015-11-16 16:24                   ` Will Deacon
2015-11-16 16:44                     ` Paul E. McKenney
2015-11-16 16:46                       ` Will Deacon
2015-11-16 17:15                         ` Paul E. McKenney
2015-11-16 21:58                     ` Linus Torvalds
2015-11-17 11:51                       ` Will Deacon
2015-11-17 21:01                         ` Paul E. McKenney
2015-11-18 11:25                           ` Will Deacon
2015-11-19 18:01                             ` Will Deacon
2015-11-20 10:09                               ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151102134940.883198067@infradead.org \
    --to=peterz@infradead.org \
    --cc=boqun.feng@gmail.com \
    --cc=corbet@lwn.net \
    --cc=dhowells@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.