From: Peter Zijlstra <peterz@infradead.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
Ingo Molnar <mingo@redhat.com>
Subject: Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
Date: Wed, 1 Jul 2020 10:02:13 +0200 [thread overview]
Message-ID: <20200701080213.GF4817@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20200701022646.GO2005@dread.disaster.area>
On Wed, Jul 01, 2020 at 12:26:46PM +1000, Dave Chinner wrote:
> There's nothing like this in the scheduler code that I can find that
> explains the expected overall ordering/serialisation mechanisms and
> relationships used in the subsystem. Hence when I comes across
> something that doesn't appear to make sense, there's nothign I can
> refer to that would explain whether it is a bug or whether the
> comment is wrong or whether I've just completely misunderstood what
> the comment is referring to.
>
> Put simply: the little details are great, but they aren't sufficient
> by themselves to understand the relationships being maintained
> between the various objects.
You're absolutely right, we lack that. As a very first draft / brain
dump, I wrote the below, I'll try and polish it up and add to it over
the next few days.
---
kernel/sched/core.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 75 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d3d2d67f398..568f7ade9a09 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -77,6 +77,74 @@ __read_mostly int scheduler_running;
*/
int sysctl_sched_rt_runtime = 950000;
+
+/*
+ * Serialization rules:
+ *
+ * Lock order:
+ *
+ * p->pi_lock
+ * rq->lock
+ * hrtimer_clock_base::cpu_base->lock
+ *
+ * rq1->lock
+ * rq2->lock where: &rq1 < &rq2
+ *
+ * Regular state:
+ *
+ * Normal scheduling state is serialized by rq->lock. __schedule() takes the
+ * local CPU's rq->lock, it optionally removes the task from the runqueue and
+ * always looks at the local rq data structures to find the most elegible task
+ * to run next.
+ *
+ * Task enqueue is also under rq->lock, possibly taken from another CPU.
+ * Wakeups from another LLC domain might use an IPI to transfer the enqueue
+ * to the local CPU to avoid bouncing the runqueue state around.
+ *
+ * Task wakeup, specifically wakeups that involve migration, are horribly
+ * complicated to avoid having to take two rq->locks.
+ *
+ * Special state:
+ *
+ * p->state <- TASK_*
+ * p->on_cpu <- { 0, 1 }
+ * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }
+ * task_cpu(p) <- set_task_cpu()
+ *
+ * System-calls and anything external will use task_rq_lock() which acquires
+ * both p->lock and rq->lock. As a consequence things like p->cpus_ptr are
+ * stable while holding either lock.
+ *
+ * p->state is changed locklessly using set_current_state(),
+ * __set_current_state() or set_special_state(), see their respective comments.
+ *
+ * p->on_cpu is set by prepare_task() and cleared by finish_task() such that it
+ * will be set before p is scheduled-in and cleared after p is scheduled-out,
+ * both under rq->lock. Non-zero indicates the task is 'current' on it's CPU.
+ *
+ * p->on_rq is set by activate_task() and cleared by deactivate_task(), under
+ * rq->lock. Non-zero indicates the task is runnable, the special
+ * ON_RQ_MIGRATING state is used for migration without holding both rq->locks.
+ * It indicates task_cpu() is not stable, see *task_rq_lock().
+ *
+ * task_cpu(p) is changed by set_task_cpu(), the rules are intricate but basically:
+ *
+ * - don't call set_task_cpu() on a blocked task
+ *
+ * - for try_to_wake_up(), called under p->pi_lock
+ *
+ * - for migration called under rq->lock:
+ * o move_queued_task()
+ * o __migrate_swap_task()
+ * o detach_task()
+ *
+ * - for migration called under double_rq_lock():
+ * o push_rt_task() / pull_rt_task()
+ * o push_dl_task() / pull_dl_task()
+ * o dl_task_offline_migration()
+ *
+ */
+
/*
* __task_rq_lock - lock the rq @p resides on.
*/
@@ -1466,8 +1534,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
{
lockdep_assert_held(&rq->lock);
- WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
- dequeue_task(rq, p, DEQUEUE_NOCLOCK);
+ deactivate_task(rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, new_cpu);
rq_unlock(rq, rf);
@@ -1475,8 +1542,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
rq_lock(rq, rf);
BUG_ON(task_cpu(p) != new_cpu);
- enqueue_task(rq, p, 0);
- p->on_rq = TASK_ON_RQ_QUEUED;
+ activate_task(rq, p, 0);
check_preempt_curr(rq, p, 0);
return rq;
@@ -3134,8 +3200,12 @@ static inline void prepare_task(struct task_struct *next)
/*
* Claim the task as running, we do this before switching to it
* such that any running task will have this set.
+ *
+ * __schedule()'s rq->lock and smp_mb__after_spin_lock() orders this
+ * store against prior state change of p, also see try_to_wake_up(),
+ * specifically smp_load_acquire(&p->on_cpu).
*/
- next->on_cpu = 1;
+ WRITE_ONCE(next->on_cpu, 1);
#endif
}
next prev parent reply other threads:[~2020-07-01 8:02 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-26 0:47 [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS Dave Chinner
2020-06-26 5:57 ` Dave Chinner
2020-06-26 7:33 ` Peter Zijlstra
2020-06-26 22:32 ` Dave Chinner
2020-06-27 18:30 ` Peter Zijlstra
2020-06-29 23:55 ` Dave Chinner
2020-06-30 8:57 ` Peter Zijlstra
2020-07-01 2:26 ` Dave Chinner
2020-07-01 8:02 ` Peter Zijlstra [this message]
2020-07-01 11:06 ` Dave Chinner
2020-07-01 13:42 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200701080213.GF4817@hirez.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=david@fromorbit.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mingo@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).