linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/8] PI and assorted failings
@ 2016-06-07 19:56 Peter Zijlstra
  2016-06-07 19:56 ` [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
                   ` (7 more replies)
  0 siblings, 8 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-07 19:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

Since Mathieu tried adding more !DEADLINE aware tracing and I'd been neglecting
Xunlei's patches, I figured I'd have a go at all that.

Here a stack of patches I ended up with looking at the assorted 'features'
in our PI code.

The code is compile tested only, contains some notes and in general needs more
love.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter
  2016-06-07 19:56 [RFC][PATCH 0/8] PI and assorted failings Peter Zijlstra
@ 2016-06-07 19:56 ` Peter Zijlstra
  2016-06-14  9:09   ` Juri Lelli
  2016-06-14 18:22   ` Steven Rostedt
  2016-06-07 19:56 ` [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-07 19:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz,
	Ingo Molnar

[-- Attachment #1: xunlei_pang-rtmutex-deboost_before_waking_up_the_top_waiter.patch --]
[-- Type: text/plain, Size: 3510 bytes --]

From: Xunlei Pang <xlpang@redhat.com>

We should deboost before waking the high-prio task, such that
we don't run two tasks with the same "state"(priority, deadline,
sched_class, etc) during the period between the end of wake_up_q()
and the end of rt_mutex_adjust_prio().

As "Peter Zijlstra" said:
Its semantically icky to have the two tasks running off the same
state and practically icky when you consider bandwidth inheritance --
where the boosted task wants to explicitly modify the state of the
booster. In that latter case you really want to unboost before you
let the booster run again.

But this however can lead to prio-inversion if current would get
preempted after the deboost but before waking our high-prio task,
hence we disable preemption before doing deboost, and enabling it
after the wake up is over.

The patch fixed the logic, and introduced rt_mutex_postunlock()
to do some code refactor.

Most importantly however; this change ensures pointer stability for
the next patch, where we have rt_mutex_setprio() cache a pointer to
the top-most waiter task. If we, as before this change, do the wakeup
first and then deboost, this pointer might point into thin air.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
[peterz: Changelog]
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1461659449-19497-1-git-send-email-xlpang@redhat.com
---

 kernel/futex.c                  |    5 ++---
 kernel/locking/rtmutex.c        |   28 ++++++++++++++++++++++++----
 kernel/locking/rtmutex_common.h |    1 +
 3 files changed, 27 insertions(+), 7 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1336,9 +1336,8 @@ static int wake_futex_pi(u32 __user *uad
 	 * scheduled away before the wake up can take place.
 	 */
 	spin_unlock(&hb->lock);
-	wake_up_q(&wake_q);
-	if (deboost)
-		rt_mutex_adjust_prio(current);
+
+	rt_mutex_postunlock(&wake_q, deboost);
 
 	return 0;
 }
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1390,12 +1390,32 @@ rt_mutex_fastunlock(struct rt_mutex *loc
 	} else {
 		bool deboost = slowfn(lock, &wake_q);
 
-		wake_up_q(&wake_q);
+		rt_mutex_postunlock(&wake_q, deboost);
+	}
+}
+
 
-		/* Undo pi boosting if necessary: */
-		if (deboost)
-			rt_mutex_adjust_prio(current);
+/*
+ * Undo pi boosting (if necessary) and wake top waiter.
+ */
+void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+{
+	/*
+	 * We should deboost before waking the top waiter task such that
+	 * we don't run two tasks with the 'same' priority. This however
+	 * can lead to prio-inversion if we would get preempted after
+	 * the deboost but before waking our high-prio task, hence the
+	 * preempt_disable.
+	 */
+	if (deboost) {
+		preempt_disable();
+		rt_mutex_adjust_prio(current);
 	}
+
+	wake_up_q(wake_q);
+
+	if (deboost)
+		preempt_enable();
 }
 
 /**
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -111,6 +111,7 @@ extern int rt_mutex_finish_proxy_lock(st
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
 				  struct wake_q_head *wqh);
+extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-06-07 19:56 [RFC][PATCH 0/8] PI and assorted failings Peter Zijlstra
  2016-06-07 19:56 ` [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
@ 2016-06-07 19:56 ` Peter Zijlstra
  2016-06-14 10:21   ` Juri Lelli
  2016-06-14 18:42   ` Steven Rostedt
  2016-06-07 19:56 ` [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-07 19:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz,
	Ingo Molnar

[-- Attachment #1: xunlei_pang-sched_rtmutex_deadline-fix_a_pi_crash_for_deadline_tasks.patch --]
[-- Type: text/plain, Size: 8031 bytes --]

From: Xunlei Pang <xlpang@redhat.com>

A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [<ffffffff810b658c>] enqueue_task+0x2c/0x80
    [<ffffffff810ba763>] activate_task+0x23/0x30
    [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
    [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
    [<ffffffff8164e783>] __schedule+0xd3/0x900
    [<ffffffff8164efd9>] schedule+0x29/0x70
    [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
    [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
    [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
    [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
    [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
    [<ffffffff810edd50>] SyS_futex+0x80/0x180

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.

In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.

[XXX this next section is unparsable]
One problem is when rt_mutex_adjust_prio()->...->rt_mutex_setprio(),
at that time rtmutex lock was released and owner was marked off,
this can cause "pi_top_task" dereferenced to be a running one(as it
can be falsely woken up by others before rt_mutex_setprio() is
made to update "pi_top_task"). We solve this by directly calling
__rt_mutex_adjust_prio() in mark_wakeup_next_waiter() which held
pi_lock and rtmutex lock, and remove rt_mutex_adjust_prio(). Since
now we moved the deboost point, in order to avoid current to be
preempted due to deboost earlier before wake_up_q(), we also moved
preempt_disable() before unlocking rtmutex.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Originally-From: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1461659449-19497-2-git-send-email-xlpang@redhat.com
---

 include/linux/init_task.h |    1 
 include/linux/sched.h     |    2 +
 include/linux/sched/rt.h  |    1 
 kernel/fork.c             |    1 
 kernel/locking/rtmutex.c  |   65 +++++++++++++++++++---------------------------
 kernel/sched/core.c       |    2 +
 6 files changed, 34 insertions(+), 38 deletions(-)

--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -162,6 +162,7 @@ extern struct task_group root_task_group
 #ifdef CONFIG_RT_MUTEXES
 # define INIT_RT_MUTEXES(tsk)						\
 	.pi_waiters = RB_ROOT,						\
+	.pi_top_task = NULL,						\
 	.pi_waiters_leftmost = NULL,
 #else
 # define INIT_RT_MUTEXES(tsk)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1681,6 +1681,8 @@ struct task_struct {
 	/* PI waiters blocked on a rt_mutex held by this task */
 	struct rb_root pi_waiters;
 	struct rb_node *pi_waiters_leftmost;
+	/* Updated under owner's pi_lock and rq lock */
+	struct task_struct *pi_top_task;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
 #endif
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -19,6 +19,7 @@ static inline int rt_task(struct task_st
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
 extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
+extern void rt_mutex_update_top_task(struct task_struct *p);
 extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1219,6 +1219,7 @@ static void rt_mutex_init_task(struct ta
 #ifdef CONFIG_RT_MUTEXES
 	p->pi_waiters = RB_ROOT;
 	p->pi_waiters_leftmost = NULL;
+	p->pi_top_task = NULL;
 	p->pi_blocked_on = NULL;
 #endif
 }
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -256,6 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
 	RB_CLEAR_NODE(&waiter->pi_tree_entry);
 }
 
+void rt_mutex_update_top_task(struct task_struct *p)
+{
+	if (!task_has_pi_waiters(p)) {
+		p->pi_top_task = NULL;
+		return;
+	}
+
+	p->pi_top_task = task_top_pi_waiter(p)->task;
+}
+
 /*
  * Calculate task priority from the waiter tree priority
  *
@@ -273,10 +283,7 @@ int rt_mutex_getprio(struct task_struct
 
 struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 {
-	if (likely(!task_has_pi_waiters(task)))
-		return NULL;
-
-	return task_top_pi_waiter(task)->task;
+	return task->pi_top_task;
 }
 
 /*
@@ -285,12 +292,12 @@ struct task_struct *rt_mutex_get_top_tas
  */
 int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
 {
-	if (!task_has_pi_waiters(task))
+	struct task_struct *top_task = rt_mutex_get_top_task(task);
+
+	if (!top_task)
 		return newprio;
 
-	if (task_top_pi_waiter(task)->task->prio <= newprio)
-		return task_top_pi_waiter(task)->task->prio;
-	return newprio;
+	return min(top_task->prio, newprio);
 }
 
 /*
@@ -307,24 +314,6 @@ static void __rt_mutex_adjust_prio(struc
 }
 
 /*
- * Adjust task priority (undo boosting). Called from the exit path of
- * rt_mutex_slowunlock() and rt_mutex_slowlock().
- *
- * (Note: We do this outside of the protection of lock->wait_lock to
- * allow the lock to be taken while or before we readjust the priority
- * of task. We do not use the spin_xx_mutex() variants here as we are
- * outside of the debug path.)
- */
-void rt_mutex_adjust_prio(struct task_struct *task)
-{
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	__rt_mutex_adjust_prio(task);
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
-}
-
-/*
  * Deadlock detection is conditional:
  *
  * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
@@ -987,6 +976,7 @@ static void mark_wakeup_next_waiter(stru
 	 * lock->wait_lock.
 	 */
 	rt_mutex_dequeue_pi(current, waiter);
+	__rt_mutex_adjust_prio(current);
 
 	/*
 	 * As we are waking up the top waiter, and the waiter stays
@@ -1325,6 +1315,16 @@ static bool __sched rt_mutex_slowunlock(
 	 */
 	mark_wakeup_next_waiter(wake_q, lock);
 
+	/*
+	 * We should deboost before waking the top waiter task such that
+	 * we don't run two tasks with the 'same' priority. This however
+	 * can lead to prio-inversion if we would get preempted after
+	 * the deboost but before waking our high-prio task, hence the
+	 * preempt_disable before unlock. Pairs with preempt_enable() in
+	 * rt_mutex_postunlock();
+	 */
+	preempt_disable();
+
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* check PI boosting */
@@ -1400,20 +1400,9 @@ rt_mutex_fastunlock(struct rt_mutex *loc
  */
 void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
 {
-	/*
-	 * We should deboost before waking the top waiter task such that
-	 * we don't run two tasks with the 'same' priority. This however
-	 * can lead to prio-inversion if we would get preempted after
-	 * the deboost but before waking our high-prio task, hence the
-	 * preempt_disable.
-	 */
-	if (deboost) {
-		preempt_disable();
-		rt_mutex_adjust_prio(current);
-	}
-
 	wake_up_q(wake_q);
 
+	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
 	if (deboost)
 		preempt_enable();
 }
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3568,6 +3568,8 @@ void rt_mutex_setprio(struct task_struct
 		goto out_unlock;
 	}
 
+	rt_mutex_update_top_task(p);
+
 	trace_sched_pi_setprio(p, prio);
 	oldprio = p->prio;
 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
  2016-06-07 19:56 [RFC][PATCH 0/8] PI and assorted failings Peter Zijlstra
  2016-06-07 19:56 ` [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
  2016-06-07 19:56 ` [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
@ 2016-06-07 19:56 ` Peter Zijlstra
  2016-06-14 10:43   ` Juri Lelli
  2016-06-15 16:30   ` Steven Rostedt
  2016-06-07 19:56 ` [RFC][PATCH 4/8] rtmutex: Remove rt_mutex_fastunlock() Peter Zijlstra
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-07 19:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz,
	Ingo Molnar

[-- Attachment #1: xunlei_pang-sched_deadline_rtmutex-don_t_miss_the_dl_runtime_dl_period_update.patch --]
[-- Type: text/plain, Size: 1690 bytes --]

From: Xunlei Pang <xlpang@redhat.com>

Currently dl tasks will actually return at the very beginning
of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:

    if (waiter->prio == task->prio) {
        if (!detect_deadlock)
            goto out_unlock_pi; // out here
        else
            requeue = false;
    }

As the deadline value of blocked deadline tasks(waiters) without
changing their sched_class(thus prio doesn't change) never changes,
this seems reasonable, but it actually misses the chance of updating
rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
deadline parameters(dl_runtime, dl_period) or boosted waiter changes
to !deadline class.

Thus, force deadline task not out by adding the !dl_prio() condition.

[peterz: I should introduce more task state comparators like
rt_mutex_waiter_less, all PI prio comparisons already have this DL
exception, except this one]

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
---
 kernel/locking/rtmutex.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -488,7 +488,7 @@ static int rt_mutex_adjust_prio_chain(st
 	 * enabled we continue, but stop the requeueing in the chain
 	 * walk.
 	 */
-	if (waiter->prio == task->prio) {
+	if (waiter->prio == task->prio && !dl_task(task)) {
 		if (!detect_deadlock)
 			goto out_unlock_pi;
 		else

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC][PATCH 4/8] rtmutex: Remove rt_mutex_fastunlock()
  2016-06-07 19:56 [RFC][PATCH 0/8] PI and assorted failings Peter Zijlstra
                   ` (2 preceding siblings ...)
  2016-06-07 19:56 ` [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
@ 2016-06-07 19:56 ` Peter Zijlstra
  2016-06-15 16:43   ` Steven Rostedt
  2016-06-07 19:56 ` [RFC][PATCH 5/8] rtmutex: Clean up Peter Zijlstra
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-07 19:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-1.patch --]
[-- Type: text/plain, Size: 1386 bytes --]

There is but a single user and the previous patch mandates slowfn must
be rt_mutex_slowunlock(), so fold the lot together and save a few
lines.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/rtmutex.c |   29 ++++++++++-------------------
 1 file changed, 10 insertions(+), 19 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1377,24 +1377,6 @@ rt_mutex_fasttrylock(struct rt_mutex *lo
 	return slowfn(lock);
 }
 
-static inline void
-rt_mutex_fastunlock(struct rt_mutex *lock,
-		    bool (*slowfn)(struct rt_mutex *lock,
-				   struct wake_q_head *wqh))
-{
-	WAKE_Q(wake_q);
-
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-		rt_mutex_deadlock_account_unlock(current);
-
-	} else {
-		bool deboost = slowfn(lock, &wake_q);
-
-		rt_mutex_postunlock(&wake_q, deboost);
-	}
-}
-
-
 /*
  * Undo pi boosting (if necessary) and wake top waiter.
  */
@@ -1501,7 +1483,16 @@ EXPORT_SYMBOL_GPL(rt_mutex_trylock);
  */
 void __sched rt_mutex_unlock(struct rt_mutex *lock)
 {
-	rt_mutex_fastunlock(lock, rt_mutex_slowunlock);
+	WAKE_Q(wake_q);
+
+	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
+		rt_mutex_deadlock_account_unlock(current);
+
+	} else {
+		bool deboost = rt_mutex_slowunlock(lock, &wake_q);
+
+		rt_mutex_postunlock(&wake_q, deboost);
+	}
 }
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC][PATCH 5/8] rtmutex: Clean up
  2016-06-07 19:56 [RFC][PATCH 0/8] PI and assorted failings Peter Zijlstra
                   ` (3 preceding siblings ...)
  2016-06-07 19:56 ` [RFC][PATCH 4/8] rtmutex: Remove rt_mutex_fastunlock() Peter Zijlstra
@ 2016-06-07 19:56 ` Peter Zijlstra
  2016-06-14 12:08   ` Juri Lelli
  2016-06-07 19:56 ` [RFC][PATCH 6/8] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-07 19:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-2.patch --]
[-- Type: text/plain, Size: 3949 bytes --]

Previous patches changed the meaning of the return value of
rt_mutex_slowunlock(); update comments and code to reflect this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |   12 ++++++------
 kernel/locking/rtmutex.c        |   20 +++++++++-----------
 kernel/locking/rtmutex_common.h |    2 +-
 3 files changed, 16 insertions(+), 18 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1261,7 +1261,7 @@ static int wake_futex_pi(u32 __user *uad
 	struct futex_pi_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
 	WAKE_Q(wake_q);
-	bool deboost;
+	bool postunlock;
 	int ret = 0;
 
 	if (!pi_state)
@@ -1327,17 +1327,17 @@ static int wake_futex_pi(u32 __user *uad
 
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 
-	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+	postunlock = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
 
 	/*
 	 * First unlock HB so the waiter does not spin on it once he got woken
-	 * up. Second wake up the waiter before the priority is adjusted. If we
-	 * deboost first (and lose our higher priority), then the task might get
-	 * scheduled away before the wake up can take place.
+	 * up. Then wakeup the waiter by calling rt_mutex_postunlock(). Priority
+	 * is already adjusted and preemption is disabled to avoid inversion.
 	 */
 	spin_unlock(&hb->lock);
 
-	rt_mutex_postunlock(&wake_q, deboost);
+	if (postunlock)
+		rt_mutex_postunlock(&wake_q);
 
 	return 0;
 }
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1254,7 +1254,8 @@ static inline int rt_mutex_slowtrylock(s
 
 /*
  * Slow path to release a rt-mutex.
- * Return whether the current task needs to undo a potential priority boosting.
+ *
+ * Return whether the current task needs to call rt_mutex_postunlock().
  */
 static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
 					struct wake_q_head *wake_q)
@@ -1327,7 +1328,7 @@ static bool __sched rt_mutex_slowunlock(
 
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
-	/* check PI boosting */
+	/* call rt_mutex_postunlock() */
 	return true;
 }
 
@@ -1378,15 +1379,14 @@ rt_mutex_fasttrylock(struct rt_mutex *lo
 }
 
 /*
- * Undo pi boosting (if necessary) and wake top waiter.
+ * Performs the wakeup of the the top-waiter and re-enables preemption.
  */
-void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+void rt_mutex_postunlock(struct wake_q_head *wake_q)
 {
 	wake_up_q(wake_q);
 
 	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
-	if (deboost)
-		preempt_enable();
+	preempt_enable();
 }
 
 /**
@@ -1489,9 +1489,8 @@ void __sched rt_mutex_unlock(struct rt_m
 		rt_mutex_deadlock_account_unlock(current);
 
 	} else {
-		bool deboost = rt_mutex_slowunlock(lock, &wake_q);
-
-		rt_mutex_postunlock(&wake_q, deboost);
+		if (rt_mutex_slowunlock(lock, &wake_q))
+			rt_mutex_postunlock(&wake_q);
 	}
 }
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
@@ -1500,8 +1499,7 @@ EXPORT_SYMBOL_GPL(rt_mutex_unlock);
  * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
  * @lock: the rt_mutex to be unlocked
  *
- * Returns: true/false indicating whether priority adjustment is
- * required or not.
+ * Returns: true/false indicating whether we should call rt_mutex_postunlock().
  */
 bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
 				   struct wake_q_head *wqh)
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -111,7 +111,7 @@ extern int rt_mutex_finish_proxy_lock(st
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
 				  struct wake_q_head *wqh);
-extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
+extern void rt_mutex_postunlock(struct wake_q_head *wake_q);
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC][PATCH 6/8] sched/rtmutex: Refactor rt_mutex_setprio()
  2016-06-07 19:56 [RFC][PATCH 0/8] PI and assorted failings Peter Zijlstra
                   ` (4 preceding siblings ...)
  2016-06-07 19:56 ` [RFC][PATCH 5/8] rtmutex: Clean up Peter Zijlstra
@ 2016-06-07 19:56 ` Peter Zijlstra
  2016-06-14 13:14   ` Juri Lelli
  2016-06-07 19:56 ` [RFC][PATCH 7/8] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
  2016-06-07 19:56 ` [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity Peter Zijlstra
  7 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-07 19:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-3.patch --]
[-- Type: text/plain, Size: 12018 bytes --]

With the introduction of SCHED_DEADLINE the whole notion that priority
is a single number is gone, therefore the @prio argument to
rt_mutex_setprio() doesn't make sense anymore.

So rework the code to pass a pi_task instead.

Note this also fixes a problem with pi_top_task caching; previously we
would not set the pointer (call rt_mutex_update_top_task) if the
priority didn't change, this could lead to a stale pointer.

As for the XXX, I think its fine to use pi_task->prio, because if it
differs from waiter->prio, a PI chain update is immenent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched/rt.h        |   21 +-------
 kernel/locking/rtmutex.c        |  105 +++++++++++-----------------------------
 kernel/locking/rtmutex_common.h |    1 
 kernel/sched/core.c             |   66 ++++++++++++++++++++-----
 4 files changed, 88 insertions(+), 105 deletions(-)

--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -16,28 +16,17 @@ static inline int rt_task(struct task_st
 }
 
 #ifdef CONFIG_RT_MUTEXES
-extern int rt_mutex_getprio(struct task_struct *p);
-extern void rt_mutex_setprio(struct task_struct *p, int prio);
-extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
-extern void rt_mutex_update_top_task(struct task_struct *p);
-extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
+static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
+{
+	return p->pi_top_task;
+}
+extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
 	return tsk->pi_blocked_on != NULL;
 }
 #else
-static inline int rt_mutex_getprio(struct task_struct *p)
-{
-	return p->normal_prio;
-}
-
-static inline int rt_mutex_get_effective_prio(struct task_struct *task,
-					      int newprio)
-{
-	return newprio;
-}
-
 static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 {
 	return NULL;
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -256,61 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
 	RB_CLEAR_NODE(&waiter->pi_tree_entry);
 }
 
-void rt_mutex_update_top_task(struct task_struct *p)
+static void rt_mutex_adjust_prio(struct task_struct *p)
 {
-	if (!task_has_pi_waiters(p)) {
-		p->pi_top_task = NULL;
-		return;
-	}
+	struct task_struct *pi_task = NULL;
 
-	p->pi_top_task = task_top_pi_waiter(p)->task;
-}
-
-/*
- * Calculate task priority from the waiter tree priority
- *
- * Return task->normal_prio when the waiter tree is empty or when
- * the waiter is not allowed to do priority boosting
- */
-int rt_mutex_getprio(struct task_struct *task)
-{
-	if (likely(!task_has_pi_waiters(task)))
-		return task->normal_prio;
+	lockdep_assert_held(&p->pi_lock);
 
-	return min(task_top_pi_waiter(task)->prio,
-		   task->normal_prio);
-}
+	if (!task_has_pi_waiters(p))
+		pi_task = task_top_pi_waiter(p)->task;
 
-struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
-{
-	return task->pi_top_task;
-}
-
-/*
- * Called by sched_setscheduler() to get the priority which will be
- * effective after the change.
- */
-int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
-{
-	struct task_struct *top_task = rt_mutex_get_top_task(task);
-
-	if (!top_task)
-		return newprio;
-
-	return min(top_task->prio, newprio);
-}
-
-/*
- * Adjust the priority of a task, after its pi_waiters got modified.
- *
- * This can be both boosting and unboosting. task->pi_lock must be held.
- */
-static void __rt_mutex_adjust_prio(struct task_struct *task)
-{
-	int prio = rt_mutex_getprio(task);
-
-	if (task->prio != prio || dl_prio(prio))
-		rt_mutex_setprio(task, prio);
+	rt_mutex_setprio(p, pi_task);
 }
 
 /*
@@ -670,7 +625,7 @@ static int rt_mutex_adjust_prio_chain(st
 		 */
 		rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
 		rt_mutex_enqueue_pi(task, waiter);
-		__rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(task);
 
 	} else if (prerequeue_top_waiter == waiter) {
 		/*
@@ -686,7 +641,7 @@ static int rt_mutex_adjust_prio_chain(st
 		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
 		rt_mutex_enqueue_pi(task, waiter);
-		__rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(task);
 	} else {
 		/*
 		 * Nothing changed. No need to do any priority
@@ -896,7 +851,7 @@ static int task_blocks_on_rt_mutex(struc
 		return -EDEADLK;
 
 	raw_spin_lock(&task->pi_lock);
-	__rt_mutex_adjust_prio(task);
+	rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
 	waiter->prio = task->prio;
@@ -918,7 +873,7 @@ static int task_blocks_on_rt_mutex(struc
 		rt_mutex_dequeue_pi(owner, top_waiter);
 		rt_mutex_enqueue_pi(owner, waiter);
 
-		__rt_mutex_adjust_prio(owner);
+		rt_mutex_adjust_prio(owner);
 		if (owner->pi_blocked_on)
 			chain_walk = 1;
 	} else if (rt_mutex_cond_detect_deadlock(waiter, chwalk)) {
@@ -970,13 +925,14 @@ static void mark_wakeup_next_waiter(stru
 	waiter = rt_mutex_top_waiter(lock);
 
 	/*
-	 * Remove it from current->pi_waiters. We do not adjust a
-	 * possible priority boost right now. We execute wakeup in the
-	 * boosted mode and go back to normal after releasing
-	 * lock->wait_lock.
+	 * Remove it from current->pi_waiters and deboost.
+	 *
+	 * We must in fact deboost here in order to ensure we call
+	 * rt_mutex_setprio() to update p->pi_top_task before the
+	 * task unblocks.
 	 */
 	rt_mutex_dequeue_pi(current, waiter);
-	__rt_mutex_adjust_prio(current);
+	rt_mutex_adjust_prio(current);
 
 	/*
 	 * As we are waking up the top waiter, and the waiter stays
@@ -988,9 +944,19 @@ static void mark_wakeup_next_waiter(stru
 	 */
 	lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
 
-	raw_spin_unlock(&current->pi_lock);
-
+	/*
+	 * We deboosted before waking the top waiter task such that we don't
+	 * run two tasks with the 'same' priority (and ensure the
+	 * p->pi_top_task pointer points to a blocked task). This however can
+	 * lead to priority inversion if we would get preempted after the
+	 * deboost but before waking our donor task, hence the preempt_disable()
+	 * before unlock.
+	 *
+	 * Pairs with preempt_enable() in rt_mutex_postunlock();
+	 */
+	preempt_disable();
 	wake_q_add(wake_q, waiter->task);
+	raw_spin_unlock(&current->pi_lock);
 }
 
 /*
@@ -1025,7 +991,7 @@ static void remove_waiter(struct rt_mute
 	if (rt_mutex_has_waiters(lock))
 		rt_mutex_enqueue_pi(owner, rt_mutex_top_waiter(lock));
 
-	__rt_mutex_adjust_prio(owner);
+	rt_mutex_adjust_prio(owner);
 
 	/* Store the lock on which owner is blocked or NULL */
 	next_lock = task_blocked_on_lock(owner);
@@ -1064,8 +1030,7 @@ void rt_mutex_adjust_pi(struct task_stru
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || (waiter->prio == task->prio &&
-			!dl_prio(task->prio))) {
+	if (!waiter || (waiter->prio == task->prio && !dl_prio(task->prio))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -1316,16 +1281,6 @@ static bool __sched rt_mutex_slowunlock(
 	 */
 	mark_wakeup_next_waiter(wake_q, lock);
 
-	/*
-	 * We should deboost before waking the top waiter task such that
-	 * we don't run two tasks with the 'same' priority. This however
-	 * can lead to prio-inversion if we would get preempted after
-	 * the deboost but before waking our high-prio task, hence the
-	 * preempt_disable before unlock. Pairs with preempt_enable() in
-	 * rt_mutex_postunlock();
-	 */
-	preempt_disable();
-
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* call rt_mutex_postunlock() */
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -112,7 +112,6 @@ extern int rt_mutex_timed_futex_lock(str
 extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
 				  struct wake_q_head *wqh);
 extern void rt_mutex_postunlock(struct wake_q_head *wake_q);
-extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3532,10 +3532,25 @@ EXPORT_SYMBOL(default_wake_function);
 
 #ifdef CONFIG_RT_MUTEXES
 
+static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)
+{
+	if (pi_task)
+		prio = min(prio, pi_task->prio);
+
+	return prio;
+}
+
+static inline int rt_effective_prio(struct task_struct *p, int prio)
+{
+	struct task_struct *pi_task = rt_mutex_get_top_task(p);
+
+	return __rt_effective_prio(pi_task, prio);
+}
+
 /*
  * rt_mutex_setprio - set the current priority of a task
- * @p: task
- * @prio: prio value (kernel-internal form)
+ * @p: task to boost
+ * @pi_task: donor task
  *
  * This function changes the 'effective' priority of a task. It does
  * not touch ->normal_prio like __setscheduler().
@@ -3543,16 +3558,40 @@ EXPORT_SYMBOL(default_wake_function);
  * Used by the rt_mutex code to implement priority inheritance
  * logic. Call site only calls if the priority of the task changed.
  */
-void rt_mutex_setprio(struct task_struct *p, int prio)
+void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 {
-	int oldprio, queued, running, queue_flag = DEQUEUE_SAVE | DEQUEUE_MOVE;
+	int prio, oldprio, queued, running, queue_flag = DEQUEUE_SAVE | DEQUEUE_MOVE;
 	const struct sched_class *prev_class;
 	struct rq_flags rf;
 	struct rq *rq;
 
-	BUG_ON(prio > MAX_PRIO);
+	/* XXX used to be waiter->prio, not waiter->task->prio */
+	prio = __rt_effective_prio(pi_task, p->normal_prio);
+
+	/*
+	 * If nothing changed; bail early.
+	 */
+	if (p->pi_top_task == pi_task && prio == p->prio && !dl_prio(prio))
+		return;
 
 	rq = __task_rq_lock(p, &rf);
+	/*
+	 * Set under pi_lock && rq->lock, such that the value can be used under
+	 * either lock.
+	 *
+	 * Note that there is loads of tricky to make this pointer cache work
+	 * right. rt_mutex_slowunlock()+rt_mutex_postunlock() work together to
+	 * ensure a task is de-boosted (pi_task is set to NULL) before the
+	 * task is allowed to run again (and can exit). This ensures the pointer
+	 * points to a blocked task -- which guaratees the task is present.
+	 */
+	p->pi_top_task = pi_task;
+
+	/*
+	 * For FIFO/RR we only need to set prio, if that matches we're done.
+	 */
+	if (prio == p->prio && !dl_prio(prio))
+		goto out_unlock;
 
 	/*
 	 * Idle task boosting is a nono in general. There is one
@@ -3572,9 +3611,7 @@ void rt_mutex_setprio(struct task_struct
 		goto out_unlock;
 	}
 
-	rt_mutex_update_top_task(p);
-
-	trace_sched_pi_setprio(p, prio);
+	trace_sched_pi_setprio(p, prio); /* broken */
 	oldprio = p->prio;
 
 	if (oldprio == prio)
@@ -3598,7 +3635,6 @@ void rt_mutex_setprio(struct task_struct
 	 *          running task
 	 */
 	if (dl_prio(prio)) {
-		struct task_struct *pi_task = rt_mutex_get_top_task(p);
 		if (!dl_prio(p->normal_prio) ||
 		    (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
 			p->dl.dl_boosted = 1;
@@ -3635,6 +3671,11 @@ void rt_mutex_setprio(struct task_struct
 	balance_callback(rq);
 	preempt_enable();
 }
+#else
+static inline int rt_effective_prio(struct task_struct p, int prio)
+{
+	return prio;
+}
 #endif
 
 void set_user_nice(struct task_struct *p, long nice)
@@ -3873,10 +3914,9 @@ static void __setscheduler(struct rq *rq
 	 * Keep a potential priority boosting if called from
 	 * sched_setscheduler().
 	 */
+	p->prio = normal_prio(p);
 	if (keep_boost)
-		p->prio = rt_mutex_get_effective_prio(p, normal_prio(p));
-	else
-		p->prio = normal_prio(p);
+		p->prio = rt_effective_prio(p, p->prio);
 
 	if (dl_prio(p->prio))
 		p->sched_class = &dl_sched_class;
@@ -4163,7 +4203,7 @@ static int __sched_setscheduler(struct t
 		 * the runqueue. This will be done when the task deboost
 		 * itself.
 		 */
-		new_effective_prio = rt_mutex_get_effective_prio(p, newprio);
+		new_effective_prio = rt_effective_prio(p, newprio);
 		if (new_effective_prio == oldprio)
 			queue_flags &= ~DEQUEUE_MOVE;
 	}

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC][PATCH 7/8] sched,tracing: Update trace_sched_pi_setprio()
  2016-06-07 19:56 [RFC][PATCH 0/8] PI and assorted failings Peter Zijlstra
                   ` (5 preceding siblings ...)
  2016-06-07 19:56 ` [RFC][PATCH 6/8] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
@ 2016-06-07 19:56 ` Peter Zijlstra
  2016-06-07 19:56 ` [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity Peter Zijlstra
  7 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-07 19:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-4.patch --]
[-- Type: text/plain, Size: 3005 bytes --]

Pass the PI donor task, instead of a numerical priority.

Numerical priorities are not sufficient to describe state ever since
SCHED_DEADLINE.

Annotate all sched tracepoints that are currently broken; fixing them
will bork userspace. *hate*.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/trace/events/sched.h |   16 +++++++++-------
 kernel/sched/core.c          |    2 +-
 2 files changed, 10 insertions(+), 8 deletions(-)

--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -70,7 +70,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_templat
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->success	= 1; /* rudiment, kill when possible */
 		__entry->target_cpu	= task_cpu(p);
 	),
@@ -147,6 +147,7 @@ TRACE_EVENT(sched_switch,
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
+		/* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
@@ -181,7 +182,7 @@ TRACE_EVENT(sched_migrate_task,
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->orig_cpu	= task_cpu(p);
 		__entry->dest_cpu	= dest_cpu;
 	),
@@ -206,7 +207,7 @@ DECLARE_EVENT_CLASS(sched_process_templa
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("comm=%s pid=%d prio=%d",
@@ -253,7 +254,7 @@ TRACE_EVENT(sched_process_wait,
 	TP_fast_assign(
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 		__entry->pid		= pid_nr(pid);
-		__entry->prio		= current->prio;
+		__entry->prio		= current->prio; /* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("comm=%s pid=%d prio=%d",
@@ -413,9 +414,9 @@ DEFINE_EVENT(sched_stat_runtime, sched_s
  */
 TRACE_EVENT(sched_pi_setprio,
 
-	TP_PROTO(struct task_struct *tsk, int newprio),
+	TP_PROTO(struct task_struct *tsk, struct task_struct *pi_task),
 
-	TP_ARGS(tsk, newprio),
+	TP_ARGS(tsk, pi_task),
 
 	TP_STRUCT__entry(
 		__array( char,	comm,	TASK_COMM_LEN	)
@@ -428,7 +429,8 @@ TRACE_EVENT(sched_pi_setprio,
 		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
 		__entry->pid		= tsk->pid;
 		__entry->oldprio	= tsk->prio;
-		__entry->newprio	= newprio;
+		__entry->newprio	= pi_task->prio;
+		/* XXX SCHED_DEADLINE bits missing */
 	),
 
 	TP_printk("comm=%s pid=%d oldprio=%d newprio=%d",
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3611,7 +3611,7 @@ void rt_mutex_setprio(struct task_struct
 		goto out_unlock;
 	}
 
-	trace_sched_pi_setprio(p, prio); /* broken */
+	trace_sched_pi_setprio(p, pi_task);
 	oldprio = p->prio;
 
 	if (oldprio == prio)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity
  2016-06-07 19:56 [RFC][PATCH 0/8] PI and assorted failings Peter Zijlstra
                   ` (6 preceding siblings ...)
  2016-06-07 19:56 ` [RFC][PATCH 7/8] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
@ 2016-06-07 19:56 ` Peter Zijlstra
  2016-06-14 17:39   ` Juri Lelli
  7 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-07 19:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-5.patch --]
[-- Type: text/plain, Size: 1986 bytes --]

rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
during the PI chain walk, such that the PI chain order isn't messed up
by (asynchronous) task state updates.

Currently rt_mutex_waiter_less() uses task state for deadline tasks;
this is broken, since the task state can, as said above, change
asynchronously, causing the RB tree order to change without actual
tree update -> FAIL.

Fix this by also copying the deadline into the rt_mutex_waiter state
and updating it along with its prio field.

Ideally we would also force PI chain updates whenever DL tasks update
their deadline parameter, but for first approximation this is less
broken than it was.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/rtmutex.c        |    5 +++--
 kernel/locking/rtmutex_common.h |    1 +
 2 files changed, 4 insertions(+), 2 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -172,8 +172,7 @@ rt_mutex_waiter_less(struct rt_mutex_wai
 	 * then right waiter has a dl_prio() too.
 	 */
 	if (dl_prio(left->prio))
-		return dl_time_before(left->task->dl.deadline,
-				      right->task->dl.deadline);
+		return dl_time_before(left->deadline, right->deadline);
 
 	return 0;
 }
@@ -585,6 +584,7 @@ static int rt_mutex_adjust_prio_chain(st
 	/* [7] Requeue the waiter in the lock waiter tree. */
 	rt_mutex_dequeue(lock, waiter);
 	waiter->prio = task->prio;
+	waiter->deadline = task->dl.deadline;
 	rt_mutex_enqueue(lock, waiter);
 
 	/* [8] Release the task */
@@ -855,6 +855,7 @@ static int task_blocks_on_rt_mutex(struc
 	waiter->task = task;
 	waiter->lock = lock;
 	waiter->prio = task->prio;
+	waiter->deadline = task->dl.deadline;
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -33,6 +33,7 @@ struct rt_mutex_waiter {
 	struct rt_mutex		*deadlock_lock;
 #endif
 	int prio;
+	u64 deadline;
 };
 
 /*

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter
  2016-06-07 19:56 ` [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
@ 2016-06-14  9:09   ` Juri Lelli
  2016-06-14 12:54     ` Peter Zijlstra
  2016-06-14 16:36     ` Davidlohr Bueso
  2016-06-14 18:22   ` Steven Rostedt
  1 sibling, 2 replies; 40+ messages in thread
From: Juri Lelli @ 2016-06-14  9:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

Hi,

I've got only nitpicks for the changelog. Otherwise the patch looks good
to me (and yes, without it bw inheritance would be a problem).

On 07/06/16 21:56, Peter Zijlstra wrote:
> From: Xunlei Pang <xlpang@redhat.com>
> 
> We should deboost before waking the high-prio task, such that
> we don't run two tasks with the same "state"(priority, deadline,
                                              ^
                                            space

> sched_class, etc) during the period between the end of wake_up_q()
> and the end of rt_mutex_adjust_prio().
> 
> As "Peter Zijlstra" said:
> Its semantically icky to have the two tasks running off the same

s/Its/It's/

> state and practically icky when you consider bandwidth inheritance --
> where the boosted task wants to explicitly modify the state of the
> booster. In that latter case you really want to unboost before you
> let the booster run again.
> 
> But this however can lead to prio-inversion if current would get
> preempted after the deboost but before waking our high-prio task,
> hence we disable preemption before doing deboost, and enabling it

s/enabling/re-enable/

> after the wake up is over.
> 
> The patch fixed the logic, and introduced rt_mutex_postunlock()

s/The/This/
s/fixed/fixes/
s/introduced/introduces/

> to do some code refactor.
> 
> Most importantly however; this change ensures pointer stability for
> the next patch, where we have rt_mutex_setprio() cache a pointer to
> the top-most waiter task. If we, as before this change, do the wakeup
> first and then deboost, this pointer might point into thin air.
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@arm.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> [peterz: Changelog]
> Signed-off-by: Xunlei Pang <xlpang@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1461659449-19497-1-git-send-email-xlpang@redhat.com

Do we have any specific tests for this set? I'm running mine.

Best,

- Juri

> ---
> 
>  kernel/futex.c                  |    5 ++---
>  kernel/locking/rtmutex.c        |   28 ++++++++++++++++++++++++----
>  kernel/locking/rtmutex_common.h |    1 +
>  3 files changed, 27 insertions(+), 7 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1336,9 +1336,8 @@ static int wake_futex_pi(u32 __user *uad
>  	 * scheduled away before the wake up can take place.
>  	 */
>  	spin_unlock(&hb->lock);
> -	wake_up_q(&wake_q);
> -	if (deboost)
> -		rt_mutex_adjust_prio(current);
> +
> +	rt_mutex_postunlock(&wake_q, deboost);
>  
>  	return 0;
>  }
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1390,12 +1390,32 @@ rt_mutex_fastunlock(struct rt_mutex *loc
>  	} else {
>  		bool deboost = slowfn(lock, &wake_q);
>  
> -		wake_up_q(&wake_q);
> +		rt_mutex_postunlock(&wake_q, deboost);
> +	}
> +}
> +
>  
> -		/* Undo pi boosting if necessary: */
> -		if (deboost)
> -			rt_mutex_adjust_prio(current);
> +/*
> + * Undo pi boosting (if necessary) and wake top waiter.
> + */
> +void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
> +{
> +	/*
> +	 * We should deboost before waking the top waiter task such that
> +	 * we don't run two tasks with the 'same' priority. This however
> +	 * can lead to prio-inversion if we would get preempted after
> +	 * the deboost but before waking our high-prio task, hence the
> +	 * preempt_disable.
> +	 */
> +	if (deboost) {
> +		preempt_disable();
> +		rt_mutex_adjust_prio(current);
>  	}
> +
> +	wake_up_q(wake_q);
> +
> +	if (deboost)
> +		preempt_enable();
>  }
>  
>  /**
> --- a/kernel/locking/rtmutex_common.h
> +++ b/kernel/locking/rtmutex_common.h
> @@ -111,6 +111,7 @@ extern int rt_mutex_finish_proxy_lock(st
>  extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
>  extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
>  				  struct wake_q_head *wqh);
> +extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
>  extern void rt_mutex_adjust_prio(struct task_struct *task);
>  
>  #ifdef CONFIG_DEBUG_RT_MUTEXES
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-06-07 19:56 ` [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
@ 2016-06-14 10:21   ` Juri Lelli
  2016-06-14 12:30     ` Peter Zijlstra
  2016-06-14 12:53     ` Xunlei Pang
  2016-06-14 18:42   ` Steven Rostedt
  1 sibling, 2 replies; 40+ messages in thread
From: Juri Lelli @ 2016-06-14 10:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

Hi,

On 07/06/16 21:56, Peter Zijlstra wrote:
> From: Xunlei Pang <xlpang@redhat.com>
> 
> A crash happened while I was playing with deadline PI rtmutex.
> 
>     BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>     IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
>     PGD 232a75067 PUD 230947067 PMD 0
>     Oops: 0000 [#1] SMP
>     CPU: 1 PID: 10994 Comm: a.out Not tainted
> 
>     Call Trace:
>     [<ffffffff810b658c>] enqueue_task+0x2c/0x80
>     [<ffffffff810ba763>] activate_task+0x23/0x30
>     [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
>     [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
>     [<ffffffff8164e783>] __schedule+0xd3/0x900
>     [<ffffffff8164efd9>] schedule+0x29/0x70
>     [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
>     [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
>     [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
>     [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
>     [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
>     [<ffffffff810edd50>] SyS_futex+0x80/0x180
> 

This seems to be caused by the race condition you detail below between
load balancing and PI code. I tried to reproduce the BUG on my box, but
it looks hard to get. Do you have a reproducer I can give a try?

> This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
> are only protected by pi_lock when operating pi waiters, while
> rt_mutex_get_top_task(), will access them with rq lock held but
> not holding pi_lock.
> 
> In order to tackle it, we introduce new "pi_top_task" pointer
> cached in task_struct, and add new rt_mutex_update_top_task()
> to update its value, it can be called by rt_mutex_setprio()
> which held both owner's pi_lock and rq lock. Thus "pi_top_task"
> can be safely accessed by enqueue_task_dl() under rq lock.
> 
> [XXX this next section is unparsable]

Yes, a bit hard to understand. However, am I correct in assuming this
patch and the previous one should fix this problem? Or are there still
other races causing issues?

> One problem is when rt_mutex_adjust_prio()->...->rt_mutex_setprio(),
> at that time rtmutex lock was released and owner was marked off,
> this can cause "pi_top_task" dereferenced to be a running one(as it
> can be falsely woken up by others before rt_mutex_setprio() is
> made to update "pi_top_task"). We solve this by directly calling
> __rt_mutex_adjust_prio() in mark_wakeup_next_waiter() which held
> pi_lock and rtmutex lock, and remove rt_mutex_adjust_prio(). Since
> now we moved the deboost point, in order to avoid current to be
> preempted due to deboost earlier before wake_up_q(), we also moved
> preempt_disable() before unlocking rtmutex.
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@arm.com>
> Originally-From: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Xunlei Pang <xlpang@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1461659449-19497-2-git-send-email-xlpang@redhat.com

The idea of this fix makes sense to me. But, I would like to be able to
see the BUG and test the fix. What I have is a test in which I create N
DEADLINE workers that share a PI mutex. They get migrated around and
seem to stress PI code. But I couldn't hit the BUG yet. Maybe I let it
run for some more time.

Best,

- Juri

> ---
> 
>  include/linux/init_task.h |    1 
>  include/linux/sched.h     |    2 +
>  include/linux/sched/rt.h  |    1 
>  kernel/fork.c             |    1 
>  kernel/locking/rtmutex.c  |   65 +++++++++++++++++++---------------------------
>  kernel/sched/core.c       |    2 +
>  6 files changed, 34 insertions(+), 38 deletions(-)
> 
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -162,6 +162,7 @@ extern struct task_group root_task_group
>  #ifdef CONFIG_RT_MUTEXES
>  # define INIT_RT_MUTEXES(tsk)						\
>  	.pi_waiters = RB_ROOT,						\
> +	.pi_top_task = NULL,						\
>  	.pi_waiters_leftmost = NULL,
>  #else
>  # define INIT_RT_MUTEXES(tsk)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1681,6 +1681,8 @@ struct task_struct {
>  	/* PI waiters blocked on a rt_mutex held by this task */
>  	struct rb_root pi_waiters;
>  	struct rb_node *pi_waiters_leftmost;
> +	/* Updated under owner's pi_lock and rq lock */
> +	struct task_struct *pi_top_task;
>  	/* Deadlock detection and priority inheritance handling */
>  	struct rt_mutex_waiter *pi_blocked_on;
>  #endif
> --- a/include/linux/sched/rt.h
> +++ b/include/linux/sched/rt.h
> @@ -19,6 +19,7 @@ static inline int rt_task(struct task_st
>  extern int rt_mutex_getprio(struct task_struct *p);
>  extern void rt_mutex_setprio(struct task_struct *p, int prio);
>  extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
> +extern void rt_mutex_update_top_task(struct task_struct *p);
>  extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
>  extern void rt_mutex_adjust_pi(struct task_struct *p);
>  static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1219,6 +1219,7 @@ static void rt_mutex_init_task(struct ta
>  #ifdef CONFIG_RT_MUTEXES
>  	p->pi_waiters = RB_ROOT;
>  	p->pi_waiters_leftmost = NULL;
> +	p->pi_top_task = NULL;
>  	p->pi_blocked_on = NULL;
>  #endif
>  }
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -256,6 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
>  	RB_CLEAR_NODE(&waiter->pi_tree_entry);
>  }
>  
> +void rt_mutex_update_top_task(struct task_struct *p)
> +{
> +	if (!task_has_pi_waiters(p)) {
> +		p->pi_top_task = NULL;
> +		return;
> +	}
> +
> +	p->pi_top_task = task_top_pi_waiter(p)->task;
> +}
> +
>  /*
>   * Calculate task priority from the waiter tree priority
>   *
> @@ -273,10 +283,7 @@ int rt_mutex_getprio(struct task_struct
>  
>  struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>  {
> -	if (likely(!task_has_pi_waiters(task)))
> -		return NULL;
> -
> -	return task_top_pi_waiter(task)->task;
> +	return task->pi_top_task;
>  }
>  
>  /*
> @@ -285,12 +292,12 @@ struct task_struct *rt_mutex_get_top_tas
>   */
>  int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
>  {
> -	if (!task_has_pi_waiters(task))
> +	struct task_struct *top_task = rt_mutex_get_top_task(task);
> +
> +	if (!top_task)
>  		return newprio;
>  
> -	if (task_top_pi_waiter(task)->task->prio <= newprio)
> -		return task_top_pi_waiter(task)->task->prio;
> -	return newprio;
> +	return min(top_task->prio, newprio);
>  }
>  
>  /*
> @@ -307,24 +314,6 @@ static void __rt_mutex_adjust_prio(struc
>  }
>  
>  /*
> - * Adjust task priority (undo boosting). Called from the exit path of
> - * rt_mutex_slowunlock() and rt_mutex_slowlock().
> - *
> - * (Note: We do this outside of the protection of lock->wait_lock to
> - * allow the lock to be taken while or before we readjust the priority
> - * of task. We do not use the spin_xx_mutex() variants here as we are
> - * outside of the debug path.)
> - */
> -void rt_mutex_adjust_prio(struct task_struct *task)
> -{
> -	unsigned long flags;
> -
> -	raw_spin_lock_irqsave(&task->pi_lock, flags);
> -	__rt_mutex_adjust_prio(task);
> -	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
> -}
> -
> -/*
>   * Deadlock detection is conditional:
>   *
>   * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
> @@ -987,6 +976,7 @@ static void mark_wakeup_next_waiter(stru
>  	 * lock->wait_lock.
>  	 */
>  	rt_mutex_dequeue_pi(current, waiter);
> +	__rt_mutex_adjust_prio(current);
>  
>  	/*
>  	 * As we are waking up the top waiter, and the waiter stays
> @@ -1325,6 +1315,16 @@ static bool __sched rt_mutex_slowunlock(
>  	 */
>  	mark_wakeup_next_waiter(wake_q, lock);
>  
> +	/*
> +	 * We should deboost before waking the top waiter task such that
> +	 * we don't run two tasks with the 'same' priority. This however
> +	 * can lead to prio-inversion if we would get preempted after
> +	 * the deboost but before waking our high-prio task, hence the
> +	 * preempt_disable before unlock. Pairs with preempt_enable() in
> +	 * rt_mutex_postunlock();
> +	 */
> +	preempt_disable();
> +
>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  
>  	/* check PI boosting */
> @@ -1400,20 +1400,9 @@ rt_mutex_fastunlock(struct rt_mutex *loc
>   */
>  void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
>  {
> -	/*
> -	 * We should deboost before waking the top waiter task such that
> -	 * we don't run two tasks with the 'same' priority. This however
> -	 * can lead to prio-inversion if we would get preempted after
> -	 * the deboost but before waking our high-prio task, hence the
> -	 * preempt_disable.
> -	 */
> -	if (deboost) {
> -		preempt_disable();
> -		rt_mutex_adjust_prio(current);
> -	}
> -
>  	wake_up_q(wake_q);
>  
> +	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
>  	if (deboost)
>  		preempt_enable();
>  }
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3568,6 +3568,8 @@ void rt_mutex_setprio(struct task_struct
>  		goto out_unlock;
>  	}
>  
> +	rt_mutex_update_top_task(p);
> +
>  	trace_sched_pi_setprio(p, prio);
>  	oldprio = p->prio;
>  
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
  2016-06-07 19:56 ` [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
@ 2016-06-14 10:43   ` Juri Lelli
  2016-06-14 12:09     ` Peter Zijlstra
  2016-06-15 16:30   ` Steven Rostedt
  1 sibling, 1 reply; 40+ messages in thread
From: Juri Lelli @ 2016-06-14 10:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

Hi,

On 07/06/16 21:56, Peter Zijlstra wrote:
> From: Xunlei Pang <xlpang@redhat.com>
> 
> Currently dl tasks will actually return at the very beginning
> of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:
> 
>     if (waiter->prio == task->prio) {
>         if (!detect_deadlock)
>             goto out_unlock_pi; // out here
>         else
>             requeue = false;
>     }
> 
> As the deadline value of blocked deadline tasks(waiters) without
> changing their sched_class(thus prio doesn't change) never changes,
> this seems reasonable, but it actually misses the chance of updating
> rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
> deadline parameters(dl_runtime, dl_period) or boosted waiter changes
> to !deadline class.
> 
> Thus, force deadline task not out by adding the !dl_prio() condition.
> 
> [peterz: I should introduce more task state comparators like
> rt_mutex_waiter_less, all PI prio comparisons already have this DL
> exception, except this one]
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Juri Lelli <juri.lelli@arm.com>
> Signed-off-by: Xunlei Pang <xlpang@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
> ---
>  kernel/locking/rtmutex.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -488,7 +488,7 @@ static int rt_mutex_adjust_prio_chain(st
>  	 * enabled we continue, but stop the requeueing in the chain
>  	 * walk.
>  	 */
> -	if (waiter->prio == task->prio) {
> +	if (waiter->prio == task->prio && !dl_task(task)) {

Right. Do we want a rt_mutex_waiter_equal() helper? As I think the
comment in the changelog was also saying?

Best,

- Juri

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 5/8] rtmutex: Clean up
  2016-06-07 19:56 ` [RFC][PATCH 5/8] rtmutex: Clean up Peter Zijlstra
@ 2016-06-14 12:08   ` Juri Lelli
  2016-06-14 12:32     ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Juri Lelli @ 2016-06-14 12:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot

Hi,

On 07/06/16 21:56, Peter Zijlstra wrote:
> Previous patches changed the meaning of the return value of
> rt_mutex_slowunlock(); update comments and code to reflect this.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/futex.c                  |   12 ++++++------
>  kernel/locking/rtmutex.c        |   20 +++++++++-----------
>  kernel/locking/rtmutex_common.h |    2 +-
>  3 files changed, 16 insertions(+), 18 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1261,7 +1261,7 @@ static int wake_futex_pi(u32 __user *uad
>  	struct futex_pi_state *pi_state = this->pi_state;
>  	u32 uninitialized_var(curval), newval;
>  	WAKE_Q(wake_q);
> -	bool deboost;
> +	bool postunlock;
>  	int ret = 0;
>  
>  	if (!pi_state)
> @@ -1327,17 +1327,17 @@ static int wake_futex_pi(u32 __user *uad
>  
>  	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
>  
> -	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
> +	postunlock = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
>  
>  	/*
>  	 * First unlock HB so the waiter does not spin on it once he got woken
> -	 * up. Second wake up the waiter before the priority is adjusted. If we
> -	 * deboost first (and lose our higher priority), then the task might get
> -	 * scheduled away before the wake up can take place.
> +	 * up. Then wakeup the waiter by calling rt_mutex_postunlock(). Priority
> +	 * is already adjusted and preemption is disabled to avoid inversion.
>  	 */
>  	spin_unlock(&hb->lock);
>  
> -	rt_mutex_postunlock(&wake_q, deboost);
> +	if (postunlock)
> +		rt_mutex_postunlock(&wake_q);

I'm most probably missing something, but don't we still need to call
wake_up_q() even when postunlock is false? IIUC, we were always doing
that, rt_mutex_postunlock(), before this change (only calling
preempt_enable() was conditional).

Best,

- Juri

>  
>  	return 0;
>  }
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1254,7 +1254,8 @@ static inline int rt_mutex_slowtrylock(s
>  
>  /*
>   * Slow path to release a rt-mutex.
> - * Return whether the current task needs to undo a potential priority boosting.
> + *
> + * Return whether the current task needs to call rt_mutex_postunlock().
>   */
>  static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
>  					struct wake_q_head *wake_q)
> @@ -1327,7 +1328,7 @@ static bool __sched rt_mutex_slowunlock(
>  
>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  
> -	/* check PI boosting */
> +	/* call rt_mutex_postunlock() */
>  	return true;
>  }
>  
> @@ -1378,15 +1379,14 @@ rt_mutex_fasttrylock(struct rt_mutex *lo
>  }
>  
>  /*
> - * Undo pi boosting (if necessary) and wake top waiter.
> + * Performs the wakeup of the the top-waiter and re-enables preemption.
>   */
> -void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
> +void rt_mutex_postunlock(struct wake_q_head *wake_q)
>  {
>  	wake_up_q(wake_q);
>  
>  	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
> -	if (deboost)
> -		preempt_enable();
> +	preempt_enable();
>  }
>  
>  /**
> @@ -1489,9 +1489,8 @@ void __sched rt_mutex_unlock(struct rt_m
>  		rt_mutex_deadlock_account_unlock(current);
>  
>  	} else {
> -		bool deboost = rt_mutex_slowunlock(lock, &wake_q);
> -
> -		rt_mutex_postunlock(&wake_q, deboost);
> +		if (rt_mutex_slowunlock(lock, &wake_q))
> +			rt_mutex_postunlock(&wake_q);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(rt_mutex_unlock);
> @@ -1500,8 +1499,7 @@ EXPORT_SYMBOL_GPL(rt_mutex_unlock);
>   * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
>   * @lock: the rt_mutex to be unlocked
>   *
> - * Returns: true/false indicating whether priority adjustment is
> - * required or not.
> + * Returns: true/false indicating whether we should call rt_mutex_postunlock().
>   */
>  bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
>  				   struct wake_q_head *wqh)
> --- a/kernel/locking/rtmutex_common.h
> +++ b/kernel/locking/rtmutex_common.h
> @@ -111,7 +111,7 @@ extern int rt_mutex_finish_proxy_lock(st
>  extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
>  extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
>  				  struct wake_q_head *wqh);
> -extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
> +extern void rt_mutex_postunlock(struct wake_q_head *wake_q);
>  extern void rt_mutex_adjust_prio(struct task_struct *task);
>  
>  #ifdef CONFIG_DEBUG_RT_MUTEXES
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
  2016-06-14 10:43   ` Juri Lelli
@ 2016-06-14 12:09     ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-14 12:09 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

On Tue, Jun 14, 2016 at 11:43:39AM +0100, Juri Lelli wrote:
> > [peterz: I should introduce more task state comparators like
> > rt_mutex_waiter_less, all PI prio comparisons already have this DL
> > exception, except this one]

> > +++ b/kernel/locking/rtmutex.c
> > @@ -488,7 +488,7 @@ static int rt_mutex_adjust_prio_chain(st
> >  	 * enabled we continue, but stop the requeueing in the chain
> >  	 * walk.
> >  	 */
> > -	if (waiter->prio == task->prio) {
> > +	if (waiter->prio == task->prio && !dl_task(task)) {
> 
> Right. Do we want a rt_mutex_waiter_equal() helper? As I think the
> comment in the changelog was also saying?

rt_mutex_waiter_equal() isn't going to help; it will not work on code
that doesn't have access to rt_mutex_waiter thingies, like
rt_mutex_setprio() for example. 

Then again, I already wondered about making rt_mutex_waiter available to
sched code, but tglx didn't much like that iirc.

Esp. given the last patch in this series, I wondered if perhaps the
second argument to rt_mutex_setprio() should be a struct
rt_mutex_waiter, instead of struct task_struct.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-06-14 10:21   ` Juri Lelli
@ 2016-06-14 12:30     ` Peter Zijlstra
  2016-06-14 12:53     ` Xunlei Pang
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-14 12:30 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

On Tue, Jun 14, 2016 at 11:21:09AM +0100, Juri Lelli wrote:
> > [XXX this next section is unparsable]
> 
> Yes, a bit hard to understand. However, am I correct in assuming this
> patch and the previous one should fix this problem? Or are there still
> other races causing issues?

I think so; so there were two related problems,

 1) top_waiter was used outside its serialization
 2) a race against the top waiter task and sched_setscheduler() changing
 its state

Now, I could not understand a word of that marked paragraph, but from my
understanding of the code both are solved.

 1) by keeping the top_pi_task cache updated under pi_lock and rq->lock,
 thereby ensuring that holding either is sufficient to stabilize it.

 2) sched_setscheduler() can change the parameters of the top_pi_task,
 but since it too holds both pi_lock and rq->lock, it cannot happen at
 the same time that we're looking at the cached top pi waiter pointer
 thingy.

It can however happen that top_pi_waiter is not in fact the top waiter
in a narrow window between sched_setscheduler() changing its parameters
and rt_mutex_adjust_pi() re-ordering the PI chain - ending in updating
the cached top task pointer thingy.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 5/8] rtmutex: Clean up
  2016-06-14 12:08   ` Juri Lelli
@ 2016-06-14 12:32     ` Peter Zijlstra
  2016-06-14 12:41       ` Juri Lelli
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-14 12:32 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot

On Tue, Jun 14, 2016 at 01:08:13PM +0100, Juri Lelli wrote:
> > +	postunlock = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
> >  
> >  	/*
> >  	 * First unlock HB so the waiter does not spin on it once he got woken
> > +	 * up. Then wakeup the waiter by calling rt_mutex_postunlock(). Priority
> > +	 * is already adjusted and preemption is disabled to avoid inversion.
> >  	 */
> >  	spin_unlock(&hb->lock);
> >  
> > +	if (postunlock)
> > +		rt_mutex_postunlock(&wake_q);
> 
> I'm most probably missing something, but don't we still need to call
> wake_up_q() even when postunlock is false? IIUC, we were always doing
> that, rt_mutex_postunlock(), before this change (only calling
> preempt_enable() was conditional).

Note that rt_mutex_slowunlock() only uses wake_q on the true path. When
it returns false, it will not have placed a task to wake up.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 5/8] rtmutex: Clean up
  2016-06-14 12:32     ` Peter Zijlstra
@ 2016-06-14 12:41       ` Juri Lelli
  0 siblings, 0 replies; 40+ messages in thread
From: Juri Lelli @ 2016-06-14 12:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot

On 14/06/16 14:32, Peter Zijlstra wrote:
> On Tue, Jun 14, 2016 at 01:08:13PM +0100, Juri Lelli wrote:
> > > +	postunlock = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
> > >  
> > >  	/*
> > >  	 * First unlock HB so the waiter does not spin on it once he got woken
> > > +	 * up. Then wakeup the waiter by calling rt_mutex_postunlock(). Priority
> > > +	 * is already adjusted and preemption is disabled to avoid inversion.
> > >  	 */
> > >  	spin_unlock(&hb->lock);
> > >  
> > > +	if (postunlock)
> > > +		rt_mutex_postunlock(&wake_q);
> > 
> > I'm most probably missing something, but don't we still need to call
> > wake_up_q() even when postunlock is false? IIUC, we were always doing
> > that, rt_mutex_postunlock(), before this change (only calling
> > preempt_enable() was conditional).
> 
> Note that rt_mutex_slowunlock() only uses wake_q on the true path. When
> it returns false, it will not have placed a task to wake up.
> 

Right. But, I thought we were still ending up calling wake_up_q before
this change. Which however looked fine, as it won't do anything if no
task is queued I guess. So, no problem before and no problem now I'd
say. :-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-06-14 10:21   ` Juri Lelli
  2016-06-14 12:30     ` Peter Zijlstra
@ 2016-06-14 12:53     ` Xunlei Pang
  2016-06-14 13:07       ` Juri Lelli
  1 sibling, 1 reply; 40+ messages in thread
From: Xunlei Pang @ 2016-06-14 12:53 UTC (permalink / raw)
  To: Juri Lelli, Peter Zijlstra
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 9770 bytes --]

On 2016/06/14 at 18:21, Juri Lelli wrote:
> Hi,
>
> On 07/06/16 21:56, Peter Zijlstra wrote:
>> From: Xunlei Pang <xlpang@redhat.com>
>>
>> A crash happened while I was playing with deadline PI rtmutex.
>>
>>     BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>>     IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
>>     PGD 232a75067 PUD 230947067 PMD 0
>>     Oops: 0000 [#1] SMP
>>     CPU: 1 PID: 10994 Comm: a.out Not tainted
>>
>>     Call Trace:
>>     [<ffffffff810b658c>] enqueue_task+0x2c/0x80
>>     [<ffffffff810ba763>] activate_task+0x23/0x30
>>     [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
>>     [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
>>     [<ffffffff8164e783>] __schedule+0xd3/0x900
>>     [<ffffffff8164efd9>] schedule+0x29/0x70
>>     [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
>>     [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
>>     [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
>>     [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
>>     [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
>>     [<ffffffff810edd50>] SyS_futex+0x80/0x180
>>
> This seems to be caused by the race condition you detail below between
> load balancing and PI code. I tried to reproduce the BUG on my box, but
> it looks hard to get. Do you have a reproducer I can give a try?
>
>> This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
>> are only protected by pi_lock when operating pi waiters, while
>> rt_mutex_get_top_task(), will access them with rq lock held but
>> not holding pi_lock.
>>
>> In order to tackle it, we introduce new "pi_top_task" pointer
>> cached in task_struct, and add new rt_mutex_update_top_task()
>> to update its value, it can be called by rt_mutex_setprio()
>> which held both owner's pi_lock and rq lock. Thus "pi_top_task"
>> can be safely accessed by enqueue_task_dl() under rq lock.
>>
>> [XXX this next section is unparsable]
> Yes, a bit hard to understand. However, am I correct in assuming this
> patch and the previous one should fix this problem? Or are there still
> other races causing issues?

Yes, these two patches can fix the problem.

>
>> One problem is when rt_mutex_adjust_prio()->...->rt_mutex_setprio(),
>> at that time rtmutex lock was released and owner was marked off,
>> this can cause "pi_top_task" dereferenced to be a running one(as it
>> can be falsely woken up by others before rt_mutex_setprio() is
>> made to update "pi_top_task"). We solve this by directly calling
>> __rt_mutex_adjust_prio() in mark_wakeup_next_waiter() which held
>> pi_lock and rtmutex lock, and remove rt_mutex_adjust_prio(). Since
>> now we moved the deboost point, in order to avoid current to be
>> preempted due to deboost earlier before wake_up_q(), we also moved
>> preempt_disable() before unlocking rtmutex.
>>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Juri Lelli <juri.lelli@arm.com>
>> Originally-From: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Xunlei Pang <xlpang@redhat.com>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Link: http://lkml.kernel.org/r/1461659449-19497-2-git-send-email-xlpang@redhat.com
> The idea of this fix makes sense to me. But, I would like to be able to
> see the BUG and test the fix. What I have is a test in which I create N
> DEADLINE workers that share a PI mutex. They get migrated around and
> seem to stress PI code. But I couldn't hit the BUG yet. Maybe I let it
> run for some more time.

You can use this reproducer attached(gcc crash_deadline_pi.c -lpthread -lrt ).
Start multiple instances, then it will hit the bug very soon.

Regards,
Xunlei

>
> Best,
>
> - Juri
>
>> ---
>>
>>  include/linux/init_task.h |    1 
>>  include/linux/sched.h     |    2 +
>>  include/linux/sched/rt.h  |    1 
>>  kernel/fork.c             |    1 
>>  kernel/locking/rtmutex.c  |   65 +++++++++++++++++++---------------------------
>>  kernel/sched/core.c       |    2 +
>>  6 files changed, 34 insertions(+), 38 deletions(-)
>>
>> --- a/include/linux/init_task.h
>> +++ b/include/linux/init_task.h
>> @@ -162,6 +162,7 @@ extern struct task_group root_task_group
>>  #ifdef CONFIG_RT_MUTEXES
>>  # define INIT_RT_MUTEXES(tsk)						\
>>  	.pi_waiters = RB_ROOT,						\
>> +	.pi_top_task = NULL,						\
>>  	.pi_waiters_leftmost = NULL,
>>  #else
>>  # define INIT_RT_MUTEXES(tsk)
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1681,6 +1681,8 @@ struct task_struct {
>>  	/* PI waiters blocked on a rt_mutex held by this task */
>>  	struct rb_root pi_waiters;
>>  	struct rb_node *pi_waiters_leftmost;
>> +	/* Updated under owner's pi_lock and rq lock */
>> +	struct task_struct *pi_top_task;
>>  	/* Deadlock detection and priority inheritance handling */
>>  	struct rt_mutex_waiter *pi_blocked_on;
>>  #endif
>> --- a/include/linux/sched/rt.h
>> +++ b/include/linux/sched/rt.h
>> @@ -19,6 +19,7 @@ static inline int rt_task(struct task_st
>>  extern int rt_mutex_getprio(struct task_struct *p);
>>  extern void rt_mutex_setprio(struct task_struct *p, int prio);
>>  extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
>> +extern void rt_mutex_update_top_task(struct task_struct *p);
>>  extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
>>  extern void rt_mutex_adjust_pi(struct task_struct *p);
>>  static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1219,6 +1219,7 @@ static void rt_mutex_init_task(struct ta
>>  #ifdef CONFIG_RT_MUTEXES
>>  	p->pi_waiters = RB_ROOT;
>>  	p->pi_waiters_leftmost = NULL;
>> +	p->pi_top_task = NULL;
>>  	p->pi_blocked_on = NULL;
>>  #endif
>>  }
>> --- a/kernel/locking/rtmutex.c
>> +++ b/kernel/locking/rtmutex.c
>> @@ -256,6 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
>>  	RB_CLEAR_NODE(&waiter->pi_tree_entry);
>>  }
>>  
>> +void rt_mutex_update_top_task(struct task_struct *p)
>> +{
>> +	if (!task_has_pi_waiters(p)) {
>> +		p->pi_top_task = NULL;
>> +		return;
>> +	}
>> +
>> +	p->pi_top_task = task_top_pi_waiter(p)->task;
>> +}
>> +
>>  /*
>>   * Calculate task priority from the waiter tree priority
>>   *
>> @@ -273,10 +283,7 @@ int rt_mutex_getprio(struct task_struct
>>  
>>  struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>>  {
>> -	if (likely(!task_has_pi_waiters(task)))
>> -		return NULL;
>> -
>> -	return task_top_pi_waiter(task)->task;
>> +	return task->pi_top_task;
>>  }
>>  
>>  /*
>> @@ -285,12 +292,12 @@ struct task_struct *rt_mutex_get_top_tas
>>   */
>>  int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
>>  {
>> -	if (!task_has_pi_waiters(task))
>> +	struct task_struct *top_task = rt_mutex_get_top_task(task);
>> +
>> +	if (!top_task)
>>  		return newprio;
>>  
>> -	if (task_top_pi_waiter(task)->task->prio <= newprio)
>> -		return task_top_pi_waiter(task)->task->prio;
>> -	return newprio;
>> +	return min(top_task->prio, newprio);
>>  }
>>  
>>  /*
>> @@ -307,24 +314,6 @@ static void __rt_mutex_adjust_prio(struc
>>  }
>>  
>>  /*
>> - * Adjust task priority (undo boosting). Called from the exit path of
>> - * rt_mutex_slowunlock() and rt_mutex_slowlock().
>> - *
>> - * (Note: We do this outside of the protection of lock->wait_lock to
>> - * allow the lock to be taken while or before we readjust the priority
>> - * of task. We do not use the spin_xx_mutex() variants here as we are
>> - * outside of the debug path.)
>> - */
>> -void rt_mutex_adjust_prio(struct task_struct *task)
>> -{
>> -	unsigned long flags;
>> -
>> -	raw_spin_lock_irqsave(&task->pi_lock, flags);
>> -	__rt_mutex_adjust_prio(task);
>> -	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
>> -}
>> -
>> -/*
>>   * Deadlock detection is conditional:
>>   *
>>   * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
>> @@ -987,6 +976,7 @@ static void mark_wakeup_next_waiter(stru
>>  	 * lock->wait_lock.
>>  	 */
>>  	rt_mutex_dequeue_pi(current, waiter);
>> +	__rt_mutex_adjust_prio(current);
>>  
>>  	/*
>>  	 * As we are waking up the top waiter, and the waiter stays
>> @@ -1325,6 +1315,16 @@ static bool __sched rt_mutex_slowunlock(
>>  	 */
>>  	mark_wakeup_next_waiter(wake_q, lock);
>>  
>> +	/*
>> +	 * We should deboost before waking the top waiter task such that
>> +	 * we don't run two tasks with the 'same' priority. This however
>> +	 * can lead to prio-inversion if we would get preempted after
>> +	 * the deboost but before waking our high-prio task, hence the
>> +	 * preempt_disable before unlock. Pairs with preempt_enable() in
>> +	 * rt_mutex_postunlock();
>> +	 */
>> +	preempt_disable();
>> +
>>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>>  
>>  	/* check PI boosting */
>> @@ -1400,20 +1400,9 @@ rt_mutex_fastunlock(struct rt_mutex *loc
>>   */
>>  void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
>>  {
>> -	/*
>> -	 * We should deboost before waking the top waiter task such that
>> -	 * we don't run two tasks with the 'same' priority. This however
>> -	 * can lead to prio-inversion if we would get preempted after
>> -	 * the deboost but before waking our high-prio task, hence the
>> -	 * preempt_disable.
>> -	 */
>> -	if (deboost) {
>> -		preempt_disable();
>> -		rt_mutex_adjust_prio(current);
>> -	}
>> -
>>  	wake_up_q(wake_q);
>>  
>> +	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
>>  	if (deboost)
>>  		preempt_enable();
>>  }
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3568,6 +3568,8 @@ void rt_mutex_setprio(struct task_struct
>>  		goto out_unlock;
>>  	}
>>  
>> +	rt_mutex_update_top_task(p);
>> +
>>  	trace_sched_pi_setprio(p, prio);
>>  	oldprio = p->prio;
>>  
>>
>>


[-- Attachment #2: crash_deadline_pi.c --]
[-- Type: text/x-csrc, Size: 4149 bytes --]

#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <linux/unistd.h>
#include <linux/kernel.h>
#include <linux/types.h>
#include <sys/syscall.h>
#include <pthread.h>
#include <sched.h>
#include <time.h>

 #define gettid() syscall(__NR_gettid)

 #define SCHED_DEADLINE	6

 /* XXX use the proper syscall numbers */
 #ifdef __x86_64__
 #define __NR_sched_setattr		314
 #define __NR_sched_getattr		315
 #endif

 #ifdef __i386__
 #define __NR_sched_setattr		351
 #define __NR_sched_getattr		352
 #endif

 #ifdef __ppc64__
 #define __NR_sched_setattr		355
 #define __NR_sched_getattr		356
 #endif

#ifdef __s390x__
#define __NR_sched_setattr     345
#define __NR_sched_getattr     346
#endif


static volatile int done;

 struct sched_attr {
	__u32 size;

	__u32 sched_policy;
	__u64 sched_flags;

	/* SCHED_NORMAL, SCHED_BATCH */
	__s32 sched_nice;

	/* SCHED_FIFO, SCHED_RR */
	__u32 sched_priority;

	/* SCHED_DEADLINE (nsec) */
	__u64 sched_runtime;
	__u64 sched_deadline;
	__u64 sched_period;
 };

 int sched_setattr(pid_t pid,
		  const struct sched_attr *attr,
		  unsigned int flags)
 {
	return syscall(__NR_sched_setattr, pid, attr, flags);
 }

 int sched_getattr(pid_t pid,
		  struct sched_attr *attr,
		  unsigned int size,
		  unsigned int flags)
 {
	return syscall(__NR_sched_getattr, pid, attr, size, flags);
 }

pthread_mutex_t mutex_obj;
pthread_mutexattr_t mutex_attr;
int x = 0;

static int decide(void)
{
	struct timespec ts;

	clock_gettime(CLOCK_MONOTONIC, &ts);

	if (ts.tv_nsec & 1)
		return 1;
	else
		return 0;
}

static void mutex_init(void)
{
	pthread_mutexattr_init(&mutex_attr);
	pthread_mutexattr_setprotocol(&mutex_attr, PTHREAD_PRIO_INHERIT);
	pthread_mutex_init(&mutex_obj, &mutex_attr);
}

static deadline_ndelay(unsigned int cnt)
{
	unsigned int i;

	for (i = 0; i < 10000 * cnt; i++);
}

 void *run_deadline_special(void *data)
 {
	struct sched_attr attr;
	int ret, take;
	unsigned int flags = 0;

	attr.size = sizeof(attr);
	attr.sched_flags = 0;
	attr.sched_nice = 0;
	attr.sched_priority = 0;

	/* This creates a 10ms/30ms reservation */
	attr.sched_policy = SCHED_DEADLINE;
	attr.sched_runtime = 100 * 1000 * 1000;
	attr.sched_deadline = 200 * 1000 * 1000;
	attr.sched_period = 300 * 1000 * 1000;

	ret = sched_setattr(0, &attr, flags);
	if (ret < 0) {
		done = 0;
		perror("sched_setattr");
		exit(-1);
	}

	printf("special deadline thread started [%ld]\n", gettid());
	while (!done) {
		take = decide();
		if (take)
			pthread_mutex_lock(&mutex_obj);

		x++;
		deadline_ndelay((unsigned long) attr.sched_runtime % 7);

		if (take)
			pthread_mutex_unlock(&mutex_obj);
	}

	printf("special deadline thread dies [%ld]\n", gettid());
	return NULL;
 }

 void *run_deadline(void *data)
 {
	struct sched_attr attr;
	int ret, take;
	unsigned int flags = 0;
	static unsigned int delta = 0;


	attr.size = sizeof(attr);
	attr.sched_flags = 0;
	attr.sched_nice = 0;
	attr.sched_priority = 0;

	/* This creates a 10ms/30ms reservation */
	delta += 1000 * 1000 * 2;
	attr.sched_policy = SCHED_DEADLINE;
	attr.sched_runtime = 20 * 1000 * 1000 + delta;
	attr.sched_deadline = 400 * 1000 * 1000;
	attr.sched_period = 400 * 1000 * 1000;

	ret = sched_setattr(0, &attr, flags);
	if (ret < 0) {
		done = 0;
		perror("sched_setattr");
		exit(-1);
	}

	printf("deadline thread started [%ld]\n", gettid());
	while (!done) {
		take = decide();
		if (take)
			pthread_mutex_lock(&mutex_obj);

		x++;
		deadline_ndelay((unsigned long) attr.sched_runtime % 7);

		if (take)
			pthread_mutex_unlock(&mutex_obj);
	}

	printf("deadline thread dies [%ld]\n", gettid());
	return NULL;
 }

#define THREAD_NUM 10

 int main (int argc, char **argv)
 {
	pthread_t thread[THREAD_NUM];
	int i;

	mutex_init();

	printf("main thread [%ld]\n", gettid());

	for (i = 0; i < THREAD_NUM-1; i++)
		pthread_create(&thread[i], NULL, run_deadline, NULL);

	pthread_create(&thread[THREAD_NUM-1], NULL, run_deadline_special, NULL);

	sleep(3600*300);

	done = 1;
	for (i = 0; i < THREAD_NUM; i++)
		pthread_join(thread[i], NULL);

	printf("main dies [%ld]\n", gettid());
	return 0;
 }

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter
  2016-06-14  9:09   ` Juri Lelli
@ 2016-06-14 12:54     ` Peter Zijlstra
  2016-06-14 13:20       ` Juri Lelli
  2016-06-14 16:36     ` Davidlohr Bueso
  1 sibling, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-14 12:54 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

On Tue, Jun 14, 2016 at 10:09:34AM +0100, Juri Lelli wrote:
> I've got only nitpicks for the changelog. Otherwise the patch looks good
> to me (and yes, without it bw inheritance would be a problem).

So for bw inheritance I'm still not sure how to dead with the faxt that
the top_pi_waiter, while blocked, can still be running, spin waiting.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-06-14 12:53     ` Xunlei Pang
@ 2016-06-14 13:07       ` Juri Lelli
  2016-06-14 16:39         ` Juri Lelli
  0 siblings, 1 reply; 40+ messages in thread
From: Juri Lelli @ 2016-06-14 13:07 UTC (permalink / raw)
  To: xlpang
  Cc: Peter Zijlstra, mingo, tglx, rostedt, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On 14/06/16 20:53, Xunlei Pang wrote:
> On 2016/06/14 at 18:21, Juri Lelli wrote:
> > Hi,
> >
> > On 07/06/16 21:56, Peter Zijlstra wrote:
> >> From: Xunlei Pang <xlpang@redhat.com>
> >>
> >> A crash happened while I was playing with deadline PI rtmutex.
> >>
> >>     BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> >>     IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
> >>     PGD 232a75067 PUD 230947067 PMD 0
> >>     Oops: 0000 [#1] SMP
> >>     CPU: 1 PID: 10994 Comm: a.out Not tainted
> >>
> >>     Call Trace:
> >>     [<ffffffff810b658c>] enqueue_task+0x2c/0x80
> >>     [<ffffffff810ba763>] activate_task+0x23/0x30
> >>     [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
> >>     [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
> >>     [<ffffffff8164e783>] __schedule+0xd3/0x900
> >>     [<ffffffff8164efd9>] schedule+0x29/0x70
> >>     [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
> >>     [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
> >>     [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
> >>     [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
> >>     [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
> >>     [<ffffffff810edd50>] SyS_futex+0x80/0x180
> >>
> > This seems to be caused by the race condition you detail below between
> > load balancing and PI code. I tried to reproduce the BUG on my box, but
> > it looks hard to get. Do you have a reproducer I can give a try?
> >
> >> This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
> >> are only protected by pi_lock when operating pi waiters, while
> >> rt_mutex_get_top_task(), will access them with rq lock held but
> >> not holding pi_lock.
> >>
> >> In order to tackle it, we introduce new "pi_top_task" pointer
> >> cached in task_struct, and add new rt_mutex_update_top_task()
> >> to update its value, it can be called by rt_mutex_setprio()
> >> which held both owner's pi_lock and rq lock. Thus "pi_top_task"
> >> can be safely accessed by enqueue_task_dl() under rq lock.
> >>
> >> [XXX this next section is unparsable]
> > Yes, a bit hard to understand. However, am I correct in assuming this
> > patch and the previous one should fix this problem? Or are there still
> > other races causing issues?
> 
> Yes, these two patches can fix the problem.
> 
> >
> >> One problem is when rt_mutex_adjust_prio()->...->rt_mutex_setprio(),
> >> at that time rtmutex lock was released and owner was marked off,
> >> this can cause "pi_top_task" dereferenced to be a running one(as it
> >> can be falsely woken up by others before rt_mutex_setprio() is
> >> made to update "pi_top_task"). We solve this by directly calling
> >> __rt_mutex_adjust_prio() in mark_wakeup_next_waiter() which held
> >> pi_lock and rtmutex lock, and remove rt_mutex_adjust_prio(). Since
> >> now we moved the deboost point, in order to avoid current to be
> >> preempted due to deboost earlier before wake_up_q(), we also moved
> >> preempt_disable() before unlocking rtmutex.
> >>
> >> Cc: Steven Rostedt <rostedt@goodmis.org>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Cc: Juri Lelli <juri.lelli@arm.com>
> >> Originally-From: Peter Zijlstra <peterz@infradead.org>
> >> Signed-off-by: Xunlei Pang <xlpang@redhat.com>
> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Link: http://lkml.kernel.org/r/1461659449-19497-2-git-send-email-xlpang@redhat.com
> > The idea of this fix makes sense to me. But, I would like to be able to
> > see the BUG and test the fix. What I have is a test in which I create N
> > DEADLINE workers that share a PI mutex. They get migrated around and
> > seem to stress PI code. But I couldn't hit the BUG yet. Maybe I let it
> > run for some more time.
> 
> You can use this reproducer attached(gcc crash_deadline_pi.c -lpthread -lrt ).
> Start multiple instances, then it will hit the bug very soon.
> 

Great, thanks! I'll use it.

Best,

- Juri

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 6/8] sched/rtmutex: Refactor rt_mutex_setprio()
  2016-06-07 19:56 ` [RFC][PATCH 6/8] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
@ 2016-06-14 13:14   ` Juri Lelli
  2016-06-14 14:08     ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Juri Lelli @ 2016-06-14 13:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot

Hi,

still digesting this change, but I'll point out below why I think you
are hitting a NULL ptr dereference (discussed on IRC).

On 07/06/16 21:56, Peter Zijlstra wrote:
> With the introduction of SCHED_DEADLINE the whole notion that priority
> is a single number is gone, therefore the @prio argument to
> rt_mutex_setprio() doesn't make sense anymore.
> 
> So rework the code to pass a pi_task instead.
> 
> Note this also fixes a problem with pi_top_task caching; previously we
> would not set the pointer (call rt_mutex_update_top_task) if the
> priority didn't change, this could lead to a stale pointer.
> 
> As for the XXX, I think its fine to use pi_task->prio, because if it
> differs from waiter->prio, a PI chain update is immenent.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched/rt.h        |   21 +-------
>  kernel/locking/rtmutex.c        |  105 +++++++++++-----------------------------
>  kernel/locking/rtmutex_common.h |    1 
>  kernel/sched/core.c             |   66 ++++++++++++++++++++-----
>  4 files changed, 88 insertions(+), 105 deletions(-)
> 

[...]

> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -256,61 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
>  	RB_CLEAR_NODE(&waiter->pi_tree_entry);
>  }
>  
> -void rt_mutex_update_top_task(struct task_struct *p)
> +static void rt_mutex_adjust_prio(struct task_struct *p)
>  {
> -	if (!task_has_pi_waiters(p)) {
> -		p->pi_top_task = NULL;
> -		return;
> -	}
> +	struct task_struct *pi_task = NULL;
>  
> -	p->pi_top_task = task_top_pi_waiter(p)->task;
> -}
> -
> -/*
> - * Calculate task priority from the waiter tree priority
> - *
> - * Return task->normal_prio when the waiter tree is empty or when
> - * the waiter is not allowed to do priority boosting
> - */
> -int rt_mutex_getprio(struct task_struct *task)
> -{
> -	if (likely(!task_has_pi_waiters(task)))
> -		return task->normal_prio;
> +	lockdep_assert_held(&p->pi_lock);
>  
> -	return min(task_top_pi_waiter(task)->prio,
> -		   task->normal_prio);
> -}
> +	if (!task_has_pi_waiters(p))

Shouldn't this be the other way around?

 if (task_has_pi_waiters(p))
 	pi_task = ...

Best,

- Juri

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter
  2016-06-14 12:54     ` Peter Zijlstra
@ 2016-06-14 13:20       ` Juri Lelli
  2016-06-14 13:59         ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Juri Lelli @ 2016-06-14 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

On 14/06/16 14:54, Peter Zijlstra wrote:
> On Tue, Jun 14, 2016 at 10:09:34AM +0100, Juri Lelli wrote:
> > I've got only nitpicks for the changelog. Otherwise the patch looks good
> > to me (and yes, without it bw inheritance would be a problem).
> 
> So for bw inheritance I'm still not sure how to dead with the faxt that
> the top_pi_waiter, while blocked, can still be running, spin waiting.
> 

You mean for M-BWI (multiprocessor), right? If that's the case, we were
actually discussing this thing with Pisa/Trento folks yesterday. I'm not
sure yet as well, but plan seems to be to get first things right with
current DI code (Luca was saying that there is a BUG somewhere); then
move to implement BWI; and then tackle the M- case (and see what we can
do to work around the theoretical need for spin waiting). We actually
got some ideas a while back, but I need to go there and refresh my mind.

If the plan sounds reasonable to you, it seems that we can start this
discussion as soon as Luca has his DI fixes ready. What you think?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter
  2016-06-14 13:20       ` Juri Lelli
@ 2016-06-14 13:59         ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-14 13:59 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

On Tue, Jun 14, 2016 at 02:20:31PM +0100, Juri Lelli wrote:
> On 14/06/16 14:54, Peter Zijlstra wrote:
> > On Tue, Jun 14, 2016 at 10:09:34AM +0100, Juri Lelli wrote:
> > > I've got only nitpicks for the changelog. Otherwise the patch looks good
> > > to me (and yes, without it bw inheritance would be a problem).
> > 
> > So for bw inheritance I'm still not sure how to dead with the faxt that
> > the top_pi_waiter, while blocked, can still be running, spin waiting.
> > 
> 
> You mean for M-BWI (multiprocessor), right? If that's the case, we were
> actually discussing this thing with Pisa/Trento folks yesterday. I'm not
> sure yet as well, but plan seems to be to get first things right with
> current DI code (Luca was saying that there is a BUG somewhere); then
> move to implement BWI; and then tackle the M- case (and see what we can
> do to work around the theoretical need for spin waiting). We actually
> got some ideas a while back, but I need to go there and refresh my mind.
> 
> If the plan sounds reasonable to you, it seems that we can start this
> discussion as soon as Luca has his DI fixes ready. What you think?

No objections.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 6/8] sched/rtmutex: Refactor rt_mutex_setprio()
  2016-06-14 13:14   ` Juri Lelli
@ 2016-06-14 14:08     ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-14 14:08 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot

On Tue, Jun 14, 2016 at 02:14:24PM +0100, Juri Lelli wrote:
> Hi,
> 
> still digesting this change, but I'll point out below why I think you
> are hitting a NULL ptr dereference (discussed on IRC).
> 
> On 07/06/16 21:56, Peter Zijlstra wrote:
> 
> > --- a/kernel/locking/rtmutex.c
> > +++ b/kernel/locking/rtmutex.c
> > @@ -256,61 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
> >  	RB_CLEAR_NODE(&waiter->pi_tree_entry);
> >  }
> >  
> > +static void rt_mutex_adjust_prio(struct task_struct *p)
> >  {
> > +	struct task_struct *pi_task = NULL;
> >  
> > +	lockdep_assert_held(&p->pi_lock);
> >  
> > +	if (!task_has_pi_waiters(p))
> 
> Shouldn't this be the other way around?
> 
>  if (task_has_pi_waiters(p))
>  	pi_task = ...

Yeah, that would make more sense :-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter
  2016-06-14  9:09   ` Juri Lelli
  2016-06-14 12:54     ` Peter Zijlstra
@ 2016-06-14 16:36     ` Davidlohr Bueso
  2016-06-14 17:01       ` Juri Lelli
  1 sibling, 1 reply; 40+ messages in thread
From: Davidlohr Bueso @ 2016-06-14 16:36 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, mingo, tglx, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Tue, 14 Jun 2016, Juri Lelli wrote:

>Do we have any specific tests for this set? I'm running mine.

pi_stress from rt-tests is a good workload to run for such changes.

(https://git.kernel.org/cgit/utils/rt-tests/rt-tests.git/)

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-06-14 13:07       ` Juri Lelli
@ 2016-06-14 16:39         ` Juri Lelli
  0 siblings, 0 replies; 40+ messages in thread
From: Juri Lelli @ 2016-06-14 16:39 UTC (permalink / raw)
  To: xlpang
  Cc: Peter Zijlstra, mingo, tglx, rostedt, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On 14/06/16 14:07, Juri Lelli wrote:
> On 14/06/16 20:53, Xunlei Pang wrote:
> > On 2016/06/14 at 18:21, Juri Lelli wrote:
> > > Hi,
> > >
> > > On 07/06/16 21:56, Peter Zijlstra wrote:
> > >> From: Xunlei Pang <xlpang@redhat.com>
> > >>

[...]

> > > The idea of this fix makes sense to me. But, I would like to be able to
> > > see the BUG and test the fix. What I have is a test in which I create N
> > > DEADLINE workers that share a PI mutex. They get migrated around and
> > > seem to stress PI code. But I couldn't hit the BUG yet. Maybe I let it
> > > run for some more time.
> > 
> > You can use this reproducer attached(gcc crash_deadline_pi.c -lpthread -lrt ).
> > Start multiple instances, then it will hit the bug very soon.
> > 
> 
> Great, thanks! I'll use it.
> 

OK. I could reproduce and also test that I can't see the BUG anymore
with the fix.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter
  2016-06-14 16:36     ` Davidlohr Bueso
@ 2016-06-14 17:01       ` Juri Lelli
  0 siblings, 0 replies; 40+ messages in thread
From: Juri Lelli @ 2016-06-14 17:01 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, mingo, tglx, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On 14/06/16 09:36, Davidlohr Bueso wrote:
> On Tue, 14 Jun 2016, Juri Lelli wrote:
> 
> >Do we have any specific tests for this set? I'm running mine.
> 
> pi_stress from rt-tests is a good workload to run for such changes.
> 
> (https://git.kernel.org/cgit/utils/rt-tests/rt-tests.git/)
> 

Oh, right! I keep forgetting that DEADLINE support has been added there.
Apologies. I'll run those as well for sure. Thanks for pointing them
out.

Best,

- Juri

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity
  2016-06-07 19:56 ` [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity Peter Zijlstra
@ 2016-06-14 17:39   ` Juri Lelli
  2016-06-14 19:44     ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Juri Lelli @ 2016-06-14 17:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot

On 07/06/16 21:56, Peter Zijlstra wrote:
> rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
> during the PI chain walk, such that the PI chain order isn't messed up
> by (asynchronous) task state updates.
> 
> Currently rt_mutex_waiter_less() uses task state for deadline tasks;
> this is broken, since the task state can, as said above, change
> asynchronously, causing the RB tree order to change without actual
> tree update -> FAIL.
> 
> Fix this by also copying the deadline into the rt_mutex_waiter state
> and updating it along with its prio field.
> 
> Ideally we would also force PI chain updates whenever DL tasks update
> their deadline parameter, but for first approximation this is less
> broken than it was.
> 

The patch looks OK to me. However, I'm failing to see when we can update
dl.deadline of a waiter asynchronously. Since a waiter is blocked, we
can't really change his dl.deadline by calling setscheduler on him, as
the update would operate on dl.dl_deadline. The new values will start to
be used as soon as it gets unblocked. The situation seems different for
RT tasks, for which priority change takes effect immediately.

What am I missing? :-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter
  2016-06-07 19:56 ` [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
  2016-06-14  9:09   ` Juri Lelli
@ 2016-06-14 18:22   ` Steven Rostedt
  1 sibling, 0 replies; 40+ messages in thread
From: Steven Rostedt @ 2016-06-14 18:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

On Tue, 07 Jun 2016 21:56:36 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> From: Xunlei Pang <xlpang@redhat.com>
> 
> We should deboost before waking the high-prio task, such that
> we don't run two tasks with the same "state"(priority, deadline,
> sched_class, etc) during the period between the end of wake_up_q()
> and the end of rt_mutex_adjust_prio().
> 
> As "Peter Zijlstra" said:
> Its semantically icky to have the two tasks running off the same
> state and practically icky when you consider bandwidth inheritance --
> where the boosted task wants to explicitly modify the state of the
> booster. In that latter case you really want to unboost before you
> let the booster run again.
> 
> But this however can lead to prio-inversion if current would get
> preempted after the deboost but before waking our high-prio task,
> hence we disable preemption before doing deboost, and enabling it
> after the wake up is over.
> 
> The patch fixed the logic, and introduced rt_mutex_postunlock()
> to do some code refactor.
> 
> Most importantly however; this change ensures pointer stability for
> the next patch, where we have rt_mutex_setprio() cache a pointer to
> the top-most waiter task. If we, as before this change, do the wakeup
> first and then deboost, this pointer might point into thin air.
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@arm.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> [peterz: Changelog]
> Signed-off-by: Xunlei Pang <xlpang@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1461659449-19497-1-git-send-email-xlpang@redhat.com
> ---

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-06-07 19:56 ` [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
  2016-06-14 10:21   ` Juri Lelli
@ 2016-06-14 18:42   ` Steven Rostedt
  2016-06-14 20:28     ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2016-06-14 18:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

On Tue, 07 Jun 2016 21:56:37 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
\
> --- a/include/linux/sched/rt.h
> +++ b/include/linux/sched/rt.h
> @@ -19,6 +19,7 @@ static inline int rt_task(struct task_st
>  extern int rt_mutex_getprio(struct task_struct *p);
>  extern void rt_mutex_setprio(struct task_struct *p, int prio);
>  extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
> +extern void rt_mutex_update_top_task(struct task_struct *p);
>  extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
>  extern void rt_mutex_adjust_pi(struct task_struct *p);
>  static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1219,6 +1219,7 @@ static void rt_mutex_init_task(struct ta
>  #ifdef CONFIG_RT_MUTEXES
>  	p->pi_waiters = RB_ROOT;
>  	p->pi_waiters_leftmost = NULL;
> +	p->pi_top_task = NULL;
>  	p->pi_blocked_on = NULL;
>  #endif
>  }
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -256,6 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
>  	RB_CLEAR_NODE(&waiter->pi_tree_entry);
>  }
>  
> +void rt_mutex_update_top_task(struct task_struct *p)
> +{
> +	if (!task_has_pi_waiters(p)) {
> +		p->pi_top_task = NULL;
> +		return;
> +	}
> +
> +	p->pi_top_task = task_top_pi_waiter(p)->task;
> +}
> +
>  /*
>   * Calculate task priority from the waiter tree priority
>   *
> @@ -273,10 +283,7 @@ int rt_mutex_getprio(struct task_struct
>  
>  struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>  {
> -	if (likely(!task_has_pi_waiters(task)))
> -		return NULL;
> -
> -	return task_top_pi_waiter(task)->task;
> +	return task->pi_top_task;
>  }
>  
>  /*
> @@ -285,12 +292,12 @@ struct task_struct *rt_mutex_get_top_tas
>   */
>  int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
>  {
> -	if (!task_has_pi_waiters(task))
> +	struct task_struct *top_task = rt_mutex_get_top_task(task);
> +
> +	if (!top_task)
>  		return newprio;
>  
> -	if (task_top_pi_waiter(task)->task->prio <= newprio)
> -		return task_top_pi_waiter(task)->task->prio;
> -	return newprio;
> +	return min(top_task->prio, newprio);
>  }
>  
>  /*
> @@ -307,24 +314,6 @@ static void __rt_mutex_adjust_prio(struc
>  }
>  
>  /*
> - * Adjust task priority (undo boosting). Called from the exit path of
> - * rt_mutex_slowunlock() and rt_mutex_slowlock().
> - *
> - * (Note: We do this outside of the protection of lock->wait_lock to
> - * allow the lock to be taken while or before we readjust the priority
> - * of task. We do not use the spin_xx_mutex() variants here as we are
> - * outside of the debug path.)
> - */
> -void rt_mutex_adjust_prio(struct task_struct *task)
> -{
> -	unsigned long flags;
> -
> -	raw_spin_lock_irqsave(&task->pi_lock, flags);
> -	__rt_mutex_adjust_prio(task);
> -	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
> -}
> -
> -/*
>   * Deadlock detection is conditional:
>   *
>   * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
> @@ -987,6 +976,7 @@ static void mark_wakeup_next_waiter(stru
>  	 * lock->wait_lock.
>  	 */
>  	rt_mutex_dequeue_pi(current, waiter);
> +	__rt_mutex_adjust_prio(current);
>  
>  	/*
>  	 * As we are waking up the top waiter, and the waiter stays
> @@ -1325,6 +1315,16 @@ static bool __sched rt_mutex_slowunlock(
>  	 */
>  	mark_wakeup_next_waiter(wake_q, lock);
>  
> +	/*
> +	 * We should deboost before waking the top waiter task such that
> +	 * we don't run two tasks with the 'same' priority. This however
> +	 * can lead to prio-inversion if we would get preempted after
> +	 * the deboost but before waking our high-prio task, hence the
> +	 * preempt_disable before unlock. Pairs with preempt_enable() in
> +	 * rt_mutex_postunlock();
> +	 */
> +	preempt_disable();
> +

This looks like a possible maintenance nightmare. Can we add some more
comments at the start of the functions that state that
rt_mutex_slowunlock() calls must be paired with rt_mutex_postunlock()?

Other than that...

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve


>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  
>  	/* check PI boosting */
> @@ -1400,20 +1400,9 @@ rt_mutex_fastunlock(struct rt_mutex *loc
>   */
>  void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
>  {
> -	/*
> -	 * We should deboost before waking the top waiter task such that
> -	 * we don't run two tasks with the 'same' priority. This however
> -	 * can lead to prio-inversion if we would get preempted after
> -	 * the deboost but before waking our high-prio task, hence the
> -	 * preempt_disable.
> -	 */
> -	if (deboost) {
> -		preempt_disable();
> -		rt_mutex_adjust_prio(current);
> -	}
> -
>  	wake_up_q(wake_q);
>  
> +	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
>  	if (deboost)
>  		preempt_enable();
>  }
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3568,6 +3568,8 @@ void rt_mutex_setprio(struct task_struct
>  		goto out_unlock;
>  	}
>  
> +	rt_mutex_update_top_task(p);
> +
>  	trace_sched_pi_setprio(p, prio);
>  	oldprio = p->prio;
>  
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity
  2016-06-14 17:39   ` Juri Lelli
@ 2016-06-14 19:44     ` Peter Zijlstra
  2016-06-15  7:25       ` Juri Lelli
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-14 19:44 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot

On Tue, Jun 14, 2016 at 06:39:08PM +0100, Juri Lelli wrote:
> On 07/06/16 21:56, Peter Zijlstra wrote:
> > rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
> > during the PI chain walk, such that the PI chain order isn't messed up
> > by (asynchronous) task state updates.
> > 
> > Currently rt_mutex_waiter_less() uses task state for deadline tasks;
> > this is broken, since the task state can, as said above, change
> > asynchronously, causing the RB tree order to change without actual
> > tree update -> FAIL.
> > 
> > Fix this by also copying the deadline into the rt_mutex_waiter state
> > and updating it along with its prio field.
> > 
> > Ideally we would also force PI chain updates whenever DL tasks update
> > their deadline parameter, but for first approximation this is less
> > broken than it was.
> > 
> 
> The patch looks OK to me. However, I'm failing to see when we can update
> dl.deadline of a waiter asynchronously. Since a waiter is blocked, we
> can't really change his dl.deadline by calling setscheduler on him, as
> the update would operate on dl.dl_deadline. The new values will start to
> be used as soon as it gets unblocked. The situation seems different for
> RT tasks, for which priority change takes effect immediately.
> 
> What am I missing? :-)

Ah, I missed the dl_deadline vs deadline thing. Still, with optimistic
spinning the waiter could hit its throttle/refresh path, right? And then
that would update deadline.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-06-14 18:42   ` Steven Rostedt
@ 2016-06-14 20:28     ` Peter Zijlstra
  2016-06-15 16:14       ` Steven Rostedt
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-14 20:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mingo, tglx, juri.lelli, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

On Tue, Jun 14, 2016 at 02:42:17PM -0400, Steven Rostedt wrote:
> On Tue, 07 Jun 2016 21:56:37 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> > +	/*
> > +	 * We should deboost before waking the top waiter task such that
> > +	 * we don't run two tasks with the 'same' priority. This however
> > +	 * can lead to prio-inversion if we would get preempted after
> > +	 * the deboost but before waking our high-prio task, hence the
> > +	 * preempt_disable before unlock. Pairs with preempt_enable() in
> > +	 * rt_mutex_postunlock();
> > +	 */
> > +	preempt_disable();
> > +
> 
> This looks like a possible maintenance nightmare. Can we add some more
> comments at the start of the functions that state that
> rt_mutex_slowunlock() calls must be paired with rt_mutex_postunlock()?

Please look at patches 4 and 5 that clean this up.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity
  2016-06-14 19:44     ` Peter Zijlstra
@ 2016-06-15  7:25       ` Juri Lelli
  2016-06-27 12:23         ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Juri Lelli @ 2016-06-15  7:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot

On 14/06/16 21:44, Peter Zijlstra wrote:
> On Tue, Jun 14, 2016 at 06:39:08PM +0100, Juri Lelli wrote:
> > On 07/06/16 21:56, Peter Zijlstra wrote:
> > > rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
> > > during the PI chain walk, such that the PI chain order isn't messed up
> > > by (asynchronous) task state updates.
> > > 
> > > Currently rt_mutex_waiter_less() uses task state for deadline tasks;
> > > this is broken, since the task state can, as said above, change
> > > asynchronously, causing the RB tree order to change without actual
> > > tree update -> FAIL.
> > > 
> > > Fix this by also copying the deadline into the rt_mutex_waiter state
> > > and updating it along with its prio field.
> > > 
> > > Ideally we would also force PI chain updates whenever DL tasks update
> > > their deadline parameter, but for first approximation this is less
> > > broken than it was.
> > > 
> > 
> > The patch looks OK to me. However, I'm failing to see when we can update
> > dl.deadline of a waiter asynchronously. Since a waiter is blocked, we
> > can't really change his dl.deadline by calling setscheduler on him, as
> > the update would operate on dl.dl_deadline. The new values will start to
> > be used as soon as it gets unblocked. The situation seems different for
> > RT tasks, for which priority change takes effect immediately.
> > 
> > What am I missing? :-)
> 
> Ah, I missed the dl_deadline vs deadline thing. Still, with optimistic
> spinning the waiter could hit its throttle/refresh path, right? And then
> that would update deadline.
> 

I guess it's not that likely, but yes it could potentially happen that a
waiter is optimistically spinning, depletes its runtime, gets throttled
and then replenished when still spinning. Maybe it doesn't really make
sense continuing spinning in this situation, but I guess things get
really complicated. :-/

Anyway, as said, I think this patch is OK. Maybe we want to add a
comment just to remember what situation can cause an issue if we don't
do this? Patch changelog would be OK as well for such a comment IMHO.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-06-14 20:28     ` Peter Zijlstra
@ 2016-06-15 16:14       ` Steven Rostedt
  0 siblings, 0 replies; 40+ messages in thread
From: Steven Rostedt @ 2016-06-15 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

On Tue, 14 Jun 2016 22:28:34 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Please look at patches 4 and 5 that clean this up.
> 

OK I will. I'm currently doing this while traveling so I can only look
at a couple of patches at a time.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
  2016-06-07 19:56 ` [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
  2016-06-14 10:43   ` Juri Lelli
@ 2016-06-15 16:30   ` Steven Rostedt
  2016-06-15 17:55     ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: Steven Rostedt @ 2016-06-15 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

On Tue, 07 Jun 2016 21:56:38 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> From: Xunlei Pang <xlpang@redhat.com>
> 
> Currently dl tasks will actually return at the very beginning
> of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:
> 
>     if (waiter->prio == task->prio) {
>         if (!detect_deadlock)
>             goto out_unlock_pi; // out here
>         else
>             requeue = false;
>     }
> 
> As the deadline value of blocked deadline tasks(waiters) without
> changing their sched_class(thus prio doesn't change) never changes,
> this seems reasonable, but it actually misses the chance of updating
> rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
> deadline parameters(dl_runtime, dl_period) or boosted waiter changes
> to !deadline class.
> 
> Thus, force deadline task not out by adding the !dl_prio() condition.
> 
> [peterz: I should introduce more task state comparators like
> rt_mutex_waiter_less, all PI prio comparisons already have this DL
> exception, except this one]
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Juri Lelli <juri.lelli@arm.com>
> Signed-off-by: Xunlei Pang <xlpang@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
> ---
>  kernel/locking/rtmutex.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -488,7 +488,7 @@ static int rt_mutex_adjust_prio_chain(st
>  	 * enabled we continue, but stop the requeueing in the chain
>  	 * walk.
>  	 */
> -	if (waiter->prio == task->prio) {
> +	if (waiter->prio == task->prio && !dl_task(task)) {

Isn't task the owner of the lock? What happens if the waiter is a
deadline task?

-- Steve

>  		if (!detect_deadlock)
>  			goto out_unlock_pi;
>  		else
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 4/8] rtmutex: Remove rt_mutex_fastunlock()
  2016-06-07 19:56 ` [RFC][PATCH 4/8] rtmutex: Remove rt_mutex_fastunlock() Peter Zijlstra
@ 2016-06-15 16:43   ` Steven Rostedt
  0 siblings, 0 replies; 40+ messages in thread
From: Steven Rostedt @ 2016-06-15 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot

[ Now sending with Reply-All ]

On Tue, 07 Jun 2016 21:56:39 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> There is but a single user and the previous patch mandates slowfn must
> be rt_mutex_slowunlock(), so fold the lot together and save a few
> lines.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/locking/rtmutex.c |   29 ++++++++++-------------------
>  1 file changed, 10 insertions(+), 19 deletions(-)

Reviewed-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
  2016-06-15 16:30   ` Steven Rostedt
@ 2016-06-15 17:55     ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-15 17:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mingo, tglx, juri.lelli, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot, Ingo Molnar

On Wed, Jun 15, 2016 at 12:30:07PM -0400, Steven Rostedt wrote:

> > +++ b/kernel/locking/rtmutex.c
> > @@ -488,7 +488,7 @@ static int rt_mutex_adjust_prio_chain(st
> >  	 * enabled we continue, but stop the requeueing in the chain
> >  	 * walk.
> >  	 */
> > -	if (waiter->prio == task->prio) {
> > +	if (waiter->prio == task->prio && !dl_task(task)) {
> 
> Isn't task the owner of the lock? 

No, task is blocked on something.

> What happens if the waiter is a
> deadline task?

So the test here is a shortcut to terminate the Pi chain adjust, it says
that if the waiter and task have the same priority, we're done. Further
adjustments will not make a difference.

The problem is that for deadline tasks, prio is a useless number, so
even if they match (all deadline tasks have prio -1) they might still
not actually match.

After the last patch I suppose we could do something like:

	waiter->prio == task->prio && waiter->deadline == task->dl.deadline

In any case, any condition that compares just two 'prio' values is per
definition broken if DL tasks are involved.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity
  2016-06-15  7:25       ` Juri Lelli
@ 2016-06-27 12:23         ` Peter Zijlstra
  2016-06-27 12:40           ` Thomas Gleixner
  2016-06-28  9:05           ` Juri Lelli
  0 siblings, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-06-27 12:23 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot

On Wed, Jun 15, 2016 at 08:25:07AM +0100, Juri Lelli wrote:
> I guess it's not that likely, but yes it could potentially happen that a
> waiter is optimistically spinning, depletes its runtime, gets throttled
> and then replenished when still spinning. Maybe it doesn't really make
> sense continuing spinning in this situation, but I guess things get
> really complicated. :-/
> 
> Anyway, as said, I think this patch is OK. Maybe we want to add a
> comment just to remember what situation can cause an issue if we don't
> do this? Patch changelog would be OK as well for such a comment IMHO.


OK, so I went to write a simple comment and ended up with the below :/

While writing the comment I noticed two issues:

 - we update the waiter order fields while the entry is still enqueued
   on the pi_waiters tree, which is also sorted by these exact fields.

 - another one of these pure ->prio comparisons

Please double check, there be dragons here.

---

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -158,6 +158,9 @@ static inline bool unlock_rt_mutex_safe(
 }
 #endif
 
+#define cmp_task(p)	\
+	&(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = (p)->dl.deadline }
+
 static inline int
 rt_mutex_waiter_less(struct rt_mutex_waiter *left,
 		     struct rt_mutex_waiter *right)
@@ -583,8 +586,26 @@ static int rt_mutex_adjust_prio_chain(st
 
 	/* [7] Requeue the waiter in the lock waiter tree. */
 	rt_mutex_dequeue(lock, waiter);
+
+	/*
+	 * Update the waiter prio fields now that we're dequeued.
+	 *
+	 * These values can have changed through either:
+	 *
+	 *   sys_sched_set_scheduler() / sys_sched_setattr()
+	 *
+	 * or
+	 *
+	 *   DL CBS enforcement advancing the effective deadline.
+	 *
+	 * Even though pi_waiters also uses these fields, and that tree is only
+	 * updated in [11], we can do this here, since we hold [L], which
+	 * serializes all pi_waiters access and rb_erase() does not care about
+	 * the values of the node being removed.
+	 */
 	waiter->prio = task->prio;
 	waiter->deadline = task->dl.deadline;
+
 	rt_mutex_enqueue(lock, waiter);
 
 	/* [8] Release the task */
@@ -711,6 +732,8 @@ static int rt_mutex_adjust_prio_chain(st
 static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 				struct rt_mutex_waiter *waiter)
 {
+	lockdep_assert_held(&lock->wait_lock);
+
 	/*
 	 * Before testing whether we can acquire @lock, we set the
 	 * RT_MUTEX_HAS_WAITERS bit in @lock->owner. This forces all
@@ -770,7 +793,8 @@ static int try_to_take_rt_mutex(struct r
 			 * the top waiter priority (kernel view),
 			 * @task lost.
 			 */
-			if (task->prio >= rt_mutex_top_waiter(lock)->prio)
+			if (!rt_mutex_waiter_less(cmp_task(task),
+						  rt_mutex_top_waiter(lock)))
 				return 0;
 
 			/*
@@ -838,6 +862,8 @@ static int task_blocks_on_rt_mutex(struc
 	struct rt_mutex *next_lock;
 	int chain_walk = 0, res;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	/*
 	 * Early deadlock detection. We really don't want the task to
 	 * enqueue on itself just to untangle the mess later. It's not
@@ -973,6 +999,8 @@ static void remove_waiter(struct rt_mute
 	struct task_struct *owner = rt_mutex_owner(lock);
 	struct rt_mutex *next_lock;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	raw_spin_lock(&current->pi_lock);
 	rt_mutex_dequeue(lock, waiter);
 	current->pi_blocked_on = NULL;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity
  2016-06-27 12:23         ` Peter Zijlstra
@ 2016-06-27 12:40           ` Thomas Gleixner
  2016-06-28  9:05           ` Juri Lelli
  1 sibling, 0 replies; 40+ messages in thread
From: Thomas Gleixner @ 2016-06-27 12:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, mingo, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 27 Jun 2016, Peter Zijlstra wrote:

> On Wed, Jun 15, 2016 at 08:25:07AM +0100, Juri Lelli wrote:
> > I guess it's not that likely, but yes it could potentially happen that a
> > waiter is optimistically spinning, depletes its runtime, gets throttled
> > and then replenished when still spinning. Maybe it doesn't really make
> > sense continuing spinning in this situation, but I guess things get
> > really complicated. :-/
> > 
> > Anyway, as said, I think this patch is OK. Maybe we want to add a
> > comment just to remember what situation can cause an issue if we don't
> > do this? Patch changelog would be OK as well for such a comment IMHO.
> 
> 
> OK, so I went to write a simple comment and ended up with the below :/
> 
> While writing the comment I noticed two issues:
> 
>  - we update the waiter order fields while the entry is still enqueued
>    on the pi_waiters tree, which is also sorted by these exact fields.
> 
>  - another one of these pure ->prio comparisons
> 
> Please double check, there be dragons here.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity
  2016-06-27 12:23         ` Peter Zijlstra
  2016-06-27 12:40           ` Thomas Gleixner
@ 2016-06-28  9:05           ` Juri Lelli
  1 sibling, 0 replies; 40+ messages in thread
From: Juri Lelli @ 2016-06-28  9:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, rostedt, xlpang, linux-kernel, mathieu.desnoyers,
	jdesfossez, bristot

On 27/06/16 14:23, Peter Zijlstra wrote:
> On Wed, Jun 15, 2016 at 08:25:07AM +0100, Juri Lelli wrote:
> > I guess it's not that likely, but yes it could potentially happen that a
> > waiter is optimistically spinning, depletes its runtime, gets throttled
> > and then replenished when still spinning. Maybe it doesn't really make
> > sense continuing spinning in this situation, but I guess things get
> > really complicated. :-/
> > 
> > Anyway, as said, I think this patch is OK. Maybe we want to add a
> > comment just to remember what situation can cause an issue if we don't
> > do this? Patch changelog would be OK as well for such a comment IMHO.
> 
> 
> OK, so I went to write a simple comment and ended up with the below :/
> 
> While writing the comment I noticed two issues:
> 
>  - we update the waiter order fields while the entry is still enqueued
>    on the pi_waiters tree, which is also sorted by these exact fields.
> 
>  - another one of these pure ->prio comparisons
> 
> Please double check, there be dragons here.
> 

Reviewed-and-tested-by: Juri Lelli <juri.lelli@arm.com>

And, FWIW, you can obviously keep this if you are going to squash this
into 8/8.

Best,

- Juri

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2016-06-28  9:05 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-07 19:56 [RFC][PATCH 0/8] PI and assorted failings Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
2016-06-14  9:09   ` Juri Lelli
2016-06-14 12:54     ` Peter Zijlstra
2016-06-14 13:20       ` Juri Lelli
2016-06-14 13:59         ` Peter Zijlstra
2016-06-14 16:36     ` Davidlohr Bueso
2016-06-14 17:01       ` Juri Lelli
2016-06-14 18:22   ` Steven Rostedt
2016-06-07 19:56 ` [RFC][PATCH 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
2016-06-14 10:21   ` Juri Lelli
2016-06-14 12:30     ` Peter Zijlstra
2016-06-14 12:53     ` Xunlei Pang
2016-06-14 13:07       ` Juri Lelli
2016-06-14 16:39         ` Juri Lelli
2016-06-14 18:42   ` Steven Rostedt
2016-06-14 20:28     ` Peter Zijlstra
2016-06-15 16:14       ` Steven Rostedt
2016-06-07 19:56 ` [RFC][PATCH 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
2016-06-14 10:43   ` Juri Lelli
2016-06-14 12:09     ` Peter Zijlstra
2016-06-15 16:30   ` Steven Rostedt
2016-06-15 17:55     ` Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 4/8] rtmutex: Remove rt_mutex_fastunlock() Peter Zijlstra
2016-06-15 16:43   ` Steven Rostedt
2016-06-07 19:56 ` [RFC][PATCH 5/8] rtmutex: Clean up Peter Zijlstra
2016-06-14 12:08   ` Juri Lelli
2016-06-14 12:32     ` Peter Zijlstra
2016-06-14 12:41       ` Juri Lelli
2016-06-07 19:56 ` [RFC][PATCH 6/8] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
2016-06-14 13:14   ` Juri Lelli
2016-06-14 14:08     ` Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 7/8] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
2016-06-07 19:56 ` [RFC][PATCH 8/8] rtmutex: Fix PI chain order integrity Peter Zijlstra
2016-06-14 17:39   ` Juri Lelli
2016-06-14 19:44     ` Peter Zijlstra
2016-06-15  7:25       ` Juri Lelli
2016-06-27 12:23         ` Peter Zijlstra
2016-06-27 12:40           ` Thomas Gleixner
2016-06-28  9:05           ` Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).