All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -v2 0/9] PI and assorted failings
@ 2016-09-26 12:32 Peter Zijlstra
  2016-09-26 12:32 ` [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 12:32 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

OK, long time since I posted, but I finally found these patches again.
Hopefully I addressed all feedback from last time (sorry if I missed anything).

Aside from the stuff I posted during review last time, I restructured the first
two patches to get rid of that unparsable changelog.

I folded part of the second patch into the first patch and rewrote the
Changelog for the first patch.

Now the first patch fully restructures the unlock+deboost path and the second
patch only adds the pi_top_task pointer cache. Where previously the first patch
did half the unlock+deboost restructure and the second patch did the rest and
added that cache.

If there are no objections, I'll queue the stuff to finally get rid of these
patches (and not 'loose' them again).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter
  2016-09-26 12:32 [PATCH -v2 0/9] PI and assorted failings Peter Zijlstra
@ 2016-09-26 12:32 ` Peter Zijlstra
  2016-09-26 15:15   ` Steven Rostedt
  2016-09-28  9:07   ` Sebastian Andrzej Siewior
  2016-09-26 12:32 ` [PATCH -v2 2/9] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 12:32 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz,
	Ingo Molnar

[-- Attachment #1: xunlei_pang-rtmutex-deboost_before_waking_up_the_top_waiter.patch --]
[-- Type: text/plain, Size: 4552 bytes --]

We should deboost before waking the high-priority task, such that we
don't run two tasks with the same "state" (priority, deadline,
sched_class, etc).

In order to make sure the boosting task doesn't start running between
unlock and deboost (due to 'spurious' wakeup), we move the deboost
under the wait_lock, that way its serialized against the wait loop in
__rt_mutex_slowlock().

Doing the deboost early can however lead to priority-inversion if
current would get preempted after the deboost but before waking our
high-prio task, hence we disable preemption before doing deboost, and
enabling it after the wake up is over.

This gets us the right semantic order, but most importantly however;
this change ensures pointer stability for the next patch, where we
have rt_mutex_setprio() cache a pointer to the top-most waiter task.
If we, as before this change, do the wakeup first and then deboost,
this pointer might point into thin air.

[peterz: Changelog + patch munging]
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---

 kernel/futex.c                  |    5 +---
 kernel/locking/rtmutex.c        |   48 ++++++++++++++++++++--------------------
 kernel/locking/rtmutex_common.h |    1 
 3 files changed, 28 insertions(+), 26 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1374,9 +1374,8 @@ static int wake_futex_pi(u32 __user *uad
 	 * scheduled away before the wake up can take place.
 	 */
 	spin_unlock(&hb->lock);
-	wake_up_q(&wake_q);
-	if (deboost)
-		rt_mutex_adjust_prio(current);
+
+	rt_mutex_postunlock(&wake_q, deboost);
 
 	return 0;
 }
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -307,24 +307,6 @@ static void __rt_mutex_adjust_prio(struc
 }
 
 /*
- * Adjust task priority (undo boosting). Called from the exit path of
- * rt_mutex_slowunlock() and rt_mutex_slowlock().
- *
- * (Note: We do this outside of the protection of lock->wait_lock to
- * allow the lock to be taken while or before we readjust the priority
- * of task. We do not use the spin_xx_mutex() variants here as we are
- * outside of the debug path.)
- */
-void rt_mutex_adjust_prio(struct task_struct *task)
-{
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	__rt_mutex_adjust_prio(task);
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
-}
-
-/*
  * Deadlock detection is conditional:
  *
  * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
@@ -987,6 +969,7 @@ static void mark_wakeup_next_waiter(stru
 	 * lock->wait_lock.
 	 */
 	rt_mutex_dequeue_pi(current, waiter);
+	__rt_mutex_adjust_prio(current);
 
 	/*
 	 * As we are waking up the top waiter, and the waiter stays
@@ -1325,6 +1308,16 @@ static bool __sched rt_mutex_slowunlock(
 	 */
 	mark_wakeup_next_waiter(wake_q, lock);
 
+	/*
+	 * We should deboost before waking the top waiter task such that
+	 * we don't run two tasks with the 'same' priority. This however
+	 * can lead to prio-inversion if we would get preempted after
+	 * the deboost but before waking our high-prio task, hence the
+	 * preempt_disable before unlock. Pairs with preempt_enable() in
+	 * rt_mutex_postunlock();
+	 */
+	preempt_disable();
+
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* check PI boosting */
@@ -1390,14 +1383,23 @@ rt_mutex_fastunlock(struct rt_mutex *loc
 	} else {
 		bool deboost = slowfn(lock, &wake_q);
 
-		wake_up_q(&wake_q);
-
-		/* Undo pi boosting if necessary: */
-		if (deboost)
-			rt_mutex_adjust_prio(current);
+		rt_mutex_postunlock(&wake_q, deboost);
 	}
 }
 
+
+/*
+ * Undo pi boosting (if necessary) and wake top waiter.
+ */
+void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+{
+	wake_up_q(wake_q);
+
+	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
+	if (deboost)
+		preempt_enable();
+}
+
 /**
  * rt_mutex_lock - lock a rt_mutex
  *
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -111,6 +111,7 @@ extern int rt_mutex_finish_proxy_lock(st
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
 				  struct wake_q_head *wqh);
+extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -v2 2/9] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-09-26 12:32 [PATCH -v2 0/9] PI and assorted failings Peter Zijlstra
  2016-09-26 12:32 ` [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
@ 2016-09-26 12:32 ` Peter Zijlstra
  2016-09-26 15:20   ` Steven Rostedt
  2016-09-29 14:49   ` Thomas Gleixner
  2016-09-26 12:32 ` [PATCH -v2 3/9] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 12:32 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz,
	Ingo Molnar

[-- Attachment #1: xunlei_pang-sched_rtmutex_deadline-fix_a_pi_crash_for_deadline_tasks.patch --]
[-- Type: text/plain, Size: 4976 bytes --]

A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [<ffffffff810b658c>] enqueue_task+0x2c/0x80
    [<ffffffff810ba763>] activate_task+0x23/0x30
    [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
    [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
    [<ffffffff8164e783>] __schedule+0xd3/0x900
    [<ffffffff8164efd9>] schedule+0x29/0x70
    [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
    [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
    [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
    [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
    [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
    [<ffffffff810edd50>] SyS_futex+0x80/0x180

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.

In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.

Originally-From: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---

 include/linux/init_task.h |    1 +
 include/linux/sched.h     |    2 ++
 include/linux/sched/rt.h  |    1 +
 kernel/fork.c             |    1 +
 kernel/locking/rtmutex.c  |   23 +++++++++++++++--------
 kernel/sched/core.c       |    2 ++
 6 files changed, 22 insertions(+), 8 deletions(-)

--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -164,6 +164,7 @@ extern struct task_group root_task_group
 #ifdef CONFIG_RT_MUTEXES
 # define INIT_RT_MUTEXES(tsk)						\
 	.pi_waiters = RB_ROOT,						\
+	.pi_top_task = NULL,						\
 	.pi_waiters_leftmost = NULL,
 #else
 # define INIT_RT_MUTEXES(tsk)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1716,6 +1716,8 @@ struct task_struct {
 	/* PI waiters blocked on a rt_mutex held by this task */
 	struct rb_root pi_waiters;
 	struct rb_node *pi_waiters_leftmost;
+	/* Updated under owner's pi_lock and rq lock */
+	struct task_struct *pi_top_task;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
 #endif
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -19,6 +19,7 @@ static inline int rt_task(struct task_st
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
 extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
+extern void rt_mutex_update_top_task(struct task_struct *p);
 extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1397,6 +1397,7 @@ static void rt_mutex_init_task(struct ta
 #ifdef CONFIG_RT_MUTEXES
 	p->pi_waiters = RB_ROOT;
 	p->pi_waiters_leftmost = NULL;
+	p->pi_top_task = NULL;
 	p->pi_blocked_on = NULL;
 #endif
 }
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -256,6 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
 	RB_CLEAR_NODE(&waiter->pi_tree_entry);
 }
 
+void rt_mutex_update_top_task(struct task_struct *p)
+{
+	if (!task_has_pi_waiters(p)) {
+		p->pi_top_task = NULL;
+		return;
+	}
+
+	p->pi_top_task = task_top_pi_waiter(p)->task;
+}
+
 /*
  * Calculate task priority from the waiter tree priority
  *
@@ -273,10 +283,7 @@ int rt_mutex_getprio(struct task_struct
 
 struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 {
-	if (likely(!task_has_pi_waiters(task)))
-		return NULL;
-
-	return task_top_pi_waiter(task)->task;
+	return task->pi_top_task;
 }
 
 /*
@@ -285,12 +292,12 @@ struct task_struct *rt_mutex_get_top_tas
  */
 int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
 {
-	if (!task_has_pi_waiters(task))
+	struct task_struct *top_task = rt_mutex_get_top_task(task);
+
+	if (!top_task)
 		return newprio;
 
-	if (task_top_pi_waiter(task)->task->prio <= newprio)
-		return task_top_pi_waiter(task)->task->prio;
-	return newprio;
+	return min(top_task->prio, newprio);
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3669,6 +3669,8 @@ void rt_mutex_setprio(struct task_struct
 		goto out_unlock;
 	}
 
+	rt_mutex_update_top_task(p);
+
 	trace_sched_pi_setprio(p, prio);
 	oldprio = p->prio;
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -v2 3/9] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
  2016-09-26 12:32 [PATCH -v2 0/9] PI and assorted failings Peter Zijlstra
  2016-09-26 12:32 ` [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
  2016-09-26 12:32 ` [PATCH -v2 2/9] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
@ 2016-09-26 12:32 ` Peter Zijlstra
  2016-09-26 16:03   ` Steven Rostedt
  2016-09-29 14:48   ` Thomas Gleixner
  2016-09-26 12:32 ` [PATCH -v2 4/9] rtmutex: Remove rt_mutex_fastunlock() Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 12:32 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz,
	Ingo Molnar

[-- Attachment #1: xunlei_pang-sched_deadline_rtmutex-don_t_miss_the_dl_runtime_dl_period_update.patch --]
[-- Type: text/plain, Size: 1494 bytes --]

Currently dl tasks will actually return at the very beginning
of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:

    if (waiter->prio == task->prio) {
        if (!detect_deadlock)
            goto out_unlock_pi; // out here
        else
            requeue = false;
    }

As the deadline value of blocked deadline tasks(waiters) without
changing their sched_class(thus prio doesn't change) never changes,
this seems reasonable, but it actually misses the chance of updating
rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
deadline parameters(dl_runtime, dl_period) or boosted waiter changes
to !deadline class.

Thus, force deadline task not out by adding the !dl_prio() condition.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
---
 kernel/locking/rtmutex.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -488,7 +488,7 @@ static int rt_mutex_adjust_prio_chain(st
 	 * enabled we continue, but stop the requeueing in the chain
 	 * walk.
 	 */
-	if (waiter->prio == task->prio) {
+	if (waiter->prio == task->prio && !dl_task(task)) {
 		if (!detect_deadlock)
 			goto out_unlock_pi;
 		else

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -v2 4/9] rtmutex: Remove rt_mutex_fastunlock()
  2016-09-26 12:32 [PATCH -v2 0/9] PI and assorted failings Peter Zijlstra
                   ` (2 preceding siblings ...)
  2016-09-26 12:32 ` [PATCH -v2 3/9] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
@ 2016-09-26 12:32 ` Peter Zijlstra
  2016-09-29 14:47   ` Thomas Gleixner
  2016-09-26 12:32 ` [PATCH -v2 5/9] rtmutex: Clean up Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 12:32 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-1.patch --]
[-- Type: text/plain, Size: 1436 bytes --]

There is but a single user and the previous patch mandates slowfn must
be rt_mutex_slowunlock(), so fold the lot together and save a few
lines.

Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/rtmutex.c |   29 ++++++++++-------------------
 1 file changed, 10 insertions(+), 19 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1377,24 +1377,6 @@ rt_mutex_fasttrylock(struct rt_mutex *lo
 	return slowfn(lock);
 }
 
-static inline void
-rt_mutex_fastunlock(struct rt_mutex *lock,
-		    bool (*slowfn)(struct rt_mutex *lock,
-				   struct wake_q_head *wqh))
-{
-	WAKE_Q(wake_q);
-
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-		rt_mutex_deadlock_account_unlock(current);
-
-	} else {
-		bool deboost = slowfn(lock, &wake_q);
-
-		rt_mutex_postunlock(&wake_q, deboost);
-	}
-}
-
-
 /*
  * Undo pi boosting (if necessary) and wake top waiter.
  */
@@ -1501,7 +1483,16 @@ EXPORT_SYMBOL_GPL(rt_mutex_trylock);
  */
 void __sched rt_mutex_unlock(struct rt_mutex *lock)
 {
-	rt_mutex_fastunlock(lock, rt_mutex_slowunlock);
+	WAKE_Q(wake_q);
+
+	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
+		rt_mutex_deadlock_account_unlock(current);
+
+	} else {
+		bool deboost = rt_mutex_slowunlock(lock, &wake_q);
+
+		rt_mutex_postunlock(&wake_q, deboost);
+	}
 }
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -v2 5/9] rtmutex: Clean up
  2016-09-26 12:32 [PATCH -v2 0/9] PI and assorted failings Peter Zijlstra
                   ` (3 preceding siblings ...)
  2016-09-26 12:32 ` [PATCH -v2 4/9] rtmutex: Remove rt_mutex_fastunlock() Peter Zijlstra
@ 2016-09-26 12:32 ` Peter Zijlstra
  2016-09-26 16:09   ` Steven Rostedt
  2016-09-26 12:32 ` [PATCH -v2 6/9] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 12:32 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-2.patch --]
[-- Type: text/plain, Size: 3949 bytes --]

Previous patches changed the meaning of the return value of
rt_mutex_slowunlock(); update comments and code to reflect this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |   12 ++++++------
 kernel/locking/rtmutex.c        |   20 +++++++++-----------
 kernel/locking/rtmutex_common.h |    2 +-
 3 files changed, 16 insertions(+), 18 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1261,7 +1261,7 @@ static int wake_futex_pi(u32 __user *uad
 	struct futex_pi_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
 	WAKE_Q(wake_q);
-	bool deboost;
+	bool postunlock;
 	int ret = 0;
 
 	if (!pi_state)
@@ -1327,17 +1327,17 @@ static int wake_futex_pi(u32 __user *uad
 
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 
-	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+	postunlock = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
 
 	/*
 	 * First unlock HB so the waiter does not spin on it once he got woken
-	 * up. Second wake up the waiter before the priority is adjusted. If we
-	 * deboost first (and lose our higher priority), then the task might get
-	 * scheduled away before the wake up can take place.
+	 * up. Then wakeup the waiter by calling rt_mutex_postunlock(). Priority
+	 * is already adjusted and preemption is disabled to avoid inversion.
 	 */
 	spin_unlock(&hb->lock);
 
-	rt_mutex_postunlock(&wake_q, deboost);
+	if (postunlock)
+		rt_mutex_postunlock(&wake_q);
 
 	return 0;
 }
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1254,7 +1254,8 @@ static inline int rt_mutex_slowtrylock(s
 
 /*
  * Slow path to release a rt-mutex.
- * Return whether the current task needs to undo a potential priority boosting.
+ *
+ * Return whether the current task needs to call rt_mutex_postunlock().
  */
 static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
 					struct wake_q_head *wake_q)
@@ -1327,7 +1328,7 @@ static bool __sched rt_mutex_slowunlock(
 
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
-	/* check PI boosting */
+	/* call rt_mutex_postunlock() */
 	return true;
 }
 
@@ -1378,15 +1379,14 @@ rt_mutex_fasttrylock(struct rt_mutex *lo
 }
 
 /*
- * Undo pi boosting (if necessary) and wake top waiter.
+ * Performs the wakeup of the the top-waiter and re-enables preemption.
  */
-void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+void rt_mutex_postunlock(struct wake_q_head *wake_q)
 {
 	wake_up_q(wake_q);
 
 	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
-	if (deboost)
-		preempt_enable();
+	preempt_enable();
 }
 
 /**
@@ -1489,9 +1489,8 @@ void __sched rt_mutex_unlock(struct rt_m
 		rt_mutex_deadlock_account_unlock(current);
 
 	} else {
-		bool deboost = rt_mutex_slowunlock(lock, &wake_q);
-
-		rt_mutex_postunlock(&wake_q, deboost);
+		if (rt_mutex_slowunlock(lock, &wake_q))
+			rt_mutex_postunlock(&wake_q);
 	}
 }
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
@@ -1500,8 +1499,7 @@ EXPORT_SYMBOL_GPL(rt_mutex_unlock);
  * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
  * @lock: the rt_mutex to be unlocked
  *
- * Returns: true/false indicating whether priority adjustment is
- * required or not.
+ * Returns: true/false indicating whether we should call rt_mutex_postunlock().
  */
 bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
 				   struct wake_q_head *wqh)
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -111,7 +111,7 @@ extern int rt_mutex_finish_proxy_lock(st
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
 				  struct wake_q_head *wqh);
-extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
+extern void rt_mutex_postunlock(struct wake_q_head *wake_q);
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -v2 6/9] sched/rtmutex: Refactor rt_mutex_setprio()
  2016-09-26 12:32 [PATCH -v2 0/9] PI and assorted failings Peter Zijlstra
                   ` (4 preceding siblings ...)
  2016-09-26 12:32 ` [PATCH -v2 5/9] rtmutex: Clean up Peter Zijlstra
@ 2016-09-26 12:32 ` Peter Zijlstra
  2016-09-26 16:57   ` Steven Rostedt
  2016-09-26 12:32 ` [PATCH -v2 7/9] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 12:32 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-3.patch --]
[-- Type: text/plain, Size: 12018 bytes --]

With the introduction of SCHED_DEADLINE the whole notion that priority
is a single number is gone, therefore the @prio argument to
rt_mutex_setprio() doesn't make sense anymore.

So rework the code to pass a pi_task instead.

Note this also fixes a problem with pi_top_task caching; previously we
would not set the pointer (call rt_mutex_update_top_task) if the
priority didn't change, this could lead to a stale pointer.

As for the XXX, I think its fine to use pi_task->prio, because if it
differs from waiter->prio, a PI chain update is immenent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched/rt.h        |   21 +-------
 kernel/locking/rtmutex.c        |  105 +++++++++++-----------------------------
 kernel/locking/rtmutex_common.h |    1 
 kernel/sched/core.c             |   66 ++++++++++++++++++++-----
 4 files changed, 88 insertions(+), 105 deletions(-)

--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -16,28 +16,17 @@ static inline int rt_task(struct task_st
 }
 
 #ifdef CONFIG_RT_MUTEXES
-extern int rt_mutex_getprio(struct task_struct *p);
-extern void rt_mutex_setprio(struct task_struct *p, int prio);
-extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
-extern void rt_mutex_update_top_task(struct task_struct *p);
-extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
+static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
+{
+	return p->pi_top_task;
+}
+extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
 	return tsk->pi_blocked_on != NULL;
 }
 #else
-static inline int rt_mutex_getprio(struct task_struct *p)
-{
-	return p->normal_prio;
-}
-
-static inline int rt_mutex_get_effective_prio(struct task_struct *task,
-					      int newprio)
-{
-	return newprio;
-}
-
 static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 {
 	return NULL;
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -256,61 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
 	RB_CLEAR_NODE(&waiter->pi_tree_entry);
 }
 
-void rt_mutex_update_top_task(struct task_struct *p)
+static void rt_mutex_adjust_prio(struct task_struct *p)
 {
-	if (!task_has_pi_waiters(p)) {
-		p->pi_top_task = NULL;
-		return;
-	}
+	struct task_struct *pi_task = NULL;
 
-	p->pi_top_task = task_top_pi_waiter(p)->task;
-}
-
-/*
- * Calculate task priority from the waiter tree priority
- *
- * Return task->normal_prio when the waiter tree is empty or when
- * the waiter is not allowed to do priority boosting
- */
-int rt_mutex_getprio(struct task_struct *task)
-{
-	if (likely(!task_has_pi_waiters(task)))
-		return task->normal_prio;
+	lockdep_assert_held(&p->pi_lock);
 
-	return min(task_top_pi_waiter(task)->prio,
-		   task->normal_prio);
-}
+	if (task_has_pi_waiters(p))
+		pi_task = task_top_pi_waiter(p)->task;
 
-struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
-{
-	return task->pi_top_task;
-}
-
-/*
- * Called by sched_setscheduler() to get the priority which will be
- * effective after the change.
- */
-int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
-{
-	struct task_struct *top_task = rt_mutex_get_top_task(task);
-
-	if (!top_task)
-		return newprio;
-
-	return min(top_task->prio, newprio);
-}
-
-/*
- * Adjust the priority of a task, after its pi_waiters got modified.
- *
- * This can be both boosting and unboosting. task->pi_lock must be held.
- */
-static void __rt_mutex_adjust_prio(struct task_struct *task)
-{
-	int prio = rt_mutex_getprio(task);
-
-	if (task->prio != prio || dl_prio(prio))
-		rt_mutex_setprio(task, prio);
+	rt_mutex_setprio(p, pi_task);
 }
 
 /*
@@ -670,7 +625,7 @@ static int rt_mutex_adjust_prio_chain(st
 		 */
 		rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
 		rt_mutex_enqueue_pi(task, waiter);
-		__rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(task);
 
 	} else if (prerequeue_top_waiter == waiter) {
 		/*
@@ -686,7 +641,7 @@ static int rt_mutex_adjust_prio_chain(st
 		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
 		rt_mutex_enqueue_pi(task, waiter);
-		__rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(task);
 	} else {
 		/*
 		 * Nothing changed. No need to do any priority
@@ -896,7 +851,7 @@ static int task_blocks_on_rt_mutex(struc
 		return -EDEADLK;
 
 	raw_spin_lock(&task->pi_lock);
-	__rt_mutex_adjust_prio(task);
+	rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
 	waiter->prio = task->prio;
@@ -918,7 +873,7 @@ static int task_blocks_on_rt_mutex(struc
 		rt_mutex_dequeue_pi(owner, top_waiter);
 		rt_mutex_enqueue_pi(owner, waiter);
 
-		__rt_mutex_adjust_prio(owner);
+		rt_mutex_adjust_prio(owner);
 		if (owner->pi_blocked_on)
 			chain_walk = 1;
 	} else if (rt_mutex_cond_detect_deadlock(waiter, chwalk)) {
@@ -970,13 +925,14 @@ static void mark_wakeup_next_waiter(stru
 	waiter = rt_mutex_top_waiter(lock);
 
 	/*
-	 * Remove it from current->pi_waiters. We do not adjust a
-	 * possible priority boost right now. We execute wakeup in the
-	 * boosted mode and go back to normal after releasing
-	 * lock->wait_lock.
+	 * Remove it from current->pi_waiters and deboost.
+	 *
+	 * We must in fact deboost here in order to ensure we call
+	 * rt_mutex_setprio() to update p->pi_top_task before the
+	 * task unblocks.
 	 */
 	rt_mutex_dequeue_pi(current, waiter);
-	__rt_mutex_adjust_prio(current);
+	rt_mutex_adjust_prio(current);
 
 	/*
 	 * As we are waking up the top waiter, and the waiter stays
@@ -988,9 +944,19 @@ static void mark_wakeup_next_waiter(stru
 	 */
 	lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
 
-	raw_spin_unlock(&current->pi_lock);
-
+	/*
+	 * We deboosted before waking the top waiter task such that we don't
+	 * run two tasks with the 'same' priority (and ensure the
+	 * p->pi_top_task pointer points to a blocked task). This however can
+	 * lead to priority inversion if we would get preempted after the
+	 * deboost but before waking our donor task, hence the preempt_disable()
+	 * before unlock.
+	 *
+	 * Pairs with preempt_enable() in rt_mutex_postunlock();
+	 */
+	preempt_disable();
 	wake_q_add(wake_q, waiter->task);
+	raw_spin_unlock(&current->pi_lock);
 }
 
 /*
@@ -1025,7 +991,7 @@ static void remove_waiter(struct rt_mute
 	if (rt_mutex_has_waiters(lock))
 		rt_mutex_enqueue_pi(owner, rt_mutex_top_waiter(lock));
 
-	__rt_mutex_adjust_prio(owner);
+	rt_mutex_adjust_prio(owner);
 
 	/* Store the lock on which owner is blocked or NULL */
 	next_lock = task_blocked_on_lock(owner);
@@ -1064,8 +1030,7 @@ void rt_mutex_adjust_pi(struct task_stru
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || (waiter->prio == task->prio &&
-			!dl_prio(task->prio))) {
+	if (!waiter || (waiter->prio == task->prio && !dl_prio(task->prio))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -1316,16 +1281,6 @@ static bool __sched rt_mutex_slowunlock(
 	 */
 	mark_wakeup_next_waiter(wake_q, lock);
 
-	/*
-	 * We should deboost before waking the top waiter task such that
-	 * we don't run two tasks with the 'same' priority. This however
-	 * can lead to prio-inversion if we would get preempted after
-	 * the deboost but before waking our high-prio task, hence the
-	 * preempt_disable before unlock. Pairs with preempt_enable() in
-	 * rt_mutex_postunlock();
-	 */
-	preempt_disable();
-
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* call rt_mutex_postunlock() */
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -112,7 +112,6 @@ extern int rt_mutex_timed_futex_lock(str
 extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
 				  struct wake_q_head *wqh);
 extern void rt_mutex_postunlock(struct wake_q_head *wake_q);
-extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3546,10 +3546,25 @@ EXPORT_SYMBOL(default_wake_function);
 
 #ifdef CONFIG_RT_MUTEXES
 
+static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)
+{
+	if (pi_task)
+		prio = min(prio, pi_task->prio);
+
+	return prio;
+}
+
+static inline int rt_effective_prio(struct task_struct *p, int prio)
+{
+	struct task_struct *pi_task = rt_mutex_get_top_task(p);
+
+	return __rt_effective_prio(pi_task, prio);
+}
+
 /*
  * rt_mutex_setprio - set the current priority of a task
- * @p: task
- * @prio: prio value (kernel-internal form)
+ * @p: task to boost
+ * @pi_task: donor task
  *
  * This function changes the 'effective' priority of a task. It does
  * not touch ->normal_prio like __setscheduler().
@@ -3557,16 +3572,40 @@ EXPORT_SYMBOL(default_wake_function);
  * Used by the rt_mutex code to implement priority inheritance
  * logic. Call site only calls if the priority of the task changed.
  */
-void rt_mutex_setprio(struct task_struct *p, int prio)
+void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 {
-	int oldprio, queued, running, queue_flag = DEQUEUE_SAVE | DEQUEUE_MOVE;
+	int prio, oldprio, queued, running, queue_flag = DEQUEUE_SAVE | DEQUEUE_MOVE;
 	const struct sched_class *prev_class;
 	struct rq_flags rf;
 	struct rq *rq;
 
-	BUG_ON(prio > MAX_PRIO);
+	/* XXX used to be waiter->prio, not waiter->task->prio */
+	prio = __rt_effective_prio(pi_task, p->normal_prio);
+
+	/*
+	 * If nothing changed; bail early.
+	 */
+	if (p->pi_top_task == pi_task && prio == p->prio && !dl_prio(prio))
+		return;
 
 	rq = __task_rq_lock(p, &rf);
+	/*
+	 * Set under pi_lock && rq->lock, such that the value can be used under
+	 * either lock.
+	 *
+	 * Note that there is loads of tricky to make this pointer cache work
+	 * right. rt_mutex_slowunlock()+rt_mutex_postunlock() work together to
+	 * ensure a task is de-boosted (pi_task is set to NULL) before the
+	 * task is allowed to run again (and can exit). This ensures the pointer
+	 * points to a blocked task -- which guaratees the task is present.
+	 */
+	p->pi_top_task = pi_task;
+
+	/*
+	 * For FIFO/RR we only need to set prio, if that matches we're done.
+	 */
+	if (prio == p->prio && !dl_prio(prio))
+		goto out_unlock;
 
 	/*
 	 * Idle task boosting is a nono in general. There is one
@@ -3586,9 +3625,7 @@ void rt_mutex_setprio(struct task_struct
 		goto out_unlock;
 	}
 
-	rt_mutex_update_top_task(p);
-
-	trace_sched_pi_setprio(p, prio);
+	trace_sched_pi_setprio(p, prio); /* broken */
 	oldprio = p->prio;
 
 	if (oldprio == prio)
@@ -3612,7 +3649,6 @@ void rt_mutex_setprio(struct task_struct
 	 *          running task
 	 */
 	if (dl_prio(prio)) {
-		struct task_struct *pi_task = rt_mutex_get_top_task(p);
 		if (!dl_prio(p->normal_prio) ||
 		    (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
 			p->dl.dl_boosted = 1;
@@ -3649,6 +3685,11 @@ void rt_mutex_setprio(struct task_struct
 	balance_callback(rq);
 	preempt_enable();
 }
+#else
+static inline int rt_effective_prio(struct task_struct *p, int prio)
+{
+	return prio;
+}
 #endif
 
 void set_user_nice(struct task_struct *p, long nice)
@@ -3887,10 +3928,9 @@ static void __setscheduler(struct rq *rq
 	 * Keep a potential priority boosting if called from
 	 * sched_setscheduler().
 	 */
+	p->prio = normal_prio(p);
 	if (keep_boost)
-		p->prio = rt_mutex_get_effective_prio(p, normal_prio(p));
-	else
-		p->prio = normal_prio(p);
+		p->prio = rt_effective_prio(p, p->prio);
 
 	if (dl_prio(p->prio))
 		p->sched_class = &dl_sched_class;
@@ -4177,7 +4217,7 @@ static int __sched_setscheduler(struct t
 		 * the runqueue. This will be done when the task deboost
 		 * itself.
 		 */
-		new_effective_prio = rt_mutex_get_effective_prio(p, newprio);
+		new_effective_prio = rt_effective_prio(p, newprio);
 		if (new_effective_prio == oldprio)
 			queue_flags &= ~DEQUEUE_MOVE;
 	}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -v2 7/9] sched,tracing: Update trace_sched_pi_setprio()
  2016-09-26 12:32 [PATCH -v2 0/9] PI and assorted failings Peter Zijlstra
                   ` (5 preceding siblings ...)
  2016-09-26 12:32 ` [PATCH -v2 6/9] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
@ 2016-09-26 12:32 ` Peter Zijlstra
  2016-09-26 17:04   ` Steven Rostedt
  2016-09-26 12:32 ` [PATCH -v2 8/9] rtmutex: Fix PI chain order integrity Peter Zijlstra
  2016-09-26 12:32 ` [PATCH -v2 9/9] rtmutex: Fix more prio comparisons Peter Zijlstra
  8 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 12:32 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-4.patch --]
[-- Type: text/plain, Size: 3027 bytes --]

Pass the PI donor task, instead of a numerical priority.

Numerical priorities are not sufficient to describe state ever since
SCHED_DEADLINE.

Annotate all sched tracepoints that are currently broken; fixing them
will bork userspace. *hate*.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/trace/events/sched.h |   16 +++++++++-------
 kernel/sched/core.c          |    2 +-
 2 files changed, 10 insertions(+), 8 deletions(-)

--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -70,7 +70,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_templat
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->success	= 1; /* rudiment, kill when possible */
 		__entry->target_cpu	= task_cpu(p);
 	),
@@ -147,6 +147,7 @@ TRACE_EVENT(sched_switch,
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
+		/* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
@@ -181,7 +182,7 @@ TRACE_EVENT(sched_migrate_task,
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->orig_cpu	= task_cpu(p);
 		__entry->dest_cpu	= dest_cpu;
 	),
@@ -206,7 +207,7 @@ DECLARE_EVENT_CLASS(sched_process_templa
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("comm=%s pid=%d prio=%d",
@@ -253,7 +254,7 @@ TRACE_EVENT(sched_process_wait,
 	TP_fast_assign(
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 		__entry->pid		= pid_nr(pid);
-		__entry->prio		= current->prio;
+		__entry->prio		= current->prio; /* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("comm=%s pid=%d prio=%d",
@@ -413,9 +414,9 @@ DEFINE_EVENT(sched_stat_runtime, sched_s
  */
 TRACE_EVENT(sched_pi_setprio,
 
-	TP_PROTO(struct task_struct *tsk, int newprio),
+	TP_PROTO(struct task_struct *tsk, struct task_struct *pi_task),
 
-	TP_ARGS(tsk, newprio),
+	TP_ARGS(tsk, pi_task),
 
 	TP_STRUCT__entry(
 		__array( char,	comm,	TASK_COMM_LEN	)
@@ -428,7 +429,8 @@ TRACE_EVENT(sched_pi_setprio,
 		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
 		__entry->pid		= tsk->pid;
 		__entry->oldprio	= tsk->prio;
-		__entry->newprio	= newprio;
+		__entry->newprio	= pi_task ? pi_task->prio : tsk->prio;
+		/* XXX SCHED_DEADLINE bits missing */
 	),
 
 	TP_printk("comm=%s pid=%d oldprio=%d newprio=%d",
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3634,7 +3634,7 @@ void rt_mutex_setprio(struct task_struct
 		goto out_unlock;
 	}
 
-	trace_sched_pi_setprio(p, prio); /* broken */
+	trace_sched_pi_setprio(p, pi_task);
 	oldprio = p->prio;
 
 	if (oldprio == prio)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -v2 8/9] rtmutex: Fix PI chain order integrity
  2016-09-26 12:32 [PATCH -v2 0/9] PI and assorted failings Peter Zijlstra
                   ` (6 preceding siblings ...)
  2016-09-26 12:32 ` [PATCH -v2 7/9] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
@ 2016-09-26 12:32 ` Peter Zijlstra
  2016-09-26 12:32 ` [PATCH -v2 9/9] rtmutex: Fix more prio comparisons Peter Zijlstra
  8 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 12:32 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-5.patch --]
[-- Type: text/plain, Size: 3489 bytes --]

rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
during the PI chain walk, such that the PI chain order isn't messed up
by (asynchronous) task state updates.

Currently rt_mutex_waiter_less() uses task state for deadline tasks;
this is broken, since the task state can, as said above, change
asynchronously, causing the RB tree order to change without actual
tree update -> FAIL.

Fix this by also copying the deadline into the rt_mutex_waiter state
and updating it along with its prio field.

Ideally we would also force PI chain updates whenever DL tasks update
their deadline parameter, but for first approximation this is less
broken than it was.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/rtmutex.c        |   29 +++++++++++++++++++++++++++--
 kernel/locking/rtmutex_common.h |    1 +
 2 files changed, 28 insertions(+), 2 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -172,8 +172,7 @@ rt_mutex_waiter_less(struct rt_mutex_wai
 	 * then right waiter has a dl_prio() too.
 	 */
 	if (dl_prio(left->prio))
-		return dl_time_before(left->task->dl.deadline,
-				      right->task->dl.deadline);
+		return dl_time_before(left->deadline, right->deadline);
 
 	return 0;
 }
@@ -584,7 +583,26 @@ static int rt_mutex_adjust_prio_chain(st
 
 	/* [7] Requeue the waiter in the lock waiter tree. */
 	rt_mutex_dequeue(lock, waiter);
+
+	/*
+	 * Update the waiter prio fields now that we're dequeued.
+	 *
+	 * These values can have changed through either:
+	 *
+	 *   sys_sched_set_scheduler() / sys_sched_setattr()
+	 *
+	 * or
+	 *
+	 *   DL CBS enforcement advancing the effective deadline.
+	 *
+	 * Even though pi_waiters also uses these fields, and that tree is only
+	 * updated in [11], we can do this here, since we hold [L], which
+	 * serializes all pi_waiters access and rb_erase() does not care about
+	 * the values of the node being removed.
+	 */
 	waiter->prio = task->prio;
+	waiter->deadline = task->dl.deadline;
+
 	rt_mutex_enqueue(lock, waiter);
 
 	/* [8] Release the task */
@@ -711,6 +729,8 @@ static int rt_mutex_adjust_prio_chain(st
 static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 				struct rt_mutex_waiter *waiter)
 {
+	lockdep_assert_held(&lock->wait_lock);
+
 	/*
 	 * Before testing whether we can acquire @lock, we set the
 	 * RT_MUTEX_HAS_WAITERS bit in @lock->owner. This forces all
@@ -838,6 +858,8 @@ static int task_blocks_on_rt_mutex(struc
 	struct rt_mutex *next_lock;
 	int chain_walk = 0, res;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	/*
 	 * Early deadlock detection. We really don't want the task to
 	 * enqueue on itself just to untangle the mess later. It's not
@@ -855,6 +877,7 @@ static int task_blocks_on_rt_mutex(struc
 	waiter->task = task;
 	waiter->lock = lock;
 	waiter->prio = task->prio;
+	waiter->deadline = task->dl.deadline;
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
@@ -972,6 +995,8 @@ static void remove_waiter(struct rt_mute
 	struct task_struct *owner = rt_mutex_owner(lock);
 	struct rt_mutex *next_lock;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	raw_spin_lock(&current->pi_lock);
 	rt_mutex_dequeue(lock, waiter);
 	current->pi_blocked_on = NULL;
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -33,6 +33,7 @@ struct rt_mutex_waiter {
 	struct rt_mutex		*deadlock_lock;
 #endif
 	int prio;
+	u64 deadline;
 };
 
 /*

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH -v2 9/9] rtmutex: Fix more prio comparisons
  2016-09-26 12:32 [PATCH -v2 0/9] PI and assorted failings Peter Zijlstra
                   ` (7 preceding siblings ...)
  2016-09-26 12:32 ` [PATCH -v2 8/9] rtmutex: Fix PI chain order integrity Peter Zijlstra
@ 2016-09-26 12:32 ` Peter Zijlstra
  8 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 12:32 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-6.patch --]
[-- Type: text/plain, Size: 2701 bytes --]

There was a pure ->prio comparison left in try_to_wake_rt_mutex(),
convert it to use rt_mutex_waiter_less(), noting that greater-or-equal
is not-less (both in kernel priority view).

This necessitated the introduction of cmp_task() which creates a
pointer to an unnamed stack variable of struct rt_mutex_waiter type to
compare against tasks.

With this, we can now also create and employ rt_mutex_waiter_equal().

Reviewed-and-tested-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/rtmutex.c |   32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -158,6 +158,12 @@ static inline bool unlock_rt_mutex_safe(
 }
 #endif
 
+/*
+ * Only use with rt_mutex_waiter_{less,equal}()
+ */
+#define cmp_task(p)	\
+	&(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = (p)->dl.deadline }
+
 static inline int
 rt_mutex_waiter_less(struct rt_mutex_waiter *left,
 		     struct rt_mutex_waiter *right)
@@ -177,6 +183,25 @@ rt_mutex_waiter_less(struct rt_mutex_wai
 	return 0;
 }
 
+static inline int
+rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
+		      struct rt_mutex_waiter *right)
+{
+	if (left->prio != right->prio)
+		return 0;
+
+	/*
+	 * If both waiters have dl_prio(), we check the deadlines of the
+	 * associated tasks.
+	 * If left waiter has a dl_prio(), and we didn't return 0 above,
+	 * then right waiter has a dl_prio() too.
+	 */
+	if (dl_prio(left->prio))
+		return left->deadline == right->deadline;
+
+	return 1;
+}
+
 static void
 rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
 {
@@ -487,7 +512,7 @@ static int rt_mutex_adjust_prio_chain(st
 	 * enabled we continue, but stop the requeueing in the chain
 	 * walk.
 	 */
-	if (waiter->prio == task->prio && !dl_task(task)) {
+	if (rt_mutex_waiter_equal(waiter, cmp_task(task))) {
 		if (!detect_deadlock)
 			goto out_unlock_pi;
 		else
@@ -790,7 +815,8 @@ static int try_to_take_rt_mutex(struct r
 			 * the top waiter priority (kernel view),
 			 * @task lost.
 			 */
-			if (task->prio >= rt_mutex_top_waiter(lock)->prio)
+			if (!rt_mutex_waiter_less(cmp_task(task),
+						  rt_mutex_top_waiter(lock)))
 				return 0;
 
 			/*
@@ -1055,7 +1081,7 @@ void rt_mutex_adjust_pi(struct task_stru
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || (waiter->prio == task->prio && !dl_prio(task->prio))) {
+	if (!waiter || rt_mutex_waiter_equal(waiter, cmp_task(task))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter
  2016-09-26 12:32 ` [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
@ 2016-09-26 15:15   ` Steven Rostedt
  2016-09-26 15:22     ` Peter Zijlstra
  2016-09-28  9:07   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 31+ messages in thread
From: Steven Rostedt @ 2016-09-26 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Mon, 26 Sep 2016 14:32:14 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1374,9 +1374,8 @@ static int wake_futex_pi(u32 __user *uad
>  	 * scheduled away before the wake up can take place.
>  	 */
>  	spin_unlock(&hb->lock);
> -	wake_up_q(&wake_q);
> -	if (deboost)
> -		rt_mutex_adjust_prio(current);
> +
> +	rt_mutex_postunlock(&wake_q, deboost);

Hmm...

>  
>  	return 0;
>  }
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1325,6 +1308,16 @@ static bool __sched rt_mutex_slowunlock(
>  	 */
>  	mark_wakeup_next_waiter(wake_q, lock);
>  
> +	/*
> +	 * We should deboost before waking the top waiter task such that
> +	 * we don't run two tasks with the 'same' priority. This however
> +	 * can lead to prio-inversion if we would get preempted after
> +	 * the deboost but before waking our high-prio task, hence the
> +	 * preempt_disable before unlock. Pairs with preempt_enable() in
> +	 * rt_mutex_postunlock();

There's a preempt_enable() in rt_mutex_postunlock()? Does
wake_futex_pi() know that?

-- Steve


> +	 */
> +	preempt_disable();
> +
>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  
>  	/* check PI boosting */
> @@ -1390,14 +1383,23 @@ rt_mutex_fastunlock(struct rt_mutex *loc
>  	} else {
>  		bool deboost = slowfn(lock, &wake_q);
>  
> -		wake_up_q(&wake_q);
> -
> -		/* Undo pi boosting if necessary: */
> -		if (deboost)
> -			rt_mutex_adjust_prio(current);
> +		rt_mutex_postunlock(&wake_q, deboost);
>  	}
>  }
>  
> +
> +/*
> + * Undo pi boosting (if necessary) and wake top waiter.
> + */
> +void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
> +{
> +	wake_up_q(wake_q);
> +
> +	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
> +	if (deboost)
> +		preempt_enable();
> +}
> +
>  /**
>   * rt_mutex_lock - lock a rt_mutex
>   *
> --- a/kernel/locking/rtmutex_common.h
> +++ b/kernel/locking/rtmutex_common.h
> @@ -111,6 +111,7 @@ extern int rt_mutex_finish_proxy_lock(st
>  extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
>  extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
>  				  struct wake_q_head *wqh);
> +extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
>  extern void rt_mutex_adjust_prio(struct task_struct *task);
>  
>  #ifdef CONFIG_DEBUG_RT_MUTEXES
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 2/9] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-09-26 12:32 ` [PATCH -v2 2/9] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
@ 2016-09-26 15:20   ` Steven Rostedt
  2016-09-26 15:26     ` Peter Zijlstra
  2016-09-29 14:49   ` Thomas Gleixner
  1 sibling, 1 reply; 31+ messages in thread
From: Steven Rostedt @ 2016-09-26 15:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -256,6 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
>  	RB_CLEAR_NODE(&waiter->pi_tree_entry);
>  }
>  

Shouldn't we add a comment about what locks are expected to be held
when calling this? Especially if it can be called outside this file.

> +void rt_mutex_update_top_task(struct task_struct *p)
> +{
> +	if (!task_has_pi_waiters(p)) {
> +		p->pi_top_task = NULL;
> +		return;
> +	}
> +
> +	p->pi_top_task = task_top_pi_waiter(p)->task;
> +}
> +
>  /*
>   * Calculate task priority from the waiter tree priority
>   *
> @@ -273,10 +283,7 @@ int rt_mutex_getprio(struct task_struct
>  

Any specific locks that must be held when calling this?

>  struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>  {
> -	if (likely(!task_has_pi_waiters(task)))
> -		return NULL;
> -
> -	return task_top_pi_waiter(task)->task;
> +	return task->pi_top_task;
>  }

-- Steve

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter
  2016-09-26 15:15   ` Steven Rostedt
@ 2016-09-26 15:22     ` Peter Zijlstra
  2016-09-26 15:35       ` Steven Rostedt
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 15:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Mon, Sep 26, 2016 at 11:15:11AM -0400, Steven Rostedt wrote:
> On Mon, 26 Sep 2016 14:32:14 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -1374,9 +1374,8 @@ static int wake_futex_pi(u32 __user *uad
> >  	 * scheduled away before the wake up can take place.
> >  	 */
> >  	spin_unlock(&hb->lock);
> > -	wake_up_q(&wake_q);
> > -	if (deboost)
> > -		rt_mutex_adjust_prio(current);
> > +
> > +	rt_mutex_postunlock(&wake_q, deboost);
> 
> Hmm...
> 
> >  
> >  	return 0;
> >  }
> > --- a/kernel/locking/rtmutex.c
> > +++ b/kernel/locking/rtmutex.c
> > @@ -1325,6 +1308,16 @@ static bool __sched rt_mutex_slowunlock(
> >  	 */
> >  	mark_wakeup_next_waiter(wake_q, lock);
> >  
> > +	/*
> > +	 * We should deboost before waking the top waiter task such that
> > +	 * we don't run two tasks with the 'same' priority. This however
> > +	 * can lead to prio-inversion if we would get preempted after
> > +	 * the deboost but before waking our high-prio task, hence the
> > +	 * preempt_disable before unlock. Pairs with preempt_enable() in
> > +	 * rt_mutex_postunlock();
> 
> There's a preempt_enable() in rt_mutex_postunlock()? Does
> wake_futex_pi() know that?
> 

Not sure I see your point. rt_mutex_futex_unlock() calls
rt_mutex_slowunlock() which does the preempt_disable(), we then pass the
return of that into deboost, which we pass into rt_mutex_postunlock()
and everything should be balanced.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 2/9] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-09-26 15:20   ` Steven Rostedt
@ 2016-09-26 15:26     ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 15:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Mon, Sep 26, 2016 at 11:20:58AM -0400, Steven Rostedt wrote:
> > --- a/kernel/locking/rtmutex.c
> > +++ b/kernel/locking/rtmutex.c
> > @@ -256,6 +256,16 @@ rt_mutex_dequeue_pi(struct task_struct *
> >  	RB_CLEAR_NODE(&waiter->pi_tree_entry);
> >  }
> >  
> 
> Shouldn't we add a comment about what locks are expected to be held
> when calling this? Especially if it can be called outside this file.

Comments are somewhat useless.. I would like to do the below, except I
cannot.

> > +void rt_mutex_update_top_task(struct task_struct *p)
> > +{

	lockdep_assert_held(&p->pi_lock);
	lockdep_assert_held(&task_rq(p)->lock); // except that we cannot access rq :/

> > +	if (!task_has_pi_waiters(p)) {
> > +		p->pi_top_task = NULL;
> > +		return;
> > +	}
> > +
> > +	p->pi_top_task = task_top_pi_waiter(p)->task;
> > +}
> > +
> >  /*
> >   * Calculate task priority from the waiter tree priority
> >   *
> > @@ -273,10 +283,7 @@ int rt_mutex_getprio(struct task_struct
> >  
> 
> Any specific locks that must be held when calling this?

#ifdef CONFIG_LOCKDEP
	WARN_ON_ONCE(debug_locks &&
		!lock_is_held(&p->pi_lock) &&
		!lock_is_held(&task_rq(p)->lock)); // again, cannot do this :/
#endif

> >  struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
> >  {
> > -	if (likely(!task_has_pi_waiters(task)))
> > -		return NULL;
> > -
> > -	return task_top_pi_waiter(task)->task;
> > +	return task->pi_top_task;
> >  }
> 
> -- Steve
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter
  2016-09-26 15:22     ` Peter Zijlstra
@ 2016-09-26 15:35       ` Steven Rostedt
  2016-09-26 15:37         ` Steven Rostedt
  2016-09-26 15:39         ` Peter Zijlstra
  0 siblings, 2 replies; 31+ messages in thread
From: Steven Rostedt @ 2016-09-26 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Mon, 26 Sep 2016 17:22:28 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > > +	/*
> > > +	 * We should deboost before waking the top waiter task such that
> > > +	 * we don't run two tasks with the 'same' priority. This however
> > > +	 * can lead to prio-inversion if we would get preempted after
> > > +	 * the deboost but before waking our high-prio task, hence the
> > > +	 * preempt_disable before unlock. Pairs with preempt_enable() in
> > > +	 * rt_mutex_postunlock();  
> > 
> > There's a preempt_enable() in rt_mutex_postunlock()? Does
> > wake_futex_pi() know that?
> >   
> 
> Not sure I see your point. rt_mutex_futex_unlock() calls
> rt_mutex_slowunlock() which does the preempt_disable(), we then pass the
> return of that into deboost, which we pass into rt_mutex_postunlock()
> and everything should be balanced.

Can we please add more comments explaining this. Having side effects of
functions disabling preemption, passing a bool saying that it did, and
needing to call another function (somewhat seemingly unrelated) to
re-enable preemption, just seems a bit of a stretch for maintainable
code.

Especially now that the code after the spin_unlock(&hb->lock) is now a
critical section (preemption is disable). There's nothing obvious in
futex.c that says it is.

Just think about looking at this code in another 5 years. Are you going
to remember all this?

-- Steve

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter
  2016-09-26 15:35       ` Steven Rostedt
@ 2016-09-26 15:37         ` Steven Rostedt
  2016-09-26 15:41           ` Peter Zijlstra
  2016-09-26 15:39         ` Peter Zijlstra
  1 sibling, 1 reply; 31+ messages in thread
From: Steven Rostedt @ 2016-09-26 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Mon, 26 Sep 2016 11:35:03 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Especially now that the code after the spin_unlock(&hb->lock) is now a
> critical section (preemption is disable). There's nothing obvious in
> futex.c that says it is.

Not to mention, this looks like it will break PREEMPT_RT as wake_up_q()
calls sleepable spin locks.

-- Steve

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter
  2016-09-26 15:35       ` Steven Rostedt
  2016-09-26 15:37         ` Steven Rostedt
@ 2016-09-26 15:39         ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 15:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Mon, Sep 26, 2016 at 11:35:03AM -0400, Steven Rostedt wrote:
> On Mon, 26 Sep 2016 17:22:28 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > > +	/*
> > > > +	 * We should deboost before waking the top waiter task such that
> > > > +	 * we don't run two tasks with the 'same' priority. This however
> > > > +	 * can lead to prio-inversion if we would get preempted after
> > > > +	 * the deboost but before waking our high-prio task, hence the
> > > > +	 * preempt_disable before unlock. Pairs with preempt_enable() in
> > > > +	 * rt_mutex_postunlock();  
> > > 
> > > There's a preempt_enable() in rt_mutex_postunlock()? Does
> > > wake_futex_pi() know that?
> > >   
> > 
> > Not sure I see your point. rt_mutex_futex_unlock() calls
> > rt_mutex_slowunlock() which does the preempt_disable(), we then pass the
> > return of that into deboost, which we pass into rt_mutex_postunlock()
> > and everything should be balanced.
> 
> Can we please add more comments explaining this. Having side effects of
> functions disabling preemption, passing a bool saying that it did, and
> needing to call another function (somewhat seemingly unrelated) to
> re-enable preemption, just seems a bit of a stretch for maintainable
> code.
> 
> Especially now that the code after the spin_unlock(&hb->lock) is now a
> critical section (preemption is disable). There's nothing obvious in
> futex.c that says it is.
> 
> Just think about looking at this code in another 5 years. Are you going
> to remember all this?

There's some cleanups later in the series that should clear this up.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter
  2016-09-26 15:37         ` Steven Rostedt
@ 2016-09-26 15:41           ` Peter Zijlstra
  2016-09-29 14:43             ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-26 15:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Mon, Sep 26, 2016 at 11:37:27AM -0400, Steven Rostedt wrote:
> On Mon, 26 Sep 2016 11:35:03 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Especially now that the code after the spin_unlock(&hb->lock) is now a
> > critical section (preemption is disable). There's nothing obvious in
> > futex.c that says it is.
> 
> Not to mention, this looks like it will break PREEMPT_RT as wake_up_q()
> calls sleepable spin locks.

What locks would that be?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 3/9] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
  2016-09-26 12:32 ` [PATCH -v2 3/9] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
@ 2016-09-26 16:03   ` Steven Rostedt
  2016-09-29 14:48   ` Thomas Gleixner
  1 sibling, 0 replies; 31+ messages in thread
From: Steven Rostedt @ 2016-09-26 16:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Mon, 26 Sep 2016 14:32:16 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Currently dl tasks will actually return at the very beginning
> of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:
> 
>     if (waiter->prio == task->prio) {
>         if (!detect_deadlock)
>             goto out_unlock_pi; // out here
>         else
>             requeue = false;
>     }
> 
> As the deadline value of blocked deadline tasks(waiters) without
> changing their sched_class(thus prio doesn't change) never changes,
> this seems reasonable, but it actually misses the chance of updating
> rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
> deadline parameters(dl_runtime, dl_period) or boosted waiter changes
> to !deadline class.
> 
> Thus, force deadline task not out by adding the !dl_prio() condition.
> 
> Cc: Steven Rostedt <rostedt@goodmis.org>

Acked-by: Steven Rostedt <rostedt@goodmis.org.

-- Steve

> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Juri Lelli <juri.lelli@arm.com>
> Signed-off-by: Xunlei Pang <xlpang@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 5/9] rtmutex: Clean up
  2016-09-26 12:32 ` [PATCH -v2 5/9] rtmutex: Clean up Peter Zijlstra
@ 2016-09-26 16:09   ` Steven Rostedt
  2016-09-29 14:51     ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Steven Rostedt @ 2016-09-26 16:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 26 Sep 2016 14:32:18 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Previous patches changed the meaning of the return value of
> rt_mutex_slowunlock(); update comments and code to reflect this.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/futex.c                  |   12 ++++++------
>  kernel/locking/rtmutex.c        |   20 +++++++++-----------
>  kernel/locking/rtmutex_common.h |    2 +-
>  3 files changed, 16 insertions(+), 18 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1261,7 +1261,7 @@ static int wake_futex_pi(u32 __user *uad
>  	struct futex_pi_state *pi_state = this->pi_state;
>  	u32 uninitialized_var(curval), newval;
>  	WAKE_Q(wake_q);
> -	bool deboost;
> +	bool postunlock;
>  	int ret = 0;
>  
>  	if (!pi_state)
> @@ -1327,17 +1327,17 @@ static int wake_futex_pi(u32 __user *uad
>  
>  	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
>  
> -	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
> +	postunlock = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
>  
>  	/*
>  	 * First unlock HB so the waiter does not spin on it once he got woken
> -	 * up. Second wake up the waiter before the priority is adjusted. If we
> -	 * deboost first (and lose our higher priority), then the task might get
> -	 * scheduled away before the wake up can take place.
> +	 * up. Then wakeup the waiter by calling rt_mutex_postunlock(). Priority
> +	 * is already adjusted and preemption is disabled to avoid inversion.

Can we specify here that preemption is only disabled if
rt_mutex_futex_unlock() returns true, and will be enabled again with
rt_mutex_postunlock().


>  	 */
>  	spin_unlock(&hb->lock);
>  
> -	rt_mutex_postunlock(&wake_q, deboost);
> +	if (postunlock)
> +		rt_mutex_postunlock(&wake_q);
>  
>  	return 0;
>  }
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1254,7 +1254,8 @@ static inline int rt_mutex_slowtrylock(s
>  
>  /*
>   * Slow path to release a rt-mutex.
> - * Return whether the current task needs to undo a potential priority boosting.
> + *
> + * Return whether the current task needs to call rt_mutex_postunlock().
>   */
>  static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
>  					struct wake_q_head *wake_q)
> @@ -1327,7 +1328,7 @@ static bool __sched rt_mutex_slowunlock(
>  
>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  
> -	/* check PI boosting */
> +	/* call rt_mutex_postunlock() */

Can we rephrase this to "A call to rt_mutex_postunlock() is required".

>  	return true;
>  }
>  
> @@ -1378,15 +1379,14 @@ rt_mutex_fasttrylock(struct rt_mutex *lo
>  }
>  
>  /*
> - * Undo pi boosting (if necessary) and wake top waiter.
> + * Performs the wakeup of the the top-waiter and re-enables preemption.
>   */
> -void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
> +void rt_mutex_postunlock(struct wake_q_head *wake_q)
>  {
>  	wake_up_q(wake_q);
>  
>  	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
> -	if (deboost)
> -		preempt_enable();
> +	preempt_enable();
>  }
>  
>  /**
> @@ -1489,9 +1489,8 @@ void __sched rt_mutex_unlock(struct rt_m
>  		rt_mutex_deadlock_account_unlock(current);
>  
>  	} else {
> -		bool deboost = rt_mutex_slowunlock(lock, &wake_q);
> -
> -		rt_mutex_postunlock(&wake_q, deboost);
> +		if (rt_mutex_slowunlock(lock, &wake_q))
> +			rt_mutex_postunlock(&wake_q);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(rt_mutex_unlock);
> @@ -1500,8 +1499,7 @@ EXPORT_SYMBOL_GPL(rt_mutex_unlock);
>   * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
>   * @lock: the rt_mutex to be unlocked
>   *
> - * Returns: true/false indicating whether priority adjustment is
> - * required or not.
> + * Returns: true/false indicating whether we should call rt_mutex_postunlock().

Can this be rephrased to: "Returns true if preemption has been
disabled and a call to rt_mutex_postunlock() is required (which will
re-enable preemption)"

-- Steve


>   */
>  bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
>  				   struct wake_q_head *wqh)
> --- a/kernel/locking/rtmutex_common.h
> +++ b/kernel/locking/rtmutex_common.h
> @@ -111,7 +111,7 @@ extern int rt_mutex_finish_proxy_lock(st
>  extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
>  extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
>  				  struct wake_q_head *wqh);
> -extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
> +extern void rt_mutex_postunlock(struct wake_q_head *wake_q);
>  extern void rt_mutex_adjust_prio(struct task_struct *task);
>  
>  #ifdef CONFIG_DEBUG_RT_MUTEXES
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 6/9] sched/rtmutex: Refactor rt_mutex_setprio()
  2016-09-26 12:32 ` [PATCH -v2 6/9] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
@ 2016-09-26 16:57   ` Steven Rostedt
  0 siblings, 0 replies; 31+ messages in thread
From: Steven Rostedt @ 2016-09-26 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 26 Sep 2016 14:32:19 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> @@ -3586,9 +3625,7 @@ void rt_mutex_setprio(struct task_struct
>  		goto out_unlock;
>  	}
>  
> -	rt_mutex_update_top_task(p);
> -
> -	trace_sched_pi_setprio(p, prio);
> +	trace_sched_pi_setprio(p, prio); /* broken */

Not a very comforting comment to see during a review ;-)

I'll hold off giving a full review of this patch while in the airport.
My mind needs a bit more "quiet" to do this one.

-- Steve

>  	oldprio = p->prio;
>  
>  	if (oldprio == prio)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 7/9] sched,tracing: Update trace_sched_pi_setprio()
  2016-09-26 12:32 ` [PATCH -v2 7/9] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
@ 2016-09-26 17:04   ` Steven Rostedt
  2016-09-27  7:44     ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Steven Rostedt @ 2016-09-26 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 26 Sep 2016 14:32:20 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Pass the PI donor task, instead of a numerical priority.
> 
> Numerical priorities are not sufficient to describe state ever since
> SCHED_DEADLINE.
> 
> Annotate all sched tracepoints that are currently broken; fixing them
> will bork userspace. *hate*.

Perhaps display -1 for SCHED_DEADLINE tasks?

-- Steve

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Steven Rostedt <rostedt@goodmis.org.

-- Steve

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 7/9] sched,tracing: Update trace_sched_pi_setprio()
  2016-09-26 17:04   ` Steven Rostedt
@ 2016-09-27  7:44     ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-27  7:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, Sep 26, 2016 at 01:04:30PM -0400, Steven Rostedt wrote:
> On Mon, 26 Sep 2016 14:32:20 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Pass the PI donor task, instead of a numerical priority.
> > 
> > Numerical priorities are not sufficient to describe state ever since
> > SCHED_DEADLINE.
> > 
> > Annotate all sched tracepoints that are currently broken; fixing them
> > will bork userspace. *hate*.
> 
> Perhaps display -1 for SCHED_DEADLINE tasks?

I think it already does that, but that's plenty useless.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter
  2016-09-26 12:32 ` [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
  2016-09-26 15:15   ` Steven Rostedt
@ 2016-09-28  9:07   ` Sebastian Andrzej Siewior
  2016-09-28  9:24     ` Peter Zijlstra
  1 sibling, 1 reply; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-09-28  9:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On 2016-09-26 14:32:14 [+0200], Peter Zijlstra wrote:
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1374,9 +1374,8 @@ static int wake_futex_pi(u32 __user *uad
>  	 * scheduled away before the wake up can take place.
>  	 */
>  	spin_unlock(&hb->lock);
> -	wake_up_q(&wake_q);
> -	if (deboost)
> -		rt_mutex_adjust_prio(current);
> +
> +	rt_mutex_postunlock(&wake_q, deboost);

This breaks -RT. Before that spin_unlock() you do a preempt_disable()
which means you had one spinlock with enabled preemption and now you get
one unlock with disabled preemption. And this breaks migrate_disable() /
enable (because we take the fast path in the in_atomic() case).

>  	return 0;
>  }
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -307,24 +307,6 @@ static void __rt_mutex_adjust_prio(struc
>  }
>  
>  /*
> - * Adjust task priority (undo boosting). Called from the exit path of
> - * rt_mutex_slowunlock() and rt_mutex_slowlock().
> - *
> - * (Note: We do this outside of the protection of lock->wait_lock to
> - * allow the lock to be taken while or before we readjust the priority
> - * of task. We do not use the spin_xx_mutex() variants here as we are
> - * outside of the debug path.)
> - */
> -void rt_mutex_adjust_prio(struct task_struct *task)
> -{
> -	unsigned long flags;
> -
> -	raw_spin_lock_irqsave(&task->pi_lock, flags);
> -	__rt_mutex_adjust_prio(task);
> -	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
> -}

I don't see this function getting back somewhere in this patch. There is
one occurrence left in kernel/locking/rtmutex_common.h

Sebastian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter
  2016-09-28  9:07   ` Sebastian Andrzej Siewior
@ 2016-09-28  9:24     ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-28  9:24 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Wed, Sep 28, 2016 at 11:07:40AM +0200, Sebastian Andrzej Siewior wrote:
> On 2016-09-26 14:32:14 [+0200], Peter Zijlstra wrote:
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -1374,9 +1374,8 @@ static int wake_futex_pi(u32 __user *uad
> >  	 * scheduled away before the wake up can take place.
> >  	 */
> >  	spin_unlock(&hb->lock);
> > -	wake_up_q(&wake_q);
> > -	if (deboost)
> > -		rt_mutex_adjust_prio(current);
> > +
> > +	rt_mutex_postunlock(&wake_q, deboost);
> 
> This breaks -RT. Before that spin_unlock() you do a preempt_disable()
> which means you had one spinlock with enabled preemption and now you get
> one unlock with disabled preemption. And this breaks migrate_disable() /
> enable (because we take the fast path in the in_atomic() case).

Oh crud, the hb lock is not raw :/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter
  2016-09-26 15:41           ` Peter Zijlstra
@ 2016-09-29 14:43             ` Thomas Gleixner
  2016-09-29 14:49               ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2016-09-29 14:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, mingo, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Mon, 26 Sep 2016, Peter Zijlstra wrote:

> On Mon, Sep 26, 2016 at 11:37:27AM -0400, Steven Rostedt wrote:
> > On Mon, 26 Sep 2016 11:35:03 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > Especially now that the code after the spin_unlock(&hb->lock) is now a
> > > critical section (preemption is disable). There's nothing obvious in
> > > futex.c that says it is.
> > 
> > Not to mention, this looks like it will break PREEMPT_RT as wake_up_q()
> > calls sleepable spin locks.
> 
> What locks would that be?

None :)

It still breaks RT in the futex case due to:

   deboost = rt_mutex_futex_unlock();

   spin_unlock(&hb->lock);
	....
	migrate_enable();
	    if (in_atomic())
		return;

So the migrate_disable() which was emitted by spin_lock(&hb->lock) will not
be cleaned up and we leak the migrate disable count. We can work around
that, but it's not pretty.

As a related note, Sebastian decoded another possible priority inversion
issue in the futex mess.

      T1 holds futex

      T2 blocks on futex and boosts T1

      T1 unlocks futex and holds hb->lock

      T1 unlocks rt mutex, so T1 has no more pi waiters

      T3 blocks on hb->lock and adds itself to the pi waiters list of T1

      T1 unlocks hb->lock and deboosts itself

      T4 preempts T1 so the wakeup of T2 gets delayed .....

We tried to fix it with a preempt_disable() and that's where we ran into
that migrate_enable() hickup. We have a non deboosting variant for
spin_unlock() for now, but we'll have to revisit that anyway ...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 4/9] rtmutex: Remove rt_mutex_fastunlock()
  2016-09-26 12:32 ` [PATCH -v2 4/9] rtmutex: Remove rt_mutex_fastunlock() Peter Zijlstra
@ 2016-09-29 14:47   ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2016-09-29 14:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 26 Sep 2016, Peter Zijlstra wrote:

> There is but a single user and the previous patch mandates slowfn must
> be rt_mutex_slowunlock(), so fold the lot together and save a few
> lines.

We might have to bring that back for RT, but ok :)

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 3/9] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
  2016-09-26 12:32 ` [PATCH -v2 3/9] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
  2016-09-26 16:03   ` Steven Rostedt
@ 2016-09-29 14:48   ` Thomas Gleixner
  1 sibling, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2016-09-29 14:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Mon, 26 Sep 2016, Peter Zijlstra wrote:

> Currently dl tasks will actually return at the very beginning
> of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:
> 
>     if (waiter->prio == task->prio) {
>         if (!detect_deadlock)
>             goto out_unlock_pi; // out here
>         else
>             requeue = false;
>     }
> 
> As the deadline value of blocked deadline tasks(waiters) without
> changing their sched_class(thus prio doesn't change) never changes,
> this seems reasonable, but it actually misses the chance of updating
> rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
> deadline parameters(dl_runtime, dl_period) or boosted waiter changes
> to !deadline class.
> 
> Thus, force deadline task not out by adding the !dl_prio() condition.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 2/9] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-09-26 12:32 ` [PATCH -v2 2/9] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
  2016-09-26 15:20   ` Steven Rostedt
@ 2016-09-29 14:49   ` Thomas Gleixner
  1 sibling, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2016-09-29 14:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Mon, 26 Sep 2016, Peter Zijlstra wrote:
> This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
> are only protected by pi_lock when operating pi waiters, while
> rt_mutex_get_top_task(), will access them with rq lock held but
> not holding pi_lock.
> 
> In order to tackle it, we introduce new "pi_top_task" pointer
> cached in task_struct, and add new rt_mutex_update_top_task()
> to update its value, it can be called by rt_mutex_setprio()
> which held both owner's pi_lock and rq lock. Thus "pi_top_task"
> can be safely accessed by enqueue_task_dl() under rq lock.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter
  2016-09-29 14:43             ` Thomas Gleixner
@ 2016-09-29 14:49               ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2016-09-29 14:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, mingo, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, Ingo Molnar

On Thu, Sep 29, 2016 at 10:43:54AM -0400, Thomas Gleixner wrote:
> On Mon, 26 Sep 2016, Peter Zijlstra wrote:
> 
> > On Mon, Sep 26, 2016 at 11:37:27AM -0400, Steven Rostedt wrote:
> > > On Mon, 26 Sep 2016 11:35:03 -0400
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > > 
> > > > Especially now that the code after the spin_unlock(&hb->lock) is now a
> > > > critical section (preemption is disable). There's nothing obvious in
> > > > futex.c that says it is.
> > > 
> > > Not to mention, this looks like it will break PREEMPT_RT as wake_up_q()
> > > calls sleepable spin locks.
> > 
> > What locks would that be?
> 
> None :)
> 
> It still breaks RT in the futex case due to:
> 
>    deboost = rt_mutex_futex_unlock();
> 
>    spin_unlock(&hb->lock);
> 	....

Yeah, noticed that already. Am currently trying to untangle the pi_state
locking rules.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH -v2 5/9] rtmutex: Clean up
  2016-09-26 16:09   ` Steven Rostedt
@ 2016-09-29 14:51     ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2016-09-29 14:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, mingo, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 26 Sep 2016, Steven Rostedt wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> Can this be rephrased to: "Returns true if preemption has been
> disabled and a call to rt_mutex_postunlock() is required (which will
> re-enable preemption)"

I agree with Steven that the comments should be rephrased.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2016-09-29 14:54 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-26 12:32 [PATCH -v2 0/9] PI and assorted failings Peter Zijlstra
2016-09-26 12:32 ` [PATCH -v2 1/9] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
2016-09-26 15:15   ` Steven Rostedt
2016-09-26 15:22     ` Peter Zijlstra
2016-09-26 15:35       ` Steven Rostedt
2016-09-26 15:37         ` Steven Rostedt
2016-09-26 15:41           ` Peter Zijlstra
2016-09-29 14:43             ` Thomas Gleixner
2016-09-29 14:49               ` Peter Zijlstra
2016-09-26 15:39         ` Peter Zijlstra
2016-09-28  9:07   ` Sebastian Andrzej Siewior
2016-09-28  9:24     ` Peter Zijlstra
2016-09-26 12:32 ` [PATCH -v2 2/9] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
2016-09-26 15:20   ` Steven Rostedt
2016-09-26 15:26     ` Peter Zijlstra
2016-09-29 14:49   ` Thomas Gleixner
2016-09-26 12:32 ` [PATCH -v2 3/9] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
2016-09-26 16:03   ` Steven Rostedt
2016-09-29 14:48   ` Thomas Gleixner
2016-09-26 12:32 ` [PATCH -v2 4/9] rtmutex: Remove rt_mutex_fastunlock() Peter Zijlstra
2016-09-29 14:47   ` Thomas Gleixner
2016-09-26 12:32 ` [PATCH -v2 5/9] rtmutex: Clean up Peter Zijlstra
2016-09-26 16:09   ` Steven Rostedt
2016-09-29 14:51     ` Thomas Gleixner
2016-09-26 12:32 ` [PATCH -v2 6/9] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
2016-09-26 16:57   ` Steven Rostedt
2016-09-26 12:32 ` [PATCH -v2 7/9] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
2016-09-26 17:04   ` Steven Rostedt
2016-09-27  7:44     ` Peter Zijlstra
2016-09-26 12:32 ` [PATCH -v2 8/9] rtmutex: Fix PI chain order integrity Peter Zijlstra
2016-09-26 12:32 ` [PATCH -v2 9/9] rtmutex: Fix more prio comparisons Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.