All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] sched/deadline/rtmutex: Fix two deadline PI issues
@ 2016-04-14 11:37 Xunlei Pang
  2016-04-14 11:37 ` [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter Xunlei Pang
                   ` (5 more replies)
  0 siblings, 6 replies; 46+ messages in thread
From: Xunlei Pang @ 2016-04-14 11:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Thomas Gleixner, Juri Lelli, Ingo Molnar,
	Steven Rostedt, Xunlei Pang

PATCH 1~2 mainly fix the deadline PI crash happened when doing 
enqueue_task_dl()->rt_mutex_get_top_task() due to not holding 
rq lock for the top waiter update.

PATCH 3~6 mainly fix the deadline PI issue happened when doing
enqueue_task_dl() after get @pi_task, and access pi_task's data
(dl.dl_runtime and dl.dl_period), because the access is not 
holding any lock(pi lock or rq lock) of pi_task's. PATCH 3~4 are
separated out to make PATCH 5 smaller and easier to reviewers.

The two issues can be fixed using the same logic, so bind them
together as one series.

Xunlei Pang (6):
  rtmutex: Deboost before waking up the top waiter
  sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  rtmutex: Move "rt_mutex_waiter" definition to
    "include/linux/rtmutex.h"
  sched: Move dl_policy() to "include/linux/sched.h"
  sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl()
  sched/deadline/rtmutex: Don't miss the dl_runtime/dl_period update

 include/linux/init_task.h       |  3 +-
 include/linux/rtmutex.h         | 29 +++++++++++++-
 include/linux/sched.h           | 10 ++++-
 include/linux/sched/deadline.h  | 22 +++++++++++
 kernel/fork.c                   |  1 +
 kernel/futex.c                  |  5 +--
 kernel/locking/rtmutex.c        | 84 ++++++++++++++++++++++++++++-------------
 kernel/locking/rtmutex_common.h | 22 +----------
 kernel/sched/core.c             |  2 +
 kernel/sched/deadline.c         | 10 +++--
 kernel/sched/sched.h            |  4 --
 11 files changed, 132 insertions(+), 60 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter
  2016-04-14 11:37 [PATCH v3 0/6] sched/deadline/rtmutex: Fix two deadline PI issues Xunlei Pang
@ 2016-04-14 11:37 ` Xunlei Pang
  2016-04-18  8:23   ` Thomas Gleixner
  2016-04-14 11:37 ` [PATCH v3 2/6] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Xunlei Pang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 46+ messages in thread
From: Xunlei Pang @ 2016-04-14 11:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Thomas Gleixner, Juri Lelli, Ingo Molnar,
	Steven Rostedt, Xunlei Pang

We should deboost before waking the high-prio task such that
we don't run two tasks with the 'same' priority.

The patch fixed the logic, and introduced rt_mutex_postunlock()
to do some code refactor.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
---
 kernel/futex.c                  |  5 ++---
 kernel/locking/rtmutex.c        | 28 ++++++++++++++++++++++++----
 kernel/locking/rtmutex_common.h |  1 +
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 4e1a53e..4ae3523 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1524,9 +1524,8 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
 	 * scheduled away before the wake up can take place.
 	 */
 	spin_unlock(&hb->lock);
-	wake_up_q(&wake_q);
-	if (deboost)
-		rt_mutex_adjust_prio(current);
+
+	rt_mutex_postunlock(&wake_q, deboost);
 
 	return 0;
 }
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 3e74660..42bc59b 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1390,12 +1390,32 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
 	} else {
 		bool deboost = slowfn(lock, &wake_q);
 
-		wake_up_q(&wake_q);
+		rt_mutex_postunlock(&wake_q, deboost);
+	}
+}
+
 
-		/* Undo pi boosting if necessary: */
-		if (deboost)
-			rt_mutex_adjust_prio(current);
+/*
+ * Undo pi boosting (if necessary) and wake top waiter.
+ */
+void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+{
+	/*
+	 * We should deboost before waking the high-prio task such that
+	 * we don't run two tasks with the 'same' priority. This however
+	 * can lead to prio-inversion if we would get preempted after
+	 * the deboost but before waking our high-prio task, hence the
+	 * preempt_disable.
+	 */
+	if (deboost) {
+		preempt_disable();
+		rt_mutex_adjust_prio(current);
 	}
+
+	wake_up_q(wake_q);
+
+	if (deboost)
+		preempt_enable();
 }
 
 /**
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 4f5f83c..93b0924 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -111,6 +111,7 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
 				  struct wake_q_head *wqh);
+extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 2/6] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-04-14 11:37 [PATCH v3 0/6] sched/deadline/rtmutex: Fix two deadline PI issues Xunlei Pang
  2016-04-14 11:37 ` [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter Xunlei Pang
@ 2016-04-14 11:37 ` Xunlei Pang
  2016-04-20 13:19   ` Peter Zijlstra
  2016-04-14 11:37 ` [PATCH v3 3/6] rtmutex: Move "rt_mutex_waiter" definition to "include/linux/rtmutex.h" Xunlei Pang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 46+ messages in thread
From: Xunlei Pang @ 2016-04-14 11:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Thomas Gleixner, Juri Lelli, Ingo Molnar,
	Steven Rostedt, Xunlei Pang

A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [<ffffffff810cf8aa>] ? enqueue_task_dl+0x2a/0x320
    [<ffffffff810b658c>] enqueue_task+0x2c/0x80
    [<ffffffff810ba763>] activate_task+0x23/0x30
    [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
    [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
    [<ffffffff8164e783>] __schedule+0xd3/0x900
    [<ffffffff8164efd9>] schedule+0x29/0x70
    [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
    [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
    [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
    [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
    [<ffffffff810cfa15>] ? enqueue_task_dl+0x195/0x320
    [<ffffffff810d0bac>] ? prio_changed_dl+0x6c/0x90
    [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
    [<ffffffff810edd50>] SyS_futex+0x80/0x180
    [<ffffffff8165a089>] system_call_fastpath+0x16/0x1b
    RIP  [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task()(now replaced by rt_mutex_get_top_waiter()
by previous patches, same issue), will access them with rq lock
held but not holding pi_lock.

In order to tackle it, we introduce new "pi_waiters_leftmost_copy"
in task_struct, and add a new function named rt_mutex_update_copy()
to do the copy, it can be called by rt_mutex_setprio() which held
owner's pi_lock, rq lock, and rt_mutex lock. But one exception is
that rt_mutex_setprio() can be called without rtmutex locked after
mark_wakeup_next_waiter() by rt_mutex_adjust_prio(), so we need a
new function in mark_wakeup_next_waiter() to lock rq first then
call rt_mutex_update_copy().

We avoid adding new logic by calling __rt_mutex_adjust_prio() directly
in mark_wakeup_next_waiter()and remove the old rt_mutex_adjust_prio()
outside the lock. Since we moved the deboost point, in order to avoid
current process to be preempted(due to deboost earlier) before finishing
wake_up_q(), we also move preempt_disable() before unlocking rtmutex.

Originally-From: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
---
 include/linux/init_task.h      |  3 ++-
 include/linux/sched.h          |  3 +++
 include/linux/sched/deadline.h |  2 ++
 kernel/fork.c                  |  1 +
 kernel/locking/rtmutex.c       | 61 +++++++++++++++++-------------------------
 kernel/sched/core.c            |  2 ++
 6 files changed, 35 insertions(+), 37 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f2cb8d4..7be5a83 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -162,7 +162,8 @@ extern struct task_group root_task_group;
 #ifdef CONFIG_RT_MUTEXES
 # define INIT_RT_MUTEXES(tsk)						\
 	.pi_waiters = RB_ROOT,						\
-	.pi_waiters_leftmost = NULL,
+	.pi_waiters_leftmost = NULL,					\
+	.pi_waiters_leftmost_copy = NULL,
 #else
 # define INIT_RT_MUTEXES(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45e848c..332a6b5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1621,7 +1621,10 @@ struct task_struct {
 #ifdef CONFIG_RT_MUTEXES
 	/* PI waiters blocked on a rt_mutex held by this task */
 	struct rb_root pi_waiters;
+	/* Updated under pi_lock and rtmutex lock */
 	struct rb_node *pi_waiters_leftmost;
+	/* Updated under pi_lock, rq_lock, and rtmutex lock */
+	struct rb_node *pi_waiters_leftmost_copy;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
 #endif
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 9089a2a..e8304d4 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -26,4 +26,6 @@ static inline bool dl_time_before(u64 a, u64 b)
 	return (s64)(a - b) < 0;
 }
 
+extern void rt_mutex_update_copy(struct task_struct *p);
+
 #endif /* _SCHED_DEADLINE_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index a453546..4b847e4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1225,6 +1225,7 @@ static void rt_mutex_init_task(struct task_struct *p)
 #ifdef CONFIG_RT_MUTEXES
 	p->pi_waiters = RB_ROOT;
 	p->pi_waiters_leftmost = NULL;
+	p->pi_waiters_leftmost_copy = NULL;
 	p->pi_blocked_on = NULL;
 #endif
 }
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 42bc59b..00c6560 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -273,10 +273,11 @@ int rt_mutex_getprio(struct task_struct *task)
 
 struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 {
-	if (likely(!task_has_pi_waiters(task)))
+	if (!task->pi_waiters_leftmost_copy)
 		return NULL;
 
-	return task_top_pi_waiter(task)->task;
+	return rb_entry(task->pi_waiters_leftmost_copy,
+				struct rt_mutex_waiter, pi_tree_entry)->task;
 }
 
 /*
@@ -285,12 +286,20 @@ struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
  */
 int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
 {
-	if (!task_has_pi_waiters(task))
+	struct task_struct *top_task = rt_mutex_get_top_task(task);
+
+	if (!top_task)
 		return newprio;
 
-	if (task_top_pi_waiter(task)->task->prio <= newprio)
-		return task_top_pi_waiter(task)->task->prio;
-	return newprio;
+	return min(top_task->prio, newprio);
+}
+
+/*
+ * Must hold rq lock, p's pi_lock, and rt_mutex lock.
+ */
+void rt_mutex_update_copy(struct task_struct *p)
+{
+	p->pi_waiters_leftmost_copy = p->pi_waiters_leftmost;
 }
 
 /*
@@ -307,24 +316,6 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
 }
 
 /*
- * Adjust task priority (undo boosting). Called from the exit path of
- * rt_mutex_slowunlock() and rt_mutex_slowlock().
- *
- * (Note: We do this outside of the protection of lock->wait_lock to
- * allow the lock to be taken while or before we readjust the priority
- * of task. We do not use the spin_xx_mutex() variants here as we are
- * outside of the debug path.)
- */
-void rt_mutex_adjust_prio(struct task_struct *task)
-{
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	__rt_mutex_adjust_prio(task);
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
-}
-
-/*
  * Deadlock detection is conditional:
  *
  * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
@@ -987,6 +978,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
 	 * lock->wait_lock.
 	 */
 	rt_mutex_dequeue_pi(current, waiter);
+	__rt_mutex_adjust_prio(current);
 
 	/*
 	 * As we are waking up the top waiter, and the waiter stays
@@ -1325,6 +1317,15 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
 	 */
 	mark_wakeup_next_waiter(wake_q, lock);
 
+	/*
+	 * We should deboost before waking the high-prio task such that
+	 * we don't run two tasks with the 'same' priority. This however
+	 * can lead to prio-inversion if we would get preempted after
+	 * the deboost but before waking our high-prio task, hence the
+	 * preempt_disable before unlock.
+	 */
+	preempt_disable();
+
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* check PI boosting */
@@ -1400,18 +1401,6 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
  */
 void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
 {
-	/*
-	 * We should deboost before waking the high-prio task such that
-	 * we don't run two tasks with the 'same' priority. This however
-	 * can lead to prio-inversion if we would get preempted after
-	 * the deboost but before waking our high-prio task, hence the
-	 * preempt_disable.
-	 */
-	if (deboost) {
-		preempt_disable();
-		rt_mutex_adjust_prio(current);
-	}
-
 	wake_up_q(wake_q);
 
 	if (deboost)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1159423..4a2c79d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3480,6 +3480,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 		goto out_unlock;
 	}
 
+	rt_mutex_update_copy(p);
+
 	trace_sched_pi_setprio(p, prio);
 	oldprio = p->prio;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 3/6] rtmutex: Move "rt_mutex_waiter" definition to "include/linux/rtmutex.h"
  2016-04-14 11:37 [PATCH v3 0/6] sched/deadline/rtmutex: Fix two deadline PI issues Xunlei Pang
  2016-04-14 11:37 ` [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter Xunlei Pang
  2016-04-14 11:37 ` [PATCH v3 2/6] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Xunlei Pang
@ 2016-04-14 11:37 ` Xunlei Pang
  2016-04-14 11:37 ` [PATCH v3 4/6] sched: Move dl_policy() to "include/linux/sched.h" Xunlei Pang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 46+ messages in thread
From: Xunlei Pang @ 2016-04-14 11:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Thomas Gleixner, Juri Lelli, Ingo Molnar,
	Steven Rostedt, Xunlei Pang

Deadline code will need it, so make it visible to deadline.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
---
 include/linux/rtmutex.h         | 22 +++++++++++++++++++++-
 kernel/locking/rtmutex_common.h | 21 ---------------------
 2 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 1abba5c..f9bf40a 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -39,7 +39,27 @@ struct rt_mutex {
 #endif
 };
 
-struct rt_mutex_waiter;
+/*
+ * This is the control structure for tasks blocked on a rt_mutex,
+ * which is allocated on the kernel stack on of the blocked task.
+ *
+ * @tree_entry:		pi node to enqueue into the mutex waiters tree
+ * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
+ * @task:		task reference to the blocked task
+ */
+struct rt_mutex_waiter {
+	struct rb_node          tree_entry;
+	struct rb_node          pi_tree_entry;
+	struct task_struct	*task;
+	struct rt_mutex		*lock;
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+	unsigned long		ip;
+	struct pid		*deadlock_task_pid;
+	struct rt_mutex		*deadlock_lock;
+#endif
+	int prio;
+};
+
 struct hrtimer_sleeper;
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 93b0924..c94adcd 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -15,27 +15,6 @@
 #include <linux/rtmutex.h>
 
 /*
- * This is the control structure for tasks blocked on a rt_mutex,
- * which is allocated on the kernel stack on of the blocked task.
- *
- * @tree_entry:		pi node to enqueue into the mutex waiters tree
- * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
- * @task:		task reference to the blocked task
- */
-struct rt_mutex_waiter {
-	struct rb_node          tree_entry;
-	struct rb_node          pi_tree_entry;
-	struct task_struct	*task;
-	struct rt_mutex		*lock;
-#ifdef CONFIG_DEBUG_RT_MUTEXES
-	unsigned long		ip;
-	struct pid		*deadlock_task_pid;
-	struct rt_mutex		*deadlock_lock;
-#endif
-	int prio;
-};
-
-/*
  * Various helpers to access the waiters-tree:
  */
 static inline int rt_mutex_has_waiters(struct rt_mutex *lock)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 4/6] sched: Move dl_policy() to "include/linux/sched.h"
  2016-04-14 11:37 [PATCH v3 0/6] sched/deadline/rtmutex: Fix two deadline PI issues Xunlei Pang
                   ` (2 preceding siblings ...)
  2016-04-14 11:37 ` [PATCH v3 3/6] rtmutex: Move "rt_mutex_waiter" definition to "include/linux/rtmutex.h" Xunlei Pang
@ 2016-04-14 11:37 ` Xunlei Pang
  2016-04-14 11:37 ` [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl() Xunlei Pang
  2016-04-14 11:37 ` [PATCH v3 6/6] sched/deadline/rtmutex: Don't miss the dl_runtime/dl_period update Xunlei Pang
  5 siblings, 0 replies; 46+ messages in thread
From: Xunlei Pang @ 2016-04-14 11:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Thomas Gleixner, Juri Lelli, Ingo Molnar,
	Steven Rostedt, Xunlei Pang

Rtmutex code will need dl_policy(), so make it visible to rtmutex.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
---
 include/linux/sched.h | 5 +++++
 kernel/sched/sched.h  | 4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 332a6b5..8ad3522 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1309,6 +1309,11 @@ struct sched_rt_entity {
 #endif
 };
 
+static inline int dl_policy(int policy)
+{
+	return policy == SCHED_DEADLINE;
+}
+
 struct sched_dl_entity {
 	struct rb_node	rb_node;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7cbad7..0e6ea02 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -99,10 +99,6 @@ static inline int rt_policy(int policy)
 	return policy == SCHED_FIFO || policy == SCHED_RR;
 }
 
-static inline int dl_policy(int policy)
-{
-	return policy == SCHED_DEADLINE;
-}
 static inline bool valid_policy(int policy)
 {
 	return idle_policy(policy) || fair_policy(policy) ||
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl()
  2016-04-14 11:37 [PATCH v3 0/6] sched/deadline/rtmutex: Fix two deadline PI issues Xunlei Pang
                   ` (3 preceding siblings ...)
  2016-04-14 11:37 ` [PATCH v3 4/6] sched: Move dl_policy() to "include/linux/sched.h" Xunlei Pang
@ 2016-04-14 11:37 ` Xunlei Pang
  2016-04-14 15:31   ` Peter Zijlstra
  2016-04-14 11:37 ` [PATCH v3 6/6] sched/deadline/rtmutex: Don't miss the dl_runtime/dl_period update Xunlei Pang
  5 siblings, 1 reply; 46+ messages in thread
From: Xunlei Pang @ 2016-04-14 11:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Thomas Gleixner, Juri Lelli, Ingo Molnar,
	Steven Rostedt, Xunlei Pang

We access @pi_task's data without any lock in enqueue_task_dl(), though
checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.

"dl_period" and "dl_runtime" of "pi_task->dl" can change. For example,
if it changes to !deadline class, dl_runtime will be cleared to zero,
then we will hit an endless loop in replenish_dl_entity() below:
    while (dl_se->runtime <= 0) {
        dl_se->deadline += pi_se->dl_period;
        dl_se->runtime += pi_se->dl_runtime;
    }

or hit "BUG_ON(pi_se->dl_runtime <= 0)" earlier.

That's because without any lock of top waiter, there is no guarantee.

In order to solve it, we add some members in "rt_mutex_waiter" and use
this structure instead of task_struct as the one to be accessed by
enqueue_task_dl(), specifically added:

struct rt_mutex_waiter {
     ... ...

    int prio;
+   /* Updated under waiter's pi_lock and rt_mutex lock */
+   u64 dl_runtime, dl_period;
+   /*
+    * Copied directly from above.
+    * Updated under owner's pi_lock, rq lock, and rt_mutex lock.
+    */
+   u64 dl_runtime_copy, dl_period_copy;
};

We must update "dl_runtime_copy" and "dl_period_copy" under rt_mutex
lock, because they are copied from rt_mutex_waiter's "dl_runtime" and
"dl_period" which are protected by the same rt_mutex lock. We update
the copy in rt_mutex_update_copy() introduced perviously, as it is
called by rt_mutex_setprio() which held owner's pi_lock, rq lock, and
rt_mutex lock.

"dl_runtime_copy" and "dl_period_copy" are updated under owner's pi_lock,
rq lock, and rt_mutex lock, plus the waiter was dependably blocked on
rtmutex, thus they can be safely accessed by enqueue_task_dl() which
held rq lock.

Note that, now we return a rt_mutex_waiter to enqueue_task_dl(), we add
a new "struct sched_dl_entity_fake" to fake as a real sched_dl_entity,
this is ok as long as we keep the "dl_runtime" and "dl_period" in it
the same order as that in sched_dl_entity. Also adjust the location of
"dl_period" in sched_dl_entity to make sched_dl_entity_fake smaller.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
---
 include/linux/rtmutex.h        |  7 +++++++
 include/linux/sched.h          |  2 +-
 include/linux/sched/deadline.h | 20 ++++++++++++++++++++
 kernel/locking/rtmutex.c       | 23 +++++++++++++++++++++++
 kernel/sched/deadline.c        | 10 +++++++---
 5 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index f9bf40a..56e2aaf 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -58,6 +58,13 @@ struct rt_mutex_waiter {
 	struct rt_mutex		*deadlock_lock;
 #endif
 	int prio;
+	/* Updated under waiter's pi_lock and rt_mutex lock */
+	u64 dl_runtime, dl_period;
+	/*
+	 * Copied directly from above.
+	 * Updated under owner's pi_lock, rq lock, and rt_mutex lock.
+	 */
+	u64 dl_runtime_copy, dl_period_copy;
 };
 
 struct hrtimer_sleeper;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8ad3522..960465c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1323,8 +1323,8 @@ struct sched_dl_entity {
 	 * the next sched_setattr().
 	 */
 	u64 dl_runtime;		/* maximum runtime for each instance	*/
-	u64 dl_deadline;	/* relative deadline of each instance	*/
 	u64 dl_period;		/* separation of two instances (period) */
+	u64 dl_deadline;	/* relative deadline of each instance	*/
 	u64 dl_bw;		/* dl_runtime / dl_deadline		*/
 
 	/*
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index e8304d4..ca5bae5 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -2,6 +2,16 @@
 #define _SCHED_DEADLINE_H
 
 /*
+ * Used by enqueue_task_dl() for PI cases to disguise sched_dl_entity,
+ * thus must be the same order as the counterparts in sched_dl_entity.
+ */
+struct sched_dl_entity_fake {
+	struct rb_node  rb_node;
+	u64 dl_runtime;
+	u64 dl_period;
+};
+
+/*
  * SCHED_DEADLINE tasks has negative priorities, reflecting
  * the fact that any of them has higher prio than RT and
  * NORMAL/BATCH tasks.
@@ -28,4 +38,14 @@ static inline bool dl_time_before(u64 a, u64 b)
 
 extern void rt_mutex_update_copy(struct task_struct *p);
 
+#ifdef CONFIG_RT_MUTEXES
+extern struct rt_mutex_waiter *rt_mutex_get_top_waiter(struct task_struct *p);
+#else
+static inline
+struct rt_mutex_waiter *rt_mutex_get_top_waiter(struct task_struct *p)
+{
+	return NULL;
+}
+#endif
+
 #endif /* _SCHED_DEADLINE_H */
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 00c6560..4d14eee 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -280,6 +280,15 @@ struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 				struct rt_mutex_waiter, pi_tree_entry)->task;
 }
 
+struct rt_mutex_waiter *rt_mutex_get_top_waiter(struct task_struct *task)
+{
+	if (!task->pi_waiters_leftmost_copy)
+		return NULL;
+
+	return rb_entry(task->pi_waiters_leftmost_copy,
+				struct rt_mutex_waiter, pi_tree_entry);
+}
+
 /*
  * Called by sched_setscheduler() to get the priority which will be
  * effective after the change.
@@ -299,7 +308,17 @@ int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
  */
 void rt_mutex_update_copy(struct task_struct *p)
 {
+	struct rt_mutex_waiter *top_waiter;
+
+	/* We must always update it, even if NULL */
 	p->pi_waiters_leftmost_copy = p->pi_waiters_leftmost;
+
+	if (!task_has_pi_waiters(p))
+		return;
+
+	top_waiter = task_top_pi_waiter(p);
+	top_waiter->dl_runtime_copy = top_waiter->dl_runtime;
+	top_waiter->dl_period_copy = top_waiter->dl_period;
 }
 
 /*
@@ -632,6 +651,8 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	/* [7] Requeue the waiter in the lock waiter tree. */
 	rt_mutex_dequeue(lock, waiter);
 	waiter->prio = task->prio;
+	waiter->dl_runtime = dl_policy(task->policy) ? task->dl.dl_runtime : 0;
+	waiter->dl_period = dl_policy(task->policy) ? task->dl.dl_period : 0;
 	rt_mutex_enqueue(lock, waiter);
 
 	/* [8] Release the task */
@@ -902,6 +923,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	waiter->task = task;
 	waiter->lock = lock;
 	waiter->prio = task->prio;
+	waiter->dl_runtime = dl_policy(task->policy) ? task->dl.dl_runtime : 0;
+	waiter->dl_period = dl_policy(task->policy) ? task->dl.dl_period : 0;
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index affd97e..224aa64 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -932,8 +932,9 @@ static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct task_struct *pi_task = rt_mutex_get_top_task(p);
+	struct rt_mutex_waiter *top_waiter = rt_mutex_get_top_waiter(p);
 	struct sched_dl_entity *pi_se = &p->dl;
+	struct sched_dl_entity_fake pi_se_fake;
 
 	/*
 	 * Use the scheduling parameters of the top pi-waiter
@@ -941,8 +942,11 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	 * smaller than our one... OTW we keep our runtime and
 	 * deadline.
 	 */
-	if (pi_task && p->dl.dl_boosted && dl_prio(pi_task->normal_prio)) {
-		pi_se = &pi_task->dl;
+	if (top_waiter && p->dl.dl_boosted && top_waiter->dl_runtime_copy) {
+		BUG_ON(top_waiter->dl_period_copy == 0);
+		pi_se_fake.dl_runtime = top_waiter->dl_runtime_copy;
+		pi_se_fake.dl_period = top_waiter->dl_period_copy;
+		pi_se = (struct sched_dl_entity *)&pi_se_fake;
 	} else if (!dl_prio(p->normal_prio)) {
 		/*
 		 * Special case in which we have a !SCHED_DEADLINE task
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3 6/6] sched/deadline/rtmutex: Don't miss the dl_runtime/dl_period update
  2016-04-14 11:37 [PATCH v3 0/6] sched/deadline/rtmutex: Fix two deadline PI issues Xunlei Pang
                   ` (4 preceding siblings ...)
  2016-04-14 11:37 ` [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl() Xunlei Pang
@ 2016-04-14 11:37 ` Xunlei Pang
  5 siblings, 0 replies; 46+ messages in thread
From: Xunlei Pang @ 2016-04-14 11:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Thomas Gleixner, Juri Lelli, Ingo Molnar,
	Steven Rostedt, Xunlei Pang

Currently dl tasks will actually return at the very beginning
of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:

    if (waiter->prio == task->prio) {
        if (!detect_deadlock)
            goto out_unlock_pi; // out here
        else
            requeue = false;
    }

As the deadline value of blocked deadline tasks(waiters) without
changing their sched_class(thus prio doesn't change) never changes,
this seems reasonable, but it actually misses the chance of updating
rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
deadline parameters(dl_runtime, dl_period) or boosted waiter changes
to !deadline class.

Thus, force deadline task not out by adding the !dl_prio() condition.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
---
 kernel/locking/rtmutex.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 4d14eee..f3b7d29 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -554,7 +554,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * enabled we continue, but stop the requeueing in the chain
 	 * walk.
 	 */
-	if (waiter->prio == task->prio) {
+	if (waiter->prio == task->prio && !dl_task(task)) {
 		if (!detect_deadlock)
 			goto out_unlock_pi;
 		else
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl()
  2016-04-14 11:37 ` [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl() Xunlei Pang
@ 2016-04-14 15:31   ` Peter Zijlstra
  2016-04-15  1:58     ` Xunlei Pang
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-04-14 15:31 UTC (permalink / raw)
  To: Xunlei Pang
  Cc: linux-kernel, Thomas Gleixner, Juri Lelli, Ingo Molnar, Steven Rostedt

On Thu, Apr 14, 2016 at 07:37:06PM +0800, Xunlei Pang wrote:
> We access @pi_task's data without any lock in enqueue_task_dl(), though
> checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.

The proper fix is to ensure that pi_task is guaranteed to be blocked.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl()
  2016-04-14 15:31   ` Peter Zijlstra
@ 2016-04-15  1:58     ` Xunlei Pang
  2016-04-15  2:19       ` Xunlei Pang
  0 siblings, 1 reply; 46+ messages in thread
From: Xunlei Pang @ 2016-04-15  1:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Juri Lelli, Ingo Molnar, Steven Rostedt

On 2016/04/14 at 23:31, Peter Zijlstra wrote:
> On Thu, Apr 14, 2016 at 07:37:06PM +0800, Xunlei Pang wrote:
>> We access @pi_task's data without any lock in enqueue_task_dl(), though
>> checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.
> The proper fix is to ensure that pi_task is guaranteed to be blocked.

Even if pi_task was blocked, its parameters are still allowed to be changed,
so we have to do that. Did I miss something?

Regards,
Xunlei

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl()
  2016-04-15  1:58     ` Xunlei Pang
@ 2016-04-15  2:19       ` Xunlei Pang
  2016-04-20 12:25         ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Xunlei Pang @ 2016-04-15  2:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Juri Lelli, Ingo Molnar, Steven Rostedt

On 2016/04/15 at 09:58, Xunlei Pang wrote:
> On 2016/04/14 at 23:31, Peter Zijlstra wrote:
>> On Thu, Apr 14, 2016 at 07:37:06PM +0800, Xunlei Pang wrote:
>>> We access @pi_task's data without any lock in enqueue_task_dl(), though
>>> checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.
>> The proper fix is to ensure that pi_task is guaranteed to be blocked.
> Even if pi_task was blocked, its parameters are still allowed to be changed,
> so we have to do that. Did I miss something?
>
> Regards,
> Xunlei

Fortunately, I just reproduced through an overnight test, so it really happened in reality as I thought.

[50697.042391] kernel BUG at kernel/sched/deadline.c:398!
[50697.048212] invalid opcode: 0000 [#1] SMP
[50697.137676] CPU: 1 PID: 10676 Comm: bugon Tainted: G        W       4.6.0-rc3+ #19
[50697.146250] Hardware name: Intel Corporation Broadwell Client platform/SawTooth Peak, BIOS BDW-E1R1.86C.0127.R00.150
8062034 08/06/2015
[50697.159942] task: ffff880089d72b80 ti: ffff880074bb4000 task.ti: ffff880074bb4000
[50697.168420] RIP: 0010:[<ffffffff810cb4ef>]  [<ffffffff810cb4ef>] replenish_dl_entity+0xff/0x110
[50697.178292] RSP: 0000:ffff88016ec43d90  EFLAGS: 00010046
[50697.184307] RAX: 0000000000000001 RBX: ffff880089d72d50 RCX: 0000000000000001
[50697.192390] RDX: 0000000000000010 RSI: ffff8800719858d0 RDI: ffff880089d72d50
[50697.200473] RBP: ffff88016ec43da8 R08: 0000000000000001 R09: 0000000000000097
[50697.208556] R10: 0000000057102e72 R11: 000000000f9e6fd7 R12: ffff88016ec56e40
[50697.216638] R13: ffff88016ec56e40 R14: 0000000000016e40 R15: ffff880089d72d50
[50697.224721] FS:  00007f14e788b700(0000) GS:ffff88016ec40000(0000) knlGS:0000000000000000
[50697.233887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50697.240396] CR2: 000055be08240c68 CR3: 000000008a5d5000 CR4: 00000000003406e0
[50697.248478] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[50697.256561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[50697.264643] Stack:
[50697.266917]  ffff88016ec56e40 ffff880089d72b80 0000000000000010 ffff88016ec43de8
[50697.275338]  ffffffff810cbfe4 ffff88016ec43de8 ffff880089d72b80 ffff88016ec56e40
[50697.283757]  00000000000188c5 ffff880089d72d50 ffff88016ec4f228 ffff88016ec43e18
[50697.292175] Call Trace:
[50697.294943]  <IRQ>
[50697.297122]  [<ffffffff810cbfe4>] enqueue_task_dl+0x264/0x340
[50697.303838]  [<ffffffff810cc453>] update_curr_dl+0x1c3/0x1f0
[50697.310249]  [<ffffffff810cc51c>] task_tick_dl+0x1c/0x80
[50697.316265]  [<ffffffff810b66ac>] scheduler_tick+0x5c/0xe0
[50697.322480]  [<ffffffff811060d0>] ? tick_sched_do_timer+0x50/0x50
[50697.329383]  [<ffffffff810f60e1>] update_process_times+0x51/0x60
[50697.336188]  [<ffffffff81105a25>] tick_sched_handle.isra.17+0x25/0x60
[50697.343486]  [<ffffffff8110610d>] tick_sched_timer+0x3d/0x70
[50697.349895]  [<ffffffff810f6c93>] __hrtimer_run_queues+0xf3/0x270
[50697.356797]  [<ffffffff810f7168>] hrtimer_interrupt+0xa8/0x1a0
[50697.363404]  [<ffffffff81053de5>] local_apic_timer_interrupt+0x35/0x60
[50697.370799]  [<ffffffff816b7a1d>] smp_apic_timer_interrupt+0x3d/0x50
[50697.377996]  [<ffffffff816b5b5c>] apic_timer_interrupt+0x8c/0xa0
[50697.384798]  <EOI>
[50697.384798]  <EOI>
[50697.386974] Code: a9 48 c7 c7 38 5f a0 81 31 c0 48 89 75 e8 c6 05 5c 48 c8 00 01 e8 74 20 0c 00 49 8b 84 24 28 09 00 00 8b 4b 54 48 8b 75 e8 eb c4 <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
[50697.409201] RIP  [<ffffffff810cb4ef>] replenish_dl_entity+0xff/0x110
[50697.416409]  RSP <ffff88016ec43d90>
[50697.433683] ---[ end trace da6e1e42babefb7f ]---
[50697.438913] Kernel panic - not syncing: Fatal exception in interrupt
[50698.484088] Shutting down cpus with NMI
[50698.488434] Kernel Offset: disabled
[50698.492383] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter
  2016-04-14 11:37 ` [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter Xunlei Pang
@ 2016-04-18  8:23   ` Thomas Gleixner
  2016-04-18  8:44     ` Xunlei Pang
  0 siblings, 1 reply; 46+ messages in thread
From: Thomas Gleixner @ 2016-04-18  8:23 UTC (permalink / raw)
  To: Xunlei Pang
  Cc: linux-kernel, Peter Zijlstra, Juri Lelli, Ingo Molnar, Steven Rostedt

On Thu, 14 Apr 2016, Xunlei Pang wrote:
> We should deboost before waking the high-prio task such that
> we don't run two tasks with the 'same' priority.

No. This is fundamentaly broken.
 
T1 (prio 0)	lock(X)

-->	 	preemption

T2 (prio 10)	lock(X)
   	 	boost(T1)
		schedule()

T1 (prio 10)	unlock(X)
   	 	deboost()
   (prio 0) 

-->		preemption

T3 (prio 5)	....

Classic priority inversion enabled by a mechanism to avoid it. Brilliant
stuff.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter
  2016-04-18  8:23   ` Thomas Gleixner
@ 2016-04-18  8:44     ` Xunlei Pang
  2016-04-18  9:02       ` Thomas Gleixner
  0 siblings, 1 reply; 46+ messages in thread
From: Xunlei Pang @ 2016-04-18  8:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, Juri Lelli, Ingo Molnar, Steven Rostedt

On 2016/04/18 at 16:23, Thomas Gleixner wrote:
> On Thu, 14 Apr 2016, Xunlei Pang wrote:
>> We should deboost before waking the high-prio task such that
>> we don't run two tasks with the 'same' priority.
> No. This is fundamentaly broken.
>  
> T1 (prio 0)	lock(X)
>
> -->	 	preemption
>
> T2 (prio 10)	lock(X)
>    	 	boost(T1)
> 		schedule()
>
> T1 (prio 10)	unlock(X)

We add a preempt_disable() before deboost to avoid the breakage,
there's also some comment about this in the patch's code.

Regards,
Xunlei

>    	 	deboost()
>    (prio 0) 
>
> -->		preemption
>
> T3 (prio 5)	....
>
> Classic priority inversion enabled by a mechanism to avoid it. Brilliant
> stuff.
>
> Thanks,
>
> 	tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter
  2016-04-18  8:44     ` Xunlei Pang
@ 2016-04-18  9:02       ` Thomas Gleixner
  2016-04-18  9:41         ` Xunlei Pang
  2016-04-20 12:20         ` Peter Zijlstra
  0 siblings, 2 replies; 46+ messages in thread
From: Thomas Gleixner @ 2016-04-18  9:02 UTC (permalink / raw)
  To: xlpang
  Cc: linux-kernel, Peter Zijlstra, Juri Lelli, Ingo Molnar, Steven Rostedt

On Mon, 18 Apr 2016, Xunlei Pang wrote:
> On 2016/04/18 at 16:23, Thomas Gleixner wrote:
> > On Thu, 14 Apr 2016, Xunlei Pang wrote:
> >> We should deboost before waking the high-prio task such that
> >> we don't run two tasks with the 'same' priority.
> > No. This is fundamentaly broken.
> >  
> > T1 (prio 0)	lock(X)
> >
> > -->	 	preemption
> >
> > T2 (prio 10)	lock(X)
> >    	 	boost(T1)
> > 		schedule()
> >
> > T1 (prio 10)	unlock(X)
> 
> We add a preempt_disable() before deboost to avoid the breakage,
> there's also some comment about this in the patch's code.

So the changelog is useless and misleading. Neither does it explain what's
wrong with having two tasks with the same priority in running state.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter
  2016-04-18  9:02       ` Thomas Gleixner
@ 2016-04-18  9:41         ` Xunlei Pang
  2016-04-20 12:20         ` Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: Xunlei Pang @ 2016-04-18  9:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, Juri Lelli, Ingo Molnar, Steven Rostedt

On 2016/04/18 at 17:02, Thomas Gleixner wrote:
> On Mon, 18 Apr 2016, Xunlei Pang wrote:
>> On 2016/04/18 at 16:23, Thomas Gleixner wrote:
>>> On Thu, 14 Apr 2016, Xunlei Pang wrote:
>>>> We should deboost before waking the high-prio task such that
>>>> we don't run two tasks with the 'same' priority.
>>> No. This is fundamentaly broken.
>>>  
>>> T1 (prio 0)	lock(X)
>>>
>>> -->	 	preemption
>>>
>>> T2 (prio 10)	lock(X)
>>>    	 	boost(T1)
>>> 		schedule()
>>>
>>> T1 (prio 10)	unlock(X)
>> We add a preempt_disable() before deboost to avoid the breakage,
>> there's also some comment about this in the patch's code.
> So the changelog is useless and misleading. Neither does it explain what's
> wrong with having two tasks with the same priority in running state.

Sorry about that, will improve it.

Regards,
Xunlei

>
> Thanks,
>
> 	tglx
>
>
>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter
  2016-04-18  9:02       ` Thomas Gleixner
  2016-04-18  9:41         ` Xunlei Pang
@ 2016-04-20 12:20         ` Peter Zijlstra
  2016-04-20 12:43           ` Thomas Gleixner
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-04-20 12:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: xlpang, linux-kernel, Juri Lelli, Ingo Molnar, Steven Rostedt

On Mon, Apr 18, 2016 at 11:02:28AM +0200, Thomas Gleixner wrote:
> On Mon, 18 Apr 2016, Xunlei Pang wrote:
> > On 2016/04/18 at 16:23, Thomas Gleixner wrote:
> > > On Thu, 14 Apr 2016, Xunlei Pang wrote:
> > >> We should deboost before waking the high-prio task such that
> > >> we don't run two tasks with the 'same' priority.
> > > No. This is fundamentaly broken.
> > >  
> > > T1 (prio 0)	lock(X)
> > >
> > > -->	 	preemption
> > >
> > > T2 (prio 10)	lock(X)
> > >    	 	boost(T1)
> > > 		schedule()
> > >
> > > T1 (prio 10)	unlock(X)
> > 
> > We add a preempt_disable() before deboost to avoid the breakage,
> > there's also some comment about this in the patch's code.
> 
> So the changelog is useless and misleading. Neither does it explain what's
> wrong with having two tasks with the same priority in running state.

So its semantically icky to have the two tasks running off the same
state and practically icky when you consider bandwidth inheritance --
where the boosted task wants to explicitly modify the state of the
booster.

In that latter case you really want to unboost before you let the
booster run again.

However, you noted we need to deal with this case due to the whole
optimistic spinning crap anyway :/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl()
  2016-04-15  2:19       ` Xunlei Pang
@ 2016-04-20 12:25         ` Peter Zijlstra
  2016-04-20 13:00           ` Xunlei Pang
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-04-20 12:25 UTC (permalink / raw)
  To: xlpang
  Cc: linux-kernel, Thomas Gleixner, Juri Lelli, Ingo Molnar, Steven Rostedt

On Fri, Apr 15, 2016 at 10:19:12AM +0800, Xunlei Pang wrote:
> On 2016/04/15 at 09:58, Xunlei Pang wrote:
> > On 2016/04/14 at 23:31, Peter Zijlstra wrote:
> >> On Thu, Apr 14, 2016 at 07:37:06PM +0800, Xunlei Pang wrote:
> >>> We access @pi_task's data without any lock in enqueue_task_dl(), though
> >>> checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.
> >> The proper fix is to ensure that pi_task is guaranteed to be blocked.
> > Even if pi_task was blocked, its parameters are still allowed to be changed,
> > so we have to do that. Did I miss something?
> >
> > Regards,
> > Xunlei
> 
> Fortunately, I just reproduced through an overnight test, so it really happened in reality as I thought.

But what happens? How is it changed when it is blocked?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter
  2016-04-20 12:20         ` Peter Zijlstra
@ 2016-04-20 12:43           ` Thomas Gleixner
  2016-04-20 13:10             ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Thomas Gleixner @ 2016-04-20 12:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: xlpang, linux-kernel, Juri Lelli, Ingo Molnar, Steven Rostedt

On Wed, 20 Apr 2016, Peter Zijlstra wrote:
> On Mon, Apr 18, 2016 at 11:02:28AM +0200, Thomas Gleixner wrote:
> > On Mon, 18 Apr 2016, Xunlei Pang wrote:
> > > We add a preempt_disable() before deboost to avoid the breakage,
> > > there's also some comment about this in the patch's code.
> > 
> > So the changelog is useless and misleading. Neither does it explain what's
> > wrong with having two tasks with the same priority in running state.
> 
> So its semantically icky to have the two tasks running off the same
> state and practically icky when you consider bandwidth inheritance --
> where the boosted task wants to explicitly modify the state of the
> booster.
>
> In that latter case you really want to unboost before you let the
> booster run again.

I understand that. That doesn't make the changelog any better, which mumbles
about priorities :(

> However, you noted we need to deal with this case due to the whole
> optimistic spinning crap anyway :/

Right, but that's another dimension of madness. Both tasks are on a cpu. The
reason why we boost the lock holder before spinning is to make sure that it
does not get preempted by something of medium priority before dropping the
lock. That really gets interesting with bandwith inheritance ....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl()
  2016-04-20 12:25         ` Peter Zijlstra
@ 2016-04-20 13:00           ` Xunlei Pang
  2016-04-20 13:17             ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Xunlei Pang @ 2016-04-20 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Juri Lelli, Ingo Molnar, Steven Rostedt

On 2016/04/20/ at 20:25, Peter Zijlstra wrote:
> On Fri, Apr 15, 2016 at 10:19:12AM +0800, Xunlei Pang wrote:
>> On 2016/04/15 at 09:58, Xunlei Pang wrote:
>>> On 2016/04/14 at 23:31, Peter Zijlstra wrote:
>>>> On Thu, Apr 14, 2016 at 07:37:06PM +0800, Xunlei Pang wrote:
>>>>> We access @pi_task's data without any lock in enqueue_task_dl(), though
>>>>> checked "dl_prio(pi_task->normal_prio)" condition, that's not enough.
>>>> The proper fix is to ensure that pi_task is guaranteed to be blocked.
>>> Even if pi_task was blocked, its parameters are still allowed to be changed,
>>> so we have to do that. Did I miss something?
>>>
>>> Regards,
>>> Xunlei
>> Fortunately, I just reproduced through an overnight test, so it really happened in reality as I thought.
> But what happens? How is it changed when it is blocked?

The top waiter's policy can be changed by other tasks through sched_setattr() syscall during it was blocked.
I created another thread doing the following thing:
     while (1) {
        change the waiter to cfs
        do something
        change the waiter to deadline
    }

Regards,
Xunlei

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter
  2016-04-20 12:43           ` Thomas Gleixner
@ 2016-04-20 13:10             ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-04-20 13:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: xlpang, linux-kernel, Juri Lelli, Ingo Molnar, Steven Rostedt

On Wed, Apr 20, 2016 at 02:43:29PM +0200, Thomas Gleixner wrote:
> > So its semantically icky to have the two tasks running off the same
> > state and practically icky when you consider bandwidth inheritance --
> > where the boosted task wants to explicitly modify the state of the
> > booster.
> >
> > In that latter case you really want to unboost before you let the
> > booster run again.
> 
> I understand that. That doesn't make the changelog any better, which mumbles
> about priorities :(

Agreed.

> > However, you noted we need to deal with this case due to the whole
> > optimistic spinning crap anyway :/
> 
> Right, but that's another dimension of madness. Both tasks are on a cpu.

> The reason why we boost the lock holder before spinning is to make
> sure that it does not get preempted by something of medium priority
> before dropping the lock. 

Right; I figured that out pretty quickly, which is why this patch does a
preempt_disable() over the unboost+wakeup.

FWIW, the immediate reason for this patch is that is ensures the new
p->pi_task pointer, points to something that exists.

> That really gets interesting with bandwith inheritance ....

I'm more worried about the optimistic spinning case..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl()
  2016-04-20 13:00           ` Xunlei Pang
@ 2016-04-20 13:17             ` Peter Zijlstra
  2016-04-20 13:45               ` Xunlei Pang
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-04-20 13:17 UTC (permalink / raw)
  To: xlpang
  Cc: linux-kernel, Thomas Gleixner, Juri Lelli, Ingo Molnar, Steven Rostedt

On Wed, Apr 20, 2016 at 09:00:32PM +0800, Xunlei Pang wrote:

> > But what happens? How is it changed when it is blocked?
> 
> The top waiter's policy can be changed by other tasks through sched_setattr() syscall during it was blocked.
> I created another thread doing the following thing:
>      while (1) {
>         change the waiter to cfs
>         do something
>         change the waiter to deadline
>     }

Indeed; so why didn't you say that? That is the single most important
thing in the Changelog -- the _actual_ problem, and you left it out.

I'm not quite sure how to go fix that best, but copying the state is not
right. That precludes being able to change the state.

There's two (obvious but) rather ugly solutions:

 - delay the __sched_setscheduler() call until such a time that the task
   is no longer the top waiter.

 - deboost + __sched_setscheduler() + boost

Both have a different set of problems, but both keep the p->pi_task
pointer and its state 'stable'.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 2/6] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-04-14 11:37 ` [PATCH v3 2/6] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Xunlei Pang
@ 2016-04-20 13:19   ` Peter Zijlstra
  2016-04-20 13:49     ` Xunlei Pang
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-04-20 13:19 UTC (permalink / raw)
  To: Xunlei Pang
  Cc: linux-kernel, Thomas Gleixner, Juri Lelli, Ingo Molnar, Steven Rostedt

On Thu, Apr 14, 2016 at 07:37:03PM +0800, Xunlei Pang wrote:
> +	/* Updated under pi_lock and rtmutex lock */
>  	struct rb_node *pi_waiters_leftmost;
> +	struct rb_node *pi_waiters_leftmost_copy;

>  struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>  {
> +	if (!task->pi_waiters_leftmost_copy)
>  		return NULL;
>  
> +	return rb_entry(task->pi_waiters_leftmost_copy,
> +				struct rt_mutex_waiter, pi_tree_entry)->task;
>  }

why ?! Why not keep a regular task_struct pointer and avoid this stuff?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl()
  2016-04-20 13:17             ` Peter Zijlstra
@ 2016-04-20 13:45               ` Xunlei Pang
  0 siblings, 0 replies; 46+ messages in thread
From: Xunlei Pang @ 2016-04-20 13:45 UTC (permalink / raw)
  To: Peter Zijlstra, xlpang
  Cc: linux-kernel, Thomas Gleixner, Juri Lelli, Ingo Molnar, Steven Rostedt

On 2016/04/20 at 21:17, Peter Zijlstra wrote:
> On Wed, Apr 20, 2016 at 09:00:32PM +0800, Xunlei Pang wrote:
>
>>> But what happens? How is it changed when it is blocked?
>> The top waiter's policy can be changed by other tasks through sched_setattr() syscall during it was blocked.
>> I created another thread doing the following thing:
>>      while (1) {
>>         change the waiter to cfs
>>         do something
>>         change the waiter to deadline
>>     }
> Indeed; so why didn't you say that? That is the single most important
> thing in the Changelog -- the _actual_ problem, and you left it out.

Sorry, the changelog mentioned a little, I should describe it in detail.

>
> I'm not quite sure how to go fix that best, but copying the state is not
> right. That precludes being able to change the state.

The patch updates the copy everytime the waiter's policy/runtime/period
are changed. The calling path is rt_mutex_setprio()->rt_mutex_update_copy(),
so it can change very soon after __sched_setscheduler()->rt_mutex_adjust_pi()
is made, also PATCH6 forces to make the update for deadline cases.

This is not acceptable?

Regards,
Xunlei

>
> There's two (obvious but) rather ugly solutions:
>
>  - delay the __sched_setscheduler() call until such a time that the task
>    is no longer the top waiter.
>
>  - deboost + __sched_setscheduler() + boost
>
> Both have a different set of problems, but both keep the p->pi_task
> pointer and its state 'stable'.
>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 2/6] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-04-20 13:19   ` Peter Zijlstra
@ 2016-04-20 13:49     ` Xunlei Pang
  2016-04-22  3:26       ` Xunlei Pang
  0 siblings, 1 reply; 46+ messages in thread
From: Xunlei Pang @ 2016-04-20 13:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Juri Lelli, Ingo Molnar, Steven Rostedt

On 2016/04/20 at 21:19, Peter Zijlstra wrote:
> On Thu, Apr 14, 2016 at 07:37:03PM +0800, Xunlei Pang wrote:
>> +	/* Updated under pi_lock and rtmutex lock */
>>  	struct rb_node *pi_waiters_leftmost;
>> +	struct rb_node *pi_waiters_leftmost_copy;
>>  struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>>  {
>> +	if (!task->pi_waiters_leftmost_copy)
>>  		return NULL;
>>  
>> +	return rb_entry(task->pi_waiters_leftmost_copy,
>> +				struct rt_mutex_waiter, pi_tree_entry)->task;
>>  }
> why ?! Why not keep a regular task_struct pointer and avoid this stuff?

I meant to make it semantically consistent, but I can change it since you think task_struct is better.

Regards,
Xunlei

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v3 2/6] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2016-04-20 13:49     ` Xunlei Pang
@ 2016-04-22  3:26       ` Xunlei Pang
  0 siblings, 0 replies; 46+ messages in thread
From: Xunlei Pang @ 2016-04-22  3:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Juri Lelli, Ingo Molnar, Steven Rostedt

Hi Peter,

On 2016/04/20 at 21:49, Xunlei Pang wrote:
> On 2016/04/20 at 21:19, Peter Zijlstra wrote:
>> On Thu, Apr 14, 2016 at 07:37:03PM +0800, Xunlei Pang wrote:
>>> +	/* Updated under pi_lock and rtmutex lock */
>>>  	struct rb_node *pi_waiters_leftmost;
>>> +	struct rb_node *pi_waiters_leftmost_copy;
>>>  struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
>>>  {
>>> +	if (!task->pi_waiters_leftmost_copy)
>>>  		return NULL;
>>>  
>>> +	return rb_entry(task->pi_waiters_leftmost_copy,
>>> +				struct rt_mutex_waiter, pi_tree_entry)->task;
>>>  }
>> why ?! Why not keep a regular task_struct pointer and avoid this stuff?
> I meant to make it semantically consistent, but I can change it since you think task_struct is better.

What do you think this version of PATCH1 and PATCH2?
If you are fine with it, I can sent it out as v4 separately, then we can focus on the issue in PATCH5.

Thanks!

>
> Regards,
> Xunlei

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH -v3 0/8] PI vs SCHED_DEADLINE fixes
@ 2017-03-23 14:56 Peter Zijlstra
  2017-03-23 14:56 ` [PATCH -v3 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
                   ` (7 more replies)
  0 siblings, 8 replies; 46+ messages in thread
From: Peter Zijlstra @ 2017-03-23 14:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

Now that this pesky little problem with futexes is (hopefully) dealt with;

  https://lkml.kernel.org/r/20170322103547.756091212@infradead.org

We can get on with fixing the actual bug this all started out with.

These patches, started by Xunlei Pang, rework the PI infrastructure a bit
fixing various problems it has, most notable a NULL deref in SCHED_DEADLINE.

Once this is sorted; we can look at improving the tracing thing (Daniel Bristot
and Julien Desfossez were working on that) and have a better foundation to look
at bandwidth inheritance (Juri and co.).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH -v3 1/8] rtmutex: Deboost before waking up the top waiter
  2017-03-23 14:56 [PATCH -v3 0/8] PI vs SCHED_DEADLINE fixes Peter Zijlstra
@ 2017-03-23 14:56 ` Peter Zijlstra
  2017-04-04  9:48   ` [tip:locking/core] " tip-bot for Xunlei Pang
  2017-03-23 14:56 ` [PATCH -v3 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2017-03-23 14:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz,
	Ingo Molnar

[-- Attachment #1: xunlei_pang-rtmutex-deboost_before_waking_up_the_top_waiter.patch --]
[-- Type: text/plain, Size: 5302 bytes --]

From: Xunlei Pang <xlpang@redhat.com>

We should deboost before waking the high-priority task, such that we
don't run two tasks with the same "state" (priority, deadline,
sched_class, etc).

In order to make sure the boosting task doesn't start running between
unlock and deboost (due to 'spurious' wakeup), we move the deboost
under the wait_lock, that way its serialized against the wait loop in
__rt_mutex_slowlock().

Doing the deboost early can however lead to priority-inversion if
current would get preempted after the deboost but before waking our
high-prio task, hence we disable preemption before doing deboost, and
enabling it after the wake up is over.

This gets us the right semantic order, but most importantly however;
this change ensures pointer stability for the next patch, where we
have rt_mutex_setprio() cache a pointer to the top-most waiter task.
If we, as before this change, do the wakeup first and then deboost,
this pointer might point into thin air.

[peterz: Changelog + patch munging]
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---

 kernel/futex.c                  |    5 ---
 kernel/locking/rtmutex.c        |   59 +++++++++++++++++++++-------------------
 kernel/locking/rtmutex_common.h |    2 -
 3 files changed, 34 insertions(+), 32 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1465,10 +1465,7 @@ static int wake_futex_pi(u32 __user *uad
 out_unlock:
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 
-	if (deboost) {
-		wake_up_q(&wake_q);
-		rt_mutex_adjust_prio(current);
-	}
+	rt_mutex_postunlock(&wake_q, deboost);
 
 	return ret;
 }
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -307,24 +307,6 @@ static void __rt_mutex_adjust_prio(struc
 }
 
 /*
- * Adjust task priority (undo boosting). Called from the exit path of
- * rt_mutex_slowunlock() and rt_mutex_slowlock().
- *
- * (Note: We do this outside of the protection of lock->wait_lock to
- * allow the lock to be taken while or before we readjust the priority
- * of task. We do not use the spin_xx_mutex() variants here as we are
- * outside of the debug path.)
- */
-void rt_mutex_adjust_prio(struct task_struct *task)
-{
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	__rt_mutex_adjust_prio(task);
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
-}
-
-/*
  * Deadlock detection is conditional:
  *
  * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
@@ -985,6 +967,7 @@ static void mark_wakeup_next_waiter(stru
 	 * lock->wait_lock.
 	 */
 	rt_mutex_dequeue_pi(current, waiter);
+	__rt_mutex_adjust_prio(current);
 
 	/*
 	 * As we are waking up the top waiter, and the waiter stays
@@ -1321,6 +1304,16 @@ static bool __sched rt_mutex_slowunlock(
 	 */
 	mark_wakeup_next_waiter(wake_q, lock);
 
+	/*
+	 * We should deboost before waking the top waiter task such that
+	 * we don't run two tasks with the 'same' priority. This however
+	 * can lead to prio-inversion if we would get preempted after
+	 * the deboost but before waking our high-prio task, hence the
+	 * preempt_disable before unlock. Pairs with preempt_enable() in
+	 * rt_mutex_postunlock();
+	 */
+	preempt_disable();
+
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* check PI boosting */
@@ -1370,6 +1363,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lo
 	return slowfn(lock);
 }
 
+/*
+ * Undo pi boosting (if necessary) and wake top waiter.
+ */
+void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+{
+	wake_up_q(wake_q);
+
+	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
+	if (deboost)
+		preempt_enable();
+}
+
 static inline void
 rt_mutex_fastunlock(struct rt_mutex *lock,
 		    bool (*slowfn)(struct rt_mutex *lock,
@@ -1383,11 +1388,7 @@ rt_mutex_fastunlock(struct rt_mutex *loc
 
 	deboost = slowfn(lock, &wake_q);
 
-	wake_up_q(&wake_q);
-
-	/* Undo pi boosting if necessary: */
-	if (deboost)
-		rt_mutex_adjust_prio(current);
+	rt_mutex_postunlock(&wake_q, deboost);
 }
 
 /**
@@ -1513,6 +1514,13 @@ bool __sched __rt_mutex_futex_unlock(str
 	}
 
 	mark_wakeup_next_waiter(wake_q, lock);
+	/*
+	 * We've already deboosted, retain preempt_disabled when dropping
+	 * the wait_lock to avoid inversion until the wakeup. Matched
+	 * by rt_mutex_postunlock();
+	 */
+	preempt_disable();
+
 	return true; /* deboost and wakeups */
 }
 
@@ -1525,10 +1533,7 @@ void __sched rt_mutex_futex_unlock(struc
 	deboost = __rt_mutex_futex_unlock(lock, &wake_q);
 	raw_spin_unlock_irq(&lock->wait_lock);
 
-	if (deboost) {
-		wake_up_q(&wake_q);
-		rt_mutex_adjust_prio(current);
-	}
+	rt_mutex_postunlock(&wake_q, deboost);
 }
 
 /**
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -116,7 +116,7 @@ extern void rt_mutex_futex_unlock(struct
 extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock,
 				 struct wake_q_head *wqh);
 
-extern void rt_mutex_adjust_prio(struct task_struct *task);
+extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH -v3 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2017-03-23 14:56 [PATCH -v3 0/8] PI vs SCHED_DEADLINE fixes Peter Zijlstra
  2017-03-23 14:56 ` [PATCH -v3 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
@ 2017-03-23 14:56 ` Peter Zijlstra
  2017-04-04  9:49   ` [tip:locking/core] " tip-bot for Xunlei Pang
  2017-03-23 14:56 ` [PATCH -v3 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2017-03-23 14:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz,
	Ingo Molnar

[-- Attachment #1: xunlei_pang-sched_rtmutex_deadline-fix_a_pi_crash_for_deadline_tasks.patch --]
[-- Type: text/plain, Size: 5245 bytes --]

From: Xunlei Pang <xlpang@redhat.com>

A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [<ffffffff810b658c>] enqueue_task+0x2c/0x80
    [<ffffffff810ba763>] activate_task+0x23/0x30
    [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
    [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
    [<ffffffff8164e783>] __schedule+0xd3/0x900
    [<ffffffff8164efd9>] schedule+0x29/0x70
    [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
    [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
    [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
    [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
    [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
    [<ffffffff810edd50>] SyS_futex+0x80/0x180

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.

In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.

Originally-From: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---

 include/linux/init_task.h |    1 +
 include/linux/sched.h     |    2 ++
 include/linux/sched/rt.h  |    1 +
 kernel/fork.c             |    1 +
 kernel/locking/rtmutex.c  |   29 +++++++++++++++++++++--------
 kernel/sched/core.c       |    2 ++
 6 files changed, 28 insertions(+), 8 deletions(-)

--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -181,6 +181,7 @@ extern struct cred init_cred;
 #ifdef CONFIG_RT_MUTEXES
 # define INIT_RT_MUTEXES(tsk)						\
 	.pi_waiters = RB_ROOT,						\
+	.pi_top_task = NULL,						\
 	.pi_waiters_leftmost = NULL,
 #else
 # define INIT_RT_MUTEXES(tsk)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,6 +775,8 @@ struct task_struct {
 	/* PI waiters blocked on a rt_mutex held by this task: */
 	struct rb_root			pi_waiters;
 	struct rb_node			*pi_waiters_leftmost;
+	/* Updated under owner's pi_lock and rq lock */
+	struct task_struct		*pi_top_task;
 	/* Deadlock detection and priority inheritance handling: */
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -21,6 +21,7 @@ static inline int rt_task(struct task_st
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
 extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
+extern void rt_mutex_update_top_task(struct task_struct *p);
 extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1438,6 +1438,7 @@ static void rt_mutex_init_task(struct ta
 #ifdef CONFIG_RT_MUTEXES
 	p->pi_waiters = RB_ROOT;
 	p->pi_waiters_leftmost = NULL;
+	p->pi_top_task = NULL;
 	p->pi_blocked_on = NULL;
 #endif
 }
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -323,6 +323,19 @@ rt_mutex_dequeue_pi(struct task_struct *
 }
 
 /*
+ * Must hold both p->pi_lock and task_rq(p)->lock.
+ */
+void rt_mutex_update_top_task(struct task_struct *p)
+{
+	if (!task_has_pi_waiters(p)) {
+		p->pi_top_task = NULL;
+		return;
+	}
+
+	p->pi_top_task = task_top_pi_waiter(p)->task;
+}
+
+/*
  * Calculate task priority from the waiter tree priority
  *
  * Return task->normal_prio when the waiter tree is empty or when
@@ -337,12 +350,12 @@ int rt_mutex_getprio(struct task_struct
 		   task->normal_prio);
 }
 
+/*
+ * Must hold either p->pi_lock or task_rq(p)->lock.
+ */
 struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 {
-	if (likely(!task_has_pi_waiters(task)))
-		return NULL;
-
-	return task_top_pi_waiter(task)->task;
+	return task->pi_top_task;
 }
 
 /*
@@ -351,12 +364,12 @@ struct task_struct *rt_mutex_get_top_tas
  */
 int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
 {
-	if (!task_has_pi_waiters(task))
+	struct task_struct *top_task = rt_mutex_get_top_task(task);
+
+	if (!top_task)
 		return newprio;
 
-	if (task_top_pi_waiter(task)->task->prio <= newprio)
-		return task_top_pi_waiter(task)->task->prio;
-	return newprio;
+	return min(top_task->prio, newprio);
 }
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3716,6 +3716,8 @@ void rt_mutex_setprio(struct task_struct
 		goto out_unlock;
 	}
 
+	rt_mutex_update_top_task(p);
+
 	trace_sched_pi_setprio(p, prio);
 	oldprio = p->prio;
 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH -v3 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
  2017-03-23 14:56 [PATCH -v3 0/8] PI vs SCHED_DEADLINE fixes Peter Zijlstra
  2017-03-23 14:56 ` [PATCH -v3 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
  2017-03-23 14:56 ` [PATCH -v3 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
@ 2017-03-23 14:56 ` Peter Zijlstra
  2017-04-04  9:50   ` [tip:locking/core] " tip-bot for Xunlei Pang
  2017-03-23 14:56 ` [PATCH -v3 4/8] rtmutex: Clean up Peter Zijlstra
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2017-03-23 14:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz,
	Ingo Molnar

[-- Attachment #1: xunlei_pang-sched_deadline_rtmutex-don_t_miss_the_dl_runtime_dl_period_update.patch --]
[-- Type: text/plain, Size: 1548 bytes --]

From: Xunlei Pang <xlpang@redhat.com>

Currently dl tasks will actually return at the very beginning
of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:

    if (waiter->prio == task->prio) {
        if (!detect_deadlock)
            goto out_unlock_pi; // out here
        else
            requeue = false;
    }

As the deadline value of blocked deadline tasks(waiters) without
changing their sched_class(thus prio doesn't change) never changes,
this seems reasonable, but it actually misses the chance of updating
rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
deadline parameters(dl_runtime, dl_period) or boosted waiter changes
to !deadline class.

Thus, force deadline task not out by adding the !dl_prio() condition.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
---
 kernel/locking/rtmutex.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -488,7 +488,7 @@ static int rt_mutex_adjust_prio_chain(st
 	 * enabled we continue, but stop the requeueing in the chain
 	 * walk.
 	 */
-	if (waiter->prio == task->prio) {
+	if (waiter->prio == task->prio && !dl_task(task)) {
 		if (!detect_deadlock)
 			goto out_unlock_pi;
 		else

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH -v3 4/8] rtmutex: Clean up
  2017-03-23 14:56 [PATCH -v3 0/8] PI vs SCHED_DEADLINE fixes Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-03-23 14:56 ` [PATCH -v3 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
@ 2017-03-23 14:56 ` Peter Zijlstra
  2017-03-23 18:21   ` Steven Rostedt
  2017-04-04  9:50   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-03-23 14:56 ` [PATCH -v3 5/8] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2017-03-23 14:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-2.patch --]
[-- Type: text/plain, Size: 3677 bytes --]

Previous patches changed the meaning of the return value of
rt_mutex_slowunlock(); update comments and code to reflect this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |    7 ++++---
 kernel/locking/rtmutex.c        |   28 +++++++++++++---------------
 kernel/locking/rtmutex_common.h |    2 +-
 3 files changed, 18 insertions(+), 19 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1394,7 +1394,7 @@ static int wake_futex_pi(u32 __user *uad
 {
 	u32 uninitialized_var(curval), newval;
 	struct task_struct *new_owner;
-	bool deboost = false;
+	bool postunlock = false;
 	DEFINE_WAKE_Q(wake_q);
 	int ret = 0;
 
@@ -1455,12 +1455,13 @@ static int wake_futex_pi(u32 __user *uad
 	/*
 	 * We've updated the uservalue, this unlock cannot fail.
 	 */
-	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+	postunlock = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
 
 out_unlock:
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 
-	rt_mutex_postunlock(&wake_q, deboost);
+	if (postunlock)
+		rt_mutex_postunlock(&wake_q);
 
 	return ret;
 }
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1330,7 +1330,8 @@ static inline int rt_mutex_slowtrylock(s
 
 /*
  * Slow path to release a rt-mutex.
- * Return whether the current task needs to undo a potential priority boosting.
+ *
+ * Return whether the current task needs to call rt_mutex_postunlock().
  */
 static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
 					struct wake_q_head *wake_q)
@@ -1401,8 +1402,7 @@ static bool __sched rt_mutex_slowunlock(
 
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
-	/* check PI boosting */
-	return true;
+	return true; /* call rt_mutex_postunlock() */
 }
 
 /*
@@ -1449,15 +1449,14 @@ rt_mutex_fasttrylock(struct rt_mutex *lo
 }
 
 /*
- * Undo pi boosting (if necessary) and wake top waiter.
+ * Performs the wakeup of the the top-waiter and re-enables preemption.
  */
-void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+void rt_mutex_postunlock(struct wake_q_head *wake_q)
 {
 	wake_up_q(wake_q);
 
 	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
-	if (deboost)
-		preempt_enable();
+	preempt_enable();
 }
 
 static inline void
@@ -1466,14 +1465,12 @@ rt_mutex_fastunlock(struct rt_mutex *loc
 				   struct wake_q_head *wqh))
 {
 	DEFINE_WAKE_Q(wake_q);
-	bool deboost;
 
 	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
 		return;
 
-	deboost = slowfn(lock, &wake_q);
-
-	rt_mutex_postunlock(&wake_q, deboost);
+	if (slowfn(lock, &wake_q))
+		rt_mutex_postunlock(&wake_q);
 }
 
 /**
@@ -1593,19 +1590,20 @@ bool __sched __rt_mutex_futex_unlock(str
 	 */
 	preempt_disable();
 
-	return true; /* deboost and wakeups */
+	return true; /* call postunlock() */
 }
 
 void __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
 {
 	DEFINE_WAKE_Q(wake_q);
-	bool deboost;
+	bool postunlock;
 
 	raw_spin_lock_irq(&lock->wait_lock);
-	deboost = __rt_mutex_futex_unlock(lock, &wake_q);
+	postunlock = __rt_mutex_futex_unlock(lock, &wake_q);
 	raw_spin_unlock_irq(&lock->wait_lock);
 
-	rt_mutex_postunlock(&wake_q, deboost);
+	if (postunlock)
+		rt_mutex_postunlock(&wake_q);
 }
 
 /**
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -122,7 +122,7 @@ extern void rt_mutex_futex_unlock(struct
 extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock,
 				 struct wake_q_head *wqh);
 
-extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
+extern void rt_mutex_postunlock(struct wake_q_head *wake_q);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH -v3 5/8] sched/rtmutex: Refactor rt_mutex_setprio()
  2017-03-23 14:56 [PATCH -v3 0/8] PI vs SCHED_DEADLINE fixes Peter Zijlstra
                   ` (3 preceding siblings ...)
  2017-03-23 14:56 ` [PATCH -v3 4/8] rtmutex: Clean up Peter Zijlstra
@ 2017-03-23 14:56 ` Peter Zijlstra
  2017-04-04  9:51   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-03-23 14:56 ` [PATCH -v3 6/8] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2017-03-23 14:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-3.patch --]
[-- Type: text/plain, Size: 11839 bytes --]

With the introduction of SCHED_DEADLINE the whole notion that priority
is a single number is gone, therefore the @prio argument to
rt_mutex_setprio() doesn't make sense anymore.

So rework the code to pass a pi_task instead.

Note this also fixes a problem with pi_top_task caching; previously we
would not set the pointer (call rt_mutex_update_top_task) if the
priority didn't change, this could lead to a stale pointer.

As for the XXX, I think its fine to use pi_task->prio, because if it
differs from waiter->prio, a PI chain update is immenent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched/rt.h |   24 +++-------
 kernel/locking/rtmutex.c |  112 ++++++++++++-----------------------------------
 kernel/sched/core.c      |   66 ++++++++++++++++++++++-----
 3 files changed, 91 insertions(+), 111 deletions(-)

--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -18,28 +18,20 @@ static inline int rt_task(struct task_st
 }
 
 #ifdef CONFIG_RT_MUTEXES
-extern int rt_mutex_getprio(struct task_struct *p);
-extern void rt_mutex_setprio(struct task_struct *p, int prio);
-extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
-extern void rt_mutex_update_top_task(struct task_struct *p);
-extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
+/*
+ * Must hold either p->pi_lock or task_rq(p)->lock.
+ */
+static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
+{
+	return p->pi_top_task;
+}
+extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
 	return tsk->pi_blocked_on != NULL;
 }
 #else
-static inline int rt_mutex_getprio(struct task_struct *p)
-{
-	return p->normal_prio;
-}
-
-static inline int rt_mutex_get_effective_prio(struct task_struct *task,
-					      int newprio)
-{
-	return newprio;
-}
-
 static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 {
 	return NULL;
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -322,67 +322,16 @@ rt_mutex_dequeue_pi(struct task_struct *
 	RB_CLEAR_NODE(&waiter->pi_tree_entry);
 }
 
-/*
- * Must hold both p->pi_lock and task_rq(p)->lock.
- */
-void rt_mutex_update_top_task(struct task_struct *p)
-{
-	if (!task_has_pi_waiters(p)) {
-		p->pi_top_task = NULL;
-		return;
-	}
-
-	p->pi_top_task = task_top_pi_waiter(p)->task;
-}
-
-/*
- * Calculate task priority from the waiter tree priority
- *
- * Return task->normal_prio when the waiter tree is empty or when
- * the waiter is not allowed to do priority boosting
- */
-int rt_mutex_getprio(struct task_struct *task)
-{
-	if (likely(!task_has_pi_waiters(task)))
-		return task->normal_prio;
-
-	return min(task_top_pi_waiter(task)->prio,
-		   task->normal_prio);
-}
-
-/*
- * Must hold either p->pi_lock or task_rq(p)->lock.
- */
-struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
-{
-	return task->pi_top_task;
-}
-
-/*
- * Called by sched_setscheduler() to get the priority which will be
- * effective after the change.
- */
-int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
+static void rt_mutex_adjust_prio(struct task_struct *p)
 {
-	struct task_struct *top_task = rt_mutex_get_top_task(task);
+	struct task_struct *pi_task = NULL;
 
-	if (!top_task)
-		return newprio;
+	lockdep_assert_held(&p->pi_lock);
 
-	return min(top_task->prio, newprio);
-}
+	if (task_has_pi_waiters(p))
+		pi_task = task_top_pi_waiter(p)->task;
 
-/*
- * Adjust the priority of a task, after its pi_waiters got modified.
- *
- * This can be both boosting and unboosting. task->pi_lock must be held.
- */
-static void __rt_mutex_adjust_prio(struct task_struct *task)
-{
-	int prio = rt_mutex_getprio(task);
-
-	if (task->prio != prio || dl_prio(prio))
-		rt_mutex_setprio(task, prio);
+	rt_mutex_setprio(p, pi_task);
 }
 
 /*
@@ -742,7 +691,7 @@ static int rt_mutex_adjust_prio_chain(st
 		 */
 		rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
 		rt_mutex_enqueue_pi(task, waiter);
-		__rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(task);
 
 	} else if (prerequeue_top_waiter == waiter) {
 		/*
@@ -758,7 +707,7 @@ static int rt_mutex_adjust_prio_chain(st
 		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
 		rt_mutex_enqueue_pi(task, waiter);
-		__rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(task);
 	} else {
 		/*
 		 * Nothing changed. No need to do any priority
@@ -966,7 +915,7 @@ static int task_blocks_on_rt_mutex(struc
 		return -EDEADLK;
 
 	raw_spin_lock(&task->pi_lock);
-	__rt_mutex_adjust_prio(task);
+	rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
 	waiter->prio = task->prio;
@@ -988,7 +937,7 @@ static int task_blocks_on_rt_mutex(struc
 		rt_mutex_dequeue_pi(owner, top_waiter);
 		rt_mutex_enqueue_pi(owner, waiter);
 
-		__rt_mutex_adjust_prio(owner);
+		rt_mutex_adjust_prio(owner);
 		if (owner->pi_blocked_on)
 			chain_walk = 1;
 	} else if (rt_mutex_cond_detect_deadlock(waiter, chwalk)) {
@@ -1040,13 +989,14 @@ static void mark_wakeup_next_waiter(stru
 	waiter = rt_mutex_top_waiter(lock);
 
 	/*
-	 * Remove it from current->pi_waiters. We do not adjust a
-	 * possible priority boost right now. We execute wakeup in the
-	 * boosted mode and go back to normal after releasing
-	 * lock->wait_lock.
+	 * Remove it from current->pi_waiters and deboost.
+	 *
+	 * We must in fact deboost here in order to ensure we call
+	 * rt_mutex_setprio() to update p->pi_top_task before the
+	 * task unblocks.
 	 */
 	rt_mutex_dequeue_pi(current, waiter);
-	__rt_mutex_adjust_prio(current);
+	rt_mutex_adjust_prio(current);
 
 	/*
 	 * As we are waking up the top waiter, and the waiter stays
@@ -1058,9 +1008,19 @@ static void mark_wakeup_next_waiter(stru
 	 */
 	lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
 
-	raw_spin_unlock(&current->pi_lock);
-
+	/*
+	 * We deboosted before waking the top waiter task such that we don't
+	 * run two tasks with the 'same' priority (and ensure the
+	 * p->pi_top_task pointer points to a blocked task). This however can
+	 * lead to priority inversion if we would get preempted after the
+	 * deboost but before waking our donor task, hence the preempt_disable()
+	 * before unlock.
+	 *
+	 * Pairs with preempt_enable() in rt_mutex_postunlock();
+	 */
+	preempt_disable();
 	wake_q_add(wake_q, waiter->task);
+	raw_spin_unlock(&current->pi_lock);
 }
 
 /*
@@ -1095,7 +1055,7 @@ static void remove_waiter(struct rt_mute
 	if (rt_mutex_has_waiters(lock))
 		rt_mutex_enqueue_pi(owner, rt_mutex_top_waiter(lock));
 
-	__rt_mutex_adjust_prio(owner);
+	rt_mutex_adjust_prio(owner);
 
 	/* Store the lock on which owner is blocked or NULL */
 	next_lock = task_blocked_on_lock(owner);
@@ -1134,8 +1094,7 @@ void rt_mutex_adjust_pi(struct task_stru
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || (waiter->prio == task->prio &&
-			!dl_prio(task->prio))) {
+	if (!waiter || (waiter->prio == task->prio && !dl_prio(task->prio))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -1389,17 +1348,6 @@ static bool __sched rt_mutex_slowunlock(
 	 * Queue the next waiter for wakeup once we release the wait_lock.
 	 */
 	mark_wakeup_next_waiter(wake_q, lock);
-
-	/*
-	 * We should deboost before waking the top waiter task such that
-	 * we don't run two tasks with the 'same' priority. This however
-	 * can lead to prio-inversion if we would get preempted after
-	 * the deboost but before waking our high-prio task, hence the
-	 * preempt_disable before unlock. Pairs with preempt_enable() in
-	 * rt_mutex_postunlock();
-	 */
-	preempt_disable();
-
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	return true; /* call rt_mutex_postunlock() */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3674,10 +3674,25 @@ EXPORT_SYMBOL(default_wake_function);
 
 #ifdef CONFIG_RT_MUTEXES
 
+static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)
+{
+	if (pi_task)
+		prio = min(prio, pi_task->prio);
+
+	return prio;
+}
+
+static inline int rt_effective_prio(struct task_struct *p, int prio)
+{
+	struct task_struct *pi_task = rt_mutex_get_top_task(p);
+
+	return __rt_effective_prio(pi_task, prio);
+}
+
 /*
  * rt_mutex_setprio - set the current priority of a task
- * @p: task
- * @prio: prio value (kernel-internal form)
+ * @p: task to boost
+ * @pi_task: donor task
  *
  * This function changes the 'effective' priority of a task. It does
  * not touch ->normal_prio like __setscheduler().
@@ -3685,18 +3700,42 @@ EXPORT_SYMBOL(default_wake_function);
  * Used by the rt_mutex code to implement priority inheritance
  * logic. Call site only calls if the priority of the task changed.
  */
-void rt_mutex_setprio(struct task_struct *p, int prio)
+void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 {
-	int oldprio, queued, running, queue_flag =
+	int prio, oldprio, queued, running, queue_flag =
 		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
 	const struct sched_class *prev_class;
 	struct rq_flags rf;
 	struct rq *rq;
 
-	BUG_ON(prio > MAX_PRIO);
+	/* XXX used to be waiter->prio, not waiter->task->prio */
+	prio = __rt_effective_prio(pi_task, p->normal_prio);
+
+	/*
+	 * If nothing changed; bail early.
+	 */
+	if (p->pi_top_task == pi_task && prio == p->prio && !dl_prio(prio))
+		return;
 
 	rq = __task_rq_lock(p, &rf);
 	update_rq_clock(rq);
+	/*
+	 * Set under pi_lock && rq->lock, such that the value can be used under
+	 * either lock.
+	 *
+	 * Note that there is loads of tricky to make this pointer cache work
+	 * right. rt_mutex_slowunlock()+rt_mutex_postunlock() work together to
+	 * ensure a task is de-boosted (pi_task is set to NULL) before the
+	 * task is allowed to run again (and can exit). This ensures the pointer
+	 * points to a blocked task -- which guaratees the task is present.
+	 */
+	p->pi_top_task = pi_task;
+
+	/*
+	 * For FIFO/RR we only need to set prio, if that matches we're done.
+	 */
+	if (prio == p->prio && !dl_prio(prio))
+		goto out_unlock;
 
 	/*
 	 * Idle task boosting is a nono in general. There is one
@@ -3716,9 +3755,7 @@ void rt_mutex_setprio(struct task_struct
 		goto out_unlock;
 	}
 
-	rt_mutex_update_top_task(p);
-
-	trace_sched_pi_setprio(p, prio);
+	trace_sched_pi_setprio(p, prio); /* broken */
 	oldprio = p->prio;
 
 	if (oldprio == prio)
@@ -3742,7 +3779,6 @@ void rt_mutex_setprio(struct task_struct
 	 *          running task
 	 */
 	if (dl_prio(prio)) {
-		struct task_struct *pi_task = rt_mutex_get_top_task(p);
 		if (!dl_prio(p->normal_prio) ||
 		    (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
 			p->dl.dl_boosted = 1;
@@ -3780,6 +3816,11 @@ void rt_mutex_setprio(struct task_struct
 	balance_callback(rq);
 	preempt_enable();
 }
+#else
+static inline int rt_effective_prio(struct task_struct *p, int prio)
+{
+	return prio;
+}
 #endif
 
 void set_user_nice(struct task_struct *p, long nice)
@@ -4026,10 +4067,9 @@ static void __setscheduler(struct rq *rq
 	 * Keep a potential priority boosting if called from
 	 * sched_setscheduler().
 	 */
+	p->prio = normal_prio(p);
 	if (keep_boost)
-		p->prio = rt_mutex_get_effective_prio(p, normal_prio(p));
-	else
-		p->prio = normal_prio(p);
+		p->prio = rt_effective_prio(p, p->prio);
 
 	if (dl_prio(p->prio))
 		p->sched_class = &dl_sched_class;
@@ -4316,7 +4356,7 @@ static int __sched_setscheduler(struct t
 		 * the runqueue. This will be done when the task deboost
 		 * itself.
 		 */
-		new_effective_prio = rt_mutex_get_effective_prio(p, newprio);
+		new_effective_prio = rt_effective_prio(p, newprio);
 		if (new_effective_prio == oldprio)
 			queue_flags &= ~DEQUEUE_MOVE;
 	}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH -v3 6/8] sched,tracing: Update trace_sched_pi_setprio()
  2017-03-23 14:56 [PATCH -v3 0/8] PI vs SCHED_DEADLINE fixes Peter Zijlstra
                   ` (4 preceding siblings ...)
  2017-03-23 14:56 ` [PATCH -v3 5/8] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
@ 2017-03-23 14:56 ` Peter Zijlstra
  2017-04-04  9:51   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-03-23 14:56 ` [PATCH -v3 7/8] rtmutex: Fix PI chain order integrity Peter Zijlstra
  2017-03-23 14:56 ` [PATCH -v3 8/8] rtmutex: Fix more prio comparisons Peter Zijlstra
  7 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2017-03-23 14:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-4.patch --]
[-- Type: text/plain, Size: 3077 bytes --]

Pass the PI donor task, instead of a numerical priority.

Numerical priorities are not sufficient to describe state ever since
SCHED_DEADLINE.

Annotate all sched tracepoints that are currently broken; fixing them
will bork userspace. *hate*.

Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/trace/events/sched.h |   16 +++++++++-------
 kernel/sched/core.c          |    2 +-
 2 files changed, 10 insertions(+), 8 deletions(-)

--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -70,7 +70,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_templat
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->success	= 1; /* rudiment, kill when possible */
 		__entry->target_cpu	= task_cpu(p);
 	),
@@ -147,6 +147,7 @@ TRACE_EVENT(sched_switch,
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
+		/* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
@@ -181,7 +182,7 @@ TRACE_EVENT(sched_migrate_task,
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->orig_cpu	= task_cpu(p);
 		__entry->dest_cpu	= dest_cpu;
 	),
@@ -206,7 +207,7 @@ DECLARE_EVENT_CLASS(sched_process_templa
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("comm=%s pid=%d prio=%d",
@@ -253,7 +254,7 @@ TRACE_EVENT(sched_process_wait,
 	TP_fast_assign(
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 		__entry->pid		= pid_nr(pid);
-		__entry->prio		= current->prio;
+		__entry->prio		= current->prio; /* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("comm=%s pid=%d prio=%d",
@@ -413,9 +414,9 @@ DEFINE_EVENT(sched_stat_runtime, sched_s
  */
 TRACE_EVENT(sched_pi_setprio,
 
-	TP_PROTO(struct task_struct *tsk, int newprio),
+	TP_PROTO(struct task_struct *tsk, struct task_struct *pi_task),
 
-	TP_ARGS(tsk, newprio),
+	TP_ARGS(tsk, pi_task),
 
 	TP_STRUCT__entry(
 		__array( char,	comm,	TASK_COMM_LEN	)
@@ -428,7 +429,8 @@ TRACE_EVENT(sched_pi_setprio,
 		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
 		__entry->pid		= tsk->pid;
 		__entry->oldprio	= tsk->prio;
-		__entry->newprio	= newprio;
+		__entry->newprio	= pi_task ? pi_task->prio : tsk->prio;
+		/* XXX SCHED_DEADLINE bits missing */
 	),
 
 	TP_printk("comm=%s pid=%d oldprio=%d newprio=%d",
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3634,7 +3634,7 @@ void rt_mutex_setprio(struct task_struct
 		goto out_unlock;
 	}
 
-	trace_sched_pi_setprio(p, prio); /* broken */
+	trace_sched_pi_setprio(p, pi_task);
 	oldprio = p->prio;
 
 	if (oldprio == prio)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH -v3 7/8] rtmutex: Fix PI chain order integrity
  2017-03-23 14:56 [PATCH -v3 0/8] PI vs SCHED_DEADLINE fixes Peter Zijlstra
                   ` (5 preceding siblings ...)
  2017-03-23 14:56 ` [PATCH -v3 6/8] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
@ 2017-03-23 14:56 ` Peter Zijlstra
  2017-04-04  9:52   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-03-23 14:56 ` [PATCH -v3 8/8] rtmutex: Fix more prio comparisons Peter Zijlstra
  7 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2017-03-23 14:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-5.patch --]
[-- Type: text/plain, Size: 3489 bytes --]

rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
during the PI chain walk, such that the PI chain order isn't messed up
by (asynchronous) task state updates.

Currently rt_mutex_waiter_less() uses task state for deadline tasks;
this is broken, since the task state can, as said above, change
asynchronously, causing the RB tree order to change without actual
tree update -> FAIL.

Fix this by also copying the deadline into the rt_mutex_waiter state
and updating it along with its prio field.

Ideally we would also force PI chain updates whenever DL tasks update
their deadline parameter, but for first approximation this is less
broken than it was.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/rtmutex.c        |   29 +++++++++++++++++++++++++++--
 kernel/locking/rtmutex_common.h |    1 +
 2 files changed, 28 insertions(+), 2 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -172,8 +172,7 @@ rt_mutex_waiter_less(struct rt_mutex_wai
 	 * then right waiter has a dl_prio() too.
 	 */
 	if (dl_prio(left->prio))
-		return dl_time_before(left->task->dl.deadline,
-				      right->task->dl.deadline);
+		return dl_time_before(left->deadline, right->deadline);
 
 	return 0;
 }
@@ -584,7 +583,26 @@ static int rt_mutex_adjust_prio_chain(st
 
 	/* [7] Requeue the waiter in the lock waiter tree. */
 	rt_mutex_dequeue(lock, waiter);
+
+	/*
+	 * Update the waiter prio fields now that we're dequeued.
+	 *
+	 * These values can have changed through either:
+	 *
+	 *   sys_sched_set_scheduler() / sys_sched_setattr()
+	 *
+	 * or
+	 *
+	 *   DL CBS enforcement advancing the effective deadline.
+	 *
+	 * Even though pi_waiters also uses these fields, and that tree is only
+	 * updated in [11], we can do this here, since we hold [L], which
+	 * serializes all pi_waiters access and rb_erase() does not care about
+	 * the values of the node being removed.
+	 */
 	waiter->prio = task->prio;
+	waiter->deadline = task->dl.deadline;
+
 	rt_mutex_enqueue(lock, waiter);
 
 	/* [8] Release the task */
@@ -711,6 +729,8 @@ static int rt_mutex_adjust_prio_chain(st
 static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 				struct rt_mutex_waiter *waiter)
 {
+	lockdep_assert_held(&lock->wait_lock);
+
 	/*
 	 * Before testing whether we can acquire @lock, we set the
 	 * RT_MUTEX_HAS_WAITERS bit in @lock->owner. This forces all
@@ -838,6 +858,8 @@ static int task_blocks_on_rt_mutex(struc
 	struct rt_mutex *next_lock;
 	int chain_walk = 0, res;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	/*
 	 * Early deadlock detection. We really don't want the task to
 	 * enqueue on itself just to untangle the mess later. It's not
@@ -855,6 +877,7 @@ static int task_blocks_on_rt_mutex(struc
 	waiter->task = task;
 	waiter->lock = lock;
 	waiter->prio = task->prio;
+	waiter->deadline = task->dl.deadline;
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
@@ -972,6 +995,8 @@ static void remove_waiter(struct rt_mute
 	struct task_struct *owner = rt_mutex_owner(lock);
 	struct rt_mutex *next_lock;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	raw_spin_lock(&current->pi_lock);
 	rt_mutex_dequeue(lock, waiter);
 	current->pi_blocked_on = NULL;
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -33,6 +33,7 @@ struct rt_mutex_waiter {
 	struct rt_mutex		*deadlock_lock;
 #endif
 	int prio;
+	u64 deadline;
 };
 
 /*

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH -v3 8/8] rtmutex: Fix more prio comparisons
  2017-03-23 14:56 [PATCH -v3 0/8] PI vs SCHED_DEADLINE fixes Peter Zijlstra
                   ` (6 preceding siblings ...)
  2017-03-23 14:56 ` [PATCH -v3 7/8] rtmutex: Fix PI chain order integrity Peter Zijlstra
@ 2017-03-23 14:56 ` Peter Zijlstra
  2017-04-04  9:52   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  7 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2017-03-23 14:56 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-cleanup-rt-mutex-6.patch --]
[-- Type: text/plain, Size: 2701 bytes --]

There was a pure ->prio comparison left in try_to_wake_rt_mutex(),
convert it to use rt_mutex_waiter_less(), noting that greater-or-equal
is not-less (both in kernel priority view).

This necessitated the introduction of cmp_task() which creates a
pointer to an unnamed stack variable of struct rt_mutex_waiter type to
compare against tasks.

With this, we can now also create and employ rt_mutex_waiter_equal().

Reviewed-and-tested-by: Juri Lelli <juri.lelli@arm.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/rtmutex.c |   32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -158,6 +158,12 @@ static inline bool unlock_rt_mutex_safe(
 }
 #endif
 
+/*
+ * Only use with rt_mutex_waiter_{less,equal}()
+ */
+#define cmp_task(p)	\
+	&(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = (p)->dl.deadline }
+
 static inline int
 rt_mutex_waiter_less(struct rt_mutex_waiter *left,
 		     struct rt_mutex_waiter *right)
@@ -177,6 +183,25 @@ rt_mutex_waiter_less(struct rt_mutex_wai
 	return 0;
 }
 
+static inline int
+rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
+		      struct rt_mutex_waiter *right)
+{
+	if (left->prio != right->prio)
+		return 0;
+
+	/*
+	 * If both waiters have dl_prio(), we check the deadlines of the
+	 * associated tasks.
+	 * If left waiter has a dl_prio(), and we didn't return 0 above,
+	 * then right waiter has a dl_prio() too.
+	 */
+	if (dl_prio(left->prio))
+		return left->deadline == right->deadline;
+
+	return 1;
+}
+
 static void
 rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
 {
@@ -487,7 +512,7 @@ static int rt_mutex_adjust_prio_chain(st
 	 * enabled we continue, but stop the requeueing in the chain
 	 * walk.
 	 */
-	if (waiter->prio == task->prio && !dl_task(task)) {
+	if (rt_mutex_waiter_equal(waiter, cmp_task(task))) {
 		if (!detect_deadlock)
 			goto out_unlock_pi;
 		else
@@ -790,7 +815,8 @@ static int try_to_take_rt_mutex(struct r
 			 * the top waiter priority (kernel view),
 			 * @task lost.
 			 */
-			if (task->prio >= rt_mutex_top_waiter(lock)->prio)
+			if (!rt_mutex_waiter_less(cmp_task(task),
+						  rt_mutex_top_waiter(lock)))
 				return 0;
 
 			/*
@@ -1055,7 +1081,7 @@ void rt_mutex_adjust_pi(struct task_stru
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || (waiter->prio == task->prio && !dl_prio(task->prio))) {
+	if (!waiter || rt_mutex_waiter_equal(waiter, cmp_task(task))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH -v3 4/8] rtmutex: Clean up
  2017-03-23 14:56 ` [PATCH -v3 4/8] rtmutex: Clean up Peter Zijlstra
@ 2017-03-23 18:21   ` Steven Rostedt
  2017-04-04  9:50   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: Steven Rostedt @ 2017-03-23 18:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Thu, 23 Mar 2017 15:56:10 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> Previous patches changed the meaning of the return value of
> rt_mutex_slowunlock(); update comments and code to reflect this.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---


Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [tip:locking/core] rtmutex: Deboost before waking up the top waiter
  2017-03-23 14:56 ` [PATCH -v3 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
@ 2017-04-04  9:48   ` tip-bot for Xunlei Pang
  2017-04-05  8:08     ` Mike Galbraith
  0 siblings, 1 reply; 46+ messages in thread
From: tip-bot for Xunlei Pang @ 2017-04-04  9:48 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, linux-kernel, rostedt, xlpang, tglx, peterz, hpa

Commit-ID:  2a1c6029940675abb2217b590512dbf691867ec4
Gitweb:     http://git.kernel.org/tip/2a1c6029940675abb2217b590512dbf691867ec4
Author:     Xunlei Pang <xlpang@redhat.com>
AuthorDate: Thu, 23 Mar 2017 15:56:07 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 4 Apr 2017 11:44:05 +0200

rtmutex: Deboost before waking up the top waiter

We should deboost before waking the high-priority task, such that we
don't run two tasks with the same "state" (priority, deadline,
sched_class, etc).

In order to make sure the boosting task doesn't start running between
unlock and deboost (due to 'spurious' wakeup), we move the deboost
under the wait_lock, that way its serialized against the wait loop in
__rt_mutex_slowlock().

Doing the deboost early can however lead to priority-inversion if
current would get preempted after the deboost but before waking our
high-prio task, hence we disable preemption before doing deboost, and
enabling it after the wake up is over.

This gets us the right semantic order, but most importantly however;
this change ensures pointer stability for the next patch, where we
have rt_mutex_setprio() cache a pointer to the top-most waiter task.
If we, as before this change, do the wakeup first and then deboost,
this pointer might point into thin air.

[peterz: Changelog + patch munging]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c                  |  5 +---
 kernel/locking/rtmutex.c        | 59 ++++++++++++++++++++++-------------------
 kernel/locking/rtmutex_common.h |  2 +-
 3 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 628be42..414a30d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1460,10 +1460,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_
 out_unlock:
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 
-	if (deboost) {
-		wake_up_q(&wake_q);
-		rt_mutex_adjust_prio(current);
-	}
+	rt_mutex_postunlock(&wake_q, deboost);
 
 	return ret;
 }
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index dd10312..71ecf06 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -373,24 +373,6 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
 }
 
 /*
- * Adjust task priority (undo boosting). Called from the exit path of
- * rt_mutex_slowunlock() and rt_mutex_slowlock().
- *
- * (Note: We do this outside of the protection of lock->wait_lock to
- * allow the lock to be taken while or before we readjust the priority
- * of task. We do not use the spin_xx_mutex() variants here as we are
- * outside of the debug path.)
- */
-void rt_mutex_adjust_prio(struct task_struct *task)
-{
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	__rt_mutex_adjust_prio(task);
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
-}
-
-/*
  * Deadlock detection is conditional:
  *
  * If CONFIG_DEBUG_RT_MUTEXES=n, deadlock detection is only conducted
@@ -1051,6 +1033,7 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
 	 * lock->wait_lock.
 	 */
 	rt_mutex_dequeue_pi(current, waiter);
+	__rt_mutex_adjust_prio(current);
 
 	/*
 	 * As we are waking up the top waiter, and the waiter stays
@@ -1393,6 +1376,16 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
 	 */
 	mark_wakeup_next_waiter(wake_q, lock);
 
+	/*
+	 * We should deboost before waking the top waiter task such that
+	 * we don't run two tasks with the 'same' priority. This however
+	 * can lead to prio-inversion if we would get preempted after
+	 * the deboost but before waking our high-prio task, hence the
+	 * preempt_disable before unlock. Pairs with preempt_enable() in
+	 * rt_mutex_postunlock();
+	 */
+	preempt_disable();
+
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* check PI boosting */
@@ -1442,6 +1435,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,
 	return slowfn(lock);
 }
 
+/*
+ * Undo pi boosting (if necessary) and wake top waiter.
+ */
+void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+{
+	wake_up_q(wake_q);
+
+	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
+	if (deboost)
+		preempt_enable();
+}
+
 static inline void
 rt_mutex_fastunlock(struct rt_mutex *lock,
 		    bool (*slowfn)(struct rt_mutex *lock,
@@ -1455,11 +1460,7 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
 
 	deboost = slowfn(lock, &wake_q);
 
-	wake_up_q(&wake_q);
-
-	/* Undo pi boosting if necessary: */
-	if (deboost)
-		rt_mutex_adjust_prio(current);
+	rt_mutex_postunlock(&wake_q, deboost);
 }
 
 /**
@@ -1572,6 +1573,13 @@ bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock,
 	}
 
 	mark_wakeup_next_waiter(wake_q, lock);
+	/*
+	 * We've already deboosted, retain preempt_disabled when dropping
+	 * the wait_lock to avoid inversion until the wakeup. Matched
+	 * by rt_mutex_postunlock();
+	 */
+	preempt_disable();
+
 	return true; /* deboost and wakeups */
 }
 
@@ -1584,10 +1592,7 @@ void __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
 	deboost = __rt_mutex_futex_unlock(lock, &wake_q);
 	raw_spin_unlock_irq(&lock->wait_lock);
 
-	if (deboost) {
-		wake_up_q(&wake_q);
-		rt_mutex_adjust_prio(current);
-	}
+	rt_mutex_postunlock(&wake_q, deboost);
 }
 
 /**
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index b1ccfea..a09c029 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -122,7 +122,7 @@ extern void rt_mutex_futex_unlock(struct rt_mutex *lock);
 extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock,
 				 struct wake_q_head *wqh);
 
-extern void rt_mutex_adjust_prio(struct task_struct *task);
+extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [tip:locking/core] sched/rtmutex/deadline: Fix a PI crash for deadline tasks
  2017-03-23 14:56 ` [PATCH -v3 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
@ 2017-04-04  9:49   ` tip-bot for Xunlei Pang
  0 siblings, 0 replies; 46+ messages in thread
From: tip-bot for Xunlei Pang @ 2017-04-04  9:49 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, linux-kernel, hpa, peterz, xlpang, tglx, rostedt

Commit-ID:  e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22
Gitweb:     http://git.kernel.org/tip/e96a7705e7d3fef96aec9b590c63b2f6f7d2ba22
Author:     Xunlei Pang <xlpang@redhat.com>
AuthorDate: Thu, 23 Mar 2017 15:56:08 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 4 Apr 2017 11:44:05 +0200

sched/rtmutex/deadline: Fix a PI crash for deadline tasks

A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [<ffffffff810b658c>] enqueue_task+0x2c/0x80
    [<ffffffff810ba763>] activate_task+0x23/0x30
    [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
    [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
    [<ffffffff8164e783>] __schedule+0xd3/0x900
    [<ffffffff8164efd9>] schedule+0x29/0x70
    [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
    [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
    [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
    [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
    [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
    [<ffffffff810edd50>] SyS_futex+0x80/0x180

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.

In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.

Originally-From: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/linux/init_task.h |  1 +
 include/linux/sched.h     |  2 ++
 include/linux/sched/rt.h  |  1 +
 kernel/fork.c             |  1 +
 kernel/locking/rtmutex.c  | 29 +++++++++++++++++++++--------
 kernel/sched/core.c       |  2 ++
 6 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 91d9049..2c487e0 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -181,6 +181,7 @@ extern struct cred init_cred;
 #ifdef CONFIG_RT_MUTEXES
 # define INIT_RT_MUTEXES(tsk)						\
 	.pi_waiters = RB_ROOT,						\
+	.pi_top_task = NULL,						\
 	.pi_waiters_leftmost = NULL,
 #else
 # define INIT_RT_MUTEXES(tsk)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d67eee8..1ea2eee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,6 +775,8 @@ struct task_struct {
 	/* PI waiters blocked on a rt_mutex held by this task: */
 	struct rb_root			pi_waiters;
 	struct rb_node			*pi_waiters_leftmost;
+	/* Updated under owner's pi_lock and rq lock */
+	struct task_struct		*pi_top_task;
 	/* Deadlock detection and priority inheritance handling: */
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index 3bd66841..10ee7ee 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -21,6 +21,7 @@ static inline int rt_task(struct task_struct *p)
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
 extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
+extern void rt_mutex_update_top_task(struct task_struct *p);
 extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
diff --git a/kernel/fork.c b/kernel/fork.c
index 6c463c80..b30196a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1438,6 +1438,7 @@ static void rt_mutex_init_task(struct task_struct *p)
 #ifdef CONFIG_RT_MUTEXES
 	p->pi_waiters = RB_ROOT;
 	p->pi_waiters_leftmost = NULL;
+	p->pi_top_task = NULL;
 	p->pi_blocked_on = NULL;
 #endif
 }
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 71ecf06..bc05b10 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -323,6 +323,19 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
 }
 
 /*
+ * Must hold both p->pi_lock and task_rq(p)->lock.
+ */
+void rt_mutex_update_top_task(struct task_struct *p)
+{
+	if (!task_has_pi_waiters(p)) {
+		p->pi_top_task = NULL;
+		return;
+	}
+
+	p->pi_top_task = task_top_pi_waiter(p)->task;
+}
+
+/*
  * Calculate task priority from the waiter tree priority
  *
  * Return task->normal_prio when the waiter tree is empty or when
@@ -337,12 +350,12 @@ int rt_mutex_getprio(struct task_struct *task)
 		   task->normal_prio);
 }
 
+/*
+ * Must hold either p->pi_lock or task_rq(p)->lock.
+ */
 struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 {
-	if (likely(!task_has_pi_waiters(task)))
-		return NULL;
-
-	return task_top_pi_waiter(task)->task;
+	return task->pi_top_task;
 }
 
 /*
@@ -351,12 +364,12 @@ struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
  */
 int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
 {
-	if (!task_has_pi_waiters(task))
+	struct task_struct *top_task = rt_mutex_get_top_task(task);
+
+	if (!top_task)
 		return newprio;
 
-	if (task_top_pi_waiter(task)->task->prio <= newprio)
-		return task_top_pi_waiter(task)->task->prio;
-	return newprio;
+	return min(top_task->prio, newprio);
 }
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab9f6ac..e1f44ec 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3713,6 +3713,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 		goto out_unlock;
 	}
 
+	rt_mutex_update_top_task(p);
+
 	trace_sched_pi_setprio(p, prio);
 	oldprio = p->prio;
 

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [tip:locking/core] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
  2017-03-23 14:56 ` [PATCH -v3 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
@ 2017-04-04  9:50   ` tip-bot for Xunlei Pang
  0 siblings, 0 replies; 46+ messages in thread
From: tip-bot for Xunlei Pang @ 2017-04-04  9:50 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: peterz, hpa, rostedt, tglx, xlpang, linux-kernel, mingo

Commit-ID:  85e2d4f992868ad78dc8bb2c077b652fcfb3661a
Gitweb:     http://git.kernel.org/tip/85e2d4f992868ad78dc8bb2c077b652fcfb3661a
Author:     Xunlei Pang <xlpang@redhat.com>
AuthorDate: Thu, 23 Mar 2017 15:56:09 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 4 Apr 2017 11:44:05 +0200

sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update

Currently dl tasks will actually return at the very beginning
of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:

    if (waiter->prio == task->prio) {
        if (!detect_deadlock)
            goto out_unlock_pi; // out here
        else
            requeue = false;
    }

As the deadline value of blocked deadline tasks(waiters) without
changing their sched_class(thus prio doesn't change) never changes,
this seems reasonable, but it actually misses the chance of updating
rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
deadline parameters(dl_runtime, dl_period) or boosted waiter changes
to !deadline class.

Thus, force deadline task not out by adding the !dl_prio() condition.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.206577901@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/locking/rtmutex.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index bc05b10..8faf472 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -605,7 +605,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * enabled we continue, but stop the requeueing in the chain
 	 * walk.
 	 */
-	if (waiter->prio == task->prio) {
+	if (waiter->prio == task->prio && !dl_task(task)) {
 		if (!detect_deadlock)
 			goto out_unlock_pi;
 		else

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [tip:locking/core] rtmutex: Clean up
  2017-03-23 14:56 ` [PATCH -v3 4/8] rtmutex: Clean up Peter Zijlstra
  2017-03-23 18:21   ` Steven Rostedt
@ 2017-04-04  9:50   ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-04-04  9:50 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, linux-kernel, peterz, hpa, mingo

Commit-ID:  aa2bfe55366552cb7e93e8709d66e698d79ccc47
Gitweb:     http://git.kernel.org/tip/aa2bfe55366552cb7e93e8709d66e698d79ccc47
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 23 Mar 2017 15:56:10 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 4 Apr 2017 11:44:05 +0200

rtmutex: Clean up

Previous patches changed the meaning of the return value of
rt_mutex_slowunlock(); update comments and code to reflect this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.255058238@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c                  |  7 ++++---
 kernel/locking/rtmutex.c        | 28 +++++++++++++---------------
 kernel/locking/rtmutex_common.h |  2 +-
 3 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 414a30d..c3eebcd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1394,7 +1394,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_
 {
 	u32 uninitialized_var(curval), newval;
 	struct task_struct *new_owner;
-	bool deboost = false;
+	bool postunlock = false;
 	DEFINE_WAKE_Q(wake_q);
 	int ret = 0;
 
@@ -1455,12 +1455,13 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_
 	/*
 	 * We've updated the uservalue, this unlock cannot fail.
 	 */
-	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+	postunlock = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
 
 out_unlock:
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 
-	rt_mutex_postunlock(&wake_q, deboost);
+	if (postunlock)
+		rt_mutex_postunlock(&wake_q);
 
 	return ret;
 }
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 8faf472..4b1015e 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1330,7 +1330,8 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 
 /*
  * Slow path to release a rt-mutex.
- * Return whether the current task needs to undo a potential priority boosting.
+ *
+ * Return whether the current task needs to call rt_mutex_postunlock().
  */
 static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
 					struct wake_q_head *wake_q)
@@ -1401,8 +1402,7 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
 
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
-	/* check PI boosting */
-	return true;
+	return true; /* call rt_mutex_postunlock() */
 }
 
 /*
@@ -1449,15 +1449,14 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,
 }
 
 /*
- * Undo pi boosting (if necessary) and wake top waiter.
+ * Performs the wakeup of the the top-waiter and re-enables preemption.
  */
-void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost)
+void rt_mutex_postunlock(struct wake_q_head *wake_q)
 {
 	wake_up_q(wake_q);
 
 	/* Pairs with preempt_disable() in rt_mutex_slowunlock() */
-	if (deboost)
-		preempt_enable();
+	preempt_enable();
 }
 
 static inline void
@@ -1466,14 +1465,12 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
 				   struct wake_q_head *wqh))
 {
 	DEFINE_WAKE_Q(wake_q);
-	bool deboost;
 
 	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
 		return;
 
-	deboost = slowfn(lock, &wake_q);
-
-	rt_mutex_postunlock(&wake_q, deboost);
+	if (slowfn(lock, &wake_q))
+		rt_mutex_postunlock(&wake_q);
 }
 
 /**
@@ -1593,19 +1590,20 @@ bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock,
 	 */
 	preempt_disable();
 
-	return true; /* deboost and wakeups */
+	return true; /* call postunlock() */
 }
 
 void __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
 {
 	DEFINE_WAKE_Q(wake_q);
-	bool deboost;
+	bool postunlock;
 
 	raw_spin_lock_irq(&lock->wait_lock);
-	deboost = __rt_mutex_futex_unlock(lock, &wake_q);
+	postunlock = __rt_mutex_futex_unlock(lock, &wake_q);
 	raw_spin_unlock_irq(&lock->wait_lock);
 
-	rt_mutex_postunlock(&wake_q, deboost);
+	if (postunlock)
+		rt_mutex_postunlock(&wake_q);
 }
 
 /**
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index a09c029..9e36aed 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -122,7 +122,7 @@ extern void rt_mutex_futex_unlock(struct rt_mutex *lock);
 extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock,
 				 struct wake_q_head *wqh);
 
-extern void rt_mutex_postunlock(struct wake_q_head *wake_q, bool deboost);
+extern void rt_mutex_postunlock(struct wake_q_head *wake_q);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [tip:locking/core] sched/rtmutex: Refactor rt_mutex_setprio()
  2017-03-23 14:56 ` [PATCH -v3 5/8] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
@ 2017-04-04  9:51   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-04-04  9:51 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, linux-kernel, peterz, tglx, mingo

Commit-ID:  acd58620e415aee4a43a808d7d2fd87259ee0001
Gitweb:     http://git.kernel.org/tip/acd58620e415aee4a43a808d7d2fd87259ee0001
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 23 Mar 2017 15:56:11 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 4 Apr 2017 11:44:06 +0200

sched/rtmutex: Refactor rt_mutex_setprio()

With the introduction of SCHED_DEADLINE the whole notion that priority
is a single number is gone, therefore the @prio argument to
rt_mutex_setprio() doesn't make sense anymore.

So rework the code to pass a pi_task instead.

Note this also fixes a problem with pi_top_task caching; previously we
would not set the pointer (call rt_mutex_update_top_task) if the
priority didn't change, this could lead to a stale pointer.

As for the XXX, I think its fine to use pi_task->prio, because if it
differs from waiter->prio, a PI chain update is immenent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/linux/sched/rt.h |  24 ++++------
 kernel/locking/rtmutex.c | 112 +++++++++++++----------------------------------
 kernel/sched/core.c      |  66 ++++++++++++++++++++++------
 3 files changed, 91 insertions(+), 111 deletions(-)

diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index 10ee7ee..f93329a 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -18,28 +18,20 @@ static inline int rt_task(struct task_struct *p)
 }
 
 #ifdef CONFIG_RT_MUTEXES
-extern int rt_mutex_getprio(struct task_struct *p);
-extern void rt_mutex_setprio(struct task_struct *p, int prio);
-extern int rt_mutex_get_effective_prio(struct task_struct *task, int newprio);
-extern void rt_mutex_update_top_task(struct task_struct *p);
-extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
+/*
+ * Must hold either p->pi_lock or task_rq(p)->lock.
+ */
+static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
+{
+	return p->pi_top_task;
+}
+extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
 	return tsk->pi_blocked_on != NULL;
 }
 #else
-static inline int rt_mutex_getprio(struct task_struct *p)
-{
-	return p->normal_prio;
-}
-
-static inline int rt_mutex_get_effective_prio(struct task_struct *task,
-					      int newprio)
-{
-	return newprio;
-}
-
 static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
 {
 	return NULL;
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 4b1015e..00b49cd 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -322,67 +322,16 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
 	RB_CLEAR_NODE(&waiter->pi_tree_entry);
 }
 
-/*
- * Must hold both p->pi_lock and task_rq(p)->lock.
- */
-void rt_mutex_update_top_task(struct task_struct *p)
-{
-	if (!task_has_pi_waiters(p)) {
-		p->pi_top_task = NULL;
-		return;
-	}
-
-	p->pi_top_task = task_top_pi_waiter(p)->task;
-}
-
-/*
- * Calculate task priority from the waiter tree priority
- *
- * Return task->normal_prio when the waiter tree is empty or when
- * the waiter is not allowed to do priority boosting
- */
-int rt_mutex_getprio(struct task_struct *task)
-{
-	if (likely(!task_has_pi_waiters(task)))
-		return task->normal_prio;
-
-	return min(task_top_pi_waiter(task)->prio,
-		   task->normal_prio);
-}
-
-/*
- * Must hold either p->pi_lock or task_rq(p)->lock.
- */
-struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
-{
-	return task->pi_top_task;
-}
-
-/*
- * Called by sched_setscheduler() to get the priority which will be
- * effective after the change.
- */
-int rt_mutex_get_effective_prio(struct task_struct *task, int newprio)
+static void rt_mutex_adjust_prio(struct task_struct *p)
 {
-	struct task_struct *top_task = rt_mutex_get_top_task(task);
+	struct task_struct *pi_task = NULL;
 
-	if (!top_task)
-		return newprio;
+	lockdep_assert_held(&p->pi_lock);
 
-	return min(top_task->prio, newprio);
-}
+	if (task_has_pi_waiters(p))
+		pi_task = task_top_pi_waiter(p)->task;
 
-/*
- * Adjust the priority of a task, after its pi_waiters got modified.
- *
- * This can be both boosting and unboosting. task->pi_lock must be held.
- */
-static void __rt_mutex_adjust_prio(struct task_struct *task)
-{
-	int prio = rt_mutex_getprio(task);
-
-	if (task->prio != prio || dl_prio(prio))
-		rt_mutex_setprio(task, prio);
+	rt_mutex_setprio(p, pi_task);
 }
 
 /*
@@ -742,7 +691,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 		 */
 		rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
 		rt_mutex_enqueue_pi(task, waiter);
-		__rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(task);
 
 	} else if (prerequeue_top_waiter == waiter) {
 		/*
@@ -758,7 +707,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
 		rt_mutex_enqueue_pi(task, waiter);
-		__rt_mutex_adjust_prio(task);
+		rt_mutex_adjust_prio(task);
 	} else {
 		/*
 		 * Nothing changed. No need to do any priority
@@ -966,7 +915,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 		return -EDEADLK;
 
 	raw_spin_lock(&task->pi_lock);
-	__rt_mutex_adjust_prio(task);
+	rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
 	waiter->prio = task->prio;
@@ -988,7 +937,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 		rt_mutex_dequeue_pi(owner, top_waiter);
 		rt_mutex_enqueue_pi(owner, waiter);
 
-		__rt_mutex_adjust_prio(owner);
+		rt_mutex_adjust_prio(owner);
 		if (owner->pi_blocked_on)
 			chain_walk = 1;
 	} else if (rt_mutex_cond_detect_deadlock(waiter, chwalk)) {
@@ -1040,13 +989,14 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
 	waiter = rt_mutex_top_waiter(lock);
 
 	/*
-	 * Remove it from current->pi_waiters. We do not adjust a
-	 * possible priority boost right now. We execute wakeup in the
-	 * boosted mode and go back to normal after releasing
-	 * lock->wait_lock.
+	 * Remove it from current->pi_waiters and deboost.
+	 *
+	 * We must in fact deboost here in order to ensure we call
+	 * rt_mutex_setprio() to update p->pi_top_task before the
+	 * task unblocks.
 	 */
 	rt_mutex_dequeue_pi(current, waiter);
-	__rt_mutex_adjust_prio(current);
+	rt_mutex_adjust_prio(current);
 
 	/*
 	 * As we are waking up the top waiter, and the waiter stays
@@ -1058,9 +1008,19 @@ static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
 	 */
 	lock->owner = (void *) RT_MUTEX_HAS_WAITERS;
 
-	raw_spin_unlock(&current->pi_lock);
-
+	/*
+	 * We deboosted before waking the top waiter task such that we don't
+	 * run two tasks with the 'same' priority (and ensure the
+	 * p->pi_top_task pointer points to a blocked task). This however can
+	 * lead to priority inversion if we would get preempted after the
+	 * deboost but before waking our donor task, hence the preempt_disable()
+	 * before unlock.
+	 *
+	 * Pairs with preempt_enable() in rt_mutex_postunlock();
+	 */
+	preempt_disable();
 	wake_q_add(wake_q, waiter->task);
+	raw_spin_unlock(&current->pi_lock);
 }
 
 /*
@@ -1095,7 +1055,7 @@ static void remove_waiter(struct rt_mutex *lock,
 	if (rt_mutex_has_waiters(lock))
 		rt_mutex_enqueue_pi(owner, rt_mutex_top_waiter(lock));
 
-	__rt_mutex_adjust_prio(owner);
+	rt_mutex_adjust_prio(owner);
 
 	/* Store the lock on which owner is blocked or NULL */
 	next_lock = task_blocked_on_lock(owner);
@@ -1134,8 +1094,7 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || (waiter->prio == task->prio &&
-			!dl_prio(task->prio))) {
+	if (!waiter || (waiter->prio == task->prio && !dl_prio(task->prio))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -1389,17 +1348,6 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
 	 * Queue the next waiter for wakeup once we release the wait_lock.
 	 */
 	mark_wakeup_next_waiter(wake_q, lock);
-
-	/*
-	 * We should deboost before waking the top waiter task such that
-	 * we don't run two tasks with the 'same' priority. This however
-	 * can lead to prio-inversion if we would get preempted after
-	 * the deboost but before waking our high-prio task, hence the
-	 * preempt_disable before unlock. Pairs with preempt_enable() in
-	 * rt_mutex_postunlock();
-	 */
-	preempt_disable();
-
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	return true; /* call rt_mutex_postunlock() */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1f44ec..37b152a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3671,10 +3671,25 @@ EXPORT_SYMBOL(default_wake_function);
 
 #ifdef CONFIG_RT_MUTEXES
 
+static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)
+{
+	if (pi_task)
+		prio = min(prio, pi_task->prio);
+
+	return prio;
+}
+
+static inline int rt_effective_prio(struct task_struct *p, int prio)
+{
+	struct task_struct *pi_task = rt_mutex_get_top_task(p);
+
+	return __rt_effective_prio(pi_task, prio);
+}
+
 /*
  * rt_mutex_setprio - set the current priority of a task
- * @p: task
- * @prio: prio value (kernel-internal form)
+ * @p: task to boost
+ * @pi_task: donor task
  *
  * This function changes the 'effective' priority of a task. It does
  * not touch ->normal_prio like __setscheduler().
@@ -3682,18 +3697,42 @@ EXPORT_SYMBOL(default_wake_function);
  * Used by the rt_mutex code to implement priority inheritance
  * logic. Call site only calls if the priority of the task changed.
  */
-void rt_mutex_setprio(struct task_struct *p, int prio)
+void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 {
-	int oldprio, queued, running, queue_flag =
+	int prio, oldprio, queued, running, queue_flag =
 		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
 	const struct sched_class *prev_class;
 	struct rq_flags rf;
 	struct rq *rq;
 
-	BUG_ON(prio > MAX_PRIO);
+	/* XXX used to be waiter->prio, not waiter->task->prio */
+	prio = __rt_effective_prio(pi_task, p->normal_prio);
+
+	/*
+	 * If nothing changed; bail early.
+	 */
+	if (p->pi_top_task == pi_task && prio == p->prio && !dl_prio(prio))
+		return;
 
 	rq = __task_rq_lock(p, &rf);
 	update_rq_clock(rq);
+	/*
+	 * Set under pi_lock && rq->lock, such that the value can be used under
+	 * either lock.
+	 *
+	 * Note that there is loads of tricky to make this pointer cache work
+	 * right. rt_mutex_slowunlock()+rt_mutex_postunlock() work together to
+	 * ensure a task is de-boosted (pi_task is set to NULL) before the
+	 * task is allowed to run again (and can exit). This ensures the pointer
+	 * points to a blocked task -- which guaratees the task is present.
+	 */
+	p->pi_top_task = pi_task;
+
+	/*
+	 * For FIFO/RR we only need to set prio, if that matches we're done.
+	 */
+	if (prio == p->prio && !dl_prio(prio))
+		goto out_unlock;
 
 	/*
 	 * Idle task boosting is a nono in general. There is one
@@ -3713,9 +3752,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 		goto out_unlock;
 	}
 
-	rt_mutex_update_top_task(p);
-
-	trace_sched_pi_setprio(p, prio);
+	trace_sched_pi_setprio(p, prio); /* broken */
 	oldprio = p->prio;
 
 	if (oldprio == prio)
@@ -3739,7 +3776,6 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	 *          running task
 	 */
 	if (dl_prio(prio)) {
-		struct task_struct *pi_task = rt_mutex_get_top_task(p);
 		if (!dl_prio(p->normal_prio) ||
 		    (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))) {
 			p->dl.dl_boosted = 1;
@@ -3777,6 +3813,11 @@ out_unlock:
 	balance_callback(rq);
 	preempt_enable();
 }
+#else
+static inline int rt_effective_prio(struct task_struct *p, int prio)
+{
+	return prio;
+}
 #endif
 
 void set_user_nice(struct task_struct *p, long nice)
@@ -4023,10 +4064,9 @@ static void __setscheduler(struct rq *rq, struct task_struct *p,
 	 * Keep a potential priority boosting if called from
 	 * sched_setscheduler().
 	 */
+	p->prio = normal_prio(p);
 	if (keep_boost)
-		p->prio = rt_mutex_get_effective_prio(p, normal_prio(p));
-	else
-		p->prio = normal_prio(p);
+		p->prio = rt_effective_prio(p, p->prio);
 
 	if (dl_prio(p->prio))
 		p->sched_class = &dl_sched_class;
@@ -4313,7 +4353,7 @@ change:
 		 * the runqueue. This will be done when the task deboost
 		 * itself.
 		 */
-		new_effective_prio = rt_mutex_get_effective_prio(p, newprio);
+		new_effective_prio = rt_effective_prio(p, newprio);
 		if (new_effective_prio == oldprio)
 			queue_flags &= ~DEQUEUE_MOVE;
 	}

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [tip:locking/core] sched,tracing: Update trace_sched_pi_setprio()
  2017-03-23 14:56 ` [PATCH -v3 6/8] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
@ 2017-04-04  9:51   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-04-04  9:51 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, rostedt, peterz, tglx, linux-kernel, hpa

Commit-ID:  b91473ff6e979c0028f02f90e40c844959c736d8
Gitweb:     http://git.kernel.org/tip/b91473ff6e979c0028f02f90e40c844959c736d8
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 23 Mar 2017 15:56:12 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 4 Apr 2017 11:44:06 +0200

sched,tracing: Update trace_sched_pi_setprio()

Pass the PI donor task, instead of a numerical priority.

Numerical priorities are not sufficient to describe state ever since
SCHED_DEADLINE.

Annotate all sched tracepoints that are currently broken; fixing them
will bork userspace. *hate*.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.353599881@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 include/trace/events/sched.h | 16 +++++++++-------
 kernel/sched/core.c          |  2 +-
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9e3ef6c..ae1409f 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -70,7 +70,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->success	= 1; /* rudiment, kill when possible */
 		__entry->target_cpu	= task_cpu(p);
 	),
@@ -147,6 +147,7 @@ TRACE_EVENT(sched_switch,
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
+		/* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
@@ -181,7 +182,7 @@ TRACE_EVENT(sched_migrate_task,
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->orig_cpu	= task_cpu(p);
 		__entry->dest_cpu	= dest_cpu;
 	),
@@ -206,7 +207,7 @@ DECLARE_EVENT_CLASS(sched_process_template,
 	TP_fast_assign(
 		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
-		__entry->prio		= p->prio;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("comm=%s pid=%d prio=%d",
@@ -253,7 +254,7 @@ TRACE_EVENT(sched_process_wait,
 	TP_fast_assign(
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 		__entry->pid		= pid_nr(pid);
-		__entry->prio		= current->prio;
+		__entry->prio		= current->prio; /* XXX SCHED_DEADLINE */
 	),
 
 	TP_printk("comm=%s pid=%d prio=%d",
@@ -413,9 +414,9 @@ DEFINE_EVENT(sched_stat_runtime, sched_stat_runtime,
  */
 TRACE_EVENT(sched_pi_setprio,
 
-	TP_PROTO(struct task_struct *tsk, int newprio),
+	TP_PROTO(struct task_struct *tsk, struct task_struct *pi_task),
 
-	TP_ARGS(tsk, newprio),
+	TP_ARGS(tsk, pi_task),
 
 	TP_STRUCT__entry(
 		__array( char,	comm,	TASK_COMM_LEN	)
@@ -428,7 +429,8 @@ TRACE_EVENT(sched_pi_setprio,
 		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
 		__entry->pid		= tsk->pid;
 		__entry->oldprio	= tsk->prio;
-		__entry->newprio	= newprio;
+		__entry->newprio	= pi_task ? pi_task->prio : tsk->prio;
+		/* XXX SCHED_DEADLINE bits missing */
 	),
 
 	TP_printk("comm=%s pid=%d oldprio=%d newprio=%d",
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 37b152a..0c443bb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3752,7 +3752,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 		goto out_unlock;
 	}
 
-	trace_sched_pi_setprio(p, prio); /* broken */
+	trace_sched_pi_setprio(p, pi_task);
 	oldprio = p->prio;
 
 	if (oldprio == prio)

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [tip:locking/core] rtmutex: Fix PI chain order integrity
  2017-03-23 14:56 ` [PATCH -v3 7/8] rtmutex: Fix PI chain order integrity Peter Zijlstra
@ 2017-04-04  9:52   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-04-04  9:52 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, peterz, hpa, tglx, linux-kernel

Commit-ID:  e0aad5b44ff5d28ac1d6ae70cdf84ca228e889dc
Gitweb:     http://git.kernel.org/tip/e0aad5b44ff5d28ac1d6ae70cdf84ca228e889dc
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 23 Mar 2017 15:56:13 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 4 Apr 2017 11:44:06 +0200

rtmutex: Fix PI chain order integrity

rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
during the PI chain walk, such that the PI chain order isn't messed up
by (asynchronous) task state updates.

Currently rt_mutex_waiter_less() uses task state for deadline tasks;
this is broken, since the task state can, as said above, change
asynchronously, causing the RB tree order to change without actual
tree update -> FAIL.

Fix this by also copying the deadline into the rt_mutex_waiter state
and updating it along with its prio field.

Ideally we would also force PI chain updates whenever DL tasks update
their deadline parameter, but for first approximation this is less
broken than it was.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.403992539@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/locking/rtmutex.c        | 29 +++++++++++++++++++++++++++--
 kernel/locking/rtmutex_common.h |  1 +
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 00b49cd..c6eda04 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -238,8 +238,7 @@ rt_mutex_waiter_less(struct rt_mutex_waiter *left,
 	 * then right waiter has a dl_prio() too.
 	 */
 	if (dl_prio(left->prio))
-		return dl_time_before(left->task->dl.deadline,
-				      right->task->dl.deadline);
+		return dl_time_before(left->deadline, right->deadline);
 
 	return 0;
 }
@@ -650,7 +649,26 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 
 	/* [7] Requeue the waiter in the lock waiter tree. */
 	rt_mutex_dequeue(lock, waiter);
+
+	/*
+	 * Update the waiter prio fields now that we're dequeued.
+	 *
+	 * These values can have changed through either:
+	 *
+	 *   sys_sched_set_scheduler() / sys_sched_setattr()
+	 *
+	 * or
+	 *
+	 *   DL CBS enforcement advancing the effective deadline.
+	 *
+	 * Even though pi_waiters also uses these fields, and that tree is only
+	 * updated in [11], we can do this here, since we hold [L], which
+	 * serializes all pi_waiters access and rb_erase() does not care about
+	 * the values of the node being removed.
+	 */
 	waiter->prio = task->prio;
+	waiter->deadline = task->dl.deadline;
+
 	rt_mutex_enqueue(lock, waiter);
 
 	/* [8] Release the task */
@@ -777,6 +795,8 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 				struct rt_mutex_waiter *waiter)
 {
+	lockdep_assert_held(&lock->wait_lock);
+
 	/*
 	 * Before testing whether we can acquire @lock, we set the
 	 * RT_MUTEX_HAS_WAITERS bit in @lock->owner. This forces all
@@ -902,6 +922,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	struct rt_mutex *next_lock;
 	int chain_walk = 0, res;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	/*
 	 * Early deadlock detection. We really don't want the task to
 	 * enqueue on itself just to untangle the mess later. It's not
@@ -919,6 +941,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	waiter->task = task;
 	waiter->lock = lock;
 	waiter->prio = task->prio;
+	waiter->deadline = task->dl.deadline;
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
@@ -1036,6 +1059,8 @@ static void remove_waiter(struct rt_mutex *lock,
 	struct task_struct *owner = rt_mutex_owner(lock);
 	struct rt_mutex *next_lock;
 
+	lockdep_assert_held(&lock->wait_lock);
+
 	raw_spin_lock(&current->pi_lock);
 	rt_mutex_dequeue(lock, waiter);
 	current->pi_blocked_on = NULL;
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 9e36aed..72ad45a 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -34,6 +34,7 @@ struct rt_mutex_waiter {
 	struct rt_mutex		*deadlock_lock;
 #endif
 	int prio;
+	u64 deadline;
 };
 
 /*

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [tip:locking/core] rtmutex: Fix more prio comparisons
  2017-03-23 14:56 ` [PATCH -v3 8/8] rtmutex: Fix more prio comparisons Peter Zijlstra
@ 2017-04-04  9:52   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-04-04  9:52 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, tglx, juri.lelli, linux-kernel, mingo, peterz

Commit-ID:  19830e55247cddb3f46f1bf60b8e245593491bea
Gitweb:     http://git.kernel.org/tip/19830e55247cddb3f46f1bf60b8e245593491bea
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 23 Mar 2017 15:56:14 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 4 Apr 2017 11:44:07 +0200

rtmutex: Fix more prio comparisons

There was a pure ->prio comparison left in try_to_wake_rt_mutex(),
convert it to use rt_mutex_waiter_less(), noting that greater-or-equal
is not-less (both in kernel priority view).

This necessitated the introduction of cmp_task() which creates a
pointer to an unnamed stack variable of struct rt_mutex_waiter type to
compare against tasks.

With this, we can now also create and employ rt_mutex_waiter_equal().

Reviewed-and-tested-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.455584638@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/locking/rtmutex.c | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index c6eda04..0e641eb 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -224,6 +224,12 @@ static inline bool unlock_rt_mutex_safe(struct rt_mutex *lock,
 }
 #endif
 
+/*
+ * Only use with rt_mutex_waiter_{less,equal}()
+ */
+#define task_to_waiter(p)	\
+	&(struct rt_mutex_waiter){ .prio = (p)->prio, .deadline = (p)->dl.deadline }
+
 static inline int
 rt_mutex_waiter_less(struct rt_mutex_waiter *left,
 		     struct rt_mutex_waiter *right)
@@ -243,6 +249,25 @@ rt_mutex_waiter_less(struct rt_mutex_waiter *left,
 	return 0;
 }
 
+static inline int
+rt_mutex_waiter_equal(struct rt_mutex_waiter *left,
+		      struct rt_mutex_waiter *right)
+{
+	if (left->prio != right->prio)
+		return 0;
+
+	/*
+	 * If both waiters have dl_prio(), we check the deadlines of the
+	 * associated tasks.
+	 * If left waiter has a dl_prio(), and we didn't return 0 above,
+	 * then right waiter has a dl_prio() too.
+	 */
+	if (dl_prio(left->prio))
+		return left->deadline == right->deadline;
+
+	return 1;
+}
+
 static void
 rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
 {
@@ -553,7 +578,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * enabled we continue, but stop the requeueing in the chain
 	 * walk.
 	 */
-	if (waiter->prio == task->prio && !dl_task(task)) {
+	if (rt_mutex_waiter_equal(waiter, task_to_waiter(task))) {
 		if (!detect_deadlock)
 			goto out_unlock_pi;
 		else
@@ -856,7 +881,8 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 			 * the top waiter priority (kernel view),
 			 * @task lost.
 			 */
-			if (task->prio >= rt_mutex_top_waiter(lock)->prio)
+			if (!rt_mutex_waiter_less(task_to_waiter(task),
+						  rt_mutex_top_waiter(lock)))
 				return 0;
 
 			/*
@@ -1119,7 +1145,7 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || (waiter->prio == task->prio && !dl_prio(task->prio))) {
+	if (!waiter || rt_mutex_waiter_equal(waiter, task_to_waiter(task))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [tip:locking/core] rtmutex: Deboost before waking up the top waiter
  2017-04-04  9:48   ` [tip:locking/core] " tip-bot for Xunlei Pang
@ 2017-04-05  8:08     ` Mike Galbraith
  2017-04-05 14:55       ` [tip:locking/core] Retiplockingcore_rtmutex_Deboost_before_waking_up_the_top_waiter tip-bot for Mike Galbraith
                         ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Mike Galbraith @ 2017-04-05  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, linux-kernel, rostedt, mingo, xlpang, tglx, Peter Zijlstra

locking/rtmutex: Fix preempt leak in __rt_mutex_futex_unlock()

mark_wakeup_next_waiter() already disables preemption, doing so
again leaves us with an unpaired preempt_disable().

Signed-off-by: Mike Galbraith <efault@gmx.de>
---
 kernel/locking/rtmutex.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1581,13 +1581,13 @@ bool __sched __rt_mutex_futex_unlock(str
 		return false; /* done */
 	}
 
-	mark_wakeup_next_waiter(wake_q, lock);
 	/*
-	 * We've already deboosted, retain preempt_disabled when dropping
-	 * the wait_lock to avoid inversion until the wakeup. Matched
-	 * by rt_mutex_postunlock();
+	 * We've already deboosted, mark_wakeup_next_waiter() will
+	 * retain preempt_disabled when we drop the wait_lock, to
+	 * avoid inversion prior to the wakeup.  preempt_disable()
+	 * therein pairs with rt_mutex_postunlock().
 	 */
-	preempt_disable();
+	mark_wakeup_next_waiter(wake_q, lock);
 
 	return true; /* call postunlock() */
 }

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [tip:locking/core] Retiplockingcore_rtmutex_Deboost_before_waking_up_the_top_waiter
  2017-04-05  8:08     ` Mike Galbraith
@ 2017-04-05 14:55       ` tip-bot for Mike Galbraith
  2017-04-05 15:03       ` [tip:locking/core] rtmutex: Plug preempt count leak in rt_mutex_futex_unlock() tip-bot for Mike Galbraith
  2017-04-06  6:16       ` [tip:locking/core] rtmutex: Deboost before waking up the top waiter Xunlei Pang
  2 siblings, 0 replies; 46+ messages in thread
From: tip-bot for Mike Galbraith @ 2017-04-05 14:55 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, tglx, efault, peterz, linux-kernel, hpa

Commit-ID:  94247f76e7361afd85ba03a3f923bf3d07ba3017
Gitweb:     http://git.kernel.org/tip/94247f76e7361afd85ba03a3f923bf3d07ba3017
Author:     Mike Galbraith <efault@gmx.de>
AuthorDate: Wed, 5 Apr 2017 10:08:27 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 5 Apr 2017 16:52:10 +0200

Retiplockingcore_rtmutex_Deboost_before_waking_up_the_top_waiter

mark_wakeup_next_waiter() already disables preemption, doing so again
leaves us with an unpaired preempt_disable().

Fixes: 2a1c60299406 ("rtmutex: Deboost before waking up the top waiter")
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/1491379707.6538.2.camel@gmx.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/locking/rtmutex.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 0e641eb..b955094 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1581,13 +1581,13 @@ bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock,
 		return false; /* done */
 	}
 
-	mark_wakeup_next_waiter(wake_q, lock);
 	/*
-	 * We've already deboosted, retain preempt_disabled when dropping
-	 * the wait_lock to avoid inversion until the wakeup. Matched
-	 * by rt_mutex_postunlock();
+	 * We've already deboosted, mark_wakeup_next_waiter() will
+	 * retain preempt_disabled when we drop the wait_lock, to
+	 * avoid inversion prior to the wakeup.  preempt_disable()
+	 * therein pairs with rt_mutex_postunlock().
 	 */
-	preempt_disable();
+	mark_wakeup_next_waiter(wake_q, lock);
 
 	return true; /* call postunlock() */
 }

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [tip:locking/core] rtmutex: Plug preempt count leak in rt_mutex_futex_unlock()
  2017-04-05  8:08     ` Mike Galbraith
  2017-04-05 14:55       ` [tip:locking/core] Retiplockingcore_rtmutex_Deboost_before_waking_up_the_top_waiter tip-bot for Mike Galbraith
@ 2017-04-05 15:03       ` tip-bot for Mike Galbraith
  2017-04-06  6:16       ` [tip:locking/core] rtmutex: Deboost before waking up the top waiter Xunlei Pang
  2 siblings, 0 replies; 46+ messages in thread
From: tip-bot for Mike Galbraith @ 2017-04-05 15:03 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, peterz, tglx, efault, linux-kernel, mingo

Commit-ID:  def34eaae5ce04b324e48e1bfac873091d945213
Gitweb:     http://git.kernel.org/tip/def34eaae5ce04b324e48e1bfac873091d945213
Author:     Mike Galbraith <efault@gmx.de>
AuthorDate: Wed, 5 Apr 2017 10:08:27 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 5 Apr 2017 16:59:37 +0200

rtmutex: Plug preempt count leak in rt_mutex_futex_unlock()

mark_wakeup_next_waiter() already disables preemption, doing so again
leaves us with an unpaired preempt_disable().

Fixes: 2a1c60299406 ("rtmutex: Deboost before waking up the top waiter")
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/1491379707.6538.2.camel@gmx.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/locking/rtmutex.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 0e641eb..b955094 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1581,13 +1581,13 @@ bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock,
 		return false; /* done */
 	}
 
-	mark_wakeup_next_waiter(wake_q, lock);
 	/*
-	 * We've already deboosted, retain preempt_disabled when dropping
-	 * the wait_lock to avoid inversion until the wakeup. Matched
-	 * by rt_mutex_postunlock();
+	 * We've already deboosted, mark_wakeup_next_waiter() will
+	 * retain preempt_disabled when we drop the wait_lock, to
+	 * avoid inversion prior to the wakeup.  preempt_disable()
+	 * therein pairs with rt_mutex_postunlock().
 	 */
-	preempt_disable();
+	mark_wakeup_next_waiter(wake_q, lock);
 
 	return true; /* call postunlock() */
 }

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [tip:locking/core] rtmutex: Deboost before waking up the top waiter
  2017-04-05  8:08     ` Mike Galbraith
  2017-04-05 14:55       ` [tip:locking/core] Retiplockingcore_rtmutex_Deboost_before_waking_up_the_top_waiter tip-bot for Mike Galbraith
  2017-04-05 15:03       ` [tip:locking/core] rtmutex: Plug preempt count leak in rt_mutex_futex_unlock() tip-bot for Mike Galbraith
@ 2017-04-06  6:16       ` Xunlei Pang
  2 siblings, 0 replies; 46+ messages in thread
From: Xunlei Pang @ 2017-04-06  6:16 UTC (permalink / raw)
  To: Mike Galbraith, linux-tip-commits
  Cc: hpa, linux-kernel, rostedt, mingo, xlpang, tglx, Peter Zijlstra

On 04/05/2017 at 04:08 PM, Mike Galbraith wrote:
> locking/rtmutex: Fix preempt leak in __rt_mutex_futex_unlock()
>
> mark_wakeup_next_waiter() already disables preemption, doing so
> again leaves us with an unpaired preempt_disable().

You can also fix the corresponding comment in rt_mutex_postunlock():
    /* Pairs with preempt_disable() in rt_mutex_slowunlock() */
    preempt_enable();

Thanks,
Xunlei

>
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/locking/rtmutex.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1581,13 +1581,13 @@ bool __sched __rt_mutex_futex_unlock(str
>  		return false; /* done */
>  	}
>  
> -	mark_wakeup_next_waiter(wake_q, lock);
>  	/*
> -	 * We've already deboosted, retain preempt_disabled when dropping
> -	 * the wait_lock to avoid inversion until the wakeup. Matched
> -	 * by rt_mutex_postunlock();
> +	 * We've already deboosted, mark_wakeup_next_waiter() will
> +	 * retain preempt_disabled when we drop the wait_lock, to
> +	 * avoid inversion prior to the wakeup.  preempt_disable()
> +	 * therein pairs with rt_mutex_postunlock().
>  	 */
> -	preempt_disable();
> +	mark_wakeup_next_waiter(wake_q, lock);
>  
>  	return true; /* call postunlock() */
>  }

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2017-04-06  6:15 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-23 14:56 [PATCH -v3 0/8] PI vs SCHED_DEADLINE fixes Peter Zijlstra
2017-03-23 14:56 ` [PATCH -v3 1/8] rtmutex: Deboost before waking up the top waiter Peter Zijlstra
2017-04-04  9:48   ` [tip:locking/core] " tip-bot for Xunlei Pang
2017-04-05  8:08     ` Mike Galbraith
2017-04-05 14:55       ` [tip:locking/core] Retiplockingcore_rtmutex_Deboost_before_waking_up_the_top_waiter tip-bot for Mike Galbraith
2017-04-05 15:03       ` [tip:locking/core] rtmutex: Plug preempt count leak in rt_mutex_futex_unlock() tip-bot for Mike Galbraith
2017-04-06  6:16       ` [tip:locking/core] rtmutex: Deboost before waking up the top waiter Xunlei Pang
2017-03-23 14:56 ` [PATCH -v3 2/8] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Peter Zijlstra
2017-04-04  9:49   ` [tip:locking/core] " tip-bot for Xunlei Pang
2017-03-23 14:56 ` [PATCH -v3 3/8] sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update Peter Zijlstra
2017-04-04  9:50   ` [tip:locking/core] " tip-bot for Xunlei Pang
2017-03-23 14:56 ` [PATCH -v3 4/8] rtmutex: Clean up Peter Zijlstra
2017-03-23 18:21   ` Steven Rostedt
2017-04-04  9:50   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-03-23 14:56 ` [PATCH -v3 5/8] sched/rtmutex: Refactor rt_mutex_setprio() Peter Zijlstra
2017-04-04  9:51   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-03-23 14:56 ` [PATCH -v3 6/8] sched,tracing: Update trace_sched_pi_setprio() Peter Zijlstra
2017-04-04  9:51   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-03-23 14:56 ` [PATCH -v3 7/8] rtmutex: Fix PI chain order integrity Peter Zijlstra
2017-04-04  9:52   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-03-23 14:56 ` [PATCH -v3 8/8] rtmutex: Fix more prio comparisons Peter Zijlstra
2017-04-04  9:52   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  -- strict thread matches above, loose matches on Subject: below --
2016-04-14 11:37 [PATCH v3 0/6] sched/deadline/rtmutex: Fix two deadline PI issues Xunlei Pang
2016-04-14 11:37 ` [PATCH v3 1/6] rtmutex: Deboost before waking up the top waiter Xunlei Pang
2016-04-18  8:23   ` Thomas Gleixner
2016-04-18  8:44     ` Xunlei Pang
2016-04-18  9:02       ` Thomas Gleixner
2016-04-18  9:41         ` Xunlei Pang
2016-04-20 12:20         ` Peter Zijlstra
2016-04-20 12:43           ` Thomas Gleixner
2016-04-20 13:10             ` Peter Zijlstra
2016-04-14 11:37 ` [PATCH v3 2/6] sched/rtmutex/deadline: Fix a PI crash for deadline tasks Xunlei Pang
2016-04-20 13:19   ` Peter Zijlstra
2016-04-20 13:49     ` Xunlei Pang
2016-04-22  3:26       ` Xunlei Pang
2016-04-14 11:37 ` [PATCH v3 3/6] rtmutex: Move "rt_mutex_waiter" definition to "include/linux/rtmutex.h" Xunlei Pang
2016-04-14 11:37 ` [PATCH v3 4/6] sched: Move dl_policy() to "include/linux/sched.h" Xunlei Pang
2016-04-14 11:37 ` [PATCH v3 5/6] sched/deadline/rtmutex: Fix unprotected PI access in enqueue_task_dl() Xunlei Pang
2016-04-14 15:31   ` Peter Zijlstra
2016-04-15  1:58     ` Xunlei Pang
2016-04-15  2:19       ` Xunlei Pang
2016-04-20 12:25         ` Peter Zijlstra
2016-04-20 13:00           ` Xunlei Pang
2016-04-20 13:17             ` Peter Zijlstra
2016-04-20 13:45               ` Xunlei Pang
2016-04-14 11:37 ` [PATCH v3 6/6] sched/deadline/rtmutex: Don't miss the dl_runtime/dl_period update Xunlei Pang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.