All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI
@ 2017-03-22 10:35 Peter Zijlstra
  2017-03-22 10:35 ` [PATCH -v6 01/13] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
                   ` (13 more replies)
  0 siblings, 14 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

Hi all,

Another installment of the futex patches that give you nightmares ;-)

This version sports updated comments and Changelogs as requested last
time around. It also includes two fixes, both reported by Sebastian
who was kind enough to stick this in his RT tree for some testing.

The last patch is RT specific, but I figure we can merge it anyway.

Again; I sincerely hope this to be the very last version.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 01/13] futex: Cleanup variable names for futex_top_waiter()
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:19   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-03-24 21:11   ` [PATCH -v6 01/13] " Darren Hart
  2017-03-22 10:35 ` [PATCH -v6 02/13] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-cleanup-top_waiter.patch --]
[-- Type: text/plain, Size: 3384 bytes --]

futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
this to a variable 'match' totally obscures the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |   30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1120,14 +1120,14 @@ static int attach_to_pi_owner(u32 uval,
 static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 			   union futex_key *key, struct futex_pi_state **ps)
 {
-	struct futex_q *match = futex_top_waiter(hb, key);
+	struct futex_q *top_waiter = futex_top_waiter(hb, key);
 
 	/*
 	 * If there is a waiter on that futex, validate it and
 	 * attach to the pi_state when the validation succeeds.
 	 */
-	if (match)
-		return attach_to_pi_state(uval, match->pi_state, ps);
+	if (top_waiter)
+		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
 	/*
 	 * We are the first waiter - try to look up the owner based on
@@ -1174,7 +1174,7 @@ static int futex_lock_pi_atomic(u32 __us
 				struct task_struct *task, int set_waiters)
 {
 	u32 uval, newval, vpid = task_pid_vnr(task);
-	struct futex_q *match;
+	struct futex_q *top_waiter;
 	int ret;
 
 	/*
@@ -1200,9 +1200,9 @@ static int futex_lock_pi_atomic(u32 __us
 	 * Lookup existing state first. If it exists, try to attach to
 	 * its pi_state.
 	 */
-	match = futex_top_waiter(hb, key);
-	if (match)
-		return attach_to_pi_state(uval, match->pi_state, ps);
+	top_waiter = futex_top_waiter(hb, key);
+	if (top_waiter)
+		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
 	/*
 	 * No waiter and user TID is 0. We are here because the
@@ -1292,11 +1292,11 @@ static void mark_wake_futex(struct wake_
 	q->lock_ptr = NULL;
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
 			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = this->pi_state;
+	struct futex_pi_state *pi_state = top_waiter->pi_state;
 	u32 uninitialized_var(curval), newval;
 	DEFINE_WAKE_Q(wake_q);
 	bool deboost;
@@ -1317,11 +1317,11 @@ static int wake_futex_pi(u32 __user *uad
 
 	/*
 	 * It is possible that the next waiter (the one that brought
-	 * this owner to the kernel) timed out and is no longer
+	 * top_waiter owner to the kernel) timed out and is no longer
 	 * waiting on the lock.
 	 */
 	if (!new_owner)
-		new_owner = this->task;
+		new_owner = top_waiter->task;
 
 	/*
 	 * We pass it to the next owner. The WAITERS bit is always
@@ -2631,7 +2631,7 @@ static int futex_unlock_pi(u32 __user *u
 	u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current);
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb;
-	struct futex_q *match;
+	struct futex_q *top_waiter;
 	int ret;
 
 retry:
@@ -2655,9 +2655,9 @@ static int futex_unlock_pi(u32 __user *u
 	 * all and we at least want to know if user space fiddled
 	 * with the futex value instead of blindly unlocking.
 	 */
-	match = futex_top_waiter(hb, &key);
-	if (match) {
-		ret = wake_futex_pi(uaddr, uval, match, hb);
+	top_waiter = futex_top_waiter(hb, &key);
+	if (top_waiter) {
+		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
 		/*
 		 * In case of success wake_futex_pi dropped the hash
 		 * bucket lock.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 02/13] futex: Use smp_store_release() in mark_wake_futex()
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
  2017-03-22 10:35 ` [PATCH -v6 01/13] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:19   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-03-24 21:16   ` [PATCH -v6 02/13] " Darren Hart
  2017-03-22 10:35 ` [PATCH -v6 03/13] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-mark_wake_futex.patch --]
[-- Type: text/plain, Size: 697 bytes --]

Since the futex_q can dissapear the instruction after assigning NULL,
this really should be a RELEASE barrier. That stops loads from hitting
dead memory too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1288,8 +1288,7 @@ static void mark_wake_futex(struct wake_
 	 * memory barrier is required here to prevent the following
 	 * store to lock_ptr from getting ahead of the plist_del.
 	 */
-	smp_wmb();
-	q->lock_ptr = NULL;
+	smp_store_release(&q->lock_ptr, NULL);
 }
 
 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 03/13] futex: Remove rt_mutex_deadlock_account_*()
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
  2017-03-22 10:35 ` [PATCH -v6 01/13] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
  2017-03-22 10:35 ` [PATCH -v6 02/13] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:20   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-03-24 21:29   ` [PATCH -v6 03/13] " Darren Hart
  2017-03-22 10:35 ` [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API Peter Zijlstra
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-cleanup-deadlock.patch --]
[-- Type: text/plain, Size: 4950 bytes --]

These are unused and clutter up the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/rtmutex-debug.c |    9 -------
 kernel/locking/rtmutex-debug.h |    3 --
 kernel/locking/rtmutex.c       |   47 +++++++++++++++--------------------------
 kernel/locking/rtmutex.h       |    2 -
 4 files changed, 18 insertions(+), 43 deletions(-)

--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -173,12 +173,3 @@ void debug_rt_mutex_init(struct rt_mutex
 	lock->name = name;
 }
 
-void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task)
-{
-}
-
-void rt_mutex_deadlock_account_unlock(struct task_struct *task)
-{
-}
-
--- a/kernel/locking/rtmutex-debug.h
+++ b/kernel/locking/rtmutex-debug.h
@@ -9,9 +9,6 @@
  * This file contains macros used solely by rtmutex.c. Debug version.
  */
 
-extern void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task);
-extern void rt_mutex_deadlock_account_unlock(struct task_struct *task);
 extern void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name);
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -936,8 +936,6 @@ static int try_to_take_rt_mutex(struct r
 	 */
 	rt_mutex_set_owner(lock, task);
 
-	rt_mutex_deadlock_account_lock(lock, task);
-
 	return 1;
 }
 
@@ -1340,8 +1338,6 @@ static bool __sched rt_mutex_slowunlock(
 
 	debug_rt_mutex_unlock(lock);
 
-	rt_mutex_deadlock_account_unlock(current);
-
 	/*
 	 * We must be careful here if the fast path is enabled. If we
 	 * have no waiters queued we cannot set owner to NULL here
@@ -1407,11 +1403,10 @@ rt_mutex_fastlock(struct rt_mutex *lock,
 				struct hrtimer_sleeper *timeout,
 				enum rtmutex_chainwalk chwalk))
 {
-	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 0;
-	} else
-		return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
+
+	return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
 }
 
 static inline int
@@ -1423,21 +1418,19 @@ rt_mutex_timed_fastlock(struct rt_mutex
 				      enum rtmutex_chainwalk chwalk))
 {
 	if (chwalk == RT_MUTEX_MIN_CHAINWALK &&
-	    likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	    likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 0;
-	} else
-		return slowfn(lock, state, timeout, chwalk);
+
+	return slowfn(lock, state, timeout, chwalk);
 }
 
 static inline int
 rt_mutex_fasttrylock(struct rt_mutex *lock,
 		     int (*slowfn)(struct rt_mutex *lock))
 {
-	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 1;
-	}
+
 	return slowfn(lock);
 }
 
@@ -1447,19 +1440,18 @@ rt_mutex_fastunlock(struct rt_mutex *loc
 				   struct wake_q_head *wqh))
 {
 	DEFINE_WAKE_Q(wake_q);
+	bool deboost;
 
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-		rt_mutex_deadlock_account_unlock(current);
+	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
+		return;
 
-	} else {
-		bool deboost = slowfn(lock, &wake_q);
+	deboost = slowfn(lock, &wake_q);
 
-		wake_up_q(&wake_q);
+	wake_up_q(&wake_q);
 
-		/* Undo pi boosting if necessary: */
-		if (deboost)
-			rt_mutex_adjust_prio(current);
-	}
+	/* Undo pi boosting if necessary: */
+	if (deboost)
+		rt_mutex_adjust_prio(current);
 }
 
 /**
@@ -1570,10 +1562,9 @@ EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
 				   struct wake_q_head *wqh)
 {
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-		rt_mutex_deadlock_account_unlock(current);
+	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
 		return false;
-	}
+
 	return rt_mutex_slowunlock(lock, wqh);
 }
 
@@ -1635,7 +1626,6 @@ void rt_mutex_init_proxy_locked(struct r
 	__rt_mutex_init(lock, NULL);
 	debug_rt_mutex_proxy_lock(lock, proxy_owner);
 	rt_mutex_set_owner(lock, proxy_owner);
-	rt_mutex_deadlock_account_lock(lock, proxy_owner);
 }
 
 /**
@@ -1655,7 +1645,6 @@ void rt_mutex_proxy_unlock(struct rt_mut
 {
 	debug_rt_mutex_proxy_unlock(lock);
 	rt_mutex_set_owner(lock, NULL);
-	rt_mutex_deadlock_account_unlock(proxy_owner);
 }
 
 /**
--- a/kernel/locking/rtmutex.h
+++ b/kernel/locking/rtmutex.h
@@ -11,8 +11,6 @@
  */
 
 #define rt_mutex_deadlock_check(l)			(0)
-#define rt_mutex_deadlock_account_lock(m, t)		do { } while (0)
-#define rt_mutex_deadlock_account_unlock(l)		do { } while (0)
 #define debug_rt_mutex_init_waiter(w)			do { } while (0)
 #define debug_rt_mutex_free_waiter(w)			do { } while (0)
 #define debug_rt_mutex_lock(l)				do { } while (0)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-03-22 10:35 ` [PATCH -v6 03/13] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:20   ` [tip:locking/core] " tip-bot for Peter Zijlstra
                     ` (2 more replies)
  2017-03-22 10:35 ` [PATCH -v6 05/13] futex: Change locking rules Peter Zijlstra
                   ` (9 subsequent siblings)
  13 siblings, 3 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-1.patch --]
[-- Type: text/plain, Size: 6619 bytes --]

Part of what makes futex_unlock_pi() intricate is that
rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
rt_mutex::wait_lock.

This means we cannot rely on the atomicy of wait_lock, which we would
like to do in order to not rely on hb->lock so much.

The reason rt_mutex_slowunlock() needs to drop wait_lock is because it
can race with the rt_mutex fastpath, however futexes have their own
fast path.

Since futexes already have a bunch of separate rt_mutex accessors,
complete that set and implement a rt_mutex variant without fastpath
for them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |   30 ++++++++++-----------
 kernel/locking/rtmutex.c        |   55 +++++++++++++++++++++++++++++-----------
 kernel/locking/rtmutex_common.h |    9 +++++-
 3 files changed, 62 insertions(+), 32 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -916,7 +916,7 @@ void exit_pi_state_list(struct task_stru
 		pi_state->owner = NULL;
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		rt_mutex_unlock(&pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
 
 		spin_unlock(&hb->lock);
 
@@ -1364,20 +1364,18 @@ static int wake_futex_pi(u32 __user *uad
 	pi_state->owner = new_owner;
 	raw_spin_unlock(&new_owner->pi_lock);
 
-	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-
-	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
-
 	/*
-	 * First unlock HB so the waiter does not spin on it once he got woken
-	 * up. Second wake up the waiter before the priority is adjusted. If we
-	 * deboost first (and lose our higher priority), then the task might get
-	 * scheduled away before the wake up can take place.
+	 * We've updated the uservalue, this unlock cannot fail.
 	 */
+	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	spin_unlock(&hb->lock);
-	wake_up_q(&wake_q);
-	if (deboost)
+
+	if (deboost) {
+		wake_up_q(&wake_q);
 		rt_mutex_adjust_prio(current);
+	}
 
 	return 0;
 }
@@ -2253,7 +2251,7 @@ static int fixup_owner(u32 __user *uaddr
 		 * task acquired the rt_mutex after we removed ourself from the
 		 * rt_mutex waiters list.
 		 */
-		if (rt_mutex_trylock(&q->pi_state->pi_mutex)) {
+		if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
 			locked = 1;
 			goto out;
 		}
@@ -2568,7 +2566,7 @@ static int futex_lock_pi(u32 __user *uad
 	if (!trylock) {
 		ret = rt_mutex_timed_futex_lock(&q.pi_state->pi_mutex, to);
 	} else {
-		ret = rt_mutex_trylock(&q.pi_state->pi_mutex);
+		ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
 		/* Fixup the trylock return value: */
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
@@ -2591,7 +2589,7 @@ static int futex_lock_pi(u32 __user *uad
 	 * it and return the fault to userspace.
 	 */
 	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
-		rt_mutex_unlock(&q.pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(&q);
@@ -2898,7 +2896,7 @@ static int futex_wait_requeue_pi(u32 __u
 			spin_lock(q.lock_ptr);
 			ret = fixup_pi_state_owner(uaddr2, &q, current);
 			if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current)
-				rt_mutex_unlock(&q.pi_state->pi_mutex);
+				rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
 			/*
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.
@@ -2938,7 +2936,7 @@ static int futex_wait_requeue_pi(u32 __u
 		 * userspace.
 		 */
 		if (ret && rt_mutex_owner(pi_mutex) == current)
-			rt_mutex_unlock(pi_mutex);
+			rt_mutex_futex_unlock(pi_mutex);
 
 		/* Unqueue and drop the lock. */
 		unqueue_me_pi(&q);
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1488,15 +1488,23 @@ EXPORT_SYMBOL_GPL(rt_mutex_lock_interrup
 
 /*
  * Futex variant with full deadlock detection.
+ * Futex variants must not use the fast-path, see __rt_mutex_futex_unlock().
  */
-int rt_mutex_timed_futex_lock(struct rt_mutex *lock,
+int __sched rt_mutex_timed_futex_lock(struct rt_mutex *lock,
 			      struct hrtimer_sleeper *timeout)
 {
 	might_sleep();
 
-	return rt_mutex_timed_fastlock(lock, TASK_INTERRUPTIBLE, timeout,
-				       RT_MUTEX_FULL_CHAINWALK,
-				       rt_mutex_slowlock);
+	return rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE,
+				 timeout, RT_MUTEX_FULL_CHAINWALK);
+}
+
+/*
+ * Futex variant, must not use fastpath.
+ */
+int __sched rt_mutex_futex_trylock(struct rt_mutex *lock)
+{
+	return rt_mutex_slowtrylock(lock);
 }
 
 /**
@@ -1555,19 +1563,38 @@ void __sched rt_mutex_unlock(struct rt_m
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
- * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
- * @lock: the rt_mutex to be unlocked
- *
- * Returns: true/false indicating whether priority adjustment is
- * required or not.
+ * Futex variant, that since futex variants do not use the fast-path, can be
+ * simple and will not need to retry.
  */
-bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
-				   struct wake_q_head *wqh)
+bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				    struct wake_q_head *wake_q)
+{
+	lockdep_assert_held(&lock->wait_lock);
+
+	debug_rt_mutex_unlock(lock);
+
+	if (!rt_mutex_has_waiters(lock)) {
+		lock->owner = NULL;
+		return false; /* done */
+	}
+
+	mark_wakeup_next_waiter(wake_q, lock);
+	return true; /* deboost and wakeups */
+}
+
+void __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
 {
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
-		return false;
+	DEFINE_WAKE_Q(wake_q);
+	bool deboost;
 
-	return rt_mutex_slowunlock(lock, wqh);
+	raw_spin_lock_irq(&lock->wait_lock);
+	deboost = __rt_mutex_futex_unlock(lock, &wake_q);
+	raw_spin_unlock_irq(&lock->wait_lock);
+
+	if (deboost) {
+		wake_up_q(&wake_q);
+		rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -109,9 +109,14 @@ extern int rt_mutex_start_proxy_lock(str
 extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct hrtimer_sleeper *to,
 				      struct rt_mutex_waiter *waiter);
+
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
-extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
-				  struct wake_q_head *wqh);
+extern int rt_mutex_futex_trylock(struct rt_mutex *l);
+
+extern void rt_mutex_futex_unlock(struct rt_mutex *lock);
+extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				 struct wake_q_head *wqh);
+
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 05/13] futex: Change locking rules
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
                   ` (3 preceding siblings ...)
  2017-03-22 10:35 ` [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:21   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-04-05 21:18   ` [PATCH -v6 05/13] " Darren Hart
  2017-03-22 10:35 ` [PATCH -v6 06/13] futex: Cleanup refcounting Peter Zijlstra
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-2.patch --]
[-- Type: text/plain, Size: 10280 bytes --]

Currently futex-pi relies on hb->lock to serialize everything. Since
hb->lock is giving us problems (PI inversions among other things,
since on -rt hb lock itself is a rt_mutex), we want to break this up a
bit.

This patch reworks and documents the locking. Notably, it
consistently uses rt_mutex::wait_lock to serialize {uval, pi_state}.
This would allow us to do rt_mutex_unlock() (including deboost)
without holding hb->lock.

Nothing yet relies on the new locking rules.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |  165 +++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 132 insertions(+), 33 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -973,6 +973,39 @@ void exit_pi_state_list(struct task_stru
  *
  * [10] There is no transient state which leaves owner and user space
  *	TID out of sync.
+ *
+ *
+ * Serialization and lifetime rules:
+ *
+ * hb->lock:
+ *
+ *	hb -> futex_q, relation
+ *	futex_q -> pi_state, relation
+ *
+ *	(cannot be raw because hb can contain arbitrary amount
+ *	 of futex_q's)
+ *
+ * pi_mutex->wait_lock:
+ *
+ *	{uval, pi_state}
+ *
+ *	(and pi_mutex 'obviously')
+ *
+ * p->pi_lock:
+ *
+ *	p->pi_state_list -> pi_state->list, relation
+ *
+ * pi_state->refcount:
+ *
+ *	pi_state lifetime
+ *
+ *
+ * Lock order:
+ *
+ *   hb->lock
+ *     pi_mutex->wait_lock
+ *       p->pi_lock
+ *
  */
 
 /*
@@ -980,10 +1013,12 @@ void exit_pi_state_list(struct task_stru
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
+static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
+			      struct futex_pi_state *pi_state,
 			      struct futex_pi_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
+	int ret, uval2;
 
 	/*
 	 * Userspace might have messed up non-PI and PI futexes [3]
@@ -991,9 +1026,34 @@ static int attach_to_pi_state(u32 uval,
 	if (unlikely(!pi_state))
 		return -EINVAL;
 
+	/*
+	 * We get here with hb->lock held, and having found a
+	 * futex_top_waiter(). This means that futex_lock_pi() of said futex_q
+	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
+	 * which in turn means that futex_lock_pi() still has a reference on
+	 * our pi_state.
+	 */
 	WARN_ON(!atomic_read(&pi_state->refcount));
 
 	/*
+	 * Now that we have a pi_state, we can acquire wait_lock
+	 * and do the state validation.
+	 */
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+	/*
+	 * Since {uval, pi_state} is serialized by wait_lock, and our current
+	 * uval was read without holding it, it can have changed. Verify it
+	 * still is what we expect it to be, otherwise retry the entire
+	 * operation.
+	 */
+	if (get_futex_value_locked(&uval2, uaddr))
+		goto out_efault;
+
+	if (uval != uval2)
+		goto out_eagain;
+
+	/*
 	 * Handle the owner died case:
 	 */
 	if (uval & FUTEX_OWNER_DIED) {
@@ -1008,11 +1068,11 @@ static int attach_to_pi_state(u32 uval,
 			 * is not 0. Inconsistent state. [5]
 			 */
 			if (pid)
-				return -EINVAL;
+				goto out_einval;
 			/*
 			 * Take a ref on the state and return success. [4]
 			 */
-			goto out_state;
+			goto out_attach;
 		}
 
 		/*
@@ -1024,14 +1084,14 @@ static int attach_to_pi_state(u32 uval,
 		 * Take a ref on the state and return success. [6]
 		 */
 		if (!pid)
-			goto out_state;
+			goto out_attach;
 	} else {
 		/*
 		 * If the owner died bit is not set, then the pi_state
 		 * must have an owner. [7]
 		 */
 		if (!pi_state->owner)
-			return -EINVAL;
+			goto out_einval;
 	}
 
 	/*
@@ -1040,11 +1100,29 @@ static int attach_to_pi_state(u32 uval,
 	 * user space TID. [9/10]
 	 */
 	if (pid != task_pid_vnr(pi_state->owner))
-		return -EINVAL;
-out_state:
+		goto out_einval;
+
+out_attach:
 	atomic_inc(&pi_state->refcount);
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	*ps = pi_state;
 	return 0;
+
+out_einval:
+	ret = -EINVAL;
+	goto out_error;
+
+out_eagain:
+	ret = -EAGAIN;
+	goto out_error;
+
+out_efault:
+	ret = -EFAULT;
+	goto out_error;
+
+out_error:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+	return ret;
 }
 
 /*
@@ -1095,6 +1173,9 @@ static int attach_to_pi_owner(u32 uval,
 
 	/*
 	 * No existing pi state. First waiter. [2]
+	 *
+	 * This creates pi_state, we have hb->lock held, this means nothing can
+	 * observe this state, wait_lock is irrelevant.
 	 */
 	pi_state = alloc_pi_state();
 
@@ -1119,7 +1200,8 @@ static int attach_to_pi_owner(u32 uval,
 	return 0;
 }
 
-static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
+static int lookup_pi_state(u32 __user *uaddr, u32 uval,
+			   struct futex_hash_bucket *hb,
 			   union futex_key *key, struct futex_pi_state **ps)
 {
 	struct futex_q *top_waiter = futex_top_waiter(hb, key);
@@ -1129,7 +1211,7 @@ static int lookup_pi_state(u32 uval, str
 	 * attach to the pi_state when the validation succeeds.
 	 */
 	if (top_waiter)
-		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
 
 	/*
 	 * We are the first waiter - try to look up the owner based on
@@ -1148,7 +1230,7 @@ static int lock_pi_update_atomic(u32 __u
 	if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)))
 		return -EFAULT;
 
-	/*If user space value changed, let the caller retry */
+	/* If user space value changed, let the caller retry */
 	return curval != uval ? -EAGAIN : 0;
 }
 
@@ -1204,7 +1286,7 @@ static int futex_lock_pi_atomic(u32 __us
 	 */
 	top_waiter = futex_top_waiter(hb, key);
 	if (top_waiter)
-		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
 
 	/*
 	 * No waiter and user TID is 0. We are here because the
@@ -1336,6 +1418,7 @@ static int wake_futex_pi(u32 __user *uad
 
 	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
 		ret = -EFAULT;
+
 	} else if (curval != uval) {
 		/*
 		 * If a unconditional UNLOCK_PI operation (user space did not
@@ -1348,6 +1431,7 @@ static int wake_futex_pi(u32 __user *uad
 		else
 			ret = -EINVAL;
 	}
+
 	if (ret) {
 		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 		return ret;
@@ -1823,7 +1907,7 @@ static int futex_requeue(u32 __user *uad
 			 * If that call succeeds then we have pi_state and an
 			 * initial refcount on it.
 			 */
-			ret = lookup_pi_state(ret, hb2, &key2, &pi_state);
+			ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state);
 		}
 
 		switch (ret) {
@@ -2122,10 +2206,13 @@ static int fixup_pi_state_owner(u32 __us
 {
 	u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
 	struct futex_pi_state *pi_state = q->pi_state;
-	struct task_struct *oldowner = pi_state->owner;
 	u32 uval, uninitialized_var(curval), newval;
+	struct task_struct *oldowner;
 	int ret;
 
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+	oldowner = pi_state->owner;
 	/* Owner died? */
 	if (!pi_state->owner)
 		newtid |= FUTEX_OWNER_DIED;
@@ -2141,11 +2228,10 @@ static int fixup_pi_state_owner(u32 __us
 	 * because we can fault here. Imagine swapped out pages or a fork
 	 * that marked all the anonymous memory readonly for cow.
 	 *
-	 * Modifying pi_state _before_ the user space value would
-	 * leave the pi_state in an inconsistent state when we fault
-	 * here, because we need to drop the hash bucket lock to
-	 * handle the fault. This might be observed in the PID check
-	 * in lookup_pi_state.
+	 * Modifying pi_state _before_ the user space value would leave the
+	 * pi_state in an inconsistent state when we fault here, because we
+	 * need to drop the locks to handle the fault. This might be observed
+	 * in the PID check in lookup_pi_state.
 	 */
 retry:
 	if (get_futex_value_locked(&uval, uaddr))
@@ -2166,47 +2252,60 @@ static int fixup_pi_state_owner(u32 __us
 	 * itself.
 	 */
 	if (pi_state->owner != NULL) {
-		raw_spin_lock_irq(&pi_state->owner->pi_lock);
+		raw_spin_lock(&pi_state->owner->pi_lock);
 		WARN_ON(list_empty(&pi_state->list));
 		list_del_init(&pi_state->list);
-		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
+		raw_spin_unlock(&pi_state->owner->pi_lock);
 	}
 
 	pi_state->owner = newowner;
 
-	raw_spin_lock_irq(&newowner->pi_lock);
+	raw_spin_lock(&newowner->pi_lock);
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &newowner->pi_state_list);
-	raw_spin_unlock_irq(&newowner->pi_lock);
+	raw_spin_unlock(&newowner->pi_lock);
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+
 	return 0;
 
 	/*
-	 * To handle the page fault we need to drop the hash bucket
-	 * lock here. That gives the other task (either the highest priority
-	 * waiter itself or the task which stole the rtmutex) the
-	 * chance to try the fixup of the pi_state. So once we are
-	 * back from handling the fault we need to check the pi_state
-	 * after reacquiring the hash bucket lock and before trying to
-	 * do another fixup. When the fixup has been done already we
-	 * simply return.
+	 * To handle the page fault we need to drop the locks here. That gives
+	 * the other task (either the highest priority waiter itself or the
+	 * task which stole the rtmutex) the chance to try the fixup of the
+	 * pi_state. So once we are back from handling the fault we need to
+	 * check the pi_state after reacquiring the locks and before trying to
+	 * do another fixup. When the fixup has been done already we simply
+	 * return.
+	 *
+	 * Note: we hold both hb->lock and pi_mutex->wait_lock. We can safely
+	 * drop hb->lock since the caller owns the hb -> futex_q relation.
+	 * Dropping the pi_mutex->wait_lock requires the state revalidate.
 	 */
 handle_fault:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	spin_unlock(q->lock_ptr);
 
 	ret = fault_in_user_writeable(uaddr);
 
 	spin_lock(q->lock_ptr);
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 
 	/*
 	 * Check if someone else fixed it for us:
 	 */
-	if (pi_state->owner != oldowner)
-		return 0;
+	if (pi_state->owner != oldowner) {
+		ret = 0;
+		goto out_unlock;
+	}
 
 	if (ret)
-		return ret;
+		goto out_unlock;
 
 	goto retry;
+
+out_unlock:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+	return ret;
 }
 
 static long futex_wait_restart(struct restart_block *restart);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 06/13] futex: Cleanup refcounting
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
                   ` (4 preceding siblings ...)
  2017-03-22 10:35 ` [PATCH -v6 05/13] futex: Change locking rules Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:21   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-04-05 21:29   ` [PATCH -v6 06/13] " Darren Hart
  2017-03-22 10:35 ` [PATCH -v6 07/13] futex: Rework inconsistent rt_mutex/futex_q state Peter Zijlstra
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-8.patch --]
[-- Type: text/plain, Size: 1760 bytes --]

Since we're going to add more refcount fiddling, introduce
get_pi_state() to match the existing put_pi_state().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -802,7 +802,7 @@ static int refill_pi_state_cache(void)
 	return 0;
 }
 
-static struct futex_pi_state * alloc_pi_state(void)
+static struct futex_pi_state *alloc_pi_state(void)
 {
 	struct futex_pi_state *pi_state = current->pi_state_cache;
 
@@ -812,6 +812,11 @@ static struct futex_pi_state * alloc_pi_
 	return pi_state;
 }
 
+static void get_pi_state(struct futex_pi_state *pi_state)
+{
+	WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
+}
+
 /*
  * Drops a reference to the pi_state object and frees or caches it
  * when the last reference is gone.
@@ -856,7 +861,7 @@ static void put_pi_state(struct futex_pi
  * Look up the task based on what TID userspace gave us.
  * We dont trust it.
  */
-static struct task_struct * futex_find_get_task(pid_t pid)
+static struct task_struct *futex_find_get_task(pid_t pid)
 {
 	struct task_struct *p;
 
@@ -1103,7 +1108,7 @@ static int attach_to_pi_state(u32 __user
 		goto out_einval;
 
 out_attach:
-	atomic_inc(&pi_state->refcount);
+	get_pi_state(pi_state);
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	*ps = pi_state;
 	return 0;
@@ -1990,7 +1995,7 @@ static int futex_requeue(u32 __user *uad
 			 * refcount on the pi_state and store the pointer in
 			 * the futex_q object of the waiter.
 			 */
-			atomic_inc(&pi_state->refcount);
+			get_pi_state(pi_state);
 			this->pi_state = pi_state;
 			ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
 							this->rt_waiter,

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 07/13] futex: Rework inconsistent rt_mutex/futex_q state
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
                   ` (5 preceding siblings ...)
  2017-03-22 10:35 ` [PATCH -v6 06/13] futex: Cleanup refcounting Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:22   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-04-05 21:58   ` [PATCH -v6 07/13] " Darren Hart
  2017-03-22 10:35 ` [PATCH -v6 08/13] futex: Pull rt_mutex_futex_unlock() out from under hb->lock Peter Zijlstra
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peter_zijlstra-futex_unlock_pi_wobbles.patch --]
[-- Type: text/plain, Size: 4074 bytes --]

There is a weird state in the futex_unlock_pi() path when it
interleaves with a concurrent futex_lock_pi() at the point where it
drops hb->lock.

In this case, it can happen that the rt_mutex wait_list and the
futex_q disagree on pending waiters, in particular rt_mutex will find
no pending waiters where futex_q thinks there are.

In this case the rt_mutex unlock code cannot assign an owner.

What the current code does in this case is use the futex_q waiter that
got us here; however when the rt_mutex_timed_futex_lock() has already
failed; this leaves things in a weird state, resulting in much
head-aches in fixup_owner().

Simplify all this by changing wake_futex_pi() to return -EAGAIN when
this situation occurs. This then gives the futex_lock_pi() code the
opportunity to continue and the retried futex_unlock_pi() will now
observe a coherent state.

The only problem is that this breaks RT timeliness guarantees. That
is, consider the following scenario:

  T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)

    CPU0

    T1
      lock_pi()
      queue_me()  <- Waiter is visible

    preemption

    T2
      unlock_pi()
	loops with -EAGAIN forever

Which is undesirable for PI primitives. Future patches will rectify
this. For now we want to get rid of the fixup magic.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |   50 ++++++++++++++------------------------------------
 1 file changed, 14 insertions(+), 36 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1404,12 +1404,19 @@ static int wake_futex_pi(u32 __user *uad
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
 
 	/*
-	 * It is possible that the next waiter (the one that brought
-	 * top_waiter owner to the kernel) timed out and is no longer
-	 * waiting on the lock.
+	 * When we interleave with futex_lock_pi() where it does
+	 * rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
+	 * but the rt_mutex's wait_list can be empty (either still, or again,
+	 * depending on which side we land).
+	 *
+	 * When this happens, give up our locks and try again, giving the
+	 * futex_lock_pi() instance time to complete, either by waiting on the
+	 * rtmutex or removing itself from the futex queue.
 	 */
-	if (!new_owner)
-		new_owner = top_waiter->task;
+	if (!new_owner) {
+		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+		return -EAGAIN;
+	}
 
 	/*
 	 * We pass it to the next owner. The WAITERS bit is always
@@ -2332,7 +2339,6 @@ static long futex_wait_restart(struct re
  */
 static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 {
-	struct task_struct *owner;
 	int ret = 0;
 
 	if (locked) {
@@ -2346,43 +2352,15 @@ static int fixup_owner(u32 __user *uaddr
 	}
 
 	/*
-	 * Catch the rare case, where the lock was released when we were on the
-	 * way back before we locked the hash bucket.
-	 */
-	if (q->pi_state->owner == current) {
-		/*
-		 * Try to get the rt_mutex now. This might fail as some other
-		 * task acquired the rt_mutex after we removed ourself from the
-		 * rt_mutex waiters list.
-		 */
-		if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
-			locked = 1;
-			goto out;
-		}
-
-		/*
-		 * pi_state is incorrect, some other task did a lock steal and
-		 * we returned due to timeout or signal without taking the
-		 * rt_mutex. Too late.
-		 */
-		raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
-		owner = rt_mutex_owner(&q->pi_state->pi_mutex);
-		if (!owner)
-			owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
-		raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
-		ret = fixup_pi_state_owner(uaddr, q, owner);
-		goto out;
-	}
-
-	/*
 	 * Paranoia check. If we did not take the lock, then we should not be
 	 * the owner of the rt_mutex.
 	 */
-	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current)
+	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) {
 		printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p "
 				"pi-state %p\n", ret,
 				q->pi_state->pi_mutex.owner,
 				q->pi_state->owner);
+	}
 
 out:
 	return ret ? ret : locked;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 08/13] futex: Pull rt_mutex_futex_unlock() out from under hb->lock
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
                   ` (6 preceding siblings ...)
  2017-03-22 10:35 ` [PATCH -v6 07/13] futex: Rework inconsistent rt_mutex/futex_q state Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:22   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-04-05 23:52   ` [PATCH -v6 08/13] " Darren Hart
  2017-03-22 10:35 ` [PATCH -v6 09/13] futex,rt_mutex: Introduce rt_mutex_init_waiter() Peter Zijlstra
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-3.patch --]
[-- Type: text/plain, Size: 10596 bytes --]

There's a number of 'interesting' problems, all caused by holding
hb->lock while doing the rt_mutex_unlock() equivalient.

Notably:

 - a PI inversion on hb->lock; and,

 - a DL crash because of pointer instability.

Because of all the previous patches that:

 - allow us to do rt_mutex_futex_unlock() without dropping wait_lock;
   which in turn allows us to rely on wait_lock atomicy.

 - changed locking rules to cover {uval,pi_state} with wait_lock.

 - simplified the waiter conundrum.

We can now quite simply pull rt_mutex_futex_unlock() out from under
hb->lock, a pi_state reference and wait_lock are sufficient.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |  154 +++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 100 insertions(+), 54 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -921,10 +921,12 @@ void exit_pi_state_list(struct task_stru
 		pi_state->owner = NULL;
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		rt_mutex_futex_unlock(&pi_state->pi_mutex);
-
+		get_pi_state(pi_state);
 		spin_unlock(&hb->lock);
 
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
+		put_pi_state(pi_state);
+
 		raw_spin_lock_irq(&curr->pi_lock);
 	}
 	raw_spin_unlock_irq(&curr->pi_lock);
@@ -1037,6 +1039,11 @@ static int attach_to_pi_state(u32 __user
 	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
 	 * which in turn means that futex_lock_pi() still has a reference on
 	 * our pi_state.
+	 *
+	 * The waiter holding a reference on @pi_state also protects against
+	 * the unlocked put_pi_state() in futex_unlock_pi(), futex_lock_pi()
+	 * and futex_wait_requeue_pi() as it cannot go to 0 and consequently
+	 * free pi_state before we can take a reference ourselves.
 	 */
 	WARN_ON(!atomic_read(&pi_state->refcount));
 
@@ -1380,48 +1387,40 @@ static void mark_wake_futex(struct wake_
 	smp_store_release(&q->lock_ptr, NULL);
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
-			 struct futex_hash_bucket *hb)
+/*
+ * Caller must hold a reference on @pi_state.
+ */
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
 {
-	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = top_waiter->pi_state;
 	u32 uninitialized_var(curval), newval;
+	struct task_struct *new_owner;
+	bool deboost = false;
 	DEFINE_WAKE_Q(wake_q);
-	bool deboost;
 	int ret = 0;
 
-	if (!pi_state)
-		return -EINVAL;
-
-	/*
-	 * If current does not own the pi_state then the futex is
-	 * inconsistent and user space fiddled with the futex value.
-	 */
-	if (pi_state->owner != current)
-		return -EINVAL;
-
 	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
-
-	/*
-	 * When we interleave with futex_lock_pi() where it does
-	 * rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
-	 * but the rt_mutex's wait_list can be empty (either still, or again,
-	 * depending on which side we land).
-	 *
-	 * When this happens, give up our locks and try again, giving the
-	 * futex_lock_pi() instance time to complete, either by waiting on the
-	 * rtmutex or removing itself from the futex queue.
-	 */
 	if (!new_owner) {
-		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-		return -EAGAIN;
+		/*
+		 * Since we held neither hb->lock nor wait_lock when coming
+		 * into this function, we could have raced with futex_lock_pi()
+		 * such that we might observe @this futex_q waiter, but the
+		 * rt_mutex's wait_list can be empty (either still, or again,
+		 * depending on which side we land).
+		 *
+		 * When this happens, give up our locks and try again, giving
+		 * the futex_lock_pi() instance time to complete, either by
+		 * waiting on the rtmutex or removing itself from the futex
+		 * queue.
+		 */
+		ret = -EAGAIN;
+		goto out_unlock;
 	}
 
 	/*
-	 * We pass it to the next owner. The WAITERS bit is always
-	 * kept enabled while there is PI state around. We cleanup the
-	 * owner died bit, because we are the owner.
+	 * We pass it to the next owner. The WAITERS bit is always kept
+	 * enabled while there is PI state around. We cleanup the owner
+	 * died bit, because we are the owner.
 	 */
 	newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
 
@@ -1444,10 +1443,8 @@ static int wake_futex_pi(u32 __user *uad
 			ret = -EINVAL;
 	}
 
-	if (ret) {
-		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-		return ret;
-	}
+	if (ret)
+		goto out_unlock;
 
 	raw_spin_lock(&pi_state->owner->pi_lock);
 	WARN_ON(list_empty(&pi_state->list));
@@ -1465,15 +1462,15 @@ static int wake_futex_pi(u32 __user *uad
 	 */
 	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
 
+out_unlock:
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-	spin_unlock(&hb->lock);
 
 	if (deboost) {
 		wake_up_q(&wake_q);
 		rt_mutex_adjust_prio(current);
 	}
 
-	return 0;
+	return ret;
 }
 
 /*
@@ -2232,7 +2229,8 @@ static int fixup_pi_state_owner(u32 __us
 	/*
 	 * We are here either because we stole the rtmutex from the
 	 * previous highest priority waiter or we are the highest priority
-	 * waiter but failed to get the rtmutex the first time.
+	 * waiter but have failed to get the rtmutex the first time.
+	 *
 	 * We have to replace the newowner TID in the user space variable.
 	 * This must be atomic as we have to preserve the owner died bit here.
 	 *
@@ -2249,7 +2247,7 @@ static int fixup_pi_state_owner(u32 __us
 	if (get_futex_value_locked(&uval, uaddr))
 		goto handle_fault;
 
-	while (1) {
+	for (;;) {
 		newval = (uval & FUTEX_OWNER_DIED) | newtid;
 
 		if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
@@ -2345,6 +2343,10 @@ static int fixup_owner(u32 __user *uaddr
 		/*
 		 * Got the lock. We might not be the anticipated owner if we
 		 * did a lock-steal - fix up the PI-state in that case:
+		 *
+		 * We can safely read pi_state->owner without holding wait_lock
+		 * because we now own the rt_mutex, only the owner will attempt
+		 * to change it.
 		 */
 		if (q->pi_state->owner != current)
 			ret = fixup_pi_state_owner(uaddr, q, current);
@@ -2584,6 +2586,7 @@ static int futex_lock_pi(u32 __user *uad
 			 ktime_t *time, int trylock)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
+	struct futex_pi_state *pi_state = NULL;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
 	int res, ret;
@@ -2670,12 +2673,19 @@ static int futex_lock_pi(u32 __user *uad
 	 * If fixup_owner() faulted and was unable to handle the fault, unlock
 	 * it and return the fault to userspace.
 	 */
-	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
-		rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
+	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current)) {
+		pi_state = q.pi_state;
+		get_pi_state(pi_state);
+	}
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(&q);
 
+	if (pi_state) {
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
+		put_pi_state(pi_state);
+	}
+
 	goto out_put_key;
 
 out_unlock_put_key:
@@ -2738,10 +2748,36 @@ static int futex_unlock_pi(u32 __user *u
 	 */
 	top_waiter = futex_top_waiter(hb, &key);
 	if (top_waiter) {
-		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
+		struct futex_pi_state *pi_state = top_waiter->pi_state;
+
+		ret = -EINVAL;
+		if (!pi_state)
+			goto out_unlock;
+
+		/*
+		 * If current does not own the pi_state then the futex is
+		 * inconsistent and user space fiddled with the futex value.
+		 */
+		if (pi_state->owner != current)
+			goto out_unlock;
+
+		/*
+		 * Grab a reference on the pi_state and drop hb->lock.
+		 *
+		 * The reference ensures pi_state lives, dropping the hb->lock
+		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
+		 * close the races against futex_lock_pi(), but in case of
+		 * _any_ fail we'll abort and retry the whole deal.
+		 */
+		get_pi_state(pi_state);
+		spin_unlock(&hb->lock);
+
+		ret = wake_futex_pi(uaddr, uval, pi_state);
+
+		put_pi_state(pi_state);
+
 		/*
-		 * In case of success wake_futex_pi dropped the hash
-		 * bucket lock.
+		 * Success, we're done! No tricky corner cases.
 		 */
 		if (!ret)
 			goto out_putkey;
@@ -2756,7 +2792,6 @@ static int futex_unlock_pi(u32 __user *u
 		 * setting the FUTEX_WAITERS bit. Try again.
 		 */
 		if (ret == -EAGAIN) {
-			spin_unlock(&hb->lock);
 			put_futex_key(&key);
 			goto retry;
 		}
@@ -2764,7 +2799,7 @@ static int futex_unlock_pi(u32 __user *u
 		 * wake_futex_pi has detected invalid state. Tell user
 		 * space.
 		 */
-		goto out_unlock;
+		goto out_putkey;
 	}
 
 	/*
@@ -2774,8 +2809,10 @@ static int futex_unlock_pi(u32 __user *u
 	 * preserve the WAITERS bit not the OWNER_DIED one. We are the
 	 * owner.
 	 */
-	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0))
+	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0)) {
+		spin_unlock(&hb->lock);
 		goto pi_faulted;
+	}
 
 	/*
 	 * If uval has changed, let user space handle it.
@@ -2789,7 +2826,6 @@ static int futex_unlock_pi(u32 __user *u
 	return ret;
 
 pi_faulted:
-	spin_unlock(&hb->lock);
 	put_futex_key(&key);
 
 	ret = fault_in_user_writeable(uaddr);
@@ -2893,6 +2929,7 @@ static int futex_wait_requeue_pi(u32 __u
 				 u32 __user *uaddr2)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
+	struct futex_pi_state *pi_state = NULL;
 	struct rt_mutex_waiter rt_waiter;
 	struct futex_hash_bucket *hb;
 	union futex_key key2 = FUTEX_KEY_INIT;
@@ -2977,8 +3014,10 @@ static int futex_wait_requeue_pi(u32 __u
 		if (q.pi_state && (q.pi_state->owner != current)) {
 			spin_lock(q.lock_ptr);
 			ret = fixup_pi_state_owner(uaddr2, &q, current);
-			if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current)
-				rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
+			if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) {
+				pi_state = q.pi_state;
+				get_pi_state(pi_state);
+			}
 			/*
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.
@@ -3017,13 +3056,20 @@ static int futex_wait_requeue_pi(u32 __u
 		 * the fault, unlock the rt_mutex and return the fault to
 		 * userspace.
 		 */
-		if (ret && rt_mutex_owner(pi_mutex) == current)
-			rt_mutex_futex_unlock(pi_mutex);
+		if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) {
+			pi_state = q.pi_state;
+			get_pi_state(pi_state);
+		}
 
 		/* Unqueue and drop the lock. */
 		unqueue_me_pi(&q);
 	}
 
+	if (pi_state) {
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
+		put_pi_state(pi_state);
+	}
+
 	if (ret == -EINTR) {
 		/*
 		 * We've already been requeued, but cannot restart by calling

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 09/13] futex,rt_mutex: Introduce rt_mutex_init_waiter()
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
                   ` (7 preceding siblings ...)
  2017-03-22 10:35 ` [PATCH -v6 08/13] futex: Pull rt_mutex_futex_unlock() out from under hb->lock Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:23   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-04-05 23:57   ` [PATCH -v6 09/13] " Darren Hart
  2017-03-22 10:35 ` [PATCH -v6 10/13] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() Peter Zijlstra
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-10.patch --]
[-- Type: text/plain, Size: 2163 bytes --]

Since there's already two copies of this code, introduce a helper now
before we get a third instance.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |    5 +----
 kernel/locking/rtmutex.c        |   12 +++++++++---
 kernel/locking/rtmutex_common.h |    1 +
 3 files changed, 11 insertions(+), 7 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2956,10 +2956,7 @@ static int futex_wait_requeue_pi(u32 __u
 	 * The waiter is allocated on our stack, manipulated by the requeue
 	 * code while we sleep on uaddr.
 	 */
-	debug_rt_mutex_init_waiter(&rt_waiter);
-	RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
-	RB_CLEAR_NODE(&rt_waiter.tree_entry);
-	rt_waiter.task = NULL;
+	rt_mutex_init_waiter(&rt_waiter);
 
 	ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
 	if (unlikely(ret != 0))
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1153,6 +1153,14 @@ void rt_mutex_adjust_pi(struct task_stru
 				   next_lock, NULL, task);
 }
 
+void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
+{
+	debug_rt_mutex_init_waiter(waiter);
+	RB_CLEAR_NODE(&waiter->pi_tree_entry);
+	RB_CLEAR_NODE(&waiter->tree_entry);
+	waiter->task = NULL;
+}
+
 /**
  * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
  * @lock:		 the rt_mutex to take
@@ -1235,9 +1243,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 	unsigned long flags;
 	int ret = 0;
 
-	debug_rt_mutex_init_waiter(&waiter);
-	RB_CLEAR_NODE(&waiter.pi_tree_entry);
-	RB_CLEAR_NODE(&waiter.tree_entry);
+	rt_mutex_init_waiter(&waiter);
 
 	/*
 	 * Technically we could use raw_spin_[un]lock_irq() here, but this can
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -103,6 +103,7 @@ extern void rt_mutex_init_proxy_locked(s
 				       struct task_struct *proxy_owner);
 extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 				  struct task_struct *proxy_owner);
+extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 				     struct rt_mutex_waiter *waiter,
 				     struct task_struct *task);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 10/13] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
                   ` (8 preceding siblings ...)
  2017-03-22 10:35 ` [PATCH -v6 09/13] futex,rt_mutex: Introduce rt_mutex_init_waiter() Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:23   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-04-07 23:30   ` [PATCH -v6 10/13] " Darren Hart
  2017-03-22 10:35 ` [PATCH -v6 11/13] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() Peter Zijlstra
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-9.patch --]
[-- Type: text/plain, Size: 5049 bytes --]

With the ultimate goal of keeping rt_mutex wait_list and futex_q
waiters consistent we want to split 'rt_mutex_futex_lock()' into finer
parts, such that only the actual blocking can be done without hb->lock
held.

This means we need to split rt_mutex_finish_proxy_lock() into two
parts, one that does the blocking and one that does remove_waiter()
when we fail to acquire.

When we do acquire, we can safely remove ourselves, since there is no
concurrency on the lock owner.

This means that, except for futex_lock_pi(), all wait_list
modifications are done with both hb->lock and wait_lock held.

[bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |    7 +++--
 kernel/locking/rtmutex.c        |   53 ++++++++++++++++++++++++++++++++++------
 kernel/locking/rtmutex_common.h |    8 +++---
 3 files changed, 56 insertions(+), 12 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3032,10 +3032,13 @@ static int futex_wait_requeue_pi(u32 __u
 		 */
 		WARN_ON(!q.pi_state);
 		pi_mutex = &q.pi_state->pi_mutex;
-		ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter);
-		debug_rt_mutex_free_waiter(&rt_waiter);
+		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
 
 		spin_lock(q.lock_ptr);
+		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
+			ret = 0;
+
+		debug_rt_mutex_free_waiter(&rt_waiter);
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we
 		 * haven't already.
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1753,21 +1753,23 @@ struct task_struct *rt_mutex_next_owner(
 }
 
 /**
- * rt_mutex_finish_proxy_lock() - Complete lock acquisition
+ * rt_mutex_wait_proxy_lock() - Wait for lock acquisition
  * @lock:		the rt_mutex we were woken on
  * @to:			the timeout, null if none. hrtimer should already have
  *			been started.
  * @waiter:		the pre-initialized rt_mutex_waiter
  *
- * Complete the lock acquisition started our behalf by another thread.
+ * Wait for the the lock acquisition started on our behalf by
+ * rt_mutex_start_proxy_lock(). Upon failure, the caller must call
+ * rt_mutex_cleanup_proxy_lock().
  *
  * Returns:
  *  0 - success
  * <0 - error, one of -EINTR, -ETIMEDOUT
  *
- * Special API call for PI-futex requeue support
+ * Special API call for PI-futex support
  */
-int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
 			       struct hrtimer_sleeper *to,
 			       struct rt_mutex_waiter *waiter)
 {
@@ -1780,9 +1782,6 @@ int rt_mutex_finish_proxy_lock(struct rt
 	/* sleep on the mutex */
 	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
 
-	if (unlikely(ret))
-		remove_waiter(lock, waiter);
-
 	/*
 	 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
 	 * have to fix that up.
@@ -1793,3 +1792,43 @@ int rt_mutex_finish_proxy_lock(struct rt
 
 	return ret;
 }
+
+/**
+ * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition
+ * @lock:		the rt_mutex we were woken on
+ * @waiter:		the pre-initialized rt_mutex_waiter
+ *
+ * Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
+ *
+ * Unless we acquired the lock; we're still enqueued on the wait-list and can
+ * in fact still be granted ownership until we're removed. Therefore we can
+ * find we are in fact the owner and must disregard the
+ * rt_mutex_wait_proxy_lock() failure.
+ *
+ * Returns:
+ *  true  - did the cleanup, we done.
+ *  false - we acquired the lock after rt_mutex_wait_proxy_lock() returned,
+ *          caller should disregards its return value.
+ *
+ * Special API call for PI-futex support
+ */
+bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
+				 struct rt_mutex_waiter *waiter)
+{
+	bool cleanup = false;
+
+	raw_spin_lock_irq(&lock->wait_lock);
+	/*
+	 * Unless we're the owner; we're still enqueued on the wait_list.
+	 * So check if we became owner, if not, take us off the wait_list.
+	 */
+	if (rt_mutex_owner(lock) != current) {
+		remove_waiter(lock, waiter);
+		fixup_rt_mutex_waiters(lock);
+		cleanup = true;
+	}
+	raw_spin_unlock_irq(&lock->wait_lock);
+
+	return cleanup;
+}
+
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -107,9 +107,11 @@ extern void rt_mutex_init_waiter(struct
 extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 				     struct rt_mutex_waiter *waiter,
 				     struct task_struct *task);
-extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
-				      struct hrtimer_sleeper *to,
-				      struct rt_mutex_waiter *waiter);
+extern int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
+			       struct hrtimer_sleeper *to,
+			       struct rt_mutex_waiter *waiter);
+extern bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
+				 struct rt_mutex_waiter *waiter);
 
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 extern int rt_mutex_futex_trylock(struct rt_mutex *l);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 11/13] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
                   ` (9 preceding siblings ...)
  2017-03-22 10:35 ` [PATCH -v6 10/13] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:24   ` [tip:locking/core] " tip-bot for Peter Zijlstra
                     ` (2 more replies)
  2017-03-22 10:35 ` [PATCH -v6 12/13] futex: futex_unlock_pi() determinism Peter Zijlstra
                   ` (2 subsequent siblings)
  13 siblings, 3 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-5.patch --]
[-- Type: text/plain, Size: 7373 bytes --]

By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() we arrive
at a point where all wait_list modifications are done under both
hb->lock and wait_lock.

This closes the obvious interleave pattern between futex_lock_pi() and
futex_unlock_pi(), but not entirely so. See below:

Before:

futex_lock_pi()			futex_unlock_pi()
  unlock hb->lock

				  lock hb->lock
				  unlock hb->lock

				  lock rt_mutex->wait_lock
				  unlock rt_mutex_wait_lock
				    -EAGAIN

  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock

  schedule()

  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

				  <idem>
				    -EAGAIN

  lock hb->lock


After:

futex_lock_pi()			futex_unlock_pi()

  lock hb->lock
  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock
  unlock hb->lock

  schedule()
				  lock hb->lock
				  unlock hb->lock
  lock hb->lock
  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

				  lock rt_mutex->wait_lock
				  unlock rt_mutex_wait_lock
				    -EAGAIN

  unlock hb->lock


It does however solve the earlier starvation/live-lock scenario which
got introduced with the -EAGAIN since unlike the before scenario;
where the -EAGAIN happens while futex_unlock_pi() doesn't hold any
locks; in the after scenario it happens while futex_unlock_pi()
actually holds a lock, and then we can serialize on that lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |   70 +++++++++++++++++++++++++++-------------
 kernel/locking/rtmutex.c        |   13 -------
 kernel/locking/rtmutex_common.h |    1 
 3 files changed, 48 insertions(+), 36 deletions(-)

Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -2099,20 +2099,7 @@ queue_unlock(struct futex_hash_bucket *h
 	hb_waiters_dec(hb);
 }
 
-/**
- * queue_me() - Enqueue the futex_q on the futex_hash_bucket
- * @q:	The futex_q to enqueue
- * @hb:	The destination hash bucket
- *
- * The hb->lock must be held by the caller, and is released here. A call to
- * queue_me() is typically paired with exactly one call to unqueue_me().  The
- * exceptions involve the PI related operations, which may use unqueue_me_pi()
- * or nothing if the unqueue is done as part of the wake process and the unqueue
- * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
- * an example).
- */
-static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
-	__releases(&hb->lock)
+static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
 {
 	int prio;
 
@@ -2129,6 +2116,24 @@ static inline void queue_me(struct futex
 	plist_node_init(&q->list, prio);
 	plist_add(&q->list, &hb->chain);
 	q->task = current;
+}
+
+/**
+ * queue_me() - Enqueue the futex_q on the futex_hash_bucket
+ * @q:	The futex_q to enqueue
+ * @hb:	The destination hash bucket
+ *
+ * The hb->lock must be held by the caller, and is released here. A call to
+ * queue_me() is typically paired with exactly one call to unqueue_me().  The
+ * exceptions involve the PI related operations, which may use unqueue_me_pi()
+ * or nothing if the unqueue is done as part of the wake process and the unqueue
+ * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
+ * an example).
+ */
+static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
+	__releases(&hb->lock)
+{
+	__queue_me(q, hb);
 	spin_unlock(&hb->lock);
 }
 
@@ -2587,6 +2592,7 @@ static int futex_lock_pi(u32 __user *uad
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct futex_pi_state *pi_state = NULL;
+	struct rt_mutex_waiter rt_waiter;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
 	int res, ret;
@@ -2639,25 +2645,52 @@ retry_private:
 		}
 	}
 
+	WARN_ON(!q.pi_state);
+
 	/*
 	 * Only actually queue now that the atomic ops are done:
 	 */
-	queue_me(&q, hb);
+	__queue_me(&q, hb);
 
-	WARN_ON(!q.pi_state);
-	/*
-	 * Block on the PI mutex:
-	 */
-	if (!trylock) {
-		ret = rt_mutex_timed_futex_lock(&q.pi_state->pi_mutex, to);
-	} else {
+	if (trylock) {
 		ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
 		/* Fixup the trylock return value: */
 		ret = ret ? 0 : -EWOULDBLOCK;
+		goto no_block;
 	}
 
+	/*
+	 * We must add ourselves to the rt_mutex waitlist while holding hb->lock
+	 * such that the hb and rt_mutex wait lists match.
+	 */
+	rt_mutex_init_waiter(&rt_waiter);
+	ret = rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
+	if (ret) {
+		if (ret == 1)
+			ret = 0;
+
+		goto no_block;
+	}
+
+	spin_unlock(q.lock_ptr);
+
+	if (unlikely(to))
+		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
+
+	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
+
 	spin_lock(q.lock_ptr);
 	/*
+	 * If we failed to acquire the lock (signal/timeout), we must
+	 * first acquire the hb->lock before removing the lock from the
+	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
+	 * wait lists consistent.
+	 */
+	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
+		ret = 0;
+
+no_block:
+	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
 	 * haven't already.
 	 */
Index: linux-2.6/kernel/locking/rtmutex.c
===================================================================
--- linux-2.6.orig/kernel/locking/rtmutex.c
+++ linux-2.6/kernel/locking/rtmutex.c
@@ -1493,19 +1493,6 @@ int __sched rt_mutex_lock_interruptible(
 EXPORT_SYMBOL_GPL(rt_mutex_lock_interruptible);
 
 /*
- * Futex variant with full deadlock detection.
- * Futex variants must not use the fast-path, see __rt_mutex_futex_unlock().
- */
-int __sched rt_mutex_timed_futex_lock(struct rt_mutex *lock,
-			      struct hrtimer_sleeper *timeout)
-{
-	might_sleep();
-
-	return rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE,
-				 timeout, RT_MUTEX_FULL_CHAINWALK);
-}
-
-/*
  * Futex variant, must not use fastpath.
  */
 int __sched rt_mutex_futex_trylock(struct rt_mutex *lock)
@@ -1782,12 +1769,6 @@ int rt_mutex_wait_proxy_lock(struct rt_m
 	/* sleep on the mutex */
 	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
 
-	/*
-	 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
-	 * have to fix that up.
-	 */
-	fixup_rt_mutex_waiters(lock);
-
 	raw_spin_unlock_irq(&lock->wait_lock);
 
 	return ret;
@@ -1827,6 +1808,13 @@ bool rt_mutex_cleanup_proxy_lock(struct
 		fixup_rt_mutex_waiters(lock);
 		cleanup = true;
 	}
+
+	/*
+	 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
+	 * have to fix that up.
+	 */
+	fixup_rt_mutex_waiters(lock);
+
 	raw_spin_unlock_irq(&lock->wait_lock);
 
 	return cleanup;
Index: linux-2.6/kernel/locking/rtmutex_common.h
===================================================================
--- linux-2.6.orig/kernel/locking/rtmutex_common.h
+++ linux-2.6/kernel/locking/rtmutex_common.h
@@ -113,7 +113,6 @@ extern int rt_mutex_wait_proxy_lock(stru
 extern bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
 				 struct rt_mutex_waiter *waiter);
 
-extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 extern int rt_mutex_futex_trylock(struct rt_mutex *l);
 
 extern void rt_mutex_futex_unlock(struct rt_mutex *lock);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 12/13] futex: futex_unlock_pi() determinism
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
                   ` (10 preceding siblings ...)
  2017-03-22 10:35 ` [PATCH -v6 11/13] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() Peter Zijlstra
@ 2017-03-22 10:35 ` Peter Zijlstra
  2017-03-23 18:24   ` [tip:locking/core] futex: Futex_unlock_pi() determinism tip-bot for Peter Zijlstra
  2017-04-08  1:27   ` [PATCH -v6 12/13] futex: futex_unlock_pi() determinism Darren Hart
  2017-03-22 10:36 ` [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL Peter Zijlstra
  2017-03-24  1:45 ` [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Darren Hart
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:35 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-11.patch --]
[-- Type: text/plain, Size: 2650 bytes --]

The problem with returning -EAGAIN when the waiter state mismatches is
that it becomes very hard to proof a bounded execution time on the
operation. And seeing that this is a RT operation, this is somewhat
important.

While in practise; given the previous patch; it will be very unlikely
to ever really take more than one or two rounds, proving so becomes
rather hard.

However, now that modifying wait_list is done while holding both
hb->lock and wait_lock, we can avoid the scenario entirely if we
acquire wait_lock while still holding hb-lock. Doing a hand-over,
without leaving a hole.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |   26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -1398,15 +1398,10 @@ static int wake_futex_pi(u32 __user *uad
 	DEFINE_WAKE_Q(wake_q);
 	int ret = 0;
 
-	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
-	if (!new_owner) {
+	if (WARN_ON_ONCE(!new_owner)) {
 		/*
-		 * Since we held neither hb->lock nor wait_lock when coming
-		 * into this function, we could have raced with futex_lock_pi()
-		 * such that we might observe @this futex_q waiter, but the
-		 * rt_mutex's wait_list can be empty (either still, or again,
-		 * depending on which side we land).
+		 * As per the comment in futex_unlock_pi() this should not happen.
 		 *
 		 * When this happens, give up our locks and try again, giving
 		 * the futex_lock_pi() instance time to complete, either by
@@ -2787,15 +2782,18 @@ retry:
 		if (pi_state->owner != current)
 			goto out_unlock;
 
+		get_pi_state(pi_state);
 		/*
-		 * Grab a reference on the pi_state and drop hb->lock.
+		 * Since modifying the wait_list is done while holding both
+		 * hb->lock and wait_lock, holding either is sufficient to
+		 * observe it.
 		 *
-		 * The reference ensures pi_state lives, dropping the hb->lock
-		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
-		 * close the races against futex_lock_pi(), but in case of
-		 * _any_ fail we'll abort and retry the whole deal.
+		 * By taking wait_lock while still holding hb->lock, we ensure
+		 * there is no point where we hold neither; and therefore
+		 * wake_futex_pi() must observe a state consistent with what we
+		 * observed.
 		 */
-		get_pi_state(pi_state);
+		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 		spin_unlock(&hb->lock);
 
 		ret = wake_futex_pi(uaddr, uval, pi_state);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
                   ` (11 preceding siblings ...)
  2017-03-22 10:35 ` [PATCH -v6 12/13] futex: futex_unlock_pi() determinism Peter Zijlstra
@ 2017-03-22 10:36 ` Peter Zijlstra
  2017-03-23 18:25   ` [tip:locking/core] futex: Drop hb->lock before enqueueing on the rtmutex tip-bot for Peter Zijlstra
  2017-04-08  2:26   ` [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL Darren Hart
  2017-03-24  1:45 ` [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Darren Hart
  13 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-03-22 10:36 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-12.patch --]
[-- Type: text/plain, Size: 6343 bytes --]

When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
chain code will (falsely) report a deadlock and BUG.

The problem is that we hold hb->lock (now an rt_mutex) while doing
task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
interleaved just right with futex_unlock_pi() leads it to believe we
have an AB-BA deadlock.

  Task1 (holds rt_mutex,	Task2 (does FUTEX_LOCK_PI)
         does FUTEX_UNLOCK_PI)

				lock hb->lock
				lock rt_mutex (as per start_proxy)
  lock hb->lock

Which is a trivial AB-BA.

It is not an actual deadlock, because we won't be holding hb->lock by
the time we actually block on rt_mutex, but the chainwalk code doesn't
know that.

To avoid this problem, do the same thing we do in futex_unlock_pi()
and drop hb->lock after acquiring wait_lock. This still fully
serializes against futex_unlock_pi(), since adding to the wait_list
does the very same lock dance, and removing it holds both locks.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |   30 +++++++++++++++++-------
 kernel/locking/rtmutex.c        |   49 ++++++++++++++++++++++------------------
 kernel/locking/rtmutex_common.h |    3 ++
 3 files changed, 52 insertions(+), 30 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2654,20 +2654,33 @@ static int futex_lock_pi(u32 __user *uad
 		goto no_block;
 	}
 
+	rt_mutex_init_waiter(&rt_waiter);
+
 	/*
-	 * We must add ourselves to the rt_mutex waitlist while holding hb->lock
-	 * such that the hb and rt_mutex wait lists match.
+	 * On PREEMPT_RT_FULL, when hb->lock becomes an rt_mutex, we must not
+	 * hold it while doing rt_mutex_start_proxy(), because then it will
+	 * include hb->lock in the blocking chain, even through we'll not in
+	 * fact hold it while blocking. This will lead it to report -EDEADLK
+	 * and BUG when futex_unlock_pi() interleaves with this.
+	 *
+	 * Therefore acquire wait_lock while holding hb->lock, but drop the
+	 * latter before calling rt_mutex_start_proxy_lock(). This still fully
+	 * serializes against futex_unlock_pi() as that does the exact same
+	 * lock handoff sequence.
 	 */
-	rt_mutex_init_waiter(&rt_waiter);
-	ret = rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
+	raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
+	spin_unlock(q.lock_ptr);
+	ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
+	raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
+
 	if (ret) {
 		if (ret == 1)
 			ret = 0;
 
+		spin_lock(q.lock_ptr);
 		goto no_block;
 	}
 
-	spin_unlock(q.lock_ptr);
 
 	if (unlikely(to))
 		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
@@ -2680,6 +2693,9 @@ static int futex_lock_pi(u32 __user *uad
 	 * first acquire the hb->lock before removing the lock from the
 	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
 	 * wait lists consistent.
+	 *
+	 * In particular; it is important that futex_unlock_pi() can not
+	 * observe this inconsistency.
 	 */
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
@@ -2791,10 +2807,6 @@ static int futex_unlock_pi(u32 __user *u
 
 		get_pi_state(pi_state);
 		/*
-		 * Since modifying the wait_list is done while holding both
-		 * hb->lock and wait_lock, holding either is sufficient to
-		 * observe it.
-		 *
 		 * By taking wait_lock while still holding hb->lock, we ensure
 		 * there is no point where we hold neither; and therefore
 		 * wake_futex_pi() must observe a state consistent with what we
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1669,31 +1669,14 @@ void rt_mutex_proxy_unlock(struct rt_mut
 	rt_mutex_set_owner(lock, NULL);
 }
 
-/**
- * rt_mutex_start_proxy_lock() - Start lock acquisition for another task
- * @lock:		the rt_mutex to take
- * @waiter:		the pre-initialized rt_mutex_waiter
- * @task:		the task to prepare
- *
- * Returns:
- *  0 - task blocked on lock
- *  1 - acquired the lock for task, caller should wake it up
- * <0 - error
- *
- * Special API call for FUTEX_REQUEUE_PI support.
- */
-int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 			      struct rt_mutex_waiter *waiter,
 			      struct task_struct *task)
 {
 	int ret;
 
-	raw_spin_lock_irq(&lock->wait_lock);
-
-	if (try_to_take_rt_mutex(lock, task, NULL)) {
-		raw_spin_unlock_irq(&lock->wait_lock);
+	if (try_to_take_rt_mutex(lock, task, NULL))
 		return 1;
-	}
 
 	/* We enforce deadlock detection for futexes */
 	ret = task_blocks_on_rt_mutex(lock, waiter, task,
@@ -1712,12 +1695,36 @@ int rt_mutex_start_proxy_lock(struct rt_
 	if (unlikely(ret))
 		remove_waiter(lock, waiter);
 
-	raw_spin_unlock_irq(&lock->wait_lock);
-
 	debug_rt_mutex_print_deadlock(waiter);
 
 	return ret;
 }
+
+/**
+ * rt_mutex_start_proxy_lock() - Start lock acquisition for another task
+ * @lock:		the rt_mutex to take
+ * @waiter:		the pre-initialized rt_mutex_waiter
+ * @task:		the task to prepare
+ *
+ * Returns:
+ *  0 - task blocked on lock
+ *  1 - acquired the lock for task, caller should wake it up
+ * <0 - error
+ *
+ * Special API call for FUTEX_REQUEUE_PI support.
+ */
+int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+			      struct rt_mutex_waiter *waiter,
+			      struct task_struct *task)
+{
+	int ret;
+
+	raw_spin_lock_irq(&lock->wait_lock);
+	ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
+	raw_spin_unlock_irq(&lock->wait_lock);
+
+	return ret;
+}
 
 /**
  * rt_mutex_next_owner - return the next owner of the lock
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -104,6 +104,9 @@ extern void rt_mutex_init_proxy_locked(s
 extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 				  struct task_struct *proxy_owner);
 extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
+extern int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+				     struct rt_mutex_waiter *waiter,
+				     struct task_struct *task);
 extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 				     struct rt_mutex_waiter *waiter,
 				     struct task_struct *task);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex: Cleanup variable names for futex_top_waiter()
  2017-03-22 10:35 ` [PATCH -v6 01/13] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
@ 2017-03-23 18:19   ` tip-bot for Peter Zijlstra
  2017-03-24 21:11   ` [PATCH -v6 01/13] " Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:19 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, linux-kernel, mingo, peterz, tglx

Commit-ID:  499f5aca2cdd5e958b27e2655e7e7f82524f46b1
Gitweb:     http://git.kernel.org/tip/499f5aca2cdd5e958b27e2655e7e7f82524f46b1
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:48 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:06 +0100

futex: Cleanup variable names for futex_top_waiter()

futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
this to a variable 'match' totally obscures the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.554710645@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 45858ec..1531cc4 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1122,14 +1122,14 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 			   union futex_key *key, struct futex_pi_state **ps)
 {
-	struct futex_q *match = futex_top_waiter(hb, key);
+	struct futex_q *top_waiter = futex_top_waiter(hb, key);
 
 	/*
 	 * If there is a waiter on that futex, validate it and
 	 * attach to the pi_state when the validation succeeds.
 	 */
-	if (match)
-		return attach_to_pi_state(uval, match->pi_state, ps);
+	if (top_waiter)
+		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
 	/*
 	 * We are the first waiter - try to look up the owner based on
@@ -1176,7 +1176,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 				struct task_struct *task, int set_waiters)
 {
 	u32 uval, newval, vpid = task_pid_vnr(task);
-	struct futex_q *match;
+	struct futex_q *top_waiter;
 	int ret;
 
 	/*
@@ -1202,9 +1202,9 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 	 * Lookup existing state first. If it exists, try to attach to
 	 * its pi_state.
 	 */
-	match = futex_top_waiter(hb, key);
-	if (match)
-		return attach_to_pi_state(uval, match->pi_state, ps);
+	top_waiter = futex_top_waiter(hb, key);
+	if (top_waiter)
+		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
 	/*
 	 * No waiter and user TID is 0. We are here because the
@@ -1294,11 +1294,11 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	q->lock_ptr = NULL;
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
 			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = this->pi_state;
+	struct futex_pi_state *pi_state = top_waiter->pi_state;
 	u32 uninitialized_var(curval), newval;
 	DEFINE_WAKE_Q(wake_q);
 	bool deboost;
@@ -1319,11 +1319,11 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
 
 	/*
 	 * It is possible that the next waiter (the one that brought
-	 * this owner to the kernel) timed out and is no longer
+	 * top_waiter owner to the kernel) timed out and is no longer
 	 * waiting on the lock.
 	 */
 	if (!new_owner)
-		new_owner = this->task;
+		new_owner = top_waiter->task;
 
 	/*
 	 * We pass it to the next owner. The WAITERS bit is always
@@ -2633,7 +2633,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current);
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb;
-	struct futex_q *match;
+	struct futex_q *top_waiter;
 	int ret;
 
 retry:
@@ -2657,9 +2657,9 @@ retry:
 	 * all and we at least want to know if user space fiddled
 	 * with the futex value instead of blindly unlocking.
 	 */
-	match = futex_top_waiter(hb, &key);
-	if (match) {
-		ret = wake_futex_pi(uaddr, uval, match, hb);
+	top_waiter = futex_top_waiter(hb, &key);
+	if (top_waiter) {
+		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
 		/*
 		 * In case of success wake_futex_pi dropped the hash
 		 * bucket lock.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex: Use smp_store_release() in mark_wake_futex()
  2017-03-22 10:35 ` [PATCH -v6 02/13] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
@ 2017-03-23 18:19   ` tip-bot for Peter Zijlstra
  2017-03-24 21:16   ` [PATCH -v6 02/13] " Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:19 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: peterz, mingo, linux-kernel, hpa, tglx

Commit-ID:  1b367ece0d7e696cab1c8501bab282cc6a538b3f
Gitweb:     http://git.kernel.org/tip/1b367ece0d7e696cab1c8501bab282cc6a538b3f
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:49 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:06 +0100

futex: Use smp_store_release() in mark_wake_futex()

Since the futex_q can dissapear the instruction after assigning NULL,
this really should be a RELEASE barrier. That stops loads from hitting
dead memory too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.604296452@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 1531cc4..cc10340 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1290,8 +1290,7 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	 * memory barrier is required here to prevent the following
 	 * store to lock_ptr from getting ahead of the plist_del.
 	 */
-	smp_wmb();
-	q->lock_ptr = NULL;
+	smp_store_release(&q->lock_ptr, NULL);
 }
 
 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex: Remove rt_mutex_deadlock_account_*()
  2017-03-22 10:35 ` [PATCH -v6 03/13] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
@ 2017-03-23 18:20   ` tip-bot for Peter Zijlstra
  2017-03-24 21:29   ` [PATCH -v6 03/13] " Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:20 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, mingo, tglx, peterz, hpa

Commit-ID:  fffa954fb528963c2fb7b0c0084eb77e2be7ab52
Gitweb:     http://git.kernel.org/tip/fffa954fb528963c2fb7b0c0084eb77e2be7ab52
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:50 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:07 +0100

futex: Remove rt_mutex_deadlock_account_*()

These are unused and clutter up the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.652692478@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/locking/rtmutex-debug.c |  9 --------
 kernel/locking/rtmutex-debug.h |  3 ---
 kernel/locking/rtmutex.c       | 47 ++++++++++++++++--------------------------
 kernel/locking/rtmutex.h       |  2 --
 4 files changed, 18 insertions(+), 43 deletions(-)

diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c
index 97ee9df..32fe775 100644
--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -174,12 +174,3 @@ void debug_rt_mutex_init(struct rt_mutex *lock, const char *name)
 	lock->name = name;
 }
 
-void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task)
-{
-}
-
-void rt_mutex_deadlock_account_unlock(struct task_struct *task)
-{
-}
-
diff --git a/kernel/locking/rtmutex-debug.h b/kernel/locking/rtmutex-debug.h
index d0519c3..b585af9 100644
--- a/kernel/locking/rtmutex-debug.h
+++ b/kernel/locking/rtmutex-debug.h
@@ -9,9 +9,6 @@
  * This file contains macros used solely by rtmutex.c. Debug version.
  */
 
-extern void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task);
-extern void rt_mutex_deadlock_account_unlock(struct task_struct *task);
 extern void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 6edc32e..bab66cb 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -938,8 +938,6 @@ takeit:
 	 */
 	rt_mutex_set_owner(lock, task);
 
-	rt_mutex_deadlock_account_lock(lock, task);
-
 	return 1;
 }
 
@@ -1342,8 +1340,6 @@ static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
 
 	debug_rt_mutex_unlock(lock);
 
-	rt_mutex_deadlock_account_unlock(current);
-
 	/*
 	 * We must be careful here if the fast path is enabled. If we
 	 * have no waiters queued we cannot set owner to NULL here
@@ -1409,11 +1405,10 @@ rt_mutex_fastlock(struct rt_mutex *lock, int state,
 				struct hrtimer_sleeper *timeout,
 				enum rtmutex_chainwalk chwalk))
 {
-	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 0;
-	} else
-		return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
+
+	return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
 }
 
 static inline int
@@ -1425,21 +1420,19 @@ rt_mutex_timed_fastlock(struct rt_mutex *lock, int state,
 				      enum rtmutex_chainwalk chwalk))
 {
 	if (chwalk == RT_MUTEX_MIN_CHAINWALK &&
-	    likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	    likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 0;
-	} else
-		return slowfn(lock, state, timeout, chwalk);
+
+	return slowfn(lock, state, timeout, chwalk);
 }
 
 static inline int
 rt_mutex_fasttrylock(struct rt_mutex *lock,
 		     int (*slowfn)(struct rt_mutex *lock))
 {
-	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 1;
-	}
+
 	return slowfn(lock);
 }
 
@@ -1449,19 +1442,18 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
 				   struct wake_q_head *wqh))
 {
 	DEFINE_WAKE_Q(wake_q);
+	bool deboost;
 
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-		rt_mutex_deadlock_account_unlock(current);
+	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
+		return;
 
-	} else {
-		bool deboost = slowfn(lock, &wake_q);
+	deboost = slowfn(lock, &wake_q);
 
-		wake_up_q(&wake_q);
+	wake_up_q(&wake_q);
 
-		/* Undo pi boosting if necessary: */
-		if (deboost)
-			rt_mutex_adjust_prio(current);
-	}
+	/* Undo pi boosting if necessary: */
+	if (deboost)
+		rt_mutex_adjust_prio(current);
 }
 
 /**
@@ -1572,10 +1564,9 @@ EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
 				   struct wake_q_head *wqh)
 {
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-		rt_mutex_deadlock_account_unlock(current);
+	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
 		return false;
-	}
+
 	return rt_mutex_slowunlock(lock, wqh);
 }
 
@@ -1637,7 +1628,6 @@ void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
 	__rt_mutex_init(lock, NULL);
 	debug_rt_mutex_proxy_lock(lock, proxy_owner);
 	rt_mutex_set_owner(lock, proxy_owner);
-	rt_mutex_deadlock_account_lock(lock, proxy_owner);
 }
 
 /**
@@ -1657,7 +1647,6 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 {
 	debug_rt_mutex_proxy_unlock(lock);
 	rt_mutex_set_owner(lock, NULL);
-	rt_mutex_deadlock_account_unlock(proxy_owner);
 }
 
 /**
diff --git a/kernel/locking/rtmutex.h b/kernel/locking/rtmutex.h
index c406058..6607802 100644
--- a/kernel/locking/rtmutex.h
+++ b/kernel/locking/rtmutex.h
@@ -11,8 +11,6 @@
  */
 
 #define rt_mutex_deadlock_check(l)			(0)
-#define rt_mutex_deadlock_account_lock(m, t)		do { } while (0)
-#define rt_mutex_deadlock_account_unlock(l)		do { } while (0)
 #define debug_rt_mutex_init_waiter(w)			do { } while (0)
 #define debug_rt_mutex_free_waiter(w)			do { } while (0)
 #define debug_rt_mutex_lock(l)				do { } while (0)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex,rt_mutex: Provide futex specific rt_mutex API
  2017-03-22 10:35 ` [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API Peter Zijlstra
@ 2017-03-23 18:20   ` tip-bot for Peter Zijlstra
  2017-03-25  0:37   ` [PATCH -v6 04/13] " Darren Hart
  2017-04-05 15:02   ` Darren Hart
  2 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:20 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: peterz, linux-kernel, hpa, mingo, tglx

Commit-ID:  5293c2efda37775346885c7e924d4ef7018ea60b
Gitweb:     http://git.kernel.org/tip/5293c2efda37775346885c7e924d4ef7018ea60b
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:51 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:07 +0100

futex,rt_mutex: Provide futex specific rt_mutex API

Part of what makes futex_unlock_pi() intricate is that
rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
rt_mutex::wait_lock.

This means it cannot rely on the atomicy of wait_lock, which would be
preferred in order to not rely on hb->lock so much.

The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
race with the rt_mutex fastpath, however futexes have their own fast path.

Since futexes already have a bunch of separate rt_mutex accessors, complete
that set and implement a rt_mutex variant without fastpath for them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c                  | 30 +++++++++++-----------
 kernel/locking/rtmutex.c        | 55 ++++++++++++++++++++++++++++++-----------
 kernel/locking/rtmutex_common.h |  9 +++++--
 3 files changed, 62 insertions(+), 32 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index cc10340..af02291 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -916,7 +916,7 @@ void exit_pi_state_list(struct task_struct *curr)
 		pi_state->owner = NULL;
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		rt_mutex_unlock(&pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
 
 		spin_unlock(&hb->lock);
 
@@ -1364,20 +1364,18 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter
 	pi_state->owner = new_owner;
 	raw_spin_unlock(&new_owner->pi_lock);
 
-	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-
-	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
-
 	/*
-	 * First unlock HB so the waiter does not spin on it once he got woken
-	 * up. Second wake up the waiter before the priority is adjusted. If we
-	 * deboost first (and lose our higher priority), then the task might get
-	 * scheduled away before the wake up can take place.
+	 * We've updated the uservalue, this unlock cannot fail.
 	 */
+	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	spin_unlock(&hb->lock);
-	wake_up_q(&wake_q);
-	if (deboost)
+
+	if (deboost) {
+		wake_up_q(&wake_q);
 		rt_mutex_adjust_prio(current);
+	}
 
 	return 0;
 }
@@ -2253,7 +2251,7 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 		 * task acquired the rt_mutex after we removed ourself from the
 		 * rt_mutex waiters list.
 		 */
-		if (rt_mutex_trylock(&q->pi_state->pi_mutex)) {
+		if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
 			locked = 1;
 			goto out;
 		}
@@ -2568,7 +2566,7 @@ retry_private:
 	if (!trylock) {
 		ret = rt_mutex_timed_futex_lock(&q.pi_state->pi_mutex, to);
 	} else {
-		ret = rt_mutex_trylock(&q.pi_state->pi_mutex);
+		ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
 		/* Fixup the trylock return value: */
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
@@ -2591,7 +2589,7 @@ retry_private:
 	 * it and return the fault to userspace.
 	 */
 	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
-		rt_mutex_unlock(&q.pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(&q);
@@ -2898,7 +2896,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 			spin_lock(q.lock_ptr);
 			ret = fixup_pi_state_owner(uaddr2, &q, current);
 			if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current)
-				rt_mutex_unlock(&q.pi_state->pi_mutex);
+				rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
 			/*
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.
@@ -2938,7 +2936,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		 * userspace.
 		 */
 		if (ret && rt_mutex_owner(pi_mutex) == current)
-			rt_mutex_unlock(pi_mutex);
+			rt_mutex_futex_unlock(pi_mutex);
 
 		/* Unqueue and drop the lock. */
 		unqueue_me_pi(&q);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index bab66cb..7d63bc5 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1488,15 +1488,23 @@ EXPORT_SYMBOL_GPL(rt_mutex_lock_interruptible);
 
 /*
  * Futex variant with full deadlock detection.
+ * Futex variants must not use the fast-path, see __rt_mutex_futex_unlock().
  */
-int rt_mutex_timed_futex_lock(struct rt_mutex *lock,
+int __sched rt_mutex_timed_futex_lock(struct rt_mutex *lock,
 			      struct hrtimer_sleeper *timeout)
 {
 	might_sleep();
 
-	return rt_mutex_timed_fastlock(lock, TASK_INTERRUPTIBLE, timeout,
-				       RT_MUTEX_FULL_CHAINWALK,
-				       rt_mutex_slowlock);
+	return rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE,
+				 timeout, RT_MUTEX_FULL_CHAINWALK);
+}
+
+/*
+ * Futex variant, must not use fastpath.
+ */
+int __sched rt_mutex_futex_trylock(struct rt_mutex *lock)
+{
+	return rt_mutex_slowtrylock(lock);
 }
 
 /**
@@ -1555,19 +1563,38 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
- * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
- * @lock: the rt_mutex to be unlocked
- *
- * Returns: true/false indicating whether priority adjustment is
- * required or not.
+ * Futex variant, that since futex variants do not use the fast-path, can be
+ * simple and will not need to retry.
  */
-bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
-				   struct wake_q_head *wqh)
+bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				    struct wake_q_head *wake_q)
 {
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
-		return false;
+	lockdep_assert_held(&lock->wait_lock);
+
+	debug_rt_mutex_unlock(lock);
+
+	if (!rt_mutex_has_waiters(lock)) {
+		lock->owner = NULL;
+		return false; /* done */
+	}
+
+	mark_wakeup_next_waiter(wake_q, lock);
+	return true; /* deboost and wakeups */
+}
 
-	return rt_mutex_slowunlock(lock, wqh);
+void __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
+{
+	DEFINE_WAKE_Q(wake_q);
+	bool deboost;
+
+	raw_spin_lock_irq(&lock->wait_lock);
+	deboost = __rt_mutex_futex_unlock(lock, &wake_q);
+	raw_spin_unlock_irq(&lock->wait_lock);
+
+	if (deboost) {
+		wake_up_q(&wake_q);
+		rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 856dfff..af667f6 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -109,9 +109,14 @@ extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct hrtimer_sleeper *to,
 				      struct rt_mutex_waiter *waiter);
+
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
-extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
-				  struct wake_q_head *wqh);
+extern int rt_mutex_futex_trylock(struct rt_mutex *l);
+
+extern void rt_mutex_futex_unlock(struct rt_mutex *lock);
+extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				 struct wake_q_head *wqh);
+
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex: Change locking rules
  2017-03-22 10:35 ` [PATCH -v6 05/13] futex: Change locking rules Peter Zijlstra
@ 2017-03-23 18:21   ` tip-bot for Peter Zijlstra
  2017-04-05 21:18   ` [PATCH -v6 05/13] " Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:21 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, tglx, hpa, peterz, linux-kernel

Commit-ID:  734009e96d1983ad739e5b656e03430b3660c913
Gitweb:     http://git.kernel.org/tip/734009e96d1983ad739e5b656e03430b3660c913
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:52 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:07 +0100

futex: Change locking rules

Currently futex-pi relies on hb->lock to serialize everything. But hb->lock
creates another set of problems, especially priority inversions on RT where
hb->lock becomes a rt_mutex itself.

The rt_mutex::wait_lock is the most obvious protection for keeping the
futex user space value and the kernel internal pi_state in sync.

Rework and document the locking so rt_mutex::wait_lock is held accross all
operations which modify the user space value and the pi state.

This allows to invoke rt_mutex_unlock() (including deboost) without holding
hb->lock as a next step.

Nothing yet relies on the new locking rules.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c | 165 +++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 132 insertions(+), 33 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index af02291..3e71d66 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -973,6 +973,39 @@ void exit_pi_state_list(struct task_struct *curr)
  *
  * [10] There is no transient state which leaves owner and user space
  *	TID out of sync.
+ *
+ *
+ * Serialization and lifetime rules:
+ *
+ * hb->lock:
+ *
+ *	hb -> futex_q, relation
+ *	futex_q -> pi_state, relation
+ *
+ *	(cannot be raw because hb can contain arbitrary amount
+ *	 of futex_q's)
+ *
+ * pi_mutex->wait_lock:
+ *
+ *	{uval, pi_state}
+ *
+ *	(and pi_mutex 'obviously')
+ *
+ * p->pi_lock:
+ *
+ *	p->pi_state_list -> pi_state->list, relation
+ *
+ * pi_state->refcount:
+ *
+ *	pi_state lifetime
+ *
+ *
+ * Lock order:
+ *
+ *   hb->lock
+ *     pi_mutex->wait_lock
+ *       p->pi_lock
+ *
  */
 
 /*
@@ -980,10 +1013,12 @@ void exit_pi_state_list(struct task_struct *curr)
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
+static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
+			      struct futex_pi_state *pi_state,
 			      struct futex_pi_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
+	int ret, uval2;
 
 	/*
 	 * Userspace might have messed up non-PI and PI futexes [3]
@@ -991,9 +1026,34 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
 	if (unlikely(!pi_state))
 		return -EINVAL;
 
+	/*
+	 * We get here with hb->lock held, and having found a
+	 * futex_top_waiter(). This means that futex_lock_pi() of said futex_q
+	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
+	 * which in turn means that futex_lock_pi() still has a reference on
+	 * our pi_state.
+	 */
 	WARN_ON(!atomic_read(&pi_state->refcount));
 
 	/*
+	 * Now that we have a pi_state, we can acquire wait_lock
+	 * and do the state validation.
+	 */
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+	/*
+	 * Since {uval, pi_state} is serialized by wait_lock, and our current
+	 * uval was read without holding it, it can have changed. Verify it
+	 * still is what we expect it to be, otherwise retry the entire
+	 * operation.
+	 */
+	if (get_futex_value_locked(&uval2, uaddr))
+		goto out_efault;
+
+	if (uval != uval2)
+		goto out_eagain;
+
+	/*
 	 * Handle the owner died case:
 	 */
 	if (uval & FUTEX_OWNER_DIED) {
@@ -1008,11 +1068,11 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
 			 * is not 0. Inconsistent state. [5]
 			 */
 			if (pid)
-				return -EINVAL;
+				goto out_einval;
 			/*
 			 * Take a ref on the state and return success. [4]
 			 */
-			goto out_state;
+			goto out_attach;
 		}
 
 		/*
@@ -1024,14 +1084,14 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
 		 * Take a ref on the state and return success. [6]
 		 */
 		if (!pid)
-			goto out_state;
+			goto out_attach;
 	} else {
 		/*
 		 * If the owner died bit is not set, then the pi_state
 		 * must have an owner. [7]
 		 */
 		if (!pi_state->owner)
-			return -EINVAL;
+			goto out_einval;
 	}
 
 	/*
@@ -1040,11 +1100,29 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
 	 * user space TID. [9/10]
 	 */
 	if (pid != task_pid_vnr(pi_state->owner))
-		return -EINVAL;
-out_state:
+		goto out_einval;
+
+out_attach:
 	atomic_inc(&pi_state->refcount);
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	*ps = pi_state;
 	return 0;
+
+out_einval:
+	ret = -EINVAL;
+	goto out_error;
+
+out_eagain:
+	ret = -EAGAIN;
+	goto out_error;
+
+out_efault:
+	ret = -EFAULT;
+	goto out_error;
+
+out_error:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+	return ret;
 }
 
 /*
@@ -1095,6 +1173,9 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 
 	/*
 	 * No existing pi state. First waiter. [2]
+	 *
+	 * This creates pi_state, we have hb->lock held, this means nothing can
+	 * observe this state, wait_lock is irrelevant.
 	 */
 	pi_state = alloc_pi_state();
 
@@ -1119,7 +1200,8 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 	return 0;
 }
 
-static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
+static int lookup_pi_state(u32 __user *uaddr, u32 uval,
+			   struct futex_hash_bucket *hb,
 			   union futex_key *key, struct futex_pi_state **ps)
 {
 	struct futex_q *top_waiter = futex_top_waiter(hb, key);
@@ -1129,7 +1211,7 @@ static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 	 * attach to the pi_state when the validation succeeds.
 	 */
 	if (top_waiter)
-		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
 
 	/*
 	 * We are the first waiter - try to look up the owner based on
@@ -1148,7 +1230,7 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
 	if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)))
 		return -EFAULT;
 
-	/*If user space value changed, let the caller retry */
+	/* If user space value changed, let the caller retry */
 	return curval != uval ? -EAGAIN : 0;
 }
 
@@ -1204,7 +1286,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 	 */
 	top_waiter = futex_top_waiter(hb, key);
 	if (top_waiter)
-		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
 
 	/*
 	 * No waiter and user TID is 0. We are here because the
@@ -1336,6 +1418,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter
 
 	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
 		ret = -EFAULT;
+
 	} else if (curval != uval) {
 		/*
 		 * If a unconditional UNLOCK_PI operation (user space did not
@@ -1348,6 +1431,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter
 		else
 			ret = -EINVAL;
 	}
+
 	if (ret) {
 		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 		return ret;
@@ -1823,7 +1907,7 @@ retry_private:
 			 * If that call succeeds then we have pi_state and an
 			 * initial refcount on it.
 			 */
-			ret = lookup_pi_state(ret, hb2, &key2, &pi_state);
+			ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state);
 		}
 
 		switch (ret) {
@@ -2122,10 +2206,13 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 {
 	u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
 	struct futex_pi_state *pi_state = q->pi_state;
-	struct task_struct *oldowner = pi_state->owner;
 	u32 uval, uninitialized_var(curval), newval;
+	struct task_struct *oldowner;
 	int ret;
 
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+	oldowner = pi_state->owner;
 	/* Owner died? */
 	if (!pi_state->owner)
 		newtid |= FUTEX_OWNER_DIED;
@@ -2141,11 +2228,10 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 	 * because we can fault here. Imagine swapped out pages or a fork
 	 * that marked all the anonymous memory readonly for cow.
 	 *
-	 * Modifying pi_state _before_ the user space value would
-	 * leave the pi_state in an inconsistent state when we fault
-	 * here, because we need to drop the hash bucket lock to
-	 * handle the fault. This might be observed in the PID check
-	 * in lookup_pi_state.
+	 * Modifying pi_state _before_ the user space value would leave the
+	 * pi_state in an inconsistent state when we fault here, because we
+	 * need to drop the locks to handle the fault. This might be observed
+	 * in the PID check in lookup_pi_state.
 	 */
 retry:
 	if (get_futex_value_locked(&uval, uaddr))
@@ -2166,47 +2252,60 @@ retry:
 	 * itself.
 	 */
 	if (pi_state->owner != NULL) {
-		raw_spin_lock_irq(&pi_state->owner->pi_lock);
+		raw_spin_lock(&pi_state->owner->pi_lock);
 		WARN_ON(list_empty(&pi_state->list));
 		list_del_init(&pi_state->list);
-		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
+		raw_spin_unlock(&pi_state->owner->pi_lock);
 	}
 
 	pi_state->owner = newowner;
 
-	raw_spin_lock_irq(&newowner->pi_lock);
+	raw_spin_lock(&newowner->pi_lock);
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &newowner->pi_state_list);
-	raw_spin_unlock_irq(&newowner->pi_lock);
+	raw_spin_unlock(&newowner->pi_lock);
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+
 	return 0;
 
 	/*
-	 * To handle the page fault we need to drop the hash bucket
-	 * lock here. That gives the other task (either the highest priority
-	 * waiter itself or the task which stole the rtmutex) the
-	 * chance to try the fixup of the pi_state. So once we are
-	 * back from handling the fault we need to check the pi_state
-	 * after reacquiring the hash bucket lock and before trying to
-	 * do another fixup. When the fixup has been done already we
-	 * simply return.
+	 * To handle the page fault we need to drop the locks here. That gives
+	 * the other task (either the highest priority waiter itself or the
+	 * task which stole the rtmutex) the chance to try the fixup of the
+	 * pi_state. So once we are back from handling the fault we need to
+	 * check the pi_state after reacquiring the locks and before trying to
+	 * do another fixup. When the fixup has been done already we simply
+	 * return.
+	 *
+	 * Note: we hold both hb->lock and pi_mutex->wait_lock. We can safely
+	 * drop hb->lock since the caller owns the hb -> futex_q relation.
+	 * Dropping the pi_mutex->wait_lock requires the state revalidate.
 	 */
 handle_fault:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	spin_unlock(q->lock_ptr);
 
 	ret = fault_in_user_writeable(uaddr);
 
 	spin_lock(q->lock_ptr);
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 
 	/*
 	 * Check if someone else fixed it for us:
 	 */
-	if (pi_state->owner != oldowner)
-		return 0;
+	if (pi_state->owner != oldowner) {
+		ret = 0;
+		goto out_unlock;
+	}
 
 	if (ret)
-		return ret;
+		goto out_unlock;
 
 	goto retry;
+
+out_unlock:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+	return ret;
 }
 
 static long futex_wait_restart(struct restart_block *restart);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex: Cleanup refcounting
  2017-03-22 10:35 ` [PATCH -v6 06/13] futex: Cleanup refcounting Peter Zijlstra
@ 2017-03-23 18:21   ` tip-bot for Peter Zijlstra
  2017-04-05 21:29   ` [PATCH -v6 06/13] " Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:21 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: peterz, linux-kernel, mingo, hpa, tglx

Commit-ID:  bf92cf3a5100f5a0d5f9834787b130159397cb22
Gitweb:     http://git.kernel.org/tip/bf92cf3a5100f5a0d5f9834787b130159397cb22
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:53 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:08 +0100

futex: Cleanup refcounting

Add a put_pit_state() as counterpart for get_pi_state() so the refcounting
becomes consistent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.801778516@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 3e71d66..3b6dbee 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -802,7 +802,7 @@ static int refill_pi_state_cache(void)
 	return 0;
 }
 
-static struct futex_pi_state * alloc_pi_state(void)
+static struct futex_pi_state *alloc_pi_state(void)
 {
 	struct futex_pi_state *pi_state = current->pi_state_cache;
 
@@ -812,6 +812,11 @@ static struct futex_pi_state * alloc_pi_state(void)
 	return pi_state;
 }
 
+static void get_pi_state(struct futex_pi_state *pi_state)
+{
+	WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
+}
+
 /*
  * Drops a reference to the pi_state object and frees or caches it
  * when the last reference is gone.
@@ -856,7 +861,7 @@ static void put_pi_state(struct futex_pi_state *pi_state)
  * Look up the task based on what TID userspace gave us.
  * We dont trust it.
  */
-static struct task_struct * futex_find_get_task(pid_t pid)
+static struct task_struct *futex_find_get_task(pid_t pid)
 {
 	struct task_struct *p;
 
@@ -1103,7 +1108,7 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
 		goto out_einval;
 
 out_attach:
-	atomic_inc(&pi_state->refcount);
+	get_pi_state(pi_state);
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	*ps = pi_state;
 	return 0;
@@ -1990,7 +1995,7 @@ retry_private:
 			 * refcount on the pi_state and store the pointer in
 			 * the futex_q object of the waiter.
 			 */
-			atomic_inc(&pi_state->refcount);
+			get_pi_state(pi_state);
 			this->pi_state = pi_state;
 			ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
 							this->rt_waiter,

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex: Rework inconsistent rt_mutex/futex_q state
  2017-03-22 10:35 ` [PATCH -v6 07/13] futex: Rework inconsistent rt_mutex/futex_q state Peter Zijlstra
@ 2017-03-23 18:22   ` tip-bot for Peter Zijlstra
  2017-04-05 21:58   ` [PATCH -v6 07/13] " Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:22 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, peterz, tglx, linux-kernel, mingo

Commit-ID:  73d786bd043ebc855f349c81ea805f6b11cbf2aa
Gitweb:     http://git.kernel.org/tip/73d786bd043ebc855f349c81ea805f6b11cbf2aa
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:54 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:08 +0100

futex: Rework inconsistent rt_mutex/futex_q state

There is a weird state in the futex_unlock_pi() path when it interleaves
with a concurrent futex_lock_pi() at the point where it drops hb->lock.

In this case, it can happen that the rt_mutex wait_list and the futex_q
disagree on pending waiters, in particular rt_mutex will find no pending
waiters where futex_q thinks there are. In this case the rt_mutex unlock
code cannot assign an owner.

The futex side fixup code has to cleanup the inconsistencies with quite a
bunch of interesting corner cases.

Simplify all this by changing wake_futex_pi() to return -EAGAIN when this
situation occurs. This then gives the futex_lock_pi() code the opportunity
to continue and the retried futex_unlock_pi() will now observe a coherent
state.

The only problem is that this breaks RT timeliness guarantees. That
is, consider the following scenario:

  T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)

    CPU0

    T1
      lock_pi()
      queue_me()  <- Waiter is visible

    preemption

    T2
      unlock_pi()
	loops with -EAGAIN forever

Which is undesirable for PI primitives. Future patches will rectify
this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.850383690@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c | 50 ++++++++++++++------------------------------------
 1 file changed, 14 insertions(+), 36 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 3b6dbee..51a248a 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1404,12 +1404,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
 
 	/*
-	 * It is possible that the next waiter (the one that brought
-	 * top_waiter owner to the kernel) timed out and is no longer
-	 * waiting on the lock.
+	 * When we interleave with futex_lock_pi() where it does
+	 * rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
+	 * but the rt_mutex's wait_list can be empty (either still, or again,
+	 * depending on which side we land).
+	 *
+	 * When this happens, give up our locks and try again, giving the
+	 * futex_lock_pi() instance time to complete, either by waiting on the
+	 * rtmutex or removing itself from the futex queue.
 	 */
-	if (!new_owner)
-		new_owner = top_waiter->task;
+	if (!new_owner) {
+		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+		return -EAGAIN;
+	}
 
 	/*
 	 * We pass it to the next owner. The WAITERS bit is always
@@ -2332,7 +2339,6 @@ static long futex_wait_restart(struct restart_block *restart);
  */
 static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 {
-	struct task_struct *owner;
 	int ret = 0;
 
 	if (locked) {
@@ -2346,43 +2352,15 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 	}
 
 	/*
-	 * Catch the rare case, where the lock was released when we were on the
-	 * way back before we locked the hash bucket.
-	 */
-	if (q->pi_state->owner == current) {
-		/*
-		 * Try to get the rt_mutex now. This might fail as some other
-		 * task acquired the rt_mutex after we removed ourself from the
-		 * rt_mutex waiters list.
-		 */
-		if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
-			locked = 1;
-			goto out;
-		}
-
-		/*
-		 * pi_state is incorrect, some other task did a lock steal and
-		 * we returned due to timeout or signal without taking the
-		 * rt_mutex. Too late.
-		 */
-		raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
-		owner = rt_mutex_owner(&q->pi_state->pi_mutex);
-		if (!owner)
-			owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
-		raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
-		ret = fixup_pi_state_owner(uaddr, q, owner);
-		goto out;
-	}
-
-	/*
 	 * Paranoia check. If we did not take the lock, then we should not be
 	 * the owner of the rt_mutex.
 	 */
-	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current)
+	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) {
 		printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p "
 				"pi-state %p\n", ret,
 				q->pi_state->pi_mutex.owner,
 				q->pi_state->owner);
+	}
 
 out:
 	return ret ? ret : locked;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex: Pull rt_mutex_futex_unlock() out from under hb->lock
  2017-03-22 10:35 ` [PATCH -v6 08/13] futex: Pull rt_mutex_futex_unlock() out from under hb->lock Peter Zijlstra
@ 2017-03-23 18:22   ` tip-bot for Peter Zijlstra
  2017-04-05 23:52   ` [PATCH -v6 08/13] " Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:22 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, hpa, linux-kernel, peterz, mingo

Commit-ID:  16ffa12d742534d4ff73e8b3a4e81c1de39196f0
Gitweb:     http://git.kernel.org/tip/16ffa12d742534d4ff73e8b3a4e81c1de39196f0
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:55 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:08 +0100

futex: Pull rt_mutex_futex_unlock() out from under hb->lock

There's a number of 'interesting' problems, all caused by holding
hb->lock while doing the rt_mutex_unlock() equivalient.

Notably:

 - a PI inversion on hb->lock; and,

 - a SCHED_DEADLINE crash because of pointer instability.

The previous changes:

 - changed the locking rules to cover {uval,pi_state} with wait_lock.

 - allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in
   turn allows to rely on wait_lock atomicity completely.

 - simplified the waiter conundrum.

It's now sufficient to hold rtmutex::wait_lock and a reference on the
pi_state to protect the state consistency, so hb->lock can be dropped
before calling rt_mutex_futex_unlock().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.900002056@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c | 154 +++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 100 insertions(+), 54 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 51a248a..3b0aace 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -921,10 +921,12 @@ void exit_pi_state_list(struct task_struct *curr)
 		pi_state->owner = NULL;
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		rt_mutex_futex_unlock(&pi_state->pi_mutex);
-
+		get_pi_state(pi_state);
 		spin_unlock(&hb->lock);
 
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
+		put_pi_state(pi_state);
+
 		raw_spin_lock_irq(&curr->pi_lock);
 	}
 	raw_spin_unlock_irq(&curr->pi_lock);
@@ -1037,6 +1039,11 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
 	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
 	 * which in turn means that futex_lock_pi() still has a reference on
 	 * our pi_state.
+	 *
+	 * The waiter holding a reference on @pi_state also protects against
+	 * the unlocked put_pi_state() in futex_unlock_pi(), futex_lock_pi()
+	 * and futex_wait_requeue_pi() as it cannot go to 0 and consequently
+	 * free pi_state before we can take a reference ourselves.
 	 */
 	WARN_ON(!atomic_read(&pi_state->refcount));
 
@@ -1380,48 +1387,40 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	smp_store_release(&q->lock_ptr, NULL);
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
-			 struct futex_hash_bucket *hb)
+/*
+ * Caller must hold a reference on @pi_state.
+ */
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
 {
-	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = top_waiter->pi_state;
 	u32 uninitialized_var(curval), newval;
+	struct task_struct *new_owner;
+	bool deboost = false;
 	DEFINE_WAKE_Q(wake_q);
-	bool deboost;
 	int ret = 0;
 
-	if (!pi_state)
-		return -EINVAL;
-
-	/*
-	 * If current does not own the pi_state then the futex is
-	 * inconsistent and user space fiddled with the futex value.
-	 */
-	if (pi_state->owner != current)
-		return -EINVAL;
-
 	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
-
-	/*
-	 * When we interleave with futex_lock_pi() where it does
-	 * rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
-	 * but the rt_mutex's wait_list can be empty (either still, or again,
-	 * depending on which side we land).
-	 *
-	 * When this happens, give up our locks and try again, giving the
-	 * futex_lock_pi() instance time to complete, either by waiting on the
-	 * rtmutex or removing itself from the futex queue.
-	 */
 	if (!new_owner) {
-		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-		return -EAGAIN;
+		/*
+		 * Since we held neither hb->lock nor wait_lock when coming
+		 * into this function, we could have raced with futex_lock_pi()
+		 * such that we might observe @this futex_q waiter, but the
+		 * rt_mutex's wait_list can be empty (either still, or again,
+		 * depending on which side we land).
+		 *
+		 * When this happens, give up our locks and try again, giving
+		 * the futex_lock_pi() instance time to complete, either by
+		 * waiting on the rtmutex or removing itself from the futex
+		 * queue.
+		 */
+		ret = -EAGAIN;
+		goto out_unlock;
 	}
 
 	/*
-	 * We pass it to the next owner. The WAITERS bit is always
-	 * kept enabled while there is PI state around. We cleanup the
-	 * owner died bit, because we are the owner.
+	 * We pass it to the next owner. The WAITERS bit is always kept
+	 * enabled while there is PI state around. We cleanup the owner
+	 * died bit, because we are the owner.
 	 */
 	newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
 
@@ -1444,10 +1443,8 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter
 			ret = -EINVAL;
 	}
 
-	if (ret) {
-		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-		return ret;
-	}
+	if (ret)
+		goto out_unlock;
 
 	raw_spin_lock(&pi_state->owner->pi_lock);
 	WARN_ON(list_empty(&pi_state->list));
@@ -1465,15 +1462,15 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter
 	 */
 	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
 
+out_unlock:
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-	spin_unlock(&hb->lock);
 
 	if (deboost) {
 		wake_up_q(&wake_q);
 		rt_mutex_adjust_prio(current);
 	}
 
-	return 0;
+	return ret;
 }
 
 /*
@@ -2232,7 +2229,8 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 	/*
 	 * We are here either because we stole the rtmutex from the
 	 * previous highest priority waiter or we are the highest priority
-	 * waiter but failed to get the rtmutex the first time.
+	 * waiter but have failed to get the rtmutex the first time.
+	 *
 	 * We have to replace the newowner TID in the user space variable.
 	 * This must be atomic as we have to preserve the owner died bit here.
 	 *
@@ -2249,7 +2247,7 @@ retry:
 	if (get_futex_value_locked(&uval, uaddr))
 		goto handle_fault;
 
-	while (1) {
+	for (;;) {
 		newval = (uval & FUTEX_OWNER_DIED) | newtid;
 
 		if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
@@ -2345,6 +2343,10 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 		/*
 		 * Got the lock. We might not be the anticipated owner if we
 		 * did a lock-steal - fix up the PI-state in that case:
+		 *
+		 * We can safely read pi_state->owner without holding wait_lock
+		 * because we now own the rt_mutex, only the owner will attempt
+		 * to change it.
 		 */
 		if (q->pi_state->owner != current)
 			ret = fixup_pi_state_owner(uaddr, q, current);
@@ -2584,6 +2586,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 			 ktime_t *time, int trylock)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
+	struct futex_pi_state *pi_state = NULL;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
 	int res, ret;
@@ -2670,12 +2673,19 @@ retry_private:
 	 * If fixup_owner() faulted and was unable to handle the fault, unlock
 	 * it and return the fault to userspace.
 	 */
-	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
-		rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
+	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current)) {
+		pi_state = q.pi_state;
+		get_pi_state(pi_state);
+	}
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(&q);
 
+	if (pi_state) {
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
+		put_pi_state(pi_state);
+	}
+
 	goto out_put_key;
 
 out_unlock_put_key:
@@ -2738,10 +2748,36 @@ retry:
 	 */
 	top_waiter = futex_top_waiter(hb, &key);
 	if (top_waiter) {
-		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
+		struct futex_pi_state *pi_state = top_waiter->pi_state;
+
+		ret = -EINVAL;
+		if (!pi_state)
+			goto out_unlock;
+
+		/*
+		 * If current does not own the pi_state then the futex is
+		 * inconsistent and user space fiddled with the futex value.
+		 */
+		if (pi_state->owner != current)
+			goto out_unlock;
+
 		/*
-		 * In case of success wake_futex_pi dropped the hash
-		 * bucket lock.
+		 * Grab a reference on the pi_state and drop hb->lock.
+		 *
+		 * The reference ensures pi_state lives, dropping the hb->lock
+		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
+		 * close the races against futex_lock_pi(), but in case of
+		 * _any_ fail we'll abort and retry the whole deal.
+		 */
+		get_pi_state(pi_state);
+		spin_unlock(&hb->lock);
+
+		ret = wake_futex_pi(uaddr, uval, pi_state);
+
+		put_pi_state(pi_state);
+
+		/*
+		 * Success, we're done! No tricky corner cases.
 		 */
 		if (!ret)
 			goto out_putkey;
@@ -2756,7 +2792,6 @@ retry:
 		 * setting the FUTEX_WAITERS bit. Try again.
 		 */
 		if (ret == -EAGAIN) {
-			spin_unlock(&hb->lock);
 			put_futex_key(&key);
 			goto retry;
 		}
@@ -2764,7 +2799,7 @@ retry:
 		 * wake_futex_pi has detected invalid state. Tell user
 		 * space.
 		 */
-		goto out_unlock;
+		goto out_putkey;
 	}
 
 	/*
@@ -2774,8 +2809,10 @@ retry:
 	 * preserve the WAITERS bit not the OWNER_DIED one. We are the
 	 * owner.
 	 */
-	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0))
+	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0)) {
+		spin_unlock(&hb->lock);
 		goto pi_faulted;
+	}
 
 	/*
 	 * If uval has changed, let user space handle it.
@@ -2789,7 +2826,6 @@ out_putkey:
 	return ret;
 
 pi_faulted:
-	spin_unlock(&hb->lock);
 	put_futex_key(&key);
 
 	ret = fault_in_user_writeable(uaddr);
@@ -2893,6 +2929,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 				 u32 __user *uaddr2)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
+	struct futex_pi_state *pi_state = NULL;
 	struct rt_mutex_waiter rt_waiter;
 	struct futex_hash_bucket *hb;
 	union futex_key key2 = FUTEX_KEY_INIT;
@@ -2977,8 +3014,10 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		if (q.pi_state && (q.pi_state->owner != current)) {
 			spin_lock(q.lock_ptr);
 			ret = fixup_pi_state_owner(uaddr2, &q, current);
-			if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current)
-				rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
+			if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) {
+				pi_state = q.pi_state;
+				get_pi_state(pi_state);
+			}
 			/*
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.
@@ -3017,13 +3056,20 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		 * the fault, unlock the rt_mutex and return the fault to
 		 * userspace.
 		 */
-		if (ret && rt_mutex_owner(pi_mutex) == current)
-			rt_mutex_futex_unlock(pi_mutex);
+		if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) {
+			pi_state = q.pi_state;
+			get_pi_state(pi_state);
+		}
 
 		/* Unqueue and drop the lock. */
 		unqueue_me_pi(&q);
 	}
 
+	if (pi_state) {
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
+		put_pi_state(pi_state);
+	}
+
 	if (ret == -EINTR) {
 		/*
 		 * We've already been requeued, but cannot restart by calling

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex,rt_mutex: Introduce rt_mutex_init_waiter()
  2017-03-22 10:35 ` [PATCH -v6 09/13] futex,rt_mutex: Introduce rt_mutex_init_waiter() Peter Zijlstra
@ 2017-03-23 18:23   ` tip-bot for Peter Zijlstra
  2017-04-05 23:57   ` [PATCH -v6 09/13] " Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:23 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, mingo, linux-kernel, peterz, hpa

Commit-ID:  50809358dd7199aa7ce232f6877dd09ec30ef374
Gitweb:     http://git.kernel.org/tip/50809358dd7199aa7ce232f6877dd09ec30ef374
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:56 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:09 +0100

futex,rt_mutex: Introduce rt_mutex_init_waiter()

Since there's already two copies of this code, introduce a helper now
before adding a third one.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.950039479@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c                  |  5 +----
 kernel/locking/rtmutex.c        | 12 +++++++++---
 kernel/locking/rtmutex_common.h |  1 +
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 3b0aace..f03ff63 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2956,10 +2956,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	 * The waiter is allocated on our stack, manipulated by the requeue
 	 * code while we sleep on uaddr.
 	 */
-	debug_rt_mutex_init_waiter(&rt_waiter);
-	RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
-	RB_CLEAR_NODE(&rt_waiter.tree_entry);
-	rt_waiter.task = NULL;
+	rt_mutex_init_waiter(&rt_waiter);
 
 	ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
 	if (unlikely(ret != 0))
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 7d63bc5..d2fe4b4 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1153,6 +1153,14 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 				   next_lock, NULL, task);
 }
 
+void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
+{
+	debug_rt_mutex_init_waiter(waiter);
+	RB_CLEAR_NODE(&waiter->pi_tree_entry);
+	RB_CLEAR_NODE(&waiter->tree_entry);
+	waiter->task = NULL;
+}
+
 /**
  * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
  * @lock:		 the rt_mutex to take
@@ -1235,9 +1243,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	unsigned long flags;
 	int ret = 0;
 
-	debug_rt_mutex_init_waiter(&waiter);
-	RB_CLEAR_NODE(&waiter.pi_tree_entry);
-	RB_CLEAR_NODE(&waiter.tree_entry);
+	rt_mutex_init_waiter(&waiter);
 
 	/*
 	 * Technically we could use raw_spin_[un]lock_irq() here, but this can
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index af667f6..10f57d6 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -103,6 +103,7 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
 				       struct task_struct *proxy_owner);
 extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 				  struct task_struct *proxy_owner);
+extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 				     struct rt_mutex_waiter *waiter,
 				     struct task_struct *task);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
  2017-03-22 10:35 ` [PATCH -v6 10/13] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() Peter Zijlstra
@ 2017-03-23 18:23   ` tip-bot for Peter Zijlstra
  2017-04-07 23:30   ` [PATCH -v6 10/13] " Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:23 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: peterz, tglx, mingo, hpa, linux-kernel

Commit-ID:  38d589f2fd08f1296aea3ce62bebd185125c6d81
Gitweb:     http://git.kernel.org/tip/38d589f2fd08f1296aea3ce62bebd185125c6d81
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:57 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:09 +0100

futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()

With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
parts, such that only the actual blocking can be done without hb->lock
held.

Split split_mutex_finish_proxy_lock() into two parts, one that does the
blocking and one that does remove_waiter() when the lock acquire failed.

When the rtmutex was acquired successfully the waiter can be removed in the
acquisiton path safely, since there is no concurrency on the lock owner.

This means that, except for futex_lock_pi(), all wait_list modifications
are done with both hb->lock and wait_lock held.

[bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c                  |  7 ++++--
 kernel/locking/rtmutex.c        | 52 +++++++++++++++++++++++++++++++++++------
 kernel/locking/rtmutex_common.h |  8 ++++---
 3 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f03ff63..1cd8df7 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3032,10 +3032,13 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		 */
 		WARN_ON(!q.pi_state);
 		pi_mutex = &q.pi_state->pi_mutex;
-		ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter);
-		debug_rt_mutex_free_waiter(&rt_waiter);
+		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
 
 		spin_lock(q.lock_ptr);
+		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
+			ret = 0;
+
+		debug_rt_mutex_free_waiter(&rt_waiter);
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we
 		 * haven't already.
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index d2fe4b4..1e8368d 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1753,21 +1753,23 @@ struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock)
 }
 
 /**
- * rt_mutex_finish_proxy_lock() - Complete lock acquisition
+ * rt_mutex_wait_proxy_lock() - Wait for lock acquisition
  * @lock:		the rt_mutex we were woken on
  * @to:			the timeout, null if none. hrtimer should already have
  *			been started.
  * @waiter:		the pre-initialized rt_mutex_waiter
  *
- * Complete the lock acquisition started our behalf by another thread.
+ * Wait for the the lock acquisition started on our behalf by
+ * rt_mutex_start_proxy_lock(). Upon failure, the caller must call
+ * rt_mutex_cleanup_proxy_lock().
  *
  * Returns:
  *  0 - success
  * <0 - error, one of -EINTR, -ETIMEDOUT
  *
- * Special API call for PI-futex requeue support
+ * Special API call for PI-futex support
  */
-int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
 			       struct hrtimer_sleeper *to,
 			       struct rt_mutex_waiter *waiter)
 {
@@ -1780,9 +1782,6 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 	/* sleep on the mutex */
 	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
 
-	if (unlikely(ret))
-		remove_waiter(lock, waiter);
-
 	/*
 	 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
 	 * have to fix that up.
@@ -1793,3 +1792,42 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 
 	return ret;
 }
+
+/**
+ * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition
+ * @lock:		the rt_mutex we were woken on
+ * @waiter:		the pre-initialized rt_mutex_waiter
+ *
+ * Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
+ *
+ * Unless we acquired the lock; we're still enqueued on the wait-list and can
+ * in fact still be granted ownership until we're removed. Therefore we can
+ * find we are in fact the owner and must disregard the
+ * rt_mutex_wait_proxy_lock() failure.
+ *
+ * Returns:
+ *  true  - did the cleanup, we done.
+ *  false - we acquired the lock after rt_mutex_wait_proxy_lock() returned,
+ *          caller should disregards its return value.
+ *
+ * Special API call for PI-futex support
+ */
+bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
+				 struct rt_mutex_waiter *waiter)
+{
+	bool cleanup = false;
+
+	raw_spin_lock_irq(&lock->wait_lock);
+	/*
+	 * Unless we're the owner; we're still enqueued on the wait_list.
+	 * So check if we became owner, if not, take us off the wait_list.
+	 */
+	if (rt_mutex_owner(lock) != current) {
+		remove_waiter(lock, waiter);
+		fixup_rt_mutex_waiters(lock);
+		cleanup = true;
+	}
+	raw_spin_unlock_irq(&lock->wait_lock);
+
+	return cleanup;
+}
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 10f57d6..35361e4 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -107,9 +107,11 @@ extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 				     struct rt_mutex_waiter *waiter,
 				     struct task_struct *task);
-extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
-				      struct hrtimer_sleeper *to,
-				      struct rt_mutex_waiter *waiter);
+extern int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
+			       struct hrtimer_sleeper *to,
+			       struct rt_mutex_waiter *waiter);
+extern bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
+				 struct rt_mutex_waiter *waiter);
 
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 extern int rt_mutex_futex_trylock(struct rt_mutex *l);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
  2017-03-22 10:35 ` [PATCH -v6 11/13] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() Peter Zijlstra
@ 2017-03-23 18:24   ` tip-bot for Peter Zijlstra
  2017-04-08  0:55   ` [PATCH -v6 11/13] " Darren Hart
  2017-04-10 15:51   ` alexander.levin
  2 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:24 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: peterz, mingo, tglx, hpa, linux-kernel

Commit-ID:  cfafcd117da0216520568c195cb2f6cd1980c4bb
Gitweb:     http://git.kernel.org/tip/cfafcd117da0216520568c195cb2f6cd1980c4bb
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:58 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:09 +0100

futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()

By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
modifications are done under both hb->lock and wait_lock.

This closes the obvious interleave pattern between futex_lock_pi() and
futex_unlock_pi(), but not entirely so. See below:

Before:

futex_lock_pi()			futex_unlock_pi()
  unlock hb->lock

				  lock hb->lock
				  unlock hb->lock

				  lock rt_mutex->wait_lock
				  unlock rt_mutex_wait_lock
				    -EAGAIN

  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock

  schedule()

  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

				  <idem>
				    -EAGAIN

  lock hb->lock


After:

futex_lock_pi()			futex_unlock_pi()

  lock hb->lock
  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock
  unlock hb->lock

  schedule()
				  lock hb->lock
				  unlock hb->lock
  lock hb->lock
  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

				  lock rt_mutex->wait_lock
				  unlock rt_mutex_wait_lock
				    -EAGAIN

  unlock hb->lock


It does however solve the earlier starvation/live-lock scenario which got
introduced with the -EAGAIN since unlike the before scenario; where the
-EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
after scenario it happens while futex_unlock_pi() actually holds a lock,
and then it is serialized on that lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c                  | 77 +++++++++++++++++++++++++++++------------
 kernel/locking/rtmutex.c        | 26 ++++----------
 kernel/locking/rtmutex_common.h |  1 -
 3 files changed, 62 insertions(+), 42 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 1cd8df7..eecce7b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2099,20 +2099,7 @@ queue_unlock(struct futex_hash_bucket *hb)
 	hb_waiters_dec(hb);
 }
 
-/**
- * queue_me() - Enqueue the futex_q on the futex_hash_bucket
- * @q:	The futex_q to enqueue
- * @hb:	The destination hash bucket
- *
- * The hb->lock must be held by the caller, and is released here. A call to
- * queue_me() is typically paired with exactly one call to unqueue_me().  The
- * exceptions involve the PI related operations, which may use unqueue_me_pi()
- * or nothing if the unqueue is done as part of the wake process and the unqueue
- * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
- * an example).
- */
-static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
-	__releases(&hb->lock)
+static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
 {
 	int prio;
 
@@ -2129,6 +2116,24 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
 	plist_node_init(&q->list, prio);
 	plist_add(&q->list, &hb->chain);
 	q->task = current;
+}
+
+/**
+ * queue_me() - Enqueue the futex_q on the futex_hash_bucket
+ * @q:	The futex_q to enqueue
+ * @hb:	The destination hash bucket
+ *
+ * The hb->lock must be held by the caller, and is released here. A call to
+ * queue_me() is typically paired with exactly one call to unqueue_me().  The
+ * exceptions involve the PI related operations, which may use unqueue_me_pi()
+ * or nothing if the unqueue is done as part of the wake process and the unqueue
+ * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
+ * an example).
+ */
+static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
+	__releases(&hb->lock)
+{
+	__queue_me(q, hb);
 	spin_unlock(&hb->lock);
 }
 
@@ -2587,6 +2592,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct futex_pi_state *pi_state = NULL;
+	struct rt_mutex_waiter rt_waiter;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
 	int res, ret;
@@ -2639,25 +2645,52 @@ retry_private:
 		}
 	}
 
+	WARN_ON(!q.pi_state);
+
 	/*
 	 * Only actually queue now that the atomic ops are done:
 	 */
-	queue_me(&q, hb);
+	__queue_me(&q, hb);
 
-	WARN_ON(!q.pi_state);
-	/*
-	 * Block on the PI mutex:
-	 */
-	if (!trylock) {
-		ret = rt_mutex_timed_futex_lock(&q.pi_state->pi_mutex, to);
-	} else {
+	if (trylock) {
 		ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
 		/* Fixup the trylock return value: */
 		ret = ret ? 0 : -EWOULDBLOCK;
+		goto no_block;
+	}
+
+	/*
+	 * We must add ourselves to the rt_mutex waitlist while holding hb->lock
+	 * such that the hb and rt_mutex wait lists match.
+	 */
+	rt_mutex_init_waiter(&rt_waiter);
+	ret = rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
+	if (ret) {
+		if (ret == 1)
+			ret = 0;
+
+		goto no_block;
 	}
 
+	spin_unlock(q.lock_ptr);
+
+	if (unlikely(to))
+		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
+
+	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
+
 	spin_lock(q.lock_ptr);
 	/*
+	 * If we failed to acquire the lock (signal/timeout), we must
+	 * first acquire the hb->lock before removing the lock from the
+	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
+	 * wait lists consistent.
+	 */
+	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
+		ret = 0;
+
+no_block:
+	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
 	 * haven't already.
 	 */
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 1e8368d..48418a1 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1493,19 +1493,6 @@ int __sched rt_mutex_lock_interruptible(struct rt_mutex *lock)
 EXPORT_SYMBOL_GPL(rt_mutex_lock_interruptible);
 
 /*
- * Futex variant with full deadlock detection.
- * Futex variants must not use the fast-path, see __rt_mutex_futex_unlock().
- */
-int __sched rt_mutex_timed_futex_lock(struct rt_mutex *lock,
-			      struct hrtimer_sleeper *timeout)
-{
-	might_sleep();
-
-	return rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE,
-				 timeout, RT_MUTEX_FULL_CHAINWALK);
-}
-
-/*
  * Futex variant, must not use fastpath.
  */
 int __sched rt_mutex_futex_trylock(struct rt_mutex *lock)
@@ -1782,12 +1769,6 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
 	/* sleep on the mutex */
 	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
 
-	/*
-	 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
-	 * have to fix that up.
-	 */
-	fixup_rt_mutex_waiters(lock);
-
 	raw_spin_unlock_irq(&lock->wait_lock);
 
 	return ret;
@@ -1827,6 +1808,13 @@ bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
 		fixup_rt_mutex_waiters(lock);
 		cleanup = true;
 	}
+
+	/*
+	 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
+	 * have to fix that up.
+	 */
+	fixup_rt_mutex_waiters(lock);
+
 	raw_spin_unlock_irq(&lock->wait_lock);
 
 	return cleanup;
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 35361e4..1e93e15 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -113,7 +113,6 @@ extern int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
 extern bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
 				 struct rt_mutex_waiter *waiter);
 
-extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 extern int rt_mutex_futex_trylock(struct rt_mutex *l);
 
 extern void rt_mutex_futex_unlock(struct rt_mutex *lock);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex: Futex_unlock_pi() determinism
  2017-03-22 10:35 ` [PATCH -v6 12/13] futex: futex_unlock_pi() determinism Peter Zijlstra
@ 2017-03-23 18:24   ` tip-bot for Peter Zijlstra
  2017-04-08  1:27   ` [PATCH -v6 12/13] futex: futex_unlock_pi() determinism Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:24 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, hpa, peterz, mingo, linux-kernel

Commit-ID:  bebe5b514345f09be2c15e414d076b02ecb9cce8
Gitweb:     http://git.kernel.org/tip/bebe5b514345f09be2c15e414d076b02ecb9cce8
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:35:59 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:10:10 +0100

futex: Futex_unlock_pi() determinism

The problem with returning -EAGAIN when the waiter state mismatches is that
it becomes very hard to proof a bounded execution time on the
operation. And seeing that this is a RT operation, this is somewhat
important.

While in practise; given the previous patch; it will be very unlikely to
ever really take more than one or two rounds, proving so becomes rather
hard.

However, now that modifying wait_list is done while holding both hb->lock
and wait_lock, the scenario can be avoided entirely by acquiring wait_lock
while still holding hb-lock. Doing a hand-over, without leaving a hole.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.112378812@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index eecce7b..4cdc603 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1398,15 +1398,10 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_
 	DEFINE_WAKE_Q(wake_q);
 	int ret = 0;
 
-	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
-	if (!new_owner) {
+	if (WARN_ON_ONCE(!new_owner)) {
 		/*
-		 * Since we held neither hb->lock nor wait_lock when coming
-		 * into this function, we could have raced with futex_lock_pi()
-		 * such that we might observe @this futex_q waiter, but the
-		 * rt_mutex's wait_list can be empty (either still, or again,
-		 * depending on which side we land).
+		 * As per the comment in futex_unlock_pi() this should not happen.
 		 *
 		 * When this happens, give up our locks and try again, giving
 		 * the futex_lock_pi() instance time to complete, either by
@@ -2794,15 +2789,18 @@ retry:
 		if (pi_state->owner != current)
 			goto out_unlock;
 
+		get_pi_state(pi_state);
 		/*
-		 * Grab a reference on the pi_state and drop hb->lock.
+		 * Since modifying the wait_list is done while holding both
+		 * hb->lock and wait_lock, holding either is sufficient to
+		 * observe it.
 		 *
-		 * The reference ensures pi_state lives, dropping the hb->lock
-		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
-		 * close the races against futex_lock_pi(), but in case of
-		 * _any_ fail we'll abort and retry the whole deal.
+		 * By taking wait_lock while still holding hb->lock, we ensure
+		 * there is no point where we hold neither; and therefore
+		 * wake_futex_pi() must observe a state consistent with what we
+		 * observed.
 		 */
-		get_pi_state(pi_state);
+		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 		spin_unlock(&hb->lock);
 
 		ret = wake_futex_pi(uaddr, uval, pi_state);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex: Drop hb->lock before enqueueing on the rtmutex
  2017-03-22 10:36 ` [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL Peter Zijlstra
@ 2017-03-23 18:25   ` tip-bot for Peter Zijlstra
  2017-04-08  2:26   ` [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-23 18:25 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: bigeasy, linux-kernel, tglx, mingo, peterz, hpa

Commit-ID:  56222b212e8edb1cf51f5dd73ff645809b082b40
Gitweb:     http://git.kernel.org/tip/56222b212e8edb1cf51f5dd73ff645809b082b40
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 22 Mar 2017 11:36:00 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 23 Mar 2017 19:14:59 +0100

futex: Drop hb->lock before enqueueing on the rtmutex

When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
chain code will (falsely) report a deadlock and BUG.

The problem is that it hold hb->lock (now an rt_mutex) while doing
task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
interleaved just right with futex_unlock_pi() leads it to believe to see an
AB-BA deadlock.

  Task1 (holds rt_mutex,	Task2 (does FUTEX_LOCK_PI)
         does FUTEX_UNLOCK_PI)

				lock hb->lock
				lock rt_mutex (as per start_proxy)
  lock hb->lock

Which is a trivial AB-BA.

It is not an actual deadlock, because it won't be holding hb->lock by the
time it actually blocks on the rt_mutex, but the chainwalk code doesn't
know that and it would be a nightmare to handle this gracefully.

To avoid this problem, do the same as in futex_unlock_pi() and drop
hb->lock after acquiring wait_lock. This still fully serializes against
futex_unlock_pi(), since adding to the wait_list does the very same lock
dance, and removing it holds both locks.

Aside of solving the RT problem this makes the lock and unlock mechanism
symetric and reduces the hb->lock held time.

Reported-and-tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex.c                  | 30 +++++++++++++++++--------
 kernel/locking/rtmutex.c        | 49 +++++++++++++++++++++++------------------
 kernel/locking/rtmutex_common.h |  3 +++
 3 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 4cdc603..628be42 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2654,20 +2654,33 @@ retry_private:
 		goto no_block;
 	}
 
+	rt_mutex_init_waiter(&rt_waiter);
+
 	/*
-	 * We must add ourselves to the rt_mutex waitlist while holding hb->lock
-	 * such that the hb and rt_mutex wait lists match.
+	 * On PREEMPT_RT_FULL, when hb->lock becomes an rt_mutex, we must not
+	 * hold it while doing rt_mutex_start_proxy(), because then it will
+	 * include hb->lock in the blocking chain, even through we'll not in
+	 * fact hold it while blocking. This will lead it to report -EDEADLK
+	 * and BUG when futex_unlock_pi() interleaves with this.
+	 *
+	 * Therefore acquire wait_lock while holding hb->lock, but drop the
+	 * latter before calling rt_mutex_start_proxy_lock(). This still fully
+	 * serializes against futex_unlock_pi() as that does the exact same
+	 * lock handoff sequence.
 	 */
-	rt_mutex_init_waiter(&rt_waiter);
-	ret = rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
+	raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
+	spin_unlock(q.lock_ptr);
+	ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
+	raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
+
 	if (ret) {
 		if (ret == 1)
 			ret = 0;
 
+		spin_lock(q.lock_ptr);
 		goto no_block;
 	}
 
-	spin_unlock(q.lock_ptr);
 
 	if (unlikely(to))
 		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
@@ -2680,6 +2693,9 @@ retry_private:
 	 * first acquire the hb->lock before removing the lock from the
 	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
 	 * wait lists consistent.
+	 *
+	 * In particular; it is important that futex_unlock_pi() can not
+	 * observe this inconsistency.
 	 */
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
@@ -2791,10 +2807,6 @@ retry:
 
 		get_pi_state(pi_state);
 		/*
-		 * Since modifying the wait_list is done while holding both
-		 * hb->lock and wait_lock, holding either is sufficient to
-		 * observe it.
-		 *
 		 * By taking wait_lock while still holding hb->lock, we ensure
 		 * there is no point where we hold neither; and therefore
 		 * wake_futex_pi() must observe a state consistent with what we
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 48418a1..dd10312 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1669,31 +1669,14 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 	rt_mutex_set_owner(lock, NULL);
 }
 
-/**
- * rt_mutex_start_proxy_lock() - Start lock acquisition for another task
- * @lock:		the rt_mutex to take
- * @waiter:		the pre-initialized rt_mutex_waiter
- * @task:		the task to prepare
- *
- * Returns:
- *  0 - task blocked on lock
- *  1 - acquired the lock for task, caller should wake it up
- * <0 - error
- *
- * Special API call for FUTEX_REQUEUE_PI support.
- */
-int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 			      struct rt_mutex_waiter *waiter,
 			      struct task_struct *task)
 {
 	int ret;
 
-	raw_spin_lock_irq(&lock->wait_lock);
-
-	if (try_to_take_rt_mutex(lock, task, NULL)) {
-		raw_spin_unlock_irq(&lock->wait_lock);
+	if (try_to_take_rt_mutex(lock, task, NULL))
 		return 1;
-	}
 
 	/* We enforce deadlock detection for futexes */
 	ret = task_blocks_on_rt_mutex(lock, waiter, task,
@@ -1712,14 +1695,38 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 	if (unlikely(ret))
 		remove_waiter(lock, waiter);
 
-	raw_spin_unlock_irq(&lock->wait_lock);
-
 	debug_rt_mutex_print_deadlock(waiter);
 
 	return ret;
 }
 
 /**
+ * rt_mutex_start_proxy_lock() - Start lock acquisition for another task
+ * @lock:		the rt_mutex to take
+ * @waiter:		the pre-initialized rt_mutex_waiter
+ * @task:		the task to prepare
+ *
+ * Returns:
+ *  0 - task blocked on lock
+ *  1 - acquired the lock for task, caller should wake it up
+ * <0 - error
+ *
+ * Special API call for FUTEX_REQUEUE_PI support.
+ */
+int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+			      struct rt_mutex_waiter *waiter,
+			      struct task_struct *task)
+{
+	int ret;
+
+	raw_spin_lock_irq(&lock->wait_lock);
+	ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
+	raw_spin_unlock_irq(&lock->wait_lock);
+
+	return ret;
+}
+
+/**
  * rt_mutex_next_owner - return the next owner of the lock
  *
  * @lock: the rt lock query
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 1e93e15..b1ccfea 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -104,6 +104,9 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
 extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 				  struct task_struct *proxy_owner);
 extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
+extern int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+				     struct rt_mutex_waiter *waiter,
+				     struct task_struct *task);
 extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 				     struct rt_mutex_waiter *waiter,
 				     struct task_struct *task);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI
  2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
                   ` (12 preceding siblings ...)
  2017-03-22 10:36 ` [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL Peter Zijlstra
@ 2017-03-24  1:45 ` Darren Hart
  13 siblings, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-03-24  1:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:47AM +0100, Peter Zijlstra wrote:
> Hi all,
> 
> Another installment of the futex patches that give you nightmares ;-)
> 
> This version sports updated comments and Changelogs as requested last
> time around. It also includes two fixes, both reported by Sebastian
> who was kind enough to stick this in his RT tree for some testing.
> 
> The last patch is RT specific, but I figure we can merge it anyway.
> 
> Again; I sincerely hope this to be the very last version.

Step 1: Apply to tree, kick off automated Futex tests.
Step 2: Fix Makefile for futex selftests that's been broken since November
        (sad face)
Step 3: Pick this up in the morning...

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 01/13] futex: Cleanup variable names for futex_top_waiter()
  2017-03-22 10:35 ` [PATCH -v6 01/13] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
  2017-03-23 18:19   ` [tip:locking/core] " tip-bot for Peter Zijlstra
@ 2017-03-24 21:11   ` Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-03-24 21:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:48AM +0100, Peter Zijlstra wrote:
> futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
> this to a variable 'match' totally obscures the code.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Yup, still happy to see this one.

Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>

> ---
>  kernel/futex.c |   30 +++++++++++++++---------------
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1120,14 +1120,14 @@ static int attach_to_pi_owner(u32 uval,
>  static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
>  			   union futex_key *key, struct futex_pi_state **ps)
>  {
> -	struct futex_q *match = futex_top_waiter(hb, key);
> +	struct futex_q *top_waiter = futex_top_waiter(hb, key);
>  
>  	/*
>  	 * If there is a waiter on that futex, validate it and
>  	 * attach to the pi_state when the validation succeeds.
>  	 */
> -	if (match)
> -		return attach_to_pi_state(uval, match->pi_state, ps);
> +	if (top_waiter)
> +		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
>  
>  	/*
>  	 * We are the first waiter - try to look up the owner based on
> @@ -1174,7 +1174,7 @@ static int futex_lock_pi_atomic(u32 __us
>  				struct task_struct *task, int set_waiters)
>  {
>  	u32 uval, newval, vpid = task_pid_vnr(task);
> -	struct futex_q *match;
> +	struct futex_q *top_waiter;
>  	int ret;
>  
>  	/*
> @@ -1200,9 +1200,9 @@ static int futex_lock_pi_atomic(u32 __us
>  	 * Lookup existing state first. If it exists, try to attach to
>  	 * its pi_state.
>  	 */
> -	match = futex_top_waiter(hb, key);
> -	if (match)
> -		return attach_to_pi_state(uval, match->pi_state, ps);
> +	top_waiter = futex_top_waiter(hb, key);
> +	if (top_waiter)
> +		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
>  
>  	/*
>  	 * No waiter and user TID is 0. We are here because the
> @@ -1292,11 +1292,11 @@ static void mark_wake_futex(struct wake_
>  	q->lock_ptr = NULL;
>  }
>  
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
>  			 struct futex_hash_bucket *hb)
>  {
>  	struct task_struct *new_owner;
> -	struct futex_pi_state *pi_state = this->pi_state;
> +	struct futex_pi_state *pi_state = top_waiter->pi_state;
>  	u32 uninitialized_var(curval), newval;
>  	DEFINE_WAKE_Q(wake_q);
>  	bool deboost;
> @@ -1317,11 +1317,11 @@ static int wake_futex_pi(u32 __user *uad
>  
>  	/*
>  	 * It is possible that the next waiter (the one that brought
> -	 * this owner to the kernel) timed out and is no longer
> +	 * top_waiter owner to the kernel) timed out and is no longer
>  	 * waiting on the lock.
>  	 */
>  	if (!new_owner)
> -		new_owner = this->task;
> +		new_owner = top_waiter->task;
>  
>  	/*
>  	 * We pass it to the next owner. The WAITERS bit is always
> @@ -2631,7 +2631,7 @@ static int futex_unlock_pi(u32 __user *u
>  	u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current);
>  	union futex_key key = FUTEX_KEY_INIT;
>  	struct futex_hash_bucket *hb;
> -	struct futex_q *match;
> +	struct futex_q *top_waiter;
>  	int ret;
>  
>  retry:
> @@ -2655,9 +2655,9 @@ static int futex_unlock_pi(u32 __user *u
>  	 * all and we at least want to know if user space fiddled
>  	 * with the futex value instead of blindly unlocking.
>  	 */
> -	match = futex_top_waiter(hb, &key);
> -	if (match) {
> -		ret = wake_futex_pi(uaddr, uval, match, hb);
> +	top_waiter = futex_top_waiter(hb, &key);
> +	if (top_waiter) {
> +		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
>  		/*
>  		 * In case of success wake_futex_pi dropped the hash
>  		 * bucket lock.
> 
> 
> 

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 02/13] futex: Use smp_store_release() in mark_wake_futex()
  2017-03-22 10:35 ` [PATCH -v6 02/13] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
  2017-03-23 18:19   ` [tip:locking/core] " tip-bot for Peter Zijlstra
@ 2017-03-24 21:16   ` Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-03-24 21:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:49AM +0100, Peter Zijlstra wrote:
> Since the futex_q can dissapear the instruction after assigning NULL,
> this really should be a RELEASE barrier. That stops loads from hitting
> dead memory too.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I reviewed this carefully in the previous thread, confirming that despite the
move to wake queues, spurious wakeups can still lead to the situration Peter
describes. As such:

Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>

My only suggestion would be to clarify the language in the preceding comment to
make that obvious, as well as clarify which plist_del it is referring to since
it has been moved under the __unqueue_futex. I can do that as a follow-on though.

> ---
>  kernel/futex.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1288,8 +1288,7 @@ static void mark_wake_futex(struct wake_
>  	 * memory barrier is required here to prevent the following
>  	 * store to lock_ptr from getting ahead of the plist_del.
>  	 */
> -	smp_wmb();
> -	q->lock_ptr = NULL;
> +	smp_store_release(&q->lock_ptr, NULL);
>  }
>  
>  static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
> 
> 
> 

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 03/13] futex: Remove rt_mutex_deadlock_account_*()
  2017-03-22 10:35 ` [PATCH -v6 03/13] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
  2017-03-23 18:20   ` [tip:locking/core] " tip-bot for Peter Zijlstra
@ 2017-03-24 21:29   ` Darren Hart
  2017-03-24 21:31     ` Darren Hart
  1 sibling, 1 reply; 60+ messages in thread
From: Darren Hart @ 2017-03-24 21:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:50AM +0100, Peter Zijlstra wrote:
> These are unused and clutter up the code.

And apparently have been that way for a very long time.

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 03/13] futex: Remove rt_mutex_deadlock_account_*()
  2017-03-24 21:29   ` [PATCH -v6 03/13] " Darren Hart
@ 2017-03-24 21:31     ` Darren Hart
  0 siblings, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-03-24 21:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Fri, Mar 24, 2017 at 02:29:27PM -0700, Darren Hart wrote:
> On Wed, Mar 22, 2017 at 11:35:50AM +0100, Peter Zijlstra wrote:
> > These are unused and clutter up the code.
> 
> And apparently have been that way for a very long time.
> 
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>

Oh, with the nit that the Subject prefix should be rtmutex rather than futex for
this patch.

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API
  2017-03-22 10:35 ` [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API Peter Zijlstra
  2017-03-23 18:20   ` [tip:locking/core] " tip-bot for Peter Zijlstra
@ 2017-03-25  0:37   ` Darren Hart
  2017-04-06 12:15     ` Peter Zijlstra
  2017-04-05 15:02   ` Darren Hart
  2 siblings, 1 reply; 60+ messages in thread
From: Darren Hart @ 2017-03-25  0:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:51AM +0100, Peter Zijlstra wrote:
> Part of what makes futex_unlock_pi() intricate is that
> rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
> rt_mutex::wait_lock.
> 
> This means we cannot rely on the atomicy of wait_lock, which we would
> like to do in order to not rely on hb->lock so much.
> 
> The reason rt_mutex_slowunlock() needs to drop wait_lock is because it
> can race with the rt_mutex fastpath, however futexes have their own
> fast path.
> 
> Since futexes already have a bunch of separate rt_mutex accessors,
> complete that set and implement a rt_mutex variant without fastpath
> for them.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Got to here and tried to get some testing going while I was reviewing... to find
that some of the existing pi test suites LTP/realtime, are not building either.
Got a fix, got it into CI, some CI issues, but no obvious fallout from this. So,
review will continue...

But, Peter are you testing this with anything in particular?

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API
  2017-03-22 10:35 ` [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API Peter Zijlstra
  2017-03-23 18:20   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-03-25  0:37   ` [PATCH -v6 04/13] " Darren Hart
@ 2017-04-05 15:02   ` Darren Hart
  2017-04-06 12:17     ` Peter Zijlstra
  2 siblings, 1 reply; 60+ messages in thread
From: Darren Hart @ 2017-04-05 15:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:51AM +0100, Peter Zijlstra wrote:
> Part of what makes futex_unlock_pi() intricate is that
> rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
> rt_mutex::wait_lock.
> 
> This means we cannot rely on the atomicy of wait_lock, which we would
> like to do in order to not rely on hb->lock so much.
> 
> The reason rt_mutex_slowunlock() needs to drop wait_lock is because it
> can race with the rt_mutex fastpath, however futexes have their own
> fast path.
> 
> Since futexes already have a bunch of separate rt_mutex accessors,
> complete that set and implement a rt_mutex variant without fastpath
> for them.

Premise makes sense, I'm tripping over some detail - wondering if it is all
related...

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/futex.c                  |   30 ++++++++++-----------
>  kernel/locking/rtmutex.c        |   55 +++++++++++++++++++++++++++++-----------
>  kernel/locking/rtmutex_common.h |    9 +++++-
>  3 files changed, 62 insertions(+), 32 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -916,7 +916,7 @@ void exit_pi_state_list(struct task_stru
>  		pi_state->owner = NULL;
>  		raw_spin_unlock_irq(&curr->pi_lock);
>  
> -		rt_mutex_unlock(&pi_state->pi_mutex);
> +		rt_mutex_futex_unlock(&pi_state->pi_mutex);
>  
>  		spin_unlock(&hb->lock);
>  
> @@ -1364,20 +1364,18 @@ static int wake_futex_pi(u32 __user *uad
>  	pi_state->owner = new_owner;
>  	raw_spin_unlock(&new_owner->pi_lock);
>  
> -	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
> -
> -	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
> -
>  	/*
> -	 * First unlock HB so the waiter does not spin on it once he got woken
> -	 * up. Second wake up the waiter before the priority is adjusted. If we
> -	 * deboost first (and lose our higher priority), then the task might get
> -	 * scheduled away before the wake up can take place.
> +	 * We've updated the uservalue, this unlock cannot fail.

It isn't clear to me what I should understand from this new comment. How does
the value of the uval affect whether or not the pi_state->pi_mutex can be
unlocked or not? Or are you noting that we've set FUTEX_WAITIERS so any valid
userspace operations will be forced intot he kernel and can't race with us since
we hold the hb->lock? With futexes, I think it's important that we be very
explicit in our comment blocks.

>  	 */
> +	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
> +
> +	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
>  	spin_unlock(&hb->lock);
> -	wake_up_q(&wake_q);
> -	if (deboost)
> +
> +	if (deboost) {
> +		wake_up_q(&wake_q);

Is moving wake_up_q under deboost related to this change or is it just an
optimization since there is no need to wake unless we are deboosting ourselves -
which was true before as well?

If this is due to the rt_mutex_futex* API, I haven't made the connection.

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 05/13] futex: Change locking rules
  2017-03-22 10:35 ` [PATCH -v6 05/13] futex: Change locking rules Peter Zijlstra
  2017-03-23 18:21   ` [tip:locking/core] " tip-bot for Peter Zijlstra
@ 2017-04-05 21:18   ` Darren Hart
  2017-04-06 12:28     ` Peter Zijlstra
  1 sibling, 1 reply; 60+ messages in thread
From: Darren Hart @ 2017-04-05 21:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:52AM +0100, Peter Zijlstra wrote:
> Currently futex-pi relies on hb->lock to serialize everything. Since
> hb->lock is giving us problems (PI inversions among other things,
> since on -rt hb lock itself is a rt_mutex), we want to break this up a
> bit.
> 
> This patch reworks and documents the locking. Notably, it
> consistently uses rt_mutex::wait_lock to serialize {uval, pi_state}.
> This would allow us to do rt_mutex_unlock() (including deboost)
> without holding hb->lock.
> 
> Nothing yet relies on the new locking rules.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/futex.c |  165 +++++++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 132 insertions(+), 33 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -973,6 +973,39 @@ void exit_pi_state_list(struct task_stru
>   *
>   * [10] There is no transient state which leaves owner and user space
>   *	TID out of sync.
> + *
> + *
> + * Serialization and lifetime rules:
> + *
> + * hb->lock:
> + *
> + *	hb -> futex_q, relation
> + *	futex_q -> pi_state, relation
> + *
> + *	(cannot be raw because hb can contain arbitrary amount
> + *	 of futex_q's)
> + *
> + * pi_mutex->wait_lock:
> + *
> + *	{uval, pi_state}
> + *
> + *	(and pi_mutex 'obviously')
> + *
> + * p->pi_lock:

This documentation uses a mix of types and common variable names. I'd recommend
some declarations just below "Serialization and lifetime rules:" to help make
this explicit, e.g.:

struct futex_pi_state *pi_state;
struct futex_hash_bucket *hb;
struct rt_mutex *pi_mutex;
struct futex_q *q;
task_struct *p;

> + *
> + *	p->pi_state_list -> pi_state->list, relation
> + *
> + * pi_state->refcount:
> + *
> + *	pi_state lifetime
> + *
> + *
> + * Lock order:
> + *
> + *   hb->lock
> + *     pi_mutex->wait_lock
> + *       p->pi_lock
> + *
>   */
>  
>  /*
> @@ -980,10 +1013,12 @@ void exit_pi_state_list(struct task_stru
>   * the pi_state against the user space value. If correct, attach to
>   * it.
>   */
> -static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
> +static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
> +			      struct futex_pi_state *pi_state,
>  			      struct futex_pi_state **ps)
>  {
>  	pid_t pid = uval & FUTEX_TID_MASK;
> +	int ret, uval2;

The uval should be an unsigned type:

u32 uval2;

>  
>  	/*
>  	 * Userspace might have messed up non-PI and PI futexes [3]
> @@ -991,9 +1026,34 @@ static int attach_to_pi_state(u32 uval,
>  	if (unlikely(!pi_state))
>  		return -EINVAL;
>  
> +	/*
> +	 * We get here with hb->lock held, and having found a
> +	 * futex_top_waiter(). This means that futex_lock_pi() of said futex_q
> +	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),

This context got here like this:

	futex_lock_pi
		hb lock
		futex_lock_pi_atomic
			top waiter
			attach_to_pi_state()

	The queue_me and unqueue_me_pi both come after this in futex_lock_pi.
	Also, the hb lock is dropped in queue_me, not between queue_me and
	unqueue_me_pi.

Are you saying that in order to be here, there are at least two tasks contending
for the lock, and one that has come before us has proceeded as far as queue_me()
but has not yet entered unqueue_me_pi(), therefor we know there is a waiter and
it has a pi_state? If so, I think we can make this much clearer by at least
noting the two tasks in play.

...

> @@ -1336,6 +1418,7 @@ static int wake_futex_pi(u32 __user *uad
>  
>  	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
>  		ret = -EFAULT;
> +

Stray whitespace addition? Not explicitly against coding-style, but I don't
normally see a new line before the closing brace leading to an else...

>  	} else if (curval != uval) {
>  		/*
>  		 * If a unconditional UNLOCK_PI operation (user space did not

...

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 06/13] futex: Cleanup refcounting
  2017-03-22 10:35 ` [PATCH -v6 06/13] futex: Cleanup refcounting Peter Zijlstra
  2017-03-23 18:21   ` [tip:locking/core] " tip-bot for Peter Zijlstra
@ 2017-04-05 21:29   ` Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-05 21:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:53AM +0100, Peter Zijlstra wrote:
> Since we're going to add more refcount fiddling, introduce
> get_pi_state() to match the existing put_pi_state().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

This looks awfully familiar... https://lkml.org/lkml/2015/12/20/5 ... I guess
timing is everything :-) Since we have an incoming need:

Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 07/13] futex: Rework inconsistent rt_mutex/futex_q state
  2017-03-22 10:35 ` [PATCH -v6 07/13] futex: Rework inconsistent rt_mutex/futex_q state Peter Zijlstra
  2017-03-23 18:22   ` [tip:locking/core] " tip-bot for Peter Zijlstra
@ 2017-04-05 21:58   ` Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-05 21:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:54AM +0100, Peter Zijlstra wrote:
> There is a weird state in the futex_unlock_pi() path when it
> interleaves with a concurrent futex_lock_pi() at the point where it
> drops hb->lock.
> 
> In this case, it can happen that the rt_mutex wait_list and the
> futex_q disagree on pending waiters, in particular rt_mutex will find
> no pending waiters where futex_q thinks there are.
> 
> In this case the rt_mutex unlock code cannot assign an owner.
> 
> What the current code does in this case is use the futex_q waiter that
> got us here; however when the rt_mutex_timed_futex_lock() has already
> failed; this leaves things in a weird state, resulting in much
> head-aches in fixup_owner().
> 
> Simplify all this by changing wake_futex_pi() to return -EAGAIN when
> this situation occurs. This then gives the futex_lock_pi() code the
> opportunity to continue and the retried futex_unlock_pi() will now
> observe a coherent state.
> 
> The only problem is that this breaks RT timeliness guarantees. That
> is, consider the following scenario:
> 
>   T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)
> 
>     CPU0
> 
>     T1
>       lock_pi()
>       queue_me()  <- Waiter is visible
> 
>     preemption
> 
>     T2
>       unlock_pi()
> 	loops with -EAGAIN forever
> 
> Which is undesirable for PI primitives. Future patches will rectify
> this. For now we want to get rid of the fixup magic.

Errrrm... OK... I don't like the idea of having this broken after this commit,
but until I internalize the remaining 5 (that number has never seemed quite so
dauntingly large before... 5...) I can't comment on the alternative. I suppose
having it documented in the commit log means anyone backporting only up to this
point gets what they deserve.

A good patch *removing* code from futex.c is always nice though !

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 08/13] futex: Pull rt_mutex_futex_unlock() out from under hb->lock
  2017-03-22 10:35 ` [PATCH -v6 08/13] futex: Pull rt_mutex_futex_unlock() out from under hb->lock Peter Zijlstra
  2017-03-23 18:22   ` [tip:locking/core] " tip-bot for Peter Zijlstra
@ 2017-04-05 23:52   ` Darren Hart
  2017-04-06 12:42     ` Peter Zijlstra
  1 sibling, 1 reply; 60+ messages in thread
From: Darren Hart @ 2017-04-05 23:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:55AM +0100, Peter Zijlstra wrote:
> There's a number of 'interesting' problems, all caused by holding
> hb->lock while doing the rt_mutex_unlock() equivalient.
> 
> Notably:
> 
>  - a PI inversion on hb->lock; and,
> 
>  - a DL crash because of pointer instability.

A DL crash? What is this? Can you elaborate a bit?

> 
> Because of all the previous patches that:
> 
>  - allow us to do rt_mutex_futex_unlock() without dropping wait_lock;
>    which in turn allows us to rely on wait_lock atomicy.
> 
>  - changed locking rules to cover {uval,pi_state} with wait_lock.
> 
>  - simplified the waiter conundrum.
> 
> We can now quite simply pull rt_mutex_futex_unlock() out from under
> hb->lock, a pi_state reference and wait_lock are sufficient.

OK, owe. I think I've traced most of this through. I have a few gray areas
still, and will continue through the series to see if that addresses them.

A few thoughts as they occurred to me below.

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/futex.c |  154 +++++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 100 insertions(+), 54 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c

...

> @@ -1380,48 +1387,40 @@ static void mark_wake_futex(struct wake_
>  	smp_store_release(&q->lock_ptr, NULL);
>  }
>  
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
> -			 struct futex_hash_bucket *hb)
> +/*
> + * Caller must hold a reference on @pi_state.
> + */
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
>  {
> -	struct task_struct *new_owner;
> -	struct futex_pi_state *pi_state = top_waiter->pi_state;
>  	u32 uninitialized_var(curval), newval;
> +	struct task_struct *new_owner;
> +	bool deboost = false;
>  	DEFINE_WAKE_Q(wake_q);
> -	bool deboost;

Nit: Based on what I've seen from Thomas and others, I ask for declarations in
decreasing order of line length. So deboost should have stayed where it was.

>  
>  /*
> @@ -2232,7 +2229,8 @@ static int fixup_pi_state_owner(u32 __us
>  	/*
>  	 * We are here either because we stole the rtmutex from the
>  	 * previous highest priority waiter or we are the highest priority
> -	 * waiter but failed to get the rtmutex the first time.
> +	 * waiter but have failed to get the rtmutex the first time.
> +	 *
>  	 * We have to replace the newowner TID in the user space variable.
>  	 * This must be atomic as we have to preserve the owner died bit here.
>  	 *
> @@ -2249,7 +2247,7 @@ static int fixup_pi_state_owner(u32 __us
>  	if (get_futex_value_locked(&uval, uaddr))
>  		goto handle_fault;
>  
> -	while (1) {
> +	for (;;) {

As far as I'm aware, there is no difference and both are used throughout the
kernel (with the while version having 50% more instances). Is there more to this
than personal preference?

>  		newval = (uval & FUTEX_OWNER_DIED) | newtid;
>  
>  		if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
> @@ -2345,6 +2343,10 @@ static int fixup_owner(u32 __user *uaddr
>  		/*
>  		 * Got the lock. We might not be the anticipated owner if we
>  		 * did a lock-steal - fix up the PI-state in that case:
> +		 *
> +		 * We can safely read pi_state->owner without holding wait_lock
> +		 * because we now own the rt_mutex, only the owner will attempt
> +		 * to change it.

This seems to contradict the Serialization and lifetime rules:

+ * pi_mutex->wait_lock:
+ *
+ *     {uval, pi_state}
+ *
+ *     (and pi_mutex 'obviously')

It would seem that simply holding pi_mutex is sufficient for serialization on
pi_state->owner then.

...

> @@ -2738,10 +2748,36 @@ static int futex_unlock_pi(u32 __user *u
>  	 */
>  	top_waiter = futex_top_waiter(hb, &key);
>  	if (top_waiter) {
> -		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
> +		struct futex_pi_state *pi_state = top_waiter->pi_state;
> +
> +		ret = -EINVAL;
> +		if (!pi_state)
> +			goto out_unlock;
> +
> +		/*
> +		 * If current does not own the pi_state then the futex is
> +		 * inconsistent and user space fiddled with the futex value.
> +		 */
> +		if (pi_state->owner != current)
> +			goto out_unlock;
> +
> +		/*
> +		 * Grab a reference on the pi_state and drop hb->lock.
> +		 *
> +		 * The reference ensures pi_state lives, dropping the hb->lock
> +		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
> +		 * close the races against futex_lock_pi(), but in case of
> +		 * _any_ fail we'll abort and retry the whole deal.

s/fail/failure/

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 09/13] futex,rt_mutex: Introduce rt_mutex_init_waiter()
  2017-03-22 10:35 ` [PATCH -v6 09/13] futex,rt_mutex: Introduce rt_mutex_init_waiter() Peter Zijlstra
  2017-03-23 18:23   ` [tip:locking/core] " tip-bot for Peter Zijlstra
@ 2017-04-05 23:57   ` Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-05 23:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:56AM +0100, Peter Zijlstra wrote:
> Since there's already two copies of this code, introduce a helper now
> before we get a third instance.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>


An easy one!

Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>

> ---
>  kernel/futex.c                  |    5 +----
>  kernel/locking/rtmutex.c        |   12 +++++++++---
>  kernel/locking/rtmutex_common.h |    1 +
>  3 files changed, 11 insertions(+), 7 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2956,10 +2956,7 @@ static int futex_wait_requeue_pi(u32 __u
>  	 * The waiter is allocated on our stack, manipulated by the requeue
>  	 * code while we sleep on uaddr.
>  	 */
> -	debug_rt_mutex_init_waiter(&rt_waiter);
> -	RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
> -	RB_CLEAR_NODE(&rt_waiter.tree_entry);
> -	rt_waiter.task = NULL;
> +	rt_mutex_init_waiter(&rt_waiter);
>  
>  	ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
>  	if (unlikely(ret != 0))
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1153,6 +1153,14 @@ void rt_mutex_adjust_pi(struct task_stru
>  				   next_lock, NULL, task);
>  }
>  
> +void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
> +{
> +	debug_rt_mutex_init_waiter(waiter);
> +	RB_CLEAR_NODE(&waiter->pi_tree_entry);
> +	RB_CLEAR_NODE(&waiter->tree_entry);
> +	waiter->task = NULL;
> +}
> +
>  /**
>   * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
>   * @lock:		 the rt_mutex to take
> @@ -1235,9 +1243,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
>  	unsigned long flags;
>  	int ret = 0;
>  
> -	debug_rt_mutex_init_waiter(&waiter);
> -	RB_CLEAR_NODE(&waiter.pi_tree_entry);
> -	RB_CLEAR_NODE(&waiter.tree_entry);
> +	rt_mutex_init_waiter(&waiter);

Verified that despite not assigning waiter.task to NULL here, it does no harm to
do so as it is initialized by task_blocks_on_rt_mutex before it is referenced.

>  
>  	/*
>  	 * Technically we could use raw_spin_[un]lock_irq() here, but this can
> --- a/kernel/locking/rtmutex_common.h
> +++ b/kernel/locking/rtmutex_common.h
> @@ -103,6 +103,7 @@ extern void rt_mutex_init_proxy_locked(s
>  				       struct task_struct *proxy_owner);
>  extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
>  				  struct task_struct *proxy_owner);
> +extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
>  extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
>  				     struct rt_mutex_waiter *waiter,
>  				     struct task_struct *task);
> 
> 
> 

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API
  2017-03-25  0:37   ` [PATCH -v6 04/13] " Darren Hart
@ 2017-04-06 12:15     ` Peter Zijlstra
  2017-04-06 17:02       ` Darren Hart
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2017-04-06 12:15 UTC (permalink / raw)
  To: Darren Hart
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Fri, Mar 24, 2017 at 05:37:02PM -0700, Darren Hart wrote:
> 
> But, Peter are you testing this with anything in particular?

Testing? :-)

I ran some of the futex pi tests we have, and have a slightly modified
kernel+version of pi_stress to trigger more funnies.

Other than that Sebastian ran various versions through -rt and their
customer workloads (and found bugs).

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API
  2017-04-05 15:02   ` Darren Hart
@ 2017-04-06 12:17     ` Peter Zijlstra
  2017-04-06 17:08       ` Darren Hart
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2017-04-06 12:17 UTC (permalink / raw)
  To: Darren Hart
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Apr 05, 2017 at 08:02:17AM -0700, Darren Hart wrote:
> > @@ -1364,20 +1364,18 @@ static int wake_futex_pi(u32 __user *uad
> >  	pi_state->owner = new_owner;
> >  	raw_spin_unlock(&new_owner->pi_lock);
> >  
> >  	/*
> > +	 * We've updated the uservalue, this unlock cannot fail.
> 
> It isn't clear to me what I should understand from this new comment. How does
> the value of the uval affect whether or not the pi_state->pi_mutex can be
> unlocked or not? Or are you noting that we've set FUTEX_WAITIERS so any valid
> userspace operations will be forced intot he kernel and can't race with us since
> we hold the hb->lock? With futexes, I think it's important that we be very
> explicit in our comment blocks.

The critical point is that once you've modified uval we must not fail;
there is no way to undo things thereafter.

> >  	 */
> > +	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
> > +
> > +	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
> >  	spin_unlock(&hb->lock);
> > +
> > +	if (deboost) {
> > +		wake_up_q(&wake_q);
> 
> Is moving wake_up_q under deboost related to this change or is it just an
> optimization since there is no need to wake unless we are deboosting ourselves -
> which was true before as well?
> 
> If this is due to the rt_mutex_futex* API, I haven't made the connection.

It's how rt_mutex does wakeups, note that later patches clean this up.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 05/13] futex: Change locking rules
  2017-04-05 21:18   ` [PATCH -v6 05/13] " Darren Hart
@ 2017-04-06 12:28     ` Peter Zijlstra
  2017-04-06 15:58       ` Joe Perches
  2017-04-06 17:21       ` Darren Hart
  0 siblings, 2 replies; 60+ messages in thread
From: Peter Zijlstra @ 2017-04-06 12:28 UTC (permalink / raw)
  To: Darren Hart
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Apr 05, 2017 at 02:18:43PM -0700, Darren Hart wrote:
> On Wed, Mar 22, 2017 at 11:35:52AM +0100, Peter Zijlstra wrote:

> > + *
> > + * Serialization and lifetime rules:
> > + *
> > + * hb->lock:
> > + *
> > + *	hb -> futex_q, relation
> > + *	futex_q -> pi_state, relation
> > + *
> > + *	(cannot be raw because hb can contain arbitrary amount
> > + *	 of futex_q's)
> > + *
> > + * pi_mutex->wait_lock:
> > + *
> > + *	{uval, pi_state}
> > + *
> > + *	(and pi_mutex 'obviously')
> > + *
> > + * p->pi_lock:
> 
> This documentation uses a mix of types and common variable names. I'd recommend
> some declarations just below "Serialization and lifetime rules:" to help make
> this explicit, e.g.:
> 
> struct futex_pi_state *pi_state;
> struct futex_hash_bucket *hb;
> struct rt_mutex *pi_mutex;
> struct futex_q *q;
> task_struct *p;

Yeah, not convinced it helps much. If you're stuck at that level, the
rest of futex is going to make your head explode.

> > @@ -980,10 +1013,12 @@ void exit_pi_state_list(struct task_stru
> >   * the pi_state against the user space value. If correct, attach to
> >   * it.
> >   */
> > +static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
> > +			      struct futex_pi_state *pi_state,
> >  			      struct futex_pi_state **ps)
> >  {
> >  	pid_t pid = uval & FUTEX_TID_MASK;
> > +	int ret, uval2;
> 
> The uval should be an unsigned type:
> 
> u32 uval2;

Right you are.

> >  
> >  	/*
> >  	 * Userspace might have messed up non-PI and PI futexes [3]
> > @@ -991,9 +1026,34 @@ static int attach_to_pi_state(u32 uval,
> >  	if (unlikely(!pi_state))
> >  		return -EINVAL;
> >  
> > +	/*
> > +	 * We get here with hb->lock held, and having found a
> > +	 * futex_top_waiter(). This means that futex_lock_pi() of said futex_q
> > +	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
> 
> This context got here like this:
> 
> 	futex_lock_pi
> 		hb lock
> 		futex_lock_pi_atomic
> 			top waiter
> 			attach_to_pi_state()
> 
> 	The queue_me and unqueue_me_pi both come after this in futex_lock_pi.
> 	Also, the hb lock is dropped in queue_me, not between queue_me and
> 	unqueue_me_pi.
> 
> Are you saying that in order to be here, there are at least two tasks contending
> for the lock, and one that has come before us has proceeded as far as queue_me()
> but has not yet entered unqueue_me_pi(), therefor we know there is a waiter and
> it has a pi_state? If so, I think we can make this much clearer by at least
> noting the two tasks in play.

The point is that this other task must have a reference, and since we
now hold hb->lock, it cannot go away.

> 
> ...
> 
> > @@ -1336,6 +1418,7 @@ static int wake_futex_pi(u32 __user *uad
> >  
> >  	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
> >  		ret = -EFAULT;
> > +
> 
> Stray whitespace addition? Not explicitly against coding-style, but I don't
> normally see a new line before the closing brace leading to an else...

I found it more readable that way. Sod checkpatch and co ;-)

> >  	} else if (curval != uval) {
> >  		/*
> >  		 * If a unconditional UNLOCK_PI operation (user space did not

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 08/13] futex: Pull rt_mutex_futex_unlock() out from under hb->lock
  2017-04-05 23:52   ` [PATCH -v6 08/13] " Darren Hart
@ 2017-04-06 12:42     ` Peter Zijlstra
  2017-04-06 17:42       ` Darren Hart
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2017-04-06 12:42 UTC (permalink / raw)
  To: Darren Hart
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Apr 05, 2017 at 04:52:25PM -0700, Darren Hart wrote:
> On Wed, Mar 22, 2017 at 11:35:55AM +0100, Peter Zijlstra wrote:
> > There's a number of 'interesting' problems, all caused by holding
> > hb->lock while doing the rt_mutex_unlock() equivalient.
> > 
> > Notably:
> > 
> >  - a PI inversion on hb->lock; and,
> > 
> >  - a DL crash because of pointer instability.
> 
> A DL crash? What is this? Can you elaborate a bit?

See here:

  https://lkml.kernel.org/r/20170323145606.480214279@infradead.org


> > @@ -1380,48 +1387,40 @@ static void mark_wake_futex(struct wake_
> >  	smp_store_release(&q->lock_ptr, NULL);
> >  }
> >  
> > -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
> > -			 struct futex_hash_bucket *hb)
> > +/*
> > + * Caller must hold a reference on @pi_state.
> > + */
> > +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> >  {
> > -	struct task_struct *new_owner;
> > -	struct futex_pi_state *pi_state = top_waiter->pi_state;
> >  	u32 uninitialized_var(curval), newval;
> > +	struct task_struct *new_owner;
> > +	bool deboost = false;
> >  	DEFINE_WAKE_Q(wake_q);
> > -	bool deboost;
> 
> Nit: Based on what I've seen from Thomas and others, I ask for declarations in
> decreasing order of line length. So deboost should have stayed where it was.

Hurm, yeah I mostly do that. No idea what went wrong there.

> >  
> >  /*
> > @@ -2232,7 +2229,8 @@ static int fixup_pi_state_owner(u32 __us
> >  	/*
> >  	 * We are here either because we stole the rtmutex from the
> >  	 * previous highest priority waiter or we are the highest priority
> > -	 * waiter but failed to get the rtmutex the first time.
> > +	 * waiter but have failed to get the rtmutex the first time.
> > +	 *
> >  	 * We have to replace the newowner TID in the user space variable.
> >  	 * This must be atomic as we have to preserve the owner died bit here.
> >  	 *
> > @@ -2249,7 +2247,7 @@ static int fixup_pi_state_owner(u32 __us
> >  	if (get_futex_value_locked(&uval, uaddr))
> >  		goto handle_fault;
> >  
> > -	while (1) {
> > +	for (;;) {
> 
> As far as I'm aware, there is no difference and both are used throughout the
> kernel (with the while version having 50% more instances). Is there more to this
> than personal preference?

Nope. Only that. I think I played around with the loop at one point and
this is all that remained of that.

> >  		newval = (uval & FUTEX_OWNER_DIED) | newtid;
> >  
> >  		if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
> > @@ -2345,6 +2343,10 @@ static int fixup_owner(u32 __user *uaddr
> >  		/*
> >  		 * Got the lock. We might not be the anticipated owner if we
> >  		 * did a lock-steal - fix up the PI-state in that case:
> > +		 *
> > +		 * We can safely read pi_state->owner without holding wait_lock
> > +		 * because we now own the rt_mutex, only the owner will attempt
> > +		 * to change it.
> 
> This seems to contradict the Serialization and lifetime rules:
> 
> + * pi_mutex->wait_lock:
> + *
> + *     {uval, pi_state}
> + *
> + *     (and pi_mutex 'obviously')
> 
> It would seem that simply holding pi_mutex is sufficient for serialization on
> pi_state->owner then.

Not a contradiction; just a very specific special case. If current is
the owner of a lock, said owner will not be going anywhere.

> > +
> > +		/*
> > +		 * Grab a reference on the pi_state and drop hb->lock.
> > +		 *
> > +		 * The reference ensures pi_state lives, dropping the hb->lock
> > +		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
> > +		 * close the races against futex_lock_pi(), but in case of
> > +		 * _any_ fail we'll abort and retry the whole deal.
> 
> s/fail/failure/

I don't think that survives the patch-set. That is, I cannot find it in
the current code.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 05/13] futex: Change locking rules
  2017-04-06 12:28     ` Peter Zijlstra
@ 2017-04-06 15:58       ` Joe Perches
  2017-04-06 17:21       ` Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: Joe Perches @ 2017-04-06 15:58 UTC (permalink / raw)
  To: Peter Zijlstra, Darren Hart
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Thu, 2017-04-06 at 14:28 +0200, Peter Zijlstra wrote:
> On Wed, Apr 05, 2017 at 02:18:43PM -0700, Darren Hart wrote:
> > On Wed, Mar 22, 2017 at 11:35:52AM +0100, Peter Zijlstra wrote:
> > > @@ -1336,6 +1418,7 @@ static int wake_futex_pi(u32 __user *uad
> > >  
> > >  	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
> > >  		ret = -EFAULT;
> > > +
> > 
> > Stray whitespace addition? Not explicitly against coding-style, but I don't
> > normally see a new line before the closing brace leading to an else...
> 
> I found it more readable that way. Sod checkpatch and co ;-)

The only good sod is the stuff you get to play games on.

And this week's best sod is Augusta's immaculate carpet
for the Masters Tournament.

So no worries from me.  checkpatch is a brainless script.
Rules made to be broken, etc.

afaict another way to write that would be to use gotos
and that would be a lot more lines and less readable.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API
  2017-04-06 12:15     ` Peter Zijlstra
@ 2017-04-06 17:02       ` Darren Hart
  0 siblings, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-06 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Thu, Apr 06, 2017 at 02:15:05PM +0200, Peter Zijlstra wrote:
> On Fri, Mar 24, 2017 at 05:37:02PM -0700, Darren Hart wrote:
> > 
> > But, Peter are you testing this with anything in particular?
> 
> Testing? :-)
> 
> I ran some of the futex pi tests we have, and have a slightly modified
> kernel+version of pi_stress to trigger more funnies.
> 
> Other than that Sebastian ran various versions through -rt and their
> customer workloads (and found bugs).
> 

OK, that's more or less what I assumed, but wanted to confirm. I'm bubbling up
the futex torture test suite on my priority list (now that I find myself with a
lot more time available for this stuff) \o/

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API
  2017-04-06 12:17     ` Peter Zijlstra
@ 2017-04-06 17:08       ` Darren Hart
  0 siblings, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-06 17:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Thu, Apr 06, 2017 at 02:17:28PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 05, 2017 at 08:02:17AM -0700, Darren Hart wrote:
> > > @@ -1364,20 +1364,18 @@ static int wake_futex_pi(u32 __user *uad
> > >  	pi_state->owner = new_owner;
> > >  	raw_spin_unlock(&new_owner->pi_lock);
> > >  
> > >  	/*
> > > +	 * We've updated the uservalue, this unlock cannot fail.
> > 
> > It isn't clear to me what I should understand from this new comment. How does
> > the value of the uval affect whether or not the pi_state->pi_mutex can be
> > unlocked or not? Or are you noting that we've set FUTEX_WAITIERS so any valid
> > userspace operations will be forced intot he kernel and can't race with us since
> > we hold the hb->lock? With futexes, I think it's important that we be very
> > explicit in our comment blocks.
> 
> The critical point is that once you've modified uval we must not fail;
> there is no way to undo things thereafter.

Aha, "must not", OK. I interpretted "cannot" as "is incapable of failing". So
let's use something like that for the comment:

/*
 * We updated the user value and are committed to completing the unlock, we must
 * not fail.
 */

Wow... English. I tried a few versions, but cannot, may not, etc. all have
doublt meanings. :-)

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 05/13] futex: Change locking rules
  2017-04-06 12:28     ` Peter Zijlstra
  2017-04-06 15:58       ` Joe Perches
@ 2017-04-06 17:21       ` Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-06 17:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Thu, Apr 06, 2017 at 02:28:32PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 05, 2017 at 02:18:43PM -0700, Darren Hart wrote:
> > On Wed, Mar 22, 2017 at 11:35:52AM +0100, Peter Zijlstra wrote:
> 
> > > + *
> > > + * Serialization and lifetime rules:
> > > + *
> > > + * hb->lock:
> > > + *
> > > + *	hb -> futex_q, relation
> > > + *	futex_q -> pi_state, relation
> > > + *
> > > + *	(cannot be raw because hb can contain arbitrary amount
> > > + *	 of futex_q's)
> > > + *
> > > + * pi_mutex->wait_lock:
> > > + *
> > > + *	{uval, pi_state}
> > > + *
> > > + *	(and pi_mutex 'obviously')
> > > + *
> > > + * p->pi_lock:
> > 
> > This documentation uses a mix of types and common variable names. I'd recommend
> > some declarations just below "Serialization and lifetime rules:" to help make
> > this explicit, e.g.:
> > 
> > struct futex_pi_state *pi_state;
> > struct futex_hash_bucket *hb;
> > struct rt_mutex *pi_mutex;
> > struct futex_q *q;
> > task_struct *p;
> 
> Yeah, not convinced it helps much. If you're stuck at that level, the
> rest of futex is going to make your head explode.

It just presented one more fork in the mindmap to go confirm types and names so
I was sure I was thinking of the same things as what was documented. Being
explicit avoids unnecessary confusion, reduces thought errors, and takes minimal
effort on our part. Well worth it IMHO.


> 
> > > @@ -980,10 +1013,12 @@ void exit_pi_state_list(struct task_stru
> > >   * the pi_state against the user space value. If correct, attach to
> > >   * it.
> > >   */
> > > +static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
> > > +			      struct futex_pi_state *pi_state,
> > >  			      struct futex_pi_state **ps)
> > >  {
> > >  	pid_t pid = uval & FUTEX_TID_MASK;
> > > +	int ret, uval2;
> > 
> > The uval should be an unsigned type:
> > 
> > u32 uval2;
> 
> Right you are.
> 
> > >  
> > >  	/*
> > >  	 * Userspace might have messed up non-PI and PI futexes [3]
> > > @@ -991,9 +1026,34 @@ static int attach_to_pi_state(u32 uval,
> > >  	if (unlikely(!pi_state))
> > >  		return -EINVAL;
> > >  
> > > +	/*
> > > +	 * We get here with hb->lock held, and having found a
> > > +	 * futex_top_waiter(). This means that futex_lock_pi() of said futex_q
> > > +	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
> > 
> > This context got here like this:
> > 
> > 	futex_lock_pi
> > 		hb lock
> > 		futex_lock_pi_atomic
> > 			top waiter
> > 			attach_to_pi_state()
> > 
> > 	The queue_me and unqueue_me_pi both come after this in futex_lock_pi.
> > 	Also, the hb lock is dropped in queue_me, not between queue_me and
> > 	unqueue_me_pi.
> > 
> > Are you saying that in order to be here, there are at least two tasks contending
> > for the lock, and one that has come before us has proceeded as far as queue_me()
> > but has not yet entered unqueue_me_pi(), therefor we know there is a waiter and
> > it has a pi_state? If so, I think we can make this much clearer by at least
> > noting the two tasks in play.
> 
> The point is that this other task must have a reference, and since we
> now hold hb->lock, it cannot go away.


OK, so yes, two tasks. Noting the two task contexts somewhere in that comment
block would make this easier to follow - which is why we're adding the comment.


> > > @@ -1336,6 +1418,7 @@ static int wake_futex_pi(u32 __user *uad
> > >  
> > >  	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
> > >  		ret = -EFAULT;
> > > +
> > 
> > Stray whitespace addition? Not explicitly against coding-style, but I don't
> > normally see a new line before the closing brace leading to an else...
> 
> I found it more readable that way. Sod checkpatch and co ;-)

Heh, I didn't run checkpatch, just found it odd and unrelated. I hesitate
to call you on style and superfluous change - but hey, if I'd make the comment
to people contributing to platform drivers, it would be hypocritical not to do
the same for you :-) And, if the feedback doesn't apply at this level, then I
should drop it as a barrier for the platform drivers - so serves as a good
litmus test.

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 08/13] futex: Pull rt_mutex_futex_unlock() out from under hb->lock
  2017-04-06 12:42     ` Peter Zijlstra
@ 2017-04-06 17:42       ` Darren Hart
  0 siblings, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-06 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Thu, Apr 06, 2017 at 02:42:48PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 05, 2017 at 04:52:25PM -0700, Darren Hart wrote:
> > On Wed, Mar 22, 2017 at 11:35:55AM +0100, Peter Zijlstra wrote:
> > > There's a number of 'interesting' problems, all caused by holding
> > > hb->lock while doing the rt_mutex_unlock() equivalient.
> > > 
> > > Notably:
> > > 
> > >  - a PI inversion on hb->lock; and,
> > > 
> > >  - a DL crash because of pointer instability.
> > 
> > A DL crash? What is this? Can you elaborate a bit?
> 
> See here:
> 
>   https://lkml.kernel.org/r/20170323145606.480214279@infradead.org

Ah, DeadLine, thanks.

...

> > >  		newval = (uval & FUTEX_OWNER_DIED) | newtid;
> > >  
> > >  		if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
> > > @@ -2345,6 +2343,10 @@ static int fixup_owner(u32 __user *uaddr
> > >  		/*
> > >  		 * Got the lock. We might not be the anticipated owner if we
> > >  		 * did a lock-steal - fix up the PI-state in that case:
> > > +		 *
> > > +		 * We can safely read pi_state->owner without holding wait_lock
> > > +		 * because we now own the rt_mutex, only the owner will attempt
> > > +		 * to change it.
> > 
> > This seems to contradict the Serialization and lifetime rules:
> > 
> > + * pi_mutex->wait_lock:
> > + *
> > + *     {uval, pi_state}
> > + *
> > + *     (and pi_mutex 'obviously')
> > 
> > It would seem that simply holding pi_mutex is sufficient for serialization on
> > pi_state->owner then.
> 
> Not a contradiction; just a very specific special case. If current is
> the owner of a lock, said owner will not be going anywhere.

OK.

> > > +
> > > +		/*
> > > +		 * Grab a reference on the pi_state and drop hb->lock.
> > > +		 *
> > > +		 * The reference ensures pi_state lives, dropping the hb->lock
> > > +		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
> > > +		 * close the races against futex_lock_pi(), but in case of
> > > +		 * _any_ fail we'll abort and retry the whole deal.
> > 
> > s/fail/failure/
> 
> I don't think that survives the patch-set. That is, I cannot find it in
> the current code.

Ah right, intermediate documentation. Kudos for that! :-)

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 10/13] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
  2017-03-22 10:35 ` [PATCH -v6 10/13] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() Peter Zijlstra
  2017-03-23 18:23   ` [tip:locking/core] " tip-bot for Peter Zijlstra
@ 2017-04-07 23:30   ` Darren Hart
  2017-04-07 23:35     ` Darren Hart
  1 sibling, 1 reply; 60+ messages in thread
From: Darren Hart @ 2017-04-07 23:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:57AM +0100, Peter Zijlstra wrote:
> With the ultimate goal of keeping rt_mutex wait_list and futex_q
> waiters consistent we want to split 'rt_mutex_futex_lock()' into finer

I want to be clear that I understand why this patch is needed - as it actually
moves both the waiter removal and the rt_waiter freeing under the hb lock while
you've been working to be less dependent on the hb lock.

Was inconsistency of the rt_mutex wait_list and the futex_q waiters a problem
before this patch series, or do the previous patches make this one necessary?

It makes sense that for the two to be consistent they should be manipulated
under a common lock.

> parts, such that only the actual blocking can be done without hb->lock
> held.
> 
> This means we need to split rt_mutex_finish_proxy_lock() into two
> parts, one that does the blocking and one that does remove_waiter()
> when we fail to acquire.
> 
> When we do acquire, we can safely remove ourselves, since there is no
> concurrency on the lock owner.
> 
> This means that, except for futex_lock_pi(), all wait_list
> modifications are done with both hb->lock and wait_lock held.
> 
> [bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/futex.c                  |    7 +++--
>  kernel/locking/rtmutex.c        |   53 ++++++++++++++++++++++++++++++++++------
>  kernel/locking/rtmutex_common.h |    8 +++---
>  3 files changed, 56 insertions(+), 12 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -3032,10 +3032,13 @@ static int futex_wait_requeue_pi(u32 __u
>  		 */
>  		WARN_ON(!q.pi_state);
>  		pi_mutex = &q.pi_state->pi_mutex;
> -		ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter);
> -		debug_rt_mutex_free_waiter(&rt_waiter);
> +		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
>  
>  		spin_lock(q.lock_ptr);
> +		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
> +			ret = 0;
> +
> +		debug_rt_mutex_free_waiter(&rt_waiter);
>  		/*
>  		 * Fixup the pi_state owner and possibly acquire the lock if we
>  		 * haven't already.
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -1753,21 +1753,23 @@ struct task_struct *rt_mutex_next_owner(
>  }
>  
>  /**
> - * rt_mutex_finish_proxy_lock() - Complete lock acquisition
> + * rt_mutex_wait_proxy_lock() - Wait for lock acquisition
>   * @lock:		the rt_mutex we were woken on
>   * @to:			the timeout, null if none. hrtimer should already have
>   *			been started.
>   * @waiter:		the pre-initialized rt_mutex_waiter
>   *
> - * Complete the lock acquisition started our behalf by another thread.
> + * Wait for the the lock acquisition started on our behalf by
> + * rt_mutex_start_proxy_lock(). Upon failure, the caller must call
> + * rt_mutex_cleanup_proxy_lock().
>   *
>   * Returns:
>   *  0 - success
>   * <0 - error, one of -EINTR, -ETIMEDOUT
>   *
> - * Special API call for PI-futex requeue support
> + * Special API call for PI-futex support
>   */
> -int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
> +int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
>  			       struct hrtimer_sleeper *to,
>  			       struct rt_mutex_waiter *waiter)
>  {
> @@ -1780,9 +1782,6 @@ int rt_mutex_finish_proxy_lock(struct rt
>  	/* sleep on the mutex */
>  	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
>  
> -	if (unlikely(ret))
> -		remove_waiter(lock, waiter);
> -
>  	/*
>  	 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
>  	 * have to fix that up.
> @@ -1793,3 +1792,43 @@ int rt_mutex_finish_proxy_lock(struct rt
>  
>  	return ret;
>  }
> +
> +/**
> + * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition
> + * @lock:		the rt_mutex we were woken on
> + * @waiter:		the pre-initialized rt_mutex_waiter
> + *
> + * Attempt to clean up after a failed rt_mutex_wait_proxy_lock().
> + *
> + * Unless we acquired the lock; we're still enqueued on the wait-list and can
> + * in fact still be granted ownership until we're removed. Therefore we can
> + * find we are in fact the owner and must disregard the
> + * rt_mutex_wait_proxy_lock() failure.
> + *
> + * Returns:
> + *  true  - did the cleanup, we done.
> + *  false - we acquired the lock after rt_mutex_wait_proxy_lock() returned,
> + *          caller should disregards its return value.
> + *
> + * Special API call for PI-futex support
> + */
> +bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
> +				 struct rt_mutex_waiter *waiter)
> +{
> +	bool cleanup = false;
> +
> +	raw_spin_lock_irq(&lock->wait_lock);
> +	/*
> +	 * Unless we're the owner; we're still enqueued on the wait_list.
> +	 * So check if we became owner, if not, take us off the wait_list.
> +	 */
> +	if (rt_mutex_owner(lock) != current) {
> +		remove_waiter(lock, waiter);
> +		fixup_rt_mutex_waiters(lock);
> +		cleanup = true;
> +	}
> +	raw_spin_unlock_irq(&lock->wait_lock);
> +
> +	return cleanup;
> +}
> +
> --- a/kernel/locking/rtmutex_common.h
> +++ b/kernel/locking/rtmutex_common.h
> @@ -107,9 +107,11 @@ extern void rt_mutex_init_waiter(struct
>  extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
>  				     struct rt_mutex_waiter *waiter,
>  				     struct task_struct *task);
> -extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
> -				      struct hrtimer_sleeper *to,
> -				      struct rt_mutex_waiter *waiter);
> +extern int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
> +			       struct hrtimer_sleeper *to,
> +			       struct rt_mutex_waiter *waiter);
> +extern bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
> +				 struct rt_mutex_waiter *waiter);
>  
>  extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
>  extern int rt_mutex_futex_trylock(struct rt_mutex *l);
> 
> 
> 

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 10/13] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
  2017-04-07 23:30   ` [PATCH -v6 10/13] " Darren Hart
@ 2017-04-07 23:35     ` Darren Hart
  0 siblings, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-07 23:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Fri, Apr 07, 2017 at 04:30:59PM -0700, Darren Hart wrote:
> On Wed, Mar 22, 2017 at 11:35:57AM +0100, Peter Zijlstra wrote:
> > With the ultimate goal of keeping rt_mutex wait_list and futex_q
> > waiters consistent we want to split 'rt_mutex_futex_lock()' into finer
> 
> I want to be clear that I understand why this patch is needed - as it actually
> moves both the waiter removal and the rt_waiter freeing under the hb lock while
> you've been working to be less dependent on the hb lock.
> 
> Was inconsistency of the rt_mutex wait_list and the futex_q waiters a problem
> before this patch series, or do the previous patches make this one necessary?

Ah, this is a follow-on to the issue described in 7 of 10. Nevermind.

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 11/13] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
  2017-03-22 10:35 ` [PATCH -v6 11/13] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() Peter Zijlstra
  2017-03-23 18:24   ` [tip:locking/core] " tip-bot for Peter Zijlstra
@ 2017-04-08  0:55   ` Darren Hart
  2017-04-10 15:51   ` alexander.levin
  2 siblings, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-08  0:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:58AM +0100, Peter Zijlstra wrote:
> By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() we arrive
> at a point where all wait_list modifications are done under both
> hb->lock and wait_lock.
> 
> This closes the obvious interleave pattern between futex_lock_pi() and
> futex_unlock_pi(), but not entirely so. See below:
> 
> Before:
> 
> futex_lock_pi()			futex_unlock_pi()
>   unlock hb->lock
> 
> 				  lock hb->lock
> 				  unlock hb->lock
> 
> 				  lock rt_mutex->wait_lock
> 				  unlock rt_mutex_wait_lock
> 				    -EAGAIN
> 
>   lock rt_mutex->wait_lock
>   list_add
>   unlock rt_mutex->wait_lock
> 
>   schedule()
> 
>   lock rt_mutex->wait_lock
>   list_del
>   unlock rt_mutex->wait_lock
> 
> 				  <idem>
> 				    -EAGAIN
> 
>   lock hb->lock
> 
> 
> After:
> 
> futex_lock_pi()			futex_unlock_pi()
> 
>   lock hb->lock
>   lock rt_mutex->wait_lock
>   list_add
>   unlock rt_mutex->wait_lock
>   unlock hb->lock
> 
>   schedule()
> 				  lock hb->lock
> 				  unlock hb->lock
>   lock hb->lock
>   lock rt_mutex->wait_lock
>   list_del
>   unlock rt_mutex->wait_lock
> 
> 				  lock rt_mutex->wait_lock
> 				  unlock rt_mutex_wait_lock

Underscore to dereference:	  rt_mutex->wait_lock

> 				    -EAGAIN
> 
>   unlock hb->lock
> 
> 
> It does however solve the earlier starvation/live-lock scenario which
> got introduced with the -EAGAIN since unlike the before scenario;
> where the -EAGAIN happens while futex_unlock_pi() doesn't hold any
> locks; in the after scenario it happens while futex_unlock_pi()

I think you mean futex_lock_pi() here ----------^
And possibly in the previous reference, although both are true.

> actually holds a lock, and then we can serialize on that lock.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/futex.c                  |   70 +++++++++++++++++++++++++++-------------
>  kernel/locking/rtmutex.c        |   13 -------
>  kernel/locking/rtmutex_common.h |    1 
>  3 files changed, 48 insertions(+), 36 deletions(-)
> 
> Index: linux-2.6/kernel/futex.c

...

> @@ -2587,6 +2592,7 @@ static int futex_lock_pi(u32 __user *uad

...

> +no_block:
> +	/*
>  	 * Fixup the pi_state owner and possibly acquire the lock if we
>  	 * haven't already.
>  	 */

Deleted a bunch of commentary about the following comment and the code to follow
(which shows up just below this point). Turns out it isn't wrong... it's just
really complex. This snippet used to be self contained within the first if
block, and now the connection to the comment is less direct. I didn't come up
with a better way to say it though.... so just noting this here in case you or
someone else has a better idea.


        /*
         * If fixup_owner() faulted and was unable to handle the fault, unlock
         * it and return the fault to userspace.
         */
        if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current)) {
                pi_state = q.pi_state;
                get_pi_state(pi_state);
        }

        /* Unqueue and drop the lock */
        unqueue_me_pi(&q);

        if (pi_state) {
                rt_mutex_futex_unlock(&pi_state->pi_mutex);
                put_pi_state(pi_state);
        }

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 12/13] futex: futex_unlock_pi() determinism
  2017-03-22 10:35 ` [PATCH -v6 12/13] futex: futex_unlock_pi() determinism Peter Zijlstra
  2017-03-23 18:24   ` [tip:locking/core] futex: Futex_unlock_pi() determinism tip-bot for Peter Zijlstra
@ 2017-04-08  1:27   ` Darren Hart
  1 sibling, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-08  1:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:35:59AM +0100, Peter Zijlstra wrote:
> The problem with returning -EAGAIN when the waiter state mismatches is
> that it becomes very hard to proof a bounded execution time on the

prove

> operation. And seeing that this is a RT operation, this is somewhat

an RT

> important.
> 
> While in practise; given the previous patch; it will be very unlikely

Heh, that's not what semicolons are for ;-) Commas here, or a parenthetical.

> to ever really take more than one or two rounds, proving so becomes
> rather hard.
> 
> However, now that modifying wait_list is done while holding both
> hb->lock and wait_lock, we can avoid the scenario entirely if we
> acquire wait_lock while still holding hb-lock. Doing a hand-over,
> without leaving a hole.

Nice :)

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL
  2017-03-22 10:36 ` [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL Peter Zijlstra
  2017-03-23 18:25   ` [tip:locking/core] futex: Drop hb->lock before enqueueing on the rtmutex tip-bot for Peter Zijlstra
@ 2017-04-08  2:26   ` Darren Hart
  2017-04-08  5:22     ` Mike Galbraith
                       ` (2 more replies)
  1 sibling, 3 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-08  2:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Mar 22, 2017 at 11:36:00AM +0100, Peter Zijlstra wrote:
> When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
> chain code will (falsely) report a deadlock and BUG.
> 
> The problem is that we hold hb->lock (now an rt_mutex) while doing
> task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
> interleaved just right with futex_unlock_pi() leads it to believe we
> have an AB-BA deadlock.
> 
>   Task1 (holds rt_mutex,	Task2 (does FUTEX_LOCK_PI)
>          does FUTEX_UNLOCK_PI)
> 
> 				lock hb->lock
> 				lock rt_mutex (as per start_proxy)
>   lock hb->lock
> 
> Which is a trivial AB-BA.
> 
> It is not an actual deadlock, because we won't be holding hb->lock by
> the time we actually block on rt_mutex, but the chainwalk code doesn't
> know that.
> 
> To avoid this problem, do the same thing we do in futex_unlock_pi()
> and drop hb->lock after acquiring wait_lock. This still fully
> serializes against futex_unlock_pi(), since adding to the wait_list
> does the very same lock dance, and removing it holds both locks.
> 
> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I have gone through each of these carefully, and while I'm not naive enough to
say "there are no possible locking problems", each of Peter's claims were
supported by my review. I went down a number of paths which concerned me, but
eventually they each proved not to be a problem, and it was impressive to see
the knot of locks loosen and come free in the last few patches. That's a really
nice piece of work Peter.

I've made several comments on the comment blocks and commit messages to clarify
things where I think they would have saved me time or were inconsistent. I've
only made one code change recommendation iirc, which was the simple type
declaration of a new uval from int to u32.

I would like to see more testing because... well... futexes. But, we don't have
a futex torture suite yet, but that is something I'm hoping to be looking into
in the near future. What testing we do have available has passed between my
futex selftests, the LTP suite, the pi_stress, and the RT runs by Sebastian.

Peter, I presume there will be a v7 with the u32 change and hopefully a couple
text updates?

Thanks,

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL
  2017-04-08  2:26   ` [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL Darren Hart
@ 2017-04-08  5:22     ` Mike Galbraith
  2017-04-10  8:43     ` Sebastian Andrzej Siewior
  2017-04-10  9:08     ` Peter Zijlstra
  2 siblings, 0 replies; 60+ messages in thread
From: Mike Galbraith @ 2017-04-08  5:22 UTC (permalink / raw)
  To: Darren Hart, Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Fri, 2017-04-07 at 19:26 -0700, Darren Hart wrote:

> I would like to see more testing because... well... futexes. But, we don't have
> a futex torture suite yet, but that is something I'm hoping to be looking into
> in the near future. What testing we do have available has passed between my
> futex selftests, the LTP suite, the pi_stress, and the RT runs by Sebastian.

Ditto, tip-rt shows no signs of trouble.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL
  2017-04-08  2:26   ` [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL Darren Hart
  2017-04-08  5:22     ` Mike Galbraith
@ 2017-04-10  8:43     ` Sebastian Andrzej Siewior
  2017-04-10  9:08     ` Peter Zijlstra
  2 siblings, 0 replies; 60+ messages in thread
From: Sebastian Andrzej Siewior @ 2017-04-10  8:43 UTC (permalink / raw)
  To: Darren Hart
  Cc: Peter Zijlstra, tglx, mingo, juri.lelli, rostedt, xlpang,
	linux-kernel, mathieu.desnoyers, jdesfossez, bristot

On 2017-04-07 19:26:10 [-0700], Darren Hart wrote:
> I would like to see more testing because... well... futexes. But, we don't have
> a futex torture suite yet, but that is something I'm hoping to be looking into
> in the near future. What testing we do have available has passed between my
> futex selftests, the LTP suite, the pi_stress, and the RT runs by Sebastian.

It is also part of the latest v4.9-rt release which exposes it to
greater audience.

> Thanks,

Sebastian

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL
  2017-04-08  2:26   ` [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL Darren Hart
  2017-04-08  5:22     ` Mike Galbraith
  2017-04-10  8:43     ` Sebastian Andrzej Siewior
@ 2017-04-10  9:08     ` Peter Zijlstra
  2017-04-10 16:05       ` Darren Hart
  2 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2017-04-10  9:08 UTC (permalink / raw)
  To: Darren Hart
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Fri, Apr 07, 2017 at 07:26:10PM -0700, Darren Hart wrote:
> Peter, I presume there will be a v7 with the u32 change and hopefully a couple
> text updates?

Well, tglx already committed these here patches, so no -v7. What I can
do however is do a follow up patch that fixes some of the in-code things
you mentioned.

Something like the below is what I had lying about from your earlier
emails; I've not looked to see if there's anything else from your later
emails.

---
Subject: futex: Small misc fixes..
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Apr 7 09:04:07 CEST 2017

Feedback from Darren's review.

Reported-by: Darren Hart (VMWare) <dvhart@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1025,7 +1025,8 @@ static int attach_to_pi_state(u32 __user
 			      struct futex_pi_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
-	int ret, uval2;
+	u32 uval2;
+	int ret;
 
 	/*
 	 * Userspace might have messed up non-PI and PI futexes [3]
@@ -1441,6 +1442,11 @@ static int wake_futex_pi(u32 __user *uad
 	if (ret)
 		goto out_unlock;
 
+	/*
+	 * This is a point of no return; once we modify the uval there is no
+	 * going back and subsequent operations must not fail.
+	 */
+
 	raw_spin_lock(&pi_state->owner->pi_lock);
 	WARN_ON(list_empty(&pi_state->list));
 	list_del_init(&pi_state->list);
@@ -1452,9 +1458,6 @@ static int wake_futex_pi(u32 __user *uad
 	pi_state->owner = new_owner;
 	raw_spin_unlock(&new_owner->pi_lock);
 
-	/*
-	 * We've updated the uservalue, this unlock cannot fail.
-	 */
 	postunlock = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
 
 out_unlock:

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 11/13] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
  2017-03-22 10:35 ` [PATCH -v6 11/13] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() Peter Zijlstra
  2017-03-23 18:24   ` [tip:locking/core] " tip-bot for Peter Zijlstra
  2017-04-08  0:55   ` [PATCH -v6 11/13] " Darren Hart
@ 2017-04-10 15:51   ` alexander.levin
  2017-04-10 16:03     ` Thomas Gleixner
  2 siblings, 1 reply; 60+ messages in thread
From: alexander.levin @ 2017-04-10 15:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Wed, Mar 22, 2017 at 11:35:58AM +0100, Peter Zijlstra wrote:
> By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() we arrive
> at a point where all wait_list modifications are done under both
> hb->lock and wait_lock.
[...]

Hey Peter,

I'm seeing the following, which seems to be related to this patch:

[   21.762875] ODEBUG: free active (active state 0) object type: hrtimer hint: hrtimer_wakeup (kernel/time/hrtimer.c:1423)
[   21.771034] ------------[ cut here ]------------
[   21.771657] WARNING: CPU: 6 PID: 1974 at lib/debugobjects.c:289 debug_print_object (lib/debugobjects.c:286)
[   21.772872] Modules linked in:
[   21.773323] CPU: 6 PID: 1974 Comm: trinity-c92 Not tainted 4.11.0-rc5-next-20170407-dirty #21
[   21.774534] task: ffff880389063e40 task.stack: ffff880389158000
[   21.775383] RIP: 0010:debug_print_object (??:?)
[   21.776081] RSP: 0018:ffff88038915f108 EFLAGS: 00010086
[   21.776815] RAX: 0000000000000057 RBX: 0000000000000003 RCX: 0000000000000000
[   21.777791] RDX: 0000000000000057 RSI: 1ffff1007122bdc0 RDI: ffffed007122be17
[   21.778773] RBP: ffff88038915f130 R08: 203a47554245444f R09: 7463612065657266
[   21.779756] R10: 0000000000000000 R11: 000000000000147e R12: ffffffff834598e0
[   21.780741] R13: ffffffff8127c150 R14: 0000000000000000 R15: ffffffff8410e588
[   21.782945] FS:  00007fd5e261f700(0000) GS:ffff88039cb80000(0000) knlGS:0000000000000000
[   21.783942] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   21.784638] CR2: 00007fd5e21f2520 CR3: 000000038d4c9000 CR4: 00000000000406a0
[   21.785718] DR0: 00007fd5e00df000 DR1: 0000000000000000 DR2: 0000000000000000
[   21.786675] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[   21.787684] Call Trace:
[   21.788050] debug_object_free (lib/debugobjects.c:603)
[   21.791105] destroy_hrtimer_on_stack (kernel/time/hrtimer.c:427)
[   21.791746] futex_lock_pi (kernel/futex.c:2740)
[   21.800721] do_futex (kernel/futex.c:3399)
[   21.818395] SyS_futex (kernel/futex.c:3447 kernel/futex.c:3415)
[   21.822260] do_syscall_64 (arch/x86/entry/common.c:284)
[   21.827328] entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:249)                                                                                      [   21.827960] RIP: 0033:0x7fd5e1f2e8e9^M                                                                                                                     [   21.828455] RSP: 002b:00007ffcfe586f48 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca^M                                                                       [   21.829931] RAX: ffffffffffffffda RBX: 00000000000000ca RCX: 00007fd5e1f2e8e9^M                                                                            [   21.830875] RDX: 0000000000004000 RSI: 0000000000000086 RDI: 00007fd5e00df000^M                                                                            [   21.831783] RBP: 00007fd5e0a12000 R08: 00007fd5e2426000 R09: 008008000100a000^M                                                                            [   21.832775] R10: 00007fd5e00e1000 R11: 0000000000000246 R12: 0000000000000002^M                                                                            [   21.833705] R13: 00007fd5e0a12048 R14: 00007fd5e261f698 R15: 00007fd5e0a12000^M                                                                            [   21.834644] Code: 0d 48 89 75 d8 e8 30 01 8b ff 48 8b 75 d8 48 8b 14 dd 40 8f 51 83 4d 89 e9 4d 89 e0 44 89 f1 48 c7 c7 e0 85 51 83 e8 e3 29 75 ff <0f> ff 83 05 4a 1e 16 02 01 48 83 c4 08 5b 41 5c 41 5d 41 5e 5d 
All code                                                                                                                                                      ========
   0:   0d 48 89 75 d8          or     $0xd8758948,%eax                                                                                                          5:   e8 30 01 8b ff          callq  0xffffffffff8b013a
   a:   48 8b 75 d8             mov    -0x28(%rbp),%rsi                                                                                                          e:   48 8b 14 dd 40 8f 51    mov    -0x7cae70c0(,%rbx,8),%rdx                                                                                                15:   83                                                                                                                                                      16:   4d 89 e9                mov    %r13,%r9                                                                                                                 19:   4d 89 e0                mov    %r12,%r8                                                                                                                 1c:   44 89 f1                mov    %r14d,%ecx
  1f:   48 c7 c7 e0 85 51 83    mov    $0xffffffff835185e0,%rdi
  26:   e8 e3 29 75 ff          callq  0xffffffffff752a0e
  2b:*  0f ff                   (bad)           <-- trapping instruction
  2d:   83 05 4a 1e 16 02 01    addl   $0x1,0x2161e4a(%rip)        # 0x2161e7e
  34:   48 83 c4 08             add    $0x8,%rsp
  38:   5b                      pop    %rbx
  39:   41 5c                   pop    %r12
  3b:   41 5d                   pop    %r13
  3d:   41 5e                   pop    %r14
  3f:   5d                      pop    %rbp
        ...

Code starting with the faulting instruction
===========================================
   0:   0f ff                   (bad)
   2:   83 05 4a 1e 16 02 01    addl   $0x1,0x2161e4a(%rip)        # 0x2161e53
   9:   48 83 c4 08             add    $0x8,%rsp
   d:   5b                      pop    %rbx
   e:   41 5c                   pop    %r12
  10:   41 5d                   pop    %r13
  12:   41 5e                   pop    %r14
  14:   5d                      pop    %rbp
        ...
[   21.837142] ---[ end trace 9e2690a9beaffa07 ]---
-- 

Thanks,
Sasha

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 11/13] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
  2017-04-10 15:51   ` alexander.levin
@ 2017-04-10 16:03     ` Thomas Gleixner
  2017-04-14  9:30       ` [tip:locking/core] futex: Avoid freeing an active timer tip-bot for Thomas Gleixner
  0 siblings, 1 reply; 60+ messages in thread
From: Thomas Gleixner @ 2017-04-10 16:03 UTC (permalink / raw)
  To: alexander.levin
  Cc: Peter Zijlstra, mingo, juri.lelli, rostedt, xlpang, bigeasy,
	linux-kernel, mathieu.desnoyers, jdesfossez, bristot, dvhart

On Mon, 10 Apr 2017, alexander.levin@verizon.com wrote:
> On Wed, Mar 22, 2017 at 11:35:58AM +0100, Peter Zijlstra wrote:
> > By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() we arrive
> > at a point where all wait_list modifications are done under both
> > hb->lock and wait_lock.
> [...]
> 
> Hey Peter,
> 
> I'm seeing the following, which seems to be related to this patch:
> 
> [   21.762875] ODEBUG: free active (active state 0) object type: hrtimer hint: hrtimer_wakeup (kernel/time/hrtimer.c:1423)

> [   21.788050] debug_object_free (lib/debugobjects.c:603)
> [   21.791105] destroy_hrtimer_on_stack (kernel/time/hrtimer.c:427)
> [   21.791746] futex_lock_pi (kernel/futex.c:2740)
> [   21.800721] do_futex (kernel/futex.c:3399)
> [   21.818395] SyS_futex (kernel/futex.c:3447 kernel/futex.c:3415)
> [   21.822260] do_syscall_64 (arch/x86/entry/common.c:284)
> [   21.827328] entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:249)

Yep, that rework dropped the hrtimer cancel. Fix below.

Thanks,

	tglx

8<------------------------

diff --git a/kernel/futex.c b/kernel/futex.c
index c3eebcdac206..7ac167683c9f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2736,8 +2736,10 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 out_put_key:
 	put_futex_key(&q.key);
 out:
-	if (to)
+	if (to) {
+		hrtimer_cancel(&to->timer);
 		destroy_hrtimer_on_stack(&to->timer);
+	}
 	return ret != -EINTR ? ret : -ERESTARTNOINTR;
 
 uaddr_faulted:

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL
  2017-04-10  9:08     ` Peter Zijlstra
@ 2017-04-10 16:05       ` Darren Hart
  0 siblings, 0 replies; 60+ messages in thread
From: Darren Hart @ 2017-04-10 16:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, Apr 10, 2017 at 11:08:18AM +0200, Peter Zijlstra wrote:
> On Fri, Apr 07, 2017 at 07:26:10PM -0700, Darren Hart wrote:
> > Peter, I presume there will be a v7 with the u32 change and hopefully a couple
> > text updates?
> 
> Well, tglx already committed these here patches, so no -v7. What I can
> do however is do a follow up patch that fixes some of the in-code things
> you mentioned.
> 
> Something like the below is what I had lying about from your earlier
> emails; I've not looked to see if there's anything else from your later
> emails.
> 
> ---
> Subject: futex: Small misc fixes..
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Apr 7 09:04:07 CEST 2017
> 
> Feedback from Darren's review.
> 
> Reported-by: Darren Hart (VMWare) <dvhart@infradead.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>


OK, this addresses 2 of the big ones as I recall. I'll go through my feedback
and if I think anything else needs updating, I'll prepare a few patches.

Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>

Thanks Peter!

-- 
Darren Hart
VMware Open Source Technology Center

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [tip:locking/core] futex: Avoid freeing an active timer
  2017-04-10 16:03     ` Thomas Gleixner
@ 2017-04-14  9:30       ` tip-bot for Thomas Gleixner
  0 siblings, 0 replies; 60+ messages in thread
From: tip-bot for Thomas Gleixner @ 2017-04-14  9:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, alexander.levin, torvalds, tglx, mingo, linux-kernel, hpa

Commit-ID:  97181f9bd57405b879403763284537e27d46963d
Gitweb:     http://git.kernel.org/tip/97181f9bd57405b879403763284537e27d46963d
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 10 Apr 2017 18:03:36 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 14 Apr 2017 10:29:53 +0200

futex: Avoid freeing an active timer

Alexander reported a hrtimer debug_object splat:

  ODEBUG: free active (active state 0) object type: hrtimer hint: hrtimer_wakeup (kernel/time/hrtimer.c:1423)

  debug_object_free (lib/debugobjects.c:603)
  destroy_hrtimer_on_stack (kernel/time/hrtimer.c:427)
  futex_lock_pi (kernel/futex.c:2740)
  do_futex (kernel/futex.c:3399)
  SyS_futex (kernel/futex.c:3447 kernel/futex.c:3415)
  do_syscall_64 (arch/x86/entry/common.c:284)
  entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:249)

Which was caused by commit:

  cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")

... losing the hrtimer_cancel() in the shuffle. Where previously the
hrtimer_cancel() was done by rt_mutex_slowlock() we now need to do it
manually.

Reported-by: Alexander Levin <alexander.levin@verizon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1704101802370.2906@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/futex.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index c3eebcd..7ac1676 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2736,8 +2736,10 @@ out_unlock_put_key:
 out_put_key:
 	put_futex_key(&q.key);
 out:
-	if (to)
+	if (to) {
+		hrtimer_cancel(&to->timer);
 		destroy_hrtimer_on_stack(&to->timer);
+	}
 	return ret != -EINTR ? ret : -ERESTARTNOINTR;
 
 uaddr_faulted:

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2017-04-14  9:34 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-22 10:35 [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Peter Zijlstra
2017-03-22 10:35 ` [PATCH -v6 01/13] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
2017-03-23 18:19   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-03-24 21:11   ` [PATCH -v6 01/13] " Darren Hart
2017-03-22 10:35 ` [PATCH -v6 02/13] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
2017-03-23 18:19   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-03-24 21:16   ` [PATCH -v6 02/13] " Darren Hart
2017-03-22 10:35 ` [PATCH -v6 03/13] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
2017-03-23 18:20   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-03-24 21:29   ` [PATCH -v6 03/13] " Darren Hart
2017-03-24 21:31     ` Darren Hart
2017-03-22 10:35 ` [PATCH -v6 04/13] futex,rt_mutex: Provide futex specific rt_mutex API Peter Zijlstra
2017-03-23 18:20   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-03-25  0:37   ` [PATCH -v6 04/13] " Darren Hart
2017-04-06 12:15     ` Peter Zijlstra
2017-04-06 17:02       ` Darren Hart
2017-04-05 15:02   ` Darren Hart
2017-04-06 12:17     ` Peter Zijlstra
2017-04-06 17:08       ` Darren Hart
2017-03-22 10:35 ` [PATCH -v6 05/13] futex: Change locking rules Peter Zijlstra
2017-03-23 18:21   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-04-05 21:18   ` [PATCH -v6 05/13] " Darren Hart
2017-04-06 12:28     ` Peter Zijlstra
2017-04-06 15:58       ` Joe Perches
2017-04-06 17:21       ` Darren Hart
2017-03-22 10:35 ` [PATCH -v6 06/13] futex: Cleanup refcounting Peter Zijlstra
2017-03-23 18:21   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-04-05 21:29   ` [PATCH -v6 06/13] " Darren Hart
2017-03-22 10:35 ` [PATCH -v6 07/13] futex: Rework inconsistent rt_mutex/futex_q state Peter Zijlstra
2017-03-23 18:22   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-04-05 21:58   ` [PATCH -v6 07/13] " Darren Hart
2017-03-22 10:35 ` [PATCH -v6 08/13] futex: Pull rt_mutex_futex_unlock() out from under hb->lock Peter Zijlstra
2017-03-23 18:22   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-04-05 23:52   ` [PATCH -v6 08/13] " Darren Hart
2017-04-06 12:42     ` Peter Zijlstra
2017-04-06 17:42       ` Darren Hart
2017-03-22 10:35 ` [PATCH -v6 09/13] futex,rt_mutex: Introduce rt_mutex_init_waiter() Peter Zijlstra
2017-03-23 18:23   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-04-05 23:57   ` [PATCH -v6 09/13] " Darren Hart
2017-03-22 10:35 ` [PATCH -v6 10/13] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() Peter Zijlstra
2017-03-23 18:23   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-04-07 23:30   ` [PATCH -v6 10/13] " Darren Hart
2017-04-07 23:35     ` Darren Hart
2017-03-22 10:35 ` [PATCH -v6 11/13] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() Peter Zijlstra
2017-03-23 18:24   ` [tip:locking/core] " tip-bot for Peter Zijlstra
2017-04-08  0:55   ` [PATCH -v6 11/13] " Darren Hart
2017-04-10 15:51   ` alexander.levin
2017-04-10 16:03     ` Thomas Gleixner
2017-04-14  9:30       ` [tip:locking/core] futex: Avoid freeing an active timer tip-bot for Thomas Gleixner
2017-03-22 10:35 ` [PATCH -v6 12/13] futex: futex_unlock_pi() determinism Peter Zijlstra
2017-03-23 18:24   ` [tip:locking/core] futex: Futex_unlock_pi() determinism tip-bot for Peter Zijlstra
2017-04-08  1:27   ` [PATCH -v6 12/13] futex: futex_unlock_pi() determinism Darren Hart
2017-03-22 10:36 ` [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL Peter Zijlstra
2017-03-23 18:25   ` [tip:locking/core] futex: Drop hb->lock before enqueueing on the rtmutex tip-bot for Peter Zijlstra
2017-04-08  2:26   ` [PATCH -v6 13/13] futex: futex_lock_pi() vs PREEMPT_RT_FULL Darren Hart
2017-04-08  5:22     ` Mike Galbraith
2017-04-10  8:43     ` Sebastian Andrzej Siewior
2017-04-10  9:08     ` Peter Zijlstra
2017-04-10 16:05       ` Darren Hart
2017-03-24  1:45 ` [PATCH -v6 00/13] The arduous story of FUTEX_UNLOCK_PI Darren Hart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.