All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles
@ 2016-10-03  9:12 Peter Zijlstra
  2016-10-03  9:12 ` [RFC][PATCH 1/4] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
                   ` (4 more replies)
  0 siblings, 5 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-03  9:12 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz


Hi,

During my last PI failing patch set it became obvious there's a number of
related fail in FUTEX_UNLOCK_PI that needed sorting before we can move on
with that stuff.

These here patches are the result of staring at that code for a wee bit.

Please have a very _very_ careful look at the last patch, it appears to not
explode when running:

  - perf bench futex lock-pi
  - selftests/futex

but given the immense amount of tricky involved with both PI and futex there is
bound to be something I've overlooked.

Do people have more/better futex-pi test cases?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC][PATCH 1/4] futex: Cleanup variable names for futex_top_waiter()
  2016-10-03  9:12 [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Peter Zijlstra
@ 2016-10-03  9:12 ` Peter Zijlstra
  2016-10-03 14:15   ` Steven Rostedt
  2016-10-05  3:58   ` Davidlohr Bueso
  2016-10-03  9:12 ` [RFC][PATCH 2/4] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-03  9:12 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-futex-cleanup-top_waiter.patch --]
[-- Type: text/plain, Size: 3377 bytes --]

futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
this to a variable 'match' totally obscures the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |   30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1120,14 +1120,14 @@ static int attach_to_pi_owner(u32 uval,
 static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 			   union futex_key *key, struct futex_pi_state **ps)
 {
-	struct futex_q *match = futex_top_waiter(hb, key);
+	struct futex_q *top_waiter = futex_top_waiter(hb, key);
 
 	/*
 	 * If there is a waiter on that futex, validate it and
 	 * attach to the pi_state when the validation succeeds.
 	 */
-	if (match)
-		return attach_to_pi_state(uval, match->pi_state, ps);
+	if (top_waiter)
+		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
 	/*
 	 * We are the first waiter - try to look up the owner based on
@@ -1174,7 +1174,7 @@ static int futex_lock_pi_atomic(u32 __us
 				struct task_struct *task, int set_waiters)
 {
 	u32 uval, newval, vpid = task_pid_vnr(task);
-	struct futex_q *match;
+	struct futex_q *top_waiter;
 	int ret;
 
 	/*
@@ -1200,9 +1200,9 @@ static int futex_lock_pi_atomic(u32 __us
 	 * Lookup existing state first. If it exists, try to attach to
 	 * its pi_state.
 	 */
-	match = futex_top_waiter(hb, key);
-	if (match)
-		return attach_to_pi_state(uval, match->pi_state, ps);
+	top_waiter = futex_top_waiter(hb, key);
+	if (top_waiter)
+		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
 	/*
 	 * No waiter and user TID is 0. We are here because the
@@ -1292,11 +1292,11 @@ static void mark_wake_futex(struct wake_
 	q->lock_ptr = NULL;
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
 			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = this->pi_state;
+	struct futex_pi_state *pi_state = top_waiter->pi_state;
 	u32 uninitialized_var(curval), newval;
 	WAKE_Q(wake_q);
 	bool deboost;
@@ -1317,11 +1317,11 @@ static int wake_futex_pi(u32 __user *uad
 
 	/*
 	 * It is possible that the next waiter (the one that brought
-	 * this owner to the kernel) timed out and is no longer
+	 * top_waiter owner to the kernel) timed out and is no longer
 	 * waiting on the lock.
 	 */
 	if (!new_owner)
-		new_owner = this->task;
+		new_owner = top_waiter->task;
 
 	/*
 	 * We pass it to the next owner. The WAITERS bit is always
@@ -2631,7 +2631,7 @@ static int futex_unlock_pi(u32 __user *u
 	u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current);
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb;
-	struct futex_q *match;
+	struct futex_q *top_waiter;
 	int ret;
 
 retry:
@@ -2655,9 +2655,9 @@ static int futex_unlock_pi(u32 __user *u
 	 * all and we at least want to know if user space fiddled
 	 * with the futex value instead of blindly unlocking.
 	 */
-	match = futex_top_waiter(hb, &key);
-	if (match) {
-		ret = wake_futex_pi(uaddr, uval, match, hb);
+	top_waiter = futex_top_waiter(hb, &key);
+	if (top_waiter) {
+		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
 		/*
 		 * In case of success wake_futex_pi dropped the hash
 		 * bucket lock.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC][PATCH 2/4] futex: Use smp_store_release() in mark_wake_futex()
  2016-10-03  9:12 [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Peter Zijlstra
  2016-10-03  9:12 ` [RFC][PATCH 1/4] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
@ 2016-10-03  9:12 ` Peter Zijlstra
  2016-10-03 14:19   ` Steven Rostedt
  2016-10-05  3:57   ` Davidlohr Bueso
  2016-10-03  9:12 ` [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-03  9:12 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-futex-mark_wake_futex.patch --]
[-- Type: text/plain, Size: 697 bytes --]

Since the futex_q can dissapear the instruction after assigning NULL,
this really should be a RELEASE barrier. That stops loads from hitting
dead memory too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1288,8 +1288,7 @@ static void mark_wake_futex(struct wake_
 	 * memory barrier is required here to prevent the following
 	 * store to lock_ptr from getting ahead of the plist_del.
 	 */
-	smp_wmb();
-	q->lock_ptr = NULL;
+	smp_store_release(&q->lock_ptr, NULL);
 }
 
 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*()
  2016-10-03  9:12 [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Peter Zijlstra
  2016-10-03  9:12 ` [RFC][PATCH 1/4] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
  2016-10-03  9:12 ` [RFC][PATCH 2/4] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
@ 2016-10-03  9:12 ` Peter Zijlstra
  2016-10-03  9:34   ` Peter Zijlstra
                     ` (3 more replies)
  2016-10-03  9:12 ` [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI Peter Zijlstra
  2016-10-05  1:02 ` [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Davidlohr Bueso
  4 siblings, 4 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-03  9:12 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-futex-cleanup-deadlock.patch --]
[-- Type: text/plain, Size: 4943 bytes --]

These are unused and clutter up the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/rtmutex-debug.c |    9 -------
 kernel/locking/rtmutex-debug.h |    3 --
 kernel/locking/rtmutex.c       |   47 +++++++++++++++--------------------------
 kernel/locking/rtmutex.h       |    2 -
 4 files changed, 18 insertions(+), 43 deletions(-)

--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -173,12 +173,3 @@ void debug_rt_mutex_init(struct rt_mutex
 	lock->name = name;
 }
 
-void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task)
-{
-}
-
-void rt_mutex_deadlock_account_unlock(struct task_struct *task)
-{
-}
-
--- a/kernel/locking/rtmutex-debug.h
+++ b/kernel/locking/rtmutex-debug.h
@@ -9,9 +9,6 @@
  * This file contains macros used solely by rtmutex.c. Debug version.
  */
 
-extern void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task);
-extern void rt_mutex_deadlock_account_unlock(struct task_struct *task);
 extern void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name);
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -872,8 +872,6 @@ static int try_to_take_rt_mutex(struct r
 	 */
 	rt_mutex_set_owner(lock, task);
 
-	rt_mutex_deadlock_account_lock(lock, task);
-
 	return 1;
 }
 
@@ -1276,8 +1274,6 @@ static bool __sched rt_mutex_slowunlock(
 
 	debug_rt_mutex_unlock(lock);
 
-	rt_mutex_deadlock_account_unlock(current);
-
 	/*
 	 * We must be careful here if the fast path is enabled. If we
 	 * have no waiters queued we cannot set owner to NULL here
@@ -1343,11 +1339,10 @@ rt_mutex_fastlock(struct rt_mutex *lock,
 				struct hrtimer_sleeper *timeout,
 				enum rtmutex_chainwalk chwalk))
 {
-	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 0;
-	} else
-		return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
+
+	return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
 }
 
 static inline int
@@ -1359,21 +1354,19 @@ rt_mutex_timed_fastlock(struct rt_mutex
 				      enum rtmutex_chainwalk chwalk))
 {
 	if (chwalk == RT_MUTEX_MIN_CHAINWALK &&
-	    likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	    likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 0;
-	} else
-		return slowfn(lock, state, timeout, chwalk);
+
+	return slowfn(lock, state, timeout, chwalk);
 }
 
 static inline int
 rt_mutex_fasttrylock(struct rt_mutex *lock,
 		     int (*slowfn)(struct rt_mutex *lock))
 {
-	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 1;
-	}
+
 	return slowfn(lock);
 }
 
@@ -1383,19 +1376,18 @@ rt_mutex_fastunlock(struct rt_mutex *loc
 				   struct wake_q_head *wqh))
 {
 	WAKE_Q(wake_q);
+	bool deboost;
 
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-		rt_mutex_deadlock_account_unlock(current);
+	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
+		return;
 
-	} else {
-		bool deboost = slowfn(lock, &wake_q);
+	deboost = slowfn(lock, &wake_q);
 
-		wake_up_q(&wake_q);
+	wake_up_q(&wake_q);
 
-		/* Undo pi boosting if necessary: */
-		if (deboost)
-			rt_mutex_adjust_prio(current);
-	}
+	/* Undo pi boosting if necessary: */
+	if (deboost)
+		rt_mutex_adjust_prio(current);
 }
 
 /**
@@ -1506,10 +1498,9 @@ EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
 				   struct wake_q_head *wqh)
 {
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-		rt_mutex_deadlock_account_unlock(current);
+	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
 		return false;
-	}
+
 	return rt_mutex_slowunlock(lock, wqh);
 }
 
@@ -1567,7 +1558,6 @@ void rt_mutex_init_proxy_locked(struct r
 	__rt_mutex_init(lock, NULL);
 	debug_rt_mutex_proxy_lock(lock, proxy_owner);
 	rt_mutex_set_owner(lock, proxy_owner);
-	rt_mutex_deadlock_account_lock(lock, proxy_owner);
 }
 
 /**
@@ -1583,7 +1573,6 @@ void rt_mutex_proxy_unlock(struct rt_mut
 {
 	debug_rt_mutex_proxy_unlock(lock);
 	rt_mutex_set_owner(lock, NULL);
-	rt_mutex_deadlock_account_unlock(proxy_owner);
 }
 
 /**
--- a/kernel/locking/rtmutex.h
+++ b/kernel/locking/rtmutex.h
@@ -11,8 +11,6 @@
  */
 
 #define rt_mutex_deadlock_check(l)			(0)
-#define rt_mutex_deadlock_account_lock(m, t)		do { } while (0)
-#define rt_mutex_deadlock_account_unlock(l)		do { } while (0)
 #define debug_rt_mutex_init_waiter(w)			do { } while (0)
 #define debug_rt_mutex_free_waiter(w)			do { } while (0)
 #define debug_rt_mutex_lock(l)				do { } while (0)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-03  9:12 [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Peter Zijlstra
                   ` (2 preceding siblings ...)
  2016-10-03  9:12 ` [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
@ 2016-10-03  9:12 ` Peter Zijlstra
  2016-10-03 15:36   ` Steven Rostedt
                     ` (3 more replies)
  2016-10-05  1:02 ` [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Davidlohr Bueso
  4 siblings, 4 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-03  9:12 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot, peterz

[-- Attachment #1: peterz-futex-pi-unlock.patch --]
[-- Type: text/plain, Size: 8620 bytes --]

There's a number of 'interesting' problems with FUTEX_UNLOCK_PI, all
caused by holding hb->lock while doing the rt_mutex_unlock()
equivalient.

This patch doesn't attempt to fix any of the actual problems, but
instead reworks the code to not hold hb->lock across the unlock,
paving the way to actually fix the problems later.

The current reason we hold hb->lock over unlock is that it serializes
against FUTEX_LOCK_PI and avoids new waiters from coming in, this then
ensures the rt_mutex_next_owner() value is stable and can be written
into the user-space futex value before doing the unlock. Such that the
unlock will indeed end up at new_owner.

This patch recognises that holding rt_mutex::wait_lock results in the
very same guarantee, no new waiters can come in while we hold that
lock -- after all, waiters would need this lock to queue themselves.

It therefore restructures the code to keep rt_mutex::wait_lock held.

This (of course) is not entirely straight forward either, see the
comment in rt_mutex_slowunlock(), doing the unlock itself might drop
wait_lock, letting new waiters in. To cure this
rt_mutex_futex_unlock() becomes a variant of rt_mutex_slowunlock()
that return -EAGAIN instead. This ensures the FUTEX_UNLOCK_PI code
aborts and restarts the entire operation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |   63 +++++++++++++++++--------------
 kernel/locking/rtmutex.c        |   81 ++++++++++++++++++++++++++++++++++++----
 kernel/locking/rtmutex_common.h |    4 -
 3 files changed, 110 insertions(+), 38 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1294,24 +1294,21 @@ static void mark_wake_futex(struct wake_
 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
 			 struct futex_hash_bucket *hb)
 {
-	struct task_struct *new_owner;
 	struct futex_pi_state *pi_state = top_waiter->pi_state;
 	u32 uninitialized_var(curval), newval;
+	struct task_struct *new_owner;
 	WAKE_Q(wake_q);
-	bool deboost;
 	int ret = 0;
 
-	if (!pi_state)
-		return -EINVAL;
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 
+	WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
 	/*
-	 * If current does not own the pi_state then the futex is
-	 * inconsistent and user space fiddled with the futex value.
+	 * Now that we hold wait_lock, no new waiters can happen on the
+	 * rt_mutex and new owner is stable. Drop hb->lock.
 	 */
-	if (pi_state->owner != current)
-		return -EINVAL;
+	spin_unlock(&hb->lock);
 
-	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
 
 	/*
@@ -1334,6 +1331,7 @@ static int wake_futex_pi(u32 __user *uad
 
 	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
 		ret = -EFAULT;
+
 	} else if (curval != uval) {
 		/*
 		 * If a unconditional UNLOCK_PI operation (user space did not
@@ -1346,10 +1344,9 @@ static int wake_futex_pi(u32 __user *uad
 		else
 			ret = -EINVAL;
 	}
-	if (ret) {
-		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-		return ret;
-	}
+
+	if (ret)
+		goto out_unlock;
 
 	raw_spin_lock(&pi_state->owner->pi_lock);
 	WARN_ON(list_empty(&pi_state->list));
@@ -1362,22 +1359,20 @@ static int wake_futex_pi(u32 __user *uad
 	pi_state->owner = new_owner;
 	raw_spin_unlock(&new_owner->pi_lock);
 
-	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+	ret = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
 
-	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+out_unlock:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 
-	/*
-	 * First unlock HB so the waiter does not spin on it once he got woken
-	 * up. Second wake up the waiter before the priority is adjusted. If we
-	 * deboost first (and lose our higher priority), then the task might get
-	 * scheduled away before the wake up can take place.
-	 */
-	spin_unlock(&hb->lock);
-	wake_up_q(&wake_q);
-	if (deboost)
+	if (ret > 0) {
+		wake_up_q(&wake_q);
 		rt_mutex_adjust_prio(current);
+		ret = 0;
+	}
 
-	return 0;
+	put_pi_state(pi_state);
+
+	return ret;
 }
 
 /*
@@ -2656,6 +2651,20 @@ static int futex_unlock_pi(u32 __user *u
 	 */
 	top_waiter = futex_top_waiter(hb, &key);
 	if (top_waiter) {
+
+		if (!top_waiter->pi_state)
+			goto out_unlock;
+
+		/*
+		 * If current does not own the pi_state then the futex is
+		 * inconsistent and user space fiddled with the futex value.
+		 */
+		if (top_waiter->pi_state->owner != current)
+			goto out_unlock;
+
+		/*
+		 * wait_futex_pi() _will_ drop hb->lock
+		 */
 		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
 		/*
 		 * In case of success wake_futex_pi dropped the hash
@@ -2674,7 +2683,6 @@ static int futex_unlock_pi(u32 __user *u
 		 * setting the FUTEX_WAITERS bit. Try again.
 		 */
 		if (ret == -EAGAIN) {
-			spin_unlock(&hb->lock);
 			put_futex_key(&key);
 			goto retry;
 		}
@@ -2682,7 +2690,7 @@ static int futex_unlock_pi(u32 __user *u
 		 * wake_futex_pi has detected invalid state. Tell user
 		 * space.
 		 */
-		goto out_unlock;
+		goto out_putkey;
 	}
 
 	/*
@@ -2707,7 +2715,6 @@ static int futex_unlock_pi(u32 __user *u
 	return ret;
 
 pi_faulted:
-	spin_unlock(&hb->lock);
 	put_futex_key(&key);
 
 	ret = fault_in_user_writeable(uaddr);
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1489,19 +1489,84 @@ void __sched rt_mutex_unlock(struct rt_m
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
- * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_slowunlock
  * @lock: the rt_mutex to be unlocked
+ * @wqh: wake queue
  *
- * Returns: true/false indicating whether priority adjustment is
- * required or not.
+ * This cannot just be rt_mutex_slowunlock() since that does the wait_lock
+ * dance, and the futex code is tricky (ha!) and uses the wait_lock
+ * to hold off new waiters, so dropping this lock invalidates prior state
+ * and we need to redo all of it.
+ *
+ * Returns:
+ *  -EAGAIN: if we need to retry the whole operation;
+ *        1: if we need to wake and deboost;
+ *        0: if we're done.
  */
-bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
-				   struct wake_q_head *wqh)
+int __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
+				  struct wake_q_head *wake_q)
 {
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
-		return false;
+	lockdep_assert_held(&lock->wait_lock);
+
+	debug_rt_mutex_unlock(lock);
+
+	/*
+	 * We must be careful here if the fast path is enabled. If we
+	 * have no waiters queued we cannot set owner to NULL here
+	 * because of:
+	 *
+	 * foo->lock->owner = NULL;
+	 *			rtmutex_lock(foo->lock);   <- fast path
+	 *			free = atomic_dec_and_test(foo->refcnt);
+	 *			rtmutex_unlock(foo->lock); <- fast path
+	 *			if (free)
+	 *				kfree(foo);
+	 * raw_spin_unlock(foo->lock->wait_lock);
+	 *
+	 * So for the fastpath enabled kernel:
+	 *
+	 * Nothing can set the waiters bit as long as we hold
+	 * lock->wait_lock. So we do the following sequence:
+	 *
+	 *	owner = rt_mutex_owner(lock);
+	 *	clear_rt_mutex_waiters(lock);
+	 *	raw_spin_unlock(&lock->wait_lock);
+	 *	if (cmpxchg(&lock->owner, owner, 0) == owner)
+	 *		return;
+	 *	goto retry;
+	 *
+	 * The fastpath disabled variant is simple as all access to
+	 * lock->owner is serialized by lock->wait_lock:
+	 *
+	 *	lock->owner = NULL;
+	 *	raw_spin_unlock(&lock->wait_lock);
+	 */
+	if (!rt_mutex_has_waiters(lock)) {
+		unsigned long flags;
+		int ret = 0;
+
+		local_save_flags(flags);
+
+		/* Drops lock->wait_lock ! */
+		if (unlock_rt_mutex_safe(lock, flags))
+			ret = -EAGAIN;
+
+		/* Relock the rtmutex and try again */
+		raw_spin_lock_irq(&lock->wait_lock);
+
+		return ret;
+	}
+
+	/*
+	 * The wakeup next waiter path does not suffer from the above
+	 * race. See the comments there.
+	 *
+	 * Queue the next waiter for wakeup once we release the wait_lock.
+	 */
+	mark_wakeup_next_waiter(wake_q, lock);
 
-	return rt_mutex_slowunlock(lock, wqh);
+	/* Do wakeups and deboost. */
+	return 1;
 }
 
 /**
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -109,8 +109,8 @@ extern int rt_mutex_finish_proxy_lock(st
 				      struct hrtimer_sleeper *to,
 				      struct rt_mutex_waiter *waiter);
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
-extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
-				  struct wake_q_head *wqh);
+extern int rt_mutex_futex_unlock(struct rt_mutex *lock,
+				 struct wake_q_head *wqh);
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*()
  2016-10-03  9:12 ` [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
@ 2016-10-03  9:34   ` Peter Zijlstra
  2016-10-03 14:25   ` Steven Rostedt
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-03  9:34 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot


^Subject should've had rtmutex: on.. -ENOTENOUGHTEA

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 1/4] futex: Cleanup variable names for futex_top_waiter()
  2016-10-03  9:12 ` [RFC][PATCH 1/4] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
@ 2016-10-03 14:15   ` Steven Rostedt
  2016-10-05  3:58   ` Davidlohr Bueso
  1 sibling, 0 replies; 46+ messages in thread
From: Steven Rostedt @ 2016-10-03 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 03 Oct 2016 11:12:35 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
> this to a variable 'match' totally obscures the code.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Nice!

Reviewed-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 2/4] futex: Use smp_store_release() in mark_wake_futex()
  2016-10-03  9:12 ` [RFC][PATCH 2/4] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
@ 2016-10-03 14:19   ` Steven Rostedt
  2016-10-05  3:57   ` Davidlohr Bueso
  1 sibling, 0 replies; 46+ messages in thread
From: Steven Rostedt @ 2016-10-03 14:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 03 Oct 2016 11:12:36 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Since the futex_q can dissapear the instruction after assigning NULL,
> this really should be a RELEASE barrier. That stops loads from hitting
> dead memory too.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/futex.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1288,8 +1288,7 @@ static void mark_wake_futex(struct wake_
>  	 * memory barrier is required here to prevent the following
>  	 * store to lock_ptr from getting ahead of the plist_del.
>  	 */
> -	smp_wmb();
> -	q->lock_ptr = NULL;
> +	smp_store_release(&q->lock_ptr, NULL);
>  }
>  
>  static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
> 

Reviewed-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*()
  2016-10-03  9:12 ` [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
  2016-10-03  9:34   ` Peter Zijlstra
@ 2016-10-03 14:25   ` Steven Rostedt
  2016-10-05  1:08   ` Davidlohr Bueso
  2016-10-05  7:29   ` Sebastian Andrzej Siewior
  3 siblings, 0 replies; 46+ messages in thread
From: Steven Rostedt @ 2016-10-03 14:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 03 Oct 2016 11:12:37 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> These are unused and clutter up the code.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-03  9:12 ` [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI Peter Zijlstra
@ 2016-10-03 15:36   ` Steven Rostedt
  2016-10-03 15:44     ` Peter Zijlstra
  2016-10-03 15:45     ` Peter Zijlstra
  2016-10-05  7:41   ` Sebastian Andrzej Siewior
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 46+ messages in thread
From: Steven Rostedt @ 2016-10-03 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 03 Oct 2016 11:12:38 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> There's a number of 'interesting' problems with FUTEX_UNLOCK_PI, all
> caused by holding hb->lock while doing the rt_mutex_unlock()
> equivalient.
> 
> This patch doesn't attempt to fix any of the actual problems, but
> instead reworks the code to not hold hb->lock across the unlock,
> paving the way to actually fix the problems later.
> 
> The current reason we hold hb->lock over unlock is that it serializes
> against FUTEX_LOCK_PI and avoids new waiters from coming in, this then
> ensures the rt_mutex_next_owner() value is stable and can be written
> into the user-space futex value before doing the unlock. Such that the
> unlock will indeed end up at new_owner.
> 
> This patch recognises that holding rt_mutex::wait_lock results in the
> very same guarantee, no new waiters can come in while we hold that
> lock -- after all, waiters would need this lock to queue themselves.
> 
> It therefore restructures the code to keep rt_mutex::wait_lock held.
> 
> This (of course) is not entirely straight forward either, see the
> comment in rt_mutex_slowunlock(), doing the unlock itself might drop
> wait_lock, letting new waiters in. To cure this
> rt_mutex_futex_unlock() becomes a variant of rt_mutex_slowunlock()
> that return -EAGAIN instead. This ensures the FUTEX_UNLOCK_PI code
> aborts and restarts the entire operation.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/futex.c                  |   63 +++++++++++++++++--------------
>  kernel/locking/rtmutex.c        |   81 ++++++++++++++++++++++++++++++++++++----
>  kernel/locking/rtmutex_common.h |    4 -
>  3 files changed, 110 insertions(+), 38 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1294,24 +1294,21 @@ static void mark_wake_futex(struct wake_
>  static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
>  			 struct futex_hash_bucket *hb)
>  {
> -	struct task_struct *new_owner;
>  	struct futex_pi_state *pi_state = top_waiter->pi_state;
>  	u32 uninitialized_var(curval), newval;
> +	struct task_struct *new_owner;
>  	WAKE_Q(wake_q);
> -	bool deboost;
>  	int ret = 0;
>  
> -	if (!pi_state)
> -		return -EINVAL;
> +	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
>  
> +	WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));

Don't we have a rule where WARN_ON() and BUG_ON() should never have
"side effects"? That is, they should only check values, but their
contents should not update values.

hence have:

	ret = atomic_inc_not_zero(&pi_state->refcount);
	WARN_ON_ONCE(!ret);

>  	/*
> -	 * If current does not own the pi_state then the futex is
> -	 * inconsistent and user space fiddled with the futex value.
> +	 * Now that we hold wait_lock, no new waiters can happen on the
> +	 * rt_mutex and new owner is stable. Drop hb->lock.
>  	 */
> -	if (pi_state->owner != current)
> -		return -EINVAL;
> +	spin_unlock(&hb->lock);
>  

Also, as Sebastian has said before, I believe this breaks rt's migrate
disable code. As migrate disable and migrate_enable are a nop if
preemption is disabled, thus if you hold a raw_spin_lock across a
spin_unlock() when the migrate enable will be a nop, and the
migrate_disable() will never stop.

-- Steve

> -	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
>  	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
>  
>  	/*

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-03 15:36   ` Steven Rostedt
@ 2016-10-03 15:44     ` Peter Zijlstra
  2016-10-03 15:45     ` Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-03 15:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, Oct 03, 2016 at 11:36:24AM -0400, Steven Rostedt wrote:
> >  	/*
> > -	 * If current does not own the pi_state then the futex is
> > -	 * inconsistent and user space fiddled with the futex value.
> > +	 * Now that we hold wait_lock, no new waiters can happen on the
> > +	 * rt_mutex and new owner is stable. Drop hb->lock.
> >  	 */
> > -	if (pi_state->owner != current)
> > -		return -EINVAL;
> > +	spin_unlock(&hb->lock);
> >  
> 
> Also, as Sebastian has said before, I believe this breaks rt's migrate
> disable code. As migrate disable and migrate_enable are a nop if
> preemption is disabled, thus if you hold a raw_spin_lock across a
> spin_unlock() when the migrate enable will be a nop, and the
> migrate_disable() will never stop.

Its too long since I looked at that trainwreck, but yuck, that would
make lock unlock order important :-(

Now I think we could do something like so.. but I'm not entirely sure on
the various lifetime rules here, its not overly documented.

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1300,15 +1300,14 @@ static int wake_futex_pi(u32 __user *uad
 	WAKE_Q(wake_q);
 	int ret = 0;
 
-	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
-
 	WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
 	/*
-	 * Now that we hold wait_lock, no new waiters can happen on the
-	 * rt_mutex and new owner is stable. Drop hb->lock.
+	 * XXX
 	 */
 	spin_unlock(&hb->lock);
 
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
 
 	/*

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-03 15:36   ` Steven Rostedt
  2016-10-03 15:44     ` Peter Zijlstra
@ 2016-10-03 15:45     ` Peter Zijlstra
  2016-10-03 16:23       ` Steven Rostedt
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-03 15:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, Oct 03, 2016 at 11:36:24AM -0400, Steven Rostedt wrote:

> > +	WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
> 
> Don't we have a rule where WARN_ON() and BUG_ON() should never have
> "side effects"? That is, they should only check values, but their
> contents should not update values.

not that I'm aware, there's various places in the kernel (including
kref.h) that relies on WARN_ON*() having side effects.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-03 15:45     ` Peter Zijlstra
@ 2016-10-03 16:23       ` Steven Rostedt
  0 siblings, 0 replies; 46+ messages in thread
From: Steven Rostedt @ 2016-10-03 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 3 Oct 2016 17:45:14 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Oct 03, 2016 at 11:36:24AM -0400, Steven Rostedt wrote:
> 
> > > +	WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));  
> > 
> > Don't we have a rule where WARN_ON() and BUG_ON() should never have
> > "side effects"? That is, they should only check values, but their
> > contents should not update values.  
> 
> not that I'm aware, there's various places in the kernel (including
> kref.h) that relies on WARN_ON*() having side effects.

Looking more into it, I believe the "no side effect" is only with
BUG_ON(), as WARN_ON() can be used within if statements.

-- Steve

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles
  2016-10-03  9:12 [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Peter Zijlstra
                   ` (3 preceding siblings ...)
  2016-10-03  9:12 ` [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI Peter Zijlstra
@ 2016-10-05  1:02 ` Davidlohr Bueso
  2016-10-05  6:20   ` Peter Zijlstra
  4 siblings, 1 reply; 46+ messages in thread
From: Davidlohr Bueso @ 2016-10-05  1:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 03 Oct 2016, Peter Zijlstra wrote:

>
>Hi,
>
>During my last PI failing patch set it became obvious there's a number of
>related fail in FUTEX_UNLOCK_PI that needed sorting before we can move on
>with that stuff.
>
>These here patches are the result of staring at that code for a wee bit.
>
>Please have a very _very_ careful look at the last patch, it appears to not
>explode when running:
>
>  - perf bench futex lock-pi
>  - selftests/futex
>
>but given the immense amount of tricky involved with both PI and futex there is
>bound to be something I've overlooked.
>
>Do people have more/better futex-pi test cases?

pi_stress.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*()
  2016-10-03  9:12 ` [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
  2016-10-03  9:34   ` Peter Zijlstra
  2016-10-03 14:25   ` Steven Rostedt
@ 2016-10-05  1:08   ` Davidlohr Bueso
  2016-10-05  7:29   ` Sebastian Andrzej Siewior
  3 siblings, 0 replies; 46+ messages in thread
From: Davidlohr Bueso @ 2016-10-05  1:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 03 Oct 2016, Peter Zijlstra wrote:

>These are unused and clutter up the code.
>
>Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 2/4] futex: Use smp_store_release() in mark_wake_futex()
  2016-10-03  9:12 ` [RFC][PATCH 2/4] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
  2016-10-03 14:19   ` Steven Rostedt
@ 2016-10-05  3:57   ` Davidlohr Bueso
  2016-10-05  6:20     ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Davidlohr Bueso @ 2016-10-05  3:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 03 Oct 2016, Peter Zijlstra wrote:

>Since the futex_q can dissapear the instruction after assigning NULL,
>this really should be a RELEASE barrier. That stops loads from hitting
>dead memory too.
>
>Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>---
> kernel/futex.c |    3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
>--- a/kernel/futex.c
>+++ b/kernel/futex.c
>@@ -1288,8 +1288,7 @@ static void mark_wake_futex(struct wake_
> 	 * memory barrier is required here to prevent the following
> 	 * store to lock_ptr from getting ahead of the plist_del.
> 	 */
>-	smp_wmb();
>-	q->lock_ptr = NULL;
>+	smp_store_release(&q->lock_ptr, NULL);
> }

Hmm, what if we relied on the implicit barrier in the wake_q_add()
above and got rid of the smp_wmb altogether? We'd obviously have to
move up __unqueue_futex(), but all we care about is the publishing
store to lock_ptr being the last operation, or at least the plist_del,
such that the wakeup order is respected; ie:

   __unqueue_futex(q);
   wake_q_add(wake_q, p);
   q->lock_ptr = NULL;

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 1/4] futex: Cleanup variable names for futex_top_waiter()
  2016-10-03  9:12 ` [RFC][PATCH 1/4] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
  2016-10-03 14:15   ` Steven Rostedt
@ 2016-10-05  3:58   ` Davidlohr Bueso
  1 sibling, 0 replies; 46+ messages in thread
From: Davidlohr Bueso @ 2016-10-05  3:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, 03 Oct 2016, Peter Zijlstra wrote:

>futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
>this to a variable 'match' totally obscures the code.
>
>Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 2/4] futex: Use smp_store_release() in mark_wake_futex()
  2016-10-05  3:57   ` Davidlohr Bueso
@ 2016-10-05  6:20     ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-05  6:20 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Tue, Oct 04, 2016 at 08:57:55PM -0700, Davidlohr Bueso wrote:
> On Mon, 03 Oct 2016, Peter Zijlstra wrote:
> 
> >Since the futex_q can dissapear the instruction after assigning NULL,
> >this really should be a RELEASE barrier. That stops loads from hitting
> >dead memory too.
> >
> >Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >---
> >kernel/futex.c |    3 +--
> >1 file changed, 1 insertion(+), 2 deletions(-)
> >
> >--- a/kernel/futex.c
> >+++ b/kernel/futex.c
> >@@ -1288,8 +1288,7 @@ static void mark_wake_futex(struct wake_
> >	 * memory barrier is required here to prevent the following
> >	 * store to lock_ptr from getting ahead of the plist_del.
> >	 */
> >-	smp_wmb();
> >-	q->lock_ptr = NULL;
> >+	smp_store_release(&q->lock_ptr, NULL);
> >}
> 
> Hmm, what if we relied on the implicit barrier in the wake_q_add()
> above and got rid of the smp_wmb altogether? We'd obviously have to
> move up __unqueue_futex(), but all we care about is the publishing
> store to lock_ptr being the last operation, or at least the plist_del,
> such that the wakeup order is respected; ie:
> 
>   __unqueue_futex(q);
>   wake_q_add(wake_q, p);
>   q->lock_ptr = NULL;

Less obvious but would work I suppose, I didn't look too hard at the
ordering requirements on __unqueue_futex(), but an early morning glance
(without tea) doesn't show any obvious problems with that.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles
  2016-10-05  1:02 ` [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Davidlohr Bueso
@ 2016-10-05  6:20   ` Peter Zijlstra
  2016-10-05  7:26     ` Sebastian Andrzej Siewior
  2016-10-05 16:04     ` Davidlohr Bueso
  0 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-05  6:20 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Tue, Oct 04, 2016 at 06:02:14PM -0700, Davidlohr Bueso wrote:
> On Mon, 03 Oct 2016, Peter Zijlstra wrote:

> >Do people have more/better futex-pi test cases?
> 
> pi_stress.

Where does one find that? Link?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles
  2016-10-05  6:20   ` Peter Zijlstra
@ 2016-10-05  7:26     ` Sebastian Andrzej Siewior
  2016-10-05 16:04     ` Davidlohr Bueso
  1 sibling, 0 replies; 46+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-10-05  7:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, mingo, tglx, juri.lelli, rostedt, xlpang,
	linux-kernel, mathieu.desnoyers, jdesfossez, bristot

On 2016-10-05 08:20:58 [+0200], Peter Zijlstra wrote:
> > pi_stress.
> 
> Where does one find that? Link?
https://git.kernel.org/cgit/utils/rt-tests/rt-tests.git/tree/src/pi_tests/pi_stress.c

Take the whole package :)

Sebastian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*()
  2016-10-03  9:12 ` [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
                     ` (2 preceding siblings ...)
  2016-10-05  1:08   ` Davidlohr Bueso
@ 2016-10-05  7:29   ` Sebastian Andrzej Siewior
  3 siblings, 0 replies; 46+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-10-05  7:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On 2016-10-03 11:12:37 [+0200], Peter Zijlstra wrote:
> These are unused and clutter up the code.

it seems that it has been like that since day one (Ingo's patches from
2006 [0] are like that). I assume we never had real code behind this
functions, right? Not even in -RT?

[0] http://people.redhat.com/mingo/PI-futex-patches/patches/

Sebastian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-03  9:12 ` [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI Peter Zijlstra
  2016-10-03 15:36   ` Steven Rostedt
@ 2016-10-05  7:41   ` Sebastian Andrzej Siewior
  2016-10-05  8:09     ` Peter Zijlstra
  2016-10-06 10:29   ` Peter Zijlstra
  2016-10-07 11:21   ` Peter Zijlstra
  3 siblings, 1 reply; 46+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-10-05  7:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On 2016-10-03 11:12:38 [+0200], Peter Zijlstra wrote:
> There's a number of 'interesting' problems with FUTEX_UNLOCK_PI, all
> caused by holding hb->lock while doing the rt_mutex_unlock()
> equivalient.

are those problems DL related?

Sebastian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-05  7:41   ` Sebastian Andrzej Siewior
@ 2016-10-05  8:09     ` Peter Zijlstra
  2016-10-05  8:21       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-05  8:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Oct 05, 2016 at 09:41:47AM +0200, Sebastian Andrzej Siewior wrote:
> On 2016-10-03 11:12:38 [+0200], Peter Zijlstra wrote:
> > There's a number of 'interesting' problems with FUTEX_UNLOCK_PI, all
> > caused by holding hb->lock while doing the rt_mutex_unlock()
> > equivalient.
> 
> are those problems DL related?

One of them, the other is that PI thing you did that ugly nodeboost
thing for, right?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-05  8:09     ` Peter Zijlstra
@ 2016-10-05  8:21       ` Sebastian Andrzej Siewior
  2016-10-05  8:32         ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Sebastian Andrzej Siewior @ 2016-10-05  8:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On 2016-10-05 10:09:12 [+0200], Peter Zijlstra wrote:
> On Wed, Oct 05, 2016 at 09:41:47AM +0200, Sebastian Andrzej Siewior wrote:
> > are those problems DL related?
> 
> One of them, the other is that PI thing you did that ugly nodeboost
> thing for, right?

this no-de-boost yes. This is probably a problem since we have this
"delayed" wake-up. I've been thinking about a marked in PI state to
ignore a de-boost so the spin_unlock() won't be a problem. But if I
understand it right, then this won't solve the DL problem since you
can't have two tasks at the same priority.

Sebastian

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-05  8:21       ` Sebastian Andrzej Siewior
@ 2016-10-05  8:32         ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-05  8:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Oct 05, 2016 at 10:21:02AM +0200, Sebastian Andrzej Siewior wrote:
> On 2016-10-05 10:09:12 [+0200], Peter Zijlstra wrote:
> > On Wed, Oct 05, 2016 at 09:41:47AM +0200, Sebastian Andrzej Siewior wrote:
> > > are those problems DL related?
> > 
> > One of them, the other is that PI thing you did that ugly nodeboost
> > thing for, right?
> 
> this no-de-boost yes. This is probably a problem since we have this
> "delayed" wake-up. I've been thinking about a marked in PI state to
> ignore a de-boost so the spin_unlock() won't be a problem. But if I
> understand it right, then this won't solve the DL problem since you
> can't have two tasks at the same priority.

The primary concern for DL right now is being able to have a stable
pointer to the top waiter. We do this by having  rt_mutex_setprio()
update the pointer while holding both rq->lock and tsk->pi_lock.

This means the pointer is stable when holding either lock, which is
sufficient.

But this means, we need to deboost _before_ we wake. Otherwise the task
could've continued running and called do_exit() on us.


Secondary, once we start looking at BWI (bandwidth inheritance), where a
blocked DL task donates its runtime budget along with its deadline, we
also very much need this, since a task cannot be running of its own
budget while at the same time the boosted task is also running off that
same budget.

(having the 'blocked' DL task spin-waiting, as per optimistic spinning,
makes all that rather 'interesting').

In any case, this is two problems:

  - your inversion issue
  - my pointer stability (and eventually bandwidth issue)

that are caused by this hb->lock being in the way.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles
  2016-10-05  6:20   ` Peter Zijlstra
  2016-10-05  7:26     ` Sebastian Andrzej Siewior
@ 2016-10-05 16:04     ` Davidlohr Bueso
  1 sibling, 0 replies; 46+ messages in thread
From: Davidlohr Bueso @ 2016-10-05 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, 05 Oct 2016, Peter Zijlstra wrote:

>On Tue, Oct 04, 2016 at 06:02:14PM -0700, Davidlohr Bueso wrote:
>> On Mon, 03 Oct 2016, Peter Zijlstra wrote:
>
>> >Do people have more/better futex-pi test cases?
>>
>> pi_stress.
>
>Where does one find that? Link?

Sorry, git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-03  9:12 ` [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI Peter Zijlstra
  2016-10-03 15:36   ` Steven Rostedt
  2016-10-05  7:41   ` Sebastian Andrzej Siewior
@ 2016-10-06 10:29   ` Peter Zijlstra
  2016-10-07 11:21   ` Peter Zijlstra
  3 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-06 10:29 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot

On Mon, Oct 03, 2016 at 11:12:38AM +0200, Peter Zijlstra wrote:
> There's a number of 'interesting' problems with FUTEX_UNLOCK_PI, all
> caused by holding hb->lock while doing the rt_mutex_unlock()
> equivalient.
> 
> This patch doesn't attempt to fix any of the actual problems, but
> instead reworks the code to not hold hb->lock across the unlock,
> paving the way to actually fix the problems later.
> 
> The current reason we hold hb->lock over unlock is that it serializes
> against FUTEX_LOCK_PI and avoids new waiters from coming in, this then
> ensures the rt_mutex_next_owner() value is stable and can be written
> into the user-space futex value before doing the unlock. Such that the
> unlock will indeed end up at new_owner.
> 
> This patch recognises that holding rt_mutex::wait_lock results in the
> very same guarantee, no new waiters can come in while we hold that
> lock -- after all, waiters would need this lock to queue themselves.
> 
> It therefore restructures the code to keep rt_mutex::wait_lock held.
> 
> This (of course) is not entirely straight forward either, see the
> comment in rt_mutex_slowunlock(), doing the unlock itself might drop
> wait_lock, letting new waiters in. To cure this
> rt_mutex_futex_unlock() becomes a variant of rt_mutex_slowunlock()
> that return -EAGAIN instead. This ensures the FUTEX_UNLOCK_PI code
> aborts and restarts the entire operation.

Urgh, I missed a bunch :/

So there's the !new_owner case in wake_futex_pi() which can happen if
futex_lock_pi()'s rt_mutex_timed_futex_lock() failed but we still see
that task on the futex_q list (it hasn't yet done unqueue_me).

I wondered if we could sort this case by making fixup_owner() more
interesting, which got me looking at that.

And it turns out fixup_owner() relies on futex_pi_unlock() holding
hb->lock as well.. It does rt_mutex_owner() while holding wait_lock, but
then drops wait_lock to call fixup_pi_state_owner(), assuming the owner
it read remains valid.

ARGGH, what a mess.

Lemme stare at this more..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-03  9:12 ` [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI Peter Zijlstra
                     ` (2 preceding siblings ...)
  2016-10-06 10:29   ` Peter Zijlstra
@ 2016-10-07 11:21   ` Peter Zijlstra
  2016-10-08 15:53     ` Thomas Gleixner
                       ` (2 more replies)
  3 siblings, 3 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-07 11:21 UTC (permalink / raw)
  To: mingo, tglx, juri.lelli, rostedt, xlpang, bigeasy
  Cc: linux-kernel, mathieu.desnoyers, jdesfossez, bristot


New version..


This one seems to pass all the (pi) futex tests and survives many hours
of my modified pi_stress (I added MADV_UNMAP to punch holes in the
page-tables to trigger (minor) faults).

---
Subject: futex: Rewrite FUTEX_UNLOCK_PI
From: Peter Zijlstra <peterz@infradead.org>
Date: Sun Oct 2 18:42:33 CEST 2016

There's a number of 'interesting' problems with FUTEX_UNLOCK_PI, all
caused by holding hb->lock while doing the rt_mutex_unlock()
equivalient.

Notably:
 - a PI inversion on hb->lock
 - DL crash because of pointer instability.

This patch doesn't attempt to fix any of the actual problems, but
instead reworks the code to not hold hb->lock across the unlock,
paving the way to actually fix the problems later.

The current reason we hold hb->lock over unlock is that it serializes
against FUTEX_LOCK_PI and avoids new waiters from coming in, this then
ensures the rt_mutex_next_owner() value is stable and can be written
into the user-space futex value before doing the unlock. Such that the
unlock will indeed end up at new_owner.

This patch recognises that holding rt_mutex::wait_lock results in the
very same guarantee, no new waiters can come in while we hold that
lock -- after all, waiters would need this lock to queue themselves.

This (of course) is not entirely straight forward either, see the
comment in rt_mutex_slowunlock(), doing the unlock itself might drop
wait_lock, letting new waiters in.

Another problem is the case where futex_lock_pi() failed to acquire
the lock (ie. released rt_mutex::wait_lock) but hasn't yet re-acquired
hb->lock and called unqueue_me_pi(). In this case we're confused about
having waiters (the futex state says yes, the rt_mutex state says no).

The current solution is to assign the futex to the waiter from the
futex state, and have futex_lock_pi() detect this and try and fix it
up. This again, all relies on hb->lock serializing things.


Solve all that by:

 - using futex specific rt_mutex calls that lack the fastpath, futexes
   have their own fastpath anyway. This makes that
   rt_mutex_futex_unlock() doesn't need to drop rt_mutex::wait_lock
   and the unlock is guaranteed if we manage to update user state.

 - make futex_unlock_pi() drop hb->lock early and only use
   rt_mutex::wait_lock to serialize against rt_mutex waiters
   update the futex value and unlock.

 - in case futex and rt_mutex disagree on waiters, side with rt_mutex
   and simply clear the user value. This works because either there
   really are no waiters left, or futex_lock_pi() triggers the
   lock-steal path and fixes up the WAITERS flag.


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |  178 ++++++++++++++++++++--------------------
 kernel/locking/rtmutex.c        |   55 +++++++++---
 kernel/locking/rtmutex_common.h |    9 +-
 3 files changed, 139 insertions(+), 103 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -914,7 +914,7 @@ void exit_pi_state_list(struct task_stru
 		pi_state->owner = NULL;
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		rt_mutex_unlock(&pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
 
 		spin_unlock(&hb->lock);
 
@@ -1146,7 +1146,7 @@ static int lock_pi_update_atomic(u32 __u
 	if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)))
 		return -EFAULT;
 
-	/*If user space value changed, let the caller retry */
+	/* If user space value changed, let the caller retry */
 	return curval != uval ? -EAGAIN : 0;
 }
 
@@ -1291,49 +1291,58 @@ static void mark_wake_futex(struct wake_
 	smp_store_release(&q->lock_ptr, NULL);
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
-			 struct futex_hash_bucket *hb)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
 {
-	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = top_waiter->pi_state;
 	u32 uninitialized_var(curval), newval;
+	struct task_struct *new_owner;
+	bool deboost = false;
 	WAKE_Q(wake_q);
-	bool deboost;
 	int ret = 0;
 
-	if (!pi_state)
-		return -EINVAL;
-
-	/*
-	 * If current does not own the pi_state then the futex is
-	 * inconsistent and user space fiddled with the futex value.
-	 */
-	if (pi_state->owner != current)
-		return -EINVAL;
-
 	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
-	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
-
-	/*
-	 * It is possible that the next waiter (the one that brought
-	 * top_waiter owner to the kernel) timed out and is no longer
-	 * waiting on the lock.
-	 */
-	if (!new_owner)
-		new_owner = top_waiter->task;
 
-	/*
-	 * We pass it to the next owner. The WAITERS bit is always
-	 * kept enabled while there is PI state around. We cleanup the
-	 * owner died bit, because we are the owner.
-	 */
-	newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
+	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
+	if (!new_owner) {
+		/*
+		 * This is the case where futex_lock_pi() has not yet or failed
+		 * to acquire the lock but still has the futex_q enqueued. So
+		 * the futex state has a 'waiter' while the rt_mutex state does
+		 * not.
+		 *
+		 * Even though there still is pi_state for this futex, we can
+		 * clear FUTEX_WAITERS. Either:
+		 *
+		 *  - we or futex_lock_pi() will drop the last reference and
+		 *    clean up this pi_state,
+		 *
+		 *  - userspace acquires the futex through its fastpath
+		 *    and the above pi_state cleanup still happens,
+		 *
+		 *  - or futex_lock_pi() will re-set the WAITERS bit in
+		 *    fixup_owner().
+		 */
+		newval = 0;
+		/*
+		 * Since pi_state->owner must point to a valid task, and
+		 * task_pid_vnr(pi_state->owner) must match TID_MASK, use
+		 * init_task.
+		 */
+		new_owner = &init_task;
+	} else {
+		/*
+		 * We pass it to the next owner. The WAITERS bit is always kept
+		 * enabled while there is PI state around. We cleanup the owner
+		 * died bit, because we are the owner.
+		 */
+		newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
+	}
 
 	if (unlikely(should_fail_futex(true)))
 		ret = -EFAULT;
 
 	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
 		ret = -EFAULT;
+
 	} else if (curval != uval) {
 		/*
 		 * If a unconditional UNLOCK_PI operation (user space did not
@@ -1346,10 +1355,9 @@ static int wake_futex_pi(u32 __user *uad
 		else
 			ret = -EINVAL;
 	}
-	if (ret) {
-		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-		return ret;
-	}
+
+	if (ret)
+		goto out_unlock;
 
 	raw_spin_lock(&pi_state->owner->pi_lock);
 	WARN_ON(list_empty(&pi_state->list));
@@ -1362,22 +1370,20 @@ static int wake_futex_pi(u32 __user *uad
 	pi_state->owner = new_owner;
 	raw_spin_unlock(&new_owner->pi_lock);
 
-	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-
-	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
-
 	/*
-	 * First unlock HB so the waiter does not spin on it once he got woken
-	 * up. Second wake up the waiter before the priority is adjusted. If we
-	 * deboost first (and lose our higher priority), then the task might get
-	 * scheduled away before the wake up can take place.
+	 * We've updated the uservalue, this unlock cannot fail.
 	 */
-	spin_unlock(&hb->lock);
-	wake_up_q(&wake_q);
-	if (deboost)
+	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+out_unlock:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+
+	if (deboost) {
+		wake_up_q(&wake_q);
 		rt_mutex_adjust_prio(current);
+	}
 
-	return 0;
+	return ret;
 }
 
 /*
@@ -2228,7 +2234,6 @@ static long futex_wait_restart(struct re
  */
 static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 {
-	struct task_struct *owner;
 	int ret = 0;
 
 	if (locked) {
@@ -2242,43 +2247,15 @@ static int fixup_owner(u32 __user *uaddr
 	}
 
 	/*
-	 * Catch the rare case, where the lock was released when we were on the
-	 * way back before we locked the hash bucket.
-	 */
-	if (q->pi_state->owner == current) {
-		/*
-		 * Try to get the rt_mutex now. This might fail as some other
-		 * task acquired the rt_mutex after we removed ourself from the
-		 * rt_mutex waiters list.
-		 */
-		if (rt_mutex_trylock(&q->pi_state->pi_mutex)) {
-			locked = 1;
-			goto out;
-		}
-
-		/*
-		 * pi_state is incorrect, some other task did a lock steal and
-		 * we returned due to timeout or signal without taking the
-		 * rt_mutex. Too late.
-		 */
-		raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
-		owner = rt_mutex_owner(&q->pi_state->pi_mutex);
-		if (!owner)
-			owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
-		raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
-		ret = fixup_pi_state_owner(uaddr, q, owner);
-		goto out;
-	}
-
-	/*
 	 * Paranoia check. If we did not take the lock, then we should not be
 	 * the owner of the rt_mutex.
 	 */
-	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current)
+	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) {
 		printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p "
 				"pi-state %p\n", ret,
 				q->pi_state->pi_mutex.owner,
 				q->pi_state->owner);
+	}
 
 out:
 	return ret ? ret : locked;
@@ -2566,7 +2543,7 @@ static int futex_lock_pi(u32 __user *uad
 	if (!trylock) {
 		ret = rt_mutex_timed_futex_lock(&q.pi_state->pi_mutex, to);
 	} else {
-		ret = rt_mutex_trylock(&q.pi_state->pi_mutex);
+		ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
 		/* Fixup the trylock return value: */
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
@@ -2589,7 +2566,7 @@ static int futex_lock_pi(u32 __user *uad
 	 * it and return the fault to userspace.
 	 */
 	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
-		rt_mutex_unlock(&q.pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(&q);
@@ -2656,7 +2633,34 @@ static int futex_unlock_pi(u32 __user *u
 	 */
 	top_waiter = futex_top_waiter(hb, &key);
 	if (top_waiter) {
-		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
+		struct futex_pi_state *pi_state = top_waiter->pi_state;
+
+		ret = -EINVAL;
+		if (!pi_state)
+			goto out_unlock;
+
+		/*
+		 * If current does not own the pi_state then the futex is
+		 * inconsistent and user space fiddled with the futex value.
+		 */
+		if (pi_state->owner != current)
+			goto out_unlock;
+
+		/*
+		 * Grab a reference on the pi_state and drop hb->lock.
+		 *
+		 * The reference ensures pi_state lives, dropping the hb->lock
+		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
+		 * close the races against futex_lock_pi(), but in case of
+		 * _any_ fail we'll abort and retry the whole deal.
+		 */
+		WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
+		spin_unlock(&hb->lock);
+
+		ret = wake_futex_pi(uaddr, uval, pi_state);
+
+		put_pi_state(pi_state);
+
 		/*
 		 * In case of success wake_futex_pi dropped the hash
 		 * bucket lock.
@@ -2674,7 +2678,6 @@ static int futex_unlock_pi(u32 __user *u
 		 * setting the FUTEX_WAITERS bit. Try again.
 		 */
 		if (ret == -EAGAIN) {
-			spin_unlock(&hb->lock);
 			put_futex_key(&key);
 			goto retry;
 		}
@@ -2682,7 +2685,7 @@ static int futex_unlock_pi(u32 __user *u
 		 * wake_futex_pi has detected invalid state. Tell user
 		 * space.
 		 */
-		goto out_unlock;
+		goto out_putkey;
 	}
 
 	/*
@@ -2692,8 +2695,10 @@ static int futex_unlock_pi(u32 __user *u
 	 * preserve the WAITERS bit not the OWNER_DIED one. We are the
 	 * owner.
 	 */
-	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0))
+	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0)) {
+		spin_unlock(&hb->lock);
 		goto pi_faulted;
+	}
 
 	/*
 	 * If uval has changed, let user space handle it.
@@ -2707,7 +2712,6 @@ static int futex_unlock_pi(u32 __user *u
 	return ret;
 
 pi_faulted:
-	spin_unlock(&hb->lock);
 	put_futex_key(&key);
 
 	ret = fault_in_user_writeable(uaddr);
@@ -2937,7 +2941,7 @@ static int futex_wait_requeue_pi(u32 __u
 	 */
 	if (ret == -EFAULT) {
 		if (pi_mutex && rt_mutex_owner(pi_mutex) == current)
-			rt_mutex_unlock(pi_mutex);
+			rt_mutex_futex_unlock(pi_mutex);
 	} else if (ret == -EINTR) {
 		/*
 		 * We've already been requeued, but cannot restart by calling
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1422,15 +1422,23 @@ EXPORT_SYMBOL_GPL(rt_mutex_lock_interrup
 
 /*
  * Futex variant with full deadlock detection.
+ * Futex variants must not use the fast-path, see __rt_mutex_futex_unlock().
  */
-int rt_mutex_timed_futex_lock(struct rt_mutex *lock,
+int __sched rt_mutex_timed_futex_lock(struct rt_mutex *lock,
 			      struct hrtimer_sleeper *timeout)
 {
 	might_sleep();
 
-	return rt_mutex_timed_fastlock(lock, TASK_INTERRUPTIBLE, timeout,
-				       RT_MUTEX_FULL_CHAINWALK,
-				       rt_mutex_slowlock);
+	return rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE,
+				 timeout, RT_MUTEX_FULL_CHAINWALK);
+}
+
+/*
+ * Futex variant, must not use fastpath.
+ */
+int __sched rt_mutex_futex_trylock(struct rt_mutex *lock)
+{
+	return rt_mutex_slowtrylock(lock);
 }
 
 /**
@@ -1489,19 +1497,38 @@ void __sched rt_mutex_unlock(struct rt_m
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
- * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
- * @lock: the rt_mutex to be unlocked
- *
- * Returns: true/false indicating whether priority adjustment is
- * required or not.
+ * Futex variant, that since futex variants do not use the fast-path, can be
+ * simple and will not need to retry.
  */
-bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
-				   struct wake_q_head *wqh)
+bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				    struct wake_q_head *wake_q)
+{
+	lockdep_assert_held(&lock->wait_lock);
+
+	debug_rt_mutex_unlock(lock);
+
+	if (!rt_mutex_has_waiters(lock)) {
+		lock->owner = NULL;
+		return false; /* done */
+	}
+
+	mark_wakeup_next_waiter(wake_q, lock);
+	return true; /* deboost and wakeups */
+}
+
+void __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
 {
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
-		return false;
+	WAKE_Q(wake_q);
+	bool deboost;
 
-	return rt_mutex_slowunlock(lock, wqh);
+	raw_spin_lock_irq(&lock->wait_lock);
+	deboost = __rt_mutex_futex_unlock(lock, &wake_q);
+	raw_spin_unlock_irq(&lock->wait_lock);
+
+	if (deboost) {
+		wake_up_q(&wake_q);
+		rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -108,9 +108,14 @@ extern int rt_mutex_start_proxy_lock(str
 extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct hrtimer_sleeper *to,
 				      struct rt_mutex_waiter *waiter);
+
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
-extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
-				  struct wake_q_head *wqh);
+extern int rt_mutex_futex_trylock(struct rt_mutex *l);
+
+extern void rt_mutex_futex_unlock(struct rt_mutex *lock);
+extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				 struct wake_q_head *wqh);
+
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-07 11:21   ` Peter Zijlstra
@ 2016-10-08 15:53     ` Thomas Gleixner
  2016-10-08 16:55       ` Peter Zijlstra
  2016-10-08 18:22     ` Thomas Gleixner
  2016-10-09 11:17     ` Thomas Gleixner
  2 siblings, 1 reply; 46+ messages in thread
From: Thomas Gleixner @ 2016-10-08 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Fri, 7 Oct 2016, Peter Zijlstra wrote:
> Solve all that by:
> 
>  - using futex specific rt_mutex calls that lack the fastpath, futexes
>    have their own fastpath anyway. This makes that
>    rt_mutex_futex_unlock() doesn't need to drop rt_mutex::wait_lock
>    and the unlock is guaranteed if we manage to update user state.
> 
>  - make futex_unlock_pi() drop hb->lock early and only use
>    rt_mutex::wait_lock to serialize against rt_mutex waiters
>    update the futex value and unlock.
> 
>  - in case futex and rt_mutex disagree on waiters, side with rt_mutex
>    and simply clear the user value. This works because either there
>    really are no waiters left, or futex_lock_pi() triggers the
>    lock-steal path and fixes up the WAITERS flag.

I stared at this for a few hours and while I'm not yet done analyzing all
possible combinations I found at least one thing which is broken:

CPU 0				CPU 1

unlock_pi(f)
  ....
  unlock(hb->lock)
  *f = new_owner_tid | WAITERS;

				lock_pi(f) 
				  lock(hb->lock)
				  uval = *f;
				  topwaiter = futex_top_waiter();
				    attach_to_pi_state(uval, topwaiter->pistate);
				      pid = uval & TID_MASK;
				      if (pid != task_pid_vnr(pistate->owner))
				      	 return -EINVAL;
  ....
  pistate->owner = newowner;

So in this case we tell the caller on CPU 1 that the futex is in
inconsistent state, because pistate->owner still points to the unlocking
task while the user space value alread shows the new owner. So this sanity
check triggers and we simply fail while we should not. It's [10] in the
state matrix above attach_to_pi_state().

I suspect that there are more issues like this, especially since I did not
look at requeue_pi yet, but by now my brain is completely fried.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-08 15:53     ` Thomas Gleixner
@ 2016-10-08 16:55       ` Peter Zijlstra
  2016-10-08 17:06         ` Thomas Gleixner
  2016-10-10 10:17         ` Thomas Gleixner
  0 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-08 16:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Sat, Oct 08, 2016 at 05:53:49PM +0200, Thomas Gleixner wrote:
> On Fri, 7 Oct 2016, Peter Zijlstra wrote:
> > Solve all that by:
> > 
> >  - using futex specific rt_mutex calls that lack the fastpath, futexes
> >    have their own fastpath anyway. This makes that
> >    rt_mutex_futex_unlock() doesn't need to drop rt_mutex::wait_lock
> >    and the unlock is guaranteed if we manage to update user state.
> > 
> >  - make futex_unlock_pi() drop hb->lock early and only use
> >    rt_mutex::wait_lock to serialize against rt_mutex waiters
> >    update the futex value and unlock.
> > 
> >  - in case futex and rt_mutex disagree on waiters, side with rt_mutex
> >    and simply clear the user value. This works because either there
> >    really are no waiters left, or futex_lock_pi() triggers the
> >    lock-steal path and fixes up the WAITERS flag.
> 
> I stared at this for a few hours and while I'm not yet done analyzing all
> possible combinations I found at least one thing which is broken:
> 
> CPU 0				CPU 1
> 
> unlock_pi(f)
>   ....
>   unlock(hb->lock)
>   *f = new_owner_tid | WAITERS;
> 
> 				lock_pi(f) 
> 				  lock(hb->lock)
> 				  uval = *f;
> 				  topwaiter = futex_top_waiter();
> 				    attach_to_pi_state(uval, topwaiter->pistate);
> 				      pid = uval & TID_MASK;
> 				      if (pid != task_pid_vnr(pistate->owner))
> 				      	 return -EINVAL;
>   ....
>   pistate->owner = newowner;
> 
> So in this case we tell the caller on CPU 1 that the futex is in
> inconsistent state, because pistate->owner still points to the unlocking
> task while the user space value alread shows the new owner. So this sanity
> check triggers and we simply fail while we should not. It's [10] in the
> state matrix above attach_to_pi_state().

Urgh, yes. I think I can cure that, by taking
pi_state->pi_mutex.wait_lock in attach_to_pi_state(), but blergh.

> I suspect that there are more issues like this, especially since I did not
> look at requeue_pi yet, but by now my brain is completely fried.

Yes, I know about fried brains :-( This stuff has far too many moving
parts. I've been staring at this stuff far too long.


Also, we need better tools to stress this stuff.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-08 16:55       ` Peter Zijlstra
@ 2016-10-08 17:06         ` Thomas Gleixner
  2016-10-10 10:17         ` Thomas Gleixner
  1 sibling, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2016-10-08 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Sat, 8 Oct 2016, Peter Zijlstra wrote:
> > So in this case we tell the caller on CPU 1 that the futex is in
> > inconsistent state, because pistate->owner still points to the unlocking
> > task while the user space value alread shows the new owner. So this sanity
> > check triggers and we simply fail while we should not. It's [10] in the
> > state matrix above attach_to_pi_state().
> 
> Urgh, yes. I think I can cure that, by taking
> pi_state->pi_mutex.wait_lock in attach_to_pi_state(), but blergh.
> 
> > I suspect that there are more issues like this, especially since I did not
> > look at requeue_pi yet, but by now my brain is completely fried.
> 
> Yes, I know about fried brains :-( This stuff has far too many moving
> parts. I've been staring at this stuff far too long.
> 
> Also, we need better tools to stress this stuff.

I tried to come up with something which forces all corner cases of this
years ago and failed. The state space seems to be infinite.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-07 11:21   ` Peter Zijlstra
  2016-10-08 15:53     ` Thomas Gleixner
@ 2016-10-08 18:22     ` Thomas Gleixner
  2016-10-09 11:17     ` Thomas Gleixner
  2 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2016-10-08 18:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Fri, 7 Oct 2016, Peter Zijlstra wrote:
> +		/*
> +		 * Grab a reference on the pi_state and drop hb->lock.
> +		 *
> +		 * The reference ensures pi_state lives, dropping the hb->lock
> +		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
> +		 * close the races against futex_lock_pi(), but in case of
> +		 * _any_ fail we'll abort and retry the whole deal.
> +		 */
> +		WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
> +		spin_unlock(&hb->lock);
> +
> +		ret = wake_futex_pi(uaddr, uval, pi_state);
> +
> +		put_pi_state(pi_state);
> +
>  		/*
>  		 * In case of success wake_futex_pi dropped the hash
>  		 * bucket lock.

This comment has become stale.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-07 11:21   ` Peter Zijlstra
  2016-10-08 15:53     ` Thomas Gleixner
  2016-10-08 18:22     ` Thomas Gleixner
@ 2016-10-09 11:17     ` Thomas Gleixner
  2016-10-10 14:06       ` Peter Zijlstra
  2 siblings, 1 reply; 46+ messages in thread
From: Thomas Gleixner @ 2016-10-09 11:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Fri, 7 Oct 2016, Peter Zijlstra wrote:
>  	top_waiter = futex_top_waiter(hb, &key);
>  	if (top_waiter) {
> -		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
> +		struct futex_pi_state *pi_state = top_waiter->pi_state;
> +
> +		ret = -EINVAL;
> +		if (!pi_state)
> +			goto out_unlock;
> +
> +		/*
> +		 * If current does not own the pi_state then the futex is
> +		 * inconsistent and user space fiddled with the futex value.
> +		 */
> +		if (pi_state->owner != current)
> +			goto out_unlock;
> +
> +		/*
> +		 * Grab a reference on the pi_state and drop hb->lock.
> +		 *
> +		 * The reference ensures pi_state lives, dropping the hb->lock
> +		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
> +		 * close the races against futex_lock_pi(), but in case of
> +		 * _any_ fail we'll abort and retry the whole deal.
> +		 */
> +		WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
> +		spin_unlock(&hb->lock);
> +
> +		ret = wake_futex_pi(uaddr, uval, pi_state);
> +
> +		put_pi_state(pi_state);

put_pi_state() requires hb->lock protection AFAICT.

CPU0	       			 CPU1

    wake_futex_pi()		 attach_to_pi_state()
    put_pi_state()
	refcount--;	
	if (!refcount)		
	    free_state();	
	    			WARN_ON(!pi_state->refcount);

we might not see the warning, but in any case the following access to
pi_state on cpu1 is borked.

Thanks,

	tglx


   	     	     	      	

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-08 16:55       ` Peter Zijlstra
  2016-10-08 17:06         ` Thomas Gleixner
@ 2016-10-10 10:17         ` Thomas Gleixner
  2016-10-10 11:40           ` Peter Zijlstra
  2016-10-21 12:27           ` Peter Zijlstra
  1 sibling, 2 replies; 46+ messages in thread
From: Thomas Gleixner @ 2016-10-10 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Sat, 8 Oct 2016, Peter Zijlstra wrote:
> On Sat, Oct 08, 2016 at 05:53:49PM +0200, Thomas Gleixner wrote:
> > On Fri, 7 Oct 2016, Peter Zijlstra wrote:
> > > Solve all that by:
> > > 
> > >  - using futex specific rt_mutex calls that lack the fastpath, futexes
> > >    have their own fastpath anyway. This makes that
> > >    rt_mutex_futex_unlock() doesn't need to drop rt_mutex::wait_lock
> > >    and the unlock is guaranteed if we manage to update user state.
> > > 
> > >  - make futex_unlock_pi() drop hb->lock early and only use
> > >    rt_mutex::wait_lock to serialize against rt_mutex waiters
> > >    update the futex value and unlock.
> > > 
> > >  - in case futex and rt_mutex disagree on waiters, side with rt_mutex
> > >    and simply clear the user value. This works because either there
> > >    really are no waiters left, or futex_lock_pi() triggers the
> > >    lock-steal path and fixes up the WAITERS flag.
> > 
> > I stared at this for a few hours and while I'm not yet done analyzing all
> > possible combinations I found at least one thing which is broken:
> > 
> > CPU 0				CPU 1
> > 
> > unlock_pi(f)
> >   ....
> >   unlock(hb->lock)
> >   *f = new_owner_tid | WAITERS;
> > 
> > 				lock_pi(f) 
> > 				  lock(hb->lock)
> > 				  uval = *f;
> > 				  topwaiter = futex_top_waiter();
> > 				    attach_to_pi_state(uval, topwaiter->pistate);
> > 				      pid = uval & TID_MASK;
> > 				      if (pid != task_pid_vnr(pistate->owner))
> > 				      	 return -EINVAL;
> >   ....
> >   pistate->owner = newowner;
> > 
> > So in this case we tell the caller on CPU 1 that the futex is in
> > inconsistent state, because pistate->owner still points to the unlocking
> > task while the user space value alread shows the new owner. So this sanity
> > check triggers and we simply fail while we should not. It's [10] in the
> > state matrix above attach_to_pi_state().
> 
> Urgh, yes. I think I can cure that, by taking
> pi_state->pi_mutex.wait_lock in attach_to_pi_state(), but blergh.

There is another problem with all that racing against fixup_owner()
resp. fixup_pi_state_owner().

I fear, we need to rethink this whole locking/protection scheme from
scratch.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-10 10:17         ` Thomas Gleixner
@ 2016-10-10 11:40           ` Peter Zijlstra
  2016-10-21 12:27           ` Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-10 11:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, Oct 10, 2016 at 12:17:48PM +0200, Thomas Gleixner wrote:
> There is another problem with all that racing against fixup_owner()
> resp. fixup_pi_state_owner().
> 
> I fear, we need to rethink this whole locking/protection scheme from
> scratch.


So for pi_state (ie, the attach_to_pi_state() vs put_pi_state() race) I
can see two options, either we re-take hb->lock after we've completed
the futex_unlock_pi() in order to drop it, which is cringe worthy, or we
make pi_state RCU freed and replace that WARN_ON() in
attach_to_pi_state() with an atomic_inc_not_zero() and deal with the
fail case by going back to the caller and treating it like !top_waiter.


As to the rt_mutex vs futex state coherence, I think I can do all of
that with rt_mutex::wait_lock held, the alternative is doing part with
hb->lock and part with rt_mutex::wait_lock and parts with both, but
that's already hurting my head and I didn't even try yet.


Let me think a wee bit more on this..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-09 11:17     ` Thomas Gleixner
@ 2016-10-10 14:06       ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-10 14:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Sun, Oct 09, 2016 at 01:17:50PM +0200, Thomas Gleixner wrote:
> On Fri, 7 Oct 2016, Peter Zijlstra wrote:
> >  	top_waiter = futex_top_waiter(hb, &key);
> >  	if (top_waiter) {
> > -		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
> > +		struct futex_pi_state *pi_state = top_waiter->pi_state;
> > +
> > +		ret = -EINVAL;
> > +		if (!pi_state)
> > +			goto out_unlock;
> > +
> > +		/*
> > +		 * If current does not own the pi_state then the futex is
> > +		 * inconsistent and user space fiddled with the futex value.
> > +		 */
> > +		if (pi_state->owner != current)
> > +			goto out_unlock;
> > +
> > +		/*
> > +		 * Grab a reference on the pi_state and drop hb->lock.
> > +		 *
> > +		 * The reference ensures pi_state lives, dropping the hb->lock
> > +		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
> > +		 * close the races against futex_lock_pi(), but in case of
> > +		 * _any_ fail we'll abort and retry the whole deal.
> > +		 */
> > +		WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
> > +		spin_unlock(&hb->lock);
> > +
> > +		ret = wake_futex_pi(uaddr, uval, pi_state);
> > +
> > +		put_pi_state(pi_state);
> 
> put_pi_state() requires hb->lock protection AFAICT.
> 
> CPU0	       			 CPU1
> 
>     wake_futex_pi()		 attach_to_pi_state()
>     put_pi_state()
> 	refcount--;	
> 	if (!refcount)		
> 	    free_state();	
> 	    			WARN_ON(!pi_state->refcount);
> 
> we might not see the warning, but in any case the following access to
> pi_state on cpu1 is borked.

Not sure this can happen, we do all attach_to_pi_state() with hb->lock
held, and the only way to get there is through futex_q->pi_state. And as
long as that link is stable, pi_state is too.

That is, the only way for wake_futex_pi() to drop the last reference is
if there are no futex_q's referencing it anymore, but that also means
attach_to_pi_state() cannot happen (!top_waiter).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-10 10:17         ` Thomas Gleixner
  2016-10-10 11:40           ` Peter Zijlstra
@ 2016-10-21 12:27           ` Peter Zijlstra
  2016-10-27 20:36             ` Thomas Gleixner
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-21 12:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Mon, Oct 10, 2016 at 12:17:48PM +0200, Thomas Gleixner wrote:
> I fear, we need to rethink this whole locking/protection scheme from
> scratch.

Here goes... as discussed at ELCE this serializes the {uval,
pi_state} state using pi_mutex->wait_lock.

I did my best to reason about requeue_pi, but I did suffer some stack
overflows. That said, I audited all places pi_state is used/modified
and applied consistent locking and/or comments.

The patch could possibly be broken up in 3 parts, namely:

 - introduce the futex specific rt_mutex wrappers that avoid the
   regular rt_mutex fast path (and isolate futex from the normal
   rt_mutex interface).

 - add the pi_mutex->wait_lock locking, which at that point will be
   entirely redundant.

 - rework the unlock_pi path to drop hb->lock.

Let me know if you would prefer to see that. For review I think having
all that in a single patch is easier to reason about.

The only thing I didn't do is update the Changelog. I would like to
leave that until later, when I've managed to convince you this approach
is indeed viable.

Please have a look.

---

Subject: futex: Rewrite FUTEX_UNLOCK_PI
From: Peter Zijlstra <peterz@infradead.org>
Date: Sun Oct 2 18:42:33 CEST 2016

There's a number of 'interesting' problems with FUTEX_UNLOCK_PI, all
caused by holding hb->lock while doing the rt_mutex_unlock()
equivalient.

Notably:
 - a PI inversion on hb->lock
 - DL crash because of pointer instability.

This patch doesn't attempt to fix any of the actual problems, but
instead reworks the code to not hold hb->lock across the unlock,
paving the way to actually fix the problems later.

The current reason we hold hb->lock over unlock is that it serializes
against FUTEX_LOCK_PI and avoids new waiters from coming in, this then
ensures the rt_mutex_next_owner() value is stable and can be written
into the user-space futex value before doing the unlock. Such that the
unlock will indeed end up at new_owner.

This patch recognises that holding rt_mutex::wait_lock results in the
very same guarantee, no new waiters can come in while we hold that
lock -- after all, waiters would need this lock to queue themselves.

This (of course) is not entirely straight forward either, see the
comment in rt_mutex_slowunlock(), doing the unlock itself might drop
wait_lock, letting new waiters in.

Another problem is the case where futex_lock_pi() failed to acquire
the lock (ie. released rt_mutex::wait_lock) but hasn't yet re-acquired
hb->lock and called unqueue_me_pi(). In this case we're confused about
having waiters (the futex state says yes, the rt_mutex state says no).

The current solution is to assign the futex to the waiter from the
futex state, and have futex_lock_pi() detect this and try and fix it
up. This again, all relies on hb->lock serializing things.


Solve all that by:

 - using futex specific rt_mutex calls that lack the fastpath, futexes
   have their own fastpath anyway. This makes that
   rt_mutex_futex_unlock() doesn't need to drop rt_mutex::wait_lock
   and the unlock is guaranteed if we manage to update user state.

 - make futex_unlock_pi() drop hb->lock early and only use
   rt_mutex::wait_lock to serialize against rt_mutex waiters
   update the futex value and unlock.

 - in case futex and rt_mutex disagree on waiters, side with rt_mutex
   and simply clear the user value. This works because either there
   really are no waiters left, or futex_lock_pi() triggers the
   lock-steal path and fixes up the WAITERS flag.


XXX update changelog..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 futex.c                  |  340 ++++++++++++++++++++++++++++++-----------------
 locking/rtmutex.c        |   55 +++++--
 locking/rtmutex_common.h |    9 -
 3 files changed, 270 insertions(+), 134 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -914,7 +914,7 @@ void exit_pi_state_list(struct task_stru
 		pi_state->owner = NULL;
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		rt_mutex_unlock(&pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
 
 		spin_unlock(&hb->lock);
 
@@ -963,7 +963,9 @@ void exit_pi_state_list(struct task_stru
  *
  * [7]	pi_state->owner can only be NULL when the OWNER_DIED bit is set.
  *
- * [8]	Owner and user space value match
+ * [8]	Owner and user space value match; there is a special case in
+ *	wake_futex_pi() where we use pi_state->owner = &init_task to
+ *	make this true for TID 0 and avoid rules 9 and 10.
  *
  * [9]	There is no transient state which sets the user space TID to 0
  *	except exit_robust_list(), but this is indicated by the
@@ -971,6 +973,39 @@ void exit_pi_state_list(struct task_stru
  *
  * [10] There is no transient state which leaves owner and user space
  *	TID out of sync.
+ *
+ *
+ * Serialization and lifetime rules:
+ *
+ * hb->lock:
+ *
+ *	hb -> futex_q, relation
+ *	futex_q -> pi_state, relation
+ *
+ *	(cannot be raw because hb can contain arbitrary amount
+ *	 of futex_q's)
+ *
+ * pi_mutex->wait_lock:
+ *
+ *	{uval, pi_state}
+ *
+ *	(and pi_mutex 'obviously')
+ *
+ * p->pi_lock:
+ *
+ * 	p->pi_state_list -> pi_state->list, relation
+ *
+ * pi_state->refcount:
+ *
+ *	pi_state lifetime
+ *
+ *
+ * Lock order:
+ *
+ *   hb->lock
+ *     pi_mutex->wait_lock
+ *       p->pi_lock
+ *
  */
 
 /*
@@ -978,10 +1013,12 @@ void exit_pi_state_list(struct task_stru
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
+static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
+			      struct futex_pi_state *pi_state,
 			      struct futex_pi_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
+	int ret, uval2;
 
 	/*
 	 * Userspace might have messed up non-PI and PI futexes [3]
@@ -989,9 +1026,37 @@ static int attach_to_pi_state(u32 uval,
 	if (unlikely(!pi_state))
 		return -EINVAL;
 
+	/*
+	 * We get here with hb->lock held, and having found a
+	 * futex_top_waiter(). This means that futex_lock_pi() of said futex_q
+	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
+	 * which in turn means that futex_lock_pi() still has a reference on
+	 * our pi_state.
+	 *
+	 * IOW, we cannot race against the unlocked put_pi_state() in
+	 * futex_unlock_pi().
+	 */
 	WARN_ON(!atomic_read(&pi_state->refcount));
 
 	/*
+	 * Now that we have a pi_state, we can acquire wait_lock
+	 * and do the state validation.
+	 */
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+	/*
+	 * Since {uval, pi_state} is serialized by wait_lock, and our current
+	 * uval was read without holding it, it can have changed. Verify it
+	 * still is what we expect it to be, otherwise retry the entire
+	 * operation.
+	 */
+	if (get_futex_value_locked(&uval2, uaddr))
+		goto out_efault;
+
+	if (uval != uval2)
+		goto out_eagain;
+
+	/*
 	 * Handle the owner died case:
 	 */
 	if (uval & FUTEX_OWNER_DIED) {
@@ -1006,11 +1071,11 @@ static int attach_to_pi_state(u32 uval,
 			 * is not 0. Inconsistent state. [5]
 			 */
 			if (pid)
-				return -EINVAL;
+				goto out_einval;
 			/*
 			 * Take a ref on the state and return success. [4]
 			 */
-			goto out_state;
+			goto out_attach;
 		}
 
 		/*
@@ -1022,14 +1087,14 @@ static int attach_to_pi_state(u32 uval,
 		 * Take a ref on the state and return success. [6]
 		 */
 		if (!pid)
-			goto out_state;
+			goto out_attach;
 	} else {
 		/*
 		 * If the owner died bit is not set, then the pi_state
 		 * must have an owner. [7]
 		 */
 		if (!pi_state->owner)
-			return -EINVAL;
+			goto out_einval;
 	}
 
 	/*
@@ -1038,11 +1103,29 @@ static int attach_to_pi_state(u32 uval,
 	 * user space TID. [9/10]
 	 */
 	if (pid != task_pid_vnr(pi_state->owner))
-		return -EINVAL;
-out_state:
+		goto out_einval;
+
+out_attach:
 	atomic_inc(&pi_state->refcount);
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	*ps = pi_state;
 	return 0;
+
+out_einval:
+	ret = -EINVAL;
+	goto out_error;
+
+out_eagain:
+	ret = -EAGAIN;
+	goto out_error;
+
+out_efault:
+	ret = -EFAULT;
+	goto out_error;
+
+out_error:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+	return ret;
 }
 
 /*
@@ -1093,6 +1176,9 @@ static int attach_to_pi_owner(u32 uval,
 
 	/*
 	 * No existing pi state. First waiter. [2]
+	 *
+	 * This creates pi_state, we have hb->lock held, this means nothing can
+	 * observe this state, wait_lock is irrelevant.
 	 */
 	pi_state = alloc_pi_state();
 
@@ -1117,7 +1203,8 @@ static int attach_to_pi_owner(u32 uval,
 	return 0;
 }
 
-static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
+static int lookup_pi_state(u32 __user *uaddr, u32 uval,
+			   struct futex_hash_bucket *hb,
 			   union futex_key *key, struct futex_pi_state **ps)
 {
 	struct futex_q *top_waiter = futex_top_waiter(hb, key);
@@ -1127,7 +1214,7 @@ static int lookup_pi_state(u32 uval, str
 	 * attach to the pi_state when the validation succeeds.
 	 */
 	if (top_waiter)
-		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
 
 	/*
 	 * We are the first waiter - try to look up the owner based on
@@ -1146,7 +1233,7 @@ static int lock_pi_update_atomic(u32 __u
 	if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)))
 		return -EFAULT;
 
-	/*If user space value changed, let the caller retry */
+	/* If user space value changed, let the caller retry */
 	return curval != uval ? -EAGAIN : 0;
 }
 
@@ -1202,7 +1289,7 @@ static int futex_lock_pi_atomic(u32 __us
 	 */
 	top_waiter = futex_top_waiter(hb, key);
 	if (top_waiter)
-		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
 
 	/*
 	 * No waiter and user TID is 0. We are here because the
@@ -1291,49 +1378,58 @@ static void mark_wake_futex(struct wake_
 	smp_store_release(&q->lock_ptr, NULL);
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
-			 struct futex_hash_bucket *hb)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
 {
-	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = top_waiter->pi_state;
 	u32 uninitialized_var(curval), newval;
+	struct task_struct *new_owner;
+	bool deboost = false;
 	WAKE_Q(wake_q);
-	bool deboost;
 	int ret = 0;
 
-	if (!pi_state)
-		return -EINVAL;
-
-	/*
-	 * If current does not own the pi_state then the futex is
-	 * inconsistent and user space fiddled with the futex value.
-	 */
-	if (pi_state->owner != current)
-		return -EINVAL;
-
 	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
-	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
 
-	/*
-	 * It is possible that the next waiter (the one that brought
-	 * top_waiter owner to the kernel) timed out and is no longer
-	 * waiting on the lock.
-	 */
-	if (!new_owner)
-		new_owner = top_waiter->task;
-
-	/*
-	 * We pass it to the next owner. The WAITERS bit is always
-	 * kept enabled while there is PI state around. We cleanup the
-	 * owner died bit, because we are the owner.
-	 */
-	newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
+	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
+	if (!new_owner) {
+		/*
+		 * This is the case where futex_lock_pi() has not yet or failed
+		 * to acquire the lock but still has the futex_q enqueued. So
+		 * the futex state has a 'waiter' while the rt_mutex state does
+		 * not.
+		 *
+		 * Even though there still is pi_state for this futex, we can
+		 * clear FUTEX_WAITERS. Either:
+		 *
+		 *  - we or futex_lock_pi() will drop the last reference and
+		 *    clean up this pi_state,
+		 *
+		 *  - userspace acquires the futex through its fastpath
+		 *    and the above pi_state cleanup still happens,
+		 *
+		 *  - or futex_lock_pi() will re-set the WAITERS bit in
+		 *    fixup_owner().
+		 */
+		newval = 0;
+		/*
+		 * Since pi_state->owner must point to a valid task, and
+		 * task_pid_vnr(pi_state->owner) must match TID_MASK, use
+		 * init_task.
+		 */
+		new_owner = &init_task;
+	} else {
+		/*
+		 * We pass it to the next owner. The WAITERS bit is always kept
+		 * enabled while there is PI state around. We cleanup the owner
+		 * died bit, because we are the owner.
+		 */
+		newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
+	}
 
 	if (unlikely(should_fail_futex(true)))
 		ret = -EFAULT;
 
 	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
 		ret = -EFAULT;
+
 	} else if (curval != uval) {
 		/*
 		 * If a unconditional UNLOCK_PI operation (user space did not
@@ -1346,10 +1442,9 @@ static int wake_futex_pi(u32 __user *uad
 		else
 			ret = -EINVAL;
 	}
-	if (ret) {
-		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-		return ret;
-	}
+
+	if (ret)
+		goto out_unlock;
 
 	raw_spin_lock(&pi_state->owner->pi_lock);
 	WARN_ON(list_empty(&pi_state->list));
@@ -1362,22 +1457,20 @@ static int wake_futex_pi(u32 __user *uad
 	pi_state->owner = new_owner;
 	raw_spin_unlock(&new_owner->pi_lock);
 
-	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-
-	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
-
 	/*
-	 * First unlock HB so the waiter does not spin on it once he got woken
-	 * up. Second wake up the waiter before the priority is adjusted. If we
-	 * deboost first (and lose our higher priority), then the task might get
-	 * scheduled away before the wake up can take place.
+	 * We've updated the uservalue, this unlock cannot fail.
 	 */
-	spin_unlock(&hb->lock);
-	wake_up_q(&wake_q);
-	if (deboost)
+	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+out_unlock:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+
+	if (deboost) {
+		wake_up_q(&wake_q);
 		rt_mutex_adjust_prio(current);
+	}
 
-	return 0;
+	return ret;
 }
 
 /*
@@ -1823,7 +1916,7 @@ static int futex_requeue(u32 __user *uad
 			 * If that call succeeds then we have pi_state and an
 			 * initial refcount on it.
 			 */
-			ret = lookup_pi_state(ret, hb2, &key2, &pi_state);
+			ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state);
 		}
 
 		switch (ret) {
@@ -2122,10 +2215,13 @@ static int fixup_pi_state_owner(u32 __us
 {
 	u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
 	struct futex_pi_state *pi_state = q->pi_state;
-	struct task_struct *oldowner = pi_state->owner;
 	u32 uval, uninitialized_var(curval), newval;
+	struct task_struct *oldowner;
 	int ret;
 
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+	oldowner = pi_state->owner;
 	/* Owner died? */
 	if (!pi_state->owner)
 		newtid |= FUTEX_OWNER_DIED;
@@ -2141,11 +2237,10 @@ static int fixup_pi_state_owner(u32 __us
 	 * because we can fault here. Imagine swapped out pages or a fork
 	 * that marked all the anonymous memory readonly for cow.
 	 *
-	 * Modifying pi_state _before_ the user space value would
-	 * leave the pi_state in an inconsistent state when we fault
-	 * here, because we need to drop the hash bucket lock to
-	 * handle the fault. This might be observed in the PID check
-	 * in lookup_pi_state.
+	 * Modifying pi_state _before_ the user space value would leave the
+	 * pi_state in an inconsistent state when we fault here, because we
+	 * need to drop the locks to handle the fault. This might be observed
+	 * in the PID check in lookup_pi_state.
 	 */
 retry:
 	if (get_futex_value_locked(&uval, uaddr))
@@ -2166,36 +2261,43 @@ static int fixup_pi_state_owner(u32 __us
 	 * itself.
 	 */
 	if (pi_state->owner != NULL) {
-		raw_spin_lock_irq(&pi_state->owner->pi_lock);
+		raw_spin_lock(&pi_state->owner->pi_lock);
 		WARN_ON(list_empty(&pi_state->list));
 		list_del_init(&pi_state->list);
-		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
+		raw_spin_unlock(&pi_state->owner->pi_lock);
 	}
 
 	pi_state->owner = newowner;
 
-	raw_spin_lock_irq(&newowner->pi_lock);
+	raw_spin_lock(&newowner->pi_lock);
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &newowner->pi_state_list);
-	raw_spin_unlock_irq(&newowner->pi_lock);
+	raw_spin_unlock(&newowner->pi_lock);
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+
 	return 0;
 
 	/*
-	 * To handle the page fault we need to drop the hash bucket
-	 * lock here. That gives the other task (either the highest priority
-	 * waiter itself or the task which stole the rtmutex) the
-	 * chance to try the fixup of the pi_state. So once we are
-	 * back from handling the fault we need to check the pi_state
-	 * after reacquiring the hash bucket lock and before trying to
-	 * do another fixup. When the fixup has been done already we
-	 * simply return.
+	 * To handle the page fault we need to drop the locks here. That gives
+	 * the other task (either the highest priority waiter itself or the
+	 * task which stole the rtmutex) the chance to try the fixup of the
+	 * pi_state. So once we are back from handling the fault we need to
+	 * check the pi_state after reacquiring the locks and before trying to
+	 * do another fixup. When the fixup has been done already we simply
+	 * return.
+	 *
+	 * Note: we hold both hb->lock and pi_mutex->wait_lock. We can safely
+	 * drop hb->lock since the caller owns the hb -> futex_q relation.
+	 * Dropping the pi_mutex->wait_lock requires the state revalidate.
 	 */
 handle_fault:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	spin_unlock(q->lock_ptr);
 
 	ret = fault_in_user_writeable(uaddr);
 
 	spin_lock(q->lock_ptr);
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 
 	/*
 	 * Check if someone else fixed it for us:
@@ -2228,45 +2330,20 @@ static long futex_wait_restart(struct re
  */
 static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 {
-	struct task_struct *owner;
 	int ret = 0;
 
 	if (locked) {
 		/*
 		 * Got the lock. We might not be the anticipated owner if we
 		 * did a lock-steal - fix up the PI-state in that case:
+		 *
+		 * We can safely read pi_state->owner without holding wait_lock
+		 * because we now own the rt_mutex, only the owner will attempt
+		 * to change it.
 		 */
 		if (q->pi_state->owner != current)
 			ret = fixup_pi_state_owner(uaddr, q, current);
-		goto out;
-	}
 
-	/*
-	 * Catch the rare case, where the lock was released when we were on the
-	 * way back before we locked the hash bucket.
-	 */
-	if (q->pi_state->owner == current) {
-		/*
-		 * Try to get the rt_mutex now. This might fail as some other
-		 * task acquired the rt_mutex after we removed ourself from the
-		 * rt_mutex waiters list.
-		 */
-		if (rt_mutex_trylock(&q->pi_state->pi_mutex)) {
-			locked = 1;
-			goto out;
-		}
-
-		/*
-		 * pi_state is incorrect, some other task did a lock steal and
-		 * we returned due to timeout or signal without taking the
-		 * rt_mutex. Too late.
-		 */
-		raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
-		owner = rt_mutex_owner(&q->pi_state->pi_mutex);
-		if (!owner)
-			owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
-		raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
-		ret = fixup_pi_state_owner(uaddr, q, owner);
 		goto out;
 	}
 
@@ -2274,11 +2351,12 @@ static int fixup_owner(u32 __user *uaddr
 	 * Paranoia check. If we did not take the lock, then we should not be
 	 * the owner of the rt_mutex.
 	 */
-	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current)
+	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) {
 		printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p "
 				"pi-state %p\n", ret,
 				q->pi_state->pi_mutex.owner,
 				q->pi_state->owner);
+	}
 
 out:
 	return ret ? ret : locked;
@@ -2566,7 +2644,7 @@ static int futex_lock_pi(u32 __user *uad
 	if (!trylock) {
 		ret = rt_mutex_timed_futex_lock(&q.pi_state->pi_mutex, to);
 	} else {
-		ret = rt_mutex_trylock(&q.pi_state->pi_mutex);
+		ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
 		/* Fixup the trylock return value: */
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
@@ -2589,7 +2667,7 @@ static int futex_lock_pi(u32 __user *uad
 	 * it and return the fault to userspace.
 	 */
 	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
-		rt_mutex_unlock(&q.pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(&q);
@@ -2656,10 +2734,36 @@ static int futex_unlock_pi(u32 __user *u
 	 */
 	top_waiter = futex_top_waiter(hb, &key);
 	if (top_waiter) {
-		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
+		struct futex_pi_state *pi_state = top_waiter->pi_state;
+
+		ret = -EINVAL;
+		if (!pi_state)
+			goto out_unlock;
+
 		/*
-		 * In case of success wake_futex_pi dropped the hash
-		 * bucket lock.
+		 * If current does not own the pi_state then the futex is
+		 * inconsistent and user space fiddled with the futex value.
+		 */
+		if (pi_state->owner != current)
+			goto out_unlock;
+
+		/*
+		 * Grab a reference on the pi_state and drop hb->lock.
+		 *
+		 * The reference ensures pi_state lives, dropping the hb->lock
+		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
+		 * close the races against futex_lock_pi(), but in case of
+		 * _any_ fail we'll abort and retry the whole deal.
+		 */
+		WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
+		spin_unlock(&hb->lock);
+
+		ret = wake_futex_pi(uaddr, uval, pi_state);
+
+		put_pi_state(pi_state);
+
+		/*
+		 * Success, we're done! No tricky corner cases.
 		 */
 		if (!ret)
 			goto out_putkey;
@@ -2674,7 +2778,6 @@ static int futex_unlock_pi(u32 __user *u
 		 * setting the FUTEX_WAITERS bit. Try again.
 		 */
 		if (ret == -EAGAIN) {
-			spin_unlock(&hb->lock);
 			put_futex_key(&key);
 			goto retry;
 		}
@@ -2682,7 +2785,7 @@ static int futex_unlock_pi(u32 __user *u
 		 * wake_futex_pi has detected invalid state. Tell user
 		 * space.
 		 */
-		goto out_unlock;
+		goto out_putkey;
 	}
 
 	/*
@@ -2692,8 +2795,10 @@ static int futex_unlock_pi(u32 __user *u
 	 * preserve the WAITERS bit not the OWNER_DIED one. We are the
 	 * owner.
 	 */
-	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0))
+	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0)) {
+		spin_unlock(&hb->lock);
 		goto pi_faulted;
+	}
 
 	/*
 	 * If uval has changed, let user space handle it.
@@ -2707,7 +2812,6 @@ static int futex_unlock_pi(u32 __user *u
 	return ret;
 
 pi_faulted:
-	spin_unlock(&hb->lock);
 	put_futex_key(&key);
 
 	ret = fault_in_user_writeable(uaddr);
@@ -2937,7 +3041,7 @@ static int futex_wait_requeue_pi(u32 __u
 	 */
 	if (ret == -EFAULT) {
 		if (pi_mutex && rt_mutex_owner(pi_mutex) == current)
-			rt_mutex_unlock(pi_mutex);
+			rt_mutex_futex_unlock(pi_mutex);
 	} else if (ret == -EINTR) {
 		/*
 		 * We've already been requeued, but cannot restart by calling
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1422,15 +1422,23 @@ EXPORT_SYMBOL_GPL(rt_mutex_lock_interrup
 
 /*
  * Futex variant with full deadlock detection.
+ * Futex variants must not use the fast-path, see __rt_mutex_futex_unlock().
  */
-int rt_mutex_timed_futex_lock(struct rt_mutex *lock,
+int __sched rt_mutex_timed_futex_lock(struct rt_mutex *lock,
 			      struct hrtimer_sleeper *timeout)
 {
 	might_sleep();
 
-	return rt_mutex_timed_fastlock(lock, TASK_INTERRUPTIBLE, timeout,
-				       RT_MUTEX_FULL_CHAINWALK,
-				       rt_mutex_slowlock);
+	return rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE,
+				 timeout, RT_MUTEX_FULL_CHAINWALK);
+}
+
+/*
+ * Futex variant, must not use fastpath.
+ */
+int __sched rt_mutex_futex_trylock(struct rt_mutex *lock)
+{
+	return rt_mutex_slowtrylock(lock);
 }
 
 /**
@@ -1489,19 +1497,38 @@ void __sched rt_mutex_unlock(struct rt_m
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
- * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
- * @lock: the rt_mutex to be unlocked
- *
- * Returns: true/false indicating whether priority adjustment is
- * required or not.
+ * Futex variant, that since futex variants do not use the fast-path, can be
+ * simple and will not need to retry.
  */
-bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
-				   struct wake_q_head *wqh)
+bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				    struct wake_q_head *wake_q)
+{
+	lockdep_assert_held(&lock->wait_lock);
+
+	debug_rt_mutex_unlock(lock);
+
+	if (!rt_mutex_has_waiters(lock)) {
+		lock->owner = NULL;
+		return false; /* done */
+	}
+
+	mark_wakeup_next_waiter(wake_q, lock);
+	return true; /* deboost and wakeups */
+}
+
+void __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
 {
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
-		return false;
+	WAKE_Q(wake_q);
+	bool deboost;
 
-	return rt_mutex_slowunlock(lock, wqh);
+	raw_spin_lock_irq(&lock->wait_lock);
+	deboost = __rt_mutex_futex_unlock(lock, &wake_q);
+	raw_spin_unlock_irq(&lock->wait_lock);
+
+	if (deboost) {
+		wake_up_q(&wake_q);
+		rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -108,9 +108,14 @@ extern int rt_mutex_start_proxy_lock(str
 extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct hrtimer_sleeper *to,
 				      struct rt_mutex_waiter *waiter);
+
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
-extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
-				  struct wake_q_head *wqh);
+extern int rt_mutex_futex_trylock(struct rt_mutex *l);
+
+extern void rt_mutex_futex_unlock(struct rt_mutex *lock);
+extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				 struct wake_q_head *wqh);
+
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-21 12:27           ` Peter Zijlstra
@ 2016-10-27 20:36             ` Thomas Gleixner
  2016-11-23 19:20               ` Peter Zijlstra
  2016-11-25 14:09               ` Peter Zijlstra
  0 siblings, 2 replies; 46+ messages in thread
From: Thomas Gleixner @ 2016-10-27 20:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Fri, 21 Oct 2016, Peter Zijlstra wrote:
> On Mon, Oct 10, 2016 at 12:17:48PM +0200, Thomas Gleixner wrote:
> > I fear, we need to rethink this whole locking/protection scheme from
> > scratch.
> 
> Here goes... as discussed at ELCE this serializes the {uval,
> pi_state} state using pi_mutex->wait_lock.
> 
> I did my best to reason about requeue_pi, but I did suffer some stack
> overflows. That said, I audited all places pi_state is used/modified
> and applied consistent locking and/or comments.
> 
> The patch could possibly be broken up in 3 parts, namely:
> 
>  - introduce the futex specific rt_mutex wrappers that avoid the
>    regular rt_mutex fast path (and isolate futex from the normal
>    rt_mutex interface).
> 
>  - add the pi_mutex->wait_lock locking, which at that point will be
>    entirely redundant.
> 
>  - rework the unlock_pi path to drop hb->lock.
> 
> Let me know if you would prefer to see that. For review I think having
> all that in a single patch is easier to reason about.

For merging and final review, yes please.
 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -914,7 +914,7 @@ void exit_pi_state_list(struct task_stru
>  		pi_state->owner = NULL;
>  		raw_spin_unlock_irq(&curr->pi_lock);
>  
> -		rt_mutex_unlock(&pi_state->pi_mutex);
> +		rt_mutex_futex_unlock(&pi_state->pi_mutex);
>  
>  		spin_unlock(&hb->lock);
>  
> @@ -963,7 +963,9 @@ void exit_pi_state_list(struct task_stru
>   *
>   * [7]	pi_state->owner can only be NULL when the OWNER_DIED bit is set.
>   *
> - * [8]	Owner and user space value match
> + * [8]	Owner and user space value match; there is a special case in
> + *	wake_futex_pi() where we use pi_state->owner = &init_task to
> + *	make this true for TID 0 and avoid rules 9 and 10.

So you create a transient state with TID 0, but without setting the waiter
bit, right? I'll comment on that in the code below.

>   *
>   * [9]	There is no transient state which sets the user space TID to 0
>   *	except exit_robust_list(), but this is indicated by the
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
> -			 struct futex_hash_bucket *hb)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
>  {
> -	struct task_struct *new_owner;
> -	struct futex_pi_state *pi_state = top_waiter->pi_state;
>  	u32 uninitialized_var(curval), newval;
> +	struct task_struct *new_owner;
> +	bool deboost = false;
>  	WAKE_Q(wake_q);
> -	bool deboost;
>  	int ret = 0;
>  
> -	if (!pi_state)
> -		return -EINVAL;
> -
> -	/*
> -	 * If current does not own the pi_state then the futex is
> -	 * inconsistent and user space fiddled with the futex value.
> -	 */
> -	if (pi_state->owner != current)
> -		return -EINVAL;
> -
>  	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> -	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
>  
> -	/*
> -	 * It is possible that the next waiter (the one that brought
> -	 * top_waiter owner to the kernel) timed out and is no longer
> -	 * waiting on the lock.
> -	 */
> -	if (!new_owner)
> -		new_owner = top_waiter->task;
> -
> -	/*
> -	 * We pass it to the next owner. The WAITERS bit is always
> -	 * kept enabled while there is PI state around. We cleanup the
> -	 * owner died bit, because we are the owner.
> -	 */
> -	newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
> +	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
> +	if (!new_owner) {
> +		/*
> +		 * This is the case where futex_lock_pi() has not yet or failed
> +		 * to acquire the lock but still has the futex_q enqueued. So
> +		 * the futex state has a 'waiter' while the rt_mutex state does
> +		 * not.
> +		 *
> +		 * Even though there still is pi_state for this futex, we can
> +		 * clear FUTEX_WAITERS. Either:
> +		 *
> +		 *  - we or futex_lock_pi() will drop the last reference and
> +		 *    clean up this pi_state,
> +		 *
> +		 *  - userspace acquires the futex through its fastpath
> +		 *    and the above pi_state cleanup still happens,
> +		 *
> +		 *  - or futex_lock_pi() will re-set the WAITERS bit in
> +		 *    fixup_owner().
> +		 */
> +		newval = 0;
> +		/*
> +		 * Since pi_state->owner must point to a valid task, and
> +		 * task_pid_vnr(pi_state->owner) must match TID_MASK, use
> +		 * init_task.
> +		 */
> +		new_owner = &init_task;

So that waiter which is now spinning on pi_mutex->lock has already set the
waiters bit, which you undo. So you created the following scenario:

CPU0	     	       	  CPU1				CPU2

TID 0x1001		  TID 0x1000			TID 0x1002

			  Acquires futex in user space
    			  futex value = 0x1000;

futex_lock_pi()
  
  lock_hb()
  set_waiter_bit()
  --> futex value = 0x40001000;

			  futex_unlock_pi()
			   lock_hb()
  attach_to_pi_owner()
    rt_mutex_init_proxy_locked()
  queue_me()
    unlock_hb()
			   unlock_hb()
  rt_mutex_lock()	   wake_futex_pi()
   			   lock(pi_mutex->lock);
   lock(pi_mutex->lock)	   new_owner is NULL;
		 	    --> futex value = 0;
			   rt_mutex_futex_unlock(pi_mutex);
			   unlock(pi_mutex->lock);
   acquire_rt_mutex()	   return to user space
							Acquires futex in user space
							--> futex value = 0x1002
  fixup_owner()
    fixup_pi_state_owner()
       uval = 0x1002;
       newval = 0x40001001;
       cmpxchg(uval, newval) succeeds
       --> futex value = 0x40001001

Voila. Inconsistent state .... TID 0x1002 is borked now.

There is no check in fixup_pi_state_owner() for this case as we until now
rightfully forced consistent state.

So with this new transient state we really need to check whether uval is
what we expect it to be, i.e. 0. The completely untested and probably
incorrect patch below might cure the issue in futex_lock_pi(), but does not
handle any of the requeue_pi mess.

The other option we have is to set the futex value to FUTEX_WAITERS instead
of 0. Have not looked at that variant yet, but it might be simpler as it
forces the other task into the kernel instead of giving it the oportunity
to steal the futex in user space. Just assumptions as brain melt has
already set in.

In any case we must carefully document this new transient state with all
its extra bells and whistels.


That transient state 0 made me nervous from the very beginning and I should
have seen the hole earlier...

Sorry for spoiling the party once again!

Thanks,

	tglx

8<------------------------

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2246,16 +2246,27 @@ static int fixup_pi_state_owner(u32 __us
 	if (get_futex_value_locked(&uval, uaddr))
 		goto handle_fault;
 
-	while (1) {
-		newval = (uval & FUTEX_OWNER_DIED) | newtid;
-
-		if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
-			goto handle_fault;
-		if (curval == uval)
-			break;
-		uval = curval;
+	/*
+	 * If wake_futex_pi() set the futex to 0 and made init_task the
+	 * transient owner another task might have acquired the futex
+	 * in user space.
+	 */
+	if (oldowner == &init_task && uval != 0) {
+		raw_spin_lock(&pi_state->owner->pi_lock);
+		list_del_init(&pi_state->list);
+		raw_spin_unlock(&pi_state->owner->pi_lock);
+		pi_state->owner = NULL;
+		return -EAGAIN;
 	}
 
+	newval = (uval & FUTEX_OWNER_DIED) | newtid;
+
+	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
+		goto handle_fault;
+
+	if (curval != uval)
+		goto retry;
+
 	/*
 	 * We fixed up user space. Now we need to fix the pi_state
 	 * itself.
@@ -2679,6 +2690,10 @@ static int futex_lock_pi(u32 __user *uad
 
 out_put_key:
 	put_futex_key(&q.key);
+
+	if (ret == -EAGAIN)
+		goto retry;
+
 out:
 	if (to)
 		destroy_hrtimer_on_stack(&to->timer);

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-27 20:36             ` Thomas Gleixner
@ 2016-11-23 19:20               ` Peter Zijlstra
  2016-11-24 16:52                 ` Peter Zijlstra
  2016-11-25 14:09               ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-11-23 19:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Thu, Oct 27, 2016 at 10:36:00PM +0200, Thomas Gleixner wrote:
> > +	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
> > +	if (!new_owner) {
> > +		/*
> > +		 * This is the case where futex_lock_pi() has not yet or failed
> > +		 * to acquire the lock but still has the futex_q enqueued. So
> > +		 * the futex state has a 'waiter' while the rt_mutex state does
> > +		 * not.
> > +		 *
> > +		 * Even though there still is pi_state for this futex, we can
> > +		 * clear FUTEX_WAITERS. Either:
> > +		 *
> > +		 *  - we or futex_lock_pi() will drop the last reference and
> > +		 *    clean up this pi_state,
> > +		 *
> > +		 *  - userspace acquires the futex through its fastpath
> > +		 *    and the above pi_state cleanup still happens,
> > +		 *
> > +		 *  - or futex_lock_pi() will re-set the WAITERS bit in
> > +		 *    fixup_owner().
> > +		 */
> > +		newval = 0;
> > +		/*
> > +		 * Since pi_state->owner must point to a valid task, and
> > +		 * task_pid_vnr(pi_state->owner) must match TID_MASK, use
> > +		 * init_task.
> > +		 */
> > +		new_owner = &init_task;
> 
> So that waiter which is now spinning on pi_mutex->lock has already set the
> waiters bit, which you undo. So you created the following scenario:
> 
> CPU0	     	       	  CPU1				CPU2
> 
> TID 0x1001		  TID 0x1000			TID 0x1002
> 
> 			  Acquires futex in user space
>     			  futex value = 0x1000;
> 
> futex_lock_pi()
>   
>   lock_hb()
>   set_waiter_bit()
>   --> futex value = 0x40001000;
> 
> 			  futex_unlock_pi()
> 			   lock_hb()
>   attach_to_pi_owner()
>     rt_mutex_init_proxy_locked()
>   queue_me()
>     unlock_hb()
> 			   unlock_hb()
>   rt_mutex_lock()	   wake_futex_pi()
>    			   lock(pi_mutex->lock);
>    lock(pi_mutex->lock)	   new_owner is NULL;
> 		 	    --> futex value = 0;
> 			   rt_mutex_futex_unlock(pi_mutex);
> 			   unlock(pi_mutex->lock);
>    acquire_rt_mutex()	   return to user space
> 							Acquires futex in user space
> 							--> futex value = 0x1002
>   fixup_owner()
>     fixup_pi_state_owner()
>        uval = 0x1002;
>        newval = 0x40001001;
>        cmpxchg(uval, newval) succeeds
>        --> futex value = 0x40001001
> 
> Voila. Inconsistent state .... TID 0x1002 is borked now.

Urgh, right.

> The other option we have is to set the futex value to FUTEX_WAITERS instead
> of 0.

Yeah, I initially did that but didn't really like it, I then went on to
convince myself setting it to 0 was good, silly me. Leaking the WAITERS
bit is _by_far_ the simplest option though.

Ok, I went and implemented various broken and discarded alternatives
while re-learning all about futexes that I forgot the past few weeks,
while trying to figure out wtf the below does.

I also tried to create some 4+ CPU races that would hit holes in the
below, no luck so far.

> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2246,16 +2246,27 @@ static int fixup_pi_state_owner(u32 __us
>  	if (get_futex_value_locked(&uval, uaddr))
>  		goto handle_fault;
>  
> +	/*
> +	 * If wake_futex_pi() set the futex to 0 and made init_task the
> +	 * transient owner another task might have acquired the futex
> +	 * in user space.
> +	 */

True, but that doesn't explain why we do this. Naively leaving
pi_state->owner set while returning EAGAIN shouldn't be a problem
because put_pi_state() should clean that up.

_However_, that does rt_mutex_proxy_unlock() on it as well, and _that_
is a problem, because it's not the rt_mutex owner.

Then again, we can hit this very problem through any of the other
put_pi_state() calls after setting ->owner = &init_task I think. Which
would argue to special case this in put_pi_state() instead.

> +	if (oldowner == &init_task && uval != 0) {
> +		raw_spin_lock(&pi_state->owner->pi_lock);
> +		list_del_init(&pi_state->list);
> +		raw_spin_unlock(&pi_state->owner->pi_lock);
> +		pi_state->owner = NULL;
> +		return -EAGAIN;
>  	}
>  
> +	newval = (uval & FUTEX_OWNER_DIED) | newtid;
> +
> +	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
> +		goto handle_fault;
> +
> +	if (curval != uval)
> +		goto retry;

This is slightly weird in that we loose the obvious cmpxchg loop
construct. So I'd write it differently, also that
get_futex_value_locked() call is entirely superfluous the second time
around, we got curval after all.

> +
>  	/*
>  	 * We fixed up user space. Now we need to fix the pi_state
>  	 * itself.
> @@ -2679,6 +2690,10 @@ static int futex_lock_pi(u32 __user *uad
>  
>  out_put_key:
>  	put_futex_key(&q.key);
> +
> +	if (ret == -EAGAIN)
> +		goto retry;
> +

And this is far too clever and really needs a comment. So the crucial
point is that this is after unqueue_me_pi(), which drops the pi_state
and loops back to lookup the pi_state again, which, hopefully, has now
been completely destroyed -- and therefore we hit the regular
attach_to_pi_owner() path, fixing up our 'funny' state.

This is where I was playing with 4+ CPU scenarios, to see if we could
somehow keep the pi_state alive and not make progress.

I think we can do the retry slightly earlier, right after
unqueue_me_pi() and then add retry_queue: right before
futex_lock_pi_atomic(), that would avoid dropping the hb->lock (and
avoids put/get_futex_key).

>  out:
>  	if (to)
>  		destroy_hrtimer_on_stack(&to->timer);

Also, since fixup_pi_state_owner() can now return -EAGAIN, all users
need handling for that.


I'll try and do a patch that does all that and attempt to write coherent
comments on our fancy new state.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-11-23 19:20               ` Peter Zijlstra
@ 2016-11-24 16:52                 ` Peter Zijlstra
  2016-11-24 17:56                   ` Thomas Gleixner
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-11-24 16:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Wed, Nov 23, 2016 at 08:20:05PM +0100, Peter Zijlstra wrote:
> > +	if (oldowner == &init_task && uval != 0) {
> > +		raw_spin_lock(&pi_state->owner->pi_lock);
> > +		list_del_init(&pi_state->list);
> > +		raw_spin_unlock(&pi_state->owner->pi_lock);
> > +		pi_state->owner = NULL;
> > +		return -EAGAIN;

> > @@ -2679,6 +2690,10 @@ static int futex_lock_pi(u32 __user *uad
> >  
> >  out_put_key:
> >  	put_futex_key(&q.key);
> > +
> > +	if (ret == -EAGAIN)
> > +		goto retry;
> > +
> 
> And this is far too clever and really needs a comment. So the crucial
> point is that this is after unqueue_me_pi(), which drops the pi_state
> and loops back to lookup the pi_state again, which, hopefully, has now
> been completely destroyed -- and therefore we hit the regular
> attach_to_pi_owner() path, fixing up our 'funny' state.
> 

I'm stumped on REQUEUE_PI.. this relies on attach_to_pi_owner() and
fixup_owner() being in the same function. But this is not the case for
requeue. WAIT_REQUEUE has the fixup, as its return path finds it has
acquired the outer pi-futex (uaddr2), but the lookup_pi_state() stuff is
done by CMP_REQUEUE, which does the actual transfer of the waiters from
inner futex (uaddr1) to outer futex (uaddr2).


Maybe I can restructure things a bit, I think CMP_REQUEUE would also
know who actually acquired the outer-futex, but I have to think more on
this and the brain is pretty fried...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-11-24 16:52                 ` Peter Zijlstra
@ 2016-11-24 17:56                   ` Thomas Gleixner
  2016-11-24 18:58                     ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Thomas Gleixner @ 2016-11-24 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Thu, 24 Nov 2016, Peter Zijlstra wrote:
> On Wed, Nov 23, 2016 at 08:20:05PM +0100, Peter Zijlstra wrote:
> > > +	if (oldowner == &init_task && uval != 0) {
> > > +		raw_spin_lock(&pi_state->owner->pi_lock);
> > > +		list_del_init(&pi_state->list);
> > > +		raw_spin_unlock(&pi_state->owner->pi_lock);
> > > +		pi_state->owner = NULL;
> > > +		return -EAGAIN;
> 
> > > @@ -2679,6 +2690,10 @@ static int futex_lock_pi(u32 __user *uad
> > >  
> > >  out_put_key:
> > >  	put_futex_key(&q.key);
> > > +
> > > +	if (ret == -EAGAIN)
> > > +		goto retry;
> > > +
> > 
> > And this is far too clever and really needs a comment. So the crucial
> > point is that this is after unqueue_me_pi(), which drops the pi_state
> > and loops back to lookup the pi_state again, which, hopefully, has now
> > been completely destroyed -- and therefore we hit the regular
> > attach_to_pi_owner() path, fixing up our 'funny' state.
> > 
> 
> I'm stumped on REQUEUE_PI.. this relies on attach_to_pi_owner() and

You mean LOCK_PI, right? 

> fixup_owner() being in the same function. But this is not the case for
> requeue. WAIT_REQUEUE has the fixup, as its return path finds it has
> acquired the outer pi-futex (uaddr2), but the lookup_pi_state() stuff is
> done by CMP_REQUEUE, which does the actual transfer of the waiters from
> inner futex (uaddr1) to outer futex (uaddr2).

Correct. WAIT_REQUEUE puts the futex on the inner (uaddr1) and then gets
moved to the outer. From there it's the same thing as LOCK_PI.

> Maybe I can restructure things a bit, I think CMP_REQUEUE would also
> know who actually acquired the outer-futex, but I have to think more on
> this and the brain is pretty fried...

That is irrelevant at requeue time and the owner can change up to the point
where the waiter is really woken by a UNLOCK_PI.

Thanks,

	tglx

 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-11-24 17:56                   ` Thomas Gleixner
@ 2016-11-24 18:58                     ` Peter Zijlstra
  2016-11-25  9:23                       ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-11-24 18:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Thu, Nov 24, 2016 at 06:56:53PM +0100, Thomas Gleixner wrote:
> > I'm stumped on REQUEUE_PI.. this relies on attach_to_pi_owner() and
> 
> You mean LOCK_PI, right? 
> 
> > fixup_owner() being in the same function. But this is not the case for
> > requeue. WAIT_REQUEUE has the fixup, as its return path finds it has
> > acquired the outer pi-futex (uaddr2), but the lookup_pi_state() stuff is
> > done by CMP_REQUEUE, which does the actual transfer of the waiters from
> > inner futex (uaddr1) to outer futex (uaddr2).
> 
> Correct. WAIT_REQUEUE puts the futex on the inner (uaddr1) and then gets
> moved to the outer. From there it's the same thing as LOCK_PI.
> 
> > Maybe I can restructure things a bit, I think CMP_REQUEUE would also
> > know who actually acquired the outer-futex, but I have to think more on
> > this and the brain is pretty fried...
> 
> That is irrelevant at requeue time and the owner can change up to the point
> where the waiter is really woken by a UNLOCK_PI.

OK, so clearly I'm confused. So let me try again.

LOCK_PI, does in one function: lookup_pi_state, and fixup_owner. If
fixup_owner fails with -EAGAIN, we can redo the pi_state lookup.

The requeue stuff, otoh, has one each. REQUEUE_WAIT has fixup_owner(),
CMP_REQUEUE has lookup_pi_state. Therefore, fixup_owner failing with
-EAGAIN leaves us dead in the water. There's nothing to go back to to
retry.

So far, so 'good', right?

Now, as far as I understand this requeue stuff, we have 2 futexes, an
inner futex and an outer futex. The inner futex is always 'locked' and
serves as a collection pool for waiting threads.

The requeue crap picks one (or more) waiters from the inner futex and
sticks them on the outer futex, which gives them a chance to run.

So WAIT_REQUEUE blocks on the inner futex, but knows that if it ever
gets woken, it will be on the outer futex, and hence needs to
fixup_owner if the futex and rt_mutex state got out of sync.

CMP_REQUEUEUEUE picks the one (or more) waiters of the inner futex and
sticks them on the outer futex.

So far, so 'good' ?

The thing I'm not entire sure on is what happens with the outer futex,
do we first LOCK_PI it before doing CMP_REQUEUE, giving us waiters, and
then UNLOCK_PI to let them rip? Or do we just CMP_REQUEUE and then let
whoever wins finish with UNLOCK_PI?


In any case, I don't think it matters much, either way we can race
betwen the 'last' UNLOCK_PI and getting rt_mutex waiters and then hit
the &init_task funny state, such that WAIT_REQUEUE waking hits EAGAIN
and we're 'stuck'.

Now, if we always CMP_REQUEUE to a locked outer futex, then we cannot
know, at CMP_REQUEUE time, who will win and cannot fix up.

The only solution I've come up with so far involves that
rt_mutex_proxy_swizzle() muck which you didn't really fancy much.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-11-24 18:58                     ` Peter Zijlstra
@ 2016-11-25  9:23                       ` Peter Zijlstra
  2016-11-25 10:03                         ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-11-25  9:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Thu, Nov 24, 2016 at 07:58:07PM +0100, Peter Zijlstra wrote:

> OK, so clearly I'm confused. So let me try again.
> 
> LOCK_PI, does in one function: lookup_pi_state, and fixup_owner. If
> fixup_owner fails with -EAGAIN, we can redo the pi_state lookup.
> 
> The requeue stuff, otoh, has one each. REQUEUE_WAIT has fixup_owner(),
> CMP_REQUEUE has lookup_pi_state. Therefore, fixup_owner failing with
> -EAGAIN leaves us dead in the water. There's nothing to go back to to
> retry.
> 
> So far, so 'good', right?
> 
> Now, as far as I understand this requeue stuff, we have 2 futexes, an
> inner futex and an outer futex. The inner futex is always 'locked' and
> serves as a collection pool for waiting threads.
> 
> The requeue crap picks one (or more) waiters from the inner futex and
> sticks them on the outer futex, which gives them a chance to run.
> 
> So WAIT_REQUEUE blocks on the inner futex, but knows that if it ever
> gets woken, it will be on the outer futex, and hence needs to
> fixup_owner if the futex and rt_mutex state got out of sync.
> 
> CMP_REQUEUEUEUE picks the one (or more) waiters of the inner futex and
> sticks them on the outer futex.
> 
> So far, so 'good' ?
> 
> The thing I'm not entire sure on is what happens with the outer futex,
> do we first LOCK_PI it before doing CMP_REQUEUE, giving us waiters, and
> then UNLOCK_PI to let them rip? Or do we just CMP_REQUEUE and then let
> whoever wins finish with UNLOCK_PI?
> 
> 
> In any case, I don't think it matters much, either way we can race
> betwen the 'last' UNLOCK_PI and getting rt_mutex waiters and then hit
> the &init_task funny state, such that WAIT_REQUEUE waking hits EAGAIN
> and we're 'stuck'.
> 
> Now, if we always CMP_REQUEUE to a locked outer futex, then we cannot
> know, at CMP_REQUEUE time, who will win and cannot fix up.

OTOH, if we always first LOCK_PI before doing CMP_REQUEUE, I don't think
we can hit the funny state, LOCK_PI will have fixed it up for us.

So the question is, do we mandate LOCK_PI before CMP_REQUEUE?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-11-25  9:23                       ` Peter Zijlstra
@ 2016-11-25 10:03                         ` Peter Zijlstra
  2016-11-25 19:13                           ` Thomas Gleixner
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-11-25 10:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Fri, Nov 25, 2016 at 10:23:26AM +0100, Peter Zijlstra wrote:
> On Thu, Nov 24, 2016 at 07:58:07PM +0100, Peter Zijlstra wrote:
> 
> > OK, so clearly I'm confused. So let me try again.
> > 
> > LOCK_PI, does in one function: lookup_pi_state, and fixup_owner. If
> > fixup_owner fails with -EAGAIN, we can redo the pi_state lookup.
> > 
> > The requeue stuff, otoh, has one each. REQUEUE_WAIT has fixup_owner(),
> > CMP_REQUEUE has lookup_pi_state. Therefore, fixup_owner failing with
> > -EAGAIN leaves us dead in the water. There's nothing to go back to to
> > retry.
> > 
> > So far, so 'good', right?
> > 
> > Now, as far as I understand this requeue stuff, we have 2 futexes, an
> > inner futex and an outer futex. The inner futex is always 'locked' and
> > serves as a collection pool for waiting threads.
> > 
> > The requeue crap picks one (or more) waiters from the inner futex and
> > sticks them on the outer futex, which gives them a chance to run.
> > 
> > So WAIT_REQUEUE blocks on the inner futex, but knows that if it ever
> > gets woken, it will be on the outer futex, and hence needs to
> > fixup_owner if the futex and rt_mutex state got out of sync.
> > 
> > CMP_REQUEUEUEUE picks the one (or more) waiters of the inner futex and
> > sticks them on the outer futex.
> > 
> > So far, so 'good' ?
> > 
> > The thing I'm not entire sure on is what happens with the outer futex,
> > do we first LOCK_PI it before doing CMP_REQUEUE, giving us waiters, and
> > then UNLOCK_PI to let them rip? Or do we just CMP_REQUEUE and then let
> > whoever wins finish with UNLOCK_PI?
> > 
> > 
> > In any case, I don't think it matters much, either way we can race
> > betwen the 'last' UNLOCK_PI and getting rt_mutex waiters and then hit
> > the &init_task funny state, such that WAIT_REQUEUE waking hits EAGAIN
> > and we're 'stuck'.
> > 
> > Now, if we always CMP_REQUEUE to a locked outer futex, then we cannot
> > know, at CMP_REQUEUE time, who will win and cannot fix up.
> 
> OTOH, if we always first LOCK_PI before doing CMP_REQUEUE, I don't think
> we can hit the funny state, LOCK_PI will have fixed it up for us.
> 
> So the question is, do we mandate LOCK_PI before CMP_REQUEUE?

Going by futex_requeue(), the first thing it does; after validation and
getting hbs locked, is futex_proxy_trylock_atomic(), which per the
comment above it will attempt to acquire uaddr2.

So no such mandate, otherwise that op would not exist and we'd only need
to validate that uaddr2 was 'current'.

A well, back to reading more..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-10-27 20:36             ` Thomas Gleixner
  2016-11-23 19:20               ` Peter Zijlstra
@ 2016-11-25 14:09               ` Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-11-25 14:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Thu, Oct 27, 2016 at 10:36:00PM +0200, Thomas Gleixner wrote:
> So that waiter which is now spinning on pi_mutex->lock has already set the
> waiters bit, which you undo. So you created the following scenario:
> 
> CPU0	     	       	  CPU1				CPU2
> 
> TID 0x1001		  TID 0x1000			TID 0x1002
> 
> 			  Acquires futex in user space
>     			  futex value = 0x1000;
> 
> futex_lock_pi()
>   
>   lock_hb()
>   set_waiter_bit()
>   --> futex value = 0x40001000;
> 
> 			  futex_unlock_pi()
> 			   lock_hb()
>   attach_to_pi_owner()
>     rt_mutex_init_proxy_locked()
>   queue_me()
>     unlock_hb()
> 			   unlock_hb()
>   rt_mutex_lock()	   wake_futex_pi()
>    			   lock(pi_mutex->lock);
>    lock(pi_mutex->lock)	   new_owner is NULL;
> 		 	    --> futex value = 0;
> 			   rt_mutex_futex_unlock(pi_mutex);
> 			   unlock(pi_mutex->lock);
>    acquire_rt_mutex()	   return to user space
> 							Acquires futex in user space
> 							--> futex value = 0x1002
>   fixup_owner()
>     fixup_pi_state_owner()
>        uval = 0x1002;
>        newval = 0x40001001;
>        cmpxchg(uval, newval) succeeds
>        --> futex value = 0x40001001
> 
> Voila. Inconsistent state .... TID 0x1002 is borked now.

OK, I think its much worse...


Consider:


CPU0 (tid=0x1000)		CPU1 (tid=0x1001)		CPU2 (tid=0x1002)


acquire in userspace
  *uaddr = 0x1000

				futex_lock_pi()
				  acq hb->lock
				  attach_to_pi_owner
				    pi_state->refcount == 1
				  queue_me
				    rel hb->lock
								futex_lock_pi()
								  acq hb->lock
								  attach_to_pi_state
								    pi_state->refcount == 2
								  queue_me
								    rel hb->lock

futex_unlock_pi()
  acq pi_mutex->lock
  top_waiter == (CPU1 | CPU2)
  wake_futex_pi()
    new_owner == NULL
      pi_state->owner = &init_task
      *uaddr = 0
    rt_mutex_futex_unlock()
  ret-to-user


acquire in userspace
  *uaddr = 0x1000



>From here there's multiple ways to trainwreck the thing, the way you
list above, lets call this A):

								  rt_mutex_lock()
								  acq hb->lock
								  fixup_owner()
								    cmpxchg(uaddr, 0x1000, 0x40001002)

								    *BORKED CPU0*

alternatively we can do; B):

			CPU3 (or CPU0 with a different tid)

			futex_lock_pi()
			  acq hb->lock
			  attach_to_pi_state()
			    -EINVAL : *uaddr != task_pid_vnr(&init_task)



Now, since CPU1 is pinning the hb->waiter relation, the proposed
fixup_owner() -EAGAIN change cannot work either, since, A1):

								  rt_mutex_lock()
								  acq hb->lock
								  fixup_owner() : -EAGAIN
								  unqueue_me_pi()
								    put_pi_state
								      pi_state->refcount == 1
								    rel hb->lock
								  goto retry_private

								retry_private:
								  acq hb->lock
								  attach_to_pi_state() : -EINVAL


Similar things happen with CMP_REQUEUE doing a lookup_pi_state() around
this point.

I'm sorely tempted to do the 'simple' thing and leak FUTEX_WAITERS,
forcing things into the slow path.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
  2016-11-25 10:03                         ` Peter Zijlstra
@ 2016-11-25 19:13                           ` Thomas Gleixner
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Gleixner @ 2016-11-25 19:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot

On Fri, 25 Nov 2016, Peter Zijlstra wrote:
> On Fri, Nov 25, 2016 at 10:23:26AM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 24, 2016 at 07:58:07PM +0100, Peter Zijlstra wrote:
> > 
> > > OK, so clearly I'm confused. So let me try again.
> > > 
> > > LOCK_PI, does in one function: lookup_pi_state, and fixup_owner. If
> > > fixup_owner fails with -EAGAIN, we can redo the pi_state lookup.
> > > 
> > > The requeue stuff, otoh, has one each. REQUEUE_WAIT has fixup_owner(),
> > > CMP_REQUEUE has lookup_pi_state. Therefore, fixup_owner failing with
> > > -EAGAIN leaves us dead in the water. There's nothing to go back to to
> > > retry.
> > > 
> > > So far, so 'good', right?
> > > 
> > > Now, as far as I understand this requeue stuff, we have 2 futexes, an
> > > inner futex and an outer futex. The inner futex is always 'locked' and
> > > serves as a collection pool for waiting threads.

Yes.
 
> > > The requeue crap picks one (or more) waiters from the inner futex and
> > > sticks them on the outer futex, which gives them a chance to run.

No. The chance to ran can get only one, if it can acquire the futex in user
space directly. If the futex is locked, it is queued on the outer
futex/rtmutex and gets to run when the outer futex is unlocked by the owner.
 
> > > So WAIT_REQUEUE blocks on the inner futex, but knows that if it ever
> > > gets woken, it will be on the outer futex, and hence needs to
> > > fixup_owner if the futex and rt_mutex state got out of sync.

It can be woken on the inner one as well (signal/timeout), but that does
not affect the outer futex, so it's not interesting.

> > > CMP_REQUEUEUEUE picks the one (or more) waiters of the inner futex and
> > > sticks them on the outer futex.
> > > 
> > > So far, so 'good' ?
> > > 
> > > The thing I'm not entire sure on is what happens with the outer futex,
> > > do we first LOCK_PI it before doing CMP_REQUEUE, giving us waiters, and
> > > then UNLOCK_PI to let them rip? Or do we just CMP_REQUEUE and then let
> > > whoever wins finish with UNLOCK_PI?
> > > 
> > > 
> > > In any case, I don't think it matters much, either way we can race
> > > betwen the 'last' UNLOCK_PI and getting rt_mutex waiters and then hit
> > > the &init_task funny state, such that WAIT_REQUEUE waking hits EAGAIN
> > > and we're 'stuck'.

Right, that would be exceptionally bad.
 
> > > Now, if we always CMP_REQUEUE to a locked outer futex, then we cannot
> > > know, at CMP_REQUEUE time, who will win and cannot fix up.
> > 
> > OTOH, if we always first LOCK_PI before doing CMP_REQUEUE, I don't think
> > we can hit the funny state, LOCK_PI will have fixed it up for us.
> > 
> > So the question is, do we mandate LOCK_PI before CMP_REQUEUE?
> 
> Going by futex_requeue(), the first thing it does; after validation and
> getting hbs locked, is futex_proxy_trylock_atomic(), which per the
> comment above it will attempt to acquire uaddr2.
> 
> So no such mandate, otherwise that op would not exist and we'd only need
> to validate that uaddr2 was 'current'.

Correct. It can be done with uaddr2 locked or not. That's why we try to
take it directly first.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2016-11-25 19:16 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-03  9:12 [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Peter Zijlstra
2016-10-03  9:12 ` [RFC][PATCH 1/4] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
2016-10-03 14:15   ` Steven Rostedt
2016-10-05  3:58   ` Davidlohr Bueso
2016-10-03  9:12 ` [RFC][PATCH 2/4] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
2016-10-03 14:19   ` Steven Rostedt
2016-10-05  3:57   ` Davidlohr Bueso
2016-10-05  6:20     ` Peter Zijlstra
2016-10-03  9:12 ` [RFC][PATCH 3/4] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
2016-10-03  9:34   ` Peter Zijlstra
2016-10-03 14:25   ` Steven Rostedt
2016-10-05  1:08   ` Davidlohr Bueso
2016-10-05  7:29   ` Sebastian Andrzej Siewior
2016-10-03  9:12 ` [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI Peter Zijlstra
2016-10-03 15:36   ` Steven Rostedt
2016-10-03 15:44     ` Peter Zijlstra
2016-10-03 15:45     ` Peter Zijlstra
2016-10-03 16:23       ` Steven Rostedt
2016-10-05  7:41   ` Sebastian Andrzej Siewior
2016-10-05  8:09     ` Peter Zijlstra
2016-10-05  8:21       ` Sebastian Andrzej Siewior
2016-10-05  8:32         ` Peter Zijlstra
2016-10-06 10:29   ` Peter Zijlstra
2016-10-07 11:21   ` Peter Zijlstra
2016-10-08 15:53     ` Thomas Gleixner
2016-10-08 16:55       ` Peter Zijlstra
2016-10-08 17:06         ` Thomas Gleixner
2016-10-10 10:17         ` Thomas Gleixner
2016-10-10 11:40           ` Peter Zijlstra
2016-10-21 12:27           ` Peter Zijlstra
2016-10-27 20:36             ` Thomas Gleixner
2016-11-23 19:20               ` Peter Zijlstra
2016-11-24 16:52                 ` Peter Zijlstra
2016-11-24 17:56                   ` Thomas Gleixner
2016-11-24 18:58                     ` Peter Zijlstra
2016-11-25  9:23                       ` Peter Zijlstra
2016-11-25 10:03                         ` Peter Zijlstra
2016-11-25 19:13                           ` Thomas Gleixner
2016-11-25 14:09               ` Peter Zijlstra
2016-10-08 18:22     ` Thomas Gleixner
2016-10-09 11:17     ` Thomas Gleixner
2016-10-10 14:06       ` Peter Zijlstra
2016-10-05  1:02 ` [RFC][PATCH 0/4] FUTEX_UNLOCK_PI wobbles Davidlohr Bueso
2016-10-05  6:20   ` Peter Zijlstra
2016-10-05  7:26     ` Sebastian Andrzej Siewior
2016-10-05 16:04     ` Davidlohr Bueso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.