All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues
@ 2017-03-04  9:27 Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 01/14] futex: Fix potential use-after-free in FUTEX_REQUEUE_PI Peter Zijlstra
                   ` (13 more replies)
  0 siblings, 14 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

Hi all,

Here be the 5th iteration of these patches; which strive to unlock the rt_mutex
without holding hb->lock.

In total this patch-set is very much like the previous one (including the 'late'
fix) but it gets there in a slightly different -- and hopefully easier to
understand -- route.

The last 3 patches could be superfluous if we can proof that the -EAGAIN retry
in futex_unlock_pi() isn't a problem for bounded execution (or are willing to
accept it). In this case the patch-set ends up simpler than before.

I sincerely hope this is the last of it; but please consider carefully, there be
dragons here.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 01/14] futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-14 20:48   ` [tip:locking/urgent] " tip-bot for Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 02/14] futex: Add missing error handling to FUTEX_REQUEUE_PI Peter Zijlstra
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz,
	Darren Hart

[-- Attachment #1: peter_zijlstra-futex-fix_potential_use-after-free_in_futex_requeue_pi.patch --]
[-- Type: text/plain, Size: 2137 bytes --]

While working on the futex code, I stumbled over this potential
use-after-free scenario.

pi_mutex is a pointer into pi_state, which we drop the reference on in
unqueue_me_pi(). So any access to that pointer after that is bad.

Since other sites already do rt_mutex_unlock() with hb->lock held, see
for example futex_lock_pi(), simply move the unlock before
unqueue_me_pi().

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |   20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2813,7 +2813,6 @@ static int futex_wait_requeue_pi(u32 __u
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct rt_mutex_waiter rt_waiter;
-	struct rt_mutex *pi_mutex = NULL;
 	struct futex_hash_bucket *hb;
 	union futex_key key2 = FUTEX_KEY_INIT;
 	struct futex_q q = futex_q_init;
@@ -2905,6 +2904,8 @@ static int futex_wait_requeue_pi(u32 __u
 			spin_unlock(q.lock_ptr);
 		}
 	} else {
+		struct rt_mutex *pi_mutex;
+
 		/*
 		 * We have been woken up by futex_unlock_pi(), a timeout, or a
 		 * signal.  futex_unlock_pi() will not destroy the lock_ptr nor
@@ -2928,18 +2929,19 @@ static int futex_wait_requeue_pi(u32 __u
 		if (res)
 			ret = (res < 0) ? res : 0;
 
+		/*
+		 * If fixup_pi_state_owner() faulted and was unable to handle
+		 * the fault, unlock the rt_mutex and return the fault to
+		 * userspace.
+		 */
+		if (ret && rt_mutex_owner(pi_mutex) == current)
+			rt_mutex_unlock(pi_mutex);
+
 		/* Unqueue and drop the lock. */
 		unqueue_me_pi(&q);
 	}
 
-	/*
-	 * If fixup_pi_state_owner() faulted and was unable to handle the
-	 * fault, unlock the rt_mutex and return the fault to userspace.
-	 */
-	if (ret == -EFAULT) {
-		if (pi_mutex && rt_mutex_owner(pi_mutex) == current)
-			rt_mutex_unlock(pi_mutex);
-	} else if (ret == -EINTR) {
+	if (ret == -EINTR) {
 		/*
 		 * We've already been requeued, but cannot restart by calling
 		 * futex_lock_pi() directly. We could restart this syscall, but

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 02/14] futex: Add missing error handling to FUTEX_REQUEUE_PI
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 01/14] futex: Fix potential use-after-free in FUTEX_REQUEUE_PI Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-14 20:49   ` [tip:locking/urgent] " tip-bot for Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 03/14] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz,
	Darren Hart

[-- Attachment #1: peter_zijlstra-futex-fix_missing_error_handling.patch --]
[-- Type: text/plain, Size: 787 bytes --]

Thomas spotted that fixup_pi_state_owner() can return errors and we
fail to unlock the rt_mutex in that case.

Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2896,6 +2896,8 @@ static int futex_wait_requeue_pi(u32 __u
 		if (q.pi_state && (q.pi_state->owner != current)) {
 			spin_lock(q.lock_ptr);
 			ret = fixup_pi_state_owner(uaddr2, &q, current);
+			if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current)
+				rt_mutex_unlock(&q.pi_state->pi_mutex);
 			/*
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 03/14] futex: Cleanup variable names for futex_top_waiter()
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 01/14] futex: Fix potential use-after-free in FUTEX_REQUEUE_PI Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 02/14] futex: Add missing error handling to FUTEX_REQUEUE_PI Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 04/14] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-cleanup-top_waiter.patch --]
[-- Type: text/plain, Size: 3384 bytes --]

futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
this to a variable 'match' totally obscures the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |   30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1120,14 +1120,14 @@ static int attach_to_pi_owner(u32 uval,
 static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 			   union futex_key *key, struct futex_pi_state **ps)
 {
-	struct futex_q *match = futex_top_waiter(hb, key);
+	struct futex_q *top_waiter = futex_top_waiter(hb, key);
 
 	/*
 	 * If there is a waiter on that futex, validate it and
 	 * attach to the pi_state when the validation succeeds.
 	 */
-	if (match)
-		return attach_to_pi_state(uval, match->pi_state, ps);
+	if (top_waiter)
+		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
 	/*
 	 * We are the first waiter - try to look up the owner based on
@@ -1174,7 +1174,7 @@ static int futex_lock_pi_atomic(u32 __us
 				struct task_struct *task, int set_waiters)
 {
 	u32 uval, newval, vpid = task_pid_vnr(task);
-	struct futex_q *match;
+	struct futex_q *top_waiter;
 	int ret;
 
 	/*
@@ -1200,9 +1200,9 @@ static int futex_lock_pi_atomic(u32 __us
 	 * Lookup existing state first. If it exists, try to attach to
 	 * its pi_state.
 	 */
-	match = futex_top_waiter(hb, key);
-	if (match)
-		return attach_to_pi_state(uval, match->pi_state, ps);
+	top_waiter = futex_top_waiter(hb, key);
+	if (top_waiter)
+		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
 
 	/*
 	 * No waiter and user TID is 0. We are here because the
@@ -1292,11 +1292,11 @@ static void mark_wake_futex(struct wake_
 	q->lock_ptr = NULL;
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
 			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = this->pi_state;
+	struct futex_pi_state *pi_state = top_waiter->pi_state;
 	u32 uninitialized_var(curval), newval;
 	DEFINE_WAKE_Q(wake_q);
 	bool deboost;
@@ -1317,11 +1317,11 @@ static int wake_futex_pi(u32 __user *uad
 
 	/*
 	 * It is possible that the next waiter (the one that brought
-	 * this owner to the kernel) timed out and is no longer
+	 * top_waiter owner to the kernel) timed out and is no longer
 	 * waiting on the lock.
 	 */
 	if (!new_owner)
-		new_owner = this->task;
+		new_owner = top_waiter->task;
 
 	/*
 	 * We pass it to the next owner. The WAITERS bit is always
@@ -2631,7 +2631,7 @@ static int futex_unlock_pi(u32 __user *u
 	u32 uninitialized_var(curval), uval, vpid = task_pid_vnr(current);
 	union futex_key key = FUTEX_KEY_INIT;
 	struct futex_hash_bucket *hb;
-	struct futex_q *match;
+	struct futex_q *top_waiter;
 	int ret;
 
 retry:
@@ -2655,9 +2655,9 @@ static int futex_unlock_pi(u32 __user *u
 	 * all and we at least want to know if user space fiddled
 	 * with the futex value instead of blindly unlocking.
 	 */
-	match = futex_top_waiter(hb, &key);
-	if (match) {
-		ret = wake_futex_pi(uaddr, uval, match, hb);
+	top_waiter = futex_top_waiter(hb, &key);
+	if (top_waiter) {
+		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
 		/*
 		 * In case of success wake_futex_pi dropped the hash
 		 * bucket lock.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 04/14] futex: Use smp_store_release() in mark_wake_futex()
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-03-04  9:27 ` [PATCH -v5 03/14] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 05/14] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-mark_wake_futex.patch --]
[-- Type: text/plain, Size: 697 bytes --]

Since the futex_q can dissapear the instruction after assigning NULL,
this really should be a RELEASE barrier. That stops loads from hitting
dead memory too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1288,8 +1288,7 @@ static void mark_wake_futex(struct wake_
 	 * memory barrier is required here to prevent the following
 	 * store to lock_ptr from getting ahead of the plist_del.
 	 */
-	smp_wmb();
-	q->lock_ptr = NULL;
+	smp_store_release(&q->lock_ptr, NULL);
 }
 
 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 05/14] futex: Remove rt_mutex_deadlock_account_*()
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
                   ` (3 preceding siblings ...)
  2017-03-04  9:27 ` [PATCH -v5 04/14] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 06/14] futex,rt_mutex: Provide futex specific rt_mutex API Peter Zijlstra
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-cleanup-deadlock.patch --]
[-- Type: text/plain, Size: 4950 bytes --]

These are unused and clutter up the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/rtmutex-debug.c |    9 -------
 kernel/locking/rtmutex-debug.h |    3 --
 kernel/locking/rtmutex.c       |   47 +++++++++++++++--------------------------
 kernel/locking/rtmutex.h       |    2 -
 4 files changed, 18 insertions(+), 43 deletions(-)

--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -173,12 +173,3 @@ void debug_rt_mutex_init(struct rt_mutex
 	lock->name = name;
 }
 
-void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task)
-{
-}
-
-void rt_mutex_deadlock_account_unlock(struct task_struct *task)
-{
-}
-
--- a/kernel/locking/rtmutex-debug.h
+++ b/kernel/locking/rtmutex-debug.h
@@ -9,9 +9,6 @@
  * This file contains macros used solely by rtmutex.c. Debug version.
  */
 
-extern void
-rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task);
-extern void rt_mutex_deadlock_account_unlock(struct task_struct *task);
 extern void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter);
 extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name);
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -936,8 +936,6 @@ static int try_to_take_rt_mutex(struct r
 	 */
 	rt_mutex_set_owner(lock, task);
 
-	rt_mutex_deadlock_account_lock(lock, task);
-
 	return 1;
 }
 
@@ -1340,8 +1338,6 @@ static bool __sched rt_mutex_slowunlock(
 
 	debug_rt_mutex_unlock(lock);
 
-	rt_mutex_deadlock_account_unlock(current);
-
 	/*
 	 * We must be careful here if the fast path is enabled. If we
 	 * have no waiters queued we cannot set owner to NULL here
@@ -1407,11 +1403,10 @@ rt_mutex_fastlock(struct rt_mutex *lock,
 				struct hrtimer_sleeper *timeout,
 				enum rtmutex_chainwalk chwalk))
 {
-	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 0;
-	} else
-		return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
+
+	return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK);
 }
 
 static inline int
@@ -1423,21 +1418,19 @@ rt_mutex_timed_fastlock(struct rt_mutex
 				      enum rtmutex_chainwalk chwalk))
 {
 	if (chwalk == RT_MUTEX_MIN_CHAINWALK &&
-	    likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	    likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 0;
-	} else
-		return slowfn(lock, state, timeout, chwalk);
+
+	return slowfn(lock, state, timeout, chwalk);
 }
 
 static inline int
 rt_mutex_fasttrylock(struct rt_mutex *lock,
 		     int (*slowfn)(struct rt_mutex *lock))
 {
-	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
-		rt_mutex_deadlock_account_lock(lock, current);
+	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
 		return 1;
-	}
+
 	return slowfn(lock);
 }
 
@@ -1447,19 +1440,18 @@ rt_mutex_fastunlock(struct rt_mutex *loc
 				   struct wake_q_head *wqh))
 {
 	DEFINE_WAKE_Q(wake_q);
+	bool deboost;
 
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-		rt_mutex_deadlock_account_unlock(current);
+	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
+		return;
 
-	} else {
-		bool deboost = slowfn(lock, &wake_q);
+	deboost = slowfn(lock, &wake_q);
 
-		wake_up_q(&wake_q);
+	wake_up_q(&wake_q);
 
-		/* Undo pi boosting if necessary: */
-		if (deboost)
-			rt_mutex_adjust_prio(current);
-	}
+	/* Undo pi boosting if necessary: */
+	if (deboost)
+		rt_mutex_adjust_prio(current);
 }
 
 /**
@@ -1570,10 +1562,9 @@ EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
 				   struct wake_q_head *wqh)
 {
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL))) {
-		rt_mutex_deadlock_account_unlock(current);
+	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
 		return false;
-	}
+
 	return rt_mutex_slowunlock(lock, wqh);
 }
 
@@ -1635,7 +1626,6 @@ void rt_mutex_init_proxy_locked(struct r
 	__rt_mutex_init(lock, NULL);
 	debug_rt_mutex_proxy_lock(lock, proxy_owner);
 	rt_mutex_set_owner(lock, proxy_owner);
-	rt_mutex_deadlock_account_lock(lock, proxy_owner);
 }
 
 /**
@@ -1655,7 +1645,6 @@ void rt_mutex_proxy_unlock(struct rt_mut
 {
 	debug_rt_mutex_proxy_unlock(lock);
 	rt_mutex_set_owner(lock, NULL);
-	rt_mutex_deadlock_account_unlock(proxy_owner);
 }
 
 /**
--- a/kernel/locking/rtmutex.h
+++ b/kernel/locking/rtmutex.h
@@ -11,8 +11,6 @@
  */
 
 #define rt_mutex_deadlock_check(l)			(0)
-#define rt_mutex_deadlock_account_lock(m, t)		do { } while (0)
-#define rt_mutex_deadlock_account_unlock(l)		do { } while (0)
 #define debug_rt_mutex_init_waiter(w)			do { } while (0)
 #define debug_rt_mutex_free_waiter(w)			do { } while (0)
 #define debug_rt_mutex_lock(l)				do { } while (0)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 06/14] futex,rt_mutex: Provide futex specific rt_mutex API
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
                   ` (4 preceding siblings ...)
  2017-03-04  9:27 ` [PATCH -v5 05/14] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 07/14] futex: Change locking rules Peter Zijlstra
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-1.patch --]
[-- Type: text/plain, Size: 6619 bytes --]

Part of what makes futex_unlock_pi() intricate is that
rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
rt_mutex::wait_lock.

This means we cannot rely on the atomicy of wait_lock, which we would
like to do in order to not rely on hb->lock so much.

The reason rt_mutex_slowunlock() needs to drop wait_lock is because it
can race with the rt_mutex fastpath, however futexes have their own
fast path.

Since futexes already have a bunch of separate rt_mutex accessors,
complete that set and implement a rt_mutex variant without fastpath
for them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |   30 ++++++++++-----------
 kernel/locking/rtmutex.c        |   55 +++++++++++++++++++++++++++++-----------
 kernel/locking/rtmutex_common.h |    9 +++++-
 3 files changed, 62 insertions(+), 32 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -916,7 +916,7 @@ void exit_pi_state_list(struct task_stru
 		pi_state->owner = NULL;
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		rt_mutex_unlock(&pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
 
 		spin_unlock(&hb->lock);
 
@@ -1364,20 +1364,18 @@ static int wake_futex_pi(u32 __user *uad
 	pi_state->owner = new_owner;
 	raw_spin_unlock(&new_owner->pi_lock);
 
-	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-
-	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
-
 	/*
-	 * First unlock HB so the waiter does not spin on it once he got woken
-	 * up. Second wake up the waiter before the priority is adjusted. If we
-	 * deboost first (and lose our higher priority), then the task might get
-	 * scheduled away before the wake up can take place.
+	 * We've updated the uservalue, this unlock cannot fail.
 	 */
+	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	spin_unlock(&hb->lock);
-	wake_up_q(&wake_q);
-	if (deboost)
+
+	if (deboost) {
+		wake_up_q(&wake_q);
 		rt_mutex_adjust_prio(current);
+	}
 
 	return 0;
 }
@@ -2253,7 +2251,7 @@ static int fixup_owner(u32 __user *uaddr
 		 * task acquired the rt_mutex after we removed ourself from the
 		 * rt_mutex waiters list.
 		 */
-		if (rt_mutex_trylock(&q->pi_state->pi_mutex)) {
+		if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
 			locked = 1;
 			goto out;
 		}
@@ -2568,7 +2566,7 @@ static int futex_lock_pi(u32 __user *uad
 	if (!trylock) {
 		ret = rt_mutex_timed_futex_lock(&q.pi_state->pi_mutex, to);
 	} else {
-		ret = rt_mutex_trylock(&q.pi_state->pi_mutex);
+		ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
 		/* Fixup the trylock return value: */
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
@@ -2591,7 +2589,7 @@ static int futex_lock_pi(u32 __user *uad
 	 * it and return the fault to userspace.
 	 */
 	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
-		rt_mutex_unlock(&q.pi_state->pi_mutex);
+		rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(&q);
@@ -2898,7 +2896,7 @@ static int futex_wait_requeue_pi(u32 __u
 			spin_lock(q.lock_ptr);
 			ret = fixup_pi_state_owner(uaddr2, &q, current);
 			if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current)
-				rt_mutex_unlock(&q.pi_state->pi_mutex);
+				rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
 			/*
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.
@@ -2938,7 +2936,7 @@ static int futex_wait_requeue_pi(u32 __u
 		 * userspace.
 		 */
 		if (ret && rt_mutex_owner(pi_mutex) == current)
-			rt_mutex_unlock(pi_mutex);
+			rt_mutex_futex_unlock(pi_mutex);
 
 		/* Unqueue and drop the lock. */
 		unqueue_me_pi(&q);
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1488,15 +1488,23 @@ EXPORT_SYMBOL_GPL(rt_mutex_lock_interrup
 
 /*
  * Futex variant with full deadlock detection.
+ * Futex variants must not use the fast-path, see __rt_mutex_futex_unlock().
  */
-int rt_mutex_timed_futex_lock(struct rt_mutex *lock,
+int __sched rt_mutex_timed_futex_lock(struct rt_mutex *lock,
 			      struct hrtimer_sleeper *timeout)
 {
 	might_sleep();
 
-	return rt_mutex_timed_fastlock(lock, TASK_INTERRUPTIBLE, timeout,
-				       RT_MUTEX_FULL_CHAINWALK,
-				       rt_mutex_slowlock);
+	return rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE,
+				 timeout, RT_MUTEX_FULL_CHAINWALK);
+}
+
+/*
+ * Futex variant, must not use fastpath.
+ */
+int __sched rt_mutex_futex_trylock(struct rt_mutex *lock)
+{
+	return rt_mutex_slowtrylock(lock);
 }
 
 /**
@@ -1555,19 +1563,38 @@ void __sched rt_mutex_unlock(struct rt_m
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
- * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
- * @lock: the rt_mutex to be unlocked
- *
- * Returns: true/false indicating whether priority adjustment is
- * required or not.
+ * Futex variant, that since futex variants do not use the fast-path, can be
+ * simple and will not need to retry.
  */
-bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
-				   struct wake_q_head *wqh)
+bool __sched __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				    struct wake_q_head *wake_q)
+{
+	lockdep_assert_held(&lock->wait_lock);
+
+	debug_rt_mutex_unlock(lock);
+
+	if (!rt_mutex_has_waiters(lock)) {
+		lock->owner = NULL;
+		return false; /* done */
+	}
+
+	mark_wakeup_next_waiter(wake_q, lock);
+	return true; /* deboost and wakeups */
+}
+
+void __sched rt_mutex_futex_unlock(struct rt_mutex *lock)
 {
-	if (likely(rt_mutex_cmpxchg_release(lock, current, NULL)))
-		return false;
+	DEFINE_WAKE_Q(wake_q);
+	bool deboost;
 
-	return rt_mutex_slowunlock(lock, wqh);
+	raw_spin_lock_irq(&lock->wait_lock);
+	deboost = __rt_mutex_futex_unlock(lock, &wake_q);
+	raw_spin_unlock_irq(&lock->wait_lock);
+
+	if (deboost) {
+		wake_up_q(&wake_q);
+		rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -109,9 +109,14 @@ extern int rt_mutex_start_proxy_lock(str
 extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct hrtimer_sleeper *to,
 				      struct rt_mutex_waiter *waiter);
+
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
-extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
-				  struct wake_q_head *wqh);
+extern int rt_mutex_futex_trylock(struct rt_mutex *l);
+
+extern void rt_mutex_futex_unlock(struct rt_mutex *lock);
+extern bool __rt_mutex_futex_unlock(struct rt_mutex *lock,
+				 struct wake_q_head *wqh);
+
 extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 07/14] futex: Change locking rules
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
                   ` (5 preceding siblings ...)
  2017-03-04  9:27 ` [PATCH -v5 06/14] futex,rt_mutex: Provide futex specific rt_mutex API Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-07 13:22   ` Thomas Gleixner
  2017-03-04  9:27 ` [PATCH -v5 08/14] futex: Cleanup refcounting Peter Zijlstra
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-2.patch --]
[-- Type: text/plain, Size: 9937 bytes --]

Currently futex-pi relies on hb->lock to serialize everything. Since
hb->lock is giving us problems (PI inversions among other things,
since on -rt hb lock itself is a rt_mutex), we want to break this up a
bit.

This patch reworks and documents the locking. Notably, it
consistently uses rt_mutex::wait_lock to serialize {uval, pi_state}.
This would allow us to do rt_mutex_unlock() (including deboost)
without holding hb->lock.

Nothing yet relies on the new locking rules.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |  153 +++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 123 insertions(+), 30 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -973,6 +973,39 @@ void exit_pi_state_list(struct task_stru
  *
  * [10] There is no transient state which leaves owner and user space
  *	TID out of sync.
+ *
+ *
+ * Serialization and lifetime rules:
+ *
+ * hb->lock:
+ *
+ *	hb -> futex_q, relation
+ *	futex_q -> pi_state, relation
+ *
+ *	(cannot be raw because hb can contain arbitrary amount
+ *	 of futex_q's)
+ *
+ * pi_mutex->wait_lock:
+ *
+ *	{uval, pi_state}
+ *
+ *	(and pi_mutex 'obviously')
+ *
+ * p->pi_lock:
+ *
+ *	p->pi_state_list -> pi_state->list, relation
+ *
+ * pi_state->refcount:
+ *
+ *	pi_state lifetime
+ *
+ *
+ * Lock order:
+ *
+ *   hb->lock
+ *     pi_mutex->wait_lock
+ *       p->pi_lock
+ *
  */
 
 /*
@@ -980,10 +1013,12 @@ void exit_pi_state_list(struct task_stru
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
+static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
+			      struct futex_pi_state *pi_state,
 			      struct futex_pi_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
+	int ret, uval2;
 
 	/*
 	 * Userspace might have messed up non-PI and PI futexes [3]
@@ -991,9 +1026,34 @@ static int attach_to_pi_state(u32 uval,
 	if (unlikely(!pi_state))
 		return -EINVAL;
 
+	/*
+	 * We get here with hb->lock held, and having found a
+	 * futex_top_waiter(). This means that futex_lock_pi() of said futex_q
+	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
+	 * which in turn means that futex_lock_pi() still has a reference on
+	 * our pi_state.
+	 */
 	WARN_ON(!atomic_read(&pi_state->refcount));
 
 	/*
+	 * Now that we have a pi_state, we can acquire wait_lock
+	 * and do the state validation.
+	 */
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+	/*
+	 * Since {uval, pi_state} is serialized by wait_lock, and our current
+	 * uval was read without holding it, it can have changed. Verify it
+	 * still is what we expect it to be, otherwise retry the entire
+	 * operation.
+	 */
+	if (get_futex_value_locked(&uval2, uaddr))
+		goto out_efault;
+
+	if (uval != uval2)
+		goto out_eagain;
+
+	/*
 	 * Handle the owner died case:
 	 */
 	if (uval & FUTEX_OWNER_DIED) {
@@ -1008,11 +1068,11 @@ static int attach_to_pi_state(u32 uval,
 			 * is not 0. Inconsistent state. [5]
 			 */
 			if (pid)
-				return -EINVAL;
+				goto out_einval;
 			/*
 			 * Take a ref on the state and return success. [4]
 			 */
-			goto out_state;
+			goto out_attach;
 		}
 
 		/*
@@ -1024,14 +1084,14 @@ static int attach_to_pi_state(u32 uval,
 		 * Take a ref on the state and return success. [6]
 		 */
 		if (!pid)
-			goto out_state;
+			goto out_attach;
 	} else {
 		/*
 		 * If the owner died bit is not set, then the pi_state
 		 * must have an owner. [7]
 		 */
 		if (!pi_state->owner)
-			return -EINVAL;
+			goto out_einval;
 	}
 
 	/*
@@ -1040,11 +1100,29 @@ static int attach_to_pi_state(u32 uval,
 	 * user space TID. [9/10]
 	 */
 	if (pid != task_pid_vnr(pi_state->owner))
-		return -EINVAL;
-out_state:
+		goto out_einval;
+
+out_attach:
 	atomic_inc(&pi_state->refcount);
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	*ps = pi_state;
 	return 0;
+
+out_einval:
+	ret = -EINVAL;
+	goto out_error;
+
+out_eagain:
+	ret = -EAGAIN;
+	goto out_error;
+
+out_efault:
+	ret = -EFAULT;
+	goto out_error;
+
+out_error:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+	return ret;
 }
 
 /*
@@ -1095,6 +1173,9 @@ static int attach_to_pi_owner(u32 uval,
 
 	/*
 	 * No existing pi state. First waiter. [2]
+	 *
+	 * This creates pi_state, we have hb->lock held, this means nothing can
+	 * observe this state, wait_lock is irrelevant.
 	 */
 	pi_state = alloc_pi_state();
 
@@ -1119,7 +1200,8 @@ static int attach_to_pi_owner(u32 uval,
 	return 0;
 }
 
-static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
+static int lookup_pi_state(u32 __user *uaddr, u32 uval,
+			   struct futex_hash_bucket *hb,
 			   union futex_key *key, struct futex_pi_state **ps)
 {
 	struct futex_q *top_waiter = futex_top_waiter(hb, key);
@@ -1129,7 +1211,7 @@ static int lookup_pi_state(u32 uval, str
 	 * attach to the pi_state when the validation succeeds.
 	 */
 	if (top_waiter)
-		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
 
 	/*
 	 * We are the first waiter - try to look up the owner based on
@@ -1148,7 +1230,7 @@ static int lock_pi_update_atomic(u32 __u
 	if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)))
 		return -EFAULT;
 
-	/*If user space value changed, let the caller retry */
+	/* If user space value changed, let the caller retry */
 	return curval != uval ? -EAGAIN : 0;
 }
 
@@ -1204,7 +1286,7 @@ static int futex_lock_pi_atomic(u32 __us
 	 */
 	top_waiter = futex_top_waiter(hb, key);
 	if (top_waiter)
-		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
 
 	/*
 	 * No waiter and user TID is 0. We are here because the
@@ -1336,6 +1418,7 @@ static int wake_futex_pi(u32 __user *uad
 
 	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
 		ret = -EFAULT;
+
 	} else if (curval != uval) {
 		/*
 		 * If a unconditional UNLOCK_PI operation (user space did not
@@ -1348,6 +1431,7 @@ static int wake_futex_pi(u32 __user *uad
 		else
 			ret = -EINVAL;
 	}
+
 	if (ret) {
 		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 		return ret;
@@ -1823,7 +1907,7 @@ static int futex_requeue(u32 __user *uad
 			 * If that call succeeds then we have pi_state and an
 			 * initial refcount on it.
 			 */
-			ret = lookup_pi_state(ret, hb2, &key2, &pi_state);
+			ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state);
 		}
 
 		switch (ret) {
@@ -2122,10 +2206,13 @@ static int fixup_pi_state_owner(u32 __us
 {
 	u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
 	struct futex_pi_state *pi_state = q->pi_state;
-	struct task_struct *oldowner = pi_state->owner;
 	u32 uval, uninitialized_var(curval), newval;
+	struct task_struct *oldowner;
 	int ret;
 
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+	oldowner = pi_state->owner;
 	/* Owner died? */
 	if (!pi_state->owner)
 		newtid |= FUTEX_OWNER_DIED;
@@ -2141,11 +2228,10 @@ static int fixup_pi_state_owner(u32 __us
 	 * because we can fault here. Imagine swapped out pages or a fork
 	 * that marked all the anonymous memory readonly for cow.
 	 *
-	 * Modifying pi_state _before_ the user space value would
-	 * leave the pi_state in an inconsistent state when we fault
-	 * here, because we need to drop the hash bucket lock to
-	 * handle the fault. This might be observed in the PID check
-	 * in lookup_pi_state.
+	 * Modifying pi_state _before_ the user space value would leave the
+	 * pi_state in an inconsistent state when we fault here, because we
+	 * need to drop the locks to handle the fault. This might be observed
+	 * in the PID check in lookup_pi_state.
 	 */
 retry:
 	if (get_futex_value_locked(&uval, uaddr))
@@ -2166,36 +2252,43 @@ static int fixup_pi_state_owner(u32 __us
 	 * itself.
 	 */
 	if (pi_state->owner != NULL) {
-		raw_spin_lock_irq(&pi_state->owner->pi_lock);
+		raw_spin_lock(&pi_state->owner->pi_lock);
 		WARN_ON(list_empty(&pi_state->list));
 		list_del_init(&pi_state->list);
-		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
+		raw_spin_unlock(&pi_state->owner->pi_lock);
 	}
 
 	pi_state->owner = newowner;
 
-	raw_spin_lock_irq(&newowner->pi_lock);
+	raw_spin_lock(&newowner->pi_lock);
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &newowner->pi_state_list);
-	raw_spin_unlock_irq(&newowner->pi_lock);
+	raw_spin_unlock(&newowner->pi_lock);
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+
 	return 0;
 
 	/*
-	 * To handle the page fault we need to drop the hash bucket
-	 * lock here. That gives the other task (either the highest priority
-	 * waiter itself or the task which stole the rtmutex) the
-	 * chance to try the fixup of the pi_state. So once we are
-	 * back from handling the fault we need to check the pi_state
-	 * after reacquiring the hash bucket lock and before trying to
-	 * do another fixup. When the fixup has been done already we
-	 * simply return.
+	 * To handle the page fault we need to drop the locks here. That gives
+	 * the other task (either the highest priority waiter itself or the
+	 * task which stole the rtmutex) the chance to try the fixup of the
+	 * pi_state. So once we are back from handling the fault we need to
+	 * check the pi_state after reacquiring the locks and before trying to
+	 * do another fixup. When the fixup has been done already we simply
+	 * return.
+	 *
+	 * Note: we hold both hb->lock and pi_mutex->wait_lock. We can safely
+	 * drop hb->lock since the caller owns the hb -> futex_q relation.
+	 * Dropping the pi_mutex->wait_lock requires the state revalidate.
 	 */
 handle_fault:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	spin_unlock(q->lock_ptr);
 
 	ret = fault_in_user_writeable(uaddr);
 
 	spin_lock(q->lock_ptr);
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 
 	/*
 	 * Check if someone else fixed it for us:

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 08/14] futex: Cleanup refcounting
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
                   ` (6 preceding siblings ...)
  2017-03-04  9:27 ` [PATCH -v5 07/14] futex: Change locking rules Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 09/14] futex: Rework inconsistent rt_mutex/futex_q state Peter Zijlstra
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-8.patch --]
[-- Type: text/plain, Size: 1760 bytes --]

Since we're going to add more refcount fiddling, introduce
get_pi_state() to match the existing put_pi_state().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -802,7 +802,7 @@ static int refill_pi_state_cache(void)
 	return 0;
 }
 
-static struct futex_pi_state * alloc_pi_state(void)
+static struct futex_pi_state *alloc_pi_state(void)
 {
 	struct futex_pi_state *pi_state = current->pi_state_cache;
 
@@ -812,6 +812,11 @@ static struct futex_pi_state * alloc_pi_
 	return pi_state;
 }
 
+static void get_pi_state(struct futex_pi_state *pi_state)
+{
+	WARN_ON_ONCE(!atomic_inc_not_zero(&pi_state->refcount));
+}
+
 /*
  * Drops a reference to the pi_state object and frees or caches it
  * when the last reference is gone.
@@ -856,7 +861,7 @@ static void put_pi_state(struct futex_pi
  * Look up the task based on what TID userspace gave us.
  * We dont trust it.
  */
-static struct task_struct * futex_find_get_task(pid_t pid)
+static struct task_struct *futex_find_get_task(pid_t pid)
 {
 	struct task_struct *p;
 
@@ -1103,7 +1108,7 @@ static int attach_to_pi_state(u32 __user
 		goto out_einval;
 
 out_attach:
-	atomic_inc(&pi_state->refcount);
+	get_pi_state(pi_state);
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	*ps = pi_state;
 	return 0;
@@ -1990,7 +1995,7 @@ static int futex_requeue(u32 __user *uad
 			 * refcount on the pi_state and store the pointer in
 			 * the futex_q object of the waiter.
 			 */
-			atomic_inc(&pi_state->refcount);
+			get_pi_state(pi_state);
 			this->pi_state = pi_state;
 			ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
 							this->rt_waiter,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 09/14] futex: Rework inconsistent rt_mutex/futex_q state
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
                   ` (7 preceding siblings ...)
  2017-03-04  9:27 ` [PATCH -v5 08/14] futex: Cleanup refcounting Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-07 13:26   ` Thomas Gleixner
  2017-03-04  9:27 ` [PATCH -v5 10/14] futex: Pull rt_mutex_futex_unlock() out from under hb->lock Peter Zijlstra
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peter_zijlstra-futex_unlock_pi_wobbles.patch --]
[-- Type: text/plain, Size: 3582 bytes --]

There is a weird state in the futex_unlock_pi() path when it
interleaves with a concurrent futex_lock_pi() at the point where it
drops hb->lock.

In this case, it can happen that the rt_mutex wait_list and the
futex_q disagree on pending waiters, in particular rt_mutex will find
no pending waiters where futex_q thinks there are.

In this case the rt_mutex unlock code cannot assign an owner.

What the current code does in this case is use the futex_q waiter that
got us here; however when the rt_mutex_timed_futex_lock() has already
failed; this leaves things in a weird state, resulting in much
head-aches in fixup_owner().

Simplify all this by changing wake_futex_pi() to return -EAGAIN when
this situation occurs. This then gives the futex_lock_pi() code the
opportunity to continue and the retried futex_unlock_pi() will now
observe a coherent state.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |   49 +++++++++++++------------------------------------
 1 file changed, 13 insertions(+), 36 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1402,12 +1402,18 @@ static int wake_futex_pi(u32 __user *uad
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
 
 	/*
-	 * It is possible that the next waiter (the one that brought
-	 * top_waiter owner to the kernel) timed out and is no longer
-	 * waiting on the lock.
+	 * When we interleave with futex_lock_pi() where it does
+	 * rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
+	 * but the rt_mutex's wait_list can be empty (either still, or again,
+	 * depending on which side we land).
+	 *
+	 * When this happens, give up our locks and try again, giving the
+	 * futex_lock_pi() instance time to complete and unqueue_me().
 	 */
-	if (!new_owner)
-		new_owner = top_waiter->task;
+	if (!new_owner) {
+		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+		return -EAGAIN;
+	}
 
 	/*
 	 * We pass it to the next owner. The WAITERS bit is always
@@ -2324,7 +2330,6 @@ static long futex_wait_restart(struct re
  */
 static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 {
-	struct task_struct *owner;
 	int ret = 0;
 
 	if (locked) {
@@ -2338,43 +2343,15 @@ static int fixup_owner(u32 __user *uaddr
 	}
 
 	/*
-	 * Catch the rare case, where the lock was released when we were on the
-	 * way back before we locked the hash bucket.
-	 */
-	if (q->pi_state->owner == current) {
-		/*
-		 * Try to get the rt_mutex now. This might fail as some other
-		 * task acquired the rt_mutex after we removed ourself from the
-		 * rt_mutex waiters list.
-		 */
-		if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
-			locked = 1;
-			goto out;
-		}
-
-		/*
-		 * pi_state is incorrect, some other task did a lock steal and
-		 * we returned due to timeout or signal without taking the
-		 * rt_mutex. Too late.
-		 */
-		raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
-		owner = rt_mutex_owner(&q->pi_state->pi_mutex);
-		if (!owner)
-			owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
-		raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
-		ret = fixup_pi_state_owner(uaddr, q, owner);
-		goto out;
-	}
-
-	/*
 	 * Paranoia check. If we did not take the lock, then we should not be
 	 * the owner of the rt_mutex.
 	 */
-	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current)
+	if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) {
 		printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p "
 				"pi-state %p\n", ret,
 				q->pi_state->pi_mutex.owner,
 				q->pi_state->owner);
+	}
 
 out:
 	return ret ? ret : locked;

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 10/14] futex: Pull rt_mutex_futex_unlock() out from under hb->lock
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
                   ` (8 preceding siblings ...)
  2017-03-04  9:27 ` [PATCH -v5 09/14] futex: Rework inconsistent rt_mutex/futex_q state Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-07 14:08   ` Thomas Gleixner
  2017-03-04  9:27 ` [PATCH -v5 11/14] futex,rt_mutex: Introduce rt_mutex_init_waiter() Peter Zijlstra
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-3.patch --]
[-- Type: text/plain, Size: 10042 bytes --]

There's a number of 'interesting' problems, all caused by holding
hb->lock while doing the rt_mutex_unlock() equivalient.

Notably:

 - a PI inversion on hb->lock; and,

 - a DL crash because of pointer instability.

Because of all the previous patches that:

 - allow us to do rt_mutex_futex_unlock() without dropping wait_lock;
   which in turn allows us to rely on wait_lock atomicy.

 - changed locking rules to cover {uval,pi_state} with wait_lock.

 - simplified the waiter conundrum.

We can now quite simply pull rt_mutex_futex_unlock() out from under
hb->lock, a pi_state reference and wait_lock are sufficient.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |  144 ++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 91 insertions(+), 53 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -919,10 +919,12 @@ void exit_pi_state_list(struct task_stru
 		pi_state->owner = NULL;
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		rt_mutex_futex_unlock(&pi_state->pi_mutex);
-
+		get_pi_state(pi_state);
 		spin_unlock(&hb->lock);
 
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
+		put_pi_state(pi_state);
+
 		raw_spin_lock_irq(&curr->pi_lock);
 	}
 	raw_spin_unlock_irq(&curr->pi_lock);
@@ -1035,6 +1037,9 @@ static int attach_to_pi_state(u32 __user
 	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
 	 * which in turn means that futex_lock_pi() still has a reference on
 	 * our pi_state.
+	 *
+	 * IOW, we cannot race against the unlocked put_pi_state() in
+	 * futex_unlock_pi().
 	 */
 	WARN_ON(!atomic_read(&pi_state->refcount));
 
@@ -1378,47 +1383,33 @@ static void mark_wake_futex(struct wake_
 	smp_store_release(&q->lock_ptr, NULL);
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
-			 struct futex_hash_bucket *hb)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
 {
-	struct task_struct *new_owner;
-	struct futex_pi_state *pi_state = top_waiter->pi_state;
 	u32 uninitialized_var(curval), newval;
+	struct task_struct *new_owner;
+	bool deboost = false;
 	DEFINE_WAKE_Q(wake_q);
-	bool deboost;
 	int ret = 0;
 
-	if (!pi_state)
-		return -EINVAL;
-
-	/*
-	 * If current does not own the pi_state then the futex is
-	 * inconsistent and user space fiddled with the futex value.
-	 */
-	if (pi_state->owner != current)
-		return -EINVAL;
-
 	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
-
-	/*
-	 * When we interleave with futex_lock_pi() where it does
-	 * rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
-	 * but the rt_mutex's wait_list can be empty (either still, or again,
-	 * depending on which side we land).
-	 *
-	 * When this happens, give up our locks and try again, giving the
-	 * futex_lock_pi() instance time to complete and unqueue_me().
-	 */
 	if (!new_owner) {
-		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-		return -EAGAIN;
+		/*
+		 * Since we held neither hb->lock nor wait_lock when coming
+		 * into this function, we could have raced with futex_lock_pi()
+		 * such that it will have removed the waiter that brought us
+		 * here.
+		 *
+		 * In this case, retry the entire operation.
+		 */
+		ret = -EAGAIN;
+		goto out_unlock;
 	}
 
 	/*
-	 * We pass it to the next owner. The WAITERS bit is always
-	 * kept enabled while there is PI state around. We cleanup the
-	 * owner died bit, because we are the owner.
+	 * We pass it to the next owner. The WAITERS bit is always kept
+	 * enabled while there is PI state around. We cleanup the owner
+	 * died bit, because we are the owner.
 	 */
 	newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
 
@@ -1441,10 +1432,8 @@ static int wake_futex_pi(u32 __user *uad
 			ret = -EINVAL;
 	}
 
-	if (ret) {
-		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-		return ret;
-	}
+	if (ret)
+		goto out_unlock;
 
 	raw_spin_lock(&pi_state->owner->pi_lock);
 	WARN_ON(list_empty(&pi_state->list));
@@ -1462,15 +1451,15 @@ static int wake_futex_pi(u32 __user *uad
 	 */
 	deboost = __rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
 
+out_unlock:
 	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
-	spin_unlock(&hb->lock);
 
 	if (deboost) {
 		wake_up_q(&wake_q);
 		rt_mutex_adjust_prio(current);
 	}
 
-	return 0;
+	return ret;
 }
 
 /*
@@ -2229,7 +2218,8 @@ static int fixup_pi_state_owner(u32 __us
 	/*
 	 * We are here either because we stole the rtmutex from the
 	 * previous highest priority waiter or we are the highest priority
-	 * waiter but failed to get the rtmutex the first time.
+	 * waiter but have failed to get the rtmutex the first time.
+	 *
 	 * We have to replace the newowner TID in the user space variable.
 	 * This must be atomic as we have to preserve the owner died bit here.
 	 *
@@ -2246,7 +2236,7 @@ static int fixup_pi_state_owner(u32 __us
 	if (get_futex_value_locked(&uval, uaddr))
 		goto handle_fault;
 
-	while (1) {
+	for (;;) {
 		newval = (uval & FUTEX_OWNER_DIED) | newtid;
 
 		if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
@@ -2336,6 +2326,10 @@ static int fixup_owner(u32 __user *uaddr
 		/*
 		 * Got the lock. We might not be the anticipated owner if we
 		 * did a lock-steal - fix up the PI-state in that case:
+		 *
+		 * We can safely read pi_state->owner without holding wait_lock
+		 * because we now own the rt_mutex, only the owner will attempt
+		 * to change it.
 		 */
 		if (q->pi_state->owner != current)
 			ret = fixup_pi_state_owner(uaddr, q, current);
@@ -2575,6 +2569,7 @@ static int futex_lock_pi(u32 __user *uad
 			 ktime_t *time, int trylock)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
+	struct futex_pi_state *pi_state = NULL;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
 	int res, ret;
@@ -2661,12 +2656,19 @@ static int futex_lock_pi(u32 __user *uad
 	 * If fixup_owner() faulted and was unable to handle the fault, unlock
 	 * it and return the fault to userspace.
 	 */
-	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current))
-		rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
+	if (ret && (rt_mutex_owner(&q.pi_state->pi_mutex) == current)) {
+		pi_state = q.pi_state;
+		get_pi_state(pi_state);
+	}
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(&q);
 
+	if (pi_state) {
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
+		put_pi_state(pi_state);
+	}
+
 	goto out_put_key;
 
 out_unlock_put_key:
@@ -2729,10 +2731,36 @@ static int futex_unlock_pi(u32 __user *u
 	 */
 	top_waiter = futex_top_waiter(hb, &key);
 	if (top_waiter) {
-		ret = wake_futex_pi(uaddr, uval, top_waiter, hb);
+		struct futex_pi_state *pi_state = top_waiter->pi_state;
+
+		ret = -EINVAL;
+		if (!pi_state)
+			goto out_unlock;
+
+		/*
+		 * If current does not own the pi_state then the futex is
+		 * inconsistent and user space fiddled with the futex value.
+		 */
+		if (pi_state->owner != current)
+			goto out_unlock;
+
 		/*
-		 * In case of success wake_futex_pi dropped the hash
-		 * bucket lock.
+		 * Grab a reference on the pi_state and drop hb->lock.
+		 *
+		 * The reference ensures pi_state lives, dropping the hb->lock
+		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
+		 * close the races against futex_lock_pi(), but in case of
+		 * _any_ fail we'll abort and retry the whole deal.
+		 */
+		get_pi_state(pi_state);
+		spin_unlock(&hb->lock);
+
+		ret = wake_futex_pi(uaddr, uval, pi_state);
+
+		put_pi_state(pi_state);
+
+		/*
+		 * Success, we're done! No tricky corner cases.
 		 */
 		if (!ret)
 			goto out_putkey;
@@ -2747,7 +2775,6 @@ static int futex_unlock_pi(u32 __user *u
 		 * setting the FUTEX_WAITERS bit. Try again.
 		 */
 		if (ret == -EAGAIN) {
-			spin_unlock(&hb->lock);
 			put_futex_key(&key);
 			goto retry;
 		}
@@ -2755,7 +2782,7 @@ static int futex_unlock_pi(u32 __user *u
 		 * wake_futex_pi has detected invalid state. Tell user
 		 * space.
 		 */
-		goto out_unlock;
+		goto out_putkey;
 	}
 
 	/*
@@ -2765,8 +2792,10 @@ static int futex_unlock_pi(u32 __user *u
 	 * preserve the WAITERS bit not the OWNER_DIED one. We are the
 	 * owner.
 	 */
-	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0))
+	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, 0)) {
+		spin_unlock(&hb->lock);
 		goto pi_faulted;
+	}
 
 	/*
 	 * If uval has changed, let user space handle it.
@@ -2780,7 +2809,6 @@ static int futex_unlock_pi(u32 __user *u
 	return ret;
 
 pi_faulted:
-	spin_unlock(&hb->lock);
 	put_futex_key(&key);
 
 	ret = fault_in_user_writeable(uaddr);
@@ -2884,6 +2912,7 @@ static int futex_wait_requeue_pi(u32 __u
 				 u32 __user *uaddr2)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
+	struct futex_pi_state *pi_state = NULL;
 	struct rt_mutex_waiter rt_waiter;
 	struct futex_hash_bucket *hb;
 	union futex_key key2 = FUTEX_KEY_INIT;
@@ -2968,8 +2997,10 @@ static int futex_wait_requeue_pi(u32 __u
 		if (q.pi_state && (q.pi_state->owner != current)) {
 			spin_lock(q.lock_ptr);
 			ret = fixup_pi_state_owner(uaddr2, &q, current);
-			if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current)
-				rt_mutex_futex_unlock(&q.pi_state->pi_mutex);
+			if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) {
+				pi_state = q.pi_state;
+				get_pi_state(pi_state);
+			}
 			/*
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.
@@ -3008,13 +3039,20 @@ static int futex_wait_requeue_pi(u32 __u
 		 * the fault, unlock the rt_mutex and return the fault to
 		 * userspace.
 		 */
-		if (ret && rt_mutex_owner(pi_mutex) == current)
-			rt_mutex_futex_unlock(pi_mutex);
+		if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current) {
+			pi_state = q.pi_state;
+			get_pi_state(pi_state);
+		}
 
 		/* Unqueue and drop the lock. */
 		unqueue_me_pi(&q);
 	}
 
+	if (pi_state) {
+		rt_mutex_futex_unlock(&pi_state->pi_mutex);
+		put_pi_state(pi_state);
+	}
+
 	if (ret == -EINTR) {
 		/*
 		 * We've already been requeued, but cannot restart by calling

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 11/14] futex,rt_mutex: Introduce rt_mutex_init_waiter()
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
                   ` (9 preceding siblings ...)
  2017-03-04  9:27 ` [PATCH -v5 10/14] futex: Pull rt_mutex_futex_unlock() out from under hb->lock Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 12/14] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() Peter Zijlstra
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-10.patch --]
[-- Type: text/plain, Size: 2163 bytes --]

Since there's already two copies of this code, introduce a helper now
before we get a third instance.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |    5 +----
 kernel/locking/rtmutex.c        |   12 +++++++++---
 kernel/locking/rtmutex_common.h |    1 +
 3 files changed, 11 insertions(+), 7 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2939,10 +2939,7 @@ static int futex_wait_requeue_pi(u32 __u
 	 * The waiter is allocated on our stack, manipulated by the requeue
 	 * code while we sleep on uaddr.
 	 */
-	debug_rt_mutex_init_waiter(&rt_waiter);
-	RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
-	RB_CLEAR_NODE(&rt_waiter.tree_entry);
-	rt_waiter.task = NULL;
+	rt_mutex_init_waiter(&rt_waiter);
 
 	ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
 	if (unlikely(ret != 0))
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1151,6 +1151,14 @@ void rt_mutex_adjust_pi(struct task_stru
 				   next_lock, NULL, task);
 }
 
+void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
+{
+	debug_rt_mutex_init_waiter(waiter);
+	RB_CLEAR_NODE(&waiter->pi_tree_entry);
+	RB_CLEAR_NODE(&waiter->tree_entry);
+	waiter->task = NULL;
+}
+
 /**
  * __rt_mutex_slowlock() - Perform the wait-wake-try-to-take loop
  * @lock:		 the rt_mutex to take
@@ -1233,9 +1241,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 	unsigned long flags;
 	int ret = 0;
 
-	debug_rt_mutex_init_waiter(&waiter);
-	RB_CLEAR_NODE(&waiter.pi_tree_entry);
-	RB_CLEAR_NODE(&waiter.tree_entry);
+	rt_mutex_init_waiter(&waiter);
 
 	/*
 	 * Technically we could use raw_spin_[un]lock_irq() here, but this can
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -102,6 +102,7 @@ extern void rt_mutex_init_proxy_locked(s
 				       struct task_struct *proxy_owner);
 extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 				  struct task_struct *proxy_owner);
+extern void rt_mutex_init_waiter(struct rt_mutex_waiter *waiter);
 extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 				     struct rt_mutex_waiter *waiter,
 				     struct task_struct *task);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 12/14] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
                   ` (10 preceding siblings ...)
  2017-03-04  9:27 ` [PATCH -v5 11/14] futex,rt_mutex: Introduce rt_mutex_init_waiter() Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-07 14:18   ` Thomas Gleixner
  2017-03-08 15:29   ` [PATCH] futex: move debug_rt_mutex_free_waiter() further down Sebastian Andrzej Siewior
  2017-03-04  9:27 ` [PATCH -v5 13/14] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 14/14] futex: futex_unlock_pi() determinism Peter Zijlstra
  13 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-9.patch --]
[-- Type: text/plain, Size: 4607 bytes --]

With the ultimate goal of keeping rt_mutex wait_list and futex_q
waiters consistent we want to split 'rt_mutex_futex_lock()' into finer
parts, such that only the actual blocking can be done without hb->lock
held.

This means we need to split rt_mutex_finish_proxy_lock() into two
parts, one that does the blocking and one that does remove_waiter()
when we fail to acquire.

When we do acquire, we can safely remove ourselves, since there is no
concurrency on the lock owner.

This means that, except for futex_lock_pi(), all wait_list
modifications are done with both hb->lock and wait_lock held.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |    5 +++-
 kernel/locking/rtmutex.c        |   47 ++++++++++++++++++++++++++++++++++------
 kernel/locking/rtmutex_common.h |    8 ++++--
 3 files changed, 49 insertions(+), 11 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3012,10 +3012,13 @@ static int futex_wait_requeue_pi(u32 __u
 		 */
 		WARN_ON(!q.pi_state);
 		pi_mutex = &q.pi_state->pi_mutex;
-		ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter);
+		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
 		debug_rt_mutex_free_waiter(&rt_waiter);
 
 		spin_lock(q.lock_ptr);
+		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
+			ret = 0;
+
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we
 		 * haven't already.
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1747,21 +1747,23 @@ struct task_struct *rt_mutex_next_owner(
 }
 
 /**
- * rt_mutex_finish_proxy_lock() - Complete lock acquisition
+ * rt_mutex_wait_proxy_lock() - Wait for lock acquisition
  * @lock:		the rt_mutex we were woken on
  * @to:			the timeout, null if none. hrtimer should already have
  *			been started.
  * @waiter:		the pre-initialized rt_mutex_waiter
  *
- * Complete the lock acquisition started our behalf by another thread.
+ * Wait for the the lock acquisition started on our behalf by
+ * rt_mutex_start_proxy_lock(). Upon failure, the caller must call
+ * rt_mutex_cleanup_proxy_lock().
  *
  * Returns:
  *  0 - success
  * <0 - error, one of -EINTR, -ETIMEDOUT
  *
- * Special API call for PI-futex requeue support
+ * Special API call for PI-futex support
  */
-int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
 			       struct hrtimer_sleeper *to,
 			       struct rt_mutex_waiter *waiter)
 {
@@ -1774,9 +1776,6 @@ int rt_mutex_finish_proxy_lock(struct rt
 	/* sleep on the mutex */
 	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter);
 
-	if (unlikely(ret))
-		remove_waiter(lock, waiter);
-
 	/*
 	 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
 	 * have to fix that up.
@@ -1787,3 +1786,37 @@ int rt_mutex_finish_proxy_lock(struct rt
 
 	return ret;
 }
+
+/**
+ * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition
+ * @lock:		the rt_mutex we were woken on
+ * @waiter:		the pre-initialized rt_mutex_waiter
+ *
+ * Clean up the failed lock acquisition as per rt_mutex_wait_proxy_lock().
+ *
+ * Returns:
+ *  true  - did the cleanup, we done.
+ *  false - we acquired the lock after rt_mutex_wait_proxy_lock() returned,
+ *          caller should disregards its return value.
+ *
+ * Special API call for PI-futex support
+ */
+bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
+				 struct rt_mutex_waiter *waiter)
+{
+	bool cleanup = false;
+
+	raw_spin_lock_irq(&lock->wait_lock);
+	/*
+	 * If we acquired the lock, no cleanup required.
+	 */
+	if (rt_mutex_owner(lock) != current) {
+		remove_waiter(lock, waiter);
+		fixup_rt_mutex_waiters(lock);
+		cleanup = true;
+	}
+	raw_spin_unlock_irq(&lock->wait_lock);
+
+	return cleanup;
+}
+
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -106,9 +106,11 @@ extern void rt_mutex_proxy_unlock(struct
 extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 				     struct rt_mutex_waiter *waiter,
 				     struct task_struct *task);
-extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
-				      struct hrtimer_sleeper *to,
-				      struct rt_mutex_waiter *waiter);
+extern int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
+			       struct hrtimer_sleeper *to,
+			       struct rt_mutex_waiter *waiter);
+extern bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
+				 struct rt_mutex_waiter *waiter);
 
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 extern int rt_mutex_futex_trylock(struct rt_mutex *l);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 13/14] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
                   ` (11 preceding siblings ...)
  2017-03-04  9:27 ` [PATCH -v5 12/14] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-04  9:27 ` [PATCH -v5 14/14] futex: futex_unlock_pi() determinism Peter Zijlstra
  13 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-5.patch --]
[-- Type: text/plain, Size: 5921 bytes --]

By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() we arrive
at a point where all wait_list modifications are done under both
hb->lock and wait_lock.

This closes the obvious interleave pattern between futex_lock_pi() and
futex_unlock_pi(), but not entirely so. See below:


Before:

futex_lock_pi()			futex_unlock_pi()
  unlock hb->lock

				  lock hb->lock
				  unlock hb->lock

				  lock rt_mutex->wait_lock
				  unlock rt_mutex_wait_lock
				    -EAGAIN

  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock

  schedule()

  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

				  <idem>
				    -EAGAIN

  lock hb->lock


After:

futex_lock_pi()			futex_unlock_pi()

  lock hb->lock
  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock
  unlock hb->lock

  schedule()
				  lock hb->lock
				  unlock hb->lock
  lock hb->lock
  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

				  lock rt_mutex->wait_lock
				  unlock rt_mutex_wait_lock
				    -EAGAIN

  unlock hb->lock


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c                  |   70 +++++++++++++++++++++++++++-------------
 kernel/locking/rtmutex.c        |   13 -------
 kernel/locking/rtmutex_common.h |    1 
 3 files changed, 48 insertions(+), 36 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2088,20 +2088,7 @@ queue_unlock(struct futex_hash_bucket *h
 	hb_waiters_dec(hb);
 }
 
-/**
- * queue_me() - Enqueue the futex_q on the futex_hash_bucket
- * @q:	The futex_q to enqueue
- * @hb:	The destination hash bucket
- *
- * The hb->lock must be held by the caller, and is released here. A call to
- * queue_me() is typically paired with exactly one call to unqueue_me().  The
- * exceptions involve the PI related operations, which may use unqueue_me_pi()
- * or nothing if the unqueue is done as part of the wake process and the unqueue
- * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
- * an example).
- */
-static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
-	__releases(&hb->lock)
+static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
 {
 	int prio;
 
@@ -2118,6 +2105,24 @@ static inline void queue_me(struct futex
 	plist_node_init(&q->list, prio);
 	plist_add(&q->list, &hb->chain);
 	q->task = current;
+}
+
+/**
+ * queue_me() - Enqueue the futex_q on the futex_hash_bucket
+ * @q:	The futex_q to enqueue
+ * @hb:	The destination hash bucket
+ *
+ * The hb->lock must be held by the caller, and is released here. A call to
+ * queue_me() is typically paired with exactly one call to unqueue_me().  The
+ * exceptions involve the PI related operations, which may use unqueue_me_pi()
+ * or nothing if the unqueue is done as part of the wake process and the unqueue
+ * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
+ * an example).
+ */
+static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
+	__releases(&hb->lock)
+{
+	__queue_me(q, hb);
 	spin_unlock(&hb->lock);
 }
 
@@ -2570,6 +2575,7 @@ static int futex_lock_pi(u32 __user *uad
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct futex_pi_state *pi_state = NULL;
+	struct rt_mutex_waiter rt_waiter;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
 	int res, ret;
@@ -2622,25 +2628,45 @@ static int futex_lock_pi(u32 __user *uad
 		}
 	}
 
+	WARN_ON(!q.pi_state);
+
 	/*
 	 * Only actually queue now that the atomic ops are done:
 	 */
-	queue_me(&q, hb);
+	__queue_me(&q, hb);
 
-	WARN_ON(!q.pi_state);
-	/*
-	 * Block on the PI mutex:
-	 */
-	if (!trylock) {
-		ret = rt_mutex_timed_futex_lock(&q.pi_state->pi_mutex, to);
-	} else {
+	if (trylock) {
 		ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);
 		/* Fixup the trylock return value: */
 		ret = ret ? 0 : -EWOULDBLOCK;
+		goto did_trylock;
 	}
 
+	/*
+	 * We must add ourselves to the rt_mutex waitlist while holding hb->lock
+	 * such that the hb and rt_mutex wait lists match.
+	 */
+	rt_mutex_init_waiter(&rt_waiter);
+	rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
+	spin_unlock(q.lock_ptr);
+
+	if (unlikely(to))
+		hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
+
+	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
+
 	spin_lock(q.lock_ptr);
 	/*
+	 * If we failed to acquire the lock (signal/timeout), we must
+	 * first acquire the hb->lock before removing the lock from the
+	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex
+	 * wait lists consistent.
+	 */
+	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
+		ret = 0;
+
+did_trylock:
+	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
 	 * haven't already.
 	 */
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1491,19 +1491,6 @@ int __sched rt_mutex_lock_interruptible(
 EXPORT_SYMBOL_GPL(rt_mutex_lock_interruptible);
 
 /*
- * Futex variant with full deadlock detection.
- * Futex variants must not use the fast-path, see __rt_mutex_futex_unlock().
- */
-int __sched rt_mutex_timed_futex_lock(struct rt_mutex *lock,
-			      struct hrtimer_sleeper *timeout)
-{
-	might_sleep();
-
-	return rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE,
-				 timeout, RT_MUTEX_FULL_CHAINWALK);
-}
-
-/*
  * Futex variant, must not use fastpath.
  */
 int __sched rt_mutex_futex_trylock(struct rt_mutex *lock)
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -112,7 +112,6 @@ extern int rt_mutex_wait_proxy_lock(stru
 extern bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
 				 struct rt_mutex_waiter *waiter);
 
-extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
 extern int rt_mutex_futex_trylock(struct rt_mutex *l);
 
 extern void rt_mutex_futex_unlock(struct rt_mutex *lock);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH -v5 14/14] futex: futex_unlock_pi() determinism
  2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
                   ` (12 preceding siblings ...)
  2017-03-04  9:27 ` [PATCH -v5 13/14] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() Peter Zijlstra
@ 2017-03-04  9:27 ` Peter Zijlstra
  2017-03-07 14:31   ` Thomas Gleixner
  13 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-04  9:27 UTC (permalink / raw)
  To: tglx
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart, peterz

[-- Attachment #1: peterz-futex-pi-unlock-11.patch --]
[-- Type: text/plain, Size: 2435 bytes --]

The problem with returning -EAGAIN when the waiter state mismatches is
that it becomes very hard to proof a bounded execution time on the
operation. And seeing that this is a RT operation, this is somewhat
important.

While in practise it will be very unlikely to ever really take more
than one or two rounds, proving so becomes rather hard.

Now that modifying wait_list is done while holding both hb->lock and
wait_lock, we can avoid the scenario entirely if we acquire wait_lock
while still holding hb-lock. Doing a hand-over, without leaving a
hole.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/futex.c |   26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1391,16 +1391,11 @@ static int wake_futex_pi(u32 __user *uad
 	DEFINE_WAKE_Q(wake_q);
 	int ret = 0;
 
-	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
-	if (!new_owner) {
+	if (WARN_ON_ONCE(!new_owner)) {
 		/*
-		 * Since we held neither hb->lock nor wait_lock when coming
-		 * into this function, we could have raced with futex_lock_pi()
-		 * such that it will have removed the waiter that brought us
-		 * here.
-		 *
-		 * In this case, retry the entire operation.
+		 * Should be impossible now... but if weirdness happens,
+		 * returning -EAGAIN is safe and correct.
 		 */
 		ret = -EAGAIN;
 		goto out_unlock;
@@ -2770,15 +2765,18 @@ static int futex_unlock_pi(u32 __user *u
 		if (pi_state->owner != current)
 			goto out_unlock;
 
+		get_pi_state(pi_state);
 		/*
-		 * Grab a reference on the pi_state and drop hb->lock.
+		 * Since modifying the wait_list is done while holding both
+		 * hb->lock and wait_lock, holding either is sufficient to
+		 * observe it.
 		 *
-		 * The reference ensures pi_state lives, dropping the hb->lock
-		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
-		 * close the races against futex_lock_pi(), but in case of
-		 * _any_ fail we'll abort and retry the whole deal.
+		 * By taking wait_lock while still holding hb->lock, we ensure
+		 * there is no point where we hold neither; and therefore
+		 * wake_futex_pi() must observe a state consistent with what we
+		 * observed.
 		 */
-		get_pi_state(pi_state);
+		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 		spin_unlock(&hb->lock);
 
 		ret = wake_futex_pi(uaddr, uval, pi_state);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 07/14] futex: Change locking rules
  2017-03-04  9:27 ` [PATCH -v5 07/14] futex: Change locking rules Peter Zijlstra
@ 2017-03-07 13:22   ` Thomas Gleixner
  2017-03-07 16:47     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2017-03-07 13:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Sat, 4 Mar 2017, Peter Zijlstra wrote:
> @@ -2166,36 +2252,43 @@ static int fixup_pi_state_owner(u32 __us
>  	/*
> -	 * To handle the page fault we need to drop the hash bucket
> -	 * lock here. That gives the other task (either the highest priority
> -	 * waiter itself or the task which stole the rtmutex) the
> -	 * chance to try the fixup of the pi_state. So once we are
> -	 * back from handling the fault we need to check the pi_state
> -	 * after reacquiring the hash bucket lock and before trying to
> -	 * do another fixup. When the fixup has been done already we
> -	 * simply return.
> +	 * To handle the page fault we need to drop the locks here. That gives
> +	 * the other task (either the highest priority waiter itself or the
> +	 * task which stole the rtmutex) the chance to try the fixup of the
> +	 * pi_state. So once we are back from handling the fault we need to
> +	 * check the pi_state after reacquiring the locks and before trying to
> +	 * do another fixup. When the fixup has been done already we simply
> +	 * return.
> +	 *
> +	 * Note: we hold both hb->lock and pi_mutex->wait_lock. We can safely
> +	 * drop hb->lock since the caller owns the hb -> futex_q relation.
> +	 * Dropping the pi_mutex->wait_lock requires the state revalidate.
>  	 */
>  handle_fault:
> +	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
>  	spin_unlock(q->lock_ptr);
>  
>  	ret = fault_in_user_writeable(uaddr);
>  
>  	spin_lock(q->lock_ptr);
> +	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
>  
>  	/*
>  	 * Check if someone else fixed it for us:

Adding context:

         */
        if (pi_state->owner != oldowner)
                return 0;

        if (ret)
                return ret;

        goto retry;

Both 'return' statements leak &pi_state->pi_mutex.wait_lock ....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 09/14] futex: Rework inconsistent rt_mutex/futex_q state
  2017-03-04  9:27 ` [PATCH -v5 09/14] futex: Rework inconsistent rt_mutex/futex_q state Peter Zijlstra
@ 2017-03-07 13:26   ` Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Gleixner @ 2017-03-07 13:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Sat, 4 Mar 2017, Peter Zijlstra wrote:
> @@ -1402,12 +1402,18 @@ static int wake_futex_pi(u32 __user *uad
>  	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
>  
>  	/*
> -	 * It is possible that the next waiter (the one that brought
> -	 * top_waiter owner to the kernel) timed out and is no longer
> -	 * waiting on the lock.
> +	 * When we interleave with futex_lock_pi() where it does
> +	 * rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
> +	 * but the rt_mutex's wait_list can be empty (either still, or again,
> +	 * depending on which side we land).
> +	 *
> +	 * When this happens, give up our locks and try again, giving the
> +	 * futex_lock_pi() instance time to complete and unqueue_me().

  time to complete, either by waiting on the rtmutex or removing itself
  from the futex queue.

Or something like that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 10/14] futex: Pull rt_mutex_futex_unlock() out from under hb->lock
  2017-03-04  9:27 ` [PATCH -v5 10/14] futex: Pull rt_mutex_futex_unlock() out from under hb->lock Peter Zijlstra
@ 2017-03-07 14:08   ` Thomas Gleixner
  2017-03-07 18:01     ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2017-03-07 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Sat, 4 Mar 2017, Peter Zijlstra wrote:
> @@ -1035,6 +1037,9 @@ static int attach_to_pi_state(u32 __user
>  	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
>  	 * which in turn means that futex_lock_pi() still has a reference on
>  	 * our pi_state.
> +	 *
> +	 * IOW, we cannot race against the unlocked put_pi_state() in
> +	 * futex_unlock_pi().

That 'IOW' made my head spin for a while. I rather prefer to spell it out
more explicitely:

	 * The waiter holding a reference on @pi_state protects also
         * against the unlocked put_pi_state() in futex_unlock_pi(),
         * futex_lock_pi() and futex_wait_requeue_pi() as it cannot go to 0
         * and consequentely free pi state before we can take a reference
         * ourself.

>  	 */
>  	WARN_ON(!atomic_read(&pi_state->refcount));
>  
> @@ -1378,47 +1383,33 @@ static void mark_wake_futex(struct wake_
>  	smp_store_release(&q->lock_ptr, NULL);
>  }
>  
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
> -			 struct futex_hash_bucket *hb)

Please add a comment, that the caller must hold a reference on @pi_state

> +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
>  {
> -	struct task_struct *new_owner;
> -	struct futex_pi_state *pi_state = top_waiter->pi_state;
>  	u32 uninitialized_var(curval), newval;
> +	struct task_struct *new_owner;
> +	bool deboost = false;
>  	DEFINE_WAKE_Q(wake_q);
> -	bool deboost;
>  	int ret = 0;
>  
> -	if (!pi_state)
> -		return -EINVAL;
> -
> -	/*
> -	 * If current does not own the pi_state then the futex is
> -	 * inconsistent and user space fiddled with the futex value.
> -	 */
> -	if (pi_state->owner != current)
> -		return -EINVAL;
> -
>  	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
>  	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
> -
> -	/*
> -	 * When we interleave with futex_lock_pi() where it does
> -	 * rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter,
> -	 * but the rt_mutex's wait_list can be empty (either still, or again,
> -	 * depending on which side we land).
> -	 *
> -	 * When this happens, give up our locks and try again, giving the
> -	 * futex_lock_pi() instance time to complete and unqueue_me().
> -	 */
>  	if (!new_owner) {
> -		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
> -		return -EAGAIN;
> +		/*
> +		 * Since we held neither hb->lock nor wait_lock when coming
> +		 * into this function, we could have raced with futex_lock_pi()
> +		 * such that it will have removed the waiter that brought us
> +		 * here.

Hmm. That's not entirely correct. There are two cases:

     lock_pi()
	queue_me() <- Makes it visible as waiter in the hash bucket
	unlock(hb->lock)

  [1]

	rtmutex_futex_lock()

  [2]
  
	lock(hb->lock)

Both [1] and [2] are valid reasons why the top waiter is not a rtmutex
waiter.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 12/14] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
  2017-03-04  9:27 ` [PATCH -v5 12/14] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() Peter Zijlstra
@ 2017-03-07 14:18   ` Thomas Gleixner
  2017-03-07 17:57     ` Peter Zijlstra
  2017-03-08 15:29   ` [PATCH] futex: move debug_rt_mutex_free_waiter() further down Sebastian Andrzej Siewior
  1 sibling, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2017-03-07 14:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Sat, 4 Mar 2017, Peter Zijlstra wrote:
> +/**
> + * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition
> + * @lock:		the rt_mutex we were woken on
> + * @waiter:		the pre-initialized rt_mutex_waiter
> + *
> + * Clean up the failed lock acquisition as per rt_mutex_wait_proxy_lock().
> + *
> + * Returns:
> + *  true  - did the cleanup, we done.
> + *  false - we acquired the lock after rt_mutex_wait_proxy_lock() returned,
> + *          caller should disregards its return value.

Hmm. How would that happen? Magic owner assignement to a non waiter? The
callsite only calls here in the failed case.

I must be missing something

Thanks,

	tglx

> + *
> + * Special API call for PI-futex support
> + */
> +bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
> +				 struct rt_mutex_waiter *waiter)
> +{
> +	bool cleanup = false;
> +
> +	raw_spin_lock_irq(&lock->wait_lock);
> +	/*
> +	 * If we acquired the lock, no cleanup required.
> +	 */
> +	if (rt_mutex_owner(lock) != current) {
> +		remove_waiter(lock, waiter);
> +		fixup_rt_mutex_waiters(lock);
> +		cleanup = true;
> +	}
> +	raw_spin_unlock_irq(&lock->wait_lock);
> +
> +	return cleanup;
> +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 14/14] futex: futex_unlock_pi() determinism
  2017-03-04  9:27 ` [PATCH -v5 14/14] futex: futex_unlock_pi() determinism Peter Zijlstra
@ 2017-03-07 14:31   ` Thomas Gleixner
  2017-03-07 17:59     ` Peter Zijlstra
  2017-03-13  9:25     ` Peter Zijlstra
  0 siblings, 2 replies; 36+ messages in thread
From: Thomas Gleixner @ 2017-03-07 14:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Sat, 4 Mar 2017, Peter Zijlstra wrote:

> The problem with returning -EAGAIN when the waiter state mismatches is
> that it becomes very hard to proof a bounded execution time on the
> operation. And seeing that this is a RT operation, this is somewhat
> important.
> 
> While in practise it will be very unlikely to ever really take more
> than one or two rounds, proving so becomes rather hard.

Oh no. Assume the following:

T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)

CPU0

T1 
  lock_pi()
  queue_me()  <- Waiter is visible

preemption

T2
  unlock_pi()
    loops with -EAGAIN forever

> Now that modifying wait_list is done while holding both hb->lock and
> wait_lock, we can avoid the scenario entirely if we acquire wait_lock
> while still holding hb-lock. Doing a hand-over, without leaving a
> hole.

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/futex.c |   26 ++++++++++++--------------
>  1 file changed, 12 insertions(+), 14 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1391,16 +1391,11 @@ static int wake_futex_pi(u32 __user *uad
>  	DEFINE_WAKE_Q(wake_q);
>  	int ret = 0;
>  
> -	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
>  	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
> -	if (!new_owner) {
> +	if (WARN_ON_ONCE(!new_owner)) {
>  		/*
> -		 * Since we held neither hb->lock nor wait_lock when coming
> -		 * into this function, we could have raced with futex_lock_pi()
> -		 * such that it will have removed the waiter that brought us
> -		 * here.
> -		 *
> -		 * In this case, retry the entire operation.
> +		 * Should be impossible now... but if weirdness happens,

'now...' is not very useful 6 month from NOW :)

> +		 * returning -EAGAIN is safe and correct.
>  		 */
>  		ret = -EAGAIN;
>  		goto out_unlock;
> @@ -2770,15 +2765,18 @@ static int futex_unlock_pi(u32 __user *u
>  		if (pi_state->owner != current)
>  			goto out_unlock;
>  
> +		get_pi_state(pi_state);
>  		/*
> -		 * Grab a reference on the pi_state and drop hb->lock.
> +		 * Since modifying the wait_list is done while holding both
> +		 * hb->lock and wait_lock, holding either is sufficient to
> +		 * observe it.
>  		 *
> -		 * The reference ensures pi_state lives, dropping the hb->lock
> -		 * is tricky.. wake_futex_pi() will take rt_mutex::wait_lock to
> -		 * close the races against futex_lock_pi(), but in case of
> -		 * _any_ fail we'll abort and retry the whole deal.
> +		 * By taking wait_lock while still holding hb->lock, we ensure
> +		 * there is no point where we hold neither; and therefore
> +		 * wake_futex_pi() must observe a state consistent with what we
> +		 * observed.
>  		 */
> -		get_pi_state(pi_state);
> +		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
>  		spin_unlock(&hb->lock);

Other than that, this pretty good.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 07/14] futex: Change locking rules
  2017-03-07 13:22   ` Thomas Gleixner
@ 2017-03-07 16:47     ` Sebastian Andrzej Siewior
  2017-03-07 18:01       ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2017-03-07 16:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, mingo, juri.lelli, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

From: Peter Zijlstra <peterz@infradead.org>
Date: Sat, 4 Mar 2017 10:27:24 +0100
Subject: [PATCH] futex: Change locking rules

Currently futex-pi relies on hb->lock to serialize everything. Since
hb->lock is giving us problems (PI inversions among other things,
since on -rt hb lock itself is a rt_mutex), we want to break this up a
bit.

This patch reworks and documents the locking. Notably, it
consistently uses rt_mutex::wait_lock to serialize {uval, pi_state}.
This would allow us to do rt_mutex_unlock() (including deboost)
without holding hb->lock.

Nothing yet relies on the new locking rules.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
On 2017-03-07 14:22:14 [+0100], Thomas Gleixner wrote:
> Both 'return' statements leak &pi_state->pi_mutex.wait_lock ....

this has unlock in both 'return's.

 kernel/futex.c | 161 +++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 129 insertions(+), 32 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index af022919933a..eda79e6c4e45 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -973,6 +973,39 @@ void exit_pi_state_list(struct task_struct *curr)
  *
  * [10] There is no transient state which leaves owner and user space
  *	TID out of sync.
+ *
+ *
+ * Serialization and lifetime rules:
+ *
+ * hb->lock:
+ *
+ *	hb -> futex_q, relation
+ *	futex_q -> pi_state, relation
+ *
+ *	(cannot be raw because hb can contain arbitrary amount
+ *	 of futex_q's)
+ *
+ * pi_mutex->wait_lock:
+ *
+ *	{uval, pi_state}
+ *
+ *	(and pi_mutex 'obviously')
+ *
+ * p->pi_lock:
+ *
+ *	p->pi_state_list -> pi_state->list, relation
+ *
+ * pi_state->refcount:
+ *
+ *	pi_state lifetime
+ *
+ *
+ * Lock order:
+ *
+ *   hb->lock
+ *     pi_mutex->wait_lock
+ *       p->pi_lock
+ *
  */
 
 /*
@@ -980,10 +1013,12 @@ void exit_pi_state_list(struct task_struct *curr)
  * the pi_state against the user space value. If correct, attach to
  * it.
  */
-static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
+static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
+			      struct futex_pi_state *pi_state,
 			      struct futex_pi_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
+	int ret, uval2;
 
 	/*
 	 * Userspace might have messed up non-PI and PI futexes [3]
@@ -991,9 +1026,34 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
 	if (unlikely(!pi_state))
 		return -EINVAL;
 
+	/*
+	 * We get here with hb->lock held, and having found a
+	 * futex_top_waiter(). This means that futex_lock_pi() of said futex_q
+	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
+	 * which in turn means that futex_lock_pi() still has a reference on
+	 * our pi_state.
+	 */
 	WARN_ON(!atomic_read(&pi_state->refcount));
 
 	/*
+	 * Now that we have a pi_state, we can acquire wait_lock
+	 * and do the state validation.
+	 */
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+	/*
+	 * Since {uval, pi_state} is serialized by wait_lock, and our current
+	 * uval was read without holding it, it can have changed. Verify it
+	 * still is what we expect it to be, otherwise retry the entire
+	 * operation.
+	 */
+	if (get_futex_value_locked(&uval2, uaddr))
+		goto out_efault;
+
+	if (uval != uval2)
+		goto out_eagain;
+
+	/*
 	 * Handle the owner died case:
 	 */
 	if (uval & FUTEX_OWNER_DIED) {
@@ -1008,11 +1068,11 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
 			 * is not 0. Inconsistent state. [5]
 			 */
 			if (pid)
-				return -EINVAL;
+				goto out_einval;
 			/*
 			 * Take a ref on the state and return success. [4]
 			 */
-			goto out_state;
+			goto out_attach;
 		}
 
 		/*
@@ -1024,14 +1084,14 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
 		 * Take a ref on the state and return success. [6]
 		 */
 		if (!pid)
-			goto out_state;
+			goto out_attach;
 	} else {
 		/*
 		 * If the owner died bit is not set, then the pi_state
 		 * must have an owner. [7]
 		 */
 		if (!pi_state->owner)
-			return -EINVAL;
+			goto out_einval;
 	}
 
 	/*
@@ -1040,11 +1100,29 @@ static int attach_to_pi_state(u32 uval, struct futex_pi_state *pi_state,
 	 * user space TID. [9/10]
 	 */
 	if (pid != task_pid_vnr(pi_state->owner))
-		return -EINVAL;
-out_state:
+		goto out_einval;
+
+out_attach:
 	atomic_inc(&pi_state->refcount);
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	*ps = pi_state;
 	return 0;
+
+out_einval:
+	ret = -EINVAL;
+	goto out_error;
+
+out_eagain:
+	ret = -EAGAIN;
+	goto out_error;
+
+out_efault:
+	ret = -EFAULT;
+	goto out_error;
+
+out_error:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+	return ret;
 }
 
 /*
@@ -1095,6 +1173,9 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 
 	/*
 	 * No existing pi state. First waiter. [2]
+	 *
+	 * This creates pi_state, we have hb->lock held, this means nothing can
+	 * observe this state, wait_lock is irrelevant.
 	 */
 	pi_state = alloc_pi_state();
 
@@ -1119,7 +1200,8 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 	return 0;
 }
 
-static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
+static int lookup_pi_state(u32 __user *uaddr, u32 uval,
+			   struct futex_hash_bucket *hb,
 			   union futex_key *key, struct futex_pi_state **ps)
 {
 	struct futex_q *top_waiter = futex_top_waiter(hb, key);
@@ -1129,7 +1211,7 @@ static int lookup_pi_state(u32 uval, struct futex_hash_bucket *hb,
 	 * attach to the pi_state when the validation succeeds.
 	 */
 	if (top_waiter)
-		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
 
 	/*
 	 * We are the first waiter - try to look up the owner based on
@@ -1148,7 +1230,7 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
 	if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)))
 		return -EFAULT;
 
-	/*If user space value changed, let the caller retry */
+	/* If user space value changed, let the caller retry */
 	return curval != uval ? -EAGAIN : 0;
 }
 
@@ -1204,7 +1286,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 	 */
 	top_waiter = futex_top_waiter(hb, key);
 	if (top_waiter)
-		return attach_to_pi_state(uval, top_waiter->pi_state, ps);
+		return attach_to_pi_state(uaddr, uval, top_waiter->pi_state, ps);
 
 	/*
 	 * No waiter and user TID is 0. We are here because the
@@ -1336,6 +1418,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter
 
 	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval)) {
 		ret = -EFAULT;
+
 	} else if (curval != uval) {
 		/*
 		 * If a unconditional UNLOCK_PI operation (user space did not
@@ -1348,6 +1431,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter
 		else
 			ret = -EINVAL;
 	}
+
 	if (ret) {
 		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 		return ret;
@@ -1823,7 +1907,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
 			 * If that call succeeds then we have pi_state and an
 			 * initial refcount on it.
 			 */
-			ret = lookup_pi_state(ret, hb2, &key2, &pi_state);
+			ret = lookup_pi_state(uaddr2, ret, hb2, &key2, &pi_state);
 		}
 
 		switch (ret) {
@@ -2122,10 +2206,13 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 {
 	u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
 	struct futex_pi_state *pi_state = q->pi_state;
-	struct task_struct *oldowner = pi_state->owner;
 	u32 uval, uninitialized_var(curval), newval;
+	struct task_struct *oldowner;
 	int ret;
 
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+	oldowner = pi_state->owner;
 	/* Owner died? */
 	if (!pi_state->owner)
 		newtid |= FUTEX_OWNER_DIED;
@@ -2141,11 +2228,10 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 	 * because we can fault here. Imagine swapped out pages or a fork
 	 * that marked all the anonymous memory readonly for cow.
 	 *
-	 * Modifying pi_state _before_ the user space value would
-	 * leave the pi_state in an inconsistent state when we fault
-	 * here, because we need to drop the hash bucket lock to
-	 * handle the fault. This might be observed in the PID check
-	 * in lookup_pi_state.
+	 * Modifying pi_state _before_ the user space value would leave the
+	 * pi_state in an inconsistent state when we fault here, because we
+	 * need to drop the locks to handle the fault. This might be observed
+	 * in the PID check in lookup_pi_state.
 	 */
 retry:
 	if (get_futex_value_locked(&uval, uaddr))
@@ -2166,45 +2252,56 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 	 * itself.
 	 */
 	if (pi_state->owner != NULL) {
-		raw_spin_lock_irq(&pi_state->owner->pi_lock);
+		raw_spin_lock(&pi_state->owner->pi_lock);
 		WARN_ON(list_empty(&pi_state->list));
 		list_del_init(&pi_state->list);
-		raw_spin_unlock_irq(&pi_state->owner->pi_lock);
+		raw_spin_unlock(&pi_state->owner->pi_lock);
 	}
 
 	pi_state->owner = newowner;
 
-	raw_spin_lock_irq(&newowner->pi_lock);
+	raw_spin_lock(&newowner->pi_lock);
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &newowner->pi_state_list);
-	raw_spin_unlock_irq(&newowner->pi_lock);
+	raw_spin_unlock(&newowner->pi_lock);
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+
 	return 0;
 
 	/*
-	 * To handle the page fault we need to drop the hash bucket
-	 * lock here. That gives the other task (either the highest priority
-	 * waiter itself or the task which stole the rtmutex) the
-	 * chance to try the fixup of the pi_state. So once we are
-	 * back from handling the fault we need to check the pi_state
-	 * after reacquiring the hash bucket lock and before trying to
-	 * do another fixup. When the fixup has been done already we
-	 * simply return.
+	 * To handle the page fault we need to drop the locks here. That gives
+	 * the other task (either the highest priority waiter itself or the
+	 * task which stole the rtmutex) the chance to try the fixup of the
+	 * pi_state. So once we are back from handling the fault we need to
+	 * check the pi_state after reacquiring the locks and before trying to
+	 * do another fixup. When the fixup has been done already we simply
+	 * return.
+	 *
+	 * Note: we hold both hb->lock and pi_mutex->wait_lock. We can safely
+	 * drop hb->lock since the caller owns the hb -> futex_q relation.
+	 * Dropping the pi_mutex->wait_lock requires the state revalidate.
 	 */
 handle_fault:
+	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 	spin_unlock(q->lock_ptr);
 
 	ret = fault_in_user_writeable(uaddr);
 
 	spin_lock(q->lock_ptr);
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 
 	/*
 	 * Check if someone else fixed it for us:
 	 */
-	if (pi_state->owner != oldowner)
+	if (pi_state->owner != oldowner) {
+		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 		return 0;
+	}
 
-	if (ret)
+	if (ret) {
+		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 		return ret;
+	}
 
 	goto retry;
 }
-- 
2.11.0

Sebastian

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 12/14] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
  2017-03-07 14:18   ` Thomas Gleixner
@ 2017-03-07 17:57     ` Peter Zijlstra
  2017-03-07 17:59       ` Thomas Gleixner
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-07 17:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Tue, Mar 07, 2017 at 03:18:46PM +0100, Thomas Gleixner wrote:
> On Sat, 4 Mar 2017, Peter Zijlstra wrote:
> > +/**
> > + * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition
> > + * @lock:		the rt_mutex we were woken on
> > + * @waiter:		the pre-initialized rt_mutex_waiter
> > + *
> > + * Clean up the failed lock acquisition as per rt_mutex_wait_proxy_lock().
> > + *
> > + * Returns:
> > + *  true  - did the cleanup, we done.
> > + *  false - we acquired the lock after rt_mutex_wait_proxy_lock() returned,
> > + *          caller should disregards its return value.
> 
> Hmm. How would that happen? Magic owner assignement to a non waiter? The
> callsite only calls here in the failed case.

Ah, but until the remove_waiter() below, we _still_ are a waiter, and
thus can get assigned ownership.

> > + *
> > + * Special API call for PI-futex support
> > + */
> > +bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
> > +				 struct rt_mutex_waiter *waiter)
> > +{
> > +	bool cleanup = false;
> > +
> > +	raw_spin_lock_irq(&lock->wait_lock);
> > +	/*
> > +	 * If we acquired the lock, no cleanup required.
> > +	 */
> > +	if (rt_mutex_owner(lock) != current) {
> > +		remove_waiter(lock, waiter);

See, up till this point, we still a waiter and any unlock can see us
being one.

> > +		fixup_rt_mutex_waiters(lock);
> > +		cleanup = true;
> > +	}
> > +	raw_spin_unlock_irq(&lock->wait_lock);
> > +
> > +	return cleanup;
> > +}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 14/14] futex: futex_unlock_pi() determinism
  2017-03-07 14:31   ` Thomas Gleixner
@ 2017-03-07 17:59     ` Peter Zijlstra
  2017-03-13  9:25     ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-07 17:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Tue, Mar 07, 2017 at 03:31:50PM +0100, Thomas Gleixner wrote:
> On Sat, 4 Mar 2017, Peter Zijlstra wrote:
> 
> > The problem with returning -EAGAIN when the waiter state mismatches is
> > that it becomes very hard to proof a bounded execution time on the
> > operation. And seeing that this is a RT operation, this is somewhat
> > important.
> > 
> > While in practise it will be very unlikely to ever really take more
> > than one or two rounds, proving so becomes rather hard.
> 
> Oh no. Assume the following:
> 
> T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)
> 
> CPU0
> 
> T1 
>   lock_pi()
>   queue_me()  <- Waiter is visible
> 
> preemption
> 
> T2
>   unlock_pi()
>     loops with -EAGAIN forever

Ah! indeed.

> > Now that modifying wait_list is done while holding both hb->lock and
> > wait_lock, we can avoid the scenario entirely if we acquire wait_lock
> > while still holding hb-lock. Doing a hand-over, without leaving a
> > hole.
> 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/futex.c |   26 ++++++++++++--------------
> >  1 file changed, 12 insertions(+), 14 deletions(-)
> > 
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -1391,16 +1391,11 @@ static int wake_futex_pi(u32 __user *uad
> >  	DEFINE_WAKE_Q(wake_q);
> >  	int ret = 0;
> >  
> >  	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
> > +	if (WARN_ON_ONCE(!new_owner)) {
> >  		/*
> > +		 * Should be impossible now... but if weirdness happens,
> 
> 'now...' is not very useful 6 month from NOW :)

I'll put in a reference to the below comment in, that explains why this
should now be impossible.

> > +		 * returning -EAGAIN is safe and correct.
> >  		 */
> >  		ret = -EAGAIN;
> >  		goto out_unlock;
> > @@ -2770,15 +2765,18 @@ static int futex_unlock_pi(u32 __user *u
> >  		if (pi_state->owner != current)
> >  			goto out_unlock;
> >  
> > +		get_pi_state(pi_state);
> >  		/*
> > +		 * Since modifying the wait_list is done while holding both
> > +		 * hb->lock and wait_lock, holding either is sufficient to
> > +		 * observe it.
> >  		 *
> > +		 * By taking wait_lock while still holding hb->lock, we ensure
> > +		 * there is no point where we hold neither; and therefore
> > +		 * wake_futex_pi() must observe a state consistent with what we
> > +		 * observed.
> >  		 */


^^ that one.

> > +		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> >  		spin_unlock(&hb->lock);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 12/14] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()
  2017-03-07 17:57     ` Peter Zijlstra
@ 2017-03-07 17:59       ` Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Gleixner @ 2017-03-07 17:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Tue, 7 Mar 2017, Peter Zijlstra wrote:

> On Tue, Mar 07, 2017 at 03:18:46PM +0100, Thomas Gleixner wrote:
> > On Sat, 4 Mar 2017, Peter Zijlstra wrote:
> > > +/**
> > > + * rt_mutex_cleanup_proxy_lock() - Cleanup failed lock acquisition
> > > + * @lock:		the rt_mutex we were woken on
> > > + * @waiter:		the pre-initialized rt_mutex_waiter
> > > + *
> > > + * Clean up the failed lock acquisition as per rt_mutex_wait_proxy_lock().
> > > + *
> > > + * Returns:
> > > + *  true  - did the cleanup, we done.
> > > + *  false - we acquired the lock after rt_mutex_wait_proxy_lock() returned,
> > > + *          caller should disregards its return value.
> > 
> > Hmm. How would that happen? Magic owner assignement to a non waiter? The
> > callsite only calls here in the failed case.
> 
> Ah, but until the remove_waiter() below, we _still_ are a waiter, and
> thus can get assigned ownership.
> 
> > > + *
> > > + * Special API call for PI-futex support
> > > + */
> > > +bool rt_mutex_cleanup_proxy_lock(struct rt_mutex *lock,
> > > +				 struct rt_mutex_waiter *waiter)
> > > +{
> > > +	bool cleanup = false;
> > > +
> > > +	raw_spin_lock_irq(&lock->wait_lock);
> > > +	/*
> > > +	 * If we acquired the lock, no cleanup required.
> > > +	 */
> > > +	if (rt_mutex_owner(lock) != current) {
> > > +		remove_waiter(lock, waiter);
> 
> See, up till this point, we still a waiter and any unlock can see us
> being one.

Hmm, true. So the comments should explain that

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 10/14] futex: Pull rt_mutex_futex_unlock() out from under hb->lock
  2017-03-07 14:08   ` Thomas Gleixner
@ 2017-03-07 18:01     ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-07 18:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Tue, Mar 07, 2017 at 03:08:17PM +0100, Thomas Gleixner wrote:
> On Sat, 4 Mar 2017, Peter Zijlstra wrote:
> > @@ -1035,6 +1037,9 @@ static int attach_to_pi_state(u32 __user
> >  	 * has dropped the hb->lock in between queue_me() and unqueue_me_pi(),
> >  	 * which in turn means that futex_lock_pi() still has a reference on
> >  	 * our pi_state.
> > +	 *
> > +	 * IOW, we cannot race against the unlocked put_pi_state() in
> > +	 * futex_unlock_pi().
> 
> That 'IOW' made my head spin for a while. I rather prefer to spell it out
> more explicitely:
> 
> 	 * The waiter holding a reference on @pi_state protects also
>          * against the unlocked put_pi_state() in futex_unlock_pi(),
>          * futex_lock_pi() and futex_wait_requeue_pi() as it cannot go to 0
>          * and consequentely free pi state before we can take a reference
>          * ourself.

Right you are. After staring at this for too damn long one tends to
forget what 'obvious' means.

> 
> >  	 */
> >  	WARN_ON(!atomic_read(&pi_state->refcount));
> >  
> > @@ -1378,47 +1383,33 @@ static void mark_wake_futex(struct wake_
> >  	smp_store_release(&q->lock_ptr, NULL);
> >  }
> >  
> > -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
> > -			 struct futex_hash_bucket *hb)
> 
> Please add a comment, that the caller must hold a reference on @pi_state

Will do.

> > +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> >  {
> >  	u32 uninitialized_var(curval), newval;
> > +	struct task_struct *new_owner;
> > +	bool deboost = false;
> >  	DEFINE_WAKE_Q(wake_q);
> >  	int ret = 0;
> >  
> >  	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> >  	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
> >  	if (!new_owner) {
> > +		/*
> > +		 * Since we held neither hb->lock nor wait_lock when coming
> > +		 * into this function, we could have raced with futex_lock_pi()
> > +		 * such that it will have removed the waiter that brought us
> > +		 * here.
> 
> Hmm. That's not entirely correct. There are two cases:
> 
>      lock_pi()
> 	queue_me() <- Makes it visible as waiter in the hash bucket
> 	unlock(hb->lock)
> 
>   [1]
> 
> 	rtmutex_futex_lock()
> 
>   [2]
>   
> 	lock(hb->lock)
> 
> Both [1] and [2] are valid reasons why the top waiter is not a rtmutex
> waiter.

Correct, I've even drawn similar state pictures elsewhere in this
series. I'll update.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 07/14] futex: Change locking rules
  2017-03-07 16:47     ` Sebastian Andrzej Siewior
@ 2017-03-07 18:01       ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-07 18:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, mingo, juri.lelli, rostedt, xlpang,
	linux-kernel, mathieu.desnoyers, jdesfossez, bristot, dvhart

On Tue, Mar 07, 2017 at 05:47:44PM +0100, Sebastian Andrzej Siewior wrote:
> On 2017-03-07 14:22:14 [+0100], Thomas Gleixner wrote:
> > Both 'return' statements leak &pi_state->pi_mutex.wait_lock ....
> 
> this has unlock in both 'return's.

>  handle_fault:
> +	raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
>  	spin_unlock(q->lock_ptr);
>  
>  	ret = fault_in_user_writeable(uaddr);
>  
>  	spin_lock(q->lock_ptr);
> +	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
>  
>  	/*
>  	 * Check if someone else fixed it for us:
>  	 */
> -	if (pi_state->owner != oldowner)
> +	if (pi_state->owner != oldowner) {
> +		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
>  		return 0;
> +	}
>  
> -	if (ret)
> +	if (ret) {
> +		raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
>  		return ret;
> +	}
>  
>  	goto retry;
>  }

I had locally already fixed it with a common:

out_unlock:
  raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
  return ret;

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] futex: move debug_rt_mutex_free_waiter() further down
  2017-03-04  9:27 ` [PATCH -v5 12/14] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() Peter Zijlstra
  2017-03-07 14:18   ` Thomas Gleixner
@ 2017-03-08 15:29   ` Sebastian Andrzej Siewior
  2017-03-08 15:37     ` Sebastian Andrzej Siewior
                       ` (2 more replies)
  1 sibling, 3 replies; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2017-03-08 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

Without this, futex_requeue_pi_signal_restart will trigger

|kernel BUG at locking/rtmutex_common.h:55!
|Call Trace:
| rt_mutex_cleanup_proxy_lock+0x54/0x90
| futex_wait_requeue_pi.constprop.21+0x387/0x4d0
| do_futex+0x289/0xbf0
|RIP: remove_waiter+0x157/0x170 RSP: ffffc90000e0fbe0

with BUG 2222222222222222 != pointer once this patch is applied.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 00ec4a01d3f5..73abfe0da4d0 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3046,11 +3046,11 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		WARN_ON(!q.pi_state);
 		pi_mutex = &q.pi_state->pi_mutex;
 		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
-		debug_rt_mutex_free_waiter(&rt_waiter);
 
 		spin_lock(q.lock_ptr);
 		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
 			ret = 0;
+		debug_rt_mutex_free_waiter(&rt_waiter);
 
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] futex: move debug_rt_mutex_free_waiter() further down
  2017-03-08 15:29   ` [PATCH] futex: move debug_rt_mutex_free_waiter() further down Sebastian Andrzej Siewior
@ 2017-03-08 15:37     ` Sebastian Andrzej Siewior
  2017-03-08 16:21       ` Steven Rostedt
  2017-03-08 16:20     ` Steven Rostedt
  2017-03-13  9:16     ` Peter Zijlstra
  2 siblings, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2017-03-08 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On 2017-03-08 16:29:02 [+0100], To Peter Zijlstra wrote:
> Without this, futex_requeue_pi_signal_restart will trigger
> 
> |kernel BUG at locking/rtmutex_common.h:55!
> |Call Trace:
> | rt_mutex_cleanup_proxy_lock+0x54/0x90
> | futex_wait_requeue_pi.constprop.21+0x387/0x4d0
> | do_futex+0x289/0xbf0
> |RIP: remove_waiter+0x157/0x170 RSP: ffffc90000e0fbe0
> 
> with BUG 2222222222222222 != pointer once this patch is applied.

My wording is wrong. This BUG_ON() statement described here in this
patch (together with the test case mentioned) will trigger once

 "[PATCH -v5 12/14] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()"

is applied.

Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] futex: move debug_rt_mutex_free_waiter() further down
  2017-03-08 15:29   ` [PATCH] futex: move debug_rt_mutex_free_waiter() further down Sebastian Andrzej Siewior
  2017-03-08 15:37     ` Sebastian Andrzej Siewior
@ 2017-03-08 16:20     ` Steven Rostedt
  2017-03-13  9:16     ` Peter Zijlstra
  2 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2017-03-08 16:20 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, tglx, mingo, juri.lelli, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Wed, 8 Mar 2017 16:29:02 +0100
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> Without this, futex_requeue_pi_signal_restart will trigger
> 
> |kernel BUG at locking/rtmutex_common.h:55!
> |Call Trace:
> | rt_mutex_cleanup_proxy_lock+0x54/0x90
> | futex_wait_requeue_pi.constprop.21+0x387/0x4d0
> | do_futex+0x289/0xbf0
> |RIP: remove_waiter+0x157/0x170 RSP: ffffc90000e0fbe0
> 
> with BUG 2222222222222222 != pointer once this patch is applied.

This sentence makes no sense. It's a no-sensetence ;-)

-- Steve

> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  kernel/futex.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 00ec4a01d3f5..73abfe0da4d0 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -3046,11 +3046,11 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
>  		WARN_ON(!q.pi_state);
>  		pi_mutex = &q.pi_state->pi_mutex;
>  		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
> -		debug_rt_mutex_free_waiter(&rt_waiter);
>  
>  		spin_lock(q.lock_ptr);
>  		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
>  			ret = 0;
> +		debug_rt_mutex_free_waiter(&rt_waiter);
>  
>  		/*
>  		 * Fixup the pi_state owner and possibly acquire the lock if we

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] futex: move debug_rt_mutex_free_waiter() further down
  2017-03-08 15:37     ` Sebastian Andrzej Siewior
@ 2017-03-08 16:21       ` Steven Rostedt
  0 siblings, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2017-03-08 16:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, tglx, mingo, juri.lelli, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Wed, 8 Mar 2017 16:37:32 +0100
Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> On 2017-03-08 16:29:02 [+0100], To Peter Zijlstra wrote:
> > Without this, futex_requeue_pi_signal_restart will trigger
> > 
> > |kernel BUG at locking/rtmutex_common.h:55!
> > |Call Trace:
> > | rt_mutex_cleanup_proxy_lock+0x54/0x90
> > | futex_wait_requeue_pi.constprop.21+0x387/0x4d0
> > | do_futex+0x289/0xbf0
> > |RIP: remove_waiter+0x157/0x170 RSP: ffffc90000e0fbe0
> > 
> > with BUG 2222222222222222 != pointer once this patch is applied.  
> 
> My wording is wrong. This BUG_ON() statement described here in this
> patch (together with the test case mentioned) will trigger once
> 
>  "[PATCH -v5 12/14] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()"
> 
> is applied.
>

Now I read this. Ignore my last email.

-- Steve

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] futex: move debug_rt_mutex_free_waiter() further down
  2017-03-08 15:29   ` [PATCH] futex: move debug_rt_mutex_free_waiter() further down Sebastian Andrzej Siewior
  2017-03-08 15:37     ` Sebastian Andrzej Siewior
  2017-03-08 16:20     ` Steven Rostedt
@ 2017-03-13  9:16     ` Peter Zijlstra
  2 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-13  9:16 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: tglx, mingo, juri.lelli, rostedt, xlpang, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Wed, Mar 08, 2017 at 04:29:02PM +0100, Sebastian Andrzej Siewior wrote:

>  kernel/futex.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 00ec4a01d3f5..73abfe0da4d0 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -3046,11 +3046,11 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
>  		WARN_ON(!q.pi_state);
>  		pi_mutex = &q.pi_state->pi_mutex;
>  		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
> -		debug_rt_mutex_free_waiter(&rt_waiter);
>  
>  		spin_lock(q.lock_ptr);
>  		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
>  			ret = 0;
> +		debug_rt_mutex_free_waiter(&rt_waiter);
>  
>  		/*
>  		 * Fixup the pi_state owner and possibly acquire the lock if we
> 

Thanks, folded.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 14/14] futex: futex_unlock_pi() determinism
  2017-03-07 14:31   ` Thomas Gleixner
  2017-03-07 17:59     ` Peter Zijlstra
@ 2017-03-13  9:25     ` Peter Zijlstra
  2017-03-13 14:25       ` Thomas Gleixner
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-13  9:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Tue, Mar 07, 2017 at 03:31:50PM +0100, Thomas Gleixner wrote:
> On Sat, 4 Mar 2017, Peter Zijlstra wrote:
> 
> > The problem with returning -EAGAIN when the waiter state mismatches is
> > that it becomes very hard to proof a bounded execution time on the
> > operation. And seeing that this is a RT operation, this is somewhat
> > important.
> > 
> > While in practise it will be very unlikely to ever really take more
> > than one or two rounds, proving so becomes rather hard.
> 
> Oh no. Assume the following:
> 
> T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)
> 
> CPU0
> 
> T1 
>   lock_pi()
>   queue_me()  <- Waiter is visible
> 
> preemption
> 
> T2
>   unlock_pi()
>     loops with -EAGAIN forever

So this is true before the last patch; but if we look at the locking
changes brought by that (pasting its changelog here):

Before:

futex_lock_pi()                 futex_unlock_pi()
  unlock hb->lock

                                  lock hb->lock
                                  unlock hb->lock

                                  lock rt_mutex->wait_lock
                                  unlock rt_mutex_wait_lock
                                    -EAGAIN

  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock

  schedule()

  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

                                  <idem>
                                    -EAGAIN

  lock hb->lock


After:

futex_lock_pi()                 futex_unlock_pi()

  lock hb->lock
  lock rt_mutex->wait_lock
  list_add
  unlock rt_mutex->wait_lock
  unlock hb->lock

  schedule()
                                  lock hb->lock
                                  unlock hb->lock
  lock hb->lock
  lock rt_mutex->wait_lock
  list_del
  unlock rt_mutex->wait_lock

                                  lock rt_mutex->wait_lock
                                  unlock rt_mutex_wait_lock
                                    -EAGAIN

  unlock hb->lock


Your T2 (of higher prio) would block on T1's hb->lock and boost T1
(since hb->lock is an rt_mutex).

Alternatively (!PREEMPT_FULL), the interleave cannot happen (when pinned
to a single CPU) because then hb->lock disables preemption, it being a
spinlock.


Unless I need to go drink more wake-up-juice..

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 14/14] futex: futex_unlock_pi() determinism
  2017-03-13  9:25     ` Peter Zijlstra
@ 2017-03-13 14:25       ` Thomas Gleixner
  2017-03-13 15:11         ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Gleixner @ 2017-03-13 14:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Mon, 13 Mar 2017, Peter Zijlstra wrote:

> On Tue, Mar 07, 2017 at 03:31:50PM +0100, Thomas Gleixner wrote:
> > On Sat, 4 Mar 2017, Peter Zijlstra wrote:
> > 
> > > The problem with returning -EAGAIN when the waiter state mismatches is
> > > that it becomes very hard to proof a bounded execution time on the
> > > operation. And seeing that this is a RT operation, this is somewhat
> > > important.
> > > 
> > > While in practise it will be very unlikely to ever really take more
> > > than one or two rounds, proving so becomes rather hard.
> > 
> > Oh no. Assume the following:
> > 
> > T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)
> > 
> > CPU0
> > 
> > T1 
> >   lock_pi()
> >   queue_me()  <- Waiter is visible
> > 
> > preemption
> > 
> > T2
> >   unlock_pi()
> >     loops with -EAGAIN forever
> 
> So this is true before the last patch; but if we look at the locking
> changes brought by that (pasting its changelog here):

I was referring to the state before the last patch and your wording in the
changelog of this being very unlikely.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH -v5 14/14] futex: futex_unlock_pi() determinism
  2017-03-13 14:25       ` Thomas Gleixner
@ 2017-03-13 15:11         ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2017-03-13 15:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, juri.lelli, rostedt, xlpang, bigeasy, linux-kernel,
	mathieu.desnoyers, jdesfossez, bristot, dvhart

On Mon, Mar 13, 2017 at 03:25:52PM +0100, Thomas Gleixner wrote:
> On Mon, 13 Mar 2017, Peter Zijlstra wrote:
> 
> > On Tue, Mar 07, 2017 at 03:31:50PM +0100, Thomas Gleixner wrote:
> > > On Sat, 4 Mar 2017, Peter Zijlstra wrote:
> > > 
> > > > The problem with returning -EAGAIN when the waiter state mismatches is
> > > > that it becomes very hard to proof a bounded execution time on the
> > > > operation. And seeing that this is a RT operation, this is somewhat
> > > > important.
> > > > 
> > > > While in practise it will be very unlikely to ever really take more
> > > > than one or two rounds, proving so becomes rather hard.
> > > 
> > > Oh no. Assume the following:
> > > 
> > > T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)
> > > 
> > > CPU0
> > > 
> > > T1 
> > >   lock_pi()
> > >   queue_me()  <- Waiter is visible
> > > 
> > > preemption
> > > 
> > > T2
> > >   unlock_pi()
> > >     loops with -EAGAIN forever
> > 
> > So this is true before the last patch; but if we look at the locking
> > changes brought by that (pasting its changelog here):
> 
> I was referring to the state before the last patch and your wording in the
> changelog of this being very unlikely.

Yeah, I understand that. Lemme see what I can do to clarify both
situations.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [tip:locking/urgent] futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
  2017-03-04  9:27 ` [PATCH -v5 01/14] futex: Fix potential use-after-free in FUTEX_REQUEUE_PI Peter Zijlstra
@ 2017-03-14 20:48   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-14 20:48 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, mingo, peterz, tglx, linux-kernel, dvyukov, dvhart

Commit-ID:  c236c8e95a3d395b0494e7108f0d41cf36ec107c
Gitweb:     http://git.kernel.org/tip/c236c8e95a3d395b0494e7108f0d41cf36ec107c
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Sat, 4 Mar 2017 10:27:18 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 14 Mar 2017 21:45:36 +0100

futex: Fix potential use-after-free in FUTEX_REQUEUE_PI

While working on the futex code, I stumbled over this potential
use-after-free scenario. Dmitry triggered it later with syzkaller.

pi_mutex is a pointer into pi_state, which we drop the reference on in
unqueue_me_pi(). So any access to that pointer after that is bad.

Since other sites already do rt_mutex_unlock() with hb->lock held, see
for example futex_lock_pi(), simply move the unlock before
unqueue_me_pi().

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 229a744..3a4775f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2815,7 +2815,6 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct rt_mutex_waiter rt_waiter;
-	struct rt_mutex *pi_mutex = NULL;
 	struct futex_hash_bucket *hb;
 	union futex_key key2 = FUTEX_KEY_INIT;
 	struct futex_q q = futex_q_init;
@@ -2907,6 +2906,8 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 			spin_unlock(q.lock_ptr);
 		}
 	} else {
+		struct rt_mutex *pi_mutex;
+
 		/*
 		 * We have been woken up by futex_unlock_pi(), a timeout, or a
 		 * signal.  futex_unlock_pi() will not destroy the lock_ptr nor
@@ -2930,18 +2931,19 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		if (res)
 			ret = (res < 0) ? res : 0;
 
+		/*
+		 * If fixup_pi_state_owner() faulted and was unable to handle
+		 * the fault, unlock the rt_mutex and return the fault to
+		 * userspace.
+		 */
+		if (ret && rt_mutex_owner(pi_mutex) == current)
+			rt_mutex_unlock(pi_mutex);
+
 		/* Unqueue and drop the lock. */
 		unqueue_me_pi(&q);
 	}
 
-	/*
-	 * If fixup_pi_state_owner() faulted and was unable to handle the
-	 * fault, unlock the rt_mutex and return the fault to userspace.
-	 */
-	if (ret == -EFAULT) {
-		if (pi_mutex && rt_mutex_owner(pi_mutex) == current)
-			rt_mutex_unlock(pi_mutex);
-	} else if (ret == -EINTR) {
+	if (ret == -EINTR) {
 		/*
 		 * We've already been requeued, but cannot restart by calling
 		 * futex_lock_pi() directly. We could restart this syscall, but

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip:locking/urgent] futex: Add missing error handling to FUTEX_REQUEUE_PI
  2017-03-04  9:27 ` [PATCH -v5 02/14] futex: Add missing error handling to FUTEX_REQUEUE_PI Peter Zijlstra
@ 2017-03-14 20:49   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot for Peter Zijlstra @ 2017-03-14 20:49 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, hpa, peterz, tglx, dvhart, linux-kernel

Commit-ID:  9bbb25afeb182502ca4f2c4f3f88af0681b34cae
Gitweb:     http://git.kernel.org/tip/9bbb25afeb182502ca4f2c4f3f88af0681b34cae
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Sat, 4 Mar 2017 10:27:19 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 14 Mar 2017 21:45:36 +0100

futex: Add missing error handling to FUTEX_REQUEUE_PI

Thomas spotted that fixup_pi_state_owner() can return errors and we
fail to unlock the rt_mutex in that case.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170304093558.867401760@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 kernel/futex.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 3a4775f..45858ec 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2898,6 +2898,8 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		if (q.pi_state && (q.pi_state->owner != current)) {
 			spin_lock(q.lock_ptr);
 			ret = fixup_pi_state_owner(uaddr2, &q, current);
+			if (ret && rt_mutex_owner(&q.pi_state->pi_mutex) == current)
+				rt_mutex_unlock(&q.pi_state->pi_mutex);
 			/*
 			 * Drop the reference to the pi state which
 			 * the requeue_pi() code acquired for us.

^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2017-03-14 20:49 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-04  9:27 [PATCH -v5 00/14] the saga of FUTEX_UNLOCK_PI wobbles continues Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 01/14] futex: Fix potential use-after-free in FUTEX_REQUEUE_PI Peter Zijlstra
2017-03-14 20:48   ` [tip:locking/urgent] " tip-bot for Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 02/14] futex: Add missing error handling to FUTEX_REQUEUE_PI Peter Zijlstra
2017-03-14 20:49   ` [tip:locking/urgent] " tip-bot for Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 03/14] futex: Cleanup variable names for futex_top_waiter() Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 04/14] futex: Use smp_store_release() in mark_wake_futex() Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 05/14] futex: Remove rt_mutex_deadlock_account_*() Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 06/14] futex,rt_mutex: Provide futex specific rt_mutex API Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 07/14] futex: Change locking rules Peter Zijlstra
2017-03-07 13:22   ` Thomas Gleixner
2017-03-07 16:47     ` Sebastian Andrzej Siewior
2017-03-07 18:01       ` Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 08/14] futex: Cleanup refcounting Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 09/14] futex: Rework inconsistent rt_mutex/futex_q state Peter Zijlstra
2017-03-07 13:26   ` Thomas Gleixner
2017-03-04  9:27 ` [PATCH -v5 10/14] futex: Pull rt_mutex_futex_unlock() out from under hb->lock Peter Zijlstra
2017-03-07 14:08   ` Thomas Gleixner
2017-03-07 18:01     ` Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 11/14] futex,rt_mutex: Introduce rt_mutex_init_waiter() Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 12/14] futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() Peter Zijlstra
2017-03-07 14:18   ` Thomas Gleixner
2017-03-07 17:57     ` Peter Zijlstra
2017-03-07 17:59       ` Thomas Gleixner
2017-03-08 15:29   ` [PATCH] futex: move debug_rt_mutex_free_waiter() further down Sebastian Andrzej Siewior
2017-03-08 15:37     ` Sebastian Andrzej Siewior
2017-03-08 16:21       ` Steven Rostedt
2017-03-08 16:20     ` Steven Rostedt
2017-03-13  9:16     ` Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 13/14] futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() Peter Zijlstra
2017-03-04  9:27 ` [PATCH -v5 14/14] futex: futex_unlock_pi() determinism Peter Zijlstra
2017-03-07 14:31   ` Thomas Gleixner
2017-03-07 17:59     ` Peter Zijlstra
2017-03-13  9:25     ` Peter Zijlstra
2017-03-13 14:25       ` Thomas Gleixner
2017-03-13 15:11         ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.