linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -tip 0/4] rtmutex: Spin on owner
@ 2015-05-19 17:24 Davidlohr Bueso
  2015-05-19 17:24 ` [PATCH 1/4] locking/rtmutex: Implement lockless top-waiter wakeup Davidlohr Bueso
                   ` (4 more replies)
  0 siblings, 5 replies; 48+ messages in thread
From: Davidlohr Bueso @ 2015-05-19 17:24 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Mike Galbraith, Paul E. McKenney,
	Sebastian Andrzej Siewior, Davidlohr Bueso, linux-kernel

Hello,

First three patches are straightforward and found while
going through the code. Patch 4 is the actual meat of the
set, but similar to what we have already in regular mutexes.
While having optimistic spinning in rtmutexes is a desired
feature, I've marked it RFC as I could very well have missed
something inherint in the rt flavor.

Details in the individual patches. Passes pi tests from
Darren's futextest suite as well as all weekend running
pi_stress from rt-tests on a 60 core box.

Thanks!

Davidlohr Bueso (4):
  locking/rtmutex: Implement lockless top-waiter wakeup
  locking/rtmutex: Use cmp-cmpxchg
  locking/rtmutex: Update stale plist comments
  locking/rtmutex: Support spin on owner (osq)

 include/linux/rtmutex.h  |   4 ++
 kernel/Kconfig.locks     |   4 ++
 kernel/locking/rtmutex.c | 175 +++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 162 insertions(+), 21 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/4] locking/rtmutex: Implement lockless top-waiter wakeup
  2015-05-19 17:24 [PATCH -tip 0/4] rtmutex: Spin on owner Davidlohr Bueso
@ 2015-05-19 17:24 ` Davidlohr Bueso
  2015-06-05 12:35   ` Thomas Gleixner
                     ` (2 more replies)
  2015-05-19 17:24 ` [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg Davidlohr Bueso
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 48+ messages in thread
From: Davidlohr Bueso @ 2015-05-19 17:24 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Mike Galbraith, Paul E. McKenney,
	Sebastian Andrzej Siewior, Davidlohr Bueso, linux-kernel,
	Davidlohr Bueso

Mark the task for later wakeup after the wait_lock has been released.
This way, once the next task is awoken, it will have a better chance
to of finding the wait_lock free when continuing executing in
__rt_mutex_slowlock() when trying to acquire the rtmutex, calling
try_to_take_rt_mutex(). Upon contended scenarios, other tasks attempting
take the lock may acquire it first, right after the wait_lock is released,
but (a) this can also occur with the current code, as it relies on the
spinlock fairness, and (b) we are dealing with the top-waiter anyway,
so it will always take the lock next.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/rtmutex.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 36573e9..74188d8 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -955,14 +955,13 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 }
 
 /*
- * Wake up the next waiter on the lock.
- *
  * Remove the top waiter from the current tasks pi waiter list and
- * wake it up.
+ * queue it up.
  *
  * Called with lock->wait_lock held.
  */
-static void wakeup_next_waiter(struct rt_mutex *lock)
+static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
+				    struct rt_mutex *lock)
 {
 	struct rt_mutex_waiter *waiter;
 	unsigned long flags;
@@ -991,12 +990,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 
 	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
 
-	/*
-	 * It's safe to dereference waiter as it cannot go away as
-	 * long as we hold lock->wait_lock. The waiter task needs to
-	 * acquire it in order to dequeue the waiter.
-	 */
-	wake_up_process(waiter->task);
+	wake_q_add(wake_q, waiter->task);
 }
 
 /*
@@ -1255,6 +1249,8 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 static void __sched
 rt_mutex_slowunlock(struct rt_mutex *lock)
 {
+	WAKE_Q(wake_q);
+
 	raw_spin_lock(&lock->wait_lock);
 
 	debug_rt_mutex_unlock(lock);
@@ -1303,10 +1299,13 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	/*
 	 * The wakeup next waiter path does not suffer from the above
 	 * race. See the comments there.
+	 *
+	 * Queue the next waiter for wakeup once we release the wait_lock.
 	 */
-	wakeup_next_waiter(lock);
+	mark_wakeup_next_waiter(&wake_q, lock);
 
 	raw_spin_unlock(&lock->wait_lock);
+	wake_up_q(&wake_q);
 
 	/* Undo pi boosting if necessary: */
 	rt_mutex_adjust_prio(current);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg
  2015-05-19 17:24 [PATCH -tip 0/4] rtmutex: Spin on owner Davidlohr Bueso
  2015-05-19 17:24 ` [PATCH 1/4] locking/rtmutex: Implement lockless top-waiter wakeup Davidlohr Bueso
@ 2015-05-19 17:24 ` Davidlohr Bueso
  2015-06-05 12:38   ` Thomas Gleixner
  2015-05-19 17:24 ` [PATCH 3/4] locking/rtmutex: Update stale plist comments Davidlohr Bueso
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 48+ messages in thread
From: Davidlohr Bueso @ 2015-05-19 17:24 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Mike Galbraith, Paul E. McKenney,
	Sebastian Andrzej Siewior, Davidlohr Bueso, linux-kernel,
	Davidlohr Bueso

Avoid unnecessary cmpxchg calls, all of our other locks
use it as well.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/rtmutex.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 74188d8..1d5cc9d 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -74,7 +74,9 @@ static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
  * set up.
  */
 #ifndef CONFIG_DEBUG_RT_MUTEXES
-# define rt_mutex_cmpxchg(l,c,n)	(cmpxchg(&l->owner, c, n) == c)
+# define rt_mutex_cmpxchg(l,c,n) \
+	(l->owner == c && cmpxchg(&l->owner, c, n) == c)
+
 static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
 {
 	unsigned long owner, *p = (unsigned long *) &lock->owner;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/4] locking/rtmutex: Update stale plist comments
  2015-05-19 17:24 [PATCH -tip 0/4] rtmutex: Spin on owner Davidlohr Bueso
  2015-05-19 17:24 ` [PATCH 1/4] locking/rtmutex: Implement lockless top-waiter wakeup Davidlohr Bueso
  2015-05-19 17:24 ` [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg Davidlohr Bueso
@ 2015-05-19 17:24 ` Davidlohr Bueso
  2015-06-05 12:39   ` Thomas Gleixner
                     ` (2 more replies)
  2015-05-19 17:24 ` [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq) Davidlohr Bueso
  2015-05-25 20:35 ` [PATCH -tip 0/4] rtmutex: Spin on owner Davidlohr Bueso
  4 siblings, 3 replies; 48+ messages in thread
From: Davidlohr Bueso @ 2015-05-19 17:24 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Mike Galbraith, Paul E. McKenney,
	Sebastian Andrzej Siewior, Davidlohr Bueso, linux-kernel,
	Davidlohr Bueso

... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
no longer use plists for queuing any waiters. Update stale comments.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/rtmutex.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 1d5cc9d..81aad32 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -626,7 +626,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 */
 	prerequeue_top_waiter = rt_mutex_top_waiter(lock);
 
-	/* [7] Requeue the waiter in the lock waiter list. */
+	/* [7] Requeue the waiter in the lock waiter tree. */
 	rt_mutex_dequeue(lock, waiter);
 	waiter->prio = task->prio;
 	rt_mutex_enqueue(lock, waiter);
@@ -664,7 +664,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 		/*
 		 * The waiter became the new top (highest priority)
 		 * waiter on the lock. Replace the previous top waiter
-		 * in the owner tasks pi waiters list with this waiter
+		 * in the owner tasks pi waiters tree with this waiter
 		 * and adjust the priority of the owner.
 		 */
 		rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
@@ -675,7 +675,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 		/*
 		 * The waiter was the top waiter on the lock, but is
 		 * no longer the top prority waiter. Replace waiter in
-		 * the owner tasks pi waiters list with the new top
+		 * the owner tasks pi waiters tree with the new top
 		 * (highest priority) waiter and adjust the priority
 		 * of the owner.
 		 * The new top waiter is stored in @waiter so that
@@ -749,7 +749,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
  *
  * @lock:   The lock to be acquired.
  * @task:   The task which wants to acquire the lock
- * @waiter: The waiter that is queued to the lock's wait list if the
+ * @waiter: The waiter that is queued to the lock's wait tree if the
  *	    callsite called task_blocked_on_lock(), otherwise NULL
  */
 static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
@@ -784,7 +784,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 
 	/*
 	 * If @waiter != NULL, @task has already enqueued the waiter
-	 * into @lock waiter list. If @waiter == NULL then this is a
+	 * into @lock waiter tree. If @waiter == NULL then this is a
 	 * trylock attempt.
 	 */
 	if (waiter) {
@@ -797,7 +797,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 
 		/*
 		 * We can acquire the lock. Remove the waiter from the
-		 * lock waiters list.
+		 * lock waiters tree.
 		 */
 		rt_mutex_dequeue(lock, waiter);
 
@@ -829,7 +829,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 			 * No waiters. Take the lock without the
 			 * pi_lock dance.@task->pi_blocked_on is NULL
 			 * and we have no waiters to enqueue in @task
-			 * pi waiters list.
+			 * pi waiters tree.
 			 */
 			goto takeit;
 		}
@@ -846,7 +846,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 	/*
 	 * Finish the lock acquisition. @task is the new owner. If
 	 * other waiters exist we have to insert the highest priority
-	 * waiter into @task->pi_waiters list.
+	 * waiter into @task->pi_waiters tree.
 	 */
 	if (rt_mutex_has_waiters(lock))
 		rt_mutex_enqueue_pi(task, rt_mutex_top_waiter(lock));
@@ -957,7 +957,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 }
 
 /*
- * Remove the top waiter from the current tasks pi waiter list and
+ * Remove the top waiter from the current tasks pi waiter tree and
  * queue it up.
  *
  * Called with lock->wait_lock held.
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)
  2015-05-19 17:24 [PATCH -tip 0/4] rtmutex: Spin on owner Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2015-05-19 17:24 ` [PATCH 3/4] locking/rtmutex: Update stale plist comments Davidlohr Bueso
@ 2015-05-19 17:24 ` Davidlohr Bueso
  2015-05-20  7:11   ` Paul Bolle
                     ` (2 more replies)
  2015-05-25 20:35 ` [PATCH -tip 0/4] rtmutex: Spin on owner Davidlohr Bueso
  4 siblings, 3 replies; 48+ messages in thread
From: Davidlohr Bueso @ 2015-05-19 17:24 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Mike Galbraith, Paul E. McKenney,
	Sebastian Andrzej Siewior, Davidlohr Bueso, linux-kernel,
	Davidlohr Bueso

Similar to what we have in other locks, particularly regular mutexes, the
idea is that as long as the owner is running, there is a fair chance it'll
release the lock soon, and thus a task trying to acquire the rtmutex will
better off spinning instead of blocking immediately after the fastpath.
Conditions to stop spinning and enter the slowpath are simple:

(1) Upon need_resched()
(2) Current lock owner blocks

Spinning tasks performance is further enhanced by queuing in cancelable
mcs (osq). Because rtmutexes track the lock owner atomically, we can
extend the fastpath to continue polling on the lock owner via
cmpxchg(lock->owner, NULL, current).

This is a conservative approach, such that upon entering the slowpath,
we force all lock spinners (including future ones) to serialize on the
wait_lock as mark_rt_mutex_waiters() is called, unconditionally.

CPU0					CPU1
optimistic_spin() (failed)
try_to_take_rt_mutex()
	mark_rt_mutex_waiters()
					optimistic_spin() (failed cmpxchg)
					spin_lock(&lock->wait_lock)

As such we check for RT_MUTEX_HAS_WAITERS bit0 (rt_mutex_has_waiters_fast()).
This allows priority boosting to take precedence over spinning, as otherwise
we could starve a higher priority queued-up task (ie: top waiter) if spinners
constantly steal the lock. Another alternative would be to stop spinning when
we should really wakeup a higher priority waiter, but of course we do not hold
the wait_lock, so that is racy.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 include/linux/rtmutex.h  |   4 ++
 kernel/Kconfig.locks     |   4 ++
 kernel/locking/rtmutex.c | 132 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 140 insertions(+)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 1abba5c..b5e85ca 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -15,6 +15,7 @@
 #include <linux/linkage.h>
 #include <linux/rbtree.h>
 #include <linux/spinlock_types.h>
+#include <linux/osq_lock.h>
 
 extern int max_lock_depth; /* for sysctl */
 
@@ -31,6 +32,9 @@ struct rt_mutex {
 	struct rb_root          waiters;
 	struct rb_node          *waiters_leftmost;
 	struct task_struct	*owner;
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+	struct optimistic_spin_queue osq; /* Spinner MCS lock */
+#endif
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
 	const char 		*name, *file;
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index ebdb004..628915d 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -227,6 +227,10 @@ config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
 
+config RT_MUTEX_SPIN_ON_OWNER
+	def_bool y
+	depends on SMP && RT_MUTEXES && !CONFIG_DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
+
 config RWSEM_SPIN_ON_OWNER
        def_bool y
        depends on SMP && RWSEM_XCHGADD_ALGORITHM && ARCH_SUPPORTS_ATOMIC_RMW
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 81aad32..6524c7c 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -63,6 +63,20 @@ static inline void clear_rt_mutex_waiters(struct rt_mutex *lock)
 			((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
 }
 
+/*
+ * Lockless alternative to rt_mutex_has_waiters() as we do not need the
+ * wait_lock to check if we are in, for instance, a transitional state
+ * after calling mark_rt_mutex_waiters().
+ */
+static inline bool rt_mutex_has_waiters_fast(struct rt_mutex *lock)
+{
+	unsigned long val = (unsigned long)lock->owner;
+
+	if (!val)
+		return false;
+	return val & RT_MUTEX_HAS_WAITERS;
+}
+
 static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
 {
 	if (!rt_mutex_has_waiters(lock))
@@ -1152,6 +1166,121 @@ static void rt_mutex_handle_deadlock(int res, int detect_deadlock,
 	}
 }
 
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+static noinline
+bool rt_mutex_spin_on_owner(struct rt_mutex *lock, struct task_struct *owner)
+{
+	bool ret = true;
+
+	rcu_read_lock();
+	while (rt_mutex_owner(lock) == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		if (!owner->on_cpu || need_resched()) {
+			ret = false;
+			break;
+		}
+
+		cpu_relax_lowlatency();
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+static inline bool rt_mutex_can_spin_on_owner(struct rt_mutex *lock)
+{
+	struct task_struct *owner;
+	/* default return to spin: if no owner, the lock is free */
+	int ret = true;
+
+	if (need_resched())
+		return 0;
+
+	rcu_read_lock();
+	owner = rt_mutex_owner(lock);
+	if (owner)
+		ret = owner->on_cpu;
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
+{
+	bool taken = false;
+
+	preempt_disable();
+
+	if (!rt_mutex_can_spin_on_owner(lock))
+		goto done;
+	/*
+	 * In order to avoid a stampede of mutex spinners trying to
+	 * acquire the mutex all at once, the spinners need to take a
+	 * MCS (queued) lock first before spinning on the owner field.
+	 */
+	if (!osq_lock(&lock->osq))
+		goto done;
+
+	while (true) {
+		struct task_struct *owner;
+
+		/*
+		 * When another task is competing for the lock in the
+		 * slowpath (either transitional or not), avoid the
+		 * cmpxchg and immediately block. If the situation is
+		 * later fixed by clearing the waiters bit, future
+		 * tasks that atempt to take the rtmutex can spin.
+		 */
+		if (rt_mutex_has_waiters_fast(lock))
+			goto done_unlock;
+
+		/*
+		 * If there's an owner, wait for it to either
+		 * release the lock or go to sleep.
+		 */
+		owner = rt_mutex_owner(lock);
+		if (owner && !rt_mutex_spin_on_owner(lock, owner))
+			break;
+
+		/* Try to acquire the lock, if it is unlocked. */
+		if (rt_mutex_cmpxchg(lock, NULL, current)) {
+			taken = true;
+			goto done_unlock;
+		}
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax_lowlatency();
+	}
+
+done_unlock:
+	osq_unlock(&lock->osq);
+done:
+	preempt_enable();
+	return taken;
+}
+
+#else
+static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
+{
+	return false;
+}
+#endif
+
 /*
  * Slow path lock function:
  */
@@ -1163,6 +1292,9 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	struct rt_mutex_waiter waiter;
 	int ret = 0;
 
+	if (rt_mutex_optimistic_spin(lock))
+		return ret;
+
 	debug_rt_mutex_init_waiter(&waiter);
 	RB_CLEAR_NODE(&waiter.pi_tree_entry);
 	RB_CLEAR_NODE(&waiter.tree_entry);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)
  2015-05-19 17:24 ` [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq) Davidlohr Bueso
@ 2015-05-20  7:11   ` Paul Bolle
  2015-05-25 20:35     ` Davidlohr Bueso
  2015-05-29 15:19   ` Davidlohr Bueso
  2015-06-05 13:59   ` Thomas Gleixner
  2 siblings, 1 reply; 48+ messages in thread
From: Paul Bolle @ 2015-05-20  7:11 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Mike Galbraith, Paul E. McKenney, Sebastian Andrzej Siewior,
	linux-kernel, Davidlohr Bueso

(This is pasted as an RFC, so you probably don't want feedback going
into detail yet, but I couldn't help spotting this.)

On Tue, 2015-05-19 at 10:24 -0700, Davidlohr Bueso wrote:
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks

> +config RT_MUTEX_SPIN_ON_OWNER
> +	def_bool y
> +	depends on SMP && RT_MUTEXES && !CONFIG_DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW

s/CONFIG_DEBUG_RT_MUTEXES/DEBUG_RT_MUTEXES/?

Or perhaps even drop "&& !CONFIG_DEBUG_RT_MUTEXES" entirely, because
that test currently always evaluates to true, so it might not be needed.

Running 
   scripts/checkkconfigsymbols.py --diff $sha1..$sha2

helps catching typos like this.


Paul Bolle


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)
  2015-05-20  7:11   ` Paul Bolle
@ 2015-05-25 20:35     ` Davidlohr Bueso
  0 siblings, 0 replies; 48+ messages in thread
From: Davidlohr Bueso @ 2015-05-25 20:35 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Mike Galbraith, Paul E. McKenney, Sebastian Andrzej Siewior,
	linux-kernel

On Wed, 2015-05-20 at 09:11 +0200, Paul Bolle wrote:
> (This is pasted as an RFC, so you probably don't want feedback going
> into detail yet, but I couldn't help spotting this.)
> 
> On Tue, 2015-05-19 at 10:24 -0700, Davidlohr Bueso wrote:
> > --- a/kernel/Kconfig.locks
> > +++ b/kernel/Kconfig.locks
> 
> > +config RT_MUTEX_SPIN_ON_OWNER
> > +	def_bool y
> > +	depends on SMP && RT_MUTEXES && !CONFIG_DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
> 
> s/CONFIG_DEBUG_RT_MUTEXES/DEBUG_RT_MUTEXES/?

Ah, yes thanks.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH -tip 0/4] rtmutex: Spin on owner
  2015-05-19 17:24 [PATCH -tip 0/4] rtmutex: Spin on owner Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2015-05-19 17:24 ` [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq) Davidlohr Bueso
@ 2015-05-25 20:35 ` Davidlohr Bueso
  2015-05-26 19:05   ` Thomas Gleixner
  4 siblings, 1 reply; 48+ messages in thread
From: Davidlohr Bueso @ 2015-05-25 20:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel

ping?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH -tip 0/4] rtmutex: Spin on owner
  2015-05-25 20:35 ` [PATCH -tip 0/4] rtmutex: Spin on owner Davidlohr Bueso
@ 2015-05-26 19:05   ` Thomas Gleixner
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Gleixner @ 2015-05-26 19:05 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel

On Mon, 25 May 2015, Davidlohr Bueso wrote:

> ping?

pong!

It's in my review backlog list and it's gonna stay there until I
return from vacation and conferencing end of next week.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)
  2015-05-19 17:24 ` [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq) Davidlohr Bueso
  2015-05-20  7:11   ` Paul Bolle
@ 2015-05-29 15:19   ` Davidlohr Bueso
  2015-05-29 18:01     ` Davidlohr Bueso
  2015-06-05 13:59   ` Thomas Gleixner
  2 siblings, 1 reply; 48+ messages in thread
From: Davidlohr Bueso @ 2015-05-29 15:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel

Btw, I just realized I had sent a stale patch where the osq was not
being initialized, fixed below. Thanks!

From: Davidlohr Bueso <dave@stgolabs.net>
Subject: [PATCH v2] locking/rtmutex: Support spin on owner (osq)

Similar to what we have in other locks, particularly regular mutexes, the
idea is that as long as the owner is running, there is a fair chance it'll
release the lock soon, and thus a task trying to acquire the rtmutex will
better off spinning instead of blocking immediately after the fastpath.
Conditions to stop spinning and enter the slowpath are simple:

(1) Upon need_resched()
(2) Current lock owner blocks

Spinning tasks performance is further enhanced by queuing in cancelable
mcs (osq). Because rtmutexes track the lock owner atomically, we can
extend the fastpath to continue polling on the lock owner via
cmpxchg(lock->owner, NULL, current).

This is a conservative approach, such that upon entering the slowpath,
we force all lock spinners (including future ones) to serialize on the
wait_lock as mark_rt_mutex_waiters() is called, unconditionally.

CPU0					CPU1
optimistic_spin() (failed)
try_to_take_rt_mutex()
	mark_rt_mutex_waiters()
					optimistic_spin() (failed cmpxchg)
					spin_lock(&lock->wait_lock)

As such we check for RT_MUTEX_HAS_WAITERS bit0 (rt_mutex_has_waiters_fast()).
This allows priority boosting to take precedence over spinning, as otherwise
we could starve a higher priority queued-up task (ie: top waiter) if spinners
constantly steal the lock. Another alternative would be to stop spinning when
we should really wakeup a higher priority waiter, but of course we do not hold
the wait_lock, so that is racy.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
Changes from v1:
- Fix Kconfig debug
- Init lock->osq in rt mutex initializer

 include/linux/percpu-rwsem.h |  14 +++++
 include/linux/rtmutex.h      |   4 ++
 kernel/Kconfig.locks         |   4 ++
 kernel/locking/rtmutex.c     | 136 ++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                  |  14 ++---
 5 files changed, 164 insertions(+), 8 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 266110d..05b4737 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -45,6 +45,20 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *sem)
 	 */
 }
 
+static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem)
+{
+	bool ret = false;
+
+	preempt_disable();
+	if (likely(rcu_sync_is_idle(&sem->rss))) {
+		__this_cpu_inc(*sem->refcount);
+		ret = true;
+		rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 0, _RET_IP_);
+	}
+	preempt_enable();
+	return ret;
+}
+
 static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
 {
 	/*
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 1abba5c..b5e85ca 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -15,6 +15,7 @@
 #include <linux/linkage.h>
 #include <linux/rbtree.h>
 #include <linux/spinlock_types.h>
+#include <linux/osq_lock.h>
 
 extern int max_lock_depth; /* for sysctl */
 
@@ -31,6 +32,9 @@ struct rt_mutex {
 	struct rb_root          waiters;
 	struct rb_node          *waiters_leftmost;
 	struct task_struct	*owner;
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+	struct optimistic_spin_queue osq; /* Spinner MCS lock */
+#endif
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
 	const char 		*name, *file;
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index ebdb004..8602871 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -227,6 +227,10 @@ config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
 
+config RT_MUTEX_SPIN_ON_OWNER
+	def_bool y
+	depends on SMP && RT_MUTEXES && !DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
+
 config RWSEM_SPIN_ON_OWNER
        def_bool y
        depends on SMP && RWSEM_XCHGADD_ALGORITHM && ARCH_SUPPORTS_ATOMIC_RMW
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 81aad32..c868bd4 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -63,6 +63,20 @@ static inline void clear_rt_mutex_waiters(struct rt_mutex *lock)
 			((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
 }
 
+/*
+ * Lockless alternative to rt_mutex_has_waiters() as we do not need the
+ * wait_lock to check if we are in, for instance, a transitional state
+ * after calling mark_rt_mutex_waiters().
+ */
+static inline bool rt_mutex_has_waiters_fast(struct rt_mutex *lock)
+{
+	unsigned long val = (unsigned long)lock->owner;
+
+	if (!val)
+		return false;
+	return val & RT_MUTEX_HAS_WAITERS;
+}
+
 static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
 {
 	if (!rt_mutex_has_waiters(lock))
@@ -1152,6 +1166,121 @@ static void rt_mutex_handle_deadlock(int res, int detect_deadlock,
 	}
 }
 
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+static noinline
+bool rt_mutex_spin_on_owner(struct rt_mutex *lock, struct task_struct *owner)
+{
+	bool ret = true;
+
+	rcu_read_lock();
+	while (rt_mutex_owner(lock) == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		if (!owner->on_cpu || need_resched()) {
+			ret = false;
+			break;
+		}
+
+		cpu_relax_lowlatency();
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+static inline bool rt_mutex_can_spin_on_owner(struct rt_mutex *lock)
+{
+	struct task_struct *owner;
+	/* default return to spin: if no owner, the lock is free */
+	int ret = true;
+
+	if (need_resched())
+		return 0;
+
+	rcu_read_lock();
+	owner = rt_mutex_owner(lock);
+	if (owner)
+		ret = owner->on_cpu;
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
+{
+	bool taken = false;
+
+	preempt_disable();
+
+	if (!rt_mutex_can_spin_on_owner(lock))
+		goto done;
+	/*
+	 * In order to avoid a stampede of mutex spinners trying to
+	 * acquire the mutex all at once, the spinners need to take a
+	 * MCS (queued) lock first before spinning on the owner field.
+	 */
+	if (!osq_lock(&lock->osq))
+		goto done;
+
+	while (true) {
+		struct task_struct *owner;
+
+		/*
+		 * When another task is competing for the lock in the
+		 * slowpath (either transitional or not), avoid the
+		 * cmpxchg and immediately block. If the situation is
+		 * later fixed by clearing the waiters bit, future
+		 * tasks that atempt to take the rtmutex can spin.
+		 */
+		if (rt_mutex_has_waiters_fast(lock))
+			goto done_unlock;
+
+		/*
+		 * If there's an owner, wait for it to either
+		 * release the lock or go to sleep.
+		 */
+		owner = rt_mutex_owner(lock);
+		if (owner && !rt_mutex_spin_on_owner(lock, owner))
+			break;
+
+		/* Try to acquire the lock, if it is unlocked. */
+		if (rt_mutex_cmpxchg(lock, NULL, current)) {
+			taken = true;
+			goto done_unlock;
+		}
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax_lowlatency();
+	}
+
+done_unlock:
+	osq_unlock(&lock->osq);
+done:
+	preempt_enable();
+	return taken;
+}
+
+#else
+static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
+{
+	return false;
+}
+#endif
+
 /*
  * Slow path lock function:
  */
@@ -1163,6 +1292,9 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	struct rt_mutex_waiter waiter;
 	int ret = 0;
 
+	if (rt_mutex_optimistic_spin(lock))
+		return ret;
+
 	debug_rt_mutex_init_waiter(&waiter);
 	RB_CLEAR_NODE(&waiter.pi_tree_entry);
 	RB_CLEAR_NODE(&waiter.tree_entry);
@@ -1500,7 +1632,9 @@ void __rt_mutex_init(struct rt_mutex *lock, const char *name)
 	raw_spin_lock_init(&lock->wait_lock);
 	lock->waiters = RB_ROOT;
 	lock->waiters_leftmost = NULL;
-
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+	osq_lock_init(&lock->osq);
+#endif
 	debug_rt_mutex_init(lock, name);
 }
 EXPORT_SYMBOL_GPL(__rt_mutex_init);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..f2da14d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -36,7 +36,7 @@
 #include <linux/cpuset.h>
 #include <linux/compaction.h>
 #include <linux/notifier.h>
-#include <linux/rwsem.h>
+#include <linux/percpu-rwsem.h>
 #include <linux/delay.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -211,9 +211,9 @@ int register_shrinker(struct shrinker *shrinker)
 	if (!shrinker->nr_deferred)
 		return -ENOMEM;
 
-	down_write(&shrinker_rwsem);
+	percpu_down_write(&shrinker_rwsem);
 	list_add_tail(&shrinker->list, &shrinker_list);
-	up_write(&shrinker_rwsem);
+	percpu_up_write(&shrinker_rwsem);
 	return 0;
 }
 EXPORT_SYMBOL(register_shrinker);
@@ -223,9 +223,9 @@ EXPORT_SYMBOL(register_shrinker);
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-	down_write(&shrinker_rwsem);
+	percpu_down_write(&shrinker_rwsem);
 	list_del(&shrinker->list);
-	up_write(&shrinker_rwsem);
+	percpu_up_write(&shrinker_rwsem);
 	kfree(shrinker->nr_deferred);
 }
 EXPORT_SYMBOL(unregister_shrinker);
@@ -386,7 +386,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	if (nr_scanned == 0)
 		nr_scanned = SWAP_CLUSTER_MAX;
 
-	if (!down_read_trylock(&shrinker_rwsem)) {
+	if (!percpu_down_read_trylock(&shrinker_rwsem)) {
 		/*
 		 * If we would return 0, our callers would understand that we
 		 * have nothing else to shrink and give up trying. By returning
@@ -413,7 +413,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
 	}
 
-	up_read(&shrinker_rwsem);
+	percpu_up_read(&shrinker_rwsem);
 out:
 	cond_resched();
 	return freed;
-- 
2.1.4




^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)
  2015-05-29 15:19   ` Davidlohr Bueso
@ 2015-05-29 18:01     ` Davidlohr Bueso
  0 siblings, 0 replies; 48+ messages in thread
From: Davidlohr Bueso @ 2015-05-29 18:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel

On Fri, 2015-05-29 at 08:19 -0700, Davidlohr Bueso wrote:
> Btw, I just realized I had sent a stale patch where the osq was not
> being initialized, fixed below. Thanks!

I am an idiot, patch contained some bogus changes in percpu rwsems...
*sigh* Here's v3, sorry for the noise.

8<---------------------------------------------------------------------
From: Davidlohr Bueso <dave@stgolabs.net>
Subject: [PATCH v3 4/4] locking/rtmutex: Support spin on owner (osq)

Similar to what we have in other locks, particularly regular mutexes, the
idea is that as long as the owner is running, there is a fair chance it'll
release the lock soon, and thus a task trying to acquire the rtmutex will
better off spinning instead of blocking immediately after the fastpath.
Conditions to stop spinning and enter the slowpath are simple:

(1) Upon need_resched()
(2) Current lock owner blocks

Spinning tasks performance is further enhanced by queuing in cancelable
mcs (osq). Because rtmutexes track the lock owner atomically, we can
extend the fastpath to continue polling on the lock owner via
cmpxchg(lock->owner, NULL, current).

This is a conservative approach, such that upon entering the slowpath,
we force all lock spinners (including future ones) to serialize on the
wait_lock as mark_rt_mutex_waiters() is called, unconditionally.

CPU0					CPU1
optimistic_spin() (failed)
try_to_take_rt_mutex()
	mark_rt_mutex_waiters()
					optimistic_spin() (failed cmpxchg)
					spin_lock(&lock->wait_lock)

As such we check for RT_MUTEX_HAS_WAITERS bit0 (rt_mutex_has_waiters_fast()).
This allows priority boosting to take precedence over spinning, as otherwise
we could starve a higher priority queued-up task (ie: top waiter) if spinners
constantly steal the lock. Another alternative would be to stop spinning when
we should really wakeup a higher priority waiter, but of course we do not hold
the wait_lock, so that is racy.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 include/linux/rtmutex.h  |   4 ++
 kernel/Kconfig.locks     |   4 ++
 kernel/locking/rtmutex.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 143 insertions(+), 1 deletion(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 1abba5c..b5e85ca 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -15,6 +15,7 @@
 #include <linux/linkage.h>
 #include <linux/rbtree.h>
 #include <linux/spinlock_types.h>
+#include <linux/osq_lock.h>
 
 extern int max_lock_depth; /* for sysctl */
 
@@ -31,6 +32,9 @@ struct rt_mutex {
 	struct rb_root          waiters;
 	struct rb_node          *waiters_leftmost;
 	struct task_struct	*owner;
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+	struct optimistic_spin_queue osq; /* Spinner MCS lock */
+#endif
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
 	const char 		*name, *file;
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index ebdb004..8602871 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -227,6 +227,10 @@ config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
 
+config RT_MUTEX_SPIN_ON_OWNER
+	def_bool y
+	depends on SMP && RT_MUTEXES && !DEBUG_RT_MUTEXES && ARCH_SUPPORTS_ATOMIC_RMW
+
 config RWSEM_SPIN_ON_OWNER
        def_bool y
        depends on SMP && RWSEM_XCHGADD_ALGORITHM && ARCH_SUPPORTS_ATOMIC_RMW
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 81aad32..c868bd4 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -63,6 +63,20 @@ static inline void clear_rt_mutex_waiters(struct rt_mutex *lock)
 			((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
 }
 
+/*
+ * Lockless alternative to rt_mutex_has_waiters() as we do not need the
+ * wait_lock to check if we are in, for instance, a transitional state
+ * after calling mark_rt_mutex_waiters().
+ */
+static inline bool rt_mutex_has_waiters_fast(struct rt_mutex *lock)
+{
+	unsigned long val = (unsigned long)lock->owner;
+
+	if (!val)
+		return false;
+	return val & RT_MUTEX_HAS_WAITERS;
+}
+
 static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
 {
 	if (!rt_mutex_has_waiters(lock))
@@ -1152,6 +1166,121 @@ static void rt_mutex_handle_deadlock(int res, int detect_deadlock,
 	}
 }
 
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+static noinline
+bool rt_mutex_spin_on_owner(struct rt_mutex *lock, struct task_struct *owner)
+{
+	bool ret = true;
+
+	rcu_read_lock();
+	while (rt_mutex_owner(lock) == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		if (!owner->on_cpu || need_resched()) {
+			ret = false;
+			break;
+		}
+
+		cpu_relax_lowlatency();
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
+/*
+ * Initial check for entering the mutex spinning loop
+ */
+static inline bool rt_mutex_can_spin_on_owner(struct rt_mutex *lock)
+{
+	struct task_struct *owner;
+	/* default return to spin: if no owner, the lock is free */
+	int ret = true;
+
+	if (need_resched())
+		return 0;
+
+	rcu_read_lock();
+	owner = rt_mutex_owner(lock);
+	if (owner)
+		ret = owner->on_cpu;
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
+{
+	bool taken = false;
+
+	preempt_disable();
+
+	if (!rt_mutex_can_spin_on_owner(lock))
+		goto done;
+	/*
+	 * In order to avoid a stampede of mutex spinners trying to
+	 * acquire the mutex all at once, the spinners need to take a
+	 * MCS (queued) lock first before spinning on the owner field.
+	 */
+	if (!osq_lock(&lock->osq))
+		goto done;
+
+	while (true) {
+		struct task_struct *owner;
+
+		/*
+		 * When another task is competing for the lock in the
+		 * slowpath (either transitional or not), avoid the
+		 * cmpxchg and immediately block. If the situation is
+		 * later fixed by clearing the waiters bit, future
+		 * tasks that atempt to take the rtmutex can spin.
+		 */
+		if (rt_mutex_has_waiters_fast(lock))
+			goto done_unlock;
+
+		/*
+		 * If there's an owner, wait for it to either
+		 * release the lock or go to sleep.
+		 */
+		owner = rt_mutex_owner(lock);
+		if (owner && !rt_mutex_spin_on_owner(lock, owner))
+			break;
+
+		/* Try to acquire the lock, if it is unlocked. */
+		if (rt_mutex_cmpxchg(lock, NULL, current)) {
+			taken = true;
+			goto done_unlock;
+		}
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax_lowlatency();
+	}
+
+done_unlock:
+	osq_unlock(&lock->osq);
+done:
+	preempt_enable();
+	return taken;
+}
+
+#else
+static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
+{
+	return false;
+}
+#endif
+
 /*
  * Slow path lock function:
  */
@@ -1163,6 +1292,9 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	struct rt_mutex_waiter waiter;
 	int ret = 0;
 
+	if (rt_mutex_optimistic_spin(lock))
+		return ret;
+
 	debug_rt_mutex_init_waiter(&waiter);
 	RB_CLEAR_NODE(&waiter.pi_tree_entry);
 	RB_CLEAR_NODE(&waiter.tree_entry);
@@ -1500,7 +1632,9 @@ void __rt_mutex_init(struct rt_mutex *lock, const char *name)
 	raw_spin_lock_init(&lock->wait_lock);
 	lock->waiters = RB_ROOT;
 	lock->waiters_leftmost = NULL;
-
+#ifdef CONFIG_RT_MUTEX_SPIN_ON_OWNER
+	osq_lock_init(&lock->osq);
+#endif
 	debug_rt_mutex_init(lock, name);
 }
 EXPORT_SYMBOL_GPL(__rt_mutex_init);
-- 
2.1.4




^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/4] locking/rtmutex: Implement lockless top-waiter wakeup
  2015-05-19 17:24 ` [PATCH 1/4] locking/rtmutex: Implement lockless top-waiter wakeup Davidlohr Bueso
@ 2015-06-05 12:35   ` Thomas Gleixner
  2015-06-16 19:29   ` [PATCH] futex: lower the lock contention on the HB lock during wake up Sebastian Andrzej Siewior
  2015-06-18 20:30   ` [tip:sched/core] locking/rtmutex: Implement lockless top-waiter wakeup tip-bot for Davidlohr Bueso
  2 siblings, 0 replies; 48+ messages in thread
From: Thomas Gleixner @ 2015-06-05 12:35 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel,
	Davidlohr Bueso

On Tue, 19 May 2015, Davidlohr Bueso wrote:

> Mark the task for later wakeup after the wait_lock has been released.
> This way, once the next task is awoken, it will have a better chance
> to of finding the wait_lock free when continuing executing in
> __rt_mutex_slowlock() when trying to acquire the rtmutex, calling
> try_to_take_rt_mutex(). Upon contended scenarios, other tasks attempting
> take the lock may acquire it first, right after the wait_lock is released,
> but (a) this can also occur with the current code, as it relies on the
> spinlock fairness, and (b) we are dealing with the top-waiter anyway,
> so it will always take the lock next.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg
  2015-05-19 17:24 ` [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg Davidlohr Bueso
@ 2015-06-05 12:38   ` Thomas Gleixner
  2015-06-06 15:27     ` Davidlohr Bueso
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2015-06-05 12:38 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel,
	Davidlohr Bueso

On Tue, 19 May 2015, Davidlohr Bueso wrote:

> Avoid unnecessary cmpxchg calls, all of our other locks
> use it as well.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  kernel/locking/rtmutex.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 74188d8..1d5cc9d 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -74,7 +74,9 @@ static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
>   * set up.
>   */
>  #ifndef CONFIG_DEBUG_RT_MUTEXES
> -# define rt_mutex_cmpxchg(l,c,n)	(cmpxchg(&l->owner, c, n) == c)
> +# define rt_mutex_cmpxchg(l,c,n) \
> +	(l->owner == c && cmpxchg(&l->owner, c, n) == c)

Errm. How does that improve stuff like rt_mutex_lock() ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/4] locking/rtmutex: Update stale plist comments
  2015-05-19 17:24 ` [PATCH 3/4] locking/rtmutex: Update stale plist comments Davidlohr Bueso
@ 2015-06-05 12:39   ` Thomas Gleixner
  2015-06-18 20:57   ` [tip:sched/core] " tip-bot for Davidlohr Bueso
  2015-06-19 19:33   ` [tip:sched/locking] " tip-bot for Davidlohr Bueso
  2 siblings, 0 replies; 48+ messages in thread
From: Thomas Gleixner @ 2015-06-05 12:39 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel,
	Davidlohr Bueso

On Tue, 19 May 2015, Davidlohr Bueso wrote:

> ... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
> no longer use plists for queuing any waiters. Update stale comments.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)
  2015-05-19 17:24 ` [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq) Davidlohr Bueso
  2015-05-20  7:11   ` Paul Bolle
  2015-05-29 15:19   ` Davidlohr Bueso
@ 2015-06-05 13:59   ` Thomas Gleixner
  2015-06-09  4:41     ` Davidlohr Bueso
  2 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2015-06-05 13:59 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel,
	Davidlohr Bueso

On Tue, 19 May 2015, Davidlohr Bueso wrote:
>  
> +/*
> + * Lockless alternative to rt_mutex_has_waiters() as we do not need the
> + * wait_lock to check if we are in, for instance, a transitional state
> + * after calling mark_rt_mutex_waiters().

Before I get into a state of brain melt, could you please explain that
in an understandable way? 

rt_mutex_has_waiters() looks at the root pointer of the rbtree head
whether that's empty. You can do a lockless check of that as well,
right? So what's the FAST part of that function and how is that
related to a point after we called mark_rt_mutex_waiters()?

> + */
> +static inline bool rt_mutex_has_waiters_fast(struct rt_mutex *lock)
> +{
> +	unsigned long val = (unsigned long)lock->owner;
> +
> +	if (!val)
> +		return false;
> +	return val & RT_MUTEX_HAS_WAITERS;
> +}
> +

> +/*
> + * Initial check for entering the mutex spinning loop
> + */
> +static inline bool rt_mutex_can_spin_on_owner(struct rt_mutex *lock)
> +{
> +	struct task_struct *owner;
> +	/* default return to spin: if no owner, the lock is free */


Rather than having a comment in the middle of the variable declaration
section, I'd prefer a comment explaing the whole logic of this
function.

> +	int ret = true;

> +static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
> +{
> +	bool taken = false;
> +
> +	preempt_disable();
> +
> +	if (!rt_mutex_can_spin_on_owner(lock))
> +		goto done;
> +	/*
> +	 * In order to avoid a stampede of mutex spinners trying to
> +	 * acquire the mutex all at once, the spinners need to take a
> +	 * MCS (queued) lock first before spinning on the owner field.
> +	 */
> +	if (!osq_lock(&lock->osq))
> +		goto done;

Hmm. The queue lock is serializing potential spinners, right?

So that's going to lead to a potential priority ordering problem
because if a lower prio task wins the racing to the ocq_lock queue,
then the higher prio waiter will be queued behind and blocked from
taking the lock first.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg
  2015-06-05 12:38   ` Thomas Gleixner
@ 2015-06-06 15:27     ` Davidlohr Bueso
  2015-06-15 18:34       ` Jason Low
  0 siblings, 1 reply; 48+ messages in thread
From: Davidlohr Bueso @ 2015-06-06 15:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel

On Fri, 2015-06-05 at 14:38 +0200, Thomas Gleixner wrote:
> On Tue, 19 May 2015, Davidlohr Bueso wrote:
> 
> > Avoid unnecessary cmpxchg calls, all of our other locks
> > use it as well.
> > 
> > Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> > ---
> >  kernel/locking/rtmutex.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> > index 74188d8..1d5cc9d 100644
> > --- a/kernel/locking/rtmutex.c
> > +++ b/kernel/locking/rtmutex.c
> > @@ -74,7 +74,9 @@ static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
> >   * set up.
> >   */
> >  #ifndef CONFIG_DEBUG_RT_MUTEXES
> > -# define rt_mutex_cmpxchg(l,c,n)	(cmpxchg(&l->owner, c, n) == c)
> > +# define rt_mutex_cmpxchg(l,c,n) \
> > +	(l->owner == c && cmpxchg(&l->owner, c, n) == c)
> 
> Errm. How does that improve stuff like rt_mutex_lock() ?

It just avoids a cmpxchg in the fastpath when locked, at the cost of an
extra test when unlocked. CCAS techniques have been proven to boost some
workloads for both rwsems and mutexes. That's all.

Thanks,
Davidlohr



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)
  2015-06-05 13:59   ` Thomas Gleixner
@ 2015-06-09  4:41     ` Davidlohr Bueso
  2015-06-09  9:29       ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Davidlohr Bueso @ 2015-06-09  4:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel

On Fri, 2015-06-05 at 15:59 +0200, Thomas Gleixner wrote:
> On Tue, 19 May 2015, Davidlohr Bueso wrote:
> >  
> > +/*
> > + * Lockless alternative to rt_mutex_has_waiters() as we do not need the
> > + * wait_lock to check if we are in, for instance, a transitional state
> > + * after calling mark_rt_mutex_waiters().
> 
> Before I get into a state of brain melt, could you please explain that
> in an understandable way?

With that I meant that we could check the owner to see if the
RT_MUTEX_HAS_WAITERS bit was set without taking the wait_lock and no
owner.

> 
> rt_mutex_has_waiters() looks at the root pointer of the rbtree head
> whether that's empty. You can do a lockless check of that as well,
> right? So what's the FAST part of that function and how is that
> related to a point after we called mark_rt_mutex_waiters()?

You're right, we could use rt_mutex_has_waiters(). When I thought of
this originally, I was considering something like:

if (rt_mutex_has_waiters(lock)) {
	if (current->prio >= rt_mutex_top_waiter(lock)->prio)
	...

Which obviously requires the wait_lock, but I did not consider just
using the tree. However, the consequence I see in doing this is that we
would miss scenarios where mark_rt_mutex_waiters() is called (under nil
owner, for example), so we would force tasks to block only when there
are truly waiters.

> > + */
> > +static inline bool rt_mutex_has_waiters_fast(struct rt_mutex *lock)
> > +{
> > +	unsigned long val = (unsigned long)lock->owner;
> > +
> > +	if (!val)
> > +		return false;
> > +	return val & RT_MUTEX_HAS_WAITERS;
> > +}
> > +
> 
> > +/*
> > + * Initial check for entering the mutex spinning loop
> > + */
> > +static inline bool rt_mutex_can_spin_on_owner(struct rt_mutex *lock)
> > +{
> > +	struct task_struct *owner;
> > +	/* default return to spin: if no owner, the lock is free */
> 
> 
> Rather than having a comment in the middle of the variable declaration
> section, I'd prefer a comment explaing the whole logic of this
> function.

Ok.

> > +	int ret = true;
> 
> > +static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
> > +{
> > +	bool taken = false;
> > +
> > +	preempt_disable();
> > +
> > +	if (!rt_mutex_can_spin_on_owner(lock))
> > +		goto done;
> > +	/*
> > +	 * In order to avoid a stampede of mutex spinners trying to
> > +	 * acquire the mutex all at once, the spinners need to take a
> > +	 * MCS (queued) lock first before spinning on the owner field.
> > +	 */
> > +	if (!osq_lock(&lock->osq))
> > +		goto done;
> 
> Hmm. The queue lock is serializing potential spinners, right?

Yes.

> 
> So that's going to lead to a potential priority ordering problem
> because if a lower prio task wins the racing to the ocq_lock queue,
> then the higher prio waiter will be queued behind and blocked from
> taking the lock first.

Hmm yes, ocq is a fair lock. However I believe this is mitigated by (a)
the conservative spinning approach, and (b) by osq_lock's need_resched()
check, so at least a spinner will abort if a higher prio task comes in.
But of course, this only deals with spinners, and we cannot account for
a lower prio owner task.

So if this is not acceptable, I guess we'll have to do without the mcs
like properties.

Thanks,
Davidlohr



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)
  2015-06-09  4:41     ` Davidlohr Bueso
@ 2015-06-09  9:29       ` Thomas Gleixner
  2015-06-09 11:21         ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2015-06-09  9:29 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel

On Mon, 8 Jun 2015, Davidlohr Bueso wrote:
> On Fri, 2015-06-05 at 15:59 +0200, Thomas Gleixner wrote:
> > rt_mutex_has_waiters() looks at the root pointer of the rbtree head
> > whether that's empty. You can do a lockless check of that as well,
> > right? So what's the FAST part of that function and how is that
> > related to a point after we called mark_rt_mutex_waiters()?
> 
> You're right, we could use rt_mutex_has_waiters(). When I thought of
> this originally, I was considering something like:
> 
> if (rt_mutex_has_waiters(lock)) {
> 	if (current->prio >= rt_mutex_top_waiter(lock)->prio)
> 	...
> 
> Which obviously requires the wait_lock, but I did not consider just
> using the tree. However, the consequence I see in doing this is that we
> would miss scenarios where mark_rt_mutex_waiters() is called (under nil
> owner, for example), so we would force tasks to block only when there
> are truly waiters.

Fair enough. But we really want a proper comment explaining it.

> > > +static bool rt_mutex_optimistic_spin(struct rt_mutex *lock)
> > > +{
> > > +	bool taken = false;
> > > +
> > > +	preempt_disable();
> > > +
> > > +	if (!rt_mutex_can_spin_on_owner(lock))
> > > +		goto done;
> > > +	/*
> > > +	 * In order to avoid a stampede of mutex spinners trying to
> > > +	 * acquire the mutex all at once, the spinners need to take a
> > > +	 * MCS (queued) lock first before spinning on the owner field.
> > > +	 */
> > > +	if (!osq_lock(&lock->osq))
> > > +		goto done;
> > 
> > Hmm. The queue lock is serializing potential spinners, right?
> 
> Yes.
> 
> > 
> > So that's going to lead to a potential priority ordering problem
> > because if a lower prio task wins the racing to the ocq_lock queue,
> > then the higher prio waiter will be queued behind and blocked from
> > taking the lock first.
> 
> Hmm yes, ocq is a fair lock. However I believe this is mitigated by (a)
> the conservative spinning approach, and (b) by osq_lock's need_resched()
> check, so at least a spinner will abort if a higher prio task comes in.
> But of course, this only deals with spinners, and we cannot account for
> a lower prio owner task.
> 
> So if this is not acceptable, I guess we'll have to do without the mcs
> like properties.

I'd say it accounts as priority inversion.

If you look at the RT code, then you'll notice that in the slow lock
path we queue the incoming waiter (including the PI dance) and then
spin only if the waiter is the top waiter on the lock.

Surely it would be nice to avoid the whole PI dance, but OTOH it can
lead to the following issue (and some others):

CPU0   		CPU1

T0 prio=10	T1 prio=20
lock(RTM);
		lock(RTM);
		spin()
->preempt()
T2 prio=15	   if (!owner->on_cpu)
   		       break;
		block_on_rtmutex();
			prio_boost(T0);	
->preempt()
T0 prio=20
unlock(RTM);

IIRC, the initial RT attempt was to spin w/o the PI dance but we gave
up on it due to latency and correctness issues.

I wrapped my head around doing the following:

  1) Lightweight mark the owner as boosted, w/o actually boosting it to
     keep it on the cpu 

  2) Have a priority check in the spin to drop out when a higher prio
     waiter comes in.

But #1 is a nightmare in the scheduler to do, so I gave up on it. If
you have a better idea, you're welcome.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)
  2015-06-09  9:29       ` Thomas Gleixner
@ 2015-06-09 11:21         ` Peter Zijlstra
  2015-06-09 12:53           ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2015-06-09 11:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel

On Tue, Jun 09, 2015 at 11:29:59AM +0200, Thomas Gleixner wrote:
> 
> If you look at the RT code, then you'll notice that in the slow lock
> path we queue the incoming waiter (including the PI dance) and then
> spin only if the waiter is the top waiter on the lock.

On a related note; does the RT code still add lock stealing to rtmutex?

We allow higher prio lock stealing, but I seem to remember a longish
debate if we should allow equal prio lock stealing.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq)
  2015-06-09 11:21         ` Peter Zijlstra
@ 2015-06-09 12:53           ` Thomas Gleixner
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Gleixner @ 2015-06-09 12:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Davidlohr Bueso, Ingo Molnar, Steven Rostedt, Mike Galbraith,
	Paul E. McKenney, Sebastian Andrzej Siewior, linux-kernel

On Tue, 9 Jun 2015, Peter Zijlstra wrote:

> On Tue, Jun 09, 2015 at 11:29:59AM +0200, Thomas Gleixner wrote:
> > 
> > If you look at the RT code, then you'll notice that in the slow lock
> > path we queue the incoming waiter (including the PI dance) and then
> > spin only if the waiter is the top waiter on the lock.
> 
> On a related note; does the RT code still add lock stealing to rtmutex?

No.
 
> We allow higher prio lock stealing, but I seem to remember a longish
> debate if we should allow equal prio lock stealing.

Right. We never finished that one:)

Thanks,

	tglx
 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg
  2015-06-06 15:27     ` Davidlohr Bueso
@ 2015-06-15 18:34       ` Jason Low
  2015-06-15 19:37         ` Davidlohr Bueso
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Low @ 2015-06-15 18:34 UTC (permalink / raw)
  To: Davidlohr Bueso, Jason Low
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Mike Galbraith, Paul E. McKenney, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

Hi David,

On Sat, Jun 6, 2015 at 8:27 AM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> On Fri, 2015-06-05 at 14:38 +0200, Thomas Gleixner wrote:
>> On Tue, 19 May 2015, Davidlohr Bueso wrote:
>>
>> > Avoid unnecessary cmpxchg calls, all of our other locks
>> > use it as well.
>> >
>> > Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
>> > ---
>> >  kernel/locking/rtmutex.c | 4 +++-
>> >  1 file changed, 3 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
>> > index 74188d8..1d5cc9d 100644
>> > --- a/kernel/locking/rtmutex.c
>> > +++ b/kernel/locking/rtmutex.c
>> > @@ -74,7 +74,9 @@ static void fixup_rt_mutex_waiters(struct rt_mutex *lock)
>> >   * set up.
>> >   */
>> >  #ifndef CONFIG_DEBUG_RT_MUTEXES
>> > -# define rt_mutex_cmpxchg(l,c,n)   (cmpxchg(&l->owner, c, n) == c)
>> > +# define rt_mutex_cmpxchg(l,c,n) \
>> > +   (l->owner == c && cmpxchg(&l->owner, c, n) == c)
>>
>> Errm. How does that improve stuff like rt_mutex_lock() ?
>
> It just avoids a cmpxchg in the fastpath when locked, at the cost of an
> extra test when unlocked. CCAS techniques have been proven to boost some
> workloads for both rwsems and mutexes. That's all.

The CCAS technique was typically used in the slow paths for those
other locks, where the chance of the operation returning false is
higher.

The rt_mutex_cmpxchg is used in places such as rt_mutex fastlocks. We
might not want to add extra costs to the fast paths, particularly when
the rt_mutex_cmpxchg are marked "likely" in those cases.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg
  2015-06-15 18:34       ` Jason Low
@ 2015-06-15 19:37         ` Davidlohr Bueso
  2015-06-16  1:00           ` Jason Low
  0 siblings, 1 reply; 48+ messages in thread
From: Davidlohr Bueso @ 2015-06-15 19:37 UTC (permalink / raw)
  To: Jason Low
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Mike Galbraith, Paul E. McKenney, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List

On Mon, 2015-06-15 at 11:34 -0700, Jason Low wrote:
> The CCAS technique was typically used in the slow paths for those
> other locks, where the chance of the operation returning false is
> higher.

That is true. Although I really want to use it in patch 4, I guess I
could move the check in there, and thus avoid having it in the fastpath.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg
  2015-06-15 19:37         ` Davidlohr Bueso
@ 2015-06-16  1:00           ` Jason Low
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Low @ 2015-06-16  1:00 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Jason Low, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Mike Galbraith, Paul E. McKenney,
	Sebastian Andrzej Siewior, Linux Kernel Mailing List

On Mon, Jun 15, 2015 at 12:37 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> On Mon, 2015-06-15 at 11:34 -0700, Jason Low wrote:
>> The CCAS technique was typically used in the slow paths for those
>> other locks, where the chance of the operation returning false is
>> higher.
>
> That is true. Although I really want to use it in patch 4, I guess I
> could move the check in there, and thus avoid having it in the fastpath.

I agree, that way, we only have the extra check in cases where it is
beneficial, like in the optimistic spin loop.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH] futex: lower the lock contention on the HB lock during wake up
  2015-05-19 17:24 ` [PATCH 1/4] locking/rtmutex: Implement lockless top-waiter wakeup Davidlohr Bueso
  2015-06-05 12:35   ` Thomas Gleixner
@ 2015-06-16 19:29   ` Sebastian Andrzej Siewior
  2015-06-16 19:50     ` Davidlohr Bueso
  2015-06-18 20:30   ` [tip:sched/core] locking/rtmutex: Implement lockless top-waiter wakeup tip-bot for Davidlohr Bueso
  2 siblings, 1 reply; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-06-16 19:29 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Mike Galbraith, Paul E. McKenney, linux-kernel, Davidlohr Bueso

wake_futex_pi() wakes the task before releasing the hash bucket lock
(HB). The first thing the woken up task usually does is to acquire the
lock which requires the HB lock. On SMP Systems this leads to blocking
on the HB lock which is released by the owner shortly after.
This patch rearranges the unlock path by first releasing the HB lock and
then waking up the task.

[bigeasy: redo ontop of lockless wake-queues]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

Davidlohr, would it work for you to replace that patch #1 from your
series with this one?

 kernel/futex.c                  | 32 ++++++++++++++++++++---
 kernel/locking/rtmutex.c        | 56 +++++++++++++++++++++++++++++------------
 kernel/locking/rtmutex_common.h |  3 +++
 3 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index ea6ca0bca525..026594f02bd2 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1117,12 +1117,15 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	q->lock_ptr = NULL;
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
 	struct futex_pi_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
 	int ret = 0;
+	WAKE_Q(wake_q);
+	bool deboost;
 
 	if (!pi_state)
 		return -EINVAL;
@@ -1173,7 +1176,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
 	raw_spin_unlock_irq(&new_owner->pi_lock);
 
 	raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
-	rt_mutex_unlock(&pi_state->pi_mutex);
+
+	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+	/*
+	 * First unlock HB so the waiter does not spin on it once he got woken
+	 * up. Second wake up the waiter before the priority is adjusted. If we
+	 * deboost first (and lose our higher priority), then the task might get
+	 * scheduled away before the wake up can take place.
+	 */
+	spin_unlock(&hb->lock);
+	wake_up_q(&wake_q);
+	if (deboost)
+		rt_mutex_adjust_prio(current);
 
 	return 0;
 }
@@ -2410,13 +2425,23 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	 */
 	match = futex_top_waiter(hb, &key);
 	if (match) {
-		ret = wake_futex_pi(uaddr, uval, match);
+		ret = wake_futex_pi(uaddr, uval, match, hb);
+		/*
+		 * In case of success wake_futex_pi dropped the hash
+		 * bucket lock.
+		 */
+		if (!ret)
+			goto out_putkey;
 		/*
 		 * The atomic access to the futex value generated a
 		 * pagefault, so retry the user-access and the wakeup:
 		 */
 		if (ret == -EFAULT)
 			goto pi_faulted;
+		/*
+		 * wake_futex_pi has detected invalid state. Tell user
+		 * space.
+		 */
 		goto out_unlock;
 	}
 
@@ -2437,6 +2462,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 
 out_unlock:
 	spin_unlock(&hb->lock);
+out_putkey:
 	put_futex_key(&key);
 	return ret;
 
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 36573e96a477..0c8f43baead6 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -300,7 +300,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
  * of task. We do not use the spin_xx_mutex() variants here as we are
  * outside of the debug path.)
  */
-static void rt_mutex_adjust_prio(struct task_struct *task)
+void rt_mutex_adjust_prio(struct task_struct *task)
 {
 	unsigned long flags;
 
@@ -957,12 +957,12 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 /*
  * Wake up the next waiter on the lock.
  *
- * Remove the top waiter from the current tasks pi waiter list and
- * wake it up.
+ * Remove the top waiter from the current tasks pi waiter list, put him on the
+ * wake head (for later wake up).
  *
  * Called with lock->wait_lock held.
  */
-static void wakeup_next_waiter(struct rt_mutex *lock)
+static void wakeup_next_waiter(struct rt_mutex *lock, struct wake_q_head *wqh)
 {
 	struct rt_mutex_waiter *waiter;
 	unsigned long flags;
@@ -996,7 +996,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 	 * long as we hold lock->wait_lock. The waiter task needs to
 	 * acquire it in order to dequeue the waiter.
 	 */
-	wake_up_process(waiter->task);
+	wake_q_add(wqh, waiter->task);
 }
 
 /*
@@ -1250,10 +1250,11 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 }
 
 /*
- * Slow path to release a rt-mutex:
+ * Slow path to release a rt-mutex.
+ * Return whether the current task needs to undo a potential priority boosting.
  */
-static void __sched
-rt_mutex_slowunlock(struct rt_mutex *lock)
+static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
+					struct wake_q_head *wqh)
 {
 	raw_spin_lock(&lock->wait_lock);
 
@@ -1295,7 +1296,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	while (!rt_mutex_has_waiters(lock)) {
 		/* Drops lock->wait_lock ! */
 		if (unlock_rt_mutex_safe(lock) == true)
-			return;
+			return false;
 		/* Relock the rtmutex and try again */
 		raw_spin_lock(&lock->wait_lock);
 	}
@@ -1304,12 +1305,12 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	 * The wakeup next waiter path does not suffer from the above
 	 * race. See the comments there.
 	 */
-	wakeup_next_waiter(lock);
+	wakeup_next_waiter(lock, wqh);
 
 	raw_spin_unlock(&lock->wait_lock);
 
-	/* Undo pi boosting if necessary: */
-	rt_mutex_adjust_prio(current);
+	/* check PI boosting */
+	return true;
 }
 
 /*
@@ -1360,12 +1361,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,
 
 static inline void
 rt_mutex_fastunlock(struct rt_mutex *lock,
-		    void (*slowfn)(struct rt_mutex *lock))
+		    bool (*slowfn)(struct rt_mutex *lock,
+				   struct wake_q_head *wqh))
 {
-	if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+	WAKE_Q(wake_q);
+
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
 		rt_mutex_deadlock_account_unlock(current);
-	else
-		slowfn(lock);
+
+	} else if (slowfn(lock, &wake_q)) {
+		/* Undo pi boosting if necessary: */
+		rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
@@ -1467,6 +1474,23 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * @lock: the rt_mutex to be unlocked
+ *
+ * Returns: true/false indicating whether priority adjustment is
+ * required or not.
+ */
+bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
+				   struct wake_q_head *wqh)
+{
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
+		rt_mutex_deadlock_account_unlock(current);
+		return false;
+	}
+	return rt_mutex_slowunlock(lock, wqh);
+}
+
+/**
  * rt_mutex_destroy - mark a mutex unusable
  * @lock: the mutex to be destroyed
  *
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 855212501407..7844f8f0e639 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -131,6 +131,9 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct hrtimer_sleeper *to,
 				      struct rt_mutex_waiter *waiter);
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
+extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
+				  struct wake_q_head *wqh);
+extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH] futex: lower the lock contention on the HB lock during wake up
  2015-06-16 19:29   ` [PATCH] futex: lower the lock contention on the HB lock during wake up Sebastian Andrzej Siewior
@ 2015-06-16 19:50     ` Davidlohr Bueso
  2015-06-17  8:33       ` [PATCH v2] " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 48+ messages in thread
From: Davidlohr Bueso @ 2015-06-16 19:50 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Mike Galbraith, Paul E. McKenney, linux-kernel

On Tue, 2015-06-16 at 21:29 +0200, Sebastian Andrzej Siewior wrote:

> Davidlohr, would it work for you to replace that patch #1 from your
> series with this one?

I prefer having two separate patches, thus keeping their own changelog
for the change justification.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-06-16 19:50     ` Davidlohr Bueso
@ 2015-06-17  8:33       ` Sebastian Andrzej Siewior
  2015-06-17 14:17         ` Mike Galbraith
                           ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-06-17  8:33 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Mike Galbraith, Paul E. McKenney, linux-kernel

wake_futex_pi() wakes the task before releasing the hash bucket lock
(HB). The first thing the woken up task usually does is to acquire the
lock which requires the HB lock. On SMP Systems this leads to blocking
on the HB lock which is released by the owner shortly after.
This patch rearranges the unlock path by first releasing the HB lock and
then waking up the task.

[bigeasy: redo ontop of lockless wake-queues]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
* Davidlohr Bueso | 2015-06-16 12:50:26 [-0700]:

>I prefer having two separate patches, thus keeping their own changelog
>for the change justification.

okay, here it is on top of #1.

 kernel/futex.c                  | 32 +++++++++++++++++++++++---
 kernel/locking/rtmutex.c        | 51 +++++++++++++++++++++++++++++------------
 kernel/locking/rtmutex_common.h |  3 +++
 3 files changed, 68 insertions(+), 18 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index ea6ca0bca525..026594f02bd2 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1117,12 +1117,15 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	q->lock_ptr = NULL;
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
 	struct futex_pi_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
 	int ret = 0;
+	WAKE_Q(wake_q);
+	bool deboost;
 
 	if (!pi_state)
 		return -EINVAL;
@@ -1173,7 +1176,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
 	raw_spin_unlock_irq(&new_owner->pi_lock);
 
 	raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
-	rt_mutex_unlock(&pi_state->pi_mutex);
+
+	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+	/*
+	 * First unlock HB so the waiter does not spin on it once he got woken
+	 * up. Second wake up the waiter before the priority is adjusted. If we
+	 * deboost first (and lose our higher priority), then the task might get
+	 * scheduled away before the wake up can take place.
+	 */
+	spin_unlock(&hb->lock);
+	wake_up_q(&wake_q);
+	if (deboost)
+		rt_mutex_adjust_prio(current);
 
 	return 0;
 }
@@ -2410,13 +2425,23 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	 */
 	match = futex_top_waiter(hb, &key);
 	if (match) {
-		ret = wake_futex_pi(uaddr, uval, match);
+		ret = wake_futex_pi(uaddr, uval, match, hb);
+		/*
+		 * In case of success wake_futex_pi dropped the hash
+		 * bucket lock.
+		 */
+		if (!ret)
+			goto out_putkey;
 		/*
 		 * The atomic access to the futex value generated a
 		 * pagefault, so retry the user-access and the wakeup:
 		 */
 		if (ret == -EFAULT)
 			goto pi_faulted;
+		/*
+		 * wake_futex_pi has detected invalid state. Tell user
+		 * space.
+		 */
 		goto out_unlock;
 	}
 
@@ -2437,6 +2462,7 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 
 out_unlock:
 	spin_unlock(&hb->lock);
+out_putkey:
 	put_futex_key(&key);
 	return ret;
 
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 74188d83b065..53fab686a1c2 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -300,7 +300,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
  * of task. We do not use the spin_xx_mutex() variants here as we are
  * outside of the debug path.)
  */
-static void rt_mutex_adjust_prio(struct task_struct *task)
+void rt_mutex_adjust_prio(struct task_struct *task)
 {
 	unsigned long flags;
 
@@ -1244,13 +1244,12 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 }
 
 /*
- * Slow path to release a rt-mutex:
+ * Slow path to release a rt-mutex.
+ * Return whether the current task needs to undo a potential priority boosting.
  */
-static void __sched
-rt_mutex_slowunlock(struct rt_mutex *lock)
+static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
+					struct wake_q_head *wake_q)
 {
-	WAKE_Q(wake_q);
-
 	raw_spin_lock(&lock->wait_lock);
 
 	debug_rt_mutex_unlock(lock);
@@ -1291,7 +1290,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	while (!rt_mutex_has_waiters(lock)) {
 		/* Drops lock->wait_lock ! */
 		if (unlock_rt_mutex_safe(lock) == true)
-			return;
+			return false;
 		/* Relock the rtmutex and try again */
 		raw_spin_lock(&lock->wait_lock);
 	}
@@ -1302,13 +1301,12 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	 *
 	 * Queue the next waiter for wakeup once we release the wait_lock.
 	 */
-	mark_wakeup_next_waiter(&wake_q, lock);
+	mark_wakeup_next_waiter(wake_q, lock);
 
 	raw_spin_unlock(&lock->wait_lock);
-	wake_up_q(&wake_q);
 
-	/* Undo pi boosting if necessary: */
-	rt_mutex_adjust_prio(current);
+	/* check PI boosting */
+	return true;
 }
 
 /*
@@ -1359,12 +1357,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,
 
 static inline void
 rt_mutex_fastunlock(struct rt_mutex *lock,
-		    void (*slowfn)(struct rt_mutex *lock))
+		    bool (*slowfn)(struct rt_mutex *lock,
+				   struct wake_q_head *wqh))
 {
-	if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+	WAKE_Q(wake_q);
+
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
 		rt_mutex_deadlock_account_unlock(current);
-	else
-		slowfn(lock);
+
+	} else if (slowfn(lock, &wake_q)) {
+		/* Undo pi boosting if necessary: */
+		rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
@@ -1466,6 +1470,23 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * @lock: the rt_mutex to be unlocked
+ *
+ * Returns: true/false indicating whether priority adjustment is
+ * required or not.
+ */
+bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
+				   struct wake_q_head *wqh)
+{
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
+		rt_mutex_deadlock_account_unlock(current);
+		return false;
+	}
+	return rt_mutex_slowunlock(lock, wqh);
+}
+
+/**
  * rt_mutex_destroy - mark a mutex unusable
  * @lock: the mutex to be destroyed
  *
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 855212501407..7844f8f0e639 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -131,6 +131,9 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct hrtimer_sleeper *to,
 				      struct rt_mutex_waiter *waiter);
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
+extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
+				  struct wake_q_head *wqh);
+extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-06-17  8:33       ` [PATCH v2] " Sebastian Andrzej Siewior
@ 2015-06-17 14:17         ` Mike Galbraith
  2015-06-17 14:28           ` Sebastian Andrzej Siewior
  2015-06-18 20:30         ` [tip:sched/core] futex: Lower " tip-bot for Sebastian Andrzej Siewior
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 48+ messages in thread
From: Mike Galbraith @ 2015-06-17 14:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Davidlohr Bueso, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Paul E. McKenney, linux-kernel

On Wed, 2015-06-17 at 10:33 +0200, Sebastian Andrzej Siewior wrote:
> wake_futex_pi() wakes the task before releasing the hash bucket lock
> (HB). The first thing the woken up task usually does is to acquire the
> lock which requires the HB lock. On SMP Systems this leads to blocking
> on the HB lock which is released by the owner shortly after.
> This patch rearranges the unlock path by first releasing the HB lock and
> then waking up the task.
> 
> [bigeasy: redo ontop of lockless wake-queues]
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

4.1-rc8-rt4 contains this via 4.0-rt4, and seems fine on my 64 core
DL980.  I ran a few iterations of futextests and stockfish, then mixed
two loops of futextest at different rt prios, with stockfish also rt,
and ltplight as tossed in as... crack filler.  Box is still doing that,
is way too busy, but not griping about it.  

	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-06-17 14:17         ` Mike Galbraith
@ 2015-06-17 14:28           ` Sebastian Andrzej Siewior
  2015-06-17 14:31             ` Mike Galbraith
  2015-06-21  4:35             ` Mike Galbraith
  0 siblings, 2 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-06-17 14:28 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Davidlohr Bueso, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Paul E. McKenney, linux-kernel

On 06/17/2015 04:17 PM, Mike Galbraith wrote:
> On Wed, 2015-06-17 at 10:33 +0200, Sebastian Andrzej Siewior wrote:
>> wake_futex_pi() wakes the task before releasing the hash bucket lock
>> (HB). The first thing the woken up task usually does is to acquire the
>> lock which requires the HB lock. On SMP Systems this leads to blocking
>> on the HB lock which is released by the owner shortly after.
>> This patch rearranges the unlock path by first releasing the HB lock and
>> then waking up the task.
>>
>> [bigeasy: redo ontop of lockless wake-queues]
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> 4.1-rc8-rt4 contains this via 4.0-rt4, and seems fine on my 64 core
> DL980.  I ran a few iterations of futextests and stockfish, then mixed
> two loops of futextest at different rt prios, with stockfish also rt,
> and ltplight as tossed in as... crack filler.  Box is still doing that,
> is way too busy, but not griping about it.  

There are two patches mostly doing the same thing. The patch posted
here is a redo ontop of "lockless wake-queues". It does hb-unlock,
wakeup, de-boost. The patch merged into -RT is the original approach
not using "lockless wake-queues" and performing wakeup, hb-unlock,
de-boost.

I plan to get into -RT the final solution once it hits upstream.

> 
> 	-Mike
> 

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-06-17 14:28           ` Sebastian Andrzej Siewior
@ 2015-06-17 14:31             ` Mike Galbraith
  2015-06-21  4:35             ` Mike Galbraith
  1 sibling, 0 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-06-17 14:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Davidlohr Bueso, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Paul E. McKenney, linux-kernel

On Wed, 2015-06-17 at 16:28 +0200, Sebastian Andrzej Siewior wrote:
> On 06/17/2015 04:17 PM, Mike Galbraith wrote:
> > On Wed, 2015-06-17 at 10:33 +0200, Sebastian Andrzej Siewior wrote:
> >> wake_futex_pi() wakes the task before releasing the hash bucket lock
> >> (HB). The first thing the woken up task usually does is to acquire the
> >> lock which requires the HB lock. On SMP Systems this leads to blocking
> >> on the HB lock which is released by the owner shortly after.
> >> This patch rearranges the unlock path by first releasing the HB lock and
> >> then waking up the task.
> >>
> >> [bigeasy: redo ontop of lockless wake-queues]
> >> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> >> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > 
> > 4.1-rc8-rt4 contains this via 4.0-rt4, and seems fine on my 64 core
> > DL980.  I ran a few iterations of futextests and stockfish, then mixed
> > two loops of futextest at different rt prios, with stockfish also rt,
> > and ltplight as tossed in as... crack filler.  Box is still doing that,
> > is way too busy, but not griping about it.  
> 
> There are two patches mostly doing the same thing. The patch posted
> here is a redo ontop of "lockless wake-queues". It does hb-unlock,
> wakeup, de-boost. The patch merged into -RT is the original approach
> not using "lockless wake-queues" and performing wakeup, hb-unlock,
> de-boost.
> 
> I plan to get into -RT the final solution once it hits upstream.

OK, a glance wasn't enough.  Guess I can let tired ole box rest.

	-Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [tip:sched/core] locking/rtmutex: Implement lockless top-waiter wakeup
  2015-05-19 17:24 ` [PATCH 1/4] locking/rtmutex: Implement lockless top-waiter wakeup Davidlohr Bueso
  2015-06-05 12:35   ` Thomas Gleixner
  2015-06-16 19:29   ` [PATCH] futex: lower the lock contention on the HB lock during wake up Sebastian Andrzej Siewior
@ 2015-06-18 20:30   ` tip-bot for Davidlohr Bueso
  2 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Davidlohr Bueso @ 2015-06-18 20:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: umgwanakikbuti, rostedt, hpa, tglx, bigeasy, linux-kernel,
	peterz, paulmck, dave, dbueso, mingo

Commit-ID:  45ab4effc3bee6f8a5cb05652b7bb895ec5b6a7a
Gitweb:     http://git.kernel.org/tip/45ab4effc3bee6f8a5cb05652b7bb895ec5b6a7a
Author:     Davidlohr Bueso <dave@stgolabs.net>
AuthorDate: Tue, 19 May 2015 10:24:55 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 18 Jun 2015 22:27:46 +0200

locking/rtmutex: Implement lockless top-waiter wakeup

Mark the task for later wakeup after the wait_lock has been released.
This way, once the next task is awoken, it will have a better chance
to of finding the wait_lock free when continuing executing in
__rt_mutex_slowlock() when trying to acquire the rtmutex, calling
try_to_take_rt_mutex(). Upon contended scenarios, other tasks attempting
take the lock may acquire it first, right after the wait_lock is released,
but (a) this can also occur with the current code, as it relies on the
spinlock fairness, and (b) we are dealing with the top-waiter anyway,
so it will always take the lock next.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1432056298-18738-2-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/locking/rtmutex.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index b025295..44ee8f8 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -955,14 +955,13 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 }
 
 /*
- * Wake up the next waiter on the lock.
- *
  * Remove the top waiter from the current tasks pi waiter list and
- * wake it up.
+ * queue it up.
  *
  * Called with lock->wait_lock held.
  */
-static void wakeup_next_waiter(struct rt_mutex *lock)
+static void mark_wakeup_next_waiter(struct wake_q_head *wake_q,
+				    struct rt_mutex *lock)
 {
 	struct rt_mutex_waiter *waiter;
 	unsigned long flags;
@@ -991,12 +990,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 
 	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
 
-	/*
-	 * It's safe to dereference waiter as it cannot go away as
-	 * long as we hold lock->wait_lock. The waiter task needs to
-	 * acquire it in order to dequeue the waiter.
-	 */
-	wake_up_process(waiter->task);
+	wake_q_add(wake_q, waiter->task);
 }
 
 /*
@@ -1258,6 +1252,8 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 static void __sched
 rt_mutex_slowunlock(struct rt_mutex *lock)
 {
+	WAKE_Q(wake_q);
+
 	raw_spin_lock(&lock->wait_lock);
 
 	debug_rt_mutex_unlock(lock);
@@ -1306,10 +1302,13 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	/*
 	 * The wakeup next waiter path does not suffer from the above
 	 * race. See the comments there.
+	 *
+	 * Queue the next waiter for wakeup once we release the wait_lock.
 	 */
-	wakeup_next_waiter(lock);
+	mark_wakeup_next_waiter(&wake_q, lock);
 
 	raw_spin_unlock(&lock->wait_lock);
+	wake_up_q(&wake_q);
 
 	/* Undo pi boosting if necessary: */
 	rt_mutex_adjust_prio(current);

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [tip:sched/core] futex: Lower the lock contention on the HB lock during wake up
  2015-06-17  8:33       ` [PATCH v2] " Sebastian Andrzej Siewior
  2015-06-17 14:17         ` Mike Galbraith
@ 2015-06-18 20:30         ` tip-bot for Sebastian Andrzej Siewior
  2015-06-19 17:51         ` [PATCH v2] futex: lower " Kevin Hilman
  2015-06-19 19:33         ` [tip:sched/locking] futex: Lower " tip-bot for Sebastian Andrzej Siewior
  3 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Sebastian Andrzej Siewior @ 2015-06-18 20:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: rostedt, tglx, mingo, hpa, bigeasy, dave, peterz, linux-kernel,
	umgwanakikbuti, paulmck

Commit-ID:  881bd58d6e9eba4240b9dbc49fdc03a3374d7508
Gitweb:     http://git.kernel.org/tip/881bd58d6e9eba4240b9dbc49fdc03a3374d7508
Author:     Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 17 Jun 2015 10:33:50 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 18 Jun 2015 22:27:46 +0200

futex: Lower the lock contention on the HB lock during wake up

wake_futex_pi() wakes the task before releasing the hash bucket lock
(HB). The first thing the woken up task usually does is to acquire the
lock which requires the HB lock. On SMP Systems this leads to blocking
on the HB lock which is released by the owner shortly after.
This patch rearranges the unlock path by first releasing the HB lock and
then waking up the task.

Originally-from: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Link: http://lkml.kernel.org/r/20150617083350.GA2433@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex.c                  | 32 +++++++++++++++++++++++---
 kernel/locking/rtmutex.c        | 51 +++++++++++++++++++++++++++++------------
 kernel/locking/rtmutex_common.h |  3 +++
 3 files changed, 68 insertions(+), 18 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f9984c3..a0cf6fa 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1117,11 +1117,14 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	q->lock_ptr = NULL;
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
 	struct futex_pi_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
+	WAKE_Q(wake_q);
+	bool deboost;
 	int ret = 0;
 
 	if (!pi_state)
@@ -1173,7 +1176,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
 	raw_spin_unlock_irq(&new_owner->pi_lock);
 
 	raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
-	rt_mutex_unlock(&pi_state->pi_mutex);
+
+	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+	/*
+	 * First unlock HB so the waiter does not spin on it once he got woken
+	 * up. Second wake up the waiter before the priority is adjusted. If we
+	 * deboost first (and lose our higher priority), then the task might get
+	 * scheduled away before the wake up can take place.
+	 */
+	spin_unlock(&hb->lock);
+	wake_up_q(&wake_q);
+	if (deboost)
+		rt_mutex_adjust_prio(current);
 
 	return 0;
 }
@@ -2413,13 +2428,23 @@ retry:
 	 */
 	match = futex_top_waiter(hb, &key);
 	if (match) {
-		ret = wake_futex_pi(uaddr, uval, match);
+		ret = wake_futex_pi(uaddr, uval, match, hb);
+		/*
+		 * In case of success wake_futex_pi dropped the hash
+		 * bucket lock.
+		 */
+		if (!ret)
+			goto out_putkey;
 		/*
 		 * The atomic access to the futex value generated a
 		 * pagefault, so retry the user-access and the wakeup:
 		 */
 		if (ret == -EFAULT)
 			goto pi_faulted;
+		/*
+		 * wake_futex_pi has detected invalid state. Tell user
+		 * space.
+		 */
 		goto out_unlock;
 	}
 
@@ -2440,6 +2465,7 @@ retry:
 
 out_unlock:
 	spin_unlock(&hb->lock);
+out_putkey:
 	put_futex_key(&key);
 	return ret;
 
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 44ee8f8..1130130 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -300,7 +300,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
  * of task. We do not use the spin_xx_mutex() variants here as we are
  * outside of the debug path.)
  */
-static void rt_mutex_adjust_prio(struct task_struct *task)
+void rt_mutex_adjust_prio(struct task_struct *task)
 {
 	unsigned long flags;
 
@@ -1247,13 +1247,12 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 }
 
 /*
- * Slow path to release a rt-mutex:
+ * Slow path to release a rt-mutex.
+ * Return whether the current task needs to undo a potential priority boosting.
  */
-static void __sched
-rt_mutex_slowunlock(struct rt_mutex *lock)
+static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
+					struct wake_q_head *wake_q)
 {
-	WAKE_Q(wake_q);
-
 	raw_spin_lock(&lock->wait_lock);
 
 	debug_rt_mutex_unlock(lock);
@@ -1294,7 +1293,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	while (!rt_mutex_has_waiters(lock)) {
 		/* Drops lock->wait_lock ! */
 		if (unlock_rt_mutex_safe(lock) == true)
-			return;
+			return false;
 		/* Relock the rtmutex and try again */
 		raw_spin_lock(&lock->wait_lock);
 	}
@@ -1305,13 +1304,12 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	 *
 	 * Queue the next waiter for wakeup once we release the wait_lock.
 	 */
-	mark_wakeup_next_waiter(&wake_q, lock);
+	mark_wakeup_next_waiter(wake_q, lock);
 
 	raw_spin_unlock(&lock->wait_lock);
-	wake_up_q(&wake_q);
 
-	/* Undo pi boosting if necessary: */
-	rt_mutex_adjust_prio(current);
+	/* check PI boosting */
+	return true;
 }
 
 /*
@@ -1362,12 +1360,18 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,
 
 static inline void
 rt_mutex_fastunlock(struct rt_mutex *lock,
-		    void (*slowfn)(struct rt_mutex *lock))
+		    bool (*slowfn)(struct rt_mutex *lock,
+				   struct wake_q_head *wqh))
 {
-	if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+	WAKE_Q(wake_q);
+
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
 		rt_mutex_deadlock_account_unlock(current);
-	else
-		slowfn(lock);
+
+	} else if (slowfn(lock, &wake_q)) {
+		/* Undo pi boosting if necessary: */
+		rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
@@ -1462,6 +1466,23 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * @lock: the rt_mutex to be unlocked
+ *
+ * Returns: true/false indicating whether priority adjustment is
+ * required or not.
+ */
+bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
+				   struct wake_q_head *wqh)
+{
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
+		rt_mutex_deadlock_account_unlock(current);
+		return false;
+	}
+	return rt_mutex_slowunlock(lock, wqh);
+}
+
+/**
  * rt_mutex_destroy - mark a mutex unusable
  * @lock: the mutex to be destroyed
  *
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 8552125..7844f8f 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -131,6 +131,9 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct hrtimer_sleeper *to,
 				      struct rt_mutex_waiter *waiter);
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
+extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
+				  struct wake_q_head *wqh);
+extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [tip:sched/core] locking/rtmutex: Update stale plist comments
  2015-05-19 17:24 ` [PATCH 3/4] locking/rtmutex: Update stale plist comments Davidlohr Bueso
  2015-06-05 12:39   ` Thomas Gleixner
@ 2015-06-18 20:57   ` tip-bot for Davidlohr Bueso
  2015-06-19 19:33   ` [tip:sched/locking] " tip-bot for Davidlohr Bueso
  2 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Davidlohr Bueso @ 2015-06-18 20:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: paulmck, umgwanakikbuti, rostedt, linux-kernel, dave, bigeasy,
	dbueso, peterz, mingo, hpa, tglx

Commit-ID:  f5de9f848cdfba072a0466c24891167c0c8b3be3
Gitweb:     http://git.kernel.org/tip/f5de9f848cdfba072a0466c24891167c0c8b3be3
Author:     Davidlohr Bueso <dave@stgolabs.net>
AuthorDate: Tue, 19 May 2015 10:24:57 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Thu, 18 Jun 2015 22:53:36 +0200

locking/rtmutex: Update stale plist comments

... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
no longer use plists for queuing any waiters. Update stale comments.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1432056298-18738-4-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/locking/rtmutex.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 1130130..5537ebc 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -624,7 +624,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 */
 	prerequeue_top_waiter = rt_mutex_top_waiter(lock);
 
-	/* [7] Requeue the waiter in the lock waiter list. */
+	/* [7] Requeue the waiter in the lock waiter tree. */
 	rt_mutex_dequeue(lock, waiter);
 	waiter->prio = task->prio;
 	rt_mutex_enqueue(lock, waiter);
@@ -662,7 +662,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 		/*
 		 * The waiter became the new top (highest priority)
 		 * waiter on the lock. Replace the previous top waiter
-		 * in the owner tasks pi waiters list with this waiter
+		 * in the owner tasks pi waiters tree with this waiter
 		 * and adjust the priority of the owner.
 		 */
 		rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
@@ -673,7 +673,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 		/*
 		 * The waiter was the top waiter on the lock, but is
 		 * no longer the top prority waiter. Replace waiter in
-		 * the owner tasks pi waiters list with the new top
+		 * the owner tasks pi waiters tree with the new top
 		 * (highest priority) waiter and adjust the priority
 		 * of the owner.
 		 * The new top waiter is stored in @waiter so that
@@ -747,7 +747,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
  *
  * @lock:   The lock to be acquired.
  * @task:   The task which wants to acquire the lock
- * @waiter: The waiter that is queued to the lock's wait list if the
+ * @waiter: The waiter that is queued to the lock's wait tree if the
  *	    callsite called task_blocked_on_lock(), otherwise NULL
  */
 static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
@@ -782,7 +782,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 
 	/*
 	 * If @waiter != NULL, @task has already enqueued the waiter
-	 * into @lock waiter list. If @waiter == NULL then this is a
+	 * into @lock waiter tree. If @waiter == NULL then this is a
 	 * trylock attempt.
 	 */
 	if (waiter) {
@@ -795,7 +795,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 
 		/*
 		 * We can acquire the lock. Remove the waiter from the
-		 * lock waiters list.
+		 * lock waiters tree.
 		 */
 		rt_mutex_dequeue(lock, waiter);
 
@@ -827,7 +827,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 			 * No waiters. Take the lock without the
 			 * pi_lock dance.@task->pi_blocked_on is NULL
 			 * and we have no waiters to enqueue in @task
-			 * pi waiters list.
+			 * pi waiters tree.
 			 */
 			goto takeit;
 		}
@@ -844,7 +844,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 	/*
 	 * Finish the lock acquisition. @task is the new owner. If
 	 * other waiters exist we have to insert the highest priority
-	 * waiter into @task->pi_waiters list.
+	 * waiter into @task->pi_waiters tree.
 	 */
 	if (rt_mutex_has_waiters(lock))
 		rt_mutex_enqueue_pi(task, rt_mutex_top_waiter(lock));
@@ -955,7 +955,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 }
 
 /*
- * Remove the top waiter from the current tasks pi waiter list and
+ * Remove the top waiter from the current tasks pi waiter tree and
  * queue it up.
  *
  * Called with lock->wait_lock held.

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-06-17  8:33       ` [PATCH v2] " Sebastian Andrzej Siewior
  2015-06-17 14:17         ` Mike Galbraith
  2015-06-18 20:30         ` [tip:sched/core] futex: Lower " tip-bot for Sebastian Andrzej Siewior
@ 2015-06-19 17:51         ` Kevin Hilman
  2015-06-19 18:54           ` Thomas Gleixner
  2015-06-19 19:33         ` [tip:sched/locking] futex: Lower " tip-bot for Sebastian Andrzej Siewior
  3 siblings, 1 reply; 48+ messages in thread
From: Kevin Hilman @ 2015-06-19 17:51 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Davidlohr Bueso, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Mike Galbraith, Paul E. McKenney, lkml,
	Tyler Baker, Olof Johansson, Tony Lindgren, linux-omap,
	Santosh Shilimkar, Felipe Balbi, Nishanth Menon

On Wed, Jun 17, 2015 at 1:33 AM, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> wake_futex_pi() wakes the task before releasing the hash bucket lock
> (HB). The first thing the woken up task usually does is to acquire the
> lock which requires the HB lock. On SMP Systems this leads to blocking
> on the HB lock which is released by the owner shortly after.
> This patch rearranges the unlock path by first releasing the HB lock and
> then waking up the task.
>
> [bigeasy: redo ontop of lockless wake-queues]
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> * Davidlohr Bueso | 2015-06-16 12:50:26 [-0700]:
>
>>I prefer having two separate patches, thus keeping their own changelog
>>for the change justification.
>
> okay, here it is on top of #1.

A handful of boot test failures on ARM/OMAP were found by kernelci.org
in next-20150619[1] and were bisected down to this patch, which hit
next-20150619 in the form of commit 881bd58d6e9e (futex: Lower the
lock contention on the HB lock during wake up).  I confirmed that
reverting that patch on top of next-20150619 gets things booting again
for the affected platforms.

I haven't debugged this any further, but full boot logs are available
for the boot failures[2][3] and the linux-omap list and maintainer are
Cc'd here to help investigate further if needed.

Kevin

[1] http://kernelci.org/boot/all/job/next/kernel/next-20150619/
[2] http://storage.kernelci.org/next/next-20150619/arm-multi_v7_defconfig/lab-khilman/boot-omap5-uevm.html
[3] http://storage.kernelci.org/next/next-20150619/arm-omap2plus_defconfig/lab-tbaker/boot-omap3-beagle-xm.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-06-19 17:51         ` [PATCH v2] futex: lower " Kevin Hilman
@ 2015-06-19 18:54           ` Thomas Gleixner
  2015-06-19 19:32             ` Kevin Hilman
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2015-06-19 18:54 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: Sebastian Andrzej Siewior, Davidlohr Bueso, Peter Zijlstra,
	Ingo Molnar, Steven Rostedt, Mike Galbraith, Paul E. McKenney,
	lkml, Tyler Baker, Olof Johansson, Tony Lindgren, linux-omap,
	Santosh Shilimkar, Felipe Balbi, Nishanth Menon

On Fri, 19 Jun 2015, Kevin Hilman wrote:
> On Wed, Jun 17, 2015 at 1:33 AM, Sebastian Andrzej Siewior
> A handful of boot test failures on ARM/OMAP were found by kernelci.org
> in next-20150619[1] and were bisected down to this patch, which hit
> next-20150619 in the form of commit 881bd58d6e9e (futex: Lower the
> lock contention on the HB lock during wake up).  I confirmed that
> reverting that patch on top of next-20150619 gets things booting again
> for the affected platforms.
> 
> I haven't debugged this any further, but full boot logs are available
> for the boot failures[2][3] and the linux-omap list and maintainer are
> Cc'd here to help investigate further if needed.

Found it. Dunno, how I missed that one. Fix below.

Thanks,

	tglx
---

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 10dbeb6fe96f..5674b073473c 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1365,9 +1365,14 @@ rt_mutex_fastunlock(struct rt_mutex *lock,
 	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
 		rt_mutex_deadlock_account_unlock(current);
 
-	} else if (slowfn(lock, &wake_q)) {
+	} else {
+		bool deboost = slowfn(lock, &wake_q);
+
+		wake_up_q(&wake_q);
+
 		/* Undo pi boosting if necessary: */
-		rt_mutex_adjust_prio(current);
+		if (deboost)
+			rt_mutex_adjust_prio(current);
 	}
 }
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-06-19 18:54           ` Thomas Gleixner
@ 2015-06-19 19:32             ` Kevin Hilman
  0 siblings, 0 replies; 48+ messages in thread
From: Kevin Hilman @ 2015-06-19 19:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sebastian Andrzej Siewior, Davidlohr Bueso, Peter Zijlstra,
	Ingo Molnar, Steven Rostedt, Mike Galbraith, Paul E. McKenney,
	lkml, Tyler Baker, Olof Johansson, Tony Lindgren, linux-omap,
	Santosh Shilimkar, Felipe Balbi, Nishanth Menon

Thomas Gleixner <tglx@linutronix.de> writes:

> On Fri, 19 Jun 2015, Kevin Hilman wrote:
>> On Wed, Jun 17, 2015 at 1:33 AM, Sebastian Andrzej Siewior
>> A handful of boot test failures on ARM/OMAP were found by kernelci.org
>> in next-20150619[1] and were bisected down to this patch, which hit
>> next-20150619 in the form of commit 881bd58d6e9e (futex: Lower the
>> lock contention on the HB lock during wake up).  I confirmed that
>> reverting that patch on top of next-20150619 gets things booting again
>> for the affected platforms.
>> 
>> I haven't debugged this any further, but full boot logs are available
>> for the boot failures[2][3] and the linux-omap list and maintainer are
>> Cc'd here to help investigate further if needed.
>
> Found it. Dunno, how I missed that one. Fix below.
>

Yup, that fix on top of next-20150619 gets the two OMAP platforms
booting again.

Tested-by: Kevin Hilman <khilman@linaro.org>

Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [tip:sched/locking] futex: Lower the lock contention on the HB lock during wake up
  2015-06-17  8:33       ` [PATCH v2] " Sebastian Andrzej Siewior
                           ` (2 preceding siblings ...)
  2015-06-19 17:51         ` [PATCH v2] futex: lower " Kevin Hilman
@ 2015-06-19 19:33         ` tip-bot for Sebastian Andrzej Siewior
  3 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Sebastian Andrzej Siewior @ 2015-06-19 19:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: umgwanakikbuti, tglx, linux-kernel, paulmck, bigeasy, rostedt,
	peterz, mingo, hpa, dave

Commit-ID:  802ab58da74bb49ab348d2872190ef26ddc1a3e0
Gitweb:     http://git.kernel.org/tip/802ab58da74bb49ab348d2872190ef26ddc1a3e0
Author:     Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Wed, 17 Jun 2015 10:33:50 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 19 Jun 2015 21:26:38 +0200

futex: Lower the lock contention on the HB lock during wake up

wake_futex_pi() wakes the task before releasing the hash bucket lock
(HB). The first thing the woken up task usually does is to acquire the
lock which requires the HB lock. On SMP Systems this leads to blocking
on the HB lock which is released by the owner shortly after.
This patch rearranges the unlock path by first releasing the HB lock and
then waking up the task.

[ tglx: Fixed up the rtmutex unlock path ]

Originally-from: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Link: http://lkml.kernel.org/r/20150617083350.GA2433@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/futex.c                  | 32 ++++++++++++++++++++---
 kernel/locking/rtmutex.c        | 56 ++++++++++++++++++++++++++++++-----------
 kernel/locking/rtmutex_common.h |  3 +++
 3 files changed, 73 insertions(+), 18 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f9984c3..a0cf6fa 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1117,11 +1117,14 @@ static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q)
 	q->lock_ptr = NULL;
 }
 
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this,
+			 struct futex_hash_bucket *hb)
 {
 	struct task_struct *new_owner;
 	struct futex_pi_state *pi_state = this->pi_state;
 	u32 uninitialized_var(curval), newval;
+	WAKE_Q(wake_q);
+	bool deboost;
 	int ret = 0;
 
 	if (!pi_state)
@@ -1173,7 +1176,19 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
 	raw_spin_unlock_irq(&new_owner->pi_lock);
 
 	raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
-	rt_mutex_unlock(&pi_state->pi_mutex);
+
+	deboost = rt_mutex_futex_unlock(&pi_state->pi_mutex, &wake_q);
+
+	/*
+	 * First unlock HB so the waiter does not spin on it once he got woken
+	 * up. Second wake up the waiter before the priority is adjusted. If we
+	 * deboost first (and lose our higher priority), then the task might get
+	 * scheduled away before the wake up can take place.
+	 */
+	spin_unlock(&hb->lock);
+	wake_up_q(&wake_q);
+	if (deboost)
+		rt_mutex_adjust_prio(current);
 
 	return 0;
 }
@@ -2413,13 +2428,23 @@ retry:
 	 */
 	match = futex_top_waiter(hb, &key);
 	if (match) {
-		ret = wake_futex_pi(uaddr, uval, match);
+		ret = wake_futex_pi(uaddr, uval, match, hb);
+		/*
+		 * In case of success wake_futex_pi dropped the hash
+		 * bucket lock.
+		 */
+		if (!ret)
+			goto out_putkey;
 		/*
 		 * The atomic access to the futex value generated a
 		 * pagefault, so retry the user-access and the wakeup:
 		 */
 		if (ret == -EFAULT)
 			goto pi_faulted;
+		/*
+		 * wake_futex_pi has detected invalid state. Tell user
+		 * space.
+		 */
 		goto out_unlock;
 	}
 
@@ -2440,6 +2465,7 @@ retry:
 
 out_unlock:
 	spin_unlock(&hb->lock);
+out_putkey:
 	put_futex_key(&key);
 	return ret;
 
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 44ee8f8..0add724 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -300,7 +300,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
  * of task. We do not use the spin_xx_mutex() variants here as we are
  * outside of the debug path.)
  */
-static void rt_mutex_adjust_prio(struct task_struct *task)
+void rt_mutex_adjust_prio(struct task_struct *task)
 {
 	unsigned long flags;
 
@@ -1247,13 +1247,12 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 }
 
 /*
- * Slow path to release a rt-mutex:
+ * Slow path to release a rt-mutex.
+ * Return whether the current task needs to undo a potential priority boosting.
  */
-static void __sched
-rt_mutex_slowunlock(struct rt_mutex *lock)
+static bool __sched rt_mutex_slowunlock(struct rt_mutex *lock,
+					struct wake_q_head *wake_q)
 {
-	WAKE_Q(wake_q);
-
 	raw_spin_lock(&lock->wait_lock);
 
 	debug_rt_mutex_unlock(lock);
@@ -1294,7 +1293,7 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	while (!rt_mutex_has_waiters(lock)) {
 		/* Drops lock->wait_lock ! */
 		if (unlock_rt_mutex_safe(lock) == true)
-			return;
+			return false;
 		/* Relock the rtmutex and try again */
 		raw_spin_lock(&lock->wait_lock);
 	}
@@ -1305,13 +1304,12 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 	 *
 	 * Queue the next waiter for wakeup once we release the wait_lock.
 	 */
-	mark_wakeup_next_waiter(&wake_q, lock);
+	mark_wakeup_next_waiter(wake_q, lock);
 
 	raw_spin_unlock(&lock->wait_lock);
-	wake_up_q(&wake_q);
 
-	/* Undo pi boosting if necessary: */
-	rt_mutex_adjust_prio(current);
+	/* check PI boosting */
+	return true;
 }
 
 /*
@@ -1362,12 +1360,23 @@ rt_mutex_fasttrylock(struct rt_mutex *lock,
 
 static inline void
 rt_mutex_fastunlock(struct rt_mutex *lock,
-		    void (*slowfn)(struct rt_mutex *lock))
+		    bool (*slowfn)(struct rt_mutex *lock,
+				   struct wake_q_head *wqh))
 {
-	if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+	WAKE_Q(wake_q);
+
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
 		rt_mutex_deadlock_account_unlock(current);
-	else
-		slowfn(lock);
+
+	} else {
+		bool deboost = slowfn(lock, &wake_q);
+
+		wake_up_q(&wake_q);
+
+		/* Undo pi boosting if necessary: */
+		if (deboost)
+			rt_mutex_adjust_prio(current);
+	}
 }
 
 /**
@@ -1462,6 +1471,23 @@ void __sched rt_mutex_unlock(struct rt_mutex *lock)
 EXPORT_SYMBOL_GPL(rt_mutex_unlock);
 
 /**
+ * rt_mutex_futex_unlock - Futex variant of rt_mutex_unlock
+ * @lock: the rt_mutex to be unlocked
+ *
+ * Returns: true/false indicating whether priority adjustment is
+ * required or not.
+ */
+bool __sched rt_mutex_futex_unlock(struct rt_mutex *lock,
+				   struct wake_q_head *wqh)
+{
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL))) {
+		rt_mutex_deadlock_account_unlock(current);
+		return false;
+	}
+	return rt_mutex_slowunlock(lock, wqh);
+}
+
+/**
  * rt_mutex_destroy - mark a mutex unusable
  * @lock: the mutex to be destroyed
  *
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 8552125..7844f8f 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -131,6 +131,9 @@ extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 				      struct hrtimer_sleeper *to,
 				      struct rt_mutex_waiter *waiter);
 extern int rt_mutex_timed_futex_lock(struct rt_mutex *l, struct hrtimer_sleeper *to);
+extern bool rt_mutex_futex_unlock(struct rt_mutex *lock,
+				  struct wake_q_head *wqh);
+extern void rt_mutex_adjust_prio(struct task_struct *task);
 
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 # include "rtmutex-debug.h"
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [tip:sched/locking] locking/rtmutex: Update stale plist comments
  2015-05-19 17:24 ` [PATCH 3/4] locking/rtmutex: Update stale plist comments Davidlohr Bueso
  2015-06-05 12:39   ` Thomas Gleixner
  2015-06-18 20:57   ` [tip:sched/core] " tip-bot for Davidlohr Bueso
@ 2015-06-19 19:33   ` tip-bot for Davidlohr Bueso
  2 siblings, 0 replies; 48+ messages in thread
From: tip-bot for Davidlohr Bueso @ 2015-06-19 19:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: dbueso, bigeasy, mingo, dave, umgwanakikbuti, tglx, paulmck,
	rostedt, linux-kernel, hpa, peterz

Commit-ID:  9f40a51a35a0e1445cc4873251c3df2631eda294
Gitweb:     http://git.kernel.org/tip/9f40a51a35a0e1445cc4873251c3df2631eda294
Author:     Davidlohr Bueso <dave@stgolabs.net>
AuthorDate: Tue, 19 May 2015 10:24:57 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 19 Jun 2015 21:27:21 +0200

locking/rtmutex: Update stale plist comments

... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
no longer use plists for queuing any waiters. Update stale comments.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1432056298-18738-4-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/locking/rtmutex.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 0add724..86d4853 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -624,7 +624,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 */
 	prerequeue_top_waiter = rt_mutex_top_waiter(lock);
 
-	/* [7] Requeue the waiter in the lock waiter list. */
+	/* [7] Requeue the waiter in the lock waiter tree. */
 	rt_mutex_dequeue(lock, waiter);
 	waiter->prio = task->prio;
 	rt_mutex_enqueue(lock, waiter);
@@ -662,7 +662,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 		/*
 		 * The waiter became the new top (highest priority)
 		 * waiter on the lock. Replace the previous top waiter
-		 * in the owner tasks pi waiters list with this waiter
+		 * in the owner tasks pi waiters tree with this waiter
 		 * and adjust the priority of the owner.
 		 */
 		rt_mutex_dequeue_pi(task, prerequeue_top_waiter);
@@ -673,7 +673,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 		/*
 		 * The waiter was the top waiter on the lock, but is
 		 * no longer the top prority waiter. Replace waiter in
-		 * the owner tasks pi waiters list with the new top
+		 * the owner tasks pi waiters tree with the new top
 		 * (highest priority) waiter and adjust the priority
 		 * of the owner.
 		 * The new top waiter is stored in @waiter so that
@@ -747,7 +747,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
  *
  * @lock:   The lock to be acquired.
  * @task:   The task which wants to acquire the lock
- * @waiter: The waiter that is queued to the lock's wait list if the
+ * @waiter: The waiter that is queued to the lock's wait tree if the
  *	    callsite called task_blocked_on_lock(), otherwise NULL
  */
 static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
@@ -782,7 +782,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 
 	/*
 	 * If @waiter != NULL, @task has already enqueued the waiter
-	 * into @lock waiter list. If @waiter == NULL then this is a
+	 * into @lock waiter tree. If @waiter == NULL then this is a
 	 * trylock attempt.
 	 */
 	if (waiter) {
@@ -795,7 +795,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 
 		/*
 		 * We can acquire the lock. Remove the waiter from the
-		 * lock waiters list.
+		 * lock waiters tree.
 		 */
 		rt_mutex_dequeue(lock, waiter);
 
@@ -827,7 +827,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 			 * No waiters. Take the lock without the
 			 * pi_lock dance.@task->pi_blocked_on is NULL
 			 * and we have no waiters to enqueue in @task
-			 * pi waiters list.
+			 * pi waiters tree.
 			 */
 			goto takeit;
 		}
@@ -844,7 +844,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 	/*
 	 * Finish the lock acquisition. @task is the new owner. If
 	 * other waiters exist we have to insert the highest priority
-	 * waiter into @task->pi_waiters list.
+	 * waiter into @task->pi_waiters tree.
 	 */
 	if (rt_mutex_has_waiters(lock))
 		rt_mutex_enqueue_pi(task, rt_mutex_top_waiter(lock));
@@ -955,7 +955,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 }
 
 /*
- * Remove the top waiter from the current tasks pi waiter list and
+ * Remove the top waiter from the current tasks pi waiter tree and
  * queue it up.
  *
  * Called with lock->wait_lock held.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-06-17 14:28           ` Sebastian Andrzej Siewior
  2015-06-17 14:31             ` Mike Galbraith
@ 2015-06-21  4:35             ` Mike Galbraith
  1 sibling, 0 replies; 48+ messages in thread
From: Mike Galbraith @ 2015-06-21  4:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Davidlohr Bueso, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Paul E. McKenney, linux-kernel

On Wed, 2015-06-17 at 16:28 +0200, Sebastian Andrzej Siewior wrote:
> On 06/17/2015 04:17 PM, Mike Galbraith wrote:
> > On Wed, 2015-06-17 at 10:33 +0200, Sebastian Andrzej Siewior wrote:
> >> wake_futex_pi() wakes the task before releasing the hash bucket lock
> >> (HB). The first thing the woken up task usually does is to acquire the
> >> lock which requires the HB lock. On SMP Systems this leads to blocking
> >> on the HB lock which is released by the owner shortly after.
> >> This patch rearranges the unlock path by first releasing the HB lock and
> >> then waking up the task.
> >>
> >> [bigeasy: redo ontop of lockless wake-queues]
> >> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> >> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > 
> > 4.1-rc8-rt4 contains this via 4.0-rt4, and seems fine on my 64 core
> > DL980.  I ran a few iterations of futextests and stockfish, then mixed
> > two loops of futextest at different rt prios, with stockfish also rt,
> > and ltplight as tossed in as... crack filler.  Box is still doing that,
> > is way too busy, but not griping about it.  
> 
> There are two patches mostly doing the same thing. The patch posted
> here is a redo ontop of "lockless wake-queues". It does hb-unlock,
> wakeup, de-boost. The patch merged into -RT is the original approach
> not using "lockless wake-queues" and performing wakeup, hb-unlock,
> de-boost.
> 
> I plan to get into -RT the final solution once it hits upstream.

I plugged patch1 and tip version into rt and beat it, seems solid.

Converting the rest of rtmutex.c to use wake queues with ->save_state to
select wake function went less well.  Kernel does a good impersonation
of a working kernel until I beat it up, then it loses wakeups.  Hohum,
so much for yet another early morning tinker session.

	-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-09-16 23:57               ` Zhu Jefferry
@ 2015-09-17  7:08                 ` Thomas Gleixner
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Gleixner @ 2015-09-17  7:08 UTC (permalink / raw)
  To: Zhu Jefferry; +Cc: linux-kernel, bigeasy

On Wed, 16 Sep 2015, Zhu Jefferry wrote:
> Besides the application did not check the return value, the mutex_unlock in
> Libc did not check the return value from kernel neither. 

That's even worse.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-09-16 13:39             ` Thomas Gleixner
@ 2015-09-16 23:57               ` Zhu Jefferry
  2015-09-17  7:08                 ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Zhu Jefferry @ 2015-09-16 23:57 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, bigeasy

> On Wed, 16 Sep 2015, Zhu Jefferry wrote:
> > > > The primary debugging shows the content of __lock is wrong in first.
> > > > After a call of Mutex_unlock, the value of __lock should not be
> > > > this thread self. But we observed The value of __lock is still
> > > > self after unlock. So, other threads will be stuck,
> > >
> > > How did you observe that?
> >
> > Add one assert in mutex_unlock, after it finish the __lock modify
> > either in User space or kernel space, before return.
> 
> And that assert tells you that the kernel screwed up the futex value?
> No, it does not. It merily tells you that the value is not what you
> expect, but it does not tell you what caused that.
> 
> Hint: There are proper instrumentation tools, e.g. tracing, which tell
> you the exact flow of events and not just the observation after the fact.

I'm trying to get more details about the failure flow. But I'm told a little
Bit timing changing in the code might impact the failure appear in a longer time,
or even disappear.

> 
> > > > This thread could lock due to recursive type and __counter keep
> > > > increasing, although mutex_unlock return fails, due to the wrong
> > > > value of __owner, but the application did not check the return
> > > > value. So the thread 0 looks like fine. But thread 1 will be stuck
> forever.
> > >
> > > Oh well. So thread 0 looks all fine, despite not checking return
> values.
> > >
> >
> > Correct.
> 
> No. That's absolutely NOT correct. Not checking return values can cause
> all kind of corruptions. Return values are there for a reason.
> 

Besides the application did not check the return value, the mutex_unlock in
Libc did not check the return value from kernel neither. 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-09-16 11:13           ` Zhu Jefferry
@ 2015-09-16 13:39             ` Thomas Gleixner
  2015-09-16 23:57               ` Zhu Jefferry
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2015-09-16 13:39 UTC (permalink / raw)
  To: Zhu Jefferry; +Cc: linux-kernel, bigeasy

On Wed, 16 Sep 2015, Zhu Jefferry wrote:
> > > The primary debugging shows the content of __lock is wrong in first.
> > > After a call of Mutex_unlock, the value of __lock should not be this
> > > thread self. But we observed The value of __lock is still self after
> > > unlock. So, other threads will be stuck,
> > 
> > How did you observe that?
> 
> Add one assert in mutex_unlock, after it finish the __lock modify either in
> User space or kernel space, before return.

And that assert tells you that the kernel screwed up the futex value?
No, it does not. It merily tells you that the value is not what you
expect, but it does not tell you what caused that.

Hint: There are proper instrumentation tools, e.g. tracing, which tell
you the exact flow of events and not just the observation after the
fact.

> > > This thread could lock due to recursive type and __counter keep
> > > increasing, although mutex_unlock return fails, due to the wrong value
> > > of __owner, but the application did not check the return value. So the
> > > thread 0 looks like fine. But thread 1 will be stuck forever.
> > 
> > Oh well. So thread 0 looks all fine, despite not checking return values.
> > 
> 
> Correct.

No. That's absolutely NOT correct. Not checking return values can
cause all kind of corruptions. Return values are there for a reason.
 
> Actually, I'm not clear how about the state changing of futex in kernel.
> I search the Internet, see a similar failure from other users. He is using 
> Kernel 2.6.38. Our customer is using kernel 2.6.34 (WindRiver Linux 4.1)

So your customer should talk to WindRiver about this. I have no idea
what kind of patches WindRiver has in their kernel and I really don't
want to know it.

If you can reproduce that issue against a recent mainline kernel, then
I'm happy to analyze that.

>     ====
>     http://www.programdoc.com/1272_157986_1.htm

Your supply of weird web pages seems to be infinite.
 
> But I can not understand the sample failure case which he mentioned. But I think
> It might be helpful for you to analyze the corner case.

No, it's absolutely NOT helpful because it's just random guesswork as
the flow he is describing is just not possible. That guy never showed
his test case, so I have no idea how he can 'proof' his theory.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-09-16 10:22         ` Thomas Gleixner
@ 2015-09-16 11:13           ` Zhu Jefferry
  2015-09-16 13:39             ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Zhu Jefferry @ 2015-09-16 11:13 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, bigeasy

> On Wed, 16 Sep 2015, Zhu Jefferry wrote:
> > The application is a multi-thread program, to use the pairs of
> > mutex_lock and mutex_unlock to protect the shared data structure. The
> > type of this mutex is PTHREAD_MUTEX_PI_RECURSIVE_NP. After running
> > long time, to say several days, the mutex_lock data structure in user
> space looks like corrupt.
> >
> >    thread 0 can do mutex_lock/unlock
> >    __lock = this thread | FUTEX_WAITERS
> >    __owner = 0, should be this thread
> 
> The kernel does not know about __owner.

Correct, it shows the last failure is in mutex_unlock, 
which clear the __owner in user space.

> 
> >    __counter keep increasing, although there is no recursive mutex_lock
> call.
> >
> >    thread 1 will be stuck
> > 
> > The primary debugging shows the content of __lock is wrong in first.
> > After a call of Mutex_unlock, the value of __lock should not be this
> > thread self. But we observed The value of __lock is still self after
> > unlock. So, other threads will be stuck,
> 
> How did you observe that?

Add one assert in mutex_unlock, after it finish the __lock modify either in
User space or kernel space, before return.

> 
> > This thread could lock due to recursive type and __counter keep
> > increasing, although mutex_unlock return fails, due to the wrong value
> > of __owner, but the application did not check the return value. So the
> > thread 0 looks like fine. But thread 1 will be stuck forever.
> 
> Oh well. So thread 0 looks all fine, despite not checking return values.
> 

Correct.

Actually, I'm not clear how about the state changing of futex in kernel.
I search the Internet, see a similar failure from other users. He is using 
Kernel 2.6.38. Our customer is using kernel 2.6.34 (WindRiver Linux 4.1)

    ====
    http://www.programdoc.com/1272_157986_1.htm

    Maybe, there is a bug about pi-futex, it would let the program in 
    user-space going to hang.
    We have a board: CPU is powerpc 8572, two core. after ran one month, 
    the state of pi-futex in user-space got bad: 
    mutex->__data.__lock is 0x8000023e, 
    mutex->__data.__count is 0, 
    mutex->__data.__owner is 0.

But I can not understand the sample failure case which he mentioned. But I think
It might be helpful for you to analyze the corner case.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-09-16  9:52       ` Zhu Jefferry
@ 2015-09-16 10:22         ` Thomas Gleixner
  2015-09-16 11:13           ` Zhu Jefferry
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2015-09-16 10:22 UTC (permalink / raw)
  To: Zhu Jefferry; +Cc: linux-kernel, bigeasy

On Wed, 16 Sep 2015, Zhu Jefferry wrote:
> The application is a multi-thread program, to use the pairs of mutex_lock and 
> mutex_unlock to protect the shared data structure. The type of this mutex
> is PTHREAD_MUTEX_PI_RECURSIVE_NP. After running long time, to say several days,
> the mutex_lock data structure in user space looks like corrupt.
> 
>    thread 0 can do mutex_lock/unlock     
>    __lock = this thread | FUTEX_WAITERS
>    __owner = 0, should be this thread

The kernel does not know about __owner.

>    __counter keep increasing, although there is no recursive mutex_lock call.
>
>    thread 1 will be stuck 
> 
> The primary debugging shows the content of __lock is wrong in first. After a call of
> Mutex_unlock, the value of __lock should not be this thread self. But we observed
> The value of __lock is still self after unlock. So, other threads will be stuck,

How did you observe that?

> This thread could lock due to recursive type and __counter keep increasing, 
> although mutex_unlock return fails, due to the wrong value of __owner, 
> but the application did not check the return value. So the thread 0 looks
> like fine. But thread 1 will be stuck forever.

Oh well. So thread 0 looks all fine, despite not checking return
values.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-09-16  8:06     ` Thomas Gleixner
@ 2015-09-16  9:52       ` Zhu Jefferry
  2015-09-16 10:22         ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Zhu Jefferry @ 2015-09-16  9:52 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, bigeasy

> > I assume your pseudo code set_waiter_bit is mapped to the real code
> > "futex_lock_pi_atomic", It's possible for futex_lock_pi_atomic to
> > successfully set FUTEX_WAITERS bit, but return with Page fault, for
> > example, like fail in lookup_pi_state().
> 
> No. It's not. lookup_pi_state() cannot return EFAULT. The only function
> which can fault inside of lock_pi_update_atomic() is the actual cmpxchg.
> Though lock_pi_update_atomic() can successfully set the waiter bit and
> then return with some other failure code (ESRCH, EAGAIN, ...). But that
> does not matter at all.
> 
> Any failure return will end up in a retry. And if the waker managed to
> release the futex before the retry takes place then the waiter will see
> that and take the futex.
> 
Let me try to descript the application failure here.

The application is a multi-thread program, to use the pairs of mutex_lock and 
mutex_unlock to protect the shared data structure. The type of this mutex
is PTHREAD_MUTEX_PI_RECURSIVE_NP. After running long time, to say several days,
the mutex_lock data structure in user space looks like corrupt.

   thread 0 can do mutex_lock/unlock     
   __lock = this thread | FUTEX_WAITERS
   __owner = 0, should be this thread
   __counter keep increasing, although there is no recursive mutex_lock call.

   thread 1 will be stuck 

The primary debugging shows the content of __lock is wrong in first. After a call of
Mutex_unlock, the value of __lock should not be this thread self. But we observed
The value of __lock is still self after unlock. So, other threads will be stuck,
This thread could lock due to recursive type and __counter keep increasing, 
although mutex_unlock return fails, due to the wrong value of __owner, 
but the application did not check the return value. So the thread 0 looks
like fine. But thread 1 will be stuck forever.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-09-16  0:17   ` Zhu Jefferry
@ 2015-09-16  8:06     ` Thomas Gleixner
  2015-09-16  9:52       ` Zhu Jefferry
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2015-09-16  8:06 UTC (permalink / raw)
  To: Zhu Jefferry; +Cc: linux-kernel, bigeasy

On Wed, 16 Sep 2015, Zhu Jefferry wrote:

> Thanks for your detail guideline and explanations. Please see my questions in-line.

Please trim the reply to the relevant sections. It's annoying if I
have to search your replies inside of useless quoted text.
 
> > -----Original Message-----
> > From: Thomas Gleixner [mailto:tglx@linutronix.de]
> > The flow is:
> > 
> > sys_futex(LOCK_PI, futex, ...)
> > 
> >  retry:
> >   lock(hb(futex));
> >   ret = set_waiter_bit(futex);
> >   if (ret == -EFAULT) {
> >     unlock(hb(futex));
> >     handle_fault();
> >     goto retry;
> >   }
> > 
> >   list_add();
> >   unlock(hb(futex));
> >   schedule();
> > 
> > So when set_waiter_bit() succeeds, then the hash bucket lock is held and
> > blocks the waker. So it's guaranteed that the waker will see the waiter
> > on the list.
> > 
> > If set_waiter_bit() faults, then the waiter bit is not set and therefor
> > there is nothing to wake. So the waker will not enter the kernel because
> > the futex is uncontended.
> > 

> I assume your pseudo code set_waiter_bit is mapped to the real code
> "futex_lock_pi_atomic", It's possible for futex_lock_pi_atomic to
> successfully set FUTEX_WAITERS bit, but return with Page fault, for
> example, like fail in lookup_pi_state().

No. It's not. lookup_pi_state() cannot return EFAULT. The only
function which can fault inside of lock_pi_update_atomic() is the
actual cmpxchg. Though lock_pi_update_atomic() can successfully set
the waiter bit and then return with some other failure code (ESRCH,
EAGAIN, ...). But that does not matter at all.

Any failure return will end up in a retry. And if the waker managed to
release the futex before the retry takes place then the waiter will
see that and take the futex.

As I said before:

> > Random speculation is not helping here.

You still fail to provide the relevant information I asked for. If you
cannot provide that information, we can't help.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-09-16  0:01 ` Thomas Gleixner
@ 2015-09-16  0:17   ` Zhu Jefferry
  2015-09-16  8:06     ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Zhu Jefferry @ 2015-09-16  0:17 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, bigeasy

Thanks for your detail guideline and explanations. Please see my questions in-line.

> -----Original Message-----
> From: Thomas Gleixner [mailto:tglx@linutronix.de]
> Sent: Wednesday, September 16, 2015 8:01 AM
> To: Zhu Shuangjun-R65879
> Cc: linux-kernel@vger.kernel.org; bigeasy@linutronix.de
> Subject: RE: [PATCH v2] futex: lower the lock contention on the HB lock
> during wake up
> 
> On Tue, 15 Sep 2015, Zhu Jefferry wrote:
> 
> Please configure your e-mail client proper and follow the basic rules:
> 
> - Choose a meaningful subject for your questions
> 
>   You just copied a random subject line from some other mail thread,
>   which makes your mail look like a patch. But it's not a patch. You
>   have a question about futexes and patches related to them.
> 
> - Make sure your lines are no longer than 72 to 76 characters in length.
> 
>   You can't assume that all e-mail and news clients behave like yours,
> and while yours might wrap lines automatically when the text reaches the
> right of the window containing it, not all do.
> 
>   For the sentence above I need a 190 character wide display ....
> 
> - Do not use colors or other gimmicks. They just make the mail
>   unreadable in simple text based readers.
> 
> > Just in the list, I see the patch "[PATCH v2] futex: lower the lock
> > contention on the HB lock during wake up" at
> > http://www.gossamer-
> threads.com/lists/linux/kernel/2199938?search_string=futex;#2199938.
> 
> > But I see another patch with same name, different content here,
> >
> > 23b7776290b10297fe2cae0fb5f166a4f2c68121(http://code.metager.de/source
> > /xref/linux/stable/kernel/futex.c?r=23b7776290b10297fe2cae0fb5f166a4f2
> > c68121)
> 
> I have no idea what that metager thing tells you and I really don't want
> to know. Plain git tells me:
> 
> # git show 23b7776290b10297fe2cae0fb5f166a4f2c68121
> Merge: 6bc4c3ad3619 6fab54101923
> Author: Linus Torvalds <torvalds@linux-foundation.org
> Date:   Mon Jun 22 15:52:04 2015 -0700
> 
>     Merge branch 'sched-core-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> 
> So that's a merge commit where Linus pulled a pile of changes into the
> mainline kernel. And that merge does not contain the patch above, but it
> contains a different change to the futex code.
> 
> > Could you please help to give a little bit more explanation on this,
> > why they have same name with different modify in the futex.c? I'm a
> > newbie in the community.
> 
> Use the proper tools and not some random web interface. The commit you
> are looking for is a completely different one.
> 
> # git log kernel/futex.c
> ....
> commit 802ab58da74bb49ab348d2872190ef26ddc1a3e0
> Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de
> Date:   Wed Jun 17 10:33:50 2015 +0200
> 
>     futex: Lower the lock contention on the HB lock during wake up ....
> 
> And that's the same as the one in the LKML thread plus a fixup.
> 
> > Actually, I encounter a customer issue which is related to the glibc
> > code "pthread_mutex_lock", which is using the futex service in kernel,
> > without the patches above.
> 
> The patches above are merily an optimization and completely unrelated to
> your problem.
> 
> You fail to provide the real interesting information here:
> 
>  - Which architecture/SoC
>  - Which kernel version and which extra patches
>  - Which glibc version and which extra patches
> 
> > After lots of customer discussing, ( I could not reproduce the failure
> > in my office), I seriously suspect there might be some particular
> > corner cases in the futex code.
> 
> The futex code is more or less a conglomorate of corner cases.
> 
> But again you fail to provide the real interesting information:
> 
>  - What is the actual failure ?
> 
> The information that you discussed that with your customer is completely
> irrelevant and your suspicion does not clarify the issue either.
> 
> > In the unlock flow, the user space code (pthread_mutex_unlock) will
> > check FUTEX_WAITERS flag first, then wakeup the waiters in the kernel
> > list. But in the lock flow, the kernel code (futex) will set
> > FUTEX_WAITERS in first too, then try to get the waiter from the list.
> > They are following same sequence, flag first, entry in list secondly.
> > But there might be some timing problem in SMP system, if the query
> > (unlock flow) is executing just before the list adding action (lock
> > flow).
> 
> There might be some timing problem, if the code would look like the
> scheme you painted below, but it does not.
> 
> > It might cause the mutex is never really released, and other threads
> > will infinite waiting. Could you please help to take a look at it?
> >
> > CPU 0 (trhead 0)                                CPU 1 (thread 1)
> >
> >  mutex_lock
> >  val = *futex;
> >  sys_futex(LOCK_PI, futex, val);
> >
> >  return to user space
> 
> If the futex is uncontended then you don't enter the kernel for acquiring
> the futex.
> 
> >  after acquire the lock                           mutex_lock
> >                                                   val = *futex;
> >                                                   sys_futex(LOCK_PI,
> > futex, val);
> 
> The futex FUTEX_LOCK_PI operation does not take the user space value.
> That's what FUTEX_WAIT does.
> 
> >
> lock(hash_bucket(futex));
> >                                                     set FUTEX_WAITERS
> flag
> >
> > unlock(hash_bucket(futex)) and retry due to page fault
> 
> So here you are completely off the track. If the 'set FUTEX_WAITERS bit'
> operation fails due to a page fault, then the FUTEX_WAITERS bit is not
> set. So it cannot be observed on the other core.
> 
> The flow is:
> 
> sys_futex(LOCK_PI, futex, ...)
> 
>  retry:
>   lock(hb(futex));
>   ret = set_waiter_bit(futex);
>   if (ret == -EFAULT) {
>     unlock(hb(futex));
>     handle_fault();
>     goto retry;
>   }
> 
>   list_add();
>   unlock(hb(futex));
>   schedule();
> 
> So when set_waiter_bit() succeeds, then the hash bucket lock is held and
> blocks the waker. So it's guaranteed that the waker will see the waiter
> on the list.
> 
> If set_waiter_bit() faults, then the waiter bit is not set and therefor
> there is nothing to wake. So the waker will not enter the kernel because
> the futex is uncontended.
> 

I assume your pseudo code set_waiter_bit is mapped to the real code "futex_lock_pi_atomic",
It's possible for futex_lock_pi_atomic to successfully set FUTEX_WAITERS bit, but return with
Page fault, for example, like fail in lookup_pi_state().


> So now, lets assume that the waiter failed to set the waiter bit and the
> waker unlocked the futex. When the waiter retries then it actually checks
> whether the futex still has an owner. So it observes the owner has been
> cleared, it acquires the futex and returns.
> 
> It's a bit more complex than that due to handling of the gazillion of
> corner cases, but that's the basic synchronization mechanism and there is
> no hidden timing issue on SMP.
> 
> Random speculation is not helping here.
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
  2015-09-15  1:26 [PATCH v2] futex: lower the lock contention on the HB lock during wake up Zhu Jefferry
@ 2015-09-16  0:01 ` Thomas Gleixner
  2015-09-16  0:17   ` Zhu Jefferry
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2015-09-16  0:01 UTC (permalink / raw)
  To: Zhu Jefferry; +Cc: linux-kernel, bigeasy

On Tue, 15 Sep 2015, Zhu Jefferry wrote:

Please configure your e-mail client proper and follow the basic rules:

- Choose a meaningful subject for your questions

  You just copied a random subject line from some other mail thread,
  which makes your mail look like a patch. But it's not a patch. You
  have a question about futexes and patches related to them.

- Make sure your lines are no longer than 72 to 76 characters in length.

  You can't assume that all e-mail and news clients behave like yours, and while yours might wrap lines automatically when the text reaches the right of the window containing it, not all do.

  For the sentence above I need a 190 character wide display ....

- Do not use colors or other gimmicks. They just make the mail
  unreadable in simple text based readers.

> Just in the list, I see the patch "[PATCH v2] futex: lower the lock
> contention on the HB lock during wake up" at
> http://www.gossamer-threads.com/lists/linux/kernel/2199938?search_string=futex;#2199938.
 
> But I see another patch with same name, different content here,
>     23b7776290b10297fe2cae0fb5f166a4f2c68121(http://code.metager.de/source/xref/linux/stable/kernel/futex.c?r=23b7776290b10297fe2cae0fb5f166a4f2c68121)

I have no idea what that metager thing tells you and I really don't
want to know. Plain git tells me:

# git show 23b7776290b10297fe2cae0fb5f166a4f2c68121
Merge: 6bc4c3ad3619 6fab54101923
Author: Linus Torvalds <torvalds@linux-foundation.org
Date:   Mon Jun 22 15:52:04 2015 -0700

    Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

So that's a merge commit where Linus pulled a pile of changes into the
mainline kernel. And that merge does not contain the patch above, but
it contains a different change to the futex code.

> Could you please help to give a little bit more explanation on this,
> why they have same name with different modify in the futex.c? I'm a
> newbie in the community.

Use the proper tools and not some random web interface. The commit you
are looking for is a completely different one.

# git log kernel/futex.c
....
commit 802ab58da74bb49ab348d2872190ef26ddc1a3e0
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de
Date:   Wed Jun 17 10:33:50 2015 +0200

    futex: Lower the lock contention on the HB lock during wake up
....

And that's the same as the one in the LKML thread plus a fixup.

> Actually, I encounter a customer issue which is related to the glibc
> code "pthread_mutex_lock", which is using the futex service in
> kernel, without the patches above.

The patches above are merily an optimization and completely unrelated
to your problem.

You fail to provide the real interesting information here:

 - Which architecture/SoC
 - Which kernel version and which extra patches
 - Which glibc version and which extra patches

> After lots of customer discussing, ( I could not reproduce the
> failure in my office), I seriously suspect there might be some
> particular corner cases in the futex code.

The futex code is more or less a conglomorate of corner cases.

But again you fail to provide the real interesting information:

 - What is the actual failure ?

The information that you discussed that with your customer is
completely irrelevant and your suspicion does not clarify the issue
either.

> In the unlock flow, the user space code (pthread_mutex_unlock) will
> check FUTEX_WAITERS flag first, then wakeup the waiters in the
> kernel list. But in the lock flow, the kernel code (futex) will set
> FUTEX_WAITERS in first too, then try to get the waiter from the
> list. They are following same sequence, flag first, entry in list
> secondly. But there might be some timing problem in SMP system, if
> the query (unlock flow) is executing just before the list adding
> action (lock flow).

There might be some timing problem, if the code would look like the
scheme you painted below, but it does not.

> It might cause the mutex is never really released, and other threads
> will infinite waiting. Could you please help to take a look at it?
>
> CPU 0 (trhead 0)                                CPU 1 (thread 1)
> 
>  mutex_lock                                               
>  val = *futex;                                  
>  sys_futex(LOCK_PI, futex, val);                
>
>  return to user space

If the futex is uncontended then you don't enter the kernel for
acquiring the futex.

>  after acquire the lock                           mutex_lock
>                                                   val = *futex;
>                                                   sys_futex(LOCK_PI, futex, val);

The futex FUTEX_LOCK_PI operation does not take the user space value. That's
what FUTEX_WAIT does.

>                                                     lock(hash_bucket(futex));
>                                                     set FUTEX_WAITERS flag
>                                                     unlock(hash_bucket(futex)) and retry due to page fault

So here you are completely off the track. If the 'set FUTEX_WAITERS
bit' operation fails due to a page fault, then the FUTEX_WAITERS bit
is not set. So it cannot be observed on the other core.

The flow is:

sys_futex(LOCK_PI, futex, ...)

 retry:
  lock(hb(futex));
  ret = set_waiter_bit(futex);
  if (ret == -EFAULT) {
    unlock(hb(futex));
    handle_fault();
    goto retry;
  }

  list_add();
  unlock(hb(futex));
  schedule();

So when set_waiter_bit() succeeds, then the hash bucket lock is held
and blocks the waker. So it's guaranteed that the waker will see the
waiter on the list.

If set_waiter_bit() faults, then the waiter bit is not set and
therefor there is nothing to wake. So the waker will not enter the
kernel because the futex is uncontended.

So now, lets assume that the waiter failed to set the waiter bit and
the waker unlocked the futex. When the waiter retries then it actually
checks whether the futex still has an owner. So it observes the owner
has been cleared, it acquires the futex and returns.

It's a bit more complex than that due to handling of the gazillion of
corner cases, but that's the basic synchronization mechanism and there
is no hidden timing issue on SMP.

Random speculation is not helping here.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
@ 2015-09-15  1:26 Zhu Jefferry
  2015-09-16  0:01 ` Thomas Gleixner
  0 siblings, 1 reply; 48+ messages in thread
From: Zhu Jefferry @ 2015-09-15  1:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: tglx, bigeasy

Hi 

Just in the list, I see the patch "[PATCH v2] futex: lower the lock contention on the HB lock during wake up" at http://www.gossamer-threads.com/lists/linux/kernel/2199938?search_string=futex;#2199938.

But I see another patch with same name, different content here,
    23b7776290b10297fe2cae0fb5f166a4f2c68121(http://code.metager.de/source/xref/linux/stable/kernel/futex.c?r=23b7776290b10297fe2cae0fb5f166a4f2c68121) 23-Jun-2015 Linus Torvalds 
    futex: Lower the lock contention on the HB lock during wake up wake_futex_pi() wakes the task before releasing the hash bucket lock (HB). 
    The first thing the woken up task usually does is to acquire the lock which requires the HB lock. On SMP Systems this leads to blocking 
     on the HB lock which is released by the owner shortly after. This patch rearranges the unlock path by first releasing the HB lock and
     then waking up the task.

Could you please help to give a little bit more explanation on this, why they have same name with different modify in the futex.c? I'm a newbie in the community.

Actually, I encounter a customer issue which is related to the glibc code "pthread_mutex_lock", which is using the futex service in kernel, without the patches above.

After lots of customer discussing, ( I could not reproduce the failure in my office), I seriously suspect there might be some particular corner cases in the futex code.

In the unlock flow, the user space code (pthread_mutex_unlock) will check FUTEX_WAITERS flag first, then wakeup the waiters in the kernel list. But in the lock flow, the kernel code (futex) will set FUTEX_WAITERS in first too, then try to get the waiter from the list. They are following same sequence, flag first, entry in list secondly. But there might be some timing problem in SMP system, if the query (unlock flow) is executing just before the list adding action (lock flow).

It might cause the mutex is never really released, and other threads will infinite waiting. Could you please help to take a look at it?

===========================================================================================================================
CPU 0 (trhead 0)                                CPU 1 (thread 1)

 mutex_lock                                               
 val = *futex;                                  
 sys_futex(LOCK_PI, futex, val);                
                                                
 return to user space                           
 after acquire the lock                           mutex_lock
                                                  val = *futex;
                                                  sys_futex(LOCK_PI, futex, val);
                                                    lock(hash_bucket(futex));
                                                    set FUTEX_WAITERS flag
                                                    unlock(hash_bucket(futex)) and retry due to page fault
                                                
 mutex_unlock in user space                     
 check FUTEX_WAITERS flag                                               
 sys_futex(UNLOCK_PI, futex, val);              
   lock(hash_bucket(futex));        <--.            
                                       .---------   waiting for the lock of (hash_bucket(futex)) to do list adding
                                                
   try to get the waiter in waitling <--.           
   list, but it's empty                 |       
                                        |       
   set new_owner to itself              |       
   instead of expecting waiter          |       
                                        |       
                                        |       
   unlock(hash_bucket(futex));          |       
                                        |           lock(hash_bucket(futex));
                                        .--------   add itself to the waiting list
                                                    unlock(hash_bucket(futex));
                                                    waiting forever since there is nobody will release the PI
   the futex is owned by itself                 
   forever in userspace. Because                
   the __owner in user space has                
   been cleared and mutex_unlock                
   will fail forever before it 
   call kernel.                           


Thanks,
Jeff


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2015-09-17  7:09 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-19 17:24 [PATCH -tip 0/4] rtmutex: Spin on owner Davidlohr Bueso
2015-05-19 17:24 ` [PATCH 1/4] locking/rtmutex: Implement lockless top-waiter wakeup Davidlohr Bueso
2015-06-05 12:35   ` Thomas Gleixner
2015-06-16 19:29   ` [PATCH] futex: lower the lock contention on the HB lock during wake up Sebastian Andrzej Siewior
2015-06-16 19:50     ` Davidlohr Bueso
2015-06-17  8:33       ` [PATCH v2] " Sebastian Andrzej Siewior
2015-06-17 14:17         ` Mike Galbraith
2015-06-17 14:28           ` Sebastian Andrzej Siewior
2015-06-17 14:31             ` Mike Galbraith
2015-06-21  4:35             ` Mike Galbraith
2015-06-18 20:30         ` [tip:sched/core] futex: Lower " tip-bot for Sebastian Andrzej Siewior
2015-06-19 17:51         ` [PATCH v2] futex: lower " Kevin Hilman
2015-06-19 18:54           ` Thomas Gleixner
2015-06-19 19:32             ` Kevin Hilman
2015-06-19 19:33         ` [tip:sched/locking] futex: Lower " tip-bot for Sebastian Andrzej Siewior
2015-06-18 20:30   ` [tip:sched/core] locking/rtmutex: Implement lockless top-waiter wakeup tip-bot for Davidlohr Bueso
2015-05-19 17:24 ` [PATCH 2/4] locking/rtmutex: Use cmp-cmpxchg Davidlohr Bueso
2015-06-05 12:38   ` Thomas Gleixner
2015-06-06 15:27     ` Davidlohr Bueso
2015-06-15 18:34       ` Jason Low
2015-06-15 19:37         ` Davidlohr Bueso
2015-06-16  1:00           ` Jason Low
2015-05-19 17:24 ` [PATCH 3/4] locking/rtmutex: Update stale plist comments Davidlohr Bueso
2015-06-05 12:39   ` Thomas Gleixner
2015-06-18 20:57   ` [tip:sched/core] " tip-bot for Davidlohr Bueso
2015-06-19 19:33   ` [tip:sched/locking] " tip-bot for Davidlohr Bueso
2015-05-19 17:24 ` [PATCH -rfc 4/4] locking/rtmutex: Support spin on owner (osq) Davidlohr Bueso
2015-05-20  7:11   ` Paul Bolle
2015-05-25 20:35     ` Davidlohr Bueso
2015-05-29 15:19   ` Davidlohr Bueso
2015-05-29 18:01     ` Davidlohr Bueso
2015-06-05 13:59   ` Thomas Gleixner
2015-06-09  4:41     ` Davidlohr Bueso
2015-06-09  9:29       ` Thomas Gleixner
2015-06-09 11:21         ` Peter Zijlstra
2015-06-09 12:53           ` Thomas Gleixner
2015-05-25 20:35 ` [PATCH -tip 0/4] rtmutex: Spin on owner Davidlohr Bueso
2015-05-26 19:05   ` Thomas Gleixner
2015-09-15  1:26 [PATCH v2] futex: lower the lock contention on the HB lock during wake up Zhu Jefferry
2015-09-16  0:01 ` Thomas Gleixner
2015-09-16  0:17   ` Zhu Jefferry
2015-09-16  8:06     ` Thomas Gleixner
2015-09-16  9:52       ` Zhu Jefferry
2015-09-16 10:22         ` Thomas Gleixner
2015-09-16 11:13           ` Zhu Jefferry
2015-09-16 13:39             ` Thomas Gleixner
2015-09-16 23:57               ` Zhu Jefferry
2015-09-17  7:08                 ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).