linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][RT][PATCH 0/4] rtmutex: Simplify PI code
@ 2010-12-23 22:47 Steven Rostedt
  2010-12-23 22:47 ` [RFC][RT][PATCH 1/4] rtmutex: Only save lock depth once in spin_slowlock Steven Rostedt
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Steven Rostedt @ 2010-12-23 22:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Lai Jiangshan

This is still a little buggy. I've been hitting a hang when running
dbench 100 in a loop. But I also hit that hang with vanilla -rt.

But I'm sure this still has some things that need to be straighten out.
Anyway, you can either apply the following patches or pull the code
from the listed git repo.

This still needs to be worked out, but since I'm taking off for the
rest of the year, I might as well show you where I ended up at.

The following patches are in:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-rt.git

    branch: rt/tip/rt/devel


Lai Jiangshan (1):
      rtmutex: Ensure only the top waiter or higher priority task can take the lock

Steven Rostedt (3):
      rtmutex: Only save lock depth once in spin_slowlock
      rtmutex: Try to take lock early in rt_spin_lock_slowlock()
      rtmutex: Revert Optimize rt lock wakeup

----
 kernel/futex.c          |   22 +--
 kernel/rt.c             |   10 +-
 kernel/rtmutex-debug.c  |    1 -
 kernel/rtmutex.c        |  458 ++++++++++++++++++-----------------------------
 kernel/rtmutex_common.h |   18 +--
 5 files changed, 188 insertions(+), 321 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][RT][PATCH 1/4] rtmutex: Only save lock depth once in spin_slowlock
  2010-12-23 22:47 [RFC][RT][PATCH 0/4] rtmutex: Simplify PI code Steven Rostedt
@ 2010-12-23 22:47 ` Steven Rostedt
  2010-12-23 22:47 ` [RFC][RT][PATCH 2/4] rtmutex: Try to take lock early in rt_spin_lock_slowlock() Steven Rostedt
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2010-12-23 22:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Lai Jiangshan

[-- Attachment #1: 0001-rtmutex-Only-save-lock-depth-once-in-spin_slowlock.patch --]
[-- Type: text/plain, Size: 2101 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

The task that enters the rt_spin_lock_slowlock() will not take or
release the kernel lock again until it leaves the fuction.
There's no reason that we need to save and restore the lock depth
for each iteration of the try_lock loop.

Just save and reset the lock depth before entering the loop
and restore it upon exit.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rtmutex.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 348b1e7..f0ce334 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -821,6 +821,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	struct rt_mutex_waiter waiter;
 	unsigned long saved_state, flags;
 	struct task_struct *orig_owner;
+	int saved_lock_depth;
 
 	debug_rt_mutex_init_waiter(&waiter);
 	waiter.task = NULL;
@@ -843,8 +844,14 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	 */
 	saved_state = rt_set_current_blocked_state(current->state);
 
+	/*
+	 * Prevent schedule() to drop BKL, while waiting for
+	 * the lock ! We restore lock_depth when we come back.
+	 */
+	saved_lock_depth = current->lock_depth;
+	current->lock_depth = -1;
+
 	for (;;) {
-		int saved_lock_depth = current->lock_depth;
 
 		/* Try to acquire the lock */
 		if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL))
@@ -863,11 +870,6 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 				continue;
 		}
 
-		/*
-		 * Prevent schedule() to drop BKL, while waiting for
-		 * the lock ! We restore lock_depth when we come back.
-		 */
-		current->lock_depth = -1;
 		orig_owner = rt_mutex_owner(lock);
 		get_task_struct(orig_owner);
 		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
@@ -883,9 +885,9 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 			put_task_struct(orig_owner);
 
 		raw_spin_lock_irqsave(&lock->wait_lock, flags);
-		current->lock_depth = saved_lock_depth;
 		saved_state = rt_set_current_blocked_state(saved_state);
 	}
+	current->lock_depth = saved_lock_depth;
 
 	rt_restore_current_state(saved_state);
 
-- 
1.7.2.3



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][RT][PATCH 2/4] rtmutex: Try to take lock early in rt_spin_lock_slowlock()
  2010-12-23 22:47 [RFC][RT][PATCH 0/4] rtmutex: Simplify PI code Steven Rostedt
  2010-12-23 22:47 ` [RFC][RT][PATCH 1/4] rtmutex: Only save lock depth once in spin_slowlock Steven Rostedt
@ 2010-12-23 22:47 ` Steven Rostedt
  2010-12-23 22:47 ` [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup Steven Rostedt
  2010-12-23 22:47 ` [RFC][RT][PATCH 4/4] rtmutex: Ensure only the top waiter or higher priority task can take the lock Steven Rostedt
  3 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2010-12-23 22:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Lai Jiangshan

[-- Attachment #1: 0002-rtmutex-Try-to-take-lock-early-in-rt_spin_lock_slowl.patch --]
[-- Type: text/plain, Size: 881 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Try to take the lock again as soon as we go into the
rt_spin_lock_slowlock() code before doing the setup and wait loop.
This makes the code closer to what the rt_mutex_slowlock() does,
which will help in simplifying this code in later commits.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rtmutex.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index f0ce334..318d7ed 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -829,6 +829,11 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	init_lists(lock);
 
+	if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL)) {
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+		return;
+	}
+
 	BUG_ON(rt_mutex_owner(lock) == current);
 
 	/*
-- 
1.7.2.3



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2010-12-23 22:47 [RFC][RT][PATCH 0/4] rtmutex: Simplify PI code Steven Rostedt
  2010-12-23 22:47 ` [RFC][RT][PATCH 1/4] rtmutex: Only save lock depth once in spin_slowlock Steven Rostedt
  2010-12-23 22:47 ` [RFC][RT][PATCH 2/4] rtmutex: Try to take lock early in rt_spin_lock_slowlock() Steven Rostedt
@ 2010-12-23 22:47 ` Steven Rostedt
  2010-12-24  4:45   ` Gregory Haskins
       [not found]   ` <4D13DF250200005A000793E1@novell.com>
  2010-12-23 22:47 ` [RFC][RT][PATCH 4/4] rtmutex: Ensure only the top waiter or higher priority task can take the lock Steven Rostedt
  3 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2010-12-23 22:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Lai Jiangshan,
	Gregory Haskins, Peter Morreale

[-- Attachment #1: 0003-rtmutex-Revert-Optimize-rt-lock-wakeup.patch --]
[-- Type: text/plain, Size: 2377 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

The commit: rtmutex: Optimize rt lock wakeup

Does not do what it was suppose to do.
This is because the adaptive waiter sets its state to TASK_(UN)INTERRUPTIBLE
before going into the loop. Thus, the test in wakeup_next_waiter()
will always fail on an adaptive waiter, as it only tests to see if
the pending waiter never has its state set ot TASK_RUNNING unless
something else had woke it up.

The smp_mb() added to make this test work is just as expensive as
just calling wakeup. And since we we fail to wake up anyway, we are
doing both a smp_mb() and wakeup as well.

I tested this with dbench and we run faster without this patch.
I also tried a variant that instead fixed the loop, to change the state
only if the spinner was to go to sleep, and that still did not show
any improvement.

Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Peter Morreale <pmorreale@novell.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rtmutex.c |   29 ++---------------------------
 1 files changed, 2 insertions(+), 27 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 318d7ed..e218873 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -554,33 +554,8 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 	 */
 	if (!savestate)
 		wake_up_process(pendowner);
-	else {
-		/*
-		 * We can skip the actual (expensive) wakeup if the
-		 * waiter is already running, but we have to be careful
-		 * of race conditions because they may be about to sleep.
-		 *
-		 * The waiter-side protocol has the following pattern:
-		 * 1: Set state != RUNNING
-		 * 2: Conditionally sleep if waiter->task != NULL;
-		 *
-		 * And the owner-side has the following:
-		 * A: Set waiter->task = NULL
-		 * B: Conditionally wake if the state != RUNNING
-		 *
-		 * As long as we ensure 1->2 order, and A->B order, we
-		 * will never miss a wakeup.
-		 *
-		 * Therefore, this barrier ensures that waiter->task = NULL
-		 * is visible before we test the pendowner->state.  The
-		 * corresponding barrier is in the sleep logic.
-		 */
-		smp_mb();
-
-		/* If !RUNNING && !RUNNING_MUTEX */
-		if (pendowner->state & ~TASK_RUNNING_MUTEX)
-			wake_up_process_mutex(pendowner);
-	}
+	else
+		wake_up_process_mutex(pendowner);
 
 	rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
 
-- 
1.7.2.3



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][RT][PATCH 4/4] rtmutex: Ensure only the top waiter or higher priority task can take the lock
  2010-12-23 22:47 [RFC][RT][PATCH 0/4] rtmutex: Simplify PI code Steven Rostedt
                   ` (2 preceding siblings ...)
  2010-12-23 22:47 ` [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup Steven Rostedt
@ 2010-12-23 22:47 ` Steven Rostedt
  2011-01-04  4:02   ` Steven Rostedt
  3 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2010-12-23 22:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Lai Jiangshan

[-- Attachment #1: 0004-rtmutex-Ensure-only-the-top-waiter-or-higher-priorit.patch --]
[-- Type: text/plain, Size: 30982 bytes --]

From: Lai Jiangshan <laijs@cn.fujitsu.com>

In current rtmutex, the pending owner may be boosted by the tasks
in the rtmutex's waitlist when the pending owner is deboosted
or a task in the waitlist is boosted. This boosting is unrelated,
because the pending owner does not really take the rtmutex.
It is not reasonable.

Example.

time1:
A(high prio) onwers the rtmutex.
B(mid prio) and C (low prio) in the waitlist.

time2
A release the lock, B becomes the pending owner
A(or other high prio task) continues to run. B's prio is lower
than A, so B is just queued at the runqueue.

time3
A or other high prio task sleeps, but we have passed some time
The B and C's prio are changed in the period (time2 ~ time3)
due to boosting or deboosting. Now C has the priority higher
than B. ***Is it reasonable that C has to boost B and help B to
get the rtmutex?

NO!! I think, it is unrelated/unneed boosting before B really
owns the rtmutex. We should give C a chance to beat B and
win the rtmutex.

This is the motivation of this patch. This patch *ensures*
only the top waiter or higher priority task can take the lock.

How?
1) we don't dequeue the top waiter when unlock, if the top waiter
   is changed, the old top waiter will fail and go to sleep again.
2) when requiring lock, it will get the lock when the lock is not taken and:
   there is no waiter OR higher priority than waiters OR it is top waiter.
3) In any time, the top waiter is changed, the top waiter will be woken up.

The algorithm is much simpler than before, no pending owner, no
boosting for pending owner.

Other advantage of this patch:
1) The states of a rtmutex are reduced a half, easier to read the code.
2) the codes become shorter.
3) top waiter is not dequeued until it really take the lock:
   they will retain FIFO when it is stolen.

Not advantage nor disadvantage
1) Even we may wakeup multiple waiters(any time when top waiter changed),
   we hardly cause "thundering herd",
   the number of wokenup task is likely 1 or very little.
2) two APIs are changed.
   rt_mutex_owner() will not return pending owner, it will return NULL when
                    the top waiter is going to take the lock.
   rt_mutex_next_owner() always return the top waiter.
	                 will not return NULL if we have waiters
                         because the top waiter is not dequeued.

   I have fixed the code that use these APIs.

need updated after this patch is accepted
1) Document/*
2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst

[ Ported to the -rt patch set by Steven Rostedt ]

Signed-off-by:  Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4D130D23.1040309@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/futex.c          |   22 +--
 kernel/rt.c             |   10 +-
 kernel/rtmutex-debug.c  |    1 -
 kernel/rtmutex.c        |  414 ++++++++++++++++++-----------------------------
 kernel/rtmutex_common.h |   18 +--
 5 files changed, 175 insertions(+), 290 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 07cd774..570fa4b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -774,18 +774,10 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
 		return -EINVAL;
 
 	raw_spin_lock(&pi_state->pi_mutex.wait_lock);
+	/* set new owner to the highest priority waiter (top waiter). */
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
 
 	/*
-	 * This happens when we have stolen the lock and the original
-	 * pending owner did not enqueue itself back on the rt_mutex.
-	 * Thats not a tragedy. We know that way, that a lock waiter
-	 * is on the fly. We make the futex_q waiter the pending owner.
-	 */
-	if (!new_owner)
-		new_owner = this->task;
-
-	/*
 	 * We pass it to the next owner. (The WAITERS bit is always
 	 * kept enabled while there is PI state around. We must also
 	 * preserve the owner died bit.)
@@ -1515,10 +1507,10 @@ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
 
 	/*
 	 * We are here either because we stole the rtmutex from the
-	 * pending owner or we are the pending owner which failed to
-	 * get the rtmutex. We have to replace the pending owner TID
-	 * in the user space variable. This must be atomic as we have
-	 * to preserve the owner died bit here.
+	 * previous highest priority waiter or we are the highest priority
+	 * waiter but failed to get the rtmutex the first time.
+	 * We have to replace the newowner TID in the user space variable.
+	 * This must be atomic as we have to preserve the owner died bit here.
 	 *
 	 * Note: We write the user space value _before_ changing the pi_state
 	 * because we can fault here. Imagine swapped out pages or a fork
@@ -1567,8 +1559,8 @@ retry:
 
 	/*
 	 * To handle the page fault we need to drop the hash bucket
-	 * lock here. That gives the other task (either the pending
-	 * owner itself or the task which stole the rtmutex) the
+	 * lock here. That gives the other task (either the highest priority
+	 * waiter itself or the task which stole the rtmutex) the
 	 * chance to try the fixup of the pi_state. So once we are
 	 * back from handling the fault we need to check the pi_state
 	 * after reacquiring the hash bucket lock and before trying to
diff --git a/kernel/rt.c b/kernel/rt.c
index ccbf20f..5ace817 100644
--- a/kernel/rt.c
+++ b/kernel/rt.c
@@ -208,7 +208,7 @@ int __lockfunc rt_read_trylock(rwlock_t *rwlock)
 	 * but not when read_depth == 0 which means that the lock is
 	 * write locked.
 	 */
-	if (rt_mutex_real_owner(lock) != current)
+	if (rt_mutex_owner(lock) != current)
 		ret = rt_mutex_trylock(lock);
 	else if (!rwlock->read_depth)
 		ret = 0;
@@ -238,7 +238,7 @@ void __lockfunc rt_read_lock(rwlock_t *rwlock)
 	/*
 	 * recursive read locks succeed when current owns the lock
 	 */
-	if (rt_mutex_real_owner(lock) != current)
+	if (rt_mutex_owner(lock) != current)
 		__rt_spin_lock(lock);
 	rwlock->read_depth++;
 }
@@ -318,7 +318,7 @@ EXPORT_SYMBOL(rt_up_read);
  */
 void  rt_downgrade_write(struct rw_semaphore *rwsem)
 {
-	BUG_ON(rt_mutex_real_owner(&rwsem->lock) != current);
+	BUG_ON(rt_mutex_owner(&rwsem->lock) != current);
 	rwsem->read_depth = 1;
 }
 EXPORT_SYMBOL(rt_downgrade_write);
@@ -357,7 +357,7 @@ int  rt_down_read_trylock(struct rw_semaphore *rwsem)
 	 * but not when read_depth == 0 which means that the rwsem is
 	 * write locked.
 	 */
-	if (rt_mutex_real_owner(lock) != current)
+	if (rt_mutex_owner(lock) != current)
 		ret = rt_mutex_trylock(&rwsem->lock);
 	else if (!rwsem->read_depth)
 		ret = 0;
@@ -376,7 +376,7 @@ static void __rt_down_read(struct rw_semaphore *rwsem, int subclass)
 
 	rwsem_acquire_read(&rwsem->dep_map, subclass, 0, _RET_IP_);
 
-	if (rt_mutex_real_owner(lock) != current)
+	if (rt_mutex_owner(lock) != current)
 		rt_mutex_lock(&rwsem->lock);
 	rwsem->read_depth++;
 }
diff --git a/kernel/rtmutex-debug.c b/kernel/rtmutex-debug.c
index e7e6314..7b87c81 100644
--- a/kernel/rtmutex-debug.c
+++ b/kernel/rtmutex-debug.c
@@ -161,7 +161,6 @@ void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
 	put_pid(waiter->deadlock_task_pid);
 	DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->list_entry));
 	DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
-	DEBUG_LOCKS_WARN_ON(waiter->task);
 	memset(waiter, 0x22, sizeof(*waiter));
 }
 
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index e218873..3eeeb24 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -28,41 +28,34 @@
 /*
  * lock->owner state tracking:
  *
- * lock->owner holds the task_struct pointer of the owner. Bit 0 and 1
- * are used to keep track of the "owner is pending" and "lock has
- * waiters" state.
+ * lock->owner holds the task_struct pointer of the owner. Bit 0
+ * is used to keep track of the "lock has waiters" state.
  *
- * owner	bit1	bit0
- * NULL		0	0	lock is free (fast acquire possible)
- * NULL		0	1	invalid state
- * NULL		1	0	Transitional State*
- * NULL		1	1	invalid state
- * taskpointer	0	0	lock is held (fast release possible)
- * taskpointer	0	1	task is pending owner
- * taskpointer	1	0	lock is held and has waiters
- * taskpointer	1	1	task is pending owner and lock has more waiters
- *
- * Pending ownership is assigned to the top (highest priority)
- * waiter of the lock, when the lock is released. The thread is woken
- * up and can now take the lock. Until the lock is taken (bit 0
- * cleared) a competing higher priority thread can steal the lock
- * which puts the woken up thread back on the waiters list.
+ * owner	bit0
+ * NULL		0	lock is free (fast acquire possible)
+ * NULL		1	lock is free and has waiters and the top waiter
+ *				is going to take the lock*
+ * taskpointer	0	lock is held (fast release possible)
+ * taskpointer	1	lock is held and has waiters**
  *
  * The fast atomic compare exchange based acquire and release is only
- * possible when bit 0 and 1 of lock->owner are 0.
+ * possible when bit 0 of lock->owner is 0.
+ *
+ * (*) It also can be a transitional state when grabbing the lock
+ * with ->wait_lock is held. To prevent any fast path cmpxchg to the lock,
+ * we need to set the bit0 before looking at the lock, and the owner may be
+ * NULL in this small time, hence this can be a transitional state.
  *
- * (*) There's a small time where the owner can be NULL and the
- * "lock has waiters" bit is set.  This can happen when grabbing the lock.
- * To prevent a cmpxchg of the owner releasing the lock, we need to set this
- * bit before looking at the lock, hence the reason this is a transitional
- * state.
+ * (**) There is a small time when bit 0 is set but there are no
+ * waiters. This can happen when grabbing the lock in the slow path.
+ * To prevent a cmpxchg of the owner releasing the lock, we need to
+ * set this bit before looking at the lock.
  */
 
 static void
-rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner,
-		   unsigned long mask)
+rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner)
 {
-	unsigned long val = (unsigned long)owner | mask;
+	unsigned long val = (unsigned long)owner;
 
 	if (rt_mutex_has_waiters(lock))
 		val |= RT_MUTEX_HAS_WAITERS;
@@ -232,15 +225,14 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * reached or the state of the chain has changed while we
 	 * dropped the locks.
 	 */
-	if (!rt_mutex_real_waiter(waiter) || !waiter->task)
+	if (!rt_mutex_real_waiter(waiter))
 		goto out_unlock_pi;
 
 	/*
 	 * Check the orig_waiter state. After we dropped the locks,
-	 * the previous owner of the lock might have released the lock
-	 * and made us the pending owner:
+	 * the previous owner of the lock might have released the lock.
 	 */
-	if (orig_waiter && !orig_waiter->task)
+	if (orig_waiter && !rt_mutex_owner(orig_lock))
 		goto out_unlock_pi;
 
 	/*
@@ -283,6 +275,22 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 
 	/* Release the task */
 	raw_spin_unlock(&task->pi_lock);
+	if (!rt_mutex_owner(lock)) {
+		struct rt_mutex_waiter *lock_top_waiter;
+		/*
+		 * If the requeue above changed the top waiter, then we need
+		 * to wake the new top waiter up to try to get the lock.
+		 */
+		lock_top_waiter = rt_mutex_top_waiter(lock);
+		if (top_waiter != lock_top_waiter) {
+			if (lock_top_waiter->savestate)
+				wake_up_process_mutex(lock_top_waiter->task);
+			else
+				wake_up_process(lock_top_waiter->task);
+		}
+		raw_spin_unlock(&lock->wait_lock);
+		goto out_put_task;
+	}
 	put_task_struct(task);
 
 	/* Grab the next task */
@@ -325,77 +333,16 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 }
 
 /*
- * Optimization: check if we can steal the lock from the
- * assigned pending owner [which might not have taken the
- * lock yet]:
- */
-static inline int try_to_steal_lock(struct rt_mutex *lock,
-				    struct task_struct *task, int mode)
-{
-	struct task_struct *pendowner = rt_mutex_owner(lock);
-	struct rt_mutex_waiter *next;
-
-	if (!rt_mutex_owner_pending(lock))
-		return 0;
-
-	if (pendowner == task)
-		return 1;
-
-	raw_spin_lock(&pendowner->pi_lock);
-	if (!lock_is_stealable(task, pendowner, mode)) {
-		raw_spin_unlock(&pendowner->pi_lock);
-		return 0;
-	}
-
-	/*
-	 * Check if a waiter is enqueued on the pending owners
-	 * pi_waiters list. Remove it and readjust pending owners
-	 * priority.
-	 */
-	if (likely(!rt_mutex_has_waiters(lock))) {
-		raw_spin_unlock(&pendowner->pi_lock);
-		return 1;
-	}
-
-	/* No chain handling, pending owner is not blocked on anything: */
-	next = rt_mutex_top_waiter(lock);
-	plist_del(&next->pi_list_entry, &pendowner->pi_waiters);
-	__rt_mutex_adjust_prio(pendowner);
-	raw_spin_unlock(&pendowner->pi_lock);
-
-	/*
-	 * We are going to steal the lock and a waiter was
-	 * enqueued on the pending owners pi_waiters queue. So
-	 * we have to enqueue this waiter into
-	 * task->pi_waiters list. This covers the case,
-	 * where task is boosted because it holds another
-	 * lock and gets unboosted because the booster is
-	 * interrupted, so we would delay a waiter with higher
-	 * priority as task->normal_prio.
-	 *
-	 * Note: in the rare case of a SCHED_OTHER task changing
-	 * its priority and thus stealing the lock, next->task
-	 * might be task:
-	 */
-	if (likely(next->task != task)) {
-		raw_spin_lock(&task->pi_lock);
-		plist_add(&next->pi_list_entry, &task->pi_waiters);
-		__rt_mutex_adjust_prio(task);
-		raw_spin_unlock(&task->pi_lock);
-	}
-	return 1;
-}
-
-/*
  * Try to take an rt-mutex
  *
- * This fails
- * - when the lock has a real owner
- * - when a different pending owner exists and has higher priority than current
- *
  * Must be called with lock->wait_lock held.
+ *
+ * @lock:   the lock to be acquired.
+ * @task:   the task which wants to acquire the lock
+ * @waiter: the waiter that is queued to the lock's wait list. (could be NULL)
  */
-static int do_try_to_take_rt_mutex(struct rt_mutex *lock, int mode)
+static int do_try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
+				   struct rt_mutex_waiter *waiter, int mode)
 {
 	/*
 	 * We have to be careful here if the atomic speedups are
@@ -418,22 +365,61 @@ static int do_try_to_take_rt_mutex(struct rt_mutex *lock, int mode)
 	 */
 	mark_rt_mutex_waiters(lock);
 
-	if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, current, mode))
+	if (rt_mutex_owner(lock))
 		return 0;
 
+	/*
+	 * It will get the lock because of one of these conditions:
+	 * 1) there is no waiter
+	 * 2) higher priority than waiters
+	 * 3) it is top waiter
+	 */
+	if (rt_mutex_has_waiters(lock)) {
+		struct task_struct *pendowner = rt_mutex_top_waiter(lock)->task;
+		if (task != pendowner && !lock_is_stealable(task, pendowner, mode))
+			return 0;
+	}
+
 	/* We got the lock. */
+
+	if (waiter || rt_mutex_has_waiters(lock)) {
+		unsigned long flags;
+		struct rt_mutex_waiter *top;
+
+		raw_spin_lock_irqsave(&task->pi_lock, flags);
+
+		/* remove the queued waiter. */
+		if (waiter) {
+			plist_del(&waiter->list_entry, &lock->wait_list);
+			task->pi_blocked_on = NULL;
+		}
+
+		/*
+		 * We have to enqueue the top waiter(if it exists) into
+		 * task->pi_waiters list.
+		 */
+		if (rt_mutex_has_waiters(lock)) {
+			top = rt_mutex_top_waiter(lock);
+			top->pi_list_entry.prio = top->list_entry.prio;
+			plist_add(&top->pi_list_entry, &task->pi_waiters);
+		}
+		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	}
+
 	debug_rt_mutex_lock(lock);
 
-	rt_mutex_set_owner(lock, current, 0);
+	rt_mutex_set_owner(lock, task);
 
-	rt_mutex_deadlock_account_lock(lock, current);
+	rt_mutex_deadlock_account_lock(lock, task);
 
 	return 1;
 }
 
-static inline int try_to_take_rt_mutex(struct rt_mutex *lock)
+static inline int
+try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
+		     struct rt_mutex_waiter *waiter)
 {
-	return do_try_to_take_rt_mutex(lock, STEAL_NORMAL);
+	return do_try_to_take_rt_mutex(lock, task, waiter, STEAL_NORMAL);
 }
 
 /*
@@ -446,7 +432,8 @@ static inline int try_to_take_rt_mutex(struct rt_mutex *lock)
 static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 				   struct rt_mutex_waiter *waiter,
 				   struct task_struct *task,
-				   int detect_deadlock, unsigned long flags)
+				   int detect_deadlock, unsigned long flags,
+				   int savestate)
 {
 	struct task_struct *owner = rt_mutex_owner(lock);
 	struct rt_mutex_waiter *top_waiter = waiter;
@@ -473,6 +460,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
+	waiter->savestate = savestate;
 	plist_node_init(&waiter->list_entry, task->prio);
 	plist_node_init(&waiter->pi_list_entry, task->prio);
 
@@ -485,6 +473,9 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 
 	raw_spin_unlock(&task->pi_lock);
 
+	if (!owner)
+		return 0;
+
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		raw_spin_lock(&owner->pi_lock);
 		plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
@@ -521,21 +512,17 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 /*
  * Wake up the next waiter on the lock.
  *
- * Remove the top waiter from the current tasks waiter list and from
- * the lock waiter list. Set it as pending owner. Then wake it up.
+ * Remove the top waiter from the current tasks waiter list and wake it up.
  *
  * Called with lock->wait_lock held.
  */
 static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 {
 	struct rt_mutex_waiter *waiter;
-	struct task_struct *pendowner;
-	struct rt_mutex_waiter *next;
-
-	raw_spin_lock(&current->pi_lock);
+	struct task_struct *top_task;
 
 	waiter = rt_mutex_top_waiter(lock);
-	plist_del(&waiter->list_entry, &lock->wait_list);
+	top_task = waiter->task;
 
 	/*
 	 * Remove it from current->pi_waiters. We do not adjust a
@@ -543,49 +530,20 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 	 * boosted mode and go back to normal after releasing
 	 * lock->wait_lock.
 	 */
+	raw_spin_lock(&current->pi_lock);
 	plist_del(&waiter->pi_list_entry, &current->pi_waiters);
-	pendowner = waiter->task;
-	waiter->task = NULL;
-
-	/*
-	 * Do the wakeup before the ownership change to give any spinning
-	 * waiter grantees a headstart over the other threads that will
-	 * trigger once owner changes.
-	 */
-	if (!savestate)
-		wake_up_process(pendowner);
-	else
-		wake_up_process_mutex(pendowner);
-
-	rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
-
 	raw_spin_unlock(&current->pi_lock);
 
-	/*
-	 * Clear the pi_blocked_on variable and enqueue a possible
-	 * waiter into the pi_waiters list of the pending owner. This
-	 * prevents that in case the pending owner gets unboosted a
-	 * waiter with higher priority than pending-owner->normal_prio
-	 * is blocked on the unboosted (pending) owner.
-	 */
+	rt_mutex_set_owner(lock, NULL);
 
-	if (rt_mutex_has_waiters(lock))
-		next = rt_mutex_top_waiter(lock);
+	if (!savestate)
+		wake_up_process(top_task);
 	else
-		next = NULL;
-
-	raw_spin_lock(&pendowner->pi_lock);
-
-	WARN_ON(!pendowner->pi_blocked_on);
-	WARN_ON(pendowner->pi_blocked_on != waiter);
-	WARN_ON(pendowner->pi_blocked_on->lock != lock);
-
-	pendowner->pi_blocked_on = NULL;
+		wake_up_process_mutex(top_task);
 
-	if (next)
-		plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
-
-	raw_spin_unlock(&pendowner->pi_lock);
+	WARN_ON(!top_task->pi_blocked_on);
+	WARN_ON(top_task->pi_blocked_on != waiter);
+	WARN_ON(top_task->pi_blocked_on->lock != lock);
 }
 
 /*
@@ -603,11 +561,15 @@ static void remove_waiter(struct rt_mutex *lock,
 
 	raw_spin_lock(&current->pi_lock);
 	plist_del(&waiter->list_entry, &lock->wait_list);
-	waiter->task = NULL;
 	current->pi_blocked_on = NULL;
 	raw_spin_unlock(&current->pi_lock);
 
-	if (first && owner != current) {
+	if (!owner) {
+		BUG_ON(first);
+		return;
+	}
+
+	if (first) {
 
 		raw_spin_lock(&owner->pi_lock);
 
@@ -712,10 +674,6 @@ static int adaptive_wait(struct rt_mutex_waiter *waiter,
 {
 	for (;;) {
 
-		/* we are the owner? */
-		if (!waiter->task)
-			return 0;
-
 		/* Owner changed? Then lets update the original */
 		if (orig_owner != rt_mutex_owner(waiter->lock))
 			return 0;
@@ -795,8 +753,11 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 {
 	struct rt_mutex_waiter waiter;
 	unsigned long saved_state, flags;
-	struct task_struct *orig_owner;
+	/* orig_owner is only set if next_waiter is set */
+	struct task_struct *uninitialized_var(orig_owner);
+	int next_waiter;
 	int saved_lock_depth;
+	int ret;
 
 	debug_rt_mutex_init_waiter(&waiter);
 	waiter.task = NULL;
@@ -804,7 +765,7 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	init_lists(lock);
 
-	if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL)) {
+	if (do_try_to_take_rt_mutex(lock, current, NULL, STEAL_LATERAL)) {
 		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return;
 	}
@@ -831,38 +792,33 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	saved_lock_depth = current->lock_depth;
 	current->lock_depth = -1;
 
+	ret = task_blocks_on_rt_mutex(lock, &waiter, current, 0, flags, 1);
+	BUG_ON(ret);
+
 	for (;;) {
+		int sleep = 1;
 
-		/* Try to acquire the lock */
-		if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL))
+		/* Try to acquire the lock again. */
+		if (do_try_to_take_rt_mutex(lock, current, &waiter, STEAL_LATERAL))
 			break;
 
-		/*
-		 * waiter.task is NULL the first time we come here and
-		 * when we have been woken up by the previous owner
-		 * but the lock got stolen by an higher prio task.
-		 */
-		if (!waiter.task) {
-			task_blocks_on_rt_mutex(lock, &waiter, current, 0,
-						flags);
-			/* Wakeup during boost ? */
-			if (unlikely(!waiter.task))
-				continue;
+		next_waiter = &waiter == rt_mutex_top_waiter(lock);
+		if (next_waiter) {
+			orig_owner = rt_mutex_owner(lock);
+			if (orig_owner)
+				get_task_struct(orig_owner);
 		}
-
-		orig_owner = rt_mutex_owner(lock);
-		get_task_struct(orig_owner);
 		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 		debug_rt_mutex_print_deadlock(&waiter);
 
-		if (adaptive_wait(&waiter, orig_owner)) {
-			put_task_struct(orig_owner);
-
-			if (waiter.task)
-				schedule_rt_mutex(lock);
-		} else
+		if (next_waiter && orig_owner) {
+			if (!adaptive_wait(&waiter, orig_owner))
+				sleep = 0;
 			put_task_struct(orig_owner);
+		}
+		if (sleep)
+			schedule_rt_mutex(lock);
 
 		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 		saved_state = rt_set_current_blocked_state(saved_state);
@@ -872,19 +828,14 @@ rt_spin_lock_slowlock(struct rt_mutex *lock)
 	rt_restore_current_state(saved_state);
 
 	/*
-	 * Extremely rare case, if we got woken up by a non-mutex wakeup,
-	 * and we managed to steal the lock despite us not being the
-	 * highest-prio waiter (due to SCHED_OTHER changing prio), then we
-	 * can end up with a non-NULL waiter.task:
-	 */
-	if (unlikely(waiter.task))
-		remove_waiter(lock, &waiter, flags);
-	/*
 	 * try_to_take_rt_mutex() sets the waiter bit
 	 * unconditionally. We might have to fix that up:
 	 */
 	fixup_rt_mutex_waiters(lock);
 
+	BUG_ON(rt_mutex_has_waiters(lock) && &waiter == rt_mutex_top_waiter(lock));
+	BUG_ON(!plist_node_empty(&waiter.list_entry));
+
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	debug_rt_mutex_free_waiter(&waiter);
@@ -1056,7 +1007,6 @@ static inline void rt_reacquire_bkl(int saved_lock_depth)
  *			 or TASK_UNINTERRUPTIBLE)
  * @timeout:		 the pre-initialized and started timer, or NULL for none
  * @waiter:		 the pre-initialized rt_mutex_waiter
- * @detect_deadlock:	 passed to task_blocks_on_rt_mutex
  *
  * lock->wait_lock must be held by the caller.
  */
@@ -1064,13 +1014,13 @@ static int __sched
 __rt_mutex_slowlock(struct rt_mutex *lock, int state,
 		    struct hrtimer_sleeper *timeout,
 		    struct rt_mutex_waiter *waiter,
-		    int detect_deadlock, unsigned long flags)
+		    unsigned long flags)
 {
 	int ret = 0;
 
 	for (;;) {
 		/* Try to acquire the lock: */
-		if (try_to_take_rt_mutex(lock))
+		if (try_to_take_rt_mutex(lock, current, waiter))
 			break;
 
 		/*
@@ -1087,39 +1037,11 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state,
 				break;
 		}
 
-		/*
-		 * waiter->task is NULL the first time we come here and
-		 * when we have been woken up by the previous owner
-		 * but the lock got stolen by a higher prio task.
-		 */
-		if (!waiter->task) {
-			ret = task_blocks_on_rt_mutex(lock, waiter, current,
-						      detect_deadlock, flags);
-			/*
-			 * If we got woken up by the owner then start loop
-			 * all over without going into schedule to try
-			 * to get the lock now:
-			 */
-			if (unlikely(!waiter->task)) {
-				/*
-				 * Reset the return value. We might
-				 * have returned with -EDEADLK and the
-				 * owner released the lock while we
-				 * were walking the pi chain.
-				 */
-				ret = 0;
-				continue;
-			}
-			if (unlikely(ret))
-				break;
-		}
-
 		raw_spin_unlock_irq(&lock->wait_lock);
 
 		debug_rt_mutex_print_deadlock(waiter);
 
-		if (waiter->task)
-			schedule_rt_mutex(lock);
+		schedule_rt_mutex(lock);
 
 		raw_spin_lock_irq(&lock->wait_lock);
 
@@ -1142,13 +1064,12 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	unsigned long flags;
 
 	debug_rt_mutex_init_waiter(&waiter);
-	waiter.task = NULL;
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	init_lists(lock);
 
 	/* Try to acquire the lock again: */
-	if (try_to_take_rt_mutex(lock)) {
+	if (try_to_take_rt_mutex(lock, current, NULL)) {
 		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return 0;
 	}
@@ -1169,13 +1090,16 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 			timeout->task = NULL;
 	}
 
-	ret = __rt_mutex_slowlock(lock, state, timeout, &waiter,
-				  detect_deadlock, flags);
+	ret = task_blocks_on_rt_mutex(lock, &waiter, current, detect_deadlock, flags, 0);
+
+	if (likely(!ret))
+		ret = __rt_mutex_slowlock(lock, state, timeout, &waiter, flags);
 
 	set_current_state(TASK_RUNNING);
 
-	if (unlikely(waiter.task))
+	if (unlikely(ret))
 		remove_waiter(lock, &waiter, flags);
+	BUG_ON(!plist_node_empty(&waiter.list_entry));
 
 	/*
 	 * try_to_take_rt_mutex() sets the waiter bit
@@ -1189,14 +1113,6 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	if (unlikely(timeout))
 		hrtimer_cancel(&timeout->timer);
 
-	/*
-	 * Readjust priority, when we did not get the lock. We might
-	 * have been the pending owner and boosted. Since we did not
-	 * take the lock, the PI boost has to go.
-	 */
-	if (unlikely(ret))
-		rt_mutex_adjust_prio(current);
-
 	/* Must we reaquire the BKL? */
 	if (unlikely(saved_lock_depth >= 0))
 		rt_reacquire_bkl(saved_lock_depth);
@@ -1221,7 +1137,7 @@ rt_mutex_slowtrylock(struct rt_mutex *lock)
 
 		init_lists(lock);
 
-		ret = try_to_take_rt_mutex(lock);
+		ret = try_to_take_rt_mutex(lock, current, NULL);
 		/*
 		 * try_to_take_rt_mutex() sets the lock waiters
 		 * bit unconditionally. Clean this up.
@@ -1474,7 +1390,7 @@ void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
 {
 	__rt_mutex_init(lock, NULL);
 	debug_rt_mutex_proxy_lock(lock, proxy_owner);
-	rt_mutex_set_owner(lock, proxy_owner, 0);
+	rt_mutex_set_owner(lock, proxy_owner);
 	rt_mutex_deadlock_account_lock(lock, proxy_owner);
 }
 
@@ -1490,7 +1406,7 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
 			   struct task_struct *proxy_owner)
 {
 	debug_rt_mutex_proxy_unlock(lock);
-	rt_mutex_set_owner(lock, NULL, 0);
+	rt_mutex_set_owner(lock, NULL);
 	rt_mutex_deadlock_account_unlock(proxy_owner);
 }
 
@@ -1517,22 +1433,15 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 
-	mark_rt_mutex_waiters(lock);
-
-	if (!rt_mutex_owner(lock) ||
-	    try_to_steal_lock(lock, task, STEAL_NORMAL)) {
-		/* We got the lock for task. */
-		debug_rt_mutex_lock(lock);
-		rt_mutex_set_owner(lock, task, 0);
+	if (try_to_take_rt_mutex(lock, task, NULL)) {
 		raw_spin_unlock(&lock->wait_lock);
-		rt_mutex_deadlock_account_lock(lock, task);
 		return 1;
 	}
 
 	ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock,
-				      flags);
+				      flags, 0);
 
-	if (ret == -EDEADLK && !waiter->task) {
+	if (ret == -EDEADLK && !rt_mutex_owner(lock)) {
 		/*
 		 * Reset the return value. We might have
 		 * returned with -EDEADLK and the owner
@@ -1541,6 +1450,10 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 		 */
 		ret = 0;
 	}
+
+	if (unlikely(ret))
+		remove_waiter(lock, waiter, flags);
+
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	debug_rt_mutex_print_deadlock(waiter);
@@ -1596,12 +1509,11 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 
 	set_current_state(TASK_INTERRUPTIBLE);
 
-	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter,
-				  detect_deadlock, flags);
+	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter, flags);
 
 	set_current_state(TASK_RUNNING);
 
-	if (unlikely(waiter->task))
+	if (unlikely(ret))
 		remove_waiter(lock, waiter, flags);
 
 	/*
@@ -1612,13 +1524,5 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
-	/*
-	 * Readjust priority, when we did not get the lock. We might have been
-	 * the pending owner and boosted. Since we did not take the lock, the
-	 * PI boost has to go.
-	 */
-	if (unlikely(ret))
-		rt_mutex_adjust_prio(current);
-
 	return ret;
 }
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 97fc68c..ee1f353 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -43,6 +43,7 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
  * @list_entry:		pi node to enqueue into the mutex waiters list
  * @pi_list_entry:	pi node to enqueue into the mutex owner waiters list
  * @task:		task reference to the blocked task
+ * @savestate:		should the task be woken with wake_up_process_mutex?
  */
 struct rt_mutex_waiter {
 	struct plist_node	list_entry;
@@ -54,6 +55,7 @@ struct rt_mutex_waiter {
 	struct pid		*deadlock_task_pid;
 	struct rt_mutex		*deadlock_lock;
 #endif
+	int			savestate;
 };
 
 /*
@@ -91,9 +93,8 @@ task_top_pi_waiter(struct task_struct *p)
 /*
  * lock->owner state tracking:
  */
-#define RT_MUTEX_OWNER_PENDING	1UL
-#define RT_MUTEX_HAS_WAITERS	2UL
-#define RT_MUTEX_OWNER_MASKALL	3UL
+#define RT_MUTEX_HAS_WAITERS	1UL
+#define RT_MUTEX_OWNER_MASKALL	1UL
 
 static inline struct task_struct *rt_mutex_owner(struct rt_mutex *lock)
 {
@@ -101,17 +102,6 @@ static inline struct task_struct *rt_mutex_owner(struct rt_mutex *lock)
 		((unsigned long)lock->owner & ~RT_MUTEX_OWNER_MASKALL);
 }
 
-static inline struct task_struct *rt_mutex_real_owner(struct rt_mutex *lock)
-{
-	return (struct task_struct *)
-		((unsigned long)lock->owner & ~RT_MUTEX_HAS_WAITERS);
-}
-
-static inline unsigned long rt_mutex_owner_pending(struct rt_mutex *lock)
-{
-	return (unsigned long)lock->owner & RT_MUTEX_OWNER_PENDING;
-}
-
 /*
  * PI-futex support (proxy locking functions, etc.):
  */
-- 
1.7.2.3



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2010-12-23 22:47 ` [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup Steven Rostedt
@ 2010-12-24  4:45   ` Gregory Haskins
  2010-12-24  4:54     ` Steven Rostedt
       [not found]   ` <4D13DF250200005A000793E1@novell.com>
  1 sibling, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2010-12-24  4:45 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel
  Cc: Lai Jiangshan, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Peter Morreale

Hey Steve,

>>> On 12/23/2010 at 05:47 PM, in message <20101223225116.729981172@goodmis.org>,
Steven Rostedt <rostedt@goodmis.org> wrote: 
> From: Steven Rostedt <srostedt@redhat.com>
> 
> The commit: rtmutex: Optimize rt lock wakeup
> 
> Does not do what it was suppose to do.
> This is because the adaptive waiter sets its state to TASK_(UN)INTERRUPTIBLE
> before going into the loop. Thus, the test in wakeup_next_waiter()
> will always fail on an adaptive waiter, as it only tests to see if
> the pending waiter never has its state set ot TASK_RUNNING unless
> something else had woke it up.
> 
> The smp_mb() added to make this test work is just as expensive as
> just calling wakeup. And since we we fail to wake up anyway, we are
> doing both a smp_mb() and wakeup as well.
> 
> I tested this with dbench and we run faster without this patch.
> I also tried a variant that instead fixed the loop, to change the state
> only if the spinner was to go to sleep, and that still did not show
> any improvement.

Just a quick note to say I am a bit skeptical of this patch.  I know you are offline next week, so lets plan on hashing it out after the new year before I ack it.

Happy holidays!
-Greg

> 
> Cc: Gregory Haskins <ghaskins@novell.com>
> Cc: Peter Morreale <pmorreale@novell.com>
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/rtmutex.c |   29 ++---------------------------
>  1 files changed, 2 insertions(+), 27 deletions(-)
> 
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 318d7ed..e218873 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -554,33 +554,8 @@ static void wakeup_next_waiter(struct rt_mutex *lock, 
> int savestate)
>  	 */
>  	if (!savestate)
>  		wake_up_process(pendowner);
> -	else {
> -		/*
> -		 * We can skip the actual (expensive) wakeup if the
> -		 * waiter is already running, but we have to be careful
> -		 * of race conditions because they may be about to sleep.
> -		 *
> -		 * The waiter-side protocol has the following pattern:
> -		 * 1: Set state != RUNNING
> -		 * 2: Conditionally sleep if waiter->task != NULL;
> -		 *
> -		 * And the owner-side has the following:
> -		 * A: Set waiter->task = NULL
> -		 * B: Conditionally wake if the state != RUNNING
> -		 *
> -		 * As long as we ensure 1->2 order, and A->B order, we
> -		 * will never miss a wakeup.
> -		 *
> -		 * Therefore, this barrier ensures that waiter->task = NULL
> -		 * is visible before we test the pendowner->state.  The
> -		 * corresponding barrier is in the sleep logic.
> -		 */
> -		smp_mb();
> -
> -		/* If !RUNNING && !RUNNING_MUTEX */
> -		if (pendowner->state & ~TASK_RUNNING_MUTEX)
> -			wake_up_process_mutex(pendowner);
> -	}
> +	else
> +		wake_up_process_mutex(pendowner);
>  
>  	rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
>  





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2010-12-24  4:45   ` Gregory Haskins
@ 2010-12-24  4:54     ` Steven Rostedt
  2010-12-28 14:06       ` Gregory Haskins
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2010-12-24  4:54 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, Lai Jiangshan, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Peter Morreale

On Thu, 2010-12-23 at 21:45 -0700, Gregory Haskins wrote:
> Hey Steve,
> 
> >>> On 12/23/2010 at 05:47 PM, in message <20101223225116.729981172@goodmis.org>,
> Steven Rostedt <rostedt@goodmis.org> wrote: 
> > From: Steven Rostedt <srostedt@redhat.com>
> > 
> > The commit: rtmutex: Optimize rt lock wakeup
> > 
> > Does not do what it was suppose to do.
> > This is because the adaptive waiter sets its state to TASK_(UN)INTERRUPTIBLE
> > before going into the loop. Thus, the test in wakeup_next_waiter()
> > will always fail on an adaptive waiter, as it only tests to see if
> > the pending waiter never has its state set ot TASK_RUNNING unless
> > something else had woke it up.
> > 
> > The smp_mb() added to make this test work is just as expensive as
> > just calling wakeup. And since we we fail to wake up anyway, we are
> > doing both a smp_mb() and wakeup as well.
> > 
> > I tested this with dbench and we run faster without this patch.
> > I also tried a variant that instead fixed the loop, to change the state
> > only if the spinner was to go to sleep, and that still did not show
> > any improvement.
> 
> Just a quick note to say I am a bit skeptical of this patch.  I know you are offline next week, so lets plan on hashing it out after the new year before I ack it.

Sure, but as I said, it is mostly broken anyway. I could even insert
some tracepoints to show that this is always missed (heck I'll add an
unlikely and do the branch profiler ;-)

The reason is that adaptive spinners spin in some other state than
TASK_RUNNING, thus it does not help adaptive spinners at all. I first
tried to fix that, but it made dbench run even slower. But I only did a
few tests, and only on a 4 CPU box, so it was a rather small sample. The
removal of the code had to deal with more that it was already broken
than anything else.

But yeah, we can hash this out in the new year. This is one of the
reasons I only posted this patch set as an RFC.

> 
> Happy holidays!

You too!

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
       [not found]   ` <4D13DF250200005A000793E1@novell.com>
@ 2010-12-24 15:47     ` Peter W. Morreale
  0 siblings, 0 replies; 22+ messages in thread
From: Peter W. Morreale @ 2010-12-24 15:47 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Steven Rostedt, linux-kernel, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, Thomas Gleixner

On Thu, 2010-12-23 at 21:45 -0700, Gregory Haskins wrote:
> Hey Steve,
> 
> >>> On 12/23/2010 at 05:47 PM, in message <20101223225116.729981172@goodmis.org>,
> Steven Rostedt <rostedt@goodmis.org> wrote: 
> > From: Steven Rostedt <srostedt@redhat.com>
> > 
> > The commit: rtmutex: Optimize rt lock wakeup
> > 
> > Does not do what it was suppose to do.
> > This is because the adaptive waiter sets its state to TASK_(UN)INTERRUPTIBLE
> > before going into the loop. Thus, the test in wakeup_next_waiter()
> > will always fail on an adaptive waiter, as it only tests to see if
> > the pending waiter never has its state set ot TASK_RUNNING unless
> > something else had woke it up.
> > 
> > The smp_mb() added to make this test work is just as expensive as
> > just calling wakeup. And since we we fail to wake up anyway, we are
> > doing both a smp_mb() and wakeup as well.
> > 
> > I tested this with dbench and we run faster without this patch.
> > I also tried a variant that instead fixed the loop, to change the state
> > only if the spinner was to go to sleep, and that still did not show
> > any improvement.
> 
> Just a quick note to say I am a bit skeptical of this patch.  I know you are offline next week, so lets plan on hashing it out after the new year before I ack it.
> 

We shouldn't be too quick to merely rip this out without a little
thinking.  Clearly this is broken, however the intent was clear.  

That being that if a waiter is spinning, we don't need to wake it up. 

The wake up path is substantially more than a barrier; it includes a
barrier as well as grabbing the task_rq_lock only to find that the task
is oncpu. Then various accounting is updated, etc. 

We know definitively that a waiter can only spin if the owner is oncpu,
by definition of adaptive spinning.  We also know that only the owner
can release the lock to a waiter (spinning or not).   So it seems clear
that avoiding unnecessary contention on the rq lock would be a Good
Thing(tm). 

Perhaps this cannot be done safely, but if you saw an improvement in
dbench by merely removing a barrier, imagine the improvement by removing
the contention on the lock.

Happy Holidays to all!

-PWM


 

> Happy holidays!
> -Greg
> 
> > 
> > Cc: Gregory Haskins <ghaskins@novell.com>
> > Cc: Peter Morreale <pmorreale@novell.com>
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> > ---
> >  kernel/rtmutex.c |   29 ++---------------------------
> >  1 files changed, 2 insertions(+), 27 deletions(-)
> > 
> > diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> > index 318d7ed..e218873 100644
> > --- a/kernel/rtmutex.c
> > +++ b/kernel/rtmutex.c
> > @@ -554,33 +554,8 @@ static void wakeup_next_waiter(struct rt_mutex *lock, 
> > int savestate)
> >  	 */
> >  	if (!savestate)
> >  		wake_up_process(pendowner);
> > -	else {
> > -		/*
> > -		 * We can skip the actual (expensive) wakeup if the
> > -		 * waiter is already running, but we have to be careful
> > -		 * of race conditions because they may be about to sleep.
> > -		 *
> > -		 * The waiter-side protocol has the following pattern:
> > -		 * 1: Set state != RUNNING
> > -		 * 2: Conditionally sleep if waiter->task != NULL;
> > -		 *
> > -		 * And the owner-side has the following:
> > -		 * A: Set waiter->task = NULL
> > -		 * B: Conditionally wake if the state != RUNNING
> > -		 *
> > -		 * As long as we ensure 1->2 order, and A->B order, we
> > -		 * will never miss a wakeup.
> > -		 *
> > -		 * Therefore, this barrier ensures that waiter->task = NULL
> > -		 * is visible before we test the pendowner->state.  The
> > -		 * corresponding barrier is in the sleep logic.
> > -		 */
> > -		smp_mb();
> > -
> > -		/* If !RUNNING && !RUNNING_MUTEX */
> > -		if (pendowner->state & ~TASK_RUNNING_MUTEX)
> > -			wake_up_process_mutex(pendowner);
> > -	}
> > +	else
> > +		wake_up_process_mutex(pendowner);
> >  
> >  	rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
> >  
> 
> 
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2010-12-24  4:54     ` Steven Rostedt
@ 2010-12-28 14:06       ` Gregory Haskins
  2011-01-03 19:06         ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Gregory Haskins @ 2010-12-28 14:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Lai Jiangshan, Ingo Molnar, Peter Zijlstra, ThomasGleixner,
	Peter Morreale, linux-kernel

>>> On 12/23/2010 at 11:54 PM, in message
<1293166464.22802.415.camel@gandalf.stny.rr.com>, Steven Rostedt
<rostedt@goodmis.org> wrote: 
> On Thu, 2010-12-23 at 21:45 -0700, Gregory Haskins wrote:
>> Hey Steve,
>> 
>> >>> On 12/23/2010 at 05:47 PM, in message <20101223225116.729981172@goodmis.org>,
>> Steven Rostedt <rostedt@goodmis.org> wrote: 
>> > From: Steven Rostedt <srostedt@redhat.com>
>> > 
>> > The commit: rtmutex: Optimize rt lock wakeup
>> > 
>> > Does not do what it was suppose to do.
>> > This is because the adaptive waiter sets its state to 
> TASK_(UN)INTERRUPTIBLE
>> > before going into the loop. Thus, the test in wakeup_next_waiter()
>> > will always fail on an adaptive waiter, as it only tests to see if
>> > the pending waiter never has its state set ot TASK_RUNNING unless
>> > something else had woke it up.
>> > 
>> > The smp_mb() added to make this test work is just as expensive as
>> > just calling wakeup. And since we we fail to wake up anyway, we are
>> > doing both a smp_mb() and wakeup as well.
>> > 
>> > I tested this with dbench and we run faster without this patch.
>> > I also tried a variant that instead fixed the loop, to change the state
>> > only if the spinner was to go to sleep, and that still did not show
>> > any improvement.
>> 
>> Just a quick note to say I am a bit skeptical of this patch.  I know you are 
> offline next week, so lets plan on hashing it out after the new year before I 
> ack it.
> 
> Sure, but as I said, it is mostly broken anyway. I could even insert
> some tracepoints to show that this is always missed (heck I'll add an
> unlikely and do the branch profiler ;-)

Well, I think that would be a good datapoint and is one of the things I'd like to see.

> 
> The reason is that adaptive spinners spin in some other state than
> TASK_RUNNING, thus it does not help adaptive spinners at all. I first
> tried to fix that, but it made dbench run even slower.

This is why I am skeptical.  You are essentially asserting there are two issues here, IIUC:

1) The intent of avoiding a wakeup is broken and we take the double whammy of a mb()
plus the wakeup() anyway.

2) mb() is apparently slower than wakeup().

I agree (1) is plausible, though I would like to see the traces to confirm.  Its been a long time
since I looked at that code, but I think the original code either ran in RUNNING_MUTEX and was
inadvertently broken in the mean time or the other cpu would have transitioned to RUNNING on
its own when we flipped the owner before the release-side check was performed.  Or perhaps
we just plain screwed this up and it was racy ;)  I'm not sure.  But as Peter (M) stated, it seems
like a shame to walk away from the concept without further investigation.  I think everyone can
agree that at the very least, if it is in fact taking a double whammy we should fix that.

For (2), I am skeptical in two parts ;).  You stated you thought mb() was just as expensive as a
wakeup which seems suspect to me, given a wakeup needs to be a superset of a barrier
II[R|U]C.  Lets call this "2a".  In addition, your results when you removed the logic and went 
straight to a wakeup() and found dbench actually was faster than the "fixed mb()" path would 
imply wakeup() is actually _faster_ than mb().  Lets call this "2b".

For (2a), I would like to see some traces that compare mb() to wakeup() (of a presumably 
already running task that happens in the INTERRUPTIBLE state) to be convinced that wakeup() is 
equal/faster.  I suspect it isn't

For (2b), I would suggest that we don't rely on dbench alone in evaluating the merit of the 
change.  In some ways, its a great test for this type of change since it leans heavily on the coarse 
VFS locks.  However, dbench is also pretty odd and thrives on somewhat chaotic behavior.  For 
instance, it loves the "lateral steal" logic, even though this patch technically breaks fairness.  So
I would therefore propose a suite of benchmarks known for creating as much lock contention as
possible should be run in addition to dbench alone.

Happy new year, all,
-Greg


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2010-12-28 14:06       ` Gregory Haskins
@ 2011-01-03 19:06         ` Steven Rostedt
  2011-01-03 20:22           ` Steven Rostedt
  2011-01-10 14:49           ` Gregory Haskins
  0 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2011-01-03 19:06 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Lai Jiangshan, Ingo Molnar, Peter Zijlstra, ThomasGleixner,
	Peter Morreale, linux-kernel

On Tue, 2010-12-28 at 07:06 -0700, Gregory Haskins wrote:
> >>> On 12/23/2010 at 11:54 PM, in message

> > Sure, but as I said, it is mostly broken anyway. I could even insert
> > some tracepoints to show that this is always missed (heck I'll add an
> > unlikely and do the branch profiler ;-)
> 
> Well, I think that would be a good datapoint and is one of the things I'd like to see.

OK, here it is, after running a simple "dbench 10":

 correct incorrect  %        Function                  File              Line
 ------- ---------  -        --------                  ----              ----
     150   726979  99 wakeup_next_waiter             rtmutex.c            581

And the patch I added:

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 318d7ed..dacdcb6 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -578,7 +578,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 		smp_mb();
 
 		/* If !RUNNING && !RUNNING_MUTEX */
-		if (pendowner->state & ~TASK_RUNNING_MUTEX)
+		if (unlikely(pendowner->state & ~TASK_RUNNING_MUTEX))
 			wake_up_process_mutex(pendowner);
 	}
 

Interesting that we hit 150 times that the new owner was already awake.
Perhaps it was because the new owner was about to spin on a spinlock
after it had put itself into the TASK_INTERRUPTIBLE state, and it was
woken up by someone else?


> 
> > 
> > The reason is that adaptive spinners spin in some other state than
> > TASK_RUNNING, thus it does not help adaptive spinners at all. I first
> > tried to fix that, but it made dbench run even slower.
> 
> This is why I am skeptical.  You are essentially asserting there are two issues here, IIUC:
> 
> 1) The intent of avoiding a wakeup is broken and we take the double whammy of a mb()
> plus the wakeup() anyway.

Yep, this is what I believe is happening.

> 
> 2) mb() is apparently slower than wakeup().

Forget this point, as #1 seems to be the main issue. Also, I'm not sure
it is safe to "fix" this, as I started to try.


> 
> I agree (1) is plausible, though I would like to see the traces to confirm.  Its been a long time
> since I looked at that code, but I think the original code either ran in RUNNING_MUTEX and was
> inadvertently broken in the mean time or the other cpu would have transitioned to RUNNING on
> its own when we flipped the owner before the release-side check was performed.  Or perhaps
> we just plain screwed this up and it was racy ;)  I'm not sure.  But as Peter (M) stated, it seems
> like a shame to walk away from the concept without further investigation.  I think everyone can
> agree that at the very least, if it is in fact taking a double whammy we should fix that.

[ cut out all the 2's since I'm not worried about that now, even if it 
  is not a problem. ]


Now, the way I was going to fix this is to have the adaptive wait be in
TASK_RUNNING state, and then change the state iff we are going to sleep.

But this can be a bit more race. Especially with Lai's new changes. With
the new changes, the waiter->task does not get nulled anymore.

The test to take the lock, now needs to be under the lock->wait_lock
spinlock. We have to grab that lock, and see if the owner is null, and
that we are the next waiter in the queue. Thus, on taking the lock we
need to have something like this:


	if (adaptive_wait(&waiter, orig_owner))
		sleep = 1;
	else
		sleep = 0;

	if (sleep)
		raw_spin_lock(&lock->wait_lock);
		saved_state = rt_set_current_block_state(saved_state);
		if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
			sleep = 0;
		raw_spin_unlock(&lock->wait_lock);
		if (sleep)
			schedule_rt_mutex(lock);
		saved_state = rt_restore_current_blocked_state(saved_state);
	}

Otherwise we can risk the wakeup_next_waiter() missing the wakeup.

To clarify, we want the adaptive_wait() to run as TASK_RUNNING. Then if
we must sleep, then we must set the state to TASK_UNINTERRUPTIBLE, test
again if we can still the lock, and if not then sleep. Otherwise, if a
wakeup happens just before we set the state to TASK_UNINTERRUPTIBLE,
then we miss the wake up all together.

I can do this change, and see what impact it makes.

I'm also curious if this ever worked? If it did not, then are you sure
your tests that show the benefit of it was true. I don't have a large
scale box at my disposal ATM, so I can only see what this does on 4way
machines.

-- Steve



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2011-01-03 19:06         ` Steven Rostedt
@ 2011-01-03 20:22           ` Steven Rostedt
  2011-01-04 15:19             ` Peter W. Morreale
  2011-01-10 14:49           ` Gregory Haskins
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2011-01-03 20:22 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Lai Jiangshan, Ingo Molnar, Peter Zijlstra, ThomasGleixner,
	Peter Morreale, linux-kernel

On Mon, 2011-01-03 at 14:06 -0500, Steven Rostedt wrote:

> 
> 	if (adaptive_wait(&waiter, orig_owner))
> 		sleep = 1;
> 	else
> 		sleep = 0;
> 
> 	if (sleep)


> 		raw_spin_lock(&lock->wait_lock);
> 		saved_state = rt_set_current_block_state(saved_state);
> 		if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
> 			sleep = 0;
> 		raw_spin_unlock(&lock->wait_lock);

I may be able to remove the above locks and replace it with:

	saved_state = rt_set_current_blocked_state(saved_state);
	if (orig_owner == rt_mutex_owner(lock))
		schedule_rt_mutex(lock);

-- Steve


> 		if (sleep)
> 			schedule_rt_mutex(lock);
> 		saved_state = rt_restore_current_blocked_state(saved_state);
> 	}
> 
> Otherwise we can risk the wakeup_next_waiter() missing the wakeup.
> 
> To clarify, we want the adaptive_wait() to run as TASK_RUNNING. Then if
> we must sleep, then we must set the state to TASK_UNINTERRUPTIBLE, test
> again if we can still the lock, and if not then sleep. Otherwise, if a
> wakeup happens just before we set the state to TASK_UNINTERRUPTIBLE,
> then we miss the wake up all together.
> 
> I can do this change, and see what impact it makes.
> 
> I'm also curious if this ever worked? If it did not, then are you sure
> your tests that show the benefit of it was true. I don't have a large
> scale box at my disposal ATM, so I can only see what this does on 4way
> machines.
> 
> -- Steve
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 4/4] rtmutex: Ensure only the top waiter or higher priority task can take the lock
  2010-12-23 22:47 ` [RFC][RT][PATCH 4/4] rtmutex: Ensure only the top waiter or higher priority task can take the lock Steven Rostedt
@ 2011-01-04  4:02   ` Steven Rostedt
  2011-01-05  7:12     ` Lai Jiangshan
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2011-01-04  4:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Lai Jiangshan

On Thu, 2010-12-23 at 17:47 -0500, Steven Rostedt wrote:
> plain text document attachment
> (0004-rtmutex-Ensure-only-the-top-waiter-or-higher-priorit.patch)
> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> 
> In current rtmutex, the pending owner may be boosted by the tasks
> in the rtmutex's waitlist when the pending owner is deboosted
> or a task in the waitlist is boosted. This boosting is unrelated,
> because the pending owner does not really take the rtmutex.
> It is not reasonable.
> 

I'm still hitting some bugs with the port to -rt, but I also noticed
something that doesn't look too good.

There's several places in the kernel where a task may release and
acquire the same lock multiple times in a row.

The old way of removing the pending owner from the lists and waking it
up once, would have the high prio task wake it up once, and then it can
grab the locks multiple times without modifying the list, since the
pending owner is already awake and not in the pi list anymore.

The new way has the owner remove the woken task from its pi list and
wakes it up, but when it steals the lock again, it adds this owner back
to its pi list. When it releases the lock, it wakes it up again and
removes it from its pi list again. This happens over and over again.

Running dbench, I see this happen (probably on one of the dcache locks)
literally a 100 times in a row.

I'm a bit nervous to add this overhead. I'm still working in debug mode
so I'm not at a point to do real benchmarks yet.

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2011-01-03 20:22           ` Steven Rostedt
@ 2011-01-04 15:19             ` Peter W. Morreale
  2011-01-04 15:47               ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Peter W. Morreale @ 2011-01-04 15:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, Lai Jiangshan, Ingo Molnar, Peter Zijlstra,
	ThomasGleixner, linux-kernel

On Mon, 2011-01-03 at 15:22 -0500, Steven Rostedt wrote:
> On Mon, 2011-01-03 at 14:06 -0500, Steven Rostedt wrote:
> 
> > 
> > 	if (adaptive_wait(&waiter, orig_owner))
> > 		sleep = 1;
> > 	else
> > 		sleep = 0;
> > 
> > 	if (sleep)
> 
> 
> > 		raw_spin_lock(&lock->wait_lock);
> > 		saved_state = rt_set_current_block_state(saved_state);
> > 		if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
> > 			sleep = 0;
> > 		raw_spin_unlock(&lock->wait_lock);
> 
> I may be able to remove the above locks and replace it with:
> 
> 	saved_state = rt_set_current_blocked_state(saved_state);
> 	if (orig_owner == rt_mutex_owner(lock))
> 		schedule_rt_mutex(lock);
> 
> -- Steve

Isn't it possible to miss a wakeup here if the waiter becomes preempted?

Recall that adaptive wait is a preemptive wait.  Hence the (I believe)
original reason we did the adaptive spin in a (transitioning)  sleep
state.

-PWM

> 
> 
> > 		if (sleep)
> > 			schedule_rt_mutex(lock);
> > 		saved_state = rt_restore_current_blocked_state(saved_state);
> > 	}
> > 
> > Otherwise we can risk the wakeup_next_waiter() missing the wakeup.
> > 
> > To clarify, we want the adaptive_wait() to run as TASK_RUNNING. Then if
> > we must sleep, then we must set the state to TASK_UNINTERRUPTIBLE, test
> > again if we can still the lock, and if not then sleep. Otherwise, if a
> > wakeup happens just before we set the state to TASK_UNINTERRUPTIBLE,
> > then we miss the wake up all together.
> > 
> > I can do this change, and see what impact it makes.
> > 
> > I'm also curious if this ever worked? If it did not, then are you sure
> > your tests that show the benefit of it was true. I don't have a large
> > scale box at my disposal ATM, so I can only see what this does on 4way
> > machines.
> > 
> > -- Steve
> > 
> 
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2011-01-04 15:19             ` Peter W. Morreale
@ 2011-01-04 15:47               ` Steven Rostedt
  2011-01-04 17:15                 ` Peter W. Morreale
  2011-01-04 17:24                 ` Peter W. Morreale
  0 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2011-01-04 15:47 UTC (permalink / raw)
  To: pmorreale
  Cc: Gregory Haskins, Lai Jiangshan, Ingo Molnar, Peter Zijlstra,
	ThomasGleixner, linux-kernel

On Tue, 2011-01-04 at 08:19 -0700, Peter W. Morreale wrote:
> On Mon, 2011-01-03 at 15:22 -0500, Steven Rostedt wrote:
> > On Mon, 2011-01-03 at 14:06 -0500, Steven Rostedt wrote:
> > 
> > > 
> > > 	if (adaptive_wait(&waiter, orig_owner))
> > > 		sleep = 1;
> > > 	else
> > > 		sleep = 0;
> > > 
> > > 	if (sleep)
> > 
> > 
> > > 		raw_spin_lock(&lock->wait_lock);
> > > 		saved_state = rt_set_current_block_state(saved_state);
> > > 		if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
> > > 			sleep = 0;
> > > 		raw_spin_unlock(&lock->wait_lock);
> > 
> > I may be able to remove the above locks and replace it with:
> > 
> > 	saved_state = rt_set_current_blocked_state(saved_state);
> > 	if (orig_owner == rt_mutex_owner(lock))
> > 		schedule_rt_mutex(lock);
> > 
> > -- Steve
> 
> Isn't it possible to miss a wakeup here if the waiter becomes preempted?

Why? Preemption doesn't change the task state.

> 
> Recall that adaptive wait is a preemptive wait.  Hence the (I believe)
> original reason we did the adaptive spin in a (transitioning)  sleep
> state.

Yes it is a preemptive wait, I would not have accepted the patches if it
was anything else. But preemption is not affected by the state of the
task. A task could be TASK_RUNNING or TASK_INTERRUPTIBLE or
TASK_UNINTERRUPTIBLE, and that would not affect how it acts when it is
preempted.

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2011-01-04 15:47               ` Steven Rostedt
@ 2011-01-04 17:15                 ` Peter W. Morreale
  2011-01-04 17:27                   ` Steven Rostedt
  2011-01-04 17:24                 ` Peter W. Morreale
  1 sibling, 1 reply; 22+ messages in thread
From: Peter W. Morreale @ 2011-01-04 17:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, Lai Jiangshan, Ingo Molnar, Peter Zijlstra,
	ThomasGleixner, linux-kernel

On Tue, 2011-01-04 at 10:47 -0500, Steven Rostedt wrote:
> On Tue, 2011-01-04 at 08:19 -0700, Peter W. Morreale wrote:
> > On Mon, 2011-01-03 at 15:22 -0500, Steven Rostedt wrote:
> > > On Mon, 2011-01-03 at 14:06 -0500, Steven Rostedt wrote:
> > > 
> > > > 
> > > > 	if (adaptive_wait(&waiter, orig_owner))
> > > > 		sleep = 1;
> > > > 	else
> > > > 		sleep = 0;
> > > > 
> > > > 	if (sleep)
> > > 
> > > 
> > > > 		raw_spin_lock(&lock->wait_lock);
> > > > 		saved_state = rt_set_current_block_state(saved_state);
> > > > 		if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
> > > > 			sleep = 0;
> > > > 		raw_spin_unlock(&lock->wait_lock);
> > > 
> > > I may be able to remove the above locks and replace it with:
> > > 
> > > 	saved_state = rt_set_current_blocked_state(saved_state);
> > > 	if (orig_owner == rt_mutex_owner(lock))
> > > 		schedule_rt_mutex(lock);
> > > 
> > > -- Steve
> > 
> > Isn't it possible to miss a wakeup here if the waiter becomes preempted?
> 
> Why? Preemption doesn't change the task state.
> > 
> > Recall that adaptive wait is a preemptive wait.  Hence the (I believe)
> > original reason we did the adaptive spin in a (transitioning)  sleep
> > state.
> 
> Yes it is a preemptive wait, I would not have accepted the patches if it
> was anything else. But preemption is not affected by the state of the
> task. A task could be TASK_RUNNING or TASK_INTERRUPTIBLE or
> TASK_UNINTERRUPTIBLE, and that would not affect how it acts when it is
> preempted.
> 
> -- Steve
> 
> 

My bad.  I thought preemption did change task state.

This still requires the owner to run through try_to_wake_up() and all
its associated overhead only to find out that the waiter is running.  

The assumption I made when I suggested the original concept to Greg was
that if the new owner is running, there is *nothing* to do wrt
scheduling.  If that was a wrong assumption, then, yes, drop the patch
and clean things up.  

If that was a good assumption, then we are leaving 'cycles on the table'
as waking up a running process is a non-zero-overhead path and that is a
bad thing considering how many times spin_unlock() is called on an rt
system.

Bear in mind that this savings scales directly as the number of CPUs
(assuming all are vectored on the lock).  We can only have nr_cpus-1
spinning waiters at any given time, regardless of the number of tasks in
contention.  Perhaps this is too little to worry about on a 4way system,
but I suspect that it could be substantial on larger systems.  

I'll be quiet now as I know little about the intricacies of
preemption/scheduling (obviously) and like Greg, have been removed from
RT kernel work for several years. <sigh>

-PWM





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2011-01-04 15:47               ` Steven Rostedt
  2011-01-04 17:15                 ` Peter W. Morreale
@ 2011-01-04 17:24                 ` Peter W. Morreale
  1 sibling, 0 replies; 22+ messages in thread
From: Peter W. Morreale @ 2011-01-04 17:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, Lai Jiangshan, Ingo Molnar, Peter Zijlstra,
	ThomasGleixner, linux-kernel

On Tue, 2011-01-04 at 10:47 -0500, Steven Rostedt wrote: 
> On Tue, 2011-01-04 at 08:19 -0700, Peter W. Morreale wrote:
> > On Mon, 2011-01-03 at 15:22 -0500, Steven Rostedt wrote:
> > > On Mon, 2011-01-03 at 14:06 -0500, Steven Rostedt wrote:
> > > 
> > > > 
> > > > 	if (adaptive_wait(&waiter, orig_owner))
> > > > 		sleep = 1;
> > > > 	else
> > > > 		sleep = 0;
> > > > 
> > > > 	if (sleep)
> > > 
> > > 
> > > > 		raw_spin_lock(&lock->wait_lock);
> > > > 		saved_state = rt_set_current_block_state(saved_state);
> > > > 		if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
> > > > 			sleep = 0;
> > > > 		raw_spin_unlock(&lock->wait_lock);
> > > 
> > > I may be able to remove the above locks and replace it with:
> > > 
> > > 	saved_state = rt_set_current_blocked_state(saved_state);
> > > 	if (orig_owner == rt_mutex_owner(lock))
> > > 		schedule_rt_mutex(lock);
> > > 
> > > -- Steve
> > 
> > Isn't it possible to miss a wakeup here if the waiter becomes preempted?
> 
> Why? Preemption doesn't change the task state.
> > 
> > Recall that adaptive wait is a preemptive wait.  Hence the (I believe)
> > original reason we did the adaptive spin in a (transitioning)  sleep
> > state.
> 
> Yes it is a preemptive wait, I would not have accepted the patches if it
> was anything else. But preemption is not affected by the state of the
> task. A task could be TASK_RUNNING or TASK_INTERRUPTIBLE or
> TASK_UNINTERRUPTIBLE, and that would not affect how it acts when it is
> preempted.
> 
> -- Steve
> 
> 

My bad.  I thought preemption did change task state.

This still requires the owner to run through try_to_wake_up() and all
its associated overhead only to find out that the waiter is running.  

The assumption I made when I suggested the original concept to Greg was
that if the new owner is running, there is *nothing* to do wrt
scheduling.  If that was a wrong assumption, then, yes, drop the patch
and clean things up.  

If that was a good assumption, then we are leaving 'cycles on the table'
as waking up a running process is a non-zero-overhead path and that is a
bad thing considering how many times spin_unlock() is called on an rt
system.

Bear in mind that this savings scales directly as the number of CPUs
(assuming all are vectored on the lock).  We can only have nr_cpus-1
spinning waiters at any given time, regardless of the number of tasks in
contention.  Perhaps this is too little to worry about on a 4way system,
but I suspect that it could be substantial on larger systems.  

I'll be quiet now as I know little about the intricacies of
preemption/scheduling (obviously) and like Greg, have been removed from
RT kernel work for several years. <sigh>

-PWM






^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2011-01-04 17:15                 ` Peter W. Morreale
@ 2011-01-04 17:27                   ` Steven Rostedt
  2011-01-04 17:45                     ` Peter W. Morreale
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2011-01-04 17:27 UTC (permalink / raw)
  To: pmorreale
  Cc: Gregory Haskins, Lai Jiangshan, Ingo Molnar, Peter Zijlstra,
	ThomasGleixner, linux-kernel

On Tue, 2011-01-04 at 10:15 -0700, Peter W. Morreale wrote:

> My bad.  I thought preemption did change task state.
> 
> This still requires the owner to run through try_to_wake_up() and all
> its associated overhead only to find out that the waiter is running.  
> 
> The assumption I made when I suggested the original concept to Greg was
> that if the new owner is running, there is *nothing* to do wrt
> scheduling.  If that was a wrong assumption, then, yes, drop the patch
> and clean things up.  
> 
> If that was a good assumption, then we are leaving 'cycles on the table'
> as waking up a running process is a non-zero-overhead path and that is a
> bad thing considering how many times spin_unlock() is called on an rt
> system.
> 
> Bear in mind that this savings scales directly as the number of CPUs
> (assuming all are vectored on the lock).  We can only have nr_cpus-1
> spinning waiters at any given time, regardless of the number of tasks in
> contention.  Perhaps this is too little to worry about on a 4way system,
> but I suspect that it could be substantial on larger systems.  
> 
> I'll be quiet now as I know little about the intricacies of
> preemption/scheduling (obviously) and like Greg, have been removed from
> RT kernel work for several years. <sigh>

No need to be quiet ;-)

I'm working on making it spin in TASK_RUNNING state if possible, but it
is making the code a bit more complex, as it seems that there is an
assumption with the wakeup and the changing of the current->state in the
rt_spin_lock_slowlock code all being under the lock->wait_lock. I think
I'll scrap this idea.

That said, I think your wakeup patch may be worth while with Lai's new
code. His changes causes the owner to wake up the pending owner several
times, because the pending owner is never removed from the lock
wait_list. If a high prio task grabs and releases the same lock over and
over, if there is a waiter it will try to wake up that waiter each time.

Thus, having your patch may prevent that unnecessary wakeup.

I'll look more into it. Thanks!

-- Steve




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2011-01-04 17:27                   ` Steven Rostedt
@ 2011-01-04 17:45                     ` Peter W. Morreale
  2011-01-04 18:06                       ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Peter W. Morreale @ 2011-01-04 17:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, Lai Jiangshan, Ingo Molnar, Peter Zijlstra,
	ThomasGleixner, linux-kernel

On Tue, 2011-01-04 at 12:27 -0500, Steven Rostedt wrote:
> On Tue, 2011-01-04 at 10:15 -0700, Peter W. Morreale wrote:
> 
> > My bad.  I thought preemption did change task state.
> > 
> > This still requires the owner to run through try_to_wake_up() and all
> > its associated overhead only to find out that the waiter is running.  
> > 
> > The assumption I made when I suggested the original concept to Greg was
> > that if the new owner is running, there is *nothing* to do wrt
> > scheduling.  If that was a wrong assumption, then, yes, drop the patch
> > and clean things up.  
> > 
> > If that was a good assumption, then we are leaving 'cycles on the table'
> > as waking up a running process is a non-zero-overhead path and that is a
> > bad thing considering how many times spin_unlock() is called on an rt
> > system.
> > 
> > Bear in mind that this savings scales directly as the number of CPUs
> > (assuming all are vectored on the lock).  We can only have nr_cpus-1
> > spinning waiters at any given time, regardless of the number of tasks in
> > contention.  Perhaps this is too little to worry about on a 4way system,
> > but I suspect that it could be substantial on larger systems.  
> > 
> > I'll be quiet now as I know little about the intricacies of
> > preemption/scheduling (obviously) and like Greg, have been removed from
> > RT kernel work for several years. <sigh>
> 
> No need to be quiet ;-)
> 
> I'm working on making it spin in TASK_RUNNING state if possible, but it
> is making the code a bit more complex, as it seems that there is an
> assumption with the wakeup and the changing of the current->state in the
> rt_spin_lock_slowlock code all being under the lock->wait_lock. I think
> I'll scrap this idea.
> 
> That said, I think your wakeup patch may be worth while with Lai's new
> code. His changes causes the owner to wake up the pending owner several
> times, because the pending owner is never removed from the lock
> wait_list. If a high prio task grabs and releases the same lock over and
> over, if there is a waiter it will try to wake up that waiter each time.
> 


Oooohhhh  I wonder if that would enable 'lock-stealing' for FIFO?

IIRC, you came up with a ping-pong scenario that prevented its use
there.

-PWM



> Thus, having your patch may prevent that unnecessary wakeup.
> 
> I'll look more into it. Thanks!
> 
> -- Steve
> 
> 
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2011-01-04 17:45                     ` Peter W. Morreale
@ 2011-01-04 18:06                       ` Steven Rostedt
  2011-01-04 20:48                         ` Peter W. Morreale
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2011-01-04 18:06 UTC (permalink / raw)
  To: pmorreale
  Cc: Gregory Haskins, Lai Jiangshan, Ingo Molnar, Peter Zijlstra,
	ThomasGleixner, linux-kernel

On Tue, 2011-01-04 at 10:45 -0700, Peter W. Morreale wrote:
> On Tue, 2011-01-04 at 12:27 -0500, Steven Rostedt wrote:

> > That said, I think your wakeup patch may be worth while with Lai's new
> > code. His changes causes the owner to wake up the pending owner several
> > times, because the pending owner is never removed from the lock
> > wait_list. If a high prio task grabs and releases the same lock over and
> > over, if there is a waiter it will try to wake up that waiter each time.
> > 
> 
> 
> Oooohhhh  I wonder if that would enable 'lock-stealing' for FIFO?

I guess you mean literal FIFO order and not SCHED_FIFO.

But, if I understand you, it would allow for this. If a process has a
lock and two processes are blocked on it. When the first releases the
lock it gives it to the higher of the waiters. But if a high prio task
comes a long, it will steal that lock away, and this woken task will
need to race to get it from the other lower prio task that comes along.
And yes, if that other task is the same prio, we reverse the FIFO order
here.

Lai's patch prevents this. Either the woken task gets the lock or a
higher prio task does. No mixing around the order of taking the lock
here. The better side effect of Lai's patch is that it cleans up the
code quite a bit, which is what I really want, as we start to introduce
SCHED_DEADLINE.

> 
> IIRC, you came up with a ping-pong scenario that prevented its use
> there.

Probably.

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2011-01-04 18:06                       ` Steven Rostedt
@ 2011-01-04 20:48                         ` Peter W. Morreale
  0 siblings, 0 replies; 22+ messages in thread
From: Peter W. Morreale @ 2011-01-04 20:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, Lai Jiangshan, Ingo Molnar, Peter Zijlstra,
	ThomasGleixner, linux-kernel

On Tue, 2011-01-04 at 13:06 -0500, Steven Rostedt wrote:
> On Tue, 2011-01-04 at 10:45 -0700, Peter W. Morreale wrote:
> > On Tue, 2011-01-04 at 12:27 -0500, Steven Rostedt wrote:
> 
> > > That said, I think your wakeup patch may be worth while with Lai's new
> > > code. His changes causes the owner to wake up the pending owner several
> > > times, because the pending owner is never removed from the lock
> > > wait_list. If a high prio task grabs and releases the same lock over and
> > > over, if there is a waiter it will try to wake up that waiter each time.
> > > 
> > 
> > 
> > Oooohhhh  I wonder if that would enable 'lock-stealing' for FIFO?
> 
> I guess you mean literal FIFO order and not SCHED_FIFO.
> 
> But, if I understand you, it would allow for this. If a process has a
> lock and two processes are blocked on it. When the first releases the
> lock it gives it to the higher of the waiters. But if a high prio task
> comes a long, it will steal that lock away, and this woken task will
> need to race to get it from the other lower prio task that comes along.
> And yes, if that other task is the same prio, we reverse the FIFO order
> here.
> 
> Lai's patch prevents this. Either the woken task gets the lock or a
> higher prio task does. No mixing around the order of taking the lock
> here. The better side effect of Lai's patch is that it cleans up the
> code quite a bit, which is what I really want, as we start to introduce
> SCHED_DEADLINE.
> 

I was referring to one of the original patches that allowing an incoming
task of similar priority to 'steal' the lock over a pending waiter.  If
the owner had been cleared and the pending waiter not yet given
ownership, the incoming task could steal the lock.  

IIRC, you had found a case wrt SCHED_FIFO tasks that could negatively
impact performance/throughput in some way and it was restricted to
non-SCHED_FIFO tasks. 

This was just an idle question, it sounds like RT is moving towards (and
perhaps already there) synchronized lock access.  

-PWM


> > 
> > IIRC, you came up with a ping-pong scenario that prevented its use
> > there.
> 
> Probably.
> 
> -- Steve
> 
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 4/4] rtmutex: Ensure only the top waiter or higher priority task can take the lock
  2011-01-04  4:02   ` Steven Rostedt
@ 2011-01-05  7:12     ` Lai Jiangshan
  0 siblings, 0 replies; 22+ messages in thread
From: Lai Jiangshan @ 2011-01-05  7:12 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Peter Zijlstra

On 01/04/2011 12:02 PM, Steven Rostedt wrote:
> On Thu, 2010-12-23 at 17:47 -0500, Steven Rostedt wrote:
>> plain text document attachment
>> (0004-rtmutex-Ensure-only-the-top-waiter-or-higher-priorit.patch)
>> From: Lai Jiangshan <laijs@cn.fujitsu.com>
>>
>> In current rtmutex, the pending owner may be boosted by the tasks
>> in the rtmutex's waitlist when the pending owner is deboosted
>> or a task in the waitlist is boosted. This boosting is unrelated,
>> because the pending owner does not really take the rtmutex.
>> It is not reasonable.
>>
> 
> I'm still hitting some bugs with the port to -rt, but I also noticed
> something that doesn't look too good.
> 
> There's several places in the kernel where a task may release and
> acquire the same lock multiple times in a row.
> 
> The old way of removing the pending owner from the lists and waking it
> up once, would have the high prio task wake it up once, and then it can
> grab the locks multiple times without modifying the list, since the
> pending owner is already awake and not in the pi list anymore.
> 
> The new way has the owner remove the woken task from its pi list and
> wakes it up, but when it steals the lock again, it adds this owner back
> to its pi list. When it releases the lock, it wakes it up again and
> removes it from its pi list again. This happens over and over again.

It is a expected behavior.  With this behavior: we can assume that
if a lock has waiter(s), the top waiter is always on the owner's pi list
and the owner gets/(will get) boosted from it.

This simplifies the code and the logic. But performance is more important,
I will send 2 patches for it in this weekend.

Thanks,
Lai.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup
  2011-01-03 19:06         ` Steven Rostedt
  2011-01-03 20:22           ` Steven Rostedt
@ 2011-01-10 14:49           ` Gregory Haskins
  1 sibling, 0 replies; 22+ messages in thread
From: Gregory Haskins @ 2011-01-10 14:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Lai Jiangshan, Ingo Molnar, Peter Zijlstra, ThomasGleixner,
	Peter Morreale, linux-kernel

Hey Steve,

Just getting back online now....

>>> On 1/3/2011 at 02:06 PM, in message
<1294081596.3948.192.camel@gandalf.stny.rr.com>, Steven Rostedt
<rostedt@goodmis.org> wrote: 
> On Tue, 2010-12-28 at 07:06 -0700, Gregory Haskins wrote:
>> >>> On 12/23/2010 at 11:54 PM, in message
> 
>> > Sure, but as I said, it is mostly broken anyway. I could even insert
>> > some tracepoints to show that this is always missed (heck I'll add an
>> > unlikely and do the branch profiler ;-)
>> 
>> Well, I think that would be a good datapoint and is one of the things I'd 
> like to see.
> 
> OK, here it is, after running a simple "dbench 10":
> 
>  correct incorrect  %        Function                  File              
> Line
>  ------- ---------  -        --------                  ----              ----
>      150   726979  99 wakeup_next_waiter             rtmutex.c            
> 581
> 

Interesting, thanks for gathering the info

 
> 
> Interesting that we hit 150 times that the new owner was already awake.
> Perhaps it was because the new owner was about to spin on a spinlock
> after it had put itself into the TASK_INTERRUPTIBLE state, and it was
> woken up by someone else?

I speculate that what you are seeing is the waiter was an adaptive-spinner and it beat the releaser to change
its state from INTERRUPTIBLE->RUNNING in parallel before the releaser checked the condition.  Its essentially
a race condition that apparently only hits ~1% of the time.  Without further instrumentation, its just a guess,
though.



> 
> 
>> 
>> > 
>> > The reason is that adaptive spinners spin in some other state than
>> > TASK_RUNNING, thus it does not help adaptive spinners at all. I first
>> > tried to fix that, but it made dbench run even slower.
>> 
>> This is why I am skeptical.  You are essentially asserting there are two 
> issues here, IIUC:
>> 
>> 1) The intent of avoiding a wakeup is broken and we take the double whammy 
> of a mb()
>> plus the wakeup() anyway.
> 
> Yep, this is what I believe is happening.
> 
>> 
>> 2) mb() is apparently slower than wakeup().
> 
> Forget this point, as #1 seems to be the main issue. Also, I'm not sure
> it is safe to "fix" this, as I started to try.

ok

> 
> 
>> 
>> I agree (1) is plausible, though I would like to see the traces to confirm.  
> Its been a long time
>> since I looked at that code, but I think the original code either ran in 
> RUNNING_MUTEX and was
>> inadvertently broken in the mean time or the other cpu would have 
> transitioned to RUNNING on
>> its own when we flipped the owner before the release-side check was 
> performed.  Or perhaps
>> we just plain screwed this up and it was racy ;)  I'm not sure.  But as 
> Peter (M) stated, it seems
>> like a shame to walk away from the concept without further investigation.  I 
> think everyone can
>> agree that at the very least, if it is in fact taking a double whammy we 
> should fix that.
> 
> [ cut out all the 2's since I'm not worried about that now, even if it 
>   is not a problem. ]
> 
> 
> Now, the way I was going to fix this is to have the adaptive wait be in
> TASK_RUNNING state, and then change the state iff we are going to sleep.
> 
> But this can be a bit more race. Especially with Lai's new changes. With
> the new changes, the waiter->task does not get nulled anymore.
> 
> The test to take the lock, now needs to be under the lock->wait_lock
> spinlock. We have to grab that lock, and see if the owner is null, and
> that we are the next waiter in the queue. Thus, on taking the lock we
> need to have something like this:
> 
> 
> 	if (adaptive_wait(&waiter, orig_owner))
> 		sleep = 1;
> 	else
> 		sleep = 0;
> 
> 	if (sleep)
> 		raw_spin_lock(&lock->wait_lock);
> 		saved_state = rt_set_current_block_state(saved_state);
> 		if (!lock->owner && &waiter == rt_mutex_top_waiter(lock))
> 			sleep = 0;
> 		raw_spin_unlock(&lock->wait_lock);
> 		if (sleep)
> 			schedule_rt_mutex(lock);
> 		saved_state = rt_restore_current_blocked_state(saved_state);
> 	}
> 
> Otherwise we can risk the wakeup_next_waiter() missing the wakeup.

Yep, you definitely need something like your proposal above if you want to mess with the state, I agree.

> 
> To clarify, we want the adaptive_wait() to run as TASK_RUNNING. Then if
> we must sleep, then we must set the state to TASK_UNINTERRUPTIBLE, test
> again if we can still the lock, and if not then sleep. Otherwise, if a
> wakeup happens just before we set the state to TASK_UNINTERRUPTIBLE,
> then we miss the wake up all together.
> 
> I can do this change, and see what impact it makes.
> 
> I'm also curious if this ever worked?

Well, I _thought_ so, but it was so long ago I don't remember to specific level of granularity of the unit tests.

> If it did not, then are you sure
> your tests that show the benefit of it was true. I don't have a large
> scale box at my disposal ATM, so I can only see what this does on 4way
> machines.

Yeah, me either.  At the time we had 16 and 32 core boxes, plus your 64 core box.  There were certainly
substantial improvements throughout the series on these machines (which I think you also verified
independently, or you wouldn't have accepted them ;).  Given that what you found amounts to a race,
I suppose the code, even if it was racy from day 1, may have had a positive impact for certain workloads
since the race will be environment specific.

I digress: whether it worked once and broke in the meantime or was always broken is purely an exercise
in evaluating my stupidity ;)  Ill leave that as an exercise to the community.  The important thing is to either
fix the optimization (e.g. with a patch like yours above) or remove the optimization outright.  

Bottom line: Nice find, and let me know if you need me to do anything.

-Greg



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-01-10 14:49 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-12-23 22:47 [RFC][RT][PATCH 0/4] rtmutex: Simplify PI code Steven Rostedt
2010-12-23 22:47 ` [RFC][RT][PATCH 1/4] rtmutex: Only save lock depth once in spin_slowlock Steven Rostedt
2010-12-23 22:47 ` [RFC][RT][PATCH 2/4] rtmutex: Try to take lock early in rt_spin_lock_slowlock() Steven Rostedt
2010-12-23 22:47 ` [RFC][RT][PATCH 3/4] rtmutex: Revert Optimize rt lock wakeup Steven Rostedt
2010-12-24  4:45   ` Gregory Haskins
2010-12-24  4:54     ` Steven Rostedt
2010-12-28 14:06       ` Gregory Haskins
2011-01-03 19:06         ` Steven Rostedt
2011-01-03 20:22           ` Steven Rostedt
2011-01-04 15:19             ` Peter W. Morreale
2011-01-04 15:47               ` Steven Rostedt
2011-01-04 17:15                 ` Peter W. Morreale
2011-01-04 17:27                   ` Steven Rostedt
2011-01-04 17:45                     ` Peter W. Morreale
2011-01-04 18:06                       ` Steven Rostedt
2011-01-04 20:48                         ` Peter W. Morreale
2011-01-04 17:24                 ` Peter W. Morreale
2011-01-10 14:49           ` Gregory Haskins
     [not found]   ` <4D13DF250200005A000793E1@novell.com>
2010-12-24 15:47     ` Peter W. Morreale
2010-12-23 22:47 ` [RFC][RT][PATCH 4/4] rtmutex: Ensure only the top waiter or higher priority task can take the lock Steven Rostedt
2011-01-04  4:02   ` Steven Rostedt
2011-01-05  7:12     ` Lai Jiangshan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).