linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RT Mutex patch and tester [PREEMPT_RT]
@ 2006-01-11 17:25 Esben Nielsen
  2006-01-11 17:51 ` Steven Rostedt
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Esben Nielsen @ 2006-01-11 17:25 UTC (permalink / raw)
  To: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3837 bytes --]

I have done 2 things which might be of interrest:

I) A rt_mutex unittest suite. It might also be usefull against the generic
mutexes.

II) I changed the priority inheritance mechanism in rt.c,
optaining the following goals:

1) rt_mutex deadlocks doesn't become raw_spinlock deadlocks. And more
importantly: futex_deadlocks doesn't become raw_spinlock deadlocks.
2) Time-Predictable code. No matter how deep you nest your locks
(kernel or futex) the time spend in irqs or preemption off should be
limited.
3) Simpler code. rt.c was kind of messy. Maybe it still is....:-)

I have lost:
1) Some speed in the slow slow path. I _might_ have gained some in the
normal slow path, though, without meassuring it.


Idea:

When a task blocks on a lock it adds itself to the wait list and calls
schedule(). When it is unblocked it has the lock. Or rather due to
grab-locking it has to check again. Therefore the schedule() call is
wrapped in a loop.

Now when a task is PI boosted, it is at the same time checked if it is
blocked on a rt_mutex. If it is, it is unblocked ( wake_up_process_mutex()
). It will now go around in the above loop mentioned above. Within this loop
it will now boost the owner of the lock it is blocked on, maybe unblocking the
owner, which in turn can boost and unblock the next in the lock chain...
At all points there is at least one task boosted to the highest priority
required unblocked and working on boosting the next in the lock chain and
there is therefore no priority inversion.

The boosting of a long list of blocked tasks will clearly take longer than
the previous version as there will be task switches. But remember, it is
in the slow slow path! And it only occurs when PI boosting is happening on
_nested_ locks.

What is gained is that the amount of time where irq and preemption is off
is limited: One task does it's work with preemption disabled, wakes up the
next and enable preemption and schedules. The amount of time spend with
preemption disabled is has a clear upper limit, untouched by how
complicated and deep the lock structure is.

So how many locks do we have to worry about? Two.
One for locking the lock. One for locking various PI related data on the
task structure, as the pi_waiters list, blocked_on, pending_owner - and
also prio.
Therefore only lock->wait_lock and sometask->pi_lock will be locked at the
same time. And in that order. There is therefore no spinlock deadlocks.
And the code is simpler.

Because of the simplere code I was able to implement an optimization:
Only the first waiter on each lock is member of the owner->pi_waiters.
Therefore it is not needed to do any list traversels on neither
owner->pi_waiters, not lock->wait_list. Every operation requires touching
only removing and/or adding one element to these lists.

As for robust futexes: They ought to work out of the box now, blocking in
deadlock situations. I have added an entry to /proc/<pid>/status
"BlckOn: <pid>". This can be used to do "post mortem" deadlock detection
from userspace.

What am I missing:
Testing on SMP. I have no SMP machine. The unittest can mimic the SMP
somewhat
but no unittest can catch _all_ errors.

Testing with futexes.

ALL_PI_TASKS are always switched on now. This is for making the code
simpler.

My machine fails to run with CONFIG_DEBUG_DEADLOCKS and CONFIG_DEBUG_PREEMPT
on at the same time. I need a serial cabel and on consol over serial to
debug it. My screen is too small to see enough there.

Figure out more tests to run in my unittester.

So why aren't I doing those things before sending the patch? 1) Well my
girlfriend comes back tomorrow with our child. I know I will have no time to code anything substential
then. 2) I want to make sure Ingo sees this approach before he starts
merging preempt_rt and rt_mutex with his now mainstream mutex.

Esben







[-- Attachment #2: Type: APPLICATION/x-gzip, Size: 20048 bytes --]

[-- Attachment #3: Type: TEXT/PLAIN, Size: 48007 bytes --]

diff -upr linux-2.6.15-rt3.orig/fs/proc/array.c linux-2.6.15-rt3-pipatch/fs/proc/array.c
--- linux-2.6.15-rt3.orig/fs/proc/array.c	2006-01-11 01:45:18.000000000 +0100
+++ linux-2.6.15-rt3-pipatch/fs/proc/array.c	2006-01-11 03:02:12.000000000 +0100
@@ -295,6 +295,14 @@ static inline char *task_cap(struct task
 			    cap_t(p->cap_effective));
 }
 
+
+static char *show_blocked_on(task_t *task, char *buffer)
+{
+  pid_t pid = get_blocked_on(task);
+  return buffer + sprintf(buffer,"BlckOn: %d\n",pid);
+}
+
+
 int proc_pid_status(struct task_struct *task, char * buffer)
 {
 	char * orig = buffer;
@@ -313,6 +321,7 @@ int proc_pid_status(struct task_struct *
 #if defined(CONFIG_ARCH_S390)
 	buffer = task_show_regs(task, buffer);
 #endif
+	buffer = show_blocked_on(task,buffer);
 	return buffer - orig;
 }
 
diff -upr linux-2.6.15-rt3.orig/include/linux/sched.h linux-2.6.15-rt3-pipatch/include/linux/sched.h
--- linux-2.6.15-rt3.orig/include/linux/sched.h	2006-01-11 01:45:18.000000000 +0100
+++ linux-2.6.15-rt3-pipatch/include/linux/sched.h	2006-01-11 03:02:12.000000000 +0100
@@ -1652,6 +1652,8 @@ extern void recalc_sigpending(void);
 
 extern void signal_wake_up(struct task_struct *t, int resume_stopped);
 
+extern pid_t get_blocked_on(task_t *task);
+
 /*
  * Wrappers for p->thread_info->cpu access. No-op on UP.
  */
diff -upr linux-2.6.15-rt3.orig/kernel/hrtimer.c linux-2.6.15-rt3-pipatch/kernel/hrtimer.c
--- linux-2.6.15-rt3.orig/kernel/hrtimer.c	2006-01-11 01:45:18.000000000 +0100
+++ linux-2.6.15-rt3-pipatch/kernel/hrtimer.c	2006-01-11 03:02:12.000000000 +0100
@@ -404,7 +404,7 @@ kick_off_hrtimer(struct hrtimer *timer, 
 # define hrtimer_hres_active		0
 # define hres_enqueue_expired(t,b,n)	0
 # define hrtimer_check_clocks()		do { } while (0)
-# define kick_off_hrtimer		do { } while (0)
+# define kick_off_hrtimer(timer,base)	do { } while (0)
 
 #endif /* !CONFIG_HIGH_RES_TIMERS */
 
diff -upr linux-2.6.15-rt3.orig/kernel/rt.c linux-2.6.15-rt3-pipatch/kernel/rt.c
--- linux-2.6.15-rt3.orig/kernel/rt.c	2006-01-11 01:45:18.000000000 +0100
+++ linux-2.6.15-rt3-pipatch/kernel/rt.c	2006-01-11 09:08:00.000000000 +0100
@@ -36,7 +36,10 @@
  *   (also by Steven Rostedt)
  *    - Converted single pi_lock to individual task locks.
  *
+ * By Esben Nielsen:
+ *    Doing priority inheritance with help of the scheduler.
  */
+
 #include <linux/config.h>
 #include <linux/rt_lock.h>
 #include <linux/sched.h>
@@ -58,18 +61,26 @@
  *  To keep from having a single lock for PI, each task and lock
  *  has their own locking. The order is as follows:
  *
+ *     lock->wait_lock   -> sometask->pi_lock
+ * You should only hold one wait_lock and one pi_lock
  * blocked task->pi_lock -> lock->wait_lock -> owner task->pi_lock.
  *
- * This is safe since a owner task should never block on a lock that
- * is owned by a blocking task.  Otherwise you would have a deadlock
- * in the normal system.
- * The same goes for the locks. A lock held by one task, should not be
- * taken by task that holds a lock that is blocking this lock's owner.
+ * lock->wait_lock protects everything inside the lock and all the waiters
+ * on lock->wait_list.
+ * sometask->pi_lock protects everything on task-> related to the rt_mutex.
+ *
+ * Invariants  - must be true when unlock lock->wait_lock:
+ *   If lock->wait_list is non-empty 
+ *     1) lock_owner(lock) points to a valid thread.
+ *     2) The first and only the first waiter on the list must be on
+ *        lock_owner(lock)->task->pi_waiters.
+ * 
+ *  A waiter struct is on the lock->wait_list iff waiter->ti!=NULL.
  *
- * A task that is about to grab a lock is first considered to be a
- * blocking task, even if the task successfully acquires the lock.
- * This is because the taking of the locks happen before the
- * task becomes the owner.
+ *  Strategy for boosting lock chain:
+ *   task A blocked on lock 1 owned by task B blocked on lock 2 etc..
+ *  A sets B's prio up and wakes B. B try to get lock 2 again and fails.
+ *  B therefore boost C.
  */
 
 /*
@@ -117,8 +128,11 @@
  * This flag is good for debugging the PI code - it makes all tasks
  * in the system fall under PI handling. Normally only SCHED_FIFO/RR
  * tasks are PI-handled:
+ *
+ * It must stay on for now as the invariant that the first waiter is always
+ * on the pi_waiters list is keeped only this way (for now).
  */
-#define ALL_TASKS_PI 0
+#define ALL_TASKS_PI 1
 
 #ifdef CONFIG_DEBUG_DEADLOCKS
 # define __EIP_DECL__ , unsigned long eip
@@ -311,7 +325,7 @@ void check_preempt_wakeup(struct task_st
 		}
 }
 
-static inline void
+static void
 account_mutex_owner_down(struct task_struct *task, struct rt_mutex *lock)
 {
 	if (task->lock_count >= MAX_LOCK_STACK) {
@@ -325,7 +339,7 @@ account_mutex_owner_down(struct task_str
 	task->lock_count++;
 }
 
-static inline void
+static void
 account_mutex_owner_up(struct task_struct *task)
 {
 	if (!task->lock_count) {
@@ -729,25 +743,6 @@ restart:
 #if ALL_TASKS_PI && defined(CONFIG_DEBUG_DEADLOCKS)
 
 static void
-check_pi_list_present(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
-		      struct thread_info *old_owner)
-{
-	struct rt_mutex_waiter *w;
-
-	_raw_spin_lock(&old_owner->task->pi_lock);
-	TRACE_WARN_ON_LOCKED(plist_node_empty(&waiter->pi_list));
-
-	plist_for_each_entry(w, &old_owner->task->pi_waiters, pi_list) {
-		if (w == waiter)
-			goto ok;
-	}
-	TRACE_WARN_ON_LOCKED(1);
-ok:
-	_raw_spin_unlock(&old_owner->task->pi_lock);
-	return;
-}
-
-static void
 check_pi_list_empty(struct rt_mutex *lock, struct thread_info *old_owner)
 {
 	struct rt_mutex_waiter *w;
@@ -781,274 +776,116 @@ check_pi_list_empty(struct rt_mutex *loc
 
 #endif
 
-/*
- * Move PI waiters of this lock to the new owner:
- */
-static void
-change_owner(struct rt_mutex *lock, struct thread_info *old_owner,
-	     struct thread_info *new_owner)
-{
-	struct rt_mutex_waiter *w, *tmp;
-	int requeued = 0, sum = 0;
-
-	if (old_owner == new_owner)
-		return;
-
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&old_owner->task->pi_lock));
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&new_owner->task->pi_lock));
-	plist_for_each_entry_safe(w, tmp, &old_owner->task->pi_waiters, pi_list) {
-		if (w->lock == lock) {
-			trace_special_pid(w->ti->task->pid, w->ti->task->prio, w->ti->task->normal_prio);
-			plist_del(&w->pi_list);
-			w->pi_list.prio = w->ti->task->prio;
-			plist_add(&w->pi_list, &new_owner->task->pi_waiters);
-			requeued++;
-		}
-		sum++;
-	}
-	trace_special(sum, requeued, 0);
-}
 
-int pi_walk, pi_null, pi_prio, pi_initialized;
 
-/*
- * The lock->wait_lock and p->pi_lock must be held.
- */
-static void pi_setprio(struct rt_mutex *lock, struct task_struct *task, int prio)
+static int calc_pi_prio(task_t *task)
 {
-	struct rt_mutex *l = lock;
-	struct task_struct *p = task;
-	/*
-	 * We don't want to release the parameters locks.
-	 */
+	int prio = task->normal_prio;
+	if(!plist_head_empty(&task->pi_waiters)) {
+		struct  rt_mutex_waiter *waiter = 
+			plist_first_entry(&task->pi_waiters, struct rt_mutex_waiter, pi_list);
+		prio = min(waiter->pi_list.prio,prio);
 
-	if (unlikely(!p->pid)) {
-		pi_null++;
-		return;
 	}
+	return prio;
 
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&p->pi_lock));
-#ifdef CONFIG_DEBUG_DEADLOCKS
-	pi_prio++;
-	if (p->policy != SCHED_NORMAL && prio > normal_prio(p)) {
-		TRACE_OFF();
-
-		printk("huh? (%d->%d??)\n", p->prio, prio);
-		printk("owner:\n");
-		printk_task(p);
-		printk("\ncurrent:\n");
-		printk_task(current);
-		printk("\nlock:\n");
-		printk_lock(lock, 1);
-		dump_stack();
-		trace_local_irq_disable(ti);
 	}
-#endif
-	/*
-	 * If the task is blocked on some other task then boost that
-	 * other task (or tasks) too:
-	 */
-	for (;;) {
-		struct rt_mutex_waiter *w = p->blocked_on;
-#ifdef CONFIG_DEBUG_DEADLOCKS
-		int was_rt = rt_task(p);
-#endif
-
-		mutex_setprio(p, prio);
 
-		/*
-		 * The BKL can really be a pain. It can happen where the
-		 * BKL is being held by one task that is just about to
-		 * block on another task that is waiting for the BKL.
-		 * This isn't a deadlock, since the BKL is released
-		 * when the task goes to sleep.  This also means that
-		 * all holders of the BKL are not blocked, or are just
-		 * about to be blocked.
-		 *
-		 * Another side-effect of this is that there's a small
-		 * window where the spinlocks are not held, and the blocked
-		 * process hasn't released the BKL.  So if we are going
-		 * to boost the owner of the BKL, stop after that,
-		 * since that owner is either running, or about to sleep
-		 * but don't go any further or we are in a loop.
-		 */
-		if (!w || unlikely(p->lock_depth >= 0))
-			break;
-		/*
-		 * If the task is blocked on a lock, and we just made
-		 * it RT, then register the task in the PI list and
-		 * requeue it to the wait list:
-		 */
-
-		/*
-		 * Don't unlock the original lock->wait_lock
-		 */
-		if (l != lock)
-			_raw_spin_unlock(&l->wait_lock);
-		l = w->lock;
-		TRACE_BUG_ON_LOCKED(!lock);
-
-#ifdef CONFIG_PREEMPT_RT
-		/*
-		 * The current task that is blocking can also the one
-		 * holding the BKL, and blocking on a task that wants
-		 * it.  So if it were to get this far, we would deadlock.
-		 */
-		if (unlikely(l == &kernel_sem.lock) && lock_owner(l) == current_thread_info()) {
-			/*
-			 * No locks are held for locks, so fool the unlocking code
-			 * by thinking the last lock was the original.
-			 */
-			l = lock;
-			break;
+static void fix_prio(task_t *task)
+{
+	int prio = calc_pi_prio(task);
+	if(task->prio > prio) {
+		/* Boost him */
+		mutex_setprio(task,prio);
+		if(task->blocked_on) {
+			/* Let it run to boost it's lock */
+			wake_up_process_mutex(task);
 		}
-#endif
-
-		if (l != lock)
-			_raw_spin_lock(&l->wait_lock);
-
-		TRACE_BUG_ON_LOCKED(!lock_owner(l));
-
-		if (!plist_node_empty(&w->pi_list)) {
-			TRACE_BUG_ON_LOCKED(!was_rt && !ALL_TASKS_PI && !rt_task(p));
-			/*
-			 * If the task is blocked on a lock, and we just restored
-			 * it from RT to non-RT then unregister the task from
-			 * the PI list and requeue it to the wait list.
-			 *
-			 * (TODO: this can be unfair to SCHED_NORMAL tasks if they
-			 *        get PI handled.)
-			 */
-			plist_del(&w->pi_list);
-		} else
-			TRACE_BUG_ON_LOCKED((ALL_TASKS_PI || rt_task(p)) && was_rt);
-
-		if (ALL_TASKS_PI || rt_task(p)) {
-			w->pi_list.prio = prio;
-			plist_add(&w->pi_list, &lock_owner(l)->task->pi_waiters);
-		}
-
-		plist_del(&w->list);
-		w->list.prio = prio;
-		plist_add(&w->list, &l->wait_list);
-
-		pi_walk++;
-
-		if (p != task)
-			_raw_spin_unlock(&p->pi_lock);
-
-		p = lock_owner(l)->task;
-		TRACE_BUG_ON_LOCKED(!p);
-		_raw_spin_lock(&p->pi_lock);
-		/*
-		 * If the dependee is already higher-prio then
-		 * no need to boost it, and all further tasks down
-		 * the dependency chain are already boosted:
-		 */
-		if (p->prio <= prio)
-			break;
 	}
-	if (l != lock)
-		_raw_spin_unlock(&l->wait_lock);
-	if (p != task)
-		_raw_spin_unlock(&p->pi_lock);
-}
-
-/*
- * Change priority of a task pi aware
- *
- * There are several aspects to consider:
- * - task is priority boosted
- * - task is blocked on a mutex
- *
- */
-void pi_changeprio(struct task_struct *p, int prio)
-{
-	unsigned long flags;
-	int oldprio;
-
-	spin_lock_irqsave(&p->pi_lock,flags);
-	if (p->blocked_on)
-		spin_lock(&p->blocked_on->lock->wait_lock);
-
-	oldprio = p->normal_prio;
-	if (oldprio == prio)
-		goto out;
-
-	/* Set normal prio in any case */
-	p->normal_prio = prio;
-
-	/* Check, if we can safely lower the priority */
-	if (prio > p->prio && !plist_head_empty(&p->pi_waiters)) {
-		struct rt_mutex_waiter *w;
-		w = plist_first_entry(&p->pi_waiters,
-				      struct rt_mutex_waiter, pi_list);
-		if (w->ti->task->prio < prio)
-			prio = w->ti->task->prio;
+	else if(task->prio < prio) {
+		/* Priority too high */
+		if(task->blocked_on) {
+			/* Let it run to unboost it's lock */
+			wake_up_process_mutex(task);
+		}
+		else {
+			mutex_setprio(task,prio);
+		}
 	}
-
-	if (prio == p->prio)
-		goto out;
-
-	/* Is task blocked on a mutex ? */
-	if (p->blocked_on)
-		pi_setprio(p->blocked_on->lock, p, prio);
-	else
-		mutex_setprio(p, prio);
- out:
-	if (p->blocked_on)
-		spin_unlock(&p->blocked_on->lock->wait_lock);
-
-	spin_unlock_irqrestore(&p->pi_lock, flags);
-
 }
 
+int pi_walk, pi_null, pi_prio, pi_initialized;
+
 /*
  * This is called with both the waiter->task->pi_lock and
  * lock->wait_lock held.
  */
 static void
 task_blocks_on_lock(struct rt_mutex_waiter *waiter, struct thread_info *ti,
-		    struct rt_mutex *lock __EIP_DECL__)
+                    struct rt_mutex *lock, int state __EIP_DECL__)
 {
+	struct rt_mutex_waiter *old_first;
 	struct task_struct *task = ti->task;
 #ifdef CONFIG_DEBUG_DEADLOCKS
 	check_deadlock(lock, 0, ti, eip);
 	/* mark the current thread as blocked on the lock */
 	waiter->eip = eip;
 #endif
+	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&task->pi_lock));
+
+	if(plist_head_empty(&lock->wait_list)) {
+		old_first = NULL;
+	}
+	else {
+		old_first = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter, list);
+	}
+
+
+	_raw_spin_lock(&task->pi_lock);
 	task->blocked_on = waiter;
 	waiter->lock = lock;
 	waiter->ti = ti;
-	plist_node_init(&waiter->pi_list, task->prio);
+        
+	{
+		/* Fixup the prio of the (current) task here while we have the
+		   pi_lock */
+		int prio = calc_pi_prio(task);
+		if(prio!=task->prio) {
+			mutex_setprio(task,prio);
+		}
+	}
+
+	plist_node_init(&waiter->list, task->prio);
+	plist_add(&waiter->list, &lock->wait_list);
+	set_task_state(task, state);
+	_raw_spin_unlock(&task->pi_lock);
+
+	set_lock_owner_pending(lock);   
+#if !ALL_TASKS_PI
 	/*
 	 * Add SCHED_NORMAL tasks to the end of the waitqueue (FIFO):
 	 */
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&task->pi_lock));
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
-#if !ALL_TASKS_PI
 	if ((!rt_task(task) &&
 		!(lock->mutex_attr & FUTEX_ATTR_PRIORITY_INHERITANCE))) {
-		plist_add(&waiter->list, &lock->wait_list);
-		set_lock_owner_pending(lock);
 		return;
 	}
 #endif
-	_raw_spin_lock(&lock_owner(lock)->task->pi_lock);
-	plist_add(&waiter->pi_list, &lock_owner(lock)->task->pi_waiters);
-	/*
-	 * Add RT tasks to the head:
-	 */
-	plist_add(&waiter->list, &lock->wait_list);
-	set_lock_owner_pending(lock);
-	/*
-	 * If the waiter has higher priority than the owner
-	 * then temporarily boost the owner:
-	 */
-	if (task->prio < lock_owner(lock)->task->prio)
-		pi_setprio(lock, lock_owner(lock)->task, task->prio);
-	_raw_spin_unlock(&lock_owner(lock)->task->pi_lock);
+	if(waiter ==
+	   plist_first_entry(&lock->wait_list, struct rt_mutex_waiter, list)) {
+		task_t *owner = lock_owner(lock)->task;
+
+		plist_node_init(&waiter->pi_list, task->prio);
+
+		_raw_spin_lock(&owner->pi_lock);
+		if(old_first) {
+			plist_del(&old_first->pi_list);
+		}
+		plist_add(&waiter->pi_list, &owner->pi_waiters);
+		fix_prio(owner);
+
+		_raw_spin_unlock(&owner->pi_lock);
+	}
 }
 
 /*
@@ -1085,20 +922,45 @@ EXPORT_SYMBOL(__init_rwsem);
 #endif
 
 /*
- * This must be called with both the old_owner and new_owner pi_locks held.
- * As well as the lock->wait_lock.
+ * This must be called with the lock->wait_lock held.
+ * Must: new_owner!=NULL
+ * Likely: old_owner==NULL
  */
-static inline
+static 
 void set_new_owner(struct rt_mutex *lock, struct thread_info *old_owner,
 			struct thread_info *new_owner __EIP_DECL__)
 {
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&old_owner->task->pi_lock));
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&new_owner->task->pi_lock));
+	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
+
 	if (new_owner)
 		trace_special_pid(new_owner->task->pid, new_owner->task->prio, 0);
-	if (unlikely(old_owner))
-		change_owner(lock, old_owner, new_owner);
+	if(old_owner) {
+		account_mutex_owner_up(old_owner->task);
+	}
+#ifdef CONFIG_DEBUG_DEADLOCKS
+	if (trace_on && unlikely(old_owner)) {
+		TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list));
+		list_del_init(&lock->held_list);
+	}
+#endif
 	lock->owner = new_owner;
-	if (!plist_head_empty(&lock->wait_list))
+	if (!plist_head_empty(&lock->wait_list)) {
+		struct rt_mutex_waiter *next =
+			plist_first_entry(&lock->wait_list, 
+					  struct rt_mutex_waiter, list);
+		if(old_owner) {
+			_raw_spin_lock(&old_owner->task->pi_lock);
+			plist_del(&next->pi_list);
+			_raw_spin_unlock(&old_owner->task->pi_lock);
+		}
+		_raw_spin_lock(&new_owner->task->pi_lock);
+		plist_add(&next->pi_list, &new_owner->task->pi_waiters);
 		set_lock_owner_pending(lock);
+		_raw_spin_unlock(&new_owner->task->pi_lock);
+	}
+        
 #ifdef CONFIG_DEBUG_DEADLOCKS
 	if (trace_on) {
 		TRACE_WARN_ON_LOCKED(!list_empty(&lock->held_list));
@@ -1109,6 +971,32 @@ void set_new_owner(struct rt_mutex *lock
 	account_mutex_owner_down(new_owner->task, lock);
 }
 
+
+static inline void remove_waiter(struct rt_mutex *lock, 
+                                 struct rt_mutex_waiter *waiter, 
+                                 int fixprio)
+{
+	task_t *owner = lock_owner(lock) ? lock_owner(lock)->task : NULL;
+	int first = (waiter==plist_first_entry(&lock->wait_list, 
+					       struct rt_mutex_waiter, list));
+        
+	plist_del(&waiter->list);
+	if(first && owner) {
+		_raw_spin_lock(&owner->pi_lock);
+		plist_del(&waiter->pi_list);
+		if(!plist_head_empty(&lock->wait_list)) {
+			struct rt_mutex_waiter *next =
+				plist_first_entry(&lock->wait_list, 
+						  struct rt_mutex_waiter, list);
+			plist_add(&next->pi_list, &owner->pi_waiters);                  
+		}
+		if(fixprio) {
+			fix_prio(owner);
+		}
+		_raw_spin_unlock(&owner->pi_lock);
+	}
+}
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - the spinlock must be held by the caller
@@ -1123,70 +1011,34 @@ pick_new_owner(struct rt_mutex *lock, st
 	struct thread_info *new_owner;
 
 	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&old_owner->task->pi_lock));
+
 	/*
 	 * Get the highest prio one:
 	 *
 	 * (same-prio RT tasks go FIFO)
 	 */
 	waiter = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter, list);
-
-#ifdef CONFIG_SMP
- try_again:
-#endif
+	remove_waiter(lock,waiter,0);
 	trace_special_pid(waiter->ti->task->pid, waiter->ti->task->prio, 0);
 
-#if ALL_TASKS_PI
-	check_pi_list_present(lock, waiter, old_owner);
-#endif
 	new_owner = waiter->ti;
-	/*
-	 * The new owner is still blocked on this lock, so we
-	 * must release the lock->wait_lock before grabing
-	 * the new_owner lock.
-	 */
-	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_lock(&new_owner->task->pi_lock);
-	_raw_spin_lock(&lock->wait_lock);
-	/*
-	 * In this split second of releasing the lock, a high priority
-	 * process could have come along and blocked as well.
-	 */
-#ifdef CONFIG_SMP
-	waiter = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter, list);
-	if (unlikely(waiter->ti != new_owner)) {
-		_raw_spin_unlock(&new_owner->task->pi_lock);
-		goto try_again;
-	}
-#ifdef CONFIG_PREEMPT_RT
-	/*
-	 * Once again the BKL comes to play.  Since the BKL can be grabbed and released
-	 * out of the normal P1->L1->P2 order, there's a chance that someone has the
-	 * BKL owner's lock and is waiting on the new owner lock.
-	 */
-	if (unlikely(lock == &kernel_sem.lock)) {
-		if (!_raw_spin_trylock(&old_owner->task->pi_lock)) {
-			_raw_spin_unlock(&new_owner->task->pi_lock);
-			goto try_again;
-		}
-	} else
-#endif
-#endif
-		_raw_spin_lock(&old_owner->task->pi_lock);
-
-	plist_del(&waiter->list);
-	plist_del(&waiter->pi_list);
-	waiter->pi_list.prio = waiter->ti->task->prio;
 
 	set_new_owner(lock, old_owner, new_owner __W_EIP__(waiter));
+
+	_raw_spin_lock(&new_owner->task->pi_lock);
 	/* Don't touch waiter after ->task has been NULLed */
 	mb();
 	waiter->ti = NULL;
 	new_owner->task->blocked_on = NULL;
-	TRACE_WARN_ON(save_state != lock->save_state);
-
-	_raw_spin_unlock(&old_owner->task->pi_lock);
+#ifdef CAPTURE_LOCK
+	new_owner->task->rt_flags |= RT_PENDOWNER;
+	new_owner->task->pending_owner = lock;
+#endif
 	_raw_spin_unlock(&new_owner->task->pi_lock);
 
+	TRACE_WARN_ON(save_state != lock->save_state);
+
 	return new_owner;
 }
 
@@ -1222,6 +1074,34 @@ static inline void init_lists(struct rt_
 #endif
 }
 
+
+static void remove_pending_owner_nolock(task_t *owner)
+{
+	owner->rt_flags &= ~RT_PENDOWNER;
+	owner->pending_owner = NULL;
+}
+
+static void remove_pending_owner(task_t *owner)
+{
+	_raw_spin_lock(&owner->pi_lock);
+	remove_pending_owner_nolock(owner);
+	_raw_spin_unlock(&owner->pi_lock);
+}
+
+int task_is_pending_owner_nolock(struct thread_info  *owner, 
+                                 struct rt_mutex *lock)
+{
+	return (lock_owner(lock) == owner) &&
+		(owner->task->pending_owner == lock);
+}
+int task_is_pending_owner(struct thread_info  *owner, struct rt_mutex *lock)
+{
+	int res;
+	_raw_spin_lock(&owner->task->pi_lock);
+	res = task_is_pending_owner_nolock(owner,lock);
+	_raw_spin_unlock(&owner->task->pi_lock);
+	return res;
+}
 /*
  * Try to grab a lock, and if it is owned but the owner
  * hasn't woken up yet, see if we can steal it.
@@ -1233,6 +1113,8 @@ static int __grab_lock(struct rt_mutex *
 {
 #ifndef CAPTURE_LOCK
 	return 0;
+#else
+	int res = 0;
 #endif
 	/*
 	 * The lock is owned, but now test to see if the owner
@@ -1241,111 +1123,36 @@ static int __grab_lock(struct rt_mutex *
 
 	TRACE_BUG_ON_LOCKED(!owner);
 
+	_raw_spin_lock(&owner->pi_lock);
+
 	/* The owner is pending on a lock, but is it this lock? */
 	if (owner->pending_owner != lock)
-		return 0;
+		goto out_unlock;
 
 	/*
 	 * There's an owner, but it hasn't woken up to take the lock yet.
 	 * See if we should steal it from him.
 	 */
 	if (task->prio > owner->prio)
-		return 0;
+		goto out_unlock;
 #ifdef CONFIG_PREEMPT_RT
 	/*
 	 * The BKL is a PITA. Don't ever steal it
 	 */
 	if (lock == &kernel_sem.lock)
-		return 0;
+		goto out_unlock;
 #endif
 	/*
 	 * This task is of higher priority than the current pending
 	 * owner, so we may steal it.
 	 */
-	owner->rt_flags &= ~RT_PENDOWNER;
-	owner->pending_owner = NULL;
+	remove_pending_owner_nolock(owner);
 
-#ifdef CONFIG_DEBUG_DEADLOCKS
-	/*
-	 * This task will be taking the ownership away, and
-	 * when it does, the lock can't be on the held list.
-	 */
-	if (trace_on) {
-		TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list));
-		list_del_init(&lock->held_list);
-	}
-#endif
-	account_mutex_owner_up(owner);
+	res = 1;
 
-	return 1;
-}
-
-/*
- * Bring a task from pending ownership to owning a lock.
- *
- * Return 0 if we secured it, otherwise non-zero if it was
- * stolen.
- */
-static int
-capture_lock(struct rt_mutex_waiter *waiter, struct thread_info *ti,
-	     struct task_struct *task)
-{
-	struct rt_mutex *lock = waiter->lock;
-	struct thread_info *old_owner;
-	unsigned long flags;
-	int ret = 0;
-
-#ifndef CAPTURE_LOCK
-	return 0;
-#endif
-#ifdef CONFIG_PREEMPT_RT
-	/*
-	 * The BKL is special, we always get it.
-	 */
-	if (lock == &kernel_sem.lock)
-		return 0;
-#endif
-
-	trace_lock_irqsave(&trace_lock, flags, ti);
-	/*
-	 * We are no longer blocked on the lock, so we are considered a
-	 * owner. So we must grab the lock->wait_lock first.
-	 */
-	_raw_spin_lock(&lock->wait_lock);
-	_raw_spin_lock(&task->pi_lock);
-
-	if (!(task->rt_flags & RT_PENDOWNER)) {
-		/*
-		 * Someone else stole it.
-		 */
-		old_owner = lock_owner(lock);
-		TRACE_BUG_ON_LOCKED(old_owner == ti);
-		if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
-			/* we got it back! */
-			if (old_owner) {
-				_raw_spin_lock(&old_owner->task->pi_lock);
-				set_new_owner(lock, old_owner, ti __W_EIP__(waiter));
-				_raw_spin_unlock(&old_owner->task->pi_lock);
-			} else
-				set_new_owner(lock, old_owner, ti __W_EIP__(waiter));
-			ret = 0;
-		} else {
-			/* Add ourselves back to the list */
-			TRACE_BUG_ON_LOCKED(!plist_node_empty(&waiter->list));
-			plist_node_init(&waiter->list, task->prio);
-			task_blocks_on_lock(waiter, ti, lock __W_EIP__(waiter));
-			ret = 1;
-		}
-	} else {
-		task->rt_flags &= ~RT_PENDOWNER;
-		task->pending_owner = NULL;
-	}
-
-	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_unlock(&task->pi_lock);
-	trace_unlock_irqrestore(&trace_lock, flags, ti);
-
-	return ret;
+ out_unlock:
+	_raw_spin_unlock(&owner->pi_lock);
+	return res;
 }
 
 static inline void INIT_WAITER(struct rt_mutex_waiter *waiter)
@@ -1366,10 +1173,24 @@ static inline void FREE_WAITER(struct rt
 #endif
 }
 
+static int allowed_to_take_lock(struct thread_info *ti,
+                                task_t *task,
+                                struct thread_info *old_owner,
+                                struct rt_mutex *lock)
+{
+	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&old_owner->task->pi_lock));
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&task->pi_lock));
+
+	return !old_owner || 
+		task_is_pending_owner(ti,lock) || 
+		__grab_lock(lock, task, old_owner->task);
+}
+
 /*
  * lock it semaphore-style: no worries about missed wakeups.
  */
-static inline void
+static void
 ____down(struct rt_mutex *lock __EIP_DECL__)
 {
 	struct thread_info *ti = current_thread_info(), *old_owner;
@@ -1379,65 +1200,56 @@ ____down(struct rt_mutex *lock __EIP_DEC
 
 	trace_lock_irqsave(&trace_lock, flags, ti);
 	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	_raw_spin_lock(&task->pi_lock);
 	_raw_spin_lock(&lock->wait_lock);
 	INIT_WAITER(&waiter);
 
-	old_owner = lock_owner(lock);
 	init_lists(lock);
 
-	if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
+	/* wait to be given the lock */
+	for (;;) {
+		old_owner = lock_owner(lock);
+
+		if(allowed_to_take_lock(ti, task, old_owner,lock)) {
 		/* granted */
-		TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
-		if (old_owner) {
-			_raw_spin_lock(&old_owner->task->pi_lock);
+			TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
 			set_new_owner(lock, old_owner, ti __EIP__);
-			_raw_spin_unlock(&old_owner->task->pi_lock);
-		} else
-			set_new_owner(lock, old_owner, ti __EIP__);
-		_raw_spin_unlock(&lock->wait_lock);
-		_raw_spin_unlock(&task->pi_lock);
-		trace_unlock_irqrestore(&trace_lock, flags, ti);
-
-		FREE_WAITER(&waiter);
-		return;
-	}
-
-	set_task_state(task, TASK_UNINTERRUPTIBLE);
-
-	plist_node_init(&waiter.list, task->prio);
-	task_blocks_on_lock(&waiter, ti, lock __EIP__);
+			remove_pending_owner(task);
+			_raw_spin_unlock(&lock->wait_lock);
+			trace_unlock_irqrestore(&trace_lock, flags, ti);
 
-	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	/* we don't need to touch the lock struct anymore */
-	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_unlock(&task->pi_lock);
-	trace_unlock_irqrestore(&trace_lock, flags, ti);
+			FREE_WAITER(&waiter);
+			return;
+		}
+		
+		task_blocks_on_lock(&waiter, ti, lock, TASK_UNINTERRUPTIBLE __EIP__);
 
-	might_sleep();
+		TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
+		/* we don't need to touch the lock struct anymore */
+		_raw_spin_unlock(&lock->wait_lock);
+		trace_unlock_irqrestore(&trace_lock, flags, ti);
+		
+		might_sleep();
+		
+		nosched_flag = current->flags & PF_NOSCHED;
+		current->flags &= ~PF_NOSCHED;
 
-	nosched_flag = current->flags & PF_NOSCHED;
-	current->flags &= ~PF_NOSCHED;
+		if (waiter.ti)
+		{
+			schedule();
+		}
+		
+		current->flags |= nosched_flag;
+		task->state = TASK_RUNNING;
 
-wait_again:
-	/* wait to be given the lock */
-	for (;;) {
-		if (!waiter.ti)
-			break;
-		schedule();
-		set_task_state(task, TASK_UNINTERRUPTIBLE);
-	}
-	/*
-	 * Check to see if we didn't have ownership stolen.
-	 */
-	if (capture_lock(&waiter, ti, task)) {
-		set_task_state(task, TASK_UNINTERRUPTIBLE);
-		goto wait_again;
+		trace_lock_irqsave(&trace_lock, flags, ti);
+		_raw_spin_lock(&lock->wait_lock);
+		if(waiter.ti) {
+			remove_waiter(lock,&waiter,1);
+		}
 	}
 
-	current->flags |= nosched_flag;
-	task->state = TASK_RUNNING;
-	FREE_WAITER(&waiter);
+	/* Should not get here! */
+	BUG_ON(1);
 }
 
 /*
@@ -1450,122 +1262,105 @@ wait_again:
  * enables the seemless use of arbitrary (blocking) spinlocks within
  * sleep/wakeup event loops.
  */
-static inline void
+static void
 ____down_mutex(struct rt_mutex *lock __EIP_DECL__)
 {
 	struct thread_info *ti = current_thread_info(), *old_owner;
-	unsigned long state, saved_state, nosched_flag;
+	unsigned long state, saved_state;
 	struct task_struct *task = ti->task;
 	struct rt_mutex_waiter waiter;
 	unsigned long flags;
-	int got_wakeup = 0, saved_lock_depth;
+	int got_wakeup = 0;
+	
+	        
 
 	trace_lock_irqsave(&trace_lock, flags, ti);
 	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	_raw_spin_lock(&task->pi_lock);
 	_raw_spin_lock(&lock->wait_lock);
-	INIT_WAITER(&waiter);
-
-	old_owner = lock_owner(lock);
-	init_lists(lock);
-
-	if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
-		/* granted */
-		TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
-		if (old_owner) {
-			_raw_spin_lock(&old_owner->task->pi_lock);
-			set_new_owner(lock, old_owner, ti __EIP__);
-			_raw_spin_unlock(&old_owner->task->pi_lock);
-		} else
-			set_new_owner(lock, old_owner, ti __EIP__);
-		_raw_spin_unlock(&lock->wait_lock);
-		_raw_spin_unlock(&task->pi_lock);
-		trace_unlock_irqrestore(&trace_lock, flags, ti);
-
-		FREE_WAITER(&waiter);
-		return;
-	}
-
-	plist_node_init(&waiter.list, task->prio);
-	task_blocks_on_lock(&waiter, ti, lock __EIP__);
-
-	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	/*
+/*
 	 * Here we save whatever state the task was in originally,
 	 * we'll restore it at the end of the function and we'll
 	 * take any intermediate wakeup into account as well,
 	 * independently of the mutex sleep/wakeup mechanism:
 	 */
 	saved_state = xchg(&task->state, TASK_UNINTERRUPTIBLE);
+        
+	INIT_WAITER(&waiter);
 
-	/* we don't need to touch the lock struct anymore */
-	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_unlock(&task->pi_lock);
-	trace_unlock(&trace_lock, ti);
-
-	/*
-	 * TODO: check 'flags' for the IRQ bit here - it is illegal to
-	 * call down() from an IRQs-off section that results in
-	 * an actual reschedule.
-	 */
-
-	nosched_flag = current->flags & PF_NOSCHED;
-	current->flags &= ~PF_NOSCHED;
-
-	/*
-	 * BKL users expect the BKL to be held across spinlock/rwlock-acquire.
-	 * Save and clear it, this will cause the scheduler to not drop the
-	 * BKL semaphore if we end up scheduling:
-	 */
-	saved_lock_depth = task->lock_depth;
-	task->lock_depth = -1;
+	init_lists(lock);
 
-wait_again:
 	/* wait to be given the lock */
 	for (;;) {
-		unsigned long saved_flags = current->flags & PF_NOSCHED;
-
-		if (!waiter.ti)
-			break;
-		trace_local_irq_enable(ti);
-		// no need to check for preemption here, we schedule().
-		current->flags &= ~PF_NOSCHED;
+		old_owner = lock_owner(lock);
+        
+		if (allowed_to_take_lock(ti,task,old_owner,lock)) {
+		/* granted */
+			TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
+			set_new_owner(lock, old_owner, ti __EIP__);
+			remove_pending_owner(task);
+			_raw_spin_unlock(&lock->wait_lock);
+                        
+			/*
+			 * Only set the task's state to TASK_RUNNING if it got
+			 * a non-mutex wakeup. We keep the original state otherwise.
+			 * A mutex wakeup changes the task's state to TASK_RUNNING_MUTEX,
+			 * not TASK_RUNNING - hence we can differenciate betwee5~n the two
+			 * cases:
+			 */
+			state = xchg(&task->state, saved_state);
+			if (state == TASK_RUNNING)
+				got_wakeup = 1;
+			if (got_wakeup)
+				task->state = TASK_RUNNING;
+			trace_unlock_irqrestore(&trace_lock, flags, ti);
+			preempt_check_resched();
 
-		schedule();
+			FREE_WAITER(&waiter);
+			return;
+		}
+		
+		task_blocks_on_lock(&waiter, ti, lock,
+				    TASK_UNINTERRUPTIBLE __EIP__);
 
-		current->flags |= saved_flags;
-		trace_local_irq_disable(ti);
-		state = xchg(&task->state, TASK_UNINTERRUPTIBLE);
-		if (state == TASK_RUNNING)
-			got_wakeup = 1;
-	}
-	/*
-	 * Check to see if we didn't have ownership stolen.
-	 */
-	if (capture_lock(&waiter, ti, task)) {
-		state = xchg(&task->state, TASK_UNINTERRUPTIBLE);
-		if (state == TASK_RUNNING)
-			got_wakeup = 1;
-		goto wait_again;
-	}
-	/*
-	 * Only set the task's state to TASK_RUNNING if it got
-	 * a non-mutex wakeup. We keep the original state otherwise.
-	 * A mutex wakeup changes the task's state to TASK_RUNNING_MUTEX,
-	 * not TASK_RUNNING - hence we can differenciate between the two
-	 * cases:
-	 */
-	state = xchg(&task->state, saved_state);
-	if (state == TASK_RUNNING)
-		got_wakeup = 1;
-	if (got_wakeup)
-		task->state = TASK_RUNNING;
-	trace_local_irq_enable(ti);
-	preempt_check_resched();
+		TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
+		/* we don't need to touch the lock struct anymore */
+		_raw_spin_unlock(&lock->wait_lock);
+		trace_unlock(&trace_lock, ti);
+                
+		if (waiter.ti) {
+			unsigned long saved_flags = 
+				current->flags & PF_NOSCHED;
+			/*
+			 * BKL users expect the BKL to be held across spinlock/rwlock-acquire.
+			 * Save and clear it, this will cause the scheduler to not drop the
+			 * BKL semaphore if we end up scheduling:
+			 */
 
-	task->lock_depth = saved_lock_depth;
-	current->flags |= nosched_flag;
-	FREE_WAITER(&waiter);
+			int saved_lock_depth = task->lock_depth;
+			task->lock_depth = -1;
+			
+
+			trace_local_irq_enable(ti);
+			// no need to check for preemption here, we schedule().
+                        
+			current->flags &= ~PF_NOSCHED;
+			
+			schedule();
+			
+			trace_local_irq_disable(ti);
+			task->flags |= saved_flags;
+			task->lock_depth = saved_lock_depth;
+			state = xchg(&task->state, TASK_RUNNING_MUTEX);
+			if (state == TASK_RUNNING)
+				got_wakeup = 1;
+		}
+		
+		trace_lock_irq(&trace_lock, ti);
+		_raw_spin_lock(&lock->wait_lock);
+		if(waiter.ti) {
+			remove_waiter(lock,&waiter,1);
+		}
+	}
 }
 
 static void __up_mutex_waiter_savestate(struct rt_mutex *lock __EIP_DECL__);
@@ -1574,7 +1369,7 @@ static void __up_mutex_waiter_nosavestat
 /*
  * release the lock:
  */
-static inline void
+static void
 ____up_mutex(struct rt_mutex *lock, int save_state __EIP_DECL__)
 {
 	struct thread_info *ti = current_thread_info();
@@ -1587,13 +1382,6 @@ ____up_mutex(struct rt_mutex *lock, int 
 	_raw_spin_lock(&lock->wait_lock);
 	TRACE_BUG_ON_LOCKED(!lock->wait_list.prio_list.prev && !lock->wait_list.prio_list.next);
 
-#ifdef CONFIG_DEBUG_DEADLOCKS
-	if (trace_on) {
-		TRACE_WARN_ON_LOCKED(lock_owner(lock) != ti);
-		TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list));
-		list_del_init(&lock->held_list);
-	}
-#endif
 
 #if ALL_TASKS_PI
 	if (plist_head_empty(&lock->wait_list))
@@ -1604,11 +1392,19 @@ ____up_mutex(struct rt_mutex *lock, int 
 			__up_mutex_waiter_savestate(lock __EIP__);
 		else
 			__up_mutex_waiter_nosavestate(lock __EIP__);
-	} else
+	} else {
+#ifdef CONFIG_DEBUG_DEADLOCKS
+		if (trace_on) {
+			TRACE_WARN_ON_LOCKED(lock_owner(lock) != ti);
+			TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list));
+			list_del_init(&lock->held_list);
+		}
+#endif
 		lock->owner = NULL;
+		account_mutex_owner_up(ti->task);
+	}
 	_raw_spin_unlock(&lock->wait_lock);
 #if defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_PREEMPT_RT)
-	account_mutex_owner_up(current);
 	if (!current->lock_count && !rt_prio(current->normal_prio) &&
 					rt_prio(current->prio)) {
 		static int once = 1;
@@ -1841,125 +1637,99 @@ static int __sched __down_interruptible(
 	struct rt_mutex_waiter waiter;
 	struct timer_list timer;
 	unsigned long expire = 0;
+	int timer_installed = 0;
 	int ret;
 
 	trace_lock_irqsave(&trace_lock, flags, ti);
 	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	_raw_spin_lock(&task->pi_lock);
 	_raw_spin_lock(&lock->wait_lock);
 	INIT_WAITER(&waiter);
 
-	old_owner = lock_owner(lock);
 	init_lists(lock);
 
-	if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
+	ret = 0;
+	/* wait to be given the lock */
+	for (;;) {
+		old_owner = lock_owner(lock);
+                
+		if (allowed_to_take_lock(ti,task,old_owner,lock)) {
 		/* granted */
-		TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
-		if (old_owner) {
-			_raw_spin_lock(&old_owner->task->pi_lock);
+			TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
 			set_new_owner(lock, old_owner, ti __EIP__);
-			_raw_spin_unlock(&old_owner->task->pi_lock);
-		} else
-			set_new_owner(lock, old_owner, ti __EIP__);
-		_raw_spin_unlock(&lock->wait_lock);
-		_raw_spin_unlock(&task->pi_lock);
-		trace_unlock_irqrestore(&trace_lock, flags, ti);
-
-		FREE_WAITER(&waiter);
-		return 0;
-	}
+			_raw_spin_unlock(&lock->wait_lock);
+			trace_unlock_irqrestore(&trace_lock, flags, ti);
 
-	set_task_state(task, TASK_INTERRUPTIBLE);
+			goto out_free_timer;
+		}
 
-	plist_node_init(&waiter.list, task->prio);
-	task_blocks_on_lock(&waiter, ti, lock __EIP__);
+		task_blocks_on_lock(&waiter, ti, lock, TASK_INTERRUPTIBLE __EIP__);
 
-	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	/* we don't need to touch the lock struct anymore */
-	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_unlock(&task->pi_lock);
-	trace_unlock_irqrestore(&trace_lock, flags, ti);
-
-	might_sleep();
+		TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
+		/* we don't need to touch the lock struct anymore */
+		_raw_spin_unlock(&lock->wait_lock);
+		trace_unlock_irqrestore(&trace_lock, flags, ti);
+		
+		might_sleep();
+		
+		nosched_flag = current->flags & PF_NOSCHED;
+		current->flags &= ~PF_NOSCHED;
+		if (time && !timer_installed) {
+			expire = time + jiffies;
+			init_timer(&timer);
+			timer.expires = expire;
+			timer.data = (unsigned long)current;
+			timer.function = process_timeout;
+			add_timer(&timer);
+			timer_installed = 1;
+		}
 
-	nosched_flag = current->flags & PF_NOSCHED;
-	current->flags &= ~PF_NOSCHED;
-	if (time) {
-		expire = time + jiffies;
-		init_timer(&timer);
-		timer.expires = expire;
-		timer.data = (unsigned long)current;
-		timer.function = process_timeout;
-		add_timer(&timer);
-	}
+                        
+		if (waiter.ti) {
+			schedule();
+		}
+		
+		current->flags |= nosched_flag;
+		task->state = TASK_RUNNING;
 
-	ret = 0;
-wait_again:
-	/* wait to be given the lock */
-	for (;;) {
-		if (signal_pending(current) || (time && !timer_pending(&timer))) {
-			/*
-			 * Remove ourselves from the wait list if we
-			 * didnt get the lock - else return success:
-			 */
-			trace_lock_irq(&trace_lock, ti);
-			_raw_spin_lock(&task->pi_lock);
-			_raw_spin_lock(&lock->wait_lock);
-			if (waiter.ti || time) {
-				plist_del(&waiter.list);
-				/*
-				 * If we were the last waiter then clear
-				 * the pending bit:
-				 */
-				if (plist_head_empty(&lock->wait_list))
-					lock->owner = lock_owner(lock);
-				/*
-				 * Just remove ourselves from the PI list.
-				 * (No big problem if our PI effect lingers
-				 *  a bit - owner will restore prio.)
-				 */
-				TRACE_WARN_ON_LOCKED(waiter.ti != ti);
-				TRACE_WARN_ON_LOCKED(current->blocked_on != &waiter);
-				plist_del(&waiter.pi_list);
-				waiter.pi_list.prio = task->prio;
-				waiter.ti = NULL;
-				current->blocked_on = NULL;
-				if (time) {
-					ret = (int)(expire - jiffies);
-					if (!timer_pending(&timer)) {
-						del_singleshot_timer_sync(&timer);
-						ret = -ETIMEDOUT;
-					}
-				} else
-					ret = -EINTR;
+		trace_lock_irqsave(&trace_lock, flags, ti);
+		_raw_spin_lock(&lock->wait_lock);
+		if(waiter.ti) {
+			remove_waiter(lock,&waiter,1);
+		}
+		if(signal_pending(current)) {
+			if (time) {
+				ret = (int)(expire - jiffies);
+				if (!timer_pending(&timer)) {
+					ret = -ETIMEDOUT;
+				}
 			}
-			_raw_spin_unlock(&lock->wait_lock);
-			_raw_spin_unlock(&task->pi_lock);
-			trace_unlock_irq(&trace_lock, ti);
-			break;
+			else
+				ret = -EINTR;
+			
+			goto out_unlock;
 		}
-		if (!waiter.ti)
-			break;
-		schedule();
-		set_task_state(task, TASK_INTERRUPTIBLE);
-	}
-
-	/*
-	 * Check to see if we didn't have ownership stolen.
-	 */
-	if (!ret) {
-		if (capture_lock(&waiter, ti, task)) {
-			set_task_state(task, TASK_INTERRUPTIBLE);
-			goto wait_again;
+		else if(timer_installed &&
+			!timer_pending(&timer)) {
+			ret = -ETIMEDOUT;
+			goto out_unlock;
 		}
 	}
 
-	task->state = TASK_RUNNING;
-	current->flags |= nosched_flag;
 
+ out_unlock:
+	_raw_spin_unlock(&lock->wait_lock);
+	trace_unlock_irqrestore(&trace_lock, flags, ti);
+
+ out_free_timer:
+	if (time && timer_installed) {
+		if (!timer_pending(&timer)) {
+			del_singleshot_timer_sync(&timer);
+		}
+	}
 	FREE_WAITER(&waiter);
 	return ret;
 }
+
 /*
  * trylock for writing -- returns 1 if successful, 0 if contention
  */
@@ -1972,7 +1742,6 @@ static int __down_trylock(struct rt_mute
 
 	trace_lock_irqsave(&trace_lock, flags, ti);
 	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	_raw_spin_lock(&task->pi_lock);
 	/*
 	 * It is OK for the owner of the lock to do a trylock on
 	 * a lock it owns, so to prevent deadlocking, we must
@@ -1989,17 +1758,11 @@ static int __down_trylock(struct rt_mute
 	if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
 		/* granted */
 		TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
-		if (old_owner) {
-			_raw_spin_lock(&old_owner->task->pi_lock);
-			set_new_owner(lock, old_owner, ti __EIP__);
-			_raw_spin_unlock(&old_owner->task->pi_lock);
-		} else
-			set_new_owner(lock, old_owner, ti __EIP__);
+		set_new_owner(lock, old_owner, ti __EIP__);
 		ret = 1;
 	}
 	_raw_spin_unlock(&lock->wait_lock);
 failed:
-	_raw_spin_unlock(&task->pi_lock);
 	trace_unlock_irqrestore(&trace_lock, flags, ti);
 
 	return ret;
@@ -2050,7 +1813,6 @@ static void __up_mutex_waiter_nosavestat
 {
 	struct thread_info *old_owner_ti, *new_owner_ti;
 	struct task_struct *old_owner, *new_owner;
-	struct rt_mutex_waiter *w;
 	int prio;
 
 	old_owner_ti = lock_owner(lock);
@@ -2064,25 +1826,11 @@ static void __up_mutex_waiter_nosavestat
 	 * waiter's priority):
 	 */
 	_raw_spin_lock(&old_owner->pi_lock);
-	prio = old_owner->normal_prio;
-	if (unlikely(!plist_head_empty(&old_owner->pi_waiters))) {
-		w = plist_first_entry(&old_owner->pi_waiters, struct rt_mutex_waiter, pi_list);
-		if (w->ti->task->prio < prio)
-			prio = w->ti->task->prio;
-	}
+	prio = calc_pi_prio(old_owner);
+
 	if (unlikely(prio != old_owner->prio))
-		pi_setprio(lock, old_owner, prio);
+		mutex_setprio(old_owner, prio);
 	_raw_spin_unlock(&old_owner->pi_lock);
-#ifdef CAPTURE_LOCK
-#ifdef CONFIG_PREEMPT_RT
-	if (lock != &kernel_sem.lock) {
-#endif
-		new_owner->rt_flags |= RT_PENDOWNER;
-		new_owner->pending_owner = lock;
-#ifdef CONFIG_PREEMPT_RT
-	}
-#endif
-#endif
 	wake_up_process(new_owner);
 }
 
@@ -2090,7 +1838,6 @@ static void __up_mutex_waiter_savestate(
 {
 	struct thread_info *old_owner_ti, *new_owner_ti;
 	struct task_struct *old_owner, *new_owner;
-	struct rt_mutex_waiter *w;
 	int prio;
 
 	old_owner_ti = lock_owner(lock);
@@ -2104,25 +1851,11 @@ static void __up_mutex_waiter_savestate(
 	 * waiter's priority):
 	 */
 	_raw_spin_lock(&old_owner->pi_lock);
-	prio = old_owner->normal_prio;
-	if (unlikely(!plist_head_empty(&old_owner->pi_waiters))) {
-		w = plist_first_entry(&old_owner->pi_waiters, struct rt_mutex_waiter, pi_list);
-		if (w->ti->task->prio < prio)
-			prio = w->ti->task->prio;
-	}
+	prio = calc_pi_prio(old_owner);
+
 	if (unlikely(prio != old_owner->prio))
-		pi_setprio(lock, old_owner, prio);
+		mutex_setprio(old_owner, prio);
 	_raw_spin_unlock(&old_owner->pi_lock);
-#ifdef CAPTURE_LOCK
-#ifdef CONFIG_PREEMPT_RT
-	if (lock != &kernel_sem.lock) {
-#endif
-		new_owner->rt_flags |= RT_PENDOWNER;
-		new_owner->pending_owner = lock;
-#ifdef CONFIG_PREEMPT_RT
-	}
-#endif
-#endif
 	wake_up_process_mutex(new_owner);
 }
 
@@ -2578,7 +2311,7 @@ int __lockfunc _read_trylock(rwlock_t *r
 {
 #ifdef CONFIG_DEBUG_RT_LOCKING_MODE
 	if (!preempt_locks)
-	return _raw_read_trylock(&rwlock->lock.lock.debug_rwlock);
+		return _raw_read_trylock(&rwlock->lock.lock.debug_rwlock);
 	else
 #endif
 		return down_read_trylock_mutex(&rwlock->lock);
@@ -2905,17 +2638,6 @@ notrace int irqs_disabled(void)
 EXPORT_SYMBOL(irqs_disabled);
 #endif
 
-/*
- * This routine changes the owner of a mutex. It's only
- * caller is the futex code which locks a futex on behalf
- * of another thread.
- */
-void fastcall rt_mutex_set_owner(struct rt_mutex *lock, struct thread_info *t)
-{
-	account_mutex_owner_up(current);
-	account_mutex_owner_down(t->task, lock);
-	lock->owner = t;
-}
 
 struct thread_info * fastcall rt_mutex_owner(struct rt_mutex *lock)
 {
@@ -2950,7 +2672,6 @@ down_try_futex(struct rt_mutex *lock, st
 
 	trace_lock_irqsave(&trace_lock, flags, proxy_owner);
 	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	_raw_spin_lock(&task->pi_lock);
 	_raw_spin_lock(&lock->wait_lock);
 
 	old_owner = lock_owner(lock);
@@ -2959,16 +2680,10 @@ down_try_futex(struct rt_mutex *lock, st
 	if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
 		/* granted */
 		TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
-		if (old_owner) {
-			_raw_spin_lock(&old_owner->task->pi_lock);
-			set_new_owner(lock, old_owner, proxy_owner __EIP__);
-			_raw_spin_unlock(&old_owner->task->pi_lock);
-		} else
 			set_new_owner(lock, old_owner, proxy_owner __EIP__);
 		ret = 1;
 	}
 	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_unlock(&task->pi_lock);
 	trace_unlock_irqrestore(&trace_lock, flags, proxy_owner);
 
 	return ret;
@@ -3064,3 +2779,33 @@ void fastcall init_rt_mutex(struct rt_mu
 	__init_rt_mutex(lock, save_state, name, file, line);
 }
 EXPORT_SYMBOL(init_rt_mutex);
+
+
+pid_t get_blocked_on(task_t *task)
+{
+	pid_t res = 0;
+	struct rt_mutex *lock;
+	struct thread_info *owner;
+ try_again:
+	_raw_spin_lock(&task->pi_lock);
+	if(!task->blocked_on) {
+		_raw_spin_unlock(&task->pi_lock);
+		goto out;
+	}
+	lock = task->blocked_on->lock;
+	if(!_raw_spin_trylock(&lock->wait_lock)) {
+		_raw_spin_unlock(&task->pi_lock);
+		goto try_again;
+	}       
+	owner = lock_owner(lock);
+	if(owner)
+		res = owner->task->pid;
+
+	_raw_spin_unlock(&task->pi_lock);
+	_raw_spin_unlock(&lock->wait_lock);
+        
+ out:
+	return res;
+                
+}
+EXPORT_SYMBOL(get_blocked_on);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-11 17:25 RT Mutex patch and tester [PREEMPT_RT] Esben Nielsen
@ 2006-01-11 17:51 ` Steven Rostedt
  2006-01-11 21:45   ` Esben Nielsen
  2006-01-12 11:33 ` Bill Huey
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2006-01-11 17:51 UTC (permalink / raw)
  To: Esben Nielsen; +Cc: Ingo Molnar, david singleton, linux-kernel


On Wed, 11 Jan 2006, Esben Nielsen wrote:

> I have done 2 things which might be of interrest:
>
> I) A rt_mutex unittest suite. It might also be usefull against the generic
> mutexes.
>
> II) I changed the priority inheritance mechanism in rt.c,
> optaining the following goals:
>

Interesting.  I'll take a look more at this after I finish dealing with
some deadlocks that I found in posix-timers.

[snip]
>
> What am I missing:
> Testing on SMP. I have no SMP machine. The unittest can mimic the SMP
> somewhat
> but no unittest can catch _all_ errors.

I have a SMP machine that just freed up.  It would be interesting to see
how this works on a 8x machine.  I'll test it first on my 2x, and when
Ingo gets some time he can test it on his big boxes.

>
> Testing with futexes.
>
> ALL_PI_TASKS are always switched on now. This is for making the code
> simpler.
>
> My machine fails to run with CONFIG_DEBUG_DEADLOCKS and CONFIG_DEBUG_PREEMPT
> on at the same time. I need a serial cabel and on consol over serial to
> debug it. My screen is too small to see enough there.
>
> Figure out more tests to run in my unittester.
>
> So why aren't I doing those things before sending the patch? 1) Well my
> girlfriend comes back tomorrow with our child. I know I will have no time to code anything substential
> then. 2) I want to make sure Ingo sees this approach before he starts
> merging preempt_rt and rt_mutex with his now mainstream mutex.

If I get time, I might be able to finish this up, if the changes look
decent, and don't cause too much overhead.

-- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-11 17:51 ` Steven Rostedt
@ 2006-01-11 21:45   ` Esben Nielsen
  0 siblings, 0 replies; 22+ messages in thread
From: Esben Nielsen @ 2006-01-11 21:45 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, david singleton, linux-kernel


On Wed, 11 Jan 2006, Steven Rostedt wrote:

>
> On Wed, 11 Jan 2006, Esben Nielsen wrote:
> [snip]
>
> If I get time, I might be able to finish this up, if the changes look
> decent, and don't cause too much overhead.

This was the answer I was hoping for!  I'll try to get time to test and
improve myself ofcourse.
As for optimization: I take and release the current->pi_lock and
owner->pi_lock a lot because it isn't allowed to have both locks. Some
code restructuring could probably improve it such that it first finishes
what it has to finish under current->pi_lock then does what it has to do
under the owner->pi_lock - or the other way around.
In a few places pi_lock is taken without just to be sure. It might be
removed.

But first we have to establish the principle. Then optimization can begin.

Esben

>
> -- Steve
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-11 17:25 RT Mutex patch and tester [PREEMPT_RT] Esben Nielsen
  2006-01-11 17:51 ` Steven Rostedt
@ 2006-01-12 11:33 ` Bill Huey
  2006-01-12 12:54   ` Esben Nielsen
  2006-01-15  4:24 ` Bill Huey
       [not found] ` <Pine.LNX.4.44L0.0601181120100.1993-201000@lifa02.phys.au.dk>
  3 siblings, 1 reply; 22+ messages in thread
From: Bill Huey @ 2006-01-12 11:33 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel,
	Bill Huey (hui)

On Wed, Jan 11, 2006 at 06:25:36PM +0100, Esben Nielsen wrote:
> I have done 2 things which might be of interrest:
> 
> II) I changed the priority inheritance mechanism in rt.c,
> optaining the following goals:
> 3) Simpler code. rt.c was kind of messy. Maybe it still is....:-)

Awssome. The code was done in what seems like a hurry and mixes up a
bunch of things that should be seperate out into individual sub-sections.

The allocation of the waiter object on the thread's stack should undergo
some consideration of whether this should be move into a more permanent
store. I haven't looked at an implementation of turnstiles recently, but
I suspect that this is what it actually is and it would eliminate the
moving of waiters to the thread that's is actively running with the lock
path terminal mutex. It works, but it's sloppy stuff.

[loop trickery, priority leanding operations handed off to mutex owner]

> What is gained is that the amount of time where irq and preemption is off
> is limited: One task does it's work with preemption disabled, wakes up the
> next and enable preemption and schedules. The amount of time spend with
> preemption disabled is has a clear upper limit, untouched by how
> complicated and deep the lock structure is.

task_blocks_on_lock() is another place that one might consider seperating
out some bundled functionality into different places in the down()
implementation. I'll look at the preemption stuff next.

Just some ideas. Looks like a decent start.

bill


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-12 11:33 ` Bill Huey
@ 2006-01-12 12:54   ` Esben Nielsen
  2006-01-13  8:07     ` Bill Huey
  0 siblings, 1 reply; 22+ messages in thread
From: Esben Nielsen @ 2006-01-12 12:54 UTC (permalink / raw)
  To: Bill Huey; +Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel



On Thu, 12 Jan 2006, Bill Huey wrote:

> On Wed, Jan 11, 2006 at 06:25:36PM +0100, Esben Nielsen wrote:
> > I have done 2 things which might be of interrest:
> >
> > II) I changed the priority inheritance mechanism in rt.c,
> > optaining the following goals:
> > 3) Simpler code. rt.c was kind of messy. Maybe it still is....:-)
>
> Awssome. The code was done in what seems like a hurry and mixes up a
> bunch of things that should be seperate out into individual sub-sections.

*nod*
I worked on the tester before christmas and only had a few evenings for
myself to look at the kernel after christmas. With a fulltime job and a family
I don't get many spare hours with no disturbance to code (close to none
really), so I had to ship it while my girlfriend and child were away for a
few days.

>
> The allocation of the waiter object on the thread's stack should undergo
> some consideration of whether this should be move into a more permanent
> store.

Hmm, why?
When I first saw it in the Linux kernel I thought: Damn this is elagant.
You only need a waiter when you block, and while you block your stack is
untouched. When you don't block you don't need the waiter, so why have it
allocated somewhere else, say in task_t?

> I haven't looked at an implementation of turnstiles recently,

turnstiles? What is that?

> but
> I suspect that this is what it actually is and it would eliminate the
> moving of waiters to the thread that's is actively running with the lock
> path terminal mutex. It works, but it's sloppy stuff.
>
> [loop trickery, priority leanding operations handed off to mutex owner]
>
> > What is gained is that the amount of time where irq and preemption is off
> > is limited: One task does it's work with preemption disabled, wakes up the
> > next and enable preemption and schedules. The amount of time spend with
> > preemption disabled is has a clear upper limit, untouched by how
> > complicated and deep the lock structure is.
>
> task_blocks_on_lock() is another place that one might consider seperating
> out some bundled functionality into different places in the down()
> implementation.

What is done now with my patch is "minimal", but you have to add your
self to the wait list. You also have to boost the owner of the lock.
You might be able to split it up by releasing and reacquiring all the
spinlocks; but I am pretty much sure this is not the place in the
whole system giving you the longest preemptions off section, so it doesn't
make much sense to improve it.


> I'll look at the preemption stuff next.
>
> Just some ideas. Looks like a decent start.
>

Thanks for the positive response!

Esben

> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-12 12:54   ` Esben Nielsen
@ 2006-01-13  8:07     ` Bill Huey
  2006-01-13  8:47       ` Esben Nielsen
  0 siblings, 1 reply; 22+ messages in thread
From: Bill Huey @ 2006-01-13  8:07 UTC (permalink / raw)
  To: Esben Nielsen; +Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel

On Thu, Jan 12, 2006 at 01:54:23PM +0100, Esben Nielsen wrote:
> turnstiles? What is that?

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/subr_turnstile.c

Please, read. Now tell me or not if that looks familiar ? :)

Moving closer an implementation is arguable, but it is something that
should be considered somewhat since folks in both the Solaris (and
FreeBSD) communities have given a lot more consideration to these issues.

The stack allocated objects are fine for now. Priority inheritance
chains should never get long with a fine grained kernel, so the use
of a stack allocated object and migrating pi-ed waiters should not
be a major real world issue in Linux yet.

Folks should also consider using an adaptive spin in the __grab_lock() (sp?)
related loops as a possible way of optimizing away the immediate blocks.
FreeBSD actually checks the owner of a lock aacross another processor
to see if it's actively running, "current", and will block or wait if
it's running or not respectively. It's pretty trivial code, so it's
not a big issue to implement. This is ignoring the CPU local storage
issues.

bill


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-13  8:07     ` Bill Huey
@ 2006-01-13  8:47       ` Esben Nielsen
  2006-01-13 10:19         ` Bill Huey
  0 siblings, 1 reply; 22+ messages in thread
From: Esben Nielsen @ 2006-01-13  8:47 UTC (permalink / raw)
  To: Bill Huey; +Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel



On Fri, 13 Jan 2006, Bill Huey wrote:

> On Thu, Jan 12, 2006 at 01:54:23PM +0100, Esben Nielsen wrote:
> > turnstiles? What is that?
>
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/subr_turnstile.c
>
> Please, read. Now tell me or not if that looks familiar ? :)

Yes, it reminds me of Ingo's first approach to pi locking:
Everything is done under a global spin lock. In Ingo's approach it was the
pi_lock. In turnstiles it is sched_lock, which (without looking at other
code in FreeBSD) locks the whole scheduler.

Although making the code a lot simpler, scalability is ofcourse the main
issue here. But apparently FreeBSD does have a global lock protecting the
scheduler anyway.

Otherwise it looks a lot like the rt_mutex.

Esben


>
> Moving closer an implementation is arguable, but it is something that
> should be considered somewhat since folks in both the Solaris (and
> FreeBSD) communities have given a lot more consideration to these issues.
>
> The stack allocated objects are fine for now. Priority inheritance
> chains should never get long with a fine grained kernel, so the use
> of a stack allocated object and migrating pi-ed waiters should not
> be a major real world issue in Linux yet.
>
> Folks should also consider using an adaptive spin in the __grab_lock() (sp?)
> related loops as a possible way of optimizing away the immediate blocks.
> FreeBSD actually checks the owner of a lock aacross another processor
> to see if it's actively running, "current", and will block or wait if
> it's running or not respectively. It's pretty trivial code, so it's
> not a big issue to implement. This is ignoring the CPU local storage
> issues.
>
> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-13  8:47       ` Esben Nielsen
@ 2006-01-13 10:19         ` Bill Huey
  0 siblings, 0 replies; 22+ messages in thread
From: Bill Huey @ 2006-01-13 10:19 UTC (permalink / raw)
  To: Esben Nielsen; +Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel

On Fri, Jan 13, 2006 at 09:47:39AM +0100, Esben Nielsen wrote:
> On Fri, 13 Jan 2006, Bill Huey wrote:
> 
> > On Thu, Jan 12, 2006 at 01:54:23PM +0100, Esben Nielsen wrote:
> > > turnstiles? What is that?
> >
> > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/subr_turnstile.c
> >
> > Please, read. Now tell me or not if that looks familiar ? :)
> 
> Yes, it reminds me of Ingo's first approach to pi locking:
> Everything is done under a global spin lock. In Ingo's approach it was the
> pi_lock. In turnstiles it is sched_lock, which (without looking at other
> code in FreeBSD) locks the whole scheduler.
> 
> Although making the code a lot simpler, scalability is ofcourse the main
> issue here. But apparently FreeBSD does have a global lock protecting the
> scheduler anyway.

FreeBSD hasn't really address their scalability issues yet with their
locking. The valuable thing about that file is how they manipulate
threading priorities under priority inheritance. Some ideas might be
stealable from it. That's all.

bill


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-11 17:25 RT Mutex patch and tester [PREEMPT_RT] Esben Nielsen
  2006-01-11 17:51 ` Steven Rostedt
  2006-01-12 11:33 ` Bill Huey
@ 2006-01-15  4:24 ` Bill Huey
  2006-01-16  8:35   ` Esben Nielsen
       [not found] ` <Pine.LNX.4.44L0.0601181120100.1993-201000@lifa02.phys.au.dk>
  3 siblings, 1 reply; 22+ messages in thread
From: Bill Huey @ 2006-01-15  4:24 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel,
	Bill Huey (hui)

On Wed, Jan 11, 2006 at 06:25:36PM +0100, Esben Nielsen wrote:
> So how many locks do we have to worry about? Two.
> One for locking the lock. One for locking various PI related data on the
> task structure, as the pi_waiters list, blocked_on, pending_owner - and
> also prio.
> Therefore only lock->wait_lock and sometask->pi_lock will be locked at the
> same time. And in that order. There is therefore no spinlock deadlocks.
> And the code is simpler.

Ok, got a question. How do deal with the false reporting and handling of
a lock circularity window involving the handoff of task A's BKL to another
task B ? Task A is blocked trying to get a mutex owned by task B, task A
is block B since it owns BKL which task B is contending on. It's not a
deadlock since it's a hand off situation.

I didn't see any handling of this case in the code and I was wondering
if the traversal logic you wrote avoids this case as an inherent property
and I missed that stuff ?

bill


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-15  4:24 ` Bill Huey
@ 2006-01-16  8:35   ` Esben Nielsen
  2006-01-16 10:22     ` Bill Huey
  0 siblings, 1 reply; 22+ messages in thread
From: Esben Nielsen @ 2006-01-16  8:35 UTC (permalink / raw)
  To: Bill Huey; +Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel

On Sat, 14 Jan 2006, Bill Huey wrote:

> On Wed, Jan 11, 2006 at 06:25:36PM +0100, Esben Nielsen wrote:
> > So how many locks do we have to worry about? Two.
> > One for locking the lock. One for locking various PI related data on the
> > task structure, as the pi_waiters list, blocked_on, pending_owner - and
> > also prio.
> > Therefore only lock->wait_lock and sometask->pi_lock will be locked at the
> > same time. And in that order. There is therefore no spinlock deadlocks.
> > And the code is simpler.
>
> Ok, got a question. How do deal with the false reporting and handling of
> a lock circularity window involving the handoff of task A's BKL to another
> task B ? Task A is blocked trying to get a mutex owned by task B, task A
> is block B since it owns BKL which task B is contending on. It's not a
> deadlock since it's a hand off situation.
>
I am not precisely sure what you mean by "false reporting".

Handing off BKL is done in schedule() in sched.c. I.e. if B owns a normal
mutex, A will give BKL to B when A calls schedule() in the down-operation
of that mutex.

> I didn't see any handling of this case in the code and I was wondering
> if the traversal logic you wrote avoids this case as an inherent property
> and I missed that stuff ?

The stuff is in kernel/sched.c and lib/kernel_lock.c.


>
> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-16  8:35   ` Esben Nielsen
@ 2006-01-16 10:22     ` Bill Huey
  2006-01-16 10:53       ` Bill Huey
  0 siblings, 1 reply; 22+ messages in thread
From: Bill Huey @ 2006-01-16 10:22 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel,
	Bill Huey (hui)

On Mon, Jan 16, 2006 at 09:35:42AM +0100, Esben Nielsen wrote:
> On Sat, 14 Jan 2006, Bill Huey wrote:
> I am not precisely sure what you mean by "false reporting".
> 
> Handing off BKL is done in schedule() in sched.c. I.e. if B owns a normal
> mutex, A will give BKL to B when A calls schedule() in the down-operation
> of that mutex.

Task A holding BKL would have to drop BKL when it blocks against a mutex held
by task B in my example and therefore must hit schedule() before any pi boost
operation happens. I'll take another look at your code just to see if this is
clear.

bill


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-16 10:22     ` Bill Huey
@ 2006-01-16 10:53       ` Bill Huey
  2006-01-16 11:30         ` Esben Nielsen
  0 siblings, 1 reply; 22+ messages in thread
From: Bill Huey @ 2006-01-16 10:53 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel,
	Bill Huey (hui)

On Mon, Jan 16, 2006 at 02:22:55AM -0800, Bill Huey wrote:
> On Mon, Jan 16, 2006 at 09:35:42AM +0100, Esben Nielsen wrote:
> > On Sat, 14 Jan 2006, Bill Huey wrote:
> > I am not precisely sure what you mean by "false reporting".
> > 
> > Handing off BKL is done in schedule() in sched.c. I.e. if B owns a normal
> > mutex, A will give BKL to B when A calls schedule() in the down-operation
> > of that mutex.
> 
> Task A holding BKL would have to drop BKL when it blocks against a mutex held
> by task B in my example and therefore must hit schedule() before any pi boost
> operation happens. I'll take another look at your code just to see if this is
> clear.

Esben,

Ok, I see what you did. Looking through the raw patch instead of the applied
sources made it not so obvious it me. Looks the logic for it is there to deal
with that case, good. I like the patch, but it does context switch twice as
much it seems which might a killer.

bill


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-16 10:53       ` Bill Huey
@ 2006-01-16 11:30         ` Esben Nielsen
  0 siblings, 0 replies; 22+ messages in thread
From: Esben Nielsen @ 2006-01-16 11:30 UTC (permalink / raw)
  To: Bill Huey; +Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel

On Mon, 16 Jan 2006, Bill Huey wrote:

> On Mon, Jan 16, 2006 at 02:22:55AM -0800, Bill Huey wrote:
> > On Mon, Jan 16, 2006 at 09:35:42AM +0100, Esben Nielsen wrote:
> > > On Sat, 14 Jan 2006, Bill Huey wrote:
> > > I am not precisely sure what you mean by "false reporting".
> > >
> > > Handing off BKL is done in schedule() in sched.c. I.e. if B owns a normal
> > > mutex, A will give BKL to B when A calls schedule() in the down-operation
> > > of that mutex.
> >
> > Task A holding BKL would have to drop BKL when it blocks against a mutex held
> > by task B in my example and therefore must hit schedule() before any pi boost
> > operation happens. I'll take another look at your code just to see if this is
> > clear.
>
> Esben,
>
> Ok, I see what you did. Looking through the raw patch instead of the applied
> sources made it not so obvious it me.

Only small patches can be read directly....

> Looks the logic for it is there to deal
> with that case, good. I like the patch,

good :-)

> but it does context switch twice as
> much it seems which might a killer.

Twice? It depends on the lock nesting depth. The number of task
switches is the lock nesting depth in any blocking down() operation.
In all other implementations the number of task switches is just 1.
I.e is the usual case (task A blocks on  lock 1 owned by B, which is
unblocked), where lock nesting depth is 1, there is no penalty with my
approach. The panalty comes at (task A blocks on lock 1 owned by B,
blocked on lock 2 owned by C). There B is scheduled as an agent to boost
C, such that A never touches lock 2 or task C. Precisely this makes the
spinlocks a lot easier to handle.

On the other hand the maximum time spent with preemption off is
O(1) in my implementation whereas it is at least O(lock nesting depth) in
other implementations. I think the current implementation in rt_preempt
is max(O(total number of PI waiters),O(lock nesting depth)).

Esben

>
> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
       [not found] ` <Pine.LNX.4.44L0.0601181120100.1993-201000@lifa02.phys.au.dk>
@ 2006-01-18 10:38   ` Ingo Molnar
  2006-01-18 12:49   ` Steven Rostedt
       [not found]   ` <Pine.LNX.4.44L0.0601230047290.31387-201000@lifa01.phys.au.dk>
  2 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2006-01-18 10:38 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Steven Rostedt, david singleton, Bill Huey, linux-kernel,
	Thomas Gleixner


* Esben Nielsen <simlo@phys.au.dk> wrote:

> Hi,
>  I have updated it:
> 
> 1) Now ALL_TASKS_PI is 0 again. Only RT tasks will be added to
> task->pi_waiters. Therefore we avoid taking the owner->pi_lock when the
> waiter isn't RT.
> 2) Merged into 2.6.15-rt6.
> 3) Updated the tester to test the hand over of BKL, which was mentioned
> as a potential issue by Bill Huey. Also added/adjusted the tests for the
> ALL_TASKS_PI==0 setup.
> (I really like unittesting: If someone points out an issue or finds a bug,
> make a test first demonstrating the problem. Then fixing the code is a lot
> easier - especially in this case where I run rt.c in userspace where I can
> easily use gdb.)

looks really nice to me. In particular i like the cleanup effect:

 5 files changed, 490 insertions(+), 744 deletions(-)

right now Thomas is merging hrtimer-latest to -rt, which temporarily 
blocks merging of intrusive patches - but i'll try to merge your patch 
after Thomas is done. (hopefully later today)

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
       [not found] ` <Pine.LNX.4.44L0.0601181120100.1993-201000@lifa02.phys.au.dk>
  2006-01-18 10:38   ` Ingo Molnar
@ 2006-01-18 12:49   ` Steven Rostedt
  2006-01-18 14:18     ` Esben Nielsen
       [not found]   ` <Pine.LNX.4.44L0.0601230047290.31387-201000@lifa01.phys.au.dk>
  2 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2006-01-18 12:49 UTC (permalink / raw)
  To: Esben Nielsen; +Cc: Ingo Molnar, david singleton, Bill Huey, linux-kernel

On Wed, 2006-01-18 at 11:31 +0100, Esben Nielsen wrote:
> Hi,
>  I have updated it:
> 
> 1) Now ALL_TASKS_PI is 0 again. Only RT tasks will be added to
> task->pi_waiters. Therefore we avoid taking the owner->pi_lock when the
> waiter isn't RT.
> 2) Merged into 2.6.15-rt6.
> 3) Updated the tester to test the hand over of BKL, which was mentioned
> as a potential issue by Bill Huey. Also added/adjusted the tests for the
> ALL_TASKS_PI==0 setup.
> (I really like unittesting: If someone points out an issue or finds a bug,
> make a test first demonstrating the problem. Then fixing the code is a lot
> easier - especially in this case where I run rt.c in userspace where I can
> easily use gdb.)

Hmm, maybe I'll actually get a chance to finally play with this. I've
discovered issues with the hrtimers earlier, and was too busy helping 
Thomas with them.  That had to take precedence.

;)

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-18 12:49   ` Steven Rostedt
@ 2006-01-18 14:18     ` Esben Nielsen
  0 siblings, 0 replies; 22+ messages in thread
From: Esben Nielsen @ 2006-01-18 14:18 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, david singleton, Bill Huey, linux-kernel

On Wed, 18 Jan 2006, Steven Rostedt wrote:

> On Wed, 2006-01-18 at 11:31 +0100, Esben Nielsen wrote:
> > Hi,
> >  I have updated it:
> >
> > 1) Now ALL_TASKS_PI is 0 again. Only RT tasks will be added to
> > task->pi_waiters. Therefore we avoid taking the owner->pi_lock when the
> > waiter isn't RT.
> > 2) Merged into 2.6.15-rt6.
> > 3) Updated the tester to test the hand over of BKL, which was mentioned
> > as a potential issue by Bill Huey. Also added/adjusted the tests for the
> > ALL_TASKS_PI==0 setup.
> > (I really like unittesting: If someone points out an issue or finds a bug,
> > make a test first demonstrating the problem. Then fixing the code is a lot
> > easier - especially in this case where I run rt.c in userspace where I can
> > easily use gdb.)
>
> Hmm, maybe I'll actually get a chance to finally play with this. I've
> discovered issues with the hrtimers earlier, and was too busy helping
> Thomas with them.  That had to take precedence.
>
I just keep making small improvements and test as I get time (which isn't
much).

Esben
> ;)
>
> -- Steve
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
       [not found]   ` <Pine.LNX.4.44L0.0601230047290.31387-201000@lifa01.phys.au.dk>
@ 2006-01-23  0:38     ` david singleton
  2006-01-23  2:04     ` Bill Huey
  1 sibling, 0 replies; 22+ messages in thread
From: david singleton @ 2006-01-23  0:38 UTC (permalink / raw)
  To: Esben Nielsen; +Cc: Bill Huey, linux-kernel, Ingo Molnar, Steven Rostedt


On Jan 22, 2006, at 4:20 PM, Esben Nielsen wrote:

> Ok this time around I have a patch against 2.6.15-rt12.
>
> Updated since the last time:
> 1) Fixed a bug wrt. BKL. The problem involved the grab-lock mechanism.
> 2) Updated the tester to detect the bug. See
> TestRTMutex/reaquire_bkl_while_waiting.tst :-)
>
> I still need testing on SMP and with robust futexes.
> I test with the old priority inheritance test of mine. Do you guys test
> with anything newer?

I'll try your patch with rt12 and robustness.  I have an SMP test I'm
working on.

David

>
> When looking at BKL i found that the buisness about reacquiring BKL is
> really bad for real-time. The problem is that you without knowing it
> suddenly can block on the BKL!
>
> Here is the problem:
>
> Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> tries to lock mutex 2, which is owned by task C. B goes blocks and 
> releases the
> BKL. Our RT task A comes along and tries to get 1. It boosts task B
> which boosts task C which releases mutex 2. Now B can continue? No, it 
> has
> to reaquire BKL! The netto effect is that our RT task A waits for BKL 
> to
> be released without ever calling into a module using BKL. But just 
> because
> somebody in some non-RT code called into a module otherwise considered
> safe for RT usage with BKL held, A must wait on BKL!
>
> Esben
>
> On Wed, 18 Jan 2006, Esben Nielsen wrote:
>
>> Hi,
>>  I have updated it:
>>
>> 1) Now ALL_TASKS_PI is 0 again. Only RT tasks will be added to
>> task->pi_waiters. Therefore we avoid taking the owner->pi_lock when 
>> the
>> waiter isn't RT.
>> 2) Merged into 2.6.15-rt6.
>> 3) Updated the tester to test the hand over of BKL, which was 
>> mentioned
>> as a potential issue by Bill Huey. Also added/adjusted the tests for 
>> the
>> ALL_TASKS_PI==0 setup.
>> (I really like unittesting: If someone points out an issue or finds a 
>> bug,
>> make a test first demonstrating the problem. Then fixing the code is 
>> a lot
>> easier - especially in this case where I run rt.c in userspace where 
>> I can
>> easily use gdb.)
>>
>> Esben
>>
>> On Wed, 11 Jan 2006, Esben Nielsen wrote:
>>
>>> I have done 2 things which might be of interrest:
>>>
>>> I) A rt_mutex unittest suite. It might also be usefull against the 
>>> generic
>>> mutexes.
>>>
>>> II) I changed the priority inheritance mechanism in rt.c,
>>> optaining the following goals:
>>>
>>> 1) rt_mutex deadlocks doesn't become raw_spinlock deadlocks. And more
>>> importantly: futex_deadlocks doesn't become raw_spinlock deadlocks.
>>> 2) Time-Predictable code. No matter how deep you nest your locks
>>> (kernel or futex) the time spend in irqs or preemption off should be
>>> limited.
>>> 3) Simpler code. rt.c was kind of messy. Maybe it still is....:-)
>>>
>>> I have lost:
>>> 1) Some speed in the slow slow path. I _might_ have gained some in 
>>> the
>>> normal slow path, though, without meassuring it.
>>>
>>>
>>> Idea:
>>>
>>> When a task blocks on a lock it adds itself to the wait list and 
>>> calls
>>> schedule(). When it is unblocked it has the lock. Or rather due to
>>> grab-locking it has to check again. Therefore the schedule() call is
>>> wrapped in a loop.
>>>
>>> Now when a task is PI boosted, it is at the same time checked if it 
>>> is
>>> blocked on a rt_mutex. If it is, it is unblocked ( 
>>> wake_up_process_mutex()
>>> ). It will now go around in the above loop mentioned above. Within 
>>> this loop
>>> it will now boost the owner of the lock it is blocked on, maybe 
>>> unblocking the
>>> owner, which in turn can boost and unblock the next in the lock 
>>> chain...
>>> At all points there is at least one task boosted to the highest 
>>> priority
>>> required unblocked and working on boosting the next in the lock 
>>> chain and
>>> there is therefore no priority inversion.
>>>
>>> The boosting of a long list of blocked tasks will clearly take 
>>> longer than
>>> the previous version as there will be task switches. But remember, 
>>> it is
>>> in the slow slow path! And it only occurs when PI boosting is 
>>> happening on
>>> _nested_ locks.
>>>
>>> What is gained is that the amount of time where irq and preemption 
>>> is off
>>> is limited: One task does it's work with preemption disabled, wakes 
>>> up the
>>> next and enable preemption and schedules. The amount of time spend 
>>> with
>>> preemption disabled is has a clear upper limit, untouched by how
>>> complicated and deep the lock structure is.
>>>
>>> So how many locks do we have to worry about? Two.
>>> One for locking the lock. One for locking various PI related data on 
>>> the
>>> task structure, as the pi_waiters list, blocked_on, pending_owner - 
>>> and
>>> also prio.
>>> Therefore only lock->wait_lock and sometask->pi_lock will be locked 
>>> at the
>>> same time. And in that order. There is therefore no spinlock 
>>> deadlocks.
>>> And the code is simpler.
>>>
>>> Because of the simplere code I was able to implement an optimization:
>>> Only the first waiter on each lock is member of the 
>>> owner->pi_waiters.
>>> Therefore it is not needed to do any list traversels on neither
>>> owner->pi_waiters, not lock->wait_list. Every operation requires 
>>> touching
>>> only removing and/or adding one element to these lists.
>>>
>>> As for robust futexes: They ought to work out of the box now, 
>>> blocking in
>>> deadlock situations. I have added an entry to /proc/<pid>/status
>>> "BlckOn: <pid>". This can be used to do "post mortem" deadlock 
>>> detection
>>> from userspace.
>>>
>>> What am I missing:
>>> Testing on SMP. I have no SMP machine. The unittest can mimic the SMP
>>> somewhat
>>> but no unittest can catch _all_ errors.
>>>
>>> Testing with futexes.
>>>
>>> ALL_PI_TASKS are always switched on now. This is for making the code
>>> simpler.
>>>
>>> My machine fails to run with CONFIG_DEBUG_DEADLOCKS and 
>>> CONFIG_DEBUG_PREEMPT
>>> on at the same time. I need a serial cabel and on consol over serial 
>>> to
>>> debug it. My screen is too small to see enough there.
>>>
>>> Figure out more tests to run in my unittester.
>>>
>>> So why aren't I doing those things before sending the patch? 1) Well 
>>> my
>>> girlfriend comes back tomorrow with our child. I know I will have no 
>>> time to code anything substential
>>> then. 2) I want to make sure Ingo sees this approach before he starts
>>> merging preempt_rt and rt_mutex with his now mainstream mutex.
>>>
>>> Esben
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
> <pi_lock.patch-rt12><TestRTMutex.tgz>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
       [not found]   ` <Pine.LNX.4.44L0.0601230047290.31387-201000@lifa01.phys.au.dk>
  2006-01-23  0:38     ` david singleton
@ 2006-01-23  2:04     ` Bill Huey
  2006-01-23  9:33       ` Esben Nielsen
  1 sibling, 1 reply; 22+ messages in thread
From: Bill Huey @ 2006-01-23  2:04 UTC (permalink / raw)
  To: Esben Nielsen; +Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel

On Mon, Jan 23, 2006 at 01:20:12AM +0100, Esben Nielsen wrote:
> Here is the problem:
> 
> Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> tries to lock mutex 2, which is owned by task C. B goes blocks and releases the
> BKL. Our RT task A comes along and tries to get 1. It boosts task B
> which boosts task C which releases mutex 2. Now B can continue? No, it has
> to reaquire BKL! The netto effect is that our RT task A waits for BKL to
> be released without ever calling into a module using BKL. But just because
> somebody in some non-RT code called into a module otherwise considered
> safe for RT usage with BKL held, A must wait on BKL!

True, that's major suckage, but I can't name a single place in the kernel that
does that. Remember, BKL is now preemptible so the place that it might sleep similar 
to the above would be in spinlock_t definitions. But BKL is held across schedules()s
so that the BKL semantics are preserved. Contending under a priority inheritance
operation isn't too much of a problem anyways since the use of it already makes that
path indeterminant. Even under contention, a higher priority task above A can still
run since the kernel is preemptive now even when manipulating BKL.

bill


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-23  2:04     ` Bill Huey
@ 2006-01-23  9:33       ` Esben Nielsen
  2006-01-23 14:23         ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Esben Nielsen @ 2006-01-23  9:33 UTC (permalink / raw)
  To: Bill Huey; +Cc: Ingo Molnar, Steven Rostedt, david singleton, linux-kernel

On Sun, 22 Jan 2006, Bill Huey wrote:

> On Mon, Jan 23, 2006 at 01:20:12AM +0100, Esben Nielsen wrote:
> > Here is the problem:
> >
> > Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> > tries to lock mutex 2, which is owned by task C. B goes blocks and releases the
> > BKL. Our RT task A comes along and tries to get 1. It boosts task B
> > which boosts task C which releases mutex 2. Now B can continue? No, it has
> > to reaquire BKL! The netto effect is that our RT task A waits for BKL to
> > be released without ever calling into a module using BKL. But just because
> > somebody in some non-RT code called into a module otherwise considered
> > safe for RT usage with BKL held, A must wait on BKL!
>
> True, that's major suckage, but I can't name a single place in the kernel that
> does that.

Sounds good. But someone might put it in...

> Remember, BKL is now preemptible so the place that it might sleep
> similar
> to the above would be in spinlock_t definitions.
I can't see that from how it works. It is explicitly made such that you
are allowed to use semaphores with BKL held - and such that the BKL is
released if you do.

> But BKL is held across schedules()s
> so that the BKL semantics are preserved.
Only for spinlock_t now rt_mutex operation, not for semaphore/mutex
operations.
> Contending under a priority inheritance
> operation isn't too much of a problem anyways since the use of it already
> makes that
> path indeterminant.
The problem is that you might hit BKL because of what some other low
priority  task does, thus making your RT code indeterministic.

> Even under contention, a higher priority task above A can still
> run since the kernel is preemptive now even when manipulating BKL.

No, A waits for BKL because it waits for B which waits for the BKL.

Esben
>
> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-23  9:33       ` Esben Nielsen
@ 2006-01-23 14:23         ` Steven Rostedt
  2006-01-23 15:14           ` Esben Nielsen
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2006-01-23 14:23 UTC (permalink / raw)
  To: Esben Nielsen; +Cc: Bill Huey, Ingo Molnar, david singleton, linux-kernel

On Mon, 2006-01-23 at 10:33 +0100, Esben Nielsen wrote:
> On Sun, 22 Jan 2006, Bill Huey wrote:
> 
> > On Mon, Jan 23, 2006 at 01:20:12AM +0100, Esben Nielsen wrote:
> > > Here is the problem:
> > >
> > > Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> > > tries to lock mutex 2, which is owned by task C. B goes blocks and releases the
> > > BKL. Our RT task A comes along and tries to get 1. It boosts task B
> > > which boosts task C which releases mutex 2. Now B can continue? No, it has
> > > to reaquire BKL! The netto effect is that our RT task A waits for BKL to
> > > be released without ever calling into a module using BKL. But just because
> > > somebody in some non-RT code called into a module otherwise considered
> > > safe for RT usage with BKL held, A must wait on BKL!
> >
> > True, that's major suckage, but I can't name a single place in the kernel that
> > does that.
> 
> Sounds good. But someone might put it in...

Hmm, I wouldn't be surprised if this is done somewhere in the VFS layer.

> 
> > Remember, BKL is now preemptible so the place that it might sleep
> > similar
> > to the above would be in spinlock_t definitions.
> I can't see that from how it works. It is explicitly made such that you
> are allowed to use semaphores with BKL held - and such that the BKL is
> released if you do.

Correct.  I hope you didn't remove my comment in the rt.c about BKL
being a PITA :) (Ingo was nice enough to change my original patch to use
the acronym.) 

> 
> > But BKL is held across schedules()s
> > so that the BKL semantics are preserved.
> Only for spinlock_t now rt_mutex operation, not for semaphore/mutex
> operations.
> > Contending under a priority inheritance
> > operation isn't too much of a problem anyways since the use of it already
> > makes that
> > path indeterminant.
> The problem is that you might hit BKL because of what some other low
> priority  task does, thus making your RT code indeterministic.

I disagree here.  The fact that you grab a semaphore that may also be
grabbed by a path while holding the BKL means that grabbing that
semaphore may be blocked on the BKL too.  So the length of grabbing a
semaphore that can be grabbed while also holding the BKL is the length
of the critical section of the semaphore + the length of the longest BKL
hold.

Just don't let your RT tasks grab semaphores that can be grabbed while
also holding the BKL :)

But the main point is that it is still deterministic.  Just that it may
be longer than one thinks.

> 
> > Even under contention, a higher priority task above A can still
> > run since the kernel is preemptive now even when manipulating BKL.
> 
> No, A waits for BKL because it waits for B which waits for the BKL.

Right.

-- Steve

PS. I might actually get around to testing your patch today :)  That is,
if -rt12 passes all my tests.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-23 14:23         ` Steven Rostedt
@ 2006-01-23 15:14           ` Esben Nielsen
  2006-01-27 15:18             ` Esben Nielsen
  0 siblings, 1 reply; 22+ messages in thread
From: Esben Nielsen @ 2006-01-23 15:14 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Bill Huey, Ingo Molnar, david singleton, linux-kernel

On Mon, 23 Jan 2006, Steven Rostedt wrote:

> On Mon, 2006-01-23 at 10:33 +0100, Esben Nielsen wrote:
> > On Sun, 22 Jan 2006, Bill Huey wrote:
> >
> > > On Mon, Jan 23, 2006 at 01:20:12AM +0100, Esben Nielsen wrote:
> > > > Here is the problem:
> > > >
> > > > Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> > > > tries to lock mutex 2, which is owned by task C. B goes blocks and releases the
> > > > BKL. Our RT task A comes along and tries to get 1. It boosts task B
> > > > which boosts task C which releases mutex 2. Now B can continue? No, it has
> > > > to reaquire BKL! The netto effect is that our RT task A waits for BKL to
> > > > be released without ever calling into a module using BKL. But just because
> > > > somebody in some non-RT code called into a module otherwise considered
> > > > safe for RT usage with BKL held, A must wait on BKL!
> > >
> > > True, that's major suckage, but I can't name a single place in the kernel that
> > > does that.
> >
> > Sounds good. But someone might put it in...
>
> Hmm, I wouldn't be surprised if this is done somewhere in the VFS layer.
>
> >
> > > Remember, BKL is now preemptible so the place that it might sleep
> > > similar
> > > to the above would be in spinlock_t definitions.
> > I can't see that from how it works. It is explicitly made such that you
> > are allowed to use semaphores with BKL held - and such that the BKL is
> > released if you do.
>
> Correct.  I hope you didn't remove my comment in the rt.c about BKL
> being a PITA :) (Ingo was nice enough to change my original patch to use
> the acronym.)

I left it there it seems :-)

>
> >
> > > But BKL is held across schedules()s
> > > so that the BKL semantics are preserved.
> > Only for spinlock_t now rt_mutex operation, not for semaphore/mutex
> > operations.
> > > Contending under a priority inheritance
> > > operation isn't too much of a problem anyways since the use of it already
> > > makes that
> > > path indeterminant.
> > The problem is that you might hit BKL because of what some other low
> > priority  task does, thus making your RT code indeterministic.
>
> I disagree here.  The fact that you grab a semaphore that may also be
> grabbed by a path while holding the BKL means that grabbing that
> semaphore may be blocked on the BKL too.  So the length of grabbing a
> semaphore that can be grabbed while also holding the BKL is the length
> of the critical section of the semaphore + the length of the longest BKL
> hold.
Exactly. What is "the length of the longest BKL hold" ? (see below).

>
> Just don't let your RT tasks grab semaphores that can be grabbed while
> also holding the BKL :)

How are you to _know_ that. Even though your code or any code you
call or any code called from code you call haven't changed, this situation
can arise!

>
> But the main point is that it is still deterministic.  Just that it may
> be longer than one thinks.
>
I don't consider "the length of the longest BKL hold" deterministic.
People might traverse all kinds of weird lists and datastructures while
holding BKL.

> >
> > > Even under contention, a higher priority task above A can still
> > > run since the kernel is preemptive now even when manipulating BKL.
> >
> > No, A waits for BKL because it waits for B which waits for the BKL.
>
> Right.
>
> -- Steve
>
> PS. I might actually get around to testing your patch today :)  That is,
> if -rt12 passes all my tests.
>

Sounds nice :-) I cross my fingers...

Esben


>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: RT Mutex patch and tester [PREEMPT_RT]
  2006-01-23 15:14           ` Esben Nielsen
@ 2006-01-27 15:18             ` Esben Nielsen
  0 siblings, 0 replies; 22+ messages in thread
From: Esben Nielsen @ 2006-01-27 15:18 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Bill Huey, Ingo Molnar, david singleton, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4088 bytes --]

I have patched against 2.6.15-rt15 and I have found a hyperthreaded P4
machine. It works fine on that one.

Esben

On Mon, 23 Jan 2006, Esben Nielsen wrote:

> On Mon, 23 Jan 2006, Steven Rostedt wrote:
>
> > On Mon, 2006-01-23 at 10:33 +0100, Esben Nielsen wrote:
> > > On Sun, 22 Jan 2006, Bill Huey wrote:
> > >
> > > > On Mon, Jan 23, 2006 at 01:20:12AM +0100, Esben Nielsen wrote:
> > > > > Here is the problem:
> > > > >
> > > > > Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> > > > > tries to lock mutex 2, which is owned by task C. B goes blocks and releases the
> > > > > BKL. Our RT task A comes along and tries to get 1. It boosts task B
> > > > > which boosts task C which releases mutex 2. Now B can continue? No, it has
> > > > > to reaquire BKL! The netto effect is that our RT task A waits for BKL to
> > > > > be released without ever calling into a module using BKL. But just because
> > > > > somebody in some non-RT code called into a module otherwise considered
> > > > > safe for RT usage with BKL held, A must wait on BKL!
> > > >
> > > > True, that's major suckage, but I can't name a single place in the kernel that
> > > > does that.
> > >
> > > Sounds good. But someone might put it in...
> >
> > Hmm, I wouldn't be surprised if this is done somewhere in the VFS layer.
> >
> > >
> > > > Remember, BKL is now preemptible so the place that it might sleep
> > > > similar
> > > > to the above would be in spinlock_t definitions.
> > > I can't see that from how it works. It is explicitly made such that you
> > > are allowed to use semaphores with BKL held - and such that the BKL is
> > > released if you do.
> >
> > Correct.  I hope you didn't remove my comment in the rt.c about BKL
> > being a PITA :) (Ingo was nice enough to change my original patch to use
> > the acronym.)
>
> I left it there it seems :-)
>
> >
> > >
> > > > But BKL is held across schedules()s
> > > > so that the BKL semantics are preserved.
> > > Only for spinlock_t now rt_mutex operation, not for semaphore/mutex
> > > operations.
> > > > Contending under a priority inheritance
> > > > operation isn't too much of a problem anyways since the use of it already
> > > > makes that
> > > > path indeterminant.
> > > The problem is that you might hit BKL because of what some other low
> > > priority  task does, thus making your RT code indeterministic.
> >
> > I disagree here.  The fact that you grab a semaphore that may also be
> > grabbed by a path while holding the BKL means that grabbing that
> > semaphore may be blocked on the BKL too.  So the length of grabbing a
> > semaphore that can be grabbed while also holding the BKL is the length
> > of the critical section of the semaphore + the length of the longest BKL
> > hold.
> Exactly. What is "the length of the longest BKL hold" ? (see below).
>
> >
> > Just don't let your RT tasks grab semaphores that can be grabbed while
> > also holding the BKL :)
>
> How are you to _know_ that. Even though your code or any code you
> call or any code called from code you call haven't changed, this situation
> can arise!
>
> >
> > But the main point is that it is still deterministic.  Just that it may
> > be longer than one thinks.
> >
> I don't consider "the length of the longest BKL hold" deterministic.
> People might traverse all kinds of weird lists and datastructures while
> holding BKL.
>
> > >
> > > > Even under contention, a higher priority task above A can still
> > > > run since the kernel is preemptive now even when manipulating BKL.
> > >
> > > No, A waits for BKL because it waits for B which waits for the BKL.
> >
> > Right.
> >
> > -- Steve
> >
> > PS. I might actually get around to testing your patch today :)  That is,
> > if -rt12 passes all my tests.
> >
>
> Sounds nice :-) I cross my fingers...
>
> Esben
>
>
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
>
>

[-- Attachment #2: Type: TEXT/PLAIN, Size: 53400 bytes --]

diff -upr linux-2.6.15-rt15-orig/fs/proc/array.c linux-2.6.15-rt15-pipatch/fs/proc/array.c
--- linux-2.6.15-rt15-orig/fs/proc/array.c	2006-01-24 18:50:37.000000000 +0100
+++ linux-2.6.15-rt15-pipatch/fs/proc/array.c	2006-01-24 18:56:07.000000000 +0100
@@ -295,6 +295,14 @@ static inline char *task_cap(struct task
 			    cap_t(p->cap_effective));
 }
 
+
+static char *show_blocked_on(task_t *task, char *buffer)
+{
+  pid_t pid = get_blocked_on(task);
+  return buffer + sprintf(buffer,"BlckOn: %d\n",pid);
+}
+
+
 int proc_pid_status(struct task_struct *task, char * buffer)
 {
 	char * orig = buffer;
@@ -313,6 +321,7 @@ int proc_pid_status(struct task_struct *
 #if defined(CONFIG_ARCH_S390)
 	buffer = task_show_regs(task, buffer);
 #endif
+	buffer = show_blocked_on(task,buffer);
 	return buffer - orig;
 }
 
diff -upr linux-2.6.15-rt15-orig/include/linux/rt_lock.h linux-2.6.15-rt15-pipatch/include/linux/rt_lock.h
--- linux-2.6.15-rt15-orig/include/linux/rt_lock.h	2006-01-24 18:50:37.000000000 +0100
+++ linux-2.6.15-rt15-pipatch/include/linux/rt_lock.h	2006-01-24 18:56:07.000000000 +0100
@@ -36,6 +36,7 @@ struct rt_mutex {
 	unsigned long		acquire_eip;
 	char 			*name, *file;
 	int			line;
+	int                     verbose;
 # endif
 # ifdef CONFIG_DEBUG_PREEMPT
 	int			was_preempt_off;
@@ -67,7 +68,7 @@ struct rt_mutex_waiter {
 
 #ifdef CONFIG_DEBUG_DEADLOCKS
 # define __RT_MUTEX_DEADLOCK_DETECT_INITIALIZER(lockname) \
-	, .name = #lockname, .file = __FILE__, .line = __LINE__
+	, .name = #lockname, .file = __FILE__, .line = __LINE__, .verbose =0
 #else
 # define __RT_MUTEX_DEADLOCK_DETECT_INITIALIZER(lockname)
 #endif
diff -upr linux-2.6.15-rt15-orig/include/linux/sched.h linux-2.6.15-rt15-pipatch/include/linux/sched.h
--- linux-2.6.15-rt15-orig/include/linux/sched.h	2006-01-24 18:50:37.000000000 +0100
+++ linux-2.6.15-rt15-pipatch/include/linux/sched.h	2006-01-24 18:56:07.000000000 +0100
@@ -1652,6 +1652,8 @@ extern void recalc_sigpending(void);
 
 extern void signal_wake_up(struct task_struct *t, int resume_stopped);
 
+extern pid_t get_blocked_on(task_t *task);
+
 /*
  * Wrappers for p->thread_info->cpu access. No-op on UP.
  */
diff -upr linux-2.6.15-rt15-orig/init/main.c linux-2.6.15-rt15-pipatch/init/main.c
--- linux-2.6.15-rt15-orig/init/main.c	2006-01-24 18:50:37.000000000 +0100
+++ linux-2.6.15-rt15-pipatch/init/main.c	2006-01-24 18:56:07.000000000 +0100
@@ -616,6 +616,12 @@ static void __init do_initcalls(void)
 			printk(KERN_WARNING "error in initcall at 0x%p: "
 				"returned with %s\n", *call, msg);
 		}
+		if (initcall_debug) {
+			printk(KERN_DEBUG "Returned from initcall 0x%p", *call);
+			print_fn_descriptor_symbol(": %s()", (unsigned long) *call);
+			printk("\n");
+		}
+
 	}
 
 	/* Make sure there is no pending stuff from the initcall sequence */
diff -upr linux-2.6.15-rt15-orig/kernel/rt.c linux-2.6.15-rt15-pipatch/kernel/rt.c
--- linux-2.6.15-rt15-orig/kernel/rt.c	2006-01-24 18:50:37.000000000 +0100
+++ linux-2.6.15-rt15-pipatch/kernel/rt.c	2006-01-24 18:56:07.000000000 +0100
@@ -36,7 +36,10 @@
  *   (also by Steven Rostedt)
  *    - Converted single pi_lock to individual task locks.
  *
+ * By Esben Nielsen:
+ *    Doing priority inheritance with help of the scheduler.
  */
+
 #include <linux/config.h>
 #include <linux/rt_lock.h>
 #include <linux/sched.h>
@@ -58,18 +61,26 @@
  *  To keep from having a single lock for PI, each task and lock
  *  has their own locking. The order is as follows:
  *
+ *     lock->wait_lock   -> sometask->pi_lock
+ * You should only hold one wait_lock and one pi_lock
  * blocked task->pi_lock -> lock->wait_lock -> owner task->pi_lock.
  *
- * This is safe since a owner task should never block on a lock that
- * is owned by a blocking task.  Otherwise you would have a deadlock
- * in the normal system.
- * The same goes for the locks. A lock held by one task, should not be
- * taken by task that holds a lock that is blocking this lock's owner.
+ * lock->wait_lock protects everything inside the lock and all the waiters
+ * on lock->wait_list.
+ * sometask->pi_lock protects everything on task-> related to the rt_mutex.
+ *
+ * Invariants  - must be true when unlock lock->wait_lock:
+ *   If lock->wait_list is non-empty 
+ *     1) lock_owner(lock) points to a valid thread.
+ *     2) The first and only the first waiter on the list must be on
+ *        lock_owner(lock)->task->pi_waiters.
+ * 
+ *  A waiter struct is on the lock->wait_list iff waiter->ti!=NULL.
  *
- * A task that is about to grab a lock is first considered to be a
- * blocking task, even if the task successfully acquires the lock.
- * This is because the taking of the locks happen before the
- * task becomes the owner.
+ *  Strategy for boosting lock chain:
+ *   task A blocked on lock 1 owned by task B blocked on lock 2 etc..
+ *  A sets B's prio up and wakes B. B try to get lock 2 again and fails.
+ *  B therefore boost C.
  */
 
 /*
@@ -117,6 +128,7 @@
  * This flag is good for debugging the PI code - it makes all tasks
  * in the system fall under PI handling. Normally only SCHED_FIFO/RR
  * tasks are PI-handled:
+ *
  */
 #define ALL_TASKS_PI 0
 
@@ -132,6 +144,19 @@
 # define __CALLER0__
 #endif
 
+int rt_mutex_debug = 0;
+
+#ifdef CONFIG_PREEMPT_RT
+static int is_kernel_lock(struct rt_mutex *lock)
+{
+	return (lock == &kernel_sem.lock);
+
+}
+#else
+#define is_kernel_lock(lock) (0)
+#endif
+
+
 #ifdef CONFIG_DEBUG_DEADLOCKS
 /*
  * We need a global lock when we walk through the multi-process
@@ -311,7 +336,7 @@ void check_preempt_wakeup(struct task_st
 		}
 }
 
-static inline void
+static void
 account_mutex_owner_down(struct task_struct *task, struct rt_mutex *lock)
 {
 	if (task->lock_count >= MAX_LOCK_STACK) {
@@ -325,7 +350,7 @@ account_mutex_owner_down(struct task_str
 	task->lock_count++;
 }
 
-static inline void
+static void
 account_mutex_owner_up(struct task_struct *task)
 {
 	if (!task->lock_count) {
@@ -390,6 +415,21 @@ static void printk_lock(struct rt_mutex 
 	}
 }
 
+static void debug_lock(struct rt_mutex *lock, 
+		       const char *fmt,...)
+{ 
+	if(rt_mutex_debug && lock->verbose) { 
+		va_list args;
+		printk_task(current);
+
+		va_start(args, fmt);
+		vprintk(fmt, args);
+		va_end(args);
+		printk_lock(lock, 1);
+	} 
+}
+
+
 static void printk_waiter(struct rt_mutex_waiter *w)
 {
 	printk("-------------------------\n");
@@ -534,10 +574,9 @@ static int check_deadlock(struct rt_mute
 	 * Special-case: the BKL self-releases at schedule()
 	 * time so it can never deadlock:
 	 */
-#ifdef CONFIG_PREEMPT_RT
-	if (lock == &kernel_sem.lock)
+	if (is_kernel_lock(lock))
 		return 0;
-#endif
+
 	ti = lock_owner(lock);
 	if (!ti)
 		return 0;
@@ -562,13 +601,8 @@ static int check_deadlock(struct rt_mute
 		trace_local_irq_disable(ti);
 		return 0;
 	}
-#ifdef CONFIG_PREEMPT_RT
-	/*
-	 * Skip the BKL:
-	 */
-	if (lockblk == &kernel_sem.lock)
+	if(is_kernel_lock(lockblk))
 		return 0;
-#endif
 	/*
 	 * Ugh, something corrupted the lock data structure?
 	 */
@@ -656,7 +690,7 @@ restart:
 		list_del_init(curr);
 		trace_unlock_irqrestore(&trace_lock, flags, ti);
 
-		if (lock == &kernel_sem.lock) {
+		if (is_kernel_lock(lock)) {
 			printk("BUG: %s/%d, BKL held at task exit time!\n",
 				task->comm, task->pid);
 			printk("BKL acquired at: ");
@@ -724,28 +758,14 @@ restart:
 	return err;
 }
 
-#endif
-
-#if ALL_TASKS_PI && defined(CONFIG_DEBUG_DEADLOCKS)
-
-static void
-check_pi_list_present(struct rt_mutex *lock, struct rt_mutex_waiter *waiter,
-		      struct thread_info *old_owner)
+#else /* ifdef CONFIG_DEBUG_DEADLOCKS */
+static inline void debug_lock(struct rt_mutex *lock, 
+			      const char *fmt,...)
 {
-	struct rt_mutex_waiter *w;
-
-	_raw_spin_lock(&old_owner->task->pi_lock);
-	TRACE_WARN_ON_LOCKED(plist_node_empty(&waiter->pi_list));
-
-	plist_for_each_entry(w, &old_owner->task->pi_waiters, pi_list) {
-		if (w == waiter)
-			goto ok;
-	}
-	TRACE_WARN_ON_LOCKED(1);
-ok:
-	_raw_spin_unlock(&old_owner->task->pi_lock);
-	return;
 }
+#endif /* else CONFIG_DEBUG_DEADLOCKS */
+
+#if ALL_TASKS_PI && defined(CONFIG_DEBUG_DEADLOCKS)
 
 static void
 check_pi_list_empty(struct rt_mutex *lock, struct thread_info *old_owner)
@@ -781,274 +801,115 @@ check_pi_list_empty(struct rt_mutex *loc
 
 #endif
 
-/*
- * Move PI waiters of this lock to the new owner:
- */
-static void
-change_owner(struct rt_mutex *lock, struct thread_info *old_owner,
-	     struct thread_info *new_owner)
+static inline int boosting_waiter(struct  rt_mutex_waiter *waiter)
 {
-	struct rt_mutex_waiter *w, *tmp;
-	int requeued = 0, sum = 0;
-
-	if (old_owner == new_owner)
-		return;
-
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&old_owner->task->pi_lock));
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&new_owner->task->pi_lock));
-	plist_for_each_entry_safe(w, tmp, &old_owner->task->pi_waiters, pi_list) {
-		if (w->lock == lock) {
-			trace_special_pid(w->ti->task->pid, w->ti->task->prio, w->ti->task->normal_prio);
-			plist_del(&w->pi_list);
-			w->pi_list.prio = w->ti->task->prio;
-			plist_add(&w->pi_list, &new_owner->task->pi_waiters);
-			requeued++;
-		}
-		sum++;
-	}
-	trace_special(sum, requeued, 0);
+  return ALL_TASKS_PI || rt_prio(waiter->list.prio);
 }
 
-int pi_walk, pi_null, pi_prio, pi_initialized;
-
-/*
- * The lock->wait_lock and p->pi_lock must be held.
- */
-static void pi_setprio(struct rt_mutex *lock, struct task_struct *task, int prio)
+static int calc_pi_prio(task_t *task)
 {
-	struct rt_mutex *l = lock;
-	struct task_struct *p = task;
-	/*
-	 * We don't want to release the parameters locks.
-	 */
-
-	if (unlikely(!p->pid)) {
-		pi_null++;
-		return;
+	int prio = task->normal_prio;
+	if(!plist_head_empty(&task->pi_waiters)) {
+		struct  rt_mutex_waiter *waiter = 
+			plist_first_entry(&task->pi_waiters, struct rt_mutex_waiter, pi_list);
+		prio = min(waiter->pi_list.prio,prio);
 	}
 
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&p->pi_lock));
-#ifdef CONFIG_DEBUG_DEADLOCKS
-	pi_prio++;
-	if (p->policy != SCHED_NORMAL && prio > normal_prio(p)) {
-		TRACE_OFF();
-
-		printk("huh? (%d->%d??)\n", p->prio, prio);
-		printk("owner:\n");
-		printk_task(p);
-		printk("\ncurrent:\n");
-		printk_task(current);
-		printk("\nlock:\n");
-		printk_lock(lock, 1);
-		dump_stack();
-		trace_local_irq_disable(ti);
-	}
-#endif
-	/*
-	 * If the task is blocked on some other task then boost that
-	 * other task (or tasks) too:
-	 */
-	for (;;) {
-		struct rt_mutex_waiter *w = p->blocked_on;
-#ifdef CONFIG_DEBUG_DEADLOCKS
-		int was_rt = rt_task(p);
-#endif
-
-		mutex_setprio(p, prio);
-
-		/*
-		 * The BKL can really be a pain. It can happen where the
-		 * BKL is being held by one task that is just about to
-		 * block on another task that is waiting for the BKL.
-		 * This isn't a deadlock, since the BKL is released
-		 * when the task goes to sleep.  This also means that
-		 * all holders of the BKL are not blocked, or are just
-		 * about to be blocked.
-		 *
-		 * Another side-effect of this is that there's a small
-		 * window where the spinlocks are not held, and the blocked
-		 * process hasn't released the BKL.  So if we are going
-		 * to boost the owner of the BKL, stop after that,
-		 * since that owner is either running, or about to sleep
-		 * but don't go any further or we are in a loop.
-		 */
-		if (!w || unlikely(p->lock_depth >= 0))
-			break;
-		/*
-		 * If the task is blocked on a lock, and we just made
-		 * it RT, then register the task in the PI list and
-		 * requeue it to the wait list:
-		 */
-
-		/*
-		 * Don't unlock the original lock->wait_lock
-		 */
-		if (l != lock)
-			_raw_spin_unlock(&l->wait_lock);
-		l = w->lock;
-		TRACE_BUG_ON_LOCKED(!lock);
+	return prio;
 
-#ifdef CONFIG_PREEMPT_RT
-		/*
-		 * The current task that is blocking can also the one
-		 * holding the BKL, and blocking on a task that wants
-		 * it.  So if it were to get this far, we would deadlock.
-		 */
-		if (unlikely(l == &kernel_sem.lock) && lock_owner(l) == current_thread_info()) {
-			/*
-			 * No locks are held for locks, so fool the unlocking code
-			 * by thinking the last lock was the original.
-			 */
-			l = lock;
-			break;
-		}
-#endif
-
-		if (l != lock)
-			_raw_spin_lock(&l->wait_lock);
-
-		TRACE_BUG_ON_LOCKED(!lock_owner(l));
-
-		if (!plist_node_empty(&w->pi_list)) {
-			TRACE_BUG_ON_LOCKED(!was_rt && !ALL_TASKS_PI && !rt_task(p));
-			/*
-			 * If the task is blocked on a lock, and we just restored
-			 * it from RT to non-RT then unregister the task from
-			 * the PI list and requeue it to the wait list.
-			 *
-			 * (TODO: this can be unfair to SCHED_NORMAL tasks if they
-			 *        get PI handled.)
-			 */
-			plist_del(&w->pi_list);
-		} else
-			TRACE_BUG_ON_LOCKED((ALL_TASKS_PI || rt_task(p)) && was_rt);
-
-		if (ALL_TASKS_PI || rt_task(p)) {
-			w->pi_list.prio = prio;
-			plist_add(&w->pi_list, &lock_owner(l)->task->pi_waiters);
-		}
-
-		plist_del(&w->list);
-		w->list.prio = prio;
-		plist_add(&w->list, &l->wait_list);
-
-		pi_walk++;
-
-		if (p != task)
-			_raw_spin_unlock(&p->pi_lock);
-
-		p = lock_owner(l)->task;
-		TRACE_BUG_ON_LOCKED(!p);
-		_raw_spin_lock(&p->pi_lock);
-		/*
-		 * If the dependee is already higher-prio then
-		 * no need to boost it, and all further tasks down
-		 * the dependency chain are already boosted:
-		 */
-		if (p->prio <= prio)
-			break;
-	}
-	if (l != lock)
-		_raw_spin_unlock(&l->wait_lock);
-	if (p != task)
-		_raw_spin_unlock(&p->pi_lock);
 }
 
-/*
- * Change priority of a task pi aware
- *
- * There are several aspects to consider:
- * - task is priority boosted
- * - task is blocked on a mutex
- *
- */
-void pi_changeprio(struct task_struct *p, int prio)
+static void fix_prio(task_t *task)
 {
-	unsigned long flags;
-	int oldprio;
-
-	spin_lock_irqsave(&p->pi_lock,flags);
-	if (p->blocked_on)
-		spin_lock(&p->blocked_on->lock->wait_lock);
-
-	oldprio = p->normal_prio;
-	if (oldprio == prio)
-		goto out;
-
-	/* Set normal prio in any case */
-	p->normal_prio = prio;
-
-	/* Check, if we can safely lower the priority */
-	if (prio > p->prio && !plist_head_empty(&p->pi_waiters)) {
-		struct rt_mutex_waiter *w;
-		w = plist_first_entry(&p->pi_waiters,
-				      struct rt_mutex_waiter, pi_list);
-		if (w->ti->task->prio < prio)
-			prio = w->ti->task->prio;
+	int prio = calc_pi_prio(task);
+	if(task->prio > prio) {
+		/* Boost him */
+		mutex_setprio(task,prio);
+		if(task->blocked_on) {
+			/* Let it run to boost it's lock */
+			wake_up_process_mutex(task);
+		}
+	}
+	else if(task->prio < prio) {
+		/* Priority too high */
+		if(task->blocked_on) {
+			/* Let it run to unboost it's lock */
+			wake_up_process_mutex(task);
+		}
+		else {
+			mutex_setprio(task,prio);
+		}
 	}
-
-	if (prio == p->prio)
-		goto out;
-
-	/* Is task blocked on a mutex ? */
-	if (p->blocked_on)
-		pi_setprio(p->blocked_on->lock, p, prio);
-	else
-		mutex_setprio(p, prio);
- out:
-	if (p->blocked_on)
-		spin_unlock(&p->blocked_on->lock->wait_lock);
-
-	spin_unlock_irqrestore(&p->pi_lock, flags);
-
 }
 
+int pi_walk, pi_null, pi_prio, pi_initialized;
+
 /*
  * This is called with both the waiter->task->pi_lock and
  * lock->wait_lock held.
  */
 static void
 task_blocks_on_lock(struct rt_mutex_waiter *waiter, struct thread_info *ti,
-		    struct rt_mutex *lock __EIP_DECL__)
+                    struct rt_mutex *lock, int state __EIP_DECL__)
 {
+	struct rt_mutex_waiter *old_first;
 	struct task_struct *task = ti->task;
 #ifdef CONFIG_DEBUG_DEADLOCKS
 	check_deadlock(lock, 0, ti, eip);
 	/* mark the current thread as blocked on the lock */
 	waiter->eip = eip;
 #endif
+	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&task->pi_lock));
+
+	if(plist_head_empty(&lock->wait_list)) {
+		old_first = NULL;
+	}
+	else {
+		old_first = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter, list);
+		if(!boosting_waiter(old_first)) {
+			old_first = NULL;
+		}
+	}
+
+
+	_raw_spin_lock(&task->pi_lock);
 	task->blocked_on = waiter;
 	waiter->lock = lock;
 	waiter->ti = ti;
-	plist_node_init(&waiter->pi_list, task->prio);
-	/*
-	 * Add SCHED_NORMAL tasks to the end of the waitqueue (FIFO):
-	 */
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&task->pi_lock));
-	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
-#if !ALL_TASKS_PI
-	if ((!rt_task(task) &&
-		!(lock->mutex_attr & FUTEX_ATTR_PRIORITY_INHERITANCE))) {
-		plist_add(&waiter->list, &lock->wait_list);
-		set_lock_owner_pending(lock);
-		return;
+        
+	{
+		/* Fixup the prio of the (current) task here while we have the
+		   pi_lock */
+		int prio = calc_pi_prio(task);
+		if(prio!=task->prio) {
+			mutex_setprio(task,prio);
+		}
 	}
-#endif
-	_raw_spin_lock(&lock_owner(lock)->task->pi_lock);
-	plist_add(&waiter->pi_list, &lock_owner(lock)->task->pi_waiters);
-	/*
-	 * Add RT tasks to the head:
-	 */
+
+	plist_node_init(&waiter->list, task->prio);
 	plist_add(&waiter->list, &lock->wait_list);
-	set_lock_owner_pending(lock);
-	/*
-	 * If the waiter has higher priority than the owner
-	 * then temporarily boost the owner:
-	 */
-	if (task->prio < lock_owner(lock)->task->prio)
-		pi_setprio(lock, lock_owner(lock)->task, task->prio);
-	_raw_spin_unlock(&lock_owner(lock)->task->pi_lock);
+	set_task_state(task, state);
+	_raw_spin_unlock(&task->pi_lock);
+
+	set_lock_owner_pending(lock);   
+
+	if(waiter ==
+	   plist_first_entry(&lock->wait_list, struct rt_mutex_waiter, list)
+	    && boosting_waiter(waiter)) {
+		task_t *owner = lock_owner(lock)->task;
+
+		plist_node_init(&waiter->pi_list, task->prio);
+
+		_raw_spin_lock(&owner->pi_lock);
+		if(old_first) {
+			plist_del(&old_first->pi_list);
+		}
+		plist_add(&waiter->pi_list, &owner->pi_waiters);
+		fix_prio(owner);
+
+		_raw_spin_unlock(&owner->pi_lock);
+	}
 }
 
 /*
@@ -1068,6 +929,7 @@ static void __init_rt_mutex(struct rt_mu
 	lock->name = name;
 	lock->file = file;
 	lock->line = line;
+	lock->verbose = 0;
 #endif
 #ifdef CONFIG_DEBUG_PREEMPT
 	lock->was_preempt_off = 0;
@@ -1085,20 +947,48 @@ EXPORT_SYMBOL(__init_rwsem);
 #endif
 
 /*
- * This must be called with both the old_owner and new_owner pi_locks held.
- * As well as the lock->wait_lock.
+ * This must be called with the lock->wait_lock held.
+ * Must: new_owner!=NULL
+ * Likely: old_owner==NULL
  */
-static inline
+static 
 void set_new_owner(struct rt_mutex *lock, struct thread_info *old_owner,
 			struct thread_info *new_owner __EIP_DECL__)
 {
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&old_owner->task->pi_lock));
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&new_owner->task->pi_lock));
+	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
+
 	if (new_owner)
 		trace_special_pid(new_owner->task->pid, new_owner->task->prio, 0);
-	if (unlikely(old_owner))
-		change_owner(lock, old_owner, new_owner);
+	if(old_owner) {
+		account_mutex_owner_up(old_owner->task);
+	}
+#ifdef CONFIG_DEBUG_DEADLOCKS
+	if (trace_on && unlikely(old_owner)) {
+		TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list));
+		list_del_init(&lock->held_list);
+	}
+#endif
 	lock->owner = new_owner;
-	if (!plist_head_empty(&lock->wait_list))
-		set_lock_owner_pending(lock);
+	if (!plist_head_empty(&lock->wait_list)) {
+		struct rt_mutex_waiter *next =
+			plist_first_entry(&lock->wait_list, 
+					  struct rt_mutex_waiter, list);
+		if(boosting_waiter(next)) {
+			if(old_owner) {
+				_raw_spin_lock(&old_owner->task->pi_lock);
+				plist_del(&next->pi_list);
+				_raw_spin_unlock(&old_owner->task->pi_lock);
+			}
+			_raw_spin_lock(&new_owner->task->pi_lock);
+			plist_add(&next->pi_list, 
+				  &new_owner->task->pi_waiters);
+			set_lock_owner_pending(lock);
+			_raw_spin_unlock(&new_owner->task->pi_lock);
+		}
+	}
+        
 #ifdef CONFIG_DEBUG_DEADLOCKS
 	if (trace_on) {
 		TRACE_WARN_ON_LOCKED(!list_empty(&lock->held_list));
@@ -1109,6 +999,36 @@ void set_new_owner(struct rt_mutex *lock
 	account_mutex_owner_down(new_owner->task, lock);
 }
 
+
+static void remove_waiter(struct rt_mutex *lock, 
+			  struct rt_mutex_waiter *waiter, 
+			  int fixprio)
+{
+	task_t *owner = lock_owner(lock) ? lock_owner(lock)->task : NULL;
+	int first = (waiter==plist_first_entry(&lock->wait_list, 
+					       struct rt_mutex_waiter, list));
+        
+	plist_del(&waiter->list);
+	if(first && owner) {
+		_raw_spin_lock(&owner->pi_lock);
+		if(boosting_waiter(waiter)) {
+			plist_del(&waiter->pi_list);
+		}
+		if(!plist_head_empty(&lock->wait_list)) {
+			struct rt_mutex_waiter *next =
+				plist_first_entry(&lock->wait_list, 
+						  struct rt_mutex_waiter, list);
+			if(boosting_waiter(next)) {
+				plist_add(&next->pi_list, &owner->pi_waiters);
+			}
+		}
+		if(fixprio) {
+			fix_prio(owner);
+		}
+		_raw_spin_unlock(&owner->pi_lock);
+	}
+}
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - the spinlock must be held by the caller
@@ -1123,70 +1043,36 @@ pick_new_owner(struct rt_mutex *lock, st
 	struct thread_info *new_owner;
 
 	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&old_owner->task->pi_lock));
+
 	/*
 	 * Get the highest prio one:
 	 *
 	 * (same-prio RT tasks go FIFO)
 	 */
 	waiter = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter, list);
-
-#ifdef CONFIG_SMP
- try_again:
-#endif
+	remove_waiter(lock,waiter,0);
 	trace_special_pid(waiter->ti->task->pid, waiter->ti->task->prio, 0);
 
-#if ALL_TASKS_PI
-	check_pi_list_present(lock, waiter, old_owner);
-#endif
 	new_owner = waiter->ti;
-	/*
-	 * The new owner is still blocked on this lock, so we
-	 * must release the lock->wait_lock before grabing
-	 * the new_owner lock.
-	 */
-	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_lock(&new_owner->task->pi_lock);
-	_raw_spin_lock(&lock->wait_lock);
-	/*
-	 * In this split second of releasing the lock, a high priority
-	 * process could have come along and blocked as well.
-	 */
-#ifdef CONFIG_SMP
-	waiter = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter, list);
-	if (unlikely(waiter->ti != new_owner)) {
-		_raw_spin_unlock(&new_owner->task->pi_lock);
-		goto try_again;
-	}
-#ifdef CONFIG_PREEMPT_RT
-	/*
-	 * Once again the BKL comes to play.  Since the BKL can be grabbed and released
-	 * out of the normal P1->L1->P2 order, there's a chance that someone has the
-	 * BKL owner's lock and is waiting on the new owner lock.
-	 */
-	if (unlikely(lock == &kernel_sem.lock)) {
-		if (!_raw_spin_trylock(&old_owner->task->pi_lock)) {
-			_raw_spin_unlock(&new_owner->task->pi_lock);
-			goto try_again;
-		}
-	} else
-#endif
-#endif
-		_raw_spin_lock(&old_owner->task->pi_lock);
-
-	plist_del(&waiter->list);
-	plist_del(&waiter->pi_list);
-	waiter->pi_list.prio = waiter->ti->task->prio;
 
 	set_new_owner(lock, old_owner, new_owner __W_EIP__(waiter));
+
+	_raw_spin_lock(&new_owner->task->pi_lock);
 	/* Don't touch waiter after ->task has been NULLed */
 	mb();
 	waiter->ti = NULL;
 	new_owner->task->blocked_on = NULL;
-	TRACE_WARN_ON(save_state != lock->save_state);
-
-	_raw_spin_unlock(&old_owner->task->pi_lock);
+#ifdef CAPTURE_LOCK
+	if (!is_kernel_lock(lock)) {
+		new_owner->task->rt_flags |= RT_PENDOWNER;
+		new_owner->task->pending_owner = lock;
+	}
+#endif
 	_raw_spin_unlock(&new_owner->task->pi_lock);
 
+	TRACE_WARN_ON(save_state != lock->save_state);
+
 	return new_owner;
 }
 
@@ -1217,11 +1103,41 @@ static inline void init_lists(struct rt_
 	}
 #endif
 #ifdef CONFIG_DEBUG_DEADLOCKS
-	if (!lock->held_list.prev && !lock->held_list.next)
+	if (!lock->held_list.prev && !lock->held_list.next) {
 		INIT_LIST_HEAD(&lock->held_list);
+		lock->verbose = 0;
+	}
 #endif
 }
 
+
+static void remove_pending_owner_nolock(task_t *owner)
+{
+	owner->rt_flags &= ~RT_PENDOWNER;
+	owner->pending_owner = NULL;
+}
+
+static void remove_pending_owner(task_t *owner)
+{
+	_raw_spin_lock(&owner->pi_lock);
+	remove_pending_owner_nolock(owner);
+	_raw_spin_unlock(&owner->pi_lock);
+}
+
+int task_is_pending_owner_nolock(struct thread_info  *owner, 
+                                 struct rt_mutex *lock)
+{
+	return (lock_owner(lock) == owner) &&
+		(owner->task->pending_owner == lock);
+}
+int task_is_pending_owner(struct thread_info  *owner, struct rt_mutex *lock)
+{
+	int res;
+	_raw_spin_lock(&owner->task->pi_lock);
+	res = task_is_pending_owner_nolock(owner,lock);
+	_raw_spin_unlock(&owner->task->pi_lock);
+	return res;
+}
 /*
  * Try to grab a lock, and if it is owned but the owner
  * hasn't woken up yet, see if we can steal it.
@@ -1233,6 +1149,8 @@ static int __grab_lock(struct rt_mutex *
 {
 #ifndef CAPTURE_LOCK
 	return 0;
+#else
+	int res = 0;
 #endif
 	/*
 	 * The lock is owned, but now test to see if the owner
@@ -1241,111 +1159,36 @@ static int __grab_lock(struct rt_mutex *
 
 	TRACE_BUG_ON_LOCKED(!owner);
 
+	_raw_spin_lock(&owner->pi_lock);
+
 	/* The owner is pending on a lock, but is it this lock? */
 	if (owner->pending_owner != lock)
-		return 0;
+		goto out_unlock;
 
 	/*
 	 * There's an owner, but it hasn't woken up to take the lock yet.
 	 * See if we should steal it from him.
 	 */
 	if (task->prio > owner->prio)
-		return 0;
-#ifdef CONFIG_PREEMPT_RT
+		goto out_unlock;
+
 	/*
 	 * The BKL is a PITA. Don't ever steal it
 	 */
-	if (lock == &kernel_sem.lock)
-		return 0;
-#endif
+	if (is_kernel_lock(lock))
+		goto out_unlock;
+
 	/*
 	 * This task is of higher priority than the current pending
 	 * owner, so we may steal it.
 	 */
-	owner->rt_flags &= ~RT_PENDOWNER;
-	owner->pending_owner = NULL;
-
-#ifdef CONFIG_DEBUG_DEADLOCKS
-	/*
-	 * This task will be taking the ownership away, and
-	 * when it does, the lock can't be on the held list.
-	 */
-	if (trace_on) {
-		TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list));
-		list_del_init(&lock->held_list);
-	}
-#endif
-	account_mutex_owner_up(owner);
-
-	return 1;
-}
-
-/*
- * Bring a task from pending ownership to owning a lock.
- *
- * Return 0 if we secured it, otherwise non-zero if it was
- * stolen.
- */
-static int
-capture_lock(struct rt_mutex_waiter *waiter, struct thread_info *ti,
-	     struct task_struct *task)
-{
-	struct rt_mutex *lock = waiter->lock;
-	struct thread_info *old_owner;
-	unsigned long flags;
-	int ret = 0;
-
-#ifndef CAPTURE_LOCK
-	return 0;
-#endif
-#ifdef CONFIG_PREEMPT_RT
-	/*
-	 * The BKL is special, we always get it.
-	 */
-	if (lock == &kernel_sem.lock)
-		return 0;
-#endif
-
-	trace_lock_irqsave(&trace_lock, flags, ti);
-	/*
-	 * We are no longer blocked on the lock, so we are considered a
-	 * owner. So we must grab the lock->wait_lock first.
-	 */
-	_raw_spin_lock(&lock->wait_lock);
-	_raw_spin_lock(&task->pi_lock);
-
-	if (!(task->rt_flags & RT_PENDOWNER)) {
-		/*
-		 * Someone else stole it.
-		 */
-		old_owner = lock_owner(lock);
-		TRACE_BUG_ON_LOCKED(old_owner == ti);
-		if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
-			/* we got it back! */
-			if (old_owner) {
-				_raw_spin_lock(&old_owner->task->pi_lock);
-				set_new_owner(lock, old_owner, ti __W_EIP__(waiter));
-				_raw_spin_unlock(&old_owner->task->pi_lock);
-			} else
-				set_new_owner(lock, old_owner, ti __W_EIP__(waiter));
-			ret = 0;
-		} else {
-			/* Add ourselves back to the list */
-			TRACE_BUG_ON_LOCKED(!plist_node_empty(&waiter->list));
-			plist_node_init(&waiter->list, task->prio);
-			task_blocks_on_lock(waiter, ti, lock __W_EIP__(waiter));
-			ret = 1;
-		}
-	} else {
-		task->rt_flags &= ~RT_PENDOWNER;
-		task->pending_owner = NULL;
-	}
+	remove_pending_owner_nolock(owner);
 
-	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_unlock(&task->pi_lock);
-	trace_unlock_irqrestore(&trace_lock, flags, ti);
+	res = 1;
 
-	return ret;
+ out_unlock:
+	_raw_spin_unlock(&owner->pi_lock);
+	return res;
 }
 
 static inline void INIT_WAITER(struct rt_mutex_waiter *waiter)
@@ -1366,10 +1209,25 @@ static inline void FREE_WAITER(struct rt
 #endif
 }
 
+static int allowed_to_take_lock(struct thread_info *ti,
+                                task_t *task,
+                                struct thread_info *old_owner,
+                                struct rt_mutex *lock)
+{
+	SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&old_owner->task->pi_lock));
+	SMP_TRACE_BUG_ON_LOCKED(spin_is_locked(&task->pi_lock));
+
+	return !old_owner ||
+		(is_kernel_lock(lock)  && lock_owner(lock) == ti) ||
+		task_is_pending_owner(ti,lock) || 
+		__grab_lock(lock, task, old_owner->task);
+}
+
 /*
  * lock it semaphore-style: no worries about missed wakeups.
  */
-static inline void
+static void
 ____down(struct rt_mutex *lock __EIP_DECL__)
 {
 	struct thread_info *ti = current_thread_info(), *old_owner;
@@ -1379,65 +1237,66 @@ ____down(struct rt_mutex *lock __EIP_DEC
 
 	trace_lock_irqsave(&trace_lock, flags, ti);
 	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	_raw_spin_lock(&task->pi_lock);
 	_raw_spin_lock(&lock->wait_lock);
 	INIT_WAITER(&waiter);
 
-	old_owner = lock_owner(lock);
 	init_lists(lock);
 
-	if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
+	debug_lock(lock,"down");
+	/* wait to be given the lock */
+	for (;;) {
+		old_owner = lock_owner(lock);
+
+		if(allowed_to_take_lock(ti, task, old_owner,lock)) {
 		/* granted */
-		TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
-		if (old_owner) {
-			_raw_spin_lock(&old_owner->task->pi_lock);
-			set_new_owner(lock, old_owner, ti __EIP__);
-			_raw_spin_unlock(&old_owner->task->pi_lock);
-		} else
+			TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
 			set_new_owner(lock, old_owner, ti __EIP__);
-		_raw_spin_unlock(&lock->wait_lock);
-		_raw_spin_unlock(&task->pi_lock);
-		trace_unlock_irqrestore(&trace_lock, flags, ti);
-
-		FREE_WAITER(&waiter);
-		return;
-	}
-
-	set_task_state(task, TASK_UNINTERRUPTIBLE);
+			if (!is_kernel_lock(lock)) {
+				remove_pending_owner(task);
+			}
+		  	debug_lock(lock,"got lock");
 
-	plist_node_init(&waiter.list, task->prio);
-	task_blocks_on_lock(&waiter, ti, lock __EIP__);
+			_raw_spin_unlock(&lock->wait_lock);
+			trace_unlock_irqrestore(&trace_lock, flags, ti);
 
-	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	/* we don't need to touch the lock struct anymore */
-	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_unlock(&task->pi_lock);
-	trace_unlock_irqrestore(&trace_lock, flags, ti);
+			FREE_WAITER(&waiter);
+			return;
+		}
+		
+		task_blocks_on_lock(&waiter, ti, lock, TASK_UNINTERRUPTIBLE __EIP__);
 
-	might_sleep();
+		TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
+		/* we don't need to touch the lock struct anymore */
+		debug_lock(lock,"sleeping on");
+		_raw_spin_unlock(&lock->wait_lock);
+		trace_unlock_irqrestore(&trace_lock, flags, ti);
+		
+		might_sleep();
+		
+		nosched_flag = current->flags & PF_NOSCHED;
+		current->flags &= ~PF_NOSCHED;
 
-	nosched_flag = current->flags & PF_NOSCHED;
-	current->flags &= ~PF_NOSCHED;
+		if (waiter.ti)
+		{
+			schedule();
+		}
+		
+		current->flags |= nosched_flag;
+		task->state = TASK_RUNNING;
 
-wait_again:
-	/* wait to be given the lock */
-	for (;;) {
-		if (!waiter.ti)
-			break;
-		schedule();
-		set_task_state(task, TASK_UNINTERRUPTIBLE);
-	}
-	/*
-	 * Check to see if we didn't have ownership stolen.
-	 */
-	if (capture_lock(&waiter, ti, task)) {
-		set_task_state(task, TASK_UNINTERRUPTIBLE);
-		goto wait_again;
+		trace_lock_irqsave(&trace_lock, flags, ti);
+		_raw_spin_lock(&lock->wait_lock);
+		debug_lock(lock,"waking up on");
+		if(waiter.ti) {
+			remove_waiter(lock,&waiter,1);
+		}
+		_raw_spin_lock(&task->pi_lock);
+		task->blocked_on = NULL;
+		_raw_spin_unlock(&task->pi_lock);
 	}
 
-	current->flags |= nosched_flag;
-	task->state = TASK_RUNNING;
-	FREE_WAITER(&waiter);
+	/* Should not get here! */
+	BUG_ON(1);
 }
 
 /*
@@ -1450,131 +1309,116 @@ wait_again:
  * enables the seemless use of arbitrary (blocking) spinlocks within
  * sleep/wakeup event loops.
  */
-static inline void
+static void
 ____down_mutex(struct rt_mutex *lock __EIP_DECL__)
 {
 	struct thread_info *ti = current_thread_info(), *old_owner;
-	unsigned long state, saved_state, nosched_flag;
+	unsigned long state, saved_state;
 	struct task_struct *task = ti->task;
 	struct rt_mutex_waiter waiter;
 	unsigned long flags;
-	int got_wakeup = 0, saved_lock_depth;
+	int got_wakeup = 0;
+	
+	        
 
 	trace_lock_irqsave(&trace_lock, flags, ti);
 	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	_raw_spin_lock(&task->pi_lock);
 	_raw_spin_lock(&lock->wait_lock);
-	INIT_WAITER(&waiter);
-
-	old_owner = lock_owner(lock);
-	init_lists(lock);
-
-	if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
-		/* granted */
-		TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
-		if (old_owner) {
-			_raw_spin_lock(&old_owner->task->pi_lock);
-			set_new_owner(lock, old_owner, ti __EIP__);
-			_raw_spin_unlock(&old_owner->task->pi_lock);
-		} else
-			set_new_owner(lock, old_owner, ti __EIP__);
-		_raw_spin_unlock(&lock->wait_lock);
-		_raw_spin_unlock(&task->pi_lock);
-		trace_unlock_irqrestore(&trace_lock, flags, ti);
-
-		FREE_WAITER(&waiter);
-		return;
-	}
-
-	plist_node_init(&waiter.list, task->prio);
-	task_blocks_on_lock(&waiter, ti, lock __EIP__);
-
-	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	/*
+/*
 	 * Here we save whatever state the task was in originally,
 	 * we'll restore it at the end of the function and we'll
 	 * take any intermediate wakeup into account as well,
 	 * independently of the mutex sleep/wakeup mechanism:
 	 */
 	saved_state = xchg(&task->state, TASK_UNINTERRUPTIBLE);
+        
+	INIT_WAITER(&waiter);
 
-	/* we don't need to touch the lock struct anymore */
-	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_unlock(&task->pi_lock);
-	trace_unlock(&trace_lock, ti);
-
-	/*
-	 * TODO: check 'flags' for the IRQ bit here - it is illegal to
-	 * call down() from an IRQs-off section that results in
-	 * an actual reschedule.
-	 */
-
-	nosched_flag = current->flags & PF_NOSCHED;
-	current->flags &= ~PF_NOSCHED;
-
-	/*
-	 * BKL users expect the BKL to be held across spinlock/rwlock-acquire.
-	 * Save and clear it, this will cause the scheduler to not drop the
-	 * BKL semaphore if we end up scheduling:
-	 */
-	saved_lock_depth = task->lock_depth;
-	task->lock_depth = -1;
+	init_lists(lock);
 
-wait_again:
 	/* wait to be given the lock */
 	for (;;) {
-		unsigned long saved_flags = current->flags & PF_NOSCHED;
-
-		if (!waiter.ti)
-			break;
-		trace_local_irq_enable(ti);
-		// no need to check for preemption here, we schedule().
-		current->flags &= ~PF_NOSCHED;
+		old_owner = lock_owner(lock);
+        
+		if (allowed_to_take_lock(ti,task,old_owner,lock)) {
+		/* granted */
+			TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
+			set_new_owner(lock, old_owner, ti __EIP__);
+			remove_pending_owner(task);
+			_raw_spin_unlock(&lock->wait_lock);
+                        
+			/*
+			 * Only set the task's state to TASK_RUNNING if it got
+			 * a non-mutex wakeup. We keep the original state otherwise.
+			 * A mutex wakeup changes the task's state to TASK_RUNNING_MUTEX,
+			 * not TASK_RUNNING - hence we can differenciate betwee5~n the two
+			 * cases:
+			 */
+			state = xchg(&task->state, saved_state);
+			if (state == TASK_RUNNING)
+				got_wakeup = 1;
+			if (got_wakeup)
+				task->state = TASK_RUNNING;
+			trace_unlock_irqrestore(&trace_lock, flags, ti);
+			preempt_check_resched();
 
-		schedule();
+			FREE_WAITER(&waiter);
+			return;
+		}
+		
+		task_blocks_on_lock(&waiter, ti, lock,
+				    TASK_UNINTERRUPTIBLE __EIP__);
 
-		current->flags |= saved_flags;
-		trace_local_irq_disable(ti);
-		state = xchg(&task->state, TASK_UNINTERRUPTIBLE);
-		if (state == TASK_RUNNING)
-			got_wakeup = 1;
-	}
-	/*
-	 * Check to see if we didn't have ownership stolen.
-	 */
-	if (capture_lock(&waiter, ti, task)) {
-		state = xchg(&task->state, TASK_UNINTERRUPTIBLE);
-		if (state == TASK_RUNNING)
-			got_wakeup = 1;
-		goto wait_again;
-	}
-	/*
-	 * Only set the task's state to TASK_RUNNING if it got
-	 * a non-mutex wakeup. We keep the original state otherwise.
-	 * A mutex wakeup changes the task's state to TASK_RUNNING_MUTEX,
-	 * not TASK_RUNNING - hence we can differenciate between the two
-	 * cases:
-	 */
-	state = xchg(&task->state, saved_state);
-	if (state == TASK_RUNNING)
-		got_wakeup = 1;
-	if (got_wakeup)
-		task->state = TASK_RUNNING;
-	trace_local_irq_enable(ti);
-	preempt_check_resched();
+		TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
+		/* we don't need to touch the lock struct anymore */
+		_raw_spin_unlock(&lock->wait_lock);
+		trace_unlock(&trace_lock, ti);
+                
+		if (waiter.ti) {
+			unsigned long saved_flags = 
+				current->flags & PF_NOSCHED;
+			/*
+			 * BKL users expect the BKL to be held across spinlock/rwlock-acquire.
+			 * Save and clear it, this will cause the scheduler to not drop the
+			 * BKL semaphore if we end up scheduling:
+			 */
 
-	task->lock_depth = saved_lock_depth;
-	current->flags |= nosched_flag;
-	FREE_WAITER(&waiter);
+			int saved_lock_depth = task->lock_depth;
+			task->lock_depth = -1;
+			
+
+			trace_local_irq_enable(ti);
+			// no need to check for preemption here, we schedule().
+                        
+			current->flags &= ~PF_NOSCHED;
+			
+			schedule();
+			
+			trace_local_irq_disable(ti);
+			task->flags |= saved_flags;
+			task->lock_depth = saved_lock_depth;
+			state = xchg(&task->state, TASK_RUNNING_MUTEX);
+			if (state == TASK_RUNNING)
+				got_wakeup = 1;
+		}
+		
+		trace_lock_irq(&trace_lock, ti);
+		_raw_spin_lock(&lock->wait_lock);
+		if(waiter.ti) {
+			remove_waiter(lock,&waiter,1);
+		}
+		_raw_spin_lock(&task->pi_lock);
+		task->blocked_on = NULL;
+		_raw_spin_unlock(&task->pi_lock);
+	}
 }
 
-static void __up_mutex_waiter_savestate(struct rt_mutex *lock __EIP_DECL__);
-static void __up_mutex_waiter_nosavestate(struct rt_mutex *lock __EIP_DECL__);
-
+static void __up_mutex_waiter(struct rt_mutex *lock, 
+			      int savestate __EIP_DECL__);
 /*
  * release the lock:
  */
-static inline void
+static void
 ____up_mutex(struct rt_mutex *lock, int save_state __EIP_DECL__)
 {
 	struct thread_info *ti = current_thread_info();
@@ -1585,30 +1429,31 @@ ____up_mutex(struct rt_mutex *lock, int 
 	trace_lock_irqsave(&trace_lock, flags, ti);
 	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
 	_raw_spin_lock(&lock->wait_lock);
+	debug_lock(lock,"upping");
 	TRACE_BUG_ON_LOCKED(!lock->wait_list.prio_list.prev && !lock->wait_list.prio_list.next);
 
-#ifdef CONFIG_DEBUG_DEADLOCKS
-	if (trace_on) {
-		TRACE_WARN_ON_LOCKED(lock_owner(lock) != ti);
-		TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list));
-		list_del_init(&lock->held_list);
-	}
-#endif
 
 #if ALL_TASKS_PI
 	if (plist_head_empty(&lock->wait_list))
 		check_pi_list_empty(lock, lock_owner(lock));
 #endif
 	if (unlikely(!plist_head_empty(&lock->wait_list))) {
-		if (save_state)
-			__up_mutex_waiter_savestate(lock __EIP__);
-		else
-			__up_mutex_waiter_nosavestate(lock __EIP__);
-	} else
+		__up_mutex_waiter(lock,save_state __EIP__);
+		debug_lock(lock,"woke up waiter");
+	} else {
+#ifdef CONFIG_DEBUG_DEADLOCKS
+		if (trace_on) {
+			TRACE_WARN_ON_LOCKED(lock_owner(lock) != ti);
+			TRACE_WARN_ON_LOCKED(list_empty(&lock->held_list));
+			list_del_init(&lock->held_list);
+		}
+#endif
 		lock->owner = NULL;
+		debug_lock(lock,"there was no waiters");
+		account_mutex_owner_up(ti->task);
+	}
 	_raw_spin_unlock(&lock->wait_lock);
 #if defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_PREEMPT_RT)
-	account_mutex_owner_up(current);
 	if (!current->lock_count && !rt_prio(current->normal_prio) &&
 					rt_prio(current->prio)) {
 		static int once = 1;
@@ -1841,125 +1686,103 @@ static int __sched __down_interruptible(
 	struct rt_mutex_waiter waiter;
 	struct timer_list timer;
 	unsigned long expire = 0;
+	int timer_installed = 0;
 	int ret;
 
 	trace_lock_irqsave(&trace_lock, flags, ti);
 	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	_raw_spin_lock(&task->pi_lock);
 	_raw_spin_lock(&lock->wait_lock);
 	INIT_WAITER(&waiter);
 
-	old_owner = lock_owner(lock);
 	init_lists(lock);
 
-	if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
+	ret = 0;
+	/* wait to be given the lock */
+	for (;;) {
+		old_owner = lock_owner(lock);
+                
+		if (allowed_to_take_lock(ti,task,old_owner,lock)) {
 		/* granted */
-		TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
-		if (old_owner) {
-			_raw_spin_lock(&old_owner->task->pi_lock);
-			set_new_owner(lock, old_owner, ti __EIP__);
-			_raw_spin_unlock(&old_owner->task->pi_lock);
-		} else
+			TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
 			set_new_owner(lock, old_owner, ti __EIP__);
-		_raw_spin_unlock(&lock->wait_lock);
-		_raw_spin_unlock(&task->pi_lock);
-		trace_unlock_irqrestore(&trace_lock, flags, ti);
-
-		FREE_WAITER(&waiter);
-		return 0;
-	}
+			_raw_spin_unlock(&lock->wait_lock);
+			trace_unlock_irqrestore(&trace_lock, flags, ti);
 
-	set_task_state(task, TASK_INTERRUPTIBLE);
+			goto out_free_timer;
+		}
 
-	plist_node_init(&waiter.list, task->prio);
-	task_blocks_on_lock(&waiter, ti, lock __EIP__);
+		task_blocks_on_lock(&waiter, ti, lock, TASK_INTERRUPTIBLE __EIP__);
 
-	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	/* we don't need to touch the lock struct anymore */
-	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_unlock(&task->pi_lock);
-	trace_unlock_irqrestore(&trace_lock, flags, ti);
+		TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
+		/* we don't need to touch the lock struct anymore */
+		_raw_spin_unlock(&lock->wait_lock);
+		trace_unlock_irqrestore(&trace_lock, flags, ti);
+		
+		might_sleep();
+		
+		nosched_flag = current->flags & PF_NOSCHED;
+		current->flags &= ~PF_NOSCHED;
+		if (time && !timer_installed) {
+			expire = time + jiffies;
+			init_timer(&timer);
+			timer.expires = expire;
+			timer.data = (unsigned long)current;
+			timer.function = process_timeout;
+			add_timer(&timer);
+			timer_installed = 1;
+		}
 
-	might_sleep();
+                        
+		if (waiter.ti) {
+			schedule();
+		}
+		
+		current->flags |= nosched_flag;
+		task->state = TASK_RUNNING;
 
-	nosched_flag = current->flags & PF_NOSCHED;
-	current->flags &= ~PF_NOSCHED;
-	if (time) {
-		expire = time + jiffies;
-		init_timer(&timer);
-		timer.expires = expire;
-		timer.data = (unsigned long)current;
-		timer.function = process_timeout;
-		add_timer(&timer);
-	}
+		trace_lock_irqsave(&trace_lock, flags, ti);
+		_raw_spin_lock(&lock->wait_lock);
+		if(waiter.ti) {
+			remove_waiter(lock,&waiter,1);
+		}
+		_raw_spin_lock(&task->pi_lock);
+		task->blocked_on = NULL;
+		_raw_spin_unlock(&task->pi_lock);
 
-	ret = 0;
-wait_again:
-	/* wait to be given the lock */
-	for (;;) {
-		if (signal_pending(current) || (time && !timer_pending(&timer))) {
-			/*
-			 * Remove ourselves from the wait list if we
-			 * didnt get the lock - else return success:
-			 */
-			trace_lock_irq(&trace_lock, ti);
-			_raw_spin_lock(&task->pi_lock);
-			_raw_spin_lock(&lock->wait_lock);
-			if (waiter.ti || time) {
-				plist_del(&waiter.list);
-				/*
-				 * If we were the last waiter then clear
-				 * the pending bit:
-				 */
-				if (plist_head_empty(&lock->wait_list))
-					lock->owner = lock_owner(lock);
-				/*
-				 * Just remove ourselves from the PI list.
-				 * (No big problem if our PI effect lingers
-				 *  a bit - owner will restore prio.)
-				 */
-				TRACE_WARN_ON_LOCKED(waiter.ti != ti);
-				TRACE_WARN_ON_LOCKED(current->blocked_on != &waiter);
-				plist_del(&waiter.pi_list);
-				waiter.pi_list.prio = task->prio;
-				waiter.ti = NULL;
-				current->blocked_on = NULL;
-				if (time) {
-					ret = (int)(expire - jiffies);
-					if (!timer_pending(&timer)) {
-						del_singleshot_timer_sync(&timer);
-						ret = -ETIMEDOUT;
-					}
-				} else
-					ret = -EINTR;
+		if(signal_pending(current)) {
+			if (time) {
+				ret = (int)(expire - jiffies);
+				if (!timer_pending(&timer)) {
+					ret = -ETIMEDOUT;
+				}
 			}
-			_raw_spin_unlock(&lock->wait_lock);
-			_raw_spin_unlock(&task->pi_lock);
-			trace_unlock_irq(&trace_lock, ti);
-			break;
+			else
+				ret = -EINTR;
+			
+			goto out_unlock;
 		}
-		if (!waiter.ti)
-			break;
-		schedule();
-		set_task_state(task, TASK_INTERRUPTIBLE);
-	}
-
-	/*
-	 * Check to see if we didn't have ownership stolen.
-	 */
-	if (!ret) {
-		if (capture_lock(&waiter, ti, task)) {
-			set_task_state(task, TASK_INTERRUPTIBLE);
-			goto wait_again;
+		else if(timer_installed &&
+			!timer_pending(&timer)) {
+			ret = -ETIMEDOUT;
+			goto out_unlock;
 		}
 	}
 
-	task->state = TASK_RUNNING;
-	current->flags |= nosched_flag;
 
+ out_unlock:
+	_raw_spin_unlock(&lock->wait_lock);
+	trace_unlock_irqrestore(&trace_lock, flags, ti);
+
+ out_free_timer:
+	if (time && timer_installed) {
+		if (!timer_pending(&timer)) {
+			del_singleshot_timer_sync(&timer);
+		}
+	}
 	FREE_WAITER(&waiter);
 	return ret;
 }
+
 /*
  * trylock for writing -- returns 1 if successful, 0 if contention
  */
@@ -1972,7 +1795,6 @@ static int __down_trylock(struct rt_mute
 
 	trace_lock_irqsave(&trace_lock, flags, ti);
 	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	_raw_spin_lock(&task->pi_lock);
 	/*
 	 * It is OK for the owner of the lock to do a trylock on
 	 * a lock it owns, so to prevent deadlocking, we must
@@ -1989,17 +1811,11 @@ static int __down_trylock(struct rt_mute
 	if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
 		/* granted */
 		TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
-		if (old_owner) {
-			_raw_spin_lock(&old_owner->task->pi_lock);
-			set_new_owner(lock, old_owner, ti __EIP__);
-			_raw_spin_unlock(&old_owner->task->pi_lock);
-		} else
-			set_new_owner(lock, old_owner, ti __EIP__);
+		set_new_owner(lock, old_owner, ti __EIP__);
 		ret = 1;
 	}
 	_raw_spin_unlock(&lock->wait_lock);
 failed:
-	_raw_spin_unlock(&task->pi_lock);
 	trace_unlock_irqrestore(&trace_lock, flags, ti);
 
 	return ret;
@@ -2046,16 +1862,16 @@ static int down_read_trylock_mutex(struc
 }
 #endif
 
-static void __up_mutex_waiter_nosavestate(struct rt_mutex *lock __EIP_DECL__)
+static void __up_mutex_waiter(struct rt_mutex *lock,
+			      int save_state __EIP_DECL__)
 {
 	struct thread_info *old_owner_ti, *new_owner_ti;
 	struct task_struct *old_owner, *new_owner;
-	struct rt_mutex_waiter *w;
 	int prio;
 
 	old_owner_ti = lock_owner(lock);
 	old_owner = old_owner_ti->task;
-	new_owner_ti = pick_new_owner(lock, old_owner_ti, 0 __EIP__);
+	new_owner_ti = pick_new_owner(lock, old_owner_ti, save_state __EIP__);
 	new_owner = new_owner_ti->task;
 
 	/*
@@ -2063,67 +1879,21 @@ static void __up_mutex_waiter_nosavestat
 	 * to the previous priority (or to the next highest prio
 	 * waiter's priority):
 	 */
-	_raw_spin_lock(&old_owner->pi_lock);
-	prio = old_owner->normal_prio;
-	if (unlikely(!plist_head_empty(&old_owner->pi_waiters))) {
-		w = plist_first_entry(&old_owner->pi_waiters, struct rt_mutex_waiter, pi_list);
-		if (w->ti->task->prio < prio)
-			prio = w->ti->task->prio;
-	}
-	if (unlikely(prio != old_owner->prio))
-		pi_setprio(lock, old_owner, prio);
-	_raw_spin_unlock(&old_owner->pi_lock);
-#ifdef CAPTURE_LOCK
-#ifdef CONFIG_PREEMPT_RT
-	if (lock != &kernel_sem.lock) {
-#endif
-		new_owner->rt_flags |= RT_PENDOWNER;
-		new_owner->pending_owner = lock;
-#ifdef CONFIG_PREEMPT_RT
-	}
-#endif
-#endif
-	wake_up_process(new_owner);
-}
-
-static void __up_mutex_waiter_savestate(struct rt_mutex *lock __EIP_DECL__)
-{
-	struct thread_info *old_owner_ti, *new_owner_ti;
-	struct task_struct *old_owner, *new_owner;
-	struct rt_mutex_waiter *w;
-	int prio;
+	if(ALL_TASKS_PI || rt_prio(old_owner->prio)) {
+		_raw_spin_lock(&old_owner->pi_lock);
 
-	old_owner_ti = lock_owner(lock);
-	old_owner = old_owner_ti->task;
-	new_owner_ti = pick_new_owner(lock, old_owner_ti, 1 __EIP__);
-	new_owner = new_owner_ti->task;
+		prio = calc_pi_prio(old_owner);
+		if (unlikely(prio != old_owner->prio))
+			mutex_setprio(old_owner, prio);
 
-	/*
-	 * If the owner got priority-boosted then restore it
-	 * to the previous priority (or to the next highest prio
-	 * waiter's priority):
-	 */
-	_raw_spin_lock(&old_owner->pi_lock);
-	prio = old_owner->normal_prio;
-	if (unlikely(!plist_head_empty(&old_owner->pi_waiters))) {
-		w = plist_first_entry(&old_owner->pi_waiters, struct rt_mutex_waiter, pi_list);
-		if (w->ti->task->prio < prio)
-			prio = w->ti->task->prio;
-	}
-	if (unlikely(prio != old_owner->prio))
-		pi_setprio(lock, old_owner, prio);
-	_raw_spin_unlock(&old_owner->pi_lock);
-#ifdef CAPTURE_LOCK
-#ifdef CONFIG_PREEMPT_RT
-	if (lock != &kernel_sem.lock) {
-#endif
-		new_owner->rt_flags |= RT_PENDOWNER;
-		new_owner->pending_owner = lock;
-#ifdef CONFIG_PREEMPT_RT
+		_raw_spin_unlock(&old_owner->pi_lock);
+	}
+	if(save_state) {
+		wake_up_process_mutex(new_owner);
+	}
+	else {
+		wake_up_process(new_owner);
 	}
-#endif
-#endif
-	wake_up_process_mutex(new_owner);
 }
 
 #ifdef CONFIG_PREEMPT_RT
@@ -2578,7 +2348,7 @@ int __lockfunc _read_trylock(rwlock_t *r
 {
 #ifdef CONFIG_DEBUG_RT_LOCKING_MODE
 	if (!preempt_locks)
-	return _raw_read_trylock(&rwlock->lock.lock.debug_rwlock);
+		return _raw_read_trylock(&rwlock->lock.lock.debug_rwlock);
 	else
 #endif
 		return down_read_trylock_mutex(&rwlock->lock);
@@ -2905,17 +2675,6 @@ notrace int irqs_disabled(void)
 EXPORT_SYMBOL(irqs_disabled);
 #endif
 
-/*
- * This routine changes the owner of a mutex. It's only
- * caller is the futex code which locks a futex on behalf
- * of another thread.
- */
-void fastcall rt_mutex_set_owner(struct rt_mutex *lock, struct thread_info *t)
-{
-	account_mutex_owner_up(current);
-	account_mutex_owner_down(t->task, lock);
-	lock->owner = t;
-}
 
 struct thread_info * fastcall rt_mutex_owner(struct rt_mutex *lock)
 {
@@ -2950,7 +2709,6 @@ down_try_futex(struct rt_mutex *lock, st
 
 	trace_lock_irqsave(&trace_lock, flags, proxy_owner);
 	TRACE_BUG_ON_LOCKED(!raw_irqs_disabled());
-	_raw_spin_lock(&task->pi_lock);
 	_raw_spin_lock(&lock->wait_lock);
 
 	old_owner = lock_owner(lock);
@@ -2959,16 +2717,10 @@ down_try_futex(struct rt_mutex *lock, st
 	if (likely(!old_owner) || __grab_lock(lock, task, old_owner->task)) {
 		/* granted */
 		TRACE_WARN_ON_LOCKED(!plist_head_empty(&lock->wait_list) && !old_owner);
-		if (old_owner) {
-			_raw_spin_lock(&old_owner->task->pi_lock);
-			set_new_owner(lock, old_owner, proxy_owner __EIP__);
-			_raw_spin_unlock(&old_owner->task->pi_lock);
-		} else
 			set_new_owner(lock, old_owner, proxy_owner __EIP__);
 		ret = 1;
 	}
 	_raw_spin_unlock(&lock->wait_lock);
-	_raw_spin_unlock(&task->pi_lock);
 	trace_unlock_irqrestore(&trace_lock, flags, proxy_owner);
 
 	return ret;
@@ -3064,3 +2816,33 @@ void fastcall init_rt_mutex(struct rt_mu
 	__init_rt_mutex(lock, save_state, name, file, line);
 }
 EXPORT_SYMBOL(init_rt_mutex);
+
+
+pid_t get_blocked_on(task_t *task)
+{
+	pid_t res = 0;
+	struct rt_mutex *lock;
+	struct thread_info *owner;
+ try_again:
+	_raw_spin_lock(&task->pi_lock);
+	if(!task->blocked_on) {
+		_raw_spin_unlock(&task->pi_lock);
+		goto out;
+	}
+	lock = task->blocked_on->lock;
+	if(!_raw_spin_trylock(&lock->wait_lock)) {
+		_raw_spin_unlock(&task->pi_lock);
+		goto try_again;
+	}       
+	owner = lock_owner(lock);
+	if(owner)
+		res = owner->task->pid;
+
+	_raw_spin_unlock(&task->pi_lock);
+	_raw_spin_unlock(&lock->wait_lock);
+        
+ out:
+	return res;
+                
+}
+EXPORT_SYMBOL(get_blocked_on);

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2006-01-27 15:19 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-11 17:25 RT Mutex patch and tester [PREEMPT_RT] Esben Nielsen
2006-01-11 17:51 ` Steven Rostedt
2006-01-11 21:45   ` Esben Nielsen
2006-01-12 11:33 ` Bill Huey
2006-01-12 12:54   ` Esben Nielsen
2006-01-13  8:07     ` Bill Huey
2006-01-13  8:47       ` Esben Nielsen
2006-01-13 10:19         ` Bill Huey
2006-01-15  4:24 ` Bill Huey
2006-01-16  8:35   ` Esben Nielsen
2006-01-16 10:22     ` Bill Huey
2006-01-16 10:53       ` Bill Huey
2006-01-16 11:30         ` Esben Nielsen
     [not found] ` <Pine.LNX.4.44L0.0601181120100.1993-201000@lifa02.phys.au.dk>
2006-01-18 10:38   ` Ingo Molnar
2006-01-18 12:49   ` Steven Rostedt
2006-01-18 14:18     ` Esben Nielsen
     [not found]   ` <Pine.LNX.4.44L0.0601230047290.31387-201000@lifa01.phys.au.dk>
2006-01-23  0:38     ` david singleton
2006-01-23  2:04     ` Bill Huey
2006-01-23  9:33       ` Esben Nielsen
2006-01-23 14:23         ` Steven Rostedt
2006-01-23 15:14           ` Esben Nielsen
2006-01-27 15:18             ` Esben Nielsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).