All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4][RT] futex: fix tasks blocking on two rt_mutex locks
@ 2010-07-09 22:32 Darren Hart
  2010-07-09 22:32 ` [PATCH 1/4] rtmutex: avoid null derefence in WARN_ON Darren Hart
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Darren Hart @ 2010-07-09 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Eric Dumazet,
	John Kacur, Steven Rostedt, Mike Galbraith, linux-rt-users

This patch-set address the following problem reported by Mike Galbraith:

> Stress testing, looking to trigger RCU stalls, I've managed to find a
> way to repeatably create fireworks.  (got RCU stall, see attached)
> 
> 1. download ltp-full-20100630.  Needs to be this version because of
> testcase bustage in earlier versions, and must be built with gcc > 4.3,
> else testcases will segfault due to a gcc bug. 
> 
> 2. apply patchlet so you can run testcases/realtime/perf/latency/run.sh
> at all.
> 
> --- pthread_cond_many.c.org	2010-07-05 09:05:59.000000000 +0200
> +++ pthread_cond_many.c	2010-07-04 12:12:25.000000000 +0200
> @@ -259,7 +259,7 @@ void usage(void)
>  
>  int parse_args(int c, char *v)
>  {
> -	int handled;
> +	int handled = 1;
>          switch (c) {
>  		case 'h':
>  			usage();
> 
> 3. add --realtime for no particular reason.
> 
> --- run.sh.org	2010-07-06 15:54:58.000000000 +0200
> +++ run.sh	2010-07-06 16:37:34.000000000 +0200
> @@ -22,7 +22,7 @@ make
>  # process to run realtime.  The remainder of the processes (if any)
>  # will run non-realtime in any case.
>  
> -nthread=5000
> +nthread=500
>  iter=400
>  nproc=5
>  
> @@ -39,7 +39,7 @@ i=0
>  i=1
>  while test $i -lt $nproc
>  do
> -        ./pthread_cond_many --broadcast -i $iter -n $nthread > $nthread.$iter.$nproc.$i.out &
> +        ./pthread_cond_many --realtime --broadcast -i $iter -n $nthread > $nthread.$iter.$nproc.$i.out &
>          i=`expr $i + 1`
>  done
>  wait

The LTP wreckage will be addressed separately <ugh>

> 4. run it.
> 
> What happens here is we hit WARN_ON(pendowner->pi_blocked_on != waiter),
> this does not make it to consoles (poking sysrq-foo doesn't either).
> Next comes WARN_ON(!pendowner->pi_blocked_on), followed by the NULL
> explosion, which does make it to consoles.
> 
> With explosion avoidance, I also see pendowner->pi_blocked_on->task ==
> NULL at times, but that, as !pendowner->pi_blocked_on, seems to be
> fallout.  The start of bad juju is always pi_blocked_on != waiter.
> 
> [  141.609268] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
> [  141.609268] IP: [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [  141.609268] PGD 20e174067 PUD 20e253067 PMD 0 
> [  141.609268] Oops: 0000 [#1] PREEMPT SMP 
> [  141.609268] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
> [  141.609268] CPU 0 
> [  141.609268] Pid: 8154, comm: pthread_cond_ma Tainted: G        W  2.6.33.6-rt23 #12 MS-7502/MS-7502
> [  141.609268] RIP: 0010:[<ffffffff8106856d>]  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [  141.609268] RSP: 0018:ffff88020e3cdd78  EFLAGS: 00010097
> [  141.609268] RAX: 0000000000000000 RBX: ffff8801e8eba5c0 RCX: 0000000000000000
> [  141.609268] RDX: ffff880028200000 RSI: 0000000000000046 RDI: 0000000000000009
> [  141.609268] RBP: ffff88020e3cdda8 R08: 0000000000000002 R09: 0000000000000000
> [  141.609268] R10: 0000000000000005 R11: 0000000000000000 R12: ffffffff81659068
> [  141.609268] R13: ffff8801e8ebdb58 R14: 0000000000000000 R15: ffff8801e8ebac08
> [  141.609268] FS:  00007f664d539700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
> [  141.609268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  141.609268] CR2: 0000000000000058 CR3: 0000000214266000 CR4: 00000000000006f0
> [  141.609268] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  141.609268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  141.609268] Process pthread_cond_ma (pid: 8154, threadinfo ffff88020e3cc000, task ffff88020e2a4700)
> [  141.609268] Stack:
> [  141.609268]  0000000000000000 ffffffff81659068 0000000000000202 0000000000000000
> [  141.609268] <0> 0000000000000000 0000000080001fda ffff88020e3cddc8 ffffffff812fec48
> [  141.609268] <0> ffffffff81659068 0000000000606300 ffff88020e3cddd8 ffffffff812ff1b9
> [  141.609268] Call Trace:
> [  141.609268]  [<ffffffff812fec48>] rt_spin_lock_slowunlock+0x43/0x61
> [  141.609268]  [<ffffffff812ff1b9>] rt_spin_unlock+0x46/0x48
> [  141.609268]  [<ffffffff81067d7f>] do_futex+0x83c/0x935
> [  141.609268]  [<ffffffff810c26ce>] ? handle_mm_fault+0x6de/0x6f1
> [  141.609268]  [<ffffffff81067e36>] ? do_futex+0x8f3/0x935
> [  141.609268]  [<ffffffff81067fba>] sys_futex+0x142/0x154
> [  141.609268]  [<ffffffff81020eb0>] ? do_page_fault+0x23e/0x28e
> [  141.609268]  [<ffffffff81004aa7>] ? math_state_restore+0x3d/0x3f
> [  141.609268]  [<ffffffff81004b08>] ? do_device_not_available+0xe/0x12
> [  141.609268]  [<ffffffff81002c5b>] system_call_fastpath+0x16/0x1b
> [  141.609268] Code: c7 09 6d 41 81 e8 ac 34 fd ff 4c 39 ab 70 06 00 00 74 11 be 47 02 00 00 48 c7 c7 09 6d 41 81 e8 92 34 fd ff 48 8b 83 70 06 00 00 <4c> 39 60 58 74 11 be 48 02 00 00 48 c7 c7 09 6d 41 81 e8 74 34 
> [  141.609268] RIP  [<ffffffff8106856d>] wakeup_next_waiter+0x12c/0x177
> [  141.609268]  RSP <ffff88020e3cdd78>
> [  141.609268] CR2: 0000000000000058
> [  141.609268] ---[ end trace 58805b944e6f93ce ]---
> [  141.609268] note: pthread_cond_ma[8154] exited with preempt_count 2



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/4] rtmutex: avoid null derefence in WARN_ON
  2010-07-09 22:32 [PATCH 0/4][RT] futex: fix tasks blocking on two rt_mutex locks Darren Hart
@ 2010-07-09 22:32 ` Darren Hart
  2010-07-10  0:29   ` Steven Rostedt
  2010-07-09 22:32 ` [PATCH 2/4] rtmutex: add BUG_ON if a task attempts to block on two locks Darren Hart
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Darren Hart @ 2010-07-09 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Eric Dumazet,
	John Kacur, Steven Rostedt, Mike Galbraith, linux-rt-users,
	Darren Hart

If the pi_blocked_on variable is NULL, the subsequent WARN_ON's
will cause an OOPS. Only perform the susequent checks if
pi_blocked_on is valid.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/rtmutex.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 23dd443..baac7d9 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -579,9 +579,10 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
 
 	raw_spin_lock(&pendowner->pi_lock);
 
-	WARN_ON(!pendowner->pi_blocked_on);
-	WARN_ON(pendowner->pi_blocked_on != waiter);
-	WARN_ON(pendowner->pi_blocked_on->lock != lock);
+	if (!WARN_ON(!pendowner->pi_blocked_on)) {
+		WARN_ON(pendowner->pi_blocked_on != waiter);
+		WARN_ON(pendowner->pi_blocked_on->lock != lock);
+	}
 
 	pendowner->pi_blocked_on = NULL;
 
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/4] rtmutex: add BUG_ON if a task attempts to block on two locks
  2010-07-09 22:32 [PATCH 0/4][RT] futex: fix tasks blocking on two rt_mutex locks Darren Hart
  2010-07-09 22:32 ` [PATCH 1/4] rtmutex: avoid null derefence in WARN_ON Darren Hart
@ 2010-07-09 22:32 ` Darren Hart
  2010-07-10  0:30   ` Steven Rostedt
  2010-07-09 22:32 ` [PATCH 3/4] futex: free_pi_state outside of hb->lock sections Darren Hart
  2010-07-09 22:33 ` [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t Darren Hart
  3 siblings, 1 reply; 28+ messages in thread
From: Darren Hart @ 2010-07-09 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Eric Dumazet,
	John Kacur, Steven Rostedt, Mike Galbraith, linux-rt-users,
	Darren Hart

rtmutex proxy locking complicates the logic a bit and opens up
the possibility for a task to wake and attempt to take another
sleeping lock without knowing it has been enqueued on another
lock already. Add a BUG_ON to catch this scenario early.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/rtmutex.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index baac7d9..22f9d18 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -459,6 +459,9 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 		top_waiter = rt_mutex_top_waiter(lock);
 	plist_add(&waiter->list_entry, &lock->wait_list);
 
+	/* Tasks can only block on one lock at a time. */
+	BUG_ON(task->pi_blocked_on != NULL);
+
 	task->pi_blocked_on = waiter;
 
 	raw_spin_unlock(&task->pi_lock);
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/4] futex: free_pi_state outside of hb->lock sections
  2010-07-09 22:32 [PATCH 0/4][RT] futex: fix tasks blocking on two rt_mutex locks Darren Hart
  2010-07-09 22:32 ` [PATCH 1/4] rtmutex: avoid null derefence in WARN_ON Darren Hart
  2010-07-09 22:32 ` [PATCH 2/4] rtmutex: add BUG_ON if a task attempts to block on two locks Darren Hart
@ 2010-07-09 22:32 ` Darren Hart
  2010-07-09 22:55     ` Darren Hart
  2010-07-12 10:35   ` [PATCH 3/4] " Thomas Gleixner
  2010-07-09 22:33 ` [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t Darren Hart
  3 siblings, 2 replies; 28+ messages in thread
From: Darren Hart @ 2010-07-09 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Eric Dumazet,
	John Kacur, Steven Rostedt, Mike Galbraith, linux-rt-users,
	Darren Hart

free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
get the calls to free_pi_state() out of the hb->lock() sections.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/futex.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a6cec32..b217972 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -114,7 +114,7 @@ struct futex_q {
 	struct plist_node list;
 
 	struct task_struct *task;
-	spinlock_t *lock_ptr;
+	raw_spinlock_t *lock_ptr;
 	union futex_key key;
 	struct futex_pi_state *pi_state;
 	struct rt_mutex_waiter *rt_waiter;
@@ -487,7 +487,7 @@ void exit_pi_state_list(struct task_struct *curr)
 		 * task still owns the PI-state:
 		 */
 		if (head->next != next) {
-			spin_unlock(&hb->lock);
+			raw_spin_unlock(&hb->lock);
 			continue;
 		}
 
@@ -1339,7 +1339,6 @@ retry_private:
 			} else if (ret) {
 				/* -EDEADLK */
 				this->pi_state = NULL;
-				free_pi_state(pi_state);
 				goto out_unlock;
 			}
 		}
@@ -1437,7 +1436,7 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
  */
 static int unqueue_me(struct futex_q *q)
 {
-	spinlock_t *lock_ptr;
+	raw_spinlock_t *lock_ptr;
 	int ret = 0;
 
 	/* In the common case we don't take the spinlock, which is nice. */
@@ -1483,16 +1482,19 @@ retry:
  */
 static void unqueue_me_pi(struct futex_q *q)
 {
+	struct futex_pi_state *pi_state = NULL;
+
 	WARN_ON(plist_node_empty(&q->list));
 	plist_del(&q->list, &q->list.plist);
 
 	BUG_ON(!q->pi_state);
-	free_pi_state(q->pi_state);
+	pi_state = q->pi_state;
 	q->pi_state = NULL;
 
 	spin_unlock(q->lock_ptr);
-
 	drop_futex_key_refs(&q->key);
+
+	free_pi_state(pi_state);
 }
 
 /*
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-09 22:32 [PATCH 0/4][RT] futex: fix tasks blocking on two rt_mutex locks Darren Hart
                   ` (2 preceding siblings ...)
  2010-07-09 22:32 ` [PATCH 3/4] futex: free_pi_state outside of hb->lock sections Darren Hart
@ 2010-07-09 22:33 ` Darren Hart
  2010-07-09 22:57   ` [PATCH 4/4 V2] " Darren Hart
                     ` (2 more replies)
  3 siblings, 3 replies; 28+ messages in thread
From: Darren Hart @ 2010-07-09 22:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Eric Dumazet,
	John Kacur, Steven Rostedt, Mike Galbraith, linux-rt-users,
	Darren Hart

The requeue_pi mechanism introduced proxy locking of the rtmutex.  This creates
a scenario where a task can wake-up, not knowing it has been enqueued on an
rtmutex. In order to detect this, the task would have to be able to take either
task->pi_blocked_on->lock->wait_lock and/or the hb->lock.  Unfortunately,
without already holding one of these, the pi_blocked_on variable can change
from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
to take a sleeping lock after wakeup or it could end up trying to block on two
locks, the second overwriting a valid pi_blocked_on value. This obviously
breaks the pi mechanism.

This patch increases latency, while running the ltp pthread_cond_many test
which Michal reported the bug with, I see double digit hrtimer latencies
(typically only on the first run after boo):

	kernel: hrtimer: interrupt took 75911 ns

This might be addressed by changing the various loops in the futex code to be
incremental, probably at an additional throughput hit. The private hash_bucket
lists discussed in the past could reduce hb->lock contention in some scenarios.
It should be noted that pthread_cond_many is a rather pathological case.

This also introduces problems for plists which want a spinlock_t rather
than a raw_spinlock_t. Any thoughts on how to address this?

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/futex.c |   67 ++++++++++++++++++++++++++++++--------------------------
 1 files changed, 36 insertions(+), 31 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index b217972..0ad5a85 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -128,7 +128,7 @@ struct futex_q {
  * waiting on a futex.
  */
 struct futex_hash_bucket {
-	spinlock_t lock;
+	raw_spinlock_t lock;
 	struct plist_head chain;
 };
 
@@ -479,7 +479,7 @@ void exit_pi_state_list(struct task_struct *curr)
 		hb = hash_futex(&key);
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		spin_lock(&hb->lock);
+		raw_spin_lock(&hb->lock);
 
 		raw_spin_lock_irq(&curr->pi_lock);
 		/*
@@ -499,7 +499,7 @@ void exit_pi_state_list(struct task_struct *curr)
 
 		rt_mutex_unlock(&pi_state->pi_mutex);
 
-		spin_unlock(&hb->lock);
+		raw_spin_unlock(&hb->lock);
 
 		raw_spin_lock_irq(&curr->pi_lock);
 	}
@@ -860,21 +860,21 @@ static inline void
 double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
 {
 	if (hb1 <= hb2) {
-		spin_lock(&hb1->lock);
+		raw_spin_lock(&hb1->lock);
 		if (hb1 < hb2)
-			spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
 	} else { /* hb1 > hb2 */
-		spin_lock(&hb2->lock);
-		spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
+		raw_spin_lock(&hb2->lock);
+		raw_spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
 	}
 }
 
 static inline void
 double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
 {
-	spin_unlock(&hb1->lock);
+	raw_spin_unlock(&hb1->lock);
 	if (hb1 != hb2)
-		spin_unlock(&hb2->lock);
+		raw_spin_unlock(&hb2->lock);
 }
 
 /*
@@ -896,7 +896,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
 		goto out;
 
 	hb = hash_futex(&key);
-	spin_lock(&hb->lock);
+	raw_spin_lock(&hb->lock);
 	head = &hb->chain;
 
 	plist_for_each_entry_safe(this, next, head, list) {
@@ -916,7 +916,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
 		}
 	}
 
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 	put_futex_key(fshared, &key);
 out:
 	return ret;
@@ -1070,6 +1070,7 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
 
 	q->lock_ptr = &hb->lock;
 #ifdef CONFIG_DEBUG_PI_LIST
+	/* FIXME: we're converting this to a raw lock... */
 	q->list.plist.spinlock = &hb->lock;
 #endif
 
@@ -1377,14 +1378,14 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 	hb = hash_futex(&q->key);
 	q->lock_ptr = &hb->lock;
 
-	spin_lock(&hb->lock);
+	raw_spin_lock(&hb->lock);
 	return hb;
 }
 
 static inline void
 queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
 {
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 	drop_futex_key_refs(&q->key);
 }
 
@@ -1416,11 +1417,12 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
 
 	plist_node_init(&q->list, prio);
 #ifdef CONFIG_DEBUG_PI_LIST
+	/* FIXME: we're converting this to a raw_spinlock */
 	q->list.plist.spinlock = &hb->lock;
 #endif
 	plist_add(&q->list, &hb->chain);
 	q->task = current;
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 }
 
 /**
@@ -1444,7 +1446,7 @@ retry:
 	lock_ptr = q->lock_ptr;
 	barrier();
 	if (lock_ptr != NULL) {
-		spin_lock(lock_ptr);
+		raw_spin_lock(lock_ptr);
 		/*
 		 * q->lock_ptr can change between reading it and
 		 * spin_lock(), causing us to take the wrong lock.  This
@@ -1459,7 +1461,7 @@ retry:
 		 * we can detect whether we acquired the correct lock.
 		 */
 		if (unlikely(lock_ptr != q->lock_ptr)) {
-			spin_unlock(lock_ptr);
+			raw_spin_unlock(lock_ptr);
 			goto retry;
 		}
 		WARN_ON(plist_node_empty(&q->list));
@@ -1467,7 +1469,7 @@ retry:
 
 		BUG_ON(q->pi_state);
 
-		spin_unlock(lock_ptr);
+		raw_spin_unlock(lock_ptr);
 		ret = 1;
 	}
 
@@ -1491,7 +1493,7 @@ static void unqueue_me_pi(struct futex_q *q)
 	pi_state = q->pi_state;
 	q->pi_state = NULL;
 
-	spin_unlock(q->lock_ptr);
+	raw_spin_unlock(q->lock_ptr);
 	drop_futex_key_refs(&q->key);
 
 	free_pi_state(pi_state);
@@ -1579,11 +1581,11 @@ retry:
 	 * simply return.
 	 */
 handle_fault:
-	spin_unlock(q->lock_ptr);
+	raw_spin_unlock(q->lock_ptr);
 
 	ret = fault_in_user_writeable(uaddr);
 
-	spin_lock(q->lock_ptr);
+	raw_spin_lock(q->lock_ptr);
 
 	/*
 	 * Check if someone else fixed it for us:
@@ -1976,7 +1978,7 @@ retry_private:
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
 
-	spin_lock(q.lock_ptr);
+	raw_spin_lock(q.lock_ptr);
 	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
 	 * haven't already.
@@ -2053,7 +2055,7 @@ retry:
 		goto out;
 
 	hb = hash_futex(&key);
-	spin_lock(&hb->lock);
+	raw_spin_lock(&hb->lock);
 
 	/*
 	 * To avoid races, try to do the TID -> 0 atomic transition
@@ -2102,14 +2104,14 @@ retry:
 	}
 
 out_unlock:
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 	put_futex_key(fshared, &key);
 
 out:
 	return ret;
 
 pi_faulted:
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 	put_futex_key(fshared, &key);
 
 	ret = fault_in_user_writeable(uaddr);
@@ -2257,9 +2259,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
 	futex_wait_queue_me(hb, &q, to);
 
-	spin_lock(&hb->lock);
+	raw_spin_lock(&hb->lock);
 	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 	if (ret)
 		goto out_put_keys;
 
@@ -2277,10 +2279,10 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 		 * did a lock-steal - fix up the PI-state in that case.
 		 */
 		if (q.pi_state && (q.pi_state->owner != current)) {
-			spin_lock(q.lock_ptr);
+			raw_spin_lock(q.lock_ptr);
 			ret = fixup_pi_state_owner(uaddr2, &q, current,
 						   fshared);
-			spin_unlock(q.lock_ptr);
+			raw_spin_unlock(q.lock_ptr);
 		}
 	} else {
 		/*
@@ -2293,7 +2295,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 		ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
 		debug_rt_mutex_free_waiter(&rt_waiter);
 
-		spin_lock(q.lock_ptr);
+		raw_spin_lock(q.lock_ptr);
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we
 		 * haven't already.
@@ -2668,8 +2670,11 @@ static int __init futex_init(void)
 		futex_cmpxchg_enabled = 1;
 
 	for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
-		plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock);
-		spin_lock_init(&futex_queues[i].lock);
+		/* 
+		 * FIXME: plist wants a spinlock, but the hb->lock is a raw_spinlock_t
+		 */
+		plist_head_init(&futex_queues[i].chain, NULL /*&futex_queues[i].lock*/);
+		raw_spin_lock_init(&futex_queues[i].lock);
 	}
 
 	return 0;
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections
  2010-07-09 22:32 ` [PATCH 3/4] futex: free_pi_state outside of hb->lock sections Darren Hart
@ 2010-07-09 22:55     ` Darren Hart
  2010-07-12 10:35   ` [PATCH 3/4] " Thomas Gleixner
  1 sibling, 0 replies; 28+ messages in thread
From: Darren Hart @ 2010-07-09 22:55 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Steven Rostedt, Mike Galbraith,
	linux-rt-users

Apologies, mangled a rebase and this patch had bits of 4/4 in it.
Corrected below:


>From f40b8e684df6ed2b1ba9167cad3922ab43f4a717 Mon Sep 17 00:00:00 2001
From: Darren Hart <dvhltc@us.ibm.com>
Date: Fri, 9 Jul 2010 17:07:10 -0400
Subject: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections

free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
get the calls to free_pi_state() out of the hb->lock() sections.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/futex.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a6cec32..2cd58a2 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1339,7 +1339,6 @@ retry_private:
 			} else if (ret) {
 				/* -EDEADLK */
 				this->pi_state = NULL;
-				free_pi_state(pi_state);
 				goto out_unlock;
 			}
 		}
@@ -1483,16 +1482,19 @@ retry:
  */
 static void unqueue_me_pi(struct futex_q *q)
 {
+	struct futex_pi_state *pi_state = NULL;
+
 	WARN_ON(plist_node_empty(&q->list));
 	plist_del(&q->list, &q->list.plist);
 
 	BUG_ON(!q->pi_state);
-	free_pi_state(q->pi_state);
+	pi_state = q->pi_state;
 	q->pi_state = NULL;
 
 	spin_unlock(q->lock_ptr);
-
 	drop_futex_key_refs(&q->key);
+
+	free_pi_state(pi_state);
 }
 
 /*
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections
@ 2010-07-09 22:55     ` Darren Hart
  0 siblings, 0 replies; 28+ messages in thread
From: Darren Hart @ 2010-07-09 22:55 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Steven Rostedt, Mike Galbraith,
	linux-rt-users

Apologies, mangled a rebase and this patch had bits of 4/4 in it.
Corrected below:


>From f40b8e684df6ed2b1ba9167cad3922ab43f4a717 Mon Sep 17 00:00:00 2001
From: Darren Hart <dvhltc@us.ibm.com>
Date: Fri, 9 Jul 2010 17:07:10 -0400
Subject: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections

free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
get the calls to free_pi_state() out of the hb->lock() sections.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/futex.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a6cec32..2cd58a2 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1339,7 +1339,6 @@ retry_private:
 			} else if (ret) {
 				/* -EDEADLK */
 				this->pi_state = NULL;
-				free_pi_state(pi_state);
 				goto out_unlock;
 			}
 		}
@@ -1483,16 +1482,19 @@ retry:
  */
 static void unqueue_me_pi(struct futex_q *q)
 {
+	struct futex_pi_state *pi_state = NULL;
+
 	WARN_ON(plist_node_empty(&q->list));
 	plist_del(&q->list, &q->list.plist);
 
 	BUG_ON(!q->pi_state);
-	free_pi_state(q->pi_state);
+	pi_state = q->pi_state;
 	q->pi_state = NULL;
 
 	spin_unlock(q->lock_ptr);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4 V2] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-09 22:33 ` [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t Darren Hart
@ 2010-07-09 22:57   ` Darren Hart
  2010-07-10  0:34     ` Steven Rostedt
  2010-07-10 19:41   ` [PATCH 4/4] " Mike Galbraith
  2010-07-12 13:05   ` Thomas Gleixner
  2 siblings, 1 reply; 28+ messages in thread
From: Darren Hart @ 2010-07-09 22:57 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Steven Rostedt, Mike Galbraith,
	linux-rt-users

This version pulls in the bits mistakenly left in 3/4.


>From 9f8b4faac79518f98131464c2d21a1c64fb841d2 Mon Sep 17 00:00:00 2001
From: Darren Hart <dvhltc@us.ibm.com>
Date: Fri, 9 Jul 2010 16:44:47 -0400
Subject: [PATCH 4/4 V2] futex: convert hash_bucket locks to raw_spinlock_t

The requeue_pi mechanism introduced proxy locking of the rtmutex.  This creates
a scenario where a task can wake-up, not knowing it has been enqueued on an
rtmutex. In order to detect this, the task would have to be able to take either
task->pi_blocked_on->lock->wait_lock and/or the hb->lock.  Unfortunately,
without already holding one of these, the pi_blocked_on variable can change
from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
to take a sleeping lock after wakeup or it could end up trying to block on two
locks, the second overwriting a valid pi_blocked_on value. This obviously
breaks the pi mechanism.

This patch increases latency, while running the ltp pthread_cond_many test
which Michal reported the bug with, I see double digit hrtimer latencies
(typically only on the first run after boo):

	kernel: hrtimer: interrupt took 75911 ns

This might be addressed by changing the various loops in the futex code to be
incremental, probably at an additional throughput hit. The private hash_bucket
lists discussed in the past could reduce hb->lock contention in some scenarios.
It should be noted that pthread_cond_many is a rather pathological case.

This also introduces problems for plists which want a spinlock_t rather
than a raw_spinlock_t. Any thoughts on how to address this?

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/futex.c |   73 ++++++++++++++++++++++++++++++--------------------------
 1 files changed, 39 insertions(+), 34 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2cd58a2..0ad5a85 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -114,7 +114,7 @@ struct futex_q {
 	struct plist_node list;
 
 	struct task_struct *task;
-	spinlock_t *lock_ptr;
+	raw_spinlock_t *lock_ptr;
 	union futex_key key;
 	struct futex_pi_state *pi_state;
 	struct rt_mutex_waiter *rt_waiter;
@@ -128,7 +128,7 @@ struct futex_q {
  * waiting on a futex.
  */
 struct futex_hash_bucket {
-	spinlock_t lock;
+	raw_spinlock_t lock;
 	struct plist_head chain;
 };
 
@@ -479,7 +479,7 @@ void exit_pi_state_list(struct task_struct *curr)
 		hb = hash_futex(&key);
 		raw_spin_unlock_irq(&curr->pi_lock);
 
-		spin_lock(&hb->lock);
+		raw_spin_lock(&hb->lock);
 
 		raw_spin_lock_irq(&curr->pi_lock);
 		/*
@@ -487,7 +487,7 @@ void exit_pi_state_list(struct task_struct *curr)
 		 * task still owns the PI-state:
 		 */
 		if (head->next != next) {
-			spin_unlock(&hb->lock);
+			raw_spin_unlock(&hb->lock);
 			continue;
 		}
 
@@ -499,7 +499,7 @@ void exit_pi_state_list(struct task_struct *curr)
 
 		rt_mutex_unlock(&pi_state->pi_mutex);
 
-		spin_unlock(&hb->lock);
+		raw_spin_unlock(&hb->lock);
 
 		raw_spin_lock_irq(&curr->pi_lock);
 	}
@@ -860,21 +860,21 @@ static inline void
 double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
 {
 	if (hb1 <= hb2) {
-		spin_lock(&hb1->lock);
+		raw_spin_lock(&hb1->lock);
 		if (hb1 < hb2)
-			spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_lock_nested(&hb2->lock, SINGLE_DEPTH_NESTING);
 	} else { /* hb1 > hb2 */
-		spin_lock(&hb2->lock);
-		spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
+		raw_spin_lock(&hb2->lock);
+		raw_spin_lock_nested(&hb1->lock, SINGLE_DEPTH_NESTING);
 	}
 }
 
 static inline void
 double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
 {
-	spin_unlock(&hb1->lock);
+	raw_spin_unlock(&hb1->lock);
 	if (hb1 != hb2)
-		spin_unlock(&hb2->lock);
+		raw_spin_unlock(&hb2->lock);
 }
 
 /*
@@ -896,7 +896,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
 		goto out;
 
 	hb = hash_futex(&key);
-	spin_lock(&hb->lock);
+	raw_spin_lock(&hb->lock);
 	head = &hb->chain;
 
 	plist_for_each_entry_safe(this, next, head, list) {
@@ -916,7 +916,7 @@ static int futex_wake(u32 __user *uaddr, int fshared, int nr_wake, u32 bitset)
 		}
 	}
 
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 	put_futex_key(fshared, &key);
 out:
 	return ret;
@@ -1070,6 +1070,7 @@ void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
 
 	q->lock_ptr = &hb->lock;
 #ifdef CONFIG_DEBUG_PI_LIST
+	/* FIXME: we're converting this to a raw lock... */
 	q->list.plist.spinlock = &hb->lock;
 #endif
 
@@ -1377,14 +1378,14 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 	hb = hash_futex(&q->key);
 	q->lock_ptr = &hb->lock;
 
-	spin_lock(&hb->lock);
+	raw_spin_lock(&hb->lock);
 	return hb;
 }
 
 static inline void
 queue_unlock(struct futex_q *q, struct futex_hash_bucket *hb)
 {
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 	drop_futex_key_refs(&q->key);
 }
 
@@ -1416,11 +1417,12 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
 
 	plist_node_init(&q->list, prio);
 #ifdef CONFIG_DEBUG_PI_LIST
+	/* FIXME: we're converting this to a raw_spinlock */
 	q->list.plist.spinlock = &hb->lock;
 #endif
 	plist_add(&q->list, &hb->chain);
 	q->task = current;
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 }
 
 /**
@@ -1436,7 +1438,7 @@ static inline void queue_me(struct futex_q *q, struct futex_hash_bucket *hb)
  */
 static int unqueue_me(struct futex_q *q)
 {
-	spinlock_t *lock_ptr;
+	raw_spinlock_t *lock_ptr;
 	int ret = 0;
 
 	/* In the common case we don't take the spinlock, which is nice. */
@@ -1444,7 +1446,7 @@ retry:
 	lock_ptr = q->lock_ptr;
 	barrier();
 	if (lock_ptr != NULL) {
-		spin_lock(lock_ptr);
+		raw_spin_lock(lock_ptr);
 		/*
 		 * q->lock_ptr can change between reading it and
 		 * spin_lock(), causing us to take the wrong lock.  This
@@ -1459,7 +1461,7 @@ retry:
 		 * we can detect whether we acquired the correct lock.
 		 */
 		if (unlikely(lock_ptr != q->lock_ptr)) {
-			spin_unlock(lock_ptr);
+			raw_spin_unlock(lock_ptr);
 			goto retry;
 		}
 		WARN_ON(plist_node_empty(&q->list));
@@ -1467,7 +1469,7 @@ retry:
 
 		BUG_ON(q->pi_state);
 
-		spin_unlock(lock_ptr);
+		raw_spin_unlock(lock_ptr);
 		ret = 1;
 	}
 
@@ -1491,7 +1493,7 @@ static void unqueue_me_pi(struct futex_q *q)
 	pi_state = q->pi_state;
 	q->pi_state = NULL;
 
-	spin_unlock(q->lock_ptr);
+	raw_spin_unlock(q->lock_ptr);
 	drop_futex_key_refs(&q->key);
 
 	free_pi_state(pi_state);
@@ -1579,11 +1581,11 @@ retry:
 	 * simply return.
 	 */
 handle_fault:
-	spin_unlock(q->lock_ptr);
+	raw_spin_unlock(q->lock_ptr);
 
 	ret = fault_in_user_writeable(uaddr);
 
-	spin_lock(q->lock_ptr);
+	raw_spin_lock(q->lock_ptr);
 
 	/*
 	 * Check if someone else fixed it for us:
@@ -1976,7 +1978,7 @@ retry_private:
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
 
-	spin_lock(q.lock_ptr);
+	raw_spin_lock(q.lock_ptr);
 	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
 	 * haven't already.
@@ -2053,7 +2055,7 @@ retry:
 		goto out;
 
 	hb = hash_futex(&key);
-	spin_lock(&hb->lock);
+	raw_spin_lock(&hb->lock);
 
 	/*
 	 * To avoid races, try to do the TID -> 0 atomic transition
@@ -2102,14 +2104,14 @@ retry:
 	}
 
 out_unlock:
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 	put_futex_key(fshared, &key);
 
 out:
 	return ret;
 
 pi_faulted:
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 	put_futex_key(fshared, &key);
 
 	ret = fault_in_user_writeable(uaddr);
@@ -2257,9 +2259,9 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
 	futex_wait_queue_me(hb, &q, to);
 
-	spin_lock(&hb->lock);
+	raw_spin_lock(&hb->lock);
 	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
-	spin_unlock(&hb->lock);
+	raw_spin_unlock(&hb->lock);
 	if (ret)
 		goto out_put_keys;
 
@@ -2277,10 +2279,10 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 		 * did a lock-steal - fix up the PI-state in that case.
 		 */
 		if (q.pi_state && (q.pi_state->owner != current)) {
-			spin_lock(q.lock_ptr);
+			raw_spin_lock(q.lock_ptr);
 			ret = fixup_pi_state_owner(uaddr2, &q, current,
 						   fshared);
-			spin_unlock(q.lock_ptr);
+			raw_spin_unlock(q.lock_ptr);
 		}
 	} else {
 		/*
@@ -2293,7 +2295,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 		ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
 		debug_rt_mutex_free_waiter(&rt_waiter);
 
-		spin_lock(q.lock_ptr);
+		raw_spin_lock(q.lock_ptr);
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we
 		 * haven't already.
@@ -2668,8 +2670,11 @@ static int __init futex_init(void)
 		futex_cmpxchg_enabled = 1;
 
 	for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
-		plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock);
-		spin_lock_init(&futex_queues[i].lock);
+		/* 
+		 * FIXME: plist wants a spinlock, but the hb->lock is a raw_spinlock_t
+		 */
+		plist_head_init(&futex_queues[i].chain, NULL /*&futex_queues[i].lock*/);
+		raw_spin_lock_init(&futex_queues[i].lock);
 	}
 
 	return 0;
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] rtmutex: avoid null derefence in WARN_ON
  2010-07-09 22:32 ` [PATCH 1/4] rtmutex: avoid null derefence in WARN_ON Darren Hart
@ 2010-07-10  0:29   ` Steven Rostedt
  2010-07-10 14:42     ` Darren Hart
  0 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2010-07-10  0:29 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Mike Galbraith, linux-rt-users

On Fri, 2010-07-09 at 15:32 -0700, Darren Hart wrote:
> If the pi_blocked_on variable is NULL, the subsequent WARN_ON's
> will cause an OOPS. Only perform the susequent checks if
> pi_blocked_on is valid.
> 
> Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: John Kacur <jkacur@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/rtmutex.c |    7 ++++---
>  1 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 23dd443..baac7d9 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -579,9 +579,10 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
>  
>  	raw_spin_lock(&pendowner->pi_lock);
>  
> -	WARN_ON(!pendowner->pi_blocked_on);
> -	WARN_ON(pendowner->pi_blocked_on != waiter);
> -	WARN_ON(pendowner->pi_blocked_on->lock != lock);
> +	if (!WARN_ON(!pendowner->pi_blocked_on)) {
> +		WARN_ON(pendowner->pi_blocked_on != waiter);

The above actually has no issue if the pi_blocked_on is NULL.

The below, well yeah.

-- Steve

> +		WARN_ON(pendowner->pi_blocked_on->lock != lock);
> +	}
>  
>  	pendowner->pi_blocked_on = NULL;
>  



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4] rtmutex: add BUG_ON if a task attempts to block on two locks
  2010-07-09 22:32 ` [PATCH 2/4] rtmutex: add BUG_ON if a task attempts to block on two locks Darren Hart
@ 2010-07-10  0:30   ` Steven Rostedt
  2010-07-10 17:30     ` [PATCH 2/4 V2] " Darren Hart
  0 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2010-07-10  0:30 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Mike Galbraith, linux-rt-users

On Fri, 2010-07-09 at 15:32 -0700, Darren Hart wrote:
> rtmutex proxy locking complicates the logic a bit and opens up
> the possibility for a task to wake and attempt to take another
> sleeping lock without knowing it has been enqueued on another
> lock already. Add a BUG_ON to catch this scenario early.
> 
> Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: John Kacur <jkacur@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/rtmutex.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index baac7d9..22f9d18 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -459,6 +459,9 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
>  		top_waiter = rt_mutex_top_waiter(lock);
>  	plist_add(&waiter->list_entry, &lock->wait_list);
>  
> +	/* Tasks can only block on one lock at a time. */
> +	BUG_ON(task->pi_blocked_on != NULL);

WARN_ON may be better. Since it may not cause a system crash or other
huge bug if it is not true.

-- Steve

> +
>  	task->pi_blocked_on = waiter;
>  
>  	raw_spin_unlock(&task->pi_lock);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections
  2010-07-09 22:55     ` Darren Hart
  (?)
@ 2010-07-10  0:32     ` Steven Rostedt
  2010-07-10 14:41       ` Darren Hart
  -1 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2010-07-10  0:32 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Mike Galbraith, linux-rt-users

On Fri, 2010-07-09 at 15:55 -0700, Darren Hart wrote:
> Apologies, mangled a rebase and this patch had bits of 4/4 in it.
> Corrected below:
> 
> 
> >From f40b8e684df6ed2b1ba9167cad3922ab43f4a717 Mon Sep 17 00:00:00 2001
> From: Darren Hart <dvhltc@us.ibm.com>
> Date: Fri, 9 Jul 2010 17:07:10 -0400
> Subject: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections
> 
> free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
> get the calls to free_pi_state() out of the hb->lock() sections.
> 
> Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: John Kacur <jkacur@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/futex.c |    8 +++++---
>  1 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..2cd58a2 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1339,7 +1339,6 @@ retry_private:

This is why I always add a space before labels. I have no idea what
function this was in.

>  			} else if (ret) {
>  				/* -EDEADLK */
>  				this->pi_state = NULL;
> -				free_pi_state(pi_state);

Where do we free the pi state here?

-- Steve

>  				goto out_unlock;
>  			}
>  		}
> @@ -1483,16 +1482,19 @@ retry:
>   */
>  static void unqueue_me_pi(struct futex_q *q)
>  {
> +	struct futex_pi_state *pi_state = NULL;
> +
>  	WARN_ON(plist_node_empty(&q->list));
>  	plist_del(&q->list, &q->list.plist);
>  
>  	BUG_ON(!q->pi_state);
> -	free_pi_state(q->pi_state);
> +	pi_state = q->pi_state;
>  	q->pi_state = NULL;
>  
>  	spin_unlock(q->lock_ptr);
> -
>  	drop_futex_key_refs(&q->key);
> +
> +	free_pi_state(pi_state);
>  }
>  
>  /*



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4 V2] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-09 22:57   ` [PATCH 4/4 V2] " Darren Hart
@ 2010-07-10  0:34     ` Steven Rostedt
  0 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2010-07-10  0:34 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Mike Galbraith, linux-rt-users

On Fri, 2010-07-09 at 15:57 -0700, Darren Hart wrote:
> This version pulls in the bits mistakenly left in 3/4.
> 
> 
> >From 9f8b4faac79518f98131464c2d21a1c64fb841d2 Mon Sep 17 00:00:00 2001
> From: Darren Hart <dvhltc@us.ibm.com>
> Date: Fri, 9 Jul 2010 16:44:47 -0400
> Subject: [PATCH 4/4 V2] futex: convert hash_bucket locks to raw_spinlock_t
> 
> The requeue_pi mechanism introduced proxy locking of the rtmutex.  This creates
> a scenario where a task can wake-up, not knowing it has been enqueued on an
> rtmutex. In order to detect this, the task would have to be able to take either
> task->pi_blocked_on->lock->wait_lock and/or the hb->lock.  Unfortunately,
> without already holding one of these, the pi_blocked_on variable can change
> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
> to take a sleeping lock after wakeup or it could end up trying to block on two
> locks, the second overwriting a valid pi_blocked_on value. This obviously
> breaks the pi mechanism.
> 
> This patch increases latency, while running the ltp pthread_cond_many test
> which Michal reported the bug with, I see double digit hrtimer latencies
> (typically only on the first run after boo):
> 
> 	kernel: hrtimer: interrupt took 75911 ns
> 
> This might be addressed by changing the various loops in the futex code to be
> incremental, probably at an additional throughput hit. The private hash_bucket
> lists discussed in the past could reduce hb->lock contention in some scenarios.
> It should be noted that pthread_cond_many is a rather pathological case.
> 
> This also introduces problems for plists which want a spinlock_t rather
> than a raw_spinlock_t. Any thoughts on how to address this?
> 
> Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: John Kacur <jkacur@redhat.com>

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

> Cc: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/futex.c |   73 ++++++++++++++++++++++++++++++--------------------------
>  1 files changed, 39 insertions(+), 34 deletions(-)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections
  2010-07-10  0:32     ` Steven Rostedt
@ 2010-07-10 14:41       ` Darren Hart
  0 siblings, 0 replies; 28+ messages in thread
From: Darren Hart @ 2010-07-10 14:41 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Mike Galbraith, linux-rt-users

On 07/09/2010 05:32 PM, Steven Rostedt wrote:
> On Fri, 2010-07-09 at 15:55 -0700, Darren Hart wrote:
>> Apologies, mangled a rebase and this patch had bits of 4/4 in it.
>> Corrected below:
>>
>>
>> > From f40b8e684df6ed2b1ba9167cad3922ab43f4a717 Mon Sep 17 00:00:00 2001
>> From: Darren Hart<dvhltc@us.ibm.com>
>> Date: Fri, 9 Jul 2010 17:07:10 -0400
>> Subject: [PATCH 3/4 V2] futex: free_pi_state outside of hb->lock sections
>>
>> free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
>> get the calls to free_pi_state() out of the hb->lock() sections.
>>
>> Signed-off-by: Darren Hart<dvhltc@us.ibm.com>
>> Cc: Thomas Gleixner<tglx@linutronix.de>
>> Cc: Peter Zijlstra<peterz@infradead.org>
>> Cc: Ingo Molnar<mingo@elte.hu>
>> Cc: Eric Dumazet<eric.dumazet@gmail.com>
>> Cc: John Kacur<jkacur@redhat.com>
>> Cc: Steven Rostedt<rostedt@goodmis.org>
>> Cc: Mike Galbraith<efault@gmx.de>
>> ---
>>   kernel/futex.c |    8 +++++---
>>   1 files changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/futex.c b/kernel/futex.c
>> index a6cec32..2cd58a2 100644
>> --- a/kernel/futex.c
>> +++ b/kernel/futex.c
>> @@ -1339,7 +1339,6 @@ retry_private:
>
> This is why I always add a space before labels. I have no idea what
> function this was in.
>
>>   			} else if (ret) {
>>   				/* -EDEADLK */
>>   				this->pi_state = NULL;
>> -				free_pi_state(pi_state);
>
> Where do we free the pi state here?

Near the end after goto out_unlock which falls through to the out: 
label. In this case I didn't move it, it just wasn't necessary.

out:
	if (pi_state != NULL)
		free_pi_state(pi_state);


--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] rtmutex: avoid null derefence in WARN_ON
  2010-07-10  0:29   ` Steven Rostedt
@ 2010-07-10 14:42     ` Darren Hart
  0 siblings, 0 replies; 28+ messages in thread
From: Darren Hart @ 2010-07-10 14:42 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Mike Galbraith, linux-rt-users

On 07/09/2010 05:29 PM, Steven Rostedt wrote:
> On Fri, 2010-07-09 at 15:32 -0700, Darren Hart wrote:
>> If the pi_blocked_on variable is NULL, the subsequent WARN_ON's
>> will cause an OOPS. Only perform the susequent checks if
>> pi_blocked_on is valid.
>>
>> Signed-off-by: Darren Hart<dvhltc@us.ibm.com>
>> Cc: Thomas Gleixner<tglx@linutronix.de>
>> Cc: Peter Zijlstra<peterz@infradead.org>
>> Cc: Ingo Molnar<mingo@elte.hu>
>> Cc: Eric Dumazet<eric.dumazet@gmail.com>
>> Cc: John Kacur<jkacur@redhat.com>
>> Cc: Steven Rostedt<rostedt@goodmis.org>
>> Cc: Mike Galbraith<efault@gmx.de>
>> ---
>>   kernel/rtmutex.c |    7 ++++---
>>   1 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
>> index 23dd443..baac7d9 100644
>> --- a/kernel/rtmutex.c
>> +++ b/kernel/rtmutex.c
>> @@ -579,9 +579,10 @@ static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
>>
>>   	raw_spin_lock(&pendowner->pi_lock);
>>
>> -	WARN_ON(!pendowner->pi_blocked_on);
>> -	WARN_ON(pendowner->pi_blocked_on != waiter);
>> -	WARN_ON(pendowner->pi_blocked_on->lock != lock);
>> +	if (!WARN_ON(!pendowner->pi_blocked_on)) {
>> +		WARN_ON(pendowner->pi_blocked_on != waiter);
>
> The above actually has no issue if the pi_blocked_on is NULL.

It doesn't, but it's also redundant and makes the console noisier for no 
reason. Seemed worth while to drop it under the if in the same go.

--
Darren


> The below, well yeah.
>
> -- Steve
>
>> +		WARN_ON(pendowner->pi_blocked_on->lock != lock);
>> +	}
>>
>>   	pendowner->pi_blocked_on = NULL;
>>
>
>


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/4 V2] rtmutex: add BUG_ON if a task attempts to block on two locks
  2010-07-10  0:30   ` Steven Rostedt
@ 2010-07-10 17:30     ` Darren Hart
  0 siblings, 0 replies; 28+ messages in thread
From: Darren Hart @ 2010-07-10 17:30 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Mike Galbraith, linux-rt-users

On 07/09/2010 05:30 PM, Steven Rostedt wrote:
> On Fri, 2010-07-09 at 15:32 -0700, Darren Hart wrote:
>> rtmutex proxy locking complicates the logic a bit and opens up
>> the possibility for a task to wake and attempt to take another
>> sleeping lock without knowing it has been enqueued on another
>> lock already. Add a BUG_ON to catch this scenario early.
>>
>> Signed-off-by: Darren Hart<dvhltc@us.ibm.com>
>> Cc: Thomas Gleixner<tglx@linutronix.de>
>> Cc: Peter Zijlstra<peterz@infradead.org>
>> Cc: Ingo Molnar<mingo@elte.hu>
>> Cc: Eric Dumazet<eric.dumazet@gmail.com>
>> Cc: John Kacur<jkacur@redhat.com>
>> Cc: Steven Rostedt<rostedt@goodmis.org>
>> Cc: Mike Galbraith<efault@gmx.de>
>> ---
>>   kernel/rtmutex.c |    3 +++
>>   1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
>> index baac7d9..22f9d18 100644
>> --- a/kernel/rtmutex.c
>> +++ b/kernel/rtmutex.c
>> @@ -459,6 +459,9 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
>>   		top_waiter = rt_mutex_top_waiter(lock);
>>   	plist_add(&waiter->list_entry,&lock->wait_list);
>>
>> +	/* Tasks can only block on one lock at a time. */
>> +	BUG_ON(task->pi_blocked_on != NULL);
> 
> WARN_ON may be better. Since it may not cause a system crash or other
> huge bug if it is not true.
> 

No objection.


>From 31c7b6c5657bcc897ddc79d7f9bb1942eb4c854a Mon Sep 17 00:00:00 2001
From: Darren Hart <dvhltc@us.ibm.com>
Date: Fri, 9 Jul 2010 17:46:05 -0400
Subject: [PATCH 2/4] rtmutex: add WARN_ON if a task attempts to block on two locks

rtmutex proxy locking complicates the logic a bit and opens up
the possibility for a task to wake and attempt to take another
sleeping lock without knowing it has been enqueued on another
lock already. Add a WARN_ON to catch this scenario early.

V2: use a WARN_ON instead of a BUG_ON per Steven's request.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/rtmutex.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index baac7d9..5262d0f 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -459,6 +459,9 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 		top_waiter = rt_mutex_top_waiter(lock);
 	plist_add(&waiter->list_entry, &lock->wait_list);
 
+	/* Tasks can only block on one lock at a time. */
+	WARN_ON(task->pi_blocked_on != NULL);
+
 	task->pi_blocked_on = waiter;
 
 	raw_spin_unlock(&task->pi_lock);
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-09 22:33 ` [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t Darren Hart
  2010-07-09 22:57   ` [PATCH 4/4 V2] " Darren Hart
@ 2010-07-10 19:41   ` Mike Galbraith
  2010-07-11 13:33     ` Mike Galbraith
  2010-07-12 19:10     ` Darren Hart
  2010-07-12 13:05   ` Thomas Gleixner
  2 siblings, 2 replies; 28+ messages in thread
From: Mike Galbraith @ 2010-07-10 19:41 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Steven Rostedt, linux-rt-users

On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
> The requeue_pi mechanism introduced proxy locking of the rtmutex.  This creates
> a scenario where a task can wake-up, not knowing it has been enqueued on an
> rtmutex. In order to detect this, the task would have to be able to take either
> task->pi_blocked_on->lock->wait_lock and/or the hb->lock.  Unfortunately,
> without already holding one of these, the pi_blocked_on variable can change
> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
> to take a sleeping lock after wakeup or it could end up trying to block on two
> locks, the second overwriting a valid pi_blocked_on value. This obviously
> breaks the pi mechanism.

copy/paste offline query/reply at Darren's request..

On Sat, 2010-07-10 at 10:26 -0700, Darren Hart wrote:
On 07/09/2010 09:32 PM, Mike Galbraith wrote:
> > On Fri, 2010-07-09 at 13:05 -0700, Darren Hart wrote:
> >
> >> The core of the problem is that the proxy_lock blocks a task on a lock
> >> the task knows nothing about. So when it wakes up inside of
> >> futex_wait_requeue_pi, it immediately tries to block on hb->lock to
> >> check why it woke up. This has the potential to block the task on two
> >> locks (thus overwriting the pi_blocked_on). Any attempt preventing this
> >> involves a lock, and ultimiately the hb->lock. The only solution I see
> >> is to make the hb->locks raw locks (thanks to Steven Rostedt for
> >> original idea and batting this around with me in IRC).
> >
> > Hm, so wakee _was_ munging his own state after all.
> >
> > Out of curiosity, what's wrong with holding his pi_lock across the
> > wakeup?  He can _try_ to block, but can't until pi state is stable.
> >
> > I presume there's a big fat gotcha that's just not obvious to futex
> > locking newbie :)
> 
> It'll take me more time that I have right now to positive, but:
> 
> 
> 	rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);
> 
> 	raw_spin_unlock(&current->pi_lock);
> 
> Your patch moved the unlock before the set_owner. I _believe_ this can 
> break the pi boosting logic - current is the owner until it calls 
> set_owner to be pendowner. I haven't traced this entire path yet, but 
> that's my gut feel.

I _think_ it should be fine to do that.  Setting an owner seems to only
require holding the wait_lock.  I could easily be missing subtleties
though.  Looking around, I didn't see any reason not to unlock the
owner's pi_lock after twiddling pi_waiters (and still don't, but...).
 
> However, you're idea has merit as we have to take our ->pi_lock before 
> we can block on the hb->lock (inside task_blocks_on_rt_mutex()).
> 
> If we can't move the unlock above before set_owner, then we may need a:
> 
> retry:
> cur->lock()
> top_waiter = get_top_waiter()
> cur->unlock()
> 
> double_lock(cur, topwaiter)
> if top_waiter != get_top_waiter()
> 	double_unlock(cur, topwaiter)
> 	goto retry
> 
> Not ideal, but I think I prefer that to making all the hb locks raw.
> 
> You dropped the CC list for some reason, probably a good idea to send 
> this back out in response to my raw lock patch (4/4) - your question and 
> my reply. This is crazy stuff, no harm in putting the question out there.
> 
> I'll take a closer look at this when I can, if not tonight, Monday morning.

	-Mike


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-10 19:41   ` [PATCH 4/4] " Mike Galbraith
@ 2010-07-11 13:33     ` Mike Galbraith
  2010-07-11 15:10       ` Darren Hart
  2010-07-12 11:45       ` Steven Rostedt
  2010-07-12 19:10     ` Darren Hart
  1 sibling, 2 replies; 28+ messages in thread
From: Mike Galbraith @ 2010-07-11 13:33 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Steven Rostedt, linux-rt-users

On Sat, 2010-07-10 at 21:41 +0200, Mike Galbraith wrote:
> On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:

> > If we can't move the unlock above before set_owner, then we may need a:
> > 
> > retry:
> > cur->lock()
> > top_waiter = get_top_waiter()
> > cur->unlock()
> > 
> > double_lock(cur, topwaiter)
> > if top_waiter != get_top_waiter()
> > 	double_unlock(cur, topwaiter)
> > 	goto retry
> > 
> > Not ideal, but I think I prefer that to making all the hb locks raw.

Another option: only scratch the itchy spot.

futex: non-blocking synchronization point for futex_wait_requeue_pi() and futex_requeue().

Problem analysis by Darren Hart;
The requeue_pi mechanism introduced proxy locking of the rtmutex.  This creates
a scenario where a task can wake-up, not knowing it has been enqueued on an
rtmutex. In order to detect this, the task would have to be able to take either
task->pi_blocked_on->lock->wait_lock and/or the hb->lock.  Unfortunately,
without already holding one of these, the pi_blocked_on variable can change
from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
to take a sleeping lock after wakeup or it could end up trying to block on two
locks, the second overwriting a valid pi_blocked_on value. This obviously
breaks the pi mechanism.

Rather than convert the bh-lock to a raw spinlock, do so only in the spot where
blocking cannot be allowed, ie before we know that lock handoff has completed.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>

diff --git a/kernel/futex.c b/kernel/futex.c
index a6cec32..ef489f3 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2255,7 +2255,14 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
 	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
 	futex_wait_queue_me(hb, &q, to);
 
-	spin_lock(&hb->lock);
+	/*
+	 * Non-blocking synchronization point with futex_requeue().
+	 *
+	 * We dare not block here because this will alter PI state, possibly
+	 * before our waker finishes modifying same in wakeup_next_waiter().
+	 */
+	while(!spin_trylock(&hb->lock))
+		cpu_relax();
 	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
 	spin_unlock(&hb->lock);
 	if (ret)



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-11 13:33     ` Mike Galbraith
@ 2010-07-11 15:10       ` Darren Hart
  2010-07-12 11:45       ` Steven Rostedt
  1 sibling, 0 replies; 28+ messages in thread
From: Darren Hart @ 2010-07-11 15:10 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Steven Rostedt, linux-rt-users

On 07/11/2010 06:33 AM, Mike Galbraith wrote:
> On Sat, 2010-07-10 at 21:41 +0200, Mike Galbraith wrote:
>> On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
>
>>> If we can't move the unlock above before set_owner, then we may need a:
>>>
>>> retry:
>>> cur->lock()
>>> top_waiter = get_top_waiter()
>>> cur->unlock()
>>>
>>> double_lock(cur, topwaiter)
>>> if top_waiter != get_top_waiter()
>>> 	double_unlock(cur, topwaiter)
>>> 	goto retry
>>>
>>> Not ideal, but I think I prefer that to making all the hb locks raw.
>
> Another option: only scratch the itchy spot.
>
> futex: non-blocking synchronization point for futex_wait_requeue_pi() and futex_requeue().
>
> Problem analysis by Darren Hart;
> The requeue_pi mechanism introduced proxy locking of the rtmutex.  This creates
> a scenario where a task can wake-up, not knowing it has been enqueued on an
> rtmutex. In order to detect this, the task would have to be able to take either
> task->pi_blocked_on->lock->wait_lock and/or the hb->lock.  Unfortunately,
> without already holding one of these, the pi_blocked_on variable can change
> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
> to take a sleeping lock after wakeup or it could end up trying to block on two
> locks, the second overwriting a valid pi_blocked_on value. This obviously
> breaks the pi mechanism.
>
> Rather than convert the bh-lock to a raw spinlock, do so only in the spot where
> blocking cannot be allowed, ie before we know that lock handoff has completed.

I like it. I especially like the change is only evident if you are using 
the code path that introduced the problem in the first place. If you're 
doing a lot of requeue_pi operations, then the waking waiters have an 
advantage over new pending waiters or other tasks with futex keyed on 
the same hash-bucket... but that seems acceptable to me.

I'd like to confirm that holding the pendowner->pi-lock across the 
wakeup in wakeup_next_waiter() isn't feasible first. If it can work, I 
think the impact would be lower. I'll have a look tomorrow.

Nice work Mike.

--
Darrem

> Signed-off-by: Mike Galbraith<efault@gmx.de>
> Cc: Darren Hart<dvhltc@us.ibm.com>
> Cc: Thomas Gleixner<tglx@linutronix.de>
> Cc: Peter Zijlstra<peterz@infradead.org>
> Cc: Ingo Molnar<mingo@elte.hu>
> Cc: Eric Dumazet<eric.dumazet@gmail.com>
> Cc: John Kacur<jkacur@redhat.com>
> Cc: Steven Rostedt<rostedt@goodmis.org>
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..ef489f3 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2255,7 +2255,14 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
>   	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
>   	futex_wait_queue_me(hb,&q, to);
>
> -	spin_lock(&hb->lock);
> +	/*
> +	 * Non-blocking synchronization point with futex_requeue().
> +	 *
> +	 * We dare not block here because this will alter PI state, possibly
> +	 * before our waker finishes modifying same in wakeup_next_waiter().
> +	 */
> +	while(!spin_trylock(&hb->lock))
> +		cpu_relax();
>   	ret = handle_early_requeue_pi_wakeup(hb,&q,&key2, to);
>   	spin_unlock(&hb->lock);
>   	if (ret)
>
>


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] futex: free_pi_state outside of hb->lock sections
  2010-07-09 22:32 ` [PATCH 3/4] futex: free_pi_state outside of hb->lock sections Darren Hart
  2010-07-09 22:55     ` Darren Hart
@ 2010-07-12 10:35   ` Thomas Gleixner
  2010-07-12 10:46     ` Steven Rostedt
  1 sibling, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2010-07-12 10:35 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Eric Dumazet,
	John Kacur, Steven Rostedt, Mike Galbraith, linux-rt-users

On Fri, 9 Jul 2010, Darren Hart wrote:

> free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
> get the calls to free_pi_state() out of the hb->lock() sections.
> 
> Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: John Kacur <jkacur@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/futex.c |   14 ++++++++------
>  1 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..b217972 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -114,7 +114,7 @@ struct futex_q {
>  	struct plist_node list;
>  
>  	struct task_struct *task;
> -	spinlock_t *lock_ptr;
> +	raw_spinlock_t *lock_ptr;

 How is this related to the changelog ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] futex: free_pi_state outside of hb->lock sections
  2010-07-12 10:35   ` [PATCH 3/4] " Thomas Gleixner
@ 2010-07-12 10:46     ` Steven Rostedt
  0 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2010-07-12 10:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Darren Hart, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Mike Galbraith, linux-rt-users

On Mon, 2010-07-12 at 12:35 +0200, Thomas Gleixner wrote:
> On Fri, 9 Jul 2010, Darren Hart wrote:
> 
> > free_pi_state() calls kfree() and might sleep. To prepare for raw hb->locks,
> > get the calls to free_pi_state() out of the hb->lock() sections.
> > 
> > Signed-off-by: Darren Hart <dvhltc@us.ibm.com>

> >  	struct task_struct *task;
> > -	spinlock_t *lock_ptr;
> > +	raw_spinlock_t *lock_ptr;
> 
>  How is this related to the changelog ?

Read the reply to himself (V2) ;-)

-- Steve



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-11 13:33     ` Mike Galbraith
  2010-07-11 15:10       ` Darren Hart
@ 2010-07-12 11:45       ` Steven Rostedt
  2010-07-12 12:12         ` Mike Galbraith
  1 sibling, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2010-07-12 11:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Eric Dumazet, John Kacur, linux-rt-users

On Sun, 2010-07-11 at 15:33 +0200, Mike Galbraith wrote:
> On Sat, 2010-07-10 at 21:41 +0200, Mike Galbraith wrote:

> diff --git a/kernel/futex.c b/kernel/futex.c
> index a6cec32..ef489f3 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2255,7 +2255,14 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
>  	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
>  	futex_wait_queue_me(hb, &q, to);
>  
> -	spin_lock(&hb->lock);
> +	/*
> +	 * Non-blocking synchronization point with futex_requeue().
> +	 *
> +	 * We dare not block here because this will alter PI state, possibly
> +	 * before our waker finishes modifying same in wakeup_next_waiter().
> +	 */
> +	while(!spin_trylock(&hb->lock))
> +		cpu_relax();

I agree that this would work. But I wonder if this should have an:

#ifdef PREEMPT_RT
[...]
#else
	spin_lock(&hb->lock);
#endif

around it. Or encapsulate this lock in a macro that does the same thing
(just to keep the actual code cleaner)

-- Steve

>  	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
>  	spin_unlock(&hb->lock);
>  	if (ret)
> 
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-12 11:45       ` Steven Rostedt
@ 2010-07-12 12:12         ` Mike Galbraith
  0 siblings, 0 replies; 28+ messages in thread
From: Mike Galbraith @ 2010-07-12 12:12 UTC (permalink / raw)
  To: rostedt
  Cc: Darren Hart, linux-kernel, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Eric Dumazet, John Kacur, linux-rt-users

On Mon, 2010-07-12 at 07:45 -0400, Steven Rostedt wrote:
> On Sun, 2010-07-11 at 15:33 +0200, Mike Galbraith wrote:
> > On Sat, 2010-07-10 at 21:41 +0200, Mike Galbraith wrote:
> 
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index a6cec32..ef489f3 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -2255,7 +2255,14 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
> >  	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
> >  	futex_wait_queue_me(hb, &q, to);
> >  
> > -	spin_lock(&hb->lock);
> > +	/*
> > +	 * Non-blocking synchronization point with futex_requeue().
> > +	 *
> > +	 * We dare not block here because this will alter PI state, possibly
> > +	 * before our waker finishes modifying same in wakeup_next_waiter().
> > +	 */
> > +	while(!spin_trylock(&hb->lock))
> > +		cpu_relax();
> 
> I agree that this would work. But I wonder if this should have an:
> 
> #ifdef PREEMPT_RT
> [...]
> #else
> 	spin_lock(&hb->lock);
> #endif
> 
> around it. Or encapsulate this lock in a macro that does the same thing
> (just to keep the actual code cleaner)

Yeah, it should.  I'll wait to see what Darren/others say about holding
the wakee's pi_lock across wakeup to plug it.  If he submits something
along that line, I can bin this.

	-Mike


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-09 22:33 ` [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t Darren Hart
  2010-07-09 22:57   ` [PATCH 4/4 V2] " Darren Hart
  2010-07-10 19:41   ` [PATCH 4/4] " Mike Galbraith
@ 2010-07-12 13:05   ` Thomas Gleixner
  2 siblings, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2010-07-12 13:05 UTC (permalink / raw)
  To: Darren Hart
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Eric Dumazet,
	John Kacur, Steven Rostedt, Mike Galbraith, linux-rt-users

On Fri, 9 Jul 2010, Darren Hart wrote:

> The requeue_pi mechanism introduced proxy locking of the rtmutex.  This creates
> a scenario where a task can wake-up, not knowing it has been enqueued on an
> rtmutex. In order to detect this, the task would have to be able to take either
> task->pi_blocked_on->lock->wait_lock and/or the hb->lock.  Unfortunately,
> without already holding one of these, the pi_blocked_on variable can change
> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
> to take a sleeping lock after wakeup or it could end up trying to block on two
> locks, the second overwriting a valid pi_blocked_on value. This obviously
> breaks the pi mechanism.
> 
> This patch increases latency, while running the ltp pthread_cond_many test
> which Michal reported the bug with, I see double digit hrtimer latencies
> (typically only on the first run after boo):
> 
> 	kernel: hrtimer: interrupt took 75911 ns

Eewwww. There must be some more intelligent and less intrusive way to
detect this.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-10 19:41   ` [PATCH 4/4] " Mike Galbraith
  2010-07-11 13:33     ` Mike Galbraith
@ 2010-07-12 19:10     ` Darren Hart
  2010-07-12 20:40       ` Thomas Gleixner
  1 sibling, 1 reply; 28+ messages in thread
From: Darren Hart @ 2010-07-12 19:10 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Steven Rostedt, linux-rt-users

On 07/10/2010 12:41 PM, Mike Galbraith wrote:
> On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
>> The requeue_pi mechanism introduced proxy locking of the rtmutex.  This creates
>> a scenario where a task can wake-up, not knowing it has been enqueued on an
>> rtmutex. In order to detect this, the task would have to be able to take either
>> task->pi_blocked_on->lock->wait_lock and/or the hb->lock.  Unfortunately,
>> without already holding one of these, the pi_blocked_on variable can change
>> from NULL to valid or from valid to NULL. Therefor, the task cannot be allowed
>> to take a sleeping lock after wakeup or it could end up trying to block on two
>> locks, the second overwriting a valid pi_blocked_on value. This obviously
>> breaks the pi mechanism.
>
> copy/paste offline query/reply at Darren's request..
>
> On Sat, 2010-07-10 at 10:26 -0700, Darren Hart wrote:
> On 07/09/2010 09:32 PM, Mike Galbraith wrote:
>>> On Fri, 2010-07-09 at 13:05 -0700, Darren Hart wrote:
>>>
>>>> The core of the problem is that the proxy_lock blocks a task on a lock
>>>> the task knows nothing about. So when it wakes up inside of
>>>> futex_wait_requeue_pi, it immediately tries to block on hb->lock to
>>>> check why it woke up. This has the potential to block the task on two
>>>> locks (thus overwriting the pi_blocked_on). Any attempt preventing this
>>>> involves a lock, and ultimiately the hb->lock. The only solution I see
>>>> is to make the hb->locks raw locks (thanks to Steven Rostedt for
>>>> original idea and batting this around with me in IRC).
>>>
>>> Hm, so wakee _was_ munging his own state after all.
>>>
>>> Out of curiosity, what's wrong with holding his pi_lock across the
>>> wakeup?  He can _try_ to block, but can't until pi state is stable.
>>>
>>> I presume there's a big fat gotcha that's just not obvious to futex
>>> locking newbie :)

Nor to some of us that have been engrossed in futexes for the last 
couple years! I discussed the pi_lock across the wakeup issue with 
Thomas. While this fixes the problem for this particular failure case, 
it doesn't protect against:

<tglx> assume the following:
<tglx> t1 is on the condvar
<tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
<tglx> t3 takes hb->lock for a futex in the same bucket
<tglx> t2 wakes due to signal/timeout
<tglx> t2 blocks on hb->lock

You are likely to have not hit the above scenario because you only had 
one condvar, so the hash_buckets were not heavily shared and you weren't 
likely to hit:

<tglx> t3 takes hb->lock for a futex in the same bucket


I'm going to roll up a patchset with your (Mike) spin_trylock patch and 
run it through some tests. I'd still prefer a way to detect early wakeup 
without having to grab the hb->lock(), but I haven't found it yet.

+	while(!spin_trylock(&hb->lock))
+		cpu_relax();
  	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
  	spin_unlock(&hb->lock);

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-12 19:10     ` Darren Hart
@ 2010-07-12 20:40       ` Thomas Gleixner
  2010-07-12 20:43         ` Thomas Gleixner
  2010-07-13  3:09         ` Mike Galbraith
  0 siblings, 2 replies; 28+ messages in thread
From: Thomas Gleixner @ 2010-07-12 20:40 UTC (permalink / raw)
  To: Darren Hart
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Steven Rostedt, linux-rt-users

On Mon, 12 Jul 2010, Darren Hart wrote:
> On 07/10/2010 12:41 PM, Mike Galbraith wrote:
> > On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
> > > > Out of curiosity, what's wrong with holding his pi_lock across the
> > > > wakeup?  He can _try_ to block, but can't until pi state is stable.
> > > > 
> > > > I presume there's a big fat gotcha that's just not obvious to futex
> > > > locking newbie :)
> 
> Nor to some of us that have been engrossed in futexes for the last couple
> years! I discussed the pi_lock across the wakeup issue with Thomas. While this
> fixes the problem for this particular failure case, it doesn't protect
> against:
> 
> <tglx> assume the following:
> <tglx> t1 is on the condvar
> <tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
> <tglx> t3 takes hb->lock for a futex in the same bucket
> <tglx> t2 wakes due to signal/timeout
> <tglx> t2 blocks on hb->lock
> 
> You are likely to have not hit the above scenario because you only had one
> condvar, so the hash_buckets were not heavily shared and you weren't likely to
> hit:
> 
> <tglx> t3 takes hb->lock for a futex in the same bucket
> 
> 
> I'm going to roll up a patchset with your (Mike) spin_trylock patch and run it
> through some tests. I'd still prefer a way to detect early wakeup without
> having to grab the hb->lock(), but I haven't found it yet.
> 
> +	while(!spin_trylock(&hb->lock))
> +		cpu_relax();
>  	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
>  	spin_unlock(&hb->lock);

And this is nasty as it will create unbound priority inversion :(

We discussed another solution on IRC in meantime:

in futex_wait_requeue_pi()

   futex_wait_queue_me(hb, &q, to);

   raw_spin_lock(current->pi_lock);
   if (current->pi_blocked_on) {
      /*
       * We know that we can only be blocked on the outer futex
       * so we can skip the early wakeup check
       */
       raw_spin_unlock(current->pi_lock);
       ret = 0;
   } else {
      current->pi_blocked_on = PI_WAKEUP_INPROGRESS;
      raw_spin_unlock(current->pi_lock);

      spin_lock(&hb->lock);
      ret = handle_early_requeue_pi_wakeup();
      ....
      spin_lock(&hb->lock);
   }

Now in the rtmutex magic we need in task_blocks_on_rt_mutex():

   raw_spin_lock(task->pi_lock);

   /*
    * Add big fat comment why this is only relevant to futex
    * requeue_pi
    */

   if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
      raw_spin_lock(task->pi_lock);

      /*
       * Returning 0 here is fine. the requeue code is just going to
       * move the futex_q to the other bucket, but that'll be fixed
       * up in handle_early_requeue_pi_wakeup()
       */

      return 0;
   }

Thanks,

	tglx

    

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-12 20:40       ` Thomas Gleixner
@ 2010-07-12 20:43         ` Thomas Gleixner
  2010-07-13  3:09         ` Mike Galbraith
  1 sibling, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2010-07-12 20:43 UTC (permalink / raw)
  To: Darren Hart
  Cc: Mike Galbraith, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Steven Rostedt, linux-rt-users



On Mon, 12 Jul 2010, Thomas Gleixner wrote:

> On Mon, 12 Jul 2010, Darren Hart wrote:
> > On 07/10/2010 12:41 PM, Mike Galbraith wrote:
> > > On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
> > > > > Out of curiosity, what's wrong with holding his pi_lock across the
> > > > > wakeup?  He can _try_ to block, but can't until pi state is stable.
> > > > > 
> > > > > I presume there's a big fat gotcha that's just not obvious to futex
> > > > > locking newbie :)
> > 
> > Nor to some of us that have been engrossed in futexes for the last couple
> > years! I discussed the pi_lock across the wakeup issue with Thomas. While this
> > fixes the problem for this particular failure case, it doesn't protect
> > against:
> > 
> > <tglx> assume the following:
> > <tglx> t1 is on the condvar
> > <tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> > <tglx> t2 wakes due to signal/timeout
> > <tglx> t2 blocks on hb->lock
> > 
> > You are likely to have not hit the above scenario because you only had one
> > condvar, so the hash_buckets were not heavily shared and you weren't likely to
> > hit:
> > 
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> > 
> > 
> > I'm going to roll up a patchset with your (Mike) spin_trylock patch and run it
> > through some tests. I'd still prefer a way to detect early wakeup without
> > having to grab the hb->lock(), but I haven't found it yet.
> > 
> > +	while(!spin_trylock(&hb->lock))
> > +		cpu_relax();
> >  	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
> >  	spin_unlock(&hb->lock);
> 
> And this is nasty as it will create unbound priority inversion :(
> 
> We discussed another solution on IRC in meantime:
> 
> in futex_wait_requeue_pi()
> 
>    futex_wait_queue_me(hb, &q, to);
> 
>    raw_spin_lock(current->pi_lock);
>    if (current->pi_blocked_on) {
>       /*
>        * We know that we can only be blocked on the outer futex
>        * so we can skip the early wakeup check
>        */
>        raw_spin_unlock(current->pi_lock);
>        ret = 0;
>    } else {
>       current->pi_blocked_on = PI_WAKEUP_INPROGRESS;
>       raw_spin_unlock(current->pi_lock);
> 
>       spin_lock(&hb->lock);
>       ret = handle_early_requeue_pi_wakeup();
>       ....
>       spin_lock(&hb->lock);
>    }
> 
> Now in the rtmutex magic we need in task_blocks_on_rt_mutex():
> 
>    raw_spin_lock(task->pi_lock);
> 
>    /*
>     * Add big fat comment why this is only relevant to futex
>     * requeue_pi
>     */
> 
>    if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
>       raw_spin_lock(task->pi_lock);
> 
>       /*
>        * Returning 0 here is fine. the requeue code is just going to
>        * move the futex_q to the other bucket, but that'll be fixed
>        * up in handle_early_requeue_pi_wakeup()
>        */
> 
>       return 0;

We might also return a sensible error code here and just remove the
waiter from all queues, which needs to be handled in
handle_early_requeue_pi_wakeup() after acquiring hb->lock then.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-12 20:40       ` Thomas Gleixner
  2010-07-12 20:43         ` Thomas Gleixner
@ 2010-07-13  3:09         ` Mike Galbraith
  2010-07-13  7:12           ` Darren Hart
  1 sibling, 1 reply; 28+ messages in thread
From: Mike Galbraith @ 2010-07-13  3:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Darren Hart, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Steven Rostedt, linux-rt-users

On Mon, 2010-07-12 at 22:40 +0200, Thomas Gleixner wrote:
> On Mon, 12 Jul 2010, Darren Hart wrote:
> > On 07/10/2010 12:41 PM, Mike Galbraith wrote:
> > > On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
> > > > > Out of curiosity, what's wrong with holding his pi_lock across the
> > > > > wakeup?  He can _try_ to block, but can't until pi state is stable.
> > > > > 
> > > > > I presume there's a big fat gotcha that's just not obvious to futex
> > > > > locking newbie :)
> > 
> > Nor to some of us that have been engrossed in futexes for the last couple
> > years! I discussed the pi_lock across the wakeup issue with Thomas. While this
> > fixes the problem for this particular failure case, it doesn't protect
> > against:
> > 
> > <tglx> assume the following:
> > <tglx> t1 is on the condvar
> > <tglx> t2 does the requeue dance and t1 is now blocked on the outer futex
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> > <tglx> t2 wakes due to signal/timeout
> > <tglx> t2 blocks on hb->lock
> > 
> > You are likely to have not hit the above scenario because you only had one
> > condvar, so the hash_buckets were not heavily shared and you weren't likely to
> > hit:
> > 
> > <tglx> t3 takes hb->lock for a futex in the same bucket
> > 
> > 
> > I'm going to roll up a patchset with your (Mike) spin_trylock patch and run it
> > through some tests. I'd still prefer a way to detect early wakeup without
> > having to grab the hb->lock(), but I haven't found it yet.
> > 
> > +	while(!spin_trylock(&hb->lock))
> > +		cpu_relax();
> >  	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
> >  	spin_unlock(&hb->lock);
> 
> And this is nasty as it will create unbound priority inversion :(

Oh ma gawd, _it's a train_ :>
 
> We discussed another solution on IRC in meantime:
> 
> in futex_wait_requeue_pi()
> 
>    futex_wait_queue_me(hb, &q, to);
> 
>    raw_spin_lock(current->pi_lock);
>    if (current->pi_blocked_on) {
>       /*
>        * We know that we can only be blocked on the outer futex
>        * so we can skip the early wakeup check
>        */
>        raw_spin_unlock(current->pi_lock);
>        ret = 0;
>    } else {
>       current->pi_blocked_on = PI_WAKEUP_INPROGRESS;
>       raw_spin_unlock(current->pi_lock);
> 
>       spin_lock(&hb->lock);
>       ret = handle_early_requeue_pi_wakeup();
>       ....
>       spin_lock(&hb->lock);
>    }
> 
> Now in the rtmutex magic we need in task_blocks_on_rt_mutex():
> 
>    raw_spin_lock(task->pi_lock);
> 
>    /*
>     * Add big fat comment why this is only relevant to futex
>     * requeue_pi
>     */
> 
>    if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
>       raw_spin_lock(task->pi_lock);
> 
>       /*
>        * Returning 0 here is fine. the requeue code is just going to
>        * move the futex_q to the other bucket, but that'll be fixed
>        * up in handle_early_requeue_pi_wakeup()
>        */
> 
>       return 0;
>    }
> 
> Thanks,
> 
> 	tglx
> 
>     
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t
  2010-07-13  3:09         ` Mike Galbraith
@ 2010-07-13  7:12           ` Darren Hart
  0 siblings, 0 replies; 28+ messages in thread
From: Darren Hart @ 2010-07-13  7:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, John Kacur, Steven Rostedt, linux-rt-users

On 07/12/2010 08:09 PM, Mike Galbraith wrote:
> On Mon, 2010-07-12 at 22:40 +0200, Thomas Gleixner wrote:
>> On Mon, 12 Jul 2010, Darren Hart wrote:
>>> On 07/10/2010 12:41 PM, Mike Galbraith wrote:
>>>> On Fri, 2010-07-09 at 15:33 -0700, Darren Hart wrote:
>>>>>> Out of curiosity, what's wrong with holding his pi_lock across the
>>>>>> wakeup?  He can _try_ to block, but can't until pi state is stable.
>>>>>>
>>>>>> I presume there's a big fat gotcha that's just not obvious to futex
>>>>>> locking newbie :)
>>>
>>> Nor to some of us that have been engrossed in futexes for the last couple
>>> years! I discussed the pi_lock across the wakeup issue with Thomas. While this
>>> fixes the problem for this particular failure case, it doesn't protect
>>> against:
>>>
>>> <tglx>  assume the following:
>>> <tglx>  t1 is on the condvar
>>> <tglx>  t2 does the requeue dance and t1 is now blocked on the outer futex
>>> <tglx>  t3 takes hb->lock for a futex in the same bucket
>>> <tglx>  t2 wakes due to signal/timeout
>>> <tglx>  t2 blocks on hb->lock
>>>
>>> You are likely to have not hit the above scenario because you only had one
>>> condvar, so the hash_buckets were not heavily shared and you weren't likely to
>>> hit:
>>>
>>> <tglx>  t3 takes hb->lock for a futex in the same bucket
>>>
>>>
>>> I'm going to roll up a patchset with your (Mike) spin_trylock patch and run it
>>> through some tests. I'd still prefer a way to detect early wakeup without
>>> having to grab the hb->lock(), but I haven't found it yet.
>>>
>>> +	while(!spin_trylock(&hb->lock))
>>> +		cpu_relax();
>>>   	ret = handle_early_requeue_pi_wakeup(hb,&q,&key2, to);
>>>   	spin_unlock(&hb->lock);
>>
>> And this is nasty as it will create unbound priority inversion :(
>
> Oh ma gawd, _it's a train_ :>

Seriously.

I have a fix. Cleaning it up as we speak, still hope to send out tonight.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-07-13  7:12 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-09 22:32 [PATCH 0/4][RT] futex: fix tasks blocking on two rt_mutex locks Darren Hart
2010-07-09 22:32 ` [PATCH 1/4] rtmutex: avoid null derefence in WARN_ON Darren Hart
2010-07-10  0:29   ` Steven Rostedt
2010-07-10 14:42     ` Darren Hart
2010-07-09 22:32 ` [PATCH 2/4] rtmutex: add BUG_ON if a task attempts to block on two locks Darren Hart
2010-07-10  0:30   ` Steven Rostedt
2010-07-10 17:30     ` [PATCH 2/4 V2] " Darren Hart
2010-07-09 22:32 ` [PATCH 3/4] futex: free_pi_state outside of hb->lock sections Darren Hart
2010-07-09 22:55   ` [PATCH 3/4 V2] " Darren Hart
2010-07-09 22:55     ` Darren Hart
2010-07-10  0:32     ` Steven Rostedt
2010-07-10 14:41       ` Darren Hart
2010-07-12 10:35   ` [PATCH 3/4] " Thomas Gleixner
2010-07-12 10:46     ` Steven Rostedt
2010-07-09 22:33 ` [PATCH 4/4] futex: convert hash_bucket locks to raw_spinlock_t Darren Hart
2010-07-09 22:57   ` [PATCH 4/4 V2] " Darren Hart
2010-07-10  0:34     ` Steven Rostedt
2010-07-10 19:41   ` [PATCH 4/4] " Mike Galbraith
2010-07-11 13:33     ` Mike Galbraith
2010-07-11 15:10       ` Darren Hart
2010-07-12 11:45       ` Steven Rostedt
2010-07-12 12:12         ` Mike Galbraith
2010-07-12 19:10     ` Darren Hart
2010-07-12 20:40       ` Thomas Gleixner
2010-07-12 20:43         ` Thomas Gleixner
2010-07-13  3:09         ` Mike Galbraith
2010-07-13  7:12           ` Darren Hart
2010-07-12 13:05   ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.