All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] fixup pi_state in futex_requeue on lock steal
@ 2009-08-06  0:01 Darren Hart
  2009-08-06 16:37 ` Peter Zijlstra
  2009-08-06 23:07 ` Darren Hart
  0 siblings, 2 replies; 8+ messages in thread
From: Darren Hart @ 2009-08-06  0:01 UTC (permalink / raw)
  To: lkml, , linux-rt-users
  Cc: Thomas Gleixner, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	John Kacur, Dinakar Guniguntala, John Stultz

NOT FOR INCLUSION

Fixup the uval and pi_state owner in futex_requeue(requeue_pi=1) in the event
of a lock steal or owner died.  I had hoped to leave it up to the new owner to
fix up the userspace value since we can't really handle a fault here
gracefully.  This should be safe as the lock is contended and should force all
userspace attempts to lock or unlock into the kernel where they'll block on the
hb lock.  However, when I don't update the uaddr, I hit the WARN_ON(pid !=
pi_state->owner->pid) as expected, and the userspace testcase deadlocks.

I need to try and better understand what's happening to hang userspace.  In the
meantime I thought I'd share what I'm working with atm.  This is a complete HACK
and is ugly, non-modular, etc etc.  However, it currently works.  It would explode
in a most impressive fashion should we happen to fault.  So the remaining questions
are:

o Why does userspace deadlock if we leave the uval updating to the new owner
  waking up in futex_wait_requeue_pi()?

o If we have to handle a fault in futex_requeue(), how can we best cleanup the
  proxy lock acquisition and get things back into a sane state.  We faulted, so
  determinism is out the window anyway, we just need to recover gracefully.

Thoughts welcome.

Darren Hart

Index: 2.6.31-rc4-rt1/kernel/futex.c
===================================================================
--- 2.6.31-rc4-rt1.orig/kernel/futex.c	2009-08-05 13:27:11.000000000 -0700
+++ 2.6.31-rc4-rt1/kernel/futex.c	2009-08-05 16:19:08.000000000 -0700
@@ -1174,7 +1174,7 @@ static int futex_requeue(u32 __user *uad
 	struct plist_head *head1;
 	struct futex_q *this, *next;
 	struct task_struct *wake_list = &init_task;
-	u32 curval2;
+	u32 curval, curval2, newval;
 
 	if (requeue_pi) {
 		/*
@@ -1329,6 +1329,45 @@ retry_private:
 			if (ret == 1) {
 				/* We got the lock. */
 				requeue_pi_wake_futex(this, &key2, hb2);
+
+				while (1) {
+					newval = (curval2 & FUTEX_OWNER_DIED) |
+						 task_pid_vnr(this->task) |
+						 FUTEX_WAITERS;
+
+					curval = cmpxchg_futex_value_locked(uaddr2, curval2, newval);
+
+					if (curval == -EFAULT)
+						goto handle_fault;
+					if (curval == curval2)
+						break;
+					curval2 = curval;
+				}
+				/*
+				 * Fixup the pi_state owner to ensure pi
+				 * boosting happens in the event of a race
+				 * between us dropping the lock and the new
+				 * owner waking and grabbing the lock.  Since
+				 * the lock is contended, we leave it to the new
+				 * owner to fix up the userspace value since we
+				 * can't handle a fault here gracefully.
+				 */
+				/*
+				 * FIXME: we duplicate this in 3 places now.  If
+				 * we keep this solution, let's refactor this
+				 * pi_state reparenting thing.
+				 */
+				BUG_ON(!pi_state->owner);
+				atomic_spin_lock_irq(&pi_state->owner->pi_lock);
+				WARN_ON(list_empty(&pi_state->list));
+				list_del_init(&pi_state->list);
+				atomic_spin_unlock_irq(&pi_state->owner->pi_lock);
+
+				atomic_spin_lock_irq(&this->task->pi_lock);
+				WARN_ON(!list_empty(&pi_state->list));
+				list_add(&pi_state->list, &this->task->pi_state_list);
+				pi_state->owner = this->task;
+				atomic_spin_unlock_irq(&this->task->pi_lock);
 				continue;
 			} else if (ret) {
 				/* -EDEADLK */
@@ -1363,6 +1402,17 @@ out:
 	if (pi_state != NULL)
 		free_pi_state(pi_state);
 	return ret ? ret : task_count;
+
+handle_fault:
+	/*
+	 * We can't handle a fault, so instead, give up the lock and requeue the
+	 * would-have-been-new-owner instead.
+	 */
+	printk("ERROR: faulted during futex_requeue loop, trying to fixup pi_state owner\n");
+	/* FIXME: write the unlock and requeue code ... */
+	ret = -EFAULT;
+	goto out_unlock;
+
 }
 
 /* The key must be already stored in q->key. */
-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] fixup pi_state in futex_requeue on lock steal
  2009-08-06  0:01 [RFC][PATCH] fixup pi_state in futex_requeue on lock steal Darren Hart
@ 2009-08-06 16:37 ` Peter Zijlstra
  2009-08-06 22:46   ` Darren Hart
  2009-08-06 23:07 ` Darren Hart
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2009-08-06 16:37 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml,,
	linux-rt-users, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	John Kacur, Dinakar Guniguntala, John Stultz

On Wed, 2009-08-05 at 17:01 -0700, Darren Hart wrote:
> NOT FOR INCLUSION
> 
> Fixup the uval and pi_state owner in futex_requeue(requeue_pi=1) in the event
> of a lock steal or owner died.  I had hoped to leave it up to the new owner to
> fix up the userspace value since we can't really handle a fault here
> gracefully.  This should be safe as the lock is contended and should force all
> userspace attempts to lock or unlock into the kernel where they'll block on the
> hb lock.  However, when I don't update the uaddr, I hit the WARN_ON(pid !=
> pi_state->owner->pid) as expected, and the userspace testcase deadlocks.
> 
> I need to try and better understand what's happening to hang userspace.  In the
> meantime I thought I'd share what I'm working with atm.  This is a complete HACK
> and is ugly, non-modular, etc etc.  However, it currently works.  It would explode
> in a most impressive fashion should we happen to fault.  So the remaining questions
> are:
> 
> o Why does userspace deadlock if we leave the uval updating to the new owner
>   waking up in futex_wait_requeue_pi()?
> 
> o If we have to handle a fault in futex_requeue(), how can we best cleanup the
>   proxy lock acquisition and get things back into a sane state.  We faulted, so
>   determinism is out the window anyway, we just need to recover gracefully.


Do you have a trace of the thing going down?

Tglx and me usually use sched_switch and a few trace_printk()s sprinkled
around, the typical one would be in sys_futex, printing the futex cmd
and arg.

OK, so run me through this one more time.

A condvar has two futexes, an inner and an outer. The inner futex is
always locked and the waiting threads are stacked on that.

Then on signal/broadcast, we lock the outer lock and requeue all the
blocked tasks from the inner to the outer, then we release the outer
lock and let them rip.

Since we're seeing lock steals, I'm thinking the outer lock isn't taken
when we're doing the requeue?

Anyway, during the requeue we lock-steal because the owner isn't running
yet and we iterate a higher prio task in the requeue loop?

This leaves the outer lock's futex field messed up because it points to
the wrong TID.

After we finish the requeue loop, we unlock the HBs.


So far so good?


Now, normally the waking thread will find itself owner and will check
the futex variable and fix her up -- while holding the HB lock.

However, in case the outer lock gets contended again, we can get
interrupted between requeue and wakeup/fixup and observe this messed up
futex value, which is causing this WARN to trigger.


So where do we deadlock, after this all goes down? Do we perhaps lookup
the wrong pi_state using that wrong TID?



Since its only the outer futex's value that matters, right? Can't we pin
that using get_user_pages() before we take the HB lock and go into the
requeue loop? That way we're sure to be able to change it without
faulting.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] fixup pi_state in futex_requeue on lock steal
  2009-08-06 16:37 ` Peter Zijlstra
@ 2009-08-06 22:46   ` Darren Hart
  2009-08-06 22:53     ` Steven Rostedt
  2009-08-08 15:27     ` Ingo Molnar
  0 siblings, 2 replies; 8+ messages in thread
From: Darren Hart @ 2009-08-06 22:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: lkml,,
	linux-rt-users, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
	John Kacur, Dinakar Guniguntala, John Stultz

Peter Zijlstra wrote:
> On Wed, 2009-08-05 at 17:01 -0700, Darren Hart wrote:
>> NOT FOR INCLUSION
>>
>> Fixup the uval and pi_state owner in futex_requeue(requeue_pi=1) in the event
>> of a lock steal or owner died.  I had hoped to leave it up to the new owner to
>> fix up the userspace value since we can't really handle a fault here
>> gracefully.  This should be safe as the lock is contended and should force all
>> userspace attempts to lock or unlock into the kernel where they'll block on the
>> hb lock.  However, when I don't update the uaddr, I hit the WARN_ON(pid !=
>> pi_state->owner->pid) as expected, and the userspace testcase deadlocks.
>>
>> I need to try and better understand what's happening to hang userspace.  In the
>> meantime I thought I'd share what I'm working with atm.  This is a complete HACK
>> and is ugly, non-modular, etc etc.  However, it currently works.  It would explode
>> in a most impressive fashion should we happen to fault.  So the remaining questions
>> are:
>>
>> o Why does userspace deadlock if we leave the uval updating to the new owner
>>   waking up in futex_wait_requeue_pi()?
>>
>> o If we have to handle a fault in futex_requeue(), how can we best cleanup the
>>   proxy lock acquisition and get things back into a sane state.  We faulted, so
>>   determinism is out the window anyway, we just need to recover gracefully.
> 
> 
> Do you have a trace of the thing going down?

I finally did get a trace... but learned something in the process. 
Elaborating below.

> 
> Tglx and me usually use sched_switch and a few trace_printk()s sprinkled
> around, the typical one would be in sys_futex, printing the futex cmd
> and arg.
> 
> OK, so run me through this one more time.
> 
> A condvar has two futexes, an inner and an outer. The inner futex is
> always locked and the waiting threads are stacked on that.

3 actually:

cond->data->futex (the waitqueue)
cond->data->lock (the lock protecting the internal data)
outer mutex (the pthread_mutex)

> 
> Then on signal/broadcast, we lock the outer lock and requeue all the
> blocked tasks from the inner to the outer, then we release the outer
> lock and let them rip.

Yes - and in requeue_pi with a PI mutex we only let 1 rip, and requeue 
the rest, rather than wake them all as the old implementation for PI 
mutexes did.

> 
> Since we're seeing lock steals, I'm thinking the outer lock isn't taken
> when we're doing the requeue?

Correct.  Unfortunately this is "valid" usage.

> 
> Anyway, during the requeue we lock-steal because the owner isn't running
> yet and we iterate a higher prio task in the requeue loop?

I believe so.

> 
> This leaves the outer lock's futex field messed up because it points to
> the wrong TID.

The futex uval isn't messed up, it just still hold the value of the 
previous owners tid (not the expected owner we're stealing from).  I 
believe now that this is proper behavior.

> 
> After we finish the requeue loop, we unlock the HBs.
> 
> 
> So far so good?

Yup.

> 
> 
> Now, normally the waking thread will find itself owner and will check
> the futex variable and fix her up -- while holding the HB lock.

Correcto.

> 
> However, in case the outer lock gets contended again, we can get
> interrupted between requeue and wakeup/fixup and observe this messed up
> futex value, which is causing this WARN to trigger.

This is where I was mistaken.  I had seen the WARN_ON(pid != 
pi_state->owner->pid) in lookup_pi_state() while working on the previous 
2 patches I sent to the list.  The one which updates the lock_ptr of the 
futex_q to match that of the pi_state should fix this.  What happened 
before was we would grab the wrong hb lock so while we were fixing up 
the pi_state and uval in the woken thread, a contending thread would 
read those value while holding the correct hb lock.  That race is fixed 
with the "[PATCH 1/2] Update woken requeued futex_q lock_ptr" patch.

> 
> So where do we deadlock, after this all goes down? Do we perhaps lookup
> the wrong pi_state using that wrong TID?
> 

We only deadlocked while I was (wrongly) trying to update pi_state owner 
from the requeue thread.  Deadlocks don't occur in my testing with only 
patches 1 and 2.

[PATCH 1/2] Update woken requeued futex_q lock_ptr" patch
[PATCH 2/2][RT] Avoid deadlock in rt_mutex_start_proxy_lock()


> Since its only the outer futex's value that matters, right? Can't we pin
> that using get_user_pages() before we take the HB lock and go into the
> requeue loop? That way we're sure to be able to change it without
> faulting.

I now don't believe we have to do this.  In fact, futex_lock_pi() 
exhibits a similar "race window" (simplified below):

         /*
          * Block on the PI mutex:
          */
         ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1);

         [RACE WINDOW ]   (not really, see below)

         spin_lock(q.lock_ptr);
         /*
          * Fixup the pi_state owner and possibly acquire the lock if we
          * haven't already.
          */
         res = fixup_owner(uaddr, fshared, &q, !ret);

Note that the rt_mutex is acquire while the q.lock_ptr (hb->lock) is not 
held (since we can sleep).  This is FINE and not a race.  Lets look at 
what happens if another task tries to get the lock during that time:

futex_lock_pi
	futex_lock_pi_atomic
		lookup_pi_state
			
At this point we have the pi_state.  It's owner field will point to the 
previous owner, not the task that is currently acquiring it.  But the 
rt_mutex itself knows who owns it, so proper boosting should still 
occur.  Once the new owner complete the pi_state update, the pi_state 
will be removed from the old owner pi_state_list and added to its 
pi_state_list.  Since the futex uval shows it's owned in both cases, the 
new contender is still forced into the kernel to block on the rt_mutex. 
  Since we update the uval, then the pi_state->owner, we are sure to be 
able to access the rt_mutex via the old uval so long as we hold the 
hb->lock.

So, I think we're fine with respect to the pi_state ownership!  In fact 
I finally managed to catch the lock steal in the requeue loop in my 
tracing, and everything worked fine.  Going to go rerun a bunch more 
tests and see if I hit any other issues, if I do, I suspect they are 
unrelated to this.

Thanks for the help in thinking this through.

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] fixup pi_state in futex_requeue on lock steal
  2009-08-06 22:46   ` Darren Hart
@ 2009-08-06 22:53     ` Steven Rostedt
  2009-08-07  0:36       ` Darren Hart
  2009-08-08 15:27     ` Ingo Molnar
  1 sibling, 1 reply; 8+ messages in thread
From: Steven Rostedt @ 2009-08-06 22:53 UTC (permalink / raw)
  To: Darren Hart
  Cc: Peter Zijlstra, lkml,,
	linux-rt-users, Thomas Gleixner, Ingo Molnar, John Kacur,
	Dinakar Guniguntala, John Stultz


On Thu, 6 Aug 2009, Darren Hart wrote:

> Peter Zijlstra wrote:
> > On Wed, 2009-08-05 at 17:01 -0700, Darren Hart wrote:
> > > NOT FOR INCLUSION
> > > 
> > > Fixup the uval and pi_state owner in futex_requeue(requeue_pi=1) in the
> > > event
> > > of a lock steal or owner died.  I had hoped to leave it up to the new
> > > owner to
> > > fix up the userspace value since we can't really handle a fault here
> > > gracefully.  This should be safe as the lock is contended and should force
> > > all
> > > userspace attempts to lock or unlock into the kernel where they'll block
> > > on the
> > > hb lock.  However, when I don't update the uaddr, I hit the WARN_ON(pid !=
> > > pi_state->owner->pid) as expected, and the userspace testcase deadlocks.
> > > 
> > > I need to try and better understand what's happening to hang userspace.
> > > In the
> > > meantime I thought I'd share what I'm working with atm.  This is a
> > > complete HACK
> > > and is ugly, non-modular, etc etc.  However, it currently works.  It would
> > > explode
> > > in a most impressive fashion should we happen to fault.  So the remaining
> > > questions
> > > are:
> > > 
> > > o Why does userspace deadlock if we leave the uval updating to the new
> > > owner
> > >   waking up in futex_wait_requeue_pi()?
> > > 
> > > o If we have to handle a fault in futex_requeue(), how can we best cleanup
> > > the
> > >   proxy lock acquisition and get things back into a sane state.  We
> > > faulted, so
> > >   determinism is out the window anyway, we just need to recover
> > > gracefully.
> > 
> > 
> > Do you have a trace of the thing going down?
> 
> I finally did get a trace... but learned something in the process. Elaborating
> below.

I'm assuming you used ftrace as your tracing infrastructure?

-- Steve

> 
> > 
> > Tglx and me usually use sched_switch and a few trace_printk()s sprinkled
> > around, the typical one would be in sys_futex, printing the futex cmd
> > and arg.
> > 
> > OK, so run me through this one more time.
> > 
> > A condvar has two futexes, an inner and an outer. The inner futex is
> > always locked and the waiting threads are stacked on that.
> 
> 3 actually:
> 
> cond->data->futex (the waitqueue)
> cond->data->lock (the lock protecting the internal data)
> outer mutex (the pthread_mutex)
> 
> > 
> > Then on signal/broadcast, we lock the outer lock and requeue all the
> > blocked tasks from the inner to the outer, then we release the outer
> > lock and let them rip.
> 
> Yes - and in requeue_pi with a PI mutex we only let 1 rip, and requeue the
> rest, rather than wake them all as the old implementation for PI mutexes did.
> 
> > 
> > Since we're seeing lock steals, I'm thinking the outer lock isn't taken
> > when we're doing the requeue?
> 
> Correct.  Unfortunately this is "valid" usage.
> 
> > 
> > Anyway, during the requeue we lock-steal because the owner isn't running
> > yet and we iterate a higher prio task in the requeue loop?
> 
> I believe so.
> 
> > 
> > This leaves the outer lock's futex field messed up because it points to
> > the wrong TID.
> 
> The futex uval isn't messed up, it just still hold the value of the previous
> owners tid (not the expected owner we're stealing from).  I believe now that
> this is proper behavior.
> 
> > 
> > After we finish the requeue loop, we unlock the HBs.
> > 
> > 
> > So far so good?
> 
> Yup.
> 
> > 
> > 
> > Now, normally the waking thread will find itself owner and will check
> > the futex variable and fix her up -- while holding the HB lock.
> 
> Correcto.
> 
> > 
> > However, in case the outer lock gets contended again, we can get
> > interrupted between requeue and wakeup/fixup and observe this messed up
> > futex value, which is causing this WARN to trigger.
> 
> This is where I was mistaken.  I had seen the WARN_ON(pid !=
> pi_state->owner->pid) in lookup_pi_state() while working on the previous 2
> patches I sent to the list.  The one which updates the lock_ptr of the futex_q
> to match that of the pi_state should fix this.  What happened before was we
> would grab the wrong hb lock so while we were fixing up the pi_state and uval
> in the woken thread, a contending thread would read those value while holding
> the correct hb lock.  That race is fixed with the "[PATCH 1/2] Update woken
> requeued futex_q lock_ptr" patch.
> 
> > 
> > So where do we deadlock, after this all goes down? Do we perhaps lookup
> > the wrong pi_state using that wrong TID?
> > 
> 
> We only deadlocked while I was (wrongly) trying to update pi_state owner from
> the requeue thread.  Deadlocks don't occur in my testing with only patches 1
> and 2.
> 
> [PATCH 1/2] Update woken requeued futex_q lock_ptr" patch
> [PATCH 2/2][RT] Avoid deadlock in rt_mutex_start_proxy_lock()
> 
> 
> > Since its only the outer futex's value that matters, right? Can't we pin
> > that using get_user_pages() before we take the HB lock and go into the
> > requeue loop? That way we're sure to be able to change it without
> > faulting.
> 
> I now don't believe we have to do this.  In fact, futex_lock_pi() exhibits a
> similar "race window" (simplified below):
> 
>         /*
>          * Block on the PI mutex:
>          */
>         ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1);
> 
>         [RACE WINDOW ]   (not really, see below)
> 
>         spin_lock(q.lock_ptr);
>         /*
>          * Fixup the pi_state owner and possibly acquire the lock if we
>          * haven't already.
>          */
>         res = fixup_owner(uaddr, fshared, &q, !ret);
> 
> Note that the rt_mutex is acquire while the q.lock_ptr (hb->lock) is not held
> (since we can sleep).  This is FINE and not a race.  Lets look at what happens
> if another task tries to get the lock during that time:
> 
> futex_lock_pi
> 	futex_lock_pi_atomic
> 		lookup_pi_state
> 			
> At this point we have the pi_state.  It's owner field will point to the
> previous owner, not the task that is currently acquiring it.  But the rt_mutex
> itself knows who owns it, so proper boosting should still occur.  Once the new
> owner complete the pi_state update, the pi_state will be removed from the old
> owner pi_state_list and added to its pi_state_list.  Since the futex uval
> shows it's owned in both cases, the new contender is still forced into the
> kernel to block on the rt_mutex.  Since we update the uval, then the
> pi_state->owner, we are sure to be able to access the rt_mutex via the old
> uval so long as we hold the hb->lock.
> 
> So, I think we're fine with respect to the pi_state ownership!  In fact I
> finally managed to catch the lock steal in the requeue loop in my tracing, and
> everything worked fine.  Going to go rerun a bunch more tests and see if I hit
> any other issues, if I do, I suspect they are unrelated to this.
> 
> Thanks for the help in thinking this through.
> 
> -- 
> Darren Hart
> IBM Linux Technology Center
> Real-Time Linux Team
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] fixup pi_state in futex_requeue on lock steal
  2009-08-06  0:01 [RFC][PATCH] fixup pi_state in futex_requeue on lock steal Darren Hart
  2009-08-06 16:37 ` Peter Zijlstra
@ 2009-08-06 23:07 ` Darren Hart
  1 sibling, 0 replies; 8+ messages in thread
From: Darren Hart @ 2009-08-06 23:07 UTC (permalink / raw)
  To: lkml, , linux-rt-users
  Cc: Thomas Gleixner, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	John Kacur, Dinakar Guniguntala, John Stultz

Darren Hart wrote:
> NOT FOR INCLUSION

Please disregard this patch, it was ill-conceived.

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] fixup pi_state in futex_requeue on lock steal
  2009-08-06 22:53     ` Steven Rostedt
@ 2009-08-07  0:36       ` Darren Hart
  0 siblings, 0 replies; 8+ messages in thread
From: Darren Hart @ 2009-08-07  0:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, lkml,,
	linux-rt-users, Thomas Gleixner, Ingo Molnar, John Kacur,
	Dinakar Guniguntala, John Stultz

Steven Rostedt wrote:
> On Thu, 6 Aug 2009, Darren Hart wrote:
> 
>> Peter Zijlstra wrote:
>>> On Wed, 2009-08-05 at 17:01 -0700, Darren Hart wrote:
>>>> NOT FOR INCLUSION
>>>>
>>>> Fixup the uval and pi_state owner in futex_requeue(requeue_pi=1) in the
>>>> event
>>>> of a lock steal or owner died.  I had hoped to leave it up to the new
>>>> owner to
>>>> fix up the userspace value since we can't really handle a fault here
>>>> gracefully.  This should be safe as the lock is contended and should force
>>>> all
>>>> userspace attempts to lock or unlock into the kernel where they'll block
>>>> on the
>>>> hb lock.  However, when I don't update the uaddr, I hit the WARN_ON(pid !=
>>>> pi_state->owner->pid) as expected, and the userspace testcase deadlocks.
>>>>
>>>> I need to try and better understand what's happening to hang userspace.
>>>> In the
>>>> meantime I thought I'd share what I'm working with atm.  This is a
>>>> complete HACK
>>>> and is ugly, non-modular, etc etc.  However, it currently works.  It would
>>>> explode
>>>> in a most impressive fashion should we happen to fault.  So the remaining
>>>> questions
>>>> are:
>>>>
>>>> o Why does userspace deadlock if we leave the uval updating to the new
>>>> owner
>>>>   waking up in futex_wait_requeue_pi()?
>>>>
>>>> o If we have to handle a fault in futex_requeue(), how can we best cleanup
>>>> the
>>>>   proxy lock acquisition and get things back into a sane state.  We
>>>> faulted, so
>>>>   determinism is out the window anyway, we just need to recover
>>>> gracefully.
>>>
>>> Do you have a trace of the thing going down?
>> I finally did get a trace... but learned something in the process. Elaborating
>> below.
> 
> I'm assuming you used ftrace as your tracing infrastructure?

:-)  Yes, ftrace with the nop tracer and some trace_printk worked 
nicely.  I turned on the trace from userspace and stopped the trace from 
within the kernel using tracing_off().  But, since the bug didn't 
actually exist, I never actually hit tracing_off().  The trace_printk's 
however nicely highlighted the "race window" that wasn't actually a 
window ;-)

Thanks for the ring buffer fixes btw.

--
Darren

> 
> -- Steve
> 
>>> Tglx and me usually use sched_switch and a few trace_printk()s sprinkled
>>> around, the typical one would be in sys_futex, printing the futex cmd
>>> and arg.
>>>
>>> OK, so run me through this one more time.
>>>
>>> A condvar has two futexes, an inner and an outer. The inner futex is
>>> always locked and the waiting threads are stacked on that.
>> 3 actually:
>>
>> cond->data->futex (the waitqueue)
>> cond->data->lock (the lock protecting the internal data)
>> outer mutex (the pthread_mutex)
>>
>>> Then on signal/broadcast, we lock the outer lock and requeue all the
>>> blocked tasks from the inner to the outer, then we release the outer
>>> lock and let them rip.
>> Yes - and in requeue_pi with a PI mutex we only let 1 rip, and requeue the
>> rest, rather than wake them all as the old implementation for PI mutexes did.
>>
>>> Since we're seeing lock steals, I'm thinking the outer lock isn't taken
>>> when we're doing the requeue?
>> Correct.  Unfortunately this is "valid" usage.
>>
>>> Anyway, during the requeue we lock-steal because the owner isn't running
>>> yet and we iterate a higher prio task in the requeue loop?
>> I believe so.
>>
>>> This leaves the outer lock's futex field messed up because it points to
>>> the wrong TID.
>> The futex uval isn't messed up, it just still hold the value of the previous
>> owners tid (not the expected owner we're stealing from).  I believe now that
>> this is proper behavior.
>>
>>> After we finish the requeue loop, we unlock the HBs.
>>>
>>>
>>> So far so good?
>> Yup.
>>
>>>
>>> Now, normally the waking thread will find itself owner and will check
>>> the futex variable and fix her up -- while holding the HB lock.
>> Correcto.
>>
>>> However, in case the outer lock gets contended again, we can get
>>> interrupted between requeue and wakeup/fixup and observe this messed up
>>> futex value, which is causing this WARN to trigger.
>> This is where I was mistaken.  I had seen the WARN_ON(pid !=
>> pi_state->owner->pid) in lookup_pi_state() while working on the previous 2
>> patches I sent to the list.  The one which updates the lock_ptr of the futex_q
>> to match that of the pi_state should fix this.  What happened before was we
>> would grab the wrong hb lock so while we were fixing up the pi_state and uval
>> in the woken thread, a contending thread would read those value while holding
>> the correct hb lock.  That race is fixed with the "[PATCH 1/2] Update woken
>> requeued futex_q lock_ptr" patch.
>>
>>> So where do we deadlock, after this all goes down? Do we perhaps lookup
>>> the wrong pi_state using that wrong TID?
>>>
>> We only deadlocked while I was (wrongly) trying to update pi_state owner from
>> the requeue thread.  Deadlocks don't occur in my testing with only patches 1
>> and 2.
>>
>> [PATCH 1/2] Update woken requeued futex_q lock_ptr" patch
>> [PATCH 2/2][RT] Avoid deadlock in rt_mutex_start_proxy_lock()
>>
>>
>>> Since its only the outer futex's value that matters, right? Can't we pin
>>> that using get_user_pages() before we take the HB lock and go into the
>>> requeue loop? That way we're sure to be able to change it without
>>> faulting.
>> I now don't believe we have to do this.  In fact, futex_lock_pi() exhibits a
>> similar "race window" (simplified below):
>>
>>         /*
>>          * Block on the PI mutex:
>>          */
>>         ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1);
>>
>>         [RACE WINDOW ]   (not really, see below)
>>
>>         spin_lock(q.lock_ptr);
>>         /*
>>          * Fixup the pi_state owner and possibly acquire the lock if we
>>          * haven't already.
>>          */
>>         res = fixup_owner(uaddr, fshared, &q, !ret);
>>
>> Note that the rt_mutex is acquire while the q.lock_ptr (hb->lock) is not held
>> (since we can sleep).  This is FINE and not a race.  Lets look at what happens
>> if another task tries to get the lock during that time:
>>
>> futex_lock_pi
>> 	futex_lock_pi_atomic
>> 		lookup_pi_state
>> 			
>> At this point we have the pi_state.  It's owner field will point to the
>> previous owner, not the task that is currently acquiring it.  But the rt_mutex
>> itself knows who owns it, so proper boosting should still occur.  Once the new
>> owner complete the pi_state update, the pi_state will be removed from the old
>> owner pi_state_list and added to its pi_state_list.  Since the futex uval
>> shows it's owned in both cases, the new contender is still forced into the
>> kernel to block on the rt_mutex.  Since we update the uval, then the
>> pi_state->owner, we are sure to be able to access the rt_mutex via the old
>> uval so long as we hold the hb->lock.
>>
>> So, I think we're fine with respect to the pi_state ownership!  In fact I
>> finally managed to catch the lock steal in the requeue loop in my tracing, and
>> everything worked fine.  Going to go rerun a bunch more tests and see if I hit
>> any other issues, if I do, I suspect they are unrelated to this.
>>
>> Thanks for the help in thinking this through.
>>
>> -- 
>> Darren Hart
>> IBM Linux Technology Center
>> Real-Time Linux Team
>>


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] fixup pi_state in futex_requeue on lock steal
  2009-08-06 22:46   ` Darren Hart
  2009-08-06 22:53     ` Steven Rostedt
@ 2009-08-08 15:27     ` Ingo Molnar
  2009-08-08 23:11       ` Darren Hart
  1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2009-08-08 15:27 UTC (permalink / raw)
  To: Darren Hart
  Cc: Peter Zijlstra, lkml,,
	linux-rt-users, Thomas Gleixner, Steven Rostedt, John Kacur,
	Dinakar Guniguntala, John Stultz


* Darren Hart <dvhltc@us.ibm.com> wrote:

> So, I think we're fine with respect to the pi_state ownership!  In 
> fact I finally managed to catch the lock steal in the requeue loop 
> in my tracing, and everything worked fine.  Going to go rerun a 
> bunch more tests and see if I hit any other issues, if I do, I 
> suspect they are unrelated to this.
>
> Thanks for the help in thinking this through.

i've got these queued up:

 00235fe: futex: Update woken requeued futex_q lock_ptr
 1bbf208: rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock()

should i drop them?

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] fixup pi_state in futex_requeue on lock steal
  2009-08-08 15:27     ` Ingo Molnar
@ 2009-08-08 23:11       ` Darren Hart
  0 siblings, 0 replies; 8+ messages in thread
From: Darren Hart @ 2009-08-08 23:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, lkml,,
	linux-rt-users, Thomas Gleixner, Steven Rostedt, John Kacur,
	Dinakar Guniguntala, John Stultz

Ingo Molnar wrote:
> * Darren Hart <dvhltc@us.ibm.com> wrote:
> 
>> So, I think we're fine with respect to the pi_state ownership!  In 
>> fact I finally managed to catch the lock steal in the requeue loop 
>> in my tracing, and everything worked fine.  Going to go rerun a 
>> bunch more tests and see if I hit any other issues, if I do, I 
>> suspect they are unrelated to this.
>>
>> Thanks for the help in thinking this through.
> 
> i've got these queued up:
> 
>  00235fe: futex: Update woken requeued futex_q lock_ptr
>  1bbf208: rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock()
> 
> should i drop them?

My apologies for the churn on these Ingo.  My comments above only apply 
to this RFC thread, the other patches are needed.  You should include 
the following patches:

tip/core/urgent
===============
rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock()
	1bbf20835c4e088667a090ce6523a0f70b62dc76

[PATCH] futex: Update futex_q lock_ptr on requeue proxy lock (resend)
	from Aug 7, 2009
	The one you committed is older, I resent it on Aug 7 with an
	improved patch description, commentary, and DEBUG_PI_LIST
	ifdefs.  Please drop 00235fe25eba6d3a13f3349b2e3a2d94b699a414
	and pull in the new one.

[PATCH V2] futex: Fix handling of bad requeue syscall pairing
	from Aug 7, 2009


tip/rt/something
================
[PATCH 2/2][RT] Avoid deadlock in rt_mutex_start_proxy_lock()
	from Aug 5, 2009
	This one uses the new atomic_spinlock calls for the RT tree.  I
	suspect you may instead choose to make that one line change
	yourself.

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-08-08 23:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-06  0:01 [RFC][PATCH] fixup pi_state in futex_requeue on lock steal Darren Hart
2009-08-06 16:37 ` Peter Zijlstra
2009-08-06 22:46   ` Darren Hart
2009-08-06 22:53     ` Steven Rostedt
2009-08-07  0:36       ` Darren Hart
2009-08-08 15:27     ` Ingo Molnar
2009-08-08 23:11       ` Darren Hart
2009-08-06 23:07 ` Darren Hart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.