There is a weird state in the futex_unlock_pi() path when it interleaves with a concurrent futex_lock_pi() at the point where it drops hb->lock. In this case, it can happen that the rt_mutex wait_list and the futex_q disagree on pending waiters, in particular rt_mutex will find no pending waiters where futex_q thinks there are. In this case the rt_mutex unlock code cannot assign an owner. What the current code does in this case is use the futex_q waiter that got us here; however when the rt_mutex_timed_futex_lock() has already failed; this leaves things in a weird state, resulting in much head-aches in fixup_owner(). Simplify all this by changing wake_futex_pi() to return -EAGAIN when this situation occurs. This then gives the futex_lock_pi() code the opportunity to continue and the retried futex_unlock_pi() will now observe a coherent state. Signed-off-by: Peter Zijlstra (Intel) --- kernel/futex.c | 49 +++++++++++++------------------------------------ 1 file changed, 13 insertions(+), 36 deletions(-) --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1402,12 +1402,18 @@ static int wake_futex_pi(u32 __user *uad new_owner = rt_mutex_next_owner(&pi_state->pi_mutex); /* - * It is possible that the next waiter (the one that brought - * top_waiter owner to the kernel) timed out and is no longer - * waiting on the lock. + * When we interleave with futex_lock_pi() where it does + * rt_mutex_timed_futex_lock(), we might observe @this futex_q waiter, + * but the rt_mutex's wait_list can be empty (either still, or again, + * depending on which side we land). + * + * When this happens, give up our locks and try again, giving the + * futex_lock_pi() instance time to complete and unqueue_me(). */ - if (!new_owner) - new_owner = top_waiter->task; + if (!new_owner) { + raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); + return -EAGAIN; + } /* * We pass it to the next owner. The WAITERS bit is always @@ -2324,7 +2330,6 @@ static long futex_wait_restart(struct re */ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) { - struct task_struct *owner; int ret = 0; if (locked) { @@ -2338,43 +2343,15 @@ static int fixup_owner(u32 __user *uaddr } /* - * Catch the rare case, where the lock was released when we were on the - * way back before we locked the hash bucket. - */ - if (q->pi_state->owner == current) { - /* - * Try to get the rt_mutex now. This might fail as some other - * task acquired the rt_mutex after we removed ourself from the - * rt_mutex waiters list. - */ - if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) { - locked = 1; - goto out; - } - - /* - * pi_state is incorrect, some other task did a lock steal and - * we returned due to timeout or signal without taking the - * rt_mutex. Too late. - */ - raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock); - owner = rt_mutex_owner(&q->pi_state->pi_mutex); - if (!owner) - owner = rt_mutex_next_owner(&q->pi_state->pi_mutex); - raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock); - ret = fixup_pi_state_owner(uaddr, q, owner); - goto out; - } - - /* * Paranoia check. If we did not take the lock, then we should not be * the owner of the rt_mutex. */ - if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) + if (rt_mutex_owner(&q->pi_state->pi_mutex) == current) { printk(KERN_ERR "fixup_owner: ret = %d pi-mutex: %p " "pi-state %p\n", ret, q->pi_state->pi_mutex.owner, q->pi_state->owner); + } out: return ret ? ret : locked;