From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751795AbbBPUQB (ORCPT ); Mon, 16 Feb 2015 15:16:01 -0500 Received: from mx1.redhat.com ([209.132.183.28]:34853 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751765AbbBPUP7 (ORCPT ); Mon, 16 Feb 2015 15:15:59 -0500 Date: Mon, 16 Feb 2015 21:13:36 +0100 From: Oleg Nesterov To: Darren Hart , Peter Zijlstra , Thomas Gleixner Cc: Jerome Marchand , Larry Woodman , Mateusz Guzik , linux-kernel@vger.kernel.org, Alexey Kuznetsov , Pavel Emelyanov Subject: [PATCH 1/1] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition Message-ID: <20150216201336.GB18246@redhat.com> References: <20150202140515.GA26398@redhat.com> <20150216201313.GA18246@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150216201313.GA18246@redhat.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org It is absolutely not clear why attach_to_pi_owner() returns -EAGAIN which triggers "retry" if the lock owner is PF_EXITING but not PF_EXITPIDONE. This burns CPU for no reason and this can even livelock if the rt_task() caller preempts a PF_EXITING owner. Remove the PF_EXITING check altogether. We do not care if it is exiting, all we need to know is can we rely on exit_pi_state_list() or not. So we also need to set PF_EXITPIDONE before we flush ->pi_state_list and call exit_pi_state_list() unconditionally. Perhaps we can add the fast-path list_empty() check in mm_release() back, but lets fix the problem first. Besides, in theory this check is already not correct, at least it should be list_empty_careful() to avoid the race with free_pi_state() in progress. Signed-off-by: Oleg Nesterov --- kernel/exit.c | 22 +--------------------- kernel/fork.c | 3 +-- kernel/futex.c | 40 ++++++++++------------------------------ 3 files changed, 12 insertions(+), 53 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index 6806c55..bc969ed 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -683,27 +683,13 @@ void do_exit(long code) */ if (unlikely(tsk->flags & PF_EXITING)) { pr_alert("Fixing recursive fault but reboot is needed!\n"); - /* - * We can do this unlocked here. The futex code uses - * this flag just to verify whether the pi state - * cleanup has been done or not. In the worst case it - * loops once more. We pretend that the cleanup was - * done as there is no way to return. Either the - * OWNER_DIED bit is set by now or we push the blocked - * task into the wait for ever nirwana as well. - */ + /* Avoid the new additions to ->pi_state_list at least */ tsk->flags |= PF_EXITPIDONE; set_current_state(TASK_UNINTERRUPTIBLE); schedule(); } exit_signals(tsk); /* sets PF_EXITING */ - /* - * tsk->flags are checked in the futex code to protect against - * an exiting task cleaning up the robust pi futexes. - */ - smp_mb(); - raw_spin_unlock_wait(&tsk->pi_lock); if (unlikely(in_atomic())) pr_info("note: %s[%d] exited with preempt_count %d\n", @@ -779,12 +765,6 @@ void do_exit(long code) * Make sure we are holding no locks: */ debug_check_no_locks_held(); - /* - * We can do this unlocked here. The futex code uses this flag - * just to verify whether the pi state cleanup has been done - * or not. In the worst case it loops once more. - */ - tsk->flags |= PF_EXITPIDONE; if (tsk->io_context) exit_io_context(tsk); diff --git a/kernel/fork.c b/kernel/fork.c index 4dc2dda..ec3208e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -803,8 +803,7 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm) tsk->compat_robust_list = NULL; } #endif - if (unlikely(!list_empty(&tsk->pi_state_list))) - exit_pi_state_list(tsk); + exit_pi_state_list(tsk); #endif uprobe_free_utask(tsk); diff --git a/kernel/futex.c b/kernel/futex.c index b101381..c1104a8 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr) if (!futex_cmpxchg_enabled) return; + /* - * We are a ZOMBIE and nobody can enqueue itself on - * pi_state_list anymore, but we have to be careful - * versus waiters unqueueing themselves: + * attach_to_pi_owner() can no longer add the new entry. But + * we have to be careful versus waiters unqueueing themselves. */ + curr->flags |= PF_EXITPIDONE; + raw_spin_lock_irq(&curr->pi_lock); while (!list_empty(head)) { @@ -905,24 +907,12 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key, return -EPERM; } - /* - * We need to look at the task state flags to figure out, - * whether the task is exiting. To protect against the do_exit - * change of the task flags, we do this protected by - * p->pi_lock: - */ raw_spin_lock_irq(&p->pi_lock); - if (unlikely(p->flags & PF_EXITING)) { - /* - * The task is on the way out. When PF_EXITPIDONE is - * set, we know that the task has finished the - * cleanup: - */ - int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN; - + if (unlikely(p->flags & PF_EXITPIDONE)) { + /* exit_pi_state_list() was already called */ raw_spin_unlock_irq(&p->pi_lock); put_task_struct(p); - return ret; + return -ESRCH; } /* @@ -1633,12 +1623,7 @@ retry_private: goto retry; goto out; case -EAGAIN: - /* - * Two reasons for this: - * - Owner is exiting and we just wait for the - * exit to complete. - * - The user space value changed. - */ + /* The user space value changed. */ free_pi_state(pi_state); pi_state = NULL; double_unlock_hb(hb1, hb2); @@ -2295,12 +2280,7 @@ retry_private: case -EFAULT: goto uaddr_faulted; case -EAGAIN: - /* - * Two reasons for this: - * - Task is exiting and we just wait for the - * exit to complete. - * - The user space value changed. - */ + /* The user space value changed. */ queue_unlock(hb); put_futex_key(&q.key); cond_resched(); -- 1.5.5.1