All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oleg Nesterov <oleg@redhat.com>
To: Darren Hart <darren@dvhart.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>
Cc: Jerome Marchand <jmarchan@redhat.com>,
	Larry Woodman <lwoodman@redhat.com>,
	Mateusz Guzik <mguzik@redhat.com>,
	linux-kernel@vger.kernel.org,
	Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>,
	Pavel Emelyanov <xemul@parallels.com>
Subject: [PATCH 1/1] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition
Date: Mon, 16 Feb 2015 21:13:36 +0100	[thread overview]
Message-ID: <20150216201336.GB18246@redhat.com> (raw)
In-Reply-To: <20150216201313.GA18246@redhat.com>

It is absolutely not clear why attach_to_pi_owner() returns -EAGAIN which
triggers "retry" if the lock owner is PF_EXITING but not PF_EXITPIDONE.
This burns CPU for no reason and this can even livelock if the rt_task()
caller preempts a PF_EXITING owner.

Remove the PF_EXITING check altogether. We do not care if it is exiting,
all we need to know is can we rely on exit_pi_state_list() or not. So we
also need to set PF_EXITPIDONE before we flush ->pi_state_list and call
exit_pi_state_list() unconditionally.

Perhaps we can add the fast-path list_empty() check in mm_release() back,
but lets fix the problem first. Besides, in theory this check is already
not correct, at least it should be list_empty_careful() to avoid the race
with free_pi_state() in progress.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c  |   22 +---------------------
 kernel/fork.c  |    3 +--
 kernel/futex.c |   40 ++++++++++------------------------------
 3 files changed, 12 insertions(+), 53 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 6806c55..bc969ed 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -683,27 +683,13 @@ void do_exit(long code)
 	 */
 	if (unlikely(tsk->flags & PF_EXITING)) {
 		pr_alert("Fixing recursive fault but reboot is needed!\n");
-		/*
-		 * We can do this unlocked here. The futex code uses
-		 * this flag just to verify whether the pi state
-		 * cleanup has been done or not. In the worst case it
-		 * loops once more. We pretend that the cleanup was
-		 * done as there is no way to return. Either the
-		 * OWNER_DIED bit is set by now or we push the blocked
-		 * task into the wait for ever nirwana as well.
-		 */
+		/* Avoid the new additions to ->pi_state_list at least */
 		tsk->flags |= PF_EXITPIDONE;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		schedule();
 	}
 
 	exit_signals(tsk);  /* sets PF_EXITING */
-	/*
-	 * tsk->flags are checked in the futex code to protect against
-	 * an exiting task cleaning up the robust pi futexes.
-	 */
-	smp_mb();
-	raw_spin_unlock_wait(&tsk->pi_lock);
 
 	if (unlikely(in_atomic()))
 		pr_info("note: %s[%d] exited with preempt_count %d\n",
@@ -779,12 +765,6 @@ void do_exit(long code)
 	 * Make sure we are holding no locks:
 	 */
 	debug_check_no_locks_held();
-	/*
-	 * We can do this unlocked here. The futex code uses this flag
-	 * just to verify whether the pi state cleanup has been done
-	 * or not. In the worst case it loops once more.
-	 */
-	tsk->flags |= PF_EXITPIDONE;
 
 	if (tsk->io_context)
 		exit_io_context(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..ec3208e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -803,8 +803,7 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 		tsk->compat_robust_list = NULL;
 	}
 #endif
-	if (unlikely(!list_empty(&tsk->pi_state_list)))
-		exit_pi_state_list(tsk);
+	exit_pi_state_list(tsk);
 #endif
 
 	uprobe_free_utask(tsk);
diff --git a/kernel/futex.c b/kernel/futex.c
index b101381..c1104a8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr)
 
 	if (!futex_cmpxchg_enabled)
 		return;
+
 	/*
-	 * We are a ZOMBIE and nobody can enqueue itself on
-	 * pi_state_list anymore, but we have to be careful
-	 * versus waiters unqueueing themselves:
+	 * attach_to_pi_owner() can no longer add the new entry. But
+	 * we have to be careful versus waiters unqueueing themselves.
 	 */
+	curr->flags |= PF_EXITPIDONE;
+
 	raw_spin_lock_irq(&curr->pi_lock);
 	while (!list_empty(head)) {
 
@@ -905,24 +907,12 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 		return -EPERM;
 	}
 
-	/*
-	 * We need to look at the task state flags to figure out,
-	 * whether the task is exiting. To protect against the do_exit
-	 * change of the task flags, we do this protected by
-	 * p->pi_lock:
-	 */
 	raw_spin_lock_irq(&p->pi_lock);
-	if (unlikely(p->flags & PF_EXITING)) {
-		/*
-		 * The task is on the way out. When PF_EXITPIDONE is
-		 * set, we know that the task has finished the
-		 * cleanup:
-		 */
-		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
-
+	if (unlikely(p->flags & PF_EXITPIDONE)) {
+		/* exit_pi_state_list() was already called */
 		raw_spin_unlock_irq(&p->pi_lock);
 		put_task_struct(p);
-		return ret;
+		return -ESRCH;
 	}
 
 	/*
@@ -1633,12 +1623,7 @@ retry_private:
 				goto retry;
 			goto out;
 		case -EAGAIN:
-			/*
-			 * Two reasons for this:
-			 * - Owner is exiting and we just wait for the
-			 *   exit to complete.
-			 * - The user space value changed.
-			 */
+			/* The user space value changed. */
 			free_pi_state(pi_state);
 			pi_state = NULL;
 			double_unlock_hb(hb1, hb2);
@@ -2295,12 +2280,7 @@ retry_private:
 		case -EFAULT:
 			goto uaddr_faulted;
 		case -EAGAIN:
-			/*
-			 * Two reasons for this:
-			 * - Task is exiting and we just wait for the
-			 *   exit to complete.
-			 * - The user space value changed.
-			 */
+			/* The user space value changed. */
 			queue_unlock(hb);
 			put_futex_key(&q.key);
 			cond_resched();
-- 
1.5.5.1



  reply	other threads:[~2015-02-16 20:16 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-02 14:05 [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads Oleg Nesterov
2015-02-02 14:05 ` [PATCH 1/1] " Oleg Nesterov
2015-02-04 10:48   ` Peter Zijlstra
2015-02-14 18:01   ` Davidlohr Bueso
2015-02-14 20:57     ` Oleg Nesterov
2015-02-14 21:15       ` Davidlohr Bueso
2015-02-14 21:54         ` Oleg Nesterov
2015-02-18 17:11   ` [tip:locking/core] locking/futex: Check " tip-bot for Oleg Nesterov
2015-02-02 15:11 ` [PATCH 0/1] futex: check " Peter Zijlstra
2015-02-02 15:13   ` Peter Zijlstra
2015-02-02 15:14     ` Peter Zijlstra
2015-02-02 16:20   ` Oleg Nesterov
2015-02-03 20:09   ` Oleg Nesterov
2015-02-04 11:12     ` Peter Zijlstra
2015-02-04 20:25       ` Oleg Nesterov
2015-02-05 16:27         ` Peter Zijlstra
2015-02-05 18:10           ` Oleg Nesterov
2015-02-06 10:46             ` Peter Zijlstra
2015-02-06 17:04               ` Oleg Nesterov
2015-02-09 20:38                 ` Darren Hart
2015-02-10 11:14                   ` Oleg Nesterov
2015-02-16 20:13 ` [PATCH 0/1] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition Oleg Nesterov
2015-02-16 20:13   ` Oleg Nesterov [this message]
2015-02-27  9:52     ` [PATCH 1/1] " Peter Zijlstra
2015-02-27 11:54       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150216201336.GB18246@redhat.com \
    --to=oleg@redhat.com \
    --cc=darren@dvhart.com \
    --cc=jmarchan@redhat.com \
    --cc=kuznet@ms2.inr.ac.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lwoodman@redhat.com \
    --cc=mguzik@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.