All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oleg Nesterov <oleg@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Florian Weimer <fweimer@redhat.com>,
	Shawn Landden <shawn@git.icu>,
	libc-alpha@sourceware.org, linux-api@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>,
	Arnd Bergmann <arnd@arndb.de>,
	Deepa Dinamani <deepa.kernel@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Keith Packard <keithp@keithp.com>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: handle_exit_race && PF_EXITING
Date: Wed, 6 Nov 2019 09:55:29 +0100	[thread overview]
Message-ID: <20191106085529.GA12575@redhat.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1911051959260.1869@nanos.tec.linutronix.de>

On 11/05, Thomas Gleixner wrote:
>
>  sys_futex()
>     loop infinite because
>     	 PF_EXITING is set,
> 	 but PF_EXITPIDONE not

Yes.

IOW, the problem is very simple. RT task preempts the exiting lock owner
after it sets PF_EXITING but before it sets PF_EXITPIDONE, if they run on
the same CPU futex_lock_pi() will spin forever.

> So the obvious question is why PF_EXITPIDONE is set way after the futex
> exit cleanup has run,

Another obvious question is why this code checks PF_EXITING. I still think
it should not.

> The way we can deal with that is:
>
>     do_exit()
>     tsk->flags |= PF_EXITING;
>     ...
>     mutex_lock(&tsk->futex_exit_mutex);
>     futex_exit();
>     tsk->flags |= PF_EXITPIDONE;
>     mutex_unlock(&tsk->futex_exit_mutex);
>
> and on the futex lock_pi side:
>
>     if (!(tsk->flags & PF_EXITING))
>     	return 0;		<- All good
>
>     if (tsk->flags & PF_EXITPIDONE)
>         return -EOWNERDEAD;	<- Locker can take over
>
>     mutex_lock(&tsk->futex_exit_mutex);
>     if (tsk->flags & PF_EXITPIDONE) {
>         mutex_unlock(&tsk->futex_exit_mutex);
>         return -EOWNERDEAD;	<- Locker can take over
>     }
>
>     queue_futex();
>     mutex_unlock(&tsk->futex_exit_mutex);
>
> Not that I think it's pretty, but it plugs all holes AFAICT.

I have found the fix I sent in 2015, attached below. I forgot everything
I knew about futex.c, so I need some time to adapt it to the current code.

But I think it is clear what this patch tries to do, do you see any hole?

Oleg.

[PATCH] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition

It is absolutely not clear why attach_to_pi_owner() returns -EAGAIN which
triggers "retry" if the lock owner is PF_EXITING but not PF_EXITPIDONE.
This burns CPU for no reason and this can even livelock if the rt_task()
caller preempts a PF_EXITING owner.

Remove the PF_EXITING check altogether. We do not care if it is exiting,
all we need to know is can we rely on exit_pi_state_list() or not. So we
also need to set PF_EXITPIDONE before we flush ->pi_state_list and call
exit_pi_state_list() unconditionally.

Perhaps we can add the fast-path list_empty() check in mm_release() back,
but lets fix the problem first. Besides, in theory this check is already
not correct, at least it should be list_empty_careful() to avoid the race
with free_pi_state() in progress.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c  |   22 +---------------------
 kernel/fork.c  |    3 +--
 kernel/futex.c |   40 ++++++++++------------------------------
 3 files changed, 12 insertions(+), 53 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 6806c55..bc969ed 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -683,27 +683,13 @@ void do_exit(long code)
 	 */
 	if (unlikely(tsk->flags & PF_EXITING)) {
 		pr_alert("Fixing recursive fault but reboot is needed!\n");
-		/*
-		 * We can do this unlocked here. The futex code uses
-		 * this flag just to verify whether the pi state
-		 * cleanup has been done or not. In the worst case it
-		 * loops once more. We pretend that the cleanup was
-		 * done as there is no way to return. Either the
-		 * OWNER_DIED bit is set by now or we push the blocked
-		 * task into the wait for ever nirwana as well.
-		 */
+		/* Avoid the new additions to ->pi_state_list at least */
 		tsk->flags |= PF_EXITPIDONE;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		schedule();
 	}
 
 	exit_signals(tsk);  /* sets PF_EXITING */
-	/*
-	 * tsk->flags are checked in the futex code to protect against
-	 * an exiting task cleaning up the robust pi futexes.
-	 */
-	smp_mb();
-	raw_spin_unlock_wait(&tsk->pi_lock);
 
 	if (unlikely(in_atomic()))
 		pr_info("note: %s[%d] exited with preempt_count %d\n",
@@ -779,12 +765,6 @@ void do_exit(long code)
 	 * Make sure we are holding no locks:
 	 */
 	debug_check_no_locks_held();
-	/*
-	 * We can do this unlocked here. The futex code uses this flag
-	 * just to verify whether the pi state cleanup has been done
-	 * or not. In the worst case it loops once more.
-	 */
-	tsk->flags |= PF_EXITPIDONE;
 
 	if (tsk->io_context)
 		exit_io_context(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..ec3208e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -803,8 +803,7 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 		tsk->compat_robust_list = NULL;
 	}
 #endif
-	if (unlikely(!list_empty(&tsk->pi_state_list)))
-		exit_pi_state_list(tsk);
+	exit_pi_state_list(tsk);
 #endif
 
 	uprobe_free_utask(tsk);
diff --git a/kernel/futex.c b/kernel/futex.c
index b101381..c1104a8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr)
 
 	if (!futex_cmpxchg_enabled)
 		return;
+
 	/*
-	 * We are a ZOMBIE and nobody can enqueue itself on
-	 * pi_state_list anymore, but we have to be careful
-	 * versus waiters unqueueing themselves:
+	 * attach_to_pi_owner() can no longer add the new entry. But
+	 * we have to be careful versus waiters unqueueing themselves.
 	 */
+	curr->flags |= PF_EXITPIDONE;
+
 	raw_spin_lock_irq(&curr->pi_lock);
 	while (!list_empty(head)) {
 
@@ -905,24 +907,12 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 		return -EPERM;
 	}
 
-	/*
-	 * We need to look at the task state flags to figure out,
-	 * whether the task is exiting. To protect against the do_exit
-	 * change of the task flags, we do this protected by
-	 * p->pi_lock:
-	 */
 	raw_spin_lock_irq(&p->pi_lock);
-	if (unlikely(p->flags & PF_EXITING)) {
-		/*
-		 * The task is on the way out. When PF_EXITPIDONE is
-		 * set, we know that the task has finished the
-		 * cleanup:
-		 */
-		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
-
+	if (unlikely(p->flags & PF_EXITPIDONE)) {
+		/* exit_pi_state_list() was already called */
 		raw_spin_unlock_irq(&p->pi_lock);
 		put_task_struct(p);
-		return ret;
+		return -ESRCH;
 	}
 
 	/*
@@ -1633,12 +1623,7 @@ retry_private:
 				goto retry;
 			goto out;
 		case -EAGAIN:
-			/*
-			 * Two reasons for this:
-			 * - Owner is exiting and we just wait for the
-			 *   exit to complete.
-			 * - The user space value changed.
-			 */
+			/* The user space value changed. */
 			free_pi_state(pi_state);
 			pi_state = NULL;
 			double_unlock_hb(hb1, hb2);
@@ -2295,12 +2280,7 @@ retry_private:
 		case -EFAULT:
 			goto uaddr_faulted;
 		case -EAGAIN:
-			/*
-			 * Two reasons for this:
-			 * - Task is exiting and we just wait for the
-			 *   exit to complete.
-			 * - The user space value changed.
-			 */
+			/* The user space value changed. */
 			queue_unlock(hb);
 			put_futex_key(&q.key);
 			cond_resched();
-- 
1.5.5.1




  reply	other threads:[~2019-11-06  8:55 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-04  0:29 [RFC v2 PATCH] futex: extend set_robust_list to allow 2 locking ABIs at the same time Shawn Landden
2019-11-04  0:51 ` Shawn Landden
2019-11-04 15:37 ` Thomas Gleixner
2019-11-05  0:10   ` Thomas Gleixner
2019-11-05  9:48 ` Florian Weimer
2019-11-05  9:59   ` Thomas Gleixner
2019-11-05 10:06     ` Florian Weimer
2019-11-05 11:56       ` Thomas Gleixner
2019-11-05 14:10         ` Carlos O'Donell
2019-11-05 14:27           ` Florian Weimer
2019-11-05 14:53             ` Thomas Gleixner
2019-11-05 14:27           ` Thomas Gleixner
2019-11-05 14:33             ` Florian Weimer
2019-11-05 14:48               ` Thomas Gleixner
2019-11-06 14:00             ` Zack Weinberg
2019-11-06 14:04               ` Florian Weimer
2019-11-05 15:27     ` handle_exit_race && PF_EXITING Oleg Nesterov
2019-11-05 17:28       ` Thomas Gleixner
2019-11-05 17:59         ` Thomas Gleixner
2019-11-05 18:56           ` Thomas Gleixner
2019-11-05 19:19             ` Thomas Gleixner
2019-11-06  8:55               ` Oleg Nesterov [this message]
2019-11-06  9:53                 ` Thomas Gleixner
2019-11-06 10:35                   ` Oleg Nesterov
2019-11-06 11:07                     ` Thomas Gleixner
2019-11-06 12:11                       ` Oleg Nesterov
2019-11-06 13:38                         ` Thomas Gleixner
2019-11-06 17:42                         ` Thomas Gleixner
2019-11-07 15:51                           ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191106085529.GA12575@redhat.com \
    --to=oleg@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=catalin.marinas@arm.com \
    --cc=deepa.kernel@gmail.com \
    --cc=fweimer@redhat.com \
    --cc=keithp@keithp.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=shawn@git.icu \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.