linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>,
	Shawn Landden <shawn@git.icu>,
	libc-alpha@sourceware.org, linux-api@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>,
	Arnd Bergmann <arnd@arndb.de>,
	Deepa Dinamani <deepa.kernel@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Keith Packard <keithp@keithp.com>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: handle_exit_race && PF_EXITING
Date: Wed, 6 Nov 2019 10:53:01 +0100 (CET)	[thread overview]
Message-ID: <alpine.DEB.2.21.1911061028020.1869@nanos.tec.linutronix.de> (raw)
In-Reply-To: <20191106085529.GA12575@redhat.com>

Oleg,

On Wed, 6 Nov 2019, Oleg Nesterov wrote:
> I have found the fix I sent in 2015, attached below. I forgot everything
> I knew about futex.c, so I need some time to adapt it to the current code.
> 
> But I think it is clear what this patch tries to do, do you see any hole?

> @@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr)
>  
>  	if (!futex_cmpxchg_enabled)
>  		return;
> +
>  	/*
> -	 * We are a ZOMBIE and nobody can enqueue itself on
> -	 * pi_state_list anymore, but we have to be careful
> -	 * versus waiters unqueueing themselves:
> +	 * attach_to_pi_owner() can no longer add the new entry. But
> +	 * we have to be careful versus waiters unqueueing themselves.
>  	 */
> +	curr->flags |= PF_EXITPIDONE;

This obviously would need a barrier or would have to be moved inside of the
pi_lock region.

>  	raw_spin_lock_irq(&curr->pi_lock);
>  	while (!list_empty(head)) {
>  
> @@ -905,24 +907,12 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
>  		return -EPERM;
>  	}
>  
> -	/*
> -	 * We need to look at the task state flags to figure out,
> -	 * whether the task is exiting. To protect against the do_exit
> -	 * change of the task flags, we do this protected by
> -	 * p->pi_lock:
> -	 */
>  	raw_spin_lock_irq(&p->pi_lock);
> -	if (unlikely(p->flags & PF_EXITING)) {
> -		/*
> -		 * The task is on the way out. When PF_EXITPIDONE is
> -		 * set, we know that the task has finished the
> -		 * cleanup:
> -		 */
> -		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
> -
> +	if (unlikely(p->flags & PF_EXITPIDONE)) {
> +		/* exit_pi_state_list() was already called */
>  		raw_spin_unlock_irq(&p->pi_lock);
>  		put_task_struct(p);
> -		return ret;
> +		return -ESRCH;

But, this is incorrect because we'd return -ESRCH to user space while the
futex value still has the TID of the exiting task set which will
subsequently cleanout the futex and set the owner died bit.

The result is inconsistent state and will trigger the asserts in the futex
test suite and in the pthread_mutex implementation.

The only reason why -ESRCH can be returned is when the user space value of
the futex contains garbage. But in this case it does not contain garbage
and returning -ESRCH violates the implicit robustness guarantee of PI
futexes and causes unexpected havoc.

See da791a667536 ("futex: Cure exit race") for example.

The futex PI contract between kernel and user space relies on consistent
state. Guess why that code has more corner case handling than actual
functionality. :)

Thanks,

	tglx


  reply	other threads:[~2019-11-06  9:53 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-04  0:29 [RFC v2 PATCH] futex: extend set_robust_list to allow 2 locking ABIs at the same time Shawn Landden
2019-11-04  0:51 ` Shawn Landden
2019-11-04 15:37 ` Thomas Gleixner
2019-11-05  0:10   ` Thomas Gleixner
2019-11-05  9:48 ` Florian Weimer
2019-11-05  9:59   ` Thomas Gleixner
2019-11-05 10:06     ` Florian Weimer
2019-11-05 11:56       ` Thomas Gleixner
2019-11-05 14:10         ` Carlos O'Donell
2019-11-05 14:27           ` Florian Weimer
2019-11-05 14:53             ` Thomas Gleixner
2019-11-05 14:27           ` Thomas Gleixner
2019-11-05 14:33             ` Florian Weimer
2019-11-05 14:48               ` Thomas Gleixner
2019-11-06 14:00             ` Zack Weinberg
2019-11-06 14:04               ` Florian Weimer
2019-11-05 15:27     ` handle_exit_race && PF_EXITING Oleg Nesterov
2019-11-05 17:28       ` Thomas Gleixner
2019-11-05 17:59         ` Thomas Gleixner
2019-11-05 18:56           ` Thomas Gleixner
2019-11-05 19:19             ` Thomas Gleixner
2019-11-06  8:55               ` Oleg Nesterov
2019-11-06  9:53                 ` Thomas Gleixner [this message]
2019-11-06 10:35                   ` Oleg Nesterov
2019-11-06 11:07                     ` Thomas Gleixner
2019-11-06 12:11                       ` Oleg Nesterov
2019-11-06 13:38                         ` Thomas Gleixner
2019-11-06 17:42                         ` Thomas Gleixner
2019-11-07 15:51                           ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.21.1911061028020.1869@nanos.tec.linutronix.de \
    --to=tglx@linutronix.de \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=catalin.marinas@arm.com \
    --cc=deepa.kernel@gmail.com \
    --cc=fweimer@redhat.com \
    --cc=keithp@keithp.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=shawn@git.icu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).