LKML Archive on lore.kernel.org
 help / Atom feed
* [patch] futex: Cure exit race
@ 2018-12-10 15:23 Thomas Gleixner
  2018-12-10 16:02 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Thomas Gleixner @ 2018-12-10 15:23 UTC (permalink / raw)
  To: LKML
  Cc: Stefan Liebler, Heiko Carstens, Peter Zijlstra, Darren Hart, Ingo Molnar

Stefan reported, that the glibc tst-robustpi4 test case fails
occasionally. That case creates the following race between
sys_exit() and sys_futex(LOCK_PI):

 CPU0				CPU1

 sys_exit()			sys_futex()
  do_exit()			 futex_lock_pi()
   exit_signals(tsk)		  No waiters:
    tsk->flags |= PF_EXITING;	  *uaddr == 0x00000PID
  mm_release(tsk)		  Set waiter bit
   exit_robust_list(tsk) {	  *uaddr = 0x80000PID;
      Set owner died		  attach_to_pi_owner() {
    *uaddr = 0xC0000000;	   tsk = get_task(PID);
   }				   if (!tsk->flags & PF_EXITING) {
  ...				     attach();
  tsk->flags |= PF_EXITPIDONE;	   } else {
				     if (!(tsk->flags & PF_EXITPIDONE))
				       return -EAGAIN;
				     return -ESRCH; <--- FAIL
				   }

ESRCH is returned all the way to user space, which triggers the glibc test
case assert. Returning ESRCH unconditionally is wrong here because the user
space value has been changed by the exiting task to 0xC0000000, i.e. the
FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
is a valid state and the kernel has to handle it, i.e. taking the futex.

Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
is set in the task which owns the futex. If the value has changed, let
the kernel retry the operation, which includes all regular sanity checks
and correctly handles the FUTEX_OWNER_DIED case.

If it hasn't changed, then return ESRCH as there is no way to distinguish
this case from malfunctioning user space. This happens when the exiting
task did not have a robust list, the robust list was corrupted or the user
space value in the futex was simply bogus.

Reported-by: Stefan Liebler <stli@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
---
 kernel/futex.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 53 insertions(+), 4 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,11 +1148,60 @@ static int attach_to_pi_state(u32 __user
 	return ret;
 }
 
+static int handle_exit_race(u32 __user *uaddr, u32 uval, struct task_struct *tsk)
+{
+	u32 uval2;
+
+	/*
+	 * If PF_EXITPIDONE is not yet set try again.
+	 */
+	if (!(tsk->flags & PF_EXITPIDONE))
+		return -EAGAIN;
+
+	/*
+	 * Reread the user space value to handle the following situation:
+	 *
+	 * CPU0				CPU1
+	 *
+	 * sys_exit()			sys_futex()
+	 *  do_exit()			 futex_lock_pi()
+	 *   exit_signals(tsk)		  No waiters:
+	 *    tsk->flags |= PF_EXITING;	  *uaddr == 0x00000PID
+	 *  mm_release(tsk)		  Set waiter bit
+	 *   exit_robust_list(tsk) {	  *uaddr = 0x80000PID;
+	 *      Set owner died		  attach_to_pi_owner() {
+	 *    *uaddr = 0xC0000000;	   tsk = get_task(PID);
+	 *   }				   if (!tsk->flags & PF_EXITING) {
+	 *  ...				     attach();
+	 *  tsk->flags |= PF_EXITPIDONE;   } else {
+	 *				     if (!(tsk->flags & PF_EXITPIDONE))
+	 *				       return -EAGAIN;
+	 *				     return -ESRCH; <--- FAIL
+	 *				   }
+	 *
+	 * Returning ESRCH unconditionally is wrong here because the
+	 * user space value has been changed by the exiting task.
+	 */
+	if (get_futex_value_locked(&uval2, uaddr))
+		return -EFAULT;
+
+	/* If the user space value has changed, try again. */
+	if (uval2 != uval)
+		return -EAGAIN;
+
+	/*
+	 * The exiting task did not have a robust list, the robust list was
+	 * corrupted or the user space value in *uaddr is simply bogus.
+	 * Give up and tell user space.
+	 */
+	return -ESRCH;
+}
+
 /*
  * Lookup the task for the TID provided from user space and attach to
  * it after doing proper sanity checks.
  */
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
 			      struct futex_pi_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
@@ -1187,7 +1236,7 @@ static int attach_to_pi_owner(u32 uval,
 		 * set, we know that the task has finished the
 		 * cleanup:
 		 */
-		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
+		int ret = handle_exit_race(uaddr, uval, p);
 
 		raw_spin_unlock_irq(&p->pi_lock);
 		put_task_struct(p);
@@ -1244,7 +1293,7 @@ static int lookup_pi_state(u32 __user *u
 	 * We are the first waiter - try to look up the owner based on
 	 * @uval and attach to it.
 	 */
-	return attach_to_pi_owner(uval, key, ps);
+	return attach_to_pi_owner(uaddr, uval, key, ps);
 }
 
 static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1352,7 +1401,7 @@ static int futex_lock_pi_atomic(u32 __us
 	 * attach to the owner. If that fails, no harm done, we only
 	 * set the FUTEX_WAITERS bit in the user space variable.
 	 */
-	return attach_to_pi_owner(uval, key, ps);
+	return attach_to_pi_owner(uaddr, uval, key, ps);
 }
 
 /**



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] futex: Cure exit race
  2018-12-10 15:23 [patch] futex: Cure exit race Thomas Gleixner
@ 2018-12-10 16:02 ` Peter Zijlstra
  2018-12-10 17:43   ` Thomas Gleixner
       [not found] ` <20181210210920.75EBD20672@mail.kernel.org>
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2018-12-10 16:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Stefan Liebler, Heiko Carstens, Darren Hart, Ingo Molnar

On Mon, Dec 10, 2018 at 04:23:06PM +0100, Thomas Gleixner wrote:

>  kernel/futex.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 53 insertions(+), 4 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1148,11 +1148,60 @@ static int attach_to_pi_state(u32 __user
>  	return ret;
>  }
>  
> +static int handle_exit_race(u32 __user *uaddr, u32 uval, struct task_struct *tsk)
> +{
> +	u32 uval2;
> +
> +	/*
> +	 * If PF_EXITPIDONE is not yet set try again.
> +	 */
> +	if (!(tsk->flags & PF_EXITPIDONE))
> +		return -EAGAIN;
> +
> +	/*
> +	 * Reread the user space value to handle the following situation:
> +	 *
> +	 * CPU0				CPU1
> +	 *
> +	 * sys_exit()			sys_futex()
> +	 *  do_exit()			 futex_lock_pi()
> +	 *   exit_signals(tsk)		  No waiters:
> +	 *    tsk->flags |= PF_EXITING;	  *uaddr == 0x00000PID
> +	 *  mm_release(tsk)		  Set waiter bit
> +	 *   exit_robust_list(tsk) {	  *uaddr = 0x80000PID;

Just to clarify; this is: sys_futex() <- futex_lock_pi() <-
futex_lock_pi_atomic(), where we do:

  lock_pi_update_atomic(); // changes the futex word
  attach_to_pi_owner(); // possibly returns ESRCH after changing the word


> +	 *      Set owner died		  attach_to_pi_owner() {
> +	 *    *uaddr = 0xC0000000;	   tsk = get_task(PID);
> +	 *   }				   if (!tsk->flags & PF_EXITING) {
> +	 *  ...				     attach();
> +	 *  tsk->flags |= PF_EXITPIDONE;   } else {
> +	 *				     if (!(tsk->flags & PF_EXITPIDONE))
> +	 *				       return -EAGAIN;
> +	 *				     return -ESRCH; <--- FAIL
> +	 *				   }
> +	 *
> +	 * Returning ESRCH unconditionally is wrong here because the
> +	 * user space value has been changed by the exiting task.
> +	 */
> +	if (get_futex_value_locked(&uval2, uaddr))
> +		return -EFAULT;
> +
> +	/* If the user space value has changed, try again. */
> +	if (uval2 != uval)
> +		return -EAGAIN;

And this then goes back to futex_lock_pi(), which does a retry loop.

> +	/*
> +	 * The exiting task did not have a robust list, the robust list was
> +	 * corrupted or the user space value in *uaddr is simply bogus.
> +	 * Give up and tell user space.
> +	 */
> +	return -ESRCH;

If it is unchanged; -ESRCH is a valid return value.

> +}

There is another callers of futex_lock_pi_atomic(),
futex_proxy_trylock_atomic(), which is part of futex_requeue(), that too
does a retry loop on -EAGAIN.

And there is another caller of attach_to_pi_owner(): lookup_pi_state(),
and that too is in futex_requeue() and handles the retry case properly.

Yes, this all looks good.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] futex: Cure exit race
  2018-12-10 16:02 ` Peter Zijlstra
@ 2018-12-10 17:43   ` Thomas Gleixner
  2018-12-12  9:04     ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2018-12-10 17:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Stefan Liebler, Heiko Carstens, Darren Hart, Ingo Molnar

On Mon, 10 Dec 2018, Peter Zijlstra wrote:
> On Mon, Dec 10, 2018 at 04:23:06PM +0100, Thomas Gleixner wrote:
> There is another callers of futex_lock_pi_atomic(),
> futex_proxy_trylock_atomic(), which is part of futex_requeue(), that too
> does a retry loop on -EAGAIN.
> 
> And there is another caller of attach_to_pi_owner(): lookup_pi_state(),
> and that too is in futex_requeue() and handles the retry case properly.
> 
> Yes, this all looks good.
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Bah. The little devil in the unconcious part of my brain insisted on
thinking further about that EAGAIN loop even despite my attempt to page
that futex horrors out again immediately after sending that patch.

There is another related issue which is even worse than just mildly
confusing user space:

   task1(SCHED_OTHER)
   sys_exit()
     do_exit()
      exit_mm()
       task1->flags |= PF_EXITING;

   ---> preemption

   task2(SCHED_FIFO)
     sys_futex(LOCK_PI)
       ....
       attach_to_pi_owner() {
         ...
         if (!task1->flags & PF_EXITING) {
           attach();
         } else {
              if (!(tsk->flags & PF_EXITPIDONE))
	         return -EAGAIN;

Now assume UP or both tasks pinned on the same CPU. That results in a
livelock because task2 is going to loop forever.

No immediate idea how to cure that one w/o creating a mess.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] futex: Cure exit race
       [not found] ` <20181210210920.75EBD20672@mail.kernel.org>
@ 2018-12-10 21:16   ` Thomas Gleixner
  2018-12-10 23:01     ` Sasha Levin
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2018-12-10 21:16 UTC (permalink / raw)
  To: Sasha Levin; +Cc: LKML, stable

On Mon, 10 Dec 2018, Sasha Levin wrote:
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: all
> 
> The bot has tested the following trees: v4.19.8, v4.14.87, v4.9.144, v4.4.166, v3.18.128, 
> 
> v4.19.8: Build OK!
> v4.14.87: Build OK!
> v4.9.144: Build failed! Errors:
>     kernel/futex.c:1186:28: error: ???uaddr??? undeclared (first use in this function)
> 
> v4.4.166: Build failed! Errors:
>     kernel/futex.c:1181:28: error: ???uaddr??? undeclared (first use in this function)
> 
> v3.18.128: Build failed! Errors:
>     kernel/futex.c:1103:28: error: ???uaddr??? undeclared (first use in this function)
> 
> How should we proceed with this patch?

I'll look into that once this is sorted... I so love these rotten kernels.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] futex: Cure exit race
  2018-12-10 21:16   ` Thomas Gleixner
@ 2018-12-10 23:01     ` Sasha Levin
  2018-12-11 10:29       ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Sasha Levin @ 2018-12-10 23:01 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, stable

On Mon, Dec 10, 2018 at 10:16:03PM +0100, Thomas Gleixner wrote:
>On Mon, 10 Dec 2018, Sasha Levin wrote:
>> This commit has been processed because it contains a -stable tag.
>> The stable tag indicates that it's relevant for the following trees: all
>>
>> The bot has tested the following trees: v4.19.8, v4.14.87, v4.9.144, v4.4.166, v3.18.128,
>>
>> v4.19.8: Build OK!
>> v4.14.87: Build OK!
>> v4.9.144: Build failed! Errors:
>>     kernel/futex.c:1186:28: error: ???uaddr??? undeclared (first use in this function)
>>
>> v4.4.166: Build failed! Errors:
>>     kernel/futex.c:1181:28: error: ???uaddr??? undeclared (first use in this function)
>>
>> v3.18.128: Build failed! Errors:
>>     kernel/futex.c:1103:28: error: ???uaddr??? undeclared (first use in this function)
>>
>> How should we proceed with this patch?
>
>I'll look into that once this is sorted... I so love these rotten kernels.

It seems we need:

	734009e96d19 ("futex: Change locking rules")

Which isn't trivial to backport.

--
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] futex: Cure exit race
  2018-12-10 15:23 [patch] futex: Cure exit race Thomas Gleixner
  2018-12-10 16:02 ` Peter Zijlstra
       [not found] ` <20181210210920.75EBD20672@mail.kernel.org>
@ 2018-12-11  8:04 ` Stefan Liebler
  2018-12-11 10:32   ` Thomas Gleixner
  2018-12-18 22:18 ` [tip:locking/urgent] " tip-bot for Thomas Gleixner
  3 siblings, 1 reply; 13+ messages in thread
From: Stefan Liebler @ 2018-12-11  8:04 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Heiko Carstens, Peter Zijlstra, Darren Hart, Ingo Molnar

Hi Thomas,

does this also handle the ESRCH returned by
attach_to_pi_owner(...)
{...
	if (!pid)
		return -ESRCH;
	p = find_get_task_by_vpid(pid);
	if (!p)
		return -ESRCH;
...

I think pid should never be zero when attach_to_pi_owner is called.
But it can happen that p is null? At least I traced the "return -ESRCH" 
with the 4.17 kernel. Unfortunately both returns were done by the same 
instruction address.

Bye
Stefan

On 12/10/2018 04:23 PM, Thomas Gleixner wrote:
> Stefan reported, that the glibc tst-robustpi4 test case fails
> occasionally. That case creates the following race between
> sys_exit() and sys_futex(LOCK_PI):
> 
>   CPU0				CPU1
> 
>   sys_exit()			sys_futex()
>    do_exit()			 futex_lock_pi()
>     exit_signals(tsk)		  No waiters:
>      tsk->flags |= PF_EXITING;	  *uaddr == 0x00000PID
>    mm_release(tsk)		  Set waiter bit
>     exit_robust_list(tsk) {	  *uaddr = 0x80000PID;
>        Set owner died		  attach_to_pi_owner() {
>      *uaddr = 0xC0000000;	   tsk = get_task(PID);
>     }				   if (!tsk->flags & PF_EXITING) {
>    ...				     attach();
>    tsk->flags |= PF_EXITPIDONE;	   } else {
> 				     if (!(tsk->flags & PF_EXITPIDONE))
> 				       return -EAGAIN;
> 				     return -ESRCH; <--- FAIL
> 				   }
> 
> ESRCH is returned all the way to user space, which triggers the glibc test
> case assert. Returning ESRCH unconditionally is wrong here because the user
> space value has been changed by the exiting task to 0xC0000000, i.e. the
> FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
> is a valid state and the kernel has to handle it, i.e. taking the futex.
> 
> Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
> is set in the task which owns the futex. If the value has changed, let
> the kernel retry the operation, which includes all regular sanity checks
> and correctly handles the FUTEX_OWNER_DIED case.
> 
> If it hasn't changed, then return ESRCH as there is no way to distinguish
> this case from malfunctioning user space. This happens when the exiting
> task did not have a robust list, the robust list was corrupted or the user
> space value in the futex was simply bogus.
> 
> Reported-by: Stefan Liebler <stli@linux.ibm.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Darren Hart <dvhart@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: stable@vger.kernel.org
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
> ---
>   kernel/futex.c |   57 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 53 insertions(+), 4 deletions(-)
> 
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1148,11 +1148,60 @@ static int attach_to_pi_state(u32 __user
>   	return ret;
>   }
>   
> +static int handle_exit_race(u32 __user *uaddr, u32 uval, struct task_struct *tsk)
> +{
> +	u32 uval2;
> +
> +	/*
> +	 * If PF_EXITPIDONE is not yet set try again.
> +	 */
> +	if (!(tsk->flags & PF_EXITPIDONE))
> +		return -EAGAIN;
> +
> +	/*
> +	 * Reread the user space value to handle the following situation:
> +	 *
> +	 * CPU0				CPU1
> +	 *
> +	 * sys_exit()			sys_futex()
> +	 *  do_exit()			 futex_lock_pi()
> +	 *   exit_signals(tsk)		  No waiters:
> +	 *    tsk->flags |= PF_EXITING;	  *uaddr == 0x00000PID
> +	 *  mm_release(tsk)		  Set waiter bit
> +	 *   exit_robust_list(tsk) {	  *uaddr = 0x80000PID;
> +	 *      Set owner died		  attach_to_pi_owner() {
> +	 *    *uaddr = 0xC0000000;	   tsk = get_task(PID);
> +	 *   }				   if (!tsk->flags & PF_EXITING) {
> +	 *  ...				     attach();
> +	 *  tsk->flags |= PF_EXITPIDONE;   } else {
> +	 *				     if (!(tsk->flags & PF_EXITPIDONE))
> +	 *				       return -EAGAIN;
> +	 *				     return -ESRCH; <--- FAIL
> +	 *				   }
> +	 *
> +	 * Returning ESRCH unconditionally is wrong here because the
> +	 * user space value has been changed by the exiting task.
> +	 */
> +	if (get_futex_value_locked(&uval2, uaddr))
> +		return -EFAULT;
> +
> +	/* If the user space value has changed, try again. */
> +	if (uval2 != uval)
> +		return -EAGAIN;
> +
> +	/*
> +	 * The exiting task did not have a robust list, the robust list was
> +	 * corrupted or the user space value in *uaddr is simply bogus.
> +	 * Give up and tell user space.
> +	 */
> +	return -ESRCH;
> +}
> +
>   /*
>    * Lookup the task for the TID provided from user space and attach to
>    * it after doing proper sanity checks.
>    */
> -static int attach_to_pi_owner(u32 uval, union futex_key *key,
> +static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
>   			      struct futex_pi_state **ps)
>   {
>   	pid_t pid = uval & FUTEX_TID_MASK;
> @@ -1187,7 +1236,7 @@ static int attach_to_pi_owner(u32 uval,
>   		 * set, we know that the task has finished the
>   		 * cleanup:
>   		 */
> -		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
> +		int ret = handle_exit_race(uaddr, uval, p);
>   
>   		raw_spin_unlock_irq(&p->pi_lock);
>   		put_task_struct(p);
> @@ -1244,7 +1293,7 @@ static int lookup_pi_state(u32 __user *u
>   	 * We are the first waiter - try to look up the owner based on
>   	 * @uval and attach to it.
>   	 */
> -	return attach_to_pi_owner(uval, key, ps);
> +	return attach_to_pi_owner(uaddr, uval, key, ps);
>   }
>   
>   static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
> @@ -1352,7 +1401,7 @@ static int futex_lock_pi_atomic(u32 __us
>   	 * attach to the owner. If that fails, no harm done, we only
>   	 * set the FUTEX_WAITERS bit in the user space variable.
>   	 */
> -	return attach_to_pi_owner(uval, key, ps);
> +	return attach_to_pi_owner(uaddr, uval, key, ps);
>   }
>   
>   /**
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] futex: Cure exit race
  2018-12-10 23:01     ` Sasha Levin
@ 2018-12-11 10:29       ` Thomas Gleixner
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2018-12-11 10:29 UTC (permalink / raw)
  To: Sasha Levin; +Cc: LKML, stable

On Mon, 10 Dec 2018, Sasha Levin wrote:
> On Mon, Dec 10, 2018 at 10:16:03PM +0100, Thomas Gleixner wrote:
> > On Mon, 10 Dec 2018, Sasha Levin wrote:
> > > How should we proceed with this patch?
> > 
> > I'll look into that once this is sorted... I so love these rotten kernels.
> 
> It seems we need:
> 
> 	734009e96d19 ("futex: Change locking rules")
> 
> Which isn't trivial to backport.

It's simpler to backport the fix. I'll look at that once we agreed on the
final solution.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] futex: Cure exit race
  2018-12-11  8:04 ` Stefan Liebler
@ 2018-12-11 10:32   ` Thomas Gleixner
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2018-12-11 10:32 UTC (permalink / raw)
  To: Stefan Liebler
  Cc: LKML, Heiko Carstens, Peter Zijlstra, Darren Hart, Ingo Molnar

Stefan,

On Tue, 11 Dec 2018, Stefan Liebler wrote:
> does this also handle the ESRCH returned by
> attach_to_pi_owner(...)
> {...
> 	if (!pid)
> 		return -ESRCH;
> 	p = find_get_task_by_vpid(pid);
> 	if (!p)
> 		return -ESRCH;
> ...
> 
> I think pid should never be zero when attach_to_pi_owner is called.

Yeah, I just checked again. It's a paranoid check.

> But it can happen that p is null? At least I traced the "return -ESRCH" with
> the 4.17 kernel. Unfortunately both returns were done by the same instruction
> address.

Yes, you are right. We need the same sanity check for that part. Updated
patch below.

Now I "just" have to come up with a cure for that livelock thing ....

Thanks,

	tglx

8<--------------
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,11 +1148,65 @@ static int attach_to_pi_state(u32 __user
 	return ret;
 }
 
+static int handle_exit_race(u32 __user *uaddr, u32 uval,
+			    struct task_struct *tsk)
+{
+	u32 uval2;
+
+	/*
+	 * If PF_EXITPIDONE is not yet set, then try again.
+	 */
+	if (tsk && !(tsk->flags & PF_EXITPIDONE))
+		return -EAGAIN;
+
+	/*
+	 * Reread the user space value to handle the following situation:
+	 *
+	 * CPU0				CPU1
+	 *
+	 * sys_exit()			sys_futex()
+	 *  do_exit()			 futex_lock_pi()
+	 *                                futex_lock_pi_atomic()
+	 *   exit_signals(tsk)		    No waiters:
+	 *    tsk->flags |= PF_EXITING;	    *uaddr == 0x00000PID
+	 *  mm_release(tsk)		    Set waiter bit
+	 *   exit_robust_list(tsk) {	    *uaddr = 0x80000PID;
+	 *      Set owner died		    attach_to_pi_owner() {
+	 *    *uaddr = 0xC0000000;	     tsk = get_task(PID);
+	 *   }				     if (!tsk->flags & PF_EXITING) {
+	 *  ...				       attach();
+	 *  tsk->flags |= PF_EXITPIDONE;     } else {
+	 *				       if (!(tsk->flags & PF_EXITPIDONE))
+	 *				         return -EAGAIN;
+	 *				       return -ESRCH; <--- FAIL
+	 *				     }
+	 *
+	 * Returning ESRCH unconditionally is wrong here because the
+	 * user space value has been changed by the exiting task.
+	 *
+	 * The same logic applies to the case where the exiting task is
+	 * already gone.
+	 */
+	if (get_futex_value_locked(&uval2, uaddr))
+		return -EFAULT;
+
+	/* If the user space value has changed, try again. */
+	if (uval2 != uval)
+		return -EAGAIN;
+
+	/*
+	 * The exiting task did not have a robust list, the robust list was
+	 * corrupted or the user space value in *uaddr is simply bogus.
+	 * Give up and tell user space.
+	 */
+	return -ESRCH;
+}
+
 /*
  * Lookup the task for the TID provided from user space and attach to
  * it after doing proper sanity checks.
  */
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
 			      struct futex_pi_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
@@ -1162,12 +1216,15 @@ static int attach_to_pi_owner(u32 uval,
 	/*
 	 * We are the first waiter - try to look up the real owner and attach
 	 * the new pi_state to it, but bail out when TID = 0 [1]
+	 *
+	 * The !pid check is paranoid. None of the call sites should end up
+	 * with pid == 0, but better safe than sorry. Let the caller retry
 	 */
 	if (!pid)
-		return -ESRCH;
+		return -EAGAIN;
 	p = find_get_task_by_vpid(pid);
 	if (!p)
-		return -ESRCH;
+		return handle_exit_race(uaddr, uval, NULL);
 
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		put_task_struct(p);
@@ -1187,7 +1244,7 @@ static int attach_to_pi_owner(u32 uval,
 		 * set, we know that the task has finished the
 		 * cleanup:
 		 */
-		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
+		int ret = handle_exit_race(uaddr, uval, p);
 
 		raw_spin_unlock_irq(&p->pi_lock);
 		put_task_struct(p);
@@ -1244,7 +1301,7 @@ static int lookup_pi_state(u32 __user *u
 	 * We are the first waiter - try to look up the owner based on
 	 * @uval and attach to it.
 	 */
-	return attach_to_pi_owner(uval, key, ps);
+	return attach_to_pi_owner(uaddr, uval, key, ps);
 }
 
 static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1352,7 +1409,7 @@ static int futex_lock_pi_atomic(u32 __us
 	 * attach to the owner. If that fails, no harm done, we only
 	 * set the FUTEX_WAITERS bit in the user space variable.
 	 */
-	return attach_to_pi_owner(uval, key, ps);
+	return attach_to_pi_owner(uaddr, newval, key, ps);
 }
 
 /**

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] futex: Cure exit race
  2018-12-10 17:43   ` Thomas Gleixner
@ 2018-12-12  9:04     ` Peter Zijlstra
  2018-12-18  9:31       ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2018-12-12  9:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Stefan Liebler, Heiko Carstens, Darren Hart, Ingo Molnar

On Mon, Dec 10, 2018 at 06:43:51PM +0100, Thomas Gleixner wrote:
> On Mon, 10 Dec 2018, Peter Zijlstra wrote:
> > On Mon, Dec 10, 2018 at 04:23:06PM +0100, Thomas Gleixner wrote:
> > There is another callers of futex_lock_pi_atomic(),
> > futex_proxy_trylock_atomic(), which is part of futex_requeue(), that too
> > does a retry loop on -EAGAIN.
> > 
> > And there is another caller of attach_to_pi_owner(): lookup_pi_state(),
> > and that too is in futex_requeue() and handles the retry case properly.
> > 
> > Yes, this all looks good.
> > 
> > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Bah. The little devil in the unconcious part of my brain insisted on
> thinking further about that EAGAIN loop even despite my attempt to page
> that futex horrors out again immediately after sending that patch.
> 
> There is another related issue which is even worse than just mildly
> confusing user space:
> 
>    task1(SCHED_OTHER)
>    sys_exit()
>      do_exit()
>       exit_mm()
>        task1->flags |= PF_EXITING;
> 
>    ---> preemption
> 
>    task2(SCHED_FIFO)
>      sys_futex(LOCK_PI)
>        ....
>        attach_to_pi_owner() {
>          ...
>          if (!task1->flags & PF_EXITING) {
>            attach();
>          } else {
>               if (!(tsk->flags & PF_EXITPIDONE))
> 	         return -EAGAIN;
> 
> Now assume UP or both tasks pinned on the same CPU. That results in a
> livelock because task2 is going to loop forever.
> 
> No immediate idea how to cure that one w/o creating a mess.

One possible; but fairly gruesome hack; would be something like the
below.

Now, this obviously introduces a priority inversion, but that's
arguablly better than a live-lock, also I'm not sure there's really
anything 'sane' you can do in the case where your lock holder is dying
instead of doing a proper unlock anyway.

But no, I'm not liking this much either...

diff --git a/kernel/exit.c b/kernel/exit.c
index 0e21e6d21f35..bc6a01112d9d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -806,6 +806,8 @@ void __noreturn do_exit(long code)
 		 * task into the wait for ever nirwana as well.
 		 */
 		tsk->flags |= PF_EXITPIDONE;
+		smp_mb();
+		wake_up_bit(&tsk->flags, 3 /* PF_EXITPIDONE */);
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		schedule();
 	}
diff --git a/kernel/futex.c b/kernel/futex.c
index f423f9b6577e..a743d657e783 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,8 +1148,8 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
  * Lookup the task for the TID provided from user space and attach to
  * it after doing proper sanity checks.
  */
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
-			      struct futex_pi_state **ps)
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
+			      struct futex_pi_state **ps, struct task_struct **pe)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
 	struct futex_pi_state *pi_state;
@@ -1187,10 +1236,15 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 		 * set, we know that the task has finished the
 		 * cleanup:
 		 */
 		int ret = handle_exit_race(uaddr, uval, p);
 
 		raw_spin_unlock_irq(&p->pi_lock);
-		put_task_struct(p);
+
+		if (ret == -EAGAIN)
+			*pe = p;
+		else
+			put_task_struct(p);
+
 		return ret;
 	}
 
@@ -1244,7 +1298,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval,
 	 * We are the first waiter - try to look up the owner based on
 	 * @uval and attach to it.
 	 */
-	return attach_to_pi_owner(uval, key, ps);
+	return attach_to_pi_owner(uaddr, uval, key, ps);
 }
 
 static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1282,7 +1336,8 @@ static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
 static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 				union futex_key *key,
 				struct futex_pi_state **ps,
-				struct task_struct *task, int set_waiters)
+				struct task_struct *task, int set_waiters,
+				struct task_struct **exiting)
 {
 	u32 uval, newval, vpid = task_pid_vnr(task);
 	struct futex_q *top_waiter;
@@ -1352,7 +1407,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 	 * attach to the owner. If that fails, no harm done, we only
 	 * set the FUTEX_WAITERS bit in the user space variable.
 	 */
-	return attach_to_pi_owner(uval, key, ps);
+	return attach_to_pi_owner(uaddr, uval, key, ps, exiting);
 }
 
 /**
@@ -2716,6 +2771,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	struct rt_mutex_waiter rt_waiter;
 	struct futex_hash_bucket *hb;
 	struct futex_q q = futex_q_init;
+	struct task_struct *exiting;
 	int res, ret;
 
 	if (!IS_ENABLED(CONFIG_FUTEX_PI))
@@ -2733,6 +2789,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	}
 
 retry:
+	exiting = NULL;
 	ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key, VERIFY_WRITE);
 	if (unlikely(ret != 0))
 		goto out;
@@ -2740,7 +2797,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 retry_private:
 	hb = queue_lock(&q);
 
-	ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0);
+	ret = futex_lock_pi_atomic(uaddr, hb, &q.key, &q.pi_state, current, 0, &exiting);
 	if (unlikely(ret)) {
 		/*
 		 * Atomic work succeeded and we got the lock,
@@ -2762,6 +2819,12 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 			 */
 			queue_unlock(hb);
 			put_futex_key(&q.key);
+
+			if (exiting) {
+				wait_bit(&exiting->flags, 3 /* PF_EXITPIDONE */, TASK_UNINTERRUPTIBLE);
+				put_task_struct(exiting);
+			}
+
 			cond_resched();
 			goto retry;
 		default:

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] futex: Cure exit race
  2018-12-12  9:04     ` Peter Zijlstra
@ 2018-12-18  9:31       ` Thomas Gleixner
  2018-12-19 13:29         ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2018-12-18  9:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Stefan Liebler, Heiko Carstens, Darren Hart, Ingo Molnar

On Wed, 12 Dec 2018, Peter Zijlstra wrote:
> On Mon, Dec 10, 2018 at 06:43:51PM +0100, Thomas Gleixner wrote:
> @@ -806,6 +806,8 @@ void __noreturn do_exit(long code)
>  		 * task into the wait for ever nirwana as well.
>  		 */
>  		tsk->flags |= PF_EXITPIDONE;
> +		smp_mb();
> +		wake_up_bit(&tsk->flags, 3 /* PF_EXITPIDONE */);

Using ilog2(PF_EXITPIDONE) spares that horrible inline comment and more
importantly selects the right bit. 0x04 is bit 2 ....

> @@ -1187,10 +1236,15 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
>  		 * set, we know that the task has finished the
>  		 * cleanup:
>  		 */
>  		int ret = handle_exit_race(uaddr, uval, p);
>  
>  		raw_spin_unlock_irq(&p->pi_lock);
> -		put_task_struct(p);
> +
> +		if (ret == -EAGAIN)
> +			*pe = p;

Hmm, no. We really want to split the return value for that. EAGAIN is also
returned for other reasons.

Plus requeue_pi() needs the same treatment. I'm staring into it, but all I
came up with so far is horribly ugly.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [tip:locking/urgent] futex: Cure exit race
  2018-12-10 15:23 [patch] futex: Cure exit race Thomas Gleixner
                   ` (2 preceding siblings ...)
  2018-12-11  8:04 ` Stefan Liebler
@ 2018-12-18 22:18 ` " tip-bot for Thomas Gleixner
  3 siblings, 0 replies; 13+ messages in thread
From: tip-bot for Thomas Gleixner @ 2018-12-18 22:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, stli, mingo, linux-kernel, heiko.carstens, hpa, sashal,
	dvhart, tglx

Commit-ID:  da791a667536bf8322042e38ca85d55a78d3c273
Gitweb:     https://git.kernel.org/tip/da791a667536bf8322042e38ca85d55a78d3c273
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 10 Dec 2018 14:35:14 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Tue, 18 Dec 2018 23:13:15 +0100

futex: Cure exit race

Stefan reported, that the glibc tst-robustpi4 test case fails
occasionally. That case creates the following race between
sys_exit() and sys_futex_lock_pi():

 CPU0				CPU1

 sys_exit()			sys_futex()
  do_exit()			 futex_lock_pi()
   exit_signals(tsk)		  No waiters:
    tsk->flags |= PF_EXITING;	  *uaddr == 0x00000PID
  mm_release(tsk)		  Set waiter bit
   exit_robust_list(tsk) {	  *uaddr = 0x80000PID;
      Set owner died		  attach_to_pi_owner() {
    *uaddr = 0xC0000000;	   tsk = get_task(PID);
   }				   if (!tsk->flags & PF_EXITING) {
  ...				     attach();
  tsk->flags |= PF_EXITPIDONE;	   } else {
				     if (!(tsk->flags & PF_EXITPIDONE))
				       return -EAGAIN;
				     return -ESRCH; <--- FAIL
				   }

ESRCH is returned all the way to user space, which triggers the glibc test
case assert. Returning ESRCH unconditionally is wrong here because the user
space value has been changed by the exiting task to 0xC0000000, i.e. the
FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
is a valid state and the kernel has to handle it, i.e. taking the futex.

Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
is set in the task which 'owns' the futex. If the value has changed, let
the kernel retry the operation, which includes all regular sanity checks
and correctly handles the FUTEX_OWNER_DIED case.

If it hasn't changed, then return ESRCH as there is no way to distinguish
this case from malfunctioning user space. This happens when the exiting
task did not have a robust list, the robust list was corrupted or the user
space value in the futex was simply bogus.

Reported-by: Stefan Liebler <stli@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
Link: https://lkml.kernel.org/r/20181210152311.986181245@linutronix.de
---
 kernel/futex.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 6 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f423f9b6577e..5cc8083a4c89 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,11 +1148,65 @@ out_error:
 	return ret;
 }
 
+static int handle_exit_race(u32 __user *uaddr, u32 uval,
+			    struct task_struct *tsk)
+{
+	u32 uval2;
+
+	/*
+	 * If PF_EXITPIDONE is not yet set, then try again.
+	 */
+	if (tsk && !(tsk->flags & PF_EXITPIDONE))
+		return -EAGAIN;
+
+	/*
+	 * Reread the user space value to handle the following situation:
+	 *
+	 * CPU0				CPU1
+	 *
+	 * sys_exit()			sys_futex()
+	 *  do_exit()			 futex_lock_pi()
+	 *                                futex_lock_pi_atomic()
+	 *   exit_signals(tsk)		    No waiters:
+	 *    tsk->flags |= PF_EXITING;	    *uaddr == 0x00000PID
+	 *  mm_release(tsk)		    Set waiter bit
+	 *   exit_robust_list(tsk) {	    *uaddr = 0x80000PID;
+	 *      Set owner died		    attach_to_pi_owner() {
+	 *    *uaddr = 0xC0000000;	     tsk = get_task(PID);
+	 *   }				     if (!tsk->flags & PF_EXITING) {
+	 *  ...				       attach();
+	 *  tsk->flags |= PF_EXITPIDONE;     } else {
+	 *				       if (!(tsk->flags & PF_EXITPIDONE))
+	 *				         return -EAGAIN;
+	 *				       return -ESRCH; <--- FAIL
+	 *				     }
+	 *
+	 * Returning ESRCH unconditionally is wrong here because the
+	 * user space value has been changed by the exiting task.
+	 *
+	 * The same logic applies to the case where the exiting task is
+	 * already gone.
+	 */
+	if (get_futex_value_locked(&uval2, uaddr))
+		return -EFAULT;
+
+	/* If the user space value has changed, try again. */
+	if (uval2 != uval)
+		return -EAGAIN;
+
+	/*
+	 * The exiting task did not have a robust list, the robust list was
+	 * corrupted or the user space value in *uaddr is simply bogus.
+	 * Give up and tell user space.
+	 */
+	return -ESRCH;
+}
+
 /*
  * Lookup the task for the TID provided from user space and attach to
  * it after doing proper sanity checks.
  */
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
 			      struct futex_pi_state **ps)
 {
 	pid_t pid = uval & FUTEX_TID_MASK;
@@ -1162,12 +1216,15 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 	/*
 	 * We are the first waiter - try to look up the real owner and attach
 	 * the new pi_state to it, but bail out when TID = 0 [1]
+	 *
+	 * The !pid check is paranoid. None of the call sites should end up
+	 * with pid == 0, but better safe than sorry. Let the caller retry
 	 */
 	if (!pid)
-		return -ESRCH;
+		return -EAGAIN;
 	p = find_get_task_by_vpid(pid);
 	if (!p)
-		return -ESRCH;
+		return handle_exit_race(uaddr, uval, NULL);
 
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		put_task_struct(p);
@@ -1187,7 +1244,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
 		 * set, we know that the task has finished the
 		 * cleanup:
 		 */
-		int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
+		int ret = handle_exit_race(uaddr, uval, p);
 
 		raw_spin_unlock_irq(&p->pi_lock);
 		put_task_struct(p);
@@ -1244,7 +1301,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval,
 	 * We are the first waiter - try to look up the owner based on
 	 * @uval and attach to it.
 	 */
-	return attach_to_pi_owner(uval, key, ps);
+	return attach_to_pi_owner(uaddr, uval, key, ps);
 }
 
 static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1352,7 +1409,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 	 * attach to the owner. If that fails, no harm done, we only
 	 * set the FUTEX_WAITERS bit in the user space variable.
 	 */
-	return attach_to_pi_owner(uval, key, ps);
+	return attach_to_pi_owner(uaddr, newval, key, ps);
 }
 
 /**

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] futex: Cure exit race
  2018-12-18  9:31       ` Thomas Gleixner
@ 2018-12-19 13:29         ` Thomas Gleixner
  2018-12-19 19:13           ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2018-12-19 13:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Stefan Liebler, Heiko Carstens, Darren Hart, Ingo Molnar

On 2018-12-18 10:31, Thomas Gleixner wrote:
> On Wed, 12 Dec 2018, Peter Zijlstra wrote:
>> On Mon, Dec 10, 2018 at 06:43:51PM +0100, Thomas Gleixner wrote:
>> @@ -806,6 +806,8 @@ void __noreturn do_exit(long code)
>>  		 * task into the wait for ever nirwana as well.
>>  		 */
>>  		tsk->flags |= PF_EXITPIDONE;
>> +		smp_mb();
>> +		wake_up_bit(&tsk->flags, 3 /* PF_EXITPIDONE */);
>
> Using ilog2(PF_EXITPIDONE) spares that horrible inline comment and 
> more
> importantly selects the right bit. 0x04 is bit 2 ....

Plus wake_up_bit() and wait_on_bit() want an unsigned long, but 
tsk->flags is
unsigned int....

Moar staring....


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] futex: Cure exit race
  2018-12-19 13:29         ` Thomas Gleixner
@ 2018-12-19 19:13           ` Thomas Gleixner
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2018-12-19 19:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Stefan Liebler, Heiko Carstens, Darren Hart, Ingo Molnar

On Wed, 19 Dec 2018, Thomas Gleixner wrote:
> On 2018-12-18 10:31, Thomas Gleixner wrote:
> > On Wed, 12 Dec 2018, Peter Zijlstra wrote:
> > > On Mon, Dec 10, 2018 at 06:43:51PM +0100, Thomas Gleixner wrote:
> > > @@ -806,6 +806,8 @@ void __noreturn do_exit(long code)
> > >  		 * task into the wait for ever nirwana as well.
> > >  		 */
> > >  		tsk->flags |= PF_EXITPIDONE;
> > > +		smp_mb();
> > > +		wake_up_bit(&tsk->flags, 3 /* PF_EXITPIDONE */);
> > 
> > Using ilog2(PF_EXITPIDONE) spares that horrible inline comment and more
> > importantly selects the right bit. 0x04 is bit 2 ....
> 
> Plus wake_up_bit() and wait_on_bit() want an unsigned long, but tsk->flags is
> unsigned int....
> 
> Moar staring....

Aside of that calling wake_on_bit() unconditionally can be slow if the
waitqueue in the hash bucket is not empty.

So while cooking up an alternative solution I found yet another exit race:

  CPU0	 	       		   CPU1

  sys_futex()                      sys_exit()
   futex_lock_pi()                  do_exit()
   No waiters:
   *uaddr == 0x00000PID;
   Set waiter bit
   *uaddr = 0x80000PID;
   attach_to_pi_owner()
    tsk = get_task(PID);            exit_signals(tsk)
    if (!(tsk->flags & PF_EXITING))
       ...                           tsk->flags |= PF_EXITING;
                                    mm_release(tsk)
				      exit_robust_list(tsk)
				        Set owner died and clear PID
					*uaddr = 0xC0000000;
                                      if (unlikely(!list_empty(&tsk->pi_state_list)))
       list_add(&pi_state->list,
             &tsk->pi_state_list);

I put that all on hold until Jan 7.

If somebody is really bored, here is the WIP patch series which addresses
the live lock mess: https://tglx.de/~tglx/patches.tar

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, back to index

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-10 15:23 [patch] futex: Cure exit race Thomas Gleixner
2018-12-10 16:02 ` Peter Zijlstra
2018-12-10 17:43   ` Thomas Gleixner
2018-12-12  9:04     ` Peter Zijlstra
2018-12-18  9:31       ` Thomas Gleixner
2018-12-19 13:29         ` Thomas Gleixner
2018-12-19 19:13           ` Thomas Gleixner
     [not found] ` <20181210210920.75EBD20672@mail.kernel.org>
2018-12-10 21:16   ` Thomas Gleixner
2018-12-10 23:01     ` Sasha Levin
2018-12-11 10:29       ` Thomas Gleixner
2018-12-11  8:04 ` Stefan Liebler
2018-12-11 10:32   ` Thomas Gleixner
2018-12-18 22:18 ` [tip:locking/urgent] " tip-bot for Thomas Gleixner

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox