[PATCH] ipc/sem: Fix race between to-be-woken task and waker

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] ipc/sem: Fix race between to-be-woken task and waker
@ 2019-09-20 15:54 Waiman Long
  2019-09-26  9:34 ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Waiman Long @ 2019-09-20 15:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Arnd Bergmann, Gustavo A. R. Silva,
	Catalin Marinas, Mathieu Malaterre, Peter Zijlstra,
	Davidlohr Bueso, Waiman Long

While looking at a customr bug report about potential missed wakeup in
the system V semaphore code, I spot a potential problem.  The fact that
semaphore waiter stays in TASK_RUNNING state while checking queue status
may lead to missed wakeup if a spurious wakeup happens in the right
moment as try_to_wake_up() will do nothing if the task state isn't right.

To eliminate this possibility, the task state is now reset to
TASK_INTERRUPTIBLE immediately after wakeup before checking the queue
status. This should eliminate the race condition on the interaction
between the queue status and the task state and fix the potential missed
wakeup problem.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 ipc/sem.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 7da4504bcc7c..1bcd424be047 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2146,11 +2146,11 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 		sma->complex_count++;
 	}
 
+	__set_current_state(TASK_INTERRUPTIBLE);
 	do {
 		WRITE_ONCE(queue.status, -EINTR);
 		queue.sleeper = current;
 
-		__set_current_state(TASK_INTERRUPTIBLE);
 		sem_unlock(sma, locknum);
 		rcu_read_unlock();
 
@@ -2159,6 +2159,24 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 		else
 			schedule();
 
+		/*
+		 * A spurious wakeup at the right moment can cause race
+		 * between the to-be-woken task and the waker leading to
+		 * missed wakeup. Setting state back to TASK_INTERRUPTIBLE
+		 * before checking queue.status will ensure that the race
+		 * won't happen.
+		 *
+		 *	CPU0				CPU1
+		 *
+		 *  <spurious wakeup>		wake_up_sem_queue_prepare():
+		 *  state = TASK_INTERRUPTIBLE    status = error
+		 *				try_to_wake_up():
+		 *  smp_mb()			  smp_mb()
+		 *  if (status == -EINTR)	  if (!(p->state & state))
+		 *    schedule()		    goto out
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+
 		/*
 		 * fastpath: the semop has completed, either successfully or
 		 * not, from the syscall pov, is quite irrelevant to us at this
@@ -2210,6 +2228,7 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	sem_unlock(sma, locknum);
 	rcu_read_unlock();
 out_free:
+	__set_current_state(TASK_RUNNING);
 	if (sops != fast_sops)
 		kvfree(sops);
 	return error;
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] ipc/sem: Fix race between to-be-woken task and waker
  2019-09-20 15:54 [PATCH] ipc/sem: Fix race between to-be-woken task and waker Waiman Long
@ 2019-09-26  9:34 ` Peter Zijlstra
  2019-09-26 18:12   ` Waiman Long
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2019-09-26  9:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, linux-kernel, Arnd Bergmann, Gustavo A. R. Silva,
	Catalin Marinas, Mathieu Malaterre, Davidlohr Bueso, manfred

On Fri, Sep 20, 2019 at 11:54:02AM -0400, Waiman Long wrote:
> While looking at a customr bug report about potential missed wakeup in
> the system V semaphore code, I spot a potential problem.  The fact that
> semaphore waiter stays in TASK_RUNNING state while checking queue status
> may lead to missed wakeup if a spurious wakeup happens in the right
> moment as try_to_wake_up() will do nothing if the task state isn't right.
> 
> To eliminate this possibility, the task state is now reset to
> TASK_INTERRUPTIBLE immediately after wakeup before checking the queue
> status. This should eliminate the race condition on the interaction
> between the queue status and the task state and fix the potential missed
> wakeup problem.

Bah, this code always makes my head hurt.

Yes, AFAICT the pattern it uses has been broken since 0a2b9d4c7967,
since that removed doing the actual wakeup from under the sem_lock(),
which is what it relies on.

> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  ipc/sem.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/ipc/sem.c b/ipc/sem.c
> index 7da4504bcc7c..1bcd424be047 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -2146,11 +2146,11 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
>  		sma->complex_count++;
>  	}
>  
> +	__set_current_state(TASK_INTERRUPTIBLE);

This can use a comment; this relies on the fact that this is in the
same sem_lock() section that added us to the queue.

>  	do {
>  		WRITE_ONCE(queue.status, -EINTR);
>  		queue.sleeper = current;
>  
> -		__set_current_state(TASK_INTERRUPTIBLE);
>  		sem_unlock(sma, locknum);
>  		rcu_read_unlock();
>  
> @@ -2159,6 +2159,24 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
>  		else
>  			schedule();
>  
> +		/*
> +		 * A spurious wakeup at the right moment can cause race
> +		 * between the to-be-woken task and the waker leading to
> +		 * missed wakeup. Setting state back to TASK_INTERRUPTIBLE
> +		 * before checking queue.status will ensure that the race
> +		 * won't happen.
> +		 *
> +		 *	CPU0				CPU1
> +		 *
> +		 *  <spurious wakeup>		wake_up_sem_queue_prepare():
> +		 *  state = TASK_INTERRUPTIBLE    status = error
> +		 *				try_to_wake_up():
> +		 *  smp_mb()			  smp_mb()
> +		 *  if (status == -EINTR)	  if (!(p->state & state))
> +		 *    schedule()		    goto out
> +		 */
> +		set_current_state(TASK_INTERRUPTIBLE);
> +
>  		/*
>  		 * fastpath: the semop has completed, either successfully or
>  		 * not, from the syscall pov, is quite irrelevant to us at this
> @@ -2210,6 +2228,7 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
>  	sem_unlock(sma, locknum);
>  	rcu_read_unlock();
>  out_free:
> +	__set_current_state(TASK_RUNNING);
>  	if (sops != fast_sops)
>  		kvfree(sops);
>  	return error;
> -- 
> 2.18.1
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ipc/sem: Fix race between to-be-woken task and waker
  2019-09-26  9:34 ` Peter Zijlstra
@ 2019-09-26 18:12   ` Waiman Long
  2019-09-27  4:59     ` Manfred Spraul
  0 siblings, 1 reply; 6+ messages in thread
From: Waiman Long @ 2019-09-26 18:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, linux-kernel, Arnd Bergmann, Gustavo A. R. Silva,
	Catalin Marinas, Mathieu Malaterre, Davidlohr Bueso, manfred

On 9/26/19 5:34 AM, Peter Zijlstra wrote:
> On Fri, Sep 20, 2019 at 11:54:02AM -0400, Waiman Long wrote:
>> While looking at a customr bug report about potential missed wakeup in
>> the system V semaphore code, I spot a potential problem.  The fact that
>> semaphore waiter stays in TASK_RUNNING state while checking queue status
>> may lead to missed wakeup if a spurious wakeup happens in the right
>> moment as try_to_wake_up() will do nothing if the task state isn't right.
>>
>> To eliminate this possibility, the task state is now reset to
>> TASK_INTERRUPTIBLE immediately after wakeup before checking the queue
>> status. This should eliminate the race condition on the interaction
>> between the queue status and the task state and fix the potential missed
>> wakeup problem.
> Bah, this code always makes my head hurt.
>
> Yes, AFAICT the pattern it uses has been broken since 0a2b9d4c7967,
> since that removed doing the actual wakeup from under the sem_lock(),
> which is what it relies on.

After having a second look at the code again, I probably misread the
code the first time around. In the sleeping path, there is a check of
queue.status and setting of task state both under the sem lock in the
sleeping path. So as long as setting of queue status is under lock, they
should synchronize properly.

It looks like queue status setting is under lock, but I can't use
lockdep to confirm that as the locking can be done by either the array
lock or in one of the spinlocks in the array. Are you aware of a way of
doing that?

Anyway, I do think we need to add some comment to clarify the situation
to avoid future confusion.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ipc/sem: Fix race between to-be-woken task and waker
  2019-09-26 18:12   ` Waiman Long
@ 2019-09-27  4:59     ` Manfred Spraul
  0 siblings, 0 replies; 6+ messages in thread
From: Manfred Spraul @ 2019-09-27  4:59 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra
  Cc: Andrew Morton, linux-kernel, Arnd Bergmann, Gustavo A. R. Silva,
	Catalin Marinas, Mathieu Malaterre, Davidlohr Bueso

Hi,
On 9/26/19 8:12 PM, Waiman Long wrote:
> On 9/26/19 5:34 AM, Peter Zijlstra wrote:
>> On Fri, Sep 20, 2019 at 11:54:02AM -0400, Waiman Long wrote:
>>> While looking at a customr bug report about potential missed wakeup in
>>> the system V semaphore code, I spot a potential problem.  The fact that
>>> semaphore waiter stays in TASK_RUNNING state while checking queue status
>>> may lead to missed wakeup if a spurious wakeup happens in the right
>>> moment as try_to_wake_up() will do nothing if the task state isn't right.
>>>
>>> To eliminate this possibility, the task state is now reset to
>>> TASK_INTERRUPTIBLE immediately after wakeup before checking the queue
>>> status. This should eliminate the race condition on the interaction
>>> between the queue status and the task state and fix the potential missed
>>> wakeup problem.
You are obviously right, there is a huge race condition.
>> Bah, this code always makes my head hurt.
>>
>> Yes, AFAICT the pattern it uses has been broken since 0a2b9d4c7967,
>> since that removed doing the actual wakeup from under the sem_lock(),
>> which is what it relies on.

Correct - I've overlooked that.

First, theory:

setting queue->status, reading queue->status, setting 
current->state=TASK_INTERRUPTIBLE are all under the correct spinlock.

(there is an opportunistic read of queue->status without locks, but it 
is retried when the lock got acquired)

setting current->state=RUNNING is outside of any lock.

So as far as current->state is concerned, the lock doesn't exist. And if 
the lock doesn't exist, we must follow the rules applicable for 
set_current_state().

I'll try to check the code this week.

And we should check the remaining wake-queue users, the logic is 
everywhere identical.

> After having a second look at the code again, I probably misread the
> code the first time around. In the sleeping path, there is a check of
> queue.status and setting of task state both under the sem lock in the
> sleeping path. So as long as setting of queue status is under lock, they
> should synchronize properly.
>
> It looks like queue status setting is under lock, but I can't use
> lockdep to confirm that as the locking can be done by either the array
> lock or in one of the spinlocks in the array. Are you aware of a way of
> doing that?

For testing? Have you considered just always using the global lock?

(untested):

--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -370,7 +370,7 @@ static inline int sem_lock(struct sem_array *sma, 
struct sembuf *sops,
         struct sem *sem;
         int idx;

-       if (nsops != 1) {
+       if (nsops != 1 || 1) {
                 /* Complex operation - acquire a full lock */
                 ipc_lock_object(&sma->sem_perm);


> Anyway, I do think we need to add some comment to clarify the situation
> to avoid future confusion.

Around line 190 is the comment that explains locking & memory ordering.

I have only documented the content of sem_undo and sem_array, but 
neither queue nor current->state :-(


--

     Manfred



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ipc/sem: Fix race between to-be-woken task and waker
  2019-09-29 10:24 ` Manfred Spraul
@ 2019-09-30 13:53   ` Waiman Long
  0 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2019-09-30 13:53 UTC (permalink / raw)
  To: Manfred Spraul, Davidlohr Bueso, Peter Zijlstra
  Cc: Linux Kernel Mailing List, 1vier1

On 9/29/19 6:24 AM, Manfred Spraul wrote:
> Hi Waiman,
>
> I have now written the mail 3 times:
> Twice I thought that I found a race, but during further analysis, it
> always turns out that the spin_lock() is sufficient.
>
> First, to avoid any obvious things: Until the series with e.g.
> 27d7be1801a4824e, there was a race inside sem_lock().
>
> Thus it was possible that multiple threads were operating on the same
> semaphore array, with obviously arbitrary impact.
>
I was saying that there was a race. It was just that my initial analysis
of the code seems to indicate a race was possible.


> On 9/20/19 5:54 PM, Waiman Long wrote:
>
>>   +        /*
>> +         * A spurious wakeup at the right moment can cause race
>> +         * between the to-be-woken task and the waker leading to
>> +         * missed wakeup. Setting state back to TASK_INTERRUPTIBLE
>> +         * before checking queue.status will ensure that the race
>> +         * won't happen.
>> +         *
>> +         *    CPU0                CPU1
>> +         *
>> +         *  <spurious wakeup>        wake_up_sem_queue_prepare():
>> +         *  state = TASK_INTERRUPTIBLE    status = error
>> +         *                try_to_wake_up():
>> +         *  smp_mb()              smp_mb()
>> +         *  if (status == -EINTR)      if (!(p->state & state))
>> +         *    schedule()            goto out
>> +         */
>> +        set_current_state(TASK_INTERRUPTIBLE);
>> +
>
> So the the hypothesis is that we have a race due to the optimization
> within try_to_wake_up():
> If the status is already TASK_RUNNING, then the wakeup is a nop.
>
> Correct?
>
> The waker wants to use:
>
>     lock();
>     set_conditions();
>     unlock();
>
> as the wake_q is a shared list, completely asynchroneously this will
> happen:
>
>     smp_mb(); //// ***1
>     if (current->state = TASK_INTERRUPTIBLE) current->state=TASK_RUNNING;
>
> The only guarantee is that this will happen after lock(), it may
> happen before set_conditions().
>
> The task that goes to sleep uses:
>
>     lock();
>     check_conditions();
>     __set_current_state();
>     unlock(); //// ***2
>     schedule();
>
> You propose to change that to:
>
>     lock();
>     set_current_state();
>     check_conditions();
>     unlock();
>     schedule();
>
> I don't see a race anymore, and I don't see how the proposed change
> will help.
> e.g.: __set_current_state() and smp_mb() have paired memory barriers
> ***1 and ***2 above. 

Now that I had taken a second look at it. I agreed that the current code
should be fine. My only comment is that we should probably add extra
comments to clarify the situation so that it won't get messed up in the
future.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ipc/sem: Fix race between to-be-woken task and waker
       [not found] <20190920155402.28996-1-longman () redhat ! com>
@ 2019-09-29 10:24 ` Manfred Spraul
  2019-09-30 13:53   ` Waiman Long
  0 siblings, 1 reply; 6+ messages in thread
From: Manfred Spraul @ 2019-09-29 10:24 UTC (permalink / raw)
  To: Waiman Long, Davidlohr Bueso, Peter Zijlstra
  Cc: Linux Kernel Mailing List, 1vier1

Hi Waiman,

I have now written the mail 3 times:
Twice I thought that I found a race, but during further analysis, it 
always turns out that the spin_lock() is sufficient.

First, to avoid any obvious things: Until the series with e.g. 
27d7be1801a4824e, there was a race inside sem_lock().

Thus it was possible that multiple threads were operating on the same 
semaphore array, with obviously arbitrary impact.

On 9/20/19 5:54 PM, Waiman Long wrote:

>   
> +		/*
> +		 * A spurious wakeup at the right moment can cause race
> +		 * between the to-be-woken task and the waker leading to
> +		 * missed wakeup. Setting state back to TASK_INTERRUPTIBLE
> +		 * before checking queue.status will ensure that the race
> +		 * won't happen.
> +		 *
> +		 *	CPU0				CPU1
> +		 *
> +		 *  <spurious wakeup>		wake_up_sem_queue_prepare():
> +		 *  state = TASK_INTERRUPTIBLE    status = error
> +		 *				try_to_wake_up():
> +		 *  smp_mb()			  smp_mb()
> +		 *  if (status == -EINTR)	  if (!(p->state & state))
> +		 *    schedule()		    goto out
> +		 */
> +		set_current_state(TASK_INTERRUPTIBLE);
> +

So the the hypothesis is that we have a race due to the optimization 
within try_to_wake_up():
If the status is already TASK_RUNNING, then the wakeup is a nop.

Correct?

The waker wants to use:

     lock();
     set_conditions();
     unlock();

as the wake_q is a shared list, completely asynchroneously this will happen:

     smp_mb(); //// ***1
     if (current->state = TASK_INTERRUPTIBLE) current->state=TASK_RUNNING;

The only guarantee is that this will happen after lock(), it may happen 
before set_conditions().

The task that goes to sleep uses:

     lock();
     check_conditions();
     __set_current_state();
     unlock(); //// ***2
     schedule();

You propose to change that to:

     lock();
     set_current_state();
     check_conditions();
     unlock();
     schedule();

I don't see a race anymore, and I don't see how the proposed change will 
help.
e.g.: __set_current_state() and smp_mb() have paired memory barriers 
***1 and ***2 above.

--

     Manfred

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-09-30 13:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-20 15:54 [PATCH] ipc/sem: Fix race between to-be-woken task and waker Waiman Long
2019-09-26  9:34 ` Peter Zijlstra
2019-09-26 18:12   ` Waiman Long
2019-09-27  4:59     ` Manfred Spraul
     [not found] <20190920155402.28996-1-longman () redhat ! com>
2019-09-29 10:24 ` Manfred Spraul
2019-09-30 13:53   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).