linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Fix a lockup in wait_for_completion() and friends
@ 2019-05-08 20:27 minyard
  2019-05-13 12:13 ` Daniel Wagner
  0 siblings, 1 reply; 3+ messages in thread
From: minyard @ 2019-05-08 20:27 UTC (permalink / raw)
  To: linux-rt-users; +Cc: minyard, Corey Minyard

From: Corey Minyard <cminyard@mvista.com>

The function call do_wait_for_common() has a race condition that
can result in lockups waiting for completions.  Adding the thread
to (and removing the thread from) the wait queue for the completion
is done outside the do loop in that function.  However, if the thread
is woken up with swake_up_locked(), that function will delete the
entry from the wait queue.  If that happens and another thread sneaks
in and decrements the done count in the completion to zero, the
loop will go around again, but the thread will no longer be in the
wait queue, so there is no way to wake it up.

Fix it by adding/removing the thread to/from the wait queue inside
the do loop.

Fixes: a04ff6b4ec4ee7e ("completion: Use simple wait queues")
Signed-off-by: Corey Minyard <cminyard@mvista.com>
---
This looks like a fairly serious bug, I guess, but I've never seen a
report on it before.

I found it because I have an out-of-tree feature (hopefully in tree some
day) that takes a core dump of a running process without killing it.  It
makes extensive use of completions, and the test code is fairly brutal.
It didn't lock up on stock 4.19, but failed with the RT patches applied.

The funny thing is, I've never seen this test code fail before on earlier
releases, but it locks up pretty reliably on 4.19-rt.  It looks like this
bug goes back to at least the 4.4-rt kernel.  But we haven't received any
customer reports of failures.

The feature and test are in a public tree if someone wants to try to
reproduce this.  But hopefully this is pretty obvious with the explaination.

Also, you could put the DECLARE_SWAITQUEUE() outside the loop, I think,
but maybe it's cleaner or safer to declare it in the loop?  If someone
cares I can test it that way.

-corey

 kernel/sched/completion.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index 755a58084978..4cde33cf8b28 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -70,10 +70,10 @@ do_wait_for_common(struct completion *x,
 		   long (*action)(long), long timeout, int state)
 {
 	if (!x->done) {
-		DECLARE_SWAITQUEUE(wait);
-
-		__prepare_to_swait(&x->wait, &wait);
 		do {
+			DECLARE_SWAITQUEUE(wait);
+
+			__prepare_to_swait(&x->wait, &wait);
 			if (signal_pending_state(state, current)) {
 				timeout = -ERESTARTSYS;
 				break;
@@ -82,8 +82,8 @@ do_wait_for_common(struct completion *x,
 			raw_spin_unlock_irq(&x->wait.lock);
 			timeout = action(timeout);
 			raw_spin_lock_irq(&x->wait.lock);
+			__finish_swait(&x->wait, &wait);
 		} while (!x->done && timeout);
-		__finish_swait(&x->wait, &wait);
 		if (!x->done)
 			return timeout;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] Fix a lockup in wait_for_completion() and friends
  2019-05-08 20:27 [PATCH] Fix a lockup in wait_for_completion() and friends minyard
@ 2019-05-13 12:13 ` Daniel Wagner
  2019-05-13 12:19   ` Daniel Wagner
  0 siblings, 1 reply; 3+ messages in thread
From: Daniel Wagner @ 2019-05-13 12:13 UTC (permalink / raw)
  To: minyard, linux-rt-users; +Cc: Corey Minyard, LKML, Peter Zijlstra

Hi Corey,

On 08.05.19 22:27, minyard@acm.org wrote:
> From: Corey Minyard <cminyard@mvista.com>
> 
> The function call do_wait_for_common() has a race condition that
> can result in lockups waiting for completions.  Adding the thread
> to (and removing the thread from) the wait queue for the completion
> is done outside the do loop in that function.  However, if the thread
> is woken up with swake_up_locked(), that function will delete the
> entry from the wait queue.  If that happens and another thread sneaks
> in and decrements the done count in the completion to zero, the
> loop will go around again, but the thread will no longer be in the
> wait queue, so there is no way to wake it up.
> 
> Fix it by adding/removing the thread to/from the wait queue inside
> the do loop.
> 
> Fixes: a04ff6b4ec4ee7e ("completion: Use simple wait queues")
> Signed-off-by: Corey Minyard <cminyard@mvista.com>

Added Peter and lkml to the CC since this is mainline and not -rt only.

Thanks,
Daniel

> ---
> This looks like a fairly serious bug, I guess, but I've never seen a
> report on it before.
> 
> I found it because I have an out-of-tree feature (hopefully in tree some
> day) that takes a core dump of a running process without killing it.  It
> makes extensive use of completions, and the test code is fairly brutal.
> It didn't lock up on stock 4.19, but failed with the RT patches applied.
> 
> The funny thing is, I've never seen this test code fail before on earlier
> releases, but it locks up pretty reliably on 4.19-rt.  It looks like this
> bug goes back to at least the 4.4-rt kernel.  But we haven't received any
> customer reports of failures.
> 
> The feature and test are in a public tree if someone wants to try to
> reproduce this.  But hopefully this is pretty obvious with the explaination.
> 
> Also, you could put the DECLARE_SWAITQUEUE() outside the loop, I think,
> but maybe it's cleaner or safer to declare it in the loop?  If someone
> cares I can test it that way.
> 
> -corey
> 
>  kernel/sched/completion.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
> index 755a58084978..4cde33cf8b28 100644
> --- a/kernel/sched/completion.c
> +++ b/kernel/sched/completion.c
> @@ -70,10 +70,10 @@ do_wait_for_common(struct completion *x,
>  		   long (*action)(long), long timeout, int state)
>  {
>  	if (!x->done) {
> -		DECLARE_SWAITQUEUE(wait);
> -
> -		__prepare_to_swait(&x->wait, &wait);
>  		do {
> +			DECLARE_SWAITQUEUE(wait);
> +
> +			__prepare_to_swait(&x->wait, &wait);
>  			if (signal_pending_state(state, current)) {
>  				timeout = -ERESTARTSYS;
>  				break;
> @@ -82,8 +82,8 @@ do_wait_for_common(struct completion *x,
>  			raw_spin_unlock_irq(&x->wait.lock);
>  			timeout = action(timeout);
>  			raw_spin_lock_irq(&x->wait.lock);
> +			__finish_swait(&x->wait, &wait);
>  		} while (!x->done && timeout);
> -		__finish_swait(&x->wait, &wait);
>  		if (!x->done)
>  			return timeout;
>  	}
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] Fix a lockup in wait_for_completion() and friends
  2019-05-13 12:13 ` Daniel Wagner
@ 2019-05-13 12:19   ` Daniel Wagner
  0 siblings, 0 replies; 3+ messages in thread
From: Daniel Wagner @ 2019-05-13 12:19 UTC (permalink / raw)
  To: minyard, linux-rt-users; +Cc: Corey Minyard, LKML, Peter Zijlstra

> Added Peter and lkml to the CC since this is mainline and not -rt only.

Argh, forget it. was late to the party and missed the discussion.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-05-13 12:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-08 20:27 [PATCH] Fix a lockup in wait_for_completion() and friends minyard
2019-05-13 12:13 ` Daniel Wagner
2019-05-13 12:19   ` Daniel Wagner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).