linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Oleg Nesterov <oleg@redhat.com>
Cc: mingo@redhat.com, hpa@zytor.com, linux-kernel@vger.kernel.org,
	a.p.zijlstra@chello.nl, y-goto@jp.fujitsu.com,
	akpm@linux-foundation.org, tglx@linutronix.de, mingo@elte.hu,
	linux-tip-commits@vger.kernel.org
Subject: Re: [tip:sched/core] sched: Fix ancient race in do_exit()
Date: Sun, 29 Jan 2012 09:44:42 -0800	[thread overview]
Message-ID: <CA+55aFz20=H0ZubZEAmEbv0gL8j3C5Kgt4m2+5MU63uVipZt0Q@mail.gmail.com> (raw)
In-Reply-To: <20120129160711.GA20803@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4066 bytes --]

On Sun, Jan 29, 2012 at 8:07 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 01/28, Linus Torvalds wrote:
>>
>> It would be much nicer to just clear the rwsem waiter->task thing
>> *after* waking the task up, which would avoid this race entirely,
>> afaik.
>
> How? The problem is that wake_up_process(tsk) sees this task in
> TASK_UNINTERRUPTIBLE state (the first "p->state & state" check in
> try_to_wake_up), but then this task changes its state to TASK_DEAD
> without schedule() and ttwu() does s/TASK_DEAD/TASK_RUNNING/.

Exactly.

The problem is that THE TASK HAS LONG SINCE GONE AWAY FROM THE RWSEM!

The task is already finishing up the exit (it might even have
*completed* the exit, which is why we take the extra reference count
to the task) when the wakeup happens.

THAT is the problem. The fact that in that big problem ("we do a
spurious wakeup for the rwsem long after the task no longer waits for
it") we then have a really small and tiny race (the race between
TASK_UNINTERRUPTIBLE -> (TASK_RUNNING ->) TASK_DEAD) is just a tiny
small detail.

So the real problem is that rwsem's are broken. Normal wakeups never
have this issue, because they use the spinlock for the waitqueue to
serialize: either a process wakes up in a timely manner, or the
process has removed itself from the wait-queue - you never see wakeups
long after the process has stopped caring.

The whole problem with the exit sequence is just a random detail. The
same spurious wakeup could happen in other places too - it just
doesn't happen to cause an Oops anywhere else (that we know about -
who knows if some code woudl be unhappy with getting woken up
unexpectedly).




>> Tell me, why wouldn't that work? rwsem_down_failed_common() does
>>
>>         /* wait to be given the lock */
>>         for (;;) {
>>                 if (!waiter.task)
>>                         break;
>>                ...
>>
>> so then we wouldn't need the task refcount crap in rwsem either etc,
>> and we'd get rid of all races with wakeup.
>>
>> I wonder why we're clearing that whole waiter->task so early.
>
> I must have missed something. I can't understand how this can help,
> and "clear the rwsem waiter->task thing *after* waking" looks
> obviously wrong.

Why? It helps because it means that the task that did the "down()" on
the rwsem will *never* leave its task pointer around to be spuriously
woken later: it will wait for the waiter->task entry to be NULL - so
that we know that the wakeup path has used the task _and_ woken it up,
so there is no possibility for the task pointer still being active and
used to wake up some *random* state later.

> If we do this, then we can miss the "!!waiter.task"
> condition. The loop above actually does
>
>        set_task_state(TASK_UNINTERRUPTIBLE);
>
>        if (!waiter.task)
>                break;
>        schedule();
>
> and
>        wake_up_process(tsk);
>        waiter->task = NULL;
>
> can happen right after set_task_state().

So? That's the *point*. We want to wait until the wakeup is *done*
before we return from down(). We do *not* want to "break" (and go off
doing something else) while some other CPU may still be using 'task'.

But yes, if you're talking about TASK_UNINTERRUPTIBLE, we do need to
just remove the setting of that entirely. It needs to be set *before*
adding us to the list, not after. That's just a bug - we get woken up
when we've been given the lock.

IOW, I think the fix should look like the attched.  TOTALLY UNTESTED!
This is certainly a much scarier patch, but I think it fixes the
actual *reason* for the problem, rather than just working around the
symptom.

So it may be completely and utterly broken for some subtle reason, but
I don't see what it would be. It seems to clean up and simplify the
logic, and remove all the bogus workarounds for the fact that we used
to do things stupidly.

But maybe there's some reason for those "stupid" things. I just don't see it.

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1273 bytes --]

 lib/rwsem.c |   10 ++--------
 1 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index 410aa1189b13..7afb403918da 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -92,10 +92,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 	 */
 	list_del(&waiter->list);
 	tsk = waiter->task;
+	wake_up_process(tsk);
 	smp_mb();
 	waiter->task = NULL;
-	wake_up_process(tsk);
-	put_task_struct(tsk);
 	goto out;
 
  readers_only:
@@ -146,10 +145,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type)
 		waiter = list_entry(next, struct rwsem_waiter, list);
 		next = waiter->list.next;
 		tsk = waiter->task;
+		wake_up_process(tsk);
 		smp_mb();
 		waiter->task = NULL;
-		wake_up_process(tsk);
-		put_task_struct(tsk);
 	}
 
 	sem->wait_list.next = next;
@@ -183,7 +181,6 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 	raw_spin_lock_irq(&sem->wait_lock);
 	waiter.task = tsk;
 	waiter.flags = flags;
-	get_task_struct(tsk);
 
 	if (list_empty(&sem->wait_list))
 		adjustment += RWSEM_WAITING_BIAS;
@@ -211,11 +208,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
 		if (!waiter.task)
 			break;
 		schedule();
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
 
-	tsk->state = TASK_RUNNING;
-
 	return sem;
 }
 

  reply	other threads:[~2012-01-29 17:45 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-22  0:42 [BUG] TASK_DEAD task is able to be woken up in special condition Yasunori Goto
2011-12-22  2:14 ` KOSAKI Motohiro
2011-12-22  8:22   ` Yasunori Goto
2011-12-22 20:02     ` KOSAKI Motohiro
2011-12-23  9:49 ` Peter Zijlstra
2011-12-23 15:41   ` Oleg Nesterov
2011-12-26  8:23     ` Yasunori Goto
2011-12-26 17:11       ` Oleg Nesterov
2011-12-27  6:48         ` Yasunori Goto
2012-01-06 10:22           ` Yasunori Goto
2012-01-06 11:01             ` Peter Zijlstra
2012-01-06 12:01               ` Yasunori Goto
2012-01-06 12:43                 ` Peter Zijlstra
2012-01-06 14:12                   ` Oleg Nesterov
2012-01-06 14:19                     ` Oleg Nesterov
2012-01-07  1:31                     ` Yasunori Goto
2012-01-16 11:51                       ` Yasunori Goto
2012-01-16 13:38                         ` Peter Zijlstra
2012-01-17  8:40                           ` Yasunori Goto
2012-01-17  9:06                             ` Ingo Molnar
2012-01-17 15:12                               ` Oleg Nesterov
2012-01-18  9:42                                 ` Ingo Molnar
2012-01-18 14:20                                   ` Oleg Nesterov
2012-01-24 10:19                                     ` Peter Zijlstra
2012-01-24 10:55                                       ` Peter Zijlstra
2012-01-24 17:25                                         ` KOSAKI Motohiro
2012-01-25 15:45                                         ` Oleg Nesterov
2012-01-25 16:51                                           ` Peter Zijlstra
2012-01-25 17:43                                             ` Oleg Nesterov
2012-01-26 15:32                                               ` Peter Zijlstra
2012-01-26 16:26                                                 ` Oleg Nesterov
2012-01-27  8:59                                                   ` Peter Zijlstra
2012-01-24 10:11                                   ` Peter Zijlstra
2012-01-26  9:39                                     ` Ingo Molnar
2012-01-28 12:03                             ` [tip:sched/core] sched: Fix ancient race in do_exit() tip-bot for Yasunori Goto
2012-01-28 21:12                               ` Linus Torvalds
2012-01-29 16:07                                 ` Oleg Nesterov
2012-01-29 17:44                                   ` Linus Torvalds [this message]
2012-01-29 18:28                                     ` Linus Torvalds
2012-01-29 18:59                                     ` Oleg Nesterov
2012-01-30 16:27                                       ` Linus Torvalds
2012-01-06 13:48             ` [BUG] TASK_DEAD task is able to be woken up in special condition Oleg Nesterov
2011-12-28 21:07         ` KOSAKI Motohiro
2012-01-24 10:23           ` Peter Zijlstra
2012-01-24 18:01             ` KOSAKI Motohiro
2012-01-25  6:15               ` Mike Galbraith
2012-01-26 21:24                 ` KOSAKI Motohiro
2012-01-25 10:10           ` Peter Zijlstra
2012-01-26 20:25             ` [tip:sched/urgent] sched: Fix rq->nr_uninterruptible update race tip-bot for Peter Zijlstra
2012-01-27  5:20               ` Rakib Mullick
2012-01-27  8:19                 ` Peter Zijlstra
2012-01-27 14:11                   ` Rakib Mullick
2012-01-26 21:21             ` [BUG] TASK_DEAD task is able to be woken up in special condition KOSAKI Motohiro
2012-01-27  8:21               ` Peter Zijlstra
2011-12-26  6:52   ` Yasunori Goto

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+55aFz20=H0ZubZEAmEbv0gL8j3C5Kgt4m2+5MU63uVipZt0Q@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=y-goto@jp.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).