From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753768Ab2A2TGV (ORCPT ); Sun, 29 Jan 2012 14:06:21 -0500 Received: from mx1.redhat.com ([209.132.183.28]:36835 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753684Ab2A2TGU (ORCPT ); Sun, 29 Jan 2012 14:06:20 -0500 Date: Sun, 29 Jan 2012 19:59:49 +0100 From: Oleg Nesterov To: Linus Torvalds Cc: mingo@redhat.com, hpa@zytor.com, linux-kernel@vger.kernel.org, a.p.zijlstra@chello.nl, y-goto@jp.fujitsu.com, akpm@linux-foundation.org, tglx@linutronix.de, mingo@elte.hu, linux-tip-commits@vger.kernel.org Subject: Re: [tip:sched/core] sched: Fix ancient race in do_exit() Message-ID: <20120129185949.GA27192@redhat.com> References: <20120117174031.3118.E1E9C6FF@jp.fujitsu.com> <20120129160711.GA20803@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/29, Linus Torvalds wrote: > > But yes, if you're talking about TASK_UNINTERRUPTIBLE, we do need to > just remove the setting of that entirely. It needs to be set *before* > adding us to the list, not after. That's just a bug - we get woken up > when we've been given the lock. Yes, I think this should work although I am not familiar with this code. If we remove set_task_state() from the main waiting loop we can never race with __rwsem_do_wake()->try_to_wake_up() seeing us in UNINTERRUPTIBLE state. rwsem_down_failed_common() simply can't return until UNINTERRUPTIBLE->RUNNING transition is finished (__rwsem_do_wake does wakeup first). And since we do not play with current->state after spin_unlock(), it is fine to "race" with waiter->task clearing, just we can do the unnecessary but harmless schedule() in TASK_RUNNING. > So it may be completely and utterly broken for some subtle reason, Well, what about another spurious wakeup from somewhere? In this case rwsem_down_failed_common() will do a busy-wait loop. > @@ -92,10 +92,9 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wake_type) > */ > list_del(&waiter->list); > tsk = waiter->task; > + wake_up_process(tsk); > smp_mb(); > waiter->task = NULL; OK, now I understand why do we need "clear after wakeup". But then I don't really understand this mb, perhaps wmb() is enough? Afaics we only need to ensure we change waiter->task after changing task's state. OTOH, > @@ -183,7 +181,6 @@ rwsem_down_failed_common(struct rw_semaphore *sem, > raw_spin_lock_irq(&sem->wait_lock); > waiter.task = tsk; > waiter.flags = flags; > - get_task_struct(tsk); > > if (list_empty(&sem->wait_list)) > adjustment += RWSEM_WAITING_BIAS; > @@ -211,11 +208,8 @@ rwsem_down_failed_common(struct rw_semaphore *sem, > if (!waiter.task) > break; > schedule(); > - set_task_state(tsk, TASK_UNINTERRUPTIBLE); > } > > - tsk->state = TASK_RUNNING; > - > return sem; > } Suppose that this task does tsk->state = TASK_WHATEVER after that. It seems that we need mb() before return, otherwise the next ->state change can be reordered with "if (!waiter.task)" above. Or not? Oleg.