From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758674Ab2AFOTE (ORCPT ); Fri, 6 Jan 2012 09:19:04 -0500 Received: from mx1.redhat.com ([209.132.183.28]:14165 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752166Ab2AFOTD (ORCPT ); Fri, 6 Jan 2012 09:19:03 -0500 Date: Fri, 6 Jan 2012 15:12:58 +0100 From: Oleg Nesterov To: Peter Zijlstra Cc: Yasunori Goto , Ingo Molnar , Hiroyuki KAMEZAWA , Motohiro Kosaki , Linux Kernel ML Subject: Re: [BUG] TASK_DEAD task is able to be woken up in special condition Message-ID: <20120106141258.GB19462@redhat.com> References: <20120106192256.AB15.E1E9C6FF@jp.fujitsu.com> <1325847671.2442.7.camel@twins> <20120106210108.AB18.E1E9C6FF@jp.fujitsu.com> <1325853838.2442.18.camel@twins> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1325853838.2442.18.camel@twins> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/06, Peter Zijlstra wrote: > > On Fri, 2012-01-06 at 21:01 +0900, Yasunori Goto wrote: > > > Do you mean the following patch? > > Yes, something like that. At that point ->state should be TASK_RUNNING > (since we are after all running). The unlock_wait() will synchronize > against any in-progress ttwu() while its fast path is a non-atomic > compare. Any ttwu after this will bail since it will either observe > TASK_RUNNING or TASK_DEAD, neither are a state it will act upon. > > Now the only question that remains is if we need the full memory barrier > or if we can get away with less. > > I guess the mb separates the write to ->state (setting TASK_RUNNING) > from the read of ->pi_lock. The remote CPU must see the TASK_RUNNING, > and we must see ->pi_lock taken if it is. Yes, I think we need the full mb, STORE vs LOAD. > > --- linux-3.2-rc7.orig/kernel/exit.c > > +++ linux-3.2-rc7/kernel/exit.c > > @@ -1038,6 +1038,10 @@ NORET_TYPE void do_exit(long code) > > > > preempt_disable(); > > exit_rcu(); > > + > > + smp_mb(); > > + raw_spin_unlock_wait(&tsk->pi_lock); > > + > > /* causes final put_task_struct in finish_task_switch(). */ > > tsk->state = TASK_DEAD; Interesting. Initially I thought this is wrong and we should do raw_spin_unlock_wait(pi_lock); mb(); tsk->state = TASK_DEAD; This "obviously" serializes LOAD(pi_lock) and STORE(state). But when I re-read your explanation above I think you are right, mb() before unlock_wait() should work too, just it refers to state = RUNNING in the past. But this makes me worry. We are doing a lot of things after exit_mm(). In particular we take tasklist_lock in exit_notify() and then do_exit() takes task_lock(). But every unlock + lock implies mb(). So how it was possible to hit this bug??? Oleg.