From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932898AbaICNjs (ORCPT ); Wed, 3 Sep 2014 09:39:48 -0400 Received: from mx1.redhat.com ([209.132.183.28]:30648 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932357AbaICNjg (ORCPT ); Wed, 3 Sep 2014 09:39:36 -0400 Date: Wed, 3 Sep 2014 15:36:40 +0200 From: Oleg Nesterov To: Peter Zijlstra Cc: Kautuk Consul , Ingo Molnar , Andrew Morton , Michal Hocko , David Rientjes , Ionut Alexa , Guillaume Morin , linux-kernel@vger.kernel.org, Kirill Tkhai Subject: Re: [PATCH 1/1] do_exit(): Solve possibility of BUG() due to race with try_to_wake_up() Message-ID: <20140903133640.GA25439@redhat.com> References: <1408964064-21447-1-git-send-email-consul.kautuk@gmail.com> <20140825155738.GA5944@redhat.com> <20140901153935.GQ27892@worktop.ger.corp.intel.com> <20140901175851.GA15210@redhat.com> <20140901190931.GD5806@worktop.ger.corp.intel.com> <20140902155208.GA28668@redhat.com> <20140902164714.GA17033@redhat.com> <20140902173910.GF27892@worktop.ger.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140902173910.GF27892@worktop.ger.corp.intel.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Peter, sorry for slow responses. On 09/02, Peter Zijlstra wrote: > > On Tue, Sep 02, 2014 at 06:47:14PM +0200, Oleg Nesterov wrote: > > > But since I already wrote v2 yesterday, let me show it anyway. Perhaps > > you will notice something wrong immediately... > > > > So, once again, this patch adds the ugly "goto" into schedule(). OTOH, > > it removes the ugly spin_unlock_wait(pi_lock). > > But schedule() is called _far_ more often than exit(). It would be > really good not to have to do that. Yes sure, performance-wise this is not a win. My point was, this way the whole "last schedule" logic becomes very simple. But OK, I buy your nack. I understand that we should not penalize __schedule() if possible. Let's forget this patch. > > TASK_DEAD can die. The only valid user is schedule_debug(), trivial to > > change. The usage of TASK_DEAD in task_numa_fault() is wrong in any case. > > > > In fact, I think that the next change can change exit_schedule() to use > > PREEMPT_ACTIVE, and then we can simply remove the TASK_DEAD check in > > schedule_debug(). > > So you worry about concurrent wakeups vs setting TASK_DEAD and thereby > loosing it, right? > > Would not something like: > > spin_lock_irq(¤t->pi_lock); > __set_current_state(TASK_DEAD); > spin_unlock_irq(¤t->pi_lock); Sure. This should obviously fix the problem. And, I think, another mb() after unlock_wait should fix it as well. > Not be race free and similarly expensive to the smp_mb() we have there > now? Ah, I simply do not know what is cheaper, even on x86. Well, we need to enable/disable irqs, but again I do not really know how much does this cost. I can even say what (imo) looks better, lock/unlock above or // Ensure that the previous __set_current_state(RUNNING) can't // leak after spin_unlock_wait() smp_mb(); spin_unlock_wait(); // Another mb to ensure this too can't be reordered with unlock_wait set_current_state(TASK_DEAD); What do you think looks better? Oleg.