From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757487Ab2ARNQN (ORCPT ); Wed, 18 Jan 2012 08:16:13 -0500 Received: from e28smtp05.in.ibm.com ([122.248.162.5]:44903 "EHLO e28smtp05.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757397Ab2ARNQK (ORCPT ); Wed, 18 Jan 2012 08:16:10 -0500 Message-ID: <4F16C60B.4030903@linux.vnet.ibm.com> Date: Wed, 18 Jan 2012 18:45:55 +0530 From: "Srivatsa S. Bhat" User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0) Gecko/20110927 Thunderbird/7.0 MIME-Version: 1.0 To: Suresh Siddha CC: Linus Torvalds , Ming Lei , Djalal Harouni , Borislav Petkov , Tony Luck , Hidetoshi Seto , Ingo Molnar , Andi Kleen , linux-kernel@vger.kernel.org, Greg Kroah-Hartman , Kay Sievers , gouders@et.bocholt.fh-gelsenkirchen.de, Marcos Souza , Linux PM mailing list , "Rafael J. Wysocki" , "tglx@linutronix.de" , prasad@linux.vnet.ibm.com, justinmattock@gmail.com, Jeff Chua , Peter Zijlstra , Mel Gorman , Gilad Ben-Yossef , Sergey Senozhatsky Subject: Re: x86/mce: machine check warning during poweroff References: <20120111000051.GA28874@dztty> <4F10929E.8070007@linux.vnet.ibm.com> <4F10BDF7.8030306@linux.vnet.ibm.com> <4F10EB5B.5060804@linux.vnet.ibm.com> <1326766892.16150.21.camel@sbsiddha-desk.sc.intel.com> <4F1544EA.5060907@linux.vnet.ibm.com> <1326856624.5291.20.camel@sbsiddha-mobl2> In-Reply-To: <1326856624.5291.20.camel@sbsiddha-mobl2> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit x-cbid: 12011813-8256-0000-0000-000000EDF276 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/18/2012 08:47 AM, Suresh Siddha wrote: > On Tue, 2012-01-17 at 15:22 +0530, Srivatsa S. Bhat wrote: >> Thanks for the patch, but unfortunately it doesn't fix the problem! >> Exactly the same stack traces are seen during a CPU Hotplug stress test. >> (I didn't even have to stress it - it is so fragile that just a script >> to offline all cpus except the boot cpu was good enough to reproduce the >> problem easily.) > > hmm, that's weird. with the patch, sched_ilb_notifier() should have > cleared the cpu going offline from the nohz.idle_cpus_mask. And this > should have happened after that cpu is removed from active mask. So > no-one else should add that cpu back to the nohz.idle_cpus_mask and this > should prevent the issue from happening. > > I could reproduce the problem easily with out the patch but when I > applied the patch I couldn't recreate the issue. Srivatsa, can you > please re-check the kernel you tested indeed has the fix? > > re-Reviewing the code/patch also doesn't give me a hint. > >> I have a few questions regarding the synchronization with CPU Hotplug. >> What guarantees that the code which selects and IPIs the new ilb is totally >> race-free with respect to CPU hotplug and we will never IPI an offline CPU? > > So, nohz_balancer_kick() gets called only from interrupts disabled. > During that time (from selecting the ilb_cpu to sending the IPI), no cpu > can go offline. As the offline happens from the stop-machine process > context with interrupts disabled. > > Only thing we need to make sure is the offlined cpu shouldn't be part of > the nohz.idle_cpus_mask and for post 3.2 code, posted patch ensures > that. > > For 3.2 and before, when a cpu exits tickless idle, it gets removed from > the nohz.idle_cpus_mask (and also from the nohz.load_balancer). And if > the cpu is not in the active mask (while going offline), subsequent > calls to select_nohz_load_balancer() ensures that the cpu going down > doesn't update the nohz structures. So I thought 3.2 shouldn't exhibit > this problem. > > >> (As demonstrated above, this issue is in 3.2-rc7 >> as well.) > > hmm, don't think we ran into this before 3.2. So, what am I missing from > the above? I will try to reproduce it on 3.2 too. > I tested again on 3.2. I didn't hit those warnings (IPI to offline cpus). It happens only in the post-3.2 kernel. Regards, Srivatsa S. Bhat IBM Linux Technology Center