From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754557Ab2ANDGV (ORCPT ); Fri, 13 Jan 2012 22:06:21 -0500 Received: from e23smtp08.au.ibm.com ([202.81.31.141]:43872 "EHLO e23smtp08.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753872Ab2ANDGT (ORCPT ); Fri, 13 Jan 2012 22:06:19 -0500 Message-ID: <4F10F117.40006@linux.vnet.ibm.com> Date: Sat, 14 Jan 2012 08:35:59 +0530 From: "Srivatsa S. Bhat" User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0) Gecko/20110927 Thunderbird/7.0 MIME-Version: 1.0 To: Linus Torvalds CC: Ming Lei , Djalal Harouni , Borislav Petkov , Tony Luck , Hidetoshi Seto , Ingo Molnar , Andi Kleen , linux-kernel@vger.kernel.org, Greg Kroah-Hartman , Kay Sievers , gouders@et.bocholt.fh-gelsenkirchen.de, Marcos Souza , Linux PM mailing list , "Rafael J. Wysocki" , "tglx@linutronix.de" , prasad@linux.vnet.ibm.com, justinmattock@gmail.com, Jeff Chua , Suresh B Siddha , Peter Zijlstra , Mel Gorman , Gilad Ben-Yossef Subject: Re: x86/mce: machine check warning during poweroff References: <20120111000051.GA28874@dztty> <4F10929E.8070007@linux.vnet.ibm.com> <4F10BDF7.8030306@linux.vnet.ibm.com> <4F10EB5B.5060804@linux.vnet.ibm.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit x-cbid: 12011317-5140-0000-0000-000000969807 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/14/2012 08:23 AM, Linus Torvalds wrote: > On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat > wrote: >> >> YES!! Finally I have a fix for this whole MCE thing! :-) > > Goodie. > >> The patch below works perfectly for me - I tested multiple CPU hotplug >> operations as well as multiple pm_test runs at core level. Please let me >> know if this solves the suspend issue as well.. > > Ok, I'll try, and I bet it does. > > HOWEVER. > > I'd be a whole lot happier knowing exactly which field in "struct > device" that needed to be NULL before it gets registered. > > I don't like how > > device_register() + device_create_file(dev).. > > is not sufficiently undone by > > .. device_remove_file(dev) + device_unregister() > > so that it can't be repeated. Exactly *what* state is stale and > re-used incorrectly if you do that device_register() a second time. > > It smells like a misfeature of the device core handling. > > But that does obviously explain why this started happening with a > fairly straightforward conversion from sysdev to struct device. It > just makes me worry about any *other* such conversions. > > Of course, normal users will allocate and free the memory, so never > see this "re-use the same piece of memory" issue. But still.. > I totally agree with you. Even I had set out to find out *exactly* what was going wrong. After spending significant amount of time digging through the code (unsuccessfully), this idea of zeroing out everything struck me and it worked, as expected. Yes, it is definitely important to know the exact issue so that we can fix the driver core and avoid other mishaps, but I guess finding that out is not all that simple.. as of now I am rather exhausted following those zillions of pointers continuously for the past few hours.. ;-/ Regards, Srivatsa S. Bhat IBM Linux Technology Center