From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ozlabs.org (bilbo.ozlabs.org [203.11.71.1]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 41mPZ55vRCzDqBy for ; Thu, 9 Aug 2018 20:26:53 +1000 (AEST) Received: from ozlabs.org (bilbo.ozlabs.org [203.11.71.1]) by bilbo.ozlabs.org (Postfix) with ESMTP id 41mPZ55MF6z8tBY for ; Thu, 9 Aug 2018 20:26:53 +1000 (AEST) Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 41mPZ45C50z9s0n for ; Thu, 9 Aug 2018 20:26:52 +1000 (AEST) Date: Thu, 9 Aug 2018 12:26:46 +0200 From: Michal =?UTF-8?B?U3VjaMOhbmVr?= To: Nicholas Piggin Cc: Ananth N Mavinakayanahalli , "Aneesh Kumar K.V" , Michal Suchanek , Mahesh J Salgaonkar , linuxppc-dev , "Aneesh Kumar K.V" , Laurent Dufour Subject: Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE. Message-ID: <20180809122646.2f1a827d@naga.suse.cz> In-Reply-To: <20180809183333.6097d5ec@roar.ozlabs.ibm.com> References: <153365127532.14256.1965469477086140841.stgit@jupiter.in.ibm.com> <153365146712.14256.11869543914717297278.stgit@jupiter.in.ibm.com> <87o9ecaovz.fsf@concordia.ellerman.id.au> <87d0us9hgg.fsf@concordia.ellerman.id.au> <20180809180253.5665ddf5@roar.ozlabs.ibm.com> <20180809080945.5wgxevm5oq7otbpe@in.ibm.com> <20180809183333.6097d5ec@roar.ozlabs.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, 9 Aug 2018 18:33:33 +1000 Nicholas Piggin wrote: > On Thu, 9 Aug 2018 13:39:45 +0530 > Ananth N Mavinakayanahalli wrote: > > > On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote: > > > On Thu, 09 Aug 2018 16:34:07 +1000 > > > Michael Ellerman wrote: > > > > > > > "Aneesh Kumar K.V" writes: > > > > > On 08/08/2018 08:26 PM, Michael Ellerman wrote: > > > > >> Mahesh J Salgaonkar writes: > > > > >>> From: Mahesh Salgaonkar > > > > >>> > > > > >>> Introduce recovery action for recovered memory errors > > > > >>> (MCEs). There are soft memory errors like SLB Multihit, > > > > >>> which can be a result of a bad hardware OR software BUG. > > > > >>> Kernel can easily recover from these soft errors by > > > > >>> flushing SLB contents. After the recovery kernel can still > > > > >>> continue to function without any issue. But in some > > > > >>> scenario's we may keep getting these soft errors until the > > > > >>> root cause is fixed. To be able to analyze and find the > > > > >>> root cause, best way is to gather enough data and system > > > > >>> state at the time of MCE. Hence this patch introduces a > > > > >>> sysctl knob where user can decide either to continue after > > > > >>> recovery or panic the kernel to capture the dump. > > > > >> > > > > >> I'm not convinced we want this. > > > > >> > > > > >> As we've discovered it's often not possible to reconstruct > > > > >> what happened based on a dump anyway. > > > > >> > > > > >> The key thing you need is the content of the SLB and that's > > > > >> not included in a dump. > > > > >> > > > > >> So I think we should dump the SLB content when we get the > > > > >> MCE (which this series does) and any other useful info, and > > > > >> then if we can recover we should. > > > > > > > > > > The reasoning there is what if we got multi-hit due to some > > > > > corruption in slb_cache_ptr. ie. some part of kernel is > > > > > wrongly updating the paca data structure due to wrong > > > > > pointer. Now that is far fetched, but then possible right?. > > > > > Hence the idea that, if we don't have much insight into why a > > > > > slb multi-hit occur from the dmesg which include slb content, > > > > > slb_cache contents etc, there should be an easy way to force > > > > > a dump that might assist in further debug. > > > > > > > > If you're debugging something complex that you can't determine > > > > from the SLB dump then you should be running a debug kernel > > > > anyway. And if anything you want to drop into xmon and sit > > > > there, preserving the most state, rather than taking a dump. > > > > > > I'm not saying for a dump specifically, just some form of crash. > > > And we really should have an option to xmon on panic, but that's > > > another story. > > > > That's fine during development or in a lab, not something we could > > enforce in a customer environment, could we? > > xmon on panic? Not something to enforce but IMO (without thinking > about it too much but having encountered it several times) it should > probably be tied xmon on BUG option. You should get that with this patch and xmon=on or am I missing something? Thanks Michal