From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ozlabs.org (bilbo.ozlabs.org [203.11.71.1]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 41m9yS3pWszDqBy for ; Thu, 9 Aug 2018 11:43:44 +1000 (AEST) Received: from ozlabs.org (bilbo.ozlabs.org [203.11.71.1]) by bilbo.ozlabs.org (Postfix) with ESMTP id 41m9yS334Yz8tBY for ; Thu, 9 Aug 2018 11:43:44 +1000 (AEST) Received: from mail-pg1-x544.google.com (mail-pg1-x544.google.com [IPv6:2607:f8b0:4864:20::544]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 41m9yR1wDVz9s4c for ; Thu, 9 Aug 2018 11:43:42 +1000 (AEST) Received: by mail-pg1-x544.google.com with SMTP id k3-v6so1941674pgq.5 for ; Wed, 08 Aug 2018 18:43:42 -0700 (PDT) Date: Thu, 9 Aug 2018 11:43:30 +1000 From: Nicholas Piggin To: Michael Ellerman Cc: Mahesh J Salgaonkar , linuxppc-dev , "Aneesh Kumar K.V" , Michal Suchanek , Ananth Narayan , Laurent Dufour Subject: Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE. Message-ID: <20180809114330.7cad9866@roar.ozlabs.ibm.com> In-Reply-To: <87o9ecaovz.fsf@concordia.ellerman.id.au> References: <153365127532.14256.1965469477086140841.stgit@jupiter.in.ibm.com> <153365146712.14256.11869543914717297278.stgit@jupiter.in.ibm.com> <87o9ecaovz.fsf@concordia.ellerman.id.au> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, 09 Aug 2018 00:56:00 +1000 Michael Ellerman wrote: > Mahesh J Salgaonkar writes: > > From: Mahesh Salgaonkar > > > > Introduce recovery action for recovered memory errors (MCEs). There are > > soft memory errors like SLB Multihit, which can be a result of a bad > > hardware OR software BUG. Kernel can easily recover from these soft errors > > by flushing SLB contents. After the recovery kernel can still continue to > > function without any issue. But in some scenario's we may keep getting > > these soft errors until the root cause is fixed. To be able to analyze and > > find the root cause, best way is to gather enough data and system state at > > the time of MCE. Hence this patch introduces a sysctl knob where user can > > decide either to continue after recovery or panic the kernel to capture the > > dump. > > I'm not convinced we want this. > > As we've discovered it's often not possible to reconstruct what happened > based on a dump anyway. > > The key thing you need is the content of the SLB and that's not included > in a dump. > > So I think we should dump the SLB content when we get the MCE (which > this series does) and any other useful info, and then if we can recover > we should. Yeah it's a lot of knobs that administrators can hardly be expected to tune. Hypervisor or firmware should really eventually make the MCE unrecoverable if we aren't making progress. That said, x86 has a bunch of options, and for debugging a rare crash or specialised installations it might be useful. But we should follow the normal format, /proc/sys/kernel/panic_on_mce. Thanks, Nick