From: Nicholas Piggin <npiggin@gmail.com>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>,
linuxppc-dev <linuxppc-dev@ozlabs.org>,
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
Michal Suchanek <msuchanek@suse.com>,
Ananth Narayan <ananth@in.ibm.com>,
Laurent Dufour <ldufour@linux.vnet.ibm.com>
Subject: Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.
Date: Thu, 9 Aug 2018 11:43:30 +1000 [thread overview]
Message-ID: <20180809114330.7cad9866@roar.ozlabs.ibm.com> (raw)
In-Reply-To: <87o9ecaovz.fsf@concordia.ellerman.id.au>
On Thu, 09 Aug 2018 00:56:00 +1000
Michael Ellerman <mpe@ellerman.id.au> wrote:
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> >
> > Introduce recovery action for recovered memory errors (MCEs). There are
> > soft memory errors like SLB Multihit, which can be a result of a bad
> > hardware OR software BUG. Kernel can easily recover from these soft errors
> > by flushing SLB contents. After the recovery kernel can still continue to
> > function without any issue. But in some scenario's we may keep getting
> > these soft errors until the root cause is fixed. To be able to analyze and
> > find the root cause, best way is to gather enough data and system state at
> > the time of MCE. Hence this patch introduces a sysctl knob where user can
> > decide either to continue after recovery or panic the kernel to capture the
> > dump.
>
> I'm not convinced we want this.
>
> As we've discovered it's often not possible to reconstruct what happened
> based on a dump anyway.
>
> The key thing you need is the content of the SLB and that's not included
> in a dump.
>
> So I think we should dump the SLB content when we get the MCE (which
> this series does) and any other useful info, and then if we can recover
> we should.
Yeah it's a lot of knobs that administrators can hardly be expected to
tune. Hypervisor or firmware should really eventually make the MCE
unrecoverable if we aren't making progress.
That said, x86 has a bunch of options, and for debugging a rare crash
or specialised installations it might be useful. But we should follow
the normal format, /proc/sys/kernel/panic_on_mce.
Thanks,
Nick
next prev parent reply other threads:[~2018-08-09 1:43 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-08-07 14:15 [PATCH v7 0/9] powerpc/pseries: Machine check handler improvements Mahesh J Salgaonkar
2018-08-07 14:16 ` [PATCH v7 1/9] powerpc/pseries: Avoid using the size greater than RTAS_ERROR_LOG_MAX Mahesh J Salgaonkar
2018-08-07 14:16 ` [PATCH v7 2/9] powerpc/pseries: Defer the logging of rtas error to irq work queue Mahesh J Salgaonkar
2018-08-07 14:16 ` [PATCH v7 3/9] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler Mahesh J Salgaonkar
2018-08-13 11:23 ` [v7, " Michael Ellerman
2018-08-07 14:16 ` [PATCH v7 4/9] powerpc/pseries: Define MCE error event section Mahesh J Salgaonkar
2018-08-08 14:42 ` Michael Ellerman
2018-08-10 10:29 ` Mahesh Jagannath Salgaonkar
2018-08-16 4:14 ` Michael Ellerman
2018-08-16 14:44 ` Segher Boessenkool
2018-08-17 11:22 ` Mahesh Jagannath Salgaonkar
2018-08-07 14:17 ` [PATCH v7 5/9] powerpc/pseries: flush SLB contents on SLB MCE errors Mahesh J Salgaonkar
2018-08-07 16:54 ` Michal Suchánek
2018-08-10 10:30 ` Mahesh Jagannath Salgaonkar
2018-08-08 9:04 ` Nicholas Piggin
2018-08-10 10:30 ` Mahesh Jagannath Salgaonkar
2018-08-07 14:17 ` [PATCH v7 6/9] powerpc/pseries: Display machine check error details Mahesh J Salgaonkar
2018-08-07 14:17 ` [PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors Mahesh J Salgaonkar
2018-08-09 1:05 ` Michael Ellerman
2018-08-10 10:32 ` Mahesh Jagannath Salgaonkar
2018-08-10 10:49 ` Mahesh Jagannath Salgaonkar
2018-08-11 4:33 ` Nicholas Piggin
2018-08-13 4:17 ` Mahesh Jagannath Salgaonkar
2018-08-13 14:27 ` Nicholas Piggin
2018-08-14 10:57 ` Mahesh Jagannath Salgaonkar
2018-08-14 12:47 ` Aneesh Kumar K.V
2018-08-07 14:17 ` [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE Mahesh J Salgaonkar
2018-08-08 14:56 ` Michael Ellerman
2018-08-08 15:37 ` Aneesh Kumar K.V
2018-08-08 16:09 ` Michal Suchánek
2018-08-10 11:04 ` Michael Ellerman
2018-08-09 6:34 ` Michael Ellerman
2018-08-09 8:02 ` Nicholas Piggin
2018-08-09 8:09 ` Ananth N Mavinakayanahalli
2018-08-09 8:33 ` Nicholas Piggin
2018-08-09 10:26 ` Michal Suchánek
2018-08-10 7:31 ` Nicholas Piggin
2018-08-09 1:43 ` Nicholas Piggin [this message]
2018-08-07 14:18 ` [PATCH v7 9/9] powernv/pseries: consolidate code for mce early handling Mahesh J Salgaonkar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180809114330.7cad9866@roar.ozlabs.ibm.com \
--to=npiggin@gmail.com \
--cc=ananth@in.ibm.com \
--cc=aneesh.kumar@linux.vnet.ibm.com \
--cc=ldufour@linux.vnet.ibm.com \
--cc=linuxppc-dev@ozlabs.org \
--cc=mahesh@linux.vnet.ibm.com \
--cc=mpe@ellerman.id.au \
--cc=msuchanek@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).