linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Nicholas Piggin <npiggin@gmail.com>
To: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>,
	linuxppc-dev <linuxppc-dev@ozlabs.org>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
	Michal Suchanek <msuchanek@suse.com>,
	Laurent Dufour <ldufour@linux.vnet.ibm.com>
Subject: Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.
Date: Thu, 9 Aug 2018 18:33:33 +1000	[thread overview]
Message-ID: <20180809183333.6097d5ec@roar.ozlabs.ibm.com> (raw)
In-Reply-To: <20180809080945.5wgxevm5oq7otbpe@in.ibm.com>

On Thu, 9 Aug 2018 13:39:45 +0530
Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> wrote:

> On Thu, Aug 09, 2018 at 06:02:53PM +1000, Nicholas Piggin wrote:
> > On Thu, 09 Aug 2018 16:34:07 +1000
> > Michael Ellerman <mpe@ellerman.id.au> wrote:
> >   
> > > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:  
> > > > On 08/08/2018 08:26 PM, Michael Ellerman wrote:    
> > > >> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:    
> > > >>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> > > >>>
> > > >>> Introduce recovery action for recovered memory errors (MCEs). There are
> > > >>> soft memory errors like SLB Multihit, which can be a result of a bad
> > > >>> hardware OR software BUG. Kernel can easily recover from these soft errors
> > > >>> by flushing SLB contents. After the recovery kernel can still continue to
> > > >>> function without any issue. But in some scenario's we may keep getting
> > > >>> these soft errors until the root cause is fixed. To be able to analyze and
> > > >>> find the root cause, best way is to gather enough data and system state at
> > > >>> the time of MCE. Hence this patch introduces a sysctl knob where user can
> > > >>> decide either to continue after recovery or panic the kernel to capture the
> > > >>> dump.    
> > > >> 
> > > >> I'm not convinced we want this.
> > > >> 
> > > >> As we've discovered it's often not possible to reconstruct what happened
> > > >> based on a dump anyway.
> > > >> 
> > > >> The key thing you need is the content of the SLB and that's not included
> > > >> in a dump.
> > > >> 
> > > >> So I think we should dump the SLB content when we get the MCE (which
> > > >> this series does) and any other useful info, and then if we can recover
> > > >> we should.    
> > > >
> > > > The reasoning there is what if we got multi-hit due to some corruption 
> > > > in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
> > > > data structure due to wrong pointer. Now that is far fetched, but then 
> > > > possible right?. Hence the idea that, if we don't have much insight into 
> > > > why a slb multi-hit occur from the dmesg which include slb content, 
> > > > slb_cache contents etc, there should be an easy way to force a dump that 
> > > > might assist in further debug.    
> > > 
> > > If you're debugging something complex that you can't determine from the
> > > SLB dump then you should be running a debug kernel anyway. And if
> > > anything you want to drop into xmon and sit there, preserving the most
> > > state, rather than taking a dump.  
> > 
> > I'm not saying for a dump specifically, just some form of crash. And we
> > really should have an option to xmon on panic, but that's another story.  
> 
> That's fine during development or in a lab, not something we could
> enforce in a customer environment, could we?

xmon on panic? Not something to enforce but IMO (without thinking about
it too much but having encountered it several times) it should probably
be tied xmon on BUG option.

> 
> > I think HA/failover kind of environments use options like this too. If
> > anything starts going bad they don't want to try limping along but stop
> > ASAP.  
> 
> Right. And in this particular case, can we guarantee no corruption
> (leading to or post the multihit recovery) when running a customer workload,
> is the question...

I think that's an element of it. If SLB corruption is caused by
software then we could already have memory corruption. If it's hardware
then presumably we're supposed to have some guarantee of error rates.
But still you would say a machine that has taken no MCEs is less likely
to have a problem than one that has taken some MCEs!

It's not just corruption either, I've run into bugs where we get huge
streams of HMIs for example which all get recovered properly but
performance would have been in the toilet.

Anyway, being policy maybe we could drop this patch out of the SLB MCE
series and introduce it afterwards if we think it's necessary. For
SLB multi hit caused by software bug in slb handling, I'd say Michael's
pretty right about just needing the MCE output with SLB contents.

Thanks,
Nick

  reply	other threads:[~2018-08-09  8:33 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-07 14:15 [PATCH v7 0/9] powerpc/pseries: Machine check handler improvements Mahesh J Salgaonkar
2018-08-07 14:16 ` [PATCH v7 1/9] powerpc/pseries: Avoid using the size greater than RTAS_ERROR_LOG_MAX Mahesh J Salgaonkar
2018-08-07 14:16 ` [PATCH v7 2/9] powerpc/pseries: Defer the logging of rtas error to irq work queue Mahesh J Salgaonkar
2018-08-07 14:16 ` [PATCH v7 3/9] powerpc/pseries: Fix endainness while restoring of r3 in MCE handler Mahesh J Salgaonkar
2018-08-13 11:23   ` [v7, " Michael Ellerman
2018-08-07 14:16 ` [PATCH v7 4/9] powerpc/pseries: Define MCE error event section Mahesh J Salgaonkar
2018-08-08 14:42   ` Michael Ellerman
2018-08-10 10:29     ` Mahesh Jagannath Salgaonkar
2018-08-16  4:14       ` Michael Ellerman
2018-08-16 14:44         ` Segher Boessenkool
2018-08-17 11:22         ` Mahesh Jagannath Salgaonkar
2018-08-07 14:17 ` [PATCH v7 5/9] powerpc/pseries: flush SLB contents on SLB MCE errors Mahesh J Salgaonkar
2018-08-07 16:54   ` Michal Suchánek
2018-08-10 10:30     ` Mahesh Jagannath Salgaonkar
2018-08-08  9:04   ` Nicholas Piggin
2018-08-10 10:30     ` Mahesh Jagannath Salgaonkar
2018-08-07 14:17 ` [PATCH v7 6/9] powerpc/pseries: Display machine check error details Mahesh J Salgaonkar
2018-08-07 14:17 ` [PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors Mahesh J Salgaonkar
2018-08-09  1:05   ` Michael Ellerman
2018-08-10 10:32     ` Mahesh Jagannath Salgaonkar
2018-08-10 10:49       ` Mahesh Jagannath Salgaonkar
2018-08-11  4:33   ` Nicholas Piggin
2018-08-13  4:17     ` Mahesh Jagannath Salgaonkar
2018-08-13 14:27       ` Nicholas Piggin
2018-08-14 10:57         ` Mahesh Jagannath Salgaonkar
2018-08-14 12:47           ` Aneesh Kumar K.V
2018-08-07 14:17 ` [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE Mahesh J Salgaonkar
2018-08-08 14:56   ` Michael Ellerman
2018-08-08 15:37     ` Aneesh Kumar K.V
2018-08-08 16:09       ` Michal Suchánek
2018-08-10 11:04         ` Michael Ellerman
2018-08-09  6:34       ` Michael Ellerman
2018-08-09  8:02         ` Nicholas Piggin
2018-08-09  8:09           ` Ananth N Mavinakayanahalli
2018-08-09  8:33             ` Nicholas Piggin [this message]
2018-08-09 10:26               ` Michal Suchánek
2018-08-10  7:31                 ` Nicholas Piggin
2018-08-09  1:43     ` Nicholas Piggin
2018-08-07 14:18 ` [PATCH v7 9/9] powernv/pseries: consolidate code for mce early handling Mahesh J Salgaonkar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180809183333.6097d5ec@roar.ozlabs.ibm.com \
    --to=npiggin@gmail.com \
    --cc=ananth@linux.vnet.ibm.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=ldufour@linux.vnet.ibm.com \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=mahesh@linux.vnet.ibm.com \
    --cc=mpe@ellerman.id.au \
    --cc=msuchanek@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).