linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Corey Minyard <minyard@acm.org>
To: Borislav Petkov <bp@alien8.de>
Cc: "Luck, Tony" <tony.luck@intel.com>,
	Andy Lutomirski <luto@kernel.org>,
	linux-edac@vger.kernel.org, Corey Minyard <cminyard@mvista.com>,
	hidehiro.kawai.ez@hitachi.com, linfeilong@huawei.com,
	liuzhiqiang26@huawei.com
Subject: Re: [PATCH v2] x86: Fix MCE error handing when kdump is enabled
Date: Wed, 30 Sep 2020 13:49:06 -0500	[thread overview]
Message-ID: <20200930184906.GZ3674@minyard.net> (raw)
In-Reply-To: <20200930175633.GM6810@zn.tnic>

On Wed, Sep 30, 2020 at 07:56:33PM +0200, Borislav Petkov wrote:
> On Tue, Sep 29, 2020 at 04:16:44PM -0500, minyard@acm.org wrote:
> > From: Corey Minyard <cminyard@mvista.com>
> > 
> > If kdump is enabled, the handling of shooting down CPUs does not use the
> > RESET_VECTOR irq before trying to use NMIs to shoot down the CPUs.
> 
> So I've read that commit message like a bunch of times already and am
> getting none the wiser about what the situation is, who's doing what and
> what is this thing fixing.
> 
> It must be something about kdumping a kernel and an MCE happening at the
> same time and we did something about this a while ago, see:
> 
>  5bc329503e81 ("x86/mce: Handle broadcasted MCE gracefully with kexec")
> 
> and that is simply letting CPUs which are not doing the kexec-ing
> continue from the broadcasted MCE holding pattern so that kexec
> finishes.
> 
> So please explain exactly what this problem is, who's doing what, when
> does the MCE happen etc?
> 
> I've found this:
> 
> https://lkml.kernel.org/r/1600339070-570840-1-git-send-email-wubo40@huawei.com
> 
> and that sounds like the problem and I'm going to read that one in
> detail if that is the issue we're talking about. But from skimming over
> it, it sounds like the commit I mentioned above should take care of it.

That is the original post for this, yes.

Wu, what kernel version are you using?  Can you try to reproduce on the
current mainstream kernel?  I just assumed it was current.

The description isn't that great, no.  I'll try again.

The problem is that while waiting in wait_for_panic() in the mce code,
interrupts are enabled.  In the kdump case, there is nothing that will
wake them up, so they will sit there in the loop until they time out.

In the mean time, the cpu handling the panic calls some IPMI code that
stores panic information in the IPMI event log.  Since interrupts are
enabled on the CPUs in wait_for_panic(), those CPUs are handling
interrupts from the IPMI hardware.  They will not, however, handle
the NMI IPI that gets sent from the panic() code for kdump.

The IPMI code has disabled locks to avoid a deadlock if the exception
happens while in IPMI code.  So the panic-handling part of IPMI and the
IPMI interrupt handling are both running at the same time, messing each
other up.

It seems, in general, like a bad idea to have interrupts enabled on some
CPUs while running through the panic() code and after the new kdump
kernel has started.  There are other issues that might come from this.

I'm also not quite sure how kdump register information for the CPUs
in wait_for_panic() gets put into the kernel coredump if you don't do
something like my patch.

Thanks,

-corey

> 
> Although I have no clue what this means:
> 
> "1) MCE appears on all CPUs, Currently all CPUs are in the NMI interrupt 
>    context."
> 
> I think he means, all CPUs are in the #MC handler.
> 
> Also, looking at that mail, what kernel is Wu Bo using?
> 
> [ 4767.947960] BUG: unable to handle kernel paging request at ffff893e40000000
> [ 4767.947962] PGD 13c001067 P4D 13c001067 PUD 0
> [ 4767.947965] Oops: 0000 [#1] SMP PTI
> [ 4767.947967] CPU: 0 PID: 0 Comm: swapper/0
> 
> There's no kernel version on this line above. Taint line is gone too. Why?
> 
> Judging by the "unable to handle kernel paging request" text, that must
> be from before
> 
>   f28b11a2abd9 ("x86/fault: Reword initial BUG message for unhandled page faults")
> 
> which is 5.1. The commit above is in 5.1 but Wu Bo better try the latest
> *upstream* kernel first. The stress being on *upstream*.
> 
> Also that kernel is in a guest - I take MCEs in guests not very
> seriously.
> 
> So before we waste time, let's explain why we're doing all that exercise
> first.
> 
> Thx.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

  reply	other threads:[~2020-09-30 18:49 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-29 21:16 [PATCH v2] x86: Fix MCE error handing when kdump is enabled minyard
2020-09-30 17:56 ` Borislav Petkov
2020-09-30 18:49   ` Corey Minyard [this message]
2020-10-01 11:33     ` Borislav Petkov
2020-10-01 13:44       ` Corey Minyard
2020-10-01 16:16         ` Borislav Petkov
2020-10-01 16:29           ` Luck, Tony
2020-10-01 16:58             ` Borislav Petkov
2020-10-01 17:12             ` Corey Minyard
2020-10-10  1:36 ` Zhiqiang Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200930184906.GZ3674@minyard.net \
    --to=minyard@acm.org \
    --cc=bp@alien8.de \
    --cc=cminyard@mvista.com \
    --cc=hidehiro.kawai.ez@hitachi.com \
    --cc=linfeilong@huawei.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=liuzhiqiang26@huawei.com \
    --cc=luto@kernel.org \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).