All of lore.kernel.org
 help / color / mirror / Atom feed
From: huang ying <huang.ying.caritas@gmail.com>
To: Don Zickus <dzickus@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>, Ingo Molnar <mingo@elte.hu>,
	linux-kernel@vger.kernel.org, Andi Kleen <andi@firstfloor.org>,
	Robert Richter <robert.richter@amd.com>,
	Andi Kleen <ak@linux.intel.com>
Subject: Re: [RFC] x86, NMI, Treat unknown NMI as hardware error
Date: Sat, 14 May 2011 08:20:40 +0800	[thread overview]
Message-ID: <BANLkTiktuq3vMXKOdChm03SF6V4c2GhDVw@mail.gmail.com> (raw)
In-Reply-To: <20110513135154.GB31888@redhat.com>

On Fri, May 13, 2011 at 9:51 PM, Don Zickus <dzickus@redhat.com> wrote:
> On Fri, May 13, 2011 at 09:17:13PM +0800, huang ying wrote:
>> Hi, Don,
>>
>> On Fri, May 13, 2011 at 8:45 PM, Don Zickus <dzickus@redhat.com> wrote:
>> > On Fri, May 13, 2011 at 04:23:38PM +0800, Huang Ying wrote:
>> >> In general, unknown NMI is used by hardware and firmware to notify
>> >> fatal hardware errors to OS. So the Linux should treat unknown NMI as
>> >> hardware error and go panic upon unknown NMI for better error
>> >> containment.
>> >
>> > I have a couple of concerns about this patch.  One I don't think BIOSes
>> > are ready for this.  I have Intel Westmere boxes that say they have a
>> > valid HEST, GHES, and EINJ table, but when I inject an error there is no
>> > GHES record.  This leaves me with an unknown NMI and panic.  Yeah, it is a
>> > BIOS bug I guess, but I think vendors are going to be slow fixing all this
>> > stuff (my Nehalem box is in even worse shape with this stuff).
>>
>> Although there is no GHES record, I think the Westmere box behavior is
>> acceptable, an unknown NMI is used by BIOS to notify hardware error,
>> this is what we want to deal with in this patch.
>
> I don't think having HEST changes the situation.  I agree with your
> statement above, but I can also generate unknown NMIs from stressing perf.

Yes. perf can still generate unknown NMIs.  Maybe we should turn off
panic on unknown NMI logic if perf is running.  Maybe add warning to
users that if you use perf, you may lose some RAS feature.

> Broken hardware usually generated NMIs, sometimes they propogated to the
> cpu, other times, the were swallowed by the chipset.  Which means having
> HEST or not having HEST doesn't improve anything nor make it any worse.
>
> IOW I don't think we gain anything with this patch.

Without this patch, a real fatal hardware error may silently ruin your
disk data.  But with this patch, you can panic before that.  I think
this is what we gain with this patch.

>>
>> > Also, is there any known issues with x86_64 platforms with bad NMIs?  RHEL
>> > has had unknown NMI's panic on x86_64 since x86_64 first came out, I don't
>> > recall any exceptions we had to add to handle 'quirky' hardware.
>> >
>> > Then for the i686 case, because the 'quirky' hardware is so old, can't we
>> > just leave it a kernel config option to switch between using a 'printk'
>> > vs. a 'panic'?  Or even a kernel command line option.
>> >
>> > I figure these 'quirky' hardware machines are more the exception nowdays,
>> > do we really need to add code to whitelist machines?
>> >
>> > Granted I am not familiar enough with the quirky hardware (in fact I don't
>> > think I have seen any mainly because I haven't been around long enough).
>> > Most cases I see when trolling through the fedora bugzilla list for
>> > unknown NMIs, is just bad firmware or acpi power configurations.
>> >
>> > Just wondering if we could just simplify the patch somehow with better
>> > assumptions.
>>
>> So there is still unknown NMIs on real hardware now. I am afraid turn
>> on panic on unknown NMI by default may be not acceptable for someone.
>
> The opposite could be said too.  I think that was Ingo's point.  The
> policy should be left in the hands of the user or distro because there is
> no right answer.

IMHO, Linux is not X, so Linux kernel will not push all policy to user
space.  And for fatal hardware error processing, there may be no
opportunity for user space to run.

Best Regards,
Huang Ying

  reply	other threads:[~2011-05-14  0:20 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-13  8:23 [RFC] x86, NMI, Treat unknown NMI as hardware error Huang Ying
2011-05-13 12:45 ` Don Zickus
2011-05-13 13:00   ` Ingo Molnar
2011-05-13 13:24     ` huang ying
2011-05-13 15:20       ` Ingo Molnar
2011-05-13 16:00         ` Don Zickus
2011-05-16 11:29           ` Ingo Molnar
2011-05-16 19:19             ` Don Zickus
2011-05-17  8:50               ` Ingo Molnar
2011-05-17  7:41             ` Huang Ying
2011-05-17  8:53               ` Ingo Molnar
2011-05-19  6:44                 ` Huang Ying
2011-05-20 11:58                   ` Ingo Molnar
2011-05-14  0:56         ` huang ying
2011-05-13 13:17   ` huang ying
2011-05-13 13:51     ` Don Zickus
2011-05-14  0:20       ` huang ying [this message]
2011-05-14  4:11         ` Andi Kleen
2011-05-13 15:17 ` Cyrill Gorcunov
2011-05-14  0:26   ` huang ying
2011-05-14  7:51     ` Cyrill Gorcunov
2011-05-15  0:06       ` huang ying
2011-05-15  6:34         ` Cyrill Gorcunov
2011-05-16  1:09           ` Huang Ying
2011-05-16 19:03             ` Don Zickus
2011-05-16 19:53               ` Cyrill Gorcunov
2011-05-17  5:39               ` Huang Ying
2011-05-17 14:24                 ` Don Zickus
2011-05-17 16:38                   ` Andi Kleen
2011-05-17 17:57                     ` Don Zickus
2011-05-17 18:18                       ` Andi Kleen
2011-05-17 19:07                         ` Don Zickus
2011-05-20  8:13                           ` Huang Ying
2011-06-09 12:09                             ` Don Zickus
2011-06-09 15:22                               ` Cyrill Gorcunov
2011-06-13  1:34                               ` Huang Ying
2011-05-16 19:44             ` Cyrill Gorcunov
2011-05-17  7:32               ` Huang Ying
2011-05-14  0:47   ` huang ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BANLkTiktuq3vMXKOdChm03SF6V4c2GhDVw@mail.gmail.com \
    --to=huang.ying.caritas@gmail.com \
    --cc=ak@linux.intel.com \
    --cc=andi@firstfloor.org \
    --cc=dzickus@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=robert.richter@amd.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.