linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: rui wang <ruiv.wang@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"gong.chen@linux.intel.com" <gong.chen@linux.intel.com>,
	"Wang, Rui Y" <rui.y.wang@intel.com>
Subject: Re: [PATCH v3] x86/mce: Try printing all machine check banks known before panic
Date: Fri, 21 Nov 2014 19:13:34 +0100	[thread overview]
Message-ID: <20141121181334.GC4274@pd.tnic> (raw)
In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F3294F888@ORSMSX114.amr.corp.intel.com>

On Fri, Nov 21, 2014 at 05:20:53PM +0000, Luck, Tony wrote:
> > leave them in. Then you can read them out again on panic time. The mce
> > log buffer will have to become a circular buffer or something like that.
> 
> This is a mixed bag.  If there are a bunch of errors so that we overflow the buffer,
> then general wisdom says that people want to see the first errors, as the later
> ones may just be secondary effects from the earlier ones.
> 
> But - lots of systems don't run mcelog(8) daemon.  So the buffer just fills
> up with the first 32 errors (perhaps all relatively harmless corrected errors)
> and then when some serious stuff happens we have no place to log :-)

Of course we do - we overwrite the first one. Changing it into a
circular buffer will give us last 32 errors logged.

> Perhaps we need separate buffers for UC=0 and UC=1?  Or something else??

Well, adding yet another arbitrary length buffer of struct mces used in
yet another context scenario doesn't help that either, does it?

This chunk particularly makes me go WTH?!:

+ * All valid error banks seen during MCE are temporarily saved here.
+ * There are multiple components which can report an error. For now
+ * 8 might be enough but it's subject to change in the future.
+ */
+#define MAX_ERRORS     8
+static struct mce banks_saved[MAX_ERRORS];

It sounds to me like we need to go back to the drawing board and analyze
why that thing happens first:

[  177.806166] Kernel panic - not syncing: Machine check from unknown source

Now this comes from mce_reign() which is entered by the CPU which
entered the #MC handler first, according to the comments above it.

So basically it tells me that we want all the MCEs from the last
"round," so to speak, where we had to summon all cores into the indian
clearing of #MC to quickly show each other's wounds :-) :-)

Now, if MCE_LOG_LEN records, aka 32, are not enough because the error
happened at some point in the past, there are two possibilities:

* error got logged into mcelog and is long out to dmesg.

So we go look at dmesg. Not very easy to do when we panic, I know, so we
better make sure we have serial connected.


 [ Btw., we can know when userspace is eating up error data:
   drivers/ras/debugfs.c. If it doesn't, we can then dump it to dmesg.
   We'll have to teach mcelog/ras daemons to open that file so that we
   don't issue to dmesg. ]


* error is not logged yet so still in mcelog and we simply dump it out
to dmesg.

In any case, we cannot have fixed-size buffer for some number of errors
and rely on it always having the error which caused the #MC as something
will consume it at some point anyway.

So maybe if we could get a more detailed explanation of when this thing
happens, then we might address it better.

And also:

        /*
         * No machine check event found. Must be some external
         * source or one CPU is hung. Panic.
         */
        if (global_worst <= MCE_KEEP_SEVERITY && mca_cfg.tolerant < 3)
                mce_panic("Machine check from unknown source", NULL, NULL);

Provided this comment is correct, it doesn't sound like any MCE record
will ever tell us what causes the error as an external source or a hung
CPU doesn't generate an MCE record in any bank, does it?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

  reply	other threads:[~2014-11-21 18:13 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-19  9:22 [PATCH v3] x86/mce: Try printing all machine check banks known before panic ruiv.wang
2014-11-19 10:29 ` Borislav Petkov
2014-11-19 23:34   ` Luck, Tony
2014-11-20 10:15     ` Borislav Petkov
2014-11-21  1:20       ` rui wang
2014-11-21 16:41         ` Borislav Petkov
2014-11-21 17:20           ` Luck, Tony
2014-11-21 18:13             ` Borislav Petkov [this message]
2014-11-21 21:31               ` Luck, Tony
2014-11-21 21:35                 ` Borislav Petkov
2014-11-21 21:59                   ` Luck, Tony
2014-11-23 20:55                     ` Borislav Petkov
2014-11-22  2:16               ` rui wang
2014-11-22  9:44                 ` Borislav Petkov
2014-11-22 15:32                   ` rui wang
2014-11-22 16:31                     ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141121181334.GC4274@pd.tnic \
    --to=bp@alien8.de \
    --cc=gong.chen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rui.y.wang@intel.com \
    --cc=ruiv.wang@gmail.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).