All of lore.kernel.org
 help / color / mirror / Atom feed
From: tip-bot for Tony Luck <tipbot@zytor.com>
To: linux-tip-commits@vger.kernel.org
Cc: tglx@linutronix.de, linux-edac@vger.kernel.org,
	dan.j.williams@intel.com, ashok.raj@intel.com,
	tony.luck@intel.com, qiuxu.zhuo@intel.com, bp@suse.de,
	mingo@kernel.org, hpa@zytor.com, linux-kernel@vger.kernel.org
Subject: [tip:ras/core] x86/mce: Fix incorrect "Machine check from unknown source" message
Date: Fri, 22 Jun 2018 05:40:15 -0700	[thread overview]
Message-ID: <tip-40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2@git.kernel.org> (raw)
In-Reply-To: <52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com>

Commit-ID:  40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2
Gitweb:     https://git.kernel.org/tip/40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2
Author:     Tony Luck <tony.luck@intel.com>
AuthorDate: Fri, 22 Jun 2018 11:54:23 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 22 Jun 2018 14:35:50 +0200

x86/mce: Fix incorrect "Machine check from unknown source" message

Some injection testing resulted in the following console log:

  mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134
  mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]}
  mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88
  mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043
  mce: [Hardware Error]: Run the above through 'mcelog --ascii'
  Kernel panic - not syncing: Machine check from unknown source

This confused everybody because the first line quite clearly shows
that we found a logged error in "Bank 1", while the last line says
"unknown source".

The problem is that the Linux code doesn't do the right thing
for a local machine check that results in a fatal error.

It turns out that we know very early in the handler whether the
machine check is fatal. The call to mce_no_way_out() has checked
all the banks for the CPU that took the local machine check. If
it says we must crash, we can do so right away with the right
messages.

We do scan all the banks again. This means that we might initially
not see a problem, but during the second scan find something fatal.
If this happens we print a slightly different message (so I can
see if it actually every happens).

[ bp: Remove unneeded severity assignment. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org # 4.2
Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com

---
 arch/x86/kernel/cpu/mcheck/mce.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7e6f51a9d917..e93670d736a6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1207,13 +1207,18 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		lmce = m.mcgstatus & MCG_STATUS_LMCES;
 
 	/*
+	 * Local machine check may already know that we have to panic.
+	 * Broadcast machine check begins rendezvous in mce_start()
 	 * Go through all banks in exclusion of the other CPUs. This way we
 	 * don't report duplicated events on shared banks because the first one
-	 * to see it will clear it. If this is a Local MCE, then no need to
-	 * perform rendezvous.
+	 * to see it will clear it.
 	 */
-	if (!lmce)
+	if (lmce) {
+		if (no_way_out)
+			mce_panic("Fatal local machine check", &m, msg);
+	} else {
 		order = mce_start(&no_way_out);
+	}
 
 	for (i = 0; i < cfg->banks; i++) {
 		__clear_bit(i, toclear);
@@ -1289,12 +1294,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			no_way_out = worst >= MCE_PANIC_SEVERITY;
 	} else {
 		/*
-		 * Local MCE skipped calling mce_reign()
-		 * If we found a fatal error, we need to panic here.
+		 * If there was a fatal machine check we should have
+		 * already called mce_panic earlier in this function.
+		 * Since we re-read the banks, we might have found
+		 * something new. Check again to see if we found a
+		 * fatal error. We call "mce_severity()" again to
+		 * make sure we have the right "msg".
 		 */
-		 if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
-			mce_panic("Machine check from unknown source",
-				NULL, NULL);
+		if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
+			mce_severity(&m, cfg->tolerant, &msg, true);
+			mce_panic("Local fatal machine check!", &m, msg);
+		}
 	}
 
 	/*

WARNING: multiple messages have this Message-ID (diff)
From: tip-bot for Borislav Petkov <tipbot@zytor.com>
To: linux-tip-commits@vger.kernel.org
Cc: tglx@linutronix.de, linux-edac@vger.kernel.org,
	dan.j.williams@intel.com, ashok.raj@intel.com,
	tony.luck@intel.com, qiuxu.zhuo@intel.com, bp@suse.de,
	mingo@kernel.org, hpa@zytor.com, linux-kernel@vger.kernel.org
Subject: [tip:ras/core] x86/mce: Fix incorrect "Machine check from unknown source" message
Date: Fri, 22 Jun 2018 05:40:15 -0700	[thread overview]
Message-ID: <tip-40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2@git.kernel.org> (raw)

Commit-ID:  40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2
Gitweb:     https://git.kernel.org/tip/40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2
Author:     Tony Luck <tony.luck@intel.com>
AuthorDate: Fri, 22 Jun 2018 11:54:23 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 22 Jun 2018 14:35:50 +0200

x86/mce: Fix incorrect "Machine check from unknown source" message

Some injection testing resulted in the following console log:

  mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134
  mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]}
  mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88
  mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043
  mce: [Hardware Error]: Run the above through 'mcelog --ascii'
  Kernel panic - not syncing: Machine check from unknown source

This confused everybody because the first line quite clearly shows
that we found a logged error in "Bank 1", while the last line says
"unknown source".

The problem is that the Linux code doesn't do the right thing
for a local machine check that results in a fatal error.

It turns out that we know very early in the handler whether the
machine check is fatal. The call to mce_no_way_out() has checked
all the banks for the CPU that took the local machine check. If
it says we must crash, we can do so right away with the right
messages.

We do scan all the banks again. This means that we might initially
not see a problem, but during the second scan find something fatal.
If this happens we print a slightly different message (so I can
see if it actually every happens).

[ bp: Remove unneeded severity assignment. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org # 4.2
Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com
---
 arch/x86/kernel/cpu/mcheck/mce.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7e6f51a9d917..e93670d736a6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1207,13 +1207,18 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		lmce = m.mcgstatus & MCG_STATUS_LMCES;
 
 	/*
+	 * Local machine check may already know that we have to panic.
+	 * Broadcast machine check begins rendezvous in mce_start()
 	 * Go through all banks in exclusion of the other CPUs. This way we
 	 * don't report duplicated events on shared banks because the first one
-	 * to see it will clear it. If this is a Local MCE, then no need to
-	 * perform rendezvous.
+	 * to see it will clear it.
 	 */
-	if (!lmce)
+	if (lmce) {
+		if (no_way_out)
+			mce_panic("Fatal local machine check", &m, msg);
+	} else {
 		order = mce_start(&no_way_out);
+	}
 
 	for (i = 0; i < cfg->banks; i++) {
 		__clear_bit(i, toclear);
@@ -1289,12 +1294,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			no_way_out = worst >= MCE_PANIC_SEVERITY;
 	} else {
 		/*
-		 * Local MCE skipped calling mce_reign()
-		 * If we found a fatal error, we need to panic here.
+		 * If there was a fatal machine check we should have
+		 * already called mce_panic earlier in this function.
+		 * Since we re-read the banks, we might have found
+		 * something new. Check again to see if we found a
+		 * fatal error. We call "mce_severity()" again to
+		 * make sure we have the right "msg".
 		 */
-		 if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
-			mce_panic("Machine check from unknown source",
-				NULL, NULL);
+		if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
+			mce_severity(&m, cfg->tolerant, &msg, true);
+			mce_panic("Local fatal machine check!", &m, msg);
+		}
 	}
 
 	/*

  parent reply	other threads:[~2018-06-22 12:40 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-25 21:40 [PATCH 0/3] x86/mce fixes Tony Luck
2018-05-25 21:41 ` [PATCH 1/3] x86/mce: Improve error message when kernel cannot recover Tony Luck
2018-06-07 20:24   ` [tip:ras/urgent] " tip-bot for Tony Luck
2018-05-25 21:41 ` [PATCH 2/3] x86/mce: Fix incorrect "Machine check from unknown source" message Tony Luck
2018-05-28 20:49   ` Borislav Petkov
2018-05-29 16:15     ` [PATCH 2/3 V2] " Luck, Tony
2018-05-29 17:41       ` Borislav Petkov
2018-05-29 17:50         ` Luck, Tony
2018-05-29 17:53           ` Borislav Petkov
2018-05-29 18:54             ` Luck, Tony
2018-05-29 20:17               ` Dan Williams
2018-05-30  9:26               ` Borislav Petkov
2018-06-19 10:30                 ` Borislav Petkov
2018-05-29 18:22     ` [PATCH 2/3] " Raj, Ashok
2018-05-29 10:42   ` Borislav Petkov
2018-05-29 16:13     ` Luck, Tony
2018-06-22 12:40   ` tip-bot for Tony Luck [this message]
2018-06-22 12:40     ` [tip:ras/core] " tip-bot for Borislav Petkov
2018-05-25 21:42 ` [PATCH 3/3] x86/mce: Check for alternate indication of machine check recovery on Skylake Tony Luck
2018-06-07 17:43   ` Luck, Tony
2018-06-07 20:18     ` Dan Williams
2018-06-07 20:24       ` Borislav Petkov
2018-06-07 22:26         ` Luck, Tony
2018-06-14 21:57         ` Luck, Tony
2018-06-15 11:45           ` Borislav Petkov
2018-06-15 16:34             ` Luck, Tony
2018-06-15 17:16               ` Borislav Petkov
2018-06-07 20:24       ` Thomas Gleixner
2018-06-07 20:25   ` [tip:ras/urgent] " tip-bot for Tony Luck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=tip-40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2@git.kernel.org \
    --to=tipbot@zytor.com \
    --cc=ashok.raj@intel.com \
    --cc=bp@suse.de \
    --cc=dan.j.williams@intel.com \
    --cc=hpa@zytor.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=qiuxu.zhuo@intel.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.