All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Tony Luck <tony.luck@intel.com>,
	Borislav Petkov <bp@suse.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ashok Raj <ashok.raj@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Qiuxu Zhuo <qiuxu.zhuo@intel.com>,
	linux-edac <linux-edac@vger.kernel.org>
Subject: [PATCH 4.4 36/47] x86/mce: Fix incorrect "Machine check from unknown source" message
Date: Tue, 10 Jul 2018 20:25:00 +0200	[thread overview]
Message-ID: <20180710182338.673569223@linuxfoundation.org> (raw)
In-Reply-To: <20180710182337.047502999@linuxfoundation.org>

4.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Tony Luck <tony.luck@intel.com>

commit 40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2 upstream.

Some injection testing resulted in the following console log:

  mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134
  mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]}
  mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88
  mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043
  mce: [Hardware Error]: Run the above through 'mcelog --ascii'
  Kernel panic - not syncing: Machine check from unknown source

This confused everybody because the first line quite clearly shows
that we found a logged error in "Bank 1", while the last line says
"unknown source".

The problem is that the Linux code doesn't do the right thing
for a local machine check that results in a fatal error.

It turns out that we know very early in the handler whether the
machine check is fatal. The call to mce_no_way_out() has checked
all the banks for the CPU that took the local machine check. If
it says we must crash, we can do so right away with the right
messages.

We do scan all the banks again. This means that we might initially
not see a problem, but during the second scan find something fatal.
If this happens we print a slightly different message (so I can
see if it actually every happens).

[ bp: Remove unneeded severity assignment. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org # 4.2
Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 arch/x86/kernel/cpu/mcheck/mce.c |   26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1052,13 +1052,18 @@ void do_machine_check(struct pt_regs *re
 		lmce = m.mcgstatus & MCG_STATUS_LMCES;
 
 	/*
+	 * Local machine check may already know that we have to panic.
+	 * Broadcast machine check begins rendezvous in mce_start()
 	 * Go through all banks in exclusion of the other CPUs. This way we
 	 * don't report duplicated events on shared banks because the first one
-	 * to see it will clear it. If this is a Local MCE, then no need to
-	 * perform rendezvous.
+	 * to see it will clear it.
 	 */
-	if (!lmce)
+	if (lmce) {
+		if (no_way_out)
+			mce_panic("Fatal local machine check", &m, msg);
+	} else {
 		order = mce_start(&no_way_out);
+	}
 
 	for (i = 0; i < cfg->banks; i++) {
 		__clear_bit(i, toclear);
@@ -1135,12 +1140,17 @@ void do_machine_check(struct pt_regs *re
 			no_way_out = worst >= MCE_PANIC_SEVERITY;
 	} else {
 		/*
-		 * Local MCE skipped calling mce_reign()
-		 * If we found a fatal error, we need to panic here.
+		 * If there was a fatal machine check we should have
+		 * already called mce_panic earlier in this function.
+		 * Since we re-read the banks, we might have found
+		 * something new. Check again to see if we found a
+		 * fatal error. We call "mce_severity()" again to
+		 * make sure we have the right "msg".
 		 */
-		 if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
-			mce_panic("Machine check from unknown source",
-				NULL, NULL);
+		if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
+			mce_severity(&m, cfg->tolerant, &msg, true);
+			mce_panic("Local fatal machine check!", &m, msg);
+		}
 	}
 
 	/*



WARNING: multiple messages have this Message-ID (diff)
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Tony Luck <tony.luck@intel.com>,
	Borislav Petkov <bp@suse.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ashok Raj <ashok.raj@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Qiuxu Zhuo <qiuxu.zhuo@intel.com>,
	linux-edac <linux-edac@vger.kernel.org>
Subject: [4.4,36/47] x86/mce: Fix incorrect "Machine check from unknown source" message
Date: Tue, 10 Jul 2018 20:25:00 +0200	[thread overview]
Message-ID: <20180710182338.673569223@linuxfoundation.org> (raw)

4.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Tony Luck <tony.luck@intel.com>

commit 40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2 upstream.

Some injection testing resulted in the following console log:

  mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134
  mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]}
  mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88
  mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043
  mce: [Hardware Error]: Run the above through 'mcelog --ascii'
  Kernel panic - not syncing: Machine check from unknown source

This confused everybody because the first line quite clearly shows
that we found a logged error in "Bank 1", while the last line says
"unknown source".

The problem is that the Linux code doesn't do the right thing
for a local machine check that results in a fatal error.

It turns out that we know very early in the handler whether the
machine check is fatal. The call to mce_no_way_out() has checked
all the banks for the CPU that took the local machine check. If
it says we must crash, we can do so right away with the right
messages.

We do scan all the banks again. This means that we might initially
not see a problem, but during the second scan find something fatal.
If this happens we print a slightly different message (so I can
see if it actually every happens).

[ bp: Remove unneeded severity assignment. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org # 4.2
Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 arch/x86/kernel/cpu/mcheck/mce.c |   26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1052,13 +1052,18 @@ void do_machine_check(struct pt_regs *re
 		lmce = m.mcgstatus & MCG_STATUS_LMCES;
 
 	/*
+	 * Local machine check may already know that we have to panic.
+	 * Broadcast machine check begins rendezvous in mce_start()
 	 * Go through all banks in exclusion of the other CPUs. This way we
 	 * don't report duplicated events on shared banks because the first one
-	 * to see it will clear it. If this is a Local MCE, then no need to
-	 * perform rendezvous.
+	 * to see it will clear it.
 	 */
-	if (!lmce)
+	if (lmce) {
+		if (no_way_out)
+			mce_panic("Fatal local machine check", &m, msg);
+	} else {
 		order = mce_start(&no_way_out);
+	}
 
 	for (i = 0; i < cfg->banks; i++) {
 		__clear_bit(i, toclear);
@@ -1135,12 +1140,17 @@ void do_machine_check(struct pt_regs *re
 			no_way_out = worst >= MCE_PANIC_SEVERITY;
 	} else {
 		/*
-		 * Local MCE skipped calling mce_reign()
-		 * If we found a fatal error, we need to panic here.
+		 * If there was a fatal machine check we should have
+		 * already called mce_panic earlier in this function.
+		 * Since we re-read the banks, we might have found
+		 * something new. Check again to see if we found a
+		 * fatal error. We call "mce_severity()" again to
+		 * make sure we have the right "msg".
 		 */
-		 if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
-			mce_panic("Machine check from unknown source",
-				NULL, NULL);
+		if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
+			mce_severity(&m, cfg->tolerant, &msg, true);
+			mce_panic("Local fatal machine check!", &m, msg);
+		}
 	}
 
 	/*

  parent reply	other threads:[~2018-07-10 18:29 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-10 18:24 [PATCH 4.4 00/47] 4.4.140-stable review Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 01/47] usb: cdc_acm: Add quirk for Uniden UBC125 scanner Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 02/47] USB: serial: cp210x: add CESINEL device ids Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 03/47] USB: serial: cp210x: add Silicon Labs IDs for Windows Update Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 04/47] n_tty: Fix stall at n_tty_receive_char_special() Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 05/47] staging: android: ion: Return an ERR_PTR in ion_map_kernel Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 06/47] n_tty: Access echo_* variables carefully Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 07/47] x86/boot: Fix early command-line parsing when matching at end Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 08/47] ath10k: fix rfc1042 header retrieval in QCA4019 with eth decap mode Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 09/47] i2c: rcar: fix resume by always initializing registers before transfer Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 10/47] ipv4: Fix error return value in fib_convert_metrics() Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 11/47] kprobes/x86: Do not modify singlestep buffer while resuming Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 12/47] nvme-pci: initialize queue memory before interrupts Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 13/47] netfilter: nf_tables: use WARN_ON_ONCE instead of BUG_ON in nft_do_chain() Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 14/47] ARM: dts: imx6q: Use correct SDMA script for SPI5 core Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 15/47] ubi: fastmap: Correctly handle interrupted erasures in EBA Greg Kroah-Hartman
2018-07-26  2:12   ` Ben Hutchings
2018-07-26  6:25     ` Richard Weinberger
2018-07-28  1:28       ` Ben Hutchings
2018-07-28  5:56         ` Richard Weinberger
2018-07-10 18:24 ` [PATCH 4.4 16/47] mm: hugetlb: yield when prepping struct pages Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 17/47] tracing: Fix missing return symbol in function_graph output Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 18/47] scsi: sg: mitigate read/write abuse Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 19/47] s390: Correct register corruption in critical section cleanup Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 20/47] drbd: fix access after free Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 21/47] cifs: Fix infinite loop when using hard mount option Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 22/47] jbd2: dont mark block as modified if the handle is out of credits Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 23/47] ext4: make sure bitmaps and the inode table dont overlap with bg descriptors Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 24/47] ext4: always check block group bounds in ext4_init_block_bitmap() Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 25/47] ext4: only look at the bg_flags field if it is valid Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 26/47] ext4: verify the depth of extent tree in ext4_find_extent() Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 27/47] ext4: include the illegal physical block in the bad map ext4_error msg Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 28/47] ext4: clear i_data in ext4_inode_info when removing inline data Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 29/47] ext4: add more inode number paranoia checks Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 30/47] ext4: add more mount time checks of the superblock Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 31/47] ext4: check superblock mapped prior to committing Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 32/47] HID: i2c-hid: Fix "incomplete report" noise Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 33/47] HID: hiddev: fix potential Spectre v1 Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 34/47] HID: debug: check length before copy_to_user() Greg Kroah-Hartman
2018-07-10 18:24 ` [PATCH 4.4 35/47] x86/mce: Detect local MCEs properly Greg Kroah-Hartman
2018-07-10 18:24   ` [4.4,35/47] " Greg Kroah-Hartman
2018-07-10 18:25 ` Greg Kroah-Hartman [this message]
2018-07-10 18:25   ` [4.4,36/47] x86/mce: Fix incorrect "Machine check from unknown source" message Greg Kroah-Hartman
2018-07-10 18:25 ` [PATCH 4.4 37/47] media: cx25840: Use subdev host data for PLL override Greg Kroah-Hartman
2018-07-10 18:25 ` [PATCH 4.4 38/47] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset Greg Kroah-Hartman
2018-07-10 18:25 ` [PATCH 4.4 39/47] dm bufio: avoid sleeping while holding the dm_bufio lock Greg Kroah-Hartman
2018-07-10 18:25 ` [PATCH 4.4 40/47] dm bufio: drop the lock when doing GFP_NOIO allocation Greg Kroah-Hartman
2018-07-10 18:25 ` [PATCH 4.4 41/47] mtd: rawnand: mxc: set spare area size register explicitly Greg Kroah-Hartman
2018-07-10 18:25 ` [PATCH 4.4 42/47] dm bufio: dont take the lock in dm_bufio_shrink_count Greg Kroah-Hartman
2018-07-10 18:25 ` [PATCH 4.4 43/47] mtd: cfi_cmdset_0002: Change definition naming to retry write operation Greg Kroah-Hartman
2018-07-10 18:25 ` [PATCH 4.4 44/47] mtd: cfi_cmdset_0002: Change erase functions to retry for error Greg Kroah-Hartman
2018-07-10 18:25 ` [PATCH 4.4 45/47] mtd: cfi_cmdset_0002: Change erase functions to check chip good only Greg Kroah-Hartman
2018-07-10 18:25 ` [PATCH 4.4 46/47] netfilter: nf_log: dont hold nf_log_mutex during user access Greg Kroah-Hartman
2018-07-10 18:25 ` [PATCH 4.4 47/47] staging: comedi: quatech_daqp_cs: fix no-op loop daqp_ao_insn_write() Greg Kroah-Hartman
2018-07-10 19:10 ` [PATCH 4.4 00/47] 4.4.140-stable review Nathan Chancellor
2018-07-11 11:21 ` Naresh Kamboju
2018-07-11 13:40 ` Guenter Roeck
2018-07-11 15:10 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180710182338.673569223@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=ashok.raj@intel.com \
    --cc=bp@suse.de \
    --cc=dan.j.williams@intel.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=qiuxu.zhuo@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.