linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* FAILED: patch "[PATCH] x86/mce: Fix incorrect "Machine check from unknown source"" failed to apply to 4.4-stable tree
@ 2018-07-05 18:21 Greg Kroah-Hartman
  0 siblings, 0 replies; 5+ messages in thread
From: Greg Kroah-Hartman @ 2018-07-05 18:21 UTC (permalink / raw)
  To: Luck, Tony
  Cc: ashok.raj, bp, dan.j.williams, linux-edac, qiuxu.zhuo, tglx, stable

On Thu, Jul 05, 2018 at 11:15:33AM -0700, Luck, Tony wrote:
> On Thu, Jul 05, 2018 at 08:11:23PM +0200, Greg KH wrote:
> > On Thu, Jun 28, 2018 at 03:09:31PM -0700, Luck, Tony wrote:
> > > On Thu, Jun 28, 2018 at 11:07:22AM +0900, gregkh@linuxfoundation.org wrote:
> > > > 
> > > > The patch below does not apply to the 4.4-stable tree.
> > > > If someone wants it applied there, or to any other stable or longterm
> > > > tree, then please email the backport, including the original git commit
> > > > id to <stable@vger.kernel.org>.
> > > > 
> > > > thanks,
> > > 
> > > This patch relies on:
> > > 
> > > 	3acb431b84d8 ("x86/mce: Detect local MCEs properly")
> > 
> > $ git describe --contains 3acb431b84d8
> > Could not get sha1 for 3acb431b84d8. Skipping.
> > 
> > Are you sure that is correct?
> 
> I must have picked up a commit ID from an older version of this
> patch in a test branch.  This looks to be in Linus tree.
> 
> fead35c68926 ("x86/mce: Detect local MCEs properly")
> 
> Sorry

No problem.  But that commit does not apply cleanly on 4.4.y :(

Can you backport it and this original patch, and send me the series?

thanks,

greg k-h
---
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* FAILED: patch "[PATCH] x86/mce: Fix incorrect "Machine check from unknown source"" failed to apply to 4.4-stable tree
@ 2018-07-05 18:15 Luck, Tony
  0 siblings, 0 replies; 5+ messages in thread
From: Luck, Tony @ 2018-07-05 18:15 UTC (permalink / raw)
  To: Greg KH
  Cc: ashok.raj, bp, dan.j.williams, linux-edac, qiuxu.zhuo, tglx, stable

On Thu, Jul 05, 2018 at 08:11:23PM +0200, Greg KH wrote:
> On Thu, Jun 28, 2018 at 03:09:31PM -0700, Luck, Tony wrote:
> > On Thu, Jun 28, 2018 at 11:07:22AM +0900, gregkh@linuxfoundation.org wrote:
> > > 
> > > The patch below does not apply to the 4.4-stable tree.
> > > If someone wants it applied there, or to any other stable or longterm
> > > tree, then please email the backport, including the original git commit
> > > id to <stable@vger.kernel.org>.
> > > 
> > > thanks,
> > 
> > This patch relies on:
> > 
> > 	3acb431b84d8 ("x86/mce: Detect local MCEs properly")
> 
> $ git describe --contains 3acb431b84d8
> Could not get sha1 for 3acb431b84d8. Skipping.
> 
> Are you sure that is correct?

I must have picked up a commit ID from an older version of this
patch in a test branch.  This looks to be in Linus tree.

fead35c68926 ("x86/mce: Detect local MCEs properly")

Sorry

-Tony
---
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* FAILED: patch "[PATCH] x86/mce: Fix incorrect "Machine check from unknown source"" failed to apply to 4.4-stable tree
@ 2018-07-05 18:11 Greg Kroah-Hartman
  0 siblings, 0 replies; 5+ messages in thread
From: Greg Kroah-Hartman @ 2018-07-05 18:11 UTC (permalink / raw)
  To: Luck, Tony
  Cc: ashok.raj, bp, dan.j.williams, linux-edac, qiuxu.zhuo, tglx, stable

On Thu, Jun 28, 2018 at 03:09:31PM -0700, Luck, Tony wrote:
> On Thu, Jun 28, 2018 at 11:07:22AM +0900, gregkh@linuxfoundation.org wrote:
> > 
> > The patch below does not apply to the 4.4-stable tree.
> > If someone wants it applied there, or to any other stable or longterm
> > tree, then please email the backport, including the original git commit
> > id to <stable@vger.kernel.org>.
> > 
> > thanks,
> 
> This patch relies on:
> 
> 	3acb431b84d8 ("x86/mce: Detect local MCEs properly")

$ git describe --contains 3acb431b84d8
Could not get sha1 for 3acb431b84d8. Skipping.

Are you sure that is correct?

thanks,

greg k-h
---
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* FAILED: patch "[PATCH] x86/mce: Fix incorrect "Machine check from unknown source"" failed to apply to 4.4-stable tree
@ 2018-06-28 22:09 Luck, Tony
  0 siblings, 0 replies; 5+ messages in thread
From: Luck, Tony @ 2018-06-28 22:09 UTC (permalink / raw)
  To: gregkh
  Cc: ashok.raj, bp, dan.j.williams, linux-edac, qiuxu.zhuo, tglx, stable

On Thu, Jun 28, 2018 at 11:07:22AM +0900, gregkh@linuxfoundation.org wrote:
> 
> The patch below does not apply to the 4.4-stable tree.
> If someone wants it applied there, or to any other stable or longterm
> tree, then please email the backport, including the original git commit
> id to <stable@vger.kernel.org>.
> 
> thanks,

This patch relies on:

	3acb431b84d8 ("x86/mce: Detect local MCEs properly")

cherry pick that (and fix up the trivial merge problem around the
change to initialize "lmce = 1;" instead of "lmce = 0";)

Then this will merge cleanly.

-Tony
> 
> ------------------ original commit in Linus's tree ------------------
> 
> From 40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2 Mon Sep 17 00:00:00 2001
> From: Tony Luck <tony.luck@intel.com>
> Date: Fri, 22 Jun 2018 11:54:23 +0200
> Subject: [PATCH] x86/mce: Fix incorrect "Machine check from unknown source"
>  message
> 
> Some injection testing resulted in the following console log:
> 
>   mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134
>   mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]}
>   mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88
>   mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043
>   mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>   Kernel panic - not syncing: Machine check from unknown source
> 
> This confused everybody because the first line quite clearly shows
> that we found a logged error in "Bank 1", while the last line says
> "unknown source".
> 
> The problem is that the Linux code doesn't do the right thing
> for a local machine check that results in a fatal error.
> 
> It turns out that we know very early in the handler whether the
> machine check is fatal. The call to mce_no_way_out() has checked
> all the banks for the CPU that took the local machine check. If
> it says we must crash, we can do so right away with the right
> messages.
> 
> We do scan all the banks again. This means that we might initially
> not see a problem, but during the second scan find something fatal.
> If this happens we print a slightly different message (so I can
> see if it actually every happens).
> 
> [ bp: Remove unneeded severity assignment. ]
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Cc: linux-edac <linux-edac@vger.kernel.org>
> Cc: stable@vger.kernel.org # 4.2
> Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 7e6f51a9d917..e93670d736a6 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1207,13 +1207,18 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  		lmce = m.mcgstatus & MCG_STATUS_LMCES;
>  
>  	/*
> +	 * Local machine check may already know that we have to panic.
> +	 * Broadcast machine check begins rendezvous in mce_start()
>  	 * Go through all banks in exclusion of the other CPUs. This way we
>  	 * don't report duplicated events on shared banks because the first one
> -	 * to see it will clear it. If this is a Local MCE, then no need to
> -	 * perform rendezvous.
> +	 * to see it will clear it.
>  	 */
> -	if (!lmce)
> +	if (lmce) {
> +		if (no_way_out)
> +			mce_panic("Fatal local machine check", &m, msg);
> +	} else {
>  		order = mce_start(&no_way_out);
> +	}
>  
>  	for (i = 0; i < cfg->banks; i++) {
>  		__clear_bit(i, toclear);
> @@ -1289,12 +1294,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  			no_way_out = worst >= MCE_PANIC_SEVERITY;
>  	} else {
>  		/*
> -		 * Local MCE skipped calling mce_reign()
> -		 * If we found a fatal error, we need to panic here.
> +		 * If there was a fatal machine check we should have
> +		 * already called mce_panic earlier in this function.
> +		 * Since we re-read the banks, we might have found
> +		 * something new. Check again to see if we found a
> +		 * fatal error. We call "mce_severity()" again to
> +		 * make sure we have the right "msg".
>  		 */
> -		 if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
> -			mce_panic("Machine check from unknown source",
> -				NULL, NULL);
> +		if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
> +			mce_severity(&m, cfg->tolerant, &msg, true);
> +			mce_panic("Local fatal machine check!", &m, msg);
> +		}
>  	}
>  
>  	/*
>
---
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* FAILED: patch "[PATCH] x86/mce: Fix incorrect "Machine check from unknown source"" failed to apply to 4.4-stable tree
@ 2018-06-28  2:07 Greg Kroah-Hartman
  0 siblings, 0 replies; 5+ messages in thread
From: Greg Kroah-Hartman @ 2018-06-28  2:07 UTC (permalink / raw)
  To: tony.luck, ashok.raj, bp, dan.j.williams, linux-edac, qiuxu.zhuo, tglx
  Cc: stable

The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

thanks,

greg k-h

------------------ original commit in Linus's tree ------------------

From 40c36e2741d7fe1e66d6ec55477ba5fd19c9c5d2 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Fri, 22 Jun 2018 11:54:23 +0200
Subject: [PATCH] x86/mce: Fix incorrect "Machine check from unknown source"
 message

Some injection testing resulted in the following console log:

  mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134
  mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]}
  mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88
  mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043
  mce: [Hardware Error]: Run the above through 'mcelog --ascii'
  Kernel panic - not syncing: Machine check from unknown source

This confused everybody because the first line quite clearly shows
that we found a logged error in "Bank 1", while the last line says
"unknown source".

The problem is that the Linux code doesn't do the right thing
for a local machine check that results in a fatal error.

It turns out that we know very early in the handler whether the
machine check is fatal. The call to mce_no_way_out() has checked
all the banks for the CPU that took the local machine check. If
it says we must crash, we can do so right away with the right
messages.

We do scan all the banks again. This means that we might initially
not see a problem, but during the second scan find something fatal.
If this happens we print a slightly different message (so I can
see if it actually every happens).

[ bp: Remove unneeded severity assignment. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org # 4.2
Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com
---
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7e6f51a9d917..e93670d736a6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1207,13 +1207,18 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		lmce = m.mcgstatus & MCG_STATUS_LMCES;
 
 	/*
+	 * Local machine check may already know that we have to panic.
+	 * Broadcast machine check begins rendezvous in mce_start()
 	 * Go through all banks in exclusion of the other CPUs. This way we
 	 * don't report duplicated events on shared banks because the first one
-	 * to see it will clear it. If this is a Local MCE, then no need to
-	 * perform rendezvous.
+	 * to see it will clear it.
 	 */
-	if (!lmce)
+	if (lmce) {
+		if (no_way_out)
+			mce_panic("Fatal local machine check", &m, msg);
+	} else {
 		order = mce_start(&no_way_out);
+	}
 
 	for (i = 0; i < cfg->banks; i++) {
 		__clear_bit(i, toclear);
@@ -1289,12 +1294,17 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			no_way_out = worst >= MCE_PANIC_SEVERITY;
 	} else {
 		/*
-		 * Local MCE skipped calling mce_reign()
-		 * If we found a fatal error, we need to panic here.
+		 * If there was a fatal machine check we should have
+		 * already called mce_panic earlier in this function.
+		 * Since we re-read the banks, we might have found
+		 * something new. Check again to see if we found a
+		 * fatal error. We call "mce_severity()" again to
+		 * make sure we have the right "msg".
 		 */
-		 if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3)
-			mce_panic("Machine check from unknown source",
-				NULL, NULL);
+		if (worst >= MCE_PANIC_SEVERITY && mca_cfg.tolerant < 3) {
+			mce_severity(&m, cfg->tolerant, &msg, true);
+			mce_panic("Local fatal machine check!", &m, msg);
+		}
 	}
 
 	/*

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-07-05 18:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-05 18:21 FAILED: patch "[PATCH] x86/mce: Fix incorrect "Machine check from unknown source"" failed to apply to 4.4-stable tree Greg Kroah-Hartman
  -- strict thread matches above, loose matches on Subject: below --
2018-07-05 18:15 Luck, Tony
2018-07-05 18:11 Greg Kroah-Hartman
2018-06-28 22:09 Luck, Tony
2018-06-28  2:07 Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).