[PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases
@ 2022-03-31 16:38 Carlos Bilbao
  2022-03-31 16:38 ` [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors Carlos Bilbao
  2022-03-31 16:38 ` [PATCH v2 2/2] x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading Carlos Bilbao
  0 siblings, 2 replies; 11+ messages in thread
From: Carlos Bilbao @ 2022-03-31 16:38 UTC (permalink / raw)
  To: bp
  Cc: tglx, mingo, dave.hansen, x86, yazen.ghannam, linux-kernel,
	linux-edac, bilbao, Carlos Bilbao

This patchset includes grading of new types of machine errors on AMD's MCE
grading logic mce_severity_amd(), which helps the MCE handler determine
what actions to take. If the error is graded as a PANIC, the EDAC driver
will not decode; so we also include new error messages to describe the MCE
and help debugging critical errors.

Changes since v1:
  On patch 1/2, follow a simplified approach for severity.c, that resembles
  the available PPR more closely. This also simplifies patch 2/2, as less 
  panic error messages are added.

Carlos Bilbao (2):
  x86/mce: Extend AMD severity grading function with new types of errors
  x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading

 arch/x86/kernel/cpu/mce/severity.c | 111 ++++++++++++-----------------
 1 file changed, 44 insertions(+), 67 deletions(-)

-- 
2.27.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors
  2022-03-31 16:38 [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases Carlos Bilbao
@ 2022-03-31 16:38 ` Carlos Bilbao
  2022-04-05 17:18   ` Yazen Ghannam
  2022-03-31 16:38 ` [PATCH v2 2/2] x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading Carlos Bilbao
  1 sibling, 1 reply; 11+ messages in thread
From: Carlos Bilbao @ 2022-03-31 16:38 UTC (permalink / raw)
  To: bp
  Cc: tglx, mingo, dave.hansen, x86, yazen.ghannam, linux-kernel,
	linux-edac, bilbao, Carlos Bilbao

The MCE handler needs to understand the severity of the machine errors to
act accordingly. In the case of AMD, very few errors are covered in the
grading logic.

Extend the MCEs severity grading of AMD to cover new types of machine
errors.

Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
---
 arch/x86/kernel/cpu/mce/severity.c | 104 ++++++++++-------------------
 1 file changed, 37 insertions(+), 67 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index 1add86935349..4d52eef21230 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -301,85 +301,55 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 	}
 }
 
-static __always_inline int mce_severity_amd_smca(struct mce *m, enum context err_ctx)
-{
-	u64 mcx_cfg;
-
-	/*
-	 * We need to look at the following bits:
-	 * - "succor" bit (data poisoning support), and
-	 * - TCC bit (Task Context Corrupt)
-	 * in MCi_STATUS to determine error severity.
-	 */
-	if (!mce_flags.succor)
-		return MCE_PANIC_SEVERITY;
-
-	mcx_cfg = mce_rdmsrl(MSR_AMD64_SMCA_MCx_CONFIG(m->bank));
-
-	/* TCC (Task context corrupt). If set and if IN_KERNEL, panic. */
-	if ((mcx_cfg & MCI_CONFIG_MCAX) &&
-	    (m->status & MCI_STATUS_TCC) &&
-	    (err_ctx == IN_KERNEL))
-		return MCE_PANIC_SEVERITY;
-
-	 /* ...otherwise invoke hwpoison handler. */
-	return MCE_AR_SEVERITY;
-}
-
 /*
- * See AMD Error Scope Hierarchy table in a newer BKDG. For example
- * 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"
+ * See AMD PPR(s) section 3.1 Machine Check Architecture
  */
 static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
 {
-	enum context ctx = error_context(m, regs);
+	int ret;
+
+	/*
+	 * Default return value: Action required, the error must be handled
+	 * immediately.
+	 */
+	ret = MCE_AR_SEVERITY;
 
 	/* Processor Context Corrupt, no need to fumble too much, die! */
-	if (m->status & MCI_STATUS_PCC)
-		return MCE_PANIC_SEVERITY;
+	if (m->status & MCI_STATUS_PCC) {
+		ret = MCE_PANIC_SEVERITY;
+		goto amd_severity;
+	}
 
-	if (m->status & MCI_STATUS_UC) {
+	/*
+	 * Evaluate the severity of deferred errors for AMD systems, for which only
+	 * scrub error is interesting to notify an action requirement. The poll
+	 * handler catches deferred errors and adds to mce_ring so memorty-failure
+	 * can take recovery actions.
+	 */
+	if (m->status & MCI_STATUS_DEFERRED) {
+		ret = MCE_DEFERRED_SEVERITY;
+		goto amd_severity;
+	}
 
-		if (ctx == IN_KERNEL)
-			return MCE_PANIC_SEVERITY;
+	/* If the UC bit is not set, the error has been corrected */
+	if (!(m->status & MCI_STATUS_UC)) {
+		ret = MCE_KEEP_SEVERITY;
+		goto amd_severity;
+	}
 
-		/*
-		 * On older systems where overflow_recov flag is not present, we
-		 * should simply panic if an error overflow occurs. If
-		 * overflow_recov flag is present and set, then software can try
-		 * to at least kill process to prolong system operation.
-		 */
-		if (mce_flags.overflow_recov) {
-			if (mce_flags.smca)
-				return mce_severity_amd_smca(m, ctx);
-
-			/* kill current process */
-			return MCE_AR_SEVERITY;
-		} else {
-			/* at least one error was not logged */
-			if (m->status & MCI_STATUS_OVER)
-				return MCE_PANIC_SEVERITY;
-		}
-
-		/*
-		 * For any other case, return MCE_UC_SEVERITY so that we log the
-		 * error and exit #MC handler.
-		 */
-		return MCE_UC_SEVERITY;
+	if (((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov)
+	     || !mce_flags.succor) {
+		ret = MCE_PANIC_SEVERITY;
+		goto amd_severity;
 	}
 
-	/*
-	 * deferred error: poll handler catches these and adds to mce_ring so
-	 * memory-failure can take recovery actions.
-	 */
-	if (m->status & MCI_STATUS_DEFERRED)
-		return MCE_DEFERRED_SEVERITY;
+	if (error_context(m, regs) == IN_KERNEL) {
+		ret = MCE_PANIC_SEVERITY;
+	}
 
-	/*
-	 * corrected error: poll handler catches these and passes responsibility
-	 * of decoding the error to EDAC
-	 */
-	return MCE_KEEP_SEVERITY;
+amd_severity:
+
+	return ret;
 }
 
 static noinstr int mce_severity_intel(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading
  2022-03-31 16:38 [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases Carlos Bilbao
  2022-03-31 16:38 ` [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors Carlos Bilbao
@ 2022-03-31 16:38 ` Carlos Bilbao
  2022-03-31 17:17   ` Day, Michael
  2022-04-05 17:38   ` Yazen Ghannam
  1 sibling, 2 replies; 11+ messages in thread
From: Carlos Bilbao @ 2022-03-31 16:38 UTC (permalink / raw)
  To: bp
  Cc: tglx, mingo, dave.hansen, x86, yazen.ghannam, linux-kernel,
	linux-edac, bilbao, Carlos Bilbao

When a machine error is graded as PANIC by AMD grading logic, the MCE
handler calls mce_panic(). The notification chain does not come into effect
so the AMD EDAC driver does not decode the errors. In these cases, the
messages displayed to the user are more cryptic and miss information
that might be relevant, like the context in which the error took place.

Fix the above issue including messages on AMD's grading logic for machine
errors graded as PANIC.

Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
---
 arch/x86/kernel/cpu/mce/severity.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index 4d52eef21230..ea4b9407bbad 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -307,6 +307,7 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
 {
 	int ret;
+	char *panic_msg;
 
 	/*
 	 * Default return value: Action required, the error must be handled
@@ -316,6 +317,7 @@ static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **
 
 	/* Processor Context Corrupt, no need to fumble too much, die! */
 	if (m->status & MCI_STATUS_PCC) {
+		panic_msg = "Processor Context Corrupt";
 		ret = MCE_PANIC_SEVERITY;
 		goto amd_severity;
 	}
@@ -339,16 +341,21 @@ static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **
 
 	if (((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov)
 	     || !mce_flags.succor) {
+		panic_msg = "Uncorrected unrecoverable error";
 		ret = MCE_PANIC_SEVERITY;
 		goto amd_severity;
 	}
 
 	if (error_context(m, regs) == IN_KERNEL) {
+		panic_msg = "Uncorrected error in kernel context";
 		ret = MCE_PANIC_SEVERITY;
 	}
 
 amd_severity:
 
+	if (msg && panic_msg)
+		*msg = panic_msg;
+
 	return ret;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* RE: [PATCH v2 2/2] x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading
  2022-03-31 16:38 ` [PATCH v2 2/2] x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading Carlos Bilbao
@ 2022-03-31 17:17   ` Day, Michael
  2022-04-05 17:38   ` Yazen Ghannam
  1 sibling, 0 replies; 11+ messages in thread
From: Day, Michael @ 2022-03-31 17:17 UTC (permalink / raw)
  To: Bilbao, Carlos, bp
  Cc: tglx, mingo, dave.hansen, x86, Ghannam, Yazen, linux-kernel,
	linux-edac, bilbao

[AMD Official Use Only]

-----Original Message-----
From: Bilbao, Carlos <Carlos.Bilbao@amd.com> 
Sent: Thursday, March 31, 2022 12:39 PM
To: bp@alien8.de
Cc: tglx@linutronix.de; mingo@redhat.com; dave.hansen@linux.intel.com; x86@kernel.org; Ghannam, Yazen <Yazen.Ghannam@amd.com>; linux-kernel@vger.kernel.org; linux-edac@vger.kernel.org; bilbao@vt.edu; Bilbao, Carlos <Carlos.Bilbao@amd.com>
Subject: [PATCH v2 2/2] x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading

When a machine error is graded as PANIC by AMD grading logic, the MCE handler calls mce_panic(). The notification chain does not come into effect so the AMD EDAC driver does not decode the errors. In these cases, the messages displayed to the user are more cryptic and miss information that might be relevant, like the context in which the error took place.

Fix the above issue including messages on AMD's grading logic for machine errors graded as PANIC.

Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
---
 arch/x86/kernel/cpu/mce/severity.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index 4d52eef21230..ea4b9407bbad 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -307,6 +307,7 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)  static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)  {
 	int ret;
+	char *panic_msg;

I suggest char *panic_msg = NULL;

Simply because its stack-based and a non-null value may get assigned to *msg after the amd_severity: label. 
 
 	/*
 	 * Default return value: Action required, the error must be handled @@ -316,6 +317,7 @@ static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **
 
 	/* Processor Context Corrupt, no need to fumble too much, die! */
 	if (m->status & MCI_STATUS_PCC) {
+		panic_msg = "Processor Context Corrupt";
 		ret = MCE_PANIC_SEVERITY;
 		goto amd_severity;
 	}
@@ -339,16 +341,21 @@ static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **
 
 	if (((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov)
 	     || !mce_flags.succor) {
+		panic_msg = "Uncorrected unrecoverable error";
 		ret = MCE_PANIC_SEVERITY;
 		goto amd_severity;
 	}
 
 	if (error_context(m, regs) == IN_KERNEL) {
+		panic_msg = "Uncorrected error in kernel context";
 		ret = MCE_PANIC_SEVERITY;
 	}
 
 amd_severity:
 
+	if (msg && panic_msg)
+		*msg = panic_msg;
+
 	return ret;
 }
 
--
2.31.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors
  2022-03-31 16:38 ` [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors Carlos Bilbao
@ 2022-04-05 17:18   ` Yazen Ghannam
  2022-04-05 17:24     ` Carlos Bilbao
  0 siblings, 1 reply; 11+ messages in thread
From: Yazen Ghannam @ 2022-04-05 17:18 UTC (permalink / raw)
  To: Carlos Bilbao
  Cc: bp, tglx, mingo, dave.hansen, x86, linux-kernel, linux-edac, bilbao

On Thu, Mar 31, 2022 at 11:38:49AM -0500, Carlos Bilbao wrote:
> The MCE handler needs to understand the severity of the machine errors to
> act accordingly. In the case of AMD, very few errors are covered in the
> grading logic.
> 
> Extend the MCEs severity grading of AMD to cover new types of machine
> errors.
>

This patch does not add new types of machine errors. Please update the commit
message (and cover letter) to be consistent with changes made between patch
revisions.
 
> Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
> ---
>  arch/x86/kernel/cpu/mce/severity.c | 104 ++++++++++-------------------
>  1 file changed, 37 insertions(+), 67 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
> index 1add86935349..4d52eef21230 100644
> --- a/arch/x86/kernel/cpu/mce/severity.c
> +++ b/arch/x86/kernel/cpu/mce/severity.c
> @@ -301,85 +301,55 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>  	}
>  }
>  
> -static __always_inline int mce_severity_amd_smca(struct mce *m, enum context err_ctx)
> -{
> -	u64 mcx_cfg;
> -
> -	/*
> -	 * We need to look at the following bits:
> -	 * - "succor" bit (data poisoning support), and
> -	 * - TCC bit (Task Context Corrupt)
> -	 * in MCi_STATUS to determine error severity.
> -	 */
> -	if (!mce_flags.succor)
> -		return MCE_PANIC_SEVERITY;
> -
> -	mcx_cfg = mce_rdmsrl(MSR_AMD64_SMCA_MCx_CONFIG(m->bank));
> -
> -	/* TCC (Task context corrupt). If set and if IN_KERNEL, panic. */
> -	if ((mcx_cfg & MCI_CONFIG_MCAX) &&
> -	    (m->status & MCI_STATUS_TCC) &&
> -	    (err_ctx == IN_KERNEL))
> -		return MCE_PANIC_SEVERITY;
> -
> -	 /* ...otherwise invoke hwpoison handler. */
> -	return MCE_AR_SEVERITY;
> -}
> -
>  /*
> - * See AMD Error Scope Hierarchy table in a newer BKDG. For example
> - * 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"
> + * See AMD PPR(s) section 3.1 Machine Check Architecture

I don't know that section numbers will be consistent between different PPR
versions, so having the section name is a good idea. The "Machine Check Error
Handling" section is what the severity grading function is based on.

>   */
>  static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
>  {
> -	enum context ctx = error_context(m, regs);
> +	int ret;
> +
> +	/*
> +	 * Default return value: Action required, the error must be handled
> +	 * immediately.
> +	 */
> +	ret = MCE_AR_SEVERITY;
>  
>  	/* Processor Context Corrupt, no need to fumble too much, die! */
> -	if (m->status & MCI_STATUS_PCC)
> -		return MCE_PANIC_SEVERITY;
> +	if (m->status & MCI_STATUS_PCC) {
> +		ret = MCE_PANIC_SEVERITY;
> +		goto amd_severity;
> +	}
>  
> -	if (m->status & MCI_STATUS_UC) {
> +	/*
> +	 * Evaluate the severity of deferred errors for AMD systems, for which only
> +	 * scrub error is interesting to notify an action requirement. The poll
> +	 * handler catches deferred errors and adds to mce_ring so memorty-failure
> +	 * can take recovery actions.
> +	 */

I think this whole comment can be dropped. The "scrub error" part is not
correct. The polling function may find deferred errors, but they are most
likely to be see by the deferred error interrupt handler on modern AMD
systems. The "mce_ring" was removed a long time ago (in v4.3).

> +	if (m->status & MCI_STATUS_DEFERRED) {
> +		ret = MCE_DEFERRED_SEVERITY;
> +		goto amd_severity;
> +	}
>  
> -		if (ctx == IN_KERNEL)
> -			return MCE_PANIC_SEVERITY;
> +	/* If the UC bit is not set, the error has been corrected */

This comment is not true. Deferred errors are an example of an uncorrectable
error where UC is not set.

> +	if (!(m->status & MCI_STATUS_UC)) {
> +		ret = MCE_KEEP_SEVERITY;
> +		goto amd_severity;
> +	}
>  
> -		/*
> -		 * On older systems where overflow_recov flag is not present, we
> -		 * should simply panic if an error overflow occurs. If
> -		 * overflow_recov flag is present and set, then software can try
> -		 * to at least kill process to prolong system operation.
> -		 */
> -		if (mce_flags.overflow_recov) {
> -			if (mce_flags.smca)
> -				return mce_severity_amd_smca(m, ctx);
> -
> -			/* kill current process */
> -			return MCE_AR_SEVERITY;
> -		} else {
> -			/* at least one error was not logged */
> -			if (m->status & MCI_STATUS_OVER)
> -				return MCE_PANIC_SEVERITY;
> -		}
> -
> -		/*
> -		 * For any other case, return MCE_UC_SEVERITY so that we log the
> -		 * error and exit #MC handler.
> -		 */
> -		return MCE_UC_SEVERITY;
> +	if (((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov)
> +	     || !mce_flags.succor) {

I appreciate merged two cases together that have the same result. But I feel
keeping them separate may be easier to follow. They can also each have their
own code comments. Or keep them together and explain each within the same
comment block.

Also, there's a checkpatch "CHECK" here. You'll see it when using the
"--strict" flag with checkpatch.

> +		ret = MCE_PANIC_SEVERITY;
> +		goto amd_severity;
>  	}
>  
> -	/*
> -	 * deferred error: poll handler catches these and adds to mce_ring so
> -	 * memory-failure can take recovery actions.
> -	 */
> -	if (m->status & MCI_STATUS_DEFERRED)
> -		return MCE_DEFERRED_SEVERITY;
> +	if (error_context(m, regs) == IN_KERNEL) {
> +		ret = MCE_PANIC_SEVERITY;
> +	}

Braces aren't needed here. The previous comment about braces was for when
there's a block of "if/else-if/else" statements. A single "if" statement with
a single line doesn't need braces.

>  
> -	/*
> -	 * corrected error: poll handler catches these and passes responsibility
> -	 * of decoding the error to EDAC
> -	 */
> -	return MCE_KEEP_SEVERITY;
> +amd_severity:

This label doesn't look right to me. Maybe I'm too used to seeing "out" and
"err" labels.

Please see "Documentation/process/coding-style.rst" section (7) "Centralized
exiting of functions".

Maybe something like "out_ret_severity" to indicate the code is going to exit
and return the severity. Or maybe just use "out"? Maybe others have thoughts
on this.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors
  2022-04-05 17:18   ` Yazen Ghannam
@ 2022-04-05 17:24     ` Carlos Bilbao
  2022-04-05 17:41       ` Yazen Ghannam
  0 siblings, 1 reply; 11+ messages in thread
From: Carlos Bilbao @ 2022-04-05 17:24 UTC (permalink / raw)
  To: Yazen Ghannam
  Cc: bp, tglx, mingo, dave.hansen, x86, linux-kernel, linux-edac, bilbao

On 4/5/2022 12:18 PM, Yazen Ghannam wrote:
> On Thu, Mar 31, 2022 at 11:38:49AM -0500, Carlos Bilbao wrote:
>> The MCE handler needs to understand the severity of the machine errors to
>> act accordingly. In the case of AMD, very few errors are covered in the
>> grading logic.
>>
>> Extend the MCEs severity grading of AMD to cover new types of machine
>> errors.
>>
> 
> This patch does not add new types of machine errors. Please update the commit
> message (and cover letter) to be consistent with changes made between patch
> revisions.
>  

I'm thinking "cover error cases not previously considered".

>> Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
>> ---
>>  arch/x86/kernel/cpu/mce/severity.c | 104 ++++++++++-------------------
>>  1 file changed, 37 insertions(+), 67 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
>> index 1add86935349..4d52eef21230 100644
>> --- a/arch/x86/kernel/cpu/mce/severity.c
>> +++ b/arch/x86/kernel/cpu/mce/severity.c
>> @@ -301,85 +301,55 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>>  	}
>>  }
>>  
>> -static __always_inline int mce_severity_amd_smca(struct mce *m, enum context err_ctx)
>> -{
>> -	u64 mcx_cfg;
>> -
>> -	/*
>> -	 * We need to look at the following bits:
>> -	 * - "succor" bit (data poisoning support), and
>> -	 * - TCC bit (Task Context Corrupt)
>> -	 * in MCi_STATUS to determine error severity.
>> -	 */
>> -	if (!mce_flags.succor)
>> -		return MCE_PANIC_SEVERITY;
>> -
>> -	mcx_cfg = mce_rdmsrl(MSR_AMD64_SMCA_MCx_CONFIG(m->bank));
>> -
>> -	/* TCC (Task context corrupt). If set and if IN_KERNEL, panic. */
>> -	if ((mcx_cfg & MCI_CONFIG_MCAX) &&
>> -	    (m->status & MCI_STATUS_TCC) &&
>> -	    (err_ctx == IN_KERNEL))
>> -		return MCE_PANIC_SEVERITY;
>> -
>> -	 /* ...otherwise invoke hwpoison handler. */
>> -	return MCE_AR_SEVERITY;
>> -}
>> -
>>  /*
>> - * See AMD Error Scope Hierarchy table in a newer BKDG. For example
>> - * 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"
>> + * See AMD PPR(s) section 3.1 Machine Check Architecture
> 
> I don't know that section numbers will be consistent between different PPR
> versions, so having the section name is a good idea. The "Machine Check Error
> Handling" section is what the severity grading function is based on.
> 

Ack

>>   */
>>  static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
>>  {
>> -	enum context ctx = error_context(m, regs);
>> +	int ret;
>> +
>> +	/*
>> +	 * Default return value: Action required, the error must be handled
>> +	 * immediately.
>> +	 */
>> +	ret = MCE_AR_SEVERITY;
>>  
>>  	/* Processor Context Corrupt, no need to fumble too much, die! */
>> -	if (m->status & MCI_STATUS_PCC)
>> -		return MCE_PANIC_SEVERITY;
>> +	if (m->status & MCI_STATUS_PCC) {
>> +		ret = MCE_PANIC_SEVERITY;
>> +		goto amd_severity;
>> +	}
>>  
>> -	if (m->status & MCI_STATUS_UC) {
>> +	/*
>> +	 * Evaluate the severity of deferred errors for AMD systems, for which only
>> +	 * scrub error is interesting to notify an action requirement. The poll
>> +	 * handler catches deferred errors and adds to mce_ring so memorty-failure
>> +	 * can take recovery actions.
>> +	 */
> 
> I think this whole comment can be dropped. The "scrub error" part is not
> correct. The polling function may find deferred errors, but they are most
> likely to be see by the deferred error interrupt handler on modern AMD
> systems. The "mce_ring" was removed a long time ago (in v4.3).
> 

Ack

>> +	if (m->status & MCI_STATUS_DEFERRED) {
>> +		ret = MCE_DEFERRED_SEVERITY;
>> +		goto amd_severity;
>> +	}
>>  
>> -		if (ctx == IN_KERNEL)
>> -			return MCE_PANIC_SEVERITY;
>> +	/* If the UC bit is not set, the error has been corrected */
> 
> This comment is not true. Deferred errors are an example of an uncorrectable
> error where UC is not set.
> 

Ack

>> +	if (!(m->status & MCI_STATUS_UC)) {
>> +		ret = MCE_KEEP_SEVERITY;
>> +		goto amd_severity;
>> +	}
>>  
>> -		/*
>> -		 * On older systems where overflow_recov flag is not present, we
>> -		 * should simply panic if an error overflow occurs. If
>> -		 * overflow_recov flag is present and set, then software can try
>> -		 * to at least kill process to prolong system operation.
>> -		 */
>> -		if (mce_flags.overflow_recov) {
>> -			if (mce_flags.smca)
>> -				return mce_severity_amd_smca(m, ctx);
>> -
>> -			/* kill current process */
>> -			return MCE_AR_SEVERITY;
>> -		} else {
>> -			/* at least one error was not logged */
>> -			if (m->status & MCI_STATUS_OVER)
>> -				return MCE_PANIC_SEVERITY;
>> -		}
>> -
>> -		/*
>> -		 * For any other case, return MCE_UC_SEVERITY so that we log the
>> -		 * error and exit #MC handler.
>> -		 */
>> -		return MCE_UC_SEVERITY;
>> +	if (((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov)
>> +	     || !mce_flags.succor) {
> 
> I appreciate merged two cases together that have the same result. But I feel
> keeping them separate may be easier to follow. They can also each have their
> own code comments. Or keep them together and explain each within the same
> comment block.
> 

I will divide these two cases.

> Also, there's a checkpatch "CHECK" here. You'll see it when using the
> "--strict" flag with checkpatch.
> 
>> +		ret = MCE_PANIC_SEVERITY;
>> +		goto amd_severity;
>>  	}
>>  
>> -	/*
>> -	 * deferred error: poll handler catches these and adds to mce_ring so
>> -	 * memory-failure can take recovery actions.
>> -	 */
>> -	if (m->status & MCI_STATUS_DEFERRED)
>> -		return MCE_DEFERRED_SEVERITY;
>> +	if (error_context(m, regs) == IN_KERNEL) {
>> +		ret = MCE_PANIC_SEVERITY;
>> +	}
> 
> Braces aren't needed here. The previous comment about braces was for when
> there's a block of "if/else-if/else" statements. A single "if" statement with
> a single line doesn't need braces.
> 

Ack

>>  
>> -	/*
>> -	 * corrected error: poll handler catches these and passes responsibility
>> -	 * of decoding the error to EDAC
>> -	 */
>> -	return MCE_KEEP_SEVERITY;
>> +amd_severity:
> 
> This label doesn't look right to me. Maybe I'm too used to seeing "out" and
> "err" labels.
> 
> Please see "Documentation/process/coding-style.rst" section (7) "Centralized
> exiting of functions".
> 
> Maybe something like "out_ret_severity" to indicate the code is going to exit
> and return the severity. Or maybe just use "out"? Maybe others have thoughts
> on this.
> 

"out_amd_severity" sounds good to me.

> Thanks,
> Yazen

Will send updated pachset. 

Thanks,
Carlos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading
  2022-03-31 16:38 ` [PATCH v2 2/2] x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading Carlos Bilbao
  2022-03-31 17:17   ` Day, Michael
@ 2022-04-05 17:38   ` Yazen Ghannam
  1 sibling, 0 replies; 11+ messages in thread
From: Yazen Ghannam @ 2022-04-05 17:38 UTC (permalink / raw)
  To: Carlos Bilbao
  Cc: bp, tglx, mingo, dave.hansen, x86, linux-kernel, linux-edac, bilbao

On Thu, Mar 31, 2022 at 11:38:50AM -0500, Carlos Bilbao wrote:

...

>  static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
>  {
>  	int ret;
> +	char *panic_msg;

Order variable lines from longest to shortest.

And the pointer should be initiliazed to NULL like Mike said also.

>  
>  	/*
>  	 * Default return value: Action required, the error must be handled
> @@ -316,6 +317,7 @@ static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **
>  
>  	/* Processor Context Corrupt, no need to fumble too much, die! */
>  	if (m->status & MCI_STATUS_PCC) {
> +		panic_msg = "Processor Context Corrupt";
>  		ret = MCE_PANIC_SEVERITY;
>  		goto amd_severity;
>  	}
> @@ -339,16 +341,21 @@ static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **
>  
>  	if (((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov)
>  	     || !mce_flags.succor) {
> +		panic_msg = "Uncorrected unrecoverable error";

So these two cases should definitely be separate. One is "Overflowed
uncorrected error without MCA Overflow Recovery", and the other is
"Uncorrected error without MCA Recovery".

>  		ret = MCE_PANIC_SEVERITY;
>  		goto amd_severity;
>  	}
>  
>  	if (error_context(m, regs) == IN_KERNEL) {
> +		panic_msg = "Uncorrected error in kernel context";

This should be "Uncorrected unrecoverable error in kernel context". There is
the IN_KERNEL_RECOV error context for a recoverable error in kernel context.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors
  2022-04-05 17:24     ` Carlos Bilbao
@ 2022-04-05 17:41       ` Yazen Ghannam
  2022-04-05 17:46         ` Carlos Bilbao
  0 siblings, 1 reply; 11+ messages in thread
From: Yazen Ghannam @ 2022-04-05 17:41 UTC (permalink / raw)
  To: Carlos Bilbao
  Cc: bp, tglx, mingo, dave.hansen, x86, linux-kernel, linux-edac, bilbao

On Tue, Apr 05, 2022 at 12:24:31PM -0500, Carlos Bilbao wrote:
> On 4/5/2022 12:18 PM, Yazen Ghannam wrote:
> > On Thu, Mar 31, 2022 at 11:38:49AM -0500, Carlos Bilbao wrote:
> >> The MCE handler needs to understand the severity of the machine errors to
> >> act accordingly. In the case of AMD, very few errors are covered in the
> >> grading logic.
> >>
> >> Extend the MCEs severity grading of AMD to cover new types of machine
> >> errors.
> >>
> > 
> > This patch does not add new types of machine errors. Please update the commit
> > message (and cover letter) to be consistent with changes made between patch
> > revisions.
> >  
> 
> I'm thinking "cover error cases not previously considered".
>

Please elaborate. It seems that all the cases in this patch are already
covered in the existing code. This patch seems to simplify the existing code.

Thanks,
Yazen 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors
  2022-04-05 17:41       ` Yazen Ghannam
@ 2022-04-05 17:46         ` Carlos Bilbao
  0 siblings, 0 replies; 11+ messages in thread
From: Carlos Bilbao @ 2022-04-05 17:46 UTC (permalink / raw)
  To: Yazen Ghannam
  Cc: bp, tglx, mingo, dave.hansen, x86, linux-kernel, linux-edac, bilbao

On 4/5/2022 12:41 PM, Yazen Ghannam wrote:
> On Tue, Apr 05, 2022 at 12:24:31PM -0500, Carlos Bilbao wrote:
>> On 4/5/2022 12:18 PM, Yazen Ghannam wrote:
>>> On Thu, Mar 31, 2022 at 11:38:49AM -0500, Carlos Bilbao wrote:
>>>> The MCE handler needs to understand the severity of the machine errors to
>>>> act accordingly. In the case of AMD, very few errors are covered in the
>>>> grading logic.
>>>>
>>>> Extend the MCEs severity grading of AMD to cover new types of machine
>>>> errors.
>>>>
>>>
>>> This patch does not add new types of machine errors. Please update the commit
>>> message (and cover letter) to be consistent with changes made between patch
>>> revisions.
>>>  
>>
>> I'm thinking "cover error cases not previously considered".
>>
> 
> Please elaborate. It seems that all the cases in this patch are already
> covered in the existing code. This patch seems to simplify the existing code.
> 

You are right, will update. Since we don't "extend" the severity grading anymore,
i.e. will change commit message, it's probably best if I send a completely new patch set 
instead of considering the next iteration v3.

> Thanks,
> Yazen 

Thanks,
Carlos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases
  2022-03-31 16:32 [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases Carlos Bilbao
@ 2022-03-31 16:35 ` Carlos Bilbao
  0 siblings, 0 replies; 11+ messages in thread
From: Carlos Bilbao @ 2022-03-31 16:35 UTC (permalink / raw)
  To: bp
  Cc: tglx, mingo, dave.hansen, x86, yazen.ghannam, linux-kernel,
	linux-edac, bilbao

On 3/31/2022 11:32 AM, Carlos Bilbao wrote:
> This patchset includes grading of new types of machine errors on AMD's MCE
> grading logic mce_severity_amd(), which helps the MCE handler determine
> what actions to take. If the error is graded as a PANIC, the EDAC driver
> will not decode; so we also include new error messages to describe the MCE
> and help debugging critical errors.
> 
> Changes since v1:
>   On patch 1/2, follow a simplified approach for severity.c, that resembles
>   the available PPR more closely. This also simplifies patch 2/2, as less 
>   panic error messages are added.
> 
> Carlos Bilbao (2):
>   x86/mce: Extend AMD severity grading function with new types of errors
>   x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading
>  

Oops...

>  arch/x86/include/asm/mce.h         |   6 +
>  arch/x86/kernel/cpu/mce/severity.c | 203 ++++++++++++++++++++++++-----
>  2 files changed, 174 insertions(+), 35 deletions(-)

I forgot to update this, my apologies! resending...

> 

Thanks,
Carlos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases
@ 2022-03-31 16:32 Carlos Bilbao
  2022-03-31 16:35 ` Carlos Bilbao
  0 siblings, 1 reply; 11+ messages in thread
From: Carlos Bilbao @ 2022-03-31 16:32 UTC (permalink / raw)
  To: bp
  Cc: tglx, mingo, dave.hansen, x86, yazen.ghannam, linux-kernel,
	linux-edac, bilbao, Carlos Bilbao

This patchset includes grading of new types of machine errors on AMD's MCE
grading logic mce_severity_amd(), which helps the MCE handler determine
what actions to take. If the error is graded as a PANIC, the EDAC driver
will not decode; so we also include new error messages to describe the MCE
and help debugging critical errors.

Changes since v1:
  On patch 1/2, follow a simplified approach for severity.c, that resembles
  the available PPR more closely. This also simplifies patch 2/2, as less 
  panic error messages are added.

Carlos Bilbao (2):
  x86/mce: Extend AMD severity grading function with new types of errors
  x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading

 arch/x86/include/asm/mce.h         |   6 +
 arch/x86/kernel/cpu/mce/severity.c | 203 ++++++++++++++++++++++++-----
 2 files changed, 174 insertions(+), 35 deletions(-)

-- 
2.27.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-04-06  2:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-31 16:38 [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases Carlos Bilbao
2022-03-31 16:38 ` [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors Carlos Bilbao
2022-04-05 17:18   ` Yazen Ghannam
2022-04-05 17:24     ` Carlos Bilbao
2022-04-05 17:41       ` Yazen Ghannam
2022-04-05 17:46         ` Carlos Bilbao
2022-03-31 16:38 ` [PATCH v2 2/2] x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading Carlos Bilbao
2022-03-31 17:17   ` Day, Michael
2022-04-05 17:38   ` Yazen Ghannam
  -- strict thread matches above, loose matches on Subject: below --
2022-03-31 16:32 [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases Carlos Bilbao
2022-03-31 16:35 ` Carlos Bilbao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.