All of lore.kernel.org
 help / color / mirror / Atom feed
From: Carlos Bilbao <carlos.bilbao@amd.com>
To: <bp@alien8.de>
Cc: <tglx@linutronix.de>, <mingo@redhat.com>,
	<dave.hansen@linux.intel.com>, <x86@kernel.org>,
	<yazen.ghannam@amd.com>, <linux-kernel@vger.kernel.org>,
	<linux-edac@vger.kernel.org>, <bilbao@vt.edu>,
	Carlos Bilbao <carlos.bilbao@amd.com>
Subject: [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors
Date: Thu, 31 Mar 2022 11:38:49 -0500	[thread overview]
Message-ID: <20220331163849.6850-2-carlos.bilbao@amd.com> (raw)
In-Reply-To: <20220331163849.6850-1-carlos.bilbao@amd.com>

The MCE handler needs to understand the severity of the machine errors to
act accordingly. In the case of AMD, very few errors are covered in the
grading logic.

Extend the MCEs severity grading of AMD to cover new types of machine
errors.

Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
---
 arch/x86/kernel/cpu/mce/severity.c | 104 ++++++++++-------------------
 1 file changed, 37 insertions(+), 67 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index 1add86935349..4d52eef21230 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -301,85 +301,55 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 	}
 }
 
-static __always_inline int mce_severity_amd_smca(struct mce *m, enum context err_ctx)
-{
-	u64 mcx_cfg;
-
-	/*
-	 * We need to look at the following bits:
-	 * - "succor" bit (data poisoning support), and
-	 * - TCC bit (Task Context Corrupt)
-	 * in MCi_STATUS to determine error severity.
-	 */
-	if (!mce_flags.succor)
-		return MCE_PANIC_SEVERITY;
-
-	mcx_cfg = mce_rdmsrl(MSR_AMD64_SMCA_MCx_CONFIG(m->bank));
-
-	/* TCC (Task context corrupt). If set and if IN_KERNEL, panic. */
-	if ((mcx_cfg & MCI_CONFIG_MCAX) &&
-	    (m->status & MCI_STATUS_TCC) &&
-	    (err_ctx == IN_KERNEL))
-		return MCE_PANIC_SEVERITY;
-
-	 /* ...otherwise invoke hwpoison handler. */
-	return MCE_AR_SEVERITY;
-}
-
 /*
- * See AMD Error Scope Hierarchy table in a newer BKDG. For example
- * 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"
+ * See AMD PPR(s) section 3.1 Machine Check Architecture
  */
 static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
 {
-	enum context ctx = error_context(m, regs);
+	int ret;
+
+	/*
+	 * Default return value: Action required, the error must be handled
+	 * immediately.
+	 */
+	ret = MCE_AR_SEVERITY;
 
 	/* Processor Context Corrupt, no need to fumble too much, die! */
-	if (m->status & MCI_STATUS_PCC)
-		return MCE_PANIC_SEVERITY;
+	if (m->status & MCI_STATUS_PCC) {
+		ret = MCE_PANIC_SEVERITY;
+		goto amd_severity;
+	}
 
-	if (m->status & MCI_STATUS_UC) {
+	/*
+	 * Evaluate the severity of deferred errors for AMD systems, for which only
+	 * scrub error is interesting to notify an action requirement. The poll
+	 * handler catches deferred errors and adds to mce_ring so memorty-failure
+	 * can take recovery actions.
+	 */
+	if (m->status & MCI_STATUS_DEFERRED) {
+		ret = MCE_DEFERRED_SEVERITY;
+		goto amd_severity;
+	}
 
-		if (ctx == IN_KERNEL)
-			return MCE_PANIC_SEVERITY;
+	/* If the UC bit is not set, the error has been corrected */
+	if (!(m->status & MCI_STATUS_UC)) {
+		ret = MCE_KEEP_SEVERITY;
+		goto amd_severity;
+	}
 
-		/*
-		 * On older systems where overflow_recov flag is not present, we
-		 * should simply panic if an error overflow occurs. If
-		 * overflow_recov flag is present and set, then software can try
-		 * to at least kill process to prolong system operation.
-		 */
-		if (mce_flags.overflow_recov) {
-			if (mce_flags.smca)
-				return mce_severity_amd_smca(m, ctx);
-
-			/* kill current process */
-			return MCE_AR_SEVERITY;
-		} else {
-			/* at least one error was not logged */
-			if (m->status & MCI_STATUS_OVER)
-				return MCE_PANIC_SEVERITY;
-		}
-
-		/*
-		 * For any other case, return MCE_UC_SEVERITY so that we log the
-		 * error and exit #MC handler.
-		 */
-		return MCE_UC_SEVERITY;
+	if (((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov)
+	     || !mce_flags.succor) {
+		ret = MCE_PANIC_SEVERITY;
+		goto amd_severity;
 	}
 
-	/*
-	 * deferred error: poll handler catches these and adds to mce_ring so
-	 * memory-failure can take recovery actions.
-	 */
-	if (m->status & MCI_STATUS_DEFERRED)
-		return MCE_DEFERRED_SEVERITY;
+	if (error_context(m, regs) == IN_KERNEL) {
+		ret = MCE_PANIC_SEVERITY;
+	}
 
-	/*
-	 * corrected error: poll handler catches these and passes responsibility
-	 * of decoding the error to EDAC
-	 */
-	return MCE_KEEP_SEVERITY;
+amd_severity:
+
+	return ret;
 }
 
 static noinstr int mce_severity_intel(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
-- 
2.31.1


  reply	other threads:[~2022-03-31 16:39 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-31 16:38 [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases Carlos Bilbao
2022-03-31 16:38 ` Carlos Bilbao [this message]
2022-04-05 17:18   ` [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors Yazen Ghannam
2022-04-05 17:24     ` Carlos Bilbao
2022-04-05 17:41       ` Yazen Ghannam
2022-04-05 17:46         ` Carlos Bilbao
2022-03-31 16:38 ` [PATCH v2 2/2] x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading Carlos Bilbao
2022-03-31 17:17   ` Day, Michael
2022-04-05 17:38   ` Yazen Ghannam
  -- strict thread matches above, loose matches on Subject: below --
2022-03-31 16:32 [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases Carlos Bilbao
2022-03-31 16:32 ` [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors Carlos Bilbao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220331163849.6850-2-carlos.bilbao@amd.com \
    --to=carlos.bilbao@amd.com \
    --cc=bilbao@vt.edu \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.