linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Naveen Krishna Chatradhi <nchatrad@amd.com>
To: <linux-edac@vger.kernel.org>
Cc: <bp@alien8.de>, <mingo@redhat.com>, <mchehab@kernel.org>,
	<mchehab+huawei@kernel.org>, <yazen.ghannam@amd.com>,
	Muralidhara M K <muralimk@amd.com>,
	Naveen Krishna Chatradhi <nchatrad@amd.com>
Subject: [PATCH 3/4] rasdaemon: Enumerate memory on noncpu nodes
Date: Tue, 10 Aug 2021 22:52:13 +0530	[thread overview]
Message-ID: <20210810172214.134099-4-nchatrad@amd.com> (raw)
In-Reply-To: <20210810172214.134099-1-nchatrad@amd.com>

From: Muralidhara M K <muralimk@amd.com>

On newer heterogeneous systems from AMD with GPU nodes
(with HBM2 memory) connected via xGMI links to the CPUs.

The node id information is available in the
MCA_IPID[47:44](InstanceIdHI) register.

The UMC Phys on Aldeberan nodes are enumerated as csrow
The UMC channels connected to HBMs are enumerated as ranks.

Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
---
 mce-amd-smca.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/mce-amd-smca.c b/mce-amd-smca.c
index 3c346f4..9381aa1 100644
--- a/mce-amd-smca.c
+++ b/mce-amd-smca.c
@@ -78,6 +78,16 @@ enum smca_bank_types {
 /* Maximum number of MCA banks per CPU. */
 #define MAX_NR_BANKS	64
 
+/*
+ * On newer heterogeneous systems the data gabrics of the CPUs and GPUs
+ * are connected directly via a custom links, like is done with
+ * 2 socket CPU systems and also within a socket for Multi-chip Module
+ * (MCM) CPUs like Naples.
+ * The first GPU node(non cpu) is assumed to have an "AMD Node ID" value
+ * of 8 (the second GPU node has 9, etc.).
+ */
+#define NONCPU_NODE_INDEX	8
+
 /* SMCA Extended error strings */
 /* Load Store */
 static const char * const smca_ls_mce_desc[] = {
@@ -531,6 +541,26 @@ static int find_umc_channel(struct mce_event *e)
 {
 	return EXTRACT(e->ipid, 0, 31) >> 20;
 }
+
+/*
+ * The HBM memory managed by the UMCCH of the noncpu node
+ * can be calculated based on the [15:12]bits of IPID
+ */
+static int find_hbm_channel(struct mce_event *e)
+{
+	int umc, tmp;
+
+	umc = EXTRACT(e->ipid, 0, 31) >> 20;
+
+	/*
+	 * The HBM channel managed by the UMC of the noncpu node
+	 * can be calculated based on the [15:12]bits of IPID as follows
+	 */
+	tmp = ((e->ipid >> 12) & 0xf);
+
+	return (umc % 2) ? tmp + 4 : tmp;
+}
+
 /* Decode extended errors according to Scalable MCA specification */
 static void decode_smca_error(struct mce_event *e)
 {
@@ -539,6 +569,7 @@ static void decode_smca_error(struct mce_event *e)
 	unsigned short xec = (e->status >> 16) & 0x3f;
 	const struct smca_hwid *s_hwid;
 	uint32_t mcatype_hwid = EXTRACT(e->ipid, 32, 63);
+	uint8_t mcatype_instancehi = EXTRACT(e->ipid, 44, 47);
 	unsigned int csrow = -1, channel = -1;
 	unsigned int i;
 
@@ -548,14 +579,16 @@ static void decode_smca_error(struct mce_event *e)
 			bank_type = s_hwid->bank_type;
 			break;
 		}
+		if (mcatype_instancehi >= NONCPU_NODE_INDEX)
+			bank_type = SMCA_UMC_V2;
 	}
 
-	if (i >= ARRAY_SIZE(smca_hwid_mcatypes)) {
+	if (i >= MAX_NR_BANKS) {
 		strcpy(e->mcastatus_msg, "Couldn't find bank type with IPID");
 		return;
 	}
 
-	if (bank_type >= N_SMCA_BANK_TYPES) {
+	if (bank_type >= MAX_NR_BANKS) {
 		strcpy(e->mcastatus_msg, "Don't know how to decode this bank");
 		return;
 	}
@@ -580,6 +613,16 @@ static void decode_smca_error(struct mce_event *e)
 		mce_snprintf(e->mc_location, "memory_channel=%d,csrow=%d",
 			     channel, csrow);
 	}
+
+	if (bank_type == SMCA_UMC_V2 && xec == 0) {
+		/* The UMCPHY is reported as csrow in case of noncpu nodes */
+		csrow = find_umc_channel(e) / 2;
+		/* UMCCH is managing the HBM memory */
+		channel = find_hbm_channel(e);
+		mce_snprintf(e->mc_location, "memory_channel=%d,csrow=%d",
+			     channel, csrow);
+	}
+
 }
 
 int parse_amd_smca_event(struct ras_events *ras, struct mce_event *e)
-- 
2.17.1


  parent reply	other threads:[~2021-08-10 17:23 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-10 17:22 [PATCH 0/4] rasdaemon: Add support for AMD CPUs family 19h Naveen Krishna Chatradhi
2021-08-10 17:22 ` [PATCH 1/4] rasdaemon: Add new SMCA bank types with error decoding Naveen Krishna Chatradhi
2021-08-10 17:22 ` [PATCH 2/4] rasdaemon: set SMCA maximum number of banks to 64 Naveen Krishna Chatradhi
2021-08-10 17:22 ` Naveen Krishna Chatradhi [this message]
2021-08-10 17:22 ` [PATCH 4/4] rasdaemon: Support MCE for AMD CPU family 19h Naveen Krishna Chatradhi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210810172214.134099-4-nchatrad@amd.com \
    --to=nchatrad@amd.com \
    --cc=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=mchehab+huawei@kernel.org \
    --cc=mchehab@kernel.org \
    --cc=mingo@redhat.com \
    --cc=muralimk@amd.com \
    --cc=yazen.ghannam@amd.com \
    --subject='Re: [PATCH 3/4] rasdaemon: Enumerate memory on noncpu nodes' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).