linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] AMD Family 19h Models 90h-9fh EDAC Support
@ 2023-07-20 12:54 Muralidhara M K
  2023-07-20 12:54 ` [PATCH 1/7] x86/amd_nb: Add AMD Family 19h Models(80h-80fh) and (90h-9fh) PCI IDs Muralidhara M K
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Muralidhara M K @ 2023-07-20 12:54 UTC (permalink / raw)
  To: linux-edac, x86
  Cc: linux-kernel, bp, mingo, mchehab, nchatrad, yazen.ghannam,
	Muralidhara M K

Add Support for AMD Family 19h Models 90h-9fh.

Patch 1:
Add MI300 PCI IDs to the AMD NB code.

Patch 2:
Remove SMCA Extended Error code descriptions, because some of the
existing bit definitions in the CTL register of SMCA bank type
are reassigned without defining new HWID and McaType.

Patch 3:
Add New SMCA bank types MALL, USR_DP, USR_CP.

Patch 4:
Add HBM3 memory in the enum.

Patch 5:
Add Family 19h and Models 90h-9fh Enumeration support.

Patch 6:
Decode error instance get_inst_id() to pvt->ops

Patch 7:
Convert ondie ECC DRAM decoded address to Normalized address

Muralidhara M K (7):
  x86/amd_nb: Add AMD Family 19h Models(80h-80fh) and (90h-9fh) PCI IDs
  EDAC/mce_amd: Remove SMCA Extended Error code descriptions
  x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types
  EDAC/mc: Add new HBM3 memory type
  EDAC/amd64: Add MI300 Enumeration support
  EDAC/amd64: Add error instance get_err_info() to pvt->ops
  EDAC/amd64: Add Error address conversion for UMC

 arch/x86/include/asm/mce.h    |   3 +
 arch/x86/kernel/amd_nb.c      |   5 +
 arch/x86/kernel/cpu/mce/amd.c |   6 +
 drivers/edac/amd64_edac.c     | 261 +++++++++++++++++-
 drivers/edac/amd64_edac.h     |   2 +
 drivers/edac/edac_mc.c        |   1 +
 drivers/edac/mce_amd.c        | 480 ----------------------------------
 include/linux/edac.h          |   3 +
 include/linux/pci_ids.h       |   1 +
 9 files changed, 269 insertions(+), 493 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/7] x86/amd_nb: Add AMD Family 19h Models(80h-80fh) and (90h-9fh) PCI IDs
  2023-07-20 12:54 [PATCH 0/7] AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
@ 2023-07-20 12:54 ` Muralidhara M K
  2023-07-21 14:44   ` Yazen Ghannam
  2023-07-20 12:54 ` [PATCH 2/7] EDAC/mce_amd: Remove SMCA Extended Error code descriptions Muralidhara M K
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Muralidhara M K @ 2023-07-20 12:54 UTC (permalink / raw)
  To: linux-edac, x86
  Cc: linux-kernel, bp, mingo, mchehab, nchatrad, yazen.ghannam,
	Muralidhara M K

From: Muralidhara M K <muralidhara.mk@amd.com>

Add new Root, Device 18h Function 3, and Function 4 PCI IDS
for x86 AMD family 19h, Models 80h-80fh and 90h-9fh.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
---
 arch/x86/kernel/amd_nb.c | 5 +++++
 include/linux/pci_ids.h  | 1 +
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c
index 035a3db5330b..0dece8606ae2 100644
--- a/arch/x86/kernel/amd_nb.c
+++ b/arch/x86/kernel/amd_nb.c
@@ -25,6 +25,7 @@
 #define PCI_DEVICE_ID_AMD_19H_M60H_ROOT		0x14d8
 #define PCI_DEVICE_ID_AMD_19H_M70H_ROOT		0x14e8
 #define PCI_DEVICE_ID_AMD_MI200_ROOT		0x14bb
+#define PCI_DEVICE_ID_AMD_MI300_ROOT		0x14f8
 
 #define PCI_DEVICE_ID_AMD_17H_DF_F4		0x1464
 #define PCI_DEVICE_ID_AMD_17H_M10H_DF_F4	0x15ec
@@ -40,6 +41,7 @@
 #define PCI_DEVICE_ID_AMD_19H_M70H_DF_F4	0x14f4
 #define PCI_DEVICE_ID_AMD_19H_M78H_DF_F4	0x12fc
 #define PCI_DEVICE_ID_AMD_MI200_DF_F4		0x14d4
+#define PCI_DEVICE_ID_AMD_MI300_DF_F4		0x152c
 
 /* Protect the PCI config register pairs used for SMN. */
 static DEFINE_MUTEX(smn_mutex);
@@ -57,6 +59,7 @@ static const struct pci_device_id amd_root_ids[] = {
 	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M60H_ROOT) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M70H_ROOT) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_ROOT) },
+	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI300_ROOT) },
 	{}
 };
 
@@ -86,6 +89,7 @@ static const struct pci_device_id amd_nb_misc_ids[] = {
 	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M70H_DF_F3) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M78H_DF_F3) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F3) },
+	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI300_DF_F3) },
 	{}
 };
 
@@ -107,6 +111,7 @@ static const struct pci_device_id amd_nb_link_ids[] = {
 	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M50H_DF_F4) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_CNB17H_F4) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F4) },
+	{ PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI300_DF_F4) },
 	{}
 };
 
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 2dc75df1437f..70decb578206 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -577,6 +577,7 @@
 #define PCI_DEVICE_ID_AMD_19H_M70H_DF_F3 0x14f3
 #define PCI_DEVICE_ID_AMD_19H_M78H_DF_F3 0x12fb
 #define PCI_DEVICE_ID_AMD_MI200_DF_F3	0x14d3
+#define PCI_DEVICE_ID_AMD_MI300_DF_F3	0x152b
 #define PCI_DEVICE_ID_AMD_CNB17H_F3	0x1703
 #define PCI_DEVICE_ID_AMD_LANCE		0x2000
 #define PCI_DEVICE_ID_AMD_LANCE_HOME	0x2001
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/7] EDAC/mce_amd: Remove SMCA Extended Error code descriptions
  2023-07-20 12:54 [PATCH 0/7] AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
  2023-07-20 12:54 ` [PATCH 1/7] x86/amd_nb: Add AMD Family 19h Models(80h-80fh) and (90h-9fh) PCI IDs Muralidhara M K
@ 2023-07-20 12:54 ` Muralidhara M K
  2023-07-20 13:59   ` Borislav Petkov
  2023-07-20 12:54 ` [PATCH 3/7] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types Muralidhara M K
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Muralidhara M K @ 2023-07-20 12:54 UTC (permalink / raw)
  To: linux-edac, x86
  Cc: linux-kernel, bp, mingo, mchehab, nchatrad, yazen.ghannam,
	Muralidhara M K

From: Muralidhara M K <muralidhara.mk@amd.com>

On AMD systems with Scalable MCA, each machine check error of a SMCA bank
type has an associated bit position in the bank's control (CTL) register.

An error's bit position in the CTL register is used during error decoding
for offsetting into the corresponding bank's error description structure.
As new errors are being added in newer AMD systems for existing SMCA bank
types, the underlying SMCA architecture guarantees that the bit positions
of existing errors are not altered.

However, on some AMD systems some of the existing bit definitions in the
CTL register of SMCA bank type are reassigned without defining new HWID
and McaType. Consequently, the errors whose bit definitions have been
reassigned in the CTL register are being erroneously decoded.

Remove SMCA Extended Error Code descriptions. This avoids decoding issues
for incorrectly reassigned bits, and avoids the related maintenance burden
in the kernel. This decoding can be done in external tools or by referring
to AMD documentation. The bank type and Extended Error Code value for an
error will continue to be printed as a convenience.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 drivers/edac/mce_amd.c | 480 -----------------------------------------
 1 file changed, 480 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 9215c06783df..3a67f02a34ad 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -143,482 +143,6 @@ static const char * const mc6_mce_desc[] = {
 	"Status Register File",
 };
 
-/* Scalable MCA error strings */
-static const char * const smca_ls_mce_desc[] = {
-	"Load queue parity error",
-	"Store queue parity error",
-	"Miss address buffer payload parity error",
-	"Level 1 TLB parity error",
-	"DC Tag error type 5",
-	"DC Tag error type 6",
-	"DC Tag error type 1",
-	"Internal error type 1",
-	"Internal error type 2",
-	"System Read Data Error Thread 0",
-	"System Read Data Error Thread 1",
-	"DC Tag error type 2",
-	"DC Data error type 1 and poison consumption",
-	"DC Data error type 2",
-	"DC Data error type 3",
-	"DC Tag error type 4",
-	"Level 2 TLB parity error",
-	"PDC parity error",
-	"DC Tag error type 3",
-	"DC Tag error type 5",
-	"L2 Fill Data error",
-};
-
-static const char * const smca_ls2_mce_desc[] = {
-	"An ECC error was detected on a data cache read by a probe or victimization",
-	"An ECC error or L2 poison was detected on a data cache read by a load",
-	"An ECC error was detected on a data cache read-modify-write by a store",
-	"An ECC error or poison bit mismatch was detected on a tag read by a probe or victimization",
-	"An ECC error or poison bit mismatch was detected on a tag read by a load",
-	"An ECC error or poison bit mismatch was detected on a tag read by a store",
-	"An ECC error was detected on an EMEM read by a load",
-	"An ECC error was detected on an EMEM read-modify-write by a store",
-	"A parity error was detected in an L1 TLB entry by any access",
-	"A parity error was detected in an L2 TLB entry by any access",
-	"A parity error was detected in a PWC entry by any access",
-	"A parity error was detected in an STQ entry by any access",
-	"A parity error was detected in an LDQ entry by any access",
-	"A parity error was detected in a MAB entry by any access",
-	"A parity error was detected in an SCB entry state field by any access",
-	"A parity error was detected in an SCB entry address field by any access",
-	"A parity error was detected in an SCB entry data field by any access",
-	"A parity error was detected in a WCB entry by any access",
-	"A poisoned line was detected in an SCB entry by any access",
-	"A SystemReadDataError error was reported on read data returned from L2 for a load",
-	"A SystemReadDataError error was reported on read data returned from L2 for an SCB store",
-	"A SystemReadDataError error was reported on read data returned from L2 for a WCB store",
-	"A hardware assertion error was reported",
-	"A parity error was detected in an STLF, SCB EMEM entry or SRB store data by any access",
-};
-
-static const char * const smca_if_mce_desc[] = {
-	"Op Cache Microtag Probe Port Parity Error",
-	"IC Microtag or Full Tag Multi-hit Error",
-	"IC Full Tag Parity Error",
-	"IC Data Array Parity Error",
-	"Decoupling Queue PhysAddr Parity Error",
-	"L0 ITLB Parity Error",
-	"L1 ITLB Parity Error",
-	"L2 ITLB Parity Error",
-	"BPQ Thread 0 Snoop Parity Error",
-	"BPQ Thread 1 Snoop Parity Error",
-	"L1 BTB Multi-Match Error",
-	"L2 BTB Multi-Match Error",
-	"L2 Cache Response Poison Error",
-	"System Read Data Error",
-	"Hardware Assertion Error",
-	"L1-TLB Multi-Hit",
-	"L2-TLB Multi-Hit",
-	"BSR Parity Error",
-	"CT MCE",
-};
-
-static const char * const smca_l2_mce_desc[] = {
-	"L2M Tag Multiple-Way-Hit error",
-	"L2M Tag or State Array ECC Error",
-	"L2M Data Array ECC Error",
-	"Hardware Assert Error",
-};
-
-static const char * const smca_de_mce_desc[] = {
-	"Micro-op cache tag parity error",
-	"Micro-op cache data parity error",
-	"Instruction buffer parity error",
-	"Micro-op queue parity error",
-	"Instruction dispatch queue parity error",
-	"Fetch address FIFO parity error",
-	"Patch RAM data parity error",
-	"Patch RAM sequencer parity error",
-	"Micro-op buffer parity error",
-	"Hardware Assertion MCA Error",
-};
-
-static const char * const smca_ex_mce_desc[] = {
-	"Watchdog Timeout error",
-	"Physical register file parity error",
-	"Flag register file parity error",
-	"Immediate displacement register file parity error",
-	"Address generator payload parity error",
-	"EX payload parity error",
-	"Checkpoint queue parity error",
-	"Retire dispatch queue parity error",
-	"Retire status queue parity error",
-	"Scheduling queue parity error",
-	"Branch buffer queue parity error",
-	"Hardware Assertion error",
-	"Spec Map parity error",
-	"Retire Map parity error",
-};
-
-static const char * const smca_fp_mce_desc[] = {
-	"Physical register file (PRF) parity error",
-	"Freelist (FL) parity error",
-	"Schedule queue parity error",
-	"NSQ parity error",
-	"Retire queue (RQ) parity error",
-	"Status register file (SRF) parity error",
-	"Hardware assertion",
-};
-
-static const char * const smca_l3_mce_desc[] = {
-	"Shadow Tag Macro ECC Error",
-	"Shadow Tag Macro Multi-way-hit Error",
-	"L3M Tag ECC Error",
-	"L3M Tag Multi-way-hit Error",
-	"L3M Data ECC Error",
-	"SDP Parity Error or SystemReadDataError from XI",
-	"L3 Victim Queue Parity Error",
-	"L3 Hardware Assertion",
-};
-
-static const char * const smca_cs_mce_desc[] = {
-	"Illegal Request",
-	"Address Violation",
-	"Security Violation",
-	"Illegal Response",
-	"Unexpected Response",
-	"Request or Probe Parity Error",
-	"Read Response Parity Error",
-	"Atomic Request Parity Error",
-	"Probe Filter ECC Error",
-};
-
-static const char * const smca_cs2_mce_desc[] = {
-	"Illegal Request",
-	"Address Violation",
-	"Security Violation",
-	"Illegal Response",
-	"Unexpected Response",
-	"Request or Probe Parity Error",
-	"Read Response Parity Error",
-	"Atomic Request Parity Error",
-	"SDP read response had no match in the CS queue",
-	"Probe Filter Protocol Error",
-	"Probe Filter ECC Error",
-	"SDP read response had an unexpected RETRY error",
-	"Counter overflow error",
-	"Counter underflow error",
-};
-
-static const char * const smca_pie_mce_desc[] = {
-	"Hardware Assert",
-	"Register security violation",
-	"Link Error",
-	"Poison data consumption",
-	"A deferred error was detected in the DF"
-};
-
-static const char * const smca_umc_mce_desc[] = {
-	"DRAM ECC error",
-	"Data poison error",
-	"SDP parity error",
-	"Advanced peripheral bus error",
-	"Address/Command parity error",
-	"Write data CRC error",
-	"DCQ SRAM ECC error",
-	"AES SRAM ECC error",
-};
-
-static const char * const smca_umc2_mce_desc[] = {
-	"DRAM ECC error",
-	"Data poison error",
-	"SDP parity error",
-	"Reserved",
-	"Address/Command parity error",
-	"Write data parity error",
-	"DCQ SRAM ECC error",
-	"Reserved",
-	"Read data parity error",
-	"Rdb SRAM ECC error",
-	"RdRsp SRAM ECC error",
-	"LM32 MP errors",
-};
-
-static const char * const smca_pb_mce_desc[] = {
-	"An ECC error in the Parameter Block RAM array",
-};
-
-static const char * const smca_psp_mce_desc[] = {
-	"An ECC or parity error in a PSP RAM instance",
-};
-
-static const char * const smca_psp2_mce_desc[] = {
-	"High SRAM ECC or parity error",
-	"Low SRAM ECC or parity error",
-	"Instruction Cache Bank 0 ECC or parity error",
-	"Instruction Cache Bank 1 ECC or parity error",
-	"Instruction Tag Ram 0 parity error",
-	"Instruction Tag Ram 1 parity error",
-	"Data Cache Bank 0 ECC or parity error",
-	"Data Cache Bank 1 ECC or parity error",
-	"Data Cache Bank 2 ECC or parity error",
-	"Data Cache Bank 3 ECC or parity error",
-	"Data Tag Bank 0 parity error",
-	"Data Tag Bank 1 parity error",
-	"Data Tag Bank 2 parity error",
-	"Data Tag Bank 3 parity error",
-	"Dirty Data Ram parity error",
-	"TLB Bank 0 parity error",
-	"TLB Bank 1 parity error",
-	"System Hub Read Buffer ECC or parity error",
-};
-
-static const char * const smca_smu_mce_desc[] = {
-	"An ECC or parity error in an SMU RAM instance",
-};
-
-static const char * const smca_smu2_mce_desc[] = {
-	"High SRAM ECC or parity error",
-	"Low SRAM ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"System Hub Read Buffer ECC or parity error",
-	"PHY RAM ECC error",
-};
-
-static const char * const smca_mp5_mce_desc[] = {
-	"High SRAM ECC or parity error",
-	"Low SRAM ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-};
-
-static const char * const smca_mpdma_mce_desc[] = {
-	"Main SRAM [31:0] bank ECC or parity error",
-	"Main SRAM [63:32] bank ECC or parity error",
-	"Main SRAM [95:64] bank ECC or parity error",
-	"Main SRAM [127:96] bank ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"System Hub Read Buffer ECC or parity error",
-	"MPDMA TVF DVSEC Memory ECC or parity error",
-	"MPDMA TVF MMIO Mailbox0 ECC or parity error",
-	"MPDMA TVF MMIO Mailbox1 ECC or parity error",
-	"MPDMA TVF Doorbell Memory ECC or parity error",
-	"MPDMA TVF SDP Slave Memory 0 ECC or parity error",
-	"MPDMA TVF SDP Slave Memory 1 ECC or parity error",
-	"MPDMA TVF SDP Slave Memory 2 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 0 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 1 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 2 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 3 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 4 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 5 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 6 ECC or parity error",
-	"MPDMA PTE Command FIFO ECC or parity error",
-	"MPDMA PTE Hub Data FIFO ECC or parity error",
-	"MPDMA PTE Internal Data FIFO ECC or parity error",
-	"MPDMA PTE Command Memory DMA ECC or parity error",
-	"MPDMA PTE Command Memory Internal ECC or parity error",
-	"MPDMA PTE DMA Completion FIFO ECC or parity error",
-	"MPDMA PTE Tablewalk Completion FIFO ECC or parity error",
-	"MPDMA PTE Descriptor Completion FIFO ECC or parity error",
-	"MPDMA PTE ReadOnly Completion FIFO ECC or parity error",
-	"MPDMA PTE DirectWrite Completion FIFO ECC or parity error",
-	"SDP Watchdog Timer expired",
-};
-
-static const char * const smca_nbio_mce_desc[] = {
-	"ECC or Parity error",
-	"PCIE error",
-	"SDP ErrEvent error",
-	"SDP Egress Poison Error",
-	"IOHC Internal Poison Error",
-};
-
-static const char * const smca_pcie_mce_desc[] = {
-	"CCIX PER Message logging",
-	"CCIX Read Response with Status: Non-Data Error",
-	"CCIX Write Response with Status: Non-Data Error",
-	"CCIX Read Response with Status: Data Error",
-	"CCIX Non-okay write response with data error",
-};
-
-static const char * const smca_pcie2_mce_desc[] = {
-	"SDP Parity Error logging",
-};
-
-static const char * const smca_xgmipcs_mce_desc[] = {
-	"Data Loss Error",
-	"Training Error",
-	"Flow Control Acknowledge Error",
-	"Rx Fifo Underflow Error",
-	"Rx Fifo Overflow Error",
-	"CRC Error",
-	"BER Exceeded Error",
-	"Tx Vcid Data Error",
-	"Replay Buffer Parity Error",
-	"Data Parity Error",
-	"Replay Fifo Overflow Error",
-	"Replay Fifo Underflow Error",
-	"Elastic Fifo Overflow Error",
-	"Deskew Error",
-	"Flow Control CRC Error",
-	"Data Startup Limit Error",
-	"FC Init Timeout Error",
-	"Recovery Timeout Error",
-	"Ready Serial Timeout Error",
-	"Ready Serial Attempt Error",
-	"Recovery Attempt Error",
-	"Recovery Relock Attempt Error",
-	"Replay Attempt Error",
-	"Sync Header Error",
-	"Tx Replay Timeout Error",
-	"Rx Replay Timeout Error",
-	"LinkSub Tx Timeout Error",
-	"LinkSub Rx Timeout Error",
-	"Rx CMD Packet Error",
-};
-
-static const char * const smca_xgmiphy_mce_desc[] = {
-	"RAM ECC Error",
-	"ARC instruction buffer parity error",
-	"ARC data buffer parity error",
-	"PHY APB error",
-};
-
-static const char * const smca_nbif_mce_desc[] = {
-	"Timeout error from GMI",
-	"SRAM ECC error",
-	"NTB Error Event",
-	"SDP Parity error",
-};
-
-static const char * const smca_sata_mce_desc[] = {
-	"Parity error for port 0",
-	"Parity error for port 1",
-	"Parity error for port 2",
-	"Parity error for port 3",
-	"Parity error for port 4",
-	"Parity error for port 5",
-	"Parity error for port 6",
-	"Parity error for port 7",
-};
-
-static const char * const smca_usb_mce_desc[] = {
-	"Parity error or ECC error for S0 RAM0",
-	"Parity error or ECC error for S0 RAM1",
-	"Parity error or ECC error for S0 RAM2",
-	"Parity error for PHY RAM0",
-	"Parity error for PHY RAM1",
-	"AXI Slave Response error",
-};
-
-static const char * const smca_gmipcs_mce_desc[] = {
-	"Data Loss Error",
-	"Training Error",
-	"Replay Parity Error",
-	"Rx Fifo Underflow Error",
-	"Rx Fifo Overflow Error",
-	"CRC Error",
-	"BER Exceeded Error",
-	"Tx Fifo Underflow Error",
-	"Replay Buffer Parity Error",
-	"Tx Overflow Error",
-	"Replay Fifo Overflow Error",
-	"Replay Fifo Underflow Error",
-	"Elastic Fifo Overflow Error",
-	"Deskew Error",
-	"Offline Error",
-	"Data Startup Limit Error",
-	"FC Init Timeout Error",
-	"Recovery Timeout Error",
-	"Ready Serial Timeout Error",
-	"Ready Serial Attempt Error",
-	"Recovery Attempt Error",
-	"Recovery Relock Attempt Error",
-	"Deskew Abort Error",
-	"Rx Buffer Error",
-	"Rx LFDS Fifo Overflow Error",
-	"Rx LFDS Fifo Underflow Error",
-	"LinkSub Tx Timeout Error",
-	"LinkSub Rx Timeout Error",
-	"Rx CMD Packet Error",
-	"LFDS Training Timeout Error",
-	"LFDS FC Init Timeout Error",
-	"Data Loss Error",
-};
-
-struct smca_mce_desc {
-	const char * const *descs;
-	unsigned int num_descs;
-};
-
-static struct smca_mce_desc smca_mce_descs[] = {
-	[SMCA_LS]	= { smca_ls_mce_desc,	ARRAY_SIZE(smca_ls_mce_desc)	},
-	[SMCA_LS_V2]	= { smca_ls2_mce_desc,	ARRAY_SIZE(smca_ls2_mce_desc)	},
-	[SMCA_IF]	= { smca_if_mce_desc,	ARRAY_SIZE(smca_if_mce_desc)	},
-	[SMCA_L2_CACHE]	= { smca_l2_mce_desc,	ARRAY_SIZE(smca_l2_mce_desc)	},
-	[SMCA_DE]	= { smca_de_mce_desc,	ARRAY_SIZE(smca_de_mce_desc)	},
-	[SMCA_EX]	= { smca_ex_mce_desc,	ARRAY_SIZE(smca_ex_mce_desc)	},
-	[SMCA_FP]	= { smca_fp_mce_desc,	ARRAY_SIZE(smca_fp_mce_desc)	},
-	[SMCA_L3_CACHE]	= { smca_l3_mce_desc,	ARRAY_SIZE(smca_l3_mce_desc)	},
-	[SMCA_CS]	= { smca_cs_mce_desc,	ARRAY_SIZE(smca_cs_mce_desc)	},
-	[SMCA_CS_V2]	= { smca_cs2_mce_desc,	ARRAY_SIZE(smca_cs2_mce_desc)	},
-	[SMCA_PIE]	= { smca_pie_mce_desc,	ARRAY_SIZE(smca_pie_mce_desc)	},
-	[SMCA_UMC]	= { smca_umc_mce_desc,	ARRAY_SIZE(smca_umc_mce_desc)	},
-	[SMCA_UMC_V2]	= { smca_umc2_mce_desc,	ARRAY_SIZE(smca_umc2_mce_desc)	},
-	[SMCA_PB]	= { smca_pb_mce_desc,	ARRAY_SIZE(smca_pb_mce_desc)	},
-	[SMCA_PSP]	= { smca_psp_mce_desc,	ARRAY_SIZE(smca_psp_mce_desc)	},
-	[SMCA_PSP_V2]	= { smca_psp2_mce_desc,	ARRAY_SIZE(smca_psp2_mce_desc)	},
-	[SMCA_SMU]	= { smca_smu_mce_desc,	ARRAY_SIZE(smca_smu_mce_desc)	},
-	[SMCA_SMU_V2]	= { smca_smu2_mce_desc,	ARRAY_SIZE(smca_smu2_mce_desc)	},
-	[SMCA_MP5]	= { smca_mp5_mce_desc,	ARRAY_SIZE(smca_mp5_mce_desc)	},
-	[SMCA_MPDMA]	= { smca_mpdma_mce_desc,	ARRAY_SIZE(smca_mpdma_mce_desc)	},
-	[SMCA_NBIO]	= { smca_nbio_mce_desc,	ARRAY_SIZE(smca_nbio_mce_desc)	},
-	[SMCA_PCIE]	= { smca_pcie_mce_desc,	ARRAY_SIZE(smca_pcie_mce_desc)	},
-	[SMCA_PCIE_V2]	= { smca_pcie2_mce_desc,   ARRAY_SIZE(smca_pcie2_mce_desc)	},
-	[SMCA_XGMI_PCS]	= { smca_xgmipcs_mce_desc, ARRAY_SIZE(smca_xgmipcs_mce_desc)	},
-	/* NBIF and SHUB have the same error descriptions, for now. */
-	[SMCA_NBIF]	= { smca_nbif_mce_desc, ARRAY_SIZE(smca_nbif_mce_desc)	},
-	[SMCA_SHUB]	= { smca_nbif_mce_desc, ARRAY_SIZE(smca_nbif_mce_desc)	},
-	[SMCA_SATA]	= { smca_sata_mce_desc, ARRAY_SIZE(smca_sata_mce_desc)	},
-	[SMCA_USB]	= { smca_usb_mce_desc,	ARRAY_SIZE(smca_usb_mce_desc)	},
-	[SMCA_GMI_PCS]	= { smca_gmipcs_mce_desc,  ARRAY_SIZE(smca_gmipcs_mce_desc)	},
-	/* All the PHY bank types have the same error descriptions, for now. */
-	[SMCA_XGMI_PHY]	= { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc)	},
-	[SMCA_WAFL_PHY]	= { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc)	},
-	[SMCA_GMI_PHY]	= { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc)	},
-};
-
 static bool f12h_mc0_mce(u16 ec, u8 xec)
 {
 	bool ret = false;
@@ -1182,10 +706,6 @@ static void decode_smca_error(struct mce *m)
 
 	pr_emerg(HW_ERR "%s Ext. Error Code: %d", ip_name, xec);
 
-	/* Only print the decode of valid error codes */
-	if (xec < smca_mce_descs[bank_type].num_descs)
-		pr_cont(", %s.\n", smca_mce_descs[bank_type].descs[xec]);
-
 	if ((bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) &&
 	    xec == 0 && decode_dram_ecc)
 		decode_dram_ecc(topology_die_id(m->extcpu), m);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/7] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types
  2023-07-20 12:54 [PATCH 0/7] AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
  2023-07-20 12:54 ` [PATCH 1/7] x86/amd_nb: Add AMD Family 19h Models(80h-80fh) and (90h-9fh) PCI IDs Muralidhara M K
  2023-07-20 12:54 ` [PATCH 2/7] EDAC/mce_amd: Remove SMCA Extended Error code descriptions Muralidhara M K
@ 2023-07-20 12:54 ` Muralidhara M K
  2023-07-22  8:20   ` Borislav Petkov
  2023-07-20 12:54 ` [PATCH 4/7] EDAC/mc: Add new HBM3 memory type Muralidhara M K
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Muralidhara M K @ 2023-07-20 12:54 UTC (permalink / raw)
  To: linux-edac, x86
  Cc: linux-kernel, bp, mingo, mchehab, nchatrad, yazen.ghannam,
	Muralidhara M K

From: Muralidhara M K <muralidhara.mk@amd.com>

Add HWID and McaType values for new SMCA bank types.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
---
 arch/x86/include/asm/mce.h    | 3 +++
 arch/x86/kernel/cpu/mce/amd.c | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 180b1cbfcc4e..8e0ed4b86e29 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -311,6 +311,7 @@ enum smca_bank_types {
 	SMCA_PIE,	/* Power, Interrupts, etc. */
 	SMCA_UMC,	/* Unified Memory Controller */
 	SMCA_UMC_V2,
+	SMCA_MA_LLC,	/* Memory Attached Last Level Cache */
 	SMCA_PB,	/* Parameter Block */
 	SMCA_PSP,	/* Platform Security Processor */
 	SMCA_PSP_V2,
@@ -326,6 +327,8 @@ enum smca_bank_types {
 	SMCA_SHUB,	/* System HUB Unit */
 	SMCA_SATA,	/* SATA Unit */
 	SMCA_USB,	/* USB Unit */
+	SMCA_USR_DP,	/* Ultra Short Reach Data Plane Controller */
+	SMCA_USR_CP,	/* Ultra Short Reach Control Plane Controller */
 	SMCA_GMI_PCS,	/* GMI PCS Unit */
 	SMCA_XGMI_PHY,	/* xGMI PHY Unit */
 	SMCA_WAFL_PHY,	/* WAFL PHY Unit */
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 5e74610b39e7..cf8b4616fd31 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -107,6 +107,7 @@ static struct smca_bank_name smca_names[] = {
 	/* UMC v2 is separate because both of them can exist in a single system. */
 	[SMCA_UMC]			= { "umc",		"Unified Memory Controller" },
 	[SMCA_UMC_V2]			= { "umc_v2",		"Unified Memory Controller v2" },
+	[SMCA_MA_LLC]			= { "mall",		"Memory Attached Last Level Cache" },
 	[SMCA_PB]			= { "param_block",	"Parameter Block" },
 	[SMCA_PSP ... SMCA_PSP_V2]	= { "psp",		"Platform Security Processor" },
 	[SMCA_SMU ... SMCA_SMU_V2]	= { "smu",		"System Management Unit" },
@@ -119,6 +120,8 @@ static struct smca_bank_name smca_names[] = {
 	[SMCA_SHUB]			= { "shub",		"System Hub Unit" },
 	[SMCA_SATA]			= { "sata",		"SATA Unit" },
 	[SMCA_USB]			= { "usb",		"USB Unit" },
+	[SMCA_USR_DP]			= { "usr_dp_pcs",	"Ultra Short Reach Data Plane Controller" },
+	[SMCA_USR_CP]			= { "usr_cp_pcs",	"Ultra Short Reach Control Plane Controller" },
 	[SMCA_GMI_PCS]			= { "gmi_pcs",		"Global Memory Interconnect PCS Unit" },
 	[SMCA_XGMI_PHY]			= { "xgmi_phy",		"Ext Global Memory Interconnect PHY Unit" },
 	[SMCA_WAFL_PHY]			= { "wafl_phy",		"WAFL PHY Unit" },
@@ -178,6 +181,7 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
 	{ SMCA_CS,	 HWID_MCATYPE(0x2E, 0x0)	},
 	{ SMCA_PIE,	 HWID_MCATYPE(0x2E, 0x1)	},
 	{ SMCA_CS_V2,	 HWID_MCATYPE(0x2E, 0x2)	},
+	{ SMCA_MA_LLC,	 HWID_MCATYPE(0x2E, 0x4)	},
 
 	/* Unified Memory Controller MCA type */
 	{ SMCA_UMC,	 HWID_MCATYPE(0x96, 0x0)	},
@@ -212,6 +216,8 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
 	{ SMCA_SHUB,	 HWID_MCATYPE(0x80, 0x0)	},
 	{ SMCA_SATA,	 HWID_MCATYPE(0xA8, 0x0)	},
 	{ SMCA_USB,	 HWID_MCATYPE(0xAA, 0x0)	},
+	{ SMCA_USR_DP,	 HWID_MCATYPE(0x170, 0x0)	},
+	{ SMCA_USR_CP,	 HWID_MCATYPE(0x180, 0x0)	},
 	{ SMCA_GMI_PCS,  HWID_MCATYPE(0x241, 0x0)	},
 	{ SMCA_XGMI_PHY, HWID_MCATYPE(0x259, 0x0)	},
 	{ SMCA_WAFL_PHY, HWID_MCATYPE(0x267, 0x0)	},
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/7] EDAC/mc: Add new HBM3 memory type
  2023-07-20 12:54 [PATCH 0/7] AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
                   ` (2 preceding siblings ...)
  2023-07-20 12:54 ` [PATCH 3/7] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types Muralidhara M K
@ 2023-07-20 12:54 ` Muralidhara M K
  2023-08-03 10:27   ` Borislav Petkov
  2023-07-20 12:54 ` [PATCH 5/7] EDAC/amd64: Add Fam19h Model 90h ~ 9fh enumeration support Muralidhara M K
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Muralidhara M K @ 2023-07-20 12:54 UTC (permalink / raw)
  To: linux-edac, x86
  Cc: linux-kernel, bp, mingo, mchehab, nchatrad, yazen.ghannam,
	Muralidhara M K

From: Muralidhara M K <muralidhara.mk@amd.com>

Add a new entry to 'enum mem_type' and a new string to 'edac_mem_types[]'
for HBM3 (High Bandwidth Memory Gen 3) new memory type.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
---
 drivers/edac/edac_mc.c | 1 +
 include/linux/edac.h   | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 6faeb2ab3960..d6eed727b0cd 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -166,6 +166,7 @@ const char * const edac_mem_types[] = {
 	[MEM_NVDIMM]	= "Non-volatile-RAM",
 	[MEM_WIO2]	= "Wide-IO-2",
 	[MEM_HBM2]	= "High-bandwidth-memory-Gen2",
+	[MEM_HBM3]	= "High-bandwidth-memory-Gen3",
 };
 EXPORT_SYMBOL_GPL(edac_mem_types);
 
diff --git a/include/linux/edac.h b/include/linux/edac.h
index fa4bda2a70f6..1174beb94ab6 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -187,6 +187,7 @@ static inline char *mc_event_error_type(const unsigned int err_type)
  * @MEM_NVDIMM:		Non-volatile RAM
  * @MEM_WIO2:		Wide I/O 2.
  * @MEM_HBM2:		High bandwidth Memory Gen 2.
+ * @MEM_HBM3:		High bandwidth Memory Gen 3.
  */
 enum mem_type {
 	MEM_EMPTY = 0,
@@ -218,6 +219,7 @@ enum mem_type {
 	MEM_NVDIMM,
 	MEM_WIO2,
 	MEM_HBM2,
+	MEM_HBM3,
 };
 
 #define MEM_FLAG_EMPTY		BIT(MEM_EMPTY)
@@ -248,6 +250,7 @@ enum mem_type {
 #define MEM_FLAG_NVDIMM		BIT(MEM_NVDIMM)
 #define MEM_FLAG_WIO2		BIT(MEM_WIO2)
 #define MEM_FLAG_HBM2		BIT(MEM_HBM2)
+#define MEM_FLAG_HBM3		BIT(MEM_HBM3)
 
 /**
  * enum edac_type - Error Detection and Correction capabilities and mode
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/7] EDAC/amd64: Add Fam19h Model 90h ~ 9fh enumeration support
  2023-07-20 12:54 [PATCH 0/7] AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
                   ` (3 preceding siblings ...)
  2023-07-20 12:54 ` [PATCH 4/7] EDAC/mc: Add new HBM3 memory type Muralidhara M K
@ 2023-07-20 12:54 ` Muralidhara M K
  2023-08-05 10:10   ` Borislav Petkov
  2023-07-20 12:54 ` [PATCH 6/7] EDAC/amd64: Add error instance get_err_info() to pvt->ops Muralidhara M K
  2023-07-20 12:54 ` [PATCH 7/7] EDAC/amd64: Add Error address conversion for UMC Muralidhara M K
  6 siblings, 1 reply; 19+ messages in thread
From: Muralidhara M K @ 2023-07-20 12:54 UTC (permalink / raw)
  To: linux-edac, x86
  Cc: linux-kernel, bp, mingo, mchehab, nchatrad, yazen.ghannam,
	Muralidhara M K

From: Muralidhara M K <muralidhara.mk@amd.com>

Add AMD family 19h Model 90h-9fh. Models 90h-9fh are APUs, and
they have built-in HBM3 memory. ECC support is enabled by default.

APU models have a single Data Fabric (DF) per Package. Each DF is
visible to the OS in the same way as chiplet-based systems like
Rome and later. However, the Unified Memory Controllers (UMCs) are
arranged in the same way as GPU-based MI200 devices rather than
CPU-based systems.
So, it uses the gpu_ops for enumeration and adds a few fixups.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
---
 drivers/edac/amd64_edac.c | 65 +++++++++++++++++++++++++++++++--------
 1 file changed, 53 insertions(+), 12 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 597dae7692b1..45d8093c117a 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -996,12 +996,16 @@ static struct local_node_map {
 #define LNTM_NODE_COUNT				GENMASK(27, 16)
 #define LNTM_BASE_NODE_ID			GENMASK(11, 0)
 
-static int gpu_get_node_map(void)
+static int gpu_get_node_map(struct amd64_pvt *pvt)
 {
 	struct pci_dev *pdev;
 	int ret;
 	u32 tmp;
 
+	/* return early for non heterogeneous systems */
+	if (pvt->F3->device != PCI_DEVICE_ID_AMD_MI200_DF_F3)
+		return 0;
+
 	/*
 	 * Node ID 0 is reserved for CPUs.
 	 * Therefore, a non-zero Node ID means we've already cached the values.
@@ -3851,7 +3855,7 @@ static void gpu_init_csrows(struct mem_ctl_info *mci)
 
 			dimm->nr_pages = gpu_get_csrow_nr_pages(pvt, umc, cs);
 			dimm->edac_mode = EDAC_SECDED;
-			dimm->mtype = MEM_HBM2;
+			dimm->mtype = pvt->dram_type;
 			dimm->dtype = DEV_X16;
 			dimm->grain = 64;
 		}
@@ -3880,6 +3884,9 @@ static bool gpu_ecc_enabled(struct amd64_pvt *pvt)
 	return true;
 }
 
+/* Base address used for channels selection on GPUs */
+static u32 gpu_umc_base = 0x50000;
+
 static inline u32 gpu_get_umc_base(u8 umc, u8 channel)
 {
 	/*
@@ -3893,13 +3900,32 @@ static inline u32 gpu_get_umc_base(u8 umc, u8 channel)
 	 * On GPU nodes channels are selected in 3rd nibble
 	 * HBM chX[3:0]= [Y  ]5X[3:0]000;
 	 * HBM chX[7:4]= [Y+1]5X[3:0]000
+	 *
+	 * On APU nodes, same as GPU but with diff base 0x90000;
 	 */
 	umc *= 2;
 
 	if (channel >= 4)
 		umc++;
 
-	return 0x50000 + (umc << 20) + ((channel % 4) << 12);
+	return gpu_umc_base + (umc << 20) + ((channel % 4) << 12);
+}
+
+static void gpu_determine_memory_type(struct amd64_pvt *pvt)
+{
+	if (pvt->fam == 0x19) {
+		switch (pvt->model) {
+		case 0x30 ... 0x3F:
+			pvt->dram_type = MEM_HBM2;
+			break;
+		case 0x90 ... 0x9F:
+			pvt->dram_type = MEM_HBM3;
+			break;
+		default:
+			break;
+		}
+	}
+	edac_dbg(1, "  MEM type: %s\n", edac_mem_types[pvt->dram_type]);
 }
 
 static void gpu_read_mc_regs(struct amd64_pvt *pvt)
@@ -3960,7 +3986,7 @@ static int gpu_hw_info_get(struct amd64_pvt *pvt)
 {
 	int ret;
 
-	ret = gpu_get_node_map();
+	ret = gpu_get_node_map(pvt);
 	if (ret)
 		return ret;
 
@@ -3971,6 +3997,7 @@ static int gpu_hw_info_get(struct amd64_pvt *pvt)
 	gpu_prep_chip_selects(pvt);
 	gpu_read_base_mask(pvt);
 	gpu_read_mc_regs(pvt);
+	gpu_determine_memory_type(pvt);
 
 	return 0;
 }
@@ -4142,6 +4169,12 @@ static int per_family_init(struct amd64_pvt *pvt)
 			pvt->ctl_name			= "F19h_M70h";
 			pvt->flags.zn_regs_v2		= 1;
 			break;
+		case 0x90 ... 0x9f:
+			pvt->ctl_name			= "F19h_M90h";
+			pvt->max_mcs			= 4;
+			gpu_umc_base			= 0x90000;
+			pvt->ops			= &gpu_ops;
+			break;
 		case 0xa0 ... 0xaf:
 			pvt->ctl_name			= "F19h_MA0h";
 			pvt->max_mcs			= 12;
@@ -4166,23 +4199,31 @@ static const struct attribute_group *amd64_edac_attr_groups[] = {
 	NULL
 };
 
+/*
+ * For Heterogeneous and APU models EDAC CHIP_SELECT and CHANNEL layers
+ * should be swapped to fit into the layers.
+ */
+static unsigned int get_layer_size(struct amd64_pvt *pvt, u8 layer)
+{
+	bool is_gpu = (pvt->ops == &gpu_ops);
+
+	if (!layer)
+		return is_gpu ? pvt->max_mcs : pvt->csels[0].b_cnt;
+
+	return is_gpu ? pvt->csels[0].b_cnt : pvt->max_mcs;
+}
+
 static int init_one_instance(struct amd64_pvt *pvt)
 {
 	struct mem_ctl_info *mci = NULL;
 	struct edac_mc_layer layers[2];
 	int ret = -ENOMEM;
 
-	/*
-	 * For Heterogeneous family EDAC CHIP_SELECT and CHANNEL layers should
-	 * be swapped to fit into the layers.
-	 */
 	layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
-	layers[0].size = (pvt->F3->device == PCI_DEVICE_ID_AMD_MI200_DF_F3) ?
-			 pvt->max_mcs : pvt->csels[0].b_cnt;
+	layers[0].size = get_layer_size(pvt, 0);
 	layers[0].is_virt_csrow = true;
 	layers[1].type = EDAC_MC_LAYER_CHANNEL;
-	layers[1].size = (pvt->F3->device == PCI_DEVICE_ID_AMD_MI200_DF_F3) ?
-			 pvt->csels[0].b_cnt : pvt->max_mcs;
+	layers[1].size = get_layer_size(pvt, 1);
 	layers[1].is_virt_csrow = false;
 
 	mci = edac_mc_alloc(pvt->mc_node_id, ARRAY_SIZE(layers), layers, 0);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/7] EDAC/amd64: Add error instance get_err_info() to pvt->ops
  2023-07-20 12:54 [PATCH 0/7] AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
                   ` (4 preceding siblings ...)
  2023-07-20 12:54 ` [PATCH 5/7] EDAC/amd64: Add Fam19h Model 90h ~ 9fh enumeration support Muralidhara M K
@ 2023-07-20 12:54 ` Muralidhara M K
  2023-07-21 14:47   ` Yazen Ghannam
  2023-07-20 12:54 ` [PATCH 7/7] EDAC/amd64: Add Error address conversion for UMC Muralidhara M K
  6 siblings, 1 reply; 19+ messages in thread
From: Muralidhara M K @ 2023-07-20 12:54 UTC (permalink / raw)
  To: linux-edac, x86
  Cc: linux-kernel, bp, mingo, mchehab, nchatrad, yazen.ghannam,
	Muralidhara M K, Naveen Krishna Chatradhi

From: Muralidhara M K <muralidhara.mk@amd.com>

On CPUs the data fabric ID of an instance on a CPU is equal to the
UMC number. since the UMC number and channel are equal in CPU nodes,
the channel can be used as the data fabric ID of the instance.

GPU node has 'X' number of PHYs and 'Y' number of channels.
This results in 'X*Y' number of instances in the data fabric.
Therefore the data fabric ID of an instance in GPU as below:
  df_inst_id = 'X' * number of channels per PHY + 'Y'

Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
---
 drivers/edac/amd64_edac.c | 36 +++++++++++++++++++++++++++++++++++-
 drivers/edac/amd64_edac.h |  2 ++
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 45d8093c117a..74b2b47cc22a 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3047,6 +3047,17 @@ static inline void decode_bus_error(int node_id, struct mce *m)
 	__log_ecc_error(mci, &err, ecc_type);
 }
 
+/*
+ * On CPUs, The data fabric ID of an instance is equal to the UMC number.
+ * and since the UMC number and channel are equal in CPU nodes, the channel can be
+ * used as the data fabric ID of the instance.
+ */
+static int umc_inst_id(struct mem_ctl_info *mci, struct amd64_pvt *pvt,
+		       struct err_info *err)
+{
+	return err->channel;
+}
+
 /*
  * To find the UMC channel represented by this bank we need to match on its
  * instance_id. The instance_id of a bank is held in the lower 32 bits of its
@@ -3071,6 +3082,7 @@ static void decode_umc_error(int node_id, struct mce *m)
 	struct mem_ctl_info *mci;
 	struct amd64_pvt *pvt;
 	struct err_info err;
+	u8 df_inst_id;
 	u64 sys_addr;
 
 	node_id = fixup_node_id(node_id, m);
@@ -3101,8 +3113,9 @@ static void decode_umc_error(int node_id, struct mce *m)
 	}
 
 	pvt->ops->get_err_info(m, &err);
+	df_inst_id = pvt->ops->get_inst_id(mci, pvt, &err);
 
-	if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, err.channel, &sys_addr)) {
+	if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, df_inst_id, &sys_addr)) {
 		err.err_code = ERR_NORM_ADDR;
 		goto log_error;
 	}
@@ -3758,6 +3771,25 @@ static int umc_hw_info_get(struct amd64_pvt *pvt)
 	return 0;
 }
 
+/*
+ * A GPU node has 'X' number of PHYs and 'Y' number of channels.
+ * This results in 'X*Y' number of instances in the data fabric.
+ * Therefore the data fabric ID of an instance can be found with the following formula:
+ * df_inst_id = 'X' * number of channels per PHY + 'Y'
+ *
+ */
+static int gpu_inst_id(struct mem_ctl_info *mci, struct amd64_pvt *pvt,
+		       struct err_info *err)
+{
+	int i, channels = 0;
+
+	/* The memory channels in case of GPUs are fully populated */
+	for_each_umc(i)
+		channels += pvt->csels[i].b_cnt;
+
+	return (err->csrow * channels / mci->nr_csrows) + err->channel;
+}
+
 /*
  * The CPUs have one channel per UMC, so UMC number is equivalent to a
  * channel number. The GPUs have 8 channels per UMC, so the UMC number no
@@ -4015,6 +4047,7 @@ static struct low_ops umc_ops = {
 	.setup_mci_misc_attrs		= umc_setup_mci_misc_attrs,
 	.dump_misc_regs			= umc_dump_misc_regs,
 	.get_err_info			= umc_get_err_info,
+	.get_inst_id			= umc_inst_id,
 };
 
 static struct low_ops gpu_ops = {
@@ -4023,6 +4056,7 @@ static struct low_ops gpu_ops = {
 	.setup_mci_misc_attrs		= gpu_setup_mci_misc_attrs,
 	.dump_misc_regs			= gpu_dump_misc_regs,
 	.get_err_info			= gpu_get_err_info,
+	.get_inst_id			= gpu_inst_id,
 };
 
 /* Use Family 16h versions for defaults and adjust as needed below. */
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 5a4e4a59682b..d9e9e62dd4b1 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -471,6 +471,8 @@ struct low_ops {
 	void (*setup_mci_misc_attrs)(struct mem_ctl_info *mci);
 	void (*dump_misc_regs)(struct amd64_pvt *pvt);
 	void (*get_err_info)(struct mce *m, struct err_info *err);
+	int  (*get_inst_id)(struct mem_ctl_info *mci, struct amd64_pvt *pvt,
+			    struct err_info *err);
 };
 
 int __amd64_read_pci_cfg_dword(struct pci_dev *pdev, int offset,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 7/7] EDAC/amd64: Add Error address conversion for UMC
  2023-07-20 12:54 [PATCH 0/7] AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
                   ` (5 preceding siblings ...)
  2023-07-20 12:54 ` [PATCH 6/7] EDAC/amd64: Add error instance get_err_info() to pvt->ops Muralidhara M K
@ 2023-07-20 12:54 ` Muralidhara M K
  2023-07-21 14:49   ` Yazen Ghannam
  6 siblings, 1 reply; 19+ messages in thread
From: Muralidhara M K @ 2023-07-20 12:54 UTC (permalink / raw)
  To: linux-edac, x86
  Cc: linux-kernel, bp, mingo, mchehab, nchatrad, yazen.ghannam,
	Muralidhara M K

From: Muralidhara M K <muralidhara.mk@amd.com>

Reported MCA address is DRAM address which needs to be converted
to normalized address before Data fabric address translation.

Some AMD systems have on-chip memory capable of OnDie ECC support.
OnDie-ECC error address to MCA is a DRAM decoded address reported with
a DRAM address (PC/SID/Bank/ROW/COL) instead of normalized address
unlike MI200’s UMC ECC, as the implementation difference between
HBM3 ODECC and HBM2 host ECC.
Because OnDie-ECC address reporting is done in the back-end of UMC and
it no longer has normalized address at that point.
So software needs to convert the reported MCA Error Address back to
normalized address.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
---
 drivers/edac/amd64_edac.c | 160 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 160 insertions(+)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 74b2b47cc22a..304d104c25d8 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3076,6 +3076,159 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
 	err->csrow = m->synd & 0x7;
 }
 
+static bool internal_bit_wise_xor(u32 inp)
+{
+	bool tmp = 0;
+
+	for (int i = 0; i < 32; i++)
+		tmp = tmp ^ ((inp >> i) & 0x1);
+
+	return tmp;
+}
+
+/* mapping of MCA error address to normalized address */
+static const u8 umc_mca2na_mapping[] = {
+	0,  5,  6,  8,  9,  14, 12, 13,
+	10, 11, 15, 16, 17, 18, 19, 20,
+	21, 22, 23, 24, 25, 26, 27, 28,
+	7,  29, 30,
+};
+
+/*
+ * Read AMD PPR UMC::AddrHashBank and
+ * UMC::CH::AddrHashPC/PC2 register fields
+ */
+static struct {
+	u32 xor_enable	:1;
+	u32 col_xor	:13;
+	u32 row_xor	:18;
+} addr_hash_pc, addr_hash_bank[4];
+
+static struct {
+	u32 bank_xor	:6;
+} addr_hash_pc2;
+
+/*
+ * The location of bank, column and row are fixed.
+ * location of column bit must be NA[5].
+ * Row bits are always placed in a contiguous stretch of NA above the
+ * column and bank bits.
+ * Bits below the row bits can be either column or bank in any order,
+ * with the exception that NA[5] must be a column bit.
+ * Stack ID(SID) bits are placed in the MSB position of the NA.
+ */
+static int umc_ondie_addr_to_normaddr(u64 mca_addr, u16 nid)
+{
+	u32 bank[4], bank_hash[4], pc_hash;
+	u32 col, row, rawbank = 0, pc;
+	int i, temp = 0;
+	u64 mca2na;
+
+	u32 gpu_umc_base = 0x90000;
+
+	/*
+	 * the below calculation, trying to maps ondie error address
+	 * to normalized address. logged ondie MCA address format is
+	 * BEQ_MCA_RdDatAddr[27:0] =
+	 *	{SID[1:0],PC[0],row[14:0],bank[3:0],col[4:0],1'b0}
+	 * The conversion mappings are:
+	 *
+	 * Normalized location	  ondie MCA error Address
+	 * ===================	  ======================
+	 * NA[4]		  = 1'b0
+	 * NA[5]	= col[0]  = BEQ_MCA_RdDatAddr[1]
+	 * NA[6]	= col[1]  = BEQ_MCA_RdDatAddr[2]
+	 * NA[8]	= col[2]  = BEQ_MCA_RdDatAddr[3]
+	 * NA[9]	= col[3]  = BEQ_MCA_RdDatAddr[4]
+	 * NA[14]	= col[4]  = BEQ_MCA_RdDatAddr[5]
+	 * NA[12]	= bank[0] = BEQ_MCA_RdDatAddr[5]
+	 * NA[13]	= bank[1] = BEQ_MCA_RdDatAddr[6]
+	 * NA[10]	= bank[2] = BEQ_MCA_RdDatAddr[7]
+	 * NA[11]	= bank[3] = BEQ_MCA_RdDatAddr[8]
+	 *
+	 * row low is 12 bit locations, low lsb bit starts from 10
+	 * NA[15..26] = row[0..11]  = BEQ_MCA_RdDatAddr[10..21]
+	 *
+	 * row high is 2 bit locations, high lsb bit starts from 22
+	 * NA[27..28] = row[12..13] = BEQ_MCA_RdDatAddr[22..23]
+	 *
+	 * NA[7]	= PC[0]   = BEQ_MCA_RdDatAddr[25]
+	 * NA[29]	= sid[0]  = bank[4] = BEQ_MCA_RdDatAddr[26]
+	 * NA[30]	= sid[1]  = bank[5] = BEQ_MCA_RdDatAddr[27]
+	 * Basically, it calculates a locations to fit as shown in
+	 * table umc_mca2na_mapping[].
+	 *
+	 * XORs need to be applied based on the hash settings below.
+	 */
+
+	/* Calculate column and row */
+	col = FIELD_GET(GENMASK(5, 1), mca_addr);
+	row = FIELD_GET(GENMASK(23, 10), mca_addr);
+
+	/* Apply hashing on below banks for bank calculation */
+	for (i = 0; i < 4; i++)
+		bank_hash[i] = (mca_addr >> (6 + i)) & 0x1;
+
+	/* bank hash algorithm */
+	for (i = 0; i < 4; i++) {
+		/* Read AMD PPR UMC::AddrHashBank register*/
+		if (!amd_smn_read(nid, gpu_umc_base + 0xC8 + (i * 4), &temp)) {
+			addr_hash_bank[i].xor_enable = temp & 1;
+			addr_hash_bank[i].col_xor = FIELD_GET(GENMASK(13, 1), temp);
+			addr_hash_bank[i].row_xor = FIELD_GET(GENMASK(31, 14), temp);
+			/* bank hash selection */
+			bank[i] = bank_hash[i] ^ (addr_hash_bank[i].xor_enable &
+				  (internal_bit_wise_xor(col & addr_hash_bank[i].col_xor) ^
+				  internal_bit_wise_xor(row & addr_hash_bank[i].row_xor)));
+		}
+	}
+
+	/* To apply hash on pc bit */
+	pc_hash = (mca_addr >> 25) & 0x1;
+
+	/* Read AMD PPR UMC::CH::AddrHashPC register */
+	if (!amd_smn_read(nid, gpu_umc_base + 0xE0, &temp)) {
+		addr_hash_pc.xor_enable = temp & 1;
+		addr_hash_pc.col_xor = FIELD_GET(GENMASK(13, 1), temp);
+		addr_hash_pc.row_xor = FIELD_GET(GENMASK(31, 14), temp);
+	}
+	/* Read AMD PPR UMC::CH::AddrHashPC2 register*/
+	if (!amd_smn_read(nid, gpu_umc_base + 0xE4, &temp))
+		addr_hash_pc2.bank_xor = FIELD_GET(GENMASK(5, 0), temp);
+
+	/* Calculate bank value from bank[0..3], bank[4] and bank[5] */
+	for (i = 0; i < 4; i++)
+		rawbank |= (bank[i] & 1) << i;
+
+	rawbank |= (mca_addr >> 22) & 0x30;
+
+	/* pseudochannel(pc) hash selection */
+	pc = pc_hash ^ (addr_hash_pc.xor_enable &
+		(internal_bit_wise_xor(col & addr_hash_pc.col_xor) ^
+		internal_bit_wise_xor(row & addr_hash_pc.row_xor) ^
+		internal_bit_wise_xor(rawbank & addr_hash_pc2.bank_xor)));
+
+	/* Mask b'25(pc_bit) and b'[9:6](bank) */
+	mca_addr &= ~0x20003c0ULL;
+
+	for (i = 0; i < 4; i++)
+		mca_addr |= (bank[i] << (6 + i));
+
+	 mca_addr |= (pc << 25);
+
+	/* NA[4..0] is fixed */
+	mca2na = 0x0;
+	/* convert mca error address to normalized address */
+	for (i = 1; i < ARRAY_SIZE(umc_mca2na_mapping); i++)
+		mca2na |= ((mca_addr >> i) & 0x1) << umc_mca2na_mapping[i];
+
+	mca_addr = mca2na;
+	pr_emerg(HW_ERR "Error Addr: 0x%016llx\n", mca_addr);
+	pr_emerg(HW_ERR "Error hit on Bank: %d Row: %d Column: %d\n", rawbank, row, col);
+
+	return mca_addr;
+}
+
 static void decode_umc_error(int node_id, struct mce *m)
 {
 	u8 ecc_type = (m->status >> 45) & 0x3;
@@ -3115,6 +3268,13 @@ static void decode_umc_error(int node_id, struct mce *m)
 	pvt->ops->get_err_info(m, &err);
 	df_inst_id = pvt->ops->get_inst_id(mci, pvt, &err);
 
+	/*
+	 * The reported MCA address(Error Addr) is DRAM decoded address which needs to be
+	 * converted to normalized address before DF address translation.
+	 */
+	if (pvt->fam == 0x19 && (pvt->model >= 0x90 && pvt->model <= 0x9f))
+		m->addr = umc_ondie_addr_to_normaddr(m->addr, pvt->mc_node_id);
+
 	if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, df_inst_id, &sys_addr)) {
 		err.err_code = ERR_NORM_ADDR;
 		goto log_error;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] EDAC/mce_amd: Remove SMCA Extended Error code descriptions
  2023-07-20 12:54 ` [PATCH 2/7] EDAC/mce_amd: Remove SMCA Extended Error code descriptions Muralidhara M K
@ 2023-07-20 13:59   ` Borislav Petkov
  2023-07-20 15:25     ` M K, Muralidhara
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2023-07-20 13:59 UTC (permalink / raw)
  To: Muralidhara M K
  Cc: linux-edac, x86, linux-kernel, mingo, mchehab, nchatrad,
	yazen.ghannam, Muralidhara M K

On Thu, Jul 20, 2023 at 12:54:20PM +0000, Muralidhara M K wrote:
> From: Muralidhara M K <muralidhara.mk@amd.com>
> 
> On AMD systems with Scalable MCA, each machine check error of a SMCA bank
> type has an associated bit position in the bank's control (CTL) register.
> 
> An error's bit position in the CTL register is used during error decoding
> for offsetting into the corresponding bank's error description structure.
> As new errors are being added in newer AMD systems for existing SMCA bank
> types, the underlying SMCA architecture guarantees that the bit positions
> of existing errors are not altered.
> 
> However, on some AMD systems some of the existing bit definitions in the
> CTL register of SMCA bank type are reassigned without defining new HWID
> and McaType. Consequently, the errors whose bit definitions have been
> reassigned in the CTL register are being erroneously decoded.
> 
> Remove SMCA Extended Error Code descriptions. This avoids decoding issues
> for incorrectly reassigned bits, and avoids the related maintenance burden
> in the kernel. This decoding can be done in external tools or by referring
> to AMD documentation. The bank type and Extended Error Code value for an
> error will continue to be printed as a convenience.
> 
> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
>  drivers/edac/mce_amd.c | 480 -----------------------------------------
>  1 file changed, 480 deletions(-)

This needs to stay until rasdaemon has support for decoding errors - and
I've told you already.

Lemme tell you again, maybe it'll stick this time.

In any case, NAK.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] EDAC/mce_amd: Remove SMCA Extended Error code descriptions
  2023-07-20 13:59   ` Borislav Petkov
@ 2023-07-20 15:25     ` M K, Muralidhara
  2023-07-20 15:55       ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: M K, Muralidhara @ 2023-07-20 15:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-edac, x86, linux-kernel, mingo, mchehab, nchatrad,
	yazen.ghannam, Muralidhara M K

Hi Boris,

On 7/20/2023 7:29 PM, Borislav Petkov wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On Thu, Jul 20, 2023 at 12:54:20PM +0000, Muralidhara M K wrote:
>> From: Muralidhara M K <muralidhara.mk@amd.com>
>>
>> On AMD systems with Scalable MCA, each machine check error of a SMCA bank
>> type has an associated bit position in the bank's control (CTL) register.
>>
>> An error's bit position in the CTL register is used during error decoding
>> for offsetting into the corresponding bank's error description structure.
>> As new errors are being added in newer AMD systems for existing SMCA bank
>> types, the underlying SMCA architecture guarantees that the bit positions
>> of existing errors are not altered.
>>
>> However, on some AMD systems some of the existing bit definitions in the
>> CTL register of SMCA bank type are reassigned without defining new HWID
>> and McaType. Consequently, the errors whose bit definitions have been
>> reassigned in the CTL register are being erroneously decoded.
>>
>> Remove SMCA Extended Error Code descriptions. This avoids decoding issues
>> for incorrectly reassigned bits, and avoids the related maintenance burden
>> in the kernel. This decoding can be done in external tools or by referring
>> to AMD documentation. The bank type and Extended Error Code value for an
>> error will continue to be printed as a convenience.
>>
>> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
>> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
>> ---
>>   drivers/edac/mce_amd.c | 480 -----------------------------------------
>>   1 file changed, 480 deletions(-)
> 
> This needs to stay until rasdaemon has support for decoding errors - and
> I've told you already.
> 
> Lemme tell you again, maybe it'll stick this time.
> 
> In any case, NAK.
> 

Pull request created in rasdaemon for the same.
https://github.com/mchehab/rasdaemon/pull/106/commits/09026653864305b7a91dcb3604b91a9c0c0d74f3

> --
> Regards/Gruss,
>      Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] EDAC/mce_amd: Remove SMCA Extended Error code descriptions
  2023-07-20 15:25     ` M K, Muralidhara
@ 2023-07-20 15:55       ` Borislav Petkov
  2023-07-21 14:45         ` Yazen Ghannam
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2023-07-20 15:55 UTC (permalink / raw)
  To: M K, Muralidhara
  Cc: linux-edac, x86, linux-kernel, mingo, mchehab, nchatrad,
	yazen.ghannam, Muralidhara M K

On Thu, Jul 20, 2023 at 08:55:01PM +0530, M K, Muralidhara wrote:
> Pull request created in rasdaemon for the same.
> https://github.com/mchehab/rasdaemon/pull/106/commits/09026653864305b7a91dcb3604b91a9c0c0d74f3

I'd like to see a single error, once decoded with rasdaemon, after those
are applied, and once with the kernel, before this change.

Then add that info to the commit message so that people know what to do
when they see an error and how to go about decoding it.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/7] x86/amd_nb: Add AMD Family 19h Models(80h-80fh) and (90h-9fh) PCI IDs
  2023-07-20 12:54 ` [PATCH 1/7] x86/amd_nb: Add AMD Family 19h Models(80h-80fh) and (90h-9fh) PCI IDs Muralidhara M K
@ 2023-07-21 14:44   ` Yazen Ghannam
  0 siblings, 0 replies; 19+ messages in thread
From: Yazen Ghannam @ 2023-07-21 14:44 UTC (permalink / raw)
  To: Muralidhara M K, linux-edac, x86
  Cc: yazen.ghannam, linux-kernel, bp, mingo, mchehab, nchatrad,
	Muralidhara M K

On 7/20/2023 8:54 AM, Muralidhara M K wrote:
> From: Muralidhara M K <muralidhara.mk@amd.com>
> 
> Add new Root, Device 18h Function 3, and Function 4 PCI IDS
> for x86 AMD family 19h, Models 80h-80fh and 90h-9fh.
> 

$SUBJECT and commit message reference Family/Models, but the patch uses 
the code name MI300.

I suggest using MI300 in the $SUBJECT. And highlight in the message that 
both groups of MI300 use the same IDs. You can clarify that Models 
80h-8Fh are MI300C and 90h-9Fh are MI300A.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] EDAC/mce_amd: Remove SMCA Extended Error code descriptions
  2023-07-20 15:55       ` Borislav Petkov
@ 2023-07-21 14:45         ` Yazen Ghannam
  2023-10-24  6:18           ` M K, Muralidhara
  0 siblings, 1 reply; 19+ messages in thread
From: Yazen Ghannam @ 2023-07-21 14:45 UTC (permalink / raw)
  To: Borislav Petkov, M K, Muralidhara
  Cc: yazen.ghannam, linux-edac, x86, linux-kernel, mingo, mchehab,
	nchatrad, Muralidhara M K

On 7/20/2023 11:55 AM, Borislav Petkov wrote:
> On Thu, Jul 20, 2023 at 08:55:01PM +0530, M K, Muralidhara wrote:
>> Pull request created in rasdaemon for the same.
>> https://github.com/mchehab/rasdaemon/pull/106/commits/09026653864305b7a91dcb3604b91a9c0c0d74f3
> 
> I'd like to see a single error, once decoded with rasdaemon, after those
> are applied, and once with the kernel, before this change.
> 
> Then add that info to the commit message so that people know what to do
> when they see an error and how to go about decoding it.
> 

Agreed. We have already discussed this...

Thanks,
Yazen


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/7] EDAC/amd64: Add error instance get_err_info() to pvt->ops
  2023-07-20 12:54 ` [PATCH 6/7] EDAC/amd64: Add error instance get_err_info() to pvt->ops Muralidhara M K
@ 2023-07-21 14:47   ` Yazen Ghannam
  0 siblings, 0 replies; 19+ messages in thread
From: Yazen Ghannam @ 2023-07-21 14:47 UTC (permalink / raw)
  To: Muralidhara M K, linux-edac, x86
  Cc: yazen.ghannam, linux-kernel, bp, mingo, mchehab, nchatrad,
	Muralidhara M K, Naveen Krishna Chatradhi

On 7/20/2023 8:54 AM, Muralidhara M K wrote:
> From: Muralidhara M K <muralidhara.mk@amd.com>
> 
> On CPUs the data fabric ID of an instance on a CPU is equal to the
> UMC number. since the UMC number and channel are equal in CPU nodes,
> the channel can be used as the data fabric ID of the instance.
> 
> GPU node has 'X' number of PHYs and 'Y' number of channels.
> This results in 'X*Y' number of instances in the data fabric.
> Therefore the data fabric ID of an instance in GPU as below:
>    df_inst_id = 'X' * number of channels per PHY + 'Y'
> 
> Co-developed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
> Signed-off-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
> ---
>   drivers/edac/amd64_edac.c | 36 +++++++++++++++++++++++++++++++++++-
>   drivers/edac/amd64_edac.h |  2 ++
>   2 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 45d8093c117a..74b2b47cc22a 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3047,6 +3047,17 @@ static inline void decode_bus_error(int node_id, struct mce *m)
>   	__log_ecc_error(mci, &err, ecc_type);
>   }
>   
> +/*
> + * On CPUs, The data fabric ID of an instance is equal to the UMC number.
> + * and since the UMC number and channel are equal in CPU nodes, the channel can be
> + * used as the data fabric ID of the instance.
> + */
> +static int umc_inst_id(struct mem_ctl_info *mci, struct amd64_pvt *pvt,
> +		       struct err_info *err)
> +{
> +	return err->channel;
> +}
> +
>   /*
>    * To find the UMC channel represented by this bank we need to match on its
>    * instance_id. The instance_id of a bank is held in the lower 32 bits of its
> @@ -3071,6 +3082,7 @@ static void decode_umc_error(int node_id, struct mce *m)
>   	struct mem_ctl_info *mci;
>   	struct amd64_pvt *pvt;
>   	struct err_info err;
> +	u8 df_inst_id;
>   	u64 sys_addr;
>   
>   	node_id = fixup_node_id(node_id, m);
> @@ -3101,8 +3113,9 @@ static void decode_umc_error(int node_id, struct mce *m)
>   	}
>   
>   	pvt->ops->get_err_info(m, &err);
> +	df_inst_id = pvt->ops->get_inst_id(mci, pvt, &err);
>   
> -	if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, err.channel, &sys_addr)) {
> +	if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, df_inst_id, &sys_addr)) {
>   		err.err_code = ERR_NORM_ADDR;
>   		goto log_error;
>   	}

This patch is not useful until the address translation is updated. So 
lets drop this for now. And these changes can be included as part of the 
address translation updates.

Thanks,
Yazen


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/7] EDAC/amd64: Add Error address conversion for UMC
  2023-07-20 12:54 ` [PATCH 7/7] EDAC/amd64: Add Error address conversion for UMC Muralidhara M K
@ 2023-07-21 14:49   ` Yazen Ghannam
  0 siblings, 0 replies; 19+ messages in thread
From: Yazen Ghannam @ 2023-07-21 14:49 UTC (permalink / raw)
  To: Muralidhara M K, linux-edac, x86
  Cc: yazen.ghannam, linux-kernel, bp, mingo, mchehab, nchatrad,
	Muralidhara M K

On 7/20/2023 8:54 AM, Muralidhara M K wrote:
> From: Muralidhara M K <muralidhara.mk@amd.com>
> 
> Reported MCA address is DRAM address which needs to be converted
> to normalized address before Data fabric address translation.
> 
> Some AMD systems have on-chip memory capable of OnDie ECC support.
> OnDie-ECC error address to MCA is a DRAM decoded address reported with
> a DRAM address (PC/SID/Bank/ROW/COL) instead of normalized address
> unlike MI200’s UMC ECC, as the implementation difference between
> HBM3 ODECC and HBM2 host ECC.
> Because OnDie-ECC address reporting is done in the back-end of UMC and
> it no longer has normalized address at that point.
> So software needs to convert the reported MCA Error Address back to
> normalized address.
> 
> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
> ---
>   drivers/edac/amd64_edac.c | 160 ++++++++++++++++++++++++++++++++++++++
>   1 file changed, 160 insertions(+)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 74b2b47cc22a..304d104c25d8 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3076,6 +3076,159 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
>   	err->csrow = m->synd & 0x7;
>   }
>   
> +static bool internal_bit_wise_xor(u32 inp)
> +{
> +	bool tmp = 0;
> +
> +	for (int i = 0; i < 32; i++)
> +		tmp = tmp ^ ((inp >> i) & 0x1);
> +
> +	return tmp;
> +}
> +
> +/* mapping of MCA error address to normalized address */
> +static const u8 umc_mca2na_mapping[] = {
> +	0,  5,  6,  8,  9,  14, 12, 13,
> +	10, 11, 15, 16, 17, 18, 19, 20,
> +	21, 22, 23, 24, 25, 26, 27, 28,
> +	7,  29, 30,
> +};
> +
> +/*
> + * Read AMD PPR UMC::AddrHashBank and
> + * UMC::CH::AddrHashPC/PC2 register fields
> + */
> +static struct {
> +	u32 xor_enable	:1;
> +	u32 col_xor	:13;
> +	u32 row_xor	:18;
> +} addr_hash_pc, addr_hash_bank[4];
> +
> +static struct {
> +	u32 bank_xor	:6;
> +} addr_hash_pc2;
> +
> +/*
> + * The location of bank, column and row are fixed.
> + * location of column bit must be NA[5].
> + * Row bits are always placed in a contiguous stretch of NA above the
> + * column and bank bits.
> + * Bits below the row bits can be either column or bank in any order,
> + * with the exception that NA[5] must be a column bit.
> + * Stack ID(SID) bits are placed in the MSB position of the NA.
> + */
> +static int umc_ondie_addr_to_normaddr(u64 mca_addr, u16 nid)
> +{
> +	u32 bank[4], bank_hash[4], pc_hash;
> +	u32 col, row, rawbank = 0, pc;
> +	int i, temp = 0;
> +	u64 mca2na;
> +
> +	u32 gpu_umc_base = 0x90000;
> +
> +	/*
> +	 * the below calculation, trying to maps ondie error address
> +	 * to normalized address. logged ondie MCA address format is
> +	 * BEQ_MCA_RdDatAddr[27:0] =
> +	 *	{SID[1:0],PC[0],row[14:0],bank[3:0],col[4:0],1'b0}
> +	 * The conversion mappings are:
> +	 *
> +	 * Normalized location	  ondie MCA error Address
> +	 * ===================	  ======================
> +	 * NA[4]		  = 1'b0
> +	 * NA[5]	= col[0]  = BEQ_MCA_RdDatAddr[1]
> +	 * NA[6]	= col[1]  = BEQ_MCA_RdDatAddr[2]
> +	 * NA[8]	= col[2]  = BEQ_MCA_RdDatAddr[3]
> +	 * NA[9]	= col[3]  = BEQ_MCA_RdDatAddr[4]
> +	 * NA[14]	= col[4]  = BEQ_MCA_RdDatAddr[5]
> +	 * NA[12]	= bank[0] = BEQ_MCA_RdDatAddr[5]
> +	 * NA[13]	= bank[1] = BEQ_MCA_RdDatAddr[6]
> +	 * NA[10]	= bank[2] = BEQ_MCA_RdDatAddr[7]
> +	 * NA[11]	= bank[3] = BEQ_MCA_RdDatAddr[8]
> +	 *
> +	 * row low is 12 bit locations, low lsb bit starts from 10
> +	 * NA[15..26] = row[0..11]  = BEQ_MCA_RdDatAddr[10..21]
> +	 *
> +	 * row high is 2 bit locations, high lsb bit starts from 22
> +	 * NA[27..28] = row[12..13] = BEQ_MCA_RdDatAddr[22..23]
> +	 *
> +	 * NA[7]	= PC[0]   = BEQ_MCA_RdDatAddr[25]
> +	 * NA[29]	= sid[0]  = bank[4] = BEQ_MCA_RdDatAddr[26]
> +	 * NA[30]	= sid[1]  = bank[5] = BEQ_MCA_RdDatAddr[27]
> +	 * Basically, it calculates a locations to fit as shown in
> +	 * table umc_mca2na_mapping[].
> +	 *
> +	 * XORs need to be applied based on the hash settings below.
> +	 */
> +
> +	/* Calculate column and row */
> +	col = FIELD_GET(GENMASK(5, 1), mca_addr);
> +	row = FIELD_GET(GENMASK(23, 10), mca_addr);
> +
> +	/* Apply hashing on below banks for bank calculation */
> +	for (i = 0; i < 4; i++)
> +		bank_hash[i] = (mca_addr >> (6 + i)) & 0x1;
> +
> +	/* bank hash algorithm */
> +	for (i = 0; i < 4; i++) {
> +		/* Read AMD PPR UMC::AddrHashBank register*/
> +		if (!amd_smn_read(nid, gpu_umc_base + 0xC8 + (i * 4), &temp)) {
> +			addr_hash_bank[i].xor_enable = temp & 1;
> +			addr_hash_bank[i].col_xor = FIELD_GET(GENMASK(13, 1), temp);
> +			addr_hash_bank[i].row_xor = FIELD_GET(GENMASK(31, 14), temp);
> +			/* bank hash selection */
> +			bank[i] = bank_hash[i] ^ (addr_hash_bank[i].xor_enable &
> +				  (internal_bit_wise_xor(col & addr_hash_bank[i].col_xor) ^
> +				  internal_bit_wise_xor(row & addr_hash_bank[i].row_xor)));
> +		}
> +	}
> +
> +	/* To apply hash on pc bit */
> +	pc_hash = (mca_addr >> 25) & 0x1;
> +
> +	/* Read AMD PPR UMC::CH::AddrHashPC register */
> +	if (!amd_smn_read(nid, gpu_umc_base + 0xE0, &temp)) {
> +		addr_hash_pc.xor_enable = temp & 1;
> +		addr_hash_pc.col_xor = FIELD_GET(GENMASK(13, 1), temp);
> +		addr_hash_pc.row_xor = FIELD_GET(GENMASK(31, 14), temp);
> +	}
> +	/* Read AMD PPR UMC::CH::AddrHashPC2 register*/
> +	if (!amd_smn_read(nid, gpu_umc_base + 0xE4, &temp))
> +		addr_hash_pc2.bank_xor = FIELD_GET(GENMASK(5, 0), temp);
> +
> +	/* Calculate bank value from bank[0..3], bank[4] and bank[5] */
> +	for (i = 0; i < 4; i++)
> +		rawbank |= (bank[i] & 1) << i;
> +
> +	rawbank |= (mca_addr >> 22) & 0x30;
> +
> +	/* pseudochannel(pc) hash selection */
> +	pc = pc_hash ^ (addr_hash_pc.xor_enable &
> +		(internal_bit_wise_xor(col & addr_hash_pc.col_xor) ^
> +		internal_bit_wise_xor(row & addr_hash_pc.row_xor) ^
> +		internal_bit_wise_xor(rawbank & addr_hash_pc2.bank_xor)));
> +
> +	/* Mask b'25(pc_bit) and b'[9:6](bank) */
> +	mca_addr &= ~0x20003c0ULL;
> +
> +	for (i = 0; i < 4; i++)
> +		mca_addr |= (bank[i] << (6 + i));
> +
> +	 mca_addr |= (pc << 25);
> +
> +	/* NA[4..0] is fixed */
> +	mca2na = 0x0;
> +	/* convert mca error address to normalized address */
> +	for (i = 1; i < ARRAY_SIZE(umc_mca2na_mapping); i++)
> +		mca2na |= ((mca_addr >> i) & 0x1) << umc_mca2na_mapping[i];
> +
> +	mca_addr = mca2na;
> +	pr_emerg(HW_ERR "Error Addr: 0x%016llx\n", mca_addr);
> +	pr_emerg(HW_ERR "Error hit on Bank: %d Row: %d Column: %d\n", rawbank, row, col);
> +
> +	return mca_addr;
> +}
> +
>   static void decode_umc_error(int node_id, struct mce *m)
>   {
>   	u8 ecc_type = (m->status >> 45) & 0x3;
> @@ -3115,6 +3268,13 @@ static void decode_umc_error(int node_id, struct mce *m)
>   	pvt->ops->get_err_info(m, &err);
>   	df_inst_id = pvt->ops->get_inst_id(mci, pvt, &err);
>   
> +	/*
> +	 * The reported MCA address(Error Addr) is DRAM decoded address which needs to be
> +	 * converted to normalized address before DF address translation.
> +	 */
> +	if (pvt->fam == 0x19 && (pvt->model >= 0x90 && pvt->model <= 0x9f))
> +		m->addr = umc_ondie_addr_to_normaddr(m->addr, pvt->mc_node_id);
> +
>   	if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, df_inst_id, &sys_addr)) {
>   		err.err_code = ERR_NORM_ADDR;
>   		goto log_error;

Same comment as previous patch. Leave this until address translation 
updates.

Furthermore, I'm not sure if overwriting m->addr is still a good idea, 
since we'd like to keep the original error information for other uses.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/7] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types
  2023-07-20 12:54 ` [PATCH 3/7] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types Muralidhara M K
@ 2023-07-22  8:20   ` Borislav Petkov
  0 siblings, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2023-07-22  8:20 UTC (permalink / raw)
  To: Muralidhara M K
  Cc: linux-edac, x86, linux-kernel, mingo, mchehab, nchatrad,
	yazen.ghannam, Muralidhara M K

On Thu, Jul 20, 2023 at 12:54:21PM +0000, Muralidhara M K wrote:
> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> index 5e74610b39e7..cf8b4616fd31 100644
> --- a/arch/x86/kernel/cpu/mce/amd.c
> +++ b/arch/x86/kernel/cpu/mce/amd.c
> @@ -107,6 +107,7 @@ static struct smca_bank_name smca_names[] = {
>  	/* UMC v2 is separate because both of them can exist in a single system. */
>  	[SMCA_UMC]			= { "umc",		"Unified Memory Controller" },
>  	[SMCA_UMC_V2]			= { "umc_v2",		"Unified Memory Controller v2" },
> +	[SMCA_MA_LLC]			= { "mall",		"Memory Attached Last Level Cache" },

"ma_llc" - not a mall. :)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/7] EDAC/mc: Add new HBM3 memory type
  2023-07-20 12:54 ` [PATCH 4/7] EDAC/mc: Add new HBM3 memory type Muralidhara M K
@ 2023-08-03 10:27   ` Borislav Petkov
  0 siblings, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2023-08-03 10:27 UTC (permalink / raw)
  To: Muralidhara M K
  Cc: linux-edac, x86, linux-kernel, mingo, mchehab, nchatrad,
	yazen.ghannam, Muralidhara M K

On Thu, Jul 20, 2023 at 12:54:22PM +0000, Muralidhara M K wrote:
> From: Muralidhara M K <muralidhara.mk@amd.com>
> 
> Add a new entry to 'enum mem_type' and a new string to 'edac_mem_types[]'

Do not talk about *what* the patch is doing in the commit message - that
should be obvious from the diff itself. Rather, concentrate on the *why*
it needs to be done.

> for HBM3 (High Bandwidth Memory Gen 3) new memory type.
> 
> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
> ---
>  drivers/edac/edac_mc.c | 1 +
>  include/linux/edac.h   | 3 +++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
> index 6faeb2ab3960..d6eed727b0cd 100644
> --- a/drivers/edac/edac_mc.c
> +++ b/drivers/edac/edac_mc.c
> @@ -166,6 +166,7 @@ const char * const edac_mem_types[] = {
>  	[MEM_NVDIMM]	= "Non-volatile-RAM",
>  	[MEM_WIO2]	= "Wide-IO-2",
>  	[MEM_HBM2]	= "High-bandwidth-memory-Gen2",
> +	[MEM_HBM3]	= "High-bandwidth-memory-Gen3",
>  };
>  EXPORT_SYMBOL_GPL(edac_mem_types);
>  
> diff --git a/include/linux/edac.h b/include/linux/edac.h
> index fa4bda2a70f6..1174beb94ab6 100644
> --- a/include/linux/edac.h
> +++ b/include/linux/edac.h
> @@ -187,6 +187,7 @@ static inline char *mc_event_error_type(const unsigned int err_type)
>   * @MEM_NVDIMM:		Non-volatile RAM
>   * @MEM_WIO2:		Wide I/O 2.
>   * @MEM_HBM2:		High bandwidth Memory Gen 2.
> + * @MEM_HBM3:		High bandwidth Memory Gen 3.

s/bandwidth/Bandwidth/g

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/7] EDAC/amd64: Add Fam19h Model 90h ~ 9fh enumeration support
  2023-07-20 12:54 ` [PATCH 5/7] EDAC/amd64: Add Fam19h Model 90h ~ 9fh enumeration support Muralidhara M K
@ 2023-08-05 10:10   ` Borislav Petkov
  0 siblings, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2023-08-05 10:10 UTC (permalink / raw)
  To: Muralidhara M K
  Cc: linux-edac, x86, linux-kernel, mingo, mchehab, nchatrad,
	yazen.ghannam, Muralidhara M K

On Thu, Jul 20, 2023 at 12:54:23PM +0000, Muralidhara M K wrote:
> From: Muralidhara M K <muralidhara.mk@amd.com>
> 
> Add AMD family 19h Model 90h-9fh. Models 90h-9fh are APUs, and
> they have built-in HBM3 memory. ECC support is enabled by default.
> 
> APU models have a single Data Fabric (DF) per Package. Each DF is
> visible to the OS in the same way as chiplet-based systems like
> Rome and later. However, the Unified Memory Controllers (UMCs) are
> arranged in the same way as GPU-based MI200 devices rather than
> CPU-based systems.
> So, it uses the gpu_ops for enumeration and adds a few fixups.

s/it uses/use/

Imperative tone:

Pls read section "2) Describe your changes" in
Documentation/process/submitting-patches.rst for more details.

> 
> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
> ---
>  drivers/edac/amd64_edac.c | 65 +++++++++++++++++++++++++++++++--------
>  1 file changed, 53 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 597dae7692b1..45d8093c117a 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -996,12 +996,16 @@ static struct local_node_map {
>  #define LNTM_NODE_COUNT				GENMASK(27, 16)
>  #define LNTM_BASE_NODE_ID			GENMASK(11, 0)
>  
> -static int gpu_get_node_map(void)
> +static int gpu_get_node_map(struct amd64_pvt *pvt)
>  {
>  	struct pci_dev *pdev;
>  	int ret;
>  	u32 tmp;
>  
> +	/* return early for non heterogeneous systems */

Superfluous comment.

> +	if (pvt->F3->device != PCI_DEVICE_ID_AMD_MI200_DF_F3)
> +		return 0;
> +
>  	/*
>  	 * Node ID 0 is reserved for CPUs.
>  	 * Therefore, a non-zero Node ID means we've already cached the values.
> @@ -3851,7 +3855,7 @@ static void gpu_init_csrows(struct mem_ctl_info *mci)
>  
>  			dimm->nr_pages = gpu_get_csrow_nr_pages(pvt, umc, cs);
>  			dimm->edac_mode = EDAC_SECDED;
> -			dimm->mtype = MEM_HBM2;
> +			dimm->mtype = pvt->dram_type;
>  			dimm->dtype = DEV_X16;
>  			dimm->grain = 64;
>  		}
> @@ -3880,6 +3884,9 @@ static bool gpu_ecc_enabled(struct amd64_pvt *pvt)
>  	return true;
>  }
>  
> +/* Base address used for channels selection on GPUs */
> +static u32 gpu_umc_base = 0x50000;

Why isn't this part of amd64_pvt like the rest of the fields?

> +
>  static inline u32 gpu_get_umc_base(u8 umc, u8 channel)
>  {
>  	/*
> @@ -3893,13 +3900,32 @@ static inline u32 gpu_get_umc_base(u8 umc, u8 channel)
>  	 * On GPU nodes channels are selected in 3rd nibble
>  	 * HBM chX[3:0]= [Y  ]5X[3:0]000;
>  	 * HBM chX[7:4]= [Y+1]5X[3:0]000
> +	 *
> +	 * On APU nodes, same as GPU but with diff base 0x90000;

"diff"?

>  	 */
>  	umc *= 2;
>  
>  	if (channel >= 4)
>  		umc++;
>  
> -	return 0x50000 + (umc << 20) + ((channel % 4) << 12);
> +	return gpu_umc_base + (umc << 20) + ((channel % 4) << 12);
> +}
> +

...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/7] EDAC/mce_amd: Remove SMCA Extended Error code descriptions
  2023-07-21 14:45         ` Yazen Ghannam
@ 2023-10-24  6:18           ` M K, Muralidhara
  0 siblings, 0 replies; 19+ messages in thread
From: M K, Muralidhara @ 2023-10-24  6:18 UTC (permalink / raw)
  To: Yazen Ghannam, Borislav Petkov
  Cc: linux-edac, x86, linux-kernel, mingo, mchehab, nchatrad, Muralidhara M K

Hi Boris,

On 7/21/2023 8:15 PM, Yazen Ghannam wrote:
> On 7/20/2023 11:55 AM, Borislav Petkov wrote:
>> On Thu, Jul 20, 2023 at 08:55:01PM +0530, M K, Muralidhara wrote:
>>> Pull request created in rasdaemon for the same.
>>> https://github.com/mchehab/rasdaemon/pull/106/commits/09026653864305b7a91dcb3604b91a9c0c0d74f3 
>>>
>>
>> I'd like to see a single error, once decoded with rasdaemon, after those
>> are applied, and once with the kernel, before this change.
>>
>> Then add that info to the commit message so that people know what to do
>> when they see an error and how to go about decoding it.
>>
> 
> Agreed. We have already discussed this...
> 

The below patches got accepted in rasdaemon.
https://github.com/mchehab/rasdaemon/commit/1f74a59ee33b7448b00d7ba13d5ecd4918b9853c 
rasdaemon: Add new MA_LLC, USR_DP, and USR_CP bank types

https://github.com/mchehab/rasdaemon/commit/2d15882a0cbfce0b905039bebc811ac8311cd739 
rasdaemon: Handle reassigned bit definitions for UMC bank

I will describe in commit message and submit v2 of this patch series.

> Thanks,
> Yazen
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-10-24  6:18 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20 12:54 [PATCH 0/7] AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
2023-07-20 12:54 ` [PATCH 1/7] x86/amd_nb: Add AMD Family 19h Models(80h-80fh) and (90h-9fh) PCI IDs Muralidhara M K
2023-07-21 14:44   ` Yazen Ghannam
2023-07-20 12:54 ` [PATCH 2/7] EDAC/mce_amd: Remove SMCA Extended Error code descriptions Muralidhara M K
2023-07-20 13:59   ` Borislav Petkov
2023-07-20 15:25     ` M K, Muralidhara
2023-07-20 15:55       ` Borislav Petkov
2023-07-21 14:45         ` Yazen Ghannam
2023-10-24  6:18           ` M K, Muralidhara
2023-07-20 12:54 ` [PATCH 3/7] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types Muralidhara M K
2023-07-22  8:20   ` Borislav Petkov
2023-07-20 12:54 ` [PATCH 4/7] EDAC/mc: Add new HBM3 memory type Muralidhara M K
2023-08-03 10:27   ` Borislav Petkov
2023-07-20 12:54 ` [PATCH 5/7] EDAC/amd64: Add Fam19h Model 90h ~ 9fh enumeration support Muralidhara M K
2023-08-05 10:10   ` Borislav Petkov
2023-07-20 12:54 ` [PATCH 6/7] EDAC/amd64: Add error instance get_err_info() to pvt->ops Muralidhara M K
2023-07-21 14:47   ` Yazen Ghannam
2023-07-20 12:54 ` [PATCH 7/7] EDAC/amd64: Add Error address conversion for UMC Muralidhara M K
2023-07-21 14:49   ` Yazen Ghannam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).