linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support
@ 2023-11-02 11:42 Muralidhara M K
  2023-11-02 11:42 ` [PATCH v3 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions Muralidhara M K
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Muralidhara M K @ 2023-11-02 11:42 UTC (permalink / raw)
  To: linux-edac; +Cc: linux-kernel, bp, Muralidhara M K

From: Muralidhara M K <muralidhara.mk@amd.com>

The below patchset is based on previously submitted changes
https://lore.kernel.org/linux-edac/20231025051455.101424-1-muralimk@amd.com/T/#m5834ca7109d159f7fc869031f3603dff8a79633c

Patch 1:
Remove SMCA Extended Error code descriptions, because some of the
existing bit definitions in the CTL register of SMCA bank type are
reassigned without defining new HWID and McaType.

Patch 2:
Add New SMCA bank types MALL, USR_DP, USR_CP.

Patch 3:
Add HBM3 memory in the enum.

Patch 4:
Add Family 19h and Models 90h-9fh Enumeration support.

Muralidhara M K (4):
  EDAC/mce_amd: Remove SMCA Extended Error code descriptions
  x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types
  EDAC/mc: Add support for HBM3 memory type
  EDAC/amd64: Add support for family 0x19, models 0x90-9f devices

 arch/x86/include/asm/mce.h    |   3 +
 arch/x86/kernel/cpu/mce/amd.c |   6 +
 drivers/edac/amd64_edac.c     |  62 +++--
 drivers/edac/amd64_edac.h     |   1 +
 drivers/edac/edac_mc.c        |   1 +
 drivers/edac/mce_amd.c        | 480 ----------------------------------
 include/linux/edac.h          |   3 +
 7 files changed, 60 insertions(+), 496 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions
  2023-11-02 11:42 [PATCH v3 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
@ 2023-11-02 11:42 ` Muralidhara M K
  2023-11-28 14:20   ` [PATCH] Documentation: Begin a RAS section Borislav Petkov
  2023-11-28 14:24   ` [tip: ras/core] EDAC/mce_amd: Remove SMCA Extended Error code descriptions tip-bot2 for Muralidhara M K
  2023-11-02 11:42 ` [PATCH v3 2/4] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types Muralidhara M K
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 18+ messages in thread
From: Muralidhara M K @ 2023-11-02 11:42 UTC (permalink / raw)
  To: linux-edac; +Cc: linux-kernel, bp, Muralidhara M K, Yazen Ghannam

From: Muralidhara M K <muralidhara.mk@amd.com>

AMD systems with Scalable MCA, each machine check error of a SMCA bank
type has an associated bit position in the bank's control (CTL) register.

An error's bit position in the CTL register is used during error decoding
for offsetting into the corresponding bank's error description structure.
As new errors are being added in newer AMD systems for existing SMCA bank
types, the underlying SMCA architecture guarantees that the bit positions
of existing errors are not altered.

However, on some AMD systems some of the existing bit definitions in the
CTL register of SMCA bank type are reassigned without defining new HWID
and McaType. Consequently, the errors whose bit definitions have been
reassigned in the CTL register are being erroneously decoded.

Remove SMCA Extended Error Code descriptions, this avoids decoding issues
for incorrectly reassigned bits, and avoids the related maintenance burden
in the kernel. But the bank type and Extended Error Code value for an
error will continue to be printed as a convenience.

The decoding of SMCA Extended Error Code description can be done by
referring to AMD documentation or use external tools such as rasdaemon.

Offline decoding can be done using below option in rasdaemon. for example:
$ rasdaemon -p --status <STATUS> --ipid <IPID> --smca

Also, user can pass particular family and model to decode the error string.
$ rasdaemon -p --status <STATUS> --ipid <IPID> --smca --family <CPU Family>
	--model <CPU Model> --bank <BANK_NUM>
Refer below commit for details
https://github.com/mchehab/rasdaemon/commit/932118b04a04104dfac6b8536

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Changes:
v2->v3: Add more information on how to decode in commit message

 drivers/edac/mce_amd.c | 480 -----------------------------------------
 1 file changed, 480 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 9215c06783df..3a67f02a34ad 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -143,482 +143,6 @@ static const char * const mc6_mce_desc[] = {
 	"Status Register File",
 };
 
-/* Scalable MCA error strings */
-static const char * const smca_ls_mce_desc[] = {
-	"Load queue parity error",
-	"Store queue parity error",
-	"Miss address buffer payload parity error",
-	"Level 1 TLB parity error",
-	"DC Tag error type 5",
-	"DC Tag error type 6",
-	"DC Tag error type 1",
-	"Internal error type 1",
-	"Internal error type 2",
-	"System Read Data Error Thread 0",
-	"System Read Data Error Thread 1",
-	"DC Tag error type 2",
-	"DC Data error type 1 and poison consumption",
-	"DC Data error type 2",
-	"DC Data error type 3",
-	"DC Tag error type 4",
-	"Level 2 TLB parity error",
-	"PDC parity error",
-	"DC Tag error type 3",
-	"DC Tag error type 5",
-	"L2 Fill Data error",
-};
-
-static const char * const smca_ls2_mce_desc[] = {
-	"An ECC error was detected on a data cache read by a probe or victimization",
-	"An ECC error or L2 poison was detected on a data cache read by a load",
-	"An ECC error was detected on a data cache read-modify-write by a store",
-	"An ECC error or poison bit mismatch was detected on a tag read by a probe or victimization",
-	"An ECC error or poison bit mismatch was detected on a tag read by a load",
-	"An ECC error or poison bit mismatch was detected on a tag read by a store",
-	"An ECC error was detected on an EMEM read by a load",
-	"An ECC error was detected on an EMEM read-modify-write by a store",
-	"A parity error was detected in an L1 TLB entry by any access",
-	"A parity error was detected in an L2 TLB entry by any access",
-	"A parity error was detected in a PWC entry by any access",
-	"A parity error was detected in an STQ entry by any access",
-	"A parity error was detected in an LDQ entry by any access",
-	"A parity error was detected in a MAB entry by any access",
-	"A parity error was detected in an SCB entry state field by any access",
-	"A parity error was detected in an SCB entry address field by any access",
-	"A parity error was detected in an SCB entry data field by any access",
-	"A parity error was detected in a WCB entry by any access",
-	"A poisoned line was detected in an SCB entry by any access",
-	"A SystemReadDataError error was reported on read data returned from L2 for a load",
-	"A SystemReadDataError error was reported on read data returned from L2 for an SCB store",
-	"A SystemReadDataError error was reported on read data returned from L2 for a WCB store",
-	"A hardware assertion error was reported",
-	"A parity error was detected in an STLF, SCB EMEM entry or SRB store data by any access",
-};
-
-static const char * const smca_if_mce_desc[] = {
-	"Op Cache Microtag Probe Port Parity Error",
-	"IC Microtag or Full Tag Multi-hit Error",
-	"IC Full Tag Parity Error",
-	"IC Data Array Parity Error",
-	"Decoupling Queue PhysAddr Parity Error",
-	"L0 ITLB Parity Error",
-	"L1 ITLB Parity Error",
-	"L2 ITLB Parity Error",
-	"BPQ Thread 0 Snoop Parity Error",
-	"BPQ Thread 1 Snoop Parity Error",
-	"L1 BTB Multi-Match Error",
-	"L2 BTB Multi-Match Error",
-	"L2 Cache Response Poison Error",
-	"System Read Data Error",
-	"Hardware Assertion Error",
-	"L1-TLB Multi-Hit",
-	"L2-TLB Multi-Hit",
-	"BSR Parity Error",
-	"CT MCE",
-};
-
-static const char * const smca_l2_mce_desc[] = {
-	"L2M Tag Multiple-Way-Hit error",
-	"L2M Tag or State Array ECC Error",
-	"L2M Data Array ECC Error",
-	"Hardware Assert Error",
-};
-
-static const char * const smca_de_mce_desc[] = {
-	"Micro-op cache tag parity error",
-	"Micro-op cache data parity error",
-	"Instruction buffer parity error",
-	"Micro-op queue parity error",
-	"Instruction dispatch queue parity error",
-	"Fetch address FIFO parity error",
-	"Patch RAM data parity error",
-	"Patch RAM sequencer parity error",
-	"Micro-op buffer parity error",
-	"Hardware Assertion MCA Error",
-};
-
-static const char * const smca_ex_mce_desc[] = {
-	"Watchdog Timeout error",
-	"Physical register file parity error",
-	"Flag register file parity error",
-	"Immediate displacement register file parity error",
-	"Address generator payload parity error",
-	"EX payload parity error",
-	"Checkpoint queue parity error",
-	"Retire dispatch queue parity error",
-	"Retire status queue parity error",
-	"Scheduling queue parity error",
-	"Branch buffer queue parity error",
-	"Hardware Assertion error",
-	"Spec Map parity error",
-	"Retire Map parity error",
-};
-
-static const char * const smca_fp_mce_desc[] = {
-	"Physical register file (PRF) parity error",
-	"Freelist (FL) parity error",
-	"Schedule queue parity error",
-	"NSQ parity error",
-	"Retire queue (RQ) parity error",
-	"Status register file (SRF) parity error",
-	"Hardware assertion",
-};
-
-static const char * const smca_l3_mce_desc[] = {
-	"Shadow Tag Macro ECC Error",
-	"Shadow Tag Macro Multi-way-hit Error",
-	"L3M Tag ECC Error",
-	"L3M Tag Multi-way-hit Error",
-	"L3M Data ECC Error",
-	"SDP Parity Error or SystemReadDataError from XI",
-	"L3 Victim Queue Parity Error",
-	"L3 Hardware Assertion",
-};
-
-static const char * const smca_cs_mce_desc[] = {
-	"Illegal Request",
-	"Address Violation",
-	"Security Violation",
-	"Illegal Response",
-	"Unexpected Response",
-	"Request or Probe Parity Error",
-	"Read Response Parity Error",
-	"Atomic Request Parity Error",
-	"Probe Filter ECC Error",
-};
-
-static const char * const smca_cs2_mce_desc[] = {
-	"Illegal Request",
-	"Address Violation",
-	"Security Violation",
-	"Illegal Response",
-	"Unexpected Response",
-	"Request or Probe Parity Error",
-	"Read Response Parity Error",
-	"Atomic Request Parity Error",
-	"SDP read response had no match in the CS queue",
-	"Probe Filter Protocol Error",
-	"Probe Filter ECC Error",
-	"SDP read response had an unexpected RETRY error",
-	"Counter overflow error",
-	"Counter underflow error",
-};
-
-static const char * const smca_pie_mce_desc[] = {
-	"Hardware Assert",
-	"Register security violation",
-	"Link Error",
-	"Poison data consumption",
-	"A deferred error was detected in the DF"
-};
-
-static const char * const smca_umc_mce_desc[] = {
-	"DRAM ECC error",
-	"Data poison error",
-	"SDP parity error",
-	"Advanced peripheral bus error",
-	"Address/Command parity error",
-	"Write data CRC error",
-	"DCQ SRAM ECC error",
-	"AES SRAM ECC error",
-};
-
-static const char * const smca_umc2_mce_desc[] = {
-	"DRAM ECC error",
-	"Data poison error",
-	"SDP parity error",
-	"Reserved",
-	"Address/Command parity error",
-	"Write data parity error",
-	"DCQ SRAM ECC error",
-	"Reserved",
-	"Read data parity error",
-	"Rdb SRAM ECC error",
-	"RdRsp SRAM ECC error",
-	"LM32 MP errors",
-};
-
-static const char * const smca_pb_mce_desc[] = {
-	"An ECC error in the Parameter Block RAM array",
-};
-
-static const char * const smca_psp_mce_desc[] = {
-	"An ECC or parity error in a PSP RAM instance",
-};
-
-static const char * const smca_psp2_mce_desc[] = {
-	"High SRAM ECC or parity error",
-	"Low SRAM ECC or parity error",
-	"Instruction Cache Bank 0 ECC or parity error",
-	"Instruction Cache Bank 1 ECC or parity error",
-	"Instruction Tag Ram 0 parity error",
-	"Instruction Tag Ram 1 parity error",
-	"Data Cache Bank 0 ECC or parity error",
-	"Data Cache Bank 1 ECC or parity error",
-	"Data Cache Bank 2 ECC or parity error",
-	"Data Cache Bank 3 ECC or parity error",
-	"Data Tag Bank 0 parity error",
-	"Data Tag Bank 1 parity error",
-	"Data Tag Bank 2 parity error",
-	"Data Tag Bank 3 parity error",
-	"Dirty Data Ram parity error",
-	"TLB Bank 0 parity error",
-	"TLB Bank 1 parity error",
-	"System Hub Read Buffer ECC or parity error",
-};
-
-static const char * const smca_smu_mce_desc[] = {
-	"An ECC or parity error in an SMU RAM instance",
-};
-
-static const char * const smca_smu2_mce_desc[] = {
-	"High SRAM ECC or parity error",
-	"Low SRAM ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"System Hub Read Buffer ECC or parity error",
-	"PHY RAM ECC error",
-};
-
-static const char * const smca_mp5_mce_desc[] = {
-	"High SRAM ECC or parity error",
-	"Low SRAM ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-};
-
-static const char * const smca_mpdma_mce_desc[] = {
-	"Main SRAM [31:0] bank ECC or parity error",
-	"Main SRAM [63:32] bank ECC or parity error",
-	"Main SRAM [95:64] bank ECC or parity error",
-	"Main SRAM [127:96] bank ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"System Hub Read Buffer ECC or parity error",
-	"MPDMA TVF DVSEC Memory ECC or parity error",
-	"MPDMA TVF MMIO Mailbox0 ECC or parity error",
-	"MPDMA TVF MMIO Mailbox1 ECC or parity error",
-	"MPDMA TVF Doorbell Memory ECC or parity error",
-	"MPDMA TVF SDP Slave Memory 0 ECC or parity error",
-	"MPDMA TVF SDP Slave Memory 1 ECC or parity error",
-	"MPDMA TVF SDP Slave Memory 2 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 0 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 1 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 2 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 3 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 4 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 5 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 6 ECC or parity error",
-	"MPDMA PTE Command FIFO ECC or parity error",
-	"MPDMA PTE Hub Data FIFO ECC or parity error",
-	"MPDMA PTE Internal Data FIFO ECC or parity error",
-	"MPDMA PTE Command Memory DMA ECC or parity error",
-	"MPDMA PTE Command Memory Internal ECC or parity error",
-	"MPDMA PTE DMA Completion FIFO ECC or parity error",
-	"MPDMA PTE Tablewalk Completion FIFO ECC or parity error",
-	"MPDMA PTE Descriptor Completion FIFO ECC or parity error",
-	"MPDMA PTE ReadOnly Completion FIFO ECC or parity error",
-	"MPDMA PTE DirectWrite Completion FIFO ECC or parity error",
-	"SDP Watchdog Timer expired",
-};
-
-static const char * const smca_nbio_mce_desc[] = {
-	"ECC or Parity error",
-	"PCIE error",
-	"SDP ErrEvent error",
-	"SDP Egress Poison Error",
-	"IOHC Internal Poison Error",
-};
-
-static const char * const smca_pcie_mce_desc[] = {
-	"CCIX PER Message logging",
-	"CCIX Read Response with Status: Non-Data Error",
-	"CCIX Write Response with Status: Non-Data Error",
-	"CCIX Read Response with Status: Data Error",
-	"CCIX Non-okay write response with data error",
-};
-
-static const char * const smca_pcie2_mce_desc[] = {
-	"SDP Parity Error logging",
-};
-
-static const char * const smca_xgmipcs_mce_desc[] = {
-	"Data Loss Error",
-	"Training Error",
-	"Flow Control Acknowledge Error",
-	"Rx Fifo Underflow Error",
-	"Rx Fifo Overflow Error",
-	"CRC Error",
-	"BER Exceeded Error",
-	"Tx Vcid Data Error",
-	"Replay Buffer Parity Error",
-	"Data Parity Error",
-	"Replay Fifo Overflow Error",
-	"Replay Fifo Underflow Error",
-	"Elastic Fifo Overflow Error",
-	"Deskew Error",
-	"Flow Control CRC Error",
-	"Data Startup Limit Error",
-	"FC Init Timeout Error",
-	"Recovery Timeout Error",
-	"Ready Serial Timeout Error",
-	"Ready Serial Attempt Error",
-	"Recovery Attempt Error",
-	"Recovery Relock Attempt Error",
-	"Replay Attempt Error",
-	"Sync Header Error",
-	"Tx Replay Timeout Error",
-	"Rx Replay Timeout Error",
-	"LinkSub Tx Timeout Error",
-	"LinkSub Rx Timeout Error",
-	"Rx CMD Packet Error",
-};
-
-static const char * const smca_xgmiphy_mce_desc[] = {
-	"RAM ECC Error",
-	"ARC instruction buffer parity error",
-	"ARC data buffer parity error",
-	"PHY APB error",
-};
-
-static const char * const smca_nbif_mce_desc[] = {
-	"Timeout error from GMI",
-	"SRAM ECC error",
-	"NTB Error Event",
-	"SDP Parity error",
-};
-
-static const char * const smca_sata_mce_desc[] = {
-	"Parity error for port 0",
-	"Parity error for port 1",
-	"Parity error for port 2",
-	"Parity error for port 3",
-	"Parity error for port 4",
-	"Parity error for port 5",
-	"Parity error for port 6",
-	"Parity error for port 7",
-};
-
-static const char * const smca_usb_mce_desc[] = {
-	"Parity error or ECC error for S0 RAM0",
-	"Parity error or ECC error for S0 RAM1",
-	"Parity error or ECC error for S0 RAM2",
-	"Parity error for PHY RAM0",
-	"Parity error for PHY RAM1",
-	"AXI Slave Response error",
-};
-
-static const char * const smca_gmipcs_mce_desc[] = {
-	"Data Loss Error",
-	"Training Error",
-	"Replay Parity Error",
-	"Rx Fifo Underflow Error",
-	"Rx Fifo Overflow Error",
-	"CRC Error",
-	"BER Exceeded Error",
-	"Tx Fifo Underflow Error",
-	"Replay Buffer Parity Error",
-	"Tx Overflow Error",
-	"Replay Fifo Overflow Error",
-	"Replay Fifo Underflow Error",
-	"Elastic Fifo Overflow Error",
-	"Deskew Error",
-	"Offline Error",
-	"Data Startup Limit Error",
-	"FC Init Timeout Error",
-	"Recovery Timeout Error",
-	"Ready Serial Timeout Error",
-	"Ready Serial Attempt Error",
-	"Recovery Attempt Error",
-	"Recovery Relock Attempt Error",
-	"Deskew Abort Error",
-	"Rx Buffer Error",
-	"Rx LFDS Fifo Overflow Error",
-	"Rx LFDS Fifo Underflow Error",
-	"LinkSub Tx Timeout Error",
-	"LinkSub Rx Timeout Error",
-	"Rx CMD Packet Error",
-	"LFDS Training Timeout Error",
-	"LFDS FC Init Timeout Error",
-	"Data Loss Error",
-};
-
-struct smca_mce_desc {
-	const char * const *descs;
-	unsigned int num_descs;
-};
-
-static struct smca_mce_desc smca_mce_descs[] = {
-	[SMCA_LS]	= { smca_ls_mce_desc,	ARRAY_SIZE(smca_ls_mce_desc)	},
-	[SMCA_LS_V2]	= { smca_ls2_mce_desc,	ARRAY_SIZE(smca_ls2_mce_desc)	},
-	[SMCA_IF]	= { smca_if_mce_desc,	ARRAY_SIZE(smca_if_mce_desc)	},
-	[SMCA_L2_CACHE]	= { smca_l2_mce_desc,	ARRAY_SIZE(smca_l2_mce_desc)	},
-	[SMCA_DE]	= { smca_de_mce_desc,	ARRAY_SIZE(smca_de_mce_desc)	},
-	[SMCA_EX]	= { smca_ex_mce_desc,	ARRAY_SIZE(smca_ex_mce_desc)	},
-	[SMCA_FP]	= { smca_fp_mce_desc,	ARRAY_SIZE(smca_fp_mce_desc)	},
-	[SMCA_L3_CACHE]	= { smca_l3_mce_desc,	ARRAY_SIZE(smca_l3_mce_desc)	},
-	[SMCA_CS]	= { smca_cs_mce_desc,	ARRAY_SIZE(smca_cs_mce_desc)	},
-	[SMCA_CS_V2]	= { smca_cs2_mce_desc,	ARRAY_SIZE(smca_cs2_mce_desc)	},
-	[SMCA_PIE]	= { smca_pie_mce_desc,	ARRAY_SIZE(smca_pie_mce_desc)	},
-	[SMCA_UMC]	= { smca_umc_mce_desc,	ARRAY_SIZE(smca_umc_mce_desc)	},
-	[SMCA_UMC_V2]	= { smca_umc2_mce_desc,	ARRAY_SIZE(smca_umc2_mce_desc)	},
-	[SMCA_PB]	= { smca_pb_mce_desc,	ARRAY_SIZE(smca_pb_mce_desc)	},
-	[SMCA_PSP]	= { smca_psp_mce_desc,	ARRAY_SIZE(smca_psp_mce_desc)	},
-	[SMCA_PSP_V2]	= { smca_psp2_mce_desc,	ARRAY_SIZE(smca_psp2_mce_desc)	},
-	[SMCA_SMU]	= { smca_smu_mce_desc,	ARRAY_SIZE(smca_smu_mce_desc)	},
-	[SMCA_SMU_V2]	= { smca_smu2_mce_desc,	ARRAY_SIZE(smca_smu2_mce_desc)	},
-	[SMCA_MP5]	= { smca_mp5_mce_desc,	ARRAY_SIZE(smca_mp5_mce_desc)	},
-	[SMCA_MPDMA]	= { smca_mpdma_mce_desc,	ARRAY_SIZE(smca_mpdma_mce_desc)	},
-	[SMCA_NBIO]	= { smca_nbio_mce_desc,	ARRAY_SIZE(smca_nbio_mce_desc)	},
-	[SMCA_PCIE]	= { smca_pcie_mce_desc,	ARRAY_SIZE(smca_pcie_mce_desc)	},
-	[SMCA_PCIE_V2]	= { smca_pcie2_mce_desc,   ARRAY_SIZE(smca_pcie2_mce_desc)	},
-	[SMCA_XGMI_PCS]	= { smca_xgmipcs_mce_desc, ARRAY_SIZE(smca_xgmipcs_mce_desc)	},
-	/* NBIF and SHUB have the same error descriptions, for now. */
-	[SMCA_NBIF]	= { smca_nbif_mce_desc, ARRAY_SIZE(smca_nbif_mce_desc)	},
-	[SMCA_SHUB]	= { smca_nbif_mce_desc, ARRAY_SIZE(smca_nbif_mce_desc)	},
-	[SMCA_SATA]	= { smca_sata_mce_desc, ARRAY_SIZE(smca_sata_mce_desc)	},
-	[SMCA_USB]	= { smca_usb_mce_desc,	ARRAY_SIZE(smca_usb_mce_desc)	},
-	[SMCA_GMI_PCS]	= { smca_gmipcs_mce_desc,  ARRAY_SIZE(smca_gmipcs_mce_desc)	},
-	/* All the PHY bank types have the same error descriptions, for now. */
-	[SMCA_XGMI_PHY]	= { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc)	},
-	[SMCA_WAFL_PHY]	= { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc)	},
-	[SMCA_GMI_PHY]	= { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc)	},
-};
-
 static bool f12h_mc0_mce(u16 ec, u8 xec)
 {
 	bool ret = false;
@@ -1182,10 +706,6 @@ static void decode_smca_error(struct mce *m)
 
 	pr_emerg(HW_ERR "%s Ext. Error Code: %d", ip_name, xec);
 
-	/* Only print the decode of valid error codes */
-	if (xec < smca_mce_descs[bank_type].num_descs)
-		pr_cont(", %s.\n", smca_mce_descs[bank_type].descs[xec]);
-
 	if ((bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) &&
 	    xec == 0 && decode_dram_ecc)
 		decode_dram_ecc(topology_die_id(m->extcpu), m);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/4] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types
  2023-11-02 11:42 [PATCH v3 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
  2023-11-02 11:42 ` [PATCH v3 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions Muralidhara M K
@ 2023-11-02 11:42 ` Muralidhara M K
  2023-11-29 10:33   ` [tip: ras/core] " tip-bot2 for Muralidhara M K
  2023-11-02 11:42 ` [PATCH v3 3/4] EDAC/mc: Add support for HBM3 memory type Muralidhara M K
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Muralidhara M K @ 2023-11-02 11:42 UTC (permalink / raw)
  To: linux-edac; +Cc: linux-kernel, bp, Muralidhara M K

From: Muralidhara M K <muralidhara.mk@amd.com>

Add HWID and McaType values for new SMCA bank types.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
---
Chnages:
v2->v3: Rename usr_dp_pcs to usr_dp and same for usr_cp_pcs

 arch/x86/include/asm/mce.h    | 3 +++
 arch/x86/kernel/cpu/mce/amd.c | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 6de6e1d95952..9e2bb7bdc3b3 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -311,6 +311,7 @@ enum smca_bank_types {
 	SMCA_PIE,	/* Power, Interrupts, etc. */
 	SMCA_UMC,	/* Unified Memory Controller */
 	SMCA_UMC_V2,
+	SMCA_MA_LLC,	/* Memory Attached Last Level Cache */
 	SMCA_PB,	/* Parameter Block */
 	SMCA_PSP,	/* Platform Security Processor */
 	SMCA_PSP_V2,
@@ -326,6 +327,8 @@ enum smca_bank_types {
 	SMCA_SHUB,	/* System HUB Unit */
 	SMCA_SATA,	/* SATA Unit */
 	SMCA_USB,	/* USB Unit */
+	SMCA_USR_DP,	/* Ultra Short Reach Data Plane Controller */
+	SMCA_USR_CP,	/* Ultra Short Reach Control Plane Controller */
 	SMCA_GMI_PCS,	/* GMI PCS Unit */
 	SMCA_XGMI_PHY,	/* xGMI PHY Unit */
 	SMCA_WAFL_PHY,	/* WAFL PHY Unit */
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index f3517b8a8e91..4814d8a25ef8 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -107,6 +107,7 @@ static struct smca_bank_name smca_names[] = {
 	/* UMC v2 is separate because both of them can exist in a single system. */
 	[SMCA_UMC]			= { "umc",		"Unified Memory Controller" },
 	[SMCA_UMC_V2]			= { "umc_v2",		"Unified Memory Controller v2" },
+	[SMCA_MA_LLC]			= { "ma_llc",		"Memory Attached Last Level Cache" },
 	[SMCA_PB]			= { "param_block",	"Parameter Block" },
 	[SMCA_PSP ... SMCA_PSP_V2]	= { "psp",		"Platform Security Processor" },
 	[SMCA_SMU ... SMCA_SMU_V2]	= { "smu",		"System Management Unit" },
@@ -119,6 +120,8 @@ static struct smca_bank_name smca_names[] = {
 	[SMCA_SHUB]			= { "shub",		"System Hub Unit" },
 	[SMCA_SATA]			= { "sata",		"SATA Unit" },
 	[SMCA_USB]			= { "usb",		"USB Unit" },
+	[SMCA_USR_DP]			= { "usr_dp",		"Ultra Short Reach Data Plane Controller" },
+	[SMCA_USR_CP]			= { "usr_cp",		"Ultra Short Reach Control Plane Controller" },
 	[SMCA_GMI_PCS]			= { "gmi_pcs",		"Global Memory Interconnect PCS Unit" },
 	[SMCA_XGMI_PHY]			= { "xgmi_phy",		"Ext Global Memory Interconnect PHY Unit" },
 	[SMCA_WAFL_PHY]			= { "wafl_phy",		"WAFL PHY Unit" },
@@ -178,6 +181,7 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
 	{ SMCA_CS,	 HWID_MCATYPE(0x2E, 0x0)	},
 	{ SMCA_PIE,	 HWID_MCATYPE(0x2E, 0x1)	},
 	{ SMCA_CS_V2,	 HWID_MCATYPE(0x2E, 0x2)	},
+	{ SMCA_MA_LLC,	 HWID_MCATYPE(0x2E, 0x4)	},
 
 	/* Unified Memory Controller MCA type */
 	{ SMCA_UMC,	 HWID_MCATYPE(0x96, 0x0)	},
@@ -212,6 +216,8 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
 	{ SMCA_SHUB,	 HWID_MCATYPE(0x80, 0x0)	},
 	{ SMCA_SATA,	 HWID_MCATYPE(0xA8, 0x0)	},
 	{ SMCA_USB,	 HWID_MCATYPE(0xAA, 0x0)	},
+	{ SMCA_USR_DP,	 HWID_MCATYPE(0x170, 0x0)	},
+	{ SMCA_USR_CP,	 HWID_MCATYPE(0x180, 0x0)	},
 	{ SMCA_GMI_PCS,  HWID_MCATYPE(0x241, 0x0)	},
 	{ SMCA_XGMI_PHY, HWID_MCATYPE(0x259, 0x0)	},
 	{ SMCA_WAFL_PHY, HWID_MCATYPE(0x267, 0x0)	},
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 3/4] EDAC/mc: Add support for HBM3 memory type
  2023-11-02 11:42 [PATCH v3 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
  2023-11-02 11:42 ` [PATCH v3 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions Muralidhara M K
  2023-11-02 11:42 ` [PATCH v3 2/4] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types Muralidhara M K
@ 2023-11-02 11:42 ` Muralidhara M K
  2023-11-02 11:42 ` [PATCH v3 4/4] EDAC/amd64: Add support for family 0x19, models 0x90-9f devices Muralidhara M K
  2023-11-29 10:26 ` [PATCH v3 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support Borislav Petkov
  4 siblings, 0 replies; 18+ messages in thread
From: Muralidhara M K @ 2023-11-02 11:42 UTC (permalink / raw)
  To: linux-edac; +Cc: linux-kernel, bp, Muralidhara M K

From: Muralidhara M K <muralidhara.mk@amd.com>

AMD MI300A models use HBM3 (High Bandwidth Memory Gen 3) memory.
HBM is a high-speed computer memory interface for 3D-stacked synchronous
dynamic random-access memory (SDRAM).
Add support for this memory type by adding a new entry to 'enum mem_type'.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
---
Changes:
v2->v3: Modify commit message

 drivers/edac/edac_mc.c | 1 +
 include/linux/edac.h   | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 6faeb2ab3960..d6eed727b0cd 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -166,6 +166,7 @@ const char * const edac_mem_types[] = {
 	[MEM_NVDIMM]	= "Non-volatile-RAM",
 	[MEM_WIO2]	= "Wide-IO-2",
 	[MEM_HBM2]	= "High-bandwidth-memory-Gen2",
+	[MEM_HBM3]	= "High-bandwidth-memory-Gen3",
 };
 EXPORT_SYMBOL_GPL(edac_mem_types);
 
diff --git a/include/linux/edac.h b/include/linux/edac.h
index fa4bda2a70f6..1174beb94ab6 100644
--- a/include/linux/edac.h
+++ b/include/linux/edac.h
@@ -187,6 +187,7 @@ static inline char *mc_event_error_type(const unsigned int err_type)
  * @MEM_NVDIMM:		Non-volatile RAM
  * @MEM_WIO2:		Wide I/O 2.
  * @MEM_HBM2:		High bandwidth Memory Gen 2.
+ * @MEM_HBM3:		High bandwidth Memory Gen 3.
  */
 enum mem_type {
 	MEM_EMPTY = 0,
@@ -218,6 +219,7 @@ enum mem_type {
 	MEM_NVDIMM,
 	MEM_WIO2,
 	MEM_HBM2,
+	MEM_HBM3,
 };
 
 #define MEM_FLAG_EMPTY		BIT(MEM_EMPTY)
@@ -248,6 +250,7 @@ enum mem_type {
 #define MEM_FLAG_NVDIMM		BIT(MEM_NVDIMM)
 #define MEM_FLAG_WIO2		BIT(MEM_WIO2)
 #define MEM_FLAG_HBM2		BIT(MEM_HBM2)
+#define MEM_FLAG_HBM3		BIT(MEM_HBM3)
 
 /**
  * enum edac_type - Error Detection and Correction capabilities and mode
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 4/4] EDAC/amd64: Add support for family 0x19, models 0x90-9f devices
  2023-11-02 11:42 [PATCH v3 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
                   ` (2 preceding siblings ...)
  2023-11-02 11:42 ` [PATCH v3 3/4] EDAC/mc: Add support for HBM3 memory type Muralidhara M K
@ 2023-11-02 11:42 ` Muralidhara M K
  2023-11-29 10:26 ` [PATCH v3 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support Borislav Petkov
  4 siblings, 0 replies; 18+ messages in thread
From: Muralidhara M K @ 2023-11-02 11:42 UTC (permalink / raw)
  To: linux-edac; +Cc: linux-kernel, bp, Muralidhara M K

From: Muralidhara M K <muralidhara.mk@amd.com>

AMD Models 90h-9fh are APUs. They have built-in HBM3 memory.
ECC support is enabled by default.

APU models have a single Data Fabric (DF) per Package. Each DF is
visible to the OS in the same way as chiplet-based systems like
Zen2 CPUs and later. However, the Unified Memory Controllers (UMCs)
are arranged in the same way as GPU-based MI200 devices rather than
CPU-based systems.
Use the existing gpu_ops for hetergeneous systems to support
enumeration of nodes and memory topology with few fixups.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
---
Changes:
v2->v3:
1. Remove determine_memory_type and added in per_family_init
2. gpu_umc_base moved to pvt struct and used directly in per_family_init

 drivers/edac/amd64_edac.c | 62 +++++++++++++++++++++++++++++----------
 drivers/edac/amd64_edac.h |  1 +
 2 files changed, 47 insertions(+), 16 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 9b6642d00871..d080e7d70588 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -996,12 +996,20 @@ static struct local_node_map {
 #define LNTM_NODE_COUNT				GENMASK(27, 16)
 #define LNTM_BASE_NODE_ID			GENMASK(11, 0)
 
-static int gpu_get_node_map(void)
+static int gpu_get_node_map(struct amd64_pvt *pvt)
 {
 	struct pci_dev *pdev;
 	int ret;
 	u32 tmp;
 
+	/*
+	 * Mapping of nodes from hardware-provided AMD Node ID to a
+	 * Linux logical one is applicable for MI200 models.
+	 * Therefore return early for other heterogeneous systems.
+	 */
+	if (pvt->F3->device != PCI_DEVICE_ID_AMD_MI200_DF_F3)
+		return 0;
+
 	/*
 	 * Node ID 0 is reserved for CPUs.
 	 * Therefore, a non-zero Node ID means we've already cached the values.
@@ -3851,7 +3859,7 @@ static void gpu_init_csrows(struct mem_ctl_info *mci)
 
 			dimm->nr_pages = gpu_get_csrow_nr_pages(pvt, umc, cs);
 			dimm->edac_mode = EDAC_SECDED;
-			dimm->mtype = MEM_HBM2;
+			dimm->mtype = pvt->dram_type;
 			dimm->dtype = DEV_X16;
 			dimm->grain = 64;
 		}
@@ -3880,7 +3888,7 @@ static bool gpu_ecc_enabled(struct amd64_pvt *pvt)
 	return true;
 }
 
-static inline u32 gpu_get_umc_base(u8 umc, u8 channel)
+static inline u32 gpu_get_umc_base(struct amd64_pvt *pvt, u8 umc, u8 channel)
 {
 	/*
 	 * On CPUs, there is one channel per UMC, so UMC numbering equals
@@ -3893,13 +3901,16 @@ static inline u32 gpu_get_umc_base(u8 umc, u8 channel)
 	 * On GPU nodes channels are selected in 3rd nibble
 	 * HBM chX[3:0]= [Y  ]5X[3:0]000;
 	 * HBM chX[7:4]= [Y+1]5X[3:0]000
+	 *
+	 * On MI300 APU nodes, same as GPU nodes but channels are selected
+	 * in the base address of 0x90000
 	 */
 	umc *= 2;
 
 	if (channel >= 4)
 		umc++;
 
-	return 0x50000 + (umc << 20) + ((channel % 4) << 12);
+	return pvt->gpu_umc_base + (umc << 20) + ((channel % 4) << 12);
 }
 
 static void gpu_read_mc_regs(struct amd64_pvt *pvt)
@@ -3910,7 +3921,7 @@ static void gpu_read_mc_regs(struct amd64_pvt *pvt)
 
 	/* Read registers from each UMC */
 	for_each_umc(i) {
-		umc_base = gpu_get_umc_base(i, 0);
+		umc_base = gpu_get_umc_base(pvt, i, 0);
 		umc = &pvt->umc[i];
 
 		amd_smn_read(nid, umc_base + UMCCH_UMC_CFG, &umc->umc_cfg);
@@ -3927,7 +3938,7 @@ static void gpu_read_base_mask(struct amd64_pvt *pvt)
 
 	for_each_umc(umc) {
 		for_each_chip_select(cs, umc, pvt) {
-			base_reg = gpu_get_umc_base(umc, cs) + UMCCH_BASE_ADDR;
+			base_reg = gpu_get_umc_base(pvt, umc, cs) + UMCCH_BASE_ADDR;
 			base = &pvt->csels[umc].csbases[cs];
 
 			if (!amd_smn_read(pvt->mc_node_id, base_reg, base)) {
@@ -3935,7 +3946,7 @@ static void gpu_read_base_mask(struct amd64_pvt *pvt)
 					 umc, cs, *base, base_reg);
 			}
 
-			mask_reg = gpu_get_umc_base(umc, cs) + UMCCH_ADDR_MASK;
+			mask_reg = gpu_get_umc_base(pvt, umc, cs) + UMCCH_ADDR_MASK;
 			mask = &pvt->csels[umc].csmasks[cs];
 
 			if (!amd_smn_read(pvt->mc_node_id, mask_reg, mask)) {
@@ -3960,7 +3971,7 @@ static int gpu_hw_info_get(struct amd64_pvt *pvt)
 {
 	int ret;
 
-	ret = gpu_get_node_map();
+	ret = gpu_get_node_map(pvt);
 	if (ret)
 		return ret;
 
@@ -4125,6 +4136,8 @@ static int per_family_init(struct amd64_pvt *pvt)
 			if (pvt->F3->device == PCI_DEVICE_ID_AMD_MI200_DF_F3) {
 				pvt->ctl_name		= "MI200";
 				pvt->max_mcs		= 4;
+				pvt->dram_type		= MEM_HBM2;
+				pvt->gpu_umc_base	= 0x50000;
 				pvt->ops		= &gpu_ops;
 			} else {
 				pvt->ctl_name		= "F19h_M30h";
@@ -4142,6 +4155,13 @@ static int per_family_init(struct amd64_pvt *pvt)
 			pvt->ctl_name			= "F19h_M70h";
 			pvt->flags.zn_regs_v2		= 1;
 			break;
+		case 0x90 ... 0x9f:
+			pvt->ctl_name			= "F19h_M90h";
+			pvt->max_mcs			= 4;
+			pvt->dram_type			= MEM_HBM3;
+			pvt->gpu_umc_base		= 0x90000;
+			pvt->ops			= &gpu_ops;
+			break;
 		case 0xa0 ... 0xaf:
 			pvt->ctl_name			= "F19h_MA0h";
 			pvt->max_mcs			= 12;
@@ -4180,23 +4200,33 @@ static const struct attribute_group *amd64_edac_attr_groups[] = {
 	NULL
 };
 
+/*
+ * For heterogeneous and APU models EDAC CHIP_SELECT and CHANNEL layers
+ * should be swapped to fit into the layers.
+ */
+static unsigned int get_layer_size(struct amd64_pvt *pvt, u8 layer)
+{
+	bool is_gpu = (pvt->ops == &gpu_ops);
+
+	if (!layer)
+		return is_gpu ? pvt->max_mcs
+			      : pvt->csels[0].b_cnt;
+	else
+		return is_gpu ? pvt->csels[0].b_cnt
+			      : pvt->max_mcs;
+}
+
 static int init_one_instance(struct amd64_pvt *pvt)
 {
 	struct mem_ctl_info *mci = NULL;
 	struct edac_mc_layer layers[2];
 	int ret = -ENOMEM;
 
-	/*
-	 * For Heterogeneous family EDAC CHIP_SELECT and CHANNEL layers should
-	 * be swapped to fit into the layers.
-	 */
 	layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
-	layers[0].size = (pvt->F3->device == PCI_DEVICE_ID_AMD_MI200_DF_F3) ?
-			 pvt->max_mcs : pvt->csels[0].b_cnt;
+	layers[0].size = get_layer_size(pvt, 0);
 	layers[0].is_virt_csrow = true;
 	layers[1].type = EDAC_MC_LAYER_CHANNEL;
-	layers[1].size = (pvt->F3->device == PCI_DEVICE_ID_AMD_MI200_DF_F3) ?
-			 pvt->csels[0].b_cnt : pvt->max_mcs;
+	layers[1].size = get_layer_size(pvt, 1);
 	layers[1].is_virt_csrow = false;
 
 	mci = edac_mc_alloc(pvt->mc_node_id, ARRAY_SIZE(layers), layers, 0);
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 5a4e4a59682b..1665f7932bac 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -362,6 +362,7 @@ struct amd64_pvt {
 	u32 dct_sel_lo;		/* DRAM Controller Select Low */
 	u32 dct_sel_hi;		/* DRAM Controller Select High */
 	u32 online_spare;	/* On-Line spare Reg */
+	u32 gpu_umc_base;	/* Base address used for channel selection on GPUs */
 
 	/* x4, x8, or x16 syndromes in use */
 	u8 ecc_sym_sz;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH] Documentation: Begin a RAS section
  2023-11-02 11:42 ` [PATCH v3 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions Muralidhara M K
@ 2023-11-28 14:20   ` Borislav Petkov
  2023-11-28 15:23     ` Yazen Ghannam
                       ` (2 more replies)
  2023-11-28 14:24   ` [tip: ras/core] EDAC/mce_amd: Remove SMCA Extended Error code descriptions tip-bot2 for Muralidhara M K
  1 sibling, 3 replies; 18+ messages in thread
From: Borislav Petkov @ 2023-11-28 14:20 UTC (permalink / raw)
  To: Tony Luck, Yazen Ghannam
  Cc: Muralidhara M K, linux-edac, linux-kernel, Muralidhara M K, linux-doc

On Thu, Nov 02, 2023 at 11:42:22AM +0000, Muralidhara M K wrote:
> From: Muralidhara M K <muralidhara.mk@amd.com>
> 
> AMD systems with Scalable MCA, each machine check error of a SMCA bank
> type has an associated bit position in the bank's control (CTL) register.

Ontop of this. It is long overdue:

---
From: "Borislav Petkov (AMD)" <bp@alien8.de>
Date: Tue, 28 Nov 2023 14:37:56 +0100

Add some initial RAS documentation. The expectation is for this to
collect all the user-visible features for interacting with the RAS
features of the kernel.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
---
 Documentation/RAS/ras.rst | 26 ++++++++++++++++++++++++++
 Documentation/index.rst   |  1 +
 2 files changed, 27 insertions(+)
 create mode 100644 Documentation/RAS/ras.rst

diff --git a/Documentation/RAS/ras.rst b/Documentation/RAS/ras.rst
new file mode 100644
index 000000000000..2556b397cd27
--- /dev/null
+++ b/Documentation/RAS/ras.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Reliability, Availability and Serviceability features
+=====================================================
+
+This documents different aspects of the RAS functionality present in the
+kernel.
+
+Error decoding
+---------------
+
+* x86
+
+Error decoding on AMD systems should be done using the rasdaemon tool:
+https://github.com/mchehab/rasdaemon/
+
+While the daemon is running, it would automatically log and decode
+errors. If not, one can still decode such errors by supplying the
+hardware information from the error::
+
+        $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca
+
+Also, the user can pass particular family and model to decode the error
+string::
+
+        $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM>
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 9dfdc826618c..36e61783437c 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -113,6 +113,7 @@ to ReStructured Text format, or are simply too old.
    :maxdepth: 1
 
    staging/index
+   RAS/ras
 
 
 Translations
-- 
2.42.0.rc0.25.ga82fb66fed25


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [tip: ras/core] EDAC/mce_amd: Remove SMCA Extended Error code descriptions
  2023-11-02 11:42 ` [PATCH v3 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions Muralidhara M K
  2023-11-28 14:20   ` [PATCH] Documentation: Begin a RAS section Borislav Petkov
@ 2023-11-28 14:24   ` tip-bot2 for Muralidhara M K
  1 sibling, 0 replies; 18+ messages in thread
From: tip-bot2 for Muralidhara M K @ 2023-11-28 14:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Muralidhara M K, Borislav Petkov (AMD), Yazen Ghannam, x86, linux-kernel

The following commit has been merged into the ras/core branch of tip:

Commit-ID:     9f988030e85fafa2b03910d467302853ad29a300
Gitweb:        https://git.kernel.org/tip/9f988030e85fafa2b03910d467302853ad29a300
Author:        Muralidhara M K <muralidhara.mk@amd.com>
AuthorDate:    Thu, 02 Nov 2023 11:42:22 
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Tue, 28 Nov 2023 15:17:09 +01:00

EDAC/mce_amd: Remove SMCA Extended Error code descriptions

On AMD systems with Scalable MCA each machine check error of a SMCA bank
type has an associated bit position in the bank's control (CTL)
register.

An error's bit position in the CTL register is used during error decoding
for offsetting into the corresponding bank's error description structure.
As new errors are being added in newer AMD systems for existing SMCA bank
types, the underlying SMCA architecture guarantees that the bit positions
of existing errors are not altered.

However, on some AMD systems some of the existing bit definitions in the
CTL register of SMCA bank type are reassigned without defining new HWID
and McaType. Consequently, the errors whose bit definitions have been
reassigned in the CTL register are being erroneously decoded.

Remove SMCA Extended Error Code descriptions, this avoids decoding
issues for incorrectly reassigned bits, and avoids the related
maintenance burden in the kernel. But the bank type and Extended Error
Code value for an error will continue to be printed as a convenience.

The decoding of SMCA Extended Error Code description can be done by
referring to AMD documentation or use external tools such as rasdaemon.

Offline decoding can be done using below option in rasdaemon. For example:

  $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca

Also, the user can pass particular family and model to decode the error
string.

$ rasdaemon -p --status <STATUS> --ipid <IPID> --smca --family <CPU Family>
	--model <CPU Model> --bank <BANK_NUM>

Refer to the rasdaemon commit for details:

  https://github.com/mchehab/rasdaemon/commit/932118b04a04104dfac6b8536

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20231102114225.2006878-2-muralimk@amd.com
---
 drivers/edac/mce_amd.c | 480 +----------------------------------------
 1 file changed, 480 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 28363eb..ec8b6c9 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -143,482 +143,6 @@ static const char * const mc6_mce_desc[] = {
 	"Status Register File",
 };
 
-/* Scalable MCA error strings */
-static const char * const smca_ls_mce_desc[] = {
-	"Load queue parity error",
-	"Store queue parity error",
-	"Miss address buffer payload parity error",
-	"Level 1 TLB parity error",
-	"DC Tag error type 5",
-	"DC Tag error type 6",
-	"DC Tag error type 1",
-	"Internal error type 1",
-	"Internal error type 2",
-	"System Read Data Error Thread 0",
-	"System Read Data Error Thread 1",
-	"DC Tag error type 2",
-	"DC Data error type 1 and poison consumption",
-	"DC Data error type 2",
-	"DC Data error type 3",
-	"DC Tag error type 4",
-	"Level 2 TLB parity error",
-	"PDC parity error",
-	"DC Tag error type 3",
-	"DC Tag error type 5",
-	"L2 Fill Data error",
-};
-
-static const char * const smca_ls2_mce_desc[] = {
-	"An ECC error was detected on a data cache read by a probe or victimization",
-	"An ECC error or L2 poison was detected on a data cache read by a load",
-	"An ECC error was detected on a data cache read-modify-write by a store",
-	"An ECC error or poison bit mismatch was detected on a tag read by a probe or victimization",
-	"An ECC error or poison bit mismatch was detected on a tag read by a load",
-	"An ECC error or poison bit mismatch was detected on a tag read by a store",
-	"An ECC error was detected on an EMEM read by a load",
-	"An ECC error was detected on an EMEM read-modify-write by a store",
-	"A parity error was detected in an L1 TLB entry by any access",
-	"A parity error was detected in an L2 TLB entry by any access",
-	"A parity error was detected in a PWC entry by any access",
-	"A parity error was detected in an STQ entry by any access",
-	"A parity error was detected in an LDQ entry by any access",
-	"A parity error was detected in a MAB entry by any access",
-	"A parity error was detected in an SCB entry state field by any access",
-	"A parity error was detected in an SCB entry address field by any access",
-	"A parity error was detected in an SCB entry data field by any access",
-	"A parity error was detected in a WCB entry by any access",
-	"A poisoned line was detected in an SCB entry by any access",
-	"A SystemReadDataError error was reported on read data returned from L2 for a load",
-	"A SystemReadDataError error was reported on read data returned from L2 for an SCB store",
-	"A SystemReadDataError error was reported on read data returned from L2 for a WCB store",
-	"A hardware assertion error was reported",
-	"A parity error was detected in an STLF, SCB EMEM entry or SRB store data by any access",
-};
-
-static const char * const smca_if_mce_desc[] = {
-	"Op Cache Microtag Probe Port Parity Error",
-	"IC Microtag or Full Tag Multi-hit Error",
-	"IC Full Tag Parity Error",
-	"IC Data Array Parity Error",
-	"Decoupling Queue PhysAddr Parity Error",
-	"L0 ITLB Parity Error",
-	"L1 ITLB Parity Error",
-	"L2 ITLB Parity Error",
-	"BPQ Thread 0 Snoop Parity Error",
-	"BPQ Thread 1 Snoop Parity Error",
-	"L1 BTB Multi-Match Error",
-	"L2 BTB Multi-Match Error",
-	"L2 Cache Response Poison Error",
-	"System Read Data Error",
-	"Hardware Assertion Error",
-	"L1-TLB Multi-Hit",
-	"L2-TLB Multi-Hit",
-	"BSR Parity Error",
-	"CT MCE",
-};
-
-static const char * const smca_l2_mce_desc[] = {
-	"L2M Tag Multiple-Way-Hit error",
-	"L2M Tag or State Array ECC Error",
-	"L2M Data Array ECC Error",
-	"Hardware Assert Error",
-};
-
-static const char * const smca_de_mce_desc[] = {
-	"Micro-op cache tag parity error",
-	"Micro-op cache data parity error",
-	"Instruction buffer parity error",
-	"Micro-op queue parity error",
-	"Instruction dispatch queue parity error",
-	"Fetch address FIFO parity error",
-	"Patch RAM data parity error",
-	"Patch RAM sequencer parity error",
-	"Micro-op buffer parity error",
-	"Hardware Assertion MCA Error",
-};
-
-static const char * const smca_ex_mce_desc[] = {
-	"Watchdog Timeout error",
-	"Physical register file parity error",
-	"Flag register file parity error",
-	"Immediate displacement register file parity error",
-	"Address generator payload parity error",
-	"EX payload parity error",
-	"Checkpoint queue parity error",
-	"Retire dispatch queue parity error",
-	"Retire status queue parity error",
-	"Scheduling queue parity error",
-	"Branch buffer queue parity error",
-	"Hardware Assertion error",
-	"Spec Map parity error",
-	"Retire Map parity error",
-};
-
-static const char * const smca_fp_mce_desc[] = {
-	"Physical register file (PRF) parity error",
-	"Freelist (FL) parity error",
-	"Schedule queue parity error",
-	"NSQ parity error",
-	"Retire queue (RQ) parity error",
-	"Status register file (SRF) parity error",
-	"Hardware assertion",
-};
-
-static const char * const smca_l3_mce_desc[] = {
-	"Shadow Tag Macro ECC Error",
-	"Shadow Tag Macro Multi-way-hit Error",
-	"L3M Tag ECC Error",
-	"L3M Tag Multi-way-hit Error",
-	"L3M Data ECC Error",
-	"SDP Parity Error or SystemReadDataError from XI",
-	"L3 Victim Queue Parity Error",
-	"L3 Hardware Assertion",
-};
-
-static const char * const smca_cs_mce_desc[] = {
-	"Illegal Request",
-	"Address Violation",
-	"Security Violation",
-	"Illegal Response",
-	"Unexpected Response",
-	"Request or Probe Parity Error",
-	"Read Response Parity Error",
-	"Atomic Request Parity Error",
-	"Probe Filter ECC Error",
-};
-
-static const char * const smca_cs2_mce_desc[] = {
-	"Illegal Request",
-	"Address Violation",
-	"Security Violation",
-	"Illegal Response",
-	"Unexpected Response",
-	"Request or Probe Parity Error",
-	"Read Response Parity Error",
-	"Atomic Request Parity Error",
-	"SDP read response had no match in the CS queue",
-	"Probe Filter Protocol Error",
-	"Probe Filter ECC Error",
-	"SDP read response had an unexpected RETRY error",
-	"Counter overflow error",
-	"Counter underflow error",
-};
-
-static const char * const smca_pie_mce_desc[] = {
-	"Hardware Assert",
-	"Register security violation",
-	"Link Error",
-	"Poison data consumption",
-	"A deferred error was detected in the DF"
-};
-
-static const char * const smca_umc_mce_desc[] = {
-	"DRAM ECC error",
-	"Data poison error",
-	"SDP parity error",
-	"Advanced peripheral bus error",
-	"Address/Command parity error",
-	"Write data CRC error",
-	"DCQ SRAM ECC error",
-	"AES SRAM ECC error",
-};
-
-static const char * const smca_umc2_mce_desc[] = {
-	"DRAM ECC error",
-	"Data poison error",
-	"SDP parity error",
-	"Reserved",
-	"Address/Command parity error",
-	"Write data parity error",
-	"DCQ SRAM ECC error",
-	"Reserved",
-	"Read data parity error",
-	"Rdb SRAM ECC error",
-	"RdRsp SRAM ECC error",
-	"LM32 MP errors",
-};
-
-static const char * const smca_pb_mce_desc[] = {
-	"An ECC error in the Parameter Block RAM array",
-};
-
-static const char * const smca_psp_mce_desc[] = {
-	"An ECC or parity error in a PSP RAM instance",
-};
-
-static const char * const smca_psp2_mce_desc[] = {
-	"High SRAM ECC or parity error",
-	"Low SRAM ECC or parity error",
-	"Instruction Cache Bank 0 ECC or parity error",
-	"Instruction Cache Bank 1 ECC or parity error",
-	"Instruction Tag Ram 0 parity error",
-	"Instruction Tag Ram 1 parity error",
-	"Data Cache Bank 0 ECC or parity error",
-	"Data Cache Bank 1 ECC or parity error",
-	"Data Cache Bank 2 ECC or parity error",
-	"Data Cache Bank 3 ECC or parity error",
-	"Data Tag Bank 0 parity error",
-	"Data Tag Bank 1 parity error",
-	"Data Tag Bank 2 parity error",
-	"Data Tag Bank 3 parity error",
-	"Dirty Data Ram parity error",
-	"TLB Bank 0 parity error",
-	"TLB Bank 1 parity error",
-	"System Hub Read Buffer ECC or parity error",
-};
-
-static const char * const smca_smu_mce_desc[] = {
-	"An ECC or parity error in an SMU RAM instance",
-};
-
-static const char * const smca_smu2_mce_desc[] = {
-	"High SRAM ECC or parity error",
-	"Low SRAM ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"System Hub Read Buffer ECC or parity error",
-	"PHY RAM ECC error",
-};
-
-static const char * const smca_mp5_mce_desc[] = {
-	"High SRAM ECC or parity error",
-	"Low SRAM ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-};
-
-static const char * const smca_mpdma_mce_desc[] = {
-	"Main SRAM [31:0] bank ECC or parity error",
-	"Main SRAM [63:32] bank ECC or parity error",
-	"Main SRAM [95:64] bank ECC or parity error",
-	"Main SRAM [127:96] bank ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"Data Cache Bank A ECC or parity error",
-	"Data Cache Bank B ECC or parity error",
-	"Data Tag Cache Bank A ECC or parity error",
-	"Data Tag Cache Bank B ECC or parity error",
-	"Instruction Cache Bank A ECC or parity error",
-	"Instruction Cache Bank B ECC or parity error",
-	"Instruction Tag Cache Bank A ECC or parity error",
-	"Instruction Tag Cache Bank B ECC or parity error",
-	"System Hub Read Buffer ECC or parity error",
-	"MPDMA TVF DVSEC Memory ECC or parity error",
-	"MPDMA TVF MMIO Mailbox0 ECC or parity error",
-	"MPDMA TVF MMIO Mailbox1 ECC or parity error",
-	"MPDMA TVF Doorbell Memory ECC or parity error",
-	"MPDMA TVF SDP Slave Memory 0 ECC or parity error",
-	"MPDMA TVF SDP Slave Memory 1 ECC or parity error",
-	"MPDMA TVF SDP Slave Memory 2 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 0 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 1 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 2 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 3 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 4 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 5 ECC or parity error",
-	"MPDMA TVF SDP Master Memory 6 ECC or parity error",
-	"MPDMA PTE Command FIFO ECC or parity error",
-	"MPDMA PTE Hub Data FIFO ECC or parity error",
-	"MPDMA PTE Internal Data FIFO ECC or parity error",
-	"MPDMA PTE Command Memory DMA ECC or parity error",
-	"MPDMA PTE Command Memory Internal ECC or parity error",
-	"MPDMA PTE DMA Completion FIFO ECC or parity error",
-	"MPDMA PTE Tablewalk Completion FIFO ECC or parity error",
-	"MPDMA PTE Descriptor Completion FIFO ECC or parity error",
-	"MPDMA PTE ReadOnly Completion FIFO ECC or parity error",
-	"MPDMA PTE DirectWrite Completion FIFO ECC or parity error",
-	"SDP Watchdog Timer expired",
-};
-
-static const char * const smca_nbio_mce_desc[] = {
-	"ECC or Parity error",
-	"PCIE error",
-	"SDP ErrEvent error",
-	"SDP Egress Poison Error",
-	"IOHC Internal Poison Error",
-};
-
-static const char * const smca_pcie_mce_desc[] = {
-	"CCIX PER Message logging",
-	"CCIX Read Response with Status: Non-Data Error",
-	"CCIX Write Response with Status: Non-Data Error",
-	"CCIX Read Response with Status: Data Error",
-	"CCIX Non-okay write response with data error",
-};
-
-static const char * const smca_pcie2_mce_desc[] = {
-	"SDP Parity Error logging",
-};
-
-static const char * const smca_xgmipcs_mce_desc[] = {
-	"Data Loss Error",
-	"Training Error",
-	"Flow Control Acknowledge Error",
-	"Rx Fifo Underflow Error",
-	"Rx Fifo Overflow Error",
-	"CRC Error",
-	"BER Exceeded Error",
-	"Tx Vcid Data Error",
-	"Replay Buffer Parity Error",
-	"Data Parity Error",
-	"Replay Fifo Overflow Error",
-	"Replay Fifo Underflow Error",
-	"Elastic Fifo Overflow Error",
-	"Deskew Error",
-	"Flow Control CRC Error",
-	"Data Startup Limit Error",
-	"FC Init Timeout Error",
-	"Recovery Timeout Error",
-	"Ready Serial Timeout Error",
-	"Ready Serial Attempt Error",
-	"Recovery Attempt Error",
-	"Recovery Relock Attempt Error",
-	"Replay Attempt Error",
-	"Sync Header Error",
-	"Tx Replay Timeout Error",
-	"Rx Replay Timeout Error",
-	"LinkSub Tx Timeout Error",
-	"LinkSub Rx Timeout Error",
-	"Rx CMD Packet Error",
-};
-
-static const char * const smca_xgmiphy_mce_desc[] = {
-	"RAM ECC Error",
-	"ARC instruction buffer parity error",
-	"ARC data buffer parity error",
-	"PHY APB error",
-};
-
-static const char * const smca_nbif_mce_desc[] = {
-	"Timeout error from GMI",
-	"SRAM ECC error",
-	"NTB Error Event",
-	"SDP Parity error",
-};
-
-static const char * const smca_sata_mce_desc[] = {
-	"Parity error for port 0",
-	"Parity error for port 1",
-	"Parity error for port 2",
-	"Parity error for port 3",
-	"Parity error for port 4",
-	"Parity error for port 5",
-	"Parity error for port 6",
-	"Parity error for port 7",
-};
-
-static const char * const smca_usb_mce_desc[] = {
-	"Parity error or ECC error for S0 RAM0",
-	"Parity error or ECC error for S0 RAM1",
-	"Parity error or ECC error for S0 RAM2",
-	"Parity error for PHY RAM0",
-	"Parity error for PHY RAM1",
-	"AXI Slave Response error",
-};
-
-static const char * const smca_gmipcs_mce_desc[] = {
-	"Data Loss Error",
-	"Training Error",
-	"Replay Parity Error",
-	"Rx Fifo Underflow Error",
-	"Rx Fifo Overflow Error",
-	"CRC Error",
-	"BER Exceeded Error",
-	"Tx Fifo Underflow Error",
-	"Replay Buffer Parity Error",
-	"Tx Overflow Error",
-	"Replay Fifo Overflow Error",
-	"Replay Fifo Underflow Error",
-	"Elastic Fifo Overflow Error",
-	"Deskew Error",
-	"Offline Error",
-	"Data Startup Limit Error",
-	"FC Init Timeout Error",
-	"Recovery Timeout Error",
-	"Ready Serial Timeout Error",
-	"Ready Serial Attempt Error",
-	"Recovery Attempt Error",
-	"Recovery Relock Attempt Error",
-	"Deskew Abort Error",
-	"Rx Buffer Error",
-	"Rx LFDS Fifo Overflow Error",
-	"Rx LFDS Fifo Underflow Error",
-	"LinkSub Tx Timeout Error",
-	"LinkSub Rx Timeout Error",
-	"Rx CMD Packet Error",
-	"LFDS Training Timeout Error",
-	"LFDS FC Init Timeout Error",
-	"Data Loss Error",
-};
-
-struct smca_mce_desc {
-	const char * const *descs;
-	unsigned int num_descs;
-};
-
-static struct smca_mce_desc smca_mce_descs[] = {
-	[SMCA_LS]	= { smca_ls_mce_desc,	ARRAY_SIZE(smca_ls_mce_desc)	},
-	[SMCA_LS_V2]	= { smca_ls2_mce_desc,	ARRAY_SIZE(smca_ls2_mce_desc)	},
-	[SMCA_IF]	= { smca_if_mce_desc,	ARRAY_SIZE(smca_if_mce_desc)	},
-	[SMCA_L2_CACHE]	= { smca_l2_mce_desc,	ARRAY_SIZE(smca_l2_mce_desc)	},
-	[SMCA_DE]	= { smca_de_mce_desc,	ARRAY_SIZE(smca_de_mce_desc)	},
-	[SMCA_EX]	= { smca_ex_mce_desc,	ARRAY_SIZE(smca_ex_mce_desc)	},
-	[SMCA_FP]	= { smca_fp_mce_desc,	ARRAY_SIZE(smca_fp_mce_desc)	},
-	[SMCA_L3_CACHE]	= { smca_l3_mce_desc,	ARRAY_SIZE(smca_l3_mce_desc)	},
-	[SMCA_CS]	= { smca_cs_mce_desc,	ARRAY_SIZE(smca_cs_mce_desc)	},
-	[SMCA_CS_V2]	= { smca_cs2_mce_desc,	ARRAY_SIZE(smca_cs2_mce_desc)	},
-	[SMCA_PIE]	= { smca_pie_mce_desc,	ARRAY_SIZE(smca_pie_mce_desc)	},
-	[SMCA_UMC]	= { smca_umc_mce_desc,	ARRAY_SIZE(smca_umc_mce_desc)	},
-	[SMCA_UMC_V2]	= { smca_umc2_mce_desc,	ARRAY_SIZE(smca_umc2_mce_desc)	},
-	[SMCA_PB]	= { smca_pb_mce_desc,	ARRAY_SIZE(smca_pb_mce_desc)	},
-	[SMCA_PSP]	= { smca_psp_mce_desc,	ARRAY_SIZE(smca_psp_mce_desc)	},
-	[SMCA_PSP_V2]	= { smca_psp2_mce_desc,	ARRAY_SIZE(smca_psp2_mce_desc)	},
-	[SMCA_SMU]	= { smca_smu_mce_desc,	ARRAY_SIZE(smca_smu_mce_desc)	},
-	[SMCA_SMU_V2]	= { smca_smu2_mce_desc,	ARRAY_SIZE(smca_smu2_mce_desc)	},
-	[SMCA_MP5]	= { smca_mp5_mce_desc,	ARRAY_SIZE(smca_mp5_mce_desc)	},
-	[SMCA_MPDMA]	= { smca_mpdma_mce_desc,	ARRAY_SIZE(smca_mpdma_mce_desc)	},
-	[SMCA_NBIO]	= { smca_nbio_mce_desc,	ARRAY_SIZE(smca_nbio_mce_desc)	},
-	[SMCA_PCIE]	= { smca_pcie_mce_desc,	ARRAY_SIZE(smca_pcie_mce_desc)	},
-	[SMCA_PCIE_V2]	= { smca_pcie2_mce_desc,   ARRAY_SIZE(smca_pcie2_mce_desc)	},
-	[SMCA_XGMI_PCS]	= { smca_xgmipcs_mce_desc, ARRAY_SIZE(smca_xgmipcs_mce_desc)	},
-	/* NBIF and SHUB have the same error descriptions, for now. */
-	[SMCA_NBIF]	= { smca_nbif_mce_desc, ARRAY_SIZE(smca_nbif_mce_desc)	},
-	[SMCA_SHUB]	= { smca_nbif_mce_desc, ARRAY_SIZE(smca_nbif_mce_desc)	},
-	[SMCA_SATA]	= { smca_sata_mce_desc, ARRAY_SIZE(smca_sata_mce_desc)	},
-	[SMCA_USB]	= { smca_usb_mce_desc,	ARRAY_SIZE(smca_usb_mce_desc)	},
-	[SMCA_GMI_PCS]	= { smca_gmipcs_mce_desc,  ARRAY_SIZE(smca_gmipcs_mce_desc)	},
-	/* All the PHY bank types have the same error descriptions, for now. */
-	[SMCA_XGMI_PHY]	= { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc)	},
-	[SMCA_WAFL_PHY]	= { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc)	},
-	[SMCA_GMI_PHY]	= { smca_xgmiphy_mce_desc, ARRAY_SIZE(smca_xgmiphy_mce_desc)	},
-};
-
 static bool f12h_mc0_mce(u16 ec, u8 xec)
 {
 	bool ret = false;
@@ -1220,10 +744,6 @@ static void decode_smca_error(struct mce *m)
 
 	pr_emerg(HW_ERR "%s Ext. Error Code: %d", smca_get_long_name(bank_type), xec);
 
-	/* Only print the decode of valid error codes */
-	if (xec < smca_mce_descs[bank_type].num_descs)
-		pr_cont(", %s.\n", smca_mce_descs[bank_type].descs[xec]);
-
 	if ((bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) &&
 	    xec == 0 && decode_dram_ecc)
 		decode_dram_ecc(topology_die_id(m->extcpu), m);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] Documentation: Begin a RAS section
  2023-11-28 14:20   ` [PATCH] Documentation: Begin a RAS section Borislav Petkov
@ 2023-11-28 15:23     ` Yazen Ghannam
  2023-11-28 17:04     ` Randy Dunlap
  2024-01-09 17:47     ` Jonathan Corbet
  2 siblings, 0 replies; 18+ messages in thread
From: Yazen Ghannam @ 2023-11-28 15:23 UTC (permalink / raw)
  To: Borislav Petkov, Tony Luck
  Cc: yazen.ghannam, Muralidhara M K, linux-edac, linux-kernel,
	Muralidhara M K, linux-doc

On 11/28/2023 9:20 AM, Borislav Petkov wrote:
> On Thu, Nov 02, 2023 at 11:42:22AM +0000, Muralidhara M K wrote:
>> From: Muralidhara M K <muralidhara.mk@amd.com>
>>
>> AMD systems with Scalable MCA, each machine check error of a SMCA bank
>> type has an associated bit position in the bank's control (CTL) register.
> 
> Ontop of this. It is long overdue:
> 
> ---
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Tue, 28 Nov 2023 14:37:56 +0100
> 
> Add some initial RAS documentation. The expectation is for this to
> collect all the user-visible features for interacting with the RAS
> features of the kernel.
> 
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> ---
>   Documentation/RAS/ras.rst | 26 ++++++++++++++++++++++++++
>   Documentation/index.rst   |  1 +
>   2 files changed, 27 insertions(+)
>   create mode 100644 Documentation/RAS/ras.rst
> 
> diff --git a/Documentation/RAS/ras.rst b/Documentation/RAS/ras.rst
> new file mode 100644
> index 000000000000..2556b397cd27
> --- /dev/null
> +++ b/Documentation/RAS/ras.rst
> @@ -0,0 +1,26 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Reliability, Availability and Serviceability features
> +=====================================================
> +
> +This documents different aspects of the RAS functionality present in the
> +kernel.
> +
> +Error decoding
> +---------------
> +
> +* x86
> +
> +Error decoding on AMD systems should be done using the rasdaemon tool:
> +https://github.com/mchehab/rasdaemon/
> +
> +While the daemon is running, it would automatically log and decode
> +errors. If not, one can still decode such errors by supplying the
> +hardware information from the error::
> +
> +        $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca
> +
> +Also, the user can pass particular family and model to decode the error
> +string::
> +
> +        $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM>
> diff --git a/Documentation/index.rst b/Documentation/index.rst
> index 9dfdc826618c..36e61783437c 100644
> --- a/Documentation/index.rst
> +++ b/Documentation/index.rst
> @@ -113,6 +113,7 @@ to ReStructured Text format, or are simply too old.
>      :maxdepth: 1
>   
>      staging/index
> +   RAS/ras
>   
>   
>   Translations

Thanks for starting this. I'll add some notes for the AMD Address 
Translation Library in the next revision.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Documentation: Begin a RAS section
  2023-11-28 14:20   ` [PATCH] Documentation: Begin a RAS section Borislav Petkov
  2023-11-28 15:23     ` Yazen Ghannam
@ 2023-11-28 17:04     ` Randy Dunlap
  2023-11-28 19:39       ` Borislav Petkov
  2024-01-09 17:47     ` Jonathan Corbet
  2 siblings, 1 reply; 18+ messages in thread
From: Randy Dunlap @ 2023-11-28 17:04 UTC (permalink / raw)
  To: Borislav Petkov, Tony Luck, Yazen Ghannam
  Cc: Muralidhara M K, linux-edac, linux-kernel, Muralidhara M K, linux-doc

Hi,

On 11/28/23 06:20, Borislav Petkov wrote:
> On Thu, Nov 02, 2023 at 11:42:22AM +0000, Muralidhara M K wrote:
>> From: Muralidhara M K <muralidhara.mk@amd.com>
>>
>> AMD systems with Scalable MCA, each machine check error of a SMCA bank
>> type has an associated bit position in the bank's control (CTL) register.
> 
> Ontop of this. It is long overdue:
> 
> ---
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Tue, 28 Nov 2023 14:37:56 +0100
> 
> Add some initial RAS documentation. The expectation is for this to
> collect all the user-visible features for interacting with the RAS
> features of the kernel.
> 

In general, does RAS include EDAC and MCE?

Thanks.

> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> ---
>  Documentation/RAS/ras.rst | 26 ++++++++++++++++++++++++++
>  Documentation/index.rst   |  1 +
>  2 files changed, 27 insertions(+)
>  create mode 100644 Documentation/RAS/ras.rst
> 


-- 
~Randy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Documentation: Begin a RAS section
  2023-11-28 17:04     ` Randy Dunlap
@ 2023-11-28 19:39       ` Borislav Petkov
  0 siblings, 0 replies; 18+ messages in thread
From: Borislav Petkov @ 2023-11-28 19:39 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Tony Luck, Yazen Ghannam, Muralidhara M K, linux-edac,
	linux-kernel, Muralidhara M K, linux-doc

On Tue, Nov 28, 2023 at 09:04:22AM -0800, Randy Dunlap wrote:
> In general, does RAS include EDAC and MCE?

You can say that.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support
  2023-11-02 11:42 [PATCH v3 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
                   ` (3 preceding siblings ...)
  2023-11-02 11:42 ` [PATCH v3 4/4] EDAC/amd64: Add support for family 0x19, models 0x90-9f devices Muralidhara M K
@ 2023-11-29 10:26 ` Borislav Petkov
  4 siblings, 0 replies; 18+ messages in thread
From: Borislav Petkov @ 2023-11-29 10:26 UTC (permalink / raw)
  To: Muralidhara M K; +Cc: linux-edac, linux-kernel, Muralidhara M K

On Thu, Nov 02, 2023 at 11:42:21AM +0000, Muralidhara M K wrote:
> Muralidhara M K (4):
>   EDAC/mce_amd: Remove SMCA Extended Error code descriptions
>   x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types
>   EDAC/mc: Add support for HBM3 memory type
>   EDAC/amd64: Add support for family 0x19, models 0x90-9f devices

Applied, thanks.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tip: ras/core] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types
  2023-11-02 11:42 ` [PATCH v3 2/4] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types Muralidhara M K
@ 2023-11-29 10:33   ` tip-bot2 for Muralidhara M K
  0 siblings, 0 replies; 18+ messages in thread
From: tip-bot2 for Muralidhara M K @ 2023-11-29 10:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Muralidhara M K, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the ras/core branch of tip:

Commit-ID:     47b744ea5e3cf855087951a74ba9f89180fa1ba5
Gitweb:        https://git.kernel.org/tip/47b744ea5e3cf855087951a74ba9f89180fa1ba5
Author:        Muralidhara M K <muralidhara.mk@amd.com>
AuthorDate:    Thu, 02 Nov 2023 11:42:23 
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Tue, 28 Nov 2023 16:26:55 +01:00

x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types

Add HWID and McaType values for new SMCA bank types.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231102114225.2006878-3-muralimk@amd.com
---
 arch/x86/include/asm/mce.h    | 3 +++
 arch/x86/kernel/cpu/mce/amd.c | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 4ad49af..de31183 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -311,6 +311,7 @@ enum smca_bank_types {
 	SMCA_PIE,	/* Power, Interrupts, etc. */
 	SMCA_UMC,	/* Unified Memory Controller */
 	SMCA_UMC_V2,
+	SMCA_MA_LLC,	/* Memory Attached Last Level Cache */
 	SMCA_PB,	/* Parameter Block */
 	SMCA_PSP,	/* Platform Security Processor */
 	SMCA_PSP_V2,
@@ -326,6 +327,8 @@ enum smca_bank_types {
 	SMCA_SHUB,	/* System HUB Unit */
 	SMCA_SATA,	/* SATA Unit */
 	SMCA_USB,	/* USB Unit */
+	SMCA_USR_DP,	/* Ultra Short Reach Data Plane Controller */
+	SMCA_USR_CP,	/* Ultra Short Reach Control Plane Controller */
 	SMCA_GMI_PCS,	/* GMI PCS Unit */
 	SMCA_XGMI_PHY,	/* xGMI PHY Unit */
 	SMCA_WAFL_PHY,	/* WAFL PHY Unit */
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index f6c6c1e..2b46eb0 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -102,6 +102,7 @@ static const char * const smca_names[] = {
 	/* UMC v2 is separate because both of them can exist in a single system. */
 	[SMCA_UMC]			= "umc",
 	[SMCA_UMC_V2]			= "umc_v2",
+	[SMCA_MA_LLC]			= "ma_llc",
 	[SMCA_PB]			= "param_block",
 	[SMCA_PSP ... SMCA_PSP_V2]	= "psp",
 	[SMCA_SMU ... SMCA_SMU_V2]	= "smu",
@@ -114,6 +115,8 @@ static const char * const smca_names[] = {
 	[SMCA_SHUB]			= "shub",
 	[SMCA_SATA]			= "sata",
 	[SMCA_USB]			= "usb",
+	[SMCA_USR_DP]			= "usr_dp",
+	[SMCA_USR_CP]			= "usr_cp",
 	[SMCA_GMI_PCS]			= "gmi_pcs",
 	[SMCA_XGMI_PHY]			= "xgmi_phy",
 	[SMCA_WAFL_PHY]			= "wafl_phy",
@@ -164,6 +167,7 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
 	{ SMCA_CS,	 HWID_MCATYPE(0x2E, 0x0)	},
 	{ SMCA_PIE,	 HWID_MCATYPE(0x2E, 0x1)	},
 	{ SMCA_CS_V2,	 HWID_MCATYPE(0x2E, 0x2)	},
+	{ SMCA_MA_LLC,	 HWID_MCATYPE(0x2E, 0x4)	},
 
 	/* Unified Memory Controller MCA type */
 	{ SMCA_UMC,	 HWID_MCATYPE(0x96, 0x0)	},
@@ -198,6 +202,8 @@ static const struct smca_hwid smca_hwid_mcatypes[] = {
 	{ SMCA_SHUB,	 HWID_MCATYPE(0x80, 0x0)	},
 	{ SMCA_SATA,	 HWID_MCATYPE(0xA8, 0x0)	},
 	{ SMCA_USB,	 HWID_MCATYPE(0xAA, 0x0)	},
+	{ SMCA_USR_DP,	 HWID_MCATYPE(0x170, 0x0)	},
+	{ SMCA_USR_CP,	 HWID_MCATYPE(0x180, 0x0)	},
 	{ SMCA_GMI_PCS,  HWID_MCATYPE(0x241, 0x0)	},
 	{ SMCA_XGMI_PHY, HWID_MCATYPE(0x259, 0x0)	},
 	{ SMCA_WAFL_PHY, HWID_MCATYPE(0x267, 0x0)	},

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] Documentation: Begin a RAS section
  2023-11-28 14:20   ` [PATCH] Documentation: Begin a RAS section Borislav Petkov
  2023-11-28 15:23     ` Yazen Ghannam
  2023-11-28 17:04     ` Randy Dunlap
@ 2024-01-09 17:47     ` Jonathan Corbet
  2024-01-09 18:36       ` Borislav Petkov
  2 siblings, 1 reply; 18+ messages in thread
From: Jonathan Corbet @ 2024-01-09 17:47 UTC (permalink / raw)
  To: Borislav Petkov, Tony Luck, Yazen Ghannam
  Cc: Muralidhara M K, linux-edac, linux-kernel, Muralidhara M K, linux-doc

Borislav Petkov <bp@alien8.de> writes:

> On Thu, Nov 02, 2023 at 11:42:22AM +0000, Muralidhara M K wrote:
>> From: Muralidhara M K <muralidhara.mk@amd.com>
>> 
>> AMD systems with Scalable MCA, each machine check error of a SMCA bank
>> type has an associated bit position in the bank's control (CTL) register.
>
> Ontop of this. It is long overdue:
>
> ---
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Tue, 28 Nov 2023 14:37:56 +0100
>
> Add some initial RAS documentation. The expectation is for this to
> collect all the user-visible features for interacting with the RAS
> features of the kernel.
>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> ---
>  Documentation/RAS/ras.rst | 26 ++++++++++++++++++++++++++
>  Documentation/index.rst   |  1 +
>  2 files changed, 27 insertions(+)
>  create mode 100644 Documentation/RAS/ras.rst

I wish I'd been copied on this ... I've been working to get a handle on
the top-level Documentation/ directories for a while, and would rather
not see a new one added for this.  Offhand, based on this first
document, it looks like material that belongs under
Documentation/admin-guide; can we move it there, please?

Thanks,

jon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Documentation: Begin a RAS section
  2024-01-09 17:47     ` Jonathan Corbet
@ 2024-01-09 18:36       ` Borislav Petkov
  2024-01-09 19:44         ` Jonathan Corbet
  0 siblings, 1 reply; 18+ messages in thread
From: Borislav Petkov @ 2024-01-09 18:36 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Tony Luck, Yazen Ghannam, Muralidhara M K, linux-edac,
	linux-kernel, Muralidhara M K, linux-doc

On Tue, Jan 09, 2024 at 10:47:29AM -0700, Jonathan Corbet wrote:
> I wish I'd been copied on this ... 

linux-doc was CCed:

https://lore.kernel.org/all/20231128142049.GTZWX3QQTSaQk%2F+u53@fat_crate.local/

Or did you prefer you directly?

I've been working to get a handle on
> the top-level Documentation/ directories for a while, and would rather
> not see a new one added for this.  Offhand, based on this first
> document, it looks like material that belongs under
> Documentation/admin-guide; can we move it there, please?

Not really an admin guide thing - yes, based on the current content but
actually, the aim for this is to document all things RAS, so it is more
likely a subsystem thing. And all the subsystems are directories under
Documentation/.

So where do you want me to put it?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Documentation: Begin a RAS section
  2024-01-09 18:36       ` Borislav Petkov
@ 2024-01-09 19:44         ` Jonathan Corbet
  2024-01-09 20:04           ` Borislav Petkov
  0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Corbet @ 2024-01-09 19:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, Yazen Ghannam, Muralidhara M K, linux-edac,
	linux-kernel, Muralidhara M K, linux-doc

Borislav Petkov <bp@alien8.de> writes:

> On Tue, Jan 09, 2024 at 10:47:29AM -0700, Jonathan Corbet wrote:
>> I wish I'd been copied on this ... 
>
> linux-doc was CCed:
>
> https://lore.kernel.org/all/20231128142049.GTZWX3QQTSaQk%2F+u53@fat_crate.local/
>
> Or did you prefer you directly?

Lots of stuff goes to linux-doc, I can miss things.

Of course, I miss things in my own email too...you know the drill...

> I've been working to get a handle on
>> the top-level Documentation/ directories for a while, and would rather
>> not see a new one added for this.  Offhand, based on this first
>> document, it looks like material that belongs under
>> Documentation/admin-guide; can we move it there, please?
>
> Not really an admin guide thing - yes, based on the current content but
> actually, the aim for this is to document all things RAS, so it is more
> likely a subsystem thing. And all the subsystems are directories under
> Documentation/.
>
> So where do you want me to put it?

The hope with all of this documentation thrashing has been to organize
our docs with the *reader* in mind.  "All things RAS" is convenient for
RAS developers, but not for (say) a sysadmin trying to figure out how to
make use of it.  So I would really rather see RAS documentation placed
under admin-guide or userspace-api as appropriate.

Yes, there is a lot of existing documentation that still doesn't live up
to this idea, but we can try to follow it for new stuff while the rest
is (slowly) fixed up.

Make sense?

Thanks,

jon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Documentation: Begin a RAS section
  2024-01-09 19:44         ` Jonathan Corbet
@ 2024-01-09 20:04           ` Borislav Petkov
  2024-01-24 12:40             ` Borislav Petkov
  0 siblings, 1 reply; 18+ messages in thread
From: Borislav Petkov @ 2024-01-09 20:04 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Tony Luck, Yazen Ghannam, Muralidhara M K, linux-edac,
	linux-kernel, Muralidhara M K, linux-doc

On Tue, Jan 09, 2024 at 12:44:41PM -0700, Jonathan Corbet wrote:
> Of course, I miss things in my own email too...you know the drill...

Yeah, tell me about it.

My train of thought with CCing maintainers in such cases usually is: I'd
CC the mailing list as I don't want to bother the maintainer - she/he gets
too much email anyway and this is an FYI thing anyway so she/he'll find
it in the archives eventually.

> Yes, there is a lot of existing documentation that still doesn't live up
> to this idea, but we can try to follow it for new stuff while the rest
> is (slowly) fixed up.

The problem I see here is that not all of the RAS stuff will be
"admin-guide" stuff but some design decisions we've made. I mean, if it
is a really curious admin, it'll fit her/his alley but it won't be
purely administrative tasks' descriptions.

In the end of the day, I don't really care where it is as long as it is
in one place and we can point people to it and say, here, that's why we
did it the way we did it and what you can do about it.

So I'm fine with admin-guide too - just pointing out a potential issue
I see.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] Documentation: Begin a RAS section
  2024-01-09 20:04           ` Borislav Petkov
@ 2024-01-24 12:40             ` Borislav Petkov
  2024-02-05 19:41               ` Borislav Petkov
  0 siblings, 1 reply; 18+ messages in thread
From: Borislav Petkov @ 2024-01-24 12:40 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Tony Luck, Yazen Ghannam, Muralidhara M K, linux-edac,
	linux-kernel, Muralidhara M K, linux-doc

On Tue, Jan 09, 2024 at 09:04:34PM +0100, Borislav Petkov wrote:
> So I'm fine with admin-guide too - just pointing out a potential issue
> I see.

Ok, how does that look like?

I've merged it with ras.rst which we had there already and with some
more new documentation that is coming from:

https://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git/log/?h=edac-amd-atl

Thx.

---
From: "Borislav Petkov (AMD)" <bp@alien8.de>
Date: Wed, 24 Jan 2024 13:37:52 +0100
Subject: [PATCH] Documentation: Move RAS section to admin-guide

This is where this stuff should be.

Requested-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
---
 Documentation/RAS/index.rst                        | 14 --------------
 .../{ => admin-guide}/RAS/address-translation.rst  |  0
 .../{ => admin-guide}/RAS/error-decoding.rst       |  0
 Documentation/admin-guide/RAS/index.rst            |  7 +++++++
 .../admin-guide/{ras.rst => RAS/main.rst}          | 10 +++++++---
 Documentation/admin-guide/index.rst                |  2 +-
 Documentation/index.rst                            |  1 -
 7 files changed, 15 insertions(+), 19 deletions(-)
 delete mode 100644 Documentation/RAS/index.rst
 rename Documentation/{ => admin-guide}/RAS/address-translation.rst (100%)
 rename Documentation/{ => admin-guide}/RAS/error-decoding.rst (100%)
 create mode 100644 Documentation/admin-guide/RAS/index.rst
 rename Documentation/admin-guide/{ras.rst => RAS/main.rst} (99%)

diff --git a/Documentation/RAS/index.rst b/Documentation/RAS/index.rst
deleted file mode 100644
index 2794c1816e90..000000000000
--- a/Documentation/RAS/index.rst
+++ /dev/null
@@ -1,14 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========================================================
-Reliability, Availability and Serviceability (RAS) features
-===========================================================
-
-This documents different aspects of the RAS functionality present in the
-kernel.
-
-.. toctree::
-   :maxdepth: 2
-
-   error-decoding
-   address-translation
diff --git a/Documentation/RAS/address-translation.rst b/Documentation/admin-guide/RAS/address-translation.rst
similarity index 100%
rename from Documentation/RAS/address-translation.rst
rename to Documentation/admin-guide/RAS/address-translation.rst
diff --git a/Documentation/RAS/error-decoding.rst b/Documentation/admin-guide/RAS/error-decoding.rst
similarity index 100%
rename from Documentation/RAS/error-decoding.rst
rename to Documentation/admin-guide/RAS/error-decoding.rst
diff --git a/Documentation/admin-guide/RAS/index.rst b/Documentation/admin-guide/RAS/index.rst
new file mode 100644
index 000000000000..f4087040a7c0
--- /dev/null
+++ b/Documentation/admin-guide/RAS/index.rst
@@ -0,0 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. toctree::
+   :maxdepth: 2
+
+   main
+   error-decoding
+   address-translation
diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/RAS/main.rst
similarity index 99%
rename from Documentation/admin-guide/ras.rst
rename to Documentation/admin-guide/RAS/main.rst
index 8e03751d126d..7ac1d4ccc509 100644
--- a/Documentation/admin-guide/ras.rst
+++ b/Documentation/admin-guide/RAS/main.rst
@@ -1,8 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
 .. include:: <isonum.txt>
 
-============================================
-Reliability, Availability and Serviceability
-============================================
+==================================================
+Reliability, Availability and Serviceability (RAS)
+==================================================
+
+This documents different aspects of the RAS functionality present in the
+kernel.
 
 RAS concepts
 ************
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index fb40a1f6f79e..dfc06fab9432 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -122,7 +122,7 @@ configure specific aspects of kernel behavior to your liking.
    pmf
    pnp
    rapidio
-   ras
+   RAS/index
    rtc
    serial-console
    svga
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 07f2aa07f0fa..9dfdc826618c 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -113,7 +113,6 @@ to ReStructured Text format, or are simply too old.
    :maxdepth: 1
 
    staging/index
-   RAS/index
 
 
 Translations
-- 
2.43.0

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] Documentation: Begin a RAS section
  2024-01-24 12:40             ` Borislav Petkov
@ 2024-02-05 19:41               ` Borislav Petkov
  0 siblings, 0 replies; 18+ messages in thread
From: Borislav Petkov @ 2024-02-05 19:41 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Tony Luck, Yazen Ghannam, Muralidhara M K, linux-edac,
	linux-kernel, Muralidhara M K, linux-doc

On Wed, Jan 24, 2024 at 01:40:30PM +0100, Borislav Petkov wrote:
> From: "Borislav Petkov (AMD)" <bp@alien8.de>
> Date: Wed, 24 Jan 2024 13:37:52 +0100
> Subject: [PATCH] Documentation: Move RAS section to admin-guide
> 
> This is where this stuff should be.
> 
> Requested-by: Jonathan Corbet <corbet@lwn.net>
> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> ---
>  Documentation/RAS/index.rst                        | 14 --------------
>  .../{ => admin-guide}/RAS/address-translation.rst  |  0
>  .../{ => admin-guide}/RAS/error-decoding.rst       |  0
>  Documentation/admin-guide/RAS/index.rst            |  7 +++++++
>  .../admin-guide/{ras.rst => RAS/main.rst}          | 10 +++++++---
>  Documentation/admin-guide/index.rst                |  2 +-
>  Documentation/index.rst                            |  1 -
>  7 files changed, 15 insertions(+), 19 deletions(-)
>  delete mode 100644 Documentation/RAS/index.rst
>  rename Documentation/{ => admin-guide}/RAS/address-translation.rst (100%)
>  rename Documentation/{ => admin-guide}/RAS/error-decoding.rst (100%)
>  create mode 100644 Documentation/admin-guide/RAS/index.rst
>  rename Documentation/admin-guide/{ras.rst => RAS/main.rst} (99%)

Now queued.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-02-05 19:41 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-02 11:42 [PATCH v3 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
2023-11-02 11:42 ` [PATCH v3 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions Muralidhara M K
2023-11-28 14:20   ` [PATCH] Documentation: Begin a RAS section Borislav Petkov
2023-11-28 15:23     ` Yazen Ghannam
2023-11-28 17:04     ` Randy Dunlap
2023-11-28 19:39       ` Borislav Petkov
2024-01-09 17:47     ` Jonathan Corbet
2024-01-09 18:36       ` Borislav Petkov
2024-01-09 19:44         ` Jonathan Corbet
2024-01-09 20:04           ` Borislav Petkov
2024-01-24 12:40             ` Borislav Petkov
2024-02-05 19:41               ` Borislav Petkov
2023-11-28 14:24   ` [tip: ras/core] EDAC/mce_amd: Remove SMCA Extended Error code descriptions tip-bot2 for Muralidhara M K
2023-11-02 11:42 ` [PATCH v3 2/4] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types Muralidhara M K
2023-11-29 10:33   ` [tip: ras/core] " tip-bot2 for Muralidhara M K
2023-11-02 11:42 ` [PATCH v3 3/4] EDAC/mc: Add support for HBM3 memory type Muralidhara M K
2023-11-02 11:42 ` [PATCH v3 4/4] EDAC/amd64: Add support for family 0x19, models 0x90-9f devices Muralidhara M K
2023-11-29 10:26 ` [PATCH v3 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).