Linux-EDAC Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 0/8] AMD64 EDAC fixes
@ 2019-08-21 23:59 Ghannam, Yazen
  2019-08-21 23:59 ` [PATCH v3 1/8] EDAC/amd64: Support more than two controllers for chip selects handling Ghannam, Yazen
                   ` (11 more replies)
  0 siblings, 12 replies; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-21 23:59 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Hi Boris,

This set contains a few fixes for some changes merged in v5.2. There
are also a couple of fixes for older issues. In addition, there are a
couple of patches to add support for Asymmetric Dual-Rank DIMMs.

I don't have the failing config readily available that you used, but I
believe I found the issue. Please let me know how it goes.

I've also added RFC patches to avoid the "ECC disabled" message for
nodes without memory. I haven't fully tested these, but I wanted to get
your thoughts. Here's an earlier discussion:
https://lkml.kernel.org/r/20180321191335.7832-1-Yazen.Ghannam@amd.com

Thanks,
Yazen

Link:
https://lkml.kernel.org/r/20190709215643.171078-1-Yazen.Ghannam@amd.com

v2->v3:
* Drop Fixes: tags in patch 1.
* Add detection of x8 DRAM devices in patches 2 and 3.
* Fix Chip Select size printing in patch 4.
* Added RFC patches to avoid "ECC disabled" message for nodes without memory.

v1->v2:
* Squash patches 1 and 2 together.


Yazen Ghannam (10):
  EDAC/amd64: Support more than two controllers for chip selects
    handling
  EDAC/amd64: Recognize DRAM device type with EDAC_CTL_CAP
  EDAC/amd64: Initialize DIMM info for systems with more than two
    channels
  EDAC/amd64: Find Chip Select memory size using Address Mask
  EDAC/amd64: Decode syndrome before translating address
  EDAC/amd64: Cache secondary Chip Select registers
  EDAC/amd64: Support Asymmetric Dual-Rank DIMMs
  EDAC/amd64: Gather hardware information early
  EDAC/amd64: Use cached data when checking for ECC
  EDAC/amd64: Check for memory before fully initializing an instance

 drivers/edac/amd64_edac.c | 371 +++++++++++++++++++++++++-------------
 drivers/edac/amd64_edac.h |   9 +-
 2 files changed, 251 insertions(+), 129 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 1/8] EDAC/amd64: Support more than two controllers for chip selects handling
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
@ 2019-08-21 23:59 ` Ghannam, Yazen
  2019-08-21 23:59 ` [PATCH v3 2/8] EDAC/amd64: Recognize DRAM device type with EDAC_CTL_CAP Ghannam, Yazen
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-21 23:59 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

The struct chip_select array that's used for saving chip select bases
and masks is fixed at length of two. There should be one struct
chip_select for each controller, so this array should be increased to
support systems that may have more than two controllers.

Increase the size of the struct chip_select array to eight, which is the
largest number of controllers per die currently supported on AMD
systems.

Fix number of DIMMs and Chip Select bases/masks on Family17h, because AMD
Family 17h systems support 2 DIMMs, 4 CS bases, and 2 CS masks per
channel.

Also, carve out the Family 17h+ reading of the bases/masks into a
separate function. This effectively reverts the original bases/masks
reading code to before Family 17h support was added.

This is a second version of a commit that was reverted.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20190709215643.171078-2-Yazen.Ghannam@amd.com

v2->v3:
* Drop Fixes: tags.

v1->v2:
* Patches 1 and 2 squashed together.

 drivers/edac/amd64_edac.c | 123 +++++++++++++++++++++-----------------
 drivers/edac/amd64_edac.h |   5 +-
 2 files changed, 71 insertions(+), 57 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 873437be86d9..dd60cf5a3d96 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -810,7 +810,7 @@ static void debug_display_dimm_sizes_df(struct amd64_pvt *pvt, u8 ctrl)
 
 	edac_printk(KERN_DEBUG, EDAC_MC, "UMC%d chip selects:\n", ctrl);
 
-	for (dimm = 0; dimm < 4; dimm++) {
+	for (dimm = 0; dimm < 2; dimm++) {
 		size0 = 0;
 		cs0 = dimm * 2;
 
@@ -942,89 +942,102 @@ static void prep_chip_selects(struct amd64_pvt *pvt)
 	} else if (pvt->fam == 0x15 && pvt->model == 0x30) {
 		pvt->csels[0].b_cnt = pvt->csels[1].b_cnt = 4;
 		pvt->csels[0].m_cnt = pvt->csels[1].m_cnt = 2;
+	} else if (pvt->fam >= 0x17) {
+		int umc;
+
+		for_each_umc(umc) {
+			pvt->csels[umc].b_cnt = 4;
+			pvt->csels[umc].m_cnt = 2;
+		}
+
 	} else {
 		pvt->csels[0].b_cnt = pvt->csels[1].b_cnt = 8;
 		pvt->csels[0].m_cnt = pvt->csels[1].m_cnt = 4;
 	}
 }
 
+static void read_umc_base_mask(struct amd64_pvt *pvt)
+{
+	u32 umc_base_reg, umc_mask_reg;
+	u32 base_reg, mask_reg;
+	u32 *base, *mask;
+	int cs, umc;
+
+	for_each_umc(umc) {
+		umc_base_reg = get_umc_base(umc) + UMCCH_BASE_ADDR;
+
+		for_each_chip_select(cs, umc, pvt) {
+			base = &pvt->csels[umc].csbases[cs];
+
+			base_reg = umc_base_reg + (cs * 4);
+
+			if (!amd_smn_read(pvt->mc_node_id, base_reg, base))
+				edac_dbg(0, "  DCSB%d[%d]=0x%08x reg: 0x%x\n",
+					 umc, cs, *base, base_reg);
+		}
+
+		umc_mask_reg = get_umc_base(umc) + UMCCH_ADDR_MASK;
+
+		for_each_chip_select_mask(cs, umc, pvt) {
+			mask = &pvt->csels[umc].csmasks[cs];
+
+			mask_reg = umc_mask_reg + (cs * 4);
+
+			if (!amd_smn_read(pvt->mc_node_id, mask_reg, mask))
+				edac_dbg(0, "  DCSM%d[%d]=0x%08x reg: 0x%x\n",
+					 umc, cs, *mask, mask_reg);
+		}
+	}
+}
+
 /*
  * Function 2 Offset F10_DCSB0; read in the DCS Base and DCS Mask registers
  */
 static void read_dct_base_mask(struct amd64_pvt *pvt)
 {
-	int base_reg0, base_reg1, mask_reg0, mask_reg1, cs;
+	int cs;
 
 	prep_chip_selects(pvt);
 
-	if (pvt->umc) {
-		base_reg0 = get_umc_base(0) + UMCCH_BASE_ADDR;
-		base_reg1 = get_umc_base(1) + UMCCH_BASE_ADDR;
-		mask_reg0 = get_umc_base(0) + UMCCH_ADDR_MASK;
-		mask_reg1 = get_umc_base(1) + UMCCH_ADDR_MASK;
-	} else {
-		base_reg0 = DCSB0;
-		base_reg1 = DCSB1;
-		mask_reg0 = DCSM0;
-		mask_reg1 = DCSM1;
-	}
+	if (pvt->umc)
+		return read_umc_base_mask(pvt);
 
 	for_each_chip_select(cs, 0, pvt) {
-		int reg0   = base_reg0 + (cs * 4);
-		int reg1   = base_reg1 + (cs * 4);
+		int reg0   = DCSB0 + (cs * 4);
+		int reg1   = DCSB1 + (cs * 4);
 		u32 *base0 = &pvt->csels[0].csbases[cs];
 		u32 *base1 = &pvt->csels[1].csbases[cs];
 
-		if (pvt->umc) {
-			if (!amd_smn_read(pvt->mc_node_id, reg0, base0))
-				edac_dbg(0, "  DCSB0[%d]=0x%08x reg: 0x%x\n",
-					 cs, *base0, reg0);
-
-			if (!amd_smn_read(pvt->mc_node_id, reg1, base1))
-				edac_dbg(0, "  DCSB1[%d]=0x%08x reg: 0x%x\n",
-					 cs, *base1, reg1);
-		} else {
-			if (!amd64_read_dct_pci_cfg(pvt, 0, reg0, base0))
-				edac_dbg(0, "  DCSB0[%d]=0x%08x reg: F2x%x\n",
-					 cs, *base0, reg0);
+		if (!amd64_read_dct_pci_cfg(pvt, 0, reg0, base0))
+			edac_dbg(0, "  DCSB0[%d]=0x%08x reg: F2x%x\n",
+				 cs, *base0, reg0);
 
-			if (pvt->fam == 0xf)
-				continue;
+		if (pvt->fam == 0xf)
+			continue;
 
-			if (!amd64_read_dct_pci_cfg(pvt, 1, reg0, base1))
-				edac_dbg(0, "  DCSB1[%d]=0x%08x reg: F2x%x\n",
-					 cs, *base1, (pvt->fam == 0x10) ? reg1
-								: reg0);
-		}
+		if (!amd64_read_dct_pci_cfg(pvt, 1, reg0, base1))
+			edac_dbg(0, "  DCSB1[%d]=0x%08x reg: F2x%x\n",
+				 cs, *base1, (pvt->fam == 0x10) ? reg1
+							: reg0);
 	}
 
 	for_each_chip_select_mask(cs, 0, pvt) {
-		int reg0   = mask_reg0 + (cs * 4);
-		int reg1   = mask_reg1 + (cs * 4);
+		int reg0   = DCSM0 + (cs * 4);
+		int reg1   = DCSM1 + (cs * 4);
 		u32 *mask0 = &pvt->csels[0].csmasks[cs];
 		u32 *mask1 = &pvt->csels[1].csmasks[cs];
 
-		if (pvt->umc) {
-			if (!amd_smn_read(pvt->mc_node_id, reg0, mask0))
-				edac_dbg(0, "    DCSM0[%d]=0x%08x reg: 0x%x\n",
-					 cs, *mask0, reg0);
-
-			if (!amd_smn_read(pvt->mc_node_id, reg1, mask1))
-				edac_dbg(0, "    DCSM1[%d]=0x%08x reg: 0x%x\n",
-					 cs, *mask1, reg1);
-		} else {
-			if (!amd64_read_dct_pci_cfg(pvt, 0, reg0, mask0))
-				edac_dbg(0, "    DCSM0[%d]=0x%08x reg: F2x%x\n",
-					 cs, *mask0, reg0);
+		if (!amd64_read_dct_pci_cfg(pvt, 0, reg0, mask0))
+			edac_dbg(0, "    DCSM0[%d]=0x%08x reg: F2x%x\n",
+				 cs, *mask0, reg0);
 
-			if (pvt->fam == 0xf)
-				continue;
+		if (pvt->fam == 0xf)
+			continue;
 
-			if (!amd64_read_dct_pci_cfg(pvt, 1, reg0, mask1))
-				edac_dbg(0, "    DCSM1[%d]=0x%08x reg: F2x%x\n",
-					 cs, *mask1, (pvt->fam == 0x10) ? reg1
-								: reg0);
-		}
+		if (!amd64_read_dct_pci_cfg(pvt, 1, reg0, mask1))
+			edac_dbg(0, "    DCSM1[%d]=0x%08x reg: F2x%x\n",
+				 cs, *mask1, (pvt->fam == 0x10) ? reg1
+							: reg0);
 	}
 }
 
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 8f66472f7adc..4dce6a2ac75f 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -96,6 +96,7 @@
 /* Hardware limit on ChipSelect rows per MC and processors per system */
 #define NUM_CHIPSELECTS			8
 #define DRAM_RANGES			8
+#define NUM_CONTROLLERS			8
 
 #define ON true
 #define OFF false
@@ -351,8 +352,8 @@ struct amd64_pvt {
 	u32 dbam0;		/* DRAM Base Address Mapping reg for DCT0 */
 	u32 dbam1;		/* DRAM Base Address Mapping reg for DCT1 */
 
-	/* one for each DCT */
-	struct chip_select csels[2];
+	/* one for each DCT/UMC */
+	struct chip_select csels[NUM_CONTROLLERS];
 
 	/* DRAM base and limit pairs F1x[78,70,68,60,58,50,48,40] */
 	struct dram_range ranges[DRAM_RANGES];
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 2/8] EDAC/amd64: Recognize DRAM device type with EDAC_CTL_CAP
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
  2019-08-21 23:59 ` [PATCH v3 1/8] EDAC/amd64: Support more than two controllers for chip selects handling Ghannam, Yazen
@ 2019-08-21 23:59 ` Ghannam, Yazen
  2019-08-21 23:59 ` [PATCH v3 3/8] EDAC/amd64: Initialize DIMM info for systems with more than two channels Ghannam, Yazen
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-21 23:59 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

AMD Family 17h systems support x4 and x16 DRAM devices. However, the
device type is not checked when setting EDAC_CTL_CAP.

Set the appropriate EDAC_CTL_CAP flag based on the device type.

Default to x8 DRAM device when neither the x4 or x16 bits are set.

Fixes: 2d09d8f301f5 ("EDAC, amd64: Determine EDAC MC capabilities on Fam17h")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20190709215643.171078-3-Yazen.Ghannam@amd.com

v2->v3:
* Add case for x8 DRAM devices.

v1->v2:
* No change.

 drivers/edac/amd64_edac.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index dd60cf5a3d96..0e8b2137edbb 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3150,12 +3150,15 @@ static bool ecc_enabled(struct pci_dev *F3, u16 nid)
 static inline void
 f17h_determine_edac_ctl_cap(struct mem_ctl_info *mci, struct amd64_pvt *pvt)
 {
-	u8 i, ecc_en = 1, cpk_en = 1;
+	u8 i, ecc_en = 1, cpk_en = 1, dev_x4 = 1, dev_x16 = 1;
 
 	for_each_umc(i) {
 		if (pvt->umc[i].sdp_ctrl & UMC_SDP_INIT) {
 			ecc_en &= !!(pvt->umc[i].umc_cap_hi & UMC_ECC_ENABLED);
 			cpk_en &= !!(pvt->umc[i].umc_cap_hi & UMC_ECC_CHIPKILL_CAP);
+
+			dev_x4 &= !!(pvt->umc[i].dimm_cfg & BIT(6));
+			dev_x16 &= !!(pvt->umc[i].dimm_cfg & BIT(7));
 		}
 	}
 
@@ -3163,8 +3166,14 @@ f17h_determine_edac_ctl_cap(struct mem_ctl_info *mci, struct amd64_pvt *pvt)
 	if (ecc_en) {
 		mci->edac_ctl_cap |= EDAC_FLAG_SECDED;
 
-		if (cpk_en)
-			mci->edac_ctl_cap |= EDAC_FLAG_S4ECD4ED;
+		if (cpk_en) {
+			if (dev_x4)
+				mci->edac_ctl_cap |= EDAC_FLAG_S4ECD4ED;
+			else if (dev_x16)
+				mci->edac_ctl_cap |= EDAC_FLAG_S16ECD16ED;
+			else
+				mci->edac_ctl_cap |= EDAC_FLAG_S8ECD8ED;
+		}
 	}
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 3/8] EDAC/amd64: Initialize DIMM info for systems with more than two channels
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
  2019-08-21 23:59 ` [PATCH v3 1/8] EDAC/amd64: Support more than two controllers for chip selects handling Ghannam, Yazen
  2019-08-21 23:59 ` [PATCH v3 2/8] EDAC/amd64: Recognize DRAM device type with EDAC_CTL_CAP Ghannam, Yazen
@ 2019-08-21 23:59 ` Ghannam, Yazen
  2019-08-21 23:59 ` [PATCH v3 4/8] EDAC/amd64: Find Chip Select memory size using Address Mask Ghannam, Yazen
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-21 23:59 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Currently, the DIMM info for AMD Family 17h systems is initialized in
init_csrows(). This function is shared with legacy systems, and it has a
limit of two channel support.

This prevents initialization of the DIMM info for a number of ranks, so
there will be missing ranks in the EDAC sysfs.

Create a new init_csrows_df() for Family17h+ and revert init_csrows()
back to pre-Family17h support.

Loop over all channels in the new function in order to support systems
with more than two channels.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20190709215643.171078-4-Yazen.Ghannam@amd.com

v2->v3:
* Drop Fixes: tag.
* Add x8 DRAM device case.

v1->v2:
* No change.

 drivers/edac/amd64_edac.c | 66 ++++++++++++++++++++++++++++++---------
 1 file changed, 52 insertions(+), 14 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 0e8b2137edbb..001dc85122e9 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2837,6 +2837,49 @@ static u32 get_csrow_nr_pages(struct amd64_pvt *pvt, u8 dct, int csrow_nr_orig)
 	return nr_pages;
 }
 
+static int init_csrows_df(struct mem_ctl_info *mci)
+{
+	struct amd64_pvt *pvt = mci->pvt_info;
+	enum edac_type edac_mode = EDAC_NONE;
+	enum dev_type dev_type = DEV_UNKNOWN;
+	struct dimm_info *dimm;
+	int empty = 1;
+	u8 umc, cs;
+
+	if (mci->edac_ctl_cap & EDAC_FLAG_S16ECD16ED) {
+		edac_mode = EDAC_S16ECD16ED;
+		dev_type = DEV_X16;
+	} else if (mci->edac_ctl_cap & EDAC_FLAG_S8ECD8ED) {
+		edac_mode = EDAC_S8ECD8ED;
+		dev_type = DEV_X8;
+	} else if (mci->edac_ctl_cap & EDAC_FLAG_S4ECD4ED) {
+		edac_mode = EDAC_S4ECD4ED;
+		dev_type = DEV_X4;
+	} else if (mci->edac_ctl_cap & EDAC_FLAG_SECDED) {
+		edac_mode = EDAC_SECDED;
+	}
+
+	for_each_umc(umc) {
+		for_each_chip_select(cs, umc, pvt) {
+			if (!csrow_enabled(cs, umc, pvt))
+				continue;
+
+			empty = 0;
+			dimm = mci->csrows[cs]->channels[umc]->dimm;
+
+			edac_dbg(1, "MC node: %d, csrow: %d\n",
+					pvt->mc_node_id, cs);
+
+			dimm->nr_pages = get_csrow_nr_pages(pvt, umc, cs);
+			dimm->mtype = pvt->dram_type;
+			dimm->edac_mode = edac_mode;
+			dimm->dtype = dev_type;
+		}
+	}
+
+	return empty;
+}
+
 /*
  * Initialize the array of csrow attribute instances, based on the values
  * from pci config hardware registers.
@@ -2851,15 +2894,16 @@ static int init_csrows(struct mem_ctl_info *mci)
 	int nr_pages = 0;
 	u32 val;
 
-	if (!pvt->umc) {
-		amd64_read_pci_cfg(pvt->F3, NBCFG, &val);
+	if (pvt->umc)
+		return init_csrows_df(mci);
+
+	amd64_read_pci_cfg(pvt->F3, NBCFG, &val);
 
-		pvt->nbcfg = val;
+	pvt->nbcfg = val;
 
-		edac_dbg(0, "node %d, NBCFG=0x%08x[ChipKillEccCap: %d|DramEccEn: %d]\n",
-			 pvt->mc_node_id, val,
-			 !!(val & NBCFG_CHIPKILL), !!(val & NBCFG_ECC_ENABLE));
-	}
+	edac_dbg(0, "node %d, NBCFG=0x%08x[ChipKillEccCap: %d|DramEccEn: %d]\n",
+		 pvt->mc_node_id, val,
+		 !!(val & NBCFG_CHIPKILL), !!(val & NBCFG_ECC_ENABLE));
 
 	/*
 	 * We iterate over DCT0 here but we look at DCT1 in parallel, if needed.
@@ -2896,13 +2940,7 @@ static int init_csrows(struct mem_ctl_info *mci)
 		edac_dbg(1, "Total csrow%d pages: %u\n", i, nr_pages);
 
 		/* Determine DIMM ECC mode: */
-		if (pvt->umc) {
-			if (mci->edac_ctl_cap & EDAC_FLAG_S4ECD4ED)
-				edac_mode = EDAC_S4ECD4ED;
-			else if (mci->edac_ctl_cap & EDAC_FLAG_SECDED)
-				edac_mode = EDAC_SECDED;
-
-		} else if (pvt->nbcfg & NBCFG_ECC_ENABLE) {
+		if (pvt->nbcfg & NBCFG_ECC_ENABLE) {
 			edac_mode = (pvt->nbcfg & NBCFG_CHIPKILL)
 					? EDAC_S4ECD4ED
 					: EDAC_SECDED;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 4/8] EDAC/amd64: Find Chip Select memory size using Address Mask
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
                   ` (2 preceding siblings ...)
  2019-08-21 23:59 ` [PATCH v3 3/8] EDAC/amd64: Initialize DIMM info for systems with more than two channels Ghannam, Yazen
@ 2019-08-21 23:59 ` Ghannam, Yazen
  2019-08-22  0:00 ` [PATCH v3 5/8] EDAC/amd64: Decode syndrome before translating address Ghannam, Yazen
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-21 23:59 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Chip Select memory size reporting on AMD Family 17h was recently fixed
in order to account for interleaving. However, the current method is not
robust.

The Chip Select Address Mask can be used to find the memory size. There
are a couple of cases.

1) For single-rank and dual-rank non-interleaved, use the address mask
plus 1 as the size.

2) For dual-rank interleaved, do #1 but "de-interleave" the address mask
first.

Always "de-interleave" the address mask in order to simplify the code
flow. Bit mask manipulation is necessary to check for interleaving, so
just go ahead and do the de-interleaving. In the non-interleaved case,
the original and de-interleaved address masks will be the same.

To de-interleave the mask, count the number of zero bits in the middle
of the mask and swap them with the most significant bits.

For example,
Original=0xFFFF9FE, De-interleaved=0x3FFFFFE

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20190709215643.171078-5-Yazen.Ghannam@amd.com

v2->v3:
* Drop Fixes: tag.
* Add checks to only return CS size for enabled CSes.

v1->v2:
* No change.

 drivers/edac/amd64_edac.c | 114 +++++++++++++++++++++++---------------
 1 file changed, 70 insertions(+), 44 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 001dc85122e9..c4f2d7c59b04 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -788,51 +788,39 @@ static void debug_dump_dramcfg_low(struct amd64_pvt *pvt, u32 dclr, int chan)
 		 (dclr & BIT(15)) ?  "yes" : "no");
 }
 
-/*
- * The Address Mask should be a contiguous set of bits in the non-interleaved
- * case. So to check for CS interleaving, find the most- and least-significant
- * bits of the mask, generate a contiguous bitmask, and compare the two.
- */
-static bool f17_cs_interleaved(struct amd64_pvt *pvt, u8 ctrl, int cs)
+#define CS_EVEN_PRIMARY		BIT(0)
+#define CS_ODD_PRIMARY		BIT(1)
+
+#define CS_EVEN			CS_EVEN_PRIMARY
+#define CS_ODD			CS_ODD_PRIMARY
+
+static int f17_get_cs_mode(int dimm, u8 ctrl, struct amd64_pvt *pvt)
 {
-	u32 mask = pvt->csels[ctrl].csmasks[cs >> 1];
-	u32 msb = fls(mask) - 1, lsb = ffs(mask) - 1;
-	u32 test_mask = GENMASK(msb, lsb);
+	int cs_mode = 0;
 
-	edac_dbg(1, "mask=0x%08x test_mask=0x%08x\n", mask, test_mask);
+	if (csrow_enabled(2 * dimm, ctrl, pvt))
+		cs_mode |= CS_EVEN_PRIMARY;
 
-	return mask ^ test_mask;
+	if (csrow_enabled(2 * dimm + 1, ctrl, pvt))
+		cs_mode |= CS_ODD_PRIMARY;
+
+	return cs_mode;
 }
 
 static void debug_display_dimm_sizes_df(struct amd64_pvt *pvt, u8 ctrl)
 {
-	int dimm, size0, size1, cs0, cs1;
+	int dimm, size0, size1, cs0, cs1, cs_mode;
 
 	edac_printk(KERN_DEBUG, EDAC_MC, "UMC%d chip selects:\n", ctrl);
 
 	for (dimm = 0; dimm < 2; dimm++) {
-		size0 = 0;
 		cs0 = dimm * 2;
-
-		if (csrow_enabled(cs0, ctrl, pvt))
-			size0 = pvt->ops->dbam_to_cs(pvt, ctrl, 0, cs0);
-
-		size1 = 0;
 		cs1 = dimm * 2 + 1;
 
-		if (csrow_enabled(cs1, ctrl, pvt)) {
-			/*
-			 * CS interleaving is only supported if both CSes have
-			 * the same amount of memory. Because they are
-			 * interleaved, it will look like both CSes have the
-			 * full amount of memory. Save the size for both as
-			 * half the amount we found on CS0, if interleaved.
-			 */
-			if (f17_cs_interleaved(pvt, ctrl, cs1))
-				size1 = size0 = (size0 >> 1);
-			else
-				size1 = pvt->ops->dbam_to_cs(pvt, ctrl, 0, cs1);
-		}
+		cs_mode = f17_get_cs_mode(dimm, ctrl, pvt);
+
+		size0 = pvt->ops->dbam_to_cs(pvt, ctrl, cs_mode, cs0);
+		size1 = pvt->ops->dbam_to_cs(pvt, ctrl, cs_mode, cs1);
 
 		amd64_info(EDAC_MC ": %d: %5dMB %d: %5dMB\n",
 				cs0,	size0,
@@ -1569,18 +1557,54 @@ static int f16_dbam_to_chip_select(struct amd64_pvt *pvt, u8 dct,
 		return ddr3_cs_size(cs_mode, false);
 }
 
-static int f17_base_addr_to_cs_size(struct amd64_pvt *pvt, u8 umc,
+static int f17_addr_mask_to_cs_size(struct amd64_pvt *pvt, u8 umc,
 				    unsigned int cs_mode, int csrow_nr)
 {
-	u32 base_addr = pvt->csels[umc].csbases[csrow_nr];
+	u32 addr_mask_orig, addr_mask_deinterleaved;
+	u32 msb, weight, num_zero_bits;
+	int dimm, size = 0;
 
-	/*  Each mask is used for every two base addresses. */
-	u32 addr_mask = pvt->csels[umc].csmasks[csrow_nr >> 1];
+	/* No Chip Selects are enabled. */
+	if (!cs_mode)
+		return size;
 
-	/*  Register [31:1] = Address [39:9]. Size is in kBs here. */
-	u32 size = ((addr_mask >> 1) - (base_addr >> 1) + 1) >> 1;
+	/* Requested size of an even CS but none are enabled. */
+	if (!(cs_mode & CS_EVEN) && !(csrow_nr & 1))
+		return size;
 
-	edac_dbg(1, "BaseAddr: 0x%x, AddrMask: 0x%x\n", base_addr, addr_mask);
+	/* Requested size of an odd CS but none are enabled. */
+	if (!(cs_mode & CS_ODD) && (csrow_nr & 1))
+		return size;
+
+	/*
+	 * There is one mask per DIMM, and two Chip Selects per DIMM.
+	 *	CS0 and CS1 -> DIMM0
+	 *	CS2 and CS3 -> DIMM1
+	 */
+	dimm = csrow_nr >> 1;
+
+	addr_mask_orig = pvt->csels[umc].csmasks[dimm];
+
+	/*
+	 * The number of zero bits in the mask is equal to the number of bits
+	 * in a full mask minus the number of bits in the current mask.
+	 *
+	 * The MSB is the number of bits in the full mask because BIT[0] is
+	 * always 0.
+	 */
+	msb = fls(addr_mask_orig) - 1;
+	weight = hweight_long(addr_mask_orig);
+	num_zero_bits = msb - weight;
+
+	/* Take the number of zero bits off from the top of the mask. */
+	addr_mask_deinterleaved = GENMASK_ULL(msb - num_zero_bits, 1);
+
+	edac_dbg(1, "CS%d DIMM%d AddrMasks:\n", csrow_nr, dimm);
+	edac_dbg(1, "  Original AddrMask: 0x%x\n", addr_mask_orig);
+	edac_dbg(1, "  Deinterleaved AddrMask: 0x%x\n", addr_mask_deinterleaved);
+
+	/* Register [31:1] = Address [39:9]. Size is in kBs here. */
+	size = (addr_mask_deinterleaved >> 2) + 1;
 
 	/* Return size in MBs. */
 	return size >> 10;
@@ -2245,7 +2269,7 @@ static struct amd64_family_type family_types[] = {
 		.f6_id = PCI_DEVICE_ID_AMD_17H_DF_F6,
 		.ops = {
 			.early_channel_count	= f17_early_channel_count,
-			.dbam_to_cs		= f17_base_addr_to_cs_size,
+			.dbam_to_cs		= f17_addr_mask_to_cs_size,
 		}
 	},
 	[F17_M10H_CPUS] = {
@@ -2254,7 +2278,7 @@ static struct amd64_family_type family_types[] = {
 		.f6_id = PCI_DEVICE_ID_AMD_17H_M10H_DF_F6,
 		.ops = {
 			.early_channel_count	= f17_early_channel_count,
-			.dbam_to_cs		= f17_base_addr_to_cs_size,
+			.dbam_to_cs		= f17_addr_mask_to_cs_size,
 		}
 	},
 	[F17_M30H_CPUS] = {
@@ -2263,7 +2287,7 @@ static struct amd64_family_type family_types[] = {
 		.f6_id = PCI_DEVICE_ID_AMD_17H_M30H_DF_F6,
 		.ops = {
 			.early_channel_count	= f17_early_channel_count,
-			.dbam_to_cs		= f17_base_addr_to_cs_size,
+			.dbam_to_cs		= f17_addr_mask_to_cs_size,
 		}
 	},
 };
@@ -2822,10 +2846,12 @@ static u32 get_csrow_nr_pages(struct amd64_pvt *pvt, u8 dct, int csrow_nr_orig)
 	int csrow_nr = csrow_nr_orig;
 	u32 cs_mode, nr_pages;
 
-	if (!pvt->umc)
+	if (!pvt->umc) {
 		csrow_nr >>= 1;
-
-	cs_mode = DBAM_DIMM(csrow_nr, dbam);
+		cs_mode = DBAM_DIMM(csrow_nr, dbam);
+	} else {
+		cs_mode = f17_get_cs_mode(csrow_nr >> 1, dct, pvt);
+	}
 
 	nr_pages   = pvt->ops->dbam_to_cs(pvt, dct, cs_mode, csrow_nr);
 	nr_pages <<= 20 - PAGE_SHIFT;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 5/8] EDAC/amd64: Decode syndrome before translating address
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
                   ` (3 preceding siblings ...)
  2019-08-21 23:59 ` [PATCH v3 4/8] EDAC/amd64: Find Chip Select memory size using Address Mask Ghannam, Yazen
@ 2019-08-22  0:00 ` Ghannam, Yazen
  2019-08-22  0:00 ` [PATCH v3 6/8] EDAC/amd64: Cache secondary Chip Select registers Ghannam, Yazen
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-22  0:00 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

AMD Family 17h systems currently require address translation in order to
report the system address of a DRAM ECC error. This is currently done
before decoding the syndrome information. The syndrome information does
not depend on the address translation, so the proper EDAC csrow/channel
reporting can function without the address. However, the syndrome
information will not be decoded if the address translation fails.

Decode the syndrome information before doing the address translation.
The syndrome information is architecturally defined in MCA_SYND and can
be considered robust. The address translation is system-specific and may
fail on newer systems without proper updates to the translation
algorithm.

Fixes: 713ad54675fd ("EDAC, amd64: Define and register UMC error decode function")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20190709215643.171078-6-Yazen.Ghannam@amd.com

v2->v3:
* No change.

v1->v2:
* No change.

 drivers/edac/amd64_edac.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index c4f2d7c59b04..de141de7b2e5 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2574,13 +2574,6 @@ static void decode_umc_error(int node_id, struct mce *m)
 
 	err.channel = find_umc_channel(m);
 
-	if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, err.channel, &sys_addr)) {
-		err.err_code = ERR_NORM_ADDR;
-		goto log_error;
-	}
-
-	error_address_to_page_and_offset(sys_addr, &err);
-
 	if (!(m->status & MCI_STATUS_SYNDV)) {
 		err.err_code = ERR_SYND;
 		goto log_error;
@@ -2597,6 +2590,13 @@ static void decode_umc_error(int node_id, struct mce *m)
 
 	err.csrow = m->synd & 0x7;
 
+	if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, err.channel, &sys_addr)) {
+		err.err_code = ERR_NORM_ADDR;
+		goto log_error;
+	}
+
+	error_address_to_page_and_offset(sys_addr, &err);
+
 log_error:
 	__log_ecc_error(mci, &err, ecc_type);
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 6/8] EDAC/amd64: Cache secondary Chip Select registers
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
                   ` (4 preceding siblings ...)
  2019-08-22  0:00 ` [PATCH v3 5/8] EDAC/amd64: Decode syndrome before translating address Ghannam, Yazen
@ 2019-08-22  0:00 ` Ghannam, Yazen
  2019-08-22  0:00 ` [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early Ghannam, Yazen
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-22  0:00 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

AMD Family 17h systems have a set of secondary Chip Select Base
Addresses and Address Masks. These do not represent unique Chip
Selects, rather they are used in conjunction with the primary
Chip Select registers in certain use cases.

Cache these secondary Chip Select registers for future use.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20190709215643.171078-7-Yazen.Ghannam@amd.com

v2->v3:
* No change.

v1->v2:
* No change.

 drivers/edac/amd64_edac.c | 23 ++++++++++++++++++++---
 drivers/edac/amd64_edac.h |  4 ++++
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index de141de7b2e5..26ce48fcaf00 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -946,34 +946,51 @@ static void prep_chip_selects(struct amd64_pvt *pvt)
 
 static void read_umc_base_mask(struct amd64_pvt *pvt)
 {
-	u32 umc_base_reg, umc_mask_reg;
-	u32 base_reg, mask_reg;
-	u32 *base, *mask;
+	u32 umc_base_reg, umc_base_reg_sec;
+	u32 umc_mask_reg, umc_mask_reg_sec;
+	u32 base_reg, base_reg_sec;
+	u32 mask_reg, mask_reg_sec;
+	u32 *base, *base_sec;
+	u32 *mask, *mask_sec;
 	int cs, umc;
 
 	for_each_umc(umc) {
 		umc_base_reg = get_umc_base(umc) + UMCCH_BASE_ADDR;
+		umc_base_reg_sec = get_umc_base(umc) + UMCCH_BASE_ADDR_SEC;
 
 		for_each_chip_select(cs, umc, pvt) {
 			base = &pvt->csels[umc].csbases[cs];
+			base_sec = &pvt->csels[umc].csbases_sec[cs];
 
 			base_reg = umc_base_reg + (cs * 4);
+			base_reg_sec = umc_base_reg_sec + (cs * 4);
 
 			if (!amd_smn_read(pvt->mc_node_id, base_reg, base))
 				edac_dbg(0, "  DCSB%d[%d]=0x%08x reg: 0x%x\n",
 					 umc, cs, *base, base_reg);
+
+			if (!amd_smn_read(pvt->mc_node_id, base_reg_sec, base_sec))
+				edac_dbg(0, "    DCSB_SEC%d[%d]=0x%08x reg: 0x%x\n",
+					 umc, cs, *base_sec, base_reg_sec);
 		}
 
 		umc_mask_reg = get_umc_base(umc) + UMCCH_ADDR_MASK;
+		umc_mask_reg_sec = get_umc_base(umc) + UMCCH_ADDR_MASK_SEC;
 
 		for_each_chip_select_mask(cs, umc, pvt) {
 			mask = &pvt->csels[umc].csmasks[cs];
+			mask_sec = &pvt->csels[umc].csmasks_sec[cs];
 
 			mask_reg = umc_mask_reg + (cs * 4);
+			mask_reg_sec = umc_mask_reg_sec + (cs * 4);
 
 			if (!amd_smn_read(pvt->mc_node_id, mask_reg, mask))
 				edac_dbg(0, "  DCSM%d[%d]=0x%08x reg: 0x%x\n",
 					 umc, cs, *mask, mask_reg);
+
+			if (!amd_smn_read(pvt->mc_node_id, mask_reg_sec, mask_sec))
+				edac_dbg(0, "    DCSM_SEC%d[%d]=0x%08x reg: 0x%x\n",
+					 umc, cs, *mask_sec, mask_reg_sec);
 		}
 	}
 }
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 4dce6a2ac75f..68f12de6e654 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -259,7 +259,9 @@
 
 /* UMC CH register offsets */
 #define UMCCH_BASE_ADDR			0x0
+#define UMCCH_BASE_ADDR_SEC		0x10
 #define UMCCH_ADDR_MASK			0x20
+#define UMCCH_ADDR_MASK_SEC		0x28
 #define UMCCH_ADDR_CFG			0x30
 #define UMCCH_DIMM_CFG			0x80
 #define UMCCH_UMC_CFG			0x100
@@ -312,9 +314,11 @@ struct dram_range {
 /* A DCT chip selects collection */
 struct chip_select {
 	u32 csbases[NUM_CHIPSELECTS];
+	u32 csbases_sec[NUM_CHIPSELECTS];
 	u8 b_cnt;
 
 	u32 csmasks[NUM_CHIPSELECTS];
+	u32 csmasks_sec[NUM_CHIPSELECTS];
 	u8 m_cnt;
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 7/8] EDAC/amd64: Support Asymmetric Dual-Rank DIMMs
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
                   ` (6 preceding siblings ...)
  2019-08-22  0:00 ` [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early Ghannam, Yazen
@ 2019-08-22  0:00 ` Ghannam, Yazen
  2019-08-23 11:26   ` Borislav Petkov
  2019-08-22  0:00 ` [RFC PATCH v3 10/10] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-22  0:00 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Future AMD systems will support "Asymmetric" Dual-Rank DIMMs. These are
DIMMs where the ranks are of different sizes.

The even rank will use the Primary Even Chip Select registers and the
odd rank will use the Secondary Odd Chip Select registers.

Recognize if a Secondary Odd Chip Select is being used. Use the
Secondary Odd Address Mask when calculating the chip select size.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20190709215643.171078-8-Yazen.Ghannam@amd.com

v2->v3:
* Add check of csrow_nr before using secondary mask.

v1->v2:
* No change.

 drivers/edac/amd64_edac.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 26ce48fcaf00..4d1e6daa7ec4 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -790,9 +790,13 @@ static void debug_dump_dramcfg_low(struct amd64_pvt *pvt, u32 dclr, int chan)
 
 #define CS_EVEN_PRIMARY		BIT(0)
 #define CS_ODD_PRIMARY		BIT(1)
+#define CS_EVEN_SECONDARY	BIT(2)
+#define CS_ODD_SECONDARY	BIT(3)
 
-#define CS_EVEN			CS_EVEN_PRIMARY
-#define CS_ODD			CS_ODD_PRIMARY
+#define CS_EVEN			(CS_EVEN_PRIMARY | CS_EVEN_SECONDARY)
+#define CS_ODD			(CS_ODD_PRIMARY | CS_EVEN_SECONDARY)
+
+#define csrow_sec_enabled(i, dct, pvt)	((pvt)->csels[(dct)].csbases_sec[(i)] & DCSB_CS_ENABLE)
 
 static int f17_get_cs_mode(int dimm, u8 ctrl, struct amd64_pvt *pvt)
 {
@@ -804,6 +808,10 @@ static int f17_get_cs_mode(int dimm, u8 ctrl, struct amd64_pvt *pvt)
 	if (csrow_enabled(2 * dimm + 1, ctrl, pvt))
 		cs_mode |= CS_ODD_PRIMARY;
 
+	/* Asymmetric Dual-Rank DIMM support. */
+	if (csrow_sec_enabled(2 * dimm + 1, ctrl, pvt))
+		cs_mode |= CS_ODD_SECONDARY;
+
 	return cs_mode;
 }
 
@@ -1600,7 +1608,11 @@ static int f17_addr_mask_to_cs_size(struct amd64_pvt *pvt, u8 umc,
 	 */
 	dimm = csrow_nr >> 1;
 
-	addr_mask_orig = pvt->csels[umc].csmasks[dimm];
+	/* Asymmetric Dual-Rank DIMM support. */
+	if ((csrow_nr & 1) && (cs_mode & CS_ODD_SECONDARY))
+		addr_mask_orig = pvt->csels[umc].csmasks_sec[dimm];
+	else
+		addr_mask_orig = pvt->csels[umc].csmasks[dimm];
 
 	/*
 	 * The number of zero bits in the mask is equal to the number of bits
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
                   ` (5 preceding siblings ...)
  2019-08-22  0:00 ` [PATCH v3 6/8] EDAC/amd64: Cache secondary Chip Select registers Ghannam, Yazen
@ 2019-08-22  0:00 ` Ghannam, Yazen
  2019-08-29  9:22   ` Borislav Petkov
  2019-08-22  0:00 ` [PATCH v3 7/8] EDAC/amd64: Support Asymmetric Dual-Rank DIMMs Ghannam, Yazen
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-22  0:00 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Split out gathering hardware information from init_one_instance() into a
separate function get_hardware_info().

This is necessary so that the information can be cached earlier and used
to check if memory is populated and if ECC is enabled on a node.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 drivers/edac/amd64_edac.c | 76 +++++++++++++++++++++++----------------
 1 file changed, 45 insertions(+), 31 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 4d1e6daa7ec4..84832771dec0 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3405,34 +3405,17 @@ static void compute_num_umcs(void)
 	edac_dbg(1, "Number of UMCs: %x", num_umcs);
 }
 
-static int init_one_instance(unsigned int nid)
+static int get_hardware_info(struct amd64_pvt *pvt,
+			     struct amd64_family_type *fam_type)
 {
-	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
-	struct amd64_family_type *fam_type = NULL;
-	struct mem_ctl_info *mci = NULL;
-	struct edac_mc_layer layers[2];
-	struct amd64_pvt *pvt = NULL;
 	u16 pci_id1, pci_id2;
-	int err = 0, ret;
-
-	ret = -ENOMEM;
-	pvt = kzalloc(sizeof(struct amd64_pvt), GFP_KERNEL);
-	if (!pvt)
-		goto err_ret;
-
-	pvt->mc_node_id	= nid;
-	pvt->F3 = F3;
-
-	ret = -EINVAL;
-	fam_type = per_family_init(pvt);
-	if (!fam_type)
-		goto err_free;
+	int ret = -EINVAL;
 
 	if (pvt->fam >= 0x17) {
 		pvt->umc = kcalloc(num_umcs, sizeof(struct amd64_umc), GFP_KERNEL);
 		if (!pvt->umc) {
 			ret = -ENOMEM;
-			goto err_free;
+			goto err_ret;
 		}
 
 		pci_id1 = fam_type->f0_id;
@@ -3442,18 +3425,34 @@ static int init_one_instance(unsigned int nid)
 		pci_id2 = fam_type->f2_id;
 	}
 
-	err = reserve_mc_sibling_devs(pvt, pci_id1, pci_id2);
-	if (err)
+	ret = reserve_mc_sibling_devs(pvt, pci_id1, pci_id2);
+	if (ret)
 		goto err_post_init;
 
 	read_mc_regs(pvt);
 
+	return 0;
+
+err_post_init:
+	if (pvt->fam >= 0x17)
+		kfree(pvt->umc);
+
+err_ret:
+	return ret;
+}
+
+static int init_one_instance(struct amd64_pvt *pvt,
+			     struct amd64_family_type *fam_type)
+{
+	struct mem_ctl_info *mci = NULL;
+	struct edac_mc_layer layers[2];
+	int ret = -EINVAL;
+
 	/*
 	 * We need to determine how many memory channels there are. Then use
 	 * that information for calculating the size of the dynamic instance
 	 * tables in the 'mci' structure.
 	 */
-	ret = -EINVAL;
 	pvt->channel_count = pvt->ops->early_channel_count(pvt);
 	if (pvt->channel_count < 0)
 		goto err_siblings;
@@ -3478,7 +3477,7 @@ static int init_one_instance(unsigned int nid)
 		layers[1].size = 2;
 	layers[1].is_virt_csrow = false;
 
-	mci = edac_mc_alloc(nid, ARRAY_SIZE(layers), layers, 0);
+	mci = edac_mc_alloc(pvt->mc_node_id, ARRAY_SIZE(layers), layers, 0);
 	if (!mci)
 		goto err_siblings;
 
@@ -3504,20 +3503,17 @@ static int init_one_instance(unsigned int nid)
 err_siblings:
 	free_mc_sibling_devs(pvt);
 
-err_post_init:
 	if (pvt->fam >= 0x17)
 		kfree(pvt->umc);
 
-err_free:
-	kfree(pvt);
-
-err_ret:
 	return ret;
 }
 
 static int probe_one_instance(unsigned int nid)
 {
 	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
+	struct amd64_family_type *fam_type = NULL;
+	struct amd64_pvt *pvt = NULL;
 	struct ecc_settings *s;
 	int ret;
 
@@ -3528,6 +3524,21 @@ static int probe_one_instance(unsigned int nid)
 
 	ecc_stngs[nid] = s;
 
+	pvt = kzalloc(sizeof(struct amd64_pvt), GFP_KERNEL);
+	if (!pvt)
+		goto err_settings;
+
+	pvt->mc_node_id	= nid;
+	pvt->F3 = F3;
+
+	fam_type = per_family_init(pvt);
+	if (!fam_type)
+		goto err_enable;
+
+	ret = get_hardware_info(pvt, fam_type);
+	if (ret < 0)
+		goto err_enable;
+
 	if (!ecc_enabled(F3, nid)) {
 		ret = 0;
 
@@ -3544,7 +3555,7 @@ static int probe_one_instance(unsigned int nid)
 			goto err_enable;
 	}
 
-	ret = init_one_instance(nid);
+	ret = init_one_instance(pvt, fam_type);
 	if (ret < 0) {
 		amd64_err("Error probing instance: %d\n", nid);
 
@@ -3557,6 +3568,9 @@ static int probe_one_instance(unsigned int nid)
 	return ret;
 
 err_enable:
+	kfree(pvt);
+
+err_settings:
 	kfree(s);
 	ecc_stngs[nid] = NULL;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH v3 09/10] EDAC/amd64: Use cached data when checking for ECC
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
                   ` (8 preceding siblings ...)
  2019-08-22  0:00 ` [RFC PATCH v3 10/10] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
@ 2019-08-22  0:00 ` Ghannam, Yazen
  2019-08-22  0:50 ` [PATCH v3 0/8] AMD64 EDAC fixes Adam Borowski
  2019-08-22 18:48 ` [RFC PATCH v2] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
  11 siblings, 0 replies; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-22  0:00 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

...now that the data is available earlier.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 drivers/edac/amd64_edac.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 84832771dec0..c1cb0234f085 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3183,31 +3183,27 @@ static const char *ecc_msg =
 	"'ecc_enable_override'.\n"
 	" (Note that use of the override may cause unknown side effects.)\n";
 
-static bool ecc_enabled(struct pci_dev *F3, u16 nid)
+static bool ecc_enabled(struct amd64_pvt *pvt)
 {
+	u16 nid = pvt->mc_node_id;
 	bool nb_mce_en = false;
 	u8 ecc_en = 0, i;
 	u32 value;
 
 	if (boot_cpu_data.x86 >= 0x17) {
 		u8 umc_en_mask = 0, ecc_en_mask = 0;
+		struct amd64_umc *umc;
 
 		for_each_umc(i) {
-			u32 base = get_umc_base(i);
+			umc = &pvt->umc[i];
 
 			/* Only check enabled UMCs. */
-			if (amd_smn_read(nid, base + UMCCH_SDP_CTRL, &value))
-				continue;
-
-			if (!(value & UMC_SDP_INIT))
+			if (!(umc->sdp_ctrl & UMC_SDP_INIT))
 				continue;
 
 			umc_en_mask |= BIT(i);
 
-			if (amd_smn_read(nid, base + UMCCH_UMC_CAP_HI, &value))
-				continue;
-
-			if (value & UMC_ECC_ENABLED)
+			if (umc->umc_cap_hi & UMC_ECC_ENABLED)
 				ecc_en_mask |= BIT(i);
 		}
 
@@ -3220,7 +3216,7 @@ static bool ecc_enabled(struct pci_dev *F3, u16 nid)
 		/* Assume UMC MCA banks are enabled. */
 		nb_mce_en = true;
 	} else {
-		amd64_read_pci_cfg(F3, NBCFG, &value);
+		amd64_read_pci_cfg(pvt->F3, NBCFG, &value);
 
 		ecc_en = !!(value & NBCFG_ECC_ENABLE);
 
@@ -3539,7 +3535,7 @@ static int probe_one_instance(unsigned int nid)
 	if (ret < 0)
 		goto err_enable;
 
-	if (!ecc_enabled(F3, nid)) {
+	if (!ecc_enabled(pvt)) {
 		ret = 0;
 
 		if (!ecc_enable_override)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH v3 10/10] EDAC/amd64: Check for memory before fully initializing an instance
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
                   ` (7 preceding siblings ...)
  2019-08-22  0:00 ` [PATCH v3 7/8] EDAC/amd64: Support Asymmetric Dual-Rank DIMMs Ghannam, Yazen
@ 2019-08-22  0:00 ` Ghannam, Yazen
  2019-08-22 18:51   ` [RFC PATCH v2] " Ghannam, Yazen
  2019-08-22  0:00 ` [RFC PATCH v3 09/10] EDAC/amd64: Use cached data when checking for ECC Ghannam, Yazen
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-22  0:00 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Return early before checking for ECC if the node does not have any
populated memory.

Free any cached hardware data before returning. Also, return 0 in this
case since this is not a failure. Other nodes may have memory and the
module should attempt to load an instance for them.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 drivers/edac/amd64_edac.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index c1cb0234f085..7230ed4ff665 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3505,6 +3505,23 @@ static int init_one_instance(struct amd64_pvt *pvt,
 	return ret;
 }
 
+static bool instance_has_memory(struct amd64_pvt *pvt)
+{
+	bool cs_enabled = false;
+	int num_channels = 2;
+	int cs = 0, dct = 0;
+
+	if (pvt->umc)
+		num_channels = num_umcs;
+
+	for (dct = 0; dct < num_channels; dct++) {
+		for_each_chip_select(cs, dct, pvt)
+			cs_enabled |= csrow_enabled(cs, dct, pvt);
+	}
+
+	return cs_enabled;
+}
+
 static int probe_one_instance(unsigned int nid)
 {
 	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
@@ -3535,6 +3552,10 @@ static int probe_one_instance(unsigned int nid)
 	if (ret < 0)
 		goto err_enable;
 
+	ret = 0;
+	if (!instance_has_memory(pvt))
+		goto err_enable;
+
 	if (!ecc_enabled(pvt)) {
 		ret = 0;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/8] AMD64 EDAC fixes
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
                   ` (9 preceding siblings ...)
  2019-08-22  0:00 ` [RFC PATCH v3 09/10] EDAC/amd64: Use cached data when checking for ECC Ghannam, Yazen
@ 2019-08-22  0:50 ` Adam Borowski
  2019-08-22  8:35   ` Borislav Petkov
  2019-08-22 18:54   ` Ghannam, Yazen
  2019-08-22 18:48 ` [RFC PATCH v2] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
  11 siblings, 2 replies; 31+ messages in thread
From: Adam Borowski @ 2019-08-22  0:50 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel, bp

On Wed, Aug 21, 2019 at 11:59:53PM +0000, Ghannam, Yazen wrote:
> I've also added RFC patches to avoid the "ECC disabled" message for
> nodes without memory. I haven't fully tested these, but I wanted to get
> your thoughts. Here's an earlier discussion:
> https://lkml.kernel.org/r/20180321191335.7832-1-Yazen.Ghannam@amd.com

While you're editing that code, could you please also cut the spam if ECC is
actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;
64 copies of:

[    8.186164] EDAC amd64: Node 0: DRAM ECC disabled.
[    8.188364] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                (Note that use of the override may cause unknown side effects.)
[    8.194762] EDAC amd64: Node 1: DRAM ECC disabled.
[    8.196307] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                (Note that use of the override may cause unknown side effects.)
[    8.199840] EDAC amd64: Node 2: DRAM ECC disabled.
[    8.200963] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                (Note that use of the override may cause unknown side effects.)
[    8.204326] EDAC amd64: Node 3: DRAM ECC disabled.
[    8.205436] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                (Note that use of the override may cause unknown side effects.)


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋  The root of a real enemy is an imaginary friend.
⠈⠳⣄⠀⠀⠀⠀

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/8] AMD64 EDAC fixes
  2019-08-22  0:50 ` [PATCH v3 0/8] AMD64 EDAC fixes Adam Borowski
@ 2019-08-22  8:35   ` Borislav Petkov
  2019-08-22  9:46     ` Adam Borowski
  2019-08-22 18:54   ` Ghannam, Yazen
  1 sibling, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2019-08-22  8:35 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Ghannam, Yazen, linux-edac, linux-kernel

On Thu, Aug 22, 2019 at 02:50:20AM +0200, Adam Borowski wrote:
> While you're editing that code, could you please also cut the spam if ECC is
> actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;

Patch is in there. I'll give you extra points if you spot it.

> Meow!

Wuff!

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/8] AMD64 EDAC fixes
  2019-08-22  8:35   ` Borislav Petkov
@ 2019-08-22  9:46     ` Adam Borowski
  2019-08-22  9:55       ` Borislav Petkov
  0 siblings, 1 reply; 31+ messages in thread
From: Adam Borowski @ 2019-08-22  9:46 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Ghannam, Yazen, linux-edac, linux-kernel

On Thu, Aug 22, 2019 at 10:35:48AM +0200, Borislav Petkov wrote:
> On Thu, Aug 22, 2019 at 02:50:20AM +0200, Adam Borowski wrote:
> > While you're editing that code, could you please also cut the spam if ECC is
> > actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;
> 
> Patch is in there. I'll give you extra points if you spot it.

Yeah, some of messages are no longer emitted for memory-less nodes (NUMA 1
and 3).  Your patch set also overhauls the messages.

But, the amount of redundant messages I'm complaining about has actually
increased:

dmesg|grep EDAC|cut -c 16-|sort|uniq -c
    256 EDAC MC: UMC0 chip selects:
    256 EDAC MC: UMC1 chip selects:
      1 EDAC MC: Ver: 3.0.0
    128 EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
    ^ three lines each
     64 EDAC amd64: F17h detected (node 0).
     64 EDAC amd64: F17h detected (node 1).
     64 EDAC amd64: F17h detected (node 2).
     64 EDAC amd64: F17h detected (node 3).
    512 EDAC amd64: MC: 0:     0MB 1:     0MB
    256 EDAC amd64: MC: 2:     0MB 3:     0MB
    256 EDAC amd64: MC: 2:  8192MB 3:     0MB
     64 EDAC amd64: Node 0: DRAM ECC disabled.
     64 EDAC amd64: Node 2: DRAM ECC disabled.
    256 EDAC amd64: using x4 syndromes.

(Full dmesg at http://ix.io/1T1o)

While on 5.3-rc5 without the patchset I get:

      1 EDAC MC: Ver: 3.0.0
    256 EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
    ^ three lines each
     64 EDAC amd64: Node 0: DRAM ECC disabled.
     64 EDAC amd64: Node 1: DRAM ECC disabled.
     64 EDAC amd64: Node 2: DRAM ECC disabled.
     64 EDAC amd64: Node 3: DRAM ECC disabled.

So I wonder if we could deduplicate those.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄⠀⠀⠀⠀ A master species delegates.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/8] AMD64 EDAC fixes
  2019-08-22  9:46     ` Adam Borowski
@ 2019-08-22  9:55       ` Borislav Petkov
  0 siblings, 0 replies; 31+ messages in thread
From: Borislav Petkov @ 2019-08-22  9:55 UTC (permalink / raw)
  To: Adam Borowski; +Cc: Ghannam, Yazen, linux-edac, linux-kernel

On Thu, Aug 22, 2019 at 11:46:07AM +0200, Adam Borowski wrote:
> Yeah, some of messages are no longer emitted for memory-less nodes (NUMA 1
> and 3).  Your patch set also overhauls the messages.

Not my patchset - Yazen's.

> But, the amount of redundant messages I'm complaining about has actually
> increased:

He's working on that - that still needs some love.

Thanks for testing.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH v2] EDAC/amd64: Check for memory before fully initializing an instance
  2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
                   ` (10 preceding siblings ...)
  2019-08-22  0:50 ` [PATCH v3 0/8] AMD64 EDAC fixes Adam Borowski
@ 2019-08-22 18:48 ` Ghannam, Yazen
  11 siblings, 0 replies; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-22 18:48 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Return early before checking for ECC if the node does not have any
populated memory.

Free any cached hardware data before returning. Also, return 0 in this
case since this is not a failure. Other nodes may have memory and the
module should attempt to load an instance for them.

Move printing of hardware information to after the instance is
initialized, so that the information is only printed for nodes with
memory.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20190821235938.118710-11-Yazen.Ghannam@amd.com

v1->v2:
* Moved hardware info printing to after instance is initialized.
* Added message for when instance has no memory.

 drivers/edac/amd64_edac.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index c1cb0234f085..3f0fe6ed1fa3 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2831,8 +2831,6 @@ static void read_mc_regs(struct amd64_pvt *pvt)
 	edac_dbg(1, "  DIMM type: %s\n", edac_mem_types[pvt->dram_type]);
 
 	determine_ecc_sym_sz(pvt);
-
-	dump_misc_regs(pvt);
 }
 
 /*
@@ -3505,6 +3503,23 @@ static int init_one_instance(struct amd64_pvt *pvt,
 	return ret;
 }
 
+static bool instance_has_memory(struct amd64_pvt *pvt)
+{
+	bool cs_enabled = false;
+	int num_channels = 2;
+	int cs = 0, dct = 0;
+
+	if (pvt->umc)
+		num_channels = num_umcs;
+
+	for (dct = 0; dct < num_channels; dct++) {
+		for_each_chip_select(cs, dct, pvt)
+			cs_enabled |= csrow_enabled(cs, dct, pvt);
+	}
+
+	return cs_enabled;
+}
+
 static int probe_one_instance(unsigned int nid)
 {
 	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
@@ -3535,6 +3550,12 @@ static int probe_one_instance(unsigned int nid)
 	if (ret < 0)
 		goto err_enable;
 
+	ret = 0;
+	if (!instance_has_memory(pvt)) {
+		amd64_warn("Node %d: DRAM ECC disabled. No DIMMs detected.\n", nid);
+		goto err_enable;
+	}
+
 	if (!ecc_enabled(pvt)) {
 		ret = 0;
 
@@ -3561,6 +3582,8 @@ static int probe_one_instance(unsigned int nid)
 		goto err_enable;
 	}
 
+	dump_misc_regs(pvt);
+
 	return ret;
 
 err_enable:
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH v2] EDAC/amd64: Check for memory before fully initializing an instance
  2019-08-22  0:00 ` [RFC PATCH v3 10/10] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
@ 2019-08-22 18:51   ` " Ghannam, Yazen
  0 siblings, 0 replies; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-22 18:51 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Return early before checking for ECC if the node does not have any
populated memory.

Free any cached hardware data before returning. Also, return 0 in this
case since this is not a failure. Other nodes may have memory and the
module should attempt to load an instance for them.

Move printing of hardware information to after the instance is
initialized, so that the information is only printed for nodes with
memory.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20190821235938.118710-11-Yazen.Ghannam@amd.com

v1->v2:
* Moved hardware info printing to after instance is initialized.
* Added message for when instance has no memory.

 drivers/edac/amd64_edac.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index c1cb0234f085..3f0fe6ed1fa3 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2831,8 +2831,6 @@ static void read_mc_regs(struct amd64_pvt *pvt)
 	edac_dbg(1, "  DIMM type: %s\n", edac_mem_types[pvt->dram_type]);
 
 	determine_ecc_sym_sz(pvt);
-
-	dump_misc_regs(pvt);
 }
 
 /*
@@ -3505,6 +3503,23 @@ static int init_one_instance(struct amd64_pvt *pvt,
 	return ret;
 }
 
+static bool instance_has_memory(struct amd64_pvt *pvt)
+{
+	bool cs_enabled = false;
+	int num_channels = 2;
+	int cs = 0, dct = 0;
+
+	if (pvt->umc)
+		num_channels = num_umcs;
+
+	for (dct = 0; dct < num_channels; dct++) {
+		for_each_chip_select(cs, dct, pvt)
+			cs_enabled |= csrow_enabled(cs, dct, pvt);
+	}
+
+	return cs_enabled;
+}
+
 static int probe_one_instance(unsigned int nid)
 {
 	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
@@ -3535,6 +3550,12 @@ static int probe_one_instance(unsigned int nid)
 	if (ret < 0)
 		goto err_enable;
 
+	ret = 0;
+	if (!instance_has_memory(pvt)) {
+		amd64_warn("Node %d: DRAM ECC disabled. No DIMMs detected.\n", nid);
+		goto err_enable;
+	}
+
 	if (!ecc_enabled(pvt)) {
 		ret = 0;
 
@@ -3561,6 +3582,8 @@ static int probe_one_instance(unsigned int nid)
 		goto err_enable;
 	}
 
+	dump_misc_regs(pvt);
+
 	return ret;
 
 err_enable:
-- 
2.17.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v3 0/8] AMD64 EDAC fixes
  2019-08-22  0:50 ` [PATCH v3 0/8] AMD64 EDAC fixes Adam Borowski
  2019-08-22  8:35   ` Borislav Petkov
@ 2019-08-22 18:54   ` Ghannam, Yazen
  2019-08-23 15:28     ` Ghannam, Yazen
  1 sibling, 1 reply; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-22 18:54 UTC (permalink / raw)
  To: Adam Borowski; +Cc: linux-edac, linux-kernel, bp

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Adam Borowski
> Sent: Wednesday, August 21, 2019 7:50 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; bp@alien8.de
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
> 
> On Wed, Aug 21, 2019 at 11:59:53PM +0000, Ghannam, Yazen wrote:
> > I've also added RFC patches to avoid the "ECC disabled" message for
> > nodes without memory. I haven't fully tested these, but I wanted to get
> > your thoughts. Here's an earlier discussion:
> > https://lkml.kernel.org/r/20180321191335.7832-1-Yazen.Ghannam@amd.com
> 
> While you're editing that code, could you please also cut the spam if ECC is
> actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;
> 64 copies of:
> 
> [    8.186164] EDAC amd64: Node 0: DRAM ECC disabled.
> [    8.188364] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
>                 Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
>                 (Note that use of the override may cause unknown side effects.)
> [    8.194762] EDAC amd64: Node 1: DRAM ECC disabled.
> [    8.196307] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
>                 Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
>                 (Note that use of the override may cause unknown side effects.)
> [    8.199840] EDAC amd64: Node 2: DRAM ECC disabled.
> [    8.200963] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
>                 Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
>                 (Note that use of the override may cause unknown side effects.)
> [    8.204326] EDAC amd64: Node 3: DRAM ECC disabled.
> [    8.205436] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
>                 Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
>                 (Note that use of the override may cause unknown side effects.)
> 

I wonder if the module is being loaded multiple times. I'll try to reproduce this and track it down.

Thanks for testing and reporting this issue.

-Yazen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 7/8] EDAC/amd64: Support Asymmetric Dual-Rank DIMMs
  2019-08-22  0:00 ` [PATCH v3 7/8] EDAC/amd64: Support Asymmetric Dual-Rank DIMMs Ghannam, Yazen
@ 2019-08-23 11:26   ` Borislav Petkov
  2019-08-23 13:27     ` Ghannam, Yazen
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2019-08-23 11:26 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Thu, Aug 22, 2019 at 12:00:02AM +0000, Ghannam, Yazen wrote:
> From: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> Future AMD systems will support "Asymmetric" Dual-Rank DIMMs. These are
> DIMMs where the ranks are of different sizes.
> 
> The even rank will use the Primary Even Chip Select registers and the
> odd rank will use the Secondary Odd Chip Select registers.
> 
> Recognize if a Secondary Odd Chip Select is being used. Use the
> Secondary Odd Address Mask when calculating the chip select size.
> 
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
> Link:
> https://lkml.kernel.org/r/20190709215643.171078-8-Yazen.Ghannam@amd.com
> 
> v2->v3:
> * Add check of csrow_nr before using secondary mask.
> 
> v1->v2:
> * No change.
> 
>  drivers/edac/amd64_edac.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 26ce48fcaf00..4d1e6daa7ec4 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -790,9 +790,13 @@ static void debug_dump_dramcfg_low(struct amd64_pvt *pvt, u32 dclr, int chan)
>  
>  #define CS_EVEN_PRIMARY		BIT(0)
>  #define CS_ODD_PRIMARY		BIT(1)
> +#define CS_EVEN_SECONDARY	BIT(2)
> +#define CS_ODD_SECONDARY	BIT(3)
>  
> -#define CS_EVEN			CS_EVEN_PRIMARY
> -#define CS_ODD			CS_ODD_PRIMARY
> +#define CS_EVEN			(CS_EVEN_PRIMARY | CS_EVEN_SECONDARY)
> +#define CS_ODD			(CS_ODD_PRIMARY | CS_EVEN_SECONDARY)

That's just my urge to have stuff ballanced but shouldn't that last line be:

#define CS_ODD			(CS_ODD_PRIMARY | CS_ODD_SECONDARY)

i.e., not have "even" as in CS_EVEN_SECONDARY in there but only "odd"s? :)

> +#define csrow_sec_enabled(i, dct, pvt)	((pvt)->csels[(dct)].csbases_sec[(i)] & DCSB_CS_ENABLE)

I moved that to the header, under csrow_enabled().

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v3 7/8] EDAC/amd64: Support Asymmetric Dual-Rank DIMMs
  2019-08-23 11:26   ` Borislav Petkov
@ 2019-08-23 13:27     ` Ghannam, Yazen
  2019-08-23 15:11       ` Borislav Petkov
  0 siblings, 1 reply; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-23 13:27 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Friday, August 23, 2019 6:26 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 7/8] EDAC/amd64: Support Asymmetric Dual-Rank DIMMs
> 
> On Thu, Aug 22, 2019 at 12:00:02AM +0000, Ghannam, Yazen wrote:
> > From: Yazen Ghannam <yazen.ghannam@amd.com>
> >
> > Future AMD systems will support "Asymmetric" Dual-Rank DIMMs. These are
> > DIMMs where the ranks are of different sizes.
> >
> > The even rank will use the Primary Even Chip Select registers and the
> > odd rank will use the Secondary Odd Chip Select registers.
> >
> > Recognize if a Secondary Odd Chip Select is being used. Use the
> > Secondary Odd Address Mask when calculating the chip select size.
> >
> > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> > ---
> > Link:
> > https://lkml.kernel.org/r/20190709215643.171078-8-Yazen.Ghannam@amd.com
> >
> > v2->v3:
> > * Add check of csrow_nr before using secondary mask.
> >
> > v1->v2:
> > * No change.
> >
> >  drivers/edac/amd64_edac.c | 18 +++++++++++++++---
> >  1 file changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> > index 26ce48fcaf00..4d1e6daa7ec4 100644
> > --- a/drivers/edac/amd64_edac.c
> > +++ b/drivers/edac/amd64_edac.c
> > @@ -790,9 +790,13 @@ static void debug_dump_dramcfg_low(struct amd64_pvt *pvt, u32 dclr, int chan)
> >
> >  #define CS_EVEN_PRIMARY		BIT(0)
> >  #define CS_ODD_PRIMARY		BIT(1)
> > +#define CS_EVEN_SECONDARY	BIT(2)
> > +#define CS_ODD_SECONDARY	BIT(3)
> >
> > -#define CS_EVEN			CS_EVEN_PRIMARY
> > -#define CS_ODD			CS_ODD_PRIMARY
> > +#define CS_EVEN			(CS_EVEN_PRIMARY | CS_EVEN_SECONDARY)
> > +#define CS_ODD			(CS_ODD_PRIMARY | CS_EVEN_SECONDARY)
> 
> That's just my urge to have stuff ballanced but shouldn't that last line be:
> 
> #define CS_ODD			(CS_ODD_PRIMARY | CS_ODD_SECONDARY)
> 
> i.e., not have "even" as in CS_EVEN_SECONDARY in there but only "odd"s? :)
> 

Yes, sorry I missed that.

> > +#define csrow_sec_enabled(i, dct, pvt)	((pvt)->csels[(dct)].csbases_sec[(i)] & DCSB_CS_ENABLE)
> 
> I moved that to the header, under csrow_enabled().
> 

Okay, thank you.

-Yazen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 7/8] EDAC/amd64: Support Asymmetric Dual-Rank DIMMs
  2019-08-23 13:27     ` Ghannam, Yazen
@ 2019-08-23 15:11       ` Borislav Petkov
  0 siblings, 0 replies; 31+ messages in thread
From: Borislav Petkov @ 2019-08-23 15:11 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Fri, Aug 23, 2019 at 01:27:50PM +0000, Ghannam, Yazen wrote:
> Yes, sorry I missed that.

Ok, fixed. Version below. So I'm queueing all patches up to and
including this one. I have some more comments for the remaining ones but
they can wait.

Thx.

---
From: Yazen Ghannam <yazen.ghannam@amd.com>
Date: Thu, 22 Aug 2019 00:00:02 +0000
Subject: [PATCH] EDAC/amd64: Support asymmetric dual-rank DIMMs

Future AMD systems will support asymmetric dual-rank DIMMs. These are
DIMMs where the ranks are of different sizes.

The even rank will use the Primary Even Chip Select registers and the
odd rank will use the Secondary Odd Chip Select registers.

Recognize if a Secondary Odd Chip Select is being used. Use the
Secondary Odd Address Mask when calculating the chip select size.

 [ bp: move csrow_sec_enabled() to the header, fix CS_ODD define and
   tone-down the capitalized words spelling. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190821235938.118710-8-Yazen.Ghannam@amd.com
---
 drivers/edac/amd64_edac.c | 16 +++++++++++++---
 drivers/edac/amd64_edac.h |  3 ++-
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 23251bba8eb6..18ba9c898389 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -790,9 +790,11 @@ static void debug_dump_dramcfg_low(struct amd64_pvt *pvt, u32 dclr, int chan)
 
 #define CS_EVEN_PRIMARY		BIT(0)
 #define CS_ODD_PRIMARY		BIT(1)
+#define CS_EVEN_SECONDARY	BIT(2)
+#define CS_ODD_SECONDARY	BIT(3)
 
-#define CS_EVEN			CS_EVEN_PRIMARY
-#define CS_ODD			CS_ODD_PRIMARY
+#define CS_EVEN			(CS_EVEN_PRIMARY | CS_EVEN_SECONDARY)
+#define CS_ODD			(CS_ODD_PRIMARY | CS_ODD_SECONDARY)
 
 static int f17_get_cs_mode(int dimm, u8 ctrl, struct amd64_pvt *pvt)
 {
@@ -804,6 +806,10 @@ static int f17_get_cs_mode(int dimm, u8 ctrl, struct amd64_pvt *pvt)
 	if (csrow_enabled(2 * dimm + 1, ctrl, pvt))
 		cs_mode |= CS_ODD_PRIMARY;
 
+	/* Asymmetric dual-rank DIMM support. */
+	if (csrow_sec_enabled(2 * dimm + 1, ctrl, pvt))
+		cs_mode |= CS_ODD_SECONDARY;
+
 	return cs_mode;
 }
 
@@ -1600,7 +1606,11 @@ static int f17_addr_mask_to_cs_size(struct amd64_pvt *pvt, u8 umc,
 	 */
 	dimm = csrow_nr >> 1;
 
-	addr_mask_orig = pvt->csels[umc].csmasks[dimm];
+	/* Asymmetric dual-rank DIMM support. */
+	if ((csrow_nr & 1) && (cs_mode & CS_ODD_SECONDARY))
+		addr_mask_orig = pvt->csels[umc].csmasks_sec[dimm];
+	else
+		addr_mask_orig = pvt->csels[umc].csmasks[dimm];
 
 	/*
 	 * The number of zero bits in the mask is equal to the number of bits
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 68f12de6e654..8addc4d95577 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -169,7 +169,8 @@
 #define DCSM0				0x60
 #define DCSM1				0x160
 
-#define csrow_enabled(i, dct, pvt)	((pvt)->csels[(dct)].csbases[(i)] & DCSB_CS_ENABLE)
+#define csrow_enabled(i, dct, pvt)	((pvt)->csels[(dct)].csbases[(i)]     & DCSB_CS_ENABLE)
+#define csrow_sec_enabled(i, dct, pvt)	((pvt)->csels[(dct)].csbases_sec[(i)] & DCSB_CS_ENABLE)
 
 #define DRAM_CONTROL			0x78
 
-- 
2.21.0


-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v3 0/8] AMD64 EDAC fixes
  2019-08-22 18:54   ` Ghannam, Yazen
@ 2019-08-23 15:28     ` Ghannam, Yazen
  2019-08-23 15:37       ` Borislav Petkov
  0 siblings, 1 reply; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-23 15:28 UTC (permalink / raw)
  To: Ghannam, Yazen, Adam Borowski; +Cc: linux-edac, linux-kernel, bp

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Ghannam, Yazen
> Sent: Thursday, August 22, 2019 1:54 PM
> To: Adam Borowski <kilobyte@angband.pl>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; bp@alien8.de
> Subject: RE: [PATCH v3 0/8] AMD64 EDAC fixes
> 
...
> I wonder if the module is being loaded multiple times. I'll try to reproduce this and track it down.
> 

I was able to reproduce a similar failure. I do see that the module is being loaded multiple times on failure.

Here's a call trace from one dump_stack() output:
[  +0.004964] CPU: 132 PID: 2680 Comm: systemd-udevd Not tainted 4.20.0-edac-debug+ #36
[  +0.009802] Call Trace:
[  +0.002727]  dump_stack+0x63/0x85
[  +0.003696]  amd64_edac_init+0x2163/0x3000 [amd64_edac_mod]
[  +0.006216]  ? __wake_up+0x13/0x20
[  +0.003790]  ? 0xffffffffc120d000
[  +0.003694]  do_one_initcall+0x4a/0x1c9
[  +0.004277]  ? _cond_resched+0x19/0x40
[  +0.004178]  ? kmem_cache_alloc_trace+0x15c/0x1d0
[  +0.005244]  do_init_module+0x5f/0x216
[  +0.004180]  load_module+0x21d5/0x2ac0
[  +0.004179]  ? wait_woken+0x80/0x80
[  +0.003889]  __do_sys_finit_module+0xfc/0x120
[  +0.004858]  ? __do_sys_finit_module+0xfc/0x120
[  +0.005052]  __x64_sys_finit_module+0x1a/0x20
[  +0.004857]  do_syscall_64+0x5a/0x120
[  +0.004081]  entry_SYSCALL_64_after_hwframe+0x44/0xa9


So it seems that userspace (systemd-udevd) keeps trying to load the module. I'm not sure how to prevent this from within the module.

Boris,
Do you think it'd be appropriate to change the return values for some cases?

For example, ECC disabled is a hardware configuration. This doesn't mean that the module failed any operations in this case.

In other words, the module checks for a feature. If the feature is not present, then return without failure (and maybe give a message).

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/8] AMD64 EDAC fixes
  2019-08-23 15:28     ` Ghannam, Yazen
@ 2019-08-23 15:37       ` Borislav Petkov
  2019-08-26 14:19         ` Ghannam, Yazen
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2019-08-23 15:37 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: Adam Borowski, linux-edac, linux-kernel

On Fri, Aug 23, 2019 at 03:28:59PM +0000, Ghannam, Yazen wrote:
> Boris, Do you think it'd be appropriate to change the return values
> for some cases?
>
> For example, ECC disabled is a hardware configuration. This doesn't
> mean that the module failed any operations in this case.
>
> In other words, the module checks for a feature. If the feature is not
> present, then return without failure (and maybe give a message).

That makes sense but AFAICT if probe_one_instance() sees that ECC is not
enabled, it returns 0.

The "if (!edac_has_mcs())" check later is to verify that at least once
instance was loaded successfully and, if not, then return an error.

So where does it return failure?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v3 0/8] AMD64 EDAC fixes
  2019-08-23 15:37       ` Borislav Petkov
@ 2019-08-26 14:19         ` Ghannam, Yazen
  2019-08-26 14:59           ` Borislav Petkov
  0 siblings, 1 reply; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-26 14:19 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Adam Borowski, linux-edac, linux-kernel

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Friday, August 23, 2019 10:38 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: Adam Borowski <kilobyte@angband.pl>; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
> 
> On Fri, Aug 23, 2019 at 03:28:59PM +0000, Ghannam, Yazen wrote:
> > Boris, Do you think it'd be appropriate to change the return values
> > for some cases?
> >
> > For example, ECC disabled is a hardware configuration. This doesn't
> > mean that the module failed any operations in this case.
> >
> > In other words, the module checks for a feature. If the feature is not
> > present, then return without failure (and maybe give a message).
> 
> That makes sense but AFAICT if probe_one_instance() sees that ECC is not
> enabled, it returns 0.
> 
> The "if (!edac_has_mcs())" check later is to verify that at least once
> instance was loaded successfully and, if not, then return an error.
> 
> So where does it return failure?
> 

I was tracking down the failure with ECC disabled, and that seems to be it.

So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
there if ECC is disabled on all nodes and there wasn't some other initialization
error.

I'll send a patch for this soon.

Adam, would you mind testing this patch?

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/8] AMD64 EDAC fixes
  2019-08-26 14:19         ` Ghannam, Yazen
@ 2019-08-26 14:59           ` Borislav Petkov
  2019-08-26 15:05             ` Ghannam, Yazen
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2019-08-26 14:59 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: Adam Borowski, linux-edac, linux-kernel

On Mon, Aug 26, 2019 at 02:19:18PM +0000, Ghannam, Yazen wrote:
> I was tracking down the failure with ECC disabled, and that seems to be it.
>
> So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
> there if ECC is disabled on all nodes and there wasn't some other initialization
> error.
>
> I'll send a patch for this soon.
>
> Adam, would you mind testing this patch?

You can't return 0 when ECC is disabled on all nodes because then the
driver remains loaded without driving anything. That silly userspace
needs to understand that ENODEV means "stop trying to load this driver".

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH v3 0/8] AMD64 EDAC fixes
  2019-08-26 14:59           ` Borislav Petkov
@ 2019-08-26 15:05             ` Ghannam, Yazen
  0 siblings, 0 replies; 31+ messages in thread
From: Ghannam, Yazen @ 2019-08-26 15:05 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Adam Borowski, linux-edac, linux-kernel

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Monday, August 26, 2019 9:59 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: Adam Borowski <kilobyte@angband.pl>; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
> 
> On Mon, Aug 26, 2019 at 02:19:18PM +0000, Ghannam, Yazen wrote:
> > I was tracking down the failure with ECC disabled, and that seems to be it.
> >
> > So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
> > there if ECC is disabled on all nodes and there wasn't some other initialization
> > error.
> >
> > I'll send a patch for this soon.
> >
> > Adam, would you mind testing this patch?
> 
> You can't return 0 when ECC is disabled on all nodes because then the
> driver remains loaded without driving anything. That silly userspace
> needs to understand that ENODEV means "stop trying to load this driver".
> 

Yes, you're right.

I'll try and track down the interaction here between userspace and the module.
Please let me know if you have any suggestions.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early
  2019-08-22  0:00 ` [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early Ghannam, Yazen
@ 2019-08-29  9:22   ` Borislav Petkov
  2019-09-06 19:14     ` Ghannam, Yazen
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2019-08-29  9:22 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Thu, Aug 22, 2019 at 12:00:02AM +0000, Ghannam, Yazen wrote:
> From: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> Split out gathering hardware information from init_one_instance() into a
> separate function get_hardware_info().
> 
> This is necessary so that the information can be cached earlier and used
> to check if memory is populated and if ECC is enabled on a node.
> 
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
>  drivers/edac/amd64_edac.c | 76 +++++++++++++++++++++++----------------
>  1 file changed, 45 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 4d1e6daa7ec4..84832771dec0 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3405,34 +3405,17 @@ static void compute_num_umcs(void)
>  	edac_dbg(1, "Number of UMCs: %x", num_umcs);
>  }
>  
> -static int init_one_instance(unsigned int nid)
> +static int get_hardware_info(struct amd64_pvt *pvt,
> +			     struct amd64_family_type *fam_type)
>  {
> -	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
> -	struct amd64_family_type *fam_type = NULL;
> -	struct mem_ctl_info *mci = NULL;
> -	struct edac_mc_layer layers[2];
> -	struct amd64_pvt *pvt = NULL;
>  	u16 pci_id1, pci_id2;
> -	int err = 0, ret;
> -
> -	ret = -ENOMEM;
> -	pvt = kzalloc(sizeof(struct amd64_pvt), GFP_KERNEL);
> -	if (!pvt)
> -		goto err_ret;
> -
> -	pvt->mc_node_id	= nid;
> -	pvt->F3 = F3;
> -
> -	ret = -EINVAL;
> -	fam_type = per_family_init(pvt);
> -	if (!fam_type)
> -		goto err_free;
> +	int ret = -EINVAL;
>  
>  	if (pvt->fam >= 0x17) {
>  		pvt->umc = kcalloc(num_umcs, sizeof(struct amd64_umc), GFP_KERNEL);

Yeah, a get_hardware_info() function which does an allocation of that
struct amd64_umc on => F17 which is only 20 bytes. Just add it into the
pvt struct:

struct amd64_pvt {
	...
	struct amd64_umc umc;  /* UMC registers */
};

and be done with it. This should simplify the code flow here a bit and
20 bytes more per pvt is not a big deal.

And I know we do test "if (pvt->umc)" in a bunch of places but that can
be replaced with a "if (pvt->fam >= 0x17)" test which is equivalent.

And that conversion should be a single patch.

>  		if (!pvt->umc) {
>  			ret = -ENOMEM;
> -			goto err_free;
> +			goto err_ret;
>  		}
>  
>  		pci_id1 = fam_type->f0_id;
> @@ -3442,18 +3425,34 @@ static int init_one_instance(unsigned int nid)
>  		pci_id2 = fam_type->f2_id;
>  	}
>  
> -	err = reserve_mc_sibling_devs(pvt, pci_id1, pci_id2);
> -	if (err)
> +	ret = reserve_mc_sibling_devs(pvt, pci_id1, pci_id2);
> +	if (ret)
>  		goto err_post_init;
>  
>  	read_mc_regs(pvt);
>  
> +	return 0;
> +
> +err_post_init:
> +	if (pvt->fam >= 0x17)
> +		kfree(pvt->umc);
> +
> +err_ret:
> +	return ret;
> +}
> +
> +static int init_one_instance(struct amd64_pvt *pvt,
> +			     struct amd64_family_type *fam_type)

Yeah, that fam_type can be made global. No need to hand it around in
functions since it is going to be a single struct per system. Do that in
another, separate patch please.

After you've done those things, this patch would become a lot simpler.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early
  2019-08-29  9:22   ` Borislav Petkov
@ 2019-09-06 19:14     ` Ghannam, Yazen
  2019-09-06 20:35       ` Borislav Petkov
  0 siblings, 1 reply; 31+ messages in thread
From: Ghannam, Yazen @ 2019-09-06 19:14 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org <linux-kernel-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Thursday, August 29, 2019 4:23 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early
> 
> On Thu, Aug 22, 2019 at 12:00:02AM +0000, Ghannam, Yazen wrote:
> > From: Yazen Ghannam <yazen.ghannam@amd.com>
> >
> > Split out gathering hardware information from init_one_instance() into a
> > separate function get_hardware_info().
> >
> > This is necessary so that the information can be cached earlier and used
> > to check if memory is populated and if ECC is enabled on a node.
> >
> > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> > ---
> >  drivers/edac/amd64_edac.c | 76 +++++++++++++++++++++++----------------
> >  1 file changed, 45 insertions(+), 31 deletions(-)
> >
> > diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> > index 4d1e6daa7ec4..84832771dec0 100644
> > --- a/drivers/edac/amd64_edac.c
> > +++ b/drivers/edac/amd64_edac.c
> > @@ -3405,34 +3405,17 @@ static void compute_num_umcs(void)
> >  	edac_dbg(1, "Number of UMCs: %x", num_umcs);
> >  }
> >
> > -static int init_one_instance(unsigned int nid)
> > +static int get_hardware_info(struct amd64_pvt *pvt,
> > +			     struct amd64_family_type *fam_type)
> >  {
> > -	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
> > -	struct amd64_family_type *fam_type = NULL;
> > -	struct mem_ctl_info *mci = NULL;
> > -	struct edac_mc_layer layers[2];
> > -	struct amd64_pvt *pvt = NULL;
> >  	u16 pci_id1, pci_id2;
> > -	int err = 0, ret;
> > -
> > -	ret = -ENOMEM;
> > -	pvt = kzalloc(sizeof(struct amd64_pvt), GFP_KERNEL);
> > -	if (!pvt)
> > -		goto err_ret;
> > -
> > -	pvt->mc_node_id	= nid;
> > -	pvt->F3 = F3;
> > -
> > -	ret = -EINVAL;
> > -	fam_type = per_family_init(pvt);
> > -	if (!fam_type)
> > -		goto err_free;
> > +	int ret = -EINVAL;
> >
> >  	if (pvt->fam >= 0x17) {
> >  		pvt->umc = kcalloc(num_umcs, sizeof(struct amd64_umc), GFP_KERNEL);
> 
> Yeah, a get_hardware_info() function which does an allocation of that
> struct amd64_umc on => F17 which is only 20 bytes. Just add it into the
> pvt struct:
> 
> struct amd64_pvt {
> 	...
> 	struct amd64_umc umc;  /* UMC registers */
> };
> 
> and be done with it. This should simplify the code flow here a bit and
> 20 bytes more per pvt is not a big deal.
> 

This struct is used per channel, so we may have 2-8 per system. We could fix it at the max (8).
What do you think?

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early
  2019-09-06 19:14     ` Ghannam, Yazen
@ 2019-09-06 20:35       ` Borislav Petkov
  2019-09-06 20:49         ` Ghannam, Yazen
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2019-09-06 20:35 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Fri, Sep 06, 2019 at 07:14:57PM +0000, Ghannam, Yazen wrote:
> This struct is used per channel, so we may have 2-8 per system.

Ah, true.

> We could fix it at the max (8). What do you think?

Anything in struct amd64_umc that is shared between those channels or
all max 8 of them can be distinct?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early
  2019-09-06 20:35       ` Borislav Petkov
@ 2019-09-06 20:49         ` Ghannam, Yazen
  2019-09-09 15:31           ` Borislav Petkov
  0 siblings, 1 reply; 31+ messages in thread
From: Ghannam, Yazen @ 2019-09-06 20:49 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Friday, September 6, 2019 3:35 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early
> 
> On Fri, Sep 06, 2019 at 07:14:57PM +0000, Ghannam, Yazen wrote:
> > This struct is used per channel, so we may have 2-8 per system.
> 
> Ah, true.
> 
> > We could fix it at the max (8). What do you think?
> 
> Anything in struct amd64_umc that is shared between those channels or
> all max 8 of them can be distinct?
> 


All the fields are register values, and there are unique instances for each channel. They can
potentially all be different.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early
  2019-09-06 20:49         ` Ghannam, Yazen
@ 2019-09-09 15:31           ` Borislav Petkov
  0 siblings, 0 replies; 31+ messages in thread
From: Borislav Petkov @ 2019-09-09 15:31 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Fri, Sep 06, 2019 at 08:49:52PM +0000, Ghannam, Yazen wrote:
> All the fields are register values, and there are unique instances for
> each channel. They can potentially all be different.

Hmm, ok, that's 160 bytes more per node. And on !Rome it is wasted. Oh
well, let's remain conservative then and allocate it only where needed.

Btw, upon a second look, compute_num_umcs() can be made part of
get_hardware_info() instead of having this as a separate function which
gets called only once before probe_one_instance().

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, back to index

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-21 23:59 [PATCH v3 0/8] AMD64 EDAC fixes Ghannam, Yazen
2019-08-21 23:59 ` [PATCH v3 1/8] EDAC/amd64: Support more than two controllers for chip selects handling Ghannam, Yazen
2019-08-21 23:59 ` [PATCH v3 2/8] EDAC/amd64: Recognize DRAM device type with EDAC_CTL_CAP Ghannam, Yazen
2019-08-21 23:59 ` [PATCH v3 3/8] EDAC/amd64: Initialize DIMM info for systems with more than two channels Ghannam, Yazen
2019-08-21 23:59 ` [PATCH v3 4/8] EDAC/amd64: Find Chip Select memory size using Address Mask Ghannam, Yazen
2019-08-22  0:00 ` [PATCH v3 5/8] EDAC/amd64: Decode syndrome before translating address Ghannam, Yazen
2019-08-22  0:00 ` [PATCH v3 6/8] EDAC/amd64: Cache secondary Chip Select registers Ghannam, Yazen
2019-08-22  0:00 ` [RFC PATCH v3 08/10] EDAC/amd64: Gather hardware information early Ghannam, Yazen
2019-08-29  9:22   ` Borislav Petkov
2019-09-06 19:14     ` Ghannam, Yazen
2019-09-06 20:35       ` Borislav Petkov
2019-09-06 20:49         ` Ghannam, Yazen
2019-09-09 15:31           ` Borislav Petkov
2019-08-22  0:00 ` [PATCH v3 7/8] EDAC/amd64: Support Asymmetric Dual-Rank DIMMs Ghannam, Yazen
2019-08-23 11:26   ` Borislav Petkov
2019-08-23 13:27     ` Ghannam, Yazen
2019-08-23 15:11       ` Borislav Petkov
2019-08-22  0:00 ` [RFC PATCH v3 10/10] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
2019-08-22 18:51   ` [RFC PATCH v2] " Ghannam, Yazen
2019-08-22  0:00 ` [RFC PATCH v3 09/10] EDAC/amd64: Use cached data when checking for ECC Ghannam, Yazen
2019-08-22  0:50 ` [PATCH v3 0/8] AMD64 EDAC fixes Adam Borowski
2019-08-22  8:35   ` Borislav Petkov
2019-08-22  9:46     ` Adam Borowski
2019-08-22  9:55       ` Borislav Petkov
2019-08-22 18:54   ` Ghannam, Yazen
2019-08-23 15:28     ` Ghannam, Yazen
2019-08-23 15:37       ` Borislav Petkov
2019-08-26 14:19         ` Ghannam, Yazen
2019-08-26 14:59           ` Borislav Petkov
2019-08-26 15:05             ` Ghannam, Yazen
2019-08-22 18:48 ` [RFC PATCH v2] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen

Linux-EDAC Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-edac/0 linux-edac/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-edac linux-edac/ https://lore.kernel.org/linux-edac \
		linux-edac@vger.kernel.org
	public-inbox-index linux-edac

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-edac


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git