linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc.
@ 2019-10-22 20:35 Ghannam, Yazen
  2019-10-22 20:35 ` [PATCH v2 1/6] EDAC/amd64: Make struct amd64_family_type global Ghannam, Yazen
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Ghannam, Yazen @ 2019-10-22 20:35 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Hi Boris,

Most of these patches address the issue where the module checks and
complains about DRAM ECC on nodes without memory.

Thanks,
Yazen

Link:
https://lkml.kernel.org/r/20191018153114.39378-1-Yazen.Ghannam@amd.com

Yazen Ghannam (6):
  EDAC/amd64: Make struct amd64_family_type global
  EDAC/amd64: Gather hardware information early
  EDAC/amd64: Save max number of controllers to family type
  EDAC/amd64: Use cached data when checking for ECC
  EDAC/amd64: Check for memory before fully initializing an instance
  EDAC/amd64: Set grain per DIMM

 drivers/edac/amd64_edac.c | 196 +++++++++++++++++++-------------------
 drivers/edac/amd64_edac.h |   2 +
 2 files changed, 100 insertions(+), 98 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/6] EDAC/amd64: Make struct amd64_family_type global
  2019-10-22 20:35 [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
@ 2019-10-22 20:35 ` Ghannam, Yazen
  2019-10-22 20:35 ` [PATCH v2 2/6] EDAC/amd64: Gather hardware information early Ghannam, Yazen
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Ghannam, Yazen @ 2019-10-22 20:35 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

The struct amd64_family_type doesn't change between multiple nodes and
instances of the modules, so make it global.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20191018153114.39378-2-Yazen.Ghannam@amd.com

v1 -> v2:
* No change.

rfc -> v1:
* New patch based on suggestion from Boris.

 drivers/edac/amd64_edac.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index c1d4536ae466..b9a712819c68 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -16,6 +16,8 @@ module_param(ecc_enable_override, int, 0644);
 
 static struct msr __percpu *msrs;
 
+static struct amd64_family_type *fam_type;
+
 /* Per-node stuff */
 static struct ecc_settings **ecc_stngs;
 
@@ -3278,8 +3280,7 @@ f17h_determine_edac_ctl_cap(struct mem_ctl_info *mci, struct amd64_pvt *pvt)
 	}
 }
 
-static void setup_mci_misc_attrs(struct mem_ctl_info *mci,
-				 struct amd64_family_type *fam)
+static void setup_mci_misc_attrs(struct mem_ctl_info *mci)
 {
 	struct amd64_pvt *pvt = mci->pvt_info;
 
@@ -3298,7 +3299,7 @@ static void setup_mci_misc_attrs(struct mem_ctl_info *mci,
 
 	mci->edac_cap		= determine_edac_cap(pvt);
 	mci->mod_name		= EDAC_MOD_STR;
-	mci->ctl_name		= fam->ctl_name;
+	mci->ctl_name		= fam_type->ctl_name;
 	mci->dev_name		= pci_name(pvt->F3);
 	mci->ctl_page_to_phys	= NULL;
 
@@ -3312,8 +3313,6 @@ static void setup_mci_misc_attrs(struct mem_ctl_info *mci,
  */
 static struct amd64_family_type *per_family_init(struct amd64_pvt *pvt)
 {
-	struct amd64_family_type *fam_type = NULL;
-
 	pvt->ext_model  = boot_cpu_data.x86_model >> 4;
 	pvt->stepping	= boot_cpu_data.x86_stepping;
 	pvt->model	= boot_cpu_data.x86_model;
@@ -3420,7 +3419,6 @@ static void compute_num_umcs(void)
 static int init_one_instance(unsigned int nid)
 {
 	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
-	struct amd64_family_type *fam_type = NULL;
 	struct mem_ctl_info *mci = NULL;
 	struct edac_mc_layer layers[2];
 	struct amd64_pvt *pvt = NULL;
@@ -3497,7 +3495,7 @@ static int init_one_instance(unsigned int nid)
 	mci->pvt_info = pvt;
 	mci->pdev = &pvt->F3->dev;
 
-	setup_mci_misc_attrs(mci, fam_type);
+	setup_mci_misc_attrs(mci);
 
 	if (init_csrows(mci))
 		mci->edac_cap = EDAC_FLAG_NONE;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/6] EDAC/amd64: Gather hardware information early
  2019-10-22 20:35 [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
  2019-10-22 20:35 ` [PATCH v2 1/6] EDAC/amd64: Make struct amd64_family_type global Ghannam, Yazen
@ 2019-10-22 20:35 ` Ghannam, Yazen
  2019-10-22 20:35 ` [PATCH v2 3/6] EDAC/amd64: Save max number of controllers to family type Ghannam, Yazen
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Ghannam, Yazen @ 2019-10-22 20:35 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Split out gathering hardware information from init_one_instance() into a
separate function hw_info_get().

This is necessary so that the information can be cached earlier and used
to check if memory is populated and if ECC is enabled on a node.

Also, define a function hw_info_put() to back out changes made in
hw_info_get(). Currently, this includes two actions: freeing reserved
PCI device siblings and freeing the allocated struct amd64_umc.

Check for an allocated PCI device (Function 0 for Family 17h or Function
1 for pre-Family 17h) before freeing, since hw_info_put() may be called
before PCI siblings are reserved.

Drop the family check when freeing pvt->umc. This will be NULL on
pre-Family 17h systems. However, kfree() is safe and will check for a
NULL pointer before freeing.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20191018153114.39378-3-Yazen.Ghannam@amd.com

v1 -> v2:
* Change get_hardware_info() to hw_info_get().
* Add hw_info_put() to backout changes from hw_info_get().

rfc -> v1:
* Fixup after making struct amd64_family_type fam_type global.

 drivers/edac/amd64_edac.c | 101 +++++++++++++++++++-------------------
 1 file changed, 51 insertions(+), 50 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index b9a712819c68..df7dd9604bb2 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3416,34 +3416,15 @@ static void compute_num_umcs(void)
 	edac_dbg(1, "Number of UMCs: %x", num_umcs);
 }
 
-static int init_one_instance(unsigned int nid)
+static int hw_info_get(struct amd64_pvt *pvt)
 {
-	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
-	struct mem_ctl_info *mci = NULL;
-	struct edac_mc_layer layers[2];
-	struct amd64_pvt *pvt = NULL;
 	u16 pci_id1, pci_id2;
-	int err = 0, ret;
-
-	ret = -ENOMEM;
-	pvt = kzalloc(sizeof(struct amd64_pvt), GFP_KERNEL);
-	if (!pvt)
-		goto err_ret;
-
-	pvt->mc_node_id	= nid;
-	pvt->F3 = F3;
-
-	ret = -EINVAL;
-	fam_type = per_family_init(pvt);
-	if (!fam_type)
-		goto err_free;
+	int ret = -EINVAL;
 
 	if (pvt->fam >= 0x17) {
 		pvt->umc = kcalloc(num_umcs, sizeof(struct amd64_umc), GFP_KERNEL);
-		if (!pvt->umc) {
-			ret = -ENOMEM;
-			goto err_free;
-		}
+		if (!pvt->umc)
+			return -ENOMEM;
 
 		pci_id1 = fam_type->f0_id;
 		pci_id2 = fam_type->f6_id;
@@ -3452,21 +3433,37 @@ static int init_one_instance(unsigned int nid)
 		pci_id2 = fam_type->f2_id;
 	}
 
-	err = reserve_mc_sibling_devs(pvt, pci_id1, pci_id2);
-	if (err)
-		goto err_post_init;
+	ret = reserve_mc_sibling_devs(pvt, pci_id1, pci_id2);
+	if (ret)
+		return ret;
 
 	read_mc_regs(pvt);
 
+	return 0;
+}
+
+static void hw_info_put(struct amd64_pvt *pvt)
+{
+	if (pvt->F0 || pvt->F1)
+		free_mc_sibling_devs(pvt);
+
+	kfree(pvt->umc);
+}
+
+static int init_one_instance(struct amd64_pvt *pvt)
+{
+	struct mem_ctl_info *mci = NULL;
+	struct edac_mc_layer layers[2];
+	int ret = -EINVAL;
+
 	/*
 	 * We need to determine how many memory channels there are. Then use
 	 * that information for calculating the size of the dynamic instance
 	 * tables in the 'mci' structure.
 	 */
-	ret = -EINVAL;
 	pvt->channel_count = pvt->ops->early_channel_count(pvt);
 	if (pvt->channel_count < 0)
-		goto err_siblings;
+		return ret;
 
 	ret = -ENOMEM;
 	layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
@@ -3488,9 +3485,9 @@ static int init_one_instance(unsigned int nid)
 		layers[1].size = 2;
 	layers[1].is_virt_csrow = false;
 
-	mci = edac_mc_alloc(nid, ARRAY_SIZE(layers), layers, 0);
+	mci = edac_mc_alloc(pvt->mc_node_id, ARRAY_SIZE(layers), layers, 0);
 	if (!mci)
-		goto err_siblings;
+		return ret;
 
 	mci->pvt_info = pvt;
 	mci->pdev = &pvt->F3->dev;
@@ -3503,31 +3500,17 @@ static int init_one_instance(unsigned int nid)
 	ret = -ENODEV;
 	if (edac_mc_add_mc_with_groups(mci, amd64_edac_attr_groups)) {
 		edac_dbg(1, "failed edac_mc_add_mc()\n");
-		goto err_add_mc;
+		edac_mc_free(mci);
+		return ret;
 	}
 
 	return 0;
-
-err_add_mc:
-	edac_mc_free(mci);
-
-err_siblings:
-	free_mc_sibling_devs(pvt);
-
-err_post_init:
-	if (pvt->fam >= 0x17)
-		kfree(pvt->umc);
-
-err_free:
-	kfree(pvt);
-
-err_ret:
-	return ret;
 }
 
 static int probe_one_instance(unsigned int nid)
 {
 	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
+	struct amd64_pvt *pvt = NULL;
 	struct ecc_settings *s;
 	int ret;
 
@@ -3538,6 +3521,21 @@ static int probe_one_instance(unsigned int nid)
 
 	ecc_stngs[nid] = s;
 
+	pvt = kzalloc(sizeof(struct amd64_pvt), GFP_KERNEL);
+	if (!pvt)
+		goto err_settings;
+
+	pvt->mc_node_id	= nid;
+	pvt->F3 = F3;
+
+	fam_type = per_family_init(pvt);
+	if (!fam_type)
+		goto err_enable;
+
+	ret = hw_info_get(pvt);
+	if (ret < 0)
+		goto err_enable;
+
 	if (!ecc_enabled(F3, nid)) {
 		ret = 0;
 
@@ -3554,7 +3552,7 @@ static int probe_one_instance(unsigned int nid)
 			goto err_enable;
 	}
 
-	ret = init_one_instance(nid);
+	ret = init_one_instance(pvt);
 	if (ret < 0) {
 		amd64_err("Error probing instance: %d\n", nid);
 
@@ -3567,6 +3565,10 @@ static int probe_one_instance(unsigned int nid)
 	return ret;
 
 err_enable:
+	hw_info_put(pvt);
+	kfree(pvt);
+
+err_settings:
 	kfree(s);
 	ecc_stngs[nid] = NULL;
 
@@ -3593,14 +3595,13 @@ static void remove_one_instance(unsigned int nid)
 
 	restore_ecc_error_reporting(s, nid, F3);
 
-	free_mc_sibling_devs(pvt);
-
 	kfree(ecc_stngs[nid]);
 	ecc_stngs[nid] = NULL;
 
 	/* Free the EDAC CORE resources */
 	mci->pvt_info = NULL;
 
+	hw_info_put(pvt);
 	kfree(pvt);
 	edac_mc_free(mci);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/6] EDAC/amd64: Save max number of controllers to family type
  2019-10-22 20:35 [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
  2019-10-22 20:35 ` [PATCH v2 1/6] EDAC/amd64: Make struct amd64_family_type global Ghannam, Yazen
  2019-10-22 20:35 ` [PATCH v2 2/6] EDAC/amd64: Gather hardware information early Ghannam, Yazen
@ 2019-10-22 20:35 ` Ghannam, Yazen
  2019-10-22 20:35 ` [PATCH v2 4/6] EDAC/amd64: Use cached data when checking for ECC Ghannam, Yazen
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Ghannam, Yazen @ 2019-10-22 20:35 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

The maximum number of memory controllers is fixed within a family/model
group. In most cases, this has been fixed at 2, but some systems may
have up to 8.

The struct amd64_family_type already contains family/model-specific
information, and this can be used rather than adding model checks to
various functions.

Create a new field in struct amd64_family_type for max_mcs.
Set this when setting other family type information, and use this when
needing the maximum number of memory controllers possible for a system.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20191018153114.39378-4-Yazen.Ghannam@amd.com

v1 -> v2:
* Change max_num_controllers to max_mcs.

rfc -> v1:
* New patch.
* Idea came up from Boris' comment about compute_num_umcs().

 drivers/edac/amd64_edac.c | 44 +++++++++++++--------------------------
 drivers/edac/amd64_edac.h |  2 ++
 2 files changed, 16 insertions(+), 30 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index df7dd9604bb2..2d8129c8d183 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -21,9 +21,6 @@ static struct amd64_family_type *fam_type;
 /* Per-node stuff */
 static struct ecc_settings **ecc_stngs;
 
-/* Number of Unified Memory Controllers */
-static u8 num_umcs;
-
 /*
  * Valid scrub rates for the K8 hardware memory scrubber. We map the scrubbing
  * bandwidth to a valid bit pattern. The 'set' operation finds the 'matching-
@@ -456,7 +453,7 @@ static void get_cs_base_and_mask(struct amd64_pvt *pvt, int csrow, u8 dct,
 	for (i = 0; i < pvt->csels[dct].m_cnt; i++)
 
 #define for_each_umc(i) \
-	for (i = 0; i < num_umcs; i++)
+	for (i = 0; i < fam_type->max_mcs; i++)
 
 /*
  * @input_addr is an InputAddr associated with the node given by mci. Return the
@@ -2226,6 +2223,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "K8",
 		.f1_id = PCI_DEVICE_ID_AMD_K8_NB_ADDRMAP,
 		.f2_id = PCI_DEVICE_ID_AMD_K8_NB_MEMCTL,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= k8_early_channel_count,
 			.map_sysaddr_to_csrow	= k8_map_sysaddr_to_csrow,
@@ -2236,6 +2234,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F10h",
 		.f1_id = PCI_DEVICE_ID_AMD_10H_NB_MAP,
 		.f2_id = PCI_DEVICE_ID_AMD_10H_NB_DRAM,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2246,6 +2245,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F15h",
 		.f1_id = PCI_DEVICE_ID_AMD_15H_NB_F1,
 		.f2_id = PCI_DEVICE_ID_AMD_15H_NB_F2,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2256,6 +2256,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F15h_M30h",
 		.f1_id = PCI_DEVICE_ID_AMD_15H_M30H_NB_F1,
 		.f2_id = PCI_DEVICE_ID_AMD_15H_M30H_NB_F2,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2266,6 +2267,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F15h_M60h",
 		.f1_id = PCI_DEVICE_ID_AMD_15H_M60H_NB_F1,
 		.f2_id = PCI_DEVICE_ID_AMD_15H_M60H_NB_F2,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2276,6 +2278,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F16h",
 		.f1_id = PCI_DEVICE_ID_AMD_16H_NB_F1,
 		.f2_id = PCI_DEVICE_ID_AMD_16H_NB_F2,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2286,6 +2289,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F16h_M30h",
 		.f1_id = PCI_DEVICE_ID_AMD_16H_M30H_NB_F1,
 		.f2_id = PCI_DEVICE_ID_AMD_16H_M30H_NB_F2,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2296,6 +2300,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F17h",
 		.f0_id = PCI_DEVICE_ID_AMD_17H_DF_F0,
 		.f6_id = PCI_DEVICE_ID_AMD_17H_DF_F6,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f17_early_channel_count,
 			.dbam_to_cs		= f17_addr_mask_to_cs_size,
@@ -2305,6 +2310,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F17h_M10h",
 		.f0_id = PCI_DEVICE_ID_AMD_17H_M10H_DF_F0,
 		.f6_id = PCI_DEVICE_ID_AMD_17H_M10H_DF_F6,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f17_early_channel_count,
 			.dbam_to_cs		= f17_addr_mask_to_cs_size,
@@ -2314,6 +2320,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F17h_M30h",
 		.f0_id = PCI_DEVICE_ID_AMD_17H_M30H_DF_F0,
 		.f6_id = PCI_DEVICE_ID_AMD_17H_M30H_DF_F6,
+		.max_mcs = 8,
 		.ops = {
 			.early_channel_count	= f17_early_channel_count,
 			.dbam_to_cs		= f17_addr_mask_to_cs_size,
@@ -2323,6 +2330,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F17h_M70h",
 		.f0_id = PCI_DEVICE_ID_AMD_17H_M70H_DF_F0,
 		.f6_id = PCI_DEVICE_ID_AMD_17H_M70H_DF_F6,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f17_early_channel_count,
 			.dbam_to_cs		= f17_addr_mask_to_cs_size,
@@ -3400,29 +3408,13 @@ static const struct attribute_group *amd64_edac_attr_groups[] = {
 	NULL
 };
 
-/* Set the number of Unified Memory Controllers in the system. */
-static void compute_num_umcs(void)
-{
-	u8 model = boot_cpu_data.x86_model;
-
-	if (boot_cpu_data.x86 < 0x17)
-		return;
-
-	if (model >= 0x30 && model <= 0x3f)
-		num_umcs = 8;
-	else
-		num_umcs = 2;
-
-	edac_dbg(1, "Number of UMCs: %x", num_umcs);
-}
-
 static int hw_info_get(struct amd64_pvt *pvt)
 {
 	u16 pci_id1, pci_id2;
 	int ret = -EINVAL;
 
 	if (pvt->fam >= 0x17) {
-		pvt->umc = kcalloc(num_umcs, sizeof(struct amd64_umc), GFP_KERNEL);
+		pvt->umc = kcalloc(fam_type->max_mcs, sizeof(struct amd64_umc), GFP_KERNEL);
 		if (!pvt->umc)
 			return -ENOMEM;
 
@@ -3475,14 +3467,8 @@ static int init_one_instance(struct amd64_pvt *pvt)
 	 * Always allocate two channels since we can have setups with DIMMs on
 	 * only one channel. Also, this simplifies handling later for the price
 	 * of a couple of KBs tops.
-	 *
-	 * On Fam17h+, the number of controllers may be greater than two. So set
-	 * the size equal to the maximum number of UMCs.
 	 */
-	if (pvt->fam >= 0x17)
-		layers[1].size = num_umcs;
-	else
-		layers[1].size = 2;
+	layers[1].size = fam_type->max_mcs;
 	layers[1].is_virt_csrow = false;
 
 	mci = edac_mc_alloc(pvt->mc_node_id, ARRAY_SIZE(layers), layers, 0);
@@ -3667,8 +3653,6 @@ static int __init amd64_edac_init(void)
 	if (!msrs)
 		goto err_free;
 
-	compute_num_umcs();
-
 	for (i = 0; i < amd_nb_num(); i++) {
 		err = probe_one_instance(i);
 		if (err) {
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 8c3cda81e619..9be31688110b 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -479,6 +479,8 @@ struct low_ops {
 struct amd64_family_type {
 	const char *ctl_name;
 	u16 f0_id, f1_id, f2_id, f6_id;
+	/* Maximum number of memory controllers per die/node. */
+	u8 max_mcs;
 	struct low_ops ops;
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/6] EDAC/amd64: Use cached data when checking for ECC
  2019-10-22 20:35 [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
                   ` (2 preceding siblings ...)
  2019-10-22 20:35 ` [PATCH v2 3/6] EDAC/amd64: Save max number of controllers to family type Ghannam, Yazen
@ 2019-10-22 20:35 ` Ghannam, Yazen
  2019-10-22 20:35 ` [PATCH v2 5/6] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Ghannam, Yazen @ 2019-10-22 20:35 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

...now that the data is available earlier.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20191018153114.39378-5-Yazen.Ghannam@amd.com

v1 -> v2:
* No change.

rfc -> v1:
* No change.

 drivers/edac/amd64_edac.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 2d8129c8d183..6b6df53e8ae7 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3200,31 +3200,27 @@ static const char *ecc_msg =
 	"'ecc_enable_override'.\n"
 	" (Note that use of the override may cause unknown side effects.)\n";
 
-static bool ecc_enabled(struct pci_dev *F3, u16 nid)
+static bool ecc_enabled(struct amd64_pvt *pvt)
 {
+	u16 nid = pvt->mc_node_id;
 	bool nb_mce_en = false;
 	u8 ecc_en = 0, i;
 	u32 value;
 
 	if (boot_cpu_data.x86 >= 0x17) {
 		u8 umc_en_mask = 0, ecc_en_mask = 0;
+		struct amd64_umc *umc;
 
 		for_each_umc(i) {
-			u32 base = get_umc_base(i);
+			umc = &pvt->umc[i];
 
 			/* Only check enabled UMCs. */
-			if (amd_smn_read(nid, base + UMCCH_SDP_CTRL, &value))
-				continue;
-
-			if (!(value & UMC_SDP_INIT))
+			if (!(umc->sdp_ctrl & UMC_SDP_INIT))
 				continue;
 
 			umc_en_mask |= BIT(i);
 
-			if (amd_smn_read(nid, base + UMCCH_UMC_CAP_HI, &value))
-				continue;
-
-			if (value & UMC_ECC_ENABLED)
+			if (umc->umc_cap_hi & UMC_ECC_ENABLED)
 				ecc_en_mask |= BIT(i);
 		}
 
@@ -3237,7 +3233,7 @@ static bool ecc_enabled(struct pci_dev *F3, u16 nid)
 		/* Assume UMC MCA banks are enabled. */
 		nb_mce_en = true;
 	} else {
-		amd64_read_pci_cfg(F3, NBCFG, &value);
+		amd64_read_pci_cfg(pvt->F3, NBCFG, &value);
 
 		ecc_en = !!(value & NBCFG_ECC_ENABLE);
 
@@ -3522,7 +3518,7 @@ static int probe_one_instance(unsigned int nid)
 	if (ret < 0)
 		goto err_enable;
 
-	if (!ecc_enabled(F3, nid)) {
+	if (!ecc_enabled(pvt)) {
 		ret = 0;
 
 		if (!ecc_enable_override)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 5/6] EDAC/amd64: Check for memory before fully initializing an instance
  2019-10-22 20:35 [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
                   ` (3 preceding siblings ...)
  2019-10-22 20:35 ` [PATCH v2 4/6] EDAC/amd64: Use cached data when checking for ECC Ghannam, Yazen
@ 2019-10-22 20:35 ` Ghannam, Yazen
  2019-10-22 20:35 ` [PATCH v2 6/6] EDAC/amd64: Set grain per DIMM Ghannam, Yazen
  2019-10-25 13:34 ` [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Borislav Petkov
  6 siblings, 0 replies; 13+ messages in thread
From: Ghannam, Yazen @ 2019-10-22 20:35 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Return early before checking for ECC if the node does not have any
populated memory.

Free any cached hardware data before returning. Also, return 0 in this
case since this is not a failure. Other nodes may have memory and the
module should attempt to load an instance for them.

Move printing of hardware information to after the instance is
initialized, so that the information is only printed for nodes with
memory.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20191018153114.39378-6-Yazen.Ghannam@amd.com

v1 -> v2:
* No change.

rfc -> v1:
* Change message severity to "info".
  * Nodes without memory is a valid configuration. The user doesn't
    need to be warned.
* Drop "DRAM ECC disabled" from message.
  * The message is given when no memory was detected on a node.
  * The state of DRAM ECC is not checked here.

 drivers/edac/amd64_edac.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 6b6df53e8ae7..114e7395daab 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2848,8 +2848,6 @@ static void read_mc_regs(struct amd64_pvt *pvt)
 	edac_dbg(1, "  DIMM type: %s\n", edac_mem_types[pvt->dram_type]);
 
 	determine_ecc_sym_sz(pvt);
-
-	dump_misc_regs(pvt);
 }
 
 /*
@@ -3489,6 +3487,19 @@ static int init_one_instance(struct amd64_pvt *pvt)
 	return 0;
 }
 
+static bool instance_has_memory(struct amd64_pvt *pvt)
+{
+	bool cs_enabled = false;
+	int cs = 0, dct = 0;
+
+	for (dct = 0; dct < fam_type->max_mcs; dct++) {
+		for_each_chip_select(cs, dct, pvt)
+			cs_enabled |= csrow_enabled(cs, dct, pvt);
+	}
+
+	return cs_enabled;
+}
+
 static int probe_one_instance(unsigned int nid)
 {
 	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
@@ -3518,6 +3529,12 @@ static int probe_one_instance(unsigned int nid)
 	if (ret < 0)
 		goto err_enable;
 
+	ret = 0;
+	if (!instance_has_memory(pvt)) {
+		amd64_info("Node %d: No DIMMs detected.\n", nid);
+		goto err_enable;
+	}
+
 	if (!ecc_enabled(pvt)) {
 		ret = 0;
 
@@ -3544,6 +3561,8 @@ static int probe_one_instance(unsigned int nid)
 		goto err_enable;
 	}
 
+	dump_misc_regs(pvt);
+
 	return ret;
 
 err_enable:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 6/6] EDAC/amd64: Set grain per DIMM
  2019-10-22 20:35 [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
                   ` (4 preceding siblings ...)
  2019-10-22 20:35 ` [PATCH v2 5/6] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
@ 2019-10-22 20:35 ` Ghannam, Yazen
  2019-10-25 13:41   ` Borislav Petkov
  2019-10-25 13:34 ` [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Borislav Petkov
  6 siblings, 1 reply; 13+ messages in thread
From: Ghannam, Yazen @ 2019-10-22 20:35 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

The following commit introduced a warning on error reports without a
non-zero grain value.

  3724ace582d9 ("EDAC/mc: Fix grain_bits calculation")

The amd64_edac_mod module does not provide a value, so the warning will
be given on the first reported memory error.

Set the grain per DIMM to cacheline size (64 bytes). This is the current
recommendation.

Fixes: 3724ace582d9 ("EDAC/mc: Fix grain_bits calculation")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20191018153114.39378-7-Yazen.Ghannam@amd.com

v1 -> v2:
* No change.

rfc -> v1:
* New patch.

 drivers/edac/amd64_edac.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 114e7395daab..4ab7bcdede51 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2944,6 +2944,7 @@ static int init_csrows_df(struct mem_ctl_info *mci)
 			dimm->mtype = pvt->dram_type;
 			dimm->edac_mode = edac_mode;
 			dimm->dtype = dev_type;
+			dimm->grain = 64;
 		}
 	}
 
@@ -3020,6 +3021,7 @@ static int init_csrows(struct mem_ctl_info *mci)
 			dimm = csrow->channels[j]->dimm;
 			dimm->mtype = pvt->dram_type;
 			dimm->edac_mode = edac_mode;
+			dimm->grain = 64;
 		}
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc.
  2019-10-22 20:35 [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
                   ` (5 preceding siblings ...)
  2019-10-22 20:35 ` [PATCH v2 6/6] EDAC/amd64: Set grain per DIMM Ghannam, Yazen
@ 2019-10-25 13:34 ` Borislav Petkov
  2019-11-01 15:19   ` Ghannam, Yazen
  6 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2019-10-25 13:34 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Tue, Oct 22, 2019 at 08:35:08PM +0000, Ghannam, Yazen wrote:
> From: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> Hi Boris,
> 
> Most of these patches address the issue where the module checks and
> complains about DRAM ECC on nodes without memory.
> 
> Thanks,
> Yazen
> 
> Link:
> https://lkml.kernel.org/r/20191018153114.39378-1-Yazen.Ghannam@amd.com
> 
> Yazen Ghannam (6):
>   EDAC/amd64: Make struct amd64_family_type global
>   EDAC/amd64: Gather hardware information early
>   EDAC/amd64: Save max number of controllers to family type
>   EDAC/amd64: Use cached data when checking for ECC
>   EDAC/amd64: Check for memory before fully initializing an instance
>   EDAC/amd64: Set grain per DIMM
> 
>  drivers/edac/amd64_edac.c | 196 +++++++++++++++++++-------------------
>  drivers/edac/amd64_edac.h |   2 +
>  2 files changed, 100 insertions(+), 98 deletions(-)

Almost there: now it dumps the whole shebang twice. This is on an old
F10h box which doesn't have ECC DIMMs:

[    2.222853] EDAC MC: Ver: 3.0.0
[    2.226881] EDAC DEBUG: edac_mc_sysfs_init: device mc created
[    5.726912] EDAC amd64: F10h detected (node 0).
[    5.732709] EDAC DEBUG: reserve_mc_sibling_devs: F1: 0000:00:18.1
[    5.750886] EDAC DEBUG: reserve_mc_sibling_devs: F2: 0000:00:18.2
[    5.758427] EDAC DEBUG: reserve_mc_sibling_devs: F3: 0000:00:18.3
[    5.765871] EDAC DEBUG: read_mc_regs:   TOP_MEM:  0x00000000d0000000
[    5.774098] EDAC DEBUG: read_mc_regs:   TOP_MEM2: 0x0000000230000000
[    5.782339] EDAC DEBUG: read_dram_ctl_register: F2x110 (DCTSelLow): 0xffffffff, High range addrs at: 0xfffff800
[    5.793976] EDAC DEBUG: read_dram_ctl_register:   DCTs operate in ganged mode
[    5.802429] EDAC DEBUG: read_dram_ctl_register:   data interleave for ECC: enabled, DRAM cleared since last warm reset: yes
[    5.814702] EDAC DEBUG: read_dram_ctl_register:   channel interleave: enabled, interleave bits selector: 0x3
[    5.826142] EDAC DEBUG: read_mc_regs:   DRAM range[0], base: 0x0000ff0000000000; limit: 0x0000ff022fffffff
[    5.837070] EDAC DEBUG: read_mc_regs:    IntlvEn=Disabled; Range access: RW IntlvSel=0 DstNode=0
[    5.847061] EDAC DEBUG: read_dct_base_mask:   DCSB0[0]=0x00000001 reg: F2x40
[    5.854699] EDAC DEBUG: read_dct_base_mask:   DCSB1[0]=0x00000000 reg: F2x140
[    5.862763] EDAC DEBUG: read_dct_base_mask:   DCSB0[1]=0x00000101 reg: F2x44
[    5.870614] EDAC DEBUG: read_dct_base_mask:   DCSB1[1]=0x00000000 reg: F2x144
[    5.878457] EDAC DEBUG: read_dct_base_mask:   DCSB0[2]=0x00000201 reg: F2x48
[    5.888483] EDAC DEBUG: read_dct_base_mask:   DCSB1[2]=0x00000000 reg: F2x148
[    5.897359] EDAC DEBUG: read_dct_base_mask:   DCSB0[3]=0x00000301 reg: F2x4c
[    5.906307] EDAC DEBUG: read_dct_base_mask:   DCSB1[3]=0x00000000 reg: F2x14c
[    5.913698] EDAC DEBUG: read_dct_base_mask:   DCSB0[4]=0x00000000 reg: F2x50
[    5.921646] EDAC DEBUG: read_dct_base_mask:   DCSB1[4]=0x00000000 reg: F2x150
[    5.930415] EDAC DEBUG: read_dct_base_mask:   DCSB0[5]=0x00000000 reg: F2x54
[    5.937772] EDAC DEBUG: read_dct_base_mask:   DCSB1[5]=0x00000000 reg: F2x154
[    5.945684] EDAC DEBUG: read_dct_base_mask:   DCSB0[6]=0x00000000 reg: F2x58
[    5.953523] EDAC DEBUG: read_dct_base_mask:   DCSB1[6]=0x00000000 reg: F2x158
[    5.961546] EDAC DEBUG: read_dct_base_mask:   DCSB0[7]=0x00000000 reg: F2x5c
[    5.969385] EDAC DEBUG: read_dct_base_mask:   DCSB1[7]=0x00000000 reg: F2x15c
[    5.977333] EDAC DEBUG: read_dct_base_mask:     DCSM0[0]=0x00f83ce0 reg: F2x60
[    5.986777] EDAC DEBUG: read_dct_base_mask:     DCSM1[0]=0x00000000 reg: F2x160
[    6.000195] EDAC DEBUG: read_dct_base_mask:     DCSM0[1]=0x00f83ce0 reg: F2x64
[    6.012487] EDAC DEBUG: read_dct_base_mask:     DCSM1[1]=0x00000000 reg: F2x164
[    6.019946] EDAC DEBUG: read_dct_base_mask:     DCSM0[2]=0x00000000 reg: F2x68
[    6.027283] EDAC DEBUG: read_dct_base_mask:     DCSM1[2]=0x00000000 reg: F2x168
[    6.035342] EDAC DEBUG: read_dct_base_mask:     DCSM0[3]=0x00000000 reg: F2x6c
[    6.042800] EDAC DEBUG: read_dct_base_mask:     DCSM1[3]=0x00000000 reg: F2x16c
[    6.050913] EDAC DEBUG: read_mc_regs:   DIMM type: Unbuffered-DDR2
[    6.057183] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled
[    6.065925] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled
[    6.081200] EDAC amd64: Node 0: DRAM ECC disabled.
[    6.092690] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
[    6.208087] EDAC amd64: F10h detected (node 0).
[    6.212966] EDAC DEBUG: reserve_mc_sibling_devs: F1: 0000:00:18.1
[    6.235500] EDAC DEBUG: reserve_mc_sibling_devs: F2: 0000:00:18.2
[    6.241661] EDAC DEBUG: reserve_mc_sibling_devs: F3: 0000:00:18.3
[    6.252691] EDAC DEBUG: read_mc_regs:   TOP_MEM:  0x00000000d0000000
[    6.259134] EDAC DEBUG: read_mc_regs:   TOP_MEM2: 0x0000000230000000
[    6.265823] EDAC DEBUG: read_dram_ctl_register: F2x110 (DCTSelLow): 0xffffffff, High range addrs at: 0xfffff800
[    6.275978] EDAC DEBUG: read_dram_ctl_register:   DCTs operate in ganged mode
[    6.283271] EDAC DEBUG: read_dram_ctl_register:   data interleave for ECC: enabled, DRAM cleared since last warm reset: yes
[    6.294635] EDAC DEBUG: read_dram_ctl_register:   channel interleave: enabled, interleave bits selector: 0x3
[    6.304565] EDAC DEBUG: read_mc_regs:   DRAM range[0], base: 0x0000ff0000000000; limit: 0x0000ff022fffffff
[    6.314367] EDAC DEBUG: read_mc_regs:    IntlvEn=Disabled; Range access: RW IntlvSel=0 DstNode=0
[    6.323259] EDAC DEBUG: read_dct_base_mask:   DCSB0[0]=0x00000001 reg: F2x40
[    6.330434] EDAC DEBUG: read_dct_base_mask:   DCSB1[0]=0x00000000 reg: F2x140
[    6.337648] EDAC DEBUG: read_dct_base_mask:   DCSB0[1]=0x00000101 reg: F2x44
[    6.351551] EDAC DEBUG: read_dct_base_mask:   DCSB1[1]=0x00000000 reg: F2x144
[    6.364985] EDAC DEBUG: read_dct_base_mask:   DCSB0[2]=0x00000201 reg: F2x48
[    6.379708] EDAC DEBUG: read_dct_base_mask:   DCSB1[2]=0x00000000 reg: F2x148
[    6.386913] EDAC DEBUG: read_dct_base_mask:   DCSB0[3]=0x00000301 reg: F2x4c
[    6.394037] EDAC DEBUG: read_dct_base_mask:   DCSB1[3]=0x00000000 reg: F2x14c
[    6.401259] EDAC DEBUG: read_dct_base_mask:   DCSB0[4]=0x00000000 reg: F2x50
[    6.408377] EDAC DEBUG: read_dct_base_mask:   DCSB1[4]=0x00000000 reg: F2x150
[    6.415854] EDAC DEBUG: read_dct_base_mask:   DCSB0[5]=0x00000000 reg: F2x54
[    6.422976] EDAC DEBUG: read_dct_base_mask:   DCSB1[5]=0x00000000 reg: F2x154
[    6.430178] EDAC DEBUG: read_dct_base_mask:   DCSB0[6]=0x00000000 reg: F2x58
[    6.437300] EDAC DEBUG: read_dct_base_mask:   DCSB1[6]=0x00000000 reg: F2x158
[    6.444507] EDAC DEBUG: read_dct_base_mask:   DCSB0[7]=0x00000000 reg: F2x5c
[    6.451621] EDAC DEBUG: read_dct_base_mask:   DCSB1[7]=0x00000000 reg: F2x15c
[    6.458833] EDAC DEBUG: read_dct_base_mask:     DCSM0[0]=0x00f83ce0 reg: F2x60
[    6.466155] EDAC DEBUG: read_dct_base_mask:     DCSM1[0]=0x00000000 reg: F2x160
[    6.473571] EDAC DEBUG: read_dct_base_mask:     DCSM0[1]=0x00f83ce0 reg: F2x64
[    6.480901] EDAC DEBUG: read_dct_base_mask:     DCSM1[1]=0x00000000 reg: F2x164
[    6.488305] EDAC DEBUG: read_dct_base_mask:     DCSM0[2]=0x00000000 reg: F2x68
[    6.495647] EDAC DEBUG: read_dct_base_mask:     DCSM1[2]=0x00000000 reg: F2x168
[    6.511447] EDAC DEBUG: read_dct_base_mask:     DCSM0[3]=0x00000000 reg: F2x6c
[    6.511448] EDAC DEBUG: read_dct_base_mask:     DCSM1[3]=0x00000000 reg: F2x16c
[    6.511451] EDAC DEBUG: read_mc_regs:   DIMM type: Unbuffered-DDR2
[    6.511458] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled
[    6.511459] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled
[    6.511460] EDAC amd64: Node 0: DRAM ECC disabled.
[    6.511461] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 6/6] EDAC/amd64: Set grain per DIMM
  2019-10-22 20:35 ` [PATCH v2 6/6] EDAC/amd64: Set grain per DIMM Ghannam, Yazen
@ 2019-10-25 13:41   ` Borislav Petkov
  0 siblings, 0 replies; 13+ messages in thread
From: Borislav Petkov @ 2019-10-25 13:41 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Tue, Oct 22, 2019 at 08:35:14PM +0000, Ghannam, Yazen wrote:
> From: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> The following commit introduced a warning on error reports without a
> non-zero grain value.
> 
>   3724ace582d9 ("EDAC/mc: Fix grain_bits calculation")
> 
> The amd64_edac_mod module does not provide a value, so the warning will
> be given on the first reported memory error.
> 
> Set the grain per DIMM to cacheline size (64 bytes). This is the current
> recommendation.
> 
> Fixes: 3724ace582d9 ("EDAC/mc: Fix grain_bits calculation")
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
> Link:
> https://lkml.kernel.org/r/20191018153114.39378-7-Yazen.Ghannam@amd.com
> 
> v1 -> v2:
> * No change.
> 
> rfc -> v1:
> * New patch.
> 
>  drivers/edac/amd64_edac.c | 2 ++
>  1 file changed, 2 insertions(+)

This one I can take now. Applied, thanks.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc.
  2019-10-25 13:34 ` [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Borislav Petkov
@ 2019-11-01 15:19   ` Ghannam, Yazen
  2019-11-01 15:54     ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Ghannam, Yazen @ 2019-11-01 15:19 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

> -----Original Message-----
> From: Borislav Petkov <bp@alien8.de>
> Sent: Friday, October 25, 2019 9:35 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc.
> 
> On Tue, Oct 22, 2019 at 08:35:08PM +0000, Ghannam, Yazen wrote:
> > From: Yazen Ghannam <yazen.ghannam@amd.com>
> >
> > Hi Boris,
> >
> > Most of these patches address the issue where the module checks and
> > complains about DRAM ECC on nodes without memory.
> >
> > Thanks,
> > Yazen
> >
> > Link:
> > https://lkml.kernel.org/r/20191018153114.39378-1-Yazen.Ghannam@amd.com
> >
> > Yazen Ghannam (6):
> >   EDAC/amd64: Make struct amd64_family_type global
> >   EDAC/amd64: Gather hardware information early
> >   EDAC/amd64: Save max number of controllers to family type
> >   EDAC/amd64: Use cached data when checking for ECC
> >   EDAC/amd64: Check for memory before fully initializing an instance
> >   EDAC/amd64: Set grain per DIMM
> >
> >  drivers/edac/amd64_edac.c | 196 +++++++++++++++++++-------------------
> >  drivers/edac/amd64_edac.h |   2 +
> >  2 files changed, 100 insertions(+), 98 deletions(-)
> 
> Almost there: now it dumps the whole shebang twice. This is on an old
> F10h box which doesn't have ECC DIMMs:
> 
> [    2.222853] EDAC MC: Ver: 3.0.0
> [    2.226881] EDAC DEBUG: edac_mc_sysfs_init: device mc created
> [    5.726912] EDAC amd64: F10h detected (node 0).
...
> [    6.208087] EDAC amd64: F10h detected (node 0).

Is the module being probed twice? We have this problem in general, e.g. the
module gets loaded multiple times on failure.

The clue for me is that node 0 gets detected twice. This is done in
per_family_init() early in probe_one_instance().

In any case, I think we can make !ecc_enabled(pvt) in probe_one_instance() a
failure now that we have an explicit check for memory on a node. In other
words, if we have memory and ECC is disabled then this is a failure for the
module.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-01 15:19   ` Ghannam, Yazen
@ 2019-11-01 15:54     ` Borislav Petkov
  2019-11-05 13:38       ` Ghannam, Yazen
  0 siblings, 1 reply; 13+ messages in thread
From: Borislav Petkov @ 2019-11-01 15:54 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Fri, Nov 01, 2019 at 03:19:36PM +0000, Ghannam, Yazen wrote:
> Is the module being probed twice? We have this problem in general, e.g. the
> module gets loaded multiple times on failure.

Yap, it looks like it.

> The clue for me is that node 0 gets detected twice. This is done in
> per_family_init() early in probe_one_instance().
>
> In any case, I think we can make !ecc_enabled(pvt) in probe_one_instance() a
> failure now that we have an explicit check for memory on a node. In other
> words, if we have memory and ECC is disabled then this is a failure for the
> module.

Yeah, for that case we should be printing ecc_msg. Makes sense.

Thx.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-01 15:54     ` Borislav Petkov
@ 2019-11-05 13:38       ` Ghannam, Yazen
  2019-11-05 13:48         ` Borislav Petkov
  0 siblings, 1 reply; 13+ messages in thread
From: Ghannam, Yazen @ 2019-11-05 13:38 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Friday, November 1, 2019 11:54 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc.
> 
> On Fri, Nov 01, 2019 at 03:19:36PM +0000, Ghannam, Yazen wrote:
> > Is the module being probed twice? We have this problem in general, e.g. the
> > module gets loaded multiple times on failure.
> 
> Yap, it looks like it.
> 
> > The clue for me is that node 0 gets detected twice. This is done in
> > per_family_init() early in probe_one_instance().
> >
> > In any case, I think we can make !ecc_enabled(pvt) in probe_one_instance() a
> > failure now that we have an explicit check for memory on a node. In other
> > words, if we have memory and ECC is disabled then this is a failure for the
> > module.
> 
> Yeah, for that case we should be printing ecc_msg. Makes sense.
> 

Do you have any other comments on this set? Should I send another revision
with this change?

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-05 13:38       ` Ghannam, Yazen
@ 2019-11-05 13:48         ` Borislav Petkov
  0 siblings, 0 replies; 13+ messages in thread
From: Borislav Petkov @ 2019-11-05 13:48 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Tue, Nov 05, 2019 at 01:38:15PM +0000, Ghannam, Yazen wrote:
> Do you have any other comments on this set?

No, it looks good otherwise.

> Should I send another revision with this change?

Pls do, thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-11-05 13:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-22 20:35 [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
2019-10-22 20:35 ` [PATCH v2 1/6] EDAC/amd64: Make struct amd64_family_type global Ghannam, Yazen
2019-10-22 20:35 ` [PATCH v2 2/6] EDAC/amd64: Gather hardware information early Ghannam, Yazen
2019-10-22 20:35 ` [PATCH v2 3/6] EDAC/amd64: Save max number of controllers to family type Ghannam, Yazen
2019-10-22 20:35 ` [PATCH v2 4/6] EDAC/amd64: Use cached data when checking for ECC Ghannam, Yazen
2019-10-22 20:35 ` [PATCH v2 5/6] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
2019-10-22 20:35 ` [PATCH v2 6/6] EDAC/amd64: Set grain per DIMM Ghannam, Yazen
2019-10-25 13:41   ` Borislav Petkov
2019-10-25 13:34 ` [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc Borislav Petkov
2019-11-01 15:19   ` Ghannam, Yazen
2019-11-01 15:54     ` Borislav Petkov
2019-11-05 13:38       ` Ghannam, Yazen
2019-11-05 13:48         ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).