linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
@ 2019-11-06  1:24 Ghannam, Yazen
  2019-11-06  1:24 ` [PATCH v3 1/5] EDAC/amd64: Make struct amd64_family_type global Ghannam, Yazen
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Ghannam, Yazen @ 2019-11-06  1:24 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Hi Boris,

These patches address the issue where the module checks and complains
about DRAM ECC on nodes without memory.

Changes from last revision:
  1) Dropped patch 6 which was for adding a grain value.
  2) Added an error code for !ecc_enabled() in patch 5.

Thanks,
Yazen

Link:
https://lkml.kernel.org/r/20191022203448.13962-1-Yazen.Ghannam@amd.com

Yazen Ghannam (5):
  EDAC/amd64: Make struct amd64_family_type global
  EDAC/amd64: Gather hardware information early
  EDAC/amd64: Save max number of controllers to family type
  EDAC/amd64: Use cached data when checking for ECC
  EDAC/amd64: Check for memory before fully initializing an instance

 drivers/edac/amd64_edac.c | 196 +++++++++++++++++++-------------------
 drivers/edac/amd64_edac.h |   2 +
 2 files changed, 99 insertions(+), 99 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/5] EDAC/amd64: Make struct amd64_family_type global
  2019-11-06  1:24 [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
@ 2019-11-06  1:24 ` Ghannam, Yazen
  2019-11-06  1:25 ` [PATCH v3 2/5] EDAC/amd64: Gather hardware information early Ghannam, Yazen
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Ghannam, Yazen @ 2019-11-06  1:24 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

The struct amd64_family_type doesn't change between multiple nodes and
instances of the modules, so make it global.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20191022203448.13962-2-Yazen.Ghannam@amd.com

v2 -> v3:
* No change.

v1 -> v2:
* No change.

rfc -> v1:
* New patch based on suggestion from Boris.

 drivers/edac/amd64_edac.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index cc5e56d752c8..83c659e38084 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -16,6 +16,8 @@ module_param(ecc_enable_override, int, 0644);
 
 static struct msr __percpu *msrs;
 
+static struct amd64_family_type *fam_type;
+
 /* Per-node stuff */
 static struct ecc_settings **ecc_stngs;
 
@@ -3280,8 +3282,7 @@ f17h_determine_edac_ctl_cap(struct mem_ctl_info *mci, struct amd64_pvt *pvt)
 	}
 }
 
-static void setup_mci_misc_attrs(struct mem_ctl_info *mci,
-				 struct amd64_family_type *fam)
+static void setup_mci_misc_attrs(struct mem_ctl_info *mci)
 {
 	struct amd64_pvt *pvt = mci->pvt_info;
 
@@ -3300,7 +3301,7 @@ static void setup_mci_misc_attrs(struct mem_ctl_info *mci,
 
 	mci->edac_cap		= determine_edac_cap(pvt);
 	mci->mod_name		= EDAC_MOD_STR;
-	mci->ctl_name		= fam->ctl_name;
+	mci->ctl_name		= fam_type->ctl_name;
 	mci->dev_name		= pci_name(pvt->F3);
 	mci->ctl_page_to_phys	= NULL;
 
@@ -3314,8 +3315,6 @@ static void setup_mci_misc_attrs(struct mem_ctl_info *mci,
  */
 static struct amd64_family_type *per_family_init(struct amd64_pvt *pvt)
 {
-	struct amd64_family_type *fam_type = NULL;
-
 	pvt->ext_model  = boot_cpu_data.x86_model >> 4;
 	pvt->stepping	= boot_cpu_data.x86_stepping;
 	pvt->model	= boot_cpu_data.x86_model;
@@ -3422,7 +3421,6 @@ static void compute_num_umcs(void)
 static int init_one_instance(unsigned int nid)
 {
 	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
-	struct amd64_family_type *fam_type = NULL;
 	struct mem_ctl_info *mci = NULL;
 	struct edac_mc_layer layers[2];
 	struct amd64_pvt *pvt = NULL;
@@ -3499,7 +3497,7 @@ static int init_one_instance(unsigned int nid)
 	mci->pvt_info = pvt;
 	mci->pdev = &pvt->F3->dev;
 
-	setup_mci_misc_attrs(mci, fam_type);
+	setup_mci_misc_attrs(mci);
 
 	if (init_csrows(mci))
 		mci->edac_cap = EDAC_FLAG_NONE;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/5] EDAC/amd64: Gather hardware information early
  2019-11-06  1:24 [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
  2019-11-06  1:24 ` [PATCH v3 1/5] EDAC/amd64: Make struct amd64_family_type global Ghannam, Yazen
@ 2019-11-06  1:25 ` Ghannam, Yazen
  2019-11-06  1:25 ` [PATCH v3 3/5] EDAC/amd64: Save max number of controllers to family type Ghannam, Yazen
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Ghannam, Yazen @ 2019-11-06  1:25 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Split out gathering hardware information from init_one_instance() into a
separate function hw_info_get().

This is necessary so that the information can be cached earlier and used
to check if memory is populated and if ECC is enabled on a node.

Also, define a function hw_info_put() to back out changes made in
hw_info_get(). Currently, this includes two actions: freeing reserved
PCI device siblings and freeing the allocated struct amd64_umc.

Check for an allocated PCI device (Function 0 for Family 17h or Function
1 for pre-Family 17h) before freeing, since hw_info_put() may be called
before PCI siblings are reserved.

Drop the family check when freeing pvt->umc. This will be NULL on
pre-Family 17h systems. However, kfree() is safe and will check for a
NULL pointer before freeing.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20191022203448.13962-3-Yazen.Ghannam@amd.com

v2 -> v3:
* No change.

v1 -> v2:
* Change get_hardware_info() to hw_info_get().
* Add hw_info_put() to backout changes from hw_info_get().

rfc -> v1:
* Fixup after making struct amd64_family_type fam_type global.

 drivers/edac/amd64_edac.c | 101 +++++++++++++++++++-------------------
 1 file changed, 51 insertions(+), 50 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 83c659e38084..6e1c739b7fad 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3418,34 +3418,15 @@ static void compute_num_umcs(void)
 	edac_dbg(1, "Number of UMCs: %x", num_umcs);
 }
 
-static int init_one_instance(unsigned int nid)
+static int hw_info_get(struct amd64_pvt *pvt)
 {
-	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
-	struct mem_ctl_info *mci = NULL;
-	struct edac_mc_layer layers[2];
-	struct amd64_pvt *pvt = NULL;
 	u16 pci_id1, pci_id2;
-	int err = 0, ret;
-
-	ret = -ENOMEM;
-	pvt = kzalloc(sizeof(struct amd64_pvt), GFP_KERNEL);
-	if (!pvt)
-		goto err_ret;
-
-	pvt->mc_node_id	= nid;
-	pvt->F3 = F3;
-
-	ret = -EINVAL;
-	fam_type = per_family_init(pvt);
-	if (!fam_type)
-		goto err_free;
+	int ret = -EINVAL;
 
 	if (pvt->fam >= 0x17) {
 		pvt->umc = kcalloc(num_umcs, sizeof(struct amd64_umc), GFP_KERNEL);
-		if (!pvt->umc) {
-			ret = -ENOMEM;
-			goto err_free;
-		}
+		if (!pvt->umc)
+			return -ENOMEM;
 
 		pci_id1 = fam_type->f0_id;
 		pci_id2 = fam_type->f6_id;
@@ -3454,21 +3435,37 @@ static int init_one_instance(unsigned int nid)
 		pci_id2 = fam_type->f2_id;
 	}
 
-	err = reserve_mc_sibling_devs(pvt, pci_id1, pci_id2);
-	if (err)
-		goto err_post_init;
+	ret = reserve_mc_sibling_devs(pvt, pci_id1, pci_id2);
+	if (ret)
+		return ret;
 
 	read_mc_regs(pvt);
 
+	return 0;
+}
+
+static void hw_info_put(struct amd64_pvt *pvt)
+{
+	if (pvt->F0 || pvt->F1)
+		free_mc_sibling_devs(pvt);
+
+	kfree(pvt->umc);
+}
+
+static int init_one_instance(struct amd64_pvt *pvt)
+{
+	struct mem_ctl_info *mci = NULL;
+	struct edac_mc_layer layers[2];
+	int ret = -EINVAL;
+
 	/*
 	 * We need to determine how many memory channels there are. Then use
 	 * that information for calculating the size of the dynamic instance
 	 * tables in the 'mci' structure.
 	 */
-	ret = -EINVAL;
 	pvt->channel_count = pvt->ops->early_channel_count(pvt);
 	if (pvt->channel_count < 0)
-		goto err_siblings;
+		return ret;
 
 	ret = -ENOMEM;
 	layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
@@ -3490,9 +3487,9 @@ static int init_one_instance(unsigned int nid)
 		layers[1].size = 2;
 	layers[1].is_virt_csrow = false;
 
-	mci = edac_mc_alloc(nid, ARRAY_SIZE(layers), layers, 0);
+	mci = edac_mc_alloc(pvt->mc_node_id, ARRAY_SIZE(layers), layers, 0);
 	if (!mci)
-		goto err_siblings;
+		return ret;
 
 	mci->pvt_info = pvt;
 	mci->pdev = &pvt->F3->dev;
@@ -3505,31 +3502,17 @@ static int init_one_instance(unsigned int nid)
 	ret = -ENODEV;
 	if (edac_mc_add_mc_with_groups(mci, amd64_edac_attr_groups)) {
 		edac_dbg(1, "failed edac_mc_add_mc()\n");
-		goto err_add_mc;
+		edac_mc_free(mci);
+		return ret;
 	}
 
 	return 0;
-
-err_add_mc:
-	edac_mc_free(mci);
-
-err_siblings:
-	free_mc_sibling_devs(pvt);
-
-err_post_init:
-	if (pvt->fam >= 0x17)
-		kfree(pvt->umc);
-
-err_free:
-	kfree(pvt);
-
-err_ret:
-	return ret;
 }
 
 static int probe_one_instance(unsigned int nid)
 {
 	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
+	struct amd64_pvt *pvt = NULL;
 	struct ecc_settings *s;
 	int ret;
 
@@ -3540,6 +3523,21 @@ static int probe_one_instance(unsigned int nid)
 
 	ecc_stngs[nid] = s;
 
+	pvt = kzalloc(sizeof(struct amd64_pvt), GFP_KERNEL);
+	if (!pvt)
+		goto err_settings;
+
+	pvt->mc_node_id	= nid;
+	pvt->F3 = F3;
+
+	fam_type = per_family_init(pvt);
+	if (!fam_type)
+		goto err_enable;
+
+	ret = hw_info_get(pvt);
+	if (ret < 0)
+		goto err_enable;
+
 	if (!ecc_enabled(F3, nid)) {
 		ret = 0;
 
@@ -3556,7 +3554,7 @@ static int probe_one_instance(unsigned int nid)
 			goto err_enable;
 	}
 
-	ret = init_one_instance(nid);
+	ret = init_one_instance(pvt);
 	if (ret < 0) {
 		amd64_err("Error probing instance: %d\n", nid);
 
@@ -3569,6 +3567,10 @@ static int probe_one_instance(unsigned int nid)
 	return ret;
 
 err_enable:
+	hw_info_put(pvt);
+	kfree(pvt);
+
+err_settings:
 	kfree(s);
 	ecc_stngs[nid] = NULL;
 
@@ -3595,14 +3597,13 @@ static void remove_one_instance(unsigned int nid)
 
 	restore_ecc_error_reporting(s, nid, F3);
 
-	free_mc_sibling_devs(pvt);
-
 	kfree(ecc_stngs[nid]);
 	ecc_stngs[nid] = NULL;
 
 	/* Free the EDAC CORE resources */
 	mci->pvt_info = NULL;
 
+	hw_info_put(pvt);
 	kfree(pvt);
 	edac_mc_free(mci);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 3/5] EDAC/amd64: Save max number of controllers to family type
  2019-11-06  1:24 [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
  2019-11-06  1:24 ` [PATCH v3 1/5] EDAC/amd64: Make struct amd64_family_type global Ghannam, Yazen
  2019-11-06  1:25 ` [PATCH v3 2/5] EDAC/amd64: Gather hardware information early Ghannam, Yazen
@ 2019-11-06  1:25 ` Ghannam, Yazen
  2019-11-06  1:25 ` [PATCH v3 4/5] EDAC/amd64: Use cached data when checking for ECC Ghannam, Yazen
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Ghannam, Yazen @ 2019-11-06  1:25 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

The maximum number of memory controllers is fixed within a family/model
group. In most cases, this has been fixed at 2, but some systems may
have up to 8.

The struct amd64_family_type already contains family/model-specific
information, and this can be used rather than adding model checks to
various functions.

Create a new field in struct amd64_family_type for max_mcs.
Set this when setting other family type information, and use this when
needing the maximum number of memory controllers possible for a system.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20191022203448.13962-4-Yazen.Ghannam@amd.com

v2 -> v3:
* No change.

v1 -> v2:
* Change max_num_controllers to max_mcs.

rfc -> v1:
* New patch.
* Idea came up from Boris' comment about compute_num_umcs().

 drivers/edac/amd64_edac.c | 44 +++++++++++++--------------------------
 drivers/edac/amd64_edac.h |  2 ++
 2 files changed, 16 insertions(+), 30 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 6e1c739b7fad..110ed0d27998 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -21,9 +21,6 @@ static struct amd64_family_type *fam_type;
 /* Per-node stuff */
 static struct ecc_settings **ecc_stngs;
 
-/* Number of Unified Memory Controllers */
-static u8 num_umcs;
-
 /*
  * Valid scrub rates for the K8 hardware memory scrubber. We map the scrubbing
  * bandwidth to a valid bit pattern. The 'set' operation finds the 'matching-
@@ -456,7 +453,7 @@ static void get_cs_base_and_mask(struct amd64_pvt *pvt, int csrow, u8 dct,
 	for (i = 0; i < pvt->csels[dct].m_cnt; i++)
 
 #define for_each_umc(i) \
-	for (i = 0; i < num_umcs; i++)
+	for (i = 0; i < fam_type->max_mcs; i++)
 
 /*
  * @input_addr is an InputAddr associated with the node given by mci. Return the
@@ -2226,6 +2223,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "K8",
 		.f1_id = PCI_DEVICE_ID_AMD_K8_NB_ADDRMAP,
 		.f2_id = PCI_DEVICE_ID_AMD_K8_NB_MEMCTL,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= k8_early_channel_count,
 			.map_sysaddr_to_csrow	= k8_map_sysaddr_to_csrow,
@@ -2236,6 +2234,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F10h",
 		.f1_id = PCI_DEVICE_ID_AMD_10H_NB_MAP,
 		.f2_id = PCI_DEVICE_ID_AMD_10H_NB_DRAM,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2246,6 +2245,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F15h",
 		.f1_id = PCI_DEVICE_ID_AMD_15H_NB_F1,
 		.f2_id = PCI_DEVICE_ID_AMD_15H_NB_F2,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2256,6 +2256,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F15h_M30h",
 		.f1_id = PCI_DEVICE_ID_AMD_15H_M30H_NB_F1,
 		.f2_id = PCI_DEVICE_ID_AMD_15H_M30H_NB_F2,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2266,6 +2267,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F15h_M60h",
 		.f1_id = PCI_DEVICE_ID_AMD_15H_M60H_NB_F1,
 		.f2_id = PCI_DEVICE_ID_AMD_15H_M60H_NB_F2,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2276,6 +2278,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F16h",
 		.f1_id = PCI_DEVICE_ID_AMD_16H_NB_F1,
 		.f2_id = PCI_DEVICE_ID_AMD_16H_NB_F2,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2286,6 +2289,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F16h_M30h",
 		.f1_id = PCI_DEVICE_ID_AMD_16H_M30H_NB_F1,
 		.f2_id = PCI_DEVICE_ID_AMD_16H_M30H_NB_F2,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f1x_early_channel_count,
 			.map_sysaddr_to_csrow	= f1x_map_sysaddr_to_csrow,
@@ -2296,6 +2300,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F17h",
 		.f0_id = PCI_DEVICE_ID_AMD_17H_DF_F0,
 		.f6_id = PCI_DEVICE_ID_AMD_17H_DF_F6,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f17_early_channel_count,
 			.dbam_to_cs		= f17_addr_mask_to_cs_size,
@@ -2305,6 +2310,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F17h_M10h",
 		.f0_id = PCI_DEVICE_ID_AMD_17H_M10H_DF_F0,
 		.f6_id = PCI_DEVICE_ID_AMD_17H_M10H_DF_F6,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f17_early_channel_count,
 			.dbam_to_cs		= f17_addr_mask_to_cs_size,
@@ -2314,6 +2320,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F17h_M30h",
 		.f0_id = PCI_DEVICE_ID_AMD_17H_M30H_DF_F0,
 		.f6_id = PCI_DEVICE_ID_AMD_17H_M30H_DF_F6,
+		.max_mcs = 8,
 		.ops = {
 			.early_channel_count	= f17_early_channel_count,
 			.dbam_to_cs		= f17_addr_mask_to_cs_size,
@@ -2323,6 +2330,7 @@ static struct amd64_family_type family_types[] = {
 		.ctl_name = "F17h_M70h",
 		.f0_id = PCI_DEVICE_ID_AMD_17H_M70H_DF_F0,
 		.f6_id = PCI_DEVICE_ID_AMD_17H_M70H_DF_F6,
+		.max_mcs = 2,
 		.ops = {
 			.early_channel_count	= f17_early_channel_count,
 			.dbam_to_cs		= f17_addr_mask_to_cs_size,
@@ -3402,29 +3410,13 @@ static const struct attribute_group *amd64_edac_attr_groups[] = {
 	NULL
 };
 
-/* Set the number of Unified Memory Controllers in the system. */
-static void compute_num_umcs(void)
-{
-	u8 model = boot_cpu_data.x86_model;
-
-	if (boot_cpu_data.x86 < 0x17)
-		return;
-
-	if (model >= 0x30 && model <= 0x3f)
-		num_umcs = 8;
-	else
-		num_umcs = 2;
-
-	edac_dbg(1, "Number of UMCs: %x", num_umcs);
-}
-
 static int hw_info_get(struct amd64_pvt *pvt)
 {
 	u16 pci_id1, pci_id2;
 	int ret = -EINVAL;
 
 	if (pvt->fam >= 0x17) {
-		pvt->umc = kcalloc(num_umcs, sizeof(struct amd64_umc), GFP_KERNEL);
+		pvt->umc = kcalloc(fam_type->max_mcs, sizeof(struct amd64_umc), GFP_KERNEL);
 		if (!pvt->umc)
 			return -ENOMEM;
 
@@ -3477,14 +3469,8 @@ static int init_one_instance(struct amd64_pvt *pvt)
 	 * Always allocate two channels since we can have setups with DIMMs on
 	 * only one channel. Also, this simplifies handling later for the price
 	 * of a couple of KBs tops.
-	 *
-	 * On Fam17h+, the number of controllers may be greater than two. So set
-	 * the size equal to the maximum number of UMCs.
 	 */
-	if (pvt->fam >= 0x17)
-		layers[1].size = num_umcs;
-	else
-		layers[1].size = 2;
+	layers[1].size = fam_type->max_mcs;
 	layers[1].is_virt_csrow = false;
 
 	mci = edac_mc_alloc(pvt->mc_node_id, ARRAY_SIZE(layers), layers, 0);
@@ -3669,8 +3655,6 @@ static int __init amd64_edac_init(void)
 	if (!msrs)
 		goto err_free;
 
-	compute_num_umcs();
-
 	for (i = 0; i < amd_nb_num(); i++) {
 		err = probe_one_instance(i);
 		if (err) {
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 8c3cda81e619..9be31688110b 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -479,6 +479,8 @@ struct low_ops {
 struct amd64_family_type {
 	const char *ctl_name;
 	u16 f0_id, f1_id, f2_id, f6_id;
+	/* Maximum number of memory controllers per die/node. */
+	u8 max_mcs;
 	struct low_ops ops;
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 5/5] EDAC/amd64: Check for memory before fully initializing an instance
  2019-11-06  1:24 [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
                   ` (3 preceding siblings ...)
  2019-11-06  1:25 ` [PATCH v3 4/5] EDAC/amd64: Use cached data when checking for ECC Ghannam, Yazen
@ 2019-11-06  1:25 ` Ghannam, Yazen
  2019-11-06 16:06 ` [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc Borislav Petkov
  5 siblings, 0 replies; 16+ messages in thread
From: Ghannam, Yazen @ 2019-11-06  1:25 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

Return early before checking for ECC if the node does not have any
populated memory.

Free any cached hardware data before returning. Also, return 0 in this
case since this is not a failure. Other nodes may have memory and the
module should attempt to load an instance for them.

Move printing of hardware information to after the instance is
initialized, so that the information is only printed for nodes with
memory.

Return an error code when ECC is disabled. This check happens after
checking for memory. The module should explicitly fail to load if memory
is populated on a node and ECC is disabled.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20191022203448.13962-6-Yazen.Ghannam@amd.com

v2 -> v3:
* Add error code to !ecc_enabled() path.

v1 -> v2:
* No change.

rfc -> v1:
* Change message severity to "info".
  * Nodes without memory is a valid configuration. The user doesn't
    need to be warned.
* Drop "DRAM ECC disabled" from message.
  * The message is given when no memory was detected on a node.
  * The state of DRAM ECC is not checked here.

 drivers/edac/amd64_edac.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index d38ba7f17753..3aeb5173e200 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2848,8 +2848,6 @@ static void read_mc_regs(struct amd64_pvt *pvt)
 	edac_dbg(1, "  DIMM type: %s\n", edac_mem_types[pvt->dram_type]);
 
 	determine_ecc_sym_sz(pvt);
-
-	dump_misc_regs(pvt);
 }
 
 /*
@@ -3491,6 +3489,19 @@ static int init_one_instance(struct amd64_pvt *pvt)
 	return 0;
 }
 
+static bool instance_has_memory(struct amd64_pvt *pvt)
+{
+	bool cs_enabled = false;
+	int cs = 0, dct = 0;
+
+	for (dct = 0; dct < fam_type->max_mcs; dct++) {
+		for_each_chip_select(cs, dct, pvt)
+			cs_enabled |= csrow_enabled(cs, dct, pvt);
+	}
+
+	return cs_enabled;
+}
+
 static int probe_one_instance(unsigned int nid)
 {
 	struct pci_dev *F3 = node_to_amd_nb(nid)->misc;
@@ -3520,8 +3531,14 @@ static int probe_one_instance(unsigned int nid)
 	if (ret < 0)
 		goto err_enable;
 
+	ret = 0;
+	if (!instance_has_memory(pvt)) {
+		amd64_info("Node %d: No DIMMs detected.\n", nid);
+		goto err_enable;
+	}
+
 	if (!ecc_enabled(pvt)) {
-		ret = 0;
+		ret = -ENODEV;
 
 		if (!ecc_enable_override)
 			goto err_enable;
@@ -3546,6 +3563,8 @@ static int probe_one_instance(unsigned int nid)
 		goto err_enable;
 	}
 
+	dump_misc_regs(pvt);
+
 	return ret;
 
 err_enable:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 4/5] EDAC/amd64: Use cached data when checking for ECC
  2019-11-06  1:24 [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
                   ` (2 preceding siblings ...)
  2019-11-06  1:25 ` [PATCH v3 3/5] EDAC/amd64: Save max number of controllers to family type Ghannam, Yazen
@ 2019-11-06  1:25 ` Ghannam, Yazen
  2019-11-06  1:25 ` [PATCH v3 5/5] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
  2019-11-06 16:06 ` [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc Borislav Petkov
  5 siblings, 0 replies; 16+ messages in thread
From: Ghannam, Yazen @ 2019-11-06  1:25 UTC (permalink / raw)
  To: linux-edac; +Cc: Ghannam, Yazen, linux-kernel, bp

From: Yazen Ghannam <yazen.ghannam@amd.com>

...now that the data is available earlier.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
Link:
https://lkml.kernel.org/r/20191022203448.13962-5-Yazen.Ghannam@amd.com

v2 -> v3:
* No change.

v1 -> v2:
* No change.

rfc -> v1:
* No change.

 drivers/edac/amd64_edac.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 110ed0d27998..d38ba7f17753 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3202,31 +3202,27 @@ static const char *ecc_msg =
 	"'ecc_enable_override'.\n"
 	" (Note that use of the override may cause unknown side effects.)\n";
 
-static bool ecc_enabled(struct pci_dev *F3, u16 nid)
+static bool ecc_enabled(struct amd64_pvt *pvt)
 {
+	u16 nid = pvt->mc_node_id;
 	bool nb_mce_en = false;
 	u8 ecc_en = 0, i;
 	u32 value;
 
 	if (boot_cpu_data.x86 >= 0x17) {
 		u8 umc_en_mask = 0, ecc_en_mask = 0;
+		struct amd64_umc *umc;
 
 		for_each_umc(i) {
-			u32 base = get_umc_base(i);
+			umc = &pvt->umc[i];
 
 			/* Only check enabled UMCs. */
-			if (amd_smn_read(nid, base + UMCCH_SDP_CTRL, &value))
-				continue;
-
-			if (!(value & UMC_SDP_INIT))
+			if (!(umc->sdp_ctrl & UMC_SDP_INIT))
 				continue;
 
 			umc_en_mask |= BIT(i);
 
-			if (amd_smn_read(nid, base + UMCCH_UMC_CAP_HI, &value))
-				continue;
-
-			if (value & UMC_ECC_ENABLED)
+			if (umc->umc_cap_hi & UMC_ECC_ENABLED)
 				ecc_en_mask |= BIT(i);
 		}
 
@@ -3239,7 +3235,7 @@ static bool ecc_enabled(struct pci_dev *F3, u16 nid)
 		/* Assume UMC MCA banks are enabled. */
 		nb_mce_en = true;
 	} else {
-		amd64_read_pci_cfg(F3, NBCFG, &value);
+		amd64_read_pci_cfg(pvt->F3, NBCFG, &value);
 
 		ecc_en = !!(value & NBCFG_ECC_ENABLE);
 
@@ -3524,7 +3520,7 @@ static int probe_one_instance(unsigned int nid)
 	if (ret < 0)
 		goto err_enable;
 
-	if (!ecc_enabled(F3, nid)) {
+	if (!ecc_enabled(pvt)) {
 		ret = 0;
 
 		if (!ecc_enable_override)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-06  1:24 [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
                   ` (4 preceding siblings ...)
  2019-11-06  1:25 ` [PATCH v3 5/5] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
@ 2019-11-06 16:06 ` Borislav Petkov
  2019-11-06 18:16   ` Ghannam, Yazen
  5 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2019-11-06 16:06 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Wed, Nov 06, 2019 at 01:24:59AM +0000, Ghannam, Yazen wrote:
> From: Yazen Ghannam <yazen.ghannam@amd.com>
> 
> Hi Boris,
> 
> These patches address the issue where the module checks and complains
> about DRAM ECC on nodes without memory.
> 
> Changes from last revision:
>   1) Dropped patch 6 which was for adding a grain value.
>   2) Added an error code for !ecc_enabled() in patch 5.

Still doesn't help. The load gets attempted twice still. Try reproducing
it on a small, single-node box where ECC is disabled.

[    2.590123] EDAC MC: Ver: 3.0.0
[    2.594153] EDAC DEBUG: edac_mc_sysfs_init: device mc created
[    5.946482] EDAC amd64: F10h detected (node 0).
[    5.952134] EDAC DEBUG: reserve_mc_sibling_devs: F1: 0000:00:18.1
[    5.958967] EDAC DEBUG: reserve_mc_sibling_devs: F2: 0000:00:18.2
[    5.969869] EDAC DEBUG: reserve_mc_sibling_devs: F3: 0000:00:18.3
[    5.981125] EDAC DEBUG: read_mc_regs:   TOP_MEM:  0x00000000d0000000
[    5.981126] EDAC DEBUG: read_mc_regs:   TOP_MEM2: 0x0000000230000000
[    5.981130] EDAC DEBUG: read_dram_ctl_register: F2x110 (DCTSelLow): 0xffffffff, High range addrs at: 0xfffff800
[    5.981131] EDAC DEBUG: read_dram_ctl_register:   DCTs operate in ganged mode
[    5.981132] EDAC DEBUG: read_dram_ctl_register:   data interleave for ECC: enabled, DRAM cleared since last warm reset: yes
[    5.981133] EDAC DEBUG: read_dram_ctl_register:   channel interleave: enabled, interleave bits selector: 0x3
[    5.981137] EDAC DEBUG: read_mc_regs:   DRAM range[0], base: 0x0000ff0000000000; limit: 0x0000ff022fffffff
[    5.981138] EDAC DEBUG: read_mc_regs:    IntlvEn=Disabled; Range access: RW IntlvSel=0 DstNode=0
[    5.981144] EDAC DEBUG: read_dct_base_mask:   DCSB0[0]=0x00000001 reg: F2x40
[    5.981146] EDAC DEBUG: read_dct_base_mask:   DCSB1[0]=0x00000000 reg: F2x140
[    5.981147] EDAC DEBUG: read_dct_base_mask:   DCSB0[1]=0x00000101 reg: F2x44
[    5.981148] EDAC DEBUG: read_dct_base_mask:   DCSB1[1]=0x00000000 reg: F2x144
[    5.981149] EDAC DEBUG: read_dct_base_mask:   DCSB0[2]=0x00000201 reg: F2x48
[    5.981150] EDAC DEBUG: read_dct_base_mask:   DCSB1[2]=0x00000000 reg: F2x148
[    5.981151] EDAC DEBUG: read_dct_base_mask:   DCSB0[3]=0x00000301 reg: F2x4c
[    5.981152] EDAC DEBUG: read_dct_base_mask:   DCSB1[3]=0x00000000 reg: F2x14c
[    5.981153] EDAC DEBUG: read_dct_base_mask:   DCSB0[4]=0x00000000 reg: F2x50
[    5.981154] EDAC DEBUG: read_dct_base_mask:   DCSB1[4]=0x00000000 reg: F2x150
[    5.981155] EDAC DEBUG: read_dct_base_mask:   DCSB0[5]=0x00000000 reg: F2x54
[    5.981156] EDAC DEBUG: read_dct_base_mask:   DCSB1[5]=0x00000000 reg: F2x154
[    5.981157] EDAC DEBUG: read_dct_base_mask:   DCSB0[6]=0x00000000 reg: F2x58
[    5.981158] EDAC DEBUG: read_dct_base_mask:   DCSB1[6]=0x00000000 reg: F2x158
[    5.981159] EDAC DEBUG: read_dct_base_mask:   DCSB0[7]=0x00000000 reg: F2x5c
[    5.981160] EDAC DEBUG: read_dct_base_mask:   DCSB1[7]=0x00000000 reg: F2x15c
[    5.981161] EDAC DEBUG: read_dct_base_mask:     DCSM0[0]=0x00f83ce0 reg: F2x60
[    5.981162] EDAC DEBUG: read_dct_base_mask:     DCSM1[0]=0x00000000 reg: F2x160
[    5.981163] EDAC DEBUG: read_dct_base_mask:     DCSM0[1]=0x00f83ce0 reg: F2x64
[    5.981164] EDAC DEBUG: read_dct_base_mask:     DCSM1[1]=0x00000000 reg: F2x164
[    5.981165] EDAC DEBUG: read_dct_base_mask:     DCSM0[2]=0x00000000 reg: F2x68
[    5.981166] EDAC DEBUG: read_dct_base_mask:     DCSM1[2]=0x00000000 reg: F2x168
[    5.981167] EDAC DEBUG: read_dct_base_mask:     DCSM0[3]=0x00000000 reg: F2x6c
[    5.981168] EDAC DEBUG: read_dct_base_mask:     DCSM1[3]=0x00000000 reg: F2x16c
[    5.981169] EDAC DEBUG: read_mc_regs:   DIMM type: Unbuffered-DDR2
[    5.981219] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled
[    5.981221] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled
[    5.981221] EDAC amd64: Node 0: DRAM ECC disabled.
[    5.981223] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
[    6.302561] EDAC amd64: F10h detected (node 0).
[    6.307276] EDAC DEBUG: reserve_mc_sibling_devs: F1: 0000:00:18.1
[    6.313630] EDAC DEBUG: reserve_mc_sibling_devs: F2: 0000:00:18.2
[    6.320589] EDAC DEBUG: reserve_mc_sibling_devs: F3: 0000:00:18.3
[    6.328359] EDAC DEBUG: read_mc_regs:   TOP_MEM:  0x00000000d0000000
[    6.335150] EDAC DEBUG: read_mc_regs:   TOP_MEM2: 0x0000000230000000
[    6.342188] EDAC DEBUG: read_dram_ctl_register: F2x110 (DCTSelLow): 0xffffffff, High range addrs at: 0xfffff800
[    6.353691] EDAC DEBUG: read_dram_ctl_register:   DCTs operate in ganged mode
[    6.361527] EDAC DEBUG: read_dram_ctl_register:   data interleave for ECC: enabled, DRAM cleared since last warm reset: yes
[    6.374204] EDAC DEBUG: read_dram_ctl_register:   channel interleave: enabled, interleave bits selector: 0x3
[    6.384343] EDAC DEBUG: read_mc_regs:   DRAM range[0], base: 0x0000ff0000000000; limit: 0x0000ff022fffffff
[    6.395942] EDAC DEBUG: read_mc_regs:    IntlvEn=Disabled; Range access: RW IntlvSel=0 DstNode=0
[    6.406619] EDAC DEBUG: read_dct_base_mask:   DCSB0[0]=0x00000001 reg: F2x40
[    6.414646] EDAC DEBUG: read_dct_base_mask:   DCSB1[0]=0x00000000 reg: F2x140
[    6.422526] EDAC DEBUG: read_dct_base_mask:   DCSB0[1]=0x00000101 reg: F2x44
[    6.430823] EDAC DEBUG: read_dct_base_mask:   DCSB1[1]=0x00000000 reg: F2x144
[    6.438710] EDAC DEBUG: read_dct_base_mask:   DCSB0[2]=0x00000201 reg: F2x48
[    6.446810] EDAC DEBUG: read_dct_base_mask:   DCSB1[2]=0x00000000 reg: F2x148
[    6.454788] EDAC DEBUG: read_dct_base_mask:   DCSB0[3]=0x00000301 reg: F2x4c
[    6.462743] EDAC DEBUG: read_dct_base_mask:   DCSB1[3]=0x00000000 reg: F2x14c
[    6.470585] EDAC DEBUG: read_dct_base_mask:   DCSB0[4]=0x00000000 reg: F2x50
[    6.478698] EDAC DEBUG: read_dct_base_mask:   DCSB1[4]=0x00000000 reg: F2x150
[    6.486624] EDAC DEBUG: read_dct_base_mask:   DCSB0[5]=0x00000000 reg: F2x54
[    6.494631] EDAC DEBUG: read_dct_base_mask:   DCSB1[5]=0x00000000 reg: F2x154
[    6.502866] EDAC DEBUG: read_dct_base_mask:   DCSB0[6]=0x00000000 reg: F2x58
[    6.510817] EDAC DEBUG: read_dct_base_mask:   DCSB1[6]=0x00000000 reg: F2x158
[    6.518602] EDAC DEBUG: read_dct_base_mask:   DCSB0[7]=0x00000000 reg: F2x5c
[    6.527120] EDAC DEBUG: read_dct_base_mask:   DCSB1[7]=0x00000000 reg: F2x15c
[    6.534926] EDAC DEBUG: read_dct_base_mask:     DCSM0[0]=0x00f83ce0 reg: F2x60
[    6.548356] EDAC DEBUG: read_dct_base_mask:     DCSM1[0]=0x00000000 reg: F2x160
[    6.560715] EDAC DEBUG: read_dct_base_mask:     DCSM0[1]=0x00f83ce0 reg: F2x64
[    6.568116] EDAC DEBUG: read_dct_base_mask:     DCSM1[1]=0x00000000 reg: F2x164
[    6.575596] EDAC DEBUG: read_dct_base_mask:     DCSM0[2]=0x00000000 reg: F2x68
[    6.584317] EDAC DEBUG: read_dct_base_mask:     DCSM1[2]=0x00000000 reg: F2x168
[    6.591899] EDAC DEBUG: read_dct_base_mask:     DCSM0[3]=0x00000000 reg: F2x6c
[    6.599460] EDAC DEBUG: read_dct_base_mask:     DCSM1[3]=0x00000000 reg: F2x16c
[    6.606877] EDAC DEBUG: read_mc_regs:   DIMM type: Unbuffered-DDR2
[    6.619722] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled
[    6.628463] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled
[    6.648232] EDAC amd64: Node 0: DRAM ECC disabled.
[    6.657843] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-06 16:06 ` [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc Borislav Petkov
@ 2019-11-06 18:16   ` Ghannam, Yazen
  2019-11-06 19:54     ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Ghannam, Yazen @ 2019-11-06 18:16 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Wednesday, November 6, 2019 11:06 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
> 
> On Wed, Nov 06, 2019 at 01:24:59AM +0000, Ghannam, Yazen wrote:
> > From: Yazen Ghannam <yazen.ghannam@amd.com>
> >
> > Hi Boris,
> >
> > These patches address the issue where the module checks and complains
> > about DRAM ECC on nodes without memory.
> >
> > Changes from last revision:
> >   1) Dropped patch 6 which was for adding a grain value.
> >   2) Added an error code for !ecc_enabled() in patch 5.
> 
> Still doesn't help. The load gets attempted twice still. Try reproducing
> it on a small, single-node box where ECC is disabled.
> 

We had a thread before about usersapce loading the module multiple times on
failure:
https://lore.kernel.org/linux-edac/20190822005020.GA403@angband.pl/

I tried to look into it a bit, but I didn't get very far.

So is the behavior you see only happening with the new patchset applied? That
may be a clue that we can fix this in the module.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-06 18:16   ` Ghannam, Yazen
@ 2019-11-06 19:54     ` Borislav Petkov
  2019-11-07 10:38       ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2019-11-06 19:54 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Wed, Nov 06, 2019 at 06:16:12PM +0000, Ghannam, Yazen wrote:
> We had a thread before about usersapce loading the module multiple times on
> failure:
> https://lore.kernel.org/linux-edac/20190822005020.GA403@angband.pl/
> 
> I tried to look into it a bit, but I didn't get very far.

Right, I'll try to have a look soon, as it reproduces here.

> So is the behavior you see only happening with the new patchset applied? That
> may be a clue that we can fix this in the module.

Actually, it did try twice before your patchset and I didn't notice it
then because it wouldn't spit so much debug output. But that happens now
because your patchset pulls up the detection early. And without it we
had:

$ dmesg | grep -i edac
[    2.590869] EDAC MC: Ver: 3.0.0
[    2.594855] EDAC DEBUG: edac_mc_sysfs_init: device mc created
[    5.939351] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled
[    5.948488] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled
[    5.957312] EDAC amd64: Node 0: DRAM ECC disabled.
[    5.967746] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
[    6.031424] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled
[    6.042173] EDAC DEBUG: nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled
[    6.052253] EDAC amd64: Node 0: DRAM ECC disabled.
[    6.057804] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.

which are also two attempts.

Anyway, I'll queue your set and I'll try to debug that thing because it
is getting on my nerves slowly...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-06 19:54     ` Borislav Petkov
@ 2019-11-07 10:38       ` Borislav Petkov
  2019-11-07 13:47         ` Ghannam, Yazen
  0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2019-11-07 10:38 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Wed, Nov 06, 2019 at 08:54:17PM +0100, Borislav Petkov wrote:
> which are also two attempts.
> 
> Anyway, I'll queue your set and I'll try to debug that thing because it
> is getting on my nerves slowly...

Yah, the problem is that because we have:

MODULE_DEVICE_TABLE(x86cpu, amd64_cpuids);

it gets tried on each CPU because an uevent gets dispatched for each
device, and each CPU is a device.

That's why I see it twice on this box - it has two CPUs.

And Greg says making it attempt once per system can't be done. Unless we
start doing hacks with sending uevents per BSP only which is too much.
Or we can remember the previous return value of the module init function
into edac_core but that's nasty too.

I'm thinking we should simply kill this fat ecc_msg thing which is not
very useful and be done with it:

[    5.697275] EDAC MC: Ver: 3.0.0
[    5.909530] EDAC amd64: F10h detected (node 0).
[    6.345231] EDAC amd64: Node 0: DRAM ECC disabled.
[    6.370815] EDAC amd64: F10h detected (node 0).
[    6.370929] EDAC amd64: Node 0: DRAM ECC disabled.

That's probably still a bit annoying on a large machine but better than
nothing.

---
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 3aeb5173e200..0738237e3f09 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3188,18 +3188,6 @@ static void restore_ecc_error_reporting(struct ecc_settings *s, u16 nid,
 		amd64_warn("Error restoring NB MCGCTL settings!\n");
 }
 
-/*
- * EDAC requires that the BIOS have ECC enabled before
- * taking over the processing of ECC errors. A command line
- * option allows to force-enable hardware ECC later in
- * enable_ecc_error_reporting().
- */
-static const char *ecc_msg =
-	"ECC disabled in the BIOS or no ECC capability, module will not load.\n"
-	" Either enable ECC checking or force module loading by setting "
-	"'ecc_enable_override'.\n"
-	" (Note that use of the override may cause unknown side effects.)\n";
-
 static bool ecc_enabled(struct amd64_pvt *pvt)
 {
 	u16 nid = pvt->mc_node_id;
@@ -3246,11 +3234,10 @@ static bool ecc_enabled(struct amd64_pvt *pvt)
 	amd64_info("Node %d: DRAM ECC %s.\n",
 		   nid, (ecc_en ? "enabled" : "disabled"));
 
-	if (!ecc_en || !nb_mce_en) {
-		amd64_info("%s", ecc_msg);
+	if (!ecc_en || !nb_mce_en)
 		return false;
-	}
-	return true;
+	else
+		return true;
 }
 
 static inline void

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* RE: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-07 10:38       ` Borislav Petkov
@ 2019-11-07 13:47         ` Ghannam, Yazen
  2019-11-07 15:40           ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Ghannam, Yazen @ 2019-11-07 13:47 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Thursday, November 7, 2019 5:39 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
> 
> On Wed, Nov 06, 2019 at 08:54:17PM +0100, Borislav Petkov wrote:
> > which are also two attempts.
> >
> > Anyway, I'll queue your set and I'll try to debug that thing because it
> > is getting on my nerves slowly...
> 
> Yah, the problem is that because we have:
> 
> MODULE_DEVICE_TABLE(x86cpu, amd64_cpuids);
> 
> it gets tried on each CPU because an uevent gets dispatched for each
> device, and each CPU is a device.
> 
> That's why I see it twice on this box - it has two CPUs.
> 

Okay, that's makes sense.

BTW, what do you think about loading based on PCI devices? The module
used to do this. I ask because I'm starting to see that future systems may
re-use PCI IDs, and this indicates the same level of hardware support.

> And Greg says making it attempt once per system can't be done. Unless we
> start doing hacks with sending uevents per BSP only which is too much.
> Or we can remember the previous return value of the module init function
> into edac_core but that's nasty too.
> 
> I'm thinking we should simply kill this fat ecc_msg thing which is not
> very useful and be done with it:
> 
> [    5.697275] EDAC MC: Ver: 3.0.0
> [    5.909530] EDAC amd64: F10h detected (node 0).
> [    6.345231] EDAC amd64: Node 0: DRAM ECC disabled.
> [    6.370815] EDAC amd64: F10h detected (node 0).
> [    6.370929] EDAC amd64: Node 0: DRAM ECC disabled.
> 
> That's probably still a bit annoying on a large machine but better than
> nothing.
> 

Yeah, I agree.

> ---
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 3aeb5173e200..0738237e3f09 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -3188,18 +3188,6 @@ static void restore_ecc_error_reporting(struct ecc_settings *s, u16 nid,
>  		amd64_warn("Error restoring NB MCGCTL settings!\n");
>  }
> 
> -/*
> - * EDAC requires that the BIOS have ECC enabled before
> - * taking over the processing of ECC errors. A command line
> - * option allows to force-enable hardware ECC later in
> - * enable_ecc_error_reporting().
> - */
> -static const char *ecc_msg =
> -	"ECC disabled in the BIOS or no ECC capability, module will not load.\n"
> -	" Either enable ECC checking or force module loading by setting "
> -	"'ecc_enable_override'.\n"
> -	" (Note that use of the override may cause unknown side effects.)\n";
> -
>  static bool ecc_enabled(struct amd64_pvt *pvt)
>  {
>  	u16 nid = pvt->mc_node_id;
> @@ -3246,11 +3234,10 @@ static bool ecc_enabled(struct amd64_pvt *pvt)
>  	amd64_info("Node %d: DRAM ECC %s.\n",
>  		   nid, (ecc_en ? "enabled" : "disabled"));
> 
> -	if (!ecc_en || !nb_mce_en) {
> -		amd64_info("%s", ecc_msg);
> +	if (!ecc_en || !nb_mce_en)
>  		return false;
> -	}
> -	return true;
> +	else
> +		return true;
>  }

Just a nit, but this else seems unnecessary right?

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-07 13:47         ` Ghannam, Yazen
@ 2019-11-07 15:40           ` Borislav Petkov
  2019-11-07 19:20             ` Ghannam, Yazen
  0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2019-11-07 15:40 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Thu, Nov 07, 2019 at 01:47:53PM +0000, Ghannam, Yazen wrote:
> BTW, what do you think about loading based on PCI devices? The module
> used to do this. I ask because I'm starting to see that future systems may
> re-use PCI IDs, and this indicates the same level of hardware support.

The reason we switched to family-based autoloading was that almost
every new platform would add a new PCI device ID, which would require
enablement work...

> Just a nit, but this else seems unnecessary right?

Maybe it is easier if you look at the function end in the .c file directly as
diffs can be confusing:

static bool ecc_enabled(struct amd64_pvt *pvt)
{

	...

        amd64_info("Node %d: DRAM ECC %s.\n",
                   nid, (ecc_en ? "enabled" : "disabled"));

        if (!ecc_en || !nb_mce_en)
                return false;
        else
                return true;
}

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-07 15:40           ` Borislav Petkov
@ 2019-11-07 19:20             ` Ghannam, Yazen
  2019-11-07 19:34               ` Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Ghannam, Yazen @ 2019-11-07 19:20 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org <linux-kernel-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Thursday, November 7, 2019 10:40 AM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
> 
> On Thu, Nov 07, 2019 at 01:47:53PM +0000, Ghannam, Yazen wrote:
> > BTW, what do you think about loading based on PCI devices? The module
> > used to do this. I ask because I'm starting to see that future systems may
> > re-use PCI IDs, and this indicates the same level of hardware support.
> 
> The reason we switched to family-based autoloading was that almost
> every new platform would add a new PCI device ID, which would require
> enablement work...
> 

Yes, that's right. But it looks like future systems will re-use PCI IDs even
across families and models. And the PCI IDs will be more closely related to
hardware capabilities than family and model.

In any case, we can address that when we get there.

> > Just a nit, but this else seems unnecessary right?
> 
> Maybe it is easier if you look at the function end in the .c file directly as
> diffs can be confusing:
> 
> static bool ecc_enabled(struct amd64_pvt *pvt)
> {
> 
> 	...
> 
>         amd64_info("Node %d: DRAM ECC %s.\n",
>                    nid, (ecc_en ? "enabled" : "disabled"));
> 
>         if (!ecc_en || !nb_mce_en)
>                 return false;
>         else

Right, I meant you can drop this else and just return true.

>                 return true;
> }
> 

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-07 19:20             ` Ghannam, Yazen
@ 2019-11-07 19:34               ` Borislav Petkov
  2019-11-07 19:41                 ` Ghannam, Yazen
  0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2019-11-07 19:34 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

On Thu, Nov 07, 2019 at 07:20:25PM +0000, Ghannam, Yazen wrote:
> Yes, that's right. But it looks like future systems will re-use PCI IDs even
> across families and models. And the PCI IDs will be more closely related to
> hardware capabilities than family and model.
> 
> In any case, we can address that when we get there.

I'd be fine with it if this really is the case and we don't end up
having to keep adding PCI IDs like crazy again. That was a moderate
PITA, AFAIR, especially for distro kernels having to constantly pick up
enablement patches and people complaining about it.

So you need to make sure the PCI IDs will really get reused before
converting back...

> >         if (!ecc_en || !nb_mce_en)
> >                 return false;
> >         else
> 
> Right, I meant you can drop this else and just return true.
> 
> >                 return true;

I prefer the regular if-else way because it reads faster and it is
straight-forward when one skims over the code.

But I can drop if if you insist. :-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
  2019-11-07 19:34               ` Borislav Petkov
@ 2019-11-07 19:41                 ` Ghannam, Yazen
  2019-11-09  9:08                   ` [PATCH] EDAC/amd64: Get rid of the ECC disabled long message Borislav Petkov
  0 siblings, 1 reply; 16+ messages in thread
From: Ghannam, Yazen @ 2019-11-07 19:41 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-edac, linux-kernel

> -----Original Message-----
> From: linux-edac-owner@vger.kernel.org <linux-edac-owner@vger.kernel.org> On Behalf Of Borislav Petkov
> Sent: Thursday, November 7, 2019 2:34 PM
> To: Ghannam, Yazen <Yazen.Ghannam@amd.com>
> Cc: linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc.
> 
> On Thu, Nov 07, 2019 at 07:20:25PM +0000, Ghannam, Yazen wrote:
> > Yes, that's right. But it looks like future systems will re-use PCI IDs even
> > across families and models. And the PCI IDs will be more closely related to
> > hardware capabilities than family and model.
> >
> > In any case, we can address that when we get there.
> 
> I'd be fine with it if this really is the case and we don't end up
> having to keep adding PCI IDs like crazy again. That was a moderate
> PITA, AFAIR, especially for distro kernels having to constantly pick up
> enablement patches and people complaining about it.
> 
> So you need to make sure the PCI IDs will really get reused before
> converting back...
> 

Will do.

> > >         if (!ecc_en || !nb_mce_en)
> > >                 return false;
> > >         else
> >
> > Right, I meant you can drop this else and just return true.
> >
> > >                 return true;
> 
> I prefer the regular if-else way because it reads faster and it is
> straight-forward when one skims over the code.
> 
> But I can drop if if you insist. :-)
> 

No, I don't mind.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] EDAC/amd64: Get rid of the ECC disabled long message
  2019-11-07 19:41                 ` Ghannam, Yazen
@ 2019-11-09  9:08                   ` Borislav Petkov
  0 siblings, 0 replies; 16+ messages in thread
From: Borislav Petkov @ 2019-11-09  9:08 UTC (permalink / raw)
  To: Ghannam, Yazen; +Cc: linux-edac, linux-kernel

Ok, I've queued this:

---
From: Borislav Petkov <bp@suse.de>
Date: Sat, 9 Nov 2019 10:00:54 +0100

This message keeps flooding dmesg on boxes where ECC is disabled or the
DIMMs do not support ECC but the module gets auto-probed. What's even
worse is that autoprobing happens on every CPU due to the CPU-family
matching the driver does and uevent being generated for each CPU device.

What is more, this message is becoming even more useless on newer
systems where forcing ECC is not recommended and it should be done in
the BIOS so the BIOS can do all the necessary work, i.e., just setting a
bit in an MSR is not enough anymore.

So get rid of it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: linux-edac@vger.kernel.org
Link: https://lkml.kernel.org/r/20191106160607.GC28380@zn.tnic
---
 drivers/edac/amd64_edac.c | 19 +++----------------
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 3aeb5173e200..428ce98f6776 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -3188,18 +3188,6 @@ static void restore_ecc_error_reporting(struct ecc_settings *s, u16 nid,
 		amd64_warn("Error restoring NB MCGCTL settings!\n");
 }
 
-/*
- * EDAC requires that the BIOS have ECC enabled before
- * taking over the processing of ECC errors. A command line
- * option allows to force-enable hardware ECC later in
- * enable_ecc_error_reporting().
- */
-static const char *ecc_msg =
-	"ECC disabled in the BIOS or no ECC capability, module will not load.\n"
-	" Either enable ECC checking or force module loading by setting "
-	"'ecc_enable_override'.\n"
-	" (Note that use of the override may cause unknown side effects.)\n";
-
 static bool ecc_enabled(struct amd64_pvt *pvt)
 {
 	u16 nid = pvt->mc_node_id;
@@ -3246,11 +3234,10 @@ static bool ecc_enabled(struct amd64_pvt *pvt)
 	amd64_info("Node %d: DRAM ECC %s.\n",
 		   nid, (ecc_en ? "enabled" : "disabled"));
 
-	if (!ecc_en || !nb_mce_en) {
-		amd64_info("%s", ecc_msg);
+	if (!ecc_en || !nb_mce_en)
 		return false;
-	}
-	return true;
+	else
+		return true;
 }
 
 static inline void
-- 
2.21.0

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-11-09  9:08 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-06  1:24 [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc Ghannam, Yazen
2019-11-06  1:24 ` [PATCH v3 1/5] EDAC/amd64: Make struct amd64_family_type global Ghannam, Yazen
2019-11-06  1:25 ` [PATCH v3 2/5] EDAC/amd64: Gather hardware information early Ghannam, Yazen
2019-11-06  1:25 ` [PATCH v3 3/5] EDAC/amd64: Save max number of controllers to family type Ghannam, Yazen
2019-11-06  1:25 ` [PATCH v3 4/5] EDAC/amd64: Use cached data when checking for ECC Ghannam, Yazen
2019-11-06  1:25 ` [PATCH v3 5/5] EDAC/amd64: Check for memory before fully initializing an instance Ghannam, Yazen
2019-11-06 16:06 ` [PATCH v3 0/5] AMD64 EDAC: Check for nodes without memory, etc Borislav Petkov
2019-11-06 18:16   ` Ghannam, Yazen
2019-11-06 19:54     ` Borislav Petkov
2019-11-07 10:38       ` Borislav Petkov
2019-11-07 13:47         ` Ghannam, Yazen
2019-11-07 15:40           ` Borislav Petkov
2019-11-07 19:20             ` Ghannam, Yazen
2019-11-07 19:34               ` Borislav Petkov
2019-11-07 19:41                 ` Ghannam, Yazen
2019-11-09  9:08                   ` [PATCH] EDAC/amd64: Get rid of the ECC disabled long message Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).