linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] EDAC: Add ARM64 EDAC
@ 2015-10-28 16:13 Brijesh Singh
  2015-10-28 16:40 ` Mark Rutland
  0 siblings, 1 reply; 7+ messages in thread
From: Brijesh Singh @ 2015-10-28 16:13 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Brijesh Singh, robh+dt, pawel.moll, mark.rutland, ijc+devicetree,
	galak, dougthompson, bp, mchehab, devicetree, guohanjun,
	andre.przywara, arnd, sboyd, linux-kernel, linux-edac

Add support for Cortex A57 and A53 EDAC driver.

Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
CC: robh+dt@kernel.org
CC: pawel.moll@arm.com
CC: mark.rutland@arm.com
CC: ijc+devicetree@hellion.org.uk
CC: galak@codeaurora.org
CC: dougthompson@xmission.com
CC: bp@alien8.de
CC: mchehab@osg.samsung.com
CC: devicetree@vger.kernel.org
CC: guohanjun@huawei.com
CC: andre.przywara@arm.com
CC: arnd@arndb.de
CC: sboyd@codeaurora.org
CC: linux-kernel@vger.kernel.org
CC: linux-edac@vger.kernel.org
---
Tested error reporting via rasdeamon on AMD seattle platform.

v4:
* remove THIS_MODULE assignment from platform_driver
* use module_platform_driver(), remove module_init() and module_exit()
* remove default n from Kconfig

v3:
* change DT binding compatibilty string to 'cortex-a57-edac'
* remove A57/A53 prefix from register bit definition
* unify A57 and A53 L1/L2 error parsing functions
* use mc trace event function to report the error
* update Kconfig default to 'n'

v2:
* convert into generic arm64 edac driver
* remove AMD specific references from dt binding
* remove poll_msec property from dt binding
* add poll_msec as a module param, default is 100ms
* update copyright text
* define macro mnemonics for L1 and L2 RAMID
* check L2 error per-cluster instead of per core
* update function names
* use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
  read hotplug-safe
* add error check in probe routine

 .../devicetree/bindings/edac/cortex-arm64-edac.txt |  15 +
 drivers/edac/Kconfig                               |   7 +
 drivers/edac/Makefile                              |   1 +
 drivers/edac/cortex_arm64_edac.c                   | 336 +++++++++++++++++++++
 4 files changed, 359 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/edac/cortex-arm64-edac.txt
 create mode 100644 drivers/edac/cortex_arm64_edac.c

diff --git a/Documentation/devicetree/bindings/edac/cortex-arm64-edac.txt b/Documentation/devicetree/bindings/edac/cortex-arm64-edac.txt
new file mode 100644
index 0000000..552f0c7
--- /dev/null
+++ b/Documentation/devicetree/bindings/edac/cortex-arm64-edac.txt
@@ -0,0 +1,15 @@
+* ARM Cortex A57 and A53 L1/L2 cache error reporting
+
+CPU Memory Error Syndrome and L2 Memory Error Syndrome registers can be used
+for checking L1 and L2 memory errors.
+
+The following section describes the Cortex A57/A53 EDAC DT node binding.
+
+Required properties:
+- compatible: Should be "arm,cortex-a57-edac" or "arm,cortex-a53-edac"
+
+Example:
+	edac {
+		compatible = "arm,cortex-a57-edac";
+	};
+
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index ef25000..e8f5306 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -390,4 +390,11 @@ config EDAC_XGENE
 	  Support for error detection and correction on the
 	  APM X-Gene family of SOCs.
 
+config EDAC_CORTEX_ARM64
+	tristate "ARM Cortex A57/A53"
+	depends on EDAC_MM_EDAC && ARM64
+	help
+	  Support for error detection and correction on the
+	  ARM Cortex A57 and A53.
+
 endif # EDAC
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index ae3c5f3..ac01660 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
 obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
 obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
 obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
+obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
new file mode 100644
index 0000000..b2a8723
--- /dev/null
+++ b/drivers/edac/cortex_arm64_edac.c
@@ -0,0 +1,336 @@
+/*
+ * Cortex A57 and A53 EDAC
+ *
+ * Copyright (c) 2015, Advanced Micro Devices
+ * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/platform_device.h>
+#include <ras/ras_event.h>
+
+#include "edac_core.h"
+
+#define DRV_NAME			"cortex_edac"
+
+#define CPUMERRSR_EL1_INDEX(x, y)	((x) & (y))
+#define CPUMERRSR_EL1_BANK_WAY(x, y)	(((x) >> 18) & (y))
+#define CPUMERRSR_EL1_RAMID(x)		(((x) >> 24) & 0x7f)
+#define CPUMERRSR_EL1_VALID(x)		((x) & (1 << 31))
+#define CPUMERRSR_EL1_REPEAT(x)		(((x) >> 32) & 0x7f)
+#define CPUMERRSR_EL1_OTHER(x)		(((x) >> 40) & 0xff)
+#define CPUMERRSR_EL1_FATAL(x)		((x) & (1UL << 63))
+#define L1_I_TAG_RAM			0x00
+#define L1_I_DATA_RAM			0x01
+#define L1_D_TAG_RAM			0x08
+#define L1_D_DATA_RAM			0x09
+#define L1_D_DIRTY_RAM			0x14
+#define TLB_RAM				0x18
+
+#define L2MERRSR_EL1_CPUID_WAY(x)	(((x) >> 18) & 0xf)
+#define L2MERRSR_EL1_RAMID(x)		(((x) >> 24) & 0x7f)
+#define L2MERRSR_EL1_VALID(x)		((x) & (1 << 31))
+#define L2MERRSR_EL1_REPEAT(x)		(((x) >> 32) & 0xff)
+#define L2MERRSR_EL1_OTHER(x)		(((x) >> 40) & 0xff)
+#define L2MERRSR_EL1_FATAL(x)		((x) & (1UL << 63))
+#define L2_TAG_RAM			0x10
+#define L2_DATA_RAM			0x11
+#define L2_SNOOP_RAM			0x12
+#define L2_DIRTY_RAM			0x14
+#define L2_INCLUSION_PF_RAM		0x18
+
+#define L1_CACHE			0
+#define L2_CACHE			1
+
+#define EDAC_MOD_STR			DRV_NAME
+
+static int poll_msec = 100;
+
+struct cortex_arm64_edac {
+	struct edac_device_ctl_info *edac_ctl;
+};
+
+static inline u64 read_cpumerrsr_el1(void)
+{
+	u64 val;
+
+	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
+	return val;
+}
+
+static inline void write_cpumerrsr_el1(u64 val)
+{
+	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
+}
+
+static inline u64 read_l2merrsr_el1(void)
+{
+	u64 val;
+
+	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
+	return val;
+}
+
+static inline void write_l2merrsr_el1(u64 val)
+{
+	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
+}
+
+static void parse_cpumerrsr(void *arg)
+{
+	int cpu, partnum, way;
+	unsigned int index = 0;
+	u64 val = read_cpumerrsr_el1();
+	int repeat_err, other_err;
+
+	/* we do not support fatal error handling so far */
+	if (CPUMERRSR_EL1_FATAL(val))
+		return;
+
+	/* check if we have valid error before continuing */
+	if (!CPUMERRSR_EL1_VALID(val))
+		return;
+
+	cpu = smp_processor_id();
+	partnum = read_cpuid_part_number();
+	repeat_err = CPUMERRSR_EL1_REPEAT(val);
+	other_err = CPUMERRSR_EL1_OTHER(val);
+
+	/* way/bank and index address bit ranges are different between
+	 * A57 and A53 */
+	if (partnum == ARM_CPU_PART_CORTEX_A57) {
+		index = CPUMERRSR_EL1_INDEX(val, 0x1ffff);
+		way = CPUMERRSR_EL1_BANK_WAY(val, 0x1f);
+	} else {
+		index = CPUMERRSR_EL1_INDEX(val, 0xfff);
+		way = CPUMERRSR_EL1_BANK_WAY(val, 0x7);
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPU%d L1 error detected!\n", cpu);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "index=%#x, RAMID=", index);
+
+	switch (CPUMERRSR_EL1_RAMID(val)) {
+	case L1_I_TAG_RAM:
+		pr_cont("'L1-I Tag RAM' (way %d)", way);
+		break;
+	case L1_I_DATA_RAM:
+		pr_cont("'L1-I Data RAM' (bank %d)", way);
+		break;
+	case L1_D_TAG_RAM:
+		pr_cont("'L1-D Tag RAM' (way %d)", way);
+		break;
+	case L1_D_DATA_RAM:
+		pr_cont("'L1-D Data RAM' (bank %d)", way);
+		break;
+	case L1_D_DIRTY_RAM:
+		pr_cont("'L1 Dirty RAM'");
+		break;
+	case TLB_RAM:
+		pr_cont("'TLB RAM'");
+		break;
+	default:
+		pr_cont("'unknown'");
+		break;
+	}
+
+	pr_cont(", repeat=%d, other=%d (CPUMERRSR_EL1=%#llx)\n", repeat_err,
+		other_err, val);
+
+	trace_mc_event(HW_EVENT_ERR_CORRECTED, "L1 non-fatal error",
+		       "", repeat_err, 0, 0, 0, -1, index, 0, 0, DRV_NAME);
+	write_cpumerrsr_el1(0);
+}
+
+static void a57_parse_l2merrsr_way(u8 ramid, u8 val)
+{
+	switch (ramid) {
+	case L2_TAG_RAM:
+	case L2_DATA_RAM:
+	case L2_DIRTY_RAM:
+		pr_cont("(cpu%d tag, way %d)", val / 2, val % 2);
+		break;
+	case L2_SNOOP_RAM:
+		pr_cont("(cpu%d tag, way %d)", (val & 0x6) >> 1,
+			(val & 0x1));
+		break;
+	}
+}
+
+static void a53_parse_l2merrsr_way(u8 ramid, u8 val)
+{
+	switch (ramid) {
+	case L2_TAG_RAM:
+		pr_cont("(way %d)", val);
+	case L2_DATA_RAM:
+		pr_cont("(bank %d)", val);
+		break;
+	case L2_SNOOP_RAM:
+		pr_cont("(cpu%d tag, way %d)", val / 2, val % 4);
+		break;
+	}
+}
+
+static void parse_l2merrsr(void *arg)
+{
+	int cpu, partnum;
+	unsigned int index;
+	int repeat_err, other_err;
+	u64 val = read_l2merrsr_el1();
+
+	/* we do not support fatal error handling so far */
+	if (L2MERRSR_EL1_FATAL(val))
+		return;
+
+	/* check if we have valid error before continuing */
+	if (!L2MERRSR_EL1_VALID(val))
+		return;
+
+	cpu = smp_processor_id();
+	partnum = read_cpuid_part_number();
+	repeat_err = L2MERRSR_EL1_REPEAT(val);
+	other_err = L2MERRSR_EL1_OTHER(val);
+
+	/* index address range is different between A57 and A53 */
+	if (partnum == ARM_CPU_PART_CORTEX_A57)
+		index = val & 0x1ffff;
+	else
+		index = (val >> 3) & 0x3fff;
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPU%d L2 error detected!\n", cpu);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "index=%#x RAMID=", index);
+
+	switch (L2MERRSR_EL1_RAMID(val)) {
+	case L2_TAG_RAM:
+		pr_cont("'L2 Tag RAM'");
+		break;
+	case L2_DATA_RAM:
+		pr_cont("'L2 Data RAM'");
+		break;
+	case L2_SNOOP_RAM:
+		pr_cont("'L2 Snoop tag RAM'");
+		break;
+	case L2_DIRTY_RAM:
+		pr_cont("'L2 Dirty RAM'");
+		break;
+	case L2_INCLUSION_PF_RAM:
+		pr_cont("'L2 inclusion PF RAM'");
+		break;
+	default:
+		pr_cont("unknown");
+		break;
+	}
+
+	/* cpuid/way bit description is different between A57 and A53 */
+	if (partnum == ARM_CPU_PART_CORTEX_A57)
+		a57_parse_l2merrsr_way(L2MERRSR_EL1_RAMID(val),
+				       L2MERRSR_EL1_CPUID_WAY(val));
+	else
+		a53_parse_l2merrsr_way(L2MERRSR_EL1_RAMID(val),
+				       L2MERRSR_EL1_CPUID_WAY(val));
+
+	pr_cont(", repeat=%d, other=%d (L2MERRSR_EL1=%#llx)\n", repeat_err,
+		other_err, val);
+	trace_mc_event(HW_EVENT_ERR_CORRECTED, "L2 non-fatal error",
+		       "", repeat_err, 0, 0, 0, -1, index, 0, 0, DRV_NAME);
+	write_l2merrsr_el1(0);
+}
+
+static void cortex_arm64_edac_check(struct edac_device_ctl_info *edac_ctl)
+{
+	int cpu;
+	struct cpumask cluster_mask, old_mask;
+
+	cpumask_clear(&cluster_mask);
+	cpumask_clear(&old_mask);
+
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		/* Check CPU L1 error */
+		smp_call_function_single(cpu, parse_cpumerrsr, NULL, 0);
+		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
+		if (cpumask_equal(&cluster_mask, &old_mask))
+			continue;
+		cpumask_copy(&old_mask, &cluster_mask);
+		/* Check CPU L2 error */
+		smp_call_function_any(&cluster_mask, parse_l2merrsr, NULL, 0);
+	}
+	put_online_cpus();
+}
+
+static int cortex_arm64_edac_probe(struct platform_device *pdev)
+{
+	int rc;
+	struct cortex_arm64_edac *drv;
+	struct device *dev = &pdev->dev;
+
+	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
+	if (!drv)
+		return -ENOMEM;
+
+	/* Only POLL mode is supported */
+	edac_op_state = EDAC_OPSTATE_POLL;
+
+	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu_cache", 1, "L", 2,
+						   0, NULL, 0,
+						   edac_device_alloc_index());
+	if (IS_ERR(drv->edac_ctl))
+		return -ENOMEM;
+
+	drv->edac_ctl->poll_msec = poll_msec;
+	drv->edac_ctl->edac_check = cortex_arm64_edac_check;
+	drv->edac_ctl->dev = dev;
+	drv->edac_ctl->mod_name = dev_name(dev);
+	drv->edac_ctl->dev_name = dev_name(dev);
+	drv->edac_ctl->ctl_name = "cache_err";
+	platform_set_drvdata(pdev, drv);
+
+	rc = edac_device_add_device(drv->edac_ctl);
+	if (rc)
+		edac_device_free_ctl_info(drv->edac_ctl);
+
+	return rc;
+}
+
+static int cortex_arm64_edac_remove(struct platform_device *pdev)
+{
+	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
+	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
+
+	edac_device_del_device(edac_ctl->dev);
+	edac_device_free_ctl_info(edac_ctl);
+
+	return 0;
+}
+
+static const struct of_device_id cortex_arm64_edac_of_match[] = {
+	{ .compatible = "arm,cortex-a57-edac" },
+	{ .compatible = "arm,cortex-a53-edac" },
+	{}
+};
+MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
+
+static struct platform_driver cortex_arm64_edac_driver = {
+	.probe = cortex_arm64_edac_probe,
+	.remove = cortex_arm64_edac_remove,
+	.driver = {
+		.name = DRV_NAME,
+		.of_match_table = cortex_arm64_edac_of_match,
+	},
+};
+module_platform_driver(cortex_arm64_edac_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
+MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
+module_param(poll_msec, int, 0444);
+MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] EDAC: Add ARM64 EDAC
  2015-10-28 16:13 [PATCH v4] EDAC: Add ARM64 EDAC Brijesh Singh
@ 2015-10-28 16:40 ` Mark Rutland
  2015-10-30 16:26   ` Brijesh Singh
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Rutland @ 2015-10-28 16:40 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-arm-kernel, robh+dt, pawel.moll, ijc+devicetree, galak,
	dougthompson, bp, mchehab, devicetree, guohanjun, andre.przywara,
	arnd, sboyd, linux-kernel, linux-edac

Hi,

On Wed, Oct 28, 2015 at 11:13:49AM -0500, Brijesh Singh wrote:
> Add support for Cortex A57 and A53 EDAC driver.
> 
> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
> CC: robh+dt@kernel.org
> CC: pawel.moll@arm.com
> CC: mark.rutland@arm.com
> CC: ijc+devicetree@hellion.org.uk
> CC: galak@codeaurora.org
> CC: dougthompson@xmission.com
> CC: bp@alien8.de
> CC: mchehab@osg.samsung.com
> CC: devicetree@vger.kernel.org
> CC: guohanjun@huawei.com
> CC: andre.przywara@arm.com
> CC: arnd@arndb.de
> CC: sboyd@codeaurora.org
> CC: linux-kernel@vger.kernel.org
> CC: linux-edac@vger.kernel.org
> ---
> Tested error reporting via rasdeamon on AMD seattle platform.
> 
> v4:
> * remove THIS_MODULE assignment from platform_driver
> * use module_platform_driver(), remove module_init() and module_exit()
> * remove default n from Kconfig
> 
> v3:
> * change DT binding compatibilty string to 'cortex-a57-edac'
> * remove A57/A53 prefix from register bit definition
> * unify A57 and A53 L1/L2 error parsing functions
> * use mc trace event function to report the error
> * update Kconfig default to 'n'
> 
> v2:
> * convert into generic arm64 edac driver
> * remove AMD specific references from dt binding
> * remove poll_msec property from dt binding
> * add poll_msec as a module param, default is 100ms
> * update copyright text
> * define macro mnemonics for L1 and L2 RAMID
> * check L2 error per-cluster instead of per core
> * update function names
> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>   read hotplug-safe
> * add error check in probe routine
> 
>  .../devicetree/bindings/edac/cortex-arm64-edac.txt |  15 +
>  drivers/edac/Kconfig                               |   7 +
>  drivers/edac/Makefile                              |   1 +
>  drivers/edac/cortex_arm64_edac.c                   | 336 +++++++++++++++++++++
>  4 files changed, 359 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/edac/cortex-arm64-edac.txt
>  create mode 100644 drivers/edac/cortex_arm64_edac.c
> 
> diff --git a/Documentation/devicetree/bindings/edac/cortex-arm64-edac.txt b/Documentation/devicetree/bindings/edac/cortex-arm64-edac.txt
> new file mode 100644
> index 0000000..552f0c7
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/cortex-arm64-edac.txt
> @@ -0,0 +1,15 @@
> +* ARM Cortex A57 and A53 L1/L2 cache error reporting
> +
> +CPU Memory Error Syndrome and L2 Memory Error Syndrome registers can be used
> +for checking L1 and L2 memory errors.
> +
> +The following section describes the Cortex A57/A53 EDAC DT node binding.
> +
> +Required properties:
> +- compatible: Should be "arm,cortex-a57-edac" or "arm,cortex-a53-edac"
> +
> +Example:
> +	edac {
> +		compatible = "arm,cortex-a57-edac";
> +	};
> +

This is insufficient for big.LITTLE, no interrupt is possible, and we
haven't defined the rules for accessing the registers (e.g. whether
write backs are permitted).

Please see my prior comments [1] on those points.

If we're going to use this feature directly within the kernel, we need
to consider the envelope of possible implementations rather than your
use-case alone.

> +static void parse_cpumerrsr(void *arg)
> +{
> +	int cpu, partnum, way;
> +	unsigned int index = 0;
> +	u64 val = read_cpumerrsr_el1();
> +	int repeat_err, other_err;
> +
> +	/* we do not support fatal error handling so far */
> +	if (CPUMERRSR_EL1_FATAL(val))
> +		return;

Silently ignoring fatal errors doesn't seem good.

Shouldn't this be checked after the valid bit?

> +	/* check if we have valid error before continuing */
> +	if (!CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	cpu = smp_processor_id();
> +	partnum = read_cpuid_part_number();
> +	repeat_err = CPUMERRSR_EL1_REPEAT(val);
> +	other_err = CPUMERRSR_EL1_OTHER(val);
> +
> +	/* way/bank and index address bit ranges are different between
> +	 * A57 and A53 */
> +	if (partnum == ARM_CPU_PART_CORTEX_A57) {
> +		index = CPUMERRSR_EL1_INDEX(val, 0x1ffff);
> +		way = CPUMERRSR_EL1_BANK_WAY(val, 0x1f);
> +	} else {
> +		index = CPUMERRSR_EL1_INDEX(val, 0xfff);
> +		way = CPUMERRSR_EL1_BANK_WAY(val, 0x7);
> +	}

Can the bit ranges not be determined at probe time rather than reading
the MIDR repeatedly?

> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPU%d L1 error detected!\n", cpu);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "index=%#x, RAMID=", index);
> +
> +	switch (CPUMERRSR_EL1_RAMID(val)) {
> +	case L1_I_TAG_RAM:
> +		pr_cont("'L1-I Tag RAM' (way %d)", way);
> +		break;
> +	case L1_I_DATA_RAM:
> +		pr_cont("'L1-I Data RAM' (bank %d)", way);
> +		break;
> +	case L1_D_TAG_RAM:
> +		pr_cont("'L1-D Tag RAM' (way %d)", way);
> +		break;
> +	case L1_D_DATA_RAM:
> +		pr_cont("'L1-D Data RAM' (bank %d)", way);
> +		break;
> +	case L1_D_DIRTY_RAM:
> +		pr_cont("'L1 Dirty RAM'");
> +		break;
> +	case TLB_RAM:
> +		pr_cont("'TLB RAM'");
> +		break;
> +	default:
> +		pr_cont("'unknown'");
> +		break;
> +	}
> +
> +	pr_cont(", repeat=%d, other=%d (CPUMERRSR_EL1=%#llx)\n", repeat_err,
> +		other_err, val);
> +
> +	trace_mc_event(HW_EVENT_ERR_CORRECTED, "L1 non-fatal error",
> +		       "", repeat_err, 0, 0, 0, -1, index, 0, 0, DRV_NAME);
> +	write_cpumerrsr_el1(0);
> +}

The comments above also apply to parse_l2merrsr.

> +static void cortex_arm64_edac_check(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int cpu;
> +	struct cpumask cluster_mask, old_mask;
> +
> +	cpumask_clear(&cluster_mask);
> +	cpumask_clear(&old_mask);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		/* Check CPU L1 error */
> +		smp_call_function_single(cpu, parse_cpumerrsr, NULL, 0);
> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
> +		if (cpumask_equal(&cluster_mask, &old_mask))
> +			continue;
> +		cpumask_copy(&old_mask, &cluster_mask);
> +		/* Check CPU L2 error */
> +		smp_call_function_any(&cluster_mask, parse_l2merrsr, NULL, 0);
> +	}

I don't think the use of old_mask is sufficient here, given the mapping
of logical to physical ID is arbitrary. For example, we could have CPUs
0,5,6,7 in one cluster, and CPUs 1,2,3,4 in another, and in that case
we'd check the first cluster twice.

This also is wrong for big.LITTLE; we can't necessarily check on every
CPU.

Thanks,
Mark.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/380942.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] EDAC: Add ARM64 EDAC
  2015-10-28 16:40 ` Mark Rutland
@ 2015-10-30 16:26   ` Brijesh Singh
  2015-10-30 17:06     ` Mark Rutland
  0 siblings, 1 reply; 7+ messages in thread
From: Brijesh Singh @ 2015-10-30 16:26 UTC (permalink / raw)
  To: Mark Rutland
  Cc: brijeshkumar.singh, linux-arm-kernel, robh+dt, pawel.moll,
	ijc+devicetree, galak, dougthompson, bp, mchehab, devicetree,
	guohanjun, andre.przywara, arnd, sboyd, linux-kernel, linux-edac

Hi Mark,

>> +
>> +Required properties:
>> +- compatible: Should be "arm,cortex-a57-edac" or "arm,cortex-a53-edac"
>> +
>> +Example:
>> +	edac {
>> +		compatible = "arm,cortex-a57-edac";
>> +	};
>> +
> 
> This is insufficient for big.LITTLE, no interrupt is possible, and we
> haven't defined the rules for accessing the registers (e.g. whether
> write backs are permitted).
> 
> Please see my prior comments [1] on those points.
> 
> If we're going to use this feature directly within the kernel, we need
> to consider the envelope of possible implementations rather than your
> use-case alone.
>

I have looked at possibility of pushing correctable error logging in the
firmware; but given current hardware limitation it seems like OS is the best
place to implement it. Let me summaries the issues we are running into:

* Correctable errors does not generate any interrupt:
  If we have to implement error parsing inside the firmware then work need
  to be split between OS and firmware. Maybe OS can call SMC instruction to 
  dial into firmware and then firmware can check error syndrome registers; 
  if it finds correctable error then build HEST table. This method will introduce
  performance issue because it require OS executing SMC every 100ms or so to just
  poll for correctable error. If you have any other recommendation then please share it.


> * Interaction with firmware
> - When/do we handle interrupts?

We can a properties in dt bindings:

1) "num-interrupts = 1" - number of interrupt count. One interrupts per cluster
    e.g if you have 4 cluster then num-interrupts=4.
2) interrupts = <0, 92, 0> <0, 94, 0> <0, 96, 0> <0, 98, 0>  // interrupt mapping

If num-interrupts = 0, then firmware handles interrupts. Optionally we can use HEST FIRMWARE-FIRST
bit, if bit is set then firmware is handling the interrupt otherwise use DT information.

>
>  - When is it valid to write back and clear an error? We should not do
>    this behind the back of any firmware that owns the interface.

As per A57 TRM is concerned you are right both the correctable and uncorrectable 
error needs to clear VALID bit in L1/L2 syndrome registers. So yes we need to define
a rule for accessing the registers. I can think of two possible approach here:

1) add "error-syndrome-reg-write-access=1" property in dt.
   * if '1' then OS has exclusive write backs access to error syndrome register
   * if '0' then OS will not clear the valid bit on fatal error

  The handler looks like this:

  parse_error_syndrome () {
   val = read_cpumerrsr

   if (!IS_VALID(val))
     return

   /* log the error details */

   /* if fatal error and OS does not have exclusive write back access */
   if (IS_FATAL(val) && !error-syndrom-reg-write-access)
     return; 

   val = ~(1UL << 31); /* clear valid bit */
  }

2) Use HEST FIRMWARE-FIRST bit field, if the bit is set then OS should not clear
   the valid bit on fatal error and similarly if bit is clear then OS clears the VALID bit.

Since firmware will never handle the correctable error hence its always safe to clear
the VALID bit on non-fatal error. If you have any other suggestions then please share it.

I am not pushing my use-case only; I am trying to work through current hardware
limitation and still support all the possibilities. I am open to hear your suggestions.
I am also not well versed on big.LITTILE CPU, so you may need to point me on right 
direction as we progress. My testing is limited to Cortex A57.

> 
> I don't think the use of old_mask is sufficient here, given the mapping
> of logical to physical ID is arbitrary. For example, we could have CPUs
> 0,5,6,7 in one cluster, and CPUs 1,2,3,4 in another, and in that case
> we'd check the first cluster twice.
> 

Noted. I should use physical ID instead of logical mapping.

> This also is wrong for big.LITTLE; we can't necessarily check on every
> CPU.
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] EDAC: Add ARM64 EDAC
  2015-10-30 16:26   ` Brijesh Singh
@ 2015-10-30 17:06     ` Mark Rutland
  2015-10-30 17:32       ` Mark Rutland
                         ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Mark Rutland @ 2015-10-30 17:06 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-arm-kernel, robh+dt, pawel.moll, ijc+devicetree, galak,
	dougthompson, bp, mchehab, devicetree, guohanjun, andre.przywara,
	arnd, sboyd, linux-kernel, linux-edac

On Fri, Oct 30, 2015 at 11:26:58AM -0500, Brijesh Singh wrote:
> Hi Mark,
> 
> >> +
> >> +Required properties:
> >> +- compatible: Should be "arm,cortex-a57-edac" or "arm,cortex-a53-edac"
> >> +
> >> +Example:
> >> +	edac {
> >> +		compatible = "arm,cortex-a57-edac";
> >> +	};
> >> +
> > 
> > This is insufficient for big.LITTLE, no interrupt is possible, and we
> > haven't defined the rules for accessing the registers (e.g. whether
> > write backs are permitted).
> > 
> > Please see my prior comments [1] on those points.
> > 
> > If we're going to use this feature directly within the kernel, we need
> > to consider the envelope of possible implementations rather than your
> > use-case alone.
> >
> 
> I have looked at possibility of pushing correctable error logging in the
> firmware; but given current hardware limitation it seems like OS is the best
> place to implement it. Let me summaries the issues we are running into:
> 
> * Correctable errors does not generate any interrupt:
>   If we have to implement error parsing inside the firmware then work need
>   to be split between OS and firmware. Maybe OS can call SMC instruction to 
>   dial into firmware and then firmware can check error syndrome registers; 
>   if it finds correctable error then build HEST table. This method will introduce
>   performance issue because it require OS executing SMC every 100ms or so to just
>   poll for correctable error. If you have any other recommendation then please share it.

I agree that this is a problem, and is an unfortunate hardware
limitation.

I am still wary of making use of IMPLEMENTATION DEFINED features like
this in the kernel.

> > * Interaction with firmware
> > - When/do we handle interrupts?
> 
> We can a properties in dt bindings:
> 
> 1) "num-interrupts = 1" - number of interrupt count. One interrupts per cluster
>     e.g if you have 4 cluster then num-interrupts=4.
> 2) interrupts = <0, 92, 0> <0, 94, 0> <0, 96, 0> <0, 98, 0>  // interrupt mapping
> 
> If num-interrupts = 0, then firmware handles interrupts. Optionally we can use HEST FIRMWARE-FIRST
> bit, if bit is set then firmware is handling the interrupt otherwise use DT information.

You won't have the HEST and DT information at the same time, given that
at runtime the kernel uses one of ACPI or DT.

This also doesn't define the affinity of interrupts (i.e. which
cluster/CPU does each interrupt get generated by), which is the more
important part.

> >  - When is it valid to write back and clear an error? We should not do
> >    this behind the back of any firmware that owns the interface.
> 
> As per A57 TRM is concerned you are right both the correctable and uncorrectable 
> error needs to clear VALID bit in L1/L2 syndrome registers. So yes we need to define
> a rule for accessing the registers. I can think of two possible approach here:
> 
> 1) add "error-syndrome-reg-write-access=1" property in dt.
>    * if '1' then OS has exclusive write backs access to error syndrome register
>    * if '0' then OS will not clear the valid bit on fatal error

What about correctable errors? What if someone wants to poll that in FW
(no matter how horrible that may be for performance)? What if a future
CPU revision adds an optional interrupt for that?

>   The handler looks like this:
> 
>   parse_error_syndrome () {
>    val = read_cpumerrsr
> 
>    if (!IS_VALID(val))
>      return
> 
>    /* log the error details */
> 
>    /* if fatal error and OS does not have exclusive write back access */
>    if (IS_FATAL(val) && !error-syndrom-reg-write-access)
>      return; 
> 
>    val = ~(1UL << 31); /* clear valid bit */
>   }

Ok.

> 2) Use HEST FIRMWARE-FIRST bit field, if the bit is set then OS should not clear
>    the valid bit on fatal error and similarly if bit is clear then OS clears the VALID bit.
> 
> Since firmware will never handle the correctable error hence its always safe to clear
> the VALID bit on non-fatal error. If you have any other suggestions then please share it.

As above, I don't think this is true in general.

> I am not pushing my use-case only; I am trying to work through current hardware
> limitation and still support all the possibilities. I am open to hear your suggestions.
> I am also not well versed on big.LITTILE CPU, so you may need to point me on right 
> direction as we progress. My testing is limited to Cortex A57.

The important thing to consider is that an arbitrary subset of CPUs may
have this feature, while another arbitrary subset may not.

> > I don't think the use of old_mask is sufficient here, given the mapping
> > of logical to physical ID is arbitrary. For example, we could have CPUs
> > 0,5,6,7 in one cluster, and CPUs 1,2,3,4 in another, and in that case
> > we'd check the first cluster twice.
> > 
> 
> Noted. I should use physical ID instead of logical mapping.

Or you could simply keep a mask of CPUs whose clusters have already been
checked...

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] EDAC: Add ARM64 EDAC
  2015-10-30 17:06     ` Mark Rutland
@ 2015-10-30 17:32       ` Mark Rutland
  2015-10-30 17:51       ` Borislav Petkov
  2015-10-30 19:12       ` Brijesh Singh
  2 siblings, 0 replies; 7+ messages in thread
From: Mark Rutland @ 2015-10-30 17:32 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: devicetree, arnd, pawel.moll, ijc+devicetree, mchehab, galak,
	guohanjun, linux-kernel, andre.przywara, robh+dt, bp,
	dougthompson, sboyd, linux-arm-kernel, linux-edac

> > > * Interaction with firmware
> > > - When/do we handle interrupts?
> > 
> > We can a properties in dt bindings:
> > 
> > 1) "num-interrupts = 1" - number of interrupt count. One interrupts per cluster
> >     e.g if you have 4 cluster then num-interrupts=4.
> > 2) interrupts = <0, 92, 0> <0, 94, 0> <0, 96, 0> <0, 98, 0>  // interrupt mapping
> > 
> > If num-interrupts = 0, then firmware handles interrupts. Optionally we can use HEST FIRMWARE-FIRST
> > bit, if bit is set then firmware is handling the interrupt otherwise use DT information.
> 
> You won't have the HEST and DT information at the same time, given that
> at runtime the kernel uses one of ACPI or DT.

>From a quick look at the HEST definition, I don't doesn't seem like it's
possible to describe this feature -- there's no error source descriptor
for it and the generic hardware error source is not applicable.

So I'm worried that it may not be possible to use this feature with
ACPI.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] EDAC: Add ARM64 EDAC
  2015-10-30 17:06     ` Mark Rutland
  2015-10-30 17:32       ` Mark Rutland
@ 2015-10-30 17:51       ` Borislav Petkov
  2015-10-30 19:12       ` Brijesh Singh
  2 siblings, 0 replies; 7+ messages in thread
From: Borislav Petkov @ 2015-10-30 17:51 UTC (permalink / raw)
  To: Mark Rutland, Brijesh Singh
  Cc: linux-arm-kernel, robh+dt, pawel.moll, ijc+devicetree, galak,
	dougthompson, mchehab, devicetree, guohanjun, andre.przywara,
	arnd, sboyd, linux-kernel, linux-edac

On Fri, Oct 30, 2015 at 05:06:06PM +0000, Mark Rutland wrote:
> > * Correctable errors does not generate any interrupt:
> >   If we have to implement error parsing inside the firmware then work need
> >   to be split between OS and firmware. Maybe OS can call SMC instruction to 
> >   dial into firmware and then firmware can check error syndrome registers; 
> >   if it finds correctable error then build HEST table. This method will introduce
> >   performance issue because it require OS executing SMC every 100ms or so to just
> >   poll for correctable error. If you have any other recommendation then please share it.
> 
> I agree that this is a problem, and is an unfortunate hardware
> limitation.
> 
> I am still wary of making use of IMPLEMENTATION DEFINED features like
> this in the kernel.

Well, you could do all the correctable errors collecting in the firmware
and only report those errors to the OS when they're overflowing/reach a
certain threshold.

The idea behind it being that you don't really want to upset the user
about *every* correctable error happening because it was correctable and
the hardware, well, doh, corrected it. No problem.

But when those errors start repeating and hitting the same DIMM and
addresses in close proximity, there might be a problem which you should
report.

Btw, we have been looking for doing something like that on x86:

https://lkml.kernel.org/r/1404242623-10094-1-git-send-email-bp@alien8.de

and one of those days I'll upstream the damn thing!

:-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4] EDAC: Add ARM64 EDAC
  2015-10-30 17:06     ` Mark Rutland
  2015-10-30 17:32       ` Mark Rutland
  2015-10-30 17:51       ` Borislav Petkov
@ 2015-10-30 19:12       ` Brijesh Singh
  2 siblings, 0 replies; 7+ messages in thread
From: Brijesh Singh @ 2015-10-30 19:12 UTC (permalink / raw)
  To: Mark Rutland
  Cc: brijeshkumar.singh, linux-arm-kernel, robh+dt, pawel.moll,
	ijc+devicetree, galak, dougthompson, bp, mchehab, devicetree,
	guohanjun, andre.przywara, arnd, sboyd, linux-kernel, linux-edac

Hi,

>> I have looked at possibility of pushing correctable error logging in the
>> firmware; but given current hardware limitation it seems like OS is the best
>> place to implement it. Let me summaries the issues we are running into:
>>
>> * Correctable errors does not generate any interrupt:
>>   If we have to implement error parsing inside the firmware then work need
>>   to be split between OS and firmware. Maybe OS can call SMC instruction to 
>>   dial into firmware and then firmware can check error syndrome registers; 
>>   if it finds correctable error then build HEST table. This method will introduce
>>   performance issue because it require OS executing SMC every 100ms or so to just
>>   poll for correctable error. If you have any other recommendation then please share it.
> 
> I agree that this is a problem, and is an unfortunate hardware
> limitation.
> 
> I am still wary of making use of IMPLEMENTATION DEFINED features like
> this in the kernel.
>
I do see the reason behind shy away from IMPLEMENTATION DEFINED features
but to support L1/L2 correctable error feature we have limited choices:
- use of IMPLEMENTATION DEFINED register in the kernel
   or
- go with SMC kind of method explained above which causes performance hit
  (which also means platform specific driver)
  or
- don't implement it ;)

> 
> You won't have the HEST and DT information at the same time, given that
> at runtime the kernel uses one of ACPI or DT.
> 
> This also doesn't define the affinity of interrupts (i.e. which
> cluster/CPU does each interrupt get generated by), which is the more
> important part.
> 

It seems like we might need to describe
- cluster topology
- which cpu have the feature
- interrupt affinity
e.g
edac {
    cluster0 {
        core0 { cpu = <&cpu_0>; }
	core1 { cpu = <&cpu_1>; }
    }

    cluster1 {
	 core0 { cpu = <&cpu_2>; }
	 core1 { cpu = <&cpu_3>; }
    }

    ..............
    ..............
    num-interrupts = 2;
    interrupts  = <0, 92, 0> <0, 94, 0> ;
    interrupt-affinity = <&cluster0> <&cluster1>;    
}

Is it possible to avoid all these DT entries and do something at runtime ?
e.g we know for sure that fatal interrupts are generate per cluster.

> What about correctable errors? What if someone wants to poll that in FW
> (no matter how horrible that may be for performance)? What if a future
> CPU revision adds an optional interrupt for that?
> 
If FW handles the correctable error then DT should not contain the edac entry
and driver will not be probed. This also raises question on how we are going
to load the driver in ACPI case ?


Thanks
Brijesh

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-10-30 19:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-28 16:13 [PATCH v4] EDAC: Add ARM64 EDAC Brijesh Singh
2015-10-28 16:40 ` Mark Rutland
2015-10-30 16:26   ` Brijesh Singh
2015-10-30 17:06     ` Mark Rutland
2015-10-30 17:32       ` Mark Rutland
2015-10-30 17:51       ` Borislav Petkov
2015-10-30 19:12       ` Brijesh Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).