All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-21 20:41 ` Brijesh Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-21 20:41 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Brijesh Singh, robh+dt, pawel.moll, mark.rutland, ijc+devicetree,
	galak, dougthompson, bp, mchehab, devicetree, guohanjun,
	andre.przywara, arnd, linux-kernel, linux-edac

Add support for Cortex A57 and A53 EDAC driver.

Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
CC: robh+dt@kernel.org
CC: pawel.moll@arm.com
CC: mark.rutland@arm.com
CC: ijc+devicetree@hellion.org.uk
CC: galak@codeaurora.org
CC: dougthompson@xmission.com
CC: bp@alien8.de
CC: mchehab@osg.samsung.com
CC: devicetree@vger.kernel.org
CC: guohanjun@huawei.com
CC: andre.przywara@arm.com
CC: arnd@arndb.de
CC: linux-kernel@vger.kernel.org
CC: linux-edac@vger.kernel.org
---

v2:
* convert into generic arm64 edac driver
* remove AMD specific references from dt binding
* remove poll_msec property from dt binding
* add poll_msec as a module param, default is 100ms
* update copyright text
* define macro mnemonics for L1 and L2 RAMID
* check L2 error per-cluster instead of per core
* update function names
* use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
  read hotplug-safe
* add error check in probe routine

 .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
 drivers/edac/Kconfig                               |   6 +
 drivers/edac/Makefile                              |   1 +
 drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
 4 files changed, 479 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
 create mode 100644 drivers/edac/cortex_arm64_edac.c

diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
new file mode 100644
index 0000000..dfd128f
--- /dev/null
+++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
@@ -0,0 +1,15 @@
+* ARMv8 L1/L2 cache error reporting
+
+On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
+Register can be used for checking L1 and L2 memory errors.
+
+The following section describes the ARMv8 EDAC DT node binding.
+
+Required properties:
+- compatible: Should be "arm,armv8-edac"
+
+Example:
+	edac {
+		compatible = "arm,armv8-edac";
+	};
+
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index ef25000..dd7c195 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -390,4 +390,10 @@ config EDAC_XGENE
 	  Support for error detection and correction on the
 	  APM X-Gene family of SOCs.
 
+config EDAC_CORTEX_ARM64
+	tristate "ARM Cortex A57/A53"
+	depends on EDAC_MM_EDAC && ARM64
+	help
+	  Support for error detection and correction on the
+	  ARM Cortex A57 and A53.
 endif # EDAC
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index ae3c5f3..ac01660 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
 obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
 obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
 obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
+obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
new file mode 100644
index 0000000..c37bb94
--- /dev/null
+++ b/drivers/edac/cortex_arm64_edac.c
@@ -0,0 +1,457 @@
+/*
+ * Cortex ARM64 EDAC
+ *
+ * Copyright (c) 2015, Advanced Micro Devices
+ * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/platform_device.h>
+
+#include "edac_core.h"
+
+#define EDAC_MOD_STR             "cortex_arm64_edac"
+
+#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
+#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
+#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
+#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
+#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
+#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
+#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
+#define A57_L1_I_TAG_RAM	     0x00
+#define A57_L1_I_DATA_RAM	     0x01
+#define A57_L1_D_TAG_RAM	     0x08
+#define A57_L1_D_DATA_RAM	     0x09
+#define A57_L1_TLB_RAM		     0x18
+
+#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
+#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
+#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
+#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
+#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
+#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
+#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
+#define A57_L2_TAG_RAM		     0x10
+#define A57_L2_DATA_RAM		     0x11
+#define A57_L2_SNOOP_TAG_RAM	     0x12
+#define A57_L2_DIRTY_RAM	     0x14
+#define A57_L2_INCLUSION_PF_RAM      0x18
+
+#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
+#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
+#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
+#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
+#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
+#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
+#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
+#define A53_L1_I_TAG_RAM	     0x00
+#define A53_L1_I_DATA_RAM	     0x01
+#define A53_L1_D_TAG_RAM	     0x08
+#define A53_L1_D_DATA_RAM	     0x09
+#define A53_L1_D_DIRT_RAM	     0x0A
+#define A53_L1_TLB_RAM		     0x18
+
+#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
+#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
+#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
+#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
+#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
+#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
+#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
+#define A53_L2_TAG_RAM		     0x10
+#define A53_L2_DATA_RAM		     0x11
+#define A53_L2_SNOOP_RAM	     0x12
+
+#define L1_CACHE		     0
+#define L2_CACHE		     1
+
+int poll_msec = 100;
+
+struct cortex_arm64_edac {
+	struct edac_device_ctl_info *edac_ctl;
+};
+
+static inline u64 read_cpumerrsr_el1(void)
+{
+	u64 val;
+
+	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
+	return val;
+}
+
+static inline void write_cpumerrsr_el1(u64 val)
+{
+	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
+}
+
+static inline u64 read_l2merrsr_el1(void)
+{
+	u64 val;
+
+	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
+	return val;
+}
+
+static inline void write_l2merrsr_el1(u64 val)
+{
+	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
+}
+
+static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_l2merrsr_el1();
+
+	if (!A53_L2MERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A53_L2MERRSR_EL1_FATAL(val);
+	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
+	other_err = A53_L2MERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
+
+	switch (A53_L2MERRSR_EL1_RAMID(val)) {
+	case A53_L2_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
+		break;
+	case A53_L2_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
+		break;
+	case A53_L2_SNOOP_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	write_l2merrsr_el1(0);
+}
+
+static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_l2merrsr_el1();
+
+	if (!A57_L2MERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A57_L2MERRSR_EL1_FATAL(val);
+	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
+	other_err = A57_L2MERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
+
+	switch (A57_L2MERRSR_EL1_RAMID(val)) {
+	case A57_L2_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
+		break;
+	case A57_L2_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
+		break;
+	case A57_L2_SNOOP_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
+		break;
+	case A57_L2_DIRTY_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
+		break;
+	case A57_L2_INCLUSION_PF_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	write_l2merrsr_el1(0);
+}
+
+static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_cpumerrsr_el1();
+
+	if (!A57_CPUMERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A57_CPUMERRSR_EL1_FATAL(val);
+	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
+	other_err = A57_CPUMERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
+
+	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
+	case A57_L1_I_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
+		break;
+	case A57_L1_I_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
+		break;
+	case A57_L1_D_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
+		break;
+	case A57_L1_D_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
+		break;
+	case A57_L1_TLB_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	write_cpumerrsr_el1(0);
+}
+
+static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_cpumerrsr_el1();
+
+	if (!A53_CPUMERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A53_CPUMERRSR_EL1_FATAL(val);
+	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
+	other_err = A53_CPUMERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
+
+	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
+	case A53_L1_I_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
+		break;
+	case A53_L1_I_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
+		break;
+	case A53_L1_D_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
+		break;
+	case A53_L1_D_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
+		break;
+	case A53_L1_TLB_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	write_cpumerrsr_el1(0);
+}
+
+static void parse_cpumerrsr(void *args)
+{
+	struct edac_device_ctl_info *edac_ctl = args;
+	int partnum = read_cpuid_part_number();
+
+	switch (partnum) {
+	case ARM_CPU_PART_CORTEX_A57:
+		a57_parse_cpumerrsr(edac_ctl);
+		break;
+	case ARM_CPU_PART_CORTEX_A53:
+		a53_parse_cpumerrsr(edac_ctl);
+		break;
+	}
+}
+
+static void parse_l2merrsr(void *args)
+{
+	struct edac_device_ctl_info *edac_ctl = args;
+	int partnum = read_cpuid_part_number();
+
+	switch (partnum) {
+	case ARM_CPU_PART_CORTEX_A57:
+		a57_parse_l2merrsr(edac_ctl);
+		break;
+	case ARM_CPU_PART_CORTEX_A53:
+		a53_parse_l2merrsr(edac_ctl);
+		break;
+	}
+}
+
+static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
+{
+	int cpu;
+	struct cpumask cluster_mask, old_mask;
+
+	cpumask_clear(&cluster_mask);
+	cpumask_clear(&old_mask);
+
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
+		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
+		if (cpumask_equal(&cluster_mask, &old_mask))
+			continue;
+		cpumask_copy(&old_mask, &cluster_mask);
+		smp_call_function_any(&cluster_mask, parse_l2merrsr,
+				      edev_ctl, 0);
+	}
+	put_online_cpus();
+}
+
+static int cortex_arm64_edac_probe(struct platform_device *pdev)
+{
+	int rc;
+	struct cortex_arm64_edac *drv;
+	struct device *dev = &pdev->dev;
+
+	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
+	if (!drv)
+		return -ENOMEM;
+
+	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
+						   num_possible_cpus(), "L", 2,
+						   1, NULL, 0,
+						   edac_device_alloc_index());
+	if (IS_ERR(drv->edac_ctl))
+		return -ENOMEM;
+
+	drv->edac_ctl->poll_msec = poll_msec;
+	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
+	drv->edac_ctl->dev = dev;
+	drv->edac_ctl->mod_name = dev_name(dev);
+	drv->edac_ctl->dev_name = dev_name(dev);
+	drv->edac_ctl->ctl_name = "cpu_err";
+	drv->edac_ctl->panic_on_ue = 1;
+	platform_set_drvdata(pdev, drv);
+
+	rc = edac_device_add_device(drv->edac_ctl);
+	if (rc)
+		goto edac_alloc_failed;
+
+	return 0;
+
+edac_alloc_failed:
+	edac_device_free_ctl_info(drv->edac_ctl);
+	return rc;
+}
+
+static int cortex_arm64_edac_remove(struct platform_device *pdev)
+{
+	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
+	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
+
+	edac_device_del_device(edac_ctl->dev);
+	edac_device_free_ctl_info(edac_ctl);
+
+	return 0;
+}
+
+static const struct of_device_id cortex_arm64_edac_of_match[] = {
+	{ .compatible = "arm,armv8-edac" },
+	{},
+};
+MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
+
+static struct platform_driver cortex_arm64_edac_driver = {
+	.probe = cortex_arm64_edac_probe,
+	.remove = cortex_arm64_edac_remove,
+	.driver = {
+		.name = "arm64-edac",
+		.owner = THIS_MODULE,
+		.of_match_table = cortex_arm64_edac_of_match,
+	},
+};
+
+static int __init cortex_arm64_edac_init(void)
+{
+	int rc;
+
+	/* Only POLL mode is supported so far */
+	edac_op_state = EDAC_OPSTATE_POLL;
+
+	rc = platform_driver_register(&cortex_arm64_edac_driver);
+	if (rc) {
+		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
+		return rc;
+	}
+
+	return 0;
+}
+module_init(cortex_arm64_edac_init);
+
+static void __exit cortex_arm64_edac_exit(void)
+{
+	platform_driver_unregister(&cortex_arm64_edac_driver);
+}
+module_exit(cortex_arm64_edac_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
+MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
+module_param(poll_msec, int, 0444);
+MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-21 20:41 ` Brijesh Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-21 20:41 UTC (permalink / raw)
  To: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: Brijesh Singh, robh+dt-DgEjT+Ai2ygdnm+yROfE0A,
	pawel.moll-5wv7dgnIgG8, mark.rutland-5wv7dgnIgG8,
	ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg,
	galak-sgV2jX0FEOL9JmXXK+q4OQ,
	dougthompson-aS9lmoZGLiVWk0Htik3J/w, bp-Gina5bIWoIWzQB+pC5nmwQ,
	mchehab-JPH+aEBZ4P+UEJcrhfAQsw,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	guohanjun-hv44wF8Li93QT0dZR+AlfA, andre.przywara-5wv7dgnIgG8,
	arnd-r2nGTMty4D4, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-edac-u79uwXL29TY76Z2rM5mHXA

Add support for Cortex A57 and A53 EDAC driver.

Signed-off-by: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
CC: robh+dt-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
CC: pawel.moll-5wv7dgnIgG8@public.gmane.org
CC: mark.rutland-5wv7dgnIgG8@public.gmane.org
CC: ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org
CC: galak-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org
CC: dougthompson-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org
CC: bp-Gina5bIWoIWzQB+pC5nmwQ@public.gmane.org
CC: mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org
CC: devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
CC: guohanjun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org
CC: andre.przywara-5wv7dgnIgG8@public.gmane.org
CC: arnd-r2nGTMty4D4@public.gmane.org
CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
CC: linux-edac-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---

v2:
* convert into generic arm64 edac driver
* remove AMD specific references from dt binding
* remove poll_msec property from dt binding
* add poll_msec as a module param, default is 100ms
* update copyright text
* define macro mnemonics for L1 and L2 RAMID
* check L2 error per-cluster instead of per core
* update function names
* use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
  read hotplug-safe
* add error check in probe routine

 .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
 drivers/edac/Kconfig                               |   6 +
 drivers/edac/Makefile                              |   1 +
 drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
 4 files changed, 479 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
 create mode 100644 drivers/edac/cortex_arm64_edac.c

diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
new file mode 100644
index 0000000..dfd128f
--- /dev/null
+++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
@@ -0,0 +1,15 @@
+* ARMv8 L1/L2 cache error reporting
+
+On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
+Register can be used for checking L1 and L2 memory errors.
+
+The following section describes the ARMv8 EDAC DT node binding.
+
+Required properties:
+- compatible: Should be "arm,armv8-edac"
+
+Example:
+	edac {
+		compatible = "arm,armv8-edac";
+	};
+
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index ef25000..dd7c195 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -390,4 +390,10 @@ config EDAC_XGENE
 	  Support for error detection and correction on the
 	  APM X-Gene family of SOCs.
 
+config EDAC_CORTEX_ARM64
+	tristate "ARM Cortex A57/A53"
+	depends on EDAC_MM_EDAC && ARM64
+	help
+	  Support for error detection and correction on the
+	  ARM Cortex A57 and A53.
 endif # EDAC
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index ae3c5f3..ac01660 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
 obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
 obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
 obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
+obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
new file mode 100644
index 0000000..c37bb94
--- /dev/null
+++ b/drivers/edac/cortex_arm64_edac.c
@@ -0,0 +1,457 @@
+/*
+ * Cortex ARM64 EDAC
+ *
+ * Copyright (c) 2015, Advanced Micro Devices
+ * Author: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/platform_device.h>
+
+#include "edac_core.h"
+
+#define EDAC_MOD_STR             "cortex_arm64_edac"
+
+#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
+#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
+#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
+#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
+#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
+#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
+#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
+#define A57_L1_I_TAG_RAM	     0x00
+#define A57_L1_I_DATA_RAM	     0x01
+#define A57_L1_D_TAG_RAM	     0x08
+#define A57_L1_D_DATA_RAM	     0x09
+#define A57_L1_TLB_RAM		     0x18
+
+#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
+#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
+#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
+#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
+#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
+#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
+#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
+#define A57_L2_TAG_RAM		     0x10
+#define A57_L2_DATA_RAM		     0x11
+#define A57_L2_SNOOP_TAG_RAM	     0x12
+#define A57_L2_DIRTY_RAM	     0x14
+#define A57_L2_INCLUSION_PF_RAM      0x18
+
+#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
+#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
+#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
+#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
+#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
+#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
+#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
+#define A53_L1_I_TAG_RAM	     0x00
+#define A53_L1_I_DATA_RAM	     0x01
+#define A53_L1_D_TAG_RAM	     0x08
+#define A53_L1_D_DATA_RAM	     0x09
+#define A53_L1_D_DIRT_RAM	     0x0A
+#define A53_L1_TLB_RAM		     0x18
+
+#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
+#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
+#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
+#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
+#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
+#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
+#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
+#define A53_L2_TAG_RAM		     0x10
+#define A53_L2_DATA_RAM		     0x11
+#define A53_L2_SNOOP_RAM	     0x12
+
+#define L1_CACHE		     0
+#define L2_CACHE		     1
+
+int poll_msec = 100;
+
+struct cortex_arm64_edac {
+	struct edac_device_ctl_info *edac_ctl;
+};
+
+static inline u64 read_cpumerrsr_el1(void)
+{
+	u64 val;
+
+	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
+	return val;
+}
+
+static inline void write_cpumerrsr_el1(u64 val)
+{
+	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
+}
+
+static inline u64 read_l2merrsr_el1(void)
+{
+	u64 val;
+
+	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
+	return val;
+}
+
+static inline void write_l2merrsr_el1(u64 val)
+{
+	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
+}
+
+static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_l2merrsr_el1();
+
+	if (!A53_L2MERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A53_L2MERRSR_EL1_FATAL(val);
+	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
+	other_err = A53_L2MERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
+
+	switch (A53_L2MERRSR_EL1_RAMID(val)) {
+	case A53_L2_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
+		break;
+	case A53_L2_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
+		break;
+	case A53_L2_SNOOP_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	write_l2merrsr_el1(0);
+}
+
+static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_l2merrsr_el1();
+
+	if (!A57_L2MERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A57_L2MERRSR_EL1_FATAL(val);
+	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
+	other_err = A57_L2MERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
+
+	switch (A57_L2MERRSR_EL1_RAMID(val)) {
+	case A57_L2_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
+		break;
+	case A57_L2_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
+		break;
+	case A57_L2_SNOOP_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
+		break;
+	case A57_L2_DIRTY_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
+		break;
+	case A57_L2_INCLUSION_PF_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	write_l2merrsr_el1(0);
+}
+
+static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_cpumerrsr_el1();
+
+	if (!A57_CPUMERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A57_CPUMERRSR_EL1_FATAL(val);
+	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
+	other_err = A57_CPUMERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
+
+	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
+	case A57_L1_I_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
+		break;
+	case A57_L1_I_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
+		break;
+	case A57_L1_D_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
+		break;
+	case A57_L1_D_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
+		break;
+	case A57_L1_TLB_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	write_cpumerrsr_el1(0);
+}
+
+static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_cpumerrsr_el1();
+
+	if (!A53_CPUMERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A53_CPUMERRSR_EL1_FATAL(val);
+	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
+	other_err = A53_CPUMERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
+
+	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
+	case A53_L1_I_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
+		break;
+	case A53_L1_I_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
+		break;
+	case A53_L1_D_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
+		break;
+	case A53_L1_D_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
+		break;
+	case A53_L1_TLB_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	write_cpumerrsr_el1(0);
+}
+
+static void parse_cpumerrsr(void *args)
+{
+	struct edac_device_ctl_info *edac_ctl = args;
+	int partnum = read_cpuid_part_number();
+
+	switch (partnum) {
+	case ARM_CPU_PART_CORTEX_A57:
+		a57_parse_cpumerrsr(edac_ctl);
+		break;
+	case ARM_CPU_PART_CORTEX_A53:
+		a53_parse_cpumerrsr(edac_ctl);
+		break;
+	}
+}
+
+static void parse_l2merrsr(void *args)
+{
+	struct edac_device_ctl_info *edac_ctl = args;
+	int partnum = read_cpuid_part_number();
+
+	switch (partnum) {
+	case ARM_CPU_PART_CORTEX_A57:
+		a57_parse_l2merrsr(edac_ctl);
+		break;
+	case ARM_CPU_PART_CORTEX_A53:
+		a53_parse_l2merrsr(edac_ctl);
+		break;
+	}
+}
+
+static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
+{
+	int cpu;
+	struct cpumask cluster_mask, old_mask;
+
+	cpumask_clear(&cluster_mask);
+	cpumask_clear(&old_mask);
+
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
+		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
+		if (cpumask_equal(&cluster_mask, &old_mask))
+			continue;
+		cpumask_copy(&old_mask, &cluster_mask);
+		smp_call_function_any(&cluster_mask, parse_l2merrsr,
+				      edev_ctl, 0);
+	}
+	put_online_cpus();
+}
+
+static int cortex_arm64_edac_probe(struct platform_device *pdev)
+{
+	int rc;
+	struct cortex_arm64_edac *drv;
+	struct device *dev = &pdev->dev;
+
+	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
+	if (!drv)
+		return -ENOMEM;
+
+	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
+						   num_possible_cpus(), "L", 2,
+						   1, NULL, 0,
+						   edac_device_alloc_index());
+	if (IS_ERR(drv->edac_ctl))
+		return -ENOMEM;
+
+	drv->edac_ctl->poll_msec = poll_msec;
+	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
+	drv->edac_ctl->dev = dev;
+	drv->edac_ctl->mod_name = dev_name(dev);
+	drv->edac_ctl->dev_name = dev_name(dev);
+	drv->edac_ctl->ctl_name = "cpu_err";
+	drv->edac_ctl->panic_on_ue = 1;
+	platform_set_drvdata(pdev, drv);
+
+	rc = edac_device_add_device(drv->edac_ctl);
+	if (rc)
+		goto edac_alloc_failed;
+
+	return 0;
+
+edac_alloc_failed:
+	edac_device_free_ctl_info(drv->edac_ctl);
+	return rc;
+}
+
+static int cortex_arm64_edac_remove(struct platform_device *pdev)
+{
+	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
+	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
+
+	edac_device_del_device(edac_ctl->dev);
+	edac_device_free_ctl_info(edac_ctl);
+
+	return 0;
+}
+
+static const struct of_device_id cortex_arm64_edac_of_match[] = {
+	{ .compatible = "arm,armv8-edac" },
+	{},
+};
+MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
+
+static struct platform_driver cortex_arm64_edac_driver = {
+	.probe = cortex_arm64_edac_probe,
+	.remove = cortex_arm64_edac_remove,
+	.driver = {
+		.name = "arm64-edac",
+		.owner = THIS_MODULE,
+		.of_match_table = cortex_arm64_edac_of_match,
+	},
+};
+
+static int __init cortex_arm64_edac_init(void)
+{
+	int rc;
+
+	/* Only POLL mode is supported so far */
+	edac_op_state = EDAC_OPSTATE_POLL;
+
+	rc = platform_driver_register(&cortex_arm64_edac_driver);
+	if (rc) {
+		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
+		return rc;
+	}
+
+	return 0;
+}
+module_init(cortex_arm64_edac_init);
+
+static void __exit cortex_arm64_edac_exit(void)
+{
+	platform_driver_unregister(&cortex_arm64_edac_driver);
+}
+module_exit(cortex_arm64_edac_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>");
+MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
+module_param(poll_msec, int, 0444);
+MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-21 20:41 ` Brijesh Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-21 20:41 UTC (permalink / raw)
  To: linux-arm-kernel

Add support for Cortex A57 and A53 EDAC driver.

Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
CC: robh+dt at kernel.org
CC: pawel.moll at arm.com
CC: mark.rutland at arm.com
CC: ijc+devicetree at hellion.org.uk
CC: galak at codeaurora.org
CC: dougthompson at xmission.com
CC: bp at alien8.de
CC: mchehab at osg.samsung.com
CC: devicetree at vger.kernel.org
CC: guohanjun at huawei.com
CC: andre.przywara at arm.com
CC: arnd at arndb.de
CC: linux-kernel at vger.kernel.org
CC: linux-edac at vger.kernel.org
---

v2:
* convert into generic arm64 edac driver
* remove AMD specific references from dt binding
* remove poll_msec property from dt binding
* add poll_msec as a module param, default is 100ms
* update copyright text
* define macro mnemonics for L1 and L2 RAMID
* check L2 error per-cluster instead of per core
* update function names
* use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
  read hotplug-safe
* add error check in probe routine

 .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
 drivers/edac/Kconfig                               |   6 +
 drivers/edac/Makefile                              |   1 +
 drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
 4 files changed, 479 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
 create mode 100644 drivers/edac/cortex_arm64_edac.c

diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
new file mode 100644
index 0000000..dfd128f
--- /dev/null
+++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
@@ -0,0 +1,15 @@
+* ARMv8 L1/L2 cache error reporting
+
+On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
+Register can be used for checking L1 and L2 memory errors.
+
+The following section describes the ARMv8 EDAC DT node binding.
+
+Required properties:
+- compatible: Should be "arm,armv8-edac"
+
+Example:
+	edac {
+		compatible = "arm,armv8-edac";
+	};
+
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index ef25000..dd7c195 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -390,4 +390,10 @@ config EDAC_XGENE
 	  Support for error detection and correction on the
 	  APM X-Gene family of SOCs.
 
+config EDAC_CORTEX_ARM64
+	tristate "ARM Cortex A57/A53"
+	depends on EDAC_MM_EDAC && ARM64
+	help
+	  Support for error detection and correction on the
+	  ARM Cortex A57 and A53.
 endif # EDAC
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index ae3c5f3..ac01660 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
 obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
 obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
 obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
+obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
new file mode 100644
index 0000000..c37bb94
--- /dev/null
+++ b/drivers/edac/cortex_arm64_edac.c
@@ -0,0 +1,457 @@
+/*
+ * Cortex ARM64 EDAC
+ *
+ * Copyright (c) 2015, Advanced Micro Devices
+ * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/platform_device.h>
+
+#include "edac_core.h"
+
+#define EDAC_MOD_STR             "cortex_arm64_edac"
+
+#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
+#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
+#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
+#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
+#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
+#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
+#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
+#define A57_L1_I_TAG_RAM	     0x00
+#define A57_L1_I_DATA_RAM	     0x01
+#define A57_L1_D_TAG_RAM	     0x08
+#define A57_L1_D_DATA_RAM	     0x09
+#define A57_L1_TLB_RAM		     0x18
+
+#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
+#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
+#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
+#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
+#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
+#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
+#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
+#define A57_L2_TAG_RAM		     0x10
+#define A57_L2_DATA_RAM		     0x11
+#define A57_L2_SNOOP_TAG_RAM	     0x12
+#define A57_L2_DIRTY_RAM	     0x14
+#define A57_L2_INCLUSION_PF_RAM      0x18
+
+#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
+#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
+#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
+#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
+#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
+#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
+#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
+#define A53_L1_I_TAG_RAM	     0x00
+#define A53_L1_I_DATA_RAM	     0x01
+#define A53_L1_D_TAG_RAM	     0x08
+#define A53_L1_D_DATA_RAM	     0x09
+#define A53_L1_D_DIRT_RAM	     0x0A
+#define A53_L1_TLB_RAM		     0x18
+
+#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
+#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
+#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
+#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
+#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
+#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
+#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
+#define A53_L2_TAG_RAM		     0x10
+#define A53_L2_DATA_RAM		     0x11
+#define A53_L2_SNOOP_RAM	     0x12
+
+#define L1_CACHE		     0
+#define L2_CACHE		     1
+
+int poll_msec = 100;
+
+struct cortex_arm64_edac {
+	struct edac_device_ctl_info *edac_ctl;
+};
+
+static inline u64 read_cpumerrsr_el1(void)
+{
+	u64 val;
+
+	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
+	return val;
+}
+
+static inline void write_cpumerrsr_el1(u64 val)
+{
+	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
+}
+
+static inline u64 read_l2merrsr_el1(void)
+{
+	u64 val;
+
+	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
+	return val;
+}
+
+static inline void write_l2merrsr_el1(u64 val)
+{
+	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
+}
+
+static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_l2merrsr_el1();
+
+	if (!A53_L2MERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A53_L2MERRSR_EL1_FATAL(val);
+	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
+	other_err = A53_L2MERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
+
+	switch (A53_L2MERRSR_EL1_RAMID(val)) {
+	case A53_L2_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
+		break;
+	case A53_L2_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
+		break;
+	case A53_L2_SNOOP_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	write_l2merrsr_el1(0);
+}
+
+static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_l2merrsr_el1();
+
+	if (!A57_L2MERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A57_L2MERRSR_EL1_FATAL(val);
+	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
+	other_err = A57_L2MERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
+
+	switch (A57_L2MERRSR_EL1_RAMID(val)) {
+	case A57_L2_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
+		break;
+	case A57_L2_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
+		break;
+	case A57_L2_SNOOP_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
+		break;
+	case A57_L2_DIRTY_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
+		break;
+	case A57_L2_INCLUSION_PF_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
+				      edac_ctl->name);
+	write_l2merrsr_el1(0);
+}
+
+static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_cpumerrsr_el1();
+
+	if (!A57_CPUMERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A57_CPUMERRSR_EL1_FATAL(val);
+	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
+	other_err = A57_CPUMERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
+
+	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
+	case A57_L1_I_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
+		break;
+	case A57_L1_I_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
+		break;
+	case A57_L1_D_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
+		break;
+	case A57_L1_D_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
+		break;
+	case A57_L1_TLB_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	write_cpumerrsr_el1(0);
+}
+
+static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
+{
+	int fatal;
+	int repeat_err, other_err;
+	u64 val = read_cpumerrsr_el1();
+
+	if (!A53_CPUMERRSR_EL1_VALID(val))
+		return;
+
+	fatal = A53_CPUMERRSR_EL1_FATAL(val);
+	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
+	other_err = A53_CPUMERRSR_EL1_OTHER(val);
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR,
+		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
+		    fatal ? "fatal" : "non-fatal");
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
+
+	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
+	case A53_L1_I_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
+		break;
+	case A53_L1_I_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
+		break;
+	case A53_L1_D_TAG_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
+		break;
+	case A53_L1_D_DATA_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
+		break;
+	case A53_L1_TLB_RAM:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
+		break;
+	default:
+		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
+		break;
+	}
+
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
+		    repeat_err);
+	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
+		    other_err);
+
+	if (fatal)
+		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	else
+		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
+				      edac_ctl->name);
+	write_cpumerrsr_el1(0);
+}
+
+static void parse_cpumerrsr(void *args)
+{
+	struct edac_device_ctl_info *edac_ctl = args;
+	int partnum = read_cpuid_part_number();
+
+	switch (partnum) {
+	case ARM_CPU_PART_CORTEX_A57:
+		a57_parse_cpumerrsr(edac_ctl);
+		break;
+	case ARM_CPU_PART_CORTEX_A53:
+		a53_parse_cpumerrsr(edac_ctl);
+		break;
+	}
+}
+
+static void parse_l2merrsr(void *args)
+{
+	struct edac_device_ctl_info *edac_ctl = args;
+	int partnum = read_cpuid_part_number();
+
+	switch (partnum) {
+	case ARM_CPU_PART_CORTEX_A57:
+		a57_parse_l2merrsr(edac_ctl);
+		break;
+	case ARM_CPU_PART_CORTEX_A53:
+		a53_parse_l2merrsr(edac_ctl);
+		break;
+	}
+}
+
+static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
+{
+	int cpu;
+	struct cpumask cluster_mask, old_mask;
+
+	cpumask_clear(&cluster_mask);
+	cpumask_clear(&old_mask);
+
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
+		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
+		if (cpumask_equal(&cluster_mask, &old_mask))
+			continue;
+		cpumask_copy(&old_mask, &cluster_mask);
+		smp_call_function_any(&cluster_mask, parse_l2merrsr,
+				      edev_ctl, 0);
+	}
+	put_online_cpus();
+}
+
+static int cortex_arm64_edac_probe(struct platform_device *pdev)
+{
+	int rc;
+	struct cortex_arm64_edac *drv;
+	struct device *dev = &pdev->dev;
+
+	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
+	if (!drv)
+		return -ENOMEM;
+
+	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
+						   num_possible_cpus(), "L", 2,
+						   1, NULL, 0,
+						   edac_device_alloc_index());
+	if (IS_ERR(drv->edac_ctl))
+		return -ENOMEM;
+
+	drv->edac_ctl->poll_msec = poll_msec;
+	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
+	drv->edac_ctl->dev = dev;
+	drv->edac_ctl->mod_name = dev_name(dev);
+	drv->edac_ctl->dev_name = dev_name(dev);
+	drv->edac_ctl->ctl_name = "cpu_err";
+	drv->edac_ctl->panic_on_ue = 1;
+	platform_set_drvdata(pdev, drv);
+
+	rc = edac_device_add_device(drv->edac_ctl);
+	if (rc)
+		goto edac_alloc_failed;
+
+	return 0;
+
+edac_alloc_failed:
+	edac_device_free_ctl_info(drv->edac_ctl);
+	return rc;
+}
+
+static int cortex_arm64_edac_remove(struct platform_device *pdev)
+{
+	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
+	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
+
+	edac_device_del_device(edac_ctl->dev);
+	edac_device_free_ctl_info(edac_ctl);
+
+	return 0;
+}
+
+static const struct of_device_id cortex_arm64_edac_of_match[] = {
+	{ .compatible = "arm,armv8-edac" },
+	{},
+};
+MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
+
+static struct platform_driver cortex_arm64_edac_driver = {
+	.probe = cortex_arm64_edac_probe,
+	.remove = cortex_arm64_edac_remove,
+	.driver = {
+		.name = "arm64-edac",
+		.owner = THIS_MODULE,
+		.of_match_table = cortex_arm64_edac_of_match,
+	},
+};
+
+static int __init cortex_arm64_edac_init(void)
+{
+	int rc;
+
+	/* Only POLL mode is supported so far */
+	edac_op_state = EDAC_OPSTATE_POLL;
+
+	rc = platform_driver_register(&cortex_arm64_edac_driver);
+	if (rc) {
+		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
+		return rc;
+	}
+
+	return 0;
+}
+module_init(cortex_arm64_edac_init);
+
+static void __exit cortex_arm64_edac_exit(void)
+{
+	platform_driver_unregister(&cortex_arm64_edac_driver);
+}
+module_exit(cortex_arm64_edac_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
+MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
+module_param(poll_msec, int, 0444);
+MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-21 21:25   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2015-10-21 21:25 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-arm-kernel, robh+dt, pawel.moll, mark.rutland,
	ijc+devicetree, galak, dougthompson, bp, devicetree, guohanjun,
	andre.przywara, arnd, linux-kernel, linux-edac

Em Wed, 21 Oct 2015 15:41:37 -0500
Brijesh Singh <brijeshkumar.singh@amd.com> escreveu:

> Add support for Cortex A57 and A53 EDAC driver.
> 
> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
> CC: robh+dt@kernel.org
> CC: pawel.moll@arm.com
> CC: mark.rutland@arm.com
> CC: ijc+devicetree@hellion.org.uk
> CC: galak@codeaurora.org
> CC: dougthompson@xmission.com
> CC: bp@alien8.de
> CC: mchehab@osg.samsung.com
> CC: devicetree@vger.kernel.org
> CC: guohanjun@huawei.com
> CC: andre.przywara@arm.com
> CC: arnd@arndb.de
> CC: linux-kernel@vger.kernel.org
> CC: linux-edac@vger.kernel.org
> ---
> 
> v2:
> * convert into generic arm64 edac driver
> * remove AMD specific references from dt binding
> * remove poll_msec property from dt binding
> * add poll_msec as a module param, default is 100ms
> * update copyright text
> * define macro mnemonics for L1 and L2 RAMID
> * check L2 error per-cluster instead of per core
> * update function names
> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>   read hotplug-safe
> * add error check in probe routine
> 
>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>  drivers/edac/Kconfig                               |   6 +
>  drivers/edac/Makefile                              |   1 +
>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>  4 files changed, 479 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>  create mode 100644 drivers/edac/cortex_arm64_edac.c
> 
> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> new file mode 100644
> index 0000000..dfd128f
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> @@ -0,0 +1,15 @@
> +* ARMv8 L1/L2 cache error reporting
> +
> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
> +Register can be used for checking L1 and L2 memory errors.
> +
> +The following section describes the ARMv8 EDAC DT node binding.
> +
> +Required properties:
> +- compatible: Should be "arm,armv8-edac"
> +
> +Example:
> +	edac {
> +		compatible = "arm,armv8-edac";
> +	};
> +
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index ef25000..dd7c195 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -390,4 +390,10 @@ config EDAC_XGENE
>  	  Support for error detection and correction on the
>  	  APM X-Gene family of SOCs.
>  
> +config EDAC_CORTEX_ARM64
> +	tristate "ARM Cortex A57/A53"
> +	depends on EDAC_MM_EDAC && ARM64

It would be good to be able to compile it on non-ARM64 archs
if COMPILE_TEST, e. g.:

	depends on EDAC_MM_EDAC && (ARM64 || COMPILE_TEST)

That would allow testing tools like Coverity to test it. As far as
I know, the public license we use only works on x86.

> +	help
> +	  Support for error detection and correction on the
> +	  ARM Cortex A57 and A53.
>  endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index ae3c5f3..ac01660 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
> new file mode 100644
> index 0000000..c37bb94
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_edac.c
> @@ -0,0 +1,457 @@
> +/*
> + * Cortex ARM64 EDAC
> + *
> + * Copyright (c) 2015, Advanced Micro Devices
> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/platform_device.h>
> +
> +#include "edac_core.h"
> +
> +#define EDAC_MOD_STR             "cortex_arm64_edac"
> +
> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A57_L1_I_TAG_RAM	     0x00
> +#define A57_L1_I_DATA_RAM	     0x01
> +#define A57_L1_D_TAG_RAM	     0x08
> +#define A57_L1_D_DATA_RAM	     0x09
> +#define A57_L1_TLB_RAM		     0x18
> +
> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A57_L2_TAG_RAM		     0x10
> +#define A57_L2_DATA_RAM		     0x11
> +#define A57_L2_SNOOP_TAG_RAM	     0x12
> +#define A57_L2_DIRTY_RAM	     0x14
> +#define A57_L2_INCLUSION_PF_RAM      0x18
> +
> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A53_L1_I_TAG_RAM	     0x00
> +#define A53_L1_I_DATA_RAM	     0x01
> +#define A53_L1_D_TAG_RAM	     0x08
> +#define A53_L1_D_DATA_RAM	     0x09
> +#define A53_L1_D_DIRT_RAM	     0x0A
> +#define A53_L1_TLB_RAM		     0x18
> +
> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A53_L2_TAG_RAM		     0x10
> +#define A53_L2_DATA_RAM		     0x11
> +#define A53_L2_SNOOP_RAM	     0x12
> +
> +#define L1_CACHE		     0
> +#define L2_CACHE		     1
> +
> +int poll_msec = 100;
> +
> +struct cortex_arm64_edac {
> +	struct edac_device_ctl_info *edac_ctl;
> +};
> +
> +static inline u64 read_cpumerrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_cpumerrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
> +}
> +
> +static inline u64 read_l2merrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_l2merrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
> +}
> +

If we're willing to compile with COMPILE_TEST, we'll need to provide
some stubs for the above functions that won't use asm.

> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A53_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
> +	case A53_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A53_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A53_L2_SNOOP_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A57_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
> +	case A57_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A57_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A57_L2_SNOOP_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
> +		break;
> +	case A57_L2_DIRTY_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
> +		break;
> +	case A57_L2_INCLUSION_PF_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A57_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
> +	case A57_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A57_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A57_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A57_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A57_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A53_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
> +	case A53_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A53_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A53_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A53_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A53_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);

The above code doesn't look right to me. It should be, instead, calling
one of the functions that output the errors also via trace or to call one
of the trace functions directly (see the trace functions currently defined
at  include/ras/ras_event.h).

Failing to do that would cause RAS tools (like rasdaemon) to not get
the errors.

> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void parse_cpumerrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_cpumerrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_cpumerrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void parse_l2merrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_l2merrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_l2merrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
> +{
> +	int cpu;
> +	struct cpumask cluster_mask, old_mask;
> +
> +	cpumask_clear(&cluster_mask);
> +	cpumask_clear(&old_mask);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
> +		if (cpumask_equal(&cluster_mask, &old_mask))
> +			continue;
> +		cpumask_copy(&old_mask, &cluster_mask);
> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
> +				      edev_ctl, 0);
> +	}
> +	put_online_cpus();
> +}
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	int rc;
> +	struct cortex_arm64_edac *drv;
> +	struct device *dev = &pdev->dev;
> +
> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +						   num_possible_cpus(), "L", 2,
> +						   1, NULL, 0,
> +						   edac_device_alloc_index());
> +	if (IS_ERR(drv->edac_ctl))
> +		return -ENOMEM;
> +
> +	drv->edac_ctl->poll_msec = poll_msec;
> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
> +	drv->edac_ctl->dev = dev;
> +	drv->edac_ctl->mod_name = dev_name(dev);
> +	drv->edac_ctl->dev_name = dev_name(dev);
> +	drv->edac_ctl->ctl_name = "cpu_err";
> +	drv->edac_ctl->panic_on_ue = 1;
> +	platform_set_drvdata(pdev, drv);
> +
> +	rc = edac_device_add_device(drv->edac_ctl);
> +	if (rc)
> +		goto edac_alloc_failed;
> +
> +	return 0;
> +
> +edac_alloc_failed:
> +	edac_device_free_ctl_info(drv->edac_ctl);
> +	return rc;
> +}
> +
> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
> +{
> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
> +
> +	edac_device_del_device(edac_ctl->dev);
> +	edac_device_free_ctl_info(edac_ctl);
> +
> +	return 0;
> +}
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,armv8-edac" },
> +	{},
> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = "arm64-edac",
> +		.owner = THIS_MODULE,
> +		.of_match_table = cortex_arm64_edac_of_match,
> +	},
> +};
> +
> +static int __init cortex_arm64_edac_init(void)
> +{
> +	int rc;
> +
> +	/* Only POLL mode is supported so far */
> +	edac_op_state = EDAC_OPSTATE_POLL;
> +
> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +module_init(cortex_arm64_edac_init);
> +
> +static void __exit cortex_arm64_edac_exit(void)
> +{
> +	platform_driver_unregister(&cortex_arm64_edac_driver);
> +}
> +module_exit(cortex_arm64_edac_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
> +module_param(poll_msec, int, 0444);
> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-21 21:25   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2015-10-21 21:25 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	robh+dt-DgEjT+Ai2ygdnm+yROfE0A, pawel.moll-5wv7dgnIgG8,
	mark.rutland-5wv7dgnIgG8, ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg,
	galak-sgV2jX0FEOL9JmXXK+q4OQ,
	dougthompson-aS9lmoZGLiVWk0Htik3J/w, bp-Gina5bIWoIWzQB+pC5nmwQ,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	guohanjun-hv44wF8Li93QT0dZR+AlfA, andre.przywara-5wv7dgnIgG8,
	arnd-r2nGTMty4D4, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-edac-u79uwXL29TY76Z2rM5mHXA

Em Wed, 21 Oct 2015 15:41:37 -0500
Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org> escreveu:

> Add support for Cortex A57 and A53 EDAC driver.
> 
> Signed-off-by: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
> CC: robh+dt-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> CC: pawel.moll-5wv7dgnIgG8@public.gmane.org
> CC: mark.rutland-5wv7dgnIgG8@public.gmane.org
> CC: ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org
> CC: galak-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org
> CC: dougthompson-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org
> CC: bp-Gina5bIWoIWzQB+pC5nmwQ@public.gmane.org
> CC: mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org
> CC: devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> CC: guohanjun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org
> CC: andre.przywara-5wv7dgnIgG8@public.gmane.org
> CC: arnd-r2nGTMty4D4@public.gmane.org
> CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> CC: linux-edac-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
> 
> v2:
> * convert into generic arm64 edac driver
> * remove AMD specific references from dt binding
> * remove poll_msec property from dt binding
> * add poll_msec as a module param, default is 100ms
> * update copyright text
> * define macro mnemonics for L1 and L2 RAMID
> * check L2 error per-cluster instead of per core
> * update function names
> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>   read hotplug-safe
> * add error check in probe routine
> 
>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>  drivers/edac/Kconfig                               |   6 +
>  drivers/edac/Makefile                              |   1 +
>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>  4 files changed, 479 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>  create mode 100644 drivers/edac/cortex_arm64_edac.c
> 
> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> new file mode 100644
> index 0000000..dfd128f
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> @@ -0,0 +1,15 @@
> +* ARMv8 L1/L2 cache error reporting
> +
> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
> +Register can be used for checking L1 and L2 memory errors.
> +
> +The following section describes the ARMv8 EDAC DT node binding.
> +
> +Required properties:
> +- compatible: Should be "arm,armv8-edac"
> +
> +Example:
> +	edac {
> +		compatible = "arm,armv8-edac";
> +	};
> +
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index ef25000..dd7c195 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -390,4 +390,10 @@ config EDAC_XGENE
>  	  Support for error detection and correction on the
>  	  APM X-Gene family of SOCs.
>  
> +config EDAC_CORTEX_ARM64
> +	tristate "ARM Cortex A57/A53"
> +	depends on EDAC_MM_EDAC && ARM64

It would be good to be able to compile it on non-ARM64 archs
if COMPILE_TEST, e. g.:

	depends on EDAC_MM_EDAC && (ARM64 || COMPILE_TEST)

That would allow testing tools like Coverity to test it. As far as
I know, the public license we use only works on x86.

> +	help
> +	  Support for error detection and correction on the
> +	  ARM Cortex A57 and A53.
>  endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index ae3c5f3..ac01660 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
> new file mode 100644
> index 0000000..c37bb94
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_edac.c
> @@ -0,0 +1,457 @@
> +/*
> + * Cortex ARM64 EDAC
> + *
> + * Copyright (c) 2015, Advanced Micro Devices
> + * Author: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/platform_device.h>
> +
> +#include "edac_core.h"
> +
> +#define EDAC_MOD_STR             "cortex_arm64_edac"
> +
> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A57_L1_I_TAG_RAM	     0x00
> +#define A57_L1_I_DATA_RAM	     0x01
> +#define A57_L1_D_TAG_RAM	     0x08
> +#define A57_L1_D_DATA_RAM	     0x09
> +#define A57_L1_TLB_RAM		     0x18
> +
> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A57_L2_TAG_RAM		     0x10
> +#define A57_L2_DATA_RAM		     0x11
> +#define A57_L2_SNOOP_TAG_RAM	     0x12
> +#define A57_L2_DIRTY_RAM	     0x14
> +#define A57_L2_INCLUSION_PF_RAM      0x18
> +
> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A53_L1_I_TAG_RAM	     0x00
> +#define A53_L1_I_DATA_RAM	     0x01
> +#define A53_L1_D_TAG_RAM	     0x08
> +#define A53_L1_D_DATA_RAM	     0x09
> +#define A53_L1_D_DIRT_RAM	     0x0A
> +#define A53_L1_TLB_RAM		     0x18
> +
> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A53_L2_TAG_RAM		     0x10
> +#define A53_L2_DATA_RAM		     0x11
> +#define A53_L2_SNOOP_RAM	     0x12
> +
> +#define L1_CACHE		     0
> +#define L2_CACHE		     1
> +
> +int poll_msec = 100;
> +
> +struct cortex_arm64_edac {
> +	struct edac_device_ctl_info *edac_ctl;
> +};
> +
> +static inline u64 read_cpumerrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_cpumerrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
> +}
> +
> +static inline u64 read_l2merrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_l2merrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
> +}
> +

If we're willing to compile with COMPILE_TEST, we'll need to provide
some stubs for the above functions that won't use asm.

> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A53_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
> +	case A53_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A53_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A53_L2_SNOOP_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A57_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
> +	case A57_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A57_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A57_L2_SNOOP_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
> +		break;
> +	case A57_L2_DIRTY_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
> +		break;
> +	case A57_L2_INCLUSION_PF_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A57_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
> +	case A57_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A57_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A57_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A57_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A57_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A53_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
> +	case A53_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A53_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A53_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A53_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A53_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);

The above code doesn't look right to me. It should be, instead, calling
one of the functions that output the errors also via trace or to call one
of the trace functions directly (see the trace functions currently defined
at  include/ras/ras_event.h).

Failing to do that would cause RAS tools (like rasdaemon) to not get
the errors.

> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void parse_cpumerrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_cpumerrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_cpumerrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void parse_l2merrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_l2merrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_l2merrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
> +{
> +	int cpu;
> +	struct cpumask cluster_mask, old_mask;
> +
> +	cpumask_clear(&cluster_mask);
> +	cpumask_clear(&old_mask);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
> +		if (cpumask_equal(&cluster_mask, &old_mask))
> +			continue;
> +		cpumask_copy(&old_mask, &cluster_mask);
> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
> +				      edev_ctl, 0);
> +	}
> +	put_online_cpus();
> +}
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	int rc;
> +	struct cortex_arm64_edac *drv;
> +	struct device *dev = &pdev->dev;
> +
> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +						   num_possible_cpus(), "L", 2,
> +						   1, NULL, 0,
> +						   edac_device_alloc_index());
> +	if (IS_ERR(drv->edac_ctl))
> +		return -ENOMEM;
> +
> +	drv->edac_ctl->poll_msec = poll_msec;
> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
> +	drv->edac_ctl->dev = dev;
> +	drv->edac_ctl->mod_name = dev_name(dev);
> +	drv->edac_ctl->dev_name = dev_name(dev);
> +	drv->edac_ctl->ctl_name = "cpu_err";
> +	drv->edac_ctl->panic_on_ue = 1;
> +	platform_set_drvdata(pdev, drv);
> +
> +	rc = edac_device_add_device(drv->edac_ctl);
> +	if (rc)
> +		goto edac_alloc_failed;
> +
> +	return 0;
> +
> +edac_alloc_failed:
> +	edac_device_free_ctl_info(drv->edac_ctl);
> +	return rc;
> +}
> +
> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
> +{
> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
> +
> +	edac_device_del_device(edac_ctl->dev);
> +	edac_device_free_ctl_info(edac_ctl);
> +
> +	return 0;
> +}
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,armv8-edac" },
> +	{},
> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = "arm64-edac",
> +		.owner = THIS_MODULE,
> +		.of_match_table = cortex_arm64_edac_of_match,
> +	},
> +};
> +
> +static int __init cortex_arm64_edac_init(void)
> +{
> +	int rc;
> +
> +	/* Only POLL mode is supported so far */
> +	edac_op_state = EDAC_OPSTATE_POLL;
> +
> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +module_init(cortex_arm64_edac_init);
> +
> +static void __exit cortex_arm64_edac_exit(void)
> +{
> +	platform_driver_unregister(&cortex_arm64_edac_driver);
> +}
> +module_exit(cortex_arm64_edac_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>");
> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
> +module_param(poll_msec, int, 0444);
> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-21 21:25   ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 34+ messages in thread
From: Mauro Carvalho Chehab @ 2015-10-21 21:25 UTC (permalink / raw)
  To: linux-arm-kernel

Em Wed, 21 Oct 2015 15:41:37 -0500
Brijesh Singh <brijeshkumar.singh@amd.com> escreveu:

> Add support for Cortex A57 and A53 EDAC driver.
> 
> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
> CC: robh+dt at kernel.org
> CC: pawel.moll at arm.com
> CC: mark.rutland at arm.com
> CC: ijc+devicetree at hellion.org.uk
> CC: galak at codeaurora.org
> CC: dougthompson at xmission.com
> CC: bp at alien8.de
> CC: mchehab at osg.samsung.com
> CC: devicetree at vger.kernel.org
> CC: guohanjun at huawei.com
> CC: andre.przywara at arm.com
> CC: arnd at arndb.de
> CC: linux-kernel at vger.kernel.org
> CC: linux-edac at vger.kernel.org
> ---
> 
> v2:
> * convert into generic arm64 edac driver
> * remove AMD specific references from dt binding
> * remove poll_msec property from dt binding
> * add poll_msec as a module param, default is 100ms
> * update copyright text
> * define macro mnemonics for L1 and L2 RAMID
> * check L2 error per-cluster instead of per core
> * update function names
> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>   read hotplug-safe
> * add error check in probe routine
> 
>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>  drivers/edac/Kconfig                               |   6 +
>  drivers/edac/Makefile                              |   1 +
>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>  4 files changed, 479 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>  create mode 100644 drivers/edac/cortex_arm64_edac.c
> 
> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> new file mode 100644
> index 0000000..dfd128f
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> @@ -0,0 +1,15 @@
> +* ARMv8 L1/L2 cache error reporting
> +
> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
> +Register can be used for checking L1 and L2 memory errors.
> +
> +The following section describes the ARMv8 EDAC DT node binding.
> +
> +Required properties:
> +- compatible: Should be "arm,armv8-edac"
> +
> +Example:
> +	edac {
> +		compatible = "arm,armv8-edac";
> +	};
> +
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index ef25000..dd7c195 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -390,4 +390,10 @@ config EDAC_XGENE
>  	  Support for error detection and correction on the
>  	  APM X-Gene family of SOCs.
>  
> +config EDAC_CORTEX_ARM64
> +	tristate "ARM Cortex A57/A53"
> +	depends on EDAC_MM_EDAC && ARM64

It would be good to be able to compile it on non-ARM64 archs
if COMPILE_TEST, e. g.:

	depends on EDAC_MM_EDAC && (ARM64 || COMPILE_TEST)

That would allow testing tools like Coverity to test it. As far as
I know, the public license we use only works on x86.

> +	help
> +	  Support for error detection and correction on the
> +	  ARM Cortex A57 and A53.
>  endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index ae3c5f3..ac01660 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
> new file mode 100644
> index 0000000..c37bb94
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_edac.c
> @@ -0,0 +1,457 @@
> +/*
> + * Cortex ARM64 EDAC
> + *
> + * Copyright (c) 2015, Advanced Micro Devices
> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/platform_device.h>
> +
> +#include "edac_core.h"
> +
> +#define EDAC_MOD_STR             "cortex_arm64_edac"
> +
> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A57_L1_I_TAG_RAM	     0x00
> +#define A57_L1_I_DATA_RAM	     0x01
> +#define A57_L1_D_TAG_RAM	     0x08
> +#define A57_L1_D_DATA_RAM	     0x09
> +#define A57_L1_TLB_RAM		     0x18
> +
> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A57_L2_TAG_RAM		     0x10
> +#define A57_L2_DATA_RAM		     0x11
> +#define A57_L2_SNOOP_TAG_RAM	     0x12
> +#define A57_L2_DIRTY_RAM	     0x14
> +#define A57_L2_INCLUSION_PF_RAM      0x18
> +
> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A53_L1_I_TAG_RAM	     0x00
> +#define A53_L1_I_DATA_RAM	     0x01
> +#define A53_L1_D_TAG_RAM	     0x08
> +#define A53_L1_D_DATA_RAM	     0x09
> +#define A53_L1_D_DIRT_RAM	     0x0A
> +#define A53_L1_TLB_RAM		     0x18
> +
> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A53_L2_TAG_RAM		     0x10
> +#define A53_L2_DATA_RAM		     0x11
> +#define A53_L2_SNOOP_RAM	     0x12
> +
> +#define L1_CACHE		     0
> +#define L2_CACHE		     1
> +
> +int poll_msec = 100;
> +
> +struct cortex_arm64_edac {
> +	struct edac_device_ctl_info *edac_ctl;
> +};
> +
> +static inline u64 read_cpumerrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_cpumerrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
> +}
> +
> +static inline u64 read_l2merrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_l2merrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
> +}
> +

If we're willing to compile with COMPILE_TEST, we'll need to provide
some stubs for the above functions that won't use asm.

> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A53_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
> +	case A53_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A53_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A53_L2_SNOOP_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A57_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
> +	case A57_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A57_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A57_L2_SNOOP_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
> +		break;
> +	case A57_L2_DIRTY_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
> +		break;
> +	case A57_L2_INCLUSION_PF_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A57_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
> +	case A57_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A57_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A57_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A57_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A57_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A53_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
> +	case A53_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A53_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A53_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A53_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A53_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);

The above code doesn't look right to me. It should be, instead, calling
one of the functions that output the errors also via trace or to call one
of the trace functions directly (see the trace functions currently defined
at  include/ras/ras_event.h).

Failing to do that would cause RAS tools (like rasdaemon) to not get
the errors.

> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void parse_cpumerrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_cpumerrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_cpumerrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void parse_l2merrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_l2merrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_l2merrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
> +{
> +	int cpu;
> +	struct cpumask cluster_mask, old_mask;
> +
> +	cpumask_clear(&cluster_mask);
> +	cpumask_clear(&old_mask);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
> +		if (cpumask_equal(&cluster_mask, &old_mask))
> +			continue;
> +		cpumask_copy(&old_mask, &cluster_mask);
> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
> +				      edev_ctl, 0);
> +	}
> +	put_online_cpus();
> +}
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	int rc;
> +	struct cortex_arm64_edac *drv;
> +	struct device *dev = &pdev->dev;
> +
> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +						   num_possible_cpus(), "L", 2,
> +						   1, NULL, 0,
> +						   edac_device_alloc_index());
> +	if (IS_ERR(drv->edac_ctl))
> +		return -ENOMEM;
> +
> +	drv->edac_ctl->poll_msec = poll_msec;
> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
> +	drv->edac_ctl->dev = dev;
> +	drv->edac_ctl->mod_name = dev_name(dev);
> +	drv->edac_ctl->dev_name = dev_name(dev);
> +	drv->edac_ctl->ctl_name = "cpu_err";
> +	drv->edac_ctl->panic_on_ue = 1;
> +	platform_set_drvdata(pdev, drv);
> +
> +	rc = edac_device_add_device(drv->edac_ctl);
> +	if (rc)
> +		goto edac_alloc_failed;
> +
> +	return 0;
> +
> +edac_alloc_failed:
> +	edac_device_free_ctl_info(drv->edac_ctl);
> +	return rc;
> +}
> +
> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
> +{
> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
> +
> +	edac_device_del_device(edac_ctl->dev);
> +	edac_device_free_ctl_info(edac_ctl);
> +
> +	return 0;
> +}
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,armv8-edac" },
> +	{},
> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = "arm64-edac",
> +		.owner = THIS_MODULE,
> +		.of_match_table = cortex_arm64_edac_of_match,
> +	},
> +};
> +
> +static int __init cortex_arm64_edac_init(void)
> +{
> +	int rc;
> +
> +	/* Only POLL mode is supported so far */
> +	edac_op_state = EDAC_OPSTATE_POLL;
> +
> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +module_init(cortex_arm64_edac_init);
> +
> +static void __exit cortex_arm64_edac_exit(void)
> +{
> +	platform_driver_unregister(&cortex_arm64_edac_driver);
> +}
> +module_exit(cortex_arm64_edac_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
> +module_param(poll_msec, int, 0444);
> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-21 23:52   ` Andre Przywara
  0 siblings, 0 replies; 34+ messages in thread
From: Andre Przywara @ 2015-10-21 23:52 UTC (permalink / raw)
  To: Brijesh Singh, linux-arm-kernel
  Cc: robh+dt, pawel.moll, mark.rutland, ijc+devicetree, galak,
	dougthompson, bp, mchehab, devicetree, guohanjun, arnd,
	linux-kernel, linux-edac

On 21/10/15 21:41, Brijesh Singh wrote:
> Add support for Cortex A57 and A53 EDAC driver.

Hi Brijesh,

thanks for the quick update! Some comments below.

> 
> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
> CC: robh+dt@kernel.org
> CC: pawel.moll@arm.com
> CC: mark.rutland@arm.com
> CC: ijc+devicetree@hellion.org.uk
> CC: galak@codeaurora.org
> CC: dougthompson@xmission.com
> CC: bp@alien8.de
> CC: mchehab@osg.samsung.com
> CC: devicetree@vger.kernel.org
> CC: guohanjun@huawei.com
> CC: andre.przywara@arm.com
> CC: arnd@arndb.de
> CC: linux-kernel@vger.kernel.org
> CC: linux-edac@vger.kernel.org
> ---
> 
> v2:
> * convert into generic arm64 edac driver
> * remove AMD specific references from dt binding
> * remove poll_msec property from dt binding
> * add poll_msec as a module param, default is 100ms
> * update copyright text
> * define macro mnemonics for L1 and L2 RAMID
> * check L2 error per-cluster instead of per core
> * update function names
> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>   read hotplug-safe
> * add error check in probe routine
> 
>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>  drivers/edac/Kconfig                               |   6 +
>  drivers/edac/Makefile                              |   1 +
>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>  4 files changed, 479 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>  create mode 100644 drivers/edac/cortex_arm64_edac.c
> 
> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> new file mode 100644
> index 0000000..dfd128f
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> @@ -0,0 +1,15 @@
> +* ARMv8 L1/L2 cache error reporting
> +
> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
> +Register can be used for checking L1 and L2 memory errors.
> +
> +The following section describes the ARMv8 EDAC DT node binding.
> +
> +Required properties:
> +- compatible: Should be "arm,armv8-edac"
> +
> +Example:
> +	edac {
> +		compatible = "arm,armv8-edac";
> +	};
> +

So if there is nothing in here, why do we need the DT binding at all (I
think Mark hinted at that already)?
Can't we just use the MIDR as already suggested by others?
Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
specific and not architectural.

> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index ef25000..dd7c195 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -390,4 +390,10 @@ config EDAC_XGENE
>  	  Support for error detection and correction on the
>  	  APM X-Gene family of SOCs.
>  
> +config EDAC_CORTEX_ARM64
> +	tristate "ARM Cortex A57/A53"
> +	depends on EDAC_MM_EDAC && ARM64
> +	help
> +	  Support for error detection and correction on the
> +	  ARM Cortex A57 and A53.
>  endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index ae3c5f3..ac01660 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
> new file mode 100644
> index 0000000..c37bb94
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_edac.c
> @@ -0,0 +1,457 @@
> +/*
> + * Cortex ARM64 EDAC
> + *
> + * Copyright (c) 2015, Advanced Micro Devices
> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/platform_device.h>
> +
> +#include "edac_core.h"
> +
> +#define EDAC_MOD_STR             "cortex_arm64_edac"
> +
> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A57_L1_I_TAG_RAM	     0x00
> +#define A57_L1_I_DATA_RAM	     0x01
> +#define A57_L1_D_TAG_RAM	     0x08
> +#define A57_L1_D_DATA_RAM	     0x09
> +#define A57_L1_TLB_RAM		     0x18
> +
> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A57_L2_TAG_RAM		     0x10
> +#define A57_L2_DATA_RAM		     0x11
> +#define A57_L2_SNOOP_TAG_RAM	     0x12
> +#define A57_L2_DIRTY_RAM	     0x14
> +#define A57_L2_INCLUSION_PF_RAM      0x18
> +
> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A53_L1_I_TAG_RAM	     0x00
> +#define A53_L1_I_DATA_RAM	     0x01
> +#define A53_L1_D_TAG_RAM	     0x08
> +#define A53_L1_D_DATA_RAM	     0x09
> +#define A53_L1_D_DIRT_RAM	     0x0A
> +#define A53_L1_TLB_RAM		     0x18
> +
> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A53_L2_TAG_RAM		     0x10
> +#define A53_L2_DATA_RAM		     0x11
> +#define A53_L2_SNOOP_RAM	     0x12

I guess you can get rid of the A53/A57 prefix for most of the
definitions - given they are identical.
Just keep it around for the differing bits (but not as a prefix).

> +
> +#define L1_CACHE		     0
> +#define L2_CACHE		     1
> +
> +int poll_msec = 100;
> +
> +struct cortex_arm64_edac {
> +	struct edac_device_ctl_info *edac_ctl;
> +};
> +
> +static inline u64 read_cpumerrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_cpumerrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
> +}
> +
> +static inline u64 read_l2merrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_l2merrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
> +}
> +
> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A53_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
> +	case A53_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A53_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A53_L2_SNOOP_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{

That looks like copied for most of the code, so it cries for unification.
The semantics is almost identical, though the TRM describes it in a
slighly different way sometimes (for CPUID/way for instance).

> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A57_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
> +	case A57_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A57_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A57_L2_SNOOP_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
> +		break;
> +	case A57_L2_DIRTY_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
> +		break;
> +	case A57_L2_INCLUSION_PF_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A57_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
> +	case A57_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A57_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A57_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A57_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A57_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{

Same as above. Please unify to have one function.

> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A53_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
> +	case A53_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A53_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A53_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A53_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A53_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void parse_cpumerrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:

You should have that distinction in the above (unified) function just
for the differing bits. If that looks too ugly, hide it in a macro.

> +		a57_parse_cpumerrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_cpumerrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void parse_l2merrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_l2merrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_l2merrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
> +{
> +	int cpu;
> +	struct cpumask cluster_mask, old_mask;
> +
> +	cpumask_clear(&cluster_mask);
> +	cpumask_clear(&old_mask);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
> +		if (cpumask_equal(&cluster_mask, &old_mask))
> +			continue;
> +		cpumask_copy(&old_mask, &cluster_mask);
> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
> +				      edev_ctl, 0);
> +	}
> +	put_online_cpus();
> +}
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	int rc;
> +	struct cortex_arm64_edac *drv;
> +	struct device *dev = &pdev->dev;
> +
> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +						   num_possible_cpus(), "L", 2,
> +						   1, NULL, 0,
> +						   edac_device_alloc_index());
> +	if (IS_ERR(drv->edac_ctl))
> +		return -ENOMEM;
> +
> +	drv->edac_ctl->poll_msec = poll_msec;
> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
> +	drv->edac_ctl->dev = dev;
> +	drv->edac_ctl->mod_name = dev_name(dev);
> +	drv->edac_ctl->dev_name = dev_name(dev);
> +	drv->edac_ctl->ctl_name = "cpu_err";
> +	drv->edac_ctl->panic_on_ue = 1;
> +	platform_set_drvdata(pdev, drv);
> +
> +	rc = edac_device_add_device(drv->edac_ctl);
> +	if (rc)
> +		goto edac_alloc_failed;
> +
> +	return 0;
> +
> +edac_alloc_failed:
> +	edac_device_free_ctl_info(drv->edac_ctl);
> +	return rc;
> +}
> +
> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
> +{
> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
> +
> +	edac_device_del_device(edac_ctl->dev);
> +	edac_device_free_ctl_info(edac_ctl);
> +
> +	return 0;
> +}
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,armv8-edac" },
> +	{},
> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = "arm64-edac",
> +		.owner = THIS_MODULE,
> +		.of_match_table = cortex_arm64_edac_of_match,

So you would need to get rid of that without a DT binding. Not sure if
there is a precedence for a driver loaded by an MIDR match already (PMU
comes to mind)?

Please note that would mean that the driver always loads on any Cortex
CPU, I guess this is not always desirable (thinking about phones here).
I am not sure if module blacklisting is a solution or we would need to
come up with some clever way of potentially disabling some platforms -
which admittedly cries for a DT binding again ;-)

Cheers,
Andre.

> +	},
> +};
> +
> +static int __init cortex_arm64_edac_init(void)
> +{
> +	int rc;
> +
> +	/* Only POLL mode is supported so far */
> +	edac_op_state = EDAC_OPSTATE_POLL;
> +
> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +module_init(cortex_arm64_edac_init);
> +
> +static void __exit cortex_arm64_edac_exit(void)
> +{
> +	platform_driver_unregister(&cortex_arm64_edac_driver);
> +}
> +module_exit(cortex_arm64_edac_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
> +module_param(poll_msec, int, 0444);
> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-21 23:52   ` Andre Przywara
  0 siblings, 0 replies; 34+ messages in thread
From: Andre Przywara @ 2015-10-21 23:52 UTC (permalink / raw)
  To: Brijesh Singh, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: robh+dt-DgEjT+Ai2ygdnm+yROfE0A, pawel.moll-5wv7dgnIgG8,
	mark.rutland-5wv7dgnIgG8, ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg,
	galak-sgV2jX0FEOL9JmXXK+q4OQ,
	dougthompson-aS9lmoZGLiVWk0Htik3J/w, bp-Gina5bIWoIWzQB+pC5nmwQ,
	mchehab-JPH+aEBZ4P+UEJcrhfAQsw,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	guohanjun-hv44wF8Li93QT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-edac-u79uwXL29TY76Z2rM5mHXA

On 21/10/15 21:41, Brijesh Singh wrote:
> Add support for Cortex A57 and A53 EDAC driver.

Hi Brijesh,

thanks for the quick update! Some comments below.

> 
> Signed-off-by: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
> CC: robh+dt-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> CC: pawel.moll-5wv7dgnIgG8@public.gmane.org
> CC: mark.rutland-5wv7dgnIgG8@public.gmane.org
> CC: ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org
> CC: galak-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org
> CC: dougthompson-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org
> CC: bp-Gina5bIWoIWzQB+pC5nmwQ@public.gmane.org
> CC: mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org
> CC: devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> CC: guohanjun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org
> CC: andre.przywara-5wv7dgnIgG8@public.gmane.org
> CC: arnd-r2nGTMty4D4@public.gmane.org
> CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> CC: linux-edac-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
> 
> v2:
> * convert into generic arm64 edac driver
> * remove AMD specific references from dt binding
> * remove poll_msec property from dt binding
> * add poll_msec as a module param, default is 100ms
> * update copyright text
> * define macro mnemonics for L1 and L2 RAMID
> * check L2 error per-cluster instead of per core
> * update function names
> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>   read hotplug-safe
> * add error check in probe routine
> 
>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>  drivers/edac/Kconfig                               |   6 +
>  drivers/edac/Makefile                              |   1 +
>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>  4 files changed, 479 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>  create mode 100644 drivers/edac/cortex_arm64_edac.c
> 
> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> new file mode 100644
> index 0000000..dfd128f
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> @@ -0,0 +1,15 @@
> +* ARMv8 L1/L2 cache error reporting
> +
> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
> +Register can be used for checking L1 and L2 memory errors.
> +
> +The following section describes the ARMv8 EDAC DT node binding.
> +
> +Required properties:
> +- compatible: Should be "arm,armv8-edac"
> +
> +Example:
> +	edac {
> +		compatible = "arm,armv8-edac";
> +	};
> +

So if there is nothing in here, why do we need the DT binding at all (I
think Mark hinted at that already)?
Can't we just use the MIDR as already suggested by others?
Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
specific and not architectural.

> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index ef25000..dd7c195 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -390,4 +390,10 @@ config EDAC_XGENE
>  	  Support for error detection and correction on the
>  	  APM X-Gene family of SOCs.
>  
> +config EDAC_CORTEX_ARM64
> +	tristate "ARM Cortex A57/A53"
> +	depends on EDAC_MM_EDAC && ARM64
> +	help
> +	  Support for error detection and correction on the
> +	  ARM Cortex A57 and A53.
>  endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index ae3c5f3..ac01660 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
> new file mode 100644
> index 0000000..c37bb94
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_edac.c
> @@ -0,0 +1,457 @@
> +/*
> + * Cortex ARM64 EDAC
> + *
> + * Copyright (c) 2015, Advanced Micro Devices
> + * Author: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/platform_device.h>
> +
> +#include "edac_core.h"
> +
> +#define EDAC_MOD_STR             "cortex_arm64_edac"
> +
> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A57_L1_I_TAG_RAM	     0x00
> +#define A57_L1_I_DATA_RAM	     0x01
> +#define A57_L1_D_TAG_RAM	     0x08
> +#define A57_L1_D_DATA_RAM	     0x09
> +#define A57_L1_TLB_RAM		     0x18
> +
> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A57_L2_TAG_RAM		     0x10
> +#define A57_L2_DATA_RAM		     0x11
> +#define A57_L2_SNOOP_TAG_RAM	     0x12
> +#define A57_L2_DIRTY_RAM	     0x14
> +#define A57_L2_INCLUSION_PF_RAM      0x18
> +
> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A53_L1_I_TAG_RAM	     0x00
> +#define A53_L1_I_DATA_RAM	     0x01
> +#define A53_L1_D_TAG_RAM	     0x08
> +#define A53_L1_D_DATA_RAM	     0x09
> +#define A53_L1_D_DIRT_RAM	     0x0A
> +#define A53_L1_TLB_RAM		     0x18
> +
> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A53_L2_TAG_RAM		     0x10
> +#define A53_L2_DATA_RAM		     0x11
> +#define A53_L2_SNOOP_RAM	     0x12

I guess you can get rid of the A53/A57 prefix for most of the
definitions - given they are identical.
Just keep it around for the differing bits (but not as a prefix).

> +
> +#define L1_CACHE		     0
> +#define L2_CACHE		     1
> +
> +int poll_msec = 100;
> +
> +struct cortex_arm64_edac {
> +	struct edac_device_ctl_info *edac_ctl;
> +};
> +
> +static inline u64 read_cpumerrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_cpumerrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
> +}
> +
> +static inline u64 read_l2merrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_l2merrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
> +}
> +
> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A53_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
> +	case A53_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A53_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A53_L2_SNOOP_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{

That looks like copied for most of the code, so it cries for unification.
The semantics is almost identical, though the TRM describes it in a
slighly different way sometimes (for CPUID/way for instance).

> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A57_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
> +	case A57_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A57_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A57_L2_SNOOP_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
> +		break;
> +	case A57_L2_DIRTY_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
> +		break;
> +	case A57_L2_INCLUSION_PF_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A57_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
> +	case A57_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A57_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A57_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A57_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A57_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{

Same as above. Please unify to have one function.

> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A53_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
> +	case A53_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A53_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A53_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A53_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A53_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void parse_cpumerrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:

You should have that distinction in the above (unified) function just
for the differing bits. If that looks too ugly, hide it in a macro.

> +		a57_parse_cpumerrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_cpumerrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void parse_l2merrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_l2merrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_l2merrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
> +{
> +	int cpu;
> +	struct cpumask cluster_mask, old_mask;
> +
> +	cpumask_clear(&cluster_mask);
> +	cpumask_clear(&old_mask);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
> +		if (cpumask_equal(&cluster_mask, &old_mask))
> +			continue;
> +		cpumask_copy(&old_mask, &cluster_mask);
> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
> +				      edev_ctl, 0);
> +	}
> +	put_online_cpus();
> +}
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	int rc;
> +	struct cortex_arm64_edac *drv;
> +	struct device *dev = &pdev->dev;
> +
> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +						   num_possible_cpus(), "L", 2,
> +						   1, NULL, 0,
> +						   edac_device_alloc_index());
> +	if (IS_ERR(drv->edac_ctl))
> +		return -ENOMEM;
> +
> +	drv->edac_ctl->poll_msec = poll_msec;
> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
> +	drv->edac_ctl->dev = dev;
> +	drv->edac_ctl->mod_name = dev_name(dev);
> +	drv->edac_ctl->dev_name = dev_name(dev);
> +	drv->edac_ctl->ctl_name = "cpu_err";
> +	drv->edac_ctl->panic_on_ue = 1;
> +	platform_set_drvdata(pdev, drv);
> +
> +	rc = edac_device_add_device(drv->edac_ctl);
> +	if (rc)
> +		goto edac_alloc_failed;
> +
> +	return 0;
> +
> +edac_alloc_failed:
> +	edac_device_free_ctl_info(drv->edac_ctl);
> +	return rc;
> +}
> +
> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
> +{
> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
> +
> +	edac_device_del_device(edac_ctl->dev);
> +	edac_device_free_ctl_info(edac_ctl);
> +
> +	return 0;
> +}
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,armv8-edac" },
> +	{},
> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = "arm64-edac",
> +		.owner = THIS_MODULE,
> +		.of_match_table = cortex_arm64_edac_of_match,

So you would need to get rid of that without a DT binding. Not sure if
there is a precedence for a driver loaded by an MIDR match already (PMU
comes to mind)?

Please note that would mean that the driver always loads on any Cortex
CPU, I guess this is not always desirable (thinking about phones here).
I am not sure if module blacklisting is a solution or we would need to
come up with some clever way of potentially disabling some platforms -
which admittedly cries for a DT binding again ;-)

Cheers,
Andre.

> +	},
> +};
> +
> +static int __init cortex_arm64_edac_init(void)
> +{
> +	int rc;
> +
> +	/* Only POLL mode is supported so far */
> +	edac_op_state = EDAC_OPSTATE_POLL;
> +
> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +module_init(cortex_arm64_edac_init);
> +
> +static void __exit cortex_arm64_edac_exit(void)
> +{
> +	platform_driver_unregister(&cortex_arm64_edac_driver);
> +}
> +module_exit(cortex_arm64_edac_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>");
> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
> +module_param(poll_msec, int, 0444);
> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
> 

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-21 23:52   ` Andre Przywara
  0 siblings, 0 replies; 34+ messages in thread
From: Andre Przywara @ 2015-10-21 23:52 UTC (permalink / raw)
  To: linux-arm-kernel

On 21/10/15 21:41, Brijesh Singh wrote:
> Add support for Cortex A57 and A53 EDAC driver.

Hi Brijesh,

thanks for the quick update! Some comments below.

> 
> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
> CC: robh+dt at kernel.org
> CC: pawel.moll at arm.com
> CC: mark.rutland at arm.com
> CC: ijc+devicetree at hellion.org.uk
> CC: galak at codeaurora.org
> CC: dougthompson at xmission.com
> CC: bp at alien8.de
> CC: mchehab at osg.samsung.com
> CC: devicetree at vger.kernel.org
> CC: guohanjun at huawei.com
> CC: andre.przywara at arm.com
> CC: arnd at arndb.de
> CC: linux-kernel at vger.kernel.org
> CC: linux-edac at vger.kernel.org
> ---
> 
> v2:
> * convert into generic arm64 edac driver
> * remove AMD specific references from dt binding
> * remove poll_msec property from dt binding
> * add poll_msec as a module param, default is 100ms
> * update copyright text
> * define macro mnemonics for L1 and L2 RAMID
> * check L2 error per-cluster instead of per core
> * update function names
> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>   read hotplug-safe
> * add error check in probe routine
> 
>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>  drivers/edac/Kconfig                               |   6 +
>  drivers/edac/Makefile                              |   1 +
>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>  4 files changed, 479 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>  create mode 100644 drivers/edac/cortex_arm64_edac.c
> 
> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> new file mode 100644
> index 0000000..dfd128f
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> @@ -0,0 +1,15 @@
> +* ARMv8 L1/L2 cache error reporting
> +
> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
> +Register can be used for checking L1 and L2 memory errors.
> +
> +The following section describes the ARMv8 EDAC DT node binding.
> +
> +Required properties:
> +- compatible: Should be "arm,armv8-edac"
> +
> +Example:
> +	edac {
> +		compatible = "arm,armv8-edac";
> +	};
> +

So if there is nothing in here, why do we need the DT binding at all (I
think Mark hinted at that already)?
Can't we just use the MIDR as already suggested by others?
Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
specific and not architectural.

> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index ef25000..dd7c195 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -390,4 +390,10 @@ config EDAC_XGENE
>  	  Support for error detection and correction on the
>  	  APM X-Gene family of SOCs.
>  
> +config EDAC_CORTEX_ARM64
> +	tristate "ARM Cortex A57/A53"
> +	depends on EDAC_MM_EDAC && ARM64
> +	help
> +	  Support for error detection and correction on the
> +	  ARM Cortex A57 and A53.
>  endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index ae3c5f3..ac01660 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
> new file mode 100644
> index 0000000..c37bb94
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_edac.c
> @@ -0,0 +1,457 @@
> +/*
> + * Cortex ARM64 EDAC
> + *
> + * Copyright (c) 2015, Advanced Micro Devices
> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/platform_device.h>
> +
> +#include "edac_core.h"
> +
> +#define EDAC_MOD_STR             "cortex_arm64_edac"
> +
> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A57_L1_I_TAG_RAM	     0x00
> +#define A57_L1_I_DATA_RAM	     0x01
> +#define A57_L1_D_TAG_RAM	     0x08
> +#define A57_L1_D_DATA_RAM	     0x09
> +#define A57_L1_TLB_RAM		     0x18
> +
> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A57_L2_TAG_RAM		     0x10
> +#define A57_L2_DATA_RAM		     0x11
> +#define A57_L2_SNOOP_TAG_RAM	     0x12
> +#define A57_L2_DIRTY_RAM	     0x14
> +#define A57_L2_INCLUSION_PF_RAM      0x18
> +
> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A53_L1_I_TAG_RAM	     0x00
> +#define A53_L1_I_DATA_RAM	     0x01
> +#define A53_L1_D_TAG_RAM	     0x08
> +#define A53_L1_D_DATA_RAM	     0x09
> +#define A53_L1_D_DIRT_RAM	     0x0A
> +#define A53_L1_TLB_RAM		     0x18
> +
> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A53_L2_TAG_RAM		     0x10
> +#define A53_L2_DATA_RAM		     0x11
> +#define A53_L2_SNOOP_RAM	     0x12

I guess you can get rid of the A53/A57 prefix for most of the
definitions - given they are identical.
Just keep it around for the differing bits (but not as a prefix).

> +
> +#define L1_CACHE		     0
> +#define L2_CACHE		     1
> +
> +int poll_msec = 100;
> +
> +struct cortex_arm64_edac {
> +	struct edac_device_ctl_info *edac_ctl;
> +};
> +
> +static inline u64 read_cpumerrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_cpumerrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
> +}
> +
> +static inline u64 read_l2merrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_l2merrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
> +}
> +
> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A53_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
> +	case A53_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A53_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A53_L2_SNOOP_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{

That looks like copied for most of the code, so it cries for unification.
The semantics is almost identical, though the TRM describes it in a
slighly different way sometimes (for CPUID/way for instance).

> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A57_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
> +	case A57_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A57_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A57_L2_SNOOP_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
> +		break;
> +	case A57_L2_DIRTY_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
> +		break;
> +	case A57_L2_INCLUSION_PF_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A57_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
> +	case A57_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A57_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A57_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A57_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A57_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{

Same as above. Please unify to have one function.

> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A53_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
> +	case A53_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A53_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A53_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A53_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A53_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void parse_cpumerrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:

You should have that distinction in the above (unified) function just
for the differing bits. If that looks too ugly, hide it in a macro.

> +		a57_parse_cpumerrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_cpumerrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void parse_l2merrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_l2merrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_l2merrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
> +{
> +	int cpu;
> +	struct cpumask cluster_mask, old_mask;
> +
> +	cpumask_clear(&cluster_mask);
> +	cpumask_clear(&old_mask);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
> +		if (cpumask_equal(&cluster_mask, &old_mask))
> +			continue;
> +		cpumask_copy(&old_mask, &cluster_mask);
> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
> +				      edev_ctl, 0);
> +	}
> +	put_online_cpus();
> +}
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	int rc;
> +	struct cortex_arm64_edac *drv;
> +	struct device *dev = &pdev->dev;
> +
> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +						   num_possible_cpus(), "L", 2,
> +						   1, NULL, 0,
> +						   edac_device_alloc_index());
> +	if (IS_ERR(drv->edac_ctl))
> +		return -ENOMEM;
> +
> +	drv->edac_ctl->poll_msec = poll_msec;
> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
> +	drv->edac_ctl->dev = dev;
> +	drv->edac_ctl->mod_name = dev_name(dev);
> +	drv->edac_ctl->dev_name = dev_name(dev);
> +	drv->edac_ctl->ctl_name = "cpu_err";
> +	drv->edac_ctl->panic_on_ue = 1;
> +	platform_set_drvdata(pdev, drv);
> +
> +	rc = edac_device_add_device(drv->edac_ctl);
> +	if (rc)
> +		goto edac_alloc_failed;
> +
> +	return 0;
> +
> +edac_alloc_failed:
> +	edac_device_free_ctl_info(drv->edac_ctl);
> +	return rc;
> +}
> +
> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
> +{
> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
> +
> +	edac_device_del_device(edac_ctl->dev);
> +	edac_device_free_ctl_info(edac_ctl);
> +
> +	return 0;
> +}
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,armv8-edac" },
> +	{},
> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = "arm64-edac",
> +		.owner = THIS_MODULE,
> +		.of_match_table = cortex_arm64_edac_of_match,

So you would need to get rid of that without a DT binding. Not sure if
there is a precedence for a driver loaded by an MIDR match already (PMU
comes to mind)?

Please note that would mean that the driver always loads on any Cortex
CPU, I guess this is not always desirable (thinking about phones here).
I am not sure if module blacklisting is a solution or we would need to
come up with some clever way of potentially disabling some platforms -
which admittedly cries for a DT binding again ;-)

Cheers,
Andre.

> +	},
> +};
> +
> +static int __init cortex_arm64_edac_init(void)
> +{
> +	int rc;
> +
> +	/* Only POLL mode is supported so far */
> +	edac_op_state = EDAC_OPSTATE_POLL;
> +
> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +module_init(cortex_arm64_edac_init);
> +
> +static void __exit cortex_arm64_edac_exit(void)
> +{
> +	platform_driver_unregister(&cortex_arm64_edac_driver);
> +}
> +module_exit(cortex_arm64_edac_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
> +module_param(poll_msec, int, 0444);
> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-22 14:46     ` Brijesh Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-22 14:46 UTC (permalink / raw)
  To: Andre Przywara, linux-arm-kernel
  Cc: brijeshkumar.singh, robh+dt, pawel.moll, mark.rutland,
	ijc+devicetree, galak, dougthompson, bp, mchehab, devicetree,
	guohanjun, arnd, linux-kernel, linux-edac

Hi Andre,

On 10/21/2015 06:52 PM, Andre Przywara wrote:
> On 21/10/15 21:41, Brijesh Singh wrote:
>> Add support for Cortex A57 and A53 EDAC driver.
> 
> Hi Brijesh,
> 
> thanks for the quick update! Some comments below.
> 
>>
>> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
>> CC: robh+dt@kernel.org
>> CC: pawel.moll@arm.com
>> CC: mark.rutland@arm.com
>> CC: ijc+devicetree@hellion.org.uk
>> CC: galak@codeaurora.org
>> CC: dougthompson@xmission.com
>> CC: bp@alien8.de
>> CC: mchehab@osg.samsung.com
>> CC: devicetree@vger.kernel.org
>> CC: guohanjun@huawei.com
>> CC: andre.przywara@arm.com
>> CC: arnd@arndb.de
>> CC: linux-kernel@vger.kernel.org
>> CC: linux-edac@vger.kernel.org
>> ---
>>
>> v2:
>> * convert into generic arm64 edac driver
>> * remove AMD specific references from dt binding
>> * remove poll_msec property from dt binding
>> * add poll_msec as a module param, default is 100ms
>> * update copyright text
>> * define macro mnemonics for L1 and L2 RAMID
>> * check L2 error per-cluster instead of per core
>> * update function names
>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>   read hotplug-safe
>> * add error check in probe routine
>>
>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>  drivers/edac/Kconfig                               |   6 +
>>  drivers/edac/Makefile                              |   1 +
>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>  4 files changed, 479 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>
>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> new file mode 100644
>> index 0000000..dfd128f
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> @@ -0,0 +1,15 @@
>> +* ARMv8 L1/L2 cache error reporting
>> +
>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>> +Register can be used for checking L1 and L2 memory errors.
>> +
>> +The following section describes the ARMv8 EDAC DT node binding.
>> +
>> +Required properties:
>> +- compatible: Should be "arm,armv8-edac"
>> +
>> +Example:
>> +	edac {
>> +		compatible = "arm,armv8-edac";
>> +	};
>> +
> 
> So if there is nothing in here, why do we need the DT binding at all (I
> think Mark hinted at that already)?
> Can't we just use the MIDR as already suggested by others?
> Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
> specific and not architectural.
> 
Yes, I was going with Mark suggestion to remove DT binding but then came across these cases which kind of hinted to keep DT binding:

* Without DT binding, the driver will always be loaded on arm64 unless its blacklisted.
* Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
  Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
  wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.

But if recommendation is to remove DT binding then I can remove it in next version. 

>> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
>> index ef25000..dd7c195 100644
>> --- a/drivers/edac/Kconfig
>> +++ b/drivers/edac/Kconfig
>> @@ -390,4 +390,10 @@ config EDAC_XGENE
>>  	  Support for error detection and correction on the
>>  	  APM X-Gene family of SOCs.
>>  
>> +config EDAC_CORTEX_ARM64
>> +	tristate "ARM Cortex A57/A53"
>> +	depends on EDAC_MM_EDAC && ARM64
>> +	help
>> +	  Support for error detection and correction on the
>> +	  ARM Cortex A57 and A53.
>>  endif # EDAC
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>> index ae3c5f3..ac01660 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
>> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
>> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
>> new file mode 100644
>> index 0000000..c37bb94
>> --- /dev/null
>> +++ b/drivers/edac/cortex_arm64_edac.c
>> @@ -0,0 +1,457 @@
>> +/*
>> + * Cortex ARM64 EDAC
>> + *
>> + * Copyright (c) 2015, Advanced Micro Devices
>> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/of_device.h>
>> +#include <linux/platform_device.h>
>> +
>> +#include "edac_core.h"
>> +
>> +#define EDAC_MOD_STR             "cortex_arm64_edac"
>> +
>> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
>> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
>> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A57_L1_I_TAG_RAM	     0x00
>> +#define A57_L1_I_DATA_RAM	     0x01
>> +#define A57_L1_D_TAG_RAM	     0x08
>> +#define A57_L1_D_DATA_RAM	     0x09
>> +#define A57_L1_TLB_RAM		     0x18
>> +
>> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
>> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
>> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A57_L2_TAG_RAM		     0x10
>> +#define A57_L2_DATA_RAM		     0x11
>> +#define A57_L2_SNOOP_TAG_RAM	     0x12
>> +#define A57_L2_DIRTY_RAM	     0x14
>> +#define A57_L2_INCLUSION_PF_RAM      0x18
>> +
>> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
>> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
>> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
>> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A53_L1_I_TAG_RAM	     0x00
>> +#define A53_L1_I_DATA_RAM	     0x01
>> +#define A53_L1_D_TAG_RAM	     0x08
>> +#define A53_L1_D_DATA_RAM	     0x09
>> +#define A53_L1_D_DIRT_RAM	     0x0A
>> +#define A53_L1_TLB_RAM		     0x18
>> +
>> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
>> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
>> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A53_L2_TAG_RAM		     0x10
>> +#define A53_L2_DATA_RAM		     0x11
>> +#define A53_L2_SNOOP_RAM	     0x12
> 
> I guess you can get rid of the A53/A57 prefix for most of the
> definitions - given they are identical.
> Just keep it around for the differing bits (but not as a prefix).
> 
Ok
>> +
>> +#define L1_CACHE		     0
>> +#define L2_CACHE		     1
>> +
>> +int poll_msec = 100;
>> +
>> +struct cortex_arm64_edac {
>> +	struct edac_device_ctl_info *edac_ctl;
>> +};
>> +
>> +static inline u64 read_cpumerrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_cpumerrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
>> +}
>> +
>> +static inline u64 read_l2merrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_l2merrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
>> +}
>> +
>> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A53_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
>> +	case A53_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A53_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A53_L2_SNOOP_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
> 
> That looks like copied for most of the code, so it cries for unification.
> The semantics is almost identical, though the TRM describes it in a
> slighly different way sometimes (for CPUID/way for instance).

Yes RAMID description was different between A57 and A53. Will try to unify

>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A57_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
>> +	case A57_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A57_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A57_L2_SNOOP_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
>> +		break;
>> +	case A57_L2_DIRTY_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
>> +		break;
>> +	case A57_L2_INCLUSION_PF_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A57_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A57_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A57_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A57_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A57_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A57_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
> 
> Same as above. Please unify to have one function.
> 
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A53_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A53_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A53_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A53_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A53_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A53_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void parse_cpumerrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
> 
> You should have that distinction in the above (unified) function just
> for the differing bits. If that looks too ugly, hide it in a macro.
> 
>> +		a57_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void parse_l2merrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
>> +		a57_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
>> +{
>> +	int cpu;
>> +	struct cpumask cluster_mask, old_mask;
>> +
>> +	cpumask_clear(&cluster_mask);
>> +	cpumask_clear(&old_mask);
>> +
>> +	get_online_cpus();
>> +	for_each_online_cpu(cpu) {
>> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
>> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
>> +		if (cpumask_equal(&cluster_mask, &old_mask))
>> +			continue;
>> +		cpumask_copy(&old_mask, &cluster_mask);
>> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
>> +				      edev_ctl, 0);
>> +	}
>> +	put_online_cpus();
>> +}
>> +
>> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
>> +{
>> +	int rc;
>> +	struct cortex_arm64_edac *drv;
>> +	struct device *dev = &pdev->dev;
>> +
>> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
>> +	if (!drv)
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
>> +						   num_possible_cpus(), "L", 2,
>> +						   1, NULL, 0,
>> +						   edac_device_alloc_index());
>> +	if (IS_ERR(drv->edac_ctl))
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl->poll_msec = poll_msec;
>> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
>> +	drv->edac_ctl->dev = dev;
>> +	drv->edac_ctl->mod_name = dev_name(dev);
>> +	drv->edac_ctl->dev_name = dev_name(dev);
>> +	drv->edac_ctl->ctl_name = "cpu_err";
>> +	drv->edac_ctl->panic_on_ue = 1;
>> +	platform_set_drvdata(pdev, drv);
>> +
>> +	rc = edac_device_add_device(drv->edac_ctl);
>> +	if (rc)
>> +		goto edac_alloc_failed;
>> +
>> +	return 0;
>> +
>> +edac_alloc_failed:
>> +	edac_device_free_ctl_info(drv->edac_ctl);
>> +	return rc;
>> +}
>> +
>> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
>> +{
>> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
>> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
>> +
>> +	edac_device_del_device(edac_ctl->dev);
>> +	edac_device_free_ctl_info(edac_ctl);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
>> +	{ .compatible = "arm,armv8-edac" },
>> +	{},
>> +};
>> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
>> +
>> +static struct platform_driver cortex_arm64_edac_driver = {
>> +	.probe = cortex_arm64_edac_probe,
>> +	.remove = cortex_arm64_edac_remove,
>> +	.driver = {
>> +		.name = "arm64-edac",
>> +		.owner = THIS_MODULE,
>> +		.of_match_table = cortex_arm64_edac_of_match,
> 
> So you would need to get rid of that without a DT binding. Not sure if
> there is a precedence for a driver loaded by an MIDR match already (PMU
> comes to mind)?
> 
> Please note that would mean that the driver always loads on any Cortex
> CPU, I guess this is not always desirable (thinking about phones here).
> I am not sure if module blacklisting is a solution or we would need to
> come up with some clever way of potentially disabling some platforms -
> which admittedly cries for a DT binding again ;-)
> 
Yes, you read my mind :) See my comment above on DT binding.

> Cheers,
> Andre.
> 
>> +	},
>> +};
>> +
>> +static int __init cortex_arm64_edac_init(void)
>> +{
>> +	int rc;
>> +
>> +	/* Only POLL mode is supported so far */
>> +	edac_op_state = EDAC_OPSTATE_POLL;
>> +
>> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
>> +	if (rc) {
>> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
>> +		return rc;
>> +	}
>> +
>> +	return 0;
>> +}
>> +module_init(cortex_arm64_edac_init);
>> +
>> +static void __exit cortex_arm64_edac_exit(void)
>> +{
>> +	platform_driver_unregister(&cortex_arm64_edac_driver);
>> +}
>> +module_exit(cortex_arm64_edac_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
>> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
>> +module_param(poll_msec, int, 0444);
>> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-22 14:46     ` Brijesh Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-22 14:46 UTC (permalink / raw)
  To: Andre Przywara, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: brijeshkumar.singh-5C7GfCeVMHo, robh+dt-DgEjT+Ai2ygdnm+yROfE0A,
	pawel.moll-5wv7dgnIgG8, mark.rutland-5wv7dgnIgG8,
	ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg,
	galak-sgV2jX0FEOL9JmXXK+q4OQ,
	dougthompson-aS9lmoZGLiVWk0Htik3J/w, bp-Gina5bIWoIWzQB+pC5nmwQ,
	mchehab-JPH+aEBZ4P+UEJcrhfAQsw,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	guohanjun-hv44wF8Li93QT0dZR+AlfA, arnd-r2nGTMty4D4,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-edac-u79uwXL29TY76Z2rM5mHXA

Hi Andre,

On 10/21/2015 06:52 PM, Andre Przywara wrote:
> On 21/10/15 21:41, Brijesh Singh wrote:
>> Add support for Cortex A57 and A53 EDAC driver.
> 
> Hi Brijesh,
> 
> thanks for the quick update! Some comments below.
> 
>>
>> Signed-off-by: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
>> CC: robh+dt-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
>> CC: pawel.moll-5wv7dgnIgG8@public.gmane.org
>> CC: mark.rutland-5wv7dgnIgG8@public.gmane.org
>> CC: ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org
>> CC: galak-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org
>> CC: dougthompson-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org
>> CC: bp-Gina5bIWoIWzQB+pC5nmwQ@public.gmane.org
>> CC: mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org
>> CC: devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> CC: guohanjun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org
>> CC: andre.przywara-5wv7dgnIgG8@public.gmane.org
>> CC: arnd-r2nGTMty4D4@public.gmane.org
>> CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> CC: linux-edac-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> ---
>>
>> v2:
>> * convert into generic arm64 edac driver
>> * remove AMD specific references from dt binding
>> * remove poll_msec property from dt binding
>> * add poll_msec as a module param, default is 100ms
>> * update copyright text
>> * define macro mnemonics for L1 and L2 RAMID
>> * check L2 error per-cluster instead of per core
>> * update function names
>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>   read hotplug-safe
>> * add error check in probe routine
>>
>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>  drivers/edac/Kconfig                               |   6 +
>>  drivers/edac/Makefile                              |   1 +
>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>  4 files changed, 479 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>
>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> new file mode 100644
>> index 0000000..dfd128f
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> @@ -0,0 +1,15 @@
>> +* ARMv8 L1/L2 cache error reporting
>> +
>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>> +Register can be used for checking L1 and L2 memory errors.
>> +
>> +The following section describes the ARMv8 EDAC DT node binding.
>> +
>> +Required properties:
>> +- compatible: Should be "arm,armv8-edac"
>> +
>> +Example:
>> +	edac {
>> +		compatible = "arm,armv8-edac";
>> +	};
>> +
> 
> So if there is nothing in here, why do we need the DT binding at all (I
> think Mark hinted at that already)?
> Can't we just use the MIDR as already suggested by others?
> Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
> specific and not architectural.
> 
Yes, I was going with Mark suggestion to remove DT binding but then came across these cases which kind of hinted to keep DT binding:

* Without DT binding, the driver will always be loaded on arm64 unless its blacklisted.
* Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
  Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
  wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.

But if recommendation is to remove DT binding then I can remove it in next version. 

>> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
>> index ef25000..dd7c195 100644
>> --- a/drivers/edac/Kconfig
>> +++ b/drivers/edac/Kconfig
>> @@ -390,4 +390,10 @@ config EDAC_XGENE
>>  	  Support for error detection and correction on the
>>  	  APM X-Gene family of SOCs.
>>  
>> +config EDAC_CORTEX_ARM64
>> +	tristate "ARM Cortex A57/A53"
>> +	depends on EDAC_MM_EDAC && ARM64
>> +	help
>> +	  Support for error detection and correction on the
>> +	  ARM Cortex A57 and A53.
>>  endif # EDAC
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>> index ae3c5f3..ac01660 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
>> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
>> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
>> new file mode 100644
>> index 0000000..c37bb94
>> --- /dev/null
>> +++ b/drivers/edac/cortex_arm64_edac.c
>> @@ -0,0 +1,457 @@
>> +/*
>> + * Cortex ARM64 EDAC
>> + *
>> + * Copyright (c) 2015, Advanced Micro Devices
>> + * Author: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/of_device.h>
>> +#include <linux/platform_device.h>
>> +
>> +#include "edac_core.h"
>> +
>> +#define EDAC_MOD_STR             "cortex_arm64_edac"
>> +
>> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
>> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
>> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A57_L1_I_TAG_RAM	     0x00
>> +#define A57_L1_I_DATA_RAM	     0x01
>> +#define A57_L1_D_TAG_RAM	     0x08
>> +#define A57_L1_D_DATA_RAM	     0x09
>> +#define A57_L1_TLB_RAM		     0x18
>> +
>> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
>> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
>> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A57_L2_TAG_RAM		     0x10
>> +#define A57_L2_DATA_RAM		     0x11
>> +#define A57_L2_SNOOP_TAG_RAM	     0x12
>> +#define A57_L2_DIRTY_RAM	     0x14
>> +#define A57_L2_INCLUSION_PF_RAM      0x18
>> +
>> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
>> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
>> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
>> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A53_L1_I_TAG_RAM	     0x00
>> +#define A53_L1_I_DATA_RAM	     0x01
>> +#define A53_L1_D_TAG_RAM	     0x08
>> +#define A53_L1_D_DATA_RAM	     0x09
>> +#define A53_L1_D_DIRT_RAM	     0x0A
>> +#define A53_L1_TLB_RAM		     0x18
>> +
>> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
>> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
>> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A53_L2_TAG_RAM		     0x10
>> +#define A53_L2_DATA_RAM		     0x11
>> +#define A53_L2_SNOOP_RAM	     0x12
> 
> I guess you can get rid of the A53/A57 prefix for most of the
> definitions - given they are identical.
> Just keep it around for the differing bits (but not as a prefix).
> 
Ok
>> +
>> +#define L1_CACHE		     0
>> +#define L2_CACHE		     1
>> +
>> +int poll_msec = 100;
>> +
>> +struct cortex_arm64_edac {
>> +	struct edac_device_ctl_info *edac_ctl;
>> +};
>> +
>> +static inline u64 read_cpumerrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_cpumerrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
>> +}
>> +
>> +static inline u64 read_l2merrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_l2merrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
>> +}
>> +
>> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A53_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
>> +	case A53_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A53_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A53_L2_SNOOP_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
> 
> That looks like copied for most of the code, so it cries for unification.
> The semantics is almost identical, though the TRM describes it in a
> slighly different way sometimes (for CPUID/way for instance).

Yes RAMID description was different between A57 and A53. Will try to unify

>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A57_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
>> +	case A57_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A57_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A57_L2_SNOOP_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
>> +		break;
>> +	case A57_L2_DIRTY_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
>> +		break;
>> +	case A57_L2_INCLUSION_PF_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A57_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A57_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A57_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A57_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A57_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A57_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
> 
> Same as above. Please unify to have one function.
> 
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A53_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A53_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A53_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A53_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A53_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A53_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void parse_cpumerrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
> 
> You should have that distinction in the above (unified) function just
> for the differing bits. If that looks too ugly, hide it in a macro.
> 
>> +		a57_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void parse_l2merrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
>> +		a57_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
>> +{
>> +	int cpu;
>> +	struct cpumask cluster_mask, old_mask;
>> +
>> +	cpumask_clear(&cluster_mask);
>> +	cpumask_clear(&old_mask);
>> +
>> +	get_online_cpus();
>> +	for_each_online_cpu(cpu) {
>> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
>> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
>> +		if (cpumask_equal(&cluster_mask, &old_mask))
>> +			continue;
>> +		cpumask_copy(&old_mask, &cluster_mask);
>> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
>> +				      edev_ctl, 0);
>> +	}
>> +	put_online_cpus();
>> +}
>> +
>> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
>> +{
>> +	int rc;
>> +	struct cortex_arm64_edac *drv;
>> +	struct device *dev = &pdev->dev;
>> +
>> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
>> +	if (!drv)
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
>> +						   num_possible_cpus(), "L", 2,
>> +						   1, NULL, 0,
>> +						   edac_device_alloc_index());
>> +	if (IS_ERR(drv->edac_ctl))
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl->poll_msec = poll_msec;
>> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
>> +	drv->edac_ctl->dev = dev;
>> +	drv->edac_ctl->mod_name = dev_name(dev);
>> +	drv->edac_ctl->dev_name = dev_name(dev);
>> +	drv->edac_ctl->ctl_name = "cpu_err";
>> +	drv->edac_ctl->panic_on_ue = 1;
>> +	platform_set_drvdata(pdev, drv);
>> +
>> +	rc = edac_device_add_device(drv->edac_ctl);
>> +	if (rc)
>> +		goto edac_alloc_failed;
>> +
>> +	return 0;
>> +
>> +edac_alloc_failed:
>> +	edac_device_free_ctl_info(drv->edac_ctl);
>> +	return rc;
>> +}
>> +
>> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
>> +{
>> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
>> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
>> +
>> +	edac_device_del_device(edac_ctl->dev);
>> +	edac_device_free_ctl_info(edac_ctl);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
>> +	{ .compatible = "arm,armv8-edac" },
>> +	{},
>> +};
>> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
>> +
>> +static struct platform_driver cortex_arm64_edac_driver = {
>> +	.probe = cortex_arm64_edac_probe,
>> +	.remove = cortex_arm64_edac_remove,
>> +	.driver = {
>> +		.name = "arm64-edac",
>> +		.owner = THIS_MODULE,
>> +		.of_match_table = cortex_arm64_edac_of_match,
> 
> So you would need to get rid of that without a DT binding. Not sure if
> there is a precedence for a driver loaded by an MIDR match already (PMU
> comes to mind)?
> 
> Please note that would mean that the driver always loads on any Cortex
> CPU, I guess this is not always desirable (thinking about phones here).
> I am not sure if module blacklisting is a solution or we would need to
> come up with some clever way of potentially disabling some platforms -
> which admittedly cries for a DT binding again ;-)
> 
Yes, you read my mind :) See my comment above on DT binding.

> Cheers,
> Andre.
> 
>> +	},
>> +};
>> +
>> +static int __init cortex_arm64_edac_init(void)
>> +{
>> +	int rc;
>> +
>> +	/* Only POLL mode is supported so far */
>> +	edac_op_state = EDAC_OPSTATE_POLL;
>> +
>> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
>> +	if (rc) {
>> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
>> +		return rc;
>> +	}
>> +
>> +	return 0;
>> +}
>> +module_init(cortex_arm64_edac_init);
>> +
>> +static void __exit cortex_arm64_edac_exit(void)
>> +{
>> +	platform_driver_unregister(&cortex_arm64_edac_driver);
>> +}
>> +module_exit(cortex_arm64_edac_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>");
>> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
>> +module_param(poll_msec, int, 0444);
>> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-22 14:46     ` Brijesh Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-22 14:46 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Andre,

On 10/21/2015 06:52 PM, Andre Przywara wrote:
> On 21/10/15 21:41, Brijesh Singh wrote:
>> Add support for Cortex A57 and A53 EDAC driver.
> 
> Hi Brijesh,
> 
> thanks for the quick update! Some comments below.
> 
>>
>> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
>> CC: robh+dt at kernel.org
>> CC: pawel.moll at arm.com
>> CC: mark.rutland at arm.com
>> CC: ijc+devicetree at hellion.org.uk
>> CC: galak at codeaurora.org
>> CC: dougthompson at xmission.com
>> CC: bp at alien8.de
>> CC: mchehab at osg.samsung.com
>> CC: devicetree at vger.kernel.org
>> CC: guohanjun at huawei.com
>> CC: andre.przywara at arm.com
>> CC: arnd at arndb.de
>> CC: linux-kernel at vger.kernel.org
>> CC: linux-edac at vger.kernel.org
>> ---
>>
>> v2:
>> * convert into generic arm64 edac driver
>> * remove AMD specific references from dt binding
>> * remove poll_msec property from dt binding
>> * add poll_msec as a module param, default is 100ms
>> * update copyright text
>> * define macro mnemonics for L1 and L2 RAMID
>> * check L2 error per-cluster instead of per core
>> * update function names
>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>   read hotplug-safe
>> * add error check in probe routine
>>
>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>  drivers/edac/Kconfig                               |   6 +
>>  drivers/edac/Makefile                              |   1 +
>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>  4 files changed, 479 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>
>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> new file mode 100644
>> index 0000000..dfd128f
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> @@ -0,0 +1,15 @@
>> +* ARMv8 L1/L2 cache error reporting
>> +
>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>> +Register can be used for checking L1 and L2 memory errors.
>> +
>> +The following section describes the ARMv8 EDAC DT node binding.
>> +
>> +Required properties:
>> +- compatible: Should be "arm,armv8-edac"
>> +
>> +Example:
>> +	edac {
>> +		compatible = "arm,armv8-edac";
>> +	};
>> +
> 
> So if there is nothing in here, why do we need the DT binding at all (I
> think Mark hinted at that already)?
> Can't we just use the MIDR as already suggested by others?
> Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
> specific and not architectural.
> 
Yes, I was going with Mark suggestion to remove DT binding but then came across these cases which kind of hinted to keep DT binding:

* Without DT binding, the driver will always be loaded on arm64 unless its blacklisted.
* Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
  Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
  wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.

But if recommendation is to remove DT binding then I can remove it in next version. 

>> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
>> index ef25000..dd7c195 100644
>> --- a/drivers/edac/Kconfig
>> +++ b/drivers/edac/Kconfig
>> @@ -390,4 +390,10 @@ config EDAC_XGENE
>>  	  Support for error detection and correction on the
>>  	  APM X-Gene family of SOCs.
>>  
>> +config EDAC_CORTEX_ARM64
>> +	tristate "ARM Cortex A57/A53"
>> +	depends on EDAC_MM_EDAC && ARM64
>> +	help
>> +	  Support for error detection and correction on the
>> +	  ARM Cortex A57 and A53.
>>  endif # EDAC
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>> index ae3c5f3..ac01660 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
>> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
>> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
>> new file mode 100644
>> index 0000000..c37bb94
>> --- /dev/null
>> +++ b/drivers/edac/cortex_arm64_edac.c
>> @@ -0,0 +1,457 @@
>> +/*
>> + * Cortex ARM64 EDAC
>> + *
>> + * Copyright (c) 2015, Advanced Micro Devices
>> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/of_device.h>
>> +#include <linux/platform_device.h>
>> +
>> +#include "edac_core.h"
>> +
>> +#define EDAC_MOD_STR             "cortex_arm64_edac"
>> +
>> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
>> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
>> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A57_L1_I_TAG_RAM	     0x00
>> +#define A57_L1_I_DATA_RAM	     0x01
>> +#define A57_L1_D_TAG_RAM	     0x08
>> +#define A57_L1_D_DATA_RAM	     0x09
>> +#define A57_L1_TLB_RAM		     0x18
>> +
>> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
>> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
>> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A57_L2_TAG_RAM		     0x10
>> +#define A57_L2_DATA_RAM		     0x11
>> +#define A57_L2_SNOOP_TAG_RAM	     0x12
>> +#define A57_L2_DIRTY_RAM	     0x14
>> +#define A57_L2_INCLUSION_PF_RAM      0x18
>> +
>> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
>> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
>> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
>> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A53_L1_I_TAG_RAM	     0x00
>> +#define A53_L1_I_DATA_RAM	     0x01
>> +#define A53_L1_D_TAG_RAM	     0x08
>> +#define A53_L1_D_DATA_RAM	     0x09
>> +#define A53_L1_D_DIRT_RAM	     0x0A
>> +#define A53_L1_TLB_RAM		     0x18
>> +
>> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
>> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
>> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A53_L2_TAG_RAM		     0x10
>> +#define A53_L2_DATA_RAM		     0x11
>> +#define A53_L2_SNOOP_RAM	     0x12
> 
> I guess you can get rid of the A53/A57 prefix for most of the
> definitions - given they are identical.
> Just keep it around for the differing bits (but not as a prefix).
> 
Ok
>> +
>> +#define L1_CACHE		     0
>> +#define L2_CACHE		     1
>> +
>> +int poll_msec = 100;
>> +
>> +struct cortex_arm64_edac {
>> +	struct edac_device_ctl_info *edac_ctl;
>> +};
>> +
>> +static inline u64 read_cpumerrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_cpumerrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
>> +}
>> +
>> +static inline u64 read_l2merrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_l2merrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
>> +}
>> +
>> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A53_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
>> +	case A53_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A53_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A53_L2_SNOOP_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
> 
> That looks like copied for most of the code, so it cries for unification.
> The semantics is almost identical, though the TRM describes it in a
> slighly different way sometimes (for CPUID/way for instance).

Yes RAMID description was different between A57 and A53. Will try to unify

>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A57_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
>> +	case A57_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A57_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A57_L2_SNOOP_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
>> +		break;
>> +	case A57_L2_DIRTY_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
>> +		break;
>> +	case A57_L2_INCLUSION_PF_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A57_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A57_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A57_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A57_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A57_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A57_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
> 
> Same as above. Please unify to have one function.
> 
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A53_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A53_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A53_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A53_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A53_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A53_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void parse_cpumerrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
> 
> You should have that distinction in the above (unified) function just
> for the differing bits. If that looks too ugly, hide it in a macro.
> 
>> +		a57_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void parse_l2merrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
>> +		a57_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
>> +{
>> +	int cpu;
>> +	struct cpumask cluster_mask, old_mask;
>> +
>> +	cpumask_clear(&cluster_mask);
>> +	cpumask_clear(&old_mask);
>> +
>> +	get_online_cpus();
>> +	for_each_online_cpu(cpu) {
>> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
>> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
>> +		if (cpumask_equal(&cluster_mask, &old_mask))
>> +			continue;
>> +		cpumask_copy(&old_mask, &cluster_mask);
>> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
>> +				      edev_ctl, 0);
>> +	}
>> +	put_online_cpus();
>> +}
>> +
>> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
>> +{
>> +	int rc;
>> +	struct cortex_arm64_edac *drv;
>> +	struct device *dev = &pdev->dev;
>> +
>> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
>> +	if (!drv)
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
>> +						   num_possible_cpus(), "L", 2,
>> +						   1, NULL, 0,
>> +						   edac_device_alloc_index());
>> +	if (IS_ERR(drv->edac_ctl))
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl->poll_msec = poll_msec;
>> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
>> +	drv->edac_ctl->dev = dev;
>> +	drv->edac_ctl->mod_name = dev_name(dev);
>> +	drv->edac_ctl->dev_name = dev_name(dev);
>> +	drv->edac_ctl->ctl_name = "cpu_err";
>> +	drv->edac_ctl->panic_on_ue = 1;
>> +	platform_set_drvdata(pdev, drv);
>> +
>> +	rc = edac_device_add_device(drv->edac_ctl);
>> +	if (rc)
>> +		goto edac_alloc_failed;
>> +
>> +	return 0;
>> +
>> +edac_alloc_failed:
>> +	edac_device_free_ctl_info(drv->edac_ctl);
>> +	return rc;
>> +}
>> +
>> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
>> +{
>> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
>> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
>> +
>> +	edac_device_del_device(edac_ctl->dev);
>> +	edac_device_free_ctl_info(edac_ctl);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
>> +	{ .compatible = "arm,armv8-edac" },
>> +	{},
>> +};
>> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
>> +
>> +static struct platform_driver cortex_arm64_edac_driver = {
>> +	.probe = cortex_arm64_edac_probe,
>> +	.remove = cortex_arm64_edac_remove,
>> +	.driver = {
>> +		.name = "arm64-edac",
>> +		.owner = THIS_MODULE,
>> +		.of_match_table = cortex_arm64_edac_of_match,
> 
> So you would need to get rid of that without a DT binding. Not sure if
> there is a precedence for a driver loaded by an MIDR match already (PMU
> comes to mind)?
> 
> Please note that would mean that the driver always loads on any Cortex
> CPU, I guess this is not always desirable (thinking about phones here).
> I am not sure if module blacklisting is a solution or we would need to
> come up with some clever way of potentially disabling some platforms -
> which admittedly cries for a DT binding again ;-)
> 
Yes, you read my mind :) See my comment above on DT binding.

> Cheers,
> Andre.
> 
>> +	},
>> +};
>> +
>> +static int __init cortex_arm64_edac_init(void)
>> +{
>> +	int rc;
>> +
>> +	/* Only POLL mode is supported so far */
>> +	edac_op_state = EDAC_OPSTATE_POLL;
>> +
>> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
>> +	if (rc) {
>> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
>> +		return rc;
>> +	}
>> +
>> +	return 0;
>> +}
>> +module_init(cortex_arm64_edac_init);
>> +
>> +static void __exit cortex_arm64_edac_exit(void)
>> +{
>> +	platform_driver_unregister(&cortex_arm64_edac_driver);
>> +}
>> +module_exit(cortex_arm64_edac_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
>> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
>> +module_param(poll_msec, int, 0444);
>> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-22 18:47     ` Brijesh Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-22 18:47 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: brijeshkumar.singh, linux-arm-kernel, robh+dt, pawel.moll,
	mark.rutland, ijc+devicetree, galak, dougthompson, bp,
	devicetree, guohanjun, andre.przywara, arnd, linux-kernel,
	linux-edac

Hi Mauro,

On 10/21/2015 04:25 PM, Mauro Carvalho Chehab wrote:
> Em Wed, 21 Oct 2015 15:41:37 -0500
> Brijesh Singh <brijeshkumar.singh@amd.com> escreveu:
> 
>> Add support for Cortex A57 and A53 EDAC driver.
>>
>> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
>> CC: robh+dt@kernel.org
>> CC: pawel.moll@arm.com
>> CC: mark.rutland@arm.com
>> CC: ijc+devicetree@hellion.org.uk
>> CC: galak@codeaurora.org
>> CC: dougthompson@xmission.com
>> CC: bp@alien8.de
>> CC: mchehab@osg.samsung.com
>> CC: devicetree@vger.kernel.org
>> CC: guohanjun@huawei.com
>> CC: andre.przywara@arm.com
>> CC: arnd@arndb.de
>> CC: linux-kernel@vger.kernel.org
>> CC: linux-edac@vger.kernel.org
>> ---
>>
>> v2:
>> * convert into generic arm64 edac driver
>> * remove AMD specific references from dt binding
>> * remove poll_msec property from dt binding
>> * add poll_msec as a module param, default is 100ms
>> * update copyright text
>> * define macro mnemonics for L1 and L2 RAMID
>> * check L2 error per-cluster instead of per core
>> * update function names
>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>   read hotplug-safe
>> * add error check in probe routine
>>
>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>  drivers/edac/Kconfig                               |   6 +
>>  drivers/edac/Makefile                              |   1 +
>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>  4 files changed, 479 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>
>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> new file mode 100644
>> index 0000000..dfd128f
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> @@ -0,0 +1,15 @@
>> +* ARMv8 L1/L2 cache error reporting
>> +
>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>> +Register can be used for checking L1 and L2 memory errors.
>> +
>> +The following section describes the ARMv8 EDAC DT node binding.
>> +
>> +Required properties:
>> +- compatible: Should be "arm,armv8-edac"
>> +
>> +Example:
>> +	edac {
>> +		compatible = "arm,armv8-edac";
>> +	};
>> +
>> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
>> index ef25000..dd7c195 100644
>> --- a/drivers/edac/Kconfig
>> +++ b/drivers/edac/Kconfig
>> @@ -390,4 +390,10 @@ config EDAC_XGENE
>>  	  Support for error detection and correction on the
>>  	  APM X-Gene family of SOCs.
>>  
>> +config EDAC_CORTEX_ARM64
>> +	tristate "ARM Cortex A57/A53"
>> +	depends on EDAC_MM_EDAC && ARM64
> 
> It would be good to be able to compile it on non-ARM64 archs
> if COMPILE_TEST, e. g.:
> 
> 	depends on EDAC_MM_EDAC && (ARM64 || COMPILE_TEST)
> 
> That would allow testing tools like Coverity to test it. As far as
> I know, the public license we use only works on x86.
> 
>> +	help
>> +	  Support for error detection and correction on the
>> +	  ARM Cortex A57 and A53.
>>  endif # EDAC
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>> index ae3c5f3..ac01660 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
>> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
>> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
>> new file mode 100644
>> index 0000000..c37bb94
>> --- /dev/null
>> +++ b/drivers/edac/cortex_arm64_edac.c
>> @@ -0,0 +1,457 @@
>> +/*
>> + * Cortex ARM64 EDAC
>> + *
>> + * Copyright (c) 2015, Advanced Micro Devices
>> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/of_device.h>
>> +#include <linux/platform_device.h>
>> +
>> +#include "edac_core.h"
>> +
>> +#define EDAC_MOD_STR             "cortex_arm64_edac"
>> +
>> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
>> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
>> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A57_L1_I_TAG_RAM	     0x00
>> +#define A57_L1_I_DATA_RAM	     0x01
>> +#define A57_L1_D_TAG_RAM	     0x08
>> +#define A57_L1_D_DATA_RAM	     0x09
>> +#define A57_L1_TLB_RAM		     0x18
>> +
>> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
>> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
>> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A57_L2_TAG_RAM		     0x10
>> +#define A57_L2_DATA_RAM		     0x11
>> +#define A57_L2_SNOOP_TAG_RAM	     0x12
>> +#define A57_L2_DIRTY_RAM	     0x14
>> +#define A57_L2_INCLUSION_PF_RAM      0x18
>> +
>> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
>> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
>> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
>> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A53_L1_I_TAG_RAM	     0x00
>> +#define A53_L1_I_DATA_RAM	     0x01
>> +#define A53_L1_D_TAG_RAM	     0x08
>> +#define A53_L1_D_DATA_RAM	     0x09
>> +#define A53_L1_D_DIRT_RAM	     0x0A
>> +#define A53_L1_TLB_RAM		     0x18
>> +
>> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
>> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
>> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A53_L2_TAG_RAM		     0x10
>> +#define A53_L2_DATA_RAM		     0x11
>> +#define A53_L2_SNOOP_RAM	     0x12
>> +
>> +#define L1_CACHE		     0
>> +#define L2_CACHE		     1
>> +
>> +int poll_msec = 100;
>> +
>> +struct cortex_arm64_edac {
>> +	struct edac_device_ctl_info *edac_ctl;
>> +};
>> +
>> +static inline u64 read_cpumerrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_cpumerrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
>> +}
>> +
>> +static inline u64 read_l2merrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_l2merrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
>> +}
>> +
> 
> If we're willing to compile with COMPILE_TEST, we'll need to provide
> some stubs for the above functions that won't use asm.
> 
>> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A53_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
>> +	case A53_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A53_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A53_L2_SNOOP_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A57_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
>> +	case A57_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A57_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A57_L2_SNOOP_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
>> +		break;
>> +	case A57_L2_DIRTY_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
>> +		break;
>> +	case A57_L2_INCLUSION_PF_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A57_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A57_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A57_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A57_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A57_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A57_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A53_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A53_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A53_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A53_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A53_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A53_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
> 
> The above code doesn't look right to me. It should be, instead, calling
> one of the functions that output the errors also via trace or to call one
> of the trace functions directly (see the trace functions currently defined
> at  include/ras/ras_event.h).
> 
> Failing to do that would cause RAS tools (like rasdaemon) to not get
> the errors.
> 
Noted.

I will use trace_mc_event() to generate event but it seems that I still need to call
edac_device_handle_ce/ue () to log the error in sysfs files. Also in case of UE I noticed
that edac_device_handle_ue() takes care of causing panic (the expected behavior).

So is it okay to use both trace event as well as edac_device_handle_xx. Something like this

        if (L2MERRSR_EL1_FATAL(val)) {
                trace_mc_event(HW_EVENT_ERR_UNCORRECTED, "L2 fatal error", "",
                                repeat_err, 0, 0, 0, 0, index, 0, 0,
                                "cortex_arm64_edac");
                edac_device_handle_ue(edac_ctl, cpu, L2_CACHE, edac_ctl->name);
        } else {
                trace_mc_event(HW_EVENT_ERR_CORRECTED, "L2 non-fatal error",
                                "", repeat_err, 0, 0, 0, 0, index, 0, 0,
                                "cortex_arm64_edac");
                edac_device_handle_ce(edac_ctl, cpu, L2_CACHE, edac_ctl->name);
        }


>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void parse_cpumerrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
>> +		a57_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void parse_l2merrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
>> +		a57_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
>> +{
>> +	int cpu;
>> +	struct cpumask cluster_mask, old_mask;
>> +
>> +	cpumask_clear(&cluster_mask);
>> +	cpumask_clear(&old_mask);
>> +
>> +	get_online_cpus();
>> +	for_each_online_cpu(cpu) {
>> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
>> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
>> +		if (cpumask_equal(&cluster_mask, &old_mask))
>> +			continue;
>> +		cpumask_copy(&old_mask, &cluster_mask);
>> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
>> +				      edev_ctl, 0);
>> +	}
>> +	put_online_cpus();
>> +}
>> +
>> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
>> +{
>> +	int rc;
>> +	struct cortex_arm64_edac *drv;
>> +	struct device *dev = &pdev->dev;
>> +
>> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
>> +	if (!drv)
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
>> +						   num_possible_cpus(), "L", 2,
>> +						   1, NULL, 0,
>> +						   edac_device_alloc_index());
>> +	if (IS_ERR(drv->edac_ctl))
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl->poll_msec = poll_msec;
>> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
>> +	drv->edac_ctl->dev = dev;
>> +	drv->edac_ctl->mod_name = dev_name(dev);
>> +	drv->edac_ctl->dev_name = dev_name(dev);
>> +	drv->edac_ctl->ctl_name = "cpu_err";
>> +	drv->edac_ctl->panic_on_ue = 1;
>> +	platform_set_drvdata(pdev, drv);
>> +
>> +	rc = edac_device_add_device(drv->edac_ctl);
>> +	if (rc)
>> +		goto edac_alloc_failed;
>> +
>> +	return 0;
>> +
>> +edac_alloc_failed:
>> +	edac_device_free_ctl_info(drv->edac_ctl);
>> +	return rc;
>> +}
>> +
>> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
>> +{
>> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
>> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
>> +
>> +	edac_device_del_device(edac_ctl->dev);
>> +	edac_device_free_ctl_info(edac_ctl);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
>> +	{ .compatible = "arm,armv8-edac" },
>> +	{},
>> +};
>> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
>> +
>> +static struct platform_driver cortex_arm64_edac_driver = {
>> +	.probe = cortex_arm64_edac_probe,
>> +	.remove = cortex_arm64_edac_remove,
>> +	.driver = {
>> +		.name = "arm64-edac",
>> +		.owner = THIS_MODULE,
>> +		.of_match_table = cortex_arm64_edac_of_match,
>> +	},
>> +};
>> +
>> +static int __init cortex_arm64_edac_init(void)
>> +{
>> +	int rc;
>> +
>> +	/* Only POLL mode is supported so far */
>> +	edac_op_state = EDAC_OPSTATE_POLL;
>> +
>> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
>> +	if (rc) {
>> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
>> +		return rc;
>> +	}
>> +
>> +	return 0;
>> +}
>> +module_init(cortex_arm64_edac_init);
>> +
>> +static void __exit cortex_arm64_edac_exit(void)
>> +{
>> +	platform_driver_unregister(&cortex_arm64_edac_driver);
>> +}
>> +module_exit(cortex_arm64_edac_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
>> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
>> +module_param(poll_msec, int, 0444);
>> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-22 18:47     ` Brijesh Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-22 18:47 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: brijeshkumar.singh-5C7GfCeVMHo,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	robh+dt-DgEjT+Ai2ygdnm+yROfE0A, pawel.moll-5wv7dgnIgG8,
	mark.rutland-5wv7dgnIgG8, ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg,
	galak-sgV2jX0FEOL9JmXXK+q4OQ,
	dougthompson-aS9lmoZGLiVWk0Htik3J/w, bp-Gina5bIWoIWzQB+pC5nmwQ,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	guohanjun-hv44wF8Li93QT0dZR+AlfA, andre.przywara-5wv7dgnIgG8,
	arnd-r2nGTMty4D4, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-edac-u79uwXL29TY76Z2rM5mHXA

Hi Mauro,

On 10/21/2015 04:25 PM, Mauro Carvalho Chehab wrote:
> Em Wed, 21 Oct 2015 15:41:37 -0500
> Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org> escreveu:
> 
>> Add support for Cortex A57 and A53 EDAC driver.
>>
>> Signed-off-by: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
>> CC: robh+dt-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
>> CC: pawel.moll-5wv7dgnIgG8@public.gmane.org
>> CC: mark.rutland-5wv7dgnIgG8@public.gmane.org
>> CC: ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org
>> CC: galak-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org
>> CC: dougthompson-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org
>> CC: bp-Gina5bIWoIWzQB+pC5nmwQ@public.gmane.org
>> CC: mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org
>> CC: devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> CC: guohanjun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org
>> CC: andre.przywara-5wv7dgnIgG8@public.gmane.org
>> CC: arnd-r2nGTMty4D4@public.gmane.org
>> CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> CC: linux-edac-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> ---
>>
>> v2:
>> * convert into generic arm64 edac driver
>> * remove AMD specific references from dt binding
>> * remove poll_msec property from dt binding
>> * add poll_msec as a module param, default is 100ms
>> * update copyright text
>> * define macro mnemonics for L1 and L2 RAMID
>> * check L2 error per-cluster instead of per core
>> * update function names
>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>   read hotplug-safe
>> * add error check in probe routine
>>
>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>  drivers/edac/Kconfig                               |   6 +
>>  drivers/edac/Makefile                              |   1 +
>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>  4 files changed, 479 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>
>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> new file mode 100644
>> index 0000000..dfd128f
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> @@ -0,0 +1,15 @@
>> +* ARMv8 L1/L2 cache error reporting
>> +
>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>> +Register can be used for checking L1 and L2 memory errors.
>> +
>> +The following section describes the ARMv8 EDAC DT node binding.
>> +
>> +Required properties:
>> +- compatible: Should be "arm,armv8-edac"
>> +
>> +Example:
>> +	edac {
>> +		compatible = "arm,armv8-edac";
>> +	};
>> +
>> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
>> index ef25000..dd7c195 100644
>> --- a/drivers/edac/Kconfig
>> +++ b/drivers/edac/Kconfig
>> @@ -390,4 +390,10 @@ config EDAC_XGENE
>>  	  Support for error detection and correction on the
>>  	  APM X-Gene family of SOCs.
>>  
>> +config EDAC_CORTEX_ARM64
>> +	tristate "ARM Cortex A57/A53"
>> +	depends on EDAC_MM_EDAC && ARM64
> 
> It would be good to be able to compile it on non-ARM64 archs
> if COMPILE_TEST, e. g.:
> 
> 	depends on EDAC_MM_EDAC && (ARM64 || COMPILE_TEST)
> 
> That would allow testing tools like Coverity to test it. As far as
> I know, the public license we use only works on x86.
> 
>> +	help
>> +	  Support for error detection and correction on the
>> +	  ARM Cortex A57 and A53.
>>  endif # EDAC
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>> index ae3c5f3..ac01660 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
>> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
>> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
>> new file mode 100644
>> index 0000000..c37bb94
>> --- /dev/null
>> +++ b/drivers/edac/cortex_arm64_edac.c
>> @@ -0,0 +1,457 @@
>> +/*
>> + * Cortex ARM64 EDAC
>> + *
>> + * Copyright (c) 2015, Advanced Micro Devices
>> + * Author: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/of_device.h>
>> +#include <linux/platform_device.h>
>> +
>> +#include "edac_core.h"
>> +
>> +#define EDAC_MOD_STR             "cortex_arm64_edac"
>> +
>> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
>> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
>> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A57_L1_I_TAG_RAM	     0x00
>> +#define A57_L1_I_DATA_RAM	     0x01
>> +#define A57_L1_D_TAG_RAM	     0x08
>> +#define A57_L1_D_DATA_RAM	     0x09
>> +#define A57_L1_TLB_RAM		     0x18
>> +
>> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
>> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
>> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A57_L2_TAG_RAM		     0x10
>> +#define A57_L2_DATA_RAM		     0x11
>> +#define A57_L2_SNOOP_TAG_RAM	     0x12
>> +#define A57_L2_DIRTY_RAM	     0x14
>> +#define A57_L2_INCLUSION_PF_RAM      0x18
>> +
>> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
>> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
>> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
>> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A53_L1_I_TAG_RAM	     0x00
>> +#define A53_L1_I_DATA_RAM	     0x01
>> +#define A53_L1_D_TAG_RAM	     0x08
>> +#define A53_L1_D_DATA_RAM	     0x09
>> +#define A53_L1_D_DIRT_RAM	     0x0A
>> +#define A53_L1_TLB_RAM		     0x18
>> +
>> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
>> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
>> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A53_L2_TAG_RAM		     0x10
>> +#define A53_L2_DATA_RAM		     0x11
>> +#define A53_L2_SNOOP_RAM	     0x12
>> +
>> +#define L1_CACHE		     0
>> +#define L2_CACHE		     1
>> +
>> +int poll_msec = 100;
>> +
>> +struct cortex_arm64_edac {
>> +	struct edac_device_ctl_info *edac_ctl;
>> +};
>> +
>> +static inline u64 read_cpumerrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_cpumerrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
>> +}
>> +
>> +static inline u64 read_l2merrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_l2merrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
>> +}
>> +
> 
> If we're willing to compile with COMPILE_TEST, we'll need to provide
> some stubs for the above functions that won't use asm.
> 
>> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A53_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
>> +	case A53_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A53_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A53_L2_SNOOP_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A57_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
>> +	case A57_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A57_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A57_L2_SNOOP_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
>> +		break;
>> +	case A57_L2_DIRTY_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
>> +		break;
>> +	case A57_L2_INCLUSION_PF_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A57_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A57_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A57_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A57_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A57_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A57_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A53_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A53_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A53_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A53_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A53_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A53_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
> 
> The above code doesn't look right to me. It should be, instead, calling
> one of the functions that output the errors also via trace or to call one
> of the trace functions directly (see the trace functions currently defined
> at  include/ras/ras_event.h).
> 
> Failing to do that would cause RAS tools (like rasdaemon) to not get
> the errors.
> 
Noted.

I will use trace_mc_event() to generate event but it seems that I still need to call
edac_device_handle_ce/ue () to log the error in sysfs files. Also in case of UE I noticed
that edac_device_handle_ue() takes care of causing panic (the expected behavior).

So is it okay to use both trace event as well as edac_device_handle_xx. Something like this

        if (L2MERRSR_EL1_FATAL(val)) {
                trace_mc_event(HW_EVENT_ERR_UNCORRECTED, "L2 fatal error", "",
                                repeat_err, 0, 0, 0, 0, index, 0, 0,
                                "cortex_arm64_edac");
                edac_device_handle_ue(edac_ctl, cpu, L2_CACHE, edac_ctl->name);
        } else {
                trace_mc_event(HW_EVENT_ERR_CORRECTED, "L2 non-fatal error",
                                "", repeat_err, 0, 0, 0, 0, index, 0, 0,
                                "cortex_arm64_edac");
                edac_device_handle_ce(edac_ctl, cpu, L2_CACHE, edac_ctl->name);
        }


>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void parse_cpumerrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
>> +		a57_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void parse_l2merrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
>> +		a57_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
>> +{
>> +	int cpu;
>> +	struct cpumask cluster_mask, old_mask;
>> +
>> +	cpumask_clear(&cluster_mask);
>> +	cpumask_clear(&old_mask);
>> +
>> +	get_online_cpus();
>> +	for_each_online_cpu(cpu) {
>> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
>> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
>> +		if (cpumask_equal(&cluster_mask, &old_mask))
>> +			continue;
>> +		cpumask_copy(&old_mask, &cluster_mask);
>> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
>> +				      edev_ctl, 0);
>> +	}
>> +	put_online_cpus();
>> +}
>> +
>> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
>> +{
>> +	int rc;
>> +	struct cortex_arm64_edac *drv;
>> +	struct device *dev = &pdev->dev;
>> +
>> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
>> +	if (!drv)
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
>> +						   num_possible_cpus(), "L", 2,
>> +						   1, NULL, 0,
>> +						   edac_device_alloc_index());
>> +	if (IS_ERR(drv->edac_ctl))
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl->poll_msec = poll_msec;
>> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
>> +	drv->edac_ctl->dev = dev;
>> +	drv->edac_ctl->mod_name = dev_name(dev);
>> +	drv->edac_ctl->dev_name = dev_name(dev);
>> +	drv->edac_ctl->ctl_name = "cpu_err";
>> +	drv->edac_ctl->panic_on_ue = 1;
>> +	platform_set_drvdata(pdev, drv);
>> +
>> +	rc = edac_device_add_device(drv->edac_ctl);
>> +	if (rc)
>> +		goto edac_alloc_failed;
>> +
>> +	return 0;
>> +
>> +edac_alloc_failed:
>> +	edac_device_free_ctl_info(drv->edac_ctl);
>> +	return rc;
>> +}
>> +
>> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
>> +{
>> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
>> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
>> +
>> +	edac_device_del_device(edac_ctl->dev);
>> +	edac_device_free_ctl_info(edac_ctl);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
>> +	{ .compatible = "arm,armv8-edac" },
>> +	{},
>> +};
>> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
>> +
>> +static struct platform_driver cortex_arm64_edac_driver = {
>> +	.probe = cortex_arm64_edac_probe,
>> +	.remove = cortex_arm64_edac_remove,
>> +	.driver = {
>> +		.name = "arm64-edac",
>> +		.owner = THIS_MODULE,
>> +		.of_match_table = cortex_arm64_edac_of_match,
>> +	},
>> +};
>> +
>> +static int __init cortex_arm64_edac_init(void)
>> +{
>> +	int rc;
>> +
>> +	/* Only POLL mode is supported so far */
>> +	edac_op_state = EDAC_OPSTATE_POLL;
>> +
>> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
>> +	if (rc) {
>> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
>> +		return rc;
>> +	}
>> +
>> +	return 0;
>> +}
>> +module_init(cortex_arm64_edac_init);
>> +
>> +static void __exit cortex_arm64_edac_exit(void)
>> +{
>> +	platform_driver_unregister(&cortex_arm64_edac_driver);
>> +}
>> +module_exit(cortex_arm64_edac_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>");
>> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
>> +module_param(poll_msec, int, 0444);
>> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-22 18:47     ` Brijesh Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-22 18:47 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Mauro,

On 10/21/2015 04:25 PM, Mauro Carvalho Chehab wrote:
> Em Wed, 21 Oct 2015 15:41:37 -0500
> Brijesh Singh <brijeshkumar.singh@amd.com> escreveu:
> 
>> Add support for Cortex A57 and A53 EDAC driver.
>>
>> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
>> CC: robh+dt at kernel.org
>> CC: pawel.moll at arm.com
>> CC: mark.rutland at arm.com
>> CC: ijc+devicetree at hellion.org.uk
>> CC: galak at codeaurora.org
>> CC: dougthompson at xmission.com
>> CC: bp at alien8.de
>> CC: mchehab at osg.samsung.com
>> CC: devicetree at vger.kernel.org
>> CC: guohanjun at huawei.com
>> CC: andre.przywara at arm.com
>> CC: arnd at arndb.de
>> CC: linux-kernel at vger.kernel.org
>> CC: linux-edac at vger.kernel.org
>> ---
>>
>> v2:
>> * convert into generic arm64 edac driver
>> * remove AMD specific references from dt binding
>> * remove poll_msec property from dt binding
>> * add poll_msec as a module param, default is 100ms
>> * update copyright text
>> * define macro mnemonics for L1 and L2 RAMID
>> * check L2 error per-cluster instead of per core
>> * update function names
>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>   read hotplug-safe
>> * add error check in probe routine
>>
>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>  drivers/edac/Kconfig                               |   6 +
>>  drivers/edac/Makefile                              |   1 +
>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>  4 files changed, 479 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>
>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> new file mode 100644
>> index 0000000..dfd128f
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>> @@ -0,0 +1,15 @@
>> +* ARMv8 L1/L2 cache error reporting
>> +
>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>> +Register can be used for checking L1 and L2 memory errors.
>> +
>> +The following section describes the ARMv8 EDAC DT node binding.
>> +
>> +Required properties:
>> +- compatible: Should be "arm,armv8-edac"
>> +
>> +Example:
>> +	edac {
>> +		compatible = "arm,armv8-edac";
>> +	};
>> +
>> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
>> index ef25000..dd7c195 100644
>> --- a/drivers/edac/Kconfig
>> +++ b/drivers/edac/Kconfig
>> @@ -390,4 +390,10 @@ config EDAC_XGENE
>>  	  Support for error detection and correction on the
>>  	  APM X-Gene family of SOCs.
>>  
>> +config EDAC_CORTEX_ARM64
>> +	tristate "ARM Cortex A57/A53"
>> +	depends on EDAC_MM_EDAC && ARM64
> 
> It would be good to be able to compile it on non-ARM64 archs
> if COMPILE_TEST, e. g.:
> 
> 	depends on EDAC_MM_EDAC && (ARM64 || COMPILE_TEST)
> 
> That would allow testing tools like Coverity to test it. As far as
> I know, the public license we use only works on x86.
> 
>> +	help
>> +	  Support for error detection and correction on the
>> +	  ARM Cortex A57 and A53.
>>  endif # EDAC
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>> index ae3c5f3..ac01660 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
>> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
>> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
>> new file mode 100644
>> index 0000000..c37bb94
>> --- /dev/null
>> +++ b/drivers/edac/cortex_arm64_edac.c
>> @@ -0,0 +1,457 @@
>> +/*
>> + * Cortex ARM64 EDAC
>> + *
>> + * Copyright (c) 2015, Advanced Micro Devices
>> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/of_device.h>
>> +#include <linux/platform_device.h>
>> +
>> +#include "edac_core.h"
>> +
>> +#define EDAC_MOD_STR             "cortex_arm64_edac"
>> +
>> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
>> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
>> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
>> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A57_L1_I_TAG_RAM	     0x00
>> +#define A57_L1_I_DATA_RAM	     0x01
>> +#define A57_L1_D_TAG_RAM	     0x08
>> +#define A57_L1_D_DATA_RAM	     0x09
>> +#define A57_L1_TLB_RAM		     0x18
>> +
>> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
>> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
>> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A57_L2_TAG_RAM		     0x10
>> +#define A57_L2_DATA_RAM		     0x11
>> +#define A57_L2_SNOOP_TAG_RAM	     0x12
>> +#define A57_L2_DIRTY_RAM	     0x14
>> +#define A57_L2_INCLUSION_PF_RAM      0x18
>> +
>> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
>> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
>> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
>> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
>> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
>> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
>> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
>> +#define A53_L1_I_TAG_RAM	     0x00
>> +#define A53_L1_I_DATA_RAM	     0x01
>> +#define A53_L1_D_TAG_RAM	     0x08
>> +#define A53_L1_D_DATA_RAM	     0x09
>> +#define A53_L1_D_DIRT_RAM	     0x0A
>> +#define A53_L1_TLB_RAM		     0x18
>> +
>> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
>> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
>> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
>> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
>> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
>> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
>> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
>> +#define A53_L2_TAG_RAM		     0x10
>> +#define A53_L2_DATA_RAM		     0x11
>> +#define A53_L2_SNOOP_RAM	     0x12
>> +
>> +#define L1_CACHE		     0
>> +#define L2_CACHE		     1
>> +
>> +int poll_msec = 100;
>> +
>> +struct cortex_arm64_edac {
>> +	struct edac_device_ctl_info *edac_ctl;
>> +};
>> +
>> +static inline u64 read_cpumerrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_cpumerrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
>> +}
>> +
>> +static inline u64 read_l2merrsr_el1(void)
>> +{
>> +	u64 val;
>> +
>> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
>> +	return val;
>> +}
>> +
>> +static inline void write_l2merrsr_el1(u64 val)
>> +{
>> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
>> +}
>> +
> 
> If we're willing to compile with COMPILE_TEST, we'll need to provide
> some stubs for the above functions that won't use asm.
> 
>> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A53_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
>> +	case A53_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A53_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A53_L2_SNOOP_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_l2merrsr_el1();
>> +
>> +	if (!A57_L2MERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
>> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
>> +	case A57_L2_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
>> +		break;
>> +	case A57_L2_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
>> +		break;
>> +	case A57_L2_SNOOP_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
>> +		break;
>> +	case A57_L2_DIRTY_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
>> +		break;
>> +	case A57_L2_INCLUSION_PF_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
>> +				      edac_ctl->name);
>> +	write_l2merrsr_el1(0);
>> +}
>> +
>> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A57_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A57_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A57_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A57_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A57_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A57_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
>> +{
>> +	int fatal;
>> +	int repeat_err, other_err;
>> +	u64 val = read_cpumerrsr_el1();
>> +
>> +	if (!A53_CPUMERRSR_EL1_VALID(val))
>> +		return;
>> +
>> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
>> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
>> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
>> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
>> +		    fatal ? "fatal" : "non-fatal");
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
>> +
>> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
>> +	case A53_L1_I_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
>> +		break;
>> +	case A53_L1_I_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
>> +		break;
>> +	case A53_L1_D_TAG_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
>> +		break;
>> +	case A53_L1_D_DATA_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
>> +		break;
>> +	case A53_L1_TLB_RAM:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
>> +		break;
>> +	default:
>> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
>> +		break;
>> +	}
>> +
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
>> +		    repeat_err);
>> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
>> +		    other_err);
>> +
>> +	if (fatal)
>> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
>> +	else
>> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
>> +				      edac_ctl->name);
> 
> The above code doesn't look right to me. It should be, instead, calling
> one of the functions that output the errors also via trace or to call one
> of the trace functions directly (see the trace functions currently defined
> at  include/ras/ras_event.h).
> 
> Failing to do that would cause RAS tools (like rasdaemon) to not get
> the errors.
> 
Noted.

I will use trace_mc_event() to generate event but it seems that I still need to call
edac_device_handle_ce/ue () to log the error in sysfs files. Also in case of UE I noticed
that edac_device_handle_ue() takes care of causing panic (the expected behavior).

So is it okay to use both trace event as well as edac_device_handle_xx. Something like this

        if (L2MERRSR_EL1_FATAL(val)) {
                trace_mc_event(HW_EVENT_ERR_UNCORRECTED, "L2 fatal error", "",
                                repeat_err, 0, 0, 0, 0, index, 0, 0,
                                "cortex_arm64_edac");
                edac_device_handle_ue(edac_ctl, cpu, L2_CACHE, edac_ctl->name);
        } else {
                trace_mc_event(HW_EVENT_ERR_CORRECTED, "L2 non-fatal error",
                                "", repeat_err, 0, 0, 0, 0, index, 0, 0,
                                "cortex_arm64_edac");
                edac_device_handle_ce(edac_ctl, cpu, L2_CACHE, edac_ctl->name);
        }


>> +	write_cpumerrsr_el1(0);
>> +}
>> +
>> +static void parse_cpumerrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
>> +		a57_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_cpumerrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void parse_l2merrsr(void *args)
>> +{
>> +	struct edac_device_ctl_info *edac_ctl = args;
>> +	int partnum = read_cpuid_part_number();
>> +
>> +	switch (partnum) {
>> +	case ARM_CPU_PART_CORTEX_A57:
>> +		a57_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	case ARM_CPU_PART_CORTEX_A53:
>> +		a53_parse_l2merrsr(edac_ctl);
>> +		break;
>> +	}
>> +}
>> +
>> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
>> +{
>> +	int cpu;
>> +	struct cpumask cluster_mask, old_mask;
>> +
>> +	cpumask_clear(&cluster_mask);
>> +	cpumask_clear(&old_mask);
>> +
>> +	get_online_cpus();
>> +	for_each_online_cpu(cpu) {
>> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
>> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
>> +		if (cpumask_equal(&cluster_mask, &old_mask))
>> +			continue;
>> +		cpumask_copy(&old_mask, &cluster_mask);
>> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
>> +				      edev_ctl, 0);
>> +	}
>> +	put_online_cpus();
>> +}
>> +
>> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
>> +{
>> +	int rc;
>> +	struct cortex_arm64_edac *drv;
>> +	struct device *dev = &pdev->dev;
>> +
>> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
>> +	if (!drv)
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
>> +						   num_possible_cpus(), "L", 2,
>> +						   1, NULL, 0,
>> +						   edac_device_alloc_index());
>> +	if (IS_ERR(drv->edac_ctl))
>> +		return -ENOMEM;
>> +
>> +	drv->edac_ctl->poll_msec = poll_msec;
>> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
>> +	drv->edac_ctl->dev = dev;
>> +	drv->edac_ctl->mod_name = dev_name(dev);
>> +	drv->edac_ctl->dev_name = dev_name(dev);
>> +	drv->edac_ctl->ctl_name = "cpu_err";
>> +	drv->edac_ctl->panic_on_ue = 1;
>> +	platform_set_drvdata(pdev, drv);
>> +
>> +	rc = edac_device_add_device(drv->edac_ctl);
>> +	if (rc)
>> +		goto edac_alloc_failed;
>> +
>> +	return 0;
>> +
>> +edac_alloc_failed:
>> +	edac_device_free_ctl_info(drv->edac_ctl);
>> +	return rc;
>> +}
>> +
>> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
>> +{
>> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
>> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
>> +
>> +	edac_device_del_device(edac_ctl->dev);
>> +	edac_device_free_ctl_info(edac_ctl);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
>> +	{ .compatible = "arm,armv8-edac" },
>> +	{},
>> +};
>> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
>> +
>> +static struct platform_driver cortex_arm64_edac_driver = {
>> +	.probe = cortex_arm64_edac_probe,
>> +	.remove = cortex_arm64_edac_remove,
>> +	.driver = {
>> +		.name = "arm64-edac",
>> +		.owner = THIS_MODULE,
>> +		.of_match_table = cortex_arm64_edac_of_match,
>> +	},
>> +};
>> +
>> +static int __init cortex_arm64_edac_init(void)
>> +{
>> +	int rc;
>> +
>> +	/* Only POLL mode is supported so far */
>> +	edac_op_state = EDAC_OPSTATE_POLL;
>> +
>> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
>> +	if (rc) {
>> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
>> +		return rc;
>> +	}
>> +
>> +	return 0;
>> +}
>> +module_init(cortex_arm64_edac_init);
>> +
>> +static void __exit cortex_arm64_edac_exit(void)
>> +{
>> +	platform_driver_unregister(&cortex_arm64_edac_driver);
>> +}
>> +module_exit(cortex_arm64_edac_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
>> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
>> +module_param(poll_msec, int, 0444);
>> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
  2015-10-22 14:46     ` Brijesh Singh
  (?)
@ 2015-10-23  1:41       ` Hanjun Guo
  -1 siblings, 0 replies; 34+ messages in thread
From: Hanjun Guo @ 2015-10-23  1:41 UTC (permalink / raw)
  To: Brijesh Singh, Andre Przywara, linux-arm-kernel
  Cc: robh+dt, pawel.moll, mark.rutland, ijc+devicetree, galak,
	dougthompson, bp, mchehab, devicetree, arnd, linux-kernel,
	linux-edac, Huxinwei

Hi Brijesh,

On 2015/10/22 22:46, Brijesh Singh wrote:
> Hi Andre,
>
> On 10/21/2015 06:52 PM, Andre Przywara wrote:
>> On 21/10/15 21:41, Brijesh Singh wrote:
>>> Add support for Cortex A57 and A53 EDAC driver.
>> Hi Brijesh,
>>
>> thanks for the quick update! Some comments below.
>>
>>> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
>>> CC: robh+dt@kernel.org
>>> CC: pawel.moll@arm.com
>>> CC: mark.rutland@arm.com
>>> CC: ijc+devicetree@hellion.org.uk
>>> CC: galak@codeaurora.org
>>> CC: dougthompson@xmission.com
>>> CC: bp@alien8.de
>>> CC: mchehab@osg.samsung.com
>>> CC: devicetree@vger.kernel.org
>>> CC: guohanjun@huawei.com
>>> CC: andre.przywara@arm.com
>>> CC: arnd@arndb.de
>>> CC: linux-kernel@vger.kernel.org
>>> CC: linux-edac@vger.kernel.org
>>> ---
>>>
>>> v2:
>>> * convert into generic arm64 edac driver
>>> * remove AMD specific references from dt binding
>>> * remove poll_msec property from dt binding
>>> * add poll_msec as a module param, default is 100ms
>>> * update copyright text
>>> * define macro mnemonics for L1 and L2 RAMID
>>> * check L2 error per-cluster instead of per core
>>> * update function names
>>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>>   read hotplug-safe
>>> * add error check in probe routine
>>>
>>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>>  drivers/edac/Kconfig                               |   6 +
>>>  drivers/edac/Makefile                              |   1 +
>>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>>  4 files changed, 479 insertions(+)
>>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>>
>>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>> new file mode 100644
>>> index 0000000..dfd128f
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>> @@ -0,0 +1,15 @@
>>> +* ARMv8 L1/L2 cache error reporting
>>> +
>>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>>> +Register can be used for checking L1 and L2 memory errors.
>>> +
>>> +The following section describes the ARMv8 EDAC DT node binding.
>>> +
>>> +Required properties:
>>> +- compatible: Should be "arm,armv8-edac"
>>> +
>>> +Example:
>>> +	edac {
>>> +		compatible = "arm,armv8-edac";
>>> +	};
>>> +
>> So if there is nothing in here, why do we need the DT binding at all (I
>> think Mark hinted at that already)?
>> Can't we just use the MIDR as already suggested by others?
>> Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
>> specific and not architectural.
>>
> Yes, I was going with Mark suggestion to remove DT binding but then came across these cases which kind of hinted to keep DT binding:
>
> * Without DT binding, the driver will always be loaded on arm64 unless its blacklisted.
> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.

I totally agree with you here,  thanks for putting them together.
Different SoCs may handle the error in different ways, we need
bindings to specialize them, irq number is a good example :)

Thanks
Hanjun


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-23  1:41       ` Hanjun Guo
  0 siblings, 0 replies; 34+ messages in thread
From: Hanjun Guo @ 2015-10-23  1:41 UTC (permalink / raw)
  To: Brijesh Singh, Andre Przywara, linux-arm-kernel
  Cc: robh+dt, pawel.moll, mark.rutland, ijc+devicetree, galak,
	dougthompson, bp, mchehab, devicetree, arnd, linux-kernel,
	linux-edac, Huxinwei

Hi Brijesh,

On 2015/10/22 22:46, Brijesh Singh wrote:
> Hi Andre,
>
> On 10/21/2015 06:52 PM, Andre Przywara wrote:
>> On 21/10/15 21:41, Brijesh Singh wrote:
>>> Add support for Cortex A57 and A53 EDAC driver.
>> Hi Brijesh,
>>
>> thanks for the quick update! Some comments below.
>>
>>> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
>>> CC: robh+dt@kernel.org
>>> CC: pawel.moll@arm.com
>>> CC: mark.rutland@arm.com
>>> CC: ijc+devicetree@hellion.org.uk
>>> CC: galak@codeaurora.org
>>> CC: dougthompson@xmission.com
>>> CC: bp@alien8.de
>>> CC: mchehab@osg.samsung.com
>>> CC: devicetree@vger.kernel.org
>>> CC: guohanjun@huawei.com
>>> CC: andre.przywara@arm.com
>>> CC: arnd@arndb.de
>>> CC: linux-kernel@vger.kernel.org
>>> CC: linux-edac@vger.kernel.org
>>> ---
>>>
>>> v2:
>>> * convert into generic arm64 edac driver
>>> * remove AMD specific references from dt binding
>>> * remove poll_msec property from dt binding
>>> * add poll_msec as a module param, default is 100ms
>>> * update copyright text
>>> * define macro mnemonics for L1 and L2 RAMID
>>> * check L2 error per-cluster instead of per core
>>> * update function names
>>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>>   read hotplug-safe
>>> * add error check in probe routine
>>>
>>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>>  drivers/edac/Kconfig                               |   6 +
>>>  drivers/edac/Makefile                              |   1 +
>>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>>  4 files changed, 479 insertions(+)
>>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>>
>>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>> new file mode 100644
>>> index 0000000..dfd128f
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>> @@ -0,0 +1,15 @@
>>> +* ARMv8 L1/L2 cache error reporting
>>> +
>>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>>> +Register can be used for checking L1 and L2 memory errors.
>>> +
>>> +The following section describes the ARMv8 EDAC DT node binding.
>>> +
>>> +Required properties:
>>> +- compatible: Should be "arm,armv8-edac"
>>> +
>>> +Example:
>>> +	edac {
>>> +		compatible = "arm,armv8-edac";
>>> +	};
>>> +
>> So if there is nothing in here, why do we need the DT binding at all (I
>> think Mark hinted at that already)?
>> Can't we just use the MIDR as already suggested by others?
>> Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
>> specific and not architectural.
>>
> Yes, I was going with Mark suggestion to remove DT binding but then came across these cases which kind of hinted to keep DT binding:
>
> * Without DT binding, the driver will always be loaded on arm64 unless its blacklisted.
> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.

I totally agree with you here,  thanks for putting them together.
Different SoCs may handle the error in different ways, we need
bindings to specialize them, irq number is a good example :)

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-23  1:41       ` Hanjun Guo
  0 siblings, 0 replies; 34+ messages in thread
From: Hanjun Guo @ 2015-10-23  1:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Brijesh,

On 2015/10/22 22:46, Brijesh Singh wrote:
> Hi Andre,
>
> On 10/21/2015 06:52 PM, Andre Przywara wrote:
>> On 21/10/15 21:41, Brijesh Singh wrote:
>>> Add support for Cortex A57 and A53 EDAC driver.
>> Hi Brijesh,
>>
>> thanks for the quick update! Some comments below.
>>
>>> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
>>> CC: robh+dt at kernel.org
>>> CC: pawel.moll at arm.com
>>> CC: mark.rutland at arm.com
>>> CC: ijc+devicetree at hellion.org.uk
>>> CC: galak at codeaurora.org
>>> CC: dougthompson at xmission.com
>>> CC: bp at alien8.de
>>> CC: mchehab at osg.samsung.com
>>> CC: devicetree at vger.kernel.org
>>> CC: guohanjun at huawei.com
>>> CC: andre.przywara at arm.com
>>> CC: arnd at arndb.de
>>> CC: linux-kernel at vger.kernel.org
>>> CC: linux-edac at vger.kernel.org
>>> ---
>>>
>>> v2:
>>> * convert into generic arm64 edac driver
>>> * remove AMD specific references from dt binding
>>> * remove poll_msec property from dt binding
>>> * add poll_msec as a module param, default is 100ms
>>> * update copyright text
>>> * define macro mnemonics for L1 and L2 RAMID
>>> * check L2 error per-cluster instead of per core
>>> * update function names
>>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>>   read hotplug-safe
>>> * add error check in probe routine
>>>
>>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>>  drivers/edac/Kconfig                               |   6 +
>>>  drivers/edac/Makefile                              |   1 +
>>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>>  4 files changed, 479 insertions(+)
>>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>>
>>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>> new file mode 100644
>>> index 0000000..dfd128f
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>> @@ -0,0 +1,15 @@
>>> +* ARMv8 L1/L2 cache error reporting
>>> +
>>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>>> +Register can be used for checking L1 and L2 memory errors.
>>> +
>>> +The following section describes the ARMv8 EDAC DT node binding.
>>> +
>>> +Required properties:
>>> +- compatible: Should be "arm,armv8-edac"
>>> +
>>> +Example:
>>> +	edac {
>>> +		compatible = "arm,armv8-edac";
>>> +	};
>>> +
>> So if there is nothing in here, why do we need the DT binding at all (I
>> think Mark hinted at that already)?
>> Can't we just use the MIDR as already suggested by others?
>> Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
>> specific and not architectural.
>>
> Yes, I was going with Mark suggestion to remove DT binding but then came across these cases which kind of hinted to keep DT binding:
>
> * Without DT binding, the driver will always be loaded on arm64 unless its blacklisted.
> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.

I totally agree with you here,  thanks for putting them together.
Different SoCs may handle the error in different ways, we need
bindings to specialize them, irq number is a good example :)

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-23  9:51         ` Andre Przywara
  0 siblings, 0 replies; 34+ messages in thread
From: Andre Przywara @ 2015-10-23  9:51 UTC (permalink / raw)
  To: Hanjun Guo, Brijesh Singh, linux-arm-kernel
  Cc: robh+dt, pawel.moll, mark.rutland, ijc+devicetree, galak,
	dougthompson, bp, mchehab, devicetree, arnd, linux-kernel,
	linux-edac, Huxinwei

On 23/10/15 02:41, Hanjun Guo wrote:
> Hi Brijesh,
> 
> On 2015/10/22 22:46, Brijesh Singh wrote:
>> Hi Andre,
>>
>> On 10/21/2015 06:52 PM, Andre Przywara wrote:
>>> On 21/10/15 21:41, Brijesh Singh wrote:
>>>> Add support for Cortex A57 and A53 EDAC driver.
>>> Hi Brijesh,
>>>
>>> thanks for the quick update! Some comments below.
>>>
>>>> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
>>>> CC: robh+dt@kernel.org
>>>> CC: pawel.moll@arm.com
>>>> CC: mark.rutland@arm.com
>>>> CC: ijc+devicetree@hellion.org.uk
>>>> CC: galak@codeaurora.org
>>>> CC: dougthompson@xmission.com
>>>> CC: bp@alien8.de
>>>> CC: mchehab@osg.samsung.com
>>>> CC: devicetree@vger.kernel.org
>>>> CC: guohanjun@huawei.com
>>>> CC: andre.przywara@arm.com
>>>> CC: arnd@arndb.de
>>>> CC: linux-kernel@vger.kernel.org
>>>> CC: linux-edac@vger.kernel.org
>>>> ---
>>>>
>>>> v2:
>>>> * convert into generic arm64 edac driver
>>>> * remove AMD specific references from dt binding
>>>> * remove poll_msec property from dt binding
>>>> * add poll_msec as a module param, default is 100ms
>>>> * update copyright text
>>>> * define macro mnemonics for L1 and L2 RAMID
>>>> * check L2 error per-cluster instead of per core
>>>> * update function names
>>>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>>>   read hotplug-safe
>>>> * add error check in probe routine
>>>>
>>>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>>>  drivers/edac/Kconfig                               |   6 +
>>>>  drivers/edac/Makefile                              |   1 +
>>>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>>>  4 files changed, 479 insertions(+)
>>>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>>>
>>>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>> new file mode 100644
>>>> index 0000000..dfd128f
>>>> --- /dev/null
>>>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>> @@ -0,0 +1,15 @@
>>>> +* ARMv8 L1/L2 cache error reporting
>>>> +
>>>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>>>> +Register can be used for checking L1 and L2 memory errors.
>>>> +
>>>> +The following section describes the ARMv8 EDAC DT node binding.
>>>> +
>>>> +Required properties:
>>>> +- compatible: Should be "arm,armv8-edac"
>>>> +
>>>> +Example:
>>>> +	edac {
>>>> +		compatible = "arm,armv8-edac";
>>>> +	};
>>>> +
>>> So if there is nothing in here, why do we need the DT binding at all (I
>>> think Mark hinted at that already)?
>>> Can't we just use the MIDR as already suggested by others?
>>> Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
>>> specific and not architectural.
>>>
>> Yes, I was going with Mark suggestion to remove DT binding but then came across these cases which kind of hinted to keep DT binding:
>>
>> * Without DT binding, the driver will always be loaded on arm64 unless its blacklisted.

So I checked the x86 code: the driver is always loaded as soon as the
hardware is there (looking at PCI device IDs from the on-chip
northbridge, for instance). The trick here is to have the Kconfig option
defaulting to "=n", so a kernel builder would have to explicitly enable
this. Android or embedded kernels wouldn't do this, for instance, while
a server distribution would do.
If a user doesn't want to be bothered with the driver, there is always
the possibility of blacklisting the module.
Setting a system policy is IMHO out of scope for a DT binding.

>> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.

What do you mean exactly with "firmware handles these errors"?
What would the firmware do? I guess just logging the error and then
possibly reset the register? How would this change the driver?

> I totally agree with you here,  thanks for putting them together.
> Different SoCs may handle the error in different ways, we need
> bindings to specialize them, irq number is a good example :)

But how does this affect this very driver, polling just those two registers?
Where would the interrupt come into the game here? Where is the proposed
DT binding for that interrupt?

AFAICT EL3 firmware handling errors would just hide this information
from the driver, so if the f/w decides to "handle" uncorrectable ECC
errors in a fatal way, there is nothing the driver could do anyway, right?

Can you sketch a concrete example where we would actually need the
driver to know about the firmware capabilities?

Cheers,
Andre.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-23  9:51         ` Andre Przywara
  0 siblings, 0 replies; 34+ messages in thread
From: Andre Przywara @ 2015-10-23  9:51 UTC (permalink / raw)
  To: Hanjun Guo, Brijesh Singh,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
  Cc: robh+dt-DgEjT+Ai2ygdnm+yROfE0A, pawel.moll-5wv7dgnIgG8,
	mark.rutland-5wv7dgnIgG8, ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg,
	galak-sgV2jX0FEOL9JmXXK+q4OQ,
	dougthompson-aS9lmoZGLiVWk0Htik3J/w, bp-Gina5bIWoIWzQB+pC5nmwQ,
	mchehab-JPH+aEBZ4P+UEJcrhfAQsw,
	devicetree-u79uwXL29TY76Z2rM5mHXA, arnd-r2nGTMty4D4,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-edac-u79uwXL29TY76Z2rM5mHXA, Huxinwei

On 23/10/15 02:41, Hanjun Guo wrote:
> Hi Brijesh,
> 
> On 2015/10/22 22:46, Brijesh Singh wrote:
>> Hi Andre,
>>
>> On 10/21/2015 06:52 PM, Andre Przywara wrote:
>>> On 21/10/15 21:41, Brijesh Singh wrote:
>>>> Add support for Cortex A57 and A53 EDAC driver.
>>> Hi Brijesh,
>>>
>>> thanks for the quick update! Some comments below.
>>>
>>>> Signed-off-by: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
>>>> CC: robh+dt-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
>>>> CC: pawel.moll-5wv7dgnIgG8@public.gmane.org
>>>> CC: mark.rutland-5wv7dgnIgG8@public.gmane.org
>>>> CC: ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org
>>>> CC: galak-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org
>>>> CC: dougthompson-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org
>>>> CC: bp-Gina5bIWoIWzQB+pC5nmwQ@public.gmane.org
>>>> CC: mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org
>>>> CC: devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> CC: guohanjun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org
>>>> CC: andre.przywara-5wv7dgnIgG8@public.gmane.org
>>>> CC: arnd-r2nGTMty4D4@public.gmane.org
>>>> CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> CC: linux-edac-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> ---
>>>>
>>>> v2:
>>>> * convert into generic arm64 edac driver
>>>> * remove AMD specific references from dt binding
>>>> * remove poll_msec property from dt binding
>>>> * add poll_msec as a module param, default is 100ms
>>>> * update copyright text
>>>> * define macro mnemonics for L1 and L2 RAMID
>>>> * check L2 error per-cluster instead of per core
>>>> * update function names
>>>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>>>   read hotplug-safe
>>>> * add error check in probe routine
>>>>
>>>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>>>  drivers/edac/Kconfig                               |   6 +
>>>>  drivers/edac/Makefile                              |   1 +
>>>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>>>  4 files changed, 479 insertions(+)
>>>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>>>
>>>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>> new file mode 100644
>>>> index 0000000..dfd128f
>>>> --- /dev/null
>>>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>> @@ -0,0 +1,15 @@
>>>> +* ARMv8 L1/L2 cache error reporting
>>>> +
>>>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>>>> +Register can be used for checking L1 and L2 memory errors.
>>>> +
>>>> +The following section describes the ARMv8 EDAC DT node binding.
>>>> +
>>>> +Required properties:
>>>> +- compatible: Should be "arm,armv8-edac"
>>>> +
>>>> +Example:
>>>> +	edac {
>>>> +		compatible = "arm,armv8-edac";
>>>> +	};
>>>> +
>>> So if there is nothing in here, why do we need the DT binding at all (I
>>> think Mark hinted at that already)?
>>> Can't we just use the MIDR as already suggested by others?
>>> Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
>>> specific and not architectural.
>>>
>> Yes, I was going with Mark suggestion to remove DT binding but then came across these cases which kind of hinted to keep DT binding:
>>
>> * Without DT binding, the driver will always be loaded on arm64 unless its blacklisted.

So I checked the x86 code: the driver is always loaded as soon as the
hardware is there (looking at PCI device IDs from the on-chip
northbridge, for instance). The trick here is to have the Kconfig option
defaulting to "=n", so a kernel builder would have to explicitly enable
this. Android or embedded kernels wouldn't do this, for instance, while
a server distribution would do.
If a user doesn't want to be bothered with the driver, there is always
the possibility of blacklisting the module.
Setting a system policy is IMHO out of scope for a DT binding.

>> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.

What do you mean exactly with "firmware handles these errors"?
What would the firmware do? I guess just logging the error and then
possibly reset the register? How would this change the driver?

> I totally agree with you here,  thanks for putting them together.
> Different SoCs may handle the error in different ways, we need
> bindings to specialize them, irq number is a good example :)

But how does this affect this very driver, polling just those two registers?
Where would the interrupt come into the game here? Where is the proposed
DT binding for that interrupt?

AFAICT EL3 firmware handling errors would just hide this information
from the driver, so if the f/w decides to "handle" uncorrectable ECC
errors in a fatal way, there is nothing the driver could do anyway, right?

Can you sketch a concrete example where we would actually need the
driver to know about the firmware capabilities?

Cheers,
Andre.

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-23  9:51         ` Andre Przywara
  0 siblings, 0 replies; 34+ messages in thread
From: Andre Przywara @ 2015-10-23  9:51 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/10/15 02:41, Hanjun Guo wrote:
> Hi Brijesh,
> 
> On 2015/10/22 22:46, Brijesh Singh wrote:
>> Hi Andre,
>>
>> On 10/21/2015 06:52 PM, Andre Przywara wrote:
>>> On 21/10/15 21:41, Brijesh Singh wrote:
>>>> Add support for Cortex A57 and A53 EDAC driver.
>>> Hi Brijesh,
>>>
>>> thanks for the quick update! Some comments below.
>>>
>>>> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
>>>> CC: robh+dt at kernel.org
>>>> CC: pawel.moll at arm.com
>>>> CC: mark.rutland at arm.com
>>>> CC: ijc+devicetree at hellion.org.uk
>>>> CC: galak at codeaurora.org
>>>> CC: dougthompson at xmission.com
>>>> CC: bp at alien8.de
>>>> CC: mchehab at osg.samsung.com
>>>> CC: devicetree at vger.kernel.org
>>>> CC: guohanjun at huawei.com
>>>> CC: andre.przywara at arm.com
>>>> CC: arnd at arndb.de
>>>> CC: linux-kernel at vger.kernel.org
>>>> CC: linux-edac at vger.kernel.org
>>>> ---
>>>>
>>>> v2:
>>>> * convert into generic arm64 edac driver
>>>> * remove AMD specific references from dt binding
>>>> * remove poll_msec property from dt binding
>>>> * add poll_msec as a module param, default is 100ms
>>>> * update copyright text
>>>> * define macro mnemonics for L1 and L2 RAMID
>>>> * check L2 error per-cluster instead of per core
>>>> * update function names
>>>> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>>>>   read hotplug-safe
>>>> * add error check in probe routine
>>>>
>>>>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>>>>  drivers/edac/Kconfig                               |   6 +
>>>>  drivers/edac/Makefile                              |   1 +
>>>>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>>>>  4 files changed, 479 insertions(+)
>>>>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>>  create mode 100644 drivers/edac/cortex_arm64_edac.c
>>>>
>>>> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>> new file mode 100644
>>>> index 0000000..dfd128f
>>>> --- /dev/null
>>>> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
>>>> @@ -0,0 +1,15 @@
>>>> +* ARMv8 L1/L2 cache error reporting
>>>> +
>>>> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
>>>> +Register can be used for checking L1 and L2 memory errors.
>>>> +
>>>> +The following section describes the ARMv8 EDAC DT node binding.
>>>> +
>>>> +Required properties:
>>>> +- compatible: Should be "arm,armv8-edac"
>>>> +
>>>> +Example:
>>>> +	edac {
>>>> +		compatible = "arm,armv8-edac";
>>>> +	};
>>>> +
>>> So if there is nothing in here, why do we need the DT binding at all (I
>>> think Mark hinted at that already)?
>>> Can't we just use the MIDR as already suggested by others?
>>> Secondly, armv8-edac is wrong here, as this feature is ARM-Cortex
>>> specific and not architectural.
>>>
>> Yes, I was going with Mark suggestion to remove DT binding but then came across these cases which kind of hinted to keep DT binding:
>>
>> * Without DT binding, the driver will always be loaded on arm64 unless its blacklisted.

So I checked the x86 code: the driver is always loaded as soon as the
hardware is there (looking at PCI device IDs from the on-chip
northbridge, for instance). The trick here is to have the Kconfig option
defaulting to "=n", so a kernel builder would have to explicitly enable
this. Android or embedded kernels wouldn't do this, for instance, while
a server distribution would do.
If a user doesn't want to be bothered with the driver, there is always
the possibility of blacklisting the module.
Setting a system policy is IMHO out of scope for a DT binding.

>> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.

What do you mean exactly with "firmware handles these errors"?
What would the firmware do? I guess just logging the error and then
possibly reset the register? How would this change the driver?

> I totally agree with you here,  thanks for putting them together.
> Different SoCs may handle the error in different ways, we need
> bindings to specialize them, irq number is a good example :)

But how does this affect this very driver, polling just those two registers?
Where would the interrupt come into the game here? Where is the proposed
DT binding for that interrupt?

AFAICT EL3 firmware handling errors would just hide this information
from the driver, so if the f/w decides to "handle" uncorrectable ECC
errors in a fatal way, there is nothing the driver could do anyway, right?

Can you sketch a concrete example where we would actually need the
driver to know about the firmware capabilities?

Cheers,
Andre.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
  2015-10-21 20:41 ` Brijesh Singh
@ 2015-10-23 16:58   ` Stephen Boyd
  -1 siblings, 0 replies; 34+ messages in thread
From: Stephen Boyd @ 2015-10-23 16:58 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-arm-kernel, mark.rutland, devicetree, pawel.moll,
	ijc+devicetree, andre.przywara, dougthompson, guohanjun,
	linux-kernel, arnd, robh+dt, bp, galak, mchehab, linux-edac

Drive by nitpicks

On 10/21, Brijesh Singh wrote:
> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
> new file mode 100644
> index 0000000..c37bb94
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_edac.c
> +
> +#define L1_CACHE		     0
> +#define L2_CACHE		     1
> +
> +int poll_msec = 100;

static?

> +
> +struct cortex_arm64_edac {
> +	struct edac_device_ctl_info *edac_ctl;
> +};
[..]
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	int rc;
> +	struct cortex_arm64_edac *drv;
> +	struct device *dev = &pdev->dev;
> +
> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +						   num_possible_cpus(), "L", 2,
> +						   1, NULL, 0,
> +						   edac_device_alloc_index());
> +	if (IS_ERR(drv->edac_ctl))
> +		return -ENOMEM;
> +
> +	drv->edac_ctl->poll_msec = poll_msec;
> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
> +	drv->edac_ctl->dev = dev;
> +	drv->edac_ctl->mod_name = dev_name(dev);
> +	drv->edac_ctl->dev_name = dev_name(dev);
> +	drv->edac_ctl->ctl_name = "cpu_err";
> +	drv->edac_ctl->panic_on_ue = 1;
> +	platform_set_drvdata(pdev, drv);
> +
> +	rc = edac_device_add_device(drv->edac_ctl);
> +	if (rc)
> +		goto edac_alloc_failed;
> +
> +	return 0;
> +
> +edac_alloc_failed:
> +	edac_device_free_ctl_info(drv->edac_ctl);
> +	return rc;

Simplify to:

	rc = edac_device_add_device(...
	if (rc)
		edac_device_free_ctl_info(..

	return rc;

> +}
> +
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,armv8-edac" },
> +	{},

Dropping the comma here is good style because it forces us to add
a comma if we were to add an element after the sentinel,
hopefully causing us to question why we're doing that in the
first place.

> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = "arm64-edac",
> +		.owner = THIS_MODULE,

platform_driver_register() sets this so we can drop this
assignment here.

> +		.of_match_table = cortex_arm64_edac_of_match,
> +	},
> +};
> +
> +static int __init cortex_arm64_edac_init(void)
> +{
> +	int rc;
> +
> +	/* Only POLL mode is supported so far */
> +	edac_op_state = EDAC_OPSTATE_POLL;
> +
> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
> +		return rc;
> +	}
> +
> +	return 0;

This could be simplified to

	rc = ...
	if (rc)
		edac_printk(...

	return rc;

Or even just 'return platform_driver_register()' and not care
about printing a message in that case because the end-user can't
do anything with the message anyway.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-23 16:58   ` Stephen Boyd
  0 siblings, 0 replies; 34+ messages in thread
From: Stephen Boyd @ 2015-10-23 16:58 UTC (permalink / raw)
  To: linux-arm-kernel

Drive by nitpicks

On 10/21, Brijesh Singh wrote:
> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
> new file mode 100644
> index 0000000..c37bb94
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_edac.c
> +
> +#define L1_CACHE		     0
> +#define L2_CACHE		     1
> +
> +int poll_msec = 100;

static?

> +
> +struct cortex_arm64_edac {
> +	struct edac_device_ctl_info *edac_ctl;
> +};
[..]
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	int rc;
> +	struct cortex_arm64_edac *drv;
> +	struct device *dev = &pdev->dev;
> +
> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +						   num_possible_cpus(), "L", 2,
> +						   1, NULL, 0,
> +						   edac_device_alloc_index());
> +	if (IS_ERR(drv->edac_ctl))
> +		return -ENOMEM;
> +
> +	drv->edac_ctl->poll_msec = poll_msec;
> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
> +	drv->edac_ctl->dev = dev;
> +	drv->edac_ctl->mod_name = dev_name(dev);
> +	drv->edac_ctl->dev_name = dev_name(dev);
> +	drv->edac_ctl->ctl_name = "cpu_err";
> +	drv->edac_ctl->panic_on_ue = 1;
> +	platform_set_drvdata(pdev, drv);
> +
> +	rc = edac_device_add_device(drv->edac_ctl);
> +	if (rc)
> +		goto edac_alloc_failed;
> +
> +	return 0;
> +
> +edac_alloc_failed:
> +	edac_device_free_ctl_info(drv->edac_ctl);
> +	return rc;

Simplify to:

	rc = edac_device_add_device(...
	if (rc)
		edac_device_free_ctl_info(..

	return rc;

> +}
> +
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,armv8-edac" },
> +	{},

Dropping the comma here is good style because it forces us to add
a comma if we were to add an element after the sentinel,
hopefully causing us to question why we're doing that in the
first place.

> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = "arm64-edac",
> +		.owner = THIS_MODULE,

platform_driver_register() sets this so we can drop this
assignment here.

> +		.of_match_table = cortex_arm64_edac_of_match,
> +	},
> +};
> +
> +static int __init cortex_arm64_edac_init(void)
> +{
> +	int rc;
> +
> +	/* Only POLL mode is supported so far */
> +	edac_op_state = EDAC_OPSTATE_POLL;
> +
> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
> +		return rc;
> +	}
> +
> +	return 0;

This could be simplified to

	rc = ...
	if (rc)
		edac_printk(...

	return rc;

Or even just 'return platform_driver_register()' and not care
about printing a message in that case because the end-user can't
do anything with the message anyway.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
  2015-10-23  9:51         ` Andre Przywara
  (?)
@ 2015-10-23 17:58           ` Brijesh Singh
  -1 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-23 17:58 UTC (permalink / raw)
  To: Andre Przywara, Hanjun Guo, linux-arm-kernel
  Cc: brijeshkumar.singh, robh+dt, pawel.moll, mark.rutland,
	ijc+devicetree, galak, dougthompson, bp, mchehab, devicetree,
	arnd, linux-kernel, linux-edac, Huxinwei


> So I checked the x86 code: the driver is always loaded as soon as the
> hardware is there (looking at PCI device IDs from the on-chip
> northbridge, for instance). The trick here is to have the Kconfig option
> defaulting to "=n", so a kernel builder would have to explicitly enable
> this. Android or embedded kernels wouldn't do this, for instance, while
> a server distribution would do.
> If a user doesn't want to be bothered with the driver, there is always
> the possibility of blacklisting the module.
> Setting a system policy is IMHO out of scope for a DT binding.
> 
Will update Kconfig to make it n by default.

>>> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>>>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>>>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.
> 
> What do you mean exactly with "firmware handles these errors"?
> What would the firmware do? I guess just logging the error and then
> possibly reset the register? How would this change the driver?
> 
On Seattle platform SoC generates a interrupt on both single bit and double bit error
and that interrupt is handled by firmware, so we don't need to do anything in the driver.
Driver just need to poll registers to log correctable errors (because they do not generate interrupt).
This very driver is doing exactly what we want. DT binding is not required.

But Hanjun's comment on very first patch hinted me that there is possibility that
SoC generate a interrupt on single bit and double bit but firmware does not handle it.
In those cases driver will need be extended to handle interrupt.

I will submit v3 for review with DT binding removed. We can revisit DT binding need in future.

>> I totally agree with you here,  thanks for putting them together.
>> Different SoCs may handle the error in different ways, we need
>> bindings to specialize them, irq number is a good example :)
> 
> But how does this affect this very driver, polling just those two registers?
> Where would the interrupt come into the game here? Where is the proposed
> DT binding for that interrupt?
> 
> AFAICT EL3 firmware handling errors would just hide this information
> from the driver, so if the f/w decides to "handle" uncorrectable ECC
> errors in a fatal way, there is nothing the driver could do anyway, right?
> 
> Can you sketch a concrete example where we would actually need the
> driver to know about the firmware capabilities?
> 
> Cheers,
> Andre.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-23 17:58           ` Brijesh Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-23 17:58 UTC (permalink / raw)
  To: Andre Przywara, Hanjun Guo, linux-arm-kernel
  Cc: brijeshkumar.singh, robh+dt, pawel.moll, mark.rutland,
	ijc+devicetree, galak, dougthompson, bp, mchehab, devicetree,
	arnd, linux-kernel, linux-edac, Huxinwei


> So I checked the x86 code: the driver is always loaded as soon as the
> hardware is there (looking at PCI device IDs from the on-chip
> northbridge, for instance). The trick here is to have the Kconfig option
> defaulting to "=n", so a kernel builder would have to explicitly enable
> this. Android or embedded kernels wouldn't do this, for instance, while
> a server distribution would do.
> If a user doesn't want to be bothered with the driver, there is always
> the possibility of blacklisting the module.
> Setting a system policy is IMHO out of scope for a DT binding.
> 
Will update Kconfig to make it n by default.

>>> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>>>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>>>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.
> 
> What do you mean exactly with "firmware handles these errors"?
> What would the firmware do? I guess just logging the error and then
> possibly reset the register? How would this change the driver?
> 
On Seattle platform SoC generates a interrupt on both single bit and double bit error
and that interrupt is handled by firmware, so we don't need to do anything in the driver.
Driver just need to poll registers to log correctable errors (because they do not generate interrupt).
This very driver is doing exactly what we want. DT binding is not required.

But Hanjun's comment on very first patch hinted me that there is possibility that
SoC generate a interrupt on single bit and double bit but firmware does not handle it.
In those cases driver will need be extended to handle interrupt.

I will submit v3 for review with DT binding removed. We can revisit DT binding need in future.

>> I totally agree with you here,  thanks for putting them together.
>> Different SoCs may handle the error in different ways, we need
>> bindings to specialize them, irq number is a good example :)
> 
> But how does this affect this very driver, polling just those two registers?
> Where would the interrupt come into the game here? Where is the proposed
> DT binding for that interrupt?
> 
> AFAICT EL3 firmware handling errors would just hide this information
> from the driver, so if the f/w decides to "handle" uncorrectable ECC
> errors in a fatal way, there is nothing the driver could do anyway, right?
> 
> Can you sketch a concrete example where we would actually need the
> driver to know about the firmware capabilities?
> 
> Cheers,
> Andre.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-23 17:58           ` Brijesh Singh
  0 siblings, 0 replies; 34+ messages in thread
From: Brijesh Singh @ 2015-10-23 17:58 UTC (permalink / raw)
  To: linux-arm-kernel


> So I checked the x86 code: the driver is always loaded as soon as the
> hardware is there (looking at PCI device IDs from the on-chip
> northbridge, for instance). The trick here is to have the Kconfig option
> defaulting to "=n", so a kernel builder would have to explicitly enable
> this. Android or embedded kernels wouldn't do this, for instance, while
> a server distribution would do.
> If a user doesn't want to be bothered with the driver, there is always
> the possibility of blacklisting the module.
> Setting a system policy is IMHO out of scope for a DT binding.
> 
Will update Kconfig to make it n by default.

>>> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>>>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>>>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.
> 
> What do you mean exactly with "firmware handles these errors"?
> What would the firmware do? I guess just logging the error and then
> possibly reset the register? How would this change the driver?
> 
On Seattle platform SoC generates a interrupt on both single bit and double bit error
and that interrupt is handled by firmware, so we don't need to do anything in the driver.
Driver just need to poll registers to log correctable errors (because they do not generate interrupt).
This very driver is doing exactly what we want. DT binding is not required.

But Hanjun's comment on very first patch hinted me that there is possibility that
SoC generate a interrupt on single bit and double bit but firmware does not handle it.
In those cases driver will need be extended to handle interrupt.

I will submit v3 for review with DT binding removed. We can revisit DT binding need in future.

>> I totally agree with you here,  thanks for putting them together.
>> Different SoCs may handle the error in different ways, we need
>> bindings to specialize them, irq number is a good example :)
> 
> But how does this affect this very driver, polling just those two registers?
> Where would the interrupt come into the game here? Where is the proposed
> DT binding for that interrupt?
> 
> AFAICT EL3 firmware handling errors would just hide this information
> from the driver, so if the f/w decides to "handle" uncorrectable ECC
> errors in a fatal way, there is nothing the driver could do anyway, right?
> 
> Can you sketch a concrete example where we would actually need the
> driver to know about the firmware capabilities?
> 
> Cheers,
> Andre.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
  2015-10-23 17:58           ` Brijesh Singh
  (?)
@ 2015-10-24  2:36             ` Hanjun Guo
  -1 siblings, 0 replies; 34+ messages in thread
From: Hanjun Guo @ 2015-10-24  2:36 UTC (permalink / raw)
  To: Brijesh Singh, Andre Przywara, linux-arm-kernel
  Cc: robh+dt, pawel.moll, mark.rutland, ijc+devicetree, galak,
	dougthompson, bp, mchehab, devicetree, arnd, linux-kernel,
	linux-edac, Huxinwei

On 2015/10/24 1:58, Brijesh Singh wrote:
>> So I checked the x86 code: the driver is always loaded as soon as the
>> hardware is there (looking at PCI device IDs from the on-chip
>> northbridge, for instance). The trick here is to have the Kconfig option
>> defaulting to "=n", so a kernel builder would have to explicitly enable
>> this. Android or embedded kernels wouldn't do this, for instance, while
>> a server distribution would do.
>> If a user doesn't want to be bothered with the driver, there is always
>> the possibility of blacklisting the module.
>> Setting a system policy is IMHO out of scope for a DT binding.
>>
> Will update Kconfig to make it n by default.
>
>>>> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>>>>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>>>>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.
>> What do you mean exactly with "firmware handles these errors"?
>> What would the firmware do? I guess just logging the error and then
>> possibly reset the register? How would this change the driver?
>>
> On Seattle platform SoC generates a interrupt on both single bit and double bit error
> and that interrupt is handled by firmware, so we don't need to do anything in the driver.
> Driver just need to poll registers to log correctable errors (because they do not generate interrupt).
> This very driver is doing exactly what we want. DT binding is not required.
>
> But Hanjun's comment on very first patch hinted me that there is possibility that
> SoC generate a interrupt on single bit and double bit but firmware does not handle it.
> In those cases driver will need be extended to handle interrupt.

yes, exactly.

>
> I will submit v3 for review with DT binding removed. We can revisit DT binding need in future.
>
>>> I totally agree with you here,  thanks for putting them together.
>>> Different SoCs may handle the error in different ways, we need
>>> bindings to specialize them, irq number is a good example :)
>> But how does this affect this very driver, polling just those two registers?
>> Where would the interrupt come into the game here? Where is the proposed
>> DT binding for that interrupt?
>>
>> AFAICT EL3 firmware handling errors would just hide this information
>> from the driver, so if the f/w decides to "handle" uncorrectable ECC
>> errors in a fatal way, there is nothing the driver could do anyway, right?

Yes, if EL3 firmware is involved, the driver don't need to handle such interrupt.

>>
>> Can you sketch a concrete example where we would actually need the
>> driver to know about the firmware capabilities?

So if firmware don't handle it, just like the APM xgene did in xgene_edac.c, we
need handle it in the driver, then DT bindings with irq number are needed.
You know, I'm working on ACPI and will enthusiastically encourage people using
APEI with firmware handle error first :) , but I think we can't rule out such
cases (driver handle the errors).

Thanks
Hanjun


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-24  2:36             ` Hanjun Guo
  0 siblings, 0 replies; 34+ messages in thread
From: Hanjun Guo @ 2015-10-24  2:36 UTC (permalink / raw)
  To: Brijesh Singh, Andre Przywara, linux-arm-kernel
  Cc: robh+dt, pawel.moll, mark.rutland, ijc+devicetree, galak,
	dougthompson, bp, mchehab, devicetree, arnd, linux-kernel,
	linux-edac, Huxinwei

On 2015/10/24 1:58, Brijesh Singh wrote:
>> So I checked the x86 code: the driver is always loaded as soon as the
>> hardware is there (looking at PCI device IDs from the on-chip
>> northbridge, for instance). The trick here is to have the Kconfig option
>> defaulting to "=n", so a kernel builder would have to explicitly enable
>> this. Android or embedded kernels wouldn't do this, for instance, while
>> a server distribution would do.
>> If a user doesn't want to be bothered with the driver, there is always
>> the possibility of blacklisting the module.
>> Setting a system policy is IMHO out of scope for a DT binding.
>>
> Will update Kconfig to make it n by default.
>
>>>> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>>>>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>>>>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.
>> What do you mean exactly with "firmware handles these errors"?
>> What would the firmware do? I guess just logging the error and then
>> possibly reset the register? How would this change the driver?
>>
> On Seattle platform SoC generates a interrupt on both single bit and double bit error
> and that interrupt is handled by firmware, so we don't need to do anything in the driver.
> Driver just need to poll registers to log correctable errors (because they do not generate interrupt).
> This very driver is doing exactly what we want. DT binding is not required.
>
> But Hanjun's comment on very first patch hinted me that there is possibility that
> SoC generate a interrupt on single bit and double bit but firmware does not handle it.
> In those cases driver will need be extended to handle interrupt.

yes, exactly.

>
> I will submit v3 for review with DT binding removed. We can revisit DT binding need in future.
>
>>> I totally agree with you here,  thanks for putting them together.
>>> Different SoCs may handle the error in different ways, we need
>>> bindings to specialize them, irq number is a good example :)
>> But how does this affect this very driver, polling just those two registers?
>> Where would the interrupt come into the game here? Where is the proposed
>> DT binding for that interrupt?
>>
>> AFAICT EL3 firmware handling errors would just hide this information
>> from the driver, so if the f/w decides to "handle" uncorrectable ECC
>> errors in a fatal way, there is nothing the driver could do anyway, right?

Yes, if EL3 firmware is involved, the driver don't need to handle such interrupt.

>>
>> Can you sketch a concrete example where we would actually need the
>> driver to know about the firmware capabilities?

So if firmware don't handle it, just like the APM xgene did in xgene_edac.c, we
need handle it in the driver, then DT bindings with irq number are needed.
You know, I'm working on ACPI and will enthusiastically encourage people using
APEI with firmware handle error first :) , but I think we can't rule out such
cases (driver handle the errors).

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-24  2:36             ` Hanjun Guo
  0 siblings, 0 replies; 34+ messages in thread
From: Hanjun Guo @ 2015-10-24  2:36 UTC (permalink / raw)
  To: linux-arm-kernel

On 2015/10/24 1:58, Brijesh Singh wrote:
>> So I checked the x86 code: the driver is always loaded as soon as the
>> hardware is there (looking at PCI device IDs from the on-chip
>> northbridge, for instance). The trick here is to have the Kconfig option
>> defaulting to "=n", so a kernel builder would have to explicitly enable
>> this. Android or embedded kernels wouldn't do this, for instance, while
>> a server distribution would do.
>> If a user doesn't want to be bothered with the driver, there is always
>> the possibility of blacklisting the module.
>> Setting a system policy is IMHO out of scope for a DT binding.
>>
> Will update Kconfig to make it n by default.
>
>>>> * Its possible that other SoC's might handle single-bit and double-bit errors differently compare to 
>>>>   Seattle platform. In Seattle platform both errors are handled by firmware but if other SoC 
>>>>   wants OS to handle these errors then they might need DT binding to provide the irq numbers etc.
>> What do you mean exactly with "firmware handles these errors"?
>> What would the firmware do? I guess just logging the error and then
>> possibly reset the register? How would this change the driver?
>>
> On Seattle platform SoC generates a interrupt on both single bit and double bit error
> and that interrupt is handled by firmware, so we don't need to do anything in the driver.
> Driver just need to poll registers to log correctable errors (because they do not generate interrupt).
> This very driver is doing exactly what we want. DT binding is not required.
>
> But Hanjun's comment on very first patch hinted me that there is possibility that
> SoC generate a interrupt on single bit and double bit but firmware does not handle it.
> In those cases driver will need be extended to handle interrupt.

yes, exactly.

>
> I will submit v3 for review with DT binding removed. We can revisit DT binding need in future.
>
>>> I totally agree with you here,  thanks for putting them together.
>>> Different SoCs may handle the error in different ways, we need
>>> bindings to specialize them, irq number is a good example :)
>> But how does this affect this very driver, polling just those two registers?
>> Where would the interrupt come into the game here? Where is the proposed
>> DT binding for that interrupt?
>>
>> AFAICT EL3 firmware handling errors would just hide this information
>> from the driver, so if the f/w decides to "handle" uncorrectable ECC
>> errors in a fatal way, there is nothing the driver could do anyway, right?

Yes, if EL3 firmware is involved, the driver don't need to handle such interrupt.

>>
>> Can you sketch a concrete example where we would actually need the
>> driver to know about the firmware capabilities?

So if firmware don't handle it, just like the APM xgene did in xgene_edac.c, we
need handle it in the driver, then DT bindings with irq number are needed.
You know, I'm working on ACPI and will enthusiastically encourage people using
APEI with firmware handle error first :) , but I think we can't rule out such
cases (driver handle the errors).

Thanks
Hanjun

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-26 12:46   ` Mark Rutland
  0 siblings, 0 replies; 34+ messages in thread
From: Mark Rutland @ 2015-10-26 12:46 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-arm-kernel, robh+dt, pawel.moll, ijc+devicetree, galak,
	dougthompson, bp, mchehab, devicetree, guohanjun, andre.przywara,
	arnd, linux-kernel, linux-edac

On Wed, Oct 21, 2015 at 03:41:37PM -0500, Brijesh Singh wrote:
> Add support for Cortex A57 and A53 EDAC driver.
> 
> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
> CC: robh+dt@kernel.org
> CC: pawel.moll@arm.com
> CC: mark.rutland@arm.com
> CC: ijc+devicetree@hellion.org.uk
> CC: galak@codeaurora.org
> CC: dougthompson@xmission.com
> CC: bp@alien8.de
> CC: mchehab@osg.samsung.com
> CC: devicetree@vger.kernel.org
> CC: guohanjun@huawei.com
> CC: andre.przywara@arm.com
> CC: arnd@arndb.de
> CC: linux-kernel@vger.kernel.org
> CC: linux-edac@vger.kernel.org
> ---
> 
> v2:
> * convert into generic arm64 edac driver
> * remove AMD specific references from dt binding
> * remove poll_msec property from dt binding
> * add poll_msec as a module param, default is 100ms
> * update copyright text
> * define macro mnemonics for L1 and L2 RAMID
> * check L2 error per-cluster instead of per core
> * update function names
> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>   read hotplug-safe
> * add error check in probe routine
> 
>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>  drivers/edac/Kconfig                               |   6 +
>  drivers/edac/Makefile                              |   1 +
>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>  4 files changed, 479 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>  create mode 100644 drivers/edac/cortex_arm64_edac.c
> 
> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> new file mode 100644
> index 0000000..dfd128f
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> @@ -0,0 +1,15 @@
> +* ARMv8 L1/L2 cache error reporting
> +
> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
> +Register can be used for checking L1 and L2 memory errors.
> +
> +The following section describes the ARMv8 EDAC DT node binding.
> +

To counter my original point, I now believe that the MIDR alone is
woefully insufficient to detect if we can use this feature. That needs
to be described explicitly to us (e.g. via DT), or the feature needs to
be abstracted entirely (e.g. using APEI).

More on that below.

> +Required properties:
> +- compatible: Should be "arm,armv8-edac"
> +
> +Example:
> +	edac {
> +		compatible = "arm,armv8-edac";
> +	};

As I mentioned previously, this is _not_ an ARMv8 feature. It's not a
Cortex-A series feature (nor a Cortex series feature).

This is an IMPLEMENTATION DEFINED feature. If we need compatible
strings, we need one for each particular implementation, as we have for
the IMPLEMENTATION DEFINED PMU bindings (e.g. "arm,cortex-a57-pmu"). For
ACPI I expect vendors to implement APEI.

We also need to consider:

* big.LITTLE and/or multi-cluster
  - Describe _which_ CPUs have the feature
  - Describe the affinity of any interrupts
  - Do we need to describe cluster topology (i.e. which CPUs are in the
    same cluster)?

* Virtualization
  - HCR_EL2.TIDCP will trap access to this feature, and hypervisors will
    have to set this to prevent guests from corrupting the HW state. As
    far as I am aware, KVM and/or Xen will likely kill the guest in this
    case (e.g. by injecting an undefined instruction abort).

* Interaction with firmware
  - When/do we handle interrupts?
  - When is it valid to write back and clear an error? We should not do
    this behind the back of any firmware that owns the interface.

* Future CPU revisions.
  - The feature is IMPLEMENTATION DEFINED, and has no discoverability
    mechanism. We have no guarantee that future revisions of the CPUs
    currently supporting the feature will continue to support the
    feature and/or have a compatible interface. Handling this is
    painful.

We shy from using IMPLEMENTATION DEFINED features because of issues like
these. Ideally, this would all be left to firmware, and handled with a
generic interface like APEI.

Thanks,
Mark.

> +
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index ef25000..dd7c195 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -390,4 +390,10 @@ config EDAC_XGENE
>  	  Support for error detection and correction on the
>  	  APM X-Gene family of SOCs.
>  
> +config EDAC_CORTEX_ARM64
> +	tristate "ARM Cortex A57/A53"
> +	depends on EDAC_MM_EDAC && ARM64
> +	help
> +	  Support for error detection and correction on the
> +	  ARM Cortex A57 and A53.
>  endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index ae3c5f3..ac01660 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
> new file mode 100644
> index 0000000..c37bb94
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_edac.c
> @@ -0,0 +1,457 @@
> +/*
> + * Cortex ARM64 EDAC
> + *
> + * Copyright (c) 2015, Advanced Micro Devices
> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/platform_device.h>
> +
> +#include "edac_core.h"
> +
> +#define EDAC_MOD_STR             "cortex_arm64_edac"
> +
> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A57_L1_I_TAG_RAM	     0x00
> +#define A57_L1_I_DATA_RAM	     0x01
> +#define A57_L1_D_TAG_RAM	     0x08
> +#define A57_L1_D_DATA_RAM	     0x09
> +#define A57_L1_TLB_RAM		     0x18
> +
> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A57_L2_TAG_RAM		     0x10
> +#define A57_L2_DATA_RAM		     0x11
> +#define A57_L2_SNOOP_TAG_RAM	     0x12
> +#define A57_L2_DIRTY_RAM	     0x14
> +#define A57_L2_INCLUSION_PF_RAM      0x18
> +
> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A53_L1_I_TAG_RAM	     0x00
> +#define A53_L1_I_DATA_RAM	     0x01
> +#define A53_L1_D_TAG_RAM	     0x08
> +#define A53_L1_D_DATA_RAM	     0x09
> +#define A53_L1_D_DIRT_RAM	     0x0A
> +#define A53_L1_TLB_RAM		     0x18
> +
> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A53_L2_TAG_RAM		     0x10
> +#define A53_L2_DATA_RAM		     0x11
> +#define A53_L2_SNOOP_RAM	     0x12
> +
> +#define L1_CACHE		     0
> +#define L2_CACHE		     1
> +
> +int poll_msec = 100;
> +
> +struct cortex_arm64_edac {
> +	struct edac_device_ctl_info *edac_ctl;
> +};
> +
> +static inline u64 read_cpumerrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_cpumerrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
> +}
> +
> +static inline u64 read_l2merrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_l2merrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
> +}
> +
> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A53_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
> +	case A53_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A53_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A53_L2_SNOOP_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A57_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
> +	case A57_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A57_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A57_L2_SNOOP_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
> +		break;
> +	case A57_L2_DIRTY_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
> +		break;
> +	case A57_L2_INCLUSION_PF_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A57_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
> +	case A57_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A57_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A57_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A57_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A57_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A53_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
> +	case A53_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A53_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A53_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A53_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A53_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void parse_cpumerrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_cpumerrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_cpumerrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void parse_l2merrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_l2merrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_l2merrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
> +{
> +	int cpu;
> +	struct cpumask cluster_mask, old_mask;
> +
> +	cpumask_clear(&cluster_mask);
> +	cpumask_clear(&old_mask);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
> +		if (cpumask_equal(&cluster_mask, &old_mask))
> +			continue;
> +		cpumask_copy(&old_mask, &cluster_mask);
> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
> +				      edev_ctl, 0);
> +	}
> +	put_online_cpus();
> +}
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	int rc;
> +	struct cortex_arm64_edac *drv;
> +	struct device *dev = &pdev->dev;
> +
> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +						   num_possible_cpus(), "L", 2,
> +						   1, NULL, 0,
> +						   edac_device_alloc_index());
> +	if (IS_ERR(drv->edac_ctl))
> +		return -ENOMEM;
> +
> +	drv->edac_ctl->poll_msec = poll_msec;
> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
> +	drv->edac_ctl->dev = dev;
> +	drv->edac_ctl->mod_name = dev_name(dev);
> +	drv->edac_ctl->dev_name = dev_name(dev);
> +	drv->edac_ctl->ctl_name = "cpu_err";
> +	drv->edac_ctl->panic_on_ue = 1;
> +	platform_set_drvdata(pdev, drv);
> +
> +	rc = edac_device_add_device(drv->edac_ctl);
> +	if (rc)
> +		goto edac_alloc_failed;
> +
> +	return 0;
> +
> +edac_alloc_failed:
> +	edac_device_free_ctl_info(drv->edac_ctl);
> +	return rc;
> +}
> +
> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
> +{
> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
> +
> +	edac_device_del_device(edac_ctl->dev);
> +	edac_device_free_ctl_info(edac_ctl);
> +
> +	return 0;
> +}
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,armv8-edac" },
> +	{},
> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = "arm64-edac",
> +		.owner = THIS_MODULE,
> +		.of_match_table = cortex_arm64_edac_of_match,
> +	},
> +};
> +
> +static int __init cortex_arm64_edac_init(void)
> +{
> +	int rc;
> +
> +	/* Only POLL mode is supported so far */
> +	edac_op_state = EDAC_OPSTATE_POLL;
> +
> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +module_init(cortex_arm64_edac_init);
> +
> +static void __exit cortex_arm64_edac_exit(void)
> +{
> +	platform_driver_unregister(&cortex_arm64_edac_driver);
> +}
> +module_exit(cortex_arm64_edac_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
> +module_param(poll_msec, int, 0444);
> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-26 12:46   ` Mark Rutland
  0 siblings, 0 replies; 34+ messages in thread
From: Mark Rutland @ 2015-10-26 12:46 UTC (permalink / raw)
  To: Brijesh Singh
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	robh+dt-DgEjT+Ai2ygdnm+yROfE0A, pawel.moll-5wv7dgnIgG8,
	ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg,
	galak-sgV2jX0FEOL9JmXXK+q4OQ,
	dougthompson-aS9lmoZGLiVWk0Htik3J/w, bp-Gina5bIWoIWzQB+pC5nmwQ,
	mchehab-JPH+aEBZ4P+UEJcrhfAQsw,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	guohanjun-hv44wF8Li93QT0dZR+AlfA, andre.przywara-5wv7dgnIgG8,
	arnd-r2nGTMty4D4, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-edac-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 21, 2015 at 03:41:37PM -0500, Brijesh Singh wrote:
> Add support for Cortex A57 and A53 EDAC driver.
> 
> Signed-off-by: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
> CC: robh+dt-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> CC: pawel.moll-5wv7dgnIgG8@public.gmane.org
> CC: mark.rutland-5wv7dgnIgG8@public.gmane.org
> CC: ijc+devicetree-KcIKpvwj1kUDXYZnReoRVg@public.gmane.org
> CC: galak-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org
> CC: dougthompson-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org
> CC: bp-Gina5bIWoIWzQB+pC5nmwQ@public.gmane.org
> CC: mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org
> CC: devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> CC: guohanjun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org
> CC: andre.przywara-5wv7dgnIgG8@public.gmane.org
> CC: arnd-r2nGTMty4D4@public.gmane.org
> CC: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> CC: linux-edac-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
> 
> v2:
> * convert into generic arm64 edac driver
> * remove AMD specific references from dt binding
> * remove poll_msec property from dt binding
> * add poll_msec as a module param, default is 100ms
> * update copyright text
> * define macro mnemonics for L1 and L2 RAMID
> * check L2 error per-cluster instead of per core
> * update function names
> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>   read hotplug-safe
> * add error check in probe routine
> 
>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>  drivers/edac/Kconfig                               |   6 +
>  drivers/edac/Makefile                              |   1 +
>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>  4 files changed, 479 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>  create mode 100644 drivers/edac/cortex_arm64_edac.c
> 
> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> new file mode 100644
> index 0000000..dfd128f
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> @@ -0,0 +1,15 @@
> +* ARMv8 L1/L2 cache error reporting
> +
> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
> +Register can be used for checking L1 and L2 memory errors.
> +
> +The following section describes the ARMv8 EDAC DT node binding.
> +

To counter my original point, I now believe that the MIDR alone is
woefully insufficient to detect if we can use this feature. That needs
to be described explicitly to us (e.g. via DT), or the feature needs to
be abstracted entirely (e.g. using APEI).

More on that below.

> +Required properties:
> +- compatible: Should be "arm,armv8-edac"
> +
> +Example:
> +	edac {
> +		compatible = "arm,armv8-edac";
> +	};

As I mentioned previously, this is _not_ an ARMv8 feature. It's not a
Cortex-A series feature (nor a Cortex series feature).

This is an IMPLEMENTATION DEFINED feature. If we need compatible
strings, we need one for each particular implementation, as we have for
the IMPLEMENTATION DEFINED PMU bindings (e.g. "arm,cortex-a57-pmu"). For
ACPI I expect vendors to implement APEI.

We also need to consider:

* big.LITTLE and/or multi-cluster
  - Describe _which_ CPUs have the feature
  - Describe the affinity of any interrupts
  - Do we need to describe cluster topology (i.e. which CPUs are in the
    same cluster)?

* Virtualization
  - HCR_EL2.TIDCP will trap access to this feature, and hypervisors will
    have to set this to prevent guests from corrupting the HW state. As
    far as I am aware, KVM and/or Xen will likely kill the guest in this
    case (e.g. by injecting an undefined instruction abort).

* Interaction with firmware
  - When/do we handle interrupts?
  - When is it valid to write back and clear an error? We should not do
    this behind the back of any firmware that owns the interface.

* Future CPU revisions.
  - The feature is IMPLEMENTATION DEFINED, and has no discoverability
    mechanism. We have no guarantee that future revisions of the CPUs
    currently supporting the feature will continue to support the
    feature and/or have a compatible interface. Handling this is
    painful.

We shy from using IMPLEMENTATION DEFINED features because of issues like
these. Ideally, this would all be left to firmware, and handled with a
generic interface like APEI.

Thanks,
Mark.

> +
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index ef25000..dd7c195 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -390,4 +390,10 @@ config EDAC_XGENE
>  	  Support for error detection and correction on the
>  	  APM X-Gene family of SOCs.
>  
> +config EDAC_CORTEX_ARM64
> +	tristate "ARM Cortex A57/A53"
> +	depends on EDAC_MM_EDAC && ARM64
> +	help
> +	  Support for error detection and correction on the
> +	  ARM Cortex A57 and A53.
>  endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index ae3c5f3..ac01660 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
> new file mode 100644
> index 0000000..c37bb94
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_edac.c
> @@ -0,0 +1,457 @@
> +/*
> + * Cortex ARM64 EDAC
> + *
> + * Copyright (c) 2015, Advanced Micro Devices
> + * Author: Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/platform_device.h>
> +
> +#include "edac_core.h"
> +
> +#define EDAC_MOD_STR             "cortex_arm64_edac"
> +
> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A57_L1_I_TAG_RAM	     0x00
> +#define A57_L1_I_DATA_RAM	     0x01
> +#define A57_L1_D_TAG_RAM	     0x08
> +#define A57_L1_D_DATA_RAM	     0x09
> +#define A57_L1_TLB_RAM		     0x18
> +
> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A57_L2_TAG_RAM		     0x10
> +#define A57_L2_DATA_RAM		     0x11
> +#define A57_L2_SNOOP_TAG_RAM	     0x12
> +#define A57_L2_DIRTY_RAM	     0x14
> +#define A57_L2_INCLUSION_PF_RAM      0x18
> +
> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A53_L1_I_TAG_RAM	     0x00
> +#define A53_L1_I_DATA_RAM	     0x01
> +#define A53_L1_D_TAG_RAM	     0x08
> +#define A53_L1_D_DATA_RAM	     0x09
> +#define A53_L1_D_DIRT_RAM	     0x0A
> +#define A53_L1_TLB_RAM		     0x18
> +
> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A53_L2_TAG_RAM		     0x10
> +#define A53_L2_DATA_RAM		     0x11
> +#define A53_L2_SNOOP_RAM	     0x12
> +
> +#define L1_CACHE		     0
> +#define L2_CACHE		     1
> +
> +int poll_msec = 100;
> +
> +struct cortex_arm64_edac {
> +	struct edac_device_ctl_info *edac_ctl;
> +};
> +
> +static inline u64 read_cpumerrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_cpumerrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
> +}
> +
> +static inline u64 read_l2merrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_l2merrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
> +}
> +
> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A53_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
> +	case A53_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A53_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A53_L2_SNOOP_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A57_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
> +	case A57_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A57_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A57_L2_SNOOP_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
> +		break;
> +	case A57_L2_DIRTY_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
> +		break;
> +	case A57_L2_INCLUSION_PF_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A57_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
> +	case A57_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A57_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A57_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A57_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A57_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A53_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
> +	case A53_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A53_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A53_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A53_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A53_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void parse_cpumerrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_cpumerrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_cpumerrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void parse_l2merrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_l2merrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_l2merrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
> +{
> +	int cpu;
> +	struct cpumask cluster_mask, old_mask;
> +
> +	cpumask_clear(&cluster_mask);
> +	cpumask_clear(&old_mask);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
> +		if (cpumask_equal(&cluster_mask, &old_mask))
> +			continue;
> +		cpumask_copy(&old_mask, &cluster_mask);
> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
> +				      edev_ctl, 0);
> +	}
> +	put_online_cpus();
> +}
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	int rc;
> +	struct cortex_arm64_edac *drv;
> +	struct device *dev = &pdev->dev;
> +
> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +						   num_possible_cpus(), "L", 2,
> +						   1, NULL, 0,
> +						   edac_device_alloc_index());
> +	if (IS_ERR(drv->edac_ctl))
> +		return -ENOMEM;
> +
> +	drv->edac_ctl->poll_msec = poll_msec;
> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
> +	drv->edac_ctl->dev = dev;
> +	drv->edac_ctl->mod_name = dev_name(dev);
> +	drv->edac_ctl->dev_name = dev_name(dev);
> +	drv->edac_ctl->ctl_name = "cpu_err";
> +	drv->edac_ctl->panic_on_ue = 1;
> +	platform_set_drvdata(pdev, drv);
> +
> +	rc = edac_device_add_device(drv->edac_ctl);
> +	if (rc)
> +		goto edac_alloc_failed;
> +
> +	return 0;
> +
> +edac_alloc_failed:
> +	edac_device_free_ctl_info(drv->edac_ctl);
> +	return rc;
> +}
> +
> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
> +{
> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
> +
> +	edac_device_del_device(edac_ctl->dev);
> +	edac_device_free_ctl_info(edac_ctl);
> +
> +	return 0;
> +}
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,armv8-edac" },
> +	{},
> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = "arm64-edac",
> +		.owner = THIS_MODULE,
> +		.of_match_table = cortex_arm64_edac_of_match,
> +	},
> +};
> +
> +static int __init cortex_arm64_edac_init(void)
> +{
> +	int rc;
> +
> +	/* Only POLL mode is supported so far */
> +	edac_op_state = EDAC_OPSTATE_POLL;
> +
> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +module_init(cortex_arm64_edac_init);
> +
> +static void __exit cortex_arm64_edac_exit(void)
> +{
> +	platform_driver_unregister(&cortex_arm64_edac_driver);
> +}
> +module_exit(cortex_arm64_edac_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh-5C7GfCeVMHo@public.gmane.org>");
> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
> +module_param(poll_msec, int, 0444);
> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
> -- 
> 1.9.1
> 
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-26 12:46   ` Mark Rutland
  0 siblings, 0 replies; 34+ messages in thread
From: Mark Rutland @ 2015-10-26 12:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Oct 21, 2015 at 03:41:37PM -0500, Brijesh Singh wrote:
> Add support for Cortex A57 and A53 EDAC driver.
> 
> Signed-off-by: Brijesh Singh <brijeshkumar.singh@amd.com>
> CC: robh+dt at kernel.org
> CC: pawel.moll at arm.com
> CC: mark.rutland at arm.com
> CC: ijc+devicetree at hellion.org.uk
> CC: galak at codeaurora.org
> CC: dougthompson at xmission.com
> CC: bp at alien8.de
> CC: mchehab at osg.samsung.com
> CC: devicetree at vger.kernel.org
> CC: guohanjun at huawei.com
> CC: andre.przywara at arm.com
> CC: arnd at arndb.de
> CC: linux-kernel at vger.kernel.org
> CC: linux-edac at vger.kernel.org
> ---
> 
> v2:
> * convert into generic arm64 edac driver
> * remove AMD specific references from dt binding
> * remove poll_msec property from dt binding
> * add poll_msec as a module param, default is 100ms
> * update copyright text
> * define macro mnemonics for L1 and L2 RAMID
> * check L2 error per-cluster instead of per core
> * update function names
> * use get_online_cpus() and put_online_cpus() to make L1 and L2 register 
>   read hotplug-safe
> * add error check in probe routine
> 
>  .../devicetree/bindings/edac/armv8-edac.txt        |  15 +
>  drivers/edac/Kconfig                               |   6 +
>  drivers/edac/Makefile                              |   1 +
>  drivers/edac/cortex_arm64_edac.c                   | 457 +++++++++++++++++++++
>  4 files changed, 479 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/edac/armv8-edac.txt
>  create mode 100644 drivers/edac/cortex_arm64_edac.c
> 
> diff --git a/Documentation/devicetree/bindings/edac/armv8-edac.txt b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> new file mode 100644
> index 0000000..dfd128f
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/armv8-edac.txt
> @@ -0,0 +1,15 @@
> +* ARMv8 L1/L2 cache error reporting
> +
> +On ARMv8, CPU Memory Error Syndrome Register and L2 Memory Error Syndrome
> +Register can be used for checking L1 and L2 memory errors.
> +
> +The following section describes the ARMv8 EDAC DT node binding.
> +

To counter my original point, I now believe that the MIDR alone is
woefully insufficient to detect if we can use this feature. That needs
to be described explicitly to us (e.g. via DT), or the feature needs to
be abstracted entirely (e.g. using APEI).

More on that below.

> +Required properties:
> +- compatible: Should be "arm,armv8-edac"
> +
> +Example:
> +	edac {
> +		compatible = "arm,armv8-edac";
> +	};

As I mentioned previously, this is _not_ an ARMv8 feature. It's not a
Cortex-A series feature (nor a Cortex series feature).

This is an IMPLEMENTATION DEFINED feature. If we need compatible
strings, we need one for each particular implementation, as we have for
the IMPLEMENTATION DEFINED PMU bindings (e.g. "arm,cortex-a57-pmu"). For
ACPI I expect vendors to implement APEI.

We also need to consider:

* big.LITTLE and/or multi-cluster
  - Describe _which_ CPUs have the feature
  - Describe the affinity of any interrupts
  - Do we need to describe cluster topology (i.e. which CPUs are in the
    same cluster)?

* Virtualization
  - HCR_EL2.TIDCP will trap access to this feature, and hypervisors will
    have to set this to prevent guests from corrupting the HW state. As
    far as I am aware, KVM and/or Xen will likely kill the guest in this
    case (e.g. by injecting an undefined instruction abort).

* Interaction with firmware
  - When/do we handle interrupts?
  - When is it valid to write back and clear an error? We should not do
    this behind the back of any firmware that owns the interface.

* Future CPU revisions.
  - The feature is IMPLEMENTATION DEFINED, and has no discoverability
    mechanism. We have no guarantee that future revisions of the CPUs
    currently supporting the feature will continue to support the
    feature and/or have a compatible interface. Handling this is
    painful.

We shy from using IMPLEMENTATION DEFINED features because of issues like
these. Ideally, this would all be left to firmware, and handled with a
generic interface like APEI.

Thanks,
Mark.

> +
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index ef25000..dd7c195 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -390,4 +390,10 @@ config EDAC_XGENE
>  	  Support for error detection and correction on the
>  	  APM X-Gene family of SOCs.
>  
> +config EDAC_CORTEX_ARM64
> +	tristate "ARM Cortex A57/A53"
> +	depends on EDAC_MM_EDAC && ARM64
> +	help
> +	  Support for error detection and correction on the
> +	  ARM Cortex A57 and A53.
>  endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index ae3c5f3..ac01660 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -68,3 +68,4 @@ obj-$(CONFIG_EDAC_OCTEON_PCI)		+= octeon_edac-pci.o
>  obj-$(CONFIG_EDAC_ALTERA_MC)		+= altera_edac.o
>  obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_ARM64)		+= cortex_arm64_edac.o
> diff --git a/drivers/edac/cortex_arm64_edac.c b/drivers/edac/cortex_arm64_edac.c
> new file mode 100644
> index 0000000..c37bb94
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_edac.c
> @@ -0,0 +1,457 @@
> +/*
> + * Cortex ARM64 EDAC
> + *
> + * Copyright (c) 2015, Advanced Micro Devices
> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/platform_device.h>
> +
> +#include "edac_core.h"
> +
> +#define EDAC_MOD_STR             "cortex_arm64_edac"
> +
> +#define A57_CPUMERRSR_EL1_INDEX(x)   ((x) & 0x1ffff)
> +#define A57_CPUMERRSR_EL1_BANK(x)    (((x) >> 18) & 0x1f)
> +#define A57_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A57_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A57_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0x7f)
> +#define A57_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A57_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A57_L1_I_TAG_RAM	     0x00
> +#define A57_L1_I_DATA_RAM	     0x01
> +#define A57_L1_D_TAG_RAM	     0x08
> +#define A57_L1_D_DATA_RAM	     0x09
> +#define A57_L1_TLB_RAM		     0x18
> +
> +#define A57_L2MERRSR_EL1_INDEX(x)    ((x) & 0x1ffff)
> +#define A57_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0xf)
> +#define A57_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A57_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A57_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A57_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A57_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A57_L2_TAG_RAM		     0x10
> +#define A57_L2_DATA_RAM		     0x11
> +#define A57_L2_SNOOP_TAG_RAM	     0x12
> +#define A57_L2_DIRTY_RAM	     0x14
> +#define A57_L2_INCLUSION_PF_RAM      0x18
> +
> +#define A53_CPUMERRSR_EL1_ADDR(x)    ((x) & 0xfff)
> +#define A53_CPUMERRSR_EL1_CPUID(x)   (((x) >> 18) & 0x07)
> +#define A53_CPUMERRSR_EL1_RAMID(x)   (((x) >> 24) & 0x7f)
> +#define A53_CPUMERRSR_EL1_VALID(x)   ((x) & (1 << 31))
> +#define A53_CPUMERRSR_EL1_REPEAT(x)  (((x) >> 32) & 0xff)
> +#define A53_CPUMERRSR_EL1_OTHER(x)   (((x) >> 40) & 0xff)
> +#define A53_CPUMERRSR_EL1_FATAL(x)   ((x) & (1UL << 63))
> +#define A53_L1_I_TAG_RAM	     0x00
> +#define A53_L1_I_DATA_RAM	     0x01
> +#define A53_L1_D_TAG_RAM	     0x08
> +#define A53_L1_D_DATA_RAM	     0x09
> +#define A53_L1_D_DIRT_RAM	     0x0A
> +#define A53_L1_TLB_RAM		     0x18
> +
> +#define A53_L2MERRSR_EL1_INDEX(x)    (((x) >> 3) & 0x3fff)
> +#define A53_L2MERRSR_EL1_CPUID(x)    (((x) >> 18) & 0x0f)
> +#define A53_L2MERRSR_EL1_RAMID(x)    (((x) >> 24) & 0x7f)
> +#define A53_L2MERRSR_EL1_VALID(x)    ((x) & (1 << 31))
> +#define A53_L2MERRSR_EL1_REPEAT(x)   (((x) >> 32) & 0xff)
> +#define A53_L2MERRSR_EL1_OTHER(x)    (((x) >> 40) & 0xff)
> +#define A53_L2MERRSR_EL1_FATAL(x)    ((x) & (1UL << 63))
> +#define A53_L2_TAG_RAM		     0x10
> +#define A53_L2_DATA_RAM		     0x11
> +#define A53_L2_SNOOP_RAM	     0x12
> +
> +#define L1_CACHE		     0
> +#define L2_CACHE		     1
> +
> +int poll_msec = 100;
> +
> +struct cortex_arm64_edac {
> +	struct edac_device_ctl_info *edac_ctl;
> +};
> +
> +static inline u64 read_cpumerrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_cpumerrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_2, %0" :: "r" (val));
> +}
> +
> +static inline u64 read_l2merrsr_el1(void)
> +{
> +	u64 val;
> +
> +	asm volatile("mrs %0, s3_1_c15_c2_3" : "=r" (val));
> +	return val;
> +}
> +
> +static inline void write_l2merrsr_el1(u64 val)
> +{
> +	asm volatile("msr s3_1_c15_c2_3, %0" :: "r" (val));
> +}
> +
> +static void a53_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A53_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A53_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A53_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_L2MERRSR_EL1_RAMID(val)) {
> +	case A53_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A53_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A53_L2_SNOOP_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop filter RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_l2merrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_l2merrsr_el1();
> +
> +	if (!A57_L2MERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_L2MERRSR_EL1_FATAL(val);
> +	repeat_err = A57_L2MERRSR_EL1_REPEAT(val);
> +	other_err = A57_L2MERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A57 CPU%d L2 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2MERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_L2MERRSR_EL1_RAMID(val)) {
> +	case A57_L2_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Tag RAM\n");
> +		break;
> +	case A57_L2_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Data RAM\n");
> +		break;
> +	case A57_L2_SNOOP_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Snoop tag RAM\n");
> +		break;
> +	case A57_L2_DIRTY_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 Dirty RAM\n");
> +		break;
> +	case A57_L2_INCLUSION_PF_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 inclusion PF RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L2_CACHE,
> +				      edac_ctl->name);
> +	write_l2merrsr_el1(0);
> +}
> +
> +static void a57_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A57_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A57_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A57_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A57_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A57_CPUMERRSR_EL1_RAMID(val)) {
> +	case A57_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A57_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A57_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A57_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A57_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void a53_parse_cpumerrsr(struct edac_device_ctl_info *edac_ctl)
> +{
> +	int fatal;
> +	int repeat_err, other_err;
> +	u64 val = read_cpumerrsr_el1();
> +
> +	if (!A53_CPUMERRSR_EL1_VALID(val))
> +		return;
> +
> +	fatal = A53_CPUMERRSR_EL1_FATAL(val);
> +	repeat_err = A53_CPUMERRSR_EL1_REPEAT(val);
> +	other_err = A53_CPUMERRSR_EL1_OTHER(val);
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR,
> +		    "A53 CPU%d L1 %s error detected!\n", smp_processor_id(),
> +		    fatal ? "fatal" : "non-fatal");
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "CPUMERRSR_EL1=%#llx\n", val);
> +
> +	switch (A53_CPUMERRSR_EL1_RAMID(val)) {
> +	case A53_L1_I_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Tag RAM\n");
> +		break;
> +	case A53_L1_I_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-I Data RAM\n");
> +		break;
> +	case A53_L1_D_TAG_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Tag RAM\n");
> +		break;
> +	case A53_L1_D_DATA_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L1-D Data RAM\n");
> +		break;
> +	case A53_L1_TLB_RAM:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "L2 TLB RAM\n");
> +		break;
> +	default:
> +		edac_printk(KERN_CRIT, EDAC_MOD_STR, "unknown RAMID\n");
> +		break;
> +	}
> +
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Repeated error count=%d",
> +		    repeat_err);
> +	edac_printk(KERN_CRIT, EDAC_MOD_STR, "Other error count=%d\n",
> +		    other_err);
> +
> +	if (fatal)
> +		edac_device_handle_ue(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	else
> +		edac_device_handle_ce(edac_ctl, smp_processor_id(), L1_CACHE,
> +				      edac_ctl->name);
> +	write_cpumerrsr_el1(0);
> +}
> +
> +static void parse_cpumerrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_cpumerrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_cpumerrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void parse_l2merrsr(void *args)
> +{
> +	struct edac_device_ctl_info *edac_ctl = args;
> +	int partnum = read_cpuid_part_number();
> +
> +	switch (partnum) {
> +	case ARM_CPU_PART_CORTEX_A57:
> +		a57_parse_l2merrsr(edac_ctl);
> +		break;
> +	case ARM_CPU_PART_CORTEX_A53:
> +		a53_parse_l2merrsr(edac_ctl);
> +		break;
> +	}
> +}
> +
> +static void arm64_monitor_cache_errors(struct edac_device_ctl_info *edev_ctl)
> +{
> +	int cpu;
> +	struct cpumask cluster_mask, old_mask;
> +
> +	cpumask_clear(&cluster_mask);
> +	cpumask_clear(&old_mask);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		smp_call_function_single(cpu, parse_cpumerrsr, edev_ctl, 0);
> +		cpumask_copy(&cluster_mask, topology_core_cpumask(cpu));
> +		if (cpumask_equal(&cluster_mask, &old_mask))
> +			continue;
> +		cpumask_copy(&old_mask, &cluster_mask);
> +		smp_call_function_any(&cluster_mask, parse_l2merrsr,
> +				      edev_ctl, 0);
> +	}
> +	put_online_cpus();
> +}
> +
> +static int cortex_arm64_edac_probe(struct platform_device *pdev)
> +{
> +	int rc;
> +	struct cortex_arm64_edac *drv;
> +	struct device *dev = &pdev->dev;
> +
> +	drv = devm_kzalloc(dev, sizeof(*drv), GFP_KERNEL);
> +	if (!drv)
> +		return -ENOMEM;
> +
> +	drv->edac_ctl = edac_device_alloc_ctl_info(0, "cpu",
> +						   num_possible_cpus(), "L", 2,
> +						   1, NULL, 0,
> +						   edac_device_alloc_index());
> +	if (IS_ERR(drv->edac_ctl))
> +		return -ENOMEM;
> +
> +	drv->edac_ctl->poll_msec = poll_msec;
> +	drv->edac_ctl->edac_check = arm64_monitor_cache_errors;
> +	drv->edac_ctl->dev = dev;
> +	drv->edac_ctl->mod_name = dev_name(dev);
> +	drv->edac_ctl->dev_name = dev_name(dev);
> +	drv->edac_ctl->ctl_name = "cpu_err";
> +	drv->edac_ctl->panic_on_ue = 1;
> +	platform_set_drvdata(pdev, drv);
> +
> +	rc = edac_device_add_device(drv->edac_ctl);
> +	if (rc)
> +		goto edac_alloc_failed;
> +
> +	return 0;
> +
> +edac_alloc_failed:
> +	edac_device_free_ctl_info(drv->edac_ctl);
> +	return rc;
> +}
> +
> +static int cortex_arm64_edac_remove(struct platform_device *pdev)
> +{
> +	struct cortex_arm64_edac *drv = dev_get_drvdata(&pdev->dev);
> +	struct edac_device_ctl_info *edac_ctl = drv->edac_ctl;
> +
> +	edac_device_del_device(edac_ctl->dev);
> +	edac_device_free_ctl_info(edac_ctl);
> +
> +	return 0;
> +}
> +
> +static const struct of_device_id cortex_arm64_edac_of_match[] = {
> +	{ .compatible = "arm,armv8-edac" },
> +	{},
> +};
> +MODULE_DEVICE_TABLE(of, cortex_arm64_edac_of_match);
> +
> +static struct platform_driver cortex_arm64_edac_driver = {
> +	.probe = cortex_arm64_edac_probe,
> +	.remove = cortex_arm64_edac_remove,
> +	.driver = {
> +		.name = "arm64-edac",
> +		.owner = THIS_MODULE,
> +		.of_match_table = cortex_arm64_edac_of_match,
> +	},
> +};
> +
> +static int __init cortex_arm64_edac_init(void)
> +{
> +	int rc;
> +
> +	/* Only POLL mode is supported so far */
> +	edac_op_state = EDAC_OPSTATE_POLL;
> +
> +	rc = platform_driver_register(&cortex_arm64_edac_driver);
> +	if (rc) {
> +		edac_printk(KERN_ERR, EDAC_MOD_STR, "failed to register\n");
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +module_init(cortex_arm64_edac_init);
> +
> +static void __exit cortex_arm64_edac_exit(void)
> +{
> +	platform_driver_unregister(&cortex_arm64_edac_driver);
> +}
> +module_exit(cortex_arm64_edac_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Brijesh Singh <brijeshkumar.singh@amd.com>");
> +MODULE_DESCRIPTION("Cortex A57 and A53 EDAC driver");
> +module_param(poll_msec, int, 0444);
> +MODULE_PARM_DESC(poll_msec, "EDAC monitor poll interval in msec");
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH v2] EDAC: Add ARM64 EDAC
  2015-10-23  1:51 Stepan Moskovchenko
@ 2015-10-23  3:07 ` Singh, Brijeshkumar
  0 siblings, 0 replies; 34+ messages in thread
From: Singh, Brijeshkumar @ 2015-10-23  3:07 UTC (permalink / raw)
  To: Stepan Moskovchenko
  Cc: robh+dt, pawel.moll, mark.rutland, ijc+devicetree, galak,
	dougthompson, bp, mchehab, devicetree, guohanjun, andre.przywara,
	arnd, linux-kernel, linux-edac

Hi Steve,

Thanks for pointing the link, I have not seen that driver before; I was mainly looking at driver/edac/xgene_edac.c and some other arm edac drivers.  My first attempt was to do AMD specific edac driver to log correctable L1/L2 error but based on feedback I worked on v2 generic driver which now looks very similar to your driver, are you planning to upstream your driver ? I can work with you to verify it on my current hardware setup. It seems your driver also handles single-bit and double-bit error which I have no way to test in my current hardware setup. On Seattle platform most of error handling is done through firmware APEI except correctable L1/L2.  Let me know your thoughts. 

-Brijesh
________________________________________
From: Stepan Moskovchenko [stepanm@codeaurora.org]
Sent: Thursday, October 22, 2015 8:51 PM
To: Singh, Brijeshkumar
Cc: robh+dt@kernel.org; pawel.moll@arm.com; mark.rutland@arm.com; ijc+devicetree@hellion.org.uk; galak@codeaurora.org; dougthompson@xmission.com; bp@alien8.de; mchehab@osg.samsung.com; devicetree@vger.kernel.org; guohanjun@huawei.com; andre.przywara@arm.com; arnd@arndb.de; linux-kernel@vger.kernel.org; linux-edac@vger.kernel.org
Subject: Re: [PATCH v2] EDAC: Add ARM64 EDAC

 >>> +++ b/drivers/edac/cortex_arm64_edac.c
 >>> @@ -0,0 +1,457 @@
 >>> +/*
 >>> + * Cortex ARM64 EDAC
 >>> + *
 >>> + * Copyright (c) 2015, Advanced Micro Devices
 >>> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
 >>> + *

>Hi Brijesh,

>Your ARM64 EDAC driver seems rather similar to the existing driver that
>is linked below. If you have indeed based your driver on this one, can
>you please provide the appropriate attribution?

>https://www.codeaurora.org/cgit/quic/la/kernel/msm-3.14/tree/drivers/edac/cortex_arm64_edac.c?h=LA.HB.1.1.1_rb1.10


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2] EDAC: Add ARM64 EDAC
@ 2015-10-23  1:51 Stepan Moskovchenko
  2015-10-23  3:07 ` Singh, Brijeshkumar
  0 siblings, 1 reply; 34+ messages in thread
From: Stepan Moskovchenko @ 2015-10-23  1:51 UTC (permalink / raw)
  To: brijeshkumar.singh
  Cc: robh+dt, pawel.moll, mark.rutland, ijc+devicetree, galak,
	dougthompson, bp, mchehab, devicetree, guohanjun, andre.przywara,
	arnd, linux-kernel, linux-edac

 >>> +++ b/drivers/edac/cortex_arm64_edac.c
 >>> @@ -0,0 +1,457 @@
 >>> +/*
 >>> + * Cortex ARM64 EDAC
 >>> + *
 >>> + * Copyright (c) 2015, Advanced Micro Devices
 >>> + * Author: Brijesh Singh <brijeshkumar.singh@amd.com>
 >>> + *

Hi Brijesh,

Your ARM64 EDAC driver seems rather similar to the existing driver that 
is linked below. If you have indeed based your driver on this one, can 
you please provide the appropriate attribution?

https://www.codeaurora.org/cgit/quic/la/kernel/msm-3.14/tree/drivers/edac/cortex_arm64_edac.c?h=LA.HB.1.1.1_rb1.10

Thank you
Steve

-- 
  The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
  hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2015-10-26 12:46 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-21 20:41 [PATCH v2] EDAC: Add ARM64 EDAC Brijesh Singh
2015-10-21 20:41 ` Brijesh Singh
2015-10-21 20:41 ` Brijesh Singh
2015-10-21 21:25 ` Mauro Carvalho Chehab
2015-10-21 21:25   ` Mauro Carvalho Chehab
2015-10-21 21:25   ` Mauro Carvalho Chehab
2015-10-22 18:47   ` Brijesh Singh
2015-10-22 18:47     ` Brijesh Singh
2015-10-22 18:47     ` Brijesh Singh
2015-10-21 23:52 ` Andre Przywara
2015-10-21 23:52   ` Andre Przywara
2015-10-21 23:52   ` Andre Przywara
2015-10-22 14:46   ` Brijesh Singh
2015-10-22 14:46     ` Brijesh Singh
2015-10-22 14:46     ` Brijesh Singh
2015-10-23  1:41     ` Hanjun Guo
2015-10-23  1:41       ` Hanjun Guo
2015-10-23  1:41       ` Hanjun Guo
2015-10-23  9:51       ` Andre Przywara
2015-10-23  9:51         ` Andre Przywara
2015-10-23  9:51         ` Andre Przywara
2015-10-23 17:58         ` Brijesh Singh
2015-10-23 17:58           ` Brijesh Singh
2015-10-23 17:58           ` Brijesh Singh
2015-10-24  2:36           ` Hanjun Guo
2015-10-24  2:36             ` Hanjun Guo
2015-10-24  2:36             ` Hanjun Guo
2015-10-23 16:58 ` Stephen Boyd
2015-10-23 16:58   ` Stephen Boyd
2015-10-26 12:46 ` Mark Rutland
2015-10-26 12:46   ` Mark Rutland
2015-10-26 12:46   ` Mark Rutland
2015-10-23  1:51 Stepan Moskovchenko
2015-10-23  3:07 ` Singh, Brijeshkumar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.