linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC
       [not found] <cover.1575529553.git.saiprakash.ranjan@codeaurora.org>
@ 2019-12-05  9:53 ` Sai Prakash Ranjan
  2019-12-18 23:37   ` Rob Herring
  2020-01-15 18:48   ` James Morse
  2019-12-05  9:53 ` [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches Sai Prakash Ranjan
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 16+ messages in thread
From: Sai Prakash Ranjan @ 2019-12-05  9:53 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Mark Rutland, Rob Herring,
	devicetree, Borislav Petkov, Mauro Carvalho Chehab, Tony Luck,
	James Morse, Robert Richter, linux-edac
  Cc: linux-arm-kernel, linux-kernel, linux-arm-msm, Stephen Boyd,
	Evan Green, tsoni, psodagud, Sai Prakash Ranjan

This adds DT bindings for Kryo EDAC implemented with RAS
extensions on KRYO{3,4}XX CPU cores for reporting of cache
errors.

Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 .../bindings/edac/qcom-kryo-edac.yaml         | 67 +++++++++++++++++++
 1 file changed, 67 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml

diff --git a/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
new file mode 100644
index 000000000000..1a39429a73b4
--- /dev/null
+++ b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
@@ -0,0 +1,67 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/edac/qcom-kryo-edac.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Kryo Error Detection and Correction(EDAC)
+
+maintainers:
+  - Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
+
+description: |
+  Kryo EDAC is defined to describe on-chip error detection and correction
+  for the Kryo CPU cores which implement RAS extensions. It will report
+  all Single Bit Errors and Double Bit Errors found in L1/L2 caches in
+  in two registers ERXSTATUS_EL1 and ERXMISC0_EL1. L3-SCU cache errors
+  are reported in ERR1STATUS and ERR1MISC0 registers.
+    ERXSTATUS_EL1 - Selected Error Record Primary Status Register, EL1
+    ERXMISC0_EL1 - Selected Error Record Miscellaneous Register 0, EL1
+    ERR1STATUS - Error Record Primary Status Register
+    ERR1MISC0 - Error Record Miscellaneous Register 0
+  Current implementation of Kryo ECC(Error Correcting Code) mechanism is
+  based on interrupts.
+
+properties:
+  compatible:
+    enum:
+      - qcom,kryo-edac
+
+  interrupts:
+    minItems: 1
+    maxItems: 4
+    items:
+      - description: l1-l2 cache faultirq interrupt
+      - description: l1-l2 cache errirq interrupt
+      - description: l3-scu cache errirq interrupt
+      - description: l3-scu cache faultirq interrupt
+
+  interrupt-names:
+    minItems: 1
+    maxItems: 4
+    items:
+      - const: l1-l2-faultirq
+      - const: l1-l2-errirq
+      - const: l3-scu-errirq
+      - const: l3-scu-faultirq
+
+required:
+  - compatible
+  - interrupts
+  - interrupt-names
+
+examples:
+  - |
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+    kryo_edac {
+      compatible = "qcom,kryo-edac";
+      interrupts = <GIC_PPI 6 IRQ_TYPE_LEVEL_HIGH>,
+                   <GIC_PPI 7 IRQ_TYPE_LEVEL_HIGH>,
+                   <GIC_SPI 34 IRQ_TYPE_LEVEL_HIGH>,
+                   <GIC_SPI 35 IRQ_TYPE_LEVEL_HIGH>;
+      interrupt-names = "l1-l2-faultirq",
+                        "l1-l2-errirq",
+                        "l3-scu-errirq",
+                        "l3-scu-faultirq";
+    };
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches
       [not found] <cover.1575529553.git.saiprakash.ranjan@codeaurora.org>
  2019-12-05  9:53 ` [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC Sai Prakash Ranjan
@ 2019-12-05  9:53 ` Sai Prakash Ranjan
       [not found] ` <0101016ed57a6311-e815485c-4b77-4342-a3de-203673941602-000000@us-west-2.amazonses.com>
       [not found] ` <0101016ed57a6559-46c6c649-db28-4945-a11c-7441b8e9ac5b-000000@us-west-2.amazonses.com>
  3 siblings, 0 replies; 16+ messages in thread
From: Sai Prakash Ranjan @ 2019-12-05  9:53 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Mark Rutland, Rob Herring,
	devicetree, Borislav Petkov, Mauro Carvalho Chehab, Tony Luck,
	James Morse, Robert Richter, linux-edac
  Cc: linux-arm-kernel, linux-kernel, linux-arm-msm, Stephen Boyd,
	Evan Green, tsoni, psodagud, Sai Prakash Ranjan

Kryo{3,4}XX CPU cores implement RAS extensions to support
Error Correcting Code(ECC). Currently all Kryo{3,4}XX CPU
cores (gold/silver a.k.a big/LITTLE) support ECC via RAS.
This adds an interrupt based driver for those CPUs and
provides an optional polling of error recording system
registers.

Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 MAINTAINERS                   |   7 +
 drivers/edac/Kconfig          |  20 +
 drivers/edac/Makefile         |   1 +
 drivers/edac/qcom_kryo_edac.c | 679 ++++++++++++++++++++++++++++++++++
 4 files changed, 707 insertions(+)
 create mode 100644 drivers/edac/qcom_kryo_edac.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c2d80079dccc..f58c93f963f6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6049,6 +6049,13 @@ L:	linux-edac@vger.kernel.org
 S:	Maintained
 F:	drivers/edac/qcom_edac.c
 
+EDAC-KRYO-QCOM
+M:	Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
+L:	linux-arm-msm@vger.kernel.org
+L:	linux-edac@vger.kernel.org
+S:	Maintained
+F:	drivers/edac/qcom_kryo_edac.c
+
 EDIROL UA-101/UA-1000 DRIVER
 M:	Clemens Ladisch <clemens@ladisch.de>
 L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 417dad635526..cd78ac2917c9 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -508,6 +508,26 @@ config EDAC_QCOM
 	  For debugging issues having to do with stability and overall system
 	  health, you should probably say 'Y' here.
 
+config EDAC_QCOM_KRYO
+	tristate "QCOM Kryo EDAC for CPU L1/L2/L3-SCU caches"
+	depends on ARCH_QCOM && ARM64_RAS_EXTN
+	help
+	  Support for Error detection and correction on Kryo Gold and Silver CPU
+	  cores with RAS extensions. Currently it detects and reports all Single
+	  Bit Errors (SBEs) and Double Bit Errors (DBEs).
+
+	  For debugging issues having to do with stability and overall system
+	  health, you should probably say 'Y' here.
+
+config EDAC_QCOM_KRYO_POLL
+	depends on EDAC_QCOM_KRYO
+	bool "Poll on Kryo ECC registers"
+	help
+	  This option chooses whether or not you want to poll on the Kryo ECC
+	  registers. When this is enabled, the polling rate can be set as a
+	  module parameter. By default, it will call the polling function every
+	  second.
+
 config EDAC_ASPEED
 	tristate "Aspeed AST 2500 SoC"
 	depends on MACH_ASPEED_G5
diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
index d77200c9680b..29edcfa6ec0e 100644
--- a/drivers/edac/Makefile
+++ b/drivers/edac/Makefile
@@ -85,5 +85,6 @@ obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
 obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
 obj-$(CONFIG_EDAC_TI)			+= ti_edac.o
 obj-$(CONFIG_EDAC_QCOM)			+= qcom_edac.o
+obj-$(CONFIG_EDAC_QCOM_KRYO)		+= qcom_kryo_edac.o
 obj-$(CONFIG_EDAC_ASPEED)		+= aspeed_edac.o
 obj-$(CONFIG_EDAC_BLUEFIELD)		+= bluefield_edac.o
diff --git a/drivers/edac/qcom_kryo_edac.c b/drivers/edac/qcom_kryo_edac.c
new file mode 100644
index 000000000000..05b60ad3cb0e
--- /dev/null
+++ b/drivers/edac/qcom_kryo_edac.c
@@ -0,0 +1,679 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2019, The Linux Foundation. All rights reserved.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/cpu_pm.h>
+#include <linux/of_irq.h>
+#include <linux/smp.h>
+
+#include <asm/cputype.h>
+#include <asm/sysreg.h>
+
+#include "edac_device.h"
+#include "edac_mc.h"
+
+#define DRV_NAME		"qcom_kryo_edac"
+
+/*
+ * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.3
+ * ARM DSU TRM Chapter B2.3
+ * CFI = Corrected Fault Handling interrupt, FI = Fault handling interrupt
+ * UI = Uncorrected error recovery interrupt, ED = Error Detection
+ */
+#define KRYO_ERRXCTLR_ED	BIT(0)
+#define KRYO_ERRXCTLR_UI	BIT(2)
+#define KRYO_ERRXCTLR_FI	BIT(3)
+#define KRYO_ERRXCTLR_CFI	BIT(8)
+#define KRYO_ERRXCTLR_ENABLE	(KRYO_ERRXCTLR_CFI | KRYO_ERRXCTLR_FI | \
+				 KRYO_ERRXCTLR_UI | KRYO_ERRXCTLR_ED)
+
+/*
+ * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.4
+ * ARM DSU TRM Chapter B2.4
+ */
+#define KRYO_ERRXFR_ED		GENMASK(1, 0)
+#define KRYO_ERRXFR_DE		GENMASK(3, 2)
+#define KRYO_ERRXFR_UI		GENMASK(5, 4)
+#define KRYO_ERRXFR_FI		GENMASK(7, 6)
+#define KRYO_ERRXFR_UE		GENMASK(9, 8)
+#define KRYO_ERRXFR_CFI		GENMASK(11, 10)
+#define KRYO_ERRXFR_CEC		GENMASK(14, 12)
+#define KRYO_ERRXFR_RP		BIT(15)
+#define KRYO_ERRXFR_SUPPORTED	(KRYO_ERRXFR_ED | KRYO_ERRXFR_DE | \
+				 KRYO_ERRXFR_UI | KRYO_ERRXFR_FI | \
+				 KRYO_ERRXFR_UE | KRYO_ERRXFR_CFI | \
+				 KRYO_ERRXFR_CEC | KRYO_ERRXFR_RP)
+
+/*
+ * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.5
+ * ARM DSU TRM Chapter B2.5
+ */
+#define KRYO_ERRXMISC0_CECR	GENMASK_ULL(38, 32)
+#define KRYO_ERRXMISC0_CECO	GENMASK_ULL(46, 40)
+
+/* ARM Cortex-A76 TRM Chapter B3.5 */
+#define KRYO_ERRXMISC0_UNIT	GENMASK(3, 0)
+#define KRYO_ERRXMISC0_LVL	GENMASK(3, 1)
+
+/* ARM Cortex-A76 TRM Chapter B3.10 has SERR bitfields 4:0
+ * but Cortex-A55, Cortex-A75 and DSU TRM has SERR bitfields 7:0.
+ * Since max error record is 21, we can use bitfields 4:0 for
+ * Kryo{3,4}XX CPUs.
+ */
+#define KRYO_ERRXSTATUS_SERR	GENMASK(4, 0)
+#define KRYO_ERRXSTATUS_DE	BIT(23)
+#define KRYO_ERRXSTATUS_CE	GENMASK(25, 24)
+#define KRYO_ERRXSTATUS_MV	BIT(26)
+#define KRYO_ERRXSTATUS_UE	BIT(29)
+#define KRYO_ERRXSTATUS_VALID	BIT(30)
+
+/* ARM Cortex-A76 TRM Chapter B3.5
+ * IC = Instruction Cache, DC = Data Cache
+ */
+#define KRYO_L1_UNIT_IC		0x1
+#define KRYO_L2_UNIT_TLB	0x2
+#define KRYO_L1_UNIT_DC		0x4
+#define KRYO_L2_UNIT		0x8
+
+/*
+ * ARM Cortex-A55 TRM Chapter B2.36
+ * ARM Cortex-A75, Cortex-A76 TRM Chapter B2.37
+ */
+#define KRYO_ERR_RECORD_L1_L2	0x0
+#define KRYO_ERR_RECORD_L3	0x1
+
+/* ARM DSU TRM Chapter B2.10 */
+#define BUS_ERROR		0x12
+
+/* QCOM Kryo CPU part numbers */
+#define KRYO3XX_GOLD		0x802
+#define KRYO4XX_GOLD		0x804
+#define KRYO4XX_SILVER_V1	0x803
+#define KRYO4XX_SILVER_V2	0x805
+
+#define KRYO_EDAC_MSG_MAX	256
+
+static int poll_msec = 1000;
+module_param(poll_msec, int, 0444);
+
+enum {
+	KRYO_L1 = 0,
+	KRYO_L2,
+	KRYO_L3,
+};
+
+/* CE = Corrected Error, UE = Uncorrected Error, DE = Deferred Error */
+enum {
+	KRYO_L1_CE = 0,
+	KRYO_L1_UE,
+	KRYO_L1_DE,
+	KRYO_L2_CE,
+	KRYO_L2_UE,
+	KRYO_L2_DE,
+	KRYO_L3_CE,
+	KRYO_L3_UE,
+	KRYO_L3_DE,
+};
+
+struct error_record {
+	u32 error_code;
+	const char *error_msg;
+};
+
+struct error_type {
+	void (*fn)(struct edac_device_ctl_info *edev_ctl,
+		   int inst_nr, int block_nr, const char *msg);
+	const char *msg;
+};
+
+/*
+ * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.10
+ * ARM DSU TRM Chapter B2.10
+ */
+static const struct error_record serror_record[] = {
+	{ 0x1,	"Errors due to fault injection"		},
+	{ 0x2,	"ECC error from internal data buffer"	},
+	{ 0x6,	"ECC error on cache data RAM"		},
+	{ 0x7,	"ECC error on cache tag or dirty RAM"	},
+	{ 0x8,	"Parity error on TLB data RAM"		},
+	{ 0x9,	"Parity error on TLB tag RAM"		},
+	{ 0x12,	"Error response for a cache copyback"	},
+	{ 0x15,	"Deferred error not supported"		},
+};
+
+static const struct error_type err_type[] = {
+	{ edac_device_handle_ce, "Kryo L1 Corrected Error"	},
+	{ edac_device_handle_ue, "Kryo L1 Uncorrected Error"	},
+	{ edac_device_handle_ue, "Kryo L1 Deferred Error"	},
+	{ edac_device_handle_ce, "Kryo L2 Corrected Error"	},
+	{ edac_device_handle_ue, "Kryo L2 Uncorrected Error"	},
+	{ edac_device_handle_ue, "Kryo L2 Deferred Error"	},
+	{ edac_device_handle_ce, "L3 Corrected Error"		},
+	{ edac_device_handle_ue, "L3 Uncorrected Error"		},
+	{ edac_device_handle_ue, "L3 Deferred Error"		},
+};
+
+static struct edac_device_ctl_info __percpu *edac_dev;
+static struct edac_device_ctl_info *drv_edev_ctl;
+
+static const char *get_error_msg(u64 errxstatus)
+{
+	const struct error_record *rec;
+	u32 errxstatus_serr;
+
+	errxstatus_serr = FIELD_GET(KRYO_ERRXSTATUS_SERR, errxstatus);
+
+	for (rec = serror_record; rec->error_code; rec++) {
+		if (errxstatus_serr == rec->error_code)
+			return rec->error_msg;
+	}
+
+	return NULL;
+}
+
+static void dump_syndrome_reg(int error_type, int level,
+			      u64 errxstatus, u64 errxmisc,
+			      struct edac_device_ctl_info *edev_ctl)
+{
+	char msg[KRYO_EDAC_MSG_MAX];
+	const char *error_msg;
+	int cpu;
+
+	cpu = raw_smp_processor_id();
+
+	error_msg = get_error_msg(errxstatus);
+	if (!error_msg)
+		return;
+
+	snprintf(msg, KRYO_EDAC_MSG_MAX,
+		 "CPU%d: %s, ERRXSTATUS_EL1:%#llx ERRXMISC0_EL1:%#llx, %s",
+		 cpu, err_type[error_type].msg, errxstatus, errxmisc,
+		 error_msg);
+
+	err_type[error_type].fn(edev_ctl, 0, level, msg);
+}
+
+static void kryo_check_err_type(u64 errxstatus, u64 errxmisc,
+				struct edac_device_ctl_info *edev_ctl,
+				int level)
+{
+	u32 errxstatus_ue, errxstatus_ce, errxstatus_de;
+
+	errxstatus_ce = FIELD_GET(KRYO_ERRXSTATUS_CE, errxstatus);
+	errxstatus_ue = FIELD_GET(KRYO_ERRXSTATUS_UE, errxstatus);
+	errxstatus_de = FIELD_GET(KRYO_ERRXSTATUS_DE, errxstatus);
+
+	switch (level) {
+	case KRYO_L1:
+		if (errxstatus_ce)
+			dump_syndrome_reg(KRYO_L1_CE, level, errxstatus,
+					  errxmisc, edev_ctl);
+		else if (errxstatus_ue)
+			dump_syndrome_reg(KRYO_L1_UE, level, errxstatus,
+					  errxmisc, edev_ctl);
+		else if (errxstatus_de)
+			dump_syndrome_reg(KRYO_L1_DE, level, errxstatus,
+					  errxmisc, edev_ctl);
+		else
+			edac_printk(KERN_ERR, DRV_NAME, "Unknown error\n");
+		break;
+	case KRYO_L2:
+		if (errxstatus_ce)
+			dump_syndrome_reg(KRYO_L2_CE, level, errxstatus,
+					  errxmisc, edev_ctl);
+		else if (errxstatus_ue)
+			dump_syndrome_reg(KRYO_L2_UE, level, errxstatus,
+					  errxmisc, edev_ctl);
+		else if (errxstatus_de)
+			dump_syndrome_reg(KRYO_L2_DE, level, errxstatus,
+					  errxmisc, edev_ctl);
+		else
+			edac_printk(KERN_ERR, DRV_NAME, "Unknown error\n");
+		break;
+	case KRYO_L3:
+		if (errxstatus_ce)
+			dump_syndrome_reg(KRYO_L3_CE, level, errxstatus,
+					  errxmisc, edev_ctl);
+		else if (errxstatus_ue)
+			dump_syndrome_reg(KRYO_L3_UE, level, errxstatus,
+					  errxmisc, edev_ctl);
+		else if (errxstatus_de)
+			dump_syndrome_reg(KRYO_L3_DE, level, errxstatus,
+					  errxmisc, edev_ctl);
+		else
+			edac_printk(KERN_ERR, DRV_NAME, "Unknown error\n");
+		break;
+	default:
+		edac_printk(KERN_ERR, DRV_NAME, "Unknown level\n");
+	}
+}
+
+static inline void kryo_clear_error(u64 errxstatus)
+{
+	write_sysreg_s(errxstatus, SYS_ERXSTATUS_EL1);
+	isb();
+}
+
+static void kryo_parse_l1_l2_cache_error(u64 errxstatus, u64 errxmisc,
+					 struct edac_device_ctl_info *edev_ctl,
+					 int cpu)
+{
+	u32 part_num = read_cpuid_part_number();
+
+	switch (part_num) {
+	/* Kryo3XX gold CPU cores do not have a UNIT bitfield */
+	case KRYO3XX_GOLD:
+	case KRYO4XX_SILVER_V1:
+	case KRYO4XX_SILVER_V2:
+		switch (FIELD_GET(KRYO_ERRXMISC0_LVL, errxmisc)) {
+		case KRYO_L1:
+			kryo_check_err_type(errxstatus, errxmisc,
+					    edev_ctl, KRYO_L1);
+			break;
+		case KRYO_L2:
+			kryo_check_err_type(errxstatus, errxmisc,
+					    edev_ctl, KRYO_L2);
+			break;
+		default:
+			edac_printk(KERN_ERR, DRV_NAME,
+				    "silver cpu:%d unknown error: %lu\n", cpu,
+				    FIELD_GET(KRYO_ERRXMISC0_LVL, errxmisc));
+		}
+		break;
+	/* Kryo4XX gold CPU cores have a UNIT bitfield to identify levels */
+	case KRYO4XX_GOLD:
+		switch (FIELD_GET(KRYO_ERRXMISC0_UNIT, errxmisc)) {
+		case KRYO_L1_UNIT_DC:
+		case KRYO_L1_UNIT_IC:
+			kryo_check_err_type(errxstatus, errxmisc,
+					    edev_ctl, KRYO_L1);
+			break;
+		case KRYO_L2_UNIT:
+		case KRYO_L2_UNIT_TLB:
+			kryo_check_err_type(errxstatus, errxmisc,
+					    edev_ctl, KRYO_L2);
+			break;
+		default:
+			edac_printk(KERN_ERR, DRV_NAME,
+				    "gold cpu:%d unknown error: %lu\n", cpu,
+				    FIELD_GET(KRYO_ERRXMISC0_UNIT, errxmisc));
+		}
+		break;
+	default:
+		edac_printk(KERN_ERR, DRV_NAME,
+			    "Error in matching cpu%d with part num:%u\n",
+			    cpu, part_num);
+	}
+}
+
+static inline bool kryo_check_regs_valid(u64 errxstatus)
+{
+	/* Check if status and misc regs are valid */
+	if (!(FIELD_GET(KRYO_ERRXSTATUS_VALID, errxstatus)) ||
+	    !(FIELD_GET(KRYO_ERRXSTATUS_MV, errxstatus)))
+		return false;
+
+	return true;
+}
+
+static void kryo_check_l1_l2_ecc(void *info)
+{
+	struct edac_device_ctl_info *edev_ctl = info;
+	u64 errxstatus;
+	u64 errxmisc;
+	int cpu;
+
+	cpu = smp_processor_id();
+	/* We know record 0 is L1 and L2 */
+	write_sysreg_s(0, SYS_ERRSELR_EL1);
+	isb();
+
+	errxstatus = read_sysreg_s(SYS_ERXSTATUS_EL1);
+	if (!kryo_check_regs_valid(errxstatus))
+		return;
+
+	errxmisc = read_sysreg_s(SYS_ERXMISC0_EL1);
+	/* Check if L1/L2 error */
+	if (!(FIELD_GET(KRYO_ERRXMISC0_LVL, errxmisc) == KRYO_L1) &&
+	    !(FIELD_GET(KRYO_ERRXMISC0_LVL, errxmisc) == KRYO_L2))
+		return;
+
+	kryo_parse_l1_l2_cache_error(errxstatus, errxmisc, edev_ctl, cpu);
+	kryo_clear_error(errxstatus);
+}
+
+static irqreturn_t kryo_l1_l2_handler(int irq, void *drvdata)
+{
+	struct edac_device_ctl_info *edev_ctl = *(void **)drvdata;
+
+	kryo_check_l1_l2_ecc(edev_ctl);
+
+	return IRQ_HANDLED;
+}
+
+static bool kryo_check_l3_bus_error(u64 errxstatus)
+{
+	if (FIELD_GET(KRYO_ERRXSTATUS_SERR, errxstatus) == BUS_ERROR) {
+		edac_printk(KERN_ERR, DRV_NAME, "Bus Error\n");
+		return true;
+	}
+
+	return false;
+}
+
+static void kryo_check_l3_scu_ecc(struct edac_device_ctl_info *edev_ctl)
+{
+	u64 errxstatus, errxmisc;
+
+	/* We know record 1 is L3-SCU */
+	write_sysreg_s(1, SYS_ERRSELR_EL1);
+	isb();
+
+	errxstatus = read_sysreg_s(SYS_ERXSTATUS_EL1);
+	if (!kryo_check_regs_valid(errxstatus))
+		return;
+
+	errxmisc = read_sysreg_s(SYS_ERXMISC0_EL1);
+	/* Check if L3/bus error */
+	if (!(FIELD_GET(KRYO_ERRXMISC0_LVL, errxmisc) == KRYO_L3) ||
+	    kryo_check_l3_bus_error(errxstatus))
+		return;
+
+	/* Check if Corrected/Uncorrected/Deferred error and dump regs */
+	kryo_check_err_type(errxstatus, errxmisc, edev_ctl, KRYO_L3);
+	kryo_clear_error(errxstatus);
+}
+
+static irqreturn_t kryo_l3_scu_handler(int irq, void *edev_ctl)
+{
+	kryo_check_l3_scu_ecc(edev_ctl);
+
+	return IRQ_HANDLED;
+}
+
+static void kryo_l1_l2_irq_disable(void *drvdata)
+{
+	int irq = *(int *)drvdata;
+
+	disable_percpu_irq(irq);
+}
+
+static void kryo_l1_l2_irq_enable(void *drvdata)
+{
+	int irq = *(int *)drvdata;
+
+	enable_percpu_irq(irq, IRQ_TYPE_LEVEL_HIGH);
+}
+
+static int kryo_l1_l2_setup_irq(struct platform_device *pdev,
+				struct edac_device_ctl_info *edev_ctl)
+{
+	int cpu, errirq, faultirq, ret;
+
+	edac_dev = devm_alloc_percpu(&pdev->dev, *edac_dev);
+	if (!edac_dev)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		preempt_disable();
+		per_cpu(edac_dev, cpu) = edev_ctl;
+		preempt_enable();
+	}
+
+	faultirq = platform_get_irq_byname(pdev, "l1-l2-faultirq");
+	if (faultirq < 0) {
+		ret = faultirq;
+		goto out_fault;
+	}
+
+	ret = request_percpu_irq(faultirq, kryo_l1_l2_handler,
+				 "kryo_l1_l2_ecc_faultirq",
+				 &edac_dev);
+	if (ret) {
+		edac_printk(KERN_DEBUG, DRV_NAME,
+			    "Failed to request l1-l2-faultirq %d\n", faultirq);
+		goto out_fault;
+	}
+
+	on_each_cpu(kryo_l1_l2_irq_enable, &faultirq, 1);
+
+out_fault:
+	errirq = platform_get_irq_byname(pdev, "l1-l2-errirq");
+	if (errirq < 0) {
+		ret = errirq;
+		goto out_err;
+	}
+
+	ret = request_percpu_irq(errirq, kryo_l1_l2_handler,
+				 "kryo_l1_l2_ecc_errirq",
+				 &edac_dev);
+	if (ret) {
+		edac_printk(KERN_DEBUG, DRV_NAME,
+			    "Failed to request l1-l2-errirq %d\n", errirq);
+		goto out_err;
+	}
+
+	on_each_cpu(kryo_l1_l2_irq_enable, &errirq, 1);
+
+	return ret;
+
+out_err:
+	free_percpu_irq(faultirq, &edac_dev);
+
+	return ret;
+}
+
+static int kryo_l3_setup_irq(struct platform_device *pdev, void *edev_ctl)
+{
+	int errirq, faultirq, ret;
+
+	faultirq = platform_get_irq_byname(pdev, "l3-scu-faultirq");
+	if (faultirq < 0) {
+		ret = faultirq;
+		goto out_fault;
+	}
+
+	ret = devm_request_irq(&pdev->dev, faultirq, kryo_l3_scu_handler,
+			       IRQF_ONESHOT | IRQF_TRIGGER_HIGH,
+			       "kryo_l3_scu_ecc_faultirq", edev_ctl);
+	if (ret) {
+		edac_printk(KERN_DEBUG, DRV_NAME,
+			    "Failed to request l3-scu-faultirq %d\n", faultirq);
+	}
+
+out_fault:
+	errirq = platform_get_irq_byname(pdev, "l3-scu-errirq");
+	if (errirq < 0) {
+		ret = errirq;
+		goto out_err;
+	}
+
+	ret = devm_request_irq(&pdev->dev, errirq, kryo_l3_scu_handler,
+			       IRQF_ONESHOT | IRQF_TRIGGER_HIGH,
+			       "kryo_l3_scu_ecc_errirq", edev_ctl);
+	if (ret) {
+		edac_printk(KERN_DEBUG, DRV_NAME,
+			    "Failed to request l3-scu-errirq %d\n", errirq);
+	}
+
+out_err:
+	return ret;
+}
+
+static int qcom_kryo_edac_setup_irq(struct platform_device *pdev,
+				    struct edac_device_ctl_info *edev_ctl)
+{
+	int ret;
+
+	ret = kryo_l1_l2_setup_irq(pdev, edev_ctl);
+	if (ret) {
+		edac_printk(KERN_DEBUG, DRV_NAME,
+			    "Failed to setup l1-l2 irq\n");
+	}
+
+	ret = kryo_l3_setup_irq(pdev, edev_ctl);
+	if (ret) {
+		edac_printk(KERN_DEBUG, DRV_NAME,
+			    "Failed to setup l3-scu irq\n");
+	}
+
+	return ret;
+}
+
+static void kryo_poll_cache_error(struct edac_device_ctl_info *edev_ctl)
+{
+	if (!edev_ctl)
+		edev_ctl = drv_edev_ctl;
+
+	on_each_cpu(kryo_check_l1_l2_ecc, edev_ctl, 1);
+	kryo_check_l3_scu_ecc(edev_ctl);
+}
+
+static inline void kryo_enable_err_record(void)
+{
+	write_sysreg_s(KRYO_ERRXCTLR_ENABLE, SYS_ERXCTLR_EL1);
+	write_sysreg_s(KRYO_ERRXMISC0_CECR | KRYO_ERRXMISC0_CECO,
+		       SYS_ERXMISC0_EL1);
+	isb();
+}
+
+static inline void qcom_kryo_init_l3(void)
+{
+	if (!FIELD_GET(KRYO_ERRXFR_SUPPORTED, read_sysreg_s(SYS_ERXFR_EL1)))
+		return;
+
+	/* L3 is shared */
+	write_sysreg_s(KRYO_ERR_RECORD_L3, SYS_ERRSELR_EL1);
+	kryo_enable_err_record();
+}
+
+static void qcom_kryo_init_l1_l2(void *data)
+{
+	if (!FIELD_GET(KRYO_ERRXFR_SUPPORTED, read_sysreg_s(SYS_ERXFR_EL1)))
+		return;
+
+	/* L1 and L2 is per-cpu */
+	write_sysreg_s(KRYO_ERR_RECORD_L1_L2, SYS_ERRSELR_EL1);
+	kryo_enable_err_record();
+}
+
+static int kryo_edac_pm_notify(struct notifier_block *nb, unsigned long action,
+			       void *data)
+{
+	switch (action) {
+	case CPU_PM_EXIT:
+		qcom_kryo_init_l1_l2(NULL);
+		qcom_kryo_init_l3();
+		kryo_check_l1_l2_ecc(drv_edev_ctl);
+		kryo_check_l3_scu_ecc(drv_edev_ctl);
+		break;
+	default:
+		return NOTIFY_DONE;
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block kryo_edac_pm_nb = {
+	.notifier_call = kryo_edac_pm_notify,
+};
+
+static inline void qcom_kryo_edac_setup(void)
+{
+	on_each_cpu(qcom_kryo_init_l1_l2, NULL, 1);
+	qcom_kryo_init_l3();
+}
+
+static int qcom_kryo_edac_probe(struct platform_device *pdev)
+{
+	struct edac_device_ctl_info *edev_ctl;
+	struct device *dev = &pdev->dev;
+	int ret;
+
+	qcom_kryo_edac_setup();
+
+	edev_ctl = edac_device_alloc_ctl_info(0, DRV_NAME, 1, "L", 3, 1, NULL,
+					      0, edac_device_alloc_index());
+	if (!edev_ctl)
+		return -ENOMEM;
+
+	if (IS_ENABLED(CONFIG_EDAC_QCOM_KRYO_POLL)) {
+		edev_ctl->poll_msec = poll_msec;
+		edev_ctl->edac_check = kryo_poll_cache_error;
+	}
+	edev_ctl->dev = dev;
+	edev_ctl->mod_name = DRV_NAME;
+	edev_ctl->dev_name = dev_name(dev);
+	edev_ctl->ctl_name = "qcom_kryo_cache";
+	edev_ctl->panic_on_ue = 1;
+
+	ret = edac_device_add_device(edev_ctl);
+	if (ret)
+		goto out_mem;
+
+	platform_set_drvdata(pdev, edev_ctl);
+	drv_edev_ctl = edev_ctl;
+
+	ret = qcom_kryo_edac_setup_irq(pdev, edev_ctl);
+	if (ret)
+		goto out_dev;
+
+	cpu_pm_register_notifier(&kryo_edac_pm_nb);
+
+	return ret;
+
+out_dev:
+	edac_device_del_device(edev_ctl->dev);
+out_mem:
+	edac_device_free_ctl_info(edev_ctl);
+
+	return ret;
+}
+
+static void qcom_kryo_edac_teardown(struct platform_device *pdev)
+{
+	int errirq, faultirq;
+
+	faultirq = platform_get_irq_byname(pdev, "l1-l2-faultirq");
+	on_each_cpu(kryo_l1_l2_irq_disable, &faultirq, 1);
+	free_percpu_irq(faultirq, &edac_dev);
+
+	errirq = platform_get_irq_byname(pdev, "l1-l2-errirq");
+	on_each_cpu(kryo_l1_l2_irq_disable, &errirq, 1);
+	free_percpu_irq(errirq, &edac_dev);
+
+	cpu_pm_unregister_notifier(&kryo_edac_pm_nb);
+}
+
+static int qcom_kryo_edac_remove(struct platform_device *pdev)
+{
+	struct edac_device_ctl_info *edev_ctl = platform_get_drvdata(pdev);
+
+	qcom_kryo_edac_teardown(pdev);
+
+	edac_device_del_device(edev_ctl->dev);
+	edac_device_free_ctl_info(edev_ctl);
+
+	return 0;
+}
+
+static const struct of_device_id qcom_kryo_edac_of_match[] = {
+	{ .compatible = "qcom,kryo-edac" },
+	{ }
+};
+MODULE_DEVICE_TABLE(of, qcom_kryo_edac_of_match);
+
+static struct platform_driver qcom_kryo_edac_driver = {
+	.probe = qcom_kryo_edac_probe,
+	.remove = qcom_kryo_edac_remove,
+	.driver = {
+		.name = DRV_NAME,
+		.of_match_table	= qcom_kryo_edac_of_match,
+	},
+};
+module_platform_driver(qcom_kryo_edac_driver);
+
+MODULE_DESCRIPTION("QCOM Kryo EDAC driver");
+MODULE_LICENSE("GPL v2");
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches
       [not found] ` <0101016ed57a6311-e815485c-4b77-4342-a3de-203673941602-000000@us-west-2.amazonses.com>
@ 2019-12-11 19:32   ` Evan Green
       [not found]     ` <5df16ebe.1c69fb81.6481f.a011@mx.google.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Evan Green @ 2019-12-11 19:32 UTC (permalink / raw)
  To: Sai Prakash Ranjan
  Cc: Andy Gross, Bjorn Andersson, Mark Rutland, Rob Herring,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	Borislav Petkov, Mauro Carvalho Chehab, Tony Luck, James Morse,
	Robert Richter, linux-edac, linux-arm Mailing List, LKML,
	linux-arm-msm, Stephen Boyd, tsoni, psodagud

Hi Sai,

On Thu, Dec 5, 2019 at 1:53 AM Sai Prakash Ranjan
<saiprakash.ranjan@codeaurora.org> wrote:
>
> Kryo{3,4}XX CPU cores implement RAS extensions to support
> Error Correcting Code(ECC). Currently all Kryo{3,4}XX CPU
> cores (gold/silver a.k.a big/LITTLE) support ECC via RAS.
> This adds an interrupt based driver for those CPUs and
> provides an optional polling of error recording system
> registers.
>
> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> ---
>  MAINTAINERS                   |   7 +
>  drivers/edac/Kconfig          |  20 +
>  drivers/edac/Makefile         |   1 +
>  drivers/edac/qcom_kryo_edac.c | 679 ++++++++++++++++++++++++++++++++++
>  4 files changed, 707 insertions(+)
>  create mode 100644 drivers/edac/qcom_kryo_edac.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c2d80079dccc..f58c93f963f6 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6049,6 +6049,13 @@ L:       linux-edac@vger.kernel.org
>  S:     Maintained
>  F:     drivers/edac/qcom_edac.c
>
> +EDAC-KRYO-QCOM
> +M:     Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> +L:     linux-arm-msm@vger.kernel.org
> +L:     linux-edac@vger.kernel.org
> +S:     Maintained
> +F:     drivers/edac/qcom_kryo_edac.c
> +
>  EDIROL UA-101/UA-1000 DRIVER
>  M:     Clemens Ladisch <clemens@ladisch.de>
>  L:     alsa-devel@alsa-project.org (moderated for non-subscribers)
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index 417dad635526..cd78ac2917c9 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -508,6 +508,26 @@ config EDAC_QCOM
>           For debugging issues having to do with stability and overall system
>           health, you should probably say 'Y' here.
>
> +config EDAC_QCOM_KRYO
> +       tristate "QCOM Kryo EDAC for CPU L1/L2/L3-SCU caches"
> +       depends on ARCH_QCOM && ARM64_RAS_EXTN
> +       help
> +         Support for Error detection and correction on Kryo Gold and Silver CPU
> +         cores with RAS extensions. Currently it detects and reports all Single
> +         Bit Errors (SBEs) and Double Bit Errors (DBEs).
> +
> +         For debugging issues having to do with stability and overall system
> +         health, you should probably say 'Y' here.
> +
> +config EDAC_QCOM_KRYO_POLL
> +       depends on EDAC_QCOM_KRYO
> +       bool "Poll on Kryo ECC registers"
> +       help
> +         This option chooses whether or not you want to poll on the Kryo ECC
> +         registers. When this is enabled, the polling rate can be set as a
> +         module parameter. By default, it will call the polling function every
> +         second.
> +
>  config EDAC_ASPEED
>         tristate "Aspeed AST 2500 SoC"
>         depends on MACH_ASPEED_G5
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index d77200c9680b..29edcfa6ec0e 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -85,5 +85,6 @@ obj-$(CONFIG_EDAC_SYNOPSYS)           += synopsys_edac.o
>  obj-$(CONFIG_EDAC_XGENE)               += xgene_edac.o
>  obj-$(CONFIG_EDAC_TI)                  += ti_edac.o
>  obj-$(CONFIG_EDAC_QCOM)                        += qcom_edac.o
> +obj-$(CONFIG_EDAC_QCOM_KRYO)           += qcom_kryo_edac.o
>  obj-$(CONFIG_EDAC_ASPEED)              += aspeed_edac.o
>  obj-$(CONFIG_EDAC_BLUEFIELD)           += bluefield_edac.o
> diff --git a/drivers/edac/qcom_kryo_edac.c b/drivers/edac/qcom_kryo_edac.c
> new file mode 100644
> index 000000000000..05b60ad3cb0e
> --- /dev/null
> +++ b/drivers/edac/qcom_kryo_edac.c
> @@ -0,0 +1,679 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2019, The Linux Foundation. All rights reserved.
> + */
> +
> +#include <linux/bitfield.h>
> +#include <linux/cpu_pm.h>
> +#include <linux/of_irq.h>
> +#include <linux/smp.h>
> +
> +#include <asm/cputype.h>
> +#include <asm/sysreg.h>
> +
> +#include "edac_device.h"
> +#include "edac_mc.h"
> +
> +#define DRV_NAME               "qcom_kryo_edac"
> +
> +/*
> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.3
> + * ARM DSU TRM Chapter B2.3
> + * CFI = Corrected Fault Handling interrupt, FI = Fault handling interrupt
> + * UI = Uncorrected error recovery interrupt, ED = Error Detection
> + */
> +#define KRYO_ERRXCTLR_ED       BIT(0)
> +#define KRYO_ERRXCTLR_UI       BIT(2)
> +#define KRYO_ERRXCTLR_FI       BIT(3)
> +#define KRYO_ERRXCTLR_CFI      BIT(8)
> +#define KRYO_ERRXCTLR_ENABLE   (KRYO_ERRXCTLR_CFI | KRYO_ERRXCTLR_FI | \
> +                                KRYO_ERRXCTLR_UI | KRYO_ERRXCTLR_ED)
> +
> +/*
> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.4
> + * ARM DSU TRM Chapter B2.4
> + */
> +#define KRYO_ERRXFR_ED         GENMASK(1, 0)
> +#define KRYO_ERRXFR_DE         GENMASK(3, 2)
> +#define KRYO_ERRXFR_UI         GENMASK(5, 4)
> +#define KRYO_ERRXFR_FI         GENMASK(7, 6)
> +#define KRYO_ERRXFR_UE         GENMASK(9, 8)
> +#define KRYO_ERRXFR_CFI                GENMASK(11, 10)
> +#define KRYO_ERRXFR_CEC                GENMASK(14, 12)
> +#define KRYO_ERRXFR_RP         BIT(15)
> +#define KRYO_ERRXFR_SUPPORTED  (KRYO_ERRXFR_ED | KRYO_ERRXFR_DE | \
> +                                KRYO_ERRXFR_UI | KRYO_ERRXFR_FI | \
> +                                KRYO_ERRXFR_UE | KRYO_ERRXFR_CFI | \
> +                                KRYO_ERRXFR_CEC | KRYO_ERRXFR_RP)
> +
> +/*
> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.5
> + * ARM DSU TRM Chapter B2.5
> + */
> +#define KRYO_ERRXMISC0_CECR    GENMASK_ULL(38, 32)
> +#define KRYO_ERRXMISC0_CECO    GENMASK_ULL(46, 40)
> +
> +/* ARM Cortex-A76 TRM Chapter B3.5 */
> +#define KRYO_ERRXMISC0_UNIT    GENMASK(3, 0)
> +#define KRYO_ERRXMISC0_LVL     GENMASK(3, 1)
> +
> +/* ARM Cortex-A76 TRM Chapter B3.10 has SERR bitfields 4:0
> + * but Cortex-A55, Cortex-A75 and DSU TRM has SERR bitfields 7:0.
> + * Since max error record is 21, we can use bitfields 4:0 for
> + * Kryo{3,4}XX CPUs.
> + */
> +#define KRYO_ERRXSTATUS_SERR   GENMASK(4, 0)
> +#define KRYO_ERRXSTATUS_DE     BIT(23)
> +#define KRYO_ERRXSTATUS_CE     GENMASK(25, 24)
> +#define KRYO_ERRXSTATUS_MV     BIT(26)
> +#define KRYO_ERRXSTATUS_UE     BIT(29)
> +#define KRYO_ERRXSTATUS_VALID  BIT(30)
> +
> +/* ARM Cortex-A76 TRM Chapter B3.5
> + * IC = Instruction Cache, DC = Data Cache
> + */
> +#define KRYO_L1_UNIT_IC                0x1
> +#define KRYO_L2_UNIT_TLB       0x2
> +#define KRYO_L1_UNIT_DC                0x4
> +#define KRYO_L2_UNIT           0x8
> +
> +/*
> + * ARM Cortex-A55 TRM Chapter B2.36
> + * ARM Cortex-A75, Cortex-A76 TRM Chapter B2.37
> + */
> +#define KRYO_ERR_RECORD_L1_L2  0x0
> +#define KRYO_ERR_RECORD_L3     0x1
> +
> +/* ARM DSU TRM Chapter B2.10 */
> +#define BUS_ERROR              0x12
> +
> +/* QCOM Kryo CPU part numbers */
> +#define KRYO3XX_GOLD           0x802
> +#define KRYO4XX_GOLD           0x804
> +#define KRYO4XX_SILVER_V1      0x803
> +#define KRYO4XX_SILVER_V2      0x805
> +
> +#define KRYO_EDAC_MSG_MAX      256
> +
> +static int poll_msec = 1000;
> +module_param(poll_msec, int, 0444);
> +
> +enum {
> +       KRYO_L1 = 0,
> +       KRYO_L2,
> +       KRYO_L3,
> +};
> +
> +/* CE = Corrected Error, UE = Uncorrected Error, DE = Deferred Error */
> +enum {

No name?

> +       KRYO_L1_CE = 0,
> +       KRYO_L1_UE,
> +       KRYO_L1_DE,
> +       KRYO_L2_CE,
> +       KRYO_L2_UE,
> +       KRYO_L2_DE,
> +       KRYO_L3_CE,
> +       KRYO_L3_UE,
> +       KRYO_L3_DE,
> +};
> +
> +struct error_record {
> +       u32 error_code;
> +       const char *error_msg;
> +};
> +
> +struct error_type {
> +       void (*fn)(struct edac_device_ctl_info *edev_ctl,
> +                  int inst_nr, int block_nr, const char *msg);
> +       const char *msg;
> +};
> +
> +/*
> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.10
> + * ARM DSU TRM Chapter B2.10
> + */
> +static const struct error_record serror_record[] = {
> +       { 0x1,  "Errors due to fault injection"         },
> +       { 0x2,  "ECC error from internal data buffer"   },
> +       { 0x6,  "ECC error on cache data RAM"           },
> +       { 0x7,  "ECC error on cache tag or dirty RAM"   },
> +       { 0x8,  "Parity error on TLB data RAM"          },
> +       { 0x9,  "Parity error on TLB tag RAM"           },
> +       { 0x12, "Error response for a cache copyback"   },
> +       { 0x15, "Deferred error not supported"          },
> +};
> +
> +static const struct error_type err_type[] = {
> +       { edac_device_handle_ce, "Kryo L1 Corrected Error"      },
> +       { edac_device_handle_ue, "Kryo L1 Uncorrected Error"    },
> +       { edac_device_handle_ue, "Kryo L1 Deferred Error"       },
> +       { edac_device_handle_ce, "Kryo L2 Corrected Error"      },
> +       { edac_device_handle_ue, "Kryo L2 Uncorrected Error"    },
> +       { edac_device_handle_ue, "Kryo L2 Deferred Error"       },
> +       { edac_device_handle_ce, "L3 Corrected Error"           },
> +       { edac_device_handle_ue, "L3 Uncorrected Error"         },
> +       { edac_device_handle_ue, "L3 Deferred Error"            },
> +};

A comment is warranted to indicate that err_type is indexed by the
enum, as this would be easy to mess up in later changes.

> +
> +static struct edac_device_ctl_info __percpu *edac_dev;
> +static struct edac_device_ctl_info *drv_edev_ctl;
> +
> +static const char *get_error_msg(u64 errxstatus)
> +{
> +       const struct error_record *rec;
> +       u32 errxstatus_serr;
> +
> +       errxstatus_serr = FIELD_GET(KRYO_ERRXSTATUS_SERR, errxstatus);
> +
> +       for (rec = serror_record; rec->error_code; rec++) {

It looks like you expect the table to be zero terminated, but it's
not. Add the missing zero entry.

> +               if (errxstatus_serr == rec->error_code)
> +                       return rec->error_msg;
> +       }
> +
> +       return NULL;
> +}
> +
> +static void dump_syndrome_reg(int error_type, int level,
> +                             u64 errxstatus, u64 errxmisc,
> +                             struct edac_device_ctl_info *edev_ctl)
> +{
> +       char msg[KRYO_EDAC_MSG_MAX];
> +       const char *error_msg;
> +       int cpu;
> +
> +       cpu = raw_smp_processor_id();
> +
> +       error_msg = get_error_msg(errxstatus);
> +       if (!error_msg)
> +               return;
> +
> +       snprintf(msg, KRYO_EDAC_MSG_MAX,
> +                "CPU%d: %s, ERRXSTATUS_EL1:%#llx ERRXMISC0_EL1:%#llx, %s",
> +                cpu, err_type[error_type].msg, errxstatus, errxmisc,
> +                error_msg);
> +
> +       err_type[error_type].fn(edev_ctl, 0, level, msg);
> +}
> +
> +static void kryo_check_err_type(u64 errxstatus, u64 errxmisc,
> +                               struct edac_device_ctl_info *edev_ctl,
> +                               int level)
> +{
> +       u32 errxstatus_ue, errxstatus_ce, errxstatus_de;
> +
> +       errxstatus_ce = FIELD_GET(KRYO_ERRXSTATUS_CE, errxstatus);
> +       errxstatus_ue = FIELD_GET(KRYO_ERRXSTATUS_UE, errxstatus);
> +       errxstatus_de = FIELD_GET(KRYO_ERRXSTATUS_DE, errxstatus);
> +
> +       switch (level) {
> +       case KRYO_L1:
> +               if (errxstatus_ce)
> +                       dump_syndrome_reg(KRYO_L1_CE, level, errxstatus,
> +                                         errxmisc, edev_ctl);
> +               else if (errxstatus_ue)
> +                       dump_syndrome_reg(KRYO_L1_UE, level, errxstatus,
> +                                         errxmisc, edev_ctl);
> +               else if (errxstatus_de)
> +                       dump_syndrome_reg(KRYO_L1_DE, level, errxstatus,
> +                                         errxmisc, edev_ctl);
> +               else
> +                       edac_printk(KERN_ERR, DRV_NAME, "Unknown error\n");
> +               break;
> +       case KRYO_L2:
> +               if (errxstatus_ce)
> +                       dump_syndrome_reg(KRYO_L2_CE, level, errxstatus,
> +                                         errxmisc, edev_ctl);
> +               else if (errxstatus_ue)
> +                       dump_syndrome_reg(KRYO_L2_UE, level, errxstatus,
> +                                         errxmisc, edev_ctl);
> +               else if (errxstatus_de)
> +                       dump_syndrome_reg(KRYO_L2_DE, level, errxstatus,
> +                                         errxmisc, edev_ctl);
> +               else
> +                       edac_printk(KERN_ERR, DRV_NAME, "Unknown error\n");
> +               break;
> +       case KRYO_L3:
> +               if (errxstatus_ce)
> +                       dump_syndrome_reg(KRYO_L3_CE, level, errxstatus,
> +                                         errxmisc, edev_ctl);
> +               else if (errxstatus_ue)
> +                       dump_syndrome_reg(KRYO_L3_UE, level, errxstatus,
> +                                         errxmisc, edev_ctl);
> +               else if (errxstatus_de)
> +                       dump_syndrome_reg(KRYO_L3_DE, level, errxstatus,
> +                                         errxmisc, edev_ctl);
> +               else
> +                       edac_printk(KERN_ERR, DRV_NAME, "Unknown error\n");
> +               break;
> +       default:
> +               edac_printk(KERN_ERR, DRV_NAME, "Unknown level\n");
> +       }
> +}
> +
> +static inline void kryo_clear_error(u64 errxstatus)
> +{
> +       write_sysreg_s(errxstatus, SYS_ERXSTATUS_EL1);
> +       isb();

Is the isb() necessary? If so, why not a dsb as well?

> +}
> +
> +static void kryo_parse_l1_l2_cache_error(u64 errxstatus, u64 errxmisc,
> +                                        struct edac_device_ctl_info *edev_ctl,
> +                                        int cpu)
> +{
> +       u32 part_num = read_cpuid_part_number();
> +
> +       switch (part_num) {
> +       /* Kryo3XX gold CPU cores do not have a UNIT bitfield */
> +       case KRYO3XX_GOLD:
> +       case KRYO4XX_SILVER_V1:
> +       case KRYO4XX_SILVER_V2:
> +               switch (FIELD_GET(KRYO_ERRXMISC0_LVL, errxmisc)) {
> +               case KRYO_L1:
> +                       kryo_check_err_type(errxstatus, errxmisc,
> +                                           edev_ctl, KRYO_L1);
> +                       break;
> +               case KRYO_L2:
> +                       kryo_check_err_type(errxstatus, errxmisc,
> +                                           edev_ctl, KRYO_L2);
> +                       break;
> +               default:
> +                       edac_printk(KERN_ERR, DRV_NAME,
> +                                   "silver cpu:%d unknown error: %lu\n", cpu,
> +                                   FIELD_GET(KRYO_ERRXMISC0_LVL, errxmisc));
> +               }
> +               break;
> +       /* Kryo4XX gold CPU cores have a UNIT bitfield to identify levels */
> +       case KRYO4XX_GOLD:
> +               switch (FIELD_GET(KRYO_ERRXMISC0_UNIT, errxmisc)) {
> +               case KRYO_L1_UNIT_DC:
> +               case KRYO_L1_UNIT_IC:
> +                       kryo_check_err_type(errxstatus, errxmisc,
> +                                           edev_ctl, KRYO_L1);
> +                       break;
> +               case KRYO_L2_UNIT:
> +               case KRYO_L2_UNIT_TLB:
> +                       kryo_check_err_type(errxstatus, errxmisc,
> +                                           edev_ctl, KRYO_L2);
> +                       break;
> +               default:
> +                       edac_printk(KERN_ERR, DRV_NAME,
> +                                   "gold cpu:%d unknown error: %lu\n", cpu,
> +                                   FIELD_GET(KRYO_ERRXMISC0_UNIT, errxmisc));
> +               }
> +               break;
> +       default:
> +               edac_printk(KERN_ERR, DRV_NAME,
> +                           "Error in matching cpu%d with part num:%u\n",
> +                           cpu, part_num);
> +       }
> +}
> +
> +static inline bool kryo_check_regs_valid(u64 errxstatus)
> +{
> +       /* Check if status and misc regs are valid */
> +       if (!(FIELD_GET(KRYO_ERRXSTATUS_VALID, errxstatus)) ||
> +           !(FIELD_GET(KRYO_ERRXSTATUS_MV, errxstatus)))
> +               return false;
> +
> +       return true;
> +}
> +
> +static void kryo_check_l1_l2_ecc(void *info)
> +{
> +       struct edac_device_ctl_info *edev_ctl = info;
> +       u64 errxstatus;
> +       u64 errxmisc;
> +       int cpu;
> +
> +       cpu = smp_processor_id();
> +       /* We know record 0 is L1 and L2 */
> +       write_sysreg_s(0, SYS_ERRSELR_EL1);
> +       isb();

Another isb I'm not sure about. Is this meant to provide a barrier
between ERRSELR and ERXSTATUS? Wouldn't that be dsb, not isb?

> +
> +       errxstatus = read_sysreg_s(SYS_ERXSTATUS_EL1);
> +       if (!kryo_check_regs_valid(errxstatus))
> +               return;
> +
> +       errxmisc = read_sysreg_s(SYS_ERXMISC0_EL1);
> +       /* Check if L1/L2 error */
> +       if (!(FIELD_GET(KRYO_ERRXMISC0_LVL, errxmisc) == KRYO_L1) &&
> +           !(FIELD_GET(KRYO_ERRXMISC0_LVL, errxmisc) == KRYO_L2))
> +               return;
> +
> +       kryo_parse_l1_l2_cache_error(errxstatus, errxmisc, edev_ctl, cpu);
> +       kryo_clear_error(errxstatus);
> +}
> +

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches
       [not found]     ` <5df16ebe.1c69fb81.6481f.a011@mx.google.com>
@ 2019-12-13  5:31       ` Sai Prakash Ranjan
  0 siblings, 0 replies; 16+ messages in thread
From: Sai Prakash Ranjan @ 2019-12-13  5:31 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: Evan Green, Mark Rutland,
	OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS, Tony Luck,
	psodagud, linux-arm-msm, tsoni, LKML, Rob Herring,
	Bjorn Andersson, Robert Richter, Andy Gross, Borislav Petkov,
	James Morse, Mauro Carvalho Chehab, linux-arm Mailing List,
	linux-edac

On 2019-12-12 04:03, Stephen Boyd wrote:
> Quoting Evan Green (2019-12-11 11:32:37)
>> Hi Sai,
>> 
>> On Thu, Dec 5, 2019 at 1:53 AM Sai Prakash Ranjan
>> <saiprakash.ranjan@codeaurora.org> wrote:
>> > diff --git a/drivers/edac/qcom_kryo_edac.c b/drivers/edac/qcom_kryo_edac.c
>> > new file mode 100644
>> > index 000000000000..05b60ad3cb0e
>> > --- /dev/null
>> > +++ b/drivers/edac/qcom_kryo_edac.c
>> > @@ -0,0 +1,679 @@
> [...]
>> > +static const struct error_record serror_record[] = {
>> > +       { 0x1,  "Errors due to fault injection"         },
>> > +       { 0x2,  "ECC error from internal data buffer"   },
>> > +       { 0x6,  "ECC error on cache data RAM"           },
>> > +       { 0x7,  "ECC error on cache tag or dirty RAM"   },
>> > +       { 0x8,  "Parity error on TLB data RAM"          },
>> > +       { 0x9,  "Parity error on TLB tag RAM"           },
>> > +       { 0x12, "Error response for a cache copyback"   },
>> > +       { 0x15, "Deferred error not supported"          },
>> > +};
>> > +
>> > +static const struct error_type err_type[] = {
>> > +       { edac_device_handle_ce, "Kryo L1 Corrected Error"      },
>> > +       { edac_device_handle_ue, "Kryo L1 Uncorrected Error"    },
>> > +       { edac_device_handle_ue, "Kryo L1 Deferred Error"       },
>> > +       { edac_device_handle_ce, "Kryo L2 Corrected Error"      },
>> > +       { edac_device_handle_ue, "Kryo L2 Uncorrected Error"    },
>> > +       { edac_device_handle_ue, "Kryo L2 Deferred Error"       },
>> > +       { edac_device_handle_ce, "L3 Corrected Error"           },
>> > +       { edac_device_handle_ue, "L3 Uncorrected Error"         },
>> > +       { edac_device_handle_ue, "L3 Deferred Error"            },
>> > +};
>> 
>> A comment is warranted to indicate that err_type is indexed by the
>> enum, as this would be easy to mess up in later changes.
> 
> Instead of a comment please use array indexing.
> 
> 	[KRYO_L1_CE] = { edac_device_handle_ce, "Kryo L1..." },
> 	...

Will do this in the next spin.

-Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC
  2019-12-05  9:53 ` [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC Sai Prakash Ranjan
@ 2019-12-18 23:37   ` Rob Herring
  2019-12-19  6:50     ` Sai Prakash Ranjan
  2020-01-15 18:48   ` James Morse
  1 sibling, 1 reply; 16+ messages in thread
From: Rob Herring @ 2019-12-18 23:37 UTC (permalink / raw)
  To: Sai Prakash Ranjan
  Cc: Andy Gross, Bjorn Andersson, Mark Rutland, devicetree,
	Borislav Petkov, Mauro Carvalho Chehab, Tony Luck, James Morse,
	Robert Richter, linux-edac, linux-arm-kernel, linux-kernel,
	linux-arm-msm, Stephen Boyd, Evan Green, tsoni, psodagud

On Thu, Dec 05, 2019 at 09:53:05AM +0000, Sai Prakash Ranjan wrote:
> This adds DT bindings for Kryo EDAC implemented with RAS
> extensions on KRYO{3,4}XX CPU cores for reporting of cache
> errors.
> 
> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> ---
>  .../bindings/edac/qcom-kryo-edac.yaml         | 67 +++++++++++++++++++
>  1 file changed, 67 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
> 
> diff --git a/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
> new file mode 100644
> index 000000000000..1a39429a73b4
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
> @@ -0,0 +1,67 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/edac/qcom-kryo-edac.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Kryo Error Detection and Correction(EDAC)
> +
> +maintainers:
> +  - Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> +
> +description: |
> +  Kryo EDAC is defined to describe on-chip error detection and correction
> +  for the Kryo CPU cores which implement RAS extensions. It will report
> +  all Single Bit Errors and Double Bit Errors found in L1/L2 caches in
> +  in two registers ERXSTATUS_EL1 and ERXMISC0_EL1. L3-SCU cache errors
> +  are reported in ERR1STATUS and ERR1MISC0 registers.
> +    ERXSTATUS_EL1 - Selected Error Record Primary Status Register, EL1
> +    ERXMISC0_EL1 - Selected Error Record Miscellaneous Register 0, EL1
> +    ERR1STATUS - Error Record Primary Status Register
> +    ERR1MISC0 - Error Record Miscellaneous Register 0
> +  Current implementation of Kryo ECC(Error Correcting Code) mechanism is
> +  based on interrupts.
> +
> +properties:
> +  compatible:
> +    enum:
> +      - qcom,kryo-edac
> +
> +  interrupts:
> +    minItems: 1
> +    maxItems: 4
> +    items:
> +      - description: l1-l2 cache faultirq interrupt
> +      - description: l1-l2 cache errirq interrupt
> +      - description: l3-scu cache errirq interrupt
> +      - description: l3-scu cache faultirq interrupt
> +
> +  interrupt-names:
> +    minItems: 1
> +    maxItems: 4

You are saying only these combinations are valid:

l1-l2-faultirq

l1-l2-faultirq
l1-l2-errirq

l1-l2-faultirq
l1-l2-errirq
l3-scu-errirq

l1-l2-faultirq
l1-l2-errirq
l3-scu-errirq
l3-scu-faultirq

Is that your intent?

> +    items:
> +      - const: l1-l2-faultirq
> +      - const: l1-l2-errirq
> +      - const: l3-scu-errirq
> +      - const: l3-scu-faultirq
> +
> +required:
> +  - compatible
> +  - interrupts
> +  - interrupt-names
> +
> +examples:
> +  - |
> +    #include <dt-bindings/interrupt-controller/arm-gic.h>
> +
> +    kryo_edac {
> +      compatible = "qcom,kryo-edac";
> +      interrupts = <GIC_PPI 6 IRQ_TYPE_LEVEL_HIGH>,
> +                   <GIC_PPI 7 IRQ_TYPE_LEVEL_HIGH>,
> +                   <GIC_SPI 34 IRQ_TYPE_LEVEL_HIGH>,
> +                   <GIC_SPI 35 IRQ_TYPE_LEVEL_HIGH>;
> +      interrupt-names = "l1-l2-faultirq",
> +                        "l1-l2-errirq",
> +                        "l3-scu-errirq",
> +                        "l3-scu-faultirq";
> +    };
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC
  2019-12-18 23:37   ` Rob Herring
@ 2019-12-19  6:50     ` Sai Prakash Ranjan
  2019-12-19 13:58       ` Rob Herring
  0 siblings, 1 reply; 16+ messages in thread
From: Sai Prakash Ranjan @ 2019-12-19  6:50 UTC (permalink / raw)
  To: Rob Herring
  Cc: Andy Gross, Bjorn Andersson, Mark Rutland, devicetree,
	Borislav Petkov, Mauro Carvalho Chehab, Tony Luck, James Morse,
	Robert Richter, linux-edac, linux-arm-kernel, linux-kernel,
	linux-arm-msm, Stephen Boyd, Evan Green, tsoni, psodagud

Hi Rob,

On 2019-12-19 05:07, Rob Herring wrote:
> On Thu, Dec 05, 2019 at 09:53:05AM +0000, Sai Prakash Ranjan wrote:
>> This adds DT bindings for Kryo EDAC implemented with RAS
>> extensions on KRYO{3,4}XX CPU cores for reporting of cache
>> errors.
>> 
>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
>> ---
>>  .../bindings/edac/qcom-kryo-edac.yaml         | 67 
>> +++++++++++++++++++
>>  1 file changed, 67 insertions(+)
>>  create mode 100644 
>> Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
>> 
>> diff --git 
>> a/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml 
>> b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
>> new file mode 100644
>> index 000000000000..1a39429a73b4
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
>> @@ -0,0 +1,67 @@
>> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
>> +%YAML 1.2
>> +---
>> +$id: http://devicetree.org/schemas/edac/qcom-kryo-edac.yaml#
>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>> +
>> +title: Kryo Error Detection and Correction(EDAC)
>> +
>> +maintainers:
>> +  - Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
>> +
>> +description: |
>> +  Kryo EDAC is defined to describe on-chip error detection and 
>> correction
>> +  for the Kryo CPU cores which implement RAS extensions. It will 
>> report
>> +  all Single Bit Errors and Double Bit Errors found in L1/L2 caches 
>> in
>> +  in two registers ERXSTATUS_EL1 and ERXMISC0_EL1. L3-SCU cache 
>> errors
>> +  are reported in ERR1STATUS and ERR1MISC0 registers.
>> +    ERXSTATUS_EL1 - Selected Error Record Primary Status Register, 
>> EL1
>> +    ERXMISC0_EL1 - Selected Error Record Miscellaneous Register 0, 
>> EL1
>> +    ERR1STATUS - Error Record Primary Status Register
>> +    ERR1MISC0 - Error Record Miscellaneous Register 0
>> +  Current implementation of Kryo ECC(Error Correcting Code) mechanism 
>> is
>> +  based on interrupts.
>> +
>> +properties:
>> +  compatible:
>> +    enum:
>> +      - qcom,kryo-edac
>> +
>> +  interrupts:
>> +    minItems: 1
>> +    maxItems: 4
>> +    items:
>> +      - description: l1-l2 cache faultirq interrupt
>> +      - description: l1-l2 cache errirq interrupt
>> +      - description: l3-scu cache errirq interrupt
>> +      - description: l3-scu cache faultirq interrupt
>> +
>> +  interrupt-names:
>> +    minItems: 1
>> +    maxItems: 4
> 
> You are saying only these combinations are valid:
> 
> l1-l2-faultirq
> 
> l1-l2-faultirq
> l1-l2-errirq
> 
> l1-l2-faultirq
> l1-l2-errirq
> l3-scu-errirq
> 
> l1-l2-faultirq
> l1-l2-errirq
> l3-scu-errirq
> l3-scu-faultirq
> 
> Is that your intent?
> 

No, I want any combination of interrupts to be valid with atleast one 
interrupt as mandatory.
I thought specifying minItems as 1 and maxItems as 4 will take care of 
this,  am I doing something wrong?

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC
  2019-12-19  6:50     ` Sai Prakash Ranjan
@ 2019-12-19 13:58       ` Rob Herring
  2019-12-19 14:48         ` Sai Prakash Ranjan
  0 siblings, 1 reply; 16+ messages in thread
From: Rob Herring @ 2019-12-19 13:58 UTC (permalink / raw)
  To: Sai Prakash Ranjan
  Cc: Andy Gross, Bjorn Andersson, Mark Rutland, devicetree,
	Borislav Petkov, Mauro Carvalho Chehab, Tony Luck, James Morse,
	Robert Richter, linux-edac,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
	linux-kernel, linux-arm-msm, Stephen Boyd, Evan Green,
	Trilok Soni, Prasad Sodagudi

On Thu, Dec 19, 2019 at 12:50 AM Sai Prakash Ranjan
<saiprakash.ranjan@codeaurora.org> wrote:
>
> Hi Rob,
>
> On 2019-12-19 05:07, Rob Herring wrote:
> > On Thu, Dec 05, 2019 at 09:53:05AM +0000, Sai Prakash Ranjan wrote:
> >> This adds DT bindings for Kryo EDAC implemented with RAS
> >> extensions on KRYO{3,4}XX CPU cores for reporting of cache
> >> errors.
> >>
> >> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> >> ---
> >>  .../bindings/edac/qcom-kryo-edac.yaml         | 67
> >> +++++++++++++++++++
> >>  1 file changed, 67 insertions(+)
> >>  create mode 100644
> >> Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
> >>
> >> diff --git
> >> a/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
> >> b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
> >> new file mode 100644
> >> index 000000000000..1a39429a73b4
> >> --- /dev/null
> >> +++ b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
> >> @@ -0,0 +1,67 @@
> >> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> >> +%YAML 1.2
> >> +---
> >> +$id: http://devicetree.org/schemas/edac/qcom-kryo-edac.yaml#
> >> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> >> +
> >> +title: Kryo Error Detection and Correction(EDAC)
> >> +
> >> +maintainers:
> >> +  - Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> >> +
> >> +description: |
> >> +  Kryo EDAC is defined to describe on-chip error detection and
> >> correction
> >> +  for the Kryo CPU cores which implement RAS extensions. It will
> >> report
> >> +  all Single Bit Errors and Double Bit Errors found in L1/L2 caches
> >> in
> >> +  in two registers ERXSTATUS_EL1 and ERXMISC0_EL1. L3-SCU cache
> >> errors
> >> +  are reported in ERR1STATUS and ERR1MISC0 registers.
> >> +    ERXSTATUS_EL1 - Selected Error Record Primary Status Register,
> >> EL1
> >> +    ERXMISC0_EL1 - Selected Error Record Miscellaneous Register 0,
> >> EL1
> >> +    ERR1STATUS - Error Record Primary Status Register
> >> +    ERR1MISC0 - Error Record Miscellaneous Register 0
> >> +  Current implementation of Kryo ECC(Error Correcting Code) mechanism
> >> is
> >> +  based on interrupts.
> >> +
> >> +properties:
> >> +  compatible:
> >> +    enum:
> >> +      - qcom,kryo-edac
> >> +
> >> +  interrupts:
> >> +    minItems: 1
> >> +    maxItems: 4
> >> +    items:
> >> +      - description: l1-l2 cache faultirq interrupt
> >> +      - description: l1-l2 cache errirq interrupt
> >> +      - description: l3-scu cache errirq interrupt
> >> +      - description: l3-scu cache faultirq interrupt
> >> +
> >> +  interrupt-names:
> >> +    minItems: 1
> >> +    maxItems: 4
> >
> > You are saying only these combinations are valid:
> >
> > l1-l2-faultirq
> >
> > l1-l2-faultirq
> > l1-l2-errirq
> >
> > l1-l2-faultirq
> > l1-l2-errirq
> > l3-scu-errirq
> >
> > l1-l2-faultirq
> > l1-l2-errirq
> > l3-scu-errirq
> > l3-scu-faultirq
> >
> > Is that your intent?
> >
>
> No, I want any combination of interrupts to be valid with atleast one
> interrupt as mandatory.
> I thought specifying minItems as 1 and maxItems as 4 will take care of
> this,  am I doing something wrong?

Interrupts (really all properties) have a defined order in DT and an
'items' list defines both the order and index. You'll need to use
oneOf and list out the possibilities. Stick to ones you actually need.

Rob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC
  2019-12-19 13:58       ` Rob Herring
@ 2019-12-19 14:48         ` Sai Prakash Ranjan
  0 siblings, 0 replies; 16+ messages in thread
From: Sai Prakash Ranjan @ 2019-12-19 14:48 UTC (permalink / raw)
  To: Rob Herring
  Cc: Andy Gross, Bjorn Andersson, Mark Rutland, devicetree,
	Borislav Petkov, Mauro Carvalho Chehab, Tony Luck, James Morse,
	Robert Richter, linux-edac,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
	linux-kernel, linux-arm-msm, Stephen Boyd, Evan Green,
	Trilok Soni, Prasad Sodagudi

On 2019-12-19 19:28, Rob Herring wrote:
>> > Is that your intent?
>> >
>> 
>> No, I want any combination of interrupts to be valid with atleast one
>> interrupt as mandatory.
>> I thought specifying minItems as 1 and maxItems as 4 will take care of
>> this,  am I doing something wrong?
> 
> Interrupts (really all properties) have a defined order in DT and an
> 'items' list defines both the order and index. You'll need to use
> oneOf and list out the possibilities. Stick to ones you actually need.
> 

Thanks, I will make the change in the next spin.

-Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches
       [not found] ` <0101016ed57a6559-46c6c649-db28-4945-a11c-7441b8e9ac5b-000000@us-west-2.amazonses.com>
@ 2019-12-30 11:50   ` Borislav Petkov
  2020-01-13  5:44     ` Sai Prakash Ranjan
  0 siblings, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2019-12-30 11:50 UTC (permalink / raw)
  To: Sai Prakash Ranjan, James Morse
  Cc: Andy Gross, Bjorn Andersson, Mark Rutland, Rob Herring,
	devicetree, Mauro Carvalho Chehab, Tony Luck, Robert Richter,
	linux-edac, linux-arm-kernel, linux-kernel, linux-arm-msm,
	Stephen Boyd, Evan Green, tsoni, psodagud

On Thu, Dec 05, 2019 at 09:53:18AM +0000, Sai Prakash Ranjan wrote:
> Kryo{3,4}XX CPU cores implement RAS extensions to support
> Error Correcting Code(ECC). Currently all Kryo{3,4}XX CPU
> cores (gold/silver a.k.a big/LITTLE) support ECC via RAS.

via RAS what? ARM64_RAS_EXTN?

In any case, this needs James to look at and especially if there's some
ARM-generic functionality in there which should be shared, of course.

> This adds an interrupt based driver for those CPUs and

s/This adds/Add/

> provides an optional polling of error recording system
> registers.
> 
> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> ---
>  MAINTAINERS                   |   7 +
>  drivers/edac/Kconfig          |  20 +
>  drivers/edac/Makefile         |   1 +
>  drivers/edac/qcom_kryo_edac.c | 679 ++++++++++++++++++++++++++++++++++
>  4 files changed, 707 insertions(+)
>  create mode 100644 drivers/edac/qcom_kryo_edac.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c2d80079dccc..f58c93f963f6 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6049,6 +6049,13 @@ L:	linux-edac@vger.kernel.org
>  S:	Maintained
>  F:	drivers/edac/qcom_edac.c
>  
> +EDAC-KRYO-QCOM
> +M:	Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> +L:	linux-arm-msm@vger.kernel.org
> +L:	linux-edac@vger.kernel.org
> +S:	Maintained
> +F:	drivers/edac/qcom_kryo_edac.c
> +
>  EDIROL UA-101/UA-1000 DRIVER
>  M:	Clemens Ladisch <clemens@ladisch.de>
>  L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index 417dad635526..cd78ac2917c9 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -508,6 +508,26 @@ config EDAC_QCOM
>  	  For debugging issues having to do with stability and overall system
>  	  health, you should probably say 'Y' here.
>  
> +config EDAC_QCOM_KRYO
> +	tristate "QCOM Kryo EDAC for CPU L1/L2/L3-SCU caches"
> +	depends on ARCH_QCOM && ARM64_RAS_EXTN
> +	help
> +	  Support for Error detection and correction on Kryo Gold and Silver CPU
> +	  cores with RAS extensions. Currently it detects and reports all Single
> +	  Bit Errors (SBEs) and Double Bit Errors (DBEs).
> +
> +	  For debugging issues having to do with stability and overall system
> +	  health, you should probably say 'Y' here.
> +
> +config EDAC_QCOM_KRYO_POLL
> +	depends on EDAC_QCOM_KRYO
> +	bool "Poll on Kryo ECC registers"
> +	help
> +	  This option chooses whether or not you want to poll on the Kryo ECC
> +	  registers. When this is enabled, the polling rate can be set as a
> +	  module parameter. By default, it will call the polling function every
> +	  second.

Why is this a separate option and why should people use that?

Can the polling/irq be switched automatically?

> +
>  config EDAC_ASPEED
>  	tristate "Aspeed AST 2500 SoC"
>  	depends on MACH_ASPEED_G5
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index d77200c9680b..29edcfa6ec0e 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -85,5 +85,6 @@ obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
>  obj-$(CONFIG_EDAC_TI)			+= ti_edac.o
>  obj-$(CONFIG_EDAC_QCOM)			+= qcom_edac.o
> +obj-$(CONFIG_EDAC_QCOM_KRYO)		+= qcom_kryo_edac.o

What is the difference between this new driver and the qcom_edac one? Can
functionality be shared?

Should this new one be called simply kryo_edac instead?

>  obj-$(CONFIG_EDAC_ASPEED)		+= aspeed_edac.o
>  obj-$(CONFIG_EDAC_BLUEFIELD)		+= bluefield_edac.o
> diff --git a/drivers/edac/qcom_kryo_edac.c b/drivers/edac/qcom_kryo_edac.c
> new file mode 100644
> index 000000000000..05b60ad3cb0e
> --- /dev/null
> +++ b/drivers/edac/qcom_kryo_edac.c
> @@ -0,0 +1,679 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2019, The Linux Foundation. All rights reserved.
> + */
> +
> +#include <linux/bitfield.h>
> +#include <linux/cpu_pm.h>
> +#include <linux/of_irq.h>
> +#include <linux/smp.h>
> +
> +#include <asm/cputype.h>
> +#include <asm/sysreg.h>
> +
> +#include "edac_device.h"
> +#include "edac_mc.h"
> +
> +#define DRV_NAME		"qcom_kryo_edac"
> +
> +/*
> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.3

Chapter? Where? URL?

> + * ARM DSU TRM Chapter B2.3
> + * CFI = Corrected Fault Handling interrupt, FI = Fault handling interrupt
> + * UI = Uncorrected error recovery interrupt, ED = Error Detection
> + */
> +#define KRYO_ERRXCTLR_ED	BIT(0)
> +#define KRYO_ERRXCTLR_UI	BIT(2)
> +#define KRYO_ERRXCTLR_FI	BIT(3)
> +#define KRYO_ERRXCTLR_CFI	BIT(8)
> +#define KRYO_ERRXCTLR_ENABLE	(KRYO_ERRXCTLR_CFI | KRYO_ERRXCTLR_FI | \
> +				 KRYO_ERRXCTLR_UI | KRYO_ERRXCTLR_ED)
> +
> +/*
> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.4
> + * ARM DSU TRM Chapter B2.4
> + */
> +#define KRYO_ERRXFR_ED		GENMASK(1, 0)
> +#define KRYO_ERRXFR_DE		GENMASK(3, 2)
> +#define KRYO_ERRXFR_UI		GENMASK(5, 4)
> +#define KRYO_ERRXFR_FI		GENMASK(7, 6)
> +#define KRYO_ERRXFR_UE		GENMASK(9, 8)
> +#define KRYO_ERRXFR_CFI		GENMASK(11, 10)
> +#define KRYO_ERRXFR_CEC		GENMASK(14, 12)
> +#define KRYO_ERRXFR_RP		BIT(15)
> +#define KRYO_ERRXFR_SUPPORTED	(KRYO_ERRXFR_ED | KRYO_ERRXFR_DE | \
> +				 KRYO_ERRXFR_UI | KRYO_ERRXFR_FI | \
> +				 KRYO_ERRXFR_UE | KRYO_ERRXFR_CFI | \
> +				 KRYO_ERRXFR_CEC | KRYO_ERRXFR_RP)
> +
> +/*
> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.5
> + * ARM DSU TRM Chapter B2.5
> + */
> +#define KRYO_ERRXMISC0_CECR	GENMASK_ULL(38, 32)
> +#define KRYO_ERRXMISC0_CECO	GENMASK_ULL(46, 40)
> +
> +/* ARM Cortex-A76 TRM Chapter B3.5 */
> +#define KRYO_ERRXMISC0_UNIT	GENMASK(3, 0)
> +#define KRYO_ERRXMISC0_LVL	GENMASK(3, 1)
> +
> +/* ARM Cortex-A76 TRM Chapter B3.10 has SERR bitfields 4:0
> + * but Cortex-A55, Cortex-A75 and DSU TRM has SERR bitfields 7:0.
> + * Since max error record is 21, we can use bitfields 4:0 for
> + * Kryo{3,4}XX CPUs.
> + */
> +#define KRYO_ERRXSTATUS_SERR	GENMASK(4, 0)
> +#define KRYO_ERRXSTATUS_DE	BIT(23)
> +#define KRYO_ERRXSTATUS_CE	GENMASK(25, 24)
> +#define KRYO_ERRXSTATUS_MV	BIT(26)
> +#define KRYO_ERRXSTATUS_UE	BIT(29)
> +#define KRYO_ERRXSTATUS_VALID	BIT(30)
> +
> +/* ARM Cortex-A76 TRM Chapter B3.5
> + * IC = Instruction Cache, DC = Data Cache
> + */
> +#define KRYO_L1_UNIT_IC		0x1
> +#define KRYO_L2_UNIT_TLB	0x2
> +#define KRYO_L1_UNIT_DC		0x4
> +#define KRYO_L2_UNIT		0x8
> +
> +/*
> + * ARM Cortex-A55 TRM Chapter B2.36
> + * ARM Cortex-A75, Cortex-A76 TRM Chapter B2.37
> + */
> +#define KRYO_ERR_RECORD_L1_L2	0x0
> +#define KRYO_ERR_RECORD_L3	0x1
> +
> +/* ARM DSU TRM Chapter B2.10 */
> +#define BUS_ERROR		0x12
> +
> +/* QCOM Kryo CPU part numbers */
> +#define KRYO3XX_GOLD		0x802
> +#define KRYO4XX_GOLD		0x804
> +#define KRYO4XX_SILVER_V1	0x803
> +#define KRYO4XX_SILVER_V2	0x805
> +
> +#define KRYO_EDAC_MSG_MAX	256
> +
> +static int poll_msec = 1000;
> +module_param(poll_msec, int, 0444);
> +
> +enum {
> +	KRYO_L1 = 0,
> +	KRYO_L2,
> +	KRYO_L3,
> +};
> +
> +/* CE = Corrected Error, UE = Uncorrected Error, DE = Deferred Error */
> +enum {
> +	KRYO_L1_CE = 0,
> +	KRYO_L1_UE,
> +	KRYO_L1_DE,
> +	KRYO_L2_CE,
> +	KRYO_L2_UE,
> +	KRYO_L2_DE,
> +	KRYO_L3_CE,
> +	KRYO_L3_UE,
> +	KRYO_L3_DE,
> +};
> +
> +struct error_record {
> +	u32 error_code;
> +	const char *error_msg;
> +};
> +
> +struct error_type {
> +	void (*fn)(struct edac_device_ctl_info *edev_ctl,
> +		   int inst_nr, int block_nr, const char *msg);
> +	const char *msg;
> +};
> +
> +/*
> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.10
> + * ARM DSU TRM Chapter B2.10
> + */
> +static const struct error_record serror_record[] = {
> +	{ 0x1,	"Errors due to fault injection"		},
> +	{ 0x2,	"ECC error from internal data buffer"	},
> +	{ 0x6,	"ECC error on cache data RAM"		},
> +	{ 0x7,	"ECC error on cache tag or dirty RAM"	},
> +	{ 0x8,	"Parity error on TLB data RAM"		},
> +	{ 0x9,	"Parity error on TLB tag RAM"		},
> +	{ 0x12,	"Error response for a cache copyback"	},
> +	{ 0x15,	"Deferred error not supported"		},
> +};
> +
> +static const struct error_type err_type[] = {
> +	{ edac_device_handle_ce, "Kryo L1 Corrected Error"	},
> +	{ edac_device_handle_ue, "Kryo L1 Uncorrected Error"	},
> +	{ edac_device_handle_ue, "Kryo L1 Deferred Error"	},
> +	{ edac_device_handle_ce, "Kryo L2 Corrected Error"	},
> +	{ edac_device_handle_ue, "Kryo L2 Uncorrected Error"	},
> +	{ edac_device_handle_ue, "Kryo L2 Deferred Error"	},
> +	{ edac_device_handle_ce, "L3 Corrected Error"		},
> +	{ edac_device_handle_ue, "L3 Uncorrected Error"		},
> +	{ edac_device_handle_ue, "L3 Deferred Error"		},
> +};
> +

All that is not really needed - you can put the whole error type
detection and dumping in kryo_check_err_type() in nicely readable
switch-case statement. No need for the function pointers and special
structs.

> +static struct edac_device_ctl_info __percpu *edac_dev;
> +static struct edac_device_ctl_info *drv_edev_ctl;
> +
> +static const char *get_error_msg(u64 errxstatus)
> +{
> +	const struct error_record *rec;
> +	u32 errxstatus_serr;
> +
> +	errxstatus_serr = FIELD_GET(KRYO_ERRXSTATUS_SERR, errxstatus);
> +
> +	for (rec = serror_record; rec->error_code; rec++) {
> +		if (errxstatus_serr == rec->error_code)
> +			return rec->error_msg;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void dump_syndrome_reg(int error_type, int level,
> +			      u64 errxstatus, u64 errxmisc,
> +			      struct edac_device_ctl_info *edev_ctl)
> +{
> +	char msg[KRYO_EDAC_MSG_MAX];
> +	const char *error_msg;
> +	int cpu;
> +
> +	cpu = raw_smp_processor_id();

Why raw_?

> +
> +	error_msg = get_error_msg(errxstatus);
> +	if (!error_msg)
> +		return;
> +
> +	snprintf(msg, KRYO_EDAC_MSG_MAX,
> +		 "CPU%d: %s, ERRXSTATUS_EL1:%#llx ERRXMISC0_EL1:%#llx, %s",
> +		 cpu, err_type[error_type].msg, errxstatus, errxmisc,
> +		 error_msg);
> +
> +	err_type[error_type].fn(edev_ctl, 0, level, msg);
> +}

...

> +static int kryo_l1_l2_setup_irq(struct platform_device *pdev,
> +				struct edac_device_ctl_info *edev_ctl)
> +{
> +	int cpu, errirq, faultirq, ret;
> +
> +	edac_dev = devm_alloc_percpu(&pdev->dev, *edac_dev);
> +	if (!edac_dev)
> +		return -ENOMEM;
> +
> +	for_each_possible_cpu(cpu) {
> +		preempt_disable();
> +		per_cpu(edac_dev, cpu) = edev_ctl;
> +		preempt_enable();
> +	}

That sillyness doesn't belong here, if at all.

...

> +static void kryo_poll_cache_error(struct edac_device_ctl_info *edev_ctl)
> +{
> +	if (!edev_ctl)
> +		edev_ctl = drv_edev_ctl;

That's silly.

> +
> +	on_each_cpu(kryo_check_l1_l2_ecc, edev_ctl, 1);
> +	kryo_check_l3_scu_ecc(edev_ctl);
> +}

...

> +static int qcom_kryo_edac_probe(struct platform_device *pdev)
> +{
> +	struct edac_device_ctl_info *edev_ctl;
> +	struct device *dev = &pdev->dev;
> +	int ret;
> +
> +	qcom_kryo_edac_setup();

This function needs to have a return value saying whether it did setup
the hw properly or not and the probe function needs to return here if
not.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches
  2019-12-30 11:50   ` Borislav Petkov
@ 2020-01-13  5:44     ` Sai Prakash Ranjan
  2020-01-15 18:49       ` James Morse
  0 siblings, 1 reply; 16+ messages in thread
From: Sai Prakash Ranjan @ 2020-01-13  5:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: James Morse, Andy Gross, Bjorn Andersson, Mark Rutland,
	Rob Herring, devicetree, Mauro Carvalho Chehab, Tony Luck,
	Robert Richter, linux-edac, linux-arm-kernel, linux-kernel,
	linux-arm-msm, Stephen Boyd, Evan Green, tsoni, psodagud

Hi Boris,

Thanks for the review comments.

On 2019-12-30 17:20, Borislav Petkov wrote:
> On Thu, Dec 05, 2019 at 09:53:18AM +0000, Sai Prakash Ranjan wrote:
>> Kryo{3,4}XX CPU cores implement RAS extensions to support
>> Error Correcting Code(ECC). Currently all Kryo{3,4}XX CPU
>> cores (gold/silver a.k.a big/LITTLE) support ECC via RAS.
> 
> via RAS what? ARM64_RAS_EXTN?
> 
> In any case, this needs James to look at and especially if there's some
> ARM-generic functionality in there which should be shared, of course.
> 

Yes it is ARM64_RAS_EXTN and I have been hoping if James can provide the 
feedback,
it has been some time now since I posted this out.

>> This adds an interrupt based driver for those CPUs and
> 
> s/This adds/Add/
> 

Will correct.

>> +
>> +config EDAC_QCOM_KRYO_POLL
>> +	depends on EDAC_QCOM_KRYO
>> +	bool "Poll on Kryo ECC registers"
>> +	help
>> +	  This option chooses whether or not you want to poll on the Kryo 
>> ECC
>> +	  registers. When this is enabled, the polling rate can be set as a
>> +	  module parameter. By default, it will call the polling function 
>> every
>> +	  second.
> 
> Why is this a separate option and why should people use that?
> 
> Can the polling/irq be switched automatically?
> 

No it cannot be switched automatically. It is used in case some SoCs do 
not support an irq based mechanism for EDAC.
But I am contradicting myself because I am telling that atleast one 
interrupt should be specified in bindings,
so it is best if I drop this polling option for now.

>> +
>>  config EDAC_ASPEED
>>  	tristate "Aspeed AST 2500 SoC"
>>  	depends on MACH_ASPEED_G5
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>> index d77200c9680b..29edcfa6ec0e 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -85,5 +85,6 @@ obj-$(CONFIG_EDAC_SYNOPSYS)		+= synopsys_edac.o
>>  obj-$(CONFIG_EDAC_XGENE)		+= xgene_edac.o
>>  obj-$(CONFIG_EDAC_TI)			+= ti_edac.o
>>  obj-$(CONFIG_EDAC_QCOM)			+= qcom_edac.o
>> +obj-$(CONFIG_EDAC_QCOM_KRYO)		+= qcom_kryo_edac.o
> 
> What is the difference between this new driver and the qcom_edac one? 
> Can
> functionality be shared?
> 
> Should this new one be called simply kryo_edac instead?
> 

qcom_edac driver is for QCOM system cache(last level cache), it should 
be renamed to qcom_llcc_edac.c.
This new driver is for QCOM Kryo CPU core caches(L1,L2,L3).

Functionality cannot be shared as these two are different IP blocks and 
best kept separate.

>> +
>> +#define DRV_NAME		"qcom_kryo_edac"
>> +
>> +/*
>> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.3
> 
> Chapter? Where? URL?
> 

I chose this because these TRMs are openly available and if you search 
for these above terms like
"Cortex-A76 TRM Chapter B3.3" in google, then the first search result 
will be the TRM pdf, otherwise
I would have to specify the long URL for the pdf and we do not know how 
long that URL link will be active.

>> +
>> +static const struct error_type err_type[] = {
>> +	{ edac_device_handle_ce, "Kryo L1 Corrected Error"	},
>> +	{ edac_device_handle_ue, "Kryo L1 Uncorrected Error"	},
>> +	{ edac_device_handle_ue, "Kryo L1 Deferred Error"	},
>> +	{ edac_device_handle_ce, "Kryo L2 Corrected Error"	},
>> +	{ edac_device_handle_ue, "Kryo L2 Uncorrected Error"	},
>> +	{ edac_device_handle_ue, "Kryo L2 Deferred Error"	},
>> +	{ edac_device_handle_ce, "L3 Corrected Error"		},
>> +	{ edac_device_handle_ue, "L3 Uncorrected Error"		},
>> +	{ edac_device_handle_ue, "L3 Deferred Error"		},
>> +};
>> +
> 
> All that is not really needed - you can put the whole error type
> detection and dumping in kryo_check_err_type() in nicely readable
> switch-case statement. No need for the function pointers and special
> structs.
> 

How is this not easily readable? If I put this in kryo_check_err_type, 
then
there will be nested switch which I think is not so great in terms of 
readability
since it will not fit in one screen and is just more lines of code.

>> +static struct edac_device_ctl_info __percpu *edac_dev;
>> +static struct edac_device_ctl_info *drv_edev_ctl;
>> +
>> +static const char *get_error_msg(u64 errxstatus)
>> +{
>> +	const struct error_record *rec;
>> +	u32 errxstatus_serr;
>> +
>> +	errxstatus_serr = FIELD_GET(KRYO_ERRXSTATUS_SERR, errxstatus);
>> +
>> +	for (rec = serror_record; rec->error_code; rec++) {
>> +		if (errxstatus_serr == rec->error_code)
>> +			return rec->error_msg;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +static void dump_syndrome_reg(int error_type, int level,
>> +			      u64 errxstatus, u64 errxmisc,
>> +			      struct edac_device_ctl_info *edev_ctl)
>> +{
>> +	char msg[KRYO_EDAC_MSG_MAX];
>> +	const char *error_msg;
>> +	int cpu;
>> +
>> +	cpu = raw_smp_processor_id();
> 
> Why raw_?
> 

Because we will be calling smp_processor_id in preemptible context and 
if we enable CONFIG_DEBUG_PREEMPT,
we would get a nice backtrace.

[    3.747468] BUG: using smp_processor_id() in preemptible [00000000] 
code: swapper/0/1
[    3.755527] caller is qcom_kryo_edac_probe+0x138/0x2b8
[    3.760819] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G S                
5.4.0-rc7-next-20191113-00009-g8666855d6a5b-dirty #107
[    3.772323] Hardware name: Qualcomm Technologies, Inc. SM8150 MTP 
(DT)
[    3.779030] Call trace:
[    3.781556]  dump_backtrace+0x0/0x158
[    3.785331]  show_stack+0x14/0x20
[    3.788741]  dump_stack+0xb0/0xf4
[    3.792164]  debug_smp_processor_id+0xd8/0xe0
[    3.796639]  qcom_kryo_edac_probe+0x138/0x2b8
[    3.801116]  platform_drv_probe+0x50/0xa8
[    3.805236]  really_probe+0x108/0x360
[    3.808999]  driver_probe_device+0x58/0x100
[    3.813304]  device_driver_attach+0x6c/0x78
[    3.817606]  __driver_attach+0xb0/0xf0
[    3.821459]  bus_for_each_dev+0x68/0xc8
[    3.825407]  driver_attach+0x20/0x28
[    3.829083]  bus_add_driver+0x160/0x1f0
[    3.833030]  driver_register+0x60/0x110
[    3.836976]  __platform_driver_register+0x40/0x48
[    3.841813]  qcom_kryo_edac_driver_init+0x18/0x20
[    3.846645]  do_one_initcall+0x58/0x1a0
[    3.850596]  kernel_init_freeable+0x19c/0x240
[    3.855075]  kernel_init+0x10/0x108
[    3.858665]  ret_from_fork+0x10/0x1c


>> +static int kryo_l1_l2_setup_irq(struct platform_device *pdev,
>> +				struct edac_device_ctl_info *edev_ctl)
>> +{
>> +	int cpu, errirq, faultirq, ret;
>> +
>> +	edac_dev = devm_alloc_percpu(&pdev->dev, *edac_dev);
>> +	if (!edac_dev)
>> +		return -ENOMEM;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		preempt_disable();
>> +		per_cpu(edac_dev, cpu) = edev_ctl;
>> +		preempt_enable();
>> +	}
> 
> That sillyness doesn't belong here, if at all.
> 

Sorry but I do not understand the sillyness here. Could you please 
explain?

> ...
> 
>> +static void kryo_poll_cache_error(struct edac_device_ctl_info 
>> *edev_ctl)
>> +{
>> +	if (!edev_ctl)
>> +		edev_ctl = drv_edev_ctl;
> 
> That's silly.
> 

Actually its not silly. In case, polling is enabled and on PM exit 
edev_ctl could be NULL.

>> +
>> +	on_each_cpu(kryo_check_l1_l2_ecc, edev_ctl, 1);
>> +	kryo_check_l3_scu_ecc(edev_ctl);
>> +}
> 
> ...
> 
>> +static int qcom_kryo_edac_probe(struct platform_device *pdev)
>> +{
>> +	struct edac_device_ctl_info *edev_ctl;
>> +	struct device *dev = &pdev->dev;
>> +	int ret;
>> +
>> +	qcom_kryo_edac_setup();
> 
> This function needs to have a return value saying whether it did setup
> the hw properly or not and the probe function needs to return here if
> not.

Ok will add a return check.

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC
  2019-12-05  9:53 ` [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC Sai Prakash Ranjan
  2019-12-18 23:37   ` Rob Herring
@ 2020-01-15 18:48   ` James Morse
  2020-01-24 14:21     ` Sai Prakash Ranjan
  1 sibling, 1 reply; 16+ messages in thread
From: James Morse @ 2020-01-15 18:48 UTC (permalink / raw)
  To: Sai Prakash Ranjan
  Cc: Andy Gross, Bjorn Andersson, Mark Rutland, Rob Herring,
	devicetree, Borislav Petkov, Mauro Carvalho Chehab, Tony Luck,
	Robert Richter, linux-edac, linux-arm-kernel, linux-kernel,
	linux-arm-msm, Stephen Boyd, Evan Green, tsoni, psodagud, baicar

Hi Sai,

(CC: +Tyler)

On 05/12/2019 09:53, Sai Prakash Ranjan wrote:
> This adds DT bindings for Kryo EDAC implemented with RAS
> extensions on KRYO{3,4}XX CPU cores for reporting of cache
> errors.

KRYO{3,4}XX isn't the only SoC with the RAS extensions. The DT needs to convey the range
of ways this armv8 RAS extensions stuff can be wired up.

The folk who look after the ACPI specs have made a start:
https://static.docs.arm.com/den0085/a/DEN0085_RAS_ACPI_1.0_BETA_1.pdf

(I suspect that isn't the latest version, I'll try and find out)

I'd like the ACPI table and DT to convey the same information so that we don't need to
convert or infer things in the driver. If something is missing, we should get it added!


> diff --git a/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
> new file mode 100644
> index 000000000000..1a39429a73b4
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
> @@ -0,0 +1,67 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/edac/qcom-kryo-edac.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Kryo Error Detection and Correction(EDAC)
> +
> +maintainers:
> +  - Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> +
> +description: |
> +  Kryo EDAC is defined to describe on-chip error detection and correction
> +  for the Kryo CPU cores which implement RAS extensions.

Please don't make this Kryo specific, otherwise this binding becomes an extra thing we
need to support with a 'v8.2 RAS' driver.

What I'd like is a single 'armv82_ras' edac driver that handles faults and errors reported
by interrupts, and interacts with the arch code's handling of 'external aborts'. This
should work for all platforms using v8.2 RAS and later.


> +  It will report
> +  all Single Bit Errors and Double Bit Errors found in L1/L2 caches in
> +  in two registers ERXSTATUS_EL1 and ERXMISC0_EL1. L3-SCU cache errors
> +  are reported in ERR1STATUS and ERR1MISC0 registers.
> +    ERXSTATUS_EL1 - Selected Error Record Primary Status Register, EL1
> +    ERXMISC0_EL1 - Selected Error Record Miscellaneous Register 0, EL1
> +    ERR1STATUS - Error Record Primary Status Register
> +    ERR1MISC0 - Error Record Miscellaneous Register 0
> +  Current implementation of Kryo ECC(Error Correcting Code) mechanism is
> +  based on interrupts.

Your SoC picked the system registers as the interface to these component's registers.
The binding would need to specify which index the 'l1-l2' records start at, and how many
there are. The same for the 'l3-scu'. You can't hard code these, they are different on
other platforms.

There is also an MMIO interface which needs a base address, along with the index and
ranges. (which may be different). The same component may use both the system register and
the MMIO interface.

This stuff is likely to vary on big/little systems, so you need a way of describing which
CPUs the settings refer to. This probably isn't something the ACPI tables capture as ACPI
machines are typically homogenous.


Thanks,

James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches
  2020-01-13  5:44     ` Sai Prakash Ranjan
@ 2020-01-15 18:49       ` James Morse
  2020-01-24 14:52         ` Sai Prakash Ranjan
  0 siblings, 1 reply; 16+ messages in thread
From: James Morse @ 2020-01-15 18:49 UTC (permalink / raw)
  To: Sai Prakash Ranjan
  Cc: Borislav Petkov, Andy Gross, Bjorn Andersson, Mark Rutland,
	Rob Herring, devicetree, Mauro Carvalho Chehab, Tony Luck,
	Robert Richter, linux-edac, linux-arm-kernel, linux-kernel,
	linux-arm-msm, Stephen Boyd, Evan Green, tsoni, psodagud, baicar

Hi guys,

(CC: +Tyler)

On 13/01/2020 05:44, Sai Prakash Ranjan wrote:
> On 2019-12-30 17:20, Borislav Petkov wrote:
>> On Thu, Dec 05, 2019 at 09:53:18AM +0000, Sai Prakash Ranjan wrote:
>>> Kryo{3,4}XX CPU cores implement RAS extensions to support
>>> Error Correcting Code(ECC). Currently all Kryo{3,4}XX CPU
>>> cores (gold/silver a.k.a big/LITTLE) support ECC via RAS.
>>
>> via RAS what? ARM64_RAS_EXTN?
>>
>> In any case, this needs James to look at and especially if there's some
>> ARM-generic functionality in there which should be shared, of course.

> Yes it is ARM64_RAS_EXTN and I have been hoping if James can provide the feedback,
> it has been some time now since I posted this out.

Sorry, I was out of the office for most of November/December, and I'm slowly catching up...


>>> +
>>> +config EDAC_QCOM_KRYO_POLL
>>> +    depends on EDAC_QCOM_KRYO
>>> +    bool "Poll on Kryo ECC registers"
>>> +    help
>>> +      This option chooses whether or not you want to poll on the Kryo ECC
>>> +      registers. When this is enabled, the polling rate can be set as a
>>> +      module parameter. By default, it will call the polling function every
>>> +      second.
>>
>> Why is this a separate option and why should people use that?
>>
>> Can the polling/irq be switched automatically?

> No it cannot be switched automatically. It is used in case some SoCs do not support an irq
> based mechanism for EDAC.
> But I am contradicting myself because I am telling that atleast one interrupt should be
> specified in bindings,
> so it is best if I drop this polling option for now.

For now, sure. But I think this will come back for systems with embarrassing amounts of
RAM that would rather scrub the errors than take a flood of IRQs. I'd like this to be
controllable from user-space.


>>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>>> index d77200c9680b..29edcfa6ec0e 100644
>>> --- a/drivers/edac/Makefile
>>> +++ b/drivers/edac/Makefile
>>> @@ -85,5 +85,6 @@ obj-$(CONFIG_EDAC_SYNOPSYS)        += synopsys_edac.o
>>>  obj-$(CONFIG_EDAC_XGENE)        += xgene_edac.o
>>>  obj-$(CONFIG_EDAC_TI)            += ti_edac.o
>>>  obj-$(CONFIG_EDAC_QCOM)            += qcom_edac.o
>>> +obj-$(CONFIG_EDAC_QCOM_KRYO)        += qcom_kryo_edac.o
>>
>> What is the difference between this new driver and the qcom_edac one? Can
>> functionality be shared?

High-level story time:
Until the 'v8.2' revision of the 'v8' Arm-architecture (the 64bit one), arm didn't
describe how RAS should work. Partners implemented what they needed, and we ended up with
this collection of drivers because they were all different.

v8.2 fixed all this, the good news is once its done, we should never need another edac
driver. (at least, not for SoCs built for v8.2). The downside is there is quite a lot in
there, and we need to cover ACPI machines as well as DT.

> qcom_edac driver is for QCOM system cache(last level cache), it should be renamed to
> qcom_llcc_edac.c.
> This new driver is for QCOM Kryo CPU core caches(L1,L2,L3).
>
> Functionality cannot be shared as these two are different IP blocks and best kept separate.

The qcom_edac will be Qualcomm's pre-v8.2 support.
This series is about the v8.2 support which all looks totally different to Linux.


>>> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.3
>>
>> Chapter? Where? URL?
>>
> 
> I chose this because these TRMs are openly available and if you search for these above
> terms like
> "Cortex-A76 TRM Chapter B3.3" in google, then the first search result will be the TRM pdf,
> otherwise
> I would have to specify the long URL for the pdf and we do not know how long that URL link
> will be active.

These are SoC/CPU specific. Using these we can't solve the whole problem.

The architecture all those should fit into is here:
https://static.docs.arm.com/ddi0587/cb/2019_07_05_DD_0587_C_b.pdf
(or https://developer.arm.com/docs/ and look for 'RAS')

... and the arm-arm.


>>> +static void dump_syndrome_reg(int error_type, int level,
>>> +                  u64 errxstatus, u64 errxmisc,
>>> +                  struct edac_device_ctl_info *edev_ctl)
>>> +{
>>> +    char msg[KRYO_EDAC_MSG_MAX];
>>> +    const char *error_msg;
>>> +    int cpu;
>>> +
>>> +    cpu = raw_smp_processor_id();
>>
>> Why raw_?
>>
> 
> Because we will be calling smp_processor_id in preemptible context and if we enable
> CONFIG_DEBUG_PREEMPT,
> we would get a nice backtrace.
> 
> [    3.747468] BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
> [    3.755527] caller is qcom_kryo_edac_probe+0x138/0x2b8
> [    3.760819] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G S               
> 5.4.0-rc7-next-20191113-00009-g8666855d6a5b-dirty #107
> [    3.772323] Hardware name: Qualcomm Technologies, Inc. SM8150 MTP (DT)
> [    3.779030] Call trace:
> [    3.781556]  dump_backtrace+0x0/0x158
> [    3.785331]  show_stack+0x14/0x20
> [    3.788741]  dump_stack+0xb0/0xf4
> [    3.792164]  debug_smp_processor_id+0xd8/0xe0
> [    3.796639]  qcom_kryo_edac_probe+0x138/0x2b8
> [    3.801116]  platform_drv_probe+0x50/0xa8
> [    3.805236]  really_probe+0x108/0x360
> [    3.808999]  driver_probe_device+0x58/0x100
> [    3.813304]  device_driver_attach+0x6c/0x78
> [    3.817606]  __driver_attach+0xb0/0xf0
> [    3.821459]  bus_for_each_dev+0x68/0xc8
> [    3.825407]  driver_attach+0x20/0x28
> [    3.829083]  bus_add_driver+0x160/0x1f0
> [    3.833030]  driver_register+0x60/0x110
> [    3.836976]  __platform_driver_register+0x40/0x48
> [    3.841813]  qcom_kryo_edac_driver_init+0x18/0x20
> [    3.846645]  do_one_initcall+0x58/0x1a0
> [    3.850596]  kernel_init_freeable+0x19c/0x240
> [    3.855075]  kernel_init+0x10/0x108
> [    3.858665]  ret_from_fork+0x10/0x1c

and raw_ stops the backtrace? You are still preemptible. The problem still exists, you've
just suppressed the warning.

At any time in dump_syndrome_reg(), you could get an interrupt and another task gets
scheduled. Later your thread is started on another cpu... but not the one whose cpu number
you read from smp_processor_id(). Whatever you needed it for, might have the wrong value.


>>> +static int kryo_l1_l2_setup_irq(struct platform_device *pdev,
>>> +                struct edac_device_ctl_info *edev_ctl)
>>> +{
>>> +    int cpu, errirq, faultirq, ret;
>>> +
>>> +    edac_dev = devm_alloc_percpu(&pdev->dev, *edac_dev);
>>> +    if (!edac_dev)
>>> +        return -ENOMEM;
>>> +
>>> +    for_each_possible_cpu(cpu) {
>>> +        preempt_disable();
>>> +        per_cpu(edac_dev, cpu) = edev_ctl;
>>> +        preempt_enable();
>>> +    }
>>
>> That sillyness doesn't belong here, if at all.

> Sorry but I do not understand the sillyness here. Could you please explain?

preempt_disable() prevents another task being scheduled instead of you, avoiding the risk
that you get scheduled on another cpu. In this case it doesn't matter which cpu you are
running on as you aren't accessing _this_ cpu's edac_dev, you are accessing each one in a
loop.


Thanks,

James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC
  2020-01-15 18:48   ` James Morse
@ 2020-01-24 14:21     ` Sai Prakash Ranjan
  2020-02-26 17:12       ` James Morse
  0 siblings, 1 reply; 16+ messages in thread
From: Sai Prakash Ranjan @ 2020-01-24 14:21 UTC (permalink / raw)
  To: James Morse
  Cc: Andy Gross, Bjorn Andersson, Mark Rutland, Rob Herring,
	devicetree, Borislav Petkov, Mauro Carvalho Chehab, Tony Luck,
	Robert Richter, linux-edac, linux-arm-kernel, linux-kernel,
	linux-arm-msm, Stephen Boyd, Evan Green, tsoni, psodagud, baicar

Hi James,

On 2020-01-16 00:18, James Morse wrote:
> Hi Sai,
> 
> (CC: +Tyler)
> 
> On 05/12/2019 09:53, Sai Prakash Ranjan wrote:
>> This adds DT bindings for Kryo EDAC implemented with RAS
>> extensions on KRYO{3,4}XX CPU cores for reporting of cache
>> errors.
> 
> KRYO{3,4}XX isn't the only SoC with the RAS extensions. The DT needs
> to convey the range
> of ways this armv8 RAS extensions stuff can be wired up.
> 

Right, but I was going for Kryo specific implementation and hence the 
binding as such.

> The folk who look after the ACPI specs have made a start:
> https://static.docs.arm.com/den0085/a/DEN0085_RAS_ACPI_1.0_BETA_1.pdf
> 
> (I suspect that isn't the latest version, I'll try and find out)
> 

That would be helpful, thanks.

> I'd like the ACPI table and DT to convey the same information so that
> we don't need to
> convert or infer things in the driver. If something is missing, we
> should get it added!
> 

Sure, I think it is decided now that kernel first RAS implementation 
will be generic.

> 
>> diff --git 
>> a/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml 
>> b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
>> new file mode 100644
>> index 000000000000..1a39429a73b4
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
>> @@ -0,0 +1,67 @@
>> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
>> +%YAML 1.2
>> +---
>> +$id: http://devicetree.org/schemas/edac/qcom-kryo-edac.yaml#
>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>> +
>> +title: Kryo Error Detection and Correction(EDAC)
>> +
>> +maintainers:
>> +  - Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
>> +
>> +description: |
>> +  Kryo EDAC is defined to describe on-chip error detection and 
>> correction
>> +  for the Kryo CPU cores which implement RAS extensions.
> 
> Please don't make this Kryo specific, otherwise this binding becomes
> an extra thing we
> need to support with a 'v8.2 RAS' driver.
> 
> What I'd like is a single 'armv82_ras' edac driver that handles faults
> and errors reported
> by interrupts, and interacts with the arch code's handling of
> 'external aborts'. This
> should work for all platforms using v8.2 RAS and later.
> 
> 

Ok sure.

>> +  It will report
>> +  all Single Bit Errors and Double Bit Errors found in L1/L2 caches 
>> in
>> +  in two registers ERXSTATUS_EL1 and ERXMISC0_EL1. L3-SCU cache 
>> errors
>> +  are reported in ERR1STATUS and ERR1MISC0 registers.
>> +    ERXSTATUS_EL1 - Selected Error Record Primary Status Register, 
>> EL1
>> +    ERXMISC0_EL1 - Selected Error Record Miscellaneous Register 0, 
>> EL1
>> +    ERR1STATUS - Error Record Primary Status Register
>> +    ERR1MISC0 - Error Record Miscellaneous Register 0
>> +  Current implementation of Kryo ECC(Error Correcting Code) mechanism 
>> is
>> +  based on interrupts.
> 
> Your SoC picked the system registers as the interface to these
> component's registers.
> The binding would need to specify which index the 'l1-l2' records
> start at, and how many
> there are. The same for the 'l3-scu'. You can't hard code these, they
> are different on
> other platforms.
> 

Ok will keep this in mind for the next version.

> There is also an MMIO interface which needs a base address, along with
> the index and
> ranges. (which may be different). The same component may use both the
> system register and
> the MMIO interface.
> 

I have some doubts here, Where do I get this info? Will this be 
implementation specific?

> This stuff is likely to vary on big/little systems, so you need a way
> of describing which
> CPUs the settings refer to. This probably isn't something the ACPI
> tables capture as ACPI
> machines are typically homogenous.
> 

Our SoCs are based on big.LITTLE arch, so this will be needed.

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches
  2020-01-15 18:49       ` James Morse
@ 2020-01-24 14:52         ` Sai Prakash Ranjan
  0 siblings, 0 replies; 16+ messages in thread
From: Sai Prakash Ranjan @ 2020-01-24 14:52 UTC (permalink / raw)
  To: James Morse
  Cc: Borislav Petkov, Andy Gross, Bjorn Andersson, Mark Rutland,
	Rob Herring, devicetree, Mauro Carvalho Chehab, Tony Luck,
	Robert Richter, linux-edac, linux-arm-kernel, linux-kernel,
	linux-arm-msm, Stephen Boyd, Evan Green, tsoni, psodagud, baicar

Hi James,

On 2020-01-16 00:19, James Morse wrote:
> Hi guys,
> 
> (CC: +Tyler)
> 
> On 13/01/2020 05:44, Sai Prakash Ranjan wrote:
>> On 2019-12-30 17:20, Borislav Petkov wrote:
>>> On Thu, Dec 05, 2019 at 09:53:18AM +0000, Sai Prakash Ranjan wrote:
>>>> Kryo{3,4}XX CPU cores implement RAS extensions to support
>>>> Error Correcting Code(ECC). Currently all Kryo{3,4}XX CPU
>>>> cores (gold/silver a.k.a big/LITTLE) support ECC via RAS.
>>> 
>>> via RAS what? ARM64_RAS_EXTN?
>>> 
>>> In any case, this needs James to look at and especially if there's 
>>> some
>>> ARM-generic functionality in there which should be shared, of course.
> 
>> Yes it is ARM64_RAS_EXTN and I have been hoping if James can provide 
>> the feedback,
>> it has been some time now since I posted this out.
> 
> Sorry, I was out of the office for most of November/December, and I'm
> slowly catching up...
> 
> 
>>>> +
>>>> +config EDAC_QCOM_KRYO_POLL
>>>> +    depends on EDAC_QCOM_KRYO
>>>> +    bool "Poll on Kryo ECC registers"
>>>> +    help
>>>> +      This option chooses whether or not you want to poll on the 
>>>> Kryo ECC
>>>> +      registers. When this is enabled, the polling rate can be set 
>>>> as a
>>>> +      module parameter. By default, it will call the polling 
>>>> function every
>>>> +      second.
>>> 
>>> Why is this a separate option and why should people use that?
>>> 
>>> Can the polling/irq be switched automatically?
> 
>> No it cannot be switched automatically. It is used in case some SoCs 
>> do not support an irq
>> based mechanism for EDAC.
>> But I am contradicting myself because I am telling that atleast one 
>> interrupt should be
>> specified in bindings,
>> so it is best if I drop this polling option for now.
> 
> For now, sure. But I think this will come back for systems with
> embarrassing amounts of
> RAM that would rather scrub the errors than take a flood of IRQs. I'd
> like this to be
> controllable from user-space.
> 

Ok so we should have an option to switch between polling and irq.

> 
>>>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>>>> index d77200c9680b..29edcfa6ec0e 100644
>>>> --- a/drivers/edac/Makefile
>>>> +++ b/drivers/edac/Makefile
>>>> @@ -85,5 +85,6 @@ obj-$(CONFIG_EDAC_SYNOPSYS)        += 
>>>> synopsys_edac.o
>>>>  obj-$(CONFIG_EDAC_XGENE)        += xgene_edac.o
>>>>  obj-$(CONFIG_EDAC_TI)            += ti_edac.o
>>>>  obj-$(CONFIG_EDAC_QCOM)            += qcom_edac.o
>>>> +obj-$(CONFIG_EDAC_QCOM_KRYO)        += qcom_kryo_edac.o
>>> 
>>> What is the difference between this new driver and the qcom_edac one? 
>>> Can
>>> functionality be shared?
> 
> High-level story time:
> Until the 'v8.2' revision of the 'v8' Arm-architecture (the 64bit
> one), arm didn't
> describe how RAS should work. Partners implemented what they needed,
> and we ended up with
> this collection of drivers because they were all different.
> 
> v8.2 fixed all this, the good news is once its done, we should never
> need another edac
> driver. (at least, not for SoCs built for v8.2). The downside is there
> is quite a lot in
> there, and we need to cover ACPI machines as well as DT.
> 

That is true but the qcom_edac one which is merged is for LLC(system 
cache) which is a QCOM IP.

>> qcom_edac driver is for QCOM system cache(last level cache), it should 
>> be renamed to
>> qcom_llcc_edac.c.
>> This new driver is for QCOM Kryo CPU core caches(L1,L2,L3).
>> 
>> Functionality cannot be shared as these two are different IP blocks 
>> and best kept separate.
> 
> The qcom_edac will be Qualcomm's pre-v8.2 support.
> This series is about the v8.2 support which all looks totally
> different to Linux.
> 

As said before qcom_edac is for LLC which is not available on all SoCs.
QCOM's pre v8.2 support is not upstreamed.

> 
>>>> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.3
>>> 
>>> Chapter? Where? URL?
>>> 
>> 
>> I chose this because these TRMs are openly available and if you search 
>> for these above
>> terms like
>> "Cortex-A76 TRM Chapter B3.3" in google, then the first search result 
>> will be the TRM pdf,
>> otherwise
>> I would have to specify the long URL for the pdf and we do not know 
>> how long that URL link
>> will be active.
> 
> These are SoC/CPU specific. Using these we can't solve the whole 
> problem.
> 
> The architecture all those should fit into is here:
> https://static.docs.arm.com/ddi0587/cb/2019_07_05_DD_0587_C_b.pdf
> (or https://developer.arm.com/docs/ and look for 'RAS')
> 
> ... and the arm-arm.
> 

Thanks for the link.

> 
>>>> +static void dump_syndrome_reg(int error_type, int level,
>>>> +                  u64 errxstatus, u64 errxmisc,
>>>> +                  struct edac_device_ctl_info *edev_ctl)
>>>> +{
>>>> +    char msg[KRYO_EDAC_MSG_MAX];
>>>> +    const char *error_msg;
>>>> +    int cpu;
>>>> +
>>>> +    cpu = raw_smp_processor_id();
>>> 
>>> Why raw_?
>>> 
>> 
>> Because we will be calling smp_processor_id in preemptible context and 
>> if we enable
>> CONFIG_DEBUG_PREEMPT,
>> we would get a nice backtrace.
>> 
>> [    3.747468] BUG: using smp_processor_id() in preemptible [00000000] 
>> code: swapper/0/1
>> [    3.755527] caller is qcom_kryo_edac_probe+0x138/0x2b8
>> [    3.760819] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G 
>> S               
>> 5.4.0-rc7-next-20191113-00009-g8666855d6a5b-dirty #107
>> [    3.772323] Hardware name: Qualcomm Technologies, Inc. SM8150 MTP 
>> (DT)
>> [    3.779030] Call trace:
>> [    3.781556]  dump_backtrace+0x0/0x158
>> [    3.785331]  show_stack+0x14/0x20
>> [    3.788741]  dump_stack+0xb0/0xf4
>> [    3.792164]  debug_smp_processor_id+0xd8/0xe0
>> [    3.796639]  qcom_kryo_edac_probe+0x138/0x2b8
>> [    3.801116]  platform_drv_probe+0x50/0xa8
>> [    3.805236]  really_probe+0x108/0x360
>> [    3.808999]  driver_probe_device+0x58/0x100
>> [    3.813304]  device_driver_attach+0x6c/0x78
>> [    3.817606]  __driver_attach+0xb0/0xf0
>> [    3.821459]  bus_for_each_dev+0x68/0xc8
>> [    3.825407]  driver_attach+0x20/0x28
>> [    3.829083]  bus_add_driver+0x160/0x1f0
>> [    3.833030]  driver_register+0x60/0x110
>> [    3.836976]  __platform_driver_register+0x40/0x48
>> [    3.841813]  qcom_kryo_edac_driver_init+0x18/0x20
>> [    3.846645]  do_one_initcall+0x58/0x1a0
>> [    3.850596]  kernel_init_freeable+0x19c/0x240
>> [    3.855075]  kernel_init+0x10/0x108
>> [    3.858665]  ret_from_fork+0x10/0x1c
> 
> and raw_ stops the backtrace? You are still preemptible. The problem
> still exists, you've
> just suppressed the warning.
> 
> At any time in dump_syndrome_reg(), you could get an interrupt and
> another task gets
> scheduled. Later your thread is started on another cpu... but not the
> one whose cpu number
> you read from smp_processor_id(). Whatever you needed it for, might
> have the wrong value.
> 

Ok will correct this.

> 
>>>> +static int kryo_l1_l2_setup_irq(struct platform_device *pdev,
>>>> +                struct edac_device_ctl_info *edev_ctl)
>>>> +{
>>>> +    int cpu, errirq, faultirq, ret;
>>>> +
>>>> +    edac_dev = devm_alloc_percpu(&pdev->dev, *edac_dev);
>>>> +    if (!edac_dev)
>>>> +        return -ENOMEM;
>>>> +
>>>> +    for_each_possible_cpu(cpu) {
>>>> +        preempt_disable();
>>>> +        per_cpu(edac_dev, cpu) = edev_ctl;
>>>> +        preempt_enable();
>>>> +    }
>>> 
>>> That sillyness doesn't belong here, if at all.
> 
>> Sorry but I do not understand the sillyness here. Could you please 
>> explain?
> 
> preempt_disable() prevents another task being scheduled instead of
> you, avoiding the risk
> that you get scheduled on another cpu. In this case it doesn't matter
> which cpu you are
> running on as you aren't accessing _this_ cpu's edac_dev, you are
> accessing each one in a
> loop.
> 

Thanks for the explanation James, now I get the sillyness.

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC
  2020-01-24 14:21     ` Sai Prakash Ranjan
@ 2020-02-26 17:12       ` James Morse
  0 siblings, 0 replies; 16+ messages in thread
From: James Morse @ 2020-02-26 17:12 UTC (permalink / raw)
  To: Sai Prakash Ranjan
  Cc: Andy Gross, Bjorn Andersson, Mark Rutland, Rob Herring,
	devicetree, Borislav Petkov, Mauro Carvalho Chehab, Tony Luck,
	Robert Richter, linux-edac, linux-arm-kernel, linux-kernel,
	linux-arm-msm, Stephen Boyd, Evan Green, tsoni, psodagud, baicar

Hi Sai,

On 24/01/2020 14:21, Sai Prakash Ranjan wrote:
> On 2020-01-16 00:18, James Morse wrote:
>> On 05/12/2019 09:53, Sai Prakash Ranjan wrote:
>>> This adds DT bindings for Kryo EDAC implemented with RAS
>>> extensions on KRYO{3,4}XX CPU cores for reporting of cache
>>> errors.

>>> diff --git a/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
>>> b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml
>>> new file mode 100644
>>> index 000000000000..1a39429a73b4
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/edac/qcom-kryo-edac.yaml

>> There is also an MMIO interface which needs a base address, along with
>> the index and
>> ranges. (which may be different). The same component may use both the
>> system register and the MMIO interface.

> I have some doubts here, Where do I get this info? Will this be implementation specific?

It will be implementation specific. The ACPI spec folk have gathered some of the range of
ways people are putting this together. We should take that into account with the binding,
otherwise we end up with a 'v1' and 'v2' of the binding and have to support both.


There is a 'Beta 2' of that ACPI document. It should appear on the website at some point.
Qualcomm should have this somewhere, its called 'DEN0085_RAS_ACPI_1.0_RELEASE_BETA2.pdf.


Thanks,

James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches
@ 2019-12-13  9:05 Sai Prakash Ranjan
  0 siblings, 0 replies; 16+ messages in thread
From: Sai Prakash Ranjan @ 2019-12-13  9:05 UTC (permalink / raw)
  To: Evan Green
  Cc: Andy Gross, Bjorn Andersson, Mark Rutland, Rob Herring,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	Borislav Petkov, Mauro Carvalho Chehab, Tony Luck, James Morse,
	Robert Richter, linux-edac, linux-arm Mailing List, LKML,
	linux-arm-msm, Stephen Boyd, tsoni, psodagud

Hi Evan,

Thanks for the review comments.

On 2019-12-12 01:02, Evan Green wrote:
> 
> No name?
> 

Will add them in next spin.

> 
> A comment is warranted to indicate that err_type is indexed by the
> enum, as this would be easy to mess up in later changes.
> 

Will use array index as suggested by Stephen.

>> +static const char *get_error_msg(u64 errxstatus)
>> +{
>> +       const struct error_record *rec;
>> +       u32 errxstatus_serr;
>> +
>> +       errxstatus_serr = FIELD_GET(KRYO_ERRXSTATUS_SERR, errxstatus);
>> +
>> +       for (rec = serror_record; rec->error_code; rec++) {
> 
> It looks like you expect the table to be zero terminated, but it's
> not. Add the missing zero entry.
> 

Will add it.

>> +
>> +static inline void kryo_clear_error(u64 errxstatus)
>> +{
>> +       write_sysreg_s(errxstatus, SYS_ERXSTATUS_EL1);
>> +       isb();
> 
> Is the isb() necessary? If so, why not a dsb as well?
> 

We usually use isb() with cache and system control registers.
I do not see anything about isb or dsb mentioned in the TRM
for error record registers so it's probably OK to remove this.
James can help us here.

>> +
>> +static void kryo_check_l1_l2_ecc(void *info)
>> +{
>> +       struct edac_device_ctl_info *edev_ctl = info;
>> +       u64 errxstatus;
>> +       u64 errxmisc;
>> +       int cpu;
>> +
>> +       cpu = smp_processor_id();
>> +       /* We know record 0 is L1 and L2 */
>> +       write_sysreg_s(0, SYS_ERRSELR_EL1);
>> +       isb();
> 
> Another isb I'm not sure about. Is this meant to provide a barrier
> between ERRSELR and ERXSTATUS? Wouldn't that be dsb, not isb?
> 

Same as above.

I will repost with your comments addressed once I get more feedbacks 
from EDAC maintainers.

Thanks,
Sai

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-02-26 17:12 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1575529553.git.saiprakash.ranjan@codeaurora.org>
2019-12-05  9:53 ` [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC Sai Prakash Ranjan
2019-12-18 23:37   ` Rob Herring
2019-12-19  6:50     ` Sai Prakash Ranjan
2019-12-19 13:58       ` Rob Herring
2019-12-19 14:48         ` Sai Prakash Ranjan
2020-01-15 18:48   ` James Morse
2020-01-24 14:21     ` Sai Prakash Ranjan
2020-02-26 17:12       ` James Morse
2019-12-05  9:53 ` [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches Sai Prakash Ranjan
     [not found] ` <0101016ed57a6311-e815485c-4b77-4342-a3de-203673941602-000000@us-west-2.amazonses.com>
2019-12-11 19:32   ` Evan Green
     [not found]     ` <5df16ebe.1c69fb81.6481f.a011@mx.google.com>
2019-12-13  5:31       ` Sai Prakash Ranjan
     [not found] ` <0101016ed57a6559-46c6c649-db28-4945-a11c-7441b8e9ac5b-000000@us-west-2.amazonses.com>
2019-12-30 11:50   ` Borislav Petkov
2020-01-13  5:44     ` Sai Prakash Ranjan
2020-01-15 18:49       ` James Morse
2020-01-24 14:52         ` Sai Prakash Ranjan
2019-12-13  9:05 Sai Prakash Ranjan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).