Linux-EDAC Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code
       [not found] <Shiju Jose>
@ 2019-06-17 14:28 ` Shiju Jose
  2019-06-17 14:28   ` [PATCH 1/6] rasdaemon:print non-standard error data if not decoded Shiju Jose
                     ` (6 more replies)
  2019-08-12 10:11 ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
                   ` (2 subsequent siblings)
  3 siblings, 7 replies; 33+ messages in thread
From: Shiju Jose @ 2019-06-17 14:28 UTC (permalink / raw)
  To: mchehab, linux-edac, linuxarm; +Cc: Shiju Jose

This patch set add few changes in the non-standard error decoding code and
logging for the HiSilicon HIP08 non-standard H/W errors.

Shiju Jose (6):
  rasdaemon:print non-standard error data if not decoded
  rasdaemon: rearrange HiSilicon HIP07 decoding function table
  rasdaemon: update iteration logic for the non-standard error decoding
    functions
  rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM
    format1
  rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM
    format2
  rasdaemon:add logging HiSilicon HIP08 PCIe local errors

 Makefile.am                |   2 +-
 non-standard-hisi_hip07.c  |  36 +-
 non-standard-hisi_hip08.c  | 855 +++++++++++++++++++++++++++++++++++++++++++++
 ras-non-standard-handler.c |  36 +-
 ras-non-standard-handler.h |   8 +-
 ras-record.c               |  30 +-
 ras-record.h               |  13 +
 7 files changed, 932 insertions(+), 48 deletions(-)
 create mode 100644 non-standard-hisi_hip08.c

-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/6] rasdaemon:print non-standard error data if not decoded
  2019-06-17 14:28 ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Shiju Jose
@ 2019-06-17 14:28   ` Shiju Jose
  2019-06-17 14:28   ` [PATCH 2/6] rasdaemon: rearrange HiSilicon HIP07 decoding function table Shiju Jose
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-06-17 14:28 UTC (permalink / raw)
  To: mchehab, linux-edac, linuxarm; +Cc: Shiju Jose

This patch change printing non-standard error data
only if not decoded.

Suggested-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 ras-non-standard-handler.c | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/ras-non-standard-handler.c b/ras-non-standard-handler.c
index 21e6a76..d343a2a 100644
--- a/ras-non-standard-handler.c
+++ b/ras-non-standard-handler.c
@@ -160,20 +160,6 @@ int ras_non_standard_event_handler(struct trace_seq *s,
 	ev.error = pevent_get_field_raw(s, event, "buf", record, &len, 1);
 	if(!ev.error)
 		return -1;
-	len = ev.length;
-	i = 0;
-	line_count = 0;
-	trace_seq_printf(s, " error:\n  %08x: ", i);
-	while(len >= 4) {
-		print_le_hex(s, ev.error, i);
-		i+=4;
-		len-=4;
-		if(++line_count == 4) {
-			trace_seq_printf(s, "\n  %08x: ", i);
-			line_count = 0;
-		} else
-			trace_seq_printf(s, " ");
-	}
 
 	for (count = 0; count < dec_tab_count && !dec_done; count++) {
 		dec_tab = ns_dec_tab[count];
@@ -187,6 +173,23 @@ int ras_non_standard_event_handler(struct trace_seq *s,
 		}
 	}
 
+	if (!dec_done) {
+		len = ev.length;
+		i = 0;
+		line_count = 0;
+		trace_seq_printf(s, " error:\n  %08x: ", i);
+		while (len >= 4) {
+			print_le_hex(s, ev.error, i);
+			i += 4;
+			len -= 4;
+			if (++line_count == 4) {
+				trace_seq_printf(s, "\n  %08x: ", i);
+				line_count = 0;
+			} else
+				trace_seq_printf(s, " ");
+		}
+	}
+
 	/* Insert data into the SGBD */
 #ifdef HAVE_SQLITE3
 	ras_store_non_standard_record(ras, &ev);
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/6] rasdaemon: rearrange HiSilicon HIP07 decoding function table
  2019-06-17 14:28 ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Shiju Jose
  2019-06-17 14:28   ` [PATCH 1/6] rasdaemon:print non-standard error data if not decoded Shiju Jose
@ 2019-06-17 14:28   ` Shiju Jose
  2019-06-17 14:28   ` [PATCH 3/6] rasdaemon: update iteration logic for the non-standard error decoding functions Shiju Jose
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-06-17 14:28 UTC (permalink / raw)
  To: mchehab, linux-edac, linuxarm; +Cc: Shiju Jose

This patch rearranges the decoding function table for the
HiSilicon HIP07 non-standard errors.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 non-standard-hisi_hip07.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/non-standard-hisi_hip07.c b/non-standard-hisi_hip07.c
index 3e9dabd..19a5c47 100644
--- a/non-standard-hisi_hip07.c
+++ b/non-standard-hisi_hip07.c
@@ -24,20 +24,6 @@
 #define HISI_SAS_VALID_ERR_TYPE       BIT(2)
 #define HISI_SAS_VALID_AXI_ERR_INFO   BIT(3)
 
-static int decode_hip07_sas_error(struct trace_seq *s, const void *error);
-static int decode_hip07_hns_error(struct trace_seq *s, const void *error);
-
-struct ras_ns_dec_tab hisi_ns_dec_tab[] = {
-	{
-		.sec_type = "daffd8146eba4d8c8a91bc9bbf4aa301",
-		.decode = decode_hip07_sas_error,
-	},
-	{
-		.sec_type = "fbc2d923ea7a453dab132949f5af9e53",
-		.decode = decode_hip07_hns_error,
-	},
-};
-
 struct hisi_sas_err_sec {
 	uint64_t   val_bits;
 	uint64_t   physical_addr;
@@ -138,6 +124,18 @@ static int decode_hip07_hns_error(struct trace_seq *s, const void *error)
 {
 	return 0;
 }
+
+struct ras_ns_dec_tab hisi_ns_dec_tab[] = {
+	{
+		.sec_type = "daffd8146eba4d8c8a91bc9bbf4aa301",
+		.decode = decode_hip07_sas_error,
+	},
+	{
+		.sec_type = "fbc2d923ea7a453dab132949f5af9e53",
+		.decode = decode_hip07_hns_error,
+	},
+};
+
 __attribute__((constructor))
 static void hip07_init(void)
 {
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/6] rasdaemon: update iteration logic for the non-standard error decoding functions
  2019-06-17 14:28 ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Shiju Jose
  2019-06-17 14:28   ` [PATCH 1/6] rasdaemon:print non-standard error data if not decoded Shiju Jose
  2019-06-17 14:28   ` [PATCH 2/6] rasdaemon: rearrange HiSilicon HIP07 decoding function table Shiju Jose
@ 2019-06-17 14:28   ` Shiju Jose
  2019-06-17 14:28   ` [PATCH 4/6] rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1 Shiju Jose
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-06-17 14:28 UTC (permalink / raw)
  To: mchehab, linux-edac, linuxarm; +Cc: Shiju Jose

This patch updates the iteration logic for the non-standard
error decoding functions.

Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 non-standard-hisi_hip07.c  | 2 +-
 ras-non-standard-handler.c | 2 +-
 ras-non-standard-handler.h | 1 -
 3 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/non-standard-hisi_hip07.c b/non-standard-hisi_hip07.c
index 19a5c47..bb2576e 100644
--- a/non-standard-hisi_hip07.c
+++ b/non-standard-hisi_hip07.c
@@ -134,11 +134,11 @@ struct ras_ns_dec_tab hisi_ns_dec_tab[] = {
 		.sec_type = "fbc2d923ea7a453dab132949f5af9e53",
 		.decode = decode_hip07_hns_error,
 	},
+	{ /* sentinel */ }
 };
 
 __attribute__((constructor))
 static void hip07_init(void)
 {
-	hisi_ns_dec_tab[0].len = ARRAY_SIZE(hisi_ns_dec_tab);
 	register_ns_dec_tab(hisi_ns_dec_tab);
 }
diff --git a/ras-non-standard-handler.c b/ras-non-standard-handler.c
index d343a2a..392bb27 100644
--- a/ras-non-standard-handler.c
+++ b/ras-non-standard-handler.c
@@ -163,7 +163,7 @@ int ras_non_standard_event_handler(struct trace_seq *s,
 
 	for (count = 0; count < dec_tab_count && !dec_done; count++) {
 		dec_tab = ns_dec_tab[count];
-		for (i = 0; i < dec_tab[0].len; i++) {
+		for (i = 0; dec_tab[i].decode; i++) {
 			if (uuid_le_cmp(ev.sec_type,
 					dec_tab[i].sec_type) == 0) {
 				dec_tab[i].decode(s, ev.error);
diff --git a/ras-non-standard-handler.h b/ras-non-standard-handler.h
index b9e9fb1..b2c9743 100644
--- a/ras-non-standard-handler.h
+++ b/ras-non-standard-handler.h
@@ -23,7 +23,6 @@
 typedef struct ras_ns_dec_tab {
 	const char *sec_type;
 	int (*decode)(struct trace_seq *s, const void *err);
-	size_t len;
 } *p_ns_dec_tab;
 
 int ras_non_standard_event_handler(struct trace_seq *s,
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 4/6] rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1
  2019-06-17 14:28 ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Shiju Jose
                     ` (2 preceding siblings ...)
  2019-06-17 14:28   ` [PATCH 3/6] rasdaemon: update iteration logic for the non-standard error decoding functions Shiju Jose
@ 2019-06-17 14:28   ` Shiju Jose
  2019-06-17 14:28   ` [PATCH 5/6] rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2 Shiju Jose
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-06-17 14:28 UTC (permalink / raw)
  To: mchehab, linux-edac, linuxarm; +Cc: Shiju Jose

This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format1.
These errors are from the H/W modules MN, PLL, SLLC, AA, SIOE,
POE, DISP, LPC, SAS and SATA.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 Makefile.am                |   2 +-
 non-standard-hisi_hip07.c  |   8 +-
 non-standard-hisi_hip08.c  | 332 +++++++++++++++++++++++++++++++++++++++++++++
 ras-non-standard-handler.c |   3 +-
 ras-non-standard-handler.h |   7 +-
 ras-record.c               |  30 ++--
 ras-record.h               |  13 ++
 7 files changed, 378 insertions(+), 17 deletions(-)
 create mode 100644 non-standard-hisi_hip08.c

diff --git a/Makefile.am b/Makefile.am
index f036ffd..3d89672 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -49,7 +49,7 @@ if WITH_ABRT_REPORT
    rasdaemon_SOURCES += ras-report.c
 endif
 if WITH_HISI_NS_DECODE
-   rasdaemon_SOURCES += non-standard-hisi_hip07.c
+   rasdaemon_SOURCES += non-standard-hisi_hip07.c non-standard-hisi_hip08.c
 endif
 rasdaemon_LDADD = -lpthread $(SQLITE3_LIBS) libtrace/libtrace.a
 
diff --git a/non-standard-hisi_hip07.c b/non-standard-hisi_hip07.c
index bb2576e..7f58fb3 100644
--- a/non-standard-hisi_hip07.c
+++ b/non-standard-hisi_hip07.c
@@ -87,7 +87,9 @@ static char *sas_axi_err_type(int etype)
 	return "unknown error";
 }
 
-static int decode_hip07_sas_error(struct trace_seq *s, const void *error)
+static int decode_hip07_sas_error(struct ras_events *ras,
+				  struct ras_ns_dec_tab *dec_tab,
+				  struct trace_seq *s, const void *error)
 {
 	char buf[1024];
 	char *p = buf;
@@ -120,7 +122,9 @@ static int decode_hip07_sas_error(struct trace_seq *s, const void *error)
 	return 0;
 }
 
-static int decode_hip07_hns_error(struct trace_seq *s, const void *error)
+static int decode_hip07_hns_error(struct ras_events *ras,
+				  struct ras_ns_dec_tab *dec_tab,
+				  struct trace_seq *s, const void *error)
 {
 	return 0;
 }
diff --git a/non-standard-hisi_hip08.c b/non-standard-hisi_hip08.c
new file mode 100644
index 0000000..240e832
--- /dev/null
+++ b/non-standard-hisi_hip08.c
@@ -0,0 +1,332 @@
+/*
+ * Copyright (c) 2019 Hisilicon Limited.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include "ras-record.h"
+#include "ras-logger.h"
+#include "ras-report.h"
+#include "ras-non-standard-handler.h"
+
+/* HISI OEM error definitions */
+/* HISI OEM format1 error definitions */
+#define HISI_OEM_MODULE_ID_MN	0
+#define HISI_OEM_MODULE_ID_PLL	1
+#define HISI_OEM_MODULE_ID_SLLC	2
+#define HISI_OEM_MODULE_ID_AA	3
+#define HISI_OEM_MODULE_ID_SIOE	4
+#define HISI_OEM_MODULE_ID_POE	5
+#define HISI_OEM_MODULE_ID_DISP	8
+#define HISI_OEM_MODULE_ID_LPC	9
+#define HISI_OEM_MODULE_ID_SAS	15
+#define HISI_OEM_MODULE_ID_SATA	16
+
+#define HISI_OEM_VALID_SOC_ID		BIT(0)
+#define HISI_OEM_VALID_SOCKET_ID	BIT(1)
+#define HISI_OEM_VALID_NIMBUS_ID	BIT(2)
+#define HISI_OEM_VALID_MODULE_ID	BIT(3)
+#define HISI_OEM_VALID_SUB_MODULE_ID	BIT(4)
+#define HISI_OEM_VALID_ERR_SEVERITY	BIT(5)
+
+#define HISI_OEM_TYPE1_VALID_ERR_MISC_0	BIT(6)
+#define HISI_OEM_TYPE1_VALID_ERR_MISC_1	BIT(7)
+#define HISI_OEM_TYPE1_VALID_ERR_MISC_2	BIT(8)
+#define HISI_OEM_TYPE1_VALID_ERR_MISC_3	BIT(9)
+#define HISI_OEM_TYPE1_VALID_ERR_MISC_4	BIT(10)
+#define HISI_OEM_TYPE1_VALID_ERR_ADDR	BIT(11)
+
+struct hisi_oem_type1_err_sec {
+	uint32_t   val_bits;
+	uint8_t    version;
+	uint8_t    soc_id;
+	uint8_t    socket_id;
+	uint8_t    nimbus_id;
+	uint8_t    module_id;
+	uint8_t    sub_module_id;
+	uint8_t    err_severity;
+	uint8_t    reserv;
+	uint32_t   err_misc_0;
+	uint32_t   err_misc_1;
+	uint32_t   err_misc_2;
+	uint32_t   err_misc_3;
+	uint32_t   err_misc_4;
+	uint64_t   err_addr;
+};
+
+enum hisi_oem_data_type {
+	hisi_oem_data_type_int,
+	hisi_oem_data_type_int64,
+	hisi_oem_data_type_text,
+};
+
+enum {
+	hip08_oem_type1_field_id,
+	hip08_oem_type1_field_version,
+	hip08_oem_type1_field_soc_id,
+	hip08_oem_type1_field_socket_id,
+	hip08_oem_type1_field_nimbus_id,
+	hip08_oem_type1_field_module_id,
+	hip08_oem_type1_field_sub_module_id,
+	hip08_oem_type1_field_err_sev,
+	hip08_oem_type1_field_err_misc_0,
+	hip08_oem_type1_field_err_misc_1,
+	hip08_oem_type1_field_err_misc_2,
+	hip08_oem_type1_field_err_misc_3,
+	hip08_oem_type1_field_err_misc_4,
+	hip08_oem_type1_field_err_addr,
+};
+
+/* helper functions */
+static char *err_severity(uint8_t err_sev)
+{
+	switch (err_sev) {
+	case 0: return "recoverable";
+	case 1: return "fatal";
+	case 2: return "corrected";
+	case 3: return "none";
+	}
+	return "unknown";
+}
+
+static char *oem_type1_module_name(uint8_t module_id)
+{
+	switch (module_id) {
+	case HISI_OEM_MODULE_ID_MN: return "MN";
+	case HISI_OEM_MODULE_ID_PLL: return "PLL";
+	case HISI_OEM_MODULE_ID_SLLC: return "SLLC";
+	case HISI_OEM_MODULE_ID_AA: return "AA";
+	case HISI_OEM_MODULE_ID_SIOE: return "SIOE";
+	case HISI_OEM_MODULE_ID_POE: return "POE";
+	case HISI_OEM_MODULE_ID_DISP: return "DISP";
+	case HISI_OEM_MODULE_ID_LPC: return "LPC";
+	case HISI_OEM_MODULE_ID_SAS: return "SAS";
+	case HISI_OEM_MODULE_ID_SATA: return "SATA";
+	}
+	return "unknown";
+}
+
+#ifdef HAVE_SQLITE3
+static const struct db_fields hip08_oem_type1_event_fields[] = {
+	{ .name = "id",			.type = "INTEGER PRIMARY KEY" },
+	{ .name = "version",		.type = "INTEGER" },
+	{ .name = "soc_id",		.type = "INTEGER" },
+	{ .name = "socket_id",		.type = "INTEGER" },
+	{ .name = "nimbus_id",		.type = "INTEGER" },
+	{ .name = "module_id",		.type = "TEXT" },
+	{ .name = "sub_module_id",	.type = "INTEGER" },
+	{ .name = "err_severity",	.type = "TEXT" },
+	{ .name = "err_misc_0",		.type = "INTEGER" },
+	{ .name = "err_misc_1",		.type = "INTEGER" },
+	{ .name = "err_misc_2",		.type = "INTEGER" },
+	{ .name = "err_misc_3",		.type = "INTEGER" },
+	{ .name = "err_misc_4",		.type = "INTEGER" },
+	{ .name = "err_addr",		.type = "INTEGER" },
+};
+
+static const struct db_table_descriptor hip08_oem_type1_event_tab = {
+	.name = "hip08_oem_type1_event",
+	.fields = hip08_oem_type1_event_fields,
+	.num_fields = ARRAY_SIZE(hip08_oem_type1_event_fields),
+};
+
+static void record_vendor_data(struct ras_ns_dec_tab *dec_tab,
+			       enum hisi_oem_data_type data_type,
+			       int id, int64_t data, const char *text)
+{
+	switch (data_type) {
+	case hisi_oem_data_type_int:
+		sqlite3_bind_int(dec_tab->stmt_dec_record, id, data);
+		break;
+	case hisi_oem_data_type_int64:
+		sqlite3_bind_int64(dec_tab->stmt_dec_record, id, data);
+		break;
+	case hisi_oem_data_type_text:
+		sqlite3_bind_text(dec_tab->stmt_dec_record, id, text, -1, NULL);
+		break;
+	default:
+		break;
+	}
+}
+
+static int step_vendor_data_tab(struct ras_ns_dec_tab *dec_tab, char *name)
+{
+	int rc;
+
+	rc = sqlite3_step(dec_tab->stmt_dec_record);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed to do %s step on sqlite: error = %d\n", name, rc);
+
+	rc = sqlite3_reset(dec_tab->stmt_dec_record);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed to reset %s on sqlite: error = %d\n", name, rc);
+
+	rc = sqlite3_clear_bindings(dec_tab->stmt_dec_record);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed to clear bindings %s on sqlite: error = %d\n",
+		    name, rc);
+
+	return rc;
+}
+#else
+static void record_vendor_data(struct ras_ns_dec_tab *dec_tab,
+			       enum hisi_oem_data_type data_type,
+			       int id, int64_t data, const char *text)
+{ }
+
+static int step_vendor_data_tab(struct ras_ns_dec_tab *dec_tab, char *name)
+{
+	return 0;
+}
+#endif
+
+/* error data decoding functions */
+static int decode_hip08_oem_type1_error(struct ras_events *ras,
+					struct ras_ns_dec_tab *dec_tab,
+					struct trace_seq *s, const void *error)
+{
+	const struct hisi_oem_type1_err_sec *err = error;
+	char buf[1024];
+	char *p = buf;
+
+	if (err->val_bits == 0) {
+		trace_seq_printf(s, "%s: no valid error information\n",
+				 __func__);
+		return -1;
+	}
+
+#ifdef HAVE_SQLITE3
+	if (!dec_tab->stmt_dec_record) {
+		if (ras_mc_add_vendor_table(ras, &dec_tab->stmt_dec_record,
+					    &hip08_oem_type1_event_tab)
+			!= SQLITE_OK) {
+			trace_seq_printf(s,
+					"create sql hip08_oem_type1_event_tab fail\n");
+			return -1;
+		}
+	}
+#endif
+
+	p += sprintf(p, "[ ");
+	p += sprintf(p, "Table version=%d ", err->version);
+	record_vendor_data(dec_tab, hisi_oem_data_type_int,
+			   hip08_oem_type1_field_version, err->version, NULL);
+
+	if (err->val_bits & HISI_OEM_VALID_SOC_ID) {
+		p += sprintf(p, "SOC ID=%d ", err->soc_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type1_field_soc_id,
+				   err->soc_id, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_VALID_SOCKET_ID) {
+		p += sprintf(p, "socket ID=%d ", err->socket_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type1_field_socket_id,
+				   err->socket_id, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_VALID_NIMBUS_ID) {
+		p += sprintf(p, "nimbus ID=%d ", err->nimbus_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type1_field_nimbus_id,
+				   err->nimbus_id, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_VALID_MODULE_ID) {
+		p += sprintf(p, "module=%s-",
+			     oem_type1_module_name(err->module_id));
+		record_vendor_data(dec_tab, hisi_oem_data_type_text,
+				   hip08_oem_type1_field_module_id,
+				   0, oem_type1_module_name(err->module_id));
+		if (err->val_bits & HISI_OEM_VALID_SUB_MODULE_ID) {
+			p += sprintf(p, "%d ", err->sub_module_id);
+			record_vendor_data(dec_tab, hisi_oem_data_type_int,
+					   hip08_oem_type1_field_sub_module_id,
+					   err->sub_module_id, NULL);
+		}
+	}
+
+	if (err->val_bits & HISI_OEM_VALID_ERR_SEVERITY) {
+		p += sprintf(p, "error severity=%s ",
+			     err_severity(err->err_severity));
+		record_vendor_data(dec_tab, hisi_oem_data_type_text,
+				   hip08_oem_type1_field_err_sev,
+				   0, err_severity(err->err_severity));
+	}
+
+	p += sprintf(p, "]");
+	trace_seq_printf(s, "\nHISI HIP08: OEM Type-1 Error\n");
+	trace_seq_printf(s, "%s\n", buf);
+
+	trace_seq_printf(s, "Reg Dump:\n");
+	if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_MISC_0) {
+		trace_seq_printf(s, "ERR_MISC0=0x%x\n", err->err_misc_0);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type1_field_err_misc_0,
+				   err->err_misc_0, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_MISC_1) {
+		trace_seq_printf(s, "ERR_MISC1=0x%x\n", err->err_misc_1);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type1_field_err_misc_1,
+				   err->err_misc_1, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_MISC_2) {
+		trace_seq_printf(s, "ERR_MISC2=0x%x\n", err->err_misc_2);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type1_field_err_misc_2,
+				   err->err_misc_2, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_MISC_3) {
+		trace_seq_printf(s, "ERR_MISC3=0x%x\n", err->err_misc_3);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type1_field_err_misc_3,
+				   err->err_misc_3, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_MISC_4) {
+		trace_seq_printf(s, "ERR_MISC4=0x%x\n", err->err_misc_4);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type1_field_err_misc_4,
+				   err->err_misc_4, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_TYPE1_VALID_ERR_ADDR) {
+		trace_seq_printf(s, "ERR_ADDR=0x%p\n", (void *)err->err_addr);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int64,
+				   hip08_oem_type1_field_err_addr,
+				   err->err_addr, NULL);
+	}
+
+	step_vendor_data_tab(dec_tab, "hip08_oem_type1_event_tab");
+
+	return 0;
+}
+
+struct ras_ns_dec_tab hip08_ns_oem_tab[] = {
+	{
+		.sec_type = "1f8161e155d641e6bd107afd1dc5f7c5",
+		.decode = decode_hip08_oem_type1_error,
+	},
+	{ /* sentinel */ }
+};
+
+__attribute__((constructor))
+static void hip08_init(void)
+{
+	register_ns_dec_tab(hip08_ns_oem_tab);
+}
diff --git a/ras-non-standard-handler.c b/ras-non-standard-handler.c
index 392bb27..4eda80b 100644
--- a/ras-non-standard-handler.c
+++ b/ras-non-standard-handler.c
@@ -166,7 +166,8 @@ int ras_non_standard_event_handler(struct trace_seq *s,
 		for (i = 0; dec_tab[i].decode; i++) {
 			if (uuid_le_cmp(ev.sec_type,
 					dec_tab[i].sec_type) == 0) {
-				dec_tab[i].decode(s, ev.error);
+				dec_tab[i].decode(ras, &dec_tab[i],
+						  s, ev.error);
 				dec_done = true;
 				break;
 			}
diff --git a/ras-non-standard-handler.h b/ras-non-standard-handler.h
index b2c9743..a7e48a3 100644
--- a/ras-non-standard-handler.h
+++ b/ras-non-standard-handler.h
@@ -22,7 +22,12 @@
 
 typedef struct ras_ns_dec_tab {
 	const char *sec_type;
-	int (*decode)(struct trace_seq *s, const void *err);
+	int (*decode)(struct ras_events *ras, struct ras_ns_dec_tab *dec_tab,
+		      struct trace_seq *s, const void *err);
+#ifdef HAVE_SQLITE3
+#include <sqlite3.h>
+	sqlite3_stmt *stmt_dec_record;
+#endif
 } *p_ns_dec_tab;
 
 int ras_non_standard_event_handler(struct trace_seq *s,
diff --git a/ras-record.c b/ras-record.c
index 4c8b55b..b212607 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -38,17 +38,6 @@
 
 #define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
 
-struct db_fields {
-	char *name;
-	char *type;
-};
-
-struct db_table_descriptor {
-	char			*name;
-	const struct db_fields	*fields;
-	size_t			num_fields;
-};
-
 /*
  * Table and functions to handle ras:mc_event
  */
@@ -511,7 +500,7 @@ static int ras_mc_create_table(struct sqlite3_priv *priv,
 {
 	const struct db_fields *field;
 	char sql[1024], *p = sql, *end = sql + sizeof(sql);
-	int i,rc;
+	int i, rc;
 
 	p += snprintf(p, end - p, "CREATE TABLE IF NOT EXISTS %s (",
 		      db_tab->name);
@@ -538,6 +527,23 @@ static int ras_mc_create_table(struct sqlite3_priv *priv,
 	return rc;
 }
 
+int ras_mc_add_vendor_table(struct ras_events *ras,
+			    sqlite3_stmt **stmt,
+			    const struct db_table_descriptor *db_tab)
+{
+	int rc;
+	struct sqlite3_priv *priv = ras->db_priv;
+
+	if (!priv)
+		return -1;
+
+	rc = ras_mc_create_table(priv, db_tab);
+	if (rc == SQLITE_OK)
+		rc = ras_mc_prepare_stmt(priv, stmt, db_tab);
+
+	return rc;
+}
+
 int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
 {
 	int rc;
diff --git a/ras-record.h b/ras-record.h
index 2183167..432a571 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -119,7 +119,20 @@ struct sqlite3_priv {
 #endif
 };
 
+struct db_fields {
+	char *name;
+	char *type;
+};
+
+struct db_table_descriptor {
+	char                    *name;
+	const struct db_fields  *fields;
+	size_t                  num_fields;
+};
+
 int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras);
+int ras_mc_add_vendor_table(struct ras_events *ras, sqlite3_stmt **stmt,
+			    const struct db_table_descriptor *db_tab);
 int ras_store_mc_event(struct ras_events *ras, struct ras_mc_event *ev);
 int ras_store_aer_event(struct ras_events *ras, struct ras_aer_event *ev);
 int ras_store_mce_record(struct ras_events *ras, struct mce_event *ev);
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 5/6] rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2
  2019-06-17 14:28 ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Shiju Jose
                     ` (3 preceding siblings ...)
  2019-06-17 14:28   ` [PATCH 4/6] rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1 Shiju Jose
@ 2019-06-17 14:28   ` Shiju Jose
  2019-06-17 14:28   ` [PATCH 6/6] rasdaemon:add logging HiSilicon HIP08 PCIe local errors Shiju Jose
  2019-06-21 18:42   ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Mauro Carvalho Chehab
  6 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-06-17 14:28 UTC (permalink / raw)
  To: mchehab, linux-edac, linuxarm; +Cc: Shiju Jose

This patch adds logging the HiSilicon HIP08 H/W errors reported
in the non-standard OEM format2.
These errors are from the H/W modules SMMU, HHA, HLLC, PA and DDRC.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 non-standard-hisi_hip08.c | 300 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 300 insertions(+)

diff --git a/non-standard-hisi_hip08.c b/non-standard-hisi_hip08.c
index 240e832..6fe5cbe 100644
--- a/non-standard-hisi_hip08.c
+++ b/non-standard-hisi_hip08.c
@@ -43,6 +43,20 @@
 #define HISI_OEM_TYPE1_VALID_ERR_MISC_4	BIT(10)
 #define HISI_OEM_TYPE1_VALID_ERR_ADDR	BIT(11)
 
+/* HISI OEM format2 error definitions */
+#define HISI_OEM_MODULE_ID_SMMU	0
+#define HISI_OEM_MODULE_ID_HHA	1
+#define HISI_OEM_MODULE_ID_HLLC	2
+#define HISI_OEM_MODULE_ID_PA	3
+#define HISI_OEM_MODULE_ID_DDRC	4
+
+#define HISI_OEM_TYPE2_VALID_ERR_FR	BIT(6)
+#define HISI_OEM_TYPE2_VALID_ERR_CTRL	BIT(7)
+#define HISI_OEM_TYPE2_VALID_ERR_STATUS	BIT(8)
+#define HISI_OEM_TYPE2_VALID_ERR_ADDR	BIT(9)
+#define HISI_OEM_TYPE2_VALID_ERR_MISC_0	BIT(10)
+#define HISI_OEM_TYPE2_VALID_ERR_MISC_1	BIT(11)
+
 struct hisi_oem_type1_err_sec {
 	uint32_t   val_bits;
 	uint8_t    version;
@@ -61,6 +75,30 @@ struct hisi_oem_type1_err_sec {
 	uint64_t   err_addr;
 };
 
+struct hisi_oem_type2_err_sec {
+	uint32_t   val_bits;
+	uint8_t    version;
+	uint8_t    soc_id;
+	uint8_t    socket_id;
+	uint8_t    nimbus_id;
+	uint8_t    module_id;
+	uint8_t    sub_module_id;
+	uint8_t    err_severity;
+	uint8_t    reserv;
+	uint32_t   err_fr_0;
+	uint32_t   err_fr_1;
+	uint32_t   err_ctrl_0;
+	uint32_t   err_ctrl_1;
+	uint32_t   err_status_0;
+	uint32_t   err_status_1;
+	uint32_t   err_addr_0;
+	uint32_t   err_addr_1;
+	uint32_t   err_misc0_0;
+	uint32_t   err_misc0_1;
+	uint32_t   err_misc1_0;
+	uint32_t   err_misc1_1;
+};
+
 enum hisi_oem_data_type {
 	hisi_oem_data_type_int,
 	hisi_oem_data_type_int64,
@@ -84,6 +122,29 @@ enum {
 	hip08_oem_type1_field_err_addr,
 };
 
+enum {
+	hip08_oem_type2_field_id,
+	hip08_oem_type2_field_version,
+	hip08_oem_type2_field_soc_id,
+	hip08_oem_type2_field_socket_id,
+	hip08_oem_type2_field_nimbus_id,
+	hip08_oem_type2_field_module_id,
+	hip08_oem_type2_field_sub_module_id,
+	hip08_oem_type2_field_err_sev,
+	hip08_oem_type2_field_err_fr_0,
+	hip08_oem_type2_field_err_fr_1,
+	hip08_oem_type2_field_err_ctrl_0,
+	hip08_oem_type2_field_err_ctrl_1,
+	hip08_oem_type2_field_err_status_0,
+	hip08_oem_type2_field_err_status_1,
+	hip08_oem_type2_field_err_addr_0,
+	hip08_oem_type2_field_err_addr_1,
+	hip08_oem_type2_field_err_misc0_0,
+	hip08_oem_type2_field_err_misc0_1,
+	hip08_oem_type2_field_err_misc1_0,
+	hip08_oem_type2_field_err_misc1_1,
+};
+
 /* helper functions */
 static char *err_severity(uint8_t err_sev)
 {
@@ -113,6 +174,62 @@ static char *oem_type1_module_name(uint8_t module_id)
 	return "unknown";
 }
 
+static char *oem_type2_module_name(uint8_t module_id)
+{
+	switch (module_id) {
+	case HISI_OEM_MODULE_ID_SMMU: return "SMMU";
+	case HISI_OEM_MODULE_ID_HHA: return "HHA";
+	case HISI_OEM_MODULE_ID_HLLC: return "HLLC";
+	case HISI_OEM_MODULE_ID_PA: return "PA";
+	case HISI_OEM_MODULE_ID_DDRC: return "DDRC";
+	}
+	return "unknown module";
+}
+
+static char *oem_type2_sub_module_id(char *p, uint8_t module_id,
+				     uint8_t sub_module_id)
+{
+	switch (module_id) {
+	case HISI_OEM_MODULE_ID_SMMU:
+	case HISI_OEM_MODULE_ID_HLLC:
+	case HISI_OEM_MODULE_ID_PA:
+		p += sprintf(p, "%d ", sub_module_id);
+		break;
+
+	case HISI_OEM_MODULE_ID_HHA:
+		if (sub_module_id == 0)
+			p += sprintf(p, "TA HHA0 ");
+		else if (sub_module_id == 1)
+			p += sprintf(p, "TA HHA1 ");
+		else if (sub_module_id == 2)
+			p += sprintf(p, "TB HHA0 ");
+		else if (sub_module_id == 3)
+			p += sprintf(p, "TB HHA1 ");
+		break;
+
+	case HISI_OEM_MODULE_ID_DDRC:
+		if (sub_module_id == 0)
+			p += sprintf(p, "TA DDRC0 ");
+		else if (sub_module_id == 1)
+			p += sprintf(p, "TA DDRC1 ");
+		else if (sub_module_id == 2)
+			p += sprintf(p, "TA DDRC2 ");
+		else if (sub_module_id == 3)
+			p += sprintf(p, "TA DDRC3 ");
+		else if (sub_module_id == 4)
+			p += sprintf(p, "TB DDRC0 ");
+		else if (sub_module_id == 5)
+			p += sprintf(p, "TB DDRC1 ");
+		else if (sub_module_id == 6)
+			p += sprintf(p, "TB DDRC2 ");
+		else if (sub_module_id == 7)
+			p += sprintf(p, "TB DDRC3 ");
+		break;
+	}
+
+	return p;
+}
+
 #ifdef HAVE_SQLITE3
 static const struct db_fields hip08_oem_type1_event_fields[] = {
 	{ .name = "id",			.type = "INTEGER PRIMARY KEY" },
@@ -137,6 +254,35 @@ static const struct db_table_descriptor hip08_oem_type1_event_tab = {
 	.num_fields = ARRAY_SIZE(hip08_oem_type1_event_fields),
 };
 
+static const struct db_fields hip08_oem_type2_event_fields[] = {
+	{ .name = "id",                 .type = "INTEGER PRIMARY KEY" },
+	{ .name = "version",            .type = "INTEGER" },
+	{ .name = "soc_id",             .type = "INTEGER" },
+	{ .name = "socket_id",          .type = "INTEGER" },
+	{ .name = "nimbus_id",          .type = "INTEGER" },
+	{ .name = "module_id",          .type = "TEXT" },
+	{ .name = "sub_module_id",      .type = "INTEGER" },
+	{ .name = "err_severity",       .type = "TEXT" },
+	{ .name = "err_fr_0",		.type = "INTEGER" },
+	{ .name = "err_fr_1",		.type = "INTEGER" },
+	{ .name = "err_ctrl_0",		.type = "INTEGER" },
+	{ .name = "err_ctrl_1",		.type = "INTEGER" },
+	{ .name = "err_status_0",	.type = "INTEGER" },
+	{ .name = "err_status_1",	.type = "INTEGER" },
+	{ .name = "err_addr_0",         .type = "INTEGER" },
+	{ .name = "err_addr_1",         .type = "INTEGER" },
+	{ .name = "err_misc0_0",	.type = "INTEGER" },
+	{ .name = "err_misc0_1",	.type = "INTEGER" },
+	{ .name = "err_misc1_0",	.type = "INTEGER" },
+	{ .name = "err_misc1_1",	.type = "INTEGER" },
+};
+
+static const struct db_table_descriptor hip08_oem_type2_event_tab = {
+	.name = "hip08_oem_type2_event",
+	.fields = hip08_oem_type2_event_fields,
+	.num_fields = ARRAY_SIZE(hip08_oem_type2_event_fields),
+};
+
 static void record_vendor_data(struct ras_ns_dec_tab *dec_tab,
 			       enum hisi_oem_data_type data_type,
 			       int id, int64_t data, const char *text)
@@ -317,11 +463,165 @@ static int decode_hip08_oem_type1_error(struct ras_events *ras,
 	return 0;
 }
 
+static int decode_hip08_oem_type2_error(struct ras_events *ras,
+					struct ras_ns_dec_tab *dec_tab,
+					struct trace_seq *s, const void *error)
+{
+	const struct hisi_oem_type2_err_sec *err = error;
+	char buf[1024];
+	char *p = buf;
+
+	if (err->val_bits == 0) {
+		trace_seq_printf(s, "%s: no valid error information\n",
+				 __func__);
+		return -1;
+	}
+
+#ifdef HAVE_SQLITE3
+	if (!dec_tab->stmt_dec_record) {
+		if (ras_mc_add_vendor_table(ras, &dec_tab->stmt_dec_record,
+			&hip08_oem_type2_event_tab) != SQLITE_OK) {
+			trace_seq_printf(s,
+				"create sql hip08_oem_type2_event_tab fail\n");
+			return -1;
+		}
+	}
+#endif
+	p += sprintf(p, "[ ");
+	p += sprintf(p, "Table version=%d ", err->version);
+	record_vendor_data(dec_tab, hisi_oem_data_type_int,
+			   hip08_oem_type2_field_version,
+			   err->version, NULL);
+	if (err->val_bits & HISI_OEM_VALID_SOC_ID) {
+		p += sprintf(p, "SOC ID=%d ", err->soc_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_soc_id,
+				   err->soc_id, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_VALID_SOCKET_ID) {
+		p += sprintf(p, "socket ID=%d ", err->socket_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_socket_id,
+				   err->socket_id, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_VALID_NIMBUS_ID) {
+		p += sprintf(p, "nimbus ID=%d ", err->nimbus_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_nimbus_id,
+				   err->nimbus_id, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_VALID_MODULE_ID) {
+		p += sprintf(p, "module=%s ",
+			     oem_type2_module_name(err->module_id));
+		record_vendor_data(dec_tab, hisi_oem_data_type_text,
+				   hip08_oem_type2_field_module_id,
+				   0, oem_type2_module_name(err->module_id));
+	}
+
+	if (err->val_bits & HISI_OEM_VALID_SUB_MODULE_ID) {
+		p =  oem_type2_sub_module_id(p, err->module_id,
+					     err->sub_module_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_sub_module_id,
+				   err->sub_module_id, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_VALID_ERR_SEVERITY) {
+		p += sprintf(p, "error severity=%s ",
+			     err_severity(err->err_severity));
+		record_vendor_data(dec_tab, hisi_oem_data_type_text,
+				   hip08_oem_type2_field_err_sev,
+				   0, err_severity(err->err_severity));
+	}
+
+	p += sprintf(p, "]");
+	trace_seq_printf(s, "\nHISI HIP08: OEM Type-2 Error\n");
+	trace_seq_printf(s, "%s\n", buf);
+
+	trace_seq_printf(s, "Reg Dump:\n");
+	if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_FR) {
+		trace_seq_printf(s, "ERR_FR_0=0x%x\n", err->err_fr_0);
+		trace_seq_printf(s, "ERR_FR_1=0x%x\n", err->err_fr_1);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_fr_0,
+				   err->err_fr_0, NULL);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_fr_1,
+				   err->err_fr_1, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_CTRL) {
+		trace_seq_printf(s, "ERR_CTRL_0=0x%x\n", err->err_ctrl_0);
+		trace_seq_printf(s, "ERR_CTRL_1=0x%x\n", err->err_ctrl_1);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_ctrl_0,
+				   err->err_ctrl_0, NULL);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_ctrl_1,
+				   err->err_ctrl_1, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_STATUS) {
+		trace_seq_printf(s, "ERR_STATUS_0=0x%x\n", err->err_status_0);
+		trace_seq_printf(s, "ERR_STATUS_1=0x%x\n", err->err_status_1);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_status_0,
+				   err->err_status_0, NULL);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_status_1,
+				   err->err_status_1, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_ADDR) {
+		trace_seq_printf(s, "ERR_ADDR_0=0x%x\n", err->err_addr_0);
+		trace_seq_printf(s, "ERR_ADDR_1=0x%x\n", err->err_addr_1);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_addr_0,
+				   err->err_addr_0, NULL);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_addr_1,
+				   err->err_addr_1, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_MISC_0) {
+		trace_seq_printf(s, "ERR_MISC0_0=0x%x\n", err->err_misc0_0);
+		trace_seq_printf(s, "ERR_MISC0_1=0x%x\n", err->err_misc0_1);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_misc0_0,
+				   err->err_misc0_0, NULL);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_misc0_1,
+				   err->err_misc0_1, NULL);
+	}
+
+	if (err->val_bits & HISI_OEM_TYPE2_VALID_ERR_MISC_1) {
+		trace_seq_printf(s, "ERR_MISC1_0=0x%x\n", err->err_misc1_0);
+		trace_seq_printf(s, "ERR_MISC1_1=0x%x\n", err->err_misc1_1);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_misc1_0,
+				   err->err_misc1_0, NULL);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_oem_type2_field_err_misc1_1,
+				   err->err_misc1_1, NULL);
+	}
+
+	step_vendor_data_tab(dec_tab, "hip08_oem_type2_event_tab");
+
+	return 0;
+}
+
 struct ras_ns_dec_tab hip08_ns_oem_tab[] = {
 	{
 		.sec_type = "1f8161e155d641e6bd107afd1dc5f7c5",
 		.decode = decode_hip08_oem_type1_error,
 	},
+	{
+		.sec_type = "45534ea6ce2341158535e07ab3aef91d",
+		.decode = decode_hip08_oem_type2_error,
+	},
 	{ /* sentinel */ }
 };
 
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 6/6] rasdaemon:add logging HiSilicon HIP08 PCIe local errors
  2019-06-17 14:28 ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Shiju Jose
                     ` (4 preceding siblings ...)
  2019-06-17 14:28   ` [PATCH 5/6] rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2 Shiju Jose
@ 2019-06-17 14:28   ` Shiju Jose
  2019-06-21 18:42   ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Mauro Carvalho Chehab
  6 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-06-17 14:28 UTC (permalink / raw)
  To: mchehab, linux-edac, linuxarm; +Cc: Shiju Jose

This patch adds logging for the HiSilicon HIP08 PCIe local errors.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 non-standard-hisi_hip08.c | 223 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 223 insertions(+)

diff --git a/non-standard-hisi_hip08.c b/non-standard-hisi_hip08.c
index 6fe5cbe..ae543d6 100644
--- a/non-standard-hisi_hip08.c
+++ b/non-standard-hisi_hip08.c
@@ -57,6 +57,24 @@
 #define HISI_OEM_TYPE2_VALID_ERR_MISC_0	BIT(10)
 #define HISI_OEM_TYPE2_VALID_ERR_MISC_1	BIT(11)
 
+/* HISI PCIe Local error definitions */
+#define HISI_PCIE_SUB_MODULE_ID_AP	0
+#define HISI_PCIE_SUB_MODULE_ID_TL	1
+#define HISI_PCIE_SUB_MODULE_ID_MAC	2
+#define HISI_PCIE_SUB_MODULE_ID_DL	3
+#define HISI_PCIE_SUB_MODULE_ID_SDI	4
+
+#define HISI_PCIE_LOCAL_VALID_VERSION		BIT(0)
+#define HISI_PCIE_LOCAL_VALID_SOC_ID		BIT(1)
+#define HISI_PCIE_LOCAL_VALID_SOCKET_ID		BIT(2)
+#define HISI_PCIE_LOCAL_VALID_NIMBUS_ID		BIT(3)
+#define HISI_PCIE_LOCAL_VALID_SUB_MODULE_ID	BIT(4)
+#define HISI_PCIE_LOCAL_VALID_CORE_ID		BIT(5)
+#define HISI_PCIE_LOCAL_VALID_PORT_ID		BIT(6)
+#define HISI_PCIE_LOCAL_VALID_ERR_TYPE		BIT(7)
+#define HISI_PCIE_LOCAL_VALID_ERR_SEVERITY	BIT(8)
+#define HISI_PCIE_LOCAL_VALID_ERR_MISC		9
+
 struct hisi_oem_type1_err_sec {
 	uint32_t   val_bits;
 	uint8_t    version;
@@ -99,6 +117,21 @@ struct hisi_oem_type2_err_sec {
 	uint32_t   err_misc1_1;
 };
 
+struct hisi_pcie_local_err_sec {
+	uint64_t   val_bits;
+	uint8_t    version;
+	uint8_t    soc_id;
+	uint8_t    socket_id;
+	uint8_t    nimbus_id;
+	uint8_t    sub_module_id;
+	uint8_t    core_id;
+	uint8_t    port_id;
+	uint8_t    err_severity;
+	uint16_t   err_type;
+	uint8_t    reserv[2];
+	uint32_t   err_misc[33];
+};
+
 enum hisi_oem_data_type {
 	hisi_oem_data_type_int,
 	hisi_oem_data_type_int64,
@@ -145,6 +178,20 @@ enum {
 	hip08_oem_type2_field_err_misc1_1,
 };
 
+enum {
+	hip08_pcie_local_field_id,
+	hip08_pcie_local_field_version,
+	hip08_pcie_local_field_soc_id,
+	hip08_pcie_local_field_socket_id,
+	hip08_pcie_local_field_nimbus_id,
+	hip08_pcie_local_field_sub_module_id,
+	hip08_pcie_local_field_core_id,
+	hip08_pcie_local_field_port_id,
+	hip08_pcie_local_field_err_sev,
+	hip08_pcie_local_field_err_type,
+	hip08_pcie_local_field_err_misc,
+};
+
 /* helper functions */
 static char *err_severity(uint8_t err_sev)
 {
@@ -230,6 +277,18 @@ static char *oem_type2_sub_module_id(char *p, uint8_t module_id,
 	return p;
 }
 
+static char *pcie_local_sub_module_name(uint8_t id)
+{
+	switch (id) {
+	case HISI_PCIE_SUB_MODULE_ID_AP: return "AP Layer";
+	case HISI_PCIE_SUB_MODULE_ID_TL: return "TL Layer";
+	case HISI_PCIE_SUB_MODULE_ID_MAC: return "MAC Layer";
+	case HISI_PCIE_SUB_MODULE_ID_DL: return "DL Layer";
+	case HISI_PCIE_SUB_MODULE_ID_SDI: return "SDI Layer";
+	}
+	return "unknown";
+}
+
 #ifdef HAVE_SQLITE3
 static const struct db_fields hip08_oem_type1_event_fields[] = {
 	{ .name = "id",			.type = "INTEGER PRIMARY KEY" },
@@ -283,6 +342,58 @@ static const struct db_table_descriptor hip08_oem_type2_event_tab = {
 	.num_fields = ARRAY_SIZE(hip08_oem_type2_event_fields),
 };
 
+static const struct db_fields hip08_pcie_local_event_fields[] = {
+	{ .name = "id",                 .type = "INTEGER PRIMARY KEY" },
+	{ .name = "version",            .type = "INTEGER" },
+	{ .name = "soc_id",             .type = "INTEGER" },
+	{ .name = "socket_id",          .type = "INTEGER" },
+	{ .name = "nimbus_id",          .type = "INTEGER" },
+	{ .name = "sub_module_id",      .type = "TEXT" },
+	{ .name = "core_id",		.type = "INTEGER" },
+	{ .name = "port_id",		.type = "INTEGER" },
+	{ .name = "err_severity",       .type = "TEXT" },
+	{ .name = "err_type",		.type = "INTEGER" },
+	{ .name = "err_misc0",		.type = "INTEGER" },
+	{ .name = "err_misc1",		.type = "INTEGER" },
+	{ .name = "err_misc2",		.type = "INTEGER" },
+	{ .name = "err_misc3",		.type = "INTEGER" },
+	{ .name = "err_misc4",		.type = "INTEGER" },
+	{ .name = "err_misc5",		.type = "INTEGER" },
+	{ .name = "err_misc6",		.type = "INTEGER" },
+	{ .name = "err_misc7",		.type = "INTEGER" },
+	{ .name = "err_misc8",		.type = "INTEGER" },
+	{ .name = "err_misc9",		.type = "INTEGER" },
+	{ .name = "err_misc10",		.type = "INTEGER" },
+	{ .name = "err_misc11",		.type = "INTEGER" },
+	{ .name = "err_misc12",		.type = "INTEGER" },
+	{ .name = "err_misc13",		.type = "INTEGER" },
+	{ .name = "err_misc14",		.type = "INTEGER" },
+	{ .name = "err_misc15",		.type = "INTEGER" },
+	{ .name = "err_misc16",		.type = "INTEGER" },
+	{ .name = "err_misc17",		.type = "INTEGER" },
+	{ .name = "err_misc18",		.type = "INTEGER" },
+	{ .name = "err_misc19",		.type = "INTEGER" },
+	{ .name = "err_misc20",		.type = "INTEGER" },
+	{ .name = "err_misc21",		.type = "INTEGER" },
+	{ .name = "err_misc22",		.type = "INTEGER" },
+	{ .name = "err_misc23",		.type = "INTEGER" },
+	{ .name = "err_misc24",		.type = "INTEGER" },
+	{ .name = "err_misc25",		.type = "INTEGER" },
+	{ .name = "err_misc26",		.type = "INTEGER" },
+	{ .name = "err_misc27",		.type = "INTEGER" },
+	{ .name = "err_misc28",		.type = "INTEGER" },
+	{ .name = "err_misc29",		.type = "INTEGER" },
+	{ .name = "err_misc30",		.type = "INTEGER" },
+	{ .name = "err_misc31",		.type = "INTEGER" },
+	{ .name = "err_misc32",		.type = "INTEGER" },
+};
+
+static const struct db_table_descriptor hip08_pcie_local_event_tab = {
+	.name = "hip08_pcie_local_event",
+	.fields = hip08_pcie_local_event_fields,
+	.num_fields = ARRAY_SIZE(hip08_pcie_local_event_fields),
+};
+
 static void record_vendor_data(struct ras_ns_dec_tab *dec_tab,
 			       enum hisi_oem_data_type data_type,
 			       int id, int64_t data, const char *text)
@@ -613,6 +724,114 @@ static int decode_hip08_oem_type2_error(struct ras_events *ras,
 	return 0;
 }
 
+static int decode_hip08_pcie_local_error(struct ras_events *ras,
+					 struct ras_ns_dec_tab *dec_tab,
+					 struct trace_seq *s, const void *error)
+{
+	const struct hisi_pcie_local_err_sec *err = error;
+	char buf[1024];
+	char *p = buf;
+	uint32_t i;
+
+	if (err->val_bits == 0) {
+		trace_seq_printf(s, "%s: no valid error information\n",
+				 __func__);
+		return -1;
+	}
+
+#ifdef HAVE_SQLITE3
+	if (!dec_tab->stmt_dec_record) {
+		if (ras_mc_add_vendor_table(ras, &dec_tab->stmt_dec_record,
+				&hip08_pcie_local_event_tab) != SQLITE_OK) {
+			trace_seq_printf(s,
+				"create sql hip08_pcie_local_event_tab fail\n");
+			return -1;
+		}
+	}
+#endif
+	p += sprintf(p, "[ ");
+	p += sprintf(p, "Table version=%d ", err->version);
+	record_vendor_data(dec_tab, hisi_oem_data_type_int,
+			   hip08_pcie_local_field_version,
+			   err->version, NULL);
+	if (err->val_bits & HISI_PCIE_LOCAL_VALID_SOC_ID) {
+		p += sprintf(p, "SOC ID=%d ", err->soc_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_pcie_local_field_soc_id,
+				   err->soc_id, NULL);
+	}
+
+	if (err->val_bits & HISI_PCIE_LOCAL_VALID_SOCKET_ID) {
+		p += sprintf(p, "socket ID=%d ", err->socket_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_pcie_local_field_socket_id,
+				   err->socket_id, NULL);
+	}
+
+	if (err->val_bits & HISI_PCIE_LOCAL_VALID_NIMBUS_ID) {
+		p += sprintf(p, "nimbus ID=%d ", err->nimbus_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_pcie_local_field_nimbus_id,
+				   err->nimbus_id, NULL);
+	}
+
+	if (err->val_bits & HISI_PCIE_LOCAL_VALID_SUB_MODULE_ID) {
+		p += sprintf(p, "sub module=%s ",
+			     pcie_local_sub_module_name(err->sub_module_id));
+		record_vendor_data(dec_tab, hisi_oem_data_type_text,
+				   hip08_pcie_local_field_sub_module_id,
+				   0, pcie_local_sub_module_name(err->sub_module_id));
+	}
+
+	if (err->val_bits & HISI_PCIE_LOCAL_VALID_CORE_ID) {
+		p += sprintf(p, "core ID=core%d ", err->core_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_pcie_local_field_core_id,
+				   err->core_id, NULL);
+	}
+
+	if (err->val_bits & HISI_PCIE_LOCAL_VALID_PORT_ID) {
+		p += sprintf(p, "port ID=port%d ", err->port_id);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_pcie_local_field_port_id,
+				   err->port_id, NULL);
+	}
+
+	if (err->val_bits & HISI_PCIE_LOCAL_VALID_ERR_SEVERITY) {
+		p += sprintf(p, "error severity=%s ",
+			     err_severity(err->err_severity));
+		record_vendor_data(dec_tab, hisi_oem_data_type_text,
+				   hip08_pcie_local_field_err_sev,
+				   0, err_severity(err->err_severity));
+	}
+
+	if (err->val_bits & HISI_PCIE_LOCAL_VALID_ERR_TYPE) {
+		p += sprintf(p, "error type=0x%x ", err->err_type);
+		record_vendor_data(dec_tab, hisi_oem_data_type_int,
+				   hip08_pcie_local_field_err_type,
+				   err->err_type, NULL);
+	}
+	p += sprintf(p, "]");
+
+	trace_seq_printf(s, "\nHISI HIP08: PCIe local error\n");
+	trace_seq_printf(s, "%s\n", buf);
+
+	trace_seq_printf(s, "Reg Dump:\n");
+	for (i = 0; i < 33; i++) {
+		if (err->val_bits & BIT(HISI_PCIE_LOCAL_VALID_ERR_MISC + i)) {
+			trace_seq_printf(s, "ERR_MISC_%d=0x%x\n", i,
+					 err->err_misc[i]);
+			record_vendor_data(dec_tab, hisi_oem_data_type_int,
+					   (hip08_pcie_local_field_err_misc + i),
+					   err->err_misc[i], NULL);
+		}
+	}
+
+	step_vendor_data_tab(dec_tab, "hip08_pcie_local_event_tab");
+
+	return 0;
+}
+
 struct ras_ns_dec_tab hip08_ns_oem_tab[] = {
 	{
 		.sec_type = "1f8161e155d641e6bd107afd1dc5f7c5",
@@ -622,6 +841,10 @@ struct ras_ns_dec_tab hip08_ns_oem_tab[] = {
 		.sec_type = "45534ea6ce2341158535e07ab3aef91d",
 		.decode = decode_hip08_oem_type2_error,
 	},
+	{
+		.sec_type = "b2889fc9e7d74f9da867af42e98be772",
+		.decode = decode_hip08_pcie_local_error,
+	},
 	{ /* sentinel */ }
 };
 
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code
  2019-06-17 14:28 ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Shiju Jose
                     ` (5 preceding siblings ...)
  2019-06-17 14:28   ` [PATCH 6/6] rasdaemon:add logging HiSilicon HIP08 PCIe local errors Shiju Jose
@ 2019-06-21 18:42   ` Mauro Carvalho Chehab
  6 siblings, 0 replies; 33+ messages in thread
From: Mauro Carvalho Chehab @ 2019-06-21 18:42 UTC (permalink / raw)
  To: Shiju Jose; +Cc: linux-edac, linuxarm

Em Mon, 17 Jun 2019 15:28:46 +0100
Shiju Jose <shiju.jose@huawei.com> escreveu:

> This patch set add few changes in the non-standard error decoding code and
> logging for the HiSilicon HIP08 non-standard H/W errors.
> 
> Shiju Jose (6):
>   rasdaemon:print non-standard error data if not decoded
>   rasdaemon: rearrange HiSilicon HIP07 decoding function table
>   rasdaemon: update iteration logic for the non-standard error decoding
>     functions
>   rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM
>     format1
>   rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM
>     format2
>   rasdaemon:add logging HiSilicon HIP08 PCIe local errors
> 
>  Makefile.am                |   2 +-
>  non-standard-hisi_hip07.c  |  36 +-
>  non-standard-hisi_hip08.c  | 855 +++++++++++++++++++++++++++++++++++++++++++++
>  ras-non-standard-handler.c |  36 +-
>  ras-non-standard-handler.h |   8 +-
>  ras-record.c               |  30 +-
>  ras-record.h               |  13 +
>  7 files changed, 932 insertions(+), 48 deletions(-)
>  create mode 100644 non-standard-hisi_hip08.c
> 

Applied, thanks!


Thanks,
Mauro

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors
       [not found] <Shiju Jose>
  2019-06-17 14:28 ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Shiju Jose
@ 2019-08-12 10:11 ` Shiju Jose
  2019-08-12 10:11   ` [PATCH RFC 1/4] " Shiju Jose
                     ` (4 more replies)
  2019-10-16 16:33 ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
  2019-11-13 16:31 ` [PATCH rasdaemon 0/2] rasdaemon: add fix for the sql table Shiju Jose
  3 siblings, 5 replies; 33+ messages in thread
From: Shiju Jose @ 2019-08-12 10:11 UTC (permalink / raw)
  To: linux-acpi, linux-edac, linux-kernel, rjw, lenb, james.morse,
	tony.luck, bp, baicar
  Cc: linuxarm, jonathan.cameron, tanxiaofei, Shiju Jose

Presently kernel does not support reporting the vendor specific HW errors,
in the non-standard format, to the vendor drivers for the recovery.

This patch set add this support and also move the existing handler
functions for the standard errors to the new callback method.
Also the CCIX RAS patches could be move to the proposed callback method.
https://www.spinics.net/lists/linux-edac/msg10508.html
https://patchwork.kernel.org/patch/10979491/

Shiju Jose (4):
  ACPI: APEI: Add support to notify the vendor specific HW errors
  ACPI: APEI: Add ghes_handle_memory_failure to the new notification
    method
  ACPI: APEI: Add ghes_handle_aer to the new notification method
  ACPI: APEI: Add log_arm_hw_error to the new notification method

 drivers/acpi/apei/ghes.c | 170 +++++++++++++++++++++++++++++++++++++++++------
 drivers/ras/ras.c        |   5 +-
 include/acpi/ghes.h      |  47 +++++++++++++
 include/linux/ras.h      |   7 +-
 4 files changed, 205 insertions(+), 24 deletions(-)

-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH RFC 1/4] ACPI: APEI: Add support to notify the vendor specific HW errors
  2019-08-12 10:11 ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
@ 2019-08-12 10:11   ` " Shiju Jose
  2019-08-21 17:23     ` James Morse
  2019-08-12 10:11   ` [PATCH RFC 2/4] ACPI: APEI: Add ghes_handle_memory_failure to the new notification method Shiju Jose
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 33+ messages in thread
From: Shiju Jose @ 2019-08-12 10:11 UTC (permalink / raw)
  To: linux-acpi, linux-edac, linux-kernel, rjw, lenb, james.morse,
	tony.luck, bp, baicar
  Cc: linuxarm, jonathan.cameron, tanxiaofei, Shiju Jose

Presently the vendor specific HW errors, in the non-standard format,
are not reported to the vendor drivers for the recovery.

This patch adds support to notify the vendor specific HW errors to the
registered kernel drivers.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/acpi/apei/ghes.c | 118 +++++++++++++++++++++++++++++++++++++++++++++--
 include/acpi/ghes.h      |  47 +++++++++++++++++++
 2 files changed, 160 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a66e00f..374d197 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -477,6 +477,77 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
 #endif
 }
 
+struct ghes_error_notify {
+	struct list_head list;
+	struct rcu_head	rcu_head;
+	guid_t sec_type; /* guid of the error record */
+	error_handle handle; /* error handler function */
+	void *data; /* handler driver's private data if any */
+};
+
+/* List to store the registered error handling functions */
+static DEFINE_MUTEX(ghes_error_notify_mutex);
+static LIST_HEAD(ghes_error_notify_list);
+static refcount_t ghes_ref_count;
+
+/**
+ * ghes_error_notify_register - register an error handling function
+ * for the hw errors.
+ * @sec_type: sec_type of the corresponding CPER to be notified.
+ * @handle: pointer to the error handling function.
+ * @data: handler driver's private data.
+ *
+ * return 0 : SUCCESS, non-zero : FAIL
+ */
+int ghes_error_notify_register(guid_t sec_type, error_handle handle, void *data)
+{
+	struct ghes_error_notify *err_notify;
+
+	mutex_lock(&ghes_error_notify_mutex);
+	err_notify = kzalloc(sizeof(*err_notify), GFP_KERNEL);
+	if (!err_notify)
+		return -ENOMEM;
+
+	err_notify->handle = handle;
+	guid_copy(&err_notify->sec_type, &sec_type);
+	err_notify->data = data;
+	list_add_rcu(&err_notify->list, &ghes_error_notify_list);
+	mutex_unlock(&ghes_error_notify_mutex);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ghes_error_notify_register);
+
+/**
+ * ghes_error_notify_unregister - unregister an error handling function.
+ * @sec_type: sec_type of the corresponding CPER.
+ * @handle: pointer to the error handling function.
+ *
+ * return none.
+ */
+void ghes_error_notify_unregister(guid_t sec_type, error_handle handle)
+{
+	struct ghes_error_notify *err_notify;
+	bool found = 0;
+
+	mutex_lock(&ghes_error_notify_mutex);
+	rcu_read_lock();
+	list_for_each_entry_rcu(err_notify, &ghes_error_notify_list, list) {
+		if (guid_equal(&err_notify->sec_type, &sec_type) &&
+		    err_notify->handle == handle) {
+			list_del_rcu(&err_notify->list);
+			found = 1;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	synchronize_rcu();
+	mutex_unlock(&ghes_error_notify_mutex);
+	if (found)
+		kfree(err_notify);
+}
+EXPORT_SYMBOL_GPL(ghes_error_notify_unregister);
+
 static void ghes_do_proc(struct ghes *ghes,
 			 const struct acpi_hest_generic_status *estatus)
 {
@@ -485,6 +556,8 @@ static void ghes_do_proc(struct ghes *ghes,
 	guid_t *sec_type;
 	guid_t *fru_id = &NULL_UUID_LE;
 	char *fru_text = "";
+	bool is_notify = 0;
+	struct ghes_error_notify *err_notify;
 
 	sev = ghes_severity(estatus->error_severity);
 	apei_estatus_for_each_section(estatus, gdata) {
@@ -512,11 +585,29 @@ static void ghes_do_proc(struct ghes *ghes,
 
 			log_arm_hw_error(err);
 		} else {
-			void *err = acpi_hest_get_payload(gdata);
-
-			log_non_standard_event(sec_type, fru_id, fru_text,
-					       sec_sev, err,
-					       gdata->error_data_length);
+			rcu_read_lock();
+			list_for_each_entry_rcu(err_notify,
+						&ghes_error_notify_list, list) {
+				if (guid_equal(&err_notify->sec_type,
+					       sec_type)) {
+					/* The notification is called in the
+					 * interrupt context, thus the handler
+					 * functions should be take care of it.
+					 */
+					err_notify->handle(gdata, sev,
+							   err_notify->data);
+					is_notify = 1;
+				}
+			}
+			rcu_read_unlock();
+
+			if (!is_notify) {
+				void *err = acpi_hest_get_payload(gdata);
+
+				log_non_standard_event(sec_type, fru_id,
+						       fru_text, sec_sev, err,
+						       gdata->error_data_length);
+			}
 		}
 	}
 }
@@ -1217,6 +1308,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
 
 	ghes_edac_register(ghes, &ghes_dev->dev);
 
+	if (!refcount_read(&ghes_ref_count))
+		refcount_set(&ghes_ref_count, 1);
+	else
+		refcount_inc(&ghes_ref_count);
+
 	/* Handle any pending errors right away */
 	spin_lock_irqsave(&ghes_notify_lock_irq, flags);
 	ghes_proc(ghes);
@@ -1237,6 +1333,7 @@ static int ghes_remove(struct platform_device *ghes_dev)
 	int rc;
 	struct ghes *ghes;
 	struct acpi_hest_generic *generic;
+	struct ghes_error_notify *err_notify, *tmp;
 
 	ghes = platform_get_drvdata(ghes_dev);
 	generic = ghes->generic;
@@ -1279,6 +1376,17 @@ static int ghes_remove(struct platform_device *ghes_dev)
 
 	ghes_fini(ghes);
 
+	if (refcount_dec_and_test(&ghes_ref_count) &&
+	    !list_empty(&ghes_error_notify_list)) {
+		mutex_lock(&ghes_error_notify_mutex);
+		list_for_each_entry_safe(err_notify, tmp,
+					 &ghes_error_notify_list, list) {
+			list_del_rcu(&err_notify->list);
+			kfree_rcu(err_notify, rcu_head);
+		}
+		mutex_unlock(&ghes_error_notify_mutex);
+	}
+
 	ghes_edac_unregister(ghes);
 
 	kfree(ghes);
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index e3f1cdd..d480537 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -50,6 +50,53 @@ enum {
 	GHES_SEV_PANIC = 0x3,
 };
 
+/**
+ * error_handle - error handling function for the hw errors.
+ * This handle function is called in the interrupt context.
+ * @gdata: acpi_hest_generic_data.
+ * @sev: error severity of the entire error event defined in the
+ * ACPI spec table generic error status block.
+ * @data: handler driver's private data.
+ *
+ * return : none.
+ */
+typedef void (*error_handle)(struct acpi_hest_generic_data *gdata, int sev,
+			     void *data);
+
+#ifdef CONFIG_ACPI_APEI_GHES
+/**
+ * ghes_error_notify_register - register an error handling function
+ * for the hw errors.
+ * @sec_type: sec_type of the corresponding CPER to be notified.
+ * @handle: pointer to the error handling function.
+ * @data: handler driver's private data.
+ *
+ * return : 0 - SUCCESS, non-zero - FAIL.
+ */
+int ghes_error_notify_register(guid_t sec_type, error_handle handle,
+			       void *data);
+
+/**
+ * ghes_error_notify_unregister - unregister an error handling function
+ * for the hw errors.
+ * @sec_type: sec_type of the corresponding CPER.
+ * @handle: pointer to the error handling function.
+ *
+ * return none.
+ */
+void ghes_error_notify_unregister(guid_t sec_type, error_handle handle);
+
+#else
+int ghes_error_notify_register(guid_t sec_type, error_handle handle, void *data)
+{
+	return -ENODEV;
+}
+
+void ghes_error_notify_unregister(guid_t sec_type, error_handle handle)
+{
+}
+#endif
+
 int ghes_estatus_pool_init(int num_ghes);
 
 /* From drivers/edac/ghes_edac.c */
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH RFC 2/4] ACPI: APEI: Add ghes_handle_memory_failure to the new notification method
  2019-08-12 10:11 ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
  2019-08-12 10:11   ` [PATCH RFC 1/4] " Shiju Jose
@ 2019-08-12 10:11   ` Shiju Jose
  2019-08-21 17:22     ` James Morse
  2019-08-12 10:11   ` [PATCH RFC 3/4] ACPI: APEI: Add ghes_handle_aer " Shiju Jose
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 33+ messages in thread
From: Shiju Jose @ 2019-08-12 10:11 UTC (permalink / raw)
  To: linux-acpi, linux-edac, linux-kernel, rjw, lenb, james.morse,
	tony.luck, bp, baicar
  Cc: linuxarm, jonathan.cameron, tanxiaofei, Shiju Jose

This patch adds ghes_handle_memory_failure to the new error
notification method.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/acpi/apei/ghes.c | 51 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 36 insertions(+), 15 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 374d197..4400d56 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -401,14 +401,18 @@ static void ghes_clear_estatus(struct ghes *ghes,
 		ghes_ack_error(ghes->generic_v2);
 }
 
-static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int sev)
+static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
+				       int sev, void *data)
 {
-#ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE
+	struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
+	int sec_sev = ghes_severity(gdata->error_severity);
 	unsigned long pfn;
 	int flags = -1;
-	int sec_sev = ghes_severity(gdata->error_severity);
-	struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
 
+	ghes_edac_report_mem_error(sev, mem_err);
+	arch_apei_report_mem_error(sev, mem_err);
+
+#ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE
 	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
 		return;
 
@@ -569,15 +573,7 @@ static void ghes_do_proc(struct ghes *ghes,
 		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
 			fru_text = gdata->fru_text;
 
-		if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
-			struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
-
-			ghes_edac_report_mem_error(sev, mem_err);
-
-			arch_apei_report_mem_error(sev, mem_err);
-			ghes_handle_memory_failure(gdata, sev);
-		}
-		else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
+		if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
 			ghes_handle_aer(gdata);
 		}
 		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
@@ -1190,11 +1186,25 @@ static int apei_sdei_unregister_ghes(struct ghes *ghes)
 	return sdei_unregister_ghes(ghes);
 }
 
+struct ghes_err_handler_tab {
+	guid_t sec_type;
+	error_handle handle;
+};
+
+static struct ghes_err_handler_tab handler_tab[] = {
+	{
+		.sec_type = CPER_SEC_PLATFORM_MEM,
+		.handle = ghes_handle_memory_failure,
+	},
+	{ /* sentinel */ }
+};
+
 static int ghes_probe(struct platform_device *ghes_dev)
 {
 	struct acpi_hest_generic *generic;
 	struct ghes *ghes = NULL;
 	unsigned long flags;
+	int i;
 
 	int rc = -EINVAL;
 
@@ -1308,9 +1318,20 @@ static int ghes_probe(struct platform_device *ghes_dev)
 
 	ghes_edac_register(ghes, &ghes_dev->dev);
 
-	if (!refcount_read(&ghes_ref_count))
+	if (!refcount_read(&ghes_ref_count)) {
 		refcount_set(&ghes_ref_count, 1);
-	else
+		/* register handler functions for the standard errors.
+		 * This may be done from the corresponding drivers.
+		 */
+		for (i = 0; handler_tab[i].handle; i++) {
+			if (ghes_error_notify_register(handler_tab[i].sec_type,
+						handler_tab[i].handle, NULL)) {
+				ghes_edac_unregister(ghes);
+				platform_set_drvdata(ghes_dev, NULL);
+				goto err;
+			}
+		}
+	} else
 		refcount_inc(&ghes_ref_count);
 
 	/* Handle any pending errors right away */
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH RFC 3/4] ACPI: APEI: Add ghes_handle_aer to the new notification method
  2019-08-12 10:11 ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
  2019-08-12 10:11   ` [PATCH RFC 1/4] " Shiju Jose
  2019-08-12 10:11   ` [PATCH RFC 2/4] ACPI: APEI: Add ghes_handle_memory_failure to the new notification method Shiju Jose
@ 2019-08-12 10:11   ` " Shiju Jose
  2019-08-12 10:11   ` [PATCH RFC 4/4] ACPI: APEI: Add log_arm_hw_error " Shiju Jose
  2019-08-21 17:22   ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors James Morse
  4 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-08-12 10:11 UTC (permalink / raw)
  To: linux-acpi, linux-edac, linux-kernel, rjw, lenb, james.morse,
	tony.luck, bp, baicar
  Cc: linuxarm, jonathan.cameron, tanxiaofei, Shiju Jose

This patch adds ghes_handle_aer to the new error notification method.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/acpi/apei/ghes.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 4400d56..ffc309c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -450,7 +450,8 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
  * GHES_SEV_PANIC does not make it to this handling since the kernel must
  *     panic.
  */
-static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
+static void ghes_handle_aer(struct acpi_hest_generic_data *gdata,
+			    int sev, void *data)
 {
 #ifdef CONFIG_ACPI_APEI_PCIEAER
 	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
@@ -573,10 +574,7 @@ static void ghes_do_proc(struct ghes *ghes,
 		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
 			fru_text = gdata->fru_text;
 
-		if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
-			ghes_handle_aer(gdata);
-		}
-		else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
+		if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
 			struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
 
 			log_arm_hw_error(err);
@@ -1196,6 +1194,10 @@ struct ghes_err_handler_tab {
 		.sec_type = CPER_SEC_PLATFORM_MEM,
 		.handle = ghes_handle_memory_failure,
 	},
+	{
+		.sec_type = CPER_SEC_PCIE,
+		.handle = ghes_handle_aer,
+	},
 	{ /* sentinel */ }
 };
 
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH RFC 4/4] ACPI: APEI: Add log_arm_hw_error to the new notification method
  2019-08-12 10:11 ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
                     ` (2 preceding siblings ...)
  2019-08-12 10:11   ` [PATCH RFC 3/4] ACPI: APEI: Add ghes_handle_aer " Shiju Jose
@ 2019-08-12 10:11   ` " Shiju Jose
  2019-08-21 17:22   ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors James Morse
  4 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-08-12 10:11 UTC (permalink / raw)
  To: linux-acpi, linux-edac, linux-kernel, rjw, lenb, james.morse,
	tony.luck, bp, baicar
  Cc: linuxarm, jonathan.cameron, tanxiaofei, Shiju Jose

This patch adds log_arm_hw_error to the new error notification
method.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/acpi/apei/ghes.c | 47 ++++++++++++++++++++++-------------------------
 drivers/ras/ras.c        |  5 ++++-
 include/linux/ras.h      |  7 +++++--
 3 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index ffc309c..013fea0 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -574,34 +574,27 @@ static void ghes_do_proc(struct ghes *ghes,
 		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
 			fru_text = gdata->fru_text;
 
-		if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
-			struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
-
-			log_arm_hw_error(err);
-		} else {
-			rcu_read_lock();
-			list_for_each_entry_rcu(err_notify,
-						&ghes_error_notify_list, list) {
-				if (guid_equal(&err_notify->sec_type,
-					       sec_type)) {
-					/* The notification is called in the
-					 * interrupt context, thus the handler
-					 * functions should be take care of it.
-					 */
-					err_notify->handle(gdata, sev,
-							   err_notify->data);
-					is_notify = 1;
-				}
+		rcu_read_lock();
+		list_for_each_entry_rcu(err_notify, &ghes_error_notify_list,
+					list) {
+			if (guid_equal(&err_notify->sec_type, sec_type)) {
+				/* The notification is called in the
+				 * interrupt context, thus the handler
+				 * functions should be take care of it.
+				 */
+				err_notify->handle(gdata, sev,
+						   err_notify->data);
+				is_notify = 1;
 			}
-			rcu_read_unlock();
+		}
+		rcu_read_unlock();
 
-			if (!is_notify) {
-				void *err = acpi_hest_get_payload(gdata);
+		if (!is_notify) {
+			void *err = acpi_hest_get_payload(gdata);
 
-				log_non_standard_event(sec_type, fru_id,
-						       fru_text, sec_sev, err,
-						       gdata->error_data_length);
-			}
+			log_non_standard_event(sec_type, fru_id,
+					       fru_text, sec_sev, err,
+					       gdata->error_data_length);
 		}
 	}
 }
@@ -1198,6 +1191,10 @@ struct ghes_err_handler_tab {
 		.sec_type = CPER_SEC_PCIE,
 		.handle = ghes_handle_aer,
 	},
+	{
+		.sec_type = CPER_SEC_PROC_ARM,
+		.handle = log_arm_hw_error,
+	},
 	{ /* sentinel */ }
 };
 
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index 95540ea..7ec3eeb 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -21,8 +21,11 @@ void log_non_standard_event(const guid_t *sec_type, const guid_t *fru_id,
 	trace_non_standard_event(sec_type, fru_id, fru_text, sev, err, len);
 }
 
-void log_arm_hw_error(struct cper_sec_proc_arm *err)
+void log_arm_hw_error(struct acpi_hest_generic_data *gdata,
+		      int sev, void *data)
 {
+	struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+
 	trace_arm_event(err);
 }
 
diff --git a/include/linux/ras.h b/include/linux/ras.h
index 7c3debb..05b662d 100644
--- a/include/linux/ras.h
+++ b/include/linux/ras.h
@@ -5,6 +5,7 @@
 #include <asm/errno.h>
 #include <linux/uuid.h>
 #include <linux/cper.h>
+#include <acpi/ghes.h>
 
 #ifdef CONFIG_DEBUG_FS
 int ras_userspace_consumers(void);
@@ -29,7 +30,8 @@ static inline void __init cec_init(void)	{ }
 void log_non_standard_event(const guid_t *sec_type,
 			    const guid_t *fru_id, const char *fru_text,
 			    const u8 sev, const u8 *err, const u32 len);
-void log_arm_hw_error(struct cper_sec_proc_arm *err);
+void log_arm_hw_error(struct acpi_hest_generic_data *gdata,
+		      int sev, void *data);
 #else
 static inline void
 log_non_standard_event(const guid_t *sec_type,
@@ -37,7 +39,8 @@ void log_non_standard_event(const guid_t *sec_type,
 		       const u8 sev, const u8 *err, const u32 len)
 { return; }
 static inline void
-log_arm_hw_error(struct cper_sec_proc_arm *err) { return; }
+log_arm_hw_error(struct acpi_hest_generic_data *gdata,
+		 int sev, void *data) { return; }
 #endif
 
 #endif /* __RAS_H__ */
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors
  2019-08-12 10:11 ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
                     ` (3 preceding siblings ...)
  2019-08-12 10:11   ` [PATCH RFC 4/4] ACPI: APEI: Add log_arm_hw_error " Shiju Jose
@ 2019-08-21 17:22   ` James Morse
  2019-08-22 16:56     ` Shiju Jose
  4 siblings, 1 reply; 33+ messages in thread
From: James Morse @ 2019-08-21 17:22 UTC (permalink / raw)
  To: Shiju Jose
  Cc: linux-acpi, linux-edac, linux-kernel, rjw, lenb, tony.luck, bp,
	baicar, linuxarm, jonathan.cameron, tanxiaofei

Hi,

On 12/08/2019 11:11, Shiju Jose wrote:
> Presently kernel does not support reporting the vendor specific HW errors,
> in the non-standard format, to the vendor drivers for the recovery.

'non standard' here is probably a little jarring to the casual reader. You're referring to
the UEFI spec's "N.2.3 Non-standard Section Body", which refers to any section type
published somewhere other than the UEFI spec.

These still have to have a GUID to identify them, so they still have the same section
header format.


> This patch set add this support and also move the existing handler
> functions for the standard errors to the new callback method.

Could you give an example of where this would be useful? You're adding an API with no
caller to justify its existence.


GUIDs should only belong to one driver.

I don't think we should call drivers for something described as a fatal error. (which is
the case with what you have here)


> Also the CCIX RAS patches could be move to the proposed callback method.

Presumably for any vendor-specific stuff?


Thanks,

James

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC 2/4] ACPI: APEI: Add ghes_handle_memory_failure to the new notification method
  2019-08-12 10:11   ` [PATCH RFC 2/4] ACPI: APEI: Add ghes_handle_memory_failure to the new notification method Shiju Jose
@ 2019-08-21 17:22     ` James Morse
  2019-08-22 16:57       ` Shiju Jose
  0 siblings, 1 reply; 33+ messages in thread
From: James Morse @ 2019-08-21 17:22 UTC (permalink / raw)
  To: Shiju Jose
  Cc: linux-acpi, linux-edac, linux-kernel, rjw, lenb, tony.luck, bp,
	baicar, linuxarm, jonathan.cameron, tanxiaofei

Hi,

On 12/08/2019 11:11, Shiju Jose wrote:
> This patch adds ghes_handle_memory_failure to the new error
> notification method.

The commit message doesn't answer the question: why?

The existing code works. This just looks like additional churn.
Given a user, I think the vendor specific example is useful. I don't think making this
thing more pluggable is a good idea.


Thanks,

James

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC 1/4] ACPI: APEI: Add support to notify the vendor specific HW errors
  2019-08-12 10:11   ` [PATCH RFC 1/4] " Shiju Jose
@ 2019-08-21 17:23     ` James Morse
  2019-08-22 16:57       ` Shiju Jose
  0 siblings, 1 reply; 33+ messages in thread
From: James Morse @ 2019-08-21 17:23 UTC (permalink / raw)
  To: Shiju Jose
  Cc: linux-acpi, linux-edac, linux-kernel, rjw, lenb, tony.luck, bp,
	baicar, linuxarm, jonathan.cameron, tanxiaofei

Hi,

On 12/08/2019 11:11, Shiju Jose wrote:
> Presently the vendor specific HW errors, in the non-standard format,
> are not reported to the vendor drivers for the recovery.
> 
> This patch adds support to notify the vendor specific HW errors to the
> registered kernel drivers.

> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index a66e00f..374d197 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -477,6 +477,77 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
>  #endif
>  }
>  
> +struct ghes_error_notify {
> +	struct list_head list;> +	struct rcu_head	rcu_head;
> +	guid_t sec_type; /* guid of the error record */

> +	error_handle handle; /* error handler function */

ghes_error_handler_t error_handler; ?


> +	void *data; /* handler driver's private data if any */
> +};
> +
> +/* List to store the registered error handling functions */
> +static DEFINE_MUTEX(ghes_error_notify_mutex);
> +static LIST_HEAD(ghes_error_notify_list);

> +static refcount_t ghes_ref_count;

I don't think this refcount is needed.


> +/**
> + * ghes_error_notify_register - register an error handling function
> + * for the hw errors.
> + * @sec_type: sec_type of the corresponding CPER to be notified.
> + * @handle: pointer to the error handling function.
> + * @data: handler driver's private data.
> + *
> + * return 0 : SUCCESS, non-zero : FAIL
> + */
> +int ghes_error_notify_register(guid_t sec_type, error_handle handle, void *data)
> +{
> +	struct ghes_error_notify *err_notify;
> +
> +	mutex_lock(&ghes_error_notify_mutex);
> +	err_notify = kzalloc(sizeof(*err_notify), GFP_KERNEL);
> +	if (!err_notify)
> +		return -ENOMEM;

Leaving the mutex locked.
You may as well allocate the memory before taking the lock.


> +
> +	err_notify->handle = handle;
> +	guid_copy(&err_notify->sec_type, &sec_type);
> +	err_notify->data = data;
> +	list_add_rcu(&err_notify->list, &ghes_error_notify_list);
> +	mutex_unlock(&ghes_error_notify_mutex);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(ghes_error_notify_register);

Could we leave exporting this to modules until there is a user?


> +/**
> + * ghes_error_notify_unregister - unregister an error handling function.
> + * @sec_type: sec_type of the corresponding CPER.
> + * @handle: pointer to the error handling function.
> + *
> + * return none.
> + */
> +void ghes_error_notify_unregister(guid_t sec_type, error_handle handle)

Why do we need the handle(r) a second time? Surely there can only be one callback for a
given guid.


> +{
> +	struct ghes_error_notify *err_notify;
> +	bool found = 0;
> +
> +	mutex_lock(&ghes_error_notify_mutex);
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(err_notify, &ghes_error_notify_list, list) {
> +		if (guid_equal(&err_notify->sec_type, &sec_type) &&
> +		    err_notify->handle == handle) {
> +			list_del_rcu(&err_notify->list);
> +			found = 1;
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();

> +	synchronize_rcu();

Is this for the kfree()? Please keep them together so its obvious what its for.
Putting it outside the mutex will also save any contended waiter some time.


> +	mutex_unlock(&ghes_error_notify_mutex);
> +	if (found)
> +		kfree(err_notify);
> +}
> +EXPORT_SYMBOL_GPL(ghes_error_notify_unregister);
> +

>  static void ghes_do_proc(struct ghes *ghes,
>  			 const struct acpi_hest_generic_status *estatus)
>  {> @@ -512,11 +585,29 @@ static void ghes_do_proc(struct ghes *ghes,
>  
>  			log_arm_hw_error(err);
>  		} else {
> -			void *err = acpi_hest_get_payload(gdata);
> -
> -			log_non_standard_event(sec_type, fru_id, fru_text,
> -					       sec_sev, err,
> -					       gdata->error_data_length);

> +			rcu_read_lock();
> +			list_for_each_entry_rcu(err_notify,
> +						&ghes_error_notify_list, list) {
> +				if (guid_equal(&err_notify->sec_type,
> +					       sec_type)) {

> +					/* The notification is called in the
> +					 * interrupt context, thus the handler
> +					 * functions should be take care of it.
> +					 */

I read this as "the handler will be called", which doesn't seem to be a useful comment.


> +					err_notify->handle(gdata, sev,
> +							   err_notify->data);
> +					is_notify = 1;

					break;

> +				}
> +			}
> +			rcu_read_unlock();

> +			if (!is_notify) {

if (!found) Seems more natural.


> +				void *err = acpi_hest_get_payload(gdata);
> +
> +				log_non_standard_event(sec_type, fru_id,
> +						       fru_text, sec_sev, err,
> +						       gdata->error_data_length);
> +			}

This is tricky to read as its so bunched up. Please pull it out into a separate function.
ghes_handle_non_standard_event() ?


Because you skip log_non_standard_event(), rasdaemon will no longer see these in
user-space. For any kernel consumer of these, we need to know we aren't breaking the
user-space component.


>  		}
>  	}
>  }
> @@ -1217,6 +1308,11 @@ static int ghes_probe(struct platform_device *ghes_dev)
>  
>  	ghes_edac_register(ghes, &ghes_dev->dev);
>  
> +	if (!refcount_read(&ghes_ref_count))
> +		refcount_set(&ghes_ref_count, 1);

What stops this from racing with itself if two ghes platform devices are probed at the
same time?

If the refcount needs initialising, please do it in ghes_init()....

> +	else
> +		refcount_inc(&ghes_ref_count);

.. but I don't think this refcount is needed.


>  	/* Handle any pending errors right away */
>  	spin_lock_irqsave(&ghes_notify_lock_irq, flags);
>  	ghes_proc(ghes);

> @@ -1279,6 +1376,17 @@ static int ghes_remove(struct platform_device *ghes_dev)
>  
>  	ghes_fini(ghes);
>  
> +	if (refcount_dec_and_test(&ghes_ref_count) &&
> +	    !list_empty(&ghes_error_notify_list)) {
> +		mutex_lock(&ghes_error_notify_mutex);> +		list_for_each_entry_safe(err_notify, tmp,
> +					 &ghes_error_notify_list, list) {
> +			list_del_rcu(&err_notify->list);
> +			kfree_rcu(err_notify, rcu_head);
> +		}
> +		mutex_unlock(&ghes_error_notify_mutex);
> +	}

... If someone unregisters, and re-registers all the GHES platform devices, the last one
out flushes the vendor-specific error handlers away. Then we re-probe the devices again,
but this time the vendor-specific error handlers don't work.

As you have an add/remove API for drivers, its up to drivers to cleanup when they are
removed. The comings and goings of GHES platform devices isn't relevant.


>  	ghes_edac_unregister(ghes);
>  
>  	kfree(ghes);
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index e3f1cdd..d480537 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -50,6 +50,53 @@ enum {
>  	GHES_SEV_PANIC = 0x3,
>  };
>  
> +/**
> + * error_handle - error handling function for the hw errors.

Fatal errors get dealt with earlier, so drivers will never see them.
| error handling function for non-fatal hardware errors.


> + * This handle function is called in the interrupt context.

As this overrides ghes's logging of the error, we should mention:
| The handler is responsible for any logging of the error.


> + * @gdata: acpi_hest_generic_data.
> + * @sev: error severity of the entire error event defined in the
> + * ACPI spec table generic error status block.
> + * @data: handler driver's private data.
> + *
> + * return : none.
> + */
> +typedef void (*error_handle)(struct acpi_hest_generic_data *gdata, int sev,
> +			     void *data);


Thanks,

James

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors
  2019-08-21 17:22   ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors James Morse
@ 2019-08-22 16:56     ` Shiju Jose
  2019-10-03 17:21       ` James Morse
  0 siblings, 1 reply; 33+ messages in thread
From: Shiju Jose @ 2019-08-22 16:56 UTC (permalink / raw)
  To: James Morse
  Cc: linux-acpi, linux-edac, linux-kernel, rjw, lenb, tony.luck, bp,
	baicar, Linuxarm, Jonathan Cameron, tanxiaofei

Hi James, 

Thanks for the feedback.

>-----Original Message-----
>From: linux-acpi-owner@vger.kernel.org [mailto:linux-acpi-
>owner@vger.kernel.org] On Behalf Of James Morse
>Sent: 21 August 2019 18:23
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-acpi@vger.kernel.org; linux-edac@vger.kernel.org; linux-
>kernel@vger.kernel.org; rjw@rjwysocki.net; lenb@kernel.org;
>tony.luck@intel.com; bp@alien8.de; baicar@os.amperecomputing.com;
>Linuxarm <linuxarm@huawei.com>; Jonathan Cameron
><jonathan.cameron@huawei.com>; tanxiaofei <tanxiaofei@huawei.com>
>Subject: Re: [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor
>specific HW errors
>
>Hi,
>
>On 12/08/2019 11:11, Shiju Jose wrote:
>> Presently kernel does not support reporting the vendor specific HW
>> errors, in the non-standard format, to the vendor drivers for the recovery.
>
>'non standard' here is probably a little jarring to the casual reader. You're
>referring to the UEFI spec's "N.2.3 Non-standard Section Body", which refers to
>any section type published somewhere other than the UEFI spec.
OK. I will change it.  
>
>These still have to have a GUID to identify them, so they still have the same
>section header format.
Yes. 
 
>
>
>> This patch set add this support and also move the existing handler
>> functions for the standard errors to the new callback method.
>
>Could you give an example of where this would be useful? You're adding an API
>with no caller to justify its existence.
One such example is handling the local errors occurred in a device controller, such as PCIe.

>
>
>GUIDs should only belong to one driver.
UEFI spec's N.2.3 Non-standard Section Body mentioned,  "The type (e.g. format) of a non-standard section is identified by the GUID populated in the Section Descriptor's Section Type field." 
There is a possibility to define common non-standard error section format which will be used for more than one driver if the error data to be reported is in the same format. Then can the same GUID belong to multiple drivers?

>
>I don't think we should call drivers for something described as a fatal error.
>(which is the case with what you have here)
The notification is intended only for the recoverable errors as the ghes_proc() call panic for the fatal errors in the early stage.

>
>
>> Also the CCIX RAS patches could be move to the proposed callback method.
>
>Presumably for any vendor-specific stuff?
This information was related to the proposal to replace the  number of if(guid_equal(...)) else if(guid_equal(...)) checks in the ghes_do_proc() for the existing UEFI spec defined error sections(such as PCIe,  Memory, ARM HW error) by registering the corresponding handler functions to the proposed notification method. The same apply to the CCIX error sections and any other error sections defined by the UEFI spec in the future.  

>
>
>Thanks,
>
>James

Thanks,
Shiju

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH RFC 2/4] ACPI: APEI: Add ghes_handle_memory_failure to the new notification method
  2019-08-21 17:22     ` James Morse
@ 2019-08-22 16:57       ` Shiju Jose
  0 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-08-22 16:57 UTC (permalink / raw)
  To: James Morse
  Cc: linux-acpi, linux-edac, linux-kernel, rjw, lenb, tony.luck, bp,
	baicar, Linuxarm, Jonathan Cameron, tanxiaofei

Hi James,

>-----Original Message-----
>From: James Morse [mailto:james.morse@arm.com]
>Sent: 21 August 2019 18:23
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-acpi@vger.kernel.org; linux-edac@vger.kernel.org; linux-
>kernel@vger.kernel.org; rjw@rjwysocki.net; lenb@kernel.org;
>tony.luck@intel.com; bp@alien8.de; baicar@os.amperecomputing.com;
>Linuxarm <linuxarm@huawei.com>; Jonathan Cameron
><jonathan.cameron@huawei.com>; tanxiaofei <tanxiaofei@huawei.com>
>Subject: Re: [PATCH RFC 2/4] ACPI: APEI: Add ghes_handle_memory_failure to
>the new notification method
>
>Hi,
>
>On 12/08/2019 11:11, Shiju Jose wrote:
>> This patch adds ghes_handle_memory_failure to the new error
>> notification method.
>
>The commit message doesn't answer the question: why?
>
>The existing code works. This just looks like additional churn.
>Given a user, I think the vendor specific example is useful. I don't think making
>this thing more pluggable is a good idea.
This was intended to replace the  number of if(guid_equal(...)) else if(guid_equal(...)) checks in the ghes_do_proc() , which would grow when new UEFI defined error sections would be added in the future.
>
>
>Thanks,
>
>James

Thanks,
Shiju

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH RFC 1/4] ACPI: APEI: Add support to notify the vendor specific HW errors
  2019-08-21 17:23     ` James Morse
@ 2019-08-22 16:57       ` Shiju Jose
  0 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-08-22 16:57 UTC (permalink / raw)
  To: James Morse
  Cc: linux-acpi, linux-edac, linux-kernel, rjw, lenb, tony.luck, bp,
	baicar, Linuxarm, Jonathan Cameron, tanxiaofei

Hi James,

>-----Original Message-----
>From: James Morse [mailto:james.morse@arm.com]
>Sent: 21 August 2019 18:24
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-acpi@vger.kernel.org; linux-edac@vger.kernel.org; linux-
>kernel@vger.kernel.org; rjw@rjwysocki.net; lenb@kernel.org;
>tony.luck@intel.com; bp@alien8.de; baicar@os.amperecomputing.com;
>Linuxarm <linuxarm@huawei.com>; Jonathan Cameron
><jonathan.cameron@huawei.com>; tanxiaofei <tanxiaofei@huawei.com>
>Subject: Re: [PATCH RFC 1/4] ACPI: APEI: Add support to notify the vendor
>specific HW errors
>
>Hi,
>
>On 12/08/2019 11:11, Shiju Jose wrote:
>> Presently the vendor specific HW errors, in the non-standard format,
>> are not reported to the vendor drivers for the recovery.
>>
>> This patch adds support to notify the vendor specific HW errors to the
>> registered kernel drivers.
>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
>> a66e00f..374d197 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -477,6 +477,77 @@ static void ghes_handle_aer(struct
>> acpi_hest_generic_data *gdata)  #endif  }
>>
>> +struct ghes_error_notify {
>> +	struct list_head list;> +	struct rcu_head	rcu_head;
>> +	guid_t sec_type; /* guid of the error record */
>
>> +	error_handle handle; /* error handler function */
>
>ghes_error_handler_t error_handler; ?
Sure.

>
>
>> +	void *data; /* handler driver's private data if any */ };
>> +
>> +/* List to store the registered error handling functions */ static
>> +DEFINE_MUTEX(ghes_error_notify_mutex);
>> +static LIST_HEAD(ghes_error_notify_list);
>
>> +static refcount_t ghes_ref_count;
>
>I don't think this refcount is needed.
refcount was added to register standard error handlers with this notification method one time when
multiple ghes platform devices are probed.
 
>
>
>> +/**
>> + * ghes_error_notify_register - register an error handling function
>> + * for the hw errors.
>> + * @sec_type: sec_type of the corresponding CPER to be notified.
>> + * @handle: pointer to the error handling function.
>> + * @data: handler driver's private data.
>> + *
>> + * return 0 : SUCCESS, non-zero : FAIL  */ int
>> +ghes_error_notify_register(guid_t sec_type, error_handle handle, void
>> +*data) {
>> +	struct ghes_error_notify *err_notify;
>> +
>> +	mutex_lock(&ghes_error_notify_mutex);
>> +	err_notify = kzalloc(sizeof(*err_notify), GFP_KERNEL);
>> +	if (!err_notify)
>> +		return -ENOMEM;
>
>Leaving the mutex locked.
>You may as well allocate the memory before taking the lock.
Good spot. I will fix.

>
>
>> +
>> +	err_notify->handle = handle;
>> +	guid_copy(&err_notify->sec_type, &sec_type);
>> +	err_notify->data = data;
>> +	list_add_rcu(&err_notify->list, &ghes_error_notify_list);
>> +	mutex_unlock(&ghes_error_notify_mutex);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(ghes_error_notify_register);
>
>Could we leave exporting this to modules until there is a user?
>
>
>> +/**
>> + * ghes_error_notify_unregister - unregister an error handling function.
>> + * @sec_type: sec_type of the corresponding CPER.
>> + * @handle: pointer to the error handling function.
>> + *
>> + * return none.
>> + */
>> +void ghes_error_notify_unregister(guid_t sec_type, error_handle
>> +handle)
>
>Why do we need the handle(r) a second time? Surely there can only be one
>callback for a given guid.
There is a possibility of sharing the guid between drivers if the non-standard error section format is common
for more than one devices if the error data to be reported is in the same format.
 
>
>
>> +{
>> +	struct ghes_error_notify *err_notify;
>> +	bool found = 0;
>> +
>> +	mutex_lock(&ghes_error_notify_mutex);
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(err_notify, &ghes_error_notify_list, list) {
>> +		if (guid_equal(&err_notify->sec_type, &sec_type) &&
>> +		    err_notify->handle == handle) {
>> +			list_del_rcu(&err_notify->list);
>> +			found = 1;
>> +			break;
>> +		}
>> +	}
>> +	rcu_read_unlock();
>
>> +	synchronize_rcu();
>
>Is this for the kfree()? Please keep them together so its obvious what its for.
>Putting it outside the mutex will also save any contended waiter some time.
Yes. I will move synchronize_rcu () just before kfree. 
 
>
>
>> +	mutex_unlock(&ghes_error_notify_mutex);
>> +	if (found)
>> +		kfree(err_notify);
>> +}
>> +EXPORT_SYMBOL_GPL(ghes_error_notify_unregister);
>> +
>
>>  static void ghes_do_proc(struct ghes *ghes,
>>  			 const struct acpi_hest_generic_status *estatus)  {>
>@@ -512,11
>> +585,29 @@ static void ghes_do_proc(struct ghes *ghes,
>>
>>  			log_arm_hw_error(err);
>>  		} else {
>> -			void *err = acpi_hest_get_payload(gdata);
>> -
>> -			log_non_standard_event(sec_type, fru_id, fru_text,
>> -					       sec_sev, err,
>> -					       gdata->error_data_length);
>
>> +			rcu_read_lock();
>> +			list_for_each_entry_rcu(err_notify,
>> +						&ghes_error_notify_list, list) {
>> +				if (guid_equal(&err_notify->sec_type,
>> +					       sec_type)) {
>
>> +					/* The notification is called in the
>> +					 * interrupt context, thus the handler
>> +					 * functions should be take care of it.
>> +					 */
>
>I read this as "the handler will be called", which doesn't seem to be a useful
>comment.
Ok. I will correct the comment.
>
>
>> +					err_notify->handle(gdata, sev,
>> +							   err_notify->data);
>> +					is_notify = 1;
>
>					break;
>
>> +				}
>> +			}
>> +			rcu_read_unlock();
>
>> +			if (!is_notify) {
>
>if (!found) Seems more natural.
Ok. I will change to "is_notify"  to "found".

>
>
>> +				void *err = acpi_hest_get_payload(gdata);
>> +
>> +				log_non_standard_event(sec_type, fru_id,
>> +						       fru_text, sec_sev, err,
>> +						       gdata->error_data_length);
>> +			}
>
>This is tricky to read as its so bunched up. Please pull it out into a separate
>function.
>ghes_handle_non_standard_event() ?
Ok. I will add to new ghes_handle_non_standard_event() function.

>
>
>Because you skip log_non_standard_event(), rasdaemon will no longer see
>these in user-space. For any kernel consumer of these, we need to know we
>aren't breaking the user-space component.
>
>
>>  		}
>>  	}
>>  }
>> @@ -1217,6 +1308,11 @@ static int ghes_probe(struct platform_device
>> *ghes_dev)
>>
>>  	ghes_edac_register(ghes, &ghes_dev->dev);
>>
>> +	if (!refcount_read(&ghes_ref_count))
>> +		refcount_set(&ghes_ref_count, 1);
>
>What stops this from racing with itself if two ghes platform devices are probed
>at the same time?
yes. It is an issue.
>
>If the refcount needs initialising, please do it in ghes_init()....
refcount was added to register the standard error handlers to the notification
method only for the first time when the ghes device probed multiple times.
I will check is it possible to avoid using refcount by moving the above registration
of standard error handlers to the ghes_init().
>
>> +	else
>> +		refcount_inc(&ghes_ref_count);
>
>.. but I don't think this refcount is needed.
>
>
>>  	/* Handle any pending errors right away */
>>  	spin_lock_irqsave(&ghes_notify_lock_irq, flags);
>>  	ghes_proc(ghes);
>
>> @@ -1279,6 +1376,17 @@ static int ghes_remove(struct platform_device
>> *ghes_dev)
>>
>>  	ghes_fini(ghes);
>>
>> +	if (refcount_dec_and_test(&ghes_ref_count) &&
>> +	    !list_empty(&ghes_error_notify_list)) {
>> +		mutex_lock(&ghes_error_notify_mutex);> +
>	list_for_each_entry_safe(err_notify, tmp,
>> +					 &ghes_error_notify_list, list) {
>> +			list_del_rcu(&err_notify->list);
>> +			kfree_rcu(err_notify, rcu_head);
>> +		}
>> +		mutex_unlock(&ghes_error_notify_mutex);
>> +	}
>
>... If someone unregisters, and re-registers all the GHES platform devices, the
>last one out flushes the vendor-specific error handlers away. Then we re-probe
>the devices again, but this time the vendor-specific error handlers don't work.
>
>As you have an add/remove API for drivers, its up to drivers to cleanup when
>they are removed. The comings and goings of GHES platform devices isn't
>relevant.
Ok. Got it. I will either keep the unregister for the standard error handlers only
if the standard error handlers can be part of the notification method or remove completely
if the registration can be done in the ghes_init().    
>
>
>>  	ghes_edac_unregister(ghes);
>>
>>  	kfree(ghes);
>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h index
>> e3f1cdd..d480537 100644
>> --- a/include/acpi/ghes.h
>> +++ b/include/acpi/ghes.h
>> @@ -50,6 +50,53 @@ enum {
>>  	GHES_SEV_PANIC = 0x3,
>>  };
>>
>> +/**
>> + * error_handle - error handling function for the hw errors.
>
>Fatal errors get dealt with earlier, so drivers will never see them.
>| error handling function for non-fatal hardware errors.
Ok. I will change the comment as recoverable HW errors.
>
>
>> + * This handle function is called in the interrupt context.
>
>As this overrides ghes's logging of the error, we should mention:
>| The handler is responsible for any logging of the error.
Ok. I will add in the comment.
>
>
>> + * @gdata: acpi_hest_generic_data.
>> + * @sev: error severity of the entire error event defined in the
>> + * ACPI spec table generic error status block.
>> + * @data: handler driver's private data.
>> + *
>> + * return : none.
>> + */
>> +typedef void (*error_handle)(struct acpi_hest_generic_data *gdata, int sev,
>> +			     void *data);
>
>
>Thanks,
>
>James
Thanks,
Shiju

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors
  2019-08-22 16:56     ` Shiju Jose
@ 2019-10-03 17:21       ` James Morse
  0 siblings, 0 replies; 33+ messages in thread
From: James Morse @ 2019-10-03 17:21 UTC (permalink / raw)
  To: Shiju Jose
  Cc: linux-acpi, linux-edac, linux-kernel, rjw, lenb, tony.luck, bp,
	baicar, Linuxarm, Jonathan Cameron, tanxiaofei

Hi Shiju,

On 22/08/2019 17:56, Shiju Jose wrote:
> James Morse wrote:
>> On 12/08/2019 11:11, Shiju Jose wrote:
>>> Presently kernel does not support reporting the vendor specific HW
>>> errors, in the non-standard format, to the vendor drivers for the recovery.
>>
>> 'non standard' here is probably a little jarring to the casual reader. You're
>> referring to the UEFI spec's "N.2.3 Non-standard Section Body", which refers to
>> any section type published somewhere other than the UEFI spec.

>>> This patch set add this support and also move the existing handler
>>> functions for the standard errors to the new callback method.
>>
>> Could you give an example of where this would be useful? You're adding an API
>> with no caller to justify its existence.

> One such example is handling the local errors occurred in a device controller, such as PCIe.

Could we have the example in the form of patches? (sorry, I wasn't clear)

I don't think its realistic that a PCIe device driver would want to know about errors on
other devices in the system. (SAS-HBA meet the GPU).

PCIe's has AER for handling errors that (may have) occurred on a PCIe link, and this has
its own CPER records.


>> GUIDs should only belong to one driver.

> UEFI spec's N.2.3 Non-standard Section Body mentioned,  "The type (e.g. format) of a
> non-standard section is identified by the GUID populated in the Section Descriptor's
> Section Type field." 
> There is a possibility to define common non-standard error section format

I agree the GUID describes the format of the error record,


> which will
> be used for more than one driver if the error data to be reported is in the same format.
> Then can the same GUID belong to multiple drivers?

... but here we disagree.

CPER has a component/block-diagram view of the system. It describes a Memory error or an
error with a PCIe endpoint. An error record affects one component.

If you wanted to describe an error caused by a failed transaction between a PCIe device
and memory, you would need two of these records, and its guesswork as to what happened
between them.

But the PCIe device has no business poking around in the memory error. Even if it did APEI
would be the wrong place to do this as its not the only caller of memory_failure().


>>> Also the CCIX RAS patches could be move to the proposed callback method.
>>
>> Presumably for any vendor-specific stuff?

> This information was related to the proposal to replace the  number of if(guid_equal(...)) else
> if(guid_equal(...)) checks in the ghes_do_proc() for the existing UEFI spec defined error 
> sections(such as PCIe,  Memory, ARM HW error)

'the standard ones'

> by registering the corresponding handler functions to the proposed notification method.

I really don't like this. Registering a handler for 'memory corruption' would require
walking a list of dynamically allocated pointers. Can there be more than one entry? Can
random drivers block memory_failure() while they allocate more memory to send packets over
USB? What if it loops?

For the standard error sources the kernel needs to run 'the' handler as quickly as
possible, with a minimum of code/memory-access in the meantime. It already takes too long.


Thanks,

James


> The same apply to the CCIX error sections and any other
> error sections defined by the UEFI spec in the future.  



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling
       [not found] <Shiju Jose>
  2019-06-17 14:28 ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Shiju Jose
  2019-08-12 10:11 ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
@ 2019-10-16 16:33 ` Shiju Jose
  2019-10-16 16:33   ` [PATCH 1/7] rasdaemon: fix cleanup issues in ras-events.c:read_ras_event_all_cpus() Shiju Jose
                     ` (7 more replies)
  2019-11-13 16:31 ` [PATCH rasdaemon 0/2] rasdaemon: add fix for the sql table Shiju Jose
  3 siblings, 8 replies; 33+ messages in thread
From: Shiju Jose @ 2019-10-16 16:33 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: linuxarm, Shiju Jose

This patch set add
1. fixes for some memory leaks and file closure.
2. closure for the sqlite3 database.
3. signal handling for the cleanup.

Shiju Jose (7):
  rasdaemon: fix cleanup issues in
    ras-events.c:read_ras_event_all_cpus()
  rasdaemon: fix memory leak in ras-events.c:handle_ras_events()
  rasdaemon: fix missing fclose in
    ras-events.c:select_tracing_timestamp()
  rasdaemon: fix memory leak in ras-events.c:add_event_handler()
  rasdaemon: delete multiple definitions of ARRAY_SIZE
  rasdaemon: add closure and cleanups for the database
  rasdaemon: add signal handling for the cleanup

 ras-diskerror-handler.c    |   2 -
 ras-events.c               |  88 +++++++++++++++++++++++++++-----
 ras-mce-handler.h          |   3 --
 ras-non-standard-handler.c |  16 ++++++
 ras-non-standard-handler.h |   6 ++-
 ras-record.c               | 123 +++++++++++++++++++++++++++++++++++++++++++--
 ras-record.h               |   5 ++
 7 files changed, 222 insertions(+), 21 deletions(-)

-- 
2.1.4



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/7] rasdaemon: fix cleanup issues in ras-events.c:read_ras_event_all_cpus()
  2019-10-16 16:33 ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
@ 2019-10-16 16:33   ` Shiju Jose
  2019-10-16 16:33   ` [PATCH 2/7] rasdaemon: fix memory leak in ras-events.c:handle_ras_events() Shiju Jose
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-10-16 16:33 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: linuxarm, Shiju Jose

This patch fix memory leaks and close the open files if the
open_trace() or read(fds[i].fd, page, pdata[i].ras->page_size)
function calls fail.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 ras-events.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/ras-events.c b/ras-events.c
index 3cdac19..d1773b1 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -353,6 +353,7 @@ static int read_ras_event_all_cpus(struct pthread_data *pdata,
 	struct pollfd fds[n_cpus];
 	int warnonce[n_cpus];
 	char pipe_raw[PATH_MAX];
+	int legacy_kernel = 0;
 #if 0
 	int need_sleep = 0;
 #endif
@@ -372,6 +373,9 @@ static int read_ras_event_all_cpus(struct pthread_data *pdata,
 		return -ENOMEM;
 	}
 
+	for (i = 0; i < n_cpus; i++)
+		fds[i].fd = -1;
+
 	for (i = 0; i < n_cpus; i++) {
 		fds[i].events = POLLIN;
 
@@ -382,9 +386,7 @@ static int read_ras_event_all_cpus(struct pthread_data *pdata,
 		fds[i].fd = open_trace(pdata[0].ras, pipe_raw, O_RDONLY);
 		if (fds[i].fd < 0) {
 			log(TERM, LOG_ERR, "Can't open trace_pipe_raw\n");
-			kbuffer_free(kbuf);
-			free(page);
-			return -EINVAL;
+			goto error;
 		}
 	}
 
@@ -416,7 +418,7 @@ static int read_ras_event_all_cpus(struct pthread_data *pdata,
 			size = read(fds[i].fd, page, pdata[i].ras->page_size);
 			if (size < 0) {
 				log(TERM, LOG_WARNING, "read\n");
-				return -1;
+				goto error;
 			} else if (size > 0) {
 				kbuffer_load_subbuffer(kbuf, page);
 
@@ -441,6 +443,7 @@ static int read_ras_event_all_cpus(struct pthread_data *pdata,
 		 */
 		if (count_nready == n_cpus) {
 			/* Should only happen with legacy kernels */
+			legacy_kernel = 1;
 			break;
 		}
 #endif
@@ -449,12 +452,18 @@ static int read_ras_event_all_cpus(struct pthread_data *pdata,
 	/* poll() is not supported. We need to fallback to the old way */
 	log(TERM, LOG_INFO,
 	    "Old kernel detected. Stop listening and fall back to pthread way.\n");
+error:
 	kbuffer_free(kbuf);
 	free(page);
-	for (i = 0; i < n_cpus; i++)
-		close(fds[i].fd);
+	for (i = 0; i < n_cpus; i++) {
+		if (fds[i].fd > 0)
+			close(fds[i].fd);
+	}
 
-	return -255;
+	if (legacy_kernel)
+		return -255;
+	else
+		return -1;
 }
 
 static int read_ras_event(int fd,
-- 
2.1.4



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/7] rasdaemon: fix memory leak in ras-events.c:handle_ras_events()
  2019-10-16 16:33 ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
  2019-10-16 16:33   ` [PATCH 1/7] rasdaemon: fix cleanup issues in ras-events.c:read_ras_event_all_cpus() Shiju Jose
@ 2019-10-16 16:33   ` Shiju Jose
  2019-10-16 16:33   ` [PATCH 3/7] rasdaemon: fix missing fclose in ras-events.c:select_tracing_timestamp() Shiju Jose
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-10-16 16:33 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: linuxarm, Shiju Jose

This patch fix memory leak in handle_ras_events()
when failed to trace all supported RAS events.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 ras-events.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/ras-events.c b/ras-events.c
index d1773b1..d543251 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -846,7 +846,8 @@ int handle_ras_events(int record_events)
 	if (!num_events) {
 		log(ALL, LOG_INFO,
 		    "Failed to trace all supported RAS events. Aborting.\n");
-		return EINVAL;
+		rc = -EINVAL;
+		goto err;
 	}
 
 	data = calloc(sizeof(*data), cpus);
-- 
2.1.4



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/7] rasdaemon: fix missing fclose in ras-events.c:select_tracing_timestamp()
  2019-10-16 16:33 ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
  2019-10-16 16:33   ` [PATCH 1/7] rasdaemon: fix cleanup issues in ras-events.c:read_ras_event_all_cpus() Shiju Jose
  2019-10-16 16:33   ` [PATCH 2/7] rasdaemon: fix memory leak in ras-events.c:handle_ras_events() Shiju Jose
@ 2019-10-16 16:33   ` Shiju Jose
  2019-10-16 16:33   ` [PATCH 4/7] rasdaemon: fix memory leak in ras-events.c:add_event_handler() Shiju Jose
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-10-16 16:33 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: linuxarm, Shiju Jose

This patch adds fix for missing fclose() in select_tracing_timestamp()
when return fail if can't parse /proc/uptime.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 ras-events.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ras-events.c b/ras-events.c
index d543251..fc6b288 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -600,12 +600,12 @@ static int select_tracing_timestamp(struct ras_events *ras)
 		return 0;
 	}
 	rc = fscanf(fp, "%zu.%u ", &uptime, &j1);
+	fclose(fp);
 	if (rc <= 0) {
 		log(TERM, LOG_ERR, "Can't parse /proc/uptime!\n");
 		return -1;
 	}
 	now = time(NULL);
-	fclose(fp);
 
 	ras->use_uptime = 1;
 	ras->uptime_diff = now - uptime;
-- 
2.1.4



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 4/7] rasdaemon: fix memory leak in ras-events.c:add_event_handler()
  2019-10-16 16:33 ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
                     ` (2 preceding siblings ...)
  2019-10-16 16:33   ` [PATCH 3/7] rasdaemon: fix missing fclose in ras-events.c:select_tracing_timestamp() Shiju Jose
@ 2019-10-16 16:33   ` Shiju Jose
  2019-10-16 16:33   ` [PATCH 5/7] rasdaemon: delete multiple definitions of ARRAY_SIZE Shiju Jose
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-10-16 16:33 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: linuxarm, Shiju Jose

This patch rearranges the free(page) call to prevent the
memory leak when __toggle_ras_mc_event() fail.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 ras-events.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ras-events.c b/ras-events.c
index fc6b288..f912dae 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -688,6 +688,7 @@ static int add_event_handler(struct ras_events *ras, struct pevent *pevent,
 
 	/* Enable RAS events */
 	rc = __toggle_ras_mc_event(ras, group, event, 1);
+	free(page);
 	if (rc < 0) {
 		log(TERM, LOG_ERR, "Can't enable %s:%s tracing\n",
 		    group, event);
@@ -697,7 +698,6 @@ static int add_event_handler(struct ras_events *ras, struct pevent *pevent,
 
 	log(ALL, LOG_INFO, "Enabled event %s:%s\n", group, event);
 
-	free(page);
 	return 0;
 }
 
-- 
2.1.4



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 5/7] rasdaemon: delete multiple definitions of ARRAY_SIZE
  2019-10-16 16:33 ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
                     ` (3 preceding siblings ...)
  2019-10-16 16:33   ` [PATCH 4/7] rasdaemon: fix memory leak in ras-events.c:add_event_handler() Shiju Jose
@ 2019-10-16 16:33   ` Shiju Jose
  2019-10-16 16:34   ` [PATCH 6/7] rasdaemon: add closure and cleanups for the database Shiju Jose
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-10-16 16:33 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: linuxarm, Shiju Jose

This patch deletes multiple definitions of ARRAY_SIZE and
move the definition to a common file.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 ras-diskerror-handler.c | 2 --
 ras-mce-handler.h       | 3 ---
 ras-record.c            | 3 ---
 ras-record.h            | 2 ++
 4 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/ras-diskerror-handler.c b/ras-diskerror-handler.c
index 271dfac..68c0c77 100644
--- a/ras-diskerror-handler.c
+++ b/ras-diskerror-handler.c
@@ -50,8 +50,6 @@ static const struct {
 	{ -EIO,       "I/O error" },
 };
 
-#define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
-
 static const char *get_blk_error(int err)
 {
 	int i;
diff --git a/ras-mce-handler.h b/ras-mce-handler.h
index 94395eb..4d615b4 100644
--- a/ras-mce-handler.h
+++ b/ras-mce-handler.h
@@ -24,9 +24,6 @@
 #include "ras-events.h"
 #include "libtrace/event-parse.h"
 
-
-#define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
-
 enum cputype {
 	CPU_GENERIC,
 	CPU_P6OLD,
diff --git a/ras-record.c b/ras-record.c
index ae5d359..8f1c550 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -35,9 +35,6 @@
 
 #define SQLITE_RAS_DB RASSTATEDIR "/" RAS_DB_FNAME
 
-
-#define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
-
 /*
  * Table and functions to handle ras:mc_event
  */
diff --git a/ras-record.h b/ras-record.h
index 5311c67..c9af5ae 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -23,6 +23,8 @@
 #include <stdint.h>
 #include "config.h"
 
+#define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
+
 extern long user_hz;
 
 struct ras_events *ras;
-- 
2.1.4



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 6/7] rasdaemon: add closure and cleanups for the database
  2019-10-16 16:33 ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
                     ` (4 preceding siblings ...)
  2019-10-16 16:33   ` [PATCH 5/7] rasdaemon: delete multiple definitions of ARRAY_SIZE Shiju Jose
@ 2019-10-16 16:34   ` Shiju Jose
  2019-10-16 16:34   ` [PATCH 7/7] rasdaemon: add signal handling for the cleanup Shiju Jose
  2019-11-13 16:38   ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
  7 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-10-16 16:34 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: linuxarm, Shiju Jose

This patch adds closure and cleanups for the sqlite3 database.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 ras-events.c               |  14 +++++-
 ras-non-standard-handler.c |  16 ++++++
 ras-non-standard-handler.h |   6 ++-
 ras-record.c               | 120 +++++++++++++++++++++++++++++++++++++++++++++
 ras-record.h               |   3 ++
 5 files changed, 157 insertions(+), 2 deletions(-)

diff --git a/ras-events.c b/ras-events.c
index f912dae..d155caa 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -418,7 +418,7 @@ static int read_ras_event_all_cpus(struct pthread_data *pdata,
 			size = read(fds[i].fd, page, pdata[i].ras->page_size);
 			if (size < 0) {
 				log(TERM, LOG_WARNING, "read\n");
-				goto error;
+				goto cleanup;
 			} else if (size > 0) {
 				kbuffer_load_subbuffer(kbuf, page);
 
@@ -452,6 +452,13 @@ static int read_ras_event_all_cpus(struct pthread_data *pdata,
 	/* poll() is not supported. We need to fallback to the old way */
 	log(TERM, LOG_INFO,
 	    "Old kernel detected. Stop listening and fall back to pthread way.\n");
+
+cleanup:
+	if (pdata[0].ras->record_events) {
+		unregister_ns_dec_tab();
+		ras_mc_event_closedb(pdata[0].cpu, pdata[0].ras);
+	}
+
 error:
 	kbuffer_free(kbuf);
 	free(page);
@@ -540,6 +547,11 @@ static void *handle_ras_events_cpu(void *priv)
 
 	read_ras_event(fd, pdata, kbuf, page);
 
+	if (pdata->ras->record_events) {
+		unregister_ns_dec_tab();
+		ras_mc_event_closedb(pdata->cpu, pdata->ras);
+	}
+
 	close(fd);
 	kbuffer_free(kbuf);
 	free(page);
diff --git a/ras-non-standard-handler.c b/ras-non-standard-handler.c
index 1b5d67a..d92fd42 100644
--- a/ras-non-standard-handler.c
+++ b/ras-non-standard-handler.c
@@ -41,6 +41,22 @@ int register_ns_dec_tab(const p_ns_dec_tab tab)
 void unregister_ns_dec_tab(void)
 {
 	if (ns_dec_tab) {
+#ifdef HAVE_SQLITE3
+		p_ns_dec_tab dec_tab;
+		int i, count;
+
+		for (count = 0; count < dec_tab_count; count++) {
+			dec_tab = ns_dec_tab[count];
+			for (i = 0; dec_tab[i].decode; i++) {
+				if (dec_tab[i].stmt_dec_record) {
+					ras_mc_finalize_vendor_table(
+						dec_tab[i].stmt_dec_record);
+					dec_tab[i].stmt_dec_record = NULL;
+				}
+			}
+		}
+#endif
+
 		free(ns_dec_tab);
 		ns_dec_tab = NULL;
 		dec_tab_count = 0;
diff --git a/ras-non-standard-handler.h b/ras-non-standard-handler.h
index fd9dd92..2b9bf40 100644
--- a/ras-non-standard-handler.h
+++ b/ras-non-standard-handler.h
@@ -36,8 +36,12 @@ int ras_non_standard_event_handler(struct trace_seq *s,
 
 void print_le_hex(struct trace_seq *s, const uint8_t *buf, int index);
 
+#ifdef HAVE_NON_STANDARD
 int register_ns_dec_tab(const p_ns_dec_tab tab);
-
 void unregister_ns_dec_tab(void);
+#else
+static inline int register_ns_dec_tab(const p_ns_dec_tab tab) { return 0; };
+static inline void unregister_ns_dec_tab(void) { return; };
+#endif
 
 #endif
diff --git a/ras-record.c b/ras-record.c
index 8f1c550..c6cf61a 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -595,6 +595,18 @@ int ras_mc_add_vendor_table(struct ras_events *ras,
 	return rc;
 }
 
+int ras_mc_finalize_vendor_table(sqlite3_stmt *stmt)
+{
+	int rc;
+
+	rc = sqlite3_finalize(stmt);
+	if (rc != SQLITE_OK)
+		log(TERM, LOG_ERR,
+		    "Failed to finalize sqlite: error = %d\n", rc);
+
+	return rc;
+}
+
 int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
 {
 	int rc;
@@ -692,3 +704,111 @@ int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
 		ras->db_priv = priv;
 	return 0;
 }
+
+int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras)
+{
+	int rc;
+	sqlite3 *db;
+	struct sqlite3_priv *priv = ras->db_priv;
+
+	printf("Calling %s()\n", __func__);
+
+	if (!priv)
+		return -1;
+
+	db = priv->db;
+	if (!db)
+		return -1;
+
+	if (priv->stmt_mc_event) {
+		rc = sqlite3_finalize(priv->stmt_mc_event);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize mc_event sqlite: error = %d\n",
+			    cpu, rc);
+	}
+
+#ifdef HAVE_AER
+	if (priv->stmt_aer_event) {
+		rc = sqlite3_finalize(priv->stmt_aer_event);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize aer_event sqlite: error = %d\n",
+			    cpu, rc);
+	}
+#endif
+
+#ifdef HAVE_EXTLOG
+	if (priv->stmt_extlog_record) {
+		rc = sqlite3_finalize(priv->stmt_extlog_record);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize extlog_record sqlite: error = %d\n",
+			    cpu, rc);
+	}
+#endif
+
+
+#ifdef HAVE_MCE
+	if (priv->stmt_mce_record) {
+		rc = sqlite3_finalize(priv->stmt_mce_record);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize mce_record sqlite: error = %d\n",
+			    cpu, rc);
+	}
+#endif
+
+#ifdef HAVE_NON_STANDARD
+	if (priv->stmt_non_standard_record) {
+		rc = sqlite3_finalize(priv->stmt_non_standard_record);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize non_standard_record sqlite: error = %d\n",
+			    cpu, rc);
+	}
+#endif
+
+#ifdef HAVE_ARM
+	if (priv->stmt_arm_record) {
+		rc = sqlite3_finalize(priv->stmt_arm_record);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize arm_record sqlite: error = %d\n",
+			    cpu, rc);
+	}
+#endif
+
+#ifdef HAVE_DEVLINK
+	if (priv->stmt_devlink_event) {
+		rc = sqlite3_finalize(priv->stmt_devlink_event);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize devlink_event sqlite: error = %d\n",
+			    cpu, rc);
+	}
+#endif
+
+#ifdef HAVE_DISKERROR
+	if (priv->stmt_diskerror_event) {
+		rc = sqlite3_finalize(priv->stmt_diskerror_event);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize diskerror_event sqlite: error = %d\n",
+			    cpu, rc);
+	}
+#endif
+
+	rc = sqlite3_close_v2(db);
+	if (rc != SQLITE_OK)
+		log(TERM, LOG_ERR,
+		    "cpu %u: Failed to close sqlite: error = %d\n", cpu, rc);
+
+	rc = sqlite3_shutdown();
+	if (rc != SQLITE_OK)
+		log(TERM, LOG_ERR,
+		    "cpu %u: Failed to shutdown sqlite: error = %d\n", cpu, rc);
+	free(priv);
+
+	return 0;
+}
diff --git a/ras-record.h b/ras-record.h
index c9af5ae..a67b193 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -147,8 +147,10 @@ struct db_table_descriptor {
 };
 
 int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras);
+int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras);
 int ras_mc_add_vendor_table(struct ras_events *ras, sqlite3_stmt **stmt,
 			    const struct db_table_descriptor *db_tab);
+int ras_mc_finalize_vendor_table(sqlite3_stmt *stmt);
 int ras_store_mc_event(struct ras_events *ras, struct ras_mc_event *ev);
 int ras_store_aer_event(struct ras_events *ras, struct ras_aer_event *ev);
 int ras_store_mce_record(struct ras_events *ras, struct mce_event *ev);
@@ -160,6 +162,7 @@ int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev
 
 #else
 static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; };
+static inline int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras) { return 0; };
 static inline int ras_store_mc_event(struct ras_events *ras, struct ras_mc_event *ev) { return 0; };
 static inline int ras_store_aer_event(struct ras_events *ras, struct ras_aer_event *ev) { return 0; };
 static inline int ras_store_mce_record(struct ras_events *ras, struct mce_event *ev) { return 0; };
-- 
2.1.4



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 7/7] rasdaemon: add signal handling for the cleanup
  2019-10-16 16:33 ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
                     ` (5 preceding siblings ...)
  2019-10-16 16:34   ` [PATCH 6/7] rasdaemon: add closure and cleanups for the database Shiju Jose
@ 2019-10-16 16:34   ` Shiju Jose
  2019-11-13 16:38   ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
  7 siblings, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-10-16 16:34 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: linuxarm, Shiju Jose

Presently rasdaemon would not free allocated memory and
would not do other cleanup when the rasdaemon closed
with ctrl+c or kill etc.
This patch adds handling of the signals SIGINT, SIGTERM, SIGHUP
and SIGQUIT and do necessary clean ups when receive the
specified signals.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 ras-events.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/ras-events.c b/ras-events.c
index d155caa..511c93d 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -25,6 +25,8 @@
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <sys/poll.h>
+#include <signal.h>
+#include <sys/signalfd.h>
 #include "libtrace/kbuffer.h"
 #include "libtrace/event-parse.h"
 #include "ras-mc-handler.h"
@@ -350,7 +352,9 @@ static int read_ras_event_all_cpus(struct pthread_data *pdata,
 	int ready, i, count_nready;
 	struct kbuffer *kbuf;
 	void *page;
-	struct pollfd fds[n_cpus];
+	struct pollfd fds[n_cpus + 1];
+	struct signalfd_siginfo fdsiginfo;
+	sigset_t mask;
 	int warnonce[n_cpus];
 	char pipe_raw[PATH_MAX];
 	int legacy_kernel = 0;
@@ -373,7 +377,7 @@ static int read_ras_event_all_cpus(struct pthread_data *pdata,
 		return -ENOMEM;
 	}
 
-	for (i = 0; i < n_cpus; i++)
+	for (i = 0; i < (n_cpus + 1); i++)
 		fds[i].fd = -1;
 
 	for (i = 0; i < n_cpus; i++) {
@@ -390,15 +394,51 @@ static int read_ras_event_all_cpus(struct pthread_data *pdata,
 		}
 	}
 
+	sigemptyset(&mask);
+	sigaddset(&mask, SIGINT);
+	sigaddset(&mask, SIGTERM);
+	sigaddset(&mask, SIGHUP);
+	sigaddset(&mask, SIGQUIT);
+	if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
+		log(TERM, LOG_WARNING, "sigprocmask\n");
+	fds[n_cpus].events = POLLIN;
+	fds[n_cpus].fd = signalfd(-1, &mask, 0);
+	if (fds[n_cpus].fd < 0) {
+		log(TERM, LOG_WARNING, "signalfd\n");
+		goto error;
+	}
+
 	log(TERM, LOG_INFO, "Listening to events for cpus 0 to %d\n", n_cpus - 1);
 	if (pdata[0].ras->record_events)
 		ras_mc_event_opendb(pdata[0].cpu, pdata[0].ras);
 
 	do {
-		ready = poll(fds, n_cpus, -1);
+		ready = poll(fds, (n_cpus + 1), -1);
 		if (ready < 0) {
 			log(TERM, LOG_WARNING, "poll\n");
 		}
+
+		/* check for the signal */
+		if (fds[n_cpus].revents & POLLIN) {
+			size = read(fds[n_cpus].fd, &fdsiginfo,
+				    sizeof(struct signalfd_siginfo));
+			if (size != sizeof(struct signalfd_siginfo))
+				log(TERM, LOG_WARNING, "signalfd read\n");
+
+			if (fdsiginfo.ssi_signo == SIGINT ||
+			    fdsiginfo.ssi_signo == SIGTERM ||
+			    fdsiginfo.ssi_signo == SIGHUP ||
+			    fdsiginfo.ssi_signo == SIGQUIT) {
+				log(TERM, LOG_INFO, "Recevied signal=%d\n",
+				    fdsiginfo.ssi_signo);
+				goto  cleanup;
+			} else {
+				log(TERM, LOG_INFO,
+				    "Received unexpected signal=%d\n",
+				    fdsiginfo.ssi_signo);
+			}
+		}
+
 		count_nready = 0;
 		for (i = 0; i < n_cpus; i++) {
 			if (fds[i].revents & POLLERR) {
@@ -462,7 +502,9 @@ cleanup:
 error:
 	kbuffer_free(kbuf);
 	free(page);
-	for (i = 0; i < n_cpus; i++) {
+	sigprocmask(SIG_UNBLOCK, &mask, NULL);
+
+	for (i = 0; i < (n_cpus + 1); i++) {
 		if (fds[i].fd > 0)
 			close(fds[i].fd);
 	}
-- 
2.1.4



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rasdaemon 0/2] rasdaemon: add fix for the sql table
       [not found] <Shiju Jose>
                   ` (2 preceding siblings ...)
  2019-10-16 16:33 ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
@ 2019-11-13 16:31 ` Shiju Jose
  2019-11-13 16:31   ` [PATCH rasdaemon 1/2] rasdaemon: fix for the ras-record.c:ras_mc_prepare_stmt() failure when new fields added to " Shiju Jose
  2019-11-13 16:31   ` [PATCH rasdaemon 2/2] rasdaemon: store PCIe dev name and TLP header for the aer event Shiju Jose
  3 siblings, 2 replies; 33+ messages in thread
From: Shiju Jose @ 2019-11-13 16:31 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: linuxarm, Shiju Jose

rasdaemon fail to prepare the sql table when new fields are added on top
of the existing sql table present in the system.

This patch set add solution for this issue and a patch for adding missing
information to the aer_event table.

Shiju Jose (2):
  rasdaemon: fix for the ras-record.c:ras_mc_prepare_stmt() failure when
    new fields added to the sql table
  rasdaemon: store PCIe dev name and TLP header for the aer event

 ras-aer-handler.c | 21 +++++++++++-
 ras-record.c      | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 ras-record.h      |  2 ++
 3 files changed, 115 insertions(+), 7 deletions(-)

-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rasdaemon 1/2] rasdaemon: fix for the ras-record.c:ras_mc_prepare_stmt() failure when new fields added to the sql table
  2019-11-13 16:31 ` [PATCH rasdaemon 0/2] rasdaemon: add fix for the sql table Shiju Jose
@ 2019-11-13 16:31   ` " Shiju Jose
  2019-11-13 16:31   ` [PATCH rasdaemon 2/2] rasdaemon: store PCIe dev name and TLP header for the aer event Shiju Jose
  1 sibling, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-11-13 16:31 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: linuxarm, Shiju Jose

rasdaemon fails in the ras_mc_prepare_stmt() function when new fields are
added to the table's db_fields on top of the existing sql table in the
system.

This patch adds solution for this issue.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 ras-record.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 89 insertions(+), 4 deletions(-)

diff --git a/ras-record.c b/ras-record.c
index c6cf61a..ca58b22 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -499,10 +499,9 @@ int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev
 /*
  * Generic code
  */
-
-static int ras_mc_prepare_stmt(struct sqlite3_priv *priv,
-			       sqlite3_stmt **stmt,
-			       const struct db_table_descriptor *db_tab)
+static int __ras_mc_prepare_stmt(struct sqlite3_priv *priv,
+				 sqlite3_stmt **stmt,
+				 const struct db_table_descriptor *db_tab)
 
 {
 	int i, rc;
@@ -578,6 +577,92 @@ static int ras_mc_create_table(struct sqlite3_priv *priv,
 	return rc;
 }
 
+static int ras_mc_alter_table(struct sqlite3_priv *priv,
+			      sqlite3_stmt **stmt,
+			      const struct db_table_descriptor *db_tab)
+{
+	char sql[1024], *p = sql, *end = sql + sizeof(sql);
+	const struct db_fields *field;
+	int col_count;
+	int i, j, rc, found;
+
+	snprintf(p, end - p, "SELECT * FROM %s", db_tab->name);
+	rc = sqlite3_prepare_v2(priv->db, sql, -1, stmt, NULL);
+	if (rc != SQLITE_OK) {
+		log(TERM, LOG_ERR,
+		    "Failed to query fields from the table %s on %s: error = %d\n",
+		    db_tab->name, SQLITE_RAS_DB, rc);
+		return rc;
+	}
+
+	col_count = sqlite3_column_count(*stmt);
+	for (i = 0; i < db_tab->num_fields; i++) {
+		field = &db_tab->fields[i];
+		found = 0;
+		for (j = 0; j < col_count; j++) {
+			if (!strcmp(field->name,
+			    sqlite3_column_name(*stmt, j))) {
+				found = 1;
+				break;
+			}
+		}
+
+		if (!found) {
+			/* add new field */
+			p += snprintf(p, end - p, "ALTER TABLE %s ADD ",
+				      db_tab->name);
+			p += snprintf(p, end - p,
+				      "%s %s", field->name, field->type);
+#ifdef DEBUG_SQL
+			log(TERM, LOG_INFO, "SQL: %s\n", sql);
+#endif
+			rc = sqlite3_exec(priv->db, sql, NULL, NULL, NULL);
+			if (rc != SQLITE_OK) {
+				log(TERM, LOG_ERR,
+				    "Failed to add new field %s to the table %s on %s: error = %d\n",
+				    field->name, db_tab->name,
+				    SQLITE_RAS_DB, rc);
+				return rc;
+			}
+			p = sql;
+			memset(sql, 0, sizeof(sql));
+		}
+	}
+
+	return rc;
+}
+
+static int ras_mc_prepare_stmt(struct sqlite3_priv *priv,
+			       sqlite3_stmt **stmt,
+			       const struct db_table_descriptor *db_tab)
+{
+	int rc;
+
+	rc = __ras_mc_prepare_stmt(priv, stmt, db_tab);
+	if (rc != SQLITE_OK) {
+		log(TERM, LOG_ERR,
+		    "Failed to prepare insert db at table %s (db %s): error = %s\n",
+		    db_tab->name, SQLITE_RAS_DB, sqlite3_errmsg(priv->db));
+
+		log(TERM, LOG_INFO, "Trying to alter db at table %s (db %s)\n",
+		    db_tab->name, SQLITE_RAS_DB);
+
+		rc = ras_mc_alter_table(priv, stmt, db_tab);
+		if (rc != SQLITE_OK && rc != SQLITE_DONE) {
+			log(TERM, LOG_ERR,
+			    "Failed to alter db at table %s (db %s): error = %s\n",
+			    db_tab->name, SQLITE_RAS_DB,
+			    sqlite3_errmsg(priv->db));
+			stmt = NULL;
+			return rc;
+		}
+
+		rc = __ras_mc_prepare_stmt(priv, stmt, db_tab);
+	}
+
+	return rc;
+}
+
 int ras_mc_add_vendor_table(struct ras_events *ras,
 			    sqlite3_stmt **stmt,
 			    const struct db_table_descriptor *db_tab)
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH rasdaemon 2/2] rasdaemon: store PCIe dev name and TLP header for the aer event
  2019-11-13 16:31 ` [PATCH rasdaemon 0/2] rasdaemon: add fix for the sql table Shiju Jose
  2019-11-13 16:31   ` [PATCH rasdaemon 1/2] rasdaemon: fix for the ras-record.c:ras_mc_prepare_stmt() failure when new fields added to " Shiju Jose
@ 2019-11-13 16:31   ` Shiju Jose
  1 sibling, 0 replies; 33+ messages in thread
From: Shiju Jose @ 2019-11-13 16:31 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: linuxarm, Shiju Jose

This patch adds logging and recording of the PCIe dev name and the
TLP header for the aer event.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 ras-aer-handler.c | 21 ++++++++++++++++++++-
 ras-record.c      |  6 ++++--
 ras-record.h      |  2 ++
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/ras-aer-handler.c b/ras-aer-handler.c
index 664f7b4..8ddd439 100644
--- a/ras-aer-handler.c
+++ b/ras-aer-handler.c
@@ -52,6 +52,8 @@ static const char *aer_uncor_errors[32] = {
 	[20] = "Unsupported Request",
 };
 
+#define BUF_LEN	1024
+
 int ras_aer_event_handler(struct trace_seq *s,
 			 struct pevent_record *record,
 			 struct event_format *event, void *context)
@@ -59,11 +61,12 @@ int ras_aer_event_handler(struct trace_seq *s,
 	int len;
 	unsigned long long severity_val;
 	unsigned long long status_val;
+	unsigned long long val;
 	struct ras_events *ras = context;
 	time_t now;
 	struct tm *tm;
 	struct ras_aer_event ev;
-	char buf[1024];
+	char buf[BUF_LEN];
 
 	/*
 	 * Newer kernels (3.10-rc1 or upper) provide an uptime clock.
@@ -89,6 +92,7 @@ int ras_aer_event_handler(struct trace_seq *s,
 					   record, &len, 1);
 	if (!ev.dev_name)
 		return -1;
+	trace_seq_printf(s, "%s ", ev.dev_name);
 
 	if (pevent_get_field_val(s,  event, "status", record, &status_val, 1) < 0)
 		return -1;
@@ -104,6 +108,21 @@ int ras_aer_event_handler(struct trace_seq *s,
 	else
 		bitfield_msg(buf, sizeof(buf), aer_uncor_errors, 32, 0, 0, status_val);
 	ev.msg = buf;
+
+	if (pevent_get_field_val(s, event, "tlp_header_valid",
+				record, &val, 1) < 0)
+		return -1;
+
+	ev.tlp_header_valid = val;
+	if (ev.tlp_header_valid) {
+		ev.tlp_header = pevent_get_field_raw(s, event, "tlp_header",
+						     record, &len, 1);
+		snprintf((buf + strlen(ev.msg)), BUF_LEN - strlen(ev.msg),
+			 " TLP Header: %08x %08x %08x %08x",
+			 ev.tlp_header[0], ev.tlp_header[1],
+			 ev.tlp_header[2], ev.tlp_header[3]);
+	}
+
 	trace_seq_printf(s, "%s ", ev.msg);
 
 	/* Use hw_event_aer_err_type switch between different severity_val */
diff --git a/ras-record.c b/ras-record.c
index ca58b22..318bace 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -106,6 +106,7 @@ int ras_store_mc_event(struct ras_events *ras, struct ras_mc_event *ev)
 static const struct db_fields aer_event_fields[] = {
 		{ .name="id",			.type="INTEGER PRIMARY KEY" },
 		{ .name="timestamp",		.type="TEXT" },
+		{ .name="dev_name",		.type="TEXT" },
 		{ .name="err_type",		.type="TEXT" },
 		{ .name="err_msg",		.type="TEXT" },
 };
@@ -126,8 +127,9 @@ int ras_store_aer_event(struct ras_events *ras, struct ras_aer_event *ev)
 	log(TERM, LOG_INFO, "aer_event store: %p\n", priv->stmt_aer_event);
 
 	sqlite3_bind_text(priv->stmt_aer_event,  1, ev->timestamp, -1, NULL);
-	sqlite3_bind_text(priv->stmt_aer_event,  2, ev->error_type, -1, NULL);
-	sqlite3_bind_text(priv->stmt_aer_event,  3, ev->msg, -1, NULL);
+	sqlite3_bind_text(priv->stmt_aer_event,  2, ev->dev_name, -1, NULL);
+	sqlite3_bind_text(priv->stmt_aer_event,  3, ev->error_type, -1, NULL);
+	sqlite3_bind_text(priv->stmt_aer_event,  4, ev->msg, -1, NULL);
 
 	rc = sqlite3_step(priv->stmt_aer_event);
 	if (rc != SQLITE_OK && rc != SQLITE_DONE)
diff --git a/ras-record.h b/ras-record.h
index 440669d..cc217a9 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -43,6 +43,8 @@ struct ras_aer_event {
 	char timestamp[64];
 	const char *error_type;
 	const char *dev_name;
+	uint8_t tlp_header_valid;
+	uint32_t *tlp_header;
 	const char *msg;
 };
 
-- 
1.9.1



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling
  2019-10-16 16:33 ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
                     ` (6 preceding siblings ...)
  2019-10-16 16:34   ` [PATCH 7/7] rasdaemon: add signal handling for the cleanup Shiju Jose
@ 2019-11-13 16:38   ` Shiju Jose
  2019-11-20  4:37     ` Mauro Carvalho Chehab
  7 siblings, 1 reply; 33+ messages in thread
From: Shiju Jose @ 2019-11-13 16:38 UTC (permalink / raw)
  To: mchehab, linux-edac; +Cc: Linuxarm

Hi Mauro,

Can you please review this patch set?

Thanks,
Shiju

>-----Original Message-----
>From: Shiju Jose
>Sent: 16 October 2019 17:34
>To: mchehab@kernel.org; linux-edac@vger.kernel.org
>Cc: Linuxarm <linuxarm@huawei.com>; Shiju Jose <shiju.jose@huawei.com>
>Subject: [PATCH 0/7] rasdaemon: add fixes, database closure and signal
>handling
>
>This patch set add
>1. fixes for some memory leaks and file closure.
>2. closure for the sqlite3 database.
>3. signal handling for the cleanup.
>
>Shiju Jose (7):
>  rasdaemon: fix cleanup issues in
>    ras-events.c:read_ras_event_all_cpus()
>  rasdaemon: fix memory leak in ras-events.c:handle_ras_events()
>  rasdaemon: fix missing fclose in
>    ras-events.c:select_tracing_timestamp()
>  rasdaemon: fix memory leak in ras-events.c:add_event_handler()
>  rasdaemon: delete multiple definitions of ARRAY_SIZE
>  rasdaemon: add closure and cleanups for the database
>  rasdaemon: add signal handling for the cleanup
>
> ras-diskerror-handler.c    |   2 -
> ras-events.c               |  88 +++++++++++++++++++++++++++-----
> ras-mce-handler.h          |   3 --
> ras-non-standard-handler.c |  16 ++++++
> ras-non-standard-handler.h |   6 ++-
> ras-record.c               | 123
>+++++++++++++++++++++++++++++++++++++++++++--
> ras-record.h               |   5 ++
> 7 files changed, 222 insertions(+), 21 deletions(-)
>
>--
>2.1.4
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling
  2019-11-13 16:38   ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
@ 2019-11-20  4:37     ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 33+ messages in thread
From: Mauro Carvalho Chehab @ 2019-11-20  4:37 UTC (permalink / raw)
  To: Shiju Jose; +Cc: mchehab, linux-edac, Linuxarm

Em Wed, 13 Nov 2019 16:38:45 +0000
Shiju Jose <shiju.jose@huawei.com> escreveu:

> Hi Mauro,
> 
> Can you please review this patch set?

Reviewed both rasdaemon patchsets and applied.

Thanks!
Mauro

> 
> Thanks,
> Shiju
> 
> >-----Original Message-----
> >From: Shiju Jose
> >Sent: 16 October 2019 17:34
> >To: mchehab@kernel.org; linux-edac@vger.kernel.org
> >Cc: Linuxarm <linuxarm@huawei.com>; Shiju Jose <shiju.jose@huawei.com>
> >Subject: [PATCH 0/7] rasdaemon: add fixes, database closure and signal
> >handling
> >
> >This patch set add
> >1. fixes for some memory leaks and file closure.
> >2. closure for the sqlite3 database.
> >3. signal handling for the cleanup.
> >
> >Shiju Jose (7):
> >  rasdaemon: fix cleanup issues in
> >    ras-events.c:read_ras_event_all_cpus()
> >  rasdaemon: fix memory leak in ras-events.c:handle_ras_events()
> >  rasdaemon: fix missing fclose in
> >    ras-events.c:select_tracing_timestamp()
> >  rasdaemon: fix memory leak in ras-events.c:add_event_handler()
> >  rasdaemon: delete multiple definitions of ARRAY_SIZE
> >  rasdaemon: add closure and cleanups for the database
> >  rasdaemon: add signal handling for the cleanup
> >
> > ras-diskerror-handler.c    |   2 -
> > ras-events.c               |  88 +++++++++++++++++++++++++++-----
> > ras-mce-handler.h          |   3 --
> > ras-non-standard-handler.c |  16 ++++++
> > ras-non-standard-handler.h |   6 ++-
> > ras-record.c               | 123
> >+++++++++++++++++++++++++++++++++++++++++++--
> > ras-record.h               |   5 ++
> > 7 files changed, 222 insertions(+), 21 deletions(-)
> >
> >--
> >2.1.4
> >
> 




Cheers,
Mauro

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, back to index

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Shiju Jose>
2019-06-17 14:28 ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Shiju Jose
2019-06-17 14:28   ` [PATCH 1/6] rasdaemon:print non-standard error data if not decoded Shiju Jose
2019-06-17 14:28   ` [PATCH 2/6] rasdaemon: rearrange HiSilicon HIP07 decoding function table Shiju Jose
2019-06-17 14:28   ` [PATCH 3/6] rasdaemon: update iteration logic for the non-standard error decoding functions Shiju Jose
2019-06-17 14:28   ` [PATCH 4/6] rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format1 Shiju Jose
2019-06-17 14:28   ` [PATCH 5/6] rasdaemon:add logging HiSilicon HIP08 H/W errors reported in the OEM format2 Shiju Jose
2019-06-17 14:28   ` [PATCH 6/6] rasdaemon:add logging HiSilicon HIP08 PCIe local errors Shiju Jose
2019-06-21 18:42   ` [PATCH 0/6] rasdaemon:add logging of HiSilicon HIP08 non-standard H/W errors and changes in the error decoding code Mauro Carvalho Chehab
2019-08-12 10:11 ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors Shiju Jose
2019-08-12 10:11   ` [PATCH RFC 1/4] " Shiju Jose
2019-08-21 17:23     ` James Morse
2019-08-22 16:57       ` Shiju Jose
2019-08-12 10:11   ` [PATCH RFC 2/4] ACPI: APEI: Add ghes_handle_memory_failure to the new notification method Shiju Jose
2019-08-21 17:22     ` James Morse
2019-08-22 16:57       ` Shiju Jose
2019-08-12 10:11   ` [PATCH RFC 3/4] ACPI: APEI: Add ghes_handle_aer " Shiju Jose
2019-08-12 10:11   ` [PATCH RFC 4/4] ACPI: APEI: Add log_arm_hw_error " Shiju Jose
2019-08-21 17:22   ` [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors James Morse
2019-08-22 16:56     ` Shiju Jose
2019-10-03 17:21       ` James Morse
2019-10-16 16:33 ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
2019-10-16 16:33   ` [PATCH 1/7] rasdaemon: fix cleanup issues in ras-events.c:read_ras_event_all_cpus() Shiju Jose
2019-10-16 16:33   ` [PATCH 2/7] rasdaemon: fix memory leak in ras-events.c:handle_ras_events() Shiju Jose
2019-10-16 16:33   ` [PATCH 3/7] rasdaemon: fix missing fclose in ras-events.c:select_tracing_timestamp() Shiju Jose
2019-10-16 16:33   ` [PATCH 4/7] rasdaemon: fix memory leak in ras-events.c:add_event_handler() Shiju Jose
2019-10-16 16:33   ` [PATCH 5/7] rasdaemon: delete multiple definitions of ARRAY_SIZE Shiju Jose
2019-10-16 16:34   ` [PATCH 6/7] rasdaemon: add closure and cleanups for the database Shiju Jose
2019-10-16 16:34   ` [PATCH 7/7] rasdaemon: add signal handling for the cleanup Shiju Jose
2019-11-13 16:38   ` [PATCH 0/7] rasdaemon: add fixes, database closure and signal handling Shiju Jose
2019-11-20  4:37     ` Mauro Carvalho Chehab
2019-11-13 16:31 ` [PATCH rasdaemon 0/2] rasdaemon: add fix for the sql table Shiju Jose
2019-11-13 16:31   ` [PATCH rasdaemon 1/2] rasdaemon: fix for the ras-record.c:ras_mc_prepare_stmt() failure when new fields added to " Shiju Jose
2019-11-13 16:31   ` [PATCH rasdaemon 2/2] rasdaemon: store PCIe dev name and TLP header for the aer event Shiju Jose

Linux-EDAC Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-edac/0 linux-edac/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-edac linux-edac/ https://lore.kernel.org/linux-edac \
		linux-edac@vger.kernel.org
	public-inbox-index linux-edac

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-edac


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git