linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 0/4] rasdaemon: Add support for the CXL error events
@ 2023-01-24 16:57 shiju.jose
  2023-01-24 16:57 ` [PATCH V2 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file shiju.jose
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: shiju.jose @ 2023-01-24 16:57 UTC (permalink / raw)
  To: linux-edac, linux-cxl, mchehab; +Cc: jonathan.cameron, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Log and record the following CXL errors reported through the kernel
trace events. CXL poison errors, CXL AER uncorrectable errors and CXL AER
correctable errors.

Note: The default poll method in the rasdaemon to receive the trace events
didn't work in the QEMU. Thus instead used the pthread way for
testing the CXL error events.
To do so, please make following change in the ras-events.c
<change start ...>
/* rc = read_ras_event_all_cpus(data, cpus); */
rc = -255;
< ...change end >
/* Poll doesn't work on this kernel. Fallback to pthread way */
if (rc == -255) {
...

Shiju Jose (4):
  rasdaemon: Move definition for BIT and BIT_ULL to a common file
  rasdaemon: Add support for the CXL poison events
  rasdaemon: Add support for the CXL AER uncorrectable errors
  rasdaemon: Add support for the CXL AER correctable errors

Changes:
RFC V1 -> V2
1. Rename uuid to region_uuid in the log and SQLite DB.
2. Rebase to the latest rasdaemon code.
3. Modify to match the name changes of interface structures and
   functions in the latest libtraceevent-dev, use in the rasdaemon.   

 Makefile.am                |   7 +-
 configure.ac               |  11 ++
 ras-cxl-handler.c          | 351 +++++++++++++++++++++++++++++++++++++
 ras-cxl-handler.h          |  32 ++++
 ras-events.c               |  33 ++++
 ras-events.h               |   3 +
 ras-non-standard-handler.h |   3 -
 ras-record.c               | 203 +++++++++++++++++++++
 ras-record.h               |  49 ++++++
 ras-report.c               | 219 +++++++++++++++++++++++
 ras-report.h               |   6 +
 11 files changed, 913 insertions(+), 4 deletions(-)
 create mode 100644 ras-cxl-handler.c
 create mode 100644 ras-cxl-handler.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH V2 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file
  2023-01-24 16:57 [PATCH V2 0/4] rasdaemon: Add support for the CXL error events shiju.jose
@ 2023-01-24 16:57 ` shiju.jose
  2023-01-25 16:34   ` Dave Jiang
  2023-01-24 16:57 ` [PATCH V2 2/4] rasdaemon: Add support for the CXL poison events shiju.jose
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: shiju.jose @ 2023-01-24 16:57 UTC (permalink / raw)
  To: linux-edac, linux-cxl, mchehab; +Cc: jonathan.cameron, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Move definition for BIT() and BIT_ULL() to the
common file ras-record.h

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 ras-non-standard-handler.h | 3 ---
 ras-record.h               | 3 +++
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/ras-non-standard-handler.h b/ras-non-standard-handler.h
index 4d9f938..c360eaf 100644
--- a/ras-non-standard-handler.h
+++ b/ras-non-standard-handler.h
@@ -17,9 +17,6 @@
 #include "ras-events.h"
 #include <traceevent/event-parse.h>
 
-#define BIT(nr)                 (1UL << (nr))
-#define BIT_ULL(nr)             (1ULL << (nr))
-
 struct ras_ns_ev_decoder {
 	struct ras_ns_ev_decoder *next;
 	const char *sec_type;
diff --git a/ras-record.h b/ras-record.h
index d9f7733..219f10b 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -25,6 +25,9 @@
 
 #define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
 
+#define BIT(nr)                 (1UL << (nr))
+#define BIT_ULL(nr)             (1ULL << (nr))
+
 extern long user_hz;
 
 struct ras_events;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH V2 2/4] rasdaemon: Add support for the CXL poison events
  2023-01-24 16:57 [PATCH V2 0/4] rasdaemon: Add support for the CXL error events shiju.jose
  2023-01-24 16:57 ` [PATCH V2 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file shiju.jose
@ 2023-01-24 16:57 ` shiju.jose
  2023-01-25 22:34   ` Ira Weiny
  2023-01-24 16:57 ` [PATCH V2 3/4] rasdaemon: Add support for the CXL AER uncorrectable errors shiju.jose
  2023-01-24 16:57 ` [PATCH V2 4/4] rasdaemon: Add support for the CXL AER correctable errors shiju.jose
  3 siblings, 1 reply; 11+ messages in thread
From: shiju.jose @ 2023-01-24 16:57 UTC (permalink / raw)
  To: linux-edac, linux-cxl, mchehab; +Cc: jonathan.cameron, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add support to log and record the CXL poison events.

The corresponding Kernel patches here:
https://lore.kernel.org/linux-cxl/de11785ff05844299b40b100f8e0f56c7eef7f08.1674070170.git.alison.schofield@intel.com/

Presently RFC draft version for logging, could be extended for the policy
based recovery action for the frequent poison events depending on the above
kernel patches.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 Makefile.am       |   7 +-
 configure.ac      |  11 ++++
 ras-cxl-handler.c | 162 ++++++++++++++++++++++++++++++++++++++++++++++
 ras-cxl-handler.h |  24 +++++++
 ras-events.c      |  15 +++++
 ras-events.h      |   1 +
 ras-record.c      |  81 +++++++++++++++++++++++
 ras-record.h      |  20 ++++++
 ras-report.c      |  83 ++++++++++++++++++++++++
 ras-report.h      |   2 +
 10 files changed, 405 insertions(+), 1 deletion(-)
 create mode 100644 ras-cxl-handler.c
 create mode 100644 ras-cxl-handler.h

diff --git a/Makefile.am b/Makefile.am
index a9832cc..bd7b2ae 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -73,6 +73,11 @@ endif
 if WITH_CPU_FAULT_ISOLATION
    rasdaemon_SOURCES += ras-cpu-isolation.c queue.c
 endif
+
+if WITH_CXL
+   rasdaemon_SOURCES += ras-cxl-handler.c
+endif
+
 rasdaemon_LDADD = -lpthread $(SQLITE3_LIBS) $(LIBTRACEEVENT_LIBS)
 rasdaemon_CFLAGS = $(SQLITE3_CFLAGS) $(LIBTRACEEVENT_CFLAGS)
 
@@ -81,7 +86,7 @@ include_HEADERS = config.h  ras-events.h  ras-logger.h  ras-mc-handler.h \
 		  ras-extlog-handler.h ras-arm-handler.h ras-non-standard-handler.h \
 		  ras-devlink-handler.h ras-diskerror-handler.h rbtree.h ras-page-isolation.h \
 		  non-standard-hisilicon.h non-standard-ampere.h ras-memory-failure-handler.h \
-		  ras-cpu-isolation.h queue.h
+		  ras-cxl-handler.h ras-cpu-isolation.h queue.h
 
 # This rule can't be called with more than one Makefile job (like make -j8)
 # I can't figure out a way to fix that
diff --git a/configure.ac b/configure.ac
index c973aaf..028b9b3 100644
--- a/configure.ac
+++ b/configure.ac
@@ -127,6 +127,16 @@ AS_IF([test "x$enable_memory_failure" = "xyes" || test "x$enable_all" = "xyes"],
 AM_CONDITIONAL([WITH_MEMORY_FAILURE], [test x$enable_memory_failure = xyes || test x$enable_all = xyes])
 AM_COND_IF([WITH_MEMORY_FAILURE], [USE_MEMORY_FAILURE="yes"], [USE_MEMORY_FAILURE="no"])
 
+AC_ARG_ENABLE([cxl],
+    AS_HELP_STRING([--enable-cxl], [enable CXL events (currently experimental)]))
+
+AS_IF([test "x$enable_cxl" = "xyes" || test "x$enable_all" == "xyes"], [
+  AC_DEFINE(HAVE_CXL,1,"have CXL events collect")
+  AC_SUBST([WITH_CXL])
+])
+AM_CONDITIONAL([WITH_CXL], [test x$enable_cxl = xyes || test x$enable_all == xyes])
+AM_COND_IF([WITH_CXL], [USE_CXL="yes"], [USE_CXL="no"])
+
 AC_ARG_ENABLE([abrt_report],
     AS_HELP_STRING([--enable-abrt-report], [enable report event to ABRT (currently experimental)]))
 
@@ -215,6 +225,7 @@ compile time options summary
     DEVLINK             : $USE_DEVLINK
     Disk I/O errors     : $USE_DISKERROR
     Memory Failure      : $USE_MEMORY_FAILURE
+    CXL events          : $USE_CXL
     Memory CE PFA       : $USE_MEMORY_CE_PFA
     AMP RAS errors      : $USE_AMP_NS_DECODE
     CPU fault isolation : $USE_CPU_FAULT_ISOLATION
diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c
new file mode 100644
index 0000000..0b7cdca
--- /dev/null
+++ b/ras-cxl-handler.c
@@ -0,0 +1,162 @@
+/*
+ * Copyright (c) Huawei Technologies Co., Ltd. 2023. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <traceevent/kbuffer.h>
+#include "ras-cxl-handler.h"
+#include "ras-record.h"
+#include "ras-logger.h"
+#include "ras-report.h"
+
+/* Poison List: Payload out flags */
+#define CXL_POISON_FLAG_MORE            BIT(0)
+#define CXL_POISON_FLAG_OVERFLOW        BIT(1)
+#define CXL_POISON_FLAG_SCANNING        BIT(2)
+
+/* CXL poison - source types */
+enum cxl_poison_source {
+	CXL_POISON_SOURCE_UNKNOWN = 0,
+	CXL_POISON_SOURCE_EXTERNAL = 1,
+	CXL_POISON_SOURCE_INTERNAL = 2,
+	CXL_POISON_SOURCE_INJECTED = 3,
+	CXL_POISON_SOURCE_VENDOR = 7,
+};
+
+int ras_cxl_poison_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context)
+{
+	int len;
+	unsigned long long val;
+	struct ras_events *ras = context;
+	time_t now;
+	struct tm *tm;
+	struct ras_cxl_poison_event ev;
+
+	now = record->ts/user_hz + ras->uptime_diff;
+	tm = localtime(&now);
+	if (tm)
+		strftime(ev.timestamp, sizeof(ev.timestamp),
+			 "%Y-%m-%d %H:%M:%S %z", tm);
+	else
+		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000", sizeof(ev.timestamp));
+	trace_seq_printf(s, "%s ", ev.timestamp);
+
+	ev.memdev = tep_get_field_raw(s, event, "memdev",
+				      record, &len, 1);
+	if (!ev.memdev)
+		return -1;
+	trace_seq_printf(s, "memdev:%s ", ev.memdev);
+
+	ev.pcidev = tep_get_field_raw(s, event, "pcidev",
+				      record, &len, 1);
+	if (!ev.pcidev)
+		return -1;
+	trace_seq_printf(s, "pcidev:%s ", ev.pcidev);
+
+	ev.region = tep_get_field_raw(s, event, "region",
+				      record, &len, 1);
+	if (!ev.region)
+		return -1;
+	trace_seq_printf(s, "region:%s ", ev.region);
+
+	ev.uuid = tep_get_field_raw(s, event, "uuid",
+				    record, &len, 1);
+	if (!ev.uuid)
+		return -1;
+	trace_seq_printf(s, "region_uuid:%s ", ev.uuid);
+
+	if (tep_get_field_val(s, event, "hpa", record, &val, 1) < 0)
+		return -1;
+	ev.hpa = val;
+	trace_seq_printf(s, "poison list: hpa:0x%llx ", (unsigned long long)ev.hpa);
+
+	if (tep_get_field_val(s, event, "dpa", record, &val, 1) < 0)
+		return -1;
+	ev.dpa = val;
+	trace_seq_printf(s, "dpa:0x%llx ", (unsigned long long)ev.dpa);
+
+	if (tep_get_field_val(s, event, "length", record, &val, 1) < 0)
+		return -1;
+	ev.length = val;
+	trace_seq_printf(s, "length:%d ", ev.length);
+
+	if (tep_get_field_val(s,  event, "source", record, &val, 1) < 0)
+		return -1;
+
+	switch (val) {
+	case CXL_POISON_SOURCE_UNKNOWN:
+		ev.source = "Unknown";
+		break;
+	case CXL_POISON_SOURCE_EXTERNAL:
+		ev.source = "External";
+		break;
+	case CXL_POISON_SOURCE_INTERNAL:
+		ev.source = "Internal";
+		break;
+	case CXL_POISON_SOURCE_INJECTED:
+		ev.source = "Injected";
+		break;
+	case CXL_POISON_SOURCE_VENDOR:
+		ev.source = "Vendor";
+		break;
+	default:
+		ev.source = "Invalid";
+	}
+	trace_seq_printf(s, "source:%s ", ev.source);
+
+	if (tep_get_field_val(s,  event, "flags", record, &val, 1) < 0)
+		return -1;
+	ev.flags = val;
+	trace_seq_printf(s, "flags:%d ", ev.flags);
+
+	if (ev.flags & CXL_POISON_FLAG_OVERFLOW) {
+		if (tep_get_field_val(s,  event, "overflow_t", record, &val, 1) < 0)
+			return -1;
+		if (val) {
+			/* CXL Specification 3.0
+			 * Overflow timestamp - The number of unsigned nanoseconds
+			 * that have elapsed since midnight, 01-Jan-1970 UTC
+			 */
+			time_t ovf_ts_secs = val / 1000000000ULL;
+
+			tm = localtime(&ovf_ts_secs);
+			if (tm) {
+				strftime(ev.overflow_ts, sizeof(ev.overflow_ts),
+					 "%Y-%m-%d %H:%M:%S %z", tm);
+			}
+		}
+		if (!val || !tm)
+			strncpy(ev.overflow_ts, "1970-01-01 00:00:00 +0000",
+				sizeof(ev.overflow_ts));
+	} else
+		strncpy(ev.overflow_ts, "1970-01-01 00:00:00 +0000", sizeof(ev.overflow_ts));
+	trace_seq_printf(s, "overflow timestamp:%s ", ev.overflow_ts);
+	trace_seq_printf(s, "\n");
+
+	/* Insert data into the SGBD */
+#ifdef HAVE_SQLITE3
+	ras_store_cxl_poison_event(ras, &ev);
+#endif
+
+#ifdef HAVE_ABRT_REPORT
+	/* Report event to ABRT */
+	ras_report_cxl_poison_event(ras, &ev);
+#endif
+
+	return 0;
+}
diff --git a/ras-cxl-handler.h b/ras-cxl-handler.h
new file mode 100644
index 0000000..84d5cc6
--- /dev/null
+++ b/ras-cxl-handler.h
@@ -0,0 +1,24 @@
+/*
+ * Copyright (c) Huawei Technologies Co., Ltd. 2023. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __RAS_CXL_HANDLER_H
+#define __RAS_CXL_HANDLER_H
+
+#include "ras-events.h"
+#include <traceevent/event-parse.h>
+
+int ras_cxl_poison_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context);
+#endif
diff --git a/ras-events.c b/ras-events.c
index 39f9ce2..6555125 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -40,6 +40,7 @@
 #include "ras-devlink-handler.h"
 #include "ras-diskerror-handler.h"
 #include "ras-memory-failure-handler.h"
+#include "ras-cxl-handler.h"
 #include "ras-record.h"
 #include "ras-logger.h"
 #include "ras-page-isolation.h"
@@ -243,6 +244,10 @@ int toggle_ras_mc_event(int enable)
 	rc |= __toggle_ras_mc_event(ras, "ras", "memory_failure_event", enable);
 #endif
 
+#ifdef HAVE_CXL
+	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_poison", enable);
+#endif
+
 free_ras:
 	free(ras);
 	return rc;
@@ -951,6 +956,16 @@ int handle_ras_events(int record_events)
 		    "ras", "memory_failure_event");
 #endif
 
+#ifdef HAVE_CXL
+	rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_poison",
+			       ras_cxl_poison_event_handler, NULL, CXL_POISON_EVENT);
+	if (!rc)
+		num_events++;
+	else
+		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+		    "cxl", "cxl_poison");
+#endif
+
 	if (!num_events) {
 		log(ALL, LOG_INFO,
 		    "Failed to trace all supported RAS events. Aborting.\n");
diff --git a/ras-events.h b/ras-events.h
index 6c9f507..fc51070 100644
--- a/ras-events.h
+++ b/ras-events.h
@@ -39,6 +39,7 @@ enum {
 	DEVLINK_EVENT,
 	DISKERROR_EVENT,
 	MF_EVENT,
+	CXL_POISON_EVENT,
 	NR_EVENTS
 };
 
diff --git a/ras-record.c b/ras-record.c
index a367939..f54fb41 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -559,6 +559,67 @@ int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev)
 }
 #endif
 
+#ifdef HAVE_CXL
+/*
+ * Table and functions to handle cxl:cxl_poison
+ */
+static const struct db_fields cxl_poison_event_fields[] = {
+	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
+	{ .name = "timestamp",            .type = "TEXT" },
+	{ .name = "memdev",               .type = "TEXT" },
+	{ .name = "pcidev",               .type = "TEXT" },
+	{ .name = "region",               .type = "TEXT" },
+	{ .name = "region_uuid",          .type = "TEXT" },
+	{ .name = "hpa",                  .type = "INTEGER" },
+	{ .name = "dpa",                  .type = "INTEGER" },
+	{ .name = "length",               .type = "INTEGER" },
+	{ .name = "source",               .type = "TEXT" },
+	{ .name = "flags",                .type = "INTEGER" },
+	{ .name = "overflow_ts",          .type = "TEXT" },
+};
+
+static const struct db_table_descriptor cxl_poison_event_tab = {
+	.name = "cxl_poison_event",
+	.fields = cxl_poison_event_fields,
+	.num_fields = ARRAY_SIZE(cxl_poison_event_fields),
+};
+
+int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev)
+{
+	int rc;
+	struct sqlite3_priv *priv = ras->db_priv;
+
+	if (!priv || !priv->stmt_cxl_poison_event)
+		return 0;
+	log(TERM, LOG_INFO, "cxl_poison_event store: %p\n", priv->stmt_cxl_poison_event);
+
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 1, ev->timestamp, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 2, ev->memdev, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 3, ev->pcidev, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 4, ev->region, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 5, ev->uuid, -1, NULL);
+	sqlite3_bind_int64(priv->stmt_cxl_poison_event, 6, ev->hpa);
+	sqlite3_bind_int64(priv->stmt_cxl_poison_event, 7, ev->dpa);
+	sqlite3_bind_int(priv->stmt_cxl_poison_event, 8, ev->length);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 9, ev->source, -1, NULL);
+	sqlite3_bind_int(priv->stmt_cxl_poison_event, 10, ev->flags);
+	sqlite3_bind_text(priv->stmt_cxl_poison_event, 11, ev->overflow_ts, -1, NULL);
+
+	rc = sqlite3_step(priv->stmt_cxl_poison_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed to do cxl_poison_event step on sqlite: error = %d\n", rc);
+	rc = sqlite3_reset(priv->stmt_cxl_poison_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed reset cxl_poison_event on sqlite: error = %d\n",
+		    rc);
+	log(TERM, LOG_INFO, "register inserted at db\n");
+
+	return rc;
+}
+#endif
+
 /*
  * Generic code
  */
@@ -896,6 +957,16 @@ int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
 	}
 #endif
 
+#ifdef HAVE_CXL
+	rc = ras_mc_create_table(priv, &cxl_poison_event_tab);
+	if (rc == SQLITE_OK) {
+		rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_poison_event,
+					 &cxl_poison_event_tab);
+		if (rc != SQLITE_OK)
+			goto error;
+	}
+#endif
+
 	ras->db_priv = priv;
 	return 0;
 
@@ -1008,6 +1079,16 @@ int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras)
 	}
 #endif
 
+#ifdef HAVE_CXL
+	if (priv->stmt_cxl_poison_event) {
+		rc = sqlite3_finalize(priv->stmt_cxl_poison_event);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize cxl_poison_event sqlite: error = %d\n",
+			    cpu, rc);
+	}
+#endif
+
 	rc = sqlite3_close_v2(db);
 	if (rc != SQLITE_OK)
 		log(TERM, LOG_ERR,
diff --git a/ras-record.h b/ras-record.h
index 219f10b..e5bf483 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -114,6 +114,20 @@ struct ras_mf_event {
 	const char *action_result;
 };
 
+struct ras_cxl_poison_event {
+	char timestamp[64];
+	const char *memdev;
+	const char *pcidev;
+	const char *region;
+	const char *uuid;
+	uint64_t hpa;
+	uint64_t dpa;
+	uint32_t length;
+	const char *source;
+	uint8_t flags;
+	char overflow_ts[64];
+};
+
 struct ras_mc_event;
 struct ras_aer_event;
 struct ras_extlog_event;
@@ -123,6 +137,7 @@ struct mce_event;
 struct devlink_event;
 struct diskerror_event;
 struct ras_mf_event;
+struct ras_cxl_poison_event;
 
 #ifdef HAVE_SQLITE3
 
@@ -155,6 +170,9 @@ struct sqlite3_priv {
 #ifdef HAVE_MEMORY_FAILURE
 	sqlite3_stmt	*stmt_mf_event;
 #endif
+#ifdef HAVE_CXL
+	sqlite3_stmt	*stmt_cxl_poison_event;
+#endif
 };
 
 struct db_fields {
@@ -182,6 +200,7 @@ int ras_store_arm_record(struct ras_events *ras, struct ras_arm_event *ev);
 int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev);
 int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
 int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
+int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
 
 #else
 static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; };
@@ -195,6 +214,7 @@ static inline int ras_store_arm_record(struct ras_events *ras, struct ras_arm_ev
 static inline int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev) { return 0; };
 static inline int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
 static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
+static inline int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
 
 #endif
 
diff --git a/ras-report.c b/ras-report.c
index 62d5eb7..589e640 100644
--- a/ras-report.c
+++ b/ras-report.c
@@ -331,6 +331,42 @@ static int set_mf_event_backtrace(char *buf, struct ras_mf_event *ev)
 	return 0;
 }
 
+static int set_cxl_poison_event_backtrace(char *buf, struct ras_cxl_poison_event *ev)
+{
+	char bt_buf[MAX_BACKTRACE_SIZE];
+
+	if (!buf || !ev)
+		return -1;
+
+	sprintf(bt_buf, "BACKTRACE="	\
+						"timestamp=%s\n"	\
+						"memdev=%s\n"		\
+						"pcidev=%s\n"		\
+						"region=%s\n"		\
+						"uuid=%s\n"		\
+						"hpa=0x%lx\n"		\
+						"dpa=0x%lx\n"		\
+						"length=%d\n"		\
+						"source=%s\n"		\
+						"flags=%d\n"		\
+						"overflow_timestamp=%s\n" \
+						ev->timestamp,		\
+						ev->memdev,		\
+						ev->pcidev,		\
+						ev->region,		\
+						ev->uuid,		\
+						ev->hpa,		\
+						ev->dpa,		\
+						ev->length,		\
+						ev->source,		\
+						ev->flags,		\
+						ev->overflow_ts);
+
+	strcat(buf, bt_buf);
+
+	return 0;
+}
+
 static int commit_report_backtrace(int sockfd, int type, void *ev){
 	char buf[MAX_BACKTRACE_SIZE];
 	char *pbuf = buf;
@@ -368,6 +404,9 @@ static int commit_report_backtrace(int sockfd, int type, void *ev){
 	case MF_EVENT:
 		rc = set_mf_event_backtrace(buf, (struct ras_mf_event *)ev);
 		break;
+	case CXL_POISON_EVENT:
+		rc = set_cxl_poison_event_backtrace(buf, (struct ras_cxl_poison_event *)ev);
+		break;
 	default:
 		return -1;
 	}
@@ -776,3 +815,47 @@ mf_fail:
 	else
 		return -1;
 }
+
+int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev)
+{
+	char buf[MAX_MESSAGE_SIZE];
+	int sockfd = 0;
+	int done = 0;
+	int rc = -1;
+
+	memset(buf, 0, sizeof(buf));
+
+	sockfd = setup_report_socket();
+	if (sockfd < 0)
+		return -1;
+
+	rc = commit_report_basic(sockfd);
+	if (rc < 0)
+		goto cxl_poison_fail;
+
+	rc = commit_report_backtrace(sockfd, CXL_POISON_EVENT, ev);
+	if (rc < 0)
+		goto cxl_poison_fail;
+
+	sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-poison");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_poison_fail;
+
+	sprintf(buf, "REASON=%s", "CXL poison");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_poison_fail;
+
+	done = 1;
+
+cxl_poison_fail:
+
+	if (sockfd >= 0)
+		close(sockfd);
+
+	if (done)
+		return 0;
+	else
+		return -1;
+}
diff --git a/ras-report.h b/ras-report.h
index e605eb1..d1591ce 100644
--- a/ras-report.h
+++ b/ras-report.h
@@ -39,6 +39,7 @@ int ras_report_arm_event(struct ras_events *ras, struct ras_arm_event *ev);
 int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev);
 int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
 int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
+int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
 
 #else
 
@@ -50,6 +51,7 @@ static inline int ras_report_arm_event(struct ras_events *ras, struct ras_arm_ev
 static inline int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev) { return 0; };
 static inline int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
 static inline int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
+static inline int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
 
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH V2 3/4] rasdaemon: Add support for the CXL AER uncorrectable errors
  2023-01-24 16:57 [PATCH V2 0/4] rasdaemon: Add support for the CXL error events shiju.jose
  2023-01-24 16:57 ` [PATCH V2 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file shiju.jose
  2023-01-24 16:57 ` [PATCH V2 2/4] rasdaemon: Add support for the CXL poison events shiju.jose
@ 2023-01-24 16:57 ` shiju.jose
  2023-01-25 16:54   ` Dave Jiang
  2023-01-24 16:57 ` [PATCH V2 4/4] rasdaemon: Add support for the CXL AER correctable errors shiju.jose
  3 siblings, 1 reply; 11+ messages in thread
From: shiju.jose @ 2023-01-24 16:57 UTC (permalink / raw)
  To: linux-edac, linux-cxl, mchehab; +Cc: jonathan.cameron, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add support to log and record the CXL AER uncorrectable errors.

The corresponding Kernel patch here:
https://patchwork.kernel.org/project/cxl/patch/166974413388.1608150.5875712482260436188.stgit@djiang5-desk3.ch.intel.com/

It was found that the header log data to be converted to the
big-endian format to correctly store in the SQLite database likely
because the SQLite database seems uses the big-endian storage.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 ras-cxl-handler.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++
 ras-cxl-handler.h |   5 ++
 ras-events.c      |   9 ++++
 ras-events.h      |   1 +
 ras-record.c      |  65 ++++++++++++++++++++++++
 ras-record.h      |  16 ++++++
 ras-report.c      |  69 +++++++++++++++++++++++++
 ras-report.h      |   2 +
 8 files changed, 292 insertions(+)

diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c
index 0b7cdca..d042dc9 100644
--- a/ras-cxl-handler.c
+++ b/ras-cxl-handler.c
@@ -21,6 +21,7 @@
 #include "ras-record.h"
 #include "ras-logger.h"
 #include "ras-report.h"
+#include <endian.h>
 
 /* Poison List: Payload out flags */
 #define CXL_POISON_FLAG_MORE            BIT(0)
@@ -160,3 +161,127 @@ int ras_cxl_poison_event_handler(struct trace_seq *s,
 
 	return 0;
 }
+
+/* CXL AER Errors */
+
+#define CXL_AER_UE_CACHE_DATA_PARITY	BIT(0)
+#define CXL_AER_UE_CACHE_ADDR_PARITY	BIT(1)
+#define CXL_AER_UE_CACHE_BE_PARITY	BIT(2)
+#define CXL_AER_UE_CACHE_DATA_ECC	BIT(3)
+#define CXL_AER_UE_MEM_DATA_PARITY	BIT(4)
+#define CXL_AER_UE_MEM_ADDR_PARITY	BIT(5)
+#define CXL_AER_UE_MEM_BE_PARITY	BIT(6)
+#define CXL_AER_UE_MEM_DATA_ECC		BIT(7)
+#define CXL_AER_UE_REINIT_THRESH	BIT(8)
+#define CXL_AER_UE_RSVD_ENCODE		BIT(9)
+#define CXL_AER_UE_POISON		BIT(10)
+#define CXL_AER_UE_RECV_OVERFLOW	BIT(11)
+#define CXL_AER_UE_INTERNAL_ERR		BIT(14)
+#define CXL_AER_UE_IDE_TX_ERR		BIT(15)
+#define CXL_AER_UE_IDE_RX_ERR		BIT(16)
+
+struct cxl_error_list {
+	uint32_t bit;
+	const char *error;
+};
+
+static const struct cxl_error_list cxl_aer_ue[] = {
+	{ .bit = CXL_AER_UE_CACHE_DATA_PARITY, .error = "Cache Data Parity Error" },
+	{ .bit = CXL_AER_UE_CACHE_ADDR_PARITY, .error = "Cache Address Parity Error" },
+	{ .bit = CXL_AER_UE_CACHE_BE_PARITY, .error = "Cache Byte Enable Parity Error" },
+	{ .bit = CXL_AER_UE_CACHE_DATA_ECC, .error = "Cache Data ECC Error" },
+	{ .bit = CXL_AER_UE_MEM_DATA_PARITY, .error = "Memory Data Parity Error" },
+	{ .bit = CXL_AER_UE_MEM_ADDR_PARITY, .error = "Memory Address Parity Error" },
+	{ .bit = CXL_AER_UE_MEM_BE_PARITY, .error = "Memory Byte Enable Parity Error" },
+	{ .bit = CXL_AER_UE_MEM_DATA_ECC, .error = "Memory Data ECC Error" },
+	{ .bit = CXL_AER_UE_REINIT_THRESH, .error = "REINIT Threshold Hit" },
+	{ .bit = CXL_AER_UE_RSVD_ENCODE, .error = "Received Unrecognized Encoding" },
+	{ .bit = CXL_AER_UE_POISON, .error = "Received Poison From Peer" },
+	{ .bit = CXL_AER_UE_RECV_OVERFLOW, .error = "Receiver Overflow" },
+	{ .bit = CXL_AER_UE_INTERNAL_ERR, .error = "Component Specific Error" },
+	{ .bit = CXL_AER_UE_IDE_TX_ERR, .error = "IDE Tx Error" },
+	{ .bit = CXL_AER_UE_IDE_RX_ERR, .error = "IDE Rx Error" },
+};
+
+static void decode_cxl_error_status(struct trace_seq *s, uint32_t status,
+				   const struct cxl_error_list *cxl_error_list,
+				   uint8_t num_elems)
+{
+	int i;
+
+	for (i = 0; i < num_elems; i++) {
+		if (status & cxl_error_list[i].bit)
+			trace_seq_printf(s, "\'%s\' ", cxl_error_list[i].error);
+	}
+}
+
+int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context)
+{
+	int len, i;
+	unsigned long long val;
+	time_t now;
+	struct tm *tm;
+	struct ras_events *ras = context;
+	struct ras_cxl_aer_ue_event ev;
+
+	memset(&ev, 0, sizeof(ev));
+	now = record->ts/user_hz + ras->uptime_diff;
+	tm = localtime(&now);
+	if (tm)
+		strftime(ev.timestamp, sizeof(ev.timestamp),
+			 "%Y-%m-%d %H:%M:%S %z", tm);
+	else
+		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000", sizeof(ev.timestamp));
+	trace_seq_printf(s, "%s ", ev.timestamp);
+
+	ev.dev_name = tep_get_field_raw(s, event, "dev_name",
+					record, &len, 1);
+	if (!ev.dev_name)
+		return -1;
+	trace_seq_printf(s, "dev_name:%s ", ev.dev_name);
+
+	if (tep_get_field_val(s, event, "status", record, &val, 1) < 0)
+		return -1;
+	ev.error_status = val;
+
+	trace_seq_printf(s, "error status:");
+	decode_cxl_error_status(s, ev.error_status,
+				cxl_aer_ue, ARRAY_SIZE(cxl_aer_ue));
+
+	if (tep_get_field_val(s,  event, "first_error", record, &val, 1) < 0)
+		return -1;
+	ev.first_error = val;
+
+	trace_seq_printf(s, "first error:");
+	decode_cxl_error_status(s, ev.first_error,
+				cxl_aer_ue, ARRAY_SIZE(cxl_aer_ue));
+
+	ev.header_log = tep_get_field_raw(s, event, "header_log",
+					  record, &len, 1);
+	if (!ev.header_log)
+		return -1;
+	trace_seq_printf(s, "header log:\n");
+	for (i = 0; i < CXL_HEADERLOG_SIZE_U32; i++) {
+		trace_seq_printf(s, "%08x ", ev.header_log[i]);
+		if ((i > 0) && ((i % 20) == 0))
+			trace_seq_printf(s, "\n");
+		/* Convert header log data to the big-endian format because
+		 * the SQLite database seems uses the big-endian storage.
+		 */
+		ev.header_log[i] = htobe32(ev.header_log[i]);
+	}
+
+	/* Insert data into the SGBD */
+#ifdef HAVE_SQLITE3
+	ras_store_cxl_aer_ue_event(ras, &ev);
+#endif
+
+#ifdef HAVE_ABRT_REPORT
+	/* Report event to ABRT */
+	ras_report_cxl_aer_ue_event(ras, &ev);
+#endif
+
+	return 0;
+}
diff --git a/ras-cxl-handler.h b/ras-cxl-handler.h
index 84d5cc6..18b3120 100644
--- a/ras-cxl-handler.h
+++ b/ras-cxl-handler.h
@@ -21,4 +21,9 @@
 int ras_cxl_poison_event_handler(struct trace_seq *s,
 				 struct tep_record *record,
 				 struct tep_event *event, void *context);
+
+int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context);
+
 #endif
diff --git a/ras-events.c b/ras-events.c
index 6555125..ead792b 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -246,6 +246,7 @@ int toggle_ras_mc_event(int enable)
 
 #ifdef HAVE_CXL
 	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_poison", enable);
+	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_aer_uncorrectable_error", enable);
 #endif
 
 free_ras:
@@ -964,6 +965,14 @@ int handle_ras_events(int record_events)
 	else
 		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
 		    "cxl", "cxl_poison");
+
+	rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_aer_uncorrectable_error",
+			       ras_cxl_aer_ue_event_handler, NULL, CXL_AER_UE_EVENT);
+	if (!rc)
+		num_events++;
+	else
+		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+		    "cxl", "cxl_aer_uncorrectable_error");
 #endif
 
 	if (!num_events) {
diff --git a/ras-events.h b/ras-events.h
index fc51070..65f9d9a 100644
--- a/ras-events.h
+++ b/ras-events.h
@@ -40,6 +40,7 @@ enum {
 	DISKERROR_EVENT,
 	MF_EVENT,
 	CXL_POISON_EVENT,
+	CXL_AER_UE_EVENT,
 	NR_EVENTS
 };
 
diff --git a/ras-record.c b/ras-record.c
index f54fb41..4703790 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -618,6 +618,54 @@ int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_eve
 
 	return rc;
 }
+
+/*
+ * Table and functions to handle cxl:cxl_aer_uncorrectable_error
+ */
+static const struct db_fields cxl_aer_ue_event_fields[] = {
+	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
+	{ .name = "timestamp",            .type = "TEXT" },
+	{ .name = "dev_name",             .type = "TEXT" },
+	{ .name = "error_status",         .type = "INTEGER" },
+	{ .name = "first_error",          .type = "INTEGER" },
+	{ .name = "header_log",           .type = "BLOB" },
+};
+
+static const struct db_table_descriptor cxl_aer_ue_event_tab = {
+	.name = "cxl_aer_ue_event",
+	.fields = cxl_aer_ue_event_fields,
+	.num_fields = ARRAY_SIZE(cxl_aer_ue_event_fields),
+};
+
+int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev)
+{
+	int rc;
+	struct sqlite3_priv *priv = ras->db_priv;
+
+	if (!priv || !priv->stmt_cxl_aer_ue_event)
+		return 0;
+	log(TERM, LOG_INFO, "cxl_aer_ue_event store: %p\n", priv->stmt_cxl_aer_ue_event);
+
+	sqlite3_bind_text(priv->stmt_cxl_aer_ue_event, 1, ev->timestamp, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_aer_ue_event, 2, ev->dev_name, -1, NULL);
+	sqlite3_bind_int(priv->stmt_cxl_aer_ue_event, 3, ev->error_status);
+	sqlite3_bind_int(priv->stmt_cxl_aer_ue_event, 4, ev->first_error);
+	sqlite3_bind_blob(priv->stmt_cxl_aer_ue_event, 5, ev->header_log, CXL_HEADERLOG_SIZE, NULL);
+
+	rc = sqlite3_step(priv->stmt_cxl_aer_ue_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed to do cxl_aer_ue_event step on sqlite: error = %d\n", rc);
+	rc = sqlite3_reset(priv->stmt_cxl_aer_ue_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed reset cxl_aer_ue_event on sqlite: error = %d\n",
+		    rc);
+	log(TERM, LOG_INFO, "register inserted at db\n");
+
+	return rc;
+}
+
 #endif
 
 /*
@@ -965,6 +1013,15 @@ int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
 		if (rc != SQLITE_OK)
 			goto error;
 	}
+
+	rc = ras_mc_create_table(priv, &cxl_aer_ue_event_tab);
+	if (rc == SQLITE_OK) {
+		rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_aer_ue_event,
+					 &cxl_aer_ue_event_tab);
+		if (rc != SQLITE_OK)
+			goto error;
+	}
+
 #endif
 
 	ras->db_priv = priv;
@@ -1087,6 +1144,14 @@ int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras)
 			    "cpu %u: Failed to finalize cxl_poison_event sqlite: error = %d\n",
 			    cpu, rc);
 	}
+
+	if (priv->stmt_cxl_aer_ue_event) {
+		rc = sqlite3_finalize(priv->stmt_cxl_aer_ue_event);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize cxl_aer_ue_event sqlite: error = %d\n",
+			    cpu, rc);
+	}
 #endif
 
 	rc = sqlite3_close_v2(db);
diff --git a/ras-record.h b/ras-record.h
index e5bf483..0e2c178 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -128,6 +128,18 @@ struct ras_cxl_poison_event {
 	char overflow_ts[64];
 };
 
+#define SZ_512                          0x200
+#define CXL_HEADERLOG_SIZE              SZ_512
+#define CXL_HEADERLOG_SIZE_U32          (SZ_512 / sizeof(uint32_t))
+
+struct ras_cxl_aer_ue_event {
+	char timestamp[64];
+	const char *dev_name;
+	uint32_t error_status;
+	uint32_t first_error;
+	uint32_t *header_log;
+};
+
 struct ras_mc_event;
 struct ras_aer_event;
 struct ras_extlog_event;
@@ -138,6 +150,7 @@ struct devlink_event;
 struct diskerror_event;
 struct ras_mf_event;
 struct ras_cxl_poison_event;
+struct ras_cxl_aer_ue_event;
 
 #ifdef HAVE_SQLITE3
 
@@ -172,6 +185,7 @@ struct sqlite3_priv {
 #endif
 #ifdef HAVE_CXL
 	sqlite3_stmt	*stmt_cxl_poison_event;
+	sqlite3_stmt	*stmt_cxl_aer_ue_event;
 #endif
 };
 
@@ -201,6 +215,7 @@ int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev);
 int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
 int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
 int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
+int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
 
 #else
 static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; };
@@ -215,6 +230,7 @@ static inline int ras_store_devlink_event(struct ras_events *ras, struct devlink
 static inline int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
 static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
 static inline int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
+static inline int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
 
 #endif
 
diff --git a/ras-report.c b/ras-report.c
index 589e640..4c09061 100644
--- a/ras-report.c
+++ b/ras-report.c
@@ -367,6 +367,28 @@ static int set_cxl_poison_event_backtrace(char *buf, struct ras_cxl_poison_event
 	return 0;
 }
 
+static int set_cxl_aer_ue_event_backtrace(char *buf, struct ras_cxl_aer_ue_event *ev)
+{
+	char bt_buf[MAX_BACKTRACE_SIZE];
+
+	if (!buf || !ev)
+		return -1;
+
+	sprintf(bt_buf, "BACKTRACE="	\
+						"timestamp=%s\n"	\
+						"dev_name=%s\n"		\
+						"error_status=%u\n"	\
+						"first_error=%u\n"	\
+						ev->timestamp,		\
+						ev->dev_name,		\
+						ev->error_status,	\
+						ev->first_error);
+
+	strcat(buf, bt_buf);
+
+	return 0;
+}
+
 static int commit_report_backtrace(int sockfd, int type, void *ev){
 	char buf[MAX_BACKTRACE_SIZE];
 	char *pbuf = buf;
@@ -407,6 +429,9 @@ static int commit_report_backtrace(int sockfd, int type, void *ev){
 	case CXL_POISON_EVENT:
 		rc = set_cxl_poison_event_backtrace(buf, (struct ras_cxl_poison_event *)ev);
 		break;
+	case CXL_AER_UE_EVENT:
+		rc = set_cxl_aer_ue_event_backtrace(buf, (struct ras_cxl_aer_ue_event *)ev);
+		break;
 	default:
 		return -1;
 	}
@@ -859,3 +884,47 @@ cxl_poison_fail:
 	else
 		return -1;
 }
+
+int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev)
+{
+	char buf[MAX_MESSAGE_SIZE];
+	int sockfd = 0;
+	int done = 0;
+	int rc = -1;
+
+	memset(buf, 0, sizeof(buf));
+
+	sockfd = setup_report_socket();
+	if (sockfd < 0)
+		return -1;
+
+	rc = commit_report_basic(sockfd);
+	if (rc < 0)
+		goto cxl_aer_ue_fail;
+
+	rc = commit_report_backtrace(sockfd, CXL_AER_UE_EVENT, ev);
+	if (rc < 0)
+		goto cxl_aer_ue_fail;
+
+	sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-aer-uncorrectable-error");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_aer_ue_fail;
+
+	sprintf(buf, "REASON=%s", "CXL AER uncorrectable error");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_aer_ue_fail;
+
+	done = 1;
+
+cxl_aer_ue_fail:
+
+	if (sockfd >= 0)
+		close(sockfd);
+
+	if (done)
+		return 0;
+	else
+		return -1;
+}
diff --git a/ras-report.h b/ras-report.h
index d1591ce..dfe89d1 100644
--- a/ras-report.h
+++ b/ras-report.h
@@ -40,6 +40,7 @@ int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev);
 int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
 int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
 int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
+int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
 
 #else
 
@@ -52,6 +53,7 @@ static inline int ras_report_devlink_event(struct ras_events *ras, struct devlin
 static inline int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
 static inline int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
 static inline int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
+static inline int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
 
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH V2 4/4] rasdaemon: Add support for the CXL AER correctable errors
  2023-01-24 16:57 [PATCH V2 0/4] rasdaemon: Add support for the CXL error events shiju.jose
                   ` (2 preceding siblings ...)
  2023-01-24 16:57 ` [PATCH V2 3/4] rasdaemon: Add support for the CXL AER uncorrectable errors shiju.jose
@ 2023-01-24 16:57 ` shiju.jose
  2023-01-25 16:56   ` Dave Jiang
  3 siblings, 1 reply; 11+ messages in thread
From: shiju.jose @ 2023-01-24 16:57 UTC (permalink / raw)
  To: linux-edac, linux-cxl, mchehab; +Cc: jonathan.cameron, linuxarm, shiju.jose

From: Shiju Jose <shiju.jose@huawei.com>

Add support to log and record the CXL AER correctable errors.

The corresponding Kernel patch here:
https://patchwork.kernel.org/project/cxl/patch/166974413388.1608150.5875712482260436188.stgit@djiang5-desk3.ch.intel.com/

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 ras-cxl-handler.c | 64 ++++++++++++++++++++++++++++++++++++++++++++
 ras-cxl-handler.h |  3 +++
 ras-events.c      |  9 +++++++
 ras-events.h      |  1 +
 ras-record.c      | 57 ++++++++++++++++++++++++++++++++++++++++
 ras-record.h      | 10 +++++++
 ras-report.c      | 67 +++++++++++++++++++++++++++++++++++++++++++++++
 ras-report.h      |  2 ++
 8 files changed, 213 insertions(+)

diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c
index d042dc9..c2775c8 100644
--- a/ras-cxl-handler.c
+++ b/ras-cxl-handler.c
@@ -180,6 +180,14 @@ int ras_cxl_poison_event_handler(struct trace_seq *s,
 #define CXL_AER_UE_IDE_TX_ERR		BIT(15)
 #define CXL_AER_UE_IDE_RX_ERR		BIT(16)
 
+#define CXL_AER_CE_CACHE_DATA_ECC	BIT(0)
+#define CXL_AER_CE_MEM_DATA_ECC		BIT(1)
+#define CXL_AER_CE_CRC_THRESH		BIT(2)
+#define CXL_AER_CE_RETRY_THRESH		BIT(3)
+#define CXL_AER_CE_CACHE_POISON		BIT(4)
+#define CXL_AER_CE_MEM_POISON		BIT(5)
+#define CXL_AER_CE_PHYS_LAYER_ERR	BIT(6)
+
 struct cxl_error_list {
 	uint32_t bit;
 	const char *error;
@@ -203,6 +211,16 @@ static const struct cxl_error_list cxl_aer_ue[] = {
 	{ .bit = CXL_AER_UE_IDE_RX_ERR, .error = "IDE Rx Error" },
 };
 
+static const struct cxl_error_list cxl_aer_ce[] = {
+	{ .bit = CXL_AER_CE_CACHE_DATA_ECC, .error = "Cache Data ECC Error" },
+	{ .bit = CXL_AER_CE_MEM_DATA_ECC, .error = "Memory Data ECC Error" },
+	{ .bit = CXL_AER_CE_CRC_THRESH, .error = "CRC Threshold Hit" },
+	{ .bit = CXL_AER_CE_RETRY_THRESH, .error = "Retry Threshold" },
+	{ .bit = CXL_AER_CE_CACHE_POISON, .error = "Received Cache Poison From Peer" },
+	{ .bit = CXL_AER_CE_MEM_POISON, .error = "Received Memory Poison From Peer" },
+	{ .bit = CXL_AER_CE_PHYS_LAYER_ERR, .error = "Received Error From Physical Layer" },
+};
+
 static void decode_cxl_error_status(struct trace_seq *s, uint32_t status,
 				   const struct cxl_error_list *cxl_error_list,
 				   uint8_t num_elems)
@@ -285,3 +303,49 @@ int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
 
 	return 0;
 }
+
+int ras_cxl_aer_ce_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context)
+{
+	int len;
+	unsigned long long val;
+	time_t now;
+	struct tm *tm;
+	struct ras_events *ras = context;
+	struct ras_cxl_aer_ce_event ev;
+
+	now = record->ts/user_hz + ras->uptime_diff;
+	tm = localtime(&now);
+	if (tm)
+		strftime(ev.timestamp, sizeof(ev.timestamp),
+			 "%Y-%m-%d %H:%M:%S %z", tm);
+	else
+		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000", sizeof(ev.timestamp));
+	trace_seq_printf(s, "%s ", ev.timestamp);
+
+	ev.dev_name = tep_get_field_raw(s, event, "dev_name",
+					record, &len, 1);
+	if (!ev.dev_name)
+		return -1;
+	trace_seq_printf(s, "dev_name:%s ", ev.dev_name);
+
+	if (tep_get_field_val(s, event, "status", record, &val, 1) < 0)
+		return -1;
+	ev.error_status = val;
+	trace_seq_printf(s, "error status:");
+	decode_cxl_error_status(s, ev.error_status,
+				cxl_aer_ce, ARRAY_SIZE(cxl_aer_ce));
+
+	/* Insert data into the SGBD */
+#ifdef HAVE_SQLITE3
+	ras_store_cxl_aer_ce_event(ras, &ev);
+#endif
+
+#ifdef HAVE_ABRT_REPORT
+	/* Report event to ABRT */
+	ras_report_cxl_aer_ce_event(ras, &ev);
+#endif
+
+	return 0;
+}
diff --git a/ras-cxl-handler.h b/ras-cxl-handler.h
index 18b3120..711daf4 100644
--- a/ras-cxl-handler.h
+++ b/ras-cxl-handler.h
@@ -26,4 +26,7 @@ int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
 				 struct tep_record *record,
 				 struct tep_event *event, void *context);
 
+int ras_cxl_aer_ce_event_handler(struct trace_seq *s,
+				 struct tep_record *record,
+				 struct tep_event *event, void *context);
 #endif
diff --git a/ras-events.c b/ras-events.c
index ead792b..3691311 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -247,6 +247,7 @@ int toggle_ras_mc_event(int enable)
 #ifdef HAVE_CXL
 	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_poison", enable);
 	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_aer_uncorrectable_error", enable);
+	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_aer_correctable_error", enable);
 #endif
 
 free_ras:
@@ -973,6 +974,14 @@ int handle_ras_events(int record_events)
 	else
 		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
 		    "cxl", "cxl_aer_uncorrectable_error");
+
+	rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_aer_correctable_error",
+			       ras_cxl_aer_ce_event_handler, NULL, CXL_AER_CE_EVENT);
+	if (!rc)
+		num_events++;
+	else
+		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+		    "cxl", "cxl_aer_correctable_error");
 #endif
 
 	if (!num_events) {
diff --git a/ras-events.h b/ras-events.h
index 65f9d9a..dc7bdfb 100644
--- a/ras-events.h
+++ b/ras-events.h
@@ -41,6 +41,7 @@ enum {
 	MF_EVENT,
 	CXL_POISON_EVENT,
 	CXL_AER_UE_EVENT,
+	CXL_AER_CE_EVENT,
 	NR_EVENTS
 };
 
diff --git a/ras-record.c b/ras-record.c
index 4703790..c318a18 100644
--- a/ras-record.c
+++ b/ras-record.c
@@ -666,6 +666,48 @@ int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_eve
 	return rc;
 }
 
+/*
+ * Table and functions to handle cxl:cxl_aer_correctable_error
+ */
+static const struct db_fields cxl_aer_ce_event_fields[] = {
+	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
+	{ .name = "timestamp",            .type = "TEXT" },
+	{ .name = "dev_name",             .type = "TEXT" },
+	{ .name = "error_status",         .type = "INTEGER" },
+};
+
+static const struct db_table_descriptor cxl_aer_ce_event_tab = {
+	.name = "cxl_aer_ce_event",
+	.fields = cxl_aer_ce_event_fields,
+	.num_fields = ARRAY_SIZE(cxl_aer_ce_event_fields),
+};
+
+int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev)
+{
+	int rc;
+	struct sqlite3_priv *priv = ras->db_priv;
+
+	if (!priv || !priv->stmt_cxl_aer_ce_event)
+		return 0;
+	log(TERM, LOG_INFO, "cxl_aer_ce_event store: %p\n", priv->stmt_cxl_aer_ce_event);
+
+	sqlite3_bind_text(priv->stmt_cxl_aer_ce_event, 1, ev->timestamp, -1, NULL);
+	sqlite3_bind_text(priv->stmt_cxl_aer_ce_event, 2, ev->dev_name, -1, NULL);
+	sqlite3_bind_int(priv->stmt_cxl_aer_ce_event, 3, ev->error_status);
+
+	rc = sqlite3_step(priv->stmt_cxl_aer_ce_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed to do cxl_aer_ce_event step on sqlite: error = %d\n", rc);
+	rc = sqlite3_reset(priv->stmt_cxl_aer_ce_event);
+	if (rc != SQLITE_OK && rc != SQLITE_DONE)
+		log(TERM, LOG_ERR,
+		    "Failed reset cxl_aer_ce_event on sqlite: error = %d\n",
+		    rc);
+	log(TERM, LOG_INFO, "register inserted at db\n");
+
+	return rc;
+}
 #endif
 
 /*
@@ -1022,6 +1064,13 @@ int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
 			goto error;
 	}
 
+	rc = ras_mc_create_table(priv, &cxl_aer_ce_event_tab);
+	if (rc == SQLITE_OK) {
+		rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_aer_ce_event,
+					 &cxl_aer_ce_event_tab);
+		if (rc != SQLITE_OK)
+			goto error;
+	}
 #endif
 
 	ras->db_priv = priv;
@@ -1152,6 +1201,14 @@ int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras)
 			    "cpu %u: Failed to finalize cxl_aer_ue_event sqlite: error = %d\n",
 			    cpu, rc);
 	}
+
+	if (priv->stmt_cxl_aer_ce_event) {
+		rc = sqlite3_finalize(priv->stmt_cxl_aer_ce_event);
+		if (rc != SQLITE_OK)
+			log(TERM, LOG_ERR,
+			    "cpu %u: Failed to finalize cxl_aer_ce_event sqlite: error = %d\n",
+			    cpu, rc);
+	}
 #endif
 
 	rc = sqlite3_close_v2(db);
diff --git a/ras-record.h b/ras-record.h
index 0e2c178..1f28cc1 100644
--- a/ras-record.h
+++ b/ras-record.h
@@ -140,6 +140,12 @@ struct ras_cxl_aer_ue_event {
 	uint32_t *header_log;
 };
 
+struct ras_cxl_aer_ce_event {
+	char timestamp[64];
+	const char *dev_name;
+	uint32_t error_status;
+};
+
 struct ras_mc_event;
 struct ras_aer_event;
 struct ras_extlog_event;
@@ -151,6 +157,7 @@ struct diskerror_event;
 struct ras_mf_event;
 struct ras_cxl_poison_event;
 struct ras_cxl_aer_ue_event;
+struct ras_cxl_aer_ce_event;
 
 #ifdef HAVE_SQLITE3
 
@@ -186,6 +193,7 @@ struct sqlite3_priv {
 #ifdef HAVE_CXL
 	sqlite3_stmt	*stmt_cxl_poison_event;
 	sqlite3_stmt	*stmt_cxl_aer_ue_event;
+	sqlite3_stmt	*stmt_cxl_aer_ce_event;
 #endif
 };
 
@@ -216,6 +224,7 @@ int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev
 int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
 int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
 int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
+int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev);
 
 #else
 static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; };
@@ -231,6 +240,7 @@ static inline int ras_store_diskerror_event(struct ras_events *ras, struct diske
 static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
 static inline int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
 static inline int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
+static inline int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev) { return 0; };
 
 #endif
 
diff --git a/ras-report.c b/ras-report.c
index 4c09061..796abab 100644
--- a/ras-report.c
+++ b/ras-report.c
@@ -389,6 +389,26 @@ static int set_cxl_aer_ue_event_backtrace(char *buf, struct ras_cxl_aer_ue_event
 	return 0;
 }
 
+static int set_cxl_aer_ce_event_backtrace(char *buf, struct ras_cxl_aer_ce_event *ev)
+{
+	char bt_buf[MAX_BACKTRACE_SIZE];
+
+	if (!buf || !ev)
+		return -1;
+
+	sprintf(bt_buf, "BACKTRACE="	\
+						"timestamp=%s\n"	\
+						"dev_name=%s\n"		\
+						"error_status=%u\n"	\
+						ev->timestamp,		\
+						ev->dev_name,		\
+						ev->error_status);
+
+	strcat(buf, bt_buf);
+
+	return 0;
+}
+
 static int commit_report_backtrace(int sockfd, int type, void *ev){
 	char buf[MAX_BACKTRACE_SIZE];
 	char *pbuf = buf;
@@ -432,6 +452,9 @@ static int commit_report_backtrace(int sockfd, int type, void *ev){
 	case CXL_AER_UE_EVENT:
 		rc = set_cxl_aer_ue_event_backtrace(buf, (struct ras_cxl_aer_ue_event *)ev);
 		break;
+	case CXL_AER_CE_EVENT:
+		rc = set_cxl_aer_ce_event_backtrace(buf, (struct ras_cxl_aer_ce_event *)ev);
+		break;
 	default:
 		return -1;
 	}
@@ -928,3 +951,47 @@ cxl_aer_ue_fail:
 	else
 		return -1;
 }
+
+int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev)
+{
+	char buf[MAX_MESSAGE_SIZE];
+	int sockfd = 0;
+	int done = 0;
+	int rc = -1;
+
+	memset(buf, 0, sizeof(buf));
+
+	sockfd = setup_report_socket();
+	if (sockfd < 0)
+		return -1;
+
+	rc = commit_report_basic(sockfd);
+	if (rc < 0)
+		goto cxl_aer_ce_fail;
+
+	rc = commit_report_backtrace(sockfd, CXL_AER_CE_EVENT, ev);
+	if (rc < 0)
+		goto cxl_aer_ce_fail;
+
+	sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-aer-correctable-error");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_aer_ce_fail;
+
+	sprintf(buf, "REASON=%s", "CXL AER correctable error");
+	rc = write(sockfd, buf, strlen(buf) + 1);
+	if (rc < strlen(buf) + 1)
+		goto cxl_aer_ce_fail;
+
+	done = 1;
+
+cxl_aer_ce_fail:
+
+	if (sockfd >= 0)
+		close(sockfd);
+
+	if (done)
+		return 0;
+	else
+		return -1;
+}
diff --git a/ras-report.h b/ras-report.h
index dfe89d1..46155ee 100644
--- a/ras-report.h
+++ b/ras-report.h
@@ -41,6 +41,7 @@ int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *e
 int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
 int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
 int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
+int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev);
 
 #else
 
@@ -54,6 +55,7 @@ static inline int ras_report_diskerror_event(struct ras_events *ras, struct disk
 static inline int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
 static inline int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
 static inline int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
+static inline int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev) { return 0; };
 
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file
  2023-01-24 16:57 ` [PATCH V2 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file shiju.jose
@ 2023-01-25 16:34   ` Dave Jiang
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Jiang @ 2023-01-25 16:34 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, mchehab; +Cc: jonathan.cameron, linuxarm



On 1/24/23 9:57 AM, shiju.jose@huawei.com wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
> 
> Move definition for BIT() and BIT_ULL() to the
> common file ras-record.h
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>   ras-non-standard-handler.h | 3 ---
>   ras-record.h               | 3 +++
>   2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/ras-non-standard-handler.h b/ras-non-standard-handler.h
> index 4d9f938..c360eaf 100644
> --- a/ras-non-standard-handler.h
> +++ b/ras-non-standard-handler.h
> @@ -17,9 +17,6 @@
>   #include "ras-events.h"
>   #include <traceevent/event-parse.h>
>   
> -#define BIT(nr)                 (1UL << (nr))
> -#define BIT_ULL(nr)             (1ULL << (nr))
> -
>   struct ras_ns_ev_decoder {
>   	struct ras_ns_ev_decoder *next;
>   	const char *sec_type;
> diff --git a/ras-record.h b/ras-record.h
> index d9f7733..219f10b 100644
> --- a/ras-record.h
> +++ b/ras-record.h
> @@ -25,6 +25,9 @@
>   
>   #define ARRAY_SIZE(x) (sizeof(x)/sizeof(*(x)))
>   
> +#define BIT(nr)                 (1UL << (nr))
> +#define BIT_ULL(nr)             (1ULL << (nr))
> +
>   extern long user_hz;
>   
>   struct ras_events;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 3/4] rasdaemon: Add support for the CXL AER uncorrectable errors
  2023-01-24 16:57 ` [PATCH V2 3/4] rasdaemon: Add support for the CXL AER uncorrectable errors shiju.jose
@ 2023-01-25 16:54   ` Dave Jiang
  2023-01-26  9:18     ` Shiju Jose
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Jiang @ 2023-01-25 16:54 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, mchehab; +Cc: jonathan.cameron, linuxarm



On 1/24/23 9:57 AM, shiju.jose@huawei.com wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
> 
> Add support to log and record the CXL AER uncorrectable errors.
> 
> The corresponding Kernel patch here:
> https://patchwork.kernel.org/project/cxl/patch/166974413388.1608150.5875712482260436188.stgit@djiang5-desk3.ch.intel.com/
> 
> It was found that the header log data to be converted to the
> big-endian format to correctly store in the SQLite database likely
> because the SQLite database seems uses the big-endian storage.
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>   ras-cxl-handler.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++
>   ras-cxl-handler.h |   5 ++
>   ras-events.c      |   9 ++++
>   ras-events.h      |   1 +
>   ras-record.c      |  65 ++++++++++++++++++++++++
>   ras-record.h      |  16 ++++++
>   ras-report.c      |  69 +++++++++++++++++++++++++
>   ras-report.h      |   2 +
>   8 files changed, 292 insertions(+)
> 
> diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c
> index 0b7cdca..d042dc9 100644
> --- a/ras-cxl-handler.c
> +++ b/ras-cxl-handler.c
> @@ -21,6 +21,7 @@
>   #include "ras-record.h"
>   #include "ras-logger.h"
>   #include "ras-report.h"
> +#include <endian.h>
>   
>   /* Poison List: Payload out flags */
>   #define CXL_POISON_FLAG_MORE            BIT(0)
> @@ -160,3 +161,127 @@ int ras_cxl_poison_event_handler(struct trace_seq *s,
>   
>   	return 0;
>   }
> +
> +/* CXL AER Errors */
> +
> +#define CXL_AER_UE_CACHE_DATA_PARITY	BIT(0)
> +#define CXL_AER_UE_CACHE_ADDR_PARITY	BIT(1)
> +#define CXL_AER_UE_CACHE_BE_PARITY	BIT(2)
> +#define CXL_AER_UE_CACHE_DATA_ECC	BIT(3)
> +#define CXL_AER_UE_MEM_DATA_PARITY	BIT(4)
> +#define CXL_AER_UE_MEM_ADDR_PARITY	BIT(5)
> +#define CXL_AER_UE_MEM_BE_PARITY	BIT(6)
> +#define CXL_AER_UE_MEM_DATA_ECC		BIT(7)
> +#define CXL_AER_UE_REINIT_THRESH	BIT(8)
> +#define CXL_AER_UE_RSVD_ENCODE		BIT(9)
> +#define CXL_AER_UE_POISON		BIT(10)
> +#define CXL_AER_UE_RECV_OVERFLOW	BIT(11)
> +#define CXL_AER_UE_INTERNAL_ERR		BIT(14)
> +#define CXL_AER_UE_IDE_TX_ERR		BIT(15)
> +#define CXL_AER_UE_IDE_RX_ERR		BIT(16)
> +
> +struct cxl_error_list {
> +	uint32_t bit;
> +	const char *error;
> +};
> +
> +static const struct cxl_error_list cxl_aer_ue[] = {
> +	{ .bit = CXL_AER_UE_CACHE_DATA_PARITY, .error = "Cache Data Parity Error" },
> +	{ .bit = CXL_AER_UE_CACHE_ADDR_PARITY, .error = "Cache Address Parity Error" },
> +	{ .bit = CXL_AER_UE_CACHE_BE_PARITY, .error = "Cache Byte Enable Parity Error" },
> +	{ .bit = CXL_AER_UE_CACHE_DATA_ECC, .error = "Cache Data ECC Error" },
> +	{ .bit = CXL_AER_UE_MEM_DATA_PARITY, .error = "Memory Data Parity Error" },
> +	{ .bit = CXL_AER_UE_MEM_ADDR_PARITY, .error = "Memory Address Parity Error" },
> +	{ .bit = CXL_AER_UE_MEM_BE_PARITY, .error = "Memory Byte Enable Parity Error" },
> +	{ .bit = CXL_AER_UE_MEM_DATA_ECC, .error = "Memory Data ECC Error" },
> +	{ .bit = CXL_AER_UE_REINIT_THRESH, .error = "REINIT Threshold Hit" },
> +	{ .bit = CXL_AER_UE_RSVD_ENCODE, .error = "Received Unrecognized Encoding" },
> +	{ .bit = CXL_AER_UE_POISON, .error = "Received Poison From Peer" },
> +	{ .bit = CXL_AER_UE_RECV_OVERFLOW, .error = "Receiver Overflow" },
> +	{ .bit = CXL_AER_UE_INTERNAL_ERR, .error = "Component Specific Error" },
> +	{ .bit = CXL_AER_UE_IDE_TX_ERR, .error = "IDE Tx Error" },
> +	{ .bit = CXL_AER_UE_IDE_RX_ERR, .error = "IDE Rx Error" },
> +};
> +
> +static void decode_cxl_error_status(struct trace_seq *s, uint32_t status,
> +				   const struct cxl_error_list *cxl_error_list,
> +				   uint8_t num_elems)
> +{
> +	int i;
> +
> +	for (i = 0; i < num_elems; i++) {
> +		if (status & cxl_error_list[i].bit)
> +			trace_seq_printf(s, "\'%s\' ", cxl_error_list[i].error);

A comment for all instances of trace_seq_printf() in this patch. I think 
it may return an error. Check return value <= 0?

> +	}
> +}
> +
> +int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
> +				 struct tep_record *record,
> +				 struct tep_event *event, void *context)
> +{
> +	int len, i;
> +	unsigned long long val;
> +	time_t now;
> +	struct tm *tm;
> +	struct ras_events *ras = context;
> +	struct ras_cxl_aer_ue_event ev;
> +
> +	memset(&ev, 0, sizeof(ev));
> +	now = record->ts/user_hz + ras->uptime_diff;

Adding a space around '/' makes it easier to read:
now = record->ts / user_hz + ras->uptime_diff;

DJ

> +	tm = localtime(&now);
> +	if (tm)
> +		strftime(ev.timestamp, sizeof(ev.timestamp),
> +			 "%Y-%m-%d %H:%M:%S %z", tm);
> +	else
> +		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000", sizeof(ev.timestamp));
> +	trace_seq_printf(s, "%s ", ev.timestamp);
> +
> +	ev.dev_name = tep_get_field_raw(s, event, "dev_name",
> +					record, &len, 1);
> +	if (!ev.dev_name)
> +		return -1;
> +	trace_seq_printf(s, "dev_name:%s ", ev.dev_name);
> +
> +	if (tep_get_field_val(s, event, "status", record, &val, 1) < 0)
> +		return -1;
> +	ev.error_status = val;
> +
> +	trace_seq_printf(s, "error status:");
> +	decode_cxl_error_status(s, ev.error_status,
> +				cxl_aer_ue, ARRAY_SIZE(cxl_aer_ue));
> +
> +	if (tep_get_field_val(s,  event, "first_error", record, &val, 1) < 0)
> +		return -1;
> +	ev.first_error = val;
> +
> +	trace_seq_printf(s, "first error:");
> +	decode_cxl_error_status(s, ev.first_error,
> +				cxl_aer_ue, ARRAY_SIZE(cxl_aer_ue));
> +
> +	ev.header_log = tep_get_field_raw(s, event, "header_log",
> +					  record, &len, 1);
> +	if (!ev.header_log)
> +		return -1;
> +	trace_seq_printf(s, "header log:\n");
> +	for (i = 0; i < CXL_HEADERLOG_SIZE_U32; i++) {
> +		trace_seq_printf(s, "%08x ", ev.header_log[i]);
> +		if ((i > 0) && ((i % 20) == 0))
> +			trace_seq_printf(s, "\n");
> +		/* Convert header log data to the big-endian format because
> +		 * the SQLite database seems uses the big-endian storage.
> +		 */
> +		ev.header_log[i] = htobe32(ev.header_log[i]);
> +	}
> +
> +	/* Insert data into the SGBD */
> +#ifdef HAVE_SQLITE3
> +	ras_store_cxl_aer_ue_event(ras, &ev);
> +#endif
> +
> +#ifdef HAVE_ABRT_REPORT
> +	/* Report event to ABRT */
> +	ras_report_cxl_aer_ue_event(ras, &ev);
> +#endif
> +
> +	return 0;
> +}
> diff --git a/ras-cxl-handler.h b/ras-cxl-handler.h
> index 84d5cc6..18b3120 100644
> --- a/ras-cxl-handler.h
> +++ b/ras-cxl-handler.h
> @@ -21,4 +21,9 @@
>   int ras_cxl_poison_event_handler(struct trace_seq *s,
>   				 struct tep_record *record,
>   				 struct tep_event *event, void *context);
> +
> +int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
> +				 struct tep_record *record,
> +				 struct tep_event *event, void *context);
> +
>   #endif
> diff --git a/ras-events.c b/ras-events.c
> index 6555125..ead792b 100644
> --- a/ras-events.c
> +++ b/ras-events.c
> @@ -246,6 +246,7 @@ int toggle_ras_mc_event(int enable)
>   
>   #ifdef HAVE_CXL
>   	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_poison", enable);
> +	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_aer_uncorrectable_error", enable);
>   #endif
>   
>   free_ras:
> @@ -964,6 +965,14 @@ int handle_ras_events(int record_events)
>   	else
>   		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
>   		    "cxl", "cxl_poison");
> +
> +	rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_aer_uncorrectable_error",
> +			       ras_cxl_aer_ue_event_handler, NULL, CXL_AER_UE_EVENT);
> +	if (!rc)
> +		num_events++;
> +	else
> +		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
> +		    "cxl", "cxl_aer_uncorrectable_error");
>   #endif
>   
>   	if (!num_events) {
> diff --git a/ras-events.h b/ras-events.h
> index fc51070..65f9d9a 100644
> --- a/ras-events.h
> +++ b/ras-events.h
> @@ -40,6 +40,7 @@ enum {
>   	DISKERROR_EVENT,
>   	MF_EVENT,
>   	CXL_POISON_EVENT,
> +	CXL_AER_UE_EVENT,
>   	NR_EVENTS
>   };
>   
> diff --git a/ras-record.c b/ras-record.c
> index f54fb41..4703790 100644
> --- a/ras-record.c
> +++ b/ras-record.c
> @@ -618,6 +618,54 @@ int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_eve
>   
>   	return rc;
>   }
> +
> +/*
> + * Table and functions to handle cxl:cxl_aer_uncorrectable_error
> + */
> +static const struct db_fields cxl_aer_ue_event_fields[] = {
> +	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
> +	{ .name = "timestamp",            .type = "TEXT" },
> +	{ .name = "dev_name",             .type = "TEXT" },
> +	{ .name = "error_status",         .type = "INTEGER" },
> +	{ .name = "first_error",          .type = "INTEGER" },
> +	{ .name = "header_log",           .type = "BLOB" },
> +};
> +
> +static const struct db_table_descriptor cxl_aer_ue_event_tab = {
> +	.name = "cxl_aer_ue_event",
> +	.fields = cxl_aer_ue_event_fields,
> +	.num_fields = ARRAY_SIZE(cxl_aer_ue_event_fields),
> +};
> +
> +int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev)
> +{
> +	int rc;
> +	struct sqlite3_priv *priv = ras->db_priv;
> +
> +	if (!priv || !priv->stmt_cxl_aer_ue_event)
> +		return 0;
> +	log(TERM, LOG_INFO, "cxl_aer_ue_event store: %p\n", priv->stmt_cxl_aer_ue_event);
> +
> +	sqlite3_bind_text(priv->stmt_cxl_aer_ue_event, 1, ev->timestamp, -1, NULL);
> +	sqlite3_bind_text(priv->stmt_cxl_aer_ue_event, 2, ev->dev_name, -1, NULL);
> +	sqlite3_bind_int(priv->stmt_cxl_aer_ue_event, 3, ev->error_status);
> +	sqlite3_bind_int(priv->stmt_cxl_aer_ue_event, 4, ev->first_error);
> +	sqlite3_bind_blob(priv->stmt_cxl_aer_ue_event, 5, ev->header_log, CXL_HEADERLOG_SIZE, NULL);
> +
> +	rc = sqlite3_step(priv->stmt_cxl_aer_ue_event);
> +	if (rc != SQLITE_OK && rc != SQLITE_DONE)
> +		log(TERM, LOG_ERR,
> +		    "Failed to do cxl_aer_ue_event step on sqlite: error = %d\n", rc);
> +	rc = sqlite3_reset(priv->stmt_cxl_aer_ue_event);
> +	if (rc != SQLITE_OK && rc != SQLITE_DONE)
> +		log(TERM, LOG_ERR,
> +		    "Failed reset cxl_aer_ue_event on sqlite: error = %d\n",
> +		    rc);
> +	log(TERM, LOG_INFO, "register inserted at db\n");
> +
> +	return rc;
> +}
> +
>   #endif
>   
>   /*
> @@ -965,6 +1013,15 @@ int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
>   		if (rc != SQLITE_OK)
>   			goto error;
>   	}
> +
> +	rc = ras_mc_create_table(priv, &cxl_aer_ue_event_tab);
> +	if (rc == SQLITE_OK) {
> +		rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_aer_ue_event,
> +					 &cxl_aer_ue_event_tab);
> +		if (rc != SQLITE_OK)
> +			goto error;
> +	}
> +
>   #endif
>   
>   	ras->db_priv = priv;
> @@ -1087,6 +1144,14 @@ int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras)
>   			    "cpu %u: Failed to finalize cxl_poison_event sqlite: error = %d\n",
>   			    cpu, rc);
>   	}
> +
> +	if (priv->stmt_cxl_aer_ue_event) {
> +		rc = sqlite3_finalize(priv->stmt_cxl_aer_ue_event);
> +		if (rc != SQLITE_OK)
> +			log(TERM, LOG_ERR,
> +			    "cpu %u: Failed to finalize cxl_aer_ue_event sqlite: error = %d\n",
> +			    cpu, rc);
> +	}
>   #endif
>   
>   	rc = sqlite3_close_v2(db);
> diff --git a/ras-record.h b/ras-record.h
> index e5bf483..0e2c178 100644
> --- a/ras-record.h
> +++ b/ras-record.h
> @@ -128,6 +128,18 @@ struct ras_cxl_poison_event {
>   	char overflow_ts[64];
>   };
>   
> +#define SZ_512                          0x200
> +#define CXL_HEADERLOG_SIZE              SZ_512
> +#define CXL_HEADERLOG_SIZE_U32          (SZ_512 / sizeof(uint32_t))
> +
> +struct ras_cxl_aer_ue_event {
> +	char timestamp[64];
> +	const char *dev_name;
> +	uint32_t error_status;
> +	uint32_t first_error;
> +	uint32_t *header_log;
> +};
> +
>   struct ras_mc_event;
>   struct ras_aer_event;
>   struct ras_extlog_event;
> @@ -138,6 +150,7 @@ struct devlink_event;
>   struct diskerror_event;
>   struct ras_mf_event;
>   struct ras_cxl_poison_event;
> +struct ras_cxl_aer_ue_event;
>   
>   #ifdef HAVE_SQLITE3
>   
> @@ -172,6 +185,7 @@ struct sqlite3_priv {
>   #endif
>   #ifdef HAVE_CXL
>   	sqlite3_stmt	*stmt_cxl_poison_event;
> +	sqlite3_stmt	*stmt_cxl_aer_ue_event;
>   #endif
>   };
>   
> @@ -201,6 +215,7 @@ int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev);
>   int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
>   int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
>   int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
> +int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
>   
>   #else
>   static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; };
> @@ -215,6 +230,7 @@ static inline int ras_store_devlink_event(struct ras_events *ras, struct devlink
>   static inline int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
>   static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
>   static inline int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
> +static inline int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
>   
>   #endif
>   
> diff --git a/ras-report.c b/ras-report.c
> index 589e640..4c09061 100644
> --- a/ras-report.c
> +++ b/ras-report.c
> @@ -367,6 +367,28 @@ static int set_cxl_poison_event_backtrace(char *buf, struct ras_cxl_poison_event
>   	return 0;
>   }
>   
> +static int set_cxl_aer_ue_event_backtrace(char *buf, struct ras_cxl_aer_ue_event *ev)
> +{
> +	char bt_buf[MAX_BACKTRACE_SIZE];
> +
> +	if (!buf || !ev)
> +		return -1;
> +
> +	sprintf(bt_buf, "BACKTRACE="	\
> +						"timestamp=%s\n"	\
> +						"dev_name=%s\n"		\
> +						"error_status=%u\n"	\
> +						"first_error=%u\n"	\
> +						ev->timestamp,		\
> +						ev->dev_name,		\
> +						ev->error_status,	\
> +						ev->first_error);
> +
> +	strcat(buf, bt_buf);
> +
> +	return 0;
> +}
> +
>   static int commit_report_backtrace(int sockfd, int type, void *ev){
>   	char buf[MAX_BACKTRACE_SIZE];
>   	char *pbuf = buf;
> @@ -407,6 +429,9 @@ static int commit_report_backtrace(int sockfd, int type, void *ev){
>   	case CXL_POISON_EVENT:
>   		rc = set_cxl_poison_event_backtrace(buf, (struct ras_cxl_poison_event *)ev);
>   		break;
> +	case CXL_AER_UE_EVENT:
> +		rc = set_cxl_aer_ue_event_backtrace(buf, (struct ras_cxl_aer_ue_event *)ev);
> +		break;
>   	default:
>   		return -1;
>   	}
> @@ -859,3 +884,47 @@ cxl_poison_fail:
>   	else
>   		return -1;
>   }
> +
> +int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev)
> +{
> +	char buf[MAX_MESSAGE_SIZE];
> +	int sockfd = 0;
> +	int done = 0;
> +	int rc = -1;
> +
> +	memset(buf, 0, sizeof(buf));
> +
> +	sockfd = setup_report_socket();
> +	if (sockfd < 0)
> +		return -1;
> +
> +	rc = commit_report_basic(sockfd);
> +	if (rc < 0)
> +		goto cxl_aer_ue_fail;
> +
> +	rc = commit_report_backtrace(sockfd, CXL_AER_UE_EVENT, ev);
> +	if (rc < 0)
> +		goto cxl_aer_ue_fail;
> +
> +	sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-aer-uncorrectable-error");
> +	rc = write(sockfd, buf, strlen(buf) + 1);
> +	if (rc < strlen(buf) + 1)
> +		goto cxl_aer_ue_fail;
> +
> +	sprintf(buf, "REASON=%s", "CXL AER uncorrectable error");
> +	rc = write(sockfd, buf, strlen(buf) + 1);
> +	if (rc < strlen(buf) + 1)
> +		goto cxl_aer_ue_fail;
> +
> +	done = 1;
> +
> +cxl_aer_ue_fail:
> +
> +	if (sockfd >= 0)
> +		close(sockfd);
> +
> +	if (done)
> +		return 0;
> +	else
> +		return -1;
> +}
> diff --git a/ras-report.h b/ras-report.h
> index d1591ce..dfe89d1 100644
> --- a/ras-report.h
> +++ b/ras-report.h
> @@ -40,6 +40,7 @@ int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev);
>   int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
>   int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
>   int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
> +int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
>   
>   #else
>   
> @@ -52,6 +53,7 @@ static inline int ras_report_devlink_event(struct ras_events *ras, struct devlin
>   static inline int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
>   static inline int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
>   static inline int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
> +static inline int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
>   
>   #endif
>   

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 4/4] rasdaemon: Add support for the CXL AER correctable errors
  2023-01-24 16:57 ` [PATCH V2 4/4] rasdaemon: Add support for the CXL AER correctable errors shiju.jose
@ 2023-01-25 16:56   ` Dave Jiang
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Jiang @ 2023-01-25 16:56 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, mchehab; +Cc: jonathan.cameron, linuxarm



On 1/24/23 9:57 AM, shiju.jose@huawei.com wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
> 
> Add support to log and record the CXL AER correctable errors.
> 
> The corresponding Kernel patch here:
> https://patchwork.kernel.org/project/cxl/patch/166974413388.1608150.5875712482260436188.stgit@djiang5-desk3.ch.intel.com/
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Same comments as previous patch.

> ---
>   ras-cxl-handler.c | 64 ++++++++++++++++++++++++++++++++++++++++++++
>   ras-cxl-handler.h |  3 +++
>   ras-events.c      |  9 +++++++
>   ras-events.h      |  1 +
>   ras-record.c      | 57 ++++++++++++++++++++++++++++++++++++++++
>   ras-record.h      | 10 +++++++
>   ras-report.c      | 67 +++++++++++++++++++++++++++++++++++++++++++++++
>   ras-report.h      |  2 ++
>   8 files changed, 213 insertions(+)
> 
> diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c
> index d042dc9..c2775c8 100644
> --- a/ras-cxl-handler.c
> +++ b/ras-cxl-handler.c
> @@ -180,6 +180,14 @@ int ras_cxl_poison_event_handler(struct trace_seq *s,
>   #define CXL_AER_UE_IDE_TX_ERR		BIT(15)
>   #define CXL_AER_UE_IDE_RX_ERR		BIT(16)
>   
> +#define CXL_AER_CE_CACHE_DATA_ECC	BIT(0)
> +#define CXL_AER_CE_MEM_DATA_ECC		BIT(1)
> +#define CXL_AER_CE_CRC_THRESH		BIT(2)
> +#define CXL_AER_CE_RETRY_THRESH		BIT(3)
> +#define CXL_AER_CE_CACHE_POISON		BIT(4)
> +#define CXL_AER_CE_MEM_POISON		BIT(5)
> +#define CXL_AER_CE_PHYS_LAYER_ERR	BIT(6)
> +
>   struct cxl_error_list {
>   	uint32_t bit;
>   	const char *error;
> @@ -203,6 +211,16 @@ static const struct cxl_error_list cxl_aer_ue[] = {
>   	{ .bit = CXL_AER_UE_IDE_RX_ERR, .error = "IDE Rx Error" },
>   };
>   
> +static const struct cxl_error_list cxl_aer_ce[] = {
> +	{ .bit = CXL_AER_CE_CACHE_DATA_ECC, .error = "Cache Data ECC Error" },
> +	{ .bit = CXL_AER_CE_MEM_DATA_ECC, .error = "Memory Data ECC Error" },
> +	{ .bit = CXL_AER_CE_CRC_THRESH, .error = "CRC Threshold Hit" },
> +	{ .bit = CXL_AER_CE_RETRY_THRESH, .error = "Retry Threshold" },
> +	{ .bit = CXL_AER_CE_CACHE_POISON, .error = "Received Cache Poison From Peer" },
> +	{ .bit = CXL_AER_CE_MEM_POISON, .error = "Received Memory Poison From Peer" },
> +	{ .bit = CXL_AER_CE_PHYS_LAYER_ERR, .error = "Received Error From Physical Layer" },
> +};
> +
>   static void decode_cxl_error_status(struct trace_seq *s, uint32_t status,
>   				   const struct cxl_error_list *cxl_error_list,
>   				   uint8_t num_elems)
> @@ -285,3 +303,49 @@ int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
>   
>   	return 0;
>   }
> +
> +int ras_cxl_aer_ce_event_handler(struct trace_seq *s,
> +				 struct tep_record *record,
> +				 struct tep_event *event, void *context)
> +{
> +	int len;
> +	unsigned long long val;
> +	time_t now;
> +	struct tm *tm;
> +	struct ras_events *ras = context;
> +	struct ras_cxl_aer_ce_event ev;
> +
> +	now = record->ts/user_hz + ras->uptime_diff;
> +	tm = localtime(&now);
> +	if (tm)
> +		strftime(ev.timestamp, sizeof(ev.timestamp),
> +			 "%Y-%m-%d %H:%M:%S %z", tm);
> +	else
> +		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000", sizeof(ev.timestamp));
> +	trace_seq_printf(s, "%s ", ev.timestamp);
> +
> +	ev.dev_name = tep_get_field_raw(s, event, "dev_name",
> +					record, &len, 1);
> +	if (!ev.dev_name)
> +		return -1;
> +	trace_seq_printf(s, "dev_name:%s ", ev.dev_name);
> +
> +	if (tep_get_field_val(s, event, "status", record, &val, 1) < 0)
> +		return -1;
> +	ev.error_status = val;
> +	trace_seq_printf(s, "error status:");
> +	decode_cxl_error_status(s, ev.error_status,
> +				cxl_aer_ce, ARRAY_SIZE(cxl_aer_ce));
> +
> +	/* Insert data into the SGBD */
> +#ifdef HAVE_SQLITE3
> +	ras_store_cxl_aer_ce_event(ras, &ev);
> +#endif
> +
> +#ifdef HAVE_ABRT_REPORT
> +	/* Report event to ABRT */
> +	ras_report_cxl_aer_ce_event(ras, &ev);
> +#endif
> +
> +	return 0;
> +}
> diff --git a/ras-cxl-handler.h b/ras-cxl-handler.h
> index 18b3120..711daf4 100644
> --- a/ras-cxl-handler.h
> +++ b/ras-cxl-handler.h
> @@ -26,4 +26,7 @@ int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
>   				 struct tep_record *record,
>   				 struct tep_event *event, void *context);
>   
> +int ras_cxl_aer_ce_event_handler(struct trace_seq *s,
> +				 struct tep_record *record,
> +				 struct tep_event *event, void *context);
>   #endif
> diff --git a/ras-events.c b/ras-events.c
> index ead792b..3691311 100644
> --- a/ras-events.c
> +++ b/ras-events.c
> @@ -247,6 +247,7 @@ int toggle_ras_mc_event(int enable)
>   #ifdef HAVE_CXL
>   	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_poison", enable);
>   	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_aer_uncorrectable_error", enable);
> +	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_aer_correctable_error", enable);
>   #endif
>   
>   free_ras:
> @@ -973,6 +974,14 @@ int handle_ras_events(int record_events)
>   	else
>   		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
>   		    "cxl", "cxl_aer_uncorrectable_error");
> +
> +	rc = add_event_handler(ras, pevent, page_size, "cxl", "cxl_aer_correctable_error",
> +			       ras_cxl_aer_ce_event_handler, NULL, CXL_AER_CE_EVENT);
> +	if (!rc)
> +		num_events++;
> +	else
> +		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
> +		    "cxl", "cxl_aer_correctable_error");
>   #endif
>   
>   	if (!num_events) {
> diff --git a/ras-events.h b/ras-events.h
> index 65f9d9a..dc7bdfb 100644
> --- a/ras-events.h
> +++ b/ras-events.h
> @@ -41,6 +41,7 @@ enum {
>   	MF_EVENT,
>   	CXL_POISON_EVENT,
>   	CXL_AER_UE_EVENT,
> +	CXL_AER_CE_EVENT,
>   	NR_EVENTS
>   };
>   
> diff --git a/ras-record.c b/ras-record.c
> index 4703790..c318a18 100644
> --- a/ras-record.c
> +++ b/ras-record.c
> @@ -666,6 +666,48 @@ int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_eve
>   	return rc;
>   }
>   
> +/*
> + * Table and functions to handle cxl:cxl_aer_correctable_error
> + */
> +static const struct db_fields cxl_aer_ce_event_fields[] = {
> +	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
> +	{ .name = "timestamp",            .type = "TEXT" },
> +	{ .name = "dev_name",             .type = "TEXT" },
> +	{ .name = "error_status",         .type = "INTEGER" },
> +};
> +
> +static const struct db_table_descriptor cxl_aer_ce_event_tab = {
> +	.name = "cxl_aer_ce_event",
> +	.fields = cxl_aer_ce_event_fields,
> +	.num_fields = ARRAY_SIZE(cxl_aer_ce_event_fields),
> +};
> +
> +int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev)
> +{
> +	int rc;
> +	struct sqlite3_priv *priv = ras->db_priv;
> +
> +	if (!priv || !priv->stmt_cxl_aer_ce_event)
> +		return 0;
> +	log(TERM, LOG_INFO, "cxl_aer_ce_event store: %p\n", priv->stmt_cxl_aer_ce_event);
> +
> +	sqlite3_bind_text(priv->stmt_cxl_aer_ce_event, 1, ev->timestamp, -1, NULL);
> +	sqlite3_bind_text(priv->stmt_cxl_aer_ce_event, 2, ev->dev_name, -1, NULL);
> +	sqlite3_bind_int(priv->stmt_cxl_aer_ce_event, 3, ev->error_status);
> +
> +	rc = sqlite3_step(priv->stmt_cxl_aer_ce_event);
> +	if (rc != SQLITE_OK && rc != SQLITE_DONE)
> +		log(TERM, LOG_ERR,
> +		    "Failed to do cxl_aer_ce_event step on sqlite: error = %d\n", rc);
> +	rc = sqlite3_reset(priv->stmt_cxl_aer_ce_event);
> +	if (rc != SQLITE_OK && rc != SQLITE_DONE)
> +		log(TERM, LOG_ERR,
> +		    "Failed reset cxl_aer_ce_event on sqlite: error = %d\n",
> +		    rc);
> +	log(TERM, LOG_INFO, "register inserted at db\n");
> +
> +	return rc;
> +}
>   #endif
>   
>   /*
> @@ -1022,6 +1064,13 @@ int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras)
>   			goto error;
>   	}
>   
> +	rc = ras_mc_create_table(priv, &cxl_aer_ce_event_tab);
> +	if (rc == SQLITE_OK) {
> +		rc = ras_mc_prepare_stmt(priv, &priv->stmt_cxl_aer_ce_event,
> +					 &cxl_aer_ce_event_tab);
> +		if (rc != SQLITE_OK)
> +			goto error;
> +	}
>   #endif
>   
>   	ras->db_priv = priv;
> @@ -1152,6 +1201,14 @@ int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras)
>   			    "cpu %u: Failed to finalize cxl_aer_ue_event sqlite: error = %d\n",
>   			    cpu, rc);
>   	}
> +
> +	if (priv->stmt_cxl_aer_ce_event) {
> +		rc = sqlite3_finalize(priv->stmt_cxl_aer_ce_event);
> +		if (rc != SQLITE_OK)
> +			log(TERM, LOG_ERR,
> +			    "cpu %u: Failed to finalize cxl_aer_ce_event sqlite: error = %d\n",
> +			    cpu, rc);
> +	}
>   #endif
>   
>   	rc = sqlite3_close_v2(db);
> diff --git a/ras-record.h b/ras-record.h
> index 0e2c178..1f28cc1 100644
> --- a/ras-record.h
> +++ b/ras-record.h
> @@ -140,6 +140,12 @@ struct ras_cxl_aer_ue_event {
>   	uint32_t *header_log;
>   };
>   
> +struct ras_cxl_aer_ce_event {
> +	char timestamp[64];
> +	const char *dev_name;
> +	uint32_t error_status;
> +};
> +
>   struct ras_mc_event;
>   struct ras_aer_event;
>   struct ras_extlog_event;
> @@ -151,6 +157,7 @@ struct diskerror_event;
>   struct ras_mf_event;
>   struct ras_cxl_poison_event;
>   struct ras_cxl_aer_ue_event;
> +struct ras_cxl_aer_ce_event;
>   
>   #ifdef HAVE_SQLITE3
>   
> @@ -186,6 +193,7 @@ struct sqlite3_priv {
>   #ifdef HAVE_CXL
>   	sqlite3_stmt	*stmt_cxl_poison_event;
>   	sqlite3_stmt	*stmt_cxl_aer_ue_event;
> +	sqlite3_stmt	*stmt_cxl_aer_ce_event;
>   #endif
>   };
>   
> @@ -216,6 +224,7 @@ int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev
>   int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
>   int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
>   int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
> +int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev);
>   
>   #else
>   static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; };
> @@ -231,6 +240,7 @@ static inline int ras_store_diskerror_event(struct ras_events *ras, struct diske
>   static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
>   static inline int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
>   static inline int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
> +static inline int ras_store_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev) { return 0; };
>   
>   #endif
>   
> diff --git a/ras-report.c b/ras-report.c
> index 4c09061..796abab 100644
> --- a/ras-report.c
> +++ b/ras-report.c
> @@ -389,6 +389,26 @@ static int set_cxl_aer_ue_event_backtrace(char *buf, struct ras_cxl_aer_ue_event
>   	return 0;
>   }
>   
> +static int set_cxl_aer_ce_event_backtrace(char *buf, struct ras_cxl_aer_ce_event *ev)
> +{
> +	char bt_buf[MAX_BACKTRACE_SIZE];
> +
> +	if (!buf || !ev)
> +		return -1;
> +
> +	sprintf(bt_buf, "BACKTRACE="	\
> +						"timestamp=%s\n"	\
> +						"dev_name=%s\n"		\
> +						"error_status=%u\n"	\
> +						ev->timestamp,		\
> +						ev->dev_name,		\
> +						ev->error_status);
> +
> +	strcat(buf, bt_buf);
> +
> +	return 0;
> +}
> +
>   static int commit_report_backtrace(int sockfd, int type, void *ev){
>   	char buf[MAX_BACKTRACE_SIZE];
>   	char *pbuf = buf;
> @@ -432,6 +452,9 @@ static int commit_report_backtrace(int sockfd, int type, void *ev){
>   	case CXL_AER_UE_EVENT:
>   		rc = set_cxl_aer_ue_event_backtrace(buf, (struct ras_cxl_aer_ue_event *)ev);
>   		break;
> +	case CXL_AER_CE_EVENT:
> +		rc = set_cxl_aer_ce_event_backtrace(buf, (struct ras_cxl_aer_ce_event *)ev);
> +		break;
>   	default:
>   		return -1;
>   	}
> @@ -928,3 +951,47 @@ cxl_aer_ue_fail:
>   	else
>   		return -1;
>   }
> +
> +int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev)
> +{
> +	char buf[MAX_MESSAGE_SIZE];
> +	int sockfd = 0;
> +	int done = 0;
> +	int rc = -1;
> +
> +	memset(buf, 0, sizeof(buf));
> +
> +	sockfd = setup_report_socket();
> +	if (sockfd < 0)
> +		return -1;
> +
> +	rc = commit_report_basic(sockfd);
> +	if (rc < 0)
> +		goto cxl_aer_ce_fail;
> +
> +	rc = commit_report_backtrace(sockfd, CXL_AER_CE_EVENT, ev);
> +	if (rc < 0)
> +		goto cxl_aer_ce_fail;
> +
> +	sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-aer-correctable-error");
> +	rc = write(sockfd, buf, strlen(buf) + 1);
> +	if (rc < strlen(buf) + 1)
> +		goto cxl_aer_ce_fail;
> +
> +	sprintf(buf, "REASON=%s", "CXL AER correctable error");
> +	rc = write(sockfd, buf, strlen(buf) + 1);
> +	if (rc < strlen(buf) + 1)
> +		goto cxl_aer_ce_fail;
> +
> +	done = 1;
> +
> +cxl_aer_ce_fail:
> +
> +	if (sockfd >= 0)
> +		close(sockfd);
> +
> +	if (done)
> +		return 0;
> +	else
> +		return -1;
> +}
> diff --git a/ras-report.h b/ras-report.h
> index dfe89d1..46155ee 100644
> --- a/ras-report.h
> +++ b/ras-report.h
> @@ -41,6 +41,7 @@ int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *e
>   int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
>   int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
>   int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev);
> +int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev);
>   
>   #else
>   
> @@ -54,6 +55,7 @@ static inline int ras_report_diskerror_event(struct ras_events *ras, struct disk
>   static inline int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
>   static inline int ras_report_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };
>   static inline int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct ras_cxl_aer_ue_event *ev) { return 0; };
> +static inline int ras_report_cxl_aer_ce_event(struct ras_events *ras, struct ras_cxl_aer_ce_event *ev) { return 0; };
>   
>   #endif
>   

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH V2 2/4] rasdaemon: Add support for the CXL poison events
  2023-01-24 16:57 ` [PATCH V2 2/4] rasdaemon: Add support for the CXL poison events shiju.jose
@ 2023-01-25 22:34   ` Ira Weiny
  2023-01-26 10:04     ` Shiju Jose
  0 siblings, 1 reply; 11+ messages in thread
From: Ira Weiny @ 2023-01-25 22:34 UTC (permalink / raw)
  To: shiju.jose, linux-edac, linux-cxl, mchehab
  Cc: jonathan.cameron, linuxarm, shiju.jose

shiju.jose@ wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
> 
> Add support to log and record the CXL poison events.
> 
> The corresponding Kernel patches here:
> https://lore.kernel.org/linux-cxl/de11785ff05844299b40b100f8e0f56c7eef7f08.1674070170.git.alison.schofield@intel.com/
> 
> Presently RFC draft version for logging, could be extended for the policy
> based recovery action for the frequent poison events depending on the above
> kernel patches.
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

[snip]

> +
> +int ras_cxl_poison_event_handler(struct trace_seq *s,
> +				 struct tep_record *record,
> +				 struct tep_event *event, void *context)
> +{
> +	int len;
> +	unsigned long long val;
> +	struct ras_events *ras = context;
> +	time_t now;
> +	struct tm *tm;
> +	struct ras_cxl_poison_event ev;
> +
> +	now = record->ts/user_hz + ras->uptime_diff;
> +	tm = localtime(&now);
> +	if (tm)
> +		strftime(ev.timestamp, sizeof(ev.timestamp),
> +			 "%Y-%m-%d %H:%M:%S %z", tm);
> +	else
> +		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000", sizeof(ev.timestamp));
> +	trace_seq_printf(s, "%s ", ev.timestamp);
> +
> +	ev.memdev = tep_get_field_raw(s, event, "memdev",
> +				      record, &len, 1);
> +	if (!ev.memdev)
> +		return -1;
> +	trace_seq_printf(s, "memdev:%s ", ev.memdev);
> +
> +	ev.pcidev = tep_get_field_raw(s, event, "pcidev",
> +				      record, &len, 1);
> +	if (!ev.pcidev)
> +		return -1;
> +	trace_seq_printf(s, "pcidev:%s ", ev.pcidev);
> +
> +	ev.region = tep_get_field_raw(s, event, "region",
> +				      record, &len, 1);
> +	if (!ev.region)
> +		return -1;
> +	trace_seq_printf(s, "region:%s ", ev.region);
> +
> +	ev.uuid = tep_get_field_raw(s, event, "uuid",
> +				    record, &len, 1);
> +	if (!ev.uuid)
> +		return -1;
> +	trace_seq_printf(s, "region_uuid:%s ", ev.uuid);
> +
> +	if (tep_get_field_val(s, event, "hpa", record, &val, 1) < 0)
> +		return -1;
> +	ev.hpa = val;
> +	trace_seq_printf(s, "poison list: hpa:0x%llx ", (unsigned long long)ev.hpa);
> +
> +	if (tep_get_field_val(s, event, "dpa", record, &val, 1) < 0)
> +		return -1;
> +	ev.dpa = val;
> +	trace_seq_printf(s, "dpa:0x%llx ", (unsigned long long)ev.dpa);
> +
> +	if (tep_get_field_val(s, event, "length", record, &val, 1) < 0)
> +		return -1;
> +	ev.length = val;
> +	trace_seq_printf(s, "length:%d ", ev.length);
> +
> +	if (tep_get_field_val(s,  event, "source", record, &val, 1) < 0)
> +		return -1;
> +
> +	switch (val) {
> +	case CXL_POISON_SOURCE_UNKNOWN:
> +		ev.source = "Unknown";
> +		break;
> +	case CXL_POISON_SOURCE_EXTERNAL:
> +		ev.source = "External";
> +		break;
> +	case CXL_POISON_SOURCE_INTERNAL:
> +		ev.source = "Internal";
> +		break;
> +	case CXL_POISON_SOURCE_INJECTED:
> +		ev.source = "Injected";
> +		break;
> +	case CXL_POISON_SOURCE_VENDOR:
> +		ev.source = "Vendor";
> +		break;
> +	default:
> +		ev.source = "Invalid";
> +	}
> +	trace_seq_printf(s, "source:%s ", ev.source);
> +
> +	if (tep_get_field_val(s,  event, "flags", record, &val, 1) < 0)
> +		return -1;
> +	ev.flags = val;
> +	trace_seq_printf(s, "flags:%d ", ev.flags);
> +
> +	if (ev.flags & CXL_POISON_FLAG_OVERFLOW) {
> +		if (tep_get_field_val(s,  event, "overflow_t", record, &val, 1) < 0)
> +			return -1;
> +		if (val) {
> +			/* CXL Specification 3.0
> +			 * Overflow timestamp - The number of unsigned nanoseconds
> +			 * that have elapsed since midnight, 01-Jan-1970 UTC
> +			 */
> +			time_t ovf_ts_secs = val / 1000000000ULL;
> +
> +			tm = localtime(&ovf_ts_secs);
> +			if (tm) {
> +				strftime(ev.overflow_ts, sizeof(ev.overflow_ts),
> +					 "%Y-%m-%d %H:%M:%S %z", tm);
> +			}
> +		}
> +		if (!val || !tm)
> +			strncpy(ev.overflow_ts, "1970-01-01 00:00:00 +0000",
> +				sizeof(ev.overflow_ts));
> +	} else
> +		strncpy(ev.overflow_ts, "1970-01-01 00:00:00 +0000", sizeof(ev.overflow_ts));
> +	trace_seq_printf(s, "overflow timestamp:%s ", ev.overflow_ts);
> +	trace_seq_printf(s, "\n");
> +
> +	/* Insert data into the SGBD */
> +#ifdef HAVE_SQLITE3
> +	ras_store_cxl_poison_event(ras, &ev);
> +#endif

I know nothing about the rasdaemon build system but it seems like this
needs a ifdef HAVE_CXL around it?

[snip]

> --- a/ras-record.c
> +++ b/ras-record.c
> @@ -559,6 +559,67 @@ int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev)
>  }
>  #endif
>  
> +#ifdef HAVE_CXL
> +/*
> + * Table and functions to handle cxl:cxl_poison
> + */
> +static const struct db_fields cxl_poison_event_fields[] = {
> +	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
> +	{ .name = "timestamp",            .type = "TEXT" },
> +	{ .name = "memdev",               .type = "TEXT" },
> +	{ .name = "pcidev",               .type = "TEXT" },
> +	{ .name = "region",               .type = "TEXT" },
> +	{ .name = "region_uuid",          .type = "TEXT" },
> +	{ .name = "hpa",                  .type = "INTEGER" },
> +	{ .name = "dpa",                  .type = "INTEGER" },
> +	{ .name = "length",               .type = "INTEGER" },
> +	{ .name = "source",               .type = "TEXT" },
> +	{ .name = "flags",                .type = "INTEGER" },
> +	{ .name = "overflow_ts",          .type = "TEXT" },
> +};
> +
> +static const struct db_table_descriptor cxl_poison_event_tab = {
> +	.name = "cxl_poison_event",
> +	.fields = cxl_poison_event_fields,
> +	.num_fields = ARRAY_SIZE(cxl_poison_event_fields),
> +};
> +
> +int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev)

Because I believe this is not defined if (!HAVE_CXL and HAVE_SQLITE3)

[snip]

>  
>  #ifdef HAVE_SQLITE3
>  
> @@ -155,6 +170,9 @@ struct sqlite3_priv {
>  #ifdef HAVE_MEMORY_FAILURE
>  	sqlite3_stmt	*stmt_mf_event;
>  #endif
> +#ifdef HAVE_CXL
> +	sqlite3_stmt	*stmt_cxl_poison_event;
> +#endif
>  };
>  
>  struct db_fields {
> @@ -182,6 +200,7 @@ int ras_store_arm_record(struct ras_events *ras, struct ras_arm_event *ev);
>  int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev);
>  int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev);
>  int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
> +int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev);
>  
>  #else
>  static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; };
> @@ -195,6 +214,7 @@ static inline int ras_store_arm_record(struct ras_events *ras, struct ras_arm_ev
>  static inline int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev) { return 0; };
>  static inline int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; };
>  static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; };
> +static inline int ras_store_cxl_poison_event(struct ras_events *ras, struct ras_cxl_poison_event *ev) { return 0; };

But I could be missing something.

Ira

[snip]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH V2 3/4] rasdaemon: Add support for the CXL AER uncorrectable errors
  2023-01-25 16:54   ` Dave Jiang
@ 2023-01-26  9:18     ` Shiju Jose
  0 siblings, 0 replies; 11+ messages in thread
From: Shiju Jose @ 2023-01-26  9:18 UTC (permalink / raw)
  To: Dave Jiang, linux-edac, linux-cxl, mchehab; +Cc: Jonathan Cameron, Linuxarm


Hi Dave,

>-----Original Message-----
>From: Dave Jiang <dave.jiang@intel.com>
>Sent: 25 January 2023 16:54
>To: Shiju Jose <shiju.jose@huawei.com>; linux-edac@vger.kernel.org; linux-
>cxl@vger.kernel.org; mchehab@kernel.org
>Cc: Jonathan Cameron <jonathan.cameron@huawei.com>; Linuxarm
><linuxarm@huawei.com>
>Subject: Re: [PATCH V2 3/4] rasdaemon: Add support for the CXL AER
>uncorrectable errors
>
>
>
>On 1/24/23 9:57 AM, shiju.jose@huawei.com wrote:
>> From: Shiju Jose <shiju.jose@huawei.com>
>>
>> Add support to log and record the CXL AER uncorrectable errors.
>>
>> The corresponding Kernel patch here:
>> https://patchwork.kernel.org/project/cxl/patch/166974413388.1608150.58
>> 75712482260436188.stgit@djiang5-desk3.ch.intel.com/
>>
>> It was found that the header log data to be converted to the
>> big-endian format to correctly store in the SQLite database likely
>> because the SQLite database seems uses the big-endian storage.
>>
>> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> ---
>>   ras-cxl-handler.c | 125
>++++++++++++++++++++++++++++++++++++++++++++++
>>   ras-cxl-handler.h |   5 ++
>>   ras-events.c      |   9 ++++
>>   ras-events.h      |   1 +
>>   ras-record.c      |  65 ++++++++++++++++++++++++
>>   ras-record.h      |  16 ++++++
>>   ras-report.c      |  69 +++++++++++++++++++++++++
>>   ras-report.h      |   2 +
>>   8 files changed, 292 insertions(+)
>>
>> diff --git a/ras-cxl-handler.c b/ras-cxl-handler.c index
>> 0b7cdca..d042dc9 100644
>> --- a/ras-cxl-handler.c
>> +++ b/ras-cxl-handler.c
>> @@ -21,6 +21,7 @@
>>   #include "ras-record.h"
>>   #include "ras-logger.h"
>>   #include "ras-report.h"
>> +#include <endian.h>
>>
>>   /* Poison List: Payload out flags */
>>   #define CXL_POISON_FLAG_MORE            BIT(0)
>> @@ -160,3 +161,127 @@ int ras_cxl_poison_event_handler(struct
>> trace_seq *s,
>>
>>   	return 0;
>>   }
>> +
>> +/* CXL AER Errors */
>> +
>> +#define CXL_AER_UE_CACHE_DATA_PARITY	BIT(0)
>> +#define CXL_AER_UE_CACHE_ADDR_PARITY	BIT(1)
>> +#define CXL_AER_UE_CACHE_BE_PARITY	BIT(2)
>> +#define CXL_AER_UE_CACHE_DATA_ECC	BIT(3)
>> +#define CXL_AER_UE_MEM_DATA_PARITY	BIT(4)
>> +#define CXL_AER_UE_MEM_ADDR_PARITY	BIT(5)
>> +#define CXL_AER_UE_MEM_BE_PARITY	BIT(6)
>> +#define CXL_AER_UE_MEM_DATA_ECC		BIT(7)
>> +#define CXL_AER_UE_REINIT_THRESH	BIT(8)
>> +#define CXL_AER_UE_RSVD_ENCODE		BIT(9)
>> +#define CXL_AER_UE_POISON		BIT(10)
>> +#define CXL_AER_UE_RECV_OVERFLOW	BIT(11)
>> +#define CXL_AER_UE_INTERNAL_ERR		BIT(14)
>> +#define CXL_AER_UE_IDE_TX_ERR		BIT(15)
>> +#define CXL_AER_UE_IDE_RX_ERR		BIT(16)
>> +
>> +struct cxl_error_list {
>> +	uint32_t bit;
>> +	const char *error;
>> +};
>> +
>> +static const struct cxl_error_list cxl_aer_ue[] = {
>> +	{ .bit = CXL_AER_UE_CACHE_DATA_PARITY, .error = "Cache Data
>Parity Error" },
>> +	{ .bit = CXL_AER_UE_CACHE_ADDR_PARITY, .error = "Cache Address
>Parity Error" },
>> +	{ .bit = CXL_AER_UE_CACHE_BE_PARITY, .error = "Cache Byte Enable
>Parity Error" },
>> +	{ .bit = CXL_AER_UE_CACHE_DATA_ECC, .error = "Cache Data ECC
>Error" },
>> +	{ .bit = CXL_AER_UE_MEM_DATA_PARITY, .error = "Memory Data
>Parity Error" },
>> +	{ .bit = CXL_AER_UE_MEM_ADDR_PARITY, .error = "Memory Address
>Parity Error" },
>> +	{ .bit = CXL_AER_UE_MEM_BE_PARITY, .error = "Memory Byte Enable
>Parity Error" },
>> +	{ .bit = CXL_AER_UE_MEM_DATA_ECC, .error = "Memory Data ECC
>Error" },
>> +	{ .bit = CXL_AER_UE_REINIT_THRESH, .error = "REINIT Threshold Hit" },
>> +	{ .bit = CXL_AER_UE_RSVD_ENCODE, .error = "Received Unrecognized
>Encoding" },
>> +	{ .bit = CXL_AER_UE_POISON, .error = "Received Poison From Peer" },
>> +	{ .bit = CXL_AER_UE_RECV_OVERFLOW, .error = "Receiver Overflow"
>},
>> +	{ .bit = CXL_AER_UE_INTERNAL_ERR, .error = "Component Specific
>Error" },
>> +	{ .bit = CXL_AER_UE_IDE_TX_ERR, .error = "IDE Tx Error" },
>> +	{ .bit = CXL_AER_UE_IDE_RX_ERR, .error = "IDE Rx Error" }, };
>> +
>> +static void decode_cxl_error_status(struct trace_seq *s, uint32_t status,
>> +				   const struct cxl_error_list *cxl_error_list,
>> +				   uint8_t num_elems)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < num_elems; i++) {
>> +		if (status & cxl_error_list[i].bit)
>> +			trace_seq_printf(s, "\'%s\' ", cxl_error_list[i].error);
>
>A comment for all instances of trace_seq_printf() in this patch. I think it may
>return an error. Check return value <= 0?

I will add checking error for the trace_seq_printf().
>
>> +	}
>> +}
>> +
>> +int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
>> +				 struct tep_record *record,
>> +				 struct tep_event *event, void *context) {
>> +	int len, i;
>> +	unsigned long long val;
>> +	time_t now;
>> +	struct tm *tm;
>> +	struct ras_events *ras = context;
>> +	struct ras_cxl_aer_ue_event ev;
>> +
>> +	memset(&ev, 0, sizeof(ev));
>> +	now = record->ts/user_hz + ras->uptime_diff;
>
>Adding a space around '/' makes it easier to read:
>now = record->ts / user_hz + ras->uptime_diff;

I will change.
>
>DJ
>
>> +	tm = localtime(&now);
>> +	if (tm)
>> +		strftime(ev.timestamp, sizeof(ev.timestamp),
>> +			 "%Y-%m-%d %H:%M:%S %z", tm);
>> +	else
>> +		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000",
>sizeof(ev.timestamp));
>> +	trace_seq_printf(s, "%s ", ev.timestamp);
>> +
>> +	ev.dev_name = tep_get_field_raw(s, event, "dev_name",
>> +					record, &len, 1);
>> +	if (!ev.dev_name)
>> +		return -1;
>> +	trace_seq_printf(s, "dev_name:%s ", ev.dev_name);
>> +
>> +	if (tep_get_field_val(s, event, "status", record, &val, 1) < 0)
>> +		return -1;
>> +	ev.error_status = val;
>> +
>> +	trace_seq_printf(s, "error status:");
>> +	decode_cxl_error_status(s, ev.error_status,
>> +				cxl_aer_ue, ARRAY_SIZE(cxl_aer_ue));
>> +
>> +	if (tep_get_field_val(s,  event, "first_error", record, &val, 1) < 0)
>> +		return -1;
>> +	ev.first_error = val;
>> +
>> +	trace_seq_printf(s, "first error:");
>> +	decode_cxl_error_status(s, ev.first_error,
>> +				cxl_aer_ue, ARRAY_SIZE(cxl_aer_ue));
>> +
>> +	ev.header_log = tep_get_field_raw(s, event, "header_log",
>> +					  record, &len, 1);
>> +	if (!ev.header_log)
>> +		return -1;
>> +	trace_seq_printf(s, "header log:\n");
>> +	for (i = 0; i < CXL_HEADERLOG_SIZE_U32; i++) {
>> +		trace_seq_printf(s, "%08x ", ev.header_log[i]);
>> +		if ((i > 0) && ((i % 20) == 0))
>> +			trace_seq_printf(s, "\n");
>> +		/* Convert header log data to the big-endian format because
>> +		 * the SQLite database seems uses the big-endian storage.
>> +		 */
>> +		ev.header_log[i] = htobe32(ev.header_log[i]);
>> +	}
>> +
>> +	/* Insert data into the SGBD */
>> +#ifdef HAVE_SQLITE3
>> +	ras_store_cxl_aer_ue_event(ras, &ev); #endif
>> +
>> +#ifdef HAVE_ABRT_REPORT
>> +	/* Report event to ABRT */
>> +	ras_report_cxl_aer_ue_event(ras, &ev); #endif
>> +
>> +	return 0;
>> +}
>> diff --git a/ras-cxl-handler.h b/ras-cxl-handler.h index
>> 84d5cc6..18b3120 100644
>> --- a/ras-cxl-handler.h
>> +++ b/ras-cxl-handler.h
>> @@ -21,4 +21,9 @@
>>   int ras_cxl_poison_event_handler(struct trace_seq *s,
>>   				 struct tep_record *record,
>>   				 struct tep_event *event, void *context);
>> +
>> +int ras_cxl_aer_ue_event_handler(struct trace_seq *s,
>> +				 struct tep_record *record,
>> +				 struct tep_event *event, void *context);
>> +
>>   #endif
>> diff --git a/ras-events.c b/ras-events.c index 6555125..ead792b 100644
>> --- a/ras-events.c
>> +++ b/ras-events.c
>> @@ -246,6 +246,7 @@ int toggle_ras_mc_event(int enable)
>>
>>   #ifdef HAVE_CXL
>>   	rc |= __toggle_ras_mc_event(ras, "cxl", "cxl_poison", enable);
>> +	rc |= __toggle_ras_mc_event(ras, "cxl",
>> +"cxl_aer_uncorrectable_error", enable);
>>   #endif
>>
>>   free_ras:
>> @@ -964,6 +965,14 @@ int handle_ras_events(int record_events)
>>   	else
>>   		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
>>   		    "cxl", "cxl_poison");
>> +
>> +	rc = add_event_handler(ras, pevent, page_size, "cxl",
>"cxl_aer_uncorrectable_error",
>> +			       ras_cxl_aer_ue_event_handler, NULL,
>CXL_AER_UE_EVENT);
>> +	if (!rc)
>> +		num_events++;
>> +	else
>> +		log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
>> +		    "cxl", "cxl_aer_uncorrectable_error");
>>   #endif
>>
>>   	if (!num_events) {
>> diff --git a/ras-events.h b/ras-events.h index fc51070..65f9d9a 100644
>> --- a/ras-events.h
>> +++ b/ras-events.h
>> @@ -40,6 +40,7 @@ enum {
>>   	DISKERROR_EVENT,
>>   	MF_EVENT,
>>   	CXL_POISON_EVENT,
>> +	CXL_AER_UE_EVENT,
>>   	NR_EVENTS
>>   };
>>
>> diff --git a/ras-record.c b/ras-record.c index f54fb41..4703790 100644
>> --- a/ras-record.c
>> +++ b/ras-record.c
>> @@ -618,6 +618,54 @@ int ras_store_cxl_poison_event(struct ras_events
>> *ras, struct ras_cxl_poison_eve
>>
>>   	return rc;
>>   }
>> +
>> +/*
>> + * Table and functions to handle cxl:cxl_aer_uncorrectable_error  */
>> +static const struct db_fields cxl_aer_ue_event_fields[] = {
>> +	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
>> +	{ .name = "timestamp",            .type = "TEXT" },
>> +	{ .name = "dev_name",             .type = "TEXT" },
>> +	{ .name = "error_status",         .type = "INTEGER" },
>> +	{ .name = "first_error",          .type = "INTEGER" },
>> +	{ .name = "header_log",           .type = "BLOB" },
>> +};
>> +
>> +static const struct db_table_descriptor cxl_aer_ue_event_tab = {
>> +	.name = "cxl_aer_ue_event",
>> +	.fields = cxl_aer_ue_event_fields,
>> +	.num_fields = ARRAY_SIZE(cxl_aer_ue_event_fields),
>> +};
>> +
>> +int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct
>> +ras_cxl_aer_ue_event *ev) {
>> +	int rc;
>> +	struct sqlite3_priv *priv = ras->db_priv;
>> +
>> +	if (!priv || !priv->stmt_cxl_aer_ue_event)
>> +		return 0;
>> +	log(TERM, LOG_INFO, "cxl_aer_ue_event store: %p\n",
>> +priv->stmt_cxl_aer_ue_event);
>> +
>> +	sqlite3_bind_text(priv->stmt_cxl_aer_ue_event, 1, ev->timestamp, -
>1, NULL);
>> +	sqlite3_bind_text(priv->stmt_cxl_aer_ue_event, 2, ev->dev_name, -
>1, NULL);
>> +	sqlite3_bind_int(priv->stmt_cxl_aer_ue_event, 3, ev->error_status);
>> +	sqlite3_bind_int(priv->stmt_cxl_aer_ue_event, 4, ev->first_error);
>> +	sqlite3_bind_blob(priv->stmt_cxl_aer_ue_event, 5, ev->header_log,
>> +CXL_HEADERLOG_SIZE, NULL);
>> +
>> +	rc = sqlite3_step(priv->stmt_cxl_aer_ue_event);
>> +	if (rc != SQLITE_OK && rc != SQLITE_DONE)
>> +		log(TERM, LOG_ERR,
>> +		    "Failed to do cxl_aer_ue_event step on sqlite: error =
>%d\n", rc);
>> +	rc = sqlite3_reset(priv->stmt_cxl_aer_ue_event);
>> +	if (rc != SQLITE_OK && rc != SQLITE_DONE)
>> +		log(TERM, LOG_ERR,
>> +		    "Failed reset cxl_aer_ue_event on sqlite: error = %d\n",
>> +		    rc);
>> +	log(TERM, LOG_INFO, "register inserted at db\n");
>> +
>> +	return rc;
>> +}
>> +
>>   #endif
>>
>>   /*
>> @@ -965,6 +1013,15 @@ int ras_mc_event_opendb(unsigned cpu, struct
>ras_events *ras)
>>   		if (rc != SQLITE_OK)
>>   			goto error;
>>   	}
>> +
>> +	rc = ras_mc_create_table(priv, &cxl_aer_ue_event_tab);
>> +	if (rc == SQLITE_OK) {
>> +		rc = ras_mc_prepare_stmt(priv, &priv-
>>stmt_cxl_aer_ue_event,
>> +					 &cxl_aer_ue_event_tab);
>> +		if (rc != SQLITE_OK)
>> +			goto error;
>> +	}
>> +
>>   #endif
>>
>>   	ras->db_priv = priv;
>> @@ -1087,6 +1144,14 @@ int ras_mc_event_closedb(unsigned int cpu,
>struct ras_events *ras)
>>   			    "cpu %u: Failed to finalize cxl_poison_event sqlite:
>error = %d\n",
>>   			    cpu, rc);
>>   	}
>> +
>> +	if (priv->stmt_cxl_aer_ue_event) {
>> +		rc = sqlite3_finalize(priv->stmt_cxl_aer_ue_event);
>> +		if (rc != SQLITE_OK)
>> +			log(TERM, LOG_ERR,
>> +			    "cpu %u: Failed to finalize cxl_aer_ue_event sqlite:
>error = %d\n",
>> +			    cpu, rc);
>> +	}
>>   #endif
>>
>>   	rc = sqlite3_close_v2(db);
>> diff --git a/ras-record.h b/ras-record.h index e5bf483..0e2c178 100644
>> --- a/ras-record.h
>> +++ b/ras-record.h
>> @@ -128,6 +128,18 @@ struct ras_cxl_poison_event {
>>   	char overflow_ts[64];
>>   };
>>
>> +#define SZ_512                          0x200
>> +#define CXL_HEADERLOG_SIZE              SZ_512
>> +#define CXL_HEADERLOG_SIZE_U32          (SZ_512 / sizeof(uint32_t))
>> +
>> +struct ras_cxl_aer_ue_event {
>> +	char timestamp[64];
>> +	const char *dev_name;
>> +	uint32_t error_status;
>> +	uint32_t first_error;
>> +	uint32_t *header_log;
>> +};
>> +
>>   struct ras_mc_event;
>>   struct ras_aer_event;
>>   struct ras_extlog_event;
>> @@ -138,6 +150,7 @@ struct devlink_event;
>>   struct diskerror_event;
>>   struct ras_mf_event;
>>   struct ras_cxl_poison_event;
>> +struct ras_cxl_aer_ue_event;
>>
>>   #ifdef HAVE_SQLITE3
>>
>> @@ -172,6 +185,7 @@ struct sqlite3_priv {
>>   #endif
>>   #ifdef HAVE_CXL
>>   	sqlite3_stmt	*stmt_cxl_poison_event;
>> +	sqlite3_stmt	*stmt_cxl_aer_ue_event;
>>   #endif
>>   };
>>
>> @@ -201,6 +215,7 @@ int ras_store_devlink_event(struct ras_events *ras,
>struct devlink_event *ev);
>>   int ras_store_diskerror_event(struct ras_events *ras, struct
>diskerror_event *ev);
>>   int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev);
>>   int ras_store_cxl_poison_event(struct ras_events *ras, struct
>> ras_cxl_poison_event *ev);
>> +int ras_store_cxl_aer_ue_event(struct ras_events *ras, struct
>> +ras_cxl_aer_ue_event *ev);
>>
>>   #else
>>   static inline int ras_mc_event_opendb(unsigned cpu, struct
>> ras_events *ras) { return 0; }; @@ -215,6 +230,7 @@ static inline int
>ras_store_devlink_event(struct ras_events *ras, struct devlink
>>   static inline int ras_store_diskerror_event(struct ras_events *ras, struct
>diskerror_event *ev) { return 0; };
>>   static inline int ras_store_mf_event(struct ras_events *ras, struct
>ras_mf_event *ev) { return 0; };
>>   static inline int ras_store_cxl_poison_event(struct ras_events *ras,
>> struct ras_cxl_poison_event *ev) { return 0; };
>> +static inline int ras_store_cxl_aer_ue_event(struct ras_events *ras,
>> +struct ras_cxl_aer_ue_event *ev) { return 0; };
>>
>>   #endif
>>
>> diff --git a/ras-report.c b/ras-report.c index 589e640..4c09061 100644
>> --- a/ras-report.c
>> +++ b/ras-report.c
>> @@ -367,6 +367,28 @@ static int set_cxl_poison_event_backtrace(char
>*buf, struct ras_cxl_poison_event
>>   	return 0;
>>   }
>>
>> +static int set_cxl_aer_ue_event_backtrace(char *buf, struct
>> +ras_cxl_aer_ue_event *ev) {
>> +	char bt_buf[MAX_BACKTRACE_SIZE];
>> +
>> +	if (!buf || !ev)
>> +		return -1;
>> +
>> +	sprintf(bt_buf, "BACKTRACE="	\
>> +						"timestamp=%s\n"	\
>> +						"dev_name=%s\n"
>	\
>> +						"error_status=%u\n"	\
>> +						"first_error=%u\n"	\
>> +						ev->timestamp,
>	\
>> +						ev->dev_name,
>	\
>> +						ev->error_status,	\
>> +						ev->first_error);
>> +
>> +	strcat(buf, bt_buf);
>> +
>> +	return 0;
>> +}
>> +
>>   static int commit_report_backtrace(int sockfd, int type, void *ev){
>>   	char buf[MAX_BACKTRACE_SIZE];
>>   	char *pbuf = buf;
>> @@ -407,6 +429,9 @@ static int commit_report_backtrace(int sockfd, int
>type, void *ev){
>>   	case CXL_POISON_EVENT:
>>   		rc = set_cxl_poison_event_backtrace(buf, (struct
>ras_cxl_poison_event *)ev);
>>   		break;
>> +	case CXL_AER_UE_EVENT:
>> +		rc = set_cxl_aer_ue_event_backtrace(buf, (struct
>ras_cxl_aer_ue_event *)ev);
>> +		break;
>>   	default:
>>   		return -1;
>>   	}
>> @@ -859,3 +884,47 @@ cxl_poison_fail:
>>   	else
>>   		return -1;
>>   }
>> +
>> +int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct
>> +ras_cxl_aer_ue_event *ev) {
>> +	char buf[MAX_MESSAGE_SIZE];
>> +	int sockfd = 0;
>> +	int done = 0;
>> +	int rc = -1;
>> +
>> +	memset(buf, 0, sizeof(buf));
>> +
>> +	sockfd = setup_report_socket();
>> +	if (sockfd < 0)
>> +		return -1;
>> +
>> +	rc = commit_report_basic(sockfd);
>> +	if (rc < 0)
>> +		goto cxl_aer_ue_fail;
>> +
>> +	rc = commit_report_backtrace(sockfd, CXL_AER_UE_EVENT, ev);
>> +	if (rc < 0)
>> +		goto cxl_aer_ue_fail;
>> +
>> +	sprintf(buf, "ANALYZER=%s", "rasdaemon-cxl-aer-uncorrectable-
>error");
>> +	rc = write(sockfd, buf, strlen(buf) + 1);
>> +	if (rc < strlen(buf) + 1)
>> +		goto cxl_aer_ue_fail;
>> +
>> +	sprintf(buf, "REASON=%s", "CXL AER uncorrectable error");
>> +	rc = write(sockfd, buf, strlen(buf) + 1);
>> +	if (rc < strlen(buf) + 1)
>> +		goto cxl_aer_ue_fail;
>> +
>> +	done = 1;
>> +
>> +cxl_aer_ue_fail:
>> +
>> +	if (sockfd >= 0)
>> +		close(sockfd);
>> +
>> +	if (done)
>> +		return 0;
>> +	else
>> +		return -1;
>> +}
>> diff --git a/ras-report.h b/ras-report.h index d1591ce..dfe89d1 100644
>> --- a/ras-report.h
>> +++ b/ras-report.h
>> @@ -40,6 +40,7 @@ int ras_report_devlink_event(struct ras_events *ras,
>struct devlink_event *ev);
>>   int ras_report_diskerror_event(struct ras_events *ras, struct
>diskerror_event *ev);
>>   int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event
>*ev);
>>   int ras_report_cxl_poison_event(struct ras_events *ras, struct
>> ras_cxl_poison_event *ev);
>> +int ras_report_cxl_aer_ue_event(struct ras_events *ras, struct
>> +ras_cxl_aer_ue_event *ev);
>>
>>   #else
>>
>> @@ -52,6 +53,7 @@ static inline int ras_report_devlink_event(struct
>ras_events *ras, struct devlin
>>   static inline int ras_report_diskerror_event(struct ras_events *ras, struct
>diskerror_event *ev) { return 0; };
>>   static inline int ras_report_mf_event(struct ras_events *ras, struct
>ras_mf_event *ev) { return 0; };
>>   static inline int ras_report_cxl_poison_event(struct ras_events
>> *ras, struct ras_cxl_poison_event *ev) { return 0; };
>> +static inline int ras_report_cxl_aer_ue_event(struct ras_events *ras,
>> +struct ras_cxl_aer_ue_event *ev) { return 0; };
>>
>>   #endif
>>
Thanks,
Shiju

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH V2 2/4] rasdaemon: Add support for the CXL poison events
  2023-01-25 22:34   ` Ira Weiny
@ 2023-01-26 10:04     ` Shiju Jose
  0 siblings, 0 replies; 11+ messages in thread
From: Shiju Jose @ 2023-01-26 10:04 UTC (permalink / raw)
  To: Ira Weiny, linux-edac, linux-cxl, mchehab; +Cc: Jonathan Cameron, Linuxarm

Hi Ira,

Thank you for the feedback.

>-----Original Message-----
>From: Ira Weiny <ira.weiny@intel.com>
>Sent: 25 January 2023 22:35
>To: Shiju Jose <shiju.jose@huawei.com>; linux-edac@vger.kernel.org; linux-
>cxl@vger.kernel.org; mchehab@kernel.org
>Cc: Jonathan Cameron <jonathan.cameron@huawei.com>; Linuxarm
><linuxarm@huawei.com>; Shiju Jose <shiju.jose@huawei.com>
>Subject: Re: [PATCH V2 2/4] rasdaemon: Add support for the CXL poison
>events
>
>shiju.jose@ wrote:
>> From: Shiju Jose <shiju.jose@huawei.com>
>>
>> Add support to log and record the CXL poison events.
>>
>> The corresponding Kernel patches here:
>> https://lore.kernel.org/linux-cxl/de11785ff05844299b40b100f8e0f56c7eef
>> 7f08.1674070170.git.alison.schofield@intel.com/
>>
>> Presently RFC draft version for logging, could be extended for the
>> policy based recovery action for the frequent poison events depending
>> on the above kernel patches.
>>
>> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>
>[snip]
>
>> +
>> +int ras_cxl_poison_event_handler(struct trace_seq *s,
>> +				 struct tep_record *record,
>> +				 struct tep_event *event, void *context) {
>> +	int len;
>> +	unsigned long long val;
>> +	struct ras_events *ras = context;
>> +	time_t now;
>> +	struct tm *tm;
>> +	struct ras_cxl_poison_event ev;
>> +
>> +	now = record->ts/user_hz + ras->uptime_diff;
>> +	tm = localtime(&now);
>> +	if (tm)
>> +		strftime(ev.timestamp, sizeof(ev.timestamp),
>> +			 "%Y-%m-%d %H:%M:%S %z", tm);
>> +	else
>> +		strncpy(ev.timestamp, "1970-01-01 00:00:00 +0000",
>sizeof(ev.timestamp));
>> +	trace_seq_printf(s, "%s ", ev.timestamp);
>> +
>> +	ev.memdev = tep_get_field_raw(s, event, "memdev",
>> +				      record, &len, 1);
>> +	if (!ev.memdev)
>> +		return -1;
>> +	trace_seq_printf(s, "memdev:%s ", ev.memdev);
>> +
>> +	ev.pcidev = tep_get_field_raw(s, event, "pcidev",
>> +				      record, &len, 1);
>> +	if (!ev.pcidev)
>> +		return -1;
>> +	trace_seq_printf(s, "pcidev:%s ", ev.pcidev);
>> +
>> +	ev.region = tep_get_field_raw(s, event, "region",
>> +				      record, &len, 1);
>> +	if (!ev.region)
>> +		return -1;
>> +	trace_seq_printf(s, "region:%s ", ev.region);
>> +
>> +	ev.uuid = tep_get_field_raw(s, event, "uuid",
>> +				    record, &len, 1);
>> +	if (!ev.uuid)
>> +		return -1;
>> +	trace_seq_printf(s, "region_uuid:%s ", ev.uuid);
>> +
>> +	if (tep_get_field_val(s, event, "hpa", record, &val, 1) < 0)
>> +		return -1;
>> +	ev.hpa = val;
>> +	trace_seq_printf(s, "poison list: hpa:0x%llx ", (unsigned long
>> +long)ev.hpa);
>> +
>> +	if (tep_get_field_val(s, event, "dpa", record, &val, 1) < 0)
>> +		return -1;
>> +	ev.dpa = val;
>> +	trace_seq_printf(s, "dpa:0x%llx ", (unsigned long long)ev.dpa);
>> +
>> +	if (tep_get_field_val(s, event, "length", record, &val, 1) < 0)
>> +		return -1;
>> +	ev.length = val;
>> +	trace_seq_printf(s, "length:%d ", ev.length);
>> +
>> +	if (tep_get_field_val(s,  event, "source", record, &val, 1) < 0)
>> +		return -1;
>> +
>> +	switch (val) {
>> +	case CXL_POISON_SOURCE_UNKNOWN:
>> +		ev.source = "Unknown";
>> +		break;
>> +	case CXL_POISON_SOURCE_EXTERNAL:
>> +		ev.source = "External";
>> +		break;
>> +	case CXL_POISON_SOURCE_INTERNAL:
>> +		ev.source = "Internal";
>> +		break;
>> +	case CXL_POISON_SOURCE_INJECTED:
>> +		ev.source = "Injected";
>> +		break;
>> +	case CXL_POISON_SOURCE_VENDOR:
>> +		ev.source = "Vendor";
>> +		break;
>> +	default:
>> +		ev.source = "Invalid";
>> +	}
>> +	trace_seq_printf(s, "source:%s ", ev.source);
>> +
>> +	if (tep_get_field_val(s,  event, "flags", record, &val, 1) < 0)
>> +		return -1;
>> +	ev.flags = val;
>> +	trace_seq_printf(s, "flags:%d ", ev.flags);
>> +
>> +	if (ev.flags & CXL_POISON_FLAG_OVERFLOW) {
>> +		if (tep_get_field_val(s,  event, "overflow_t", record, &val, 1) <
>0)
>> +			return -1;
>> +		if (val) {
>> +			/* CXL Specification 3.0
>> +			 * Overflow timestamp - The number of unsigned
>nanoseconds
>> +			 * that have elapsed since midnight, 01-Jan-1970 UTC
>> +			 */
>> +			time_t ovf_ts_secs = val / 1000000000ULL;
>> +
>> +			tm = localtime(&ovf_ts_secs);
>> +			if (tm) {
>> +				strftime(ev.overflow_ts,
>sizeof(ev.overflow_ts),
>> +					 "%Y-%m-%d %H:%M:%S %z", tm);
>> +			}
>> +		}
>> +		if (!val || !tm)
>> +			strncpy(ev.overflow_ts, "1970-01-01 00:00:00 +0000",
>> +				sizeof(ev.overflow_ts));
>> +	} else
>> +		strncpy(ev.overflow_ts, "1970-01-01 00:00:00 +0000",
>sizeof(ev.overflow_ts));
>> +	trace_seq_printf(s, "overflow timestamp:%s ", ev.overflow_ts);
>> +	trace_seq_printf(s, "\n");
>> +
>> +	/* Insert data into the SGBD */
>> +#ifdef HAVE_SQLITE3
>> +	ras_store_cxl_poison_event(ras, &ev); #endif
>
>I know nothing about the rasdaemon build system but it seems like this needs
>a ifdef HAVE_CXL around it?
>
>[snip]

This file ras-cxl-handler.c included in the build only if configure with --enable-cxl. 
With --enable-cxl  build scripts define HAVE_CXL.

>
>> --- a/ras-record.c
>> +++ b/ras-record.c
>> @@ -559,6 +559,67 @@ int ras_store_mf_event(struct ras_events *ras,
>> struct ras_mf_event *ev)  }  #endif
>>
>> +#ifdef HAVE_CXL
>> +/*
>> + * Table and functions to handle cxl:cxl_poison  */ static const
>> +struct db_fields cxl_poison_event_fields[] = {
>> +	{ .name = "id",                   .type = "INTEGER PRIMARY KEY" },
>> +	{ .name = "timestamp",            .type = "TEXT" },
>> +	{ .name = "memdev",               .type = "TEXT" },
>> +	{ .name = "pcidev",               .type = "TEXT" },
>> +	{ .name = "region",               .type = "TEXT" },
>> +	{ .name = "region_uuid",          .type = "TEXT" },
>> +	{ .name = "hpa",                  .type = "INTEGER" },
>> +	{ .name = "dpa",                  .type = "INTEGER" },
>> +	{ .name = "length",               .type = "INTEGER" },
>> +	{ .name = "source",               .type = "TEXT" },
>> +	{ .name = "flags",                .type = "INTEGER" },
>> +	{ .name = "overflow_ts",          .type = "TEXT" },
>> +};
>> +
>> +static const struct db_table_descriptor cxl_poison_event_tab = {
>> +	.name = "cxl_poison_event",
>> +	.fields = cxl_poison_event_fields,
>> +	.num_fields = ARRAY_SIZE(cxl_poison_event_fields),
>> +};
>> +
>> +int ras_store_cxl_poison_event(struct ras_events *ras, struct
>> +ras_cxl_poison_event *ev)
>
>Because I believe this is not defined if (!HAVE_CXL and HAVE_SQLITE3)
>
>[snip]
>
>>
>>  #ifdef HAVE_SQLITE3
>>
>> @@ -155,6 +170,9 @@ struct sqlite3_priv {  #ifdef HAVE_MEMORY_FAILURE
>>  	sqlite3_stmt	*stmt_mf_event;
>>  #endif
>> +#ifdef HAVE_CXL
>> +	sqlite3_stmt	*stmt_cxl_poison_event;
>> +#endif
>>  };
>>
>>  struct db_fields {
>> @@ -182,6 +200,7 @@ int ras_store_arm_record(struct ras_events *ras,
>> struct ras_arm_event *ev);  int ras_store_devlink_event(struct
>> ras_events *ras, struct devlink_event *ev);  int
>> ras_store_diskerror_event(struct ras_events *ras, struct
>> diskerror_event *ev);  int ras_store_mf_event(struct ras_events *ras,
>> struct ras_mf_event *ev);
>> +int ras_store_cxl_poison_event(struct ras_events *ras, struct
>> +ras_cxl_poison_event *ev);
>>
>>  #else
>>  static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events
>> *ras) { return 0; }; @@ -195,6 +214,7 @@ static inline int
>> ras_store_arm_record(struct ras_events *ras, struct ras_arm_ev  static
>> inline int ras_store_devlink_event(struct ras_events *ras, struct
>> devlink_event *ev) { return 0; };  static inline int
>> ras_store_diskerror_event(struct ras_events *ras, struct
>> diskerror_event *ev) { return 0; };  static inline int
>> ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) {
>> return 0; };
>> +static inline int ras_store_cxl_poison_event(struct ras_events *ras,
>> +struct ras_cxl_poison_event *ev) { return 0; };
>
>But I could be missing something.
>
>Ira
>
>[snip]

Thanks,
Shiju

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-01-26 10:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-24 16:57 [PATCH V2 0/4] rasdaemon: Add support for the CXL error events shiju.jose
2023-01-24 16:57 ` [PATCH V2 1/4] rasdaemon: Move definition for BIT and BIT_ULL to a common file shiju.jose
2023-01-25 16:34   ` Dave Jiang
2023-01-24 16:57 ` [PATCH V2 2/4] rasdaemon: Add support for the CXL poison events shiju.jose
2023-01-25 22:34   ` Ira Weiny
2023-01-26 10:04     ` Shiju Jose
2023-01-24 16:57 ` [PATCH V2 3/4] rasdaemon: Add support for the CXL AER uncorrectable errors shiju.jose
2023-01-25 16:54   ` Dave Jiang
2023-01-26  9:18     ` Shiju Jose
2023-01-24 16:57 ` [PATCH V2 4/4] rasdaemon: Add support for the CXL AER correctable errors shiju.jose
2023-01-25 16:56   ` Dave Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).