All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 1/3] aerdrv: Trace Event for AER
@ 2013-01-16 23:51 Lance Ortiz
  2013-01-16 23:51 ` [PATCH v10 2/3] aerdrv: Enhanced AER logging Lance Ortiz
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Lance Ortiz @ 2013-01-16 23:51 UTC (permalink / raw)
  To: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt,
	mchehab, linux-acpi, linux-pci, linux-kernel

This header file will define a new trace event that will be triggered when
a AER event occurs.  The following data will be provided to the trace
event.

char * dev_name - The name of the slot where the device resides
                  ([domain:]bus:device.function).

u32 status - Either the correctable or uncorrectable register
             indicating what error or errors have been see.

u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED

The trace event will also provide a trace string that may look like:

"0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned
TLP"

v1-v2 Move header from include/ras/aer_event.h to
include/trace/events/ras.h
v3-v4 Cleaned up comments and commit header
v4-v5 More cleanup remove () from if statement in print.
      Renamed string define to be more specific.
v5-v6 change TRACE_SYSTEM define to be ras and not aer.

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Tony Luck <tony.luck@intel.com>
---

 include/trace/events/ras.h |   77 ++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 77 insertions(+), 0 deletions(-)
 create mode 100644 include/trace/events/ras.h

diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h
new file mode 100644
index 0000000..88b8783
--- /dev/null
+++ b/include/trace/events/ras.h
@@ -0,0 +1,77 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM ras
+
+#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_AER_H
+
+#include <linux/tracepoint.h>
+#include <linux/edac.h>
+
+
+/*
+ * PCIe AER Trace event
+ *
+ * These events are generated when hardware detects a corrected or
+ * uncorrected event on a PCIe device. The event report has
+ * the following structure:
+ *
+ * char * dev_name -	The name of the slot where the device resides
+ *			([domain:]bus:device.function).
+ * u32 status -		Either the correctable or uncorrectable register
+ *			indicating what error or errors have been seen
+ * u8 severity -	error severity 0:NONFATAL 1:FATAL 2:CORRECTED
+ */
+
+#define aer_correctable_errors		\
+	{BIT(0),	"Receiver Error"},		\
+	{BIT(6),	"Bad TLP"},			\
+	{BIT(7),	"Bad DLLP"},			\
+	{BIT(8),	"RELAY_NUM Rollover"},		\
+	{BIT(12),	"Replay Timer Timeout"},	\
+	{BIT(13),	"Advisory Non-Fatal"}
+
+#define aer_uncorrectable_errors		\
+	{BIT(4),	"Data Link Protocol"},		\
+	{BIT(12),	"Poisoned TLP"},		\
+	{BIT(13),	"Flow Control Protocol"},	\
+	{BIT(14),	"Completion Timeout"},		\
+	{BIT(15),	"Completer Abort"},		\
+	{BIT(16),	"Unexpected Completion"},	\
+	{BIT(17),	"Receiver Overflow"},		\
+	{BIT(18),	"Malformed TLP"},		\
+	{BIT(19),	"ECRC"},			\
+	{BIT(20),	"Unsupported Request"}
+
+TRACE_EVENT(aer_event,
+	TP_PROTO(const char *dev_name,
+		 const u32 status,
+		 const u8 severity),
+
+	TP_ARGS(dev_name, status, severity),
+
+	TP_STRUCT__entry(
+		__string(	dev_name,	dev_name	)
+		__field(	u32,		status		)
+		__field(	u8,		severity	)
+	),
+
+	TP_fast_assign(
+		__assign_str(dev_name, dev_name);
+		__entry->status		= status;
+		__entry->severity	= severity;
+	),
+
+	TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
+		__get_str(dev_name),
+		__entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" :
+			__entry->severity == HW_EVENT_ERR_FATAL ?
+			"Fatal" : "Uncorrected",
+		__entry->severity == HW_EVENT_ERR_CORRECTED ?
+		__print_flags(__entry->status, "|", aer_correctable_errors) :
+		__print_flags(__entry->status, "|", aer_uncorrectable_errors))
+);
+
+#endif /* _TRACE_AER_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v10 2/3] aerdrv: Enhanced AER logging
  2013-01-16 23:51 [PATCH v10 1/3] aerdrv: Trace Event for AER Lance Ortiz
@ 2013-01-16 23:51 ` Lance Ortiz
  2013-01-16 23:51 ` [PATCH v10 3/3] aerdrv: Cleanup log output for AER Lance Ortiz
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Lance Ortiz @ 2013-01-16 23:51 UTC (permalink / raw)
  To: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt,
	mchehab, linux-acpi, linux-pci, linux-kernel

This patch will provide a more reliable and easy way for user-space
applications to have access to AER logs rather than reading them from the
message buffer. It also provides a way to notify user-space when an AER
event occurs.

The aer driver is updated to generate a trace event of function 'aer_event'
when a PCIe error is reported over the AER interface.  The trace event was
added to both the interrupt based aer path and the firmware first path.

v2-v3 Update to new location of trace header. Update print to remove
warning.
v3-v4 Reworked logic when getting ready to call cper_print_aer
v6-v7 Change print from pr_info to pr_err if !dev
v7-v8 Add pfx argument back into call to cper_print_aer()

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
---

 drivers/acpi/apei/cper.c               |   19 ++++++++++++++++---
 drivers/pci/pcie/aer/aerdrv_errprint.c |    9 ++++++++-
 include/linux/aer.h                    |    4 ++--
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/acpi/apei/cper.c b/drivers/acpi/apei/cper.c
index e6defd8..1e5d8a4 100644
--- a/drivers/acpi/apei/cper.c
+++ b/drivers/acpi/apei/cper.c
@@ -29,6 +29,7 @@
 #include <linux/time.h>
 #include <linux/cper.h>
 #include <linux/acpi.h>
+#include <linux/pci.h>
 #include <linux/aer.h>
 
 /*
@@ -249,6 +250,10 @@ static const char *cper_pcie_port_type_strs[] = {
 static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 			    const struct acpi_hest_generic_data *gdata)
 {
+#ifdef CONFIG_ACPI_APEI_PCIEAER
+	struct pci_dev *dev;
+#endif
+
 	if (pcie->validation_bits & CPER_PCIE_VALID_PORT_TYPE)
 		printk("%s""port_type: %d, %s\n", pfx, pcie->port_type,
 		       pcie->port_type < ARRAY_SIZE(cper_pcie_port_type_strs) ?
@@ -281,10 +286,18 @@ static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie,
 	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
 	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
 #ifdef CONFIG_ACPI_APEI_PCIEAER
-	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO) {
-		struct aer_capability_regs *aer_regs = (void *)pcie->aer_info;
-		cper_print_aer(pfx, gdata->error_severity, aer_regs);
+	dev = pci_get_domain_bus_and_slot(pcie->device_id.segment,
+			pcie->device_id.bus, pcie->device_id.function);
+	if (!dev) {
+		pr_err("PCI AER Cannot get PCI device %04x:%02x:%02x.%d\n",
+			pcie->device_id.segment, pcie->device_id.bus,
+			pcie->device_id.slot, pcie->device_id.function);
+		return;
 	}
+	if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO)
+		cper_print_aer(pfx, dev, gdata->error_severity,
+				(struct aer_capability_regs *) pcie->aer_info);
+	pci_dev_put(dev);
 #endif
 }
 
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 3ea5173..d3e5fc5 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -23,6 +23,9 @@
 
 #include "aerdrv.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/ras.h>
+
 #define AER_AGENT_RECEIVER		0
 #define AER_AGENT_REQUESTER		1
 #define AER_AGENT_COMPLETER		2
@@ -194,6 +197,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	if (info->id && info->error_dev_num > 1 && info->id == id)
 		printk("%s""  Error of this Agent(%04x) is reported first\n",
 			prefix, id);
+	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
+			info->severity);
 }
 
 void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
@@ -217,7 +222,7 @@ int cper_severity_to_aer(int cper_severity)
 }
 EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 
-void cper_print_aer(const char *prefix, int cper_severity,
+void cper_print_aer(const char *prefix, struct pci_dev *dev, int cper_severity,
 		    struct aer_capability_regs *aer)
 {
 	int aer_severity, layer, agent, status_strs_size, tlp_header_valid = 0;
@@ -259,5 +264,7 @@ void cper_print_aer(const char *prefix, int cper_severity,
 			*(tlp + 8), *(tlp + 15), *(tlp + 14),
 			*(tlp + 13), *(tlp + 12));
 	}
+	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+			aer_severity);
 }
 #endif
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 544abdb..ec10e1b 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -49,8 +49,8 @@ static inline int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 }
 #endif
 
-extern void cper_print_aer(const char *prefix, int cper_severity,
-			   struct aer_capability_regs *aer);
+extern void cper_print_aer(const char *prefix, struct pci_dev *dev,
+			   int cper_severity, struct aer_capability_regs *aer);
 extern int cper_severity_to_aer(int cper_severity);
 extern void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
 			      int severity);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v10 3/3] aerdrv: Cleanup log output for AER
  2013-01-16 23:51 [PATCH v10 1/3] aerdrv: Trace Event for AER Lance Ortiz
  2013-01-16 23:51 ` [PATCH v10 2/3] aerdrv: Enhanced AER logging Lance Ortiz
@ 2013-01-16 23:51 ` Lance Ortiz
  2013-01-17 17:21     ` Luck, Tony
  2013-12-02  5:05 ` [PATCH v10 1/3] aerdrv: Trace Event " rui wang
  2013-12-04  3:10 ` [BUG] " rui wang
  3 siblings, 1 reply; 17+ messages in thread
From: Lance Ortiz @ 2013-01-16 23:51 UTC (permalink / raw)
  To: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt,
	mchehab, linux-acpi, linux-pci, linux-kernel

These changes make cper_print_aer more consistent with aer_print_error
and clean things up by eliminating the use of the prefix variable and
replacing it with dev_printk.

v3-v4 remove agent id stuff and kept print the same to avoid
compatibility issues
v7-v8 Updated to use dev_printk instated of prefix. Changed
log levels to KERN_ERR as per Mauro's suggestion.
v8-v9 Changed dev_printk to dev_err since all log levels are KERN_ERR.

Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
---

 drivers/pci/pcie/aer/aerdrv_errprint.c |   54 +++++++++++++++-----------------
 1 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index d3e5fc5..5ab1425 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -124,12 +124,11 @@ static const char *aer_agent_string[] = {
 	"Transmitter ID"
 };
 
-static void __aer_print_error(const char *prefix,
+static void __aer_print_error(struct pci_dev *dev,
 			      struct aer_err_info *info)
 {
 	int i, status;
 	const char *errmsg = NULL;
-
 	status = (info->status & ~info->mask);
 
 	for (i = 0; i < 32; i++) {
@@ -144,26 +143,22 @@ static void __aer_print_error(const char *prefix,
 				aer_uncorrectable_error_string[i] : NULL;
 
 		if (errmsg)
-			printk("%s""   [%2d] %-22s%s\n", prefix, i, errmsg,
+			dev_err(&dev->dev, "   [%2d] %-22s%s\n", i, errmsg,
 				info->first_error == i ? " (First)" : "");
 		else
-			printk("%s""   [%2d] Unknown Error Bit%s\n", prefix, i,
-				info->first_error == i ? " (First)" : "");
+			dev_err(&dev->dev, "   [%2d] Unknown Error Bit%s\n",
+				i, info->first_error == i ? " (First)" : "");
 	}
 }
 
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
 	int id = ((dev->bus->number << 8) | dev->devfn);
-	char prefix[44];
-
-	snprintf(prefix, sizeof(prefix), "%s%s %s: ",
-		 (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR,
-		 dev_driver_string(&dev->dev), dev_name(&dev->dev));
 
 	if (info->status == 0) {
-		printk("%s""PCIe Bus Error: severity=%s, type=Unaccessible, "
-			"id=%04x(Unregistered Agent ID)\n", prefix,
+		dev_err(&dev->dev,
+			"PCIe Bus Error: severity=%s, type=Unaccessible, "
+			"id=%04x(Unregistered Agent ID)\n",
 			aer_error_severity_string[info->severity], id);
 	} else {
 		int layer, agent;
@@ -171,22 +166,24 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 		layer = AER_GET_LAYER_ERROR(info->severity, info->status);
 		agent = AER_GET_AGENT(info->severity, info->status);
 
-		printk("%s""PCIe Bus Error: severity=%s, type=%s, id=%04x(%s)\n",
-			prefix, aer_error_severity_string[info->severity],
+		dev_err(&dev->dev,
+			"PCIe Bus Error: severity=%s, type=%s, id=%04x(%s)\n",
+			aer_error_severity_string[info->severity],
 			aer_error_layer[layer], id, aer_agent_string[agent]);
 
-		printk("%s""  device [%04x:%04x] error status/mask=%08x/%08x\n",
-			prefix, dev->vendor, dev->device,
+		dev_err(&dev->dev,
+			"  device [%04x:%04x] error status/mask=%08x/%08x\n",
+			dev->vendor, dev->device,
 			info->status, info->mask);
 
-		__aer_print_error(prefix, info);
+		__aer_print_error(dev, info);
 
 		if (info->tlp_header_valid) {
 			unsigned char *tlp = (unsigned char *) &info->tlp;
-			printk("%s""  TLP Header:"
+			dev_err(&dev->dev, "  TLP Header:"
 				" %02x%02x%02x%02x %02x%02x%02x%02x"
 				" %02x%02x%02x%02x %02x%02x%02x%02x\n",
-				prefix, *(tlp + 3), *(tlp + 2), *(tlp + 1), *tlp,
+				*(tlp + 3), *(tlp + 2), *(tlp + 1), *tlp,
 				*(tlp + 7), *(tlp + 6), *(tlp + 5), *(tlp + 4),
 				*(tlp + 11), *(tlp + 10), *(tlp + 9),
 				*(tlp + 8), *(tlp + 15), *(tlp + 14),
@@ -195,8 +192,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	}
 
 	if (info->id && info->error_dev_num > 1 && info->id == id)
-		printk("%s""  Error of this Agent(%04x) is reported first\n",
-			prefix, id);
+		dev_err(&dev->dev,
+			   "  Error of this Agent(%04x) is reported first\n",
+			id);
 	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
 			info->severity);
 }
@@ -244,21 +242,21 @@ void cper_print_aer(const char *prefix, struct pci_dev *dev, int cper_severity,
 	}
 	layer = AER_GET_LAYER_ERROR(aer_severity, status);
 	agent = AER_GET_AGENT(aer_severity, status);
-	printk("%s""aer_status: 0x%08x, aer_mask: 0x%08x\n",
-	       prefix, status, mask);
+	dev_err(&dev->dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n",
+	       status, mask);
 	cper_print_bits(prefix, status, status_strs, status_strs_size);
-	printk("%s""aer_layer=%s, aer_agent=%s\n", prefix,
+	dev_err(&dev->dev, "aer_layer=%s, aer_agent=%s\n",
 	       aer_error_layer[layer], aer_agent_string[agent]);
 	if (aer_severity != AER_CORRECTABLE)
-		printk("%s""aer_uncor_severity: 0x%08x\n",
-		       prefix, aer->uncor_severity);
+		dev_err(&dev->dev, "aer_uncor_severity: 0x%08x\n",
+		       aer->uncor_severity);
 	if (tlp_header_valid) {
 		const unsigned char *tlp;
 		tlp = (const unsigned char *)&aer->header_log;
-		printk("%s""aer_tlp_header:"
+		dev_err(&dev->dev, "aer_tlp_header:"
 			" %02x%02x%02x%02x %02x%02x%02x%02x"
 			" %02x%02x%02x%02x %02x%02x%02x%02x\n",
-			prefix, *(tlp + 3), *(tlp + 2), *(tlp + 1), *tlp,
+			*(tlp + 3), *(tlp + 2), *(tlp + 1), *tlp,
 			*(tlp + 7), *(tlp + 6), *(tlp + 5), *(tlp + 4),
 			*(tlp + 11), *(tlp + 10), *(tlp + 9),
 			*(tlp + 8), *(tlp + 15), *(tlp + 14),


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* RE: [PATCH v10 3/3] aerdrv: Cleanup log output for AER
  2013-01-16 23:51 ` [PATCH v10 3/3] aerdrv: Cleanup log output for AER Lance Ortiz
  2013-01-17 17:21     ` Luck, Tony
@ 2013-01-17 17:21     ` Luck, Tony
  0 siblings, 0 replies; 17+ messages in thread
From: Luck, Tony @ 2013-01-17 17:21 UTC (permalink / raw)
  To: Lance Ortiz, bhelgaas, lance_ortiz, jiang.liu, bp, rostedt,
	mchehab, linux-acpi, linux-pci, linux-kernel

> These changes make cper_print_aer more consistent with aer_print_error
> and clean things up by eliminating the use of the prefix variable and
> replacing it with dev_printk.

Applied v10 series and put it into my "next" branch so linux-next will
grab it on the next cycle.  Will try to interest the "tip" maintainers in
pulling again in a few days.

-Tony

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v10 3/3] aerdrv: Cleanup log output for AER
@ 2013-01-17 17:21     ` Luck, Tony
  0 siblings, 0 replies; 17+ messages in thread
From: Luck, Tony @ 2013-01-17 17:21 UTC (permalink / raw)
  To: Lance Ortiz, bhelgaas, lance_ortiz, jiang.liu, bp, rostedt,
	mchehab, linux-acpi, linux-pci, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 519 bytes --]

> These changes make cper_print_aer more consistent with aer_print_error
> and clean things up by eliminating the use of the prefix variable and
> replacing it with dev_printk.

Applied v10 series and put it into my "next" branch so linux-next will
grab it on the next cycle.  Will try to interest the "tip" maintainers in
pulling again in a few days.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v10 3/3] aerdrv: Cleanup log output for AER
@ 2013-01-17 17:21     ` Luck, Tony
  0 siblings, 0 replies; 17+ messages in thread
From: Luck, Tony @ 2013-01-17 17:21 UTC (permalink / raw)
  To: Lance Ortiz, bhelgaas, lance_ortiz, jiang.liu, bp, rostedt,
	mchehab, linux-acpi, linux-pci, linux-kernel

PiBUaGVzZSBjaGFuZ2VzIG1ha2UgY3Blcl9wcmludF9hZXIgbW9yZSBjb25zaXN0ZW50IHdpdGgg
YWVyX3ByaW50X2Vycm9yDQo+IGFuZCBjbGVhbiB0aGluZ3MgdXAgYnkgZWxpbWluYXRpbmcgdGhl
IHVzZSBvZiB0aGUgcHJlZml4IHZhcmlhYmxlIGFuZA0KPiByZXBsYWNpbmcgaXQgd2l0aCBkZXZf
cHJpbnRrLg0KDQpBcHBsaWVkIHYxMCBzZXJpZXMgYW5kIHB1dCBpdCBpbnRvIG15ICJuZXh0IiBi
cmFuY2ggc28gbGludXgtbmV4dCB3aWxsDQpncmFiIGl0IG9uIHRoZSBuZXh0IGN5Y2xlLiAgV2ls
bCB0cnkgdG8gaW50ZXJlc3QgdGhlICJ0aXAiIG1haW50YWluZXJzIGluDQpwdWxsaW5nIGFnYWlu
IGluIGEgZmV3IGRheXMuDQoNCi1Ub255DQo=

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
  2013-01-16 23:51 [PATCH v10 1/3] aerdrv: Trace Event for AER Lance Ortiz
  2013-01-16 23:51 ` [PATCH v10 2/3] aerdrv: Enhanced AER logging Lance Ortiz
  2013-01-16 23:51 ` [PATCH v10 3/3] aerdrv: Cleanup log output for AER Lance Ortiz
@ 2013-12-02  5:05 ` rui wang
  2013-12-04 20:38   ` Borislav Petkov
  2013-12-04  3:10 ` [BUG] " rui wang
  3 siblings, 1 reply; 17+ messages in thread
From: rui wang @ 2013-12-02  5:05 UTC (permalink / raw)
  To: Lance Ortiz
  Cc: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt,
	mchehab, linux-acpi, linux-pci, linux-kernel, gong.chen

On 1/17/13, Lance Ortiz <lance.ortiz@hp.com> wrote:
> This header file will define a new trace event that will be triggered when
> a AER event occurs.  The following data will be provided to the trace
> event.
>
> char * dev_name - The name of the slot where the device resides
>                   ([domain:]bus:device.function).
>
> u32 status - Either the correctable or uncorrectable register
>              indicating what error or errors have been see.
>
> u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED
>
> The trace event will also provide a trace string that may look like:
>
> "0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned
> TLP"
>
> v1-v2 Move header from include/ras/aer_event.h to
> include/trace/events/ras.h
> v3-v4 Cleaned up comments and commit header
> v4-v5 More cleanup remove () from if statement in print.
>       Renamed string define to be more specific.
> v5-v6 change TRACE_SYSTEM define to be ras and not aer.
>
> Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> Acked-by: Tony Luck <tony.luck@intel.com>
> ---
>
>  include/trace/events/ras.h |   77
> ++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 77 insertions(+), 0 deletions(-)
>  create mode 100644 include/trace/events/ras.h
>
> diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h
> new file mode 100644
> index 0000000..88b8783
> --- /dev/null
> +++ b/include/trace/events/ras.h
> @@ -0,0 +1,77 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM ras
> +
> +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_AER_H
> +
> +#include <linux/tracepoint.h>
> +#include <linux/edac.h>
> +
> +
> +/*
> + * PCIe AER Trace event
> + *
> + * These events are generated when hardware detects a corrected or
> + * uncorrected event on a PCIe device. The event report has
> + * the following structure:
> + *
> + * char * dev_name -	The name of the slot where the device resides
> + *			([domain:]bus:device.function).
> + * u32 status -		Either the correctable or uncorrectable register
> + *			indicating what error or errors have been seen
> + * u8 severity -	error severity 0:NONFATAL 1:FATAL 2:CORRECTED
> + */
> +
> +#define aer_correctable_errors		\
> +	{BIT(0),	"Receiver Error"},		\
> +	{BIT(6),	"Bad TLP"},			\
> +	{BIT(7),	"Bad DLLP"},			\
> +	{BIT(8),	"RELAY_NUM Rollover"},		\
> +	{BIT(12),	"Replay Timer Timeout"},	\
> +	{BIT(13),	"Advisory Non-Fatal"}
> +
> +#define aer_uncorrectable_errors		\
> +	{BIT(4),	"Data Link Protocol"},		\
> +	{BIT(12),	"Poisoned TLP"},		\
> +	{BIT(13),	"Flow Control Protocol"},	\
> +	{BIT(14),	"Completion Timeout"},		\
> +	{BIT(15),	"Completer Abort"},		\
> +	{BIT(16),	"Unexpected Completion"},	\
> +	{BIT(17),	"Receiver Overflow"},		\
> +	{BIT(18),	"Malformed TLP"},		\
> +	{BIT(19),	"ECRC"},			\
> +	{BIT(20),	"Unsupported Request"}
> +
> +TRACE_EVENT(aer_event,
> +	TP_PROTO(const char *dev_name,
> +		 const u32 status,
> +		 const u8 severity),
> +
> +	TP_ARGS(dev_name, status, severity),
> +
> +	TP_STRUCT__entry(
> +		__string(	dev_name,	dev_name	)
> +		__field(	u32,		status		)
> +		__field(	u8,		severity	)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(dev_name, dev_name);
> +		__entry->status		= status;
> +		__entry->severity	= severity;
> +	),
> +
> +	TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
> +		__get_str(dev_name),
> +		__entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" :
> +			__entry->severity == HW_EVENT_ERR_FATAL ?
> +			"Fatal" : "Uncorrected",
> +		__entry->severity == HW_EVENT_ERR_CORRECTED ?
> +		__print_flags(__entry->status, "|", aer_correctable_errors) :
> +		__print_flags(__entry->status, "|", aer_uncorrectable_errors))
> +);

This causes inconsistency between dmesg and the trace event output.
When dmesg says "severity=Corrected", the trace event says
"severity=Fatal". What happens is that HW_EVENT_ERR_CORRECTED is
defined in edac.h:

enum hw_event_mc_err_type {
        HW_EVENT_ERR_CORRECTED,
        HW_EVENT_ERR_UNCORRECTED,
        HW_EVENT_ERR_FATAL,
        HW_EVENT_ERR_INFO,
};

while aer_print_error() uses aer_error_severity_string[] defined as:

static const char *aer_error_severity_string[] = {
        "Uncorrected (Non-Fatal)",
        "Uncorrected (Fatal)",
        "Corrected"
};

In this case dmesg is correct because info->severity is assigned in
aer_isr_one_error() using the definitions in include/linux/ras.h:
#define AER_NONFATAL                    0
#define AER_FATAL                       1
#define AER_CORRECTABLE                 2

So which one is the standard? Is there a plan to unify all these names?

Thanks
Rui Wang
> +
> +#endif /* _TRACE_AER_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [BUG] Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
  2013-01-16 23:51 [PATCH v10 1/3] aerdrv: Trace Event for AER Lance Ortiz
                   ` (2 preceding siblings ...)
  2013-12-02  5:05 ` [PATCH v10 1/3] aerdrv: Trace Event " rui wang
@ 2013-12-04  3:10 ` rui wang
  2013-12-04 15:28   ` Ethan Zhao
  3 siblings, 1 reply; 17+ messages in thread
From: rui wang @ 2013-12-04  3:10 UTC (permalink / raw)
  To: Lance Ortiz
  Cc: bhelgaas, lance_ortiz, jiang.liu, tony.luck, bp, rostedt,
	m.chehab, linux-acpi, linux-pci, linux-kernel, gong.chen

Resending adding Mauro's new Email address...


On 1/17/13, Lance Ortiz <lance.ortiz@hp.com> wrote:
> This header file will define a new trace event that will be triggered when
> a AER event occurs.  The following data will be provided to the trace
> event.
>
> char * dev_name - The name of the slot where the device resides
>                   ([domain:]bus:device.function).
>
> u32 status - Either the correctable or uncorrectable register
>              indicating what error or errors have been see.
>
> u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED
>
> The trace event will also provide a trace string that may look like:
>
> "0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned
> TLP"
>
> v1-v2 Move header from include/ras/aer_event.h to
> include/trace/events/ras.h
> v3-v4 Cleaned up comments and commit header
> v4-v5 More cleanup remove () from if statement in print.
>       Renamed string define to be more specific.
> v5-v6 change TRACE_SYSTEM define to be ras and not aer.
>
> Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
> Acked-by: Tony Luck <tony.luck@intel.com>
> ---
>
>  include/trace/events/ras.h |   77
> ++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 77 insertions(+), 0 deletions(-)
>  create mode 100644 include/trace/events/ras.h
>
> diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h
> new file mode 100644
> index 0000000..88b8783
> --- /dev/null
> +++ b/include/trace/events/ras.h
> @@ -0,0 +1,77 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM ras
> +
> +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_AER_H
> +
> +#include <linux/tracepoint.h>
> +#include <linux/edac.h>
> +
> +
> +/*
> + * PCIe AER Trace event
> + *
> + * These events are generated when hardware detects a corrected or
> + * uncorrected event on a PCIe device. The event report has
> + * the following structure:
> + *
> + * char * dev_name -	The name of the slot where the device resides
> + *			([domain:]bus:device.function).
> + * u32 status -		Either the correctable or uncorrectable register
> + *			indicating what error or errors have been seen
> + * u8 severity -	error severity 0:NONFATAL 1:FATAL 2:CORRECTED
> + */
> +
> +#define aer_correctable_errors		\
> +	{BIT(0),	"Receiver Error"},		\
> +	{BIT(6),	"Bad TLP"},			\
> +	{BIT(7),	"Bad DLLP"},			\
> +	{BIT(8),	"RELAY_NUM Rollover"},		\
> +	{BIT(12),	"Replay Timer Timeout"},	\
> +	{BIT(13),	"Advisory Non-Fatal"}
> +
> +#define aer_uncorrectable_errors		\
> +	{BIT(4),	"Data Link Protocol"},		\
> +	{BIT(12),	"Poisoned TLP"},		\
> +	{BIT(13),	"Flow Control Protocol"},	\
> +	{BIT(14),	"Completion Timeout"},		\
> +	{BIT(15),	"Completer Abort"},		\
> +	{BIT(16),	"Unexpected Completion"},	\
> +	{BIT(17),	"Receiver Overflow"},		\
> +	{BIT(18),	"Malformed TLP"},		\
> +	{BIT(19),	"ECRC"},			\
> +	{BIT(20),	"Unsupported Request"}
> +
> +TRACE_EVENT(aer_event,
> +	TP_PROTO(const char *dev_name,
> +		 const u32 status,
> +		 const u8 severity),
> +
> +	TP_ARGS(dev_name, status, severity),
> +
> +	TP_STRUCT__entry(
> +		__string(	dev_name,	dev_name	)
> +		__field(	u32,		status		)
> +		__field(	u8,		severity	)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(dev_name, dev_name);
> +		__entry->status		= status;
> +		__entry->severity	= severity;
> +	),
> +
> +	TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
> +		__get_str(dev_name),
> +		__entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" :
> +			__entry->severity == HW_EVENT_ERR_FATAL ?
> +			"Fatal" : "Uncorrected",
> +		__entry->severity == HW_EVENT_ERR_CORRECTED ?
> +		__print_flags(__entry->status, "|", aer_correctable_errors) :
> +		__print_flags(__entry->status, "|", aer_uncorrectable_errors))
> +);

Here's a bug causing inconsistency between dmesg and the trace event output.
When dmesg says "severity=Corrected", the trace event says
"severity=Fatal". What happens is that HW_EVENT_ERR_CORRECTED is
defined in edac.h:

enum hw_event_mc_err_type {
        HW_EVENT_ERR_CORRECTED,
        HW_EVENT_ERR_UNCORRECTED,
        HW_EVENT_ERR_FATAL,
        HW_EVENT_ERR_INFO,
};

while aer_print_error() uses aer_error_severity_string[] defined as:

static const char *aer_error_severity_string[] = {
        "Uncorrected (Non-Fatal)",
        "Uncorrected (Fatal)",
        "Corrected"
};

In this case dmesg is correct because info->severity is assigned in
aer_isr_one_error() using the definitions in include/linux/ras.h:
#define AER_NONFATAL                    0
#define AER_FATAL                       1
#define AER_CORRECTABLE                 2

So which one is the standard? Is there a plan to unify all these names?

Thanks
Rui Wang

> +
> +#endif /* _TRACE_AER_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
  2013-12-04  3:10 ` [BUG] " rui wang
@ 2013-12-04 15:28   ` Ethan Zhao
  2013-12-05 18:21       ` Betty Dall
  0 siblings, 1 reply; 17+ messages in thread
From: Ethan Zhao @ 2013-12-04 15:28 UTC (permalink / raw)
  To: rui wang
  Cc: Lance Ortiz, Bjorn Helgaas, lance_ortiz, jiang.liu, tony.luck,
	bp, rostedt, m.chehab, linux-acpi, linux-pci, LKML, gong.chen

Rui,
   Agree with that, there are really many such confusing error type definition
need to be standardized or unified, some of them are
ambiguous、inconsistent, some of them violates ACPI/PCI spec.

  According to the ACPI spec, the 'FATAL' in fact, is a sub-category
of 'UNCORRECTABLE' , the another one is 'NON-FATAL', then how to
understand the bebow 'UNCORRECTABLE's ?  mean  uncorrectable-non-fatal
? just guess, confusing.

acpi\actbl1.h

#define ACPI_EINJ_PROCESSOR_CORRECTABLE     (1)
#define ACPI_EINJ_PROCESSOR_UNCORRECTABLE   (1<<1)
#define ACPI_EINJ_PROCESSOR_FATAL           (1<<2)
#define ACPI_EINJ_MEMORY_CORRECTABLE        (1<<3)
#define ACPI_EINJ_MEMORY_UNCORRECTABLE      (1<<4)
#define ACPI_EINJ_MEMORY_FATAL              (1<<5)
#define ACPI_EINJ_PCIX_CORRECTABLE          (1<<6)
#define ACPI_EINJ_PCIX_UNCORRECTABLE        (1<<7)
#define ACPI_EINJ_PCIX_FATAL                (1<<8)
#define ACPI_EINJ_PLATFORM_CORRECTABLE      (1<<9)
#define ACPI_EINJ_PLATFORM_UNCORRECTABLE    (1<<10)
#define ACPI_EINJ_PLATFORM_FATAL            (1<<11)

edac.h

enum hw_event_mc_err_type {
HW_EVENT_ERR_CORRECTED,
HW_EVENT_ERR_UNCORRECTED,
HW_EVENT_ERR_FATAL,
HW_EVENT_ERR_INFO,
};

ghes.h
enum {
GHES_SEV_NO = 0x0,
GHES_SEV_CORRECTED = 0x1,
GHES_SEV_RECOVERABLE = 0x2,
GHES_SEV_PANIC = 0x3,
};

What's the meaning of GHES_SEV_PANIC ? Why not 'FATAL' , just as
described in ACPI spec section 18.3.2.6.1,
"
Error Severity 4 16 Identifies the error severity of the reported error:
0 – Recoverable
1 – Fatal
2 – Corrected
3 – None
"
If there is other intension, but could be seen translated into 'FATAL' later:

case GHES_SEV_PANIC:
  type = HW_EVENT_ERR_FATAL;

And these looks reasonable,
aer.h

#define AER_NONFATAL 0
#define AER_FATAL 1
#define AER_CORRECTABLE 2


Thanks,
Ethan

On Wed, Dec 4, 2013 at 11:10 AM, rui wang <ruiv.wang@gmail.com> wrote:
> Resending adding Mauro's new Email address...
>
>
> On 1/17/13, Lance Ortiz <lance.ortiz@hp.com> wrote:
>> This header file will define a new trace event that will be triggered when
>> a AER event occurs.  The following data will be provided to the trace
>> event.
>>
>> char * dev_name - The name of the slot where the device resides
>>                   ([domain:]bus:device.function).
>>
>> u32 status - Either the correctable or uncorrectable register
>>              indicating what error or errors have been see.
>>
>> u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED
>>
>> The trace event will also provide a trace string that may look like:
>>
>> "0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned
>> TLP"
>>
>> v1-v2 Move header from include/ras/aer_event.h to
>> include/trace/events/ras.h
>> v3-v4 Cleaned up comments and commit header
>> v4-v5 More cleanup remove () from if statement in print.
>>       Renamed string define to be more specific.
>> v5-v6 change TRACE_SYSTEM define to be ras and not aer.
>>
>> Signed-off-by: Lance Ortiz <lance.ortiz@hp.com>
>> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
>> Acked-by: Tony Luck <tony.luck@intel.com>
>> ---
>>
>>  include/trace/events/ras.h |   77
>> ++++++++++++++++++++++++++++++++++++++++++++
>>  1 files changed, 77 insertions(+), 0 deletions(-)
>>  create mode 100644 include/trace/events/ras.h
>>
>> diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h
>> new file mode 100644
>> index 0000000..88b8783
>> --- /dev/null
>> +++ b/include/trace/events/ras.h
>> @@ -0,0 +1,77 @@
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM ras
>> +
>> +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_AER_H
>> +
>> +#include <linux/tracepoint.h>
>> +#include <linux/edac.h>
>> +
>> +
>> +/*
>> + * PCIe AER Trace event
>> + *
>> + * These events are generated when hardware detects a corrected or
>> + * uncorrected event on a PCIe device. The event report has
>> + * the following structure:
>> + *
>> + * char * dev_name - The name of the slot where the device resides
>> + *                   ([domain:]bus:device.function).
>> + * u32 status -              Either the correctable or uncorrectable register
>> + *                   indicating what error or errors have been seen
>> + * u8 severity -     error severity 0:NONFATAL 1:FATAL 2:CORRECTED
>> + */
>> +
>> +#define aer_correctable_errors               \
>> +     {BIT(0),        "Receiver Error"},              \
>> +     {BIT(6),        "Bad TLP"},                     \
>> +     {BIT(7),        "Bad DLLP"},                    \
>> +     {BIT(8),        "RELAY_NUM Rollover"},          \
>> +     {BIT(12),       "Replay Timer Timeout"},        \
>> +     {BIT(13),       "Advisory Non-Fatal"}
>> +
>> +#define aer_uncorrectable_errors             \
>> +     {BIT(4),        "Data Link Protocol"},          \
>> +     {BIT(12),       "Poisoned TLP"},                \
>> +     {BIT(13),       "Flow Control Protocol"},       \
>> +     {BIT(14),       "Completion Timeout"},          \
>> +     {BIT(15),       "Completer Abort"},             \
>> +     {BIT(16),       "Unexpected Completion"},       \
>> +     {BIT(17),       "Receiver Overflow"},           \
>> +     {BIT(18),       "Malformed TLP"},               \
>> +     {BIT(19),       "ECRC"},                        \
>> +     {BIT(20),       "Unsupported Request"}
>> +
>> +TRACE_EVENT(aer_event,
>> +     TP_PROTO(const char *dev_name,
>> +              const u32 status,
>> +              const u8 severity),
>> +
>> +     TP_ARGS(dev_name, status, severity),
>> +
>> +     TP_STRUCT__entry(
>> +             __string(       dev_name,       dev_name        )
>> +             __field(        u32,            status          )
>> +             __field(        u8,             severity        )
>> +     ),
>> +
>> +     TP_fast_assign(
>> +             __assign_str(dev_name, dev_name);
>> +             __entry->status         = status;
>> +             __entry->severity       = severity;
>> +     ),
>> +
>> +     TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
>> +             __get_str(dev_name),
>> +             __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" :
>> +                     __entry->severity == HW_EVENT_ERR_FATAL ?
>> +                     "Fatal" : "Uncorrected",
>> +             __entry->severity == HW_EVENT_ERR_CORRECTED ?
>> +             __print_flags(__entry->status, "|", aer_correctable_errors) :
>> +             __print_flags(__entry->status, "|", aer_uncorrectable_errors))
>> +);
>
> Here's a bug causing inconsistency between dmesg and the trace event output.
> When dmesg says "severity=Corrected", the trace event says
> "severity=Fatal". What happens is that HW_EVENT_ERR_CORRECTED is
> defined in edac.h:
>
> enum hw_event_mc_err_type {
>         HW_EVENT_ERR_CORRECTED,
>         HW_EVENT_ERR_UNCORRECTED,
>         HW_EVENT_ERR_FATAL,
>         HW_EVENT_ERR_INFO,
> };
>
> while aer_print_error() uses aer_error_severity_string[] defined as:
>
> static const char *aer_error_severity_string[] = {
>         "Uncorrected (Non-Fatal)",
>         "Uncorrected (Fatal)",
>         "Corrected"
> };
>
> In this case dmesg is correct because info->severity is assigned in
> aer_isr_one_error() using the definitions in include/linux/ras.h:
> #define AER_NONFATAL                    0
> #define AER_FATAL                       1
> #define AER_CORRECTABLE                 2
>
> So which one is the standard? Is there a plan to unify all these names?
>
> Thanks
> Rui Wang
>
>> +
>> +#endif /* _TRACE_AER_H */
>> +
>> +/* This part must be outside protection */
>> +#include <trace/define_trace.h>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
  2013-12-02  5:05 ` [PATCH v10 1/3] aerdrv: Trace Event " rui wang
@ 2013-12-04 20:38   ` Borislav Petkov
  2013-12-06  9:06     ` rui wang
  0 siblings, 1 reply; 17+ messages in thread
From: Borislav Petkov @ 2013-12-04 20:38 UTC (permalink / raw)
  To: rui wang
  Cc: Lance Ortiz, bhelgaas, lance_ortiz, jiang.liu, tony.luck,
	rostedt, mchehab, linux-acpi, linux-pci, linux-kernel, gong.chen

On Mon, Dec 02, 2013 at 01:05:16PM +0800, rui wang wrote:
> > +	TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
> > +		__get_str(dev_name),
> > +		__entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" :
> > +			__entry->severity == HW_EVENT_ERR_FATAL ?
> > +			"Fatal" : "Uncorrected",
> > +		__entry->severity == HW_EVENT_ERR_CORRECTED ?
> > +		__print_flags(__entry->status, "|", aer_correctable_errors) :
> > +		__print_flags(__entry->status, "|", aer_uncorrectable_errors))
> > +);
> 
> This causes inconsistency between dmesg and the trace event output.
> When dmesg says "severity=Corrected", the trace event says
> "severity=Fatal". What happens is that HW_EVENT_ERR_CORRECTED is
> defined in edac.h:
> 
> enum hw_event_mc_err_type {
>         HW_EVENT_ERR_CORRECTED,
>         HW_EVENT_ERR_UNCORRECTED,
>         HW_EVENT_ERR_FATAL,
>         HW_EVENT_ERR_INFO,
> };
> 
> while aer_print_error() uses aer_error_severity_string[] defined as:
> 
> static const char *aer_error_severity_string[] = {
>         "Uncorrected (Non-Fatal)",
>         "Uncorrected (Fatal)",
>         "Corrected"
> };
> 
> In this case dmesg is correct because info->severity is assigned in
> aer_isr_one_error() using the definitions in include/linux/ras.h:
> #define AER_NONFATAL                    0
> #define AER_FATAL                       1
> #define AER_CORRECTABLE                 2
> 
> So which one is the standard? Is there a plan to unify all these names?

Yes, the AER tracepoint above should use the AER_* defines and not the
HW_EVENT_ERR_* ones which are for memory errors.

Wanna send a fix?

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
  2013-12-04 15:28   ` Ethan Zhao
@ 2013-12-05 18:21       ` Betty Dall
  0 siblings, 0 replies; 17+ messages in thread
From: Betty Dall @ 2013-12-05 18:21 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: rui wang, Lance Ortiz, Bjorn Helgaas, lance_ortiz, jiang.liu,
	tony.luck, bp, rostedt, m.chehab, linux-acpi, linux-pci, LKML,
	gong.chen

On Wed, 2013-12-04 at 23:28 +0800, Ethan Zhao wrote:
> Rui,
>    Agree with that, there are really many such confusing error type definition
> need to be standardized or unified, some of them are
> ambiguous、inconsistent, some of them violates ACPI/PCI spec.
> 
>   According to the ACPI spec, the 'FATAL' in fact, is a sub-category
> of 'UNCORRECTABLE' , the another one is 'NON-FATAL', then how to
> understand the bebow 'UNCORRECTABLE's ?  mean  uncorrectable-non-fatal
> ? just guess, confusing.
> 
> acpi\actbl1.h
> 
> #define ACPI_EINJ_PROCESSOR_CORRECTABLE     (1)
> #define ACPI_EINJ_PROCESSOR_UNCORRECTABLE   (1<<1)
> #define ACPI_EINJ_PROCESSOR_FATAL           (1<<2)
> #define ACPI_EINJ_MEMORY_CORRECTABLE        (1<<3)
> #define ACPI_EINJ_MEMORY_UNCORRECTABLE      (1<<4)
> #define ACPI_EINJ_MEMORY_FATAL              (1<<5)
> #define ACPI_EINJ_PCIX_CORRECTABLE          (1<<6)
> #define ACPI_EINJ_PCIX_UNCORRECTABLE        (1<<7)
> #define ACPI_EINJ_PCIX_FATAL                (1<<8)
> #define ACPI_EINJ_PLATFORM_CORRECTABLE      (1<<9)
> #define ACPI_EINJ_PLATFORM_UNCORRECTABLE    (1<<10)
> #define ACPI_EINJ_PLATFORM_FATAL            (1<<11)
> 
> edac.h
> 
> enum hw_event_mc_err_type {
> HW_EVENT_ERR_CORRECTED,
> HW_EVENT_ERR_UNCORRECTED,
> HW_EVENT_ERR_FATAL,
> HW_EVENT_ERR_INFO,
> };
> 
> ghes.h
> enum {
> GHES_SEV_NO = 0x0,
> GHES_SEV_CORRECTED = 0x1,
> GHES_SEV_RECOVERABLE = 0x2,
> GHES_SEV_PANIC = 0x3,
> };
> 
> What's the meaning of GHES_SEV_PANIC ? Why not 'FATAL' , just as
> described in ACPI spec section 18.3.2.6.1,
> "
> Error Severity 4 16 Identifies the error severity of the reported error:
> 0 – Recoverable
> 1 – Fatal
> 2 – Corrected
> 3 – None
> "
> If there is other intension, but could be seen translated into 'FATAL' later:
> 
> case GHES_SEV_PANIC:
>   type = HW_EVENT_ERR_FATAL;
> 
> And these looks reasonable,
> aer.h
> 
> #define AER_NONFATAL 0
> #define AER_FATAL 1
> #define AER_CORRECTABLE 2

The definition of the GHES_SEV* matches up with the error severity
definition of the CPER records as defined in the UEFI spec section
N.2.1:
"Indicates the severity of the error condition. The severity of
the error record corresponds to the most severe error
section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software."

The ghes code uses the CPER record's severity and always calls the
function ghes_severity() to convert to the GHES_SEV value. Since the
ACPI spec defines the GHES severity, it makes sense to maintain an enum
for it and use the ghes_severity() to convert where necessary. This is
what I am thinking:

Author: Betty Dall <betty.dall@hp.com>
Date:   Thu Dec 5 11:05:43 2013 -0700

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a30bc31..c59144e 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -301,16 +301,18 @@ static inline int ghes_severity(int severity)
 {
        switch (severity) {
        case CPER_SEV_INFORMATIONAL:
-               return GHES_SEV_NO;
+               return GHES_SEV_NONE;
        case CPER_SEV_CORRECTED:
                return GHES_SEV_CORRECTED;
        case CPER_SEV_RECOVERABLE:
                return GHES_SEV_RECOVERABLE;
        case CPER_SEV_FATAL:
-               return GHES_SEV_PANIC;
+               return GHES_SEV_FATAL;
        default:
                /* Unknown, go panic */
-               return GHES_SEV_PANIC;
+               pr_warn(FW_WARN GHES_PFX
+                       "Invalid CPER severity: %d\n", severity);
+               return GHES_SEV_FATAL;
        }
 }
 
@@ -828,7 +830,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct
pt_regs *regs)
        if (ret == NMI_DONE)
                goto out;
 
-       if (sev_global >= GHES_SEV_PANIC) {
+       if (sev_global >= GHES_SEV_FATAL) {
                oops_begin();
                ghes_print_queued_estatus();
                __ghes_print_estatus(KERN_EMERG, ghes_global->generic,
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index dfd60d0..7cefa89 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -39,10 +39,10 @@ struct ghes_estatus_cache {
 };
 
 enum {
-       GHES_SEV_NO = 0x0,
-       GHES_SEV_CORRECTED = 0x1,
-       GHES_SEV_RECOVERABLE = 0x2,
-       GHES_SEV_PANIC = 0x3,
+       GHES_SEV_RECOVERABLE = 0x0,
+       GHES_SEV_FATAL = 0x1,
+       GHES_SEV_CORRECTED = 0x2,
+       GHES_SEV_NONE = 0x3,
 };
 
 /* From drivers/edac/ghes_edac.c */


-Betty


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [BUG] Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
@ 2013-12-05 18:21       ` Betty Dall
  0 siblings, 0 replies; 17+ messages in thread
From: Betty Dall @ 2013-12-05 18:21 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: rui wang, Lance Ortiz, Bjorn Helgaas, lance_ortiz, jiang.liu,
	tony.luck, bp, rostedt, m.chehab, linux-acpi, linux-pci, LKML,
	gong.chen

On Wed, 2013-12-04 at 23:28 +0800, Ethan Zhao wrote:
> Rui,
>    Agree with that, there are really many such confusing error type definition
> need to be standardized or unified, some of them are
> ambiguous、inconsistent, some of them violates ACPI/PCI spec.
> 
>   According to the ACPI spec, the 'FATAL' in fact, is a sub-category
> of 'UNCORRECTABLE' , the another one is 'NON-FATAL', then how to
> understand the bebow 'UNCORRECTABLE's ?  mean  uncorrectable-non-fatal
> ? just guess, confusing.
> 
> acpi\actbl1.h
> 
> #define ACPI_EINJ_PROCESSOR_CORRECTABLE     (1)
> #define ACPI_EINJ_PROCESSOR_UNCORRECTABLE   (1<<1)
> #define ACPI_EINJ_PROCESSOR_FATAL           (1<<2)
> #define ACPI_EINJ_MEMORY_CORRECTABLE        (1<<3)
> #define ACPI_EINJ_MEMORY_UNCORRECTABLE      (1<<4)
> #define ACPI_EINJ_MEMORY_FATAL              (1<<5)
> #define ACPI_EINJ_PCIX_CORRECTABLE          (1<<6)
> #define ACPI_EINJ_PCIX_UNCORRECTABLE        (1<<7)
> #define ACPI_EINJ_PCIX_FATAL                (1<<8)
> #define ACPI_EINJ_PLATFORM_CORRECTABLE      (1<<9)
> #define ACPI_EINJ_PLATFORM_UNCORRECTABLE    (1<<10)
> #define ACPI_EINJ_PLATFORM_FATAL            (1<<11)
> 
> edac.h
> 
> enum hw_event_mc_err_type {
> HW_EVENT_ERR_CORRECTED,
> HW_EVENT_ERR_UNCORRECTED,
> HW_EVENT_ERR_FATAL,
> HW_EVENT_ERR_INFO,
> };
> 
> ghes.h
> enum {
> GHES_SEV_NO = 0x0,
> GHES_SEV_CORRECTED = 0x1,
> GHES_SEV_RECOVERABLE = 0x2,
> GHES_SEV_PANIC = 0x3,
> };
> 
> What's the meaning of GHES_SEV_PANIC ? Why not 'FATAL' , just as
> described in ACPI spec section 18.3.2.6.1,
> "
> Error Severity 4 16 Identifies the error severity of the reported error:
> 0 – Recoverable
> 1 – Fatal
> 2 – Corrected
> 3 – None
> "
> If there is other intension, but could be seen translated into 'FATAL' later:
> 
> case GHES_SEV_PANIC:
>   type = HW_EVENT_ERR_FATAL;
> 
> And these looks reasonable,
> aer.h
> 
> #define AER_NONFATAL 0
> #define AER_FATAL 1
> #define AER_CORRECTABLE 2

The definition of the GHES_SEV* matches up with the error severity
definition of the CPER records as defined in the UEFI spec section
N.2.1:
"Indicates the severity of the error condition. The severity of
the error record corresponds to the most severe error
section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software."

The ghes code uses the CPER record's severity and always calls the
function ghes_severity() to convert to the GHES_SEV value. Since the
ACPI spec defines the GHES severity, it makes sense to maintain an enum
for it and use the ghes_severity() to convert where necessary. This is
what I am thinking:

Author: Betty Dall <betty.dall@hp.com>
Date:   Thu Dec 5 11:05:43 2013 -0700

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a30bc31..c59144e 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -301,16 +301,18 @@ static inline int ghes_severity(int severity)
 {
        switch (severity) {
        case CPER_SEV_INFORMATIONAL:
-               return GHES_SEV_NO;
+               return GHES_SEV_NONE;
        case CPER_SEV_CORRECTED:
                return GHES_SEV_CORRECTED;
        case CPER_SEV_RECOVERABLE:
                return GHES_SEV_RECOVERABLE;
        case CPER_SEV_FATAL:
-               return GHES_SEV_PANIC;
+               return GHES_SEV_FATAL;
        default:
                /* Unknown, go panic */
-               return GHES_SEV_PANIC;
+               pr_warn(FW_WARN GHES_PFX
+                       "Invalid CPER severity: %d\n", severity);
+               return GHES_SEV_FATAL;
        }
 }
 
@@ -828,7 +830,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct
pt_regs *regs)
        if (ret == NMI_DONE)
                goto out;
 
-       if (sev_global >= GHES_SEV_PANIC) {
+       if (sev_global >= GHES_SEV_FATAL) {
                oops_begin();
                ghes_print_queued_estatus();
                __ghes_print_estatus(KERN_EMERG, ghes_global->generic,
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index dfd60d0..7cefa89 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -39,10 +39,10 @@ struct ghes_estatus_cache {
 };
 
 enum {
-       GHES_SEV_NO = 0x0,
-       GHES_SEV_CORRECTED = 0x1,
-       GHES_SEV_RECOVERABLE = 0x2,
-       GHES_SEV_PANIC = 0x3,
+       GHES_SEV_RECOVERABLE = 0x0,
+       GHES_SEV_FATAL = 0x1,
+       GHES_SEV_CORRECTED = 0x2,
+       GHES_SEV_NONE = 0x3,
 };
 
 /* From drivers/edac/ghes_edac.c */


-Betty



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [BUG] Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
  2013-12-05 18:21       ` Betty Dall
@ 2013-12-05 21:12         ` Borislav Petkov
  -1 siblings, 0 replies; 17+ messages in thread
From: Borislav Petkov @ 2013-12-05 21:12 UTC (permalink / raw)
  To: Betty Dall
  Cc: Ethan Zhao, rui wang, Lance Ortiz, Bjorn Helgaas, lance_ortiz,
	jiang.liu, tony.luck, rostedt, m.chehab, linux-acpi, linux-pci,
	LKML, gong.chen

On Thu, Dec 05, 2013 at 11:21:10AM -0700, Betty Dall wrote:
> The definition of the GHES_SEV* matches up with the error severity
> definition of the CPER records as defined in the UEFI spec section
> N.2.1:
> "Indicates the severity of the error condition. The severity of
> the error record corresponds to the most severe error
> section.
> 0 - Recoverable (also called non-fatal uncorrected)
> 1 - Fatal
> 2 - Corrected
> 3 - Informational
> All other values are reserved.
> Note that severity of "Informational" indicates that the record
> could be safely ignored by error handling software."

Actually, we can go even one radical step further and drop
ghes_severity() completely because GHES severity in the ACPI spec 5.0 is
defined almost exactly the same:

"18.3.2.6.1 Generic Error Data

...

Identifies the error severity of the reported error:
0 – Recoverable
1 – Fatal
2 – Corrected
3 – None
Note: This is the error severity of the entire event. Each Generic
Error Data Entry also includes its own Error Severity field."

I don't know which version of the spec dictated

enum {
        GHES_SEV_NO = 0x0,
        GHES_SEV_CORRECTED = 0x1,
        GHES_SEV_RECOVERABLE = 0x2,
        GHES_SEV_PANIC = 0x3,
};

though and whether we're going to have to differentiate between the old
and GHES numerical severity levels. Which, if we have to, would be very
nasty...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
@ 2013-12-05 21:12         ` Borislav Petkov
  0 siblings, 0 replies; 17+ messages in thread
From: Borislav Petkov @ 2013-12-05 21:12 UTC (permalink / raw)
  To: Betty Dall
  Cc: Ethan Zhao, rui wang, Lance Ortiz, Bjorn Helgaas, lance_ortiz,
	jiang.liu, tony.luck, rostedt, m.chehab, linux-acpi, linux-pci,
	LKML, gong.chen

On Thu, Dec 05, 2013 at 11:21:10AM -0700, Betty Dall wrote:
> The definition of the GHES_SEV* matches up with the error severity
> definition of the CPER records as defined in the UEFI spec section
> N.2.1:
> "Indicates the severity of the error condition. The severity of
> the error record corresponds to the most severe error
> section.
> 0 - Recoverable (also called non-fatal uncorrected)
> 1 - Fatal
> 2 - Corrected
> 3 - Informational
> All other values are reserved.
> Note that severity of "Informational" indicates that the record
> could be safely ignored by error handling software."

Actually, we can go even one radical step further and drop
ghes_severity() completely because GHES severity in the ACPI spec 5.0 is
defined almost exactly the same:

"18.3.2.6.1 Generic Error Data

...

Identifies the error severity of the reported error:
0 – Recoverable
1 – Fatal
2 – Corrected
3 – None
Note: This is the error severity of the entire event. Each Generic
Error Data Entry also includes its own Error Severity field."

I don't know which version of the spec dictated

enum {
        GHES_SEV_NO = 0x0,
        GHES_SEV_CORRECTED = 0x1,
        GHES_SEV_RECOVERABLE = 0x2,
        GHES_SEV_PANIC = 0x3,
};

though and whether we're going to have to differentiate between the old
and GHES numerical severity levels. Which, if we have to, would be very
nasty...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
  2013-12-04 20:38   ` Borislav Petkov
@ 2013-12-06  9:06     ` rui wang
  2013-12-06 15:11       ` Ethan Zhao
  0 siblings, 1 reply; 17+ messages in thread
From: rui wang @ 2013-12-06  9:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Lance Ortiz, bhelgaas, lance_ortiz, jiang.liu, tony.luck,
	rostedt, mchehab, linux-acpi, linux-pci, linux-kernel, gong.chen

On 12/5/13, Borislav Petkov <bp@alien8.de> wrote:

> Yes, the AER tracepoint above should use the AER_* defines and not the
> HW_EVENT_ERR_* ones which are for memory errors.
>
> Wanna send a fix?
>

Yes. Does it translate into something like this?

From: Rui Wang <rui.y.wang@intel.com>
Date: Fri, 6 Dec 2013 16:47:46 +0800
Subject: [PATCH] Fix severity usage in aer trace event

Signed-off-by: Rui Wang <rui.y.wang@intel.com>
---
 include/trace/events/ras.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h
index 88b8783..e2a17d8 100644
--- a/include/trace/events/ras.h
+++ b/include/trace/events/ras.h
@@ -5,7 +5,7 @@
 #define _TRACE_AER_H

 #include <linux/tracepoint.h>
-#include <linux/edac.h>
+#include <linux/aer.h>


 /*
@@ -63,10 +63,10 @@ TRACE_EVENT(aer_event,

        TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
                __get_str(dev_name),
-               __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" :
-                       __entry->severity == HW_EVENT_ERR_FATAL ?
+               __entry->severity == AER_CORRECTABLE ? "Corrected" :
+                       __entry->severity == AER_FATAL ?
                        "Fatal" : "Uncorrected",
-               __entry->severity == HW_EVENT_ERR_CORRECTED ?
+               __entry->severity == AER_CORRECTABLE ?
                __print_flags(__entry->status, "|", aer_correctable_errors) :
                __print_flags(__entry->status, "|", aer_uncorrectable_errors))
 );
-- 
1.7.5.4

Regards,
Rui

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
  2013-12-06  9:06     ` rui wang
@ 2013-12-06 15:11       ` Ethan Zhao
  2013-12-07 17:45         ` Borislav Petkov
  0 siblings, 1 reply; 17+ messages in thread
From: Ethan Zhao @ 2013-12-06 15:11 UTC (permalink / raw)
  To: rui wang
  Cc: Borislav Petkov, Lance Ortiz, Bjorn Helgaas, lance_ortiz,
	jiang.liu, tony.luck, rostedt, mchehab, linux-acpi, linux-pci,
	LKML, gong.chen

On Fri, Dec 6, 2013 at 5:06 PM, rui wang <ruiv.wang@gmail.com> wrote:
> On 12/5/13, Borislav Petkov <bp@alien8.de> wrote:
>
>> Yes, the AER tracepoint above should use the AER_* defines and not the
>> HW_EVENT_ERR_* ones which are for memory errors.
>>
>> Wanna send a fix?
>>
>
> Yes. Does it translate into something like this?
>
> From: Rui Wang <rui.y.wang@intel.com>
> Date: Fri, 6 Dec 2013 16:47:46 +0800
> Subject: [PATCH] Fix severity usage in aer trace event
>
> Signed-off-by: Rui Wang <rui.y.wang@intel.com>
> ---
>  include/trace/events/ras.h |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h
> index 88b8783..e2a17d8 100644
> --- a/include/trace/events/ras.h
> +++ b/include/trace/events/ras.h
> @@ -5,7 +5,7 @@
>  #define _TRACE_AER_H
>
>  #include <linux/tracepoint.h>
> -#include <linux/edac.h>
> +#include <linux/aer.h>
>
>
>  /*
> @@ -63,10 +63,10 @@ TRACE_EVENT(aer_event,
>
>         TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
>                 __get_str(dev_name),
> -               __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" :
> -                       __entry->severity == HW_EVENT_ERR_FATAL ?
> +               __entry->severity == AER_CORRECTABLE ? "Corrected" :
> +                       __entry->severity == AER_FATAL ?
>                         "Fatal" : "Uncorrected",

Why not             "Fatal" : "Non-fatal",   ?  per the PCIe spec,
'Fatal' and 'Non-fatal' are sub-category of "
Uncorrected". But here "Uncorrected" means "Non-fatal".

Thanks,
Ethan

> -               __entry->severity == HW_EVENT_ERR_CORRECTED ?
> +               __entry->severity == AER_CORRECTABLE ?
>                 __print_flags(__entry->status, "|", aer_correctable_errors) :
>                 __print_flags(__entry->status, "|", aer_uncorrectable_errors))
>  );
> --
> 1.7.5.4
>
> Regards,
> Rui
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v10 1/3] aerdrv: Trace Event for AER
  2013-12-06 15:11       ` Ethan Zhao
@ 2013-12-07 17:45         ` Borislav Petkov
  0 siblings, 0 replies; 17+ messages in thread
From: Borislav Petkov @ 2013-12-07 17:45 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: rui wang, Lance Ortiz, Bjorn Helgaas, lance_ortiz, jiang.liu,
	tony.luck, rostedt, mchehab, linux-acpi, linux-pci, LKML,
	gong.chen

On Fri, Dec 06, 2013 at 11:11:07PM +0800, Ethan Zhao wrote:
> > @@ -63,10 +63,10 @@ TRACE_EVENT(aer_event,
> >
> >         TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
> >                 __get_str(dev_name),
> > -               __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" :
> > -                       __entry->severity == HW_EVENT_ERR_FATAL ?
> > +               __entry->severity == AER_CORRECTABLE ? "Corrected" :
> > +                       __entry->severity == AER_FATAL ?
> >                         "Fatal" : "Uncorrected",
> 
> Why not             "Fatal" : "Non-fatal",   ?  per the PCIe spec,
> 'Fatal' and 'Non-fatal' are sub-category of "
> Uncorrected". But here "Uncorrected" means "Non-fatal".

... and just to denote that, it'll probably be best to say:

		__entry->severity == AER_CORRECTABLE ? "Corrected" :
			__entry->severity == AER_FATAL ?
			   "Fatal" : "Uncorrected, non-fatal"

right?

Btw, Rui, you patch is whitespace-damaged so next time please try
sending from a real mail client which doesn't mangle whitespace and not
from the gmail web interface. Sending the patch to yourself and trying
to apply it is always a good test for that.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-12-07 17:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-16 23:51 [PATCH v10 1/3] aerdrv: Trace Event for AER Lance Ortiz
2013-01-16 23:51 ` [PATCH v10 2/3] aerdrv: Enhanced AER logging Lance Ortiz
2013-01-16 23:51 ` [PATCH v10 3/3] aerdrv: Cleanup log output for AER Lance Ortiz
2013-01-17 17:21   ` Luck, Tony
2013-01-17 17:21     ` Luck, Tony
2013-01-17 17:21     ` Luck, Tony
2013-12-02  5:05 ` [PATCH v10 1/3] aerdrv: Trace Event " rui wang
2013-12-04 20:38   ` Borislav Petkov
2013-12-06  9:06     ` rui wang
2013-12-06 15:11       ` Ethan Zhao
2013-12-07 17:45         ` Borislav Petkov
2013-12-04  3:10 ` [BUG] " rui wang
2013-12-04 15:28   ` Ethan Zhao
2013-12-05 18:21     ` Betty Dall
2013-12-05 18:21       ` Betty Dall
2013-12-05 21:12       ` Borislav Petkov
2013-12-05 21:12         ` Borislav Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.