linux-cxl.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
@ 2022-09-16 23:10 Dave Jiang
  2022-09-16 23:11 ` [PATCH RFC v2 1/9] cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers Dave Jiang
                   ` (9 more replies)
  0 siblings, 10 replies; 34+ messages in thread
From: Dave Jiang @ 2022-09-16 23:10 UTC (permalink / raw)
  To: linux-cxl
  Cc: alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams,
	jonathan.cameron, shiju.jose, rrichter

Series set to RFC since there's no means to test. Would like to get opinion
on whether going with using trace events as reporting mechanism is ok.

Jonathan,
We currently don't have any ways to test AER events. Do you have any plans
to support AER events via QEMU emulation?

v2:
- Convert error reporting via printk to trace events
- Drop ".rmap =" initialization (Jonathan)
- return PCI_ERS_RESULT_NEED_RESET for UE in pci_channel_io_normal (Shiju)

Add a 'struct pci_error_handlers' instance for the cxl_pci driver.
Section 8.2.4.16 "CXL RAS Capability Structure" of the CXL rev3.0
specification defines the error sources considered in this
implementation. The RAS Capability Structure defines protocol, link and
internal errors which are distinct from memory poison errors that are
conveyed via direct consumption and/or media scanning.

The errors reported by the RAS registers are categorized into
correctable and uncorrectable errors, where the uncorrectable errors are
optionally steered to either fatal or non-fatal AER events. Table 12-2 
"Device Specific Error Reporting and Nomenclature Guidelines" in the CXL
rev3.0 specification outlines that the remediation for uncorrectable errors
is a reset to recover. This matches how the Linux PCIe AER core treats
uncorrectable errors as occasions to reset the device to recover
operation.

While the specification notes "CXL Reset" or "Secondary Bus Reset" as
theoretical recovery options, they are not feasible in practice since
in-flight CXL.mem operations may not terminate and cause knock-on system
fatal events. Reset is only reliable for recovering CXL.io, it is not
reliable for recovering CXL.mem. Assuming the system survives, a reset
causes CXL.mem operation to restart from scratch.

The "ECN: Error Isolation on CXL.mem and CXL.cache" [1] document
recognizes the CXL Reset vs CXL.mem operational conflict and helps to at
least provide a mechanism for the Root Port to terminate in flight
CXL.mem operations with completions. That still poses problems in
practice if the kernel is running out of "System RAM" backed by the CXL
device and poison is used to convey the data lost to the protocol error.

Regardless of whether the reset and restart of CXL.mem operations is
feasible / successful, the logging is still useful. So, the
implementation reads, reports, and clears the status in the RAS
Capability Structure registers, and it notifies the 'struct cxl_memdev'
associated with the given PCIe endpoint to reattach to its driver over
the reset so that the HDM decoder configuration can be reconstructed.

The first half of the series reworks component register mapping so that
the cxl_pci driver can own the RAS Capability while the cxl_port driver
continues to own the HDM Decoder Capability. The last half implements
the RAS Capability Structure mapping and reporting via 'struct
pci_error_handlers'.

The reporting of error information is done through event tracing. A new
cxl_ras event is introduced to report the Uncorrectable and Correctable
errors raised by CXL. The expectation is a monitoring user daemon such as
"cxl monitor" will harvest those events and record them in a log in a
format (JSON) that's consumable by management applications..

[1]: https://www.computeexpresslink.org/spec-landing

---

Dan Williams (8):
      cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers
      cxl/pci: Cleanup cxl_map_device_regs()
      cxl/pci: Kill cxl_map_regs()
      cxl/core/regs: Make cxl_map_{component, device}_regs() device generic
      cxl/port: Limit the port driver to just the HDM Decoder Capability
      cxl/pci: Prepare for mapping RAS Capability Structure
      cxl/pci: Find and map the RAS Capability Structure
      cxl/pci: Add (hopeful) error handling support

Dave Jiang (1):
      cxl/pci: add tracepoint events for CXL RAS


 drivers/cxl/core/hdm.c         |  33 ++---
 drivers/cxl/core/memdev.c      |   1 +
 drivers/cxl/core/pci.c         |   3 +-
 drivers/cxl/core/port.c        |   2 +-
 drivers/cxl/core/regs.c        | 172 +++++++++++++++-----------
 drivers/cxl/cxl.h              |  39 ++++--
 drivers/cxl/cxlmem.h           |   2 +
 drivers/cxl/cxlpci.h           |   9 --
 drivers/cxl/pci.c              | 216 +++++++++++++++++++++++++++------
 include/trace/events/cxl_ras.h | 117 ++++++++++++++++++
 10 files changed, 445 insertions(+), 149 deletions(-)
 create mode 100644 include/trace/events/cxl_ras.h

--


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH RFC v2 1/9] cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers
  2022-09-16 23:10 [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Dave Jiang
@ 2022-09-16 23:11 ` Dave Jiang
  2022-09-16 23:11 ` [PATCH RFC v2 2/9] cxl/pci: Cleanup cxl_map_device_regs() Dave Jiang
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Dave Jiang @ 2022-09-16 23:11 UTC (permalink / raw)
  To: linux-cxl
  Cc: alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams,
	jonathan.cameron, shiju.jose, rrichter

From: Dan Williams <dan.j.williams@intel.com>

Rather then duplicating the setting of valid, length, and offset for
each type, just convey a pointer to the register map to common code.

Yes, the change in cxl_probe_component_regs() does not save
any lines of code, but it is preparation for adding another component
register type to map (RAS Capability Structure).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/regs.c |   46 ++++++++++++++++++++++++++--------------------
 1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 39a129c57d40..bd6ae14b679e 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -59,36 +59,41 @@ void cxl_probe_component_regs(struct device *dev, void __iomem *base,
 
 	for (cap = 1; cap <= cap_count; cap++) {
 		void __iomem *register_block;
-		u32 hdr;
-		int decoder_cnt;
+		struct cxl_reg_map *rmap;
 		u16 cap_id, offset;
-		u32 length;
+		u32 length, hdr;
 
 		hdr = readl(base + cap * 0x4);
 
 		cap_id = FIELD_GET(CXL_CM_CAP_HDR_ID_MASK, hdr);
 		offset = FIELD_GET(CXL_CM_CAP_PTR_MASK, hdr);
 		register_block = base + offset;
+		hdr = readl(register_block);
 
+		rmap = NULL;
 		switch (cap_id) {
-		case CXL_CM_CAP_CAP_ID_HDM:
+		case CXL_CM_CAP_CAP_ID_HDM: {
+			int decoder_cnt;
+
 			dev_dbg(dev, "found HDM decoder capability (0x%x)\n",
 				offset);
 
-			hdr = readl(register_block);
-
 			decoder_cnt = cxl_hdm_decoder_count(hdr);
 			length = 0x20 * decoder_cnt + 0x10;
-
-			map->hdm_decoder.valid = true;
-			map->hdm_decoder.offset = CXL_CM_OFFSET + offset;
-			map->hdm_decoder.size = length;
+			rmap = &map->hdm_decoder;
 			break;
+		}
 		default:
 			dev_dbg(dev, "Unknown CM cap ID: %d (0x%x)\n", cap_id,
 				offset);
 			break;
 		}
+
+		if (!rmap)
+			continue;
+		rmap->valid = true;
+		rmap->offset = CXL_CM_OFFSET + offset;
+		rmap->size = length;
 	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_probe_component_regs, CXL);
@@ -117,6 +122,7 @@ void cxl_probe_device_regs(struct device *dev, void __iomem *base,
 	cap_count = FIELD_GET(CXLDEV_CAP_ARRAY_COUNT_MASK, cap_array);
 
 	for (cap = 1; cap <= cap_count; cap++) {
+		struct cxl_reg_map *rmap;
 		u32 offset, length;
 		u16 cap_id;
 
@@ -125,28 +131,22 @@ void cxl_probe_device_regs(struct device *dev, void __iomem *base,
 		offset = readl(base + cap * 0x10 + 0x4);
 		length = readl(base + cap * 0x10 + 0x8);
 
+		rmap = NULL;
 		switch (cap_id) {
 		case CXLDEV_CAP_CAP_ID_DEVICE_STATUS:
 			dev_dbg(dev, "found Status capability (0x%x)\n", offset);
-
-			map->status.valid = true;
-			map->status.offset = offset;
-			map->status.size = length;
+			rmap = &map->status;
 			break;
 		case CXLDEV_CAP_CAP_ID_PRIMARY_MAILBOX:
 			dev_dbg(dev, "found Mailbox capability (0x%x)\n", offset);
-			map->mbox.valid = true;
-			map->mbox.offset = offset;
-			map->mbox.size = length;
+			rmap = &map->mbox;
 			break;
 		case CXLDEV_CAP_CAP_ID_SECONDARY_MAILBOX:
 			dev_dbg(dev, "found Secondary Mailbox capability (0x%x)\n", offset);
 			break;
 		case CXLDEV_CAP_CAP_ID_MEMDEV:
 			dev_dbg(dev, "found Memory Device capability (0x%x)\n", offset);
-			map->memdev.valid = true;
-			map->memdev.offset = offset;
-			map->memdev.size = length;
+			rmap = &map->memdev;
 			break;
 		default:
 			if (cap_id >= 0x8000)
@@ -155,6 +155,12 @@ void cxl_probe_device_regs(struct device *dev, void __iomem *base,
 				dev_dbg(dev, "Unknown cap ID: %#x offset: %#x\n", cap_id, offset);
 			break;
 		}
+
+		if (!rmap)
+			continue;
+		rmap->valid = true;
+		rmap->offset = offset;
+		rmap->size = length;
 	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_probe_device_regs, CXL);



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC v2 2/9] cxl/pci: Cleanup cxl_map_device_regs()
  2022-09-16 23:10 [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Dave Jiang
  2022-09-16 23:11 ` [PATCH RFC v2 1/9] cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers Dave Jiang
@ 2022-09-16 23:11 ` Dave Jiang
  2022-09-16 23:11 ` [PATCH RFC v2 3/9] cxl/pci: Kill cxl_map_regs() Dave Jiang
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Dave Jiang @ 2022-09-16 23:11 UTC (permalink / raw)
  To: linux-cxl
  Cc: alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams,
	jonathan.cameron, shiju.jose, rrichter

From: Dan Williams <dan.j.williams@intel.com>

Use a loop to reduce the duplicated code in cxl_map_device_regs(). This
is in preparation for deleting cxl_map_regs().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/regs.c |   51 ++++++++++++++++++-----------------------------
 1 file changed, 20 insertions(+), 31 deletions(-)

diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index bd6ae14b679e..f03ede86412d 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -211,42 +211,31 @@ int cxl_map_device_regs(struct pci_dev *pdev,
 			struct cxl_device_regs *regs,
 			struct cxl_register_map *map)
 {
+	resource_size_t phys_addr =
+		pci_resource_start(pdev, map->barno) + map->block_offset;
 	struct device *dev = &pdev->dev;
-	resource_size_t phys_addr;
-
-	phys_addr = pci_resource_start(pdev, map->barno);
-	phys_addr += map->block_offset;
-
-	if (map->device_map.status.valid) {
-		resource_size_t addr;
+	struct mapinfo {
+		struct cxl_reg_map *rmap;
+		void __iomem **addr;
+	} mapinfo[] = {
+		{ &map->device_map.status, &regs->status, },
+		{ &map->device_map.mbox, &regs->mbox, },
+		{ &map->device_map.memdev, &regs->memdev, },
+	};
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mapinfo); i++) {
+		struct mapinfo *mi = &mapinfo[i];
 		resource_size_t length;
-
-		addr = phys_addr + map->device_map.status.offset;
-		length = map->device_map.status.size;
-		regs->status = devm_cxl_iomap_block(dev, addr, length);
-		if (!regs->status)
-			return -ENOMEM;
-	}
-
-	if (map->device_map.mbox.valid) {
 		resource_size_t addr;
-		resource_size_t length;
 
-		addr = phys_addr + map->device_map.mbox.offset;
-		length = map->device_map.mbox.size;
-		regs->mbox = devm_cxl_iomap_block(dev, addr, length);
-		if (!regs->mbox)
-			return -ENOMEM;
-	}
-
-	if (map->device_map.memdev.valid) {
-		resource_size_t addr;
-		resource_size_t length;
+		if (!mi->rmap->valid)
+			continue;
 
-		addr = phys_addr + map->device_map.memdev.offset;
-		length = map->device_map.memdev.size;
-		regs->memdev = devm_cxl_iomap_block(dev, addr, length);
-		if (!regs->memdev)
+		addr = phys_addr + mi->rmap->offset;
+		length = mi->rmap->size;
+		*(mi->addr) = devm_cxl_iomap_block(dev, addr, length);
+		if (!*(mi->addr))
 			return -ENOMEM;
 	}
 



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC v2 3/9] cxl/pci: Kill cxl_map_regs()
  2022-09-16 23:10 [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Dave Jiang
  2022-09-16 23:11 ` [PATCH RFC v2 1/9] cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers Dave Jiang
  2022-09-16 23:11 ` [PATCH RFC v2 2/9] cxl/pci: Cleanup cxl_map_device_regs() Dave Jiang
@ 2022-09-16 23:11 ` Dave Jiang
  2022-10-18 13:43   ` Jonathan Cameron
  2022-09-16 23:11 ` [PATCH RFC v2 4/9] cxl/core/regs: Make cxl_map_{component, device}_regs() device generic Dave Jiang
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Dave Jiang @ 2022-09-16 23:11 UTC (permalink / raw)
  To: linux-cxl
  Cc: alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams,
	jonathan.cameron, shiju.jose, rrichter

From: Dan Williams <dan.j.williams@intel.com>

The component registers are currently unused by the cxl_pci driver.
Only the physical address base of the component registers is conveyed to
the cxl_mem driver. Just call cxl_map_device_registers() directly.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/pci.c |   23 +----------------------
 1 file changed, 1 insertion(+), 22 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index faeb5d9d7a7a..82023cf0cdcf 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -347,27 +347,6 @@ static int cxl_probe_regs(struct pci_dev *pdev, struct cxl_register_map *map)
 	return 0;
 }
 
-static int cxl_map_regs(struct cxl_dev_state *cxlds, struct cxl_register_map *map)
-{
-	struct device *dev = cxlds->dev;
-	struct pci_dev *pdev = to_pci_dev(dev);
-
-	switch (map->reg_type) {
-	case CXL_REGLOC_RBI_COMPONENT:
-		cxl_map_component_regs(pdev, &cxlds->regs.component, map);
-		dev_dbg(dev, "Mapping component registers...\n");
-		break;
-	case CXL_REGLOC_RBI_MEMDEV:
-		cxl_map_device_regs(pdev, &cxlds->regs.device_regs, map);
-		dev_dbg(dev, "Probing device registers...\n");
-		break;
-	default:
-		break;
-	}
-
-	return 0;
-}
-
 static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
 			  struct cxl_register_map *map)
 {
@@ -461,7 +440,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
-	rc = cxl_map_regs(cxlds, &map);
+	rc = cxl_map_device_regs(pdev, &cxlds->regs.device_regs, &map);
 	if (rc)
 		return rc;
 



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC v2 4/9] cxl/core/regs: Make cxl_map_{component, device}_regs() device generic
  2022-09-16 23:10 [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Dave Jiang
                   ` (2 preceding siblings ...)
  2022-09-16 23:11 ` [PATCH RFC v2 3/9] cxl/pci: Kill cxl_map_regs() Dave Jiang
@ 2022-09-16 23:11 ` Dave Jiang
  2022-09-16 23:11 ` [PATCH RFC v2 5/9] cxl/port: Limit the port driver to just the HDM Decoder Capability Dave Jiang
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Dave Jiang @ 2022-09-16 23:11 UTC (permalink / raw)
  To: linux-cxl
  Cc: alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams,
	jonathan.cameron, shiju.jose, rrichter

From: Dan Williams <dan.j.williams@intel.com>

There is no need to carry the barno and the block offset through the
stack, just convert them to a resource base immediately.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/pci.c  |    3 +--
 drivers/cxl/core/port.c |    2 +-
 drivers/cxl/core/regs.c |   40 +++++++++++++++++++++++-----------------
 drivers/cxl/cxl.h       |   14 ++++++--------
 drivers/cxl/cxlpci.h    |    9 ---------
 drivers/cxl/pci.c       |   25 ++++++-------------------
 6 files changed, 37 insertions(+), 56 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 9240df53ed87..2a8a88d38533 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -54,8 +54,7 @@ static int match_add_dports(struct pci_dev *pdev, void *data)
 		dev_dbg(&port->dev, "failed to find component registers\n");
 
 	port_num = FIELD_GET(PCI_EXP_LNKCAP_PN, lnkcap);
-	dport = devm_cxl_add_dport(port, &pdev->dev, port_num,
-				   cxl_regmap_to_base(pdev, &map));
+	dport = devm_cxl_add_dport(port, &pdev->dev, port_num, map.resource);
 	if (IS_ERR(dport)) {
 		ctx->error = PTR_ERR(dport);
 		return PTR_ERR(dport);
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index bffde862de0b..2c78b28dbcc6 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1236,7 +1236,7 @@ static resource_size_t find_component_registers(struct device *dev)
 	pdev = to_pci_dev(dev);
 
 	cxl_find_regblock(pdev, CXL_REGLOC_RBI_COMPONENT, &map);
-	return cxl_regmap_to_base(pdev, &map);
+	return map.resource;
 }
 
 static int add_port_attach_ep(struct cxl_memdev *cxlmd,
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index f03ede86412d..e4b0d52ac3a1 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -186,17 +186,13 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
 	return ret_val;
 }
 
-int cxl_map_component_regs(struct pci_dev *pdev,
-			   struct cxl_component_regs *regs,
+int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
 			   struct cxl_register_map *map)
 {
-	struct device *dev = &pdev->dev;
 	resource_size_t phys_addr;
 	resource_size_t length;
 
-	phys_addr = pci_resource_start(pdev, map->barno);
-	phys_addr += map->block_offset;
-
+	phys_addr = map->resource;
 	phys_addr += map->component_map.hdm_decoder.offset;
 	length = map->component_map.hdm_decoder.size;
 	regs->hdm_decoder = devm_cxl_iomap_block(dev, phys_addr, length);
@@ -207,13 +203,11 @@ int cxl_map_component_regs(struct pci_dev *pdev,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_map_component_regs, CXL);
 
-int cxl_map_device_regs(struct pci_dev *pdev,
+int cxl_map_device_regs(struct device *dev,
 			struct cxl_device_regs *regs,
 			struct cxl_register_map *map)
 {
-	resource_size_t phys_addr =
-		pci_resource_start(pdev, map->barno) + map->block_offset;
-	struct device *dev = &pdev->dev;
+	resource_size_t phys_addr = map->resource;
 	struct mapinfo {
 		struct cxl_reg_map *rmap;
 		void __iomem **addr;
@@ -243,13 +237,24 @@ int cxl_map_device_regs(struct pci_dev *pdev,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_map_device_regs, CXL);
 
-static void cxl_decode_regblock(u32 reg_lo, u32 reg_hi,
+static bool cxl_decode_regblock(struct pci_dev *pdev, u32 reg_lo, u32 reg_hi,
 				struct cxl_register_map *map)
 {
-	map->block_offset = ((u64)reg_hi << 32) |
-			    (reg_lo & CXL_DVSEC_REG_LOCATOR_BLOCK_OFF_LOW_MASK);
-	map->barno = FIELD_GET(CXL_DVSEC_REG_LOCATOR_BIR_MASK, reg_lo);
+	int bar = FIELD_GET(CXL_DVSEC_REG_LOCATOR_BIR_MASK, reg_lo);
+	u64 offset = ((u64)reg_hi << 32) |
+		     (reg_lo & CXL_DVSEC_REG_LOCATOR_BLOCK_OFF_LOW_MASK);
+
+	if (offset > pci_resource_len(pdev, bar)) {
+		dev_warn(&pdev->dev,
+			 "BAR%d: %pr: too small (offset: %pa, type: %d)\n", bar,
+			 &pdev->resource[bar], &offset, map->reg_type);
+		return false;
+	}
+
 	map->reg_type = FIELD_GET(CXL_DVSEC_REG_LOCATOR_BLOCK_ID_MASK, reg_lo);
+	map->resource = pci_resource_start(pdev, bar) + offset;
+	map->max_size = pci_resource_len(pdev, bar) - offset;
+	return true;
 }
 
 /**
@@ -269,7 +274,7 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
 	u32 regloc_size, regblocks;
 	int regloc, i;
 
-	map->block_offset = U64_MAX;
+	map->resource = CXL_RESOURCE_NONE;
 	regloc = pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL,
 					   CXL_DVSEC_REG_LOCATOR);
 	if (!regloc)
@@ -287,13 +292,14 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
 		pci_read_config_dword(pdev, regloc, &reg_lo);
 		pci_read_config_dword(pdev, regloc + 4, &reg_hi);
 
-		cxl_decode_regblock(reg_lo, reg_hi, map);
+		if (!cxl_decode_regblock(pdev, reg_lo, reg_hi, map))
+			continue;
 
 		if (map->reg_type == type)
 			return 0;
 	}
 
-	map->block_offset = U64_MAX;
+	map->resource = CXL_RESOURCE_NONE;
 	return -ENODEV;
 }
 EXPORT_SYMBOL_NS_GPL(cxl_find_regblock, CXL);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f680450f0b16..75506286f39c 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -187,17 +187,17 @@ struct cxl_device_reg_map {
 /**
  * struct cxl_register_map - DVSEC harvested register block mapping parameters
  * @base: virtual base of the register-block-BAR + @block_offset
- * @block_offset: offset to start of register block in @barno
+ * @resource: physical resource base of the register block
+ * @max_size: maximum mapping size to perform register search
  * @reg_type: see enum cxl_regloc_type
- * @barno: PCI BAR number containing the register block
  * @component_map: cxl_reg_map for component registers
  * @device_map: cxl_reg_maps for device registers
  */
 struct cxl_register_map {
 	void __iomem *base;
-	u64 block_offset;
+	resource_size_t resource;
+	resource_size_t max_size;
 	u8 reg_type;
-	u8 barno;
 	union {
 		struct cxl_component_reg_map component_map;
 		struct cxl_device_reg_map device_map;
@@ -208,11 +208,9 @@ void cxl_probe_component_regs(struct device *dev, void __iomem *base,
 			      struct cxl_component_reg_map *map);
 void cxl_probe_device_regs(struct device *dev, void __iomem *base,
 			   struct cxl_device_reg_map *map);
-int cxl_map_component_regs(struct pci_dev *pdev,
-			   struct cxl_component_regs *regs,
+int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
 			   struct cxl_register_map *map);
-int cxl_map_device_regs(struct pci_dev *pdev,
-			struct cxl_device_regs *regs,
+int cxl_map_device_regs(struct device *dev, struct cxl_device_regs *regs,
 			struct cxl_register_map *map);
 
 enum cxl_regloc_type;
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index eec597dbe763..920909791bb9 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -62,15 +62,6 @@ enum cxl_regloc_type {
 	CXL_REGLOC_RBI_TYPES
 };
 
-static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
-						 struct cxl_register_map *map)
-{
-	if (map->block_offset == U64_MAX)
-		return CXL_RESOURCE_NONE;
-
-	return pci_resource_start(pdev, map->barno) + map->block_offset;
-}
-
 int devm_cxl_port_enumerate_dports(struct cxl_port *port);
 struct cxl_dev_state;
 int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm);
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 82023cf0cdcf..aba31c2291c4 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -276,35 +276,22 @@ static int cxl_pci_setup_mailbox(struct cxl_dev_state *cxlds)
 
 static int cxl_map_regblock(struct pci_dev *pdev, struct cxl_register_map *map)
 {
-	void __iomem *addr;
-	int bar = map->barno;
 	struct device *dev = &pdev->dev;
-	resource_size_t offset = map->block_offset;
 
-	/* Basic sanity check that BAR is big enough */
-	if (pci_resource_len(pdev, bar) < offset) {
-		dev_err(dev, "BAR%d: %pr: too small (offset: %pa)\n", bar,
-			&pdev->resource[bar], &offset);
-		return -ENXIO;
-	}
-
-	addr = pci_iomap(pdev, bar, 0);
-	if (!addr) {
+	map->base = ioremap(map->resource, map->max_size);
+	if (!map->base) {
 		dev_err(dev, "failed to map registers\n");
 		return -ENOMEM;
 	}
 
-	dev_dbg(dev, "Mapped CXL Memory Device resource bar %u @ %pa\n",
-		bar, &offset);
-
-	map->base = addr + map->block_offset;
+	dev_dbg(dev, "Mapped CXL Memory Device resource %pa\n", &map->resource);
 	return 0;
 }
 
 static void cxl_unmap_regblock(struct pci_dev *pdev,
 			       struct cxl_register_map *map)
 {
-	pci_iounmap(pdev, map->base - map->block_offset);
+	iounmap(map->base);
 	map->base = NULL;
 }
 
@@ -440,7 +427,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		return rc;
 
-	rc = cxl_map_device_regs(pdev, &cxlds->regs.device_regs, &map);
+	rc = cxl_map_device_regs(&pdev->dev, &cxlds->regs.device_regs, &map);
 	if (rc)
 		return rc;
 
@@ -453,7 +440,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (rc)
 		dev_warn(&pdev->dev, "No component registers (%d)\n", rc);
 
-	cxlds->component_reg_phys = cxl_regmap_to_base(pdev, &map);
+	cxlds->component_reg_phys = map.resource;
 
 	devm_cxl_pci_create_doe(cxlds);
 



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC v2 5/9] cxl/port: Limit the port driver to just the HDM Decoder Capability
  2022-09-16 23:10 [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Dave Jiang
                   ` (3 preceding siblings ...)
  2022-09-16 23:11 ` [PATCH RFC v2 4/9] cxl/core/regs: Make cxl_map_{component, device}_regs() device generic Dave Jiang
@ 2022-09-16 23:11 ` Dave Jiang
  2022-10-20 16:54   ` Jonathan Cameron
  2022-09-16 23:11 ` [PATCH RFC v2 6/9] cxl/pci: Prepare for mapping RAS Capability Structure Dave Jiang
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Dave Jiang @ 2022-09-16 23:11 UTC (permalink / raw)
  To: linux-cxl
  Cc: alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams,
	jonathan.cameron, shiju.jose, rrichter

From: Dan Williams <dan.j.williams@intel.com>

Update the port driver to use cxl_map_component_registers() so that the
component register block can be shared between the cxl_pci driver and
the cxl_port driver. I.e. stop the port driver from reserving the entire
component register block for itself via request_region() when it only
needs the HDM Decoder Capability subset.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/hdm.c |   32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index d1d2caea5c62..061551148cfe 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -82,18 +82,22 @@ static void parse_hdm_decoder_caps(struct cxl_hdm *cxlhdm)
 		cxlhdm->interleave_mask |= GENMASK(14, 12);
 }
 
-static void __iomem *map_hdm_decoder_regs(struct cxl_port *port,
-					  void __iomem *crb)
+static int map_hdm_decoder_regs(struct cxl_port *port, void __iomem *crb,
+				struct cxl_component_regs *regs)
 {
-	struct cxl_component_reg_map map;
+	struct cxl_register_map map = {
+		.resource = port->component_reg_phys,
+		.base = crb,
+		.max_size = CXL_COMPONENT_REG_BLOCK_SIZE,
+	};
 
-	cxl_probe_component_regs(&port->dev, crb, &map);
-	if (!map.hdm_decoder.valid) {
+	cxl_probe_component_regs(&port->dev, crb, &map.component_map);
+	if (!map.component_map.hdm_decoder.valid) {
 		dev_err(&port->dev, "HDM decoder registers invalid\n");
-		return IOMEM_ERR_PTR(-ENXIO);
+		return -ENXIO;
 	}
 
-	return crb + map.hdm_decoder.offset;
+	return cxl_map_component_regs(&port->dev, regs, &map);
 }
 
 /**
@@ -103,25 +107,25 @@ static void __iomem *map_hdm_decoder_regs(struct cxl_port *port,
 struct cxl_hdm *devm_cxl_setup_hdm(struct cxl_port *port)
 {
 	struct device *dev = &port->dev;
-	void __iomem *crb, *hdm;
 	struct cxl_hdm *cxlhdm;
+	void __iomem *crb;
+	int rc;
 
 	cxlhdm = devm_kzalloc(dev, sizeof(*cxlhdm), GFP_KERNEL);
 	if (!cxlhdm)
 		return ERR_PTR(-ENOMEM);
 
 	cxlhdm->port = port;
-	crb = devm_cxl_iomap_block(dev, port->component_reg_phys,
-				   CXL_COMPONENT_REG_BLOCK_SIZE);
+	crb = ioremap(port->component_reg_phys, CXL_COMPONENT_REG_BLOCK_SIZE);
 	if (!crb) {
 		dev_err(dev, "No component registers mapped\n");
 		return ERR_PTR(-ENXIO);
 	}
 
-	hdm = map_hdm_decoder_regs(port, crb);
-	if (IS_ERR(hdm))
-		return ERR_CAST(hdm);
-	cxlhdm->regs.hdm_decoder = hdm;
+	rc = map_hdm_decoder_regs(port, crb, &cxlhdm->regs);
+	iounmap(crb);
+	if (rc)
+		return ERR_PTR(rc);
 
 	parse_hdm_decoder_caps(cxlhdm);
 	if (cxlhdm->decoder_count == 0) {



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC v2 6/9] cxl/pci: Prepare for mapping RAS Capability Structure
  2022-09-16 23:10 [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Dave Jiang
                   ` (4 preceding siblings ...)
  2022-09-16 23:11 ` [PATCH RFC v2 5/9] cxl/port: Limit the port driver to just the HDM Decoder Capability Dave Jiang
@ 2022-09-16 23:11 ` Dave Jiang
  2022-09-16 23:11 ` [PATCH RFC v2 7/9] cxl/pci: Find and map the " Dave Jiang
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Dave Jiang @ 2022-09-16 23:11 UTC (permalink / raw)
  To: linux-cxl
  Cc: alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams,
	jonathan.cameron, shiju.jose, rrichter

From: Dan Williams <dan.j.williams@intel.com>

The RAS Capabilitiy Structure is a CXL Component register capability
block. Unlike the HDM Decoder Capability, it will be referenced by the
cxl_pci driver in response to PCIe AER events. Due to this it is no
longer the case that cxl_map_component_regs() can assume that it should
map all component registers. Plumb a bitmask of capability ids to map
through cxl_map_component_regs().

For symmetry cxl_probe_device_regs() is updated to populate @id in
'struct cxl_reg_map' even though cxl_map_device_regs() does not have a
need to map a subset of the device registers per caller.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/hdm.c  |    3 ++-
 drivers/cxl/core/regs.c |   36 ++++++++++++++++++++++++++----------
 drivers/cxl/cxl.h       |    4 +++-
 3 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
index 061551148cfe..100d0881bde4 100644
--- a/drivers/cxl/core/hdm.c
+++ b/drivers/cxl/core/hdm.c
@@ -97,7 +97,8 @@ static int map_hdm_decoder_regs(struct cxl_port *port, void __iomem *crb,
 		return -ENXIO;
 	}
 
-	return cxl_map_component_regs(&port->dev, regs, &map);
+	return cxl_map_component_regs(&port->dev, regs, &map,
+				      BIT(CXL_CM_CAP_CAP_ID_HDM));
 }
 
 /**
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index e4b0d52ac3a1..97e8f4201493 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -92,6 +92,7 @@ void cxl_probe_component_regs(struct device *dev, void __iomem *base,
 		if (!rmap)
 			continue;
 		rmap->valid = true;
+		rmap->id = cap_id;
 		rmap->offset = CXL_CM_OFFSET + offset;
 		rmap->size = length;
 	}
@@ -159,6 +160,7 @@ void cxl_probe_device_regs(struct device *dev, void __iomem *base,
 		if (!rmap)
 			continue;
 		rmap->valid = true;
+		rmap->id = cap_id;
 		rmap->offset = offset;
 		rmap->size = length;
 	}
@@ -187,17 +189,31 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
 }
 
 int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
-			   struct cxl_register_map *map)
+			   struct cxl_register_map *map, unsigned long map_mask)
 {
-	resource_size_t phys_addr;
-	resource_size_t length;
-
-	phys_addr = map->resource;
-	phys_addr += map->component_map.hdm_decoder.offset;
-	length = map->component_map.hdm_decoder.size;
-	regs->hdm_decoder = devm_cxl_iomap_block(dev, phys_addr, length);
-	if (!regs->hdm_decoder)
-		return -ENOMEM;
+	struct mapinfo {
+		struct cxl_reg_map *rmap;
+		void __iomem **addr;
+	} mapinfo[] = {
+		{ &map->component_map.hdm_decoder, &regs->hdm_decoder },
+	};
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(mapinfo); i++) {
+		struct mapinfo *mi = &mapinfo[i];
+		resource_size_t phys_addr;
+		resource_size_t length;
+
+		if (!mi->rmap->valid)
+			continue;
+		if (!test_bit(mi->rmap->id, &map_mask))
+			continue;
+		phys_addr = map->resource + mi->rmap->offset;
+		length = mi->rmap->size;
+		*(mi->addr) = devm_cxl_iomap_block(dev, phys_addr, length);
+		if (!*(mi->addr))
+			return -ENOMEM;
+	}
 
 	return 0;
 }
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 75506286f39c..c2bf1759d55c 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -170,6 +170,7 @@ struct cxl_regs {
 
 struct cxl_reg_map {
 	bool valid;
+	int id;
 	unsigned long offset;
 	unsigned long size;
 };
@@ -209,7 +210,8 @@ void cxl_probe_component_regs(struct device *dev, void __iomem *base,
 void cxl_probe_device_regs(struct device *dev, void __iomem *base,
 			   struct cxl_device_reg_map *map);
 int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
-			   struct cxl_register_map *map);
+			   struct cxl_register_map *map,
+			   unsigned long map_mask);
 int cxl_map_device_regs(struct device *dev, struct cxl_device_regs *regs,
 			struct cxl_register_map *map);
 



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC v2 7/9] cxl/pci: Find and map the RAS Capability Structure
  2022-09-16 23:10 [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Dave Jiang
                   ` (5 preceding siblings ...)
  2022-09-16 23:11 ` [PATCH RFC v2 6/9] cxl/pci: Prepare for mapping RAS Capability Structure Dave Jiang
@ 2022-09-16 23:11 ` Dave Jiang
  2022-09-16 23:11 ` [PATCH RFC v2 8/9] cxl/pci: add tracepoint events for CXL RAS Dave Jiang
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Dave Jiang @ 2022-09-16 23:11 UTC (permalink / raw)
  To: linux-cxl
  Cc: alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams,
	jonathan.cameron, shiju.jose, rrichter

From: Dan Williams <dan.j.williams@intel.com>

The RAS Capability Structure has some ancillary information that may be
relevant with respect to AER events, link and protcol error status
registers. Map the RAS Capability Registers in support of defining a
'struct pci_error_handlers' instance for the cxl_pci driver.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/regs.c |    7 +++++++
 drivers/cxl/cxl.h       |   19 +++++++++++++++++++
 drivers/cxl/pci.c       |    8 ++++++++
 3 files changed, 34 insertions(+)

diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 97e8f4201493..b10f8b79ec40 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -83,6 +83,12 @@ void cxl_probe_component_regs(struct device *dev, void __iomem *base,
 			rmap = &map->hdm_decoder;
 			break;
 		}
+		case CXL_CM_CAP_CAP_ID_RAS:
+			dev_dbg(dev, "found RAS capability (0x%x)\n",
+				offset);
+			length = CXL_RAS_CAPABILITY_LENGTH;
+			rmap = &map->ras;
+			break;
 		default:
 			dev_dbg(dev, "Unknown CM cap ID: %d (0x%x)\n", cap_id,
 				offset);
@@ -196,6 +202,7 @@ int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs,
 		void __iomem **addr;
 	} mapinfo[] = {
 		{ &map->component_map.hdm_decoder, &regs->hdm_decoder },
+		{ &map->component_map.ras, &regs->ras },
 	};
 	int i;
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index c2bf1759d55c..ce17ccd8b125 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -33,6 +33,7 @@
 #define   CXL_CM_CAP_HDR_ARRAY_SIZE_MASK GENMASK(31, 24)
 #define CXL_CM_CAP_PTR_MASK GENMASK(31, 20)
 
+#define   CXL_CM_CAP_CAP_ID_RAS 0x2
 #define   CXL_CM_CAP_CAP_ID_HDM 0x5
 #define   CXL_CM_CAP_CAP_HDM_VERSION 1
 
@@ -119,6 +120,21 @@ static inline int ways_to_cxl(unsigned int ways, u8 *iw)
 	return 0;
 }
 
+/* RAS Registers CXL 2.0 8.2.5.9 CXL RAS Capability Structure */
+#define CXL_RAS_UNCORRECTABLE_STATUS_OFFSET 0x0
+#define   CXL_RAS_UNCORRECTABLE_STATUS_MASK (GENMASK(16, 14) | GENMASK(11, 0))
+#define CXL_RAS_UNCORRECTABLE_MASK_OFFSET 0x4
+#define   CXL_RAS_UNCORRECTABLE_MASK_MASK (GENMASK(16, 14) | GENMASK(11, 0))
+#define CXL_RAS_UNCORRECTABLE_SEVERITY_OFFSET 0x8
+#define   CXL_RAS_UNCORRECTABLE_SEVERITY_MASK (GENMASK(16, 14) | GENMASK(11, 0))
+#define CXL_RAS_CORRECTABLE_STATUS_OFFSET 0xC
+#define   CXL_RAS_CORRECTABLE_STATUS_MASK GENMASK(6, 0)
+#define CXL_RAS_CORRECTABLE_MASK_OFFSET 0x10
+#define   CXL_RAS_CORRECTABLE_MASK_MASK GENMASK(6, 0)
+#define CXL_RAS_CAP_CONTROL_OFFSET 0x14
+#define CXL_RAS_HEADER_LOG_OFFSET 0x18
+#define CXL_RAS_CAPABILITY_LENGTH 0x58
+
 /* CXL 2.0 8.2.8.1 Device Capabilities Array Register */
 #define CXLDEV_CAP_ARRAY_OFFSET 0x0
 #define   CXLDEV_CAP_ARRAY_CAP_ID 0
@@ -153,9 +169,11 @@ struct cxl_regs {
 	/*
 	 * Common set of CXL Component register block base pointers
 	 * @hdm_decoder: CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure
+	 * @ras: CXL 2.0 8.2.5.9 CXL RAS Capability Structure
 	 */
 	struct_group_tagged(cxl_component_regs, component,
 		void __iomem *hdm_decoder;
+		void __iomem *ras;
 	);
 	/*
 	 * Common set of CXL Device register block base pointers
@@ -177,6 +195,7 @@ struct cxl_reg_map {
 
 struct cxl_component_reg_map {
 	struct cxl_reg_map hdm_decoder;
+	struct cxl_reg_map ras;
 };
 
 struct cxl_device_reg_map {
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index aba31c2291c4..610b3a77f205 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -311,6 +311,9 @@ static int cxl_probe_regs(struct pci_dev *pdev, struct cxl_register_map *map)
 			return -ENXIO;
 		}
 
+		if (!comp_map->ras.valid)
+			dev_dbg(dev, "RAS registers not found\n");
+
 		dev_dbg(dev, "Set up component registers\n");
 		break;
 	case CXL_REGLOC_RBI_MEMDEV:
@@ -444,6 +447,11 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	devm_cxl_pci_create_doe(cxlds);
 
+	rc = cxl_map_component_regs(&pdev->dev, &cxlds->regs.component,
+				    &map, BIT(CXL_CM_CAP_CAP_ID_RAS));
+	if (rc)
+		dev_dbg(&pdev->dev, "Failed to map RAS capability.\n");
+
 	rc = cxl_pci_setup_mailbox(cxlds);
 	if (rc)
 		return rc;



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC v2 8/9] cxl/pci: add tracepoint events for CXL RAS
  2022-09-16 23:10 [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Dave Jiang
                   ` (6 preceding siblings ...)
  2022-09-16 23:11 ` [PATCH RFC v2 7/9] cxl/pci: Find and map the " Dave Jiang
@ 2022-09-16 23:11 ` Dave Jiang
  2022-10-20 17:02   ` Jonathan Cameron
  2022-09-16 23:11 ` [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support Dave Jiang
  2022-10-11 14:17 ` [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Jonathan Cameron
  9 siblings, 1 reply; 34+ messages in thread
From: Dave Jiang @ 2022-09-16 23:11 UTC (permalink / raw)
  To: linux-cxl
  Cc: alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams,
	jonathan.cameron, shiju.jose, rrichter

Add tracepoint events for recording the CXL uncorrectable and correctable
errors. For uncorrectable errors, there is additional data up to 512B from
the header log register (CXL spec rev3 8.2.4.16.7). The content of the
register depends on the Uncorrectable Errors (UC) register status (CXL spec
rev3 8.2.4.16.1).  This implementation supports the Receiver_Overflow error
where the definition is defined as first 3 bits of the Header Log data. The
trace event will intake a dynamic array that will dump the Header Log data
based on error. If multiple errors are set in the status register, then the
'first error' field (CXL spec rev3 v8.2.4.16.6) is read from the Error
Capabilities and Control Register in order to determine the error.

This implementation does not include CXL IDE Error details.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/pci.c              |    2 +
 include/trace/events/cxl_ras.h |  117 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+)
 create mode 100644 include/trace/events/cxl_ras.h

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 610b3a77f205..357de704e42c 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -13,6 +13,8 @@
 #include "cxlmem.h"
 #include "cxlpci.h"
 #include "cxl.h"
+#define CREATE_TRACE_POINTS
+#include <trace/events/cxl_ras.h>
 
 /**
  * DOC: cxl pci
diff --git a/include/trace/events/cxl_ras.h b/include/trace/events/cxl_ras.h
new file mode 100644
index 000000000000..6bb41c3b87c8
--- /dev/null
+++ b/include/trace/events/cxl_ras.h
@@ -0,0 +1,117 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM cxl_ras
+
+#if !defined(_CXL_RAS_EVENTS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _CXL_RAS_EVENTS_H
+
+#include <linux/tracepoint.h>
+
+#define CXL_RAS_UC_CACHE_DATA_PARITY	BIT(0)
+#define CXL_RAS_UC_CACHE_ADDR_PARITY	BIT(1)
+#define CXL_RAS_UC_CACHE_BE_PARITY	BIT(2)
+#define CXL_RAS_UC_CACHE_DATA_ECC	BIT(3)
+#define CXL_RAS_UC_MEM_DATA_PARITY	BIT(4)
+#define CXL_RAS_UC_MEM_ADDR_PARITY	BIT(5)
+#define CXL_RAS_UC_MEM_BE_PARITY	BIT(6)
+#define CXL_RAS_UC_MEM_DATA_ECC		BIT(7)
+#define CXL_RAS_UC_REINIT_THRESH	BIT(8)
+#define CXL_RAS_UC_RSVD_ENCODE		BIT(9)
+#define CXL_RAS_UC_POISON		BIT(10)
+#define CXL_RAS_UC_RECV_OVERFLOW	BIT(11)
+#define CXL_RAS_UC_INTERNAL_ERR		BIT(14)
+#define CXL_RAS_UC_IDE_TX_ERR		BIT(15)
+#define CXL_RAS_UC_IDE_RX_ERR		BIT(16)
+
+#define show_uc_errs(status)	__print_flags(status, " | ",		  \
+	{ CXL_RAS_UC_CACHE_DATA_PARITY, "Cache Data Parity Error" },	  \
+	{ CXL_RAS_UC_CACHE_ADDR_PARITY, "Cache Address Parity Error" },	  \
+	{ CXL_RAS_UC_CACHE_BE_PARITY, "Cache Byte Enable Parity Error" }, \
+	{ CXL_RAS_UC_CACHE_DATA_ECC, "Cache Data ECC Error" },		  \
+	{ CXL_RAS_UC_MEM_DATA_PARITY, "Memory Data Parity Error" },	  \
+	{ CXL_RAS_UC_MEM_ADDR_PARITY, "Memory Address Parity Error" },	  \
+	{ CXL_RAS_UC_MEM_BE_PARITY, "Memory Byte Enable Parity Error" },  \
+	{ CXL_RAS_UC_MEM_DATA_ECC, "Memory Data ECC Error" },		  \
+	{ CXL_RAS_UC_REINIT_THRESH, "REINIT Threshold Hit" },		  \
+	{ CXL_RAS_UC_RSVD_ENCODE, "Received Unrecognized Encoding" },	  \
+	{ CXL_RAS_UC_POISON, "Received Poison From Peer" },		  \
+	{ CXL_RAS_UC_RECV_OVERFLOW, "Receiver Overflow" },		  \
+	{ CXL_RAS_UC_INTERNAL_ERR, "Component Specific Error" },	  \
+	{ CXL_RAS_UC_IDE_TX_ERR, "IDE Tx Error" },			  \
+	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
+)
+
+#define CXL_RAS_UC_OVFL_D2H_REQ		BIT(0)
+#define CXL_RAS_UC_OVFL_D2H_RSP		BIT(1)
+#define CXL_RAS_UC_OVFL_D2H_DATA	BIT(2)
+#define CXL_RAS_UC_OVFL_S2M_NDR		BIT(3)
+#define CXL_RAS_UC_OVFL_S2M_DRS		BIT(4)
+
+#define show_uc_ovfl(hl)	__print_flags(hl, " | ",		\
+	{ CXL_RAS_UC_OVFL_D2H_REQ, "Receiver Overflow D2H Req" },	\
+	{ CXL_RAS_UC_OVFL_D2H_RSP, "Receiver Overflow D2H Rsp" },	\
+	{ CXL_RAS_UC_OVFL_D2H_DATA, "Receiver Overflow D2H Data" },	\
+	{ CXL_RAS_UC_OVFL_S2M_NDR, "Receiver Overflow S2M NDR" },	\
+	{ CXL_RAS_UC_OVFL_S2M_DRS, "Receiver Overflow S2M DRS" }	\
+)
+
+TRACE_EVENT(cxl_ras_uc,
+	TP_PROTO(const char *dev_name, u32 status, u32 fe, u8 *hl, int hl_len),
+	TP_ARGS(dev_name, status, fe, hl, hl_len),
+	TP_STRUCT__entry(
+		__string(dev_name, dev_name)
+		__field(u32, status)
+		__field(u32, first_error)
+		__dynamic_array(u8, header_log, hl_len)
+		__field(int, header_log_len)
+	),
+	TP_fast_assign(
+		__assign_str(dev_name, dev_name);
+		__entry->status = status;
+		__entry->first_error = fe;
+		memcpy(__get_dynamic_array(header_log), hl, hl_len);
+		__entry->header_log_len = hl_len;
+	),
+	TP_printk("%s: status: '%s' first_error: '%s' header log: %s",
+		  __get_str(dev_name), show_uc_errs(__entry->status),
+		  show_uc_errs(__entry->first_error),
+		  __print_array(__get_dynamic_array(header_log), __entry->header_log_len, 1)
+	)
+);
+
+#define CXL_RAS_CE_CACHE_DATA_ECC	BIT(0)
+#define CXL_RAS_CE_MEM_DATA_ECC		BIT(1)
+#define CXL_RAS_CE_CRC_THRESH		BIT(2)
+#define CXL_RAS_CE_CACHE_POISON		BIT(3)
+#define CXL_RAS_CE_MEM_POISON		BIT(4)
+#define CXL_RAS_CE_PHYS_LAYER_ERR	BIT(5)
+
+#define show_ce_errs(status)	__print_flags(status, " | ",			\
+	{ CXL_RAS_CE_CACHE_DATA_ECC, "Cache Data ECC Error" },			\
+	{ CXL_RAS_CE_MEM_DATA_ECC, "Memory Data Ecc Error" },			\
+	{ CXL_RAS_CE_CRC_THRESH, "CRC Threshold Hit" },				\
+	{ CXL_RAS_CE_CACHE_POISON, "Received Cache Poison From Peer" },		\
+	{ CXL_RAS_CE_MEM_POISON, "Received Memory Poison From Peer" },		\
+	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
+)
+
+TRACE_EVENT(cxl_ras_ce,
+	TP_PROTO(const char *dev_name, u32 status),
+	TP_ARGS(dev_name, status),
+	TP_STRUCT__entry(
+		__string(dev_name, dev_name)
+		__field(u32, status)
+	),
+	TP_fast_assign(
+		__assign_str(dev_name, dev_name);
+		__entry->status = status;
+	),
+	TP_printk("%s: status: '%s'",
+		  __get_str(dev_name), show_ce_errs(__entry->status)
+	)
+);
+
+#endif /* _CXL_RAS_EVENTS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support
  2022-09-16 23:10 [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Dave Jiang
                   ` (7 preceding siblings ...)
  2022-09-16 23:11 ` [PATCH RFC v2 8/9] cxl/pci: add tracepoint events for CXL RAS Dave Jiang
@ 2022-09-16 23:11 ` Dave Jiang
  2022-10-20 13:45   ` Jonathan Cameron
                     ` (2 more replies)
  2022-10-11 14:17 ` [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Jonathan Cameron
  9 siblings, 3 replies; 34+ messages in thread
From: Dave Jiang @ 2022-09-16 23:11 UTC (permalink / raw)
  To: linux-cxl
  Cc: alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams,
	jonathan.cameron, shiju.jose, rrichter

From: Dan Williams <dan.j.williams@intel.com>

Add nominal error handling that tears down CXL.mem in response to error
notifications that imply a device reset. Given some CXL.mem may be
operating as System RAM, there is a high likelihood that these error
events are fatal. However, if the system survives the notification the
expectation is that the driver behavior is equivalent to a hot-unplug
and re-plug of an endpoint.

Note that this does not change the mask values from the default. That
awaits CXL _OSC support to determine whether platform firmware is in
control of the mask registers.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/memdev.c |    1 
 drivers/cxl/cxl.h         |    2 +
 drivers/cxl/cxlmem.h      |    2 +
 drivers/cxl/pci.c         |  160 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 165 insertions(+)

diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 20ce488a7754..a74a93310d26 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -344,6 +344,7 @@ struct cxl_memdev *devm_cxl_add_memdev(struct cxl_dev_state *cxlds)
 	 * needed as this is ordered with cdev_add() publishing the device.
 	 */
 	cxlmd->cxlds = cxlds;
+	cxlds->cxlmd = cxlmd;
 
 	cdev = &cxlmd->cdev;
 	rc = cdev_device_add(cdev, dev);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index ce17ccd8b125..35434b110a3b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -132,7 +132,9 @@ static inline int ways_to_cxl(unsigned int ways, u8 *iw)
 #define CXL_RAS_CORRECTABLE_MASK_OFFSET 0x10
 #define   CXL_RAS_CORRECTABLE_MASK_MASK GENMASK(6, 0)
 #define CXL_RAS_CAP_CONTROL_OFFSET 0x14
+#define CXL_RAS_CAP_CONTROL_FE_MASK GENMASK(5, 0)
 #define CXL_RAS_HEADER_LOG_OFFSET 0x18
+#define CXL_RAX_HEADER_LOG_SIZE 512
 #define CXL_RAS_CAPABILITY_LENGTH 0x58
 
 /* CXL 2.0 8.2.8.1 Device Capabilities Array Register */
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 88e3a8e54b6a..b3117fd67f42 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -186,6 +186,7 @@ struct cxl_endpoint_dvsec_info {
  * Currently only memory devices are represented.
  *
  * @dev: The device associated with this CXL state
+ * @cxlmd: The device representing the CXL.mem capabilities of @dev
  * @regs: Parsed register blocks
  * @cxl_dvsec: Offset to the PCIe device DVSEC
  * @payload_size: Size of space for payload
@@ -218,6 +219,7 @@ struct cxl_endpoint_dvsec_info {
  */
 struct cxl_dev_state {
 	struct device *dev;
+	struct cxl_memdev *cxlmd;
 
 	struct cxl_regs regs;
 	int cxl_dvsec;
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 357de704e42c..d51e34c4c87e 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -9,6 +9,7 @@
 #include <linux/list.h>
 #include <linux/pci.h>
 #include <linux/pci-doe.h>
+#include <linux/aer.h>
 #include <linux/io.h>
 #include "cxlmem.h"
 #include "cxlpci.h"
@@ -399,6 +400,11 @@ static void devm_cxl_pci_create_doe(struct cxl_dev_state *cxlds)
 	}
 }
 
+static void disable_aer(void *pdev)
+{
+	pci_disable_pcie_error_reporting(pdev);
+}
+
 static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct cxl_register_map map;
@@ -420,6 +426,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	cxlds = cxl_dev_state_create(&pdev->dev);
 	if (IS_ERR(cxlds))
 		return PTR_ERR(cxlds);
+	pci_set_drvdata(pdev, cxlds);
 
 	cxlds->serial = pci_get_dsn(pdev);
 	cxlds->cxl_dvsec = pci_find_dvsec_capability(
@@ -474,6 +481,14 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (IS_ERR(cxlmd))
 		return PTR_ERR(cxlmd);
 
+	if (cxlds->regs.ras) {
+		pci_enable_pcie_error_reporting(pdev);
+		rc = devm_add_action_or_reset(&pdev->dev, disable_aer, pdev);
+		if (rc)
+			return rc;
+	}
+	pci_save_state(pdev);
+
 	if (resource_size(&cxlds->pmem_res) && IS_ENABLED(CONFIG_CXL_PMEM))
 		rc = devm_cxl_add_nvdimm(&pdev->dev, cxlmd);
 
@@ -487,10 +502,155 @@ static const struct pci_device_id cxl_mem_pci_tbl[] = {
 };
 MODULE_DEVICE_TABLE(pci, cxl_mem_pci_tbl);
 
+/* CXL spec rev3.0 8.2.4.16.1 */
+#define DATA_HEADER_SIZE 16
+#define FLIT_SIZE (64 + 2)
+static int header_log_setup(struct cxl_dev_state *cxlds, u32 fe, u8 *log)
+{
+	void __iomem *addr;
+
+	addr = cxlds->regs.ras + CXL_RAS_HEADER_LOG_OFFSET;
+
+	if (fe & CXL_RAS_UC_CACHE_DATA_PARITY || fe & CXL_RAS_UC_CACHE_ADDR_PARITY ||
+	    fe & CXL_RAS_UC_CACHE_BE_PARITY || fe & CXL_RAS_UC_CACHE_DATA_ECC ||
+	    fe & CXL_RAS_UC_MEM_DATA_PARITY || fe & CXL_RAS_UC_MEM_ADDR_PARITY ||
+	    fe & CXL_RAS_UC_MEM_BE_PARITY || fe & CXL_RAS_UC_MEM_DATA_ECC) {
+		memcpy_fromio(log, addr, DATA_HEADER_SIZE);
+		return DATA_HEADER_SIZE;
+	}
+
+	if (fe & CXL_RAS_UC_RSVD_ENCODE) {
+		memcpy_fromio(log, addr, FLIT_SIZE);
+		return FLIT_SIZE;
+	}
+
+	if (fe & CXL_RAS_UC_RECV_OVERFLOW) {
+		*log = readb(addr);
+		return sizeof(u8);
+	}
+
+	return 0;
+}
+
+/*
+ * Log the state of the RAS status registers and prepare them to log the
+ * next error status. Return 1 if reset needed.
+ */
+static bool cxl_report_and_clear(struct cxl_dev_state *cxlds)
+{
+	struct cxl_memdev *cxlmd = cxlds->cxlmd;
+	struct device *dev = &cxlmd->dev;
+	void __iomem *addr;
+	u32 status;
+	bool ue = false;
+
+	if (!cxlds->regs.ras)
+		return false;
+
+	addr = cxlds->regs.ras + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
+	status = le32_to_cpu(readl(addr));
+	if (status & CXL_RAS_UNCORRECTABLE_STATUS_MASK) {
+		u8 hl[CXL_RAX_HEADER_LOG_SIZE];
+		u32 fe;
+		int size;
+
+		writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
+		ue = true;
+
+		/* If multiple errors, log header points to first error from ctrl reg */
+		if (hweight32(status) > 1) {
+			addr = cxlds->regs.ras + CXL_RAS_CAP_CONTROL_OFFSET;
+			fe = BIT(le32_to_cpu(readl(addr)) & CXL_RAS_CAP_CONTROL_FE_MASK);
+		} else {
+			fe = status;
+		}
+
+		size = header_log_setup(cxlds, fe, hl);
+		trace_cxl_ras_uc(dev_name(dev), status, fe, hl, size);
+	}
+
+	addr = cxlds->regs.ras + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
+	status = le32_to_cpu(readl(addr));
+	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
+		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+		trace_cxl_ras_ce(dev_name(dev), status);
+	}
+
+	return ue;
+}
+
+static pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
+					   pci_channel_state_t state)
+{
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	struct cxl_memdev *cxlmd = cxlds->cxlmd;
+	struct device *dev = &cxlmd->dev;
+	bool ue;
+
+	/*
+	 * A frozen channel indicates an impending reset which is fatal to
+	 * CXL.mem operation, and will likely crash the system. On the off
+	 * chance the situation is recoverable dump the status of the RAS
+	 * capability registers and bounce the active state of the memdev.
+	 */
+	ue = cxl_report_and_clear(cxlds);
+
+	switch (state) {
+	case pci_channel_io_normal:
+		if (ue) {
+			device_release_driver(dev);
+			return PCI_ERS_RESULT_NEED_RESET;
+		}
+		return PCI_ERS_RESULT_CAN_RECOVER;
+	case pci_channel_io_frozen:
+		dev_warn(&pdev->dev,
+			 "%s: frozen state error detected, disable CXL.mem\n",
+			 dev_name(dev));
+		device_release_driver(dev);
+		return PCI_ERS_RESULT_NEED_RESET;
+	case pci_channel_io_perm_failure:
+		dev_warn(&pdev->dev,
+			 "failure state error detected, request disconnect\n");
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+	return PCI_ERS_RESULT_NEED_RESET;
+}
+
+static pci_ers_result_t cxl_slot_reset(struct pci_dev *pdev)
+{
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	struct cxl_memdev *cxlmd = cxlds->cxlmd;
+	struct device *dev = &cxlmd->dev;
+
+	dev_info(&pdev->dev, "%s: restart CXL.mem after slot reset\n",
+		 dev_name(dev));
+	pci_restore_state(pdev);
+	if (device_attach(dev) <= 0)
+		return PCI_ERS_RESULT_DISCONNECT;
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
+static void cxl_error_resume(struct pci_dev *pdev)
+{
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	struct cxl_memdev *cxlmd = cxlds->cxlmd;
+	struct device *dev = &cxlmd->dev;
+
+	dev_info(&pdev->dev, "%s: error resume %s\n", dev_name(dev),
+		 dev->driver ? "successful" : "failed");
+}
+
+static const struct pci_error_handlers cxl_error_handlers = {
+	.error_detected	= cxl_error_detected,
+	.slot_reset	= cxl_slot_reset,
+	.resume		= cxl_error_resume,
+};
+
 static struct pci_driver cxl_pci_driver = {
 	.name			= KBUILD_MODNAME,
 	.id_table		= cxl_mem_pci_tbl,
 	.probe			= cxl_pci_probe,
+	.err_handler		= &cxl_error_handlers,
 	.driver	= {
 		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
 	},



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-09-16 23:10 [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Dave Jiang
                   ` (8 preceding siblings ...)
  2022-09-16 23:11 ` [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support Dave Jiang
@ 2022-10-11 14:17 ` Jonathan Cameron
  2022-10-11 15:18   ` Dave Jiang
  9 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2022-10-11 14:17 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Fri, 16 Sep 2022 16:10:53 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> Series set to RFC since there's no means to test. Would like to get opinion
> on whether going with using trace events as reporting mechanism is ok.
> 
> Jonathan,
> We currently don't have any ways to test AER events. Do you have any plans
> to support AER events via QEMU emulation?

Sorry - missed this entirely as gotten a bit behind reading CXL emails.

Hmm. AER brings a few complexities IIRC. Can be handled either via
native handling in the RCEC / RP, or via GHES records, GED etc.

I don't think it would be particularly hard to emulate either of them.
I have some old code for AER firmware first injection that I could
recycle for that side of things but that's probably less interesting
here than the native case.

I have a few other things to send out as WIP first, but can have
a mess around with AER paths after that.  There is some support
already in QEMU for generic AER error injection, but we will need
to build on top of that to get the rest of the status in place
before the error is generated.  See hw/pci/pcie_aer.c

Obviously if it's something you want to take a look at in QEMU that
would be great too!

Jonathan

> 
> v2:
> - Convert error reporting via printk to trace events
> - Drop ".rmap =" initialization (Jonathan)
> - return PCI_ERS_RESULT_NEED_RESET for UE in pci_channel_io_normal (Shiju)
> 
> Add a 'struct pci_error_handlers' instance for the cxl_pci driver.
> Section 8.2.4.16 "CXL RAS Capability Structure" of the CXL rev3.0
> specification defines the error sources considered in this
> implementation. The RAS Capability Structure defines protocol, link and
> internal errors which are distinct from memory poison errors that are
> conveyed via direct consumption and/or media scanning.
> 
> The errors reported by the RAS registers are categorized into
> correctable and uncorrectable errors, where the uncorrectable errors are
> optionally steered to either fatal or non-fatal AER events. Table 12-2 
> "Device Specific Error Reporting and Nomenclature Guidelines" in the CXL
> rev3.0 specification outlines that the remediation for uncorrectable errors
> is a reset to recover. This matches how the Linux PCIe AER core treats
> uncorrectable errors as occasions to reset the device to recover
> operation.
> 
> While the specification notes "CXL Reset" or "Secondary Bus Reset" as
> theoretical recovery options, they are not feasible in practice since
> in-flight CXL.mem operations may not terminate and cause knock-on system
> fatal events. Reset is only reliable for recovering CXL.io, it is not
> reliable for recovering CXL.mem. Assuming the system survives, a reset
> causes CXL.mem operation to restart from scratch.
> 
> The "ECN: Error Isolation on CXL.mem and CXL.cache" [1] document
> recognizes the CXL Reset vs CXL.mem operational conflict and helps to at
> least provide a mechanism for the Root Port to terminate in flight
> CXL.mem operations with completions. That still poses problems in
> practice if the kernel is running out of "System RAM" backed by the CXL
> device and poison is used to convey the data lost to the protocol error.
> 
> Regardless of whether the reset and restart of CXL.mem operations is
> feasible / successful, the logging is still useful. So, the
> implementation reads, reports, and clears the status in the RAS
> Capability Structure registers, and it notifies the 'struct cxl_memdev'
> associated with the given PCIe endpoint to reattach to its driver over
> the reset so that the HDM decoder configuration can be reconstructed.
> 
> The first half of the series reworks component register mapping so that
> the cxl_pci driver can own the RAS Capability while the cxl_port driver
> continues to own the HDM Decoder Capability. The last half implements
> the RAS Capability Structure mapping and reporting via 'struct
> pci_error_handlers'.
> 
> The reporting of error information is done through event tracing. A new
> cxl_ras event is introduced to report the Uncorrectable and Correctable
> errors raised by CXL. The expectation is a monitoring user daemon such as
> "cxl monitor" will harvest those events and record them in a log in a
> format (JSON) that's consumable by management applications..
> 
> [1]: https://www.computeexpresslink.org/spec-landing
> 
> ---
> 
> Dan Williams (8):
>       cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers
>       cxl/pci: Cleanup cxl_map_device_regs()
>       cxl/pci: Kill cxl_map_regs()
>       cxl/core/regs: Make cxl_map_{component, device}_regs() device generic
>       cxl/port: Limit the port driver to just the HDM Decoder Capability
>       cxl/pci: Prepare for mapping RAS Capability Structure
>       cxl/pci: Find and map the RAS Capability Structure
>       cxl/pci: Add (hopeful) error handling support
> 
> Dave Jiang (1):
>       cxl/pci: add tracepoint events for CXL RAS
> 
> 
>  drivers/cxl/core/hdm.c         |  33 ++---
>  drivers/cxl/core/memdev.c      |   1 +
>  drivers/cxl/core/pci.c         |   3 +-
>  drivers/cxl/core/port.c        |   2 +-
>  drivers/cxl/core/regs.c        | 172 +++++++++++++++-----------
>  drivers/cxl/cxl.h              |  39 ++++--
>  drivers/cxl/cxlmem.h           |   2 +
>  drivers/cxl/cxlpci.h           |   9 --
>  drivers/cxl/pci.c              | 216 +++++++++++++++++++++++++++------
>  include/trace/events/cxl_ras.h | 117 ++++++++++++++++++
>  10 files changed, 445 insertions(+), 149 deletions(-)
>  create mode 100644 include/trace/events/cxl_ras.h
> 
> --
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-10-11 14:17 ` [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Jonathan Cameron
@ 2022-10-11 15:18   ` Dave Jiang
  2022-10-11 17:19     ` Jonathan Cameron
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Jiang @ 2022-10-11 15:18 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter


On 10/11/2022 7:17 AM, Jonathan Cameron wrote:
> On Fri, 16 Sep 2022 16:10:53 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
>> Series set to RFC since there's no means to test. Would like to get opinion
>> on whether going with using trace events as reporting mechanism is ok.
>>
>> Jonathan,
>> We currently don't have any ways to test AER events. Do you have any plans
>> to support AER events via QEMU emulation?
> Sorry - missed this entirely as gotten a bit behind reading CXL emails.
>
> Hmm. AER brings a few complexities IIRC. Can be handled either via
> native handling in the RCEC / RP, or via GHES records, GED etc.
>
> I don't think it would be particularly hard to emulate either of them.
> I have some old code for AER firmware first injection that I could
> recycle for that side of things but that's probably less interesting
> here than the native case.
>
> I have a few other things to send out as WIP first, but can have
> a mess around with AER paths after that.  There is some support
> already in QEMU for generic AER error injection, but we will need
> to build on top of that to get the rest of the status in place
> before the error is generated.  See hw/pci/pcie_aer.c
>
> Obviously if it's something you want to take a look at in QEMU that
> would be great too!

No worries. I know you are super busy. Before we go into injection, I 
think first step is having the CXL device advertise _OSC handover of AER 
handling? How complicated is it to have qemu advertise that?


>
> Jonathan
>
>> v2:
>> - Convert error reporting via printk to trace events
>> - Drop ".rmap =" initialization (Jonathan)
>> - return PCI_ERS_RESULT_NEED_RESET for UE in pci_channel_io_normal (Shiju)
>>
>> Add a 'struct pci_error_handlers' instance for the cxl_pci driver.
>> Section 8.2.4.16 "CXL RAS Capability Structure" of the CXL rev3.0
>> specification defines the error sources considered in this
>> implementation. The RAS Capability Structure defines protocol, link and
>> internal errors which are distinct from memory poison errors that are
>> conveyed via direct consumption and/or media scanning.
>>
>> The errors reported by the RAS registers are categorized into
>> correctable and uncorrectable errors, where the uncorrectable errors are
>> optionally steered to either fatal or non-fatal AER events. Table 12-2
>> "Device Specific Error Reporting and Nomenclature Guidelines" in the CXL
>> rev3.0 specification outlines that the remediation for uncorrectable errors
>> is a reset to recover. This matches how the Linux PCIe AER core treats
>> uncorrectable errors as occasions to reset the device to recover
>> operation.
>>
>> While the specification notes "CXL Reset" or "Secondary Bus Reset" as
>> theoretical recovery options, they are not feasible in practice since
>> in-flight CXL.mem operations may not terminate and cause knock-on system
>> fatal events. Reset is only reliable for recovering CXL.io, it is not
>> reliable for recovering CXL.mem. Assuming the system survives, a reset
>> causes CXL.mem operation to restart from scratch.
>>
>> The "ECN: Error Isolation on CXL.mem and CXL.cache" [1] document
>> recognizes the CXL Reset vs CXL.mem operational conflict and helps to at
>> least provide a mechanism for the Root Port to terminate in flight
>> CXL.mem operations with completions. That still poses problems in
>> practice if the kernel is running out of "System RAM" backed by the CXL
>> device and poison is used to convey the data lost to the protocol error.
>>
>> Regardless of whether the reset and restart of CXL.mem operations is
>> feasible / successful, the logging is still useful. So, the
>> implementation reads, reports, and clears the status in the RAS
>> Capability Structure registers, and it notifies the 'struct cxl_memdev'
>> associated with the given PCIe endpoint to reattach to its driver over
>> the reset so that the HDM decoder configuration can be reconstructed.
>>
>> The first half of the series reworks component register mapping so that
>> the cxl_pci driver can own the RAS Capability while the cxl_port driver
>> continues to own the HDM Decoder Capability. The last half implements
>> the RAS Capability Structure mapping and reporting via 'struct
>> pci_error_handlers'.
>>
>> The reporting of error information is done through event tracing. A new
>> cxl_ras event is introduced to report the Uncorrectable and Correctable
>> errors raised by CXL. The expectation is a monitoring user daemon such as
>> "cxl monitor" will harvest those events and record them in a log in a
>> format (JSON) that's consumable by management applications..
>>
>> [1]: https://www.computeexpresslink.org/spec-landing
>>
>> ---
>>
>> Dan Williams (8):
>>        cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers
>>        cxl/pci: Cleanup cxl_map_device_regs()
>>        cxl/pci: Kill cxl_map_regs()
>>        cxl/core/regs: Make cxl_map_{component, device}_regs() device generic
>>        cxl/port: Limit the port driver to just the HDM Decoder Capability
>>        cxl/pci: Prepare for mapping RAS Capability Structure
>>        cxl/pci: Find and map the RAS Capability Structure
>>        cxl/pci: Add (hopeful) error handling support
>>
>> Dave Jiang (1):
>>        cxl/pci: add tracepoint events for CXL RAS
>>
>>
>>   drivers/cxl/core/hdm.c         |  33 ++---
>>   drivers/cxl/core/memdev.c      |   1 +
>>   drivers/cxl/core/pci.c         |   3 +-
>>   drivers/cxl/core/port.c        |   2 +-
>>   drivers/cxl/core/regs.c        | 172 +++++++++++++++-----------
>>   drivers/cxl/cxl.h              |  39 ++++--
>>   drivers/cxl/cxlmem.h           |   2 +
>>   drivers/cxl/cxlpci.h           |   9 --
>>   drivers/cxl/pci.c              | 216 +++++++++++++++++++++++++++------
>>   include/trace/events/cxl_ras.h | 117 ++++++++++++++++++
>>   10 files changed, 445 insertions(+), 149 deletions(-)
>>   create mode 100644 include/trace/events/cxl_ras.h
>>
>> --
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-10-11 15:18   ` Dave Jiang
@ 2022-10-11 17:19     ` Jonathan Cameron
  2022-10-19 17:30       ` Jonathan Cameron
  0 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2022-10-11 17:19 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Tue, 11 Oct 2022 08:18:34 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> On 10/11/2022 7:17 AM, Jonathan Cameron wrote:
> > On Fri, 16 Sep 2022 16:10:53 -0700
> > Dave Jiang <dave.jiang@intel.com> wrote:
> >  
> >> Series set to RFC since there's no means to test. Would like to get opinion
> >> on whether going with using trace events as reporting mechanism is ok.
> >>
> >> Jonathan,
> >> We currently don't have any ways to test AER events. Do you have any plans
> >> to support AER events via QEMU emulation?  
> > Sorry - missed this entirely as gotten a bit behind reading CXL emails.
> >
> > Hmm. AER brings a few complexities IIRC. Can be handled either via
> > native handling in the RCEC / RP, or via GHES records, GED etc.
> >
> > I don't think it would be particularly hard to emulate either of them.
> > I have some old code for AER firmware first injection that I could
> > recycle for that side of things but that's probably less interesting
> > here than the native case.
> >
> > I have a few other things to send out as WIP first, but can have
> > a mess around with AER paths after that.  There is some support
> > already in QEMU for generic AER error injection, but we will need
> > to build on top of that to get the rest of the status in place
> > before the error is generated.  See hw/pci/pcie_aer.c
> >
> > Obviously if it's something you want to take a look at in QEMU that
> > would be great too!  
> 
> No worries. I know you are super busy. Before we go into injection, I 
> think first step is having the CXL device advertise _OSC handover of AER 
> handling? How complicated is it to have qemu advertise that?
> 
Should be trivial. Actually if the comments are right we already have
it set to say the OS can take control - which is reasonable seeing as
we haven't implemented any firmware handling yet.

https://elixir.bootlin.com/qemu/latest/source/hw/acpi/cxl.c#L193

the AML in DSDT on my ARM test machine (happened to have that running)
looks plausible though my ability to parse AML is a bit limited!

Seems to mask with 0x1f which means the firmware accepts:
Native PCI express hotplug
SHPC Native hot plug control
PCI Express Native Power Management
PCI Express Advanced Error Reporting
PCI Express Capability Structure control
if the OS asks for them.

Reject OS request for higher bit controlled things like DPC.

Absolute minimum we'd then need would be AER capability for the EP.
cxl-rp needs support but I have a feeling we already got that by
inheriting from PCIE_ROOT_PORT :)
So may just need to hook up AER at the type 3 device and then figure out
the routing to fake the result of ERR_COR and the other messages.

Obviously we'd need to fill the rest of the error stuff in at
the type 3 device or (switch :) before "sending".

Jonathan

> 
> >
> > Jonathan
> >  
> >> v2:
> >> - Convert error reporting via printk to trace events
> >> - Drop ".rmap =" initialization (Jonathan)
> >> - return PCI_ERS_RESULT_NEED_RESET for UE in pci_channel_io_normal (Shiju)
> >>
> >> Add a 'struct pci_error_handlers' instance for the cxl_pci driver.
> >> Section 8.2.4.16 "CXL RAS Capability Structure" of the CXL rev3.0
> >> specification defines the error sources considered in this
> >> implementation. The RAS Capability Structure defines protocol, link and
> >> internal errors which are distinct from memory poison errors that are
> >> conveyed via direct consumption and/or media scanning.
> >>
> >> The errors reported by the RAS registers are categorized into
> >> correctable and uncorrectable errors, where the uncorrectable errors are
> >> optionally steered to either fatal or non-fatal AER events. Table 12-2
> >> "Device Specific Error Reporting and Nomenclature Guidelines" in the CXL
> >> rev3.0 specification outlines that the remediation for uncorrectable errors
> >> is a reset to recover. This matches how the Linux PCIe AER core treats
> >> uncorrectable errors as occasions to reset the device to recover
> >> operation.
> >>
> >> While the specification notes "CXL Reset" or "Secondary Bus Reset" as
> >> theoretical recovery options, they are not feasible in practice since
> >> in-flight CXL.mem operations may not terminate and cause knock-on system
> >> fatal events. Reset is only reliable for recovering CXL.io, it is not
> >> reliable for recovering CXL.mem. Assuming the system survives, a reset
> >> causes CXL.mem operation to restart from scratch.
> >>
> >> The "ECN: Error Isolation on CXL.mem and CXL.cache" [1] document
> >> recognizes the CXL Reset vs CXL.mem operational conflict and helps to at
> >> least provide a mechanism for the Root Port to terminate in flight
> >> CXL.mem operations with completions. That still poses problems in
> >> practice if the kernel is running out of "System RAM" backed by the CXL
> >> device and poison is used to convey the data lost to the protocol error.
> >>
> >> Regardless of whether the reset and restart of CXL.mem operations is
> >> feasible / successful, the logging is still useful. So, the
> >> implementation reads, reports, and clears the status in the RAS
> >> Capability Structure registers, and it notifies the 'struct cxl_memdev'
> >> associated with the given PCIe endpoint to reattach to its driver over
> >> the reset so that the HDM decoder configuration can be reconstructed.
> >>
> >> The first half of the series reworks component register mapping so that
> >> the cxl_pci driver can own the RAS Capability while the cxl_port driver
> >> continues to own the HDM Decoder Capability. The last half implements
> >> the RAS Capability Structure mapping and reporting via 'struct
> >> pci_error_handlers'.
> >>
> >> The reporting of error information is done through event tracing. A new
> >> cxl_ras event is introduced to report the Uncorrectable and Correctable
> >> errors raised by CXL. The expectation is a monitoring user daemon such as
> >> "cxl monitor" will harvest those events and record them in a log in a
> >> format (JSON) that's consumable by management applications..
> >>
> >> [1]: https://www.computeexpresslink.org/spec-landing
> >>
> >> ---
> >>
> >> Dan Williams (8):
> >>        cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers
> >>        cxl/pci: Cleanup cxl_map_device_regs()
> >>        cxl/pci: Kill cxl_map_regs()
> >>        cxl/core/regs: Make cxl_map_{component, device}_regs() device generic
> >>        cxl/port: Limit the port driver to just the HDM Decoder Capability
> >>        cxl/pci: Prepare for mapping RAS Capability Structure
> >>        cxl/pci: Find and map the RAS Capability Structure
> >>        cxl/pci: Add (hopeful) error handling support
> >>
> >> Dave Jiang (1):
> >>        cxl/pci: add tracepoint events for CXL RAS
> >>
> >>
> >>   drivers/cxl/core/hdm.c         |  33 ++---
> >>   drivers/cxl/core/memdev.c      |   1 +
> >>   drivers/cxl/core/pci.c         |   3 +-
> >>   drivers/cxl/core/port.c        |   2 +-
> >>   drivers/cxl/core/regs.c        | 172 +++++++++++++++-----------
> >>   drivers/cxl/cxl.h              |  39 ++++--
> >>   drivers/cxl/cxlmem.h           |   2 +
> >>   drivers/cxl/cxlpci.h           |   9 --
> >>   drivers/cxl/pci.c              | 216 +++++++++++++++++++++++++++------
> >>   include/trace/events/cxl_ras.h | 117 ++++++++++++++++++
> >>   10 files changed, 445 insertions(+), 149 deletions(-)
> >>   create mode 100644 include/trace/events/cxl_ras.h
> >>
> >> --
> >>  
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 3/9] cxl/pci: Kill cxl_map_regs()
  2022-09-16 23:11 ` [PATCH RFC v2 3/9] cxl/pci: Kill cxl_map_regs() Dave Jiang
@ 2022-10-18 13:43   ` Jonathan Cameron
  0 siblings, 0 replies; 34+ messages in thread
From: Jonathan Cameron @ 2022-10-18 13:43 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Fri, 16 Sep 2022 16:11:11 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> From: Dan Williams <dan.j.williams@intel.com>
> 
> The component registers are currently unused by the cxl_pci driver.
> Only the physical address base of the component registers is conveyed to
> the cxl_mem driver. Just call cxl_map_device_registers() directly.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>

These first 3 impact the CPMU driver code. So I'd very much like
to have these at least upstream asap even if the rest takes a while
to follow.

Thanks,

Jonathan


> ---
>  drivers/cxl/pci.c |   23 +----------------------
>  1 file changed, 1 insertion(+), 22 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index faeb5d9d7a7a..82023cf0cdcf 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -347,27 +347,6 @@ static int cxl_probe_regs(struct pci_dev *pdev, struct cxl_register_map *map)
>  	return 0;
>  }
>  
> -static int cxl_map_regs(struct cxl_dev_state *cxlds, struct cxl_register_map *map)
> -{
> -	struct device *dev = cxlds->dev;
> -	struct pci_dev *pdev = to_pci_dev(dev);
> -
> -	switch (map->reg_type) {
> -	case CXL_REGLOC_RBI_COMPONENT:
> -		cxl_map_component_regs(pdev, &cxlds->regs.component, map);
> -		dev_dbg(dev, "Mapping component registers...\n");
> -		break;
> -	case CXL_REGLOC_RBI_MEMDEV:
> -		cxl_map_device_regs(pdev, &cxlds->regs.device_regs, map);
> -		dev_dbg(dev, "Probing device registers...\n");
> -		break;
> -	default:
> -		break;
> -	}
> -
> -	return 0;
> -}
> -
>  static int cxl_setup_regs(struct pci_dev *pdev, enum cxl_regloc_type type,
>  			  struct cxl_register_map *map)
>  {
> @@ -461,7 +440,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	if (rc)
>  		return rc;
>  
> -	rc = cxl_map_regs(cxlds, &map);
> +	rc = cxl_map_device_regs(pdev, &cxlds->regs.device_regs, &map);
>  	if (rc)
>  		return rc;
>  
> 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-10-11 17:19     ` Jonathan Cameron
@ 2022-10-19 17:30       ` Jonathan Cameron
  2022-10-19 17:38         ` Dave Jiang
  0 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2022-10-19 17:30 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Tue, 11 Oct 2022 18:19:15 +0100
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Tue, 11 Oct 2022 08:18:34 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
> 
> > On 10/11/2022 7:17 AM, Jonathan Cameron wrote:  
> > > On Fri, 16 Sep 2022 16:10:53 -0700
> > > Dave Jiang <dave.jiang@intel.com> wrote:
> > >    
> > >> Series set to RFC since there's no means to test. Would like to get opinion
> > >> on whether going with using trace events as reporting mechanism is ok.
> > >>
> > >> Jonathan,
> > >> We currently don't have any ways to test AER events. Do you have any plans
> > >> to support AER events via QEMU emulation?    
> > > Sorry - missed this entirely as gotten a bit behind reading CXL emails.

Hi Dave,

Quick update.

Working QEMU emulation - but needs some/lots of cleanup. Particularly fun was
figuring out why I wasn't getting messages past the upstream switch port.
Turned out the serial number ECAP was on top of the AER ECAP. Oops - thankfully
that patch isn't upstream yet.
Also QEMU AER rooting seems to be based on some older PCIE spec
so needed some tweaks to get the device to actually issue ERR_FATAL etc.

Anyhow, should have something you can play with in a day or two.
In meantime an example dump (not writing the header log yet!)

pcieport 0000:0c:00.0: AER: Uncorrected (Non-Fatal) error received: 0000:0f:00.0
cxl_pci 0000:0f:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00004000/00000000
cxl_pci 0000:0f:00.0:    [14] CmpltTO                (First)
cxl_ras_uc: mem3: status: 'Cache Data Parity Error' first_error: 'Cache Data Parity Error' header log: {0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0}
cxl_pci 0000:0f:00.0: mem3: restart CXL.mem after slot reset
cxl_port endpoint6: No CMA mailbox
cxl_pci 0000:0f:00.0: mem3: error resume successful
pcieport 0000:0e:00.0: AER: device recovery successful

Jonathan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-10-19 17:30       ` Jonathan Cameron
@ 2022-10-19 17:38         ` Dave Jiang
  2022-10-24 16:01           ` Jonathan Cameron
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Jiang @ 2022-10-19 17:38 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter


On 10/19/2022 10:30 AM, Jonathan Cameron wrote:
> On Tue, 11 Oct 2022 18:19:15 +0100
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>
>> On Tue, 11 Oct 2022 08:18:34 -0700
>> Dave Jiang <dave.jiang@intel.com> wrote:
>>
>>> On 10/11/2022 7:17 AM, Jonathan Cameron wrote:
>>>> On Fri, 16 Sep 2022 16:10:53 -0700
>>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>>     
>>>>> Series set to RFC since there's no means to test. Would like to get opinion
>>>>> on whether going with using trace events as reporting mechanism is ok.
>>>>>
>>>>> Jonathan,
>>>>> We currently don't have any ways to test AER events. Do you have any plans
>>>>> to support AER events via QEMU emulation?
>>>> Sorry - missed this entirely as gotten a bit behind reading CXL emails.
> Hi Dave,
>
> Quick update.
>
> Working QEMU emulation - but needs some/lots of cleanup. Particularly fun was
> figuring out why I wasn't getting messages past the upstream switch port.
> Turned out the serial number ECAP was on top of the AER ECAP. Oops - thankfully
> that patch isn't upstream yet.
> Also QEMU AER rooting seems to be based on some older PCIE spec
> so needed some tweaks to get the device to actually issue ERR_FATAL etc.
>
> Anyhow, should have something you can play with in a day or two.

Awesome! Thanks! :)


> In meantime an example dump (not writing the header log yet!)
>
> pcieport 0000:0c:00.0: AER: Uncorrected (Non-Fatal) error received: 0000:0f:00.0
> cxl_pci 0000:0f:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00004000/00000000
> cxl_pci 0000:0f:00.0:    [14] CmpltTO                (First)
> cxl_ras_uc: mem3: status: 'Cache Data Parity Error' first_error: 'Cache Data Parity Error' header log: {0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0}
> cxl_pci 0000:0f:00.0: mem3: restart CXL.mem after slot reset
> cxl_port endpoint6: No CMA mailbox
> cxl_pci 0000:0f:00.0: mem3: error resume successful
> pcieport 0000:0e:00.0: AER: device recovery successful
>
> Jonathan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support
  2022-09-16 23:11 ` [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support Dave Jiang
@ 2022-10-20 13:45   ` Jonathan Cameron
  2022-10-20 14:50     ` Dave Jiang
  2022-10-20 14:03   ` Jonathan Cameron
  2022-10-20 15:52   ` Jonathan Cameron
  2 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2022-10-20 13:45 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Fri, 16 Sep 2022 16:11:45 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> From: Dan Williams <dan.j.williams@intel.com>
> 
> Add nominal error handling that tears down CXL.mem in response to error
> notifications that imply a device reset. Given some CXL.mem may be
> operating as System RAM, there is a high likelihood that these error
> events are fatal. However, if the system survives the notification the
> expectation is that the driver behavior is equivalent to a hot-unplug
> and re-plug of an endpoint.
> 
> Note that this does not change the mask values from the default. That
> awaits CXL _OSC support to determine whether platform firmware is in
> control of the mask registers.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>

...

> +/* CXL spec rev3.0 8.2.4.16.1 */
> +#define DATA_HEADER_SIZE 16
> +#define FLIT_SIZE (64 + 2)
> +static int header_log_setup(struct cxl_dev_state *cxlds, u32 fe, u8 *log)
> +{
> +	void __iomem *addr;
> +
> +	addr = cxlds->regs.ras + CXL_RAS_HEADER_LOG_OFFSET;
> +
> +	if (fe & CXL_RAS_UC_CACHE_DATA_PARITY || fe & CXL_RAS_UC_CACHE_ADDR_PARITY ||
> +	    fe & CXL_RAS_UC_CACHE_BE_PARITY || fe & CXL_RAS_UC_CACHE_DATA_ECC ||
> +	    fe & CXL_RAS_UC_MEM_DATA_PARITY || fe & CXL_RAS_UC_MEM_ADDR_PARITY ||
> +	    fe & CXL_RAS_UC_MEM_BE_PARITY || fe & CXL_RAS_UC_MEM_DATA_ECC) {
> +		memcpy_fromio(log, addr, DATA_HEADER_SIZE);
I'd forgotten his gremlin.

You can't use memcpy_fromio() because on some architectures it will issue 8 byte
reads and 8.2.4.16.7 states that the log shall be accessed as aligned 4-byte
quantities.

> +		return DATA_HEADER_SIZE;
> +	}
> +
> +	if (fe & CXL_RAS_UC_RSVD_ENCODE) {
> +		memcpy_fromio(log, addr, FLIT_SIZE);
> +		return FLIT_SIZE;
> +	}
> +
> +	if (fe & CXL_RAS_UC_RECV_OVERFLOW) {
> +		*log = readb(addr);
Also not valid for same reason.  Do a 32bit read and mask out the bottom byte.

That was a pain to track down (and worst of all we hit the same thing for another
bit of CXL last year I'd forgotten about it :(

> +		return sizeof(u8);
> +	}
> +
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support
  2022-09-16 23:11 ` [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support Dave Jiang
  2022-10-20 13:45   ` Jonathan Cameron
@ 2022-10-20 14:03   ` Jonathan Cameron
  2022-10-20 14:57     ` Dave Jiang
  2022-10-20 15:52   ` Jonathan Cameron
  2 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2022-10-20 14:03 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Fri, 16 Sep 2022 16:11:45 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> From: Dan Williams <dan.j.williams@intel.com>
> 
> Add nominal error handling that tears down CXL.mem in response to error
> notifications that imply a device reset. Given some CXL.mem may be
> operating as System RAM, there is a high likelihood that these error
> events are fatal. However, if the system survives the notification the
> expectation is that the driver behavior is equivalent to a hot-unplug
> and re-plug of an endpoint.
> 
> Note that this does not change the mask values from the default. That
> awaits CXL _OSC support to determine whether platform firmware is in
> control of the mask registers.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> ---

>  
> +/* CXL spec rev3.0 8.2.4.16.1 */
> +#define DATA_HEADER_SIZE 16

I'm not immediately seeing a spec justification for these sizes.
The table refes to containing H2D or D2H headers. 
Jumping back to 3.2.3.3 D2H Data
The D2H Data Header is between 17 and 24 bits (assuming PBR irrelevant here)
H2D header is 24 to 28 bits.

So where does 16 bytes come from?  I'd be tempted to just spit out the whole
512 bit register in 32 bit chunks and leave interpretation of it to userspace.


> +#define FLIT_SIZE (64 + 2)
> +static int header_log_setup(struct cxl_dev_state *cxlds, u32 fe, u8 *log)
> +{
> +	void __iomem *addr;
> +
> +	addr = cxlds->regs.ras + CXL_RAS_HEADER_LOG_OFFSET;
> +
> +	if (fe & CXL_RAS_UC_CACHE_DATA_PARITY || fe & CXL_RAS_UC_CACHE_ADDR_PARITY ||
> +	    fe & CXL_RAS_UC_CACHE_BE_PARITY || fe & CXL_RAS_UC_CACHE_DATA_ECC ||
> +	    fe & CXL_RAS_UC_MEM_DATA_PARITY || fe & CXL_RAS_UC_MEM_ADDR_PARITY ||
> +	    fe & CXL_RAS_UC_MEM_BE_PARITY || fe & CXL_RAS_UC_MEM_DATA_ECC) {
> +		memcpy_fromio(log, addr, DATA_HEADER_SIZE);
> +		return DATA_HEADER_SIZE;
> +	}
> +
> +	if (fe & CXL_RAS_UC_RSVD_ENCODE) {
> +		memcpy_fromio(log, addr, FLIT_SIZE);
> +		return FLIT_SIZE;
> +	}
> +
> +	if (fe & CXL_RAS_UC_RECV_OVERFLOW) {
> +		*log = readb(addr);
> +		return sizeof(u8);
> +	}
> +
> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support
  2022-10-20 13:45   ` Jonathan Cameron
@ 2022-10-20 14:50     ` Dave Jiang
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Jiang @ 2022-10-20 14:50 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter


On 10/20/2022 6:45 AM, Jonathan Cameron wrote:
> On Fri, 16 Sep 2022 16:11:45 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
>> From: Dan Williams <dan.j.williams@intel.com>
>>
>> Add nominal error handling that tears down CXL.mem in response to error
>> notifications that imply a device reset. Given some CXL.mem may be
>> operating as System RAM, there is a high likelihood that these error
>> events are fatal. However, if the system survives the notification the
>> expectation is that the driver behavior is equivalent to a hot-unplug
>> and re-plug of an endpoint.
>>
>> Note that this does not change the mask values from the default. That
>> awaits CXL _OSC support to determine whether platform firmware is in
>> control of the mask registers.
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> ...
>
>> +/* CXL spec rev3.0 8.2.4.16.1 */
>> +#define DATA_HEADER_SIZE 16
>> +#define FLIT_SIZE (64 + 2)
>> +static int header_log_setup(struct cxl_dev_state *cxlds, u32 fe, u8 *log)
>> +{
>> +	void __iomem *addr;
>> +
>> +	addr = cxlds->regs.ras + CXL_RAS_HEADER_LOG_OFFSET;
>> +
>> +	if (fe & CXL_RAS_UC_CACHE_DATA_PARITY || fe & CXL_RAS_UC_CACHE_ADDR_PARITY ||
>> +	    fe & CXL_RAS_UC_CACHE_BE_PARITY || fe & CXL_RAS_UC_CACHE_DATA_ECC ||
>> +	    fe & CXL_RAS_UC_MEM_DATA_PARITY || fe & CXL_RAS_UC_MEM_ADDR_PARITY ||
>> +	    fe & CXL_RAS_UC_MEM_BE_PARITY || fe & CXL_RAS_UC_MEM_DATA_ECC) {
>> +		memcpy_fromio(log, addr, DATA_HEADER_SIZE);
> I'd forgotten his gremlin.
>
> You can't use memcpy_fromio() because on some architectures it will issue 8 byte
> reads and 8.2.4.16.7 states that the log shall be accessed as aligned 4-byte
> quantities.
Ok I'll fix.
>
>> +		return DATA_HEADER_SIZE;
>> +	}
>> +
>> +	if (fe & CXL_RAS_UC_RSVD_ENCODE) {
>> +		memcpy_fromio(log, addr, FLIT_SIZE);
>> +		return FLIT_SIZE;
>> +	}
>> +
>> +	if (fe & CXL_RAS_UC_RECV_OVERFLOW) {
>> +		*log = readb(addr);
> Also not valid for same reason.  Do a 32bit read and mask out the bottom byte.

Will fix. Thanks.


>
> That was a pain to track down (and worst of all we hit the same thing for another
> bit of CXL last year I'd forgotten about it :(
>
>> +		return sizeof(u8);
>> +	}
>> +
>> +	return 0;
>> +}
>> +

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support
  2022-10-20 14:03   ` Jonathan Cameron
@ 2022-10-20 14:57     ` Dave Jiang
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Jiang @ 2022-10-20 14:57 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter


On 10/20/2022 7:03 AM, Jonathan Cameron wrote:
> On Fri, 16 Sep 2022 16:11:45 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
>> From: Dan Williams <dan.j.williams@intel.com>
>>
>> Add nominal error handling that tears down CXL.mem in response to error
>> notifications that imply a device reset. Given some CXL.mem may be
>> operating as System RAM, there is a high likelihood that these error
>> events are fatal. However, if the system survives the notification the
>> expectation is that the driver behavior is equivalent to a hot-unplug
>> and re-plug of an endpoint.
>>
>> Note that this does not change the mask values from the default. That
>> awaits CXL _OSC support to determine whether platform firmware is in
>> control of the mask registers.
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> ---
>>   
>> +/* CXL spec rev3.0 8.2.4.16.1 */
>> +#define DATA_HEADER_SIZE 16
> I'm not immediately seeing a spec justification for these sizes.
> The table refes to containing H2D or D2H headers.
> Jumping back to 3.2.3.3 D2H Data
> The D2H Data Header is between 17 and 24 bits (assuming PBR irrelevant here)
> H2D header is 24 to 28 bits.
>
> So where does 16 bytes come from?  I'd be tempted to just spit out the whole
> 512 bit register in 32 bit chunks and leave interpretation of it to userspace.

Fair enough. That would make the kernel code simpler.


>
>
>> +#define FLIT_SIZE (64 + 2)
>> +static int header_log_setup(struct cxl_dev_state *cxlds, u32 fe, u8 *log)
>> +{
>> +	void __iomem *addr;
>> +
>> +	addr = cxlds->regs.ras + CXL_RAS_HEADER_LOG_OFFSET;
>> +
>> +	if (fe & CXL_RAS_UC_CACHE_DATA_PARITY || fe & CXL_RAS_UC_CACHE_ADDR_PARITY ||
>> +	    fe & CXL_RAS_UC_CACHE_BE_PARITY || fe & CXL_RAS_UC_CACHE_DATA_ECC ||
>> +	    fe & CXL_RAS_UC_MEM_DATA_PARITY || fe & CXL_RAS_UC_MEM_ADDR_PARITY ||
>> +	    fe & CXL_RAS_UC_MEM_BE_PARITY || fe & CXL_RAS_UC_MEM_DATA_ECC) {
>> +		memcpy_fromio(log, addr, DATA_HEADER_SIZE);
>> +		return DATA_HEADER_SIZE;
>> +	}
>> +
>> +	if (fe & CXL_RAS_UC_RSVD_ENCODE) {
>> +		memcpy_fromio(log, addr, FLIT_SIZE);
>> +		return FLIT_SIZE;
>> +	}
>> +
>> +	if (fe & CXL_RAS_UC_RECV_OVERFLOW) {
>> +		*log = readb(addr);
>> +		return sizeof(u8);
>> +	}
>> +
>> +	return 0;
>> +}
>> +

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support
  2022-09-16 23:11 ` [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support Dave Jiang
  2022-10-20 13:45   ` Jonathan Cameron
  2022-10-20 14:03   ` Jonathan Cameron
@ 2022-10-20 15:52   ` Jonathan Cameron
  2022-10-20 16:06     ` Dave Jiang
  2 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2022-10-20 15:52 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Fri, 16 Sep 2022 16:11:45 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> From: Dan Williams <dan.j.williams@intel.com>
> 
> Add nominal error handling that tears down CXL.mem in response to error
> notifications that imply a device reset. Given some CXL.mem may be
> operating as System RAM, there is a high likelihood that these error
> events are fatal. However, if the system survives the notification the
> expectation is that the driver behavior is equivalent to a hot-unplug
> and re-plug of an endpoint.
> 
> Note that this does not change the mask values from the default. That
> awaits CXL _OSC support to determine whether platform firmware is in
> control of the mask registers.

Hi Dave,

So I just implemented correctable error reporting and it never gets
to the handling in here.  My perhaps wrong assumption is that the
device would use ERR_COR messages to indicate those?

They get to the AER handlers (which print appropriately) but because
they have been corrected are never reported to the PCIe drivers.

https://elixir.bootlin.com/linux/latest/source/drivers/pci/pcie/aer.c#L956

I guess we will want a hook for those as well so we can log the
extra info on what the error was when they occur.

Jonathan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support
  2022-10-20 15:52   ` Jonathan Cameron
@ 2022-10-20 16:06     ` Dave Jiang
  2022-10-20 16:11       ` Jonathan Cameron
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Jiang @ 2022-10-20 16:06 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter


On 10/20/2022 8:52 AM, Jonathan Cameron wrote:
> On Fri, 16 Sep 2022 16:11:45 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
>> From: Dan Williams <dan.j.williams@intel.com>
>>
>> Add nominal error handling that tears down CXL.mem in response to error
>> notifications that imply a device reset. Given some CXL.mem may be
>> operating as System RAM, there is a high likelihood that these error
>> events are fatal. However, if the system survives the notification the
>> expectation is that the driver behavior is equivalent to a hot-unplug
>> and re-plug of an endpoint.
>>
>> Note that this does not change the mask values from the default. That
>> awaits CXL _OSC support to determine whether platform firmware is in
>> control of the mask registers.
> Hi Dave,
>
> So I just implemented correctable error reporting and it never gets
> to the handling in here.  My perhaps wrong assumption is that the
> device would use ERR_COR messages to indicate those?
>
> They get to the AER handlers (which print appropriately) but because
> they have been corrected are never reported to the PCIe drivers.
>
> https://elixir.bootlin.com/linux/latest/source/drivers/pci/pcie/aer.c#L956
>
> I guess we will want a hook for those as well so we can log the
> extra info on what the error was when they occur.

Are you suggesting having that function call pdrv->err_handler with 
either going through ->error_detected() or a new callback like 
->correctable_error_notify()?


>
> Jonathan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support
  2022-10-20 16:06     ` Dave Jiang
@ 2022-10-20 16:11       ` Jonathan Cameron
  0 siblings, 0 replies; 34+ messages in thread
From: Jonathan Cameron @ 2022-10-20 16:11 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Thu, 20 Oct 2022 09:06:50 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> On 10/20/2022 8:52 AM, Jonathan Cameron wrote:
> > On Fri, 16 Sep 2022 16:11:45 -0700
> > Dave Jiang <dave.jiang@intel.com> wrote:
> >  
> >> From: Dan Williams <dan.j.williams@intel.com>
> >>
> >> Add nominal error handling that tears down CXL.mem in response to error
> >> notifications that imply a device reset. Given some CXL.mem may be
> >> operating as System RAM, there is a high likelihood that these error
> >> events are fatal. However, if the system survives the notification the
> >> expectation is that the driver behavior is equivalent to a hot-unplug
> >> and re-plug of an endpoint.
> >>
> >> Note that this does not change the mask values from the default. That
> >> awaits CXL _OSC support to determine whether platform firmware is in
> >> control of the mask registers.  
> > Hi Dave,
> >
> > So I just implemented correctable error reporting and it never gets
> > to the handling in here.  My perhaps wrong assumption is that the
> > device would use ERR_COR messages to indicate those?
> >
> > They get to the AER handlers (which print appropriately) but because
> > they have been corrected are never reported to the PCIe drivers.
> >
> > https://elixir.bootlin.com/linux/latest/source/drivers/pci/pcie/aer.c#L956
> >
> > I guess we will want a hook for those as well so we can log the
> > extra info on what the error was when they occur.  
> 
> Are you suggesting having that function call pdrv->err_handler with 
> either going through ->error_detected() or a new callback like 
> ->correctable_error_notify()?  

Probably a new callback to avoid any side effects.  Probably a question to
ask on linux-pci...

J

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 5/9] cxl/port: Limit the port driver to just the HDM Decoder Capability
  2022-09-16 23:11 ` [PATCH RFC v2 5/9] cxl/port: Limit the port driver to just the HDM Decoder Capability Dave Jiang
@ 2022-10-20 16:54   ` Jonathan Cameron
  0 siblings, 0 replies; 34+ messages in thread
From: Jonathan Cameron @ 2022-10-20 16:54 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Fri, 16 Sep 2022 16:11:22 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> From: Dan Williams <dan.j.williams@intel.com>
> 
> Update the port driver to use cxl_map_component_registers() so that the
> component register block can be shared between the cxl_pci driver and
> the cxl_port driver. I.e. stop the port driver from reserving the entire
> component register block for itself via request_region() when it only
> needs the HDM Decoder Capability subset.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>

Makes sense
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
>  drivers/cxl/core/hdm.c |   32 ++++++++++++++++++--------------
>  1 file changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c
> index d1d2caea5c62..061551148cfe 100644
> --- a/drivers/cxl/core/hdm.c
> +++ b/drivers/cxl/core/hdm.c
> @@ -82,18 +82,22 @@ static void parse_hdm_decoder_caps(struct cxl_hdm *cxlhdm)
>  		cxlhdm->interleave_mask |= GENMASK(14, 12);
>  }
>  
> -static void __iomem *map_hdm_decoder_regs(struct cxl_port *port,
> -					  void __iomem *crb)
> +static int map_hdm_decoder_regs(struct cxl_port *port, void __iomem *crb,
> +				struct cxl_component_regs *regs)
>  {
> -	struct cxl_component_reg_map map;
> +	struct cxl_register_map map = {
> +		.resource = port->component_reg_phys,
> +		.base = crb,
> +		.max_size = CXL_COMPONENT_REG_BLOCK_SIZE,
> +	};
>  
> -	cxl_probe_component_regs(&port->dev, crb, &map);
> -	if (!map.hdm_decoder.valid) {
> +	cxl_probe_component_regs(&port->dev, crb, &map.component_map);
> +	if (!map.component_map.hdm_decoder.valid) {
>  		dev_err(&port->dev, "HDM decoder registers invalid\n");
> -		return IOMEM_ERR_PTR(-ENXIO);
> +		return -ENXIO;
>  	}
>  
> -	return crb + map.hdm_decoder.offset;
> +	return cxl_map_component_regs(&port->dev, regs, &map);
>  }
>  
>  /**
> @@ -103,25 +107,25 @@ static void __iomem *map_hdm_decoder_regs(struct cxl_port *port,
>  struct cxl_hdm *devm_cxl_setup_hdm(struct cxl_port *port)
>  {
>  	struct device *dev = &port->dev;
> -	void __iomem *crb, *hdm;
>  	struct cxl_hdm *cxlhdm;
> +	void __iomem *crb;
> +	int rc;
>  
>  	cxlhdm = devm_kzalloc(dev, sizeof(*cxlhdm), GFP_KERNEL);
>  	if (!cxlhdm)
>  		return ERR_PTR(-ENOMEM);
>  
>  	cxlhdm->port = port;
> -	crb = devm_cxl_iomap_block(dev, port->component_reg_phys,
> -				   CXL_COMPONENT_REG_BLOCK_SIZE);
> +	crb = ioremap(port->component_reg_phys, CXL_COMPONENT_REG_BLOCK_SIZE);
>  	if (!crb) {
>  		dev_err(dev, "No component registers mapped\n");
>  		return ERR_PTR(-ENXIO);
>  	}
>  
> -	hdm = map_hdm_decoder_regs(port, crb);
> -	if (IS_ERR(hdm))
> -		return ERR_CAST(hdm);
> -	cxlhdm->regs.hdm_decoder = hdm;
> +	rc = map_hdm_decoder_regs(port, crb, &cxlhdm->regs);
> +	iounmap(crb);
> +	if (rc)
> +		return ERR_PTR(rc);
>  
>  	parse_hdm_decoder_caps(cxlhdm);
>  	if (cxlhdm->decoder_count == 0) {
> 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 8/9] cxl/pci: add tracepoint events for CXL RAS
  2022-09-16 23:11 ` [PATCH RFC v2 8/9] cxl/pci: add tracepoint events for CXL RAS Dave Jiang
@ 2022-10-20 17:02   ` Jonathan Cameron
  2022-10-20 17:07     ` Dave Jiang
  0 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2022-10-20 17:02 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter, Steven Rostedt

On Fri, 16 Sep 2022 16:11:39 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> Add tracepoint events for recording the CXL uncorrectable and correctable
> errors. For uncorrectable errors, there is additional data up to 512B from
> the header log register (CXL spec rev3 8.2.4.16.7). The content of the
> register depends on the Uncorrectable Errors (UC) register status (CXL spec
> rev3 8.2.4.16.1).  This implementation supports the Receiver_Overflow error
> where the definition is defined as first 3 bits of the Header Log data. The
> trace event will intake a dynamic array that will dump the Header Log data
> based on error. If multiple errors are set in the status register, then the
> 'first error' field (CXL spec rev3 v8.2.4.16.6) is read from the Error
> Capabilities and Control Register in order to determine the error.
> 
> This implementation does not include CXL IDE Error details.
> 
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>

Given the useful review we've gotten on other race points seems wise to
Cc Stephen.

The overflow flags seems to be inconsistent wrt to spec.  They aren't flags
as such in the spec.  However I note that isn't used anyway so maybe drop
it for now?

Jonathan

> ---
>  drivers/cxl/pci.c              |    2 +
>  include/trace/events/cxl_ras.h |  117 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 119 insertions(+)
>  create mode 100644 include/trace/events/cxl_ras.h
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 610b3a77f205..357de704e42c 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -13,6 +13,8 @@
>  #include "cxlmem.h"
>  #include "cxlpci.h"
>  #include "cxl.h"
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/cxl_ras.h>
>  
>  /**
>   * DOC: cxl pci
> diff --git a/include/trace/events/cxl_ras.h b/include/trace/events/cxl_ras.h
> new file mode 100644
> index 000000000000..6bb41c3b87c8
> --- /dev/null
> +++ b/include/trace/events/cxl_ras.h
> @@ -0,0 +1,117 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM cxl_ras
> +
> +#if !defined(_CXL_RAS_EVENTS_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _CXL_RAS_EVENTS_H
> +
> +#include <linux/tracepoint.h>
> +
> +#define CXL_RAS_UC_CACHE_DATA_PARITY	BIT(0)
> +#define CXL_RAS_UC_CACHE_ADDR_PARITY	BIT(1)
> +#define CXL_RAS_UC_CACHE_BE_PARITY	BIT(2)
> +#define CXL_RAS_UC_CACHE_DATA_ECC	BIT(3)
> +#define CXL_RAS_UC_MEM_DATA_PARITY	BIT(4)
> +#define CXL_RAS_UC_MEM_ADDR_PARITY	BIT(5)
> +#define CXL_RAS_UC_MEM_BE_PARITY	BIT(6)
> +#define CXL_RAS_UC_MEM_DATA_ECC		BIT(7)
> +#define CXL_RAS_UC_REINIT_THRESH	BIT(8)
> +#define CXL_RAS_UC_RSVD_ENCODE		BIT(9)
> +#define CXL_RAS_UC_POISON		BIT(10)
> +#define CXL_RAS_UC_RECV_OVERFLOW	BIT(11)
> +#define CXL_RAS_UC_INTERNAL_ERR		BIT(14)
> +#define CXL_RAS_UC_IDE_TX_ERR		BIT(15)
> +#define CXL_RAS_UC_IDE_RX_ERR		BIT(16)
> +
> +#define show_uc_errs(status)	__print_flags(status, " | ",		  \
> +	{ CXL_RAS_UC_CACHE_DATA_PARITY, "Cache Data Parity Error" },	  \
> +	{ CXL_RAS_UC_CACHE_ADDR_PARITY, "Cache Address Parity Error" },	  \
> +	{ CXL_RAS_UC_CACHE_BE_PARITY, "Cache Byte Enable Parity Error" }, \
> +	{ CXL_RAS_UC_CACHE_DATA_ECC, "Cache Data ECC Error" },		  \
> +	{ CXL_RAS_UC_MEM_DATA_PARITY, "Memory Data Parity Error" },	  \
> +	{ CXL_RAS_UC_MEM_ADDR_PARITY, "Memory Address Parity Error" },	  \
> +	{ CXL_RAS_UC_MEM_BE_PARITY, "Memory Byte Enable Parity Error" },  \
> +	{ CXL_RAS_UC_MEM_DATA_ECC, "Memory Data ECC Error" },		  \
> +	{ CXL_RAS_UC_REINIT_THRESH, "REINIT Threshold Hit" },		  \
> +	{ CXL_RAS_UC_RSVD_ENCODE, "Received Unrecognized Encoding" },	  \
> +	{ CXL_RAS_UC_POISON, "Received Poison From Peer" },		  \
> +	{ CXL_RAS_UC_RECV_OVERFLOW, "Receiver Overflow" },		  \
> +	{ CXL_RAS_UC_INTERNAL_ERR, "Component Specific Error" },	  \
> +	{ CXL_RAS_UC_IDE_TX_ERR, "IDE Tx Error" },			  \
> +	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
> +)
> +
> +#define CXL_RAS_UC_OVFL_D2H_REQ		BIT(0)
> +#define CXL_RAS_UC_OVFL_D2H_RSP		BIT(1)
> +#define CXL_RAS_UC_OVFL_D2H_DATA	BIT(2)
> +#define CXL_RAS_UC_OVFL_S2M_NDR		BIT(3)
> +#define CXL_RAS_UC_OVFL_S2M_DRS		BIT(4)

Why not align these with the values in the spec? 
They aren't flags as such...  Mind you not used anyway so probably just
drop it for now?

> +
> +#define show_uc_ovfl(hl)	__print_flags(hl, " | ",		\
> +	{ CXL_RAS_UC_OVFL_D2H_REQ, "Receiver Overflow D2H Req" },	\
> +	{ CXL_RAS_UC_OVFL_D2H_RSP, "Receiver Overflow D2H Rsp" },	\
> +	{ CXL_RAS_UC_OVFL_D2H_DATA, "Receiver Overflow D2H Data" },	\
> +	{ CXL_RAS_UC_OVFL_S2M_NDR, "Receiver Overflow S2M NDR" },	\
> +	{ CXL_RAS_UC_OVFL_S2M_DRS, "Receiver Overflow S2M DRS" }	\
> +)
> +
> +TRACE_EVENT(cxl_ras_uc,
> +	TP_PROTO(const char *dev_name, u32 status, u32 fe, u8 *hl, int hl_len),
> +	TP_ARGS(dev_name, status, fe, hl, hl_len),
> +	TP_STRUCT__entry(
> +		__string(dev_name, dev_name)
> +		__field(u32, status)
> +		__field(u32, first_error)
> +		__dynamic_array(u8, header_log, hl_len)
> +		__field(int, header_log_len)
> +	),
> +	TP_fast_assign(
> +		__assign_str(dev_name, dev_name);
> +		__entry->status = status;
> +		__entry->first_error = fe;
> +		memcpy(__get_dynamic_array(header_log), hl, hl_len);
> +		__entry->header_log_len = hl_len;
> +	),
> +	TP_printk("%s: status: '%s' first_error: '%s' header log: %s",
> +		  __get_str(dev_name), show_uc_errs(__entry->status),
> +		  show_uc_errs(__entry->first_error),
> +		  __print_array(__get_dynamic_array(header_log), __entry->header_log_len, 1)
> +	)
> +);
> +
> +#define CXL_RAS_CE_CACHE_DATA_ECC	BIT(0)
> +#define CXL_RAS_CE_MEM_DATA_ECC		BIT(1)
> +#define CXL_RAS_CE_CRC_THRESH		BIT(2)
> +#define CXL_RAS_CE_CACHE_POISON		BIT(3)
> +#define CXL_RAS_CE_MEM_POISON		BIT(4)
> +#define CXL_RAS_CE_PHYS_LAYER_ERR	BIT(5)
> +
> +#define show_ce_errs(status)	__print_flags(status, " | ",			\
> +	{ CXL_RAS_CE_CACHE_DATA_ECC, "Cache Data ECC Error" },			\
> +	{ CXL_RAS_CE_MEM_DATA_ECC, "Memory Data Ecc Error" },			\
> +	{ CXL_RAS_CE_CRC_THRESH, "CRC Threshold Hit" },				\
> +	{ CXL_RAS_CE_CACHE_POISON, "Received Cache Poison From Peer" },		\
> +	{ CXL_RAS_CE_MEM_POISON, "Received Memory Poison From Peer" },		\
> +	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
> +)
> +
> +TRACE_EVENT(cxl_ras_ce,
> +	TP_PROTO(const char *dev_name, u32 status),
> +	TP_ARGS(dev_name, status),
> +	TP_STRUCT__entry(
> +		__string(dev_name, dev_name)
> +		__field(u32, status)
> +	),
> +	TP_fast_assign(
> +		__assign_str(dev_name, dev_name);
> +		__entry->status = status;
> +	),
> +	TP_printk("%s: status: '%s'",
> +		  __get_str(dev_name), show_ce_errs(__entry->status)
> +	)
> +);
> +
> +#endif /* _CXL_RAS_EVENTS_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 8/9] cxl/pci: add tracepoint events for CXL RAS
  2022-10-20 17:02   ` Jonathan Cameron
@ 2022-10-20 17:07     ` Dave Jiang
  2022-10-20 17:52       ` Steven Rostedt
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Jiang @ 2022-10-20 17:07 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter, Steven Rostedt


On 10/20/2022 10:02 AM, Jonathan Cameron wrote:
> On Fri, 16 Sep 2022 16:11:39 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
>> Add tracepoint events for recording the CXL uncorrectable and correctable
>> errors. For uncorrectable errors, there is additional data up to 512B from
>> the header log register (CXL spec rev3 8.2.4.16.7). The content of the
>> register depends on the Uncorrectable Errors (UC) register status (CXL spec
>> rev3 8.2.4.16.1).  This implementation supports the Receiver_Overflow error
>> where the definition is defined as first 3 bits of the Header Log data. The
>> trace event will intake a dynamic array that will dump the Header Log data
>> based on error. If multiple errors are set in the status register, then the
>> 'first error' field (CXL spec rev3 v8.2.4.16.6) is read from the Error
>> Capabilities and Control Register in order to determine the error.
>>
>> This implementation does not include CXL IDE Error details.
>>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> Given the useful review we've gotten on other race points seems wise to
> Cc Stephen.
Yes I will next rev.
>
> The overflow flags seems to be inconsistent wrt to spec.  They aren't flags
> as such in the spec.  However I note that isn't used anyway so maybe drop
> it for now?

Will do.


>
> Jonathan
>
>> ---
>>   drivers/cxl/pci.c              |    2 +
>>   include/trace/events/cxl_ras.h |  117 ++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 119 insertions(+)
>>   create mode 100644 include/trace/events/cxl_ras.h
>>
>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>> index 610b3a77f205..357de704e42c 100644
>> --- a/drivers/cxl/pci.c
>> +++ b/drivers/cxl/pci.c
>> @@ -13,6 +13,8 @@
>>   #include "cxlmem.h"
>>   #include "cxlpci.h"
>>   #include "cxl.h"
>> +#define CREATE_TRACE_POINTS
>> +#include <trace/events/cxl_ras.h>
>>   
>>   /**
>>    * DOC: cxl pci
>> diff --git a/include/trace/events/cxl_ras.h b/include/trace/events/cxl_ras.h
>> new file mode 100644
>> index 000000000000..6bb41c3b87c8
>> --- /dev/null
>> +++ b/include/trace/events/cxl_ras.h
>> @@ -0,0 +1,117 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM cxl_ras
>> +
>> +#if !defined(_CXL_RAS_EVENTS_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _CXL_RAS_EVENTS_H
>> +
>> +#include <linux/tracepoint.h>
>> +
>> +#define CXL_RAS_UC_CACHE_DATA_PARITY	BIT(0)
>> +#define CXL_RAS_UC_CACHE_ADDR_PARITY	BIT(1)
>> +#define CXL_RAS_UC_CACHE_BE_PARITY	BIT(2)
>> +#define CXL_RAS_UC_CACHE_DATA_ECC	BIT(3)
>> +#define CXL_RAS_UC_MEM_DATA_PARITY	BIT(4)
>> +#define CXL_RAS_UC_MEM_ADDR_PARITY	BIT(5)
>> +#define CXL_RAS_UC_MEM_BE_PARITY	BIT(6)
>> +#define CXL_RAS_UC_MEM_DATA_ECC		BIT(7)
>> +#define CXL_RAS_UC_REINIT_THRESH	BIT(8)
>> +#define CXL_RAS_UC_RSVD_ENCODE		BIT(9)
>> +#define CXL_RAS_UC_POISON		BIT(10)
>> +#define CXL_RAS_UC_RECV_OVERFLOW	BIT(11)
>> +#define CXL_RAS_UC_INTERNAL_ERR		BIT(14)
>> +#define CXL_RAS_UC_IDE_TX_ERR		BIT(15)
>> +#define CXL_RAS_UC_IDE_RX_ERR		BIT(16)
>> +
>> +#define show_uc_errs(status)	__print_flags(status, " | ",		  \
>> +	{ CXL_RAS_UC_CACHE_DATA_PARITY, "Cache Data Parity Error" },	  \
>> +	{ CXL_RAS_UC_CACHE_ADDR_PARITY, "Cache Address Parity Error" },	  \
>> +	{ CXL_RAS_UC_CACHE_BE_PARITY, "Cache Byte Enable Parity Error" }, \
>> +	{ CXL_RAS_UC_CACHE_DATA_ECC, "Cache Data ECC Error" },		  \
>> +	{ CXL_RAS_UC_MEM_DATA_PARITY, "Memory Data Parity Error" },	  \
>> +	{ CXL_RAS_UC_MEM_ADDR_PARITY, "Memory Address Parity Error" },	  \
>> +	{ CXL_RAS_UC_MEM_BE_PARITY, "Memory Byte Enable Parity Error" },  \
>> +	{ CXL_RAS_UC_MEM_DATA_ECC, "Memory Data ECC Error" },		  \
>> +	{ CXL_RAS_UC_REINIT_THRESH, "REINIT Threshold Hit" },		  \
>> +	{ CXL_RAS_UC_RSVD_ENCODE, "Received Unrecognized Encoding" },	  \
>> +	{ CXL_RAS_UC_POISON, "Received Poison From Peer" },		  \
>> +	{ CXL_RAS_UC_RECV_OVERFLOW, "Receiver Overflow" },		  \
>> +	{ CXL_RAS_UC_INTERNAL_ERR, "Component Specific Error" },	  \
>> +	{ CXL_RAS_UC_IDE_TX_ERR, "IDE Tx Error" },			  \
>> +	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
>> +)
>> +
>> +#define CXL_RAS_UC_OVFL_D2H_REQ		BIT(0)
>> +#define CXL_RAS_UC_OVFL_D2H_RSP		BIT(1)
>> +#define CXL_RAS_UC_OVFL_D2H_DATA	BIT(2)
>> +#define CXL_RAS_UC_OVFL_S2M_NDR		BIT(3)
>> +#define CXL_RAS_UC_OVFL_S2M_DRS		BIT(4)
> Why not align these with the values in the spec?
> They aren't flags as such...  Mind you not used anyway so probably just
> drop it for now?
>
>> +
>> +#define show_uc_ovfl(hl)	__print_flags(hl, " | ",		\
>> +	{ CXL_RAS_UC_OVFL_D2H_REQ, "Receiver Overflow D2H Req" },	\
>> +	{ CXL_RAS_UC_OVFL_D2H_RSP, "Receiver Overflow D2H Rsp" },	\
>> +	{ CXL_RAS_UC_OVFL_D2H_DATA, "Receiver Overflow D2H Data" },	\
>> +	{ CXL_RAS_UC_OVFL_S2M_NDR, "Receiver Overflow S2M NDR" },	\
>> +	{ CXL_RAS_UC_OVFL_S2M_DRS, "Receiver Overflow S2M DRS" }	\
>> +)
>> +
>> +TRACE_EVENT(cxl_ras_uc,
>> +	TP_PROTO(const char *dev_name, u32 status, u32 fe, u8 *hl, int hl_len),
>> +	TP_ARGS(dev_name, status, fe, hl, hl_len),
>> +	TP_STRUCT__entry(
>> +		__string(dev_name, dev_name)
>> +		__field(u32, status)
>> +		__field(u32, first_error)
>> +		__dynamic_array(u8, header_log, hl_len)
>> +		__field(int, header_log_len)
>> +	),
>> +	TP_fast_assign(
>> +		__assign_str(dev_name, dev_name);
>> +		__entry->status = status;
>> +		__entry->first_error = fe;
>> +		memcpy(__get_dynamic_array(header_log), hl, hl_len);
>> +		__entry->header_log_len = hl_len;
>> +	),
>> +	TP_printk("%s: status: '%s' first_error: '%s' header log: %s",
>> +		  __get_str(dev_name), show_uc_errs(__entry->status),
>> +		  show_uc_errs(__entry->first_error),
>> +		  __print_array(__get_dynamic_array(header_log), __entry->header_log_len, 1)
>> +	)
>> +);
>> +
>> +#define CXL_RAS_CE_CACHE_DATA_ECC	BIT(0)
>> +#define CXL_RAS_CE_MEM_DATA_ECC		BIT(1)
>> +#define CXL_RAS_CE_CRC_THRESH		BIT(2)
>> +#define CXL_RAS_CE_CACHE_POISON		BIT(3)
>> +#define CXL_RAS_CE_MEM_POISON		BIT(4)
>> +#define CXL_RAS_CE_PHYS_LAYER_ERR	BIT(5)
>> +
>> +#define show_ce_errs(status)	__print_flags(status, " | ",			\
>> +	{ CXL_RAS_CE_CACHE_DATA_ECC, "Cache Data ECC Error" },			\
>> +	{ CXL_RAS_CE_MEM_DATA_ECC, "Memory Data Ecc Error" },			\
>> +	{ CXL_RAS_CE_CRC_THRESH, "CRC Threshold Hit" },				\
>> +	{ CXL_RAS_CE_CACHE_POISON, "Received Cache Poison From Peer" },		\
>> +	{ CXL_RAS_CE_MEM_POISON, "Received Memory Poison From Peer" },		\
>> +	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
>> +)
>> +
>> +TRACE_EVENT(cxl_ras_ce,
>> +	TP_PROTO(const char *dev_name, u32 status),
>> +	TP_ARGS(dev_name, status),
>> +	TP_STRUCT__entry(
>> +		__string(dev_name, dev_name)
>> +		__field(u32, status)
>> +	),
>> +	TP_fast_assign(
>> +		__assign_str(dev_name, dev_name);
>> +		__entry->status = status;
>> +	),
>> +	TP_printk("%s: status: '%s'",
>> +		  __get_str(dev_name), show_ce_errs(__entry->status)
>> +	)
>> +);
>> +
>> +#endif /* _CXL_RAS_EVENTS_H */
>> +
>> +/* This part must be outside protection */
>> +#include <trace/define_trace.h>
>>
>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 8/9] cxl/pci: add tracepoint events for CXL RAS
  2022-10-20 17:07     ` Dave Jiang
@ 2022-10-20 17:52       ` Steven Rostedt
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2022-10-20 17:52 UTC (permalink / raw)
  To: Dave Jiang
  Cc: Jonathan Cameron, linux-cxl, alison.schofield, vishal.l.verma,
	bwidawsk, dan.j.williams, shiju.jose, rrichter

On Thu, 20 Oct 2022 10:07:30 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> > Given the useful review we've gotten on other race points seems wise to
> > Cc Stephen.  
> Yes I will next rev.

You can Cc me instead of this Stephen ;-)

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-10-19 17:38         ` Dave Jiang
@ 2022-10-24 16:01           ` Jonathan Cameron
  2022-10-25 15:22             ` Dave Jiang
  2022-11-03 12:58             ` Jonathan Cameron
  0 siblings, 2 replies; 34+ messages in thread
From: Jonathan Cameron @ 2022-10-24 16:01 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Wed, 19 Oct 2022 10:38:13 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> On 10/19/2022 10:30 AM, Jonathan Cameron wrote:
> > On Tue, 11 Oct 2022 18:19:15 +0100
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >  
> >> On Tue, 11 Oct 2022 08:18:34 -0700
> >> Dave Jiang <dave.jiang@intel.com> wrote:
> >>  
> >>> On 10/11/2022 7:17 AM, Jonathan Cameron wrote:  
> >>>> On Fri, 16 Sep 2022 16:10:53 -0700
> >>>> Dave Jiang <dave.jiang@intel.com> wrote:
> >>>>       
> >>>>> Series set to RFC since there's no means to test. Would like to get opinion
> >>>>> on whether going with using trace events as reporting mechanism is ok.
> >>>>>
> >>>>> Jonathan,
> >>>>> We currently don't have any ways to test AER events. Do you have any plans
> >>>>> to support AER events via QEMU emulation?  
> >>>> Sorry - missed this entirely as gotten a bit behind reading CXL emails.  
> > Hi Dave,
> >
> > Quick update.
> >
> > Working QEMU emulation - but needs some/lots of cleanup. Particularly fun was
> > figuring out why I wasn't getting messages past the upstream switch port.
> > Turned out the serial number ECAP was on top of the AER ECAP. Oops - thankfully
> > that patch isn't upstream yet.
> > Also QEMU AER rooting seems to be based on some older PCIE spec
> > so needed some tweaks to get the device to actually issue ERR_FATAL etc.
> >
> > Anyhow, should have something you can play with in a day or two.  
> 
> Awesome! Thanks! :)

Took a little longer than expected..

Anyhow, now at
https://gitlab.com/jic23/qemu/-/commits/cxl-2022-10-24

That tree is carrying far too many things right now for it make much sense
to me to email this to qemu-devel - though I may pull
hw/pci/aer: Add missing routing for AER errors
out in advance as that's closing a spec different between QEMU emulation of AER
and what the PCI spec says.

Hopefully set of out of tree patches will start to shrink soon - v9 of the DOE
patches have been on list for a week or so.

Top patch includes a very short 'how to' in patch description.  Basically fire
up QMP: Add something like -qmp tcp:localhost:444,server=on,wait=off to your
qemu commandline and use commands like:

{ "execute": "qmp_capabilities" }
...
{ "execute": "cxl-inject-uncorrectable-error",
    "arguments": {
        "path": "/machine/peripheral/cxl-pmem0",
        "type": "cache-address-parity",
        "header": [ 3, 4]
    } }
...
{ "execute": "cxl-inject-correctable-error",
    "arguments": {
        "path": "/machine/peripheral/cxl-pmem0",
        "type": "physical",
        "header": [ 3, 4]
    } }



> 
> 
> > In meantime an example dump (not writing the header log yet!)
> >
> > pcieport 0000:0c:00.0: AER: Uncorrected (Non-Fatal) error received: 0000:0f:00.0
> > cxl_pci 0000:0f:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00004000/00000000
> > cxl_pci 0000:0f:00.0:    [14] CmpltTO                (First)
> > cxl_ras_uc: mem3: status: 'Cache Data Parity Error' first_error: 'Cache Data Parity Error' header log: {0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0}
> > cxl_pci 0000:0f:00.0: mem3: restart CXL.mem after slot reset
> > cxl_port endpoint6: No CMA mailbox
> > cxl_pci 0000:0f:00.0: mem3: error resume successful
> > pcieport 0000:0e:00.0: AER: device recovery successful
> >
> > Jonathan  
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-10-24 16:01           ` Jonathan Cameron
@ 2022-10-25 15:22             ` Dave Jiang
  2022-11-03 12:58             ` Jonathan Cameron
  1 sibling, 0 replies; 34+ messages in thread
From: Dave Jiang @ 2022-10-25 15:22 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter


On 10/24/2022 9:01 AM, Jonathan Cameron wrote:
> On Wed, 19 Oct 2022 10:38:13 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
>
>> On 10/19/2022 10:30 AM, Jonathan Cameron wrote:
>>> On Tue, 11 Oct 2022 18:19:15 +0100
>>> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>>>   
>>>> On Tue, 11 Oct 2022 08:18:34 -0700
>>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>>   
>>>>> On 10/11/2022 7:17 AM, Jonathan Cameron wrote:
>>>>>> On Fri, 16 Sep 2022 16:10:53 -0700
>>>>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>>>>        
>>>>>>> Series set to RFC since there's no means to test. Would like to get opinion
>>>>>>> on whether going with using trace events as reporting mechanism is ok.
>>>>>>>
>>>>>>> Jonathan,
>>>>>>> We currently don't have any ways to test AER events. Do you have any plans
>>>>>>> to support AER events via QEMU emulation?
>>>>>> Sorry - missed this entirely as gotten a bit behind reading CXL emails.
>>> Hi Dave,
>>>
>>> Quick update.
>>>
>>> Working QEMU emulation - but needs some/lots of cleanup. Particularly fun was
>>> figuring out why I wasn't getting messages past the upstream switch port.
>>> Turned out the serial number ECAP was on top of the AER ECAP. Oops - thankfully
>>> that patch isn't upstream yet.
>>> Also QEMU AER rooting seems to be based on some older PCIE spec
>>> so needed some tweaks to get the device to actually issue ERR_FATAL etc.
>>>
>>> Anyhow, should have something you can play with in a day or two.
>> Awesome! Thanks! :)
> Took a little longer than expected..
>
> Anyhow, now at
> https://gitlab.com/jic23/qemu/-/commits/cxl-2022-10-24

Thank you! I'll try this out as soon as I get a chance.

>
> That tree is carrying far too many things right now for it make much sense
> to me to email this to qemu-devel - though I may pull
> hw/pci/aer: Add missing routing for AER errors
> out in advance as that's closing a spec different between QEMU emulation of AER
> and what the PCI spec says.
>
> Hopefully set of out of tree patches will start to shrink soon - v9 of the DOE
> patches have been on list for a week or so.
>
> Top patch includes a very short 'how to' in patch description.  Basically fire
> up QMP: Add something like -qmp tcp:localhost:444,server=on,wait=off to your
> qemu commandline and use commands like:
>
> { "execute": "qmp_capabilities" }
> ...
> { "execute": "cxl-inject-uncorrectable-error",
>      "arguments": {
>          "path": "/machine/peripheral/cxl-pmem0",
>          "type": "cache-address-parity",
>          "header": [ 3, 4]
>      } }
> ...
> { "execute": "cxl-inject-correctable-error",
>      "arguments": {
>          "path": "/machine/peripheral/cxl-pmem0",
>          "type": "physical",
>          "header": [ 3, 4]
>      } }
>
>
>
>>
>>> In meantime an example dump (not writing the header log yet!)
>>>
>>> pcieport 0000:0c:00.0: AER: Uncorrected (Non-Fatal) error received: 0000:0f:00.0
>>> cxl_pci 0000:0f:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
>>> cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00004000/00000000
>>> cxl_pci 0000:0f:00.0:    [14] CmpltTO                (First)
>>> cxl_ras_uc: mem3: status: 'Cache Data Parity Error' first_error: 'Cache Data Parity Error' header log: {0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0}
>>> cxl_pci 0000:0f:00.0: mem3: restart CXL.mem after slot reset
>>> cxl_port endpoint6: No CMA mailbox
>>> cxl_pci 0000:0f:00.0: mem3: error resume successful
>>> pcieport 0000:0e:00.0: AER: device recovery successful
>>>
>>> Jonathan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-10-24 16:01           ` Jonathan Cameron
  2022-10-25 15:22             ` Dave Jiang
@ 2022-11-03 12:58             ` Jonathan Cameron
  2022-11-03 13:27               ` Jonathan Cameron
  1 sibling, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2022-11-03 12:58 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Mon, 24 Oct 2022 17:01:02 +0100
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Wed, 19 Oct 2022 10:38:13 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
> 
> > On 10/19/2022 10:30 AM, Jonathan Cameron wrote:  
> > > On Tue, 11 Oct 2022 18:19:15 +0100
> > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > >    
> > >> On Tue, 11 Oct 2022 08:18:34 -0700
> > >> Dave Jiang <dave.jiang@intel.com> wrote:
> > >>    
> > >>> On 10/11/2022 7:17 AM, Jonathan Cameron wrote:    
> > >>>> On Fri, 16 Sep 2022 16:10:53 -0700
> > >>>> Dave Jiang <dave.jiang@intel.com> wrote:
> > >>>>         
> > >>>>> Series set to RFC since there's no means to test. Would like to get opinion
> > >>>>> on whether going with using trace events as reporting mechanism is ok.
> > >>>>>
> > >>>>> Jonathan,
> > >>>>> We currently don't have any ways to test AER events. Do you have any plans
> > >>>>> to support AER events via QEMU emulation?    
> > >>>> Sorry - missed this entirely as gotten a bit behind reading CXL emails.    
> > > Hi Dave,
> > >
> > > Quick update.
> > >
> > > Working QEMU emulation - but needs some/lots of cleanup. Particularly fun was
> > > figuring out why I wasn't getting messages past the upstream switch port.
> > > Turned out the serial number ECAP was on top of the AER ECAP. Oops - thankfully
> > > that patch isn't upstream yet.
> > > Also QEMU AER rooting seems to be based on some older PCIE spec
> > > so needed some tweaks to get the device to actually issue ERR_FATAL etc.
> > >
> > > Anyhow, should have something you can play with in a day or two.    
> > 
> > Awesome! Thanks! :)  
> 
> Took a little longer than expected..
> 
> Anyhow, now at
> https://gitlab.com/jic23/qemu/-/commits/cxl-2022-10-24
> 
> That tree is carrying far too many things right now for it make much sense
> to me to email this to qemu-devel - though I may pull
> hw/pci/aer: Add missing routing for AER errors
> out in advance as that's closing a spec different between QEMU emulation of AER
> and what the PCI spec says.
> 
> Hopefully set of out of tree patches will start to shrink soon - v9 of the DOE
> patches have been on list for a week or so.
> 
> Top patch includes a very short 'how to' in patch description.  Basically fire
> up QMP: Add something like -qmp tcp:localhost:444,server=on,wait=off to your
> qemu commandline and use commands like:
> 
> { "execute": "qmp_capabilities" }
> ...
> { "execute": "cxl-inject-uncorrectable-error",
>     "arguments": {
>         "path": "/machine/peripheral/cxl-pmem0",
>         "type": "cache-address-parity",
>         "header": [ 3, 4]
>     } }
> ...
> { "execute": "cxl-inject-correctable-error",
>     "arguments": {
>         "path": "/machine/peripheral/cxl-pmem0",
>         "type": "physical",
>         "header": [ 3, 4]
>     } }
> 

So Dave reported that this wasn't working on x86 qemu machines.

A fun bit of debugging later (I hate AML) and I think I have find the issue +
have a hack to workaround it for now.

So need some background.
1) CXL code is based on QEMU's pci expander bridge root bridge - there is a complex
   bit of handling to create appropriate ACPI DSDT magic.
2) The CXL root port is based on pcie_root_port.c 
3) Both CXL root port and pcie root port use traditional PCI interrupts, not MSI/MSIX
   for their signaling.
4) Q35 machine uses an IOAPIC and the resulting PCI bus interrupt routing lands the
   actual interrupt on line 23 for my particular configuration
5) The ACPI table says it's on line 11.
6) x86 code for creating the PRT has an informative comment...
https://elixir.bootlin.com/qemu/latest/source/hw/i386/acpi-build.c#L697
 * The main goal is to equaly distribute the interrupts
 * over the 4 existing ACPI links (works only for i440fx).

So the hack I'm running is below (note the UID thing is a separate bug that stops
iasl from disassembling the DSDT due to a duplicate entry - I'll send out a fix
for that shortly).

There are a bunch of possible approaches to fix this if my identification of
the issue is correct.

1) Clean equivalent of this hack that runs on appropriate machines only.
2) Use MSI instead. (ioh3420 root port takes this approach I think).

From 286c8f9b6d229d9e71f64657b6b3ccb70cb98306 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Date: Thu, 3 Nov 2022 12:20:25 +0000
Subject: [PATCH] HACK: Fix-up interrupt routing for CXL on q35.

I need to do some more thinking to figure out correct approach
to solve this problem.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 hw/i386/acpi-build.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 4f54b61904..8055253e68 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -746,7 +746,7 @@ static Aml *build_prt(bool is_pci0_prt)
                       lnk_idx));
 
         /* route[2] = "LNK[D|A|B|C]", selection based on pin % 3  */
-        aml_append(while_ctx, initialize_route(route, "LNKD", lnk_idx, 0));
+        aml_append(while_ctx, initialize_route(route, "GSIH", lnk_idx, 0));
         if (is_pci0_prt) {
             Aml *if_device_1, *if_pin_4, *else_pin_4;
 
@@ -762,16 +762,16 @@ static Aml *build_prt(bool is_pci0_prt)
                 else_pin_4 = aml_else();
                 {
                     aml_append(else_pin_4,
-                        aml_store(build_prt_entry("LNKA"), route));
+                        aml_store(build_prt_entry("GSIE"), route));
                 }
                 aml_append(if_device_1, else_pin_4);
             }
             aml_append(while_ctx, if_device_1);
         } else {
-            aml_append(while_ctx, initialize_route(route, "LNKA", lnk_idx, 1));
+            aml_append(while_ctx, initialize_route(route, "GSIE", lnk_idx, 1));
         }
-        aml_append(while_ctx, initialize_route(route, "LNKB", lnk_idx, 2));
-        aml_append(while_ctx, initialize_route(route, "LNKC", lnk_idx, 3));
+        aml_append(while_ctx, initialize_route(route, "GSIF", lnk_idx, 2));
+        aml_append(while_ctx, initialize_route(route, "GSIG", lnk_idx, 3));
 
         /* route[0] = 0x[slot]FFFF */
         aml_append(while_ctx,
@@ -1627,7 +1627,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
                 aml_append(pkg, aml_eisaid("PNP0A03"));
                 aml_append(dev, aml_name_decl("_CID", pkg));
                 aml_append(dev, aml_name_decl("_ADR", aml_int(0)));
-                aml_append(dev, aml_name_decl("_UID", aml_int(bus_num)));
+//                aml_append(dev, aml_name_decl("_UID", aml_int(bus_num)));
                 build_cxl_osc_method(dev);
             } else if (pci_bus_is_express(bus)) {
                 aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0A08")));
-- 
2.37.2
> 
> > 
> >   
> > > In meantime an example dump (not writing the header log yet!)
> > >
> > > pcieport 0000:0c:00.0: AER: Uncorrected (Non-Fatal) error received: 0000:0f:00.0
> > > cxl_pci 0000:0f:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> > > cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00004000/00000000
> > > cxl_pci 0000:0f:00.0:    [14] CmpltTO                (First)
> > > cxl_ras_uc: mem3: status: 'Cache Data Parity Error' first_error: 'Cache Data Parity Error' header log: {0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0}
> > > cxl_pci 0000:0f:00.0: mem3: restart CXL.mem after slot reset
> > > cxl_port endpoint6: No CMA mailbox
> > > cxl_pci 0000:0f:00.0: mem3: error resume successful
> > > pcieport 0000:0e:00.0: AER: device recovery successful
> > >
> > > Jonathan    
> >   
> 
> 


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-11-03 12:58             ` Jonathan Cameron
@ 2022-11-03 13:27               ` Jonathan Cameron
  2022-11-16 23:20                 ` Dave Jiang
  0 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2022-11-03 13:27 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Thu, 3 Nov 2022 12:58:51 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Mon, 24 Oct 2022 17:01:02 +0100
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> > On Wed, 19 Oct 2022 10:38:13 -0700
> > Dave Jiang <dave.jiang@intel.com> wrote:
> >   
> > > On 10/19/2022 10:30 AM, Jonathan Cameron wrote:    
> > > > On Tue, 11 Oct 2022 18:19:15 +0100
> > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > >      
> > > >> On Tue, 11 Oct 2022 08:18:34 -0700
> > > >> Dave Jiang <dave.jiang@intel.com> wrote:
> > > >>      
> > > >>> On 10/11/2022 7:17 AM, Jonathan Cameron wrote:      
> > > >>>> On Fri, 16 Sep 2022 16:10:53 -0700
> > > >>>> Dave Jiang <dave.jiang@intel.com> wrote:
> > > >>>>           
> > > >>>>> Series set to RFC since there's no means to test. Would like to get opinion
> > > >>>>> on whether going with using trace events as reporting mechanism is ok.
> > > >>>>>
> > > >>>>> Jonathan,
> > > >>>>> We currently don't have any ways to test AER events. Do you have any plans
> > > >>>>> to support AER events via QEMU emulation?      
> > > >>>> Sorry - missed this entirely as gotten a bit behind reading CXL emails.      
> > > > Hi Dave,
> > > >
> > > > Quick update.
> > > >
> > > > Working QEMU emulation - but needs some/lots of cleanup. Particularly fun was
> > > > figuring out why I wasn't getting messages past the upstream switch port.
> > > > Turned out the serial number ECAP was on top of the AER ECAP. Oops - thankfully
> > > > that patch isn't upstream yet.
> > > > Also QEMU AER rooting seems to be based on some older PCIE spec
> > > > so needed some tweaks to get the device to actually issue ERR_FATAL etc.
> > > >
> > > > Anyhow, should have something you can play with in a day or two.      
> > > 
> > > Awesome! Thanks! :)    
> > 
> > Took a little longer than expected..
> > 
> > Anyhow, now at
> > https://gitlab.com/jic23/qemu/-/commits/cxl-2022-10-24
> > 
> > That tree is carrying far too many things right now for it make much sense
> > to me to email this to qemu-devel - though I may pull
> > hw/pci/aer: Add missing routing for AER errors
> > out in advance as that's closing a spec different between QEMU emulation of AER
> > and what the PCI spec says.
> > 
> > Hopefully set of out of tree patches will start to shrink soon - v9 of the DOE
> > patches have been on list for a week or so.
> > 
> > Top patch includes a very short 'how to' in patch description.  Basically fire
> > up QMP: Add something like -qmp tcp:localhost:444,server=on,wait=off to your
> > qemu commandline and use commands like:
> > 
> > { "execute": "qmp_capabilities" }
> > ...
> > { "execute": "cxl-inject-uncorrectable-error",
> >     "arguments": {
> >         "path": "/machine/peripheral/cxl-pmem0",
> >         "type": "cache-address-parity",
> >         "header": [ 3, 4]
> >     } }
> > ...
> > { "execute": "cxl-inject-correctable-error",
> >     "arguments": {
> >         "path": "/machine/peripheral/cxl-pmem0",
> >         "type": "physical",
> >         "header": [ 3, 4]
> >     } }
> >   
> 
> So Dave reported that this wasn't working on x86 qemu machines.
> 
> A fun bit of debugging later (I hate AML) and I think I have find the issue +
> have a hack to workaround it for now.
> 
> So need some background.
> 1) CXL code is based on QEMU's pci expander bridge root bridge - there is a complex
>    bit of handling to create appropriate ACPI DSDT magic.
> 2) The CXL root port is based on pcie_root_port.c 
> 3) Both CXL root port and pcie root port use traditional PCI interrupts, not MSI/MSIX
>    for their signaling.
> 4) Q35 machine uses an IOAPIC and the resulting PCI bus interrupt routing lands the
>    actual interrupt on line 23 for my particular configuration
> 5) The ACPI table says it's on line 11.
> 6) x86 code for creating the PRT has an informative comment...
> https://elixir.bootlin.com/qemu/latest/source/hw/i386/acpi-build.c#L697
>  * The main goal is to equaly distribute the interrupts
>  * over the 4 existing ACPI links (works only for i440fx).
> 
> So the hack I'm running is below (note the UID thing is a separate bug that stops
> iasl from disassembling the DSDT due to a duplicate entry - I'll send out a fix
> for that shortly).
> 
> There are a bunch of possible approaches to fix this if my identification of
> the issue is correct.
> 
> 1) Clean equivalent of this hack that runs on appropriate machines only.
> 2) Use MSI instead. (ioh3420 root port takes this approach I think).

This turned out to be trivial case of cut and pate.  So MSI support it is (mostly because it doesn't
involve fixing AML generation :)

Very lightly tested on one config of x86 machine and one of arm64.
I've not thought about this at all yet, so it's a direct copy of the ioh3420 with 
appropriate renames for it being in the cxl_rp code.


From a5e85d90cb734fc1de81e0d99e443747348e65e7 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Date: Thu, 3 Nov 2022 13:10:24 +0000
Subject: [PATCH] hw: pci-bridge: cxl_root_port: Write up MSI

Done to avoid fixing ACPI rout description of traditional PCI interrupts on q35.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 hw/pci-bridge/cxl_root_port.c | 63 +++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/hw/pci-bridge/cxl_root_port.c b/hw/pci-bridge/cxl_root_port.c
index 1a9363a6de..a7475c427e 100644
--- a/hw/pci-bridge/cxl_root_port.c
+++ b/hw/pci-bridge/cxl_root_port.c
@@ -22,6 +22,7 @@
 #include "qemu/range.h"
 #include "hw/pci/pci_bridge.h"
 #include "hw/pci/pcie_port.h"
+#include "hw/pci/msi.h"
 #include "hw/qdev-properties.h"
 #include "hw/sysbus.h"
 #include "qapi/error.h"
@@ -29,6 +30,10 @@
 
 #define CXL_ROOT_PORT_DID 0x7075
 
+#define CXL_RP_MSI_OFFSET               0x60
+#define CXL_RP_MSI_SUPPORTED_FLAGS      PCI_MSI_FLAGS_MASKBIT
+#define CXL_RP_MSI_NR_VECTOR            2
+
 /* Copied from the gen root port which we derive */
 #define GEN_PCIE_ROOT_PORT_AER_OFFSET 0x100
 #define GEN_PCIE_ROOT_PORT_ACS_OFFSET \
@@ -36,6 +41,7 @@
 #define CXL_ROOT_PORT_DVSEC_OFFSET \
     (GEN_PCIE_ROOT_PORT_ACS_OFFSET + PCI_ACS_SIZEOF)
 
+
 typedef struct CXLRootPort {
     /*< private >*/
     PCIESlot parent_obj;
@@ -47,6 +53,50 @@ typedef struct CXLRootPort {
 #define TYPE_CXL_ROOT_PORT "cxl-rp"
 DECLARE_INSTANCE_CHECKER(CXLRootPort, CXL_ROOT_PORT, TYPE_CXL_ROOT_PORT)
 
+/*
+ * If two MSI vector are allocated, Advanced Error Interrupt Message Number
+ * is 1. otherwise 0.
+ * 17.12.5.10 RPERRSTS,  32:27 bit Advanced Error Interrupt Message Number.
+ */
+static uint8_t cxl_rp_aer_vector(const PCIDevice *d)
+{
+    switch (msi_nr_vectors_allocated(d)) {
+    case 1:
+        return 0;
+    case 2:
+        return 1;
+    case 4:
+    case 8:
+    case 16:
+    case 32:
+    default:
+        break;
+    }
+    abort();
+    return 0;
+}
+
+static int cxl_rp_interrupts_init(PCIDevice *d, Error **errp)
+{
+    int rc;
+
+    rc = msi_init(d, CXL_RP_MSI_OFFSET, CXL_RP_MSI_NR_VECTOR,
+                  CXL_RP_MSI_SUPPORTED_FLAGS & PCI_MSI_FLAGS_64BIT,
+                  CXL_RP_MSI_SUPPORTED_FLAGS & PCI_MSI_FLAGS_MASKBIT,
+                  errp);
+    if (rc < 0) {
+        assert(rc == -ENOTSUP);
+    }
+
+    return rc;
+}
+
+static void cxl_rp_interrupts_uninit(PCIDevice *d)
+{
+    msi_uninit(d);
+}
+
+    
 static void latch_registers(CXLRootPort *crp)
 {
     uint32_t *reg_state = crp->cxl_cstate.crb.cache_mem_registers;
@@ -177,6 +227,15 @@ static void cxl_rp_dvsec_write_config(PCIDevice *dev, uint32_t addr,
     }
 }
 
+static void cxl_rp_aer_vector_update(PCIDevice *d)
+{
+    PCIERootPortClass *rpc = PCIE_ROOT_PORT_GET_CLASS(d);
+
+    if (rpc->aer_vector) {
+        pcie_aer_root_set_vector(d, rpc->aer_vector(d));
+    }
+}
+
 static void cxl_rp_write_config(PCIDevice *d, uint32_t address, uint32_t val,
                                 int len)
 {
@@ -203,6 +262,7 @@ static void cxl_rp_write_config(PCIDevice *d, uint32_t address, uint32_t val,
         }
     }
     pci_bridge_write_config(d, address, val, len);
+    cxl_rp_aer_vector_update(d);
     pcie_cap_flr_write_config(d, address, val, len);
     pcie_cap_slot_write_config(d, slt_ctl, slt_sta, address, val, len);
     pcie_aer_write_config(d, address, val, len);
@@ -229,6 +289,9 @@ static void cxl_root_port_class_init(ObjectClass *oc, void *data)
 
     rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
     rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
+    rpc->aer_vector = cxl_rp_aer_vector;
+    rpc->interrupts_init = cxl_rp_interrupts_init;
+    rpc->interrupts_uninit = cxl_rp_interrupts_uninit;
 
     dc->hotpluggable = false;
 }
-- 
2.37.2


> 
> From 286c8f9b6d229d9e71f64657b6b3ccb70cb98306 Mon Sep 17 00:00:00 2001
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Date: Thu, 3 Nov 2022 12:20:25 +0000
> Subject: [PATCH] HACK: Fix-up interrupt routing for CXL on q35.
> 
> I need to do some more thinking to figure out correct approach
> to solve this problem.
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  hw/i386/acpi-build.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 4f54b61904..8055253e68 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -746,7 +746,7 @@ static Aml *build_prt(bool is_pci0_prt)
>                        lnk_idx));
>  
>          /* route[2] = "LNK[D|A|B|C]", selection based on pin % 3  */
> -        aml_append(while_ctx, initialize_route(route, "LNKD", lnk_idx, 0));
> +        aml_append(while_ctx, initialize_route(route, "GSIH", lnk_idx, 0));
>          if (is_pci0_prt) {
>              Aml *if_device_1, *if_pin_4, *else_pin_4;
>  
> @@ -762,16 +762,16 @@ static Aml *build_prt(bool is_pci0_prt)
>                  else_pin_4 = aml_else();
>                  {
>                      aml_append(else_pin_4,
> -                        aml_store(build_prt_entry("LNKA"), route));
> +                        aml_store(build_prt_entry("GSIE"), route));
>                  }
>                  aml_append(if_device_1, else_pin_4);
>              }
>              aml_append(while_ctx, if_device_1);
>          } else {
> -            aml_append(while_ctx, initialize_route(route, "LNKA", lnk_idx, 1));
> +            aml_append(while_ctx, initialize_route(route, "GSIE", lnk_idx, 1));
>          }
> -        aml_append(while_ctx, initialize_route(route, "LNKB", lnk_idx, 2));
> -        aml_append(while_ctx, initialize_route(route, "LNKC", lnk_idx, 3));
> +        aml_append(while_ctx, initialize_route(route, "GSIF", lnk_idx, 2));
> +        aml_append(while_ctx, initialize_route(route, "GSIG", lnk_idx, 3));
>  
>          /* route[0] = 0x[slot]FFFF */
>          aml_append(while_ctx,
> @@ -1627,7 +1627,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
>                  aml_append(pkg, aml_eisaid("PNP0A03"));
>                  aml_append(dev, aml_name_decl("_CID", pkg));
>                  aml_append(dev, aml_name_decl("_ADR", aml_int(0)));
> -                aml_append(dev, aml_name_decl("_UID", aml_int(bus_num)));
> +//                aml_append(dev, aml_name_decl("_UID", aml_int(bus_num)));
>                  build_cxl_osc_method(dev);
>              } else if (pci_bus_is_express(bus)) {
>                  aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0A08")));


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-11-03 13:27               ` Jonathan Cameron
@ 2022-11-16 23:20                 ` Dave Jiang
  2022-11-17 13:50                   ` Jonathan Cameron
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Jiang @ 2022-11-16 23:20 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter



On 11/3/2022 6:27 AM, Jonathan Cameron wrote:
> On Thu, 3 Nov 2022 12:58:51 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
>> On Mon, 24 Oct 2022 17:01:02 +0100
>> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>>
>>> On Wed, 19 Oct 2022 10:38:13 -0700
>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>    
>>>> On 10/19/2022 10:30 AM, Jonathan Cameron wrote:
>>>>> On Tue, 11 Oct 2022 18:19:15 +0100
>>>>> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>>>>>       
>>>>>> On Tue, 11 Oct 2022 08:18:34 -0700
>>>>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>>>>       
>>>>>>> On 10/11/2022 7:17 AM, Jonathan Cameron wrote:
>>>>>>>> On Fri, 16 Sep 2022 16:10:53 -0700
>>>>>>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>>>>>>            
>>>>>>>>> Series set to RFC since there's no means to test. Would like to get opinion
>>>>>>>>> on whether going with using trace events as reporting mechanism is ok.
>>>>>>>>>
>>>>>>>>> Jonathan,
>>>>>>>>> We currently don't have any ways to test AER events. Do you have any plans
>>>>>>>>> to support AER events via QEMU emulation?
>>>>>>>> Sorry - missed this entirely as gotten a bit behind reading CXL emails.
>>>>> Hi Dave,
>>>>>
>>>>> Quick update.
>>>>>
>>>>> Working QEMU emulation - but needs some/lots of cleanup. Particularly fun was
>>>>> figuring out why I wasn't getting messages past the upstream switch port.
>>>>> Turned out the serial number ECAP was on top of the AER ECAP. Oops - thankfully
>>>>> that patch isn't upstream yet.
>>>>> Also QEMU AER rooting seems to be based on some older PCIE spec
>>>>> so needed some tweaks to get the device to actually issue ERR_FATAL etc.
>>>>>
>>>>> Anyhow, should have something you can play with in a day or two.
>>>>
>>>> Awesome! Thanks! :)
>>>
>>> Took a little longer than expected..
>>>
>>> Anyhow, now at
>>> https://gitlab.com/jic23/qemu/-/commits/cxl-2022-10-24
>>>
>>> That tree is carrying far too many things right now for it make much sense
>>> to me to email this to qemu-devel - though I may pull
>>> hw/pci/aer: Add missing routing for AER errors
>>> out in advance as that's closing a spec different between QEMU emulation of AER
>>> and what the PCI spec says.
>>>
>>> Hopefully set of out of tree patches will start to shrink soon - v9 of the DOE
>>> patches have been on list for a week or so.
>>>
>>> Top patch includes a very short 'how to' in patch description.  Basically fire
>>> up QMP: Add something like -qmp tcp:localhost:444,server=on,wait=off to your
>>> qemu commandline and use commands like:
>>>
>>> { "execute": "qmp_capabilities" }
>>> ...
>>> { "execute": "cxl-inject-uncorrectable-error",
>>>      "arguments": {
>>>          "path": "/machine/peripheral/cxl-pmem0",
>>>          "type": "cache-address-parity",
>>>          "header": [ 3, 4]
>>>      } }
>>> ...
>>> { "execute": "cxl-inject-correctable-error",
>>>      "arguments": {
>>>          "path": "/machine/peripheral/cxl-pmem0",
>>>          "type": "physical",
>>>          "header": [ 3, 4]
>>>      } }
>>>    
>>
>> So Dave reported that this wasn't working on x86 qemu machines.
>>
>> A fun bit of debugging later (I hate AML) and I think I have find the issue +
>> have a hack to workaround it for now.
>>
>> So need some background.
>> 1) CXL code is based on QEMU's pci expander bridge root bridge - there is a complex
>>     bit of handling to create appropriate ACPI DSDT magic.
>> 2) The CXL root port is based on pcie_root_port.c
>> 3) Both CXL root port and pcie root port use traditional PCI interrupts, not MSI/MSIX
>>     for their signaling.
>> 4) Q35 machine uses an IOAPIC and the resulting PCI bus interrupt routing lands the
>>     actual interrupt on line 23 for my particular configuration
>> 5) The ACPI table says it's on line 11.
>> 6) x86 code for creating the PRT has an informative comment...
>> https://elixir.bootlin.com/qemu/latest/source/hw/i386/acpi-build.c#L697
>>   * The main goal is to equaly distribute the interrupts
>>   * over the 4 existing ACPI links (works only for i440fx).
>>
>> So the hack I'm running is below (note the UID thing is a separate bug that stops
>> iasl from disassembling the DSDT due to a duplicate entry - I'll send out a fix
>> for that shortly).
>>
>> There are a bunch of possible approaches to fix this if my identification of
>> the issue is correct.
>>
>> 1) Clean equivalent of this hack that runs on appropriate machines only.
>> 2) Use MSI instead. (ioh3420 root port takes this approach I think).
> 
> This turned out to be trivial case of cut and pate.  So MSI support it is (mostly because it doesn't
> involve fixing AML generation :)
> 
> Very lightly tested on one config of x86 machine and one of arm64.
> I've not thought about this at all yet, so it's a direct copy of the ioh3420 with
> appropriate renames for it being in the cxl_rp code.

Hi Jonathan, do you have a qemu branch with the latest code? Maybe I'll 
give it a try. Thanks.

> 
> 
>  From a5e85d90cb734fc1de81e0d99e443747348e65e7 Mon Sep 17 00:00:00 2001
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Date: Thu, 3 Nov 2022 13:10:24 +0000
> Subject: [PATCH] hw: pci-bridge: cxl_root_port: Write up MSI
> 
> Done to avoid fixing ACPI rout description of traditional PCI interrupts on q35.
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>   hw/pci-bridge/cxl_root_port.c | 63 +++++++++++++++++++++++++++++++++++
>   1 file changed, 63 insertions(+)
> 
> diff --git a/hw/pci-bridge/cxl_root_port.c b/hw/pci-bridge/cxl_root_port.c
> index 1a9363a6de..a7475c427e 100644
> --- a/hw/pci-bridge/cxl_root_port.c
> +++ b/hw/pci-bridge/cxl_root_port.c
> @@ -22,6 +22,7 @@
>   #include "qemu/range.h"
>   #include "hw/pci/pci_bridge.h"
>   #include "hw/pci/pcie_port.h"
> +#include "hw/pci/msi.h"
>   #include "hw/qdev-properties.h"
>   #include "hw/sysbus.h"
>   #include "qapi/error.h"
> @@ -29,6 +30,10 @@
>   
>   #define CXL_ROOT_PORT_DID 0x7075
>   
> +#define CXL_RP_MSI_OFFSET               0x60
> +#define CXL_RP_MSI_SUPPORTED_FLAGS      PCI_MSI_FLAGS_MASKBIT
> +#define CXL_RP_MSI_NR_VECTOR            2
> +
>   /* Copied from the gen root port which we derive */
>   #define GEN_PCIE_ROOT_PORT_AER_OFFSET 0x100
>   #define GEN_PCIE_ROOT_PORT_ACS_OFFSET \
> @@ -36,6 +41,7 @@
>   #define CXL_ROOT_PORT_DVSEC_OFFSET \
>       (GEN_PCIE_ROOT_PORT_ACS_OFFSET + PCI_ACS_SIZEOF)
>   
> +
>   typedef struct CXLRootPort {
>       /*< private >*/
>       PCIESlot parent_obj;
> @@ -47,6 +53,50 @@ typedef struct CXLRootPort {
>   #define TYPE_CXL_ROOT_PORT "cxl-rp"
>   DECLARE_INSTANCE_CHECKER(CXLRootPort, CXL_ROOT_PORT, TYPE_CXL_ROOT_PORT)
>   
> +/*
> + * If two MSI vector are allocated, Advanced Error Interrupt Message Number
> + * is 1. otherwise 0.
> + * 17.12.5.10 RPERRSTS,  32:27 bit Advanced Error Interrupt Message Number.
> + */
> +static uint8_t cxl_rp_aer_vector(const PCIDevice *d)
> +{
> +    switch (msi_nr_vectors_allocated(d)) {
> +    case 1:
> +        return 0;
> +    case 2:
> +        return 1;
> +    case 4:
> +    case 8:
> +    case 16:
> +    case 32:
> +    default:
> +        break;
> +    }
> +    abort();
> +    return 0;
> +}
> +
> +static int cxl_rp_interrupts_init(PCIDevice *d, Error **errp)
> +{
> +    int rc;
> +
> +    rc = msi_init(d, CXL_RP_MSI_OFFSET, CXL_RP_MSI_NR_VECTOR,
> +                  CXL_RP_MSI_SUPPORTED_FLAGS & PCI_MSI_FLAGS_64BIT,
> +                  CXL_RP_MSI_SUPPORTED_FLAGS & PCI_MSI_FLAGS_MASKBIT,
> +                  errp);
> +    if (rc < 0) {
> +        assert(rc == -ENOTSUP);
> +    }
> +
> +    return rc;
> +}
> +
> +static void cxl_rp_interrupts_uninit(PCIDevice *d)
> +{
> +    msi_uninit(d);
> +}
> +
> +
>   static void latch_registers(CXLRootPort *crp)
>   {
>       uint32_t *reg_state = crp->cxl_cstate.crb.cache_mem_registers;
> @@ -177,6 +227,15 @@ static void cxl_rp_dvsec_write_config(PCIDevice *dev, uint32_t addr,
>       }
>   }
>   
> +static void cxl_rp_aer_vector_update(PCIDevice *d)
> +{
> +    PCIERootPortClass *rpc = PCIE_ROOT_PORT_GET_CLASS(d);
> +
> +    if (rpc->aer_vector) {
> +        pcie_aer_root_set_vector(d, rpc->aer_vector(d));
> +    }
> +}
> +
>   static void cxl_rp_write_config(PCIDevice *d, uint32_t address, uint32_t val,
>                                   int len)
>   {
> @@ -203,6 +262,7 @@ static void cxl_rp_write_config(PCIDevice *d, uint32_t address, uint32_t val,
>           }
>       }
>       pci_bridge_write_config(d, address, val, len);
> +    cxl_rp_aer_vector_update(d);
>       pcie_cap_flr_write_config(d, address, val, len);
>       pcie_cap_slot_write_config(d, slt_ctl, slt_sta, address, val, len);
>       pcie_aer_write_config(d, address, val, len);
> @@ -229,6 +289,9 @@ static void cxl_root_port_class_init(ObjectClass *oc, void *data)
>   
>       rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
>       rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
> +    rpc->aer_vector = cxl_rp_aer_vector;
> +    rpc->interrupts_init = cxl_rp_interrupts_init;
> +    rpc->interrupts_uninit = cxl_rp_interrupts_uninit;
>   
>       dc->hotpluggable = false;
>   }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-11-16 23:20                 ` Dave Jiang
@ 2022-11-17 13:50                   ` Jonathan Cameron
  2022-11-18 17:15                     ` Dave Jiang
  0 siblings, 1 reply; 34+ messages in thread
From: Jonathan Cameron @ 2022-11-17 13:50 UTC (permalink / raw)
  To: Dave Jiang
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter

On Wed, 16 Nov 2022 16:20:20 -0700
Dave Jiang <dave.jiang@intel.com> wrote:

> On 11/3/2022 6:27 AM, Jonathan Cameron wrote:
> > On Thu, 3 Nov 2022 12:58:51 +0000
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >   
> >> On Mon, 24 Oct 2022 17:01:02 +0100
> >> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >>  
> >>> On Wed, 19 Oct 2022 10:38:13 -0700
> >>> Dave Jiang <dave.jiang@intel.com> wrote:
> >>>      
> >>>> On 10/19/2022 10:30 AM, Jonathan Cameron wrote:  
> >>>>> On Tue, 11 Oct 2022 18:19:15 +0100
> >>>>> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >>>>>         
> >>>>>> On Tue, 11 Oct 2022 08:18:34 -0700
> >>>>>> Dave Jiang <dave.jiang@intel.com> wrote:
> >>>>>>         
> >>>>>>> On 10/11/2022 7:17 AM, Jonathan Cameron wrote:  
> >>>>>>>> On Fri, 16 Sep 2022 16:10:53 -0700
> >>>>>>>> Dave Jiang <dave.jiang@intel.com> wrote:
> >>>>>>>>              
> >>>>>>>>> Series set to RFC since there's no means to test. Would like to get opinion
> >>>>>>>>> on whether going with using trace events as reporting mechanism is ok.
> >>>>>>>>>
> >>>>>>>>> Jonathan,
> >>>>>>>>> We currently don't have any ways to test AER events. Do you have any plans
> >>>>>>>>> to support AER events via QEMU emulation?  
> >>>>>>>> Sorry - missed this entirely as gotten a bit behind reading CXL emails.  
> >>>>> Hi Dave,
> >>>>>
> >>>>> Quick update.
> >>>>>
> >>>>> Working QEMU emulation - but needs some/lots of cleanup. Particularly fun was
> >>>>> figuring out why I wasn't getting messages past the upstream switch port.
> >>>>> Turned out the serial number ECAP was on top of the AER ECAP. Oops - thankfully
> >>>>> that patch isn't upstream yet.
> >>>>> Also QEMU AER rooting seems to be based on some older PCIE spec
> >>>>> so needed some tweaks to get the device to actually issue ERR_FATAL etc.
> >>>>>
> >>>>> Anyhow, should have something you can play with in a day or two.  
> >>>>
> >>>> Awesome! Thanks! :)  
> >>>
> >>> Took a little longer than expected..
> >>>
> >>> Anyhow, now at
> >>> https://gitlab.com/jic23/qemu/-/commits/cxl-2022-10-24
> >>>
> >>> That tree is carrying far too many things right now for it make much sense
> >>> to me to email this to qemu-devel - though I may pull
> >>> hw/pci/aer: Add missing routing for AER errors
> >>> out in advance as that's closing a spec different between QEMU emulation of AER
> >>> and what the PCI spec says.
> >>>
> >>> Hopefully set of out of tree patches will start to shrink soon - v9 of the DOE
> >>> patches have been on list for a week or so.
> >>>
> >>> Top patch includes a very short 'how to' in patch description.  Basically fire
> >>> up QMP: Add something like -qmp tcp:localhost:444,server=on,wait=off to your
> >>> qemu commandline and use commands like:
> >>>
> >>> { "execute": "qmp_capabilities" }
> >>> ...
> >>> { "execute": "cxl-inject-uncorrectable-error",
> >>>      "arguments": {
> >>>          "path": "/machine/peripheral/cxl-pmem0",
> >>>          "type": "cache-address-parity",
> >>>          "header": [ 3, 4]
> >>>      } }
> >>> ...
> >>> { "execute": "cxl-inject-correctable-error",
> >>>      "arguments": {
> >>>          "path": "/machine/peripheral/cxl-pmem0",
> >>>          "type": "physical",
> >>>          "header": [ 3, 4]
> >>>      } }
> >>>      
> >>
> >> So Dave reported that this wasn't working on x86 qemu machines.
> >>
> >> A fun bit of debugging later (I hate AML) and I think I have find the issue +
> >> have a hack to workaround it for now.
> >>
> >> So need some background.
> >> 1) CXL code is based on QEMU's pci expander bridge root bridge - there is a complex
> >>     bit of handling to create appropriate ACPI DSDT magic.
> >> 2) The CXL root port is based on pcie_root_port.c
> >> 3) Both CXL root port and pcie root port use traditional PCI interrupts, not MSI/MSIX
> >>     for their signaling.
> >> 4) Q35 machine uses an IOAPIC and the resulting PCI bus interrupt routing lands the
> >>     actual interrupt on line 23 for my particular configuration
> >> 5) The ACPI table says it's on line 11.
> >> 6) x86 code for creating the PRT has an informative comment...
> >> https://elixir.bootlin.com/qemu/latest/source/hw/i386/acpi-build.c#L697
> >>   * The main goal is to equaly distribute the interrupts
> >>   * over the 4 existing ACPI links (works only for i440fx).
> >>
> >> So the hack I'm running is below (note the UID thing is a separate bug that stops
> >> iasl from disassembling the DSDT due to a duplicate entry - I'll send out a fix
> >> for that shortly).
> >>
> >> There are a bunch of possible approaches to fix this if my identification of
> >> the issue is correct.
> >>
> >> 1) Clean equivalent of this hack that runs on appropriate machines only.
> >> 2) Use MSI instead. (ioh3420 root port takes this approach I think).  
> > 
> > This turned out to be trivial case of cut and pate.  So MSI support it is (mostly because it doesn't
> > involve fixing AML generation :)
> > 
> > Very lightly tested on one config of x86 machine and one of arm64.
> > I've not thought about this at all yet, so it's a direct copy of the ioh3420 with
> > appropriate renames for it being in the cxl_rp code.  
> 
> Hi Jonathan, do you have a qemu branch with the latest code? Maybe I'll 
> give it a try. Thanks.
> 

https://gitlab.com/jic23/qemu/-/tree/cxl-2022-11-17

should work... (famous last words).

Jonathan



> > 
> > 
> >  From a5e85d90cb734fc1de81e0d99e443747348e65e7 Mon Sep 17 00:00:00 2001
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Date: Thu, 3 Nov 2022 13:10:24 +0000
> > Subject: [PATCH] hw: pci-bridge: cxl_root_port: Write up MSI
> > 
> > Done to avoid fixing ACPI rout description of traditional PCI interrupts on q35.
> > 
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > ---
> >   hw/pci-bridge/cxl_root_port.c | 63 +++++++++++++++++++++++++++++++++++
> >   1 file changed, 63 insertions(+)
> > 
> > diff --git a/hw/pci-bridge/cxl_root_port.c b/hw/pci-bridge/cxl_root_port.c
> > index 1a9363a6de..a7475c427e 100644
> > --- a/hw/pci-bridge/cxl_root_port.c
> > +++ b/hw/pci-bridge/cxl_root_port.c
> > @@ -22,6 +22,7 @@
> >   #include "qemu/range.h"
> >   #include "hw/pci/pci_bridge.h"
> >   #include "hw/pci/pcie_port.h"
> > +#include "hw/pci/msi.h"
> >   #include "hw/qdev-properties.h"
> >   #include "hw/sysbus.h"
> >   #include "qapi/error.h"
> > @@ -29,6 +30,10 @@
> >   
> >   #define CXL_ROOT_PORT_DID 0x7075
> >   
> > +#define CXL_RP_MSI_OFFSET               0x60
> > +#define CXL_RP_MSI_SUPPORTED_FLAGS      PCI_MSI_FLAGS_MASKBIT
> > +#define CXL_RP_MSI_NR_VECTOR            2
> > +
> >   /* Copied from the gen root port which we derive */
> >   #define GEN_PCIE_ROOT_PORT_AER_OFFSET 0x100
> >   #define GEN_PCIE_ROOT_PORT_ACS_OFFSET \
> > @@ -36,6 +41,7 @@
> >   #define CXL_ROOT_PORT_DVSEC_OFFSET \
> >       (GEN_PCIE_ROOT_PORT_ACS_OFFSET + PCI_ACS_SIZEOF)
> >   
> > +
> >   typedef struct CXLRootPort {
> >       /*< private >*/
> >       PCIESlot parent_obj;
> > @@ -47,6 +53,50 @@ typedef struct CXLRootPort {
> >   #define TYPE_CXL_ROOT_PORT "cxl-rp"
> >   DECLARE_INSTANCE_CHECKER(CXLRootPort, CXL_ROOT_PORT, TYPE_CXL_ROOT_PORT)
> >   
> > +/*
> > + * If two MSI vector are allocated, Advanced Error Interrupt Message Number
> > + * is 1. otherwise 0.
> > + * 17.12.5.10 RPERRSTS,  32:27 bit Advanced Error Interrupt Message Number.
> > + */
> > +static uint8_t cxl_rp_aer_vector(const PCIDevice *d)
> > +{
> > +    switch (msi_nr_vectors_allocated(d)) {
> > +    case 1:
> > +        return 0;
> > +    case 2:
> > +        return 1;
> > +    case 4:
> > +    case 8:
> > +    case 16:
> > +    case 32:
> > +    default:
> > +        break;
> > +    }
> > +    abort();
> > +    return 0;
> > +}
> > +
> > +static int cxl_rp_interrupts_init(PCIDevice *d, Error **errp)
> > +{
> > +    int rc;
> > +
> > +    rc = msi_init(d, CXL_RP_MSI_OFFSET, CXL_RP_MSI_NR_VECTOR,
> > +                  CXL_RP_MSI_SUPPORTED_FLAGS & PCI_MSI_FLAGS_64BIT,
> > +                  CXL_RP_MSI_SUPPORTED_FLAGS & PCI_MSI_FLAGS_MASKBIT,
> > +                  errp);
> > +    if (rc < 0) {
> > +        assert(rc == -ENOTSUP);
> > +    }
> > +
> > +    return rc;
> > +}
> > +
> > +static void cxl_rp_interrupts_uninit(PCIDevice *d)
> > +{
> > +    msi_uninit(d);
> > +}
> > +
> > +
> >   static void latch_registers(CXLRootPort *crp)
> >   {
> >       uint32_t *reg_state = crp->cxl_cstate.crb.cache_mem_registers;
> > @@ -177,6 +227,15 @@ static void cxl_rp_dvsec_write_config(PCIDevice *dev, uint32_t addr,
> >       }
> >   }
> >   
> > +static void cxl_rp_aer_vector_update(PCIDevice *d)
> > +{
> > +    PCIERootPortClass *rpc = PCIE_ROOT_PORT_GET_CLASS(d);
> > +
> > +    if (rpc->aer_vector) {
> > +        pcie_aer_root_set_vector(d, rpc->aer_vector(d));
> > +    }
> > +}
> > +
> >   static void cxl_rp_write_config(PCIDevice *d, uint32_t address, uint32_t val,
> >                                   int len)
> >   {
> > @@ -203,6 +262,7 @@ static void cxl_rp_write_config(PCIDevice *d, uint32_t address, uint32_t val,
> >           }
> >       }
> >       pci_bridge_write_config(d, address, val, len);
> > +    cxl_rp_aer_vector_update(d);
> >       pcie_cap_flr_write_config(d, address, val, len);
> >       pcie_cap_slot_write_config(d, slt_ctl, slt_sta, address, val, len);
> >       pcie_aer_write_config(d, address, val, len);
> > @@ -229,6 +289,9 @@ static void cxl_root_port_class_init(ObjectClass *oc, void *data)
> >   
> >       rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
> >       rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
> > +    rpc->aer_vector = cxl_rp_aer_vector;
> > +    rpc->interrupts_init = cxl_rp_interrupts_init;
> > +    rpc->interrupts_uninit = cxl_rp_interrupts_uninit;
> >   
> >       dc->hotpluggable = false;
> >   }  


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling
  2022-11-17 13:50                   ` Jonathan Cameron
@ 2022-11-18 17:15                     ` Dave Jiang
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Jiang @ 2022-11-18 17:15 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, alison.schofield, vishal.l.verma, bwidawsk,
	dan.j.williams, shiju.jose, rrichter



On 11/17/2022 6:50 AM, Jonathan Cameron wrote:
> On Wed, 16 Nov 2022 16:20:20 -0700
> Dave Jiang <dave.jiang@intel.com> wrote:
> 
>> On 11/3/2022 6:27 AM, Jonathan Cameron wrote:
>>> On Thu, 3 Nov 2022 12:58:51 +0000
>>> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>>>    
>>>> On Mon, 24 Oct 2022 17:01:02 +0100
>>>> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>>>>   
>>>>> On Wed, 19 Oct 2022 10:38:13 -0700
>>>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>>>       
>>>>>> On 10/19/2022 10:30 AM, Jonathan Cameron wrote:
>>>>>>> On Tue, 11 Oct 2022 18:19:15 +0100
>>>>>>> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>>>>>>>          
>>>>>>>> On Tue, 11 Oct 2022 08:18:34 -0700
>>>>>>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>>>>>>          
>>>>>>>>> On 10/11/2022 7:17 AM, Jonathan Cameron wrote:
>>>>>>>>>> On Fri, 16 Sep 2022 16:10:53 -0700
>>>>>>>>>> Dave Jiang <dave.jiang@intel.com> wrote:
>>>>>>>>>>               
>>>>>>>>>>> Series set to RFC since there's no means to test. Would like to get opinion
>>>>>>>>>>> on whether going with using trace events as reporting mechanism is ok.
>>>>>>>>>>>
>>>>>>>>>>> Jonathan,
>>>>>>>>>>> We currently don't have any ways to test AER events. Do you have any plans
>>>>>>>>>>> to support AER events via QEMU emulation?
>>>>>>>>>> Sorry - missed this entirely as gotten a bit behind reading CXL emails.
>>>>>>> Hi Dave,
>>>>>>>
>>>>>>> Quick update.
>>>>>>>
>>>>>>> Working QEMU emulation - but needs some/lots of cleanup. Particularly fun was
>>>>>>> figuring out why I wasn't getting messages past the upstream switch port.
>>>>>>> Turned out the serial number ECAP was on top of the AER ECAP. Oops - thankfully
>>>>>>> that patch isn't upstream yet.
>>>>>>> Also QEMU AER rooting seems to be based on some older PCIE spec
>>>>>>> so needed some tweaks to get the device to actually issue ERR_FATAL etc.
>>>>>>>
>>>>>>> Anyhow, should have something you can play with in a day or two.
>>>>>>
>>>>>> Awesome! Thanks! :)
>>>>>
>>>>> Took a little longer than expected..
>>>>>
>>>>> Anyhow, now at
>>>>> https://gitlab.com/jic23/qemu/-/commits/cxl-2022-10-24
>>>>>
>>>>> That tree is carrying far too many things right now for it make much sense
>>>>> to me to email this to qemu-devel - though I may pull
>>>>> hw/pci/aer: Add missing routing for AER errors
>>>>> out in advance as that's closing a spec different between QEMU emulation of AER
>>>>> and what the PCI spec says.
>>>>>
>>>>> Hopefully set of out of tree patches will start to shrink soon - v9 of the DOE
>>>>> patches have been on list for a week or so.
>>>>>
>>>>> Top patch includes a very short 'how to' in patch description.  Basically fire
>>>>> up QMP: Add something like -qmp tcp:localhost:444,server=on,wait=off to your
>>>>> qemu commandline and use commands like:
>>>>>
>>>>> { "execute": "qmp_capabilities" }
>>>>> ...
>>>>> { "execute": "cxl-inject-uncorrectable-error",
>>>>>       "arguments": {
>>>>>           "path": "/machine/peripheral/cxl-pmem0",
>>>>>           "type": "cache-address-parity",
>>>>>           "header": [ 3, 4]
>>>>>       } }
>>>>> ...
>>>>> { "execute": "cxl-inject-correctable-error",
>>>>>       "arguments": {
>>>>>           "path": "/machine/peripheral/cxl-pmem0",
>>>>>           "type": "physical",
>>>>>           "header": [ 3, 4]
>>>>>       } }
>>>>>       
>>>>
>>>> So Dave reported that this wasn't working on x86 qemu machines.
>>>>
>>>> A fun bit of debugging later (I hate AML) and I think I have find the issue +
>>>> have a hack to workaround it for now.
>>>>
>>>> So need some background.
>>>> 1) CXL code is based on QEMU's pci expander bridge root bridge - there is a complex
>>>>      bit of handling to create appropriate ACPI DSDT magic.
>>>> 2) The CXL root port is based on pcie_root_port.c
>>>> 3) Both CXL root port and pcie root port use traditional PCI interrupts, not MSI/MSIX
>>>>      for their signaling.
>>>> 4) Q35 machine uses an IOAPIC and the resulting PCI bus interrupt routing lands the
>>>>      actual interrupt on line 23 for my particular configuration
>>>> 5) The ACPI table says it's on line 11.
>>>> 6) x86 code for creating the PRT has an informative comment...
>>>> https://elixir.bootlin.com/qemu/latest/source/hw/i386/acpi-build.c#L697
>>>>    * The main goal is to equaly distribute the interrupts
>>>>    * over the 4 existing ACPI links (works only for i440fx).
>>>>
>>>> So the hack I'm running is below (note the UID thing is a separate bug that stops
>>>> iasl from disassembling the DSDT due to a duplicate entry - I'll send out a fix
>>>> for that shortly).
>>>>
>>>> There are a bunch of possible approaches to fix this if my identification of
>>>> the issue is correct.
>>>>
>>>> 1) Clean equivalent of this hack that runs on appropriate machines only.
>>>> 2) Use MSI instead. (ioh3420 root port takes this approach I think).
>>>
>>> This turned out to be trivial case of cut and pate.  So MSI support it is (mostly because it doesn't
>>> involve fixing AML generation :)
>>>
>>> Very lightly tested on one config of x86 machine and one of arm64.
>>> I've not thought about this at all yet, so it's a direct copy of the ioh3420 with
>>> appropriate renames for it being in the cxl_rp code.
>>
>> Hi Jonathan, do you have a qemu branch with the latest code? Maybe I'll
>> give it a try. Thanks.
>>
> 
> https://gitlab.com/jic23/qemu/-/tree/cxl-2022-11-17
> 
> should work... (famous last words).

Yup it works great! Got my code tested. Thank you!

> 
> Jonathan
> 
> 
> 
>>>
>>>
>>>   From a5e85d90cb734fc1de81e0d99e443747348e65e7 Mon Sep 17 00:00:00 2001
>>> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> Date: Thu, 3 Nov 2022 13:10:24 +0000
>>> Subject: [PATCH] hw: pci-bridge: cxl_root_port: Write up MSI
>>>
>>> Done to avoid fixing ACPI rout description of traditional PCI interrupts on q35.
>>>
>>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> ---
>>>    hw/pci-bridge/cxl_root_port.c | 63 +++++++++++++++++++++++++++++++++++
>>>    1 file changed, 63 insertions(+)
>>>
>>> diff --git a/hw/pci-bridge/cxl_root_port.c b/hw/pci-bridge/cxl_root_port.c
>>> index 1a9363a6de..a7475c427e 100644
>>> --- a/hw/pci-bridge/cxl_root_port.c
>>> +++ b/hw/pci-bridge/cxl_root_port.c
>>> @@ -22,6 +22,7 @@
>>>    #include "qemu/range.h"
>>>    #include "hw/pci/pci_bridge.h"
>>>    #include "hw/pci/pcie_port.h"
>>> +#include "hw/pci/msi.h"
>>>    #include "hw/qdev-properties.h"
>>>    #include "hw/sysbus.h"
>>>    #include "qapi/error.h"
>>> @@ -29,6 +30,10 @@
>>>    
>>>    #define CXL_ROOT_PORT_DID 0x7075
>>>    
>>> +#define CXL_RP_MSI_OFFSET               0x60
>>> +#define CXL_RP_MSI_SUPPORTED_FLAGS      PCI_MSI_FLAGS_MASKBIT
>>> +#define CXL_RP_MSI_NR_VECTOR            2
>>> +
>>>    /* Copied from the gen root port which we derive */
>>>    #define GEN_PCIE_ROOT_PORT_AER_OFFSET 0x100
>>>    #define GEN_PCIE_ROOT_PORT_ACS_OFFSET \
>>> @@ -36,6 +41,7 @@
>>>    #define CXL_ROOT_PORT_DVSEC_OFFSET \
>>>        (GEN_PCIE_ROOT_PORT_ACS_OFFSET + PCI_ACS_SIZEOF)
>>>    
>>> +
>>>    typedef struct CXLRootPort {
>>>        /*< private >*/
>>>        PCIESlot parent_obj;
>>> @@ -47,6 +53,50 @@ typedef struct CXLRootPort {
>>>    #define TYPE_CXL_ROOT_PORT "cxl-rp"
>>>    DECLARE_INSTANCE_CHECKER(CXLRootPort, CXL_ROOT_PORT, TYPE_CXL_ROOT_PORT)
>>>    
>>> +/*
>>> + * If two MSI vector are allocated, Advanced Error Interrupt Message Number
>>> + * is 1. otherwise 0.
>>> + * 17.12.5.10 RPERRSTS,  32:27 bit Advanced Error Interrupt Message Number.
>>> + */
>>> +static uint8_t cxl_rp_aer_vector(const PCIDevice *d)
>>> +{
>>> +    switch (msi_nr_vectors_allocated(d)) {
>>> +    case 1:
>>> +        return 0;
>>> +    case 2:
>>> +        return 1;
>>> +    case 4:
>>> +    case 8:
>>> +    case 16:
>>> +    case 32:
>>> +    default:
>>> +        break;
>>> +    }
>>> +    abort();
>>> +    return 0;
>>> +}
>>> +
>>> +static int cxl_rp_interrupts_init(PCIDevice *d, Error **errp)
>>> +{
>>> +    int rc;
>>> +
>>> +    rc = msi_init(d, CXL_RP_MSI_OFFSET, CXL_RP_MSI_NR_VECTOR,
>>> +                  CXL_RP_MSI_SUPPORTED_FLAGS & PCI_MSI_FLAGS_64BIT,
>>> +                  CXL_RP_MSI_SUPPORTED_FLAGS & PCI_MSI_FLAGS_MASKBIT,
>>> +                  errp);
>>> +    if (rc < 0) {
>>> +        assert(rc == -ENOTSUP);
>>> +    }
>>> +
>>> +    return rc;
>>> +}
>>> +
>>> +static void cxl_rp_interrupts_uninit(PCIDevice *d)
>>> +{
>>> +    msi_uninit(d);
>>> +}
>>> +
>>> +
>>>    static void latch_registers(CXLRootPort *crp)
>>>    {
>>>        uint32_t *reg_state = crp->cxl_cstate.crb.cache_mem_registers;
>>> @@ -177,6 +227,15 @@ static void cxl_rp_dvsec_write_config(PCIDevice *dev, uint32_t addr,
>>>        }
>>>    }
>>>    
>>> +static void cxl_rp_aer_vector_update(PCIDevice *d)
>>> +{
>>> +    PCIERootPortClass *rpc = PCIE_ROOT_PORT_GET_CLASS(d);
>>> +
>>> +    if (rpc->aer_vector) {
>>> +        pcie_aer_root_set_vector(d, rpc->aer_vector(d));
>>> +    }
>>> +}
>>> +
>>>    static void cxl_rp_write_config(PCIDevice *d, uint32_t address, uint32_t val,
>>>                                    int len)
>>>    {
>>> @@ -203,6 +262,7 @@ static void cxl_rp_write_config(PCIDevice *d, uint32_t address, uint32_t val,
>>>            }
>>>        }
>>>        pci_bridge_write_config(d, address, val, len);
>>> +    cxl_rp_aer_vector_update(d);
>>>        pcie_cap_flr_write_config(d, address, val, len);
>>>        pcie_cap_slot_write_config(d, slt_ctl, slt_sta, address, val, len);
>>>        pcie_aer_write_config(d, address, val, len);
>>> @@ -229,6 +289,9 @@ static void cxl_root_port_class_init(ObjectClass *oc, void *data)
>>>    
>>>        rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
>>>        rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
>>> +    rpc->aer_vector = cxl_rp_aer_vector;
>>> +    rpc->interrupts_init = cxl_rp_interrupts_init;
>>> +    rpc->interrupts_uninit = cxl_rp_interrupts_uninit;
>>>    
>>>        dc->hotpluggable = false;
>>>    }
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-11-18 17:15 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-16 23:10 [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Dave Jiang
2022-09-16 23:11 ` [PATCH RFC v2 1/9] cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers Dave Jiang
2022-09-16 23:11 ` [PATCH RFC v2 2/9] cxl/pci: Cleanup cxl_map_device_regs() Dave Jiang
2022-09-16 23:11 ` [PATCH RFC v2 3/9] cxl/pci: Kill cxl_map_regs() Dave Jiang
2022-10-18 13:43   ` Jonathan Cameron
2022-09-16 23:11 ` [PATCH RFC v2 4/9] cxl/core/regs: Make cxl_map_{component, device}_regs() device generic Dave Jiang
2022-09-16 23:11 ` [PATCH RFC v2 5/9] cxl/port: Limit the port driver to just the HDM Decoder Capability Dave Jiang
2022-10-20 16:54   ` Jonathan Cameron
2022-09-16 23:11 ` [PATCH RFC v2 6/9] cxl/pci: Prepare for mapping RAS Capability Structure Dave Jiang
2022-09-16 23:11 ` [PATCH RFC v2 7/9] cxl/pci: Find and map the " Dave Jiang
2022-09-16 23:11 ` [PATCH RFC v2 8/9] cxl/pci: add tracepoint events for CXL RAS Dave Jiang
2022-10-20 17:02   ` Jonathan Cameron
2022-10-20 17:07     ` Dave Jiang
2022-10-20 17:52       ` Steven Rostedt
2022-09-16 23:11 ` [PATCH RFC v2 9/9] cxl/pci: Add (hopeful) error handling support Dave Jiang
2022-10-20 13:45   ` Jonathan Cameron
2022-10-20 14:50     ` Dave Jiang
2022-10-20 14:03   ` Jonathan Cameron
2022-10-20 14:57     ` Dave Jiang
2022-10-20 15:52   ` Jonathan Cameron
2022-10-20 16:06     ` Dave Jiang
2022-10-20 16:11       ` Jonathan Cameron
2022-10-11 14:17 ` [PATCH RFC v2 0/9] cxl/pci: Add fundamental error handling Jonathan Cameron
2022-10-11 15:18   ` Dave Jiang
2022-10-11 17:19     ` Jonathan Cameron
2022-10-19 17:30       ` Jonathan Cameron
2022-10-19 17:38         ` Dave Jiang
2022-10-24 16:01           ` Jonathan Cameron
2022-10-25 15:22             ` Dave Jiang
2022-11-03 12:58             ` Jonathan Cameron
2022-11-03 13:27               ` Jonathan Cameron
2022-11-16 23:20                 ` Dave Jiang
2022-11-17 13:50                   ` Jonathan Cameron
2022-11-18 17:15                     ` Dave Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).