archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <>
Subject: [PATCH 0/8] cxl/pci: Add fundamental error handling
Date: Tue, 15 Mar 2022 21:13:42 -0700	[thread overview]
Message-ID: <> (raw)

Add a 'struct pci_error_handlers' instance for the cxl_pci driver.
Section "CXL RAS Capability Structure" of the CXL 2.0
specification defines the error sources considered in this
implementation. The RAS Capability Structure defines protocol, link and
internal errors which are distinct from memory poison errors that are
conveyed via direct consumption and/or media scanning.

The errors reported by the RAS registers are categorized into
correctable and uncorrectable errors, where the uncorrectable errors are
optionally steered to either fatal or non-fatal AER events. Table 224
"Device Specific Error Reporting and Nomenclature Guidelines" in the CXL
2.0 specification outlines that the remediation for uncorrectable errors
is a reset to recover. This matches how the Linux PCIe AER core treats
uncorrectable errors as occasions to reset the device to recover

While the specification notes "CXL Reset" or "Secondary Bus Reset" as
theoretical recovery options, they are not feasible in practice since
in-flight CXL.mem operations may not terminate and cause knock-on system
fatal events. Reset is only reliable for recovering, it is not
reliable for recovering CXL.mem. Assuming the system survives, a reset
causes CXL.mem operation to restart from scratch.

The "ECN: Error Isolation on CXL.mem and CXL.cache" [1] document
recognizes the CXL Reset vs CXL.mem operational conflict and helps to at
least provide a mechanism for the Root Port to terminate in flight
CXL.mem operations with completions. That still poses problems in
practice if the kernel is running out of "System RAM" backed by the CXL
device and poison is used to convey the data lost to the protocol error.

Regardless of whether the reset and restart of CXL.mem operations is
feasible / successful, the logging is still useful. So, the
implementation reads, reports, and clears the status in the RAS
Capability Structure registers, and it notifies the 'struct cxl_memdev'
associated with the given PCIe endpoint to reattach to its driver over
the reset so that the HDM decoder configuration can be reconstructed.

The first half of the series reworks component register mapping so that
the cxl_pci driver can own the RAS Capability while the cxl_port driver
continues to own the HDM Decoder Capability. The last half implements
the RAS Capability Structure mapping and reporting via 'struct



Dan Williams (8):
      cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers
      cxl/pci: Cleanup cxl_map_device_regs()
      cxl/pci: Kill cxl_map_regs()
      cxl/core/regs: Make cxl_map_{component,device}_regs() device generic
      cxl/port: Limit the port driver to just the HDM Decoder Capability
      cxl/pci: Prepare for mapping RAS Capability Structure
      cxl/pci: Find and map the RAS Capability Structure
      cxl/pci: Add (hopeful) error handling support

 drivers/cxl/core/hdm.c    |   33 +++++----
 drivers/cxl/core/memdev.c |    1 
 drivers/cxl/core/pci.c    |    3 -
 drivers/cxl/core/port.c   |    2 -
 drivers/cxl/core/regs.c   |  172 ++++++++++++++++++++++++++-------------------
 drivers/cxl/cxl.h         |   36 +++++++--
 drivers/cxl/cxlmem.h      |    2 +
 drivers/cxl/cxlpci.h      |    9 --
 drivers/cxl/pci.c         |  163 ++++++++++++++++++++++++++++++++-----------
 9 files changed, 273 insertions(+), 148 deletions(-)

base-commit: 74be98774dfbc5b8b795db726bd772e735d2edd4

             reply	other threads:[~2022-03-16  4:13 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-16  4:13 Dan Williams [this message]
2022-03-16  4:13 ` [PATCH 1/8] cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers Dan Williams
2022-03-17 10:02   ` Jonathan Cameron
2022-03-16  4:13 ` [PATCH 2/8] cxl/pci: Cleanup cxl_map_device_regs() Dan Williams
2022-03-17 10:07   ` Jonathan Cameron
2022-03-18 17:13     ` Dan Williams
2022-03-16  4:13 ` [PATCH 3/8] cxl/pci: Kill cxl_map_regs() Dan Williams
2022-03-17 10:09   ` Jonathan Cameron
2022-03-18 17:08     ` Dan Williams
2022-03-16  4:14 ` [PATCH 4/8] cxl/core/regs: Make cxl_map_{component, device}_regs() device generic Dan Williams
2022-03-17 10:25   ` Jonathan Cameron
2022-03-18 17:06     ` Dan Williams
2022-03-16  4:14 ` [PATCH 5/8] cxl/port: Limit the port driver to just the HDM Decoder Capability Dan Williams
2022-03-17 10:48   ` Jonathan Cameron
2022-03-16  4:14 ` [PATCH 6/8] cxl/pci: Prepare for mapping RAS Capability Structure Dan Williams
2022-03-17 10:56   ` Jonathan Cameron
2022-03-18 19:51     ` Dan Williams
2022-03-17 17:32   ` Ben Widawsky
2022-03-18 16:19     ` Dan Williams
2022-03-16  4:14 ` [PATCH 7/8] cxl/pci: Find and map the " Dan Williams
2022-03-17 15:10   ` Jonathan Cameron
2022-03-16  4:14 ` [PATCH 8/8] cxl/pci: Add (hopeful) error handling support Dan Williams
2022-03-17 15:16   ` Jonathan Cameron
2022-03-18  9:41   ` Shiju Jose
2022-04-24 22:15     ` Dan Williams
2022-03-16  4:23 ` [PATCH 0/8] cxl/pci: Add fundamental error handling Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).