* [PATCH v3 0/6] cxl/pci: Add support for RCH RAS error handling @ 2023-04-11 18:02 Terry Bowman 2023-04-11 18:02 ` [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery Terry Bowman ` (5 more replies) 0 siblings, 6 replies; 52+ messages in thread From: Terry Bowman @ 2023-04-11 18:02 UTC (permalink / raw) To: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas This patchset adds error handling support for restricted CXL host (RCH) downstream ports. This is necessary because RCH downstream ports are implemented in RCRBs and report protocol errors through a root complex event collector (RCEC). The RCH error reporting flow is not currently supported by the CXL driver and will be added by this patchset. The first patch discovers the RCH dport AER and RAS registers. These will be mapped later and used in CXL driver error logging. The second patch exports cper_mem_err_unpack(). cper_mem_err_unpack() is a dependency for using the cper_print_aer() AER trace logging. The third patch exports cper_print_aer(). cper_print_aer() is used for CXL AER error logging because it provides a common format for logging into dmesg. The fourth patch maps the AER and RAS registers. This patch also adds the RCH handler for logging downstream port AER and RAS information. The fifth patch is AER port driver changes forwarding RCH errors to the RCiEP RCH handler. The sixth patch enables internal AER errors for RCEC's with CXL RCiEPs. The CONFIG_PCIEAER_CXL kernel option is introduced to enable this logic. Changes in V3: - Correct base commit in cover sheet. - Change hardcoded return 0 to NULL in regs.c. - Remove calls to pci_disable_pcie_error_reporting(pdev) and pci_enable_pcie_error_reporting(pdev) in mem.c; - Move RCEC interrupt unmask to PCIe port AER driver's probe. - Fixes missing PCIEAER and PCIEPORTBUS config option error. - Rename cxl_rcrb_setup() to cxl_setup_rcrb() in mem.c. - Update cper_mem_err_unpack() patch subject and description. Changes in V2: - Refactor RCH initialization into cxl_mem driver. - Includes RCH RAS and AER register discovery and mapping. - Add RCEC protocol error interrupt forwarding to CXL endpoint handler. - Change AER and RAS logging to use existing trace routines. - Enable RCEC AER internal errors. Robert Richter (2): PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling Terry Bowman (4): cxl/pci: Add RCH downstream port AER and RAS register discovery efi/cper: Export cper_mem_err_unpack() for use by modules PCI/AER: Export cper_print_aer() for use by modules cxl/pci: Add RCH downstream port error logging drivers/cxl/core/pci.c | 126 +++++++++++++++++++++++++++++---- drivers/cxl/core/regs.c | 94 +++++++++++++++++++++---- drivers/cxl/cxl.h | 18 +++++ drivers/cxl/mem.c | 110 ++++++++++++++++++++++++++--- drivers/firmware/efi/cper.c | 1 + drivers/pci/pcie/Kconfig | 8 +++ drivers/pci/pcie/aer.c | 135 ++++++++++++++++++++++++++++++++++++ 7 files changed, 457 insertions(+), 35 deletions(-) base-commit: ca712e47054678c5ce93a0e0f686353ad5561195 -- 2.34.1 ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery 2023-04-11 18:02 [PATCH v3 0/6] cxl/pci: Add support for RCH RAS error handling Terry Bowman @ 2023-04-11 18:02 ` Terry Bowman 2023-04-13 15:30 ` Jonathan Cameron 2023-04-17 23:00 ` Dan Williams 2023-04-11 18:02 ` [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules Terry Bowman ` (4 subsequent siblings) 5 siblings, 2 replies; 52+ messages in thread From: Terry Bowman @ 2023-04-11 18:02 UTC (permalink / raw) To: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas Restricted CXL host (RCH) downstream port AER information is not currently logged while in the error state. One problem preventing existing PCIe AER functions from logging errors is the AER registers are not accessible. The CXL driver requires changes to find RCH downstream port AER registers for purpose of error logging. RCH downstream ports are not enumerated during a PCI bus scan and are instead discovered using system firmware, ACPI in this case.[1] The downstream port is implemented as a Root Complex Register Block (RCRB). The RCRB is a 4k memory block containing PCIe registers based on the PCIe root port.[2] The RCRB includes AER extended capability registers used for reporting errors. Note, the RCH's AER Capability is located in the RCRB memory space instead of PCI configuration space, thus its register access is different. Existing kernel PCIe AER functions can not be used to manage the downstream port AER capabilities because the port was not enumerated during PCI scan and the registers are not PCI config accessible. Discover RCH downstream port AER extended capability registers. This requires using MMIO accesses to search for extended AER capability in RCRB register space. [1] CXL 3.0 Spec, 9.11.2 - System Firmware View of CXL 1.1 Hierarchy [2] CXL 3.0 Spec, 8.2.1.1 - RCH Downstream Port RCRB Co-developed-by: Robert Richter <rrichter@amd.com> Signed-off-by: Robert Richter <rrichter@amd.com> Signed-off-by: Terry Bowman <terry.bowman@amd.com> --- drivers/cxl/core/regs.c | 93 +++++++++++++++++++++++++++++++++++------ drivers/cxl/cxl.h | 5 +++ drivers/cxl/mem.c | 39 +++++++++++------ 3 files changed, 113 insertions(+), 24 deletions(-) diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c index 1476a0299c9b..bde1fffab09e 100644 --- a/drivers/cxl/core/regs.c +++ b/drivers/cxl/core/regs.c @@ -332,10 +332,36 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type, } EXPORT_SYMBOL_NS_GPL(cxl_find_regblock, CXL); +static void __iomem *cxl_map_reg(struct device *dev, struct cxl_register_map *map, + char *name) +{ + + if (!request_mem_region(map->resource, map->max_size, name)) + return NULL; + + map->base = ioremap(map->resource, map->max_size); + if (!map->base) { + release_mem_region(map->resource, map->max_size); + return NULL; + } + + return map->base; +} + +static void cxl_unmap_reg(struct device *dev, struct cxl_register_map *map) +{ + iounmap(map->base); + release_mem_region(map->resource, map->max_size); +} + resource_size_t cxl_rcrb_to_component(struct device *dev, resource_size_t rcrb, enum cxl_rcrb which) { + struct cxl_register_map map = { + .resource = rcrb, + .max_size = SZ_4K + }; resource_size_t component_reg_phys; void __iomem *addr; u32 bar0, bar1; @@ -343,7 +369,10 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, u32 id; if (which == CXL_RCRB_UPSTREAM) - rcrb += SZ_4K; + map.resource += SZ_4K; + + if (!cxl_map_reg(dev, &map, "CXL RCRB")) + return CXL_RESOURCE_NONE; /* * RCRB's BAR[0..1] point to component block containing CXL @@ -351,21 +380,12 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, * the PCI Base spec here, esp. 64 bit extraction and memory * ranges alignment (6.0, 7.5.1.2.1). */ - if (!request_mem_region(rcrb, SZ_4K, "CXL RCRB")) - return CXL_RESOURCE_NONE; - addr = ioremap(rcrb, SZ_4K); - if (!addr) { - dev_err(dev, "Failed to map region %pr\n", addr); - release_mem_region(rcrb, SZ_4K); - return CXL_RESOURCE_NONE; - } - + addr = map.base; id = readl(addr + PCI_VENDOR_ID); cmd = readw(addr + PCI_COMMAND); bar0 = readl(addr + PCI_BASE_ADDRESS_0); bar1 = readl(addr + PCI_BASE_ADDRESS_1); - iounmap(addr); - release_mem_region(rcrb, SZ_4K); + cxl_unmap_reg(dev, &map); /* * Sanity check, see CXL 3.0 Figure 9-8 CXL Device that Does Not @@ -396,3 +416,52 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, return component_reg_phys; } EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_component, CXL); + +u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb) +{ + struct cxl_register_map map = { + .resource = rcrb, + .max_size = SZ_4K, + }; + u32 cap_hdr; + u16 offset = 0; + + if (!cxl_map_reg(dev, &map, "CXL RCRB")) + return 0; + + cap_hdr = readl(map.base + offset); + while (PCI_EXT_CAP_ID(cap_hdr) != PCI_EXT_CAP_ID_ERR) { + + offset = PCI_EXT_CAP_NEXT(cap_hdr); + if (!offset) { + cxl_unmap_reg(dev, &map); + return 0; + } + cap_hdr = readl(map.base + offset); + } + + dev_dbg(dev, "found AER extended capability (0x%x)\n", offset); + cxl_unmap_reg(dev, &map); + + return offset; +} +EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_aer, CXL); + +u16 cxl_component_to_ras(struct device *dev, resource_size_t component_reg_phys) +{ + struct cxl_register_map map = { + .resource = component_reg_phys, + .max_size = CXL_COMPONENT_REG_BLOCK_SIZE, + }; + + if (!cxl_map_reg(dev, &map, "component")) + return 0; + + cxl_probe_component_regs(dev, map.base, &map.component_map); + cxl_unmap_reg(dev, &map); + if (!map.component_map.ras.valid) + return 0; + + return map.component_map.ras.offset; +} +EXPORT_SYMBOL_NS_GPL(cxl_component_to_ras, CXL); diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 044a92d9813e..df64c402e6e6 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -270,6 +270,9 @@ enum cxl_rcrb { resource_size_t cxl_rcrb_to_component(struct device *dev, resource_size_t rcrb, enum cxl_rcrb which); +u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb); +u16 cxl_component_to_ras(struct device *dev, + resource_size_t component_reg_phys); #define CXL_RESOURCE_NONE ((resource_size_t) -1) #define CXL_TARGET_STRLEN 20 @@ -601,6 +604,8 @@ struct cxl_dport { int port_id; resource_size_t component_reg_phys; resource_size_t rcrb; + u16 aer_cap; + u16 ras_cap; bool rch; struct cxl_port *port; }; diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c index 39c4b54f0715..014295ab6bc6 100644 --- a/drivers/cxl/mem.c +++ b/drivers/cxl/mem.c @@ -45,13 +45,36 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data) return 0; } +static void cxl_setup_rcrb(struct cxl_dev_state *cxlds, + struct cxl_dport *parent_dport) +{ + struct cxl_memdev *cxlmd = cxlds->cxlmd; + + if (!parent_dport->rch) + return; + + /* + * The component registers for an RCD might come from the + * host-bridge RCRB if they are not already mapped via the + * typical register locator mechanism. + */ + if (cxlds->component_reg_phys == CXL_RESOURCE_NONE) + cxlds->component_reg_phys = cxl_rcrb_to_component( + &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); + + parent_dport->aer_cap = cxl_rcrb_to_aer(parent_dport->dport, + parent_dport->rcrb); + + parent_dport->ras_cap = cxl_component_to_ras(parent_dport->dport, + parent_dport->component_reg_phys); +} + static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, struct cxl_dport *parent_dport) { struct cxl_port *parent_port = parent_dport->port; struct cxl_dev_state *cxlds = cxlmd->cxlds; struct cxl_port *endpoint, *iter, *down; - resource_size_t component_reg_phys; int rc; /* @@ -66,17 +89,9 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, ep->next = down; } - /* - * The component registers for an RCD might come from the - * host-bridge RCRB if they are not already mapped via the - * typical register locator mechanism. - */ - if (parent_dport->rch && cxlds->component_reg_phys == CXL_RESOURCE_NONE) - component_reg_phys = cxl_rcrb_to_component( - &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); - else - component_reg_phys = cxlds->component_reg_phys; - endpoint = devm_cxl_add_port(host, &cxlmd->dev, component_reg_phys, + cxl_setup_rcrb(cxlds, parent_dport); + + endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys, parent_dport); if (IS_ERR(endpoint)) return PTR_ERR(endpoint); -- 2.34.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery 2023-04-11 18:02 ` [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery Terry Bowman @ 2023-04-13 15:30 ` Jonathan Cameron 2023-04-13 19:13 ` Terry Bowman 2023-04-17 23:00 ` Dan Williams 1 sibling, 1 reply; 52+ messages in thread From: Jonathan Cameron @ 2023-04-13 15:30 UTC (permalink / raw) To: Terry Bowman Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas On Tue, 11 Apr 2023 13:02:57 -0500 Terry Bowman <terry.bowman@amd.com> wrote: > Restricted CXL host (RCH) downstream port AER information is not currently > logged while in the error state. One problem preventing existing PCIe AER > functions from logging errors is the AER registers are not accessible. The > CXL driver requires changes to find RCH downstream port AER registers for > purpose of error logging. > > RCH downstream ports are not enumerated during a PCI bus scan and are > instead discovered using system firmware, ACPI in this case.[1] The > downstream port is implemented as a Root Complex Register Block (RCRB). > The RCRB is a 4k memory block containing PCIe registers based on the PCIe > root port.[2] The RCRB includes AER extended capability registers used for > reporting errors. Note, the RCH's AER Capability is located in the RCRB > memory space instead of PCI configuration space, thus its register access > is different. Existing kernel PCIe AER functions can not be used to manage > the downstream port AER capabilities because the port was not enumerated > during PCI scan and the registers are not PCI config accessible. > > Discover RCH downstream port AER extended capability registers. This > requires using MMIO accesses to search for extended AER capability in > RCRB register space. > > [1] CXL 3.0 Spec, 9.11.2 - System Firmware View of CXL 1.1 Hierarchy > [2] CXL 3.0 Spec, 8.2.1.1 - RCH Downstream Port RCRB > > Co-developed-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Terry Bowman <terry.bowman@amd.com> Hi Terry, Sorry I missed first few versions. Playing catch up. A few minor comments only inline. > --- > drivers/cxl/core/regs.c | 93 +++++++++++++++++++++++++++++++++++------ > drivers/cxl/cxl.h | 5 +++ > drivers/cxl/mem.c | 39 +++++++++++------ > 3 files changed, 113 insertions(+), 24 deletions(-) > > diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c > index 1476a0299c9b..bde1fffab09e 100644 > --- a/drivers/cxl/core/regs.c > +++ b/drivers/cxl/core/regs.c > @@ -332,10 +332,36 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type, > } > EXPORT_SYMBOL_NS_GPL(cxl_find_regblock, CXL); > > +static void __iomem *cxl_map_reg(struct device *dev, struct cxl_register_map *map, > + char *name) dev isn't used. > +{ > + Trivial but no point in blank line here. > + if (!request_mem_region(map->resource, map->max_size, name)) > + return NULL; > + > + map->base = ioremap(map->resource, map->max_size); > + if (!map->base) { > + release_mem_region(map->resource, map->max_size); > + return NULL; > + } > + > + return map->base; Why return a value you've already stashed in map->base? > +} > + This is similar enough to devm_cxl_iomap_block() that I'd kind of like them them take the same parameters. That would mean moving the map structure outside of the calls and instead passing in the 3 relevant parameters. Perhaps not worth it. > +static void cxl_unmap_reg(struct device *dev, struct cxl_register_map *map) > +{ dev isn't used here either. Makes little sense to pass it in to either funtion. > + iounmap(map->base); > + release_mem_region(map->resource, map->max_size); > +} > + > resource_size_t cxl_rcrb_to_component(struct device *dev, > resource_size_t rcrb, > enum cxl_rcrb which) > { > + struct cxl_register_map map = { > + .resource = rcrb, > + .max_size = SZ_4K > + }; > resource_size_t component_reg_phys; > void __iomem *addr; > u32 bar0, bar1; > @@ -343,7 +369,10 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > u32 id; > > if (which == CXL_RCRB_UPSTREAM) > - rcrb += SZ_4K; > + map.resource += SZ_4K; > + > + if (!cxl_map_reg(dev, &map, "CXL RCRB")) > + return CXL_RESOURCE_NONE; > > /* > * RCRB's BAR[0..1] point to component block containing CXL > @@ -351,21 +380,12 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > * the PCI Base spec here, esp. 64 bit extraction and memory > * ranges alignment (6.0, 7.5.1.2.1). > */ > - if (!request_mem_region(rcrb, SZ_4K, "CXL RCRB")) > - return CXL_RESOURCE_NONE; > - addr = ioremap(rcrb, SZ_4K); > - if (!addr) { > - dev_err(dev, "Failed to map region %pr\n", addr); > - release_mem_region(rcrb, SZ_4K); > - return CXL_RESOURCE_NONE; > - } > - > + addr = map.base; I'd have preferred to see this refactor as a precursor patch to the 'real changes' that follow. > id = readl(addr + PCI_VENDOR_ID); > cmd = readw(addr + PCI_COMMAND); > bar0 = readl(addr + PCI_BASE_ADDRESS_0); > bar1 = readl(addr + PCI_BASE_ADDRESS_1); > - iounmap(addr); > - release_mem_region(rcrb, SZ_4K); > + cxl_unmap_reg(dev, &map); > > /* > * Sanity check, see CXL 3.0 Figure 9-8 CXL Device that Does Not > @@ -396,3 +416,52 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > return component_reg_phys; > } > EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_component, CXL); ... > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index 044a92d9813e..df64c402e6e6 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -270,6 +270,9 @@ enum cxl_rcrb { > resource_size_t cxl_rcrb_to_component(struct device *dev, > resource_size_t rcrb, > enum cxl_rcrb which); > +u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb); > +u16 cxl_component_to_ras(struct device *dev, > + resource_size_t component_reg_phys); > > #define CXL_RESOURCE_NONE ((resource_size_t) -1) > #define CXL_TARGET_STRLEN 20 > @@ -601,6 +604,8 @@ struct cxl_dport { > int port_id; > resource_size_t component_reg_phys; > resource_size_t rcrb; > + u16 aer_cap; > + u16 ras_cap; This structure has kernel-doc that needs to be updated for these new entries. > bool rch; > struct cxl_port *port; > }; > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c > index 39c4b54f0715..014295ab6bc6 100644 > --- a/drivers/cxl/mem.c > +++ b/drivers/cxl/mem.c > @@ -45,13 +45,36 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data) > return 0; > } > > +static void cxl_setup_rcrb(struct cxl_dev_state *cxlds, > + struct cxl_dport *parent_dport) > +{ > + struct cxl_memdev *cxlmd = cxlds->cxlmd; extra space before = > + > + if (!parent_dport->rch) > + return; > + > + /* > + * The component registers for an RCD might come from the > + * host-bridge RCRB if they are not already mapped via the > + * typical register locator mechanism. > + */ > + if (cxlds->component_reg_phys == CXL_RESOURCE_NONE) > + cxlds->component_reg_phys = cxl_rcrb_to_component( > + &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); > + > + parent_dport->aer_cap = cxl_rcrb_to_aer(parent_dport->dport, > + parent_dport->rcrb); > + > + parent_dport->ras_cap = cxl_component_to_ras(parent_dport->dport, > + parent_dport->component_reg_phys); > +} > + > static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, > struct cxl_dport *parent_dport) > { > struct cxl_port *parent_port = parent_dport->port; > struct cxl_dev_state *cxlds = cxlmd->cxlds; > struct cxl_port *endpoint, *iter, *down; > - resource_size_t component_reg_phys; > int rc; > > /* > @@ -66,17 +89,9 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, > ep->next = down; > } > > - /* > - * The component registers for an RCD might come from the > - * host-bridge RCRB if they are not already mapped via the > - * typical register locator mechanism. > - */ > - if (parent_dport->rch && cxlds->component_reg_phys == CXL_RESOURCE_NONE) > - component_reg_phys = cxl_rcrb_to_component( > - &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); > - else > - component_reg_phys = cxlds->component_reg_phys; > - endpoint = devm_cxl_add_port(host, &cxlmd->dev, component_reg_phys, > + cxl_setup_rcrb(cxlds, parent_dport); > + > + endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys, > parent_dport); As above, I'd prefer to see this refactor done in a precursor patch before the new stuff is added. I like reviewing noop patches as I don't have to think much (so can do it when I'm supposedly in a meeting ;) Jonathan > if (IS_ERR(endpoint)) > return PTR_ERR(endpoint); ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery 2023-04-13 15:30 ` Jonathan Cameron @ 2023-04-13 19:13 ` Terry Bowman 2023-04-14 11:47 ` Jonathan Cameron 2023-04-14 11:51 ` Robert Richter 0 siblings, 2 replies; 52+ messages in thread From: Terry Bowman @ 2023-04-13 19:13 UTC (permalink / raw) To: Jonathan Cameron Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas Hi Jonathan, Thanks for the review. I added comments below. On 4/13/23 10:30, Jonathan Cameron wrote: > On Tue, 11 Apr 2023 13:02:57 -0500 > Terry Bowman <terry.bowman@amd.com> wrote: > >> Restricted CXL host (RCH) downstream port AER information is not currently >> logged while in the error state. One problem preventing existing PCIe AER >> functions from logging errors is the AER registers are not accessible. The >> CXL driver requires changes to find RCH downstream port AER registers for >> purpose of error logging. >> >> RCH downstream ports are not enumerated during a PCI bus scan and are >> instead discovered using system firmware, ACPI in this case.[1] The >> downstream port is implemented as a Root Complex Register Block (RCRB). >> The RCRB is a 4k memory block containing PCIe registers based on the PCIe >> root port.[2] The RCRB includes AER extended capability registers used for >> reporting errors. Note, the RCH's AER Capability is located in the RCRB >> memory space instead of PCI configuration space, thus its register access >> is different. Existing kernel PCIe AER functions can not be used to manage >> the downstream port AER capabilities because the port was not enumerated >> during PCI scan and the registers are not PCI config accessible. >> >> Discover RCH downstream port AER extended capability registers. This >> requires using MMIO accesses to search for extended AER capability in >> RCRB register space. >> >> [1] CXL 3.0 Spec, 9.11.2 - System Firmware View of CXL 1.1 Hierarchy >> [2] CXL 3.0 Spec, 8.2.1.1 - RCH Downstream Port RCRB >> >> Co-developed-by: Robert Richter <rrichter@amd.com> >> Signed-off-by: Robert Richter <rrichter@amd.com> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com> > > Hi Terry, > > Sorry I missed first few versions. Playing catch up. > > A few minor comments only inline. > > > >> --- >> drivers/cxl/core/regs.c | 93 +++++++++++++++++++++++++++++++++++------ >> drivers/cxl/cxl.h | 5 +++ >> drivers/cxl/mem.c | 39 +++++++++++------ >> 3 files changed, 113 insertions(+), 24 deletions(-) >> >> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c >> index 1476a0299c9b..bde1fffab09e 100644 >> --- a/drivers/cxl/core/regs.c >> +++ b/drivers/cxl/core/regs.c >> @@ -332,10 +332,36 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type, >> } >> EXPORT_SYMBOL_NS_GPL(cxl_find_regblock, CXL); >> >> +static void __iomem *cxl_map_reg(struct device *dev, struct cxl_register_map *map, >> + char *name) > > dev isn't used. > 'dev' was used earlier for logging that is since removed. >> +{ >> + > > Trivial but no point in blank line here. > I'll remove it. >> + if (!request_mem_region(map->resource, map->max_size, name)) >> + return NULL; >> + >> + map->base = ioremap(map->resource, map->max_size); >> + if (!map->base) { >> + release_mem_region(map->resource, map->max_size); >> + return NULL; >> + } >> + >> + return map->base; > > Why return a value you've already stashed in map->base? > This allowed for a clean return check where cxl_map_reg() is called. This could/should have been a boolean. This will be fixed with the refactoring mentioned below. >> +} >> + > > This is similar enough to devm_cxl_iomap_block() that I'd kind > of like them them take the same parameters. That would mean > moving the map structure outside of the calls and instead passing > in the 3 relevant parameters. Perhaps not worth it. > The intent was to cleanup the cxl_map_reg() callers. Using a 'struct cxl_register_map' carries all the variables required for mapping and reduces the number of variables otherwise declared in the callers. But, I understand why a common interface is preferred in this case. Ok. I'll change the parameters and return value to match devm_cxl_iomap_block(). >> +static void cxl_unmap_reg(struct device *dev, struct cxl_register_map *map) >> +{ > > dev isn't used here either. Makes little sense to pass it in to either funtion. > >> + iounmap(map->base); >> + release_mem_region(map->resource, map->max_size); >> +} >> + >> resource_size_t cxl_rcrb_to_component(struct device *dev, >> resource_size_t rcrb, >> enum cxl_rcrb which) >> { >> + struct cxl_register_map map = { >> + .resource = rcrb, >> + .max_size = SZ_4K >> + }; >> resource_size_t component_reg_phys; >> void __iomem *addr; >> u32 bar0, bar1; >> @@ -343,7 +369,10 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, >> u32 id; >> >> if (which == CXL_RCRB_UPSTREAM) >> - rcrb += SZ_4K; >> + map.resource += SZ_4K; >> + >> + if (!cxl_map_reg(dev, &map, "CXL RCRB")) >> + return CXL_RESOURCE_NONE; >> >> /* >> * RCRB's BAR[0..1] point to component block containing CXL >> @@ -351,21 +380,12 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, >> * the PCI Base spec here, esp. 64 bit extraction and memory >> * ranges alignment (6.0, 7.5.1.2.1). >> */ >> - if (!request_mem_region(rcrb, SZ_4K, "CXL RCRB")) >> - return CXL_RESOURCE_NONE; >> - addr = ioremap(rcrb, SZ_4K); >> - if (!addr) { >> - dev_err(dev, "Failed to map region %pr\n", addr); >> - release_mem_region(rcrb, SZ_4K); >> - return CXL_RESOURCE_NONE; >> - } >> - >> + addr = map.base; > > I'd have preferred to see this refactor as a precursor patch to the > 'real changes' that follow. > Ok. I can make the cxl_map_reg() addition and cxl_rcrb_to_component() refactor to a separate patch. >> id = readl(addr + PCI_VENDOR_ID); >> cmd = readw(addr + PCI_COMMAND); >> bar0 = readl(addr + PCI_BASE_ADDRESS_0); >> bar1 = readl(addr + PCI_BASE_ADDRESS_1); >> - iounmap(addr); >> - release_mem_region(rcrb, SZ_4K); >> + cxl_unmap_reg(dev, &map); >> >> /* >> * Sanity check, see CXL 3.0 Figure 9-8 CXL Device that Does Not >> @@ -396,3 +416,52 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, >> return component_reg_phys; >> } >> EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_component, CXL); > > > ... > >> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h >> index 044a92d9813e..df64c402e6e6 100644 >> --- a/drivers/cxl/cxl.h >> +++ b/drivers/cxl/cxl.h >> @@ -270,6 +270,9 @@ enum cxl_rcrb { >> resource_size_t cxl_rcrb_to_component(struct device *dev, >> resource_size_t rcrb, >> enum cxl_rcrb which); >> +u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb); >> +u16 cxl_component_to_ras(struct device *dev, >> + resource_size_t component_reg_phys); >> >> #define CXL_RESOURCE_NONE ((resource_size_t) -1) >> #define CXL_TARGET_STRLEN 20 >> @@ -601,6 +604,8 @@ struct cxl_dport { >> int port_id; >> resource_size_t component_reg_phys; >> resource_size_t rcrb; >> + u16 aer_cap; >> + u16 ras_cap; > > This structure has kernel-doc that needs to be updated for these new entries. > I'll add. >> bool rch; >> struct cxl_port *port; >> }; >> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c >> index 39c4b54f0715..014295ab6bc6 100644 >> --- a/drivers/cxl/mem.c >> +++ b/drivers/cxl/mem.c >> @@ -45,13 +45,36 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data) >> return 0; >> } >> >> +static void cxl_setup_rcrb(struct cxl_dev_state *cxlds, >> + struct cxl_dport *parent_dport) >> +{ >> + struct cxl_memdev *cxlmd = cxlds->cxlmd; > > extra space before = > Ok. Ill remove the extra space. >> + >> + if (!parent_dport->rch) >> + return; >> + >> + /* >> + * The component registers for an RCD might come from the >> + * host-bridge RCRB if they are not already mapped via the >> + * typical register locator mechanism. >> + */ >> + if (cxlds->component_reg_phys == CXL_RESOURCE_NONE) >> + cxlds->component_reg_phys = cxl_rcrb_to_component( >> + &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); >> + >> + parent_dport->aer_cap = cxl_rcrb_to_aer(parent_dport->dport, >> + parent_dport->rcrb); >> + >> + parent_dport->ras_cap = cxl_component_to_ras(parent_dport->dport, >> + parent_dport->component_reg_phys); >> +} >> + >> static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, >> struct cxl_dport *parent_dport) >> { >> struct cxl_port *parent_port = parent_dport->port; >> struct cxl_dev_state *cxlds = cxlmd->cxlds; >> struct cxl_port *endpoint, *iter, *down; >> - resource_size_t component_reg_phys; >> int rc; >> >> /* >> @@ -66,17 +89,9 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, >> ep->next = down; >> } >> >> - /* >> - * The component registers for an RCD might come from the >> - * host-bridge RCRB if they are not already mapped via the >> - * typical register locator mechanism. >> - */ >> - if (parent_dport->rch && cxlds->component_reg_phys == CXL_RESOURCE_NONE) >> - component_reg_phys = cxl_rcrb_to_component( >> - &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); >> - else >> - component_reg_phys = cxlds->component_reg_phys; >> - endpoint = devm_cxl_add_port(host, &cxlmd->dev, component_reg_phys, >> + cxl_setup_rcrb(cxlds, parent_dport); >> + >> + endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys, >> parent_dport); > As above, I'd prefer to see this refactor done in a precursor patch before the new > stuff is added. I like reviewing noop patches as I don't have to think much (so > can do it when I'm supposedly in a meeting ;) > Ok. I'll add an earlier patch that introduces cxl_setup_rcrb() and first moves this chunk into cxl_setup_rcrb(). The following patch will replace the cxl_setup_rcrb() logic with the AER and RAS discovery. My understanding is the requested refactoring changes then splits this patch into the 3 patches listed below (using git log latest first order): - Add RCH downstream port AER and RAS register discovery - Refactor RCD component discovery into separate function - Refactor RCRB register mapping into separate function Regards, Terry > Jonathan >> if (IS_ERR(endpoint)) >> return PTR_ERR(endpoint); > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery 2023-04-13 19:13 ` Terry Bowman @ 2023-04-14 11:47 ` Jonathan Cameron 2023-04-14 11:51 ` Robert Richter 1 sibling, 0 replies; 52+ messages in thread From: Jonathan Cameron @ 2023-04-14 11:47 UTC (permalink / raw) To: Terry Bowman Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas > >> + endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys, > >> parent_dport); > > As above, I'd prefer to see this refactor done in a precursor patch before the new > > stuff is added. I like reviewing noop patches as I don't have to think much (so > > can do it when I'm supposedly in a meeting ;) > > > > Ok. I'll add an earlier patch that introduces cxl_setup_rcrb() and first moves this > chunk into cxl_setup_rcrb(). The following patch will replace the cxl_setup_rcrb() > logic with the AER and RAS discovery. > > My understanding is the requested refactoring changes then splits this patch into > the 3 patches listed below (using git log latest first order): > - Add RCH downstream port AER and RAS register discovery > - Refactor RCD component discovery into separate function > - Refactor RCRB register mapping into separate function Spot on I think. Thanks, Jonathan ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery 2023-04-13 19:13 ` Terry Bowman 2023-04-14 11:47 ` Jonathan Cameron @ 2023-04-14 11:51 ` Robert Richter 1 sibling, 0 replies; 52+ messages in thread From: Robert Richter @ 2023-04-14 11:51 UTC (permalink / raw) To: Terry Bowman Cc: Jonathan Cameron, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, linux-kernel, bhelgaas On 13.04.23 14:13:16, Terry Bowman wrote: > On 4/13/23 10:30, Jonathan Cameron wrote: > > On Tue, 11 Apr 2023 13:02:57 -0500 > > Terry Bowman <terry.bowman@amd.com> wrote: > >> +static void __iomem *cxl_map_reg(struct device *dev, struct cxl_register_map *map, > >> + char *name) > > > > dev isn't used. > > > > 'dev' was used earlier for logging that is since removed. > > >> +{ > >> + > > > > Trivial but no point in blank line here. > > > > I'll remove it. > > >> + if (!request_mem_region(map->resource, map->max_size, name)) > >> + return NULL; > >> + > >> + map->base = ioremap(map->resource, map->max_size); > >> + if (!map->base) { > >> + release_mem_region(map->resource, map->max_size); > >> + return NULL; > >> + } > >> + > >> + return map->base; > > > > Why return a value you've already stashed in map->base? > > > This allowed for a clean return check where cxl_map_reg() is called. > This could/should have been a boolean. This will be fixed with the refactoring > mentioned below. The intention was to have a shortcut to get the base addr directly which could be often the case. While the remaining struct map is only used to unmap things. To be precise, we do not check a bool here but instead an address to be non-zero. Please to not change the return value. We did not use devm_* here to allow temporary mappings during init (which might happen multiple times). With devm_* only one permanent mapping would be possible and we would need to store and maintain the base addr in some struct. This implementation here allows a local usage. > > >> +} > >> + > > > > This is similar enough to devm_cxl_iomap_block() that I'd kind > > of like them them take the same parameters. That would mean > > moving the map structure outside of the calls and instead passing > > in the 3 relevant parameters. Perhaps not worth it. > > > The intent was to cleanup the cxl_map_reg() callers. Using a 'struct > cxl_register_map' carries all the variables required for mapping and reduces > the number of variables otherwise declared in the callers. But, I understand > why a common interface is preferred in this case. > > Ok. I'll change the parameters and return value to match devm_cxl_iomap_block(). See my comment above. Struct cxl_register_map was choosen to keep data in one place and also for paired use with cxl_map_reg() and cxl_unmap_reg() (in the sense of an object-oriented programming style). The struct is widespread used in CXL code for similar reasons. I would prefer to keep the struct as argument. > > >> +static void cxl_unmap_reg(struct device *dev, struct cxl_register_map *map) > >> +{ > > > > dev isn't used here either. Makes little sense to pass it in to either funtion. Yes, dev should be removed for both functions. Thanks for catching this. -Robert > > > >> + iounmap(map->base); > >> + release_mem_region(map->resource, map->max_size); > >> +} > >> + ^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery 2023-04-11 18:02 ` [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery Terry Bowman 2023-04-13 15:30 ` Jonathan Cameron @ 2023-04-17 23:00 ` Dan Williams 2023-04-18 15:59 ` Terry Bowman 2023-04-27 13:52 ` Robert Richter 1 sibling, 2 replies; 52+ messages in thread From: Dan Williams @ 2023-04-17 23:00 UTC (permalink / raw) To: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas Terry Bowman wrote: > Restricted CXL host (RCH) downstream port AER information is not currently > logged while in the error state. One problem preventing existing PCIe AER > functions from logging errors is the AER registers are not accessible. The > CXL driver requires changes to find RCH downstream port AER registers for > purpose of error logging. > > RCH downstream ports are not enumerated during a PCI bus scan and are > instead discovered using system firmware, ACPI in this case.[1] The > downstream port is implemented as a Root Complex Register Block (RCRB). > The RCRB is a 4k memory block containing PCIe registers based on the PCIe > root port.[2] The RCRB includes AER extended capability registers used for > reporting errors. Note, the RCH's AER Capability is located in the RCRB > memory space instead of PCI configuration space, thus its register access > is different. Existing kernel PCIe AER functions can not be used to manage > the downstream port AER capabilities because the port was not enumerated > during PCI scan and the registers are not PCI config accessible. > > Discover RCH downstream port AER extended capability registers. This > requires using MMIO accesses to search for extended AER capability in > RCRB register space. > > [1] CXL 3.0 Spec, 9.11.2 - System Firmware View of CXL 1.1 Hierarchy > [2] CXL 3.0 Spec, 8.2.1.1 - RCH Downstream Port RCRB > > Co-developed-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > --- > drivers/cxl/core/regs.c | 93 +++++++++++++++++++++++++++++++++++------ > drivers/cxl/cxl.h | 5 +++ > drivers/cxl/mem.c | 39 +++++++++++------ > 3 files changed, 113 insertions(+), 24 deletions(-) > > diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c > index 1476a0299c9b..bde1fffab09e 100644 > --- a/drivers/cxl/core/regs.c > +++ b/drivers/cxl/core/regs.c > @@ -332,10 +332,36 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type, > } > EXPORT_SYMBOL_NS_GPL(cxl_find_regblock, CXL); > > +static void __iomem *cxl_map_reg(struct device *dev, struct cxl_register_map *map, > + char *name) > +{ > + > + if (!request_mem_region(map->resource, map->max_size, name)) > + return NULL; > + > + map->base = ioremap(map->resource, map->max_size); > + if (!map->base) { > + release_mem_region(map->resource, map->max_size); > + return NULL; > + } > + > + return map->base; > +} > + > +static void cxl_unmap_reg(struct device *dev, struct cxl_register_map *map) > +{ > + iounmap(map->base); > + release_mem_region(map->resource, map->max_size); > +} Not clear why these new functions are needed vs cxl_map_regblock() / cxl_unmap_regblock(), and this refactoring looks unrelated to the claimed changes in the patch changelog. ...oh, I think I see why you went this way, a potential counter-proposal below. > + > resource_size_t cxl_rcrb_to_component(struct device *dev, > resource_size_t rcrb, > enum cxl_rcrb which) > { > + struct cxl_register_map map = { > + .resource = rcrb, > + .max_size = SZ_4K > + }; > resource_size_t component_reg_phys; > void __iomem *addr; > u32 bar0, bar1; > @@ -343,7 +369,10 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > u32 id; > > if (which == CXL_RCRB_UPSTREAM) > - rcrb += SZ_4K; > + map.resource += SZ_4K; > + > + if (!cxl_map_reg(dev, &map, "CXL RCRB")) > + return CXL_RESOURCE_NONE; > > /* > * RCRB's BAR[0..1] point to component block containing CXL > @@ -351,21 +380,12 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > * the PCI Base spec here, esp. 64 bit extraction and memory > * ranges alignment (6.0, 7.5.1.2.1). > */ > - if (!request_mem_region(rcrb, SZ_4K, "CXL RCRB")) > - return CXL_RESOURCE_NONE; > - addr = ioremap(rcrb, SZ_4K); > - if (!addr) { > - dev_err(dev, "Failed to map region %pr\n", addr); > - release_mem_region(rcrb, SZ_4K); > - return CXL_RESOURCE_NONE; > - } > - > + addr = map.base; > id = readl(addr + PCI_VENDOR_ID); > cmd = readw(addr + PCI_COMMAND); > bar0 = readl(addr + PCI_BASE_ADDRESS_0); > bar1 = readl(addr + PCI_BASE_ADDRESS_1); > - iounmap(addr); > - release_mem_region(rcrb, SZ_4K); > + cxl_unmap_reg(dev, &map); > > /* > * Sanity check, see CXL 3.0 Figure 9-8 CXL Device that Does Not > @@ -396,3 +416,52 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > return component_reg_phys; > } > EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_component, CXL); > + > +u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb) > +{ > + struct cxl_register_map map = { > + .resource = rcrb, > + .max_size = SZ_4K, > + }; > + u32 cap_hdr; > + u16 offset = 0; > + > + if (!cxl_map_reg(dev, &map, "CXL RCRB")) > + return 0; > + > + cap_hdr = readl(map.base + offset); > + while (PCI_EXT_CAP_ID(cap_hdr) != PCI_EXT_CAP_ID_ERR) { > + > + offset = PCI_EXT_CAP_NEXT(cap_hdr); > + if (!offset) { > + cxl_unmap_reg(dev, &map); > + return 0; > + } > + cap_hdr = readl(map.base + offset); > + } > + > + dev_dbg(dev, "found AER extended capability (0x%x)\n", offset); > + cxl_unmap_reg(dev, &map); > + > + return offset; > +} > +EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_aer, CXL); > + > +u16 cxl_component_to_ras(struct device *dev, resource_size_t component_reg_phys) > +{ > + struct cxl_register_map map = { > + .resource = component_reg_phys, > + .max_size = CXL_COMPONENT_REG_BLOCK_SIZE, > + }; > + > + if (!cxl_map_reg(dev, &map, "component")) > + return 0; > + > + cxl_probe_component_regs(dev, map.base, &map.component_map); > + cxl_unmap_reg(dev, &map); > + if (!map.component_map.ras.valid) > + return 0; > + > + return map.component_map.ras.offset; > +} > +EXPORT_SYMBOL_NS_GPL(cxl_component_to_ras, CXL); > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index 044a92d9813e..df64c402e6e6 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -270,6 +270,9 @@ enum cxl_rcrb { > resource_size_t cxl_rcrb_to_component(struct device *dev, > resource_size_t rcrb, > enum cxl_rcrb which); > +u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb); > +u16 cxl_component_to_ras(struct device *dev, > + resource_size_t component_reg_phys); > > #define CXL_RESOURCE_NONE ((resource_size_t) -1) > #define CXL_TARGET_STRLEN 20 > @@ -601,6 +604,8 @@ struct cxl_dport { > int port_id; > resource_size_t component_reg_phys; > resource_size_t rcrb; > + u16 aer_cap; > + u16 ras_cap; > bool rch; > struct cxl_port *port; > }; > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c > index 39c4b54f0715..014295ab6bc6 100644 > --- a/drivers/cxl/mem.c > +++ b/drivers/cxl/mem.c > @@ -45,13 +45,36 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data) > return 0; > } > > +static void cxl_setup_rcrb(struct cxl_dev_state *cxlds, > + struct cxl_dport *parent_dport) > +{ > + struct cxl_memdev *cxlmd = cxlds->cxlmd; > + > + if (!parent_dport->rch) > + return; > + > + /* > + * The component registers for an RCD might come from the > + * host-bridge RCRB if they are not already mapped via the > + * typical register locator mechanism. > + */ > + if (cxlds->component_reg_phys == CXL_RESOURCE_NONE) > + cxlds->component_reg_phys = cxl_rcrb_to_component( > + &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); > + > + parent_dport->aer_cap = cxl_rcrb_to_aer(parent_dport->dport, > + parent_dport->rcrb); Hmm, how about just retrieve this as part of cxl_rcrb_to_component() (renamed to cxl_probe_rcrb()), and make an rch dport its own distinct object? Otherwise it feels odd to be retrieving downstream port properties this late at upstream port component register detection time. It also feels awkward to keep adding more RCH dport specific details to the common 'struct cxl_dport'. So, I'm thinking something like the following (compiled and cxl_test regression passed): -- >8 -- From 18fbc72f98655d10301c7a35f614b6152f46c44b Mon Sep 17 00:00:00 2001 From: Dan Williams <dan.j.williams@intel.com> Date: Mon, 17 Apr 2023 15:45:50 -0700 Subject: [PATCH] cxl/rch: Prepare for caching the MMIO mapped PCIe AER capability Prepare cxl_probe_rcrb() for retrieving more than just the component register block. The RCH AER handling code wants to get back to the AER capability that happens to be MMIO mapped rather then configuration cycles. Move rcrb specific dport data, like the RCRB base and the AER capability offset, into its own data structure ('struct cxl_rcrb_info') for cxl_probe_rcrb() to fill. Introduce 'struct cxl_rch_dport' to wrap a 'struct cxl_dport' with a 'struct cxl_rcrb_info' attribute. This centralizes all RCRB scanning in one routine. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/cxl/acpi.c | 16 ++++++++-------- drivers/cxl/core/port.c | 33 +++++++++++++++++++++------------ drivers/cxl/core/regs.c | 12 ++++++++---- drivers/cxl/cxl.h | 21 +++++++++++++++------ drivers/cxl/mem.c | 15 ++++++++++----- tools/testing/cxl/Kbuild | 2 +- tools/testing/cxl/test/cxl.c | 10 ++++++---- tools/testing/cxl/test/mock.c | 12 ++++++------ tools/testing/cxl/test/mock.h | 7 ++++--- 9 files changed, 79 insertions(+), 49 deletions(-) diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c index 4e66483f1fd3..2647eb04fcdb 100644 --- a/drivers/cxl/acpi.c +++ b/drivers/cxl/acpi.c @@ -375,7 +375,7 @@ static int add_host_bridge_uport(struct device *match, void *arg) struct cxl_chbs_context { struct device *dev; unsigned long long uid; - resource_size_t rcrb; + struct cxl_rcrb_info rcrb; resource_size_t chbcr; u32 cxl_version; }; @@ -395,7 +395,7 @@ static int cxl_get_chbcr(union acpi_subtable_headers *header, void *arg, return 0; ctx->cxl_version = chbs->cxl_version; - ctx->rcrb = CXL_RESOURCE_NONE; + ctx->rcrb.base = CXL_RESOURCE_NONE; ctx->chbcr = CXL_RESOURCE_NONE; if (!chbs->base) @@ -409,9 +409,8 @@ static int cxl_get_chbcr(union acpi_subtable_headers *header, void *arg, if (chbs->length != CXL_RCRB_SIZE) return 0; - ctx->rcrb = chbs->base; - ctx->chbcr = cxl_rcrb_to_component(ctx->dev, chbs->base, - CXL_RCRB_DOWNSTREAM); + ctx->chbcr = cxl_probe_rcrb(ctx->dev, chbs->base, &ctx->rcrb, + CXL_RCRB_DOWNSTREAM); return 0; } @@ -451,8 +450,9 @@ static int add_host_bridge_dport(struct device *match, void *arg) return 0; } - if (ctx.rcrb != CXL_RESOURCE_NONE) - dev_dbg(match, "RCRB found for UID %lld: %pa\n", uid, &ctx.rcrb); + if (ctx.rcrb.base != CXL_RESOURCE_NONE) + dev_dbg(match, "RCRB found for UID %lld: %pa\n", uid, + &ctx.rcrb.base); if (ctx.chbcr == CXL_RESOURCE_NONE) { dev_warn(match, "CHBCR invalid for Host Bridge (UID %lld)\n", @@ -466,7 +466,7 @@ static int add_host_bridge_dport(struct device *match, void *arg) bridge = pci_root->bus->bridge; if (ctx.cxl_version == ACPI_CEDT_CHBS_VERSION_CXL11) dport = devm_cxl_add_rch_dport(root_port, bridge, uid, - ctx.chbcr, ctx.rcrb); + ctx.chbcr, &ctx.rcrb); else dport = devm_cxl_add_dport(root_port, bridge, uid, ctx.chbcr); diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c index 4003f445320c..d194f48259ff 100644 --- a/drivers/cxl/core/port.c +++ b/drivers/cxl/core/port.c @@ -920,7 +920,7 @@ static void cxl_dport_unlink(void *data) static struct cxl_dport * __devm_cxl_add_dport(struct cxl_port *port, struct device *dport_dev, int port_id, resource_size_t component_reg_phys, - resource_size_t rcrb) + struct cxl_rcrb_info *ri) { char link_name[CXL_TARGET_STRLEN]; struct cxl_dport *dport; @@ -942,17 +942,26 @@ __devm_cxl_add_dport(struct cxl_port *port, struct device *dport_dev, CXL_TARGET_STRLEN) return ERR_PTR(-EINVAL); - dport = devm_kzalloc(host, sizeof(*dport), GFP_KERNEL); - if (!dport) - return ERR_PTR(-ENOMEM); + if (ri && ri->base != CXL_RESOURCE_NONE) { + struct cxl_rch_dport *rdport; + + rdport = devm_kzalloc(host, sizeof(*rdport), GFP_KERNEL); + if (!rdport) + return ERR_PTR(-ENOMEM); + + rdport->rcrb.base = ri->base; + dport = &rdport->dport; + dport->rch = true; + } else { + dport = devm_kzalloc(host, sizeof(*dport), GFP_KERNEL); + if (!dport) + return ERR_PTR(-ENOMEM); + } dport->dport = dport_dev; dport->port_id = port_id; dport->component_reg_phys = component_reg_phys; dport->port = port; - if (rcrb != CXL_RESOURCE_NONE) - dport->rch = true; - dport->rcrb = rcrb; cond_cxl_root_lock(port); rc = add_dport(port, dport); @@ -994,7 +1003,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port, struct cxl_dport *dport; dport = __devm_cxl_add_dport(port, dport_dev, port_id, - component_reg_phys, CXL_RESOURCE_NONE); + component_reg_phys, NULL); if (IS_ERR(dport)) { dev_dbg(dport_dev, "failed to add dport to %s: %ld\n", dev_name(&port->dev), PTR_ERR(dport)); @@ -1013,24 +1022,24 @@ EXPORT_SYMBOL_NS_GPL(devm_cxl_add_dport, CXL); * @dport_dev: firmware or PCI device representing the dport * @port_id: identifier for this dport in a decoder's target list * @component_reg_phys: optional location of CXL component registers - * @rcrb: mandatory location of a Root Complex Register Block + * @ri: mandatory data about the Root Complex Register Block layout * * See CXL 3.0 9.11.8 CXL Devices Attached to an RCH */ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port, struct device *dport_dev, int port_id, resource_size_t component_reg_phys, - resource_size_t rcrb) + struct cxl_rcrb_info *ri) { struct cxl_dport *dport; - if (rcrb == CXL_RESOURCE_NONE) { + if (!ri || ri->base == CXL_RESOURCE_NONE) { dev_dbg(&port->dev, "failed to add RCH dport, missing RCRB\n"); return ERR_PTR(-EINVAL); } dport = __devm_cxl_add_dport(port, dport_dev, port_id, - component_reg_phys, rcrb); + component_reg_phys, ri); if (IS_ERR(dport)) { dev_dbg(dport_dev, "failed to add RCH dport to %s: %ld\n", dev_name(&port->dev), PTR_ERR(dport)); diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c index 52d1dbeda527..b1c0db898a50 100644 --- a/drivers/cxl/core/regs.c +++ b/drivers/cxl/core/regs.c @@ -332,9 +332,8 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type, } EXPORT_SYMBOL_NS_GPL(cxl_find_regblock, CXL); -resource_size_t cxl_rcrb_to_component(struct device *dev, - resource_size_t rcrb, - enum cxl_rcrb which) +resource_size_t cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, + struct cxl_rcrb_info *ri, enum cxl_rcrb which) { resource_size_t component_reg_phys; void __iomem *addr; @@ -344,6 +343,8 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, if (which == CXL_RCRB_UPSTREAM) rcrb += SZ_4K; + else + ri->base = rcrb; /* * RCRB's BAR[0..1] point to component block containing CXL @@ -364,6 +365,9 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, cmd = readw(addr + PCI_COMMAND); bar0 = readl(addr + PCI_BASE_ADDRESS_0); bar1 = readl(addr + PCI_BASE_ADDRESS_1); + + /* TODO: retrieve rcrb->aer_cap here */ + iounmap(addr); release_mem_region(rcrb, SZ_4K); @@ -395,4 +399,4 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, return component_reg_phys; } -EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_component, CXL); +EXPORT_SYMBOL_NS_GPL(cxl_probe_rcrb, CXL); diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 1503ccec9a84..b0807f54e9fd 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -267,9 +267,9 @@ enum cxl_rcrb { CXL_RCRB_DOWNSTREAM, CXL_RCRB_UPSTREAM, }; -resource_size_t cxl_rcrb_to_component(struct device *dev, - resource_size_t rcrb, - enum cxl_rcrb which); +struct cxl_rcrb_info; +resource_size_t cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, + struct cxl_rcrb_info *ri, enum cxl_rcrb which); #define CXL_RESOURCE_NONE ((resource_size_t) -1) #define CXL_TARGET_STRLEN 20 @@ -589,12 +589,12 @@ cxl_find_dport_by_dev(struct cxl_port *port, const struct device *dport_dev) return xa_load(&port->dports, (unsigned long)dport_dev); } + /** * struct cxl_dport - CXL downstream port * @dport: PCI bridge or firmware device representing the downstream link * @port_id: unique hardware identifier for dport in decoder target list * @component_reg_phys: downstream port component registers - * @rcrb: base address for the Root Complex Register Block * @rch: Indicate whether this dport was enumerated in RCH or VH mode * @port: reference to cxl_port that contains this downstream port */ @@ -602,11 +602,20 @@ struct cxl_dport { struct device *dport; int port_id; resource_size_t component_reg_phys; - resource_size_t rcrb; bool rch; struct cxl_port *port; }; +struct cxl_rcrb_info { + resource_size_t base; + u16 aer_cap; +}; + +struct cxl_rch_dport { + struct cxl_dport dport; + struct cxl_rcrb_info rcrb; +}; + /** * struct cxl_ep - track an endpoint's interest in a port * @ep: device that hosts a generic CXL endpoint (expander or accelerator) @@ -674,7 +683,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port, struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port, struct device *dport_dev, int port_id, resource_size_t component_reg_phys, - resource_size_t rcrb); + struct cxl_rcrb_info *ri); struct cxl_decoder *to_cxl_decoder(struct device *dev); struct cxl_root_decoder *to_cxl_root_decoder(struct device *dev); diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c index 097d86dd2a8e..7da6135e0b17 100644 --- a/drivers/cxl/mem.c +++ b/drivers/cxl/mem.c @@ -71,10 +71,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, * host-bridge RCRB if they are not already mapped via the * typical register locator mechanism. */ - if (parent_dport->rch && cxlds->component_reg_phys == CXL_RESOURCE_NONE) - component_reg_phys = cxl_rcrb_to_component( - &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); - else + if (parent_dport->rch && + cxlds->component_reg_phys == CXL_RESOURCE_NONE) { + struct cxl_rch_dport *rdport = + container_of(parent_dport, typeof(*rdport), dport); + + component_reg_phys = + cxl_probe_rcrb(&cxlmd->dev, rdport->rcrb.base, + &rdport->rcrb, CXL_RCRB_UPSTREAM); + } else component_reg_phys = cxlds->component_reg_phys; endpoint = devm_cxl_add_port(host, &cxlmd->dev, component_reg_phys, parent_dport); @@ -92,7 +97,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, } return 0; -} + } static int cxl_mem_probe(struct device *dev) { diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild index fba7bec96acd..bef1bc3bd912 100644 --- a/tools/testing/cxl/Kbuild +++ b/tools/testing/cxl/Kbuild @@ -11,7 +11,7 @@ ldflags-y += --wrap=devm_cxl_enumerate_decoders ldflags-y += --wrap=cxl_await_media_ready ldflags-y += --wrap=cxl_hdm_decode_init ldflags-y += --wrap=cxl_dvsec_rr_decode -ldflags-y += --wrap=cxl_rcrb_to_component +ldflags-y += --wrap=cxl_probe_rcrb DRIVERS := ../../../drivers CXL_SRC := $(DRIVERS)/cxl diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c index 385cdeeab22c..805c79491485 100644 --- a/tools/testing/cxl/test/cxl.c +++ b/tools/testing/cxl/test/cxl.c @@ -983,12 +983,14 @@ static int mock_cxl_port_enumerate_dports(struct cxl_port *port) return 0; } -resource_size_t mock_cxl_rcrb_to_component(struct device *dev, - resource_size_t rcrb, - enum cxl_rcrb which) +resource_size_t mock_cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, + struct cxl_rcrb_info *ri, enum cxl_rcrb which) { dev_dbg(dev, "rcrb: %pa which: %d\n", &rcrb, which); + if (which == CXL_RCRB_DOWNSTREAM) + ri->base = rcrb; + return (resource_size_t) which + 1; } @@ -1000,7 +1002,7 @@ static struct cxl_mock_ops cxl_mock_ops = { .is_mock_dev = is_mock_dev, .acpi_table_parse_cedt = mock_acpi_table_parse_cedt, .acpi_evaluate_integer = mock_acpi_evaluate_integer, - .cxl_rcrb_to_component = mock_cxl_rcrb_to_component, + .cxl_probe_rcrb = mock_cxl_probe_rcrb, .acpi_pci_find_root = mock_acpi_pci_find_root, .devm_cxl_port_enumerate_dports = mock_cxl_port_enumerate_dports, .devm_cxl_setup_hdm = mock_cxl_setup_hdm, diff --git a/tools/testing/cxl/test/mock.c b/tools/testing/cxl/test/mock.c index c4e53f22e421..148bd4f184f5 100644 --- a/tools/testing/cxl/test/mock.c +++ b/tools/testing/cxl/test/mock.c @@ -244,9 +244,9 @@ int __wrap_cxl_dvsec_rr_decode(struct device *dev, int dvsec, } EXPORT_SYMBOL_NS_GPL(__wrap_cxl_dvsec_rr_decode, CXL); -resource_size_t __wrap_cxl_rcrb_to_component(struct device *dev, - resource_size_t rcrb, - enum cxl_rcrb which) +resource_size_t __wrap_cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, + struct cxl_rcrb_info *ri, + enum cxl_rcrb which) { int index; resource_size_t component_reg_phys; @@ -254,14 +254,14 @@ resource_size_t __wrap_cxl_rcrb_to_component(struct device *dev, if (ops && ops->is_mock_port(dev)) component_reg_phys = - ops->cxl_rcrb_to_component(dev, rcrb, which); + ops->cxl_probe_rcrb(dev, rcrb, ri, which); else - component_reg_phys = cxl_rcrb_to_component(dev, rcrb, which); + component_reg_phys = cxl_probe_rcrb(dev, rcrb, ri, which); put_cxl_mock_ops(index); return component_reg_phys; } -EXPORT_SYMBOL_NS_GPL(__wrap_cxl_rcrb_to_component, CXL); +EXPORT_SYMBOL_NS_GPL(__wrap_cxl_probe_rcrb, CXL); MODULE_LICENSE("GPL v2"); MODULE_IMPORT_NS(ACPI); diff --git a/tools/testing/cxl/test/mock.h b/tools/testing/cxl/test/mock.h index bef8817b01f2..7ef21356d052 100644 --- a/tools/testing/cxl/test/mock.h +++ b/tools/testing/cxl/test/mock.h @@ -15,9 +15,10 @@ struct cxl_mock_ops { acpi_string pathname, struct acpi_object_list *arguments, unsigned long long *data); - resource_size_t (*cxl_rcrb_to_component)(struct device *dev, - resource_size_t rcrb, - enum cxl_rcrb which); + resource_size_t (*cxl_probe_rcrb)(struct device *dev, + resource_size_t rcrb, + struct cxl_rcrb_info *ri, + enum cxl_rcrb which); struct acpi_pci_root *(*acpi_pci_find_root)(acpi_handle handle); bool (*is_mock_bus)(struct pci_bus *bus); bool (*is_mock_port)(struct device *dev); -- 2.39.2 -- >8 -- > + > + parent_dport->ras_cap = cxl_component_to_ras(parent_dport->dport, > + parent_dport->component_reg_phys); Since this is component register offset based can it not be shared with the VH case? I have been expecting that RCH RAS capability and VH RAS capability scanning would need to be unified in the cxl_port driver. ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery 2023-04-17 23:00 ` Dan Williams @ 2023-04-18 15:59 ` Terry Bowman 2023-04-27 13:52 ` Robert Richter 1 sibling, 0 replies; 52+ messages in thread From: Terry Bowman @ 2023-04-18 15:59 UTC (permalink / raw) To: Dan Williams, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dave.jiang, Jonathan.Cameron, linux-cxl Cc: rrichter, linux-kernel, bhelgaas Hi Dan, Thanks for the review comments. I added responses inline below. On 4/17/23 18:00, Dan Williams wrote: > Terry Bowman wrote: >> Restricted CXL host (RCH) downstream port AER information is not currently >> logged while in the error state. One problem preventing existing PCIe AER >> functions from logging errors is the AER registers are not accessible. The >> CXL driver requires changes to find RCH downstream port AER registers for >> purpose of error logging. >> >> RCH downstream ports are not enumerated during a PCI bus scan and are >> instead discovered using system firmware, ACPI in this case.[1] The >> downstream port is implemented as a Root Complex Register Block (RCRB). >> The RCRB is a 4k memory block containing PCIe registers based on the PCIe >> root port.[2] The RCRB includes AER extended capability registers used for >> reporting errors. Note, the RCH's AER Capability is located in the RCRB >> memory space instead of PCI configuration space, thus its register access >> is different. Existing kernel PCIe AER functions can not be used to manage >> the downstream port AER capabilities because the port was not enumerated >> during PCI scan and the registers are not PCI config accessible. >> >> Discover RCH downstream port AER extended capability registers. This >> requires using MMIO accesses to search for extended AER capability in >> RCRB register space. >> >> [1] CXL 3.0 Spec, 9.11.2 - System Firmware View of CXL 1.1 Hierarchy >> [2] CXL 3.0 Spec, 8.2.1.1 - RCH Downstream Port RCRB >> >> Co-developed-by: Robert Richter <rrichter@amd.com> >> Signed-off-by: Robert Richter <rrichter@amd.com> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com> >> --- >> drivers/cxl/core/regs.c | 93 +++++++++++++++++++++++++++++++++++------ >> drivers/cxl/cxl.h | 5 +++ >> drivers/cxl/mem.c | 39 +++++++++++------ >> 3 files changed, 113 insertions(+), 24 deletions(-) >> >> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c >> index 1476a0299c9b..bde1fffab09e 100644 >> --- a/drivers/cxl/core/regs.c >> +++ b/drivers/cxl/core/regs.c >> @@ -332,10 +332,36 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type, >> } >> EXPORT_SYMBOL_NS_GPL(cxl_find_regblock, CXL); >> >> +static void __iomem *cxl_map_reg(struct device *dev, struct cxl_register_map *map, >> + char *name) >> +{ >> + >> + if (!request_mem_region(map->resource, map->max_size, name)) >> + return NULL; >> + >> + map->base = ioremap(map->resource, map->max_size); >> + if (!map->base) { >> + release_mem_region(map->resource, map->max_size); >> + return NULL; >> + } >> + >> + return map->base; >> +} >> + >> +static void cxl_unmap_reg(struct device *dev, struct cxl_register_map *map) >> +{ >> + iounmap(map->base); >> + release_mem_region(map->resource, map->max_size); >> +} > > Not clear why these new functions are needed vs cxl_map_regblock() / > cxl_unmap_regblock(), and this refactoring looks unrelated to the > claimed changes in the patch changelog. > > ...oh, I think I see why you went this way, a potential counter-proposal > below. > >> + >> resource_size_t cxl_rcrb_to_component(struct device *dev, >> resource_size_t rcrb, >> enum cxl_rcrb which) >> { >> + struct cxl_register_map map = { >> + .resource = rcrb, >> + .max_size = SZ_4K >> + }; >> resource_size_t component_reg_phys; >> void __iomem *addr; >> u32 bar0, bar1; >> @@ -343,7 +369,10 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, >> u32 id; >> >> if (which == CXL_RCRB_UPSTREAM) >> - rcrb += SZ_4K; >> + map.resource += SZ_4K; >> + >> + if (!cxl_map_reg(dev, &map, "CXL RCRB")) >> + return CXL_RESOURCE_NONE; >> >> /* >> * RCRB's BAR[0..1] point to component block containing CXL >> @@ -351,21 +380,12 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, >> * the PCI Base spec here, esp. 64 bit extraction and memory >> * ranges alignment (6.0, 7.5.1.2.1). >> */ >> - if (!request_mem_region(rcrb, SZ_4K, "CXL RCRB")) >> - return CXL_RESOURCE_NONE; >> - addr = ioremap(rcrb, SZ_4K); >> - if (!addr) { >> - dev_err(dev, "Failed to map region %pr\n", addr); >> - release_mem_region(rcrb, SZ_4K); >> - return CXL_RESOURCE_NONE; >> - } >> - >> + addr = map.base; >> id = readl(addr + PCI_VENDOR_ID); >> cmd = readw(addr + PCI_COMMAND); >> bar0 = readl(addr + PCI_BASE_ADDRESS_0); >> bar1 = readl(addr + PCI_BASE_ADDRESS_1); >> - iounmap(addr); >> - release_mem_region(rcrb, SZ_4K); >> + cxl_unmap_reg(dev, &map); >> >> /* >> * Sanity check, see CXL 3.0 Figure 9-8 CXL Device that Does Not >> @@ -396,3 +416,52 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, >> return component_reg_phys; >> } >> EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_component, CXL); >> + >> +u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb) >> +{ >> + struct cxl_register_map map = { >> + .resource = rcrb, >> + .max_size = SZ_4K, >> + }; >> + u32 cap_hdr; >> + u16 offset = 0; >> + >> + if (!cxl_map_reg(dev, &map, "CXL RCRB")) >> + return 0; >> + >> + cap_hdr = readl(map.base + offset); >> + while (PCI_EXT_CAP_ID(cap_hdr) != PCI_EXT_CAP_ID_ERR) { >> + >> + offset = PCI_EXT_CAP_NEXT(cap_hdr); >> + if (!offset) { >> + cxl_unmap_reg(dev, &map); >> + return 0; >> + } >> + cap_hdr = readl(map.base + offset); >> + } >> + >> + dev_dbg(dev, "found AER extended capability (0x%x)\n", offset); >> + cxl_unmap_reg(dev, &map); >> + >> + return offset; >> +} >> +EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_aer, CXL); > >> + >> +u16 cxl_component_to_ras(struct device *dev, resource_size_t component_reg_phys) >> +{ >> + struct cxl_register_map map = { >> + .resource = component_reg_phys, >> + .max_size = CXL_COMPONENT_REG_BLOCK_SIZE, >> + }; >> + >> + if (!cxl_map_reg(dev, &map, "component")) >> + return 0; >> + >> + cxl_probe_component_regs(dev, map.base, &map.component_map); >> + cxl_unmap_reg(dev, &map); >> + if (!map.component_map.ras.valid) >> + return 0; >> + >> + return map.component_map.ras.offset; >> +} >> +EXPORT_SYMBOL_NS_GPL(cxl_component_to_ras, CXL); >> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h >> index 044a92d9813e..df64c402e6e6 100644 >> --- a/drivers/cxl/cxl.h >> +++ b/drivers/cxl/cxl.h >> @@ -270,6 +270,9 @@ enum cxl_rcrb { >> resource_size_t cxl_rcrb_to_component(struct device *dev, >> resource_size_t rcrb, >> enum cxl_rcrb which); >> +u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb); >> +u16 cxl_component_to_ras(struct device *dev, >> + resource_size_t component_reg_phys); >> >> #define CXL_RESOURCE_NONE ((resource_size_t) -1) >> #define CXL_TARGET_STRLEN 20 >> @@ -601,6 +604,8 @@ struct cxl_dport { >> int port_id; >> resource_size_t component_reg_phys; >> resource_size_t rcrb; >> + u16 aer_cap; >> + u16 ras_cap; >> bool rch; >> struct cxl_port *port; >> }; >> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c >> index 39c4b54f0715..014295ab6bc6 100644 >> --- a/drivers/cxl/mem.c >> +++ b/drivers/cxl/mem.c >> @@ -45,13 +45,36 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data) >> return 0; >> } >> >> +static void cxl_setup_rcrb(struct cxl_dev_state *cxlds, >> + struct cxl_dport *parent_dport) >> +{ >> + struct cxl_memdev *cxlmd = cxlds->cxlmd; >> + >> + if (!parent_dport->rch) >> + return; >> + >> + /* >> + * The component registers for an RCD might come from the >> + * host-bridge RCRB if they are not already mapped via the >> + * typical register locator mechanism. >> + */ >> + if (cxlds->component_reg_phys == CXL_RESOURCE_NONE) >> + cxlds->component_reg_phys = cxl_rcrb_to_component( >> + &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); >> + >> + parent_dport->aer_cap = cxl_rcrb_to_aer(parent_dport->dport, >> + parent_dport->rcrb); > > Hmm, how about just retrieve this as part of cxl_rcrb_to_component() > (renamed to cxl_probe_rcrb()), and make an rch dport its own distinct > object? Otherwise it feels odd to be retrieving downstream port > properties this late at upstream port component register detection time. > It also feels awkward to keep adding more RCH dport specific details to > the common 'struct cxl_dport'. So, I'm thinking something like the > following (compiled and cxl_test regression passed): > Thanks. I applied to this patchset's base and will include in our next revision. > -- >8 -- > From 18fbc72f98655d10301c7a35f614b6152f46c44b Mon Sep 17 00:00:00 2001 > From: Dan Williams <dan.j.williams@intel.com> > Date: Mon, 17 Apr 2023 15:45:50 -0700 > Subject: [PATCH] cxl/rch: Prepare for caching the MMIO mapped PCIe AER > capability > > Prepare cxl_probe_rcrb() for retrieving more than just the component > register block. The RCH AER handling code wants to get back to the AER > capability that happens to be MMIO mapped rather then configuration > cycles. > > Move rcrb specific dport data, like the RCRB base and the AER capability > offset, into its own data structure ('struct cxl_rcrb_info') for > cxl_probe_rcrb() to fill. Introduce 'struct cxl_rch_dport' to wrap a > 'struct cxl_dport' with a 'struct cxl_rcrb_info' attribute. > > This centralizes all RCRB scanning in one routine. > > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > drivers/cxl/acpi.c | 16 ++++++++-------- > drivers/cxl/core/port.c | 33 +++++++++++++++++++++------------ > drivers/cxl/core/regs.c | 12 ++++++++---- > drivers/cxl/cxl.h | 21 +++++++++++++++------ > drivers/cxl/mem.c | 15 ++++++++++----- > tools/testing/cxl/Kbuild | 2 +- > tools/testing/cxl/test/cxl.c | 10 ++++++---- > tools/testing/cxl/test/mock.c | 12 ++++++------ > tools/testing/cxl/test/mock.h | 7 ++++--- > 9 files changed, 79 insertions(+), 49 deletions(-) > > diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c > index 4e66483f1fd3..2647eb04fcdb 100644 > --- a/drivers/cxl/acpi.c > +++ b/drivers/cxl/acpi.c > @@ -375,7 +375,7 @@ static int add_host_bridge_uport(struct device *match, void *arg) > struct cxl_chbs_context { > struct device *dev; > unsigned long long uid; > - resource_size_t rcrb; > + struct cxl_rcrb_info rcrb; > resource_size_t chbcr; > u32 cxl_version; > }; > @@ -395,7 +395,7 @@ static int cxl_get_chbcr(union acpi_subtable_headers *header, void *arg, > return 0; > > ctx->cxl_version = chbs->cxl_version; > - ctx->rcrb = CXL_RESOURCE_NONE; > + ctx->rcrb.base = CXL_RESOURCE_NONE; > ctx->chbcr = CXL_RESOURCE_NONE; > > if (!chbs->base) > @@ -409,9 +409,8 @@ static int cxl_get_chbcr(union acpi_subtable_headers *header, void *arg, > if (chbs->length != CXL_RCRB_SIZE) > return 0; > > - ctx->rcrb = chbs->base; > - ctx->chbcr = cxl_rcrb_to_component(ctx->dev, chbs->base, > - CXL_RCRB_DOWNSTREAM); > + ctx->chbcr = cxl_probe_rcrb(ctx->dev, chbs->base, &ctx->rcrb, > + CXL_RCRB_DOWNSTREAM); > > return 0; > } > @@ -451,8 +450,9 @@ static int add_host_bridge_dport(struct device *match, void *arg) > return 0; > } > > - if (ctx.rcrb != CXL_RESOURCE_NONE) > - dev_dbg(match, "RCRB found for UID %lld: %pa\n", uid, &ctx.rcrb); > + if (ctx.rcrb.base != CXL_RESOURCE_NONE) > + dev_dbg(match, "RCRB found for UID %lld: %pa\n", uid, > + &ctx.rcrb.base); > > if (ctx.chbcr == CXL_RESOURCE_NONE) { > dev_warn(match, "CHBCR invalid for Host Bridge (UID %lld)\n", > @@ -466,7 +466,7 @@ static int add_host_bridge_dport(struct device *match, void *arg) > bridge = pci_root->bus->bridge; > if (ctx.cxl_version == ACPI_CEDT_CHBS_VERSION_CXL11) > dport = devm_cxl_add_rch_dport(root_port, bridge, uid, > - ctx.chbcr, ctx.rcrb); > + ctx.chbcr, &ctx.rcrb); > else > dport = devm_cxl_add_dport(root_port, bridge, uid, > ctx.chbcr); > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c > index 4003f445320c..d194f48259ff 100644 > --- a/drivers/cxl/core/port.c > +++ b/drivers/cxl/core/port.c > @@ -920,7 +920,7 @@ static void cxl_dport_unlink(void *data) > static struct cxl_dport * > __devm_cxl_add_dport(struct cxl_port *port, struct device *dport_dev, > int port_id, resource_size_t component_reg_phys, > - resource_size_t rcrb) > + struct cxl_rcrb_info *ri) > { > char link_name[CXL_TARGET_STRLEN]; > struct cxl_dport *dport; > @@ -942,17 +942,26 @@ __devm_cxl_add_dport(struct cxl_port *port, struct device *dport_dev, > CXL_TARGET_STRLEN) > return ERR_PTR(-EINVAL); > > - dport = devm_kzalloc(host, sizeof(*dport), GFP_KERNEL); > - if (!dport) > - return ERR_PTR(-ENOMEM); > + if (ri && ri->base != CXL_RESOURCE_NONE) { > + struct cxl_rch_dport *rdport; > + > + rdport = devm_kzalloc(host, sizeof(*rdport), GFP_KERNEL); > + if (!rdport) > + return ERR_PTR(-ENOMEM); > + > + rdport->rcrb.base = ri->base; > + dport = &rdport->dport; > + dport->rch = true; > + } else { > + dport = devm_kzalloc(host, sizeof(*dport), GFP_KERNEL); > + if (!dport) > + return ERR_PTR(-ENOMEM); > + } > > dport->dport = dport_dev; > dport->port_id = port_id; > dport->component_reg_phys = component_reg_phys; > dport->port = port; > - if (rcrb != CXL_RESOURCE_NONE) > - dport->rch = true; > - dport->rcrb = rcrb; > > cond_cxl_root_lock(port); > rc = add_dport(port, dport); > @@ -994,7 +1003,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port, > struct cxl_dport *dport; > > dport = __devm_cxl_add_dport(port, dport_dev, port_id, > - component_reg_phys, CXL_RESOURCE_NONE); > + component_reg_phys, NULL); > if (IS_ERR(dport)) { > dev_dbg(dport_dev, "failed to add dport to %s: %ld\n", > dev_name(&port->dev), PTR_ERR(dport)); > @@ -1013,24 +1022,24 @@ EXPORT_SYMBOL_NS_GPL(devm_cxl_add_dport, CXL); > * @dport_dev: firmware or PCI device representing the dport > * @port_id: identifier for this dport in a decoder's target list > * @component_reg_phys: optional location of CXL component registers > - * @rcrb: mandatory location of a Root Complex Register Block > + * @ri: mandatory data about the Root Complex Register Block layout > * > * See CXL 3.0 9.11.8 CXL Devices Attached to an RCH > */ > struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port, > struct device *dport_dev, int port_id, > resource_size_t component_reg_phys, > - resource_size_t rcrb) > + struct cxl_rcrb_info *ri) > { > struct cxl_dport *dport; > > - if (rcrb == CXL_RESOURCE_NONE) { > + if (!ri || ri->base == CXL_RESOURCE_NONE) { > dev_dbg(&port->dev, "failed to add RCH dport, missing RCRB\n"); > return ERR_PTR(-EINVAL); > } > > dport = __devm_cxl_add_dport(port, dport_dev, port_id, > - component_reg_phys, rcrb); > + component_reg_phys, ri); > if (IS_ERR(dport)) { > dev_dbg(dport_dev, "failed to add RCH dport to %s: %ld\n", > dev_name(&port->dev), PTR_ERR(dport)); > diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c > index 52d1dbeda527..b1c0db898a50 100644 > --- a/drivers/cxl/core/regs.c > +++ b/drivers/cxl/core/regs.c > @@ -332,9 +332,8 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type, > } > EXPORT_SYMBOL_NS_GPL(cxl_find_regblock, CXL); > > -resource_size_t cxl_rcrb_to_component(struct device *dev, > - resource_size_t rcrb, > - enum cxl_rcrb which) > +resource_size_t cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, > + struct cxl_rcrb_info *ri, enum cxl_rcrb which) > { > resource_size_t component_reg_phys; > void __iomem *addr; > @@ -344,6 +343,8 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > > if (which == CXL_RCRB_UPSTREAM) > rcrb += SZ_4K; > + else > + ri->base = rcrb; > > /* > * RCRB's BAR[0..1] point to component block containing CXL > @@ -364,6 +365,9 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > cmd = readw(addr + PCI_COMMAND); > bar0 = readl(addr + PCI_BASE_ADDRESS_0); > bar1 = readl(addr + PCI_BASE_ADDRESS_1); > + > + /* TODO: retrieve rcrb->aer_cap here */ > + Ack > iounmap(addr); > release_mem_region(rcrb, SZ_4K); > > @@ -395,4 +399,4 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > > return component_reg_phys; > } > -EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_component, CXL); > +EXPORT_SYMBOL_NS_GPL(cxl_probe_rcrb, CXL); > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index 1503ccec9a84..b0807f54e9fd 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -267,9 +267,9 @@ enum cxl_rcrb { > CXL_RCRB_DOWNSTREAM, > CXL_RCRB_UPSTREAM, > }; > -resource_size_t cxl_rcrb_to_component(struct device *dev, > - resource_size_t rcrb, > - enum cxl_rcrb which); > +struct cxl_rcrb_info; > +resource_size_t cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, > + struct cxl_rcrb_info *ri, enum cxl_rcrb which); > > #define CXL_RESOURCE_NONE ((resource_size_t) -1) > #define CXL_TARGET_STRLEN 20 > @@ -589,12 +589,12 @@ cxl_find_dport_by_dev(struct cxl_port *port, const struct device *dport_dev) > return xa_load(&port->dports, (unsigned long)dport_dev); > } > > + > /** > * struct cxl_dport - CXL downstream port > * @dport: PCI bridge or firmware device representing the downstream link > * @port_id: unique hardware identifier for dport in decoder target list > * @component_reg_phys: downstream port component registers > - * @rcrb: base address for the Root Complex Register Block > * @rch: Indicate whether this dport was enumerated in RCH or VH mode > * @port: reference to cxl_port that contains this downstream port > */ > @@ -602,11 +602,20 @@ struct cxl_dport { > struct device *dport; > int port_id; > resource_size_t component_reg_phys; > - resource_size_t rcrb; > bool rch; > struct cxl_port *port; > }; > > +struct cxl_rcrb_info { > + resource_size_t base; > + u16 aer_cap; > +}; > + > +struct cxl_rch_dport { > + struct cxl_dport dport; > + struct cxl_rcrb_info rcrb; > +}; > + > /** > * struct cxl_ep - track an endpoint's interest in a port > * @ep: device that hosts a generic CXL endpoint (expander or accelerator) > @@ -674,7 +683,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port, > struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port, > struct device *dport_dev, int port_id, > resource_size_t component_reg_phys, > - resource_size_t rcrb); > + struct cxl_rcrb_info *ri); > > struct cxl_decoder *to_cxl_decoder(struct device *dev); > struct cxl_root_decoder *to_cxl_root_decoder(struct device *dev); > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c > index 097d86dd2a8e..7da6135e0b17 100644 > --- a/drivers/cxl/mem.c > +++ b/drivers/cxl/mem.c > @@ -71,10 +71,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, > * host-bridge RCRB if they are not already mapped via the > * typical register locator mechanism. > */ > - if (parent_dport->rch && cxlds->component_reg_phys == CXL_RESOURCE_NONE) > - component_reg_phys = cxl_rcrb_to_component( > - &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); > - else > + if (parent_dport->rch && > + cxlds->component_reg_phys == CXL_RESOURCE_NONE) { > + struct cxl_rch_dport *rdport = > + container_of(parent_dport, typeof(*rdport), dport); > + > + component_reg_phys = > + cxl_probe_rcrb(&cxlmd->dev, rdport->rcrb.base, > + &rdport->rcrb, CXL_RCRB_UPSTREAM); > + } else > component_reg_phys = cxlds->component_reg_phys; > endpoint = devm_cxl_add_port(host, &cxlmd->dev, component_reg_phys, > parent_dport); > @@ -92,7 +97,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, > } > > return 0; > -} > + } > > static int cxl_mem_probe(struct device *dev) > { > diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild > index fba7bec96acd..bef1bc3bd912 100644 > --- a/tools/testing/cxl/Kbuild > +++ b/tools/testing/cxl/Kbuild > @@ -11,7 +11,7 @@ ldflags-y += --wrap=devm_cxl_enumerate_decoders > ldflags-y += --wrap=cxl_await_media_ready > ldflags-y += --wrap=cxl_hdm_decode_init > ldflags-y += --wrap=cxl_dvsec_rr_decode > -ldflags-y += --wrap=cxl_rcrb_to_component > +ldflags-y += --wrap=cxl_probe_rcrb > > DRIVERS := ../../../drivers > CXL_SRC := $(DRIVERS)/cxl > diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c > index 385cdeeab22c..805c79491485 100644 > --- a/tools/testing/cxl/test/cxl.c > +++ b/tools/testing/cxl/test/cxl.c > @@ -983,12 +983,14 @@ static int mock_cxl_port_enumerate_dports(struct cxl_port *port) > return 0; > } > > -resource_size_t mock_cxl_rcrb_to_component(struct device *dev, > - resource_size_t rcrb, > - enum cxl_rcrb which) > +resource_size_t mock_cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, > + struct cxl_rcrb_info *ri, enum cxl_rcrb which) > { > dev_dbg(dev, "rcrb: %pa which: %d\n", &rcrb, which); > > + if (which == CXL_RCRB_DOWNSTREAM) > + ri->base = rcrb; > + > return (resource_size_t) which + 1; > } > > @@ -1000,7 +1002,7 @@ static struct cxl_mock_ops cxl_mock_ops = { > .is_mock_dev = is_mock_dev, > .acpi_table_parse_cedt = mock_acpi_table_parse_cedt, > .acpi_evaluate_integer = mock_acpi_evaluate_integer, > - .cxl_rcrb_to_component = mock_cxl_rcrb_to_component, > + .cxl_probe_rcrb = mock_cxl_probe_rcrb, > .acpi_pci_find_root = mock_acpi_pci_find_root, > .devm_cxl_port_enumerate_dports = mock_cxl_port_enumerate_dports, > .devm_cxl_setup_hdm = mock_cxl_setup_hdm, > diff --git a/tools/testing/cxl/test/mock.c b/tools/testing/cxl/test/mock.c > index c4e53f22e421..148bd4f184f5 100644 > --- a/tools/testing/cxl/test/mock.c > +++ b/tools/testing/cxl/test/mock.c > @@ -244,9 +244,9 @@ int __wrap_cxl_dvsec_rr_decode(struct device *dev, int dvsec, > } > EXPORT_SYMBOL_NS_GPL(__wrap_cxl_dvsec_rr_decode, CXL); > > -resource_size_t __wrap_cxl_rcrb_to_component(struct device *dev, > - resource_size_t rcrb, > - enum cxl_rcrb which) > +resource_size_t __wrap_cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, > + struct cxl_rcrb_info *ri, > + enum cxl_rcrb which) > { > int index; > resource_size_t component_reg_phys; > @@ -254,14 +254,14 @@ resource_size_t __wrap_cxl_rcrb_to_component(struct device *dev, > > if (ops && ops->is_mock_port(dev)) > component_reg_phys = > - ops->cxl_rcrb_to_component(dev, rcrb, which); > + ops->cxl_probe_rcrb(dev, rcrb, ri, which); > else > - component_reg_phys = cxl_rcrb_to_component(dev, rcrb, which); > + component_reg_phys = cxl_probe_rcrb(dev, rcrb, ri, which); > put_cxl_mock_ops(index); > > return component_reg_phys; > } > -EXPORT_SYMBOL_NS_GPL(__wrap_cxl_rcrb_to_component, CXL); > +EXPORT_SYMBOL_NS_GPL(__wrap_cxl_probe_rcrb, CXL); > > MODULE_LICENSE("GPL v2"); > MODULE_IMPORT_NS(ACPI); > diff --git a/tools/testing/cxl/test/mock.h b/tools/testing/cxl/test/mock.h > index bef8817b01f2..7ef21356d052 100644 > --- a/tools/testing/cxl/test/mock.h > +++ b/tools/testing/cxl/test/mock.h > @@ -15,9 +15,10 @@ struct cxl_mock_ops { > acpi_string pathname, > struct acpi_object_list *arguments, > unsigned long long *data); > - resource_size_t (*cxl_rcrb_to_component)(struct device *dev, > - resource_size_t rcrb, > - enum cxl_rcrb which); > + resource_size_t (*cxl_probe_rcrb)(struct device *dev, > + resource_size_t rcrb, > + struct cxl_rcrb_info *ri, > + enum cxl_rcrb which); > struct acpi_pci_root *(*acpi_pci_find_root)(acpi_handle handle); > bool (*is_mock_bus)(struct pci_bus *bus); > bool (*is_mock_port)(struct device *dev); My email client chopped off the following: -- 2.39.2 -- >8 -- > + > + parent_dport->ras_cap = cxl_component_to_ras(parent_dport->dport, > + parent_dport->component_reg_phys); [Dan] Since this is component register offset based can it not be shared with the VH case? I have been expecting that RCH RAS capability and VH RAS capability scanning would need to be unified in the cxl_port driver. [Terry] The same probe function is called indirectly: cxl_component_to_ras() cxl_probe_component_regs() <<== The VH has: cxl_setup_regs() (cxl/pci.c) cxl_probe_regs() (cxl/pci.c) cxl_probe_component_regs() (cxl/core/regs.c) <<== Would you like to see the RCH RAS discovery reuse cxl_probe_regs() or (parts of) cxl_setup_regs() as well? I ask because cxl_probe_regs() and cxl_setup_regs() are static in cxl/pci.c. Of course they can be moved out to cxl/core/regs.c to avoid an unwanted dependancy from exported pci.c symbol. This would be a significant change just for the RCH case. You mentioned earlier it is late to add downstream port properties in mem device initialization. Where would you prefer the downstream RAS is discovered and mapped? I understand it needs to reuse as much existing code as possible but where to call from? The optins I see are here in mem.c or in __devm_cxl_add_dport() but neither are ideal. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery 2023-04-17 23:00 ` Dan Williams 2023-04-18 15:59 ` Terry Bowman @ 2023-04-27 13:52 ` Robert Richter 1 sibling, 0 replies; 52+ messages in thread From: Robert Richter @ 2023-04-27 13:52 UTC (permalink / raw) To: Dan Williams Cc: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dave.jiang, Jonathan.Cameron, linux-cxl, linux-kernel, bhelgaas Hi Dan, see comment to your patch below. On 17.04.23 16:00:47, Dan Williams wrote: > Terry Bowman wrote: > > + /* > > + * The component registers for an RCD might come from the > > + * host-bridge RCRB if they are not already mapped via the > > + * typical register locator mechanism. > > + */ > > + if (cxlds->component_reg_phys == CXL_RESOURCE_NONE) > > + cxlds->component_reg_phys = cxl_rcrb_to_component( > > + &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); > > + > > + parent_dport->aer_cap = cxl_rcrb_to_aer(parent_dport->dport, > > + parent_dport->rcrb); > > Hmm, how about just retrieve this as part of cxl_rcrb_to_component() > (renamed to cxl_probe_rcrb()), and make an rch dport its own distinct > object? Otherwise it feels odd to be retrieving downstream port > properties this late at upstream port component register detection time. > It also feels awkward to keep adding more RCH dport specific details to > the common 'struct cxl_dport'. So, I'm thinking something like the > following (compiled and cxl_test regression passed): > > -- >8 -- > From 18fbc72f98655d10301c7a35f614b6152f46c44b Mon Sep 17 00:00:00 2001 > From: Dan Williams <dan.j.williams@intel.com> > Date: Mon, 17 Apr 2023 15:45:50 -0700 > Subject: [PATCH] cxl/rch: Prepare for caching the MMIO mapped PCIe AER > capability > > Prepare cxl_probe_rcrb() for retrieving more than just the component > register block. The RCH AER handling code wants to get back to the AER > capability that happens to be MMIO mapped rather then configuration > cycles. > > Move rcrb specific dport data, like the RCRB base and the AER capability > offset, into its own data structure ('struct cxl_rcrb_info') for > cxl_probe_rcrb() to fill. Introduce 'struct cxl_rch_dport' to wrap a > 'struct cxl_dport' with a 'struct cxl_rcrb_info' attribute. > > This centralizes all RCRB scanning in one routine. > > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > drivers/cxl/acpi.c | 16 ++++++++-------- > drivers/cxl/core/port.c | 33 +++++++++++++++++++++------------ > drivers/cxl/core/regs.c | 12 ++++++++---- > drivers/cxl/cxl.h | 21 +++++++++++++++------ > drivers/cxl/mem.c | 15 ++++++++++----- > tools/testing/cxl/Kbuild | 2 +- > tools/testing/cxl/test/cxl.c | 10 ++++++---- > tools/testing/cxl/test/mock.c | 12 ++++++------ > tools/testing/cxl/test/mock.h | 7 ++++--- > 9 files changed, 79 insertions(+), 49 deletions(-) > > diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c > index 4e66483f1fd3..2647eb04fcdb 100644 > --- a/drivers/cxl/acpi.c > +++ b/drivers/cxl/acpi.c > @@ -375,7 +375,7 @@ static int add_host_bridge_uport(struct device *match, void *arg) > struct cxl_chbs_context { > struct device *dev; > unsigned long long uid; > - resource_size_t rcrb; > + struct cxl_rcrb_info rcrb; > resource_size_t chbcr; > u32 cxl_version; > }; > @@ -395,7 +395,7 @@ static int cxl_get_chbcr(union acpi_subtable_headers *header, void *arg, > return 0; > > ctx->cxl_version = chbs->cxl_version; > - ctx->rcrb = CXL_RESOURCE_NONE; > + ctx->rcrb.base = CXL_RESOURCE_NONE; > ctx->chbcr = CXL_RESOURCE_NONE; > > if (!chbs->base) > @@ -409,9 +409,8 @@ static int cxl_get_chbcr(union acpi_subtable_headers *header, void *arg, > if (chbs->length != CXL_RCRB_SIZE) > return 0; > > - ctx->rcrb = chbs->base; > - ctx->chbcr = cxl_rcrb_to_component(ctx->dev, chbs->base, > - CXL_RCRB_DOWNSTREAM); > + ctx->chbcr = cxl_probe_rcrb(ctx->dev, chbs->base, &ctx->rcrb, > + CXL_RCRB_DOWNSTREAM); Let's just extract the rcrb base here and do the probe later in __devm_cxl_add_dport(). Which means chbcr will be extracted there and we completely remove the cxl_rcrb_to_component() here. The code here becomes much simpler and the ACPI table parser no longer contains mmio mapping calls etc. > > return 0; > } > @@ -451,8 +450,9 @@ static int add_host_bridge_dport(struct device *match, void *arg) > return 0; > } > > - if (ctx.rcrb != CXL_RESOURCE_NONE) > - dev_dbg(match, "RCRB found for UID %lld: %pa\n", uid, &ctx.rcrb); > + if (ctx.rcrb.base != CXL_RESOURCE_NONE) > + dev_dbg(match, "RCRB found for UID %lld: %pa\n", uid, > + &ctx.rcrb.base); > > if (ctx.chbcr == CXL_RESOURCE_NONE) { > dev_warn(match, "CHBCR invalid for Host Bridge (UID %lld)\n", > @@ -466,7 +466,7 @@ static int add_host_bridge_dport(struct device *match, void *arg) > bridge = pci_root->bus->bridge; > if (ctx.cxl_version == ACPI_CEDT_CHBS_VERSION_CXL11) > dport = devm_cxl_add_rch_dport(root_port, bridge, uid, > - ctx.chbcr, ctx.rcrb); > + ctx.chbcr, &ctx.rcrb); > else > dport = devm_cxl_add_dport(root_port, bridge, uid, > ctx.chbcr); > diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c > index 4003f445320c..d194f48259ff 100644 > --- a/drivers/cxl/core/port.c > +++ b/drivers/cxl/core/port.c > @@ -920,7 +920,7 @@ static void cxl_dport_unlink(void *data) > static struct cxl_dport * > __devm_cxl_add_dport(struct cxl_port *port, struct device *dport_dev, > int port_id, resource_size_t component_reg_phys, > - resource_size_t rcrb) > + struct cxl_rcrb_info *ri) > { > char link_name[CXL_TARGET_STRLEN]; > struct cxl_dport *dport; > @@ -942,17 +942,26 @@ __devm_cxl_add_dport(struct cxl_port *port, struct device *dport_dev, > CXL_TARGET_STRLEN) > return ERR_PTR(-EINVAL); > > - dport = devm_kzalloc(host, sizeof(*dport), GFP_KERNEL); > - if (!dport) > - return ERR_PTR(-ENOMEM); > + if (ri && ri->base != CXL_RESOURCE_NONE) { > + struct cxl_rch_dport *rdport; > + > + rdport = devm_kzalloc(host, sizeof(*rdport), GFP_KERNEL); > + if (!rdport) > + return ERR_PTR(-ENOMEM); > + > + rdport->rcrb.base = ri->base; > + dport = &rdport->dport; > + dport->rch = true; > + } else { > + dport = devm_kzalloc(host, sizeof(*dport), GFP_KERNEL); > + if (!dport) > + return ERR_PTR(-ENOMEM); I think we can simlify the allocation part if we just move the struct into 'struct cxl_dport', see below. > + } > > dport->dport = dport_dev; > dport->port_id = port_id; > dport->component_reg_phys = component_reg_phys; > dport->port = port; > - if (rcrb != CXL_RESOURCE_NONE) > - dport->rch = true; > - dport->rcrb = rcrb; > > cond_cxl_root_lock(port); > rc = add_dport(port, dport); > @@ -994,7 +1003,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port, > struct cxl_dport *dport; > > dport = __devm_cxl_add_dport(port, dport_dev, port_id, > - component_reg_phys, CXL_RESOURCE_NONE); > + component_reg_phys, NULL); > if (IS_ERR(dport)) { > dev_dbg(dport_dev, "failed to add dport to %s: %ld\n", > dev_name(&port->dev), PTR_ERR(dport)); > @@ -1013,24 +1022,24 @@ EXPORT_SYMBOL_NS_GPL(devm_cxl_add_dport, CXL); > * @dport_dev: firmware or PCI device representing the dport > * @port_id: identifier for this dport in a decoder's target list > * @component_reg_phys: optional location of CXL component registers > - * @rcrb: mandatory location of a Root Complex Register Block > + * @ri: mandatory data about the Root Complex Register Block layout > * > * See CXL 3.0 9.11.8 CXL Devices Attached to an RCH > */ > struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port, > struct device *dport_dev, int port_id, > resource_size_t component_reg_phys, > - resource_size_t rcrb) > + struct cxl_rcrb_info *ri) > { > struct cxl_dport *dport; > > - if (rcrb == CXL_RESOURCE_NONE) { > + if (!ri || ri->base == CXL_RESOURCE_NONE) { > dev_dbg(&port->dev, "failed to add RCH dport, missing RCRB\n"); > return ERR_PTR(-EINVAL); > } > > dport = __devm_cxl_add_dport(port, dport_dev, port_id, > - component_reg_phys, rcrb); > + component_reg_phys, ri); > if (IS_ERR(dport)) { > dev_dbg(dport_dev, "failed to add RCH dport to %s: %ld\n", > dev_name(&port->dev), PTR_ERR(dport)); > diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c > index 52d1dbeda527..b1c0db898a50 100644 > --- a/drivers/cxl/core/regs.c > +++ b/drivers/cxl/core/regs.c > @@ -332,9 +332,8 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type, > } > EXPORT_SYMBOL_NS_GPL(cxl_find_regblock, CXL); > > -resource_size_t cxl_rcrb_to_component(struct device *dev, > - resource_size_t rcrb, > - enum cxl_rcrb which) > +resource_size_t cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, > + struct cxl_rcrb_info *ri, enum cxl_rcrb which) > { > resource_size_t component_reg_phys; > void __iomem *addr; > @@ -344,6 +343,8 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > > if (which == CXL_RCRB_UPSTREAM) > rcrb += SZ_4K; > + else > + ri->base = rcrb; For upstream ports nothing is written to ri, allow NULL pointer for ri then but check for NULL here. > > /* > * RCRB's BAR[0..1] point to component block containing CXL > @@ -364,6 +365,9 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > cmd = readw(addr + PCI_COMMAND); > bar0 = readl(addr + PCI_BASE_ADDRESS_0); > bar1 = readl(addr + PCI_BASE_ADDRESS_1); > + > + /* TODO: retrieve rcrb->aer_cap here */ > + Yes, very good. The aer cap of the RCRB would be very early available then and independent of of other drivers than cxl_acpi, esp. the pci subsystem. > iounmap(addr); > release_mem_region(rcrb, SZ_4K); > > @@ -395,4 +399,4 @@ resource_size_t cxl_rcrb_to_component(struct device *dev, > > return component_reg_phys; > } > -EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_component, CXL); > +EXPORT_SYMBOL_NS_GPL(cxl_probe_rcrb, CXL); > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index 1503ccec9a84..b0807f54e9fd 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -267,9 +267,9 @@ enum cxl_rcrb { > CXL_RCRB_DOWNSTREAM, > CXL_RCRB_UPSTREAM, > }; > -resource_size_t cxl_rcrb_to_component(struct device *dev, > - resource_size_t rcrb, > - enum cxl_rcrb which); > +struct cxl_rcrb_info; > +resource_size_t cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, > + struct cxl_rcrb_info *ri, enum cxl_rcrb which); > > #define CXL_RESOURCE_NONE ((resource_size_t) -1) > #define CXL_TARGET_STRLEN 20 > @@ -589,12 +589,12 @@ cxl_find_dport_by_dev(struct cxl_port *port, const struct device *dport_dev) > return xa_load(&port->dports, (unsigned long)dport_dev); > } > > + We will drop that. > /** > * struct cxl_dport - CXL downstream port > * @dport: PCI bridge or firmware device representing the downstream link > * @port_id: unique hardware identifier for dport in decoder target list > * @component_reg_phys: downstream port component registers > - * @rcrb: base address for the Root Complex Register Block > * @rch: Indicate whether this dport was enumerated in RCH or VH mode > * @port: reference to cxl_port that contains this downstream port > */ > @@ -602,11 +602,20 @@ struct cxl_dport { > struct device *dport; > int port_id; > resource_size_t component_reg_phys; > - resource_size_t rcrb; > bool rch; > struct cxl_port *port; > }; > > +struct cxl_rcrb_info { > + resource_size_t base; > + u16 aer_cap; > +}; > + > +struct cxl_rch_dport { > + struct cxl_dport dport; > + struct cxl_rcrb_info rcrb; > +}; How about including cxl_rcrb_info directly in cxl_dport? This simplifies dport allocation and allows direct access in cxl_dport to the cxl_rcrb_info without a container_of() call: struct cxl_dport { struct device *dport; struct cxl_port *port; int port_id; resource_size_t component_reg_phys; bool rch; struct cxl_rcrb_info rcrb; }; I know you were complaining about to many RCH dport specific details, but all this is kept in cxl_rcrb_info and the struct itself is not too big. The flat structure allows quick access, like: if (dport->rch) component_reg_phys = cxl_probe_rcrb(..., dport->rcrb.base, ...) > + > /** > * struct cxl_ep - track an endpoint's interest in a port > * @ep: device that hosts a generic CXL endpoint (expander or accelerator) > @@ -674,7 +683,7 @@ struct cxl_dport *devm_cxl_add_dport(struct cxl_port *port, > struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port, > struct device *dport_dev, int port_id, > resource_size_t component_reg_phys, > - resource_size_t rcrb); > + struct cxl_rcrb_info *ri); > > struct cxl_decoder *to_cxl_decoder(struct device *dev); > struct cxl_root_decoder *to_cxl_root_decoder(struct device *dev); > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c > index 097d86dd2a8e..7da6135e0b17 100644 > --- a/drivers/cxl/mem.c > +++ b/drivers/cxl/mem.c > @@ -71,10 +71,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, > * host-bridge RCRB if they are not already mapped via the > * typical register locator mechanism. > */ > - if (parent_dport->rch && cxlds->component_reg_phys == CXL_RESOURCE_NONE) > - component_reg_phys = cxl_rcrb_to_component( > - &cxlmd->dev, parent_dport->rcrb, CXL_RCRB_UPSTREAM); > - else > + if (parent_dport->rch && > + cxlds->component_reg_phys == CXL_RESOURCE_NONE) { > + struct cxl_rch_dport *rdport = > + container_of(parent_dport, typeof(*rdport), dport); > + > + component_reg_phys = > + cxl_probe_rcrb(&cxlmd->dev, rdport->rcrb.base, > + &rdport->rcrb, CXL_RCRB_UPSTREAM); This could overwrite the dport's contents with the upstream port info. But since we only need the info and write to it in case of a downstream port, let's set that to null here (plus adding a check in cxl_probe_rcrb()). Similar to the host case (cxl_acpi driver) where the rcrb is probed early, this code should be moved to cxl_pci. But since RAS does not use the upstream port's RCRB it is subject of a separate patch not part of this series. > + } else > component_reg_phys = cxlds->component_reg_phys; > endpoint = devm_cxl_add_port(host, &cxlmd->dev, component_reg_phys, > parent_dport); > @@ -92,7 +97,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, > } > > return 0; > -} > + } Dropping that change. > > static int cxl_mem_probe(struct device *dev) > { > diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild > index fba7bec96acd..bef1bc3bd912 100644 > --- a/tools/testing/cxl/Kbuild > +++ b/tools/testing/cxl/Kbuild > @@ -11,7 +11,7 @@ ldflags-y += --wrap=devm_cxl_enumerate_decoders > ldflags-y += --wrap=cxl_await_media_ready > ldflags-y += --wrap=cxl_hdm_decode_init > ldflags-y += --wrap=cxl_dvsec_rr_decode > -ldflags-y += --wrap=cxl_rcrb_to_component > +ldflags-y += --wrap=cxl_probe_rcrb > > DRIVERS := ../../../drivers > CXL_SRC := $(DRIVERS)/cxl > diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c > index 385cdeeab22c..805c79491485 100644 > --- a/tools/testing/cxl/test/cxl.c > +++ b/tools/testing/cxl/test/cxl.c > @@ -983,12 +983,14 @@ static int mock_cxl_port_enumerate_dports(struct cxl_port *port) > return 0; > } > > -resource_size_t mock_cxl_rcrb_to_component(struct device *dev, > - resource_size_t rcrb, > - enum cxl_rcrb which) > +resource_size_t mock_cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, > + struct cxl_rcrb_info *ri, enum cxl_rcrb which) > { > dev_dbg(dev, "rcrb: %pa which: %d\n", &rcrb, which); > > + if (which == CXL_RCRB_DOWNSTREAM) > + ri->base = rcrb; > + > return (resource_size_t) which + 1; > } > > @@ -1000,7 +1002,7 @@ static struct cxl_mock_ops cxl_mock_ops = { > .is_mock_dev = is_mock_dev, > .acpi_table_parse_cedt = mock_acpi_table_parse_cedt, > .acpi_evaluate_integer = mock_acpi_evaluate_integer, > - .cxl_rcrb_to_component = mock_cxl_rcrb_to_component, > + .cxl_probe_rcrb = mock_cxl_probe_rcrb, > .acpi_pci_find_root = mock_acpi_pci_find_root, > .devm_cxl_port_enumerate_dports = mock_cxl_port_enumerate_dports, > .devm_cxl_setup_hdm = mock_cxl_setup_hdm, > diff --git a/tools/testing/cxl/test/mock.c b/tools/testing/cxl/test/mock.c > index c4e53f22e421..148bd4f184f5 100644 > --- a/tools/testing/cxl/test/mock.c > +++ b/tools/testing/cxl/test/mock.c > @@ -244,9 +244,9 @@ int __wrap_cxl_dvsec_rr_decode(struct device *dev, int dvsec, > } > EXPORT_SYMBOL_NS_GPL(__wrap_cxl_dvsec_rr_decode, CXL); > > -resource_size_t __wrap_cxl_rcrb_to_component(struct device *dev, > - resource_size_t rcrb, > - enum cxl_rcrb which) > +resource_size_t __wrap_cxl_probe_rcrb(struct device *dev, resource_size_t rcrb, > + struct cxl_rcrb_info *ri, > + enum cxl_rcrb which) > { > int index; > resource_size_t component_reg_phys; > @@ -254,14 +254,14 @@ resource_size_t __wrap_cxl_rcrb_to_component(struct device *dev, > > if (ops && ops->is_mock_port(dev)) > component_reg_phys = > - ops->cxl_rcrb_to_component(dev, rcrb, which); > + ops->cxl_probe_rcrb(dev, rcrb, ri, which); > else > - component_reg_phys = cxl_rcrb_to_component(dev, rcrb, which); > + component_reg_phys = cxl_probe_rcrb(dev, rcrb, ri, which); > put_cxl_mock_ops(index); > > return component_reg_phys; > } > -EXPORT_SYMBOL_NS_GPL(__wrap_cxl_rcrb_to_component, CXL); > +EXPORT_SYMBOL_NS_GPL(__wrap_cxl_probe_rcrb, CXL); > > MODULE_LICENSE("GPL v2"); > MODULE_IMPORT_NS(ACPI); > diff --git a/tools/testing/cxl/test/mock.h b/tools/testing/cxl/test/mock.h > index bef8817b01f2..7ef21356d052 100644 > --- a/tools/testing/cxl/test/mock.h > +++ b/tools/testing/cxl/test/mock.h > @@ -15,9 +15,10 @@ struct cxl_mock_ops { > acpi_string pathname, > struct acpi_object_list *arguments, > unsigned long long *data); > - resource_size_t (*cxl_rcrb_to_component)(struct device *dev, > - resource_size_t rcrb, > - enum cxl_rcrb which); > + resource_size_t (*cxl_probe_rcrb)(struct device *dev, > + resource_size_t rcrb, > + struct cxl_rcrb_info *ri, > + enum cxl_rcrb which); > struct acpi_pci_root *(*acpi_pci_find_root)(acpi_handle handle); > bool (*is_mock_bus)(struct pci_bus *bus); > bool (*is_mock_port)(struct device *dev); > -- > 2.39.2 > -- >8 -- > > > > + > > + parent_dport->ras_cap = cxl_component_to_ras(parent_dport->dport, > > + parent_dport->component_reg_phys); > > Since this is component register offset based can it not be shared with > the VH case? I have been expecting that RCH RAS capability and VH RAS > capability scanning would need to be unified in the cxl_port driver. I have a modified version of your patch with following changes: * cxl_probe_rcrb(): * Moved cxl_probe_rcrb() out of ACPI CEDT parse to __devm_cxl_add_dport() (separate patch). * Set cxl_rcrb_info pointer to NULL for upstream ports including NULL pointer check. * Integrated 'struct cxl_rcrb_info' in 'struct cxl_dport'. * Adressed comments above. * Dropped unrelated newlines and whitespaces. We will include the patches in v4. Thanks, -Robert ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules 2023-04-11 18:02 [PATCH v3 0/6] cxl/pci: Add support for RCH RAS error handling Terry Bowman 2023-04-11 18:02 ` [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery Terry Bowman @ 2023-04-11 18:02 ` Terry Bowman 2023-04-12 11:04 ` Ard Biesheuvel ` (2 more replies) 2023-04-11 18:02 ` [PATCH v3 3/6] PCI/AER: Export cper_print_aer() " Terry Bowman ` (3 subsequent siblings) 5 siblings, 3 replies; 52+ messages in thread From: Terry Bowman @ 2023-04-11 18:02 UTC (permalink / raw) To: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas, Ard Biesheuvel, linux-efi The CXL driver plans to use cper_print_aer() for restricted CXL host (RCH) logging. This is not currently possible if CXL is built as a loadable module because cper_print_aer() depends on cper_mem_err_unpack() which is not exported. Export cper_mem_err_unpack() to enable cper_print_aer() usage in CXL and other loadable modules. Signed-off-by: Terry Bowman <terry.bowman@amd.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: linux-efi@vger.kernel.org --- drivers/firmware/efi/cper.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c index 35c37f667781..ff15e12160ae 100644 --- a/drivers/firmware/efi/cper.c +++ b/drivers/firmware/efi/cper.c @@ -350,6 +350,7 @@ const char *cper_mem_err_unpack(struct trace_seq *p, return ret; } +EXPORT_SYMBOL_GPL(cper_mem_err_unpack); static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem, int len) -- 2.34.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules 2023-04-11 18:02 ` [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules Terry Bowman @ 2023-04-12 11:04 ` Ard Biesheuvel 2023-04-13 16:08 ` Jonathan Cameron 2023-04-17 23:08 ` Dan Williams 2 siblings, 0 replies; 52+ messages in thread From: Ard Biesheuvel @ 2023-04-12 11:04 UTC (permalink / raw) To: Terry Bowman Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl, rrichter, linux-kernel, bhelgaas, linux-efi On Tue, 11 Apr 2023 at 20:03, Terry Bowman <terry.bowman@amd.com> wrote: > > The CXL driver plans to use cper_print_aer() for restricted CXL host (RCH) > logging. This is not currently possible if CXL is built as a loadable > module because cper_print_aer() depends on cper_mem_err_unpack() which > is not exported. > > Export cper_mem_err_unpack() to enable cper_print_aer() usage in > CXL and other loadable modules. > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Cc: Ard Biesheuvel <ardb@kernel.org> > Cc: linux-efi@vger.kernel.org Acked-by: Ard Biesheuvel <ardb@kernel.org> > --- > drivers/firmware/efi/cper.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > index 35c37f667781..ff15e12160ae 100644 > --- a/drivers/firmware/efi/cper.c > +++ b/drivers/firmware/efi/cper.c > @@ -350,6 +350,7 @@ const char *cper_mem_err_unpack(struct trace_seq *p, > > return ret; > } > +EXPORT_SYMBOL_GPL(cper_mem_err_unpack); > > static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem, > int len) > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules 2023-04-11 18:02 ` [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules Terry Bowman 2023-04-12 11:04 ` Ard Biesheuvel @ 2023-04-13 16:08 ` Jonathan Cameron 2023-04-13 19:40 ` Terry Bowman 2023-04-17 23:08 ` Dan Williams 2 siblings, 1 reply; 52+ messages in thread From: Jonathan Cameron @ 2023-04-13 16:08 UTC (permalink / raw) To: Terry Bowman Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas, Ard Biesheuvel, linux-efi On Tue, 11 Apr 2023 13:02:58 -0500 Terry Bowman <terry.bowman@amd.com> wrote: > The CXL driver plans to use cper_print_aer() for restricted CXL host (RCH) > logging. This is not currently possible if CXL is built as a loadable > module because cper_print_aer() depends on cper_mem_err_unpack() which > is not exported. > > Export cper_mem_err_unpack() to enable cper_print_aer() usage in > CXL and other loadable modules. No problem with the export, but I'm struggling to see the path that needs it. Could you give a little more detail, perhaps a call path? Thanks, Jonathan > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Cc: Ard Biesheuvel <ardb@kernel.org> > Cc: linux-efi@vger.kernel.org > --- > drivers/firmware/efi/cper.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > index 35c37f667781..ff15e12160ae 100644 > --- a/drivers/firmware/efi/cper.c > +++ b/drivers/firmware/efi/cper.c > @@ -350,6 +350,7 @@ const char *cper_mem_err_unpack(struct trace_seq *p, > > return ret; > } > +EXPORT_SYMBOL_GPL(cper_mem_err_unpack); > > static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem, > int len) ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules 2023-04-13 16:08 ` Jonathan Cameron @ 2023-04-13 19:40 ` Terry Bowman 2023-04-14 11:48 ` Jonathan Cameron 0 siblings, 1 reply; 52+ messages in thread From: Terry Bowman @ 2023-04-13 19:40 UTC (permalink / raw) To: Jonathan Cameron Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas, Ard Biesheuvel, linux-efi Hi Jonathan, On 4/13/23 11:08, Jonathan Cameron wrote: > On Tue, 11 Apr 2023 13:02:58 -0500 > Terry Bowman <terry.bowman@amd.com> wrote: > >> The CXL driver plans to use cper_print_aer() for restricted CXL host (RCH) >> logging. This is not currently possible if CXL is built as a loadable >> module because cper_print_aer() depends on cper_mem_err_unpack() which >> is not exported. >> >> Export cper_mem_err_unpack() to enable cper_print_aer() usage in >> CXL and other loadable modules. > > No problem with the export, but I'm struggling to see the path that needs it. > Could you give a little more detail, perhaps a call path? > The cper_print_aer() is used to log RCH dport AER errors. This is needed because the RCH dport AER errors are not handled directly by the AER port driver. I'll add these details to the patch. Regards, Terry > Thanks, > > Jonathan > >> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com> >> Cc: Ard Biesheuvel <ardb@kernel.org> >> Cc: linux-efi@vger.kernel.org >> --- >> drivers/firmware/efi/cper.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c >> index 35c37f667781..ff15e12160ae 100644 >> --- a/drivers/firmware/efi/cper.c >> +++ b/drivers/firmware/efi/cper.c >> @@ -350,6 +350,7 @@ const char *cper_mem_err_unpack(struct trace_seq *p, >> >> return ret; >> } >> +EXPORT_SYMBOL_GPL(cper_mem_err_unpack); >> >> static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem, >> int len) > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules 2023-04-13 19:40 ` Terry Bowman @ 2023-04-14 11:48 ` Jonathan Cameron 2023-04-14 12:44 ` Robert Richter [not found] ` <aba5d2ee-f451-145c-81c2-72595129483b@amd.com> 0 siblings, 2 replies; 52+ messages in thread From: Jonathan Cameron @ 2023-04-14 11:48 UTC (permalink / raw) To: Terry Bowman Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas, Ard Biesheuvel, linux-efi On Thu, 13 Apr 2023 14:40:10 -0500 Terry Bowman <Terry.Bowman@amd.com> wrote: > Hi Jonathan, > > On 4/13/23 11:08, Jonathan Cameron wrote: > > On Tue, 11 Apr 2023 13:02:58 -0500 > > Terry Bowman <terry.bowman@amd.com> wrote: > > > >> The CXL driver plans to use cper_print_aer() for restricted CXL host (RCH) > >> logging. This is not currently possible if CXL is built as a loadable > >> module because cper_print_aer() depends on cper_mem_err_unpack() which > >> is not exported. > >> > >> Export cper_mem_err_unpack() to enable cper_print_aer() usage in > >> CXL and other loadable modules. > > > > No problem with the export, but I'm struggling to see the path that needs it. > > Could you give a little more detail, perhaps a call path? > > > > The cper_print_aer() is used to log RCH dport AER errors. This is needed > because the RCH dport AER errors are not handled directly by the AER port > driver. I'll add these details to the patch. Ah. I wasn't particularly clear. cper_print_aer() is fine, but oddly I'm not seeing where that results in a call to cper_mem_err_unpack() More than possible my grep skills are failing me! Jonathan > > Regards, > Terry > > > Thanks, > > > > Jonathan > > > >> > >> Signed-off-by: Terry Bowman <terry.bowman@amd.com> > >> Cc: Ard Biesheuvel <ardb@kernel.org> > >> Cc: linux-efi@vger.kernel.org > >> --- > >> drivers/firmware/efi/cper.c | 1 + > >> 1 file changed, 1 insertion(+) > >> > >> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > >> index 35c37f667781..ff15e12160ae 100644 > >> --- a/drivers/firmware/efi/cper.c > >> +++ b/drivers/firmware/efi/cper.c > >> @@ -350,6 +350,7 @@ const char *cper_mem_err_unpack(struct trace_seq *p, > >> > >> return ret; > >> } > >> +EXPORT_SYMBOL_GPL(cper_mem_err_unpack); > >> > >> static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem, > >> int len) > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules 2023-04-14 11:48 ` Jonathan Cameron @ 2023-04-14 12:44 ` Robert Richter [not found] ` <aba5d2ee-f451-145c-81c2-72595129483b@amd.com> 1 sibling, 0 replies; 52+ messages in thread From: Robert Richter @ 2023-04-14 12:44 UTC (permalink / raw) To: Jonathan Cameron Cc: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, linux-kernel, bhelgaas, Ard Biesheuvel, linux-efi On 14.04.23 12:48:05, Jonathan Cameron wrote: > On Thu, 13 Apr 2023 14:40:10 -0500 > Terry Bowman <Terry.Bowman@amd.com> wrote: > > > Hi Jonathan, > > > > On 4/13/23 11:08, Jonathan Cameron wrote: > > > On Tue, 11 Apr 2023 13:02:58 -0500 > > > Terry Bowman <terry.bowman@amd.com> wrote: > > > > > >> The CXL driver plans to use cper_print_aer() for restricted CXL host (RCH) > > >> logging. This is not currently possible if CXL is built as a loadable > > >> module because cper_print_aer() depends on cper_mem_err_unpack() which > > >> is not exported. > > >> > > >> Export cper_mem_err_unpack() to enable cper_print_aer() usage in > > >> CXL and other loadable modules. > > > > > > No problem with the export, but I'm struggling to see the path that needs it. > > > Could you give a little more detail, perhaps a call path? > > > > > > > The cper_print_aer() is used to log RCH dport AER errors. This is needed > > because the RCH dport AER errors are not handled directly by the AER port > > driver. I'll add these details to the patch. > > Ah. I wasn't particularly clear. cper_print_aer() is fine, but oddly > I'm not seeing where that results in a call to cper_mem_err_unpack() > > More than possible my grep skills are failing me! No worries, it is used in some odd tracepoint macro magic included with ras_event.h. -Robert ^ permalink raw reply [flat|nested] 52+ messages in thread
[parent not found: <aba5d2ee-f451-145c-81c2-72595129483b@amd.com>]
* Re: [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules [not found] ` <aba5d2ee-f451-145c-81c2-72595129483b@amd.com> @ 2023-04-14 15:17 ` Terry Bowman 0 siblings, 0 replies; 52+ messages in thread From: Terry Bowman @ 2023-04-14 15:17 UTC (permalink / raw) To: Jonathan Cameron Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas, Ard Biesheuvel, linux-efi On 4/14/23 08:24, Terry Bowman wrote: > > > On 4/14/23 06:48, Jonathan Cameron wrote: >> On Thu, 13 Apr 2023 14:40:10 -0500 >> Terry Bowman <Terry.Bowman@amd.com> wrote: >> >>> Hi Jonathan, >>> >>> On 4/13/23 11:08, Jonathan Cameron wrote: >>>> On Tue, 11 Apr 2023 13:02:58 -0500 >>>> Terry Bowman <terry.bowman@amd.com> wrote: >>>> >>>>> The CXL driver plans to use cper_print_aer() for restricted CXL host (RCH) >>>>> logging. This is not currently possible if CXL is built as a loadable >>>>> module because cper_print_aer() depends on cper_mem_err_unpack() which >>>>> is not exported. >>>>> >>>>> Export cper_mem_err_unpack() to enable cper_print_aer() usage in >>>>> CXL and other loadable modules. >>>> >>>> No problem with the export, but I'm struggling to see the path that needs it. >>>> Could you give a little more detail, perhaps a call path? >>>> >>> >>> The cper_print_aer() is used to log RCH dport AER errors. This is needed >>> because the RCH dport AER errors are not handled directly by the AER port >>> driver. I'll add these details to the patch. >> >> Ah. I wasn't particularly clear. cper_print_aer() is fine, but oddly >> I'm not seeing where that results in a call to cper_mem_err_unpack() >> >> More than possible my grep skills are failing me! >> >> Jonathan >> > > I see. Without this patch, if include/ras/ras_event.h cper_mem_err_unpack() > > We use > > Testing shows this patch is no longer needed. This patch was required for earlier implementation calling the aer trace macros directly. I will remove this patch in next patchset revision. Regards, Terry >>> >>> Regards, >>> Terry >>> >>>> Thanks, >>>> >>>> Jonathan >>>> >>>>> >>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com> >>>>> Cc: Ard Biesheuvel <ardb@kernel.org> >>>>> Cc: linux-efi@vger.kernel.org >>>>> --- >>>>> drivers/firmware/efi/cper.c | 1 + >>>>> 1 file changed, 1 insertion(+) >>>>> >>>>> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c >>>>> index 35c37f667781..ff15e12160ae 100644 >>>>> --- a/drivers/firmware/efi/cper.c >>>>> +++ b/drivers/firmware/efi/cper.c >>>>> @@ -350,6 +350,7 @@ const char *cper_mem_err_unpack(struct trace_seq *p, >>>>> >>>>> return ret; >>>>> } >>>>> +EXPORT_SYMBOL_GPL(cper_mem_err_unpack); >>>>> >>>>> static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem, >>>>> int len) >>>> >> ^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules 2023-04-11 18:02 ` [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules Terry Bowman 2023-04-12 11:04 ` Ard Biesheuvel 2023-04-13 16:08 ` Jonathan Cameron @ 2023-04-17 23:08 ` Dan Williams 2 siblings, 0 replies; 52+ messages in thread From: Dan Williams @ 2023-04-17 23:08 UTC (permalink / raw) To: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas, Ard Biesheuvel, linux-efi Terry Bowman wrote: > The CXL driver plans to use cper_print_aer() for restricted CXL host (RCH) > logging. This is not currently possible if CXL is built as a loadable > module because cper_print_aer() depends on cper_mem_err_unpack() which > is not exported. > > Export cper_mem_err_unpack() to enable cper_print_aer() usage in > CXL and other loadable modules. > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Cc: Ard Biesheuvel <ardb@kernel.org> > Cc: linux-efi@vger.kernel.org > --- > drivers/firmware/efi/cper.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > index 35c37f667781..ff15e12160ae 100644 > --- a/drivers/firmware/efi/cper.c > +++ b/drivers/firmware/efi/cper.c > @@ -350,6 +350,7 @@ const char *cper_mem_err_unpack(struct trace_seq *p, > > return ret; > } > +EXPORT_SYMBOL_GPL(cper_mem_err_unpack); Looks ok to me. You could make it: EXPORT_SYMBOL_NS_GPL(cper_mem_err_unpack, CXL) ...to make it clear that this is really only meant to be consumed by the CXL subsystem. That was also the approach taken with the otherwise internal-only insert_resource_expand_to_fit() symbol. ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 3/6] PCI/AER: Export cper_print_aer() for use by modules 2023-04-11 18:02 [PATCH v3 0/6] cxl/pci: Add support for RCH RAS error handling Terry Bowman 2023-04-11 18:02 ` [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery Terry Bowman 2023-04-11 18:02 ` [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules Terry Bowman @ 2023-04-11 18:02 ` Terry Bowman 2023-04-13 16:13 ` Jonathan Cameron 2023-04-17 23:11 ` Dan Williams 2023-04-11 18:03 ` [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging Terry Bowman ` (2 subsequent siblings) 5 siblings, 2 replies; 52+ messages in thread From: Terry Bowman @ 2023-04-11 18:02 UTC (permalink / raw) To: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas, Mahesh J Salgaonkar, Oliver O'Halloran, linux-pci The CXL driver plans to use cper_print_aer() for restricted CXL host (RCH) logging. cper_print_aer() is not exported and as a result is not available to the CXL driver or other loadable modules. Export cper_print_aer() making it available to CXL and other loadable modules. Signed-off-by: Terry Bowman <terry.bowman@amd.com> Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-pci@vger.kernel.org --- drivers/pci/pcie/aer.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index f6c24ded134c..7a25b62d9e01 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -812,6 +812,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity, trace_aer_event(dev_name(&dev->dev), (status & ~mask), aer_severity, tlp_header_valid, &aer->header_log); } +EXPORT_SYMBOL_GPL(cper_print_aer); #endif /** -- 2.34.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v3 3/6] PCI/AER: Export cper_print_aer() for use by modules 2023-04-11 18:02 ` [PATCH v3 3/6] PCI/AER: Export cper_print_aer() " Terry Bowman @ 2023-04-13 16:13 ` Jonathan Cameron 2023-04-17 23:11 ` Dan Williams 1 sibling, 0 replies; 52+ messages in thread From: Jonathan Cameron @ 2023-04-13 16:13 UTC (permalink / raw) To: Terry Bowman Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas, Mahesh J Salgaonkar, Oliver O'Halloran, linux-pci On Tue, 11 Apr 2023 13:02:59 -0500 Terry Bowman <terry.bowman@amd.com> wrote: > The CXL driver plans to use cper_print_aer() for restricted CXL host > (RCH) logging. cper_print_aer() is not exported and as a result is not > available to the CXL driver or other loadable modules. Export > cper_print_aer() making it available to CXL and other loadable modules. > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > Cc: "Oliver O'Halloran" <oohall@gmail.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: linux-pci@vger.kernel.org Seems reasonable. FWIW Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > drivers/pci/pcie/aer.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index f6c24ded134c..7a25b62d9e01 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -812,6 +812,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity, > trace_aer_event(dev_name(&dev->dev), (status & ~mask), > aer_severity, tlp_header_valid, &aer->header_log); > } > +EXPORT_SYMBOL_GPL(cper_print_aer); > #endif > > /** ^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH v3 3/6] PCI/AER: Export cper_print_aer() for use by modules 2023-04-11 18:02 ` [PATCH v3 3/6] PCI/AER: Export cper_print_aer() " Terry Bowman 2023-04-13 16:13 ` Jonathan Cameron @ 2023-04-17 23:11 ` Dan Williams 1 sibling, 0 replies; 52+ messages in thread From: Dan Williams @ 2023-04-17 23:11 UTC (permalink / raw) To: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas, Mahesh J Salgaonkar, Oliver O'Halloran, linux-pci Terry Bowman wrote: > The CXL driver plans to use cper_print_aer() for restricted CXL host > (RCH) logging. cper_print_aer() is not exported and as a result is not > available to the CXL driver or other loadable modules. Export > cper_print_aer() making it available to CXL and other loadable modules. > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > Cc: "Oliver O'Halloran" <oohall@gmail.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: linux-pci@vger.kernel.org > --- > drivers/pci/pcie/aer.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index f6c24ded134c..7a25b62d9e01 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -812,6 +812,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity, > trace_aer_event(dev_name(&dev->dev), (status & ~mask), > aer_severity, tlp_header_valid, &aer->header_log); > } > +EXPORT_SYMBOL_GPL(cper_print_aer); Same EXPORT_SYMBOL_NS_GPL() as the last patch, I can't imagine another scenario where this symbol needs exporting. Does this not need a stub in the CONFIG_PCIEAER=n case? Maybe that's handled in the CXL code, I'll keep reading... ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging 2023-04-11 18:02 [PATCH v3 0/6] cxl/pci: Add support for RCH RAS error handling Terry Bowman ` (2 preceding siblings ...) 2023-04-11 18:02 ` [PATCH v3 3/6] PCI/AER: Export cper_print_aer() " Terry Bowman @ 2023-04-11 18:03 ` Terry Bowman 2023-04-12 1:32 ` kernel test robot ` (3 more replies) 2023-04-11 18:03 ` [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler Terry Bowman 2023-04-11 18:03 ` [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling Terry Bowman 5 siblings, 4 replies; 52+ messages in thread From: Terry Bowman @ 2023-04-11 18:03 UTC (permalink / raw) To: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas RCH downstream port error logging is missing in the current CXL driver. The missing AER and RAS error logging is needed for communicating driver error details to userspace. Update the driver to include PCIe AER and CXL RAS error logging. Add RCH downstream port error handling into the existing RCiEP handler. The downstream port error handler is added to the RCiEP error handler because the downstream port is implemented in a RCRB, is not PCI enumerable, and as a result is not directly accessible to the PCI AER root port driver. The AER root port driver calls the RCiEP handler for handling RCD errors and RCH downstream port protocol errors. Update mem.c to include RAS and AER setup. This includes AER and RAS capability discovery and mapping for later use in the error handler. Disable RCH downstream port's root port cmd interrupts.[1] Update existing RCiEP correctable and uncorrectable handlers to also call the RCH handler. The RCH handler will read the RCH AER registers, check for error severity, and if an error exists will log using an existing kernel AER trace routine. The RCH handler will also log downstream port RAS errors if they exist. [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors Co-developed-by: Robert Richter <rrichter@amd.com> Signed-off-by: Robert Richter <rrichter@amd.com> Signed-off-by: Terry Bowman <terry.bowman@amd.com> --- drivers/cxl/core/pci.c | 126 ++++++++++++++++++++++++++++++++++++---- drivers/cxl/core/regs.c | 1 + drivers/cxl/cxl.h | 13 +++++ drivers/cxl/mem.c | 73 +++++++++++++++++++++++ 4 files changed, 201 insertions(+), 12 deletions(-) diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c index 523d5b9fd7fc..d435ed2ff8b6 100644 --- a/drivers/cxl/core/pci.c +++ b/drivers/cxl/core/pci.c @@ -5,6 +5,7 @@ #include <linux/delay.h> #include <linux/pci.h> #include <linux/pci-doe.h> +#include <linux/aer.h> #include <cxlpci.h> #include <cxlmem.h> #include <cxl.h> @@ -613,32 +614,88 @@ void read_cdat_data(struct cxl_port *port) } EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL); -void cxl_cor_error_detected(struct pci_dev *pdev) +/* Get AER severity. Return false if there is no error. */ +static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs, + int *severity) +{ + if (aer_regs->uncor_status & ~aer_regs->uncor_mask) { + if (aer_regs->uncor_status & PCI_ERR_ROOT_FATAL_RCV) + *severity = AER_FATAL; + else + *severity = AER_NONFATAL; + return true; + } + + if (aer_regs->cor_status & ~aer_regs->cor_mask) { + *severity = AER_CORRECTABLE; + return true; + } + + return false; +} + +/* + * Copy the AER capability registers to a buffer. This is necessary + * because RCRB AER capability is MMIO mapped. Clear the status + * after copying. + * + * @aer_base: base address of AER capability block in RCRB + * @aer_regs: destination for copying AER capability + */ +static bool cxl_rch_get_aer_info(void __iomem *aer_base, + struct aer_capability_regs *aer_regs) +{ + int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32); + u32 *aer_regs_buf = (u32 *)aer_regs; + int n; + + if (!aer_base) + return false; + + for (n = 0; n < read_cnt; n++) + aer_regs_buf[n] = readl(aer_base + n * sizeof(u32)); + + writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS); + writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS); + + return true; +} + +static void __cxl_log_correctable_ras(struct cxl_dev_state *cxlds, + void __iomem *ras_base) { - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev); void __iomem *addr; u32 status; - if (!cxlds->regs.ras) + if (!ras_base) return; - addr = cxlds->regs.ras + CXL_RAS_CORRECTABLE_STATUS_OFFSET; + addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET; status = readl(addr); if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) { writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr); trace_cxl_aer_correctable_error(cxlds->cxlmd, status); } } -EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL); + +static void cxl_log_correctable_ras_endpoint(struct cxl_dev_state *cxlds) +{ + return __cxl_log_correctable_ras(cxlds, cxlds->regs.ras); +} + +static void cxl_log_correctable_ras_dport(struct cxl_dev_state *cxlds) +{ + return __cxl_log_correctable_ras(cxlds, cxlds->regs.dport_ras); +} /* CXL spec rev3.0 8.2.4.16.1 */ -static void header_log_copy(struct cxl_dev_state *cxlds, u32 *log) +static void header_log_copy(void __iomem *ras_base, u32 *log) { void __iomem *addr; u32 *log_addr; int i, log_u32_size = CXL_HEADERLOG_SIZE / sizeof(u32); - addr = cxlds->regs.ras + CXL_RAS_HEADER_LOG_OFFSET; + addr = ras_base + CXL_RAS_HEADER_LOG_OFFSET; log_addr = log; for (i = 0; i < log_u32_size; i++) { @@ -652,17 +709,18 @@ static void header_log_copy(struct cxl_dev_state *cxlds, u32 *log) * Log the state of the RAS status registers and prepare them to log the * next error status. Return 1 if reset needed. */ -static bool cxl_report_and_clear(struct cxl_dev_state *cxlds) +static bool __cxl_report_and_clear(struct cxl_dev_state *cxlds, + void __iomem *ras_base) { u32 hl[CXL_HEADERLOG_SIZE_U32]; void __iomem *addr; u32 status; u32 fe; - if (!cxlds->regs.ras) + if (!ras_base) return false; - addr = cxlds->regs.ras + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET; + addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET; status = readl(addr); if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK)) return false; @@ -670,7 +728,7 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds) /* If multiple errors, log header points to first error from ctrl reg */ if (hweight32(status) > 1) { void __iomem *rcc_addr = - cxlds->regs.ras + CXL_RAS_CAP_CONTROL_OFFSET; + ras_base + CXL_RAS_CAP_CONTROL_OFFSET; fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK, readl(rcc_addr))); @@ -678,13 +736,54 @@ static bool cxl_report_and_clear(struct cxl_dev_state *cxlds) fe = status; } - header_log_copy(cxlds, hl); + header_log_copy(ras_base, hl); trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl); writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr); return true; } +static bool cxl_report_and_clear(struct cxl_dev_state *cxlds) +{ + return __cxl_report_and_clear(cxlds, cxlds->regs.ras); +} + +static bool cxl_report_and_clear_dport(struct cxl_dev_state *cxlds) +{ + return __cxl_report_and_clear(cxlds, cxlds->regs.dport_ras); +} + +static void cxl_rch_log_error(struct cxl_dev_state *cxlds) +{ + struct pci_dev *pdev = to_pci_dev(cxlds->dev); + struct aer_capability_regs aer_regs; + int severity; + + if (!cxl_rch_get_aer_info(cxlds->regs.aer, &aer_regs)) + return; + + if (!cxl_rch_get_aer_severity(&aer_regs, &severity)) + return; + + cper_print_aer(pdev, severity, &aer_regs); + + if (severity == AER_CORRECTABLE) + cxl_log_correctable_ras_dport(cxlds); + else + cxl_report_and_clear_dport(cxlds); +} + +void cxl_cor_error_detected(struct pci_dev *pdev) +{ + struct cxl_dev_state *cxlds = pci_get_drvdata(pdev); + + if (cxlds->rcd) + cxl_rch_log_error(cxlds); + + cxl_log_correctable_ras_endpoint(cxlds); +} +EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, CXL); + pci_ers_result_t cxl_error_detected(struct pci_dev *pdev, pci_channel_state_t state) { @@ -693,6 +792,9 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev, struct device *dev = &cxlmd->dev; bool ue; + if (cxlds->rcd) + cxl_rch_log_error(cxlds); + /* * A frozen channel indicates an impending reset which is fatal to * CXL.mem operation, and will likely crash the system. On the off diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c index bde1fffab09e..dfa6fcfc428a 100644 --- a/drivers/cxl/core/regs.c +++ b/drivers/cxl/core/regs.c @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, return ret_val; } +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL); int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs, struct cxl_register_map *map, unsigned long map_mask) diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index df64c402e6e6..dae3f141ffcb 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -66,6 +66,8 @@ #define CXL_DECODER_MIN_GRANULARITY 256 #define CXL_DECODER_MAX_ENCODED_IG 6 +#define PCI_AER_CAPABILITY_LENGTH 56 + static inline int cxl_hdm_decoder_count(u32 cap_hdr) { int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr); @@ -209,6 +211,15 @@ struct cxl_regs { struct_group_tagged(cxl_device_regs, device_regs, void __iomem *status, *mbox, *memdev; ); + + /* + * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode) + * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors + */ + struct_group_tagged(cxl_rch_regs, rch_regs, + void __iomem *aer; + void __iomem *dport_ras; + ); }; struct cxl_reg_map { @@ -249,6 +260,8 @@ struct cxl_register_map { }; }; +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, + resource_size_t length); void cxl_probe_component_regs(struct device *dev, void __iomem *base, struct cxl_component_reg_map *map); void cxl_probe_device_regs(struct device *dev, void __iomem *base, diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c index 014295ab6bc6..dd5ae0a4560c 100644 --- a/drivers/cxl/mem.c +++ b/drivers/cxl/mem.c @@ -4,6 +4,7 @@ #include <linux/device.h> #include <linux/module.h> #include <linux/pci.h> +#include <linux/aer.h> #include "cxlmem.h" #include "cxlpci.h" @@ -45,6 +46,71 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data) return 0; } +static void rch_disable_root_ints(void __iomem *aer_base) +{ + u32 aer_cmd_mask, aer_cmd; + + /* + * Disable RCH root port command interrupts. + * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors + */ + aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN | + PCI_ERR_ROOT_CMD_NONFATAL_EN | + PCI_ERR_ROOT_CMD_FATAL_EN); + aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND); + aer_cmd &= ~aer_cmd_mask; + writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND); +} + +static int cxl_rch_map_ras(struct cxl_dev_state *cxlds, + struct cxl_dport *parent_dport) +{ + struct device *dev = parent_dport->dport; + resource_size_t aer_phys, ras_phys; + void __iomem *aer, *dport_ras; + + if (!parent_dport->rch) + return 0; + + if (!parent_dport->aer_cap || !parent_dport->ras_cap || + parent_dport->component_reg_phys == CXL_RESOURCE_NONE) + return -ENODEV; + + aer_phys = parent_dport->aer_cap + parent_dport->rcrb; + aer = devm_cxl_iomap_block(dev, aer_phys, + PCI_AER_CAPABILITY_LENGTH); + + if (!aer) + return -ENOMEM; + + ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys; + dport_ras = devm_cxl_iomap_block(dev, ras_phys, + CXL_RAS_CAPABILITY_LENGTH); + + if (!dport_ras) + return -ENOMEM; + + cxlds->regs.aer = aer; + cxlds->regs.dport_ras = dport_ras; + + return 0; +} + +static int cxl_setup_ras(struct cxl_dev_state *cxlds, + struct cxl_dport *parent_dport) +{ + int rc; + + rc = cxl_rch_map_ras(cxlds, parent_dport); + if (rc) + return rc; + + if (cxlds->rcd) + rch_disable_root_ints(cxlds->regs.aer); + + return rc; +} + static void cxl_setup_rcrb(struct cxl_dev_state *cxlds, struct cxl_dport *parent_dport) { @@ -91,6 +157,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, cxl_setup_rcrb(cxlds, parent_dport); + rc = cxl_setup_ras(cxlds, parent_dport); + /* Continue with RAS setup errors */ + if (rc) + dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc); + else + dev_info(&cxlmd->dev, "CXL error handling enabled\n"); + endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys, parent_dport); if (IS_ERR(endpoint)) -- 2.34.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging 2023-04-11 18:03 ` [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging Terry Bowman @ 2023-04-12 1:32 ` kernel test robot 2023-04-12 3:04 ` kernel test robot ` (2 subsequent siblings) 3 siblings, 0 replies; 52+ messages in thread From: kernel test robot @ 2023-04-12 1:32 UTC (permalink / raw) To: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: oe-kbuild-all, terry.bowman, rrichter, linux-kernel, bhelgaas Hi Terry, kernel test robot noticed the following build errors: [auto build test ERROR on ca712e47054678c5ce93a0e0f686353ad5561195] url: https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957 base: ca712e47054678c5ce93a0e0f686353ad5561195 patch link: https://lore.kernel.org/r/20230411180302.2678736-5-terry.bowman%40amd.com patch subject: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging config: loongarch-buildonly-randconfig-r004-20230409 (https://download.01.org/0day-ci/archive/20230412/202304120926.dekDF6um-lkp@intel.com/config) compiler: loongarch64-linux-gcc (GCC) 12.1.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957 git checkout 7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> | Link: https://lore.kernel.org/oe-kbuild-all/202304120926.dekDF6um-lkp@intel.com/ All errors (new ones prefixed by >>): loongarch64-linux-ld: drivers/cxl/core/pci.o: in function `cxl_rch_log_error': drivers/cxl/core/pci.c:768: undefined reference to `cper_print_aer' >> loongarch64-linux-ld: drivers/cxl/core/pci.c:768: undefined reference to `cper_print_aer' >> loongarch64-linux-ld: drivers/cxl/core/pci.c:768: undefined reference to `cper_print_aer' -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging 2023-04-11 18:03 ` [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging Terry Bowman 2023-04-12 1:32 ` kernel test robot @ 2023-04-12 3:04 ` kernel test robot 2023-04-13 16:50 ` Jonathan Cameron 2023-04-18 0:06 ` Dan Williams 3 siblings, 0 replies; 52+ messages in thread From: kernel test robot @ 2023-04-12 3:04 UTC (permalink / raw) To: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: oe-kbuild-all, terry.bowman, rrichter, linux-kernel, bhelgaas Hi Terry, kernel test robot noticed the following build errors: [auto build test ERROR on ca712e47054678c5ce93a0e0f686353ad5561195] url: https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957 base: ca712e47054678c5ce93a0e0f686353ad5561195 patch link: https://lore.kernel.org/r/20230411180302.2678736-5-terry.bowman%40amd.com patch subject: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging config: riscv-randconfig-r014-20230410 (https://download.01.org/0day-ci/archive/20230412/202304121055.UceD86D7-lkp@intel.com/config) compiler: riscv64-linux-gcc (GCC) 12.1.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Terry-Bowman/cxl-pci-Add-RCH-downstream-port-AER-and-RAS-register-discovery/20230412-020957 git checkout 7f1c5cefb1e75bd709dc35c7f5e3e29dd5df65e1 # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> | Link: https://lore.kernel.org/oe-kbuild-all/202304121055.UceD86D7-lkp@intel.com/ All errors (new ones prefixed by >>): riscv64-linux-ld: riscv64-linux-ld: DWARF error: could not find abbrev number 463040 drivers/cxl/core/pci.o: in function `.L0 ': pci.c:(.text+0x1ae2): undefined reference to `cper_print_aer' >> riscv64-linux-ld: pci.c:(.text+0x1afa): undefined reference to `cper_print_aer' -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging 2023-04-11 18:03 ` [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging Terry Bowman 2023-04-12 1:32 ` kernel test robot 2023-04-12 3:04 ` kernel test robot @ 2023-04-13 16:50 ` Jonathan Cameron 2023-04-14 16:36 ` Terry Bowman 2023-04-18 0:06 ` Dan Williams 3 siblings, 1 reply; 52+ messages in thread From: Jonathan Cameron @ 2023-04-13 16:50 UTC (permalink / raw) To: Terry Bowman Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas On Tue, 11 Apr 2023 13:03:00 -0500 Terry Bowman <terry.bowman@amd.com> wrote: > RCH downstream port error logging is missing in the current CXL driver. The > missing AER and RAS error logging is needed for communicating driver error > details to userspace. Update the driver to include PCIe AER and CXL RAS > error logging. > > Add RCH downstream port error handling into the existing RCiEP handler. > The downstream port error handler is added to the RCiEP error handler > because the downstream port is implemented in a RCRB, is not PCI > enumerable, and as a result is not directly accessible to the PCI AER > root port driver. The AER root port driver calls the RCiEP handler for > handling RCD errors and RCH downstream port protocol errors. > > Update mem.c to include RAS and AER setup. This includes AER and RAS > capability discovery and mapping for later use in the error handler. > > Disable RCH downstream port's root port cmd interrupts.[1] > > Update existing RCiEP correctable and uncorrectable handlers to also call > the RCH handler. The RCH handler will read the RCH AER registers, check for > error severity, and if an error exists will log using an existing kernel > AER trace routine. The RCH handler will also log downstream port RAS errors > if they exist. > > [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors > > Co-developed-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Terry Bowman <terry.bowman@amd.com> Some minor stuff inline. Looks fine to me otherwise. I do find it a little confusing how often we go into an RCD or RCH specific function then drop out directly for 2.0+ case, but you do seem to be consistent with existing code so fair enough. Jonathan > --- > drivers/cxl/core/pci.c | 126 ++++++++++++++++++++++++++++++++++++---- > drivers/cxl/core/regs.c | 1 + > drivers/cxl/cxl.h | 13 +++++ > drivers/cxl/mem.c | 73 +++++++++++++++++++++++ > 4 files changed, 201 insertions(+), 12 deletions(-) > > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c > index 523d5b9fd7fc..d435ed2ff8b6 100644 > --- a/drivers/cxl/core/pci.c > +++ b/drivers/cxl/core/pci.c > +/* > + * Copy the AER capability registers to a buffer. This is necessary > + * because RCRB AER capability is MMIO mapped. Clear the status > + * after copying. > + * > + * @aer_base: base address of AER capability block in RCRB > + * @aer_regs: destination for copying AER capability > + */ > +static bool cxl_rch_get_aer_info(void __iomem *aer_base, > + struct aer_capability_regs *aer_regs) > +{ > + int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32); > + u32 *aer_regs_buf = (u32 *)aer_regs; > + int n; > + > + if (!aer_base) > + return false; > + > + for (n = 0; n < read_cnt; n++) > + aer_regs_buf[n] = readl(aer_base + n * sizeof(u32)); Maybe add a comment here on why memcpy_fromio() doesn't work for us. I'm assuming we need these to definitely be 32bit reads. Otherwise someone will 'optimize' it in future. > + > + writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS); > + writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS); > + > + return true; > +} = > diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c > index bde1fffab09e..dfa6fcfc428a 100644 > --- a/drivers/cxl/core/regs.c > +++ b/drivers/cxl/core/regs.c > @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, > > return ret_val; > } > +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL); > > int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs, > struct cxl_register_map *map, unsigned long map_mask) > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index df64c402e6e6..dae3f141ffcb 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -66,6 +66,8 @@ > #define CXL_DECODER_MIN_GRANULARITY 256 > #define CXL_DECODER_MAX_ENCODED_IG 6 > > +#define PCI_AER_CAPABILITY_LENGTH 56 Odd place to find a PCI specific define. Also a spec reference is always good for these. What's the the length of? PCI r6.0 has cap going up to address 0x5c so length 0x60. This seems to be igoring the header log register. > + > static inline int cxl_hdm_decoder_count(u32 cap_hdr) > { > int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr); > @@ -209,6 +211,15 @@ struct cxl_regs { > struct_group_tagged(cxl_device_regs, device_regs, > void __iomem *status, *mbox, *memdev; > ); > + > + /* > + * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode) > + * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors As with other cases, I'd like full comments, so something for @aer as well. > + */ > + struct_group_tagged(cxl_rch_regs, rch_regs, > + void __iomem *aer; > + void __iomem *dport_ras; > + ); > }; > > struct cxl_reg_map { > @@ -249,6 +260,8 @@ struct cxl_register_map { > }; > }; > > +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, > + resource_size_t length); > void cxl_probe_component_regs(struct device *dev, void __iomem *base, > struct cxl_component_reg_map *map); > void cxl_probe_device_regs(struct device *dev, void __iomem *base, > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c > index 014295ab6bc6..dd5ae0a4560c 100644 > --- a/drivers/cxl/mem.c > +++ b/drivers/cxl/mem.c > @@ -4,6 +4,7 @@ > #include <linux/device.h> > #include <linux/module.h> > #include <linux/pci.h> > +#include <linux/aer.h> > > #include "cxlmem.h" > #include "cxlpci.h" > @@ -45,6 +46,71 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data) > return 0; > } > > +static void rch_disable_root_ints(void __iomem *aer_base) > +{ > + u32 aer_cmd_mask, aer_cmd; > + > + /* > + * Disable RCH root port command interrupts. > + * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors > + */ > + aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN | > + PCI_ERR_ROOT_CMD_NONFATAL_EN | > + PCI_ERR_ROOT_CMD_FATAL_EN); > + aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND); > + aer_cmd &= ~aer_cmd_mask; > + writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND); Should we be touching these if firmware hasn't granted control to the OS? Description in the spec refers to 'software'. Is that the kernel? No idea. I guess this is safe even if it has already been done. Perhaps a comment to say it should already be in this state? > +} > + > +static int cxl_rch_map_ras(struct cxl_dev_state *cxlds, > + struct cxl_dport *parent_dport) > +{ > + struct device *dev = parent_dport->dport; > + resource_size_t aer_phys, ras_phys; > + void __iomem *aer, *dport_ras; > + > + if (!parent_dport->rch) > + return 0; > + > + if (!parent_dport->aer_cap || !parent_dport->ras_cap || > + parent_dport->component_reg_phys == CXL_RESOURCE_NONE) > + return -ENODEV; > + > + aer_phys = parent_dport->aer_cap + parent_dport->rcrb; > + aer = devm_cxl_iomap_block(dev, aer_phys, > + PCI_AER_CAPABILITY_LENGTH); > + > + if (!aer) > + return -ENOMEM; > + > + ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys; > + dport_ras = devm_cxl_iomap_block(dev, ras_phys, > + CXL_RAS_CAPABILITY_LENGTH); > + > + if (!dport_ras) > + return -ENOMEM; > + > + cxlds->regs.aer = aer; > + cxlds->regs.dport_ras = dport_ras; > + > + return 0; > +} > + > +static int cxl_setup_ras(struct cxl_dev_state *cxlds, > + struct cxl_dport *parent_dport) > +{ > + int rc; > + > + rc = cxl_rch_map_ras(cxlds, parent_dport); > + if (rc) > + return rc; > + > + if (cxlds->rcd) > + rch_disable_root_ints(cxlds->regs.aer); > + > + return rc; > +} > + > static void cxl_setup_rcrb(struct cxl_dev_state *cxlds, > struct cxl_dport *parent_dport) > { > @@ -91,6 +157,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, > > cxl_setup_rcrb(cxlds, parent_dport); > > + rc = cxl_setup_ras(cxlds, parent_dport); > + /* Continue with RAS setup errors */ > + if (rc) > + dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc); > + else > + dev_info(&cxlmd->dev, "CXL error handling enabled\n"); This feels a little noisy as something to add given we didn't shout about it for non RCD cases (I think). Maybe a dev_dbg()? > + > endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys, > parent_dport); > if (IS_ERR(endpoint)) ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging 2023-04-13 16:50 ` Jonathan Cameron @ 2023-04-14 16:36 ` Terry Bowman 2023-04-17 16:56 ` Jonathan Cameron 0 siblings, 1 reply; 52+ messages in thread From: Terry Bowman @ 2023-04-14 16:36 UTC (permalink / raw) To: Jonathan Cameron Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas Hi Jonathan, I added responses inline below. On 4/13/23 11:50, Jonathan Cameron wrote: > On Tue, 11 Apr 2023 13:03:00 -0500 > Terry Bowman <terry.bowman@amd.com> wrote: > >> RCH downstream port error logging is missing in the current CXL driver. The >> missing AER and RAS error logging is needed for communicating driver error >> details to userspace. Update the driver to include PCIe AER and CXL RAS >> error logging. >> >> Add RCH downstream port error handling into the existing RCiEP handler. >> The downstream port error handler is added to the RCiEP error handler >> because the downstream port is implemented in a RCRB, is not PCI >> enumerable, and as a result is not directly accessible to the PCI AER >> root port driver. The AER root port driver calls the RCiEP handler for >> handling RCD errors and RCH downstream port protocol errors. >> >> Update mem.c to include RAS and AER setup. This includes AER and RAS >> capability discovery and mapping for later use in the error handler. >> >> Disable RCH downstream port's root port cmd interrupts.[1] >> >> Update existing RCiEP correctable and uncorrectable handlers to also call >> the RCH handler. The RCH handler will read the RCH AER registers, check for >> error severity, and if an error exists will log using an existing kernel >> AER trace routine. The RCH handler will also log downstream port RAS errors >> if they exist. >> >> [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors >> >> Co-developed-by: Robert Richter <rrichter@amd.com> >> Signed-off-by: Robert Richter <rrichter@amd.com> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Some minor stuff inline. Looks fine to me otherwise. > > I do find it a little confusing how often we go into an RCD or RCH specific > function then drop out directly for 2.0+ case, but you do seem to be consistent > with existing code so fair enough. > > Jonathan > This was to simplify the code from the caller(s) perspective while also trying to generalize the logic. >> --- >> drivers/cxl/core/pci.c | 126 ++++++++++++++++++++++++++++++++++++---- >> drivers/cxl/core/regs.c | 1 + >> drivers/cxl/cxl.h | 13 +++++ >> drivers/cxl/mem.c | 73 +++++++++++++++++++++++ >> 4 files changed, 201 insertions(+), 12 deletions(-) >> >> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c >> index 523d5b9fd7fc..d435ed2ff8b6 100644 >> --- a/drivers/cxl/core/pci.c >> +++ b/drivers/cxl/core/pci.c > > >> +/* >> + * Copy the AER capability registers to a buffer. This is necessary >> + * because RCRB AER capability is MMIO mapped. Clear the status >> + * after copying. >> + * >> + * @aer_base: base address of AER capability block in RCRB >> + * @aer_regs: destination for copying AER capability >> + */ >> +static bool cxl_rch_get_aer_info(void __iomem *aer_base, >> + struct aer_capability_regs *aer_regs) >> +{ >> + int read_cnt = PCI_AER_CAPABILITY_LENGTH / sizeof(u32); >> + u32 *aer_regs_buf = (u32 *)aer_regs; >> + int n; >> + >> + if (!aer_base) >> + return false; >> + >> + for (n = 0; n < read_cnt; n++) >> + aer_regs_buf[n] = readl(aer_base + n * sizeof(u32)); > > Maybe add a comment here on why memcpy_fromio() doesn't work for us. > I'm assuming we need these to definitely be 32bit reads. > Otherwise someone will 'optimize' it in future. > Correct, this was to enforce 32-bit accesses. I will add a comment. >> + >> + writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS); >> + writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS); >> + >> + return true; >> +} > = >> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c >> index bde1fffab09e..dfa6fcfc428a 100644 >> --- a/drivers/cxl/core/regs.c >> +++ b/drivers/cxl/core/regs.c >> @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, >> >> return ret_val; >> } >> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL); >> >> int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs, >> struct cxl_register_map *map, unsigned long map_mask) >> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h >> index df64c402e6e6..dae3f141ffcb 100644 >> --- a/drivers/cxl/cxl.h >> +++ b/drivers/cxl/cxl.h >> @@ -66,6 +66,8 @@ >> #define CXL_DECODER_MIN_GRANULARITY 256 >> #define CXL_DECODER_MAX_ENCODED_IG 6 >> >> +#define PCI_AER_CAPABILITY_LENGTH 56 > > Odd place to find a PCI specific define. Also a spec reference is > always good for these. What's the the length of? PCI r6.0 has > cap going up to address 0x5c so length 0x60. This seems to be igoring > the header log register. > This was to avoid including the TLP log at 0x38+. I can use sizeof(struct aer_capability_regs) or sizeof(*aer_regs) instead. It's the same 38h(56) and will allow me to remove this #define in the patchset revision. >> + >> static inline int cxl_hdm_decoder_count(u32 cap_hdr) >> { >> int val = FIELD_GET(CXL_HDM_DECODER_COUNT_MASK, cap_hdr); >> @@ -209,6 +211,15 @@ struct cxl_regs { >> struct_group_tagged(cxl_device_regs, device_regs, >> void __iomem *status, *mbox, *memdev; >> ); >> + >> + /* >> + * Pointer to RCH cxl_dport AER. (only for RCH/RCD mode) >> + * @dport_aer: CXL 2.0 12.2.11 RCH Downstream Port-detected Errors > > As with other cases, I'd like full comments, so something for @aer as well. > >> + */ >> + struct_group_tagged(cxl_rch_regs, rch_regs, >> + void __iomem *aer; >> + void __iomem *dport_ras; >> + ); >> }; >> >> struct cxl_reg_map { >> @@ -249,6 +260,8 @@ struct cxl_register_map { >> }; >> }; >> >> +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, >> + resource_size_t length); >> void cxl_probe_component_regs(struct device *dev, void __iomem *base, >> struct cxl_component_reg_map *map); >> void cxl_probe_device_regs(struct device *dev, void __iomem *base, >> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c >> index 014295ab6bc6..dd5ae0a4560c 100644 >> --- a/drivers/cxl/mem.c >> +++ b/drivers/cxl/mem.c >> @@ -4,6 +4,7 @@ >> #include <linux/device.h> >> #include <linux/module.h> >> #include <linux/pci.h> >> +#include <linux/aer.h> >> >> #include "cxlmem.h" >> #include "cxlpci.h" >> @@ -45,6 +46,71 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data) >> return 0; >> } >> >> +static void rch_disable_root_ints(void __iomem *aer_base) >> +{ >> + u32 aer_cmd_mask, aer_cmd; >> + >> + /* >> + * Disable RCH root port command interrupts. >> + * CXL3.0 12.2.1.1 - RCH Downstream Port-detected Errors >> + */ >> + aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN | >> + PCI_ERR_ROOT_CMD_NONFATAL_EN | >> + PCI_ERR_ROOT_CMD_FATAL_EN); >> + aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND); >> + aer_cmd &= ~aer_cmd_mask; >> + writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND); > > Should we be touching these if firmware hasn't granted control to > the OS? Description in the spec refers to 'software'. Is that > the kernel? No idea. I guess this is safe even if it has already > been done. Perhaps a comment to say it should already be in this state? > > These need to be disabled because the RCH shouldn't behave as a root port/RCEC generating interrupts as a result of correctable, fatal, or non-fatal AER errors. I added this per the CXL3.0 spec but, as you mentioned, isn't likely necessary because they are disabled by default per PCI6.0.[1][2] This would be the case for OS/native and HW/FW error reporting. I'll add a comment stating it is already in this state. [1] CXL3.0 - 12.2.1.1 RCH Downstream Port-detected Errors [2] PCI 6.0 - 7.8.4.9 Root Error Command Register (Offset 2Ch) >> +} >> + >> +static int cxl_rch_map_ras(struct cxl_dev_state *cxlds, >> + struct cxl_dport *parent_dport) >> +{ >> + struct device *dev = parent_dport->dport; >> + resource_size_t aer_phys, ras_phys; >> + void __iomem *aer, *dport_ras; >> + >> + if (!parent_dport->rch) >> + return 0; >> + >> + if (!parent_dport->aer_cap || !parent_dport->ras_cap || >> + parent_dport->component_reg_phys == CXL_RESOURCE_NONE) >> + return -ENODEV; >> + >> + aer_phys = parent_dport->aer_cap + parent_dport->rcrb; >> + aer = devm_cxl_iomap_block(dev, aer_phys, >> + PCI_AER_CAPABILITY_LENGTH); >> + >> + if (!aer) >> + return -ENOMEM; >> + >> + ras_phys = parent_dport->ras_cap + parent_dport->component_reg_phys; >> + dport_ras = devm_cxl_iomap_block(dev, ras_phys, >> + CXL_RAS_CAPABILITY_LENGTH); >> + >> + if (!dport_ras) >> + return -ENOMEM; >> + >> + cxlds->regs.aer = aer; >> + cxlds->regs.dport_ras = dport_ras; >> + >> + return 0; >> +} >> + >> +static int cxl_setup_ras(struct cxl_dev_state *cxlds, >> + struct cxl_dport *parent_dport) >> +{ >> + int rc; >> + >> + rc = cxl_rch_map_ras(cxlds, parent_dport); >> + if (rc) >> + return rc; >> + >> + if (cxlds->rcd) >> + rch_disable_root_ints(cxlds->regs.aer); >> + >> + return rc; >> +} >> + >> static void cxl_setup_rcrb(struct cxl_dev_state *cxlds, >> struct cxl_dport *parent_dport) >> { >> @@ -91,6 +157,13 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd, >> >> cxl_setup_rcrb(cxlds, parent_dport); >> >> + rc = cxl_setup_ras(cxlds, parent_dport); >> + /* Continue with RAS setup errors */ >> + if (rc) >> + dev_warn(&cxlmd->dev, "CXL RAS setup failed: %d\n", rc); >> + else >> + dev_info(&cxlmd->dev, "CXL error handling enabled\n"); > > This feels a little noisy as something to add given we didn't shout about it for > non RCD cases (I think). Maybe a dev_dbg()? > Ok. Regards, Terry >> + >> endpoint = devm_cxl_add_port(host, &cxlmd->dev, cxlds->component_reg_phys, >> parent_dport); >> if (IS_ERR(endpoint)) > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging 2023-04-14 16:36 ` Terry Bowman @ 2023-04-17 16:56 ` Jonathan Cameron 0 siblings, 0 replies; 52+ messages in thread From: Jonathan Cameron @ 2023-04-17 16:56 UTC (permalink / raw) To: Terry Bowman Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas > > >> + > >> + writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS); > >> + writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS); > >> + > >> + return true; > >> +} > > = > >> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c > >> index bde1fffab09e..dfa6fcfc428a 100644 > >> --- a/drivers/cxl/core/regs.c > >> +++ b/drivers/cxl/core/regs.c > >> @@ -198,6 +198,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr, > >> > >> return ret_val; > >> } > >> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, CXL); > >> > >> int cxl_map_component_regs(struct device *dev, struct cxl_component_regs *regs, > >> struct cxl_register_map *map, unsigned long map_mask) > >> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > >> index df64c402e6e6..dae3f141ffcb 100644 > >> --- a/drivers/cxl/cxl.h > >> +++ b/drivers/cxl/cxl.h > >> @@ -66,6 +66,8 @@ > >> #define CXL_DECODER_MIN_GRANULARITY 256 > >> #define CXL_DECODER_MAX_ENCODED_IG 6 > >> > >> +#define PCI_AER_CAPABILITY_LENGTH 56 > > > > Odd place to find a PCI specific define. Also a spec reference is > > always good for these. What's the the length of? PCI r6.0 has > > cap going up to address 0x5c so length 0x60. This seems to be igoring > > the header log register. > > > > This was to avoid including the TLP log at 0x38+. > > I can use sizeof(struct aer_capability_regs) or sizeof(*aer_regs) instead. > It's the same 38h(56) and will allow me to remove this #define in the > patchset revision. That works better than a define that people might think is more generic. Otherwise you get PCI_AER_CAP_WITHOUT_TLP_LOG_LENGTH or something equally horrible. (or define the TLP_LOG length as another define and subtract that?) > ^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging 2023-04-11 18:03 ` [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging Terry Bowman ` (2 preceding siblings ...) 2023-04-13 16:50 ` Jonathan Cameron @ 2023-04-18 0:06 ` Dan Williams 2023-04-24 18:39 ` Terry Bowman 3 siblings, 1 reply; 52+ messages in thread From: Dan Williams @ 2023-04-18 0:06 UTC (permalink / raw) To: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas Terry Bowman wrote: > RCH downstream port error logging is missing in the current CXL driver. The > missing AER and RAS error logging is needed for communicating driver error > details to userspace. Update the driver to include PCIe AER and CXL RAS > error logging. > > Add RCH downstream port error handling into the existing RCiEP handler. > The downstream port error handler is added to the RCiEP error handler > because the downstream port is implemented in a RCRB, is not PCI > enumerable, and as a result is not directly accessible to the PCI AER > root port driver. The AER root port driver calls the RCiEP handler for > handling RCD errors and RCH downstream port protocol errors. > > Update mem.c to include RAS and AER setup. This includes AER and RAS > capability discovery and mapping for later use in the error handler. > > Disable RCH downstream port's root port cmd interrupts.[1] > > Update existing RCiEP correctable and uncorrectable handlers to also call > the RCH handler. The RCH handler will read the RCH AER registers, check for > error severity, and if an error exists will log using an existing kernel > AER trace routine. The RCH handler will also log downstream port RAS errors > if they exist. I think this patch wants a lead in refactoring to move the existing probe of the CXL RAS capability into the cxl_port driver so that the RCH path and the VH path can be unified for register mapping and error handling invocation. I do not see a compelling rationale to have 2 separate ways to map the RAS capability. The timing of when cxl_setup_ras() is called looks problematic relative to when the first error handler callback might happen. For example what happens when an error fires after cxl_pci has registered its error handlers, but before the component registers have been mapped out of the RCRB? This implies the need for a callback for cxl_pci to notify the cxl_port driver of CXL errors to handle relative to a PCI AER event. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging 2023-04-18 0:06 ` Dan Williams @ 2023-04-24 18:39 ` Terry Bowman 0 siblings, 0 replies; 52+ messages in thread From: Terry Bowman @ 2023-04-24 18:39 UTC (permalink / raw) To: Dan Williams, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dave.jiang, Jonathan.Cameron, linux-cxl Cc: rrichter, linux-kernel, bhelgaas Hi Dan, I added comments inline below. On 4/17/23 19:06, Dan Williams wrote: > Terry Bowman wrote: >> RCH downstream port error logging is missing in the current CXL driver. The >> missing AER and RAS error logging is needed for communicating driver error >> details to userspace. Update the driver to include PCIe AER and CXL RAS >> error logging. >> >> Add RCH downstream port error handling into the existing RCiEP handler. >> The downstream port error handler is added to the RCiEP error handler >> because the downstream port is implemented in a RCRB, is not PCI >> enumerable, and as a result is not directly accessible to the PCI AER >> root port driver. The AER root port driver calls the RCiEP handler for >> handling RCD errors and RCH downstream port protocol errors. >> >> Update mem.c to include RAS and AER setup. This includes AER and RAS >> capability discovery and mapping for later use in the error handler. >> >> Disable RCH downstream port's root port cmd interrupts.[1] >> >> Update existing RCiEP correctable and uncorrectable handlers to also call >> the RCH handler. The RCH handler will read the RCH AER registers, check for >> error severity, and if an error exists will log using an existing kernel >> AER trace routine. The RCH handler will also log downstream port RAS errors >> if they exist. > > I think this patch wants a lead in refactoring to move the existing > probe of the CXL RAS capability into the cxl_port driver so that the RCH > path and the VH path can be unified for register mapping and error > handling invocation. I do not see a compelling rationale to have 2 > separate ways to map the RAS capability. The timing of when > cxl_setup_ras() is called looks problematic relative to when the first > error handler callback might happen. > With respect to timing, I see this works for probing AER and RAS. Will it work for caching the mapped AER and RAS addresses? I ask because the mapped AER and RAS addresses are stored in cxlds and cxlds is created in cxl_pci and isn't necessarily available during RCH dport discovery. RCH dport is discovered within cxl_acpi context (beginning from cxl_acpi_probe()). Also, port.c code shows cxlds is not typically used. If you like I can change RCH RAS mapping to use cxl_map_component_regs()? This was in cxl_rch_map_ras() to handle the RCH odd case for AER and RAS mapping. The RAS can be moved out but RCH AER would still need to be mapped presumably still in cxl_rch_map_ras(). > For example what happens when an error fires after cxl_pci has > registered its error handlers, but before the component registers have > been mapped out of the RCRB? > The RCiEP ISR would execute but the RCH AER and RAS would not be logged because neither are mapped and are instead NULL. The AER and RAS register status would stay resident and be logged in the next ISR entry. > This implies the need for a callback for cxl_pci to notify the cxl_port > driver of CXL errors to handle relative to a PCI AER event. Along similar lines, could the RCH AER and RAS status be checked immediately after mapping and logged if status is present? Regards, Terry ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-11 18:02 [PATCH v3 0/6] cxl/pci: Add support for RCH RAS error handling Terry Bowman ` (3 preceding siblings ...) 2023-04-11 18:03 ` [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging Terry Bowman @ 2023-04-11 18:03 ` Terry Bowman 2023-04-12 22:02 ` Bjorn Helgaas ` (2 more replies) 2023-04-11 18:03 ` [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling Terry Bowman 5 siblings, 3 replies; 52+ messages in thread From: Terry Bowman @ 2023-04-11 18:03 UTC (permalink / raw) To: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci From: Robert Richter <rrichter@amd.com> In Restricted CXL Device (RCD) mode a CXL device is exposed as an RCiEP, but CXL downstream and upstream ports are not enumerated and not visible in the PCIe hierarchy. Protocol and link errors are sent to an RCEC. Restricted CXL host (RCH) downstream port-detected errors are signaled as internal AER errors, either Uncorrectable Internal Error (UIE) or Corrected Internal Errors (CIE). The error source is the id of the RCEC. A CXL handler must then inspect the error status in various CXL registers residing in the dport's component register space (CXL RAS cap) or the dport's RCRB (AER ext cap). [1] Errors showing up in the RCEC's error handler must be handled and connected to the CXL subsystem. Implement this by forwarding the error to all CXL devices below the RCEC. Since the entire CXL device is controlled only using PCIe Configuration Space of device 0, Function 0, only pass it there [2]. These devices have the Memory Device class code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver can implement the handler. In addition to errors directed to the CXL endpoint device, the handler must also inspect the CXL downstream port's CXL RAS and PCIe AER external capabilities that is connected to the device. Since CXL downstream port errors are signaled using internal errors, the handler requires those errors to be unmasked. This is subject of a follow-on patch. The reason for choosing this implementation is that a CXL RCEC device is bound to the AER port driver, but the driver does not allow it to register a custom specific handler to support CXL. Connecting the RCEC hard-wired with a CXL handler does not work, as the CXL subsystem might not be present all the time. The alternative to add an implementation to the portdrv to allow the registration of a custom RCEC error handler isn't worth doing it as CXL would be its only user. Instead, just check for an CXL RCEC and pass it down to the connected CXL device's error handler. With this approach the code can entirely be implemented in the PCIe AER driver and is independent of the CXL subsystem. The CXL driver only provides the handler. [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices Co-developed-by: Terry Bowman <terry.bowman@amd.com> Signed-off-by: Robert Richter <rrichter@amd.com> Signed-off-by: Terry Bowman <terry.bowman@amd.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-pci@vger.kernel.org --- drivers/pci/pcie/Kconfig | 8 ++++++ drivers/pci/pcie/aer.c | 61 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 69 insertions(+) diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig index 228652a59f27..b0dbd864d3a3 100644 --- a/drivers/pci/pcie/Kconfig +++ b/drivers/pci/pcie/Kconfig @@ -49,6 +49,14 @@ config PCIEAER_INJECT gotten from: https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ +config PCIEAER_CXL + bool "PCI Express CXL RAS support" + default y + depends on PCIEAER && CXL_PCI + help + This enables CXL error handling for Restricted CXL Hosts + (RCHs). + # # PCI Express ECRC # diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 7a25b62d9e01..171a08fd8ebd 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -946,6 +946,65 @@ static bool find_source_device(struct pci_dev *parent, return true; } +#ifdef CONFIG_PCIEAER_CXL + +static bool is_cxl_mem_dev(struct pci_dev *dev) +{ + /* + * A CXL device is controlled only using PCIe Configuration + * Space of device 0, Function 0. + */ + if (dev->devfn != PCI_DEVFN(0, 0)) + return false; + + /* Right now there is only a CXL.mem driver */ + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL) + return false; + + return true; +} + +static bool is_internal_error(struct aer_err_info *info) +{ + if (info->severity == AER_CORRECTABLE) + return info->status & PCI_ERR_COR_INTERNAL; + + return info->status & PCI_ERR_UNC_INTN; +} + +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info); + +static int cxl_handle_error_iter(struct pci_dev *dev, void *data) +{ + struct aer_err_info *e_info = (struct aer_err_info *)data; + + if (!is_cxl_mem_dev(dev)) + return 0; + + /* pci_dev_put() in handle_error_source() */ + dev = pci_dev_get(dev); + if (dev) + handle_error_source(dev, e_info); + + return 0; +} + +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) +{ + /* + * CXL downstream port errors are signaled as RCEC internal + * errors. Forward them to all CXL devices below the RCEC. + */ + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC && + is_internal_error(info)) + pcie_walk_rcec(dev, cxl_handle_error_iter, info); +} + +#else +static inline void cxl_handle_error(struct pci_dev *dev, + struct aer_err_info *info) { } +#endif + /** * handle_error_source - handle logging error into an event log * @dev: pointer to pci_dev data structure of error source device @@ -957,6 +1016,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) { int aer = dev->aer_cap; + cxl_handle_error(dev, info); + if (info->severity == AER_CORRECTABLE) { /* * Correctable error does not need software intervention. -- 2.34.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-11 18:03 ` [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler Terry Bowman @ 2023-04-12 22:02 ` Bjorn Helgaas 2023-04-13 11:40 ` Robert Richter 2023-04-14 12:19 ` Jonathan Cameron 2023-04-18 1:01 ` Dan Williams 2 siblings, 1 reply; 52+ messages in thread From: Bjorn Helgaas @ 2023-04-12 22:02 UTC (permalink / raw) To: Terry Bowman Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl, rrichter, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On Tue, Apr 11, 2023 at 01:03:01PM -0500, Terry Bowman wrote: > From: Robert Richter <rrichter@amd.com> > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > RCiEP, but CXL downstream and upstream ports are not enumerated and > not visible in the PCIe hierarchy. Protocol and link errors are sent > to an RCEC. > > Restricted CXL host (RCH) downstream port-detected errors are signaled > as internal AER errors, either Uncorrectable Internal Error (UIE) or > Corrected Internal Errors (CIE). The error source is the id of the > RCEC. A CXL handler must then inspect the error status in various CXL > registers residing in the dport's component register space (CXL RAS > cap) or the dport's RCRB (AER ext cap). [1] > > Errors showing up in the RCEC's error handler must be handled and > connected to the CXL subsystem. Implement this by forwarding the error > to all CXL devices below the RCEC. Since the entire CXL device is > controlled only using PCIe Configuration Space of device 0, Function > 0, Capitalize "device" and "Function" the same way (also appears in comment below). > only pass it there [2]. These devices have the Memory Device class > code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver > can implement the handler. In addition to errors directed to the CXL > endpoint device, the handler must also inspect the CXL downstream > port's CXL RAS and PCIe AER external capabilities that is connected to "AER external capabilities" -- is that referring to the "AER *Extended* capability"? If so, we usually don't bother including the "extended" part because it's usually not relevant. But if you intended "external", I don't know what it means. > the device. > > Since CXL downstream port errors are signaled using internal errors, > the handler requires those errors to be unmasked. This is subject of a > follow-on patch. > > The reason for choosing this implementation is that a CXL RCEC device > is bound to the AER port driver, but the driver does not allow it to > register a custom specific handler to support CXL. Connecting the RCEC > hard-wired with a CXL handler does not work, as the CXL subsystem > might not be present all the time. The alternative to add an > implementation to the portdrv to allow the registration of a custom > RCEC error handler isn't worth doing it as CXL would be its only user. > Instead, just check for an CXL RCEC and pass it down to the connected > CXL device's error handler. With this approach the code can entirely > be implemented in the PCIe AER driver and is independent of the CXL > subsystem. The CXL driver only provides the handler. Can you make this more concrete with an example topology so we can work through how this all works? Correct me when I go off the rails here: The current code uses pcie_walk_rcec() in this path, which basically searches below a Root Port or RCEC for devices that have an AER error status bit set, add them to the e_info[] list, and call handle_error_source() for each one: aer_isr_one_error # get e_src from aer_fifo find_source_device(e_src) pcie_walk_rcec(find_device_iter) find_device_iter is_error_source # read PCI_ERR_COR_STATUS or PCI_ERR_UNCOR_STATUS if (error-source) add_error_device # add device to e_info[] list # now call handle_error_source for everything in e_info[] aer_process_err_devices for (i = 0; i < e_info->err_dev_num; i++) handle_error_source IIUC, this patch basically says that an RCEC should have an AER error status bit (UIE or CIE) set, but the devices "below" the RCEC will not, so they won't get added to e_info[]. So we insert cxl_handle_error() in handle_error_source(), where it gets called for the RCEC, and then it uses pcie_walk_rcec() again to forcibly call handle_error_source() for *every* device "below" the RCEC (even though they don't have AER error status bits set). Then handle_error_source() ultimately calls the CXL driver err_handler entry points (.cor_error_detected(), .error_detected(), etc), which can look at the CXL-specific error status in the CXL RAS or RCRB or whatever. So this basically looks like a workaround for the fact that the AER code only calls handle_error_source() when it finds AER error status, and CXL doesn't *set* that AER error status. There's not that much code here, but it seems like a quite a bit of complexity in an area that is already pretty complicated. Here's another idea: the ACPI GHES code (ghes_handle_aer()) basically receives a packet of error status from firmware and queues it for recovery via pcie_do_recovery(). What if you had a CXL module that knew how to look for the CXL error status, package it up similarly, and queue it via aer_recover_queue()? > [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors > [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > Signed-off-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Cc: "Oliver O'Halloran" <oohall@gmail.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-pci@vger.kernel.org > --- > drivers/pci/pcie/Kconfig | 8 ++++++ > drivers/pci/pcie/aer.c | 61 ++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 69 insertions(+) > > diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig > index 228652a59f27..b0dbd864d3a3 100644 > --- a/drivers/pci/pcie/Kconfig > +++ b/drivers/pci/pcie/Kconfig > @@ -49,6 +49,14 @@ config PCIEAER_INJECT > gotten from: > https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ > > +config PCIEAER_CXL > + bool "PCI Express CXL RAS support" > + default y > + depends on PCIEAER && CXL_PCI > + help > + This enables CXL error handling for Restricted CXL Hosts > + (RCHs). > + > # > # PCI Express ECRC > # > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 7a25b62d9e01..171a08fd8ebd 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -946,6 +946,65 @@ static bool find_source_device(struct pci_dev *parent, > return true; > } > > +#ifdef CONFIG_PCIEAER_CXL > + > +static bool is_cxl_mem_dev(struct pci_dev *dev) > +{ > + /* > + * A CXL device is controlled only using PCIe Configuration > + * Space of device 0, Function 0. > + */ > + if (dev->devfn != PCI_DEVFN(0, 0)) > + return false; > + > + /* Right now there is only a CXL.mem driver */ > + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL) > + return false; > + > + return true; > +} > + > +static bool is_internal_error(struct aer_err_info *info) > +{ > + if (info->severity == AER_CORRECTABLE) > + return info->status & PCI_ERR_COR_INTERNAL; > + > + return info->status & PCI_ERR_UNC_INTN; > +} > + > +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info); > + > +static int cxl_handle_error_iter(struct pci_dev *dev, void *data) > +{ > + struct aer_err_info *e_info = (struct aer_err_info *)data; > + > + if (!is_cxl_mem_dev(dev)) > + return 0; > + > + /* pci_dev_put() in handle_error_source() */ > + dev = pci_dev_get(dev); > + if (dev) > + handle_error_source(dev, e_info); > + > + return 0; > +} > + > +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > +{ > + /* > + * CXL downstream port errors are signaled as RCEC internal > + * errors. Forward them to all CXL devices below the RCEC. > + */ > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC && > + is_internal_error(info)) > + pcie_walk_rcec(dev, cxl_handle_error_iter, info); > +} > + > +#else > +static inline void cxl_handle_error(struct pci_dev *dev, > + struct aer_err_info *info) { } > +#endif > + > /** > * handle_error_source - handle logging error into an event log > * @dev: pointer to pci_dev data structure of error source device > @@ -957,6 +1016,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) > { > int aer = dev->aer_cap; > > + cxl_handle_error(dev, info); > + > if (info->severity == AER_CORRECTABLE) { > /* > * Correctable error does not need software intervention. > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-12 22:02 ` Bjorn Helgaas @ 2023-04-13 11:40 ` Robert Richter 2023-04-14 21:32 ` Bjorn Helgaas 0 siblings, 1 reply; 52+ messages in thread From: Robert Richter @ 2023-04-13 11:40 UTC (permalink / raw) To: Bjorn Helgaas Cc: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci Bjorn, thanks for your detailed review. On 12.04.23 17:02:33, Bjorn Helgaas wrote: > On Tue, Apr 11, 2023 at 01:03:01PM -0500, Terry Bowman wrote: > > From: Robert Richter <rrichter@amd.com> > > > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > > RCiEP, but CXL downstream and upstream ports are not enumerated and > > not visible in the PCIe hierarchy. Protocol and link errors are sent > > to an RCEC. > > > > Restricted CXL host (RCH) downstream port-detected errors are signaled > > as internal AER errors, either Uncorrectable Internal Error (UIE) or > > Corrected Internal Errors (CIE). The error source is the id of the > > RCEC. A CXL handler must then inspect the error status in various CXL > > registers residing in the dport's component register space (CXL RAS > > cap) or the dport's RCRB (AER ext cap). [1] > > > > Errors showing up in the RCEC's error handler must be handled and > > connected to the CXL subsystem. Implement this by forwarding the error > > to all CXL devices below the RCEC. Since the entire CXL device is > > controlled only using PCIe Configuration Space of device 0, Function > > 0, > > Capitalize "device" and "Function" the same way (also appears in > comment below). Changed that. > > > only pass it there [2]. These devices have the Memory Device class > > code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver > > can implement the handler. In addition to errors directed to the CXL > > endpoint device, the handler must also inspect the CXL downstream > > port's CXL RAS and PCIe AER external capabilities that is connected to > > "AER external capabilities" -- is that referring to the "AER > *Extended* capability"? If so, we usually don't bother including the > "extended" part because it's usually not relevant. But if you intended > "external", I don't know what it means. Right, "extended" is meant here, but I will drop it to also fit with the 'CXL RAS capability'. > > > the device. > > > > Since CXL downstream port errors are signaled using internal errors, > > the handler requires those errors to be unmasked. This is subject of a > > follow-on patch. > > > > The reason for choosing this implementation is that a CXL RCEC device > > is bound to the AER port driver, but the driver does not allow it to > > register a custom specific handler to support CXL. Connecting the RCEC > > hard-wired with a CXL handler does not work, as the CXL subsystem > > might not be present all the time. The alternative to add an > > implementation to the portdrv to allow the registration of a custom > > RCEC error handler isn't worth doing it as CXL would be its only user. > > Instead, just check for an CXL RCEC and pass it down to the connected > > CXL device's error handler. With this approach the code can entirely > > be implemented in the PCIe AER driver and is independent of the CXL > > subsystem. The CXL driver only provides the handler. > > Can you make this more concrete with an example topology so we can > work through how this all works? Correct me when I go off the rails > here: Let's assume just a simple CXL RCH topology: PCI hierarchy: ----------------- | ACPI0016 |---------------- Host bridge (CXL host) | - CEDT | | -------------| - RCRB base | | | ----------------- : | | | | | ------------------- --------- | | RCiEP |.....| RCEC | Endpoint (CXL dev) | --------| - BDF | | - BDF | | | | - PCIe AER | --------- | | | - CXL dvsec | | | | (v2: reg loc) | | | | - Comp regs | | | | - CXL RAS | | | ------------------- : : CXL hierarchy: : : : ------------------ | | | CXL root port |<-------------- | | | |----------->| - dport RCRB |<-------------- | | - PCIe AER | | | | - Comp regs | | | | - CXL RAS | | | ------------------ | | : | | | ------------------ | | ------->| CXL endpoint |--------------- | | (v1: RCRB) | ------------>| - uport RCRB | | - Comp regs | | - CXL RAS | ------------------ Dport detected errors are reported using PCIe AER and CXL RAS caps in the dports RCRB. Uport detected errors are reported using RCiEP's PCIe AER cap and either the uport's RCRB RAS cap or the RAS cap of the comp regs located using CXL DVSEC register locator. In all cases the RCEC is used with either the RCEC (dport errors) or the RCiEP (uport errors) error source id (BDF: bus, dev, func). > > The current code uses pcie_walk_rcec() in this path, which basically > searches below a Root Port or RCEC for devices that have an AER error > status bit set, add them to the e_info[] list, and call > handle_error_source() for each one: For reference, this series adds support to handle RCH downstream port-detected errors as described in CXL 3.0, 12.2.1.1. This flow looks correct to me, see comments inline. > > aer_isr_one_error > # get e_src from aer_fifo > find_source_device(e_src) e_src is the RCEC. > pcie_walk_rcec(find_device_iter) > find_device_iter > is_error_source > # read PCI_ERR_COR_STATUS or PCI_ERR_UNCOR_STATUS It is an internal error (CIE or UIE). > if (error-source) An early version of the spec did not require the RCEC as an error source. But this case is not handled with this series. > add_error_device > # add device to e_info[] list > # now call handle_error_source for everything in e_info[] > aer_process_err_devices > for (i = 0; i < e_info->err_dev_num; i++) > handle_error_source handle_error_source() is called with the RCEC as pci_dev. > > IIUC, this patch basically says that an RCEC should have an AER error > status bit (UIE or CIE) set, but the devices "below" the RCEC will > not, so they won't get added to e_info[]. An internal error of the RCEC indicates a CXL dport error. > > So we insert cxl_handle_error() in handle_error_source(), where it > gets called for the RCEC, and then it uses pcie_walk_rcec() again to > forcibly call handle_error_source() for *every* device "below" the > RCEC (even though they don't have AER error status bits set). The CXL device contains the links to the dport's caps. Also, there can be multiple RCs with CXL devs connected to it. So we must search for all CXL devices now, determine the corresponding dport and inspect both, PCIe AER and CXL RAS caps. > > Then handle_error_source() ultimately calls the CXL driver err_handler > entry points (.cor_error_detected(), .error_detected(), etc), which > can look at the CXL-specific error status in the CXL RAS or RCRB or > whatever. The AER driver (portdrv) does not have the knowledge of CXL internals. Thus the approach is to pass dport errors to the cxl_mem driver to handle it there in addition to cxl mem dev errors. > > So this basically looks like a workaround for the fact that the AER > code only calls handle_error_source() when it finds AER error status, > and CXL doesn't *set* that AER error status. There's not that much > code here, but it seems like a quite a bit of complexity in an area > that is already pretty complicated. > > Here's another idea: the ACPI GHES code (ghes_handle_aer()) basically > receives a packet of error status from firmware and queues it for > recovery via pcie_do_recovery(). What if you had a CXL module that > knew how to look for the CXL error status, package it up similarly, > and queue it via aer_recover_queue()? The CXL module knows how and where to look for errors, but it does not receive interrupts (for dport errors). The interrupts land in the portdrv (the RCEC's pci driver) and the CXL module must be notified by the portdrv. But the portdrv (AER driver) does not know the CXL module nor it is always present (e.g. CXL bus must be enumerated first etc.). aer_recover_queue() is interesting to report AER errors that has been retrieved outside the PCIe hierarchy, in particular the dport AER cap in the RCRB (see patch #4). We could collect all the data and just send it to aer_recover_queue(). I think aer_recover_work_func() must be extended to also handle corrected errors, otherwise the function is already almost the same as handle_error_source(). But first, RCEC error notifications (RCEC AER interrupts) must be sent to the CXL driver to look into the dport's RCRB. -Robert > > > [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors > > [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices > > > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > > Signed-off-by: Robert Richter <rrichter@amd.com> > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > > Cc: "Oliver O'Halloran" <oohall@gmail.com> > > Cc: Bjorn Helgaas <bhelgaas@google.com> > > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-pci@vger.kernel.org > > --- > > drivers/pci/pcie/Kconfig | 8 ++++++ > > drivers/pci/pcie/aer.c | 61 ++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 69 insertions(+) > > > > diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig > > index 228652a59f27..b0dbd864d3a3 100644 > > --- a/drivers/pci/pcie/Kconfig > > +++ b/drivers/pci/pcie/Kconfig > > @@ -49,6 +49,14 @@ config PCIEAER_INJECT > > gotten from: > > https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ > > > > +config PCIEAER_CXL > > + bool "PCI Express CXL RAS support" > > + default y > > + depends on PCIEAER && CXL_PCI > > + help > > + This enables CXL error handling for Restricted CXL Hosts > > + (RCHs). > > + > > # > > # PCI Express ECRC > > # > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > index 7a25b62d9e01..171a08fd8ebd 100644 > > --- a/drivers/pci/pcie/aer.c > > +++ b/drivers/pci/pcie/aer.c > > @@ -946,6 +946,65 @@ static bool find_source_device(struct pci_dev *parent, > > return true; > > } > > > > +#ifdef CONFIG_PCIEAER_CXL > > + > > +static bool is_cxl_mem_dev(struct pci_dev *dev) > > +{ > > + /* > > + * A CXL device is controlled only using PCIe Configuration > > + * Space of device 0, Function 0. > > + */ > > + if (dev->devfn != PCI_DEVFN(0, 0)) > > + return false; > > + > > + /* Right now there is only a CXL.mem driver */ > > + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL) > > + return false; > > + > > + return true; > > +} > > + > > +static bool is_internal_error(struct aer_err_info *info) > > +{ > > + if (info->severity == AER_CORRECTABLE) > > + return info->status & PCI_ERR_COR_INTERNAL; > > + > > + return info->status & PCI_ERR_UNC_INTN; > > +} > > + > > +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info); > > + > > +static int cxl_handle_error_iter(struct pci_dev *dev, void *data) > > +{ > > + struct aer_err_info *e_info = (struct aer_err_info *)data; > > + > > + if (!is_cxl_mem_dev(dev)) > > + return 0; > > + > > + /* pci_dev_put() in handle_error_source() */ > > + dev = pci_dev_get(dev); > > + if (dev) > > + handle_error_source(dev, e_info); > > + > > + return 0; > > +} > > + > > +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > > +{ > > + /* > > + * CXL downstream port errors are signaled as RCEC internal > > + * errors. Forward them to all CXL devices below the RCEC. > > + */ > > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC && > > + is_internal_error(info)) > > + pcie_walk_rcec(dev, cxl_handle_error_iter, info); > > +} > > + > > +#else > > +static inline void cxl_handle_error(struct pci_dev *dev, > > + struct aer_err_info *info) { } > > +#endif > > + > > /** > > * handle_error_source - handle logging error into an event log > > * @dev: pointer to pci_dev data structure of error source device > > @@ -957,6 +1016,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) > > { > > int aer = dev->aer_cap; > > > > + cxl_handle_error(dev, info); > > + > > if (info->severity == AER_CORRECTABLE) { > > /* > > * Correctable error does not need software intervention. > > -- > > 2.34.1 > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-13 11:40 ` Robert Richter @ 2023-04-14 21:32 ` Bjorn Helgaas 2023-04-17 22:00 ` Robert Richter 0 siblings, 1 reply; 52+ messages in thread From: Bjorn Helgaas @ 2023-04-14 21:32 UTC (permalink / raw) To: Robert Richter Cc: alison.schofield, dave.jiang, Terry Bowman, vishal.l.verma, linuxppc-dev, linux-pci, linux-kernel, linux-cxl, Mahesh J Salgaonkar, bhelgaas, Oliver O'Halloran, Jonathan.Cameron, bwidawsk, dan.j.williams, ira.weiny On Thu, Apr 13, 2023 at 01:40:52PM +0200, Robert Richter wrote: > On 12.04.23 17:02:33, Bjorn Helgaas wrote: > > On Tue, Apr 11, 2023 at 01:03:01PM -0500, Terry Bowman wrote: > > > From: Robert Richter <rrichter@amd.com> > ... > Let's assume just a simple CXL RCH topology: > > PCI hierarchy: > > ----------------- > | ACPI0016 |-------------- Host bridge (CXL host) > | - CEDT | | > -----------| - RCRB base | | > | ----------------- : > | | > | | > | ------------------- --------- > | | RCiEP |.....| RCEC | Endpoint (CXL dev) > | --------| - BDF | | - BDF | > | | | - PCIe AER | --------- > | | | - CXL dvsec | > | | | (v2: reg loc) | > | | | - Comp regs | > | | | - CXL RAS | > | | ------------------- > : : > > CXL hierarchy: > > : : > : ------------------ | > | | CXL root port |<------------ > | | | > |--------->| - dport RCRB |<------------ > | | - PCIe AER | | > | | - Comp regs | | > | | - CXL RAS | | > | ------------------ | > | : | > | | ------------------ | > | ------->| CXL endpoint |------------- > | | (v1: RCRB) | > ---------->| - uport RCRB | > | - Comp regs | > | - CXL RAS | > ------------------ > > Dport detected errors are reported using PCIe AER and CXL RAS caps in > the dports RCRB. > > Uport detected errors are reported using RCiEP's PCIe AER cap and > either the uport's RCRB RAS cap or the RAS cap of the comp regs > located using CXL DVSEC register locator. > > In all cases the RCEC is used with either the RCEC (dport errors) or > the RCiEP (uport errors) error source id (BDF: bus, dev, func). I'm mostly interested in the PCI entities involved because that's all aer.c can deal with. For the above, I think the PCI core only knows about these: 00:00.0 RCEC with AER, RCEC EA includes 00:01.0 00:01.0 RCiEP with AER aer_irq() would handle AER interrupts from 00:00.0. cxl_handle_error() would be called for 00:00.0 and would call handle_error_source() for everything below it (only 00:01.0 here). > > The current code uses pcie_walk_rcec() in this path, which basically > > searches below a Root Port or RCEC for devices that have an AER error > > status bit set, add them to the e_info[] list, and call > > handle_error_source() for each one: > > For reference, this series adds support to handle RCH downstream > port-detected errors as described in CXL 3.0, 12.2.1.1. > > This flow looks correct to me, see comments inline. We seem to be on the same page here, so I'll trim it out. > ... > > So we insert cxl_handle_error() in handle_error_source(), where it > > gets called for the RCEC, and then it uses pcie_walk_rcec() again to > > forcibly call handle_error_source() for *every* device "below" the > > RCEC (even though they don't have AER error status bits set). > > The CXL device contains the links to the dport's caps. Also, there can > be multiple RCs with CXL devs connected to it. So we must search for > all CXL devices now, determine the corresponding dport and inspect > both, PCIe AER and CXL RAS caps. > > > Then handle_error_source() ultimately calls the CXL driver err_handler > > entry points (.cor_error_detected(), .error_detected(), etc), which > > can look at the CXL-specific error status in the CXL RAS or RCRB or > > whatever. > > The AER driver (portdrv) does not have the knowledge of CXL internals. > Thus the approach is to pass dport errors to the cxl_mem driver to > handle it there in addition to cxl mem dev errors. > > > So this basically looks like a workaround for the fact that the AER > > code only calls handle_error_source() when it finds AER error status, > > and CXL doesn't *set* that AER error status. There's not that much > > code here, but it seems like a quite a bit of complexity in an area > > that is already pretty complicated. My main point here (correct me if I got this wrong) is that: - A RCEC generates an AER interrupt - find_source_device() searches all devices below the RCEC and builds a list everything for which to call handle_error_source() - cxl_handle_error() *again* looks at all devices below the same RCEC and calls handle_error_source() for each one So the main difference here is that the existing flow only calls handle_error_source() when it finds an error logged in an AER status register, while the new CXL flow calls handle_error_source() for *every* device below the RCEC. I think it's OK to do that, but the almost recursive structure and the unusual reference counting make the overall AER flow much harder to understand. What if we changed is_error_source() to add every CXL.mem device it finds to the e_info[] list, which I think could nicely encapsulate the idea that "CXL devices have error state we don't know how to interpret here"? Would the existing loop in aer_process_err_devices() then do what you need? > > Here's another idea: the ACPI GHES code (ghes_handle_aer()) basically > > receives a packet of error status from firmware and queues it for > > recovery via pcie_do_recovery(). What if you had a CXL module that > > knew how to look for the CXL error status, package it up similarly, > > and queue it via aer_recover_queue()? > > ... > But first, RCEC error notifications (RCEC AER interrupts) must be sent > to the CXL driver to look into the dport's RCRB. Right. I think it could be solvable to have aer_irq() call or wake a CXL interface that has been registered. But maybe changing is_error_source() would be simpler. Bjorn ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-14 21:32 ` Bjorn Helgaas @ 2023-04-17 22:00 ` Robert Richter 2023-04-19 14:17 ` Robert Richter 0 siblings, 1 reply; 52+ messages in thread From: Robert Richter @ 2023-04-17 22:00 UTC (permalink / raw) To: Bjorn Helgaas Cc: alison.schofield, dave.jiang, Terry Bowman, vishal.l.verma, linuxppc-dev, linux-pci, linux-kernel, linux-cxl, Mahesh J Salgaonkar, bhelgaas, Oliver O'Halloran, Jonathan.Cameron, bwidawsk, dan.j.williams, ira.weiny On 14.04.23 16:32:54, Bjorn Helgaas wrote: > On Thu, Apr 13, 2023 at 01:40:52PM +0200, Robert Richter wrote: > > On 12.04.23 17:02:33, Bjorn Helgaas wrote: > > > On Tue, Apr 11, 2023 at 01:03:01PM -0500, Terry Bowman wrote: > > > > From: Robert Richter <rrichter@amd.com> > > > ... > > Let's assume just a simple CXL RCH topology: > > > > PCI hierarchy: > > > > ----------------- > > | ACPI0016 |-------------- Host bridge (CXL host) > > | - CEDT | | > > -----------| - RCRB base | | > > | ----------------- : > > | | > > | | > > | ------------------- --------- > > | | RCiEP |.....| RCEC | Endpoint (CXL dev) > > | --------| - BDF | | - BDF | > > | | | - PCIe AER | --------- > > | | | - CXL dvsec | > > | | | (v2: reg loc) | > > | | | - Comp regs | > > | | | - CXL RAS | > > | | ------------------- > > : : > > > > CXL hierarchy: > > > > : : > > : ------------------ | > > | | CXL root port |<------------ > > | | | > > |--------->| - dport RCRB |<------------ > > | | - PCIe AER | | > > | | - Comp regs | | > > | | - CXL RAS | | > > | ------------------ | > > | : | > > | | ------------------ | > > | ------->| CXL endpoint |------------- > > | | (v1: RCRB) | > > ---------->| - uport RCRB | > > | - Comp regs | > > | - CXL RAS | > > ------------------ > > > > Dport detected errors are reported using PCIe AER and CXL RAS caps in > > the dports RCRB. > > > > Uport detected errors are reported using RCiEP's PCIe AER cap and > > either the uport's RCRB RAS cap or the RAS cap of the comp regs > > located using CXL DVSEC register locator. > > > > In all cases the RCEC is used with either the RCEC (dport errors) or > > the RCiEP (uport errors) error source id (BDF: bus, dev, func). > > I'm mostly interested in the PCI entities involved because that's all > aer.c can deal with. For the above, I think the PCI core only knows > about these: > > 00:00.0 RCEC with AER, RCEC EA includes 00:01.0 > 00:01.0 RCiEP with AER > > aer_irq() would handle AER interrupts from 00:00.0. > cxl_handle_error() would be called for 00:00.0 and would call > handle_error_source() for everything below it (only 00:01.0 here). > > > > The current code uses pcie_walk_rcec() in this path, which basically > > > searches below a Root Port or RCEC for devices that have an AER error > > > status bit set, add them to the e_info[] list, and call > > > handle_error_source() for each one: > > > > For reference, this series adds support to handle RCH downstream > > port-detected errors as described in CXL 3.0, 12.2.1.1. > > > > This flow looks correct to me, see comments inline. > > We seem to be on the same page here, so I'll trim it out. > > > ... > > > So we insert cxl_handle_error() in handle_error_source(), where it > > > gets called for the RCEC, and then it uses pcie_walk_rcec() again to > > > forcibly call handle_error_source() for *every* device "below" the > > > RCEC (even though they don't have AER error status bits set). > > > > The CXL device contains the links to the dport's caps. Also, there can > > be multiple RCs with CXL devs connected to it. So we must search for > > all CXL devices now, determine the corresponding dport and inspect > > both, PCIe AER and CXL RAS caps. > > > > > Then handle_error_source() ultimately calls the CXL driver err_handler > > > entry points (.cor_error_detected(), .error_detected(), etc), which > > > can look at the CXL-specific error status in the CXL RAS or RCRB or > > > whatever. > > > > The AER driver (portdrv) does not have the knowledge of CXL internals. > > Thus the approach is to pass dport errors to the cxl_mem driver to > > handle it there in addition to cxl mem dev errors. > > > > > So this basically looks like a workaround for the fact that the AER > > > code only calls handle_error_source() when it finds AER error status, > > > and CXL doesn't *set* that AER error status. There's not that much > > > code here, but it seems like a quite a bit of complexity in an area > > > that is already pretty complicated. > > My main point here (correct me if I got this wrong) is that: > > - A RCEC generates an AER interrupt > > - find_source_device() searches all devices below the RCEC and > builds a list everything for which to call handle_error_source() find_source_device() does not walk the RCEC if the error source is the RCEC itself (note that find_device_iter() is called for the root/rcec device first and exits early then). > > - cxl_handle_error() *again* looks at all devices below the same > RCEC and calls handle_error_source() for each one > > So the main difference here is that the existing flow only calls > handle_error_source() when it finds an error logged in an AER status > register, while the new CXL flow calls handle_error_source() for > *every* device below the RCEC. That is limited as much as possible: * The RCEC walk to handle CXL dport errors is done only in case of internal errors, for an RCEC only (not a port) (check in cxl_handle_error()). * Internal errors are only enabled for RCECs connected to CXL devices (handles_cxl_errors()). * The handler is only called if it is a CXL memory device (class code set and zero devfn) (check in cxl_handle_error_iter()). An optimization I see here is to convert some runtime checks to cached values determined during device enumeration (CXL device list, RCEC is associated with CXL devices). Some sort of RCEC-to-CXL-dev association, similar to rcec->rcec_ea. > > I think it's OK to do that, but the almost recursive structure and the > unusual reference counting make the overall AER flow much harder to > understand. > > What if we changed is_error_source() to add every CXL.mem device it > finds to the e_info[] list, which I think could nicely encapsulate the > idea that "CXL devices have error state we don't know how to interpret > here"? Would the existing loop in aer_process_err_devices() then do > what you need? I did not want to mix this with devices determined by the Error Source Identification Register. CXL device may not be the error source of an error which may cause some unwanted side-effects. We must also touch AER_MAX_MULTI_ERR_DEVICES then and how the dev list is implemented as the max number of devices is unclear. > > > > Here's another idea: the ACPI GHES code (ghes_handle_aer()) basically > > > receives a packet of error status from firmware and queues it for > > > recovery via pcie_do_recovery(). What if you had a CXL module that > > > knew how to look for the CXL error status, package it up similarly, > > > and queue it via aer_recover_queue()? > > > > ... > > But first, RCEC error notifications (RCEC AER interrupts) must be sent > > to the CXL driver to look into the dport's RCRB. > > Right. I think it could be solvable to have aer_irq() call or wake a > CXL interface that has been registered. But maybe changing > is_error_source() would be simpler. I am going to see if is_error_source() can be used to also find CXL devices. But my main concern here is to mix CXL devices with actual devices identified by the Error Source ID. Thanks, -Robert ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-17 22:00 ` Robert Richter @ 2023-04-19 14:17 ` Robert Richter 0 siblings, 0 replies; 52+ messages in thread From: Robert Richter @ 2023-04-19 14:17 UTC (permalink / raw) To: Bjorn Helgaas Cc: alison.schofield, dave.jiang, Terry Bowman, vishal.l.verma, linuxppc-dev, linux-pci, linux-kernel, linux-cxl, Mahesh J Salgaonkar, bhelgaas, Oliver O'Halloran, Jonathan.Cameron, bwidawsk, dan.j.williams, ira.weiny Bjorn, On 18.04.23 00:00:58, Robert Richter wrote: > On 14.04.23 16:32:54, Bjorn Helgaas wrote: > > On Thu, Apr 13, 2023 at 01:40:52PM +0200, Robert Richter wrote: > > > On 12.04.23 17:02:33, Bjorn Helgaas wrote: > > > > On Tue, Apr 11, 2023 at 01:03:01PM -0500, Terry Bowman wrote: > > I'm mostly interested in the PCI entities involved because that's all > > aer.c can deal with. For the above, I think the PCI core only knows > > about these: > > > > 00:00.0 RCEC with AER, RCEC EA includes 00:01.0 > > 00:01.0 RCiEP with AER > > > > aer_irq() would handle AER interrupts from 00:00.0. > > cxl_handle_error() would be called for 00:00.0 and would call > > handle_error_source() for everything below it (only 00:01.0 here). > > > > > > The current code uses pcie_walk_rcec() in this path, which basically > > > > searches below a Root Port or RCEC for devices that have an AER error > > > > status bit set, add them to the e_info[] list, and call > > > > handle_error_source() for each one: > > > > > > For reference, this series adds support to handle RCH downstream > > > port-detected errors as described in CXL 3.0, 12.2.1.1. > > > > > > This flow looks correct to me, see comments inline. > > > > We seem to be on the same page here, so I'll trim it out. > > > > > ... > > > > So we insert cxl_handle_error() in handle_error_source(), where it > > > > gets called for the RCEC, and then it uses pcie_walk_rcec() again to > > > > forcibly call handle_error_source() for *every* device "below" the > > > > RCEC (even though they don't have AER error status bits set). > > > > > > The CXL device contains the links to the dport's caps. Also, there can > > > be multiple RCs with CXL devs connected to it. So we must search for > > > all CXL devices now, determine the corresponding dport and inspect > > > both, PCIe AER and CXL RAS caps. > > > > > > > Then handle_error_source() ultimately calls the CXL driver err_handler > > > > entry points (.cor_error_detected(), .error_detected(), etc), which > > > > can look at the CXL-specific error status in the CXL RAS or RCRB or > > > > whatever. > > > > > > The AER driver (portdrv) does not have the knowledge of CXL internals. > > > Thus the approach is to pass dport errors to the cxl_mem driver to > > > handle it there in addition to cxl mem dev errors. > > > > > > > So this basically looks like a workaround for the fact that the AER > > > > code only calls handle_error_source() when it finds AER error status, > > > > and CXL doesn't *set* that AER error status. There's not that much > > > > code here, but it seems like a quite a bit of complexity in an area > > > > that is already pretty complicated. > > > > My main point here (correct me if I got this wrong) is that: > > > > - A RCEC generates an AER interrupt > > > > - find_source_device() searches all devices below the RCEC and > > builds a list everything for which to call handle_error_source() > > find_source_device() does not walk the RCEC if the error source is the > RCEC itself (note that find_device_iter() is called for the root/rcec > device first and exits early then). > > > > > - cxl_handle_error() *again* looks at all devices below the same > > RCEC and calls handle_error_source() for each one > > > > So the main difference here is that the existing flow only calls > > handle_error_source() when it finds an error logged in an AER status > > register, while the new CXL flow calls handle_error_source() for > > *every* device below the RCEC. > > That is limited as much as possible: > > * The RCEC walk to handle CXL dport errors is done only in case of > internal errors, for an RCEC only (not a port) (check in > cxl_handle_error()). > > * Internal errors are only enabled for RCECs connected to CXL devices > (handles_cxl_errors()). > > * The handler is only called if it is a CXL memory device (class code > set and zero devfn) (check in cxl_handle_error_iter()). > > An optimization I see here is to convert some runtime checks to cached > values determined during device enumeration (CXL device list, RCEC is > associated with CXL devices). Some sort of RCEC-to-CXL-dev > association, similar to rcec->rcec_ea. > > > > > I think it's OK to do that, but the almost recursive structure and the > > unusual reference counting make the overall AER flow much harder to > > understand. > > > > What if we changed is_error_source() to add every CXL.mem device it > > finds to the e_info[] list, which I think could nicely encapsulate the > > idea that "CXL devices have error state we don't know how to interpret > > here"? Would the existing loop in aer_process_err_devices() then do > > what you need? > > I did not want to mix this with devices determined by the Error Source > Identification Register. CXL device may not be the error source of an > error which may cause some unwanted side-effects. We must also touch > AER_MAX_MULTI_ERR_DEVICES then and how the dev list is implemented as > the max number of devices is unclear. > > > > > > > Here's another idea: the ACPI GHES code (ghes_handle_aer()) basically > > > > receives a packet of error status from firmware and queues it for > > > > recovery via pcie_do_recovery(). What if you had a CXL module that > > > > knew how to look for the CXL error status, package it up similarly, > > > > and queue it via aer_recover_queue()? > > > > > > ... > > > But first, RCEC error notifications (RCEC AER interrupts) must be sent > > > to the CXL driver to look into the dport's RCRB. > > > > Right. I think it could be solvable to have aer_irq() call or wake a > > CXL interface that has been registered. But maybe changing > > is_error_source() would be simpler. > > I am going to see if is_error_source() can be used to also find CXL > devices. But my main concern here is to mix CXL devices with actual > devices identified by the Error Source ID. I have looked into reusing is_error_source() and modifying find_source_device() to also add CXL devices (the RCiEPs) to the dev list in e_info. The problem I see is that at AER level it is unknown whether an error happened or not. The downstream port AER capability also does not reside in a PCI config space header and thus is not directly bound to a pci_dev. That means the endpoint's AER capability in pci_dev is not the one we need, instead a CXL aware driver must lookup the RCRB which contains the AER. Additional, the CXL RAS cap must be inspected by that driver. Assuming we add the RCiEP to the dev list the CXL endpoint will be processed by aer_get_device_error_info(), aer_print_error() and handle_error_source(). This is done for the endpoint device even if the source is the dport. Also we need to check the error status of both caps registers first. This will cause error reports and status checks of devices not being the error source. That said, I think the best option is still to delegate the error down to a CXL handler and do the error status check, reporting and handling of the CXL specifics there. I see your point that esp. the pci_dev's refcount handling needs to be improved. I will address that along with the other review comments in a next version of this patch series. Let's then revisit this discussion here? Thanks, -Robert ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-11 18:03 ` [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler Terry Bowman 2023-04-12 22:02 ` Bjorn Helgaas @ 2023-04-14 12:19 ` Jonathan Cameron 2023-04-14 14:35 ` Robert Richter 2023-04-18 1:01 ` Dan Williams 2 siblings, 1 reply; 52+ messages in thread From: Jonathan Cameron @ 2023-04-14 12:19 UTC (permalink / raw) To: Terry Bowman Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On Tue, 11 Apr 2023 13:03:01 -0500 Terry Bowman <terry.bowman@amd.com> wrote: > From: Robert Richter <rrichter@amd.com> > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > RCiEP, but CXL downstream and upstream ports are not enumerated and > not visible in the PCIe hierarchy. Protocol and link errors are sent > to an RCEC. > > Restricted CXL host (RCH) downstream port-detected errors are signaled > as internal AER errors, either Uncorrectable Internal Error (UIE) or > Corrected Internal Errors (CIE). The error source is the id of the > RCEC. A CXL handler must then inspect the error status in various CXL > registers residing in the dport's component register space (CXL RAS > cap) or the dport's RCRB (AER ext cap). [1] > > Errors showing up in the RCEC's error handler must be handled and > connected to the CXL subsystem. Implement this by forwarding the error > to all CXL devices below the RCEC. Since the entire CXL device is > controlled only using PCIe Configuration Space of device 0, Function > 0, only pass it there [2]. These devices have the Memory Device class > code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver > can implement the handler. This comment implies only class code compliant drivers. Sure we don't have drivers for anything else yet, but we should try to avoid saying there won't be any (which I think above implies). You have a comment in the code, but maybe relaxing the description above to "currently support devices have..." > In addition to errors directed to the CXL > endpoint device, the handler must also inspect the CXL downstream > port's CXL RAS and PCIe AER external capabilities that is connected to > the device. > > Since CXL downstream port errors are signaled using internal errors, > the handler requires those errors to be unmasked. This is subject of a > follow-on patch. > > The reason for choosing this implementation is that a CXL RCEC device > is bound to the AER port driver, but the driver does not allow it to > register a custom specific handler to support CXL. Connecting the RCEC > hard-wired with a CXL handler does not work, as the CXL subsystem > might not be present all the time. The alternative to add an > implementation to the portdrv to allow the registration of a custom > RCEC error handler isn't worth doing it as CXL would be its only user. > Instead, just check for an CXL RCEC and pass it down to the connected > CXL device's error handler. With this approach the code can entirely > be implemented in the PCIe AER driver and is independent of the CXL > subsystem. The CXL driver only provides the handler. > > [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors > [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > Signed-off-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Cc: "Oliver O'Halloran" <oohall@gmail.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-pci@vger.kernel.org Generally looks good to me. A few trivial comments inline. > --- > drivers/pci/pcie/Kconfig | 8 ++++++ > drivers/pci/pcie/aer.c | 61 ++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 69 insertions(+) > > diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig > index 228652a59f27..b0dbd864d3a3 100644 > --- a/drivers/pci/pcie/Kconfig > +++ b/drivers/pci/pcie/Kconfig > @@ -49,6 +49,14 @@ config PCIEAER_INJECT > gotten from: > https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ > > +config PCIEAER_CXL > + bool "PCI Express CXL RAS support" Description makes this sound too general. I'd mentioned restricted hosts even in the menu option title. > + default y > + depends on PCIEAER && CXL_PCI > + help > + This enables CXL error handling for Restricted CXL Hosts > + (RCHs). Spec term is probably fine in the title, but in the help I'd expand it as per the CXL 3.0 glossary to include "CXL Host that is operating in RCD mode." It might otherwise surprise people that this matters on their shiny new CXL X.0 host (because they found an old CXL 1.1 card in a box and decided to plug it in) Do we actually need this protection at all? It's a tiny amount of code and I can't see anything immediately that requires the CXL_PCI dependency other than it's a bit pointless if that isn't here. > + > # > # PCI Express ECRC > # > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 7a25b62d9e01..171a08fd8ebd 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -946,6 +946,65 @@ static bool find_source_device(struct pci_dev *parent, > return true; > } > > +#ifdef CONFIG_PCIEAER_CXL > + > +static bool is_cxl_mem_dev(struct pci_dev *dev) > +{ > + /* > + * A CXL device is controlled only using PCIe Configuration > + * Space of device 0, Function 0. That's not true in general. Definitely true that CXL protocol error reporting is controlled only using this Devfn, but more generally there could be other stuff in later functions. So perhaps make the comment more specific. > + */ > + if (dev->devfn != PCI_DEVFN(0, 0)) > + return false; > + > + /* Right now there is only a CXL.mem driver */ > + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL) > + return false; > + > + return true; > +} > + > +static bool is_internal_error(struct aer_err_info *info) > +{ > + if (info->severity == AER_CORRECTABLE) > + return info->status & PCI_ERR_COR_INTERNAL; > + > + return info->status & PCI_ERR_UNC_INTN; > +} > + > +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info); > + > +static int cxl_handle_error_iter(struct pci_dev *dev, void *data) > +{ > + struct aer_err_info *e_info = (struct aer_err_info *)data; > + > + if (!is_cxl_mem_dev(dev)) > + return 0; > + > + /* pci_dev_put() in handle_error_source() */ > + dev = pci_dev_get(dev); > + if (dev) > + handle_error_source(dev, e_info); > + > + return 0; > +} > + > +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > +{ > + /* > + * CXL downstream port errors are signaled as RCEC internal Make this comment more specific (to RCH I think). > + * errors. Forward them to all CXL devices below the RCEC. > + */ > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC && > + is_internal_error(info)) > + pcie_walk_rcec(dev, cxl_handle_error_iter, info); > +} > + > +#else > +static inline void cxl_handle_error(struct pci_dev *dev, > + struct aer_err_info *info) { } > +#endif > + > /** > * handle_error_source - handle logging error into an event log > * @dev: pointer to pci_dev data structure of error source device > @@ -957,6 +1016,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) > { > int aer = dev->aer_cap; > > + cxl_handle_error(dev, info); > + > if (info->severity == AER_CORRECTABLE) { > /* > * Correctable error does not need software intervention. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-14 12:19 ` Jonathan Cameron @ 2023-04-14 14:35 ` Robert Richter 2023-04-17 16:54 ` Jonathan Cameron 0 siblings, 1 reply; 52+ messages in thread From: Robert Richter @ 2023-04-14 14:35 UTC (permalink / raw) To: Jonathan Cameron Cc: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On 14.04.23 13:19:50, Jonathan Cameron wrote: > On Tue, 11 Apr 2023 13:03:01 -0500 > Terry Bowman <terry.bowman@amd.com> wrote: > > > From: Robert Richter <rrichter@amd.com> > > > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > > RCiEP, but CXL downstream and upstream ports are not enumerated and > > not visible in the PCIe hierarchy. Protocol and link errors are sent > > to an RCEC. > > > > Restricted CXL host (RCH) downstream port-detected errors are signaled > > as internal AER errors, either Uncorrectable Internal Error (UIE) or > > Corrected Internal Errors (CIE). The error source is the id of the > > RCEC. A CXL handler must then inspect the error status in various CXL > > registers residing in the dport's component register space (CXL RAS > > cap) or the dport's RCRB (AER ext cap). [1] > > > > Errors showing up in the RCEC's error handler must be handled and > > connected to the CXL subsystem. Implement this by forwarding the error > > to all CXL devices below the RCEC. Since the entire CXL device is > > controlled only using PCIe Configuration Space of device 0, Function > > 0, only pass it there [2]. These devices have the Memory Device class > > code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver > > can implement the handler. > > This comment implies only class code compliant drivers. Sure we don't > have drivers for anything else yet, but we should try to avoid saying > there won't be any (which I think above implies). > > You have a comment in the code, but maybe relaxing the description above > to "currently support devices have..." It is used here to identify CXL memory devices and limit the enablement to those. The spec requires this to be set for CXL mem devs (see cxl 3.0, 8.1.12.2). There could be other CXL devices (e.g. cache), but other drivers are not yet implemented. That is what I am referring to. The check makes sure there is actually a driver with a handler for it (cxl_pci). > > > In addition to errors directed to the CXL > > endpoint device, the handler must also inspect the CXL downstream > > port's CXL RAS and PCIe AER external capabilities that is connected to > > the device. > > > > Since CXL downstream port errors are signaled using internal errors, > > the handler requires those errors to be unmasked. This is subject of a > > follow-on patch. > > > > The reason for choosing this implementation is that a CXL RCEC device > > is bound to the AER port driver, but the driver does not allow it to > > register a custom specific handler to support CXL. Connecting the RCEC > > hard-wired with a CXL handler does not work, as the CXL subsystem > > might not be present all the time. The alternative to add an > > implementation to the portdrv to allow the registration of a custom > > RCEC error handler isn't worth doing it as CXL would be its only user. > > Instead, just check for an CXL RCEC and pass it down to the connected > > CXL device's error handler. With this approach the code can entirely > > be implemented in the PCIe AER driver and is independent of the CXL > > subsystem. The CXL driver only provides the handler. > > > > [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors > > [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices > > > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > > Signed-off-by: Robert Richter <rrichter@amd.com> > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > > Cc: "Oliver O'Halloran" <oohall@gmail.com> > > Cc: Bjorn Helgaas <bhelgaas@google.com> > > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-pci@vger.kernel.org > > Generally looks good to me. A few trivial comments inline. > > > --- > > drivers/pci/pcie/Kconfig | 8 ++++++ > > drivers/pci/pcie/aer.c | 61 ++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 69 insertions(+) > > > > diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig > > index 228652a59f27..b0dbd864d3a3 100644 > > --- a/drivers/pci/pcie/Kconfig > > +++ b/drivers/pci/pcie/Kconfig > > @@ -49,6 +49,14 @@ config PCIEAER_INJECT > > gotten from: > > https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ > > > > +config PCIEAER_CXL > > + bool "PCI Express CXL RAS support" > > Description makes this sound too general. I'd mentioned restricted > hosts even in the menu option title. > > > > + default y > > + depends on PCIEAER && CXL_PCI > > + help > > + This enables CXL error handling for Restricted CXL Hosts > > + (RCHs). > > Spec term is probably fine in the title, but in the help I'd > expand it as per the CXL 3.0 glossary to include > "CXL Host that is operating in RCD mode." > It might otherwise surprise people that this matters on their shiny > new CXL X.0 host (because they found an old CXL 1.1 card in a box > and decided to plug it in) > > Do we actually need this protection at all? It's a tiny amount of code > and I can't see anything immediately that requires the CXL_PCI dependency > other than it's a bit pointless if that isn't here. > > > + > > # > > # PCI Express ECRC > > # > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > index 7a25b62d9e01..171a08fd8ebd 100644 > > --- a/drivers/pci/pcie/aer.c > > +++ b/drivers/pci/pcie/aer.c > > @@ -946,6 +946,65 @@ static bool find_source_device(struct pci_dev *parent, > > return true; > > } > > > > +#ifdef CONFIG_PCIEAER_CXL > > + > > +static bool is_cxl_mem_dev(struct pci_dev *dev) > > +{ > > + /* > > + * A CXL device is controlled only using PCIe Configuration > > + * Space of device 0, Function 0. > > That's not true in general. Definitely true that CXL protocol > error reporting is controlled only using this Devfn, but > more generally there could be other stuff in later functions. > So perhaps make the comment more specific. I actually mean CXL device in RCD mode here (seen as RCiEP in the PCI hierarchy). The spec says (cxl 3.0, 8.1.3): """ In either case [(RCD and non-RCD)], the capability, status, and control fields in Device 0, Function 0 DVSEC control the CXL functionality of the entire device. """ So dev 0, func 0 must contain a CXL PCIe DVSEC. Thus it is a CXL device and able to handle CXL AER errors. The limitation to the first device prevents the handler from being run multiple times for the same event. > > > + */ > > + if (dev->devfn != PCI_DEVFN(0, 0)) > > + return false; > > + > > + /* Right now there is only a CXL.mem driver */ > > + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL) > > + return false; > > + > > + return true; > > +} > > + > > +static bool is_internal_error(struct aer_err_info *info) > > +{ > > + if (info->severity == AER_CORRECTABLE) > > + return info->status & PCI_ERR_COR_INTERNAL; > > + > > + return info->status & PCI_ERR_UNC_INTN; > > +} > > + > > +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info); > > + > > +static int cxl_handle_error_iter(struct pci_dev *dev, void *data) > > +{ > > + struct aer_err_info *e_info = (struct aer_err_info *)data; > > + > > + if (!is_cxl_mem_dev(dev)) > > + return 0; > > + > > + /* pci_dev_put() in handle_error_source() */ > > + dev = pci_dev_get(dev); > > + if (dev) > > + handle_error_source(dev, e_info); > > + > > + return 0; > > +} > > + > > +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > > +{ > > + /* > > + * CXL downstream port errors are signaled as RCEC internal > > Make this comment more specific (to RCH I think). Right, same here, this is restricted mode only. Thanks for review. -Robert > > > + * errors. Forward them to all CXL devices below the RCEC. > > + */ > > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC && > > + is_internal_error(info)) > > + pcie_walk_rcec(dev, cxl_handle_error_iter, info); > > +} > > + > > +#else > > +static inline void cxl_handle_error(struct pci_dev *dev, > > + struct aer_err_info *info) { } > > +#endif > > + > > /** > > * handle_error_source - handle logging error into an event log > > * @dev: pointer to pci_dev data structure of error source device > > @@ -957,6 +1016,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) > > { > > int aer = dev->aer_cap; > > > > + cxl_handle_error(dev, info); > > + > > if (info->severity == AER_CORRECTABLE) { > > /* > > * Correctable error does not need software intervention. > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-14 14:35 ` Robert Richter @ 2023-04-17 16:54 ` Jonathan Cameron 2023-04-17 20:36 ` Robert Richter 0 siblings, 1 reply; 52+ messages in thread From: Jonathan Cameron @ 2023-04-17 16:54 UTC (permalink / raw) To: Robert Richter Cc: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On Fri, 14 Apr 2023 16:35:05 +0200 Robert Richter <rrichter@amd.com> wrote: > On 14.04.23 13:19:50, Jonathan Cameron wrote: > > On Tue, 11 Apr 2023 13:03:01 -0500 > > Terry Bowman <terry.bowman@amd.com> wrote: > > > > > From: Robert Richter <rrichter@amd.com> > > > > > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > > > RCiEP, but CXL downstream and upstream ports are not enumerated and > > > not visible in the PCIe hierarchy. Protocol and link errors are sent > > > to an RCEC. > > > > > > Restricted CXL host (RCH) downstream port-detected errors are signaled > > > as internal AER errors, either Uncorrectable Internal Error (UIE) or > > > Corrected Internal Errors (CIE). The error source is the id of the > > > RCEC. A CXL handler must then inspect the error status in various CXL > > > registers residing in the dport's component register space (CXL RAS > > > cap) or the dport's RCRB (AER ext cap). [1] > > > > > > Errors showing up in the RCEC's error handler must be handled and > > > connected to the CXL subsystem. Implement this by forwarding the error > > > to all CXL devices below the RCEC. Since the entire CXL device is > > > controlled only using PCIe Configuration Space of device 0, Function > > > 0, only pass it there [2]. These devices have the Memory Device class > > > code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver > > > can implement the handler. > > > > This comment implies only class code compliant drivers. Sure we don't > > have drivers for anything else yet, but we should try to avoid saying > > there won't be any (which I think above implies). > > > > You have a comment in the code, but maybe relaxing the description above > > to "currently support devices have..." > > It is used here to identify CXL memory devices and limit the > enablement to those. The spec requires this to be set for CXL mem devs > (see cxl 3.0, 8.1.12.2). > > There could be other CXL devices (e.g. cache), but other drivers are > not yet implemented. That is what I am referring to. The check makes > sure there is actually a driver with a handler for it (cxl_pci). Understood on intent. My worry is that the above can be read as a statement on hardware restrictions, rathe than on what software currently implements. Meh. Minor point so I don't care that much! Unlikely anyone will read the patch description after it merges anyway ;) > > > > > > In addition to errors directed to the CXL > > > endpoint device, the handler must also inspect the CXL downstream > > > port's CXL RAS and PCIe AER external capabilities that is connected to > > > the device. > > > > > > Since CXL downstream port errors are signaled using internal errors, > > > the handler requires those errors to be unmasked. This is subject of a > > > follow-on patch. > > > > > > The reason for choosing this implementation is that a CXL RCEC device > > > is bound to the AER port driver, but the driver does not allow it to > > > register a custom specific handler to support CXL. Connecting the RCEC > > > hard-wired with a CXL handler does not work, as the CXL subsystem > > > might not be present all the time. The alternative to add an > > > implementation to the portdrv to allow the registration of a custom > > > RCEC error handler isn't worth doing it as CXL would be its only user. > > > Instead, just check for an CXL RCEC and pass it down to the connected > > > CXL device's error handler. With this approach the code can entirely > > > be implemented in the PCIe AER driver and is independent of the CXL > > > subsystem. The CXL driver only provides the handler. > > > > > > [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors > > > [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices > > > > > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > > > Signed-off-by: Robert Richter <rrichter@amd.com> > > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > > > Cc: "Oliver O'Halloran" <oohall@gmail.com> > > > Cc: Bjorn Helgaas <bhelgaas@google.com> > > > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > > > Cc: linuxppc-dev@lists.ozlabs.org > > > Cc: linux-pci@vger.kernel.org > > > > Generally looks good to me. A few trivial comments inline. > > > > > --- > > > drivers/pci/pcie/Kconfig | 8 ++++++ > > > drivers/pci/pcie/aer.c | 61 ++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 69 insertions(+) > > > > > > diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig > > > index 228652a59f27..b0dbd864d3a3 100644 > > > --- a/drivers/pci/pcie/Kconfig > > > +++ b/drivers/pci/pcie/Kconfig > > > @@ -49,6 +49,14 @@ config PCIEAER_INJECT > > > gotten from: > > > https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ > > > > > > +config PCIEAER_CXL > > > + bool "PCI Express CXL RAS support" > > > > Description makes this sound too general. I'd mentioned restricted > > hosts even in the menu option title. > > > > > > > + default y > > > + depends on PCIEAER && CXL_PCI > > > + help > > > + This enables CXL error handling for Restricted CXL Hosts > > > + (RCHs). > > > > Spec term is probably fine in the title, but in the help I'd > > expand it as per the CXL 3.0 glossary to include > > "CXL Host that is operating in RCD mode." > > It might otherwise surprise people that this matters on their shiny > > new CXL X.0 host (because they found an old CXL 1.1 card in a box > > and decided to plug it in) > > > > Do we actually need this protection at all? It's a tiny amount of code > > and I can't see anything immediately that requires the CXL_PCI dependency > > other than it's a bit pointless if that isn't here. > > > > > + > > > # > > > # PCI Express ECRC > > > # > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > > index 7a25b62d9e01..171a08fd8ebd 100644 > > > --- a/drivers/pci/pcie/aer.c > > > +++ b/drivers/pci/pcie/aer.c > > > @@ -946,6 +946,65 @@ static bool find_source_device(struct pci_dev *parent, > > > return true; > > > } > > > > > > +#ifdef CONFIG_PCIEAER_CXL > > > + > > > +static bool is_cxl_mem_dev(struct pci_dev *dev) > > > +{ > > > + /* > > > + * A CXL device is controlled only using PCIe Configuration > > > + * Space of device 0, Function 0. > > > > That's not true in general. Definitely true that CXL protocol > > error reporting is controlled only using this Devfn, but > > more generally there could be other stuff in later functions. > > So perhaps make the comment more specific. > > I actually mean CXL device in RCD mode here (seen as RCiEP in the PCI > hierarchy). > > The spec says (cxl 3.0, 8.1.3): > > """ > In either case [(RCD and non-RCD)], the capability, status, and > control fields in Device 0, Function 0 DVSEC control the CXL > functionality of the entire device. > """ > > So dev 0, func 0 must contain a CXL PCIe DVSEC. Thus it is a CXL > device and able to handle CXL AER errors. The limitation to the first > device prevents the handler from being run multiple times for the same > event. Fine with limitation. Text says "device is controlled only using". That is true for what you are controlling here, but other aspects of the device are controlled via whatever interface they like. Perhaps just quote the specification as you have done in your reply. Then it is clear that we mean just these registers. > > > > > > > + */ > > > + if (dev->devfn != PCI_DEVFN(0, 0)) > > > + return false; > > > + > > > + /* Right now there is only a CXL.mem driver */ > > > + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL) > > > + return false; > > > + > > > + return true; > > > +} > > > + > > > +static bool is_internal_error(struct aer_err_info *info) > > > +{ > > > + if (info->severity == AER_CORRECTABLE) > > > + return info->status & PCI_ERR_COR_INTERNAL; > > > + > > > + return info->status & PCI_ERR_UNC_INTN; > > > +} > > > + > > > +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info); > > > + > > > +static int cxl_handle_error_iter(struct pci_dev *dev, void *data) > > > +{ > > > + struct aer_err_info *e_info = (struct aer_err_info *)data; > > > + > > > + if (!is_cxl_mem_dev(dev)) > > > + return 0; > > > + > > > + /* pci_dev_put() in handle_error_source() */ > > > + dev = pci_dev_get(dev); > > > + if (dev) > > > + handle_error_source(dev, e_info); > > > + > > > + return 0; > > > +} > > > + > > > +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > > > +{ > > > + /* > > > + * CXL downstream port errors are signaled as RCEC internal > > > > Make this comment more specific (to RCH I think). > > Right, same here, this is restricted mode only. > > Thanks for review. > > -Robert > > > > > > > + * errors. Forward them to all CXL devices below the RCEC. > > > + */ > > > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC && > > > + is_internal_error(info)) > > > + pcie_walk_rcec(dev, cxl_handle_error_iter, info); > > > +} > > > + > > > +#else > > > +static inline void cxl_handle_error(struct pci_dev *dev, > > > + struct aer_err_info *info) { } > > > +#endif > > > + > > > /** > > > * handle_error_source - handle logging error into an event log > > > * @dev: pointer to pci_dev data structure of error source device > > > @@ -957,6 +1016,8 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) > > > { > > > int aer = dev->aer_cap; > > > > > > + cxl_handle_error(dev, info); > > > + > > > if (info->severity == AER_CORRECTABLE) { > > > /* > > > * Correctable error does not need software intervention. > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-17 16:54 ` Jonathan Cameron @ 2023-04-17 20:36 ` Robert Richter 0 siblings, 0 replies; 52+ messages in thread From: Robert Richter @ 2023-04-17 20:36 UTC (permalink / raw) To: Jonathan Cameron Cc: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci Hi Jonathan, On 17.04.23 17:54:31, Jonathan Cameron wrote: > On Fri, 14 Apr 2023 16:35:05 +0200 > Robert Richter <rrichter@amd.com> wrote: > > > On 14.04.23 13:19:50, Jonathan Cameron wrote: > > > On Tue, 11 Apr 2023 13:03:01 -0500 > > > Terry Bowman <terry.bowman@amd.com> wrote: > > > > > > > From: Robert Richter <rrichter@amd.com> > > > > > > > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > > > > RCiEP, but CXL downstream and upstream ports are not enumerated and > > > > not visible in the PCIe hierarchy. Protocol and link errors are sent > > > > to an RCEC. > > > > > > > > Restricted CXL host (RCH) downstream port-detected errors are signaled > > > > as internal AER errors, either Uncorrectable Internal Error (UIE) or > > > > Corrected Internal Errors (CIE). The error source is the id of the > > > > RCEC. A CXL handler must then inspect the error status in various CXL > > > > registers residing in the dport's component register space (CXL RAS > > > > cap) or the dport's RCRB (AER ext cap). [1] > > > > > > > > Errors showing up in the RCEC's error handler must be handled and > > > > connected to the CXL subsystem. Implement this by forwarding the error > > > > to all CXL devices below the RCEC. Since the entire CXL device is > > > > controlled only using PCIe Configuration Space of device 0, Function > > > > 0, only pass it there [2]. These devices have the Memory Device class > > > > code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver > > > > can implement the handler. > > > > > > This comment implies only class code compliant drivers. Sure we don't > > > have drivers for anything else yet, but we should try to avoid saying > > > there won't be any (which I think above implies). > > > > > > You have a comment in the code, but maybe relaxing the description above > > > to "currently support devices have..." > > > > It is used here to identify CXL memory devices and limit the > > enablement to those. The spec requires this to be set for CXL mem devs > > (see cxl 3.0, 8.1.12.2). > > > > There could be other CXL devices (e.g. cache), but other drivers are > > not yet implemented. That is what I am referring to. The check makes > > sure there is actually a driver with a handler for it (cxl_pci). > > Understood on intent. My worry is that the above can be read as a > statement on hardware restrictions, rathe than on what software currently > implements. Meh. Minor point so I don't care that much! > Unlikely anyone will read the patch description after it merges anyway ;) I have updated the description ... > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > > > index 7a25b62d9e01..171a08fd8ebd 100644 > > > > --- a/drivers/pci/pcie/aer.c > > > > +++ b/drivers/pci/pcie/aer.c > > > > @@ -946,6 +946,65 @@ static bool find_source_device(struct pci_dev *parent, > > > > return true; > > > > } > > > > > > > > +#ifdef CONFIG_PCIEAER_CXL > > > > + > > > > +static bool is_cxl_mem_dev(struct pci_dev *dev) > > > > +{ > > > > + /* > > > > + * A CXL device is controlled only using PCIe Configuration > > > > + * Space of device 0, Function 0. > > > > > > That's not true in general. Definitely true that CXL protocol > > > error reporting is controlled only using this Devfn, but > > > more generally there could be other stuff in later functions. > > > So perhaps make the comment more specific. > > > > I actually mean CXL device in RCD mode here (seen as RCiEP in the PCI > > hierarchy). > > > > The spec says (cxl 3.0, 8.1.3): > > > > """ > > In either case [(RCD and non-RCD)], the capability, status, and > > control fields in Device 0, Function 0 DVSEC control the CXL > > functionality of the entire device. > > > """ > > > > So dev 0, func 0 must contain a CXL PCIe DVSEC. Thus it is a CXL > > device and able to handle CXL AER errors. The limitation to the first > > device prevents the handler from being run multiple times for the same > > event. > > Fine with limitation. Text says "device is controlled only using". > That is true for what you are controlling here, but other aspects of the > device are controlled via whatever interface they like. > > Perhaps just quote the specification as you have done in your reply. Then it > is clear that we mean just these registers. ... and comments. Thanks, -Robert ^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-11 18:03 ` [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler Terry Bowman 2023-04-12 22:02 ` Bjorn Helgaas 2023-04-14 12:19 ` Jonathan Cameron @ 2023-04-18 1:01 ` Dan Williams 2023-04-19 13:30 ` Robert Richter 2 siblings, 1 reply; 52+ messages in thread From: Dan Williams @ 2023-04-18 1:01 UTC (permalink / raw) To: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci Terry Bowman wrote: > From: Robert Richter <rrichter@amd.com> > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > RCiEP, but CXL downstream and upstream ports are not enumerated and > not visible in the PCIe hierarchy. Protocol and link errors are sent > to an RCEC. > > Restricted CXL host (RCH) downstream port-detected errors are signaled > as internal AER errors, either Uncorrectable Internal Error (UIE) or > Corrected Internal Errors (CIE). The error source is the id of the > RCEC. A CXL handler must then inspect the error status in various CXL > registers residing in the dport's component register space (CXL RAS > cap) or the dport's RCRB (AER ext cap). [1] > > Errors showing up in the RCEC's error handler must be handled and > connected to the CXL subsystem. Implement this by forwarding the error > to all CXL devices below the RCEC. Since the entire CXL device is > controlled only using PCIe Configuration Space of device 0, Function > 0, only pass it there [2]. These devices have the Memory Device class > code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver > can implement the handler. In addition to errors directed to the CXL > endpoint device, the handler must also inspect the CXL downstream > port's CXL RAS and PCIe AER external capabilities that is connected to > the device. > > Since CXL downstream port errors are signaled using internal errors, > the handler requires those errors to be unmasked. This is subject of a > follow-on patch. > > The reason for choosing this implementation is that a CXL RCEC device > is bound to the AER port driver, but the driver does not allow it to > register a custom specific handler to support CXL. Connecting the RCEC > hard-wired with a CXL handler does not work, as the CXL subsystem > might not be present all the time. The alternative to add an > implementation to the portdrv to allow the registration of a custom > RCEC error handler isn't worth doing it as CXL would be its only user. > Instead, just check for an CXL RCEC and pass it down to the connected > CXL device's error handler. With this approach the code can entirely > be implemented in the PCIe AER driver and is independent of the CXL > subsystem. The CXL driver only provides the handler. > > [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors > [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > Signed-off-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Cc: "Oliver O'Halloran" <oohall@gmail.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-pci@vger.kernel.org > --- > drivers/pci/pcie/Kconfig | 8 ++++++ > drivers/pci/pcie/aer.c | 61 ++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 69 insertions(+) > > diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig > index 228652a59f27..b0dbd864d3a3 100644 > --- a/drivers/pci/pcie/Kconfig > +++ b/drivers/pci/pcie/Kconfig > @@ -49,6 +49,14 @@ config PCIEAER_INJECT > gotten from: > https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ > > +config PCIEAER_CXL > + bool "PCI Express CXL RAS support" > + default y > + depends on PCIEAER && CXL_PCI > + help > + This enables CXL error handling for Restricted CXL Hosts > + (RCHs). > + > # > # PCI Express ECRC > # > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 7a25b62d9e01..171a08fd8ebd 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -946,6 +946,65 @@ static bool find_source_device(struct pci_dev *parent, > return true; > } > > +#ifdef CONFIG_PCIEAER_CXL > + > +static bool is_cxl_mem_dev(struct pci_dev *dev) > +{ > + /* > + * A CXL device is controlled only using PCIe Configuration > + * Space of device 0, Function 0. > + */ > + if (dev->devfn != PCI_DEVFN(0, 0)) > + return false; > + > + /* Right now there is only a CXL.mem driver */ > + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL) > + return false; > + > + return true; > +} This part feels broken because most the errors of concern here are CXL link generic and that can involve CXL.cache and CXL.mem errors on devices that are not PCI_CLASS_MEMORY_CXL. This situation feels like it wants formal acknowledgement in 'struct pci_dev' that CXL links ride on top of PCIe links. If it were not for RCRBs then the PCI core could just do: dvsec = pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL, CXL_DVSEC_FLEXBUS_PORT); ...at bus scan time to identify devices with active CXL links. RCRBs unfortunately make it so the link presence can not be detected until a CXL driver is loaded to read that DVSEC out of MMIO space. However, I still think that looks like a CXL aware driver registering a 'struct cxl_link' (for lack of a better name) object with a corresponding PCI device. That link can indicate whether this is an RCH topology and whether it needs to do the RCEC walk, and that registration event can flag the RCEC has having CXL link duties to attend to on AER events. I suspect 'struct cxl_link' can also be used if/when we get to incoporating CXL Reset into PCI reset handling. > + > +static bool is_internal_error(struct aer_err_info *info) > +{ > + if (info->severity == AER_CORRECTABLE) > + return info->status & PCI_ERR_COR_INTERNAL; > + > + return info->status & PCI_ERR_UNC_INTN; > +} > + > +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info); > + > +static int cxl_handle_error_iter(struct pci_dev *dev, void *data) > +{ > + struct aer_err_info *e_info = (struct aer_err_info *)data; > + > + if (!is_cxl_mem_dev(dev)) > + return 0; I assume this also needs to reference the RDPAS if present? CXL 3.0 9.17.1.5 RCEC Downstream Port Association Structure (RDPAS) > + > + /* pci_dev_put() in handle_error_source() */ > + dev = pci_dev_get(dev); > + if (dev) > + handle_error_source(dev, e_info); I went looking but missed where does handle_error_source() synchronize against driver ->remove()? > + > + return 0; > +} > + > +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) Naming suggestion... Given that the VH topology does not require this scanning and assoication step, lets call this cxl_rch_handle_error() to make it clear this is only here to undo the awkwardness of CXL 1.1 platforms hiding registers from typical PCI scanning. A reference to: CXL 3.0 9.11.8 CXL Devices Attached to an RCH ...might be useful to a future reader that wonders why the CXL RCH case is so complicated from an AER perspective. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler 2023-04-18 1:01 ` Dan Williams @ 2023-04-19 13:30 ` Robert Richter 0 siblings, 0 replies; 52+ messages in thread From: Robert Richter @ 2023-04-19 13:30 UTC (permalink / raw) To: Dan Williams Cc: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dave.jiang, Jonathan.Cameron, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci Dan, thanks for review, see comments inline. On 17.04.23 18:01:41, Dan Williams wrote: > Terry Bowman wrote: > > From: Robert Richter <rrichter@amd.com> > > > > In Restricted CXL Device (RCD) mode a CXL device is exposed as an > > RCiEP, but CXL downstream and upstream ports are not enumerated and > > not visible in the PCIe hierarchy. Protocol and link errors are sent > > to an RCEC. > > > > Restricted CXL host (RCH) downstream port-detected errors are signaled > > as internal AER errors, either Uncorrectable Internal Error (UIE) or > > Corrected Internal Errors (CIE). The error source is the id of the > > RCEC. A CXL handler must then inspect the error status in various CXL > > registers residing in the dport's component register space (CXL RAS > > cap) or the dport's RCRB (AER ext cap). [1] > > > > Errors showing up in the RCEC's error handler must be handled and > > connected to the CXL subsystem. Implement this by forwarding the error > > to all CXL devices below the RCEC. Since the entire CXL device is > > controlled only using PCIe Configuration Space of device 0, Function > > 0, only pass it there [2]. These devices have the Memory Device class > > code set (PCI_CLASS_MEMORY_CXL, 502h) and the existing cxl_pci driver > > can implement the handler. In addition to errors directed to the CXL > > endpoint device, the handler must also inspect the CXL downstream > > port's CXL RAS and PCIe AER external capabilities that is connected to > > the device. > > > > Since CXL downstream port errors are signaled using internal errors, > > the handler requires those errors to be unmasked. This is subject of a > > follow-on patch. > > > > The reason for choosing this implementation is that a CXL RCEC device > > is bound to the AER port driver, but the driver does not allow it to > > register a custom specific handler to support CXL. Connecting the RCEC > > hard-wired with a CXL handler does not work, as the CXL subsystem > > might not be present all the time. The alternative to add an > > implementation to the portdrv to allow the registration of a custom > > RCEC error handler isn't worth doing it as CXL would be its only user. > > Instead, just check for an CXL RCEC and pass it down to the connected > > CXL device's error handler. With this approach the code can entirely > > be implemented in the PCIe AER driver and is independent of the CXL > > subsystem. The CXL driver only provides the handler. > > > > [1] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors > > [2] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices > > > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > > Signed-off-by: Robert Richter <rrichter@amd.com> > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > > Cc: "Oliver O'Halloran" <oohall@gmail.com> > > Cc: Bjorn Helgaas <bhelgaas@google.com> > > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-pci@vger.kernel.org > > --- > > drivers/pci/pcie/Kconfig | 8 ++++++ > > drivers/pci/pcie/aer.c | 61 ++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 69 insertions(+) > > > > diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig > > index 228652a59f27..b0dbd864d3a3 100644 > > --- a/drivers/pci/pcie/Kconfig > > +++ b/drivers/pci/pcie/Kconfig > > @@ -49,6 +49,14 @@ config PCIEAER_INJECT > > gotten from: > > https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/ > > > > +config PCIEAER_CXL > > + bool "PCI Express CXL RAS support" > > + default y > > + depends on PCIEAER && CXL_PCI > > + help > > + This enables CXL error handling for Restricted CXL Hosts > > + (RCHs). > > + > > # > > # PCI Express ECRC > > # > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > index 7a25b62d9e01..171a08fd8ebd 100644 > > --- a/drivers/pci/pcie/aer.c > > +++ b/drivers/pci/pcie/aer.c > > @@ -946,6 +946,65 @@ static bool find_source_device(struct pci_dev *parent, > > return true; > > } > > > > +#ifdef CONFIG_PCIEAER_CXL > > + > > +static bool is_cxl_mem_dev(struct pci_dev *dev) > > +{ > > + /* > > + * A CXL device is controlled only using PCIe Configuration > > + * Space of device 0, Function 0. > > + */ > > + if (dev->devfn != PCI_DEVFN(0, 0)) > > + return false; > > + > > + /* Right now there is only a CXL.mem driver */ > > + if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL) > > + return false; > > + > > + return true; > > +} > > This part feels broken because most the errors of concern here are CXL > link generic and that can involve CXL.cache and CXL.mem errors on > devices that are not PCI_CLASS_MEMORY_CXL. This situation feels like it > wants formal acknowledgement in 'struct pci_dev' that CXL links ride on > top of PCIe links. There is already rcec->rcec_ea that holds the RCEC-to-endpoint association. Determining if the RCiEP is a CXL dev is a small check which is exactly what is_cxl_mem_dev() is for. I don't see a benefit in holding the same information in an additional cxl_link structure. And as you also said below, for RCRB handling a CXL driver is needed which is why is_cxl_mem_dev() with the class check is used below. > > If it were not for RCRBs then the PCI core could just do: > > dvsec = pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL, > CXL_DVSEC_FLEXBUS_PORT); > > ...at bus scan time to identify devices with active CXL links. RCRBs > unfortunately make it so the link presence can not be detected until a > CXL driver is loaded to read that DVSEC out of MMIO space. In a VH topology those errors can be directly handled in a pci driver for CXL ports, if the portdrv handles that the check could be useful. But this is not subject of this patch series. > > However, I still think that looks like a CXL aware driver registering a > 'struct cxl_link' (for lack of a better name) object with a > corresponding PCI device. That link can indicate whether this is an RCH > topology and whether it needs to do the RCEC walk, and that registration > event can flag the RCEC has having CXL link duties to attend to on AER > events. For CXL awareness of the AER driver the simple checks from above could be used, either called directly for the pci_dev (VH mode), or by walking the RCEC. IMO, a 'struct cxl_link' and a function to register it are not really needed here. > > I suspect 'struct cxl_link' can also be used if/when we get to > incoporating CXL Reset into PCI reset handling. > > > + > > +static bool is_internal_error(struct aer_err_info *info) > > +{ > > + if (info->severity == AER_CORRECTABLE) > > + return info->status & PCI_ERR_COR_INTERNAL; > > + > > + return info->status & PCI_ERR_UNC_INTN; > > +} > > + > > +static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info); > > + > > +static int cxl_handle_error_iter(struct pci_dev *dev, void *data) > > +{ > > + struct aer_err_info *e_info = (struct aer_err_info *)data; > > + > > + if (!is_cxl_mem_dev(dev)) > > + return 0; > > > I assume this also needs to reference the RDPAS if present? That is subject of a follow-on patch. Here I see, why you may need a struct cxl_link. But that list must not reside in the pci_dev, instead a CXL aware driver can look up a self-maintained list of RDPAS mappings (RCEC-to-Downstream Port assosiations) to decide whether to lookup the dport's AER and RAS capablilities. > > CXL 3.0 9.17.1.5 RCEC Downstream Port Association Structure (RDPAS) > > > + > > + /* pci_dev_put() in handle_error_source() */ > > + dev = pci_dev_get(dev); > > + if (dev) > > + handle_error_source(dev, e_info); > > I went looking but missed where does handle_error_source() synchronize > against driver ->remove()? Right, the device_lock() is missing in handle_error_source() while accessing pdrv and calling the handler. Will send a fix. > > > + > > + return 0; > > +} > > + > > +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > > Naming suggestion... > > Given that the VH topology does not require this scanning and > assoication step, lets call this cxl_rch_handle_error() to make it clear > this is only here to undo the awkwardness of CXL 1.1 platforms hiding > registers from typical PCI scanning. A reference to: > > CXL 3.0 9.11.8 CXL Devices Attached to an RCH > > ...might be useful to a future reader that wonders why the CXL RCH case > is so complicated from an AER perspective. Ok. Thanks, -Robert ^ permalink raw reply [flat|nested] 52+ messages in thread
* [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-11 18:02 [PATCH v3 0/6] cxl/pci: Add support for RCH RAS error handling Terry Bowman ` (4 preceding siblings ...) 2023-04-11 18:03 ` [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler Terry Bowman @ 2023-04-11 18:03 ` Terry Bowman 2023-04-12 21:29 ` Bjorn Helgaas 2023-04-18 2:37 ` Dan Williams 5 siblings, 2 replies; 52+ messages in thread From: Terry Bowman @ 2023-04-11 18:03 UTC (permalink / raw) To: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci From: Robert Richter <rrichter@amd.com> RCEC AER corrected and uncorrectable internal errors (CIE/UIE) are disabled by default. [1][2] Enable them to receive CXL downstream port errors of a Restricted CXL Host (RCH). [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors [2] PCIe Base Spec 6.0, 7.8.4.3 Uncorrectable Error Mask Register, 7.8.4.6 Correctable Error Mask Register Co-developed-by: Terry Bowman <terry.bowman@amd.com> Signed-off-by: Robert Richter <rrichter@amd.com> Signed-off-by: Terry Bowman <terry.bowman@amd.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-pci@vger.kernel.org --- drivers/pci/pcie/aer.c | 73 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 171a08fd8ebd..3973c731e11d 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1000,7 +1000,79 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) pcie_walk_rcec(dev, cxl_handle_error_iter, info); } +static bool cxl_error_is_native(struct pci_dev *dev) +{ + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); + + if (pcie_ports_native) + return true; + + return host->native_aer && host->native_cxl_error; +} + +static int handles_cxl_error_iter(struct pci_dev *dev, void *data) +{ + int *handles_cxl = data; + + *handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev); + + return *handles_cxl; +} + +static bool handles_cxl_errors(struct pci_dev *rcec) +{ + int handles_cxl = 0; + + if (!rcec->aer_cap) + return false; + + if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC) + pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl); + + return !!handles_cxl; +} + +static int __cxl_unmask_internal_errors(struct pci_dev *rcec) +{ + int aer, rc; + u32 mask; + + /* + * Internal errors are masked by default, unmask RCEC's here + * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h) + * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h) + */ + aer = rcec->aer_cap; + rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask); + if (rc) + return rc; + mask &= ~PCI_ERR_UNC_INTN; + rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask); + if (rc) + return rc; + + rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask); + if (rc) + return rc; + mask &= ~PCI_ERR_COR_INTERNAL; + rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask); + + return rc; +} + +static void cxl_unmask_internal_errors(struct pci_dev *rcec) +{ + if (!handles_cxl_errors(rcec)) + return; + + if (__cxl_unmask_internal_errors(rcec)) + dev_err(&rcec->dev, "cxl: Failed to unmask internal errors"); + else + dev_dbg(&rcec->dev, "cxl: Internal errors unmasked"); +} + #else +static inline void cxl_unmask_internal_errors(struct pci_dev *dev) { } static inline void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) { } #endif @@ -1397,6 +1469,7 @@ static int aer_probe(struct pcie_device *dev) return status; } + cxl_unmask_internal_errors(port); aer_enable_rootport(rpc); pci_info(port, "enabled with IRQ %d\n", dev->irq); return 0; -- 2.34.1 ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-11 18:03 ` [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling Terry Bowman @ 2023-04-12 21:29 ` Bjorn Helgaas 2023-04-13 13:38 ` Robert Richter 2023-04-13 17:01 ` Jonathan Cameron 2023-04-18 2:37 ` Dan Williams 1 sibling, 2 replies; 52+ messages in thread From: Bjorn Helgaas @ 2023-04-12 21:29 UTC (permalink / raw) To: Terry Bowman Cc: alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl, rrichter, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote: > From: Robert Richter <rrichter@amd.com> > > RCEC AER corrected and uncorrectable internal errors (CIE/UIE) are > disabled by default. "Disabled by default" just means "the power-up state of CIE/UIC is that they are masked", right? It doesn't mean that Linux normally masks them. > [1][2] Enable them to receive CXL downstream port > errors of a Restricted CXL Host (RCH). > > [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors > [2] PCIe Base Spec 6.0, 7.8.4.3 Uncorrectable Error Mask Register, > 7.8.4.6 Correctable Error Mask Register > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > Signed-off-by: Robert Richter <rrichter@amd.com> > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > Cc: "Oliver O'Halloran" <oohall@gmail.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-pci@vger.kernel.org > --- > drivers/pci/pcie/aer.c | 73 ++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 73 insertions(+) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 171a08fd8ebd..3973c731e11d 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -1000,7 +1000,79 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > pcie_walk_rcec(dev, cxl_handle_error_iter, info); > } > > +static bool cxl_error_is_native(struct pci_dev *dev) > +{ > + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); > + > + if (pcie_ports_native) > + return true; > + > + return host->native_aer && host->native_cxl_error; > +} > + > +static int handles_cxl_error_iter(struct pci_dev *dev, void *data) > +{ > + int *handles_cxl = data; > + > + *handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev); > + > + return *handles_cxl; > +} > + > +static bool handles_cxl_errors(struct pci_dev *rcec) > +{ > + int handles_cxl = 0; > + > + if (!rcec->aer_cap) > + return false; > + > + if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC) > + pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl); > + > + return !!handles_cxl; > +} > + > +static int __cxl_unmask_internal_errors(struct pci_dev *rcec) > +{ > + int aer, rc; > + u32 mask; > + > + /* > + * Internal errors are masked by default, unmask RCEC's here > + * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h) > + * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h) > + */ Unmasking internal errors doesn't have anything specific to do with CXL, so I don't think it should have "cxl" in the function name. Maybe something like "pci_aer_unmask_internal_errors()". This also has nothing special to do with RCECs, so I think we should refer to the device as "dev" as is typical in this file. I think this needs to check pcie_aer_is_native() as is done by pci_aer_clear_nonfatal_status() and other functions that write the AER Capability. With the exception of this function, this patch looks like all CXL code that maybe could be with other CXL code. Would require making pcie_walk_rcec() available outside drivers/pci, I guess. > + aer = rcec->aer_cap; > + rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask); > + if (rc) > + return rc; > + mask &= ~PCI_ERR_UNC_INTN; > + rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask); > + if (rc) > + return rc; > + > + rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask); > + if (rc) > + return rc; > + mask &= ~PCI_ERR_COR_INTERNAL; > + rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask); > + > + return rc; > +} > + > +static void cxl_unmask_internal_errors(struct pci_dev *rcec) > +{ > + if (!handles_cxl_errors(rcec)) > + return; > + > + if (__cxl_unmask_internal_errors(rcec)) > + dev_err(&rcec->dev, "cxl: Failed to unmask internal errors"); > + else > + dev_dbg(&rcec->dev, "cxl: Internal errors unmasked"); > +} > + > #else > +static inline void cxl_unmask_internal_errors(struct pci_dev *dev) { } > static inline void cxl_handle_error(struct pci_dev *dev, > struct aer_err_info *info) { } > #endif > @@ -1397,6 +1469,7 @@ static int aer_probe(struct pcie_device *dev) > return status; > } > > + cxl_unmask_internal_errors(port); > aer_enable_rootport(rpc); > pci_info(port, "enabled with IRQ %d\n", dev->irq); > return 0; > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-12 21:29 ` Bjorn Helgaas @ 2023-04-13 13:38 ` Robert Richter 2023-04-13 17:05 ` Jonathan Cameron 2023-04-14 21:49 ` Bjorn Helgaas 2023-04-13 17:01 ` Jonathan Cameron 1 sibling, 2 replies; 52+ messages in thread From: Robert Richter @ 2023-04-13 13:38 UTC (permalink / raw) To: Bjorn Helgaas Cc: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On 12.04.23 16:29:01, Bjorn Helgaas wrote: > On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote: > > From: Robert Richter <rrichter@amd.com> > > > > RCEC AER corrected and uncorrectable internal errors (CIE/UIE) are > > disabled by default. > > "Disabled by default" just means "the power-up state of CIE/UIC is > that they are masked", right? It doesn't mean that Linux normally > masks them. Yes, will change the wording here. > > [1][2] Enable them to receive CXL downstream port > > errors of a Restricted CXL Host (RCH). > > > > [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors > > [2] PCIe Base Spec 6.0, 7.8.4.3 Uncorrectable Error Mask Register, > > 7.8.4.6 Correctable Error Mask Register > > > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > > Signed-off-by: Robert Richter <rrichter@amd.com> > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > > Cc: "Oliver O'Halloran" <oohall@gmail.com> > > Cc: Bjorn Helgaas <bhelgaas@google.com> > > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-pci@vger.kernel.org > > --- > > drivers/pci/pcie/aer.c | 73 ++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 73 insertions(+) > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > index 171a08fd8ebd..3973c731e11d 100644 > > --- a/drivers/pci/pcie/aer.c > > +++ b/drivers/pci/pcie/aer.c > > @@ -1000,7 +1000,79 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > > pcie_walk_rcec(dev, cxl_handle_error_iter, info); > > } > > > > +static bool cxl_error_is_native(struct pci_dev *dev) > > +{ > > + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); > > + > > + if (pcie_ports_native) > > + return true; > > + > > + return host->native_aer && host->native_cxl_error; > > +} > > + > > +static int handles_cxl_error_iter(struct pci_dev *dev, void *data) > > +{ > > + int *handles_cxl = data; > > + > > + *handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev); > > + > > + return *handles_cxl; > > +} > > + > > +static bool handles_cxl_errors(struct pci_dev *rcec) > > +{ > > + int handles_cxl = 0; > > + > > + if (!rcec->aer_cap) > > + return false; > > + > > + if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC) > > + pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl); > > + > > + return !!handles_cxl; > > +} > > + > > +static int __cxl_unmask_internal_errors(struct pci_dev *rcec) > > +{ > > + int aer, rc; > > + u32 mask; > > + > > + /* > > + * Internal errors are masked by default, unmask RCEC's here > > + * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h) > > + * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h) > > + */ > > Unmasking internal errors doesn't have anything specific to do with > CXL, so I don't think it should have "cxl" in the function name. > Maybe something like "pci_aer_unmask_internal_errors()". Since it is static I renamed it to aer_unmask_internal_errors() and also moved it to the beginning of the #ifdef block for easier later reuse. > > This also has nothing special to do with RCECs, so I think we should > refer to the device as "dev" as is typical in this file. Changed. > > I think this needs to check pcie_aer_is_native() as is done by > pci_aer_clear_nonfatal_status() and other functions that write the AER > Capability. Also added the check to aer_unmask_internal_errors(). There was a check for native_* in handles_cxl_errors() already, but only for the pci devs of the RCEC. I added a check of the RCEC there too. > > With the exception of this function, this patch looks like all CXL > code that maybe could be with other CXL code. Would require making > pcie_walk_rcec() available outside drivers/pci, I guess. Even this is CXL code, it implements AER support and fits better here around AER code. Export of pcie_walk_rcec() (and others?) is not the main issue here. CXL drivers can come as modules and would need to register a hook at the aer handler. This would add even more complexity here. In contrast, current solution just adds two functions for enablement and handling which are empty stubs if code is disabled. I could move that code to aer_cxl.c similar to aer_inject.c. Since the CXL part is small compared to the remaining aer code I left it in aer.c. Also, it is guarded by #ifdef which additionally encapsulates it. > > > + aer = rcec->aer_cap; > > + rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask); > > + if (rc) > > + return rc; > > + mask &= ~PCI_ERR_UNC_INTN; > > + rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask); > > + if (rc) > > + return rc; > > + > > + rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask); > > + if (rc) > > + return rc; > > + mask &= ~PCI_ERR_COR_INTERNAL; > > + rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask); > > + > > + return rc; > > +} > > + > > +static void cxl_unmask_internal_errors(struct pci_dev *rcec) Also renaming this to cxl_enable_rcec() to more generalize the function. > > +{ > > + if (!handles_cxl_errors(rcec)) > > + return; > > + > > + if (__cxl_unmask_internal_errors(rcec)) > > + dev_err(&rcec->dev, "cxl: Failed to unmask internal errors"); > > + else > > + dev_dbg(&rcec->dev, "cxl: Internal errors unmasked"); I am going to change this to a pci_info() for alignment with other messages around: [ 14.200265] pcieport 0000:40:00.3: PME: Signaling with IRQ 44 [ 14.213925] pcieport 0000:40:00.3: AER: cxl: Internal errors unmasked [ 14.228413] pcieport 0000:40:00.3: AER: enabled with IRQ 44 Plus, using pci_err() instead of dev_err(). > > +} > > + > > #else > > +static inline void cxl_unmask_internal_errors(struct pci_dev *dev) { } > > static inline void cxl_handle_error(struct pci_dev *dev, > > struct aer_err_info *info) { } > > #endif > > @@ -1397,6 +1469,7 @@ static int aer_probe(struct pcie_device *dev) > > return status; > > } > > > > + cxl_unmask_internal_errors(port); > > aer_enable_rootport(rpc); > > pci_info(port, "enabled with IRQ %d\n", dev->irq); > > return 0; > > -- > > 2.34.1 > > Thanks for review, -Robert ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-13 13:38 ` Robert Richter @ 2023-04-13 17:05 ` Jonathan Cameron 2023-04-14 11:58 ` Robert Richter 2023-04-14 21:49 ` Bjorn Helgaas 1 sibling, 1 reply; 52+ messages in thread From: Jonathan Cameron @ 2023-04-13 17:05 UTC (permalink / raw) To: Robert Richter Cc: Bjorn Helgaas, Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On Thu, 13 Apr 2023 15:38:07 +0200 Robert Richter <rrichter@amd.com> wrote: > On 12.04.23 16:29:01, Bjorn Helgaas wrote: > > On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote: > > > From: Robert Richter <rrichter@amd.com> > > > > > > RCEC AER corrected and uncorrectable internal errors (CIE/UIE) are > > > disabled by default. > > > > "Disabled by default" just means "the power-up state of CIE/UIC is > > that they are masked", right? It doesn't mean that Linux normally > > masks them. > > Yes, will change the wording here. > > > > [1][2] Enable them to receive CXL downstream port > > > errors of a Restricted CXL Host (RCH). > > > > > > [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors > > > [2] PCIe Base Spec 6.0, 7.8.4.3 Uncorrectable Error Mask Register, > > > 7.8.4.6 Correctable Error Mask Register > > > > > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > > > Signed-off-by: Robert Richter <rrichter@amd.com> > > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > > > Cc: "Oliver O'Halloran" <oohall@gmail.com> > > > Cc: Bjorn Helgaas <bhelgaas@google.com> > > > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > > > Cc: linuxppc-dev@lists.ozlabs.org > > > Cc: linux-pci@vger.kernel.org > > > --- > > > drivers/pci/pcie/aer.c | 73 ++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 73 insertions(+) > > > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > > index 171a08fd8ebd..3973c731e11d 100644 > > > --- a/drivers/pci/pcie/aer.c > > > +++ b/drivers/pci/pcie/aer.c > > > @@ -1000,7 +1000,79 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > > > pcie_walk_rcec(dev, cxl_handle_error_iter, info); > > > } > > > > > > +static bool cxl_error_is_native(struct pci_dev *dev) > > > +{ > > > + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); > > > + > > > + if (pcie_ports_native) > > > + return true; > > > + > > > + return host->native_aer && host->native_cxl_error; > > > +} > > > + > > > +static int handles_cxl_error_iter(struct pci_dev *dev, void *data) > > > +{ > > > + int *handles_cxl = data; > > > + > > > + *handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev); > > > + > > > + return *handles_cxl; > > > +} > > > + > > > +static bool handles_cxl_errors(struct pci_dev *rcec) > > > +{ > > > + int handles_cxl = 0; > > > + > > > + if (!rcec->aer_cap) > > > + return false; > > > + > > > + if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC) > > > + pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl); > > > + > > > + return !!handles_cxl; > > > +} > > > + > > > +static int __cxl_unmask_internal_errors(struct pci_dev *rcec) > > > +{ > > > + int aer, rc; > > > + u32 mask; > > > + > > > + /* > > > + * Internal errors are masked by default, unmask RCEC's here > > > + * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h) > > > + * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h) > > > + */ > > > > Unmasking internal errors doesn't have anything specific to do with > > CXL, so I don't think it should have "cxl" in the function name. > > Maybe something like "pci_aer_unmask_internal_errors()". > > Since it is static I renamed it to aer_unmask_internal_errors() and > also moved it to the beginning of the #ifdef block for easier later > reuse. > > > > > This also has nothing special to do with RCECs, so I think we should > > refer to the device as "dev" as is typical in this file. > > Changed. > > > > > I think this needs to check pcie_aer_is_native() as is done by > > pci_aer_clear_nonfatal_status() and other functions that write the AER > > Capability. > > Also added the check to aer_unmask_internal_errors(). There was a > check for native_* in handles_cxl_errors() already, but only for the > pci devs of the RCEC. I added a check of the RCEC there too. > > > > > With the exception of this function, this patch looks like all CXL > > code that maybe could be with other CXL code. Would require making > > pcie_walk_rcec() available outside drivers/pci, I guess. > > Even this is CXL code, it implements AER support and fits better here > around AER code. Export of pcie_walk_rcec() (and others?) is not the > main issue here. CXL drivers can come as modules and would need to > register a hook at the aer handler. This would add even more > complexity here. In contrast, current solution just adds two functions > for enablement and handling which are empty stubs if code is disabled. > > I could move that code to aer_cxl.c similar to aer_inject.c. Since the > CXL part is small compared to the remaining aer code I left it in > aer.c. Also, it is guarded by #ifdef which additionally encapsulates > it. > To throw another option in there (what Bjorn suggested IIRC for the more general case..) Just enable internal errors always. No need to know if they are CXL or something else. There will/might be fallout and it will be fun. Jonathan > > > > > + aer = rcec->aer_cap; > > > + rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask); > > > + if (rc) > > > + return rc; > > > + mask &= ~PCI_ERR_UNC_INTN; > > > + rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask); > > > + if (rc) > > > + return rc; > > > + > > > + rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask); > > > + if (rc) > > > + return rc; > > > + mask &= ~PCI_ERR_COR_INTERNAL; > > > + rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask); > > > + > > > + return rc; > > > +} > > > + > > > +static void cxl_unmask_internal_errors(struct pci_dev *rcec) > > Also renaming this to cxl_enable_rcec() to more generalize the > function. > > > > +{ > > > + if (!handles_cxl_errors(rcec)) > > > + return; > > > + > > > + if (__cxl_unmask_internal_errors(rcec)) > > > + dev_err(&rcec->dev, "cxl: Failed to unmask internal errors"); > > > + else > > > + dev_dbg(&rcec->dev, "cxl: Internal errors unmasked"); > > I am going to change this to a pci_info() for alignment with other > messages around: > > [ 14.200265] pcieport 0000:40:00.3: PME: Signaling with IRQ 44 > [ 14.213925] pcieport 0000:40:00.3: AER: cxl: Internal errors unmasked > [ 14.228413] pcieport 0000:40:00.3: AER: enabled with IRQ 44 > > Plus, using pci_err() instead of dev_err(). > > > > +} > > > + > > > #else > > > +static inline void cxl_unmask_internal_errors(struct pci_dev *dev) { } > > > static inline void cxl_handle_error(struct pci_dev *dev, > > > struct aer_err_info *info) { } > > > #endif > > > @@ -1397,6 +1469,7 @@ static int aer_probe(struct pcie_device *dev) > > > return status; > > > } > > > > > > + cxl_unmask_internal_errors(port); > > > aer_enable_rootport(rpc); > > > pci_info(port, "enabled with IRQ %d\n", dev->irq); > > > return 0; > > > -- > > > 2.34.1 > > > > > Thanks for review, > > -Robert ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-13 17:05 ` Jonathan Cameron @ 2023-04-14 11:58 ` Robert Richter 0 siblings, 0 replies; 52+ messages in thread From: Robert Richter @ 2023-04-14 11:58 UTC (permalink / raw) To: Jonathan Cameron Cc: Bjorn Helgaas, Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On 13.04.23 18:05:08, Jonathan Cameron wrote: > On Thu, 13 Apr 2023 15:38:07 +0200 > Robert Richter <rrichter@amd.com> wrote: > > > On 12.04.23 16:29:01, Bjorn Helgaas wrote: > > > On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote: > > > With the exception of this function, this patch looks like all CXL > > > code that maybe could be with other CXL code. Would require making > > > pcie_walk_rcec() available outside drivers/pci, I guess. > > > > Even this is CXL code, it implements AER support and fits better here > > around AER code. Export of pcie_walk_rcec() (and others?) is not the > > main issue here. CXL drivers can come as modules and would need to > > register a hook at the aer handler. This would add even more > > complexity here. In contrast, current solution just adds two functions > > for enablement and handling which are empty stubs if code is disabled. > > > > I could move that code to aer_cxl.c similar to aer_inject.c. Since the > > CXL part is small compared to the remaining aer code I left it in > > aer.c. Also, it is guarded by #ifdef which additionally encapsulates > > it. > > > > To throw another option in there (what Bjorn suggested IIRC for the more > general case..) > > Just enable internal errors always. No need to know if they are CXL > or something else. > > There will/might be fallout and it will be fun. I left the fun part to others. :-) If some PCI root port goes crazy it tears down the whole system, would avoid that. Since internal error are implementation specific, I would only enable them once a handler exists. What's why enablement is limited to CXL RCECs only. -Robert ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-13 13:38 ` Robert Richter 2023-04-13 17:05 ` Jonathan Cameron @ 2023-04-14 21:49 ` Bjorn Helgaas 1 sibling, 0 replies; 52+ messages in thread From: Bjorn Helgaas @ 2023-04-14 21:49 UTC (permalink / raw) To: Robert Richter Cc: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On Thu, Apr 13, 2023 at 03:38:07PM +0200, Robert Richter wrote: > On 12.04.23 16:29:01, Bjorn Helgaas wrote: > > On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote: > > > From: Robert Richter <rrichter@amd.com> > > > > > > RCEC AER corrected and uncorrectable internal errors (CIE/UIE) are > > > disabled by default. > > > +static void cxl_unmask_internal_errors(struct pci_dev *rcec) > > Also renaming this to cxl_enable_rcec() to more generalize the > function. I didn't follow this. "cxl_enable_rcec" doesn't say anything about "unmasking" or "internal errors", which seems like the whole point. And the function doesn't actually *enable* and RCEC. > > > +{ > > > + if (!handles_cxl_errors(rcec)) > > > + return; > > > + > > > + if (__cxl_unmask_internal_errors(rcec)) > > > + dev_err(&rcec->dev, "cxl: Failed to unmask internal errors"); > > > + else > > > + dev_dbg(&rcec->dev, "cxl: Internal errors unmasked"); > > I am going to change this to a pci_info() for alignment with other > messages around: > > [ 14.200265] pcieport 0000:40:00.3: PME: Signaling with IRQ 44 > [ 14.213925] pcieport 0000:40:00.3: AER: cxl: Internal errors unmasked > [ 14.228413] pcieport 0000:40:00.3: AER: enabled with IRQ 44 > > Plus, using pci_err() instead of dev_err(). Thanks for that! Bjorn ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-12 21:29 ` Bjorn Helgaas 2023-04-13 13:38 ` Robert Richter @ 2023-04-13 17:01 ` Jonathan Cameron 2023-04-13 22:52 ` Ira Weiny 1 sibling, 1 reply; 52+ messages in thread From: Jonathan Cameron @ 2023-04-13 17:01 UTC (permalink / raw) To: Bjorn Helgaas Cc: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On Wed, 12 Apr 2023 16:29:01 -0500 Bjorn Helgaas <helgaas@kernel.org> wrote: > On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote: > > From: Robert Richter <rrichter@amd.com> > > > > RCEC AER corrected and uncorrectable internal errors (CIE/UIE) are > > disabled by default. > > "Disabled by default" just means "the power-up state of CIE/UIC is > that they are masked", right? It doesn't mean that Linux normally > masks them. > > > [1][2] Enable them to receive CXL downstream port > > errors of a Restricted CXL Host (RCH). > > > > [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors > > [2] PCIe Base Spec 6.0, 7.8.4.3 Uncorrectable Error Mask Register, > > 7.8.4.6 Correctable Error Mask Register > > > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > > Signed-off-by: Robert Richter <rrichter@amd.com> > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > > Cc: "Oliver O'Halloran" <oohall@gmail.com> > > Cc: Bjorn Helgaas <bhelgaas@google.com> > > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-pci@vger.kernel.org > > --- > > drivers/pci/pcie/aer.c | 73 ++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 73 insertions(+) > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > index 171a08fd8ebd..3973c731e11d 100644 > > --- a/drivers/pci/pcie/aer.c > > +++ b/drivers/pci/pcie/aer.c > > @@ -1000,7 +1000,79 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > > pcie_walk_rcec(dev, cxl_handle_error_iter, info); > > } > > > > +static bool cxl_error_is_native(struct pci_dev *dev) > > +{ > > + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); > > + > > + if (pcie_ports_native) > > + return true; > > + > > + return host->native_aer && host->native_cxl_error; > > +} > > + > > +static int handles_cxl_error_iter(struct pci_dev *dev, void *data) > > +{ > > + int *handles_cxl = data; > > + > > + *handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev); > > + > > + return *handles_cxl; > > +} > > + > > +static bool handles_cxl_errors(struct pci_dev *rcec) > > +{ > > + int handles_cxl = 0; > > + > > + if (!rcec->aer_cap) > > + return false; > > + > > + if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC) > > + pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl); > > + > > + return !!handles_cxl; > > +} > > + > > +static int __cxl_unmask_internal_errors(struct pci_dev *rcec) > > +{ > > + int aer, rc; > > + u32 mask; > > + > > + /* > > + * Internal errors are masked by default, unmask RCEC's here > > + * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h) > > + * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h) > > + */ > > Unmasking internal errors doesn't have anything specific to do with > CXL, so I don't think it should have "cxl" in the function name. > Maybe something like "pci_aer_unmask_internal_errors()". This reminds me. Not sure we resolved earlier discussion on changing the system wide policy to turn these on https://lore.kernel.org/linux-cxl/20221229172731.GA611562@bhelgaas/ which needs pretty much the same thing. Ira, I think you were picking this one up? https://lore.kernel.org/linux-cxl/63e5fb533f304_13244829412@iweiny-mobl.notmuch/ Thanks, Jonathan > > This also has nothing special to do with RCECs, so I think we should > refer to the device as "dev" as is typical in this file. > > I think this needs to check pcie_aer_is_native() as is done by > pci_aer_clear_nonfatal_status() and other functions that write the AER > Capability. > > With the exception of this function, this patch looks like all CXL > code that maybe could be with other CXL code. Would require making > pcie_walk_rcec() available outside drivers/pci, I guess. > > > + aer = rcec->aer_cap; > > + rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask); > > + if (rc) > > + return rc; > > + mask &= ~PCI_ERR_UNC_INTN; > > + rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask); > > + if (rc) > > + return rc; > > + > > + rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask); > > + if (rc) > > + return rc; > > + mask &= ~PCI_ERR_COR_INTERNAL; > > + rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask); > > + > > + return rc; > > +} > > + > > +static void cxl_unmask_internal_errors(struct pci_dev *rcec) > > +{ > > + if (!handles_cxl_errors(rcec)) > > + return; > > + > > + if (__cxl_unmask_internal_errors(rcec)) > > + dev_err(&rcec->dev, "cxl: Failed to unmask internal errors"); > > + else > > + dev_dbg(&rcec->dev, "cxl: Internal errors unmasked"); > > +} > > + > > #else > > +static inline void cxl_unmask_internal_errors(struct pci_dev *dev) { } > > static inline void cxl_handle_error(struct pci_dev *dev, > > struct aer_err_info *info) { } > > #endif > > @@ -1397,6 +1469,7 @@ static int aer_probe(struct pcie_device *dev) > > return status; > > } > > > > + cxl_unmask_internal_errors(port); > > aer_enable_rootport(rpc); > > pci_info(port, "enabled with IRQ %d\n", dev->irq); > > return 0; > > -- > > 2.34.1 > > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-13 17:01 ` Jonathan Cameron @ 2023-04-13 22:52 ` Ira Weiny 2023-04-14 11:21 ` Robert Richter 0 siblings, 1 reply; 52+ messages in thread From: Ira Weiny @ 2023-04-13 22:52 UTC (permalink / raw) To: Jonathan Cameron, Bjorn Helgaas Cc: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, rrichter, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci Jonathan Cameron wrote: > On Wed, 12 Apr 2023 16:29:01 -0500 > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote: > > > From: Robert Richter <rrichter@amd.com> > > > > > > RCEC AER corrected and uncorrectable internal errors (CIE/UIE) are > > > disabled by default. > > > > "Disabled by default" just means "the power-up state of CIE/UIC is > > that they are masked", right? It doesn't mean that Linux normally > > masks them. > > > > > [1][2] Enable them to receive CXL downstream port > > > errors of a Restricted CXL Host (RCH). > > > > > > [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors > > > [2] PCIe Base Spec 6.0, 7.8.4.3 Uncorrectable Error Mask Register, > > > 7.8.4.6 Correctable Error Mask Register > > > > > > Co-developed-by: Terry Bowman <terry.bowman@amd.com> > > > Signed-off-by: Robert Richter <rrichter@amd.com> > > > Signed-off-by: Terry Bowman <terry.bowman@amd.com> > > > Cc: "Oliver O'Halloran" <oohall@gmail.com> > > > Cc: Bjorn Helgaas <bhelgaas@google.com> > > > Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> > > > Cc: linuxppc-dev@lists.ozlabs.org > > > Cc: linux-pci@vger.kernel.org > > > --- > > > drivers/pci/pcie/aer.c | 73 ++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 73 insertions(+) > > > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > > index 171a08fd8ebd..3973c731e11d 100644 > > > --- a/drivers/pci/pcie/aer.c > > > +++ b/drivers/pci/pcie/aer.c > > > @@ -1000,7 +1000,79 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info) > > > pcie_walk_rcec(dev, cxl_handle_error_iter, info); > > > } > > > > > > +static bool cxl_error_is_native(struct pci_dev *dev) > > > +{ > > > + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus); > > > + > > > + if (pcie_ports_native) > > > + return true; > > > + > > > + return host->native_aer && host->native_cxl_error; > > > +} > > > + > > > +static int handles_cxl_error_iter(struct pci_dev *dev, void *data) > > > +{ > > > + int *handles_cxl = data; > > > + > > > + *handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev); > > > + > > > + return *handles_cxl; > > > +} > > > + > > > +static bool handles_cxl_errors(struct pci_dev *rcec) > > > +{ > > > + int handles_cxl = 0; > > > + > > > + if (!rcec->aer_cap) > > > + return false; > > > + > > > + if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC) > > > + pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl); > > > + > > > + return !!handles_cxl; > > > +} > > > + > > > +static int __cxl_unmask_internal_errors(struct pci_dev *rcec) > > > +{ > > > + int aer, rc; > > > + u32 mask; > > > + > > > + /* > > > + * Internal errors are masked by default, unmask RCEC's here > > > + * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h) > > > + * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h) > > > + */ > > > > Unmasking internal errors doesn't have anything specific to do with > > CXL, so I don't think it should have "cxl" in the function name. > > Maybe something like "pci_aer_unmask_internal_errors()". > > This reminds me. Not sure we resolved earlier discussion on changing > the system wide policy to turn these on > https://lore.kernel.org/linux-cxl/20221229172731.GA611562@bhelgaas/ > which needs pretty much the same thing. > > Ira, I think you were picking this one up? > https://lore.kernel.org/linux-cxl/63e5fb533f304_13244829412@iweiny-mobl.notmuch/ After this discussion I posted an RFC to enable those errors. https://lore.kernel.org/all/20230209-cxl-pci-aer-v1-1-f9a817fa4016@intel.com/ Unfortunately the prevailing opinion was that this was unsafe. And no one piped up with a reason to pursue the alternative of a pci core call to enable them as needed. So I abandoned the work. I think the direction things where headed was to have a call like: int pci_enable_pci_internal_errors(struct pci_dev *dev) { int pos_cap_err; u32 reg; if (!pcie_aer_is_native(dev)) return -EIO; pos_cap_err = dev->aer_cap; /* Unmask correctable and uncorrectable (non-fatal) internal errors */ pci_read_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, ®); reg &= ~PCI_ERR_COR_INTERNAL; pci_write_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, reg); pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, ®); reg &= ~PCI_ERR_UNC_INTN; pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, reg); pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, ®); reg &= ~PCI_ERR_UNC_INTN; pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, reg); return 0; } ... and call this from the cxl code where it is needed. Is this an acceptable direction? Terry is welcome to steal the above from my patch and throw it into the PCI core. Looking at the current state of things I think cxl_pci_ras_unmask() may actually be broken now without calling something like the above. For that I dropped the ball. Ira ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-13 22:52 ` Ira Weiny @ 2023-04-14 11:21 ` Robert Richter 2023-04-14 11:55 ` Jonathan Cameron 0 siblings, 1 reply; 52+ messages in thread From: Robert Richter @ 2023-04-14 11:21 UTC (permalink / raw) To: Ira Weiny Cc: Jonathan Cameron, Bjorn Helgaas, Terry Bowman, alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On 13.04.23 15:52:36, Ira Weiny wrote: > Jonathan Cameron wrote: > > On Wed, 12 Apr 2023 16:29:01 -0500 > > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote: > > > > From: Robert Richter <rrichter@amd.com> > > > > > > > > +static int __cxl_unmask_internal_errors(struct pci_dev *rcec) > > > > +{ > > > > + int aer, rc; > > > > + u32 mask; > > > > + > > > > + /* > > > > + * Internal errors are masked by default, unmask RCEC's here > > > > + * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h) > > > > + * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h) > > > > + */ > > > > > > Unmasking internal errors doesn't have anything specific to do with > > > CXL, so I don't think it should have "cxl" in the function name. > > > Maybe something like "pci_aer_unmask_internal_errors()". > > > > This reminds me. Not sure we resolved earlier discussion on changing > > the system wide policy to turn these on > > https://lore.kernel.org/linux-cxl/20221229172731.GA611562@bhelgaas/ > > which needs pretty much the same thing. > > > > Ira, I think you were picking this one up? > > https://lore.kernel.org/linux-cxl/63e5fb533f304_13244829412@iweiny-mobl.notmuch/ > > After this discussion I posted an RFC to enable those errors. > > https://lore.kernel.org/all/20230209-cxl-pci-aer-v1-1-f9a817fa4016@intel.com/ > > Unfortunately the prevailing opinion was that this was unsafe. And no one > piped up with a reason to pursue the alternative of a pci core call to enable > them as needed. > > So I abandoned the work. > > I think the direction things where headed was to have a call like: > > int pci_enable_pci_internal_errors(struct pci_dev *dev) > { > int pos_cap_err; > u32 reg; > > if (!pcie_aer_is_native(dev)) > return -EIO; > > pos_cap_err = dev->aer_cap; > > /* Unmask correctable and uncorrectable (non-fatal) internal errors */ > pci_read_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, ®); > reg &= ~PCI_ERR_COR_INTERNAL; > pci_write_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, reg); > > pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, ®); > reg &= ~PCI_ERR_UNC_INTN; > pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, reg); > > pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, ®); > reg &= ~PCI_ERR_UNC_INTN; > pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, reg); > > return 0; > } > > ... and call this from the cxl code where it is needed. The version I have ready after addressing Bjorn's comments is pretty much the same, apart from error checking of the read/writes. From your patch proposed you will need it in aer.c too and we do not need to export it. This patch only enables it for (CXL) RCECs. You might want to extend this for CXL endpoints (and ports?) then. > > Is this an acceptable direction? Terry is welcome to steal the above from my > patch and throw it into the PCI core. > > Looking at the current state of things I think cxl_pci_ras_unmask() may > actually be broken now without calling something like the above. For that I > dropped the ball. Thanks, -Robert > > Ira ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-14 11:21 ` Robert Richter @ 2023-04-14 11:55 ` Jonathan Cameron 2023-04-14 14:47 ` Robert Richter 0 siblings, 1 reply; 52+ messages in thread From: Jonathan Cameron @ 2023-04-14 11:55 UTC (permalink / raw) To: Robert Richter Cc: Ira Weiny, Bjorn Helgaas, Terry Bowman, alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On Fri, 14 Apr 2023 13:21:37 +0200 Robert Richter <rrichter@amd.com> wrote: > On 13.04.23 15:52:36, Ira Weiny wrote: > > Jonathan Cameron wrote: > > > On Wed, 12 Apr 2023 16:29:01 -0500 > > > Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > > > On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote: > > > > > From: Robert Richter <rrichter@amd.com> > > > > > > > > > > > +static int __cxl_unmask_internal_errors(struct pci_dev *rcec) > > > > > +{ > > > > > + int aer, rc; > > > > > + u32 mask; > > > > > + > > > > > + /* > > > > > + * Internal errors are masked by default, unmask RCEC's here > > > > > + * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h) > > > > > + * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h) > > > > > + */ > > > > > > > > Unmasking internal errors doesn't have anything specific to do with > > > > CXL, so I don't think it should have "cxl" in the function name. > > > > Maybe something like "pci_aer_unmask_internal_errors()". > > > > > > This reminds me. Not sure we resolved earlier discussion on changing > > > the system wide policy to turn these on > > > https://lore.kernel.org/linux-cxl/20221229172731.GA611562@bhelgaas/ > > > which needs pretty much the same thing. > > > > > > Ira, I think you were picking this one up? > > > https://lore.kernel.org/linux-cxl/63e5fb533f304_13244829412@iweiny-mobl.notmuch/ > > > > After this discussion I posted an RFC to enable those errors. > > > > https://lore.kernel.org/all/20230209-cxl-pci-aer-v1-1-f9a817fa4016@intel.com/ > > Ah. I'd forgotten that thread. Thanks! > > Unfortunately the prevailing opinion was that this was unsafe. And no one > > piped up with a reason to pursue the alternative of a pci core call to enable > > them as needed. > > > > So I abandoned the work. > > > > I think the direction things where headed was to have a call like: > > > > int pci_enable_pci_internal_errors(struct pci_dev *dev) > > { > > int pos_cap_err; > > u32 reg; > > > > if (!pcie_aer_is_native(dev)) > > return -EIO; > > > > pos_cap_err = dev->aer_cap; > > > > /* Unmask correctable and uncorrectable (non-fatal) internal errors */ > > pci_read_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, ®); > > reg &= ~PCI_ERR_COR_INTERNAL; > > pci_write_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, reg); > > > > pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, ®); > > reg &= ~PCI_ERR_UNC_INTN; > > pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, reg); > > > > pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, ®); > > reg &= ~PCI_ERR_UNC_INTN; > > pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, reg); > > > > return 0; > > } > > > > ... and call this from the cxl code where it is needed. > > The version I have ready after addressing Bjorn's comments is pretty > much the same, apart from error checking of the read/writes. > > From your patch proposed you will need it in aer.c too and we do not > need to export it. I think for the other components we'll want to call it from cxl_pci_ras_unmask() so an export needed. I also wonder if a more generic function would be better as seems likely similar code will be needed for errors other than this pair. > > This patch only enables it for (CXL) RCECs. You might want to extend > this for CXL endpoints (and ports?) then. Definitely. We have the same limitation you are seeing. No errors without turning this on. Jonathan > > > > > Is this an acceptable direction? Terry is welcome to steal the above from my > > patch and throw it into the PCI core. > > > > Looking at the current state of things I think cxl_pci_ras_unmask() may > > actually be broken now without calling something like the above. For that I > > dropped the ball. > > Thanks, > > -Robert > > > > > Ira ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-14 11:55 ` Jonathan Cameron @ 2023-04-14 14:47 ` Robert Richter 0 siblings, 0 replies; 52+ messages in thread From: Robert Richter @ 2023-04-14 14:47 UTC (permalink / raw) To: Jonathan Cameron Cc: Ira Weiny, Bjorn Helgaas, Terry Bowman, alison.schofield, vishal.l.verma, bwidawsk, dan.j.williams, dave.jiang, linux-cxl, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci On 14.04.23 12:55:43, Jonathan Cameron wrote: > On Fri, 14 Apr 2023 13:21:37 +0200 > Robert Richter <rrichter@amd.com> wrote: > > The version I have ready after addressing Bjorn's comments is pretty > > much the same, apart from error checking of the read/writes. > > > > From your patch proposed you will need it in aer.c too and we do not > > need to export it. > > I think for the other components we'll want to call it from cxl_pci_ras_unmask() > so an export needed. > > I also wonder if a more generic function would be better as seems likely > similar code will be needed for errors other than this pair. There are only a few masked by default, but not only internals. Will consider that and also make it easy to export later once needed. Thanks, -Robert ^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling 2023-04-11 18:03 ` [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling Terry Bowman 2023-04-12 21:29 ` Bjorn Helgaas @ 2023-04-18 2:37 ` Dan Williams 1 sibling, 0 replies; 52+ messages in thread From: Dan Williams @ 2023-04-18 2:37 UTC (permalink / raw) To: Terry Bowman, alison.schofield, vishal.l.verma, ira.weiny, bwidawsk, dan.j.williams, dave.jiang, Jonathan.Cameron, linux-cxl Cc: terry.bowman, rrichter, linux-kernel, bhelgaas, Oliver O'Halloran, Mahesh J Salgaonkar, linuxppc-dev, linux-pci Terry Bowman wrote: > From: Robert Richter <rrichter@amd.com> > > RCEC AER corrected and uncorrectable internal errors (CIE/UIE) are > disabled by default. [1][2] Enable them to receive CXL downstream port > errors of a Restricted CXL Host (RCH). > > [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors > [2] PCIe Base Spec 6.0, 7.8.4.3 Uncorrectable Error Mask Register, > 7.8.4.6 Correctable Error Mask Register My comment on patch5 to make CXL link details a first class property of a 'struct pci_dev': http://lore.kernel.org/r/643debf5af445_1b66294f4@dwillia2-xfh.jf.intel.com.notmuch/ ...also applies here. Other than that nothing more from me on this one beyond what Bjorn and Jonathan have said. I do agree with Robert about being cautious about only enabling this for CXL devices for now and not all internal errors for all AER capable devices globally. The rationale being that CXL devices are a new link on top of PCIe and abuse/reuse internal errors when they are conceptually functionally equivalent to PCIe link errors. ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2023-04-27 13:52 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-04-11 18:02 [PATCH v3 0/6] cxl/pci: Add support for RCH RAS error handling Terry Bowman 2023-04-11 18:02 ` [PATCH v3 1/6] cxl/pci: Add RCH downstream port AER and RAS register discovery Terry Bowman 2023-04-13 15:30 ` Jonathan Cameron 2023-04-13 19:13 ` Terry Bowman 2023-04-14 11:47 ` Jonathan Cameron 2023-04-14 11:51 ` Robert Richter 2023-04-17 23:00 ` Dan Williams 2023-04-18 15:59 ` Terry Bowman 2023-04-27 13:52 ` Robert Richter 2023-04-11 18:02 ` [PATCH v3 2/6] efi/cper: Export cper_mem_err_unpack() for use by modules Terry Bowman 2023-04-12 11:04 ` Ard Biesheuvel 2023-04-13 16:08 ` Jonathan Cameron 2023-04-13 19:40 ` Terry Bowman 2023-04-14 11:48 ` Jonathan Cameron 2023-04-14 12:44 ` Robert Richter [not found] ` <aba5d2ee-f451-145c-81c2-72595129483b@amd.com> 2023-04-14 15:17 ` Terry Bowman 2023-04-17 23:08 ` Dan Williams 2023-04-11 18:02 ` [PATCH v3 3/6] PCI/AER: Export cper_print_aer() " Terry Bowman 2023-04-13 16:13 ` Jonathan Cameron 2023-04-17 23:11 ` Dan Williams 2023-04-11 18:03 ` [PATCH v3 4/6] cxl/pci: Add RCH downstream port error logging Terry Bowman 2023-04-12 1:32 ` kernel test robot 2023-04-12 3:04 ` kernel test robot 2023-04-13 16:50 ` Jonathan Cameron 2023-04-14 16:36 ` Terry Bowman 2023-04-17 16:56 ` Jonathan Cameron 2023-04-18 0:06 ` Dan Williams 2023-04-24 18:39 ` Terry Bowman 2023-04-11 18:03 ` [PATCH v3 5/6] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler Terry Bowman 2023-04-12 22:02 ` Bjorn Helgaas 2023-04-13 11:40 ` Robert Richter 2023-04-14 21:32 ` Bjorn Helgaas 2023-04-17 22:00 ` Robert Richter 2023-04-19 14:17 ` Robert Richter 2023-04-14 12:19 ` Jonathan Cameron 2023-04-14 14:35 ` Robert Richter 2023-04-17 16:54 ` Jonathan Cameron 2023-04-17 20:36 ` Robert Richter 2023-04-18 1:01 ` Dan Williams 2023-04-19 13:30 ` Robert Richter 2023-04-11 18:03 ` [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling Terry Bowman 2023-04-12 21:29 ` Bjorn Helgaas 2023-04-13 13:38 ` Robert Richter 2023-04-13 17:05 ` Jonathan Cameron 2023-04-14 11:58 ` Robert Richter 2023-04-14 21:49 ` Bjorn Helgaas 2023-04-13 17:01 ` Jonathan Cameron 2023-04-13 22:52 ` Ira Weiny 2023-04-14 11:21 ` Robert Richter 2023-04-14 11:55 ` Jonathan Cameron 2023-04-14 14:47 ` Robert Richter 2023-04-18 2:37 ` Dan Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).