[PATCHv4 00/12] pci error handling fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv4 00/12] pci error handling fixes
@ 2018-09-20 16:27 Keith Busch
  2018-09-20 16:27 ` [PATCHv4 01/12] PCI: portdrv: Initialize service drivers directly Keith Busch
                   ` (13 more replies)
  0 siblings, 14 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

Changes since v3 are:

  Initialize port services directly from port driver (suggested by
  Bjorn)

  Updated changelogs to match local style

  Added documentation update for ERR_FATAL behavior

  Included DPC save/restore state fix that was missing from previous

  Included AER upstream port fix that was mistakenly added to a
  different patch set

Keith Busch (12):
  PCI: portdrv: Initialize service drivers directly
  PCI: portdrv: Restore pci state on slot reset
  PCI: DPC: Save and restore control state
  PCI: AER: Take reference on error devices
  PCI: AER: Don't read upstream ports below fatal errors
  PCI: ERR: Use slot reset if available
  PCI: ERR: Handle fatal error recovery
  PCI: ERR: Always use the first downstream port
  PCI: ERR: Simplify broadcast callouts
  PCI: ERR: Report current recovery status for udev
  PCI: Unify device inaccessible
  PCI: Make link active reporting detection generic

 Documentation/PCI/pci-error-recovery.txt |  35 ++--
 drivers/pci/hotplug/pciehp.h             |   6 -
 drivers/pci/hotplug/pciehp_core.c        |   3 +-
 drivers/pci/hotplug/pciehp_hpc.c         |  22 +--
 drivers/pci/pci.c                        |  68 +++++++-
 drivers/pci/pci.h                        |  74 ++++++++-
 drivers/pci/pcie/aer.c                   |  25 +--
 drivers/pci/pcie/dpc.c                   |  72 ++++++--
 drivers/pci/pcie/err.c                   | 276 ++++++++-----------------------
 drivers/pci/pcie/pme.c                   |   3 +-
 drivers/pci/pcie/portdrv.h               |  24 +++
 drivers/pci/pcie/portdrv_pci.c           |  17 ++
 drivers/pci/probe.c                      |   1 +
 drivers/pci/slot.c                       |   2 +-
 include/linux/pci.h                      |   1 +
 15 files changed, 331 insertions(+), 298 deletions(-)

-- 
2.14.4


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCHv4 01/12] PCI: portdrv: Initialize service drivers directly
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-20 16:27 ` [PATCHv4 02/12] PCI: portdrv: Restore pci state on slot reset Keith Busch
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

The PCI port driver saves the PCI state after initializing the device
with the applicable service devices. This was, however, before the
service drivers were even registered because pci probe happens before
the device_initcall initialized those service drivers. The config space
state that the services set up were not being saved. The end result
would cause pci devices to not react to events that the drivers think
they did if the pci state ever needed to be restored.

This patch fixes this by changing the service drivers from using the
init calls to having the portdrv driver calling the services directly.
This will get the state saved as desired, while making the relationship
between the port driver and the services under it more explicit in the
code.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/hotplug/pciehp_core.c |  3 +--
 drivers/pci/pcie/aer.c            |  3 +--
 drivers/pci/pcie/dpc.c            |  3 +--
 drivers/pci/pcie/pme.c            |  3 +--
 drivers/pci/pcie/portdrv.h        | 24 ++++++++++++++++++++++++
 drivers/pci/pcie/portdrv_pci.c    |  9 +++++++++
 6 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index 68b20e667764..944047976569 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -287,7 +287,7 @@ static struct pcie_port_service_driver hpdriver_portdrv = {
 #endif	/* PM */
 };
 
-static int __init pcied_init(void)
+int __init pcie_hp_init(void)
 {
 	int retval = 0;
 
@@ -298,4 +298,3 @@ static int __init pcied_init(void)
 
 	return retval;
 }
-device_initcall(pcied_init);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 83180edd6ed4..637d638f73da 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1569,10 +1569,9 @@ static struct pcie_port_service_driver aerdriver = {
  *
  * Invoked when AER root service driver is loaded.
  */
-static int __init aer_service_init(void)
+int __init pcie_aer_init(void)
 {
 	if (!pci_aer_available() || aer_acpi_firmware_first())
 		return -ENXIO;
 	return pcie_port_service_register(&aerdriver);
 }
-device_initcall(aer_service_init);
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index f03279fc87cd..a1fd16bf1cab 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -282,8 +282,7 @@ static struct pcie_port_service_driver dpcdriver = {
 	.reset_link	= dpc_reset_link,
 };
 
-static int __init dpc_service_init(void)
+int __init pcie_dpc_init(void)
 {
 	return pcie_port_service_register(&dpcdriver);
 }
-device_initcall(dpc_service_init);
diff --git a/drivers/pci/pcie/pme.c b/drivers/pci/pcie/pme.c
index 3ed67676ea2a..1a8b85051b1b 100644
--- a/drivers/pci/pcie/pme.c
+++ b/drivers/pci/pcie/pme.c
@@ -446,8 +446,7 @@ static struct pcie_port_service_driver pcie_pme_driver = {
 /**
  * pcie_pme_service_init - Register the PCIe PME service driver.
  */
-static int __init pcie_pme_service_init(void)
+int __init pcie_pme_init(void)
 {
 	return pcie_port_service_register(&pcie_pme_driver);
 }
-device_initcall(pcie_pme_service_init);
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index d59afa42fc14..2498b2d34009 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -23,6 +23,30 @@
 
 #define PCIE_PORT_DEVICE_MAXSERVICES   4
 
+#ifdef CONFIG_PCIEAER
+int pcie_aer_init(void);
+#else
+static inline int pcie_aer_init(void) { return 0; }
+#endif
+
+#ifdef CONFIG_HOTPLUG_PCI_PCIE
+int pcie_hp_init(void);
+#else
+static inline int pcie_hp_init(void) { return 0; }
+#endif
+
+#ifdef CONFIG_PCIE_PME
+int pcie_pme_init(void);
+#else
+static inline int pcie_pme_init(void) { return 0; }
+#endif
+
+#ifdef CONFIG_PCIE_DPC
+int pcie_dpc_init(void);
+#else
+static inline int pcie_dpc_init(void) { return 0; }
+#endif
+
 /* Port Type */
 #define PCIE_ANY_PORT			(~0)
 
diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index eef22dc29140..23a5a0c2c3fe 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -226,11 +226,20 @@ static const struct dmi_system_id pcie_portdrv_dmi_table[] __initconst = {
 	 {}
 };
 
+static void __init pcie_init_services(void)
+{
+	pcie_aer_init();
+	pcie_pme_init();
+	pcie_dpc_init();
+	pcie_hp_init();
+}
+
 static int __init pcie_portdrv_init(void)
 {
 	if (pcie_ports_disabled)
 		return -EACCES;
 
+	pcie_init_services();
 	dmi_check_system(pcie_portdrv_dmi_table);
 
 	return pci_register_driver(&pcie_portdriver);
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 02/12] PCI: portdrv: Restore pci state on slot reset
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
  2018-09-20 16:27 ` [PATCHv4 01/12] PCI: portdrv: Initialize service drivers directly Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-20 16:27 ` [PATCHv4 03/12] PCI: DPC: Save and restore control state Keith Busch
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

The port's config space may be cleared after a link reset, which wipes
out the bridge's bus and memory windows. We need to restore the config
space that was saved during probe in order to successfully access
downstream devices.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/portdrv_pci.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 23a5a0c2c3fe..17256733fa43 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -146,6 +146,13 @@ static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
+static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
+{
+	pci_restore_state(dev);
+	pci_save_state(dev);
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
 static pci_ers_result_t pcie_portdrv_mmio_enabled(struct pci_dev *dev)
 {
 	return PCI_ERS_RESULT_RECOVERED;
@@ -185,6 +192,7 @@ static const struct pci_device_id port_pci_ids[] = { {
 
 static const struct pci_error_handlers pcie_portdrv_err_handler = {
 	.error_detected = pcie_portdrv_error_detected,
+	.slot_reset = pcie_portdrv_slot_reset,
 	.mmio_enabled = pcie_portdrv_mmio_enabled,
 	.resume = pcie_portdrv_err_resume,
 };
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 03/12] PCI: DPC: Save and restore control state
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
  2018-09-20 16:27 ` [PATCHv4 01/12] PCI: portdrv: Initialize service drivers directly Keith Busch
  2018-09-20 16:27 ` [PATCHv4 02/12] PCI: portdrv: Restore pci state on slot reset Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-20 19:46   ` Sinan Kaya
  2018-09-20 16:27 ` [PATCHv4 04/12] PCI: AER: Take reference on error devices Keith Busch
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

This patch provides DPC save and restore capabilities. This is necessary
for the driver to observe DPC events in the event the configuration
space needs to be restored after a reset.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pci.c      |  2 ++
 drivers/pci/pci.h      |  8 +++++++
 drivers/pci/pcie/dpc.c | 61 +++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index c4801a83fcc5..bcb70b7c190c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1284,6 +1284,7 @@ int pci_save_state(struct pci_dev *dev)
 	if (i != 0)
 		return i;
 
+	pci_save_dpc_state(dev);
 	return pci_save_vc_state(dev);
 }
 EXPORT_SYMBOL(pci_save_state);
@@ -1389,6 +1390,7 @@ void pci_restore_state(struct pci_dev *dev)
 	pci_restore_ats_state(dev);
 	pci_restore_vc_state(dev);
 	pci_restore_rebar_state(dev);
+	pci_restore_dpc_state(dev);
 
 	pci_cleanup_aer_error_status_regs(dev);
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 6e0d1528d471..5a5c6099b253 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -346,6 +346,14 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
 #endif	/* CONFIG_PCIEAER */
 
+#ifdef CONFIG_PCIE_DPC
+void pci_save_dpc_state(struct pci_dev *dev);
+void pci_restore_dpc_state(struct pci_dev *dev);
+#else
+void pci_save_dpc_state(struct pci_dev *dev) {}
+void pci_restore_dpc_state(struct pci_dev *dev) {}
+#endif
+
 #ifdef CONFIG_PCI_ATS
 void pci_restore_ats_state(struct pci_dev *dev);
 #else
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index a1fd16bf1cab..ed815a28512e 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -44,6 +44,58 @@ static const char * const rp_pio_error_string[] = {
 	"Memory Request Completion Timeout",		 /* Bit Position 18 */
 };
 
+static struct dpc_dev *to_dpc_dev(struct pci_dev *dev)
+{
+	struct device *device;
+
+	device = pcie_port_find_device(dev, PCIE_PORT_SERVICE_DPC);
+	if (!device)
+		return NULL;
+	return get_service_data(to_pcie_device(device));
+}
+
+void pci_save_dpc_state(struct pci_dev *dev)
+{
+	struct dpc_dev *dpc;
+	struct pci_cap_saved_state *save_state;
+	u16 *cap;
+
+	if (!pci_is_pcie(dev))
+		return;
+
+	dpc = to_dpc_dev(dev);
+	if (!dpc)
+		return;
+
+	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_DPC);
+	if (!save_state)
+		return;
+
+	cap = (u16 *)&save_state->cap.data[0];
+	pci_read_config_word(dev, dpc->cap_pos + PCI_EXP_DPC_CTL, cap);
+}
+
+void pci_restore_dpc_state(struct pci_dev *dev)
+{
+	struct dpc_dev *dpc;
+	struct pci_cap_saved_state *save_state;
+	u16 *cap;
+
+	if (!pci_is_pcie(dev))
+		return;
+
+	dpc = to_dpc_dev(dev);
+	if (!dpc)
+		return;
+
+	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_DPC);
+	if (!save_state)
+		return;
+
+	cap = (u16 *)&save_state->cap.data[0];
+	pci_write_config_word(dev, dpc->cap_pos + PCI_EXP_DPC_CTL, *cap);
+}
+
 static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
 {
 	unsigned long timeout = jiffies + HZ;
@@ -67,18 +119,13 @@ static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
 static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 {
 	struct dpc_dev *dpc;
-	struct pcie_device *pciedev;
-	struct device *devdpc;
-
 	u16 cap;
 
 	/*
 	 * DPC disables the Link automatically in hardware, so it has
 	 * already been reset by the time we get here.
 	 */
-	devdpc = pcie_port_find_device(pdev, PCIE_PORT_SERVICE_DPC);
-	pciedev = to_pcie_device(devdpc);
-	dpc = get_service_data(pciedev);
+	dpc = to_dpc_dev(pdev);
 	cap = dpc->cap_pos;
 
 	/*
@@ -259,6 +306,8 @@ static int dpc_probe(struct pcie_device *dev)
 		FLAG(cap, PCI_EXP_DPC_CAP_POISONED_TLP),
 		FLAG(cap, PCI_EXP_DPC_CAP_SW_TRIGGER), dpc->rp_log_size,
 		FLAG(cap, PCI_EXP_DPC_CAP_DL_ACTIVE));
+
+	pci_add_ext_cap_save_buffer(pdev, PCI_EXT_CAP_ID_DPC, sizeof(u16));
 	return status;
 }
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 04/12] PCI: AER: Take reference on error devices
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
                   ` (2 preceding siblings ...)
  2018-09-20 16:27 ` [PATCHv4 03/12] PCI: DPC: Save and restore control state Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-20 16:27 ` [PATCHv4 05/12] PCI: AER: Don't read upstream ports below fatal errors Keith Busch
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

Error handling may be running in parallel with a hot removal. This patch
reference counts the devices AER handling tracks so the device can not
be freed while AER wants to reference it.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/aer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 637d638f73da..ffbbd759683c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -866,7 +866,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
 static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
 {
 	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
-		e_info->dev[e_info->error_dev_num] = dev;
+		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
 		e_info->error_dev_num++;
 		return 0;
 	}
@@ -1013,6 +1013,7 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 		pcie_do_nonfatal_recovery(dev);
 	else if (info->severity == AER_FATAL)
 		pcie_do_fatal_recovery(dev, PCIE_PORT_SERVICE_AER);
+	pci_dev_put(dev);
 }
 
 #ifdef CONFIG_ACPI_APEI_PCIEAER
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 05/12] PCI: AER: Don't read upstream ports below fatal errors
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
                   ` (3 preceding siblings ...)
  2018-09-20 16:27 ` [PATCHv4 04/12] PCI: AER: Take reference on error devices Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-20 16:27 ` [PATCHv4 06/12] PCI: ERR: Use slot reset if available Keith Busch
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

The AER driver has never read the config space of an end device
that reported a fatal error because the link to that device is considered
unreliable.

An ERR_FATAL from upstream port almost certainly detected that error
on its upstream link, so we can't expect to reliably read its config
space for the same reason.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/aer.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ffbbd759683c..5c3ea7254c6a 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1116,8 +1116,9 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 			&info->mask);
 		if (!(info->status & ~info->mask))
 			return 0;
-	} else if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE ||
-		info->severity == AER_NONFATAL) {
+	} else if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
+	           pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM ||
+		   info->severity == AER_NONFATAL) {
 
 		/* Link is still healthy for IO reads */
 		pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS,
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 06/12] PCI: ERR: Use slot reset if available
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
                   ` (4 preceding siblings ...)
  2018-09-20 16:27 ` [PATCHv4 05/12] PCI: AER: Don't read upstream ports below fatal errors Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-20 16:27 ` [PATCHv4 07/12] PCI: ERR: Handle fatal error recovery Keith Busch
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

The secondary bus reset may have link side effects that a hot plug
capable port may incorrectly react to. This patch will use the slot
specific reset for hotplug ports, fixing the undesirable link down-up
handling during error recovering.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pci.c      | 33 +++++++++++++++++++++++++++++++++
 drivers/pci/pci.h      |  2 ++
 drivers/pci/pcie/aer.c |  2 +-
 drivers/pci/pcie/err.c |  2 +-
 drivers/pci/slot.c     |  2 +-
 5 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index bcb70b7c190c..ce4dafbfe2d1 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5169,6 +5169,39 @@ static int pci_bus_reset(struct pci_bus *bus, int probe)
 	return ret;
 }
 
+/**
+ * pci_bus_error_reset - reset the bridge's subordinate bus
+ * @bridge: The parent device that connects to the bus to reset
+ *
+ * This function will first try to reset the slots on this bus if the method is
+ * available. If slot reset fails or is not available, this will fall back to a
+ * secondary bus reset.
+ */
+int pci_bus_error_reset(struct pci_dev *bridge)
+{
+	struct pci_bus *bus = bridge->subordinate;
+
+	if (!bus)
+		return -ENOTTY;
+
+	mutex_lock(&pci_slot_mutex);
+	if (!list_empty(&bus->slots)) {
+		struct pci_slot *slot;
+
+		list_for_each_entry(slot, &bus->slots, list)
+			if (pci_probe_reset_slot(slot))
+				goto bus_reset;
+		list_for_each_entry(slot, &bus->slots, list)
+			if (pci_slot_reset(slot, 0))
+				goto bus_reset;
+	}
+	mutex_unlock(&pci_slot_mutex);
+	return 0;
+bus_reset:
+	mutex_unlock(&pci_slot_mutex);
+	return pci_bus_reset(bridge->subordinate, 0);
+}
+
 /**
  * pci_probe_reset_bus - probe whether a PCI bus can be reset
  * @bus: PCI bus to probe
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5a5c6099b253..807e31e24fd0 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -35,6 +35,7 @@ int pci_mmap_fits(struct pci_dev *pdev, int resno, struct vm_area_struct *vmai,
 
 int pci_probe_reset_function(struct pci_dev *dev);
 int pci_bridge_secondary_bus_reset(struct pci_dev *dev);
+int pci_bus_error_reset(struct pci_dev *dev);
 
 /**
  * struct pci_platform_pm_ops - Firmware PM callbacks
@@ -136,6 +137,7 @@ static inline void pci_remove_legacy_files(struct pci_bus *bus) { return; }
 
 /* Lock for read/write access to pci device and bus lists */
 extern struct rw_semaphore pci_bus_sem;
+extern struct mutex pci_slot_mutex;
 
 extern raw_spinlock_t pci_lock;
 
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 5c3ea7254c6a..1563e22600ec 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1528,7 +1528,7 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
 	pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
 
-	rc = pci_bridge_secondary_bus_reset(dev);
+	rc = pci_bus_error_reset(dev);
 	pci_printk(KERN_DEBUG, dev, "Root Port link has been reset\n");
 
 	/* Clear Root Error Status */
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index cac406b6e936..62ab665f0f03 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -177,7 +177,7 @@ static pci_ers_result_t default_reset_link(struct pci_dev *dev)
 {
 	int rc;
 
-	rc = pci_bridge_secondary_bus_reset(dev);
+	rc = pci_bus_error_reset(dev);
 	pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
 	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
 }
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
index 145cd953b518..3da03fcc6fbf 100644
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -14,7 +14,7 @@
 
 struct kset *pci_slots_kset;
 EXPORT_SYMBOL_GPL(pci_slots_kset);
-static DEFINE_MUTEX(pci_slot_mutex);
+DEFINE_MUTEX(pci_slot_mutex);
 
 static ssize_t pci_slot_attr_show(struct kobject *kobj,
 					struct attribute *attr, char *buf)
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 07/12] PCI: ERR: Handle fatal error recovery
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
                   ` (5 preceding siblings ...)
  2018-09-20 16:27 ` [PATCHv4 06/12] PCI: ERR: Use slot reset if available Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-20 16:27 ` [PATCHv4 08/12] PCI: ERR: Always use the first downstream port Keith Busch
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

We don't need to be paranoid about the topology changing while handling an
error. If the device has changed in a hotplug capable slot, we can rely
on the presence detection handling to react to a changing topology. This
patch restores the fatal error handling behavior that existed before
merging DPC with AER with 7e9084b3674 ("PCI/AER: Handle ERR_FATAL with
removal and re-enumeration of devices").

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/PCI/pci-error-recovery.txt | 35 +++++----------
 drivers/pci/pci.h                        |  4 +-
 drivers/pci/pcie/aer.c                   | 12 +++--
 drivers/pci/pcie/dpc.c                   |  4 +-
 drivers/pci/pcie/err.c                   | 75 +++-----------------------------
 5 files changed, 28 insertions(+), 102 deletions(-)

diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt
index 688b69121e82..0b6bb3ef449e 100644
--- a/Documentation/PCI/pci-error-recovery.txt
+++ b/Documentation/PCI/pci-error-recovery.txt
@@ -110,7 +110,7 @@ The actual steps taken by a platform to recover from a PCI error
 event will be platform-dependent, but will follow the general
 sequence described below.
 
-STEP 0: Error Event: ERR_NONFATAL
+STEP 0: Error Event
 -------------------
 A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
 is isolated, in that all I/O is blocked: all reads return 0xffffffff,
@@ -228,7 +228,13 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
 proceeds to STEP 4 (Slot Reset)
 
-STEP 3: Slot Reset
+STEP 3: Link Reset
+------------------
+The platform resets the link.  This is a PCI-Express specific step
+and is done whenever a fatal error has been detected that can be
+"solved" by resetting the link.
+
+STEP 4: Slot Reset
 ------------------
 
 In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
@@ -314,7 +320,7 @@ Failure).
 >>> However, it probably should.
 
 
-STEP 4: Resume Operations
+STEP 5: Resume Operations
 -------------------------
 The platform will call the resume() callback on all affected device
 drivers if all drivers on the segment have returned
@@ -326,7 +332,7 @@ a result code.
 At this point, if a new error happens, the platform will restart
 a new error recovery sequence.
 
-STEP 5: Permanent Failure
+STEP 6: Permanent Failure
 -------------------------
 A "permanent failure" has occurred, and the platform cannot recover
 the device.  The platform will call error_detected() with a
@@ -349,27 +355,6 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
 for additional detail on real-life experience of the causes of
 software errors.
 
-STEP 0: Error Event: ERR_FATAL
--------------------
-PCI bus error is detected by the PCI hardware. On powerpc, the slot is
-isolated, in that all I/O is blocked: all reads return 0xffffffff, all
-writes are ignored.
-
-STEP 1: Remove devices
---------------------
-Platform removes the devices depending on the error agent, it could be
-this port for all subordinates or upstream component (likely downstream
-port)
-
-STEP 2: Reset link
---------------------
-The platform resets the link.  This is a PCI-Express specific step and is
-done whenever a fatal error has been detected that can be "solved" by
-resetting the link.
-
-STEP 3: Re-enumerate the devices
---------------------
-Initiates the re-enumeration.
 
 Conclusion; General Remarks
 ---------------------------
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 807e31e24fd0..028abe34a15b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -433,8 +433,8 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev)
 #endif
 
 /* PCI error reporting and recovery */
-void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service);
-void pcie_do_nonfatal_recovery(struct pci_dev *dev);
+void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
+		      u32 service);
 
 bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
 #ifdef CONFIG_PCIEASPM
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 1563e22600ec..0619ec5d7bb5 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1010,9 +1010,11 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 					info->status);
 		pci_aer_clear_device_status(dev);
 	} else if (info->severity == AER_NONFATAL)
-		pcie_do_nonfatal_recovery(dev);
+		pcie_do_recovery(dev, pci_channel_io_normal,
+				 PCIE_PORT_SERVICE_AER);
 	else if (info->severity == AER_FATAL)
-		pcie_do_fatal_recovery(dev, PCIE_PORT_SERVICE_AER);
+		pcie_do_recovery(dev, pci_channel_io_frozen,
+				 PCIE_PORT_SERVICE_AER);
 	pci_dev_put(dev);
 }
 
@@ -1048,9 +1050,11 @@ static void aer_recover_work_func(struct work_struct *work)
 		}
 		cper_print_aer(pdev, entry.severity, entry.regs);
 		if (entry.severity == AER_NONFATAL)
-			pcie_do_nonfatal_recovery(pdev);
+			pcie_do_recovery(pdev, pci_channel_io_normal,
+					 PCIE_PORT_SERVICE_AER);
 		else if (entry.severity == AER_FATAL)
-			pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_AER);
+			pcie_do_recovery(pdev, pci_channel_io_frozen,
+					 PCIE_PORT_SERVICE_AER);
 		pci_dev_put(pdev);
 	}
 }
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index ed815a28512e..23e063aefddf 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -216,7 +216,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
 
 	reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
 	ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
-	dev_warn(dev, "DPC %s detected, remove downstream devices\n",
+	dev_warn(dev, "DPC %s detected\n",
 		 (reason == 0) ? "unmasked uncorrectable error" :
 		 (reason == 1) ? "ERR_NONFATAL" :
 		 (reason == 2) ? "ERR_FATAL" :
@@ -233,7 +233,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
 	}
 
 	/* We configure DPC so it only triggers on ERR_FATAL */
-	pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_DPC);
+	pcie_do_recovery(pdev, pci_channel_io_frozen, PCIE_PORT_SERVICE_DPC);
 
 	return IRQ_HANDLED;
 }
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 62ab665f0f03..644f3f725ef0 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -271,83 +271,20 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 	return result_data.result;
 }
 
-/**
- * pcie_do_fatal_recovery - handle fatal error recovery process
- * @dev: pointer to a pci_dev data structure of agent detecting an error
- *
- * Invoked when an error is fatal. Once being invoked, removes the devices
- * beneath this AER agent, followed by reset link e.g. secondary bus reset
- * followed by re-enumeration of devices.
- */
-void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
-{
-	struct pci_dev *udev;
-	struct pci_bus *parent;
-	struct pci_dev *pdev, *temp;
-	pci_ers_result_t result;
-
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
-		udev = dev;
-	else
-		udev = dev->bus->self;
-
-	parent = udev->subordinate;
-	pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
-
-	pci_lock_rescan_remove();
-	pci_dev_get(dev);
-	list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
-					 bus_list) {
-		pci_stop_and_remove_bus_device(pdev);
-	}
-
-	result = reset_link(udev, service);
-
-	if ((service == PCIE_PORT_SERVICE_AER) &&
-	    (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)) {
-		/*
-		 * If the error is reported by a bridge, we think this error
-		 * is related to the downstream link of the bridge, so we
-		 * do error recovery on all subordinates of the bridge instead
-		 * of the bridge and clear the error status of the bridge.
-		 */
-		pci_aer_clear_fatal_status(dev);
-		pci_aer_clear_device_status(dev);
-	}
-
-	if (result == PCI_ERS_RESULT_RECOVERED) {
-		if (pcie_wait_for_link(udev, true))
-			pci_rescan_bus(udev->bus);
-		pci_info(dev, "Device recovery from fatal error successful\n");
-	} else {
-		pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
-		pci_info(dev, "Device recovery from fatal error failed\n");
-	}
-
-	pci_dev_put(dev);
-	pci_unlock_rescan_remove();
-}
-
-/**
- * pcie_do_nonfatal_recovery - handle nonfatal error recovery process
- * @dev: pointer to a pci_dev data structure of agent detecting an error
- *
- * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
- * error detected message to all downstream drivers within a hierarchy in
- * question and return the returned code.
- */
-void pcie_do_nonfatal_recovery(struct pci_dev *dev)
+void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
+		      u32 service)
 {
 	pci_ers_result_t status;
-	enum pci_channel_state state;
-
-	state = pci_channel_io_normal;
 
 	status = broadcast_error_message(dev,
 			state,
 			"error_detected",
 			report_error_detected);
 
+	if (state == pci_channel_io_frozen &&
+	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
+		goto failed;
+
 	if (status == PCI_ERS_RESULT_CAN_RECOVER)
 		status = broadcast_error_message(dev,
 				state,
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
                   ` (6 preceding siblings ...)
  2018-09-20 16:27 ` [PATCHv4 07/12] PCI: ERR: Handle fatal error recovery Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-26 22:01   ` Bjorn Helgaas
  2018-09-20 16:27 ` [PATCHv4 09/12] PCI: ERR: Simplify broadcast callouts Keith Busch
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

The link reset always used the first bridge device, but AER broadcast
error handling may have reported an end device. This means the reset may
hit devices that were never notified of the impending error recovery.

This patch uses the first downstream port in the hierarchy considered
reliable. An error detected by a switch upstream port should mean it
occurred on its upstream link, so the patch selects the parent device
if the error is not a root or downstream port.

This allows two other clean-ups. First, error handling can only run
on bridges so this patch removes checks for end devices. Second, the
first accessible port does not inherit the channel error state since we
can access it, so the special cases for error detect and resume are no
longer necessary.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/err.c | 85 +++++++++++++-------------------------------------
 1 file changed, 21 insertions(+), 64 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 644f3f725ef0..0fa5e1417a4a 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -63,30 +63,12 @@ static int report_error_detected(struct pci_dev *dev, void *data)
 	if (!dev->driver ||
 		!dev->driver->err_handler ||
 		!dev->driver->err_handler->error_detected) {
-		if (result_data->state == pci_channel_io_frozen &&
-			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
-			/*
-			 * In case of fatal recovery, if one of down-
-			 * stream device has no driver. We might be
-			 * unable to recover because a later insmod
-			 * of a driver for this device is unaware of
-			 * its hw state.
-			 */
-			pci_printk(KERN_DEBUG, dev, "device has %s\n",
-				   dev->driver ?
-				   "no AER-aware driver" : "no driver");
-		}
-
 		/*
-		 * If there's any device in the subtree that does not
-		 * have an error_detected callback, returning
-		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
-		 * the subsequent mmio_enabled/slot_reset/resume
-		 * callbacks of "any" device in the subtree. All the
-		 * devices in the subtree are left in the error state
-		 * without recovery.
+		 * If any device in the subtree does not have an error_detected
+		 * callback, PCI_ERS_RESULT_NO_AER_DRIVER prevents subsequent
+		 * error callbacks of "any" device in the subtree, and will
+		 * exit in the disconnected error state.
 		 */
-
 		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
 			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
 		else
@@ -184,34 +166,23 @@ static pci_ers_result_t default_reset_link(struct pci_dev *dev)
 
 static pci_ers_result_t reset_link(struct pci_dev *dev, u32 service)
 {
-	struct pci_dev *udev;
 	pci_ers_result_t status;
 	struct pcie_port_service_driver *driver = NULL;
 
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
-		/* Reset this port for all subordinates */
-		udev = dev;
-	} else {
-		/* Reset the upstream component (likely downstream port) */
-		udev = dev->bus->self;
-	}
-
-	/* Use the aer driver of the component firstly */
-	driver = pcie_port_find_service(udev, service);
-
+	driver = pcie_port_find_service(dev, service);
 	if (driver && driver->reset_link) {
-		status = driver->reset_link(udev);
-	} else if (udev->has_secondary_link) {
-		status = default_reset_link(udev);
+		status = driver->reset_link(dev);
+	} else if (dev->has_secondary_link) {
+		status = default_reset_link(dev);
 	} else {
 		pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream device %s\n",
-			pci_name(udev));
+			pci_name(dev));
 		return PCI_ERS_RESULT_DISCONNECT;
 	}
 
 	if (status != PCI_ERS_RESULT_RECOVERED) {
 		pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s failed\n",
-			pci_name(udev));
+			pci_name(dev));
 		return PCI_ERS_RESULT_DISCONNECT;
 	}
 
@@ -243,31 +214,7 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 	else
 		result_data.result = PCI_ERS_RESULT_RECOVERED;
 
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
-		/*
-		 * If the error is reported by a bridge, we think this error
-		 * is related to the downstream link of the bridge, so we
-		 * do error recovery on all subordinates of the bridge instead
-		 * of the bridge and clear the error status of the bridge.
-		 */
-		if (cb == report_error_detected)
-			dev->error_state = state;
-		pci_walk_bus(dev->subordinate, cb, &result_data);
-		if (cb == report_resume) {
-			pci_aer_clear_device_status(dev);
-			pci_cleanup_aer_uncorrect_error_status(dev);
-			dev->error_state = pci_channel_io_normal;
-		}
-	} else {
-		/*
-		 * If the error is reported by an end point, we think this
-		 * error is related to the upstream link of the end point.
-		 * The error is non fatal so the bus is ok; just invoke
-		 * the callback for the function that logged the error.
-		 */
-		cb(dev, &result_data);
-	}
-
+	pci_walk_bus(dev->subordinate, cb, &result_data);
 	return result_data.result;
 }
 
@@ -276,6 +223,14 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 {
 	pci_ers_result_t status;
 
+	/*
+	 * Error recovery runs on all subordinates of the first downstream port.
+	 * If the downstream port detected the error, it is cleared at the end.
+	 */
+	if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
+	      pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
+		dev = dev->bus->self;
+
 	status = broadcast_error_message(dev,
 			state,
 			"error_detected",
@@ -311,6 +266,8 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 				"resume",
 				report_resume);
 
+	pci_aer_clear_device_status(dev);
+	pci_cleanup_aer_uncorrect_error_status(dev);
 	pci_info(dev, "AER: Device recovery successful\n");
 	return;
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 09/12] PCI: ERR: Simplify broadcast callouts
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
                   ` (7 preceding siblings ...)
  2018-09-20 16:27 ` [PATCHv4 08/12] PCI: ERR: Always use the first downstream port Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-20 16:27 ` [PATCHv4 10/12] PCI: ERR: Report current recovery status for udev Keith Busch
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

There is no point in having a generic broadcast function if it needs
to have special cases for each callback it broadcasts. This patch
abstracts the error broadcast to only the necessary information and
removes the now unnecessary helper to walk the bus.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/err.c | 107 ++++++++++++++++++-------------------------------
 1 file changed, 38 insertions(+), 69 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 0fa5e1417a4a..362a717c831a 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -19,11 +19,6 @@
 #include "portdrv.h"
 #include "../pci.h"
 
-struct aer_broadcast_data {
-	enum pci_channel_state state;
-	enum pci_ers_result result;
-};
-
 static pci_ers_result_t merge_result(enum pci_ers_result orig,
 				  enum pci_ers_result new)
 {
@@ -49,16 +44,15 @@ static pci_ers_result_t merge_result(enum pci_ers_result orig,
 	return orig;
 }
 
-static int report_error_detected(struct pci_dev *dev, void *data)
+static int report_error_detected(struct pci_dev *dev,
+				 enum pci_channel_state state,
+				 enum pci_ers_result *result)
 {
 	pci_ers_result_t vote;
 	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-
-	result_data = (struct aer_broadcast_data *) data;
 
 	device_lock(&dev->dev);
-	dev->error_state = result_data->state;
+	dev->error_state = state;
 
 	if (!dev->driver ||
 		!dev->driver->err_handler ||
@@ -75,22 +69,29 @@ static int report_error_detected(struct pci_dev *dev, void *data)
 			vote = PCI_ERS_RESULT_NONE;
 	} else {
 		err_handler = dev->driver->err_handler;
-		vote = err_handler->error_detected(dev, result_data->state);
+		vote = err_handler->error_detected(dev, state);
 		pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
 	}
 
-	result_data->result = merge_result(result_data->result, vote);
+	*result = merge_result(*result, vote);
 	device_unlock(&dev->dev);
 	return 0;
 }
 
+static int report_frozen_detected(struct pci_dev *dev, void *data)
+{
+	return report_error_detected(dev, pci_channel_io_frozen, data);
+}
+
+static int report_normal_detected(struct pci_dev *dev, void *data)
+{
+	return report_error_detected(dev, pci_channel_io_normal, data);
+}
+
 static int report_mmio_enabled(struct pci_dev *dev, void *data)
 {
-	pci_ers_result_t vote;
+	pci_ers_result_t vote, *result = data;
 	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-
-	result_data = (struct aer_broadcast_data *) data;
 
 	device_lock(&dev->dev);
 	if (!dev->driver ||
@@ -100,7 +101,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data)
 
 	err_handler = dev->driver->err_handler;
 	vote = err_handler->mmio_enabled(dev);
-	result_data->result = merge_result(result_data->result, vote);
+	*result = merge_result(*result, vote);
 out:
 	device_unlock(&dev->dev);
 	return 0;
@@ -108,11 +109,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data)
 
 static int report_slot_reset(struct pci_dev *dev, void *data)
 {
-	pci_ers_result_t vote;
+	pci_ers_result_t vote, *result = data;
 	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-
-	result_data = (struct aer_broadcast_data *) data;
 
 	device_lock(&dev->dev);
 	if (!dev->driver ||
@@ -122,7 +120,7 @@ static int report_slot_reset(struct pci_dev *dev, void *data)
 
 	err_handler = dev->driver->err_handler;
 	vote = err_handler->slot_reset(dev);
-	result_data->result = merge_result(result_data->result, vote);
+	*result = merge_result(*result, vote);
 out:
 	device_unlock(&dev->dev);
 	return 0;
@@ -189,39 +187,11 @@ static pci_ers_result_t reset_link(struct pci_dev *dev, u32 service)
 	return status;
 }
 
-/**
- * broadcast_error_message - handle message broadcast to downstream drivers
- * @dev: pointer to from where in a hierarchy message is broadcasted down
- * @state: error state
- * @error_mesg: message to print
- * @cb: callback to be broadcasted
- *
- * Invoked during error recovery process. Once being invoked, the content
- * of error severity will be broadcasted to all downstream drivers in a
- * hierarchy in question.
- */
-static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
-	enum pci_channel_state state,
-	char *error_mesg,
-	int (*cb)(struct pci_dev *, void *))
-{
-	struct aer_broadcast_data result_data;
-
-	pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
-	result_data.state = state;
-	if (cb == report_error_detected)
-		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
-	else
-		result_data.result = PCI_ERS_RESULT_RECOVERED;
-
-	pci_walk_bus(dev->subordinate, cb, &result_data);
-	return result_data.result;
-}
-
 void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 		      u32 service)
 {
-	pci_ers_result_t status;
+	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
+	struct pci_bus *bus;
 
 	/*
 	 * Error recovery runs on all subordinates of the first downstream port.
@@ -230,21 +200,23 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 	if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
 	      pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
 		dev = dev->bus->self;
+	bus = dev->subordinate;
 
-	status = broadcast_error_message(dev,
-			state,
-			"error_detected",
-			report_error_detected);
+	pci_dbg(dev, "broadcast error_detected message\n");
+	if (state == pci_channel_io_frozen)
+		pci_walk_bus(bus, report_frozen_detected, &status);
+	else
+		pci_walk_bus(bus, report_normal_detected, &status);
 
 	if (state == pci_channel_io_frozen &&
 	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
 		goto failed;
 
-	if (status == PCI_ERS_RESULT_CAN_RECOVER)
-		status = broadcast_error_message(dev,
-				state,
-				"mmio_enabled",
-				report_mmio_enabled);
+	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
+		status = PCI_ERS_RESULT_RECOVERED;
+		pci_dbg(dev, "broadcast mmio_enabled message\n");
+		pci_walk_bus(bus, report_mmio_enabled, &status);
+	}
 
 	if (status == PCI_ERS_RESULT_NEED_RESET) {
 		/*
@@ -252,19 +224,16 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 		 * functions to reset slot before calling
 		 * drivers' slot_reset callbacks?
 		 */
-		status = broadcast_error_message(dev,
-				state,
-				"slot_reset",
-				report_slot_reset);
+		status = PCI_ERS_RESULT_RECOVERED;
+		pci_dbg(dev, "broadcast slot_reset message\n");
+		pci_walk_bus(bus, report_slot_reset, &status);
 	}
 
 	if (status != PCI_ERS_RESULT_RECOVERED)
 		goto failed;
 
-	broadcast_error_message(dev,
-				state,
-				"resume",
-				report_resume);
+	pci_dbg(dev, "broadcast resume message\n");
+	pci_walk_bus(bus, report_resume, &status);
 
 	pci_aer_clear_device_status(dev);
 	pci_cleanup_aer_uncorrect_error_status(dev);
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 10/12] PCI: ERR: Report current recovery status for udev
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
                   ` (8 preceding siblings ...)
  2018-09-20 16:27 ` [PATCHv4 09/12] PCI: ERR: Simplify broadcast callouts Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-20 16:27 ` [PATCHv4 11/12] PCI: Unify device inaccessible Keith Busch
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

A device still participates in error recovery even if it doesn't have
the error callbacks. This patch provides the status for user event
watchers.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/err.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 362a717c831a..31e8a4314384 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -70,9 +70,8 @@ static int report_error_detected(struct pci_dev *dev,
 	} else {
 		err_handler = dev->driver->err_handler;
 		vote = err_handler->error_detected(dev, state);
-		pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
 	}
-
+	pci_uevent_ers(dev, vote);
 	*result = merge_result(*result, vote);
 	device_unlock(&dev->dev);
 	return 0;
@@ -140,8 +139,8 @@ static int report_resume(struct pci_dev *dev, void *data)
 
 	err_handler = dev->driver->err_handler;
 	err_handler->resume(dev);
-	pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
 out:
+	pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
 	device_unlock(&dev->dev);
 	return 0;
 }
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 11/12] PCI: Unify device inaccessible
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
                   ` (9 preceding siblings ...)
  2018-09-20 16:27 ` [PATCHv4 10/12] PCI: ERR: Report current recovery status for udev Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-20 16:27 ` [PATCHv4 12/12] PCI: Make link active reporting detection generic Keith Busch
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

This patch brings surprise removals and permanent failures together so
we no longer need separate flags. The implementation enforces the error
handling will not be able to override a surprise removal's permanent
channel failure.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pci.h      | 60 +++++++++++++++++++++++++++++++++++++++++++++-----
 drivers/pci/pcie/err.c | 10 ++++-----
 2 files changed, 59 insertions(+), 11 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 028abe34a15b..a244bd0c5ca7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -295,21 +295,71 @@ struct pci_sriov {
 	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
 };
 
-/* pci_dev priv_flags */
-#define PCI_DEV_DISCONNECTED 0
-#define PCI_DEV_ADDED 1
+/**
+ * pci_dev_set_io_state - Set the new error state if possible.
+ *
+ * @dev - pci device to set new error_state
+ * @new - the state we want dev to be in
+ *
+ * Must be called with device_lock held.
+ *
+ * Returns true if state has been changed to the requested state.
+ */
+static inline bool pci_dev_set_io_state(struct pci_dev *dev,
+					pci_channel_state_t new)
+{
+	bool changed = false;
+
+	device_lock_assert(&dev->dev);
+	switch (new) {
+	case pci_channel_io_perm_failure:
+		switch (dev->error_state) {
+		case pci_channel_io_frozen:
+		case pci_channel_io_normal:
+		case pci_channel_io_perm_failure:
+			changed = true;
+			break;
+		}
+		break;
+	case pci_channel_io_frozen:
+		switch (dev->error_state) {
+		case pci_channel_io_frozen:
+		case pci_channel_io_normal:
+			changed = true;
+			break;
+		}
+		break;
+	case pci_channel_io_normal:
+		switch (dev->error_state) {
+		case pci_channel_io_frozen:
+		case pci_channel_io_normal:
+			changed = true;
+			break;
+		}
+		break;
+	}
+	if (changed)
+		dev->error_state = new;
+	return changed;
+}
 
 static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
 {
-	set_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
+	device_lock(&dev->dev);
+	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
+	device_unlock(&dev->dev);
+
 	return 0;
 }
 
 static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
 {
-	return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
+	return dev->error_state == pci_channel_io_perm_failure;
 }
 
+/* pci_dev priv_flags */
+#define PCI_DEV_ADDED 0
+
 static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)
 {
 	assign_bit(PCI_DEV_ADDED, &dev->priv_flags, added);
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 31e8a4314384..4da2a62b4f77 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -52,9 +52,8 @@ static int report_error_detected(struct pci_dev *dev,
 	const struct pci_error_handlers *err_handler;
 
 	device_lock(&dev->dev);
-	dev->error_state = state;
-
-	if (!dev->driver ||
+	if (!pci_dev_set_io_state(dev, state) ||
+		!dev->driver ||
 		!dev->driver->err_handler ||
 		!dev->driver->err_handler->error_detected) {
 		/*
@@ -130,9 +129,8 @@ static int report_resume(struct pci_dev *dev, void *data)
 	const struct pci_error_handlers *err_handler;
 
 	device_lock(&dev->dev);
-	dev->error_state = pci_channel_io_normal;
-
-	if (!dev->driver ||
+	if (!pci_dev_set_io_state(dev, pci_channel_io_normal) ||
+		!dev->driver ||
 		!dev->driver->err_handler ||
 		!dev->driver->err_handler->resume)
 		goto out;
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 12/12] PCI: Make link active reporting detection generic
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
                   ` (10 preceding siblings ...)
  2018-09-20 16:27 ` [PATCHv4 11/12] PCI: Unify device inaccessible Keith Busch
@ 2018-09-20 16:27 ` Keith Busch
  2018-09-20 20:00 ` [PATCHv4 00/12] pci error handling fixes Sinan Kaya
  2018-09-20 21:17 ` Bjorn Helgaas
  13 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 16:27 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Mika Westerberg, Keith Busch

The spec has timing requirements when waiting for a link to become
active after a conventional reset. This patch implements those hard
delays when waiting for an active link so pciehp and dpc drivers don't
need to duplicate this.

Since downstream ports that are HPC and DPC capable exist that do not
implement data link layer active reporting, the pci driver will wait
a fixed recommend time if the device reports it does not have link
reporing capabilities.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/hotplug/pciehp.h     |  6 ------
 drivers/pci/hotplug/pciehp_hpc.c | 22 ++--------------------
 drivers/pci/pci.c                | 33 +++++++++++++++++++++++++++------
 drivers/pci/pcie/dpc.c           |  4 +++-
 drivers/pci/probe.c              |  1 +
 include/linux/pci.h              |  1 +
 6 files changed, 34 insertions(+), 33 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index 3740f1a759c5..75fd52571107 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -62,11 +62,6 @@ do {									\
  * struct controller - PCIe hotplug controller
  * @pcie: pointer to the controller's PCIe port service device
  * @slot_cap: cached copy of the Slot Capabilities register
- * @link_active_reporting: cached copy of Data Link Layer Link Active Reporting
- *	Capable bit in Link Capabilities register; if this bit is zero, the
- *	Data Link Layer Link Active bit in the Link Status register will never
- *	be set and the driver is thus confined to wait 1 second before assuming
- *	the link to a hotplugged device is up and accessing it
  * @slot_ctrl: cached copy of the Slot Control register
  * @ctrl_lock: serializes writes to the Slot Control register
  * @cmd_started: jiffies when the Slot Control register was last written;
@@ -103,7 +98,6 @@ struct controller {
 	struct pcie_device *pcie;
 
 	u32 slot_cap;				/* capabilities and quirks */
-	unsigned int link_active_reporting:1;
 
 	u16 slot_ctrl;				/* control register access */
 	struct mutex ctrl_lock;
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 7b5f9db60d9a..f0f3f4a3dac4 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -214,13 +214,6 @@ bool pciehp_check_link_active(struct controller *ctrl)
 	return ret;
 }
 
-static void pcie_wait_link_active(struct controller *ctrl)
-{
-	struct pci_dev *pdev = ctrl_dev(ctrl);
-
-	pcie_wait_for_link(pdev, true);
-}
-
 static bool pci_bus_check_dev(struct pci_bus *bus, int devfn)
 {
 	u32 l;
@@ -253,18 +246,9 @@ int pciehp_check_link_status(struct controller *ctrl)
 	bool found;
 	u16 lnk_status;
 
-	/*
-	 * Data Link Layer Link Active Reporting must be capable for
-	 * hot-plug capable downstream port. But old controller might
-	 * not implement it. In this case, we wait for 1000 ms.
-	*/
-	if (ctrl->link_active_reporting)
-		pcie_wait_link_active(ctrl);
-	else
-		msleep(1000);
+	if (!pcie_wait_for_link(pdev, true))
+		return -1;
 
-	/* wait 100ms before read pci conf, and try in 1s */
-	msleep(100);
 	found = pci_bus_check_dev(ctrl->pcie->port->subordinate,
 					PCI_DEVFN(0, 0));
 
@@ -865,8 +849,6 @@ struct controller *pcie_init(struct pcie_device *dev)
 
 	/* Check if Data Link Layer Link Active Reporting is implemented */
 	pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &link_cap);
-	if (link_cap & PCI_EXP_LNKCAP_DLLLARC)
-		ctrl->link_active_reporting = 1;
 
 	/* Clear all remaining event bits in Slot Status register. */
 	pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ce4dafbfe2d1..2b4117011313 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4501,21 +4501,42 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
 	bool ret;
 	u16 lnk_status;
 
+	/*
+	 * Some controllers might not implement link active reporting. In this
+	 * case, we wait for 1000 + 100 ms.
+	 */
+	if (!pdev->link_active_reporting) {
+		msleep(1100);
+		return true;
+	}
+
+	/*
+	 * PCIe 4.0r1 6.6.1, a component must enter LTSSM Detect within 20ms,
+	 * after which we should expect an link active if the reset was
+	 * successful. If so, software must wait a minimum 100ms before sending
+	 * configuration requests to devices downstream this port.
+	 *
+	 * If the link fails to activate, either the device was physically
+	 * removed or the link is permanently failed.
+	 */
+	if (active)
+		msleep(20);
 	for (;;) {
 		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
 		ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
 		if (ret == active)
-			return true;
+			break;
 		if (timeout <= 0)
 			break;
 		msleep(10);
 		timeout -= 10;
 	}
-
-	pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
-		 active ? "set" : "cleared");
-
-	return false;
+	if (active && ret)
+		msleep(100);
+	else if (ret != active)
+		pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
+			active ? "set" : "cleared");
+	return ret == active;
 }
 
 void pci_reset_secondary_bus(struct pci_dev *dev)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 23e063aefddf..e435d12e61a0 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -140,10 +140,12 @@ static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
 			      PCI_EXP_DPC_STATUS_TRIGGER);
 
+	if (!pcie_wait_for_link(pdev, true))
+		return PCI_ERS_RESULT_DISCONNECT;
+
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
-
 static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
 {
 	struct device *dev = &dpc->dev->device;
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 7c422ccbf9b4..ea2a6289e513 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -713,6 +713,7 @@ static void pci_set_bus_speed(struct pci_bus *bus)
 
 		pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP, &linkcap);
 		bus->max_bus_speed = pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS];
+		bridge->link_active_reporting = !!(linkcap & PCI_EXP_LNKCAP_DLLLARC);
 
 		pcie_capability_read_word(bridge, PCI_EXP_LNKSTA, &linksta);
 		pcie_update_link_speed(bus, linksta);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 6925828f9f25..896b42032ec5 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -402,6 +402,7 @@ struct pci_dev {
 	unsigned int	has_secondary_link:1;
 	unsigned int	non_compliant_bars:1;	/* Broken BARs; ignore them */
 	unsigned int	is_probed:1;		/* Device probing in progress */
+	unsigned int	link_active_reporting:1;/* Device capable of reporting link active */
 	pci_dev_flags_t dev_flags;
 	atomic_t	enable_cnt;	/* pci_enable_device has been called */
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 03/12] PCI: DPC: Save and restore control state
  2018-09-20 16:27 ` [PATCHv4 03/12] PCI: DPC: Save and restore control state Keith Busch
@ 2018-09-20 19:46   ` Sinan Kaya
  2018-09-20 19:47     ` Sinan Kaya
  0 siblings, 1 reply; 28+ messages in thread
From: Sinan Kaya @ 2018-09-20 19:46 UTC (permalink / raw)
  To: Keith Busch, Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Thomas Tai, poza, Lukas Wunner,
	Christoph Hellwig, Mika Westerberg

On 9/20/2018 12:27 PM, Keith Busch wrote:
>   	/*
>   	 * DPC disables the Link automatically in hardware, so it has
>   	 * already been reset by the time we get here.
>   	 */
> -	devdpc = pcie_port_find_device(pdev, PCIE_PORT_SERVICE_DPC);
> -	pciedev = to_pcie_device(devdpc);
> -	dpc = get_service_data(pciedev);
> +	dpc = to_dpc_dev(pdev);

I thought that the struct pci_dev sent here is the bridge and we
need to locate the struct device of the DPC object here.

Isn't this change wrong now?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 03/12] PCI: DPC: Save and restore control state
  2018-09-20 19:46   ` Sinan Kaya
@ 2018-09-20 19:47     ` Sinan Kaya
  2018-09-20 19:54       ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: Sinan Kaya @ 2018-09-20 19:47 UTC (permalink / raw)
  To: Keith Busch, Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Thomas Tai, poza, Lukas Wunner,
	Christoph Hellwig, Mika Westerberg

On 9/20/2018 3:46 PM, Sinan Kaya wrote:
> On 9/20/2018 12:27 PM, Keith Busch wrote:
>>       /*
>>        * DPC disables the Link automatically in hardware, so it has
>>        * already been reset by the time we get here.
>>        */
>> -    devdpc = pcie_port_find_device(pdev, PCIE_PORT_SERVICE_DPC);
>> -    pciedev = to_pcie_device(devdpc);
>> -    dpc = get_service_data(pciedev);
>> +    dpc = to_dpc_dev(pdev);
> 
> I thought that the struct pci_dev sent here is the bridge and we
> need to locate the struct device of the DPC object here.
> 
> Isn't this change wrong now?

Gosh, I should have looked 30 lines above. Nevermind.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 03/12] PCI: DPC: Save and restore control state
  2018-09-20 19:47     ` Sinan Kaya
@ 2018-09-20 19:54       ` Keith Busch
  0 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-09-20 19:54 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Thomas Tai,
	poza, Lukas Wunner, Christoph Hellwig, Mika Westerberg

On Thu, Sep 20, 2018 at 03:47:36PM -0400, Sinan Kaya wrote:
> On 9/20/2018 3:46 PM, Sinan Kaya wrote:
> > On 9/20/2018 12:27 PM, Keith Busch wrote:
> > >       /*
> > >        * DPC disables the Link automatically in hardware, so it has
> > >        * already been reset by the time we get here.
> > >        */
> > > -    devdpc = pcie_port_find_device(pdev, PCIE_PORT_SERVICE_DPC);
> > > -    pciedev = to_pcie_device(devdpc);
> > > -    dpc = get_service_data(pciedev);
> > > +    dpc = to_dpc_dev(pdev);
> > 
> > I thought that the struct pci_dev sent here is the bridge and we
> > need to locate the struct device of the DPC object here.
> > 
> > Isn't this change wrong now?
> 
> Gosh, I should have looked 30 lines above. Nevermind.

Yeah, this part should be equivalent to before. This patch just created
the opportunity to move this to a common helper, otherwise it would have
been duplicated two more times.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 00/12] pci error handling fixes
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
                   ` (11 preceding siblings ...)
  2018-09-20 16:27 ` [PATCHv4 12/12] PCI: Make link active reporting detection generic Keith Busch
@ 2018-09-20 20:00 ` Sinan Kaya
  2018-09-20 21:17 ` Bjorn Helgaas
  13 siblings, 0 replies; 28+ messages in thread
From: Sinan Kaya @ 2018-09-20 20:00 UTC (permalink / raw)
  To: Keith Busch, Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Thomas Tai, poza, Lukas Wunner,
	Christoph Hellwig, Mika Westerberg

On 9/20/2018 12:27 PM, Keith Busch wrote:
> Changes since v3 are:
> 
>    Initialize port services directly from port driver (suggested by
>    Bjorn)
> 
>    Updated changelogs to match local style
> 
>    Added documentation update for ERR_FATAL behavior
> 
>    Included DPC save/restore state fix that was missing from previous
> 
>    Included AER upstream port fix that was mistakenly added to a
>    different patch set
> 
> Keith Busch (12):
>    PCI: portdrv: Initialize service drivers directly
>    PCI: portdrv: Restore pci state on slot reset
>    PCI: DPC: Save and restore control state
>    PCI: AER: Take reference on error devices
>    PCI: AER: Don't read upstream ports below fatal errors
>    PCI: ERR: Use slot reset if available
>    PCI: ERR: Handle fatal error recovery
>    PCI: ERR: Always use the first downstream port
>    PCI: ERR: Simplify broadcast callouts
>    PCI: ERR: Report current recovery status for udev
>    PCI: Unify device inaccessible
>    PCI: Make link active reporting detection generic

Reviewed-by: Sinan Kaya <okaya@kernel.org>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 00/12] pci error handling fixes
  2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
                   ` (12 preceding siblings ...)
  2018-09-20 20:00 ` [PATCHv4 00/12] pci error handling fixes Sinan Kaya
@ 2018-09-20 21:17 ` Bjorn Helgaas
  13 siblings, 0 replies; 28+ messages in thread
From: Bjorn Helgaas @ 2018-09-20 21:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Mika Westerberg

On Thu, Sep 20, 2018 at 10:27:05AM -0600, Keith Busch wrote:
> Changes since v3 are:
> 
>   Initialize port services directly from port driver (suggested by
>   Bjorn)
> 
>   Updated changelogs to match local style
> 
>   Added documentation update for ERR_FATAL behavior
> 
>   Included DPC save/restore state fix that was missing from previous
> 
>   Included AER upstream port fix that was mistakenly added to a
>   different patch set
> 
> Keith Busch (12):
>   PCI: portdrv: Initialize service drivers directly
>   PCI: portdrv: Restore pci state on slot reset
>   PCI: DPC: Save and restore control state
>   PCI: AER: Take reference on error devices
>   PCI: AER: Don't read upstream ports below fatal errors
>   PCI: ERR: Use slot reset if available
>   PCI: ERR: Handle fatal error recovery
>   PCI: ERR: Always use the first downstream port
>   PCI: ERR: Simplify broadcast callouts
>   PCI: ERR: Report current recovery status for udev
>   PCI: Unify device inaccessible
>   PCI: Make link active reporting detection generic
> 
>  Documentation/PCI/pci-error-recovery.txt |  35 ++--
>  drivers/pci/hotplug/pciehp.h             |   6 -
>  drivers/pci/hotplug/pciehp_core.c        |   3 +-
>  drivers/pci/hotplug/pciehp_hpc.c         |  22 +--
>  drivers/pci/pci.c                        |  68 +++++++-
>  drivers/pci/pci.h                        |  74 ++++++++-
>  drivers/pci/pcie/aer.c                   |  25 +--
>  drivers/pci/pcie/dpc.c                   |  72 ++++++--
>  drivers/pci/pcie/err.c                   | 276 ++++++++-----------------------
>  drivers/pci/pcie/pme.c                   |   3 +-
>  drivers/pci/pcie/portdrv.h               |  24 +++
>  drivers/pci/pcie/portdrv_pci.c           |  17 ++
>  drivers/pci/probe.c                      |   1 +
>  drivers/pci/slot.c                       |   2 +-
>  include/linux/pci.h                      |   1 +
>  15 files changed, 331 insertions(+), 298 deletions(-)

Applied with Sinan's reviewed-by to pci/hotplug for v4.20, thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
  2018-09-20 16:27 ` [PATCHv4 08/12] PCI: ERR: Always use the first downstream port Keith Busch
@ 2018-09-26 22:01   ` Bjorn Helgaas
  2018-09-26 22:19     ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: Bjorn Helgaas @ 2018-09-26 22:01 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Mika Westerberg

On Thu, Sep 20, 2018 at 10:27:13AM -0600, Keith Busch wrote:
> The link reset always used the first bridge device, but AER broadcast
> error handling may have reported an end device. This means the reset may
> hit devices that were never notified of the impending error recovery.
> 
> This patch uses the first downstream port in the hierarchy considered
> reliable. An error detected by a switch upstream port should mean it
> occurred on its upstream link, so the patch selects the parent device
> if the error is not a root or downstream port.

I'm not really clear on what "Always use the first downstream port"
means.  Always use it for *what*?

I already applied this, but if we can improve the changelog, I'll
gladly update it.

> This allows two other clean-ups. First, error handling can only run
> on bridges so this patch removes checks for end devices. Second, the
> first accessible port does not inherit the channel error state since we
> can access it, so the special cases for error detect and resume are no
> longer necessary.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/pci/pcie/err.c | 85 +++++++++++++-------------------------------------
>  1 file changed, 21 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 644f3f725ef0..0fa5e1417a4a 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -63,30 +63,12 @@ static int report_error_detected(struct pci_dev *dev, void *data)
>  	if (!dev->driver ||
>  		!dev->driver->err_handler ||
>  		!dev->driver->err_handler->error_detected) {
> -		if (result_data->state == pci_channel_io_frozen &&
> -			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> -			/*
> -			 * In case of fatal recovery, if one of down-
> -			 * stream device has no driver. We might be
> -			 * unable to recover because a later insmod
> -			 * of a driver for this device is unaware of
> -			 * its hw state.
> -			 */
> -			pci_printk(KERN_DEBUG, dev, "device has %s\n",
> -				   dev->driver ?
> -				   "no AER-aware driver" : "no driver");
> -		}
> -
>  		/*
> -		 * If there's any device in the subtree that does not
> -		 * have an error_detected callback, returning
> -		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
> -		 * the subsequent mmio_enabled/slot_reset/resume
> -		 * callbacks of "any" device in the subtree. All the
> -		 * devices in the subtree are left in the error state
> -		 * without recovery.
> +		 * If any device in the subtree does not have an error_detected
> +		 * callback, PCI_ERS_RESULT_NO_AER_DRIVER prevents subsequent
> +		 * error callbacks of "any" device in the subtree, and will
> +		 * exit in the disconnected error state.
>  		 */
> -
>  		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>  			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>  		else
> @@ -184,34 +166,23 @@ static pci_ers_result_t default_reset_link(struct pci_dev *dev)
>  
>  static pci_ers_result_t reset_link(struct pci_dev *dev, u32 service)
>  {
> -	struct pci_dev *udev;
>  	pci_ers_result_t status;
>  	struct pcie_port_service_driver *driver = NULL;
>  
> -	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> -		/* Reset this port for all subordinates */
> -		udev = dev;
> -	} else {
> -		/* Reset the upstream component (likely downstream port) */
> -		udev = dev->bus->self;
> -	}
> -
> -	/* Use the aer driver of the component firstly */
> -	driver = pcie_port_find_service(udev, service);
> -
> +	driver = pcie_port_find_service(dev, service);
>  	if (driver && driver->reset_link) {
> -		status = driver->reset_link(udev);
> -	} else if (udev->has_secondary_link) {
> -		status = default_reset_link(udev);
> +		status = driver->reset_link(dev);
> +	} else if (dev->has_secondary_link) {
> +		status = default_reset_link(dev);
>  	} else {
>  		pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream device %s\n",
> -			pci_name(udev));
> +			pci_name(dev));
>  		return PCI_ERS_RESULT_DISCONNECT;
>  	}
>  
>  	if (status != PCI_ERS_RESULT_RECOVERED) {
>  		pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s failed\n",
> -			pci_name(udev));
> +			pci_name(dev));
>  		return PCI_ERS_RESULT_DISCONNECT;
>  	}
>  
> @@ -243,31 +214,7 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
>  	else
>  		result_data.result = PCI_ERS_RESULT_RECOVERED;
>  
> -	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> -		/*
> -		 * If the error is reported by a bridge, we think this error
> -		 * is related to the downstream link of the bridge, so we
> -		 * do error recovery on all subordinates of the bridge instead
> -		 * of the bridge and clear the error status of the bridge.
> -		 */
> -		if (cb == report_error_detected)
> -			dev->error_state = state;
> -		pci_walk_bus(dev->subordinate, cb, &result_data);
> -		if (cb == report_resume) {
> -			pci_aer_clear_device_status(dev);
> -			pci_cleanup_aer_uncorrect_error_status(dev);
> -			dev->error_state = pci_channel_io_normal;
> -		}
> -	} else {
> -		/*
> -		 * If the error is reported by an end point, we think this
> -		 * error is related to the upstream link of the end point.
> -		 * The error is non fatal so the bus is ok; just invoke
> -		 * the callback for the function that logged the error.
> -		 */
> -		cb(dev, &result_data);
> -	}
> -
> +	pci_walk_bus(dev->subordinate, cb, &result_data);
>  	return result_data.result;
>  }
>  
> @@ -276,6 +223,14 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>  {
>  	pci_ers_result_t status;
>  
> +	/*
> +	 * Error recovery runs on all subordinates of the first downstream port.
> +	 * If the downstream port detected the error, it is cleared at the end.
> +	 */
> +	if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
> +	      pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
> +		dev = dev->bus->self;
> +
>  	status = broadcast_error_message(dev,
>  			state,
>  			"error_detected",
> @@ -311,6 +266,8 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>  				"resume",
>  				report_resume);
>  
> +	pci_aer_clear_device_status(dev);
> +	pci_cleanup_aer_uncorrect_error_status(dev);
>  	pci_info(dev, "AER: Device recovery successful\n");
>  	return;
>  
> -- 
> 2.14.4
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
  2018-09-26 22:01   ` Bjorn Helgaas
@ 2018-09-26 22:19     ` Keith Busch
  2018-09-27 22:56       ` Bjorn Helgaas
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2018-09-26 22:19 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Mika Westerberg

On Wed, Sep 26, 2018 at 05:01:16PM -0500, Bjorn Helgaas wrote:
> On Thu, Sep 20, 2018 at 10:27:13AM -0600, Keith Busch wrote:
> > The link reset always used the first bridge device, but AER broadcast
> > error handling may have reported an end device. This means the reset may
> > hit devices that were never notified of the impending error recovery.
> > 
> > This patch uses the first downstream port in the hierarchy considered
> > reliable. An error detected by a switch upstream port should mean it
> > occurred on its upstream link, so the patch selects the parent device
> > if the error is not a root or downstream port.
> 
> I'm not really clear on what "Always use the first downstream port"
> means.  Always use it for *what*?
> 
> I already applied this, but if we can improve the changelog, I'll
> gladly update it.

I'll see if I can better rephrase.

Error handling should notify all affected pci functions. If an end device
detects and emits ERR_FATAL, the old way would have only notified that
end-device driver, but other functions may be on or below the same bus.

Using the downstream port that connects to that bus where the error was
detectedas the anchor point to broadcast error handling progression,
we can notify all functions so they have a chance to prepare for the
link reset.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
  2018-09-26 22:19     ` Keith Busch
@ 2018-09-27 22:56       ` Bjorn Helgaas
  2018-09-28 15:42         ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: Bjorn Helgaas @ 2018-09-27 22:56 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Mika Westerberg

On Wed, Sep 26, 2018 at 04:19:25PM -0600, Keith Busch wrote:
> On Wed, Sep 26, 2018 at 05:01:16PM -0500, Bjorn Helgaas wrote:
> > On Thu, Sep 20, 2018 at 10:27:13AM -0600, Keith Busch wrote:
> > > The link reset always used the first bridge device, but AER broadcast
> > > error handling may have reported an end device. This means the reset may
> > > hit devices that were never notified of the impending error recovery.
> > > 
> > > This patch uses the first downstream port in the hierarchy considered
> > > reliable. An error detected by a switch upstream port should mean it
> > > occurred on its upstream link, so the patch selects the parent device
> > > if the error is not a root or downstream port.
> > 
> > I'm not really clear on what "Always use the first downstream port"
> > means.  Always use it for *what*?

And I forgot to ask what "first downstream port" means.

> I'll see if I can better rephrase.
> 
> Error handling should notify all affected pci functions. If an end device
> detects and emits ERR_FATAL, the old way would have only notified that
> end-device driver, but other functions may be on or below the same bus.
> 
> Using the downstream port that connects to that bus where the error was
> detectedas the anchor point to broadcast error handling progression,
> we can notify all functions so they have a chance to prepare for the
> link reset.

So do I understand correctly that if the ERR_FATAL source is:

  - a Switch Upstream Port, you assume the problem is with the Link
    upstream from the Port, and that Link may need to be reset, so you
    notify everything below that Link, including the Upstream Port,
    everything below it (the Downstream Ports and anything below
    them), and potentially even any peers of the Upstream Port (is it
    even possible for a Upstream Port to have peer multi-function
    devices?)

  - a Switch Downstream Port, you assume the Port (and the Link going
    downstream) may need to be reset, so you notify the Port and
    anything below it

  - an Endpoint, you assume the Link leading to the Endpoint may need
    to be reset, so you notify the Endpoint, any peer multi-function
    devices, any related SR-IOV devices, and any devices below a peer
    that happens to be a bridge

And this is different from before because it notifies more devices in
some cases?  There was a pci_walk_bus() in broadcast_error_message(),
so we should have notified several devices in *some* cases, at least.

Bjorn

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
  2018-09-27 22:56       ` Bjorn Helgaas
@ 2018-09-28 15:42         ` Keith Busch
  2018-09-28 20:50           ` Bjorn Helgaas
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2018-09-28 15:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Mika Westerberg

On Thu, Sep 27, 2018 at 05:56:25PM -0500, Bjorn Helgaas wrote:
> On Wed, Sep 26, 2018 at 04:19:25PM -0600, Keith Busch wrote:
> > On Wed, Sep 26, 2018 at 05:01:16PM -0500, Bjorn Helgaas wrote:
> > > On Thu, Sep 20, 2018 at 10:27:13AM -0600, Keith Busch wrote:
> > > > The link reset always used the first bridge device, but AER broadcast
> > > > error handling may have reported an end device. This means the reset may
> > > > hit devices that were never notified of the impending error recovery.
> > > > 
> > > > This patch uses the first downstream port in the hierarchy considered
> > > > reliable. An error detected by a switch upstream port should mean it
> > > > occurred on its upstream link, so the patch selects the parent device
> > > > if the error is not a root or downstream port.
> > > 
> > > I'm not really clear on what "Always use the first downstream port"
> > > means.  Always use it for *what*?
> 
> And I forgot to ask what "first downstream port" means.

The "first downstream port" was supposed to mean the first DSP we
find when walking toward the root from the pci_dev that reported
ERR_[NON]FATAL.

By "use", I mean "walking down the sub-tree for error handling".

After thinking more on this, that doesn't really capture the intent. A
better way might be:

  Run error handling starting from the downstream port of the bus
  reporting an error

I'm struggling to make that short enough for a changelog subject.

> > I'll see if I can better rephrase.
> > 
> > Error handling should notify all affected pci functions. If an end device
> > detects and emits ERR_FATAL, the old way would have only notified that
> > end-device driver, but other functions may be on or below the same bus.
> > 
> > Using the downstream port that connects to that bus where the error was
> > detectedas the anchor point to broadcast error handling progression,
> > we can notify all functions so they have a chance to prepare for the
> > link reset.
> 
> So do I understand correctly that if the ERR_FATAL source is:
> 
>   - a Switch Upstream Port, you assume the problem is with the Link
>     upstream from the Port, and that Link may need to be reset, so you
>     notify everything below that Link, including the Upstream Port,
>     everything below it (the Downstream Ports and anything below
>     them), and potentially even any peers of the Upstream Port (is it
>     even possible for a Upstream Port to have peer multi-function
>     devices?)

Yep, the Microsemi Switchtec is one such real life example of an end
device function on the same bus as a USP.


>   - a Switch Downstream Port, you assume the Port (and the Link going
>     downstream) may need to be reset, so you notify the Port and
>     anything below it
> 
>   - an Endpoint, you assume the Link leading to the Endpoint may need
>     to be reset, so you notify the Endpoint, any peer multi-function
>     devices, any related SR-IOV devices, and any devices below a peer
>     that happens to be a bridge
> 
> And this is different from before because it notifies more devices in
> some cases?  There was a pci_walk_bus() in broadcast_error_message(),
> so we should have notified several devices in *some* cases, at least.

broadcast_error_message() had been using the pci_dev that detected the
error, and it's pci_walk_bus() used dev->subordinate. If the pci_dev
that detected an error was an end device, we didn't walk the bus.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
  2018-09-28 15:42         ` Keith Busch
@ 2018-09-28 20:50           ` Bjorn Helgaas
  2018-09-28 21:35             ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: Bjorn Helgaas @ 2018-09-28 20:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Mika Westerberg

On Fri, Sep 28, 2018 at 09:42:20AM -0600, Keith Busch wrote:
> On Thu, Sep 27, 2018 at 05:56:25PM -0500, Bjorn Helgaas wrote:
> > On Wed, Sep 26, 2018 at 04:19:25PM -0600, Keith Busch wrote:
> > > On Wed, Sep 26, 2018 at 05:01:16PM -0500, Bjorn Helgaas wrote:
> > > > On Thu, Sep 20, 2018 at 10:27:13AM -0600, Keith Busch wrote:
> > > > > The link reset always used the first bridge device, but AER broadcast
> > > > > error handling may have reported an end device. This means the reset may
> > > > > hit devices that were never notified of the impending error recovery.
> > > > > 
> > > > > This patch uses the first downstream port in the hierarchy considered
> > > > > reliable. An error detected by a switch upstream port should mean it
> > > > > occurred on its upstream link, so the patch selects the parent device
> > > > > if the error is not a root or downstream port.
> > > > 
> > > > I'm not really clear on what "Always use the first downstream port"
> > > > means.  Always use it for *what*?
> > 
> > And I forgot to ask what "first downstream port" means.
> 
> The "first downstream port" was supposed to mean the first DSP we
> find when walking toward the root from the pci_dev that reported
> ERR_[NON]FATAL.
> 
> By "use", I mean "walking down the sub-tree for error handling".
> 
> After thinking more on this, that doesn't really capture the intent. A
> better way might be:
> 
>   Run error handling starting from the downstream port of the bus
>   reporting an error

I think the light is beginning to dawn.  Does this make sense (as far
as it goes)?

  PCI/ERR: Run error recovery callbacks for all affected devices

  If an Endpoint reports an error with ERR_FATAL, we previously ran
  driver error recovery callbacks only for the Endpoint.  But if
  recovery requires that we reset the Endpoint, we may have to use a
  port farther upstream to initiate a Link reset, and that may affect
  components other than the Endpoint, e.g., multi-function peers and
  their children.  Drivers for those devices were never notified of
  the impending reset.

  Call driver error recovery callbacks for every device that will be
  reset.

Now help me understand this part:

> This allows two other clean-ups.  First, error handling can only run
> on bridges so this patch removes checks for endpoints.  

"error handling can only run on bridges"?  I *think* only Root Ports
and Root Complex Event Collectors can assert AER interrupts, so
aer_irq() is only run for them (actually I don't think we quite
support Event Collectors yet).  Is this what you're getting at?

Probably not, because the "dev" passed to pcie_do_recovery() isn't an
RP or RCEC.  But the e_info->dev[i] that aer_process_err_devices()
eventually passes in doesn't have to be a bridge at all, does it?

So I can't quite figure out the bridge connection here.

> Second, the first accessible port does not inherit the channel error
> state since we can access it, so the special cases for error detect
> and resume are no longer necessary.

When we call pcie_do_recovery(), I guess we specify a "dev" that
logged an AER event and a pci_channel_state.  It seems a little bit
weird to handle those separately, as opposed to incorporating into 
the pci_dev or something.

But if you're just saying that "if A is frozen because of an error,
that doesn't mean bridges upstream from A are frozen", that makes good
sense.

Bjorn

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
  2018-09-28 20:50           ` Bjorn Helgaas
@ 2018-09-28 21:35             ` Keith Busch
  2018-09-28 23:28               ` Bjorn Helgaas
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2018-09-28 21:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Mika Westerberg

On Fri, Sep 28, 2018 at 03:50:34PM -0500, Bjorn Helgaas wrote:
> On Fri, Sep 28, 2018 at 09:42:20AM -0600, Keith Busch wrote:
> > On Thu, Sep 27, 2018 at 05:56:25PM -0500, Bjorn Helgaas wrote:
> > > On Wed, Sep 26, 2018 at 04:19:25PM -0600, Keith Busch wrote:
> > The "first downstream port" was supposed to mean the first DSP we
> > find when walking toward the root from the pci_dev that reported
> > ERR_[NON]FATAL.
> > 
> > By "use", I mean "walking down the sub-tree for error handling".
> > 
> > After thinking more on this, that doesn't really capture the intent. A
> > better way might be:
> > 
> >   Run error handling starting from the downstream port of the bus
> >   reporting an error
> 
> I think the light is beginning to dawn.  Does this make sense (as far
> as it goes)?
> 
>   PCI/ERR: Run error recovery callbacks for all affected devices
> 
>   If an Endpoint reports an error with ERR_FATAL, we previously ran
>   driver error recovery callbacks only for the Endpoint.  But if
>   recovery requires that we reset the Endpoint, we may have to use a
>   port farther upstream to initiate a Link reset, and that may affect
>   components other than the Endpoint, e.g., multi-function peers and
>   their children.  Drivers for those devices were never notified of
>   the impending reset.
> 
>   Call driver error recovery callbacks for every device that will be
>   reset.

Yes!
 
> Now help me understand this part:
> 
> > This allows two other clean-ups.  First, error handling can only run
> > on bridges so this patch removes checks for endpoints.  
> 
> "error handling can only run on bridges"?  I *think* only Root Ports
> and Root Complex Event Collectors can assert AER interrupts, so
> aer_irq() is only run for them (actually I don't think we quite
> support Event Collectors yet).  Is this what you're getting at?

I mean the pci_dev sent to pcie_do_recovery(), which may be any device
in the topology including or below the root port that aer_irq() serviced.

> Probably not, because the "dev" passed to pcie_do_recovery() isn't an
> RP or RCEC.  But the e_info->dev[i] that aer_process_err_devices()
> eventually passes in doesn't have to be a bridge at all, does it?

Yes, e_info->dev[i] is sent to pcie_do_recovery(). That could be an RP,
but it may also be anything anything below it.

The assumption I'm making (which I think is a safe assumption with
general consensus) is that errors detected on an end point or an upstream
port happened because of something wrong with the link going upstream:
end devices have no other option, and I don't think it's possible a
bus error occurs on "virtual" busses.

The patch then assumes we should avoid sending any traffic through that
link until error recovery completes.
 
> So I can't quite figure out the bridge connection here.

I should have included RP's, or more generically "type 1 header devices".

> > Second, the first accessible port does not inherit the channel error
> > state since we can access it, so the special cases for error detect
> > and resume are no longer necessary.
> 
> When we call pcie_do_recovery(), I guess we specify a "dev" that
> logged an AER event and a pci_channel_state.  It seems a little bit
> weird to handle those separately, as opposed to incorporating into 
> the pci_dev or something.
>
> But if you're just saying that "if A is frozen because of an error,
> that doesn't mean bridges upstream from A are frozen", that makes good
> sense.

Yes. Another way to think of it is in terms of busses instead of
devices. While devices report errors, the error is because of the
bus. Paths going through that bus are considered frozen, and everything
else is not. We can safely communicate with the RP or DSP that connects
to that bus because communicating with those do not go through the
frozen bus.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
  2018-09-28 21:35             ` Keith Busch
@ 2018-09-28 23:28               ` Bjorn Helgaas
  2018-10-01 15:14                 ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: Bjorn Helgaas @ 2018-09-28 23:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Mika Westerberg

On Fri, Sep 28, 2018 at 03:35:23PM -0600, Keith Busch wrote:
> On Fri, Sep 28, 2018 at 03:50:34PM -0500, Bjorn Helgaas wrote:
> > On Fri, Sep 28, 2018 at 09:42:20AM -0600, Keith Busch wrote:
> > > On Thu, Sep 27, 2018 at 05:56:25PM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Sep 26, 2018 at 04:19:25PM -0600, Keith Busch wrote:
> > > The "first downstream port" was supposed to mean the first DSP we
> > > find when walking toward the root from the pci_dev that reported
> > > ERR_[NON]FATAL.
> > > 
> > > By "use", I mean "walking down the sub-tree for error handling".
> > > 
> > > After thinking more on this, that doesn't really capture the intent. A
> > > better way might be:
> > > 
> > >   Run error handling starting from the downstream port of the bus
> > >   reporting an error
> > 
> > I think the light is beginning to dawn.  Does this make sense (as far
> > as it goes)?
> > 
> >   PCI/ERR: Run error recovery callbacks for all affected devices
> > 
> >   If an Endpoint reports an error with ERR_FATAL, we previously ran
> >   driver error recovery callbacks only for the Endpoint.  But if
> >   recovery requires that we reset the Endpoint, we may have to use a
> >   port farther upstream to initiate a Link reset, and that may affect
> >   components other than the Endpoint, e.g., multi-function peers and
> >   their children.  Drivers for those devices were never notified of
> >   the impending reset.
> > 
> >   Call driver error recovery callbacks for every device that will be
> >   reset.
> 
> Yes!
>  
> > Now help me understand this part:
> > 
> > > This allows two other clean-ups.  First, error handling can only run
> > > on bridges so this patch removes checks for endpoints.  
> > 
> > "error handling can only run on bridges"?  I *think* only Root Ports
> > and Root Complex Event Collectors can assert AER interrupts, so
> > aer_irq() is only run for them (actually I don't think we quite
> > support Event Collectors yet).  Is this what you're getting at?
> 
> I mean the pci_dev sent to pcie_do_recovery(), which may be any device
> in the topology including or below the root port that aer_irq() serviced.

Yep.

> > Probably not, because the "dev" passed to pcie_do_recovery() isn't an
> > RP or RCEC.  But the e_info->dev[i] that aer_process_err_devices()
> > eventually passes in doesn't have to be a bridge at all, does it?
> 
> Yes, e_info->dev[i] is sent to pcie_do_recovery(). That could be an RP,
> but it may also be anything anything below it.

Yep.

> The assumption I'm making (which I think is a safe assumption with
> general consensus) is that errors detected on an end point or an upstream
> port happened because of something wrong with the link going upstream:
> end devices have no other option, 

Is this really true?  It looks like "Internal Errors" (sec 6.2.9) may
be unrelated to a packet or event (though they are supposed to be
associated with a PCIe interface).

It says the only method of recovering is reset or hardware
replacement.  It doesn't specify, but it seems like a FLR or similar
reset might be appropriate and we may not have to reset the link.

Getting back to the changelog, "error handling can only run on
bridges" clearly doesn't refer to the driver callbacks (since those
only apply to endpoints).  Maybe "error handling" refers to the
reset_link(), which can only be done on a bridge?

That would make sense to me, although the current code may be
resetting more devices than necessary if Internal Errors can be
handled without a link reset.

Bjorn

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
  2018-09-28 23:28               ` Bjorn Helgaas
@ 2018-10-01 15:14                 ` Keith Busch
  2018-10-02 19:35                   ` Bjorn Helgaas
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2018-10-01 15:14 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Mika Westerberg

On Fri, Sep 28, 2018 at 06:28:02PM -0500, Bjorn Helgaas wrote:
> On Fri, Sep 28, 2018 at 03:35:23PM -0600, Keith Busch wrote:
> > The assumption I'm making (which I think is a safe assumption with
> > general consensus) is that errors detected on an end point or an upstream
> > port happened because of something wrong with the link going upstream:
> > end devices have no other option, 
> 
> Is this really true?  It looks like "Internal Errors" (sec 6.2.9) may
> be unrelated to a packet or event (though they are supposed to be
> associated with a PCIe interface).
> 
> It says the only method of recovering is reset or hardware
> replacement.  It doesn't specify, but it seems like a FLR or similar
> reset might be appropriate and we may not have to reset the link.

That is an interesting case we might want to handle better. I've a couple
concerns to consider for implementing:

We don't know an ERR_FATAL occured for an internal reason until we read the
config register across the link, and the AER driver historically avoided
accessing potentially unhealthy links. I don't *think* it's harmful to
attempt reading the register, but we'd just need to check for an "all 1's"
completion before trusting the result.

The other issue with trying to use FLR is a device may not implement
it, so pci reset has fallback methods depending on the device's
capabilities. We can end up calling pci_parent_bus_reset(), which does the
same secondary bus reset that already happens as part of error recovery.
We'd just need to make sure affected devices and drivers have a chance
to be notified (which is the this patch's intention).

> Getting back to the changelog, "error handling can only run on
> bridges" clearly doesn't refer to the driver callbacks (since those
> only apply to endpoints).  Maybe "error handling" refers to the
> reset_link(), which can only be done on a bridge?

Yep, referring to how link reset_link is only sent from bridges.

> That would make sense to me, although the current code may be
> resetting more devices than necessary if Internal Errors can be
> handled without a link reset.

That sounds good, I'll test some scenarios out here.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
  2018-10-01 15:14                 ` Keith Busch
@ 2018-10-02 19:35                   ` Bjorn Helgaas
  2018-10-02 19:55                     ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: Bjorn Helgaas @ 2018-10-02 19:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Mika Westerberg

On Mon, Oct 01, 2018 at 09:14:51AM -0600, Keith Busch wrote:
> On Fri, Sep 28, 2018 at 06:28:02PM -0500, Bjorn Helgaas wrote:
> > On Fri, Sep 28, 2018 at 03:35:23PM -0600, Keith Busch wrote:
> > > The assumption I'm making (which I think is a safe assumption with
> > > general consensus) is that errors detected on an end point or an upstream
> > > port happened because of something wrong with the link going upstream:
> > > end devices have no other option, 
> > 
> > Is this really true?  It looks like "Internal Errors" (sec 6.2.9) may
> > be unrelated to a packet or event (though they are supposed to be
> > associated with a PCIe interface).
> > 
> > It says the only method of recovering is reset or hardware
> > replacement.  It doesn't specify, but it seems like a FLR or similar
> > reset might be appropriate and we may not have to reset the link.
> 
> That is an interesting case we might want to handle better. I've a couple
> concerns to consider for implementing:
> 
> We don't know an ERR_FATAL occured for an internal reason until we read the
> config register across the link, and the AER driver historically avoided
> accessing potentially unhealthy links. I don't *think* it's harmful to
> attempt reading the register, but we'd just need to check for an "all 1's"
> completion before trusting the result.
> 
> The other issue with trying to use FLR is a device may not implement
> it, so pci reset has fallback methods depending on the device's
> capabilities. We can end up calling pci_parent_bus_reset(), which does the
> same secondary bus reset that already happens as part of error recovery.
> We'd just need to make sure affected devices and drivers have a chance
> to be notified (which is the this patch's intention).
>  
> > Getting back to the changelog, "error handling can only run on
> > bridges" clearly doesn't refer to the driver callbacks (since those
> > only apply to endpoints).  Maybe "error handling" refers to the
> > reset_link(), which can only be done on a bridge?
> 
> Yep, referring to how link reset_link is only sent from bridges.
>  
> > That would make sense to me, although the current code may be
> > resetting more devices than necessary if Internal Errors can be
> > handled without a link reset.
> 
> That sounds good, I'll test some scenarios out here.

The main point here is that we call the driver callbacks for all every
device that might be reset.  If that set of devices is larger than
strictly necessary, that's an opportunity for future optimization,
which we can defer for now.

Here's my proposal for the changelog.  Let me know what I screwed up.


commit 1f7d2967334433d885c0712b8ac3f073f20211ee
Author: Keith Busch <keith.busch@intel.com>
Date:   Thu Sep 20 10:27:13 2018 -0600

    PCI/ERR: Run error recovery callbacks for all affected devices
    
    If an Endpoint reported an error with ERR_FATAL, we previously ran driver
    error recovery callbacks only for the Endpoint's driver.  But if we reset a
    Link to recover from the error, all downstream components are affected,
    including the Endpoint, any multi-function peers, and children of those
    peers.
    
    Initiate the Link reset from the deepest Downstream Port that is
    reliable, and call the error recovery callbacks for all its children.
    
    If a Downstream Port (including a Root Port) reports an error, we assume
    the Port itself is reliable and we need to reset its downstream Link.  In
    all other cases (Switch Upstream Ports, Endpoints, Bridges, etc), we assume
    the Link leading to the component needs to be reset, so we initiate the
    reset at the parent Downstream Port.
    
    This allows two other clean-ups.  First, we currently only use a Link
    reset, which can only be initiated using a Downstream Port, so we can
    remove checks for Endpoints.  Second, the Downstream Port where we initiate
    the Link reset is reliable (unlike the device that reported the error), so
    the special cases for error detect and resume are no longer necessary.
    
    Signed-off-by: Keith Busch <keith.busch@intel.com>
    [bhelgaas: changelog]
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Sinan Kaya <okaya@kernel.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
  2018-10-02 19:35                   ` Bjorn Helgaas
@ 2018-10-02 19:55                     ` Keith Busch
  0 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2018-10-02 19:55 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Mika Westerberg

On Tue, Oct 02, 2018 at 02:35:22PM -0500, Bjorn Helgaas wrote:
> Here's my proposal for the changelog.  Let me know what I screwed up.
> 
> commit 1f7d2967334433d885c0712b8ac3f073f20211ee
> Author: Keith Busch <keith.busch@intel.com>
> Date:   Thu Sep 20 10:27:13 2018 -0600
> 
>     PCI/ERR: Run error recovery callbacks for all affected devices
>     
>     If an Endpoint reported an error with ERR_FATAL, we previously ran driver
>     error recovery callbacks only for the Endpoint's driver.  But if we reset a
>     Link to recover from the error, all downstream components are affected,
>     including the Endpoint, any multi-function peers, and children of those
>     peers.
>     
>     Initiate the Link reset from the deepest Downstream Port that is
>     reliable, and call the error recovery callbacks for all its children.
>     
>     If a Downstream Port (including a Root Port) reports an error, we assume
>     the Port itself is reliable and we need to reset its downstream Link.  In
>     all other cases (Switch Upstream Ports, Endpoints, Bridges, etc), we assume
>     the Link leading to the component needs to be reset, so we initiate the
>     reset at the parent Downstream Port.
>     
>     This allows two other clean-ups.  First, we currently only use a Link
>     reset, which can only be initiated using a Downstream Port, so we can
>     remove checks for Endpoints.  Second, the Downstream Port where we initiate
>     the Link reset is reliable (unlike the device that reported the error), so
>     the special cases for error detect and resume are no longer necessary.

A downstream port may have been the device that reports the error, but
we still consider that to be accessible. Maybe "unlike its subordinate
bus".

Otherwise this sounds good to me.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-10-02 19:52 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
2018-09-20 16:27 ` [PATCHv4 01/12] PCI: portdrv: Initialize service drivers directly Keith Busch
2018-09-20 16:27 ` [PATCHv4 02/12] PCI: portdrv: Restore pci state on slot reset Keith Busch
2018-09-20 16:27 ` [PATCHv4 03/12] PCI: DPC: Save and restore control state Keith Busch
2018-09-20 19:46   ` Sinan Kaya
2018-09-20 19:47     ` Sinan Kaya
2018-09-20 19:54       ` Keith Busch
2018-09-20 16:27 ` [PATCHv4 04/12] PCI: AER: Take reference on error devices Keith Busch
2018-09-20 16:27 ` [PATCHv4 05/12] PCI: AER: Don't read upstream ports below fatal errors Keith Busch
2018-09-20 16:27 ` [PATCHv4 06/12] PCI: ERR: Use slot reset if available Keith Busch
2018-09-20 16:27 ` [PATCHv4 07/12] PCI: ERR: Handle fatal error recovery Keith Busch
2018-09-20 16:27 ` [PATCHv4 08/12] PCI: ERR: Always use the first downstream port Keith Busch
2018-09-26 22:01   ` Bjorn Helgaas
2018-09-26 22:19     ` Keith Busch
2018-09-27 22:56       ` Bjorn Helgaas
2018-09-28 15:42         ` Keith Busch
2018-09-28 20:50           ` Bjorn Helgaas
2018-09-28 21:35             ` Keith Busch
2018-09-28 23:28               ` Bjorn Helgaas
2018-10-01 15:14                 ` Keith Busch
2018-10-02 19:35                   ` Bjorn Helgaas
2018-10-02 19:55                     ` Keith Busch
2018-09-20 16:27 ` [PATCHv4 09/12] PCI: ERR: Simplify broadcast callouts Keith Busch
2018-09-20 16:27 ` [PATCHv4 10/12] PCI: ERR: Report current recovery status for udev Keith Busch
2018-09-20 16:27 ` [PATCHv4 11/12] PCI: Unify device inaccessible Keith Busch
2018-09-20 16:27 ` [PATCHv4 12/12] PCI: Make link active reporting detection generic Keith Busch
2018-09-20 20:00 ` [PATCHv4 00/12] pci error handling fixes Sinan Kaya
2018-09-20 21:17 ` Bjorn Helgaas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.