All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] pciehp error recovery fix + cleanups
@ 2021-07-31 12:39 Lukas Wunner
  2021-07-31 12:39 ` [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset Lukas Wunner
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Lukas Wunner @ 2021-07-31 12:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Stuart Hayes, Mika Westerberg, Oza Pawandeep

One fix for a pciehp error recovery issue spotted by Stuart
plus three cleanups.  Please review and test.  Thanks!

Lukas Wunner (4):
  PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
  PCI/portdrv: Remove unused resume err_handler
  PCI/portdrv: Remove unused pcie_port_bus_{,un}register() declarations
  PCI/ERR: Reduce compile time for CONFIG_PCIEAER=n

 drivers/pci/Kconfig               |  3 +++
 drivers/pci/hotplug/pciehp.h      |  2 ++
 drivers/pci/hotplug/pciehp_core.c |  4 ++++
 drivers/pci/hotplug/pciehp_hpc.c  | 28 ++++++++++++++++++++++++++++
 drivers/pci/pci-driver.c          |  2 +-
 drivers/pci/pci.c                 |  2 ++
 drivers/pci/pcie/Makefile         |  4 ++--
 drivers/pci/pcie/portdrv.h        |  6 ++----
 drivers/pci/pcie/portdrv_core.c   | 20 ++++++++++----------
 drivers/pci/pcie/portdrv_pci.c    | 27 +++------------------------
 10 files changed, 57 insertions(+), 41 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
  2021-07-31 12:39 [PATCH 0/4] pciehp error recovery fix + cleanups Lukas Wunner
@ 2021-07-31 12:39 ` Lukas Wunner
  2021-09-29 20:40   ` stuart hayes
  2021-10-07 23:00   ` Bjorn Helgaas
  2021-07-31 12:39 ` [PATCH 2/4] PCI/portdrv: Remove unused resume err_handler Lukas Wunner
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 12+ messages in thread
From: Lukas Wunner @ 2021-07-31 12:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Stuart Hayes, Mika Westerberg

Stuart Hayes reports that an error handled by DPC at a Root Port results
in pciehp gratuitously bringing down a subordinate hotplug port:

  RP -- UP -- DP -- UP -- DP (hotplug) -- EP

pciehp brings the slot down because the Link to the Endpoint goes down.
That is caused by a Hot Reset being propagated as a result of DPC.
Per PCIe Base Spec 5.0, section 6.6.1 "Conventional Reset":

  For a Switch, the following must cause a hot reset to be sent on all
  Downstream Ports: [...]

  * The Data Link Layer of the Upstream Port reporting DL_Down status.
    In Switches that support Link speeds greater than 5.0 GT/s, the
    Upstream Port must direct the LTSSM of each Downstream Port to the
    Hot Reset state, but not hold the LTSSMs in that state. This permits
    each Downstream Port to begin Link training immediately after its
    hot reset completes. This behavior is recommended for all Switches.

  * Receiving a hot reset on the Upstream Port.

Once DPC recovers, pcie_do_recovery() walks down the hierarchy and
invokes pcie_portdrv_slot_reset() to restore each port's config space.
At that point, a hotplug interrupt is signaled per PCIe Base Spec r5.0,
section 6.7.3.4 "Software Notification of Hot-Plug Events":

  If the Port is enabled for edge-triggered interrupt signaling using
  MSI or MSI-X, an interrupt message must be sent every time the logical
  AND of the following conditions transitions from FALSE to TRUE: [...]

  * The Hot-Plug Interrupt Enable bit in the Slot Control register is
    set to 1b.

  * At least one hot-plug event status bit in the Slot Status register
    and its associated enable bit in the Slot Control register are both
    set to 1b.

Prevent pciehp from gratuitously bringing down the slot by clearing the
error-induced Data Link Layer State Changed event before restoring
config space.  Afterwards, check whether the link has unexpectedly
failed to retrain and synthesize a DLLSC event if so.

Allow each pcie_port_service_driver (one of them being pciehp) to define
a slot_reset callback and re-use the existing pm_iter() function to
iterate over the callbacks.

Thereby, the Endpoint driver remains bound throughout error recovery and
may restore the device to working state.

Surprise removal during error recovery is detected through a Presence
Detect Changed event.  The hotplug port is expected to not signal that
event as a result of a Hot Reset.

The issue isn't DPC-specific, it also occurs when an error is handled by
AER through aer_root_reset().  So while the issue was noticed only now,
it's been around since 2006 when AER support was first introduced.

Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
Link: https://lore.kernel.org/linux-pci/08c046b0-c9f2-3489-eeef-7e7aca435bb9@gmail.com/
Reported-by: Stuart Hayes <stuart.w.hayes@gmail.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v2.6.19+: ba952824e6c1: PCI/portdrv: Report reset for frozen channel
Cc: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/Kconfig               |  3 +++
 drivers/pci/hotplug/pciehp.h      |  2 ++
 drivers/pci/hotplug/pciehp_core.c |  4 ++++
 drivers/pci/hotplug/pciehp_hpc.c  | 28 ++++++++++++++++++++++++++++
 drivers/pci/pcie/portdrv.h        |  2 ++
 drivers/pci/pcie/portdrv_core.c   | 20 ++++++++++----------
 drivers/pci/pcie/portdrv_pci.c    |  3 +++
 7 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 0c473d75e625..a295d3c48927 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -182,6 +182,9 @@ config PCI_LABEL
 	def_bool y if (DMI || ACPI)
 	select NLS
 
+config PCI_ERROR_RECOVERY
+	def_bool PCIEAER || EEH
+
 config PCI_HYPERV
 	tristate "Hyper-V PCI Frontend"
 	depends on X86_64 && HYPERV && PCI_MSI && PCI_MSI_IRQ_DOMAIN && SYSFS
diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index d4a930881054..7f24fe30a898 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -189,6 +189,8 @@ int pciehp_get_attention_status(struct hotplug_slot *hotplug_slot, u8 *status);
 int pciehp_set_raw_indicator_status(struct hotplug_slot *h_slot, u8 status);
 int pciehp_get_raw_indicator_status(struct hotplug_slot *h_slot, u8 *status);
 
+int pciehp_slot_reset(struct pcie_device *dev);
+
 static inline const char *slot_name(struct controller *ctrl)
 {
 	return hotplug_slot_name(&ctrl->hotplug_slot);
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index ad3393930ecb..46a62237dcc8 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -351,6 +351,10 @@ static struct pcie_port_service_driver hpdriver_portdrv = {
 	.runtime_suspend = pciehp_runtime_suspend,
 	.runtime_resume	= pciehp_runtime_resume,
 #endif	/* PM */
+
+#ifdef	CONFIG_PCI_ERROR_RECOVERY
+	.slot_reset	= pciehp_slot_reset,
+#endif
 };
 
 int __init pcie_hp_init(void)
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 9d06939736c0..72ef22d0c2c9 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -862,6 +862,34 @@ void pcie_disable_interrupt(struct controller *ctrl)
 	pcie_write_cmd(ctrl, 0, mask);
 }
 
+#ifdef CONFIG_PCI_ERROR_RECOVERY
+/**
+ * pciehp_slot_reset() - ignore link event caused by error-induced hot reset
+ * @dev: PCI Express port service device
+ *
+ * Called from pcie_portdrv_slot_reset() after AER or DPC initiated a reset
+ * further up in the hierarchy to recover from an error.  The reset was
+ * propagated down to this hotplug port.  Ignore the resulting link flap.
+ * If the link failed to retrain successfully, synthesize the ignored event.
+ * Surprise removal during reset is detected through Presence Detect Changed.
+ */
+int pciehp_slot_reset(struct pcie_device *dev)
+{
+	struct controller *ctrl = get_service_data(dev);
+
+	if (ctrl->state != ON_STATE)
+		return 0;
+
+	pcie_capability_write_word(dev->port, PCI_EXP_SLTSTA,
+				   PCI_EXP_SLTSTA_DLLSC);
+
+	if (!pciehp_check_link_active(ctrl))
+		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
+
+	return 0;
+}
+#endif
+
 /*
  * pciehp has a 1:1 bus:slot relationship so we ultimately want a secondary
  * bus reset of the bridge, but at the same time we want to ensure that it is
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index 2ff5724b8f13..92a776d211ec 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -85,6 +85,7 @@ struct pcie_port_service_driver {
 	int (*runtime_suspend)(struct pcie_device *dev);
 	int (*runtime_resume)(struct pcie_device *dev);
 
+	int (*slot_reset)(struct pcie_device *dev);
 	/* Device driver may resume normal operations */
 	void (*error_resume)(struct pci_dev *dev);
 
@@ -110,6 +111,7 @@ void pcie_port_service_unregister(struct pcie_port_service_driver *new);
 
 extern struct bus_type pcie_port_bus_type;
 int pcie_port_device_register(struct pci_dev *dev);
+int pcie_port_device_iter(struct device *dev, void *data);
 #ifdef CONFIG_PM
 int pcie_port_device_suspend(struct device *dev);
 int pcie_port_device_resume_noirq(struct device *dev);
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index e1fed6649c41..ebcec7daa245 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -362,24 +362,24 @@ int pcie_port_device_register(struct pci_dev *dev)
 	return status;
 }
 
-#ifdef CONFIG_PM
-typedef int (*pcie_pm_callback_t)(struct pcie_device *);
+typedef int (*pcie_callback_t)(struct pcie_device *);
 
-static int pm_iter(struct device *dev, void *data)
+int pcie_port_device_iter(struct device *dev, void *data)
 {
 	struct pcie_port_service_driver *service_driver;
 	size_t offset = *(size_t *)data;
-	pcie_pm_callback_t cb;
+	pcie_callback_t cb;
 
 	if ((dev->bus == &pcie_port_bus_type) && dev->driver) {
 		service_driver = to_service_driver(dev->driver);
-		cb = *(pcie_pm_callback_t *)((void *)service_driver + offset);
+		cb = *(pcie_callback_t *)((void *)service_driver + offset);
 		if (cb)
 			return cb(to_pcie_device(dev));
 	}
 	return 0;
 }
 
+#ifdef CONFIG_PM
 /**
  * pcie_port_device_suspend - suspend port services associated with a PCIe port
  * @dev: PCI Express port to handle
@@ -387,13 +387,13 @@ static int pm_iter(struct device *dev, void *data)
 int pcie_port_device_suspend(struct device *dev)
 {
 	size_t off = offsetof(struct pcie_port_service_driver, suspend);
-	return device_for_each_child(dev, &off, pm_iter);
+	return device_for_each_child(dev, &off, pcie_port_device_iter);
 }
 
 int pcie_port_device_resume_noirq(struct device *dev)
 {
 	size_t off = offsetof(struct pcie_port_service_driver, resume_noirq);
-	return device_for_each_child(dev, &off, pm_iter);
+	return device_for_each_child(dev, &off, pcie_port_device_iter);
 }
 
 /**
@@ -403,7 +403,7 @@ int pcie_port_device_resume_noirq(struct device *dev)
 int pcie_port_device_resume(struct device *dev)
 {
 	size_t off = offsetof(struct pcie_port_service_driver, resume);
-	return device_for_each_child(dev, &off, pm_iter);
+	return device_for_each_child(dev, &off, pcie_port_device_iter);
 }
 
 /**
@@ -413,7 +413,7 @@ int pcie_port_device_resume(struct device *dev)
 int pcie_port_device_runtime_suspend(struct device *dev)
 {
 	size_t off = offsetof(struct pcie_port_service_driver, runtime_suspend);
-	return device_for_each_child(dev, &off, pm_iter);
+	return device_for_each_child(dev, &off, pcie_port_device_iter);
 }
 
 /**
@@ -423,7 +423,7 @@ int pcie_port_device_runtime_suspend(struct device *dev)
 int pcie_port_device_runtime_resume(struct device *dev)
 {
 	size_t off = offsetof(struct pcie_port_service_driver, runtime_resume);
-	return device_for_each_child(dev, &off, pm_iter);
+	return device_for_each_child(dev, &off, pcie_port_device_iter);
 }
 #endif /* PM */
 
diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index c7ff1eea225a..1af74c3d9d5d 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -160,6 +160,9 @@ static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
 
 static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
 {
+	size_t off = offsetof(struct pcie_port_service_driver, slot_reset);
+	device_for_each_child(&dev->dev, &off, pcie_port_device_iter);
+
 	pci_restore_state(dev);
 	pci_save_state(dev);
 	return PCI_ERS_RESULT_RECOVERED;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] PCI/portdrv: Remove unused resume err_handler
  2021-07-31 12:39 [PATCH 0/4] pciehp error recovery fix + cleanups Lukas Wunner
  2021-07-31 12:39 ` [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset Lukas Wunner
@ 2021-07-31 12:39 ` Lukas Wunner
  2021-07-31 12:39 ` [PATCH 3/4] PCI/portdrv: Remove unused pcie_port_bus_{,un}register() declarations Lukas Wunner
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Lukas Wunner @ 2021-07-31 12:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Stuart Hayes, Mika Westerberg

Commit 3e41a317ae45 ("PCI/AER: Remove unused aer_error_resume()")
removed the resume err_handler from AER.  Since no other port service
implements the callback, support for it can be removed from portdrv.
It can be revived later if need be, preferably by re-using the
pcie_port_device_iter() iterator.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Keith Busch <kbusch@kernel.org>
---
 drivers/pci/pcie/portdrv.h     |  2 --
 drivers/pci/pcie/portdrv_pci.c | 24 ------------------------
 2 files changed, 26 deletions(-)

diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index 92a776d211ec..e9fe0fef6cde 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -86,8 +86,6 @@ struct pcie_port_service_driver {
 	int (*runtime_resume)(struct pcie_device *dev);
 
 	int (*slot_reset)(struct pcie_device *dev);
-	/* Device driver may resume normal operations */
-	void (*error_resume)(struct pci_dev *dev);
 
 	int port_type;  /* Type of the port this driver can handle */
 	u32 service;    /* Port service this device represents */
diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 1af74c3d9d5d..35eca6277a96 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -173,29 +173,6 @@ static pci_ers_result_t pcie_portdrv_mmio_enabled(struct pci_dev *dev)
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
-static int resume_iter(struct device *device, void *data)
-{
-	struct pcie_device *pcie_device;
-	struct pcie_port_service_driver *driver;
-
-	if (device->bus == &pcie_port_bus_type && device->driver) {
-		driver = to_service_driver(device->driver);
-		if (driver && driver->error_resume) {
-			pcie_device = to_pcie_device(device);
-
-			/* Forward error message to service drivers */
-			driver->error_resume(pcie_device->port);
-		}
-	}
-
-	return 0;
-}
-
-static void pcie_portdrv_err_resume(struct pci_dev *dev)
-{
-	device_for_each_child(&dev->dev, NULL, resume_iter);
-}
-
 /*
  * LINUX Device Driver Model
  */
@@ -213,7 +190,6 @@ static const struct pci_error_handlers pcie_portdrv_err_handler = {
 	.error_detected = pcie_portdrv_error_detected,
 	.slot_reset = pcie_portdrv_slot_reset,
 	.mmio_enabled = pcie_portdrv_mmio_enabled,
-	.resume = pcie_portdrv_err_resume,
 };
 
 static struct pci_driver pcie_portdriver = {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] PCI/portdrv: Remove unused pcie_port_bus_{,un}register() declarations
  2021-07-31 12:39 [PATCH 0/4] pciehp error recovery fix + cleanups Lukas Wunner
  2021-07-31 12:39 ` [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset Lukas Wunner
  2021-07-31 12:39 ` [PATCH 2/4] PCI/portdrv: Remove unused resume err_handler Lukas Wunner
@ 2021-07-31 12:39 ` Lukas Wunner
  2021-07-31 12:39 ` [PATCH 4/4] PCI/ERR: Reduce compile time for CONFIG_PCIEAER=n Lukas Wunner
  2021-10-15 19:29 ` [PATCH 0/4] pciehp error recovery fix + cleanups Bjorn Helgaas
  4 siblings, 0 replies; 12+ messages in thread
From: Lukas Wunner @ 2021-07-31 12:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Stuart Hayes, Mika Westerberg

Commit c6c889d932bb ("PCI/portdrv: Remove pcie_port_bus_type link order
dependency") removed pcie_port_bus_{,un}register() but erroneously
retained their declarations in portdrv.h.  Remove them as well.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/pcie/portdrv.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index e9fe0fef6cde..0ef4bf5f811d 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -118,8 +118,6 @@ int pcie_port_device_runtime_suspend(struct device *dev);
 int pcie_port_device_runtime_resume(struct device *dev);
 #endif
 void pcie_port_device_remove(struct pci_dev *dev);
-int __must_check pcie_port_bus_register(void);
-void pcie_port_bus_unregister(void);
 
 struct pci_dev;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] PCI/ERR: Reduce compile time for CONFIG_PCIEAER=n
  2021-07-31 12:39 [PATCH 0/4] pciehp error recovery fix + cleanups Lukas Wunner
                   ` (2 preceding siblings ...)
  2021-07-31 12:39 ` [PATCH 3/4] PCI/portdrv: Remove unused pcie_port_bus_{,un}register() declarations Lukas Wunner
@ 2021-07-31 12:39 ` Lukas Wunner
  2021-10-15 19:29 ` [PATCH 0/4] pciehp error recovery fix + cleanups Bjorn Helgaas
  4 siblings, 0 replies; 12+ messages in thread
From: Lukas Wunner @ 2021-07-31 12:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Stuart Hayes, Mika Westerberg

The sole non-static function in err.c, pcie_do_recovery(), is only
called from:

* aer.c (if CONFIG_PCIEAER=y)
* dpc.c (if CONFIG_PCIE_DPC=y, which depends on CONFIG_PCIEAER)
* edr.c (if CONFIG_PCIE_EDR=y, which depends on CONFIG_PCIE_DPC)

Thus, err.c need not be compiled if CONFIG_PCIEAER=n.

Also, pci_uevent_ers() and pcie_clear_device_status(), which are called
from err.c, can be #ifdef'ed away unless CONFIG_PCIEAER=y.

Since x86_64_defconfig doesn't enable CONFIG_PCIEAER, this change may
slightly reduce compile time for anyone doing a test build with that
config.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/pci-driver.c  | 2 +-
 drivers/pci/pci.c         | 2 ++
 drivers/pci/pcie/Makefile | 4 ++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 3a72352aa5cf..bd990a4f9d53 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1529,7 +1529,7 @@ static int pci_uevent(struct device *dev, struct kobj_uevent_env *env)
 	return 0;
 }
 
-#if defined(CONFIG_PCIEPORTBUS) || defined(CONFIG_EEH)
+#ifdef CONFIG_PCI_ERROR_RECOVERY
 /**
  * pci_uevent_ers - emit a uevent during recovery path of PCI device
  * @pdev: PCI device undergoing error recovery
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2f519074f0f8..ee9f86feac06 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2216,6 +2216,7 @@ int pci_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state state)
 }
 EXPORT_SYMBOL_GPL(pci_set_pcie_reset_state);
 
+#ifdef CONFIG_PCIEAER
 void pcie_clear_device_status(struct pci_dev *dev)
 {
 	u16 sta;
@@ -2223,6 +2224,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
 	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
 	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
 }
+#endif
 
 /**
  * pcie_clear_root_pme_status - Clear root port PME interrupt status.
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index b2980db88cc0..5783a2f79e6a 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -2,12 +2,12 @@
 #
 # Makefile for PCI Express features and port driver
 
-pcieportdrv-y			:= portdrv_core.o portdrv_pci.o err.o rcec.o
+pcieportdrv-y			:= portdrv_core.o portdrv_pci.o rcec.o
 
 obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
 
 obj-$(CONFIG_PCIEASPM)		+= aspm.o
-obj-$(CONFIG_PCIEAER)		+= aer.o
+obj-$(CONFIG_PCIEAER)		+= aer.o err.o
 obj-$(CONFIG_PCIEAER_INJECT)	+= aer_inject.o
 obj-$(CONFIG_PCIE_PME)		+= pme.o
 obj-$(CONFIG_PCIE_DPC)		+= dpc.o
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
  2021-07-31 12:39 ` [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset Lukas Wunner
@ 2021-09-29 20:40   ` stuart hayes
  2021-10-07 23:00   ` Bjorn Helgaas
  1 sibling, 0 replies; 12+ messages in thread
From: stuart hayes @ 2021-09-29 20:40 UTC (permalink / raw)
  To: Lukas Wunner, Bjorn Helgaas
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Mika Westerberg



On 7/31/2021 7:39 AM, Lukas Wunner wrote:
> Stuart Hayes reports that an error handled by DPC at a Root Port results
> in pciehp gratuitously bringing down a subordinate hotplug port:
> 
>    RP -- UP -- DP -- UP -- DP (hotplug) -- EP
> 
> pciehp brings the slot down because the Link to the Endpoint goes down.
> That is caused by a Hot Reset being propagated as a result of DPC.
> Per PCIe Base Spec 5.0, section 6.6.1 "Conventional Reset":
> 
>    For a Switch, the following must cause a hot reset to be sent on all
>    Downstream Ports: [...]
> 
>    * The Data Link Layer of the Upstream Port reporting DL_Down status.
>      In Switches that support Link speeds greater than 5.0 GT/s, the
>      Upstream Port must direct the LTSSM of each Downstream Port to the
>      Hot Reset state, but not hold the LTSSMs in that state. This permits
>      each Downstream Port to begin Link training immediately after its
>      hot reset completes. This behavior is recommended for all Switches.
> 
>    * Receiving a hot reset on the Upstream Port.
> 
> Once DPC recovers, pcie_do_recovery() walks down the hierarchy and
> invokes pcie_portdrv_slot_reset() to restore each port's config space.
> At that point, a hotplug interrupt is signaled per PCIe Base Spec r5.0,
> section 6.7.3.4 "Software Notification of Hot-Plug Events":
> 
>    If the Port is enabled for edge-triggered interrupt signaling using
>    MSI or MSI-X, an interrupt message must be sent every time the logical
>    AND of the following conditions transitions from FALSE to TRUE: [...]
> 
>    * The Hot-Plug Interrupt Enable bit in the Slot Control register is
>      set to 1b.
> 
>    * At least one hot-plug event status bit in the Slot Status register
>      and its associated enable bit in the Slot Control register are both
>      set to 1b.
> 
> Prevent pciehp from gratuitously bringing down the slot by clearing the
> error-induced Data Link Layer State Changed event before restoring
> config space.  Afterwards, check whether the link has unexpectedly
> failed to retrain and synthesize a DLLSC event if so.
> 
> Allow each pcie_port_service_driver (one of them being pciehp) to define
> a slot_reset callback and re-use the existing pm_iter() function to
> iterate over the callbacks.
> 
> Thereby, the Endpoint driver remains bound throughout error recovery and
> may restore the device to working state.
> 
> Surprise removal during error recovery is detected through a Presence
> Detect Changed event.  The hotplug port is expected to not signal that
> event as a result of a Hot Reset.
> 
> The issue isn't DPC-specific, it also occurs when an error is handled by
> AER through aer_root_reset().  So while the issue was noticed only now,
> it's been around since 2006 when AER support was first introduced.
> 
> Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
> Link: https://lore.kernel.org/linux-pci/08c046b0-c9f2-3489-eeef-7e7aca435bb9@gmail.com/
> Reported-by: Stuart Hayes <stuart.w.hayes@gmail.com>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v2.6.19+: ba952824e6c1: PCI/portdrv: Report reset for frozen channel
> Cc: Keith Busch <kbusch@kernel.org>

Tested-by: Stuart Hayes <stuart.w.hayes@gmail.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
  2021-07-31 12:39 ` [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset Lukas Wunner
  2021-09-29 20:40   ` stuart hayes
@ 2021-10-07 23:00   ` Bjorn Helgaas
  2021-10-10  9:14     ` Lukas Wunner
  1 sibling, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2021-10-07 23:00 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Stuart Hayes, Mika Westerberg

On Sat, Jul 31, 2021 at 02:39:01PM +0200, Lukas Wunner wrote:
> Stuart Hayes reports that an error handled by DPC at a Root Port results
> in pciehp gratuitously bringing down a subordinate hotplug port:
> 
>   RP -- UP -- DP -- UP -- DP (hotplug) -- EP
> 
> pciehp brings the slot down because the Link to the Endpoint goes down.
> That is caused by a Hot Reset being propagated as a result of DPC.
> Per PCIe Base Spec 5.0, section 6.6.1 "Conventional Reset":
> 
>   For a Switch, the following must cause a hot reset to be sent on all
>   Downstream Ports: [...]
> 
>   * The Data Link Layer of the Upstream Port reporting DL_Down status.
>     In Switches that support Link speeds greater than 5.0 GT/s, the
>     Upstream Port must direct the LTSSM of each Downstream Port to the
>     Hot Reset state, but not hold the LTSSMs in that state. This permits
>     each Downstream Port to begin Link training immediately after its
>     hot reset completes. This behavior is recommended for all Switches.
> 
>   * Receiving a hot reset on the Upstream Port.
> 
> Once DPC recovers, pcie_do_recovery() walks down the hierarchy and
> invokes pcie_portdrv_slot_reset() to restore each port's config space.
> At that point, a hotplug interrupt is signaled per PCIe Base Spec r5.0,
> section 6.7.3.4 "Software Notification of Hot-Plug Events":
> 
>   If the Port is enabled for edge-triggered interrupt signaling using
>   MSI or MSI-X, an interrupt message must be sent every time the logical
>   AND of the following conditions transitions from FALSE to TRUE: [...]
> 
>   * The Hot-Plug Interrupt Enable bit in the Slot Control register is
>     set to 1b.
> 
>   * At least one hot-plug event status bit in the Slot Status register
>     and its associated enable bit in the Slot Control register are both
>     set to 1b.
> 
> Prevent pciehp from gratuitously bringing down the slot by clearing the
> error-induced Data Link Layer State Changed event before restoring
> config space.  Afterwards, check whether the link has unexpectedly
> failed to retrain and synthesize a DLLSC event if so.
> 
> Allow each pcie_port_service_driver (one of them being pciehp) to define
> a slot_reset callback and re-use the existing pm_iter() function to
> iterate over the callbacks.
> 
> Thereby, the Endpoint driver remains bound throughout error recovery and
> may restore the device to working state.
> 
> Surprise removal during error recovery is detected through a Presence
> Detect Changed event.  The hotplug port is expected to not signal that
> event as a result of a Hot Reset.
> 
> The issue isn't DPC-specific, it also occurs when an error is handled by
> AER through aer_root_reset().  So while the issue was noticed only now,
> it's been around since 2006 when AER support was first introduced.
> 
> Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
> Link: https://lore.kernel.org/linux-pci/08c046b0-c9f2-3489-eeef-7e7aca435bb9@gmail.com/
> Reported-by: Stuart Hayes <stuart.w.hayes@gmail.com>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v2.6.19+: ba952824e6c1: PCI/portdrv: Report reset for frozen channel

Have you tried backporting this to v2.6.19?  I bet it's tough.  Not
sure we should suggest that stable pick this up unless there's a
reasonable path to do that.

> Cc: Keith Busch <kbusch@kernel.org>
> ---
>  drivers/pci/Kconfig               |  3 +++
>  drivers/pci/hotplug/pciehp.h      |  2 ++
>  drivers/pci/hotplug/pciehp_core.c |  4 ++++
>  drivers/pci/hotplug/pciehp_hpc.c  | 28 ++++++++++++++++++++++++++++
>  drivers/pci/pcie/portdrv.h        |  2 ++
>  drivers/pci/pcie/portdrv_core.c   | 20 ++++++++++----------
>  drivers/pci/pcie/portdrv_pci.c    |  3 +++
>  7 files changed, 52 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 0c473d75e625..a295d3c48927 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -182,6 +182,9 @@ config PCI_LABEL
>  	def_bool y if (DMI || ACPI)
>  	select NLS
>  
> +config PCI_ERROR_RECOVERY
> +	def_bool PCIEAER || EEH

I'm having a hard time connecting this to the code.  

pcie_portdrv_slot_reset() is the .slot_reset() method for
pcie_portdriver.  When CONFIG_PCIEPORTBUS=y, portdrv is bound to every
PCIe port and RCEC.

The callers of pci_driver->err_handler->slot_reset() that I see are:

  eeh_report_reset		# arch/powerpc/kernel/eeh_driver.c
  report_slot_reset		# drivers/pci/pcie/err.c
  pci_error_handlers		# drivers/misc/cxl/guest.c
  cxl_pci_slot_reset		# drivers/misc/cxl/pci.c
  pcifront_common_process	# drivers/pci/xen-pcifront.c

I guess the cxl and xen cases probably do not involve PCIe ports;
they're probably only concerned with endpoints, so maybe we can
exclude those.

But this still seems like it's kind of dangling.  It's not obvious to
me why pciehp_slot_reset() should be inside that #ifdef.  Do we need
the #ifdef for a functional reason, or is it there because we know it
will never be called?  If the latter, I don't think the savings are
worth it.

>  config PCI_HYPERV
>  	tristate "Hyper-V PCI Frontend"
>  	depends on X86_64 && HYPERV && PCI_MSI && PCI_MSI_IRQ_DOMAIN && SYSFS
> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
> index d4a930881054..7f24fe30a898 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -189,6 +189,8 @@ int pciehp_get_attention_status(struct hotplug_slot *hotplug_slot, u8 *status);
>  int pciehp_set_raw_indicator_status(struct hotplug_slot *h_slot, u8 status);
>  int pciehp_get_raw_indicator_status(struct hotplug_slot *h_slot, u8 *status);
>  
> +int pciehp_slot_reset(struct pcie_device *dev);
> +
>  static inline const char *slot_name(struct controller *ctrl)
>  {
>  	return hotplug_slot_name(&ctrl->hotplug_slot);
> diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
> index ad3393930ecb..46a62237dcc8 100644
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -351,6 +351,10 @@ static struct pcie_port_service_driver hpdriver_portdrv = {
>  	.runtime_suspend = pciehp_runtime_suspend,
>  	.runtime_resume	= pciehp_runtime_resume,
>  #endif	/* PM */
> +
> +#ifdef	CONFIG_PCI_ERROR_RECOVERY
> +	.slot_reset	= pciehp_slot_reset,
> +#endif
>  };
>  
>  int __init pcie_hp_init(void)
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 9d06939736c0..72ef22d0c2c9 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -862,6 +862,34 @@ void pcie_disable_interrupt(struct controller *ctrl)
>  	pcie_write_cmd(ctrl, 0, mask);
>  }
>  
> +#ifdef CONFIG_PCI_ERROR_RECOVERY
> +/**
> + * pciehp_slot_reset() - ignore link event caused by error-induced hot reset
> + * @dev: PCI Express port service device
> + *
> + * Called from pcie_portdrv_slot_reset() after AER or DPC initiated a reset
> + * further up in the hierarchy to recover from an error.  The reset was
> + * propagated down to this hotplug port.  Ignore the resulting link flap.
> + * If the link failed to retrain successfully, synthesize the ignored event.
> + * Surprise removal during reset is detected through Presence Detect Changed.
> + */
> +int pciehp_slot_reset(struct pcie_device *dev)
> +{
> +	struct controller *ctrl = get_service_data(dev);
> +
> +	if (ctrl->state != ON_STATE)
> +		return 0;
> +
> +	pcie_capability_write_word(dev->port, PCI_EXP_SLTSTA,
> +				   PCI_EXP_SLTSTA_DLLSC);
> +
> +	if (!pciehp_check_link_active(ctrl))
> +		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
> +
> +	return 0;
> +}
> +#endif
> +
>  /*
>   * pciehp has a 1:1 bus:slot relationship so we ultimately want a secondary
>   * bus reset of the bridge, but at the same time we want to ensure that it is
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index 2ff5724b8f13..92a776d211ec 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -85,6 +85,7 @@ struct pcie_port_service_driver {
>  	int (*runtime_suspend)(struct pcie_device *dev);
>  	int (*runtime_resume)(struct pcie_device *dev);
>  
> +	int (*slot_reset)(struct pcie_device *dev);
>  	/* Device driver may resume normal operations */
>  	void (*error_resume)(struct pci_dev *dev);
>  
> @@ -110,6 +111,7 @@ void pcie_port_service_unregister(struct pcie_port_service_driver *new);
>  
>  extern struct bus_type pcie_port_bus_type;
>  int pcie_port_device_register(struct pci_dev *dev);
> +int pcie_port_device_iter(struct device *dev, void *data);
>  #ifdef CONFIG_PM
>  int pcie_port_device_suspend(struct device *dev);
>  int pcie_port_device_resume_noirq(struct device *dev);
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index e1fed6649c41..ebcec7daa245 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -362,24 +362,24 @@ int pcie_port_device_register(struct pci_dev *dev)
>  	return status;
>  }
>  
> -#ifdef CONFIG_PM
> -typedef int (*pcie_pm_callback_t)(struct pcie_device *);
> +typedef int (*pcie_callback_t)(struct pcie_device *);
>  
> -static int pm_iter(struct device *dev, void *data)
> +int pcie_port_device_iter(struct device *dev, void *data)

I like this change, and it seems like it's basically a rename that
could be split off from rest to make the slot_reset part a little more
focused.

>  {
>  	struct pcie_port_service_driver *service_driver;
>  	size_t offset = *(size_t *)data;
> -	pcie_pm_callback_t cb;
> +	pcie_callback_t cb;
>  
>  	if ((dev->bus == &pcie_port_bus_type) && dev->driver) {
>  		service_driver = to_service_driver(dev->driver);
> -		cb = *(pcie_pm_callback_t *)((void *)service_driver + offset);
> +		cb = *(pcie_callback_t *)((void *)service_driver + offset);
>  		if (cb)
>  			return cb(to_pcie_device(dev));
>  	}
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PM
>  /**
>   * pcie_port_device_suspend - suspend port services associated with a PCIe port
>   * @dev: PCI Express port to handle
> @@ -387,13 +387,13 @@ static int pm_iter(struct device *dev, void *data)
>  int pcie_port_device_suspend(struct device *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, suspend);
> -	return device_for_each_child(dev, &off, pm_iter);
> +	return device_for_each_child(dev, &off, pcie_port_device_iter);
>  }
>  
>  int pcie_port_device_resume_noirq(struct device *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, resume_noirq);
> -	return device_for_each_child(dev, &off, pm_iter);
> +	return device_for_each_child(dev, &off, pcie_port_device_iter);
>  }
>  
>  /**
> @@ -403,7 +403,7 @@ int pcie_port_device_resume_noirq(struct device *dev)
>  int pcie_port_device_resume(struct device *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, resume);
> -	return device_for_each_child(dev, &off, pm_iter);
> +	return device_for_each_child(dev, &off, pcie_port_device_iter);
>  }
>  
>  /**
> @@ -413,7 +413,7 @@ int pcie_port_device_resume(struct device *dev)
>  int pcie_port_device_runtime_suspend(struct device *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, runtime_suspend);
> -	return device_for_each_child(dev, &off, pm_iter);
> +	return device_for_each_child(dev, &off, pcie_port_device_iter);
>  }
>  
>  /**
> @@ -423,7 +423,7 @@ int pcie_port_device_runtime_suspend(struct device *dev)
>  int pcie_port_device_runtime_resume(struct device *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, runtime_resume);
> -	return device_for_each_child(dev, &off, pm_iter);
> +	return device_for_each_child(dev, &off, pcie_port_device_iter);
>  }
>  #endif /* PM */
>  
> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
> index c7ff1eea225a..1af74c3d9d5d 100644
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -160,6 +160,9 @@ static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>  
>  static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
>  {
> +	size_t off = offsetof(struct pcie_port_service_driver, slot_reset);
> +	device_for_each_child(&dev->dev, &off, pcie_port_device_iter);
> +
>  	pci_restore_state(dev);
>  	pci_save_state(dev);
>  	return PCI_ERS_RESULT_RECOVERED;
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
  2021-10-07 23:00   ` Bjorn Helgaas
@ 2021-10-10  9:14     ` Lukas Wunner
  2021-10-11 17:25       ` Bjorn Helgaas
  0 siblings, 1 reply; 12+ messages in thread
From: Lukas Wunner @ 2021-10-10  9:14 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Stuart Hayes, Mika Westerberg

On Thu, Oct 07, 2021 at 06:00:20PM -0500, Bjorn Helgaas wrote:
> On Sat, Jul 31, 2021 at 02:39:01PM +0200, Lukas Wunner wrote:
> > The issue isn't DPC-specific, it also occurs when an error is handled by
> > AER through aer_root_reset().  So while the issue was noticed only now,
> > it's been around since 2006 when AER support was first introduced.
> > 
> > Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
[...]
> > Cc: stable@vger.kernel.org # v2.6.19+: ba952824e6c1: PCI/portdrv: Report reset for frozen channel
> 
> Have you tried backporting this to v2.6.19?  I bet it's tough.  Not
> sure we should suggest that stable pick this up unless there's a
> reasonable path to do that.

The oldest kernel.org stable release that still receives updates is v4.4.
There may be older distribution kernels out there which continue to be
supported.  We don't know for sure, so I think it's customary to tag
the release that introduced an issue and leave it to the stable and
distribution maintainers to choose what they backport.

It's true that patches often do not apply cleanly to older releases.
There are some good folks who regularly take up the thankless task of
backporting (Sudip Mukherjee to name but one), but I've also frequently
backported patches myself where necessary.


> > +config PCI_ERROR_RECOVERY
> > +	def_bool PCIEAER || EEH
> 
> I'm having a hard time connecting this to the code.
[...]
> But this still seems like it's kind of dangling.  It's not obvious to
> me why pciehp_slot_reset() should be inside that #ifdef.  Do we need
> the #ifdef for a functional reason, or is it there because we know it
> will never be called?  If the latter, I don't think the savings are
> worth it.

The motivation for the #ifdef was merely to reduce code size if neither
of PCIEAER or EEH is enabled.  Happy to remove it.

I have a different question though.  We've often discussed deprecating
portdrv and moving port services to the core instead.  I've stumbled
across commit a39bd851dccf ("PCI/PM: Clear PCIe PME Status bit in core,
not PCIe port driver"), wherein you moved pcie_pme_root_status_cleanup()
from portdrv to the core, which allowed you to drop the ->resume_noirq
callback from portdrv.

I've been thinking what moving port services to the core would look like
and what your vision for that might be.  If that commit is any indication,
it seems you'd probably prefer that pciehp_slot_reset() (as introduced in
the present patch) is called directly from report_slot_reset() in
drivers/pci/pcie/err.c.

That would eliminate much of the plumbing contained in the present patch
to allow each port service driver to define a ->slot_reset callback and
iterate over those callbacks.  pciehp is probably the only port service
which requires special handling upon ->slot_reset and the same goes for
a lot of the error handling and power management callbacks defined by
port services.  So all that plumbing is probably just a very roundabout
way of doing things.

When calling into port service code from core code, one needs to find
the port service's driver data.  For now we can resort to
pcie_port_find_device(), but I imagine once we move all port services
to the core, we'll be able to access their data directly from
struct pci_dev, e.g. via pdev->hotplug or pdev->pcie->hotplug pointers.

Does that make sense to you?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
  2021-10-10  9:14     ` Lukas Wunner
@ 2021-10-11 17:25       ` Bjorn Helgaas
  0 siblings, 0 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2021-10-11 17:25 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Stuart Hayes, Mika Westerberg

On Sun, Oct 10, 2021 at 11:14:07AM +0200, Lukas Wunner wrote:
> On Thu, Oct 07, 2021 at 06:00:20PM -0500, Bjorn Helgaas wrote:
> > On Sat, Jul 31, 2021 at 02:39:01PM +0200, Lukas Wunner wrote:
> > > The issue isn't DPC-specific, it also occurs when an error is handled by
> > > AER through aer_root_reset().  So while the issue was noticed only now,
> > > it's been around since 2006 when AER support was first introduced.
> > > 
> > > Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
> [...]
> > > Cc: stable@vger.kernel.org # v2.6.19+: ba952824e6c1: PCI/portdrv: Report reset for frozen channel
> > 
> > Have you tried backporting this to v2.6.19?  I bet it's tough.  Not
> > sure we should suggest that stable pick this up unless there's a
> > reasonable path to do that.
> 
> The oldest kernel.org stable release that still receives updates is v4.4.
> There may be older distribution kernels out there which continue to be
> supported.  We don't know for sure, so I think it's customary to tag
> the release that introduced an issue and leave it to the stable and
> distribution maintainers to choose what they backport.
> 
> It's true that patches often do not apply cleanly to older releases.
> There are some good folks who regularly take up the thankless task of
> backporting (Sudip Mukherjee to name but one), but I've also frequently
> backported patches myself where necessary.
> 
> 
> > > +config PCI_ERROR_RECOVERY
> > > +	def_bool PCIEAER || EEH
> > 
> > I'm having a hard time connecting this to the code.
> [...]
> > But this still seems like it's kind of dangling.  It's not obvious to
> > me why pciehp_slot_reset() should be inside that #ifdef.  Do we need
> > the #ifdef for a functional reason, or is it there because we know it
> > will never be called?  If the latter, I don't think the savings are
> > worth it.
> 
> The motivation for the #ifdef was merely to reduce code size if neither
> of PCIEAER or EEH is enabled.  Happy to remove it.

That'd be great.

> I have a different question though.  We've often discussed deprecating
> portdrv and moving port services to the core instead.  I've stumbled
> across commit a39bd851dccf ("PCI/PM: Clear PCIe PME Status bit in core,
> not PCIe port driver"), wherein you moved pcie_pme_root_status_cleanup()
> from portdrv to the core, which allowed you to drop the ->resume_noirq
> callback from portdrv.
> 
> I've been thinking what moving port services to the core would look like
> and what your vision for that might be.  If that commit is any indication,
> it seems you'd probably prefer that pciehp_slot_reset() (as introduced in
> the present patch) is called directly from report_slot_reset() in
> drivers/pci/pcie/err.c.
> 
> That would eliminate much of the plumbing contained in the present patch
> to allow each port service driver to define a ->slot_reset callback and
> iterate over those callbacks.  pciehp is probably the only port service
> which requires special handling upon ->slot_reset and the same goes for
> a lot of the error handling and power management callbacks defined by
> port services.  So all that plumbing is probably just a very roundabout
> way of doing things.
> 
> When calling into port service code from core code, one needs to find
> the port service's driver data.  For now we can resort to
> pcie_port_find_device(), but I imagine once we move all port services
> to the core, we'll be able to access their data directly from
> struct pci_dev, e.g. via pdev->hotplug or pdev->pcie->hotplug pointers.

We managed to get rid of pcie_port_find_service() recently, and I'd
love to get rid of pcie_port_find_device() as well.  I'd like to get
rid of struct pcie_device, too, but that's visible via sysfs and would
be harder.  But maybe we could at least stop using it internally.

I don't know exactly what the ->slot_reset() path would look like, and
I'm not suggesting you need to change that for this patch.

Bjorn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] pciehp error recovery fix + cleanups
  2021-07-31 12:39 [PATCH 0/4] pciehp error recovery fix + cleanups Lukas Wunner
                   ` (3 preceding siblings ...)
  2021-07-31 12:39 ` [PATCH 4/4] PCI/ERR: Reduce compile time for CONFIG_PCIEAER=n Lukas Wunner
@ 2021-10-15 19:29 ` Bjorn Helgaas
  2021-10-16  9:06   ` Lukas Wunner
  4 siblings, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2021-10-15 19:29 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Stuart Hayes, Mika Westerberg, Oza Pawandeep

On Sat, Jul 31, 2021 at 02:39:00PM +0200, Lukas Wunner wrote:
> One fix for a pciehp error recovery issue spotted by Stuart
> plus three cleanups.  Please review and test.  Thanks!
> 
> Lukas Wunner (4):
>   PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
>   PCI/portdrv: Remove unused resume err_handler
>   PCI/portdrv: Remove unused pcie_port_bus_{,un}register() declarations
>   PCI/ERR: Reduce compile time for CONFIG_PCIEAER=n
> 
>  drivers/pci/Kconfig               |  3 +++
>  drivers/pci/hotplug/pciehp.h      |  2 ++
>  drivers/pci/hotplug/pciehp_core.c |  4 ++++
>  drivers/pci/hotplug/pciehp_hpc.c  | 28 ++++++++++++++++++++++++++++
>  drivers/pci/pci-driver.c          |  2 +-
>  drivers/pci/pci.c                 |  2 ++
>  drivers/pci/pcie/Makefile         |  4 ++--
>  drivers/pci/pcie/portdrv.h        |  6 ++----
>  drivers/pci/pcie/portdrv_core.c   | 20 ++++++++++----------
>  drivers/pci/pcie/portdrv_pci.c    | 27 +++------------------------
>  10 files changed, 57 insertions(+), 41 deletions(-)

Applied to pci/hotplug for v5.16, thanks!

I split off the pm_iter() to its own patch at the beginning.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] pciehp error recovery fix + cleanups
  2021-10-15 19:29 ` [PATCH 0/4] pciehp error recovery fix + cleanups Bjorn Helgaas
@ 2021-10-16  9:06   ` Lukas Wunner
  2021-10-16 14:23     ` Bjorn Helgaas
  0 siblings, 1 reply; 12+ messages in thread
From: Lukas Wunner @ 2021-10-16  9:06 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Stuart Hayes, Mika Westerberg, Oza Pawandeep

On Fri, Oct 15, 2021 at 02:29:16PM -0500, Bjorn Helgaas wrote:
> On Sat, Jul 31, 2021 at 02:39:00PM +0200, Lukas Wunner wrote:
> > One fix for a pciehp error recovery issue spotted by Stuart
> > plus three cleanups.  Please review and test.  Thanks!
> > 
> > Lukas Wunner (4):
> >   PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
> >   PCI/portdrv: Remove unused resume err_handler
> >   PCI/portdrv: Remove unused pcie_port_bus_{,un}register() declarations
> >   PCI/ERR: Reduce compile time for CONFIG_PCIEAER=n
> 
> Applied to pci/hotplug for v5.16, thanks!
> 
> I split off the pm_iter() to its own patch at the beginning.

Thanks a lot, I wouldn't have gotten around to respinning the series
until in a couple of days, so it's helpful that you took over.

The lkp robot reported a trivial build issue caused by a dangling
'#ifdef CONFIG_PCI_ERROR_RECOVERY'.  If you could squash the below
fix into the top-most commit on the branch I'd be grateful.  Thanks!

-- >8 --
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 4d3f0e2..b0923bd 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1542,7 +1542,7 @@ static int pci_uevent(struct device *dev, struct kobj_uevent_env *env)
 	return 0;
 }
 
-#ifdef CONFIG_PCI_ERROR_RECOVERY
+#if defined(CONFIG_PCIEAER) || defined(CONFIG_EEH)
 /**
  * pci_uevent_ers - emit a uevent during recovery path of PCI device
  * @pdev: PCI device undergoing error recovery

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] pciehp error recovery fix + cleanups
  2021-10-16  9:06   ` Lukas Wunner
@ 2021-10-16 14:23     ` Bjorn Helgaas
  0 siblings, 0 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2021-10-16 14:23 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Kuppuswamy Sathyanarayanan, Sinan Kaya, Ashok Raj, Keith Busch,
	Yicong Yang, linux-pci, Russell Currey, Oliver OHalloran,
	Stuart Hayes, Mika Westerberg, Oza Pawandeep

On Sat, Oct 16, 2021 at 11:06:36AM +0200, Lukas Wunner wrote:
> On Fri, Oct 15, 2021 at 02:29:16PM -0500, Bjorn Helgaas wrote:
> > On Sat, Jul 31, 2021 at 02:39:00PM +0200, Lukas Wunner wrote:
> > > One fix for a pciehp error recovery issue spotted by Stuart
> > > plus three cleanups.  Please review and test.  Thanks!
> > > 
> > > Lukas Wunner (4):
> > >   PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
> > >   PCI/portdrv: Remove unused resume err_handler
> > >   PCI/portdrv: Remove unused pcie_port_bus_{,un}register() declarations
> > >   PCI/ERR: Reduce compile time for CONFIG_PCIEAER=n
> > 
> > Applied to pci/hotplug for v5.16, thanks!
> > 
> > I split off the pm_iter() to its own patch at the beginning.
> 
> Thanks a lot, I wouldn't have gotten around to respinning the series
> until in a couple of days, so it's helpful that you took over.
> 
> The lkp robot reported a trivial build issue caused by a dangling
> '#ifdef CONFIG_PCI_ERROR_RECOVERY'.  If you could squash the below
> fix into the top-most commit on the branch I'd be grateful.  Thanks!

Oops, sorry, my fault!  Thanks a lot, I squashed this in.  Updated
patch appended.

> -- >8 --
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 4d3f0e2..b0923bd 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -1542,7 +1542,7 @@ static int pci_uevent(struct device *dev, struct kobj_uevent_env *env)
>  	return 0;
>  }
>  
> -#ifdef CONFIG_PCI_ERROR_RECOVERY
> +#if defined(CONFIG_PCIEAER) || defined(CONFIG_EEH)
>  /**
>   * pci_uevent_ers - emit a uevent during recovery path of PCI device
>   * @pdev: PCI device undergoing error recovery

commit f9a6c8ad4922 ("PCI/ERR: Reduce compile time for CONFIG_PCIEAER=n")
Author: Lukas Wunner <lukas@wunner.de>
Date:   Sat Jul 31 14:39:04 2021 +0200

    PCI/ERR: Reduce compile time for CONFIG_PCIEAER=n
    
    The sole non-static function in err.c, pcie_do_recovery(), is only
    called from:
    
    * aer.c (if CONFIG_PCIEAER=y)
    * dpc.c (if CONFIG_PCIE_DPC=y, which depends on CONFIG_PCIEAER)
    * edr.c (if CONFIG_PCIE_EDR=y, which depends on CONFIG_PCIE_DPC)
    
    Thus, err.c need not be compiled if CONFIG_PCIEAER=n.
    
    Also, pci_uevent_ers() and pcie_clear_device_status(), which are called
    from err.c, can be #ifdef'ed away unless CONFIG_PCIEAER=y.
    
    Since x86_64_defconfig doesn't enable CONFIG_PCIEAER, this change may
    slightly reduce compile time for anyone doing a test build with that
    config.
    
    Link: https://lore.kernel.org/r/98f9041151268c1c035ab64cca320ad86803f64a.1627638184.git.lukas@wunner.de
    Signed-off-by: Lukas Wunner <lukas@wunner.de>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 2761ab86490d..b0923bdcdb2e 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1542,7 +1542,7 @@ static int pci_uevent(struct device *dev, struct kobj_uevent_env *env)
 	return 0;
 }
 
-#if defined(CONFIG_PCIEPORTBUS) || defined(CONFIG_EEH)
+#if defined(CONFIG_PCIEAER) || defined(CONFIG_EEH)
 /**
  * pci_uevent_ers - emit a uevent during recovery path of PCI device
  * @pdev: PCI device undergoing error recovery
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ce2ab62b64cf..e9d95189c79b 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2218,6 +2218,7 @@ int pci_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state state)
 }
 EXPORT_SYMBOL_GPL(pci_set_pcie_reset_state);
 
+#ifdef CONFIG_PCIEAER
 void pcie_clear_device_status(struct pci_dev *dev)
 {
 	u16 sta;
@@ -2225,6 +2226,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
 	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
 	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
 }
+#endif
 
 /**
  * pcie_clear_root_pme_status - Clear root port PME interrupt status.
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index b2980db88cc0..5783a2f79e6a 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -2,12 +2,12 @@
 #
 # Makefile for PCI Express features and port driver
 
-pcieportdrv-y			:= portdrv_core.o portdrv_pci.o err.o rcec.o
+pcieportdrv-y			:= portdrv_core.o portdrv_pci.o rcec.o
 
 obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
 
 obj-$(CONFIG_PCIEASPM)		+= aspm.o
-obj-$(CONFIG_PCIEAER)		+= aer.o
+obj-$(CONFIG_PCIEAER)		+= aer.o err.o
 obj-$(CONFIG_PCIEAER_INJECT)	+= aer_inject.o
 obj-$(CONFIG_PCIE_PME)		+= pme.o
 obj-$(CONFIG_PCIE_DPC)		+= dpc.o

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-10-16 14:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-31 12:39 [PATCH 0/4] pciehp error recovery fix + cleanups Lukas Wunner
2021-07-31 12:39 ` [PATCH 1/4] PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset Lukas Wunner
2021-09-29 20:40   ` stuart hayes
2021-10-07 23:00   ` Bjorn Helgaas
2021-10-10  9:14     ` Lukas Wunner
2021-10-11 17:25       ` Bjorn Helgaas
2021-07-31 12:39 ` [PATCH 2/4] PCI/portdrv: Remove unused resume err_handler Lukas Wunner
2021-07-31 12:39 ` [PATCH 3/4] PCI/portdrv: Remove unused pcie_port_bus_{,un}register() declarations Lukas Wunner
2021-07-31 12:39 ` [PATCH 4/4] PCI/ERR: Reduce compile time for CONFIG_PCIEAER=n Lukas Wunner
2021-10-15 19:29 ` [PATCH 0/4] pciehp error recovery fix + cleanups Bjorn Helgaas
2021-10-16  9:06   ` Lukas Wunner
2021-10-16 14:23     ` Bjorn Helgaas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.