[PATCHv2 00/20] PCI, error handling and hot plug

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv2 00/20] PCI, error handling and hot plug
@ 2018-09-05 20:35 Keith Busch
  2018-09-05 20:35 ` [PATCHv2 01/20] PCI: Simplify disconnected marking Keith Busch
                   ` (20 more replies)
  0 siblings, 21 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

v1 -> v2:

  * Use Dennis' patch for the incorrect slot reset detection since he
    posted that fix first

  * I found some DPC and HPC capable ports (PLX Device 9781 to be
    specific) that don't have data-link active reporting capabilities,
    so I added another patch to handle that

  * If the recovery determines the precence detection has changed during
    error handling, we need to prevent the downstream driver from
    accessing the device under its old context. This was a little tricky
    because of a circular dependency on the pci_bus_sem, so there is a
    prep patch to allow recursive pci bus walking, and then we use it
    from pciehp's slot_reset callback.

  * Make error handling not able to change the error state away from
    pci_channel_state_perm_failure so that hotplug and error handling
    may use the same state (suggested by Benjamin Herrenschmidt)

  * Moved the link active wait requirements into generic code
    (suggested by Sinan Kaya)

  * Check for successful secondary bus reset on recovery failure
    (suggested by Sinan Kaya)

  * Use pcie_device for service driver error callbacks (suggested by
    Lukas Wunner)

  * Hold pci_slot_lock when doing a slot reset (suggested by
    Lukas Wunner)

  * Fixed processing user orderly hotplug requests during error handling
    suggested by Lukas Wunner)

  * Various code cleanups (suggested by Christoph Hellwig)

  * Split dpc code cleanup into separate patch

  * Changelog grammer fixes and wording clarity

Dennis Dalessandro (1):
  PCI: Fix faulty logic in pci_reset_bus()

Keith Busch (18):
  PCI: Add required waits on link active
  PCI/AER: Remove dead code
  PCI/ERR: Use slot reset if available
  PCI/ERR: Handle fatal error recovery
  PCI/ERR: Always use the first downstream port
  PCI/ERR: Simplify broadcast callouts
  PCI/ERR: Report current recovery status for udev
  PCI/ERR: Remove devices on recovery failure
  PCI/portdrv: Provide pci error callbacks
  PCI/portdrv: Restore pci state on slot reset
  PCI: Make link active reporting detection generic
  PCI: Create recursive bus walk
  PCI/pciehp: Fix powerfault detection order
  PCI/pciehp: Implement error handling callbacks
  PCI/pciehp: Ignore link events during DPC event
  PCI/DPC: Wait for link active after reset
  PCI/DPC: Link reset code cleanup
  PCI: Unify device inaccessible

Lukas Wunner (1):
  PCI: Simplify disconnected marking

 drivers/pci/bus.c                 |  14 +-
 drivers/pci/hotplug/pciehp.h      |   2 +-
 drivers/pci/hotplug/pciehp_core.c |  39 +++++
 drivers/pci/hotplug/pciehp_hpc.c  |  56 +++----
 drivers/pci/hotplug/pciehp_pci.c  |   9 +-
 drivers/pci/pci.c                 |  68 +++++++-
 drivers/pci/pci.h                 |  17 +-
 drivers/pci/pcie/aer.c            |  27 ++--
 drivers/pci/pcie/dpc.c            |  37 +++--
 drivers/pci/pcie/err.c            | 327 +++++++++++++-------------------------
 drivers/pci/pcie/portdrv.h        |  10 +-
 drivers/pci/pcie/portdrv_pci.c    |  45 +++++-
 drivers/pci/probe.c               |   1 +
 drivers/pci/slot.c                |   2 +-
 include/linux/pci.h               |  10 ++
 15 files changed, 353 insertions(+), 311 deletions(-)

-- 
2.14.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCHv2 01/20] PCI: Simplify disconnected marking
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 02/20] PCI: Fix faulty logic in pci_reset_bus() Keith Busch
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig

From: Lukas Wunner <lukas@wunner.de>

Commit 89ee9f768003 ("PCI: Add device disconnected state") iterates over
the devices on a parent bus, marks each as disconnected, then marks
each device's children as disconnected using pci_walk_bus().

The same can be achieved more succinctly by calling pci_walk_bus() on
the parent bus.  Moreover, this does not need to wait until acquiring
pci_lock_rescan_remove(), so move it out of that critical section.

The critical section in err.c contains a pci_dev_get() / pci_dev_put()
pair which was apparently copy-pasted from pciehp_pci.c.  In the latter
it serves the purpose of holding the struct pci_dev in place until the
Command register is updated.  err.c doesn't do anything like that, hence
the pair is unnecessary.  Remove it.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Cc: Oza Pawandeep <poza@codeaurora.org>
Cc: Sinan Kaya <okaya@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 drivers/pci/hotplug/pciehp_pci.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index 18f83e554c73..0322bd4f0a7a 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -91,6 +91,9 @@ void pciehp_unconfigure_device(struct slot *p_slot, bool presence)
 	ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:00\n",
 		 __func__, pci_domain_nr(parent), parent->number);
 
+	if (!presence)
+		pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
+
 	pci_lock_rescan_remove();
 
 	/*
@@ -102,12 +105,6 @@ void pciehp_unconfigure_device(struct slot *p_slot, bool presence)
 	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
 					 bus_list) {
 		pci_dev_get(dev);
-		if (!presence) {
-			pci_dev_set_disconnected(dev, NULL);
-			if (pci_has_subordinate(dev))
-				pci_walk_bus(dev->subordinate,
-					     pci_dev_set_disconnected, NULL);
-		}
 		pci_stop_and_remove_bus_device(dev);
 		/*
 		 * Ensure that no new Requests will be generated from
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 02/20] PCI: Fix faulty logic in pci_reset_bus()
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
  2018-09-05 20:35 ` [PATCHv2 01/20] PCI: Simplify disconnected marking Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 03/20] PCI: Add required waits on link active Keith Busch
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Dennis Dalessandro, Sinan Kaya

From: Dennis Dalessandro <dennis.dalessandro@intel.com>

The pci_rest_bus() function calls into pci_probe_reset_slot() to determine
whether to call the slot or bus reset. The check has faulty logic in that
it does not account for pci_probe_reset_slot() being able to return an
errno. Fix by only calling the slot reset when the function returns 0.
Treat < 1 and > 1 the same.

Cc: Sinan Kaya <okaya@codeaurora.org>
Fixes: 811c5cb37df4 ("PCI: Unify try slot and bus reset API")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/pci/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 29ff9619b5fa..30b260332a10 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5200,7 +5200,7 @@ static int __pci_reset_bus(struct pci_bus *bus)
  */
 int pci_reset_bus(struct pci_dev *pdev)
 {
-	return pci_probe_reset_slot(pdev->slot) ?
+	return (!pci_probe_reset_slot(pdev->slot)) ?
 	    __pci_reset_slot(pdev->slot) : __pci_reset_bus(pdev->bus);
 }
 EXPORT_SYMBOL_GPL(pci_reset_bus);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 03/20] PCI: Add required waits on link active
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
  2018-09-05 20:35 ` [PATCHv2 01/20] PCI: Simplify disconnected marking Keith Busch
  2018-09-05 20:35 ` [PATCHv2 02/20] PCI: Fix faulty logic in pci_reset_bus() Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-06 11:42   ` Lukas Wunner
  2018-09-05 20:35 ` [PATCHv2 04/20] PCI/AER: Remove dead code Keith Busch
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

The spec has hard timing requirements when waiting for a link to become
active. This patch implements those hard delays when waiting for an
active link so each caller doesn't have to respect those timings.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pci.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 30b260332a10..8eb1d869fc98 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4485,21 +4485,33 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
 	bool ret;
 	u16 lnk_status;
 
+	/*
+	 * PCIe 4.0r1 6.6.1, a component must enter LTSSM Detect within 20ms,
+	 * after which we should expect an link active if the reset was
+	 * successful. If so, software must wait a minimum 100ms before sending
+	 * configuration requests to devices downstream this port.
+	 *
+	 * If the link fails to activate, either the device was physically
+	 * removed or the link is permanently failed.
+	 */
+	if (active)
+		msleep(20);
 	for (;;) {
 		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
 		ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
 		if (ret == active)
-			return true;
+			break;
 		if (timeout <= 0)
 			break;
 		msleep(10);
 		timeout -= 10;
 	}
-
-	pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
-		 active ? "set" : "cleared");
-
-	return false;
+	if (active && ret)
+		msleep(100);
+	else if (ret != active)
+		pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
+			active ? "set" : "cleared");
+	return ret == active;
 }
 
 void pci_reset_secondary_bus(struct pci_dev *dev)
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 04/20] PCI/AER: Remove dead code
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (2 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 03/20] PCI: Add required waits on link active Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 05/20] PCI/ERR: Use slot reset if available Keith Busch
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

The error recovery callbacks are only run on child devices. A root port
is never a child device, so this error resume callback was never
invoked.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/aer.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 83180edd6ed4..a95ceba977bb 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1541,18 +1541,6 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
 }
 
-/**
- * aer_error_resume - clean up corresponding error status bits
- * @dev: pointer to Root Port's pci_dev data structure
- *
- * Invoked by Port Bus driver during nonfatal recovery.
- */
-static void aer_error_resume(struct pci_dev *dev)
-{
-	pci_aer_clear_device_status(dev);
-	pci_cleanup_aer_uncorrect_error_status(dev);
-}
-
 static struct pcie_port_service_driver aerdriver = {
 	.name		= "aer",
 	.port_type	= PCI_EXP_TYPE_ROOT_PORT,
@@ -1560,7 +1548,6 @@ static struct pcie_port_service_driver aerdriver = {
 
 	.probe		= aer_probe,
 	.remove		= aer_remove,
-	.error_resume	= aer_error_resume,
 	.reset_link	= aer_root_reset,
 };
 
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 05/20] PCI/ERR: Use slot reset if available
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (3 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 04/20] PCI/AER: Remove dead code Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 06/20] PCI/ERR: Handle fatal error recovery Keith Busch
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

The secondary bus reset may have link side effects that a hot plug
capable port may incorrectly react to. This patch will use the slot
specific reset for hotplug ports, fixing the undesirable link down-up
handling during error recovering.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pci.c      | 33 +++++++++++++++++++++++++++++++++
 drivers/pci/pci.h      |  2 ++
 drivers/pci/pcie/aer.c |  2 +-
 drivers/pci/pcie/err.c |  2 +-
 drivers/pci/slot.c     |  2 +-
 5 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8eb1d869fc98..1d3b7248e5b9 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5164,6 +5164,39 @@ static int pci_bus_reset(struct pci_bus *bus, int probe)
 	return ret;
 }
 
+/**
+ * pci_bus_error_reset - reset the bridge's subordinate bus
+ * @bridge: The parent device that connects to the bus to reset
+ *
+ * This function will first try to reset the slots on this bus if the method is
+ * available. If slot reset fails or is not available, this will fall back to a
+ * secondary bus reset.
+ */
+int pci_bus_error_reset(struct pci_dev *bridge)
+{
+	struct pci_bus *bus = bridge->subordinate;
+
+	if (!bus)
+		return -ENOTTY;
+
+	mutex_lock(&pci_slot_mutex);
+	if (!list_empty(&bus->slots)) {
+		struct pci_slot *slot;
+
+		list_for_each_entry(slot, &bus->slots, list)
+			if (pci_probe_reset_slot(slot))
+				goto bus_reset;
+		list_for_each_entry(slot, &bus->slots, list)
+			if (pci_slot_reset(slot, 0))
+				goto bus_reset;
+	}
+	mutex_unlock(&pci_slot_mutex);
+	return 0;
+bus_reset:
+	mutex_unlock(&pci_slot_mutex);
+	return pci_bus_reset(bridge->subordinate, 0);
+}
+
 /**
  * pci_probe_reset_bus - probe whether a PCI bus can be reset
  * @bus: PCI bus to probe
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 6e0d1528d471..21bfa30db18d 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -35,6 +35,7 @@ int pci_mmap_fits(struct pci_dev *pdev, int resno, struct vm_area_struct *vmai,
 
 int pci_probe_reset_function(struct pci_dev *dev);
 int pci_bridge_secondary_bus_reset(struct pci_dev *dev);
+int pci_bus_error_reset(struct pci_dev *dev);
 
 /**
  * struct pci_platform_pm_ops - Firmware PM callbacks
@@ -136,6 +137,7 @@ static inline void pci_remove_legacy_files(struct pci_bus *bus) { return; }
 
 /* Lock for read/write access to pci device and bus lists */
 extern struct rw_semaphore pci_bus_sem;
+extern struct mutex pci_slot_mutex;
 
 extern raw_spinlock_t pci_lock;
 
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a95ceba977bb..a96cea39ee80 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1526,7 +1526,7 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
 	pci_write_config_dword(dev, pos + PCI_ERR_ROOT_COMMAND, reg32);
 
-	rc = pci_bridge_secondary_bus_reset(dev);
+	rc = pci_bus_error_reset(dev);
 	pci_printk(KERN_DEBUG, dev, "Root Port link has been reset\n");
 
 	/* Clear Root Error Status */
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 708fd3a0d646..12c1205e1d80 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -177,7 +177,7 @@ static pci_ers_result_t default_reset_link(struct pci_dev *dev)
 {
 	int rc;
 
-	rc = pci_bridge_secondary_bus_reset(dev);
+	rc = pci_bus_error_reset(dev);
 	pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
 	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
 }
diff --git a/drivers/pci/slot.c b/drivers/pci/slot.c
index e634229ece89..193c909bdd45 100644
--- a/drivers/pci/slot.c
+++ b/drivers/pci/slot.c
@@ -14,7 +14,7 @@
 
 struct kset *pci_slots_kset;
 EXPORT_SYMBOL_GPL(pci_slots_kset);
-static DEFINE_MUTEX(pci_slot_mutex);
+DEFINE_MUTEX(pci_slot_mutex);
 
 static ssize_t pci_slot_attr_show(struct kobject *kobj,
 					struct attribute *attr, char *buf)
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 06/20] PCI/ERR: Handle fatal error recovery
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (4 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 05/20] PCI/ERR: Use slot reset if available Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 07/20] PCI/ERR: Always use the first downstream port Keith Busch
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

We don't need to be paranoid about the topology changing while handling an
error. If the device has changed in a hotplug capable slot, we can rely
on the presence detection handling to react to a changing topology. This
patch restores the fatal error handling behavior that existed before
merging DPC with AER with 7e9084b3674 ("PCI/AER: Handle ERR_FATAL with
removal and re-enumeration of devices").

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pci.h      |  4 +--
 drivers/pci/pcie/aer.c | 12 +++++---
 drivers/pci/pcie/dpc.c |  4 +--
 drivers/pci/pcie/err.c | 79 ++++----------------------------------------------
 4 files changed, 18 insertions(+), 81 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 21bfa30db18d..5a96978d3403 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -425,8 +425,8 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev)
 #endif
 
 /* PCI error reporting and recovery */
-void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service);
-void pcie_do_nonfatal_recovery(struct pci_dev *dev);
+void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
+		      u32 service);
 
 bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
 #ifdef CONFIG_PCIEASPM
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a96cea39ee80..b737059b58e5 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1010,9 +1010,11 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 					info->status);
 		pci_aer_clear_device_status(dev);
 	} else if (info->severity == AER_NONFATAL)
-		pcie_do_nonfatal_recovery(dev);
+		pcie_do_recovery(dev, pci_channel_io_normal,
+				 PCIE_PORT_SERVICE_AER);
 	else if (info->severity == AER_FATAL)
-		pcie_do_fatal_recovery(dev, PCIE_PORT_SERVICE_AER);
+		pcie_do_recovery(dev, pci_channel_io_frozen,
+				 PCIE_PORT_SERVICE_AER);
 }
 
 #ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -1047,9 +1049,11 @@ static void aer_recover_work_func(struct work_struct *work)
 		}
 		cper_print_aer(pdev, entry.severity, entry.regs);
 		if (entry.severity == AER_NONFATAL)
-			pcie_do_nonfatal_recovery(pdev);
+			pcie_do_recovery(pdev, pci_channel_io_normal,
+					 PCIE_PORT_SERVICE_AER);
 		else if (entry.severity == AER_FATAL)
-			pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_AER);
+			pcie_do_recovery(pdev, pci_channel_io_frozen,
+					 PCIE_PORT_SERVICE_AER);
 		pci_dev_put(pdev);
 	}
 }
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index f03279fc87cd..b7dfb4988931 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -169,7 +169,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
 
 	reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
 	ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
-	dev_warn(dev, "DPC %s detected, remove downstream devices\n",
+	dev_warn(dev, "DPC %s detected\n",
 		 (reason == 0) ? "unmasked uncorrectable error" :
 		 (reason == 1) ? "ERR_NONFATAL" :
 		 (reason == 2) ? "ERR_FATAL" :
@@ -186,7 +186,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
 	}
 
 	/* We configure DPC so it only triggers on ERR_FATAL */
-	pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_DPC);
+	pcie_do_recovery(pdev, pci_channel_io_frozen, PCIE_PORT_SERVICE_DPC);
 
 	return IRQ_HANDLED;
 }
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 12c1205e1d80..644f3f725ef0 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -271,87 +271,20 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 	return result_data.result;
 }
 
-/**
- * pcie_do_fatal_recovery - handle fatal error recovery process
- * @dev: pointer to a pci_dev data structure of agent detecting an error
- *
- * Invoked when an error is fatal. Once being invoked, removes the devices
- * beneath this AER agent, followed by reset link e.g. secondary bus reset
- * followed by re-enumeration of devices.
- */
-void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
-{
-	struct pci_dev *udev;
-	struct pci_bus *parent;
-	struct pci_dev *pdev, *temp;
-	pci_ers_result_t result;
-
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
-		udev = dev;
-	else
-		udev = dev->bus->self;
-
-	parent = udev->subordinate;
-	pci_lock_rescan_remove();
-	pci_dev_get(dev);
-	list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
-					 bus_list) {
-		pci_dev_get(pdev);
-		pci_dev_set_disconnected(pdev, NULL);
-		if (pci_has_subordinate(pdev))
-			pci_walk_bus(pdev->subordinate,
-				     pci_dev_set_disconnected, NULL);
-		pci_stop_and_remove_bus_device(pdev);
-		pci_dev_put(pdev);
-	}
-
-	result = reset_link(udev, service);
-
-	if ((service == PCIE_PORT_SERVICE_AER) &&
-	    (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)) {
-		/*
-		 * If the error is reported by a bridge, we think this error
-		 * is related to the downstream link of the bridge, so we
-		 * do error recovery on all subordinates of the bridge instead
-		 * of the bridge and clear the error status of the bridge.
-		 */
-		pci_aer_clear_fatal_status(dev);
-		pci_aer_clear_device_status(dev);
-	}
-
-	if (result == PCI_ERS_RESULT_RECOVERED) {
-		if (pcie_wait_for_link(udev, true))
-			pci_rescan_bus(udev->bus);
-		pci_info(dev, "Device recovery from fatal error successful\n");
-	} else {
-		pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
-		pci_info(dev, "Device recovery from fatal error failed\n");
-	}
-
-	pci_dev_put(dev);
-	pci_unlock_rescan_remove();
-}
-
-/**
- * pcie_do_nonfatal_recovery - handle nonfatal error recovery process
- * @dev: pointer to a pci_dev data structure of agent detecting an error
- *
- * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
- * error detected message to all downstream drivers within a hierarchy in
- * question and return the returned code.
- */
-void pcie_do_nonfatal_recovery(struct pci_dev *dev)
+void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
+		      u32 service)
 {
 	pci_ers_result_t status;
-	enum pci_channel_state state;
-
-	state = pci_channel_io_normal;
 
 	status = broadcast_error_message(dev,
 			state,
 			"error_detected",
 			report_error_detected);
 
+	if (state == pci_channel_io_frozen &&
+	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
+		goto failed;
+
 	if (status == PCI_ERS_RESULT_CAN_RECOVER)
 		status = broadcast_error_message(dev,
 				state,
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 07/20] PCI/ERR: Always use the first downstream port
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (5 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 06/20] PCI/ERR: Handle fatal error recovery Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 08/20] PCI/ERR: Simplify broadcast callouts Keith Busch
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

The link reset always used the first bridge device, but AER broadcast
error handling may have reported an end device. This means the reset may
hit devices that were never notified of the impending error recovery.

This patch uses the first downstream port in the hierarchy considered
reliable. An error detected by a switch upstream port should mean it
occurred on its upstream link, so the patch selects the parent device
if the error is not a root or downstream port.

This allows two other clean-ups. First, error handling can only run
on bridges so this patch removes checks for end devices. Second, the
first accessible port does not inherit the channel error state since we
can access it, so the special cases for error detect and resume are no
longer necessary.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/err.c | 85 +++++++++++++-------------------------------------
 1 file changed, 21 insertions(+), 64 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 644f3f725ef0..0fa5e1417a4a 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -63,30 +63,12 @@ static int report_error_detected(struct pci_dev *dev, void *data)
 	if (!dev->driver ||
 		!dev->driver->err_handler ||
 		!dev->driver->err_handler->error_detected) {
-		if (result_data->state == pci_channel_io_frozen &&
-			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
-			/*
-			 * In case of fatal recovery, if one of down-
-			 * stream device has no driver. We might be
-			 * unable to recover because a later insmod
-			 * of a driver for this device is unaware of
-			 * its hw state.
-			 */
-			pci_printk(KERN_DEBUG, dev, "device has %s\n",
-				   dev->driver ?
-				   "no AER-aware driver" : "no driver");
-		}
-
 		/*
-		 * If there's any device in the subtree that does not
-		 * have an error_detected callback, returning
-		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
-		 * the subsequent mmio_enabled/slot_reset/resume
-		 * callbacks of "any" device in the subtree. All the
-		 * devices in the subtree are left in the error state
-		 * without recovery.
+		 * If any device in the subtree does not have an error_detected
+		 * callback, PCI_ERS_RESULT_NO_AER_DRIVER prevents subsequent
+		 * error callbacks of "any" device in the subtree, and will
+		 * exit in the disconnected error state.
 		 */
-
 		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
 			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
 		else
@@ -184,34 +166,23 @@ static pci_ers_result_t default_reset_link(struct pci_dev *dev)
 
 static pci_ers_result_t reset_link(struct pci_dev *dev, u32 service)
 {
-	struct pci_dev *udev;
 	pci_ers_result_t status;
 	struct pcie_port_service_driver *driver = NULL;
 
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
-		/* Reset this port for all subordinates */
-		udev = dev;
-	} else {
-		/* Reset the upstream component (likely downstream port) */
-		udev = dev->bus->self;
-	}
-
-	/* Use the aer driver of the component firstly */
-	driver = pcie_port_find_service(udev, service);
-
+	driver = pcie_port_find_service(dev, service);
 	if (driver && driver->reset_link) {
-		status = driver->reset_link(udev);
-	} else if (udev->has_secondary_link) {
-		status = default_reset_link(udev);
+		status = driver->reset_link(dev);
+	} else if (dev->has_secondary_link) {
+		status = default_reset_link(dev);
 	} else {
 		pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream device %s\n",
-			pci_name(udev));
+			pci_name(dev));
 		return PCI_ERS_RESULT_DISCONNECT;
 	}
 
 	if (status != PCI_ERS_RESULT_RECOVERED) {
 		pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s failed\n",
-			pci_name(udev));
+			pci_name(dev));
 		return PCI_ERS_RESULT_DISCONNECT;
 	}
 
@@ -243,31 +214,7 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 	else
 		result_data.result = PCI_ERS_RESULT_RECOVERED;
 
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
-		/*
-		 * If the error is reported by a bridge, we think this error
-		 * is related to the downstream link of the bridge, so we
-		 * do error recovery on all subordinates of the bridge instead
-		 * of the bridge and clear the error status of the bridge.
-		 */
-		if (cb == report_error_detected)
-			dev->error_state = state;
-		pci_walk_bus(dev->subordinate, cb, &result_data);
-		if (cb == report_resume) {
-			pci_aer_clear_device_status(dev);
-			pci_cleanup_aer_uncorrect_error_status(dev);
-			dev->error_state = pci_channel_io_normal;
-		}
-	} else {
-		/*
-		 * If the error is reported by an end point, we think this
-		 * error is related to the upstream link of the end point.
-		 * The error is non fatal so the bus is ok; just invoke
-		 * the callback for the function that logged the error.
-		 */
-		cb(dev, &result_data);
-	}
-
+	pci_walk_bus(dev->subordinate, cb, &result_data);
 	return result_data.result;
 }
 
@@ -276,6 +223,14 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 {
 	pci_ers_result_t status;
 
+	/*
+	 * Error recovery runs on all subordinates of the first downstream port.
+	 * If the downstream port detected the error, it is cleared at the end.
+	 */
+	if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
+	      pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
+		dev = dev->bus->self;
+
 	status = broadcast_error_message(dev,
 			state,
 			"error_detected",
@@ -311,6 +266,8 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 				"resume",
 				report_resume);
 
+	pci_aer_clear_device_status(dev);
+	pci_cleanup_aer_uncorrect_error_status(dev);
 	pci_info(dev, "AER: Device recovery successful\n");
 	return;
 
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 08/20] PCI/ERR: Simplify broadcast callouts
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (6 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 07/20] PCI/ERR: Always use the first downstream port Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 09/20] PCI/ERR: Report current recovery status for udev Keith Busch
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

There is no point in having a generic broadcast function if it needs
to have special cases for each callback it broadcasts. This patch
abstracts the error broadcast to only the necessary information and
removes the now unnecessary indirection to broadcast and walk the bus.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/err.c | 107 ++++++++++++++++++-------------------------------
 1 file changed, 38 insertions(+), 69 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 0fa5e1417a4a..362a717c831a 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -19,11 +19,6 @@
 #include "portdrv.h"
 #include "../pci.h"
 
-struct aer_broadcast_data {
-	enum pci_channel_state state;
-	enum pci_ers_result result;
-};
-
 static pci_ers_result_t merge_result(enum pci_ers_result orig,
 				  enum pci_ers_result new)
 {
@@ -49,16 +44,15 @@ static pci_ers_result_t merge_result(enum pci_ers_result orig,
 	return orig;
 }
 
-static int report_error_detected(struct pci_dev *dev, void *data)
+static int report_error_detected(struct pci_dev *dev,
+				 enum pci_channel_state state,
+				 enum pci_ers_result *result)
 {
 	pci_ers_result_t vote;
 	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-
-	result_data = (struct aer_broadcast_data *) data;
 
 	device_lock(&dev->dev);
-	dev->error_state = result_data->state;
+	dev->error_state = state;
 
 	if (!dev->driver ||
 		!dev->driver->err_handler ||
@@ -75,22 +69,29 @@ static int report_error_detected(struct pci_dev *dev, void *data)
 			vote = PCI_ERS_RESULT_NONE;
 	} else {
 		err_handler = dev->driver->err_handler;
-		vote = err_handler->error_detected(dev, result_data->state);
+		vote = err_handler->error_detected(dev, state);
 		pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
 	}
 
-	result_data->result = merge_result(result_data->result, vote);
+	*result = merge_result(*result, vote);
 	device_unlock(&dev->dev);
 	return 0;
 }
 
+static int report_frozen_detected(struct pci_dev *dev, void *data)
+{
+	return report_error_detected(dev, pci_channel_io_frozen, data);
+}
+
+static int report_normal_detected(struct pci_dev *dev, void *data)
+{
+	return report_error_detected(dev, pci_channel_io_normal, data);
+}
+
 static int report_mmio_enabled(struct pci_dev *dev, void *data)
 {
-	pci_ers_result_t vote;
+	pci_ers_result_t vote, *result = data;
 	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-
-	result_data = (struct aer_broadcast_data *) data;
 
 	device_lock(&dev->dev);
 	if (!dev->driver ||
@@ -100,7 +101,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data)
 
 	err_handler = dev->driver->err_handler;
 	vote = err_handler->mmio_enabled(dev);
-	result_data->result = merge_result(result_data->result, vote);
+	*result = merge_result(*result, vote);
 out:
 	device_unlock(&dev->dev);
 	return 0;
@@ -108,11 +109,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data)
 
 static int report_slot_reset(struct pci_dev *dev, void *data)
 {
-	pci_ers_result_t vote;
+	pci_ers_result_t vote, *result = data;
 	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-
-	result_data = (struct aer_broadcast_data *) data;
 
 	device_lock(&dev->dev);
 	if (!dev->driver ||
@@ -122,7 +120,7 @@ static int report_slot_reset(struct pci_dev *dev, void *data)
 
 	err_handler = dev->driver->err_handler;
 	vote = err_handler->slot_reset(dev);
-	result_data->result = merge_result(result_data->result, vote);
+	*result = merge_result(*result, vote);
 out:
 	device_unlock(&dev->dev);
 	return 0;
@@ -189,39 +187,11 @@ static pci_ers_result_t reset_link(struct pci_dev *dev, u32 service)
 	return status;
 }
 
-/**
- * broadcast_error_message - handle message broadcast to downstream drivers
- * @dev: pointer to from where in a hierarchy message is broadcasted down
- * @state: error state
- * @error_mesg: message to print
- * @cb: callback to be broadcasted
- *
- * Invoked during error recovery process. Once being invoked, the content
- * of error severity will be broadcasted to all downstream drivers in a
- * hierarchy in question.
- */
-static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
-	enum pci_channel_state state,
-	char *error_mesg,
-	int (*cb)(struct pci_dev *, void *))
-{
-	struct aer_broadcast_data result_data;
-
-	pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
-	result_data.state = state;
-	if (cb == report_error_detected)
-		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
-	else
-		result_data.result = PCI_ERS_RESULT_RECOVERED;
-
-	pci_walk_bus(dev->subordinate, cb, &result_data);
-	return result_data.result;
-}
-
 void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 		      u32 service)
 {
-	pci_ers_result_t status;
+	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
+	struct pci_bus *bus;
 
 	/*
 	 * Error recovery runs on all subordinates of the first downstream port.
@@ -230,21 +200,23 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 	if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
 	      pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
 		dev = dev->bus->self;
+	bus = dev->subordinate;
 
-	status = broadcast_error_message(dev,
-			state,
-			"error_detected",
-			report_error_detected);
+	pci_dbg(dev, "broadcast error_detected message\n");
+	if (state == pci_channel_io_frozen)
+		pci_walk_bus(bus, report_frozen_detected, &status);
+	else
+		pci_walk_bus(bus, report_normal_detected, &status);
 
 	if (state == pci_channel_io_frozen &&
 	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
 		goto failed;
 
-	if (status == PCI_ERS_RESULT_CAN_RECOVER)
-		status = broadcast_error_message(dev,
-				state,
-				"mmio_enabled",
-				report_mmio_enabled);
+	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
+		status = PCI_ERS_RESULT_RECOVERED;
+		pci_dbg(dev, "broadcast mmio_enabled message\n");
+		pci_walk_bus(bus, report_mmio_enabled, &status);
+	}
 
 	if (status == PCI_ERS_RESULT_NEED_RESET) {
 		/*
@@ -252,19 +224,16 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 		 * functions to reset slot before calling
 		 * drivers' slot_reset callbacks?
 		 */
-		status = broadcast_error_message(dev,
-				state,
-				"slot_reset",
-				report_slot_reset);
+		status = PCI_ERS_RESULT_RECOVERED;
+		pci_dbg(dev, "broadcast slot_reset message\n");
+		pci_walk_bus(bus, report_slot_reset, &status);
 	}
 
 	if (status != PCI_ERS_RESULT_RECOVERED)
 		goto failed;
 
-	broadcast_error_message(dev,
-				state,
-				"resume",
-				report_resume);
+	pci_dbg(dev, "broadcast resume message\n");
+	pci_walk_bus(bus, report_resume, &status);
 
 	pci_aer_clear_device_status(dev);
 	pci_cleanup_aer_uncorrect_error_status(dev);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 09/20] PCI/ERR: Report current recovery status for udev
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (7 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 08/20] PCI/ERR: Simplify broadcast callouts Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 10/20] PCI/ERR: Remove devices on recovery failure Keith Busch
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

A device still participates in error recovery even if it doesn't have
the error callbacks. This patch provides the status for user event
watchers.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/err.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 362a717c831a..31e8a4314384 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -70,9 +70,8 @@ static int report_error_detected(struct pci_dev *dev,
 	} else {
 		err_handler = dev->driver->err_handler;
 		vote = err_handler->error_detected(dev, state);
-		pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
 	}
-
+	pci_uevent_ers(dev, vote);
 	*result = merge_result(*result, vote);
 	device_unlock(&dev->dev);
 	return 0;
@@ -140,8 +139,8 @@ static int report_resume(struct pci_dev *dev, void *data)
 
 	err_handler = dev->driver->err_handler;
 	err_handler->resume(dev);
-	pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
 out:
+	pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
 	device_unlock(&dev->dev);
 	return 0;
 }
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 10/20] PCI/ERR: Remove devices on recovery failure
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (8 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 09/20] PCI/ERR: Report current recovery status for udev Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 11/20] PCI/portdrv: Provide pci error callbacks Keith Busch
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

This patch removes devices connected through a bus that can't recover from
an error. After removing everything, one final enumeration from scratch
will be attempted if the bridge and its downstream link appear accessible.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/err.c | 42 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 38 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 31e8a4314384..2264001f695b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -145,6 +145,41 @@ static int report_resume(struct pci_dev *dev, void *data)
 	return 0;
 }
 
+static int report_disconnect(struct pci_dev *dev, void *data)
+{
+	device_lock(&dev->dev);
+	pci_dev_set_disconnected(dev, NULL);
+	pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+/**
+ * pcie_disconnect- Called when error handling ends with
+* 		    PCI_ERS_RESULT_DISCONNECT status.
+ *
+ * Reaching here means error handling has irrevocably failed. This function
+ * will ungracefully disconnect all the devices below the bus that has
+ * experienced the unrecoverable error.
+ *
+ * If the link is active after the removing all devices on the bus, this will
+ * attempt to re-enumerate the bus from scratch.
+ */
+static void pcie_disconnect(struct pci_dev *dev)
+{
+	struct pci_bus *bus = dev->subordinate;
+	struct pci_dev *child, *tmp;
+
+	pci_lock_rescan_remove();
+	list_for_each_entry_safe(child, tmp, &bus->devices, bus_list)
+		pci_stop_and_remove_bus_device(child);
+
+	if (pci_bridge_secondary_bus_reset(dev) == 0 &&
+	    pcie_wait_for_link(dev, true))
+		pci_rescan_bus(bus);
+	pci_unlock_rescan_remove();
+}
+
 /**
  * default_reset_link - default reset function
  * @dev: pointer to pci_dev data structure
@@ -238,10 +273,9 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 	pci_cleanup_aer_uncorrect_error_status(dev);
 	pci_info(dev, "AER: Device recovery successful\n");
 	return;
-
 failed:
-	pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
-
-	/* TODO: Should kernel panic here? */
 	pci_info(dev, "AER: Device recovery failed\n");
+	pci_dbg(dev, "broadcast disconnect message\n");
+	pci_walk_bus(bus, report_disconnect, &status);
+	pcie_disconnect(dev);
 }
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 11/20] PCI/portdrv: Provide pci error callbacks
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (9 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 10/20] PCI/ERR: Remove devices on recovery failure Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 12/20] PCI/portdrv: Restore pci state on slot reset Keith Busch
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

The error notification walks the whole bus, and some ports may need to
do something to prepare for error handling and recover after slot resets
too. This patch chains the error notification to the port services that
register this callback.

Since the callback is service specific, the patch is also changing the
callback data from the generic 'struct pci_dev' to the service specific
'struct pcie_device'. This is so the service doesn't need to search for
its service specific data, and the callback can always reference the
port if they only wanted the generic pci_dev.

And while we're here, remove the misleading comments referring to AER
root ports in the generic port services driver, as the port may be
involved with other services.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/portdrv.h     | 10 ++++++++--
 drivers/pci/pcie/portdrv_pci.c | 43 ++++++++++++++++++++++++++++++++++++++----
 2 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index d59afa42fc14..36b6e7fb199f 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -53,10 +53,16 @@ struct pcie_port_service_driver {
 	int (*resume_noirq) (struct pcie_device *dev);
 	int (*resume) (struct pcie_device *dev);
 
+	/* Device driver notification of error handling */
+	void (*error_detected)(struct pcie_device *dev);
+
+	/* Device driver notification of slot reset */
+	void (*slot_reset)(struct pcie_device *dev);
+
 	/* Device driver may resume normal operations */
-	void (*error_resume)(struct pci_dev *dev);
+	void (*error_resume)(struct pcie_device *dev);
 
-	/* Link Reset Capability - AER service driver specific */
+	/* Link Reset Capability - service driver specific */
 	pci_ers_result_t (*reset_link) (struct pci_dev *dev);
 
 	int port_type;  /* Type of the port this driver can handle */
diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index eef22dc29140..93ab8be21a64 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -139,13 +139,49 @@ static void pcie_portdrv_remove(struct pci_dev *dev)
 	pcie_port_device_remove(dev);
 }
 
+static int detected_iter(struct device *device, void *data)
+{
+	struct pcie_device *pcie_device;
+	struct pcie_port_service_driver *driver;
+
+	if (device->bus == &pcie_port_bus_type && device->driver) {
+		driver = to_service_driver(device->driver);
+		if (driver && driver->error_detected) {
+			pcie_device = to_pcie_device(device);
+			driver->error_detected(pcie_device);
+		}
+	}
+	return 0;
+}
+
 static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
 					enum pci_channel_state error)
 {
-	/* Root Port has no impact. Always recovers. */
+	device_for_each_child(&dev->dev, dev, detected_iter);
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
+static int slot_reset_iter(struct device *device, void *data)
+{
+	struct pcie_device *pcie_device;
+	struct pcie_port_service_driver *driver;
+
+	if (device->bus == &pcie_port_bus_type && device->driver) {
+		driver = to_service_driver(device->driver);
+		if (driver && driver->slot_reset) {
+			pcie_device = to_pcie_device(device);
+			driver->slot_reset(pcie_device);
+		}
+	}
+	return 0;
+}
+
+static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
+{
+	device_for_each_child(&dev->dev, dev, slot_reset_iter);
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
 static pci_ers_result_t pcie_portdrv_mmio_enabled(struct pci_dev *dev)
 {
 	return PCI_ERS_RESULT_RECOVERED;
@@ -160,9 +196,7 @@ static int resume_iter(struct device *device, void *data)
 		driver = to_service_driver(device->driver);
 		if (driver && driver->error_resume) {
 			pcie_device = to_pcie_device(device);
-
-			/* Forward error message to service drivers */
-			driver->error_resume(pcie_device->port);
+			driver->error_resume(pcie_device);
 		}
 	}
 
@@ -185,6 +219,7 @@ static const struct pci_device_id port_pci_ids[] = { {
 
 static const struct pci_error_handlers pcie_portdrv_err_handler = {
 	.error_detected = pcie_portdrv_error_detected,
+	.slot_reset = pcie_portdrv_slot_reset,
 	.mmio_enabled = pcie_portdrv_mmio_enabled,
 	.resume = pcie_portdrv_err_resume,
 };
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 12/20] PCI/portdrv: Restore pci state on slot reset
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (10 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 11/20] PCI/portdrv: Provide pci error callbacks Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 13/20] PCI: Make link active reporting detection generic Keith Busch
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

The port's config space may be reset after a link reset. We need
to restore the config space that was saved during probe in order to
successfully access downstream devices.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/portdrv_pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 93ab8be21a64..8c5f44807f33 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -178,7 +178,9 @@ static int slot_reset_iter(struct device *device, void *data)
 
 static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
 {
+	pci_restore_state(dev);
 	device_for_each_child(&dev->dev, dev, slot_reset_iter);
+	pci_save_state(dev);
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 13/20] PCI: Make link active reporting detection generic
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (11 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 12/20] PCI/portdrv: Restore pci state on slot reset Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-06 12:38   ` Lukas Wunner
  2018-09-05 20:35 ` [PATCHv2 14/20] PCI: Create recursive bus walk Keith Busch
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

This patch centralizes the link active reporting capability to the
generic PCI driver, and removes the pciehp specific check.

There are existing downstream ports that are HPC and DPC capable, yet
do not implement data link layer active reporting. This is contrary to
what the spec says a driver should be allowed to expect, but since these
bridges do advertise they do not support this capability, the PCI driver
should react accordingly.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/hotplug/pciehp.h     |  1 -
 drivers/pci/hotplug/pciehp_hpc.c | 22 ++--------------------
 drivers/pci/pci.c                |  9 +++++++++
 drivers/pci/probe.c              |  1 +
 include/linux/pci.h              |  1 +
 5 files changed, 13 insertions(+), 21 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index 39c9c8815a35..3a53a24f22f0 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -121,7 +121,6 @@ struct controller {
 	struct task_struct *poll_thread;
 	unsigned long cmd_started;	/* jiffies */
 	unsigned int cmd_busy:1;
-	unsigned int link_active_reporting:1;
 	unsigned int notification_enabled:1;
 	unsigned int power_fault_detected;
 	atomic_t pending_events;
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 7136e3430925..9eb28a06cac6 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -217,13 +217,6 @@ bool pciehp_check_link_active(struct controller *ctrl)
 	return ret;
 }
 
-static void pcie_wait_link_active(struct controller *ctrl)
-{
-	struct pci_dev *pdev = ctrl_dev(ctrl);
-
-	pcie_wait_for_link(pdev, true);
-}
-
 static bool pci_bus_check_dev(struct pci_bus *bus, int devfn)
 {
 	u32 l;
@@ -256,18 +249,9 @@ int pciehp_check_link_status(struct controller *ctrl)
 	bool found;
 	u16 lnk_status;
 
-	/*
-	 * Data Link Layer Link Active Reporting must be capable for
-	 * hot-plug capable downstream port. But old controller might
-	 * not implement it. In this case, we wait for 1000 ms.
-	*/
-	if (ctrl->link_active_reporting)
-		pcie_wait_link_active(ctrl);
-	else
-		msleep(1000);
+	if (!pcie_wait_for_link(pdev, true))
+		return -1;
 
-	/* wait 100ms before read pci conf, and try in 1s */
-	msleep(100);
 	found = pci_bus_check_dev(ctrl->pcie->port->subordinate,
 					PCI_DEVFN(0, 0));
 
@@ -886,8 +870,6 @@ struct controller *pcie_init(struct pcie_device *dev)
 
 	/* Check if Data Link Layer Link Active Reporting is implemented */
 	pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &link_cap);
-	if (link_cap & PCI_EXP_LNKCAP_DLLLARC)
-		ctrl->link_active_reporting = 1;
 
 	/* Clear all remaining event bits in Slot Status register. */
 	pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 1d3b7248e5b9..d4ba4d030026 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4485,6 +4485,15 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
 	bool ret;
 	u16 lnk_status;
 
+	/*
+	 * Some controllers might not implement link active reporting. In this
+	 * case, we wait for 1000 + 100 ms.
+	 */
+	if (!pdev->link_active_reporting) {
+		msleep(1100);
+		return true;
+	}
+
 	/*
 	 * PCIe 4.0r1 6.6.1, a component must enter LTSSM Detect within 20ms,
 	 * after which we should expect an link active if the reset was
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ec784009a36b..40033e2776e1 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -713,6 +713,7 @@ static void pci_set_bus_speed(struct pci_bus *bus)
 
 		pcie_capability_read_dword(bridge, PCI_EXP_LNKCAP, &linkcap);
 		bus->max_bus_speed = pcie_link_speed[linkcap & PCI_EXP_LNKCAP_SLS];
+		bridge->link_active_reporting = !!(linkcap & PCI_EXP_LNKCAP_DLLLARC);
 
 		pcie_capability_read_word(bridge, PCI_EXP_LNKSTA, &linksta);
 		pcie_update_link_speed(bus, linksta);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e72ca8dd6241..7981c9a2c63c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -402,6 +402,7 @@ struct pci_dev {
 	unsigned int	has_secondary_link:1;
 	unsigned int	non_compliant_bars:1;	/* Broken BARs; ignore them */
 	unsigned int	is_probed:1;		/* Device probing in progress */
+	unsigned int	link_active_reporting:1;/* Device capable of reporting link active */
 	pci_dev_flags_t dev_flags;
 	atomic_t	enable_cnt;	/* pci_enable_device has been called */
 
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 14/20] PCI: Create recursive bus walk
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (12 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 13/20] PCI: Make link active reporting detection generic Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order Keith Busch
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

If a callback to walking the pci bus needs to walk its own subtree, this
could potentially deadlock if any other task takes the pci bus write
lock. This patch fixes that by providing a recursive safe callback for
when the read lock is already held.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/bus.c   | 14 +++++++++++---
 include/linux/pci.h |  3 +++
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 5cb40b2518f9..a1106d265bdc 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -376,16 +376,17 @@ EXPORT_SYMBOL(pci_bus_add_devices);
  *  other than 0, we break out.
  *
  */
-void pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *),
-		  void *userdata)
+void __pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *),
+		    void *userdata)
 {
 	struct pci_dev *dev;
 	struct pci_bus *bus;
 	struct list_head *next;
 	int retval;
 
+	lockdep_assert_held(&pci_bus_sem);
+
 	bus = top;
-	down_read(&pci_bus_sem);
 	next = top->devices.next;
 	for (;;) {
 		if (next == &bus->devices) {
@@ -408,6 +409,13 @@ void pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *),
 		if (retval)
 			break;
 	}
+}
+
+void pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *),
+		  void *userdata)
+{
+	down_read(&pci_bus_sem);
+	__pci_walk_bus(top, cb, userdata);
 	up_read(&pci_bus_sem);
 }
 EXPORT_SYMBOL_GPL(pci_walk_bus);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 7981c9a2c63c..47cdb30df55e 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1319,6 +1319,9 @@ int pci_scan_bridge(struct pci_bus *bus, struct pci_dev *dev, int max,
 
 void pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *),
 		  void *userdata);
+void __pci_walk_bus(struct pci_bus *top, int (*cb)(struct pci_dev *, void *),
+		    void *userdata);
+
 int pci_cfg_space_size(struct pci_dev *dev);
 unsigned char pci_bus_max_busnr(struct pci_bus *bus);
 void pci_setup_bridge(struct pci_bus *bus);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (13 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 14/20] PCI: Create recursive bus walk Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-06 19:36   ` Bjorn Helgaas
  2018-09-05 20:35 ` [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks Keith Busch
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

A device add in a power controller controlled slot will power on and
clear power fault slot events, but this was happening before the interrupt
handler attempted to set the sticky status and attention indicators. The
wrong status will be set if a hot-add and power fault are handled in
one interrupt. This patch fixes that by checking for power faults before
checking for new devices.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/hotplug/pciehp_hpc.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 9eb28a06cac6..52a18a7ec2a2 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -630,6 +630,14 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
 		pciehp_handle_button_press(slot);
 	}
 
+	/* Check Power Fault Detected */
+	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
+		ctrl->power_fault_detected = 1;
+		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
+		pciehp_set_attention_status(slot, 1);
+		pciehp_green_led_off(slot);
+	}
+
 	/*
 	 * Disable requests have higher priority than Presence Detect Changed
 	 * or Data Link Layer State Changed events.
@@ -641,14 +649,6 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
 		pciehp_handle_presence_or_link_change(slot, events);
 	up_read(&ctrl->reset_lock);
 
-	/* Check Power Fault Detected */
-	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
-		ctrl->power_fault_detected = 1;
-		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
-		pciehp_set_attention_status(slot, 1);
-		pciehp_green_led_off(slot);
-	}
-
 	pci_config_pm_runtime_put(pdev);
 	wake_up(&ctrl->requester);
 	return IRQ_HANDLED;
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (14 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-06 18:23   ` Thomas Tai
  2018-09-10 13:20   ` Lukas Wunner
  2018-09-05 20:35 ` [PATCHv2 17/20] PCI/pciehp: Ignore link events during DPC event Keith Busch
                   ` (4 subsequent siblings)
  20 siblings, 2 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

Error handling may trigger spurious link events, which may trigger
hotplug handling to re-enumerate the topology. This patch temporarily
disables notification during such error handling and checks the topology
for changes after reset.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/hotplug/pciehp.h      |  1 +
 drivers/pci/hotplug/pciehp_core.c | 39 +++++++++++++++++++++++++++++++++++++++
 drivers/pci/hotplug/pciehp_hpc.c  |  9 +++++++++
 3 files changed, 49 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index 3a53a24f22f0..f43bcf291c4e 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -194,6 +194,7 @@ void pciehp_get_attention_status(struct slot *slot, u8 *status);
 void pciehp_set_attention_status(struct slot *slot, u8 status);
 void pciehp_get_latch_status(struct slot *slot, u8 *status);
 void pciehp_get_adapter_status(struct slot *slot, u8 *status);
+void pciehp_get_adapter_changed(struct slot *slot, u8 *changed);
 int pciehp_query_power_fault(struct slot *slot);
 void pciehp_green_led_on(struct slot *slot);
 void pciehp_green_led_off(struct slot *slot);
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index ec48c9433ae5..7181fd991068 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -301,6 +301,43 @@ static void pciehp_remove(struct pcie_device *dev)
 	pciehp_release_ctrl(ctrl);
 }
 
+static void pciehp_error_detected(struct pcie_device *dev)
+{
+	struct controller *ctrl = get_service_data(dev);
+
+	/*
+	 * Shutdown notification to ignore hotplug events during error
+	 * handling. We'll recheck the status when error handling completes.
+	 *
+	 * It is possible link event related to this error handling may have
+	 * triggered a the hotplug interrupt ahead of this notification, but we
+	 * can't do anything about that race.
+	 */
+	pcie_shutdown_notification(ctrl);
+}
+
+static void pciehp_slot_reset(struct pcie_device *dev)
+{
+	struct controller *ctrl = get_service_data(dev);
+	u8 changed;
+
+	/*
+	 * Cache presence detect change, but ignore other hotplug events that
+	 * may occur during error handling. Ports that implement only in-band
+	 * presence detection may inadvertently believe the device has changed,
+	 * so those devices will have to be re-enumerated.
+	 */
+	pciehp_get_adapter_changed(ctrl->slot, &changed);
+	pcie_clear_hotplug_events(ctrl);
+	pcie_init_notification(ctrl);
+
+	if (changed) {
+		pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
+		__pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
+	} else if (atomic_read(&ctrl->pending_events) && !pciehp_poll_mode)
+		irq_wake_thread(ctrl->pcie->irq, ctrl);
+}
+
 #ifdef CONFIG_PM
 static int pciehp_suspend(struct pcie_device *dev)
 {
@@ -340,6 +377,8 @@ static struct pcie_port_service_driver hpdriver_portdrv = {
 
 	.probe		= pciehp_probe,
 	.remove		= pciehp_remove,
+	.error_detected	= pciehp_error_detected,
+	.slot_reset	= pciehp_slot_reset,
 
 #ifdef	CONFIG_PM
 	.suspend	= pciehp_suspend,
diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 52a18a7ec2a2..5ec2bc871a9c 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -382,6 +382,15 @@ void pciehp_get_adapter_status(struct slot *slot, u8 *status)
 	*status = !!(slot_status & PCI_EXP_SLTSTA_PDS);
 }
 
+void pciehp_get_adapter_changed(struct slot *slot, u8 *changed)
+{
+	struct pci_dev *pdev = ctrl_dev(slot->ctrl);
+	u16 slot_status;
+
+	pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &slot_status);
+	*changed = !!(slot_status & PCI_EXP_SLTSTA_PDC);
+}
+
 int pciehp_query_power_fault(struct slot *slot)
 {
 	struct pci_dev *pdev = ctrl_dev(slot->ctrl);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 17/20] PCI/pciehp: Ignore link events during DPC event
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (15 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 18/20] PCI/DPC: Wait for link active after reset Keith Busch
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

This patch adds a channel state to a subordinate bus. When a DPC event is
triggered, the DPC driver will set the channel state to frozen, and the
pciehp driver will ignore link events if the subordinate bus is being
managed by DPC error handling.

Depending on the order port interrupts are processed, it is possible the
pciehp may not observe the status, either starting after DPC handling
completes, or before DPC interrupt was started. If this happens, the
link down will be processed as normal.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/hotplug/pciehp_hpc.c | 9 +++++++++
 drivers/pci/pcie/dpc.c           | 8 ++++++--
 include/linux/pci.h              | 6 ++++++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 5ec2bc871a9c..3487a1c3f280 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -627,6 +627,15 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
 
 	synchronize_hardirq(irq);
 	events = atomic_xchg(&ctrl->pending_events, 0);
+
+	/*
+	 * Ignore link events on the suborinate bus if error handling is
+	 * active, as link down may be expected. We'll continue to handle
+	 * presence detect changes.
+	 */
+	if (pdev->subordinate && pci_bus_offline(pdev->subordinate))
+		events &= ~PCI_EXP_SLTSTA_DLLSC;
+
 	if (!events) {
 		pci_config_pm_runtime_put(pdev);
 		return IRQ_NONE;
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index b7dfb4988931..70cb1e65a311 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -92,7 +92,8 @@ static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 
 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
 			      PCI_EXP_DPC_STATUS_TRIGGER);
-
+	if (pdev->subordinate)
+		pdev->subordinate->error_state = pci_channel_io_normal;
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
@@ -204,8 +205,11 @@ static irqreturn_t dpc_irq(int irq, void *context)
 
 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
 			      PCI_EXP_DPC_STATUS_INTERRUPT);
-	if (status & PCI_EXP_DPC_STATUS_TRIGGER)
+	if (status & PCI_EXP_DPC_STATUS_TRIGGER) {
+		if (pdev->subordinate)
+			pdev->subordinate->error_state = pci_channel_io_frozen;
 		return IRQ_WAKE_THREAD;
+	}
 	return IRQ_HANDLED;
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 47cdb30df55e..4667cc26939a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -574,9 +574,15 @@ struct pci_bus {
 	struct device		dev;
 	struct bin_attribute	*legacy_io;	/* Legacy I/O for this bus */
 	struct bin_attribute	*legacy_mem;	/* Legacy mem */
+	pci_channel_state_t error_state;	/* Current connectivity state */
 	unsigned int		is_added:1;
 };
 
+static inline int pci_bus_offline(struct pci_bus *bus)
+{
+	return (bus->error_state != pci_channel_io_normal);
+}
+
 #define to_pci_bus(n)	container_of(n, struct pci_bus, dev)
 
 /*
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 18/20] PCI/DPC: Wait for link active after reset
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (16 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 17/20] PCI/pciehp: Ignore link events during DPC event Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 19/20] PCI/DPC: Link reset code cleanup Keith Busch
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

The DPC service only provides error handling now, so we don't get to
rely on the PCIe hotplug driver to handling the timing requirements for
when it is okay access the downstream devices. This patch adds the check
for the active link before returning control to the pcie error handler
after the slot reset completes.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/dpc.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 70cb1e65a311..0aa0d01562bf 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -92,6 +92,10 @@ static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 
 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
 			      PCI_EXP_DPC_STATUS_TRIGGER);
+
+	if (!pcie_wait_for_link(pdev, true))
+		return PCI_ERS_RESULT_DISCONNECT;
+
 	if (pdev->subordinate)
 		pdev->subordinate->error_state = pci_channel_io_normal;
 	return PCI_ERS_RESULT_RECOVERED;
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 19/20] PCI/DPC: Link reset code cleanup
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (17 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 18/20] PCI/DPC: Wait for link active after reset Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-05 20:35 ` [PATCHv2 20/20] PCI: Unify device inaccessible Keith Busch
  2018-09-06 17:30 ` [PATCHv2 00/20] PCI, error handling and hot plug Thomas Tai
  20 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

This is just a code clean up patch reording declarations based on their
dependency. No functional change here.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/dpc.c | 23 +++++++----------------
 1 file changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 0aa0d01562bf..25d1b1445484 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -66,24 +66,15 @@ static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
 
 static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 {
-	struct dpc_dev *dpc;
-	struct pcie_device *pciedev;
-	struct device *devdpc;
-
-	u16 cap;
-
-	/*
-	 * DPC disables the Link automatically in hardware, so it has
-	 * already been reset by the time we get here.
-	 */
-	devdpc = pcie_port_find_device(pdev, PCIE_PORT_SERVICE_DPC);
-	pciedev = to_pcie_device(devdpc);
-	dpc = get_service_data(pciedev);
-	cap = dpc->cap_pos;
+	struct device *devdpc = pcie_port_find_device(pdev, PCIE_PORT_SERVICE_DPC);
+	struct pcie_device *pciedev = to_pcie_device(devdpc);
+	struct dpc_dev *dpc = get_service_data(pciedev);
+	u16 cap = dpc->cap_pos;
 
 	/*
-	 * Wait until the Link is inactive, then clear DPC Trigger Status
-	 * to allow the Port to leave DPC.
+	 * DPC disables the Link automatically in hardware, so it has already
+	 * been reset by the time we get here. Wait until the Link is inactive,
+	 * then clear DPC Trigger Status to allow the Port to leave DPC.
 	 */
 	pcie_wait_for_link(pdev, false);
 
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv2 20/20] PCI: Unify device inaccessible
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (18 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 19/20] PCI/DPC: Link reset code cleanup Keith Busch
@ 2018-09-05 20:35 ` Keith Busch
  2018-09-06  4:20   ` Benjamin Herrenschmidt
  2018-09-06 17:30 ` [PATCHv2 00/20] PCI, error handling and hot plug Thomas Tai
  20 siblings, 1 reply; 44+ messages in thread
From: Keith Busch @ 2018-09-05 20:35 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Thomas Tai, poza,
	Lukas Wunner, Christoph Hellwig, Keith Busch

This patch brings surprise removals and permanent failures together so
we no longer need separate flags. The error handling will not be able to
override a surprise removal's permanent channel failure by doing an atomic
compare and exchange operation on the error state.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pci.h      | 11 +++++------
 drivers/pci/pcie/err.c | 15 +++++++++++----
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5a96978d3403..1ac43621adad 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -295,21 +295,20 @@ struct pci_sriov {
 	bool		drivers_autoprobe; /* Auto probing of VFs by driver */
 };
 
-/* pci_dev priv_flags */
-#define PCI_DEV_DISCONNECTED 0
-#define PCI_DEV_ADDED 1
-
 static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
 {
-	set_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
+	dev->error_state = pci_channel_io_perm_failure;
 	return 0;
 }
 
 static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
 {
-	return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
+	return dev->error_state == pci_channel_io_perm_failure;
 }
 
+/* pci_dev priv_flags */
+#define PCI_DEV_ADDED 0
+
 static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)
 {
 	assign_bit(PCI_DEV_ADDED, &dev->priv_flags, added);
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 2264001f695b..1ee7206689b5 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -52,9 +52,9 @@ static int report_error_detected(struct pci_dev *dev,
 	const struct pci_error_handlers *err_handler;
 
 	device_lock(&dev->dev);
-	dev->error_state = state;
 
-	if (!dev->driver ||
+	if (!cmpxchg(&dev->error_state, state, pci_channel_io_normal) ||
+		!dev->driver ||
 		!dev->driver->err_handler ||
 		!dev->driver->err_handler->error_detected) {
 		/*
@@ -130,11 +130,18 @@ static int report_resume(struct pci_dev *dev, void *data)
 	const struct pci_error_handlers *err_handler;
 
 	device_lock(&dev->dev);
-	dev->error_state = pci_channel_io_normal;
 
+	cmpxchg(&dev->error_state, pci_channel_io_normal,
+		pci_channel_io_frozen);
+
+	/*
+	 * If channel is offline, hotplug handling is taking care of this
+	 * device.
+	 */
 	if (!dev->driver ||
 		!dev->driver->err_handler ||
-		!dev->driver->err_handler->resume)
+		!dev->driver->err_handler->resume ||
+		pci_channel_offline(dev))
 		goto out;
 
 	err_handler = dev->driver->err_handler;
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 20/20] PCI: Unify device inaccessible
  2018-09-05 20:35 ` [PATCHv2 20/20] PCI: Unify device inaccessible Keith Busch
@ 2018-09-06  4:20   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 44+ messages in thread
From: Benjamin Herrenschmidt @ 2018-09-06  4:20 UTC (permalink / raw)
  To: Keith Busch, Linux PCI, Bjorn Helgaas
  Cc: Sinan Kaya, Thomas Tai, poza, Lukas Wunner, Christoph Hellwig,
	Russell Currey, Oliver OHalloran, Sam Bobroff

On Wed, 2018-09-05 at 14:35 -0600, Keith Busch wrote:
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 2264001f695b..1ee7206689b5 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -52,9 +52,9 @@ static int report_error_detected(struct pci_dev *dev,
>         const struct pci_error_handlers *err_handler;
>  
>         device_lock(&dev->dev);
> -       dev->error_state = state;
>  
> -       if (!dev->driver ||
> +       if (!cmpxchg(&dev->error_state, state, pci_channel_io_normal) ||
> +               !dev->driver ||
>                 !dev->driver->err_handler ||
>                 !dev->driver->err_handler->error_detected) {
>                 /*

(CC some of our folks)

Should we make this some kind of helper (pci_set_channel_state) ? We
should then use it in EEH and powerpc PAPR specific hotplug as well.

We could also implement the rule about the valid transition in the 
helper itself.

(Feel free to merge this patch as-is, what I suggest can be done on top
of this).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 03/20] PCI: Add required waits on link active
  2018-09-05 20:35 ` [PATCHv2 03/20] PCI: Add required waits on link active Keith Busch
@ 2018-09-06 11:42   ` Lukas Wunner
  2018-09-06 14:44     ` Keith Busch
  0 siblings, 1 reply; 44+ messages in thread
From: Lukas Wunner @ 2018-09-06 11:42 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Christoph Hellwig

On Wed, Sep 05, 2018 at 02:35:29PM -0600, Keith Busch wrote:
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4485,21 +4485,33 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
>  	bool ret;
>  	u16 lnk_status;
>  
> +	/*
> +	 * PCIe 4.0r1 6.6.1, a component must enter LTSSM Detect within 20ms,
> +	 * after which we should expect an link active if the reset was
> +	 * successful. If so, software must wait a minimum 100ms before sending
> +	 * configuration requests to devices downstream this port.
> +	 *
> +	 * If the link fails to activate, either the device was physically
> +	 * removed or the link is permanently failed.
> +	 */
> +	if (active)
> +		msleep(20);

This function is also called by pciehp when bringing up the slot,
yet I assume the above delay only applies to a Fundamental Reset,
not hotplug?


> +	if (active && ret)
> +		msleep(100);

This delay is already observed by pciehp_check_link_status()
after calling pcie_wait_link_active().

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 13/20] PCI: Make link active reporting detection generic
  2018-09-05 20:35 ` [PATCHv2 13/20] PCI: Make link active reporting detection generic Keith Busch
@ 2018-09-06 12:38   ` Lukas Wunner
  0 siblings, 0 replies; 44+ messages in thread
From: Lukas Wunner @ 2018-09-06 12:38 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Christoph Hellwig

On Wed, Sep 05, 2018 at 02:35:39PM -0600, Keith Busch wrote:
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -121,7 +121,6 @@ struct controller {
> 	struct task_struct *poll_thread;
> 	unsigned long cmd_started;	/* jiffies */
> 	unsigned int cmd_busy:1;
> -	unsigned int link_active_reporting:1;
> 	unsigned int notification_enabled:1;
> 	unsigned int power_fault_detected;
> 	atomic_t pending_events;

Please also remove the kerneldoc for this attribute further up in the file.


> @@ -256,18 +249,9 @@ int pciehp_check_link_status(struct controller *ctrl)
>  	bool found;
>  	u16 lnk_status;
>  
> -	/*
> -	 * Data Link Layer Link Active Reporting must be capable for
> -	 * hot-plug capable downstream port. But old controller might
> -	 * not implement it. In this case, we wait for 1000 ms.
> -	*/
> -	if (ctrl->link_active_reporting)
> -		pcie_wait_link_active(ctrl);
> -	else
> -		msleep(1000);
> +	if (!pcie_wait_for_link(pdev, true))
> +		return -1;
>  
> -	/* wait 100ms before read pci conf, and try in 1s */
> -	msleep(100);

Aha, so this addresses the second objection I've just raised for
patch [03/20].  Okay.


> +	/*
> +	 * Some controllers might not implement link active reporting. In this
> +	 * case, we wait for 1000 + 100 ms.
> +	 */
> +	if (!pdev->link_active_reporting) {
> +		msleep(1100);
> +		return true;
> +	}
> +

More specifically, the Data Link Layer Link Active Reporting Capable bit
in the Link Capabilities register was added in PCIe r1.1 (2005-03-28).
The bit was marked reserved in PCIe r1.0 (2002-04-29).

So this is perfectly legal for old devices adhering only to PCIe r1.0.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 03/20] PCI: Add required waits on link active
  2018-09-06 11:42   ` Lukas Wunner
@ 2018-09-06 14:44     ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-06 14:44 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Christoph Hellwig

On Thu, Sep 06, 2018 at 01:42:15PM +0200, Lukas Wunner wrote:
> On Wed, Sep 05, 2018 at 02:35:29PM -0600, Keith Busch wrote:
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -4485,21 +4485,33 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> >  	bool ret;
> >  	u16 lnk_status;
> >  
> > +	/*
> > +	 * PCIe 4.0r1 6.6.1, a component must enter LTSSM Detect within 20ms,
> > +	 * after which we should expect an link active if the reset was
> > +	 * successful. If so, software must wait a minimum 100ms before sending
> > +	 * configuration requests to devices downstream this port.
> > +	 *
> > +	 * If the link fails to activate, either the device was physically
> > +	 * removed or the link is permanently failed.
> > +	 */
> > +	if (active)
> > +		msleep(20);
> 
> This function is also called by pciehp when bringing up the slot,
> yet I assume the above delay only applies to a Fundamental Reset,
> not hotplug?

Should be any conventional reset, which includes the "cold reset" applying
power to a hot add component.
 
> > +	if (active && ret)
> > +		msleep(100);
> 
> This delay is already observed by pciehp_check_link_status()
> after calling pcie_wait_link_active().

Yeah, that's gone in patch 13. This series became larger than I
originally expected, and there might be a more logical shuffling.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 00/20] PCI, error handling and hot plug
  2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
                   ` (19 preceding siblings ...)
  2018-09-05 20:35 ` [PATCHv2 20/20] PCI: Unify device inaccessible Keith Busch
@ 2018-09-06 17:30 ` Thomas Tai
  2018-09-06 17:36   ` Keith Busch
  20 siblings, 1 reply; 44+ messages in thread
From: Thomas Tai @ 2018-09-06 17:30 UTC (permalink / raw)
  To: Keith Busch, Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, poza, Lukas Wunner,
	Christoph Hellwig

Hi Keith,
Which tag or branch is your patch based on? I can't apply it on the 
master branch of 
https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git

Thank you,
Thomas

On 09/05/2018 04:35 PM, Keith Busch wrote:
> v1 -> v2:
> 
>    * Use Dennis' patch for the incorrect slot reset detection since he
>      posted that fix first
> 
>    * I found some DPC and HPC capable ports (PLX Device 9781 to be
>      specific) that don't have data-link active reporting capabilities,
>      so I added another patch to handle that
> 
>    * If the recovery determines the precence detection has changed during
>      error handling, we need to prevent the downstream driver from
>      accessing the device under its old context. This was a little tricky
>      because of a circular dependency on the pci_bus_sem, so there is a
>      prep patch to allow recursive pci bus walking, and then we use it
>      from pciehp's slot_reset callback.
> 
>    * Make error handling not able to change the error state away from
>      pci_channel_state_perm_failure so that hotplug and error handling
>      may use the same state (suggested by Benjamin Herrenschmidt)
> 
>    * Moved the link active wait requirements into generic code
>      (suggested by Sinan Kaya)
> 
>    * Check for successful secondary bus reset on recovery failure
>      (suggested by Sinan Kaya)
> 
>    * Use pcie_device for service driver error callbacks (suggested by
>      Lukas Wunner)
> 
>    * Hold pci_slot_lock when doing a slot reset (suggested by
>      Lukas Wunner)
> 
>    * Fixed processing user orderly hotplug requests during error handling
>      suggested by Lukas Wunner)
> 
>    * Various code cleanups (suggested by Christoph Hellwig)
> 
>    * Split dpc code cleanup into separate patch
> 
>    * Changelog grammer fixes and wording clarity
> 
> Dennis Dalessandro (1):
>    PCI: Fix faulty logic in pci_reset_bus()
> 
> Keith Busch (18):
>    PCI: Add required waits on link active
>    PCI/AER: Remove dead code
>    PCI/ERR: Use slot reset if available
>    PCI/ERR: Handle fatal error recovery
>    PCI/ERR: Always use the first downstream port
>    PCI/ERR: Simplify broadcast callouts
>    PCI/ERR: Report current recovery status for udev
>    PCI/ERR: Remove devices on recovery failure
>    PCI/portdrv: Provide pci error callbacks
>    PCI/portdrv: Restore pci state on slot reset
>    PCI: Make link active reporting detection generic
>    PCI: Create recursive bus walk
>    PCI/pciehp: Fix powerfault detection order
>    PCI/pciehp: Implement error handling callbacks
>    PCI/pciehp: Ignore link events during DPC event
>    PCI/DPC: Wait for link active after reset
>    PCI/DPC: Link reset code cleanup
>    PCI: Unify device inaccessible
> 
> Lukas Wunner (1):
>    PCI: Simplify disconnected marking
> 
>   drivers/pci/bus.c                 |  14 +-
>   drivers/pci/hotplug/pciehp.h      |   2 +-
>   drivers/pci/hotplug/pciehp_core.c |  39 +++++
>   drivers/pci/hotplug/pciehp_hpc.c  |  56 +++----
>   drivers/pci/hotplug/pciehp_pci.c  |   9 +-
>   drivers/pci/pci.c                 |  68 +++++++-
>   drivers/pci/pci.h                 |  17 +-
>   drivers/pci/pcie/aer.c            |  27 ++--
>   drivers/pci/pcie/dpc.c            |  37 +++--
>   drivers/pci/pcie/err.c            | 327 +++++++++++++-------------------------
>   drivers/pci/pcie/portdrv.h        |  10 +-
>   drivers/pci/pcie/portdrv_pci.c    |  45 +++++-
>   drivers/pci/probe.c               |   1 +
>   drivers/pci/slot.c                |   2 +-
>   include/linux/pci.h               |  10 ++
>   15 files changed, 353 insertions(+), 311 deletions(-)
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 00/20] PCI, error handling and hot plug
  2018-09-06 17:30 ` [PATCHv2 00/20] PCI, error handling and hot plug Thomas Tai
@ 2018-09-06 17:36   ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-06 17:36 UTC (permalink / raw)
  To: Thomas Tai
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	poza, Lukas Wunner, Christoph Hellwig

On Thu, Sep 06, 2018 at 01:30:47PM -0400, Thomas Tai wrote:
> Hi Keith,
> Which tag or branch is your patch based on? I can't apply it on the master
> branch of https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git

Sure, it was built on the pci/hotplug branch:

https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/log/?h=pci/hotplug

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks
  2018-09-05 20:35 ` [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks Keith Busch
@ 2018-09-06 18:23   ` Thomas Tai
  2018-09-06 18:49     ` Keith Busch
  2018-09-10 13:20   ` Lukas Wunner
  1 sibling, 1 reply; 44+ messages in thread
From: Thomas Tai @ 2018-09-06 18:23 UTC (permalink / raw)
  To: Keith Busch, Linux PCI, Bjorn Helgaas
  Cc: Benjamin Herrenschmidt, Sinan Kaya, poza, Lukas Wunner,
	Christoph Hellwig



On 09/05/2018 04:35 PM, Keith Busch wrote:
> Error handling may trigger spurious link events, which may trigger
> hotplug handling to re-enumerate the topology. This patch temporarily
> disables notification during such error handling and checks the topology
> for changes after reset.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>   drivers/pci/hotplug/pciehp.h      |  1 +
>   drivers/pci/hotplug/pciehp_core.c | 39 +++++++++++++++++++++++++++++++++++++++
>   drivers/pci/hotplug/pciehp_hpc.c  |  9 +++++++++
>   3 files changed, 49 insertions(+)
> 
> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
> index 3a53a24f22f0..f43bcf291c4e 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -194,6 +194,7 @@ void pciehp_get_attention_status(struct slot *slot, u8 *status);
>   void pciehp_set_attention_status(struct slot *slot, u8 status);
>   void pciehp_get_latch_status(struct slot *slot, u8 *status);
>   void pciehp_get_adapter_status(struct slot *slot, u8 *status);
> +void pciehp_get_adapter_changed(struct slot *slot, u8 *changed);
>   int pciehp_query_power_fault(struct slot *slot);
>   void pciehp_green_led_on(struct slot *slot);
>   void pciehp_green_led_off(struct slot *slot);
> diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
> index ec48c9433ae5..7181fd991068 100644
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -301,6 +301,43 @@ static void pciehp_remove(struct pcie_device *dev)
>   	pciehp_release_ctrl(ctrl);
>   }
>   
> +static void pciehp_error_detected(struct pcie_device *dev)
> +{
> +	struct controller *ctrl = get_service_data(dev);
> +
> +	/*
> +	 * Shutdown notification to ignore hotplug events during error
> +	 * handling. We'll recheck the status when error handling completes.
> +	 *
> +	 * It is possible link event related to this error handling may have
> +	 * triggered a the hotplug interrupt ahead of this notification, but we
> +	 * can't do anything about that race.
> +	 */
> +	pcie_shutdown_notification(ctrl);
> +}
> +
> +static void pciehp_slot_reset(struct pcie_device *dev)
> +{
> +	struct controller *ctrl = get_service_data(dev);
> +	u8 changed;
> +
> +	/*
> +	 * Cache presence detect change, but ignore other hotplug events that
> +	 * may occur during error handling. Ports that implement only in-band
> +	 * presence detection may inadvertently believe the device has changed,
> +	 * so those devices will have to be re-enumerated.
> +	 */
> +	pciehp_get_adapter_changed(ctrl->slot, &changed);
> +	pcie_clear_hotplug_events(ctrl);
> +	pcie_init_notification(ctrl);
> +
> +	if (changed) {
> +		pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
> +		__pci_walk_bus(parent, pci_dev_set_disconnected, NULL);

error: ‘parent’ undeclared.

Shouldn't it be __pci_walk_bus(dev->port->bus, 
pci_dev_set_disconnected, NULL);

Thanks,
Thomas

> +	} else if (atomic_read(&ctrl->pending_events) && !pciehp_poll_mode)
> +		irq_wake_thread(ctrl->pcie->irq, ctrl);
> +}
> +
>   #ifdef CONFIG_PM
>   static int pciehp_suspend(struct pcie_device *dev)
>   {
> @@ -340,6 +377,8 @@ static struct pcie_port_service_driver hpdriver_portdrv = {
>   
>   	.probe		= pciehp_probe,
>   	.remove		= pciehp_remove,
> +	.error_detected	= pciehp_error_detected,
> +	.slot_reset	= pciehp_slot_reset,
>   
>   #ifdef	CONFIG_PM
>   	.suspend	= pciehp_suspend,
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 52a18a7ec2a2..5ec2bc871a9c 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -382,6 +382,15 @@ void pciehp_get_adapter_status(struct slot *slot, u8 *status)
>   	*status = !!(slot_status & PCI_EXP_SLTSTA_PDS);
>   }
>   
> +void pciehp_get_adapter_changed(struct slot *slot, u8 *changed)
> +{
> +	struct pci_dev *pdev = ctrl_dev(slot->ctrl);
> +	u16 slot_status;
> +
> +	pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &slot_status);
> +	*changed = !!(slot_status & PCI_EXP_SLTSTA_PDC);
> +}
> +
>   int pciehp_query_power_fault(struct slot *slot)
>   {
>   	struct pci_dev *pdev = ctrl_dev(slot->ctrl);
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks
  2018-09-06 18:23   ` Thomas Tai
@ 2018-09-06 18:49     ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-06 18:49 UTC (permalink / raw)
  To: Thomas Tai
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	poza, Lukas Wunner, Christoph Hellwig

On Thu, Sep 06, 2018 at 02:23:19PM -0400, Thomas Tai wrote:
> On 09/05/2018 04:35 PM, Keith Busch wrote:
> > +	if (changed) {
> > +		pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
> > +		__pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> 
> error: ‘parent’ undeclared.
> 
> Shouldn't it be __pci_walk_bus(dev->port->bus, pci_dev_set_disconnected,
> NULL);

Indeed. I've pointed git send-email to the wrong directory, but looks
like this is the only discrepency.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order
  2018-09-05 20:35 ` [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order Keith Busch
@ 2018-09-06 19:36   ` Bjorn Helgaas
  2018-09-06 19:50     ` Keith Busch
  0 siblings, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2018-09-06 19:36 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig

On Wed, Sep 05, 2018 at 02:35:41PM -0600, Keith Busch wrote:
> A device add in a power controller controlled slot will power on and
> clear power fault slot events, but this was happening before the interrupt
> handler attempted to set the sticky status and attention indicators. The
> wrong status will be set if a hot-add and power fault are handled in
> one interrupt. This patch fixes that by checking for power faults before
> checking for new devices.

Can you clarify the part about "the interrupt handler attempting to set the
sticky status and attention indicators"?  My first impression is that
you're talking about bits in the Slot Status register, but that's
obviously wrong because those bits are set by hardware (not the interrupt
handler) and they're RW1C so software clears them by writing 1 to them.

Lukas suggests that this patch should be in v4.19.  Do you agree, and if
so, can you help me justify it by describing the user-visible effect of
this?  I'm not sure what "setting the wrong status" means to a user, e.g.,
does this result in a non-functional device, an incorrect status LED on the
slot, something else?  Does it fix a regression or something we merged for
v4.19?

> Signed-off-by: Keith Busch <keith.busch@intel.com>
> Reviewed-by: Lukas Wunner <lukas@wunner.de>
> ---
>  drivers/pci/hotplug/pciehp_hpc.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 9eb28a06cac6..52a18a7ec2a2 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -630,6 +630,14 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
>  		pciehp_handle_button_press(slot);
>  	}
>  
> +	/* Check Power Fault Detected */
> +	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
> +		ctrl->power_fault_detected = 1;
> +		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
> +		pciehp_set_attention_status(slot, 1);
> +		pciehp_green_led_off(slot);
> +	}
> +
>  	/*
>  	 * Disable requests have higher priority than Presence Detect Changed
>  	 * or Data Link Layer State Changed events.
> @@ -641,14 +649,6 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
>  		pciehp_handle_presence_or_link_change(slot, events);
>  	up_read(&ctrl->reset_lock);
>  
> -	/* Check Power Fault Detected */
> -	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
> -		ctrl->power_fault_detected = 1;
> -		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
> -		pciehp_set_attention_status(slot, 1);
> -		pciehp_green_led_off(slot);
> -	}
> -
>  	pci_config_pm_runtime_put(pdev);
>  	wake_up(&ctrl->requester);
>  	return IRQ_HANDLED;
> -- 
> 2.14.4
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order
  2018-09-06 19:36   ` Bjorn Helgaas
@ 2018-09-06 19:50     ` Keith Busch
  2018-09-07 16:53       ` Bjorn Helgaas
  0 siblings, 1 reply; 44+ messages in thread
From: Keith Busch @ 2018-09-06 19:50 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig

On Thu, Sep 06, 2018 at 02:36:57PM -0500, Bjorn Helgaas wrote:
> On Wed, Sep 05, 2018 at 02:35:41PM -0600, Keith Busch wrote:
> > A device add in a power controller controlled slot will power on and
> > clear power fault slot events, but this was happening before the interrupt
> > handler attempted to set the sticky status and attention indicators. The
> > wrong status will be set if a hot-add and power fault are handled in
> > one interrupt. This patch fixes that by checking for power faults before
> > checking for new devices.
> 
> Can you clarify the part about "the interrupt handler attempting to set the
> sticky status and attention indicators"?  My first impression is that
> you're talking about bits in the Slot Status register, but that's
> obviously wrong because those bits are set by hardware (not the interrupt
> handler) and they're RW1C so software clears them by writing 1 to them.

The sticky status being the pciehp driver's "power_fault_detected"
field. We set it on the first observation of a slot's PFD and do not
clear it until we have a successful board_added event.

> Lukas suggests that this patch should be in v4.19.  Do you agree, and if
> so, can you help me justify it by describing the user-visible effect of
> this?  I'm not sure what "setting the wrong status" means to a user, e.g.,
> does this result in a non-functional device, an incorrect status LED on the
> slot, something else?  Does it fix a regression or something we merged for
> v4.19?

>From a user point of view, it is possible the attention LED light could be
on after a successful hot add.

The only reason this was successful before was how everything was chained
through work queues, the work order being:

  INT_PRESENCE_ON -> INT_POWER_FAULT -> ENABLE_REQ

The ENABLE_REQ cleared the power fault at the end, but now everything
is handled inline with the interrupt thread (which was a great change,
IMO), such that the work ENABLE_REQ was doing happens before power
fault handling now.

The commit that changed that order:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=0e94916e6091f48391b65110e71c87c583021640

 
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > Reviewed-by: Lukas Wunner <lukas@wunner.de>
> > ---
> >  drivers/pci/hotplug/pciehp_hpc.c | 16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> > index 9eb28a06cac6..52a18a7ec2a2 100644
> > --- a/drivers/pci/hotplug/pciehp_hpc.c
> > +++ b/drivers/pci/hotplug/pciehp_hpc.c
> > @@ -630,6 +630,14 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
> >  		pciehp_handle_button_press(slot);
> >  	}
> >  
> > +	/* Check Power Fault Detected */
> > +	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
> > +		ctrl->power_fault_detected = 1;
> > +		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
> > +		pciehp_set_attention_status(slot, 1);
> > +		pciehp_green_led_off(slot);
> > +	}
> > +
> >  	/*
> >  	 * Disable requests have higher priority than Presence Detect Changed
> >  	 * or Data Link Layer State Changed events.
> > @@ -641,14 +649,6 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
> >  		pciehp_handle_presence_or_link_change(slot, events);
> >  	up_read(&ctrl->reset_lock);
> >  
> > -	/* Check Power Fault Detected */
> > -	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
> > -		ctrl->power_fault_detected = 1;
> > -		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
> > -		pciehp_set_attention_status(slot, 1);
> > -		pciehp_green_led_off(slot);
> > -	}
> > -
> >  	pci_config_pm_runtime_put(pdev);
> >  	wake_up(&ctrl->requester);
> >  	return IRQ_HANDLED;
> > -- 
> > 2.14.4
> > 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order
  2018-09-06 19:50     ` Keith Busch
@ 2018-09-07 16:53       ` Bjorn Helgaas
  2018-09-07 20:03         ` Bjorn Helgaas
  0 siblings, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2018-09-07 16:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig

On Thu, Sep 06, 2018 at 01:50:47PM -0600, Keith Busch wrote:
> On Thu, Sep 06, 2018 at 02:36:57PM -0500, Bjorn Helgaas wrote:
> > On Wed, Sep 05, 2018 at 02:35:41PM -0600, Keith Busch wrote:
> > > A device add in a power controller controlled slot will power on and
> > > clear power fault slot events, but this was happening before the interrupt
> > > handler attempted to set the sticky status and attention indicators. The
> > > wrong status will be set if a hot-add and power fault are handled in
> > > one interrupt. This patch fixes that by checking for power faults before
> > > checking for new devices.
> > 
> > Can you clarify the part about "the interrupt handler attempting to set the
> > sticky status and attention indicators"?  My first impression is that
> > you're talking about bits in the Slot Status register, but that's
> > obviously wrong because those bits are set by hardware (not the interrupt
> > handler) and they're RW1C so software clears them by writing 1 to them.
> 
> The sticky status being the pciehp driver's "power_fault_detected"
> field. We set it on the first observation of a slot's PFD and do not
> clear it until we have a successful board_added event.
> 
> > Lukas suggests that this patch should be in v4.19.  Do you agree, and if
> > so, can you help me justify it by describing the user-visible effect of
> > this?  I'm not sure what "setting the wrong status" means to a user, e.g.,
> > does this result in a non-functional device, an incorrect status LED on the
> > slot, something else?  Does it fix a regression or something we merged for
> > v4.19?
> 
> From a user point of view, it is possible the attention LED light could be
> on after a successful hot add.

Great, thanks!  Also, it looks like the power LED will be off even though
the power is actually on.

    pciehp_ist
      if (events & (PDC | DLLSC))
        pciehp_handle_presence_or_link_change
          case OFF_STATE:
            pciehp_enable_slot
              __pciehp_enable_slot
                board_added
                  pciehp_power_on_slot
                    ctrl->power_fault_detected = 0
                    pcie_write_cmd(ctrl, PCI_EXP_SLTCTL_PWR_ON, PCI_EXP_SLTCTL_PCC)
      if (PFD && !ctrl->power_fault_detected)
        ctrl->power_fault_detected = 1
        pciehp_set_attention_status(slot, 1)     # attention LED on
        pciehp_green_led_off(slot)               # power LED off


Tangent: how annoying that the spec refers to "Power Indicator" and
"Attention Indicator", but (a) we call them the "green_led" and
"attention_status", and (b) both can be on/off/blinking, but the interfaces
are totally different.

> The only reason this was successful before was how everything was chained
> through work queues, the work order being:
> 
>   INT_PRESENCE_ON -> INT_POWER_FAULT -> ENABLE_REQ
> 
> The ENABLE_REQ cleared the power fault at the end, but now everything
> is handled inline with the interrupt thread (which was a great change,
> IMO), such that the work ENABLE_REQ was doing happens before power
> fault handling now.
> 
> The commit that changed that order:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=0e94916e6091f48391b65110e71c87c583021640
> 
>  
> > > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > > Reviewed-by: Lukas Wunner <lukas@wunner.de>
> > > ---
> > >  drivers/pci/hotplug/pciehp_hpc.c | 16 ++++++++--------
> > >  1 file changed, 8 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> > > index 9eb28a06cac6..52a18a7ec2a2 100644
> > > --- a/drivers/pci/hotplug/pciehp_hpc.c
> > > +++ b/drivers/pci/hotplug/pciehp_hpc.c
> > > @@ -630,6 +630,14 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
> > >  		pciehp_handle_button_press(slot);
> > >  	}
> > >  
> > > +	/* Check Power Fault Detected */
> > > +	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
> > > +		ctrl->power_fault_detected = 1;
> > > +		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
> > > +		pciehp_set_attention_status(slot, 1);
> > > +		pciehp_green_led_off(slot);
> > > +	}
> > > +
> > >  	/*
> > >  	 * Disable requests have higher priority than Presence Detect Changed
> > >  	 * or Data Link Layer State Changed events.
> > > @@ -641,14 +649,6 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
> > >  		pciehp_handle_presence_or_link_change(slot, events);
> > >  	up_read(&ctrl->reset_lock);
> > >  
> > > -	/* Check Power Fault Detected */
> > > -	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
> > > -		ctrl->power_fault_detected = 1;
> > > -		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
> > > -		pciehp_set_attention_status(slot, 1);
> > > -		pciehp_green_led_off(slot);
> > > -	}
> > > -
> > >  	pci_config_pm_runtime_put(pdev);
> > >  	wake_up(&ctrl->requester);
> > >  	return IRQ_HANDLED;
> > > -- 
> > > 2.14.4
> > > 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order
  2018-09-07 16:53       ` Bjorn Helgaas
@ 2018-09-07 20:03         ` Bjorn Helgaas
  2018-09-07 20:18           ` Keith Busch
  2018-09-07 20:26           ` Lukas Wunner
  0 siblings, 2 replies; 44+ messages in thread
From: Bjorn Helgaas @ 2018-09-07 20:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig

On Fri, Sep 07, 2018 at 11:53:52AM -0500, Bjorn Helgaas wrote:
> On Thu, Sep 06, 2018 at 01:50:47PM -0600, Keith Busch wrote:
> > On Thu, Sep 06, 2018 at 02:36:57PM -0500, Bjorn Helgaas wrote:
> > > On Wed, Sep 05, 2018 at 02:35:41PM -0600, Keith Busch wrote:
> > > > A device add in a power controller controlled slot will power on and
> > > > clear power fault slot events, but this was happening before the interrupt
> > > > handler attempted to set the sticky status and attention indicators. The
> > > > wrong status will be set if a hot-add and power fault are handled in
> > > > one interrupt. This patch fixes that by checking for power faults before
> > > > checking for new devices.
> > > 
> > > Can you clarify the part about "the interrupt handler attempting to set the
> > > sticky status and attention indicators"?  My first impression is that
> > > you're talking about bits in the Slot Status register, but that's
> > > obviously wrong because those bits are set by hardware (not the interrupt
> > > handler) and they're RW1C so software clears them by writing 1 to them.
> > 
> > The sticky status being the pciehp driver's "power_fault_detected"
> > field. We set it on the first observation of a slot's PFD and do not
> > clear it until we have a successful board_added event.
> > 
> > > Lukas suggests that this patch should be in v4.19.  Do you agree, and if
> > > so, can you help me justify it by describing the user-visible effect of
> > > this?  I'm not sure what "setting the wrong status" means to a user, e.g.,
> > > does this result in a non-functional device, an incorrect status LED on the
> > > slot, something else?  Does it fix a regression or something we merged for
> > > v4.19?
> > 
> > From a user point of view, it is possible the attention LED light could be
> > on after a successful hot add.
> 
> Great, thanks!  Also, it looks like the power LED will be off even though
> the power is actually on.
> 
>     pciehp_ist
>       if (events & (PDC | DLLSC))
>         pciehp_handle_presence_or_link_change
>           case OFF_STATE:
>             pciehp_enable_slot
>               __pciehp_enable_slot
>                 board_added
>                   pciehp_power_on_slot
>                     ctrl->power_fault_detected = 0
>                     pcie_write_cmd(ctrl, PCI_EXP_SLTCTL_PWR_ON, PCI_EXP_SLTCTL_PCC)
>       if (PFD && !ctrl->power_fault_detected)
>         ctrl->power_fault_detected = 1
>         pciehp_set_attention_status(slot, 1)     # attention LED on
>         pciehp_green_led_off(slot)               # power LED off
> 
> 
> Tangent: how annoying that the spec refers to "Power Indicator" and
> "Attention Indicator", but (a) we call them the "green_led" and
> "attention_status", and (b) both can be on/off/blinking, but the interfaces
> are totally different.

I applied this to for-linus with the following changelog.  Let me know
if I didn't understand this correctly.  I changed the comment in
pciehp_power_on_slot() so it doesn't say "sticky" to avoid confusion
with the PCI spec concept of sticky register bits (ROS, RWS, RW1CS).


commit 342227b42fe849eb2edac38342702aff12a5491d
Author: Keith Busch <keith.busch@intel.com>
Date:   Wed Sep 5 14:35:41 2018 -0600

    PCI: pciehp: Fix hot-add vs powerfault detection order
    
    If both hot-add and power fault were observed in a single interrupt, we
    handled the hot-add first, then the power fault, in this path:
    
      pciehp_ist
        if (events & (PDC | DLLSC))
          pciehp_handle_presence_or_link_change
            case OFF_STATE:
              pciehp_enable_slot
                __pciehp_enable_slot
                  board_added
                    pciehp_power_on_slot
                      ctrl->power_fault_detected = 0
                      pcie_write_cmd(ctrl, PCI_EXP_SLTCTL_PWR_ON, PCI_EXP_SLTCTL_PCC)
                    pciehp_green_led_on(p_slot)             # power LED on
                    pciehp_set_attention_status(p_slot, 0)  # attention LED off
        if ((events & PFD) && !ctrl->power_fault_detected)
          ctrl->power_fault_detected = 1
          pciehp_set_attention_status(1)                    # attention LED on
          pciehp_green_led_off(slot)                        # power LED off
    
    This left the attention indicator on (even though the hot-add succeeded)
    and the power indicator off (even though the slot power was on).
    
    Fix this by checking for power faults before checking for new devices.
    
    Fixes: 0e94916e6091 ("PCI: pciehp: Handle events synchronously")
    Signed-off-by: Keith Busch <keith.busch@intel.com>
    [bhelgaas: changelog]
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Lukas Wunner <lukas@wunner.de>

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 7136e3430925..a938abdb41ce 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -496,7 +496,7 @@ int pciehp_power_on_slot(struct slot *slot)
 	u16 slot_status;
 	int retval;
 
-	/* Clear sticky power-fault bit from previous power failures */
+	/* Clear power-fault bit from previous power failures */
 	pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &slot_status);
 	if (slot_status & PCI_EXP_SLTSTA_PFD)
 		pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
@@ -646,6 +646,14 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
 		pciehp_handle_button_press(slot);
 	}
 
+	/* Check Power Fault Detected */
+	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
+		ctrl->power_fault_detected = 1;
+		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
+		pciehp_set_attention_status(slot, 1);
+		pciehp_green_led_off(slot);
+	}
+
 	/*
 	 * Disable requests have higher priority than Presence Detect Changed
 	 * or Data Link Layer State Changed events.
@@ -657,14 +665,6 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
 		pciehp_handle_presence_or_link_change(slot, events);
 	up_read(&ctrl->reset_lock);
 
-	/* Check Power Fault Detected */
-	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
-		ctrl->power_fault_detected = 1;
-		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
-		pciehp_set_attention_status(slot, 1);
-		pciehp_green_led_off(slot);
-	}
-
 	pci_config_pm_runtime_put(pdev);
 	wake_up(&ctrl->requester);
 	return IRQ_HANDLED;

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order
  2018-09-07 20:03         ` Bjorn Helgaas
@ 2018-09-07 20:18           ` Keith Busch
  2018-09-18 21:46             ` Bjorn Helgaas
  2018-09-07 20:26           ` Lukas Wunner
  1 sibling, 1 reply; 44+ messages in thread
From: Keith Busch @ 2018-09-07 20:18 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig

On Fri, Sep 07, 2018 at 03:03:32PM -0500, Bjorn Helgaas wrote:
> I applied this to for-linus with the following changelog.  Let me know
> if I didn't understand this correctly.  I changed the comment in
> pciehp_power_on_slot() so it doesn't say "sticky" to avoid confusion
> with the PCI spec concept of sticky register bits (ROS, RWS, RW1CS).

Perfect! Thanks for queueing this up. I'll drop this one from the rest
of the series, which will need at least a v3 to fix a dumb mistake in
pointed out in review, and I'll get the order to better sense (or maybe
split into independent patch sets).
 
> commit 342227b42fe849eb2edac38342702aff12a5491d
> Author: Keith Busch <keith.busch@intel.com>
> Date:   Wed Sep 5 14:35:41 2018 -0600
> 
>     PCI: pciehp: Fix hot-add vs powerfault detection order
>     
>     If both hot-add and power fault were observed in a single interrupt, we
>     handled the hot-add first, then the power fault, in this path:
>     
>       pciehp_ist
>         if (events & (PDC | DLLSC))
>           pciehp_handle_presence_or_link_change
>             case OFF_STATE:
>               pciehp_enable_slot
>                 __pciehp_enable_slot
>                   board_added
>                     pciehp_power_on_slot
>                       ctrl->power_fault_detected = 0
>                       pcie_write_cmd(ctrl, PCI_EXP_SLTCTL_PWR_ON, PCI_EXP_SLTCTL_PCC)
>                     pciehp_green_led_on(p_slot)             # power LED on
>                     pciehp_set_attention_status(p_slot, 0)  # attention LED off
>         if ((events & PFD) && !ctrl->power_fault_detected)
>           ctrl->power_fault_detected = 1
>           pciehp_set_attention_status(1)                    # attention LED on
>           pciehp_green_led_off(slot)                        # power LED off
>     
>     This left the attention indicator on (even though the hot-add succeeded)
>     and the power indicator off (even though the slot power was on).
>     
>     Fix this by checking for power faults before checking for new devices.
>     
>     Fixes: 0e94916e6091 ("PCI: pciehp: Handle events synchronously")
>     Signed-off-by: Keith Busch <keith.busch@intel.com>
>     [bhelgaas: changelog]
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>     Reviewed-by: Lukas Wunner <lukas@wunner.de>
> 
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 7136e3430925..a938abdb41ce 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -496,7 +496,7 @@ int pciehp_power_on_slot(struct slot *slot)
>  	u16 slot_status;
>  	int retval;
>  
> -	/* Clear sticky power-fault bit from previous power failures */
> +	/* Clear power-fault bit from previous power failures */
>  	pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &slot_status);
>  	if (slot_status & PCI_EXP_SLTSTA_PFD)
>  		pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
> @@ -646,6 +646,14 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
>  		pciehp_handle_button_press(slot);
>  	}
>  
> +	/* Check Power Fault Detected */
> +	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
> +		ctrl->power_fault_detected = 1;
> +		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
> +		pciehp_set_attention_status(slot, 1);
> +		pciehp_green_led_off(slot);
> +	}
> +
>  	/*
>  	 * Disable requests have higher priority than Presence Detect Changed
>  	 * or Data Link Layer State Changed events.
> @@ -657,14 +665,6 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
>  		pciehp_handle_presence_or_link_change(slot, events);
>  	up_read(&ctrl->reset_lock);
>  
> -	/* Check Power Fault Detected */
> -	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
> -		ctrl->power_fault_detected = 1;
> -		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(slot));
> -		pciehp_set_attention_status(slot, 1);
> -		pciehp_green_led_off(slot);
> -	}
> -
>  	pci_config_pm_runtime_put(pdev);
>  	wake_up(&ctrl->requester);
>  	return IRQ_HANDLED;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order
  2018-09-07 20:03         ` Bjorn Helgaas
  2018-09-07 20:18           ` Keith Busch
@ 2018-09-07 20:26           ` Lukas Wunner
  1 sibling, 0 replies; 44+ messages in thread
From: Lukas Wunner @ 2018-09-07 20:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Keith Busch, Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt,
	Sinan Kaya, Thomas Tai, poza, Christoph Hellwig

On Fri, Sep 07, 2018 at 03:03:32PM -0500, Bjorn Helgaas wrote:
> I applied this to for-linus with the following changelog.  Let me know
> if I didn't understand this correctly.  I changed the comment in
> pciehp_power_on_slot() so it doesn't say "sticky" to avoid confusion
> with the PCI spec concept of sticky register bits (ROS, RWS, RW1CS).

The edited changelog and patch look perfectly fine to me, thanks a lot
and sorry for missing this when doing the rework.  (I missed it because
Thunderbolt doesn't have a power controller, hence can't signal a power
fault.)

The "sticky" refers to the property that if a power fault occurs and
the Power Fault Detected bit is cleared to acknowledge receipt of the
event, and if the power fault persists, the bit is immediately set
again and another interrupt is signaled.  In that sense, the bit is
"sticky" and that's what the code comment was referring to.  It's
basically level-triggered as long as the power fault persists.

pciehp does not clear the bit on receipt of a PFD event, but only sets
a flag in its internal struct.  This avoids an interrupt storm.
Both the bit and the internal flag are cleared when attempting to bring
the slot up again, either through an unplug-replug operation by the user
or an enable request via sysfs or an Attention Button press.  In either
case user intervention is required.  If the power fault is still not gone,
bringup of the slot is aborted.

The problem here was not only that the LED is turned off despite the slot
being brought up, but that the internal flag ctrl->power_fault_detected
was incorrectly set to 1 even though it had just been set to 0 when
successfully bringing up the slot.

There are some oddities with the power fault handling code, such as a
"TBD" code comment in pcie_enable_notification() where it's unclear if
there's really anything left "to be done".  I collected this and other
oddities in this e-mail:
https://www.spinics.net/lists/linux-pci/msg75743.html

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks
  2018-09-05 20:35 ` [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks Keith Busch
  2018-09-06 18:23   ` Thomas Tai
@ 2018-09-10 13:20   ` Lukas Wunner
  2018-09-10 14:56     ` Keith Busch
  1 sibling, 1 reply; 44+ messages in thread
From: Lukas Wunner @ 2018-09-10 13:20 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Christoph Hellwig

On Wed, Sep 05, 2018 at 02:35:42PM -0600, Keith Busch wrote:
> Error handling may trigger spurious link events, which may trigger
> hotplug handling to re-enumerate the topology. This patch temporarily
> disables notification during such error handling and checks the topology
> for changes after reset.
[...]
> --- a/drivers/pci/hotplug/pciehp_core.c
> +++ b/drivers/pci/hotplug/pciehp_core.c
> @@ -301,6 +301,43 @@ static void pciehp_remove(struct pcie_device *dev)
>  	pciehp_release_ctrl(ctrl);
>  }
>  
> +static void pciehp_error_detected(struct pcie_device *dev)
> +{
> +	struct controller *ctrl = get_service_data(dev);
> +
> +	/*
> +	 * Shutdown notification to ignore hotplug events during error
> +	 * handling. We'll recheck the status when error handling completes.
> +	 *
> +	 * It is possible link event related to this error handling may have
> +	 * triggered a the hotplug interrupt ahead of this notification, but we
> +	 * can't do anything about that race.
> +	 */
> +	pcie_shutdown_notification(ctrl);

Unfortunately this patch does not take into account my comment on
patch [13/16] in v1 of this series that pcie_shutdown_notification()
can't be used here because the sysfs user interface to enable/disable
the slot is still present but no longer functions once the IRQ is
released:
https://patchwork.ozlabs.org/patch/964715/

My suggestion was:

   "I think what you want to do instead is "down_write(&ctrl->reset_lock)",
    see commit 5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset").
    And possibly mask PDCE/DLLSCE like pciehp_reset_slot() does."


> +static void pciehp_slot_reset(struct pcie_device *dev)
> +{
> +	struct controller *ctrl = get_service_data(dev);
> +	u8 changed;
> +
> +	/*
> +	 * Cache presence detect change, but ignore other hotplug events that
> +	 * may occur during error handling. Ports that implement only in-band
> +	 * presence detection may inadvertently believe the device has changed,
> +	 * so those devices will have to be re-enumerated.
> +	 */
> +	pciehp_get_adapter_changed(ctrl->slot, &changed);
> +	pcie_clear_hotplug_events(ctrl);
> +	pcie_init_notification(ctrl);
> +
> +	if (changed) {
> +		pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
> +		__pci_walk_bus(parent, pci_dev_set_disconnected, NULL);

The __pci_walk_bus() seems superfluous because the devices are also
marked disconnected when bringing down the slot as a result of the
synthesized PCI_EXP_SLTSTA_PDC event.


> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -382,6 +382,15 @@ void pciehp_get_adapter_status(struct slot *slot, u8 *status)
> +void pciehp_get_adapter_changed(struct slot *slot, u8 *changed)
> +{
> +	struct pci_dev *pdev = ctrl_dev(slot->ctrl);
> +	u16 slot_status;
> +
> +	pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &slot_status);
> +	*changed = !!(slot_status & PCI_EXP_SLTSTA_PDC);
> +}

When adding new functions like this I usually let them return a bool
instead of passing a call-by-reference parameter.  The only reason
pciehp has these call-by-reference pciehp_get_*() functions is because
some of them assign a u8 value that is then directly passed to user space
via sysfs.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks
  2018-09-10 13:20   ` Lukas Wunner
@ 2018-09-10 14:56     ` Keith Busch
  2018-09-10 16:09       ` Lukas Wunner
  0 siblings, 1 reply; 44+ messages in thread
From: Keith Busch @ 2018-09-10 14:56 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Christoph Hellwig

On Mon, Sep 10, 2018 at 06:20:33AM -0700, Lukas Wunner wrote:
> On Wed, Sep 05, 2018 at 02:35:42PM -0600, Keith Busch wrote:
> > Error handling may trigger spurious link events, which may trigger
> > hotplug handling to re-enumerate the topology. This patch temporarily
> > disables notification during such error handling and checks the topology
> > for changes after reset.
> [...]
> > --- a/drivers/pci/hotplug/pciehp_core.c
> > +++ b/drivers/pci/hotplug/pciehp_core.c
> > @@ -301,6 +301,43 @@ static void pciehp_remove(struct pcie_device *dev)
> >  	pciehp_release_ctrl(ctrl);
> >  }
> >  
> > +static void pciehp_error_detected(struct pcie_device *dev)
> > +{
> > +	struct controller *ctrl = get_service_data(dev);
> > +
> > +	/*
> > +	 * Shutdown notification to ignore hotplug events during error
> > +	 * handling. We'll recheck the status when error handling completes.
> > +	 *
> > +	 * It is possible link event related to this error handling may have
> > +	 * triggered a the hotplug interrupt ahead of this notification, but we
> > +	 * can't do anything about that race.
> > +	 */
> > +	pcie_shutdown_notification(ctrl);
> 
> Unfortunately this patch does not take into account my comment on
> patch [13/16] in v1 of this series that pcie_shutdown_notification()
> can't be used here because the sysfs user interface to enable/disable
> the slot is still present but no longer functions once the IRQ is
> released:
> https://patchwork.ozlabs.org/patch/964715/
> 
> My suggestion was:
> 
>    "I think what you want to do instead is "down_write(&ctrl->reset_lock)",
>     see commit 5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset").
>     And possibly mask PDCE/DLLSCE like pciehp_reset_slot() does."

The sysfs entries still function. Their actions are only temporarily
stalled during error handling. Once the slot reset is called, the
ctrl->pending_events is queried to take requested actions.

> > +static void pciehp_slot_reset(struct pcie_device *dev)
> > +{
> > +	struct controller *ctrl = get_service_data(dev);
> > +	u8 changed;
> > +
> > +	/*
> > +	 * Cache presence detect change, but ignore other hotplug events that
> > +	 * may occur during error handling. Ports that implement only in-band
> > +	 * presence detection may inadvertently believe the device has changed,
> > +	 * so those devices will have to be re-enumerated.
> > +	 */
> > +	pciehp_get_adapter_changed(ctrl->slot, &changed);
> > +	pcie_clear_hotplug_events(ctrl);
> > +	pcie_init_notification(ctrl);
> > +
> > +	if (changed) {
> > +		pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
> > +		__pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> 
> The __pci_walk_bus() seems superfluous because the devices are also
> marked disconnected when bringing down the slot as a result of the
> synthesized PCI_EXP_SLTSTA_PDC event.

Right, but we can't do that inline with the slot reset because of the
circular locks that creates with the pci_bus_sem. We still want to
fence off drivers for downstream devices from mucking with config space
in a topology that is reported to be different than the one we're
recoverying from. You can get undefined results that way.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks
  2018-09-10 14:56     ` Keith Busch
@ 2018-09-10 16:09       ` Lukas Wunner
  2018-09-10 16:18         ` Keith Busch
  2018-09-10 16:45         ` Keith Busch
  0 siblings, 2 replies; 44+ messages in thread
From: Lukas Wunner @ 2018-09-10 16:09 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Christoph Hellwig

On Mon, Sep 10, 2018 at 08:56:42AM -0600, Keith Busch wrote:
> On Mon, Sep 10, 2018 at 06:20:33AM -0700, Lukas Wunner wrote:
> > On Wed, Sep 05, 2018 at 02:35:42PM -0600, Keith Busch wrote:
> > > +static void pciehp_error_detected(struct pcie_device *dev)
> > > +{
> > > +	struct controller *ctrl = get_service_data(dev);
> > > +
> > > +	/*
> > > +	 * Shutdown notification to ignore hotplug events during error
> > > +	 * handling. We'll recheck the status when error handling completes.
> > > +	 *
> > > +	 * It is possible link event related to this error handling may have
> > > +	 * triggered a the hotplug interrupt ahead of this notification, but we
> > > +	 * can't do anything about that race.
> > > +	 */
> > > +	pcie_shutdown_notification(ctrl);
> > 
> > Unfortunately this patch does not take into account my comment on
> > patch [13/16] in v1 of this series that pcie_shutdown_notification()
> > can't be used here because the sysfs user interface to enable/disable
> > the slot is still present but no longer functions once the IRQ is
> > released:
> > https://patchwork.ozlabs.org/patch/964715/
> > 
> > My suggestion was:
> > 
> >    "I think what you want to do instead is "down_write(&ctrl->reset_lock)",
> >     see commit 5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset").
> >     And possibly mask PDCE/DLLSCE like pciehp_reset_slot() does."
> 
> The sysfs entries still function. Their actions are only temporarily
> stalled during error handling. Once the slot reset is called, the
> ctrl->pending_events is queried to take requested actions.

Okay I see.  Still, releasing the IRQ and requesting it again seems fairly
heavy-wheight.  Why not just acquire ctrl->reset_lock?


> > > +static void pciehp_slot_reset(struct pcie_device *dev)
> > > +{
> > > +	struct controller *ctrl = get_service_data(dev);
> > > +	u8 changed;
> > > +
> > > +	/*
> > > +	 * Cache presence detect change, but ignore other hotplug events that
> > > +	 * may occur during error handling. Ports that implement only in-band
> > > +	 * presence detection may inadvertently believe the device has changed,
> > > +	 * so those devices will have to be re-enumerated.
> > > +	 */
> > > +	pciehp_get_adapter_changed(ctrl->slot, &changed);
> > > +	pcie_clear_hotplug_events(ctrl);
> > > +	pcie_init_notification(ctrl);
> > > +
> > > +	if (changed) {
> > > +		pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
> > > +		__pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> > 
> > The __pci_walk_bus() seems superfluous because the devices are also
> > marked disconnected when bringing down the slot as a result of the
> > synthesized PCI_EXP_SLTSTA_PDC event.
> 
> Right, but we can't do that inline with the slot reset because of the
> circular locks that creates with the pci_bus_sem. We still want to
> fence off drivers for downstream devices from mucking with config space
> in a topology that is reported to be different than the one we're
> recoverying from. You can get undefined results that way.

I don't quite follow.  I meant that __pci_walk_bus() is unnecessary
because the pciehp_request() call indirectly triggers it.  So the
devices are marked disconnected twice.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks
  2018-09-10 16:09       ` Lukas Wunner
@ 2018-09-10 16:18         ` Keith Busch
  2018-09-10 16:45         ` Keith Busch
  1 sibling, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-10 16:18 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Christoph Hellwig

On Mon, Sep 10, 2018 at 06:09:26PM +0200, Lukas Wunner wrote:
> On Mon, Sep 10, 2018 at 08:56:42AM -0600, Keith Busch wrote:
> > On Mon, Sep 10, 2018 at 06:20:33AM -0700, Lukas Wunner wrote:
> > > The __pci_walk_bus() seems superfluous because the devices are also
> > > marked disconnected when bringing down the slot as a result of the
> > > synthesized PCI_EXP_SLTSTA_PDC event.
> > 
> > Right, but we can't do that inline with the slot reset because of the
> > circular locks that creates with the pci_bus_sem. We still want to
> > fence off drivers for downstream devices from mucking with config space
> > in a topology that is reported to be different than the one we're
> > recoverying from. You can get undefined results that way.
> 
> I don't quite follow.  I meant that __pci_walk_bus() is unnecessary
> because the pciehp_request() call indirectly triggers it.  So the
> devices are marked disconnected twice.

Right, but pciehhp_request handling happens too late to fence off drivers
bound to devices downstream this hotplug port. The disconnect setting
needs to happen before pciehp's slot_reset returns.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks
  2018-09-10 16:09       ` Lukas Wunner
  2018-09-10 16:18         ` Keith Busch
@ 2018-09-10 16:45         ` Keith Busch
  2018-09-10 17:08           ` Lukas Wunner
  1 sibling, 1 reply; 44+ messages in thread
From: Keith Busch @ 2018-09-10 16:45 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Christoph Hellwig

On Mon, Sep 10, 2018 at 06:09:26PM +0200, Lukas Wunner wrote:
> On Mon, Sep 10, 2018 at 08:56:42AM -0600, Keith Busch wrote:
> > The sysfs entries still function. Their actions are only temporarily
> > stalled during error handling. Once the slot reset is called, the
> > ctrl->pending_events is queried to take requested actions.
> 
> Okay I see.  Still, releasing the IRQ and requesting it again seems fairly
> heavy-wheight.  Why not just acquire ctrl->reset_lock?

That was looking like a nice way to handle it, but it introduces
circular locking between ctrl->reset_lock and pci_bus_sem:

CPU A                               CPU B
---------------------------------   ------------------------
pci_walk_bus                        pciehp_ist
 down_read(&pci_bus_sem)             down_read(&ctrl->reset_lock);
  pcie_portdrv_error_detected         pciehp_handle_presence_or_link_change
   pciehp_error_detected               pciehp_unconfigure_device
    down_write(&ctrl->reset_lock)       pci_stop_and_remove_bus_devicea
                                         down_write(&pci_bus_sem);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks
  2018-09-10 16:45         ` Keith Busch
@ 2018-09-10 17:08           ` Lukas Wunner
  2018-09-10 17:22             ` Keith Busch
  0 siblings, 1 reply; 44+ messages in thread
From: Lukas Wunner @ 2018-09-10 17:08 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Christoph Hellwig

On Mon, Sep 10, 2018 at 10:45:28AM -0600, Keith Busch wrote:
> On Mon, Sep 10, 2018 at 06:09:26PM +0200, Lukas Wunner wrote:
> > On Mon, Sep 10, 2018 at 08:56:42AM -0600, Keith Busch wrote:
> > > The sysfs entries still function. Their actions are only temporarily
> > > stalled during error handling. Once the slot reset is called, the
> > > ctrl->pending_events is queried to take requested actions.
> > 
> > Okay I see.  Still, releasing the IRQ and requesting it again seems fairly
> > heavy-wheight.  Why not just acquire ctrl->reset_lock?
> 
> That was looking like a nice way to handle it, but it introduces
> circular locking between ctrl->reset_lock and pci_bus_sem:
> 
> CPU A                               CPU B
> ---------------------------------   ------------------------
> pci_walk_bus                        pciehp_ist
>  down_read(&pci_bus_sem)             down_read(&ctrl->reset_lock);
>   pcie_portdrv_error_detected         pciehp_handle_presence_or_link_change
>    pciehp_error_detected               pciehp_unconfigure_device
>     down_write(&ctrl->reset_lock)       pci_stop_and_remove_bus_devicea
>                                          down_write(&pci_bus_sem);

Why is pciehp bringing down the slot?  Is that in reaction to a
Link Down caused by the error?  Can this be solved with Sinan's
approach to check in pciehp whether PCI_EXP_DEVSTA_FED is set
and if so, waiting for it to be handled?

FWIW you can use synchronize_irq() in pciehp_error_detected()
if you need to wait for the IRQ thread to stop before taking
the reset_lock.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks
  2018-09-10 17:08           ` Lukas Wunner
@ 2018-09-10 17:22             ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-10 17:22 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Christoph Hellwig

On Mon, Sep 10, 2018 at 07:08:48PM +0200, Lukas Wunner wrote:
> On Mon, Sep 10, 2018 at 10:45:28AM -0600, Keith Busch wrote:
> > On Mon, Sep 10, 2018 at 06:09:26PM +0200, Lukas Wunner wrote:
> > > On Mon, Sep 10, 2018 at 08:56:42AM -0600, Keith Busch wrote:
> > > > The sysfs entries still function. Their actions are only temporarily
> > > > stalled during error handling. Once the slot reset is called, the
> > > > ctrl->pending_events is queried to take requested actions.
> > > 
> > > Okay I see.  Still, releasing the IRQ and requesting it again seems fairly
> > > heavy-wheight.  Why not just acquire ctrl->reset_lock?
> > 
> > That was looking like a nice way to handle it, but it introduces
> > circular locking between ctrl->reset_lock and pci_bus_sem:
> > 
> > CPU A                               CPU B
> > ---------------------------------   ------------------------
> > pci_walk_bus                        pciehp_ist
> >  down_read(&pci_bus_sem)             down_read(&ctrl->reset_lock);
> >   pcie_portdrv_error_detected         pciehp_handle_presence_or_link_change
> >    pciehp_error_detected               pciehp_unconfigure_device
> >     down_write(&ctrl->reset_lock)       pci_stop_and_remove_bus_devicea
> >                                          down_write(&pci_bus_sem);
> 
> Why is pciehp bringing down the slot?  Is that in reaction to a
> Link Down caused by the error?  

This could just be something that happens if a hotplug and error event
occur about the same time.

> Can this be solved with Sinan's
> approach to check in pciehp whether PCI_EXP_DEVSTA_FED is set
> and if so, waiting for it to be handled?

That only helps if the downstream port detected a fatal error, but
error handling happens for any device reported error.

> FWIW you can use synchronize_irq() in pciehp_error_detected()
> if you need to wait for the IRQ thread to stop before taking
> the reset_lock.

That would just be a different dead lock: pciehp_error_detected is
holding a read lock on pci_bus_sem, and irq thread may request a
write lock, so both threads are dead locked on each other.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order
  2018-09-07 20:18           ` Keith Busch
@ 2018-09-18 21:46             ` Bjorn Helgaas
  2018-09-18 22:11               ` Keith Busch
  0 siblings, 1 reply; 44+ messages in thread
From: Bjorn Helgaas @ 2018-09-18 21:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig

On Fri, Sep 07, 2018 at 02:18:19PM -0600, Keith Busch wrote:
> On Fri, Sep 07, 2018 at 03:03:32PM -0500, Bjorn Helgaas wrote:
> > I applied this to for-linus with the following changelog.  Let me know
> > if I didn't understand this correctly.  I changed the comment in
> > pciehp_power_on_slot() so it doesn't say "sticky" to avoid confusion
> > with the PCI spec concept of sticky register bits (ROS, RWS, RW1CS).
> 
> Perfect! Thanks for queueing this up. I'll drop this one from the rest
> of the series, which will need at least a v3 to fix a dumb mistake in
> pointed out in review, and I'll get the order to better sense (or maybe
> split into independent patch sets).

Are you still planning a v3?  I really want to get this in for v4.20
and I think there's probably some integration to be done with Lukas'
series (which I haven't applied yet either).

I rebased my branches to v4.19-rc4 to avoid a merge conflict Lukas
pointed out.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order
  2018-09-18 21:46             ` Bjorn Helgaas
@ 2018-09-18 22:11               ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2018-09-18 22:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Linux PCI, Bjorn Helgaas, Benjamin Herrenschmidt, Sinan Kaya,
	Thomas Tai, poza, Lukas Wunner, Christoph Hellwig

On Tue, Sep 18, 2018 at 02:46:50PM -0700, Bjorn Helgaas wrote:
> On Fri, Sep 07, 2018 at 02:18:19PM -0600, Keith Busch wrote:
> > On Fri, Sep 07, 2018 at 03:03:32PM -0500, Bjorn Helgaas wrote:
> > > I applied this to for-linus with the following changelog.  Let me know
> > > if I didn't understand this correctly.  I changed the comment in
> > > pciehp_power_on_slot() so it doesn't say "sticky" to avoid confusion
> > > with the PCI spec concept of sticky register bits (ROS, RWS, RW1CS).
> > 
> > Perfect! Thanks for queueing this up. I'll drop this one from the rest
> > of the series, which will need at least a v3 to fix a dumb mistake in
> > pointed out in review, and I'll get the order to better sense (or maybe
> > split into independent patch sets).
> 
> Are you still planning a v3?  I really want to get this in for v4.20
> and I think there's probably some integration to be done with Lukas'
> series (which I haven't applied yet either).
> 
> I rebased my branches to v4.19-rc4 to avoid a merge conflict Lukas
> pointed out.

I'll send something out today, and I think I'll split it into multiple
independent sets.

I had to trim down what this is trying to accomplish due to existing
deadlocking bugs I've found in testing: there are several circular
dependencies on tasks holding the single pci_rescan_remove_lock. I don't
think I'll be able to fix that in time for 4.20, but I'll send the parts
that I believe are an improvement that don't break anything else.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2018-09-19  3:21 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-05 20:35 [PATCHv2 00/20] PCI, error handling and hot plug Keith Busch
2018-09-05 20:35 ` [PATCHv2 01/20] PCI: Simplify disconnected marking Keith Busch
2018-09-05 20:35 ` [PATCHv2 02/20] PCI: Fix faulty logic in pci_reset_bus() Keith Busch
2018-09-05 20:35 ` [PATCHv2 03/20] PCI: Add required waits on link active Keith Busch
2018-09-06 11:42   ` Lukas Wunner
2018-09-06 14:44     ` Keith Busch
2018-09-05 20:35 ` [PATCHv2 04/20] PCI/AER: Remove dead code Keith Busch
2018-09-05 20:35 ` [PATCHv2 05/20] PCI/ERR: Use slot reset if available Keith Busch
2018-09-05 20:35 ` [PATCHv2 06/20] PCI/ERR: Handle fatal error recovery Keith Busch
2018-09-05 20:35 ` [PATCHv2 07/20] PCI/ERR: Always use the first downstream port Keith Busch
2018-09-05 20:35 ` [PATCHv2 08/20] PCI/ERR: Simplify broadcast callouts Keith Busch
2018-09-05 20:35 ` [PATCHv2 09/20] PCI/ERR: Report current recovery status for udev Keith Busch
2018-09-05 20:35 ` [PATCHv2 10/20] PCI/ERR: Remove devices on recovery failure Keith Busch
2018-09-05 20:35 ` [PATCHv2 11/20] PCI/portdrv: Provide pci error callbacks Keith Busch
2018-09-05 20:35 ` [PATCHv2 12/20] PCI/portdrv: Restore pci state on slot reset Keith Busch
2018-09-05 20:35 ` [PATCHv2 13/20] PCI: Make link active reporting detection generic Keith Busch
2018-09-06 12:38   ` Lukas Wunner
2018-09-05 20:35 ` [PATCHv2 14/20] PCI: Create recursive bus walk Keith Busch
2018-09-05 20:35 ` [PATCHv2 15/20] PCI/pciehp: Fix powerfault detection order Keith Busch
2018-09-06 19:36   ` Bjorn Helgaas
2018-09-06 19:50     ` Keith Busch
2018-09-07 16:53       ` Bjorn Helgaas
2018-09-07 20:03         ` Bjorn Helgaas
2018-09-07 20:18           ` Keith Busch
2018-09-18 21:46             ` Bjorn Helgaas
2018-09-18 22:11               ` Keith Busch
2018-09-07 20:26           ` Lukas Wunner
2018-09-05 20:35 ` [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks Keith Busch
2018-09-06 18:23   ` Thomas Tai
2018-09-06 18:49     ` Keith Busch
2018-09-10 13:20   ` Lukas Wunner
2018-09-10 14:56     ` Keith Busch
2018-09-10 16:09       ` Lukas Wunner
2018-09-10 16:18         ` Keith Busch
2018-09-10 16:45         ` Keith Busch
2018-09-10 17:08           ` Lukas Wunner
2018-09-10 17:22             ` Keith Busch
2018-09-05 20:35 ` [PATCHv2 17/20] PCI/pciehp: Ignore link events during DPC event Keith Busch
2018-09-05 20:35 ` [PATCHv2 18/20] PCI/DPC: Wait for link active after reset Keith Busch
2018-09-05 20:35 ` [PATCHv2 19/20] PCI/DPC: Link reset code cleanup Keith Busch
2018-09-05 20:35 ` [PATCHv2 20/20] PCI: Unify device inaccessible Keith Busch
2018-09-06  4:20   ` Benjamin Herrenschmidt
2018-09-06 17:30 ` [PATCHv2 00/20] PCI, error handling and hot plug Thomas Tai
2018-09-06 17:36   ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.