All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] PCI reset delay fixes
@ 2022-12-31 18:33 Lukas Wunner
  2022-12-31 18:33 ` [PATCH 1/3] PCI/PM: Observe reset delay irrespective of bridge_d3 Lukas Wunner
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Lukas Wunner @ 2022-12-31 18:33 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Keith Busch, Ashok Raj, Sathyanarayanan Kuppuswamy,
	Ravi Kishore Koppuravuri, Mika Westerberg, Sheng Bi,
	Stanislav Spassov, Yang Su

When recovering from a DPC reset, we neglect to observe the delays
prescribed by PCIe r6.0 sec 6.6.1 before accessing devices on the
secondary bus.  As a result, devices which take a little longer to
recover remain inaccessible because their config space is restored
too early.

One affected device is Intel's Ponte Vecchio HPC GPU.  Ravi Kishore
kindly tested that this series solves the issue.


As a byproduct, the series fixes a similar delay issue for Secondary
Bus Resets.  Sheng Bi proposed a patch last May, a variation of which
is contained herein:

https://patchwork.kernel.org/project/linux-pci/patch/20220523171517.32407-1-windy.bi.enflame@gmail.com/


A second byproduct of this series is an optimization for Secondary
Bus Resets whereby the delay after reset is reduced on modern PCIe
systems.  Yang Su and Stanislav Spassov proposed a patch in August
which is subsumed by the present series:

https://patchwork.kernel.org/project/linux-pci/patch/4315990a165dd019d970633713cf8e06e9b4c282.1660746147.git.yang.su@linux.alibaba.com/


If the present series is accepted, the two above-linked patches
can be closed in patchwork.  (For some reason, Sheng Bi's patch
is in "New" state, but marked "Archived".)

Thanks!


Lukas Wunner (3):
  PCI/PM: Observe reset delay irrespective of bridge_d3
  PCI: Unify delay handling for reset and resume
  PCI/DPC: Await readiness of secondary bus after reset

 drivers/pci/pci-driver.c |  2 +-
 drivers/pci/pci.c        | 55 +++++++++++++++++-----------------------
 drivers/pci/pci.h        |  6 ++++-
 drivers/pci/pcie/dpc.c   |  4 +--
 4 files changed, 31 insertions(+), 36 deletions(-)

-- 
2.39.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] PCI/PM: Observe reset delay irrespective of bridge_d3
  2022-12-31 18:33 [PATCH 0/3] PCI reset delay fixes Lukas Wunner
@ 2022-12-31 18:33 ` Lukas Wunner
  2023-01-03 19:50   ` Sathyanarayanan Kuppuswamy
  2022-12-31 18:33 ` [PATCH 2/3] PCI: Unify delay handling for reset and resume Lukas Wunner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Lukas Wunner @ 2022-12-31 18:33 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Keith Busch, Ashok Raj, Sathyanarayanan Kuppuswamy,
	Ravi Kishore Koppuravuri, Mika Westerberg, Sheng Bi,
	Stanislav Spassov, Yang Su

If a PCI bridge is suspended to D3cold upon entering system sleep,
resuming it entails a Fundamental Reset per PCIe r6.0 sec 5.8.

The delay prescribed after a Fundamental Reset in PCIe r6.0 sec 6.6.1
is sought to be observed by:

  pci_pm_resume_noirq()
    pci_pm_bridge_power_up_actions()
      pci_bridge_wait_for_secondary_bus()

However, pci_bridge_wait_for_secondary_bus() bails out if the bridge_d3
flag is not set.  That flag indicates whether a bridge is allowed to
suspend to D3cold at *runtime*.

Hence *no* delay is observed on resume from system sleep if runtime
D3cold is forbidden.  That doesn't make any sense, so drop the bridge_d3
check from pci_bridge_wait_for_secondary_bus().

The purpose of the bridge_d3 check was probably to avoid delays if a
bridge remained in D0 during suspend.  However the sole caller of
pci_bridge_wait_for_secondary_bus(), pci_pm_bridge_power_up_actions(),
is only invoked if the previous power state was D3cold.  Hence the
additional bridge_d3 check seems superfluous.

Fixes: ad9001f2f411 ("PCI/PM: Add missing link delays required by the PCIe spec")
Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v5.5+
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 drivers/pci/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index fba95486caaf..f43f3e84f634 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4964,7 +4964,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
 	if (pci_dev_is_disconnected(dev))
 		return;
 
-	if (!pci_is_bridge(dev) || !dev->bridge_d3)
+	if (!pci_is_bridge(dev))
 		return;
 
 	down_read(&pci_bus_sem);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] PCI: Unify delay handling for reset and resume
  2022-12-31 18:33 [PATCH 0/3] PCI reset delay fixes Lukas Wunner
  2022-12-31 18:33 ` [PATCH 1/3] PCI/PM: Observe reset delay irrespective of bridge_d3 Lukas Wunner
@ 2022-12-31 18:33 ` Lukas Wunner
  2023-01-03 19:50   ` Sathyanarayanan Kuppuswamy
  2023-01-12 22:31   ` Bjorn Helgaas
  2022-12-31 18:33 ` [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset Lukas Wunner
  2023-01-03 11:09 ` [PATCH 0/3] PCI reset delay fixes Mika Westerberg
  3 siblings, 2 replies; 13+ messages in thread
From: Lukas Wunner @ 2022-12-31 18:33 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Keith Busch, Ashok Raj, Sathyanarayanan Kuppuswamy,
	Ravi Kishore Koppuravuri, Mika Westerberg, Sheng Bi,
	Stanislav Spassov, Yang Su

Sheng Bi reports that pci_bridge_secondary_bus_reset() may fail to wait
for devices on the secondary bus to become accessible after reset:

Although it does call pci_dev_wait(), it erroneously passes the bridge's
pci_dev rather than that of a child.  The bridge of course is always
accessible while its secondary bus is reset, so pci_dev_wait() returns
immediately.

Sheng Bi proposes introducing a new pci_bridge_secondary_bus_wait()
function which is called from pci_bridge_secondary_bus_reset():

https://lore.kernel.org/linux-pci/20220523171517.32407-1-windy.bi.enflame@gmail.com/

However we already have pci_bridge_wait_for_secondary_bus() which does
almost exactly what we need.  So far it's only called on resume from
D3cold (which implies a Fundamental Reset per PCIe r6.0 sec 5.8).
Re-using it for Secondary Bus Resets is a leaner and more rational
approach than introducing a new function.

That only requires a few minor tweaks:

- Amend pci_bridge_wait_for_secondary_bus() to await accessibility of
  the first device on the secondary bus by calling pci_dev_wait() after
  performing the prescribed delays.  pci_dev_wait() needs two parameters,
  a reset reason and a timeout, which callers must now pass to
  pci_bridge_wait_for_secondary_bus().  The timeout is 1 sec for resume
  (PCIe r6.0 sec 6.6.1) and 60 sec for reset (commit 821cdad5c46c ("PCI:
  Wait up to 60 seconds for device to become ready after FLR")).

- Amend pci_bridge_wait_for_secondary_bus() to return 0 on success or
  -ENOTTY on error for consumption by pci_bridge_secondary_bus_reset().

- Drop an unnecessary 1 sec delay from pci_reset_secondary_bus() which
  is now performed by pci_bridge_wait_for_secondary_bus().  A static
  delay this long is only necessary for Conventional PCI, so modern
  PCIe systems benefit from shorter reset times as a side effect.

Fixes: 6b2f1351af56 ("PCI: Wait for device to become ready after secondary bus reset")
Reported-by: Sheng Bi <windy.bi.enflame@gmail.com>
Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v4.17+
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 drivers/pci/pci-driver.c |  2 +-
 drivers/pci/pci.c        | 50 ++++++++++++++++++----------------------
 drivers/pci/pci.h        |  3 ++-
 3 files changed, 25 insertions(+), 30 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index a2ceeacc33eb..02e84c87f41a 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -572,7 +572,7 @@ static void pci_pm_default_resume_early(struct pci_dev *pci_dev)
 
 static void pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev)
 {
-	pci_bridge_wait_for_secondary_bus(pci_dev);
+	pci_bridge_wait_for_secondary_bus(pci_dev, "resume", 1000);
 	/*
 	 * When powering on a bridge from D3cold, the whole hierarchy may be
 	 * powered on into D0uninitialized state, resume them to give them a
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f43f3e84f634..b0b49243a908 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4948,24 +4948,31 @@ static int pci_bus_max_d3cold_delay(const struct pci_bus *bus)
 /**
  * pci_bridge_wait_for_secondary_bus - Wait for secondary bus to be accessible
  * @dev: PCI bridge
+ * @reset_type: reset type in human-readable form
+ * @timeout: maximum time to wait for devices on secondary bus
  *
  * Handle necessary delays before access to the devices on the secondary
- * side of the bridge are permitted after D3cold to D0 transition.
+ * side of the bridge are permitted after D3cold to D0 transition
+ * or Conventional Reset.
  *
  * For PCIe this means the delays in PCIe 5.0 section 6.6.1. For
  * conventional PCI it means Tpvrh + Trhfa specified in PCI 3.0 section
  * 4.3.2.
+ *
+ * Return 0 on success or -ENOTTY if the first device on the secondary bus
+ * failed to become accessible.
  */
-void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
+int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type,
+				      int timeout)
 {
 	struct pci_dev *child;
 	int delay;
 
 	if (pci_dev_is_disconnected(dev))
-		return;
+		return 0;
 
 	if (!pci_is_bridge(dev))
-		return;
+		return 0;
 
 	down_read(&pci_bus_sem);
 
@@ -4977,14 +4984,14 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
 	 */
 	if (!dev->subordinate || list_empty(&dev->subordinate->devices)) {
 		up_read(&pci_bus_sem);
-		return;
+		return 0;
 	}
 
 	/* Take d3cold_delay requirements into account */
 	delay = pci_bus_max_d3cold_delay(dev->subordinate);
 	if (!delay) {
 		up_read(&pci_bus_sem);
-		return;
+		return 0;
 	}
 
 	child = list_first_entry(&dev->subordinate->devices, struct pci_dev,
@@ -4993,14 +5000,12 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
 
 	/*
 	 * Conventional PCI and PCI-X we need to wait Tpvrh + Trhfa before
-	 * accessing the device after reset (that is 1000 ms + 100 ms). In
-	 * practice this should not be needed because we don't do power
-	 * management for them (see pci_bridge_d3_possible()).
+	 * accessing the device after reset (that is 1000 ms + 100 ms).
 	 */
 	if (!pci_is_pcie(dev)) {
 		pci_dbg(dev, "waiting %d ms for secondary bus\n", 1000 + delay);
 		msleep(1000 + delay);
-		return;
+		return 0;
 	}
 
 	/*
@@ -5017,11 +5022,11 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
 	 * configuration requests if we only wait for 100 ms (see
 	 * https://bugzilla.kernel.org/show_bug.cgi?id=203885).
 	 *
-	 * Therefore we wait for 100 ms and check for the device presence.
-	 * If it is still not present give it an additional 100 ms.
+	 * Therefore we wait for 100 ms and check for the device presence
+	 * until the timeout expires.
 	 */
 	if (!pcie_downstream_port(dev))
-		return;
+		return 0;
 
 	if (pcie_get_speed_cap(dev) <= PCIE_SPEED_5_0GT) {
 		pci_dbg(dev, "waiting %d ms for downstream link\n", delay);
@@ -5032,14 +5037,11 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
 		if (!pcie_wait_for_link_delay(dev, true, delay)) {
 			/* Did not train, no need to wait any further */
 			pci_info(dev, "Data Link Layer Link Active not set in 1000 msec\n");
-			return;
+			return -ENOTTY;
 		}
 	}
 
-	if (!pci_device_is_present(child)) {
-		pci_dbg(child, "waiting additional %d ms to become accessible\n", delay);
-		msleep(delay);
-	}
+	return pci_dev_wait(child, reset_type, timeout - delay);
 }
 
 void pci_reset_secondary_bus(struct pci_dev *dev)
@@ -5058,15 +5060,6 @@ void pci_reset_secondary_bus(struct pci_dev *dev)
 
 	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
 	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
-
-	/*
-	 * Trhfa for conventional PCI is 2^25 clock cycles.
-	 * Assuming a minimum 33MHz clock this results in a 1s
-	 * delay before we can consider subordinate devices to
-	 * be re-initialized.  PCIe has some ways to shorten this,
-	 * but we don't make use of them yet.
-	 */
-	ssleep(1);
 }
 
 void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)
@@ -5085,7 +5078,8 @@ int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
 {
 	pcibios_reset_secondary_bus(dev);
 
-	return pci_dev_wait(dev, "bus reset", PCIE_RESET_READY_POLL_MS);
+	return pci_bridge_wait_for_secondary_bus(dev, "bus reset",
+						 PCIE_RESET_READY_POLL_MS);
 }
 EXPORT_SYMBOL_GPL(pci_bridge_secondary_bus_reset);
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9ed3b5550043..40758248dd80 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -86,8 +86,9 @@ void pci_msi_init(struct pci_dev *dev);
 void pci_msix_init(struct pci_dev *dev);
 bool pci_bridge_d3_possible(struct pci_dev *dev);
 void pci_bridge_d3_update(struct pci_dev *dev);
-void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev);
 void pci_bridge_reconfigure_ltr(struct pci_dev *dev);
+int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type,
+				      int timeout);
 
 static inline void pci_wakeup_event(struct pci_dev *dev)
 {
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset
  2022-12-31 18:33 [PATCH 0/3] PCI reset delay fixes Lukas Wunner
  2022-12-31 18:33 ` [PATCH 1/3] PCI/PM: Observe reset delay irrespective of bridge_d3 Lukas Wunner
  2022-12-31 18:33 ` [PATCH 2/3] PCI: Unify delay handling for reset and resume Lukas Wunner
@ 2022-12-31 18:33 ` Lukas Wunner
  2023-01-03 19:49   ` Sathyanarayanan Kuppuswamy
  2023-01-12 22:35   ` Bjorn Helgaas
  2023-01-03 11:09 ` [PATCH 0/3] PCI reset delay fixes Mika Westerberg
  3 siblings, 2 replies; 13+ messages in thread
From: Lukas Wunner @ 2022-12-31 18:33 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Keith Busch, Ashok Raj, Sathyanarayanan Kuppuswamy,
	Ravi Kishore Koppuravuri, Mika Westerberg, Sheng Bi,
	Stanislav Spassov, Yang Su

We're calling pci_bridge_wait_for_secondary_bus() after performing a
Secondary Bus Reset, but neglect to do the same after coming out of a
DPC-induced Hot Reset.  As a result, we're not observing the delays
prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the
secondary bus before they're ready.  Fix it.

Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org
---
 drivers/pci/pci.c      | 3 ---
 drivers/pci/pci.h      | 3 +++
 drivers/pci/pcie/dpc.c | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b0b49243a908..19fe0ef0e583 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -167,9 +167,6 @@ static int __init pcie_port_pm_setup(char *str)
 }
 __setup("pcie_port_pm=", pcie_port_pm_setup);
 
-/* Time to wait after a reset for device to become responsive */
-#define PCIE_RESET_READY_POLL_MS 60000
-
 /**
  * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
  * @bus: pointer to PCI bus structure to search
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 40758248dd80..b72fd888379b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -7,6 +7,9 @@
 /* Number of possible devfns: 0.0 to 1f.7 inclusive */
 #define MAX_NR_DEVFNS 256
 
+/* Time to wait after a reset for device to become responsive */
+#define PCIE_RESET_READY_POLL_MS 60000
+
 #define PCI_FIND_CAP_TTL	48
 
 #define PCI_VSEC_ID_INTEL_TBT	0x1234	/* Thunderbolt */
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index f5ffea17c7f8..a5d7c69b764e 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -170,8 +170,8 @@ pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
 			      PCI_EXP_DPC_STATUS_TRIGGER);
 
-	if (!pcie_wait_for_link(pdev, true)) {
-		pci_info(pdev, "Data Link Layer Link Active not set in 1000 msec\n");
+	if (pci_bridge_wait_for_secondary_bus(pdev, "DPC",
+					      PCIE_RESET_READY_POLL_MS)) {
 		clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags);
 		ret = PCI_ERS_RESULT_DISCONNECT;
 	} else {
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] PCI reset delay fixes
  2022-12-31 18:33 [PATCH 0/3] PCI reset delay fixes Lukas Wunner
                   ` (2 preceding siblings ...)
  2022-12-31 18:33 ` [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset Lukas Wunner
@ 2023-01-03 11:09 ` Mika Westerberg
  3 siblings, 0 replies; 13+ messages in thread
From: Mika Westerberg @ 2023-01-03 11:09 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, linux-pci, Keith Busch, Ashok Raj,
	Sathyanarayanan Kuppuswamy, Ravi Kishore Koppuravuri, Sheng Bi,
	Stanislav Spassov, Yang Su

Hi Lukas,

On Sat, Dec 31, 2022 at 07:33:36PM +0100, Lukas Wunner wrote:
> When recovering from a DPC reset, we neglect to observe the delays
> prescribed by PCIe r6.0 sec 6.6.1 before accessing devices on the
> secondary bus.  As a result, devices which take a little longer to
> recover remain inaccessible because their config space is restored
> too early.
> 
> One affected device is Intel's Ponte Vecchio HPC GPU.  Ravi Kishore
> kindly tested that this series solves the issue.
> 
> 
> As a byproduct, the series fixes a similar delay issue for Secondary
> Bus Resets.  Sheng Bi proposed a patch last May, a variation of which
> is contained herein:
> 
> https://patchwork.kernel.org/project/linux-pci/patch/20220523171517.32407-1-windy.bi.enflame@gmail.com/
> 
> 
> A second byproduct of this series is an optimization for Secondary
> Bus Resets whereby the delay after reset is reduced on modern PCIe
> systems.  Yang Su and Stanislav Spassov proposed a patch in August
> which is subsumed by the present series:
> 
> https://patchwork.kernel.org/project/linux-pci/patch/4315990a165dd019d970633713cf8e06e9b4c282.1660746147.git.yang.su@linux.alibaba.com/
> 
> 
> If the present series is accepted, the two above-linked patches
> can be closed in patchwork.  (For some reason, Sheng Bi's patch
> is in "New" state, but marked "Archived".)
> 
> Thanks!
> 
> 
> Lukas Wunner (3):
>   PCI/PM: Observe reset delay irrespective of bridge_d3
>   PCI: Unify delay handling for reset and resume
>   PCI/DPC: Await readiness of secondary bus after reset

All look good to me,

Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset
  2022-12-31 18:33 ` [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset Lukas Wunner
@ 2023-01-03 19:49   ` Sathyanarayanan Kuppuswamy
  2023-01-12 22:35   ` Bjorn Helgaas
  1 sibling, 0 replies; 13+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2023-01-03 19:49 UTC (permalink / raw)
  To: Lukas Wunner, Bjorn Helgaas, linux-pci
  Cc: Keith Busch, Ashok Raj, Ravi Kishore Koppuravuri,
	Mika Westerberg, Sheng Bi, Stanislav Spassov, Yang Su

HI,

On 12/31/22 10:33 AM, Lukas Wunner wrote:
> We're calling pci_bridge_wait_for_secondary_bus() after performing a
> Secondary Bus Reset, but neglect to do the same after coming out of a
> DPC-induced Hot Reset.  As a result, we're not observing the delays
> prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the
> secondary bus before they're ready.  Fix it.

Please remove "we" usage. Otherwise, looks good.

> 
> Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org
> ---
>  drivers/pci/pci.c      | 3 ---
>  drivers/pci/pci.h      | 3 +++
>  drivers/pci/pcie/dpc.c | 4 ++--
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b0b49243a908..19fe0ef0e583 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -167,9 +167,6 @@ static int __init pcie_port_pm_setup(char *str)
>  }
>  __setup("pcie_port_pm=", pcie_port_pm_setup);
>  
> -/* Time to wait after a reset for device to become responsive */
> -#define PCIE_RESET_READY_POLL_MS 60000
> -
>  /**
>   * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
>   * @bus: pointer to PCI bus structure to search
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 40758248dd80..b72fd888379b 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -7,6 +7,9 @@
>  /* Number of possible devfns: 0.0 to 1f.7 inclusive */
>  #define MAX_NR_DEVFNS 256
>  
> +/* Time to wait after a reset for device to become responsive */
> +#define PCIE_RESET_READY_POLL_MS 60000
> +
>  #define PCI_FIND_CAP_TTL	48
>  
>  #define PCI_VSEC_ID_INTEL_TBT	0x1234	/* Thunderbolt */
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index f5ffea17c7f8..a5d7c69b764e 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -170,8 +170,8 @@ pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
>  	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
>  			      PCI_EXP_DPC_STATUS_TRIGGER);
>  
> -	if (!pcie_wait_for_link(pdev, true)) {
> -		pci_info(pdev, "Data Link Layer Link Active not set in 1000 msec\n");
> +	if (pci_bridge_wait_for_secondary_bus(pdev, "DPC",
> +					      PCIE_RESET_READY_POLL_MS)) {
>  		clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags);
>  		ret = PCI_ERS_RESULT_DISCONNECT;
>  	} else {

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] PCI: Unify delay handling for reset and resume
  2022-12-31 18:33 ` [PATCH 2/3] PCI: Unify delay handling for reset and resume Lukas Wunner
@ 2023-01-03 19:50   ` Sathyanarayanan Kuppuswamy
  2023-01-12 22:31   ` Bjorn Helgaas
  1 sibling, 0 replies; 13+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2023-01-03 19:50 UTC (permalink / raw)
  To: Lukas Wunner, Bjorn Helgaas, linux-pci
  Cc: Keith Busch, Ashok Raj, Ravi Kishore Koppuravuri,
	Mika Westerberg, Sheng Bi, Stanislav Spassov, Yang Su



On 12/31/22 10:33 AM, Lukas Wunner wrote:
> Sheng Bi reports that pci_bridge_secondary_bus_reset() may fail to wait
> for devices on the secondary bus to become accessible after reset:
> 
> Although it does call pci_dev_wait(), it erroneously passes the bridge's
> pci_dev rather than that of a child.  The bridge of course is always
> accessible while its secondary bus is reset, so pci_dev_wait() returns
> immediately.
> 
> Sheng Bi proposes introducing a new pci_bridge_secondary_bus_wait()
> function which is called from pci_bridge_secondary_bus_reset():
> 
> https://lore.kernel.org/linux-pci/20220523171517.32407-1-windy.bi.enflame@gmail.com/
> 
> However we already have pci_bridge_wait_for_secondary_bus() which does
> almost exactly what we need.  So far it's only called on resume from
> D3cold (which implies a Fundamental Reset per PCIe r6.0 sec 5.8).
> Re-using it for Secondary Bus Resets is a leaner and more rational
> approach than introducing a new function.
> 
> That only requires a few minor tweaks:
> 
> - Amend pci_bridge_wait_for_secondary_bus() to await accessibility of
>   the first device on the secondary bus by calling pci_dev_wait() after
>   performing the prescribed delays.  pci_dev_wait() needs two parameters,
>   a reset reason and a timeout, which callers must now pass to
>   pci_bridge_wait_for_secondary_bus().  The timeout is 1 sec for resume
>   (PCIe r6.0 sec 6.6.1) and 60 sec for reset (commit 821cdad5c46c ("PCI:
>   Wait up to 60 seconds for device to become ready after FLR")).
> 
> - Amend pci_bridge_wait_for_secondary_bus() to return 0 on success or
>   -ENOTTY on error for consumption by pci_bridge_secondary_bus_reset().
> 
> - Drop an unnecessary 1 sec delay from pci_reset_secondary_bus() which
>   is now performed by pci_bridge_wait_for_secondary_bus().  A static
>   delay this long is only necessary for Conventional PCI, so modern
>   PCIe systems benefit from shorter reset times as a side effect.
> 
> Fixes: 6b2f1351af56 ("PCI: Wait for device to become ready after secondary bus reset")
> Reported-by: Sheng Bi <windy.bi.enflame@gmail.com>
> Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v4.17+
> Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>  drivers/pci/pci-driver.c |  2 +-
>  drivers/pci/pci.c        | 50 ++++++++++++++++++----------------------
>  drivers/pci/pci.h        |  3 ++-
>  3 files changed, 25 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index a2ceeacc33eb..02e84c87f41a 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -572,7 +572,7 @@ static void pci_pm_default_resume_early(struct pci_dev *pci_dev)
>  
>  static void pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev)
>  {
> -	pci_bridge_wait_for_secondary_bus(pci_dev);
> +	pci_bridge_wait_for_secondary_bus(pci_dev, "resume", 1000);
>  	/*
>  	 * When powering on a bridge from D3cold, the whole hierarchy may be
>  	 * powered on into D0uninitialized state, resume them to give them a
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f43f3e84f634..b0b49243a908 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4948,24 +4948,31 @@ static int pci_bus_max_d3cold_delay(const struct pci_bus *bus)
>  /**
>   * pci_bridge_wait_for_secondary_bus - Wait for secondary bus to be accessible
>   * @dev: PCI bridge
> + * @reset_type: reset type in human-readable form
> + * @timeout: maximum time to wait for devices on secondary bus
>   *
>   * Handle necessary delays before access to the devices on the secondary
> - * side of the bridge are permitted after D3cold to D0 transition.
> + * side of the bridge are permitted after D3cold to D0 transition
> + * or Conventional Reset.
>   *
>   * For PCIe this means the delays in PCIe 5.0 section 6.6.1. For
>   * conventional PCI it means Tpvrh + Trhfa specified in PCI 3.0 section
>   * 4.3.2.
> + *
> + * Return 0 on success or -ENOTTY if the first device on the secondary bus
> + * failed to become accessible.
>   */
> -void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
> +int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type,
> +				      int timeout)
>  {
>  	struct pci_dev *child;
>  	int delay;
>  
>  	if (pci_dev_is_disconnected(dev))
> -		return;
> +		return 0;
>  
>  	if (!pci_is_bridge(dev))
> -		return;
> +		return 0;
>  
>  	down_read(&pci_bus_sem);
>  
> @@ -4977,14 +4984,14 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
>  	 */
>  	if (!dev->subordinate || list_empty(&dev->subordinate->devices)) {
>  		up_read(&pci_bus_sem);
> -		return;
> +		return 0;
>  	}
>  
>  	/* Take d3cold_delay requirements into account */
>  	delay = pci_bus_max_d3cold_delay(dev->subordinate);
>  	if (!delay) {
>  		up_read(&pci_bus_sem);
> -		return;
> +		return 0;
>  	}
>  
>  	child = list_first_entry(&dev->subordinate->devices, struct pci_dev,
> @@ -4993,14 +5000,12 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
>  
>  	/*
>  	 * Conventional PCI and PCI-X we need to wait Tpvrh + Trhfa before
> -	 * accessing the device after reset (that is 1000 ms + 100 ms). In
> -	 * practice this should not be needed because we don't do power
> -	 * management for them (see pci_bridge_d3_possible()).
> +	 * accessing the device after reset (that is 1000 ms + 100 ms).
>  	 */
>  	if (!pci_is_pcie(dev)) {
>  		pci_dbg(dev, "waiting %d ms for secondary bus\n", 1000 + delay);
>  		msleep(1000 + delay);
> -		return;
> +		return 0;
>  	}
>  
>  	/*
> @@ -5017,11 +5022,11 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
>  	 * configuration requests if we only wait for 100 ms (see
>  	 * https://bugzilla.kernel.org/show_bug.cgi?id=203885).
>  	 *
> -	 * Therefore we wait for 100 ms and check for the device presence.
> -	 * If it is still not present give it an additional 100 ms.
> +	 * Therefore we wait for 100 ms and check for the device presence
> +	 * until the timeout expires.
>  	 */
>  	if (!pcie_downstream_port(dev))
> -		return;
> +		return 0;
>  
>  	if (pcie_get_speed_cap(dev) <= PCIE_SPEED_5_0GT) {
>  		pci_dbg(dev, "waiting %d ms for downstream link\n", delay);
> @@ -5032,14 +5037,11 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
>  		if (!pcie_wait_for_link_delay(dev, true, delay)) {
>  			/* Did not train, no need to wait any further */
>  			pci_info(dev, "Data Link Layer Link Active not set in 1000 msec\n");
> -			return;
> +			return -ENOTTY;
>  		}
>  	}
>  
> -	if (!pci_device_is_present(child)) {
> -		pci_dbg(child, "waiting additional %d ms to become accessible\n", delay);
> -		msleep(delay);
> -	}
> +	return pci_dev_wait(child, reset_type, timeout - delay);
>  }
>  
>  void pci_reset_secondary_bus(struct pci_dev *dev)
> @@ -5058,15 +5060,6 @@ void pci_reset_secondary_bus(struct pci_dev *dev)
>  
>  	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>  	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> -
> -	/*
> -	 * Trhfa for conventional PCI is 2^25 clock cycles.
> -	 * Assuming a minimum 33MHz clock this results in a 1s
> -	 * delay before we can consider subordinate devices to
> -	 * be re-initialized.  PCIe has some ways to shorten this,
> -	 * but we don't make use of them yet.
> -	 */
> -	ssleep(1);
>  }
>  
>  void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)
> @@ -5085,7 +5078,8 @@ int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
>  {
>  	pcibios_reset_secondary_bus(dev);
>  
> -	return pci_dev_wait(dev, "bus reset", PCIE_RESET_READY_POLL_MS);
> +	return pci_bridge_wait_for_secondary_bus(dev, "bus reset",
> +						 PCIE_RESET_READY_POLL_MS);
>  }
>  EXPORT_SYMBOL_GPL(pci_bridge_secondary_bus_reset);
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 9ed3b5550043..40758248dd80 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -86,8 +86,9 @@ void pci_msi_init(struct pci_dev *dev);
>  void pci_msix_init(struct pci_dev *dev);
>  bool pci_bridge_d3_possible(struct pci_dev *dev);
>  void pci_bridge_d3_update(struct pci_dev *dev);
> -void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev);
>  void pci_bridge_reconfigure_ltr(struct pci_dev *dev);
> +int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type,
> +				      int timeout);
>  
>  static inline void pci_wakeup_event(struct pci_dev *dev)
>  {

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] PCI/PM: Observe reset delay irrespective of bridge_d3
  2022-12-31 18:33 ` [PATCH 1/3] PCI/PM: Observe reset delay irrespective of bridge_d3 Lukas Wunner
@ 2023-01-03 19:50   ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 13+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2023-01-03 19:50 UTC (permalink / raw)
  To: Lukas Wunner, Bjorn Helgaas, linux-pci
  Cc: Keith Busch, Ashok Raj, Ravi Kishore Koppuravuri,
	Mika Westerberg, Sheng Bi, Stanislav Spassov, Yang Su



On 12/31/22 10:33 AM, Lukas Wunner wrote:
> If a PCI bridge is suspended to D3cold upon entering system sleep,
> resuming it entails a Fundamental Reset per PCIe r6.0 sec 5.8.
> 
> The delay prescribed after a Fundamental Reset in PCIe r6.0 sec 6.6.1
> is sought to be observed by:
> 
>   pci_pm_resume_noirq()
>     pci_pm_bridge_power_up_actions()
>       pci_bridge_wait_for_secondary_bus()
> 
> However, pci_bridge_wait_for_secondary_bus() bails out if the bridge_d3
> flag is not set.  That flag indicates whether a bridge is allowed to
> suspend to D3cold at *runtime*.
> 
> Hence *no* delay is observed on resume from system sleep if runtime
> D3cold is forbidden.  That doesn't make any sense, so drop the bridge_d3
> check from pci_bridge_wait_for_secondary_bus().
> 
> The purpose of the bridge_d3 check was probably to avoid delays if a
> bridge remained in D0 during suspend.  However the sole caller of
> pci_bridge_wait_for_secondary_bus(), pci_pm_bridge_power_up_actions(),
> is only invoked if the previous power state was D3cold.  Hence the
> additional bridge_d3 check seems superfluous.
> 
> Fixes: ad9001f2f411 ("PCI/PM: Add missing link delays required by the PCIe spec")
> Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v5.5+
> Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>  drivers/pci/pci.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index fba95486caaf..f43f3e84f634 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4964,7 +4964,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
>  	if (pci_dev_is_disconnected(dev))
>  		return;
>  
> -	if (!pci_is_bridge(dev) || !dev->bridge_d3)
> +	if (!pci_is_bridge(dev))
>  		return;
>  
>  	down_read(&pci_bus_sem);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] PCI: Unify delay handling for reset and resume
  2022-12-31 18:33 ` [PATCH 2/3] PCI: Unify delay handling for reset and resume Lukas Wunner
  2023-01-03 19:50   ` Sathyanarayanan Kuppuswamy
@ 2023-01-12 22:31   ` Bjorn Helgaas
  1 sibling, 0 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2023-01-12 22:31 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-pci, Keith Busch, Ashok Raj, Sathyanarayanan Kuppuswamy,
	Ravi Kishore Koppuravuri, Mika Westerberg, Sheng Bi,
	Stanislav Spassov, Yang Su

On Sat, Dec 31, 2022 at 07:33:38PM +0100, Lukas Wunner wrote:
> Sheng Bi reports that pci_bridge_secondary_bus_reset() may fail to wait
> for devices on the secondary bus to become accessible after reset:
> 
> Although it does call pci_dev_wait(), it erroneously passes the bridge's
> pci_dev rather than that of a child.  The bridge of course is always
> accessible while its secondary bus is reset, so pci_dev_wait() returns
> immediately.
> 
> Sheng Bi proposes introducing a new pci_bridge_secondary_bus_wait()
> function which is called from pci_bridge_secondary_bus_reset():
> 
> https://lore.kernel.org/linux-pci/20220523171517.32407-1-windy.bi.enflame@gmail.com/
> 
> However we already have pci_bridge_wait_for_secondary_bus() which does
> almost exactly what we need.  So far it's only called on resume from
> D3cold (which implies a Fundamental Reset per PCIe r6.0 sec 5.8).
> Re-using it for Secondary Bus Resets is a leaner and more rational
> approach than introducing a new function.
> 
> That only requires a few minor tweaks:
> 
> - Amend pci_bridge_wait_for_secondary_bus() to await accessibility of
>   the first device on the secondary bus by calling pci_dev_wait() after
>   performing the prescribed delays.  pci_dev_wait() needs two parameters,
>   a reset reason and a timeout, which callers must now pass to
>   pci_bridge_wait_for_secondary_bus().  The timeout is 1 sec for resume
>   (PCIe r6.0 sec 6.6.1) and 60 sec for reset (commit 821cdad5c46c ("PCI:
>   Wait up to 60 seconds for device to become ready after FLR")).
> 
> - Amend pci_bridge_wait_for_secondary_bus() to return 0 on success or
>   -ENOTTY on error for consumption by pci_bridge_secondary_bus_reset().
> 
> - Drop an unnecessary 1 sec delay from pci_reset_secondary_bus() which
>   is now performed by pci_bridge_wait_for_secondary_bus().  A static
>   delay this long is only necessary for Conventional PCI, so modern
>   PCIe systems benefit from shorter reset times as a side effect.
> 
> Fixes: 6b2f1351af56 ("PCI: Wait for device to become ready after secondary bus reset")
> Reported-by: Sheng Bi <windy.bi.enflame@gmail.com>
> Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v4.17+
> Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
> ---
>  drivers/pci/pci-driver.c |  2 +-
>  drivers/pci/pci.c        | 50 ++++++++++++++++++----------------------
>  drivers/pci/pci.h        |  3 ++-
>  3 files changed, 25 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index a2ceeacc33eb..02e84c87f41a 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -572,7 +572,7 @@ static void pci_pm_default_resume_early(struct pci_dev *pci_dev)
>  
>  static void pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev)
>  {
> -	pci_bridge_wait_for_secondary_bus(pci_dev);
> +	pci_bridge_wait_for_secondary_bus(pci_dev, "resume", 1000);

It sounds like this 1000 ms value is prescribed by sec 6.6.1, so we
should have a #define for it.  I know we didn't use one even before,
but this seems like a a good opportunity to add it.

>  /**
>   * pci_bridge_wait_for_secondary_bus - Wait for secondary bus to be accessible
>   * @dev: PCI bridge
> + * @reset_type: reset type in human-readable form
> + * @timeout: maximum time to wait for devices on secondary bus

I think we should mention here that the timeout is in milliseconds.

Bjorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset
  2022-12-31 18:33 ` [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset Lukas Wunner
  2023-01-03 19:49   ` Sathyanarayanan Kuppuswamy
@ 2023-01-12 22:35   ` Bjorn Helgaas
       [not found]     ` <15135d89-0515-d965-567b-79b3eca236e6@linux.alibaba.com>
  2023-01-13  9:10     ` Lukas Wunner
  1 sibling, 2 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2023-01-12 22:35 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-pci, Keith Busch, Ashok Raj, Sathyanarayanan Kuppuswamy,
	Ravi Kishore Koppuravuri, Mika Westerberg, Sheng Bi,
	Stanislav Spassov, Yang Su

On Sat, Dec 31, 2022 at 07:33:39PM +0100, Lukas Wunner wrote:
> We're calling pci_bridge_wait_for_secondary_bus() after performing a
> Secondary Bus Reset, but neglect to do the same after coming out of a
> DPC-induced Hot Reset.  As a result, we're not observing the delays
> prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the
> secondary bus before they're ready.  Fix it.
> 
> Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>

I assume this patch is the one that makes the difference for the
Intel Ponte Vecchio HPC GPU?  Is there a URL to a problem report, or
at least a sentence or two we can include here to connect the patch
with the problem users may see?  Most people won't know how to
recognize accesses to devices on the secondary bus before they're
ready.

Bjorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset
       [not found]     ` <15135d89-0515-d965-567b-79b3eca236e6@linux.alibaba.com>
@ 2023-01-13  3:06       ` Bjorn Helgaas
  2023-01-13 10:18         ` Lukas Wunner
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2023-01-13  3:06 UTC (permalink / raw)
  To: Yang Su
  Cc: Bjorn Helgaas, Lukas Wunner, linux-pci, Keith Busch, Ashok Raj,
	Sathyanarayanan Kuppuswamy, Ravi Kishore Koppuravuri,
	Mika Westerberg, Sheng Bi, Stanislav Spassov

On Thu, Jan 12, 2023 at 8:39 PM Yang Su <yang.su@linux.alibaba.com> wrote:
>
> Hi Bjorn,
>
> This email is I discussed with Lukas previously, but now I found I forget to CC you.
>
> I also test pci_bridge_wait_for_secondary_bus in NVIDIA GPU T4 which bind vfio
>
> passthrough in VMM, I found the pci_bridge_wait_for_secondary_bus can not wait
>
> the enough time as pci spec requires, the reasons are described as below.

Hi Yang,

When you're discussing Linux patches, please always keep the cc list
so the rest of us can see what's happening.  Also please use plain
ASCII email (not HTML, etc) because the list archives often discard
fancy emails.  Hints:

http://vger.kernel.org/majordomo-info.html
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
https://people.kernel.org/tglx/notes-about-netiquette

I'll wait for you and Lukas to continue this conversation on the mailing list.

Bjorn

> -------- 转发的消息 --------
> 主题: Re: [PATCH 2/3] PCI: Unify delay handling for reset and resume
> 日期: Wed, 4 Jan 2023 12:46:17 +0800
> From: Yang Su <yang.su@linux.alibaba.com>
> 收件人: Lukas Wunner <lukas@wunner.de>
>
>
> Hi Lukas,
>
> When some device such as NVIDIA GPU T4 bind vfio to passthrough in VMM, the vfio will do reset for the divice, but some device
>
> such as NVIDIA GPU T4 only support secondary bus reset (SBR), and then will call pci_reset_secondary_bus.
>
> The call path is below as figure shows,
>
>
> There are there reasons I think your modify for pci_reset_secondary_bus is not ok.
>
> 1. As I describe for you, if the device not bridge call pci_reset_secondary_bus,
>
> the device will enter pci_bridge_wait_for_secondary_bus, but in pci_bridge_wait_for_secondary_bus
>
> the device is not judged as pci bridge, will directly return.
>
> if (!pci_is_bridge(dev) || !dev->bridge_d3) {
>
> above code will do the judge.
>
>
> 2. In pci_bridge_wait_for_secondary_bus, pci_bus_max_d3cold_delay will take count of wrong time delay,
>
> such as NVIDIA GPU T4 is not pci bridge, so the subordinate is none, pci_bus_max_d3cold_delay
>
> set the min_delay is 100, max_delay is 0, here is the bug, after list_for_each_entry() in pci_bus_max_d3cold_delay,
>
> the min_delay will be 0, the max_delay also 0, the pci_bus_max_d3cold_delay return is surely 0.
>
>
> 3. pci_dev_wait must be saved, and cannot be replaced as pci_bridge_wait_for_secondary_bus.
>
> Because the pcie spec requires the device to check CRS, and pci_dev_wait do this thing.
>
> "
>
> Note: Software should use 100 ms wait periods only if software enables CRS Software Visibility.
>
> Otherwise, Completion timeouts, platform timeouts, or lengthy processor instruction stalls may result.
>
> See the Configuration Request Retry Status Implementation Note in Section 2.3.1.
>
> "
>
>
> void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev)
> {
>     struct pci_dev *child;
>     int delay;
>
>     if (pci_dev_is_disconnected(dev))
>         return;
>
>     if (!pci_is_bridge(dev) || !dev->bridge_d3)
>         return;
>
> ...
>
>     /* Take d3cold_delay requirements into account */
>     delay = pci_bus_max_d3cold_delay(dev->subordinate);
>
> }
>
>
>   void pci_reset_secondary_bus(struct pci_dev *dev)
> @@ -5058,15 +5060,6 @@ void pci_reset_secondary_bus(struct pci_dev *dev)
>
>   ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>   pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> -
> - /*
> - * Trhfa for conventional PCI is 2^25 clock cycles.
> - * Assuming a minimum 33MHz clock this results in a 1s
> - * delay before we can consider subordinate devices to
> - * be re-initialized.  PCIe has some ways to shorten this,
> - * but we don't make use of them yet.
> - */
> - ssleep(1);
>   }
>
>   void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)
> @@ -5085,7 +5078,8 @@ int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
>   {
>   pcibios_reset_secondary_bus(dev);
>
> - return pci_dev_wait(dev, "bus reset", PCIE_RESET_READY_POLL_MS);
> + return pci_bridge_wait_for_secondary_bus(dev, "bus reset",
> + PCIE_RESET_READY_POLL_MS);
>   }
>
>
> 在 2023/1/4 01:25, Lukas Wunner 写道:
>
> Hi Yang Su!
>
> Thanks for reaching out.
>
> On Tue, Jan 03, 2023 at 07:45:25PM +0800, Yang Su wrote:
>
> I think your modify in pci_reset_secondary_bus is not ok. I test pci_bridge_secondary_bus_reset used in pci_reset_secondary_bus,
> there will be an error is a device is not pci bridge device in pci_reset_secondary_buscall pci_bridge_secondary_bus_reset,
> pci_bridge_secondary_bus_reset judge the device is not pci bridge device and just return and will do nothing.
> So just use pci_bridge_secondary_bus_reset to replace ssleep(1) is not ok.
>
> Hm, I don't quite understand.
>
> How do you invoke pci_bridge_secondary_bus_reset() with a non-bridge device?
> I followed all the code paths leading to pci_bridge_secondary_bus_reset()
> and it seems to me none of them is calling the function with an endpoint
> device.  What am I missing?
>
> Thanks,
>
> Lukas
>
> 在 2023/1/13 06:35, Bjorn Helgaas 写道:
>
> On Sat, Dec 31, 2022 at 07:33:39PM +0100, Lukas Wunner wrote:
>
> We're calling pci_bridge_wait_for_secondary_bus() after performing a
> Secondary Bus Reset, but neglect to do the same after coming out of a
> DPC-induced Hot Reset.  As a result, we're not observing the delays
> prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the
> secondary bus before they're ready.  Fix it.
>
> Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
>
> I assume this patch is the one that makes the difference for the
> Intel Ponte Vecchio HPC GPU?  Is there a URL to a problem report, or
> at least a sentence or two we can include here to connect the patch
> with the problem users may see?  Most people won't know how to
> recognize accesses to devices on the secondary bus before they're
> ready.
>
> Bjorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset
  2023-01-12 22:35   ` Bjorn Helgaas
       [not found]     ` <15135d89-0515-d965-567b-79b3eca236e6@linux.alibaba.com>
@ 2023-01-13  9:10     ` Lukas Wunner
  1 sibling, 0 replies; 13+ messages in thread
From: Lukas Wunner @ 2023-01-13  9:10 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Keith Busch, Ashok Raj, Sathyanarayanan Kuppuswamy,
	Ravi Kishore Koppuravuri, Mika Westerberg, Sheng Bi,
	Stanislav Spassov, Yang Su

On Thu, Jan 12, 2023 at 04:35:33PM -0600, Bjorn Helgaas wrote:
> On Sat, Dec 31, 2022 at 07:33:39PM +0100, Lukas Wunner wrote:
> > We're calling pci_bridge_wait_for_secondary_bus() after performing a
> > Secondary Bus Reset, but neglect to do the same after coming out of a
> > DPC-induced Hot Reset.  As a result, we're not observing the delays
> > prescribed by PCIe r6.0 sec 6.6.1 and may access devices on the
> > secondary bus before they're ready.  Fix it.
> > 
> > Tested-by: Ravi Kishore Koppuravuri <ravi.kishore.koppuravuri@intel.com>
> 
> I assume this patch is the one that makes the difference for the
> Intel Ponte Vecchio HPC GPU?

Right.


> Is there a URL to a problem report, or
> at least a sentence or two we can include here to connect the patch
> with the problem users may see?

There's no public problem report.  My understanding is that Ponte Vecchio
was formally launched this Tuesday and mass distribution starts only now:

https://www.tomshardware.com/news/intel-launches-sapphire-rapids-fourth-gen-xeon-cpus-and-ponte-vecchio-max-gpu-series

The idea is to get the issue in the kernel fixed early so that users will
never even see it.


> Most people won't know how to
> recognize accesses to devices on the secondary bus before they're
> ready.

With Ponte Vecchio, the GPU is located below a PCIe switch and the
Downstream Port Containment happens at the Root Port.  So the Root
Port needs to wait for the Switch Upstream Port to re-appear.

Because config space is currently restored too early on the Switch
Upstream Port, it remains in D0uninitialized once it comes out of
reset, so all its registers, in particular the bridge windows,
are in power-on reset state.  As a result, anything downstream of it
(including the GPU) remains inaccessible and the user-visible
error messages look like this:

i915 0000:8c:00.0: can't change power state from D3cold to D0 (config space inaccessible)
intel_vsec 0000:8e:00.1: can't change power state from D3cold to D0 (config space inaccessible)

Where intel_vsec is a sibling of the GPU which is used for
telemetry I believe.

I'll be sure to include that additional information in the commit
message when respinning.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset
  2023-01-13  3:06       ` Bjorn Helgaas
@ 2023-01-13 10:18         ` Lukas Wunner
  0 siblings, 0 replies; 13+ messages in thread
From: Lukas Wunner @ 2023-01-13 10:18 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yang Su, Bjorn Helgaas, linux-pci, Keith Busch, Ashok Raj,
	Sathyanarayanan Kuppuswamy, Ravi Kishore Koppuravuri,
	Mika Westerberg, Sheng Bi, Stanislav Spassov

On Thu, Jan 12, 2023 at 09:06:15PM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 12, 2023 at 8:39 PM Yang Su <yang.su@linux.alibaba.com> wrote:
> > I also test pci_bridge_wait_for_secondary_bus in NVIDIA GPU T4
> > which bind vfio passthrough in VMM, I found the
> > pci_bridge_wait_for_secondary_bus can not wait the enough time
> > as pci spec requires, the reasons are described as below.
[...]
> I'll wait for you and Lukas to continue this conversation on the
> mailing list.

Yang Su reports that pci_bridge_secondary_bus_reset() is called for
an Nvidia T4 GPU.  That's an endpoint device but the function should
only ever be called for bridges.  It's unclear to me how that can happen.

The call stack looks like this:

vfio_pci_open()
  pci_set_power_state()
  pci_clear_master()
  pci_enable_device()
  pci_try_reset_function()
    pci_dev_trylock()
    pci_dev_save_and_disable()
    pci_dev_specific_reset()
    pcie_has_flr()
    pcie_af_flr()
    pci_read_config_word()
    pci_dev_reset_slot_function()
    pci_parent_bus_reset()
      pci_bridge_secondary_bus_reset()
        pcibios_reset_secondary_bus()
	  pci_reset_secondary_bus()
	    pci_read_config_word()
	    pci_write_config_word()
	    pci_write_config_word()
	    pcie_wait_for_link_delay()
	pci_dev_wait()
    pci_dev_restore()
    pci_cfg_access_unlock()
  pci_save_state()
  pci_store_saved_state()
  vfio_config_init()

Note that vfio_pci_open() was renamed to vfio_pci_open_device() with
commit 2cd8b14aaa66, which went into v5.15.  So apparently this call
stack is from an earlier kernel.

The GPU is located below a PEX9797 switch, which is connected to a
SkyLake-E Root Port.  So pci_bridge_secondary_bus_reset() should be
called for the Switch Downstream Port but Yang Su says it's called
for the GPU endpoint device.

pci_parent_bus_reset() finds the parent by following dev->bus->self.
I've suggested to Yang Su to replace that with pci_upstream_bridge(dev)
and see if it fixes the issue.  It does make a difference in SR-IOV
scenarios (see the comment in include/linux/pci.h above pci_is_root_bus())
though Yang Su reports that SR-IOV is not used on this machine,
only vfio passthrough.

I believe that endpoint devices don't have a Bridge Control Register
(Type 0 Config Space has Max_Lat / Min_Gnt instead), so setting the
Secondary Bus Reset bit should have no effect.  But apparently it
does have an effect because Yang Su is witnessing issues with the delay
after reset.

Specifically, Yang Su says that pci_bridge_wait_for_secondary_bus()
bails out because the GPU endpoint device fails the !pci_is_bridge()
check at the top of the function.  Also, the calculation of
pci_bus_max_d3cold_delay() fails because the GPU lacks a subordinate bus.
Apparently the unconditional 1 second delay in pci_reset_secondary_bus()
papers over the issue because it waits long enough for the GPU endpoint
device to come out of reset.

Maybe the information that pci_bridge_secondary_bus_reset() is executed
for an endpoint device is not correct and it's in fact executed for
the Downstream Port of the PEX9797 switch, but perhaps config space
of the port is broken and contains an incorrect Header Type field,
which would cause pci_is_bridge() to return false for it.  Though we
wouldn't scan the secondary bus below the switch in that case,
so the GPU wouldn't be enumerated.  I've requested "lspci -vvv" output
for the GPU and the Downstream Port off-list but haven't received it yet.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-01-13 10:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-31 18:33 [PATCH 0/3] PCI reset delay fixes Lukas Wunner
2022-12-31 18:33 ` [PATCH 1/3] PCI/PM: Observe reset delay irrespective of bridge_d3 Lukas Wunner
2023-01-03 19:50   ` Sathyanarayanan Kuppuswamy
2022-12-31 18:33 ` [PATCH 2/3] PCI: Unify delay handling for reset and resume Lukas Wunner
2023-01-03 19:50   ` Sathyanarayanan Kuppuswamy
2023-01-12 22:31   ` Bjorn Helgaas
2022-12-31 18:33 ` [PATCH 3/3] PCI/DPC: Await readiness of secondary bus after reset Lukas Wunner
2023-01-03 19:49   ` Sathyanarayanan Kuppuswamy
2023-01-12 22:35   ` Bjorn Helgaas
     [not found]     ` <15135d89-0515-d965-567b-79b3eca236e6@linux.alibaba.com>
2023-01-13  3:06       ` Bjorn Helgaas
2023-01-13 10:18         ` Lukas Wunner
2023-01-13  9:10     ` Lukas Wunner
2023-01-03 11:09 ` [PATCH 0/3] PCI reset delay fixes Mika Westerberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.