linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Improve PCI device post-reset readiness polling
@ 2020-02-23 12:20 Stanislav Spassov
  2020-02-23 12:20 ` [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable Stanislav Spassov
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Stanislav Spassov @ 2020-02-23 12:20 UTC (permalink / raw)
  To: linux-pci
  Cc: Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Wei Wang

From: Stanislav Spassov <stanspas@amazon.de>

Enable the global maximum polling time to be configured on the kernel
command line, and make per-device overrides possible. This allows the
default timeout to be lowered while accomodating devices that require
more time to finish initialization after a reset.

When Configuration Request Retry Status Software Visibility is enabled
on the parent PCIe Root Port, it is better to poll the PCI_VENDOR_ID
register to get the special CRS behavior specified in the PCI Express
Base Specification. Polling a different register can result in system
crashes due to core timeouts when the Root Port autonomously keeps
retrying the Configuration Read without reporting back to the CPU.

Wei Wang (1):
  PCI: Make PCIE_RESET_READY_POLL_MS configurable

Stanislav Spassov (2):
  PCI: Introduce per-device reset_ready_poll override
  PCI: Add CRS handling to pci_dev_wait()

 .../admin-guide/kernel-parameters.txt         |   5 +
 drivers/pci/pci.c                             | 157 +++++++++++++++---
 drivers/pci/probe.c                           |   2 +
 include/linux/pci.h                           |   1 +
 4 files changed, 138 insertions(+), 27 deletions(-)


base-commit: bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable
  2020-02-23 12:20 [PATCH 0/3] Improve PCI device post-reset readiness polling Stanislav Spassov
@ 2020-02-23 12:20 ` Stanislav Spassov
  2020-02-24 14:15   ` Bjorn Helgaas
  2020-02-23 12:20 ` [PATCH 2/3] PCI: Introduce per-device reset_ready_poll override Stanislav Spassov
  2020-02-23 12:20 ` [PATCH 3/3] PCI: Add CRS handling to pci_dev_wait() Stanislav Spassov
  2 siblings, 1 reply; 15+ messages in thread
From: Stanislav Spassov @ 2020-02-23 12:20 UTC (permalink / raw)
  To: linux-pci
  Cc: Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Wei Wang, Jonathan Corbet, linux-doc

From: Wei Wang <wawei@amazon.de>

The resonable value for the maximum time to wait for a PCI device to be
ready after reset varies depending on the platform and the reliability
of its set of devices.

Signed-off-by: Wei Wang <wawei@amazon.de>
Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
---
 .../admin-guide/kernel-parameters.txt         |  5 +++++
 drivers/pci/pci.c                             | 22 ++++++++++++++-----
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index dbc22d684627..5e4dade9acc8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3653,6 +3653,11 @@
 		nomsi	Do not use MSI for native PCIe PME signaling (this makes
 			all PCIe root ports use INTx for all services).
 
+	pcie_reset_ready_poll_ms= [PCI,PCIE]
+			Specifies timeout for PCI(e) device readiness polling
+			after device reset (in milliseconds).
+			Default: 60000 = 60 seconds
+
 	pcmv=		[HW,PCMCIA] BadgePAD 4
 
 	pd_ignore_unused
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d828ca835a98..db9b58ab6c68 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -149,7 +149,19 @@ static int __init pcie_port_pm_setup(char *str)
 __setup("pcie_port_pm=", pcie_port_pm_setup);
 
 /* Time to wait after a reset for device to become responsive */
-#define PCIE_RESET_READY_POLL_MS 60000
+#define PCIE_RESET_READY_POLL_MS_DEFAULT 60000
+
+int __read_mostly pcie_reset_ready_poll_ms = PCIE_RESET_READY_POLL_MS_DEFAULT;
+
+static int __init pcie_reset_ready_poll_ms_setup(char *str)
+{
+	int timeout;
+
+	if (!kstrtoint(str, 0, &timeout))
+		pcie_reset_ready_poll_ms = timeout;
+	return 1;
+}
+__setup("pcie_reset_ready_poll_ms=", pcie_reset_ready_poll_ms_setup);
 
 /**
  * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
@@ -4506,7 +4518,7 @@ int pcie_flr(struct pci_dev *dev)
 	 */
 	msleep(100);
 
-	return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
+	return pci_dev_wait(dev, "FLR", pcie_reset_ready_poll_ms);
 }
 EXPORT_SYMBOL_GPL(pcie_flr);
 
@@ -4551,7 +4563,7 @@ static int pci_af_flr(struct pci_dev *dev, int probe)
 	 */
 	msleep(100);
 
-	return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
+	return pci_dev_wait(dev, "AF_FLR", pcie_reset_ready_poll_ms);
 }
 
 /**
@@ -4596,7 +4608,7 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
 	pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
 	pci_dev_d3_sleep(dev);
 
-	return pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
+	return pci_dev_wait(dev, "PM D3hot->D0", pcie_reset_ready_poll_ms);
 }
 
 /**
@@ -4826,7 +4838,7 @@ int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
 {
 	pcibios_reset_secondary_bus(dev);
 
-	return pci_dev_wait(dev, "bus reset", PCIE_RESET_READY_POLL_MS);
+	return pci_dev_wait(dev, "bus reset", pcie_reset_ready_poll_ms);
 }
 EXPORT_SYMBOL_GPL(pci_bridge_secondary_bus_reset);
 
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] PCI: Introduce per-device reset_ready_poll override
  2020-02-23 12:20 [PATCH 0/3] Improve PCI device post-reset readiness polling Stanislav Spassov
  2020-02-23 12:20 ` [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable Stanislav Spassov
@ 2020-02-23 12:20 ` Stanislav Spassov
  2020-02-24 14:34   ` Bjorn Helgaas
  2020-02-23 12:20 ` [PATCH 3/3] PCI: Add CRS handling to pci_dev_wait() Stanislav Spassov
  2 siblings, 1 reply; 15+ messages in thread
From: Stanislav Spassov @ 2020-02-23 12:20 UTC (permalink / raw)
  To: linux-pci
  Cc: Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Wei Wang

From: Stanislav Spassov <stanspas@amazon.de>

A broken device may never become responsive after reset, hence the need
for a timeout. However, waiting for too long can have unintended side
effects such as triggering hung task timeouts for processes waiting on
a lock held during the reset. Locks that are shared across multiple
devices, such as VFIO's per-bus reflck, are especially problematic,
because a single broken VF can cause hangs for processes working with
other VFs on the same bus.

To allow lowering the global default post-reset timeout, while still
accommodating devices that require more time, this patch introduces
a per-device override that can be configured via a quirk.

Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
---
 drivers/pci/pci.c   | 19 ++++++++++++++-----
 drivers/pci/probe.c |  2 ++
 include/linux/pci.h |  1 +
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index db9b58ab6c68..a554818968ed 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1033,8 +1033,17 @@ void pci_wakeup_bus(struct pci_bus *bus)
 		pci_walk_bus(bus, pci_wakeup, NULL);
 }
 
-static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
+static int pci_dev_get_reset_ready_poll_ms(struct pci_dev *dev)
 {
+	if (dev->reset_ready_poll_ms >= 0)
+		return dev->reset_ready_poll_ms;
+
+	return pcie_reset_ready_poll_ms;
+}
+
+static int pci_dev_wait(struct pci_dev *dev, char *reset_type)
+{
+	int timeout = pci_dev_get_reset_ready_poll_ms(dev);
 	int delay = 1;
 	u32 id;
 
@@ -4518,7 +4527,7 @@ int pcie_flr(struct pci_dev *dev)
 	 */
 	msleep(100);
 
-	return pci_dev_wait(dev, "FLR", pcie_reset_ready_poll_ms);
+	return pci_dev_wait(dev, "FLR");
 }
 EXPORT_SYMBOL_GPL(pcie_flr);
 
@@ -4563,7 +4572,7 @@ static int pci_af_flr(struct pci_dev *dev, int probe)
 	 */
 	msleep(100);
 
-	return pci_dev_wait(dev, "AF_FLR", pcie_reset_ready_poll_ms);
+	return pci_dev_wait(dev, "AF_FLR");
 }
 
 /**
@@ -4608,7 +4617,7 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
 	pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
 	pci_dev_d3_sleep(dev);
 
-	return pci_dev_wait(dev, "PM D3hot->D0", pcie_reset_ready_poll_ms);
+	return pci_dev_wait(dev, "PM D3hot->D0");
 }
 
 /**
@@ -4838,7 +4847,7 @@ int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
 {
 	pcibios_reset_secondary_bus(dev);
 
-	return pci_dev_wait(dev, "bus reset", pcie_reset_ready_poll_ms);
+	return pci_dev_wait(dev, "bus reset");
 }
 EXPORT_SYMBOL_GPL(pci_bridge_secondary_bus_reset);
 
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 512cb4312ddd..eeb79a45d504 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2166,6 +2166,8 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
 	if (!dev)
 		return NULL;
 
+	dev->reset_ready_poll_ms = -1;
+
 	INIT_LIST_HEAD(&dev->bus_list);
 	dev->dev.type = &pci_dev_type;
 	dev->bus = pci_bus_get(bus);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 3840a541a9de..049a41b9412b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -468,6 +468,7 @@ struct pci_dev {
 	size_t		romlen;		/* Length if not from BAR */
 	char		*driver_override; /* Driver name to force a match */
 
+	int		reset_ready_poll_ms;
 	unsigned long	priv_flags;	/* Private flags for the PCI driver */
 };
 
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] PCI: Add CRS handling to pci_dev_wait()
  2020-02-23 12:20 [PATCH 0/3] Improve PCI device post-reset readiness polling Stanislav Spassov
  2020-02-23 12:20 ` [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable Stanislav Spassov
  2020-02-23 12:20 ` [PATCH 2/3] PCI: Introduce per-device reset_ready_poll override Stanislav Spassov
@ 2020-02-23 12:20 ` Stanislav Spassov
  2020-02-24 20:41   ` Bjorn Helgaas
  2 siblings, 1 reply; 15+ messages in thread
From: Stanislav Spassov @ 2020-02-23 12:20 UTC (permalink / raw)
  To: linux-pci
  Cc: Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Wei Wang

From: Stanislav Spassov <stanspas@amazon.de>

The PCI Express specification dictates minimal amounts of time that the
host needs to wait after triggering different kinds of resets before it
is allowed to attempt accessing the device. After this waiting period,
devices are required to be responsive to Configuration Space reads.
However, if a device needs more time to actually complete the reset
operation internally, it may respond to the read with a Completion
Request Retry Status (CRS), and keep doing so on subsequent reads
for as long as necessary. If the device is broken, it may even keep
responding with CRS indefinitely.

The specification also mandates that any Root Port that supports CRS
and has CRS Software Visibility (CRS SV) enabled will synthesize the
special value 0x0001 for the Vendor ID and set any other bits to 1
upon receiving a CRS Completion for a Configuration Read Request that
includes both bytes of the Vendor ID (offset 0).

IF CRS is supported by Root Port but CRS SV is not enabled, the request
is retried autonomosly by the Root Port. Platform-specific configuration
registers may exist to limit the number of or time taken by such retries.

If CRS is not supported, or a different register (not Vendor ID) is
polled, or the device is responding with CA/UR Completions (rather than
CRS), the behavior is platform-dependent, but generally the Root Port
synthesizes ~0 to complete the software read.

Previously, pci_dev_wait() avoided taking advantage of CRS. However,
on platforms where no limit/timeout can be configured as explained
above, a device responding with CRS for too long (e.g. because it is
stuck and cannot complete its reset) may trigger more severe error
conditions (e.g. TOR timeout, 3-strike CPU CATERR), because the Root
Port never reports back to the lower-level component requesting the
transaction.

This patch introduces special handling when CRS is available, and
otherwise falls back to the previous behavior of polling COMMAND.

Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
---
 drivers/pci/pci.c | 128 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 105 insertions(+), 23 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index a554818968ed..e8bce8da9402 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1041,44 +1041,126 @@ static int pci_dev_get_reset_ready_poll_ms(struct pci_dev *dev)
 	return pcie_reset_ready_poll_ms;
 }
 
-static int pci_dev_wait(struct pci_dev *dev, char *reset_type)
+static bool pci_crs_sv_enabled(struct pci_dev *dev)
+{
+	struct pci_dev *root_port = pcie_find_root_port(dev);
+	u16 root_control;
+
+	if (!root_port)
+		return false;
+
+	if (pcie_capability_read_word(root_port, PCI_EXP_RTCTL, &root_control))
+		return false;
+
+	return root_control & PCI_EXP_RTCTL_CRSSVE;
+}
+
+static int pci_dev_poll_until_ready(struct pci_dev *dev,
+				    int where,
+				    u32 mask,
+				    u32 bad_value,
+				    char *event,
+				    int timeout, int *waited,
+				    u32 *final_value)
 {
-	int timeout = pci_dev_get_reset_ready_poll_ms(dev);
 	int delay = 1;
-	u32 id;
+	u32 value;
 
-	/*
-	 * After reset, the device should not silently discard config
-	 * requests, but it may still indicate that it needs more time by
-	 * responding to them with CRS completions.  The Root Port will
-	 * generally synthesize ~0 data to complete the read (except when
-	 * CRS SV is enabled and the read was for the Vendor ID; in that
-	 * case it synthesizes 0x0001 data).
-	 *
-	 * Wait for the device to return a non-CRS completion.  Read the
-	 * Command register instead of Vendor ID so we don't have to
-	 * contend with the CRS SV value.
-	 */
-	pci_read_config_dword(dev, PCI_COMMAND, &id);
-	while (id == ~0) {
+	if (!event)
+		event = "<unknown event>";
+
+	if (pci_read_config_dword(dev, where, &value))
+		return -ENOTTY;
+
+	while ((value & mask) == bad_value) {
 		if (delay > timeout) {
 			pci_warn(dev, "not ready %dms after %s; giving up\n",
-				 delay - 1, reset_type);
-			return -ENOTTY;
+				 delay - 1, event);
+			return -ETIMEDOUT;
 		}
 
 		if (delay > 1000)
 			pci_info(dev, "not ready %dms after %s; waiting\n",
-				 delay - 1, reset_type);
+				 delay - 1, event);
 
 		msleep(delay);
 		delay *= 2;
-		pci_read_config_dword(dev, PCI_COMMAND, &id);
+
+		if (pci_read_config_dword(dev, where, &value))
+			return -ENOTTY;
 	}
 
 	if (delay > 1000)
-		pci_info(dev, "ready %dms after %s\n", delay - 1,
-			 reset_type);
+		pci_info(dev, "ready %dms after %s\n", delay - 1, event);
+
+	if (waited)
+		*waited = delay - 1;
+
+	if (final_value)
+		*final_value = value;
+
+	return 0;
+}
+
+static int pci_dev_wait_crs(struct pci_dev *dev, char *reset_type,
+			    int timeout, int *waited, u32 *id)
+{
+	return pci_dev_poll_until_ready(dev, PCI_VENDOR_ID, 0xffff, 0x0001,
+					reset_type, timeout, waited, id);
+}
+
+static int pci_dev_wait(struct pci_dev *dev, char *reset_type)
+{
+	int timeout = pci_dev_get_reset_ready_poll_ms(dev);
+	int waited = 0;
+
+	/*
+	 * After reset, the device should not silently discard config
+	 * requests, but it may still indicate that it needs more time by
+	 * responding to them with CRS completions. For such completions:
+	 * - If CRS SV is enabled on the Root Port, and the read request
+	 *   covers both bytes of the Vendor ID register, the Root Port
+	 *   will synthesize the value 0x0001 (and set any extra requested
+	 *   bytes to 0xff)
+	 * - If CRS SV is not enabled on the Root Port, the Root Port must
+	 *   re-issue the Configuration Request as a new Request.
+	 *   Depending on platform-specific Root Complex configurations,
+	 *   the Root Port may stop retrying after a set number of attempts,
+	 *   or a configured timeout is hit, or continue indefinitely
+	 *   (ultimately resulting in non-PCI-specific errors, such as a
+	 *   TOR timeout).
+	 */
+	if (pci_crs_sv_enabled(dev)) {
+		u32 id;
+
+		if (pci_dev_wait_crs(dev, reset_type, timeout, &waited, &id))
+			return -ENOTTY;
+
+		timeout -= waited;
+
+		/*
+		 * If Vendor/Device ID is valid, the device must be ready.
+		 * Note: SR-IOV VFs return ~0 for reads to Vendor/Device
+		 * ID and will not be recognized as ready by this check.
+		 */
+		if (id != 0x0000ffff && id != 0xffff0000 &&
+		    id != 0x00000000 && id != 0xffffffff)
+			return 0;
+	}
+
+	/*
+	 * Root Ports will generally indicate error scenarios (e.g.
+	 * internal timeouts, or received Completion with CA/UR) by
+	 * synthesizing an 'all bits set' value (~0).
+	 * In case CRS is not supported/enabled, as well as for SR-IOV VFs,
+	 * fall back to polling a different register that cannot validly
+	 * contain ~0. As of PCIe 5.0, bits 11-15 of COMMAND are still RsvdP
+	 * and must return 0 when read.
+	 * XXX: These bits might become meaningful in the future
+	 */
+	if (pci_dev_poll_until_ready(dev, PCI_COMMAND, ~0, ~0,
+				     reset_type, timeout, NULL, NULL))
+		return -ENOTTY;
 
 	return 0;
 }
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable
  2020-02-23 12:20 ` [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable Stanislav Spassov
@ 2020-02-24 14:15   ` Bjorn Helgaas
  2020-02-24 17:52     ` Spassov, Stanislav
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2020-02-24 14:15 UTC (permalink / raw)
  To: Stanislav Spassov
  Cc: linux-pci, Stanislav Spassov, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Wei Wang, Jonathan Corbet, linux-doc,
	Ashok Raj, Alex Williamson, Sinan Kaya, Rajat Jain

[+cc Ashok, Alex, Sinan, Rajat]

On Sun, Feb 23, 2020 at 01:20:55PM +0100, Stanislav Spassov wrote:
> From: Wei Wang <wawei@amazon.de>
> 
> The resonable value for the maximum time to wait for a PCI device to be
> ready after reset varies depending on the platform and the reliability
> of its set of devices.

There are several mechanisms in the spec for reducing these times,
e.g., Readiness Notifications (PCIe r5.0, sec 6.23), the Readiness
Time Reporting capability (sec 7.9.17), and ACPI _DSM methods (PCI
Firmware Spec r3.2, sec 4.6.8, 4.6.9).

I would much rather use standard mechanisms like those instead of
adding kernel parameters.  A user would have to use trial and error
to figure out the value to use with a parameter like this, and that
doesn't feel like a robust approach.

> Signed-off-by: Wei Wang <wawei@amazon.de>
> Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
> ---
>  .../admin-guide/kernel-parameters.txt         |  5 +++++
>  drivers/pci/pci.c                             | 22 ++++++++++++++-----
>  2 files changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index dbc22d684627..5e4dade9acc8 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3653,6 +3653,11 @@
>  		nomsi	Do not use MSI for native PCIe PME signaling (this makes
>  			all PCIe root ports use INTx for all services).
>  
> +	pcie_reset_ready_poll_ms= [PCI,PCIE]
> +			Specifies timeout for PCI(e) device readiness polling
> +			after device reset (in milliseconds).
> +			Default: 60000 = 60 seconds
> +
>  	pcmv=		[HW,PCMCIA] BadgePAD 4
>  
>  	pd_ignore_unused
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index d828ca835a98..db9b58ab6c68 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -149,7 +149,19 @@ static int __init pcie_port_pm_setup(char *str)
>  __setup("pcie_port_pm=", pcie_port_pm_setup);
>  
>  /* Time to wait after a reset for device to become responsive */
> -#define PCIE_RESET_READY_POLL_MS 60000
> +#define PCIE_RESET_READY_POLL_MS_DEFAULT 60000
> +
> +int __read_mostly pcie_reset_ready_poll_ms = PCIE_RESET_READY_POLL_MS_DEFAULT;
> +
> +static int __init pcie_reset_ready_poll_ms_setup(char *str)
> +{
> +	int timeout;
> +
> +	if (!kstrtoint(str, 0, &timeout))
> +		pcie_reset_ready_poll_ms = timeout;
> +	return 1;
> +}
> +__setup("pcie_reset_ready_poll_ms=", pcie_reset_ready_poll_ms_setup);
>  
>  /**
>   * pci_bus_max_busnr - returns maximum PCI bus number of given bus' children
> @@ -4506,7 +4518,7 @@ int pcie_flr(struct pci_dev *dev)
>  	 */
>  	msleep(100);
>  
> -	return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
> +	return pci_dev_wait(dev, "FLR", pcie_reset_ready_poll_ms);
>  }
>  EXPORT_SYMBOL_GPL(pcie_flr);
>  
> @@ -4551,7 +4563,7 @@ static int pci_af_flr(struct pci_dev *dev, int probe)
>  	 */
>  	msleep(100);
>  
> -	return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
> +	return pci_dev_wait(dev, "AF_FLR", pcie_reset_ready_poll_ms);
>  }
>  
>  /**
> @@ -4596,7 +4608,7 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
>  	pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
>  	pci_dev_d3_sleep(dev);
>  
> -	return pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
> +	return pci_dev_wait(dev, "PM D3hot->D0", pcie_reset_ready_poll_ms);
>  }
>  
>  /**
> @@ -4826,7 +4838,7 @@ int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
>  {
>  	pcibios_reset_secondary_bus(dev);
>  
> -	return pci_dev_wait(dev, "bus reset", PCIE_RESET_READY_POLL_MS);
> +	return pci_dev_wait(dev, "bus reset", pcie_reset_ready_poll_ms);
>  }
>  EXPORT_SYMBOL_GPL(pci_bridge_secondary_bus_reset);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] PCI: Introduce per-device reset_ready_poll override
  2020-02-23 12:20 ` [PATCH 2/3] PCI: Introduce per-device reset_ready_poll override Stanislav Spassov
@ 2020-02-24 14:34   ` Bjorn Helgaas
  2020-02-24 18:05     ` Spassov, Stanislav
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2020-02-24 14:34 UTC (permalink / raw)
  To: Stanislav Spassov
  Cc: linux-pci, Stanislav Spassov, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Wei Wang, Ashok Raj, Alex Williamson,
	Sinan Kaya, Rajat Jain

[+cc Ashok, Alex, Sinan, Rajat]

On Sun, Feb 23, 2020 at 01:20:56PM +0100, Stanislav Spassov wrote:
> From: Stanislav Spassov <stanspas@amazon.de>
> 
> A broken device may never become responsive after reset, hence the need
> for a timeout. However, waiting for too long can have unintended side
> effects such as triggering hung task timeouts for processes waiting on
> a lock held during the reset. Locks that are shared across multiple
> devices, such as VFIO's per-bus reflck, are especially problematic,
> because a single broken VF can cause hangs for processes working with
> other VFs on the same bus.
> 
> To allow lowering the global default post-reset timeout, while still
> accommodating devices that require more time, this patch introduces
> a per-device override that can be configured via a quirk.
> 
> Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
> ---
>  drivers/pci/pci.c   | 19 ++++++++++++++-----
>  drivers/pci/probe.c |  2 ++
>  include/linux/pci.h |  1 +
>  3 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index db9b58ab6c68..a554818968ed 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1033,8 +1033,17 @@ void pci_wakeup_bus(struct pci_bus *bus)
>  		pci_walk_bus(bus, pci_wakeup, NULL);
>  }
>  
> -static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
> +static int pci_dev_get_reset_ready_poll_ms(struct pci_dev *dev)
>  {
> +	if (dev->reset_ready_poll_ms >= 0)
> +		return dev->reset_ready_poll_ms;
> +
> +	return pcie_reset_ready_poll_ms;
> +}
> +
> +static int pci_dev_wait(struct pci_dev *dev, char *reset_type)
> +{
> +	int timeout = pci_dev_get_reset_ready_poll_ms(dev);

I like the factoring out of the timeout, since all callers of
pci_dev_wait() supply the same value.  That could be its own separate
preliminary patch, e.g., simply

  -static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
  +static int pci_dev_wait(struct pci_dev *dev, char *reset_type)
   {
  +  int timeout = PCIE_RESET_READY_POLL_MS;
  ...
  -	return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
  +	return pci_dev_wait(dev, "FLR");

I'm a little wary of "lowering the global default post-reset timeout"
because that's not safe in general.  For example, a hot-added device
that is completely spec compliant regarding post-reset timing may not
work correctly if we've lowered a global timeout.

>  	int delay = 1;
>  	u32 id;
>  
> @@ -4518,7 +4527,7 @@ int pcie_flr(struct pci_dev *dev)
>  	 */
>  	msleep(100);
>  
> -	return pci_dev_wait(dev, "FLR", pcie_reset_ready_poll_ms);
> +	return pci_dev_wait(dev, "FLR");
>  }
>  EXPORT_SYMBOL_GPL(pcie_flr);
>  
> @@ -4563,7 +4572,7 @@ static int pci_af_flr(struct pci_dev *dev, int probe)
>  	 */
>  	msleep(100);
>  
> -	return pci_dev_wait(dev, "AF_FLR", pcie_reset_ready_poll_ms);
> +	return pci_dev_wait(dev, "AF_FLR");
>  }
>  
>  /**
> @@ -4608,7 +4617,7 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
>  	pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
>  	pci_dev_d3_sleep(dev);
>  
> -	return pci_dev_wait(dev, "PM D3hot->D0", pcie_reset_ready_poll_ms);
> +	return pci_dev_wait(dev, "PM D3hot->D0");
>  }
>  
>  /**
> @@ -4838,7 +4847,7 @@ int pci_bridge_secondary_bus_reset(struct pci_dev *dev)
>  {
>  	pcibios_reset_secondary_bus(dev);
>  
> -	return pci_dev_wait(dev, "bus reset", pcie_reset_ready_poll_ms);
> +	return pci_dev_wait(dev, "bus reset");
>  }
>  EXPORT_SYMBOL_GPL(pci_bridge_secondary_bus_reset);
>  
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 512cb4312ddd..eeb79a45d504 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2166,6 +2166,8 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
>  	if (!dev)
>  		return NULL;
>  
> +	dev->reset_ready_poll_ms = -1;
> +
>  	INIT_LIST_HEAD(&dev->bus_list);
>  	dev->dev.type = &pci_dev_type;
>  	dev->bus = pci_bus_get(bus);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 3840a541a9de..049a41b9412b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -468,6 +468,7 @@ struct pci_dev {
>  	size_t		romlen;		/* Length if not from BAR */
>  	char		*driver_override; /* Driver name to force a match */
>  
> +	int		reset_ready_poll_ms;
>  	unsigned long	priv_flags;	/* Private flags for the PCI driver */
>  };

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable
  2020-02-24 14:15   ` Bjorn Helgaas
@ 2020-02-24 17:52     ` Spassov, Stanislav
  2020-02-27 21:45       ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Spassov, Stanislav @ 2020-02-24 17:52 UTC (permalink / raw)
  To: helgaas
  Cc: corbet, alex.williamson, ashok.raj, okaya, tglx, Wang, Wei, akpm,
	Schoenherr, Jan H.,
	rajatja, linux-pci, linux-doc

On Mon, 2020-02-24 at 08:15 -0600, Bjorn Helgaas wrote:
> [+cc Ashok, Alex, Sinan, Rajat]
> 
> On Sun, Feb 23, 2020 at 01:20:55PM +0100, Stanislav Spassov wrote:
> > From: Wei Wang <wawei@amazon.de>
> > 
> > The resonable value for the maximum time to wait for a PCI device to be
> > ready after reset varies depending on the platform and the reliability
> > of its set of devices.
> 
> There are several mechanisms in the spec for reducing these times,
> e.g., Readiness Notifications (PCIe r5.0, sec 6.23), the Readiness
> Time Reporting capability (sec 7.9.17), and ACPI _DSM methods (PCI
> Firmware Spec r3.2, sec 4.6.8, 4.6.9).
> 
> I would much rather use standard mechanisms like those instead of
> adding kernel parameters.  A user would have to use trial and error
> to figure out the value to use with a parameter like this, and that
> doesn't feel like a robust approach.
> 

I agree that supporting the standard mechanisms is highly desirable,
but some sort of fallback poll timeout value is necessary on platforms
where none of those mechanisms are supported. Arguably, some kernel
configurability (whether via a kernel parameter, as proposed here,
or some other means) is still desirable.

I also agree there is no robust method to determine a "good value", but
then again - how was the current value for the constant determined? As
suggested in PATCH 2, the idea is to lower the global timeout to avoid
hung tasks when devices break and never come back after reset.

Linux already (partially) supports the _DSM methods you mention:
- acpi_pci_add_bus() queries Function 8 (described in 4.6.8) to set
  ignore_reset_delay on the host bridge
- pci_acpi_optimize_delay() uses Function 9 to set d3cold_delay and d3_delay
  in struct pci_dev, but does not currently store the DL_Up Time,
  FLR Reset Time and VF Enable Time
I guess we can extend pci_struct akin to what PATCH 2 does to store all
relevant delay values on a per-device basis, with some default value and
overriding them as appropriate from Readiness Time Reporting, _DSM, or a quirk.

Unfortunately, the hardware and firmware at my disposal does not support
any of the discussed mechanisms, so I do not feel comfortable making the changes
without being able to test them.



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] PCI: Introduce per-device reset_ready_poll override
  2020-02-24 14:34   ` Bjorn Helgaas
@ 2020-02-24 18:05     ` Spassov, Stanislav
  0 siblings, 0 replies; 15+ messages in thread
From: Spassov, Stanislav @ 2020-02-24 18:05 UTC (permalink / raw)
  To: helgaas
  Cc: alex.williamson, ashok.raj, okaya, tglx, Wang, Wei, akpm,
	Schoenherr, Jan H.,
	rajatja, linux-pci

On Mon, 2020-02-24 at 08:34 -0600, Bjorn Helgaas wrote:
> I like the factoring out of the timeout, since all callers of
> pci_dev_wait() supply the same value.  That could be its own separate
> preliminary patch...

Agreed. The next version of this patch series will do the refactor separately.

I am thinking we might also want to replace the "reset_type" with an enum
that indexes into an array to get the string, but -- more importantly --
indexes into an array of per-device overrides for the various reset types.
As per discussion on PATCH 1, I noticed the ACPI _DSM method detailed in
PCI Firmware Spec r3.2, 4.6.9 can provide individual delay values for five
different scenarios (Conventional Reset, DL_Up, FLR, D3hot to D0, VF Enable),
so we should probably store each of them in struct pci_dev.

> I'm a little wary of "lowering the global default post-reset timeout"
> because that's not safe in general.  For example, a hot-added device
> that is completely spec compliant regarding post-reset timing may not
> work correctly if we've lowered a global timeout.
> 

That makes sense. However, the timeout is currently 1 minute.
The only user of this value is pci_dev_wait(), which is itself
only invoked as part of various resets. Are there any scenarios
where that much time is truly needed after a device reset?





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] PCI: Add CRS handling to pci_dev_wait()
  2020-02-23 12:20 ` [PATCH 3/3] PCI: Add CRS handling to pci_dev_wait() Stanislav Spassov
@ 2020-02-24 20:41   ` Bjorn Helgaas
  0 siblings, 0 replies; 15+ messages in thread
From: Bjorn Helgaas @ 2020-02-24 20:41 UTC (permalink / raw)
  To: Stanislav Spassov
  Cc: linux-pci, Stanislav Spassov, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Wei Wang, Ashok Raj, Alex Williamson,
	Sinan Kaya, Rajat Jain

[+cc Ashok, Alex, Sinan, Rajat]

On Sun, Feb 23, 2020 at 01:20:57PM +0100, Stanislav Spassov wrote:
> From: Stanislav Spassov <stanspas@amazon.de>
> 
> The PCI Express specification dictates minimal amounts of time that the
> host needs to wait after triggering different kinds of resets before it
> is allowed to attempt accessing the device. After this waiting period,
> devices are required to be responsive to Configuration Space reads.
> However, if a device needs more time to actually complete the reset
> operation internally, it may respond to the read with a Completion
> Request Retry Status (CRS), and keep doing so on subsequent reads
> for as long as necessary. If the device is broken, it may even keep
> responding with CRS indefinitely.
> 
> The specification also mandates that any Root Port that supports CRS
> and has CRS Software Visibility (CRS SV) enabled will synthesize the
> special value 0x0001 for the Vendor ID and set any other bits to 1
> upon receiving a CRS Completion for a Configuration Read Request that
> includes both bytes of the Vendor ID (offset 0).
> 
> IF CRS is supported by Root Port but CRS SV is not enabled, the request
> is retried autonomosly by the Root Port. Platform-specific configuration
> registers may exist to limit the number of or time taken by such retries.

s/autonomosly/autonomously/

> If CRS is not supported, or a different register (not Vendor ID) is
> polled, or the device is responding with CA/UR Completions (rather than
> CRS), the behavior is platform-dependent, but generally the Root Port
> synthesizes ~0 to complete the software read.
> 
> Previously, pci_dev_wait() avoided taking advantage of CRS. However,
> on platforms where no limit/timeout can be configured as explained
> above, a device responding with CRS for too long (e.g. because it is
> stuck and cannot complete its reset) may trigger more severe error
> conditions (e.g. TOR timeout, 3-strike CPU CATERR), because the Root
> Port never reports back to the lower-level component requesting the
> transaction.
> 
> This patch introduces special handling when CRS is available, and
> otherwise falls back to the previous behavior of polling COMMAND.
> 
> Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
> ---
>  drivers/pci/pci.c | 128 +++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 105 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index a554818968ed..e8bce8da9402 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1041,44 +1041,126 @@ static int pci_dev_get_reset_ready_poll_ms(struct pci_dev *dev)
>  	return pcie_reset_ready_poll_ms;
>  }
>  
> -static int pci_dev_wait(struct pci_dev *dev, char *reset_type)
> +static bool pci_crs_sv_enabled(struct pci_dev *dev)
> +{
> +	struct pci_dev *root_port = pcie_find_root_port(dev);
> +	u16 root_control;
> +
> +	if (!root_port)
> +		return false;
> +
> +	if (pcie_capability_read_word(root_port, PCI_EXP_RTCTL, &root_control))
> +		return false;
> +
> +	return root_control & PCI_EXP_RTCTL_CRSSVE;

Can we just cache this bit somewhere when we enable CRS SV?

> +}
> +
> +static int pci_dev_poll_until_ready(struct pci_dev *dev,
> +				    int where,
> +				    u32 mask,
> +				    u32 bad_value,
> +				    char *event,
> +				    int timeout, int *waited,
> +				    u32 *final_value)

Follow formatting conventions (fill lines instead of each arg on a
separate line).

>  {
> -	int timeout = pci_dev_get_reset_ready_poll_ms(dev);
>  	int delay = 1;
> -	u32 id;
> +	u32 value;
>  
> -	/*
> -	 * After reset, the device should not silently discard config
> -	 * requests, but it may still indicate that it needs more time by
> -	 * responding to them with CRS completions.  The Root Port will
> -	 * generally synthesize ~0 data to complete the read (except when
> -	 * CRS SV is enabled and the read was for the Vendor ID; in that
> -	 * case it synthesizes 0x0001 data).
> -	 *
> -	 * Wait for the device to return a non-CRS completion.  Read the
> -	 * Command register instead of Vendor ID so we don't have to
> -	 * contend with the CRS SV value.
> -	 */
> -	pci_read_config_dword(dev, PCI_COMMAND, &id);
> -	while (id == ~0) {
> +	if (!event)
> +		event = "<unknown event>";
> +
> +	if (pci_read_config_dword(dev, where, &value))
> +		return -ENOTTY;
> +
> +	while ((value & mask) == bad_value) {
>  		if (delay > timeout) {
>  			pci_warn(dev, "not ready %dms after %s; giving up\n",
> -				 delay - 1, reset_type);
> -			return -ENOTTY;
> +				 delay - 1, event);
> +			return -ETIMEDOUT;
>  		}
>  
>  		if (delay > 1000)
>  			pci_info(dev, "not ready %dms after %s; waiting\n",
> -				 delay - 1, reset_type);
> +				 delay - 1, event);
>  
>  		msleep(delay);
>  		delay *= 2;
> -		pci_read_config_dword(dev, PCI_COMMAND, &id);
> +
> +		if (pci_read_config_dword(dev, where, &value))
> +			return -ENOTTY;
>  	}
>  
>  	if (delay > 1000)
> -		pci_info(dev, "ready %dms after %s\n", delay - 1,
> -			 reset_type);
> +		pci_info(dev, "ready %dms after %s\n", delay - 1, event);
> +
> +	if (waited)
> +		*waited = delay - 1;
> +
> +	if (final_value)
> +		*final_value = value;
> +
> +	return 0;
> +}
> +
> +static int pci_dev_wait_crs(struct pci_dev *dev, char *reset_type,
> +			    int timeout, int *waited, u32 *id)
> +{
> +	return pci_dev_poll_until_ready(dev, PCI_VENDOR_ID, 0xffff, 0x0001,
> +					reset_type, timeout, waited, id);
> +}
> +
> +static int pci_dev_wait(struct pci_dev *dev, char *reset_type)
> +{
> +	int timeout = pci_dev_get_reset_ready_poll_ms(dev);
> +	int waited = 0;
> +
> +	/*
> +	 * After reset, the device should not silently discard config
> +	 * requests, but it may still indicate that it needs more time by
> +	 * responding to them with CRS completions. For such completions:
> +	 * - If CRS SV is enabled on the Root Port, and the read request
> +	 *   covers both bytes of the Vendor ID register, the Root Port
> +	 *   will synthesize the value 0x0001 (and set any extra requested
> +	 *   bytes to 0xff)
> +	 * - If CRS SV is not enabled on the Root Port, the Root Port must
> +	 *   re-issue the Configuration Request as a new Request.
> +	 *   Depending on platform-specific Root Complex configurations,
> +	 *   the Root Port may stop retrying after a set number of attempts,
> +	 *   or a configured timeout is hit, or continue indefinitely
> +	 *   (ultimately resulting in non-PCI-specific errors, such as a
> +	 *   TOR timeout).
> +	 */
> +	if (pci_crs_sv_enabled(dev)) {
> +		u32 id;
> +
> +		if (pci_dev_wait_crs(dev, reset_type, timeout, &waited, &id))
> +			return -ENOTTY;
> +
> +		timeout -= waited;
> +
> +		/*
> +		 * If Vendor/Device ID is valid, the device must be ready.
> +		 * Note: SR-IOV VFs return ~0 for reads to Vendor/Device
> +		 * ID and will not be recognized as ready by this check.
> +		 */
> +		if (id != 0x0000ffff && id != 0xffff0000 &&
> +		    id != 0x00000000 && id != 0xffffffff)
> +			return 0;
> +	}
> +
> +	/*
> +	 * Root Ports will generally indicate error scenarios (e.g.
> +	 * internal timeouts, or received Completion with CA/UR) by
> +	 * synthesizing an 'all bits set' value (~0).
> +	 * In case CRS is not supported/enabled, as well as for SR-IOV VFs,
> +	 * fall back to polling a different register that cannot validly
> +	 * contain ~0. As of PCIe 5.0, bits 11-15 of COMMAND are still RsvdP
> +	 * and must return 0 when read.
> +	 * XXX: These bits might become meaningful in the future
> +	 */
> +	if (pci_dev_poll_until_ready(dev, PCI_COMMAND, ~0, ~0,
> +				     reset_type, timeout, NULL, NULL))

I think the use of PCI_COMMAND in the existing pci_dev_wait() is a
little questionable.  I *think* it's there because we assumed that VFs
did not have to respond with CRS, but I don't think that assumption
was correct -- VFs *should* respond with CRS if they're not ready.

This use of PCI_COMMAND was added by 5adecf817dd6 ("PCI: Wait for up
to 1000ms after FLR reset") [1], and I'd like to resolve this question
before further complicating this path.  5adecf817dd6 refers to an
Intel integrated graphics device.  If that device doesn't handle CRS
correctly, maybe we need some sort of device-specific quirk for it?

[1] https://git.kernel.org/linus/5adecf817dd6

> +		return -ENOTTY;
>  
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable
  2020-02-24 17:52     ` Spassov, Stanislav
@ 2020-02-27 21:45       ` Bjorn Helgaas
  2020-02-27 23:44         ` Sinan Kaya
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2020-02-27 21:45 UTC (permalink / raw)
  To: Spassov, Stanislav
  Cc: corbet, alex.williamson, ashok.raj, okaya, tglx, Wang, Wei, akpm,
	Schoenherr, Jan H.,
	rajatja, linux-pci, linux-doc

On Mon, Feb 24, 2020 at 05:52:14PM +0000, Spassov, Stanislav wrote:
> On Mon, 2020-02-24 at 08:15 -0600, Bjorn Helgaas wrote:
> > On Sun, Feb 23, 2020 at 01:20:55PM +0100, Stanislav Spassov wrote:
> > > From: Wei Wang <wawei@amazon.de>
> > > 
> > > The reasonable value for the maximum time to wait for a PCI
> > > device to be ready after reset varies depending on the platform
> > > and the reliability of its set of devices.
> > 
> > There are several mechanisms in the spec for reducing these times,
> > e.g., Readiness Notifications (PCIe r5.0, sec 6.23), the Readiness
> > Time Reporting capability (sec 7.9.17), and ACPI _DSM methods (PCI
> > Firmware Spec r3.2, sec 4.6.8, 4.6.9).
> > 
> > I would much rather use standard mechanisms like those instead of
> > adding kernel parameters.  A user would have to use trial and
> > error to figure out the value to use with a parameter like this,
> > and that doesn't feel like a robust approach.
> 
> I agree that supporting the standard mechanisms is highly desirable,
> but some sort of fallback poll timeout value is necessary on
> platforms where none of those mechanisms are supported. Arguably,
> some kernel configurability (whether via a kernel parameter, as
> proposed here, or some other means) is still desirable.

IIUC we basically have two issues: 1) the default 60 second timeout is
too long, and 2) you'd like to reduce the delays further à la the
Device Readiness _DSM even for firmware that doesn't support that.

The 60 second timeout came from 821cdad5c46c ("PCI: Wait up to 60
seconds for device to become ready after FLR") and is probably too
long.  We probably should pick a smaller value based on numbers from
the spec and make quirks for devices that needed more time.

As far as reducing delays for specific devices, if we can identify
them via Vendor/Device ID, can we make per-device values that could be
set either by the _DSM or by a quirk?

I'm trying to wriggle out of adding yet more PCI kernel parameters
because people tend to stumble across them and pass them around on
bulletin boards as ways to "fix" or "speed up" things that really
should be addressed in better ways.

> I also agree there is no robust method to determine a "good value", but
> then again - how was the current value for the constant determined? As
> suggested in PATCH 2, the idea is to lower the global timeout to avoid
> hung tasks when devices break and never come back after reset.

I don't remember exactly how we came up with 60 seconds; I suspect it
was just a convenient large number.

Bjorn

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable
  2020-02-27 21:45       ` Bjorn Helgaas
@ 2020-02-27 23:44         ` Sinan Kaya
  2020-02-28  2:18           ` Raj, Ashok
  0 siblings, 1 reply; 15+ messages in thread
From: Sinan Kaya @ 2020-02-27 23:44 UTC (permalink / raw)
  To: Bjorn Helgaas, Spassov, Stanislav
  Cc: corbet, alex.williamson, ashok.raj, tglx, Wang, Wei, akpm,
	Schoenherr, Jan H.,
	rajatja, linux-pci, linux-doc

On 2/27/2020 4:45 PM, Bjorn Helgaas wrote:
> The 60 second timeout came from 821cdad5c46c ("PCI: Wait up to 60
> seconds for device to become ready after FLR") and is probably too
> long.  We probably should pick a smaller value based on numbers from
> the spec and make quirks for devices that needed more time.

If I remember right, there was no time mention about how long to
wait. Spec says device should send CRS as long as it is not ready.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable
  2020-02-27 23:44         ` Sinan Kaya
@ 2020-02-28  2:18           ` Raj, Ashok
  2020-03-02 16:39             ` Sinan Kaya
  0 siblings, 1 reply; 15+ messages in thread
From: Raj, Ashok @ 2020-02-28  2:18 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Spassov, Stanislav, corbet, alex.williamson, tglx,
	Wang, Wei, akpm, Schoenherr, Jan H.,
	rajatja, linux-pci, linux-doc, Ashok Raj

On Thu, Feb 27, 2020 at 06:44:56PM -0500, Sinan Kaya wrote:
> On 2/27/2020 4:45 PM, Bjorn Helgaas wrote:
> > The 60 second timeout came from 821cdad5c46c ("PCI: Wait up to 60
> > seconds for device to become ready after FLR") and is probably too
> > long.  We probably should pick a smaller value based on numbers from
> > the spec and make quirks for devices that needed more time.
> 
> If I remember right, there was no time mention about how long to
> wait. Spec says device should send CRS as long as it is not ready.

Not exactly.. there are some requirements to follow for rules after
a conventional reset. 

Look for "The second set of rules addresses requirements placed on the system"

i'm looking a the 5.0 spec (around page 553) :-). 

In general 1s seems good enough for most cases. For ports that support
> 5gt/s transfer speed, it says 100ms after link training completes.
I'm not sure if this means 100ms after getting a DLLSC event?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable
  2020-02-28  2:18           ` Raj, Ashok
@ 2020-03-02 16:39             ` Sinan Kaya
  2020-03-02 17:37               ` Raj, Ashok
  0 siblings, 1 reply; 15+ messages in thread
From: Sinan Kaya @ 2020-03-02 16:39 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Bjorn Helgaas, Spassov, Stanislav, corbet, alex.williamson, tglx,
	Wang, Wei, akpm, Schoenherr, Jan H.,
	rajatja, linux-pci, linux-doc

On 2/27/2020 9:18 PM, Raj, Ashok wrote:
>> If I remember right, there was no time mention about how long to
>> wait. Spec says device should send CRS as long as it is not ready.
> Not exactly.. there are some requirements to follow for rules after
> a conventional reset. 

Yes, but CRS happens after functional reset, D3hot etc. too not just
conventional reset.

1 second is too aggressive. We already have proof that several PCIe
cards take their time during FLR especially FPGA cards in the orders
of 10 seconds.

Current code is waiting up to 60 seconds. If card shows up before that
we happily return.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable
  2020-03-02 16:39             ` Sinan Kaya
@ 2020-03-02 17:37               ` Raj, Ashok
  2020-03-02 18:30                 ` Sinan Kaya
  0 siblings, 1 reply; 15+ messages in thread
From: Raj, Ashok @ 2020-03-02 17:37 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Spassov, Stanislav, corbet, alex.williamson, tglx,
	Wang, Wei, akpm, Schoenherr, Jan H.,
	rajatja, linux-pci, linux-doc, Ashok Raj

On Mon, Mar 02, 2020 at 11:39:39AM -0500, Sinan Kaya wrote:
> On 2/27/2020 9:18 PM, Raj, Ashok wrote:
> >> If I remember right, there was no time mention about how long to
> >> wait. Spec says device should send CRS as long as it is not ready.
> > Not exactly.. there are some requirements to follow for rules after
> > a conventional reset. 
> 
> Yes, but CRS happens after functional reset, D3hot etc. too not just
> conventional reset.
> 
> 1 second is too aggressive. We already have proof that several PCIe
> cards take their time during FLR especially FPGA cards in the orders
> of 10 seconds.

Aren't the rules specified in 7.9.17 Rediness Time Reporting Extended Capability
sufficient to cover this?

If a device doesn't have them we could let the driver supply this value
as a device specific value to override the default.

> 
> Current code is waiting up to 60 seconds. If card shows up before that
> we happily return.
> 

But in 7.9.17.2 Readiness Time Reporting 1 Register, says devices
can defer reporting by not setting the valid bit, but if it remains
clear after 60s system software can assume no valid values will be reported.

Maybe keep the system default to something more reasonable (after some
testing in the community), and do this insane 60s for devices that 
support the optional reporting capability?

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable
  2020-03-02 17:37               ` Raj, Ashok
@ 2020-03-02 18:30                 ` Sinan Kaya
  0 siblings, 0 replies; 15+ messages in thread
From: Sinan Kaya @ 2020-03-02 18:30 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Bjorn Helgaas, Spassov, Stanislav, corbet, alex.williamson, tglx,
	Wang, Wei, akpm, Schoenherr, Jan H.,
	rajatja, linux-pci, linux-doc

On 3/2/2020 12:37 PM, Raj, Ashok wrote:
>> 1 second is too aggressive. We already have proof that several PCIe
>> cards take their time during FLR especially FPGA cards in the orders
>> of 10 seconds.
> Aren't the rules specified in 7.9.17 Rediness Time Reporting Extended Capability
> sufficient to cover this?
> 

Yes, readiness would work too but it is not a mandatory feature.
Readiness is yet another alternative to CRS advertised as the new
flashy way of doing things. Problem is backwards compatibility.

Do you want to break all existing/legacy drivers for an optional/new
spec defined feature?

> If a device doesn't have them we could let the driver supply this value
> as a device specific value to override the default.
> 
>> Current code is waiting up to 60 seconds. If card shows up before that
>> we happily return.
>>
> But in 7.9.17.2 Readiness Time Reporting 1 Register, says devices
> can defer reporting by not setting the valid bit, but if it remains
> clear after 60s system software can assume no valid values will be reported.
> 
> Maybe keep the system default to something more reasonable (after some
> testing in the community), and do this insane 60s for devices that 
> support the optional reporting capability?

I think we should add readiness capability on top of CRS but not
exclusively.

If readiness is supported, do the spec defined behavior.
If not supported, fallback to CRS for legacy devices.

I remember Bjorn mentioning about readiness capability. Maybe, this is
the time to implement it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-03-02 18:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-23 12:20 [PATCH 0/3] Improve PCI device post-reset readiness polling Stanislav Spassov
2020-02-23 12:20 ` [PATCH 1/3] PCI: Make PCIE_RESET_READY_POLL_MS configurable Stanislav Spassov
2020-02-24 14:15   ` Bjorn Helgaas
2020-02-24 17:52     ` Spassov, Stanislav
2020-02-27 21:45       ` Bjorn Helgaas
2020-02-27 23:44         ` Sinan Kaya
2020-02-28  2:18           ` Raj, Ashok
2020-03-02 16:39             ` Sinan Kaya
2020-03-02 17:37               ` Raj, Ashok
2020-03-02 18:30                 ` Sinan Kaya
2020-02-23 12:20 ` [PATCH 2/3] PCI: Introduce per-device reset_ready_poll override Stanislav Spassov
2020-02-24 14:34   ` Bjorn Helgaas
2020-02-24 18:05     ` Spassov, Stanislav
2020-02-23 12:20 ` [PATCH 3/3] PCI: Add CRS handling to pci_dev_wait() Stanislav Spassov
2020-02-24 20:41   ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).