linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] Improve PCI device post-reset readiness polling
@ 2020-03-07 17:20 Stanislav Spassov
  2020-03-07 17:20 ` [PATCH v4 1/3] PCI: Refactor polling loop out of pci_dev_wait Stanislav Spassov
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Stanislav Spassov @ 2020-03-07 17:20 UTC (permalink / raw)
  To: linux-pci
  Cc: Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Jonathan Corbet, Ashok Raj,
	Alex Williamson, Sinan Kaya, Rajat Jain

From: Stanislav Spassov <stanspas@amazon.de>

The first version of this patch series can be found here:
https://lore.kernel.org/linux-pci/20200223122057.6504-1-stanspas@amazon.com

The goal of this patch series is to solve an issue where pci_dev_wait
can cause system crashes. After a reset, a hung device may keep
responding with CRS completions indefinitely. If CRS Software Visibility
is enabled on the Root Port, attempting to read any register other than
PCI_VENDOR_ID will cause the Root Port to autonomously retry the request
without reporting back to the CPU core. Unless the number of retries or
the amount of time spent retrying is limited by platform-specific means,
this scenario leads to low-level platform timeouts (such as a TOR
Timeout), which can easily escalate to a crash.

Feedback on the v1 inspired a lot of additional improvements all around the
device reset codepaths and reducing post-reset delays. These improvements
were published as part of v2 (v3 is just small build fixes).

It looks like there is immediate demand specifically for the CRS work,
so I am once again reducing the series to just that. The reset will be
posted as a separate patch series that will likely require more time and
iterations to stabilize.

Changes since v3:
- In pci_dev_wait(), added "timeout -= waited" to account the time spent
  polling PCI_VENDOR_ID before falling back to polling PCI_COMMAND if
  device readiness could not be positively established via CRS (i.e.,
  if we stopped receiving CRS completions but did not receive a valid
  vendor ID due to dealing with an SR-IOV VF, or due to a different error)
- Simplified the commit message of "PCI: Add CRS handling to pci_dev_wait()"
  to avoid confusion as to when Root Ports will autonomously retry requests
  that resulted in CRS completions.

Stanislav Spassov (3):
  PCI: Refactor polling loop out of pci_dev_wait
  PCI: Cache CRS Software Visibiliy in struct pci_dev
  PCI: Add CRS handling to pci_dev_wait()

 drivers/pci/pci.c   | 109 +++++++++++++++++++++++++++++++++++---------
 drivers/pci/probe.c |   8 +++-
 include/linux/pci.h |   3 ++
 3 files changed, 98 insertions(+), 22 deletions(-)


base-commit: bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/3] PCI: Refactor polling loop out of pci_dev_wait
  2020-03-07 17:20 [PATCH v4 0/3] Improve PCI device post-reset readiness polling Stanislav Spassov
@ 2020-03-07 17:20 ` Stanislav Spassov
  2020-03-07 17:20 ` [PATCH v4 2/3] PCI: Cache CRS Software Visibiliy in struct pci_dev Stanislav Spassov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Stanislav Spassov @ 2020-03-07 17:20 UTC (permalink / raw)
  To: linux-pci
  Cc: Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Jonathan Corbet, Ashok Raj,
	Alex Williamson, Sinan Kaya, Rajat Jain

From: Stanislav Spassov <stanspas@amazon.de>

This patch does not (intentionally) introduce any observable difference
in runtime behavior.

Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
---
 drivers/pci/pci.c | 70 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 49 insertions(+), 21 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d828ca835a98..44f5d4907db6 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1021,47 +1021,75 @@ void pci_wakeup_bus(struct pci_bus *bus)
 		pci_walk_bus(bus, pci_wakeup, NULL);
 }
 
-static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
+/*
+ * Performs DWORD Configuration Reads at a specific offset until the value read
+ * (with mask applied) is not equal to bad_value.
+ */
+static inline int pci_dev_poll_until_not_equal(struct pci_dev *dev, int where,
+					       u32 mask, u32 bad_value,
+					       const char *event_name,
+					       int timeout, int *waited,
+					       u32 *final_value)
 {
 	int delay = 1;
-	u32 id;
+	u32 value;
 
-	/*
-	 * After reset, the device should not silently discard config
-	 * requests, but it may still indicate that it needs more time by
-	 * responding to them with CRS completions.  The Root Port will
-	 * generally synthesize ~0 data to complete the read (except when
-	 * CRS SV is enabled and the read was for the Vendor ID; in that
-	 * case it synthesizes 0x0001 data).
-	 *
-	 * Wait for the device to return a non-CRS completion.  Read the
-	 * Command register instead of Vendor ID so we don't have to
-	 * contend with the CRS SV value.
-	 */
-	pci_read_config_dword(dev, PCI_COMMAND, &id);
-	while (id == ~0) {
+	if (!event_name)
+		event_name = "<unknown event>";
+
+	if (waited)
+		delay = *waited + 1;
+
+	pci_read_config_dword(dev, where, &value);
+
+	while ((value & mask) == bad_value) {
 		if (delay > timeout) {
 			pci_warn(dev, "not ready %dms after %s; giving up\n",
-				 delay - 1, reset_type);
+				 delay - 1, event_name);
 			return -ENOTTY;
 		}
 
 		if (delay > 1000)
 			pci_info(dev, "not ready %dms after %s; waiting\n",
-				 delay - 1, reset_type);
+				 delay - 1, event_name);
 
 		msleep(delay);
 		delay *= 2;
-		pci_read_config_dword(dev, PCI_COMMAND, &id);
+
+		pci_read_config_dword(dev, where, &value);
 	}
 
 	if (delay > 1000)
-		pci_info(dev, "ready %dms after %s\n", delay - 1,
-			 reset_type);
+		pci_info(dev, "ready %dms after %s\n", delay - 1, event_name);
+
+	if (waited)
+		*waited = delay - 1;
+
+	if (final_value)
+		*final_value = value;
 
 	return 0;
 }
 
+static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
+{
+	/*
+	 * After reset, the device should not silently discard config
+	 * requests, but it may still indicate that it needs more time by
+	 * responding to them with CRS completions.  The Root Port will
+	 * generally synthesize ~0 data to complete the read (except when
+	 * CRS SV is enabled and the read was for the Vendor ID; in that
+	 * case it synthesizes 0x0001 data).
+	 *
+	 * Wait for the device to return a non-CRS completion.  Read the
+	 * Command register instead of Vendor ID so we don't have to
+	 * contend with the CRS SV value.
+	 */
+	return pci_dev_poll_until_not_equal(dev, PCI_COMMAND, ~0, ~0,
+					    reset_type, timeout, NULL,
+					    NULL);
+}
+
 /**
  * pci_power_up - Put the given device into D0
  * @dev: PCI device to power up
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 2/3] PCI: Cache CRS Software Visibiliy in struct pci_dev
  2020-03-07 17:20 [PATCH v4 0/3] Improve PCI device post-reset readiness polling Stanislav Spassov
  2020-03-07 17:20 ` [PATCH v4 1/3] PCI: Refactor polling loop out of pci_dev_wait Stanislav Spassov
@ 2020-03-07 17:20 ` Stanislav Spassov
  2021-09-12 13:32   ` Bjorn Helgaas
  2020-03-07 17:20 ` [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait() Stanislav Spassov
  2021-01-22  8:54 ` [PATCH v4 0/3] Improve PCI device post-reset readiness polling David Woodhouse
  3 siblings, 1 reply; 19+ messages in thread
From: Stanislav Spassov @ 2020-03-07 17:20 UTC (permalink / raw)
  To: linux-pci
  Cc: Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Jonathan Corbet, Ashok Raj,
	Alex Williamson, Sinan Kaya, Rajat Jain

From: Stanislav Spassov <stanspas@amazon.de>

Arguably, since CRS SV is a capability of Root Ports and determines
how Root Ports will handle incoming CRS Completions, it makes more
sense to store this flag in the struct pci_bus that represents the
port's secondary bus, and to cache it in any buses further down the
hierarchy.

However, storing the flag in struct pci_dev allows individual devices
to be marked as not supporting CRS properly even when CRS SV is enabled
on their parent Root Port. This way, future code that relies on the new
flag will not be misled that it is safe to probe a device by relying on
CRS SV to not cause platform timeouts (See the note on CRS SV on p. 553
of PCI Express Base Specification r5.0 from May 22, 2019).

Note: Endpoints integrated into the Root Complex (RCiEP) may also be
capable of using CRS. In that case, the software visibility is
controlled using a Root Complex Register Block (RCRB). This is currently
not supported by the kernel. The code introduced here would simply not
set the newly introduced flag for RCiEP as for those bus->self is NULL.

Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
---
 drivers/pci/probe.c | 8 +++++++-
 include/linux/pci.h | 3 +++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 512cb4312ddd..eeff8a07f330 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1080,9 +1080,11 @@ static void pci_enable_crs(struct pci_dev *pdev)
 
 	/* Enable CRS Software Visibility if supported */
 	pcie_capability_read_word(pdev, PCI_EXP_RTCAP, &root_cap);
-	if (root_cap & PCI_EXP_RTCAP_CRSVIS)
+	if (root_cap & PCI_EXP_RTCAP_CRSVIS) {
 		pcie_capability_set_word(pdev, PCI_EXP_RTCTL,
 					 PCI_EXP_RTCTL_CRSSVE);
+		pdev->crssv_enabled = true;
+	}
 }
 
 static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus,
@@ -2414,6 +2416,10 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
 	list_add_tail(&dev->bus_list, &bus->devices);
 	up_write(&pci_bus_sem);
 
+	/* Propagate CRS Software Visibility bit from the parent bridge */
+	if (bus->self)
+		dev->crssv_enabled = bus->self->crssv_enabled;
+
 	ret = pcibios_add_device(dev);
 	WARN_ON(ret < 0);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 3840a541a9de..1c222c7c2572 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -354,6 +354,9 @@ struct pci_dev {
 						      user sysfs */
 	unsigned int	clear_retrain_link:1;	/* Need to clear Retrain Link
 						   bit manually */
+	unsigned int	crssv_enabled:1;	/* Configuration Request Retry
+						   Status Software Visibility
+						   enabled on (parent) bridge */
 	unsigned int	d3_delay;	/* D3->D0 transition time in ms */
 	unsigned int	d3cold_delay;	/* D3cold->D0 transition time in ms */
 
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2020-03-07 17:20 [PATCH v4 0/3] Improve PCI device post-reset readiness polling Stanislav Spassov
  2020-03-07 17:20 ` [PATCH v4 1/3] PCI: Refactor polling loop out of pci_dev_wait Stanislav Spassov
  2020-03-07 17:20 ` [PATCH v4 2/3] PCI: Cache CRS Software Visibiliy in struct pci_dev Stanislav Spassov
@ 2020-03-07 17:20 ` Stanislav Spassov
  2020-03-09 15:55   ` Sinan Kaya
                     ` (2 more replies)
  2021-01-22  8:54 ` [PATCH v4 0/3] Improve PCI device post-reset readiness polling David Woodhouse
  3 siblings, 3 replies; 19+ messages in thread
From: Stanislav Spassov @ 2020-03-07 17:20 UTC (permalink / raw)
  To: linux-pci
  Cc: Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Jonathan Corbet, Ashok Raj,
	Alex Williamson, Sinan Kaya, Rajat Jain

From: Stanislav Spassov <stanspas@amazon.de>

The PCI Express specification dictates minimal amounts of time that the
host needs to wait after triggering different kinds of resets before it
is allowed to attempt accessing the device. After this waiting period,
devices are required to be responsive to Configuration Space reads.
However, if a device needs more time to actually complete the reset
operation internally, it may respond to the read with a Completion
Request Retry Status (CRS), and keep doing so on subsequent reads
for as long as necessary. If the device is broken, it may even keep
responding with CRS indefinitely.

The specification also mandates that any Root Port that supports CRS
and has CRS Software Visibility (CRS SV) enabled will synthesize the
special value 0x0001 for the Vendor ID and set any other bits to 1
upon receiving a CRS Completion for a Configuration Read Request that
includes both bytes of the Vendor ID (offset 0).

If CRS SV is disabled or a different register (not Vendor ID) is being
read, the request is retried autonomously by the Root Port.
Platform-specific configuration registers may exist to limit the number
of or time taken by such retries.

If CRS is not supported, or a device is responding with CA/UR
Completions (rather than CRS), the behavior is platform-dependent, but
generally the Root Port synthesizes ~0 to complete the software read.

Previously, pci_dev_wait() avoided taking advantage of CRS. However,
on platforms where no retry limit/timeout can be configured, a device
responding with CRS for too long (e.g. because it is stuck and cannot
complete its reset) may trigger more severe error conditions (e.g. TOR
timeout, 3-strike CPU CATERR), because the Root Port never reports back
to the lower-level component requesting the transaction.

This patch introduces special handling when CRS is available, and
otherwise falls back to the previous behavior of polling COMMAND.

Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
---
 drivers/pci/pci.c | 55 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 47 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 44f5d4907db6..a028147f4471 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1073,17 +1073,56 @@ static inline int pci_dev_poll_until_not_equal(struct pci_dev *dev, int where,
 
 static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 {
+	int waited = 0;
+	int rc = 0;
+
+
 	/*
 	 * After reset, the device should not silently discard config
 	 * requests, but it may still indicate that it needs more time by
-	 * responding to them with CRS completions.  The Root Port will
-	 * generally synthesize ~0 data to complete the read (except when
-	 * CRS SV is enabled and the read was for the Vendor ID; in that
-	 * case it synthesizes 0x0001 data).
-	 *
-	 * Wait for the device to return a non-CRS completion.  Read the
-	 * Command register instead of Vendor ID so we don't have to
-	 * contend with the CRS SV value.
+	 * responding to them with CRS completions. For such completions:
+	 * - If CRS SV is enabled on the Root Port, and the read request
+	 *   covers both bytes of the Vendor ID register, the Root Port
+	 *   will synthesize the value 0x0001 (and set any extra requested
+	 *   bytes to 0xff)
+	 * - If CRS SV is not enabled on the Root Port, the Root Port must
+	 *   re-issue the Configuration Request as a new Request.
+	 *   Depending on platform-specific Root Complex configurations,
+	 *   the Root Port may stop retrying after a set number of attempts,
+	 *   or a configured timeout is hit, or continue indefinitely
+	 *   (ultimately resulting in non-PCI-specific platform errors, such as
+	 *   a TOR timeout).
+	 */
+	if (dev->crssv_enabled) {
+		u32 id;
+
+		rc = pci_dev_poll_until_not_equal(dev, PCI_VENDOR_ID, 0xffff,
+						  0x0001, reset_type, timeout,
+						  &waited, &id);
+		if (rc)
+			return rc;
+
+		timeout -= waited;
+
+		/*
+		 * If Vendor/Device ID is valid, the device must be ready.
+		 * Note: SR-IOV VFs return ~0 for reads to Vendor/Device
+		 * ID and will not be recognized as ready by this check.
+		 */
+		if (id != 0x0000ffff && id != 0xffff0000 &&
+		    id != 0x00000000 && id != 0xffffffff)
+			return 0;
+	}
+
+	/*
+	 * Root Ports will generally indicate error scenarios (e.g.
+	 * internal timeouts, or received Completion with CA/UR) by
+	 * synthesizing an 'all bits set' value (~0).
+	 * In case CRS is not supported/enabled, as well as for SR-IOV VFs,
+	 * fall back to polling a different register that cannot validly
+	 * contain ~0. As of PCIe 5.0, bits 11-15 of COMMAND are still RsvdP
+	 * and must return 0 when read.
+	 * XXX: These bits might become meaningful in the future
 	 */
 	return pci_dev_poll_until_not_equal(dev, PCI_COMMAND, ~0, ~0,
 					    reset_type, timeout, NULL,
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2020-03-07 17:20 ` [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait() Stanislav Spassov
@ 2020-03-09 15:55   ` Sinan Kaya
  2020-03-09 16:19     ` Raj, Ashok
  2021-09-11 14:03   ` Bjorn Helgaas
  2021-09-13 16:07   ` Bjorn Helgaas
  2 siblings, 1 reply; 19+ messages in thread
From: Sinan Kaya @ 2020-03-09 15:55 UTC (permalink / raw)
  To: Stanislav Spassov, linux-pci
  Cc: Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Jonathan Corbet, Ashok Raj,
	Alex Williamson, Rajat Jain

On 3/7/2020 12:20 PM, Stanislav Spassov wrote:
> +		rc = pci_dev_poll_until_not_equal(dev, PCI_VENDOR_ID, 0xffff,
> +						  0x0001, reset_type, timeout,
> +						  &waited, &id);
> +		if (rc)
> +			return rc;
> +

If I remember right, this doesn't work for VF sending CRS because VF
always returns 0xffff for VENDOR_ID register.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2020-03-09 15:55   ` Sinan Kaya
@ 2020-03-09 16:19     ` Raj, Ashok
  2020-03-09 16:38       ` Spassov, Stanislav
  0 siblings, 1 reply; 19+ messages in thread
From: Raj, Ashok @ 2020-03-09 16:19 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Stanislav Spassov, linux-pci, Stanislav Spassov, Bjorn Helgaas,
	Thomas Gleixner, Andrew Morton, Jan H . Schönherr,
	Jonathan Corbet, Alex Williamson, Rajat Jain, Ashok Raj

On Mon, Mar 09, 2020 at 11:55:11AM -0400, Sinan Kaya wrote:
> On 3/7/2020 12:20 PM, Stanislav Spassov wrote:
> > +		rc = pci_dev_poll_until_not_equal(dev, PCI_VENDOR_ID, 0xffff,
> > +						  0x0001, reset_type, timeout,
> > +						  &waited, &id);
> > +		if (rc)
> > +			return rc;
> > +
> 
> If I remember right, this doesn't work for VF sending CRS because VF
> always returns 0xffff for VENDOR_ID register.

Is this required by the PCIe spec? i think the only requirement is 
the 1s wait after PF has done the VF enable. See Implementation Note
right above section 2.3.1.1 in the Base spec 5.0. 

If this behavior is different for maybe a specific SRIOV device we should
probably quirk the standard behavior?

The rules are mentioned in so many places, but looking through the 
SRIOV section's doesn't seem to specify special rules for VF's other than
the wait time after VF enable.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2020-03-09 16:19     ` Raj, Ashok
@ 2020-03-09 16:38       ` Spassov, Stanislav
  2020-03-09 17:33         ` Sinan Kaya
  0 siblings, 1 reply; 19+ messages in thread
From: Spassov, Stanislav @ 2020-03-09 16:38 UTC (permalink / raw)
  To: ashok.raj, okaya
  Cc: corbet, alex.williamson, tglx, akpm, Schoenherr, Jan H.,
	rajatja, linux-pci, bhelgaas

On Mon, 2020-03-09 at 09:19 -0700, Raj, Ashok wrote:
> On Mon, Mar 09, 2020 at 11:55:11AM -0400, Sinan Kaya wrote:
> > On 3/7/2020 12:20 PM, Stanislav Spassov wrote:
> > > +           rc = pci_dev_poll_until_not_equal(dev, PCI_VENDOR_ID, 0xffff,
> > > +                                             0x0001, reset_type, timeout,
> > > +                                             &waited, &id);
> > > +           if (rc)
> > > +                   return rc;
> > > +
> > 
> > If I remember right, this doesn't work for VF sending CRS because VF
> > always returns 0xffff for VENDOR_ID register.
> 
> Is this required by the PCIe spec? i think the only requirement is
> the 1s wait after PF has done the VF enable. See Implementation Note
> right above section 2.3.1.1 in the Base spec 5.0.
> 
> If this behavior is different for maybe a specific SRIOV device we should
> probably quirk the standard behavior?
> 
> The rules are mentioned in so many places, but looking through the
> SRIOV section's doesn't seem to specify special rules for VF's other than
> the wait time after VF enable.

PCI Express Base Specification Revision 5.0 Version 1.0 (May 22, 2019)
on pages 1139 and 1140 within section 9.3.4 PF/VF Configuration Space Header
describes:

"9.3.4.1.1 Vendor ID Register Changes (Offset 00h)
...
This field in all VFs returns FFFFh when read. VI software should return the Vendor ID value from the associated PF as the
Vendor ID value for the VF."

(and similar for the Device ID)

Right after that is an Implemention Note "Legacy PCI Probing Software" that explains:

"Returning FFFFh for Device ID and Vendor ID values allows some legacy software to ignore VFs."

So, whenever a VF is providing an actual response to a vid/did read, it will respond with a Successful Completion and the data
will be 0xFFFF. However, when the VF is not ready yet after a reset, I would expect it to be returning CRS completions just like
any other device (nothing in the spec explicitly confirms or denies this, as far as I can tell). Then, the root port has no idea
if a device that it received a CRS completion from is a PF or VF, so it has to treat them equivalently, and therefore (for CRS SV enabled)
synthesize 0x0001 for the VID.



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2020-03-09 16:38       ` Spassov, Stanislav
@ 2020-03-09 17:33         ` Sinan Kaya
  0 siblings, 0 replies; 19+ messages in thread
From: Sinan Kaya @ 2020-03-09 17:33 UTC (permalink / raw)
  To: Spassov, Stanislav, ashok.raj
  Cc: corbet, alex.williamson, tglx, akpm, Schoenherr, Jan H.,
	rajatja, linux-pci, bhelgaas

On 3/9/2020 12:38 PM, Spassov, Stanislav wrote:
> So, whenever a VF is providing an actual response to a vid/did read, it will respond with a Successful Completion and the data
> will be 0xFFFF. However, when the VF is not ready yet after a reset, I would expect it to be returning CRS completions just like
> any other device (nothing in the spec explicitly confirms or denies this, as far as I can tell). Then, the root port has no idea
> if a device that it received a CRS completion from is a PF or VF, so it has to treat them equivalently, and therefore (for CRS SV enabled)
> synthesize 0x0001 for the VID.

Looking closer, I see you brought bad_value to the function parameter.
Yes, this should work as long as device responds with 0x0001. Previous
code used to bail out on ~0x0 immediately.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] Improve PCI device post-reset readiness polling
  2020-03-07 17:20 [PATCH v4 0/3] Improve PCI device post-reset readiness polling Stanislav Spassov
                   ` (2 preceding siblings ...)
  2020-03-07 17:20 ` [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait() Stanislav Spassov
@ 2021-01-22  8:54 ` David Woodhouse
  2021-09-10  9:32   ` David Woodhouse
  3 siblings, 1 reply; 19+ messages in thread
From: David Woodhouse @ 2021-01-22  8:54 UTC (permalink / raw)
  To: Stanislav Spassov, linux-pci
  Cc: Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Jonathan Corbet, Ashok Raj,
	Alex Williamson, Sinan Kaya, Rajat Jain

[-- Attachment #1: Type: text/plain, Size: 1419 bytes --]

On Sat, 2020-03-07 at 18:20 +0100, Stanislav Spassov wrote:
> From: Stanislav Spassov <stanspas@amazon.de>
> 
> The first version of this patch series can be found here:
> https://lore.kernel.org/linux-pci/20200223122057.6504-1-stanspas@amazon.com
> 
> The goal of this patch series is to solve an issue where pci_dev_wait
> can cause system crashes. After a reset, a hung device may keep
> responding with CRS completions indefinitely. If CRS Software Visibility
> is enabled on the Root Port, attempting to read any register other than
> PCI_VENDOR_ID will cause the Root Port to autonomously retry the request
> without reporting back to the CPU core. Unless the number of retries or
> the amount of time spent retrying is limited by platform-specific means,
> this scenario leads to low-level platform timeouts (such as a TOR
> Timeout), which can easily escalate to a crash.
> 
> Feedback on the v1 inspired a lot of additional improvements all around the
> device reset codepaths and reducing post-reset delays. These improvements
> were published as part of v2 (v3 is just small build fixes).
> 
> It looks like there is immediate demand specifically for the CRS work,
> so I am once again reducing the series to just that. The reset will be
> posted as a separate patch series that will likely require more time and
> iterations to stabilize.

Hm, what happened to this?

Bjorn?



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] Improve PCI device post-reset readiness polling
  2021-01-22  8:54 ` [PATCH v4 0/3] Improve PCI device post-reset readiness polling David Woodhouse
@ 2021-09-10  9:32   ` David Woodhouse
  0 siblings, 0 replies; 19+ messages in thread
From: David Woodhouse @ 2021-09-10  9:32 UTC (permalink / raw)
  To: Stanislav Spassov, linux-pci
  Cc: Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner, Andrew Morton,
	Jan H . Schönherr, Jonathan Corbet, Ashok Raj,
	Alex Williamson, Sinan Kaya, Rajat Jain

[-- Attachment #1: Type: text/plain, Size: 1557 bytes --]

On Fri, 2021-01-22 at 08:54 +0000, David Woodhouse wrote:
> On Sat, 2020-03-07 at 18:20 +0100, Stanislav Spassov wrote:
> > From: Stanislav Spassov <
> > stanspas@amazon.de
> > >
> > 
> > The first version of this patch series can be found here:
> > https://lore.kernel.org/linux-pci/20200223122057.6504-1-stanspas@amazon.com
> > 
> > 
> > The goal of this patch series is to solve an issue where pci_dev_wait
> > can cause system crashes. After a reset, a hung device may keep
> > responding with CRS completions indefinitely. If CRS Software Visibility
> > is enabled on the Root Port, attempting to read any register other than
> > PCI_VENDOR_ID will cause the Root Port to autonomously retry the request
> > without reporting back to the CPU core. Unless the number of retries or
> > the amount of time spent retrying is limited by platform-specific means,
> > this scenario leads to low-level platform timeouts (such as a TOR
> > Timeout), which can easily escalate to a crash.
> > 
> > Feedback on the v1 inspired a lot of additional improvements all around the
> > device reset codepaths and reducing post-reset delays. These improvements
> > were published as part of v2 (v3 is just small build fixes).
> > 
> > It looks like there is immediate demand specifically for the CRS work,
> > so I am once again reducing the series to just that. The reset will be
> > posted as a separate patch series that will likely require more time and
> > iterations to stabilize.
> 
> Hm, what happened to this?
> 
> Bjorn?

Ping?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2020-03-07 17:20 ` [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait() Stanislav Spassov
  2020-03-09 15:55   ` Sinan Kaya
@ 2021-09-11 14:03   ` Bjorn Helgaas
  2021-09-13 16:29     ` Spassov, Stanislav
  2021-09-13 16:07   ` Bjorn Helgaas
  2 siblings, 1 reply; 19+ messages in thread
From: Bjorn Helgaas @ 2021-09-11 14:03 UTC (permalink / raw)
  To: Stanislav Spassov
  Cc: linux-pci, Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner,
	Andrew Morton, Jan H . Schönherr, Jonathan Corbet,
	Ashok Raj, Alex Williamson, Sinan Kaya, Rajat Jain

I apologize for completely dropping the ball on this one.  I don't
remember why, but I *do* remember one issue that we should clear up:

On Sat, Mar 07, 2020 at 06:20:44PM +0100, Stanislav Spassov wrote:
> From: Stanislav Spassov <stanspas@amazon.de>
> 
> The PCI Express specification dictates minimal amounts of time that the
> host needs to wait after triggering different kinds of resets before it
> is allowed to attempt accessing the device. After this waiting period,
> devices are required to be responsive to Configuration Space reads.
> However, if a device needs more time to actually complete the reset
> operation internally, it may respond to the read with a Completion
> Request Retry Status (CRS), and keep doing so on subsequent reads
> for as long as necessary. If the device is broken, it may even keep
> responding with CRS indefinitely.
> 
> The specification also mandates that any Root Port that supports CRS
> and has CRS Software Visibility (CRS SV) enabled will synthesize the
> special value 0x0001 for the Vendor ID and set any other bits to 1
> upon receiving a CRS Completion for a Configuration Read Request that
> includes both bytes of the Vendor ID (offset 0).
> 
> If CRS SV is disabled or a different register (not Vendor ID) is being
> read, the request is retried autonomously by the Root Port.
> Platform-specific configuration registers may exist to limit the number
> of or time taken by such retries.

I think the Root Complex may eventually complete the transaction as
failed *regardless* of whether CRS SV is enabled.  This is unclear in
PCIe r5.0, sec 2.3.2, because the text formatting was broken between
r4.0 and r5.0.  The r4.0 text is formatted like this:

  Root Complex handling of a Completion with Configuration Request
  Retry Status for a Configuration Request is implementation specific,
  except for the period following system reset (see Section 6.6). For
  Root Complexes that support CRS Software Visibility, the following
  rules apply:

    * If CRS Software Visibility is not enabled, the Root Complex must
      re-issue the Configuration Request as a new Request.

    * If CRS Software Visibility is enabled (see below):

      - For a Configuration Read Request that includes both bytes of
        the Vendor ID field of a device Function’s Configuration Space
        Header, the Root Complex must complete the Request to the host
        by returning a read-data value of 0001h for the Vendor ID
        field and all ‘1’s for any additional bytes included in the
        request. This read-data value has been reserved specifically
        for this use by the PCI-SIG and does not correspond to any
        assigned Vendor ID.

      - For a Configuration Write Request or for any other
        Configuration Read Request, the Root Complex must re-issue the
        Configuration Request as a new Request.

  A Root Complex implementation may choose to limit the number of
  Configuration Request/CRS Completion Status loops before determining
  that something is wrong with the target of the Request and taking
  appropriate action, e.g., complete the Request to the host as a
  failed transaction.

I reported this to the PCI-SIG, and I think the formatting did get
fixed for the upcoming PCIe r6 spec, so the meaning will be the same
as r4.0

Probably doesn't affect this patch, but we can clarify some of the
commentary.

Bjorn

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/3] PCI: Cache CRS Software Visibiliy in struct pci_dev
  2020-03-07 17:20 ` [PATCH v4 2/3] PCI: Cache CRS Software Visibiliy in struct pci_dev Stanislav Spassov
@ 2021-09-12 13:32   ` Bjorn Helgaas
  2021-09-13 16:06     ` Spassov, Stanislav
  0 siblings, 1 reply; 19+ messages in thread
From: Bjorn Helgaas @ 2021-09-12 13:32 UTC (permalink / raw)
  To: Stanislav Spassov
  Cc: linux-pci, Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner,
	Andrew Morton, Jan H . Schönherr, Jonathan Corbet,
	Ashok Raj, Alex Williamson, Sinan Kaya, Rajat Jain

On Sat, Mar 07, 2020 at 06:20:43PM +0100, Stanislav Spassov wrote:
> From: Stanislav Spassov <stanspas@amazon.de>
> 
> Arguably, since CRS SV is a capability of Root Ports and determines
> how Root Ports will handle incoming CRS Completions, it makes more
> sense to store this flag in the struct pci_bus that represents the
> port's secondary bus, and to cache it in any buses further down the
> hierarchy.
> 
> However, storing the flag in struct pci_dev allows individual devices
> to be marked as not supporting CRS properly even when CRS SV is enabled
> on their parent Root Port. This way, future code that relies on the new
> flag will not be misled that it is safe to probe a device by relying on
> CRS SV to not cause platform timeouts (See the note on CRS SV on p. 553
> of PCI Express Base Specification r5.0 from May 22, 2019).

If we find devices that don't support CRS properly, I think we should
quirk them directly with something other than "crssv_enabled".

> Note: Endpoints integrated into the Root Complex (RCiEP) may also be
> capable of using CRS. In that case, the software visibility is
> controlled using a Root Complex Register Block (RCRB). This is currently
> not supported by the kernel. The code introduced here would simply not
> set the newly introduced flag for RCiEP as for those bus->self is NULL.
> 
> Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
> ---
>  drivers/pci/probe.c | 8 +++++++-
>  include/linux/pci.h | 3 +++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 512cb4312ddd..eeff8a07f330 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1080,9 +1080,11 @@ static void pci_enable_crs(struct pci_dev *pdev)
>  
>  	/* Enable CRS Software Visibility if supported */
>  	pcie_capability_read_word(pdev, PCI_EXP_RTCAP, &root_cap);
> -	if (root_cap & PCI_EXP_RTCAP_CRSVIS)
> +	if (root_cap & PCI_EXP_RTCAP_CRSVIS) {
>  		pcie_capability_set_word(pdev, PCI_EXP_RTCTL,
>  					 PCI_EXP_RTCTL_CRSSVE);
> +		pdev->crssv_enabled = true;

I'd use "crssv_enabled = 1" instead of mixing the "unsigned int" and
"bool" ideas.

> +	}
>  }
>  
>  static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus,
> @@ -2414,6 +2416,10 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
>  	list_add_tail(&dev->bus_list, &bus->devices);
>  	up_write(&pci_bus_sem);
>  
> +	/* Propagate CRS Software Visibility bit from the parent bridge */
> +	if (bus->self)
> +		dev->crssv_enabled = bus->self->crssv_enabled;

I think we should use pcie_find_root_port() instead of propagating it.
SV is a property of the Root Port, and devices below it have no idea
whether it's enabled at the Root Port.

>  	ret = pcibios_add_device(dev);
>  	WARN_ON(ret < 0);
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 3840a541a9de..1c222c7c2572 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -354,6 +354,9 @@ struct pci_dev {
>  						      user sysfs */
>  	unsigned int	clear_retrain_link:1;	/* Need to clear Retrain Link
>  						   bit manually */
> +	unsigned int	crssv_enabled:1;	/* Configuration Request Retry
> +						   Status Software Visibility
> +						   enabled on (parent) bridge */
>  	unsigned int	d3_delay;	/* D3->D0 transition time in ms */
>  	unsigned int	d3cold_delay;	/* D3cold->D0 transition time in ms */
>  
> -- 
> 2.25.1
> 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/3] PCI: Cache CRS Software Visibiliy in struct pci_dev
  2021-09-12 13:32   ` Bjorn Helgaas
@ 2021-09-13 16:06     ` Spassov, Stanislav
  0 siblings, 0 replies; 19+ messages in thread
From: Spassov, Stanislav @ 2021-09-13 16:06 UTC (permalink / raw)
  To: helgaas
  Cc: corbet, alex.williamson, ashok.raj, okaya, tglx, akpm,
	Schönherr, Jan H.,
	rajatja, linux-pci, bhelgaas

On Sun, 2021-09-12 at 08:32 -0500, Bjorn Helgaas wrote:
> On Sat, Mar 07, 2020 at 06:20:43PM +0100, Stanislav Spassov wrote:
> > However, storing the flag in struct pci_dev allows individual devices
> > to be marked as not supporting CRS properly even when CRS SV is enabled
> > on their parent Root Port. This way, future code that relies on the new
> > flag will not be misled that it is safe to probe a device by relying on
> > CRS SV to not cause platform timeouts (See the note on CRS SV on p. 553
> > of PCI Express Base Specification r5.0 from May 22, 2019).
> 
> If we find devices that don't support CRS properly, I think we should
> quirk them directly with something other than "crssv_enabled".

I am definitely open to suggestions here. Based on precedent such as the
broken_intx_masking field in struct pci_dev and how it is set from
pci/quirks.c, we could have a new field "broken_crs".

In hindsight, the code from PATCH 3/3 which is conditionally executed based
on this flag, should be okay to execute unconditionally: polling the Vendor
ID is safer than polling Command when CRS SV is enabled but it is still not
any more dangerous when CRS SV is disabled. This consideration allows us
to drop the current PATCH 2/3 altogether. I will implement this approach in
the next reversion of the series (it is old and needs rebasing anyway).



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2020-03-07 17:20 ` [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait() Stanislav Spassov
  2020-03-09 15:55   ` Sinan Kaya
  2021-09-11 14:03   ` Bjorn Helgaas
@ 2021-09-13 16:07   ` Bjorn Helgaas
  2021-09-13 16:39     ` Spassov, Stanislav
  2 siblings, 1 reply; 19+ messages in thread
From: Bjorn Helgaas @ 2021-09-13 16:07 UTC (permalink / raw)
  To: Stanislav Spassov
  Cc: linux-pci, Stanislav Spassov, Bjorn Helgaas, Thomas Gleixner,
	Andrew Morton, Jan H . Schönherr, Jonathan Corbet,
	Ashok Raj, Alex Williamson, Sinan Kaya, Rajat Jain

On Sat, Mar 07, 2020 at 06:20:44PM +0100, Stanislav Spassov wrote:
> From: Stanislav Spassov <stanspas@amazon.de>
> 
> The PCI Express specification dictates minimal amounts of time that the
> host needs to wait after triggering different kinds of resets before it
> is allowed to attempt accessing the device. After this waiting period,
> devices are required to be responsive to Configuration Space reads.
> However, if a device needs more time to actually complete the reset
> operation internally, it may respond to the read with a Completion
> Request Retry Status (CRS), and keep doing so on subsequent reads
> for as long as necessary. If the device is broken, it may even keep
> responding with CRS indefinitely.
> 
> The specification also mandates that any Root Port that supports CRS
> and has CRS Software Visibility (CRS SV) enabled will synthesize the
> special value 0x0001 for the Vendor ID and set any other bits to 1
> upon receiving a CRS Completion for a Configuration Read Request that
> includes both bytes of the Vendor ID (offset 0).
> 
> If CRS SV is disabled or a different register (not Vendor ID) is being
> read, the request is retried autonomously by the Root Port.
> Platform-specific configuration registers may exist to limit the number
> of or time taken by such retries.
> 
> If CRS is not supported, or a device is responding with CA/UR
> Completions (rather than CRS), the behavior is platform-dependent, but
> generally the Root Port synthesizes ~0 to complete the software read.
> 
> Previously, pci_dev_wait() avoided taking advantage of CRS. However,
> on platforms where no retry limit/timeout can be configured, a device
> responding with CRS for too long (e.g. because it is stuck and cannot
> complete its reset) may trigger more severe error conditions (e.g. TOR
> timeout, 3-strike CPU CATERR), because the Root Port never reports back
> to the lower-level component requesting the transaction.
> 
> This patch introduces special handling when CRS is available, and
> otherwise falls back to the previous behavior of polling COMMAND.

There's a lot to figure out here.

1) It sounds like this *might* be a workaround for a device defect?
   Should we infer that there's a device where:

    - We reset the device
    - We read PCI_COMMAND until it is not ~0, for up to 60 seconds
    - The device returns CRS status for each read, until ...
    - The platform hardware times out before 60 seconds and fails the
      transaction, causing a system crash?

   But reading PCI_VENDOR_ID instead of PCI_COMMAND somehow avoids the
   platform timeout?

2) I think this should somehow be integrated with pci_bus_wait_crs(),
   which also loops looking for CRS status.

3) pci_bus_wait_crs() is used in the enumeration path, and we do a
   32-bit read there, which reads both the Vendor ID and the Device
   ID.  Maybe that's some sort of micro-optimization, but apparently
   there are devices that don't implement CRS SV correctly (see
   89665a6a7140 ("PCI: Check only the Vendor ID to identify
   Configuration Request Retry"), and doing a 16-bit read would avoid
   that issue.

   For pci_dev_wait(), I don't think there's any point in doing a
   32-bit read, so maybe we should just do a 16-bit read.

> Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
> ---
>  drivers/pci/pci.c | 55 ++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 47 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 44f5d4907db6..a028147f4471 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1073,17 +1073,56 @@ static inline int pci_dev_poll_until_not_equal(struct pci_dev *dev, int where,
>  
>  static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  {
> +	int waited = 0;
> +	int rc = 0;
> +
> +
>  	/*
>  	 * After reset, the device should not silently discard config
>  	 * requests, but it may still indicate that it needs more time by
> -	 * responding to them with CRS completions.  The Root Port will
> -	 * generally synthesize ~0 data to complete the read (except when
> -	 * CRS SV is enabled and the read was for the Vendor ID; in that
> -	 * case it synthesizes 0x0001 data).
> -	 *
> -	 * Wait for the device to return a non-CRS completion.  Read the
> -	 * Command register instead of Vendor ID so we don't have to
> -	 * contend with the CRS SV value.
> +	 * responding to them with CRS completions. For such completions:
> +	 * - If CRS SV is enabled on the Root Port, and the read request
> +	 *   covers both bytes of the Vendor ID register, the Root Port
> +	 *   will synthesize the value 0x0001 (and set any extra requested
> +	 *   bytes to 0xff)
> +	 * - If CRS SV is not enabled on the Root Port, the Root Port must
> +	 *   re-issue the Configuration Request as a new Request.
> +	 *   Depending on platform-specific Root Complex configurations,
> +	 *   the Root Port may stop retrying after a set number of attempts,
> +	 *   or a configured timeout is hit, or continue indefinitely
> +	 *   (ultimately resulting in non-PCI-specific platform errors, such as
> +	 *   a TOR timeout).
> +	 */
> +	if (dev->crssv_enabled) {
> +		u32 id;
> +
> +		rc = pci_dev_poll_until_not_equal(dev, PCI_VENDOR_ID, 0xffff,
> +						  0x0001, reset_type, timeout,
> +						  &waited, &id);
> +		if (rc)
> +			return rc;
> +
> +		timeout -= waited;
> +
> +		/*
> +		 * If Vendor/Device ID is valid, the device must be ready.
> +		 * Note: SR-IOV VFs return ~0 for reads to Vendor/Device
> +		 * ID and will not be recognized as ready by this check.
> +		 */
> +		if (id != 0x0000ffff && id != 0xffff0000 &&
> +		    id != 0x00000000 && id != 0xffffffff)
> +			return 0;
> +	}
> +
> +	/*
> +	 * Root Ports will generally indicate error scenarios (e.g.
> +	 * internal timeouts, or received Completion with CA/UR) by
> +	 * synthesizing an 'all bits set' value (~0).
> +	 * In case CRS is not supported/enabled, as well as for SR-IOV VFs,
> +	 * fall back to polling a different register that cannot validly
> +	 * contain ~0. As of PCIe 5.0, bits 11-15 of COMMAND are still RsvdP
> +	 * and must return 0 when read.
> +	 * XXX: These bits might become meaningful in the future
>  	 */
>  	return pci_dev_poll_until_not_equal(dev, PCI_COMMAND, ~0, ~0,
>  					    reset_type, timeout, NULL,
> -- 
> 2.25.1
> 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2021-09-11 14:03   ` Bjorn Helgaas
@ 2021-09-13 16:29     ` Spassov, Stanislav
  2021-09-13 16:38       ` Bjorn Helgaas
  0 siblings, 1 reply; 19+ messages in thread
From: Spassov, Stanislav @ 2021-09-13 16:29 UTC (permalink / raw)
  To: helgaas
  Cc: corbet, alex.williamson, ashok.raj, okaya, tglx, akpm,
	Schönherr, Jan H.,
	rajatja, linux-pci, bhelgaas

On Sat, 2021-09-11 at 09:03 -0500, Bjorn Helgaas wrote:

> I think the Root Complex may eventually complete the transaction as
> failed *regardless* of whether CRS SV is enabled.  This is unclear in
> PCIe r5.0, sec 2.3.2, because the text formatting was broken between
> r4.0 and r5.0.  [...]
>
>   A Root Complex implementation may choose to limit the number of
>   Configuration Request/CRS Completion Status loops before determining
>   that something is wrong with the target of the Request and taking
>   appropriate action, e.g., complete the Request to the host as a
>   failed transaction.

I can provide a bit more background:

The issue that prompted me to implement this patch involved a device that
used CRS Completions to signal post-reset (non-)readiness. In some cases,
the device would get stuck and continue issuing CRS Completions for all
requests indefinitely.

The device was attached directly to a Root Port on a server-grade Intel CPU,
and CRS SV was enabled on that Root Port. The original pci_dev_wait()
implementation, by virtue of polling the Command register rather than the
Vendor ID, would always cause a TOR timeout and associated host crash.

I later understood the specific CPU did have a proprietary register for
"limiting the number of loops" that the PCIe spec talks about, and indeed
that register was set to "no limit". Coupled with the stuck device, these
indefinite retries eventually triggered TOR timeout.

Granted, there are surely Root Complexes that behave differently, since the
PCIe spec leaves this up to the implementation. Still, this patch increases
robustness by polling the safer Vendor ID register, which is safer at least
in some situations, and not any less safe generally. However, it is not a
simple matter of switching which register is polled due to the SR-IOV
considerations that require a fallback to Command.



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2021-09-13 16:29     ` Spassov, Stanislav
@ 2021-09-13 16:38       ` Bjorn Helgaas
  2021-09-13 18:04         ` Spassov, Stanislav
  0 siblings, 1 reply; 19+ messages in thread
From: Bjorn Helgaas @ 2021-09-13 16:38 UTC (permalink / raw)
  To: Spassov, Stanislav
  Cc: corbet, alex.williamson, ashok.raj, okaya, tglx, akpm,
	Schönherr, Jan H.,
	rajatja, linux-pci, bhelgaas

On Mon, Sep 13, 2021 at 04:29:51PM +0000, Spassov, Stanislav wrote:
> On Sat, 2021-09-11 at 09:03 -0500, Bjorn Helgaas wrote:
> 
> > I think the Root Complex may eventually complete the transaction as
> > failed *regardless* of whether CRS SV is enabled.  This is unclear in
> > PCIe r5.0, sec 2.3.2, because the text formatting was broken between
> > r4.0 and r5.0.  [...]
> >
> >   A Root Complex implementation may choose to limit the number of
> >   Configuration Request/CRS Completion Status loops before determining
> >   that something is wrong with the target of the Request and taking
> >   appropriate action, e.g., complete the Request to the host as a
> >   failed transaction.
> 
> I can provide a bit more background:
> 
> The issue that prompted me to implement this patch involved a device that
> used CRS Completions to signal post-reset (non-)readiness. In some cases,
> the device would get stuck and continue issuing CRS Completions for all
> requests indefinitely.
> 
> The device was attached directly to a Root Port on a server-grade Intel CPU,
> and CRS SV was enabled on that Root Port. The original pci_dev_wait()
> implementation, by virtue of polling the Command register rather than the
> Vendor ID, would always cause a TOR timeout and associated host crash.
> 
> I later understood the specific CPU did have a proprietary register for
> "limiting the number of loops" that the PCIe spec talks about, and indeed
> that register was set to "no limit". Coupled with the stuck device, these
> indefinite retries eventually triggered TOR timeout.

"No limit" sounds like a pretty bad choice, given that it means the
CPU will essentially hang forever because of a defective I/O device.
There should be a timeout so software can recover (the *device* may
never recover, but that's no reason why the kernel must crash).

> Granted, there are surely Root Complexes that behave differently, since the
> PCIe spec leaves this up to the implementation. Still, this patch increases
> robustness by polling the safer Vendor ID register, which is safer at least
> in some situations, and not any less safe generally. However, it is not a
> simple matter of switching which register is polled due to the SR-IOV
> considerations that require a fallback to Command.

Yes.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2021-09-13 16:07   ` Bjorn Helgaas
@ 2021-09-13 16:39     ` Spassov, Stanislav
  0 siblings, 0 replies; 19+ messages in thread
From: Spassov, Stanislav @ 2021-09-13 16:39 UTC (permalink / raw)
  To: helgaas
  Cc: corbet, alex.williamson, ashok.raj, okaya, tglx, akpm,
	Schönherr, Jan H.,
	rajatja, linux-pci, bhelgaas

On Mon, 2021-09-13 at 11:07 -0500, Bjorn Helgaas wrote:
> 
> 1) It sounds like this *might* be a workaround for a device defect?
>    Should we infer that there's a device where:
> 
>     - We reset the device
>     - We read PCI_COMMAND until it is not ~0, for up to 60 seconds
>     - The device returns CRS status for each read, until ...
>     - The platform hardware times out before 60 seconds and fails the
>       transaction, causing a system crash?

Yes. As detailed in my other reply (which raced with this mail), I
implemented this patch because I encountered a device (and a platform)
that behaved exactly as described.

>    But reading PCI_VENDOR_ID instead of PCI_COMMAND somehow avoids the
>    platform timeout?

Correct. More specifically, that "somehow" is the CRS SV mechanism
standardized by the PCIe specification. This mechanism relies
specifically on the target offset being PCI_VENDOR_ID.

> 2) I think this should somehow be integrated with pci_bus_wait_crs(),
>    which also loops looking for CRS status.

Good point. I will see if that can be incorporated in next version, or
explain why not if that turns out to be the case.

> 3) pci_bus_wait_crs() is used in the enumeration path, and we do a
>    32-bit read there, which reads both the Vendor ID and the Device
>    ID.  Maybe that's some sort of micro-optimization, but apparently
>    there are devices that don't implement CRS SV correctly (see
>    89665a6a7140 ("PCI: Check only the Vendor ID to identify
>    Configuration Request Retry"), and doing a 16-bit read would avoid
>    that issue.
> 
>    For pci_dev_wait(), I don't think there's any point in doing a
>    32-bit read, so maybe we should just do a 16-bit read.

I agree, and will change it to a 16-bit read in next version.

> > Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
> > ---
> >  drivers/pci/pci.c | 55 ++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 47 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 44f5d4907db6..a028147f4471 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -1073,17 +1073,56 @@ static inline int pci_dev_poll_until_not_equal(struct pci_dev *dev, int where,
> > 
> >  static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
> >  {
> > +     int waited = 0;
> > +     int rc = 0;
> > +
> > +
> >       /*
> >        * After reset, the device should not silently discard config
> >        * requests, but it may still indicate that it needs more time by
> > -      * responding to them with CRS completions.  The Root Port will
> > -      * generally synthesize ~0 data to complete the read (except when
> > -      * CRS SV is enabled and the read was for the Vendor ID; in that
> > -      * case it synthesizes 0x0001 data).
> > -      *
> > -      * Wait for the device to return a non-CRS completion.  Read the
> > -      * Command register instead of Vendor ID so we don't have to
> > -      * contend with the CRS SV value.
> > +      * responding to them with CRS completions. For such completions:
> > +      * - If CRS SV is enabled on the Root Port, and the read request
> > +      *   covers both bytes of the Vendor ID register, the Root Port
> > +      *   will synthesize the value 0x0001 (and set any extra requested
> > +      *   bytes to 0xff)
> > +      * - If CRS SV is not enabled on the Root Port, the Root Port must
> > +      *   re-issue the Configuration Request as a new Request.
> > +      *   Depending on platform-specific Root Complex configurations,
> > +      *   the Root Port may stop retrying after a set number of attempts,
> > +      *   or a configured timeout is hit, or continue indefinitely
> > +      *   (ultimately resulting in non-PCI-specific platform errors, such as
> > +      *   a TOR timeout).
> > +      */
> > +     if (dev->crssv_enabled) {
> > +             u32 id;
> > +
> > +             rc = pci_dev_poll_until_not_equal(dev, PCI_VENDOR_ID, 0xffff,
> > +                                               0x0001, reset_type, timeout,
> > +                                               &waited, &id);
> > +             if (rc)
> > +                     return rc;
> > +
> > +             timeout -= waited;
> > +
> > +             /*
> > +              * If Vendor/Device ID is valid, the device must be ready.
> > +              * Note: SR-IOV VFs return ~0 for reads to Vendor/Device
> > +              * ID and will not be recognized as ready by this check.
> > +              */
> > +             if (id != 0x0000ffff && id != 0xffff0000 &&
> > +                 id != 0x00000000 && id != 0xffffffff)
> > +                     return 0;
> > +     }
> > +
> > +     /*
> > +      * Root Ports will generally indicate error scenarios (e.g.
> > +      * internal timeouts, or received Completion with CA/UR) by
> > +      * synthesizing an 'all bits set' value (~0).
> > +      * In case CRS is not supported/enabled, as well as for SR-IOV VFs,
> > +      * fall back to polling a different register that cannot validly
> > +      * contain ~0. As of PCIe 5.0, bits 11-15 of COMMAND are still RsvdP
> > +      * and must return 0 when read.
> > +      * XXX: These bits might become meaningful in the future
> >        */
> >       return pci_dev_poll_until_not_equal(dev, PCI_COMMAND, ~0, ~0,
> >                                           reset_type, timeout, NULL,
> > --
> > 2.25.1
> > 
> > 
> > 
> > 
> > Amazon Development Center Germany GmbH
> > Krausenstr. 38
> > 10117 Berlin
> > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> > Sitz: Berlin
> > Ust-ID: DE 289 237 879
> > 
> > 
> > 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2021-09-13 16:38       ` Bjorn Helgaas
@ 2021-09-13 18:04         ` Spassov, Stanislav
  2021-09-14 17:53           ` Rajat Jain
  0 siblings, 1 reply; 19+ messages in thread
From: Spassov, Stanislav @ 2021-09-13 18:04 UTC (permalink / raw)
  To: helgaas
  Cc: corbet, alex.williamson, ashok.raj, okaya, tglx, akpm,
	Schönherr, Jan H.,
	rajatja, linux-pci, bhelgaas

On Mon, 2021-09-13 at 11:38 -0500, Bjorn Helgaas wrote:
> On Mon, Sep 13, 2021 at 04:29:51PM +0000, Spassov, Stanislav wrote:
> > On Sat, 2021-09-11 at 09:03 -0500, Bjorn Helgaas wrote:
> > 
> > I later understood the specific CPU did have a proprietary register for
> > "limiting the number of loops" that the PCIe spec talks about, and indeed
> > that register was set to "no limit". Coupled with the stuck device, these
> > indefinite retries eventually triggered TOR timeout.
> 
> "No limit" sounds like a pretty bad choice, given that it means the
> CPU will essentially hang forever because of a defective I/O device.
> There should be a timeout so software can recover (the *device* may
> never recover, but that's no reason why the kernel must crash).
> 

Correct. "No limit" is definitely a bad choice for that register,
and fixing the value would be preferable to any software solution.

Unfortunately, at least in the case I worked on, that register was
not accessible by the kernel. Intel exposes many CPU configuration
registers in terms of virtual PCI devices residing directly on Root
Buses, and the system/platform firmware is able to use vendor-provided
means to completely hide some of these pseudo-devices from the OS.

Additionally, the way the PCIe spec is phrased, not every Root Complex
implementation is required to even have such a limiting register, while
all implementations that advertise CRS SV capability are required to
behave as prescribed when PCI_VENDOR_ID is read. Hence why I believe
this patch is a general robustness improvement, rather than a workaround
for a specific device/platform.



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait()
  2021-09-13 18:04         ` Spassov, Stanislav
@ 2021-09-14 17:53           ` Rajat Jain
  0 siblings, 0 replies; 19+ messages in thread
From: Rajat Jain @ 2021-09-14 17:53 UTC (permalink / raw)
  To: Spassov, Stanislav
  Cc: helgaas, corbet, alex.williamson, ashok.raj, okaya@kernel.org,
	tglx, akpm, Schönherr, Jan H.,
	linux-pci, bhelgaas

On Mon, Sep 13, 2021 at 11:04 AM Spassov, Stanislav <stanspas@amazon.de> wrote:
>
> On Mon, 2021-09-13 at 11:38 -0500, Bjorn Helgaas wrote:
> > On Mon, Sep 13, 2021 at 04:29:51PM +0000, Spassov, Stanislav wrote:
> > > On Sat, 2021-09-11 at 09:03 -0500, Bjorn Helgaas wrote:
> > >
> > > I later understood the specific CPU did have a proprietary register for
> > > "limiting the number of loops" that the PCIe spec talks about, and indeed
> > > that register was set to "no limit". Coupled with the stuck device, these
> > > indefinite retries eventually triggered TOR timeout.
> >
> > "No limit" sounds like a pretty bad choice, given that it means the
> > CPU will essentially hang forever because of a defective I/O device.
> > There should be a timeout so software can recover (the *device* may
> > never recover, but that's no reason why the kernel must crash).
> >
>
> Correct. "No limit" is definitely a bad choice for that register,
> and fixing the value would be preferable to any software solution.
>
> Unfortunately, at least in the case I worked on, that register was
> not accessible by the kernel.

I can acknowledge that I have across exactly the same issue (no limit
on retries, results in CPU hang) on another old Intel root port too in
the past:
https://lore.kernel.org/linux-pci/53FFA54D.9000907@gmail.com/
https://lkml.org/lkml/2014/8/1/186

and had the same problem (no way to limit the number of retries). I'd
be interested and will keep a lookout for the next patch Stanislav
sends out!

Thanks!

Rajat

> Intel exposes many CPU configuration
> registers in terms of virtual PCI devices residing directly on Root
> Buses, and the system/platform firmware is able to use vendor-provided
> means to completely hide some of these pseudo-devices from the OS.
>
> Additionally, the way the PCIe spec is phrased, not every Root Complex
> implementation is required to even have such a limiting register, while
> all implementations that advertise CRS SV capability are required to
> behave as prescribed when PCI_VENDOR_ID is read. Hence why I believe
> this patch is a general robustness improvement, rather than a workaround
> for a specific device/platform.
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-09-14 17:54 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-07 17:20 [PATCH v4 0/3] Improve PCI device post-reset readiness polling Stanislav Spassov
2020-03-07 17:20 ` [PATCH v4 1/3] PCI: Refactor polling loop out of pci_dev_wait Stanislav Spassov
2020-03-07 17:20 ` [PATCH v4 2/3] PCI: Cache CRS Software Visibiliy in struct pci_dev Stanislav Spassov
2021-09-12 13:32   ` Bjorn Helgaas
2021-09-13 16:06     ` Spassov, Stanislav
2020-03-07 17:20 ` [PATCH v4 3/3] PCI: Add CRS handling to pci_dev_wait() Stanislav Spassov
2020-03-09 15:55   ` Sinan Kaya
2020-03-09 16:19     ` Raj, Ashok
2020-03-09 16:38       ` Spassov, Stanislav
2020-03-09 17:33         ` Sinan Kaya
2021-09-11 14:03   ` Bjorn Helgaas
2021-09-13 16:29     ` Spassov, Stanislav
2021-09-13 16:38       ` Bjorn Helgaas
2021-09-13 18:04         ` Spassov, Stanislav
2021-09-14 17:53           ` Rajat Jain
2021-09-13 16:07   ` Bjorn Helgaas
2021-09-13 16:39     ` Spassov, Stanislav
2021-01-22  8:54 ` [PATCH v4 0/3] Improve PCI device post-reset readiness polling David Woodhouse
2021-09-10  9:32   ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).