linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V10 1/3] PCI: introduce pci_bus_wait_crs() function
@ 2017-08-11 16:56 Sinan Kaya
  2017-08-11 16:56 ` [PATCH V10 2/3] PCI: handle CRS returned by device after FLR Sinan Kaya
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Sinan Kaya @ 2017-08-11 16:56 UTC (permalink / raw)
  To: linux-pci, timur, alex.williamson
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Bjorn Helgaas, linux-kernel

Kernel is hiding Configuration Request Retry Status (CRS) inside
pci_bus_read_dev_vendor_id() function. We are looking to add support for
Function Level Reset (FLR) where vendor id read returns ~0.

Move CRS handling into its own function so that it can be called from other
places as well.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/pci/pci.h   |  2 ++
 drivers/pci/probe.c | 44 ++++++++++++++++++++++++++++++--------------
 2 files changed, 32 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 22e0617..1bbe851 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -235,6 +235,8 @@ enum pci_bar_type {
 	pci_bar_mem64,		/* A 64-bit memory BAR */
 };
 
+bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l,
+		      int crs_timeout);
 bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl,
 				int crs_timeout);
 int pci_setup_device(struct pci_dev *dev);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index c31310d..b1cb7bd 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1824,29 +1824,17 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
 }
 EXPORT_SYMBOL(pci_alloc_dev);
 
-bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l,
-				int crs_timeout)
+bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l, int crs_timeout)
 {
 	int delay = 1;
 
-	if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
-		return false;
-
-	/* some broken boards return 0 or ~0 if a slot is empty: */
-	if (*l == 0xffffffff || *l == 0x00000000 ||
-	    *l == 0x0000ffff || *l == 0xffff0000)
-		return false;
-
 	/*
 	 * Configuration Request Retry Status.  Some root ports return the
 	 * actual device ID instead of the synthetic ID (0xFFFF) required
 	 * by the PCIe spec.  Ignore the device ID and only check for
 	 * (vendor id == 1).
 	 */
-	while ((*l & 0xffff) == 0x0001) {
-		if (!crs_timeout)
-			return false;
-
+	do {
 		msleep(delay);
 		delay *= 2;
 		if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
@@ -1858,6 +1846,34 @@ bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l,
 			       PCI_FUNC(devfn));
 			return false;
 		}
+	} while ((*l & 0xffff) == 0x0001);
+
+	return true;
+}
+EXPORT_SYMBOL(pci_bus_wait_crs);
+
+bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l,
+				int crs_timeout)
+{
+	if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
+		return false;
+
+	/* some broken boards return 0 or ~0 if a slot is empty: */
+	if (*l == 0xffffffff || *l == 0x00000000 ||
+	    *l == 0x0000ffff || *l == 0xffff0000)
+		return false;
+
+	/*
+	 * Configuration Request Retry Status.  Some root ports return the
+	 * actual device ID instead of the synthetic ID (0xFFFF) required
+	 * by the PCIe spec.  Ignore the device ID and only check for
+	 * (vendor id == 1).
+	 */
+	if ((*l & 0xffff) == 0x0001) {
+		if (!crs_timeout)
+			return false;
+
+		return pci_bus_wait_crs(bus, devfn, l, crs_timeout);
 	}
 
 	return true;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V10 2/3] PCI: handle CRS returned by device after FLR
  2017-08-11 16:56 [PATCH V10 1/3] PCI: introduce pci_bus_wait_crs() function Sinan Kaya
@ 2017-08-11 16:56 ` Sinan Kaya
  2017-08-18 21:01   ` Bjorn Helgaas
  2017-08-11 16:56 ` [PATCH V10 3/3] PCI: display not responding message while device is unreachable Sinan Kaya
  2017-08-18  3:00 ` [PATCH V10 1/3] PCI: introduce pci_bus_wait_crs() function Bjorn Helgaas
  2 siblings, 1 reply; 10+ messages in thread
From: Sinan Kaya @ 2017-08-11 16:56 UTC (permalink / raw)
  To: linux-pci, timur, alex.williamson
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Bjorn Helgaas, linux-kernel

Sporadic reset issues have been observed with Intel 750 NVMe drive while
assigning the physical function to the guest machine. The sequence of
events observed is as follows:

- perform a Function Level Reset (FLR)
- sleep up to 1000ms total
- read ~0 from PCI_COMMAND
- warn that the device didn't return from FLR
- touch the device before it's ready
- drop register read and writes performing register settings restore
- incomplete reset operation and partial register restoration
- second time device probe fails in the guest machine as HW is left in
limbo.

An endpoint is allowed to issue Configuration Request Retry Status (CRS)
following a FLR request to indicate that it is not ready to accept new
requests. CRS is defined in PCIe r3.1, sec 2.3.1. Request Handling Rules
and CRS usage in FLR context is mentioned in PCIe r3.1, sec 6.6.2.
Function-Level Reset.

A CRS indication will only be given if the address to be read is vendor ID
register. pci_bus_read_dev_vendor_id() knows how to deal with CRS returned
0xFFFF0001 value and will continue polling until a value other than
0xFFFF0001 is returned within a given timeout.

Try to discover device presence via CRS first. If device is not found, fall
through to old behavior.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/pci/pci.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index af0cc34..c853551 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3821,17 +3821,36 @@ static void pci_flr_wait(struct pci_dev *dev)
 {
 	int i = 0;
 	u32 id;
+	bool ret;
+
+	/*
+	 * Don't touch the HW before waiting 100ms. HW has to finish
+	 * within 100ms according to PCI Express Base Specification
+	 * Revision 3.1 Section 6.6.2: Function-Level Reset (FLR).
+	 */
+	msleep(100);
+
+	if (pci_bus_read_config_dword(dev->bus, dev->devfn, PCI_VENDOR_ID,
+				      &id))
+		return;
+
+	/* See if we can find a device via CRS first. */
+	if ((id & 0xffff) == 0x0001) {
+		ret = pci_bus_wait_crs(dev->bus, dev->devfn, &id, 60000);
+		if (ret)
+			return;
+	}
 
 	do {
 		msleep(100);
 		pci_read_config_dword(dev, PCI_COMMAND, &id);
-	} while (i++ < 10 && id == ~0);
+	} while (i++ < 9 && id == ~0);
 
 	if (id == ~0)
 		dev_warn(&dev->dev, "Failed to return from FLR\n");
 	else if (i > 1)
 		dev_info(&dev->dev, "Required additional %dms to return from FLR\n",
-			 (i - 1) * 100);
+			 i * 100);
 }
 
 /**
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V10 3/3] PCI: display not responding message while device is unreachable
  2017-08-11 16:56 [PATCH V10 1/3] PCI: introduce pci_bus_wait_crs() function Sinan Kaya
  2017-08-11 16:56 ` [PATCH V10 2/3] PCI: handle CRS returned by device after FLR Sinan Kaya
@ 2017-08-11 16:56 ` Sinan Kaya
  2017-08-18  3:00 ` [PATCH V10 1/3] PCI: introduce pci_bus_wait_crs() function Bjorn Helgaas
  2 siblings, 0 replies; 10+ messages in thread
From: Sinan Kaya @ 2017-08-11 16:56 UTC (permalink / raw)
  To: linux-pci, timur, alex.williamson
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Bjorn Helgaas, linux-kernel

Adding a print statement into pci_bus_wait_crs() so that user observes
the progress of device polling instead of silently waiting for timeout
to be reached.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/pci/probe.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index b1cb7bd..2f4cf7d 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1839,11 +1839,17 @@ bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l, int crs_timeout)
 		delay *= 2;
 		if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
 			return false;
+
+		if (delay >= 1000)
+			pr_info("pci %04x:%02x:%02x.%d: not responding since %dms still polling\n",
+				pci_domain_nr(bus), bus->number,
+				PCI_SLOT(devfn), PCI_FUNC(devfn), delay);
+
 		/* Card hasn't responded in 60 seconds?  Must be stuck. */
 		if (delay > crs_timeout) {
-			printk(KERN_WARNING "pci %04x:%02x:%02x.%d: not responding\n",
+			pr_warn("pci %04x:%02x:%02x.%d: not responding %dms timeout reached\n",
 			       pci_domain_nr(bus), bus->number, PCI_SLOT(devfn),
-			       PCI_FUNC(devfn));
+			       PCI_FUNC(devfn), crs_timeout);
 			return false;
 		}
 	} while ((*l & 0xffff) == 0x0001);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V10 1/3] PCI: introduce pci_bus_wait_crs() function
  2017-08-11 16:56 [PATCH V10 1/3] PCI: introduce pci_bus_wait_crs() function Sinan Kaya
  2017-08-11 16:56 ` [PATCH V10 2/3] PCI: handle CRS returned by device after FLR Sinan Kaya
  2017-08-11 16:56 ` [PATCH V10 3/3] PCI: display not responding message while device is unreachable Sinan Kaya
@ 2017-08-18  3:00 ` Bjorn Helgaas
  2017-08-18 13:33   ` Sinan Kaya
  2 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2017-08-18  3:00 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: linux-pci, timur, alex.williamson, linux-arm-msm,
	linux-arm-kernel, Bjorn Helgaas, linux-kernel

On Fri, Aug 11, 2017 at 12:56:34PM -0400, Sinan Kaya wrote:
> Kernel is hiding Configuration Request Retry Status (CRS) inside
> pci_bus_read_dev_vendor_id() function. We are looking to add support for
> Function Level Reset (FLR) where vendor id read returns ~0.
> 
> Move CRS handling into its own function so that it can be called from other
> places as well.

I think this is a much better idea than what I proposed.  I still have
a few questions proposals.  I'll post a v11 to show what I'm thinking.

> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> ---
>  drivers/pci/pci.h   |  2 ++
>  drivers/pci/probe.c | 44 ++++++++++++++++++++++++++++++--------------
>  2 files changed, 32 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 22e0617..1bbe851 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -235,6 +235,8 @@ enum pci_bar_type {
>  	pci_bar_mem64,		/* A 64-bit memory BAR */
>  };
>  
> +bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l,
> +		      int crs_timeout);
>  bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl,
>  				int crs_timeout);
>  int pci_setup_device(struct pci_dev *dev);
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index c31310d..b1cb7bd 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1824,29 +1824,17 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
>  }
>  EXPORT_SYMBOL(pci_alloc_dev);
>  
> -bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l,
> -				int crs_timeout)
> +bool pci_bus_wait_crs(struct pci_bus *bus, int devfn, u32 *l, int crs_timeout)
>  {
>  	int delay = 1;
>  
> -	if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
> -		return false;
> -
> -	/* some broken boards return 0 or ~0 if a slot is empty: */
> -	if (*l == 0xffffffff || *l == 0x00000000 ||
> -	    *l == 0x0000ffff || *l == 0xffff0000)
> -		return false;
> -
>  	/*
>  	 * Configuration Request Retry Status.  Some root ports return the
>  	 * actual device ID instead of the synthetic ID (0xFFFF) required
>  	 * by the PCIe spec.  Ignore the device ID and only check for
>  	 * (vendor id == 1).
>  	 */
> -	while ((*l & 0xffff) == 0x0001) {
> -		if (!crs_timeout)
> -			return false;
> -
> +	do {
>  		msleep(delay);
>  		delay *= 2;
>  		if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
> @@ -1858,6 +1846,34 @@ bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l,
>  			       PCI_FUNC(devfn));
>  			return false;

While staring at this, I think I found a pre-existing bug in
pci_bus_read_dev_vendor_id().  It looks like this:

  while ((*l & 0xffff) == 0x0001) {
    pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l);
    if (delay > crs_timeout)
      return false;
  }

The problem is that the config read may have *succeeded* that last
time before we time out.  I think the correct sequence is:

  - check for timeout
  - read PCI_VENDOR_ID
  - check for CRS

>  		}
> +	} while ((*l & 0xffff) == 0x0001);
> +
> +	return true;
> +}
> +EXPORT_SYMBOL(pci_bus_wait_crs);

I don't think we need EXPORT_SYMBOL here, do we?

> +bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l,
> +				int crs_timeout)
> +{
> +	if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
> +		return false;
> +
> +	/* some broken boards return 0 or ~0 if a slot is empty: */
> +	if (*l == 0xffffffff || *l == 0x00000000 ||
> +	    *l == 0x0000ffff || *l == 0xffff0000)
> +		return false;
> +
> +	/*
> +	 * Configuration Request Retry Status.  Some root ports return the
> +	 * actual device ID instead of the synthetic ID (0xFFFF) required
> +	 * by the PCIe spec.  Ignore the device ID and only check for
> +	 * (vendor id == 1).
> +	 */
> +	if ((*l & 0xffff) == 0x0001) {
> +		if (!crs_timeout)
> +			return false;

One thing I don't like is that every caller of pci_bus_wait_crs() has
to know about the 0x0001 value.

> +		return pci_bus_wait_crs(bus, devfn, l, crs_timeout);
>  	}
>  
>  	return true;
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V10 1/3] PCI: introduce pci_bus_wait_crs() function
  2017-08-18  3:00 ` [PATCH V10 1/3] PCI: introduce pci_bus_wait_crs() function Bjorn Helgaas
@ 2017-08-18 13:33   ` Sinan Kaya
  0 siblings, 0 replies; 10+ messages in thread
From: Sinan Kaya @ 2017-08-18 13:33 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, timur, alex.williamson, linux-arm-msm,
	linux-arm-kernel, Bjorn Helgaas, linux-kernel

On 8/17/2017 11:00 PM, Bjorn Helgaas wrote:
> On Fri, Aug 11, 2017 at 12:56:34PM -0400, Sinan Kaya wrote:
>> Kernel is hiding Configuration Request Retry Status (CRS) inside
>> pci_bus_read_dev_vendor_id() function. We are looking to add support for
>> Function Level Reset (FLR) where vendor id read returns ~0.
>>
>> Move CRS handling into its own function so that it can be called from other
>> places as well.
> 
> I think this is a much better idea than what I proposed.  I still have
> a few questions proposals.  I'll post a v11 to show what I'm thinking.
> 

Sure, let me know. I can test it.

>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>> ---
>>  drivers/pci/pci.h   |  2 ++
>>  drivers/pci/probe.c | 44 ++++++++++++++++++++++++++++++--------------
>>  2 files changed, 32 insertions(+), 14 deletions(-)
>>

>>  		msleep(delay);
>>  		delay *= 2;
>>  		if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
>> @@ -1858,6 +1846,34 @@ bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l,
>>  			       PCI_FUNC(devfn));
>>  			return false;
> 
> While staring at this, I think I found a pre-existing bug in
> pci_bus_read_dev_vendor_id().  It looks like this:
> 
>   while ((*l & 0xffff) == 0x0001) {
>     pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l);
>     if (delay > crs_timeout)
>       return false;
>   }
> 
> The problem is that the config read may have *succeeded* that last
> time before we time out.  I think the correct sequence is:
> 
>   - check for timeout
>   - read PCI_VENDOR_ID
>   - check for CRS
> 

Yeah, it makes sense.

>>  		}
>> +	} while ((*l & 0xffff) == 0x0001);
>> +
>> +	return true;
>> +}
>> +EXPORT_SYMBOL(pci_bus_wait_crs);
> 
> I don't think we need EXPORT_SYMBOL here, do we?

copy/paste mistake. 

> 
>> +bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l,
>> +				int crs_timeout)
>> +{
>> +	if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
>> +		return false;
>> +
>> +	/* some broken boards return 0 or ~0 if a slot is empty: */
>> +	if (*l == 0xffffffff || *l == 0x00000000 ||
>> +	    *l == 0x0000ffff || *l == 0xffff0000)
>> +		return false;
>> +
>> +	/*
>> +	 * Configuration Request Retry Status.  Some root ports return the
>> +	 * actual device ID instead of the synthetic ID (0xFFFF) required
>> +	 * by the PCIe spec.  Ignore the device ID and only check for
>> +	 * (vendor id == 1).
>> +	 */
>> +	if ((*l & 0xffff) == 0x0001) {
>> +		if (!crs_timeout)
>> +			return false;
> 
> One thing I don't like is that every caller of pci_bus_wait_crs() has
> to know about the 0x0001 value.

Another helper function? I was trying to poll for CRS completion only
if we know that we are observing CRS.

> 
>> +		return pci_bus_wait_crs(bus, devfn, l, crs_timeout);
>>  	}
>>  
>>  	return true;
>> -- 
>> 1.9.1
>>
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V10 2/3] PCI: handle CRS returned by device after FLR
  2017-08-11 16:56 ` [PATCH V10 2/3] PCI: handle CRS returned by device after FLR Sinan Kaya
@ 2017-08-18 21:01   ` Bjorn Helgaas
  2017-08-21 13:44     ` Sinan Kaya
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2017-08-18 21:01 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: linux-pci, timur, alex.williamson, linux-arm-msm,
	linux-arm-kernel, Bjorn Helgaas, linux-kernel

On Fri, Aug 11, 2017 at 12:56:35PM -0400, Sinan Kaya wrote:
> Sporadic reset issues have been observed with Intel 750 NVMe drive while
> assigning the physical function to the guest machine. The sequence of
> events observed is as follows:
> 
> - perform a Function Level Reset (FLR)
> - sleep up to 1000ms total
> - read ~0 from PCI_COMMAND
> - warn that the device didn't return from FLR
> - touch the device before it's ready
> - drop register read and writes performing register settings restore
> - incomplete reset operation and partial register restoration
> - second time device probe fails in the guest machine as HW is left in
> limbo.

Pardon me while I think out loud for a while.  It's taking me a while
to untangle my confusion about how CRS works.

Prior to this patch, you saw the "Failed to return from FLR" warning.
That means we did this:

     0ms  assert FLR
   100ms  read PCI_COMMAND, get ~0 (i == 0)
   200ms  read PCI_COMMAND, get ~0 (i == 1)
   ...
  1000ms  read PCI_COMMAND, get ~0 (i == 9)

If we did not get completions for those config reads, the root port
would complete the read as a failed transaction and probably
synthesize ~0 data.  But if that were the case, this patch would make
no difference: it reads PCI_VENDOR_ID instead of PCI_COMMAND the first
time around, that would also be a failed transaction, and we wouldn't
interpret it as a CRS status, so the sequence would be exactly the
above except the first read would be of PCI_VENDOR_ID, not
PCI_COMMAND.

Since this patch *does* make a difference, CRS Software Visibility
must be supported and enabled, and we must have gotten completions
with CRS status for the PCI_COMMAND reads.  Per the completion
handling rules in sec 2.3.2, the root complex should transparently
retry the read (since we're not reading PCI_VENDOR_ID, CRS Software
Visibility doesn't apply, so this is invisible to software).  But the
root complex *is* allowed to limit the number of retries and if the
limit is reached, complete the request as a failed transaction.

That must be what happened before this patch.  If that's the case, we
should see an error logged from the failed transaction.  We should be
able to verify this if we collected "lspci -vv" output for the system
before and after the FLR.

If CRS SV is not present (or not enabled), the PCI_VENDOR_ID read will
still get a CRS completion, but the root port will retry it instead of
returning the 0x0001 ID.  When we hit the retry limit, the root port
will synthesize ~0 data to complete the read.

That means we won't call pci_bus_wait_crs() at all, and we'll fall
back to the original PCI_COMMAND reads.  That path will only wait
1000ms, just like the original code before this patch.  So I *think*
that if you disable CRS SV on this system (or if you put the device in
a system that doesn't support CRS SV), you'll probably see the same
"failed to return from FLR" warning.

You could easily test this by using setpci to turn off
PCI_EXP_RTCTL_CRSSVE in the root port leading to the NVMe device
before the FLR.

I think the implication is that we need to generalize
pci_bus_wait_crs() to be more of a "wait for device ready" interface
that can use CRS SV (if supported and enabled) or something like the
PCI_COMMAND loop (if no CRS SV).  It should use the same timeout
either way, since CRS SV is a property of the root port, not of the
device we're waiting for.

It's "interesting" that the PCI core error handling code doesn't look
at the Master Abort bit at all, even though it's pretty fundamental to
conventional PCI error reporting.  I suspect Master Abort gets set
whenever a root port synthesizes ~0 data, but (with the exception of
some arch code) we never look at it and never clear it.

> An endpoint is allowed to issue Configuration Request Retry Status (CRS)
> following a FLR request to indicate that it is not ready to accept new
> requests. CRS is defined in PCIe r3.1, sec 2.3.1. Request Handling Rules
> and CRS usage in FLR context is mentioned in PCIe r3.1, sec 6.6.2.
> Function-Level Reset.
> 
> A CRS indication will only be given if the address to be read is vendor ID
> register. pci_bus_read_dev_vendor_id() knows how to deal with CRS returned
> 0xFFFF0001 value and will continue polling until a value other than
> 0xFFFF0001 is returned within a given timeout.
> 
> Try to discover device presence via CRS first. If device is not found, fall
> through to old behavior.
> 
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> ---
>  drivers/pci/pci.c | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index af0cc34..c853551 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3821,17 +3821,36 @@ static void pci_flr_wait(struct pci_dev *dev)
>  {
>  	int i = 0;
>  	u32 id;
> +	bool ret;
> +
> +	/*
> +	 * Don't touch the HW before waiting 100ms. HW has to finish
> +	 * within 100ms according to PCI Express Base Specification
> +	 * Revision 3.1 Section 6.6.2: Function-Level Reset (FLR).
> +	 */
> +	msleep(100);
> +
> +	if (pci_bus_read_config_dword(dev->bus, dev->devfn, PCI_VENDOR_ID,
> +				      &id))
> +		return;
> +
> +	/* See if we can find a device via CRS first. */
> +	if ((id & 0xffff) == 0x0001) {
> +		ret = pci_bus_wait_crs(dev->bus, dev->devfn, &id, 60000);
> +		if (ret)
> +			return;
> +	}
>  
>  	do {
>  		msleep(100);
>  		pci_read_config_dword(dev, PCI_COMMAND, &id);
> -	} while (i++ < 10 && id == ~0);
> +	} while (i++ < 9 && id == ~0);
>  
>  	if (id == ~0)
>  		dev_warn(&dev->dev, "Failed to return from FLR\n");
>  	else if (i > 1)
>  		dev_info(&dev->dev, "Required additional %dms to return from FLR\n",
> -			 (i - 1) * 100);
> +			 i * 100);
>  }
>  
>  /**
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V10 2/3] PCI: handle CRS returned by device after FLR
  2017-08-18 21:01   ` Bjorn Helgaas
@ 2017-08-21 13:44     ` Sinan Kaya
  2017-08-21 18:00       ` Bjorn Helgaas
  0 siblings, 1 reply; 10+ messages in thread
From: Sinan Kaya @ 2017-08-21 13:44 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, timur, alex.williamson, linux-arm-msm,
	linux-arm-kernel, Bjorn Helgaas, linux-kernel

Hi Bjorn,

On 8/18/2017 5:01 PM, Bjorn Helgaas wrote:
> On Fri, Aug 11, 2017 at 12:56:35PM -0400, Sinan Kaya wrote:
>> Sporadic reset issues have been observed with Intel 750 NVMe drive while
>> assigning the physical function to the guest machine. The sequence of
>> events observed is as follows:
>>
>> - perform a Function Level Reset (FLR)
>> - sleep up to 1000ms total
>> - read ~0 from PCI_COMMAND
>> - warn that the device didn't return from FLR
>> - touch the device before it's ready
>> - drop register read and writes performing register settings restore
>> - incomplete reset operation and partial register restoration
>> - second time device probe fails in the guest machine as HW is left in
>> limbo.
> 
> Pardon me while I think out loud for a while.  It's taking me a while
> to untangle my confusion about how CRS works.
> 

no problem,

> Prior to this patch, you saw the "Failed to return from FLR" warning.
> That means we did this:
> 
>      0ms  assert FLR
>    100ms  read PCI_COMMAND, get ~0 (i == 0)
>    200ms  read PCI_COMMAND, get ~0 (i == 1)
>    ...
>   1000ms  read PCI_COMMAND, get ~0 (i == 9)
> 
> If we did not get completions for those config reads, the root port
> would complete the read as a failed transaction and probably
> synthesize ~0 data.  

That's correct. This root port returns ~0 when destination is unreachable.

> But if that were the case, this patch would make
> no difference: it reads PCI_VENDOR_ID instead of PCI_COMMAND the first
> time around, that would also be a failed transaction, and we wouldn't
> interpret it as a CRS status, so the sequence would be exactly the
> above except the first read would be of PCI_VENDOR_ID, not
> PCI_COMMAND.
> 
> Since this patch *does* make a difference, CRS Software Visibility
> must be supported and enabled, and we must have gotten completions
> with CRS status for the PCI_COMMAND reads.  

That's right, CRS visibility is supported and enabled. Root port returns
0xFFFF0001 when vendor ID is read as per the spec.

> Per the completion
> handling rules in sec 2.3.2, the root complex should transparently
> retry the read (since we're not reading PCI_VENDOR_ID, CRS Software
> Visibility doesn't apply, so this is invisible to software).  But the
> root complex *is* allowed to limit the number of retries and if the
> limit is reached, complete the request as a failed transaction.

This version of the root port doesn't support retry mechanism. This is a
TO-DO for the hardware team. This root port returns ~0 for all accesses
other than vendor id. This means we waited 1 seconds and the device
was not accessible. Command register was not reachable at the end of
1 seconds.

> 
> That must be what happened before this patch.  If that's the case, we
> should see an error logged from the failed transaction.  We should be
> able to verify this if we collected "lspci -vv" output for the system
> before and after the FLR.
> 

See above.

> If CRS SV is not present (or not enabled), the PCI_VENDOR_ID read will
> still get a CRS completion, but the root port will retry it instead of
> returning the 0x0001 ID.  When we hit the retry limit, the root port
> will synthesize ~0 data to complete the read.

This is what is missing from the HW. I'm trying to get this corrected
to be complete.

> 
> That means we won't call pci_bus_wait_crs() at all, and we'll fall
> back to the original PCI_COMMAND reads.  That path will only wait
> 1000ms, just like the original code before this patch.  So I *think*
> that if you disable CRS SV on this system (or if you put the device in
> a system that doesn't support CRS SV), you'll probably see the same
> "failed to return from FLR" warning.
> 
> You could easily test this by using setpci to turn off
> PCI_EXP_RTCTL_CRSSVE in the root port leading to the NVMe device
> before the FLR.
> 
> I think the implication is that we need to generalize
> pci_bus_wait_crs() to be more of a "wait for device ready" interface
> that can use CRS SV (if supported and enabled) or something like the
> PCI_COMMAND loop (if no CRS SV).  It should use the same timeout
> either way, since CRS SV is a property of the root port, not of the
> device we're waiting for.

I saw your comment about timeout. I was trying not to change the behavior
for systems that do not support CRS visibility. we can certainly increase
the timeout if you think it is better.

I saw your V11, I'll review them in a minute.

> 
> It's "interesting" that the PCI core error handling code doesn't look
> at the Master Abort bit at all, even though it's pretty fundamental to
> conventional PCI error reporting.  I suspect Master Abort gets set
> whenever a root port synthesizes ~0 data, but (with the exception of
> some arch code) we never look at it and never clear it.
> 
>> An endpoint is allowed to issue Configuration Request Retry Status (CRS)
>> following a FLR request to indicate that it is not ready to accept new
>> requests. CRS is defined in PCIe r3.1, sec 2.3.1. Request Handling Rules
>> and CRS usage in FLR context is mentioned in PCIe r3.1, sec 6.6.2.
>> Function-Level Reset.
>>
>> A CRS indication will only be given if the address to be read is vendor ID
>> register. pci_bus_read_dev_vendor_id() knows how to deal with CRS returned
>> 0xFFFF0001 value and will continue polling until a value other than
>> 0xFFFF0001 is returned within a given timeout.
>>
>> Try to discover device presence via CRS first. If device is not found, fall
>> through to old behavior.
>>
>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>> ---
>>  drivers/pci/pci.c | 23 +++++++++++++++++++++--
>>  1 file changed, 21 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index af0cc34..c853551 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -3821,17 +3821,36 @@ static void pci_flr_wait(struct pci_dev *dev)
>>  {
>>  	int i = 0;
>>  	u32 id;
>> +	bool ret;
>> +
>> +	/*
>> +	 * Don't touch the HW before waiting 100ms. HW has to finish
>> +	 * within 100ms according to PCI Express Base Specification
>> +	 * Revision 3.1 Section 6.6.2: Function-Level Reset (FLR).
>> +	 */
>> +	msleep(100);
>> +
>> +	if (pci_bus_read_config_dword(dev->bus, dev->devfn, PCI_VENDOR_ID,
>> +				      &id))
>> +		return;
>> +
>> +	/* See if we can find a device via CRS first. */
>> +	if ((id & 0xffff) == 0x0001) {
>> +		ret = pci_bus_wait_crs(dev->bus, dev->devfn, &id, 60000);
>> +		if (ret)
>> +			return;
>> +	}
>>  
>>  	do {
>>  		msleep(100);
>>  		pci_read_config_dword(dev, PCI_COMMAND, &id);
>> -	} while (i++ < 10 && id == ~0);
>> +	} while (i++ < 9 && id == ~0);
>>  
>>  	if (id == ~0)
>>  		dev_warn(&dev->dev, "Failed to return from FLR\n");
>>  	else if (i > 1)
>>  		dev_info(&dev->dev, "Required additional %dms to return from FLR\n",
>> -			 (i - 1) * 100);
>> +			 i * 100);
>>  }
>>  
>>  /**
>> -- 
>> 1.9.1
>>
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V10 2/3] PCI: handle CRS returned by device after FLR
  2017-08-21 13:44     ` Sinan Kaya
@ 2017-08-21 18:00       ` Bjorn Helgaas
  2017-08-23  3:21         ` Sinan Kaya
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2017-08-21 18:00 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: linux-pci, timur, alex.williamson, linux-arm-msm,
	linux-arm-kernel, Bjorn Helgaas, linux-kernel

On Mon, Aug 21, 2017 at 09:44:09AM -0400, Sinan Kaya wrote:
> Hi Bjorn,
> 
> On 8/18/2017 5:01 PM, Bjorn Helgaas wrote:
> > On Fri, Aug 11, 2017 at 12:56:35PM -0400, Sinan Kaya wrote:
> >> Sporadic reset issues have been observed with Intel 750 NVMe drive while
> >> assigning the physical function to the guest machine. The sequence of
> >> events observed is as follows:
> >>
> >> - perform a Function Level Reset (FLR)
> >> - sleep up to 1000ms total
> >> - read ~0 from PCI_COMMAND
> >> - warn that the device didn't return from FLR
> >> - touch the device before it's ready
> >> - drop register read and writes performing register settings restore
> >> - incomplete reset operation and partial register restoration
> >> - second time device probe fails in the guest machine as HW is left in
> >> limbo.
> > 
> > Pardon me while I think out loud for a while.  It's taking me a while
> > to untangle my confusion about how CRS works.
> > 
> 
> no problem,
> 
> > Prior to this patch, you saw the "Failed to return from FLR" warning.
> > That means we did this:
> > 
> >      0ms  assert FLR
> >    100ms  read PCI_COMMAND, get ~0 (i == 0)
> >    200ms  read PCI_COMMAND, get ~0 (i == 1)
> >    ...
> >   1000ms  read PCI_COMMAND, get ~0 (i == 9)
> > 
> > If we did not get completions for those config reads, the root port
> > would complete the read as a failed transaction and probably
> > synthesize ~0 data.  
> 
> That's correct. This root port returns ~0 when destination is unreachable.
> 
> > But if that were the case, this patch would make
> > no difference: it reads PCI_VENDOR_ID instead of PCI_COMMAND the first
> > time around, that would also be a failed transaction, and we wouldn't
> > interpret it as a CRS status, so the sequence would be exactly the
> > above except the first read would be of PCI_VENDOR_ID, not
> > PCI_COMMAND.
> > 
> > Since this patch *does* make a difference, CRS Software Visibility
> > must be supported and enabled, and we must have gotten completions
> > with CRS status for the PCI_COMMAND reads.  
> 
> That's right, CRS visibility is supported and enabled. Root port returns
> 0xFFFF0001 when vendor ID is read as per the spec.
> 
> > Per the completion
> > handling rules in sec 2.3.2, the root complex should transparently
> > retry the read (since we're not reading PCI_VENDOR_ID, CRS Software
> > Visibility doesn't apply, so this is invisible to software).  But the
> > root complex *is* allowed to limit the number of retries and if the
> > limit is reached, complete the request as a failed transaction.
> 
> This version of the root port doesn't support retry mechanism. This is a
> TO-DO for the hardware team. This root port returns ~0 for all accesses
> other than vendor id. This means we waited 1 seconds and the device
> was not accessible. Command register was not reachable at the end of
> 1 seconds.

OK.  Since the spec says the root complex must re-issue the request,
but also says it may limit the number of retries (without saying what
that limit should be), I don't think we can assume any.  I think zero
is a valid number of retries, so I'd say this HW conforms
to the spec as-is.

> > That must be what happened before this patch.  If that's the case, we
> > should see an error logged from the failed transaction.  We should be
> > able to verify this if we collected "lspci -vv" output for the system
> > before and after the FLR.
> 
> See above.

It would still be interesting to see the lspci output.  Likely <MAbort
in the upstream bridge will be set even before the FLR because of
aborts during enumeration.  The PCI core should probably clear that
after enumeration.  I expect that if you manually clear it with
setpci, do the lspci, do the FLR, do another lspci, you will probably
see <MAbort being set by this pci_flr_wait() loop.  We might want to
consider clearing it here as well.

> > If CRS SV is not present (or not enabled), the PCI_VENDOR_ID read will
> > still get a CRS completion, but the root port will retry it instead of
> > returning the 0x0001 ID.  When we hit the retry limit, the root port
> > will synthesize ~0 data to complete the read.
> 
> This is what is missing from the HW. I'm trying to get this corrected
> to be complete.

>From what you've described I don't see a HW issue here.  It might be
nice for other reasons for the HW to reissue config accesses that
receive a CRS completion, but the spec seems to allow the current
behavior.

> > That means we won't call pci_bus_wait_crs() at all, and we'll fall
> > back to the original PCI_COMMAND reads.  That path will only wait
> > 1000ms, just like the original code before this patch.  So I *think*
> > that if you disable CRS SV on this system (or if you put the device in
> > a system that doesn't support CRS SV), you'll probably see the same
> > "failed to return from FLR" warning.
> > 
> > You could easily test this by using setpci to turn off
> > PCI_EXP_RTCTL_CRSSVE in the root port leading to the NVMe device
> > before the FLR.
> > 
> > I think the implication is that we need to generalize
> > pci_bus_wait_crs() to be more of a "wait for device ready" interface
> > that can use CRS SV (if supported and enabled) or something like the
> > PCI_COMMAND loop (if no CRS SV).  It should use the same timeout
> > either way, since CRS SV is a property of the root port, not of the
> > device we're waiting for.
> 
> I saw your comment about timeout. I was trying not to change the behavior
> for systems that do not support CRS visibility. we can certainly increase
> the timeout if you think it is better.

I think it's important that we use the same timeout.  If we leave them
different, on systems that support CRS SV, we'll wait up to 60 sec and
the FLR will work fine, but on systems that don't, we'll only wait 1 sec
and the FLR will fail.

Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V10 2/3] PCI: handle CRS returned by device after FLR
  2017-08-21 18:00       ` Bjorn Helgaas
@ 2017-08-23  3:21         ` Sinan Kaya
  2017-08-23 20:01           ` Bjorn Helgaas
  0 siblings, 1 reply; 10+ messages in thread
From: Sinan Kaya @ 2017-08-23  3:21 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, timur, alex.williamson, linux-arm-msm,
	linux-arm-kernel, Bjorn Helgaas, linux-kernel

On 8/21/2017 2:00 PM, Bjorn Helgaas wrote:
> On Mon, Aug 21, 2017 at 09:44:09AM -0400, Sinan Kaya wrote:
>> Hi Bjorn,
>>
>> On 8/18/2017 5:01 PM, Bjorn Helgaas wrote:

> 
> It would still be interesting to see the lspci output.  Likely <MAbort
> in the upstream bridge will be set even before the FLR because of
> aborts during enumeration.  The PCI core should probably clear that
> after enumeration.  I expect that if you manually clear it with
> setpci, do the lspci, do the FLR, do another lspci, you will probably
> see <MAbort being set by this pci_flr_wait() loop.  We might want to
> consider clearing it here as well.


     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 345
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00002000-00002fff
        Memory behind bridge: 00100000-002fffff
        Prefetchable memory behind bridge: 0000090400000000-00000904001fffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-

FLR reset here with CRS to the NVMe drive.
	
     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 345
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00002000-00002fff
        Memory behind bridge: 00100000-002fffff
        Prefetchable memory behind bridge: 0000090400000000-00000904001fffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	
I don't see <MAbort getting set. 




-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V10 2/3] PCI: handle CRS returned by device after FLR
  2017-08-23  3:21         ` Sinan Kaya
@ 2017-08-23 20:01           ` Bjorn Helgaas
  0 siblings, 0 replies; 10+ messages in thread
From: Bjorn Helgaas @ 2017-08-23 20:01 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: linux-pci, timur, alex.williamson, linux-arm-msm,
	linux-arm-kernel, Bjorn Helgaas, linux-kernel

On Tue, Aug 22, 2017 at 11:21:24PM -0400, Sinan Kaya wrote:
> On 8/21/2017 2:00 PM, Bjorn Helgaas wrote:
> > On Mon, Aug 21, 2017 at 09:44:09AM -0400, Sinan Kaya wrote:
> >> Hi Bjorn,
> >>
> >> On 8/18/2017 5:01 PM, Bjorn Helgaas wrote:
> 
> > 
> > It would still be interesting to see the lspci output.  Likely <MAbort
> > in the upstream bridge will be set even before the FLR because of
> > aborts during enumeration.  The PCI core should probably clear that
> > after enumeration.  I expect that if you manually clear it with
> > setpci, do the lspci, do the FLR, do another lspci, you will probably
> > see <MAbort being set by this pci_flr_wait() loop.  We might want to
> > consider clearing it here as well.
> 
> 
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 345
>         Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
>         I/O behind bridge: 00002000-00002fff
>         Memory behind bridge: 00100000-002fffff
>         Prefetchable memory behind bridge: 0000090400000000-00000904001fffff
>         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
> 
> FLR reset here with CRS to the NVMe drive.
> 	
>      Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 345
>         Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
>         I/O behind bridge: 00002000-00002fff
>         Memory behind bridge: 00100000-002fffff
>         Prefetchable memory behind bridge: 0000090400000000-00000904001fffff
>         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
> 	
> I don't see <MAbort getting set. 

Thanks for checking.  One of the many things I don't understand, I guess.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-08-23 20:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-11 16:56 [PATCH V10 1/3] PCI: introduce pci_bus_wait_crs() function Sinan Kaya
2017-08-11 16:56 ` [PATCH V10 2/3] PCI: handle CRS returned by device after FLR Sinan Kaya
2017-08-18 21:01   ` Bjorn Helgaas
2017-08-21 13:44     ` Sinan Kaya
2017-08-21 18:00       ` Bjorn Helgaas
2017-08-23  3:21         ` Sinan Kaya
2017-08-23 20:01           ` Bjorn Helgaas
2017-08-11 16:56 ` [PATCH V10 3/3] PCI: display not responding message while device is unreachable Sinan Kaya
2017-08-18  3:00 ` [PATCH V10 1/3] PCI: introduce pci_bus_wait_crs() function Bjorn Helgaas
2017-08-18 13:33   ` Sinan Kaya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).