linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery
@ 2020-04-30  0:42 Jay Vosburgh
  2020-04-30  1:15 ` Kuppuswamy, Sathyanarayanan
  2020-05-09  8:34 ` Yicong Yang
  0 siblings, 2 replies; 9+ messages in thread
From: Jay Vosburgh @ 2020-04-30  0:42 UTC (permalink / raw)
  To: linux-pci; +Cc: Bjorn Helgaas, Kuppuswamy Sathyanarayanan

	Commit 6d2c89441571 ("PCI/ERR: Update error status after
reset_link()"), introduced a regression, as pcie_do_recovery will
discard the status result from report_frozen_detected.  This can cause a
failure to recover if _NEED_RESET is returned by report_frozen_detected
and report_slot_reset is not invoked.

	Such an event can be induced for testing purposes by reducing
the Max_Payload_Size of a PCIe bridge to less than that of a device
downstream from the bridge, and then initating I/O through the device,
resulting in oversize transactions.  In the presence of DPC, this
results in a containment event and attempted reset and recovery via
pcie_do_recovery.  After 6d2c89441571 report_slot_reset is not invoked,
and the device does not recover.

	Inspection shows a similar path is plausible for a return of
_CAN_RECOVER and the invocation of report_mmio_enabled.

	Resolve this by preserving the result of report_frozen_detected if
reset_link does not return _DISCONNECT.

Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>

---
 drivers/pci/pcie/err.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..e4274562f3a0 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -164,10 +164,17 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 
 	pci_dbg(dev, "broadcast error_detected message\n");
 	if (state == pci_channel_io_frozen) {
+		pci_ers_result_t status2;
+
 		pci_walk_bus(bus, report_frozen_detected, &status);
-		status = reset_link(dev);
-		if (status != PCI_ERS_RESULT_RECOVERED) {
+		/* preserve status from report_frozen_detected to
+		 * insure report_mmio_enabled or report_slot_reset are
+		 * invoked even if reset_link returns _RECOVERED.
+		 */
+		status2 = reset_link(dev);
+		if (status2 != PCI_ERS_RESULT_RECOVERED) {
 			pci_warn(dev, "link reset failed\n");
+			status = status2;
 			goto failed;
 		}
 	} else {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery
  2020-04-30  0:42 [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery Jay Vosburgh
@ 2020-04-30  1:15 ` Kuppuswamy, Sathyanarayanan
  2020-04-30 19:35   ` Kuppuswamy, Sathyanarayanan
  2020-05-09  8:34 ` Yicong Yang
  1 sibling, 1 reply; 9+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2020-04-30  1:15 UTC (permalink / raw)
  To: Jay Vosburgh, linux-pci; +Cc: Bjorn Helgaas



On 4/29/20 5:42 PM, Jay Vosburgh wrote:
> 	Commit 6d2c89441571 ("PCI/ERR: Update error status after
> reset_link()"), introduced a regression, as pcie_do_recovery will
> discard the status result from report_frozen_detected.  This can cause a
> failure to recover if _NEED_RESET is returned by report_frozen_detected
> and report_slot_reset is not invoked.
> 
> 	Such an event can be induced for testing purposes by reducing
> the Max_Payload_Size of a PCIe bridge to less than that of a device
> downstream from the bridge, and then initating I/O through the device,
> resulting in oversize transactions.  In the presence of DPC, this
> results in a containment event and attempted reset and recovery via
> pcie_do_recovery.  After 6d2c89441571 report_slot_reset is not invoked,
> and the device does not recover.

I think this issue is related to the issue discussed in following
thread (DPC non-hotplug support).

https://lkml.org/lkml/2020/3/28/328

If my assumption is correct, you are dealing with devices which are
not hotplug capable. If the devices are hotplug capable then you don't
need to proceed to report_slot_reset(), since hotplug handler will
remove/re-enumerate the devices correctly.

> 
> 	Inspection shows a similar path is plausible for a return of
> _CAN_RECOVER and the invocation of report_mmio_enabled.
> 
> 	Resolve this by preserving the result of report_frozen_detected if
> reset_link does not return _DISCONNECT.
> 
> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
> 
> ---
>   drivers/pci/pcie/err.c | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 14bb8f54723e..e4274562f3a0 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -164,10 +164,17 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   
>   	pci_dbg(dev, "broadcast error_detected message\n");
>   	if (state == pci_channel_io_frozen) {
> +		pci_ers_result_t status2;
> +
>   		pci_walk_bus(bus, report_frozen_detected, &status);
> -		status = reset_link(dev);
> -		if (status != PCI_ERS_RESULT_RECOVERED) {
> +		/* preserve status from report_frozen_detected to
> +		 * insure report_mmio_enabled or report_slot_reset are
> +		 * invoked even if reset_link returns _RECOVERED.
> +		 */
> +		status2 = reset_link(dev);
> +		if (status2 != PCI_ERS_RESULT_RECOVERED) {
>   			pci_warn(dev, "link reset failed\n");
> +			status = status2;
>   			goto failed;
>   		}
>   	} else {
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery
  2020-04-30  1:15 ` Kuppuswamy, Sathyanarayanan
@ 2020-04-30 19:35   ` Kuppuswamy, Sathyanarayanan
  2020-04-30 20:41     ` Jay Vosburgh
  0 siblings, 1 reply; 9+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2020-04-30 19:35 UTC (permalink / raw)
  To: Jay Vosburgh, linux-pci; +Cc: Bjorn Helgaas

Hi Jay,

On 4/29/20 6:15 PM, Kuppuswamy, Sathyanarayanan wrote:
> 
> 
> On 4/29/20 5:42 PM, Jay Vosburgh wrote:
>>     Commit 6d2c89441571 ("PCI/ERR: Update error status after
>> reset_link()"), introduced a regression, as pcie_do_recovery will
>> discard the status result from report_frozen_detected.  This can cause a
>> failure to recover if _NEED_RESET is returned by report_frozen_detected
>> and report_slot_reset is not invoked.
>>
>>     Such an event can be induced for testing purposes by reducing
>> the Max_Payload_Size of a PCIe bridge to less than that of a device
>> downstream from the bridge, and then initating I/O through the device,
>> resulting in oversize transactions.  In the presence of DPC, this
>> results in a containment event and attempted reset and recovery via
>> pcie_do_recovery.  After 6d2c89441571 report_slot_reset is not invoked,
>> and the device does not recover.
> 
> I think this issue is related to the issue discussed in following
> thread (DPC non-hotplug support).
> 
> https://lkml.org/lkml/2020/3/28/328
> 
> If my assumption is correct, you are dealing with devices which are
> not hotplug capable. If the devices are hotplug capable then you don't
> need to proceed to report_slot_reset(), since hotplug handler will
> remove/re-enumerate the devices correctly.

Can you check whether following fix works for you?

This includes support for bus_reset in recovery function itself.

index 14bb8f54723e..c9eaab68ab7a 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,13 +165,23 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
         pci_dbg(dev, "broadcast error_detected message\n");
         if (state == pci_channel_io_frozen) {
         if (state == pci_channel_io_frozen) {
                 pci_walk_bus(bus, report_frozen_detected, &status);
-               status = reset_link(dev);
-               if (status != PCI_ERS_RESULT_RECOVERED) {
+               status = PCI_ERS_RESULT_NEED_RESET;
+       } else {
+               pci_walk_bus(bus, report_normal_detected, &status);
+       }
+
+       if (status == PCI_ERS_RESULT_NEED_RESET) {
+               if (reset_link)
+                       if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
+                               status = PCI_ERS_RESULT_DISCONNECT;
+               else
+                       if (pci_bus_error_reset(dev))
+                               status = PCI_ERS_RESULT_DISCONNECT;
+
+               if (status == PCI_ERS_RESULT_DISCONNECT) {
                         pci_warn(dev, "link reset failed\n");
                         goto failed;
                 }
-       } else {
-               pci_walk_bus(bus, report_normal_detected, &status);
         }

         if (status == PCI_ERS_RESULT_CAN_RECOVER) {


> 
>>
>>     Inspection shows a similar path is plausible for a return of
>> _CAN_RECOVER and the invocation of report_mmio_enabled.
>>
>>     Resolve this by preserving the result of report_frozen_detected if
>> reset_link does not return _DISCONNECT.
>>
>> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>>
>> ---
>>   drivers/pci/pcie/err.c | 11 +++++++++--
>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 14bb8f54723e..e4274562f3a0 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -164,10 +164,17 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev 
>> *dev,
>>       pci_dbg(dev, "broadcast error_detected message\n");
>>       if (state == pci_channel_io_frozen) {
>> +        pci_ers_result_t status2;
>> +
>>           pci_walk_bus(bus, report_frozen_detected, &status);
>> -        status = reset_link(dev);
>> -        if (status != PCI_ERS_RESULT_RECOVERED) {
>> +        /* preserve status from report_frozen_detected to
>> +         * insure report_mmio_enabled or report_slot_reset are
>> +         * invoked even if reset_link returns _RECOVERED.
>> +         */
>> +        status2 = reset_link(dev);
>> +        if (status2 != PCI_ERS_RESULT_RECOVERED) {
>>               pci_warn(dev, "link reset failed\n");
>> +            status = status2;
>>               goto failed;
>>           }
>>       } else {
>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery
  2020-04-30 19:35   ` Kuppuswamy, Sathyanarayanan
@ 2020-04-30 20:41     ` Jay Vosburgh
  2020-05-06 18:08       ` Jay Vosburgh
  2020-05-09  6:35       ` Yicong Yang
  0 siblings, 2 replies; 9+ messages in thread
From: Jay Vosburgh @ 2020-04-30 20:41 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan; +Cc: linux-pci, Bjorn Helgaas

"Kuppuswamy, Sathyanarayanan" wrote:

>Hi Jay,
>
>On 4/29/20 6:15 PM, Kuppuswamy, Sathyanarayanan wrote:
>>
>>
>> On 4/29/20 5:42 PM, Jay Vosburgh wrote:
>>>     Commit 6d2c89441571 ("PCI/ERR: Update error status after
>>> reset_link()"), introduced a regression, as pcie_do_recovery will
>>> discard the status result from report_frozen_detected.  This can cause a
>>> failure to recover if _NEED_RESET is returned by report_frozen_detected
>>> and report_slot_reset is not invoked.
>>>
>>>     Such an event can be induced for testing purposes by reducing
>>> the Max_Payload_Size of a PCIe bridge to less than that of a device
>>> downstream from the bridge, and then initating I/O through the device,
>>> resulting in oversize transactions.  In the presence of DPC, this
>>> results in a containment event and attempted reset and recovery via
>>> pcie_do_recovery.  After 6d2c89441571 report_slot_reset is not invoked,
>>> and the device does not recover.
>>
>> I think this issue is related to the issue discussed in following
>> thread (DPC non-hotplug support).
>>
>> https://lkml.org/lkml/2020/3/28/328
>>
>> If my assumption is correct, you are dealing with devices which are
>> not hotplug capable. If the devices are hotplug capable then you don't
>> need to proceed to report_slot_reset(), since hotplug handler will
>> remove/re-enumerate the devices correctly.

	Correct, this particular device (a network card) is in a
non-hotplug slot.

>Can you check whether following fix works for you?

	Yes, it does.

	I fixed up the whitespace and made a minor change to add braces
in what look like the correct places around the "if (reset_link)" block;
the patch I tested with is below.  I'll also install this on another
machine with hotplug capable slots to test there as well.

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..db80e1ecb2dc 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,13 +165,24 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	pci_dbg(dev, "broadcast error_detected message\n");
 	if (state == pci_channel_io_frozen) {
 		pci_walk_bus(bus, report_frozen_detected, &status);
-		status = reset_link(dev);
-		if (status != PCI_ERS_RESULT_RECOVERED) {
+		status = PCI_ERS_RESULT_NEED_RESET;
+	} else {
+		pci_walk_bus(bus, report_normal_detected, &status);
+	}
+
+	if (status == PCI_ERS_RESULT_NEED_RESET) {
+		if (reset_link) {
+			if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
+				status = PCI_ERS_RESULT_DISCONNECT;
+		} else {
+			if (pci_bus_error_reset(dev))
+				status = PCI_ERS_RESULT_DISCONNECT;
+		}
+
+		if (status == PCI_ERS_RESULT_DISCONNECT) {
 			pci_warn(dev, "link reset failed\n");
 			goto failed;
 		}
-	} else {
-		pci_walk_bus(bus, report_normal_detected, &status);
 	}
 
 	if (status == PCI_ERS_RESULT_CAN_RECOVER) {


	-J

>This includes support for bus_reset in recovery function itself.
>
>index 14bb8f54723e..c9eaab68ab7a 100644
>--- a/drivers/pci/pcie/err.c
>+++ b/drivers/pci/pcie/err.c
>@@ -165,13 +165,23 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>        pci_dbg(dev, "broadcast error_detected message\n");
>        if (state == pci_channel_io_frozen) {
>        if (state == pci_channel_io_frozen) {
>                pci_walk_bus(bus, report_frozen_detected, &status);
>-               status = reset_link(dev);
>-               if (status != PCI_ERS_RESULT_RECOVERED) {
>+               status = PCI_ERS_RESULT_NEED_RESET;
>+       } else {
>+               pci_walk_bus(bus, report_normal_detected, &status);
>+       }
>+
>+       if (status == PCI_ERS_RESULT_NEED_RESET) {
>+               if (reset_link)
>+                       if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
>+                               status = PCI_ERS_RESULT_DISCONNECT;
>+               else
>+                       if (pci_bus_error_reset(dev))
>+                               status = PCI_ERS_RESULT_DISCONNECT;
>+
>+               if (status == PCI_ERS_RESULT_DISCONNECT) {
>                        pci_warn(dev, "link reset failed\n");
>                        goto failed;
>                }
>-       } else {
>-               pci_walk_bus(bus, report_normal_detected, &status);
>        }
>
>        if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>
>
>>
>>>
>>>     Inspection shows a similar path is plausible for a return of
>>> _CAN_RECOVER and the invocation of report_mmio_enabled.
>>>
>>>     Resolve this by preserving the result of report_frozen_detected if
>>> reset_link does not return _DISCONNECT.
>>>
>>> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>>> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>>>
>>> ---
>>>   drivers/pci/pcie/err.c | 11 +++++++++--
>>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>> index 14bb8f54723e..e4274562f3a0 100644
>>> --- a/drivers/pci/pcie/err.c
>>> +++ b/drivers/pci/pcie/err.c
>>> @@ -164,10 +164,17 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev
>>> *dev,
>>>       pci_dbg(dev, "broadcast error_detected message\n");
>>>       if (state == pci_channel_io_frozen) {
>>> +        pci_ers_result_t status2;
>>> +
>>>           pci_walk_bus(bus, report_frozen_detected, &status);
>>> -        status = reset_link(dev);
>>> -        if (status != PCI_ERS_RESULT_RECOVERED) {
>>> +        /* preserve status from report_frozen_detected to
>>> +         * insure report_mmio_enabled or report_slot_reset are
>>> +         * invoked even if reset_link returns _RECOVERED.
>>> +         */
>>> +        status2 = reset_link(dev);
>>> +        if (status2 != PCI_ERS_RESULT_RECOVERED) {
>>>               pci_warn(dev, "link reset failed\n");
>>> +            status = status2;
>>>               goto failed;
>>>           }
>>>       } else {
>>>

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery
  2020-04-30 20:41     ` Jay Vosburgh
@ 2020-05-06 18:08       ` Jay Vosburgh
  2020-05-09  6:35       ` Yicong Yang
  1 sibling, 0 replies; 9+ messages in thread
From: Jay Vosburgh @ 2020-05-06 18:08 UTC (permalink / raw)
  Cc: Kuppuswamy, Sathyanarayanan, linux-pci, Bjorn Helgaas

Jay Vosburgh <jay.vosburgh@canonical.com> wrote:

>"Kuppuswamy, Sathyanarayanan" wrote:
>
>>Hi Jay,
>>
>>On 4/29/20 6:15 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>
>>>
>>> On 4/29/20 5:42 PM, Jay Vosburgh wrote:
>>>>     Commit 6d2c89441571 ("PCI/ERR: Update error status after
>>>> reset_link()"), introduced a regression, as pcie_do_recovery will
>>>> discard the status result from report_frozen_detected.  This can cause a
>>>> failure to recover if _NEED_RESET is returned by report_frozen_detected
>>>> and report_slot_reset is not invoked.
>>>>
>>>>     Such an event can be induced for testing purposes by reducing
>>>> the Max_Payload_Size of a PCIe bridge to less than that of a device
>>>> downstream from the bridge, and then initating I/O through the device,
>>>> resulting in oversize transactions.  In the presence of DPC, this
>>>> results in a containment event and attempted reset and recovery via
>>>> pcie_do_recovery.  After 6d2c89441571 report_slot_reset is not invoked,
>>>> and the device does not recover.
>>>
>>> I think this issue is related to the issue discussed in following
>>> thread (DPC non-hotplug support).
>>>
>>> https://lkml.org/lkml/2020/3/28/328
>>>
>>> If my assumption is correct, you are dealing with devices which are
>>> not hotplug capable. If the devices are hotplug capable then you don't
>>> need to proceed to report_slot_reset(), since hotplug handler will
>>> remove/re-enumerate the devices correctly.
>
>	Correct, this particular device (a network card) is in a
>non-hotplug slot.
>
>>Can you check whether following fix works for you?
>
>	Yes, it does.
>
>	I fixed up the whitespace and made a minor change to add braces
>in what look like the correct places around the "if (reset_link)" block;
>the patch I tested with is below.  I'll also install this on another
>machine with hotplug capable slots to test there as well.

	We've tested the below patch on a couple of different machines
and devices (network card, NVMe device) and it appears to solve the
recovery issue in our testing.

	Is there anything further we need to do, or can this be
considered for inclusion upstream at this time?

	-J

>diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>index 14bb8f54723e..db80e1ecb2dc 100644
>--- a/drivers/pci/pcie/err.c
>+++ b/drivers/pci/pcie/err.c
>@@ -165,13 +165,24 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> 	pci_dbg(dev, "broadcast error_detected message\n");
> 	if (state == pci_channel_io_frozen) {
> 		pci_walk_bus(bus, report_frozen_detected, &status);
>-		status = reset_link(dev);
>-		if (status != PCI_ERS_RESULT_RECOVERED) {
>+		status = PCI_ERS_RESULT_NEED_RESET;
>+	} else {
>+		pci_walk_bus(bus, report_normal_detected, &status);
>+	}
>+
>+	if (status == PCI_ERS_RESULT_NEED_RESET) {
>+		if (reset_link) {
>+			if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
>+				status = PCI_ERS_RESULT_DISCONNECT;
>+		} else {
>+			if (pci_bus_error_reset(dev))
>+				status = PCI_ERS_RESULT_DISCONNECT;
>+		}
>+
>+		if (status == PCI_ERS_RESULT_DISCONNECT) {
> 			pci_warn(dev, "link reset failed\n");
> 			goto failed;
> 		}
>-	} else {
>-		pci_walk_bus(bus, report_normal_detected, &status);
> 	}
> 
> 	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>
>
>	-J
>
>>This includes support for bus_reset in recovery function itself.
>>
>>index 14bb8f54723e..c9eaab68ab7a 100644
>>--- a/drivers/pci/pcie/err.c
>>+++ b/drivers/pci/pcie/err.c
>>@@ -165,13 +165,23 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>        pci_dbg(dev, "broadcast error_detected message\n");
>>        if (state == pci_channel_io_frozen) {
>>        if (state == pci_channel_io_frozen) {
>>                pci_walk_bus(bus, report_frozen_detected, &status);
>>-               status = reset_link(dev);
>>-               if (status != PCI_ERS_RESULT_RECOVERED) {
>>+               status = PCI_ERS_RESULT_NEED_RESET;
>>+       } else {
>>+               pci_walk_bus(bus, report_normal_detected, &status);
>>+       }
>>+
>>+       if (status == PCI_ERS_RESULT_NEED_RESET) {
>>+               if (reset_link)
>>+                       if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
>>+                               status = PCI_ERS_RESULT_DISCONNECT;
>>+               else
>>+                       if (pci_bus_error_reset(dev))
>>+                               status = PCI_ERS_RESULT_DISCONNECT;
>>+
>>+               if (status == PCI_ERS_RESULT_DISCONNECT) {
>>                        pci_warn(dev, "link reset failed\n");
>>                        goto failed;
>>                }
>>-       } else {
>>-               pci_walk_bus(bus, report_normal_detected, &status);
>>        }
>>
>>        if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>>
>>
>>>
>>>>
>>>>     Inspection shows a similar path is plausible for a return of
>>>> _CAN_RECOVER and the invocation of report_mmio_enabled.
>>>>
>>>>     Resolve this by preserving the result of report_frozen_detected if
>>>> reset_link does not return _DISCONNECT.
>>>>
>>>> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>>>> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>>>>
>>>> ---
>>>>   drivers/pci/pcie/err.c | 11 +++++++++--
>>>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>> index 14bb8f54723e..e4274562f3a0 100644
>>>> --- a/drivers/pci/pcie/err.c
>>>> +++ b/drivers/pci/pcie/err.c
>>>> @@ -164,10 +164,17 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev
>>>> *dev,
>>>>       pci_dbg(dev, "broadcast error_detected message\n");
>>>>       if (state == pci_channel_io_frozen) {
>>>> +        pci_ers_result_t status2;
>>>> +
>>>>           pci_walk_bus(bus, report_frozen_detected, &status);
>>>> -        status = reset_link(dev);
>>>> -        if (status != PCI_ERS_RESULT_RECOVERED) {
>>>> +        /* preserve status from report_frozen_detected to
>>>> +         * insure report_mmio_enabled or report_slot_reset are
>>>> +         * invoked even if reset_link returns _RECOVERED.
>>>> +         */
>>>> +        status2 = reset_link(dev);
>>>> +        if (status2 != PCI_ERS_RESULT_RECOVERED) {
>>>>               pci_warn(dev, "link reset failed\n");
>>>> +            status = status2;
>>>>               goto failed;
>>>>           }
>>>>       } else {
>>>>
>
>---
>	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery
  2020-04-30 20:41     ` Jay Vosburgh
  2020-05-06 18:08       ` Jay Vosburgh
@ 2020-05-09  6:35       ` Yicong Yang
  2020-05-09 17:55         ` Kuppuswamy, Sathyanarayanan
  1 sibling, 1 reply; 9+ messages in thread
From: Yicong Yang @ 2020-05-09  6:35 UTC (permalink / raw)
  To: Jay Vosburgh, Kuppuswamy, Sathyanarayanan; +Cc: linux-pci, Bjorn Helgaas

Hi Jay, Kuppuswamy

On 2020/5/1 4:41, Jay Vosburgh wrote:
> "Kuppuswamy, Sathyanarayanan" wrote:
>
>> Hi Jay,
>>
>> On 4/29/20 6:15 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>
>>> On 4/29/20 5:42 PM, Jay Vosburgh wrote:
>>>>     Commit 6d2c89441571 ("PCI/ERR: Update error status after
>>>> reset_link()"), introduced a regression, as pcie_do_recovery will
>>>> discard the status result from report_frozen_detected.  This can cause a
>>>> failure to recover if _NEED_RESET is returned by report_frozen_detected
>>>> and report_slot_reset is not invoked.
>>>>
>>>>     Such an event can be induced for testing purposes by reducing
>>>> the Max_Payload_Size of a PCIe bridge to less than that of a device
>>>> downstream from the bridge, and then initating I/O through the device,
>>>> resulting in oversize transactions.  In the presence of DPC, this
>>>> results in a containment event and attempted reset and recovery via
>>>> pcie_do_recovery.  After 6d2c89441571 report_slot_reset is not invoked,
>>>> and the device does not recover.
>>> I think this issue is related to the issue discussed in following
>>> thread (DPC non-hotplug support).
>>>
>>> https://lkml.org/lkml/2020/3/28/328
>>>
>>> If my assumption is correct, you are dealing with devices which are
>>> not hotplug capable. If the devices are hotplug capable then you don't
>>> need to proceed to report_slot_reset(), since hotplug handler will
>>> remove/re-enumerate the devices correctly.
> 	Correct, this particular device (a network card) is in a
> non-hotplug slot.
>
>> Can you check whether following fix works for you?
> 	Yes, it does.
>
> 	I fixed up the whitespace and made a minor change to add braces
> in what look like the correct places around the "if (reset_link)" block;
> the patch I tested with is below.  I'll also install this on another
> machine with hotplug capable slots to test there as well.
>
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 14bb8f54723e..db80e1ecb2dc 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -165,13 +165,24 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	pci_dbg(dev, "broadcast error_detected message\n");
>  	if (state == pci_channel_io_frozen) {
>  		pci_walk_bus(bus, report_frozen_detected, &status);
> -		status = reset_link(dev);
> -		if (status != PCI_ERS_RESULT_RECOVERED) {
> +		status = PCI_ERS_RESULT_NEED_RESET;
> +	} else {
> +		pci_walk_bus(bus, report_normal_detected, &status);
> +	}
> +
> +	if (status == PCI_ERS_RESULT_NEED_RESET) {
> +		if (reset_link) {
> +			if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
> +				status = PCI_ERS_RESULT_DISCONNECT;
> +		} else {
> +			if (pci_bus_error_reset(dev))
> +				status = PCI_ERS_RESULT_DISCONNECT;
> +		}
> +

The PCI_ERS_RESULT_NEED_RESET may indicate that the driver requires a *slot* reset.
With this patch, seems later slot reset broadcast may not be performed.

    if (status == PCI_ERS_RESULT_NEED_RESET) {
        status = PCI_ERS_RESULT_RECOVERED;
        pci_dbg(dev, "broadcast slot_reset message\n");
        pci_walk_bus(bus, report_slot_reset, &status);
    }

One minor question, currently the callers of pcie_do_recovery() will always pass a
reset_link pointer, so is the condition necessary?

Yicong

> +		if (status == PCI_ERS_RESULT_DISCONNECT) {
>  			pci_warn(dev, "link reset failed\n");
>  			goto failed;
>  		}
> -	} else {
> -		pci_walk_bus(bus, report_normal_detected, &status);
>  	}
>  
>  	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>
>
> 	-J
>
>> This includes support for bus_reset in recovery function itself.
>>
>> index 14bb8f54723e..c9eaab68ab7a 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -165,13 +165,23 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>        pci_dbg(dev, "broadcast error_detected message\n");
>>        if (state == pci_channel_io_frozen) {
>>        if (state == pci_channel_io_frozen) {
>>                pci_walk_bus(bus, report_frozen_detected, &status);
>> -               status = reset_link(dev);
>> -               if (status != PCI_ERS_RESULT_RECOVERED) {
>> +               status = PCI_ERS_RESULT_NEED_RESET;
>> +       } else {
>> +               pci_walk_bus(bus, report_normal_detected, &status);
>> +       }
>> +
>> +       if (status == PCI_ERS_RESULT_NEED_RESET) {
>> +               if (reset_link)
>> +                       if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
>> +                               status = PCI_ERS_RESULT_DISCONNECT;
>> +               else
>> +                       if (pci_bus_error_reset(dev))
>> +                               status = PCI_ERS_RESULT_DISCONNECT;
>> +
>> +               if (status == PCI_ERS_RESULT_DISCONNECT) {
>>                        pci_warn(dev, "link reset failed\n");
>>                        goto failed;
>>                }
>> -       } else {
>> -               pci_walk_bus(bus, report_normal_detected, &status);
>>        }
>>
>>        if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>>
>>
>>>>     Inspection shows a similar path is plausible for a return of
>>>> _CAN_RECOVER and the invocation of report_mmio_enabled.
>>>>
>>>>     Resolve this by preserving the result of report_frozen_detected if
>>>> reset_link does not return _DISCONNECT.
>>>>
>>>> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>>>> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>>>>
>>>> ---
>>>>   drivers/pci/pcie/err.c | 11 +++++++++--
>>>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>> index 14bb8f54723e..e4274562f3a0 100644
>>>> --- a/drivers/pci/pcie/err.c
>>>> +++ b/drivers/pci/pcie/err.c
>>>> @@ -164,10 +164,17 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev
>>>> *dev,
>>>>       pci_dbg(dev, "broadcast error_detected message\n");
>>>>       if (state == pci_channel_io_frozen) {
>>>> +        pci_ers_result_t status2;
>>>> +
>>>>           pci_walk_bus(bus, report_frozen_detected, &status);
>>>> -        status = reset_link(dev);
>>>> -        if (status != PCI_ERS_RESULT_RECOVERED) {
>>>> +        /* preserve status from report_frozen_detected to
>>>> +         * insure report_mmio_enabled or report_slot_reset are
>>>> +         * invoked even if reset_link returns _RECOVERED.
>>>> +         */
>>>> +        status2 = reset_link(dev);
>>>> +        if (status2 != PCI_ERS_RESULT_RECOVERED) {
>>>>               pci_warn(dev, "link reset failed\n");
>>>> +            status = status2;
>>>>               goto failed;
>>>>           }
>>>>       } else {
>>>>
> ---
> 	-Jay Vosburgh, jay.vosburgh@canonical.com
> .
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery
  2020-04-30  0:42 [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery Jay Vosburgh
  2020-04-30  1:15 ` Kuppuswamy, Sathyanarayanan
@ 2020-05-09  8:34 ` Yicong Yang
  1 sibling, 0 replies; 9+ messages in thread
From: Yicong Yang @ 2020-05-09  8:34 UTC (permalink / raw)
  To: Jay Vosburgh, linux-pci, Bjorn Helgaas, Kuppuswamy Sathyanarayanan
  Cc: liudongdong 00290354, Linuxarm

[ +cc dongdong as we met the issue ]

Hi,

The regression happened with our intel 82599 network adaptor. A DPC event
happened on the device, recovery returns successful but the device is
not enabled. As we only reset link, but don't call driver specific
handler(mmio enable/slot reset).

If we met an error with IO blocked, the logic should be:
1. try to reset link
2. broadcast report_frozen_detected on the bus
3. do what drivers suggest(mmio enable/slot reset)

Currently only step 1 is performed, and returns directly. I tried the
patch below and it solves the issues. The differences from Jay's one
is that it merge the return value to the status and it match the logic
I mentioned above.

Regards,
Yicong

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f5..6f8870c 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -164,12 +164,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 
        pci_dbg(dev, "broadcast error_detected message\n");
        if (state == pci_channel_io_frozen) {
-               pci_walk_bus(bus, report_frozen_detected, &status);
                status = reset_link(dev);
                if (status != PCI_ERS_RESULT_RECOVERED) {
                        pci_warn(dev, "link reset failed\n");
                        goto failed;
                }
+               pci_walk_bus(bus, report_frozen_detected, &status);
        } else {
                pci_walk_bus(bus, report_normal_detected, &status);
        }

On 2020/4/30 8:42, Jay Vosburgh wrote:
> 	Commit 6d2c89441571 ("PCI/ERR: Update error status after
> reset_link()"), introduced a regression, as pcie_do_recovery will
> discard the status result from report_frozen_detected.  This can cause a
> failure to recover if _NEED_RESET is returned by report_frozen_detected
> and report_slot_reset is not invoked.
>
> 	Such an event can be induced for testing purposes by reducing
> the Max_Payload_Size of a PCIe bridge to less than that of a device
> downstream from the bridge, and then initating I/O through the device,
> resulting in oversize transactions.  In the presence of DPC, this
> results in a containment event and attempted reset and recovery via
> pcie_do_recovery.  After 6d2c89441571 report_slot_reset is not invoked,
> and the device does not recover.
>
> 	Inspection shows a similar path is plausible for a return of
> _CAN_RECOVER and the invocation of report_mmio_enabled.
>
> 	Resolve this by preserving the result of report_frozen_detected if
> reset_link does not return _DISCONNECT.
>
> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>
> ---
>  drivers/pci/pcie/err.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 14bb8f54723e..e4274562f3a0 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -164,10 +164,17 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  
>  	pci_dbg(dev, "broadcast error_detected message\n");
>  	if (state == pci_channel_io_frozen) {
> +		pci_ers_result_t status2;
> +
>  		pci_walk_bus(bus, report_frozen_detected, &status);
> -		status = reset_link(dev);
> -		if (status != PCI_ERS_RESULT_RECOVERED) {
> +		/* preserve status from report_frozen_detected to
> +		 * insure report_mmio_enabled or report_slot_reset are
> +		 * invoked even if reset_link returns _RECOVERED.
> +		 */
> +		status2 = reset_link(dev);
> +		if (status2 != PCI_ERS_RESULT_RECOVERED) {
>  			pci_warn(dev, "link reset failed\n");
> +			status = status2;
>  			goto failed;
>  		}
>  	} else {


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery
  2020-05-09  6:35       ` Yicong Yang
@ 2020-05-09 17:55         ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 9+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2020-05-09 17:55 UTC (permalink / raw)
  To: Yicong Yang, Jay Vosburgh; +Cc: linux-pci, Bjorn Helgaas



On 5/8/20 11:35 PM, Yicong Yang wrote:
> Hi Jay, Kuppuswamy
> 
> On 2020/5/1 4:41, Jay Vosburgh wrote:
>> "Kuppuswamy, Sathyanarayanan" wrote:
>>
>>> Hi Jay,
>>>
>>> On 4/29/20 6:15 PM, Kuppuswamy, Sathyanarayanan wrote:
>>>>
>>>> On 4/29/20 5:42 PM, Jay Vosburgh wrote:
>>>>>      Commit 6d2c89441571 ("PCI/ERR: Update error status after
>>>>> reset_link()"), introduced a regression, as pcie_do_recovery will
>>>>> discard the status result from report_frozen_detected.  This can cause a
>>>>> failure to recover if _NEED_RESET is returned by report_frozen_detected
>>>>> and report_slot_reset is not invoked.
>>>>>
>>>>>      Such an event can be induced for testing purposes by reducing
>>>>> the Max_Payload_Size of a PCIe bridge to less than that of a device
>>>>> downstream from the bridge, and then initating I/O through the device,
>>>>> resulting in oversize transactions.  In the presence of DPC, this
>>>>> results in a containment event and attempted reset and recovery via
>>>>> pcie_do_recovery.  After 6d2c89441571 report_slot_reset is not invoked,
>>>>> and the device does not recover.
>>>> I think this issue is related to the issue discussed in following
>>>> thread (DPC non-hotplug support).
>>>>
>>>> https://lkml.org/lkml/2020/3/28/328
>>>>
>>>> If my assumption is correct, you are dealing with devices which are
>>>> not hotplug capable. If the devices are hotplug capable then you don't
>>>> need to proceed to report_slot_reset(), since hotplug handler will
>>>> remove/re-enumerate the devices correctly.
>> 	Correct, this particular device (a network card) is in a
>> non-hotplug slot.
>>
>>> Can you check whether following fix works for you?
>> 	Yes, it does.
>>
>> 	I fixed up the whitespace and made a minor change to add braces
>> in what look like the correct places around the "if (reset_link)" block;
>> the patch I tested with is below.  I'll also install this on another
>> machine with hotplug capable slots to test there as well.
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 14bb8f54723e..db80e1ecb2dc 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -165,13 +165,24 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>   	pci_dbg(dev, "broadcast error_detected message\n");
>>   	if (state == pci_channel_io_frozen) {
>>   		pci_walk_bus(bus, report_frozen_detected, &status);
>> -		status = reset_link(dev);
>> -		if (status != PCI_ERS_RESULT_RECOVERED) {
>> +		status = PCI_ERS_RESULT_NEED_RESET;
>> +	} else {
>> +		pci_walk_bus(bus, report_normal_detected, &status);
>> +	}
>> +
>> +	if (status == PCI_ERS_RESULT_NEED_RESET) {
>> +		if (reset_link) {
>> +			if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
>> +				status = PCI_ERS_RESULT_DISCONNECT;
>> +		} else {
>> +			if (pci_bus_error_reset(dev))
>> +				status = PCI_ERS_RESULT_DISCONNECT;
>> +		}
>> +
> 
> The PCI_ERS_RESULT_NEED_RESET may indicate that the driver requires a *slot* reset.
> With this patch, seems later slot reset broadcast may not be performed.
Slot reset wont be performed only if reset_link or pci_bus_error_reset
returns error. Otherwise, we will still call pci_slot_reset later.
> 
>      if (status == PCI_ERS_RESULT_NEED_RESET) {
>          status = PCI_ERS_RESULT_RECOVERED;
>          pci_dbg(dev, "broadcast slot_reset message\n");
>          pci_walk_bus(bus, report_slot_reset, &status);
>      }
> 
> One minor question, currently the callers of pcie_do_recovery() will always pass a
> reset_link pointer, so is the condition necessary?
Yes, currently we don't need it. I added it to cover future use cases.
But we can remove it if not needed.
> 
> Yicong
> 
>> +		if (status == PCI_ERS_RESULT_DISCONNECT) {
>>   			pci_warn(dev, "link reset failed\n");
>>   			goto failed;
>>   		}
>> -	} else {
>> -		pci_walk_bus(bus, report_normal_detected, &status);
>>   	}
>>   
>>   	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>>
>>pci_bus_error_reset
>> 	-J
>>
>>> This includes support for bus_reset in recovery function itself.
>>>
>>> index 14bb8f54723e..c9eaab68ab7a 100644
>>> --- a/drivers/pci/pcie/err.c
>>> +++ b/drivers/pci/pcie/err.c
>>> @@ -165,13 +165,23 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>         pci_dbg(dev, "broadcast error_detected message\n");
>>>         if (state == pci_channel_io_frozen) {
>>>         if (state == pci_channel_io_frozen) {
>>>                 pci_walk_bus(bus, report_frozen_detected, &status);
>>> -               status = reset_link(dev);
>>> -               if (status != PCI_ERS_RESULT_RECOVERED) {
>>> +               status = PCI_ERS_RESULT_NEED_RESET;
>>> +       } else {
>>> +               pci_walk_bus(bus, report_normal_detected, &status);
>>> +       }
>>> +
>>> +       if (status == PCI_ERS_RESULT_NEED_RESET) {
>>> +               if (reset_link)
>>> +                       if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
>>> +                               status = PCI_ERS_RESULT_DISCONNECT;
>>> +               else
>>> +                       if (pci_bus_error_reset(dev))
>>> +                               status = PCI_ERS_RESULT_DISCONNECT;
>>> +
>>> +               if (status == PCI_ERS_RESULT_DISCONNECT) {
>>>                         pci_warn(dev, "link reset failed\n");
>>>                         goto failed;
>>>                 }
>>> -       } else {
>>> -               pci_walk_bus(bus, report_normal_detected, &status);
>>>         }
>>>
>>>         if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>>>
>>>
>>>>>      Inspection shows a similar path is plausible for a return of
>>>>> _CAN_RECOVER and the invocation of report_mmio_enabled.
>>>>>
>>>>>      Resolve this by preserving the result of report_frozen_detected if
>>>>> reset_link does not return _DISCONNECT.
>>>>>
>>>>> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>>>>> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>>>>>
>>>>> ---
>>>>>    drivers/pci/pcie/err.c | 11 +++++++++--
>>>>>    1 file changed, 9 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>>> index 14bb8f54723e..e4274562f3a0 100644
>>>>> --- a/drivers/pci/pcie/err.c
>>>>> +++ b/drivers/pci/pcie/err.c
>>>>> @@ -164,10 +164,17 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev
>>>>> *dev,
>>>>>        pci_dbg(dev, "broadcast error_detected message\n");
>>>>>        if (state == pci_channel_io_frozen) {
>>>>> +        pci_ers_result_t status2;
>>>>> +
>>>>>            pci_walk_bus(bus, report_frozen_detected, &status);
>>>>> -        status = reset_link(dev);
>>>>> -        if (status != PCI_ERS_RESULT_RECOVERED) {
>>>>> +        /* preserve status from report_frozen_detected to
>>>>> +         * insure report_mmio_enabled or report_slot_reset are
>>>>> +         * invoked even if reset_link returns _RECOVERED.
>>>>> +         */
>>>>> +        status2 = reset_link(dev);
>>>>> +        if (status2 != PCI_ERS_RESULT_RECOVERED) {
>>>>>                pci_warn(dev, "link reset failed\n");
>>>>> +            status = status2;
>>>>>                goto failed;
>>>>>            }
>>>>>        } else {
>>>>>
>> ---
>> 	-Jay Vosburgh, jay.vosburgh@canonical.com
>> .
>>
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery
       [not found] <20200506203249.GA453633@bjorn-Precision-5520>
@ 2020-05-07  0:56 ` Jay Vosburgh
  0 siblings, 0 replies; 9+ messages in thread
From: Jay Vosburgh @ 2020-05-07  0:56 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Kuppuswamy, Sathyanarayanan, linux-pci

Bjorn Helgaas <helgaas@kernel.org> wrote:

>On Wed, May 06, 2020 at 11:08:35AM -0700, Jay Vosburgh wrote:
>> Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>> 
>> >"Kuppuswamy, Sathyanarayanan" wrote:
>> >
>> >>Hi Jay,
>> >>
>> >>On 4/29/20 6:15 PM, Kuppuswamy, Sathyanarayanan wrote:
>> >>>
>> >>>
>> >>> On 4/29/20 5:42 PM, Jay Vosburgh wrote:
[...]
>> >>> I think this issue is related to the issue discussed in following
>> >>> thread (DPC non-hotplug support).
>> >>>
>> >>> https://lkml.org/lkml/2020/3/28/328
>> >>>
>> >>> If my assumption is correct, you are dealing with devices which are
>> >>> not hotplug capable. If the devices are hotplug capable then you don't
>> >>> need to proceed to report_slot_reset(), since hotplug handler will
>> >>> remove/re-enumerate the devices correctly.
>> >
>> >	Correct, this particular device (a network card) is in a
>> >non-hotplug slot.
>> >
>> >>Can you check whether following fix works for you?
>> >
>> >	Yes, it does.
>> >
>> >	I fixed up the whitespace and made a minor change to add braces
>> >in what look like the correct places around the "if (reset_link)" block;
>> >the patch I tested with is below.  I'll also install this on another
>> >machine with hotplug capable slots to test there as well.
>> 
>> 	We've tested the below patch on a couple of different machines
>> and devices (network card, NVMe device) and it appears to solve the
>> recovery issue in our testing.
>> 
>> 	Is there anything further we need to do, or can this be
>> considered for inclusion upstream at this time?
>
>Can somebody please post a clean version of what we should merge?
>There was the initial patch plus a follow-up fix, so it's not clear
>where we ended up.
>
>Bjorn

	Below is the patch we tested, from Sathyanarayanan's test patch
(slightly edited to clarify ambiguous "if else" nesting), along with an
edited version of the commit log from my original patch.  I have not
seen a Signed-off-by from Sathyanarayanan, so I didn't include one here.

	One question I have is that, after the patch is applied, the
"status" filled in by pci_walk_bus(... report_frozen_detected ...) is
discarded regardless of its value.  Is that correct behavior in all
cases?  The original issue I was trying to solve was that the status set
by report_frozen_detected was thrown away starting with 6d2c89441571
("PCI/ERR: Update error status after reset_link()"), causing a
_NEED_RESET to be lost.  With the below patch, all cases of
pci_channel_io_frozen will call reset_link unconditionally.

	-J

Subject: [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
 
	Commit 6d2c89441571 ("PCI/ERR: Update error status after
reset_link()"), introduced a regression, as pcie_do_recovery will
discard the status result from report_frozen_detected.  This can cause a
failure to recover if _NEED_RESET is returned by report_frozen_detected
and report_slot_reset is not invoked.

	Such an event can be induced for testing purposes by reducing
the Max_Payload_Size of a PCIe bridge to less than that of a device
downstream from the bridge, and then initating I/O through the device,
resulting in oversize transactions.  In the presence of DPC, this
results in a containment event and attempted reset and recovery via
pcie_do_recovery.  After 6d2c89441571 report_slot_reset is not invoked,
and the device does not recover.

Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")


diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..db80e1ecb2dc 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,13 +165,24 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	pci_dbg(dev, "broadcast error_detected message\n");
 	if (state == pci_channel_io_frozen) {
 		pci_walk_bus(bus, report_frozen_detected, &status);
-		status = reset_link(dev);
-		if (status != PCI_ERS_RESULT_RECOVERED) {
+		status = PCI_ERS_RESULT_NEED_RESET;
+	} else {
+		pci_walk_bus(bus, report_normal_detected, &status);
+	}
+
+	if (status == PCI_ERS_RESULT_NEED_RESET) {
+		if (reset_link) {
+			if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
+				status = PCI_ERS_RESULT_DISCONNECT;
+		} else {
+			if (pci_bus_error_reset(dev))
+				status = PCI_ERS_RESULT_DISCONNECT;
+		}
+
+		if (status == PCI_ERS_RESULT_DISCONNECT) {
 			pci_warn(dev, "link reset failed\n");
 			goto failed;
 		}
-	} else {
-		pci_walk_bus(bus, report_normal_detected, &status);
 	}
 
 	if (status == PCI_ERS_RESULT_CAN_RECOVER) {


---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-05-09 17:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-30  0:42 [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery Jay Vosburgh
2020-04-30  1:15 ` Kuppuswamy, Sathyanarayanan
2020-04-30 19:35   ` Kuppuswamy, Sathyanarayanan
2020-04-30 20:41     ` Jay Vosburgh
2020-05-06 18:08       ` Jay Vosburgh
2020-05-09  6:35       ` Yicong Yang
2020-05-09 17:55         ` Kuppuswamy, Sathyanarayanan
2020-05-09  8:34 ` Yicong Yang
     [not found] <20200506203249.GA453633@bjorn-Precision-5520>
2020-05-07  0:56 ` Jay Vosburgh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).