All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices
@ 2020-06-04 21:50 sathyanarayanan.kuppuswamy
  2020-06-04 21:50 ` [PATCH v2 2/2] PCI/ERR: Add reset support for non fatal errors sathyanarayanan.kuppuswamy
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2020-06-04 21:50 UTC (permalink / raw)
  To: bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, sathyanarayanan.kuppuswamy,
	Jay Vosburgh

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Fatal (DPC) error recovery is currently broken for non-hotplug
capable devices. With current implementation, after successful
fatal error recovery, non-hotplug capable device state won't be
restored properly. You can find related issues in following links.

https://lkml.org/lkml/2020/5/27/290
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lkml.org/lkml/2020/3/28/328

Current fatal error recovery implementation relies on hotplug handler
for detaching/re-enumerating the affected devices/drivers on DLLSC
state changes. So when dealing with non-hotplug capable devices,
recovery code does not restore the state of the affected devices
correctly. Correct implementation should call report_slot_reset()
function after resetting the link to restore the state of the
device/driver.

So use PCI_ERS_RESULT_NEED_RESET as error status for successful
reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure
case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset()
is called after reset link operation which will also fix the above
mentioned issue.

[original patch is from jay.vosburgh@canonical.com]
[original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pcie/err.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..5fe8561c7185 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	pci_dbg(dev, "broadcast error_detected message\n");
 	if (state == pci_channel_io_frozen) {
 		pci_walk_bus(bus, report_frozen_detected, &status);
-		status = reset_link(dev);
-		if (status != PCI_ERS_RESULT_RECOVERED) {
+		/*
+		 * After resetting the link using reset_link() call, the
+		 * possible value of error status is either
+		 * PCI_ERS_RESULT_DISCONNECT (failure case) or
+		 * PCI_ERS_RESULT_NEED_RESET (success case).
+		 * So ignore the return value of report_error_detected()
+		 * call for fatal errors. Instead use
+		 * PCI_ERS_RESULT_NEED_RESET as initial status value.
+		 *
+		 * Ignoring the status return value of report_error_detected()
+		 * call will also help in case of EDR mode based error
+		 * recovery. In EDR mode AER and DPC Capabilities are owned by
+		 * firmware and hence report_error_detected() call will possibly
+		 * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore
+		 * the return value of report_error_detected() then
+		 * pcie_do_recovery() would report incorrect status after
+		 * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER
+		 * in non EDR case should not have any functional impact.
+		 */
+		status = PCI_ERS_RESULT_NEED_RESET;
+		if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
+			status = PCI_ERS_RESULT_DISCONNECT;
 			pci_warn(dev, "link reset failed\n");
 			goto failed;
 		}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] PCI/ERR: Add reset support for non fatal errors
  2020-06-04 21:50 [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices sathyanarayanan.kuppuswamy
@ 2020-06-04 21:50 ` sathyanarayanan.kuppuswamy
  2020-06-28 12:57   ` Yicong Yang
  2020-06-05  4:47 ` [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices Jay Vosburgh
  2020-07-14 23:08 ` Bjorn Helgaas
  2 siblings, 1 reply; 8+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2020-06-04 21:50 UTC (permalink / raw)
  To: bhelgaas; +Cc: linux-pci, linux-kernel, ashok.raj, sathyanarayanan.kuppuswamy

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

PCI_ERS_RESULT_NEED_RESET error status implies the device is
requesting a slot reset and a notification about slot reset
completion via ->slot_reset() callback.

But in non-fatal errors case, if report_error_detected() or
report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET
then current pcie_do_recovery() implementation does not do the
requested explicit slot reset, instead just calls the ->slot_reset()
callback on all affected devices. Notifying about the slot reset
completion without resetting it incorrect. So add this support.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pcie/err.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 5fe8561c7185..94d1c2ff7b40 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -206,6 +206,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 		 * functions to reset slot before calling
 		 * drivers' slot_reset callbacks?
 		 */
+		if (state != pci_channel_io_frozen)
+			pci_reset_bus(dev);
+
 		status = PCI_ERS_RESULT_RECOVERED;
 		pci_dbg(dev, "broadcast slot_reset message\n");
 		pci_walk_bus(bus, report_slot_reset, &status);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices
  2020-06-04 21:50 [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices sathyanarayanan.kuppuswamy
  2020-06-04 21:50 ` [PATCH v2 2/2] PCI/ERR: Add reset support for non fatal errors sathyanarayanan.kuppuswamy
@ 2020-06-05  4:47 ` Jay Vosburgh
  2020-06-24 18:52   ` Jay Vosburgh
  2020-07-14 23:08 ` Bjorn Helgaas
  2 siblings, 1 reply; 8+ messages in thread
From: Jay Vosburgh @ 2020-06-05  4:47 UTC (permalink / raw)
  To: sathyanarayanan.kuppuswamy; +Cc: bhelgaas, linux-pci, linux-kernel, ashok.raj

sathyanarayanan.kuppuswamy@linux.intel.com wrote:

>From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>
>Fatal (DPC) error recovery is currently broken for non-hotplug
>capable devices. With current implementation, after successful
>fatal error recovery, non-hotplug capable device state won't be
>restored properly. You can find related issues in following links.
>
>https://lkml.org/lkml/2020/5/27/290
>https://lore.kernel.org/linux-pci/12115.1588207324@famine/
>https://lkml.org/lkml/2020/3/28/328
>
>Current fatal error recovery implementation relies on hotplug handler
>for detaching/re-enumerating the affected devices/drivers on DLLSC
>state changes. So when dealing with non-hotplug capable devices,
>recovery code does not restore the state of the affected devices
>correctly. Correct implementation should call report_slot_reset()
>function after resetting the link to restore the state of the
>device/driver.
>
>So use PCI_ERS_RESULT_NEED_RESET as error status for successful
>reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure
>case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset()
>is called after reset link operation which will also fix the above
>mentioned issue.
>
>[original patch is from jay.vosburgh@canonical.com]
>[original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
>Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

	I've tested this patch set on one of our test machines, and it
resolves the issue.  I plan to test with other systems tomorrow.

	-J

>---
> drivers/pci/pcie/err.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>index 14bb8f54723e..5fe8561c7185 100644
>--- a/drivers/pci/pcie/err.c
>+++ b/drivers/pci/pcie/err.c
>@@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> 	pci_dbg(dev, "broadcast error_detected message\n");
> 	if (state == pci_channel_io_frozen) {
> 		pci_walk_bus(bus, report_frozen_detected, &status);
>-		status = reset_link(dev);
>-		if (status != PCI_ERS_RESULT_RECOVERED) {
>+		/*
>+		 * After resetting the link using reset_link() call, the
>+		 * possible value of error status is either
>+		 * PCI_ERS_RESULT_DISCONNECT (failure case) or
>+		 * PCI_ERS_RESULT_NEED_RESET (success case).
>+		 * So ignore the return value of report_error_detected()
>+		 * call for fatal errors. Instead use
>+		 * PCI_ERS_RESULT_NEED_RESET as initial status value.
>+		 *
>+		 * Ignoring the status return value of report_error_detected()
>+		 * call will also help in case of EDR mode based error
>+		 * recovery. In EDR mode AER and DPC Capabilities are owned by
>+		 * firmware and hence report_error_detected() call will possibly
>+		 * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore
>+		 * the return value of report_error_detected() then
>+		 * pcie_do_recovery() would report incorrect status after
>+		 * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER
>+		 * in non EDR case should not have any functional impact.
>+		 */
>+		status = PCI_ERS_RESULT_NEED_RESET;
>+		if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
>+			status = PCI_ERS_RESULT_DISCONNECT;
> 			pci_warn(dev, "link reset failed\n");
> 			goto failed;
> 		}
>-- 
>2.17.1
>

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices
  2020-06-05  4:47 ` [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices Jay Vosburgh
@ 2020-06-24 18:52   ` Jay Vosburgh
  2020-06-28 12:59     ` Yicong Yang
  0 siblings, 1 reply; 8+ messages in thread
From: Jay Vosburgh @ 2020-06-24 18:52 UTC (permalink / raw)
  To: sathyanarayanan.kuppuswamy
  Cc: bhelgaas, linux-pci, linux-kernel, ashok.raj, Yicong Yang

Jay Vosburgh <jay.vosburgh@canonical.com> wrote:

>sathyanarayanan.kuppuswamy@linux.intel.com wrote:
>
>From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
>>Fatal (DPC) error recovery is currently broken for non-hotplug
>>capable devices. With current implementation, after successful
>>fatal error recovery, non-hotplug capable device state won't be
>>restored properly. You can find related issues in following links.
>>
>>https://lkml.org/lkml/2020/5/27/290
>>https://lore.kernel.org/linux-pci/12115.1588207324@famine/
>>https://lkml.org/lkml/2020/3/28/328
>>
>>Current fatal error recovery implementation relies on hotplug handler
>>for detaching/re-enumerating the affected devices/drivers on DLLSC
>>state changes. So when dealing with non-hotplug capable devices,
>>recovery code does not restore the state of the affected devices
>>correctly. Correct implementation should call report_slot_reset()
>>function after resetting the link to restore the state of the
>>device/driver.
>>
>>So use PCI_ERS_RESULT_NEED_RESET as error status for successful
>>reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure
>>case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset()
>>is called after reset link operation which will also fix the above
>>mentioned issue.
>>
>>[original patch is from jay.vosburgh@canonical.com]
>>[original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
>>Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>>Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>>Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>
>	I've tested this patch set on one of our test machines, and it
>resolves the issue.  I plan to test with other systems tomorrow.

	I've done testing on two different systems that exhibit the
original issue and this patch set appears to behave as expected.

	Has anyone else (Yicong?) had an opportunity to test this?

	Can this be considered for acceptance, or is additional feedback
or review needed?

	-J

>>---
>> drivers/pci/pcie/err.c | 24 ++++++++++++++++++++++--
>> 1 file changed, 22 insertions(+), 2 deletions(-)
>>
>>diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>index 14bb8f54723e..5fe8561c7185 100644
>>--- a/drivers/pci/pcie/err.c
>>+++ b/drivers/pci/pcie/err.c
>>@@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>> 	pci_dbg(dev, "broadcast error_detected message\n");
>> 	if (state == pci_channel_io_frozen) {
>> 		pci_walk_bus(bus, report_frozen_detected, &status);
>>-		status = reset_link(dev);
>>-		if (status != PCI_ERS_RESULT_RECOVERED) {
>>+		/*
>>+		 * After resetting the link using reset_link() call, the
>>+		 * possible value of error status is either
>>+		 * PCI_ERS_RESULT_DISCONNECT (failure case) or
>>+		 * PCI_ERS_RESULT_NEED_RESET (success case).
>>+		 * So ignore the return value of report_error_detected()
>>+		 * call for fatal errors. Instead use
>>+		 * PCI_ERS_RESULT_NEED_RESET as initial status value.
>>+		 *
>>+		 * Ignoring the status return value of report_error_detected()
>>+		 * call will also help in case of EDR mode based error
>>+		 * recovery. In EDR mode AER and DPC Capabilities are owned by
>>+		 * firmware and hence report_error_detected() call will possibly
>>+		 * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore
>>+		 * the return value of report_error_detected() then
>>+		 * pcie_do_recovery() would report incorrect status after
>>+		 * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER
>>+		 * in non EDR case should not have any functional impact.
>>+		 */
>>+		status = PCI_ERS_RESULT_NEED_RESET;
>>+		if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
>>+			status = PCI_ERS_RESULT_DISCONNECT;
>> 			pci_warn(dev, "link reset failed\n");
>> 			goto failed;
>> 		}
>>-- 
>>2.17.1

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] PCI/ERR: Add reset support for non fatal errors
  2020-06-04 21:50 ` [PATCH v2 2/2] PCI/ERR: Add reset support for non fatal errors sathyanarayanan.kuppuswamy
@ 2020-06-28 12:57   ` Yicong Yang
  0 siblings, 0 replies; 8+ messages in thread
From: Yicong Yang @ 2020-06-28 12:57 UTC (permalink / raw)
  To: sathyanarayanan.kuppuswamy, bhelgaas; +Cc: linux-pci, linux-kernel, ashok.raj

Hi Sathy,

one minor comments below.

On 2020/6/5 5:50, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>
> PCI_ERS_RESULT_NEED_RESET error status implies the device is
> requesting a slot reset and a notification about slot reset
> completion via ->slot_reset() callback.
>
> But in non-fatal errors case, if report_error_detected() or
> report_mmio_enabled() functions requests PCI_ERS_RESULT_NEED_RESET
> then current pcie_do_recovery() implementation does not do the
> requested explicit slot reset, instead just calls the ->slot_reset()
> callback on all affected devices. Notifying about the slot reset
> completion without resetting it incorrect. So add this support.
>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  drivers/pci/pcie/err.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 5fe8561c7185..94d1c2ff7b40 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -206,6 +206,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  		 * functions to reset slot before calling
>  		 * drivers' slot_reset callbacks?
>  		 */
> +		if (state != pci_channel_io_frozen)
> +			pci_reset_bus(dev);
> +

If it's the implementation to reset the slot, should we remove the TODO comments?
JYI.

Thanks,
Yicong


>  		status = PCI_ERS_RESULT_RECOVERED;
>  		pci_dbg(dev, "broadcast slot_reset message\n");
>  		pci_walk_bus(bus, report_slot_reset, &status);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices
  2020-06-24 18:52   ` Jay Vosburgh
@ 2020-06-28 12:59     ` Yicong Yang
  0 siblings, 0 replies; 8+ messages in thread
From: Yicong Yang @ 2020-06-28 12:59 UTC (permalink / raw)
  To: Jay Vosburgh, sathyanarayanan.kuppuswamy
  Cc: bhelgaas, linux-pci, linux-kernel, ashok.raj

Hi Jay,

I've tested the patches on my board, and they work well.

Thanks,
Yicong


On 2020/6/25 2:52, Jay Vosburgh wrote:
> Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>
>> sathyanarayanan.kuppuswamy@linux.intel.com wrote:
>>
>> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>> Fatal (DPC) error recovery is currently broken for non-hotplug
>>> capable devices. With current implementation, after successful
>>> fatal error recovery, non-hotplug capable device state won't be
>>> restored properly. You can find related issues in following links.
>>>
>>> https://lkml.org/lkml/2020/5/27/290
>>> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
>>> https://lkml.org/lkml/2020/3/28/328
>>>
>>> Current fatal error recovery implementation relies on hotplug handler
>>> for detaching/re-enumerating the affected devices/drivers on DLLSC
>>> state changes. So when dealing with non-hotplug capable devices,
>>> recovery code does not restore the state of the affected devices
>>> correctly. Correct implementation should call report_slot_reset()
>>> function after resetting the link to restore the state of the
>>> device/driver.
>>>
>>> So use PCI_ERS_RESULT_NEED_RESET as error status for successful
>>> reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure
>>> case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset()
>>> is called after reset link operation which will also fix the above
>>> mentioned issue.
>>>
>>> [original patch is from jay.vosburgh@canonical.com]
>>> [original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
>>> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>>> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> 	I've tested this patch set on one of our test machines, and it
>> resolves the issue.  I plan to test with other systems tomorrow.
> 	I've done testing on two different systems that exhibit the
> original issue and this patch set appears to behave as expected.
>
> 	Has anyone else (Yicong?) had an opportunity to test this?
>
> 	Can this be considered for acceptance, or is additional feedback
> or review needed?
>
> 	-J
>
>>> ---
>>> drivers/pci/pcie/err.c | 24 ++++++++++++++++++++++--
>>> 1 file changed, 22 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>> index 14bb8f54723e..5fe8561c7185 100644
>>> --- a/drivers/pci/pcie/err.c
>>> +++ b/drivers/pci/pcie/err.c
>>> @@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>> 	pci_dbg(dev, "broadcast error_detected message\n");
>>> 	if (state == pci_channel_io_frozen) {
>>> 		pci_walk_bus(bus, report_frozen_detected, &status);
>>> -		status = reset_link(dev);
>>> -		if (status != PCI_ERS_RESULT_RECOVERED) {
>>> +		/*
>>> +		 * After resetting the link using reset_link() call, the
>>> +		 * possible value of error status is either
>>> +		 * PCI_ERS_RESULT_DISCONNECT (failure case) or
>>> +		 * PCI_ERS_RESULT_NEED_RESET (success case).
>>> +		 * So ignore the return value of report_error_detected()
>>> +		 * call for fatal errors. Instead use
>>> +		 * PCI_ERS_RESULT_NEED_RESET as initial status value.
>>> +		 *
>>> +		 * Ignoring the status return value of report_error_detected()
>>> +		 * call will also help in case of EDR mode based error
>>> +		 * recovery. In EDR mode AER and DPC Capabilities are owned by
>>> +		 * firmware and hence report_error_detected() call will possibly
>>> +		 * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore
>>> +		 * the return value of report_error_detected() then
>>> +		 * pcie_do_recovery() would report incorrect status after
>>> +		 * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER
>>> +		 * in non EDR case should not have any functional impact.
>>> +		 */
>>> +		status = PCI_ERS_RESULT_NEED_RESET;
>>> +		if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
>>> +			status = PCI_ERS_RESULT_DISCONNECT;
>>> 			pci_warn(dev, "link reset failed\n");
>>> 			goto failed;
>>> 		}
>>> -- 
>>> 2.17.1
> ---
> 	-Jay Vosburgh, jay.vosburgh@canonical.com
> .
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices
  2020-06-04 21:50 [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices sathyanarayanan.kuppuswamy
  2020-06-04 21:50 ` [PATCH v2 2/2] PCI/ERR: Add reset support for non fatal errors sathyanarayanan.kuppuswamy
  2020-06-05  4:47 ` [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices Jay Vosburgh
@ 2020-07-14 23:08 ` Bjorn Helgaas
  2020-07-16  1:54   ` Kuppuswamy, Sathyanarayanan
  2 siblings, 1 reply; 8+ messages in thread
From: Bjorn Helgaas @ 2020-07-14 23:08 UTC (permalink / raw)
  To: sathyanarayanan.kuppuswamy
  Cc: bhelgaas, linux-pci, linux-kernel, ashok.raj, Jay Vosburgh

On Thu, Jun 04, 2020 at 02:50:01PM -0700, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> Fatal (DPC) error recovery is currently broken for non-hotplug
> capable devices. With current implementation, after successful
> fatal error recovery, non-hotplug capable device state won't be
> restored properly. You can find related issues in following links.
> 
> https://lkml.org/lkml/2020/5/27/290
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lkml.org/lkml/2020/3/28/328

Can you please convert these all to lore.kernel.org links?  lkml.org
is not quite as useful or reliable.

> Current fatal error recovery implementation relies on hotplug handler
> for detaching/re-enumerating the affected devices/drivers on DLLSC
> state changes. 

Can you remind us exactly how this relies on hotplug?  I know it
*does*, but I can't remember how.  It would sure be nice if we could
decouple this from pciehp somehow.

> So when dealing with non-hotplug capable devices,
> recovery code does not restore the state of the affected devices
> correctly. Correct implementation should call report_slot_reset()
> function after resetting the link to restore the state of the
> device/driver.

We don't restore the state correctly.  What does this look like to the
user?  Does the device not work?

> So use PCI_ERS_RESULT_NEED_RESET as error status for successful
> reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure
> case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset()
> is called after reset link operation which will also fix the above
> mentioned issue.

I think PCI_ERS_RESULT_NEED_RESET results in calling driver
->slot_reset() callbacks, right?  Where does the state restoration
happen?

No, I guess it must be something in the hotplug driver that restores
the state, because you said devices below hotplug-capable ports work
correctly, but others don't.

> [original patch is from jay.vosburgh@canonical.com]
> [original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  drivers/pci/pcie/err.c | 24 ++++++++++++++++++++++--
>  1 file changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 14bb8f54723e..5fe8561c7185 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	pci_dbg(dev, "broadcast error_detected message\n");
>  	if (state == pci_channel_io_frozen) {
>  		pci_walk_bus(bus, report_frozen_detected, &status);
> -		status = reset_link(dev);
> -		if (status != PCI_ERS_RESULT_RECOVERED) {
> +		/*
> +		 * After resetting the link using reset_link() call, the
> +		 * possible value of error status is either
> +		 * PCI_ERS_RESULT_DISCONNECT (failure case) or
> +		 * PCI_ERS_RESULT_NEED_RESET (success case).
> +		 * So ignore the return value of report_error_detected()
> +		 * call for fatal errors. Instead use
> +		 * PCI_ERS_RESULT_NEED_RESET as initial status value.
> +		 *
> +		 * Ignoring the status return value of report_error_detected()
> +		 * call will also help in case of EDR mode based error
> +		 * recovery. In EDR mode AER and DPC Capabilities are owned by
> +		 * firmware and hence report_error_detected() call will possibly
> +		 * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore
> +		 * the return value of report_error_detected() then
> +		 * pcie_do_recovery() would report incorrect status after
> +		 * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER
> +		 * in non EDR case should not have any functional impact.

I can't make sense out of the comment.  We already ignore the "status"
from pci_walk_bus(bus, report_frozen_detected, &status).

No idea what to make of the second paragraph.  If we make the commit
log make sense, maybe some summary of that would be useful here.

I think this code is equivalent and makes the patch much clearer:

  status = reset_link(dev);
  if (status == PCI_ERS_RESULT_RECOVERED) {
    status = PCI_ERS_RESULT_NEED_RESET;
  } else {
    status = PCI_ERS_RESULT_DISCONNECT;
    goto failed;
  }

> +		 */
> +		status = PCI_ERS_RESULT_NEED_RESET;
> +		if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
> +			status = PCI_ERS_RESULT_DISCONNECT;
>  			pci_warn(dev, "link reset failed\n");
>  			goto failed;
>  		}
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices
  2020-07-14 23:08 ` Bjorn Helgaas
@ 2020-07-16  1:54   ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 8+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2020-07-16  1:54 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: bhelgaas, linux-pci, linux-kernel, ashok.raj, Jay Vosburgh



On 7/14/20 4:08 PM, Bjorn Helgaas wrote:
> On Thu, Jun 04, 2020 at 02:50:01PM -0700, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
>> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
>> Fatal (DPC) error recovery is currently broken for non-hotplug
>> capable devices. With current implementation, after successful
>> fatal error recovery, non-hotplug capable device state won't be
>> restored properly. You can find related issues in following links.
>>
>> https://lkml.org/lkml/2020/5/27/290
>> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
>> https://lkml.org/lkml/2020/3/28/328
> 
> Can you please convert these all to lore.kernel.org links?  lkml.org
> is not quite as useful or reliable.
Ok. I will fix it in next version.
https://lore.kernel.org/linux-pci/20200527083130.4137-1-
Zhiqiang.Hou@nxp.com/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-
pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/
> 
>> Current fatal error recovery implementation relies on hotplug handler
>> for detaching/re-enumerating the affected devices/drivers on DLLSC
>> state changes.
> 
> Can you remind us exactly how this relies on hotplug?  I know it
> *does*, but I can't remember how.  It would sure be nice if we could
> decouple this from pciehp somehow.
In case of platform that supports PCIe native hotplug, once the fatal
error disables the link, we will get DLLSC state change interrupt. On
DLLSC_DOWN event pciehp driver will remove the affected device and
detach the driver.

For platforms that does not support PCIe hotplug, currently the fatal
error recovery is broken. After reset and recovery, the device config
space is not restored properly. And we expect call to ->slot_reset()
fixes this issue.
> 
>> So when dealing with non-hotplug capable devices,
>> recovery code does not restore the state of the affected devices
>> correctly. Correct implementation should call report_slot_reset()
>> function after resetting the link to restore the state of the
>> device/driver.
> 
> We don't restore the state correctly.  What does this look like to the
> user?  Does the device not work?
Device will not be accessible. AFAIK, doing IO should fail.
> 
>> So use PCI_ERS_RESULT_NEED_RESET as error status for successful
>> reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure
>> case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset()
>> is called after reset link operation which will also fix the above
>> mentioned issue.
> 
> I think PCI_ERS_RESULT_NEED_RESET results in calling driver
> ->slot_reset() callbacks, right?  Where does the state restoration
> happen?
For fatal errors, since the reset is not triggered by OS, we cannot save
the state of the device before resetting the device. So we assume state
restoration is drivers responsibility and expect drivers to restore the
state in ->slot_reset() call back. But I am not sure whether this is
work for all devices (since this is driver dependent).
For non-fatal errors, slot_reset or bus_reset function will handle the
store/restore of device config space.
> 
> No, I guess it must be something in the hotplug driver that restores
> the state, because you said devices below hotplug-capable ports work
> correctly, but others don't.
For hotplug capable devices, driver is removed and reattached (on DLLSC
state change). So state initialization happens during re-enumeration.

> 
>> [original patch is from jay.vosburgh@canonical.com]
>> [original patch link https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
>> Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> ---
>>   drivers/pci/pcie/err.c | 24 ++++++++++++++++++++++--
>>   1 file changed, 22 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 14bb8f54723e..5fe8561c7185 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>   	pci_dbg(dev, "broadcast error_detected message\n");
>>   	if (state == pci_channel_io_frozen) {
>>   		pci_walk_bus(bus, report_frozen_detected, &status);
>> -		status = reset_link(dev);
>> -		if (status != PCI_ERS_RESULT_RECOVERED) {
>> +		/*
>> +		 * After resetting the link using reset_link() call, the
>> +		 * possible value of error status is either
>> +		 * PCI_ERS_RESULT_DISCONNECT (failure case) or
>> +		 * PCI_ERS_RESULT_NEED_RESET (success case).
>> +		 * So ignore the return value of report_error_detected()
>> +		 * call for fatal errors. Instead use
>> +		 * PCI_ERS_RESULT_NEED_RESET as initial status value.
>> +		 *
>> +		 * Ignoring the status return value of report_error_detected()
>> +		 * call will also help in case of EDR mode based error
>> +		 * recovery. In EDR mode AER and DPC Capabilities are owned by
>> +		 * firmware and hence report_error_detected() call will possibly
>> +		 * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore
>> +		 * the return value of report_error_detected() then
>> +		 * pcie_do_recovery() would report incorrect status after
>> +		 * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER
>> +		 * in non EDR case should not have any functional impact.
> 
> I can't make sense out of the comment.  We already ignore the "status"
> from pci_walk_bus(bus, report_frozen_detected, &status).
Yes, but I am trying to explain why we ignore the status.
> 
> No idea what to make of the second paragraph.  If we make the commit
> log make sense, maybe some summary of that would be useful here.
Following are more details related to second part of comment. Let me
know if it does not makes sense.

In case of EDR mode, pcie_do_recovery() will be triggered by
edr_handle_event(), and AER and DPC capabilities controls are also owned
by firmware. If DPC and AER capabilities are owned by firmware then AER
and DPC PCIe service drivers will not be enumerated and hence
report_frozen_detected() can return PCI_ERS_RESULT_NO_AER_DRIVER as
status. If the report_error_detected() returns
PCI_ERS_RESULT_NO_AER_DRIVER then as per current pcie_do_recovery()
implementation, recovery will be reported as failure.

So ignoring the status of report_error_detected() helps the case of 
recovery triggered by EDR driver.
> 
> I think this code is equivalent and makes the patch much clearer:
Ok. I will change to this logic in next version.
> 
>    status = reset_link(dev);
>    if (status == PCI_ERS_RESULT_RECOVERED) {
>      status = PCI_ERS_RESULT_NEED_RESET;
>    } else {
>      status = PCI_ERS_RESULT_DISCONNECT;
>      goto failed;
>    }

> 
>> +		 */
>> +		status = PCI_ERS_RESULT_NEED_RESET;
>> +		if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
>> +			status = PCI_ERS_RESULT_DISCONNECT;
>>   			pci_warn(dev, "link reset failed\n");
>>   			goto failed;
>>   		}
>> -- 
>> 2.17.1
>>

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-07-16  1:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-04 21:50 [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices sathyanarayanan.kuppuswamy
2020-06-04 21:50 ` [PATCH v2 2/2] PCI/ERR: Add reset support for non fatal errors sathyanarayanan.kuppuswamy
2020-06-28 12:57   ` Yicong Yang
2020-06-05  4:47 ` [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices Jay Vosburgh
2020-06-24 18:52   ` Jay Vosburgh
2020-06-28 12:59     ` Yicong Yang
2020-07-14 23:08 ` Bjorn Helgaas
2020-07-16  1:54   ` Kuppuswamy, Sathyanarayanan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.