linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
@ 2020-10-12  5:03 sathyanarayanan.nkuppuswamy
  2020-10-12  5:03 ` [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling sathyanarayanan.nkuppuswamy
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: sathyanarayanan.nkuppuswamy @ 2020-10-12  5:03 UTC (permalink / raw)
  To: bhelgaas, okaya
  Cc: linux-pci, linux-kernel, ashok.raj, sathyanarayanan.kuppuswamy

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Currently if report_error_detected() or report_mmio_enabled()
functions requests PCI_ERS_RESULT_NEED_RESET, current
pcie_do_recovery() implementation does not do the requested
explicit device reset, but instead just calls the
report_slot_reset() on all affected devices. Notifying about the
reset via report_slot_reset() without doing the actual device
reset is incorrect. So call pci_bus_reset() before triggering
->slot_reset() callback.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pcie/err.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..067c58728b88 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -181,11 +181,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	}
 
 	if (status == PCI_ERS_RESULT_NEED_RESET) {
-		/*
-		 * TODO: Should call platform-specific
-		 * functions to reset slot before calling
-		 * drivers' slot_reset callbacks?
-		 */
+		pci_reset_bus(dev);
 		status = PCI_ERS_RESULT_RECOVERED;
 		pci_dbg(dev, "broadcast slot_reset message\n");
 		pci_walk_bus(bus, report_slot_reset, &status);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
  2020-10-12  5:03 [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback sathyanarayanan.nkuppuswamy
@ 2020-10-12  5:03 ` sathyanarayanan.nkuppuswamy
  2020-10-12 14:50   ` Sinan Kaya
                     ` (2 more replies)
  2020-10-12 14:51 ` [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback Sinan Kaya
  2020-10-12 21:05 ` Raj, Ashok
  2 siblings, 3 replies; 14+ messages in thread
From: sathyanarayanan.nkuppuswamy @ 2020-10-12  5:03 UTC (permalink / raw)
  To: bhelgaas, okaya
  Cc: linux-pci, linux-kernel, ashok.raj, sathyanarayanan.kuppuswamy

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
merged fatal and non-fatal error recovery paths, and also made
recovery code depend on hotplug handler for "remove affected
device + rescan" support. But this change also complicated the
error recovery path and which in turn led to the following
issues.

1. We depend on hotplug handler for removing the affected
devices/drivers on DLLSC LINK down event (on DPC event
trigger) and DPC handler for handling the error recovery. Since
both handlers operate on same set of affected devices, it leads
to race condition, which in turn leads to  NULL pointer
exceptions or error recovery failures.You can find more details
about this issue in following link.

https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.zhao@intel.com/T/#t

2. For non-hotplug capable devices fatal (DPC) error recovery
is currently broken. Current fatal error recovery implementation
relies on PCIe hotplug (pciehp) handler for detaching and
re-enumerating the affected devices/drivers. So when dealing with
non-hotplug capable devices, recovery code does not restore the state
of the affected devices correctly. You can find more details about
this issue in the following links.

https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@nxp.com/
https://lore.kernel.org/linux-pci/12115.1588207324@famine/
https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/

In order to fix the above two issues, we should stop relying on hotplug
handler for cleaning the affected devices/drivers and let error recovery
handler own this functionality. So this patch reverts Commit bdb5ac85777d
("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
affected device + rescan"  functionality in fatal error recovery handler.

Also holding pci_lock_rescan_remove() will prevent the race between hotplug
and DPC handler.

Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 Documentation/PCI/pci-error-recovery.rst | 47 ++++++++++------
 drivers/pci/pcie/err.c                   | 71 +++++++++++++++++++-----
 2 files changed, 87 insertions(+), 31 deletions(-)

diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst
index 84ceebb08cac..830c8af5838b 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a PCI error
 event will be platform-dependent, but will follow the general
 sequence described below.
 
-STEP 0: Error Event
+STEP 0: Error Event: ERR_NONFATAL
 -------------------
 A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
 is isolated, in that all I/O is blocked: all reads return 0xffffffff,
@@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and recovery
 proceeds to STEP 2 (MMIO Enable).
 
 If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
-then recovery proceeds to STEP 4 (Slot Reset).
+then recovery proceeds to STEP 3 (Slot Reset).
 
 If the platform is unable to recover the slot, the next step
-is STEP 6 (Permanent Failure).
+is STEP 5 (Permanent Failure).
 
 .. note::
 
@@ -198,7 +198,7 @@ reset or some such, but not restart operations. This callback is made if
 all drivers on a segment agree that they can try to recover and if no automatic
 link reset was performed by the HW. If the platform can't just re-enable IOs
 without a slot reset or a link reset, it will not call this callback, and
-instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
+instead will have gone directly to STEP 3 (Slot Reset)
 
 .. note::
 
@@ -233,18 +233,12 @@ The driver should return one of the following result codes:
 
 The next step taken depends on the results returned by the drivers.
 If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
-proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
+proceeds to STEP 4 (Resume Operations).
 
 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
-proceeds to STEP 4 (Slot Reset)
+proceeds to STEP 3 (Slot Reset)
 
-STEP 3: Link Reset
-------------------
-The platform resets the link.  This is a PCI-Express specific step
-and is done whenever a fatal error has been detected that can be
-"solved" by resetting the link.
-
-STEP 4: Slot Reset
+STEP 3: Slot Reset
 ------------------
 
 In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
@@ -322,7 +316,7 @@ PCI card types::
 	+		pdev->needs_freset = 1;
 	+
 
-Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
+Platform proceeds either to STEP 4 (Resume Operations) or STEP 5 (Permanent
 Failure).
 
 .. note::
@@ -332,7 +326,7 @@ Failure).
    However, it probably should.
 
 
-STEP 5: Resume Operations
+STEP 4: Resume Operations
 -------------------------
 The platform will call the resume() callback on all affected device
 drivers if all drivers on the segment have returned
@@ -344,7 +338,7 @@ a result code.
 At this point, if a new error happens, the platform will restart
 a new error recovery sequence.
 
-STEP 6: Permanent Failure
+STEP 5: Permanent Failure
 -------------------------
 A "permanent failure" has occurred, and the platform cannot recover
 the device.  The platform will call error_detected() with a
@@ -367,6 +361,27 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
 for additional detail on real-life experience of the causes of
 software errors.
 
+STEP 0: Error Event: ERR_FATAL
+--------------------
+PCI bus error is detected by the PCI hardware. On powerpc, the slot is
+isolated, in that all I/O is blocked: all reads return 0xffffffff, all
+writes are ignored.
+
+STEP 1: Remove devices
+---------------------
+Platform removes the devices depending on the error agent, it could be
+this port for all subordinates or upstream component (likely downstream
+port)
+
+STEP 2: Reset link
+---------------------
+The platform resets the link.  This is a PCI-Express specific step and is
+done whenever a fatal error has been detected that can be "solved" by
+resetting the link.
+
+STEP 3: Re-enumerate the devices
+---------------------
+Initiates the re-enumeration.
 
 Conclusion; General Remarks
 ---------------------------
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 067c58728b88..c2ae4d08801a 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -79,11 +79,6 @@ static int report_error_detected(struct pci_dev *dev,
 	return 0;
 }
 
-static int report_frozen_detected(struct pci_dev *dev, void *data)
-{
-	return report_error_detected(dev, pci_channel_io_frozen, data);
-}
-
 static int report_normal_detected(struct pci_dev *dev, void *data)
 {
 	return report_error_detected(dev, pci_channel_io_normal, data);
@@ -146,6 +141,58 @@ static int report_resume(struct pci_dev *dev, void *data)
 	return 0;
 }
 
+static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
+			pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
+{
+	struct pci_dev *udev;
+	struct pci_bus *parent;
+	struct pci_dev *pdev, *temp;
+	pci_ers_result_t result;
+
+	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
+		udev = dev;
+	else
+		udev = dev->bus->self;
+
+	parent = udev->subordinate;
+	pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
+
+        pci_lock_rescan_remove();
+        pci_dev_get(dev);
+        list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
+					 bus_list) {
+		pci_stop_and_remove_bus_device(pdev);
+	}
+
+	result = reset_link(udev);
+
+	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		/*
+		 * If the error is reported by a bridge, we think this error
+		 * is related to the downstream link of the bridge, so we
+		 * do error recovery on all subordinates of the bridge instead
+		 * of the bridge and clear the error status of the bridge.
+		 */
+		pci_aer_clear_fatal_status(dev);
+		if (pcie_aer_is_native(dev))
+			pcie_clear_device_status(dev);
+	}
+
+	if (result == PCI_ERS_RESULT_RECOVERED) {
+		if (pcie_wait_for_link(udev, true))
+			pci_rescan_bus(udev->bus);
+		pci_info(dev, "Device recovery from fatal error successful\n");
+        } else {
+		pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
+		pci_info(dev, "Device recovery from fatal error failed\n");
+        }
+
+	pci_dev_put(dev);
+	pci_unlock_rescan_remove();
+
+	return result;
+}
+
 pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 			pci_channel_state_t state,
 			pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
@@ -153,6 +200,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
 	struct pci_bus *bus;
 
+	if (state == pci_channel_io_frozen)
+		return pcie_do_fatal_recovery(dev, reset_link);
+
 	/*
 	 * Error recovery runs on all subordinates of the first downstream port.
 	 * If the downstream port detected the error, it is cleared at the end.
@@ -163,16 +213,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	bus = dev->subordinate;
 
 	pci_dbg(dev, "broadcast error_detected message\n");
-	if (state == pci_channel_io_frozen) {
-		pci_walk_bus(bus, report_frozen_detected, &status);
-		status = reset_link(dev);
-		if (status != PCI_ERS_RESULT_RECOVERED) {
-			pci_warn(dev, "link reset failed\n");
-			goto failed;
-		}
-	} else {
-		pci_walk_bus(bus, report_normal_detected, &status);
-	}
+	pci_walk_bus(bus, report_normal_detected, &status);
 
 	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
 		status = PCI_ERS_RESULT_RECOVERED;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
  2020-10-12  5:03 ` [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling sathyanarayanan.nkuppuswamy
@ 2020-10-12 14:50   ` Sinan Kaya
  2020-10-13 11:56   ` Christoph Hellwig
  2020-10-14  5:44   ` Ethan Zhao
  2 siblings, 0 replies; 14+ messages in thread
From: Sinan Kaya @ 2020-10-12 14:50 UTC (permalink / raw)
  To: sathyanarayanan.nkuppuswamy, bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, sathyanarayanan.kuppuswamy

On 10/12/2020 1:03 AM, sathyanarayanan.nkuppuswamy@gmail.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> merged fatal and non-fatal error recovery paths, and also made
> recovery code depend on hotplug handler for "remove affected
> device + rescan" support. But this change also complicated the
> error recovery path and which in turn led to the following
> issues.
> 
> 1. We depend on hotplug handler for removing the affected
> devices/drivers on DLLSC LINK down event (on DPC event
> trigger) and DPC handler for handling the error recovery. Since
> both handlers operate on same set of affected devices, it leads
> to race condition, which in turn leads to  NULL pointer
> exceptions or error recovery failures.You can find more details
> about this issue in following link.
> 
> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.zhao@intel.com/T/#t
> 
> 2. For non-hotplug capable devices fatal (DPC) error recovery
> is currently broken. Current fatal error recovery implementation
> relies on PCIe hotplug (pciehp) handler for detaching and
> re-enumerating the affected devices/drivers. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. You can find more details about
> this issue in the following links.
> 
> https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@nxp.com/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/
> 
> In order to fix the above two issues, we should stop relying on hotplug
> handler for cleaning the affected devices/drivers and let error recovery
> handler own this functionality. So this patch reverts Commit bdb5ac85777d
> ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
> affected device + rescan"  functionality in fatal error recovery handler.
> 
> Also holding pci_lock_rescan_remove() will prevent the race between hotplug
> and DPC handler.
> 
> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  Documentation/PCI/pci-error-recovery.rst | 47 ++++++++++------
>  drivers/pci/pcie/err.c                   | 71 +++++++++++++++++++-----
>  2 files changed, 87 insertions(+), 31 deletions(-)

I'm not sure about locks involved but I do like the concept.
Current fatal error handling is best effort.

There is no way to recover if link is down by the time we
reach to fatal error handling routine.

This change will make the solution more reliable.

Reviewed-by: Sinan Kaya <okaya@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
  2020-10-12  5:03 [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback sathyanarayanan.nkuppuswamy
  2020-10-12  5:03 ` [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling sathyanarayanan.nkuppuswamy
@ 2020-10-12 14:51 ` Sinan Kaya
  2020-10-12 21:05 ` Raj, Ashok
  2 siblings, 0 replies; 14+ messages in thread
From: Sinan Kaya @ 2020-10-12 14:51 UTC (permalink / raw)
  To: sathyanarayanan.nkuppuswamy, bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, sathyanarayanan.kuppuswamy

On 10/12/2020 1:03 AM, sathyanarayanan.nkuppuswamy@gmail.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> Currently if report_error_detected() or report_mmio_enabled()
> functions requests PCI_ERS_RESULT_NEED_RESET, current
> pcie_do_recovery() implementation does not do the requested
> explicit device reset, but instead just calls the
> report_slot_reset() on all affected devices. Notifying about the
> reset via report_slot_reset() without doing the actual device
> reset is incorrect. So call pci_bus_reset() before triggering
> ->slot_reset() callback.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  drivers/pci/pcie/err.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index c543f419d8f9..067c58728b88 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -181,11 +181,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	}
>  
>  	if (status == PCI_ERS_RESULT_NEED_RESET) {
> -		/*
> -		 * TODO: Should call platform-specific
> -		 * functions to reset slot before calling
> -		 * drivers' slot_reset callbacks?
> -		 */
> +		pci_reset_bus(dev);
>  		status = PCI_ERS_RESULT_RECOVERED;
>  		pci_dbg(dev, "broadcast slot_reset message\n");
>  		pci_walk_bus(bus, report_slot_reset, &status);
> 

Reviewed-by: Sinan Kaya <okaya@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
  2020-10-12  5:03 [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback sathyanarayanan.nkuppuswamy
  2020-10-12  5:03 ` [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling sathyanarayanan.nkuppuswamy
  2020-10-12 14:51 ` [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback Sinan Kaya
@ 2020-10-12 21:05 ` Raj, Ashok
  2020-10-12 21:47   ` Kuppuswamy, Sathyanarayanan
  2 siblings, 1 reply; 14+ messages in thread
From: Raj, Ashok @ 2020-10-12 21:05 UTC (permalink / raw)
  To: sathyanarayanan.nkuppuswamy
  Cc: bhelgaas, okaya, linux-pci, linux-kernel,
	sathyanarayanan.kuppuswamy, Ashok Raj

On Sun, Oct 11, 2020 at 10:03:40PM -0700, sathyanarayanan.nkuppuswamy@gmail.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> Currently if report_error_detected() or report_mmio_enabled()
> functions requests PCI_ERS_RESULT_NEED_RESET, current
> pcie_do_recovery() implementation does not do the requested
> explicit device reset, but instead just calls the
> report_slot_reset() on all affected devices. Notifying about the
> reset via report_slot_reset() without doing the actual device
> reset is incorrect. So call pci_bus_reset() before triggering
> ->slot_reset() callback.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  drivers/pci/pcie/err.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index c543f419d8f9..067c58728b88 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -181,11 +181,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	}
>  
>  	if (status == PCI_ERS_RESULT_NEED_RESET) {
> -		/*
> -		 * TODO: Should call platform-specific
> -		 * functions to reset slot before calling
> -		 * drivers' slot_reset callbacks?
> -		 */
> +		pci_reset_bus(dev);

pci_reset_bus() returns an error, do you need to consult that before
unconditionally setting PCI_ERS_RESULT_RECOVERED?

>  		status = PCI_ERS_RESULT_RECOVERED;
>  		pci_dbg(dev, "broadcast slot_reset message\n");
>  		pci_walk_bus(bus, report_slot_reset, &status);
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
  2020-10-12 21:05 ` Raj, Ashok
@ 2020-10-12 21:47   ` Kuppuswamy, Sathyanarayanan
  2020-10-14  8:00     ` Ethan Zhao
  0 siblings, 1 reply; 14+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2020-10-12 21:47 UTC (permalink / raw)
  To: Raj, Ashok, sathyanarayanan.nkuppuswamy
  Cc: bhelgaas, okaya, linux-pci, linux-kernel



On 10/12/20 2:05 PM, Raj, Ashok wrote:
> On Sun, Oct 11, 2020 at 10:03:40PM -0700, sathyanarayanan.nkuppuswamy@gmail.com wrote:
>> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
>> Currently if report_error_detected() or report_mmio_enabled()
>> functions requests PCI_ERS_RESULT_NEED_RESET, current
>> pcie_do_recovery() implementation does not do the requested
>> explicit device reset, but instead just calls the
>> report_slot_reset() on all affected devices. Notifying about the
>> reset via report_slot_reset() without doing the actual device
>> reset is incorrect. So call pci_bus_reset() before triggering
>> ->slot_reset() callback.
>>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> ---
>>   drivers/pci/pcie/err.c | 6 +-----
>>   1 file changed, 1 insertion(+), 5 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index c543f419d8f9..067c58728b88 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -181,11 +181,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>   	}
>>   
>>   	if (status == PCI_ERS_RESULT_NEED_RESET) {
>> -		/*
>> -		 * TODO: Should call platform-specific
>> -		 * functions to reset slot before calling
>> -		 * drivers' slot_reset callbacks?
>> -		 */
>> +		pci_reset_bus(dev);
> 
> pci_reset_bus() returns an error, do you need to consult that before
> unconditionally setting PCI_ERS_RESULT_RECOVERED?
Good point. I will fix this in next version.
> 
>>   		status = PCI_ERS_RESULT_RECOVERED;
>>   		pci_dbg(dev, "broadcast slot_reset message\n");
>>   		pci_walk_bus(bus, report_slot_reset, &status);
>> -- 
>> 2.17.1
>>

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
  2020-10-12  5:03 ` [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling sathyanarayanan.nkuppuswamy
  2020-10-12 14:50   ` Sinan Kaya
@ 2020-10-13 11:56   ` Christoph Hellwig
  2020-10-13 15:17     ` Kuppuswamy, Sathyanarayanan
  2020-10-14  5:44   ` Ethan Zhao
  2 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2020-10-13 11:56 UTC (permalink / raw)
  To: sathyanarayanan.nkuppuswamy
  Cc: bhelgaas, okaya, linux-pci, linux-kernel, ashok.raj,
	sathyanarayanan.kuppuswamy

You might want to split out pcie_do_fatal_recovery and get rid of the
state argument:

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a095..eec0d3fe9fd967 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -556,7 +556,8 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev)
 
 /* PCI error reporting and recovery */
 pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
-			pci_channel_state_t state,
+			pci_ers_result_t (*reset_link)(struct pci_dev *pdev));
+pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
 			pci_ers_result_t (*reset_link)(struct pci_dev *pdev));
 
 bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 65dff5f3457ac0..4bf7ebb34cf854 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -947,9 +947,9 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 		if (pcie_aer_is_native(dev))
 			pcie_clear_device_status(dev);
 	} else if (info->severity == AER_NONFATAL)
-		pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
+		pcie_do_recovery(dev, aer_root_reset);
 	else if (info->severity == AER_FATAL)
-		pcie_do_recovery(dev, pci_channel_io_frozen, aer_root_reset);
+		pcie_do_fatal_recovery(dev, aer_root_reset);
 	pci_dev_put(dev);
 }
 
@@ -985,11 +985,9 @@ static void aer_recover_work_func(struct work_struct *work)
 		}
 		cper_print_aer(pdev, entry.severity, entry.regs);
 		if (entry.severity == AER_NONFATAL)
-			pcie_do_recovery(pdev, pci_channel_io_normal,
-					 aer_root_reset);
+			pcie_do_recovery(pdev, aer_root_reset);
 		else if (entry.severity == AER_FATAL)
-			pcie_do_recovery(pdev, pci_channel_io_frozen,
-					 aer_root_reset);
+			pcie_do_fatal_recovery(pdev, aer_root_reset);
 		pci_dev_put(pdev);
 	}
 }
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index daa9a4153776ce..74e7d1da3cf054 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -233,7 +233,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
 	dpc_process_error(pdev);
 
 	/* We configure DPC so it only triggers on ERR_FATAL */
-	pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
+	pcie_do_fatal_recovery(pdev, dpc_reset_link);
 
 	return IRQ_HANDLED;
 }
diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
index a6b9b479b97ad0..87379bc566f691 100644
--- a/drivers/pci/pcie/edr.c
+++ b/drivers/pci/pcie/edr.c
@@ -183,7 +183,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
 	 * or ERR_NONFATAL, since the link is already down, use the FATAL
 	 * error recovery path for both cases.
 	 */
-	estate = pcie_do_recovery(edev, pci_channel_io_frozen, dpc_reset_link);
+	estate = pcie_do_fatal_recovery(edev, dpc_reset_link);
 
 send_ost:
 
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c2ae4d08801a4d..11fcff16b17303 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -141,7 +141,7 @@ static int report_resume(struct pci_dev *dev, void *data)
 	return 0;
 }
 
-static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
+pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
 			pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
 {
 	struct pci_dev *udev;
@@ -194,15 +194,11 @@ static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
 }
 
 pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
-			pci_channel_state_t state,
 			pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
 {
 	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
 	struct pci_bus *bus;
 
-	if (state == pci_channel_io_frozen)
-		return pcie_do_fatal_recovery(dev, reset_link);
-
 	/*
 	 * Error recovery runs on all subordinates of the first downstream port.
 	 * If the downstream port detected the error, it is cleared at the end.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
  2020-10-13 11:56   ` Christoph Hellwig
@ 2020-10-13 15:17     ` Kuppuswamy, Sathyanarayanan
  2020-10-15  6:43       ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2020-10-13 15:17 UTC (permalink / raw)
  To: Christoph Hellwig, sathyanarayanan.nkuppuswamy
  Cc: bhelgaas, okaya, linux-pci, linux-kernel, ashok.raj



On 10/13/20 4:56 AM, Christoph Hellwig wrote:
> You might want to split out pcie_do_fatal_recovery and get rid of the
> state argument:
This is how it was before Keith merged fatal and non-fatal error recovery
paths. When the comparison is between additional-parameter vs new-interface
, I choose the former. But I can merge your change in next version.

> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index fa12f7cbc1a095..eec0d3fe9fd967 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -556,7 +556,8 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev)
>   
>   /* PCI error reporting and recovery */
>   pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> -			pci_channel_state_t state,
> +			pci_ers_result_t (*reset_link)(struct pci_dev *pdev));
> +pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
>   			pci_ers_result_t (*reset_link)(struct pci_dev *pdev));
>   
>   bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 65dff5f3457ac0..4bf7ebb34cf854 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -947,9 +947,9 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>   		if (pcie_aer_is_native(dev))
>   			pcie_clear_device_status(dev);
>   	} else if (info->severity == AER_NONFATAL)
> -		pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
> +		pcie_do_recovery(dev, aer_root_reset);
>   	else if (info->severity == AER_FATAL)
> -		pcie_do_recovery(dev, pci_channel_io_frozen, aer_root_reset);
> +		pcie_do_fatal_recovery(dev, aer_root_reset);
>   	pci_dev_put(dev);
>   }
>   
> @@ -985,11 +985,9 @@ static void aer_recover_work_func(struct work_struct *work)
>   		}
>   		cper_print_aer(pdev, entry.severity, entry.regs);
>   		if (entry.severity == AER_NONFATAL)
> -			pcie_do_recovery(pdev, pci_channel_io_normal,
> -					 aer_root_reset);
> +			pcie_do_recovery(pdev, aer_root_reset);
>   		else if (entry.severity == AER_FATAL)
> -			pcie_do_recovery(pdev, pci_channel_io_frozen,
> -					 aer_root_reset);
> +			pcie_do_fatal_recovery(pdev, aer_root_reset);
>   		pci_dev_put(pdev);
>   	}
>   }
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index daa9a4153776ce..74e7d1da3cf054 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -233,7 +233,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
>   	dpc_process_error(pdev);
>   
>   	/* We configure DPC so it only triggers on ERR_FATAL */
> -	pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
> +	pcie_do_fatal_recovery(pdev, dpc_reset_link);
>   
>   	return IRQ_HANDLED;
>   }
> diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
> index a6b9b479b97ad0..87379bc566f691 100644
> --- a/drivers/pci/pcie/edr.c
> +++ b/drivers/pci/pcie/edr.c
> @@ -183,7 +183,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>   	 * or ERR_NONFATAL, since the link is already down, use the FATAL
>   	 * error recovery path for both cases.
>   	 */
> -	estate = pcie_do_recovery(edev, pci_channel_io_frozen, dpc_reset_link);
> +	estate = pcie_do_fatal_recovery(edev, dpc_reset_link);
>   
>   send_ost:
>   
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index c2ae4d08801a4d..11fcff16b17303 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -141,7 +141,7 @@ static int report_resume(struct pci_dev *dev, void *data)
>   	return 0;
>   }
>   
> -static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
> +pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
>   			pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
>   {
>   	struct pci_dev *udev;
> @@ -194,15 +194,11 @@ static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
>   }
>   
>   pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> -			pci_channel_state_t state,
>   			pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
>   {
>   	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>   	struct pci_bus *bus;
>   
> -	if (state == pci_channel_io_frozen)
> -		return pcie_do_fatal_recovery(dev, reset_link);
> -
>   	/*
>   	 * Error recovery runs on all subordinates of the first downstream port.
>   	 * If the downstream port detected the error, it is cleared at the end.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
  2020-10-12  5:03 ` [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling sathyanarayanan.nkuppuswamy
  2020-10-12 14:50   ` Sinan Kaya
  2020-10-13 11:56   ` Christoph Hellwig
@ 2020-10-14  5:44   ` Ethan Zhao
  2020-10-14  5:51     ` Kuppuswamy, Sathyanarayanan
  2 siblings, 1 reply; 14+ messages in thread
From: Ethan Zhao @ 2020-10-14  5:44 UTC (permalink / raw)
  To: sathyanarayanan.nkuppuswamy
  Cc: Bjorn Helgaas, Sinan Kaya, linux-pci, Linux Kernel Mailing List,
	Ashok Raj, Kuppuswamy, Sathyanarayanan

On Mon, Oct 12, 2020 at 1:10 PM <sathyanarayanan.nkuppuswamy@gmail.com> wrote:
>
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>
> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> merged fatal and non-fatal error recovery paths, and also made
> recovery code depend on hotplug handler for "remove affected
> device + rescan" support. But this change also complicated the
> error recovery path and which in turn led to the following
> issues.
>
> 1. We depend on hotplug handler for removing the affected
> devices/drivers on DLLSC LINK down event (on DPC event
> trigger) and DPC handler for handling the error recovery. Since
> both handlers operate on same set of affected devices, it leads
> to race condition, which in turn leads to  NULL pointer
> exceptions or error recovery failures.You can find more details
> about this issue in following link.
>
> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.zhao@intel.com/T/#t
>
> 2. For non-hotplug capable devices fatal (DPC) error recovery
> is currently broken. Current fatal error recovery implementation
> relies on PCIe hotplug (pciehp) handler for detaching and
> re-enumerating the affected devices/drivers. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. You can find more details about
> this issue in the following links.
>
> https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@nxp.com/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/
>
> In order to fix the above two issues, we should stop relying on hotplug
> handler for cleaning the affected devices/drivers and let error recovery
> handler own this functionality. So this patch reverts Commit bdb5ac85777d
> ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
> affected device + rescan"  functionality in fatal error recovery handler.

This patch only reverts the commit bdb5ac85777d ?
or you'd better separate the revert and code you added.

Thanks,
Ethan

>
> Also holding pci_lock_rescan_remove() will prevent the race between hotplug
> and DPC handler.
>
> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  Documentation/PCI/pci-error-recovery.rst | 47 ++++++++++------
>  drivers/pci/pcie/err.c                   | 71 +++++++++++++++++++-----
>  2 files changed, 87 insertions(+), 31 deletions(-)
>
> diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst
> index 84ceebb08cac..830c8af5838b 100644
> --- a/Documentation/PCI/pci-error-recovery.rst
> +++ b/Documentation/PCI/pci-error-recovery.rst
> @@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a PCI error
>  event will be platform-dependent, but will follow the general
>  sequence described below.
>
> -STEP 0: Error Event
> +STEP 0: Error Event: ERR_NONFATAL
>  -------------------
>  A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
>  is isolated, in that all I/O is blocked: all reads return 0xffffffff,
> @@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and recovery
>  proceeds to STEP 2 (MMIO Enable).
>
>  If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
> -then recovery proceeds to STEP 4 (Slot Reset).
> +then recovery proceeds to STEP 3 (Slot Reset).
>
>  If the platform is unable to recover the slot, the next step
> -is STEP 6 (Permanent Failure).
> +is STEP 5 (Permanent Failure).
>
>  .. note::
>
> @@ -198,7 +198,7 @@ reset or some such, but not restart operations. This callback is made if
>  all drivers on a segment agree that they can try to recover and if no automatic
>  link reset was performed by the HW. If the platform can't just re-enable IOs
>  without a slot reset or a link reset, it will not call this callback, and
> -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
> +instead will have gone directly to STEP 3 (Slot Reset)
>
>  .. note::
>
> @@ -233,18 +233,12 @@ The driver should return one of the following result codes:
>
>  The next step taken depends on the results returned by the drivers.
>  If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
> -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
> +proceeds to STEP 4 (Resume Operations).
>
>  If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
> -proceeds to STEP 4 (Slot Reset)
> +proceeds to STEP 3 (Slot Reset)
>
> -STEP 3: Link Reset
> -------------------
> -The platform resets the link.  This is a PCI-Express specific step
> -and is done whenever a fatal error has been detected that can be
> -"solved" by resetting the link.
> -
> -STEP 4: Slot Reset
> +STEP 3: Slot Reset
>  ------------------
>
>  In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
> @@ -322,7 +316,7 @@ PCI card types::
>         +               pdev->needs_freset = 1;
>         +
>
> -Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
> +Platform proceeds either to STEP 4 (Resume Operations) or STEP 5 (Permanent
>  Failure).
>
>  .. note::
> @@ -332,7 +326,7 @@ Failure).
>     However, it probably should.
>
>
> -STEP 5: Resume Operations
> +STEP 4: Resume Operations
>  -------------------------
>  The platform will call the resume() callback on all affected device
>  drivers if all drivers on the segment have returned
> @@ -344,7 +338,7 @@ a result code.
>  At this point, if a new error happens, the platform will restart
>  a new error recovery sequence.
>
> -STEP 6: Permanent Failure
> +STEP 5: Permanent Failure
>  -------------------------
>  A "permanent failure" has occurred, and the platform cannot recover
>  the device.  The platform will call error_detected() with a
> @@ -367,6 +361,27 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
>  for additional detail on real-life experience of the causes of
>  software errors.
>
> +STEP 0: Error Event: ERR_FATAL
> +--------------------
> +PCI bus error is detected by the PCI hardware. On powerpc, the slot is
> +isolated, in that all I/O is blocked: all reads return 0xffffffff, all
> +writes are ignored.
> +
> +STEP 1: Remove devices
> +---------------------
> +Platform removes the devices depending on the error agent, it could be
> +this port for all subordinates or upstream component (likely downstream
> +port)
> +
> +STEP 2: Reset link
> +---------------------
> +The platform resets the link.  This is a PCI-Express specific step and is
> +done whenever a fatal error has been detected that can be "solved" by
> +resetting the link.
> +
> +STEP 3: Re-enumerate the devices
> +---------------------
> +Initiates the re-enumeration.
>
>  Conclusion; General Remarks
>  ---------------------------
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 067c58728b88..c2ae4d08801a 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -79,11 +79,6 @@ static int report_error_detected(struct pci_dev *dev,
>         return 0;
>  }
>
> -static int report_frozen_detected(struct pci_dev *dev, void *data)
> -{
> -       return report_error_detected(dev, pci_channel_io_frozen, data);
> -}
> -
>  static int report_normal_detected(struct pci_dev *dev, void *data)
>  {
>         return report_error_detected(dev, pci_channel_io_normal, data);
> @@ -146,6 +141,58 @@ static int report_resume(struct pci_dev *dev, void *data)
>         return 0;
>  }
>
> +static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
> +                       pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
> +{
> +       struct pci_dev *udev;
> +       struct pci_bus *parent;
> +       struct pci_dev *pdev, *temp;
> +       pci_ers_result_t result;
> +
> +       if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
> +               udev = dev;
> +       else
> +               udev = dev->bus->self;
> +
> +       parent = udev->subordinate;
> +       pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> +
> +        pci_lock_rescan_remove();
> +        pci_dev_get(dev);
> +        list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
> +                                        bus_list) {
> +               pci_stop_and_remove_bus_device(pdev);
> +       }
> +
> +       result = reset_link(udev);
> +
> +       if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> +               /*
> +                * If the error is reported by a bridge, we think this error
> +                * is related to the downstream link of the bridge, so we
> +                * do error recovery on all subordinates of the bridge instead
> +                * of the bridge and clear the error status of the bridge.
> +                */
> +               pci_aer_clear_fatal_status(dev);
> +               if (pcie_aer_is_native(dev))
> +                       pcie_clear_device_status(dev);
> +       }
> +
> +       if (result == PCI_ERS_RESULT_RECOVERED) {
> +               if (pcie_wait_for_link(udev, true))
> +                       pci_rescan_bus(udev->bus);
> +               pci_info(dev, "Device recovery from fatal error successful\n");
> +        } else {
> +               pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> +               pci_info(dev, "Device recovery from fatal error failed\n");
> +        }
> +
> +       pci_dev_put(dev);
> +       pci_unlock_rescan_remove();
> +
> +       return result;
> +}
> +
>  pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>                         pci_channel_state_t state,
>                         pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
> @@ -153,6 +200,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>         pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>         struct pci_bus *bus;
>
> +       if (state == pci_channel_io_frozen)
> +               return pcie_do_fatal_recovery(dev, reset_link);
> +
>         /*
>          * Error recovery runs on all subordinates of the first downstream port.
>          * If the downstream port detected the error, it is cleared at the end.
> @@ -163,16 +213,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>         bus = dev->subordinate;
>
>         pci_dbg(dev, "broadcast error_detected message\n");
> -       if (state == pci_channel_io_frozen) {
> -               pci_walk_bus(bus, report_frozen_detected, &status);
> -               status = reset_link(dev);
> -               if (status != PCI_ERS_RESULT_RECOVERED) {
> -                       pci_warn(dev, "link reset failed\n");
> -                       goto failed;
> -               }
> -       } else {
> -               pci_walk_bus(bus, report_normal_detected, &status);
> -       }
> +       pci_walk_bus(bus, report_normal_detected, &status);
>
>         if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>                 status = PCI_ERS_RESULT_RECOVERED;
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
  2020-10-14  5:44   ` Ethan Zhao
@ 2020-10-14  5:51     ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 14+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2020-10-14  5:51 UTC (permalink / raw)
  To: Ethan Zhao, sathyanarayanan.nkuppuswamy
  Cc: Bjorn Helgaas, Sinan Kaya, linux-pci, Linux Kernel Mailing List,
	Ashok Raj



On 10/13/20 10:44 PM, Ethan Zhao wrote:
> This patch only reverts the commit bdb5ac85777d ?
> or you'd better separate the revert and code you added.

We cannot revert the commit as it is. pcie_do_recovery()
function and Documentation/* folder changed a lot since
fatal and non-fatal error recovery paths were merged. So I
modified the revert so that it can be applied to the latest
kernel version.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
  2020-10-12 21:47   ` Kuppuswamy, Sathyanarayanan
@ 2020-10-14  8:00     ` Ethan Zhao
  2020-10-14  8:19       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 1 reply; 14+ messages in thread
From: Ethan Zhao @ 2020-10-14  8:00 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Raj, Ashok, sathyanarayanan.nkuppuswamy, Bjorn Helgaas,
	Sinan Kaya, linux-pci, Linux Kernel Mailing List

Please fix the building issue.

drivers/pci/pcie/err.c:144:25: error: static declaration of
‘pcie_do_fatal_recovery’ follows non-static declaration
 static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
                         ^~~~~~~~~~~~~~~~~~~~~~
In file included from drivers/pci/pcie/err.c:21:
drivers/pci/pcie/../pci.h:560:18: note: previous declaration of
‘pcie_do_fatal_recovery’ was here
 pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
                  ^~~~~~~~~~~~~~~~~~~~~~
drivers/pci/pcie/err.c:144:25: warning: ‘pcie_do_fatal_recovery’
defined but not used [-Wunused-function]
 static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,


Thanks,
Ethan

On Tue, Oct 13, 2020 at 10:18 PM Kuppuswamy, Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
>
>
> On 10/12/20 2:05 PM, Raj, Ashok wrote:
> > On Sun, Oct 11, 2020 at 10:03:40PM -0700, sathyanarayanan.nkuppuswamy@gmail.com wrote:
> >> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> >>
> >> Currently if report_error_detected() or report_mmio_enabled()
> >> functions requests PCI_ERS_RESULT_NEED_RESET, current
> >> pcie_do_recovery() implementation does not do the requested
> >> explicit device reset, but instead just calls the
> >> report_slot_reset() on all affected devices. Notifying about the
> >> reset via report_slot_reset() without doing the actual device
> >> reset is incorrect. So call pci_bus_reset() before triggering
> >> ->slot_reset() callback.
> >>
> >> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> >> ---
> >>   drivers/pci/pcie/err.c | 6 +-----
> >>   1 file changed, 1 insertion(+), 5 deletions(-)
> >>
> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >> index c543f419d8f9..067c58728b88 100644
> >> --- a/drivers/pci/pcie/err.c
> >> +++ b/drivers/pci/pcie/err.c
> >> @@ -181,11 +181,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>      }
> >>
> >>      if (status == PCI_ERS_RESULT_NEED_RESET) {
> >> -            /*
> >> -             * TODO: Should call platform-specific
> >> -             * functions to reset slot before calling
> >> -             * drivers' slot_reset callbacks?
> >> -             */
> >> +            pci_reset_bus(dev);
> >
> > pci_reset_bus() returns an error, do you need to consult that before
> > unconditionally setting PCI_ERS_RESULT_RECOVERED?
> Good point. I will fix this in next version.
> >
> >>              status = PCI_ERS_RESULT_RECOVERED;
> >>              pci_dbg(dev, "broadcast slot_reset message\n");
> >>              pci_walk_bus(bus, report_slot_reset, &status);
> >> --
> >> 2.17.1
> >>
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
  2020-10-14  8:00     ` Ethan Zhao
@ 2020-10-14  8:19       ` Kuppuswamy, Sathyanarayanan
  0 siblings, 0 replies; 14+ messages in thread
From: Kuppuswamy, Sathyanarayanan @ 2020-10-14  8:19 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: Raj, Ashok, sathyanarayanan.nkuppuswamy, Bjorn Helgaas,
	Sinan Kaya, linux-pci, Linux Kernel Mailing List



On 10/14/20 1:00 AM, Ethan Zhao wrote:
> Please fix the building issue.
> 
> drivers/pci/pcie/err.c:144:25: error: static declaration of
> ‘pcie_do_fatal_recovery’ follows non-static declaration
>   static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
>                           ^~~~~~~~~~~~~~~~~~~~~~
> In file included from drivers/pci/pcie/err.c:21:
> drivers/pci/pcie/../pci.h:560:18: note: previous declaration of
> ‘pcie_do_fatal_recovery’ was here
>   pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
>                    ^~~~~~~~~~~~~~~~~~~~~~
> drivers/pci/pcie/err.c:144:25: warning: ‘pcie_do_fatal_recovery’
> defined but not used [-Wunused-function]
>   static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
Fixed in v6. Please use that version.
> 
> 
> Thanks,
> Ethan
> 
> On Tue, Oct 13, 2020 at 10:18 PM Kuppuswamy, Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>>
>>
>>
>> On 10/12/20 2:05 PM, Raj, Ashok wrote:
>>> On Sun, Oct 11, 2020 at 10:03:40PM -0700, sathyanarayanan.nkuppuswamy@gmail.com wrote:
>>>> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>>>
>>>> Currently if report_error_detected() or report_mmio_enabled()
>>>> functions requests PCI_ERS_RESULT_NEED_RESET, current
>>>> pcie_do_recovery() implementation does not do the requested
>>>> explicit device reset, but instead just calls the
>>>> report_slot_reset() on all affected devices. Notifying about the
>>>> reset via report_slot_reset() without doing the actual device
>>>> reset is incorrect. So call pci_bus_reset() before triggering
>>>> ->slot_reset() callback.
>>>>
>>>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>>> ---
>>>>    drivers/pci/pcie/err.c | 6 +-----
>>>>    1 file changed, 1 insertion(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>> index c543f419d8f9..067c58728b88 100644
>>>> --- a/drivers/pci/pcie/err.c
>>>> +++ b/drivers/pci/pcie/err.c
>>>> @@ -181,11 +181,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>       }
>>>>
>>>>       if (status == PCI_ERS_RESULT_NEED_RESET) {
>>>> -            /*
>>>> -             * TODO: Should call platform-specific
>>>> -             * functions to reset slot before calling
>>>> -             * drivers' slot_reset callbacks?
>>>> -             */
>>>> +            pci_reset_bus(dev);
>>>
>>> pci_reset_bus() returns an error, do you need to consult that before
>>> unconditionally setting PCI_ERS_RESULT_RECOVERED?
>> Good point. I will fix this in next version.
>>>
>>>>               status = PCI_ERS_RESULT_RECOVERED;
>>>>               pci_dbg(dev, "broadcast slot_reset message\n");
>>>>               pci_walk_bus(bus, report_slot_reset, &status);
>>>> --
>>>> 2.17.1
>>>>
>>
>> --
>> Sathyanarayanan Kuppuswamy
>> Linux Kernel Developer

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
  2020-10-13 15:17     ` Kuppuswamy, Sathyanarayanan
@ 2020-10-15  6:43       ` Christoph Hellwig
  2020-10-15  6:49         ` Sathyanarayanan Kuppuswamy Natarajan
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2020-10-15  6:43 UTC (permalink / raw)
  To: Kuppuswamy, Sathyanarayanan
  Cc: Christoph Hellwig, sathyanarayanan.nkuppuswamy, bhelgaas, okaya,
	linux-pci, linux-kernel, ashok.raj, Keith Busch

On Tue, Oct 13, 2020 at 08:17:39AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> 
> 
> On 10/13/20 4:56 AM, Christoph Hellwig wrote:
> > You might want to split out pcie_do_fatal_recovery and get rid of the
> > state argument:
> This is how it was before Keith merged fatal and non-fatal error recovery
> paths. When the comparison is between additional-parameter vs new-interface
> , I choose the former. But I can merge your change in next version.

But now you split the implementation.  Keith merged made complete sense
when the code was mostly identical.  But now that the code is separate
again it doesn't make sense to hide it under a common interface that
uses a flags value to call different functions.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
  2020-10-15  6:43       ` Christoph Hellwig
@ 2020-10-15  6:49         ` Sathyanarayanan Kuppuswamy Natarajan
  0 siblings, 0 replies; 14+ messages in thread
From: Sathyanarayanan Kuppuswamy Natarajan @ 2020-10-15  6:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kuppuswamy, Sathyanarayanan, Bjorn Helgaas, okaya, linux-pci,
	linux-kernel, Ashok, Keith Busch

On Wed, Oct 14, 2020 at 11:43 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Oct 13, 2020 at 08:17:39AM -0700, Kuppuswamy, Sathyanarayanan wrote:
> >
> >
> > On 10/13/20 4:56 AM, Christoph Hellwig wrote:
> > > You might want to split out pcie_do_fatal_recovery and get rid of the
> > > state argument:
> > This is how it was before Keith merged fatal and non-fatal error recovery
> > paths. When the comparison is between additional-parameter vs new-interface
> > , I choose the former. But I can merge your change in next version.
>
> But now you split the implementation.  Keith merged made complete sense
> when the code was mostly identical.  But now that the code is separate
> again it doesn't make sense to hide it under a common interface that
> uses a flags value to call different functions.
Agreed. Already included this change in v6.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-10-15  6:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-12  5:03 [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback sathyanarayanan.nkuppuswamy
2020-10-12  5:03 ` [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling sathyanarayanan.nkuppuswamy
2020-10-12 14:50   ` Sinan Kaya
2020-10-13 11:56   ` Christoph Hellwig
2020-10-13 15:17     ` Kuppuswamy, Sathyanarayanan
2020-10-15  6:43       ` Christoph Hellwig
2020-10-15  6:49         ` Sathyanarayanan Kuppuswamy Natarajan
2020-10-14  5:44   ` Ethan Zhao
2020-10-14  5:51     ` Kuppuswamy, Sathyanarayanan
2020-10-12 14:51 ` [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback Sinan Kaya
2020-10-12 21:05 ` Raj, Ashok
2020-10-12 21:47   ` Kuppuswamy, Sathyanarayanan
2020-10-14  8:00     ` Ethan Zhao
2020-10-14  8:19       ` Kuppuswamy, Sathyanarayanan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).