All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error.
@ 2021-11-23 13:05 Mahesh Salgaonkar
  2021-11-23 23:14 ` Michael Ellerman
  2021-11-24 12:01 ` Oliver O'Halloran
  0 siblings, 2 replies; 7+ messages in thread
From: Mahesh Salgaonkar @ 2021-11-23 13:05 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Oliver O'Halloran

When certain PHB HW failure causes phyp to recover PHB, it marks the PE
state as temporarily unavailable until recovery is complete. This also
triggers an EEH handler in Linux which needs to notify drivers, and perform
recovery. But before notifying the driver about the pci error it uses
get_adapter_state()->get-sesnor-state() operation of the hotplug_slot to
determine if the slot contains a device or not. if the slot is empty, the
recovery is skipped entirely.

However on certain PHB failures, the rtas call get-sesnor-state() returns
extended busy error (9902) until PHB is recovered by phyp. Once PHB is
recovered, the get-sensor-state() returns success with correct presence
status. The rtas call interface rtas_get_sensor() loops over the rtas call
on extended delay return code (9902) until the return value is either
success (0) or error (-1). This causes the EEH handler to get stuck for ~6
seconds before it could notify that the pci error has been detected and
stop any active operations. Hence with running I/O traffic, during this 6
seconds, the network driver continues its operation and hits a timeout
(netdev watchdog). On timeouts, network driver go into ffdc capture mode
and reset path assuming the PCI device is in fatal condition. This causes
EEH recovery to fail and sometimes it leads to system hang or crash.

------------
[52732.244731] DEBUG: ibm_read_slot_reset_state2()
[52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=0x0
[52732.244798] DEBUG: in eeh_slot_presence_check
[52732.244804] DEBUG: error state check
[52732.244807] DEBUG: Is slot hotpluggable
[52732.244810] DEBUG: hotpluggable ops ?
[52732.244953] DEBUG: Calling ops->get_adapter_status
[52732.244958] DEBUG: calling rpaphp_get_sensor_state
[52736.564262] ------------[ cut here ]------------
[52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed out
[52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev_watchdog+0x438/0x440
[...]
[52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440
[52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440
------------

To fix this issue, delay the slot presence check after notifying the driver
about the pci error.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
---
 arch/powerpc/kernel/eeh_driver.c |   42 ++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 350dab18e1373..a4a80451d50f7 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -851,26 +851,6 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		return;
 	}
 
-	/*
-	 * When devices are hot-removed we might get an EEH due to
-	 * a driver attempting to touch the MMIO space of a removed
-	 * device. In this case we don't have a device to recover
-	 * so suppress the event if we can't find any present devices.
-	 *
-	 * The hotplug driver should take care of tearing down the
-	 * device itself.
-	 */
-	eeh_for_each_pe(pe, tmp_pe)
-		eeh_pe_for_each_dev(tmp_pe, edev, tmp)
-			if (eeh_slot_presence_check(edev->pdev))
-				devices++;
-
-	if (!devices) {
-		pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
-			pe->phb->global_number, pe->addr);
-		goto out; /* nothing to recover */
-	}
-
 	/* Log the event */
 	if (pe->type & EEH_PE_PHB) {
 		pr_err("EEH: Recovering PHB#%x, location: %s\n",
@@ -942,6 +922,28 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 			result = PCI_ERS_RESULT_NEED_RESET;
 	}
 
+	/*
+	 * When devices are hot-removed we might get an EEH due to
+	 * a driver attempting to touch the MMIO space of a removed
+	 * device. In this case we don't have a device to recover
+	 * bail out early as there is nothing to recover.
+	 *
+	 * The hotplug driver should take care of tearing down the
+	 * device itself.
+	 */
+	eeh_for_each_pe(pe, tmp_pe)
+		eeh_pe_for_each_dev(tmp_pe, edev, tmp)
+			if (eeh_slot_presence_check(edev->pdev))
+				devices++;
+
+	if (!devices) {
+		pr_info("EEH: Frozen PHB#%x-PE#%x is empty!\n",
+			pe->phb->global_number, pe->addr);
+		pr_info("EEH: Surprise hotplug remove. nothing to recover.\n");
+		eeh_set_channel_state(pe, pci_channel_io_perm_failure);
+		goto out; /* nothing to recover */
+	}
+
 	/* Get the current PCI slot state. This can take a long time,
 	 * sometimes over 300 seconds for certain systems.
 	 */



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error.
  2021-11-23 13:05 [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error Mahesh Salgaonkar
@ 2021-11-23 23:14 ` Michael Ellerman
  2021-11-24  8:45   ` Mahesh J Salgaonkar
  2021-11-24 12:01 ` Oliver O'Halloran
  1 sibling, 1 reply; 7+ messages in thread
From: Michael Ellerman @ 2021-11-23 23:14 UTC (permalink / raw)
  To: Mahesh Salgaonkar, linuxppc-dev; +Cc: Oliver O'Halloran

Mahesh Salgaonkar <mahesh@linux.ibm.com> writes:
> When certain PHB HW failure causes phyp to recover PHB, it marks the PE
> state as temporarily unavailable until recovery is complete. This also
> triggers an EEH handler in Linux which needs to notify drivers, and perform
> recovery. But before notifying the driver about the pci error it uses
> get_adapter_state()->get-sesnor-state() operation of the hotplug_slot to
> determine if the slot contains a device or not. if the slot is empty, the
> recovery is skipped entirely.
>
> However on certain PHB failures, the rtas call get-sesnor-state() returns
> extended busy error (9902) until PHB is recovered by phyp. Once PHB is
> recovered, the get-sensor-state() returns success with correct presence
> status. The rtas call interface rtas_get_sensor() loops over the rtas call
> on extended delay return code (9902) until the return value is either
> success (0) or error (-1). This causes the EEH handler to get stuck for ~6
> seconds before it could notify that the pci error has been detected and
> stop any active operations. Hence with running I/O traffic, during this 6
> seconds, the network driver continues its operation and hits a timeout
> (netdev watchdog). On timeouts, network driver go into ffdc capture mode
> and reset path assuming the PCI device is in fatal condition. This causes
> EEH recovery to fail and sometimes it leads to system hang or crash.
>
> ------------
> [52732.244731] DEBUG: ibm_read_slot_reset_state2()
> [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=0x0
> [52732.244798] DEBUG: in eeh_slot_presence_check
> [52732.244804] DEBUG: error state check
> [52732.244807] DEBUG: Is slot hotpluggable
> [52732.244810] DEBUG: hotpluggable ops ?
> [52732.244953] DEBUG: Calling ops->get_adapter_status
> [52732.244958] DEBUG: calling rpaphp_get_sensor_state
> [52736.564262] ------------[ cut here ]------------
> [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed out
> [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev_watchdog+0x438/0x440
> [...]
> [52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440
> [52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440
> ------------
>
> To fix this issue, delay the slot presence check after notifying the driver
> about the pci error.

How does this interact with the commit that put the slot presence check
there in the first place:

  b104af5a7687 ("powerpc/eeh: Check slot presence state in eeh_handle_normal_event()")


It seems like delaying the slot presence check will effectively revert
that commit?

cheers

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error.
  2021-11-23 23:14 ` Michael Ellerman
@ 2021-11-24  8:45   ` Mahesh J Salgaonkar
  2021-11-24 11:57     ` Oliver O'Halloran
  0 siblings, 1 reply; 7+ messages in thread
From: Mahesh J Salgaonkar @ 2021-11-24  8:45 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Oliver O'Halloran

On 2021-11-24 10:14:30 Wed, Michael Ellerman wrote:
> Mahesh Salgaonkar <mahesh@linux.ibm.com> writes:
> > When certain PHB HW failure causes phyp to recover PHB, it marks the PE
> > state as temporarily unavailable until recovery is complete. This also
> > triggers an EEH handler in Linux which needs to notify drivers, and perform
> > recovery. But before notifying the driver about the pci error it uses
> > get_adapter_state()->get-sesnor-state() operation of the hotplug_slot to
> > determine if the slot contains a device or not. if the slot is empty, the
> > recovery is skipped entirely.
> >
> > However on certain PHB failures, the rtas call get-sesnor-state() returns
> > extended busy error (9902) until PHB is recovered by phyp. Once PHB is
> > recovered, the get-sensor-state() returns success with correct presence
> > status. The rtas call interface rtas_get_sensor() loops over the rtas call
> > on extended delay return code (9902) until the return value is either
> > success (0) or error (-1). This causes the EEH handler to get stuck for ~6
> > seconds before it could notify that the pci error has been detected and
> > stop any active operations. Hence with running I/O traffic, during this 6
> > seconds, the network driver continues its operation and hits a timeout
> > (netdev watchdog). On timeouts, network driver go into ffdc capture mode
> > and reset path assuming the PCI device is in fatal condition. This causes
> > EEH recovery to fail and sometimes it leads to system hang or crash.
> >
> > ------------
> > [52732.244731] DEBUG: ibm_read_slot_reset_state2()
> > [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=0x0
> > [52732.244798] DEBUG: in eeh_slot_presence_check
> > [52732.244804] DEBUG: error state check
> > [52732.244807] DEBUG: Is slot hotpluggable
> > [52732.244810] DEBUG: hotpluggable ops ?
> > [52732.244953] DEBUG: Calling ops->get_adapter_status
> > [52732.244958] DEBUG: calling rpaphp_get_sensor_state
> > [52736.564262] ------------[ cut here ]------------
> > [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed out
> > [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev_watchdog+0x438/0x440
> > [...]
> > [52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440
> > [52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440
> > ------------
> >
> > To fix this issue, delay the slot presence check after notifying the driver
> > about the pci error.
> 
> How does this interact with the commit that put the slot presence check
> there in the first place:
> 
>   b104af5a7687 ("powerpc/eeh: Check slot presence state in eeh_handle_normal_event()")
> 
> 
> It seems like delaying the slot presence check will effectively revert
> that commit?

No it doesn't. We will still do a presence check before the recovery
process starts. This patch moves the check after notifying the driver to
stop active I/O operations. If a presence check finds the device isn't
present, we will skip the EEH recovery. However, on a surprise hotplug,
the user will see the EEH messages on the console before it finds there
is nothing to recover.

Current EEH behaviour:

EEH event -> eeh_handle_normal_event
		/* Check for adapter status */
		eeh_slot_presence_check()
			if (!present)
				bail out early

		/* Report the error */
		eeh_report_error() <- notify driver about error
			driver->err_handler->error_detected()
			/* Any active I/O will be stopped now */

		/* Start the recovery process */
		eeh_reset_device()
		eeh_report_resume()
		/* Recovery done */

With this patch:

EEH event -> eeh_handle_normal_event
		/* Report the error */
		eeh_report_error() <- notify driver about error
			driver->err_handler->error_detected()
			/* Any active I/O will be stopped now */

		/* Check for adapter status */
		eeh_slot_presence_check()
			if (!present)
				bail out early

		/* Start the recovery process */
		eeh_reset_device()
		eeh_report_resume()
		/* Recovery done */

Thanks,
-Mahesh.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error.
  2021-11-24  8:45   ` Mahesh J Salgaonkar
@ 2021-11-24 11:57     ` Oliver O'Halloran
  2021-11-25  5:34       ` Mahesh J Salgaonkar
  0 siblings, 1 reply; 7+ messages in thread
From: Oliver O'Halloran @ 2021-11-24 11:57 UTC (permalink / raw)
  To: Mahesh J Salgaonkar; +Cc: linuxppc-dev

On Wed, Nov 24, 2021 at 7:45 PM Mahesh J Salgaonkar
<mahesh@linux.ibm.com> wrote:
>
> No it doesn't. We will still do a presence check before the recovery
> process starts. This patch moves the check after notifying the driver to
> stop active I/O operations. If a presence check finds the device isn't
> present, we will skip the EEH recovery. However, on a surprise hotplug,
> the user will see the EEH messages on the console before it finds there
> is nothing to recover.

Suppressing the spurious EEH messages was part of why I added that
check in the first place. If you want to defer the presence check
until later you should move the stack trace printing, etc to after
we've confirmed there are still devices present. Considering the
motivation for this patch is to avoid spurious warnings from the
driver I don't think printing spurious EEH messages is much of an
improvement.

The other option would be returning an error from the pseries hotplug
driver. IIRC that's what pnv_php / OPAL does if the PHB is fenced and
we can't check the slot presence state.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error.
  2021-11-23 13:05 [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error Mahesh Salgaonkar
  2021-11-23 23:14 ` Michael Ellerman
@ 2021-11-24 12:01 ` Oliver O'Halloran
  2021-11-29  8:14   ` Mahesh J Salgaonkar
  1 sibling, 1 reply; 7+ messages in thread
From: Oliver O'Halloran @ 2021-11-24 12:01 UTC (permalink / raw)
  To: Mahesh Salgaonkar; +Cc: linuxppc-dev

On Wed, Nov 24, 2021 at 12:05 AM Mahesh Salgaonkar <mahesh@linux.ibm.com> wrote:
>
> *snip*
>
> This causes the EEH handler to get stuck for ~6
> seconds before it could notify that the pci error has been detected and
> stop any active operations. Hence with running I/O traffic, during this 6
> seconds, the network driver continues its operation and hits a timeout
> (netdev watchdog).On timeouts, network driver go into ffdc capture mode
> and reset path assuming the PCI device is in fatal condition. This causes
> EEH recovery to fail and sometimes it leads to system hang or crash.

Whatever is causing that crash is the real issue IMO. PCI error
reporting is fundamentally asynchronous and the driver always has to
tolerate some amount of latency between the error occuring and being
reported. Six seconds is admittedly an eternity, but it should not
cause a system crash under any circumstances. Printing a warning due
to a timeout is annoying, but it's not the end of the world.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error.
  2021-11-24 11:57     ` Oliver O'Halloran
@ 2021-11-25  5:34       ` Mahesh J Salgaonkar
  0 siblings, 0 replies; 7+ messages in thread
From: Mahesh J Salgaonkar @ 2021-11-25  5:34 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

On 2021-11-24 22:57:13 Wed, Oliver O'Halloran wrote:
> On Wed, Nov 24, 2021 at 7:45 PM Mahesh J Salgaonkar
> <mahesh@linux.ibm.com> wrote:
> >
> > No it doesn't. We will still do a presence check before the recovery
> > process starts. This patch moves the check after notifying the driver to
> > stop active I/O operations. If a presence check finds the device isn't
> > present, we will skip the EEH recovery. However, on a surprise hotplug,
> > the user will see the EEH messages on the console before it finds there
> > is nothing to recover.
> 
> Suppressing the spurious EEH messages was part of why I added that
> check in the first place. If you want to defer the presence check
> until later you should move the stack trace printing, etc to after
> we've confirmed there are still devices present. Considering the

That will help suppressing the spurious EEH messages.

> motivation for this patch is to avoid spurious warnings from the
> driver I don't think printing spurious EEH messages is much of an
> improvement.

Agree.

> 
> The other option would be returning an error from the pseries hotplug
> driver. IIRC that's what pnv_php / OPAL does if the PHB is fenced and
> we can't check the slot presence state.

Yeah. I can change rpaphp_get_sensor_state() to use
rtas_get_sensor_fast() variant which will return immediately with an
error on extended busy error. That way we don't need to move the slot
presence check at all. I did test that and it does fix the problem. But
I wasn't sure if that would have any implications on hotplug driver
behaviour. If pnv_php / OPAL does the same thing then this would be a
cleaner approach to fix this issue. Let me send out the patch with this
other option to fix the pseries hotplug driver instead.

Thanks,
-Mahesh.

-- 
Mahesh J Salgaonkar

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error.
  2021-11-24 12:01 ` Oliver O'Halloran
@ 2021-11-29  8:14   ` Mahesh J Salgaonkar
  0 siblings, 0 replies; 7+ messages in thread
From: Mahesh J Salgaonkar @ 2021-11-29  8:14 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev

On 2021-11-24 23:01:45 Wed, Oliver O'Halloran wrote:
> On Wed, Nov 24, 2021 at 12:05 AM Mahesh Salgaonkar <mahesh@linux.ibm.com> wrote:
> >
> > *snip*
> >
> > This causes the EEH handler to get stuck for ~6
> > seconds before it could notify that the pci error has been detected and
> > stop any active operations. Hence with running I/O traffic, during this 6
> > seconds, the network driver continues its operation and hits a timeout
> > (netdev watchdog).On timeouts, network driver go into ffdc capture mode
> > and reset path assuming the PCI device is in fatal condition. This causes
> > EEH recovery to fail and sometimes it leads to system hang or crash.
> 
> Whatever is causing that crash is the real issue IMO. PCI error

I have seen crash only once but that was triggered by HTX tool and may
not be related. However, the major concern here is EEH failure. I will
correct the above statement in my next patch.

> reporting is fundamentally asynchronous and the driver always has to
> tolerate some amount of latency between the error occuring and being
> reported. Six seconds is admittedly an eternity, but it should not
> cause a system crash under any circumstances. Printing a warning due
> to a timeout is annoying, but it's not the end of the world.

Yeah, but due to timeout sometimes the driver gets into a situation
where when EEH recovery kicks-in, the driver is unable to recover the
device. Thus EEH recovery fails and disconnects the pci device even when
it could have recovered. To recover, we need to either reboot the lpar
or re-assign the I/O adapter from HMC to get it back in working
condition.

[16532.212197] EEH: PCI-E AER 30: 00000000 00000000
[16532.213207] EEH: Reset without hotplug activity
[16534.229469] bnx2x: [bnx2x_clean_tx_queue:1203(enP22p1s0f1)]timeout waiting for queue[2]: txdata->tx_pkt_prod(37003) != txdata->tx_pkt_cons(36996)
[16534.385484] EEH: Beginning: 'slot_reset'
[16534.385489] PCI 0016:01:00.0#10000: EEH: Invoking bnx2x->slot_reset()
[16536.229469] bnx2x: [bnx2x_clean_tx_queue:1203(enP22p1s0f1)]timeout waiting for queue[4]: txdata->tx_pkt_prod(64894) != txdata->tx_pkt_cons(64891)
o[...]
[16623.571502] bnx2x: [bnx2x_nic_load_request:2342(enP22p1s0f1)]MCP response failure, aborting
[16623.571507] bnx2x: [bnx2x_acquire_hw_lock:2019(enP22p1s0f1)]lock_status 0xffffffff  resource_bit 0x800
[16623.571881] bnx2x: [bnx2x_io_slot_reset:14359(enP22p1s0f0)]IO slot reset initializing...
[16623.571976] bnx2x 0016:01:00.0: enabling device (0140 -> 0142)
[16623.576169] bnx2x: [bnx2x_io_slot_reset:14375(enP22p1s0f0)]IO slot reset --> driver unload
[16623.576174] PCI 0016:01:00.0#10000: EEH: bnx2x driver reports: 'disconnect'
[16623.576177] PCI 0016:01:00.1#10000: EEH: Invoking bnx2x->slot_reset()
[16623.576179] bnx2x: [bnx2x_io_slot_reset:14359(enP22p1s0f1)]IO slot reset initializing...
[16623.576239] bnx2x 0016:01:00.1: enabling device (0140 -> 0142)
[16623.580241] bnx2x: [bnx2x_io_slot_reset:14375(enP22p1s0f1)]IO slot reset --> driver unload
[16623.580245] PCI 0016:01:00.1#10000: EEH: bnx2x driver reports: 'disconnect'
[16623.580246] EEH: Finished:'slot_reset' with aggregate recovery state:'disconnect'
[16623.580250] EEH: Unable to recover from failure from PHB#16-PE#10000.

Thanks,
-Mahesh.

-- 
Mahesh J Salgaonkar

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-11-29  8:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-23 13:05 [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error Mahesh Salgaonkar
2021-11-23 23:14 ` Michael Ellerman
2021-11-24  8:45   ` Mahesh J Salgaonkar
2021-11-24 11:57     ` Oliver O'Halloran
2021-11-25  5:34       ` Mahesh J Salgaonkar
2021-11-24 12:01 ` Oliver O'Halloran
2021-11-29  8:14   ` Mahesh J Salgaonkar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.