All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mahesh J Salgaonkar <mahesh@linux.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev <linuxppc-dev@ozlabs.org>,
	Oliver O'Halloran <oohall@gmail.com>
Subject: Re: [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error.
Date: Wed, 24 Nov 2021 14:15:39 +0530	[thread overview]
Message-ID: <20211124084539.issrrg2lxq3mp6mj@in.ibm.com> (raw)
In-Reply-To: <875ysiqxbd.fsf@mpe.ellerman.id.au>

On 2021-11-24 10:14:30 Wed, Michael Ellerman wrote:
> Mahesh Salgaonkar <mahesh@linux.ibm.com> writes:
> > When certain PHB HW failure causes phyp to recover PHB, it marks the PE
> > state as temporarily unavailable until recovery is complete. This also
> > triggers an EEH handler in Linux which needs to notify drivers, and perform
> > recovery. But before notifying the driver about the pci error it uses
> > get_adapter_state()->get-sesnor-state() operation of the hotplug_slot to
> > determine if the slot contains a device or not. if the slot is empty, the
> > recovery is skipped entirely.
> >
> > However on certain PHB failures, the rtas call get-sesnor-state() returns
> > extended busy error (9902) until PHB is recovered by phyp. Once PHB is
> > recovered, the get-sensor-state() returns success with correct presence
> > status. The rtas call interface rtas_get_sensor() loops over the rtas call
> > on extended delay return code (9902) until the return value is either
> > success (0) or error (-1). This causes the EEH handler to get stuck for ~6
> > seconds before it could notify that the pci error has been detected and
> > stop any active operations. Hence with running I/O traffic, during this 6
> > seconds, the network driver continues its operation and hits a timeout
> > (netdev watchdog). On timeouts, network driver go into ffdc capture mode
> > and reset path assuming the PCI device is in fatal condition. This causes
> > EEH recovery to fail and sometimes it leads to system hang or crash.
> >
> > ------------
> > [52732.244731] DEBUG: ibm_read_slot_reset_state2()
> > [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=0x0
> > [52732.244798] DEBUG: in eeh_slot_presence_check
> > [52732.244804] DEBUG: error state check
> > [52732.244807] DEBUG: Is slot hotpluggable
> > [52732.244810] DEBUG: hotpluggable ops ?
> > [52732.244953] DEBUG: Calling ops->get_adapter_status
> > [52732.244958] DEBUG: calling rpaphp_get_sensor_state
> > [52736.564262] ------------[ cut here ]------------
> > [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed out
> > [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev_watchdog+0x438/0x440
> > [...]
> > [52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440
> > [52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440
> > ------------
> >
> > To fix this issue, delay the slot presence check after notifying the driver
> > about the pci error.
> 
> How does this interact with the commit that put the slot presence check
> there in the first place:
> 
>   b104af5a7687 ("powerpc/eeh: Check slot presence state in eeh_handle_normal_event()")
> 
> 
> It seems like delaying the slot presence check will effectively revert
> that commit?

No it doesn't. We will still do a presence check before the recovery
process starts. This patch moves the check after notifying the driver to
stop active I/O operations. If a presence check finds the device isn't
present, we will skip the EEH recovery. However, on a surprise hotplug,
the user will see the EEH messages on the console before it finds there
is nothing to recover.

Current EEH behaviour:

EEH event -> eeh_handle_normal_event
		/* Check for adapter status */
		eeh_slot_presence_check()
			if (!present)
				bail out early

		/* Report the error */
		eeh_report_error() <- notify driver about error
			driver->err_handler->error_detected()
			/* Any active I/O will be stopped now */

		/* Start the recovery process */
		eeh_reset_device()
		eeh_report_resume()
		/* Recovery done */

With this patch:

EEH event -> eeh_handle_normal_event
		/* Report the error */
		eeh_report_error() <- notify driver about error
			driver->err_handler->error_detected()
			/* Any active I/O will be stopped now */

		/* Check for adapter status */
		eeh_slot_presence_check()
			if (!present)
				bail out early

		/* Start the recovery process */
		eeh_reset_device()
		eeh_report_resume()
		/* Recovery done */

Thanks,
-Mahesh.


  reply	other threads:[~2021-11-24  8:46 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-23 13:05 [PATCH] powerpc/eeh: Delay slot presence check once driver is notified about the pci error Mahesh Salgaonkar
2021-11-23 23:14 ` Michael Ellerman
2021-11-24  8:45   ` Mahesh J Salgaonkar [this message]
2021-11-24 11:57     ` Oliver O'Halloran
2021-11-25  5:34       ` Mahesh J Salgaonkar
2021-11-24 12:01 ` Oliver O'Halloran
2021-11-29  8:14   ` Mahesh J Salgaonkar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211124084539.issrrg2lxq3mp6mj@in.ibm.com \
    --to=mahesh@linux.ibm.com \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=oohall@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.