All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Oliver O'Halloran" <oohall@gmail.com>
To: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Cc: linuxppc-dev <linuxppc-dev@ozlabs.org>
Subject: Re: [PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable.
Date: Fri, 7 May 2021 10:41:46 +1000	[thread overview]
Message-ID: <CAOSf1CG4H2GrxV5C=55vcNue4taSAkgFOUg32yVspgw9+meDAg@mail.gmail.com> (raw)
In-Reply-To: <162032297784.225551.1220900342102038880.stgit@jupiter>

On Fri, May 7, 2021 at 3:43 AM Mahesh Salgaonkar <mahesh@linux.ibm.com> wrote:
>
> When certain PHB HW failure causes phyp to recover PHB, it marks the PE
> state as temporarily unavailable. In this case, per PAPR, rtas call
> ibm,read-slot-reset-state2 returns a PE state as temporarily unavailable(5)
> and OS has to wait until that recovery is complete. During this state the
> slot presence check 'get-sensor-state(dr-entity-sense)' returns as DR
> connector empty which leads to assumption that the device has been
> hot-removed. This results into no EEH recovery on this device and it stays
> in failed state forever.
>
> This patch fixes this issue by skipping slot presence check only if device
> PE state is temporarily unavailable(5).
>
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
> ---
> * snip*
>
>         /*
>          * It should be corner case that the parent PE has been
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index 3eff6a4888e79..a0913768f33de 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -851,6 +851,17 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>                 return;
>         }
>
> +       /*
> +        * When PE's state is temporarily unavailable, the slot
> +        * presence check returns as DR connector empty.

That sounds like a bug in either RTAS or the hotplug slot driver (or
both). The presence check is there largely to filter out events that
we can guarantee are not recoverable (i.e. surprise hot-unplug). In
every other case (especially if we can't determine the state) we
should be going down the recovery path. If the hotplug slot driver is
incorrectly reporting the card has been removed then you should be
fixing the slot driver.

> +        * to assumption that the device is hot-removed and causes EEH
> +        * recovery to stop leaving the device in failed state forever.
> +        * Hence skip the slot presence check if PE's state is
> +        * temporarily unavailable and go down EEH recovery path.
> +        */
> +       if (pe->state & EEH_PE_TEMP_UNAVAIL)
> +               goto skip_slot_presence_check;

There's a time-of-check-vs-time-of-use error here. You're setting this
flag at the point of detection, but there can be a significant lag
time between when an EEH is initially detected and when it's handled
by the recovery thread (usually due to other events being recovered).
Transitioning the PE into and out of PE_TEMP_UNAVAILABLE is handled
autonomously by the hypervisor so by the time we get to recovery the
PE may be back into a normal state where the slot check works fine.

  reply	other threads:[~2021-05-07  0:43 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-06 17:43 [PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable Mahesh Salgaonkar
2021-05-07  0:41 ` Oliver O'Halloran [this message]
2021-10-18 17:28   ` Mahesh J Salgaonkar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOSf1CG4H2GrxV5C=55vcNue4taSAkgFOUg32yVspgw9+meDAg@mail.gmail.com' \
    --to=oohall@gmail.com \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=mahesh@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.