linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Mahesh Salgaonkar <mahesh@linux.ibm.com>
To: linuxppc-dev <linuxppc-dev@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Subject: [PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable.
Date: Thu, 06 May 2021 23:13:15 +0530	[thread overview]
Message-ID: <162032297784.225551.1220900342102038880.stgit@jupiter> (raw)

When certain PHB HW failure causes phyp to recover PHB, it marks the PE
state as temporarily unavailable. In this case, per PAPR, rtas call
ibm,read-slot-reset-state2 returns a PE state as temporarily unavailable(5)
and OS has to wait until that recovery is complete. During this state the
slot presence check 'get-sensor-state(dr-entity-sense)' returns as DR
connector empty which leads to assumption that the device has been
hot-removed. This results into no EEH recovery on this device and it stays
in failed state forever.

This patch fixes this issue by skipping slot presence check only if device
PE state is temporarily unavailable(5).

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |    1 +
 arch/powerpc/kernel/eeh.c        |   14 ++++++++++++--
 arch/powerpc/kernel/eeh_driver.c |   18 ++++++++++++++++++
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index b1a5bba2e0b94..5dc5538e39b62 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -64,6 +64,7 @@ struct pci_dn;
 #define EEH_PE_RECOVERING	(1 << 1)	/* Recovering PE	*/
 #define EEH_PE_CFG_BLOCKED	(1 << 2)	/* Block config access	*/
 #define EEH_PE_RESET		(1 << 3)	/* PE reset in progress */
+#define EEH_PE_TEMP_UNAVAIL	(1 << 4)	/* PE is temporarily unavailable */
 
 #define EEH_PE_KEEP		(1 << 8)	/* Keep PE on hotplug	*/
 #define EEH_PE_CFG_RESTRICTED	(1 << 9)	/* Block config on error */
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 7040e430a1249..7fcbf3df18583 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -405,7 +405,8 @@ static int eeh_phb_check_failure(struct eeh_pe *pe)
 	    (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) {
 		ret = 0;
 		goto out;
-	}
+	} else if (ret == EEH_STATE_UNAVAILABLE)
+		eeh_pe_state_mark(phb_pe, EEH_PE_TEMP_UNAVAIL);
 
 	/* Isolate the PHB and send event */
 	eeh_pe_mark_isolated(phb_pe);
@@ -519,14 +520,23 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 	 * We will punt with the following conditions: Failure to get
 	 * PE's state, EEH not support and Permanently unavailable
 	 * state, PE is in good state.
+	 *
+	 * Certain PHB HW failure causes phyp/hypervisor to recover PHB and
+	 * until that recovery completes, the PE's state is temporarily
+	 * unavailable (EEH_STATE_UNAVAILABLE). In this state the slot
+	 * presence check must be avoided since it may not return valid
+	 * status. Mark this PE status as temporarily unavailable so
+	 * that we can check it later.
 	 */
+
 	if ((ret < 0) ||
 	    (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) {
 		eeh_stats.false_positives++;
 		pe->false_positives++;
 		rc = 0;
 		goto dn_unlock;
-	}
+	} else if (ret == EEH_STATE_UNAVAILABLE)
+		eeh_pe_state_mark(pe, EEH_PE_TEMP_UNAVAIL);
 
 	/*
 	 * It should be corner case that the parent PE has been
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 3eff6a4888e79..a0913768f33de 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -851,6 +851,17 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		return;
 	}
 
+	/*
+	 * When PE's state is temporarily unavailable, the slot
+	 * presence check returns as DR connector empty. This leads
+	 * to assumption that the device is hot-removed and causes EEH
+	 * recovery to stop leaving the device in failed state forever.
+	 * Hence skip the slot presence check if PE's state is
+	 * temporarily unavailable and go down EEH recovery path.
+	 */
+	if (pe->state & EEH_PE_TEMP_UNAVAIL)
+		goto skip_slot_presence_check;
+
 	/*
 	 * When devices are hot-removed we might get an EEH due to
 	 * a driver attempting to touch the MMIO space of a removed
@@ -871,6 +882,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		goto out; /* nothing to recover */
 	}
 
+skip_slot_presence_check:
 	/* Log the event */
 	if (pe->type & EEH_PE_PHB) {
 		pr_err("EEH: Recovering PHB#%x, location: %s\n",
@@ -953,6 +965,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		}
 	}
 
+	/*
+	 * Now that we finished waiting for PE state as per PAPR,
+	 * clear the PE temporarily unavailable state.
+	 */
+	eeh_pe_state_clear(pe, EEH_PE_TEMP_UNAVAIL, true);
+
 	/* Since rtas may enable MMIO when posting the error log,
 	 * don't post the error log until after all dev drivers
 	 * have been informed.



             reply	other threads:[~2021-05-06 17:43 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-06 17:43 Mahesh Salgaonkar [this message]
2021-05-07  0:41 ` [PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable Oliver O'Halloran
2021-10-18 17:28   ` Mahesh J Salgaonkar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=162032297784.225551.1220900342102038880.stgit@jupiter \
    --to=mahesh@linux.ibm.com \
    --cc=linuxppc-dev@ozlabs.org \
    --cc=oohall@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).