[PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable.

* [PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable.
@ 2021-05-06 17:43 Mahesh Salgaonkar
  2021-05-07  0:41 ` Oliver O'Halloran
  0 siblings, 1 reply; 3+ messages in thread
From: Mahesh Salgaonkar @ 2021-05-06 17:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Oliver O'Halloran

When certain PHB HW failure causes phyp to recover PHB, it marks the PE
state as temporarily unavailable. In this case, per PAPR, rtas call
ibm,read-slot-reset-state2 returns a PE state as temporarily unavailable(5)
and OS has to wait until that recovery is complete. During this state the
slot presence check 'get-sensor-state(dr-entity-sense)' returns as DR
connector empty which leads to assumption that the device has been
hot-removed. This results into no EEH recovery on this device and it stays
in failed state forever.

This patch fixes this issue by skipping slot presence check only if device
PE state is temporarily unavailable(5).

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |    1 +
 arch/powerpc/kernel/eeh.c        |   14 ++++++++++++--
 arch/powerpc/kernel/eeh_driver.c |   18 ++++++++++++++++++
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index b1a5bba2e0b94..5dc5538e39b62 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -64,6 +64,7 @@ struct pci_dn;
 #define EEH_PE_RECOVERING	(1 << 1)	/* Recovering PE	*/
 #define EEH_PE_CFG_BLOCKED	(1 << 2)	/* Block config access	*/
 #define EEH_PE_RESET		(1 << 3)	/* PE reset in progress */
+#define EEH_PE_TEMP_UNAVAIL	(1 << 4)	/* PE is temporarily unavailable */
 
 #define EEH_PE_KEEP		(1 << 8)	/* Keep PE on hotplug	*/
 #define EEH_PE_CFG_RESTRICTED	(1 << 9)	/* Block config on error */
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 7040e430a1249..7fcbf3df18583 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -405,7 +405,8 @@ static int eeh_phb_check_failure(struct eeh_pe *pe)
 	    (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) {
 		ret = 0;
 		goto out;
-	}
+	} else if (ret == EEH_STATE_UNAVAILABLE)
+		eeh_pe_state_mark(phb_pe, EEH_PE_TEMP_UNAVAIL);
 
 	/* Isolate the PHB and send event */
 	eeh_pe_mark_isolated(phb_pe);
@@ -519,14 +520,23 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 	 * We will punt with the following conditions: Failure to get
 	 * PE's state, EEH not support and Permanently unavailable
 	 * state, PE is in good state.
+	 *
+	 * Certain PHB HW failure causes phyp/hypervisor to recover PHB and
+	 * until that recovery completes, the PE's state is temporarily
+	 * unavailable (EEH_STATE_UNAVAILABLE). In this state the slot
+	 * presence check must be avoided since it may not return valid
+	 * status. Mark this PE status as temporarily unavailable so
+	 * that we can check it later.
 	 */
+
 	if ((ret < 0) ||
 	    (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) {
 		eeh_stats.false_positives++;
 		pe->false_positives++;
 		rc = 0;
 		goto dn_unlock;
-	}
+	} else if (ret == EEH_STATE_UNAVAILABLE)
+		eeh_pe_state_mark(pe, EEH_PE_TEMP_UNAVAIL);
 
 	/*
 	 * It should be corner case that the parent PE has been
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 3eff6a4888e79..a0913768f33de 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -851,6 +851,17 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		return;
 	}
 
+	/*
+	 * When PE's state is temporarily unavailable, the slot
+	 * presence check returns as DR connector empty. This leads
+	 * to assumption that the device is hot-removed and causes EEH
+	 * recovery to stop leaving the device in failed state forever.
+	 * Hence skip the slot presence check if PE's state is
+	 * temporarily unavailable and go down EEH recovery path.
+	 */
+	if (pe->state & EEH_PE_TEMP_UNAVAIL)
+		goto skip_slot_presence_check;
+
 	/*
 	 * When devices are hot-removed we might get an EEH due to
 	 * a driver attempting to touch the MMIO space of a removed
@@ -871,6 +882,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		goto out; /* nothing to recover */
 	}
 
+skip_slot_presence_check:
 	/* Log the event */
 	if (pe->type & EEH_PE_PHB) {
 		pr_err("EEH: Recovering PHB#%x, location: %s\n",
@@ -953,6 +965,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 		}
 	}
 
+	/*
+	 * Now that we finished waiting for PE state as per PAPR,
+	 * clear the PE temporarily unavailable state.
+	 */
+	eeh_pe_state_clear(pe, EEH_PE_TEMP_UNAVAIL, true);
+
 	/* Since rtas may enable MMIO when posting the error log,
 	 * don't post the error log until after all dev drivers
 	 * have been informed.



^ permalink raw reply related	[flat|nested] 3+ messages in thread