LinuxPPC-Dev Archive on lore.kernel.org
 help / Atom feed
* [PATCH 0/6] powerpc/eeh: Improve recovery of passed-through devices
@ 2018-11-29  3:16 Sam Bobroff
  2018-11-29  3:16 ` [PATCH 1/6] powerpc/eeh: Cleanup eeh_pe_clear_frozen_state() Sam Bobroff
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Sam Bobroff @ 2018-11-29  3:16 UTC (permalink / raw)
  To: linuxppc-dev

Hello,

Here are changes that allow EEH to successfully recover after a failure that
affects of both host and guest devices. This happens, for example, when a PHB
containing passed-through devices is fenced. (Failures that include only
passed-through devices are ignored by the host.)

Currently, when an error affects both passed-through and un-passed-through
devices, the passed-through devices are treated as if their driver was not EEH
aware. This causes them to be hot-unplugged as part of recovery.

The hot unplug request is forwarded to the guest which checks the device status
before releasing the device. Because the host is recovering the device, it
reports the device status as EEH_STATE_UNAVAILABLE which causes the guest to
wait for the device to become available. This deadlocks the recovery process.

This change causes the host to recover it's own devices but leave
passed-through devices frozen until the guest performs it's own recovery. (They
are not removed.) If the guest detects the error and begins recovery itself,
waiting for the device state to change away from EEH_STATE_UNAVAILABLE causes
it to wait until the host has finished it's recovery and the guest's subsequent
recovery can then succeed.

Note that resetting a PE may implicitly thaw both it and child PEs, and to
prevent the device from being accidentally used by the guest (which may be
unaware of the failure and reset) when in this state, we re-freeze those
devices. This does leave a small window of opportunity but that will need to be
addressed with a firmware change.

I've also included a fix to the reset function (the last patch), because
without it some scenarios still fail. An example is injecting an error into
a PHB and then exiting a guest that contains passed-through devices from that
PHB so that an EEH event is raised during the process of passing the device
back to the host.

Cheers,
Sam.

Sam Bobroff (6):
  powerpc/eeh: Cleanup eeh_pe_clear_frozen_state()
  powerpc/eeh: remove sw_state from eeh_unfreeze_pe()
  powerpc/eeh: Add include_passed to eeh_pe_state_clear()
  powerpc/eeh: Add include_passed to eeh_clear_pe_frozen_state()
  powerpc/eeh: Improve recovery of passed-through devices
  powerpc/eeh: Correct retries in eeh_pe_reset_full()

 arch/powerpc/include/asm/eeh.h     |   4 +-
 arch/powerpc/include/asm/ppc-pci.h |   4 +-
 arch/powerpc/kernel/eeh.c          | 103 +++++++++++++++++++----------
 arch/powerpc/kernel/eeh_driver.c   |  86 ++++++++++--------------
 arch/powerpc/kernel/eeh_pe.c       |  68 ++++++++-----------
 arch/powerpc/kernel/eeh_sysfs.c    |   3 +-
 drivers/vfio/vfio_spapr_eeh.c      |   6 +-
 7 files changed, 140 insertions(+), 134 deletions(-)

-- 
2.19.0.2.gcad72f5712


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/6] powerpc/eeh: Cleanup eeh_pe_clear_frozen_state()
  2018-11-29  3:16 [PATCH 0/6] powerpc/eeh: Improve recovery of passed-through devices Sam Bobroff
@ 2018-11-29  3:16 ` Sam Bobroff
  2019-02-08 13:02   ` [1/6] " Michael Ellerman
  2018-11-29  3:16 ` [PATCH 2/6] powerpc/eeh: remove sw_state from eeh_unfreeze_pe() Sam Bobroff
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 8+ messages in thread
From: Sam Bobroff @ 2018-11-29  3:16 UTC (permalink / raw)
  To: linuxppc-dev

The 'clear_sw_state' parameter for eeh_pe_clear_frozen_state() is
redundant because it has no effect (except in the rare case of a
hardware error part way through unfreezing a tree of PEs, where it
would dangerously allow partial de-isolation before returning
failure).

It is passed down to __eeh_pe_clear_frozen_state(), and from there to
eeh_unfreeze_pe(), where it causes EEH_PE_ISOLATED to be removed
from the state of each PE during the traversal.  However, when the
traversal finishes, EEH_PE_ISOLATED is unconditionally removed by a
call to eeh_pe_state_clear() regardless of the parameter's value.

So remove the flag and pass false to eeh_unfreeze_pe() (to avoid the
rare case described above, as it was before the flag was introduced).
Also, perform the recursion directly in the function and eliminate a
bit of boilerplate.

There should be no change in functionality, except as mentioned above.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
 arch/powerpc/kernel/eeh_driver.c | 40 +++++++++++---------------------
 1 file changed, 13 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 9446248eb6b8..aa86a42d98f2 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -591,34 +591,20 @@ static void *eeh_pe_detach_dev(struct eeh_pe *pe, void *userdata)
  * PE reset (for 3 times), we try to clear the frozen state
  * for 3 times as well.
  */
-static void *__eeh_clear_pe_frozen_state(struct eeh_pe *pe, void *flag)
+static int eeh_clear_pe_frozen_state(struct eeh_pe *root)
 {
-	bool clear_sw_state = *(bool *)flag;
-	int i, rc = 1;
-
-	for (i = 0; rc && i < 3; i++)
-		rc = eeh_unfreeze_pe(pe, clear_sw_state);
+	struct eeh_pe *pe;
+	int i;
 
-	/* Stop immediately on any errors */
-	if (rc) {
-		pr_warn("%s: Failure %d unfreezing PHB#%x-PE#%x\n",
-			__func__, rc, pe->phb->global_number, pe->addr);
-		return (void *)pe;
+	eeh_for_each_pe(root, pe) {
+		for (i = 0; i < 3; i++)
+			if (!eeh_unfreeze_pe(pe, false))
+				break;
+		if (i >= 3)
+			return -EIO;
 	}
-
-	return NULL;
-}
-
-static int eeh_clear_pe_frozen_state(struct eeh_pe *pe,
-				     bool clear_sw_state)
-{
-	void *rc;
-
-	rc = eeh_pe_traverse(pe, __eeh_clear_pe_frozen_state, &clear_sw_state);
-	if (!rc)
-		eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
-
-	return rc ? -EIO : 0;
+	eeh_pe_state_clear(root, EEH_PE_ISOLATED);
+	return 0;
 }
 
 int eeh_pe_reset_and_recover(struct eeh_pe *pe)
@@ -643,7 +629,7 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
 	}
 
 	/* Unfreeze the PE */
-	ret = eeh_clear_pe_frozen_state(pe, true);
+	ret = eeh_clear_pe_frozen_state(pe);
 	if (ret) {
 		eeh_pe_state_clear(pe, EEH_PE_RECOVERING);
 		return ret;
@@ -716,7 +702,7 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
 	eeh_pe_restore_bars(pe);
 
 	/* Clear frozen state */
-	rc = eeh_clear_pe_frozen_state(pe, false);
+	rc = eeh_clear_pe_frozen_state(pe);
 	if (rc) {
 		pci_unlock_rescan_remove();
 		return rc;
-- 
2.19.0.2.gcad72f5712


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/6] powerpc/eeh: remove sw_state from eeh_unfreeze_pe()
  2018-11-29  3:16 [PATCH 0/6] powerpc/eeh: Improve recovery of passed-through devices Sam Bobroff
  2018-11-29  3:16 ` [PATCH 1/6] powerpc/eeh: Cleanup eeh_pe_clear_frozen_state() Sam Bobroff
@ 2018-11-29  3:16 ` Sam Bobroff
  2018-11-29  3:16 ` [PATCH 3/6] powerpc/eeh: Add include_passed to eeh_pe_state_clear() Sam Bobroff
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Sam Bobroff @ 2018-11-29  3:16 UTC (permalink / raw)
  To: linuxppc-dev

eeh_unfreeze_pe() performs two operations: unfreezing a PE (which may
cause firmware to unfreeze child PEs as well) and de-isolating the PE
and it's children.

To simplify this and support future work, separate out the
de-isolation and perform it at the call sites (when necessary).

There should be no change in behaviour.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |  2 +-
 arch/powerpc/kernel/eeh.c        | 18 ++++++++++--------
 arch/powerpc/kernel/eeh_driver.c |  2 +-
 arch/powerpc/kernel/eeh_sysfs.c  |  3 ++-
 4 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 8b596d096ebe..2ff123f745cc 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -293,7 +293,7 @@ void eeh_add_device_late(struct pci_dev *);
 void eeh_add_device_tree_late(struct pci_bus *);
 void eeh_add_sysfs_files(struct pci_bus *);
 void eeh_remove_device(struct pci_dev *);
-int eeh_unfreeze_pe(struct eeh_pe *pe, bool sw_state);
+int eeh_unfreeze_pe(struct eeh_pe *pe);
 int eeh_pe_reset_and_recover(struct eeh_pe *pe);
 int eeh_dev_open(struct pci_dev *pdev);
 void eeh_dev_release(struct pci_dev *pdev);
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 6cae6b56ffd6..ac8e69ee93a7 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -823,7 +823,7 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state stat
 	switch (state) {
 	case pcie_deassert_reset:
 		eeh_ops->reset(pe, EEH_RESET_DEACTIVATE);
-		eeh_unfreeze_pe(pe, false);
+		eeh_unfreeze_pe(pe);
 		if (!(pe->type & EEH_PE_VF))
 			eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
 		eeh_pe_dev_traverse(pe, eeh_restore_dev_state, dev);
@@ -1309,7 +1309,7 @@ void eeh_remove_device(struct pci_dev *dev)
 	edev->mode &= ~EEH_DEV_SYSFS;
 }
 
-int eeh_unfreeze_pe(struct eeh_pe *pe, bool sw_state)
+int eeh_unfreeze_pe(struct eeh_pe *pe)
 {
 	int ret;
 
@@ -1327,10 +1327,6 @@ int eeh_unfreeze_pe(struct eeh_pe *pe, bool sw_state)
 		return ret;
 	}
 
-	/* Clear software isolated state */
-	if (sw_state && (pe->state & EEH_PE_ISOLATED))
-		eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
-
 	return ret;
 }
 
@@ -1382,7 +1378,10 @@ static int eeh_pe_change_owner(struct eeh_pe *pe)
 		}
 	}
 
-	return eeh_unfreeze_pe(pe, true);
+	ret = eeh_unfreeze_pe(pe);
+	if (!ret)
+		eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
+	return ret;
 }
 
 /**
@@ -1639,7 +1638,10 @@ static int eeh_pe_reenable_devices(struct eeh_pe *pe)
 	}
 
 	/* The PE is still in frozen state */
-	return eeh_unfreeze_pe(pe, true);
+	ret = eeh_unfreeze_pe(pe);
+	if (!ret)
+		eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
+	return ret;
 }
 
 
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index aa86a42d98f2..0109d5d7fe63 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -598,7 +598,7 @@ static int eeh_clear_pe_frozen_state(struct eeh_pe *root)
 
 	eeh_for_each_pe(root, pe) {
 		for (i = 0; i < 3; i++)
-			if (!eeh_unfreeze_pe(pe, false))
+			if (!eeh_unfreeze_pe(pe))
 				break;
 		if (i >= 3)
 			return -EIO;
diff --git a/arch/powerpc/kernel/eeh_sysfs.c b/arch/powerpc/kernel/eeh_sysfs.c
index deed906dd8f1..0731d2f01dd9 100644
--- a/arch/powerpc/kernel/eeh_sysfs.c
+++ b/arch/powerpc/kernel/eeh_sysfs.c
@@ -82,8 +82,9 @@ static ssize_t eeh_pe_state_store(struct device *dev,
 	if (!(edev->pe->state & EEH_PE_ISOLATED))
 		return count;
 
-	if (eeh_unfreeze_pe(edev->pe, true))
+	if (eeh_unfreeze_pe(edev->pe))
 		return -EIO;
+	eeh_pe_state_clear(edev->pe, EEH_PE_ISOLATED);
 
 	return count;
 }
-- 
2.19.0.2.gcad72f5712


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/6] powerpc/eeh: Add include_passed to eeh_pe_state_clear()
  2018-11-29  3:16 [PATCH 0/6] powerpc/eeh: Improve recovery of passed-through devices Sam Bobroff
  2018-11-29  3:16 ` [PATCH 1/6] powerpc/eeh: Cleanup eeh_pe_clear_frozen_state() Sam Bobroff
  2018-11-29  3:16 ` [PATCH 2/6] powerpc/eeh: remove sw_state from eeh_unfreeze_pe() Sam Bobroff
@ 2018-11-29  3:16 ` Sam Bobroff
  2018-11-29  3:16 ` [PATCH 4/6] powerpc/eeh: Add include_passed to eeh_clear_pe_frozen_state() Sam Bobroff
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Sam Bobroff @ 2018-11-29  3:16 UTC (permalink / raw)
  To: linuxppc-dev

Add a parameter to eeh_pe_state_clear() that allows passed-through PEs
to be excluded. Update callers to always pass true so that there is no
change in behaviour.

Also refactor to use direct traversal, to allow the removal of some
boilerplate.

This is to prepare for follow-up work for passed-through devices.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
 arch/powerpc/include/asm/ppc-pci.h |  2 +-
 arch/powerpc/kernel/eeh.c          | 18 ++++----
 arch/powerpc/kernel/eeh_driver.c   | 20 ++++-----
 arch/powerpc/kernel/eeh_pe.c       | 68 +++++++++++++-----------------
 arch/powerpc/kernel/eeh_sysfs.c    |  2 +-
 5 files changed, 50 insertions(+), 60 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
index f67da277d652..08e094eaeccf 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -59,7 +59,7 @@ int rtas_write_config(struct pci_dn *, int where, int size, u32 val);
 int rtas_read_config(struct pci_dn *, int where, int size, u32 *val);
 void eeh_pe_state_mark(struct eeh_pe *pe, int state);
 void eeh_pe_mark_isolated(struct eeh_pe *pe);
-void eeh_pe_state_clear(struct eeh_pe *pe, int state);
+void eeh_pe_state_clear(struct eeh_pe *pe, int state, bool include_passed);
 void eeh_pe_state_mark_with_cfg(struct eeh_pe *pe, int state);
 void eeh_pe_dev_mode_mark(struct eeh_pe *pe, int mode);
 
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index ac8e69ee93a7..052512e58b05 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -825,13 +825,13 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state stat
 		eeh_ops->reset(pe, EEH_RESET_DEACTIVATE);
 		eeh_unfreeze_pe(pe);
 		if (!(pe->type & EEH_PE_VF))
-			eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
+			eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED, true);
 		eeh_pe_dev_traverse(pe, eeh_restore_dev_state, dev);
-		eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
+		eeh_pe_state_clear(pe, EEH_PE_ISOLATED, true);
 		break;
 	case pcie_hot_reset:
 		eeh_pe_mark_isolated(pe);
-		eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
+		eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED, true);
 		eeh_ops->set_option(pe, EEH_OPT_FREEZE_PE);
 		eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
 		if (!(pe->type & EEH_PE_VF))
@@ -840,7 +840,7 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state stat
 		break;
 	case pcie_warm_reset:
 		eeh_pe_mark_isolated(pe);
-		eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
+		eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED, true);
 		eeh_ops->set_option(pe, EEH_OPT_FREEZE_PE);
 		eeh_pe_dev_traverse(pe, eeh_disable_and_save_dev_state, dev);
 		if (!(pe->type & EEH_PE_VF))
@@ -848,7 +848,7 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state stat
 		eeh_ops->reset(pe, EEH_RESET_FUNDAMENTAL);
 		break;
 	default:
-		eeh_pe_state_clear(pe, EEH_PE_ISOLATED | EEH_PE_CFG_BLOCKED);
+		eeh_pe_state_clear(pe, EEH_PE_ISOLATED | EEH_PE_CFG_BLOCKED, true);
 		return -EINVAL;
 	};
 
@@ -936,7 +936,7 @@ int eeh_pe_reset_full(struct eeh_pe *pe)
 			__func__, state, pe->phb->global_number, pe->addr, (i + 1));
 	}
 
-	eeh_pe_state_clear(pe, reset_state);
+	eeh_pe_state_clear(pe, reset_state, true);
 	return ret;
 }
 
@@ -1380,7 +1380,7 @@ static int eeh_pe_change_owner(struct eeh_pe *pe)
 
 	ret = eeh_unfreeze_pe(pe);
 	if (!ret)
-		eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
+		eeh_pe_state_clear(pe, EEH_PE_ISOLATED, true);
 	return ret;
 }
 
@@ -1640,7 +1640,7 @@ static int eeh_pe_reenable_devices(struct eeh_pe *pe)
 	/* The PE is still in frozen state */
 	ret = eeh_unfreeze_pe(pe);
 	if (!ret)
-		eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
+		eeh_pe_state_clear(pe, EEH_PE_ISOLATED, true);
 	return ret;
 }
 
@@ -1668,7 +1668,7 @@ int eeh_pe_reset(struct eeh_pe *pe, int option)
 	switch (option) {
 	case EEH_RESET_DEACTIVATE:
 		ret = eeh_ops->reset(pe, option);
-		eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED);
+		eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED, true);
 		if (ret)
 			break;
 
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 0109d5d7fe63..b2687c14dc40 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -603,7 +603,7 @@ static int eeh_clear_pe_frozen_state(struct eeh_pe *root)
 		if (i >= 3)
 			return -EIO;
 	}
-	eeh_pe_state_clear(root, EEH_PE_ISOLATED);
+	eeh_pe_state_clear(root, EEH_PE_ISOLATED, true);
 	return 0;
 }
 
@@ -624,14 +624,14 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
 	/* Issue reset */
 	ret = eeh_pe_reset_full(pe);
 	if (ret) {
-		eeh_pe_state_clear(pe, EEH_PE_RECOVERING);
+		eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
 		return ret;
 	}
 
 	/* Unfreeze the PE */
 	ret = eeh_clear_pe_frozen_state(pe);
 	if (ret) {
-		eeh_pe_state_clear(pe, EEH_PE_RECOVERING);
+		eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
 		return ret;
 	}
 
@@ -639,7 +639,7 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
 	eeh_pe_dev_traverse(pe, eeh_dev_restore_state, NULL);
 
 	/* Clear recovery mode */
-	eeh_pe_state_clear(pe, EEH_PE_RECOVERING);
+	eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
 
 	return 0;
 }
@@ -730,11 +730,11 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
 			eeh_add_virt_device(edev);
 		} else {
 			if (!driver_eeh_aware)
-				eeh_pe_state_clear(pe, EEH_PE_PRI_BUS);
+				eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
 			pci_hp_add_devices(bus);
 		}
 	}
-	eeh_pe_state_clear(pe, EEH_PE_KEEP);
+	eeh_pe_state_clear(pe, EEH_PE_KEEP, true);
 
 	pe->tstamp = tstamp;
 	pe->freeze_count = cnt;
@@ -886,7 +886,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 			 * is still in frozen state. Clear it before
 			 * resuming the PE.
 			 */
-			eeh_pe_state_clear(pe, EEH_PE_ISOLATED);
+			eeh_pe_state_clear(pe, EEH_PE_ISOLATED, true);
 			result = PCI_ERS_RESULT_RECOVERED;
 		}
 	}
@@ -963,7 +963,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 			eeh_pe_dev_traverse(pe, eeh_rmv_device, NULL);
 			eeh_pe_dev_mode_mark(pe, EEH_DEV_REMOVED);
 		} else {
-			eeh_pe_state_clear(pe, EEH_PE_PRI_BUS);
+			eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
 			eeh_pe_dev_mode_mark(pe, EEH_DEV_REMOVED);
 
 			pci_lock_rescan_remove();
@@ -973,7 +973,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
 			return;
 		}
 	}
-	eeh_pe_state_clear(pe, EEH_PE_RECOVERING);
+	eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
 }
 
 /**
@@ -1055,7 +1055,7 @@ void eeh_handle_special_event(void)
 					continue;
 
 				/* Notify all devices to be down */
-				eeh_pe_state_clear(pe, EEH_PE_PRI_BUS);
+				eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true);
 				eeh_set_channel_state(pe, pci_channel_io_perm_failure);
 				eeh_pe_report(
 					"error_detected(permanent failure)", pe,
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 6fa2032e0594..8b578891f27c 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -657,62 +657,52 @@ void eeh_pe_dev_mode_mark(struct eeh_pe *pe, int mode)
 }
 
 /**
- * __eeh_pe_state_clear - Clear state for the PE
+ * eeh_pe_state_clear - Clear state for the PE
  * @data: EEH PE
- * @flag: state
+ * @state: state
+ * @include_passed: include passed-through devices?
  *
  * The function is used to clear the indicated state from the
  * given PE. Besides, we also clear the check count of the PE
  * as well.
  */
-static void *__eeh_pe_state_clear(struct eeh_pe *pe, void *flag)
+void eeh_pe_state_clear(struct eeh_pe *root, int state, bool include_passed)
 {
-	int state = *((int *)flag);
+	struct eeh_pe *pe;
 	struct eeh_dev *edev, *tmp;
 	struct pci_dev *pdev;
 
-	/* Keep the state of permanently removed PE intact */
-	if (pe->state & EEH_PE_REMOVED)
-		return NULL;
+	eeh_for_each_pe(root, pe) {
+		/* Keep the state of permanently removed PE intact */
+		if (pe->state & EEH_PE_REMOVED)
+			continue;
 
-	pe->state &= ~state;
+		if (!include_passed && eeh_pe_passed(pe))
+			continue;
 
-	/*
-	 * Special treatment on clearing isolated state. Clear
-	 * check count since last isolation and put all affected
-	 * devices to normal state.
-	 */
-	if (!(state & EEH_PE_ISOLATED))
-		return NULL;
+		pe->state &= ~state;
 
-	pe->check_count = 0;
-	eeh_pe_for_each_dev(pe, edev, tmp) {
-		pdev = eeh_dev_to_pci_dev(edev);
-		if (!pdev)
+		/*
+		 * Special treatment on clearing isolated state. Clear
+		 * check count since last isolation and put all affected
+		 * devices to normal state.
+		 */
+		if (!(state & EEH_PE_ISOLATED))
 			continue;
 
-		pdev->error_state = pci_channel_io_normal;
-	}
-
-	/* Unblock PCI config access if required */
-	if (pe->state & EEH_PE_CFG_RESTRICTED)
-		pe->state &= ~EEH_PE_CFG_BLOCKED;
+		pe->check_count = 0;
+		eeh_pe_for_each_dev(pe, edev, tmp) {
+			pdev = eeh_dev_to_pci_dev(edev);
+			if (!pdev)
+				continue;
 
-	return NULL;
-}
+			pdev->error_state = pci_channel_io_normal;
+		}
 
-/**
- * eeh_pe_state_clear - Clear state for the PE and its children
- * @pe: PE
- * @state: state to be cleared
- *
- * When the PE and its children has been recovered from error,
- * we need clear the error state for that. The function is used
- * for the purpose.
- */
-void eeh_pe_state_clear(struct eeh_pe *pe, int state)
-{
-	eeh_pe_traverse(pe, __eeh_pe_state_clear, &state);
+		/* Unblock PCI config access if required */
+		if (pe->state & EEH_PE_CFG_RESTRICTED)
+			pe->state &= ~EEH_PE_CFG_BLOCKED;
+	}
 }
 
 /*
diff --git a/arch/powerpc/kernel/eeh_sysfs.c b/arch/powerpc/kernel/eeh_sysfs.c
index 0731d2f01dd9..3fa04dda1737 100644
--- a/arch/powerpc/kernel/eeh_sysfs.c
+++ b/arch/powerpc/kernel/eeh_sysfs.c
@@ -84,7 +84,7 @@ static ssize_t eeh_pe_state_store(struct device *dev,
 
 	if (eeh_unfreeze_pe(edev->pe))
 		return -EIO;
-	eeh_pe_state_clear(edev->pe, EEH_PE_ISOLATED);
+	eeh_pe_state_clear(edev->pe, EEH_PE_ISOLATED, true);
 
 	return count;
 }
-- 
2.19.0.2.gcad72f5712


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 4/6] powerpc/eeh: Add include_passed to eeh_clear_pe_frozen_state()
  2018-11-29  3:16 [PATCH 0/6] powerpc/eeh: Improve recovery of passed-through devices Sam Bobroff
                   ` (2 preceding siblings ...)
  2018-11-29  3:16 ` [PATCH 3/6] powerpc/eeh: Add include_passed to eeh_pe_state_clear() Sam Bobroff
@ 2018-11-29  3:16 ` Sam Bobroff
  2018-11-29  3:16 ` [PATCH 5/6] powerpc/eeh: Improve recovery of passed-through devices Sam Bobroff
  2018-11-29  3:16 ` [PATCH 6/6] powerpc/eeh: Correct retries in eeh_pe_reset_full() Sam Bobroff
  5 siblings, 0 replies; 8+ messages in thread
From: Sam Bobroff @ 2018-11-29  3:16 UTC (permalink / raw)
  To: linuxppc-dev

Add a parameter to eeh_clear_pe_frozen_state() that allows
passed-through PEs to be excluded. Update callers to always pass true
so that there is no change in behaviour.

This is to prepare for follow-up work for passed-through devices.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
 arch/powerpc/kernel/eeh_driver.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index b2687c14dc40..61c177ebb230 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -591,19 +591,21 @@ static void *eeh_pe_detach_dev(struct eeh_pe *pe, void *userdata)
  * PE reset (for 3 times), we try to clear the frozen state
  * for 3 times as well.
  */
-static int eeh_clear_pe_frozen_state(struct eeh_pe *root)
+static int eeh_clear_pe_frozen_state(struct eeh_pe *root, bool include_passed)
 {
 	struct eeh_pe *pe;
 	int i;
 
 	eeh_for_each_pe(root, pe) {
-		for (i = 0; i < 3; i++)
-			if (!eeh_unfreeze_pe(pe))
-				break;
-		if (i >= 3)
-			return -EIO;
+		if (include_passed || !eeh_pe_passed(pe)) {
+			for (i = 0; i < 3; i++)
+				if (!eeh_unfreeze_pe(pe))
+					break;
+			if (i >= 3)
+				return -EIO;
+		}
 	}
-	eeh_pe_state_clear(root, EEH_PE_ISOLATED, true);
+	eeh_pe_state_clear(root, EEH_PE_ISOLATED, include_passed);
 	return 0;
 }
 
@@ -629,7 +631,7 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
 	}
 
 	/* Unfreeze the PE */
-	ret = eeh_clear_pe_frozen_state(pe);
+	ret = eeh_clear_pe_frozen_state(pe, true);
 	if (ret) {
 		eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
 		return ret;
@@ -702,7 +704,7 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
 	eeh_pe_restore_bars(pe);
 
 	/* Clear frozen state */
-	rc = eeh_clear_pe_frozen_state(pe);
+	rc = eeh_clear_pe_frozen_state(pe, true);
 	if (rc) {
 		pci_unlock_rescan_remove();
 		return rc;
-- 
2.19.0.2.gcad72f5712


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 5/6] powerpc/eeh: Improve recovery of passed-through devices
  2018-11-29  3:16 [PATCH 0/6] powerpc/eeh: Improve recovery of passed-through devices Sam Bobroff
                   ` (3 preceding siblings ...)
  2018-11-29  3:16 ` [PATCH 4/6] powerpc/eeh: Add include_passed to eeh_clear_pe_frozen_state() Sam Bobroff
@ 2018-11-29  3:16 ` Sam Bobroff
  2018-11-29  3:16 ` [PATCH 6/6] powerpc/eeh: Correct retries in eeh_pe_reset_full() Sam Bobroff
  5 siblings, 0 replies; 8+ messages in thread
From: Sam Bobroff @ 2018-11-29  3:16 UTC (permalink / raw)
  To: linuxppc-dev

Currently, the EEH recovery process considers passed-through devices
as if they were not EEH-aware, which can cause them to be removed as
part of recovery.  Because device removal requires cooperation from
the guest, this may lead to the process stalling or deadlocking.
Also, if devices are removed on the host side, they will be removed
from their IOMMU group, making recovery in the guest impossible.

Therefore, alter the recovery process so that passed-through devices
are not removed but are instead left frozen (and marked isolated)
until the guest performs it's own recovery.  If firmware thaws a
passed-through PE because it's parent PE has been thawed (because it
was not passed through), re-freeze it.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
 arch/powerpc/include/asm/eeh.h     |  2 +-
 arch/powerpc/include/asm/ppc-pci.h |  2 +-
 arch/powerpc/kernel/eeh.c          | 47 +++++++++++++++++++++++-------
 arch/powerpc/kernel/eeh_driver.c   | 32 +++++++++-----------
 drivers/vfio/vfio_spapr_eeh.c      |  6 ++--
 5 files changed, 55 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 2ff123f745cc..0b655810f32d 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -300,7 +300,7 @@ void eeh_dev_release(struct pci_dev *pdev);
 struct eeh_pe *eeh_iommu_group_to_pe(struct iommu_group *group);
 int eeh_pe_set_option(struct eeh_pe *pe, int option);
 int eeh_pe_get_state(struct eeh_pe *pe);
-int eeh_pe_reset(struct eeh_pe *pe, int option);
+int eeh_pe_reset(struct eeh_pe *pe, int option, bool include_passed);
 int eeh_pe_configure(struct eeh_pe *pe);
 int eeh_pe_inject_err(struct eeh_pe *pe, int type, int func,
 		      unsigned long addr, unsigned long mask);
diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h
index 08e094eaeccf..f191ef0d2a0a 100644
--- a/arch/powerpc/include/asm/ppc-pci.h
+++ b/arch/powerpc/include/asm/ppc-pci.h
@@ -53,7 +53,7 @@ void eeh_addr_cache_rmv_dev(struct pci_dev *dev);
 struct eeh_dev *eeh_addr_cache_get_dev(unsigned long addr);
 void eeh_slot_error_detail(struct eeh_pe *pe, int severity);
 int eeh_pci_enable(struct eeh_pe *pe, int function);
-int eeh_pe_reset_full(struct eeh_pe *pe);
+int eeh_pe_reset_full(struct eeh_pe *pe, bool include_passed);
 void eeh_save_bars(struct eeh_dev *edev);
 int rtas_write_config(struct pci_dn *, int where, int size, u32 val);
 int rtas_read_config(struct pci_dn *, int where, int size, u32 *val);
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 052512e58b05..df02f55fdfa1 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -877,6 +877,24 @@ static void *eeh_set_dev_freset(struct eeh_dev *edev, void *flag)
 	return NULL;
 }
 
+static void eeh_pe_refreeze_passed(struct eeh_pe *root)
+{
+	struct eeh_pe *pe;
+	int state;
+
+	eeh_for_each_pe(root, pe) {
+		if (eeh_pe_passed(pe)) {
+			state = eeh_ops->get_state(pe, NULL);
+			if (state &
+			   (EEH_STATE_MMIO_ACTIVE | EEH_STATE_MMIO_ENABLED)) {
+				pr_info("EEH: Passed-through PE PHB#%x-PE#%x was thawed by reset, re-freezing for safety.\n",
+					pe->phb->global_number, pe->addr);
+				eeh_pe_set_option(pe, EEH_OPT_FREEZE_PE);
+			}
+		}
+	}
+}
+
 /**
  * eeh_pe_reset_full - Complete a full reset process on the indicated PE
  * @pe: EEH PE
@@ -889,7 +907,7 @@ static void *eeh_set_dev_freset(struct eeh_dev *edev, void *flag)
  *
  * This function will attempt to reset a PE three times before failing.
  */
-int eeh_pe_reset_full(struct eeh_pe *pe)
+int eeh_pe_reset_full(struct eeh_pe *pe, bool include_passed)
 {
 	int reset_state = (EEH_PE_RESET | EEH_PE_CFG_BLOCKED);
 	int type = EEH_RESET_HOT;
@@ -911,11 +929,11 @@ int eeh_pe_reset_full(struct eeh_pe *pe)
 
 	/* Make three attempts at resetting the bus */
 	for (i = 0; i < 3; i++) {
-		ret = eeh_pe_reset(pe, type);
+		ret = eeh_pe_reset(pe, type, include_passed);
 		if (ret)
 			break;
 
-		ret = eeh_pe_reset(pe, EEH_RESET_DEACTIVATE);
+		ret = eeh_pe_reset(pe, EEH_RESET_DEACTIVATE, include_passed);
 		if (ret)
 			break;
 
@@ -936,6 +954,12 @@ int eeh_pe_reset_full(struct eeh_pe *pe)
 			__func__, state, pe->phb->global_number, pe->addr, (i + 1));
 	}
 
+	/* Resetting the PE may have unfrozen child PEs. If those PEs have been
+	 * (potentially) passed through to a guest, re-freeze them:
+	 */
+	if (!include_passed)
+		eeh_pe_refreeze_passed(pe);
+
 	eeh_pe_state_clear(pe, reset_state, true);
 	return ret;
 }
@@ -1611,13 +1635,12 @@ int eeh_pe_get_state(struct eeh_pe *pe)
 }
 EXPORT_SYMBOL_GPL(eeh_pe_get_state);
 
-static int eeh_pe_reenable_devices(struct eeh_pe *pe)
+static int eeh_pe_reenable_devices(struct eeh_pe *pe, bool include_passed)
 {
 	struct eeh_dev *edev, *tmp;
 	struct pci_dev *pdev;
 	int ret = 0;
 
-	/* Restore config space */
 	eeh_pe_restore_bars(pe);
 
 	/*
@@ -1638,9 +1661,13 @@ static int eeh_pe_reenable_devices(struct eeh_pe *pe)
 	}
 
 	/* The PE is still in frozen state */
-	ret = eeh_unfreeze_pe(pe);
+	if (include_passed || !eeh_pe_passed(pe)) {
+		ret = eeh_unfreeze_pe(pe);
+	} else
+		pr_info("EEH: Note: Leaving passthrough PHB#%x-PE#%x frozen.\n",
+			pe->phb->global_number, pe->addr);
 	if (!ret)
-		eeh_pe_state_clear(pe, EEH_PE_ISOLATED, true);
+		eeh_pe_state_clear(pe, EEH_PE_ISOLATED, include_passed);
 	return ret;
 }
 
@@ -1654,7 +1681,7 @@ static int eeh_pe_reenable_devices(struct eeh_pe *pe)
  * indicated type, either fundamental reset or hot reset.
  * PE reset is the most important part for error recovery.
  */
-int eeh_pe_reset(struct eeh_pe *pe, int option)
+int eeh_pe_reset(struct eeh_pe *pe, int option, bool include_passed)
 {
 	int ret = 0;
 
@@ -1668,11 +1695,11 @@ int eeh_pe_reset(struct eeh_pe *pe, int option)
 	switch (option) {
 	case EEH_RESET_DEACTIVATE:
 		ret = eeh_ops->reset(pe, option);
-		eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED, true);
+		eeh_pe_state_clear(pe, EEH_PE_CFG_BLOCKED, include_passed);
 		if (ret)
 			break;
 
-		ret = eeh_pe_reenable_devices(pe);
+		ret = eeh_pe_reenable_devices(pe, include_passed);
 		break;
 	case EEH_RESET_HOT:
 	case EEH_RESET_FUNDAMENTAL:
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 61c177ebb230..ad7be478750f 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -510,22 +510,11 @@ static void *eeh_rmv_device(struct eeh_dev *edev, void *userdata)
 	 * support EEH. So we just care about PCI devices for
 	 * simplicity here.
 	 */
-	if (!dev || (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE))
-		return NULL;
-
-	/*
-	 * We rely on count-based pcibios_release_device() to
-	 * detach permanently offlined PEs. Unfortunately, that's
-	 * not reliable enough. We might have the permanently
-	 * offlined PEs attached, but we needn't take care of
-	 * them and their child devices.
-	 */
-	if (eeh_dev_removed(edev))
+	if (!eeh_edev_actionable(edev) ||
+	    (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE))
 		return NULL;
 
 	if (rmv_data) {
-		if (eeh_pe_passed(edev->pe))
-			return NULL;
 		driver = eeh_pcid_get(dev);
 		if (driver) {
 			if (driver->err_handler &&
@@ -539,8 +528,8 @@ static void *eeh_rmv_device(struct eeh_dev *edev, void *userdata)
 	}
 
 	/* Remove it from PCI subsystem */
-	pr_debug("EEH: Removing %s without EEH sensitive driver\n",
-		 pci_name(dev));
+	pr_info("EEH: Removing %s without EEH sensitive driver\n",
+		pci_name(dev));
 	edev->mode |= EEH_DEV_DISCONNECTED;
 	if (rmv_data)
 		rmv_data->removed_dev_count++;
@@ -624,7 +613,7 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
 	eeh_pe_dev_traverse(pe, eeh_dev_save_state, NULL);
 
 	/* Issue reset */
-	ret = eeh_pe_reset_full(pe);
+	ret = eeh_pe_reset_full(pe, true);
 	if (ret) {
 		eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
 		return ret;
@@ -664,6 +653,11 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
 	time64_t tstamp;
 	int cnt, rc;
 	struct eeh_dev *edev;
+	struct eeh_pe *tmp_pe;
+	bool any_passed = false;
+
+	eeh_for_each_pe(pe, tmp_pe)
+		any_passed |= eeh_pe_passed(tmp_pe);
 
 	/* pcibios will clear the counter; save the value */
 	cnt = pe->freeze_count;
@@ -676,7 +670,7 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
 	 * into pci_hp_add_devices().
 	 */
 	eeh_pe_state_mark(pe, EEH_PE_KEEP);
-	if (driver_eeh_aware || (pe->type & EEH_PE_VF)) {
+	if (any_passed || driver_eeh_aware || (pe->type & EEH_PE_VF)) {
 		eeh_pe_dev_traverse(pe, eeh_rmv_device, rmv_data);
 	} else {
 		pci_lock_rescan_remove();
@@ -693,7 +687,7 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
 	 * config accesses. So we prefer to block them. However, controlled
 	 * PCI config accesses initiated from EEH itself are allowed.
 	 */
-	rc = eeh_pe_reset_full(pe);
+	rc = eeh_pe_reset_full(pe, false);
 	if (rc)
 		return rc;
 
@@ -704,7 +698,7 @@ static int eeh_reset_device(struct eeh_pe *pe, struct pci_bus *bus,
 	eeh_pe_restore_bars(pe);
 
 	/* Clear frozen state */
-	rc = eeh_clear_pe_frozen_state(pe, true);
+	rc = eeh_clear_pe_frozen_state(pe, false);
 	if (rc) {
 		pci_unlock_rescan_remove();
 		return rc;
diff --git a/drivers/vfio/vfio_spapr_eeh.c b/drivers/vfio/vfio_spapr_eeh.c
index 38edeb4729a9..1a742fe8f6db 100644
--- a/drivers/vfio/vfio_spapr_eeh.c
+++ b/drivers/vfio/vfio_spapr_eeh.c
@@ -74,13 +74,13 @@ long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 			ret = eeh_pe_get_state(pe);
 			break;
 		case VFIO_EEH_PE_RESET_DEACTIVATE:
-			ret = eeh_pe_reset(pe, EEH_RESET_DEACTIVATE);
+			ret = eeh_pe_reset(pe, EEH_RESET_DEACTIVATE, true);
 			break;
 		case VFIO_EEH_PE_RESET_HOT:
-			ret = eeh_pe_reset(pe, EEH_RESET_HOT);
+			ret = eeh_pe_reset(pe, EEH_RESET_HOT, true);
 			break;
 		case VFIO_EEH_PE_RESET_FUNDAMENTAL:
-			ret = eeh_pe_reset(pe, EEH_RESET_FUNDAMENTAL);
+			ret = eeh_pe_reset(pe, EEH_RESET_FUNDAMENTAL, true);
 			break;
 		case VFIO_EEH_PE_CONFIGURE:
 			ret = eeh_pe_configure(pe);
-- 
2.19.0.2.gcad72f5712


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 6/6] powerpc/eeh: Correct retries in eeh_pe_reset_full()
  2018-11-29  3:16 [PATCH 0/6] powerpc/eeh: Improve recovery of passed-through devices Sam Bobroff
                   ` (4 preceding siblings ...)
  2018-11-29  3:16 ` [PATCH 5/6] powerpc/eeh: Improve recovery of passed-through devices Sam Bobroff
@ 2018-11-29  3:16 ` Sam Bobroff
  5 siblings, 0 replies; 8+ messages in thread
From: Sam Bobroff @ 2018-11-29  3:16 UTC (permalink / raw)
  To: linuxppc-dev

Currently, eeh_pe_reset_full() will only attempt to reset a PE more
than once if activating the reset state and deactivating it both
succeed, but later polling shows that it hasn't become active.

Change this so that it will try up to three times for any reason other
than an unrecoverable slot error and adjust the message generation so
that it's clear weather the reset has ultimately succeeded or failed.
This allows the reset to succeed in some situations where it would
currently fail.

Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
---
 arch/powerpc/kernel/eeh.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index df02f55fdfa1..528caf857428 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -912,7 +912,7 @@ int eeh_pe_reset_full(struct eeh_pe *pe, bool include_passed)
 	int reset_state = (EEH_PE_RESET | EEH_PE_CFG_BLOCKED);
 	int type = EEH_RESET_HOT;
 	unsigned int freset = 0;
-	int i, state, ret;
+	int i, state = 0, ret;
 
 	/*
 	 * Determine the type of reset to perform - hot or fundamental.
@@ -930,28 +930,32 @@ int eeh_pe_reset_full(struct eeh_pe *pe, bool include_passed)
 	/* Make three attempts at resetting the bus */
 	for (i = 0; i < 3; i++) {
 		ret = eeh_pe_reset(pe, type, include_passed);
-		if (ret)
-			break;
-
-		ret = eeh_pe_reset(pe, EEH_RESET_DEACTIVATE, include_passed);
-		if (ret)
-			break;
+		if (!ret)
+			ret = eeh_pe_reset(pe, EEH_RESET_DEACTIVATE,
+					   include_passed);
+		if (ret) {
+			ret = -EIO;
+			pr_warn("EEH: Failure %d resetting PHB#%x-PE#%x (attempt %d)\n\n",
+				state, pe->phb->global_number, pe->addr, i + 1);
+			continue;
+		}
+		if (i)
+			pr_warn("EEH: PHB#%x-PE#%x: Successful reset (attempt %d)\n",
+				pe->phb->global_number, pe->addr, i + 1);
 
 		/* Wait until the PE is in a functioning state */
 		state = eeh_wait_state(pe, PCI_BUS_RESET_WAIT_MSEC);
 		if (state < 0) {
-			pr_warn("%s: Unrecoverable slot failure on PHB#%x-PE#%x",
-				__func__, pe->phb->global_number, pe->addr);
+			pr_warn("EEH: Unrecoverable slot failure on PHB#%x-PE#%x",
+				pe->phb->global_number, pe->addr);
 			ret = -ENOTRECOVERABLE;
 			break;
 		}
 		if (eeh_state_active(state))
 			break;
-
-		/* Set error in case this is our last attempt */
-		ret = -EIO;
-		pr_warn("%s: Failure %d resetting PHB#%x-PE#%x\n (%d)\n",
-			__func__, state, pe->phb->global_number, pe->addr, (i + 1));
+		else
+			pr_warn("EEH: PHB#%x-PE#%x: Slot inactive after reset: 0x%x (attempt %d)\n",
+				pe->phb->global_number, pe->addr, state, i + 1);
 	}
 
 	/* Resetting the PE may have unfrozen child PEs. If those PEs have been
-- 
2.19.0.2.gcad72f5712


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [1/6] powerpc/eeh: Cleanup eeh_pe_clear_frozen_state()
  2018-11-29  3:16 ` [PATCH 1/6] powerpc/eeh: Cleanup eeh_pe_clear_frozen_state() Sam Bobroff
@ 2019-02-08 13:02   ` " Michael Ellerman
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Ellerman @ 2019-02-08 13:02 UTC (permalink / raw)
  To: Sam Bobroff, linuxppc-dev

On Thu, 2018-11-29 at 03:16:37 UTC, Sam Bobroff wrote:
> The 'clear_sw_state' parameter for eeh_pe_clear_frozen_state() is
> redundant because it has no effect (except in the rare case of a
> hardware error part way through unfreezing a tree of PEs, where it
> would dangerously allow partial de-isolation before returning
> failure).
> 
> It is passed down to __eeh_pe_clear_frozen_state(), and from there to
> eeh_unfreeze_pe(), where it causes EEH_PE_ISOLATED to be removed
> from the state of each PE during the traversal.  However, when the
> traversal finishes, EEH_PE_ISOLATED is unconditionally removed by a
> call to eeh_pe_state_clear() regardless of the parameter's value.
> 
> So remove the flag and pass false to eeh_unfreeze_pe() (to avoid the
> rare case described above, as it was before the flag was introduced).
> Also, perform the recursion directly in the function and eliminate a
> bit of boilerplate.
> 
> There should be no change in functionality, except as mentioned above.
> 
> Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/3376cb91ed908eb0728900894a77d820

cheers

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, back to index

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-29  3:16 [PATCH 0/6] powerpc/eeh: Improve recovery of passed-through devices Sam Bobroff
2018-11-29  3:16 ` [PATCH 1/6] powerpc/eeh: Cleanup eeh_pe_clear_frozen_state() Sam Bobroff
2019-02-08 13:02   ` [1/6] " Michael Ellerman
2018-11-29  3:16 ` [PATCH 2/6] powerpc/eeh: remove sw_state from eeh_unfreeze_pe() Sam Bobroff
2018-11-29  3:16 ` [PATCH 3/6] powerpc/eeh: Add include_passed to eeh_pe_state_clear() Sam Bobroff
2018-11-29  3:16 ` [PATCH 4/6] powerpc/eeh: Add include_passed to eeh_clear_pe_frozen_state() Sam Bobroff
2018-11-29  3:16 ` [PATCH 5/6] powerpc/eeh: Improve recovery of passed-through devices Sam Bobroff
2018-11-29  3:16 ` [PATCH 6/6] powerpc/eeh: Correct retries in eeh_pe_reset_full() Sam Bobroff

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org linuxppc-dev@archiver.kernel.org
	public-inbox-index linuxppc-dev


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox